.v/ Directories
●Post #2: User-Scoped Encrypted Service Credentials
●Post #3: X_SYSTEMD_UNIT_ACTIVE= sd_notify() Messages
●Post #4: System-wide ProtectSystem=
●Post #5: run0Assudo Replacement
●Post #6: System Credentials
●Post #7: Unprivileged DDI Mounts + Unprivileged systemd-nspawn
●Post #8: ssh into systemd-homed Accounts
●Post #9: systemd-vmspawn
●Post #10: Mutable systemd-sysext
●Post #11: Network Device Ownership
●Post #12: systemctl sleep
●Post #13: systemd-ssh-generator
●Post #14: systemd-cryptenroll without device argument
●Post #15: dlopen() ELF Metadata
●Post #16: Capsules
I intend to do a similar series of serieses of posts for the next systemd
release (v257), hence if you haven't left tech Twitter for Mastodon yet, now is
the opportunity.
And while I have you: note that the All Systems Go 2024 Conference
(Berlin) Call for Papers ends 😲 THIS WEEK 🤯!
Hence, HURRY, and get your submissions in
now, for the best
low-level Linux userspace conference around!
mkosi on this blog. Some years ago, I took over development and
there's been a huge amount of changes and improvements since then. So I
figure this is a good time to re-introduce mkosi.
mkosi stands for Make Operating
System Image. It generates OS images that can be used for a variety of
purposes.
If you prefer watching a video over reading a blog post, you can also
watch my presentationonmkosi at All Systems Go 2023.
mkosi was originally written as a tool to simplify hacking on systemd
and for experimenting with images using many of the new concepts being
introduced in systemd at the time. In the meantime, it has evolved into
a general purpose image builder that can be used in a multitude of
scenarios.
Instructions to install mkosi can be found in its
readme. We
recommend running the latest version to take advantage of all the latest
features and bug fixes. You'll also need bubblewrap and the package
manager of your favorite distribution to get started.
At its core, the workflow of mkosi can be divided into 3 steps:
(一)Generate an OS tree for some distribution by installing a set of
packages.
(二)Package up that OS tree in a variety of output formats.
(三)(Optionally) Boot the resulting image in qemuorsystemd-nspawn.
Images can be built for any of the following distributions:
●Fedora Linux
●Ubuntu
●OpenSUSE
●Debian
●Arch Linux
●CentOS Stream
●RHEL
●Rocky Linux
●Alma Linux
And the following output formats are supported:
●GPT disk images built with systemd-repart
●Tar archives
●CPIO archives (for building initramfs images)
●USIs (Unified System Images which are full OS images packed in a UKI)
●Sysext, confext and portable images
●Directory trees
For example, to build an Arch Linux GPT disk image and boot it in
qemu, you can run the following command:
$ mkosi -d arch -p systemd -p udev -p linux -t disk qemu
To instead boot the image in systemd-nspawn, replace qemu with boot:
$ mkosi -d arch -p systemd -p udev -p linux -t disk boot
The actual image can be found in the current working directory named
image.raw. However, using a separate output directory is recommended
which is as simple as running mkdir mkosi.output.
To rebuild the image after it's already been built once, add -f to the
command line before the verb to rebuild the image. Any arguments passed
after the verb are forwarded to either systemd-nspawnorqemu
itself. To build the image without booting it, pass build instead of
bootorqemu or don't pass a verb at all.
By default, the disk image will have an appropriately sized root
partition and an ESP partition, but the partition layout and contents
can be fully customized using systemd-repart by creating partition
definition files in mkosi.repart/. This allows you to customize the
partition as you see fit:
●The root partition can be encrypted.
●Partition sizes can be customized.
●Partitions can be protected with signed dm-verity.
●You can opt out of having a root partition and only have a /usr
partition instead.
●You can add various other partitions, e.g. an XBOOTLDR partition or a
swap partition.
●...
As part of building the image, we'll run various tools such as
systemd-sysusers, systemd-firstboot, depmod, systemd-hwdb and
more to make sure the image is set up correctly.
mkosi supports configuration files
where the same settings that can be specified on the command line can be
written down.
For example, the command we used above can be written down in a
configuration file mkosi.conf:
[Distribution]
Distribution=arch
[Output]
Format=disk
[Content]
Packages=
systemd
udev
linux
Like systemd, mkosi uses INI configuration files. We also support
dropins which can be placed in mkosi.conf.d. Configuration files can
also be conditionalized using the [Match] section. For example, to
only install a specific package on Arch Linux, you can write the
following to mkosi.conf.d/10-arch.conf:
[Match]
Distribution=arch
[Content]
Packages=pacman
Because not everything you need will be supported in mkosi, we support
running scripts at various points during the image build process where
all extra image customization can be done. For example, if it is found,
mkosi.postinst is called after packages have been installed. Scripts
are executed on the host system by default (in a sandbox), but can be
executed inside the image by suffixing the script with .chroot, so if
mkosi.postinst.chroot is found it will be executed inside the image.
To add extra files to the image, you can place them in mkosi.extra in
the source directory and they will be automatically copied into the
image after packages have been installed.
mkosi will automatically
generate a UEFI/BIOS bootable image. As mkosi is a systemd project, it
will always build
UKIs
(Unified Kernel Images), except if the image is BIOS-only (since UKIs
cannot be used on BIOS). The initramfs is built like a regular image by
installing distribution packages and packaging them up in a CPIO archive
instead of a disk image. Specifically, we do not use dracut,
mkinitcpioorinitramfs-tools to generate the initramfs from the
host system. ukify is used to assemble all the individual components
into a UKI.
If you don't want mkosi to generate a bootable image, you can set
Bootable=no to explicitly disable this logic.
mkosi for development is that we can
build our source code against the image we're building and install it
into the image we're building. mkosi supports this via build scripts.
If a script named mkosi.build (ormkosi.build.chroot) is found,
we'll execute it as part of the build. Any files put by the build script
into $DESTDIR will be installed into the image. Required build
dependencies can be installed using the BuildPackages= setting. These
packages are installed into an overlay which is put on top of the image
when running the build script so the build packages are available when
running the build script but don't end up in the final image.
An example mkosi.build.chroot script for a project using meson could
look as follows:
#!/bin/sh
meson setup "$BUILDDIR" "$SRCDIR"
ninja -C "$BUILDDIR"
if ((WITH_TESTS)); then
meson test -C "$BUILDDIR"
fi
meson install -C "$BUILDDIR"
Now, every time the image is built, the build script will be executed
and the results will be installed into the image.
The $BUILDDIR environment variable points to a directory that can be
used as the build directory for build artifacts to allow for incremental
builds if the build system supports it.
Of course, downloading all packages from scratch every time and
re-installing them again every time the image is built is rather slow,
so mkosi supports two modes of caching to speed things up.
The first caching mode caches all downloaded packages so they don't have
to be downloaded again on subsequent builds. Enabling this is as simple
as running mkdir mkosi.cache.
The second mode of caching caches the image after all packages have been
installed but before running the build script. On subsequent builds,
mkosi will copy the cache instead of reinstalling all packages from
scratch. This mode can be enabled using the Incremental= setting.
While there is some rudimentary cache invalidation, the cache can also
forcibly be rebuilt by specifying -ff on the command line instead of
-f.
Note that when running on a btrfs filesystem, mkosi will automatically
use subvolumes for the cached images which can be snapshotted on
subsequent builds for even faster rebuilds. We'll also use reflinks to
do copy-on-write copies where possible.
With this setup, by running mkosi -f qemu in the systemd repository,
it takes about 40 seconds to go from a source code change to a root
shell in a virtual machine running the latest systemd with your change
applied. This makes it very easy to test changes to systemd in a safe
environment without risk of breaking your host system.
Of course, while 40 seconds is not a very long time, it's still more
than we'd like, especially if all we're doing is modifying the kernel
command line. That's why we have the KernelCommandLineExtra= option to
configure kernel command line options that are passed to the container
or virtual machine at runtime instead of being embedded into the image.
These extra kernel command line options are picked up when the image is
booted with qemu's direct kernel boot (using -append), but also when
booting a disk image in UEFI mode (using SMBIOS). The same applies to
systemd credentials (using the Credentials= setting). These settings
allow configuring the image without having to rebuild it, which means
that you only have to run mkosi qemuormkosi boot again afterwards
to apply the new settings.
newuidmap/newgidmap and systemd-repart, mkosi is able to
build images without needing root privileges. As long as proper subuid
and subgid mappings are set up for your user in /etc/subuid and
/etc/subgid, you can run mkosi as your regular user without having
to switch to root.
Note that as of the writing of this blog post this only applies to the
build and qemu verbs. Booting the image in a systemd-nspawn
container with mkosi boot still needs root privileges. We're hoping to
fix this in an future systemd release.
Regardless of whether you're running mkosi with root or without root,
almost every tool we execute is invoked in a sandbox to isolate as much
of the build process from the host as possible. For example, /etc and
/var from the host are not available in this sandbox, to avoid host
configuration inadvertently affecting the build.
Because systemd-repart can build disk images without loop devices,
mkosi can run from almost any environment, including containers. All
that's needed is a UID range with 65536 UIDs available, either via
running as the root user or via /etc/subuid and newuidmap. In a
future systemd release, we're hoping to provide an alternative to
newuidmap and /etc/subuid to allow running mkosi from all
containers, even those with only a single UID available.
mkosi can first build a tools image for you that
contains all required tools to build the actual image. This can be
enabled by adding ToolsTree=default to your mkosi configuration.
Building a tools image does not require a recent version of systemd.
In the systemd mkosi configuration, we automatically use a tools tree if
we detect your distribution does not have the minimum required systemd
version installed.
mkosi.profiles/ directory. The profile
to use can be selected using the Profile= setting (or--profile=) on
the command line. A profile allows you to bundle various settings behind
a single recognizable name. Profiles can also be matched on if you want
to apply some settings only to a few profiles.
For example, you could have a bootable profile that sets
Bootable=yes, adds the linux and systemd-boot packages and
configures Format=disk to end up with a bootable disk image when
passing --profile bootable on the kernel command line.
mkosi, we need a base image on top of
which we can build our extension.
To keep things manageable, we'll make use of mkosi's support for
building multiple images so that we can build our base image and system
extension in one go.
We start by creating a temporary directory with a base configuration
file mkosi.conf with some shared settings:
[Output]
OutputDirectory=mkosi.output
CacheDirectory=mkosi.cache
Now let's continue with the base image definition by writing the
following to mkosi.images/base/mkosi.conf:
[Output]
Format=directory
[Content]
CleanPackageMetadata=no
Packages=systemd
udev
We use the directory output format here instead of the disk output
so that we can build our extension without needing root privileges.
Now that we have our base image, we can define a sysext that builds on
top of it by writing the following to mkosi.images/btrfs/mkosi.conf:
[Config]
Dependencies=base
[Output]
Format=sysext
Overlay=yes
[Content]
BaseTrees=%O/base
Packages=btrfs-progs
BaseTrees= point to our base image and Overlay=yes instructs mkosi
to only package the files added on top of the base tree.
We can't sign the extension image without a key. We can generate one
by running mkosi genkey which will generate files that are
automatically picked up when building the image.
Finally, you can build the base image and the extensions by running
mkosi -f. You'll find btrfs.rawinmkosi.output which is the
extension image.
mkosi.key and mkosi.crt and enable the
SecureBoot= setting. You can also run mkosi genkey to have mkosi
generate a key and certificate itself.
●The Ephemeral= setting can be enabled to boot the image in an
ephemeral copy that is thrown away when the container or virtual
machine exits.
●ShimBootloader= and BiosBootloader= settings are available to
configure shim and grub installation if needed.
●mkosi can boot directory trees in a virtual using virtiofsd. This
is very useful for quickly rebuilding an image and booting it as the
image does not have to be packed up as a disk image.
●...
There's many more features that we won't go over in detail here in this
blog post. Learn more about those by reading the
documentation.
mkosi and
related tooling:
●Github repository
●Building RHEL and RHEL UBI images with mkosi
●My presentation on systemd-repart at ASG 2023
●mkosi's Matrix channel.
●systemd's mkosi configuration
●mkosi's mkosi configuration
/boot/ and the ESP, and how this could be improved.
How Linux distributions traditionally have been setting up their
“boot” file systems has been varying to some degree, but the most
common choice has been to have a separate partition mounted to
/boot/. Usually the partition is formatted as a Linux file system
such as ext2/ext3/ext4. The partition contains the kernel images, the
initrd and various boot loader resources. Some distributions, like
Debian and Ubuntu, also store ancillary files associated with the
kernel here, such as kconfigorSystem.map. Such a traditional
boot partition is only defined within the context of the distribution,
and typically not immediately recognizable as such when looking just
at the partition table (i.e. it uses the generic Linux partition type
UUID).
With the arrival of UEFI a new partition relevant for boot appeared,
the EFI System Partition (ESP). This partition is defined by the
firmware environment, but typically accessed by Linux to install or
update boot loaders. The choice of file system is not up to Linux, but
effectively mandated by the UEFI specifications: vFAT. In theory it
could be formatted as other file systems too. However, this would
require the firmware to support file systems other than vFAT. This is
rare and firmware specific though, as vFAT is the only file system
mandated by the UEFI specification. In other words, vFAT is the only
file system which is guaranteed to be universally supported.
There’s a major overlap of the type of the data typically stored in
the ESP and in the traditional boot partition mentioned earlier: a
variety of boot loader resources as well as kernels/initrds.
Unlike the traditional boot partition, the ESP is easily recognizable
in the partition table via its GPT partition type UUID. The ESP is
also a shared resource: all OSes installed on the same disk will
share it and put their boot resources into them (as opposed to the
traditional boot partition, of which there is one per installed Linux
OS, and only that one will put resources there).
To summarize, the most common setup on typical Linux distributions is
something like this:
| Type | Linux Mount Point | File System Choice |
|---|---|---|
| Linux “Boot” Partition | /boot/ |
Any Linux File System, typically ext2/ext3/ext4 |
| ESP | /boot/efi/ |
vFAT |
/boot/efi/ mount point is nested
below the /boot/ mount point. This effectively means that to access
the ESP the Boot partition must exist and be mounted first. A system
with just an ESP and without a Boot partition hence doesn’t fit well
into the current model. The Boot partition will also have to carry an
empty “efi” directory that can be used as the inner mount point, and
serves no other purpose.
Given that the traditional boot partition and the ESP may carry
similar data (i.e. boot loader resources, kernels, initrds) one may
wonder why they are separate concepts. Historically, this was the
easiest way to make the pre-UEFI way how Linux systems were booted
compatible with UEFI: conceptually, the ESP can be seen as just a
minor addition to the status quo ante that way. Today, primarily two
reasons remained:
Some distributions see a benefit in support for complex Linux file
system concepts such as hardlinks, symlinks, SELinux labels/extended
attributes and so on when storing boot loader resources. – I
personally believe that making use of features in the boot file
systems that the firmware environment cannot really make sense of is
very clearly not advisable. The UEFI file system APIs know no
symlinks, and what is SELinux to UEFI anyway? Moreover, putting more
than the absolute minimum of simple data files into such file
systems immediately raises questions about how to authenticate them
comprehensively (including all fancy metadata) cryptographically on
use (see below).
On real-life systems that ship with non-Linux OSes the ESP often
comes pre-installed with a size too small to carry multiple Linux
kernels and initrds. As growing the size of an existing ESP is
problematic (for example, because there’s no space available
immediately after the ESP, or because some low-quality firmware
reacts badly to the ESP changing size) placing the kernel in a
separate, secondary partition (i.e. the boot partition) circumvents
these space issues.
/boot/ partition should be considered untrusted: any code or
essential data read from them must be authenticated cryptographically
before use. And even more, the file system structures themselves are
also untrusted. The file system driver reading them must be careful
not to be exploitable by a rogue file system image. Effectively this
means a simple file system (for which a driver can be more easily
validated and reviewed) is generally a better choice than a complex
file system (Linux file system communities made it pretty clear that
robustness against rogue file system images is outside of their scope
and not what is being tested for.).
Some approaches tried to address the fact that boot partitions are
untrusted territory by encrypting them via a mechanism compatible to
LUKS, and adding decryption capabilities to the boot loader so it can
access it. This misses the point though, as encryption does not imply
authentication, and only authentication is typically desired. The boot
loader and kernel code are typically Open Source anyway, and hence
there’s little value in attempting to keep secret what is already
public knowledge. Moreover, encryption implies the existence of an
encryption key. Physically typing in the decryption key on a keyboard
might still be acceptable on desktop systems with a single human user
in front, but outside of that scenario unlock via TPM, PKCS#11 or
network services are typically required. And even on the desktop FIDO2
unlocking is probably the future. Implementing all the technologies
these unlocking mechanisms require in the boot loader is not
realistic, unless the boot loader shall become a full OS on its own as
it would require subsystems for FIDO2, PKCS#11, USB, Bluetooth
network, smart card access, and so on.
btrfs
when it comes to data safety guarantees. It’s not a journaled file
system, does not use CoW or any form of checksumming. This means when
used for the system boot process we need to be particularly careful
when accessing it, and in particular when making changes to it (i.e.,
trying to keep changes local to single sectors). It is essential to
use write patterns that minimize the chance of file system
corruption. Checking the file system (“fsck”) before modification
(and probably also reading) is important, as is ensuring the file
system is put into a “clean” state as quickly as possible after each
modification.
Code quality of the firmware in typical systems is known to not always
be great. When relying on the file system driver included in the
firmware it’s hence a good idea to limit use to operations that have a
better chance to be correctly implemented. For example, when writing
from the UEFI environment it might be wise to avoid any operation that
requires allocation algorithms, but instead focus on access patterns
that only override already written data, and do not require allocation
of new space for the data.
Besides write access from the boot loader code (as described above)
these file systems will require write access from the OS, to
facilitate boot loader and kernel/initrd updates. These types of
accesses are generally not fully random accesses (i.e., never partial
file updates) but usually mean adding new files as whole, and removing
old files as a whole. Existing files are typically not modified once
created, though they might be replaced wholly by newer versions.
systemd-gpt-auto-generator. Use
XBOOTLDR only if you have to, i.e., when dealing with systems that
lack UEFI (and where the ESP hence has no value) or to address the
mentioned size issues with the ESP. Note that unlike the traditional
boot partition the XBOOTLDR partition is a shared resource, i.e.,
shared between multiple parallel Linux OS installations on the same
disk. Because of this it is typically wise to place a per-OS
directory at the top of the XBOOTLDR file system to avoid conflicts.
Use vFAT for both partitions, it’s the only thing
universally understood among relevant firmwares and Linux. It’s
simple enough to be useful for untrusted storage. Or to say this
differently: writing a file system driver that is not easily
vulnerable to rogue disk images is much easier for vFAT than for
let’s say btrfs. – But the choice of vFAT implies some care needs to
be taken to address the data safety issues it brings, see below.
Mount the two partitions via the “automount”
logic. For example, via systemd’s
automount
units, with a very short idle time-out (one second or so). This
improves data safety immensely, as the file systems will remain
mounted (and thus possibly in a “dirty” state) only for very short
periods of time, when they are actually accessed – and all that
while the fact that they are not mounted continuously is mostly not
noticeable for applications as the file system paths remain
continuously around. Given that the backing file system (vFAT) has
poor data safety properties, it is essential to shorten the access
for unclean file system state as much as possible. In fact, this is
what the aforementioned systemd-gpt-auto-generator
logic actually does by default.
Whenever mounting one of the two partitions, do a file system check
(fsck; in fact this is also what
systemd-gpt-auto-generatordoes by default, hooked into
the automount logic, to run on first access). This ensures that even
if the file system is in an unclean state it is restored to be clean
when needed, i.e., on first access.
Do not mount the two partitions nested, i.e., no
more /boot/efi/. First of all, as mentioned above, it
should be possible (and is desirable) to only have one of the
two. Hence it is simply a bad idea to require the other as well,
just to be able to mount it. More importantly though, by nesting
them, automounting is complicated, as it is necessary to trigger the
first automount to establish the second automount, which defeats the
point of automounting them in the first place. Use the two distinct
mount points /efi/ (for the ESP) and
/boot/ (for XBOOTLDR) instead. You might have guessed,
but that too is what systemd-gpt-auto-generator does by
default.
When making additions or updates to ESP/XBOOTLDR from the OS make
sure to create a file and write it in full, then
syncfs() the whole file system, then rename to give it
its final name, and syncfs() again. Similar when
removing files.
When writing from the boot loader environment/UEFI to ESP/XBOOTLDR,
do not append to files or create new files. Instead overwrite
already allocated file contents (for example to maintain a random
seed file) or rename already allocated files to include information
in the file name (and ideally do not increase the file name in
length; for example to maintain boot counters).
Consider adopting
UKIs,
which minimize the number of files that need to be updated on the
ESP/XBOOTLDR during OS/kernel updates (ideally down to 1)
Consider adopting
systemd-boot,
which minimizes the number of files that need to be updated on boot
loader updates (ideally down to 1)
Consider removing any mention of ESP/XBOOTLDR from
/etc/fstab, and just let
systemd-gpt-auto-generator do its thing.
Stop implementing file systems, complex storage, disk encryption, …
in your boot loader.
Implementing things like that you gain:
Simplicity: only one file system implementation, typically only
one partition and mount point
Robust auto-discovery of all partitions, no need to even
configure /etc/fstab
Data safety guarantees as good as possible, given the
circumstances
To summarize this in a table:
| Type | Linux Mount Point | File System Choice | Automount |
|---|---|---|---|
| ESP | /efi/ |
vFAT | yes |
| XBOOTLDR | /boot/ |
vFAT | yes |
shim binary, Linux,
initrds, UEFI Firmware, PE binaries, and SecureBoot.
initrds locally, and
they are unsigned, thus not protected through SecureBoot (since that
would require local SecureBoot key enrollment, which is generally
not done), nor TPM PCRs.
Boot chain is typically Firmware →
shim → grub → Linux kernel →
initrd (dracut or similar) → root file system
Firmware’s UEFI SecureBoot protects shim, shim’s key management
protects grub and kernel. No code signing protects initrd. initrd
acquires the key for encrypted root fs from the user (or
TPM/FIDO2/PKCS11).
shim/grub/kernel is measured into TPM PCR 4, among other stuff
EFI TPM event log reports measured data into TPM PCRs, and can be
used to reconstruct and validate state of TPM PCRs from the used
resources.
No userspace components are typically measured, except for what IMA
measures
New kernels require locally generating new boot loader scripts and
generating a new initrd each time. OS updates thus mean fragile
generation of multiple resources and copying multiple files into the
boot partition.
Problems with the status quo ante:
initrd typically unlocks root file system encryption, but is not
protected whatsoever, and trivial to attack and modify offline
OS updates are brittle: PCR values of grub are very hard to
pre-calculate, as grub measures chosen control flow path, not just
code images. PCR values vary wildly, and OS provided resources are
not measured into separate PCRs. Grub’s PCR measurements might be
useful up to a point to reason about the boot after the fact, for
the most basic remote attestation purposes, but useless for
calculating them ahead of time during the OS build process (which
would be desirable to be able to bind secrets to future expected PCR
state, for example to bind secrets to an OS in a way that it remain
accessible even after that OS is updated).
Updates of a boot loader are not robust, require multi-file updates
of ESP and boot partition, and regeneration of boot scripts
No rollback protection (no way to cryptographically invalidate
access to TPM-bound secrets on OS updates)
Remote attestation of running software is needlessly complex since
initrds are generated locally and thus basically are guaranteed to
vary on each system.
Locking resources maintained by arbitrary user apps to TPM state
(PCRs) is not realistic for general purpose systems, since PCRs will
change on every OS update, and there’s no mechanism to re-enroll
each such resource before every OS update, and remove the old
enrollment after the update.
There is no concept to cryptographically invalidate/revoke secrets
for an older OS version once updated to a new OS version. An
attacker thus can always access the secrets generated on old OSes if
they manage to exploit an old version of the OS — even if a newer
version already has been deployed.
Goals of the new design:
Provide a fully signed execution path from firmware to
userspace, no exceptions
Provide a fully measured execution path from firmware to
userspace, no exceptions
Separate out TPM PCRs assignments, by “owner” of measured
resources, so that resources can be bound to them in a fine-grained
fashion.
Allow easy pre-calculation of expected PCR values based on
booted kernel/initrd, configuration, local identity of the system
Rollback protection
Simple & robust updates: one updated file per concept
Updates without requiring re-enrollment/local preparation of the
TPM-protected resources (no more “brittle” PCR hashes that must be
propagated into every TPM-protected resource on each OS update)
System ready for easy remote attestation, to prove validity of
booted OS, configuration and local identity
Ability to bind secrets to specific phases of the boot, e.g. the
root fs encryption key should be retrievable from the TPM only in
the initrd, but not after the host transitioned into the root fs.
Reasonably secure, automatic, unattended unlocking of disk
encryption secrets should be possible.
“Democratize” use of PCR policies by defining PCR register meanings,
and making binding to them robust against updates, so that
external projects can safely and securely bind their own data to
them (or use them for remote attestation) without risking breakage
whenever the OS is updated.
Build around TPM 2.0 (with graceful fallback for TPM-less
systems if desired, but TPM 1.2 support is out of scope)
Considered attack scenarios and considerations:
Evil Maid: neither online nor offline (i.e. “at rest”), physical
access to a storage device should enable an attacker to read the
user’s plaintext data on disk (confidentiality); neither online nor
offline, physical access to a storage device should allow undetected
modification/backdooring of user data or OS (integrity), or
exfiltration of secrets.
TPMs are assumed to be reasonably “secure”, i.e. can securely
store/encrypt secrets. Communication to TPM is not “secure” though
and must be protected on the wire.
Similar, the CPU is assumed to be reasonably “secure”
SecureBoot is assumed to be reasonably “secure” to permit validated
boot up to and including shim+boot loader+kernel (but see discussion
below)
All user data must be encrypted and authenticated. All vendor and
administrator data must be authenticated.
It is assumed all software involved regularly contains
vulnerabilities and requires frequent updates to address them, plus
regular revocation of old versions.
It is further assumed that key material used for signing code by the
OS vendor can reasonably be kept secure (via use of HSM, and
similar, where secret key information never leaves the signing
hardware) and does not require frequent roll-over.
sd-stub,
see below)
The Linux kernel to boot in the .linux PE section
The initrd that the kernel shall unpack and invoke in the
.initrd PE section
A kernel command line string, in the .cmdline PE
section
Optionally, information describing the OS this kernel is intended
for, in the .osrel PE section (derived from
/etc/os-release of the booted OS). This is useful for
presentation of the UKI in the boot loader menu, and ordering it
against other entries, using the included version information.
Optionally, information describing kernel release information
(i.e. uname -r output) in the .uname PE
section. This is also useful for presentation of the UKI in the
boot loader menu, and ordering it against other entries.
Optionally, a boot splash to bring to screen before transitioning
into the Linux kernel in the .splash PE section
Optionally, a compiled Devicetree database file, for systems which
need it, in the .dtb PE section
Optionally, the public key in PEM format that matches the
signatures of the .pcrsig PE section (see below), in a
.pcrpkey PE section.
Optionally, a JSON file encoding expected PCR 11 hash values seen
from userspace once the UKI has booted up, along with signatures
of these expected PCR 11 hash values, matching a specific public
key in the .pcrsig PE section. (Note: we use plural
for “values” and “signatures” here, as this JSON file will
typically carry a separate value and signature for each PCR bank
for PCR 11, i.e. one pair of value and signature for the SHA1
bank, and another pair for the SHA256 bank, and so on. This
ensures when enrolling or unlocking a TPM-bound secret we’ll
always have a signature around matching the banks available
locally (after all, which banks the local hardware supports is up
to the hardware). For the sake of simplifying this already overly
complex topic, we’ll pretend in the rest of the text there was
only one PCR signature per UKI we have to care about, even if this
is not actually the case.)
Given UKIs are regular UEFI PE files, they can thus be signed as one
for SecureBoot, protecting all of the individual resources listed
above at once, and their combination. Standard Linux tools such as
sbsigntool and pesign can be used to sign
UKI files.
UKIs wrap all of the above data in a single file, hence all of the
above components can be updated in one go through single file atomic
updates, which is useful given that the primary expected storage place
for these UKIs is the UEFI System Partition (ESP), which is a vFAT
file system, with its limited data safety guarantees.
UKIs can be generated via a single, relatively simple objcopy
invocation, that glues the listed components together, generating one
PE binary that then can be signed for SecureBoot. (For details on
building these, see below.)
Note that the primary location to place UKIs in is the EFI System
Partition (or an otherwise firmware accessible file system). This
typically means a VFAT file system of some form. Hence an effective
UKI size limit of 4GiB is in place, as that’s the largest file size a
FAT32 file system supports.
.linux PE section:
The PE sections listed are searched for in the invoked UKI the stub
is part of, and superficially validated (i.e. general file format is
in order).
All PE sections listed above of the invoked UKI are measured into
TPM PCR 11. This TPM PCR is expected to be all zeroes before the UKI
initializes. Pre-calculation is thus very straight-forward if the
resources included in the PE image are known. (Note: as a single
exception the .pcrsig PE section is excluded from this measurement,
as it is supposed to carry the expected result of the measurement, and
thus cannot also be input to it, see below for further details about
this section.)
If the .splash PE section is included in the UKI it is brought onto the screen
If the .dtb PE section is included in the UKI it is activated
using the Devicetree UEFI “fix-up” protocol
If a command line was passed from the boot loader to the UKI
executable it is discarded if SecureBoot is enabled and the command
line from the .cmdline used. If SecureBoot is disabled and a
command line was passed it is used in place of the one from
.cmdline. Either way the used command line is measured into TPM
PCR 12. (This of course removes any flexibility of control of the
kernel command line of the local user. In many scenarios this is
probably considered beneficial, but in others it is not, and some
flexibility might be desired. Thus, this concept probably needs to
be extended sooner or later, to allow more flexible kernel command
line policies to be enforced via definitions embedded into the
UKI. For example: allowing definition of multiple kernel command
lines the user/boot menu can select one from; allowing additional
allowlisted parameters to be specified; or even optionally allowing
any verification of the kernel command line to be turned off even
in SecureBoot mode. It would then be up to the builder of the UKI
to decide on the policy of the kernel command line.)
It will set a couple of volatile EFI variables to inform userspace
about executed TPM PCR measurements (and which PCR registers were
used), and other execution properties. (For example: the EFI
variable StubPcrKernelImage in the
4a67b082-0a4c-41cf-b6c7-440b29bb8c4f vendor namespace indicates
the PCR register used for the UKI measurement, i.e. the value
“11”).
An initrd cpio archive is dynamically synthesized from the
.pcrsig and .pcrpkey PE section data (this is later passed to
the invoked Linux kernel as additional initrd, to be overlaid with
the main initrd from the .initrd section). These files are later
available in the /.extra/ directory in the initrd context.
The Linux kernel from the .linux PE section is invoked with with
a combined initrd that is composed from the blob from the .initrd
PE section, the dynamically generated initrd containing the
.pcrsig and .pcrpkey PE sections, and possibly some additional
components like sysexts or syscfgs.
.pcrsig PE section, see above). This
PCR will also contain measurements of the boot phase once userspace
takes over (see below).
TPM PCR 12 shall contain measurements of the used kernel command
line. (Plus potentially other forms of
parameterization/configuration passed into the UKI, not discussed in
this document)
On top of that we intend to define two more PCR registers like this:
TPM PCR 15 shall contain measurements of the volume encryption
key of the root file system of the OS.
[TPM PCR 13 shall contain measurements of additional extension
images for the initrd, to enable a modularized initrd – not covered
by this document]
(See the Linux TPM PCR
Registry
for an overview how these four PCRs fit into the list of Linux PCR
assignments.)
For all four PCRs the assumption is that they are zero before the UKI
initializes, and only the data that the UKI and the OS measure into
them is included. This makes pre-calculating them straightforward:
given a specific set of UKI components, it is immediately clear what
PCR values can be expected in PCR 11 once the UKI booted up. Given a
kernel command line (and other parameterization/configuration) it is
clear what PCR values are expected in PCR 12.
Note that these four PCRs are defined by the conceptual “owner” of the
resources measured into them. PCR 11 only contains resources the OS
vendor controls. Thus it is straight-forward for the OS vendor to
pre-calculate and then cryptographically sign the expected values for
PCR 11. The PCR 11 values will be identical on all systems that run
the same version of the UKI. PCR 12 only contains resources the
administrator controls, thus the administrator can pre-calculate
PCR values, and they will be correct on all instances of the OS that
use the same parameters/configuration. PCR 15 only contains resources
inherently local to the local system, i.e. the cryptographic key
material that encrypts the root file system of the OS.
Separating out these three roles does not imply these actually need to
be separate when used. However the assumption is that in many popular
environments these three roles should be separate.
By separating out these PCRs by the owner’s role, it becomes
straightforward to remotely attest, individually, on the software that
runs on a node (PCR 11), the configuration it uses (PCR 12) or the
identity of the system (PCR 15). Moreover, it becomes straightforward
to robustly and securely encrypt data so that it can only be unlocked
on a specific set of systems that share the same OS, or the same
configuration, or have a specific identity – or a combination thereof.
Note that the mentioned PCRs are so far not typically used on generic
Linux-based operating systems, to our knowledge. Windows uses them,
but given that Windows and Linux should typically not be included in
the same boot process this should be unproblematic, as Windows’ use of
these PCRs should thus not conflict with ours.
To summarize:
| PCR | Purpose | Owner | Expected Value before UKI boot | Pre-Calculable |
|---|---|---|---|---|
| 11 | Measurement of UKI components and boot phases | OS Vendor | Zero | Yes (at UKI build time) |
| 12 | Measurement of kernel command line, additional kernel runtime configuration such as systemd credentials, systemd syscfg images | Administrator | Zero | Yes (when system configuration is assembled) |
| 13 | System Extension Images of initrd (and possibly more) |
(Administrator) | Zero | Yes |
| 15 | Measurement of root file system volume key (Possibly later more: measurement of root file system UUIDs and labels and of the machine ID /etc/machine-id) |
Local System | Zero | Yes (after first boot once ll such IDs are determined) |
.pcrsig PE section. The public
key part will end up in the .pcrpkey PE section.
Typically the key pair for the PCR 11 signatures should be chosen with
a narrow focus, reused for exactly one specific OS (e.g. “Fedora
Desktop Edition”) and the series of UKIs that belong to it (all the
way through all the versions of the OS). The SecureBoot signature key
can be used with a broader focus, if desired. By keeping the PCR 11
signature key narrow in focus one can ensure that secrets bound to the
signature key can only be unlocked on the narrow set of UKIs desired.
initrd in the /.extra/ directory (which as discussed above
originates in the .pcrpkey PE section of the UKI). The relevant
userspace component (e.g. systemd) is then responsible for
generating a random key to be used as symmetric encryption key for
the storage volume (let’s call it disk encryption key _here,
DEK_). The TPM is then used to encrypt (“seal”) the DEK with its
internal Storage Root Key (TPM SRK). A TPM2 policy is bound to the
encrypted DEK. The policy enforces that the DEK may only be
decrypted if a valid signature is provided that matches the state of
PCR 11 and the public key provided in the /.extra/ directory of
the initrd. The plaintext DEK key is passed to the kernel to
implement disk encryption (e.g. LUKS/dm-crypt). (Alternatively,
hardware disk encryption can be used too, i.e. Intel MKTME, AMD SME
or even OPAL, all of which are outside of the scope of this
document.) The TPM-encrypted version of the DEK which the TPM
returned is written to the encrypted volume’s superblock.
When userspace wants to unlock disk encryption on a specific
UKI, it looks for the signature data passed to the initrd in the
/.extra/ directory (which as discussed above originates in the
.pcrsig PE section of the UKI). It then reads the encrypted
version of the DEK from the superblock of the encrypted volume. The
signature and the encrypted DEK are then passed to the TPM. The TPM
then checks if the current PCR 11 state matches the supplied
signature from the .pcrsig section and the public key used during
enrollment. If all checks out it decrypts (“unseals”) the DEK and
passes it back to the OS, where it is then passed to the kernel
which implements the symmetric part of disk encryption.
Note that in this scheme the encrypted volume’s DEK is not bound
to specific literal PCR hash values, but to a public key which is
expected to sign PCR hash values.
Also note that the state of PCR 11 only matters during unlocking. It
is not used or checked when enrolling.
In this scenario:
Input to the TPM part of the enrollment process are the TPM’s
internal SRK, the plaintext DEK provided by the OS, and the public
key later used for signing expected PCR values, also provided by the
OS. – Output is the encrypted (“sealed”) DEK.
Input to the TPM part of the unlocking process are the TPM’s
internal SRK, the current TPM PCR 11 values, the public key used
during enrollment, a signature that matches both these PCR values
and the public key, and the encrypted DEK. – Output is the plaintext
(“unsealed”) DEK.
Note that sealing/unsealing is done entirely on the TPM chip, the host
OS just provides the inputs (well, only the inputs that the TPM chip
doesn’t know already on its own), and receives the outputs. With the
exception of the plaintext DEK, none of the inputs/outputs are
sensitive, and can safely be stored in the open. On the wire the
plaintext DEK is protected via TPM parameter encryption (not discussed
in detail here because though important not in scope for this
document).
TPM PCR 11 is the most important of the mentioned PCRs, and its use is
thus explained in detail here. The other mentioned PCRs can be used in
similar ways, but signatures/public keys must be provided via other
means.
This scheme builds on the functionality Linux’ LUKS2 functionality
provides, i.e. key management supporting multiple slots, and the
ability to embed arbitrary metadata in the encrypted volume’s
superblock. Note that this means the TPM2-based logic explained here
doesn’t have to be the only way to unlock an encrypted volume. For
example, in many setups it is wise to enroll both this TPM-based
mechanism and an additional “recovery key” (i.e. a high-entropy
computer generated passphrase the user can provide manually in case
they lose access to the TPM and need to access their data), of which
either can be used to unlock the volume.
initrd-enter”)
When the initrd transitions into the root file system (“initrd-leave”)
When the early boot phase of the OS on the root file system has
completed, i.e. all storage and file systems have been set up and
mounted, immediately before regular services are started
(“sysinit”)
When the OS on the root file system completed the boot process far
enough to allow unprivileged users to log in (“complete”)
When the OS begins shut down (“shutdown”)
When the service manager is mostly finished with shutting down and
is about to pass control to the final phase of the shutdown logic
(“final”)
By measuring these additional words into PCR 11 the distinct phases of
the boot process can be distinguished in a relatively straight-forward
fashion and the expected PCR values in each phase can be determined.
The phases are measured into PCR 11 (as opposed to some other PCR)
mostly because available PCRs are scarce, and the boot phases defined
are typically specific to a chosen OS, and hence fit well with the
other data measured into PCR 11: the UKI which is also specific to the
OS. The OS vendor generates both the UKI and defines the boot phases,
and thus can safely and reliably pre-calculate/sign the expected PCR
values for each phase of the boot.
dracut and similar. Once the basic components
(.linux, .initrd, .cmdline, .splash, .dtb, .osrel,
.uname) have been acquired the combination process works roughly
like this:
The expected PCR 11 hashes (and signatures for them) for the UKI
are calculated. The tool for that takes all basic UKI components
and a signing key as input, and generates a JSON object as output
that includes both the literal expected PCR hash values and a
signature for them. (For all selected TPM2 banks)
The EFI stub binary is now combined with the basic components, the
generated JSON PCR signature object from the first step (in the
.pcrsig section) and the public key for it (in the .pcrpkey
section). This is done via a simple “objcopy” invocation
resulting in a single UKI PE binary.
The resulting EFI PE binary is then signed for SecureBoot (via a
tool such as
sbsign
or similar).
Note that the UKI model implies pre-built initrds. How to generate
these (and securely extend and parameterize them) is outside of the
scope of this document, but a related document will be provided
highlighting these concepts.
systemd-stub
(or short: sd-stub) component implements the discussed UEFI stub
program
The
systemd-measure
tool can be used to pre-calculate expected PCR 11 values given the
UKI components and can sign the result, as discussed in the UKI
Image Generation section above.
The
systemd-cryptenroll
and
systemd-cryptsetup
tools can be used to bind a LUKS2 encrypted file system volume to a
TPM and PCR 11 public key/signatures, according to the scheme
described above. (The two components also implement a “recovery
key” concept, as discussed above)
The
systemd-pcrphase
component measures specific words into PCR 11 at the discussed
phases of the boot process.
The
systemd-creds
tool may be used to encrypt/decrypt data objects called
“credentials” that can be passed into services and booted systems,
and are automatically decrypted (if needed) immediately before
service invocation. Encryption is typically bound to the local TPM,
to ensure the data cannot be recovered elsewhere.
Note that
systemd-stub
(i.e. the UEFI code glued into the UKI) is distinct from
systemd-boot
(i.e. the UEFI boot loader than can manage multiple UKIs and other
boot menu items and implements automatic fallback, an interactive menu
and a programmatic interface for the OS among other things). One can
be used without the other – both sd-stub without sd-boot and vice
versa – though they integrate nicely if used in combination.
Note that the mechanisms described are relatively generic, and can be
implemented and be consumed in other software too, systemd should be
considered a reference implementation, though one that found
comprehensive adoption across Linux distributions.
Some concepts discussed above are currently not
implemented. Specifically:
The rollback protection logic is currently not implemented.
The mentioned measurement of the root file system volume key to PCR
15 is implemented, but not merged into the systemd main branch yet.
initrd and other resources. See above.
SecureBoot
A mechanism where every software component involved in the boot
process is cryptographically signed and checked against a set of
public keys stored in the mainboard hardware, implemented in firmware,
before it is used.
Measured Boot
A boot process where each component measures (i.e., hashes and extends
into a TPM PCR, see above) the next component it will pass control to
before doing so. This serves two purposes: it can be used to bind
security policy for encrypted secrets to the resulting PCR values (or
signatures thereof, see above), and it can be used to reason about
used software after the fact, for example for the purpose of remote
attestation.
initrd
Short for “initial RAM disk”, which – strictly speaking – is a
misnomer today, because no RAM disk is anymore involved, but a tmpfs
file system instance. Also known as “initramfs”, which is also
misleading, given the file system is not ramfs anymore, but tmpfs
(both of which are in-memory file systems on Linux, with different
semantics). The initrd is passed to the Linux kernel and is
basically a file system tree in cpio archive. The kernel unpacks the
image into a tmpfs (i.e., into an in-memory file system), and then
executes a binary from it. It thus contains the binaries for the first
userspace code the kernel invokes. Typically, the initrd’s job is to
find the actual root file system, unlock it (if encrypted), and
transition into it.
UEFI
Short for “Unified Extensible Firmware Interface”, it is a widely
adopted standard for PC firmware, with native support for SecureBoot
and Measured Boot.
EFI
More or less synonymous to UEFI, IRL.
Shim
A boot component originating in the Linux world, which in a way
extends the public key database SecureBoot maintains (which is under
control from Microsoft) with a second layer (which is under control of
the Linux distributions and of the owner of the physical device).
PE
Portable Executable; a file format for executable binaries,
originally from the Windows world, but also used by UEFI firmware. PE
files may contain code and data, categorized in labeled “sections”
ESP
EFI System Partition; a special partition on a storage
medium that the firmware is able to look for UEFI PE binaries
in to execute at boot.
HSM
Hardware Security Module; a piece of hardware that can generate and
store secret cryptographic keys, and execute operations with them,
without the keys leaving the hardware (though this is
configurable). TPMs can act as HSMs.
DEK
Disk Encryption Key; an asymmetric cryptographic key used for
unlocking disk encryption, i.e. passed to LUKS/dm-crypt for activating
an encrypted storage volume.
LUKS2
Linux Unified Key Setup Version 2; a specification for a superblock
for encrypted volumes widely used on Linux. LUKS2 is the default
on-disk format for the cryptsetup suite of tools. It provides
flexible key management with multiple independent key slots and allows
embedding arbitrary metadata in a JSON format in the superblock.