planet.freedesktop.org
October 04, 2021

I’m Bad At Blogging

I’m responsible enough to admit that I’m bad at blogging.

I’ve said repeatedly that I’m going to blog more often, and then I go and do the complete opposite.

I don’t know why I’m like this, but here we are, and now it’s time for another blog post.

What’s Been Happening

In short: not a lot.

All the features I’ve previously blogged about have landed, and zink is once again in “release mode” until the branchpoint next week to avoid having to rush patches in at the last second. This means probably there won’t be any interesting patches at all to zink until then.

We’re in a good spot though, and I’m pleased with the state of the driver for this release. You probably still won’t be using it to play any OpenGL games you pick up from the Winter Steam Sale, but potentially those days aren’t too far off.

With that said, I do have to actually blog about something technical for once, so let’s roll the dice and see what it’s going to be

ARB_bindless_texture

We did it. We got a good roll.

This is actually a cool extension for an implementation deep dive because of how (relatively) simple Vulkan makes it to handle.

First, an overview: What is ARB_bindless_texture?

This is an extension used by only the most elite GL apps to enable texture streaming, namely the ability to continually add more images into the rendering pipeline either for sampling or shader write operations. An image is bound to a “handle”, and from there, it can be made “resident” at any time to use it in shaders. This is different from the general GL methodology where an image must be explicitly bound to a specific slot (instead each image has its own slot), and it allows for both greater flexibility and more images to be in use at any given time.

At the implementation level, this actually amounts to three distinct features:

  • the ability to track and manage unique “handles” for each image that can be made resident
  • the ability to access these images from shaders
  • the ability to pass these images between shader stages as normal I/O

In zink, I tackled these in the order I’ve listed.

Handle Management

This wouldn’t have been (as) possible without one very special, very awful Vulkan extension.

You knew this was coming.

VK_EXT_descriptor_indexing.

That’s right, it’s a requirement for this, but not for the reason you might think. Zink has no need for the impossibly large descriptorsets enabled by this extension, but I did need the other features it provides:

  • VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT - enables binding a “bindless” descriptor set once and then performing updates on it without needing to have multiple sets
  • VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT - enables invalidating deleted members of an active set and leaving them as garbage values in the descriptor set as long as they won’t be accessed in shaders (don’t worry, this is totally safe)
  • VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT - enables updating members of an active set that aren’t currently in use by any shaders

With these, it becomes possible to implement bindless textures using the existing Gallium convention:

  • create u_idalloc instance to track and generate integer handle IDs
  • map these handle IDs to slots in a large-ish (1024) sized descriptor array
  • dynamically update the slots in the set as textures are made resident/not-resident
  • return handle IDs to the u_idalloc pool once they are destroyed and the image is no longer in use

This creates a cycle where a handle ID is allocated, an image is bound to that slot in the descriptor array, the image can be unbound, the handle ID is deleted, and then finally the ID is recycled, all while only binding and updating a single descriptor set as draws continue.

Shader Access

Now that the images are accessible to the GPU in the bindless descriptor array, shaders will have to be updated to read them.

In NIR, bindless instructions come in two variants:

  • nir_intrinsic_bindless_image_*
  • nir_instr_type_tex with nir_tex_src_texture_handle

These have their own unique semantics that I didn’t bother to look into; I only need to do completely normal array derefs, so what I actually needed was just to rewrite them back into normal-er instructions.

For the image intrinsics, that ended up being the following snippet:

nir_intrinsic_instr *instr = nir_instr_as_intrinsic(in);

nir_intrinsic_op op;
#define OP_SWAP(OP) \
case nir_intrinsic_bindless_image_##OP: \
   op = nir_intrinsic_image_deref_##OP; \
   break;


/* convert bindless intrinsics to deref intrinsics */
switch (instr->intrinsic) {
OP_SWAP(atomic_add)
OP_SWAP(atomic_and)
OP_SWAP(atomic_comp_swap)
OP_SWAP(atomic_dec_wrap)
OP_SWAP(atomic_exchange)
OP_SWAP(atomic_fadd)
OP_SWAP(atomic_fmax)
OP_SWAP(atomic_fmin)
OP_SWAP(atomic_imax)
OP_SWAP(atomic_imin)
OP_SWAP(atomic_inc_wrap)
OP_SWAP(atomic_or)
OP_SWAP(atomic_umax)
OP_SWAP(atomic_umin)
OP_SWAP(atomic_xor)
OP_SWAP(format)
OP_SWAP(load)
OP_SWAP(order)
OP_SWAP(samples)
OP_SWAP(size)
OP_SWAP(store)
default:
   return false;
}

enum glsl_sampler_dim dim = nir_intrinsic_image_dim(instr);
nir_variable *var = dim == GLSL_SAMPLER_DIM_BUF ? bindless_buffer_array : bindless_image_array;
if (!var)
   var = create_bindless_image(b->shader, dim);
instr->intrinsic = op;
b->cursor = nir_before_instr(in);
nir_deref_instr *deref = nir_build_deref_var(b, var);
if (glsl_type_is_array(var->type))
   deref = nir_build_deref_array(b, deref, nir_u2uN(b, instr->src[0].ssa, 32));
nir_instr_rewrite_src_ssa(in, &instr->src[0], &deref->dest.ssa);

In short, swap the intrinsic back to a regular image one, then rewrite the image src as a deref of a bindless image variable (which is just image[1024]). In long…it’s the same thing. It’s actually that simple.

The tex instruction is where things get trickier.

nir_variable *var = tex->sampler_dim == GLSL_SAMPLER_DIM_BUF ? bindless_buffer_array : bindless_texture_array;
if (!var)
   var = create_bindless_texture(b->shader, tex);
b->cursor = nir_before_instr(in);
nir_deref_instr *deref = nir_build_deref_var(b, var);
if (glsl_type_is_array(var->type))
   deref = nir_build_deref_array(b, deref, nir_u2uN(b, tex->src[idx].src.ssa, 32));
nir_instr_rewrite_src_ssa(in, &tex->src[idx].src, &deref->dest.ssa);

This part is the same as the image rewrite: just rewriting the instruction as a deref.

This part, however, is different:

unsigned needed_components = glsl_get_sampler_coordinate_components(glsl_without_array(var->type));
unsigned c = nir_tex_instr_src_index(tex, nir_tex_src_coord);
unsigned coord_components = nir_src_num_components(tex->src[c].src);
if (coord_components < needed_components) {
   nir_ssa_def *def = nir_pad_vector(b, tex->src[c].src.ssa, needed_components);
   nir_instr_rewrite_src_ssa(in, &tex->src[c].src, def);
   tex->coord_components = needed_components;
}

The thing about bindless textures is that by the time zink sees them, they have no dimensionality. They’re just textures in an array, regardless of whether they’re 1D, 2D, 3D, or arrayed. This means the variables used for derefs might not have the right number of coordinate components, or the instructions using them might not have the right number. To fix this, an extra cleanup is needed here to match up the number of components with the variable being used.

With all of that in place, basic bindless operations are working.

But wait…

Shader I/O

This was the tricky part. According to the spec, it now becomes legal to have an image or a sampler as an input or an output in a shader.

But is it really, truly necessary to pass images between the shaders?

No. No it isn’t.

nir_deref_instr *src_deref = nir_src_as_deref(instr->src[0]);
nir_variable *var = nir_deref_instr_get_variable(src_deref);
if (var->data.bindless)
   return false;
if (var->data.mode != nir_var_shader_in && var->data.mode != nir_var_shader_out)
   return false;
if (!glsl_type_is_image(var->type) && !glsl_type_is_sampler(var->type))
   return false;

var->type = glsl_int64_t_type();
var->data.bindless = 1;
b->cursor = nir_before_instr(in);
nir_deref_instr *deref = nir_build_deref_var(b, var);
if (instr->intrinsic == nir_intrinsic_load_deref) {
    nir_ssa_def *def = nir_load_deref(b, deref);
    nir_instr_rewrite_src_ssa(in, &instr->src[0], def);
    nir_ssa_def_rewrite_uses(&instr->dest.ssa, def);
} else {
   nir_store_deref(b, deref, instr->src[1].ssa, nir_intrinsic_write_mask(instr));
}
nir_instr_remove(in);
nir_instr_remove(&src_deref->instr);

Bindless shader i/o is really just passing array indices that masquerade as images. If they’re rewritten back to integer types, that all goes away, and they become regular i/o that needs no additional handling.

Just This Once

The translation to Vulkan made everything incredibly easy. I didn’t need any special hacks or corner case behavior, and I didn’t have to spend time reading code from other drivers to figure out what the hell I was doing wrong. Validation even works for it!

Truly miraculous.

October 01, 2021

Wim Taymans

Wim Taymans laying out the vision for the future of Linux multimedia


PipeWire has already made great strides forward in terms of improving the audio handling situation on Linux, but one of the original goals was to also bring along the video side of the house. In fact in the first few releases of Fedora Workstation where we shipped PipeWire we solely enabled it as a tool to handle screen sharing for Wayland and Flatpaks. So with PipeWire having stabilized a lot for audio now we feel the time has come to go back to the video side of PipeWire and work to improve the state-of-art for video capture handling under Linux. Wim Taymans did a presentation to our team inside Red Hat on the 30th of September talking about the current state of the world and where we need to go to move forward. I thought the information and ideas in his presentation deserved wider distribution so this blog post is building on that presentation to share it more widely and also hopefully rally the community to support us in this endeavour.

The current state of video capture, usually webcams, handling on Linux is basically the v4l2 kernel API. It has served us well for a lot of years, but we believe that just like you don’t write audio applications directly to the ALSA API anymore, you should neither write video applications directly to the v4l2 kernel API anymore. With PipeWire we can offer a lot more flexibility, security and power for video handling, just like it does for audio. The v4l2 API is an open/ioctl/mmap/read/write/close based API, meant for a single application to access at a time. There is a library called libv4l2, but nobody uses it because it causes more problems than it solves (no mmap, slow conversions, quirks). But there is no need to rely on the kernel API anymore as there are GStreamer and PipeWire plugins for v4l2 allowing you to access it using the GStreamer or PipeWire API instead. So our goal is not to replace v4l2, just as it is not our goal to replace ALSA, v4l2 and ALSA are still the kernel driver layer for video and audio.

It is also worth considering that new cameras are getting more and more complicated and thus configuring them are getting more complicated. Driving this change is a new set of cameras on the way often called MIPI cameras, as they adhere to the API standards set by the MiPI Alliance. Partly driven by this V4l2 is in active development with a Codec API addition, statefull/stateless, DMABUF, request API and also adding a Media Controller (MC) Graph with nodes, ports, links of processing blocks. This means that the threshold for an application developer to use these APIs directly is getting very high in addition to the aforementioned issues of single application access, the security issues of direct kernel access and so on.

libcamera logo


Libcamera is meant to be the userland library for v4l2.


Of course we are not the only ones seeing the growing complexity of cameras as a challenge for developers and thus libcamera has been developed to make interacting with these cameras easier. Libcamera provides unified API for setup and capture for cameras, it hides the complexity of modern camera devices, it is supported for ChromeOS, Android and Linux.
One way to describe libcamera is as the MESA of cameras. Libcamera provides hooks to run (out-of-process) vendor extensions like for image processing or enhancement. Using libcamera is considering pretty much a requirement for embedded systems these days, but also newer Intel chips will also have IPUs configurable with media controllers.

Libcamera is still under heavy development upstream and do not yet have a stable ABI, but they did add a .so version very recently which will make packaging in Fedora and elsewhere a lot simpler. In fact we have builds in Fedora ready now. Libcamera also ships with a set of GStreamer plugins which means you should be able to get for instance Cheese working through libcamera in theory (although as we will go into, we think this is the wrong approach).

Before I go further an important thing to be aware of here is that unlike on ALSA, where PipeWire can provide a virtual ALSA device to provide backwards compatibility with older applications using the ALSA API directly, there is no such option possible for v4l2. So application developers will need to port to something new here, be that libcamera or PipeWire. So what do we feel is the right way forward?

Ideal Linux Multimedia Stack

How we envision the Linux multimedia stack going forward


Above you see an illustration of what we believe should be how the stack looks going forward. If you made this drawing of what the current state is, then thanks to our backwards compatibility with ALSA, PulseAudio and Jack, all the applications would be pointing at PipeWire for their audio handling like they are in the illustration you see above, but all the video handling from most applications would be pointing directly at v4l2 in this diagram. At the same time we don’t want applications to port to libcamera either as it doesn’t offer a lot of the flexibility than using PipeWire will, but instead what we propose is that all applications target PipeWire in combination with the video camera portal API. Be aware that the video portal is not an alternative or a abstraction of the PipeWire API, it is just a way to set up the connection to PipeWire that has the added bonus of working if your application is shipping as a Flatpak or another type of desktop container. PipeWire would then be in charge of talking to libcamera or v42l for video, just like PipeWire is in charge of talking with ALSA on the audio side. Having PipeWire be the central hub means we get a lot of the same advantages for video that we get for audio. For instance as the application developer you interact with PipeWire regardless of if what you want is a screen capture, a camera feed or a video being played back. Multiple applications can share the same camera and at the same time there are security provided to avoid the camera being used without your knowledge to spy on you. And also we can have patchbay applications that supports video pipelines and not just audio, like Carla provides for Jack applications. To be clear this feature will not come for ‘free’ from Jack patchbays since Jack only does audio, but hopefully new PipeWire patchbays like Helvum can add video support.

So what about GStreamer you might ask. Well GStreamer is a great way to write multimedia applications and we strongly recommend it, but we do not recommend your GStreamer application using the v4l2 or libcamera plugins, instead we recommend that you use the PipeWire plugins, this is of course a little different from the audio side where PipeWire supports the PulseAudio and Jack APIs and thus you don’t need to port, but by targeting the PipeWire plugins in GStreamer your GStreamer application will get the full PipeWire featureset.

So what is our plan of action>
So we will start putting the pieces in place for this step by step in Fedora Workstation. We have already started on this by working on the libcamera support in PipeWire and packaging libcamera for Fedora. We will set it up so that PipeWire can have option to switch between v4l2 and libcamera, so that most users can keep using the v4l2 through PipeWire for the time being, while we work with upstream and the community to mature libcamera and its PipeWire backend. We will also enable device discoverer for PipeWire.

We are also working on maturing the GStreamer elements for PipeWire for the video capture usecase as we expect a lot of application developers will just be using GStreamer as opposed to targeting PipeWire directly. We will start with Cheese as our initial testbed for this work as it is a fairly simple application, using Cheese as a proof of concept to have it use PipeWire for camera access. We are still trying to decide if we will make Cheese speak directly with PipeWire, or have it talk to PipeWire through the pipewiresrc GStreamer plugin, but have their pro and cons in the context of testing and verifying this.

We will also start working with the Chromium and Firefox projects to have them use the Camera portal and PipeWire for camera support just like we did work with them through WebRTC for the screen sharing support using PipeWire.

There are a few major items we are still trying to decide upon in terms of the interaction between PipeWire and the Camera portal API. It would be tempting to see if we can hide the Camera portal API behind the PipeWire API, or failing that at least hide it for people using the GStreamer plugin. That way all applications get the portal support for free when porting to GStreamer instead of requiring using the Camera portal API as a second step. On the other side you need to set up the screen sharing portal yourself, so it would probably make things more consistent if we left it to application developers to do for camera access too.

What do we want from the community here?
First step is just help us with testing as we roll this out in Fedora Workstation and Cheese. While libcamera was written motivated by MIPI cameras, all webcams are meant to work through it, and thus all webcams are meant to work with PipeWire using the libcamera backend. At the moment that is not the case and thus community testing and feedback is critical for helping us and the libcamera community to mature libcamera. We hope that by allowing you to easily configure PipeWire to use the libcamera backend (and switch back after you are done testing) we can get a lot of you to test and let us what what cameras are not working well yet.

A little further down the road please start planning moving any application you maintain or contribute to away from v4l2 API and towards PipeWire. If your application is a GStreamer application the transition should be fairly simple going from the v4l2 plugins to the pipewire plugins, but beyond that you should familiarize yourself with the Camera portal API and the PipeWire API for accessing cameras.

For further news and information on PipeWire follow our @PipeWireP twitter account and for general news and information about what we are doing in Fedora Workstation make sure to follow me on twitter @cfkschaller.
PipeWire

September 27, 2021
For the hw-enablement for Bay- and Cherry-Trail devices which I do as a side project, sometimes it is useful to play with the Android which comes pre-installed on some of these devices.

Sometimes the Android-X86 boot-loader (kerneflinger) is locked and the standard "Developer-Options" -> "Enable OEM Unlock" -> "Run 'fastboot oem unlock'" sequence does not work (e.g. I got the unlock yes/no dialog, and could move between yes and no, but I could not actually confirm the choice).

Luckily there is an alternative, kernelflinger checks a "OEMLock" EFI variable to see if the device is locked or not. Like with some of my previous adventures changing hidden BIOS settings, this EFI variable is hidden from the OS as soon as the OS calls ExitBootServices, but we can use the same modified grub to change this EFI variable. After booting from an USB stick with the relevant grub binary installed as "EFI/BOOT/BOOTX64.EFI" or "BOOTIA32.EFI", entering the
following command on the grub cmdline will unlock the bootloader:

setup_var_cv OEMLock 0 1 1

Disabling dm-verity support is pretty easy on these devices because they can just boot a regular Linux distro from an USB drive. Note booting a regular Linux distro may cause the Android "system" partition to get auto-mounted after which dm-verity checks will fail! Once we have a regular Linux distro running step 1 is to find out which partition is the android_boot partition to do this as root run:

blkid /dev/mmcblk?p#

Replacing the ? for the mmcblk number for the internal eMMC and then for # is 1 to n, until one of the partitions is reported as having 'PARTLABEL="android_boot"', usually "mmcblk?p3" is the one you want, so you could try that first.

Now make an image of the partition by running e.g.:

dd if=/dev/mmcblk1p3" of=android_boot.img

And then copy the "android_boot.img" file to another computer. On this computer extract the file and then the initrd like this:

abootimg -x android_boot.img
mkdir initrd
cd initrd
zcat ../initrd.img | cpio -i


Now edit the fstab file and remove "verify" from the line for the system partition. after this update android_boot.img like this:

find . | cpio -o -H newc -R 0.0 | gzip -9 > ../initrd.img
cd ..
abootimg -u android_boot.img -r initrd.img


The easiest way to test the new image is using fastboot, boot the tablet into Android and connect it to the PC, then run:

adb reboot bootloader
fastboot boot android_boot.img


And then from an "adb shell" do "cat /fstab" verify that the "verify" option is gone now. After this you can (optionally) dd the new android_boot.img back to the android_boot partition to make the change permanent.

Note if Android is not booting you can force the bootloader to enter fastboot mode on the next boot by downloading this file and then under regular Linux running the following command as root:

cat LoaderEntryOneShot > /sys/firmware/efi/efivars/LoaderEntryOneShot-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f
September 24, 2021

Fedora Workstation
So I have spoken about what is our vision for Fedora Workstation quite a few times before, but I feel it is often useful to get back to it as we progress with our overall effort.So if you read some of my blog posts about Fedora Workstation over the last 5 years, be aware that there is probably little new in here for you. If you haven’t read them however this is hopefully a useful primer on what we are trying to achieve with Fedora Workstation.

The first few years after we launched Fedora Workstation in 2014 we focused on lot on establishing a good culture around what we where doing with Fedora, making sure that it was a good day to day desktop driver for people, and not just a great place to develop the operating system itself. I think it was Fedora Project Lead Matthew Miller who phrased it very well when he said that we want to be Leading Edge, not Bleeding Edge. We also took a good look at the operating system from an overall stance and tried to map out where Linux tended to fall short as a desktop operating system and also tried to ask ourselves what our core audience would and should be. We refocused our efforts on being a great Operating System for all kinds of developers, but I think it is fair to say that we decided that was to narrow a wording as our efforts are truly to reach makers of all kinds like graphics artists and musicians, in addition to coders. So I thought I go through our key pillar efforts and talk about where they are at and where they are going.

Flatpak

Flatpak logo
One of the first things we concluded was that our story for people who wanted to deploy applications to our platform was really bad. The main challenge was that the platform was moving very fast and it was a big overhead for application developers to keep on top of the changes. In addition to that, since the Linux desktop is so fragmented, the application developers would have to deal with the fact that there was 20 different variants of this platform, all moving at a different pace. The way Linux applications was packaged, with each dependency being packaged independently of the application created pains on both sides, for the application developer it means the world kept moving underneath them with limited control and for the distributions it meant packaging pains as different applications who all depended on the same library might work or fail with different versions of a given library. So we concluded we needed a system which allowed us to decouple of application from the host OS to let application developers update their platform at a pace of their own choosing and at the same time unify the platform in the sense that the application should be able to run without problems on the latest Fedora releases, the latest RHEL releases or the latest versions of any other distribution out there. As we looked at it we realized there was some security downsides compared to the existing model, since the Os vendor would not be in charge of keeping all libraries up to date and secure, so sandboxing the applications ended up a critical requirement. At the time Alexander Larsson was working on bringing Docker to RHEL and Fedora so we tasked him with designing the new application model. The initial idea was to see if we could adjust Docker containers to the desktop usecase, but Docker containers as it stood at that time were very unsuited for the purpose of hosting desktop applications and our experience working with the docker upstream at the time was that they where not very welcoming to our contributions. So in light of how major the changes we would need to implement and the unlikelyhood of getting them accepted upstream, Alex started on what would become Flatpak. Another major technology that was coincidentally being developed at the same time was OSTree by Colin Walters. To this day I think the best description of OSTree is that it functions as a git for binaries, meaning it allows you a simple way to maintain and update your binary applications with minimally sized updates. It also provides some disk deduplication which we felt was important due to the duplication of libraries and so on that containers bring with them. Finally another major design decision Alex did was that the runtime/baseimage should be hosted outside the container, so make possible to update the runtime independently of the application with relevant security updates etc.

Today there is a thriving community around Flatpaks, with the center of activity being flathub, the Flatpak application repository. In Fedora Workstation 35 you should start seeing Flatpak from Flathub being offered as long as you have 3rd party repositories enabled. Also underway is Owen Taylor leading our efforts of integrating Flatpak building into the internal tools we use at Red Hat for putting RHEL together, with the goal of switching over to Flatpaks as our primary application delivery method for desktop applications in RHEL and to help us bridge the Fedora and RHEL application ecosystem.

You can follow the latest news from Flatpak through the official Flatpak twitter account.

Silverblue

So another major issue we decided needing improvements was that of OS upgrades (as opposed to application updates). The model pursued by Linux distros since their inception is one of shipping their OS as a large collection of independently packaged libraries. This setup is inherently fragile and requires a lot of quality engineering and testing to avoid problems, but even then sometimes things sometimes fail, especially in a fast moving OS like Fedora. A lot of configuration changes and updates has traditionally been done through scripts and similar, making rollback to an older version in cases where there is a problem also very challenging. Adventurous developers could also have done changes to their own copy of the OS that would break the upgrade later on. So thanks to all the great efforts to test and verify upgrades they usually go well for most users, but we wanted something even more sturdy. So the idea came up to move to a image based OS model, similar to what people had gotten used to on their phones. And OSTree once again became the technology we choose to do this, especially considering it was being used in Red Hat first foray into image based operating systems for servers (the server effort later got rolled into CoreOS as part of Red Hat acquiring CoreOS). The idea is that you ship the core operating system as a singular image and then to upgrade you just replace that image with a new image, and thus the risks of problems are greatly reduced. On top of that each of those images can be tested and verified as a whole by your QE and test teams. Of course we realized that a subset of people would still want to be able to tweak their OS, but once again OSTree came to our rescue as it allows developers to layer further RPMS on top of the OS image, including replacing current system libraries with for instance newer ones. The great thing about OSTree layering is that once you are done testing/using the layers RPMS you can with a very simple command just drop them again and go back to the upstream image. So combined with applications being shipped as Flatpaks this would create an OS that is a lot more sturdy, secure and simple to update and with a lot lower chance of an OS update breaking any of your applications. On top of that OSTree allows us to do easy OS rollbacks, so if the latest update somehow don’t work for you can you quickly rollback while waiting for the issue you are having to be fixed upstream. And hence Fedora Silverblue was born as the vehicle for us to develop and evolve an image based desktop operating system.

You can follow our efforts around Silverblue through the offical Silverblue twitter account.

Toolbx

Toolbox with RHEL

Toolbox pet container with RHEL UBI


So Flatpak helped us address a lot of the the gaps for making a better desktop OS on the application side and Silverblue was the vehicle for our vision on the OS side, but we realized that we also needed some way for all kinds of developers to be able to easily take advantage of the great resource that is the Fedora RPM package universe and the wider tools universe out there. We needed something that provided people with a great terminal experience. We had already been working on various smaller improvements to the terminal for a while, but we realized we needed something a lot more substantial. Accessing an immutable OS like Silverblue through a terminal window tends to be quite limiting. So that it is usually not want you want to do and also you don’t want to rely on the OSTree layering for running all your development tools and so on as that is going to be potentially painful when you upgrade your OS.
Luckily the container revolution happening in the Linux world pointed us to the solution here too, as while containers were rolled out the concept of ‘pet containers’ were also born. The idea of a pet container is that unlike general containers (sometimes refer to as cattle containers) pet container are containers that you care about on an individual level, like your personal development environment. In fact pet containers even improves on how we used to do things as they allow you to very easily maintain different environments for different projects. So for instance if you have two projects, hosted in two separate pet containers, where the two project depends on two different versions of python, then containers make that simple as it ensures that there is no risk of one of your projects ‘contaminating’ the others with its dependencies, yet at the same time allow you to grab RPMS or other kind of packages from upstream resources and install them in your container. In fact while inside your pet container the world feels a lot like it always has when on the linux command line. Thanks to the great effort of Dan Walsh and his team we had a growing number of easy to use container tools available to us, like podman. Podman is developed with the primary usecase being for running and deploying your containers at scale, managed by OpenShift and Kubernetes. But it also gave us the foundation we needed for Debarshi Ray to kicked of the Toolbx project to ensure that we had an easy to use tool for creating and managing pet containers. As a bonus Toolbx allows us to achieve another important goal, to allow Fedora Workstation users to develop applications against RHEL in a simple and straightforward manner, because Toolbx allows you to create RHEL containers just as easy as it allows you to create Fedora containers.

You can follow our efforts around Toolbox on the official Toolbox twitter account

Wayland

Ok, so between Flatpak, Silverblue and Toolbox we have the vision clear for how to create a robust OS, with a great story for application developers to maintain and deliver applications for it, to Toolbox providing a great developer story on top of this OS. But we also looked at the technical state of the Linux desktop and realized that there where some serious deficits we needed to address. One of the first one we saw was the state of graphics where X.org had served us well for many decades, but its age was showing and adding new features as they came in was becoming more and more painful. Kristian Høgsberg had started work on an alternative to X while still at Red Hat called Wayland, an effort he and a team of engineers where pushing forward at Intel. There was a general agreement in the wider community that Wayland was the way forward, but apart from Intel there was little serious development effort being put into moving it forward. On top of that, Canonical at the time had decided to go off on their own and develop their own alternative architecture in competition with X.org and Wayland. So as we were seeing a lot of things happening in the graphics space horizon, like HiDPI, and also we where getting requests to come up with a way to make Linux desktops more secure, we decided to team up with Intel and get Wayland into a truly usable state on the desktop. So we put many of our top developers, like Olivier Fourdan, Adam Jackson and Jonas Ådahl, on working on maturing Wayland as quickly as possible.
As things would have it we also ended up getting a lot of collaboration and development help coming in from the embedded sector, where companies such as Collabora was helping to deploy systems with Wayland onto various kinds of embedded devices and contributing fixes and improvements back up to Wayland (and Weston). To be honest I have to admit we did not fully appreciate what a herculean task it would end up being getting Wayland production ready for the desktop and it took us quite a few Fedora releases before we decided it was ready to go. As you might imagine dealing with 30 years of technical debt is no easy thing to pay down and while we kept moving forward at a steady pace there always seemed to be a new batch of issues to be resolved, but we managed to do so, not just by maturing Wayland, but also by porting major applications such as Martin Stransky porting Firefox, and Caolan McNamara porting LibreOffice over to Wayland. At the end of the day I think what saw us through to success was the incredible collaboration happening upstream between a large host of individual contributors, companies and having the support of the X.org community. And even when we had the whole thing put together there where still practical issues to overcome, like how we had to keep defaulting to X.org in Fedora when people installed the binary NVidia driver because that driver did not work with XWayland, the X backwards compatibility layer in Wayland. Luckily that is now in the process of becoming a thing of the past with the latest NVidia driver updates support XWayland and us working closely with NVidia to ensure driver and windowing stack works well.

PipeWire

Pipewire in action

Example of PipeWire running


So now we had a clear vision for the OS and a much improved and much more secure graphics stack in the form of Wayland, but we realized that all the new security features brought in by Flatpak and Wayland also made certain things like desktop capturing/remoting and web camera access a lot harder. Security is great and critical, but just like the old joke about the most secure computer being the one that is turned off, we realized that we needed to make sure these things kept working, but in a secure and better manner. Thankfully we have GStreamer co-creator Wim Taymans on the team and he thought he could come up with a pulseaudio equivalent for video that would allow us to offer screen capture and webcam access in a convenient and secure manner.
As Wim where prototyping what we called PulseVideo at the time we also started discussing the state of audio on Linux. Wim had contributed to PulseAudio to add a security layer to it, to make for instance it harder for a rogue application to eavesdrop on you using your microphone, but since it was not part of the original design it wasn’t a great solution. At the same time we talked about how our vision for Fedora Workstation was to make it the natural home for all kind of makers, which included musicians, but how the separateness of the pro-audio community getting in the way of that, especially due to the uneasy co-existence of PulseAudio on the consumer side and Jack for the pro-audio side. As part of his development effort Wim came to the conclusion that he code make the core logic of his new project so fast and versatile that it should be able to deal with the low latency requirements of the pro-audio community and also serve its purpose well on the consumer audio and video side. Having audio and video in one shared system would also be an improvement for us in terms of dealing with combined audio and video sources as guaranteeing audio video sync for instance had often been a challenge in the past. So Wims effort evolved into what we today call PipeWire and which I am going to be brave enough to say has been one of the most successful launches of a major new linux system component we ever done. Replacing two old sound servers while at the same time adding video support is no small feat, but Wim is working very hard on fixing bugs as quickly as they come in and ensure users have a great experience with PipeWire. And at the same time we are very happy that PipeWire now provides us with the ability of offering musicians and sound engineers a new home in Fedora Workstation.

You can follow our efforts on PipeWire on the PipeWire twitter account.

Hardware support and firmware

In parallel with everything mentioned above we where looking at the hardware landscape surrounding desktop linux. One of the first things we realized was horribly broken was firmware support under Linux. More and more of the hardware smarts was being found in the firmware, yet the firmware access under Linux and the firmware update story was basically non-existent. As we where discussing this problem internally, Peter Jones who is our representative on UEFI standards committee, pointed out that we probably where better poised to actually do something about this problem than ever, since UEFI was causing the firmware update process on most laptops and workstations to become standardized. So we teamed Peter up with Richard Hughes and out of that collaboration fwupd and LVFS was born. And in the years since we launched that we gone from having next to no firmware available on Linux (and the little we had only available through painful processes like burning bootable CDs etc.) to now having a lot of hardware getting firmware update support and more getting added almost on a weekly basis.
For the latest and greatest news around LVFS the best source of information is Richard Hughes twitter account.

In parallel to this Adam Jackson worked on glvnd, which provided us with a way to have multiple OpenGL implementations on the same system. For those who has been using Linux for a while I am sure you remembers the pain of the NVidia driver and Mesa fighting over who provided OpenGL on your system as it was all tied to a specific .so name. There was a lot of hacks being used out there to deal with that situation, of varying degree of fragility, but with the advent of glvnd nobody has to care about that problem anymore.

We also decided that we needed to have a part of the team dedicated to looking at what was happening in the market and work on covering important gaps. And with gaps I mean fixing the things that keeps the hardware vendors from being able to properly support Linux, not writing drivers for them. Instead we have been working closely with Dell and Lenovo to ensure that their suppliers provide drivers for their hardware and when needed we work to provide a framework for them to plug their hardware into. This has lead to a series of small, but important improvements, like getting the fingerprint reader stack on Linux to a state where hardware vendors can actually support it, bringing Thunderbolt support to Linux through Bolt, support for high definition and gaming mice through the libratbag project, support in the Linux kernel for the new laptop privacy screen feature, improved power management support through the power profiles daemon and now recently hiring a dedicated engineer to get HDR support fully in place in Linux.

Summary

So to summarize. We are of course not over the finish line with our vision yet. Silverblue is a fantastic project, but we are not yet ready to declare it the official version of Fedora Workstation, mostly because we want to give the community more time to embrace the Flatpak application model and for developers to embrace the pet container model. Especially applications like IDEs that cross the boundary between being in their own Flatpak sandbox while also interacting with things in your pet container and calling out to system tools like gdb need more work, but Christian Hergert has already done great work solving the problem in GNOME Builder while Owen Taylor has put together support for using Visual Studio Code with pet containers. So hopefully the wider universe of IDEs will follow suit, in the meantime one would need to call them from the command line from inside the pet container.

The good thing here is that Flatpaks and Toolbox also works great on traditional Fedora Workstation, you can get the full benefit of both technologies even on a traditional distribution, so we can allow for a soft and easy transition.

So for anyone who made it this far, appoligies for this become a little novel, that was not my intention when I started writing it :)

Feel free to follow my personal twitter account for more general news and updates on what we are doing around Fedora Workstation.
Christian F.K. Schaller photo

Last week we had our most loved annual conference: X.Org Developers Conference 2021. As a reminder, due to COVID-19 situation in Europe (and its respective restrictions on travel and events), we kept it virtual again this year… which is a pity as the former venue was Gdańsk, a very beautiful city (see picture below if you don’t believe me!) in Poland. Let’s see if we can finally have an XDC there!

XDC 2021

This year we had a very strong program. There were talks covering all aspects of the open-source graphics stack: from the kernel (including an Outreachy talk about VKMS) and Mesa drivers of all kind, inputs, libraries, X.org security and Wayland robustness… we had talks about testing drivers, debugging them, our infra at freedesktop.org, and even Vulkan specs (such Vulkan Video and VK_EXT_multi_draw) and their support in the open-source graphics stack. Definitely, a very complete program that is very interesting to all open-source developers working on this area. You can watch all the talks here or here and the slides were already uploaded in the program.

On behalf of the Call For Papers Committee, I would like to thank all speakers for their talks… this conference won’t make sense without you!

Big shout-out to the XDC 2021 organizers (Intel) represented by Radosław Szwichtenberg, Ryszard Knop and Maciej Ramotowski. They did an awesome job on having a very smooth conference. I can tell you that they promptly fixed any issue that happened, all of that behind the scenes so that the attendees not even noticed anything most of the times! That is what good conference organizers do!

XDC 2021 Organizers Can I invite you to a drink at least? You really deserve it!

If you want to know more details about what this virtual conference entailed, just watch Ryszard’s talk at XDC (info, video) or you can reuse their materials for future conferences. That’s very useful info for future conference organizers!

Talking about our streaming platforms, the big novelty this year was the use of media.ccc.de as a privacy-friendly alternative to our traditional Youtube setup (last year we got feedback about this). Media.ccc.de is an open-source platform that respects your privacy and we hope it worked fine for all attendees. Our stats indicate that ~50% of our audience connected to it during the three days of the conference. That’s awesome!

Last but not least, we couldn’t make this conference without our sponsors. We are very lucky to have on board Intel as our Platinum sponsor and organizer, our Gold sponsors (Google, NVIDIA, ARM, Microsoft and AMD, our Silver sponsors (Igalia, Collabora, The Linux Foundation), our Bronze sponsors (Gitlab and Khronos Group) and our Supporters (C3VOC). Big thank you from the X.Org community!

XDC 2021 Sponsors

Feedback

We would like to hear from you and learn about what worked and what needs to be improved for future editions of XDC! Share us your experience!

We have sent an email asking for feedback to different mailing lists (for example this). Don’t hesitate to send an email to X.Org Foundation board with all your feedback!

XDC 2022 announced!

X.Org Developers Conference 2022 has been announced! Jeremy White, from Codeweavers, gave a lightning talk presenting next year edition! Next year the XDC will not be alone… WineConf 2022 is going to be organized by Codeweavers as well and co-located with XDC!

Save the dates! October 4-5-6, 2022 in Minneapolis, Minnesota, USA.

XDC 2022: Minneapolis, Minnesota, USA Image from Wikipedia. License CC BY-SA 4.0.

XDC 2023 hosting proposals

Have you enjoyed XDC 2021? Do you think you can do it better? ;-) We are looking for organizers for XDC 2023 (most likely in Europe but we are open to other places).

We know this is a decision that takes time (trigger internal discussion, looking for volunteers, budget, a venue suitable for the event, etc). Therefore, we encourage potential interested parties to start the internal discussions now, so any question they have can be answered before we open the call for proposals for XDC 2023 at some point next year. Please read what it is required to organize this conference and feel free to contact me or the X.Org Foundation board for more info if needed.

Final acknowledgment

I would like to thank Igalia for all the support I got when I decided to run for re-election this year in the X.Org Foundation board and to allow me to participate in XDC organization during my work hours. It’s amazing that our Free Software and collaboration values are still present after 20 years rocking in the free world!

Igalia 20th anniversary Igalia

September 23, 2021

After a nine year hiatus, a new version of the X Input Protocol is out. Credit for the work goes to Povilas Kanapickas, who also implemented support for XI 2.4 in the various pieces of the stack [0]. So let's have a look.

X has had touch events since XI 2.2 (2012) but those were only really useful for direct touch devices (read: touchscreens). There were accommodations for indirect touch devices like touchpads but they were never used. The synaptics driver set the required bits for a while but it was dropped in 2015 because ... it was complicated to make use of and no-one seemed to actually use it anyway. Meanwhile, the rest of the world moved on and touchpad gestures are now prevalent. They've been standard in MacOS for ages, in Windows for almost ages and - with recent GNOME releases - now feature prominently on the Linux desktop as well. They have been part of libinput and the Wayland protocol for years (and even recently gained a new set of "hold" gestures). Meanwhile, X was left behind in the dust or mud, depending on your local climate.

XI 2.4 fixes this, it adds pinch and swipe gestures to the XI2 protocol and makes those available to supporting clients [2]. Notably here is that the interpretation of gestures is left to the driver [1]. The server takes the gestures and does the required state handling but otherwise has no decision into what constitutes a gesture. This is of course no different to e.g. 2-finger scrolling on a touchpad where the server just receives scroll events and passes them on accordingly.

XI 2.4 gesture events are quite similar to touch events in that they are processed as a sequence of begin/update/end with both types having their own event types. So the events you will receive are e.g. XIGesturePinchBegin or XIGestureSwipeUpdate. As with touch events, a client must select for all three (begin/update/end) on a window. Only one gesture can exist at any time, so if you are a multi-tasking octopus prepare to be disappointed.

Because gestures are tied to an indirect-touch device, the location they apply at is wherever the cursor is currently positioned. In that, they work similar to button presses, and passive grabs apply as expected too. So long-term the window manager will likely want a passive grab on the root window for swipe gestures while applications will implement pinch-to-zoom as you'd expect.

In terms of API there are no suprises. libXi 1.8 is the version to implement the new features and there we have a new XIGestureClassInfo returned by XIQueryDevice and of course the two events: XIGesturePinchEvent and XIGestureSwipeEvent. Grabbing is done via e.g. XIGrabSwipeGestureBegin, so for those of you with XI2 experience this will all look familiar. For those of you without - it's probably no longer worth investing time into becoming an XI2 expert.

Overall, it's a nice addition to the protocol and it will help getting the X server slightly closer to Wayland for a widely-used feature. Once GTK, mutter and all the other pieces in the stack are in place, it will just work for any (GTK) application that supports gestures under Wayland already. The same will be true for Qt I expect.

X server 21.1 will be out in a few weeks, xf86-input-libinput 1.2.0 is already out and so are xorgproto 2021.5 and libXi 1.8.

[0] In addition to taking on the Xorg release, so clearly there are no limits here
[1] More specifically: it's done by libinput since neither xf86-input-evdev nor xf86-input-synaptics will ever see gestures being implemented
[2] Hold gestures missed out on the various deadlines

September 22, 2021

Xorg is about to released.

And it's a release without Xwayland.

And... wait, what?

Let's unwind this a bit, and ideally you should come away with a better understanding of Xorg vs Xwayland, and possibly even Wayland itself.

Heads up: if you are familiar with X, the below is simplified to the point it hurts. Sorry about that, but as an X developer you're probably good at coping with pain.

Let's go back to the 1980s, when fashion was weird and there were still reasons to be optimistic about the future. Because this is a thought exercise, we go back with full hindsight 20/20 vision and, ideally, the winning Lotto numbers in case we have some time for some self-indulgence.

If we were to implement an X server from scratch, we'd come away with a set of components. libxprotocol that handles the actual protocol wire format parsing and provides a C api to access that (quite like libxcb, actually). That one will just be the protocol-to-code conversion layer.

We'd have a libxserver component which handles all the state management required for an X server to actually behave like an X server (nothing in the X protocol require an X server to display anything). That library has a few entry points for abstract input events (pointer and keyboard, because this is the 80s after all) and a few exit points for rendered output.

libxserver uses libxprotocol but that's an implementation detail, we can ignore the protocol for the rest of the post.

Let's create a github organisation and host those two libraries. We now have: http://github.com/x/libxserver and http://github.com/x/libxprotocol [1].

Now, to actually implement a working functional X server, our new project would link against libxserver hook into this library's API points. For input, you'd use libinput and pass those events through, for output you'd use the modesetting driver that knows how to scream at the hardware until something finally shows up. This is somewhere between outrageously simplified and unacceptably wrong but it'll do for this post.

Your X server has to handle a lot of the hardware-specifics but other than that it's a wrapper around libxserver which does the work of ... well, being an X server.

Our stack looks like this:


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
|[libinput] [modesetting]|
+------------------------+
| kernel |
+------------------------+
Hooray, we have re-implemented Xorg. Or rather, XFree86 because we're 20 years from all the pent-up frustratrion that caused the Xorg fork. Let's host this project on http://github.com/x/xorg

Now, let's say instead of physical display devices, we want to render into an framebuffer, and we have no input devices.


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
| [write()] |
+------------------------+
| some buffer |
+------------------------+
This is basically Xvfb or, if you are writing out PostScript, Xprint. Let's host those on github too, we're accumulating quite a set of projects here.

Now, let's say those buffers are allocated elsewhere and we're just rendering to them. And those buffer are passed to us via an IPC protocol, like... Wayland!


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
|input events [render]|
+------------------------+
| |
+------------------------+
| Wayland compositor |
+------------------------+
And voila, we have Xwayland. If you swap out the protocol you can have Xquartz (X on Macos) or Xwin (X on Windows) or Xnext/Xephyr (X on X) or Xvnc (X over VNC). The principle is always the same.

Fun fact: the Wayland compositor doesn't need to run on the hardware, you can play display server matryoshka until you run out of turtles.

In our glorious revisioned past all these are distinct projects, re-using libxserver and some external libraries where needed. Depending on the projects things may be very simple or get very complex, it depends on how we render things.

But in the end, we have several independent projects all providing us with an X server process - the specific X bits are done in libxserver though. We can release Xwayland without having to release Xorg or Xvfb.

libxserver won't need a lot of releases, the behaviour is largely specified by the protocol requirements and once you're done implementing it, it'll be quite a slow-moving project.

Ok, now, fast forward to 2021, lose some hindsight, hope, and attitude and - oh, we have exactly the above structure. Except that it's not spread across multiple independent repos on github, it's all sitting in the same git directory: our Xorg, Xwayland, Xvfb, etc. are all sitting in hw/$name, and libxserver is basically the rest of the repo.

A traditional X server release was a tag in that git directory. An XWayland-only release is basically an rm -rf hw/*-but-not-xwayland followed by a tag, an Xorg-only release is basically an rm -rf hw/*-but-not-xfree86 [2].

In theory, we could've moved all these out into separate projects a while ago but the benefits are small and no-one has the time for that anyway.

So there you have it - you can have Xorg-only or XWayland-only releases without the world coming to an end.

Now, for the "Xorg is dead" claims - it's very likely that the current release will be the last Xorg release. [3] There is little interest in an X server that runs on hardware, or rather: there's little interest in the effort required to push out releases. Povilas did a great job in getting this one out but again, it's likely this is the last release. [4]

Xwayland - very different, it'll hang around for a long time because it's "just" a protocol translation layer. And of course the interest is there, so we have volunteers to do the releases.

So basically: expecting Xwayland releases, be surprised (but not confused) by Xorg releases.

[1] Github of course doesn't exist yet because we're in the 80s. Time-travelling is complicated.
[2] Historical directory name, just accept it.
[3] Just like the previous release...
[4] At least until the next volunteer steps ups. Turns out the problem "no-one wants to work on this" is easily fixed by "me! me! I want to work on this". A concept that is apparently quite hard to understand in the peanut gallery.

The Strange State of Authenticated Boot and Disk Encryption on Generic Linux Distributions

TL;DR: Linux has been supporting Full Disk Encryption (FDE) and technologies such as UEFI SecureBoot and TPMs for a long time. However, the way they are set up by most distributions is not as secure as they should be, and in some ways quite frankly weird. In fact, right now, your data is probably more secure if stored on current ChromeOS, Android, Windows or MacOS devices, than it is on typical Linux distributions.

Generic Linux distributions (i.e. Debian, Fedora, Ubuntu, …) adopted Full Disk Encryption (FDE) more than 15 years ago, with the LUKS/cryptsetup infrastructure. It was a big step forward to a more secure environment. Almost ten years ago the big distributions started adding UEFI SecureBoot to their boot process. Support for Trusted Platform Modules (TPMs) has been added to the distributions a long time ago as well — but even though many PCs/laptops these days have TPM chips on-board it's generally not used in the default setup of generic Linux distributions.

How these technologies currently fit together on generic Linux distributions doesn't really make too much sense to me — and falls short of what they could actually deliver. In this story I'd like to have a closer look at why I think that, and what I propose to do about it.

The Basic Technologies

Let's have a closer look what these technologies actually deliver:

  1. LUKS/dm-crypt/cryptsetup provide disk encryption, and optionally data authentication. Disk encryption means that reading the data in clear-text form is only possible if you possess a secret of some form, usually a password/passphrase. Data authentication means that no one can make changes to the data on disk unless they possess a secret of some form. Most distributions only enable the former though — the latter is a more recent addition to LUKS/cryptsetup, and is not used by default on most distributions (though it probably should be). Closely related to LUKS/dm-crypt is dm-verity (which can authenticate immutable volumes) and dm-integrity (which can authenticate writable volumes, among other things).

  2. UEFI SecureBoot provides mechanisms for authenticating boot loaders and other pre-OS binaries before they are invoked. If those boot loaders then authenticate the next step of booting in a similar fashion there's a chain of trust which can ensure that only code that has some level of trust associated with it will run on the system. Authentication of boot loaders is done via cryptographic signatures: the OS/boot loader vendors cryptographically sign their boot loader binaries. The cryptographic certificates that may be used to validate these signatures are then signed by Microsoft, and since Microsoft's certificates are basically built into all of today's PCs and laptops this will provide some basic trust chain: if you want to modify the boot loader of a system you must have access to the private key used to sign the code (or to the private keys further up the certificate chain).

  3. TPMs do many things. For this text we'll focus one facet: they can be used to protect secrets (for example for use in disk encryption, see above), that are released only if the code that booted the host can be authenticated in some form. This works roughly like this: every component that is used during the boot process (i.e. code, certificates, configuration, …) is hashed with a cryptographic hash function before it is used. The resulting hash is written to some small volatile memory the TPM maintains that is write-only (the so called Platform Configuration Registers, "PCRs"): each step of the boot process will write hashes of the resources needed by the next part of the boot process into these PCRs. The PCRs cannot be written freely: the hashes written are combined with what is already stored in the PCRs — also through hashing and the result of that then replaces the previous value. Effectively this means: only if every component involved in the boot matches expectations the hash values exposed in the TPM PCRs match the expected values too. And if you then use those values to unlock the secrets you want to protect you can guarantee that the key is only released to the OS if the expected OS and configuration is booted. The process of hashing the components of the boot process and writing that to the TPM PCRs is called "measuring". What's also important to mention is that the secrets are not only protected by these PCR values but encrypted with a "seed key" that is generated on the TPM chip itself, and cannot leave the TPM (at least so goes the theory). The idea is that you cannot read out a TPM's seed key, and thus you cannot duplicate the chip: unless you possess the original, physical chip you cannot retrieve the secret it might be able to unlock for you. Finally, TPMs can enforce a limit on unlock attempts per time ("anti-hammering"): this makes it hard to brute force things: if you can only execute a certain number of unlock attempts within some specific time then brute forcing will be prohibitively slow.

How Linux Distributions use these Technologies

As mentioned already, Linux distributions adopted the first two of these technologies widely, the third one not so much.

So typically, here's how the boot process of Linux distributions works these days:

  1. The UEFI firmware invokes a piece of code called "shim" (which is stored in the EFI System Partition — the "ESP" — of your system), that more or less is just a list of certificates compiled into code form. The shim is signed with the aforementioned Microsoft key, that is built into all PCs/laptops. This list of certificates then can be used to validate the next step of the boot process. The shim is measured by the firmware into the TPM. (Well, the shim can do a bit more than what I describe here, but this is outside of the focus of this article.)

  2. The shim then invokes a boot loader (often Grub) that is signed by a private key owned by the distribution vendor. The boot loader is stored in the ESP as well, plus some other places (i.e. possibly a separate boot partition). The corresponding certificate is included in the list of certificates built into the shim. The boot loader components are also measured into the TPM.

  3. The boot loader then invokes the kernel and passes it an initial RAM disk image (initrd), which contains initial userspace code. The kernel itself is signed by the distribution vendor too. It's also validated via the shim. The initrd is not validated, though (!). The kernel is measured into the TPM, the initrd sometimes too.

  4. The kernel unpacks the initrd image, and invokes what is contained in it. Typically, the initrd then asks the user for a password for the encrypted root file system. The initrd then uses that to set up the encrypted volume. No code authentication or TPM measurements take place.

  5. The initrd then transitions into the root file system. No code authentication or TPM measurements take place.

  6. When the OS itself is up the user is prompted for their user name, and their password. If correct, this will unlock the user account: the system is now ready to use. At this point no code authentication, no TPM measurements take place. Moreover, the user's password is not used to unlock any data, it's used only to allow or deny the login attempt — the user's data has already been decrypted a long time ago, by the initrd, as mentioned above.

What you'll notice here of course is that code validation happens for the shim, the boot loader and the kernel, but not for the initrd or the main OS code anymore. TPM measurements might go one step further: the initrd is measured sometimes too, if you are lucky. Moreover, you might notice that the disk encryption password and the user password are inquired by code that is not validated, and is thus not safe from external manipulation. You might also notice that even though TPM measurements of boot loader/OS components are done nothing actually ever makes use of the resulting PCRs in the typical setup.

Attack Scenarios

Of course, before determining whether the setup described above makes sense or not, one should have an idea what one actually intends to protect against.

The most basic attack scenario to focus on is probably that you want to be reasonably sure that if someone steals your laptop that contains all your data then this data remains confidential. The model described above probably delivers that to some degree: the full disk encryption when used with a reasonably strong password should make it hard for the laptop thief to access the data. The data is as secure as the password used is strong. The attacker might attempt to brute force the password, thus if the password is not chosen carefully the attacker might be successful.

Two more interesting attack scenarios go something like this:

  1. Instead of stealing your laptop the attacker takes the harddisk from your laptop while you aren't watching (e.g. while you went for a walk and left it at home or in your hotel room), makes a copy of it, and then puts it back. You'll never notice they did that. The attacker then analyzes the data in their lab, maybe trying to brute force the password. In this scenario you won't even know that your data is at risk, because for you nothing changed — unlike in the basic scenario above. If the attacker manages to break your password they have full access to the data included on it, i.e. everything you so far stored on it, but not necessarily on what you are going to store on it later. This scenario is worse than the basic one mentioned above, for the simple fact that you won't know that you might be attacked. (This scenario could be extended further: maybe the attacker has a chance to watch you type in your password or so, effectively lowering the password strength.)

  2. Instead of stealing your laptop the attacker takes the harddisk from your laptop while you aren't watching, inserts backdoor code on it, and puts it back. In this scenario you won't know your data is at risk, because physically everything is as before. What's really bad though is that the attacker gets access to anything you do on your laptop, both the data already on it, and whatever you will do in the future.

I think in particular this backdoor attack scenario is something we should be concerned about. We know for a fact that attacks like that happen all the time (Pegasus, industry espionage, …), hence we should make them hard.

Are we Safe?

So, does the scheme so far implemented by generic Linux distributions protect us against the latter two scenarios? Unfortunately not at all. Because distributions set up disk encryption the way they do, and only bind it to a user password, an attacker can easily duplicate the disk, and then attempt to brute force your password. What's worse: since code authentication ends at the kernel — and the initrd is not authenticated anymore —, backdooring is trivially easy: an attacker can change the initrd any way they want, without having to fight any kind of protections. And given that FDE unlocking is implemented in the initrd, and it's the initrd that asks for the encryption password things are just too easy: an attacker could trivially easily insert some code that picks up the FDE password as you type it in and send it wherever they want. And not just that: since once they are in they are in, they can do anything they like for the rest of the system's lifecycle, with full privileges — including installing backdoors for versions of the OS or kernel that are installed on the device in the future, so that their backdoor remains open for as long as they like.

That is sad of course. It's particular sad given that the other popular OSes all address this much better. ChromeOS, Android, Windows and MacOS all have way better built-in protections against attacks like this. And it's why one can certainly claim that your data is probably better protected right now if you store it on those OSes then it is on generic Linux distributions.

(Yeah, I know that there are some niche distros which do this better, and some hackers hack their own. But I care about general purpose distros here, i.e. the big ones, that most people base their work on.)

Note that there are more problems with the current setup. For example, it's really weird that during boot the user is queried for an FDE password which actually protects their data, and then once the system is up they are queried again – now asking for a username, and another password. And the weird thing is that this second authentication that appears to be user-focused doesn't really protect the user's data anymore — at that moment the data is already unlocked and accessible. The username/password query is supposed to be useful in multi-user scenarios of course, but how does that make any sense, given that these multiple users would all have to know a disk encryption password that unlocks the whole thing during the FDE step, and thus they have access to every user's data anyway if they make an offline copy of the harddisk?

Can we do better?

Of course we can, and that is what this story is actually supposed to be about.

Let's first figure out what the minimal issues we should fix are (at least in my humble opinion):

  1. The initrd must be authenticated before being booted into. (And measured unconditionally.)

  2. The OS binary resources (i.e. /usr/) must be authenticated before being booted into. (But don't need to be encrypted, since everyone has the same anyway, there's nothing to hide here.)

  3. The OS configuration and state (i.e. /etc/ and /var/) must be encrypted, and authenticated before they are used. The encryption key should be bound to the TPM device; i.e system data should be locked to a security concept belonging to the system, not the user.

  4. The user's home directory (i.e. /home/lennart/ and similar) must be encrypted and authenticated. The unlocking key should be bound to a user password or user security token (FIDO2 or PKCS#11 token); i.e. user data should be locked to a security concept belonging to the user, not the system.

Or to summarize this differently:

  1. Every single component of the boot process and OS needs to be authenticated, i.e. all of shim (done), boot loader (done), kernel (done), initrd (missing so far), OS binary resources (missing so far), OS configuration and state (missing so far), the user's home (missing so far).

  2. Encryption is necessary for the OS configuration and state (bound to TPM), and for the user's home directory (bound to a user password or user security token).

In Detail

Let's see how we can achieve the above in more detail.

How to Authenticate the initrd

At the moment initrds are generated on the installed host via scripts (dracut and similar) that try to figure out a minimal set of binaries and configuration data to build an initrd that contains just enough to be able to find and set up the root file system. What is included in the initrd hence depends highly on the individual installation and its configuration. Pretty likely no two initrds generated that way will be fully identical due to this. This model clearly has benefits: the initrds generated this way are very small and minimal, and support exactly what is necessary for the system to boot, and not less or more. It comes with serious drawbacks too though: the generation process is fragile and sometimes more akin to black magic than following clear rules: the generator script natively has to understand a myriad of storage stacks to determine what needs to be included and what not. It also means that authenticating the image is hard: given that each individual host gets a different specialized initrd, it means we cannot just sign the initrd with the vendor key like we sign the kernel. If we want to keep this design we'd have to figure out some other mechanism (e.g. a per-host signature key – that is generated locally; or by authenticating it with a message authentication code bound to the TPM). While these approaches are certainly thinkable, I am not convinced they actually are a good idea though: locally and dynamically generated per-host initrds is something we probably should move away from.

If we move away from locally generated initrds, things become a lot simpler. If the distribution vendor generates the initrds on their build systems then it can be attached to the kernel image itself, and thus be signed and measured along with the kernel image, without any further work. This simplicity is simply lovely. Besides robustness and reproducibility this gives us an easy route to authenticated initrds.

But of course, nothing is really that simple: working with vendor-generated initrds means that we can't adjust them anymore to the specifics of the individual host: if we pre-build the initrds and include them in the kernel image in immutable fashion then it becomes harder to support complex, more exotic storage or to parameterize it with local network server information, credentials, passwords, and so on. Now, for my simple laptop use-case these things don't matter, there's no need to extend/parameterize things, laptops and their setups are not that wildly different. But what to do about the cases where we want both: extensibility to cover for less common storage subsystems (iscsi, LVM, multipath, drivers for exotic hardware…) and parameterization?

Here's a proposal how to achieve that: let's build a basic initrd into the kernel as suggested, but then do two things to make this scheme both extensible and parameterizable, without compromising security.

  1. Let's define a way how the basic initrd can be extended with additional files, which are stored in separate "extension images". The basic initrd should be able to discover these extension images, authenticate them and then activate them, thus extending the initrd with additional resources on-the-fly.

  2. Let's define a way how we can safely pass additional parameters to the kernel/initrd (and actually the rest of the OS, too) in an authenticated (and possibly encrypted) fashion. Parameters in this context can be anything specific to the local installation, i.e. server information, security credentials, certificates, SSH server keys, or even just the root password that shall be able to unlock the root account in the initrd …

In such a scheme we should be able to deliver everything we are looking for:

  1. We'll have a full trust chain for the code: the boot loader will authenticate and measure the kernel and basic initrd. The initrd extension images will then be authenticated by the basic initrd image.

  2. We'll have authentication for all the parameters passed to the initrd.

This so far sounds very unspecific? Let's make it more specific by looking closer at the components I'd suggest to be used for this logic:

  1. The systemd suite since a few months contains a subsystem implementing system extensions (v248). System extensions are ultimately just disk images (for example a squashfs file system in a GPT envelope) that can extend an underlying OS tree. Extending in this regard means they simply add additional files and directories into the OS tree, i.e. below /usr/. For a longer explanation see systemd-sysext(8). When a system extension is activated it is simply mounted and then merged into the main /usr/ tree via a read-only overlayfs mount. Now what's particularly nice about them in this context we are talking about here is that the extension images may carry dm-verity authentication data, and PKCS#7 signatures (once this is merged, that is, i.e. v250).

  2. The systemd suite also contains a concept called service "credentials". These are small pieces of information passed to services in a secure way. One key feature of these credentials is that they can be encrypted and authenticated in a very simple way with a key bound to the TPM (v250). See LoadCredentialEncrypted= and systemd-creds(1) for details. They are great for safely storing SSL private keys and similar on your system, but they also come handy for parameterizing initrds: an encrypted credential is just a file that can only be decoded if the right TPM is around with the right PCR values set.

  3. The systemd suite contains a component called systemd-stub(7). It's an EFI stub, i.e. a small piece of code that is attached to a kernel image, and turns the kernel image into a regular EFI binary that can be directly executed by the firmware (or a boot loader). This stub has a number of nice features (for example, it can show a boot splash before invoking the Linux kernel itself and such). Once this work is merged (v250) the stub will support one more feature: it will automatically search for system extension image files and credential files next to the kernel image file, measure them and pass them on to the main initrd of the host.

Putting this together we have nice way to provide fully authenticated kernel images, initrd images and initrd extension images; as well as encrypted and authenticated parameters via the credentials logic.

How would a distribution actually make us of this? A distribution vendor would pre-build the basic initrd, and glue it into the kernel image, and sign that as a whole. Then, for each supposed extension of the basic initrd (e.g. one for iscsi support, one for LVM, one for multipath, …), the vendor would use a tool such as mkosi to build an extension image, i.e. a GPT disk image containing the files in squashfs format, a Verity partition that authenticates it, plus a PKCS#7 signature partition that validates the root hash for the dm-verity partition, and that can be checked against a key provided by the boot loader or main initrd. Then, any parameters for the initrd will be encrypted using systemd-creds encrypt -T. The resulting encrypted credentials and the initrd extension images are then simply placed next to the kernel image in the ESP (or boot partition). Done.

This checks all boxes: everything is authenticated and measured, the credentials also encrypted. Things remain extensible and modular, can be pre-built by the vendor, and installation is as simple as dropping in one file for each extension and/or credential.

How to Authenticate the Binary OS Resources

Let's now have a look how to authenticate the Binary OS resources, i.e. the stuff you find in /usr/, i.e. the stuff traditionally shipped to the user's system via RPMs or DEBs.

I think there are three relevant ways how to authenticate this:

  1. Make /usr/ a dm-verity volume. dm-verity is a concept implemented in the Linux kernel that provides authenticity to read-only block devices: every read access is cryptographically verified against a top-level hash value. This top-level hash is typically a 256bit value that you can either encode in the kernel image you are using, or cryptographically sign (which is particularly nice once this is merged). I think this is actually the best approach since it makes the /usr/ tree entirely immutable in a very simple way. However, this also means that the whole of /usr/ needs to be updated as once, i.e. the traditional rpm/apt based update logic cannot work in this mode.

  2. Make /usr/ a dm-integrity volume. dm-integrity is a concept provided by the Linux kernel that offers integrity guarantees to writable block devices, i.e. in some ways it can be considered to be a bit like dm-verity while permitting write access. It can be used in three ways, one of which I think is particularly relevant here. The first way is with a simple hash function in "stand-alone" mode: this is not too interesting here, it just provides greater data safety for file systems that don't hash check their files' data on their own. The second way is in combination with dm-crypt, i.e. with disk encryption. In this case it adds authenticity to confidentiality: only if you know the right secret you can read and make changes to the data, and any attempt to make changes without knowing this secret key will be detected as IO error on next read by those in possession of the secret (more about this below). The third way is the one I think is most interesting here: in "stand-alone" mode, but with a keyed hash function (e.g. HMAC). What's this good for? This provides authenticity without encryption: if you make changes to the disk without knowing the secret this will be noticed on the next read attempt of the data and result in IO errors. This mode provides what we want (authenticity) and doesn't do what we don't need (encryption). Of course, the secret key for the HMAC must be provided somehow, I think ideally by the TPM.

  3. Make /usr/ a dm-crypt (LUKS) + dm-integrity volume. This provides both authenticity and encryption. The latter isn't typically needed for /usr/ given that it generally contains no secret data: anyone can download the binaries off the Internet anyway, and the sources too. By encrypting this you'll waste CPU cycles, but beyond that it doesn't hurt much. (Admittedly, some people might want to hide the precise set of packages they have installed, since it of course does reveal a bit of information about you: i.e. what you are working on, maybe what your job is – think: if you are a hacker you have hacking tools installed – and similar). Going this way might simplify things in some cases, as it means you don't have to distinguish "OS binary resources" (i.e /usr/) and "OS configuration and state" (i.e. /etc/ + /var/, see below), and just make it the same volume. Here too, the secret key must be provided somehow, I think ideally by the TPM.

All three approach are valid. The first approach has my primary sympathies, but for distributions not willing to abandon client-side updates via RPM/dpkg this is not an option, in which case I would propose the other two approaches for these cases.

The LUKS encryption key (and in case of dm-integrity standalone mode the key for the keyed hash function) should be bound to the TPM. Why the TPM for this? You could also use a user password, a FIDO2 or PKCS#11 security token — but I think TPM is the right choice: why that? To reduce the requirement for repeated authentication, i.e. that you first have to provide the disk encryption password, and then you have to login, providing another password. It should be possible that the system boots up unattended and then only one authentication prompt is needed to unlock the user's data properly. The TPM provides a way to do this in a reasonably safe and fully unattended way. Also, when we stop considering just the laptop use-case for a moment: on servers interactive disk encryption prompts don't make much sense — the fact that TPMs can provide secrets without this requiring user interaction and thus the ability to work in entirely unattended environments is quite desirable. Note that crypttab(5) as implemented by systemd (v248) provides native support for authentication via password, via TPM2, via PKCS#11 or via FIDO2, so the choice is ultimately all yours.

How to Encrypt/Authenticate OS Configuration and State

Let's now look at the OS configuration and state, i.e. the stuff in /etc/ and /var/. It probably makes sense to not consider these two hierarchies independently but instead just consider this to be the root file system. If the OS binary resources are in a separate file system it is then mounted onto the /usr/ sub-directory of the root file system.

The OS configuration and state (or: root file system) should be both encrypted and authenticated: it might contain secret keys, user passwords, privileged logs and similar. This data matters and contains plenty data that should remain confidential.

The encryption of choice here is dm-crypt (LUKS) + dm-integrity similar as discussed above, again with the key bound to the TPM.

If the OS binary resources are protected the same way it is safe to merge these two volumes and have a single partition for both (see above)

How to Encrypt/Authenticate the User's Home Directory

The data in the user's home directory should be encrypted, and bound to the user's preferred token of authentication (i.e. a password or FIDO2/PKCS#11 security token). As mentioned, in the traditional mode of operation the user's home directory is not individually encrypted, but only encrypted because FDE is in use. The encryption key for that is a system wide key though, not a per-user key. And I think that's problem, as mentioned (and probably not even generally understood by our users). We should correct that and ensure that the user's password is what unlocks the user's data.

In the systemd suite we provide a service systemd-homed(8) (v245) that implements this in a safe way: each user gets its own LUKS volume stored in a loopback file in /home/, and this is enough to synthesize a user account. The encryption password for this volume is the user's account password, thus it's really the password provided at login time that unlocks the user's data. systemd-homed also supports other mechanisms of authentication, in particular PKCS#11/FIDO2 security tokens. It also provides support for other storage back-ends (such as fscrypt), but I'd always suggest to use the LUKS back-end since it's the only one providing the comprehensive confidentiality guarantees one wants for a UNIX-style home directory.

Note that there's one special caveat here: if the user's home directory (e.g. /home/lennart/) is encrypted and authenticated, what about the file system this data is stored on, i.e. /home/ itself? If that dir is part of the the root file system this would result in double encryption: first the data is encrypted with the TPM root file system key, and then again with the per-user key. Such double encryption is a waste of resources, and unnecessary. I'd thus suggest to make /home/ its own dm-integrity volume with a HMAC, keyed by the TPM. This means the data stored directly in /home/ will be authenticated but not encrypted. That's good not only for performance, but also has practical benefits: it allows extracting the encrypted volume of the various users in case the TPM key is lost, as a way to recover from dead laptops or similar.

Why authenticate /home/, if it only contains per-user home directories that are authenticated on their own anyway? That's a valid question: it's because the kernel file system maintainers made clear that Linux file system code is not considered safe against rogue disk images, and is not tested for that; this means before you mount anything you need to establish trust in some way because otherwise there's a risk that the act of mounting might exploit your kernel.

Summary of Resources and their Protections

So, let's now put this all together. Here's a table showing the various resources we deal with, and how I think they should be protected (in my idealized world).

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources yes no dm-verity root hash linked into kernel image, or firmware/initrd certificate database top-level partition
OS configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This should provide all the desired guarantees: everything is authenticated, and the individualized per-host or per-user data is also encrypted. No double encryption takes place. The encryption keys/verification certificates are stored/bound to the most appropriate infrastructure.

Does this address the three attack scenarios mentioned earlier? I think so, yes. The basic attack scenario I described is addressed by the fact that /var/, /etc/ and /home/*/ are encrypted. Brute forcing the former two is harder than in the status quo ante model, since a high entropy key is used instead of one derived from a user provided password. Moreover, the "anti-hammering" logic of the TPM will make brute forcing prohibitively slow. The home directories are protected by the user's password or ideally a personal FIDO2/PKCS#11 security token in this model. Of course, a password isn't better security-wise then the status quo ante. But given the FIDO2/PKCS#11 support built into systemd-homed it should be easier to lock down the home directories securely.

Binding encryption of /var/ and /etc/ to the TPM also addresses the first of the two more advanced attack scenarios: a copy of the harddisk is useless without the physical TPM chip, since the seed key is sealed into that. (And even if the attacker had the chance to watch you type in your password, it won't help unless they possess access to to the TPM chip.) For the home directory this attack is not addressed as long as a plain password is used. However, since binding home directories to FIDO2/PKCS#11 tokens is built into systemd-homed things should be safe here too — provided the user actually possesses and uses such a device.

The backdoor attack scenario is addressed by the fact that every resource in play now is authenticated: it's hard to backdoor the OS if there's no component that isn't verified by signature keys or TPM secrets the attacker hopefully doesn't know.

For general purpose distributions that focus on updating the OS per RPM/dpkg the idealized model above won't work out, since (as mentioned) this implies an immutable /usr/, and thus requires updating /usr/ via an atomic update operation. For such distros a setup like the following is probably more realistic, but see above.

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources, configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This means there's only one root file system that contains all of /etc/, /var/ and /usr/.

Recovery Keys

When binding encryption to TPMs one problem that arises is what strategy to adopt if the TPM is lost, due to hardware failure: if I need the TPM to unlock my encrypted volume, what do I do if I need the data but lost the TPM?

The answer here is supporting recovery keys (this is similar to how other OSes approach this). Recovery keys are pretty much the same concept as passwords. The main difference being that they are computer generated rather than user-chosen. Because of that they typically have much higher entropy (which makes them more annoying to type in, i.e you want to use them only when you must, not day-to-day). By having higher entropy they are useful in combination with TPM, FIDO2 or PKCS#11 based unlocking: unlike a combination with passwords they do not compromise the higher strength of protection that TPM/FIDO2/PKCS#11 based unlocking is supposed to provide.

Current versions of systemd-cryptenroll(1) implement a recovery key concept in an attempt to address this problem. You may enroll any combination of TPM chips, PKCS#11 tokens, FIDO2 tokens, recovery keys and passwords on the same LUKS volume. When enrolling a recovery key it is generated and shown on screen both in text form and as QR code you can scan off screen if you like. The idea is write down/store this recovery key at a safe place so that you can use it when you need it. Note that such recovery keys can be entered wherever a LUKS password is requested, i.e. after generation they behave pretty much the same as a regular password.

TPM PCR Brittleness

Locking devices to TPMs and enforcing a PCR policy with this (i.e. configuring the TPM key to be unlockable only if certain PCRs match certain values, and thus requiring the OS to be in a certain state) brings a problem with it: TPM PCR brittleness. If the key you want to unlock with the TPM requires the OS to be in a specific state (i.e. that all OS components' hashes match certain expectations or similar) then doing OS updates might have the affect of making your key inaccessible: the OS updates will cause the code to change, and thus the hashes of the code, and thus certain PCRs. (Thankfully, you unrolled a recovery key, as described above, so this doesn't mean you lost your data, right?).

To address this I'd suggest three strategies:

  1. Most importantly: don't actually use the TPM PCRs that contain code hashes. There are actually multiple PCRs defined, each containing measurements of different aspects of the boot process. My recommendation is to bind keys to PCR 7 only, a PCR that contains measurements of the UEFI SecureBoot certificate databases. Thus, the keys will remain accessible as long as these databases remain the same, and updates to code will not affect it (updates to the certificate databases will, and they do happen too, though hopefully much less frequent then code updates). Does this reduce security? Not much, no, because the code that's run is after all not just measured but also validated via code signatures, and those signatures are validated with the aforementioned certificate databases. Thus binding an encrypted TPM key to PCR 7 should enforce a similar level of trust in the boot/OS code as binding it to a PCR with hashes of specific versions of that code. i.e. using PCR 7 means you say "every code signed by these vendors is allowed to unlock my key" while using a PCR that contains code hashes means "only this exact version of my code may access my key".

  2. Use LUKS key management to enroll multiple versions of the TPM keys in relevant volumes, to support multiple versions of the OS code (or multiple versions of the certificate database, as discussed above). Specifically: whenever an update is done that might result changing the relevant PCRs, pre-calculate the new PCRs, and enroll them in an additional LUKS slot on the relevant volumes. This means that the unlocking keys tied to the TPM remain accessible in both states of the system. Eventually, once rebooted after the update, remove the old slots.

  3. If these two strategies didn't work out (maybe because the OS/firmware was updated outside of OS control, or the update mechanism was aborted at the wrong time) and the TPM PCRs changed unexpectedly, and the user now needs to use their recovery key to get access to the OS back, let's handle this gracefully and automatically reenroll the current TPM PCRs at boot, after the recovery key checked out, so that for future boots everything is in order again.

Other approaches can work too: for example, some OSes simply remove TPM PCR policy protection of disk encryption keys altogether immediately before OS or firmware updates, and then reenable it right after. Of course, this opens a time window where the key bound to the TPM is much less protected than people might assume. I'd try to avoid such a scheme if possible.

Anything Else?

So, given that we are talking about idealized systems: I personally actually think the ideal OS would be much simpler, and thus more secure than this:

I'd try to ditch the Shim, and instead focus on enrolling the distribution vendor keys directly in the UEFI firmware certificate list. This is actually supported by all firmwares too. This has various benefits: it's no longer necessary to bind everything to Microsoft's root key, you can just enroll your own stuff and thus make sure only what you want to trust is trusted and nothing else. To make an approach like this easier, we have been working on doing automatic enrollment of these keys from the systemd-boot boot loader, see this work in progress for details. This way the Firmware will authenticate the boot loader/kernel/initrd without any further component for this in place.

I'd also not bother with a separate boot partition, and just use the ESP for everything. The ESP is required anyway by the firmware, and is good enough for storing the few files we need.

FAQ

Can I implement all of this in my distribution today?

Probably not. While the big issues have mostly been addressed there's a lot of integration work still missing. As you might have seen I linked some PRs that haven't even been merged into our tree yet, and definitely not been released yet or even entered the distributions.

Will this show up in Fedora/Debian/Ubuntu soon?

I don't know. I am making a proposal how these things might work, and am working on getting various building blocks for this into shape. What the distributions do is up to them. But even if they don't follow the recommendations I make 100%, or don't want to use the building blocks I propose I think it's important they start thinking about this, and yes, I think they should be thinking about defaulting to setups like this.

Work for measuring/signing initrds on Fedora has been started, here's a slide deck with some information about it.

But isn't a TPM evil?

Some corners of the community tried (unfortunately successfully to some degree) to paint TPMs/Trusted Computing/SecureBoot as generally evil technologies that stop us from using our systems the way we want. That idea is rubbish though, I think. We should focus on what it can deliver for us (and that's a lot I think, see above), and appreciate the fact we can actually use it to kick out perceived evil empires from our devices instead of being subjected to them. Yes, the way SecureBoot/TPMs are defined puts you in the driver seat if you want — and you may enroll your own certificates to keep out everything you don't like.

What if my system doesn't have a TPM?

TPMs are becoming quite ubiquitous, in particular as the upcoming Windows versions will require them. In general I think we should focus on modern, fully equipped systems when designing all this, and then find fall-backs for more limited systems. Frankly it feels as if so far the design approach for all this was the other way round: try to make the new stuff work like the old rather than the old like the new (I mean, to me it appears this thinking is the main raison d'être for the Grub boot loader).

More specifically, on the systems where we have no TPM we ultimately cannot provide the same security guarantees as for those which have. So depending on the resource to protect we should fall back to different TPM-less mechanisms. For example, if we have no TPM then the root file system should probably be encrypted with a user provided password, typed in at boot as before. And for the encrypted boot credentials we probably should simply not encrypt them, and place them in the ESP unencrypted.

Effectively this means: without TPM you'll still get protection regarding the basic attack scenario, as before, but not the other two.

What if my system doesn't have UEFI?

Many of the mechanisms explained above taken individually do not require UEFI. But of course the chain of trust suggested above requires something like UEFI SecureBoot. If your system lacks UEFI it's probably best to find work-alikes to the technologies suggested above, but I doubt I'll be able to help you there.

rpm/dpkg already cryptographically validates all packages at installation time (gpg), why would I need more than that?

This type of package validation happens once: at the moment of installation (or update) of the package, but not anymore when the data installed is actually used. Thus when an attacker manages to modify the package data after installation and before use they can make any change they like without this ever being noticed. Such package download validation does address certain attack scenarios (i.e. man-in-the-middle attacks on network downloads), but it doesn't protect you from attackers with physical access, as described in the attack scenarios above.

Systems such as ostree aren't better than rpm/dpkg regarding this BTW, their data is not validated on use either, but only during download or when processing tree checkouts.

Key really here is that the scheme explained here provides offline protection for the data "at rest" — even someone with physical access to your device cannot easily make changes that aren't noticed on next use. rpm/dpkg/ostree provide online protection only: as long as the system remains up, and all OS changes are done through the intended program code-paths, and no one has physical access everything should be good. In today's world I am sure this is not good enough though. As mentioned most modern OSes provide offline protection for the data at rest in one way or another. Generic Linux distributions are terribly behind on this.

This is all so desktop/laptop focused, what about servers?

I am pretty sure servers should provide similar security guarantees as outlined above. In a way servers are a much simpler case: there are no users and no interactivity. Thus the discussion of /home/ and what it contains and of user passwords doesn't matter. However, the authenticated initrd and the unattended TPM-based encryption I think are very important for servers too, in a trusted data center environment. It provides security guarantees so far not given by Linux server OSes.

I'd like to help with this, or discuss/comment on this

Submit patches or reviews through GitHub. General discussion about this is best done on the systemd mailing list.

September 20, 2021

ES: But Why

I got a request recently to fix up the WebGL Aquarium demo. I’ve had this bookmarked for a while since it’s one of the only test cases for GL_EXT_multisampled_render_to_texture I’m aware of, at least when running Chrome in EGL mode.

Naturally, I decided to do both at once since this would be yet another extension that no native desktop driver in Mesa currently supports.

Transient Multisampling

The idea behind this extension is that on tilers, a single-sample image can be temporarily treated as multisampled without needing the extra memory bandwidth that multisampled images require. The multisampled rendering image is transient, meaning it isn’t loaded or written to, so it can be lazily allocated and then discarded.

Vulkan has similar mechanisms: loadOp and storeOp set to VK_ATTACHMENT_LOAD_OP_DONT_CARE with an image allocated using ZINK_HEAP_DEVICE_LOCAL_LAZY and VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT, then adding a resolve attachment for writeout. Simple enough.

Unfortunately, this doesn’t quite work.

As Danylo Piliaiev, RenderDoc master and Finder Of Bad Pixels, was quick to point out, this approach will discard any previous data the texture had, resulting in bad pixels. Lots of bad pixels, in fact.

The solution, which I hate, is that the “transient” multisampled image now gets a full-texture draw before its “transient” renderpass to initialize it, then is loaded with VK_ATTACHMENT_LOAD_OP_LOAD, effectively making it not transient at all.

Sorry tilers.

No frames for you.

aquarium.gif

But it does work and give solid performance (-20% or so while recording smh screen recorders git gud), so I can’t be too mad.

September 16, 2021

I am back with another status update on raytracing in RADV. And the good news is that things are finally starting to come together. After ~9 months of on and off work we’re now having games working with raytracing. Working on first try after getting all the required functionality was Control:

Control with raytracing on RADV

After poking for a long time at CTS and demos it is really nice to see the fruits of ones work.

The piece that I added recently was copy/compaction and serialization of acceleration structures which was a bunch of shader writing, handling another type of queries and dealing with indirect dispatches. (Since of course the API doesn’t give the input size on the CPU. No idea how this API should be secured …)

What games?

I did try 5 games:

  1. Quake 2 RTX (Vulkan): works. This was working already on my previous update.
  2. Control (D3D): works. Pretty much just works. Runs at maybe 30-50% of RT performance on Windows.
  3. Metro Exodus (Vulkan): works. Needs one workaround and is very finicky in WSI but otherwise works fine. Runs at 20-25% of RT performance on Windows.
  4. Ghostrunner (D3D): Does not work. This really needs per shadergroup compilation instead of just mashing all the shaders together as I get shaders now with 1 million NIR instructions, which is a pain to debug.
  5. Doom Eternal (Vulkan): Does not work. The raytracing option in the menu stays grayed out and at this point I’m at a loss what is required to make the game allow enabling RT.

If anybody could tell me how to get Doom Eternal to allow RT I’d appreciate it.

What is next?

Of course the support is far from done. Some things to still make progress on:

  1. Upstreaming what I have. Samuel has been busy reviewing my MRs and I think there is a good chance that what I have now will make it into 21.3.
  2. Improve the pipeline compilation model to hopefully make ghostrunner work.
  3. Improved BVH building. The current BVH is really naive, which is likely one of the big performance factors.
  4. Improve traversal.
  5. Move on to stuff needed for DXR 1.1 like VK_KHR_ray_query.

P.S. If you haven’t seen it yet, Jason Ekstrand from Intel recently gave a talk about how Intel implements raytracing. Nice showcase of how you can provide some more involved hardware implementation than RDNA2 does.

Been some time since my last update, so I felt it was time to flex my blog writing muscles again and provide some updates of some of the things we are working on in Fedora in preparation for Fedora Workstation 35. This is not meant to be a comprehensive whats new article about Fedora Workstation 35, more of a listing of some of the things we are doing as part of the Red Hat desktop team.

NVidia support for Wayland
One thing we spent a lot of effort on for a long time now is getting full support for the NVidia binary driver under Wayland. It has been a recurring topic in our bi-weekly calls with the NVidia engineering team ever since we started looking at moving to Wayland. There has been basic binary driver support for some time, meaning you could run a native Wayland session on top of the binary driver, but the critical missing piece was that you could not get support for accelerated graphics when running applications through XWayland, our X.org compatibility layer. Which basically meant that any application requiring 3D support and which wasn’t a native Wayland application yet wouldn’t work. So over the last Months we been having a great collaboration with NVidia around closing this gap, with them working closely with us in fixing issues in their driver while we have been fixing bugs and missing pieces in the rest of the stack. We been reporting and discussing issues back and forth allowing us a very quickly turnaround on issues as we find them which of course all resulted in the NVidia 470.42.01 driver with XWayland support. I am sure we will find new corner cases that needs to be resolved in the coming Months, but I am equally sure we will be able to quickly resolve them due to the close collaboration we have now established with NVidia. And I know some people will wonder why we spent so much time working with NVidia around their binary driver, but the reality is that NVidia is the market leader, especially in the professional Linux workstation space, and there are lot of people who either would end up not using Linux or using Linux with X without it, including a lot of Red Hat customers and Fedora users. And that is what I and my team are here for at the end of the day, to make sure Red Hat customers are able to get their job done using their Linux systems.

Lightweight kiosk mode
One of the wonderful things about open source is the constant flow of code and innovation between all the different parts of the ecosystem. For instance one thing we on the RHEL side have often been asked about over the last few years is a lightweight and simple to use solution for people wanting to run single application setups, like information boards, ATM machines, cash registers, information kiosks and so on. For many use cases people felt that running a full GNOME 3 desktop underneath their application was either to resource hungry and or created a risk that people accidentally end up in the desktop session. At the same time from our viewpoint as a development team we didn’t want a completely separate stack for this use case as that would just increase our maintenance burden as we would end up having to do a lot of things twice. So to solve this problem Ray Strode spent some time writing what we call GNOME Kiosk mode which makes setting up a simple session running single application easy and without running things like the GNOME shell, tracker, evolution etc. This gives you a window manager with full support for the latest technologies such as compositing, libinput and Wayland, but coming in at about 18MB, which is about 71MB less than a minimal GNOME 3 desktop session. You can read more about the new Kiosk mode and how to use it in this great blog post from our savvy Edge Computing Product Manager Ben Breard. The kiosk mode session described in Ben’s article about RHEL will be available with Fedora Workstation 35.

high-definition mouse wheel support
A major part of what we do is making sure that Red Hat Enterprise Linux customers and Fedora users get hardware support on par with what you find on other operating systems. We try our best to work with our hardware partners, like Lenovo, to ensure that such hardware support comes day and date with when those features are enabled on other systems, but some things ends up taking longer time for various reasons. Support for high-definition mouse wheels was one of those. Peter Hutterer, our resident input expert, put together a great blog post explaining the history and status of high-definition mouse wheel support. As Peter points out in his blog post the feature is not yet fully supported under Wayland, but we hope to close that gap in time for Fedora Workstation 35.

Mouse with hires mouse

Mouse with HiRes scroll wheel

PipeWire
I feel I can’t do one of these posts without talking about latest developments in PipeWire, our unified audio and video server. Wim Taymans keeps working with rapidly growing PipeWire community to fix issues as they are reported and add new features to PipeWire. Most recently Wims focus has been on implementing support for S/PDIF passthrough support over both S/PDIF and HDMI connections. This will allow us to send undecoded data over such connections which is critical for working well with surround sound systems and soundbars. Also the PipeWire community has been working hard on further improving the Bluetooth support with bluetooth battery status support for head-set profile and using Apple extensions. aptX-LL and FastStream codec support was also added. And of course a huge amount of bug fixes, it turns out that when you replace two different sound servers that has been around for close to two decades there are a lot of corner cases to cover :). Make sure to check out two latest release notes for 0.3.35 and for 0.3.36 for details.

Screenshot of Easyeffects

EasyEffects is a great example of a cool new application built with PipeWire

Privacy screen
Another feature that we have been working on as a result of our Lenovo partnership is Privacy screen support. For those not familiar with this technology it is basically to allow you to reduce the readability of your screen when viewed from the side, so that if you are using your laptop at a coffee shop for instance then a person sitting close by will have a lot harder time trying to read what is on your screen. Hans de Goede has been shepherding the kernel side of this forward working with Marco Trevisan from Canonical on the userspace part of it (which also makes this a nice example of cross-company collaboration), allowing you to turn this feature on or off. This feature though is not likely to fully land in time for Fedora Workstation 35 so we are looking at if we will bring this in as an update to Fedora Workstation 35 or if it will be a Fedora Workstation 36 feature.

Penny

zink inside

Zink inside the penny


As most of you know the future of 3D graphics on Linux is the Vulkan API from the Khronos Group. This doesn’t mean that OpenGL is going away anytime soon though, as there is a large host of applications out there using this API and for certain types of 3D graphics development developers might still choose to use OpenGL over Vulkan. Of course for us that creates a little bit of a challenge because maintaining two 3D graphics interfaces is a lot of work, even with the great help and contributions from the hardware makers themselves. So we been eyeing the Zink project for a while, which aims at re-implementing OpenGL on top of Vulkan, as a potential candidate for solving our long term needs to support the OpenGL API, but without drowning us in work while doing so. The big advantage to Zink is that it allows us to support one shared OpenGL implementation across all hardware and then focus our HW support efforts on the Vulkan drivers. As part of this effort Adam Jackson has been working on a project called Penny.

Zink implements OpenGL in terms of Vulkan, as far as the drawing itself is concerned, but presenting that drawing to the rest of the system is currently system-specific (GLX). For hardware that already has a Mesa driver, we use GBM. On NVIDIA’s Vulkan (and probably any other binary stacks on Linux, and probably also like WSL or macOS + MoltenVK) we download the image from the GPU back to the CPU and then use the same software upload/display path as llvmpipe, which as you can imagine is Not Fast.

Penny aims to extend Zink by replacing both of those paths, and instead using the various Vulkan WSI extensions to manage presentation. Even for the GBM case this should enable higher performance since zink will have more information about the rendering pipeline (multisampling in particular is poorly handled atm). Future window system integration work can focus on Vulkan, with EGL and GLX getting features “for free” once they’re enabled in Vulkan.

3rd party software cleanup
Over time we have been working on adding more and more 3rd party software for easy consumption in Fedora Workstation. The problem we discovered though was that due to this being done over time, with changing requirements and expectations, the functionality was not behaving in a very intuitive way and there was also new questions that needed to be answered. So Allan Day and Owen Taylor spent some time this cycle to review all the bits and pieces of this functionality and worked to clean it up. So the goal is that when you enable third-party repositories in Fedora Workstation 35 it behaves in a much more predictable and understandable way and also includes a lot of applications from Flathub. Yes, that is correct you should be able to install a lot of applications from Flathub in Fedora Workstation 35 without having to first visit the Flathub website to enable it, instead they will show up once you turned the knob for general 3rd party application support.

Power profiles
Another item we spent quite a bit of time for Fedora Workstation 35 is making sure we integrate the Power Profiles work that Bastien Nocera has been working on as part of our collaboration with Lenovo. Power Profiles is basically a feature that allows your system to behave in a smarter way when it comes to power consumption and thus prolongs your battery life. So for instance when we notice you are getting low on battery we can offer you to go into a strong power saving mode to prolong how long you can use the system until you can recharge. More in-depth explanation of Power profiles in the official README.

Wayland
I usually also have ended up talking about Wayland in my posts, but I expect to be doing less going forward as we have now covered all the major gaps we saw between Wayland and X.org. Jonas Ådahl got the headless support merged which was one of our big missing pieces and as mentioned above Olivier Fourdan and Jonas and others worked with NVidia on getting the binary driver with XWayland support working with GNOME Shell. Of course this being software we are never truly done, there will of course be new issues discovered, random bugs that needs to be fixed, and of course also new features that needs to be implemented. We already have our next big team focus in place, HDR support, which will need work from the graphics drivers, up through Mesa, into the window manager and the GUI toolkits and in the applications themselves. We been investigating and trying out some things for a while already, but we are now ready to make this a main focus for the team. In fact we will soon be posting a new job listing for a fulltime engineer to work on HDR vertically through the stack so keep an eye out for that if you are interested in working on this. The job will be open to candidates who which to work remotely, so as long as Red Hat has a business presence in the country you live we should be able to offer you the job if you are the right candidate for us. Update:Job listing is now online for our HDR engineer.

BTW, if you want to see future updates and keep on top of other happenings from Fedora and Red Hat in the desktop space, make sure to follow me on twitter.

September 13, 2021

Zink Is Over: This Time I’m Serious.

Look.

I know what you’re gonna say, and maybe I did just say zink was done a week or two ago.

I’m not saying I didn’t.

But that was practically last year at the speed with which zink’s codebase moves and its developer community sits in my office eating cookies between Mesa builds, and it was also before I set off on my journey to make the rest of those zany Phoronix benchmark games run instead of crashing or whatever.

What do we got left on that list anyway?

Metro: Last Light Redux

Oh you want some Metro? We got Metro at home.

metro.png

HITMAN

hitman.png

Agent 47, I’m gonna pretend I didn’t see that. Pull yourself together.

Basemark: High Settings

basemark.png

It’s uh… Mangohud’s slowing me down.

Bioshock Infinite

I bet you’re wondering where this one is, huh.

Warhammer 40,000: Dawn of War

dow3-fail.png

Easy as that, ju—Wait, what?

This game requires ARB_bindless_texture just to run? Is this a joke? Even fucking DOOM 2016, the final boss of OpenGL, doesn’t require bindless textures.

Fine.

Totally fine.

Not at all a problem, and I’m sure it’ll be easy to do.

Definitely no reason why only two Mesa drivers total implement it other than it being some trivial switch that everyone forgot to flip, right?

Probably just a config value here, or maybe a couple lines of code there…

Ignore all the validation errors because descriptor indexing isn’t accurately supported…

Add some null checks…

Fire up ASAN to fix a random stack explosion

File a piglit ticket because two of the eighty unit tests for the extension are bugged and these are quite literally the only unit tests available…

dow3.png

Kapow, first try, expect it in zink-wip later today-ish.

It’s just that easy.

If you disagree, you are nitpicking and biased.

September 01, 2021

I've been working on portals recently and one of the issues for me was that the documentation just didn't quite hit the sweet spot. At least the bits I found were either too high-level or too implementation-specific. So here's a set of notes on how a portal works, in the hope that this is actually correct.

First, Portals are supposed to be a way for sandboxed applications (flatpaks) to trigger functionality they don't have direct access too. The prime example: opening a file without the application having access to $HOME. This is done by the applications talking to portals instead of doing the functionality themselves.

There is really only one portal process: /usr/libexec/xdg-desktop-portal, started as a systemd user service. That process owns a DBus bus name (org.freedesktop.portal.Desktop) and an object on that name (/org/freedesktop/portal/desktop). You can see that bus name and object with D-Feet, from DBus' POV there's nothing special about it. What makes it the portal is simply that the application running inside the sandbox can talk to that DBus name and thus call the various methods. Obviously the xdg-desktop-portal needs to run outside the sandbox to do its things.

There are multiple portal interfaces, all available on that one object. Those interfaces have names like org.freedesktop.portal.FileChooser (to open/save files). The xdg-desktop-portal implements those interfaces and thus handles any method calls on those interfaces. So where an application is sandboxed, it doesn't implement the functionality itself, it instead calls e.g. the OpenFile() method on the org.freedesktop.portal.FileChooser interface. Then it gets an fd back and can read the content of that file without needing full access to the file system.

Some interfaces are fully handled within xdg-desktop-portal. For example, the Camera portal checks a few things internally, pops up a dialog for the user to confirm access to if needed [1] but otherwise there's nothing else involved with this specific method call.

Other interfaces have a backend "implementation" DBus interface. For example, the org.freedesktop.portal.FileChooser interface has a org.freedesktop.impl.portal.FileChooser (notice the "impl") counterpart. xdg-desktop-portal does not implement those impl.portals. xdg-desktop-portal instead routes the DBus calls to the respective "impl.portal". Your sandboxed application calls OpenFile(), xdg-desktop-portal now calls OpenFile() on org.freedesktop.impl.portal.FileChooser. That interface returns a value, xdg-desktop-portal extracts it and returns it back to the application in respones to the original OpenFile() call.

What provides those impl.portals doesn't matter to xdg-desktop-portal, and this is where things are hot-swappable. GTK and Qt both provide (some of) those impl portals, There are GTK and Qt-specific portals with xdg-desktop-portal-gtk and xdg-desktop-portal-kde but another one is provided by GNOME Shell directly. You can check the files in /usr/share/xdg-desktop-portal/portals/ and see which impl portal is provided on which bus name. The reason those impl.portals exist is so they can be native to the desktop environment - regardless what application you're running and with a generic xdg-desktop-portal, you see the native file chooser dialog for your desktop environment.

So the full call sequence is:

  • At startup, xdg-desktop-portal parses the /usr/libexec/xdg-desktop-portal/*.portal files to know which impl.portal interface is provided on which bus name
  • The application calls OpenFile() on the org.freedesktop.portal.FileChooser interface on the object path /org/freedesktop/portal/desktop. It can do so because the bus name this object sits on is not restricted by the sandbox
  • xdg-desktop-portal receives that call. This is portal with an impl.portal so xdg-desktop-portal calls OpenFile() on the bus name that provides the org.freedesktop.impl.portal.FileChooser interface (as previously established by reading the *.portal files)
  • Assuming xdg-desktop-portal-gtk provides that portal at the moment, that process now pops up a GTK FileChooser dialog that runs outside the sandbox. User selects a file
  • xdg-desktop-portal-gtk sends back the fd for the file to the xdg-desktop-portal, and the impl.portal parts are done
  • xdg-desktop-portal receives that fd and sends it back as reply to the OpenFile() method in the normal portal
  • The application receives the fd and can read the file now
A few details here aren't fully correct, but it's correct enough to understand the sequence - the exact details depend on the method call anyway.

Finally: because of DBus restrictions, the various methods in the portal interfaces don't just reply with values. Instead, the xdg-desktop-portal creates a new org.freedesktop.portal.Request object and returns the object path for that. Once that's done the method is complete from DBus' POV. When the actual return value arrives (e.g. the fd), that value is passed via a signal on that Request object, which is then destroyed. This roundabout way is done for purely technical reasons, regular DBus methods would time out while the user picks a file path.

Anyway. Maybe this helps someone understanding how the portal bits fit together.

[1] it does so using another portal but let's ignore that
[2] not really hot-swappable though. You need to restart xdg-desktop-portal but not your host. So luke-warm-swappable only

Edit Sep 01: clarify that it's not GTK/Qt providing the portals, but xdg-desktop-portal-gtk and -kde

August 31, 2021

Let me talk here about how we implemented the support for performance counters in the Mesa V3D driver, the OpenGL driver used by the Raspberry Pi 4. For reference, the implementation is very similar to the one already available (not done by me, by the way) for the VC4, OpenGL driver for the Raspberry Pi 3 and prior devices, also part of Mesa. If you are already familiar with how this is implemented in VC4, then this will mostly be a refresher.

First of all, what are these performance counters? Most of the processors nowadays contain some hardware facilities to get measurements about what is happening inside the processor. And of course graphics processors aren’t different. In this case, the graphics chips used by Raspberry Pi devices (manufactured by Broadcom) can record a bunch of different graphics-related parameters: how many quads are passing or failing depth/stencil tests, how many clock cycles are spent on doing vertex/fragment shading, hits/misses in the GPU cache, and many others values. In fact, with the V3D driver it is possible to measure around 87 different parameters, and up to 32 of them simultaneously. Quite a few less in VC4, though. But still a lot.

On a hardware level, using these counters is just a matter of writing and reading some GPU registers. First, write the registers to select what we want to measure, then a few more to start to measure, and finally read other registers containing the results. But of course, much like we don’t expect users to write GPU assembly code, we don’t expect users to write registers in the GPU directly. Moreover, even the Mesa drivers such as V3D can’t interact directly with the hardware; rather, this is done through the kernel, the one that can use the hardware directly, through the DRM subsystem in the kernel. For the case of V3D (and same applies to VC4, and in general to any other driver), we have a driver in user-space (whether the OpenGL driver, V3D, or the Vulkan driver, V3DV), and a kernel driver in the kernel-space, unsurprisingly also called V3D. The user-space driver is in charge of translating all the commands and options created with the OpenGL API or other API to batches of commands to be executed by the GPU, which are submitted to the kernel driver as DRM jobs. The kernel does the proper actions to send these to the GPU to execute them, including touching the proper registers. Thus, if we want to implement support for the performance counters, we need to modify the code in two places: the kernel and the (user-space) driver.

Implementation in the kernel

Here we need to think about how to deal with the GPU and the registers to make the performance counters work, as well as the API we provide to user-space to use them. As mentioned before, the approach we are following here is the same as the one used in the VC4 driver: performance counters monitors. That is, the user-space driver creates one or more monitors, specifying for each monitor what counters it is interested in (up to 32 simultaneously, the hardware limit). The kernel returns a unique identifier for each monitor, which can be used later to do the measurement, query the results, and finally destroy it when done.

In this case, there isn’t an explicit start/stop the measurement. Rather, every time the driver wants to measure a job, it includes the identifier of the monitor it wants to use for that job, if any. Before submitting a job to the GPU, the kernel checks if the job has a monitor identifier attached. If so, then it needs to check if the previous job executed by the GPU was also using the same monitor identifier, in which case it doesn’t need to do anything other than send the job to the GPU, as the performance counters required are already enabled. If the monitor is different, then it needs first to read the current counter values (through proper GPU registers), adding them to the current monitor, stop the measurement, configure the counters for the new monitor, start the measurement again, and finally submit the new job to the GPU. In this process, if it turns out there wasn’t a monitor under execution before, then it only needs to execute the last steps.

The reason to do all this is that multiple applications can be executing at the same time, some using (different) performance counters, and most of them probably not using performance counters at all. But the performance counter values of one application shouldn’t affect any other application so we need to make sure we don’t mix up the counters between applications. Keeping the values in their respective monitors helps to accomplish this. There is still a small requirement in the user-space driver to help with accomplishing this, but in general, this is how we avoid the mixing.

If you want to take a look at the full implementation, it is available in a single commit.

Implementation in the driver

Once we have a way to create and manage the monitors, using them in the driver is quite easy: as mentioned before, we only need to create a monitor with the counters we are interested in and attach it to the job to be submitted to the kernel. In order to make things easier, we keep a mirror-like version of the monitor inside the driver.

This approach is adequate when you are developing the driver, and you can add code directly on it to check performance. But what about the final user, who is writing an OpenGL application and wants to check how to improve its performance, or check any bottleneck on it? We want the user to have a way to use OpenGL for this.

Fortunately, there is in fact a way to do this through OpenGL: the GL_AMD_performance_monitor extension. This OpenGL extension provides an API to query what counters the hardware supports, to create monitors, to start and stop them, and to retrieve the values. It looks very similar to what we have described so far, except for an important difference: the user needs to start and stop the monitors explicitly. We will explain later why this is necessary. But the key point here is that when we start a monitor, this means that from that moment on, until stopping it, any job created and submitted to the kernel will have the identifier of that monitor attached. This implies that only one monitor can be enabled in the application at the same time. But this isn’t a problem, as this restriction is part of the extension.

Our driver does not implement this API directly, but through “queries”, which are used then by the Gallium subsystem in Mesa to implement the extension. For reference, the V3D driver (as well as the VC4) is implemented as part of the Gallium subsystem. The Gallium part basically handles all the hardware-independent OpenGL functionality, and just requires the driver hook functions to be implemented by the driver. If the driver implements the proper functions, then Gallium exposes the right extension (in this case, the GL_AMD_performance_monitor extension).

For our case, it requires the driver to implement functions to return which counters are available, to create or destroy a query (in this case, the query is the same as the monitor), start and stop the query, and once it is finished, to get the results back.

At this point, I would like to explain a bit better what it implies to stop the monitor and get the results back. As explained earlier, stopping the monitor or query means that from that moment on, any new job submitted to the kernel (and thus to the GPU) won’t contain a performance monitor identifier attached, and hence won’t be measured. But it is important to know that the driver submits jobs to the kernel to be executed at its own pace, but these aren’t executed immediatly; the GPU needs time to execute the jobs, and so the kernel puts the arriving jobs in a queue, to be submitted to the GPU. This means when the user stops the monitor, there could be still jobs in the queue that haven’t been executed yet and are thus pending to be measured.

And how do we know that the jobs have been executed by the GPU? The hook function to implement getting the query results has a “wait” parameter, which tells if the function needs to wait for all the pending jobs to be measured to be executed or not. If it doesn’t but there are pending jobs, then it just returns telling the caller this fact. This allows to do other work meanwhile and query again later, instead of becoming blocked waiting for all the jobs to be executed. This is implemented through sync objects. Every time a job is sent to the kernel, there’s a sync object that is used to signal when the job has finished executing. This is mainly used to have a way to synchronize the jobs. In our case, when the user finalizes the query we save this fence for the last submitted job, and we use it to know when this last job has been executed.

There are quite a few details I’m not covering here. If you are interested though, you can take a look at the merge request.

Gallium HUD

So far we have seen how the performance counters are implemented, and how to use them. In all the cases it requires writing code to create the monitor/query, start/stop it, and querying back the results, either in the driver itself or in the application through the GL_AMD_performance_monitor extension1.

But what if we want to get some general measurements without adding code to the application or the driver? Fortunately, there is an environmental variable GALLIUM_HUD that, when correctly, will show on top of the application some graphs with the measured counters.

Using it is very easy; set it to help to know how to use it, as well as to get a list of the available counters for the current hardware.

As example:

$ env GALLIUM_HUD=L2T-CLE-reads,TLB-quads-passing-z-and-stencil-test,QPU-total-active-clk-cycles-vertex-coord-shading scorched3d

You will see:

Performance Counters in Scorched 3D

Bear in mind that to be able to use this you will need a kernel that supports performance counters for V3D. At the moment of writing this, no kernel has been released yet with this support. If you don’t want to wait for it, you can download the patch, apply it to your raspberry pi kernel (which has been tested in the 5.12 branch), build and install it.

  1. All this is for the case of using OpenGL; if your application uses Vulkan, there are other similar extensions, which are not yet implemented in our V3DV driver at the moment of writing this post. 

Gut Ding braucht Weile. Almost three years ago, we added high-resolution wheel scrolling to the kernel (v5.0). The desktop stack however was first lagging and eventually left behind (except for an update a year ago or so, see here). However, I'm happy to announce that thanks to José Expósito's efforts, we now pushed it across the line. So - in a socially distanced manner and masked up to your eyebrows - gather round children, for it is storytime.

Historical History

In the beginning, there was the wheel detent. Or rather there were 24 of them, dividing a 360 degree [1] movement of a wheel into a neat set of 15 clicks. libinput exposed those wheel clicks as part of the "pointer axis" namespace and you could get the click count with libinput_event_pointer_get_axis_discrete() (announced here). The degree value is exposed as libinput_event_pointer_get_axis_value(). Other scroll backends (finger-scrolling or button-based scrolling) expose the pixel-precise value via that same function.

In a "recent" Microsoft Windows version (Vista!), MS added the ability for wheels to trigger more than 24 clicks per rotation. The MS Windows API now treats one "traditional" wheel click as a value of 120, anything finer-grained will be a fraction thereof. You may have a mouse that triggers quarter-wheel clicks, each sending a value of 30. This makes for smoother scrolling and is supported(-ish) by a lot of mice introduced in the last 10 years [2]. Obviously, three small scrolls are nicer than one large scroll, so the UX is less bad than before.

Now it's time for libinput to catch up with Windows Vista! For $reasons, the existing pointer axis API could get changed to accommodate for the high-res values, so a new API was added for scroll events. Read on for the details, you will believe what happens next.

Out with the old, in with the new

As of libinput 1.19, libinput has three new events: LIBINPUT_EVENT_POINTER_SCROLL_WHEEL, LIBINPUT_EVENT_POINTER_SCROLL_FINGER, and LIBINPUT_EVENT_POINTER_SCROLL_CONTINUOUS. These events reflect, perhaps unsuprisingly, scroll movements of a wheel, a finger or along a continuous axis (e.g. button scrolling). And they replace the old event LIBINPUT_EVENT_POINTER_AXIS. Those familiar with libinput will notice that the new event names encode the scroll source in the event name now. This makes them slightly more flexible and saves callers an extra call.

In terms of actual API, the new events have two new functions: libinput_event_pointer_get_scroll_value(). For the FINGER and CONTINUOUS events, the value returned is in "pixels" [3]. For the new WHEEL events, the value is in degrees. IOW this is a drop-in replacement for the old libinput_event_pointer_get_axis_value() function. The second call is libinput_event_pointer_get_scroll_value_v120() which, for WHEEL events, returns the 120-based logical units the kernel uses as well. libinput_event_pointer_has_axis() returns true if the given axis has a value, just as before. With those three calls you now get the data for the new events.

Backwards compatibility

To ensure backwards compatibility, libinput generates both old and new events so the rule for callers is: if you want to support the new events, just ignore the old ones completely. libinput also guarantees new events even on pre-5.0 kernels. This makes the old and new code easy to ifdef out, and once you get past the immediate event handling the code paths are virtually identical.

When, oh when?

These changes have been merged into the libinput main branch and will be part of libinput 1.19. Which is due to be released over the next month or so, so feel free to work backwards from that for your favourite distribution.

Having said that, libinput is merely the lowest block in the Jenga tower that is the desktop stack. José linked to the various MRs in the upstream libinput MR, so if you're on your seat's edge waiting for e.g. GTK to get this, well, there's an MR for that.

[1] That's degrees of an angle, not Fahrenheit
[2] As usual, on a significant number of those you'll need to know whatever proprietary protocol the vendor deemed to be important IP. Older MS mice stand out here because they use straight HID.
[3] libinput doesn't really have a concept of pixels, but it has a normalized pixel that movements are defined as. Most callers take that as real pixels except for the high-resolution displays where it's appropriately scaled.

Zink Is Over

A while ago I blogged about finishing up ES 3.2. Then I didn’t mention it again because…well, I suppose I’ve only blogged four times since then, but I’m going to pretend this was part of my master plan to make everyone forget so I could build hype again.

The hype is here: Zink can now* run ES 3.2 apps.

  • “now” is a variable unit of time subject to CI not trying to drown itself at the nearest pub, literally melt itself to slag, or hurl itself off a cliff the instant daniels takes his eyes off it

What Does This Mean For The Future?

I know I’ve said this a few times previously, and we all had a good laugh, but this time I mean it.

Zink is done.

The final boss has been beaten, there’s no more versions to support, no extensions left on my todo list, definitely no bugs remaining, and performance can’t possibly improve further.

If you think you’ve found a zink bug, report it to whoever wrote the test or app you’re running, because the only thing I plan on doing for the rest of 2021 is playing Cyberpunk 2077 on Lavapipe.

Right after it finishes loading.

August 30, 2021

The Struggle Continues

Everyone’s seen the Phoronix benchmark numbers by now, and though there’s a lot of confusion over how to calculate the percentage increase between “game does did not run a year ago” and “game runs”, it seems like a couple people out there at Big Triangle are starting to take us seriously.

With that said, even my parents are asking me what the deal is with this one result in particular:

ohno.png

Performance isn’t supposed to go down. Everyone knows this. The version numbers go up and so does the performance as long as it’s not Javascript-based.

Enraged, I sprinted to my computer and searched for tesseract game, which gave me the entirely wrong result, but I eventually did manage to find the right one. I fired up zink-wip, certain that this would end up being some bug I’d already fixed.

Unfortunately, this was not the case.

tesseract-bad.png

I vowed not to sleep, rebase, leave my office, or even run another application until this was resolved, so you can imagine how pleased I am to be writing this post after spending way too much time getting to the bottom of everything.

Speculation Interlude

Full disclosure: I didn’t actually check why performance went down. I’m pretty sure it’s just the result of having improved buffer mapping to be better in most cases, which ended up hurting this case.

But Why

…is the performance so bad?

A quick profiling revealed that this was down to a Gallium component called vbuf, used for translating vertex buffers and attributes from the ones specified by the application to ones that drivers can actually support. The component itself is fine, the problem was that, ideally, it’s not something you ever want to be hitting when you want performance.

Consider the usual sequence of drawing a frame:

  • generate and upload vertex data
  • bind some descriptors
  • maybe throw in a query or two if you need some spice
  • draw
  • repeat until frame is done

This is all great and normal, but what would happen—just hypothetically of course—if instead it looked like this:

  • generate and upload vertex data
  • stall and read vertex data
  • rewrite vertex data in another format and reupload
  • bind some descriptors
  • maybe throw in a query or two if you need some spice
  • draw
  • repeat until frame is done

Suddenly the driver is now stalling multiple times per frame on top of doing lots of CPU work!

Incidentally, this is (almost certainly) why performance appeared to have regressed: the vertex buffer is now device-local and can’t be mapped directly, so it has to be copied to a new buffer before it can be read, which is even slower.

Just AMD Problems

DISCLAIMER: We’re going deep into meme territory now, so let’s all dial down the seriousness about a thousand notches before posting about how much I hate AMD or whatever.

vertexattribmeme.png

Unlike cool hardware, AMD opts to not support features which might be less performant. I assume this is in the hopes that developers will Make The Right Choice and not use those features, but obviously developers are gonna develop, and so it is that Tesseract-The-Game-But-Not-The-One-On-Steam uses 3-component vertex attributes that aren’t supported by AMD hardware, necessitating the use of vbuf to translate them to 4-component attributes that can be safely used.

Decomposition

The vertex buffer format at work here was R8G8B8_SNORM, which is a perfectly cromulent format as long as you hate yourself. A shader would read this as a vec4, which, by the power of buffer robustness, gets translated to vec4(x, y, z, 1.0) because the w component is missing.

The approach I took to solving this was to decompose the vertex attribute into three separate R8_SNORM attributes, as this single-component format is wimpy enough for AMD to handle. Thus, a vertex input state containing three separate attributes including this one would now contain five, as the original R8G8B8_SNORM one is split into three, each reading a single component at an offset to simulate the original attribute.

The tricky part to this is that it requires a vertex shader prolog and variant in order to successfully split the shader’s input in such a way that the read value is the same. It also requires a NIR pass. Let’s check out the NIR pass since this blog has gone for way too long without seeing any real work:

struct decompose_state {
  nir_variable **split;
  bool needs_w;
};

static bool
decompose_attribs(nir_shader *nir, uint32_t decomposed_attrs, uint32_t decomposed_attrs_without_w)
{
   uint32_t bits = 0;
   nir_foreach_variable_with_modes(var, nir, nir_var_shader_in)
      bits |= BITFIELD_BIT(var->data.driver_location);
   bits = ~bits;
   u_foreach_bit(location, decomposed_attrs | decomposed_attrs_without_w) {
      nir_variable *split[5];
      struct decompose_state state;
      state.split = split;
      nir_variable *var = nir_find_variable_with_driver_location(nir, nir_var_shader_in, location);
      assert(var);
      split[0] = var;
      bits |= BITFIELD_BIT(var->data.driver_location);
      const struct glsl_type *new_type = glsl_type_is_scalar(var->type) ? var->type : glsl_get_array_element(var->type);
      unsigned num_components = glsl_get_vector_elements(var->type);
      state.needs_w = (decomposed_attrs_without_w & BITFIELD_BIT(location)) != 0 && num_components == 4;
      for (unsigned i = 0; i < (state.needs_w ? num_components - 1 : num_components); i++) {
         split[i+1] = nir_variable_clone(var, nir);
         split[i+1]->name = ralloc_asprintf(nir, "%s_split%u", var->name, i);
         if (decomposed_attrs_without_w & BITFIELD_BIT(location))
            split[i+1]->type = !i && num_components == 4 ? var->type : new_type;
         else
            split[i+1]->type = new_type;
         split[i+1]->data.driver_location = ffs(bits) - 1;
         bits &= ~BITFIELD_BIT(split[i+1]->data.driver_location);
         nir_shader_add_variable(nir, split[i+1]);
      }
      var->data.mode = nir_var_shader_temp;
      nir_shader_instructions_pass(nir, lower_attrib, nir_metadata_dominance, &state);
   }
   nir_fixup_deref_modes(nir);
   NIR_PASS_V(nir, nir_remove_dead_variables, nir_var_shader_temp, NULL);
   optimize_nir(nir);
   return true;
}

First, the base of the pass; two masks are provided, one for attributes that are being fully split (i.e., four components) and one for attributes that have fewer than four components and thus need to have a w component added, as in the Tesseract case. Each variable in the mask is split into four, with slightly different behavior for the ones needing a w and the ones that don’t.

The new variables are all given new driver locations matching the ones given to the split attributes for the vertex input pipeline state, and the decompose_state is passed along to the per-instruction part of the pass:

static bool
lower_attrib(nir_builder *b, nir_instr *instr, void *data)
{
   struct decompose_state *state = data;
   nir_variable **split = state->split;
   if (instr->type != nir_instr_type_intrinsic)
      return false;
   nir_intrinsic_instr *intr = nir_instr_as_intrinsic(instr);
   if (intr->intrinsic != nir_intrinsic_load_deref)
      return false;
   nir_deref_instr *deref = nir_src_as_deref(intr->src[0]);
   nir_variable *var = nir_deref_instr_get_variable(deref);
   if (var != split[0])
      return false;
   unsigned num_components = glsl_get_vector_elements(split[0]->type);
   b->cursor = nir_after_instr(instr);
   nir_ssa_def *loads[4];
   for (unsigned i = 0; i < (state->needs_w ? num_components - 1 : num_components); i++)
      loads[i] = nir_load_deref(b, nir_build_deref_var(b, split[i+1]));
   if (state->needs_w) {
      loads[3] = nir_channel(b, loads[0], 3);
      loads[0] = nir_channel(b, loads[0], 0);
   }
   nir_ssa_def *new_load = nir_vec(b, loads, num_components);
   nir_ssa_def_rewrite_uses(&intr->dest.ssa, new_load);
   nir_instr_remove_v(instr);
   return true;
}

The existing variable is passed along with the new variable array. Where the original is loaded, instead the new variables are all loaded in sequence and assembled into a vec matching the length of the original one. For attributes needing a w component, the first new variable is loaded as a vec4 so that the w component can be reused naturally. Then the original load instruction is removed, and with it, the original variable and its brokenness.

Immediate Results

Sort of.

tesseract-semifixed.png

The frames were definitely there, but the graphics…

Occlusion Queries

It turns out there’s almost zero coverage for occlusion queries in Vulkan’s CTS. There’s surprisingly little coverage for most query-related things, in fact, which means it wasn’t too surprising when it turned out that there were RADV query bugs at play. What was surprising was how they manifested, but that was about par for anything that reads garbage memory.

A simple one-liner later (just kidding, this fucken thing took like 4 days to find) and, magically, things were happening:

tesseract-fixed.png

We Did It.

A big thanks to Bas Nieuwenhuizen for consulting along the way even despite being so busy getting a RADV raytracing MR up and, as always, preparing his next blog post.

August 25, 2021

A year ago, I first announced libei - a library to support emulated input. After an initial spurt of development, it was left mostly untouched until a few weeks ago. Since then, another flurry of changes have been added, including some initial integration into GNOME's mutter. So, let's see what has changed.

A Recap

First, a short recap of what libei is: it's a transport layer for emulated input events to allow for any application to control the pointer, type, etc. But, unlike the XTEST extension in X, libei allows the compositor to be in control over clients, the devices they can emulate and the input events as well. So it's safer than XTEST but also a lot more flexible. libei already supports touch and smooth scrolling events, something XTest doesn't have or is struggling with.

Terminology refresher: libei is the client library (used by an application wanting to emulate input), EIS is the Emulated Input Server, i.e. the part that typically runs in the compositor.

Server-side Devices

So what has changed recently: first, the whole approach has flipped on its head - now a libei client connects to the EIS implementation and "binds" to the seats the EIS implementation provides. The EIS implementation then provides input devices to the client. In the simplest case, that's just a relative pointer but we have capabilities for absolute pointers, keyboards and touch as well. Plans for the future is to add gestures and tablet support too. Possibly joysticks, but I haven't really thought about that in detail yet.

So basically, the initial conversation with an EIS implementation goes like this:

  • Client: Hello, I am $NAME
  • Server: Hello, I have "seat0" and "seat1"
  • Client: Bind to "seat0" for pointer, keyboard and touch
  • Server: Here is a pointer device
  • Server: Here is a keyboard device
  • Client: Send relative motion event 10/2 through the pointer device
Notice how the touch device is missing? The capabilities the client binds to are just what the client wants, the server doesn't need to actually give the client a device for that capability.

One of the design choices for libei is that devices are effectively static. If something changes on the EIS side, the device is removed and a new device is created with the new data. This applies for example to regions and keymaps (see below), so libei clients need to be able to re-create their internal states whenever the screen or the keymap changes.

Device Regions

Devices can now have regions attached to them, also provided by the EIS implementation. These regions define areas reachable by the device and are required for clients such as Barrier. On a dual-monitor setup you may have one device with two regions or two devices with one region (representing one monitor), it depends on the EIS implementation. But either way, as libei client you will know that there is an area and you will know how to reach any given pixel on that area. Since the EIS implementation decides the regions, it's possible to have areas that are unreachable by emulated input (though I'm struggling a bit for a real-world use-case).

So basically, the conversation with an EIS implementation goes like this:

  • Client: Hello, I am $NAME
  • Server: Hello, I have "seat0" and "seat1"
  • Client: Bind to "seat0" for absolute pointer
  • Server: Here is an abs pointer device with regions 1920x1080@0,0, 1080x1920@1920,0
  • Server: Here is an abs pointer device with regions 1920x1080@0,0
  • Server: Here is an abs pointer device with regions 1080x1920@1920,0
  • Client: Send abs position 100/100 through the second device
Notice how we have three absolute devices? A client emulating a tablet that is mapped to a screen could just use the third device. As with everything, the server decides what devices are created and the clients have to figure out what they want to do and how to do it.

Perhaps unsurprisingly, the use of regions make libei clients windowing-system independent. The Barrier EI support WIP no longer has any Wayland-specific code in it. In theory, we could implement EIS in the X server and libei clients would work against that unmodified.

Keymap handling

The keymap handling has been changed so the keymap too is provided by the EIS implementation now, effectively in the same way as the Wayland compositor provides the keymap to Wayland clients. This means a client knows what keycodes to send, it can handle the state to keep track of things, etc. Using Barrier as an example again - if you want to generate an "a", you need to look up the keymap to figure out which keycode generates an A, then you can send that through libei to actually press the key.

Admittedly, this is quite messy. XKB (and specifically libxkbcommon) does not make it easy to go from a keysym to a key code. The existing Barrier X code is full of corner-cases with XKB already, I espect those to be necessary for the EI support as well.

Scrolling

Scroll events have four types: pixel-based scrolling, discrete scrolling, and scroll stop/cancel events. The first should be obvious, discrete scrolling is for mouse wheels. It uses the same 120-based API that Windows (and the kernel) use, so it's compatible with high-resolution wheel mice. The scroll stop event notifies an EIS implementation that the scroll interaction has stopped (e.g. lifting fingers off) which in turn may start kinetic scrolling - just like the libinput/Wayland scroll stop events. The scroll cancel event notifies the EIS implementation that scrolling really has stopped and no kinetic scrolling should be triggered. There's no equivalent in libinput/Wayland for this yet but it helps to get the hook in place.

Emulation "Transactions"

This has fairly little functional effect, but interactions with an EIS server are now sandwiched in a start/stop emulating pair. While this doesn't matter for one-shot tools like xdotool, it does matter for things like Barrier which can send the start emulating event when the pointer enters the local window. This again allows the EIS implementation to provide some visual feedback to the user. To correct the example from above, the sequence is actually:

  • ...
  • Server: Here is a pointer device
  • Client: Start emulating
  • Client: Send relative motion event 10/2 through the pointer device
  • Client: Send relative motion event 1/4 through the pointer device
  • Client: Stop emulating

Properties

Finally, there is now a generic property API, something copied from PipeWire. Properties are simple key/value string pairs and cover those things that aren't in the immediate API. One example here: the portal can set things like "ei.application.appid" to the Flatpak's appid. Properties can be locked down and only libei itself can set properties before the initial connection. This makes them reliable enough for the EIS implementation to make decisions based on their values. Just like with PipeWire, the list of useful properties will grow over time. it's too early to tell what is really needed.

Repositories

Now, for the actual demo bits: I've added enough support to Barrier, XWayland, Mutter and GNOME Shell that I can control a GNOME on Wayland session through Barrier (note: the controlling host still needs to run X since we don't have the ability to capture input events under Wayland yet). The keymap handling in Barrier is nasty but it's enough to show that it can work.

GNOME Shell has a rudimentary UI, again just to show what works:

The status icon shows ... if libei clients are connected, it changes to !!! while the clients are emulating events. Clients are listed by name and can be disconnected at will. I am not a designer, this is just a PoC to test the hooks.

Note how xdotool is listed in this screenshot: that tool is unmodified, it's the XWayland libei implementation that allows it to work and show up correctly

The various repositories are in the "wip/ei" branch of:

And of course libei itself.

Where to go from here? The last weeks were driven by rapid development, so there's plenty of test cases to be written to make sure the new code actually works as intended. That's easy enough. Looking at the Flatpak integration is another big ticket item, once the portal details are sorted all the pieces are (at least theoretically) in place. That aside, improving the integrations into the various systems above is obviously what's needed to get this working OOTB on the various distributions. Right now it's all very much in alpha stage and I could use help with all of those (unless you're happy to wait another year or so...). Do ping me if you're interested to work on any of this.

August 23, 2021

Hi all, hope you all are doing fine!

Finally today the part 5 of my Outreachy Saga came out, week 9 was on 7/19/21, and as you can see I'm a little late too( ;P )... This week had the theme: “Career opportunities/ Career Goals”

When I read the Outreachy Organizers email, I had an anxiety crisis, starting to think about what I want to do after the internship, what my career goals are, I panicked... The Imposter Syndrome hit hard, and it still haunts my thoughts and it has been very challenging (as my therapist says) to work with this feeling of not being good enough to apply for a job opening or thinking that my resume is worthless...

But week 11 arrived #SPOILERALERT with the theme: "Making connections" and talking to some people I could get to know their experiences in companies that work with free software and their contributions to it, I could feel that I am on the right path, that this is the area I want to work on. So let's get back to the topic of today's post!!

What am I looking for?

I'm looking for a job, preferably remote, which can be full or part-time!

But also some other opportunity where I can improve my CV and help me to continue working with the Linux Kernel, preferably. The end of my Outreachy internship is fast approaching (or rather, it's the day after tomorrow (o_o) ), so after August 24th, I will be available to work full or part-time. I currently live in Fundão, Portugal, and I am open to remote positions based anywhere in the world, along with the possibility of international relocation (my dream is to live in Canada!!! 😜).

What types of work would you like to contribute to?

I would like to continue working with the Linux Kernel, and I also really like Embedded Systems (I've played a little with Raspberry, Beagle Bone, and Arduino)

What tools or skills do you have that would help you with that work?

You know... I have a lot of difficulty answering this question because I always think I don't have many skills, but I'll do my best!!!

During my Outreachy internship, I Created vkms_config_show(), the function which aims to print the data in drm_debugfs_create_files(), I also started to learn how to debug the code. I already worked a little with the Coccinelle tool. And as I mentioned earlier, I've already used Raspberry, Beagle Bone, and Arduino.

What languages do you speak, and at what school grade level?

Portuguese (Native), English (intermediate)

And reviewing now my CV and Linkedin, I could see that I've already done a lot!

I have experience in remote work, with people from different parts of the world, I was the DAECOMP (Academic Directory of Computer engineering - that is how we call our students association) President, where I needed to interact with my classmates to find out which improvements they thought the course needed to improve, I also needed to interact with the campus management and our professors to get things to improve the course and the campus, I organized Free Software dissemination events. I was also a flute tutor (yes I study music!!!) and a robotics tutor, working with young people and children.

Well, I'll stop here... To see more about my experience, feel free to visit my Linkedin

Once again thank you for following me so far, my adventure with Outreachy is almost over, every day has been a lot of learning! Please feel free to comment! And stay tuned to the next chapters of this Saga!!!

Take care and have a great day!

August 18, 2021

We Back

Just a quick update today while I dip my toes back into the blogosphere to remind myself that it’s not so scary.

Remember when I blogged about how nice it would be to have a suballocator all those months ago?

Now it’s landed, and it’s nice indeed to have a suballocator.

Remember when everyone wanted GL 4.6 compatibility contexts so they could play Feral ports of their favorite games? zink-wip did that 6 months ago.

What this all means is that it’s more or less open testing season on zink. I’ve already got a sizable number of tickets open for various Steam games based on zink-wip testing, but this is hardly conclusive.

What games work for you?

What games don’t work?

I’m not gonna test them all myself, so get out there and start crashing.

August 11, 2021

Here I’m playing “Spelunky 2” on my laptop and simultaneously replaying the same Vulkan calls on an ARM board with Adreno GPU running the open source Turnip Vulkan driver. Hint: it’s an x64 Windows game that doesn’t run on ARM.

The bottom right is the game I’m playing on my laptop, the top left is GFXReconstruct immediately replaying Vulkan calls from the game on ARM board.

How is it done? And why would it be useful for debugging? Read below!


Debugging issues a driver faces with real-world applications requires the ability to capture and replay graphics API calls. However, for mobile GPUs it becomes even more challenging since for Vulkan driver the main “source” of real-world workload are x86-64 apps that run via Wine + DXVK, mainly games which were made for desktop x86-64 Windows and do not run on ARM. Efforts are being made to run these apps on ARM but it is still work-in-progress. And we want to test the drivers NOW.

The obvious solution would be to run those applications on an x86-64 machine capturing all Vulkan calls. Then replaying those calls on a second machine where we cannot run the app. This way it would be possible to test the driver even without running the application directly on it.

The main trouble is that Vulkan calls made on one GPU + Driver combo are not generally compatible with other GPU + Driver combo, sometimes even for one GPU vendor. There are different memory capabilities (VkPhysicalDeviceMemoryProperties), different memory requirements for buffer and images, different extensions available, and different optional features supported. It is easier with OpenGL but there are also some incompatibilities there.

There are two open-source vendor-agnostic tools for capturing Vulkan calls: RenderDoc (captures single frame) and GFXReconstruct (captures multiple frames). RenderDoc at the moment isn’t suitable for the task of capturing applications on desktop GPUs and replaying on mobile because it doesn’t translate memory type and requirements (see issue #814). GFXReconstruct on the other hand has the necessary features for this.

I’ll show a couple of tricks with GFXReconstruct I’m using to test things on Turnip.


Capturing with GFXReconstruct

At this point you either have the application itself or, if it doesn’t use Vulkan, a trace of its calls that could be translated to Vulkan. There is a detailed instruction on how to use GFXReconstruct to capture a trace on desktop OS. However there is no clear instruction of how to do this on Android (see issue #534), fortunately there is one in Android’s documentation:

Android how-to (click me)
For Android 9 you should copy layers to the application which will be traced
For Android 10+ it's easier to copy them to com.lunarg.gfxreconstruct.replay
You should have userdebug build of Android or probably rooted Android

# Push GFXReconstruct layer to the device
adb push libVkLayer_gfxreconstruct.so /sdcard/

# Since there is to APK for capture layer,
# copy the layer to e.g. folder of com.lunarg.gfxreconstruct.replay
adb shell run-as com.lunarg.gfxreconstruct.replay cp /sdcard/libVkLayer_gfxreconstruct.so .

# Enable layers
adb shell settings put global enable_gpu_debug_layers 1

# Specify target application
adb shell settings put global gpu_debug_app <package_name>

# Specify layer list (from top to bottom)
adb shell settings put global gpu_debug_layers VK_LAYER_LUNARG_gfxreconstruct

# Specify packages to search for layers
adb shell settings put global gpu_debug_layer_app com.lunarg.gfxreconstruct.replay

If the target application doesn’t have rights to write into external storage - you should change where the capture file is created:

adb shell "setprop debug.gfxrecon.capture_file '/data/data/<target_app_folder>/files/'"


However, when trying to replay the trace you captured on another GPU - most likely it will result in an error:

[gfxrecon] FATAL - API call vkCreateDevice returned error value VK_ERROR_EXTENSION_NOT_PRESENT that does not match the result from the capture file: VK_SUCCESS.  Replay cannot continue.
Replay has encountered a fatal error and cannot continue: the specified extension does not exist

Or other errors/crashes. Fortunately we could limit the capabilities of desktop GPU with VK_LAYER_LUNARG_device_simulation

VK_LAYER_LUNARG_device_simulation when simulating another GPU should be told to intersect the capabilities of both GPUs, making the capture compatible with both of them. This could be achieved by recently added environment variables:

VK_DEVSIM_MODIFY_EXTENSION_LIST=whitelist
VK_DEVSIM_MODIFY_FORMAT_LIST=whitelist
VK_DEVSIM_MODIFY_FORMAT_PROPERTIES=whitelist

whitelist name is rather confusing because it’s essentially means “intersection”.

One would also need to get a json file which describes target GPU capabilities, this should be done by running:

vulkaninfo -j &> <device_name>.json

The final command to capture a trace would be:

VK_LAYER_PATH=<path/to/device-simulation-layer>:<path/to/gfxreconstruct-layer> \
VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_gfxreconstruct:VK_LAYER_LUNARG_device_simulation \
VK_DEVSIM_FILENAME=<device_name>.json \
VK_DEVSIM_MODIFY_EXTENSION_LIST=whitelist \
VK_DEVSIM_MODIFY_FORMAT_LIST=whitelist \
VK_DEVSIM_MODIFY_FORMAT_PROPERTIES=whitelist \
<the_app>

Replaying with GFXReconstruct

gfxrecon-replay -m rebind --skip-failed-allocations <trace_name>.gfxr
  • -m Enable memory translation for replay on GPUs with memory types that are not compatible with the capture GPU’s
    • rebind Change memory allocation behavior based on resource usage and replay memory properties. Resources may be bound to different allocations with different offsets.
  • --skip-failed-allocations skip vkAllocateMemory, vkAllocateCommandBuffers, and vkAllocateDescriptorSets calls that failed during capture

Without these options replay would fail.

Now you could easily test any app/game on your ARM board, if you have enough RAM =) I even successfully ran a capture of “Metro Exodus” on Turnip.

But what if I want to test something that requires interactivity?

Or you don’t want to save a huge trace on disk, which could grow tens of gigabytes if application is running for considerable amount of time.

During the recording GFXReconstruct just appends calls to a file, there are no additional post-processing steps. Given that the next logical step is to just skip writing to a disk and send Vulkan calls over the network!

This would allow us to interact with the application and immediately see the results on another device with different GPU. And so I hacked together a crude support of over-the-network replay.

The only difference with ordinary tracing is that now instead of file we have to specify a network address of the target device:

VK_LAYER_PATH=<path/to/device-simulation-layer>:<path/to/gfxreconstruct-layer> \
    ...
GFXRECON_CAPTURE_FILE="<ip>:<port>" \
<the_app>

And on the target device:

while true; do gfxrecon-replay -m rebind --sfa ":<port>"; done

Why while true? It is common for DXVK to call vkCreateInstance several times leading to the creation of several traces. When replaying over the network we therefor want gfxrecon-replay to immediately restart when one trace ends to be ready for another.

You may want to bring the FPS down to match the capabilities of lower power GPU in order to prevent constant hiccups. It could be done either with libstrangle or with mangohud:

  • stranglevk -f 10
  • MANGOHUD_CONFIG=fps_limit=10 mangohud

You have seen the result at the start of the post.

August 10, 2021

I’ve been silent here for quite some time, so here is a quick summary of some of the new functionality we have been exposing in V3DV, the Vulkan driver for Raspberry PI 4, over the last few months:

  • VK_KHR_bind_memory2
  • VK_KHR_copy_commands2
  • VK_KHR_dedicated_allocation
  • VK_KHR_descriptor_update_template
  • VK_KHR_device_group
  • VK_KHR_device_group_creation
  • VK_KHR_external_fence
  • VK_KHR_external_fence_capabilities
  • VK_KHR_external_fence_fd
  • VK_KHR_external_semaphore
  • VK_KHR_external_semaphore_capabilities
  • VK_KHR_external_semaphore_fd
  • VK_KHR_get_display_properties2
  • VK_KHR_get_memory_requirements2
  • VK_KHR_get_surface_capabilities2
  • VK_KHR_image_format_list
  • VK_KHR_incremental_present
  • VK_KHR_maintenance2
  • VK_KHR_maintenance3
  • VK_KHR_multiview
  • VK_KHR_relaxed_block_layout
  • VK_KHR_sampler_mirror_clamp_to_edge
  • VK_KHR_storage_buffer_storage_class
  • VK_KHR_uniform_buffer_standard_layout
  • VK_KHR_variable_pointers
  • VK_EXT_custom_border_color
  • VK_EXT_external_memory_dma_buf
  • VK_EXT_index_type_uint8
  • VK_EXT_physical_device_drm

Besides that list of extensions, we have also added basic support for Vulkan subgroups (this is a Vulkan 1.1 feature) and Geometry Shaders (we use this to implement multiview).

I think we now meet most (if not all) of the Vulkan 1.1 mandatory feature requirements, but we still need to check this properly and we also need to start doing Vulkan 1.1 CTS runs and fix test failures. In any case, the bottom line is that Vulkan 1.1 should be fairly close now.

August 05, 2021

Just about a year after the original announcement, I think it's time to see the progress on power-profiles-daemon.

Note that I would still recommend you read the up-to-date project README if you have questions about why this project was necessary, and why a new project was started rather than building on an existing one.

 The project was born out of the need to make a firmware feature available to end-users for a number of lines of Lenovo laptops for them to be fully usable on Fedora. For that, I worked with Mark Pearson from Lenovo, who wrote the initial kernel support for the feature and served as our link to the Lenovo firmware team, and Hans de Goede, who worked on making the kernel interfaces more generic.

More generic, but in a good way

 With the initial kernel support written for (select) Lenovo laptops, Hans implemented a more generic interface called platform_profile. This interface is now the one that power-profiles-daemon will integrate with, and means that it also supports a number of Microsoft Surface, HP, Lenovo's own Ideapad laptops, and maybe Razer laptops soon.

 The next item to make more generic is Lenovo's "lap detection" which still relies on a custom driver interface. This should be soon transformed into a generic proximity sensor, which will mean I get to work some more on iio-sensor-proxy.

Working those interactions

 power-profiles-dameon landed in a number of distributions, sometimes enabled by default, sometimes not enabled by default (sigh, the less said about that the better), which fortunately meant that we had some early feedback available.

 The goal was always to have the user in control, but we still needed to think carefully about how the UI would look and how users would interact with it when a profile was temporarily unavailable, or the system started a "power saver" mode because battery was running out.

 The latter is something that David Redondo's work on the "HoldProfile" API made possible. Software can programmatically switch to the power-saver or performance profile for the duration of a command. This is useful to switch to the Performance profile when running a compilation (eg. powerprofilesctl jhbuild --no-interact build gnome-shell), or for gnome-settings-daemon to set the power-saver profile when low on battery.

 The aforementioned David Redondo and Kai Uwe Broulik also worked on the KDE interface to power-profiles-daemon, as Florian Müllner implemented the gnome-shell equivalent.

Promised by me, delivered by somebody else :)

 I took this opportunity to update the Power panel in Settings, which shows off the temporary switch to the performance mode, and the setting to automatically switch to power-saver when low on battery.

Low-Power, everywhere

 Talking of which, while it's important for the system to know that they're targetting a power saving behaviour, it's also pretty useful for applications to try and behave better.
 
 Maybe you've already integrated with "low memory" events using GLib, but thanks to Patrick Griffis you can be an event better ecosystem citizen and monitor whether the system is in "Power Saver" mode and adjust your application's behaviour.
 
 This feature will be available in GLib 2.70 along with documentation of useful steps to take. GNOME Software will already be using this functionality to avoid large automated downloads when energy saving is needed.

Availability

 The majority of the above features are available in the GNOME 41 development branches and should get to your favourite GNOME-friendly distribution for their next release, such as Fedora 35.
August 04, 2021

 I've been chasing a crocus misrendering bug show in a qt trace.


The bottom image is crocus vs 965 on top. This only happened on Gen4->5, so Ironlake and GM45 were my test machines. I burned a lot of time trying to work this out. I trimmed the traces down, dumped a stupendous amount of batchbuffers, turned off UBO push constants, dump all the index and vertex buffers, tried some RGBx changes, but nothing was rushing to hit me, except that the vertex shaders produced were different.

However they were different for many reasons, due to the optimization pipelines the mesa state tracker runs vs the 965 driver. Inputs and UBO loads were in different places so there was a lot of noise in the shaders.

I ported the trace to a piglit GL application so I could easier hack on the shaders and GL, with that I trimmed it down even further (even if I did burn some time on a misplace */+ typo).

Using the ported app, I removed all uniform buffer loads and then split the vertex shader in half (it was quite large, but had two chunks). I finally then could spot the difference in the NIR shaders.

What stood out was the 965 shader had an if which the crocus shader has converted to a bcsel. This is part of peephole optimization and the mesa/st calls it, and sure enough removing that call fixed the rendering, but why? it is a valid optimization.

In a parallel thread on another part of the planet, Ian Romanick filed a MR to mesa https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12191 fixing a bug in the gen4/5 fs backend with conditional selects. This was something he noticed while debugging elsewhere. However his fix was for the fragment shader backend, and my bug was in the vec4 vertex shader backend. I tracked down where the same changes were needed in the vec4 backend and tested a fix on top of his branch, and the misrendering disappeared.

It's a strange coincidence we both started hitting the same bug in different backends in the same week via different tests, but he's definitely saved me a lot of pain in working this out! Hopefully we can combine them and get it merged this week.

Also thanks to Angelo on the initial MR for testing crocus with some real workloads.

August 03, 2021

Hi all, hope you all are doing fine!

Finally today the part 4 of my Outreachy Saga came out, the mid-point was on 5/7/21 and as you can see I'm really late, this week had the theme: “Modifying Expectations”.

But why did it take me so long to post? First, I had to internalize the topic a lot, because in my head I thought that when I reached this point, I would have achieved all the goals I had proposed at the beginning of the internship, but when the mid-point arrived, it seemed to me that I didn't have done anything and that my internship was going to end, as I didn't fulfill expectations.

The project aimed at 2 tasks

  • Clean up the debugfs support
  • Remove custom dumbmapoffset implementations

During the development of the first task, we found that it could not be carried out as intended, so it needed to be restructured and resulted in:

  • Create vkmsconfigshow( )

    • function Which aims to print the data in drmdebugfscreate_files()
    • It has already been reviewed and approved to be part of the drm-misc tree

During the development of this function, I came across an improvement in the code:

  • Replace macro in vkms_release()

    • It has already been reviewed and approved to be part of the drm-misc tree

As part of this week's assignments, I needed to talk to my advisors about the internship progress and schedule review. During our conversation, I could see/understand that I managed to achieve one of the goals, as presented above (I thought I hadn't achieved anything!!), and we also realized that I was not going to be able to do the second task, as it was in another context and could take a long time to understand how to solve it.

Thus, for the second half of the internship, it was decided to convert vkmsconfigdebugfs into the structure proposed by Wambui Karuga, and here I'm working on it.

During this time I'm learning that the Linux Kernel development is not linear, that several things can happen (setup problem again, break the kernel, not knowing what to do...), so I realized that one of the goals of Outreachy is learning to contribute and work with the project I've chosen.

So I started to take advantage of my journey to learn how to contribute as much I can in the Linux Kernel and as I identified a lot with the development, maybe I can find a job to keep working/contributing to the Linux Kernel development.

Thank you for following me so far, please feel free to comment! And stay tuned to the next chapters of this Saga!!!

Take care and have a great day!

July 30, 2021

Deeper Into Software

I don’t feel like blogging about zink today, so here’s more about everyone’s favorite software implementation of Vulkan.

The existing LLVMpipe architecture works like this from a top-down view:

  • mesa / st - this is the GL/Gallium state tracker
  • llvmpipe - this is the Gallium driver
  • gallivm - this is the LLVM program compiler
  • llvm - this is where the fragment shader runs

In short, everything is for the purpose of compiling LLVM programs which will draw/compute the desired result.

Lavapipe makes a slight change:

  • lavapipe - this is the Vulkan state tracker
  • llvmpipe - this is the Gallium driver
  • gallivm - this is the LLVM program compiler
  • llvm - this is where the fragment shader runs

It’s that simple.

Thus, any time a new feature is added to Lavapipe, what’s actually being done is plumbing that Vulkan feature through some number of layers to change how LLVM is executed. Some features, like samplerAnisotropy, require significant work at the gallivm layer just to toggle a boolean flag at the lavapipe level.

Other changes, like KHR_timeline_semaphores are entirely contained in Lavapipe.

What Are Timeline Semaphores?

Vulkan has a number of mechanisms for synchronization including fences, events, and binary semaphores, all of which serve a specific purpose. For more conrete on all of them, please read the blog of an actual expert.

The best and most awesome (don’t @ me, it’s not debatable) of these synchronization methods, however is the timeline semaphore.

A timeline semaphore is an object that can be used to signal and wait on specific integer-assigned points in command execution, also known as timelines. Each queue submission can be accompanied by an array of timeline semaphores to wait on and an array to signal; command buffers in a given submission will wait before executing, then signal after they’re done. This enables parallel code design where one thread can assemble command buffers and submit them, and the GPU can be made to pause at certain points for buffers/images referenced to become populated by another thread before continuing with execution.

Typically, semaphores are managed through signals which pass through the kernel and hardware, meaning that “waiting” on a timeline is really just waiting on an ioctl (DRM_IOCTL_SYNCOBJ_TIMELINE_WAIT) to signal that the specified timeline id has occurred, which requires no additional host-side synchronization. Things get a bit trickier in software, however, as the kernel is not involved, so everything must be managed in the driver.

Lavapipe And Timelines

This was a todo item sitting on the list for a while because it was tricky to handle. The most visible problems here were:

  • connecting timeline identifiers with queue submissions; timelines only need to be monotonic, not sequential, meaning that using something like a sliding array wouldn’t be very efficient
  • the actual synchronization when threads are involved

After some thought and deliberation about my life choices up to this point, I decided to tackle this implementation. The methodology I selected was to add a monotonic counter to the internal command buffer submission and then create a series of per-object timeline “links” which would serve to match the counter to the timeline identifier. This would enable each timeline semaphore to maintain a singly-linked list of links each time they were submitted, and the list could then be pruned at any given time—referenced against the internal counter—to update the “current” timeline id and then evaluate whether a specified wait condition had passed. In the case where the condition had not passed, the timeline link could also store a handle to the fence from the llvmpipe queue submission that could be waited on directly.

Did it work?

Almost on the first try, actually.

But then I ran into a wall in CI while running piglit tests through zink.

It turns out that the CTS tests are considerably less aggressive than the piglit ones for things like this: specifically, there don’t appear to be any cases where a single timeline has 16 threads all trying to wait on it at different values, iterating thousands of times over the course of a couple seconds.

Oops.

But that’s now taken care of, and conformance never felt so good.

The road to Vulkan 1.2 continues!

July 29, 2021

This is a title

I’m back.

Where did I go?

My birthday passed recently, so I gifted myself a couple weeks off from blogging. Feels good.

For today, this is a Lavapipe blog.

What’s New With Lavapipe?

Lots.

Let’s check out what conformant features were added just in July:

  • EXT_line_rasterization
  • EXT_vertex_input_dynamic_state
  • EXT_extended_dynamic_state2
  • EXT_color_write_enable
  • features.strictLines
  • features.shaderStorageImageExtendedFormats
  • features.shaderStorageImageReadWithoutFormat
  • features.samplerAnisotropy
  • KHR_timeline_semaphores

Also under the hood now is a new 2D rasterizer from VMWare which yields “a 2x to 3x performance improvement for 2D workloads”.

Why Aren’t You Using Lavapipe Yet?

Have a big Vulkan-using project? Do you constantly have to worry about breakages from all manner of patches being merged without testing? Can’t afford or too lazy to set up and maintain actual hardware for testing?

Why not Lavapipe?

Seriously, why not? If there’s features missing that you need for your project, open tickets so we know what to work on.

July 28, 2021

Part 1, Part 2, Part 3

After getting thouroughly nerd-sniped a few weeks back, we now have FreeBSD support through qemu in the freedesktop.org ci-templates. This is possible through the qemu image generation we have had for quite a while now. So let's see how we can easily add a FreeBSD VM (or other distributions) to our gitlab CI pipeline:


.freebsd:
variables:
FDO_DISTRIBUTION_VERSION: '13.0'
FDO_DISTRIBUTION_TAG: 'freebsd.0' # some value for humans to read

build-image:
extends:
- .freebsd
- .fdo.qemu-build@freebsd
variables:
FDO_DISTRIBUTION_PACKAGES: "curl wget"
Now, so far this may all seem quite familiar. And indeed, this is almost exactly the same process as for normal containers (see Part 1), the only difference is the .fdo.qemu-build base template. Using this template means we build an image babushka: our desired BSD image is actual a QEMU RAW image sitting inside another generic container image. That latter image only exists to start the QEMU image and set up the environment if need be, you don't need to care what distribution it runs out (Fedora for now).

Because of the nesting, we need to handle this accordingly in our script: tag for the actual test job - we need to start the image and make sure our jobs are actually built within. The templates set up an ssh alias "vm" for this and the vmctl script helps to do things on the vm:


test-build:
extends:
- .freebsd
- .fdo.distribution-image@freebsd
script:
# start our QEMU image
- /app/vmctl start

# copy our current working directory to the VM
# (this is a yaml multiline command to work around the colon)
- |
scp -r $PWD vm:

# Run the build commands on the VM and if they succeed, create a .success file
- /app/vmctl exec "cd $CI_PROJECT_NAME; meson builddir; ninja -C builddir" && touch .success || true

# Copy results back to our run container so we can include them in artifacts:
- |
scp -r vm:$CI_PROJECT_NAME/builddir .

# kill the VM
- /app/vmctl stop

# Now that we have cleaned up: if our build job before
# failed, exit with an error
- [[ -e .success ]] || exit 1
Now, there's a bit to unpack but with the comments above it should be fairly obvious what is happening. We start the VM, copy our working directory over and then run a command on the VM before cleaning up. The reason we use touch .success is simple: it allows us to copy things out and clean up before actually failing the job.

Obviously, if you want to build any other distribution you just swap the freebsd out for fedora or whatever - the process is the same. libinput has been using fedora qemu images for ages now.

July 27, 2021

Thanks to the work done by Josè Expòsito, libinput 1.19 will ship with a new type of gesture: Hold Gestures. So far libinput supported swipe (moving multiple fingers in the same direction) and pinch (moving fingers towards each other or away from each other). These gestures are well-known, commonly used, and familiar to most users. For example, GNOME 40 recently has increased its use of touchpad gestures to switch between workspaces, etc. Swipe and pinch gestures require movement, it was not possible (for callers) to detect fingers on the touchpad that don't move.

This gap is now filled by Hold gestures. These are triggered when a user puts fingers down on the touchpad, without moving the fingers. This allows for some new interactions and we had two specific ones in mind: hold-to-click, a common interaction on older touchscreen interfaces where holding a finger in place eventually triggers the context menu. On a touchpad, a three-finger hold could zoom in, or do dictionary lookups, or kill a kitten. Whatever matches your user interface most, I guess.

The second interaction was the ability to stop kinetic scrolling. libinput does not actually provide kinetic scrolling, it merely provides the information needed in the client to do it there: specifically, it tells the caller when a finger was lifted off a touchpad at the end of a scroll movement. It's up to the caller (usually: the toolkit) to implement the kinetic scrolling effects. One missing piece was that while libinput provided information about lifting the fingers, it didn't provide information about putting fingers down again later - a common way to stop scrolling on other systems.

Hold gestures are intended to address this: a hold gesture triggered after a flick with two fingers can now be used by callers (read: toolkits) to stop scrolling.

Now, one important thing about hold gestures is that they will generate a lot of false positives, so be careful how you implement them. The vast majority of interactions with the touchpad will trigger some movement - once that movement hits a certain threshold the hold gesture will be cancelled and libinput sends out the movement events. Those events may be tiny (depending on touchpad sensitivity) so getting the balance right for the aforementioned hold-to-click gesture is up to the caller.

As usual, the required bits to get hold gestures into the wayland protocol are either in the works, mid-flight or merge-ready so expect this to hit the various repositories over the medium-term future.

July 26, 2021

I have not talked about raytracing in RADV for a while, but after some procrastination being focused on some other things I recently got back to it and achieved my next milestone.

In particular I have been hacking away at CTS and got to a point where CTS on dEQP-VK.ray_tracing.* runs to completion without crashes or hangs. Furthermore, I got the passrate to 90% of non-skiped tests. So we’re finally getting somewhere close to usable.

As further show that it is usable my fixes for CTS also fixed the corruption issues in Quake 2 RTX (Github version), delivering this image:

Q2RTX on RADV

Of course not everything is perfect yet. Besides the not 100% CTS passrate it has like half the Windows performance at 4k right now and we still have some feature gaps to make it really usable for most games.

Why is it slow?

TL;DR Because I haven’t optimized it yet and implemented every shortcut imaginable.

AMD raytracing primer

Raytracing with Vulkan works with two steps:

  1. You built a giant acceleration structure that contains all your geometry. Typically this ends up being some kind of tree, typically a Bounding Volume Hierarchy (BVH).
  2. Then you trace rays using some traversal shader through the acceleration structure you just built.

With RDNA2 AMD started accelerating this by adding an instruction that allowed doing intersection tests between a ray and a single BVH node, where the BVH node can either be

  • A triangle
  • A box node specifying 4 AABB boxes

Of course this isn’t quite enough to deal with all geometry types in Vulkan so we also add two more:

  • an AABB box
  • an instance of another BVH combined with a transformation matrix

Building the BVH

With a search tree like a BVH it is very possibly to make trees that are very useless. As an example consider a binary search tree that is very unbalanced. We can have similarly bad things with a BVH including making it unbalanced or having overlapping bounding volumes.

And my implementation is the simplest thing possible: the input geometry becomes the leaves in exactly the same order and then internal nodes are created just as you’d draw them. That is probably decently fast in building the BVH but surely results in a terrible BVH to actually use.

BVH traversal

After we built a BVH we can start tracing some rays. In rough pseudocode the current implementation is

stack = empty
insert root node into stack
while stack is not empty:

   node = pop a node from the stack

   if we left the bottom level BVH:
      reset ray origin/direction to initial origin/direction

   result = amd_intersect(ray, node)
   switch node type:
      triangle:
         if result is a hit:
            load some node data
            process hit
      box node:
         for each box hit:
            push child node on stack
      custom node 1 (instance):
         load node data
         push the root node of the bottom BVH on the stack
         apply transformation matrix to ray origin/direction
      custom node 2 (AABB geometry):
         load node data
         process hit

We already knew there were inherently going to be some difficulties:

  • We have a poor BVH so we’re going to do way more iterations than needed.
  • Calling shaders as a result of hits is going to result in some divergence.

Furthermore this also clearly shows some difficulties with how we approached the intersection instruction. Some advantages of the intersection instruction are that it avoids divergence in computing collisions if we have different node types in a subgroup and to be cheaper when there are only a few lanes active. (A single CU can process one ray/node intersection per cycle, modulo memory latency, while it can process an ALU instruction on 64 lanes per cycle).

However even if it avoids the divergence in the collision computation we still introduce a ton of divergence in the processing of the results of the intersection. So we are still doing pretty bad here.

A fast GPU traversal stack needs some work too

Another thing to be noted is our traversal stack size. According to the Vulkan specification a bottom level acceleration structure should support 2^24 -1 triangles and a top level acceleration structure should support 2^24 - 1 bottom level structures. Combined with a tree with 4 children in each internal node we can end up with a tree depth of about 24 levels.

In each internal node iteration of our loop we pop one element and push up to 4 elements, so at the deepest level of traversal we could end up with a 72 entry stack. Assuming these are 32-bit node identifiers, that ends up with 288 bytes of stack per lane, or ~18 KiB per 64 lane workgroup (the minimum which could possibly keep a CU busy with an ALU only workload). Given that we have 64 KiB of LDS (yes I am using LDS since there is no divergent dynamic register addressing) per CU that leaves only 3 workgroups per CU, leaving very little options for parallelism between different hardware execution units (e.g. the ALU and the texture unit that executes the ray intersections) or latency hiding of memory operations.

So ideally we get this stack size down significantly.

Where do we go next?

First step is to get CTS passing and getting an initial merge request into upstream Mesa. As a follow on to that I’d like to get a minimal prototype going for some DXR 1.0 games with vkd3d-proton just to make sure we have the right feature coverage.

After that we’ll have to do all the traversal optimizations. I’ll probably implement a bunch of instrumentation so I actually have a clue on what to optimize. This is where having some runnable games really helps get the right idea about performance bottlenecks.

Finally, with some luck better shaders to build a BVH will materialize as well.

July 22, 2021

If you want to write an X application, you need to use some library that speaks the X11 protocol. For a long time this meant libX11, often called xlib, which - like most things about X - is a fantastic bit of engineering that is very much a product of its time with some confusing baroque bits. Overall it does a very nice job of hiding the icky details of the protocol from the application developer.

One of the details it hides has to do with how resource IDs are allocated in X. A resource ID (an XID, in the jargon) is a 32 29-bit integer that names a resource - window, colormap, what have you. Those 29 bits are split up netmask/hostmask style, where the top 8 or so uniquely identify the client, and the rest identify the resource belonging to that client. When you create a window in X, what you really tell the server is "I want a window that's initially this size, this background color (etc.) and from now on when I say (my client id + 17) I mean that window." This is great for performance because it means resource allocation is assumed to succeed and you don't have to wait for a reply from the server.

Key to all this is that in xlib the XID is the return value from the call that issues the resource creation request. Internally the request gets queued into the protocol's write buffer, but the client can march ahead and issue the next few commands as if creation had succeeded - because it probably did, and if it didn't you're probably going to crash anyway.

So to allocate XIDs the client just marches forward through its XID range. What happens when you hit the end of the range? Before X11R4, you'd crash, because xlib doesn't keep track of which XIDs it's allocated, just the lowest one it hasn't allocated yet. Starting in R4 the server added an extension called XC-MISC that lets the client ask the server for a list of unused XIDs, so when xlib hits the end of the range it can request a new range from the server.

But. UI programming tends to want threads, and xlib is perhaps not the most thread-friendly. So XCB was invented, which sacrifices some of xlib's ease of use for a more direct binding to the protocol and (in theory) an explicitly thread-safe design. We then modified xlib and XCB to coexist in the same process, using the same I/O buffers, reply and event management, etc.

This literal reflection of the protocol into the API has consequences. In XCB, unlike xlib, XID generation is an explicit step. The client first calls into XCB to allocate the XID, and then passes that XID to the creation request in order to give the resource a name.

Which... sorta ruins that whole thread-safety thing.

Let's say you call xcb_generate_id in thread A and the XID it returns is the last one in your range. Then thread B schedules in and tries to allocate another XID. You'll ask the server for a new range, but since thread A hasn't called its resource creation request yet, from the server's perspective that "allocated" XID looks like it's still free! So now, whichever thread issues their resource creation request second will get BadIDChoice thrown at them if the other thread's resource hasn't been destroyed in the interim.

A library that was supposed to be about thread safety baked a thread safety hazard into the API. Good work, team.

How do you fix this without changing the API? Maybe you could keep a bitmap on the client side that tracks XID allocation, that's only like 256KB worst case, you can grow it dynamically and most clients don't create more than a few dozen resources anyway. Make xcb_generate_id consult that bitmap for the first unallocated ID, and mark it used when it returns. Then track every resource destruction request and zero it back out of the bitmap. You'd only need XC-MISC if some other client destroyed one of your resources and you were completely out of XIDs otherwise.

And you can implement this, except. One, XCB has zero idea what a resource destruction request is, that's simply not in the protocol description. Not a big deal, you can fix that, there's only like forty destructors you'd need to annotate. But then two, that would only catch resource destruction calls that flow through XCB's protocol binding API, which xlib does not, xlib instead pushes raw data through xcb_writev. So now you need to modify every client library (libXext, libGL, ...) to inform XCB about resource destruction.

Which is doable. Tedious. But doable.

I think.

I feel a little weird writing about this because: surely I can't be the first person to notice this.

July 21, 2021

Debugging programs using printf statements is not a technique that everybody appreciates. However, it can be quite useful and sometimes necessary depending on the situation. My past work on air traffic control software involved using several forms of printf debugging many times. The distributed and time-sensitive nature of the system being studied made it inconvenient or simply impossible to reproduce some issues and situations if one of the processes was stalled while it was being debugged.

In the context of Vulkan and graphics in general, printf debugging can be useful to see what shader programs are doing, but some people may not be aware it’s possible to “print” values from shaders. In Vulkan, shader programs are normally created in a high level language like GLSL or HLSL and then compiled to SPIR-V, which is then passed down to the driver and compiled to the GPU’s native instruction set. That final binary, many times outside the control of user applications, runs in a quite closed and highly parallel environment without many options to observe what’s happening and without text input and output facilities. Fortunately, tools like glslang can generate some debug information when compiling shaders to SPIR-V and other tools like Nsight can use that information to let you debug shaders being run.

Still, being able to print the values of different expressions inside a shader can be an easy way to debug issues. With the arrival of Ray Tracing, this is even more useful than before. In ray tracing pipelines, the shaders being executed and resources being used are chosen based on the scene geometry, the origin and the direction of the ray being traced. printf debugging can let you see where you are and what you’re using. So how do you print values from shaders?

Vulkan’s debug printf is implemented as part of the Validation Layers and the general procedure is well documented. If you were to implement this kind of mechanism yourself, you’d likely use a storage buffer to save the different values you want to print while shader invocations are running and, later, you’d go over the contents of that buffer and print the associated message with each value or values. And that is, essentially, what debug printf does but in a very convenient and automated way so that you don’t have to deal with the gory details and corner cases.

In a GLSL shader, simply:

  1. Enable the GL_EXT_debug_printf extension.

  2. Sprinkle your code with debugPrintfEXT() calls.

  3. Use the Vulkan Configurator that’s part of the SDK or manually edit vk_layer_settings.txt for your app enabling VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT.

  4. Normally, disable other validation features so as not to get too much output.

  5. Take a look at the debug report or debug utils info messages containing printf results, or set printf_to_stdout to true so printf messages are sent to stdout directly.

You can find an example shader in the validation layers test code. The debug printf feature has helped me a lot in the past, so I wanted to make sure it’s widely known and used.

Due to the observer effect, you may end up in situations where your code works correctly when enabling debug printf but incorrectly without it. This may be due to multiple reasons but one of the main ones I’ve encountered is improper synchronization. When debug printf is used, the layers use additional synchronization primitives to sync the contents of auxiliary buffers, which can mask synchronization bugs present in the app.

Finally, RenderDoc 1.14, released at the end of May, also supports Vulkan’s shader printf statements and will let you take a look at the print statements produced during a draw call. Furthermore, the print statements don’t have to be present in the original shader. You can also use the shader edit system to insert them on the fly and use them to debug the results of a particular shader invocation. Isn’t that awesome? Great work by Baldur Karlsson as always.

PS: As a happy coincidence, just yesterday LunarG published a white paper on Vulkan’s debug printf with additional information on this excellent feature. Be sure to check it out!

In order to expose OpenGL 4.6 the last missing feature in llvmpipe is anisotropic texture filtering. Adding support for this also allows lavapipe expose the Vulkan samplerAnisotropy feature.

I started writing anisotropic support > 6 months ago. At the time we were trying to deprecate the classic swrast driver, and someone pointed out it had support for anisotropic filtering. This support had also been ported to the softpipe driver, but never to llvmpipe.

I had also considered porting swiftshaders anisotropic support, but since I was told the softpipe code was functional and had users I based my llvmpipe port on that.

Porting the code to llvmpipe means rewriting it to generate LLVM IR using the llvmpipe vector processing code. This is a lot messier than just writing linear processing code, and when I thought I had it working it passes GL CTS, but failed the VK CTS. The results also to my eye looked worse than I'd have thought was acceptable, and softpipe seemed to be as bad.

Once I swung back around to this I decided to port the VK CTS test to GL and run it on softpipe and llvmpipe code. Initially llvmpipe had some more bugs to solve esp where the mipmap levels were being chosen, but once I'd finished aligning softpipe and llvmpipe I started digging into why the softpipe code wasn't as nice as I expected.

The softpipe code was based on an implementation of an Elliptical Weighted Average Filter (EWA). The paper "Creating Raster Omnimax Images from Multiple Perspective Views Using the Elliptical Weighted Average Filter" described this. I sat down with the paper and softpipe code and eventually found the one line where they diverged.[1] This turned out to be a bug introduced in a refactoring 5 years ago, and nobody had noticed or tracked it down.

I then ported the same fix to my llvmpipe code, and VK CTS passes. I also optimized the llvmpipe code a bit to avoid doing pointless sampling and cleaned things up. This code landed in [2] today.

For GL4.6 there are still some fixes in other areas.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11917

[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8804

July 20, 2021

After a month of reverse-engineering, we’re excited to release documentation on the Valhall instruction set, available as a PDF. The findings are summarized in an XML architecture description for machine consumption. In tandem with the documentation, we’ve developed a Valhall assembler and disassembler as a reverse-engineering aid.

Valhall is the fourth Arm® Mali™ architecture and the fifth Mali instruction set. It is implemented in the Arm® Mali™-G78, the most recently released Mali hardware, and Valhall will continue to be implemented in Mali products yet to come.

Each architecture represents a paradigm shift from the last. Midgard generalizes the Utgard pixel processor to support compute shaders by unifying the shader stages, adding general purpose memory access, and supporting integers of various bit sizes. Bifrost scalarizes Midgard, transitioning away from the fixed 4-channel vector (vec4) architecture of Utgard and Midgard to instead rely on warp-based execution for parallelism, better using the hardware on modern workloads. Valhall linearizes Bifrost, removing the Very Long Instruction Word mechanisms of its predecessors. Valhall replaces the compiler’s static scheduling with hardware dynamic scheduling, trading additional control hardware for higher average performance. That means padding with “no operation” instructions is no longer required, which may decrease code size, promising better instruction cache use.

All information in this post and the linked PDF and XML is published in good faith and for general information purpose only. We do not make any warranties about the completeness, reliability and accuracy of this information. Any action you take upon the information you find here, is strictly at your own risk. We are not be liable for any losses and/or damages in connection with the use of this information.

While we strive to make the information as accurate as possible, we make no claims, promises, or guarantees about its accuracy, completeness, or adequacy. We expressly disclaim liability for content, errors and omissions in this information.

Let’s dig in.

Getting started

In June, Collabora procured an International edition of the Samsung Galaxy S21 phone, powered by a system-on-chip with Mali G78. Although Arm announced Valhall with the Mali G77 in May 2019, roll out has been slow due to the COVID-19 pandemic. At the time of writing, there are not yet Linux friendly devices with a Valhall chip, forcing use of a locked down Android device. There’s a silver lining: we have a head start on the reverse-engineering, so by the time hacker-friendly devices arrive with Valhall GPUs, we can have open source drivers ready.

Android complicates reverse-engineering (though not as much as macOS). On Linux, we can compile a library on the device to intercept data sent to the GPU. On Android, we must cross-compile from a desktop with the Android Native Development Kit, ironically software that doesn’t run on Arm processors. Further, where on Linux we can track the standard system calls, Android device drivers replace the standard open() system call with a complicated Android-only “binder” interface. Adapting the library to support binder would be gnarly, but do we have to? We could sprinkle in one little hack anywhere we see a file descriptor without the file name.

#define MALI0 "/dev/mali0"

bool is_mali(int fd)
{
    char in[128] = { 0 }, out[128] = { 0 };
    snprintf(in, sizeof(in), "/proc/self/fd/%d", fd);

    int count = readlink(in, out, sizeof(out) - 1);
    return count == strlen(MALI0) && strncmp(out, MALI0, count) == 0;
}

Now we can hook the Mali ioctl() calls without tracing binder and easily dump graphics memory.

We’re interested in the new instruction set, so we’re looking for the compiled shader binaries in memory. There’s a chicken-and-egg problem: we need to find the shaders to reverse-engineer them, but we need to reverse-engineer the shaders to know what to look for. Fortunately, there’s an escape hatch. The proprietary Mali drivers allow an OpenGL application to query the compiled binary with the ARM_mali_program_binary extension, returning a file in the Mali Binary Shader format. That format was reverse-engineered years ago by Connor Abbott for earlier Mali architectures, and the basic structure is unchanged in Valhall. Our task is simple: compile a test shader, dump both GPU memory and the Mali Binary Shader, and find the common section. Searching for the common bytes produces an address in executable graphics memory, in this case 0x7f0002de00. Searching for that address in turn finds the “shader program descriptor” which references it.

18 00 00 80 00 10 00 00 00 DE 02 00 7F 00 00 00

Another search shows this descriptor’s address in the payload of an index-driven vertex shading job for graphics or a compute job for OpenCL. Those jobs contain the Job Manager header introduced a decade ago for Midgard, so we understand them well: they form a linked list of jobs, and only the first job is passed to the kernel. The kernel interface has a “job chain” parameter on the submit system call taking a GPU address. We understand the kernel interface well as it is open source due to kernel licensing requirements.

With each layer identified, we teach the wrapper library to chase the pointers and dump every shader executed, enabling us to reverse-engineer the new instruction set and develop a disassembler.

Instruction set reconnaissance

Reverse-engineering in the dark is possible, but it’s easier to have some light. While waiting for the Valhall phone to arrive, I read everything Arm made public about the instruction set, particularly this article from Anandtech. Without lifting a finger, that article tells us Valhall is…

  • Warp-based, like Bifrost, but with 16 threads per warp instead of Bifrost’s 4/8.
  • Isomorphic to Bifrost on the instruction level (“operational equivalence”).
  • Regularly encoded.
  • Flat, lacking Bifrost’s clause and tuple packaging.

It also says that Valhall has a 16KB instruction cache, holding 2048 instructions. Since Valhall has a regular encoding, we divide 16384 bytes by 2048 instructions to find a Valhall instruction is 8 bytes. Our first attempt at a “disassembler” can print hex dumps of every 8 bytes on a line; our calculation ensures that is the correct segmentation.

From here on, reverse-engineering is iterative. We have a baseline level of knowledge, and we want to grow that knowledge. To do so, we input test programs into the proprietary driver to observe the output, then perturbe the input program to see how the output changes.

As we discover new facts about the architecture, we update our disassembler, demonstrating new knowledge and separating the known from the unknown. Ideally, we encode these facts in a machine-readable file forming a single reference for the architecture. From this file, we can generate a disassembler, an assembler, an instruction encoder, and documentation. For Valhall, I use an XML file, resembling Bifrost’s equivalent XML.

Filling out this file is usually straightforward though tedious. Modern APIs are large, so there is a great deal of effort required to map the API requirements to the hardware features.

However, some hardware features do not map to any API. Here are subtler tales from reversing Valhall.

Dependency slots

Arithmetic is faster than memory access, so modern processors execute arithmetic in parallel with pending memory accesses. Modern GPU architectures require the compiler to manage this mechanism by analyzing the program and instructing the hardware to wait for the results before they’re needed.

For this purpose, Bifrost uses an explicit scoreboarding system. Bifrost groups up to 16 instructions together in a clause, and each clause has a fixed header. The compiler assigns a “dependency slot” between 0 and 7 to each clause, specified in the header. Each clause can wait on any set of slots, specified with another 8-bits in the clause header. Specifying dependencies per-clause is a compromise between precision and code size.

We expect Valhall to feature a similar scheme, but Valhall doesn’t have clauses or clause headers, so where does it specify this info?

Studying compiled shaders, we see the last byte of every instruction is usually zero. But when the result of a memory access is first read, the previous instruction has a bit set in the last byte. Which bit is set depends on the number of memory accesses in flight, so it seems the last byte encodes a dependency wait. The memory access instructions themselves are often zero in their last bytes, so it doesn’t look like the last byte is used to encode the dependency slot – but executing many memory access instructions at once and comparing the bits, we see a single 2-bit field stands out as differing. The dependency slot is specified inside the instruction, not in the metadata.

What makes this design practical? Two factors.

One, only the waits need to be specified in general. Arithmetic instructions don’t need a dependency slot, since they complete immediately. The longest message passing instructions is shorter than the longer arithmetic instruction, so there is space in the instruction itself to specify only when needed.

Two, the performance gain from adding extra slots levels off quickly. Valhall cuts back on Bifrost’s 8 slots (6 general purpose). Instead it has 4 or 5 slots, with only 3 general purpose, saving 4-bits for every instruction.

This story exemplifies a general pattern: Valhall is a flattening of Bifrost. Alternatively, Bifrost is “Valhall with clauses”, although that description is an anachronism. Why does Bifrost have clauses, and why does Valhall remove them? The pattern in this story of dependency waits generalizes to answer the question: grouping many instructions into Bifrost clauses allows the hardware to amortize operations like dependency waits and reduce the hardware gate count of the shader core. However, clauses add substantial encoding overhead, compiler complexity, and imprecision. Bifrost optimizes for die space; Valhall optimizes for performance.

The missing modifier

Hardware features that are unused by the proprietary driver are a perennial challenge for reverse-engineering. However, we have a complete Bifrost reference at our disposal, and Valhall instructions are usually equivalent to Bifrost. Special instructions and modes from Bifrost cast a shadow on Valhall, showing where there are gaps in our knowledge. Sometimes these gaps are impractical to close, short of brute-forcing the encoding space. Other times we can transfer knowledge and make good guesses.

Consider the Cross Lane PERmute instruction, CLPER, which takes a register and the index of another lane in the warp, and returns the value of the register in the specified lane. CLPER is a “subgroup operation”, required for Vulkan and used to implement screen-space derivatives in fragment shaders. On Bifrost, the CLPER instruction is defined as:

<ins name="+CLPER.i32" mask="0xfc000" exact="0x7c000">
  <src start="0" mask="0x7"/>
  <src start="3"/>
  <mod name="lane_op" start="6" size="2">
    <opt>none</opt>
    <opt>xor</opt>
    <opt>accumulate</opt>
    <opt>shift</opt>
  </mod>
  <mod name="subgroup" start="8" size="2">
    <opt>subgroup2</opt>
    <opt>subgroup4</opt>
    <opt>subgroup8</opt>
  </mod>
  <mod name="inactive_result" start="10" size="4">
    <opt>zero</opt>
    <opt>umax</opt>
    ....
    <opt>v2infn</opt>
    <opt>v2inf</opt>
  </mod>
</ins>

We expect a similar definition for Valhall. One modification is needed: Valhall warps contain 16 threads, so there should be a subgroup16 option after subgroup8, with the natural binary encoding 11. Looking at a binary Valhall CLPER instruction, we see a 11 pair corresponding to the subgroup field. Similarly experimenting with different subgroup operations in OpenCL lets us figure out the lane_op field. We end up with an instruction definition like:

<ins name="CLPER.u32" title="Cross-lane permute" dests="1" opcode="0xA0" opcode2="0xF">
  <src/>
  <src widen="true"/>
  <subgroup/>
  <lane_op/>
</ins>

Notice we do not specify the encoding in the Valhall XML, since Valhall encoding is regular. Also notice we lack the inactive_result modifier. On Bifrost, inactive_result specifies the value returned if the program attempts to access an inactive lane. We may guess Valhall has the same mechanism, but that modifier is not directly controllable by current APIs. How do we proceed?

If we can run code on the device, we can experiment with the instruction. Inactive lanes may be caused by divergent control flow, where one lane in the thread branches but another lane does not, forcing the hardware to execute only part of the warp. After reverse-engineering Valhall’s branch instructions, we can construct a situation where a single lane is active and the rest are inactive. Then we insert a CLPER instruction with extra bits set, store the result to main memory, and print the result. This assembly program does the trick:

# Elect a single lane
BRANCHZ.reconverge.id lane_id, offset:3

# Try to read a value from an inactive thread
CLPER.u32 r0, r0, 0x01000000.b3, inactive_result:VALUE

# Store the value
STORE.i32.slot0.reconverge @r0, u0, offset:0

# End shader
NOP.return

With the assembler we’re writing, we can assemble this compute kernel. How do we run it on the device without knowing the GPU data structures required to dispatch compute shaders? We make use of another classic reverse-engineering technique: instead of writing the initialization code ourselves, piggyback off the proprietary driver. Our wrapper library allows us to access graphics memory before the driver submits work to the hardware. We use this to read the memory, but we may also modify it. We already identified the shader program descriptor, so we can inject our own shaders. From here, we can jury-rig a script to execute arbitrary shader binaries on the device in the context of an OpenCL application running under the proprietary driver.

Putting it together, we find the inactive_result bits in the CLPER encoding and write one more script to dump all values.

for ((i = 0 ; i < 16 ; i++)); do
  sed -e "s/VALUE/$i/" shader.asm | python3 asm.py shader.bin
  adb push shader.bin /data/local/tmp/
  adb shell 'REPLACE=/data/local/tmp/shader.bin '\
    'LD_PRELOAD=/data/local/tmp/panwrap.so '\
    '/data/local/tmp/test-opencl'
done

The script’s output contains sixteen possibilities – and they line up perfectly with Bifrost’s sixteen options. Success.

Next steps

There’s more to learn about Valhall, but we’ve reverse-engineered enough to develop a Valhall compiler. As Valhall is a simplification of Bifrost, and we’ve already developed a free and open source compiler for Bifrost, this task is within reach. Indeed, adapting the Bifrost compiler to Valhall will require refactoring but little new development.

Mali G78 does bring changes beyond the instruction set. The data structures are changed to reduce Vulkan driver overhead. For example, the monolithic “Renderer State Descriptor” on Bifrost is split into a “Shader Program Descriptor” and a “Depth Stencil Descriptor”, so changes to the depth/stencil state no longer require the driver to re-emit shader state. True, the changes require more reverse-engineering. Fortunately, many data structures are adapted from Bifrost requiring few changes to the Mesa driver.

Overall, supporting Valhall in Mesa is within reach. If you’re designing a Linux-friendly device with Valhall and looking for open source drivers, please reach out!

Originally posted on Collabora’s blog

July 15, 2021

Some days ago my Igalia colleague Adrián Pérez pointed us to mold, a new drop-in replacement for existing Unix linkers created by the original author of LLVM lld. While mold is pretty new and does not aim to be 100% compatible with GNU ld, GNU gold or LLVM lld (at least as of the time I’m writing this), I noticed the benchmark table in its README file also painted a pretty picture about the performance of lld, if inferior to that of mold.

In my job at Igalia I work most of the time on VK-GL-CTS, Vulkan and OpenGL’s Conformance Test Suite, which contains thousands of tests for OpenGL and Vulkan. These tests are provided by different executable files and the Vulkan tests on which I’m focused are contained in a binary called deqp-vk. When built with debug information, deqp-vk can be quite large. A recent build, for example, is taking 369 MB in my drive. But the worst part is that linking the binary typically takes around 25 seconds on my work laptop.

$ time cmakebuild.sh --target deqp-vk
  [6/6] Linking CXX executable external/vulkancts/modules/vulkan/deqp-vk

  real    0m25.137s
  user    0m22.280s
  sys     0m3.440s

I had never paid much attention to the linker before, always relying on the default choice in Fedora or any other distribution. However, I decided to install lld, which has an official package, and gave it a try. You Will Not Believe What Happened Next.

$ time cmakebuild.sh --target deqp-vk
  [6/6] Linking CXX executable external/vulkancts/modules/vulkan/deqp-vk

  real    0m2.622s
  user    0m5.456s
  sys     0m1.764s

lld is capable of correctly linking deqp-vk in 1/10th of the time the default linker (GNU ld) takes to do the same job. If you want to try lld yourself you have several options. Ideally, you’d be able to run update-alternatives --set ld /usr/bin/lld as root but that option is notably not available in Fedora. There was a proposal to make that work but it never materialized, so it cannot be made the default system-wide linker.

However, depending on the build system used by a particular project, there should be a way to make it use lld instead of /usr/bin/ld. For example, VK-GL-CTS uses CMake, which invokes the compiler to link executable files, instead of calling the linker directly, which would be unusual. Both GCC and Clang can be passed -fuse-ld=lld as a command line option to use lld instead of the default linker. That flag should be added to CMake’s CMAKE_EXE_LINKER_FLAGS variable, either by reconfiguring an existing project with, for example, ccmake, or by adding the flag to the LDFLAGS environment variable before running CMake on a build directory for the first time.

Looking forward to start using the mold linker in the future and its multithreading capabilities. In the mean time, I’m very happy to have checked lld. It’s not that usual that a simple tooling change as this one gives me such a clear advantage.

July 09, 2021

It Happened.

ablend.png

That’s right.

Zink(-wip) now fully supports GL_KHR_blend_equation_advanced, which means ES 3.2 is a go (once my local CI clears me to push today’s snapshot).

And all it took was one brief exchange with a top Mesa reviewer who is incidentally rumored to be undergoing training deep in the mountains to become an expert BBQ master on the extremely professional #zink channel on OFTC:

That's the thing. You can totally do it in Zink.

My mind was blown.

Why hadn’t I thought of that sooner?

I could just…do it? Just like that? And then it’d be done?

Truly the experts are on a different level from us mortals.

So now it’s done, and that means zink is finished. I don’t expect there will be any more work to do now that the final boss has been defeated. Don’t even bother trying to file bug reports.

You may not like it, but this is what peak Friday looks like.

July 07, 2021

The Unsung Heroes

This is going to be less of a technical post and more of a have you thought about post from me personally (usual disclaimer: this post represents only my views). With that said, I think this is more important than the average post here, meaning that expectations should be set somewhere between I need to stop everything else I’m doing until I finish reading and this is the most important event in my life.

Let’s talk about open source. No, Open Source. The idea of it.

How Does Open Source Work?

Those of you who are veterans are rolling your eyes. Another post about the glory of Open Source.

The thing about Open Source is that it’s sort of whatever you make of it. At its core, it’s about getting people together to solve a problem—community building. Whether that community is large or small, the goal is the same: write some quality software.

To that end, you’ve got your usual corporate powerpoint slide of community roles:

  • maintainers
  • developers
  • reviewers
  • whatever other buzzwords are currently relevant

In Mesa, the maintainer and developer roles are mostly the same among core contributors: these are the people who write the code that gets posted about on all the news sites.

The reviewer is a bit more mysterious though. Who are reviewers, and what separates them from the others?

WD-40

Reviewers are the grease that makes the project work. There’s really no other way of saying it.

Outside of a few components of Mesa that are effectively the wild west, without any form of oversight or approval needed for changes to be landed, every driver and utility in the tree requires that changes undergo review before they land. This means that each and every patch which affects code or build has to have a person stop everything else they’re doing and physically scroll through each patch, line-by-line, then add a Reviewed-by or Acked-by tag.

If you’re unclear as to the meanings of these tags, consider it like you’re going skydiving with someone you’ve never met before who has been in charge of preparing your parachute:

  • Reviewed-by means “I triple-checked your parachute as well as your reserve, and I’m as certain as a human is capable of being that everything is how it should be”
  • Acked-by means “Hey, I grabbed this already-packed parachute off the hanger and gave it a once-over; you’ll probably be fine”

It’s then up to the developer to decide whether to merge the code based on the feedback given to them by the reviewer.

This, of course, assumes they get feedback at all.

Balance

Too often on news sites (and in certain corporate metrics) you’ll see something like “Patches McCodesAlot, working for GreatCodingCompany, authored the most code changes for this release cycle (9001 patches), which is over 100x more than the next highest contributor.”

The manager at a company sees this and thinks “I’ll send this up the chain. We should poach Patches so we can have greater control over this project which underpins our entire business strategy. Also it’ll make my powerpoint pie charts look rad.”

The casual reader sees this and says “Wow, Patches is awesome! Without Patches, I probably couldn’t even play Fororantwatch on my Linux gaming desktop!”

But how do the patches that Patches writes get merged into the release? Unless Patches works exclusively in one of the undermaintained areas of the project, in which case it’s unlikely that their work is being widely used, the odds are that someone’s pulling a huge lift on the review side to enable all of those patches landing into the repository.

This is the job of the reviewer.

A Thanks

As this Mesa release cycle starts to wind down, I hope that readers of this blog and news sites can take a moment to look past Patches McCodesAlot and see the people who make it possible for Patches to land so many damn patches.

At the time of this post, this is what the top 10 reviewers managed to accomplish over the past few months:

Number of Reviews Reviewer Name Corporate Affiliation
91 Erik Faye-Lund Collabora
94 Samuel Pitoiset Valve
99 Alejandro Piñeiro Igalia
115 Kenneth Graunke Intel
116 Bas Nieuwenhuizen Blogger
121 Lionel Landwerlin Intel
128 Adam Jackson Red Hat
140 Marek Olšák AMD
176 Jason Ekstrand Intel
300 Dave Airlie Red Hat

Summed up, that’s over 1300 patches reviewed! For perspective, that’s around 30% of all the patches in this release, and it’s about 70% of the total number of patches that zink has received in the course of its existence.

Looking at it another way though, this is over 1300 patches that other people wrote which were able to land because these people took the time to look over the proposed changes—to triple-check the parachutes, as it were.

So thanks, Mesa reviewers. The project wouldn’t exist without all of you (and your generous employers, who should be blasting these metrics in the press when they talk about being good Open Source citizens).

But Also

I’d be remiss if I didn’t also mention the people working on Mesa CI. There’s no patch counts or review counts or anything to recognize everyone hard at work here, but CI is what keeps the triangles blasting out of your GPUs looking how they should.

Thanks, CI team. You’re awesome.

According to a recent metric, the Mesa CI infrastructure only had a 0.6% accidental failure rate. That’s pretty good considering how many thousands of jobs run every day.

June 28, 2021

This year, I decided to participate as speaker in esLibre 2021 conference. esLibre is a Spanish free software conference that covers a lot of different topics related to open-source projects: from the technical point of view to its social impact.

This year the conference had talks about game development with Godot, KDE, LibreOffice, Free Software in Universities among many others. Check out the program.

esLibre 2021

This is my first time participating in this conference and I enjoyed it a lot. Huge applause to the organization team for the huge work to organize this edition, for helping out the speakers with different testing days and for their kindness to reply any question from me and other attendees. They did a superb job!

My talk was an introduction to Mesa where I covered things like where is Mesa in the open-source graphics stack, a summary of what it does, the drivers implemented in Mesa, how our community is organized and how to contribute to it. If you know Spanish, you can check it out here (PDF). But in case you want an English version of it, this talk is very similar to the one I gave at Ubucon Europe 2018.

My esLibre talk was recorded as well! I’ll update this post with the link to the recording once it is publicly available.

Enjoy it!

Introduction to Mesa

Hi all, hope you all are doing fine!

Today it's part 3 of my Outreachy Saga it’s been 5 weeks of my Outreachy internship, and everything is not sailing smoothly how I would like! Why?? Because I had a little problem with my setup and I was stuck for 2 days without working until be able to correctly do my setup, how I said in my introduction post, one thing that I'm learning at my internship is "learning", because not everything goes as I would like, sometimes it is necessary to stop, breathe, redo everything and, after redoing everything, it is so rewarding when things start to flow.

Today my week’s blog will be focusing on the Linux Kernel Community at which I’m interning and the project on which I’m working. So, let’s get started!

What is Linux Kernel?

A small context: the core of an Operating System (OS) is the Kernel, as this is responsible for the integration of the physical devices (hardware) of the computer with the programs (software). In a Linux OS this core is also known as Linux Kernel, has open source code, and is freely available for the community, how I said in my introduction blog https://open-sourceress.com/outreachy-introduction/, the community it's a set of people and companies that want to collaborate in the development of the system.

Due to these contributions Linux Kernel has grown a lot, with over 8 million lines of code and well over 1000 contributors to each release, is one of the largest and most active free software projects in existence. The kernel codebase has been logically broken down into a set of subsystems: network, architecture-specific support (x86, ARM, MIPS, ...), memory management, devices video, real-time systems, among others. This makes it a little easier to manage the contributions made in the Kernel, as most subsystems have a designated maintainer, and they handle verifying and accepting contributions before they are incorporated in Linux kernel mainline

About my project at Linux Kernel – “Improvements to DRI-devel (aka kernel GPU subsystem)“

In laptops, tablets, phones, and lots of other places GPU/display uses more silicon die space than everything else combined (humans are mostly visual people after all), dri-devel (and the wider set of projects under the X.org Foundation's umbrella) is the community that makes this all work and shine.

In my project, I would like to create new features and better understand how the DRM core works. To achieve this goal, I chose these tasks: Clean up the debugfs support and remove custom dumbmapoffset implementations

How can you contribute to Linux Kernel?

Anyone can contribute to the development of the kernel, just develop a patch, send it to the system's mailing list, wait for community considerations, fix whatever it takes, and that's it.

But yes, I know well, that starting to contribute to the kernel is scary, especially for anyone who is a noob (beginner, newbie) in the Free Software development world and also doesn't know where to start.

But there are several things and initiatives to help, for example:

Internet courses and materials:

A beginners guide to linux kernel development

Kernel newbies

First Patch tutorial

Write and Submit your first Linux kernel Patch

Internship programs:

Outreachy

Outreachy is a paid, remote internship program. Outreachy's goal is to support people from groups underrepresented in tech. We help newcomers to free software and open source make their first contributions. Outreachy provides internships to open source work. People apply from all around the world. Interns work remotely and are not required to move. Interns are paid a stipend of $5,500 USD for the three month internship. Interns have a $500 USD travel stipend to attend conferences or events. Interns work with experienced mentors from open source communities. Outreachy internship projects may include programming, user experience, documentation, illustration, graphical design, or data science. Interns often find employment after their internship with Outreachy sponsors or in jobs that use the skills they learned during their internship.

GSoC

Google Summer of Code is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 10 week programming project during their break from school.

Study groups:

In Brazil, I met 2 of these groups

In Campinas - LKCamp

In São Paulo - FLUSP

It's scary I know, but as you can see there are several initiatives and content to start contributing to the Linux Kernel. So don't be afraid, try to contribute to the Linux kernel and ask the community for help there will always be someone who can help you!

Ah!!! And I almost forgot, if you need help you can send a message, I'm also starting in this world of kernel contribution, but I'll do my best to help and my goal with the blog besides showing my Outreachy internship progress is also to create content to help with the contribution and development of the kernel for beginners, both in English and in Portuguese (my native language).

Thank you for following me so far, please feel free to comment! And stay tuned to the next chapters of this Saga!!!

Take care and have a great day!

June 23, 2021

BREAKING: THIS IS NO LONGER A ZINK BLOG

For today, at least.

Today, this blog is a Gallium blog. And it’s a momentous day indeed.

We all know what this is:

portal2-title.png

It’s a screenshot of Portal 2 with the Gallium HUD activated and VSync disabled.

But what driver is that underneath?

Well, for today’s blog it’s RadeonSI, the reference implementation of a Gallium driver.

And why is this, I can hear you all asking.

What if I told you that this screenshot with 10% higher FPS is also Portal 2 with VSync disabled on RadeonSI using one trick that graphics developers WON’T TELL YOU:

portal2-nine-title.png

Interested?

Coming Soon (Maybe, And Also Maybe Requiring Some Early 2000s-era Elbow Grease From Anyone Wanting To Try): Native Linux Source Games On Gallium Nine

We did it.

By assembling an elite team of individuals with a few minutes to spare here and there over the past week, including:

  • Josh Ashton, expert spammer of 🐸 emojis
  • Axel Davy, expert struct packer
  • Me, expert blogger
  • Is Such A Thing Even Possible? Why Yes, Yes It Is.

it is now (technically) possible to run DXVK-compatible Source Engine games through Gallium’s Nine state tracker, providing a native D3D9 runtime.

Is your Portal 2 in-game FPS sad and barely even 500 like this screenshot?

portal2-ingame.png

Why not jack it up to more than TWICE THAT NUMBER* with riced out, GPU-fan-shredding technology that Mesa Gallium drivers have been shipping for years?

portal2-nine-ingame.png

Disclaimer*

This post does not represent any form of official statement or address from Valve and is only a small project that was started out of boredom while I waited for CTS runs to finish.

This post also does not make any claims or statements regarding performance on other drivers, or performance comparisons using alternative graphics emulation layers, though whew, it sure would be interesting to see what those kinds of numbers look like!

June 21, 2021

The Khronos Group has released today a new version of the Vulkan specification that includes the VK_EXT_multi_draw extension. This new extension has been championed by Mike Blumenkrantz, contracted by Valve to work on Zink, an OpenGL implementation that’s part of Mesa and runs on top of Vulkan. Mike has been working very hard to make OpenGL-on-Vulkan performant and better, and came up with this extension to close an existing gap between the two APIs. As part of the ongoing collaboration between Igalia and Valve, I had the chance to participate in the release process by reviewing the specification text in depth, providing feedback and fixes, and writing a set of CTS tests to check conformance for drivers implementing the extension. As you can see in the contributors list, VK_EXT_multi_draw had input and feedback from more vendors. Special mention to Jason Ekstrand from Intel, who provided an initial review of the text, and Piers Daniell from NVIDIA, who was also involved since the early stages.

Thanks to VK_EXT_multi_draw, Vulkan will have equivalents to the glMultiDrawArrays and glMultiDrawElements functions from OpenGL. They’re called vkCmdDrawMultiEXT and vkCmdDrawMultiIndexedEXT. These two new functions allow recording a batch of draw commands in a command buffer using a single call, and they can be used in situations where an application would be recording a high number of draws without changing state. Although Vulkan already had mechanisms that allowed applications to record batches of draw commands in the form of indirect draws, these need the array of draw parameters to reside in a GPU-accessible buffer. VK_EXT_multi_draw, on the other hand, lets applications provide arrays of draw parameters using CPU memory.

vkCmdDrawMultiEXT is essentially equivalent to calling vkCmdDraw multiple times in a row, and vkCmdDrawMultiIndexedEXT does the same for vkCmdDrawIndexed. To improve application performance and reduce CPU overhead, Vulkan drivers are allowed and encouraged to omit checks for API function arguments provided by applications (these correctness checks are provided by the Vulkan Validation Layers mainly during application development), and thanks to mechanisms like primary and secondary command buffers, Vulkan makes it possible to prepare sequences of commands for the GPU to execute using multiple threads and CPU cores. In this situation, you may be wondering how much of an improvement the new functions provide apart from saving a few microseconds processing some function calls. In other words, what’s the practical difference between calling vkCmdDraw a thousand times and batching a thousand draws using vkCmdDrawMultiEXT?

The answer is that most of the overhead of recording a draw command doesn’t come from having to call a function, but in the checks the implementation has to run when recording the command. These checks may not be related to correctness, but to additional actions and options that may need to be taken depending on the state of the command buffer in the moment the draw command is recorded. For example, see the calls to radv_before_draw when RADV processes a draw command (note: RADV is Mesa’s super nice free software Vulkan driver for AMD cards). These checks only need to run once when using the new functions. In bechmark-like scenarios using real drivers, Mike has been able to verify that, while the overhead varies per driver and some of them are lightweight and have minimal overhead, some mainstream drivers can double their draw call processing rate when using VK_EXT_multi_draw.

Mike has work-in-progress implementations for Mesa’s ANV and RADV drivers (the Vulkan drivers for Intel and AMD GPUs, respectively) which pass conformance and will hopefully land soon in Mesa’s main branch, and more drivers are expected to ship support for the extension in the near future.

We Did It

After months and months of the construction crews hammering away, VK_EXT_multi_draw has now been released for general use.

Will this suddenly make zink the fastest GPU driver in history?

Obviously.

Long-time readers will recall that I memed about this extension some time ago, and the numbers in a synthetic benchmark targeted at exactly this feature are phenomenal.

For more on the topic, we go to our Senior Multidraw Correspondent and my personal Khronos BFF, Ricardo Garcia, who has been following this story since the beginning.

June 18, 2021

Fast Friday

In short, an issue was filed recently about getting the Nine state tracker working with zink.

Was it the first? No..

Was it the first one this year? Yes.

Thus began a minutes-long, helter-skelter sequence of events to get Nine up and running, spread out over the course of a day or two. In need of a skilled finagler knowledgeable in the mysterium of Gallium state trackers, I contacted the only developer I know with a rockstar name, Axel Davy. We set out at dawn, and I strapped on my parachute. It was almost immediately that I heard a familiar call: there’s a build issue.

Next stop was crashing in unimplemented interface methods with a stopover in flailing about wildly in TGSI what even is this before I arrivated at my target:

nine.png

Ah, glorious triangles.

June 16, 2021

Hi all, hope you all are doing fine!

It's been 3 weeks since I started the Outreachy internship, I've done a lot but at the same time, I don't think I've done anything.

In the first week, it was that week of setup machine, fighting with IRC to be able to send messages, sending some information necessary for Outreachy organizers. I also needed to configure my blog's RSS Feed (yes, at a time when I was in doubt whether I wanted to work with backend or frontend, I decided to learn how to develop a blog) as I use Gatsby as the base of the blog, it was relatively easy to configure the RSS (Hooray!! One thing worked \o/)

To do my setup, my mentor Melissa gave me 2 tutorials as a base:

Setting up your QEMU VM

How to compile and install the Linux Kernel

And I needed to redo them a few times to understand how they worked (it's in my GIANT to-do list, a tutorial with my steps explaining where I had a problem, one day it leaves...) because I was going to use a virtual machine to run the tests and see if I didn't break the kernel too much after it was configured I needed to test to see if everything was right, for that I used the tutorial

Experiment-one-iio-dummy

Ok, setup working and now?? I still needed to configure one thing: The VKMS (a software-only model of a KMS driver that is useful for testing and for running X (or similar) on headless machines) and the IGT (a test suite used specifically for debugging and development of the DRM drivers), for this I used the tutorial:

VKMS

I was stuck for a few days in this task, the tests failed but why??? Configuration error? Tool installation error??

Nooo! It was my own mistake... That I didn't read the tutorial properly, and I didn't see the message that said:

“The tests need to be run without a composer, so you need to switch to text-only mode”

For that I only needed to do:

sudo systemctl isolate multi-user.target 

Ready! Solved, tests working \o/ and now what?

Now my task for the next few days is to “create a debugfs file for vkms using drmstatedump()”, but that's a subject for the next post.

Thank you for following me so far, please feel free to comment! And stay tuned to the next chapters of this Saga called Outreachy!!

Take care and have a great day!

I Said I Would

A long, long time ago in a month far, far away I said I was going to blog about some improvements I’d been working on for zink. I blogged about some of them, but one was conspicuously absent from the original list:

  • make zink usable for gaming

There’s a lot that goes into this item. The post you’re reading now isn’t about to go so far as to claim that zink(-wip) is usable for gaming. No, that day is still far, far away. But this post is going to be the first step.

To begin with, a riddle: what change was made to zink between these two screenshots?

tr-slow.png

tr-zoom.png

That’s right, I put the punchline in the title.

A suballocator.

What Is A Suballocator?

A suballocator is a mechanism by which small blocks of memory can be suballocated out of larger one. For example, if I want to allocate an 64byte chunk of memory, I could allocate it directly and get my block, or I could allocate a 4096byte chunk of memory and then take 64bytes out of it.

When performance is involved, it’s important to consider the time-cost of allocations, and so yes, it’s useful to have already allocated another 63 instances of 64bytes when I need a second one, but there’s another, deeper issue that’s also necessary to address, especially as it relates to gaming: 32bit environments.

In a 32bit process, the amount of address space available is limited to 4GB, regardless of how much actual memory is physically present, some of which is dedicated to system resources and unavailable for general use. Any time a buffer or image is mapped by the driver in a process, this uses up address space in order to create an addressable region of memory that can be read or written to. Once all the address space has been used up, no other resources can be mapped, and it becomes impossible to continue normal operations.

In short, the game crashes.

In Vulkan, and just generally in driver work, it’s important to keep allocation sizes aligned to the preference of the hardware for a given usage; this amounts to minMemoryMapAlignment, which is 4096bytes on many drivers. Similarly, vkGetBufferMemoryRequirements and vkGetImageMemoryRequirements return aligned memory sizes, so even if only 64bytes are needed, 4096bytes must still be allocated—4032 bytes unused. This ends up wasting tons of memory when an app is allocating lots of smaller regions, and it’s further wasting address space since Vulkan prohibits memory from being mapped multiple times, meaning that each 64byte buffer is wasting an additional 4032bytes of address space.

While 4k of memory may seem like a small amount, and why would anyone ever need more than 256kb memory anyway, these allocations all add up, fast enough that zink runs out of address space in a 32bit game like Tomb Raider within a couple minutes.

Playable?

Probably not.

The Solution, As Always

If you’re working in Mesa, you basically have two options when you come across a new problem: delete some code or copy some code. It’s not often that I come across an issue which can’t be resolved by one of the two.

In this case, I had known for quite a while that the solution was going to be copying some code. Thus I entered the realm of Gallium’s awesome auxilliary/pipebuffer, a fearsome component that had only been leveraged by one driver.

zink_bo.png

Yup, it was time to throw more galaxybrain.jpg code into the blender and see what came out. Ultimately, I was able to repurpose a lot of the core calculation code for sizing allocations, which saved me from having to do any kind of thinking or maffs. This let me cut down my suballocator implementation to a little under 700 lines, leaving much, much, much more space for bugsactivities.

At a high level, here’s an overview of aux/pb:

  • call pb_cache_init to set up a memory cache
  • initialize slab allocators with pb_slabs_init
  • when allocating a new resource, determine if it can be slab allocated; if yes, use pb_slab_alloc to reuse/reclaim a slab allocation, otherwise manually allocate new memory
  • when destroying a resource, use pb_reference_with_winsys

There’s more under the hood, but it mostly boils down to filling in the interface functions to manage detecting whether resources are busy or can be reclaimed for reuse. The actual caching/reclaiming/reusing are all handled by aux/pb, meaning I was free to go about breaking everything with all the leftover time that I had.

Cultured users of zink-wip can now enjoy massively improved performance (and have already been enjoying it for the past month) in many apps. The rest of you get to sit around and watch while I bang my head against CI while ajax showers me with memes.

June 14, 2021

There’s a lot that has happened in the world of Zink since my last update, so let’s see if I can bring you up to date on the most important stuff.

Upstream development

Gosh, when I last blogged about Zink, it hadn’t even landed upstream in Mesa yet! Well, by now it’s been upstream for quite a while, and most development has moved there.

At the time of writing, we have merged 606 merge-requests labeled “zink”. The current tip of mesa’s main branch is totaling 1717 commits touching the src/gallium/drivers/zink/ sub-folder, written by 42 different contributors. That’s pretty awesome in my eyes, Zink has truly become a community project!

Another noteworthy change is that Mike Blumenkrantz has come aboard the project, and has churned out an incredible amount of improvements to Zink! He got hired by Valve to work on Zink (among other things), and is now the most prolific contributor, with more than twice the amount of commits than I have written.

If you want a job in Open Source graphics, Zink has a proven track-record as a job-creator! :smile:

In addition to Mike, there’s some other awesome people who have been helping out lately.

Half-Life 2 running with Zink. Half-Life 2 running with Zink.

OpenGL 4.6 support

Thanks to a lot of hard work by Mike assisted by Dave Airlie and Adam Jackson, both of RedHat, Zink is now able to expose the OpenGL 4.6 (Core Profile) feature set, given enough Vulkan features! :tada:

Please note that this doesn’t mean that Zink is yet a conformant implementation, there’s some details left to be ironed out before we can claim that. In particular, we need to pass the conformance tests, and submit a conformance report to Khronos. We’re not there yet.

I’m also happy to see that Zink is currently at the top of MesaMatrix (together with LLVMpipe, i965 and RadeonSI), reporting a total of 160 OpenGL extensions at the time of writing!

In theory, that means you can run any OpenGL application you can think of on top of Zink. Mike is hard at work testing the entire Steam game library, and things are working pretty well.

Is this the end of the line for Zink? Are we done now? Not at all! :laughing:

OpenGL compabibility profile

We’re still stuck at OpenGL 3.0 for compatibility contexts, mainly due to lack of testing. There’s a lot of features that need to work together in relatively complicated ways for this to work for us.

Note that this only matters for applications that rely on legacy OpenGL features. Modern OpenGL programs gets OpenGL 4.6 support, as mentioned previously.

I don’t think this is going to be a big deal to enable, but I haven’t spent time on it.

OpenGL ES 3.1 support

Similar to the OpenGL 4.6 support, we’re now able to expose the OpenGL ES 3.1 feature set. This is again thanks to a lot of hard work by Mike and the gang.

Why not OpenGL ES 3.2? This comes down to the GL_KHR_blend_equation_advanced feature. Mike blogged about the issue a while ago.

Lavapipe and continuous integration

To prevent regressions, we’ve started testing Zink on the Mesa CI system for every change. This is made possible thanks to Lavapipe, a Vulkan software implementation in Mesa that reuse the rasterizer from LLVMpipe.

This means we can run tests on virtual cloud machines without having to depend on unreliable hardware. :robot:

At the time of writing, we’re only exposing OpenGL 4.1 on top of Lavapipe, due to some lacking features. But we have patches in the works to bring this up to OpenGL 4.5, and OpenGL 4.6 probably won’t be far off when that lands.

Windows support

Basic support for Zink on Microsoft Windows has landed. This isn’t particularly useful at the moment, because we need better window-system integration to get anywhere near reasonable performance. But it’s there.

macOS support

Thanks to work by Duncan Hopkins of The Foundry, there’s also some support for macOS. This uses MoltenVK as the Vulkan implementation, meaning that we also support the Vulkan Portability Extension to some degree.

This support isn’t quite as drop-in as on other platforms, because it’s completely lacking window-system integration. But it seems to work for the use-cases they have at The Foundry, so it’s worth mentioning as well.

Driver support

Beyond this, Igalia has brought up Zink on the V3DV driver, and I’ve heard some whispers that there’s some people running Zink on top of Turnip, an open-source Vulkan driver for recent Qualcomm Adreno GPUs.

I’ve heard some people have some success getting things running on NVIDIA, but there’s a few obvious problems in the way there due to the lack of proper DRI support… Which brings us to:

Window System Integration

Another awesome new development is that Adam is working on Penny. So, what’s Penny?

Penny is another way of bringing up Zink, on systems without DRI support. It works as a dedicated GLX integration that uses the VK_KHR_swapchain extension to integrate properly with the native Vulkan driver’s window-system integration instead of Mesa baking its own.

This solves a lot of small, nasty issues in the DRI code-path. I’ll say the magic “implicit synchronization” word, and hope that scares away anyone wondering what it’s about.

Performance

A lot more has happened on the performance front as well, again all thanks to Mike. However, much of this is still out-of-tree, and waiting in Mike’s zink-wip branch.

So instead, I suggest you check out Mike’s blog for the latest performance information (and much more up-to-date info on Zink). There’s been a lot going on, and I’m sure there’s even more to come!

Closing words

I think this should cover the most interesting bits of development.

On a personal note, I recently became a dad for the first time, and as a result I’ll be away for a while on paternity leave, starting early this fall. Luckily, Zink is in good hands with Mike and the rest of the upstream community taking care of things.

I would like to again plug Mike’s blog as a great source of Zink-related news, if you’re not already following it. He posts a lot more frequent than I do, and he’s also an epic meme master, so it’s all great fun!

(Added a section on load/store pairs on June 14th)

This question probably seems absurd. An unoptimized memcpy is a simple loop that copies bytes. How hard can that be? Well...

There's a fascinating thread on llvm-dev started by George Mitenkov proposing a new family of "byte" types. I found the proposal and discussion difficult to follow. In my humble opinion, this is because the proposal touches some rather subtle and underspecified aspects of LLVM IR semantics, and rather than address those fundamentals systematically, it jumps right into the minutiae of the instruction set. I look forward to seeing how the proposal evolves. In the meantime, this article is a byproduct of me attempting to digest the problem space.

Here is a fairly natural way to (attempt to) implement memcpy in LLVM IR:

define void @memcpy(i8* %dst, i8* %src, i64 %n) {
entry:
  %dst.end = getelementptr i8, i8* %dst, i64 %n
  %isempty = icmp eq i64 %n, 0
  br i1 %isempty, label %out, label %loop

loop:
  %src.loop = phi i8* [ %src, %entry ], [ %src.next, %loop ]
  %dst.loop = phi i8* [ %dst, %entry ], [ %dst.next, %loop ]
  %ch = load i8, i8* %src.loop
  store i8 %ch, i8* %dst.loop
  %src.next = getelementptr i8, i8* %src.loop, i64 1
  %dst.next = getelementptr i8, i8* %dst.loop, i64 1
  %done = icmp eq i8* %dst.next, %dst.end
  br i1 %done, label %out, label %loop

out:
  ret void
}

Unfortunately, the copy that is written to the destination is not a perfect copy of the source.

Hold on, I hear you think, each byte of memory holds one of 256 possible bit patterns, and this bit pattern is perfectly copied by the `load`/`store` sequence! The catch is that in LLVM's model of execution, a byte of memory can in fact hold more than just one of those 256 values. For example, a byte of memory can be poison, which means that there are at least 257 possible values. Poison is forwarded perfectly by the code above, so that's fine. The trouble starts because of pointer provenance.


What and why is pointer provenance?

From a machine perspective, a pointer is just an integer that is interpreted as a memory address.

For the compiler, alias analysis -- that is, the ability to prove that different pointers point at different memory addresses -- is crucial for optimization. One basic tool in the alias analysis toolbox is to recognize that if pointers point into different "memory objects" -- different stack or heap allocations -- then they cannot alias.

Unfortunately, many pointers are obtained via getelementptr (GEP) using dynamic (non-constant) indices. These dynamic indices could be such that the resulting pointer points into a different memory object than the base pointer. This makes it nearly impossible to determine at compile time whether two pointers point into the same memory object or not.

Which is why there is a rule which says (among other things) that if a pointer P obtained via GEP ends up going out-of-bounds and pointing into a different memory object than the pointer on which the GEP was based, then dereferencing P is undefined behavior even though the pointer's memory address is valid from the machine perspective.

As a corollary, a situation is possible in which there are two pointers whose underlying memory address is identical but whose provenance is different. In that case, it's possible that one of them can be dereferenced while dereferencing the other is undefined behavior.

This only makes sense if, in the formal semantics of LLVM IR, pointer values carry more information than just an integer interpreted as a memory address. They also carry provenance information, which is essentially the set of memory objects that can be accessed via this pointer and any pointers derived from it.


Bytes in memory carry provenance information

What is the provenance of a pointer that results from a load instruction? In a clean operational semantics, the load must derive this provenance from the values stored in memory.

If bytes of memory can only hold one of 256 bit patterns (or poison), that doesn't give us much to work with. We could say that the provenance of the pointer is "empty", meaning the pointer cannot be used to access any memory objects -- but that's clearly useless. Or we could say that the provenance of the pointer is "all", meaning the pointer (or pointers derived from it) can be freely used to access all memory objects, assuming the underlying address is adjusted appropriately. That isn't much better.[0]

Instead, we must say that -- as far as LLVM IR semantics are concerned -- each byte of memory holds pointer provenance information in addition to its i8 content. The provenance information in memory is written by pointer store, and pointer load uses it to reconstruct the original provenance of the loaded pointer.

What happens to provenance information in non-pointer load/store? A load can simply ignore the additional information in memory. For store, I see 3 possible choices:

1. Leave the provenance information that already happens to be in memory unmodified.
2. Set the provenance to "empty".
3. Set the provenance to "all".

Looking back at our attempt to implement memcpy, there is no choice which results in a perfect copy. All of the choices lose provenance information.

Without major changes to LLVM IR, only the last choice is potentially viable because it is the only choice that allows dereferencing pointers that are loaded from the memcpy destination.

Should we care about losing provenance information?

Without major changes to LLVM IR, we can only implement a memcpy that loses provenance information during the copy.

So what? Alias analysis around memcpy and code like it ends up being conservative, but reasonable people can argue that this doesn't matter. The burden of evidence lies on whoever wants to make a large change here in order to improve alias analysis.

That said, we cannot just call it a day and go (or stay) home either, because there are related correctness issues in LLVM today, e.g. bug 37469 mentioned in the initial email of that llvm-dev thread.

Here's a simpler example of a correctness issue using our hand-coded memcpy:

define i32 @sample(i32** %pp) {
  %tmp = alloca i32*
  %pp.8 = bitcast i32** %pp to i8*
  %tmp.8 = bitcast i32** %tmp to i8*
  call void @memcpy(i8* %tmp.8, i8* %pp.8, i64 8)
  %p = load i32*, i32** %tmp
  %x = load i32, i32* %p
  ret i32 %x
}

A transform that should be possible is to eliminate the memcpy and temporary allocation:

define i32 @sample(i32** %pp) {
  %p = load i32*, i32** %pp
  %x = load i32, i32* %p
  ret i32 %x
}

This transform is incorrect because it introduces undefined behavior.

To see why, remember that this is the world where we agree that integer stores write an "all" provenance to memory, so %p in the original program has "all" provenance. In the transformed program, this may no longer be the case. If @sample is called with a pointer that was obtained through an out-of-bounds GEP whose resulting address just happens to fall into a different memory object, then the transformed program has undefined behavior where the original program didn't.

We could fix this correctness issue by introducing an unrestrict instruction which elevates a pointer's provenance to the "all" provenance:

define i32 @sample(i32** %pp) {
  %p = load i32*, i32** %pp
  %q = unrestrict i32* %p
  %x = load i32, i32* %q
  ret i32 %x
}

Here, %q has "all" provenance and therefore no undefined behavior is introduced.

I believe that (at least for address spaces that are well-behaved?) it would be correct to fold inttoptr(ptrtoint(x)) to unrestrict(x). The two are really the same.

For that reason, unrestrict could also be used to fix the above-mentioned bug 37469. Several folks in the bug's discussion stated the opinion that the bug is caused by incorrect store forwarding that should be weakened via inttoptr(ptrtoint(x)). unrestrict(x) is simply a clearer spelling of the same idea.


A dead end: integers cannot have provenance information

A natural thought at this point is that the situation could be improved by adding provenance information to integers. This is technically correct: our hand-coded memcpy would then produce a perfect copy of the memory contents.

However, we would get into serious trouble elsewhere because global value numbering (GVN) and similar transforms become incorrect: two integers could compare equal using the icmp instruction, but still be different because of different provenance. Replacing one by the other could result in miscompilation.

GVN is important enough that adding provenance information to integers is a no-go.

I suspect that the unrestrict instruction would allow us to apply GVN to pointers, at the cost of making later alias analysis more conservative and sprinkling unrestrict instructions that may inhibit other transforms. I have no idea what the trade-off is on that.


The "byte" types: accurate representation of memory contents

With all the above in mind, I can see the first-principles appeal of the proposed "byte" types. They allow us to represent the contents of memory accurately in SSA values, and so they fill a real gap in the expressiveness of LLVM IR.

That said, the software development cost of adding a whole new family of types to LLVM is very high, so it better be justified by more than just aesthetics.

Our hand-coded memcpy can be turned into a perfect copier with straightforward replacement of i8 by b8:

define void @memcpy(b8* %dst, b8* %src, i64 %n) {
entry:
  %dst.end = getelementptr b8, b8* %dst, i64 %n
  %isempty = icmp eq i64 %n, 0
  br i1 %isempty, label %out, label %loop

loop:
  %src.loop = phi b8* [ %src, %entry ], [ %src.next, %loop ]
  %dst.loop = phi b8* [ %dst, %entry ], [ %dst.next, %loop ]
  %ch = load b8, b8* %src.loop
  store b8 %ch, b8* %dst.loop
  %src.next = getelementptr b8, b8* %src.loop, i64 1
  %dst.next = getelementptr b8, b8* %dst.loop, i64 1
  %done = icmp eq b8* %dst.next, %dst.end
  br i1 %done, label %out, label %loop

out:
  ret void
}

Looking at the concrete choices made in the proposal, I disagree with some of them.

Memory should not be typed. In the proposal, storing an integer always results in different memory contents than storing a pointer (regardless of its provenance), and implicitly trying to mix pointers and integers is declared to be undefined behavior. In other words, a sequence such as:

store i64 %x, i64* %p
%q = bitcast i64* %p to i8**
%y = load i8*, i8** %q

... is undefined behavior under the proposal instead of being effectively inttoptr(%x). That seems fine for C/C++, but is it going to be fine for other frontends?

The corresponding distinction between bytes-as-integers and bytes-as-pointers complicates the proposal overall, e.g. it forces them to add a bytecast instruction.

Conversely, the benefits of the distinction are unclear to me. One benefit appears to be guaranteed non-aliasing between pointer and non-pointer memory accesses, but that is a form of type-based alias analysis which in LLVM should idiomatically be done via TBAA metadata. (Update: see the addendum for another potential argument in favor of typed memory.)

So let's keep memory untyped, please.

Bitwise poison in byte values makes me really nervous due to the arbitrary deviation from how poison works in other types. I don't see any justification for it in the proposal. I can kind of see how one could be motivated by implementing memcpy with vector intrinsics operating on, for example, <8 x b32>, but a simpler solution would be to just use <32 x b8> instead. And if poison is indeed bitwise, then certainly pointer provenance would also have to be bitwise!

Finally, no design discussion is complete without a little bit of bike-shedding. I believe the name "byte" is inspired by C++'s std::byte, but given that types such as b256 are possible, this name would forever be a source of confusion. Naming is hard, and I think we should at least try to look for a better one. Let me kick off the brainstorming by suggesting we think of them as "memory content" values, because that's what they are. The types could be spelled m8, m32, etc. in IR assembly.

A variation: adding a pointer provenance type

In the llvm-dev thread, Jeroen Dobbelaere points out work being done to introduce explicit `ptr_provenance` operands on certain instructions, in service of C99's restrict keyword. I haven't properly digested this work, but it inspired the thoughts of this section.

Values of the proposed byte types have both a bit pattern and a pointer provenance. Do we really need to have both pieces of information in the same SSA value? We could instead split them up into an integer bit pattern value and a pointer provenance value with an explicit provenance type. Loads of integers could read out the provenance information stored in memory and provide it as a secondary result. Similarly, stores of integers could accept the desired provenance to be stored in memory as a secondary data operand. This would allow us to write a perfect memcpy by replacing the core load/store sequence with something like:

%ch, %provenance = load_with_provenance i8, i8* %src
store_with_provenance i8 %ch, provenance %provenance, i8* %dst

The syntax and instruction names in the example are very much straw men. Don't take them too seriously, especially because LLVM IR doesn't currently allow multiple result values.

Interestingly, this split allows the derivation of pointer provenance to follow a different path than the calculation of the pointer's bit pattern. This in turn allows us in principle to perform GVN on pointers without being conservative for alias analysis.

One of the steps in bug 37469 is not quite GVN, but morally similar. Simplifying a lot, the original program sequence:

%ch1 = load i8, i8* %p1
%ch2 = load i8, i8* %p2
%eq = icmp eq i8 %ch1, %ch2
%ch = select i1 %eq, i8 %ch1, i8 %ch2
store i8 %ch, i8* %p3

... is transformed into:

%ch2 = load i8, i8* %p2
store i8 %ch2, i8* %p3

This is correct for the bit patterns being loaded and stored, but the program also indirectly relies on pointer provenance of the data. Of course, there is no pointer provenance information being copied here because i8 only holds a bit pattern. However, with the "byte" proposal, all the i8s would be replaced by b8s, and then the transform becomes incorrect because it changes the provenance information.

If we split the proposed use of b8 into a use of i8 and explicit provenance values, the original program becomes:

%ch1, %prov1 = load_with_provenance i8, i8* %p1
%ch2, %prov2 = load_with_provenance i8, i8* %p2
%eq = icmp eq i8 %ch1, %ch2
%ch = select i1 %eq, i8 %ch1, i8 %ch2
%prov = select i1 %eq, provenance %prov1, provenance %prov2
store_with_provenance i8 %ch, provenance %prov, i8* %p3

This could be transformed into something like:

%prov1 = load_only_provenance i8* %p1
%ch2, %prov2 = load_with_provenance i8, i8* %p2
%prov = merge provenance %prov1, %prov2
store_with_provenance i8 %ch2, provenance %prov, i8* %p3

... which is just as good for code generation but loses only very little provenance information.

Aside: loop idioms

Without major changes to LLVM IR, a perfect memcpy cannot be implemented because pointer provenance information is lost.

Nevertheless, one could still define the @llvm.memcpy intrinsic to be a perfect copy. This helps memcpys in the original source program be less conservative in terms of alias analysis. However, it also makes it incorrect to replace a memcpy loop idiom with a use of @llvm.memcpy: without adding unrestrict instructions, the replacement may introduce undefined behavior; and there is no way to bound the locations where such unrestricts may be needed.

We could augment @llvm.memcpy with an immediate argument that selects its provenance behavior.

In any case, one can argue that bug 37469 is really a bug in the loop idiom recognizer. It boils down to the details of how everything is defined, and unfortunately, these weird corner cases are currently underspecified in the LangRef.

Conclusion

We started with the question of whether memcpy can be implemented in LLVM IR. The answer is a qualified Yes. It is possible, but the resulting copy is imperfect because pointer provenance information is lost. This has surprising implications which in turn happen to cause real miscompilation bugs -- although those bugs could be fixed even without a perfect memcpy.

The "byte" proposal has a certain aesthetic appeal because it fixes a real gap in the expressiveness of LLVM IR, but its software engineering cost is large and I object to some of its details. There are also alternatives to consider.

The miscompilation bugs obviously need to be fixed, but they can be fixed much less intrusively, albeit at the cost of more conservative alias analysis in the affected places. It is not clear to me whether improving alias analysis justifies the more complex solutions.

I would like to understand better how all of this interacts with the C99 restrict work. That work introduces mechanisms for explicitly talking about pointer provenance in the IR, which may allow us to kill two birds with one stone.

In any case, this is a fascinating topic and discussion, and I feel like we're only at the beginning.


Addendum: storing back previously loaded integers

(Added this section on June 14th)

Harald van Dijk on Phrabricator and Ralf Jung on llvm-dev, referring to a Rust issue, explicitly and implicitly point out a curious issue with loading and storing integers.

Here is Harald's example:

define i8* @f(i8* %p) {
  %buf = alloca i8*
  %buf.i32 = bitcast i8** %buf to i32*
  store i8* %p, i8** %buf
  %i = load i32, i32* %buf.i32
  store i32 %i, i32* %buf.i32
  %q = load i8*, i8** %buf
  ret i8* %q
}

There is a pair of load/store of i32 which is fully redundant from a machine perspective and so we'd like to optimize that away, after which it becomes obvious that the function really just returns %p -- at least as far as bit patterns are concerned.

However, in a world where memory is untyped but has provenance information, this optimization is incorrect because it can introduce undefined behavior: the load/store of i32 resets the provenance information in memory to "all", so that the original function returns an unrestricted version of %p. This is no longer the case after the optimization.

There are at least two possible ways of resolving this conflict.

We could define memory to be typed, in the sense that each byte of memory remembers whether it was most recently stored as a pointer or a non-pointer. A load with the wrong type returns poison. In that case, the example above returns poison before the optimization (because %i is guaranteed to be poison). After the optimization it returns non-poison, which is an acceptable refinement, so the optimization is correct.

The alternative is to keep memory untyped and say that directly eliminating the i32 store in the example is incorrect.

We are facing a tradeoff that depends on how important that optimization is for performance.

Two observations to that end. First, the more common case of dead store elimination is one where there are multiple stores to the same address in a row, and we remove all but the last one of them. That more common optimization is unaffected by provenance issues either way.

Second, we can still perform store forwarding / peephole optimization across such load/store pairs, as long as we are careful to introduce unrestrict where needed. The example above can be optimized via store forwarding to:

define i8* @f(i8* %p) {
  %buf = alloca i8*
  %buf.i32 = bitcast i8** %buf to i32*
  store i8* %p, i8** %buf
  %i = load i32, i32* %buf.i32
  store i32 %i, i32* %buf.i32
  %q = unrestrict i8* %p
  ret i8* %q
}

We can then dead-code eliminate the bulk of the function and obtain:

define i8* @f(i8* %p) {
  %q = unrestrict i8* %p
  ret i8* %q
}

... which is as good as it can possibly get.

So, there is a good chance that preventing this particular optimization is relatively cheap in terms of code quality, and the gain in overall design simplicity may well be worth it.




[0] We could also say that the loaded pointer's provenance is magically the memory object that happens to be at the referenced memory address. Either way, provenance would become a useless no-op in most cases. For example, mem2reg would have to insert unrestrict instructions (defined later) everywhere because pointers become effectively "unrestricted" when loaded from alloca'd memory.

June 13, 2021

In an earlier article I showed how reading from VRAM with the CPU can be very slow. It however turns out there there are ways to make it less slow.

The key to this are instructions with non-temporal hints, in particular VMOVNTDQA. The Intel Instruction Manual says the following about this instruction:

“MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory type, the nontemporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the temporary internal buffer if data is available. “ (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2)

This sounds perfect for our VRAM and WC System Memory buffers as we typically only read 16-bytes per instruction and this allows us to read entire cachelines at time.

It turns out that Mesa already implemented a streaming memcpy using these instructions so all we had to do was throw that into our benchmark and write a corresponding memcpy that does non-temporal stores to benchmark writing to these memory regions.

As a reminder, we look into three allocation types that are exposed by the amdgpu Linux kernel driver:

  • VRAM. This lives on the GPU and is mapped with Uncacheable Speculative Write Combining (USWC) on the CPU. This means that accesses from the CPU are not cached, but writes can be write-combined.

  • Cacheable system memory. This is system memory that has caching enabled on the CPU and there is cache snooping to ensure the memory is coherent between the CPU and GPU (up till the top level caches. The GPU caches do not participate in the coherence).

  • USWC system memory. This is system memory that is mapped with Uncacheable Speculative Write Combining on the CPU. This can lead to slight performance benefits compared to cacheable system memory due to lack of cache snooping.

Furthermore this still uses a RX 6800 XT + a 2990WX with 4 channel 3200 MT/s RAM.

method (MiB/s) VRAM Cacheable System Memory USWC System Memory
read via memcpy 15 11488 137
write via memcpy 10028 18249 11480
read via streaming memcpy 756 6719 4409
write via streaming memcpy 10550 14737 11652

Using this memcpy implementation we get significantly better performance in uncached memory situations, 50x for VRAM and 26x for USWC system memory. If this is a significant bottleneck in your workload this can be a gamechanger. Or if you were using SDMA to avoid this hit, you might be able to do things at significantly lower latency. That said it is not at a level where it does not matter. For big copies using DMA can still be a significant win.

Note that I initially gave an explanation on why the non-temporal loads should be faster, but the increases in performance are significantly above what something that just fiddles with loading entire cachelines would achieve. I have not dug into the why of the performance increase.

DMA performance

I have been claiming DMA is faster for CPU readbacks of VRAM in both this article and the previous article on the topic. One might ask how fast DMA is then. To demonstrate this I benchmarked VRAM<->Cacheable System Memory copies using the SDMA hardware block on Radeon GPUs.

Note that there is a significant overhead per copy here due to submitting work to the GPU, so I will shows results vs copy size. The rate is measured while doing a wait after each individual copy and taking the wall clock time as these usecases tend to be latency sensitive and hence batching is not too interesting.

copy size copy from VRAM (MiB/s) copy to VRAM (MiB/s)
4 KiB 62 63
16 KiB 245 240
64 KiB 953 1015
256 KiB 3106 3082
1 MiB 6715 7281
4 MiB 9737 11636
16 MiB 12129 12158
64 MiB 13041 12975
256 MiB 13429 13387

This shows that for reads DMA is faster than a normal memcpy at 4 KiB and faster than a streaming memcpy at 64 KiB. Of course one still needs to do their CPU access at that point, but at both these thresholds even with an additional CPU memcpy the total process should still be fast with DMA.

June 10, 2021

TL;DR: Tag your GPT partitions with the right, descriptive partition types, and the world will become a better place.

A number of years ago we started the Discoverable Partitions Specification which defines GPT partition type UUIDs and partition flags for the various partitions Linux systems typically deal with. Before the specification all Linux partitions usually just used the same type, basically saying "Hey, I am a Linux partition" and not much else. With this specification the GPT partition type, flags and label system becomes a lot more expressive, as it can tell you:

  1. What kind of data a partition contains (i.e. is this swap data, a file system or Verity data?)
  2. What the purpose/mount point of a partition is (i.e. is this a /home/ partition or a root file system?)
  3. What CPU architecture a partition is intended for (i.e. is this a root partition for x86-64 or for aarch64?)
  4. Shall this partition be mounted automatically? (i.e. without specifically be configured via /etc/fstab)
  5. And if so, shall it be mounted read-only?
  6. And if so, shall the file system be grown to its enclosing partition size, if smaller?
  7. Which partition contains the newer version of the same data (i.e. multiple root file systems, with different versions)

By embedding all of this information inside the GPT partition table disk images become self-descriptive: without requiring any other source of information (such as /etc/fstab) if you look at a compliant GPT disk image it is clear how an image is put together and how it should be used and mounted. This self-descriptiveness in particular breaks one philosophical weirdness of traditional Linux installations: the original source of information which file system the root file system is, typically is embedded in the root file system itself, in /etc/fstab. Thus, in a way, in order to know what the root file system is you need to know what the root file system is. 🤯 🤯 🤯

(Of course, the way this recursion is traditionally broken up is by then copying the root file system information from /etc/fstab into the boot loader configuration, resulting in a situation where the primary source of information for this — i.e. /etc/fstab — is actually mostly irrelevant, and the secondary source — i.e. the copy in the boot loader — becomes the configuration that actually matters.)

Today, the GPT partition type UUIDs defined by the specification have been adopted quite widely, by distributions and their installers, as well as a variety of partitioning tools and other tools.

In this article I want to highlight how the various tools the systemd project provides make use of the concepts the specification introduces.

But before we start with that, let's underline why tagging partitions with these descriptive partition type UUIDs (and the associated partition flags) is a good thing, besides the philosophical points made above.

  1. Simplicity: in particular OS installers become simpler — adjusting /etc/fstab as part of the installation is not necessary anymore, as the partitioning step already put all information into place for assembling the system properly at boot. i.e. installing doesn't mean that you always have to get fdisk and /etc/fstab into place, the former suffices entirely.

  2. Robustness: since partition tables mostly remain static after installation the chance of corruption is much lower than if the data is stored in file systems (e.g. in /etc/fstab). Moreover by associating the metadata directly with the objects it describes the chance of things getting out of sync is reduced. (i.e. if you lose /etc/fstab, or forget to rerun your initrd builder you still know what a partition is supposed to be just by looking at it.)

  3. Programmability: if partitions are self-descriptive it's much easier to automatically process them with various tools. In fact, this blog story is mostly about that: various systemd tools can naturally process disk images prepared like this.

  4. Alternative entry points: on traditional disk images, the boot loader needs to be told which kernel command line option root= to use, which then provides access to the root file system, where /etc/fstab is then found which describes the rest of the file systems. Where precisely root= is configured for the boot loader highly depends on the boot loader and distribution used, and is typically encoded in a Turing complete programming language (Grub…). This makes it very hard to automatically determine the right root file system to use, to implement alternative entry points to the system. By alternative entry points I mean other ways to boot the disk image, specifically for running it as a systemd-nspawn container — but this extends to other mechanisms where the boot loader may be bypassed to boot up the system, for example qemu when configured without a boot loader.

  5. User friendliness: it's simply a lot nicer for the user looking at a partition table if the partition table explains what is what, instead of just saying "Hey, this is a Linux partition!" and nothing else.

Uses for the concept

Now that we cleared up the Why?, lets have a closer look how this is currently used and exposed in systemd's various components.

Use #1: Running a disk image in a container

If a disk image follows the Discoverable Partition Specification then systemd-nspawn has all it needs to just boot it up. Specifically, if you have a GPT disk image in a file foobar.raw and you want to boot it up in a container, just run systemd-nspawn -i foobar.raw -b, and that's it (you can specify a block device like /dev/sdb too if you like). It becomes easy and natural to prepare disk images that can be booted either on a physical machine, inside a virtual machine manager or inside such a container manager: the necessary meta-information is included in the image, easily accessible before actually looking into its file systems.

Use #2: Booting an OS image on bare-metal without /etc/fstab or kernel command line root=

If a disk image follows the specification in many cases you can remove /etc/fstab (or never even install it) — as the basic information needed is already included in the partition table. The systemd-gpt-auto-generator logic implements automatic discovery of the root file system as well as all auxiliary file systems. (Note that the former requires an initrd that uses systemd, some more conservative distributions do not support that yet, unfortunately). Effectively this means you can boot up a kernel/initrd with an entirely empty kernel command line, and the initrd will automatically find the root file system (by looking for a suitably marked partition on the same drive the EFI System Partition was found on).

(Note, if /etc/fstab or root= exist and contain relevant information they always takes precedence over the automatic logic. This is in particular useful to tweaks thing by specifying additional mount options and such.)

Use #3: Mounting a complex disk image for introspection or manipulation

The systemd-dissect tool may be used to introspect and manipulate OS disk images that implement the specification. If you pass the path to a disk image (or block device) it will extract various bits of useful information from the image (e.g. what OS is this? what partitions to mount?) and display it.

With the --mount switch a disk image (or block device) can be mounted to some location. This is useful for looking what is inside it, or changing its contents. This will dissect the image and then automatically mount all contained file systems matching their GPT partition description to the right places, so that you subsequently could chroot into it. (But why chroot if you can just use systemd-nspawn? 😎)

Use #4: Copying files in and out of a disk image

The systemd-dissect tool also has two switches --copy-from and --copy-to which allow copying files out of or into a compliant disk image, taking all included file systems and the resulting mount hierarchy into account.

Use #5: Running services directly off a disk image

The RootImage= setting in service unit files accepts paths to compliant disk images (or block device nodes), and can mount them automatically, running service binaries directly off them (in chroot() style). In fact, this is the base for the Portable Service concept of systemd.

Use #6: Provisioning disk images

systemd provides various tools that can run operations provisioning disk images in an "offline" mode. Specifically:

systemd-tmpfiles

With the --image= switch systemd-tmpfiles can directly operate on a disk image, and for example create all directories and other inodes defined in its declarative configuration files included in the image. This can be useful for example to set up the /var/ or /etc/ tree according to such configuration before first boot.

systemd-sysusers

Similar, the --image= switch of systemd-sysusers tells the tool to read the declarative system user specifications included in the image and synthesizes system users from it, writing them to the /etc/passwd (and related) files in the image. This is useful for provisioning these users before the first boot, for example to ensure UID/GID numbers are pre-allocated, and such allocations not delayed until first boot.

systemd-machine-id-setup

The --image= switch of systemd-machine-id-setup may be used to provision a fresh machine ID into /etc/machine-id of a disk image, before first boot.

systemd-firstboot

The --image= switch of systemd-firstboot may be used to set various basic system setting (such as root password, locale information, hostname, …) on the specified disk image, before booting it up.

Use #7: Extracting log information

The journalctl switch --image= may be used to show the journal log data included in a disk image (or, as usual, the specified block device). This is very useful for analyzing failed systems offline, as it gives direct access to the logs without any further, manual analysis.

Use #8: Automatic repartitioning/growing of file systems

The systemd-repart tool may be used to repartition a disk or image in an declarative and additive way. One primary use-case for it is to run during boot on physical or VM systems to grow the root file system to the disk size, or to add in, format, encrypt, populate additional partitions at boot.

With its --image= switch it the tool may operate on compliant disk images in offline mode of operation: it will then read the partition definitions that shall be grown or created off the image itself, and then apply them to the image. This is particularly useful in combination with the --size= which allows growing disk images to the specified size.

Specifically, consider the following work-flow: you download a minimized disk image foobar.raw that contains only the minimized root file system (and maybe an ESP, if you want to boot it on bare-metal, too). You then run systemd-repart --image=foo.raw --size=15G to enlarge the image to the 15G, based on the declarative rules defined in the repart.d/ drop-in files included in the image (this means this can grow the root partition, and/or add in more partitions, for example for /srv or so, maybe encrypted with a locally generated key or so). Then, you proceed to boot it up with systemd-nspawn --image=foo.raw -b, making use of the full 15G.

Versioning + Multi-Arch

Disk images implementing this specifications can carry OS executables in one of three ways:

  1. Only a root file system

  2. Only a /usr/ file system (in which case the root file system is automatically picked as tmpfs).

  3. Both a root and a /usr/file system (in which case the two are combined, the /usr/ file system mounted into the root file system, and the former possibly in read-only fashion`)

They may also contain OS executables for different architectures, permitting "multi-arch" disk images that can safely boot up on multiple CPU architectures. As the root and /usr/ partition type UUIDs are specific to architectures this is easily done by including one such partition for x86-64, and another for aarch64. If the image is now used on an x86-64 system automatically the former partition is used, on aarch64 the latter.

Moreover, these OS executables may be contained in different versions, to implement a simple versioning scheme: when tools such as systemd-nspawn or systemd-gpt-auto-generator dissect a disk image, and they find two or more root or /usr/ partitions of the same type UUID, they will automatically pick the one whose GPT partition label (a 36 character free-form string every GPT partition may have) is the newest according to strverscmp() (OK, truth be told, we don't use strverscmp() as-is, but a modified version with some more modern syntax and semantics, but conceptually identical).

This logic allows to implement a very simple and natural A/B update scheme: an updater can drop multiple versions of the OS into separate root or /usr/ partitions, always updating the partition label to the version included there-in once the download is complete. All of the tools described here will then honour this, and always automatically pick the newest version of the OS.

Verity

When building modern OS appliances, security is highly relevant. Specifically, offline security matters: an attacker with physical access should have a difficult time modifying the OS in a way that isn't noticed. i.e. think of a car or a cell network base station: these appliances are usually parked/deployed in environments attackers can get physical access to: it's essential that in this case the OS itself sufficiently protected, so that the attacker cannot just mount the OS file system image, make modifications (inserting a backdoor, spying software or similar) and the system otherwise continues to run without this being immediately detected.

A great way to implement offline security is via Linux' dm-verity subsystem: it allows to securely bind immutable disk IO to a single, short trusted hash value: if an attacker manages to offline modify the disk image the modified disk image won't match the trusted hash anymore, and will not be trusted anymore (depending on policy this then just result in IO errors being generated, or automatic reboot/power-off).

The Discoverable Partitions Specification declares how to include Verity validation data in disk images, and how to relate them to the file systems they protect, thus making if very easy to deploy and work with such protected images. For example systemd-nspawn supports a --root-hash= switch, which accepts the Verity root hash and then will automatically assemble dm-verity with this, automatically matching up the payload and verity partitions. (Alternatively, just place a .roothash file next to the image file).

Future

The above already is a powerful tool set for working with disk images. However, there are some more areas I'd like to extend this logic to:

bootctl

Similar to the other tools mentioned above, bootctl (which is a tool to interface with the boot loader, and install/update systemd's own EFI boot loader sd-boot) should learn a --image= switch, to make installation of the boot loader on disk images easy and natural. It would automatically find the ESP and other relevant partitions in the image, and copy the boot loader binaries into them (or update them).

coredumpctl

Similar to the existing journalctl --image= logic the coredumpctl tool should also gain an --image= switch for extracting coredumps from compliant disk images. The combination of journalctl --image= and coredumpctl --image= would make it exceptionally easy to work with OS disk images of appliances and extracting logging and debugging information from them after failures.

And that's all for now. Please refer to the specification and the man pages for further details. If your distribution's installer does not yet tag the GPT partition it creates with the right GPT type UUIDs, consider asking them to do so.

Thank you for your time.

June 09, 2021

Memes

We’ve all been there. No matter how 10x someone is or feels, everyone has had a moment where abruptly they say to themselves, HOW THE FUCK DO THREADS EVEN WORK?

This may be precipitated by any number of events, including, but not limited to:

  • forgetting a lock
  • forgetting to unlock
  • missing an unlock at an early return
  • forgetting to initialize a lock
  • forgetting to spawn a thread
  • forgetting to signal a conditional
  • forgetting to initialize a conditional
  • running the test case with the wrong driver

I’m not going to say that I’ve been there recently.

I’m not going to say that it was today, nor am I going to state, on the record, that at least one existing zink-wip snapshot may or may not be affected by an issue which may or may not be on the above list.

I’m not going to say any of these things.

What I am going to do is talk about a new oom handler I’ve been working on to handle the dreaded spec@!opengl 1.1@streaming-texture-leak case from piglit.

The Case

This test is annoying in that it is effectively a test of a driver’s ability to throttle itself when an app is generating and using $infinity textures without ever explicitly triggering a flush.

In short, it’s:

for (i = 0; i < 5000; i++) {
   glGenTextures(1, &texture);
   glBindTexture(GL_TEXTURE_2D, texture);
   glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, TEX_SIZE, TEX_SIZE, 0, GL_RGBA, GL_UNSIGNED_BYTE, tex_buffer);
   piglit_draw_rect_tex(0, 0, piglit_width, piglit_height, 0, 0, 1, 1);
   glDeleteTextures(1, &texture);
}

The textures are “deleted”, yes, but because they’re in use, the driver can’t actually delete them at this point of call, meaning that they can only truly be deleted once they are no longer in use by the GPU. At some iteration, this will begin to oom the GPU, and the driver will have to determine how to handle things.

The Zink Case

At present, mainline zink uses a hammer-and-nail methodology that I came up with last year: the total amount of GPU memory in use by resources in a given cmdbuf is tracked, and that amount is tracked per-context. If the in-use context memory exceeds a threshold of the total VRAM, the driver stalls, thereby freeing up all the resources that are in use so they can be recycled into new ones.

There’s a number of problems with this approach, but the biggest one is that it fails to account for cases like a AAA game that just uses as much memory as it can in order to optimize performance/resolution/graphics. I discovered such a case some time ago while running Tomb Raider, and then I set out to improve things since it was costing me about 10% of my perf on the title screen.

The annoying part of this problem is that the piglit test is a very uncommon case, and it’s tricky to handle it in a way that doesn’t also impact other cases which appear similar but need to not get memory-clamped. As a result, it’s tough to really do anything based on “overall” memory usage.

In the end, what I decided on was using the per-cmdbuf memory usage counter to trigger a check for completed cmdbufs on submit, iterating over all the pending ones to check whether they’ve completed, resetting them and freeing associated resources when possible. This yields good memory reclaiming behavior for problem cases while leaving games like Tomb Raider untouched and definitely not deadlocking or anything like that.

June 02, 2021

Remember When…

I said I’d be blogging every day about some changes? And that was a month ago or however long it’s been? And we all had a good chuckle at the idea that I could blog every day like how things used to be?

Yeah, I remember that too.

Anyway, Bas still hasn’t blogged, so let’s check the blogenda:

  • handwaving about C++ draw templates
  • some obscure vbuf thing
  • shower
  • make zink usable for gaming
  • complain about construction
  • improve shader caching
  • this week’s queue rewrite
  • some other stuff
  • suballocator?

I guess it’s that time of the week again because the schedule says it’s time to talk about this week’s (or whenever it was) major rewrite of zink’s queue handling. But first, only 90s kids will remember that time I blogged about a major queue rewrite and was excited to almost be hitting 70% of native performance.

Synchronization

A common use of GL for big games is using multiple GL contexts to parallelize work. There’s a lot of tricky restrictions for this, both on the app side and the driver side, but this is sort of the closest thing to multiple cmdbufs that GL provides.

We all recall how zink now uses a monotonic queue: upon commencing recording, each cmdbuf gets tagged with a 32bit integer id that doubles as a timeline semaphore id for fencing. The queue iterates, the cmdbuf counter increments, queue submission is done per-context in a thread, the GPU gets triangles, everyone is happy.

But how well does that work out with multiple contexts?

Pretty well, it turns out, as long as you’re using a Vulkan driver that doesn’t actually check to ensure you’re using monotonic ids for your timeline values. Let’s check out a totally hypothetical scenario that isn’t just Steam:

  • Have contexts A and B
  • Context A starts recording, gets id 1 (nonzero id club represent)
  • Context B starts recording, gets id 2
  • Context A finishes recording, submits cmdbuf
  • Context B finishes recording, submits cmdbuf
  • Timeline wait on id 1
  • Timeline wait on id 2

So far so good. But then we get past the “Checking for updates” window:

  • Context A starts recording, gets id 3
  • Context B starts recording, gets id 4
  • Context B finishes recording, submits cmdbuf
  • Context A finishes recording, submits cmdbuf
  • Timeline wait on id 3
  • Timeline wait on id 4

thonking.png

So now context B’s submit thread is dumping cmdbuf 4’s triangles into the GPU, then context A’s submit thread is also trying to dump cmdbuf 3’s triangles into the GPU, but the wait order for the timeline is still A -> B, meaning that the values are not monotonic.

Will any drivers care?

Magic 8-ball says no, no drivers care about this and everything still works fine. That’s cool and interesting, but probably it’d be better to not do that.

This Time It’s Definitely Fixed

The problem here is two problems:

  • the queue submission thread is context-based when it should be screen based
  • cmdbufs get an id when they start recording, not when they get submitted

The first problem is easy to fix: just deduplicate the thread and move the struct member.

The second one is trickier because everything in zink relies on cmdbufs getting an id as soon as they become active. This is done so that any resources written to by a given cmdbuf can have their usage tracked for synchronization purposes, e.g., reading back a buffer only after all its writes have landed.

The problem is further complicated by zink not having a great API barrier between directly accessing the “usage” for a resource and the value itself, by which I mean parts of the codebase directly reading the integer value vs having an API to wrap it; the latter would enable replacing the mechanism with whatever I wanted, so I decided to start by creating such a wrapper based on this:

struct zink_batch_usage {
   uint32_t usage;
   bool unflushed;
};

This is the existing struct zink_batch_usage but now with a bool value indicating that this cmdbuf is yet to be flushed. Each cmdbuf batch now has this sub-struct inlined onto it, and resources in zink can take references (pointers) to a specific cmdbuf’s usage struct. Because batches are never destroyed, this means the wrapper API can always dereference the struct to determine how to synchronize the usage: if it’s unflushed, it can flush or sync the flush thread; if it’s real, pending usage, it can safely wait on that usage as a timeline value and guarantee monotonic ordering.

bool
zink_screen_usage_check_completion(struct zink_screen *screen, const struct zink_batch_usage *u)
{
   if (!zink_batch_usage_exists(u))
      return true;
   if (zink_batch_usage_is_unflushed(u))
      return false;

   return zink_screen_batch_id_wait(screen, u->usage, 0);
}

bool
zink_batch_usage_check_completion(struct zink_context *ctx, const struct zink_batch_usage *u)
{
   if (!zink_batch_usage_exists(u))
      return true;
   if (zink_batch_usage_is_unflushed(u))
      return false;
   return zink_check_batch_completion(ctx, u->usage);
}

void
zink_batch_usage_wait(struct zink_context *ctx, const struct zink_batch_usage *u)
{
   if (!zink_batch_usage_exists(u))
      return;
   if (zink_batch_usage_is_unflushed(u))
      zink_fence_wait(&ctx->base);
   else
      zink_wait_on_batch(ctx, u->usage);
}

Now things render exactly the same, but with a truly monotonic queue underneath that’s conformant to specifications.

June 01, 2021

As the President of the GNOME Foundation Board of Directors, I’m really pleased to see the number and breadth of candidates we have for this year’s election. Thank you to everyone who has submitted their candidacy and volunteered their time to support the Foundation. Allan has recently blogged about how the board has been evolving, and I wanted to follow that post by talking about where the GNOME Foundation is in terms of its strategy. This may be helpful as people consider which candidates might bring the best skills to shape the Foundation’s next steps.

Around three years ago, the Foundation received a number of generous donations, and Rosanna (Director of Operations) gave a presentation at GUADEC about her and Neil’s (Executive Director, essentially the CEO of the Foundation) plans to use these funds to transform the Foundation. We would grow our activities, increasing the pace of events, outreach, development and infrastructure that supported the GNOME project and the wider desktop ecosystem – and, crucially, would grow our funding to match this increased level of activity.

I think it’s fair to say that half of this has been a great success – we’ve got a larger staff team than GNOME has ever had before. We’ve widened the GNOME software ecosystem to include related apps and projects under the GNOME Circle banner, we’ve helped get GTK 4 out of the door, run a wider-reaching program in the Community Engagement Challenge, and consistently supported better infrastructure for both GNOME and the Linux app community in Flathub.

Aside from another grant from Endless (note: my employer), our fundraising hasn’t caught up with this pace of activities. As a result, the Board recently approved a budget for this financial year which will spend more funds from our reserves than we expect to raise in income. Due to our reserves policy, this is essentially the last time we can do this: over the next 6-12 months we need to either raise more money, or start spending less.

For clarity – the Foundation is fit and well from a financial perspective – we have a very healthy bank balance, and a very conservative “12 month run rate” reserve policy to handle fluctuations in income. If we do have to slow down some of our activities, we will return to a “steady state” where our regular individual donations and corporate contributions can support a smaller staff team that supports the events and infrastructure we’ve come to rely on.

However, this isn’t what the Board wants to do – the previous and current boards were unanimous in their support of the idea that we should be ambitious: try to do more in the world and bring the benefits of GNOME to more people. We want to take our message of trusted, affordable and accessible computing to the wider world.

Typically, a lot of the activities of the Foundation have been very inwards-facing – supporting and engaging with either the existing GNOME or Open Source communities. This is a very restricted audience in terms of fundraising – many corporate actors in our community already support GNOME hugely in terms of both financial and in-kind contributions, and many OSS users are already supporters either through volunteer contributions or donating to those nonprofits that they feel are most relevant and important to them.

To raise funds from new sources, the Foundation needs to take the message and ideals of GNOME and Open Source software to new, wider audiences that we can help. We’ve been developing themes such as affordability, privacy/trust and education as promising areas for new programs that broaden our impact. The goal is to find projects and funding that allow us to both invest in the GNOME community and find new ways for FOSS to benefit people who aren’t already in our community.

Bringing it back to the election, I’d like to make clear that I see this – reaching the outside world, and finding funding to support that – as the main priority and responsibility of the Board for the next term. GNOME Foundation elections are a slightly unusual process that “filters” our board nominees by being existing Foundation members, which means that candidates already work inside our community when they stand for election. If you’re a candidate and are already active in the community – THANK YOU – you’re doing great work, keep doing it! That said, you don’t need to be a Director to achieve things within our community or gain the support of the Foundation: being a community leader is already a fantastic and important role.

The Foundation really needs support from the Board to make a success of the next 12-18 months. We need to understand our financial situation and the trade-offs we have to make, and help to define the strategy with the Executive Director so that we can launch some new programs that will broaden our impact – and funding – for the future. As people cast their votes, I’d like people to think about what kind of skills – building partnerships, commercial background, familiarity with finances, experience in nonprofit / impact spaces, etc – will help the Board make the Foundation as successful as it can be during the next term.

I Hate Construction.

Specifically when it’s right outside my house and starts at 5:30AM with heavy machinery moving around.

With this said, I’m overdue for a post, and if I don’t set a good example by continuing to blog, why would anyone else? PS. BAS IT’S TIME.

Let’s see what’s on the agenda:

  • handwaving about C++ draw templates
  • some obscure vbuf thing
  • shower
  • make zink usable for gaming
  • complain about construction
  • improve shader caching

Looks like the next thing on the list is shader caching.

The Art Of The Cache

If you’re a long-time zink connoisseur, or if you’re just a casual reader of the blog, you know that zink has a shader cache.

But did you know that it doesn’t actually do anything at present?

Indeed, it was to my chagrin that, upon diving back into my slapdash pipeline cache implementation, I discovered that it was doing absolutely nothing. And this was a different nothing than that one time I didn’t actually pass the cache back to the vulkan driver! Yes, this was the nothing of I have a cache, why am I still compiling a hundred pipelines per frame? that the occasional lucky developer runs into every now and again.

But hwhy? Who would do such a thing?

spideymeme.jpg

Past recriminations aside, how does a shader/pipeline cache work, anyway? The gist of it in most Mesa drivers is this:

Thus a shader gets cached based on its text representation, enabling matching shaders across programs to use the same cache entry. After noting the success of Steam’s fossilize-based single file cache, I decided to use a single file for zink’s shader cache.

Oops.

The problem in this case was that I was just jamming all the pipelines into a single file, written once at program exit, and expecting the Vulkan driver to figure things out.

But what if the program didn’t exit cleanly? Or what if the write failed for some reason?

In short, the pipeline cache was mostly being written as a big block of garbage data. Not very useful.

Next-Level Caching Technique

Clearly I needed to reeducate myself in the ways of managing a cache, something that, in my former life as a GUI expert, I did routinely, but that I could no longer comprehend now that I only speak bitfields and command buffers.

I sought out the reclusive Timothy Arceri, a well-known sage in many esoteric, arcane arts, and, as I recall it, purveyor of great wisdom such as (paraphrased because the original text has been lost to the ages): We Both Know The GLSL Compiler Code For Uniform Blocks Is Unfathomable, Why Do You Insist On Attempting To Modify It?

The answers I received from my sojourn were swift and concise:

Stop that. Fossilize caching wasn’t meant to work that way.

My thoughts whirling, confidence badly shaken, I stumbled and fell from the summit of the mountain and dashed my heretical cache implementation against the solid foundation of git rebase -i.

What had I been thinking?

It was back to the charts for me, and this time I had a number of different goals:

  • go back to multi-file caching (since it’s the only option)
  • smaller caches
  • more frequent updates
  • fully async

Turns out this wasn’t actually as hard as expected?

More Flowcharts (Fulfilling Image Quote For Graphics Blog)

Because we’re all big Vulkan adults, we do big Vulkan pipeline caches instead of wimpy OpenGL shader caches like so:

This has the added benefit of providing all the state variants for a given shader pipeline, saving additional lookups and ensuring that all the potential compiled pipelines are available at once. Furthermore, because there’s a (very) short delay between knowing what shaders are grouped together and needing the actual compiled pipeline, I can dump this all into a thread and handle the lookup while I update descriptors #2021 ASYNC THREADS++++++ BAYBEEEEEE.

But also, also, the one thing to absolutely not ever forget or else it’ll be really embarrassing is to ensure that you add your driver’s sha1 hash to your disk cache lookup, otherwise the whole thing explodes and @tarceri will frown down upon you.

May 24, 2021

Hi all, hope you all are doing fine!

I am Beatriz Carvalho, brazilian, living in Fundão, Portugal. I am graduated in computer engineering at Unipampa in Brazil.  I work mostly with C, Python and I am learning JavaScript, CSS, among other things to create this site... I like Harry Potter, Lord of the Rings, One Piece, The Witcher... I also like to drink wine and some cocktails and last but not least: I love cats, I have two: Ophélia and Cisco.    

I've just been selected as an Outreachy intern for Linux Kernel working with my mentors Melissa Wen and Daniel Vetter on the project "Improvements to DRI-devel (aka kernel GPU subsystem)".  

As an Outreachy intern, my first step is to say out loud for everyone to see my core values, Outreachy organizers make available a list with some values, and going through the list made me realize some things I value the most in life as an individual and once I started to work in the Linux Kernel these values caught my attention:  

Community  

I grew up inside a religious community, where my family and I always tried our best to help people and the community in general. Then when I started studying about the Linux Kernel, I could enforce this concept, because there you have enthusiasts group (people who want to contribute voluntarily to development) working alongside companies from all over the world, to the development of the kernel, contributing to its evolution and adapting it to different platforms making the Linux Kernel one of the biggest free and open-source projects. Another thing that I think as one of the most important in a community is the chance to learn from one another, specially on feedbacks during the code review process. And of course, the community it's a great place to get to know new (awesome) people, to get job opportunities and (why not) make friends! 

Learning  

Another value that catches my attention in the Kernel is the importance of you are always learning, and trying to get a better understanding on how the project works. Leaving the comfort zone, because often the projects are huge, and sometimes you will have to "burn some (or many) neurons" to understand what you need to do, but on the other hand it is really cool and rewarding when things start to work. 

Responsibility  

And finally, a value I am learning is the responsibility to work on a project as big and complex as the kernel, because a change in a drive that you make, can impact thousands (or would it be millions?) of people who use it. 

During my graduation, I attended several events about open source software, for example, tcheLinux, FISL. I always wished to contribute to the community but finding compatible materials/tutorials for beginners was really hard, mostly because despite being considered for beginners, it always required previous knowledge that were beyond my current skills.  

Then on 2019 I took part in LKCamp (Linux Kernel study group),  where I learned more about Linux Kernel, step-by-step on how to contribute to the community, and on how we could contribute with Open Source Software through internship programs such as Google Summer of Code (GSoC) and Outreachy. By the time, I got really excited about it but couldn't participate on the second stage of selection.  

So, this year I gave my best at the selection process, made some patches and wrote my intern plan together with my mentors. And now that I was selected to Outreachy program, I can't believe that I have got this opportunity to be a part of it! 

Now I need to control my anxiety and work to control the imposter syndrome in order to get the best of this opportunity, absorb all I can, and hopefully get a job to continue work with kernel. 

  Thank you for accompanying me so far, please feel free to comment! And stay tuned to the next chapters of this Saga called Outreachy!! 

Take care and have a great day!

Stop The Optimizing

I had planned to write more posts about some optimizations and whatever other cool stuff I’ve been working on.

I had planned to make more zink-wip snapshots.

I did shower; stop spamming frog emotes at me.

But I also encountered a bug so bizarre, so infruating, so esoteric, that I need to take a bit of a victory lap now that I’ve successfully corralled it. So let’s get into a real, vintage SGC blog post like we used to have back when SGC was a good blog and dive into what it took to fix a bug that took me four full days to resolve.

The Problem

In the course of writing a suballocator, I ran zero tests, as is my way. When the coding fugue ended, I stumbled weakly to my local CI script and managed to hit the Enter key before collapsing into a restless sleep where I was chased by angry triangles. Things were different when I awoke; namely I now had a lot of failing tests.

But I fixed them, because that’s what driver developers do.

All except one, which I assumed was a flake after running it a few times and seeing no failures.

This was how I got to know the horror of dEQP-GLES3.functional.vertex_array_objects.all_attributes.

The test itself is awful to debug. It generates GL_MAX_VERTEX_ATTRIBS vertex attributes to use with the maximum number of vertex buffers and does a series of draws, verifying the results. Normal enough.

Except the attributes are completely randomized, even whether they’re enabled, so no two runs are the same.

And the bisect hit just right.

The Basics Of Problem Solving

When a new bug is found with a driver, the first question is usually “Did this used to work?” followed quickly by “When did it start?” if the first answer was yes. The reason for this is that determining exactly when a problem began and what caused it to manifest gives the developer some vague starting point for determining what is happening to cause a bug.

So it was that I embarked on a bisect to figure out why dEQP-GLES3.functional.vertex_array_objects.all_attributes was suddenly failing. But obviously I couldn’t just bisect for this test. No, no, that would be far too easy.

This test only fails if run in conjunction with a series of other tests. Thus my deqp caselist file:

dEQP-GLES3.functional.vertex_array_objects.all_attributes
dEQP-GLES3.functional.vertex_arrays.single_attribute.first.byte.first6_offset1_stride2_quads256
dEQP-GLES3.functional.vertex_arrays.single_attribute.output_types.int.components2_ivec4_quads256
dEQP-GLES3.functional.vertex_arrays.single_attribute.usages.static_copy.stride32_fixed_quads1

The problem test always runs last, so something was clearly going on over time that was causing the failure. Armed with this knowledge, and so sure that this would end up being some trivial one-liner that I could fix in a few minutes, I set up my startpoint and endpoint for the bisect and went to work.

Zink By Bisection

Generally speaking, I assume every bug I find is going to be a zink bug. Just by the numbers, that’s usually the case, but then also it’s just always the case. It was therefore no surprise that my bisect landed on a certain commit:

commit 6b13e7cede95504ce8309744d8b9d83c7dbab7c9
Author: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Date:   Mon May 17 08:44:02 2021 -0400

    try better map flags

diff --git a/src/gallium/drivers/zink/zink_resource.c b/src/gallium/drivers/zink/zink_resource.c
index 55f37380d9f..121f6f0076e 100644
--- a/src/gallium/drivers/zink/zink_resource.c
+++ b/src/gallium/drivers/zink/zink_resource.c
@@ -1201,7 +1201,7 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
          /* At this point, the buffer is always idle (we checked it above). */
          usage |= PIPE_MAP_UNSYNCHRONIZED;
       }
-   } else if ((usage & PIPE_MAP_READ) && !(usage & PIPE_MAP_PERSISTENT)) {
+   } else if (((usage & PIPE_MAP_READ) && !(usage & PIPE_MAP_PERSISTENT)) || !res->obj->host_visible) {
       assert(!(usage & (TC_TRANSFER_MAP_THREADED_UNSYNC | PIPE_MAP_THREAD_SAFE)));
       if (usage & PIPE_MAP_DONTBLOCK) {
          /* sparse/device-local will always need to wait since it has to copy */
@@ -1209,7 +1209,7 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
             return NULL;
          if (!zink_resource_usage_check_completion(ctx, res, ZINK_RESOURCE_ACCESS_WRITE))
             return NULL;
-      } else if (!res->obj->host_visible) {
+      } else if (!res->obj->host_visible || res->base.b.usage != PIPE_USAGE_STAGING) {
          trans->staging_res = pipe_buffer_create(&screen->base, PIPE_BIND_LINEAR, PIPE_USAGE_STAGING, box->x + box->width);
          if (!trans->staging_res)
             return NULL;
@@ -1218,8 +1218,12 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
          zink_copy_buffer(ctx, NULL, staging_res, res, box->x, box->x, box->width);
          res = staging_res;
          zink_fence_wait(&ctx->base);
-      } else
-         zink_resource_usage_wait(ctx, res, ZINK_RESOURCE_ACCESS_WRITE);
+      } else {
+         if (!(usage & PIPE_MAP_WRITE))
+            zink_resource_usage_wait(ctx, res, ZINK_RESOURCE_ACCESS_WRITE);
+         else
+            zink_resource_usage_wait(ctx, res, ZINK_RESOURCE_ACCESS_RW);
+      }
    }
 
    if (!ptr) {

As clearly explained by my laconic commit log, this patch aims to improve non-persistent buffer mappings by forcing non-staging resources to use a snooped staging resource. For more details on why this is desirable, check out this encyclopedia of wisdom on the topic, written by RADV co-founder and Commander Of The Rays, Bas Nieuwenhuizen.

But somehow this small patch was breaking the test, so I set out to investigate.

Isolation

Once a problem area is identified, it’s usually helpful to try and isolate the exact hunks of a patch which cause the problem. In this case, I had three distinct and only vaguely-related hunks, so it was an ideal case for this strategy. The middle hunk ended up being the culprit:

@@ -1209,7 +1209,7 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
             return NULL;
          if (!zink_resource_usage_check_completion(ctx, res, ZINK_RESOURCE_ACCESS_WRITE))
             return NULL;
-      } else if (!res->obj->host_visible) {
+      } else if (!res->obj->host_visible || res->base.b.usage != PIPE_USAGE_STAGING) {
          trans->staging_res = pipe_buffer_create(&screen->base, PIPE_BIND_LINEAR, PIPE_USAGE_STAGING, box->x + box->width);
          if (!trans->staging_res)
             return NULL;

It seemed a bit odd to me, but nothing that stood out as impossible; perhaps there was some manner of issue with buffer copying offsets for setting up the staging resource, or some synchronization issue, or whatever. There were options, and I now knew that the problem was caused by setting up a staging buffer. Further prinfs revealed that this conditional was only hit for read access, so it was now narrowed down even further.

Initial Testing

Was it a buffer offset problem with copying the data for the staging resource?

Well.

No.

As interesting as it would’ve been for that to have been the case, there’s zero chance that this one test case was invoking a magical offset that wasn’t also triggered in other cases. If the general buffer copying code here was broken, it was probably broken everywhere in zink, so there would’ve been many, many more failures. There was only this one case, however, and deeper investigation confirmed this, as I directly mapped both buffers and compared the data ranges, which matched.

Synchronization it was, then, and I can hear the disembodied voice of Dave Airlie now shouting “Barriers!” before vanishing off into the ether.

First, I tried adding more GPU stalls. Like, lots more. Like, so many that the test took minutes to complete. There was no change, however. Just for hahas, I even added some usleep calls around.

Still nothing.

At this point I was seriously stumped. By now I’d fully instrumented all of the buffer access codepaths with asserts to verify the mapped contents matched the real buffer contents in all cases, and none of the asserts were ever hit.

But if it wasn’t actually an issue with synchronizing the staging buffer, what could it be?

I decided to check the test with with ANV at this point, being the case that I always run CTS against lavapipe to avoid killing my session in case I’ve foolished and added some code which triggers hangs, and…

And the test passed with ANV.

confused_nick_young.jpg

This was a real thinker, so I went to get a second opinion from Bas and RADV. RADV told me that ANV didn’t know what it was talking about, and the test was definitely failing, so I went with that answer because it seemed more sane.

As a final idea, I did the truly unthinkable: I threw in a malloc call, allocated some host memory, and copied the map contents directly into that buffer.

And leaked it.

Yes, I know, I know, We Don’t Do That, but it was just this one time. Just a little bit. Just to see if I could valg—Of course valgrind crashes when running anything in lavapipe due to unimplemented instructions, so why did I bother?

Getting Deeper

There comes a time when saying We Need To Go Deeper isn’t just a meme. That time was now, and I was about to get, as they say in technical terms when such depth is approached, deep as fuck.

Continuing to experiment with my memory leaking, the conditional block in question had by now degenerated into spaghetti:

         trans->staging_res = pipe_buffer_create(&screen->base, PIPE_BIND_LINEAR, PIPE_USAGE_STAGING, box->x + box->width);
         if (!trans->staging_res)
            return NULL;
         struct zink_resource *staging_res = zink_resource(trans->staging_res);
         trans->offset = staging_res->obj->offset;
         uint8_t *p = map_resource(screen, res);
         trans->data = malloc(box->x + box->width);
         trans->data2 = malloc(box->x + box->width);
         memset(trans->data, 0, box->x + box->width);
         memset(trans->data2, 0, box->x + box->width);
         memcpy(trans->data, p, box->x + box->width);
         memcpy(trans->data2, p, box->x + box->width);
         printf("SIZE NEEDED %u\n", box->x + box->width);
      for (unsigned i = 0; i < box->x + box->width; i++) {
         uint8_t *map = res->obj->map;
         assert(trans->data[i] == trans->data2[i]);
         assert(map[i] == trans->data2[i]);
         printf("MAP[%u] = %u\n", i, trans->data2[i]);
      }
         //zink_copy_buffer(ctx, NULL, staging_res, res, box->x, box->x, MIN2(box->width + 4, res->base.b.width0-box->x));
         //zink_copy_buffer(ctx, NULL, staging_res, res, box->x, box->x, box->width);
         ptr = trans->data;

         res = staging_res;
         zink_fence_wait(&ctx->base);

Obviously I’m gonna double buffer my memory leak so I can verify that it’s not secretly being modified on unmap (it wasn’t), and then also verify that the data matches before returning the pointer. And print it all, of course, because if you can actually read your terminal when you reach this sort of depth in the course of a debugging session, probably you’re doing it wrong.

But the time had come to start applying hacks elsewhere: namely the test itself. Being a random test case made it impossible to figure out what was going on between runs, but I’d determined one thing of interest: no matter what, unless I returned the direct mapping for the buffer, the test failed.

Let’s see what Mr. Crowbar had to say about that though when I applied him to the CTS case:

diff --git a/modules/gles3/functional/es3fVertexArrayObjectTests.cpp b/modules/gles3/functional/es3fVertexArrayObjectTests.cpp
index 82578b1ce..e231c4b1a 100644
--- a/modules/gles3/functional/es3fVertexArrayObjectTests.cpp
+++ b/modules/gles3/functional/es3fVertexArrayObjectTests.cpp
@@ -765,14 +765,14 @@ void MultiVertexArrayObjectTest::init (void)
 		m_spec.buffers.push_back(shortCoordBuffer48);
 
 		m_spec.state.attributes.push_back(Attribute());
-		m_spec.state.attributes[attribNdx].enabled		= (m_random.getInt(0, 4) == 0) ? GL_FALSE : GL_TRUE;
-		m_spec.state.attributes[attribNdx].size			= m_random.getInt(2,4);
-		m_spec.state.attributes[attribNdx].stride		= 2*m_random.getInt(1, 3);
+		m_spec.state.attributes[attribNdx].enabled		= GL_TRUE;
+		m_spec.state.attributes[attribNdx].size			= (attribNdx % 2) + 2;
+		m_spec.state.attributes[attribNdx].stride	= 2 * ((attribNdx % 2) + 1);
 		m_spec.state.attributes[attribNdx].type			= GL_SHORT;
-		m_spec.state.attributes[attribNdx].integer		= m_random.getBool();
-		m_spec.state.attributes[attribNdx].divisor		= m_random.getInt(0, 1);
-		m_spec.state.attributes[attribNdx].offset		= 2*m_random.getInt(0, 2);
-		m_spec.state.attributes[attribNdx].normalized	= m_random.getBool();
+		m_spec.state.attributes[attribNdx].integer		= attribNdx % 3 == 1;
+		m_spec.state.attributes[attribNdx].divisor		= 0;
+		m_spec.state.attributes[attribNdx].offset		= attribNdx % 5;
+		m_spec.state.attributes[attribNdx].normalized	= attribNdx % 3 == 1;
 		m_spec.state.attributes[attribNdx].bufferNdx	= attribNdx+1;
 
 		if (attribNdx == 0)
@@ -783,14 +783,14 @@ void MultiVertexArrayObjectTest::init (void)
 		}
 
 		m_spec.vao.attributes.push_back(Attribute());
-		m_spec.vao.attributes[attribNdx].enabled		= (m_random.getInt(0, 4) == 0) ? GL_FALSE : GL_TRUE;
-		m_spec.vao.attributes[attribNdx].size			= m_random.getInt(2,4);
-		m_spec.vao.attributes[attribNdx].stride			= 2*m_random.getInt(1, 3);
+		m_spec.vao.attributes[attribNdx].enabled		= GL_TRUE;
+		m_spec.vao.attributes[attribNdx].size			= (attribNdx % 2) + 2;
+		m_spec.vao.attributes[attribNdx].stride			= 2 * ((attribNdx % 2) + 1);
 		m_spec.vao.attributes[attribNdx].type			= GL_SHORT;
-		m_spec.vao.attributes[attribNdx].integer		= m_random.getBool();
-		m_spec.vao.attributes[attribNdx].divisor		= m_random.getInt(0, 1);
-		m_spec.vao.attributes[attribNdx].offset			= 2*m_random.getInt(0, 2);
-		m_spec.vao.attributes[attribNdx].normalized		= m_random.getBool();
+		m_spec.vao.attributes[attribNdx].integer		= attribNdx % 3 == 1;
+		m_spec.vao.attributes[attribNdx].divisor		= 0;
+		m_spec.vao.attributes[attribNdx].offset			= attribNdx % 5;
+		m_spec.vao.attributes[attribNdx].normalized		= attribNdx % 3 == 1;
 		m_spec.vao.attributes[attribNdx].bufferNdx		= attribCount - attribNdx;
 
 		if (attribNdx == 0)

Now I had a consistently failing test (as long as I ran it with the other test cases so it didn’t feel too lonely and accidentally pass) with consistent data, and I was dumping it all to logs that I could compare if I returned the direct pointer for the map to legitimately pass the test.

Naturally the output data that I was printing matched. It’d be pretty weird if it didn’t considering all the asserts that I had in the code, right? Hah, yeah, that’d be… That’d be pretty weird, all right…

The Forgotten Depths

By this point I had determined that it was a specific range of buffer mappings causing the problem, specifically those sized between 50 and 100 bytes. I also knew that these buffers were being mapped by u_vbuf, also known colloquially as the hinterlands of Gallium, an obscure component used to handle translating unsupported vertex buffer formats.

Veteran Mesa developers reading along are going full sensiblechuckle.gif right now, but I’ll request that we continue our no spoiler policy.

If the buffer contents were the same as the mapped contents but the test was still failing, then there had to be a reason for that. I fumbled my way over to the vertex attribute translator and fingerpainted in a printf to dump the translated vertex attributes. This enabled me to diff between a good run and a bad run.

It was then that I made a bewildering discovery.

Any time I had a 96-byte buffer map, the attributes starting at offset 92 didn’t match in cases when the test failed.

This was another thinker, so I decided to enhance my memory leaks a bit to copy more buffer since this was all 4096-aligned and it wasn’t like I was going to be copying out of bounds. This was when things started to get really weird.

Returning a copy of the resquested 96 bytes of the buffer failed the test, but returning 100 bytes passed it.

Uh-oh.

Now that I took a closer look at those vertex attribs, I realized that the ones which were failing were the ones that were read from bytes 96 and 97 of the buffer. The buffer which only had 96 bytes mapped, meaning that only the range of [0..95] was valid…

At Last

Resolution. What I had tripped over was a buffer overrun, one that was undetectable through normal means because of reasons like:

  • this is a GPU buffer, so tools which would normally catch buffer overruns wouldn’t detect it
  • this is u_vbuf, which is code that’s generally known to work pretty well given that it’s 10+ years old and is widely used and tested
  • RadeonSI is likely the only other driver which uses the same sorts of buffer mapping optimizations, and it doesn’t use u_vbuf

Iteration on various fixes finally yielded a patch that was upstreamable; the crux of the problem here was that the stride of vertex attributes was being used to calculate the size of the region to map, but the stride only determines the number of bytes between elements, not their size. For example, if the stride was 4 bytes but the element was 8 bytes, the overrun would be 4 bytes for the last element. The solution was to calculate the offset of the last element being mapped, then add the size of the element using the attribute’s format block size, which guarantees that the last attribute won’t be truncated.

Fuck that bug.

May 20, 2021

So we are looking to hire quite a few people into the Desktop team currently. First of all we are looking to hire two graphics engineers to help us work on Linux Graphics drivers. The first of those two jobs is now online on the Red Hat jobs site. This is a job in our core graphics team focusing on RHEL, Fedora and upstream around the Intel, AMD and NVidia open source drivers. This is an opportunity to join a team of incredibly talented engineers working on everything from the graphics system of the Linux kernel and on the userspace bits like Vulkan, OpenGL and Wayland.  The job is listed as Senior Engineer, but for the right candidate we have flexibility there. We also have flexibility for people who want to work remotely, so as long as there is a Red Hat office in your home country you can work remotely for us.  The second job, which we hope to have up soon, will be looking more at ARM graphics and be tied to our automotive effort, but we will be looking at the applications for either position in combination so feel free to apply for the already listed job even if you are more interested in the second one as we will discuss both jobs with potential candidates.

The second job we have up is for – Software Engineer – GPU, Input and Multimedia which is also for joining our Graphics team. This job is targetted at our  office in Brno, Czechia and is a great entry level position if you are interested in the field of graphics. The job listing can be found here and outlines the kind of things we want you to look at, but do expect initially your job will be focused on helping the rest of the team manage their backlog and then grow from there.

The last job we have online now is for the automotive team, where we are looking for someone at the Senior/Principal level to join our Infotainment team, working with car makers around issues related to multimedia and help identifying needs and gaps and then work with upstream communities to figure out how we can resolve those issues. The job is targeted at Madrid, Spain as it is where we hope to center some of the infotainment effort and it makes things easier in terms of hardware access and similar, but for the right candidate we might be open to looking for candidates wanting to work remote or in another Red Hat office. You can find this job listing here.

We expect to be posting further jobs for the infotainment team within a week or two, so I will update once they are up.

May 19, 2021

Using Power For Evil

There’s no shortage of very smart people working on Mesa. One of those, aspiring benchmark-quadrupler Marek Olšák, had a novel idea some time ago: Could C++ function templates were used to optimize draw dispatch in driver?

The answer was yes, and so began what was probably five or ten minutes of furiously jamming brackets and braces into a C++ file in order to achieve the intended result. Let’s check out what’s going on here.

Setup

To start, the templates must be accessible from C, as this is what the driver is written in. The methodology here is simple: Generate the templates as an array of function pointers such that they can be accessed by indexing the arrays with the template values. Here’s what the code looks like:

template <chip_class GFX_VERSION, si_has_tess HAS_TESS, si_has_gs HAS_GS,
          si_has_ngg NGG, si_has_prim_discard_cs ALLOW_PRIM_DISCARD_CS>
static void si_init_draw_vbo(struct si_context *sctx)
{
   /* Prim discard CS is only useful on gfx7+ because gfx6 doesn't have async compute. */
   if (ALLOW_PRIM_DISCARD_CS && GFX_VERSION < GFX7)
      return;

   if (NGG && GFX_VERSION < GFX10)
      return;

   sctx->draw_vbo[GFX_VERSION - GFX6][HAS_TESS][HAS_GS][NGG][ALLOW_PRIM_DISCARD_CS] =
      si_draw_vbo<GFX_VERSION, HAS_TESS, HAS_GS, NGG, ALLOW_PRIM_DISCARD_CS>;
}

template <chip_class GFX_VERSION, si_has_tess HAS_TESS, si_has_gs HAS_GS>
static void si_init_draw_vbo_all_internal_options(struct si_context *sctx)
{
   si_init_draw_vbo<GFX_VERSION, HAS_TESS, HAS_GS, NGG_OFF, PRIM_DISCARD_CS_OFF>(sctx);
   si_init_draw_vbo<GFX_VERSION, HAS_TESS, HAS_GS, NGG_OFF, PRIM_DISCARD_CS_ON>(sctx);
   si_init_draw_vbo<GFX_VERSION, HAS_TESS, HAS_GS, NGG_ON, PRIM_DISCARD_CS_OFF>(sctx);
   si_init_draw_vbo<GFX_VERSION, HAS_TESS, HAS_GS, NGG_ON, PRIM_DISCARD_CS_ON>(sctx);
}

template <chip_class GFX_VERSION>
static void si_init_draw_vbo_all_pipeline_options(struct si_context *sctx)
{
   si_init_draw_vbo_all_internal_options<GFX_VERSION, TESS_OFF, GS_OFF>(sctx);
   si_init_draw_vbo_all_internal_options<GFX_VERSION, TESS_OFF, GS_ON>(sctx);
   si_init_draw_vbo_all_internal_options<GFX_VERSION, TESS_ON, GS_OFF>(sctx);
   si_init_draw_vbo_all_internal_options<GFX_VERSION, TESS_ON, GS_ON>(sctx);
}

static void si_init_draw_vbo_all_families(struct si_context *sctx)
{
   si_init_draw_vbo_all_pipeline_options<GFX6>(sctx);
   si_init_draw_vbo_all_pipeline_options<GFX7>(sctx);
   si_init_draw_vbo_all_pipeline_options<GFX8>(sctx);
   si_init_draw_vbo_all_pipeline_options<GFX9>(sctx);
   si_init_draw_vbo_all_pipeline_options<GFX10>(sctx);
   si_init_draw_vbo_all_pipeline_options<GFX10_3>(sctx);
}

static void si_invalid_draw_vbo(struct pipe_context *pipe,
                                const struct pipe_draw_info *info,
                                const struct pipe_draw_indirect_info *indirect,
                                const struct pipe_draw_start_count *draws,
                                unsigned num_draws)
{
   unreachable("vertex shader not bound");
}

extern "C"
void si_init_draw_functions(struct si_context *sctx)
{
   si_init_draw_vbo_all_families(sctx);

   /* Bind a fake draw_vbo, so that draw_vbo isn't NULL, which would skip
    * initialization of callbacks in upper layers (such as u_threaded_context).
    */
   sctx->b.draw_vbo = si_invalid_draw_vbo;
   sctx->blitter->draw_rectangle = si_draw_rectangle;

   si_init_ia_multi_vgt_param_table(sctx);
}

This calls through a series of functions, ultimately reaching si_init_draw_vbo where a template is set to a member of the function pointer array based on the template parameters. Specialized functions can then be generated based on hardware type, pipeline shader presence, and more.

Application

Once initialized, there’s an inline function used to set the current function pointer:

static inline void si_select_draw_vbo(struct si_context *sctx)
{
   sctx->b.draw_vbo = sctx->draw_vbo[sctx->chip_class - GFX6]
                                    [!!sctx->shader.tes.cso]
                                    [!!sctx->shader.gs.cso]
                                    [sctx->ngg]
                                    [si_compute_prim_discard_enabled(sctx)];
   assert(sctx->b.draw_vbo);
}

Thus the parameters are pulled directly from the context, and the function can be called whenever the draw function pointer needs to be updated, such as when new shaders are bound or primitive discard is enabled.

Result

The result is that now the draw dispatch can be fully optimized for the codepath required by the active hardware and graphics pipeline, reducing the CPU overhead and making the draw code the tiniest bit faster. For example, here’s just the top part of the templated function:

template <chip_class GFX_VERSION, si_has_tess HAS_TESS, si_has_gs HAS_GS, si_has_ngg NGG,
          si_has_prim_discard_cs ALLOW_PRIM_DISCARD_CS>
static void si_draw_vbo(struct pipe_context *ctx,
                        const struct pipe_draw_info *info,
                        unsigned drawid_offset,
                        const struct pipe_draw_indirect_info *indirect,
                        const struct pipe_draw_start_count_bias *draws,
                        unsigned num_draws)
{
   /* Keep code that uses the least number of local variables as close to the beginning
    * of this function as possible to minimize register pressure.
    *
    * It doesn't matter where we return due to invalid parameters because such cases
    * shouldn't occur in practice.
    */
   struct si_context *sctx = (struct si_context *)ctx;

   /* Recompute and re-emit the texture resource states if needed. */
   unsigned dirty_tex_counter = p_atomic_read(&sctx->screen->dirty_tex_counter);
   if (unlikely(dirty_tex_counter != sctx->last_dirty_tex_counter)) {
      sctx->last_dirty_tex_counter = dirty_tex_counter;
      sctx->framebuffer.dirty_cbufs |= ((1 << sctx->framebuffer.state.nr_cbufs) - 1);
      sctx->framebuffer.dirty_zsbuf = true;
      si_mark_atom_dirty(sctx, &sctx->atoms.s.framebuffer);
      si_update_all_texture_descriptors(sctx);
   }

   unsigned dirty_buf_counter = p_atomic_read(&sctx->screen->dirty_buf_counter);
   if (unlikely(dirty_buf_counter != sctx->last_dirty_buf_counter)) {
      sctx->last_dirty_buf_counter = dirty_buf_counter;
      /* Rebind all buffers unconditionally. */
      si_rebind_buffer(sctx, NULL);
   }

   si_decompress_textures(sctx, u_bit_consecutive(0, SI_NUM_GRAPHICS_SHADERS));
   si_need_gfx_cs_space(sctx, num_draws);

   /* If we're using a secure context, determine if cs must be secure or not */
   if (GFX_VERSION >= GFX9 && unlikely(radeon_uses_secure_bos(sctx->ws))) {
      bool secure = si_gfx_resources_check_encrypted(sctx);
      if (secure != sctx->ws->cs_is_secure(&sctx->gfx_cs)) {
         si_flush_gfx_cs(sctx, RADEON_FLUSH_ASYNC_START_NEXT_GFX_IB_NOW |
                               RADEON_FLUSH_TOGGLE_SECURE_SUBMISSION, NULL);
      }
   }

   if (HAS_TESS) {
      struct si_shader_selector *tcs = sctx->shader.tcs.cso;

      /* The rarely occuring tcs == NULL case is not optimized. */
      bool same_patch_vertices =
         GFX_VERSION >= GFX9 &&
         tcs && info->vertices_per_patch == tcs->info.base.tess.tcs_vertices_out;

      if (sctx->same_patch_vertices != same_patch_vertices) {
         sctx->same_patch_vertices = same_patch_vertices;
         sctx->do_update_shaders = true;
      }

      if (GFX_VERSION == GFX9 && sctx->screen->info.has_ls_vgpr_init_bug) {
         /* Determine whether the LS VGPR fix should be applied.
          *
          * It is only required when num input CPs > num output CPs,
          * which cannot happen with the fixed function TCS. We should
          * also update this bit when switching from TCS to fixed
          * function TCS.
          */
         bool ls_vgpr_fix =
            tcs && info->vertices_per_patch > tcs->info.base.tess.tcs_vertices_out;

         if (ls_vgpr_fix != sctx->ls_vgpr_fix) {
            sctx->ls_vgpr_fix = ls_vgpr_fix;
            sctx->do_update_shaders = true;
         }
      }

Note that the hardware version parts are templated, as is the HAS_TESS conditional, enabling it to be skipped entirely if there’s no tessellation shader active.

With techniques like this, it’s no surprise that RadeonSI is the driver to beat in performance and low overhead. The latest zink-wip snapshots include similar work, skipping considerable amounts of the draw dispatch when possible, and (hopefully) lowering the CPU overhead of the draw dispatch.

May 18, 2021

TL;DR: don't use select() + bump the RLIMIT_NOFILE soft limit to the hard limit in your modern programs.

The primary way to reference, allocate and pin runtime OS resources on Linux today are file descriptors ("fds"). Originally they were used to reference open files and directories and maybe a bit more, but today they may be used to reference almost any kind of runtime resource in Linux userspace, including open devices, memory (memfd_create(2)), timers (timefd_create(2)) and even processes (with the new pidfd_open(2) system call). In a way, the philosophically skewed UNIX concept of "everything is a file" through the proliferation of fds actually acquires a bit of sensible meaning: "everything has a file descriptor" is certainly a much better motto to adopt.

Because of this proliferation of fds, non-trivial modern programs tend to have to deal with substantially more fds at the same time than they traditionally did. Today, you'll often encounter real-life programs that have a few thousand fds open at the same time.

Like on most runtime resources on Linux limits are enforced on file descriptors: once you hit the resource limit configured via RLIMIT_NOFILE any attempt to allocate more is refused with the EMFILE error — until you close a couple of those you already have open.

Because fds weren't such a universal concept traditionally, the limit of RLIMIT_NOFILE used to be quite low. Specifically, when the Linux kernel first invokes userspace it still sets RLIMIT_NOFILE to a low value of 1024 (soft) and 4096 (hard). (Quick explanation: the soft limit is what matters and causes the EMFILE issues, the hard limit is a secondary limit that processes may bump their soft limit to — if they like — without requiring further privileges to do so. Bumping the limit further would require privileges however.). A limit of 1024 fds made fds a scarce resource: APIs tried to be careful with using fds, since you simply couldn't have that many of them at the same time. This resulted in some questionable coding decisions and concepts at various places: often secondary descriptors that are very similar to fds — but were not actually fds — were introduced (e.g. inotify watch descriptors), simply to avoid for them the low limits enforced on true fds. Or code tried to aggressively close fds when not absolutely needing them (e.g. ftw()/nftw()), losing the nice + stable "pinning" effect of open fds.

Worse though is that certain OS level APIs were designed having only the low limits in mind. The worst offender being the BSD/POSIX select(2) system call: it only works with fds in the numeric range of 0…1023 (aka FD_SETSIZE-1). If you have an fd outside of this range, tough luck: select() won't work, and only if you are lucky you'll detect that and can handle it somehow.

Linux fds are exposed as simple integers, and for most calls it is guaranteed that the lowest unused integer is allocated for new fds. Thus, as long as the RLIMIT_NOFILE soft limit is set to 1024 everything remains compatible with select(): the resulting fds will also be below 1024. Yay. If we'd bump the soft limit above this threshold though and at some point in time an fd higher than the threshold is allocated, this fd would not be compatible with select() anymore.

Because of that, indiscriminately increasing the soft RLIMIT_NOFILE resource limit today for every userspace process is problematic: as long as there's userspace code still using select() doing so will risk triggering hard-to-handle, hard-to-debug errors all over the place.

However, given the nowadays ubiquitous use of fds for all kinds of resources (did you know, an eBPF program is an fd? and a cgroup too? and attaching an eBPF program to cgroup is another fd? …), we'd really like to raise the limit anyway. 🤔

So before we continue thinking about this problem, let's make the problem more complex (…uh, I mean… "more exciting") first. Having just one hard and one soft per-process limit on fds is boring. Let's add more limits on fds to the mix. Specifically on Linux there are two system-wide sysctls: fs.nr_open and fs.file-max. (Don't ask me why one uses a dash and the other an underscore, or why there are two of them...) On today's kernels they kinda lost their relevance. They had some originally, because fds weren't accounted by any other counter. But today, the kernel tracks fds mostly as small pieces of memory allocated on userspace requests — because that's ultimately what they are —, and thus charges them to the memory accounting done anyway.

So now, we have four limits (actually: five if you count the memory accounting) on the same kind of resource, and all of them make a resource artificially scarce that we don't want to be scarce. So what to do?

Back in systemd v240 already (i.e. 2019) we decided to do something about it. Specifically:

  • Automatically at boot we'll now bump the two sysctls to their maximum, making them effectively ineffective. This one was easy. We got rid of two pretty much redundant knobs. Nice!

  • The RLIMIT_NOFILE hard limit is bumped substantially to 512K. Yay, cheap fds! You may have an fd, and you, and you as well, everyone may have an fd!

  • But … we left the soft RLIMIT_NOFILE limit at 1024. We weren't quite ready to break all programs still using select() in 2019 yet. But it's not as bad as it might sound I think: given the hard limit is bumped every program can easily opt-in to a larger number of fds, by setting the soft limit to the hard limit early on — without requiring privileges.

So effectively, with this approach fds should be much less scarce (at least for programs that opt into that), and the limits should be much easier to configure, since there are only two knobs now one really needs to care about:

  • Configure the RLIMIT_NOFILE hard limit to the maximum number of fds you actually want to allow a process.

  • In the program code then either bump the soft to the hard limit, or not. If you do, you basically declare "I understood the problem, I promise to not use select(), drown me fds please!". If you don't then effectively everything remains as it always was.

Apparently this approach worked, since the negative feedback on change was even scarcer than fds traditionally were (ha, fun!). We got reports from pretty much only two projects that were bitten by the change (one being a JVM implementation): they already bumped their soft limit automatically to their hard limit during program initialization, and then allocated an array with one entry per possible fd. With the new high limit this resulted in one massive allocation that traditionally was just a few K, and this caused memory checks to be hit.

Anyway, here's the take away of this blog story:

  • Don't use select() anymore in 2021. Use poll(), epoll, iouring, …, but for heaven's sake don't use select(). It might have been all the rage in the 1990s but it doesn't scale and is simply not designed for today's programs. I wished the man page of select() would make clearer how icky it is and that there are plenty of more preferably APIs.

  • If you hack on a program that potentially uses a lot of fds, add some simple code somewhere to its start-up that bumps the RLIMIT_NOFILE soft limit to the hard limit. But if you do this, you have to make sure your code (and any code that you link to from it) refrains from using select(). (Note: there's at least one glibc NSS plugin using select() internally. Given that NSS modules can end up being loaded into pretty much any process such modules should probably be considered just buggy.)

  • If said program you hack on forks off foreign programs, make sure to reset the RLIMIT_NOFILE soft limit back to 1024 for them. Just because your program might be fine with fds >= 1024 it doesn't mean that those foreign programs might. And unfortunately RLIMIT_NOFILE is inherited down the process tree unless explicitly set.

And that's all I have for today. I hope this was enlightening.

Click Play

It’s been a while.

I meant to blog. I meant to make new zink-wip snapshots. I meant to shower.

Look, none of us are perfect, and I’m just gonna get into some graphics so nobody remembers how this post started.

tombraider-suballocated.png

Boom, beautiful triangles. Look at that ultra smooth fps in mangohud. Protip: if you’re seeing weird flickering or misrenders in your app/game, try throwing mangohud in front of the zink bus to see if it fixes them.

So what has been going on for the past however since the last post?

In a word: lots.

Here’s the rundown.

The Rundown

The 20210517 zink-wip snapshot is the biggest one in history. I say this with no exaggeration.

Changes since the last snapshot include:

  • an imperial units (and I measured this precisely) fuckton of general driver overhead reduction
  • (yet another) queue/dispatch rewrite, this one more optimized for threaded and multi-context use
  • an actually working disk cache implementation
  • an entire suballocator

One way or another, this is going to feel like a new driver. Ideally I’ll be doing a post every day detailing one of the items on that list, but for now I’ll close the post by saying that zink should be 100%-1000% faster (not a typo) in most scenarios where it was previously much slower than native GL drivers.

Yeah, Big Triangle knows who we are now.

May 09, 2021
This all started with a Mele PCG09 before testing Linux on this I took a quick look under Windows and the device-manager there showed an exclamation mark next to a Realtek 8723BS bluetooth device, so BT did not work. Under Linux I quickly found out why, the device actually uses a Broadcom Wifi/BT chipset attached over SDIO/an UART for the Wifi resp. BT parts. The UART connected BT part was described in the ACPI tables with a HID (Hardware-ID) of "OBDA8723", not good.

Now I could have easily fixed this with an extra initrd with DSDT-overrride but that did not feel right. There was an option in the BIOS which actually controls what HID gets advertised for the Wifi/BT named "WIFI" which was set to "RTL8723" which obviously is wrong, but that option was grayed out. So instead of going for the DSDT-override I really want to be able to change that BIOS option and set it to the right value. Some duckduckgo-ing found this blogpost on changing locked BIOS settings.

The flashrom packaged in Fedora dumped the BIOS in one go and after build UEFITool and ifrextract from source from their git repos I could extract the interface description for the BIOS Setup menus without issues (as described in the blogpost). Here is the interesting part of the IFR for changing the Wifi/BT model:


0xC521 One Of: WIFI, VarStoreInfo (VarOffset/VarName): 0x110, VarStore: 0x1, QuestionId: 0x1AB, Size: 1, Min: 0x0, Max 0x2, Step: 0x0 {05 91 53 03 54 03 AB 01 01 00 10 01 10 10 00 02 00}
0xC532 One Of Option: RTL8723, Value (8 bit): 0x1 (default) {09 07 55 03 10 00 01}
0xC539 One Of Option: AP6330, Value (8 bit): 0x2 {09 07 56 03 00 00 02}
0xC540 One Of Option: Disabled, Value (8 bit): 0x0 {09 07 01 04 00 00 00}
0xC547 End One Of {29 02}



So to fix the broken BT I need to change the byte at offset 0x110 in the "Setup" EFI variable which contains the BIOS settings from 0x01 to 0x02. Easy, one problem though, the "dd on /sys/firmware/efi/efivars/Setup-..." method described in the blogpost does not work on most devices. Most devices protect the BIOS settings from being modified this way by having 2 Setup-${GUID} EFI variables (with different GUIDs), hiding the real one leaving a fake one which is only a couple of bytes large.

But the BIOS Setup-menu itself is just another EFI executable, so how can this access the real Setup variable ? The trick is that the hiding happens when the OS calls exitbootservices to tell EFI it is ready to take over control of the machine. This means that under Linux the real Setup EFI variable has been hidden early on during Linux boot, but when grub is running it is still available! And there is a patch adding a new setup_var command to grub, which allows changing BIOS settings from within grub.

The original setup_var command picks the first Setup EFI variable it finds, but as mentioned already in most cases there are 2, so later an improved setup_var_3 command was added which instead skips Setup EFI variables which are too small (as the fake ones are only a few bytes). After building an EFI version of grub with the setup_var* commands added it is just a matter of booting into a grub commandline and then running "setup_var_3 0x110 2" and from then on the BIOS shows the WIFI type as being AP6330 and the ACPI tables will now report "BCM2E67" as HID for the BT and just like that the bluetooth issue has been fixed.


For your convenience I've uploaded a grubia32.efi and a grubx64.efi with the setup_var patches added here. This is build from this branch at this commit (this was just a random branch which I had checked out while working on this).

The Mele PCG09 use-case for modifying hidden BIOS-settings is a bit of a corner-case. Intel Bay- and Cherry-Trail SoCs come with an embedded OTG XHCI controller to allow them to function as an USB device/gadget rather then only being capable of operating as an USB host. Since most devices ship with Windows and Windows does not really do anything useful with USB-device controllers, this controller is disabled by most BIOS-es and there is no visible option to enable it. The same approach from above can be used to enable the "USB OTG" option in the BIOS so that we can use this under Linux. Lets take the Teclast X89 (Windows version) tablet as example. Extracting the IFR and then looking for the "USB OTG" function results in finding this IFR snippet:


0x9560 One Of: USB OTG Support, VarStoreInfo (VarOffset/VarName): 0xDA, VarStore: 0x1, QuestionId: 0xA5, Size: 1, Min: 0x0, Max 0x1, Step: 0x0 {05 91 DE 02 DF 02 A5 00 01 00 DA 00 10 10 00 01 00}
0x9571 Default: DefaultId: 0x0, Value (8 bit): 0x1 {5B 06 00 00 00 01}
0x9577 One Of Option: PCI mode, Value (8 bit): 0x1 {09 07 E0 02 00 00 01}
0x957E One Of Option: Disabled, Value (8 bit): 0x0 {09 07 3B 03 00 00 00}
0x9585 End One Of {29 02}



And then running "setup_var_3 0xda 1" on the grub commandline results in a new "00:16.0 USB controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series OTG USB Device" entry showing up in lspci.

Actually using this requires a kernel with UDC (USB Device Controller) support enabled as well as some USB gadget drivers, at least the Fedora kernel does not have these enabled by default. On Bay Trail devices an external device-mode USB-PHY is necessary for device-mode to actually work. On a kernel with UDC enabled you can check if your hardware has such a phy by doing: "cat /sys/bus/ulpi/devices/dwc3.4.auto.ulpi/modalias" if there is a phy this will usually return "ulpi:v0451p1508". If you get "ulpi:v0000p0000" instead then your hardware does not have a device-mode phy and you cannot use gadget mode.

On Cherry Trail devices the device-mode phy is build into the SoC, so on most Cherry Trail devices this just works. There is one caveat though, the x5-z83?0 Cherry Trail SoCs only have one set of USB3 superspeed data lines and this is part of the USB-datalines meant for the OTG port. So if you have a Cherry Trail device with a x5-z83?0 SoC and it has a superspeed (USB3) USB-A port then that is using the OTG superspeed-lines, when the OTG XHCI controller is enabled and the micro-usb gets switched to device-mode (which it also does when charging!) then this will now also switch the superspeed datalines to device-mode, disconnecting any superspeed USB device connected to the USB-A port. So on these devices you need to choose, you can either use the micro-usb in device-mode, or get superspeed on the USB-A port, you cannot use both at the same time.

If you have a kernel build with UDC support a quick test is to run a USB-A to micro-B cable from a desktop or laptop to the tablet and then do "sudo modprobe g_serial" on the tablet, after this you should see a binch of messages in dmesg on the desktop/tablet about an USB device showing up ending with something like "cdc_acm 1-3:2.0: ttyACM0: USB ACM device". If you want you can run a serial-console on the tablet on /dev/ttyGS0 and then connect to it on the desktop/laptop at /dev/ttyACM0.