planet.freedesktop.org
May 23, 2022

Lately I have been exposing a bit more functionality in V3DV and was wondering how far we are from Vulkan 1.2. Turns out that a lot of the new Vulkan 1.2 features are actually optional and what we have right now (missing a few trivial patches to expose a few things) seems to be sufficient for a minimal implementation.

We actually did a test run with CTS enabling Vulkan 1.2 to verify this and it went surprisingly well, with just a few test failures that I am currently looking into, so I think we should be able to submit conformance soon.

For those who may be interested, here is a list of what we are not supporting (all of these are optional features in Vulkan 1.2):

VK_EXT_descriptor_indexing

I think we should be able to support this in the future.

VK_KHR_shader_float16_int8

This we can support in theory, since the hardware has support for half-float, however, the way this is designed in hardware comes with significant caveats that I think would make it really difficult to take advantage of it in practice. It would also require significant work, so it is not something we are planning at present.

VK_KHR_buffer_device_address

We can’t implement this without hacks because the Vulkan spec explicitly defined these addresses to be 64-bit values and the V3D GPU only deals with 32-bit addresses and is not capable of doing any kind of native 64-bit operation. At first I thought we could just lower these to 32-bit (since we know they will be 32-bit), but because the spec makes these explicit 64-bit values, it allows shaders to cast a device address from/to uvec2, which generates 64-bit bitcast instructions and those require both the destination and source to be 64-bit values.

VK_EXT_sampler_filter_minmax
VK_KHR_draw_indirect_count
VK_EXT_scalar_block_layout
VK_EXT_shader_viewport_index_layer
VK_KHR_shader_atomic_int64

These lack required hardware support, so we don’t expect to implement them.

May 21, 2022

Previously, I gave you an introduction to mesh/task shaders and wrote up some details about how mesh shaders are implemented in the driver. But I left out the important details of how task shaders (aka. amplification shaders) work in the driver. In this post, I aim to give you some details about how task shaders work under the hood. Like before, this is based on my experience implementing task shaders in RADV and all details are already public information.

Refresher about the task shader API

The task shader (aka. amplification shader in D3D12) is a new stage that runs in workgroups similar to compute shaders. Each task shader workgroup has two jobs: determine how many mesh shader workgroups it should launch (dispatch size), and optionally create a “payload” (up to 16K data of your choice) which is passed to mesh shaders.

Additionally, the API allows task shaders to perform atomic operations on the payload variables.

Typical use of task shaders can be: cluster culling, LOD selection, geometry amplification.

Expectations on task shaders

Before we get into any HW specific details, there are a few things we should unpack first. Based on the API programming mode, let’s think about some expectations on a good driver implementation.

Storing the output task payload. There must exist some kind of buffer where the task payload is stored, and the size of this buffer will obviously be a limiting factor on how many task shader workgroups can run in parallel. Therefore, the implementation must ensure that only as many task workgroups run as there is space in this buffer. Preferably this would be a ring buffer whose entries get reused between different task shader workgroups.

Analogy with the tessellator. The above requirements are pretty similar to what tessellation can already do. So a natural conclusion may be that we may be able to implement task shaders by abusing the tessellator. However, this introduces a potential bottleneck on fixed-function hardware which we would prefer not to do.

Analogy with a compute pre-pass. Another similar thing that comes to mind is a compute pre-pass. Many games already do something like this: some pre-processing in a compute dispatch that is executed before a draw call. Of course, the application has to insert a barrier between the dispatch and the draw, which means the draw can’t start before every invocation in the dispatch is finished. In reality, not every graphics shader invocation depends on the results of all compute invocations, but there is no way to express a more fine-grained dependency. For task shaders, it is preferable to avoid this barrier and allow task and mesh shader invocations to overlap.

Task shaders on AMD HW

What I discuss here is based on information that is already publicly available in open source drivers. If you are already familiar with how AMD’s own PAL-based drivers work, you won’t find any surprises here.

First things fist. Under the hood, task shaders are compiled to a plain old compute shader. The task payload is located in VRAM. The shader code that stores the mesh dispatch size and payload are compiled to memory writes which store these in VRAM ring buffers. Even though they are compute shaders as far as the AMD HW is concerned, task shaders do not work like a compute pre-pass. Instead, task shaders are dispatched on an async compute queue while at the same time the mesh shader work is executed on the graphics queue in parallel.

The task+mesh dispatch packets are different from a regular compute dispatch. The compute and graphics queue firmwares work together in parallel:

  • Compute queue launches up to as many task workgroups as it has space available in the ring buffer.
  • Graphics queue waits until a task workgroup is finished and can launch mesh shader workgroups immediately. Execution of mesh dispatches from a finished task workgroup can therefore overlap with other task workgroups.
  • When a mesh dispatch from the a task workgroup is finished, its slot in the ring buffer can be reused and a new task workgroup can be launched.
  • When the ring buffer is full, the compute queue waits until a mesh dispatch is finished, before launching the next task workgroup.

You can find out the exact concrete details in the PAL source code, or RADV merge requests.

Side note, getting some implementation details wrong can easily cause a deadlock on the GPU. It is great fun to debug these.

The relevant details here are that most of the hard work is implemented in the firmware (good news, because that means I don’t have to implement it), and that task shaders are executed on an async compute queue and that the driver now has to submit compute and graphics work in parallel.

Keep in mind that the API hides this detail and pretends that the mesh shading pipeline is just another graphics pipeline that the application can submit to a graphics queue. So, once again we have a mismatch between the API programming model and what the HW actually does.

Squeezing a hidden compute pipeline in your graphics

In order to use this beautiful scheme provided by the firmware, the driver needs to do two things:

  • Create a compute pipeline from the task shader.
  • Submit the task shader work on the asyc compute queue while at the same time also submit the mesh and pixel shader work on the graphics queue.

We already had good support for compute pipelines in RADV (as much as the API needs), but internally in the driver we’ve never had this kind of close cooperation between graphics and compute.

When you use a draw call in a command buffer with a pipeline that has a task shader, RADV must create a hidden, internal compute command buffer. This internal compute command buffer contains the task shader dispatch packet, while the graphics command buffer contains the packet that dispatches the mesh shaders. We must also ensure correct synchronization between these two command buffers according to application barriers ― because of the API mismatch it must work as if the internal compute cmdbuf was part of the graphics cmdbuf. We also need to emit the same descriptors and push constants, etc. When the application submits the graphics queue, this new, internal compute command buffer is then submitted to the async compute queue.

Thus far, this sounds pretty logical and easy.

The actual hard work is to make it possible for the driver to submit work to different queues at the same time. RADV’s queue code was written assuming that there is a 1:1 mapping between radv_queue objects and HW queues. To make task shaders work we must now break this assumption.

So, of course I had to do some crazy refactor to enable this. At the time of writing the AMDGPU Linux kernel driver doesn’t support “gang submit” yet, so I use scheduled dependencies instead. This has the drawback of submitting to the two queues sequentially rather than doing everything in the same submit.

Conclusion, perf considerations

Let’s turn the above wall of text into some performance considerations that you can actually use when you write your next mesh shading application.

  1. Because task shaders are executed on a different HW queue, there is some overhead. Don’t use task shaders for small draws or other cases when this overhead may be more than what you gain from them.
  2. For the same reason, barriers may require the driver to emit some commands that stall the async compute queue. Be mindful of your barriers (eg. top of pipe, etc) and only use these when your task shader actually depends on some previous graphics work.
  3. Because task payload is written to VRAM by the task shader, and has to be read from VRAM by the mesh shader, there is some latency. Only use as much payload memory as you need. Try to compact the memory use by packing your data etc.
  4. When you have a lot of geometry data, it is beneficial to implement cluster culling in your task shader. After you’ve done this, it may or may not be worth it to implement per-triangle culling in your mesh shader.
  5. Don’t try to reimplement the classic vertex processing pipeline or emulate fixed-function HW with task+mesh shaders. Instead, come up with simpler ways that work better for your app.

NVidia also has some perf recommendations here which mostly apply to any other HW, except for the recommended number of vertices and primitives per meshlet because the sweet spot for that can differ between GPU architectures.

Stay tuned

It has been officially confirmed that a Vulkan cross-vendor mesh shading extension is coming soon.

While I can’t give you any details about the new extension, I think it won’t be a surprise to anyone that it may have been the motivation for my work on mesh and task shaders.

Once the new extension goes public, I will post some thoughts about it and a comparison to the vendor-specific NV_mesh_shader extension.

May 20, 2022
Now that FESCo has decided that Fedora will keep supporting BIOS booting, the people working on Fedora's bootloader stack will need help from the Fedora community to keep Fedora booting on systems which require Legacy BIOS to boot.

To help with this the Fedora BIOS boot SIG (special interest group) has been formed. The main goal of this SIG are to help the Fedora bootloader people by:

  1. Doing regular testing of nightly Fedora N + 1 composes on hardware
    which only supports BIOS booting

  2. Triaging and/or fixing BIOS boot related bugs.


A biosboot-sig@lists.fedoraproject.org mailinglist and bugzilla account has been created, which will be used to discuss testing result and as assignee / Cc for bootloader bugzillas which are related to BIOS booting.

If you are interested in helping with Fedora BIOS boot support please:

  1. Subscribe to the email-list

  2. Add yourself to the Members section of the SIG's wiki page

Again

Two posts in one month is a record for May of 2022. Might even shoot for three at this rate.

Yesterday I posted a hasty roundup of what’s been going on with zink.

It was not comprehensive.

What else have I been working on, you might ask.

Introducing A New Zink-enabled Platform

We all know I’m champing at the bit to further my goal of world domination with zink on all platforms, so it was only a matter of time before I branched out from the big two-point-five of desktop GPU vendors (NVIDIA, AMD, and also maybe Intel sometimes).

Thus it was that a glorious bounty descended from on high at Valve: a shiny new A630 Coachz Chromebook that runs on the open source Qualcomm driver, Turnip.

How did the initial testing go?

My initial combined, all-the-tests-at-once CTS run (KHR46, 4.6 confidential, all dEQP) yielded over 1500 crashes and over 3000 failures.

Brutal.

I then accidentally bricked my kernel by foolishly allowing my distro to upgrade it for me, at which point I defenestrated the device.

Problems

There were many problems, but two were glaring:

  • Maximum of 4 descriptor sets allowed
  • No 64-bit support

It’s tough to say which issue is a bigger problem. As zinkologists know, the preferred minimum number of descriptor sets is six:

  • Constants
  • UBOs
  • Samplers
  • SSBOs
  • Images
  • Bindless

Thus, almost all the crashes I encountered were from the tests which attempted to use shader images, as they access an out-of-bounds set.

On the other hand, while there was an amount of time in which I had to use zink without 64-bit support on my Intel machine before emulation was added at the driver level, Intel’s driver has always been helpfully tolerant of my continuing to jam 64-bit i/o into shaders without adverse effects since the backend compiler supported such operations. This was not the case with Turnip, and all I got for my laziness was crashing and failing.

Solutions?

Both problems were varying degrees of excruciating to fix, so I chose the one I thought would be worse to start off: descriptors.

We all know zink has various modes for its descriptor management:

  • caching
  • templates
  • caching without templates

The default is currently the caching mode, which calculates hash values for all the descriptors being used for a draw/dispatch and then stores the sets for reuse. My plan was to collapse the like sets: merging UBOs and SSBOs into one set as well as Samplers and Images into another set. This would use a total of four sets including the bindless one, opening up all the features.

The project actually turniped out to be easier than expected. All I needed to do was add indirection to the existing set indexing, then modify the indirect indices on startup based on how many sets were available. In this way, accessing e.g., descriptor_set[SSBO] would actually access descriptor_set[UBO] in the new compact mode, and everything else would work the same.

It was then that I remembered the tricky part of doing anything descriptor-related: updating the caching mechanism.

In time, I finally managed to settle on just merging the hash values from the merged sets just as carefully as the fusion dance must be performed, and things seemed to be working, bringing results down to just over 1000 total CTS fails.

The harder problem was actually the easier one.

Which means I’m gonna need another post.

May 19, 2022

Well

I’ve been meaning to blog for a while.

As per usual.

Finally though, I’m doing it. It’s been a month. Roundup time.

22.1

Mesa 22.1 is out, and it has the best zink release of all time. You’ve got features like:

  • kopper
  • kopper
  • kopper

Really not sure I’m getting across how momentous kopper is, but it’s probably the biggest thing to happen to zink since… I don’t know. Threaded context?

B I G.

Bugs

There’s a lot of bugs in the world. As of now, there’s a lot fewer bugs in zink than there used to be.

How many fewer, you might ask?

Let’s take a gander.

  • 0 GL4.6 CTS fails for lavapipe
  • 0 GL4.6 CTS fails for the new ANV CI job I just added
  • less than 5 GL4.6 CTS fails for RADV
  • this space reserved for NVIDIA, which has bugs in CTS itself and(probably) driver issues, but is still a very low count

In short, conformance submission pending.

Anniversary

This roughly marks 2 years since I started working on zink full-time.

Hooray.

Windows

A quick follow-up to the previous post: there was an issue in the initial WGL handling which prevented zink from loading as expected. This is now fixed.

May 16, 2022

We haven’t posted updates to the work done on the V3DV driver since
we announced the driver becoming Vulkan 1.1 Conformant.

But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then.

Multisync support

As mentioned on past posts, for the Vulkan driver we tried to focus as much as possible on the userspace part. So we tried to re-use the already existing kernel interface that we had for V3D, used by the OpenGL driver, without modifying/extending it.

This worked fine in general, except for synchronization. The V3D kernel interface only supported one synchronization object per submission. This didn’t properly map with Vulkan synchronization, which is more detailed and complex, and allowed defining several semaphores/fences. We initially handled the situation with workarounds, and left some optional features as unsupported.

After our 1.1 conformance work, our colleage Melissa Wen started to work on adding support for multiple semaphores on the V3D kernel side. Then she also implemented the changes on V3DV to use this new feature. If you want more technical info, she wrote a very detailed explanation on her blog (part1 and part2).

For now the driver has two codepaths that are used depending on if the kernel supports this new feature or not. That also means that, depending on the kernel, the V3DV driver could expose a slightly different set of supported features.

More common code – Migration to the common synchronization framework

For a while, Mesa developers have been doing a great effort to refactor and move common functionality to a single place, so it can be used by all drivers, reducing the amount of code each driver needs to maintain.

During these months we have been porting V3DV to some of that infrastructure, from small bits (common VkShaderModule to NIR code), to a really big one: common synchronization framework.

As mentioned, the Vulkan synchronization model is really detailed and powerful. But that also means it is complex. V3DV support for Vulkan synchronization included heavy use of threads. For example, V3DV needed to rely on a CPU wait (polling with threads) to implement vkCmdWaitEvents, as the GPU lacked a mechanism for this.

This was common to several drivers. So at some point there were multiple versions of complex synchronization code, one per driver. But, some months ago, Jason Ekstrand refactored Anvil support and collaborated with other driver developers to create a common framework. Obviously each driver would have their own needs, but the framework provides enough hooks for that.

After some gitlab and IRC chats, Jason provided a Merge Request with the port of V3DV to this new common framework, that we iterated and tested through the review process.

Also, with this port we got timelime semaphore support for free. Thanks to this change, we got ~1.2k less total lines of code (and have more features!).

Again, we want to thank Jason Ekstrand for all his help.

Support for more extensions:

Since 1.1 got announced the following extension got implemented and exposed:

  • VK_EXT_debug_utils
  • VK_KHR_timeline_semaphore
  • VK_KHR_create_renderpass2
  • VK_EXT_4444_formats
  • VK_KHR_driver_properties
  • VK_KHR_16_bit_storage and VK_KHR_8bit_storage
  • VK_KHR_imageless_framebuffer
  • VK_KHR_depth_stencil_resolve
  • VK_EXT_image_drm_format_modifier
  • VK_EXT_line_rasterization
  • VK_EXT_inline_uniform_block
  • VK_EXT_separate_stencil_usage
  • VK_KHR_separate_depth_stencil_layouts
  • VK_KHR_pipeline_executable_properties
  • VK_KHR_shader_float_controls
  • VK_KHR_spirv_1_4

If you want more details about VK_KHR_pipeline_executable_properties, Iago wrote recently a blog post about it (here)

Android support

Android support for V3DV was added thanks to the work of Roman Stratiienko, who implemented this and submitted Mesa patches. We also want to thank the Android RPi team, and the Lineage RPi maintainer (Konsta) who also created and tested an initial version of that support, which was used as the baseline for the code that Roman submitted. I didn’t test it myself (it’s in my personal TO-DO list), but LineageOS images for the RPi4 are already available.

Performance

In addition to new functionality, we also have been working on improving performance. Most of the focus was done on the V3D shader compiler, as improvements to it would be shared among the OpenGL and Vulkan drivers.

But one of the features specific to the Vulkan driver (pending to be ported to OpenGL), is that we have implemented double buffer mode, only available if MSAA is not enabled. This mode would split the tile buffer size in half, so the driver could start processing the next tile while the current one is being stored in memory.

In theory this could improve performance by reducing tile store overhead, so it would be more benefitial when vertex/geometry shaders aren’t too expensive. However, it comes at the cost of reducing tile size, which also causes some overhead on its own.

Testing shows that this helps in some cases (i.e the Vulkan Quake ports) but hurts in others (i.e. Unreal Engine 4), so for the time being we don’t enable this by default. It can be enabled selectively by adding V3D_DEBUG=db to the environment variables. The idea for the future would be to implement a heuristic that would decide when to activate this mode.

FOSDEM 2022

If you are interested in watching an overview of the improvements and changes to the driver during the last year, we made a presention in FOSDEM 2022:
“v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi
4”

May 13, 2022

In late 2020, Apple debuted the M1 with Apple’s GPU architecture, AGX, rumoured to be derived from Imagination’s PowerVR series. Since then, we’ve been reverse-engineering AGX and building open source graphics drivers. Last January, I rendered a triangle with my own code, but there has since been a heinous bug lurking:

The driver fails to render large amounts of geometry.

Spinning a cube is fine, low polygon geometry is okay, but detailed models won’t render. Instead, the GPU renders only part of the model and then faults.

Partially rendered bunny

It’s hard to pinpoint how much we can render without faults. It’s not just the geometry complexity that matters. The same geometry can render with simple shaders but fault with complex ones.

That suggests rendering detailed geometry with a complex shader “takes too long”, and the GPU is timing out. Maybe it renders only the parts it finished in time.

Given the hardware architecture, this explanation is unlikely.

This hypothesis is easy to test, because we can control for timing with a shader that takes as long as we like:

for (int i = 0; i < LARGE_NUMBER; ++i) {
    /* some work to prevent the optimizer from removing the loop */
}

After experimenting with such a shader, we learn…

  • If shaders have a time limit to protect against infinite loops, it’s astronomically high. There’s no way our bunny hits that limit.
  • The symptoms of timing out differ from the symptoms of our driver rendering too much geometry.

That theory is out.

Let’s experiment more. Modifying the shader and seeing where it breaks, we find the only part of the shader contributing to the bug: the amount of data interpolated per vertex. Modern graphics APIs allow specifying “varying” data for each vertex, like the colour or the surface normal. Then, for each triangle the hardware renders, these “varyings” are interpolated across the triangle to provide smooth inputs to the fragment shader, allowing efficient implementation of common graphics techniques like Blinn-Phong shading.

Putting the pieces together, what matters is the product of the number of vertices (geometry complexity) times amount of data per vertex (“shading” complexity). That product is “total amount of per-vertex data”. The GPU faults if we use too much total per-vertex data.

Why?

When the hardware processes each vertex, the vertex shader produces per-vertex data. That data has to go somewhere. How this works depends on the hardware architecture. Let’s consider common GPU architectures.1

Traditional immediate mode renderers render directly into the framebuffer. They first run the vertex shader for each vertex of a triangle, then run the fragment shader for each pixel in the triangle. Per-vertex “varying” data is passed almost directly between the shaders, so immediate mode renderers are efficient for complex scenes.

There is a drawback: rendering directly into the framebuffer requires tremendous amounts of memory access to constantly write the results of the fragment shader and to read out back results when blending. Immediate mode renderers are suited to discrete, power-hungry desktop GPUs with dedicated video RAM.

By contrast, tile-based deferred renderers split rendering into two passes. First, the hardware runs all vertex shaders for the entire frame, not just for a single model. Then the framebuffer is divided into small tiles, and dedicated hardware called a tiler determines which triangles are in each tile. Finally, for each tile, the hardware runs all relevant fragment shaders and writes the final blended result to memory.

Tilers reduce memory traffic required for the framebuffer. As the hardware renders a single tile at a time, it keeps a “cached” copy of that tile of the framebuffer (called the “tilebuffer”). The tilebuffer is small, just a few kilobytes, but tilebuffer access is fast. Writing to the tilebuffer is cheap, and unlike immediate renderers, blending is almost free. Because main memory access is expensive and mobile GPUs can’t afford dedicated video memory, tilers are suited to mobile GPUs, like Arm’s Mali, Imaginations’s PowerVR, and Apple’s AGX.

Yes, AGX is a mobile GPU, designed for the iPhone. The M1 is a screaming fast desktop, but its unified memory and tiler GPU have roots in mobile phones. Tilers work well on the desktop, but there are some drawbacks.

First, at the start of a frame, the contents of the tilebuffer are undefined. If the application needs to preserve existing framebuffer contents, the driver needs to load the framebuffer from main memory and store it into the tilebuffer. This is expensive.

Second, because all vertex shaders are run before any fragment shaders, the hardware needs a buffer to store the outputs of all vertex shaders. In general, there is much more data required than space inside the GPU, so this buffer must be in main memory. This is also expensive.

Ah-ha. Because AGX is a tiler, it requires a buffer of all per-vertex data. We fault when we use too much total per-vertex data, overflowing the buffer.

…So how do we allocate a larger buffer?

On some tilers, like older versions of Arm’s Mali GPU, the userspace driver computes how large this “varyings” buffer should be and allocates it.2 To fix the faults, we can try increasing the sizes of all buffers we allocate, in the hopes that one of them contains the per-vertex data.

No dice.

It’s prudent to observe what Apple’s Metal driver does. We can cook up a Metal program drawing variable amounts of geometry and trace all GPU memory allocations that Metal performs while running our program. Doing so, we learn that increasing the amount of geometry drawn does not increase the sizes of any allocated buffers. In fact, it doesn’t change anything in the command buffer submitted to the kernel, except for the single “number of vertices” field in the draw command.

We know that buffer exists. If it’s not allocated by userspace – and by now it seems that it’s not – it must be allocated by the kernel or firmware.

Here’s a funny thought: maybe we don’t specify the size of the buffer at all. Maybe it’s okay for it to overflow, and there’s a way to handle the overflow.

It’s time for a little reconnaissance. Digging through what little public documentation exists for AGX, we learn from one WWDC presentation:

The Tiled Vertex Buffer stores the Tiling phase output, which includes the post-transform vertex data…

But it may cause a Partial Render if full. A Partial Render is when the GPU splits the render pass in order to flush the contents of that buffer.

Bullseye. The buffer we’re chasing, the “tiled vertex buffer”, can overflow. To cope, the GPU stops accepting new geometry, renders the existing geometry, and restarts rendering.

Since partial renders hurt performance, Metal application developers need to know about them to optimize their applications. There should be performance counters flagging this issue. Poking around, we find two:

  • Number of partial renders.
  • Number of bytes used of the parameter buffer.

Wait, what’s a “parameter buffer”?

Remember the rumours that AGX is derived from PowerVR? The public PowerVR optimization guides explain:

[The] list containing pointers to each vertex passed in from the application… is called the parameter buffer (PB) and is stored in system memory along with the vertex data.

Each varying requires additional space in the parameter buffer.

The Tiled Vertex Buffer is the Parameter Buffer. PB is the PowerVR name, TVB is the public Apple name, and PB is still an internal Apple name.

What happens when PowerVR overflows the parameter buffer?

An old PowerVR presentation says that when the parameter buffer is full, the “render is flushed”, meaning “flushed data must be retrieved from the frame buffer as successive tile renders are performed”. In other words, it performs a partial render.

Back to the Apple M1, it seems the hardware is failing to perform a partial render. Let’s revisit the broken render.

Partially rendered bunny, again

Notice parts of the model are correctly rendered. The parts that are not only have the black clear colour of the scene rendered at the start. Let’s consider the logical order of events.

First, the hardware runs vertex shaders for the bunny until the parameter buffer overflows. This works: the partial geometry is correct.

Second, the hardware rasterizes the partial geometry and runs the fragment shaders. This works: the shading is correct.

Third, the hardware flushes the partial render to the framebuffer. This must work for us to see anything at all.

Fourth, the hardware runs vertex shaders for the rest of the bunny’s geometry. This ought to work: the configuration is identical to the original vertex shaders.

Fifth, the hardware rasterizes and shades the rest of the geometry, blending with the old partial render. Because AGX is a tiler, to preserve that existing partial render, the hardware needs to load it back into the tilebuffer. We have no idea how it does this.

Finally, the hardware flushes the render to the framebuffer. This should work as it did the first time.

The only problematic step is loading the framebuffer back into the tilebuffer after a partial render. Usually, the driver supplies two “extra” fragment shaders. One clears the tilebuffer at the start, and the other flushes out the tilebuffer contents at the end.

If the application needs the existing framebuffer contents preserved, instead of writing a clear colour, the “load tilebuffer” program instead reads from the framebuffer to reload the contents. Handling this requires quite a bit of code, but it works in our driver.

Looking closer, AGX requires more auxiliary programs.

The “store” program is supplied twice. I noticed this when initially bringing up the hardware, but the reason for the duplication was unclear. Omitting each copy separately and seeing what breaks, the reason becomes clear: one program flushes the final render, and the other flushes a partial render.3

…What about the program that loads the framebuffer into the tilebuffer?

When a partial render is possible, there are two “load” programs. One writes the clear colour or loads the framebuffer, depending on the application setting. We understand this one. The other always loads the framebuffer.

…Always loads the framebuffer, as in, for loading back with a partial render even if there is a clear at the start of the frame?

If this program is the issue, we can confirm easily. Metal must require it to draw the same bunny, so we can write a Metal application drawing the bunny and stomp over its GPU memory to replace this auxiliary load program with one always loading with black.

Metal drawing the bunny, stomping over its memory.

Doing so, Metal fails in a similar way. That means we’re at the root cause. Looking at our own driver code, we don’t specify any program for this partial render load. Up until now, that’s worked okay. If the parameter buffer is never overflowed, this program is unused. As soon as a partial render is required, however, failing to provide this program means the GPU dereferences a null pointer and faults. That explains our GPU faults at the beginning.

Following Metal, we supply our own program to load back the tilebuffer after a partial render…

Bunny with the fix

…which does not fix the rendering! Cursed, this GPU. The faults go away, but the render still isn’t quite right for the first few frames, indicating partial renders are still broken. Notice the weird artefacts on the feet.

Curiously, the render “repairs itself” after a few frames, suggesting the parameter buffer stops overflowing. This implies the parameter buffer can be resized (by the kernel or by the firmware), and the system is growing the parameter buffer after a few frames in response to overflow. This mechanism makes sense:

  • The hardware can’t allocate more parameter buffer space itself.
  • Overflowing the parameter buffer is expensive, as partial renders require tremendous memory bandwidth.
  • Overallocating the parameter buffer wastes memory for applications rendering simple geometry.

Starting the parameter buffer small and growing in response to overflow provides a balance, reducing the GPU’s memory footprint and minimizing partial renders.

Back to our misrendering. There are actually two buffers being used by our program, a colour buffer (framebuffer)… and a depth buffer. The depth buffer isn’t directly visible, but facilitates the “depth test”, which discards far pixels that are occluded by other close pixels. While the partial render mechanism discards geometry, the depth test discards pixels.

That would explain the missing pixels on our bunny. The depth test is broken with partial renders. Why? The depth test depends on the depth buffer, so the depth buffer must also be stored after a partial render and loaded back when resuming. Comparing a trace from our driver to a trace from Metal, looking for any relevant difference, we eventually stumble on the configuration required to make depth buffer flushes work.

And with that, we get our bunny.

The final Phong shaded bunny

  1. These explanations are massive oversimplifications of how modern GPUs work, but it’s good enough for our purposes here.↩︎

  2. This is a worse idea than it sounds. Starting with the new Valhall architecture, Mali allocates varyings much more efficiently.↩︎

  3. Why the duplication? I have not yet observed Metal using different programs for each. However, for front buffer rendering, partial renders need to be flushed to a temporary buffer for this scheme to work. Of course, you may as well use double buffering at that point.↩︎

May 12, 2022

In part 1 I gave a brief introduction on what mesh and task shaders are from the perspective of application developers. Now it’s time to dive deeper and talk about how mesh shaders are implemented in a Vulkan driver on AMD HW. Note that everything I discuss here is based on my experience and understanding as I was working on mesh shader support in RADV and is already available as public information in open source driver code. The goal of this blog post is to elaborate on how mesh shaders are implemented on the NGG hardware in AMD RDNA2 GPUs, and to show how these details affect shader performance. Hopefully, this helps the reader better understand how the concepts in the API are translated to the HW and what pitfalls to avoid to get good perf.

Short intro to NGG

NGG (Next Generation Geometry) is the technology that is responsible for any vertex and geometry processing in RDNA GPUs (with some caveats). Also known as “primitive shader”, the main innovations of NGG are:

  • Shaders are aware of not only vertices, but also primitives (this is why they are called primitive shader).
  • The output topology is entirely up to the shader, meaning that it can create output vertices and primitives with an arbitrary topology regarless of its input.
  • On RDNA2 and newer, per-primitive output attributes are also supported.

This flexibility allows the driver to implement every vertex/geometry processing stage using NGG. Vertex, tess eval and geometry shaders can all be compiled to NGG “primitive shaders”. The only major limiting factor is that each thread (SIMD lane) can only output up to 1 vertex and 1 primitive (with caveats).

The driver is also capable of extending the application shaders with sweet stuff such as per-triangle culling, but this is not the main focus of this blog post. I also won’t cover the caveats here, but I may write more about NGG in the future.

Mapping the mesh shader API to NGG

The draw commands as executed on the GPU only understand a number of input vertices but the mesh shader API draw calls specify a number of workgroups instead. To make it work, we configure the shader such that the number of input vertices per workgroup is 1, and the output is set to what you passed into the API. This way, the FW can figure out how many workgroups it really needs to launch.

The driver has to accomodate the HW limitation above, so we must ensure that in the compiled shader, each thread only outputs up to 1 vertex and 1 primitive. Reminder: the API programming model allows any shader invocation to write any vertex and/or primitive. So, there is a fundamental mismatch between the programming model and what the HW can do.

This raises a few interesting questions.

How do we allow any thread to write any vertex/primitive? The driver allocates some LDS (shared memory) space, and writes all mesh shader outputs there. At the very end of the shader, each thread reads the attributes of the vertex and primitive that matches the thread ID and outputs that single vertex and primitive. This roundtrip to the LDS can be omitted if an output is only written by the thread with matching thread ID. (Note: at the time of writing, I haven’t implemented this optimization yet, but I plan to.)

What if the MS workgroup size is less than the max number of output vertices or primitives? Each HW thread can create up to 1 vertex and 1 primitive. The driver has to set the real workgroup size accordingly:
hw workgroup size = max(api workgroup size, max vertex count, max primitive count)
The result is that the HW will get a workgroup that has some threads that execute the code you wrote (the “API shader”), and then some that won’t do anything but wait until the very end to output their up to 1 vertex and 1 primitive. It can result in poor occupancy (low HW utilization = bad performance).

What if the shader also has barriers in it? This is now turning into a headache. The driver has to ensure that the threads that “do nothing” also execute an equal amount of barriers as those that run your API shader. If the HW workgroup has the same number of waves as the API shader, this is trivial. Otherwise, we have to emit some extra code that keeps the extra waves running in a loop executing barriers. This is the worst.

What if the API shader also uses shared memory, or not all outputs fit the LDS? The D3D12 spec requires the driver to have at least 28K shared memory (LDS) available to the shader. However, graphics shaders can only access up to 32K LDS. How do we make this work, considering the above fact that the driver has to write mesh shader outputs to LDS? This is getting really ugly now, but in that case, the driver is forced to write MS outputs to VRAM instead of LDS. (Note: at the time of writing, I haven’t implemented this edge case yet, but I plan to.)

How do you deal with the compute-like stuff, eg. workgroup ID, subgroup ID, etc.? Fortunately, most of these were already available to the shader, just not exposed in the traditional VS, TES, GS programming model. The only pain point is the workgroup ID which needs trickery. I already mentioned above that the HW is tricked into thinking that each MS workgroup has 1 input vertex. So we can just use the same register that contains the vertex ID for getting the workgroup ID.

Conclusion, performance considerations

The above implementation details can be turned into performance recommendations.

Specify a MS workgroup size that matches the maximum amount of vertices and primitives. Also, distribute the work among the full workgroup rather than leaving some threads doing nothing. If you do this, you ensure that the hardware is optimally utilized. This is the most important recommendation here today.

Try to only write to the mesh output array indices from the corresponding thread. If you do this, you hit an optimal code path in the driver, so it won’t have to write those outputs to LDS and read them back at the end.

Use shared memory, but not excessively. Implementing any nice algorithm in your mesh shader will likely need you to share data between threads. Don’t be afraid to use shared memory, but prefer to use subgroup functionality instead when possible.

What if you don’t want do any of the above?

That is perfectly fine. Don’t use mesh shaders then.

The main takeaway about mesh shading is that it’s a very low level tool. The driver can implement the full programming model, but it can’t hold your hands as well as it could for traditional vertex processing. You may have to implement things (eg. vertex inputs, culling, etc.) that previously the driver would do for you. Essentially, if you write a mesh shader you are trying to beat the driver at its own game.

Wait, aren’t we forgetting something?

I think this post is already dense enough with technical detail. Brace yourself for the next post, where I’m going to blow your mind even more and talk about how task shaders are implemented.

May 11, 2022

Background
Today NVIDIA announced that they are releasing an open source kernel driver for their GPUs, so I want to share with you some background information and how I think this will impact Linux graphics and compute going forward.

One thing many people are not aware of is that Red Hat is the only Linux OS company who has a strong presence in the Linux compute and graphics engineering space. There are of course a lot of other people working in the space too, like engineers working for Intel, AMD and NVIDIA or people working for consultancy companies like Collabora or individual community members, but Red Hat as an OS integration company has been very active on trying to ensure we have a maintainable and shared upstream open source stack. This engineering presence is also what has allowed us to move important technologies forward, like getting hiDPI support for Linux some years ago, or working with NVIDIA to get glvnd implemented to remove a pain point for our users since the original OpenGL design only allowed for one OpenGl implementation to be installed at a time. We see ourselves as the open source community’s partner here, fighting to keep the linux graphics stack coherent and maintainable and as a partner for the hardware OEMs to work with when they need help pushing major new initiatives around GPUs for Linux forward. And as the only linux vendor with a significant engineering footprint in GPUs we have been working closely with NVIDIA. People like Kevin Martin, the manager for our GPU technologies team, Ben Skeggs the maintainer of Nouveau and Dave Airlie, the upstream kernel maintainer for the graphics subsystem, Nouveau contributor Karol Herbst and our accelerator lead Tom Rix have all taken part in meetings, code reviews and discussions with NVIDIA. So let me talk a little about what this release means (and also what it doesn’t mean) and what we hope to see come out of this long term.

First of all, what is in this new driver?
What has been released is an out of tree source code kernel driver which has been tested to support CUDA usecases on datacenter GPUs. There is code in there to support display, but it is not complete or fully tested yet. Also this is only the kernel part, a big part of a modern graphics driver are to be found in the firmware and userspace components and those are still closed source. But it does mean we have a NVIDIA kernel driver now that will start being able to consume the GPL-only APIs in the linux kernel, although this initial release doesn’t consume any APIs the old driver wasn’t already using. The driver also only supports NVIDIA Turing chip GPUs and newer, which means it is not targeting GPUs from before 2018. So for the average Linux desktop user, while this is a great first step and hopefully a sign of what is to come, it is not something you are going to start using tomorrow.

What does it mean for the NVidia binary driver?
Not too much immediately. This binary kernel driver will continue to be needed for older pre-Turing NVIDIA GPUs and until the open source kernel module is full tested and extended for display usecases you are likely to continue using it for your system even if you are on Turing or newer. Also as mentioned above regarding firmware and userspace bits and the binary driver is going to continue to be around even once the open source kernel driver is fully capable.

What does it mean for Nouveau?
Let me start with the obvious, this is actually great news for the Nouveau community and the Nouveau driver and NVIDIA has done a great favour to the open source graphics community with this release. And for those unfamiliar with Nouveau, Nouveau is the in-kernel graphics driver for NVIDIA GPUs today which was originally developed as a reverse engineered driver, but which over recent years actually have had active support from NVIDIA. It is fully functional, but is severely hampered by not having had the ability to for instance re-clock the NVIDIA card, meaning that it can’t give you full performance like the binary driver can. This was something we were working with NVIDIA trying to remedy, but this new release provides us with a better path forward. So what does this new driver mean for Nouveau? Less initially, but a lot in the long run. To give a little background first. The linux kernel does not allow multiple drivers for the same hardware, so in order for a new NVIDIA kernel driver to go in the current one will have to go out or at least be limited to a different set of hardware. The current one is Nouveau. And just like the binary driver a big chunk of Nouveau is not in the kernel, but are the userspace pieces found in Mesa and the Nouveau specific firmware that NVIDIA currently kindly makes available. So regardless of the long term effort to create a new open source in-tree kernel driver based on this new open source driver for NVIDIA hardware, Nouveau will very likely be staying around to support pre-turing hardware just like the NVIDIA binary kernel driver will.

The plan we are working towards from our side, but which is likely to take a few years to come to full fruition, is to come up with a way for the NVIDIA binary driver and Mesa to share a kernel driver. The details of how we will do that is something we are still working on and discussing with our friends at NVIDIA to address both the needs of the NVIDIA userspace and the needs of the Mesa userspace. Along with that evolution we hope to work with NVIDIA engineers to refactor the userspace bits of Mesa that are now targeting just Nouveau to be able to interact with this new kernel driver and also work so that the binary driver and Nouveau can share the same firmware. This has clear advantages for both the open source community and the NVIDIA. For the open source community it means that we will now have a kernel driver and firmware that allows things like changing the clocking of the GPU to provide the kind of performance people expect from the NVIDIA graphics card and it means that we will have an open source driver that will have access to the firmware and kernel updates from day one for new generations of NVIDIA hardware. For the ‘binary’ driver, and I put that in ” signs because it will now be less binary :), it means as stated above that it can start taking advantage of the GPL-only APIs in the kernel, distros can ship it and enable secure boot, and it gets an open source consumer of its kernel driver allowing it to go upstream.
If this new shared kernel driver will be known as Nouveau or something completely different is still an open question, and of course it happening at all depends on if we and the rest of the open source community and NVIDIA are able to find a path together to make it happen, but so far everyone seems to be of good will.

What does this release mean for linux distributions like Fedora and RHEL?

Over time it provides a pathway to radically simplify supporting NVIDIA hardware due to the opportunities discussed elsewhere in this document. Long term we will hope be able to get a better user experience with NVIDIA hardware in terms out of box functionality. Which means day 1 support for new chipsets, a high performance open source Mesa driver for NVIDIA and it will allow us to sign the NVIDIA driver alongside the rest of the kernel to enable things like secureboot support. Since this first release is targeting compute one can expect that these options will first be available for compute users and then graphics at a later time.

What are the next steps
Well there is a lot of work to do here. NVIDIA need to continue the effort to make this new driver feature complete for both Compute and Graphics Display usecases, we’d like to work together to come up with a plan for what the future unified kernel driver can look like and a model around it that works for both the community and NVIDIA, we need to add things like a Mesa Vulkan driver. We at Red Hat will be playing an active part in this work as the only Linux vendor with the capacity to do so and we will also work to ensure that the wider open source community has a chance to participate fully like we do for all open source efforts we are part of.

If you want to hear more about this I did talk with Chris Fisher and Linux Action News about this topic. Note: I did state some timelines in that interview which I didn’t make clear was my guesstimates and not in any form official NVIDIA timelines, so apologize for the confusion.

May 10, 2022

In the previous post, I described how we enable multiple syncobjs capabilities in the V3D kernel driver. Now I will tell you what was changed on the userspace side, where we reworked the V3DV sync mechanisms to use Vulkan multiple wait and signal semaphores directly. This change represents greater adherence to the Vulkan submission framework.

I was not used to Vulkan concepts and the V3DV driver. Fortunately, I counted on the guidance of the Igalia’s Graphics team, mainly Iago Toral (thanks!), to understand the Vulkan Graphics Pipeline, sync scopes, and submission order. Therefore, we changed the original V3DV implementation for vkQueueSubmit and all related functions to allow direct mapping of multiple semaphores from V3DV to the V3D-kernel interface.

Disclaimer: Here’s a brief and probably inaccurate background, which we’ll go into more detail later on.

In Vulkan, GPU work submissions are described as command buffers. These command buffers, with GPU jobs, are grouped in a command buffer submission batch, specified by vkSubmitInfo, and submitted to a queue for execution. vkQueueSubmit is the command called to submit command buffers to a queue. Besides command buffers, vkSubmitInfo also specifies semaphores to wait before starting the batch execution and semaphores to signal when all command buffers in the batch are complete. Moreover, a fence in vkQueueSubmit can be signaled when all command buffer batches have completed execution.

From this sequence, we can see some implicit ordering guarantees. Submission order defines the start order of execution between command buffers, in other words, it is determined by the order in which pSubmits appear in VkQueueSubmit and pCommandBuffers appear in VkSubmitInfo. However, we don’t have any completion guarantees for jobs submitted to different GPU queue, which means they may overlap and complete out of order. Of course, jobs submitted to the same GPU engine follow start and finish order. A fence is ordered after all semaphores signal operations for signal operation order. In addition to implicit sync, we also have some explicit sync resources, such as semaphores, fences, and events.

Considering these implicit and explicit sync mechanisms, we rework the V3DV implementation of queue submissions to better use multiple syncobjs capabilities from the kernel. In this merge request, you can find this work: v3dv: add support to multiple wait and signal semaphores. In this blog post, we run through each scope of change of this merge request for a V3D driver-guided description of the multisync support implementation.

Groundwork and basic code clean-up:

As the original V3D-kernel interface allowed only one semaphore, V3DV resorted to booleans to “translate” multiple semaphores into one. Consequently, if a command buffer batch had at least one semaphore, it needed to wait on all jobs submitted complete before starting its execution. So, instead of just boolean, we created and changed structs that store semaphores information to accept the actual list of wait semaphores.

Expose multisync kernel interface to the driver:

In the two commits below, we basically updated the DRM V3D interface from that one defined in the kernel and verified if the multisync capability is available for use.

Handle multiple semaphores for all GPU job types:

At this point, we were only changing the submission design to consider multiple wait semaphores. Before supporting multisync, V3DV was waiting for the last job submitted to be signaled when at least one wait semaphore was defined, even when serialization wasn’t required. V3DV handle GPU jobs according to the GPU queue in which they are submitted:

  • Control List (CL) for binning and rendering
  • Texture Formatting Unit (TFU)
  • Compute Shader Dispatch (CSD)

Therefore, we changed their submission setup to do jobs submitted to any GPU queues able to handle more than one wait semaphores.

These commits created all mechanisms to set arrays of wait and signal semaphores for GPU job submissions:

  • Checking the conditions to define the wait_stage.
  • Wrapping them in a multisync extension.
  • According to the kernel interface (described in the previous blog post), configure the generic extension as a multisync extension.

Finally, we extended the ability of GPU jobs to handle multiple signal semaphores, but at this point, no GPU job is actually in charge of signaling them. With this in place, we could rework part of the code that tracks CPU and GPU job completions by verifying the GPU status and threads spawned by Event jobs.

Rework the QueueWaitIdle mechanism to track the syncobj of the last job submitted in each queue:

As we had only single in/out syncobj interfaces for semaphores, we used a single last_job_sync to synchronize job dependencies of the previous submission. Although the DRM scheduler guarantees the order of starting to execute a job in the same queue in the kernel space, the order of completion isn’t predictable. On the other hand, we still needed to use syncobjs to follow job completion since we have event threads on the CPU side. Therefore, a more accurate implementation requires last_job syncobjs to track when each engine (CL, TFU, and CSD) is idle. We also needed to keep the driver working on previous versions of v3d kernel-driver with single semaphores, then we kept tracking ANY last_job_sync to preserve the previous implementation.

Rework synchronization and submission design to let the jobs handle wait and signal semaphores:

With multiple semaphores support, the conditions for waiting and signaling semaphores changed accordingly to the particularities of each GPU job (CL, CSD, TFU) and CPU job restrictions (Events, CSD indirect, etc.). In this sense, we redesigned V3DV semaphores handling and job submissions for command buffer batches in vkQueueSubmit.

We scrutinized possible scenarios for submitting command buffer batches to change the original implementation carefully. It resulted in three commits more:

We keep track of whether we have submitted a job to each GPU queue (CSD, TFU, CL) and a CPU job for each command buffer. We use syncobjs to track the last job submitted to each GPU queue and a flag that indicates if this represents the beginning of a command buffer.

The first GPU job submitted to a GPU queue in a command buffer should wait on wait semaphores. The first CPU job submitted in a command buffer should call v3dv_QueueWaitIdle() to do the waiting and ignore semaphores (because it is waiting for everything).

If the job is not the first but has the serialize flag set, it should wait on the completion of all last job submitted to any GPU queue before running. In practice, it means using syncobjs to track the last job submitted by queue and add these syncobjs as job dependencies of this serialized job.

If this job is the last job of a command buffer batch, it may be used to signal semaphores if this command buffer batch has only one type of GPU job (because we have guarantees of execution ordering). Otherwise, we emit a no-op job just to signal semaphores. It waits on the completion of all last jobs submitted to any GPU queue and then signal semaphores. Note: We changed this approach to correctly deal with ordering changes caused by event threads at some point. Whenever we have an event job in the command buffer, we cannot use the last job in the last command buffer assumption. We have to wait all event threads complete to signal

After submitting all command buffers, we emit a no-op job to wait on all last jobs by queue completion and signal fence. Note: at some point, we changed this approach to correct deal with ordering changes caused by event threads, as mentioned before.

Final considerations

With many changes and many rounds of reviews, the patchset was merged. After more validations and code review, we polished and fixed the implementation together with external contributions:

Also, multisync capabilities enabled us to add new features to V3DV and switch the driver to the common synchronization and submission framework:

  • v3dv: expose support for semaphore imports

    This was waiting for multisync support in the v3d kernel, which is already available. Exposing this feature however enabled a few more CTS tests that exposed pre-existing bugs in the user-space driver so we fix those here before exposing the feature.

  • v3dv: Switch to the common submit framework

    This should give you emulated timeline semaphores for free and kernel-assisted sharable timeline semaphores for cheap once you have the kernel interface wired in.

We used a set of games to ensure no performance regression in the new implementation. For this, we used GFXReconstruct to capture Vulkan API calls when playing those games. Then, we compared results with and without multisync caps in the kernelspace and also enabling multisync on v3dv. We didn’t observe any compromise in performance, but improvements when replaying scenes of vkQuake game.

As you may already know, we at Igalia have been working on several improvements to the 3D rendering drivers of Broadcom Videocore GPU, found in Raspberry Pi 4 devices. One of our recent works focused on improving V3D(V) drivers adherence to Vulkan submission and synchronization framework. We had to cross various layers from the Linux Graphics stack to add support for multiple syncobjs to V3D(V), from the Linux/DRM kernel to the Vulkan driver. We have delivered bug fixes, a generic gate to extend job submission interfaces, and a more direct sync mapping of the Vulkan framework. These changes did not impact the performance of the tested games and brought greater precision to the synchronization mechanisms. Ultimately, support for multiple syncobjs opened the door to new features and other improvements to the V3DV submission framework.

DRM Syncobjs

But, first, what are DRM sync objs?

* DRM synchronization objects (syncobj, see struct &drm_syncobj) provide a
* container for a synchronization primitive which can be used by userspace
* to explicitly synchronize GPU commands, can be shared between userspace
* processes, and can be shared between different DRM drivers.
* Their primary use-case is to implement Vulkan fences and semaphores.
[...]
* At it's core, a syncobj is simply a wrapper around a pointer to a struct
* &dma_fence which may be NULL.

And Jason Ekstrand well-summarized dma_fence features in a talk at the Linux Plumbers Conference 2021:

A struct that represents a (potentially future) event:

  • Has a boolean “signaled” state
  • Has a bunch of useful utility helpers/concepts, such as refcount, callback wait mechanisms, etc.

Provides two guarantees:

  • One-shot: once signaled, it will be signaled forever
  • Finite-time: once exposed, is guaranteed signal in a reasonable amount of time

What does multiple semaphores support mean for Raspberry Pi 4 GPU drivers?

For our main purpose, the multiple syncobjs support means that V3DV can submit jobs with more than one wait and signal semaphore. In the kernel space, wait semaphores become explicit job dependencies to wait on before executing the job. Signal semaphores (or post dependencies), in turn, work as fences to be signaled when the job completes its execution, unlocking following jobs that depend on its completion.

The multisync support development comprised of many decision-making points and steps summarized as follow:

  • added to the v3d kernel-driver capabilities to handle multiple syncobj;
  • exposed multisync capabilities to the userspace through a generic extension; and
  • reworked synchronization mechanisms of the V3DV driver to benefit from this feature
  • enabled simulator to work with multiple semaphores
  • tested on Vulkan games to verify the correctness and possible performance enhancements.

We decided to refactor parts of the V3D(V) submission design in kernel-space and userspace during this development. We improved job scheduling on V3D-kernel and the V3DV job submission design. We also delivered more accurate synchronizing mechanisms and further updates in the Broadcom Vulkan driver running on Raspberry Pi 4. Therefore, we summarize here changes in the kernel space, describing the previous state of the driver, taking decisions, side improvements, and fixes.

From single to multiple binary in/out syncobjs:

Initially, V3D was very limited in the numbers of syncobjs per job submission. V3D job interfaces (CL, CSD, and TFU) only supported one syncobj (in_sync) to be added as an execution dependency and one syncobj (out_sync) to be signaled when a submission completes. Except for CL submission, which accepts two in_syncs: one for binner and another for render job, it didn’t change the limited options.

Meanwhile in the userspace, the V3DV driver followed alternative paths to meet Vulkan’s synchronization and submission framework. It needed to handle multiple wait and signal semaphores, but the V3D kernel-driver interface only accepts one in_sync and one out_sync. In short, V3DV had to fit multiple semaphores into one when submitting every GPU job.

Generic ioctl extension

The first decision was how to extend the V3D interface to accept multiple in and out syncobjs. We could extend each ioctl with two entries of syncobj arrays and two entries for their counters. We could create new ioctls with multiple in/out syncobj. But after examining other drivers solutions to extend their submission’s interface, we decided to extend V3D ioctls (v3d_cl_submit_ioctl, v3d_csd_submit_ioctl, v3d_tfu_submit_ioctl) by a generic ioctl extension.

I found a curious commit message when I was examining how other developers handled the issue in the past:

Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 22 09:23:22 2019 +0000

    drm/i915: Introduce the i915_user_extension_method
    
    An idea for extending uABI inspired by Vulkan's extension chains.
    Instead of expanding the data struct for each ioctl every time we need
    to add a new feature, define an extension chain instead. As we add
    optional interfaces to control the ioctl, we define a new extension
    struct that can be linked into the ioctl data only when required by the
    user. The key advantage being able to ignore large control structs for
    optional interfaces/extensions, while being able to process them in a
    consistent manner.
    
    In comparison to other extensible ioctls, the key difference is the
    use of a linked chain of extension structs vs an array of tagged
    pointers. For example,
    
    struct drm_amdgpu_cs_chunk {
    	__u32		chunk_id;
        __u32		length_dw;
        __u64		chunk_data;
    };
[...]

So, inspired by amdgpu_cs_chunk and i915_user_extension, we opted to extend the V3D interface through a generic interface. After applying some suggestions from Iago Toral (Igalia) and Daniel Vetter, we reached the following struct:

struct drm_v3d_extension {
	__u64 next;
	__u32 id;
#define DRM_V3D_EXT_ID_MULTI_SYNC		0x01
	__u32 flags; /* mbz */
};

This generic extension has an id to identify the feature/extension we are adding to an ioctl (that maps the related struct type), a pointer to the next extension, and flags (if needed). Whenever we need to extend the V3D interface again for another specific feature, we subclass this generic extension into the specific one instead of extending ioctls indefinitely.

Multisync extension

For the multiple syncobjs extension, we define a multi_sync extension struct that subclasses the generic extension struct. It has arrays of in and out syncobjs, the respective number of elements in each of them, and a wait_stage value used in CL submissions to determine which job needs to wait for syncobjs before running.

struct drm_v3d_multi_sync {
	struct drm_v3d_extension base;
	/* Array of wait and signal semaphores */
	__u64 in_syncs;
	__u64 out_syncs;

	/* Number of entries */
	__u32 in_sync_count;
	__u32 out_sync_count;

	/* set the stage (v3d_queue) to sync */
	__u32 wait_stage;

	__u32 pad; /* mbz */
};

And if a multisync extension is defined, the V3D driver ignores the previous interface of single in/out syncobjs.

Once we had the interface to support multiple in/out syncobjs, v3d kernel-driver needed to handle it. As V3D uses the DRM scheduler for job executions, changing from single syncobj to multiples is quite straightforward. V3D copies from userspace the in syncobjs and uses drm_syncobj_find_fence()+ drm_sched_job_add_dependency() to add all in_syncs (wait semaphores) as job dependencies, i.e. syncobjs to be checked by the scheduler before running the job. On CL submissions, we have the bin and render jobs, so V3D follows the value of wait_stage to determine which job depends on those in_syncs to start its execution.

When V3D defines the last job in a submission, it replaces dma_fence of out_syncs with the done_fence from this last job. It uses drm_syncobj_find() + drm_syncobj_replace_fence() to do that. Therefore, when a job completes its execution and signals done_fence, all out_syncs are signaled too.

Other improvements to v3d kernel driver

This work also made possible some improvements in the original implementation. Following Iago’s suggestions, we refactored the job’s initialization code to allocate memory and initialize a job in one go. With this, we started to clean up resources more cohesively, clearly distinguishing cleanups in case of failure from job completion. We also fixed the resource cleanup when a job is aborted before the DRM scheduler arms it - at that point, drm_sched_job_arm() had recently been introduced to job initialization. Finally, we prepared the semaphore interface to implement timeline syncobjs in the future.

Going Up

The patchset that adds multiple syncobjs support and improvements to V3D is available here and comprises four patches:

  • drm/v3d: decouple adding job dependencies steps from job init
  • drm/v3d: alloc and init job in one shot
  • drm/v3d: add generic ioctl extension
  • drm/v3d: add multiple syncobjs support

After extending the V3D kernel interface to accept multiple syncobjs, we worked on V3DV to benefit from V3D multisync capabilities. In the next post, I will describe a little of this work.

May 09, 2022

As a board, we have been working on several initiatives to make the Foundation a better asset for the GNOME Project. We’re working on a number of threads in parallel, so I wanted to explain the “big picture” a bit more to try and connect together things like the new ED search and the bylaw changes.

We’re all here to see free and open source software succeed and thrive, so that people can be be truly empowered with agency over their technology, rather than being passive consumers. We want to bring GNOME to as many people as possible so that they have computing devices that they can inspect, trust, share and learn from.

In previous years we’ve tried to boost the relevance of GNOME (or technologies such as GTK) or solicit donations from businesses and individuals with existing engagement in FOSS ideology and technology. The problem with this approach is that we’re mostly addressing people and organisations who are already supporting or contributing FOSS in some way. To truly scale our impact, we need to look to the outside world, build better awareness of GNOME outside of our current user base, and find opportunities to secure funding to invest back into the GNOME project.

The Foundation supports the GNOME project with infrastructure, arranging conferences, sponsoring hackfests and travel, design work, legal support, managing sponsorships, advisory board, being the fiscal sponsor of GNOME, GTK, Flathub… and we will keep doing all of these things. What we’re talking about here are additional ways for the Foundation to support the GNOME project – we want to go beyond these activities, and invest into GNOME to grow its adoption amongst people who need it. This has a cost, and that means in parallel with these initiatives, we need to find partners to fund this work.

Neil has previously talked about themes such as education, advocacy, privacy, but we’ve not previously translated these into clear specific initiatives that we would establish in addition to the Foundation’s existing work. This is all a work in progress and we welcome any feedback from the community about refining these ideas, but here are the current strategic initiatives the board is working on. We’ve been thinking about growing our community by encouraging and retaining diverse contributors, and addressing evolving computing needs which aren’t currently well served on the desktop.

Initiative 1. Welcoming newcomers. The community is already spending a lot of time welcoming newcomers and teaching them the best practices. Those activities are as time consuming as they are important, but currently a handful of individuals are running initiatives such as GSoC, Outreachy and outreach to Universities. These activities help bring diverse individuals and perspectives into the community, and helps them develop skills and experience of collaborating to create Open Source projects. We want to make those efforts more sustainable by finding sponsors for these activities. With funding, we can hire people to dedicate their time to operating these programs, including paid mentors and creating materials to support newcomers in future, such as developer documentation, examples and tutorials. This is the initiative that needs to be refined the most before we can turn it into something real.

Initiative 2: Diverse and sustainable Linux app ecosystem. I spoke at the Linux App Summit about the work that GNOME and Endless has been supporting in Flathub, but this is an example of something which has a great overlap between commercial, technical and mission-based advantages. The key goal here is to improve the financial sustainability of participating in our community, which in turn has an impact on the diversity of who we can expect to afford to enter and remain in our community. We believe the existence of this is critically important for individual developers and contributors to unlock earning potential from our ecosystem, through donations or app sales. In turn, a healthy app ecosystem also improves the usefulness of the Linux desktop as a whole for potential users. We believe that we can build a case for commercial vendors in the space to join an advisory board alongside with GNOME, KDE, etc to input into the governance and contribute to the costs of growing Flathub.

Initiative 3: Local-first applications for the GNOME desktop. This is what Thib has been starting to discuss on Discourse, in this thread. There are many different threats to free access to computing and information in today’s world. The GNOME desktop and apps need to give users convenient and reliable access to technology which works similarly to the tools they already use everyday, but keeps them and their data safe from surveillance, censorship, filtering or just being completely cut off from the Internet. We believe that we can seek both philanthropic and grant funding for this work. It will make GNOME a more appealing and comprehensive offering for the many people who want to protect their privacy.

The idea is that these initiatives all sit on the boundary between the GNOME community and the outside world. If the Foundation can grow and deliver these kinds of projects, we are reaching to new people, new contributors and new funding. These contributions and investments back into GNOME represent a true “win-win” for the newcomers and our existing community.

(Originally posted to GNOME Discourse, please feel free to join the discussion there.)

Mesh and task shaders (amplification shaders in D3D jargon) are a new way to produce geometry in 3D applications. First proposed by NVidia in 2018 and initially available in the “Turing” series of GPUs, they are now supported on RDNA2 GPUs and on the API are part of the D3D12 API and also as vendor-specific extensions to Vulkan and OpenGL. In this post I’m going to talk about what mesh shaders are and in part 2 I’m going to talk about how they are implemented on the driver side.

Problems with the old geometry pipeline

The problem with the traditional vertex processing pipeline is that it is mainly designed assuming several fixed-function hardware units in the GPU and offers very little flexibility for the user to customize it. The main issues with the traditional pipeline are:

  • Vertex buffers and vertex shader inputs are annoying (especially from the driver’s perspective) and input assembly may be a bottleneck on some HW in some cases.
  • The user has no control over how the input vertices and primitives are arranged, so the vertex shader may uselessly run for primitives that are invisible (eg. occluded, or backfacing etc.) meaning that compute resources are wasted on things that don’t actually produce any pixels.
  • Geometry amplification depends on fixed function tessellation HW and offers poor customizability.
  • The programming model allows very poor control over input and output primitives. Geometry shaders have a horrible programming model that results in low HW occupancy and limited topologies.

The mesh shading pipeline is a graphics pipeline that addresses these issues by completely replacing the entire traditional vertex processing pipeline with two new stages: the task and mesh shader.

What is a mesh shader?

A mesh shader is a compute-like stage which allows the application to fully customize its inputs and outputs, including output primitives.

  • Mesh shader vs. Vertex shader: A mesh shader is responsible for creating its output vertices and primitives. In comparison, a vertex shader is only capable of loading a fixed amount of vertices, doing some processing on them and has no awareness of primitives.
  • Mesh shader vs. Geometry shader: As opposed to a geometry shader which can only use a fixed output topology, a mesh shader is free to define whatever topology it wants. You can think about it as if the mesh shader produces an indexed triangle list.

What does it mean that the mesh shader is compute-like?

You can use all the sweet good stuff that compute shaders already can do, but vertex shaders couldn’t, for example: use shared memory, run in workgroups, rely on workgroup ID, subgroup ID, etc.

The API allows any mesh shader invocation to write to any vertex or primitive. The invocations in each mesh shader workgroup are meant to co-operatively produce a small set of output vertices and primitives (this is sometimes called a “meshlet”). All workgroups together create the full output geometry (the “mesh”).

What does the mesh shader do?

First, it needs to figure out how many vertices and primitives it wants to create, then it can write these to its output arrays. How it does this, is entirely up to the application developer. Though, there are some performance recommendations which you should follow to make it go fast. I’m going to talk about these more in Part 2.

The input assembler step is entirely eliminated, which means that the application is now in full control of how (or if at all) the input vertex data is fetched, meaning that you can save bandwidth on things that don’t need to be loaded, etc. There is no “input” in the traditional sense, but you can rely on things like push constants, UBO, SSBO etc.

For example, a mesh shader could perform per-triangle culling in such a manner that it wouldn’t need to load data for primitives that are culled, therefore saving bandwidth.

What is a task shader aka. amplification shader?

The task shader is an optional stage which operates like a compute shader. Each task shader workgroup has two main purposes:

  • Decide how many mesh shader workgroups need to be launched.
  • Create an optional “task payload” which can be passed to mesh shaders.

The “geometry amplification” is achieved by choosing to launch more (or less) mesh shader workgroups. As opposed to the fixed-function tessellator in the traditional pipeline, it is now entirely up to the application how to create the vertices.

While you could re-implement the old fixed-function tessellation with mesh shading, this may not actually be necessary and your application may work fine with some other simpler algorithm.

Another interesting use case for task shaders is per-meshlet culling, meaning that a task shader is a good place to decide which meshlets you actually want to render and eliminate entire mesh shader workgroups which would otherwise operate on invisible primitives.

  • Task shader vs. Tessellation. Tessellation relies on fixed-function hardware and makes users do complicated shader I/O gymnastics with tess control shaders. The task shader is straightforward and only as complicated as you want it to be.
  • Task shader vs. Geometry shader. Geometry shaders operate on input primitives directly and replace them with “strip” primitives. Task shaders don’t have to be directy get involved with the geometry output, but rather just let you specify how many mesh shader workgroups to launch and let the mesh shader deal with the nitty-gritty details.

Usefulness

For now, I’ll just discuss a few basic use cases.

Meshlets. If your application loads input vertex data from somewhere, it is recommended that you subdivide that data (the “mesh”) into smaller chunks called “meshlets”. Then you can write your shaders such that each mesh shader workgroup processes a single meshlet.

Procedural geometry. Your application can generate all its vertices and primitives based on a mathematical formula that is implemented in the shader. In this case, you don’t need to load any inputs, just implement your formula as if you were writing a compute shader, then store the results into the mesh shader output arrays.

Replacing compute pre-passes. Many modern games use a compute pre-pass. They launch some compute shaders that do some pre-processing on the geometry before the graphics work. These are no longer necessary. The compute work can be made part of either the task or mesh shader, which removes the overhead of the additional submission.

Note that mesh shader workgroups may be launched as soon as the corresponding task shader workgroup is finished, so mesh shader execution (of the already finished tasks) may overlap with task shader execution, removing the need for extra synchronization on the application side.

Conclusion

Thus far, I’ve sold you on how awesome and flexible mesh shading is, so it’s time to ask the million dollar question.

Is mesh shading for you?

The answer, as always, is: It depends.

Yes, mesh and task shaders do give you a lot of opportunities to implement things just the way you like them without the stupid hardware getting in your way, but as with any low-level tools, this also means that you get a lot of possibities for shooting yourself in the foot.

The traditional vertex processing pipeline has been around for so long that on most hardware it’s extremely well optimized because the drivers do a lot of optimization work for you. Therefore, just because an app uses mesh shaders, doesn’t automatically mean that it’s going to be faster or better in any way. It’s only worth it if you are willing to do it well.

That being said, perhaps the easiest way to start experimenting with mesh shaders is to rewrite parts of your application that used to use geometry shaders. Geometry shaders are so horribly inefficient that it’ll be difficult to write a worse mesh shader.

How is mesh shading implemented under the hood?

Stay tuned for Part 2 if you are curious about that! In Part 2, I’m going to talk about how mesh and task shaders are implemented in a driver. This will shed some light on how these shaders work internally and why certain things perform really badly.

References

Sometimes you want to go and inspect details of the shaders that are used with specific draw calls in a frame. With RenderDoc this is really easy if the driver implements VK_KHR_pipeline_executable_properties. This extension allows applications to query the driver about various aspects of the executable code generated for a Vulkan pipeline.

I implemented this extension for V3DV, the Vulkan driver for Raspberry Pi 4, last week (it is currently in review process) because I was tired of jumping through loops to get the info I needed when looking at traces. For V3DV we expose the NIR and QPU assembly code as well as various others stats, some of which are quite relevant to performance, such as spill or thread counts.


Some shader statistics

Final NIR code

QPU assembly
May 02, 2022

TLDR: Hermetic /usr/ is awesome; let's popularize image-based OSes with modernized security properties built around immutability, SecureBoot, TPM2, adaptability, auto-updating, factory reset, uniformity – built from traditional distribution packages, but deployed via images.

Over the past years, systemd gained a number of components for building Linux-based operating systems. While these components individually have been adopted by many distributions and products for specific purposes, we did not publicly communicate a broader vision of how they should all fit together in the long run. In this blog story I hope to provide that from my personal perspective, i.e. explain how I personally would build an OS and where I personally think OS development with Linux should go.

I figure this is going to be a longer blog story, but I hope it will be equally enlightening. Please understand though that everything I write about OS design here is my personal opinion, and not one of my employer.

For the last 12 years or so I have been working on Linux OS development, mostly around systemd. In all those years I had a lot of time thinking about the Linux platform, and specifically traditional Linux distributions and their strengths and weaknesses. I have seen many attempts to reinvent Linux distributions in one way or another, to varying success. After all this most would probably agree that the traditional RPM or dpkg/apt-based distributions still define the Linux platform more than others (for 25+ years now), even though some Linux-based OSes (Android, ChromeOS) probably outnumber the installations overall.

And over all those 12 years I kept wondering, how would I actually build an OS for a system or for an appliance, and what are the components necessary to achieve that. And most importantly, how can we make these components generic enough so that they are useful in generic/traditional distributions too, and in other use cases than my own.

The Project

Before figuring out how I would build an OS it's probably good to figure out what type of OS I actually want to build, what purpose I intend to cover. I think a desktop OS is probably the most interesting. Why is that? Well, first of all, I use one of these for my job every single day, so I care immediately, it's my primary tool of work. But more importantly: I think building a desktop OS is one of the most complex overall OS projects you can work on, simply because desktops are so much more versatile and variable than servers or embedded devices. If one figures out the desktop case, I think there's a lot more to learn from, and reuse in the server or embedded case, then going the other way. After all, there's a reason why so much of the widely accepted Linux userspace stack comes from people with a desktop background (including systemd, BTW).

So, let's see how I would build a desktop OS. If you press me hard, and ask me why I would do that given that ChromeOS already exists and more or less is a Linux desktop OS: there's plenty I am missing in ChromeOS, but most importantly, I am lot more interested in building something people can easily and naturally rebuild and hack on, i.e. Google-style over-the-wall open source with its skewed power dynamic is not particularly attractive to me. I much prefer building this within the framework of a proper open source community, out in the open, and basing all this strongly on the status quo ante, i.e. the existing distributions. I think it is crucial to provide a clear avenue to build a modern OS based on the existing distribution model, if there shall ever be a chance to make this interesting for a larger audience.

(Let me underline though: even though I am going to focus on a desktop here, most of this is directly relevant for servers as well, in particular container host OSes and suchlike, or embedded devices, e.g. car IVI systems and so on.)

Design Goals

  1. First and foremost, I think the focus must be on an image-based design rather than a package-based one. For robustness and security it is essential to operate with reproducible, immutable images that describe the OS or large parts of it in full, rather than operating always with fine-grained RPM/dpkg style packages. That's not to say that packages are not relevant (I actually think they matter a lot!), but I think they should be less of a tool for deploying code but more one of building the objects to deploy. A different way to see this: any OS built like this must be easy to replicate in a large number of instances, with minimal variability. Regardless if we talk about desktops, servers or embedded devices: focus for my OS should be on "cattle", not "pets", i.e that from the start it's trivial to reuse the well-tested, cryptographically signed combination of software over a large set of devices the same way, with a maximum of bit-exact reuse and a minimum of local variances.

  2. The trust chain matters, from the boot loader all the way to the apps. This means all code that is run must be cryptographically validated before it is run. All storage must be cryptographically protected: public data must be integrity checked; private data must remain confidential.

    This is in fact where big distributions currently fail pretty badly. I would go as far as saying that SecureBoot on Linux distributions is mostly security theater at this point, if you so will. That's because the initrd that unlocks your FDE (i.e. the cryptographic concept that protects the rest of your system) is not signed or protected in any way. It's trivial to modify for an attacker with access to your hard disk in an undetectable way, and collect your FDE passphrase. The involved bureaucracy around the implementation of UEFI SecureBoot of the big distributions is to a large degree pointless if you ask me, given that once the kernel is assumed to be in a good state, as the next step the system invokes completely unsafe code with full privileges.

    This is a fault of current Linux distributions though, not of SecureBoot in general. Other OSes use this functionality in more useful ways, and we should correct that too.

  3. Pretty much the same thing: offline security matters. I want my data to be reasonably safe at rest, i.e. cryptographically inaccessible even when I leave my laptop in my hotel room, suspended.

  4. Everything should be cryptographically measured, so that remote attestation is supported for as much software shipped on the OS as possible.

  5. Everything should be self descriptive, have single sources of truths that are closely attached to the object itself, instead of stored externally.

  6. Everything should be self-updating. Today we know that software is never bug-free, and thus requires a continuous update cycle. Not only the OS itself, but also any extensions, services and apps running on it.

  7. Everything should be robust in respect to aborted OS operations, power loss and so on. It should be robust towards hosed OS updates (regardless if the download process failed, or the image was buggy), and not require user interaction to recover from them.

  8. There must always be a way to put the system back into a well-defined, guaranteed safe state ("factory reset"). This includes that all sensitive data from earlier uses becomes cryptographically inaccessible.

  9. The OS should enforce clear separation between vendor resources, system resources and user resources: conceptually and when it comes to cryptographical protection.

  10. Things should be adaptive: the system should come up and make the best of the system it runs on, adapt to the storage and hardware. Moreover, the system should support execution on bare metal equally well as execution in a VM environment and in a container environment (i.e. systemd-nspawn).

  11. Things should not require explicit installation. i.e. every image should be a live image. For installation it should be sufficient to dd an OS image onto disk. Thus, strong focus on "instantiate on first boot", rather than "instantiate before first boot".

  12. Things should be reasonably minimal. The image the system starts its life with should be quick to download, and not include resources that can as well be created locally later.

  13. System identity, local cryptographic keys and so on should be generated locally, not be pre-provisioned, so that there's no leak of sensitive data during the transport onto the system possible.

  14. Things should be reasonably democratic and hackable. It should be easy to fork an OS, to modify an OS and still get reasonable cryptographic protection. Modifying your OS should not necessarily imply that your "warranty is voided" and you lose all good properties of the OS, if you so will.

  15. Things should be reasonably modular. The privileged part of the core OS must be extensible, including on the individual system. It's not sufficient to support extensibility just through high-level UI applications.

  16. Things should be reasonably uniform, i.e. ideally the same formats and cryptographic properties are used for all components of the system, regardless if for the host OS itself or the payloads it receives and runs.

  17. Even taking all these goals into consideration, it should still be close to traditional Linux distributions, and take advantage of what they are really good at: integration and security update cycles.

Now that we know our goals and requirements, let's start designing the OS along these lines.

Hermetic /usr/

First of all the OS resources (code, data files, …) should be hermetic in an immutable /usr/. This means that a /usr/ tree should carry everything needed to set up the minimal set of directories and files outside of /usr/ to make the system work. This /usr/ tree can then be mounted read-only into the writable root file system that then will eventually carry the local configuration, state and user data in /etc/, /var/ and /home/ as usual.

Thankfully, modern distributions are surprisingly close to working without issues in such a hermetic context. Specifically, Fedora works mostly just fine: it has adopted the /usr/ merge and the declarative systemd-sysusers and systemd-tmpfiles components quite comprehensively, which means the directory trees outside of /usr/ are automatically generated as needed if missing. In particular /etc/passwd and /etc/group (and related files) are appropriately populated, should they be missing entries.

In my model a hermetic OS is hence comprehensively defined within /usr/: combine the /usr/ tree with an empty, otherwise unpopulated root file system, and it will boot up successfully, automatically adding the strictly necessary files, and resources that are necessary to boot up.

Monopolizing vendor OS resources and definitions in an immutable /usr/ opens multiple doors to us:

  • We can apply dm-verity to the whole /usr/ tree, i.e. guarantee structural, cryptographic integrity on the whole vendor OS resources at once, with full file system metadata.

  • We can implement updates to the OS easily: by implementing an A/B update scheme on the /usr/ tree we can update the OS resources atomically and robustly, while leaving the rest of the OS environment untouched.

  • We can implement factory reset easily: erase the root file system and reboot. The hermetic OS in /usr/ has all the information it needs to set up the root file system afresh — exactly like in a new installation.

Initial Look at the Partition Table

So let's have a look at a suitable partition table, taking a hermetic /usr/ into account. Let's conceptually start with a table of four entries:

  1. An UEFI System Partition (required by firmware to boot)

  2. Immutable, Verity-protected, signed file system with the /usr/ tree in version A

  3. Immutable, Verity-protected, signed file system with the /usr/ tree in version B

  4. A writable, encrypted root file system

(This is just for initial illustration here, as we'll see later it's going to be a bit more complex in the end.)

The Discoverable Partitions Specification provides suitable partition types UUIDs for all of the above partitions. Which is great, because it makes the image self-descriptive: simply by looking at the image's GPT table we know what to mount where. This means we do not need a manual /etc/fstab, and a multitude of tools such as systemd-nspawn and similar can operate directly on the disk image and boot it up.

Booting

Now that we have a rough idea how to organize the partition table, let's look a bit at how to boot into that. Specifically, in my model "unified kernels" are the way to go, specifically those implementing Boot Loader Specification Type #2. These are basically kernel images that have an initial RAM disk attached to them, as well as a kernel command line, a boot splash image and possibly more, all wrapped into a single UEFI PE binary. By combining these into one we achieve two goals: they become extremely easy to update (i.e. drop in one file, and you update kernel+initrd) and more importantly, you can sign them as one for the purpose of UEFI SecureBoot.

In my model, each version of such a kernel would be associated with exactly one version of the /usr/ tree: both are always updated at the same time. An update then becomes relatively simple: drop in one new /usr/ file system plus one kernel, and the update is complete.

The boot loader used for all this would be systemd-boot, of course. It's a very simple loader, and implements the aforementioned boot loader specification. This means it requires no explicit configuration or anything: it's entirely sufficient to drop in one such unified kernel file, and it will be picked up, and be made a candidate to boot into.

You might wonder how to configure the root file system to boot from with such a unified kernel that contains the kernel command line and is signed as a whole and thus immutable. The idea here is to use the usrhash= kernel command line option implemented by systemd-veritysetup-generator and systemd-fstab-generator. It does two things: it will search and set up a dm-verity volume for the /usr/ file system, and then mount it. It takes the root hash value of the dm-verity Merkle tree as the parameter. This hash is then also used to find the /usr/ partition in the GPT partition table, under the assumption that the partition UUIDs are derived from it, as per the suggestions in the discoverable partitions specification (see above).

systemd-boot (if not told otherwise) will do a version sort of the kernel image files it finds, and then automatically boot the newest one. Picking a specific kernel to boot will also fixate which version of the /usr/ tree to boot into, because — as mentioned — the Verity root hash of it is built into the kernel command line the unified kernel image contains.

In my model I'd place the kernels directly into the UEFI System Partition (ESP), in order to simplify things. (systemd-boot also supports reading them from a separate boot partition, but let's not complicate things needlessly, at least for now.)

So, with all this, we now already have a boot chain that goes something like this: once the boot loader is run, it will pick the newest kernel, which includes the initial RAM disk and a secure reference to the /usr/ file system to use. This is already great. But a /usr/ alone won't make us happy, we also need a root file system. In my model, that file system would be writable, and the /etc/ and /var/ hierarchies would be located directly on it. Since these trees potentially contain secrets (SSH keys, …) the root file system needs to be encrypted. We'll use LUKS2 for this, of course. In my model, I'd bind this to the TPM2 chip (for compatibility with systems lacking one, we can find a suitable fallback, which then provides weaker guarantees, see below). A TPM2 is a security chip available in most modern PCs. Among other things it contains a persistent secret key that can be used to encrypt data, in a way that only if you possess access to it and can prove you are using validated software you can decrypt it again. The cryptographic measuring I mentioned earlier is what allows this to work. But … let's not get lost too much in the details of TPM2 devices, that'd be material for a novel, and this blog story is going to be way too long already.

What does using a TPM2 bound key for unlocking the root file system get us? We can encrypt the root file system with it, and you can only read or make changes to the root file system if you also possess the TPM2 chip and run our validated version of the OS. This protects us against an evil maid scenario to some level: an attacker cannot just copy the hard disk of your laptop while you leave it in your hotel room, because unless the attacker also steals the TPM2 device it cannot be decrypted. The attacker can also not just modify the root file system, because such changes would be detected on next boot because they aren't done with the right cryptographic key.

So, now we have a system that already can boot up somewhat completely, and run userspace services. All code that is run is verified in some way: the /usr/ file system is Verity protected, and the root hash of it is included in the kernel that is signed via UEFI SecureBoot. And the root file system is locked to the TPM2 where the secret key is only accessible if our signed OS + /usr/ tree is used.

(One brief intermission here: so far all the components I am referencing here exist already, and have been shipped in systemd and other projects already, including the TPM2 based disk encryption. There's one thing missing here however at the moment that still needs to be developed (happy to take PRs!): right now TPM2 based LUKS2 unlocking is bound to PCR hash values. This is hard to work with when implementing updates — what we'd need instead is unlocking by signatures of PCR hashes. TPM2 supports this, but we don't support it yet in our systemd-cryptsetup + systemd-cryptenroll stack.)

One of the goals mentioned above is that cryptographic key material should always be generated locally on first boot, rather than pre-provisioned. This of course has implications for the encryption key of the root file system: if we want to boot into this system we need the root file system to exist, and thus a key already generated that it is encrypted with. But where precisely would we generate it if we have no installer which could generate while installing (as it is done in traditional Linux distribution installers). My proposed solution here is to use systemd-repart, which is a declarative, purely additive repartitioner. It can run from the initrd to create and format partitions on boot, before transitioning into the root file system. It can also format the partitions it creates and encrypt them, automatically enrolling an TPM2-bound key.

So, let's revisit the partition table we mentioned earlier. Here's what in my model we'd actually ship in the initial image:

  1. An UEFI System Partition (ESP)

  2. An immutable, Verity-protected, signed file system with the /usr/ tree in version A

And that's already it. No root file system, no B /usr/ partition, nothing else. Only two partitions are shipped: the ESP with the systemd-boot loader and one unified kernel image, and the A version of the /usr/ partition. Then, on first boot systemd-repart will notice that the root file system doesn't exist yet, and will create it, encrypt it, and format it, and enroll the key into the TPM2. It will also create the second /usr/ partition (B) that we'll need for later A/B updates (which will be created empty for now, until the first update operation actually takes place, see below). Once done the initrd will combine the fresh root file system with the shipped /usr/ tree, and transition into it. Because the OS is hermetic in /usr/ and contains all the systemd-tmpfiles and systemd-sysuser information it can then set up the root file system properly and create any directories and symlinks (and maybe a few files) necessary to operate.

Besides the fact that the root file system's encryption keys are generated on the system we boot from and never leave it, it is also pretty nice that the root file system will be sized dynamically, taking into account the physical size of the backing storage. This is perfect, because on first boot the image will automatically adapt to what it has been dd'ed onto.

Factory Reset

This is a good point to talk about the factory reset logic, i.e. the mechanism to place the system back into a known good state. This is important for two reasons: in our laptop use case, once you want to pass the laptop to someone else, you want to ensure your data is fully and comprehensively erased. Moreover, if you have reason to believe your device was hacked you want to revert the device to a known good state, i.e. ensure that exploits cannot persist. systemd-repart already has a mechanism for it. In the declarations of the partitions the system should have, entries may be marked to be candidates for erasing on factory reset. The actual factory reset is then requested by one of two means: by specifying a specific kernel command line option (which is not too interesting here, given we lock that down via UEFI SecureBoot; but then again, one could also add a second kernel to the ESP that is identical to the first, with only different that it lists this command line option: thus when the user selects this entry it will initiate a factory reset) — and via an EFI variable that can be set and is honoured on the immediately following boot. So here's how a factory reset would then go down: once the factory reset is requested it's enough to reboot. On the subsequent boot systemd-repart runs from the initrd, where it will honour the request and erase the partitions marked for erasing. Once that is complete the system is back in the state we shipped the system in: only the ESP and the /usr/ file system will exist, but the root file system is gone. And from here we can continue as on the original first boot: create a new root file system (and any other partitions), and encrypt/set it up afresh.

So now we have a nice setup, where everything is either signed or encrypted securely. The system can adapt to the system it is booted on automatically on first boot, and can easily be brought back into a well defined state identical to the way it was shipped in.

Modularity

But of course, such a monolithic, immutable system is only useful for very specific purposes. If /usr/ can't be written to, – at least in the traditional sense – one cannot just go and install a new software package that one needs. So here two goals are superficially conflicting: on one hand one wants modularity, i.e. the ability to add components to the system, and on the other immutability, i.e. that precisely this is prohibited.

So let's see what I propose as a middle ground in my model. First, what's the precise use case for such modularity? I see a couple of different ones:

  1. For some cases it is necessary to extend the system itself at the lowest level, so that the components added in extend (or maybe even replace) the resources shipped in the base OS image, so that they live in the same namespace, and are subject to the same security restrictions and privileges. Exposure to the details of the base OS and its interface for this kind of modularity is at the maximum.

    Example: a module that adds a debugger or tracing tools into the system. Or maybe an optional hardware driver module.

  2. In other cases, more isolation is preferable: instead of extending the system resources directly, additional services shall be added in that bring their own files, can live in their own namespace (but with "windows" into the host namespaces), however still are system components, and provide services to other programs, whether local or remote. Exposure to the details of the base OS for this kind of modularity is restricted: it mostly focuses on the ability to consume and provide IPC APIs from/to the system. Components of this type can still be highly privileged, but the level of integration is substantially smaller than for the type explained above.

    Example: a module that adds a specific VPN connection service to the OS.

  3. Finally, there's the actual payload of the OS. This stuff is relatively isolated from the OS and definitely from each other. It mostly consumes OS APIs, and generally doesn't provide OS APIs. This kind of stuff runs with minimal privileges, and in its own namespace of concepts.

    Example: a desktop app, for reading your emails.

Of course, the lines between these three types of modules are blurry, but I think distinguishing them does make sense, as I think different mechanisms are appropriate for each. So here's what I'd propose in my model to use for this.

  1. For the system extension case I think the systemd-sysext images are appropriate. This tool operates on system extension images that are very similar to the host's disk image: they also contain a /usr/ partition, protected by Verity. However, they just include additions to the host image: binaries that extend the host. When such a system extension image is activated, it is merged via an immutable overlayfs mount into the host's /usr/ tree. Thus any file shipped in such a system extension will suddenly appear as if it was part of the host OS itself. For optional components that should be considered part of the OS more or less this is a very simple and powerful way to combine an immutable OS with an immutable extension. Note that most likely extensions for an OS matching this tool should be built at the same time within the same update cycle scheme as the host OS itself. After all, the files included in the extensions will have dependencies on files in the system OS image, and care must be taken that these dependencies remain in order.

  2. For adding in additional somewhat isolated system services in my model, Portable Services are the proposed tool of choice. Portable services are in most ways just like regular system services; they could be included in the system OS image or an extension image. However, portable services use RootImage= to run off separate disk images, thus within their own namespace. Images set up this way have various ways to integrate into the host OS, as they are in most ways regular system services, which just happen to bring their own directory tree. Also, unlike regular system services, for them sandboxing is opt-out rather than opt-in. In my model, here too the disk images are Verity protected and thus immutable. Just like the host OS they are GPT disk images that come with a /usr/ partition and Verity data, along with signing.

  3. Finally, the actual payload of the OS, i.e. the apps. To be useful in real life here it is important to hook into existing ecosystems, so that a large set of apps are available. Given that on Linux flatpak (or on servers OCI containers) are the established format that pretty much won they are probably the way to go. That said, I think both of these mechanisms have relatively weak properties, in particular when it comes to security, since immutability/measurements and similar are not provided. This means, unlike for system extensions and portable services a complete trust chain with attestation and per-app cryptographically protected data is much harder to implement sanely.

What I'd like to underline here is that the main system OS image, as well as the system extension images and the portable service images are put together the same way: they are GPT disk images, with one immutable file system and associated Verity data. The latter two should also contain a PKCS#7 signature for the top-level Verity hash. This uniformity has many benefits: you can use the same tools to build and process these images, but most importantly: by using a single way to validate them throughout the stack (i.e. Verity, in the latter cases with PKCS#7 signatures), validation and measurement is straightforward. In fact it's so obvious that we don't even have to implement it in systemd: the kernel has direct support for this Verity signature checking natively already (IMA).

So, by composing a system at runtime from a host image, extension images and portable service images we have a nicely modular system where every single component is cryptographically validated on every single IO operation, and every component is measured, in its entire combination, directly in the kernel's IMA subsystem.

(Of course, once you add the desktop apps or OCI containers on top, then these properties are lost further down the chain. But well, a lot is already won, if you can close the chain that far down.)

Note that system extensions are not designed to replicate the fine grained packaging logic of RPM/dpkg. Of course, systemd-sysext is a generic tool, so you can use it for whatever you want, but there's a reason it does not bring support for a dependency language: the goal here is not to replicate traditional Linux packaging (we have that already, in RPM/dpkg, and I think they are actually OK for what they do) but to provide delivery of larger, coarser sets of functionality, in lockstep with the underlying OS' life-cycle and in particular with no interdependencies, except on the underlying OS.

Also note that depending on the use case it might make sense to also use system extensions to modularize the initrd step. This is probably less relevant for a desktop OS, but for server systems it might make sense to package up support for specific complex storage in a systemd-sysext system extension, which can be applied to the initrd that is built into the unified kernel. (In fact, we have been working on implementing signed yet modular initrd support to general purpose Fedora this way.)

Note that portable services are composable from system extension too, by the way. This makes them even more useful, as you can share a common runtime between multiple portable service, or even use the host image as common runtime for portable services. In this model a common runtime image is shared between one or more system extensions, and composed at runtime via an overlayfs instance.

More Modularity: Secondary OS Installs

Having an immutable, cryptographically locked down host OS is great I think, and if we have some moderate modularity on top, that's also great. But oftentimes it's useful to be able to depart/compromise for some specific use cases from that, i.e. provide a bridge for example to allow workloads designed around RPM/dpkg package management to coexist reasonably nicely with such an immutable host.

For this purpose in my model I'd propose using systemd-nspawn containers. The containers are focused on OS containerization, i.e. they allow you to run a full OS with init system and everything as payload (unlike for example Docker containers which focus on a single service, and where running a full OS in it is a mess).

Running systemd-nspawn containers for such secondary OS installs has various nice properties. One of course is that systemd-nspawn supports the same level of cryptographic image validation that we rely on for the host itself. Thus, to some level the whole OS trust chain is reasonably recursive if desired: the firmware validates the OS, and the OS can validate a secondary OS installed within it. In fact, we can run our trusted OS recursively on itself and get similar security guarantees! Besides these security aspects, systemd-nspawn also has really nice properties when it comes to integration with the host. For example the --bind-user= permits binding a host user record and their directory into a container as a simple one step operation. This makes it extremely easy to have a single user and $HOME but share it concurrently with the host and a zoo of secondary OSes in systemd-nspawn containers, which each could run different distributions even.

Developer Mode

Superficially, an OS with an immutable /usr/ appears much less hackable than an OS where everything is writable. Moreover, an OS where everything must be signed and cryptographically validated makes it hard to insert your own code, given you are unlikely to possess access to the signing keys.

To address this issue other systems have supported a "developer" mode: when entered the security guarantees are disabled, and the system can be freely modified, without cryptographic validation. While that's a great concept to have I doubt it's what most developers really want: the cryptographic properties of the OS are great after all, it sucks having to give them up once developer mode is activated.

In my model I'd thus propose two different approaches to this problem. First of all, I think there's value in allowing users to additively extend/override the OS via local developer system extensions. With this scheme the underlying cryptographic validation would remain in tact, but — if this form of development mode is explicitly enabled – the developer could add in more resources from local storage, that are not tied to the OS builder's chain of trust, but a local one (i.e. simply backed by encrypted storage of some form).

The second approach is to make it easy to extend (or in fact replace) the set of trusted validation keys, with local ones that are under the control of the user, in order to make it easy to operate with kernel, OS, extension, portable service or container images signed by the local developer without involvement of the OS builder. This is relatively easy to do for components down the trust chain, i.e. the elements further up the chain should optionally allow additional certificates to allow validation with.

(Note that systemd currently has no explicit support for a "developer" mode like this. I think we should add that sooner or later however.)

Democratizing Code Signing

Closely related to the question of developer mode is the question of code signing. If you ask me, the status quo of UEFI SecureBoot code signing in the major Linux distributions is pretty sad. The work to get stuff signed is massive, but in effect it delivers very little in return: because initrds are entirely unprotected, and reside on partitions lacking any form of cryptographic integrity protection any attacker can trivially easily modify the boot process of any such Linux system and freely collected FDE passphrases entered. There's little value in signing the boot loader and kernel in a complex bureaucracy if it then happily loads entirely unprotected code that processes the actually relevant security credentials: the FDE keys.

In my model, through use of unified kernels this important gap is closed, hence UEFI SecureBoot code signing becomes an integral part of the boot chain from firmware to the host OS. Unfortunately, code signing – and having something a user can locally hack, is to some level conflicting. However, I think we can improve the situation here, and put more emphasis on enrolling developer keys in the trust chain easily. Specifically, I see one relevant approach here: enrolling keys directly in the firmware is something that we should make less of a theoretical exercise and more something we can realistically deploy. See this work in progress making this more automatic and eventually safe. Other approaches are thinkable (including some that build on existing MokManager infrastructure), but given the politics involved, are harder to conclusively implement.

Running the OS itself in a container

What I explain above is put together with running on a bare metal system in mind. However, one of the stated goals is to make the OS adaptive enough to also run in a container environment (specifically: systemd-nspawn) nicely. Booting a disk image on bare metal or in a VM generally means that the UEFI firmware validates and invokes the boot loader, and the boot loader invokes the kernel which then transitions into the final system. This is different for containers: here the container manager immediately calls the init system, i.e. PID 1. Thus the validation logic must be different: cryptographic validation must be done by the container manager. In my model this is solved by shipping the OS image not only with a Verity data partition (as is already necessary for the UEFI SecureBoot trust chain, see above), but also with another partition, containing a PKCS#7 signature of the root hash of said Verity partition. This of course is exactly what I propose for both the system extension and portable service image. Thus, in my model the images for all three uses are put together the same way: an immutable /usr/ partition, accompanied by a Verity partition and a PKCS#7 signature partition. The OS image itself then has two ways "into" the trust chain: either through the signed unified kernel in the ESP (which is used for bare metal and VM boots) or by using the PKCS#7 signature stored in the partition (which is used for container/systemd-nspawn boots).

Parameterizing Kernels

A fully immutable and signed OS has to establish trust in the user data it makes use of before doing so. In the model I describe here, for /etc/ and /var/ we do this via disk encryption of the root file system (in combination with integrity checking). But the point where the root file system is mounted comes relatively late in the boot process, and thus cannot be used to parameterize the boot itself. In many cases it's important to be able to parameterize the boot process however.

For example, for the implementation of the developer mode indicated above it's useful to be able to pass this fact safely to the initrd, in combination with other fields (e.g. hashed root password for allowing in-initrd logins for debug purposes). After all, if the initrd is pre-built by the vendor and signed as whole together with the kernel it cannot be modified to carry such data directly (which is in fact how parameterizing of the initrd to a large degree was traditionally done).

In my model this is achieved through system credentials, which allow passing parameters to systems (and services for the matter) in an encrypted and authenticated fashion, bound to the TPM2 chip. This means that we can securely pass data into the initrd so that it can be authenticated and decrypted only on the system it is intended for and with the unified kernel image it was intended for.

Swap

In my model the OS would also carry a swap partition. For the simple reason that only then systemd-oomd.service can provide the best results. Also see In defence of swap: common misconceptions

Updating Images

We have a rough idea how the system shall be organized now, let's next focus on the deployment cycle: software needs regular update cycles, and software that is not updated regularly is a security problem. Thus, I am sure that any modern system must be automatically updated, without this requiring avoidable user interaction.

In my model, this is the job for systemd-sysupdate. It's a relatively simple A/B image updater: it operates either on partitions, on regular files in a directory, or on subdirectories in a directory. Each entry has a version (which is encoded in the GPT partition label for partitions, and in the filename for regular files and directories): whenever an update is initiated the oldest version is erased, and the newest version is downloaded.

With the setup described above a system update becomes a really simple operation. On each update the systemd-sysupdate tool downloads a /usr/ file system partition, an accompanying Verity partition, a PKCS#7 signature partition, and drops it into the host's partition table (where it possibly replaces the oldest version so far stored there). Then it downloads a unified kernel image and drops it into the EFI System Partition's /EFI/Linux (as per Boot Loader Specification; possibly erase the oldest such file there). And that's already the whole update process: four files are downloaded from the server, unpacked and put in the most straightforward of ways into the partition table or file system. Unlike in other OS designs there's no mechanism required to explicitly switch to the newer version, the aforementioned systemd-boot logic will automatically pick the newest kernel once it is dropped in.

Above we talked a lot about modularity, and how to put systems together as a combination of a host OS image, system extension images for the initrd and the host, portable service images and systemd-nspawn container images. I already emphasized that these image files are actually always the same: GPT disk images with partition definitions that match the Discoverable Partition Specification. This comes very handy when thinking about updating: we can use the exact same systemd-sysupdate tool for updating these other images as we use for the host image. The uniformity of the on-disk format allows us to update them uniformly too.

Boot Counting + Assessment

Automatic OS updates do not come without risks: if they happen automatically, and an update goes wrong this might mean your system might be automatically updated into a brick. This of course is less than ideal. Hence it is essential to address this reasonably automatically. In my model, there's systemd's Automatic Boot Assessment for that. The mechanism is simple: whenever a new unified kernel image is dropped into the system it will be stored with a small integer counter value included in the filename. Whenever the unified kernel image is selected for booting by systemd-boot, it is decreased by one. Once the system booted up successfully (which is determined by userspace) the counter is removed from the file name (which indicates "this entry is known to work"). If the counter ever hits zero, this indicates that it tried to boot it a couple of times, and each time failed, thus is apparently "bad". In this case systemd-boot will not consider the kernel anymore, and revert to the next older (that doesn't have a counter of zero).

By sticking the boot counter into the filename of the unified kernel we can directly attach this information to the kernel, and thus need not concern ourselves with cleaning up secondary information about the kernel when the kernel is removed. Updating with a tool like systemd-sysupdate remains a very simple operation hence: drop one old file, add one new file.

Picking the Newest Version

I already mentioned that systemd-boot automatically picks the newest unified kernel image to boot, by looking at the version encoded in the filename. This is done via a simple strverscmp() call (well, truth be told, it's a modified version of that call, different from the one implemented in libc, because real-life package managers use more complex rules for comparing versions these days, and hence it made sense to do that here too). The concept of having multiple entries of some resource in a directory, and picking the newest one automatically is a powerful concept, I think. It means adding/removing new versions is extremely easy (as we discussed above, in systemd-sysupdate context), and allows stateless determination of what to use.

If systemd-boot can do that, what about system extension images, portable service images, or systemd-nspawn container images that do not actually use systemd-boot as the entrypoint? All these tools actually implement the very same logic, but on the partition level: if multiple suitable /usr/ partitions exist, then the newest is determined by comparing the GPT partition label of them.

This is in a way the counterpart to the systemd-sysupdate update logic described above: we always need a way to determine which partition to actually then use after the update took place: and this becomes very easy each time: enumerate possible entries, pick the newest as per the (modified) strverscmp() result.

Home Directory Management

In my model the device's users and their home directories are managed by systemd-homed. This means they are relatively self-contained and can be migrated easily between devices. The numeric UID assignment for each user is done at the moment of login only, and the files in the home directory are mapped as needed via a uidmap mount. It also allows us to protect the data of each user individually with a credential that belongs to the user itself. i.e. instead of binding confidentiality of the user's data to the system-wide full-disk-encryption each user gets their own encrypted home directory where the user's authentication token (password, FIDO2 token, PKCS#11 token, recovery key…) is used as authentication and decryption key for the user's data. This brings a major improvement for security as it means the user's data is cryptographically inaccessible except when the user is actually logged in.

It also allows us to correct another major issue with traditional Linux systems: the way how data encryption works during system suspend. Traditionally on Linux the disk encryption credentials (e.g. LUKS passphrase) is kept in memory also when the system is suspended. This is a bad choice for security, since many (most?) of us probably never turn off their laptop but suspend it instead. But if the decryption key is always present in unencrypted form during the suspended time, then it could potentially be read from there by a sufficiently equipped attacker.

By encrypting the user's home directory with the user's authentication token we can first safely "suspend" the home directory before going to the system suspend state (i.e. flush out the cryptographic keys needed to access it). This means any process currently accessing the home directory will be frozen for the time of the suspend, but that's expected anyway during a system suspend cycle. Why is this better than the status quo ante? In this model the home directory's cryptographic key material is erased during suspend, but it can be safely reacquired on resume, from system code. If the system is only encrypted as a whole however, then the system code itself couldn't reauthenticate the user, because it would be frozen too. By separating home directory encryption from the root file system encryption we can avoid this problem.

Partition Setup

So we discussed the organization of the partitions OS images multiple times in the above, each time focusing on a specific aspect. Let's now summarize how this should look like all together.

In my model, the initial, shipped OS image should look roughly like this:

  • (1) An UEFI System Partition, with systemd-boot as boot loader and one unified kernel
  • (2) A /usr/ partition (version "A"), with a label fooOS_0.7 (under the assumption we called our project fooOS and the image version is 0.7).
  • (3) A Verity partition for the /usr/ partition (version "A"), with the same label
  • (4) A partition carrying the Verity root hash for the /usr/ partition (version "A"), along with a PKCS#7 signature of it, also with the same label

On first boot this is augmented by systemd-repart like this:

  • (5) A second /usr/ partition (version "B"), initially with a label _empty (which is the label systemd-sysupdate uses to mark partitions that currently carry no valid payload)
  • (6) A Verity partition for that (version "B"), similar to the above case, also labelled _empty
  • (7) And ditto a Verity root hash partition with a PKCS#7 signature (version "B"), also labelled _empty
  • (8) A root file system, encrypted and locked to the TPM2
  • (9) A home file system, integrity protected via a key also in TPM2 (encryption is unnecessary, since systemd-homed adds that on its own, and it's nice to avoid duplicate encryption)
  • (10) A swap partition, encrypted and locked to the TPM2

Then, on the first OS update the partitions 5, 6, 7 are filled with a new version of the OS (let's say 0.8) and thus get their label updated to fooOS_0.8. After a boot, this version is active.

On a subsequent update the three partitions fooOS_0.7 get wiped and replaced by fooOS_0.9 and so on.

On factory reset, the partitions 8, 9, 10 are deleted, so that systemd-repart recreates them, using a new set of cryptographic keys.

Here's a graphic that hopefully illustrates the partition stable from shipped image, through first boot, multiple update cycles and eventual factory reset:

Partitions Overview

Trust Chain

So let's summarize the intended chain of trust (for bare metal/VM boots) that ensures every piece of code in this model is signed and validated, and any system secret is locked to TPM2.

  1. First, firmware (or possibly shim) authenticates systemd-boot.

  2. Once systemd-boot picks a unified kernel image to boot, it is also authenticated by firmware/shim.

  3. The unified kernel image contains an initrd, which is the first userspace component that runs. It finds any system extensions passed into the initrd, and sets them up through Verity. The kernel will validate the Verity root hash signature of these system extension images against its usual keyring.

  4. The initrd also finds credentials passed in, then securely unlocks (which means: decrypts + authenticates) them with a secret from the TPM2 chip, locked to the kernel image itself.

  5. The kernel image also contains a kernel command line which contains a usrhash= option that pins the root hash of the /usr/ partition to use.

  6. The initrd then unlocks the encrypted root file system, with a secret bound to the TPM2 chip.

  7. The system then transitions into the main system, i.e. the combination of the Verity protected /usr/ and the encrypted root files system. It then activates two more encrypted (and/or integrity protected) volumes for /home/ and swap, also with a secret tied to the TPM2 chip.

Here's an attempt to illustrate the above graphically:

Trust Chain

This is the trust chain of the basic OS. Validation of system extension images, portable service images, systemd-nspawn container images always takes place the same way: the kernel validates these Verity images along with their PKCS#7 signatures against the kernel's keyring.

File System Choice

In the above I left the choice of file systems unspecified. For the immutable /usr/ partitions squashfs might be a good candidate, but any other that works nicely in a read-only fashion and generates reproducible results is a good choice, too. The home directories as managed by systemd-homed should certainly use btrfs, because it's the only general purpose file system supporting online grow and shrink, which systemd-homed can take benefit of, to manage storage.

For the root file system btrfs is likely also the best idea. That's because we intend to use LUKS/dm-crypt underneath, which by default only provides confidentiality, not authenticity of the data (unless combined with dm-integrity). Since btrfs (unlike xfs/ext4) does full data checksumming it's probably the best choice here, since it means we don't have to use dm-integrity (which comes at a higher performance cost).

OS Installation vs. OS Instantiation

In the discussion above a lot of focus was put on setting up the OS and completing the partition layout and such on first boot. This means installing the OS becomes as simple as dd-ing (i.e. "streaming") the shipped disk image into the final HDD medium. Simple, isn't it?

Of course, such a scheme is just too simple for many setups in real life. Whenever multi-boot is required (i.e. co-installing an OS implementing this model with another unrelated one), dd-ing a disk image onto the HDD is going to overwrite user data that was supposed to be kept around.

In order to cover for this case, in my model, we'd use systemd-repart (again!) to allow streaming the source disk image into the target HDD in a smarter, additive way. The tool after all is purely additive: it will add in partitions or grow them if they are missing or too small. systemd-repart already has all the necessary provisions to not only create a partition on the target disk, but also copy blocks from a raw installer disk. An install operation would then become a two stop process: one invocation of systemd-repart that adds in the /usr/, its Verity and the signature partition to the target medium, populated with a copy of the same partition of the installer medium. And one invocation of bootctl that installs the systemd-boot boot loader in the ESP. (Well, there's one thing missing here: the unified OS kernel also needs to be dropped into the ESP. For now, this can be done with a simple cp call. In the long run, this should probably be something bootctl can do as well, if told so.)

So, with this scheme we have a simple scheme to cover all bases: we can either just dd an image to disk, or we can stream an image onto an existing HDD, adding a couple of new partitions and files to the ESP.

Of course, in reality things are more complex than that even: there's a good chance that the existing ESP is simply too small to carry multiple unified kernels. In my model, the way to address this is by shipping two slightly different systemd-repart partition definition file sets: the ideal case when the ESP is large enough, and a fallback case, where it isn't and where we then add in an addition XBOOTLDR partition (as per the Discoverable Partitions Specification). In that mode the ESP carries the boot loader, but the unified kernels are stored in the XBOOTLDR partition. This scenario is not quite as simple as the XBOOTLDR-less scenario described first, but is equally well supported in the various tools. Note that systemd-repart can be told size constraints on the partitions it shall create or augment, thus to implement this scheme it's enough to invoke the tool with the fallback partition scheme if invocation with the ideal scheme fails.

Either way: regardless how the partitions, the boot loader and the unified kernels ended up on the system's hard disk, on first boot the code paths are the same again: systemd-repart will be called to augment the partition table with the root file system, and properly encrypt it, as was already discussed earlier here. This means: all cryptographic key material used for disk encryption is generated on first boot only, the installer phase does not encrypt anything.

Live Systems vs. Installer Systems vs. Installed Systems

Traditionally on Linux three types of systems were common: "installed" systems, i.e. that are stored on the main storage of the device and are the primary place people spend their time in; "installer" systems which are used to install them and whose job is to copy and setup the packages that make up the installed system; and "live" systems, which were a middle ground: a system that behaves like an installed system in most ways, but lives on removable media.

In my model I'd like to remove the distinction between these three concepts as much as possible: each of these three images should carry the exact same /usr/ file system, and should be suitable to be replicated the same way. Once installed the resulting image can also act as an installer for another system, and so on, creating a certain "viral" effect: if you have one image or installation it's automatically something you can replicate 1:1 with a simple systemd-repart invocation.

Building Images According to this Model

The above explains how the image should look like and how its first boot and update cycle will modify it. But this leaves one question unanswered: how to actually build the initial image for OS instances according to this model?

Note that there's nothing too special about the images following this model: they are ultimately just GPT disk images with Linux file systems, following the Discoverable Partition Specification. This means you can use any set of tools of your choice that can put together GPT disk images for compliant images.

I personally would use mkosi for this purpose though. It's designed to generate compliant images, and has a rich toolset for SecureBoot and signed/Verity file systems already in place.

What is key here is that this model doesn't depart from RPM and dpkg, instead it builds on top of that: in this model they are excellent for putting together images on the build host, but deployment onto the runtime host does not involve individual packages.

I think one cannot underestimate the value traditional distributions bring, regarding security, integration and general polishing. The concepts I describe above are inherited from this, but depart from the idea that distribution packages are a runtime concept and make it a build-time concept instead.

Note that the above is pretty much independent from the underlying distribution.

Final Words

I have no illusions, general purpose distributions are not going to adopt this model as their default any time soon, and it's not even my goal that they do that. The above is my personal vision, and I don't expect people to buy into it 100%, and that's fine. However, what I am interested in is finding the overlaps, i.e. work with people who buy 50% into this vision, and share the components.

My goals here thus are to:

  1. Get distributions to move to a model where images like this can be built from the distribution easily. Specifically this means that distributions make their OS hermetic in /usr/.

  2. Find the overlaps, share components with other projects to revisit how distributions are put together. This is already happening, see systemd-tmpfiles and systemd-sysuser support in various distributions, but I think there's more to share.

  3. Make people interested in building actual real-world images based on general purpose distributions adhering to the model described above. I'd love a "GnomeBook" image with full trust properties, that is built from true Linux distros, such as Fedora or ArchLinux.

FAQ

  1. What about ostree? Doesn't ostree already deliver what this blog story describes?

    ostree is fine technology, but in respect to security and robustness properties it's not too interesting I think, because unlike image-based approaches it cannot really deliver integrity/robustness guarantees over the whole tree easily. To be able to trust an ostree setup you have to establish trust in the underlying file system first, and the complexity of the file system makes that challenging. To provide an effective offline-secure trust chain through the whole depth of the stack it is essential to cryptographically validate every single I/O operation. In an image-based model this is trivially easy, but in ostree model it's with current file system technology not possible and even if this is added in one way or another in the future (though I am not aware of anyone doing on-access file-based integrity that spans a whole hierarchy of files that was compatible with ostree's hardlink farm model) I think validation is still at too high a level, since Linux file system developers made very clear their implementations are not robust to rogue images. (There's this stuff planned, but doing structural authentication ahead of time instead of on access makes the idea to weak — and I'd expect too slow — in my eyes.)

    With my design I want to deliver similar security guarantees as ChromeOS does, but ostree is much weaker there, and I see no perspective of this changing. In a way ostree's integrity checks are similar to RPM's and enforced on download rather than on access. In the model I suggest above, it's always on access, and thus safe towards offline attacks (i.e. evil maid attacks). In today's world, I think offline security is absolutely necessary though.

    That said, ostree does have some benefits over the model described above: it naturally shares file system inodes if many of the modules/images involved share the same data. It's thus more space efficient on disk (and thus also in RAM/cache to some degree) by default. In my model it would be up to the image builders to minimize shipping overly redundant disk images, by making good use of suitably composable system extensions.

  2. What about configuration management?

    At first glance immutable systems and configuration management don't go that well together. However, do note, that in the model I propose above the root file system with all its contents, including /etc/ and /var/ is actually writable and can be modified like on any other typical Linux distribution. The only exception is /usr/ where the immutable OS is hermetic. That means configuration management tools should work just fine in this model – up to the point where they are used to install additional RPM/dpkg packages, because that's something not allowed in the model above: packages need to be installed at image build time and thus on the image build host, not the runtime host.

  3. What about non-UEFI and non-TPM2 systems?

    The above is designed around the feature set of contemporary PCs, and this means UEFI and TPM2 being available (simply because the PC is pretty much defined by the Windows platform, and current versions of Windows require both).

    I think it's important to make the best of the features of today's PC hardware, and then find suitable fallbacks on more limited hardware. Specifically this means: if there's desire to implement something like the this on non-UEFI or non-TPM2 hardware we should look for suitable fallbacks for the individual functionality, but generally try to add glue to the old systems so that conceptually they behave more like the new systems instead of the other way round. Or in other words: most of the above is not strictly tied to UEFI or TPM2, and for many cases already there are reasonably fallbacks in place for more limited systems. Of course, without TPM2 many of the security guarantees will be weakened.

  4. How would you name an OS built that way?

    I think a desktop OS built this way if it has the GNOME desktop should of course be called GnomeBook, to mimic the ChromeBook name. ;-)

    But in general, I'd call hermetic, adaptive, immutable OSes like this "particles".

How can you help?

  1. Help making Distributions Hermetic in /usr/!

    One of the core ideas of the approach described above is to make the OS hermetic in /usr/, i.e. make it carry a comprehensive description of what needs to be set up outside of it when instantiated. Specifically, this means that system users that are needed are declared in systemd-sysusers snippets, and skeleton files and directories are created via systemd-tmpfiles. Moreover additional partitions should be declared via systemd-repart drop-ins.

    At this point some distributions (such as Fedora) are (probably more by accident than on purpose) already mostly hermetic in /usr/, at least for the most basic parts of the OS. However, this is not complete: many daemons require to have specific resources set up in /var/ or /etc/ before they can work, and the relevant packages do not carry systemd-tmpfiles descriptions that add them if missing. So there are two ways you could help here: politically, it would be highly relevant to convince distributions that an OS that is hermetic in /usr/ is highly desirable and it's a worthy goal for packagers to get there. More specifically, it would be desirable if RPM/dpkg packages would ship with enough systemd-tmpfiles information so that configuration files the packages strictly need for operation are symlinked (or copied) from /usr/share/factory/ if they are missing (even better of course would be if packages from their upstream sources on would just work with an empty /etc/ and /var/, and create themselves what they need and default to good defaults in absence of configuration files).

    Note that distributions that adopted systemd-sysusers, systemd-tmpfiles and the /usr/ merge are already quite close to providing an OS that is hermetic in /usr/. These were the big, the major advancements: making the image fully hermetic should be less controversial – at least that's my guess.

    Also note that making the OS hermetic in /usr/ is not just useful in scenarios like the above. It also means that stuff like this and like this can work well.

  2. Fill in the gaps!

    I already mentioned a couple of missing bits and pieces in the implementation of the overall vision. In the systemd project we'd be delighted to review/merge any PRs that fill in the voids.

  3. Build your own OS like this!

    Of course, while we built all these building blocks and they have been adopted to various levels and various purposes in the various distributions, no one so far built an OS that puts things together just like that. It would be excellent if we had communities that work on building images like what I propose above. i.e. if you want to work on making a secure GnomeBook as I suggest above a reality that would be more than welcome.

    How could this look like specifically? Pick an existing distribution, write a set of mkosi descriptions plus some additional drop-in files, and then build this on some build infrastructure. While doing so, report the gaps, and help us address them.

Further Documentation of Used Components and Concepts

  1. systemd-tmpfiles
  2. systemd-sysusers
  3. systemd-boot
  4. systemd-stub
  5. systemd-sysext
  6. systemd-portabled, Portable Services Introduction
  7. systemd-repart
  8. systemd-nspawn
  9. systemd-sysupdate
  10. systemd-creds, System and Service Credentials
  11. systemd-homed
  12. Automatic Boot Assessment
  13. Boot Loader Specification
  14. Discoverable Partitions Specification
  15. Safely Building Images

Earlier Blog Stories Related to this Topic

  1. The Strange State of Authenticated Boot and Disk Encryption on Generic Linux Distributions
  2. The Wondrous World of Discoverable GPT Disk Images
  3. Unlocking LUKS2 volumes with TPM2, FIDO2, PKCS#11 Security Hardware on systemd 248
  4. Portable Services with systemd v239
  5. mkosi — A Tool for Generating OS Images

And that's all for now.

April 29, 2022

I've been working on kopper recently, which is a complementary project to zink. Just as zink implements OpenGL in terms of Vulkan, kopper seeks to implement the GL window system bindings - like EGL and GLX - in terms of the Vulkan WSI extensions. There are several benefits to doing this, which I'll get into in a future post, but today's story is really about libX11 and libxcb.

Yes, again.

One important GLX feature is the ability to set the swap interval, which is how you get tear-free rendering by syncing buffer swaps to the vertical retrace. A swap interval of 1 is the typical case, where an image update happens once per frame. The Vulkan way to do this is to set the swapchain present mode to FIFO, since FIFO updates are implicitly synced to vblank. Mesa's WSI code for X11 uses a swapchain management thread for FIFO present modes. This thread is started from inside the vulkan driver, and it only uses libxcb to talk to the X server. But libGL is a libX11 client library, so in this scenario there is always an "xlib thread" as well.

libX11 uses libxcb internally these days, because otherwise there would be no way to intermix xlib and xcb calls in the same process. But it does not use libxcb's reflection of the protocol, XGetGeometry does not call xcb_get_geometry for example. Instead, libxcb has an API to allow other code to take over the write side of the display socket, with a callback mechanism to get it back when another xcb client issues a request. The callback function libX11 uses here is straightforward: lock the Display, flush out any internally buffered requests, and return the sequence number of the last request written. Both libraries need this sequence number for various reasons internally, xcb for example uses it to make sure replies go back to the thread that issued the request.

But "lock the Display" here really means call into a vtable in the Display struct. That vtable is filled in during XOpenDisplay, but the individual function pointers are only non-NULL if you called XInitThreads beforehand. And if you're libGL, you have no way to enforce that, your public-facing API operates on a Display that was already created.

So now we see the race. The queue management thread calls into libxcb while the main thread is somewhere inside libX11. Since libX11 has taken the socket, the xcb thread runs the release callback. Since the Display was not made thread-safe at XOpenDisplay time, the release callback does not block, so the xlib thread's work won't be correctly accounted. If you're lucky the two sides will at least write to the socket atomically with respect to each other, but at this point they have diverging opinions about the request sequence numbering, and it's a matter of time until you crash.

It turns out kopper makes this really easy to hit. Like "resize a glxgears window" easy. However, this isn't just a kopper issue, this race exists for every program that uses xcb on a not-necessarily-thread-safe Display. The only reasonable fix is to for libX11 to just always be thread-safe.

So now, it is.


April 26, 2022

I recently blogged about how to run a volatile systemd-nspawn container from your host's /usr/ tree, for quickly testing stuff in your host environment, sharing your home drectory, but all that without making a single modification to your host, and on an isolated node.

The one-liner discussed in that blog story is great for testing during system software development. Let's have a look at another systemd tool that I regularly use to test things during systemd development, in a relatively safe environment, but still taking full benefit of my host's setup.

Since a while now, systemd has been shipping with a simple component called systemd-sysext. It's primary usecase goes something like this: on one hand OS systems with immutable /usr/ hierarchies are fantastic for security, robustness, updating and simplicity, but on the other hand not being able to quickly add stuff to /usr/ is just annoying.

systemd-sysext is supposed to bridge this contradiction: when invoked it will merge a bunch of "system extension" images into /usr/ (and /opt/ as a matter of fact) through the use of read-only overlayfs, making all files shipped in the image instantly and atomically appear in /usr/ during runtime — as if they always had been there. Now, let's say you are building your locked down OS, with an immutable /usr/ tree, and it comes without ability to log into, without debugging tools, without anything you want and need when trying to debug and fix something in the system. With systemd-sysext you could use a system extension image that contains all this, drop it into the system, and activate it with systemd-sysext so that it genuinely extends the host system.

(There are many other usecases for this tool, for example, you could build systems that way that at their base use a generic image, but by installing one or more system extensions get extended to with additional more specific functionality, or drivers, or similar. The tool is generic, use it for whatever you want, but for now let's not get lost in listing all the possibilites.)

What's particularly nice about the tool is that it supports automatically discovered dm-verity images, with signatures and everything. So you can even do this in a fully authenticated, measured, safe way. But I am digressing…

Now that we (hopefully) have a rough understanding what systemd-sysext is and does, let's discuss how specficially we can use this in the context of system software development, to safely use and test bleeding edge development code — built freshly from your project's build tree – in your host OS without having to risk that the host OS is corrupted or becomes unbootable by stuff that didn't quite yet work the way it was envisioned:

The images systemd-sysext merges into /usr/ can be of two kinds: disk images with a file system/verity/signature, or simple, plain directory trees. To make these images available to the tool, they can be placed or symlinked into /usr/lib/extensions/, /var/lib/extensions/, /run/extensions/ (and a bunch of others). So if we now install our freshly built development software into a subdirectory of those paths, then that's entirely sufficient to make them valid system extension images in the sense of systemd-sysext, and thus can be merged into /usr/ to try them out.

To be more specific: when I develop systemd itself, here's what I do regularly, to see how my new development version would behave on my host system. As preparation I checked out the systemd development git tree first of course, hacked around in it a bit, then built it with meson/ninja. And now I want to test what I just built:

sudo DESTDIR=/run/extensions/systemd-test meson install -C build --quiet --no-rebuild &&
        sudo systemd-sysext refresh --force

Explanation: first, we'll install my current build tree as a system extension into /run/extensions/systemd-test/. And then we apply it to the host via the systemd-sysext refresh command. This command will search for all installed system extension images in the aforementioned directories, then unmount (i.e. "unmerge") any previously merged dirs from /usr/ and then freshly mount (i.e. "merge") the new set of system extensions on top of /usr/. And just like that, I have installed my development tree of systemd into the host OS, and all that without actually modifying/replacing even a single file on the host at all. Nothing here actually hit the disk!

Note that all this works on any system really, it is not necessary that the underlying OS even is designed with immutability in mind. Just because the tool was developed with immutable systems in mind it doesn't mean you couldn't use it on traditional systems where /usr/ is mutable as well. In fact, my development box actually runs regular Fedora, i.e. is RPM-based and thus has a mutable /usr/ tree. As long as system extensions are applied the whole of /usr/ becomes read-only though.

Once I am done testing, when I want to revert to how things were without the image installed, it is sufficient to call:

sudo systemd-sysext unmerge

And there you go, all files my development tree generated are gone again, and the host system is as it was before (and /usr/ mutable again, in case one is on a traditional Linux distribution).

Also note that a reboot (regardless if a clean one or an abnormal shutdown) will undo the whole thing automatically, since we installed our build tree into /run/ after all, i.e. a tmpfs instance that is flushed on boot. And given that the overlayfs merge is a runtime thing, too, the whole operation was executed without any persistence. Isn't that great?

(You might wonder why I specified --force on the systemd-sysext refresh line earlier. That's because systemd-sysext actually does some minimal version compatibility checks when applying system extension images. For that it will look at the host's /etc/os-release file with /usr/lib/extension-release.d/extension-release.<name>, and refuse operaton if the image is not actually built for the host OS version. Here we don't want to bother with dropping that file in there, we know already that the extension image is compatible with the host, as we just built it on it. --force allows us to skip the version check.)

You might wonder: what about the combination of the idea from the previous blog story (regarding running container's off the host /usr/ tree) with system extensions? Glad you asked. Right now we have no support for this, but it's high on our TODO list (patches welcome, of course!). i.e. a new switch for systemd-nspawn called --system-extension= that would allow merging one or more such extensions into the container tree booted would be stellar. With that, with a single command I could run a container off my host OS but with a development version of systemd dropped in, all without any persistence. How awesome would that be?

(Oh, and in case you wonder, all of this only works with distributions that have completed the /usr/ merge. On legacy distributions that didn't do that and still place parts of /usr/ all over the hierarchy the above won't work, since merging /usr/ trees via overlayfs is pretty pointess if the OS is not hermetic in /usr/.)

And that's all for now. Happy hacking!

April 24, 2022

The title might be a bit hyperbolic here, but we’re indeed exploring a first step in that direction with radv. The impetus here is the ExecuteIndirect command in Direct3D 12 and some games that are using it in non-trivial ways. (e.g. Halo Infinite)

ExecuteIndirect can be seen as an extension of what we have in Vulkan with vkCmdDrawIndirectCount. It adds extra capabilities. To support that with vkd3d-proton we need the following indirect Vulkan capabilities:

  1. Binding vertex buffers.
  2. Binding index buffers.
  3. Updating push constants.

This functionality happens to be a subset of VK_NV_device_generated_commands and hence I’ve been working on implementing a subset of that extension on radv. Unfortunately, we can’t really give the firmware a “extended indirect draw call” and execute stuff, so we’re stuck generating command buffers on the GPU.

The way the extension works, the application specifies a command “signature” on the CPU, which specifies that for each draw call the application is going to update A, B and C. Then, at runtime, the application provides a buffer providing the data for A, B and C for each draw call. The driver then processes that into a command buffer and then executes that into a secondary command buffer.

The workflow is then as follows:

  1. The application (or vkd3d-proton) provides the command signature to the driver which creates an object out of it.
  2. The application queries how big a command buffer (“preprocess buffer”) of $n$ draws with that signature would be.
  3. The application allocates the preprocess buffer.
  4. The application does its stuff to generate some commands.
  5. The application calls vkCmdPreprocessGeneratedCommandsNV which converts the application buffer into a command buffer (in the preprocess buffer)
  6. The application calls vkCmdExecuteGeneratedCommandsNV to execute the generated command buffer.

What goes into a draw in radv

When the application triggers a draw command in Vulkan, the driver generates GPU commands to do the following:

  1. Flush caches if needed
  2. Set some registers.
  3. Trigger the draw.

Of course we skip any of these steps (or parts of them) when they’re redundant. The majority of the complexity is in the register state we have to set. There are multiple parts here

  1. Fixed function state:

    1. subpass attachments
    2. static/dynamic state (viewports, scissors, etc.)
    3. index buffers
    4. some derived state from the shaders (some tesselation stuff, fragment shader export types, varyings, etc.)
  2. shaders (start address, number of registers, builtins used)
  3. user SGPRs (i.e. registers that are available at the start of a shader invocation)

Overall, most of the pipeline state is fairly easy to emit: we just precompute it on pipeline creation and memcpy it over if we switch shaders. The most difficult is probably the user SGPRs, and the reason for that is that it is derived from a lot of the remaining API state . Note that the list above doesn’t include push constants, descriptor sets or vertex buffers. The driver computes all of these, and generates the user SGPR data from that.

Descriptor sets in radv are just a piece of GPU memory, and radv binds a descriptor set by providing the shader with a pointer to that GPU memory in a user SGPR. Similarly, we have no hardware support for vertex buffers, so radv generates a push descriptor set containing internal texel buffers and then provides a user SGPR with a pointer to that descriptor set.

For push constants, radv has two modes: a portion of the data can be passed in user SGPRs directly, but sometimes a chunk of memory gets allocated and then a pointer to that memory is provided in a user SGPR. This fallback exists because the hardware doesn’t always have enough user SGPRs to fit all the data.

On Vega and later there are 32 user SGPRs, and on earlier GCN GPUs there are 16. This needs to fit pointers to all the referenced descriptor sets (including internal ones like the one for vertex buffers), push constants, builtins like the start vertex and start instance etc. To get the best performance here, radv determines a mapping of API object to user SGPR at shader compile time and then at draw time radv uses that mapping to write user SGPRs.

This results in some interesting behavior, like switching pipelines does cause the driver to update all the user SGPRs because the mapping might have changed.

Furthermore, as an interesting performance hack radv allocates all upload buffers (for the push constant and push descriptor sets), shaders and descriptor pools in a single 4 GiB region of of memory so that we can pass only the bottom 32-bits of all the pointers in a user SGPR, getting us farther with the limited number of user SGPRs. We will see later how that makes things difficult for us.

Generating a commandbuffer on the GPUs

As shown above radv has a bunch of complexity around state for draw calls and if we start generating command buffers on the GPU that risks copying a significant part of that complexity to a shader. Luckily ExecuteIndirect and VK_NV_device_generated_commands have some limitations that make this easier. The app can only change

  1. vertex buffers
  2. index buffers
  3. push constants

VK_NV_device_generated_commands also allows changing shaders and the rotation winding of what is considered a primitive backface but we’ve chosen to ignore that for now since it isn’t needed for ExecuteIndirect (though especially the shader switching could be useful for an application).

The second curveball is that the buffer the application provides needs to provide the same set of data for every draw call. This avoids having to do a lot of serial processing to figure out what the previous state was, which allows processing every draw command in a separate shader invocation. Unfortunately we’re still a bit dependent on the old state that is bound before the indirect command buffer execution:

  1. The previously bound index buffer
  2. Previously bound vertex buffers.
  3. Previously bound push constants.

Remember that for vertex buffers and push constants we may put them in a piece of memory. That piece of memory needs to contains all the vertex buffers/push constants for that draw call, so even if we modify only one of them, we have to copy the rest over. The index buffer is different: in the draw packets for the GPU there is a field that is derived from the index buffer size.

So in vkCmdPreprocessGeneratedCommandsNV radv partitions the preprocess buffer into a command buffer and an upload buffer (for the vertex buffers & push constants), both with a fixed stride based on the command signature. Then it launches a shader which processes a draw call in each invocation:

   if (shader used vertex buffers && we change a vertex buffer) {
      copy all vertex buffers 
      update the changed vertex buffers
      emit a new vertex descriptor set pointer
   }
   if (we change a push constant) {
      if (we change a push constant in memory) {
         copy all push constant
         update changed push constants
         emit a new push constant pointer
      }
      emit all changed inline push constants into user SGPRs
   }
   if (we change the index buffer) {
      emit new index buffers
   }
   emit a draw command
   insert NOPs up to the stride

In vkCmdExecuteGeneratedCommandsNV radv uses the internal equivalent of vkCmdExecuteCommands to execute as if the generated command buffer is a secondary command buffer.

Challenges

Of course one does not simply move part of the driver to GPU shaders without any challenges. In fact we have a whole bunch of them. Some of them just need a bunch of work to solve, some need some extension specification tweaking and some are hard to solve without significant tradeoffs.

Code maintainability

A big problem is that the code needed for the limited subset of state that is supported is now in 3 places:

  1. The traditional CPU path
  2. For determining how large the preprocess buffer needs to be
  3. For the shader called in vkCmdPreprocessGeneratedCommandsNV to build the preprocess buffer.

Having the same functionality in multiple places is a recipe for things going out of sync. This makes it harder to change this code and much easier for bugs to sneak in. This can be mitigated with a lot of testing, but a bunch of GPU work gets complicated quickly. (e.g. the preprocess buffer being larger than needed still results in correct results, getting a second opinion from the shader to check adds significant complexity).

nir_builder gets old quickly

In the driver at the moment we have no good high level shader compiler. As a result a lot of the internal helper shaders are written using the nir_builder helper to generate nir, the intermediate IR of the shader compiler. Example fragment:

   nir_push_loop(b);
   {
      nir_ssa_def *curr_offset = nir_load_var(b, offset);

      nir_push_if(b, nir_ieq(b, curr_offset, cmd_buf_size));
      {
         nir_jump(b, nir_jump_break);
      }
      nir_pop_if(b, NULL);

      nir_ssa_def *packet_size = nir_isub(b, cmd_buf_size, curr_offset);
      packet_size = nir_umin(b, packet_size, nir_imm_int(b, 0x3ffc * 4));

      nir_ssa_def *len = nir_ushr_imm(b, packet_size, 2);
      len = nir_iadd_imm(b, len, -2);
      nir_ssa_def *packet = nir_pkt3(b, PKT3_NOP, len);

      nir_store_ssbo(b, packet, dst_buf, curr_offset, .write_mask = 0x1,
                     .access = ACCESS_NON_READABLE, .align_mul = 4);
      nir_store_var(b, offset, nir_iadd(b, curr_offset, packet_size), 0x1);
   }
   nir_pop_loop(b, NULL);

It is clear that this all gets very verbose very quickly. This is somewhat fine as long as all the internal shaders are tiny. However, between this and raytracing our internal shaders are getting significantly bigger and the verbosity really becomes a problem.

Interesting things to explore here are to use glslang, or even to try writing our shaders in OpenCL C and then compiling it to SPIR-V at build time. The challenge there is that radv is built on a diverse set of platforms (including Windows, Android and desktop Linux) which can make significant dependencies a struggle.

Preprocessing

Ideally your GPU work is very suitable for pipelining to avoid synchronization cost on the GPU. If we generate the command buffer and then execute it we need to have a full GPU sync point in between, which can get very expensive as it waits until the GPU is idle. To avoid this VK_NV_device_generated_commands has added the separate vkCmdPreprocessGeneratedCommandsNV command, so that the application can batch up a bunch of work before incurring the cost a sync point.

However, in radv we have to do the command buffer generation in vkCmdExecuteGeneratedCommandsNV as our command buffer generation depends on some of the other state that is bound, but might not be bound yet when the application calls vkCmdPreprocessGeneratedCommandsNV.

Which brings up a slight spec problem: The extension specification doesn’t specify whether the application is allowed to execute vkCmdExecuteGeneratedCommandsNV on multiple queues concurrently with the same preprocess buffer. If all the writing of that happens in vkCmdPreprocessGeneratedCommandsNV that would result in correct behavior, but if the writing happens in vkCmdExecuteGeneratedCommandsNV this results in a race condition.

The 32-bit pointers

Remember that radv only passes the bottom 32-bits of some pointers around. As a result the application needs to allocate the preprocess buffer in that 4-GiB range. This in itself is easy: just add a new memory type and require it for this usage. However, the devil is in the details.

For example, what should we do for memory budget queries? That is per memory heap, not memory type. However, a new memory heap does not make sense, as the memory is also still subject to physical availability of VRAM, not only address space.

Furthermore, this 4-GiB region is more constrained than other memory, so it would be a shame if applications start allocating random stuff in it. If we look at the existing usage for a pretty heavy game (HZD) we get about

  1. 40 MiB of command buffers + upload buffers
  2. 200 MiB of descriptor pools
  3. 400 MiB of shaders

So typically we have a lot of room available. Ideally the ordering of memory types would get an application to prefer another memory type when we do not need this special region. However, memory object caching poses a big risk here: Would you choose a memory object in the cache that you can reuse/suballocate (potentially in that limited region), or allocate new for a “better” memory type?

Luckily we have not seen that risk play out, but the only real tested user at this point has been vkd3d-proton.

Secondary command buffers.

When executing the generated command buffer radv does that the same way as calling a secondary command buffer. This has a significant limitation: A secondary command buffer cannot call a secondary command buffer on the hardware. As a result the current implementation has a problem if vkCmdExecuteGeneratedCommandsNV gets called on a secondary command buffer.

It is possible to work around this. An example would be to split the secondary command buffer into 3 parts: pre, generated, post. However, that needs a bunch of refactoring to allow multiple internal command buffers per API command buffers.

Where to go next

Don’t expect this upstream very quickly. The main reason for exploring this in radv is ExecuteIndirect support for Halo Infinite, and after some recent updates we’re back into GPU hang limbo with radv/vkd3d-proton there. So while we’re solving that I’m holding off on upstreaming in case the hangs are caused by the implementation of this extension.

Furthermore, this is only a partial implementation of the extension anyways, with a fair number of limitations that we’d ideally eliminate before fully exposing this extension.

April 20, 2022

Let Your Memes Be Dreams

With Mesa 22.1 RC1 firmly out the door, most eyes have turned towards Mesa 22.2.

But not all eyes.

No, while most expected me to be rocketing off towards the next shiny feature, one ticket caught my eye:

Mesa 22.1rc1: Zink on Windows doesn’t work even simple wglgears app fails..

Sadly, I don’t support Windows. I don’t have a test machine to run it, and I don’t even have a VM I could spin up to run Lavapipe. I knew that Kopper was going to cause problems with other frontends, but I didn’t know how many other frontends were actually being used.

The answer was not zero, unfortunately. Plenty of users were enjoying the slow, software driver speed of Zink on Windows to spin those gears, and I had just crushed their dreams.

As I had no plans to change anything here, it would take a new hero to set things right.

The Hero We Deserve

Who here loves X-Plane?

I love X-Plane. It’s my favorite flight simulator. If I could, I’d play it all day every day. And do you know who my favorite X-Plane developer is?

Friend of the blog and part-time Zink developer, Sidney Just.

Some of you might know him from his extensive collection of artisanal blog posts. Some might have seen his work enabling Vulkan<->OpenGL interop in Mesa on Windows.

But did you know that Sid’s latest project is much more groundbreaking than just bumping Zink’s supported extension count far beyond the reach of every other driver?

What if I told you that this image

gears.png

is Zink running wglgears on a NVIDIA 2070 GPU on Windows at full speed? No software-copy scanout. Just Kopper.

Full Support: Windows Ultimate Home Professional Edition

Over the past couple days, Sid’s done the esoteric work of hammering out WSI support for Zink on Windows, making us the first hardware-accelerated, GL 4.6-capable Mesa driver to run natively on Windows.

Don’t believe me?

Recognize a little Aztec Ruins action from GFXBench?

aztec.png

The results are about what we’d expect of an app I’ve literally never run myself:

Zink

zink-aztec.png

NVIDIA

nv-aztec.png

Not too bad at all!

In Summary

I think we can safely say that Sid has managed to fix the original bug. Thanks, Sid!

But why is an X-Plane developer working on Zink?

The man himself has this to say on the topic:

X-Plane has traditionally been using OpenGL directly for all of its rendering needs. As a result, for years our plugin SDK has directly exposed the games OpenGL context directly to third party plugins, which have used it to render custom avionic screens and GUI elements. When we finally did the switch to Vulkan and Metal in 2020, one of the big issues we faced was how to deal with plugins. Our solution so far has been to rely on native Vulkan/OpenGL driver interop via extensions, which has mostly worked and allowed us to ship with modern backends.

Unfortunately this puts us at the mercy of the driver to provide good interop. Sadly on some platforms, this just isn’t available at all. On others, the drivers are broken leading to artifacts when mixing Vulkan and GL rendering. To date, our solution has been to just shrug it off and hope for better drivers. X-Plane plugins make use of compatibly profile GL features, as well as core profile features, depending on the authors skill, so libraries like ANGLE were not an option for us.

This is where Zink comes in for us: Being a real GL driver, it has support for all of the features that we need. Being open source also means that any issues that we do discover are much easier to fix ourselves. We’ve made some progress including Zink into the next major version of X-Plane, X-Plane 12, and it’s looking very promising so far. Our hope is to ship X-Plane 12 with Zink as the GL backend for plugins and leave driver interop issues in the past.

The roots of this interest can also be seen in his blog post from last year where he touches on the future of GL plugin support.

Awesome!

Big Triangle’s definitely paying attention now.

And if any of my readers think this work is cool, go buy yourself a copy of X-Plane to say thanks for contributing back to open source.

April 15, 2022

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:

  • Part 1: The high-level view of the whole CI system, and how to fully control test machines remotely (power on, OS to boot, keyboard/screen emulation using a serial console);
  • Part 2: A comparison of the different ways to generate the rootfs of your test environment, and introducing the boot2container project;
  • Part 3: Analysis of the requirements for the CI gateway, catching regressions before deployment, easy roll-back, and netbooting the CI gateway securely over the internet.

In this article, we will finally focus on generating the rootfs/container image of the CI Gateway in a way that enables live patching the system without always needing to reboot.

This work is sponsored by the Valve Corporation.

Introduction: The impact of updates

System updates are a necessary evil for any internet-facing server, unless you want your system to become part of a botnet. This is especially true for CI systems since they let people on the internet run code on machines, often leading to unfair use such as cryptomining (this one is hard to avoid though)!

The problem with system updates is not the 2 or 3 minutes of downtime that it takes to reboot, it is that we cannot reboot while any CI job is running. Scheduling a reboot thus first requires to stop accepting new jobs, wait for the current ones to finish, then finally reboot. This solution may be acceptable if your jobs take ~30 minutes, but what if they last 6h? A reboot suddenly gets close to a typical 8h work day, and we definitely want to have someone looking over the reboot sequence so they can revert to a previous boot configuration if the new one failed.

This problem may be addressed in a cloud environment by live-migrating services/containers/VMs from a non-updated host to an updated one. This is unfortunately a lot more complex to pull off for a bare-metal CI without having a second CI gateway and designing synchronization systems/hardware to arbiter access to the test machines's power/serial consoles/boot configuration.

So, while we cannot always avoid the need to drain the CI jobs before rebooting, what we can do is reduce the cases in which we need to perform this action. Unfortunately, containers have been designed with atomic updates in mind (this is why we want to use them), but that means that trivial operations such as adding an ssh key, a Wireguard peer, or updating a firewall rule will require a reboot. A hacky solution may be for the admins to update the infra container then log in the different CI gateways and manually reproduce the changes they have done in the new container. These changes would be lost at the next reboot, but this is not a problem since the CI gateway would use the latest container when rebooting which already contains the updates. While possible, this solution is error-prone and not testable ahead of time, which is against the requirements for the gateway we laid out in Part 3.

Live patching containers

An improvement to live-updating containers by hand would be to use tools such as Ansible, Salt, or even Puppet to manage and deploy non-critical services and configuration. This would enable live-updating the currently-running container but would need to be run after every reboot. An Ansible playbook may be run locally, so it is not inconceivable for a service to be run at boot that would download the latest playbook and run it. This solution is however forcing developers/admins to decide which services need to have their configuration baked in the container and which services should be deployed using a tool like Ansible... unless...

We could use a tool like Ansible to describe all the packages and services to install, along with their configuration. Creating a container would then be achieved by running the Ansible playbook on a base container image. Assuming that the playbook would truly be idem-potent (running the playbook multiple times will lead to the same final state), this would mean that there would be no differences between the live-patched container and the new container we created. In other words, we simply morph the currently-running container to the wanted configuration by running the same Ansible playbook we used to create the container, but against the live CI gateway! This will not always remove the need to reboot the CI gateways from time to time (updating the kernel, or services which don't support live-updates without affecting CI jobs), but all the smaller changes can get applied in-situ!

The base container image has to contain the basic dependencies of the tool like Ansible, but if it were made to contain all the OS packages, it would split the final image into three container layers: the base OS container, the packages needed, and the configuration. Updating the configuration would thus result in only a few megabytes of update to download at the next reboot rather than the full OS image, thus reducing the reboot time.

Limits to live-patching containers

Ansible is perfectly-suited to morph a container into its newest version, provided that all the resources used remain static between when the new container was created and when the currently-running container gets live-patched. This is because of Ansible's core principle of idempotency of operations: Rather than running commands blindly like in a shell-script, it first checks what is the current state then, if needed, update the state to match the desired target. This makes it safe to run the playbook multiple times, but will also allow us to only reboot services if its configuration or one of its dependencies' changed.

When version pinning of packages is possible (Python, Ruby, Rust, Golang, ...), Ansible can guarantee the idempotency that make live-patching safe. Unfortunately, package managers of Linux distributions are usually not idempotent: They were designed to ship updates, not pin software versions! In practice, this means that there are no guarantees that the package installed during live-patching will be the same as the one installed in the new base container, thus exposing oneself to potential differences in behaviour between the two deployment methods... The only way out of this issue is to create your own package repository and make sure its content will not change between the creation of the new container and the live-patching of all the CI Gateways. Failing that, all I can advise you to do is pick a stable distribution which will try its best to limit functional changes between updates within the same distribution version (Alpine Linux, CentOS, Debian, ...).

In the end, Ansible won't always be able to make live-updating your container strictly equivalent to rebooting into its latest version, but as long as you are aware of its limitations (or work around them), it will make updating your CI gateways way less of a trouble than it would be otherwise! You will need to find the right balance between live-updatability, and ease of maintenance of the code-base of your gateway.

Putting it all together: The example of valve-infra-container

At this point, you may be wondering how all of this looks in practice! Here is the example of the CI gateways we have been developping for Valve:

  • Ansible playbook: You will find here the entire configuration of our CI gateways. NOTE: we are still working on live-patching!;
  • Valve-infra-base-container: The buildah script used to generate the base container;
  • Valve-infra-container: The buildah script used to generate the final container by running the Ansible playbook.

And if you are wondering how we can go from these scripts to working containers, here is how:

$ podman run --rm -d -p 8088:5000 --name registry docker.io/library/registry:2
$ env \
    IMAGE_NAME=localhost:8088/valve-infra-base-container \
    BASE_IMAGE=archlinux \
    buildah unshare -- .gitlab-ci/valve-infra-base-container-build.sh
$ env \
    IMAGE_NAME=localhost:8088/valve-infra-container \
    BASE_IMAGE=valve-infra-base-container \
    ANSIBLE_EXTRA_ARGS='--extra-vars service_mgr_override=inside_container -e development=true' \
    buildah unshare -- .gitlab-ci/valve-infra-container-build.sh

And if you were willing to use our Makefile, it gets even easier:

$ make valve-infra-base-container BASE_IMAGE=archlinux IMAGE_NAME=localhost:8088/valve-infra-base-container
$ make valve-infra-container BASE_IMAGE=localhost:8088/valve-infra-base-container IMAGE_NAME=localhost:8088/valve-infra-container

Not too bad, right?

PS: These scripts are constantly being updated, so make sure to check out their current version!

Conclusion

In this post, we highlighted the difficulty of keeping the CI Gateways up to date when CI jobs can take multiple hours to complete, preventing new jobs from starting until the current queue is emptied and the gateway has rebooted.

We have then shown that despite looking like competing solutions to deploy services in production, containers and tools like Ansible can actually work well together to reduce the need for reboots by morphing the currently-running container into the updated one. There are however some limits to this solution which are important to keep in mind when designing the system.

In the next post, we will be designing the executor service which is responsible for time-sharing the test machines between different CI/manual jobs. We will thus be talking about deploying test environments, BOOTP, and serial consoles!

That's all for now, thanks for making it to the end!

April 12, 2022

Another Quarter Down

As everyone who’s anyone knows, the next Mesa release branchpoint is coming up tomorrow. Like usual, here’s the rundown on what to expect from zink in this release:

  • zero performance improvements (that I’m aware of)
  • Kopper has landed: Vulkan WSI is now used and NVIDIA drivers can finally run at full speed
  • lots of bugs fixed
  • seriously so many bugs
  • I’m not even joking
  • literally this whole quarter was just fixing bugs

So if you find a zink problem in the 22.1 release of Mesa, it’s definitely because of Kopper and not actually anything zink-related.

Piping

But also this is sort-of-almost-maybe a lavapipe blog, and that driver has had a much more exciting quarter. Here’s a rundown.

New Extensions:

  • VK_EXT_debug_utils
  • VK_EXT_depth_clip_control
  • VK_EXT_graphics_pipeline_library
  • VK_EXT_image_2d_view_of_3d
  • VK_EXT_image_robustness
  • VK_EXT_inline_uniform_block
  • VK_EXT_pipeline_creation_cache_control
  • VK_EXT_pipeline_creation_feedback
  • VK_EXT_primitives_generated_query
  • VK_EXT_shader_demote_to_helper_invocation
  • VK_EXT_subgroup_size_control
  • VK_EXT_texel_buffer_alignment
  • VK_KHR_format_feature_flags2
  • VK_KHR_memory_model
  • VK_KHR_pipeline_library
  • VK_KHR_shader_integer_dot_product
  • VK_KHR_shader_terminate_invocation
  • VK_KHR_swapchain_mutable_format
  • VK_KHR_synchronization2
  • VK_KHR_zero_initialize_workgroup_memory

Vulkan 1.3 is now supported. We’ve landed a number of big optimizations as well, leading to massively improved CI performance.

Lavapipe: the cutting-edge software implementation of Vulkan.

…as long as you don’t need descriptor indexing.

April 07, 2022

Since Kopper got merged today upstream I wanted to write a little about it as I think the value it brings can be unclear for the uninitiated.

Adam Jackson in our graphics team has been working for the last Months together with other community members like Mike Blumenkrantz implementing Kopper. For those unaware Zink is an OpenGL implementation running on top of Vulkan and Kopper is the layer that allows you to translate OpenGL and GLX window handling to Vulkan WSI handling. This means that you can get full OpenGL support even if your GPU only has a Vulkan driver available and it also means you can for instance run GNOME on top of this stack thanks to the addition of Kopper to Zink.

During the lifecycle of the soon to be released Fedora Workstation 36 we expect to allow you to turn on the doing OpenGL using Kopper and Zink as an experimental feature once we update Fedora 36 to Mesa 22.1.

So you might ask why would I care about this as an end user? Well initially you probably will not care much, but over time it is likely that GPU makers will eventually stop developing native OpenGL drivers and just focus on their Vulkan drivers. At that point Zink and Kopper provides you with a backwards compatibility solution for your OpenGL applications. And for Linux distributions it will also at some point help reduce the amount of code we need to ship and maintain significantly as we can just rely on Zink and Kopper everywhere which of course reduces the workload for maintainers.

This is not going to be an overnight transition though, Zink and Kopper will need some time to stabilize and further improve performance. At the moment performance is generally a bit slower than the native drivers, but we have seen some examples of games which actually got better performance with specific driver combinations, but over time we expect to see the negative performance delta shrink. The delta is unlikely to ever fully go away due to the cost of translating between the two APIs, but on the other side we are going to be in a situation in a few years where all current/new applications use Vulkan natively (or through Proton) and thus the stuff that relies on OpenGL will be older software, so combined with faster GPUs you should still get more than good enough performance. And at that point Zink will be a lifesaver for your old OpenGL based applications and games.

April 06, 2022

Just In Time

By the time you read this, Kopper will have landed. This means a number of things have changed:

  • Zink now uses Vulkan WSI and has actual swapchains
  • Combinations of clunky Mesa environment variables are no longer needed; MESA_LOADER_DRIVER_OVERRIDE=zink will work for all drivers
  • Some things that didn’t used to work now work
  • Some things that used to work now don’t

In particular, lots of cases of garbled/flickering rendering (I’m looking at you, Supertuxkart on ANV) will now be perfectly smooth and without issue.

Also there’s no swapinterval control yet, so X11 clients will have no choice but to churn out the maximum amount of FPS possible at all times.

You (probably?) aren’t going to be able to run a compositor on zink just yet, but it’s on the 22.1 TODO list.

Big thanks to Adam Jackson for carrying this project on his back.

April 05, 2022

Apparently, in some parts of this world, the /usr/-merge transition is still ongoing. Let's take the opportunity to have a look at one specific way to take benefit of the /usr/-merge (and associated work) IRL.

I develop system-level software as you might know. Oftentimes I want to run my development code on my PC but be reasonably sure it cannot destroy or otherwise negatively affect my host system. Now I could set up a container tree for that, and boot into that. But often I am too lazy for that, I don't want to bother with a slow package manager setting up a new OS tree for me. So here's what I often do instead — and this only works because of the /usr/-merge.

I run a command like the following (without any preparatory work):

systemd-nspawn \
        --directory=/ \
        --volatile=yes \
        -U \
        --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \
        --set-credential=firstboot.locale:C.UTF-8 \
        --bind-user=lennart \
        -b

And then I very quickly get a login prompt on a container that runs the exact same software as my host — but is also isolated from the host. I do not need to prepare any separate OS tree or anything else. It just works. And my host user lennart is just there, ready for me to log into.

So here's what these systemd-nspawn options specifically do:

  • --directory=/ tells systemd-nspawn to run off the host OS' file hierarchy. That smells like danger of course, running two OS instances off the same directory hierarchy. But don't be scared, because:

  • --volatile=yes enables volatile mode. Specifically this means what we configured with --directory=/ as root file system is slightly rearranged. Instead of mounting that tree as it is, we'll mount a tmpfs instance as actual root file system, and then mount the /usr/ subdirectory of the specified hierarchy into the /usr/ subdirectory of the container file hierarchy in read-only fashion – and only that directory. So now we have a container directory tree that is basically empty, but imports all host OS binaries and libraries into its /usr/ tree. All software installed on the host is also available in the container with no manual work. This mechanism only works because on /usr/-merged OSes vendor resources are monopolized at a single place: /usr/. It's sufficient to share that one directory with the container to get a second instance of the host OS running. Note that this means /etc/ and /var/ will be entirely empty initially when this second system boots up. Thankfully, forward looking distributions (such as Fedora) have adopted systemd-tmpfiles and systemd-sysusers quite pervasively, so that system users and files/directories required for operation are created automatically should they be missing. Thus, even though at boot the mentioned directories are initially empty, once the system is booted up they are sufficiently populated for things to just work.

  • -U means we'll enable user namespacing, in fully automatic mode. This does three things: it picks a free host UID range dynamically for the container, then sets up user namespacing for the container processes mapping host UID range to UIDs 0…65534 in the container. It then sets up a similar UID mapped mount on the /usr/ tree of the container. Net effect: file ownerships as set on the host OS tree appear as they belong to the very same users inside of the container environment, except that we use user namespacing for everything, and thus the users are actually neatly isolated from the host.

  • --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) passes a credential to the container. Credentials are bits of data that you can pass to systemd services and whole systems. They are actually awesome concepts (e.g. they support TPM2 authentication/encryption that just works!) but I am not going to go into details around that, given it's off-topic in this specific scenario. Here we just take benefit of the fact that systemd-sysusers looks for a credential called passwd.hashed-password.root to initialize the root password of the system from. We set it to mysecret. This means once the system is booted up we can log in as root and the supplied password. Yay. (Remember, /etc/ is initially empty on this container, and thus also carries no /etc/passwd or /etc/shadow, and thus has no root user record, and thus no root password.)

    mkpasswd is a tool then converts a plain text password into a UNIX hashed password, which is what this specific credential expects.

  • Similar, --set-credential=firstboot.locale:C.UTF-8 tells the systemd-firstboot service in the container to initialize /etc/locale.conf with this locale.

  • --bind-user=lennart binds the host user lennart into the container, also as user lennart. This does two things: it mounts the host user's home directory into the container. It also copies a minimal user record of the specified user into the container that nss-systemd then picks up and includes in the regular user database. This means, once the container is booted up I can log in as lennart with my regular password, and once I logged in I will see my regular host home directory, and can make changes to it. Yippieh! (This does a couple of more things, such as UID mapping and things, but let's not get lost in too much details.)

So, if I run this, I will very quickly get a login prompt, where I can log into as my regular user. I have full access to my host home directory, but otherwise everything is nicely isolated from the host, and changes outside of the home directory are either prohibited or are volatile, i.e. go to a tmpfs instance whose lifetime is bound to the container's lifetime: when I shut down the container I just started, then any changes outside of my user's home directory are lost.

Note that while here I use --volatile=yes in combination with --directory=/ you can actually use it on any OS hierarchy, i.e. just about any directory that contains OS binaries.

Similar, the --bind-user= stuff works with any OS hierarchy too (but do note that only systemd 249 and newer will pick up the user records passed to the container that way, i.e. this requires at least v249 both on the host and in the container to work).

Or in short: the possibilities are endless!

Requirements

For this all to work, you need:

  1. A recent kernel (5.15 should suffice, as it brings UID mapped mounts for the most common file systems, so that -U and --bind-user= can work well.)

  2. A recent systemd (249 should suffice, which brings --bind-user=, and a -U switch backed by UID mapped mounts).

  3. A distribution that adopted the /usr/-merge, systemd-tmpfiles and systemd-sysusers so that the directory hierarchy and user databases are automatically populated when empty at boot. (Fedora 35 should suffice.)

Limitations

While a lot of today's software actually out of the box works well on systems that come up with an unpopulated /etc/ and /var/, and either fall back to reasonable built-in defaults, or deploy systemd-tmpfiles to create what is missing, things aren't perfect: some software typically installed an desktop OSes will fail to start when invoked in such a container, and be visible as ugly failed services, but it won't stop me from logging in and using the system for what I want to use it. It would be excellent to get that fixed, though. This can either be fixed in the relevant software upstream (i.e. if opening your configuration file fails with ENOENT, then just default to reasonable defaults), or in the distribution packaging (i.e. add a tmpfiles.d/ file that copies or symlinks in skeleton configuration from /usr/share/factory/etc/ via the C or L line types).

And then there's certain software dealing with hardware management and similar that simply cannot work in a container (as device APIs on Linux are generally not virtualized for containers) reasonably. It would be excellent if software like that would be updated to carry ConditionVirtualization=!container or ConditionPathIsReadWrite=/sys conditionalization in their unit files, so that it is automatically – cleanly – skipped when executed in such a container environment.

And that's all for now.

March 30, 2022

Do you want to start a career in open-source? Do you want to learn amazing skills while getting paid? Keep reading!

Igalia Coding Experience

Igalia logo

Igalia has a grant program that gives students with a background in Computer Science, Information Technology and Free Software their first exposure to the professional world, working hand in hand with Igalia programmers and learning with them. It is called Igalia Coding Experience.

While this experience is open for everyone, Igalia expressly invites women (both cis and trans), trans men, and genderqueer people to apply. The Coding Experience program gives preference to applications coming from underrepresented groups in our industry.

You can apply to any of the offered grants this year: Web Standards, WebKit, Chromium, Compilers and Graphics.

In the case of Graphics, the student will have the opportunity to deal with the Linux DRM subsystem. Specifically, the student will improve the test coverage of DRM drivers through IGT, a testing framework designed for this purpose. These includes learning how to contribute to Linux kernel/DRM, interact with the DRI-devel community, understand DRM core functionality, and increase test coverage of IGT tool.

The conditions of our Coding Experience program are:

  • Mentorship by one of the Igalia’s outstanding open source contributors in the field.
  • It is remote-friendly. Students can participate in it wherever they live.
  • Hours: 450h
  • Compensation: 6,500€
  • Usual timetables:
    • 3 months full-time
    • 6 months part-time

The submission period goes from March 16th until April 30th. Students will be selected in May. We will work with the student to arrange a suitable starting date during 2022, from June onwards, and finishing on a date to be agreed that suits their schedule.

Google Summer of Code (GSoC)

GSoC logo

The popular Google Summer of Code is another option for students. This year, X.Org Foundation participates as Open Source organization. We have some proposed ideas but you can propose any project idea as well.

Timeline for proposals is from April 4th to April 19th. However, you should contact us before in order to discuss your ideas with potential mentors.

GSoC gives some stipend to students too (from 1,500 to 6,000 USD depending on the size of the project and your location). The hours to complete the project varies from 175 to 350 hours depending on the size of the project as well.

Of course, this is a remote-friendly program, so any student in the world can participate in it.

Outreachy

Outreachy logo

Outreachy is another internship program for applicants from around the world who face under-representation, systemic bias or discrimination in the technology industry of their country. Outreachy supports diversity in free and open source software!

Outreachy internships are remote, paid ($7,000), and last three months. Outreachy internships run from May to August and December to March. Applications open in January and August.

The projects listed cover many areas of the open-source software stack: from kernel to distributions work. Please check current proposals to find anything that is interesting for you!

X.Org Endless Vacation of Code (EVoC)

X.Org logo

X.Org Foundation voted in 2008 to initiate a program known as the X.Org Endless Vacation of Code (EVoC) program, in order to give more flexibility to students: an EVoC mentorship can be initiated at any time during the calendar year, the Board can fund as many of these mentorships as it sees fit.

Like the other programs, EVoC is remote-friendly as well. The stipend goes as follows: an initial payment of 500 USD and two further payments of 2,250 USD upon completion of project milestones. EVoC does not set limits in hours, but there are some requirements and steps to do before applying. Please read X.Org Endless Vacation of Code website to learn more.

Conclusion

As you see, there are many ways to enter into the Open Source community. Although I focused in the open source graphics stack related programs, there are many of them.

With all of these possibilities (and many more, including internships at companies), I hope that you can apply and that the experience will encourage you to start a career in the open-source community.

Happy hacking!

March 29, 2022

Ecosystem Victory

Today marks (at last) the release of some cool extensions I’ve had the pleasure of working on:

VK_EXT_graphics_pipeline_library

This extension revolutionizes how PSOs can be managed by the application, and it’s the first step towards solving the dreaded stuttering that zink suffers from when attempting to play any sort of game. There’s definitely going to be more posts from me on this in the future.

VK_EXT_primitives_generated_query

Currently, zink has to do some awfulness internally to replicate the awfulness of GL_PRIMITIVES_GENERATED. With this extension, at least some of that awfulness can be pushed down to the driver. And the spec, of course. You can’t scrub this filth out of your soul.

Support

The mesa community being awesome as it is, support for these extensions is already underway:

  • ANV has merge requests up already for both of them
  • RADV has a merge request up for preliminary support of VK_EXT_primitives_generated_query on certain hardware, and VK_EXT_graphics_pipeline_library support is nearing completion

But obviously Lavapipe, being the greatest of all drivers, will already have support landed by the time you read this post.

Let the bug reports flow!

March 22, 2022

OpenSSH has this very nice setting, VerifyHostKeyDNS, which when enabled, will pull SSH host keys from DNS, and you no longer need to either trust on first use, or copy host keys around out of band.

Naturally, trusting unsecured DNS is a bit scary, so this requires the record to be signed using DNSSEC. This has worked for a long time, but then broke, seemingly out of the blue. Running ssh -vvv gave output similar to

debug1: found 4 insecure fingerprints in DNS
debug3: verify_host_key_dns: checking SSHFP type 1 fptype 2
debug3: verify_host_key_dns: checking SSHFP type 4 fptype 2
debug1: verify_host_key_dns: matched SSHFP type 4 fptype 2
debug3: verify_host_key_dns: checking SSHFP type 4 fptype 1
debug1: verify_host_key_dns: matched SSHFP type 4 fptype 1
debug3: verify_host_key_dns: checking SSHFP type 1 fptype 1
debug1: matching host key fingerprint found in DNS

even though the zone was signed, the resolver was checking the signature and I even checked that the DNS response had the AD bit set.

The fix was to add options trust-ad to /etc/resolv.conf. Without this, glibc will discard the AD bit from any upstream DNS servers. Note that you should only add this if you actually have a trusted DNS resolver. I run unbound on localhost, so if somebody can do a man-in-the-middle attack on that traffic, I have other problems.

March 18, 2022

March Forward

Anyone who knows me knows that I hate cardio.

Full stop.

I’m not picking up and putting down all these heavy weights just so I can go for a jog afterwards and lose all my gains.

Similarly, I’m not trying to set a world record for speed-writing code. This stuff takes time, and it can’t be rushed.

Unless…

Lavapipe: The Best Driver

Today we’re a Lavapipe blog.

Lavapipe is, of course, the software implementation of Vulkan that ships with Mesa, originally braindumped directly into the repo by graphics god and part-time Twitter executive, Dave Airlie. For a long time, the Lavapipe meme has been “Try it on Lavapipe—it’s not conformant, but it still works pretty good haha” and we’ve all had a good chuckle at the idea that anything not officially signed and stamped by Khronos could ever draw a single triangle properly.

But, pending a single MR that fixes the four outstanding failures for Vulkan 1.2 conformance, as of last week, Lavapipe passes 100% of conformance tests. Thus, pending a merge and a Mesa bugfix release, Lavapipe will achieve official conformance.

And then we’ll have a new meme: Vulkan 1.3 When?

Meme Over

As some have noticed, I’ve been quietly (very, very, very, very, very, very, very, very, very, very, very, very quietly) implementing a number of features for Lavapipe over the past week.

But why?

Khronos-fanatics will immediately recognize that these features are all part of Vulkan 1.3.

Which Lavapipe also now supports, pending more merges which I expect to happen early next week.

This is what a sprint looks like.

March 17, 2022

We had a busy 2021 within GNU/Linux graphics stack at Igalia.

Would you like to know what we have done last year? Keep reading!

Open Source Raspberry Pi GPU (VideoCore) drivers

Raspberry Pi 4, model B

Last year both the OpenGL and the Vulkan drivers received a lot of love. For example, we implemented several optimizations such improvements in the v3dv pipeline cache. In this blog post, Alejandro Piñeiro presents how we improved the v3dv pipeline cache times by reducing the two-cache-lookup done previously by only one, and shows some numbers on both a synthetic test (modified CTS test), and some games.

We also did performance improvements of the v3d compilers for OpenGL and Vulkan. Iago Toral explains our work on optimizating the backend compiler with techniques such as improving memory lookup efficiency, reducing instruction counts, instruction packing, uniform handling, among others. There are some numbers that show framerate improvements from ~6 to ~62% on different games / demos.

Framerate improvements Framerate improvement after optimization (in %). Taken from Iago’s blogpost

Of course, there was work related to feature implementation. This blog post from Iago lists some Vulkan extensions implemented in the v3dv driver in 2021… Although not all the implemented extensions are listed there, you can see the driver is quickly catching up in its Vulkan extension support.

My colleague Juan A. Suárez implemented performance counters in the v3d driver (an OpenGL driver) which required modifications in the kernel and in the Mesa driver. More info in his blog post.

There was more work in other areas done in 2021 too, like the improved support for RenderDoc and GFXReconstruct. And not to forget the kernel contributions to the DRM driver done by Melissa Wen, who not only worked on developing features for it, but also reviewed all the patches that came from the community.

However, the biggest milestone for the v3Dv driver was to be Vulkan 1.1 conformant in the last quarter of 2021. That was just one year after becoming Vulkan 1.0 conformant. As you can imagine, that implied a lot of work implementing features, fixing bugs and, of course, improving the driver in many different ways. Great job folks!

If you want to know more about all the work done on these drivers during 2021, there is an awesome talk from my colleague Alejando Piñeiro at FOSDEM 2022: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4”, and another one from my colleague Iago Toral in XDC 2021: “Raspberry Pi Vulkan driver update”. Below you can find the video recordings of both talks.

FOSDEM 2022 talk: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4”

XDC 2021 talk: “Raspberry Pi Vulkan driver update”

Open Source Qualcomm Adreno GPU drivers

RB3 Photo of the Qualcomm® Robotics RB3 Platform embedded board that I use for Turnip development.

There were also several achievements done by igalians on both Freedreno and Turnip drivers. These are reverse engineered open-source drivers for Qualcomm Adreno GPUs: Freedreno for OpenGL and Turnip for Vulkan.

Starting 2021, my colleague Danylo Piliaiev helped with implementing the missing bits in Freedreno for supporting OpenGL 3.3 on Adreno 6xx GPUs. His blog post explained his work, such as implementing ARB_blend_func_extended, ARB_shader_stencil_export and fixing a variety of CTS test failures.

Related to this, my colleague Guilherme G. Piccoli worked on porting a recent kernel to one of the boards we use for Freedreno development: the Inforce 6640. He did an awesome job getting a 5.14 kernel booting on that embedded board. If you want to know more, please read the blog post he wrote explaining all the issues he found and how he fixed them!

Inforce6640 Picture of the Inforce 6640 board that Guilherme used for his development. Image from his blog post.

However the biggest chunk of work was done in Turnip driver. We have implemented a long list of Vulkan extensions: VK_KHR_buffer_device_address, VK_KHR_depth_stencil_resolve, VK_EXT_image_view_min_lod, VK_KHR_spirv_1_4, VK_EXT_descriptor_indexing, VK_KHR_timeline_semaphore, VK_KHR_16bit_storage, VK_KHR_shader_float16, VK_KHR_uniform_buffer_standard_layout, VK_EXT_extended_dynamic_state, VK_KHR_pipeline_executable_properties, VK_VALVE_mutable_descriptor_type, VK_KHR_vulkan_memory_model and many others. Danylo Piliaiev and Hyunjun Ko are terrific developers!

But not all our work was related to feature development, for example I implemented Low-Resolution Z-buffer (LRZ) HW optimization, Danylo fixed a long list of rendering bugs that happened in real-world applications (blog post 1, blog post 2) like D3D games run on Vulkan (thanks to DXVK and VKD3D), instrumented the backend compiler to dump register values, among many other fixes and optimizations.

However, the biggest achievement was getting Vulkan 1.1 conformance for Turnip. Danylo wrote a blog post mentioning all the work we did to achieve that this year.

If you want to know more, don’t miss this FOSDEM 2022 talk given by my colleague Hyunjun Ko called “The status of turnip driver development. What happened in 2021 and will happen in 2022 for turnip.”. Video below.

FOSDEM 2022 talk: “The status of turnip driver development. What happened in 2021 and will happen in 2022 for turnip.”

Vulkan contributions

Our graphics work doesn’t cover only driver development, we also participate in Khronos Group as Vulkan Conformance Test Suite developers and even as spec contributors.

My colleague Ricardo Garcia is a very productive developer. He worked on implementing tests for Vulkan Ray Tracing extensions (read his blog post about ray tracing for more info about this big Vulkan feature), implemented tests for a long list of Vulkan extensions like VK_KHR_present_id and VK_KHR_present_wait, VK_EXT_multi_draw (watch his talk at XDC 2021), VK_EXT_border_color_swizzle (watch his talk at FOSDEM 2022) among many others. In many of these extensions, he contributed to their respective specifications in a significant way (just search for his name in the Vulkan spec!).

XDC 2021 talk: “Quick Overview of VK_EXT_multi_draw”

FOSDEM 2022 talk: “Fun with border colors in Vulkan. An overview of the story behind VK_EXT_border_color_swizzle”

Similarly, I participated modestly in this effort by developing tests for some extensions like VK_EXT_image_view_min_lod (blog post). Of course, both Ricardo and I implemented many new CTS tests by adding coverage to existing ones, we fixed lots of bugs in existing ones and reported dozens of driver issues to the respective Mesa developers.

Not only that, both Ricardo and I appeared as Vulkan 1.3 spec contributors.

Vulkan 1.3

Another interesting work we started in 2021 is Vulkan Video support on Gstreamer. My colleague Víctor Jaquez presented the Vulkan Video extension at XDC 2021 and soon after he started working on Vulkan Video’s h264 decoder support. You can find more information in his blog post, or watching his XDC 2021 talk below:

FOSDEM 2022 talk: “Video decoding in Vulkan: VK_KHR_video_queue/decode APIs”

Before I leave this section, don’t forget to take a look at Ricardo’s blogpost on debugPrintfEXT feature. If you are a Graphics developer, you will find this feature very interesting for debugging issues in your applications!

Along those lines, Danylo presented at XDC 2021 a talk about dissecting and fixing Vulkan rendering issues in drivers with RenderDoc. Very useful for driver developers! Watch the talk below:

XDC 2021 talk: “Dissecting Vulkan rendering issues in drivers with RenderDoc”

To finalize this blog post, remember that you now have vkrunner (the Vulkan shader tester created by Igalia) available for RPM-based GNU/Linux distributions. In case you are working with embedded systems, maybe my blog post about cross-compiling with icecream will help to speed up your builds.

This is just a summary of the highlights we did last year. I’m sorry if I am missing more work from my colleagues.

March 16, 2022

At Last

Those of you in-the-know are well aware that Zink has always had a crippling addiction to seamless cubemaps. Specifically, Vulkan doesn’t support non-seamless cubemaps since nobody wants those anymore, but this is the default mode of sampling for OpenGL.

Thus, it is impossible for Zink to pass GL 4.6 conformance until this issue is resolved.

But what does this even mean?

Cubes: They Have Faces Just Like You And Me

As veterans of intro to geometry courses all know*, a cube is a 3D shape that has six identically-sized sides called “faces”. In computer graphics, each of these faces has its own content that can be read and written discretely.

When interpolating data from a cube-type texture, there are two methods:

  • Using a seamless interpretation of a cube yields cases where pixels may be interpolated across the boundaries of faces
  • Using a non-seamless interpretation of a cube yields cases where pixels may be clamped/wrapped at the boundary of a face

This effectively results in Zink interpolating across the boundaries of cube faces when it should instead be clamping/wrapping pixel data to a single face.

But who cares about all that math nonsense when the result is that Zink is still failing CTS cases?

*Disclosure: I have been advised by my lawyer to state on the record that I have never taken an intro to geometry course and have instead copied this entire blog post off StackOverflow.

How To Make Cubes Worse 101

In order to replicate this basic OpenGL behavior, a substantial amount of code is required—most of it terrible.

The first step is to determine when a cube should be sampled as non-seamless. OpenGL helpfully has only one extension (plus this other extension) which control seamless access of cubemaps, so as long as that one state (plus the other state) isn’t enabled, a cube shouldn’t be interpreted seamlessly.

With this done, what happens at coordinates that lie at the edge of a face? The OpenGL wrap enum covers this. For the purposes of this blog post, only two wrap modes exist:

  • edge clamping - clamp the coordinate to the edge of the face (coord = extent or coord = 0)
  • repeat - pretend the face repeats infinitely (coord %= extent)

So now non-seamless cubes are detected, and the behavior for handling their non-seamlessness is known, but how can this actually be done?

Excruciating

In short, this requires shader rewrites to handle coordinate clamping, then wrapping. Since it’s not going to be desirable to have a different shader variant for every time the wrap mode changes, this means loading the parameters from a UBO. Since it’s further not going to be desirable to have shader variants for each per-texture seamless/non-seamless cube combination, this means also making the rewrite handle the no-op case of continuing to use the original, seamless coordinates after doing all the calculations for the non-seamless case.

Worst of all, this has to be plumbed through the Rube Goldberg machine that is Mesa.

It was terrible, and it continues to be terrible.

If I were another blogger, I would probably take this opportunity to flex my Calculus credentials by putting all kinds of math here, but nobody really wants to read that, and the hell if I know how to make markdown do that whiteboard thing so I can doodle in fancy formulas or whatever from the spec.

Instead, you can read the merge request if you’re that deeply invested in cube mechanics.

March 09, 2022

Subgroup operations or wave intrinsics, such as reducing a value across the threads of a shader subgroup or wave, were introduced in GPU programming languages a while ago. They communicate with other threads of the same wave, for example to exchange the input values of a reduction, but not necessarily with all of them if there is divergent control flow.

In LLVM, we call such operations convergent. Unfortunately, LLVM does not define how the set of communicating threads in convergent operations -- the set of converged threads -- is affected by control flow.

If you're used to thinking in terms of structured control flow, this may seem trivial. Obviously, there is a tree of control flow constructs: loops, if-statements, and perhaps a few others depending on the language. Two threads are converged in the body of a child construct if and only if both execute that body and they are converged in the parent. Throw in some simple and intuitive rules about loop counters and early exits (nested return, break and continue, that sort of thing) and you're done.

In an unstructured control flow graph, the answer is not obvious at all. I gave a presentation at the 2020 LLVM Developers' Meeting that explains some of the challenges as well as a solution proposal that involves adding convergence control tokens to the IR.

Very briefly, convergent operations in the proposal use a token variable that is defined by a convergence control intrinsic. Two dynamic instances of the same static convergent operation from two different threads are converged if and only if the dynamic instances of the control intrinsic producing the used token values were converged.

(The published draft of the proposal talks of multiple threads executing the same dynamic instance. I have since been convinced that it's easier to teach this matter if we instead always give every thread its own dynamic instances and talk about a convergence equivalence relation between dynamic instances. This doesn't change the resulting semantics.)

The draft has three such control intrinsics: anchor, entry, and (loop) heart. Of particular interest here is the heart. For the most common and intuitive use cases, a heart intrinsic is placed in the header of natural loops. The token it defines is used by convergent operations in the loop. The heart intrinsic itself also uses a token that is defined outside the loop: either by another heart in the case of nested loops, or by an anchor or entry. The heart combines two intuitive behaviors:

  • It uses a token in much the same way that convergent operations do: two threads are converged for their first execution of the heart if and only if they were converged at the intrinsic that defined the used token.
  • Two threads are converged at subsequent executions of the heart if and only if they were converged for the first execution and they are currently at the same loop iteration, where iterations are counted by a virtual loop counter that is incremented at the heart.

Viewed from this angle, how about we define a weaker version of these rules that lies somewhere between an anchor and a loop heart? We could call it a "light heart", though I will stick with "iterating anchor". The iterating anchor defines a token but has no arguments. Like for the anchor, the set of converged threads is implementation-defined -- when the iterating anchor is first encountered. When threads encounter the iterating anchor again without leaving the dominance region of its containing basic block, they are converged if and only if they were converged during their previous encounter of the iterating anchor.

The notion of an iterating anchor came up when discussing the convergence behaviors that can be guaranteed for natural loops. Is it possible to guarantee that natural loops always behave in the natural way -- according to their loop counter -- when it comes to convergence?

Naively, this should be possible: just put hearts into loop headers! Unfortunately, that's not so straightforward when multiple natural loops are contained in an irreducible loop: 

Hearts in A and C must refer to a token defined outside the loops; that is, a token defined in E. The resulting program is ill-formed because it has a closed path that goes through two hearts that use the same token, but the path does not go through the definition of that token. This well-formedness rule exists because the rules about heart semantics are unsatisfiable if the rule is broken.

The underlying intuitive issue is that if the branch at E is divergent in a typical implementation, the wave (or subgroup) must choose whether A or C is executed first. Neither choice works. The heart in A indicates that (among the threads that are converged in E) all threads that visit A (whether immediately or via C) must be converged during their first visit of A. But if the wave executes A first, then threads which branch directly from E to A cannot be converged with those that first branch to C. The opposite conflict exists if the wave executes C first.

If we replace the hearts in A and C by iterating anchors, this problem goes away because the convergence during the initial visit of each block is implementation-defined. In practice, it should fall out of which of the blocks the implementation decides to execute first.

So it seems that iterating anchors can fill a gap in the expressiveness of the convergence control design. But are they really a sound addition? There are two main questions:

  • Satisfiability: Can the constraints imposed by iterating anchors be satisfied, or can they cause the sort of logical contradiction discussed for the example above? And if so, is there a simple static rule that prevents such contradictions?
  • Spooky action at a distance: Are there generic code transforms which change semantics while changing a part of the code that is distant from the iterating anchor?
The second question is important because we want to add convergence control to LLVM without having to audit and change the existing generic transforms. We certainly don't want to hurt compile-time performance by increasing the amount of code that generic transforms have to examine for making their decisions.

Satisfiability 

Consider the following simple CFG with an iterating anchor in A and a heart in B that refers back to a token defined in E:

Now consider two threads that are initially converged with execution traces:

  1. E - A - A - B - X
  2. E - A - B - A - X
The heart rule implies that the threads must be converged in B. The iterating anchor rule implies that if the threads are converged in their first dynamic instances of A, then they must also be converged in their second dynamic instances of A, which leads to a temporal paradox.

One could try to resolve the paradox by saying that the threads cannot be converged in A at all, but this would mean that the threads mustdiverge before a divergent branch occurs. That seems unreasonable, since typical implementations want to avoid divergence as long as control flow is uniform.

The example arguably breaks the spirit of the rule about convergence regions from the draft proposal linked above, and so a minor change to the definition of convergence region may be used to exclude it.

What if the CFG instead looks as follows, which does not break any rules about convergence regions:

For the same execution traces, the heart rule again implies that the threads must be converged in B. The convergence of the first dynamic instances of A are technically implementation-defined, but we'd expect most implementations to be converged there.

The second dynamic instances of A cannot be converged due to the convergence of the dynamic instances of B. That's okay: the second dynamic instance of A in thread 2 is a re-entry into the dominance region of A, and so its convergence is unrelated to any convergence of earlier dynamic instances of A.

Spooky action at a distance 

Unfortunately, we still cannot allow this second example. A program transform may find that the conditional branch in E is constant and the edge from E to B is dead. Removing that edge brings us back to the previous example which is ill-formed. However, a transform which removes the dead edge would not normally inspect the blocks A and B or their dominance relation in detail. The program becomes ill-formed by spooky action at a distance.

The following static rule forbids both example CFGs: if there is a closed path through a heart and an iterating anchor, but not through the definition of the token that the heart uses, then the heart must dominate the iterating anchor.

There is at least one other issue of spooky action at a distance. If the iterating anchor is not the first (non-phi) instruction of its basic block, then it may be preceded by a function call in the same block. The callee may contain control flow that ends up being inlined. Back edges that previously pointed at the block containing the iterating anchor will then point to a different block, which changes the behavior quite drastically. Essentially, the iterating anchor is reduced to a plain anchor.

What can we do about that? It's tempting to decree that an iterating anchor must always be the first (non-phi) instruction of a basic block. Unfortunately, this is not easily done in LLVM in the face of general transforms that might sink instructions or merge basic blocks.

Preheaders to the rescue 

We could chew through some other ideas for making iterating anchors work, but that turns out to be unnecessary. The desired behavior of iterating anchors can be obtained by inserting preheader blocks. The initial example of two natural loops contained in an irreducible loop becomes: 

Place anchors in Ap and Cp and hearts in A and C that use the token defined by their respective dominating anchor. Convergence at the anchors is implementation-defined, but relative to this initial convergence at the anchor, convergence inside the natural loops headed by A and C behaves in the natural way, based on a virtual loop counter. The transform of inserting an anchor in the preheader is easily generalized.

To sum it up: We've concluded that defining an "iterating anchor" convergence control intrinsic is problematic, but luckily also unnecessary. The control intrinsics defined in the original proposal are sufficient. I hope that the discussion that led to those conclusions helps illustrate some aspects of the convergence control proposal for LLVM as well as the goals and principles that drove it.

March 07, 2022

They Say An Image Macro Conveys An Entire Day Of Shouting At The Computer

tess-bugs.png

March 04, 2022

A quick reminder: libei is the library for emulated input. It comes as a pair of C libraries, libei for the client side and libeis for the server side.

libei has been sitting mostly untouched since the last status update. There are two use-cases we need to solve for input emulation in Wayland - the ability to emulate input (think xdotool, or Synergy/Barrier/InputLeap client) and the ability to capture input (think Synergy/Barrier/InputLeap server). The latter effectively blocked development in libei [1], until that use-case was sorted there wasn't much point investing too much into libei - after all it may get thrown out as a bad idea. And epiphanies were as elusive like toilet paper and RATs, so nothing much get done. This changed about a week or two ago when the required lightbulb finally arrived, pre-lit from the factory.

So, the solution to the input capturing use-case is going to be a so-called "passive context" for libei. In the traditional [2] "active context" approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat that typically represent the available screens. libei then sends events through these devices, causing input to be appear in the compositor which moves the cursor around. In a typical and simple use-case you'd get a 1920x1080 absolute pointer device and a keyboard with a $layout keymap, libei then sends events to position the cursor and or happily type away on-screen.

In the "passive context" <deja-vu> approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat </deja-vu> that typically represent the physical devices connected to the host computer. libei then receives events from these devices, causing input to be generated in the libei client. In a typical and simple use-case you'd get a relative pointer device and a keyboard device with a $layout keymap, the compositor then sends events matching the relative input of the connected mouse or touchpad.

The two notable differences are thus: events flow from EIS to libei and the devices don't represent the screen but rather the physical [3] input devices.

This changes libei from a library for emulated input to an input event transport layer between two processes. On a much higher level than e.g. evdev or HID and with more contextual information (seats, devices are logically abstracted, etc.). And of course, the EIS implementation is always in control of the events, regardless which direction they flow. A compositor can implement an event filter or designate key to break the connection to the libei client. In pseudocode, the compositor's input event processing function will look like this:


function handle_input_events():
real_events = libinput.get_events()
for e in real_events:
if input_capture_active:
send_event_to_passive_libei_client(e)
else:
process_event(e)

emulated_events = eis.get_events_from_active_clients()
for e in emulated_events:
process_event(e)
Not shown here are the various appropriate filters and conversions in between (e.g. all relative events from libinput devices would likely be sent through the single relative device exposed on the EIS context). Again, the compositor is in control so it would be trivial to implement e.g. capturing of the touchpad only but not the mouse.

In the current design, a libei context can only be active or passive, not both. The EIS context is both, it's up to the implementation to disconnect active or passive clients if it doesn't support those.

Notably, the above only caters for the transport of input events, it doesn't actually make any decision on when to capture events. This handled by the CaptureInput XDG Desktop Portal [4]. The idea here is that an application like Synergy/Barrier/InputLeap server connects to the CaptureInput portal and requests a CaptureInput session. In that session it can define pointer barriers (left edge, right edge, etc.) and, in the future, maybe other triggers. In return it gets a libei socket that it can initialize a libei context from. When the compositor decides that the pointer barrier has been crossed, it re-routes the input events through the EIS context so they pop out in the application. Synergy/Barrier/InputLeap then converts that to the global position, passes it to the right remote Synergy/Barrier/InputLeap client and replays it there through an active libei context where it feeds into the local compositor.

Because the management of when to capture input is handled by the portal and the respective backends, it can be natively integrated into the UI. Because the actual input events are a direct flow between compositor and application, the latency should be minimal. Because it's a high-level event library, you don't need to care about hardware-specific details (unlike, say, the inputfd proposal from 2017). Because the negotiation of when to capture input is through the portal, the application itself can run inside a sandbox. And because libei only handles the transport layer, compositors that don't want to support sandboxes can set up their own negotiation protocol.

So overall, right now this seems like a workable solution.

[1] "blocked" is probably overstating it a bit but no-one else tried to push it forward, so..
[2] "traditional" is probably overstating it for a project that's barely out of alpha development
[3] "physical" is probably overstating it since it's likely to be a logical representation of the types of inputs, e.g. one relative device for all mice/touchpads/trackpoints
[4] "handled by" is probably overstating it since at the time of writing the portal is merely a draft of an XML file

Blogging: That Thing I Forgot About

Yeah, my b, I forgot this was a thing.

Fuck it though, I’m a professional, so I’m gonna pretend I didn’t just skip a month of blogs and get right back into it.

Gallivm

Gallivm is the nir/tgsi-to-llvm translation layer in Gallium that LLVMpipe (and thus Lavapipe) uses to generate the JIT functions which make triangles. It’s very old code in that it predates me knowing how triangles work, but that doesn’t mean it doesn’t have bugs.

And Gallivm bugs are the worst bugs.

For a long time, I’ve had SIGILL crashes on exactly one machine locally for the CTS glob dEQP-GLES31.functional.program_uniform.by*sampler2D_samplerCube*. These tests pass on everyone else’s machines including CI.

Like I said, Gallivm bugs are the worst bugs.

Debugging

How does one debug JIT code? GDB can’t be used, valgrind doesn’t work, and, despite what LLVM developers would tell you, building an assert-enabled LLVM doesn’t help at all in most cases here since that will only catch invalid behavior, not questionably valid behavior that very obviously produces invalid results.

So we enter the world of lp_build_print debugging. Much like standard printf debugging, the strategy here is to just lp_build_print_value or lp_build_printf("I hate this part of the shader too") our way to figuring out where in the shader the crash occurs.

Here’s an example shader from dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex that crashes:

#version 310 es
in highp vec4 a_position;
out mediump float v_vtxOut;

struct structType
{
	mediump sampler2D m0;
	mediump samplerCube m1;
};
uniform structType u_var;

mediump float compare_float    (mediump float a, mediump float b)  { return abs(a - b) < 0.05 ? 1.0 : 0.0; }
mediump float compare_vec4     (mediump vec4 a, mediump vec4 b)    { return compare_float(a.x, b.x)*compare_float(a.y, b.y)*compare_float(a.z, b.z)*compare_float(a.w, b.w); }

void main (void)
{
	gl_Position = a_position;
	v_vtxOut = 1.0;
	v_vtxOut *= compare_vec4(texture(u_var.m0, vec2(0.0)), vec4(0.15, 0.52, 0.26, 0.35));
	v_vtxOut *= compare_vec4(texture(u_var.m1, vec3(0.0)), vec4(0.88, 0.09, 0.30, 0.61));
}

Which, in llvmpipe NIR, is:

shader: MESA_SHADER_VERTEX
source_sha1: {0xcb00c93e, 0x64db3b0f, 0xf4764ad3, 0x12b69222, 0x7fb42437}
inputs: 1
outputs: 2
uniforms: 0
shared: 0
ray queries: 0
decl_var uniform INTERP_MODE_NONE sampler2D lower@u_var.m0 (0, 0, 0)
decl_var uniform INTERP_MODE_NONE samplerCube lower@u_var.m1 (0, 0, 1)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = deref_var &a_position (shader_in vec4) 
	vec4 32 ssa_1 = intrinsic load_deref (ssa_0) (access=0)
	vec1 16 ssa_2 = load_const (0xb0cd = -0.150024)
	vec1 16 ssa_3 = load_const (0x2a66 = 0.049988)
	vec1 16 ssa_4 = load_const (0xb829 = -0.520020)
	vec1 16 ssa_5 = load_const (0xb429 = -0.260010)
	vec1 16 ssa_6 = load_const (0xb59a = -0.350098)
	vec1 16 ssa_7 = load_const (0xbb0a = -0.879883)
	vec1 16 ssa_8 = load_const (0xadc3 = -0.090027)
	vec1 16 ssa_9 = load_const (0xb4cd = -0.300049)
	vec1 16 ssa_10 = load_const (0xb8e1 = -0.609863)
	vec2 32 ssa_13 = load_const (0x00000000, 0x00000000) = (0.000000, 0.000000)
	vec1 32 ssa_49 = load_const (0x00000000 = 0.000000)
	vec4 16 ssa_14 = (float16)txl ssa_13 (coord), ssa_49 (lod), 0 (texture), 0 (sampler)
	vec1 16 ssa_15 = fadd ssa_14.x, ssa_2
	vec1 16 ssa_16 = fabs ssa_15
	vec1 16 ssa_17 = fadd ssa_14.y, ssa_4
	vec1 16 ssa_18 = fabs ssa_17
	vec1 16 ssa_19 = fadd ssa_14.z, ssa_5
	vec1 16 ssa_20 = fabs ssa_19
	vec1 16 ssa_21 = fadd ssa_14.w, ssa_6
	vec1 16 ssa_22 = fabs ssa_21
	vec1 16 ssa_23 = fmax ssa_16, ssa_18
	vec1 16 ssa_24 = fmax ssa_23, ssa_20
	vec1 16 ssa_25 = fmax ssa_24, ssa_22
	vec3 32 ssa_27 = load_const (0x00000000, 0x00000000, 0x00000000) = (0.000000, 0.000000, 0.000000)
	vec1 32 ssa_50 = load_const (0x00000000 = 0.000000)
	vec4 16 ssa_28 = (float16)txl ssa_27 (coord), ssa_50 (lod), 1 (texture), 1 (sampler)
	vec1 16 ssa_29 = fadd ssa_28.x, ssa_7
	vec1 16 ssa_30 = fabs ssa_29
	vec1 16 ssa_31 = fadd ssa_28.y, ssa_8
	vec1 16 ssa_32 = fabs ssa_31
	vec1 16 ssa_33 = fadd ssa_28.z, ssa_9
	vec1 16 ssa_34 = fabs ssa_33
	vec1 16 ssa_35 = fadd ssa_28.w, ssa_10
	vec1 16 ssa_36 = fabs ssa_35
	vec1 16 ssa_37 = fmax ssa_30, ssa_32
	vec1 16 ssa_38 = fmax ssa_37, ssa_34
	vec1 16 ssa_39 = fmax ssa_38, ssa_36
	vec1 16 ssa_40 = fmax ssa_25, ssa_39
	vec1 32 ssa_41 = flt32 ssa_40, ssa_3
	vec1 32 ssa_42 = b2f32 ssa_41
	vec1 32 ssa_43 = deref_var &gl_Position (shader_out vec4) 
	intrinsic store_deref (ssa_43, ssa_1) (wrmask=xyzw /*15*/, access=0)
	vec1 32 ssa_44 = deref_var &v_vtxOut (shader_out float) 
	intrinsic store_deref (ssa_44, ssa_42) (wrmask=x /*1*/, access=0)
	/* succs: block_1 */
	block block_1:
}

There’s two sample ops (txl), and since these tests only do simple texture() calls, it seems reasonable to assume that one of them is causing the crash. Sticking a lp_build_print_value on the texel values fetched by the sample operations will reveal whether the crash occurs before or after them.

What output does this yield?

Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'..
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
[1]    3500332 illegal hardware instruction (core dumped)

Each txl op fetches four values, which means this is the result from the first instruction, but the second one isn’t reached before the crash. Unsurprisingly, this is also the cube sampling instruction, which makes sense given that all the crashes of this type I get are from cube sampling tests.

Now that it’s been determined the second txl is causing the crash, it’s reasonable to assume that the construction of that sampling op is the cause rather than the op itself, as proven by sticking some simple lp_build_printf("What am I doing with my life") calls in just before that op. Indeed, as the printfs confirm, I’m still questioning the life choices that led me to this point, so it’s now proven that the txl instruction itself is the problem.

Cube sampling has a lot of complex math involved for face selection, and I’ve spent a lot of time in there recently. My first guess was that the cube coordinates were bogus. Printing them yielded results:

Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'..
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
cubecoords nan nan nan nan nan nan nan nan
cubecoords nan nan nan nan nan nan nan nan

These cube coords have more NaNs than a 1960s Batman TV series, so it looks like I was right in my hunch. Printing the cube S-face value next yields more NaNs. My printf search continued a couple more iterations until I wound up at this function:

static LLVMValueRef
lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord)
{
   /* ima = +0.5 / abs(coord); */
   LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5);
   LLVMValueRef absCoord = lp_build_abs(coord_bld, coord);
   LLVMValueRef ima = lp_build_div(coord_bld, posHalf, absCoord);
   return ima;
}

Immediately, all of us multiverse-brain engineers spot something suspicious: this has a division operation with a user-provided divisor. Printing absCoord here yielded all zeroes, which was about where my remaining energy was at this Friday morning, so I mangled the code slightly:

static LLVMValueRef
lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord)
{
   /* ima = +0.5 / abs(coord); */
   LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5);
   LLVMValueRef absCoord = lp_build_abs(coord_bld, coord);
   /* avoid div by zero */
   LLVMValueRef sel = lp_build_cmp(coord_bld, PIPE_FUNC_GREATER, absCoord, coord_bld->zero);
   LLVMValueRef div = lp_build_div(coord_bld, posHalf, absCoord);
   LLVMValueRef ima = lp_build_select(coord_bld, sel, div, coord_bld->zero);
   return ima;
}

And blammo, now that Gallivm could no longer divide by zero, the test was now passing. And so were a lot of others.

Progress

There’s been some speculation about how close Zink really is to being “useful”, where “useful” is determined by the majesty of passing GL4.6 CTS.

So how close is it? The answer might shock you.

Remaining Lavapipe Fails: 17

  • KHR-GL46.gpu_shader_fp64.builtin.mod_dvec2,Fail
  • KHR-GL46.gpu_shader_fp64.builtin.mod_dvec3,Fail
  • KHR-GL46.gpu_shader_fp64.builtin.mod_dvec4,Fail
  • KHR-GL46.pipeline_statistics_query_tests_ARB.functional_primitives_vertices_submitted_and_clipping_input_output_primitives,Fail
  • KHR-GL46.tessellation_shader.single.isolines_tessellation,Fail
  • KHR-GL46.tessellation_shader.tessellation_control_to_tessellation_evaluation.data_pass_through,Fail
  • KHR-GL46.tessellation_shader.tessellation_invariance.invariance_rule3,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_point_mode.points_verification,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.degenerate_case,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_tessellation.gl_InvocationID_PatchVerticesIn_PrimitiveID,Fail
  • KHR-GL46.tessellation_shader.vertex.vertex_spacing,Fail
  • KHR-GL46.texture_barrier.disjoint-texels,Fail
  • KHR-GL46.texture_barrier.overlapping-texels,Fail
  • KHR-GL46.texture_barrier_ARB.disjoint-texels,Fail
  • KHR-GL46.texture_barrier_ARB.overlapping-texels,Fail
  • KHR-GL46.texture_swizzle.functional,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.inner_tessellation_level_rounding,Crash

Remaining ANV Fails (Icelake): 9

  • KHR-GL46.pipeline_statistics_query_tests_ARB.functional_primitives_vertices_submitted_and_clipping_input_output_primitives,Fail
  • KHR-GL46.tessellation_shader.single.isolines_tessellation,Fail
  • KHR-GL46.tessellation_shader.tessellation_control_to_tessellation_evaluation.data_pass_through,Fail
  • KHR-GL46.tessellation_shader.tessellation_invariance.invariance_rule3,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_point_mode.points_verification,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.degenerate_case,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.inner_tessellation_level_rounding,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_tessellation.gl_InvocationID_PatchVerticesIn_PrimitiveID,Fail
  • KHR-GL46.tessellation_shader.vertex.vertex_spacing,Fail

Big Triangle better keep a careful eye on us now.

February 17, 2022

Around 2 years ago while I was working on tessellation support for llvmpipe, and running the heaven benchmark on my Ryzen, I noticed that heaven despite running slowly wasn't saturating all the cores. I dug in a bit, and found that llvmpipe despite threading rasterization, fragment shading and blending stages, never did anything else while those were happening.

I dug into the code as I clearly remembered seeing a concept of a "scene" where all the primitives were binned into and then dispatched. It turned out the "scene" was always executed synchronously.

At the time I wrote support to allow multiple scenes to exist, so while one scene was executing the vertex shading and binning for the next scene could execute, and it would be queued up. For heaven at the time I saw some places where it would build 36 scenes. However heaven was still 1fps with tess, and regressions in other areas were rampant, and I mostly left them in a branch.

The reasons so many things were broken by the patches was that large parts of llvmpipe and also lavapipe, weren't ready for the async pipeline processing. The concept of a fence after the pipeline finished was there, but wasn't used properly everywhere. A lot of operations assumed there was nothing going on behind the scenes so never fenced. Lots of things like queries broke due to fact that a query would always be ready in the old model, but now query availability could return unavailable like a real hw driver. Resource tracking existed but was incomplete, so knowing when to flush wasn't always accurate. Presentation was broken due to incorrect waiting both for GL and Lavapipe. Lavapipe needed semaphore support that actually did things as apps used it between the render and present pipeline pieces.

Mesa CI recently got some paraview traces added to it, and I was doing some perf traces with them. Paraview is a data visualization tool, and it generates vertex heavy workloads, as opposed to compositors and even games. It turned out binning was most of the overhead, and I realized the overlapping series could help this sort of workload. I dusted off the patch series and nailed down all the issues.

Emma Anholt ran some benchmarks on the results with the paraview traces and got

  • pv-waveletvolume fps +13.9279% +/- 4.91667% (n=15)
  • pv-waveletcountour fps +67.8306% +/- 11.4762% (n=3)
which seems like a good return on the investment.

I've got it all lined up in a merge request and it doesn't break CI anymore, so hopefully get it landed in the next while, once I cleanup any misc bits.

February 16, 2022

Earlier this week, Neil McGovern announced that he is due to be stepping down as the Executive Director as the GNOME Foundation later this year. As the President of the board and Neil’s effective manager together with the Executive Committee, I wanted to take a moment to reflect on his achievements in the past 5 years and explain a little about what the next steps would be.

Since joining in 2017, Neil has overseen a productive period of growth and maturity for the Foundation, increasing our influence both within the GNOME project and the wider Free and Open Source Software community. Here’s a few highlights of what he’s achieved together with the Foundation team and the community:

  • Improved public perception of GNOME as a desktop and GTK as a development platform, helping to align interests between key contributors and wider ecosystem stakeholders and establishing an ongoing collaboration with KDE around the Linux App Summit.
  • Worked with the board to improve the maturity of the board itself and allow it to work at a more strategic level, instigating staggered two-year terms for directors providing much-needed stability, and established the Executive and Finance committees to handle specific topics and the Governance committees to take a longer-term look at the board’s composition and capabilities.
  • Arranged 3 major grants to the Foundation totaling $2M and raised a further $250k through targeted fundraising initiatives.
  • Grown the Foundation team to its largest ever size, investing in staff development, and established ongoing direct contributions to GNOME, GTK and Flathub by Foundation staff and contractors.
  • Launched and incubated Flathub as an inclusive and sustainable ecosystem for Linux app developers to engage directly with their users, and delivered the Community Engagement Challenge to invest in the sustainability of our contributor base ­­– the Foundation’s largest and most substantial programs outside of GNOME itself since Outreachy.
  • Achieved a fantastic resolution for GNOME and the wider community, by negotiating a settlement which protects FOSS developers from patent enforcement by the Rothschild group of non-practicing entities.
  • Stood for a diverse and inclusive Foundation, implementing a code of conduct for GNOME events and online spaces, establishing our first code of conduct committee and updating the bylaws to be gender-neutral.
  • Established the GNOME Circle program together with the board, broadening the membership base of the foundation by welcoming app and library developers from the wider ecosystem.

Recognizing and appreciating the amazing progress that GNOME has made with Neil’s support, the search for a new Executive Director provides the opportunity for the Foundation board to set the agenda and next high-level goals we’d like to achieve together with our new Executive Director.

In terms of the desktop, applications, technology, design and development processes, whilst there are always improvements to be made, the board’s general feeling is that thanks to the work of our amazing community of contributors, GNOME is doing very well in terms of what we produce and publish. Recent desktop releases have looked great, highly polished and well-received, and the application ecosystem is growing and improving through new developers and applications bringing great energy at the moment. From here, our largest opportunity in terms of growing the community and our user base is being able to articulate the benefits of what we’ve produced to a wider public audience, and deliver impact which allows us to secure and grow new and sustainable sources of funding.

For individuals, we are able to offer an exceedingly high quality desktop experience and a broad range of powerful applications which are affordable to all, backed by a nonprofit which can be trusted to look after your data, digital security and your best interests as an individual. From the perspective of being a public charity in the US, we also have the opportunity to establish programs that draw upon our community, technology and products to deliver impact such as developing employable skills, incubating new Open Source contributors, learning to program and more.

For our next Executive Director, we will be looking for an individual with existing experience in that nonprofit landscape, ideally with prior experience establishing and raising funds for programs that deliver impact through technology, and appreciation for the values that bring people to Free, Open Source and other Open Culture organizations. Working closely with the existing members, contributors, volunteers and whole GNOME community, and managing our relationships with the Advisory Board and other key partners, we hope to find a candidate that can build public awareness and help people learn about, use and benefit from what GNOME has built over the past two decades.

Neil has agreed to stay in his position for a 6 month transition period, during which he will support the board in our search for a new Executive Director and support a smooth hand-over. Over the coming weeks we will publish the job description for the new ED, and establish a search committee who will be responsible for sourcing and interviewing candidates to make a recommendation to the board for Neil’s successor – a hard act to follow!

I’m confident the community will join me and the board in personally thanking Neil for his 5 years of dedicated service in support of GNOME and the Foundation. Should you have any queries regarding the process, or offers of assistance in the coming hiring process, please don’t hesitate to join the discussion or reach out directly to the board.

February 15, 2022

After roughly 20 years and counting up to 0.40 in release numbers, I've decided to call the next version of the xf86-input-wacom driver the 1.0 release. [1] This cycle has seen a bulk of development (>180 patches) which is roughly as much as the last 12 releases together. None of these patches actually added user-visible features, so let's talk about technical dept and what turned out to be an interesting way of reducing it.

The wacom driver's git history goes back to 2002 and the current batch of maintainers (Ping, Jason and I) have all been working on it for one to two decades. It used to be a Wacom-only driver but with the improvements made to the kernel over the years the driver should work with most tablets that have a kernel driver, albeit some of the more quirky niche features will be more limited (but your non-Wacom devices probably don't have those features anyway).

The one constant was always: the driver was extremely difficult to test, something common to all X input drivers. Development is a cycle of restarting the X server a billion times, testing is mostly plugging hardware in and moving things around in the hope that you can spot the bugs. On a driver that doesn't move much, this isn't necessarily a problem. Until a bug comes along, that requires some core rework of the event handling - in the kernel, libinput and, yes, the wacom driver.

After years of libinput development, I wasn't really in the mood for the whole "plug every tablet in and test it, for every commit". In a rather caffeine-driven development cycle [2], the driver was separated into two logical entities: the core driver and the "frontend". The default frontend is the X11 one which is now a relatively thin layer around the core driver parts, primarily to translate events into the X Server's API. So, not unlike libinput + xf86-input-libinput in terms of architecture. In ascii-art:


|
+--------------------+ | big giant
/dev/input/event0->| core driver | x11 |->| X server
+--------------------+ | process
|

Now, that logical separation means we can have another frontend which I implemented as a relatively light GObject wrapper and is now a library creatively called libgwacom:



+-----------------------+ |
/dev/input/event0->| core driver | gwacom |--| tools or test suites
+-----------------------+ |

This isn't a public library or API and it's very much focused on the needs of the X driver so there are some peculiarities in there. What it allows us though is a new wacom-record tool that can hook onto event nodes and print the events as they come out of the driver. So instead of having to restart X and move and click things, you get this:

$ ./builddir/wacom-record
wacom-record:
version: 0.99.2
git: xf86-input-wacom-0.99.2-17-g404dfd5a
device:
path: /dev/input/event6
name: "Wacom Intuos Pro M Pen"
events:
- source: 0
event: new-device
name: "Wacom Intuos Pro M Pen"
type: stylus
capabilities:
keys: true
is-absolute: true
is-direct-touch: false
ntouches: 0
naxes: 6
axes:
- {type: x , range: [ 0, 44800], resolution: 200000}
- {type: y , range: [ 0, 29600], resolution: 200000}
- {type: pressure , range: [ 0, 65536], resolution: 0}
- {type: tilt_x , range: [ -64, 63], resolution: 57}
- {type: tilt_y , range: [ -64, 63], resolution: 57}
- {type: wheel , range: [ -900, 899], resolution: 0}
...
- source: 0
mode: absolute
event: motion
mask: [ "x", "y", "pressure", "tilt-x", "tilt-y", "wheel" ]
axes: { x: 28066, y: 17643, pressure: 0, tilt: [ -4, 56], rotation: 0, throttle: 0, wheel: -108, rings: [ 0, 0]
This is YAML which means we can process the output for comparison or just to search for things.

A tool to quickly analyse data makes for faster development iterations but it's still a far cry from reliable regression testing (and writing a test suite is a daunting task at best). But one nice thing about GObject is that it's accessible from other languages, including Python. So our test suite can be in Python, using pytest and all its capabilities, plus all the advantages Python has over C. Most of driver testing comes down to: create a uinput device, set up the driver with some options, push events through that device and verify they come out of the driver in the right sequence and format. I don't need C for that. So there's pull request sitting out there doing exactly that - adding a pytest test suite for a 20-year old X driver written in C. That this is a) possible and b) a lot less work than expected got me quite unreasonably excited. If you do have to maintain an old C library, maybe consider whether's possible doing the same because there's nothing like the warm fuzzy feeling a green tick on a CI pipeline gives you.

[1] As scholars of version numbers know, they make as much sense as your stereotypical uncle's facebook opinion, so why not.
[2] The Colombian GDP probably went up a bit

February 12, 2022

FOSDEM 2022 took place this past weekend, on February 5th and 6th. It was a virtual event for the second year in a row, but this year the Graphics devroom made a comeback and I participated in it with a talk titled “Fun with border colors in Vulkan”. In the talk, I explained the context and origins behind the VK_EXT_border_color_swizzle extension that was published last year and in which I’m listed as one of the contributors.

Big kudos and a big thank you to the FOSDEM organizers one more year. FOSDEM is arguably the most important free and open source software conference in Europe and one of the most important FOSS conferences in the world. It’s run entirely by volunteers, doing an incredible amount of work that makes it possible to have hundreds of talks and dozens of different devrooms in the span of two days. Special thanks to the Graphics devroom organizers.

For the virtual setup, one more year FOSDEM relied on Matrix. It’s great because at Igalia we also use Matrix for our internal communications and, thanks to the federated nature of the service, I could join the FOSDEM virtual rooms using the same interface, client and account I normally use for work. The FOSDEM organizers also let participants create ad-hoc accounts to join the conference, in case they didn’t have a Matrix account previously. Thanks to Matrix widgets, each virtual devroom had its corresponding video stream, which you could also watch freely on their site, embedded in each of the virtual devrooms, so participants wanting to watch the talks and ask questions had everything in a single page.

Talks were pre-recorded and submitted in advance, played at the scheduled times, and Jitsi was used for post-talk Q&A sessions, in which moderators and devroom organizers read aloud the most voted questions in the devroom chat.

Of course, a conference this big is not without its glitches. Video feeds from presenters and moderators were sometimes cut automatically by Jitsi allegedly due to insufficient bandwidth. It also happened to me during my Q&A section while I was using a wired connection on a 300 Mbps symmetric FTTH line. I can only suppose the pipe was not wide enough on the other end to handle dozens of streams at the same time, or Jitsi was playing games as it sometimes does. In any case, audio was flawless.

In addition, some of the pre-recorded videos could not be played at the scheduled time, resulting in a black screen with no sound, due to an apparent bug in the video system. It’s worth noting all pre-recorded talks had been submitted, processed and reviewed prior to the conference, so this was an unexpected problem. This happened with my talk and I had to do the presentation live. Fortunately, I had written a script for the talk and could use it to deliver it without issues by sharing my screen with the slides over Jitsi.

Finally, as a possible improvement point for future virtual or mixed editions, is the fact that the deadline for submitting talk videos was only communicated directly and prominently by email on the day the deadline ended, a couple of weeks before the conference. It was also mentioned in the presenter’s guide that was linked in a previous email message, but an explicit warning a few days or a week before the deadline would have been useful to avoid last-minute rushes and delays submitting talks.

In any case, those small problems don’t take away the great online-only experience we had this year.

Transcription

Another advantage of having a written script for the talk is that I can use it to provide a pseudo-transcription of its contents for those that prefer not to watch a video or are unable to do so. I’ve also summed up the Q&A section at the end below. The slides are available as an attachment in the talk page.

Enjoy and see you next year, hopefully in Brussels this time.

Slide 1 (Talk cover)

Hello, my name is Ricardo Garcia. I work at Igalia as part of its Graphics Team, where I mostly work on the CTS project creating new Vulkan tests and fixing existing ones. Sometimes this means I also contribute to the specification text and other pieces of the Vulkan ecosystem.

Today I’m going to talk about the story behind the “border color swizzle” extension that was published last year. I created tests for this one and I also participated in its release process, so I’m listed as one of the contributors.

Slide 2 (Sampling in Vulkan)

I’ve already started mentioning border colors, so before we dive directly into the extension let me give you a brief introduction to sampling operations in Vulkan and explain where border colors fit in that.

Sampling means reading pixels from an image view and is typically done in the fragment shader, for example to apply a texture to some geometry.

In the example you see here, we have an image view with 3 8-bit color components in BGR order and in unsigned normalized format. This means we’ll suppose each image pixel is stored in memory using 3 bytes, with each byte corresponding to the blue, green and red components in that order.

However, when we read pixels from that image view, we want to get back normalized floating point values between 0 (for the lowest value) and 1 (for the highest value, i.e. when all bits are 1 and the natural number in memory is 255).

As you can see in the GLSL code, the result of the operation is a vector of 4 floating point numbers. Since the image does not have alpha information, it’s natural to think the output vector may have a 1 in the last component, making the color opaque.

If the coordinates of the sample operation make us read the pixel represented there, we would get the values you see on the right.

It’s also worth noting the sampler argument is a combination of two objects in Vulkan: an image view and a sampler object that specifies how sampling is done.

Slide 3 (Normalized Coordinates)

Focusing a bit on the coordinates used to sample from the image, the most common case is using normalized coordinates, which means using floating point values between 0 and 1 in each of the image axis, like the 2D case you see on the right.

But, what happens if the coordinates fall outside that range? That means sampling outside the original image, in points around it like the red marks you see on the right.

That depends on how the sampler is configured. When creating it, we can specify a so-called “address mode” independently for each of the 3 texture coordinate axis that may be used (2 in our example).

Slide 4 (Address Mode)

There are several possible address modes. The most common one is probably the one you see here on the bottom left, which is the repeat addressing mode, which applies some kind of module operation to the coordinates as if the texture was virtually repeating in the selected axis.

There’s also the clamp mode on the top right, for example, which clamps coordinates to 0 and 1 and produces the effect of the texture borders extending beyond the image edge.

The case we’re interested in is the one on the top left, which is the border mode. When sampling outside we get a border color, as if the image was surrounded by a virtually infinite frame of a chosen color.

Slide 5 (Border Color)

The border color is specified when creating the sampler, and initially could only be chosen among a restricted set of values: transparent black (all zeros), opaque white (all ones) or the “special” opaque black color, which has a zero in all color components and a 1 in the alpha component.

The “custom border color” extension introduced the possibility of specifying arbitrary RGBA colors when creating the sampler.

Slide 6 (Image View Swizzle)

However, sampling operations are also affected by one parameter that’s not part of the sampler object. It’s part of the image view and it’s called the component swizzle.

In the example I gave you before we got some color values back, but that was supposing the component swizzle was the identity swizzle (i.e. color components were not reorder or replaced).

It’s possible, however, to specify other swizzles indicating what the resulting final color should be for each of the 4 components: you can reorder the components arbitrarily (e.g. saying the red component should actually come from the original blue one), you can force some of them to be zero or one, you can replicate one of the original components in multiple positions of the final color, etc. It’s a very flexible operation.

Slide 7 (Border Color and Swizzle pt. 1)

While working on the Zink Mesa driver, Mike discovered that the interaction between non-identity swizzle and custom border colors produced different results for different implementations, and wondered if the result was specified at all.

Slide 8 (Border Color and Swizzle pt. 2)

Let me give you an example: you specify a custom border color of 0, 0, 1, 1 (opaque blue) and an addressing mode of clamping to border in the sampler.

The image view has this strange swizzle in which the red component should come from the original blue, the green component is always zero, the blue component comes from the original green and the alpha component is not modified.

If the swizzle applies to the border color you get red. If it does not, you get blue.

Any option is reasonable: if the border color is specified as part of the sampler, maybe you want to get that color no matter which image view you use that sampler on, and expect to always get a blue border.

If the border color is supposed to act as if it came from the original image, it should be affected by the swizzle as the normal pixels are and you’d get red.

Slide 9 (Border Color and Swizzle pt. 3)

Jason pointed out the spec laid out the rules in a section called “Texel Input Operations”, which specifies that swizzling should affect border colors, and non-identity swizzles could be applied to custom border colors without restrictions according to the spec, contrary to “opaque black”, which was considered a special value and non-identity swizzles would result in undefined values with that border.

Slide 10 (Texel Input Operations)

The Texel Input Operations spec section describes what the expected result is according to some steps which are supposed to happen in a defined order. It doesn’t mean the hardware has to work like this. It may need instructions before or after the hardware sampling operation to simulate things happen in the order described there.

I’ve simplified and removed some of the steps but if border color needs to be applied we’re interested in the steps we can see in bold, and step 5 (border color applied) comes before step 7 (applying the image view swizzle).

I’ll describe the steps with a bit more detail now.

Slide 11 (Coordinate Conversion)

Step 1 is coordinate conversion: this includes converting normalized coordinates to integer texel coordinates for the image view and clamping and modifying those values depending on the addressing mode.

Slide 12 (Coordinate Validation)

Once that is done, step 2 is validating the coordinates. Here, we’ll decide if texel replacement takes place or not, which may imply using the border color. In other sampling modes, robustness features will also be taken into account.

Slide 13 (Reading Texel from Image)

Step 3 happens when the coordinates are valid, and is reading the actual texel from the image. This immediately implies reordering components from the in-memory layout to the standard RGBA layout, which means a BGR image view gets its components immediately put in RGB order after reading.

Slide 14 (Format Conversion)

Step 4 also applies if an actual texel was read from the image and is format conversion. For example, unsigned normalized formats need to convert pixel values (stored as natural numbers in memory) to floating point values.

Our example texel, already in RGB order, results in the values you see on the right.

Slide 15 (Texel Replacement)

Step 5 is texel replacement, and is the alternative to the previous two steps when the coordinates were not valid. In the case of border colors, this means taking the border color and cutting it short so it only has the components present in the original image view, to act as if the border color was actually part of the image.

Because this happens after the color components have already been reordered, the border color is always specified in standard red, green, blue and alpha order when creating the sampler. The fact that the original image view was in BGR order is irrelevant for the border color. We care about the alpha component being missing, but not about the in-memory order of the image view.

Our transparent blue border is converted to just “blue” in this step.

Slide 16 (Expansion to RGBA)

Step 6 takes us back to a unified flow of steps: it applies to the color no matter where it came from. The color is expanded to always have 4 components as expected in the shader. Missing color components are replaced with zeros and the alpha component, if missing, is set to one.

Our original transparent blue border is now opaque blue.

Slide 17 (Component Swizzle)

Step 7, finally the swizzle is applied. Let’s suppose our image view had that strange swizzle in which the red component is copied from the original blue, the green component is set to zero, the blue one is set to one and the alpha component is not modified.

Our original transparent blue border is now opaque magenta.

Slide 18 (VK_EXT_custom_border_color)

So we had this situation in which some implementations swizzled the border color and others did not. What could we do?

We could double-down on the existing spec and ask vendors to fix their implementations but, what happens if they cannot fix them? Or if the fix is impractical due to its impact in performance?

Unfortunately, that was the actual situation: some implementations could not be fixed. After discovering this problem, CTS tests were going to be created for these cases. If an implementation failed to behave as mandated by the spec, it wouldn’t pass conformance, so those implementations only had one way out: stop supporting custom border colors, but that’s also a loss for users if those implementations are in widespread use (and they were).

The second option is backpedaling a bit, making behavior undefined unless some other feature is present and designing a mechanism that would allow custom border colors to be used with non-identity swizzles at least in some of the implementations.

Slide 19 (VK_EXT_border_color_swizzle)

And that’s how the “border color swizzle” extension was created last year. Custom colors with non-identity swizzle produced undefined results unless the borderColorSwizzle feature was available and enabled. Some implementations could advertise support for this almost “for free” and others could advertise lack of support for this feature.

In the middle ground, some implementations can indicate they support the case, but the component swizzle has to be indicated when creating the sampler as well. So it’s both part of the image view and part of the sampler. Samplers created this way can only be used with image views having a matching component swizzle (which means they are no longer generic samplers).

The drawback of this extension, apart from the obvious observation that it should’ve been part of the original custom border color extension, is that it somehow lowers the bar for applications that want to use a single code path for every vendor. If borderColorSwizzle is supported, it’s always legal to pass the swizzle when creating the sampler. Some implementations will need it and the rest can ignore it, so the unified code path is now harder or more specific.

And that’s basically it. Sometimes the Vulkan Working Group in Khronos has had to backpedal and mark as undefined something that previous versions of the Vulkan spec considered defined. It’s not frequent nor ideal, but it happens. But it usually does not go as far as publishing a new extension as part of the fix, which is why I considered this interesting.

Slide 20 (Questions?)

Thanks for watching! Let me know if you have any questions.

Q&A Section

Martin: The first question is from "ancurio" and he’s asking if swizzling is implemented in hardware.

Me: I don’t work on implementations so take my answer with a grain of salt. It’s my understanding you can usually program that in hardware and the hardware does the swizzling for you. There may be implementations which need to do the swizzling in software, emitting extra instructions.

Martin: Another question from "ancurio". When you said lowering the bar do you mean raising it?

I explain that, yes, I meant to say raising the bar for the application. Note: I meant to say that it lowers the bar for the specification and API, which means a more complicated solution has been accepted.

Martin: "enunes" asks if this was originally motivated by some real application bug or by something like conformance tests/spec disambiguation?

I explain it has both factors. Mike found the problem while developing Zink, so a real application hit the problematic case, and then the Vulkan Working Group inside Khronos wanted to fix this, make the spec clear and provide a solution for apps that wanted to use non-identity swizzle with border colors, as it was originally allowed.

Martin: no more questions in the room but I have one more for you. How was your experience dealing with Khronos coordinating with different vendors and figuring out what was the acceptable solution for everyone?

I explain that the main driver behind the extension in Khronos was Piers Daniell from NVIDIA (NB: listed as the extension author). I mention that my experience was positive, that the Working Group is composed of people who are really interested in making a good specification and implementations that serve app developers. When this problem was detected I created some tests that worked as a poll to see which vendors could make this work easily and what others may need to make this work if at all. Then, this was discussed in the Working Group, a solution was proposed (the extension), then more vendors reviewed and commented that, then tests were adapted to the final solution, and finally the extension was published.

Martin: How long did this whole process take?

Me: A few months. Take into account the Working Group does not meet every day, and they have a backlog of issues to discuss. Each of the previous steps takes several weeks, so you end up with a few months, which is not bad.

Martin: Not bad at all.

Me: Not at all, I think it works reasonably well.

Martin: Specially when you realize you broke something and the specification needs fixing. Definitely decent.

February 05, 2022

(I nearly went with clutterectomy, but that would be doing our old servant project a disservice.)

Yesterday, I finally merged the work-in-progress branch porting totem to GStreamer's GTK GL sink widget, undoing a lot of the work done in 2011 and 2014 to port the video widget and then to finally make use of its features.

But GTK has been modernised (in GTK3 but in GTK4 even more so), GStreamer grew a collection of GL plugins, Wayland and VA-API matured and clutter (and its siblings clutter-gtk, and clutter-gst) didn't get the resources they needed to follow.

Screenshot_from_2022-02-03_18-03-40A screenshot with practically no changes, as expected

The list of bug fixes and enhancements is substantial:

  • Makes some files that threw shaders warnings playable
  • Fixes resize lag for the widgets embedded in the video widget
  • Fixes interactions with widgets on some HDR capable systems, or even widgets disappearing sometimes (!)
  • Gets rid of the floating blank windows under Wayland
  • Should help with tearing, although that's highly dependent on the system
  • Hi-DPI support
  • Hardware acceleration (through libva)

Until the port to GTK4, we expect a overall drop in performance on systems where there's no VA-API support, and the GTK4 port should bring it to par with the fastest of players available for GNOME.

You can install a Preview version right now by running:

$ flatpak install --user https://flathub.org/beta-repo/appstream/org.gnome.Totem.Devel.flatpakref

and filing bug in the GNOME GitLab.

Next stop, a GTK4 port!

February 03, 2022

22.0

I always do one of these big roundups for each Mesa release, so here’s what you can expect to see from zink in the upcoming release:

  • fewer hangs on RADV
  • massively improved usability on NVIDIA
  • greatly improved performance with unsupported texture download formats (e.g., CS:GO, L4D2)
  • more extensions: ARB_sparse_texture, ARB_sparse_texture2, ARB_sparse_texture_clamp, EXT_memory_object, EXT_memory_object_fd, GL_EXT_semaphore, GL_EXT_semaphore_fd
  • ~1000% improved glxgears performance (be sure to run with --i-know-this-is-not-a-benchmark to see the real speed)
  • tons and tons and tons of bug fixes

All around looking like another great release.

I Hate gl_PointSize And So Can You

Yes, we’re here.

After literally years of awfulness, I’ve finally solved (for good) the debacle that is point size conversion from GL to Vulkan.

What’s so awful about it, you might be asking. How hard can it be to just add gl_PointSize to a shader, you follow up with as you push your glasses higher up your nose.

Allow me to explain.

In Vulkan, there is exactly one method for setting the size of points: the gl_PointSize shader output controls it, and that’s it.

In OpenGL (core profile):

  • 14.4 Points If program point size mode is enabled, the derived point size is taken from the (potentially clipped) shader built-in gl_PointSize written by the last vertex processing stage and clamped to the implementation-dependent point size range. If the value written to gl_PointSize is less than or equal to zero, or if no value was written to gl_PointSize, results are undefined. If program point size mode is disabled, the derived point size is specified with the command

    void PointSize( float size );

  • 11.2.3.4 Tessellation Evaluation Shader Outputs Tessellation evaluation shaders have a number of built-in output variables used to pass values to equivalent built-in input variables read by subsequent shader stages or to subsequent fixed functionality vertex processing pipeline stages. These variables are gl_Position, gl_PointSize, gl_ClipDistance, and gl_CullDistance, and all behave identically to equivalently named vertex shader outputs.
  • 11.3.4.5 Geometry Shader Outputs The built-in output gl_PointSize, if written, holds the size of the point to be rasterized, measured in pixels

In short, if PROGRAM_POINT_SIZE is enabled, then points are sized based on the gl_PointSize shader output of the last vertex stage.

In OpenGL ES (versions 2.0, 3.0, 3.1):

  • (3.3 | 3.4 | 13.3) Points The point size is taken from the shader built-in gl_PointSize written by the vertex shader, and clamped to the implementation-dependent point size range.

In OpenGL ES (version 3.2):

  • 13.5 Points The point size is determined by the last vertex processing stage. If the last vertex processing stage is not a vertex shader, the point size is 1.0. If the last vertex processing stage is a vertex shader, the point size is taken from the shader built-in gl_PointSize written by the vertex shader, and is clamped to the implementation-dependent point size range.

Thus for an ES context, the point size always comes from the last vertex stage, which means it can be anything it wants to be if that stage is a vertex shader and cannot be written to for all other stages because it is not a valid output (this last, bolded part is going to be really funny in a minute or two).

What do the specs agree on?

  • If a vertex shader is the last vertex stage, it can write gl_PointSize

Literally that’s it.

Awesome.

Zink

As we know, Vulkan has a very simple and clearly defined model for point size:

The point size is taken from the (potentially clipped) shader built-in PointSize written by:
• the geometry shader, if active;
• the tessellation evaluation shader, if active and no geometry shader is active;
• the vertex shader, otherwise
- 27.10. Points

It really can be that simple.

So one would think that we can just hook up some conditionals based on the GL rules and then export the correct value.

That would be easy.

Simple.

It would make sense.

HAHA

hahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahaha

XFB

It gets worse (obviously).

gl_PointSize is a valid XFB varying, which means it must be exported correctly to the transform feedback buffer. For the ES case, it’s simple, but for desktop GL, there’s a little something called PROGRAM_POINT_SIZE state which totally fucks that up. Because, as we know, Vulkan has exactly one way of setting point size, and it’s the shader variable.

Thus, if there is a desktop GL context using a vertex shader as its last vertex stage for a draw, and if that shader has its own gl_PointSize value, this value must be exported for XFB.

But not used for point rasterization.

It’s Actually Even Worse Than That

…Because in order to pass CTS for ES 3.2, your implementation also has to be able to violate spec.

Remember above when I said it was going to be funny that gl_PointSize is not a legal output for non-vertex stages in ES contexts?

CTS explicitly has “wide points” tests which verify illegal point sizes that are exported by the tessellation and geometry shader stages. Isn’t that cool?

Also, let’s be reasonable people for a moment, who actually wants a point that’s just one pixel? Nobody can see that on their 8k display.

To Sum Up

I hate GL point size, and so should you.

February 02, 2022

Checking In

I keep meaning to blog, but then I get sidetracked by not blogging.

Truly a tough life.

So what’s new in zink-land?

Nothing too exciting. Mostly bug fixes. I managed to sneak ARB_sparse_texture_clamp in for zink just before the branchpoint, so all the sparse texturing features supported by Mesa will be supported by zink. But only on NVIDIA since they’re the only driver that fully supports Vulkan sparse texturing.

The past couple days I’ve been doing some truly awful things with gl_PointSize to try and make this conformant for all possible cases. It’s a real debacle, and I’ll probably post more in-depth about it so everyone can get a good chuckle.

The one unusual part of my daily routine is that I haven’t rebased my testing branch in at least a couple weeks now since I’ve been trying to iron out regressions. Will I find that everything crashes and fails as soon as I do?

Probably.

More posts to come.

February 01, 2022

There was an article on Open for Everyone today about Nobara, a Fedora-based distribution optimized for gaming. So I have no beef with Tomas Crider or any other creator/maintainer of a distribution targeting a specific use case. In fact they are usually trying to solve or work around real problems and make things easier for people. That said I have for years felt that the need for these things is a failing in itself and it has been a goal for me in the context of Fedora Workstation to figure out what we can do to remove the need for ‘usecase distros’. So I thought it would be of interest if I talk a bit about how I been viewing these things and the concrete efforts we taken to reduce the need for usecase oriented distributions. It is worth noting that the usecase distributions have of course proven useful for this too, in the sense that they to some degree also function as a very detailed ‘bug report’ for why the general case OS is not enough.
Before I start, you might say, but isn’t Fedora Workstation as usecase OS too? You often talk about having a developer focus? Yes, developers are something we care deeply about, but for instance that doesn’t mean we pre-install 50 IDEs in Fedora Workstation. Fedora Workstation should be a great general purpose OS out of the box and then we should have tools like GNOME Software and Toolbx available to let you quickly and easily tweak it into your ideal development system. But at the same time by being a general purpose OS at heart, it should be equally easy to install Steam and Lutris to start gaming or install Carla and Ardour to start doing audio production. Or install OBS Studio to do video streaming.

Looking back over the years one of the first conclusions I drew from looking at all the usecase distributions out there was that they often where mostly the standard distro, but with a carefully procured list of pre-installed software, for instance the old Fedora game spin was exactly that, a copy of Fedora with a lot of games pre-installed. So why was this valuable to people? For those of us who have been around for a while we remember that the average linux ‘app store’ was a very basic GUI which listed available software by name (usually quite cryptic names) and at best with a small icon. There was almost no other metadata available and search functionality was limited at best. So finding software was not simple, at it was usually more of a ‘search the internet and if you find something interesting see if its packaged for your distro’. So the usecase distros who focused on having procured pre-installed software, be that games, or pro-audio software or graphics tools ot whatever was their focus was basically responding to the fact that finding software was non-trivial and a lot of people maybe missed out on software that could be useful to them since it they simply never learned about its existence.
So when we kicked of the creation of GNOME Software one of the big focuses early on was to create a system for providing good metadata and displaying that metadata in a useful manner. So as an end user the most obvious change was of course the more rich UI of GNOME Software, but maybe just as important was the creation of AppStream, which was a specification for how applications to ship with metadata to allow GNOME Software and others to display much more in-depth information about the application and provide screenshots and so on.

So I do believe that between working on a better ‘App Store’ story for linux between the work on GNOME Software as the actual UI, but also by working with many stakeholders in the Linux ecosystem to define metadata standards like AppStream we made software a lot more discoverable on Linux and thus reduced the need for pre-loading significantly. This work also provided an important baseline for things like Flathub to thrive, as it then had a clear way to provide metadata about the applications it hosts.
We do continue to polish that user experience on an ongoing basis, but I do feel we reduced the need to pre-load a ton of software very significantly already with this.

Of course another aspect of this is application availability, which is why we worked to ensure things like Steam is available in GNOME Software on Fedora Workstation, and which we have now expanded on by starting to include more and more software listings from Flathub. These things makes it easy for our users to find the software they want, but at the same time we are still staying true to our mission of only shipping free software by default in Fedora.

The second major reason for usecase distributions have been that the generic version of the OS didn’t really have the right settings or setup to handle an important usecase. I think pro-audio is the best example of this where usecase distros like Fedora Jam or Ubuntu Studio popped up. The pre-install a lot of relevant software was definitely part of their DNA too, but there was also other issues involved, like the need for a special audio setup with JACK and often also kernel real-time patches applied. When we decided to include Pro-audio support in PipeWire resolving these issues was a big part of it. I strongly believe that we should be able to provide a simple and good out-of-the box experience for musicians and audio engineers on Linux without needing the OS to be specifically configured for the task. The strong and positive response we gotten from the Pro-audio community for PipeWire I believe points to that we are moving in the right direction there. Not claiming things are 100% yet, but we feel very confident that we will get there with PipeWire and make the Pro-Audio folks full fledged members of the Fedora WS community. Interestingly we also spent quite a bit of time trying to ensure the pro-audio tools in Fedora has proper AppStream metadata so that they would appear in GNOME Software as part of this. One area there where we are still looking at is the real time kernel stuff, our current take is that we do believe the remaining unmerged patches are not strictly needed anymore, as most of the important stuff has already been merged, but we are monitoring it as we keep developing and benchmarking PipeWire for the Pro-Audio usecase.

Another reason that I often saw that drove the creation of a usecase distribution is special hardware support, and not necessarily that special hardware, the NVidia driver for instance has triggered a lot of these attempts. The NVidia driver is challenging on a lot of levels and has been something we have been constantly working on. There was technical issues for instance, like the NVidia driver and Mesa fighting over who owned the OpenGL.so implementation, which we fixed by the introduction glvnd a few years ago. But for a distro like Fedora that also cares deeply about free and open source software it also provided us with a lot of philosophical challenges. We had to answer the question of how could we on one side make sure our users had easy access to the driver without abandoning our principle around Fedora only shipping free software of out the box? I think we found a good compromise today where the NVidia driver is available in Fedora Workstation for easy install through GNOME Software, but at the same time default to Nouveau of the box. That said this is a part of the story where we are still hard at work to improve things further and while I am not at liberty to mention any details I think I can at least mention that we are meeting with our engineering counterparts at NVidia on almost a weekly basis to discuss how to improve things, not just for graphics, but around compute and other shared areas of interest. The most recent public result of that collaboration was of course the XWayland support in recent NVidia drivers, but I promise you that this is something we keep focusing on and I expect that we will be able to share more cool news and important progress over the course of the year, both for users of the NVidia binary driver and for users of Nouveau.

What are we still looking at in terms of addressing issues like this? Well one thing we are talking about is if there is value/need for a facility to install specific software based on hardware or software. For instance if we detect a high end gaming mouse connected to your system should we install Piper/ratbag or at least make GNOME Software suggest it? And if we detect that you installed Lutris and Steam are there other tools we should recommend you install, like the gamemode GNOME Shell extenion? It is a somewhat hard question to answer, which is why we are still pondering it, on one side it seems like a nice addition, but such connections would mean that we need to have a big database we constantly maintain which isn’t trivial and also having something running on your system to lets say check for those high end mice do add a little overhead that might be a waste for many users.

Another area that we are looking at is the issue of codecs. We did a big effort a couple of years ago and got AC3, mp3, AAC and mpeg2 video cleared for inclusion, and also got the OpenH264 implementation from Cisco made available. That solved a lot of issues, but today with so many more getting into media creation I believe we need to take another stab at it and for instance try to get reliable hardware accelerated encoding and decoding on video. I am not ready to announce anything, but we got a few ideas and leads we are looking at for how to move the needle there in a significant way.

So to summarize, I am not criticizing anyone for putting together what I call usecase distros, but at the same time I really want to get to a point where they are rarely needed, because we should be able to cater to most needs within the context of a general purpose Linux operating system. That said I do appreciate the effort of these distro makers both in terms of trying to help users have a better experience on linux and in indirectly helping us showcase both potential solutions or highlight the major pain points that still needs addressing in a general purpose Linux desktop operating system.

January 27, 2022

In defense of NIR

NIR has been an integral part of the Mesa driver stack for about six or seven years now (depending on how you count) and a lot has changed since NIR first landed at the end of 2014 and I wrote my initial NIR notes. Also, for various reasons, I’ve had to give my NIR elevator pitch a few times lately. I think it’s time for a new post. This time on why, after working on this mess for seven years, I still think NIR was the right call.

A bit of history

Shortly after I joined the Mesa team at Intel in the summer of 2014, I was sitting in the cube area asking Ken questions, trying to figure out how Mesa was put together, and I asked, “Why don’t you use LLVM?” Suddenly, all eyes turned towards Ken and myself and I realized I’d poked a bear. Ken calmly explained a bunch of the packaging/shipping issues around having your compiler in a different project as well as issues radeonsi had run into with apps bundling their own LLVM that didn’t work. But for the more technical question of whether or not it was a good idea, his answer was something about trade-offs and how it’s really not clear if LLVM would really gain them much.

That same summer, Connor Abbott showed up as our intern and started developing NIR. By the end of the summer, he had a bunch of data structures a few mostly untested passes, and a validator. He also had most of a GLSL IR to NIR pass which mostly passed validation. Later that year, after Connor had gone off to school, I took over NIR, finished the Intel scalar back-end NIR consumer, fixed piles of bugs, and wrote out-of-SSA and a bunch of optimization passes to get it to the point where we could finally land it in the tree at the end of 2014. Initially, it was only a few Intel folks and Emma Anholt (Broadcom, at the time) who were all that interested in NIR. Today, it’s integral to the Mesa project and at the core of every driver that’s still seeing active development. Over the past seven years, we (the Mesa community) have poured thousands of man hours (probably millions of engineering dollars) into NIR and it’s gone from something only capable of handling fragment shaders to supporting full Vulkan 1.2 plus ray-tracing (task and mesh are coming) along with OpenCL 1.2 compute.

Was it worth it? That’s the multi-million dollar (literally) question. 2014 was a simpler time. Compute shaders were still newish and people didn’t use them for all that much more than they would have used a fancy fragment shader for a couple years earlier. More advanced features like Vulkan’s variable pointers weren’t even on the horizon. Had I known at the time how much work we’d have to put into NIR to keep up, I may have said, “Nah, this is too much effort; let’s just use LLVM.” If I had, I think it would have made the wrong call.

Distro and packaging issues

I’d like to get this one out of the way first because, while these issues are definitely real, it’s easily the least compelling reason to write a whole new piece of software. Having your compiler in a separate project and in LLVM in particular comes with an annoying set of problems.

First, there’s release cycles. Mesa releases on a rough 3-month cadence whereas LLVM releases on a 6-month cadence and there’s nothing syncing the two release cycles. This means that any new feature enabled in Mesa that require new LLVM compiler work can’t be enabled until they pick up a new LLVM. Not only does this make the question “what mesa version has X? unanswerable, it also means every one of these features needs conditional paths in the driver to be enabled or not depending on LLVM version. Also, because we can’t guarantee which LLVM version a distro will choose to pair with any give Mesa version, radeonsi (the only LLVM-based hardware driver in Mesa) has to support the latest two releases of LLVM as well as tip-of-tree at all times. While this has certainly gotten better in recent years, it used to be that LLVM would switch around C++ data structures on you requiring a bunch of wrapper classes in Mesa to deal with the mess. (They still reserve the right, it just happens less these days.)

Second is bug fixing. What do you do if there’s a compiler bug? You fix it in LLVM, of course, right? But what if the bug is in an old version of the AMD LLVM back-end and AMD’s LLVM people refuse to back-port the fix? You work around it in Mesa, of course! Yup, even though Mesa and LLVM are both open-source projects that theoretically have a stable bugfix release cycle, Mesa has to carry LLVM work-around patches because we can’t get the other team/project to back-port fixes. Things also get sticky whenever there’s a compiler bug which touches on the interface between the LLVM back-end compiler and the driver. How do you fix that in a backwards-compatible way? Sometimes, you don’t. Those interfaces can be absurdly subtle and complex and sometimes the bug is in the interface itself so you either have to fix it LLVM tip-of-tree and work around it in Mesa for older versions, or you have to break backwards compatibility somewhere and hope users pick up the LLVM bug-fix release.

Third is that some games actually link against LLVM and, historically, LLVM hasn’t done well with two different versions of it loaded at the same time. Some of this is LLVM and some of it is the way C++ shared library loading is handled on Linux. I won’t get into all the details but the point is that there have been some games in the past which simply can’t run on radeonsi because of LLVM library version conflicts. Some of this could probably be solved if Mesa were linked against LLVM statically but distros tend to be pretty sour on static linking unless you have a really good reason. A closed-source game pulling in their own LLVM isn’t generally considered to be a good reason.

And that, in the words of Forrest Gump, is all I have to say about that.

A compiler built for GPUs

One of the key differences between NIR and LLVM is that NIR is a GPU-focused compiler whereas LLVM is CPU-focused. Yes, AMD has an upstream LLVM back-end for their GPU hardware, Intel likes to brag about their out-of-tree LLVM back-end and many other vendors use it in their drivers as well even if their back-ends are closed-source and Internal. However, none of that actually means that LLVM understands GPUs or is any good at compiling for them. Most HW vendors have made that choice because they needed LLVM for OpenCL support and they wanted a unified compiler so they figured out how to make LLVM do graphics. It works but that doesn’t mean it works well.

To demonstrate this, let’s look at the following GLSL shader I stole from the texelFetch piglit test:

#version 120

#extension GL_EXT_gpu_shader4: require
#define ivec1 int
flat varying ivec4 tc;
uniform vec4 divisor;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
    vec4 color = texelFetch2D(tex, ivec2(tc), tc.w);
    fragColor = color/divisor;
}

When compiled to NIR, this turns into

shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 1
outputs: 1
uniforms: 1
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE sampler2D tex (1, 0, 0)
decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0)
decl_function main (0 params)

impl main {
    block block_0:
    /* preds: */
    vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
    vec3 32 ssa_1 = intrinsic load_input (ssa_0) (0, 0, 34, 160) /* base=0 */ /* component=0 */ /* dest_type=int32 */ /* location=32 slots=1 */
    vec1 32 ssa_2 = deref_var &tex (uniform sampler2D)
    vec2 32 ssa_3 = vec2 ssa_1.x, ssa_1.y
    vec1 32 ssa_4 = mov ssa_1.z
    vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod)
    vec4 32 ssa_6 = intrinsic load_ubo (ssa_0, ssa_0) (0, 1073741824, 0, 0, 16) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=16 */
    vec1 32 ssa_7 = frcp ssa_6.x
    vec1 32 ssa_8 = frcp ssa_6.y
    vec1 32 ssa_9 = frcp ssa_6.z
    vec1 32 ssa_10 = frcp ssa_6.w
    vec1 32 ssa_11 = fmul ssa_5.x, ssa_7
    vec1 32 ssa_12 = fmul ssa_5.y, ssa_8
    vec1 32 ssa_13 = fmul ssa_5.z, ssa_9
    vec1 32 ssa_14 = fmul ssa_5.w, ssa_10
    vec4 32 ssa_15 = vec4 ssa_11, ssa_12, ssa_13, ssa_14
    intrinsic store_output (ssa_15, ssa_0) (0, 15, 0, 160, 132) /* base=0 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */
    /* succs: block_1 */
    block block_1:
}

Then, the AMD driver turns it into the following LLVM IR:

; ModuleID = 'mesa-shader'
source_filename = "mesa-shader"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"
target triple = "amdgcn--"

define amdgpu_ps <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @main(<4 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %0, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %1, float addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %2, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %3, i32 inreg %4, i32 inreg %5, <2 x i32> %6, <2 x i32> %7, <2 x i32> %8, <3 x i32> %9, <2 x i32> %10, <2 x i32> %11, <2 x i32> %12, float %13, float %14, float %15, float %16, float %17, i32 %18, i32 %19, float %20, i32 %21) #0 {
main_body:
  %22 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 0, i32 0, i32 %5) #4
  %23 = bitcast float %22 to i32
  %24 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 1, i32 0, i32 %5) #4
  %25 = bitcast float %24 to i32
  %26 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 2, i32 0, i32 %5) #4
  %27 = bitcast float %26 to i32
  %28 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(6)* %3, i32 32, !amdgpu.uniform !0
  %29 = load <8 x i32>, <8 x i32> addrspace(6)* %28, align 4, !invariant.load !0
  %30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4
  %31 = ptrtoint float addrspace(6)* %2 to i32
  %32 = insertelement <4 x i32> <i32 poison, i32 0, i32 16, i32 163756>, i32 %31, i32 0
  %33 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 0, i32 0) #4
  %34 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 4, i32 0) #4
  %35 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 8, i32 0) #4
  %36 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 12, i32 0) #4
  %37 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %33) #4
  %38 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %34) #4
  %39 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %35) #4
  %40 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %36) #4
  %41 = extractelement <4 x float> %30, i32 0
  %42 = fmul nsz arcp float %41, %37
  %43 = extractelement <4 x float> %30, i32 1
  %44 = fmul nsz arcp float %43, %38
  %45 = extractelement <4 x float> %30, i32 2
  %46 = fmul nsz arcp float %45, %39
  %47 = extractelement <4 x float> %30, i32 3
  %48 = fmul nsz arcp float %47, %40
  %49 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> undef, i32 %4, 4
  %50 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %49, float %42, 5
  %51 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %50, float %44, 6
  %52 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %51, float %46, 7
  %53 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %52, float %48, 8
  %54 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %53, float %20, 19
  ret <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %54
}

; Function Attrs: nounwind readnone speculatable willreturn
declare float @llvm.amdgcn.interp.mov(i32 immarg, i32 immarg, i32 immarg, i32) #1

; Function Attrs: nounwind readonly willreturn
declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2

; Function Attrs: nounwind readnone willreturn
declare float @llvm.amdgcn.s.buffer.load.f32(<4 x i32>, i32, i32 immarg) #3

; Function Attrs: nounwind readnone speculatable willreturn
declare float @llvm.amdgcn.rcp.f32(float) #1

attributes #0 = { "InitialPSInputAddr"="0xb077" "denormal-fp-math"="ieee,ieee" "denormal-fp-math-f32"="preserve-sign,preserve-sign" "target-features"="+DumpCode" }
attributes #1 = { nounwind readnone speculatable willreturn }
attributes #2 = { nounwind readonly willreturn }
attributes #3 = { nounwind readnone willreturn }
attributes #4 = { nounwind readnone }

!0 = !{}

For those of you who can’t read NIR and/or LLVM or don’t want to sift through all that, let me reduce it down to the important lines:

GLSL:

vec4 color = texelFetch2D(tex, ivec2(tc), tc.w);

NIR:

vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod)

LLVM:

%30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4

; Function Attrs: nounwind readonly willreturn
declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2

attributes #2 = { nounwind readonly willreturn }
attributes #4 = { nounwind readnone }

In NIR, a texelFetch() shows up as a texture instruction. NIR has a special instruction type just for textures called nir_tex_instr to handle of the combinatorial explosion of possibilities when it comes to all the different ways you can access a texture. In this particular case, the texture opcode is nir_texop_txf for a texel fetch and it is passed a texture, a sampler, a coordinate and an LOD. Pretty standard stuff.

In AMD-flavored LLVM IR, this turns into a magic intrinsic funciton called llvm.amdgcn.image.load.mip.2d.v4f32.i32. A bunch of information about the operation such as the fact that it takes a mip parameter and returns a vec4 is encoded in the function name. The AMD back-end then knows how to turn this into the right sequence of hardware instructions to load from a texture.

There are a couple of important things to note here. First is the @llvm.amdgcn prefix on the function name. This is an entirely AMD-specific function. If I dumped out the LLVM from the Intel windows drivers for that same GLSL, it would use a different function name with a different encoding for the various bits of ancillary information such as the return type. Even though both drivers share LLVM, in theory, the way they encode graphics operations is entirely different. If you looked at NVIDIA, you would find a third encoding. There is no standardization.

Why is this important? Well, one of the most common arguments I hear from people for why we should all be using LLVM for graphics is because it allows for code sharing. Everyone can leverage all that great work that happens in upstream LLVM. Except it doesn’t. Not really. Sure, you can get LLVM’s algebraic optimizations and code motion etc. But you can’t share any of the optimizations that are really interesting for graphics because nothing graphics-related is common. Could it be standardized? Probably. But, in the state it’s in today, any claims that two graphics compilers are sharing significant optimizations because they’re both LLVM based is a half-truth at best. And it will never become standardized unless someone other than AMD decides to put their back-end into upstream LLVM and they decide to work together.

The second important bit about that LLVM function call is that LLVM has absolutely no idea what that function does. All it knows is that it’s been decorated nounwind, readonly, and willreturn. The readonly gives it a bit of information so it knows it can move the function call around a bit since it won’t write misc data. However, it can’t even eliminate redundant texture ops because, for all LLVM knows, a second call will return a different result. While LLVM has pretty good visibility into the basic math in the shader, when it comes to anything that touches image or buffer memory, it’s flying entirely blind. The Intel LLVM-based graphics compiler tries to improve this somewhat by using actual LLVM pointers for buffer memory so LLVM gets a bit more visibility but you still end up with a pile of out-of-thin-air pointers that all potentially alias each other so it’s pretty limited.

In contrast, NIR knows exactly what sort of thing nir_texop_txf is and what it does. It knows, for instance, that, even though it accesses external memory, the API guarantees that nothing shifts out from under you so it’s fine to eliminate redundant texture calls. For nir_texop_tex (texture() in GLSL), it knows that it takes implicit derivatives and so it can’t be moved into non-uniform control-flow. For things like SSBO and workgroup memory, we know what kind of memory they’re touching and can do alias analysis that’s actually aware of buffer bindings.

Code sharing

When people try to justify their use of LLVM to me, there are typically two major benefits they cite. The first is that LLVM lets them take advantage of all this academic compiler work. In the previous section, I explained why this is a weak argument at best. The second is that embracing LLVM for graphics lets them share code with their compute compiler. Does that mean that we’re against sharing code? Not at all! In fact, NIR lets us get far more code sharing than most companies do by using LLVM.

The difference is the axis for sharing. This is something I ran into trying to explain myself to people at Intel all the time. They’re usually only thinking about how to get the Intel OpenCL driver and the Intel D3D12 driver to share code. With NIR, we have compiler code shared effectively across 20 years of hardware from a 8 different vendors and at least 4 APIs. So while Intel’s Linux Vulkan and OpenCL drivers don’t share a single line of compiler code, it’s not like we went off and hand-coded a whole compiler stack just for Intel Linux Vulkan.

As an example of this, consider nir_lower_tex() a pass that lowers various different types of texture operations to other texture operations. It can, among other things:

  • Lower texture projectors away by doing the division in the shader,

  • Lower texelFetchOffset() to texelFetch(),

  • Lower rectangle textures by dividing the coordinate by the result of textureSize(),

  • Lower texture swizzles to swizzling in the shader,

  • Lower various forms of textureGrad*() to textureLod*() under various conditions,

  • Lower imageSize(i, lod) with an LOD to imageSize(i, 0) and some shader math,

  • And much more…

Exactly what lowering is needed is highly hardware dependent (except projectors; only old Qualcomm hardware has those) but most of them are needed by at least two different vendor’s hardware. While most of these are pretty simple, when you get into things like turning derivatives into LODs, the calculations get complex and we really don’t want everyone typing it themselves if we can avoid it.

And texture lowering is just one example. We’ve got dozens of passes for everything from lowering read-only images to textures for OpenCL to lowering built-in functions like frexp() to simpler math to flipping gl_FragCoord and gl_PointCoord when rendering upside down which as is required to implement OpenGL on Linux window-systems. All that code is in one central place where it’s usable by all the graphics drivers on Linux.

Tight driver integration

I mentioned earlier that having your compiler out-of-tree is painful from a packaging and release point-of-view. What I haven’t addressed yet is just how tight driver/compiler integration has to be. It depends a lot on the API and hardware, of course but the interface between compiler and driver is often very complex. We make it look very simple on the API side where you have descriptor sets (or bindings in GL) and then you access things from them in the shader. Simple, right? Hah!

In the Intel Linux Vulkan driver, we can access a UBO one of four ways depending on a complex heuristic:

  • We try to find up to 4 small ranges UBO commonly used constants and push those into the shader as push constants.

  • If we can’t push it all and it fits inside the hardware’s 240 entry binding table, we create a descriptor for it and put it in the binding table.

  • Depending on the hardware generation, UBOs successfully bound to descriptors might be accessed as SSBOs or we might access them through the texture unit.

  • If we ran our of entries in the binding table or if it’s in a ray-tracing stage (those don’t have binding tables), we fall back to doing bounds checking in the shader and access it using raw 64-bit GPU addresses.

And that’s just UBOs! SSBO binding has a similar level of complexity and also depends on the SSBO operations done in the shader. Textures have silent fall-back to bindless if we have too many, etc. In order to handle all this insanity, we have a compiler pass called anv_nir_apply_pipeline_layout() which lives in the driver. The interface between that pass and the rest of the driver is quite complex and can communicate information about exactly how things are actually laid out. We do have to serialize it to put it all in the pipeline cache so that limits the complexity some but we don’t have to worry about keeping the interface stable at all because it lives in the driver.

We also have passes for handling YCbCr format conversion, turning multiview into instanced rendering and constructing a gl_ViewID in the shader based on the view mask and the instance number, and a handful of other tasks. Each of these requires information from the VkPipelineCreateInfo and some of them result in magic push constants which the driver has to know need pushing.

Trying to do that with your compiler in another project would be insane. So how does AMD do it with their LLVM compiler? Good question! They either do it in NIR or as part of the NIR to LLVM conversion. By the time the shader gets to LLVM, most of the GL or Vulkanisms have been translated to simpler constructs, keeping the driver/LLVM interface manageable. It also helps that AMD’s hardware binding model is crazy simple and was basically designed for an API like Vulkan.

Structured control-flow

One of the riskier decisions we made when designing NIR was to make all control-flow inherently structured. Instead of branch and conditional branch instructions like LLVM or SPIR-V has, NIR has control-flow nodes in a tree structure. The root of the tree is always a nir_function_impl. In each function, is a list of control-flow nodes that may be nir_block, nir_if, or nir_loop. An if has a condition and then and else cases. A loop is a simple infinite loop and there are nir_jump_break and nir_jump_continue instructions which act exactly as their C counterparts.

At the time, this decision was made from pure pragmatism. We had structure coming out of GLSL and most of the back-ends expected structure. Why break everything? It did mean that, when we started writing control-flow manipulation passes, things were a lot harder. A dead control-flow pass in an unstructured IR is trivial:. Delete any conditional branches where the condition is false and replace it with an unconditional branch if the condition is true. Then delete any unreachable blocks and merge blocks as necessary. Done. In a structured IR, it’s a lot more fiddly. You have to manually collapse if ladders and deleting the unconditional break at the end of a loop is equivalent to loop unrolling. But we got over that hump, built tools to make it less painful, and have implemented most of the important control-flow optimizations at this point. In exchange, back-ends get structure which is something most GPUs want thanks to the SIMT model they use.

What we didn’t see coming when we made that decision (2014, remember?) was wave/subgroup ops. In the last several years, the SIMT nature of shader execution has slowly gone from an implementation detail to something that’s baked into all modern 3D and compute APIs and shader languages. With that shift has come the need to be consistent about re-convergence. If we say “texture() has to be in uniform control flow”, is the following shader ok?

void main
#version 120

varying vec2 tc;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
    if (tc.x > 1.0)
        tc.x = 1.0;

    fragColor = texture(tex, tc);
}

Obviously, it should be. But what guarantees that you’re actually in uniform control-flow by the time you get to the texture() call? In an unstructured IR, once you diverge, it’s really hard to guarantee convergence. Of course, every GPU vendor with an LLVM-based compiler has algorithms for trying to maintain or re-create the structure but it’s always a bit fragile. Here’s an even more subtle example:

void main
#version 120

varying vec2 tc;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
    /* Block 0 */
    float x = tc.x;
    while (1) {
        /* Block 1 */
        if (x < 1.0) {
            /* Block 2 */
            tc.x = x;
            break;
        }

        /* Block 3 */
        x = x - 1.0;
    }

    /* Block 4 */
    fragColor = texture(tex, tc);
}

The same question of validity holds but there’s something even trickier in here. Can the compiler merge block 4 and block 2? If so, where should it put it? To a CPU-centric compiler like LLVM, it looks like it would be fine to merge the two and put it all in block 2. In fact, since texture ops are expensive and block 2 is deeper inside control-flow, it may think the resulting shader would be more efficient if it did. And it would be wrong on both counts.

First, the loop exit condition is non-uniform and, since texture() takes derivatives, it’s illegal to put it in non-uniform control-flow. (Yes, in this particular case, the result of those derivatives might be a bit wonky.) Second, due to the SIMT nature of execution, you really don’t want the texture op in the loop. In the worst case, a 32-wide execution will hit block 2 32 separate times whereas, if you guarantee re-convergence, it only hits block 4 once.

The fact that NIR’s control-flow is structured from start to finish has been a hidden blessing here. Once we get the structure figured out from SPIR-V decorations (which is annoyingly challenging at times), we never lose that structure and the re-convergence information it implies. NIR knows better than to move derivatives into non-uniform control-flow and its code-motion passes are tuned assuming a SIMT execution model. What has become a constant fight for people working with LLVM is a non-issue for us. The only thing that has been a challenge has been dealing with SPIR-V’s less than obvious structure rules and trying to make sure we properly structurize everything that’s legal. (It’s been getting better recently.)

Side-note: NIR does support OpenCL SPIR-V which is unstructured. To handle this, we have nir_jump_goto and nir_jump_goto_if instructions which are allowed only for a very brief period of time. After the initial SPIR-V to NIR conversion, we run a couple passes and then structurize. After that, it remains structured for the rest of the compile.

Algebraic optimizations

Every GPU compiler engineer has horror stories about something some app developer did in a shader. Sometimes it’s the fault of the developer and sometimes it’s just an artifact of whatever node-based visual shader building system the game engine presents to the artists and how it’s been abused. On Linux, however, it can get even more entertaining. Not only do we have those shaders that were written for DX9 and someone lost the code so they ran them through a DX9 to HLSL translator and then through FXC, but they then ported the app to OpenGL so it can run on Linux they did a DXBC to GLSL conversion with some horrid tool. The end result is x != 0 implemented with three levels of nested function calls, multiple splats out to a vec4 and a truly impressive pile of control-flow. I only wish I were joking….

To chew through this mess, we have nir_opt_algebraic(). We’ve implemented a little language for expressing these expression trees using python tuples and nir_opt_algebraic.py. To get a sense for what this looks like, let’s look at some excerpts from nir_opt_algebraic.py starting with the simple description at the top:

# Written in the form (<search>, <replace>) where <search> is an expression
# and <replace> is either an expression or a value.  An expression is
# defined as a tuple of the form ([~]<op>, <src0>, <src1>, <src2>, <src3>)
# where each source is either an expression or a value.  A value can be
# either a numeric constant or a string representing a variable name.
#
# <more details>

optimizations = [
   ...
   (('iadd', a, 0), a),

This rule is a good starting example because it’s so straightforward. It looks for an integer add operation of something with zero and gets rid of it. A slightly more complex example removes redundant fmax opcodes:

(('fmax', ('fmax', a, b), b), ('fmax', a, b)),

Since it’s written in python, we can also write little rule generators if the same thing applies to a bunch of opcodes or if you want to generalize across types:

# For any float comparison operation, "cmp", if you have "a == a && a cmp b"
# then the "a == a" is redundant because it's equivalent to "a is not NaN"
# and, if a is a NaN then the second comparison will fail anyway.
for op in ['flt', 'fge', 'feq']:
   optimizations += [
      (('iand', ('feq', a, a), (op, a, b)), ('!' + op, a, b)),
      (('iand', ('feq', a, a), (op, b, a)), ('!' + op, b, a)),
   ]

Because we’ve made adding new optimizations so incredibly easy, we have a lot of them. Not just the simple stuff I’ve highlighted above, either. We’ve got at least two cases where someone hand-rolled bitfieldReverse() and we match a giant pattern and turn it into a single HW instruction. (Some UE4 demo and Cyberpunk 2077, if you want to know who to blame. They hand-roll it differently, of course.) We also have patterns to chew through all the garbage from D3D9 to HLSL conversion where they emit piles of x ? 1.0 : 0.0 everywhere because D3D9 didn’t have real Boolean types. All told, as of the writing of this blog post, we have 1911 such search and replace patterns.

Not only have we made it easy to add new patterns but the nir_search framework has some pretty useful smarts in it. The expression I first showed matches a + 0 and replaces it with a but nir_search is smart enough to know that nir_op_iadd is commutative and so it also matches 0 + a without having to write two expressions. We also have syntax for detecting constants, handling different bit sizes, and applying arbitrary C predicates based on the SSA value. Since NIR is actually a vector IR (we support a lot of vec4-based hardware), nir_search also magically handles swizzles for you.

You might think 1911 patterns is a lot and it is. Doesn’t that take forever? Isn’t it O(NPS) where N is the number of instructions, P is the number of patterns and S is the average pattern size or something like that? Nope! A couple years ago, Connor Abbot converted it to using a finite state machine automata, built at driver compile time, to filter out impossible matches as we go. The result is that the whole pass effectively runs in linear time in the number of instructions.

NIR is a low(ish) level IR

This one continues to surprise me. When we set out to design NIR, the goal was something that was SSA and used flat lists of instructions (not expression trees). That was pretty much the extent of the design requirements. However, whenever you build an IR, you inevitably make a series of choices about what kinds of things you’re going to support natively and what things are going to require emulation or be a bit more painful.

One of the most fundamental choices we made in NIR was that SSA values would be typeless vectors. Each nir_ssa_def has a bit size and a number of vector components and that’s it. We don’t distinguish between integers and floats and we don’t support matrix or composite types. Not supporting matrix types was a bit controversial but it’s turned out fine. We also have to do a bit of juggling to support hardware that doesn’t have native integers because we have to lower integer operations to float and we’ve lost the type information. When working with shaders that come from D3D to OpenGL or Vulkan translators, the type information does more harm than good. I can’t count the number of shaders I’ve seen where they declare vec4 x1 through vec4 x80 at the top and then uintBitsToFloat() and floatBitsToUint() all over everywhere.

We also made adding new ALU ops and intrinsics really easy but also added a fairly powerful metadata system for both so the compiler can still reason about them. The lines we drew between ALU ops, intrinsics, texture instructions, and control-flow like break and continue were pretty arbitrary at the time if we’re honest. Texturing was going to be a lot of intrinsics so Connor added an instruction type. That was pretty much it.

The end result, however, has been an IR that’s incredibly versatile. It’s somehow both a high-level and low-level IR at the same time. When we do SPIR-V to NIR translation, we don’t have a separate IR for parsing SPIR-V. We have some data structures to deal with composite types and a handful of other stuff but when we parse SPIR-V opcodes, we go straight to NIR. We’ve got variables with fairly standard dereference chains (those do support composite types), bindings, all the crazy built-ins like frexp(), and a bunch of other language-level stuff. By the time the NIR shows up in your back-end, however, all that’s gone. Crazy built-in functions have been lowered. GL/Vulkan binding with derefs, descriptors, and locations has been turned into byte offsets and indices in a flat binding table. Some drivers have even attempted to emit hardware instructions directly from NIR. (It’s never quite worked but says a lot that they even tried.)

The Intel compiler back-end has probably shrunk by half in terms of optimization and lowering passes in the last seven years because we’re able to do so much in NIR. We’ve got code that lowers storage image access with unsupported formats to other image formats or even SSBO access, splitting of vector UBO/SSBO access that’s too wide for hardware, workarounds for imprecise trig ops, and a bunch of others. All of the interesting lowering is done in NIR. One reason for this is that Intel has two back-ends, one for scalar and one that’s vec4 and any lowering we can do in NIR is lowering that only happens once. But, also, it’s nice to be able to have the full power of NIR’s optimizer run on your lowered code.

As I said earlier, I find the versatility of NIR astounding. We never intended to write an IR that could get that close to hardware. We just wanted SSA for easier optimization writing. But the end result has been absolutely fantastic and has done a lot to accelerate driver development in Mesa.

Conclusion

If you’ve gotten this far, I both applaud and thank you! NIR has been a lot of fun to build and, as you can probably tell, I’m quite proud of it. It’s also been a huge investment involving thousands of man hours but I think it’s been well worth it. There’s a lot more work to do, of course. We still don’t have the ray-tracing situation where it needs to be and OpenCL-style compute needs some help to be really competent. But it’s come an incredibly long way in the last seven years and I’m incredibly proud of what we’ve built and forever thankful to the many many developers who have chipped in and fixed bugs and contributed optimization and lowering passes.

Hopefully, this post provides some additional background and explanation for the big question of why Mesa carries its own compiler stack. And maybe, just maybe, someone will get excited enough about it to play around with it and even contribute! One can hope, right?

January 26, 2022

(This post was first published with Collabora on Jan 25, 2022.)

A Pixel's Color

My work on Wayland and Weston color management and HDR support has been full of learning new concepts and terms. Many of them are crucial for understanding how color works. I started out so ignorant that I did not know how to blend two pixels together correctly. I did not even know that I did not know - I was just doing the obvious blend, and that was wrong. Now I think I know what I know and do not know, and I also feel that most developers around window systems and graphical applications are as uneducated as I was.

Color knowledge is surprisingly scarce in my field it seems. It is not enough that I educate myself. I need other people to talk to, to review my work, and to write patches that I will be reviewing. With the hope of making it even a little bit easier to understand what is going on with color I wrote the article: A Pixel's Color.

The article goes through most of the important concepts, trying to give you, a programmer, a vague idea of what they are. It does not explain everything too well, because I want you to be able to read through it, but it still got longer than I expected. My intention is to tell you about things you might not know about, so that you would at least know what you do not know.

A warm thank you to everyone who reviewed and commented on the article.

A New Documentation Repository

Originally Wayland CM&HDR extension merge request included documentation about how color management would work on Wayland. The actual protocol extension specification cannot even begin to explain all that.

To make that documentation easier to revise and contribute to, I proposed to move it into a new repository: color-and-hdr. That also allowed us to widen the scope of the documentation, so we can easily include things outside of Wayland: EGL, Vulkan WSI, DRM KMS, and more.

I hope that color-and-hdr documentation repository gains traction and becomes a community maintained effort in gathering information about color and HDR on Linux, and that we can eventually move it out of my personal namespace to become truly community owned.

January 17, 2022

Hello, Collabora!

Ever since I announced that I was leaving Intel, there’s been a lot of speculation as to where I’d end up. I left it a bit quiet over the holidays but, now that we’re solidly in 2022, It’s time to let it spill. As of January 24, I’ll be at Collabora!

For those of you that don’t know, Collabora is an open-source consultancy. They sell engineering services to companies who are making devices that run Linux and want to contribute to open-source technologies. They’ve worked on everything from automotive to gaming consoles to smart TVs to infotainment systems to VR platforms. I’m not an expert on what Collabora has done over the years so I’ll refer you to their brag sheet for that. Unlike some contract houses, Collabora doesn’t just do engineering for hire. They’re also an ideologically driven company that really believes in upstream and invests directly in upstream projects such as Mesa, Wayland, and others.

My personal history with Collabora is as old as my history as an open-source software developer. My first real upstream work was on Wayland in early 2013. I jumped in with a cunning plan for running a graphics-enabled desktop Linux chroot on an Android device and absolutely no idea what I was getting myself into. Two of the people who not only helped me understand the underbelly of Linux window systems but also helped me learn to navigate the world of open-source software were Daniel Stone and Pekka Paalanen, both of whom were at Collabora then and still are today.

After switching to Mesa when I joined Intel in 2014, I didn’t interact with Collabora devs quite as much since they mostly stayed in the window-system world and I tried to stay in 3D. In the last few years, however, they’ve been building up their 3D team and doing some really interesting work. Alyssa Rosenzweig and I have worked quite a bit together on various NIR passes as part of her work on Panfrost and now agx. I also worked with Boris Brezillon and Erik Faye-Lund on some of the CLOn12, GLOn12, and Zink work which layers OpenGL and OpenCL on top of D3D12 and Vulkan. In case you haven’t figured it out already from my glowing review, Collabora has some top-notch people who are doing great work and I’m excited to be joining the team and working more closely with them.

So how did this happen? What convinced me to leave the cushy corporate job and join a tiny (compared to Intel) open-source company? It’s not been for lack of opportunities. I get pinged by recruiters on LinkedIn on a regular basis and certain teams in the industry have been rather persistent. I’ve thought quite a lot over the years about where I’d want to go if I ever left Intel. Intel has been my engineering home for 7.5 years and has provided the strange cocktail on which I’ve built my career: a stable team, well-funded upstream open-source work, fairly cutting edge hardware, and an IHV seat at Khronos. Every place I’d ever considered going would mean losing one or more of those things and, until Collabora, no one had given me a good enough reason to give any of that up.

Back in September, I was chatting on IRC with other Mesa devs about OpenCL, SPIR-V, and some corner-case we were missing in the compiler when the following exchange happened:

11:39 < jenatali> I hope I get time to get back to CL at some point, I
      hate leaving it half-finished, but stupid corporate priorities
      mean I have to do other stuff instead :P
11:41 < jekstrand> Yeah... Corporations... Why do we work for them
      again?  Oh, right, so we can afford to eat.

About an hour later, Daniel Stone replied privately:

12:40 <daniels> hey so if corporations ever get you down, there are
      always less-corporate options … :)
12:40 <daniels> timing completely coincidental of course
12:42 <jekstrand> Of course...
12:42 <jekstrand> I'm always open to new things if the offer is
      right...

This kicked off the weirdest and most interesting career conversation I’ve had to date. At first, I didn’t believe him. The job he was describing doesn’t exist. No one gets that offer. Not unless you’re Dave Airlie or Linus Torvalds. But, after multiple 1 – 2 hour video chats, more IRC chatter, and an hour chatting with Philippe Kalaf (Collabora’s CEO), they had me convinced. This is real.

So what did Collabora finally offer me that no one else has? Total autonomy. In my new role at Collabora, my mandate consists of two things: invest in and mentor the Collabora 3D graphics team and invest in upstream Linux and open-source graphics however I see fit. I won’t be expected to do any contract work. I may meet with clients from time to time and I’ll likely get involved more with the various Collabora-driven Mesa projects but my primary focus will be on ensuring that upstream is healthy. I won’t be tied to any one driver or hardware vendor either. Sure, it’d be good to do a bit of Panfrost work so I can help Alyssa out since she’s now my coworker and I’ll likely still work on Intel drivers a bit since that’s my home turf. But, at the end of the day, I’m now free to put my effort wherever it’s needed in the stack without concern for corporate priorities. Ray-tracing in RADV? Why not. OpenCL 3.0 for everyone? Sure. Hacking on a new kernel interface for Freedreno? That’s fine too. As far as I’m concerned, when it comes to how I spend my engineering effort, I now report directly to upstream. No strings attached.

One of the interesting side-effect of this is how it will affect my role within Khronos. Collabora is a Khronos member so I still plan to be involved there but it will look different. For several years now (as long as RADV has been a competent driver, really), I’ve always worn two hats at Khronos: Intel and Mesa/Linux. Most of the time, I’m representing Intel but there were always those weird awkward moments where I help out the Igalia team working on V3DV or the RADV team. Now that I’m no longer at a hardware vendor, I can really embrace the role of representing Mesa and Linux upstream within Khronos. This doesn’t mean that I’m suddenly going to fix all your Vulkan spec problems overnight but it does mean I’ll be paying a bit more attention to the non-Intel drivers and doing what I can to ensure that all the Vulkan drivers in Mesa are in good shape.

Honestly, I’m still in shock that I was offered this role. It’s a great testament to Collabora’s belief in upstream that they’re willing to fund such a role and it shows an incredible amount of faith in my work. At Intel, I was blessed to be able to work upstream as part of my day job, which isn’t something most open-source software developers get. To have someone believe in your work so much that they’re willing to cut you a pay check just to keep doing what you’re doing is mind boggling. I’m truly honored and I hope the work I do in the days, months, and years to come will prove that their faith was well placed.

So, what am I going to be working on with my new found freedom? Do I have any cool new projects planned that are going to turn the industry upside-down? Of course I do! But those are topics for other blog posts.

January 10, 2022

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:

  • Part 1: The high-level view of the whole CI system, and how to fully control test machines remotely (power on, OS to boot, keyboard/screen emulation using a serial console);
  • Part 2: A comparison of the different ways to generate the rootfs of your test environment, and introducing the boot2container project.

In this article, we will further discuss the role of the CI gateway, and which steps we can take to simplify its deployment, maintenance, and disaster recovery.

This work is sponsored by the Valve Corporation.

Requirements for the CI gateway

As seen in the part 1 of this CI series, the testing gateway is sitting between the test machines and the public network/internet:

      Internet /   ------------------------------+
    Public network                               |
                                       +---------+--------+                USB
                                       |                  +-----------------------------------+
                                       |      Testing     | Private network                   |
Main power (120/240 V) -----+          |      Gateway     +-----------------+                 |
                            |          +------+--+--------+                 |                 |
                            |                 |  | Serial /                 |                 |
                            |            Main |  | Ethernet                 |                 |
                            |            Power|  |                          |                 |
                +-----------+-----------------|--+--------------+   +-------+--------+   +----+----+
                |              Switchable PDU |                |   |   RJ45 switch  |   | USB Hub |
                |  Port 0    Port 1        ...|         Port N  |   |                |   |         |
                +----+------------------------+-----------------+   +---+------------+   +-+-------+
                     |                                                  |                  |
                Main |                                                  |                  |
                Power|                                                  |                  |
            +--------|--------+               Ethernet                  |                  |
            |                 +-----------------------------------------+   +----+----+    |
            |  Test Machine 1 |            Serial (RS-232 / TTL)            |  Serial |    |
            |                 +---------------------------------------------+  2 USB  +----+ USB
            +-----------------+                                             +---------+

The testing gateway's role is to expose the test machines to the users, either directly or via GitLab/Github. As such, it will likely require the following components:

  • a host Operating System;
  • a config file describing the different test machines;
  • a bunch of services to expose said machines and deploy their test environment on demand.

Since the gateway is connected to the internet, both the OS and the different services needs to be be kept updated relatively often to prevent your CI farm from becoming part of a botnet. This creates interesting issues:

  1. How do we test updates ahead of deployment, to minimize downtime due to bad updates?
  2. How do we make updates atomic, so that we never end up with a partially-updated system?
  3. How do we rollback updates, so that broken updates can be quickly reverted?

These issues can thankfully be addressed by running all the services in a container (as systemd units), started using boot2container. Updating the operating system and the services would simply be done by generating a new container, running tests to validate it, pushing it to a container registry, rebooting the gateway, then waiting while the gateway downloads and execute the new services.

Using boot2container does not however fix the issue of how to update the kernel or boot configuration when the system fails to boot the current one. Indeed, if the kernel/boot2container/kernel command line are stored locally, they can only be modified via an SSH connection and thus require the machine to always be reachable, the gateway will be bricked until an operator boots an alternative operating system.

The easiest way not to brick your gateway after a broken update is to power it through a switchable PDU (so that we can power cycle the machine), and to download the kernel, initramfs (boot2container), and the kernel command line from a remote server at boot time. This is fortunately possible even through the internet by using fancy bootloaders, such as iPXE, and this will be the focus of this article!

Tune in for part 4 to learn more about how to create the container.

iPXE + boot2container: Netbooting your CI infrastructure from anywhere

iPXE is a tiny bootloader that packs a punch! Not only can it boot kernels from local partitions, but it can also connect to the internet, and download kernels/initramfs using HTTP(S). Even more impressive is the little scripting engine which executes boot scripts instead of declarative boot configurations like grub. This enables creating loops, endlessly trying to boot until one method finally succeeds!

Let's start with a basic example, and build towards a production-ready solution!

Netbooting from a local server

In this example, we will focus on netbooting the gateway from a local HTTP server. Let's start by reviewing a simple script that makes iPXE acquire an IP from the local DHCP server, then download and execute another iPXE script from http://<ip of your dev machine>:8000/boot/ipxe. If any step failed, the script will be restarted from the start until a successful boot is achieved.

#!ipxe

echo Welcome to Valve infra's iPXE boot script

:retry
echo Acquiring an IP
dhcp || goto retry # Keep retrying getting an IP, until we get one
echo Got the IP: $${netX/ip} / $${netX/netmask}

echo

echo Chainloading from the iPXE server...
chain http://<ip of your dev machine>:8000/boot.ipxe

# The boot failed, let's restart!
goto retry

Neat, right? Now, we need to generate a bootable ISO image starting iPXE with the above script run as a default. We will then flash this ISO to a USB pendrive:

$ git clone git://git.ipxe.org/ipxe.git
$ make -C ipxe/src -j`nproc` bin/ipxe.iso EMBED=<boot script file>
$ sudo dd if=ipxe/src/bin/ipxe.iso of=/dev/sdX bs=1M conv=fsync status=progress

Once connected to the gateway, ensure that you boot from the pendrive, and you should see iPXE bootloader trying to boot the kernel, but failing to download the script from http://<ip of your dev machine>:8000/boot.ipxe. So, let's write one:

#!ipxe

kernel /files/kernel b2c.container="docker://hello-world"
initrd /files/initrd
boot

This script specifies the following elements:

  • kernel: Download the kernel at http://<ip of your dev machine>:8000/files/kernel, and set the kernel command line to ask boot2container to start the hello-world container
  • initrd: Download the initramfs at http://<ip of your dev machine>:8000/files/initrd
  • boot: Boot the specified boot configuration

Assuming your gateway has an architecture supported by boot2container, you may now download the kernel and initrd from boot2container's releases page. In case it is unsupported, create an issue, or a merge request to add support for it!

Now that you have created all the necessary files for the boot, start the web server on your development machine:

$ ls
boot.ipxe  initrd  kernel
$ python -m http.server 8080
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
<ip of your gateway> - - [09/Jan/2022 15:32:52] "GET /boot.ipxe HTTP/1.1" 200 -
<ip of your gateway> - - [09/Jan/2022 15:32:56] "GET /kernel HTTP/1.1" 200 -
<ip of your gateway> - - [09/Jan/2022 15:32:54] "GET /initrd HTTP/1.1" 200 -

If everything went well, the gateway should, after a couple of seconds, start downloading the boot script, then the kernel, and finally the initramfs. Once done, your gateway should boot Linux, run docker's hello-world container, then shut down.

Congratulations for netbooting your gateway! However, the current solution has one annoying constraint: it requires a trusted local network and server because we are using HTTP rather than HTTPS... On an untrusted network, a man in the middle could override your boot configuration and take over your CI...

If we were using HTTPS, we could download our boot script/kernel/initramfs directly from any public server, even GIT forges, without fear of any man in the middle! Let's try to achieve this!

Netbooting from public servers

In the previous section, we managed to netboot our gateway from the local network. In this section, we try to improve on it by netbooting using HTTPS. This enables booting from a public server hosted at places such as Linode for $5/month.

As I said earlier, iPXE supports HTTPS. However, if you are anyone like me, you may be wondering how such a small bootloader could know which root certificates to trust. The answer is that iPXE generates an SSL certificate at compilation time which is then used to sign all of the root certificates trusted by Mozilla (default), or any amount of certificate you may want. See iPXE's crypto page for more information.

WARNING: iPXE currently does not like certificates exceeding 4096 bits. This can be a limiting factor when trying to connect to existing servers. We hope to one day fix this bug, but in the mean time, you may be forced to use a 2048 bits Let's Encrypt certificate on a self-hosted web server. See our issue for more information.

WARNING 2: iPXE only supports a limited amount of ciphers. You'll need to make sure they are listed in nginx's ssl_ciphers configuration: AES-128-CBC:AES-256-CBC:AES256-SHA256 and AES128-SHA256:AES256-SHA:AES128-SHA

To get started, install NGINX + Let's encrypt on your server, following your favourite tutorial, copy the boot.ipxe, kernel, and initrd files to the root of the web server, then make sure you can download them using your browser.

With this done, we just need to edit iPXE's general config C header to enable HTTPS support:

$ sed -i 's/#undef\tDOWNLOAD_PROTO_HTTPS/#define\tDOWNLOAD_PROTO_HTTPS/' ipxe/src/config/general.h

Then, let's update our boot script to point to the new server:

#!ipxe

echo Welcome to Valve infra's iPXE boot script

:retry
echo Acquiring an IP
dhcp || goto retry # Keep retrying getting an IP, until we get one
echo Got the IP: $${netX/ip} / $${netX/netmask}

echo

echo Chainloading from the iPXE server...
chain https://<your server>/boot.ipxe

# The boot failed, let's restart!
goto retry

And finally, let's re-compile iPXE, reflash the gateway pendrive, and boot the gateway!

$ make -C ipxe/src -j`nproc` bin/ipxe.iso EMBED=<boot script file>
$ sudo dd if=ipxe/src/bin/ipxe.iso of=/dev/sdX bs=1M conv=fsync status=progress

If all went well, the gateway should boot and run the hello world container once again! Let's continue our journey by provisioning and backup'ing the local storage of the gateway!

Provisioning and backups of the local storage

In the previous section, we managed to control the boot configuration of our gateway via a public HTTPS server. In this section, we will improve on that by provisioning and backuping any local file the gateway container may need.

Boot2container has a nice feature that enables you to create a volume, and provision it from a bucket in a S3-compatible cloud storage, and sync back any local change. This is done by adding the following arguments to the kernel command line:

  • b2c.minio="s3,${s3_endpoint},${s3_access_key_id},${s3_access_key}": URL and credentials to the S3 service
  • b2c.volume="perm,mirror=s3/${s3_bucket_name},pull_on=pipeline_start,push_on=changes,overwrite,delete": Create a perm podman volume, mirror it from the bucket ${s3_bucket_name} when booting the gateway, then push any local change back to the bucket. Delete or overwrite any existing file when mirroring.
  • b2c.container="-ti -v perm:/mnt/perm docker://alpine": Start an alpine container, and mount the perm container volume to /mnt/perm

Pretty, isn't it? Provided that your bucket is configured to save all the revisions of every file, this trick will kill three birds with one stone: initial provisioning, backup, and automatic recovery of the files in case the local disk fails and gets replaced with a new one!

The issue is that the boot configuration is currently open for everyone to see, if they know where to look for. This means that anyone could tamper with your local storage or even use your bucket to store their files...

Securing the access to the local storage

To prevent attackers from stealing our S3 credentials by simply pointing their web browser to the right URL, we can authenticate incoming HTTPS requests by using an SSL client certificate. A different certificate would be embedded in every gateway's iPXE bootloader and checked by NGINX before serving the boot configuration for this precise gateway. By limiting access to a machine's boot configuration to its associated client certificate fingerprint, we even prevent compromised machines from accessing the data of other machines.

Additionally, secrets should not be kept in the kernel command line, as any process executed on the gateway could easily gain access to it by reading /proc/cmdline. To address this issue, boot2container has a b2c.extra_args_url argument to source additional parameters from this URL. If this URL is generated every time the gateway is downloading its boot configuration, can be accessed only once, and expires soon after being created, then secrets can be kept private inside boot2container and not be exposed to the containers it starts.

Implementing these suggestions in a blog post is a little tricky, so I suggest you check out valve-infra's ipxe-boot-server component for more details. It provides a Makefile that makes it super easy to generate working certificates and create bootable gateway ISOs, a small python-based web service that will serve the right configuration to every gateway (including one-time secrets), and step-by-step instructions to deploy everything!

Assuming you decided to use this component and followed the README, you should then configure the gateway in this way:

$ pwd
/home/ipxe/valve-infra/ipxe-boot-server/files/<fingerprint of your gateway>/
$ ls
boot.ipxe  initrd  kernel  secrets
$ cat boot.ipxe
#!ipxe

kernel /files/kernel b2c.extra_args_url="${secrets_url}" b2c.container="-v perm:/mnt/perm docker://alpine" b2c.ntp_peer=auto b2c.cache_device=auto
initrd /files/initrd
boot
$ cat secrets
b2c.minio="bbz,${s3_endpoint},${s3_access_key_id},${s3_access_key}" b2c.volume="perm,mirror=bbz/${s3_bucket_name},pull_on=pipeline_start,push_on=changes,overwrite,delete"

And that's it! We finally made it to the end, and created a secure way to provision our CI gateways with the wanted kernel, Operating System, and even local files!

When Charlie Turner and I started designing this system, we felt it would be a clean and simple way to solve our problems with our CI gateways, but the implementation ended up being quite a little trickier than the high-level view... especially the SSL certificates! However, the certainty that we can now deploy updates and fix our CI gateways even when they are physically inaccessible from us (provided the hardware and PDU are fine) definitely made it all worth it and made the prospect of having users depending on our systems less scary!

Let us know how you feel about it!

Conclusion

In this post, we focused on provisioning the CI gateway with its boot configuration, and local files via the internet. This drastically reduces the risks that updating the gateway's kernel would result in an extended loss of service, as the kernel configuration can quickly be reverted by changing the boot config files which is served from a cloud service provider.

The local file provisioning system also doubles as a backup, and disaster recovery system which will automatically kick in in case of hardware failure thanks to the constant mirroring of the local files with an S3-compatible cloud storage bucket.

In the next post, we will be talking about how to create the infra container, and how we can minimize down time during updates by not needing to reboot the gateway.

That's all for now, thanks for making it to the end!

January 07, 2022

PyCook

A few months ago, I went on a quest to better digitize and collect a bunch of the recipes I use on a regular basis. Like most people, I’ve got a 3-ring binder of stuff I’ve printed from the internet, a box with the usual 4x6 cards, most of which are hand-written, and a stack of cookbooks. I wanted something that could be both digital and physical and which would make recipes easier to share. I also wanted whatever storage system I developed to be something stupid simple. If there’s one thing I’ve learned about myself over the years it’s that if I make something too hard, I’ll never get around to it.

This led to a few requirements:

  1. A simple human-writable file format

  2. Recipes organized as individual files in a Git repo

  3. A simple tool to convert to a web page and print formats

  4. That tool should be able to produce 4x6 cards

Enter PyCook. It does pretty much exactly what the above says and not much more. The file format is based on YAML and, IMO, is beautiful to look at all by itself and very easy to remember how to type:

name: Swedish Meatballs
from: Grandma Ekstrand

ingredients:
  - 1 [lb] Ground beef (80%)
  - 1/4 [lb] Ground pork
  - 1 [tsp] Salt
  - 1 [tsp] Sugar
  - 1 Egg, beaten
  - 1 [tbsp] Catsup
  - 1 Onion, grated
  - 1 [cup] Bread crumbs
  - 1 [cup] Milk

instructions:
  - Soak crumbs in milk

  - Mix all together well

  - Make into 1/2 -- 3/4 [in] balls and brown in a pan on the stove
    (you don't need to cook them through; just brown)

  - Heat in a casserole dish in the oven until cooked through.

Over Christmas this year, I got my mom and a couple siblings together to put together a list of family favorite recipes. The format is simple enough that my non-technical mother was able to help type up recipes and they needed very little editing before PyCook would consume them. To me, that’s a pretty good indicator that the file format is simple enough. 😁

The one bit of smarts it does have is around quantities and units. One thing that constantly annoys me with recipes is the inconsistency with abbreviations of units. Take tablespoons, for instance. I’ve seen it abbreviated “T” (as opposed to “t” for teaspoon), “tbsp”, or “tblsp”, sometimes with a “.” after the abbreviation and sometimes not. To handle this, I have a tiny macro language where units have standard abbreviations and are placed in brackets. This is then substituted with the correct abbreviation to let me change them all in one go if I ever want to. It’s also capable of handling plurals properly so when you type [cup] it will turn into either “cup” or “cups” depending on the associated number. It also has smarts to detect “1/4” and turn that into vulgar fraction character “¼” for HTML output or a nice LaTeX fraction when doing PDF output.

That brings me to the PDF generator. One of my requirements was the ability to produce 4x6 cards that I could use while cooking instead of having an electronic device near all those fluids. I started with the LaTeX template I got from my friend Steve Schulteis for making a PDF cookbook and adapted them to 4x6 cards, one recipe per card. And the results look pretty nice, I think

Meatballs recipe as a 4x6 cardMeatballs recipe as a 4x6 card

It’s even able to nicely paginate the 4x6 cards into a double-sided 8.5x11 PDF which prints onto Avery 5389 templates.

Other output formats include an 8.5x11 PDF cookbook with as many recipes per page as will fit and an HTML web version generated using Sphinx which is searchable and has a nice index.

P.S. Yes, that’s a real Swedish meatball recipe and it makes way better meatballs than the ones from IKEA.

New blog!

This week, I’ve taken advantage of some of my funemployment time to we work our website infrastructure. As part of my new job (more on that soon!) I’ll be communicating publicly more and I need a better blog space. I set up a blogger a few years ago but I really hate it. It’s not that Blogger is a terrible platform per se. It just doesn’t integrate well with the rest of my website and the comments I got were 90% spam.

At the time, I used Blogger because I didn’t want to mess implementing a blog on my own website infrastructure. Why? The honest answer is an object lesson in software engineering. The last time I re-built my website I thought that building a website generator sounded like a fantastic excuse to learn some Ruby. Why not? It’s a great programming language for web stuff and learning new languages is good, right? While those are both true, software maintainability is important. When I went to try and add a blog, it’d been nearly 4 years since I’d written a line of ruby code and the static site generation framework I was using (nanoc) had moved to a backwards-incompatible new version and I had no clue how to move my site forward without re-learning Ruby and nanoc and rewriting it from scratch. I didn’t have the time for that.

This time, I learned my lesson and went all C programmer on it. The new infra is built in Python, a language I use nearly daily in my Mesa work and, instead of using someone else’s static site generation infra, I rolled my own. I’m using Jinja2 for templating as well as Pygments for syntax highlighting and Pandoc for Markdown to HTML conversion. They’re all libraries not frameworks so they take a lot less learning to figure out and remember how to use. Jinja2 is something of a framework (it’s a templating language) but it’s so obvious and user-friendly that it’s basically impossible to forget how to use.

Sure, my infra isn’t nearly as nice as some. I don’t have caching and I have to re-build the whole thing every time. I also didn’t bother to implement RSS feed support, comments, or any of those other bloggy features. But, frankly, I don’t care. Even on a crappy raspberry pi, my website builds in a few seconds. As far as RSS goes, who actually uses RSS readers these days? Just follow me on Twitter (@jekstrand_) if you want to know when I blog stuff. Comments? 95% of the comments I got on Blogger were spam anyway. If you want to comment, reply to my tweet.

The old blog lived at jason-blog.jlekstrand.net while the new one lives at www.jlekstrand.net/jason/blog/. I’ve set up my webserver to re-direct from the old one and gone though the painful process of checking every single post to ensure the re-directs work. Any links you may have should still work.

So there you have it! A new website framework written in all of 2 days. Hopefully, this one will be maintainable and, if I need to extend it, I can without re-learning whole programming languages. I’m also hoping that now that blogging is as easy as writing some markdown and doing git push, I’ll actually blog more.

January 03, 2022

It appears that Google created a handy tool that helps finding the command which causes a GPU hang/crash. It is called Graphics Flight Recorder (GFR) and was open-sourced a year ago but didn’t receive any attention. From the readme:

The Graphics Flight Recorder (GFR) is a Vulkan layer to help trackdown and identify the cause of GPU hangs and crashes. It works by instrumenting command buffers with completion tags. When an error is detected a log file containing incomplete command buffers is written. Often the last complete or incomplete commands are responsible for the crash.

It requires VK_AMD_buffer_marker support; however, this extension is rather trivial to implement - I had only to copy-paste the code from our vkCmdSetEvent implementation and that was it. Note, at the moment of writing, GFR unconditionally usesVK_AMD_device_coherent_memory, which could be manually patched out for it to run on other GPUs.

GFR already helped me to fix hangs in “Alien: Isolation” and “Digital Combat Simulator”. In both cases the hang was in a compute shader and the output from GFR looked like:

...
- # Command:
        id: 6/9
        markerValue: 0x000A0006
        name: vkCmdBindPipeline
        state: [SUBMITTED_EXECUTION_COMPLETE]
        parameters:
          - # parameter:
            name: commandBuffer
            value: 0x000000558CFD2A10
          - # parameter:
            name: pipelineBindPoint
            value: 1
          - # parameter:
            name: pipeline
            value: 0x000000558D3D6750
      - # Command:
        id: 6/9
        message: '>>>>>>>>>>>>>> LAST COMPLETE COMMAND <<<<<<<<<<<<<<'
      - # Command:
        id: 7/9
        markerValue: 0x000A0007
        name: vkCmdDispatch
        state: [SUBMITTED_EXECUTION_INCOMPLETE]
        parameters:
          - # parameter:
            name: commandBuffer
            value: 0x000000558CFD2A10
          - # parameter:
            name: groupCountX
            value: 5
          - # parameter:
            name: groupCountY
            value: 1
          - # parameter:
            name: groupCountZ
            value: 1
        internalState:
          pipeline:
            vkHandle: 0x000000558D3D6750
            bindPoint: compute
            shaderInfos:
              - # shaderInfo:
                stage: cs
                module: (0x000000558F82B2A0)
                entry: "main"
          descriptorSets:
            - # descriptorSet:
              index: 0
              set: 0x000000558E498728
      - # Command:
        id: 8/9
        markerValue: 0x000A0008
        name: vkCmdPipelineBarrier
        state: [SUBMITTED_EXECUTION_NOT_STARTED]
...

After confirming that corresponding vkCmdDispatch is indeed the call which hangs, in both cases I made an Amber test which fully simulated the call. For a compute shader, this is relatively easy to do since all you need is to save the decompiled shader and buffers being used by it. Luckily in both cases these Amber tests reproduced the hangs.

With standalone reproducers, the problems were much easier to debug, and fixes were made shortly: MR#14044 for “Alien: Isolation” and MR#14110 for “Digital Combat Simulator”.

Unfortunately this tool is not a panacea:

  • It likely would fail to help with unrecoverable hangs where it would be impossible to read the completion tags back.
  • Or when the mere addition of the tags could “fix” the issue which may happen with synchronization issues.
  • If draw/dispatch calls run in parallel on the GPU, writing tags may force them to execute sequentially or to be imprecise.

Anyway, it’s easy to use so you should give it a try.

December 09, 2021
Starting with kernel 5.17 the kernel supports the builtin privacy screens built into the LCD panel of some new laptop models.

This means that the drm drivers will now return -EPROBE_DEFER from their probe() method on models with a builtin privacy screen when the privacy screen provider driver has not been loaded yet.

To avoid any regressions distors should modify their initrd generation tools to include privacy screen provider drivers in the initrd (at least on systems with a privacy screen), before 5.17 kernels start showing up in their repos.

If this change is not made, then users using a graphical bootsplash (plymouth) will get an extra boot-delay of up to 8 seconds (DeviceTimeout in plymouthd.defaults) before plymouth will show and when using disk-encryption where the LUKS password is requested from the initrd, the system will fallback to text-mode after these 8 seconds.

I've written a patch with the necessary changes for dracut, which might be useful as an example for how to deal with this in other initrd generators, see: https://github.com/dracutdevs/dracut/pull/1666

I've also filed bugs for tracking this for Fedora, openSUSE, Arch, Debian and Ubuntu.

Introduction

One of the big issues I have when working on Turnip driver development is that when compiling either Mesa or VK-GL-CTS it takes a lot of time to complete, no matter how powerful the embedded board is. There are reasons for that: typically those board have limited amount of RAM (8 GB for the best case), a slow storage disk (typically UFS 2.1 on-board storage) and CPUs that are not so powerful compared with x86_64 desktop alternatives.

RB3 Photo of the Qualcomm® Robotics RB3 Platform embedded board that I use for Turnip development.

To fix this, it is recommended to do cross-compilation, however installing the development environment for cross-compilation could be cumbersome and prone to errors depending on the toolchain you use. One alternative is to use a distributed compilation system that allows cross-compilation like Icecream.

Icecream is a distributed compilation system that is very useful when you have to compile big projects and/or on low-spec machines, while having powerful machines in the local network that can do that job instead. However, it is not perfect: the linking stage is still done in the machine that submits the job, which depending on the available RAM, could be too much for it (however you can alleviate this a bit by using ZRAM for example).

One of the features that icecream has over its alternatives is that there is no need to install the same toolchain in all the machines as it is able to share the toolchain among all of them. This is very useful as we will see below in this post.

Installation

Debian-based systems

$ sudo apt install icecc

Fedora systems

$ sudo dnf install icecream

Compile it from sources

You can compile it from sources.

Configuration of icecc scheduler

You need to have an icecc scheduler in the local network that will balance the load among all the available nodes connected to it.

It does not matter which machine is the scheduler, you can use any of them as it is quite lightweight. To run the scheduler execute the following command:

$ sudo icecc-scheduler

Notice that the machine running this command is going to be the scheduler but it will not participate in the compilation process by default unless you ran iceccd daemon as well (see next step).

Setup on icecc nodes

Launch daemon

First you need to run the iceccd daemon as root. This is not needed on Debian-based systems, as its systemd unit is enabled by default.

You can do that using systemd in the following way:

$ sudo systemctl start iceccd

Or you can enable the daemon at startup time:

$ sudo systemctl enable iceccd

The daemon will connect automatically to the scheduler that is running in the local network. If that’s not the case, or there are more than one scheduler, you can run it standalone and give the scheduler’s IP as parameter:

sudo iceccd -s <ip_scheduler>

Enable icecc compilation

With ccache

If you use ccache (recommended option), you just need to add the following in your .bashrc:

export CCACHE_PREFIX=icecc

Without ccache

To use it without ccache, you need to add its path to $PATH envvar so it is picked before the system compilers:

export PATH=/usr/lib/icecc/bin:$PATH

Execution

Same architecture

If you followed the previous steps, any time you compile anything on C/C++, it will distribute the work among the fastest nodes in the network. Notice that it will take into account system load, network connection, cores, among other variables, to decide which node will compile the object file.

Remember that the linking stage is always done in the machine that submits the job.

Different architectures (example cross-compiling for aarch64 on x86_64 nodes)

Icecream Icemon showing my x86_64 desktop (maxwell) cross-compiling a job for my aarch64 board (rb3).

Preparation on x86_64 machine

In one x86_64 machine, you need to create a toolchain. This is not automatically done by icecc as you can have different toolchains for cross-compilation.

Install cross-compiler

For example, you can install the cross-compiler from the distribution repositories:

For Debian-based systems:

sudo apt install crossbuild-essential-arm64

For Fedora:

$ sudo dnf install gcc-aarch64-linux-gnu gcc--c++-aarch64-linux-gnu

Create toolchain for icecc

Finally, to create the toolchain to share in icecc:

$ icecc-create-env --gcc /usr/bin/aarch64-linux-gnu-gcc /usr/bin/aarch64-linux-gnu-g++

This will create a <hash>.tar.gz file. The <hash> is used to identify the toolchain to distribute among the nodes in case there is more than one. But don’t worry, once it is copied to a node, it won’t be copied again as it detects it is already present.

Note: it is important that the toolchain is compatible with the target machine. For example, if my aarch64 board is using Debian 11 Bullseye, it is better if the cross-compilation toolchain is created from a Debian Bullseye x86_64 machine (a VM also works), because you avoid incompatibilities like having different glibc versions.

If you have installed Debian 11 Bullseye in your aarch64, you can use my own cross-compilation toolchain for x86_64 and skip this step.

Copy the toolchain to the aarch64 machine

scp <hash>.tar.gz aarch64-machine-hostname:

Preparation on aarch64

Once the toolchain (<hash>.tar.gz) is copied to the aarch64 machine, you just need to export this on .bashrc:

# Icecc setup for crosscompilation
export CCACHE_PREFIX=icecc
export ICECC_VERSION=x86_64:~/<hash>.tar.gz

Execute

Just compile on aarch64 machine and the jobs be distributed among your x86_64 machines as well. Take into account the jobs will be shared among other aarch64 machines as well if icecc decides so, therefore no need to do any extra step.

It is important to remark that the cross-compilation toolchain creation is only needed once, as icecream will copy it on all the x86_64 machines that will execute any job launched by this aarch64 machine. However, you need to copy this toolchain to any aarch64 machines that will use icecream resources for cross-compiling.

Icecream monitor

Icemon

This is an interesting graphical tool to see the status of the icecc nodes and the jobs under execution.

Install on Debian-based systems

$ sudo apt install icecc-monitor

Install on Fedora

$ sudo dnf install icemon

Install it from sources

You can compile it from sources.

Acknowledgments

Even though icecream has a good cross-compilation documentation, it was the post written 8 years ago by my Igalia colleague Víctor Jáquez the one that convinced me to setup icecream as explained in this post.

Hope you find this info as useful as I did :-)

December 06, 2021

On the road to AppStream 1.0, a lot of items from the long todo list have been done so far – only one major feature is remaining, external release descriptions, which is a tricky one to implement and specify. For AppStream 1.0 it needs to be present or be rejected though, as it would be a major change in how release data is handled in AppStream.

Besides 1.0 preparation work, the recent 0.15 release and the releases before it come with their very own large set of changes, that are worth a look and may be interesting for your application to support. But first, for a change that affects the implementation and not the XML format:

1. Completely rewritten caching code

Keeping all AppStream data in memory is expensive, especially if the data is huge (as on Debian and Ubuntu with their large repositories generated from desktop-entry files as well) and if processes using AppStream are long-running. The latter is more and more the case, not only does GNOME Software run in the background, KDE uses AppStream in KRunner and Phosh will use it too for reading form factor information. Therefore, AppStream via libappstream provides an on-disk cache that is memory-mapped, so data is only consuming RAM if we are actually doing anything with it.

Previously, AppStream used an LMDB-based cache in the background, with indices for fulltext search and other common search operations. This was a very fast solution, but also came with limitations, LMDB’s maximum key size of 511 bytes became a problem quite often, adjusting the maximum database size (since it has to be set at opening time) was annoyingly tricky, and building dedicated indices for each search operation was very inflexible. In addition to that, the caching code was changed multiple times in the past to allow system-wide metadata to be cached per-user, as some distributions didn’t (want to) build a system-wide cache and therefore ran into performance issues when XML was parsed repeatedly for generation of a temporary cache. In addition to all that, the cache was designed around the concept of “one cache for data from all sources”, which meant that we had to rebuild it entirely if just a small aspect changed, like a MetaInfo file being added to /usr/share/metainfo, which was very inefficient.

To shorten a long story, the old caching code was rewritten with the new concepts of caches not necessarily being system-wide and caches existing for more fine-grained groups of files in mind. The new caching code uses Richard Hughes’ excellent libxmlb internally for memory-mapped data storage. Unlike LMDB, libxmlb knows about the XML document model, so queries can be much more powerful and we do not need to build indices manually. The library is also already used by GNOME Software and fwupd for parsing of (refined) AppStream metadata, so it works quite well for that usecase. As a result, search queries via libappstream are now a bit slower (very much depends on the query, roughly 20% on average), but can be mmuch more powerful. The caching code is a lot more robust, which should speed up startup time of applications. And in addition to all of that, the AsPool class has gained a flag to allow it to monitor AppStream source data for changes and refresh the cache fully automatically and transparently in the background.

All software written against the previous version of the libappstream library should continue to work with the new caching code, but to make use of some of the new features, software using it may need adjustments. A lot of methods have been deprecated too now.

2. Experimental compose support

Compiling MetaInfo and other metadata into AppStream collection metadata, extracting icons, language information, refining data and caching media is an involved process. The appstream-generator tool does this very well for data from Linux distribution sources, but the tool is also pretty “heavyweight” with lots of knobs to adjust, an underlying database and a complex algorithm for icon extraction. Embedding it into other tools via anything else but its command-line API is also not easy (due to D’s GC initialization, and because it was never written with that feature in mind). Sometimes a simpler tool is all you need, so the libappstream-compose library as well as appstreamcli compose are being developed at the moment. The library contains building blocks for developing a tool like appstream-generator while the cli tool allows to simply extract metadata from any directory tree, which can be used by e.g. Flatpak. For this to work well, a lot of appstream-generator‘s D code is translated into plain C, so the implementation stays identical but the language changes.

Ultimately, the generator tool will use libappstream-compose for any general data refinement, and only implement things necessary to extract data from the archive of distributions. New applications (e.g. for new bundling systems and other purposes) can then use the same building blocks to implement new data generators similar to appstream-generator with ease, sharing much of the code that would be identical between implementations anyway.

2. Supporting user input controls

Want to advertise that your application supports touch input? Keyboard input? Has support for graphics tablets? Gamepads? Sure, nothing is easier than that with the new control relation item and supports relation kind (since 0.12.11 / 0.15.0, details):

<supports>
  <control>pointing</control>
  <control>keyboard</control>
  <control>touch</control>
  <control>tablet</control>
</supports>

3. Defining minimum display size requirements

Some applications are unusable below a certain window size, so you do not want to display them in a software center that is running on a device with a small screen, like a phone. In order to encode this information in a flexible way, AppStream now contains a display_length relation item to require or recommend a minimum (or maximum) display size that the described GUI application can work with. For example:

<requires>
  <display_length compare="ge">360</display_length>
</requires>

This will make the application require a display length greater or equal to 300 logical pixels. A logical pixel (also device independent pixel) is the amount of pixels that the application can draw in one direction. Since screens, especially phone screens but also screens on a desktop, can be rotated, the display_length value will be checked against the longest edge of a display by default (by explicitly specifying the shorter edge, this can be changed).

This feature is available since 0.13.0, details. See also Tobias Bernard’s blog entry on this topic.

4. Tags

This is a feature that was originally requested for the LVFS/fwupd, but one of the great things about AppStream is that we can take very project-specific ideas and generalize them so something comes out of them that is useful for many. The new tags tag allows people to tag components with an arbitrary namespaced string. This can be useful for project-internal organization of applications, as well as to convey certain additional properties to a software center, e.g. an application could mark itself as “featured” in a specific software center only. Metadata generators may also add their own tags to components to improve organization. AppStream gives no recommendations as to how these tags are to be interpreted except for them being a strictly optional feature. So any meaning is something clients and metadata authors need to negotiate. It therefore is a more specialized usecase of the already existing custom tag, and I expect it to be primarily useful within larger organizations that produce a lot of software components that need sorting. For example:

<tags>
  <tag namespace="lvfs">vendor-2021q1</tag>
  <tag namespace="plasma">featured</tag>
</tags>

This feature is available since 0.15.0, details.

5. MetaInfo Creator changes

The MetaInfo Creator (source) tool is a very simple web application that provides you with a form to fill out and will then generate MetaInfo XML to add to your project after you have answered all of its questions. It is an easy way for developers to add the required metadata without having to read the specification or any guides at all.

Recently, I added support for the new control and display_length tags, resolved a few minor issues and also added a button to instantly copy the generated output to clipboard so people can paste it into their project. If you want to create a new MetaInfo file, this tool is the best way to do it!

The creator tool will also not transfer any data out of your webbrowser, it is strictly a client-side application.

And that is about it for the most notable changes in AppStream land! Of course there is a lot more, additional tags for the LVFS and content rating have been added, lots of bugs have been squashed, the documentation has been refined a lot and the library has gained a lot of new API to make building software centers easier. Still, there is a lot to do and quite a few open feature requests too. Onwards to 1.0!

December 02, 2021

Khronos submission indicating Vulkan 1.1 conformance for Turnip on Adreno 618 GPU.

It is a great feat, especially for a driver which is created without hardware documentation. And we support features far from the bare minimum required for conformance.

But first of all, I want to thank and congratulate everyone working on the driver: Connor Abbott, Rob Clark, Emma Anholt, Jonathan Marek, Hyunjun Ko, Samuel Iglesias. And special thanks to Samuel Iglesias and Ricardo Garcia for tirelessly improving Khronos Vulkan Conformance Tests.


At the start of the year, when I started working on Turnip, I looked at the list of failing tests and thought “It wouldn’t take a lot to fix them!”, right, sure… And so I started fixing issues alongside of looking for missing features.

In June there were even more failures than there were in January, how could it be? Of course we were adding new features and it accounted for some of them. However even this list was likely not exhaustive because for gitlab CI instead of running the whole Vulkan CTS suite - we ran 1/3 of it. We didn’t have enough devices to run the whole suite fast enough to make it usable in CI. So I just ran it locally from time to time.

1/3 of the tests doesn’t sound bad and for the most part it’s good enough since we have a huge amount of tests looking like this:

dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture_format_list
...

Every format, every operation, etc. Tens of thousands of them.

Unfortunately the selection of tests for a fractional run is as straightforward as possible - just every third test. Which bites us when there a single unique tests, like:

dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil_no_attachment
...

Most of them test something unique that has much higher probability of triggering a special path in a driver compared to uncountable image tests. And they fell through the cracks. I even had to fix one test twice because the CI didn’t run it.

A possible solution is to skip tests only when there is a large swath of them and run smaller groups as-is. But it’s likely more productive to just throw more hardware at the issue =).

Not enough hardware in CI

Another trouble is that we had only one 6xx sub-generation present in CI - Adreno 630. We distinguish four sub-generations. Not only they have some different capabilities, there are also differences in the existing ones, causing the same test to pass on CI and being broken on another newer GPU. Presently in CI we test only Adreno 618 and 630 which are “Gen 1” GPUs and we claimed conformance only for Adreno 618.

Yet another issue is that we could render in tiling and bypass (sysmem) modes. That’s because there are a few features we could support only when there is no tiling and we render directly into the sysmem, and sometimes rendering directly into sysmem is just faster. At the moment we use tiling rendering by default unless we meet an edge case, so by default CTS tests only tiling rendering.

We are forcing sysmem mode for a subset of tests on CI, however it’s not enough because the difference between modes is relevant for more than just a few tests. Thus ideally we should run twice as many tests, and even better would be thrice as many to account for tiling mode without binning vertex shader.

That issue became apparent when I implemented a magical eight-ball to choose between tiling and bypass modes depending on the run-time information in order to squeeze more performance (it’s still work-in-progress). The basic idea is that a single draw call or a few small draw calls is faster to render directly into system memory instead of loading framebuffer into the tile memory and storing it back. But almost every single CTS test does exactly this! Do a single or a few draw calls per render pass, which causes all tests to run in bypass mode. Fun!

Now we would be forced to deal with this issue since with the magic eight-ball games would partly run in the tiling mode and partly in the bypass, making them equally important for real-world workload.

Does conformance matter? Does it reflect anything real-world?

Unfortunately no test suite could wholly reflect what game developers do in their games. However, the amount of tests grows and new tests are getting contributed based on issues found in games and other applications.

When I ran my stash of D3D11 game traces through DXVK on Turnip for the first time - I found a bunch of new crashes and hangs but it took fixing just a few of them for majority of games to render correctly. This shows that Khronos Vulkan Conformance Tests are doing their job and we at Igalia are striving to make them even better.

One of the extensions released as part of Vulkan 1.2.199 was VK_EXT_image_view_min_lod extension. I’m happy to see it published as I have participated in the release process of this extension: from reviewing the spec exhaustively (I even contributed a few things to improve it!) to developing CTS tests for it that will be eventually merged to the CTS repo.

This extension was proposed by Valve to mirror a feature present in Direct3D 12 (check ResourceMinLODClamp here) and Direct3D 11 (check SetResourceMinLOD here). In other words, this extension allows clamping the minimum LOD value accessed by an image view to a minLod value set at image view creation time.

That way, any library or API layer that translates Direct3D 11/12 calls to Vulkan can use the extension to mirror the behavior above on Vulkan directly without workarounds, facilitating the port of Direct3D applications such as games to Vulkan. For example, projects like Vkd3d, Vkd3d-proton and DXVK could benefit from it.

Going into more details, this extension changed how the image level selection is calculated and sets an additional minimum required in the image level for integer texel coordinate operations if it is enabled.

The way to use this feature in an application is very simple:

  • Check the extension is supported and if the physical device supports the respective feature:
// Provided by VK_EXT_image_view_min_lod
typedef struct VkPhysicalDeviceImageViewMinLodFeaturesEXT {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           minLod;
} VkPhysicalDeviceImageViewMinLodFeaturesEXT;
  • Once you know everything is working, enable both the extension and the feature when creating the device.

  • When you want to create a VkImageView that defines a minLod for image accesses, then add the following structure filled with the value you want in VkImageViewCreateInfo’s pNext.

// Provided by VK_EXT_image_view_min_lod
typedef struct VkImageViewMinLodCreateInfoEXT {
    VkStructureType    sType;
    const void*        pNext;
    float              minLod;
} VkImageViewMinLodCreateInfoEXT;

And that’s all! As you see, it is a very simple extension.

Happy hacking!

November 24, 2021

I was interested in how much work a vaapi on top of vulkan video proof of concept would be.

My main reason for being interested is actually video encoding, there is no good vulkan video encoding demo yet, and I'm not experienced enough in the area to write one, but I can hack stuff. I think it is probably easier to hack a vaapi encode to vulkan video encode than write a demo app myself.

With that in mind I decided to see what decode would look like first. I talked to Mike B (most famous zink author) before he left for holidays, then I ignored everything he told me and wrote a super hack.

This morning I convinced zink vaapi on top anv with iris GL doing the presents in mpv to show me some useful frames of video. However zink vaapi on anv with zink GL is failing miserably (well green jellyfish).

I'm not sure how much more I'll push on the decode side at this stage, I really wanted it to validate the driver side code, and I've found a few bugs in there already.

The WIP hacks are at [1]. I might push on to encode side and see if I can workout what it entails, though the encode spec work is a lot more changeable at the moment.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/zink-video-wip

November 18, 2021

If you own a laptop (Dell, HP, Lenovo) with a WWAN module, it is very likely that the modules are FCC-locked on every boot, and the special FCC unlock procedure needs to be run before they can be used.

Until ModemManager 1.18.2, the procedure was automatically run for the FCC unlock procedures we knew about, but this will no longer happen. Once 1.18.4 is out, the procedure will need to be explicitly enabled by each user, under their own responsibility, or otherwise implicitly enabled after installing an official FCC unlock tool provided by the manufacturer itself.

See a full description of the rationale behind this change in the ModemManager documentation site and the suggested code changes in the gitlab merge request.

If you want to enable the ModemManager provided unofficial FCC unlock tools once you have installed 1.18.4, run (assuming sysconfdir=/etc and datadir=/usr/share) this command (*):

sudo ln -sft /etc/ModemManager/fcc-unlock.d /usr/share/ModemManager/fcc-unlock.available.d/*

The user-enabled tools in /etc should not be removed during package upgrades, so this should be a one-time setup.

(*) Updated to have one single command instead of a for loop; thanks heftig!

November 15, 2021

Previously I mentioned having AMD VCN h264 support. Today I added initial support for the older UVD engine[1]. This is found on chips from Vega back to SI.

I've only tested it on my Vega so far.

I also worked out the "correct" answer to the how to I send the reset command correctly, however the nvidia player I'm using as a demo doesn't do things that way yet, so I've forked it for now[2].

The answer is to use vkCmdControlVideoCodingKHR to send a reset the first type a session is used. However I can't see how the app is meant to know this is necessary, but I've asked the appropriate people.

The initial anv branch I mentioned last week is now here[3].


[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-uvd-h264

[2] https://github.com/airlied/vk_video_samples/tree/radv-fixes

[3] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode

November 12, 2021

Last week I mentioned I had the basics of h264 decode using the proposed vulkan video on radv. This week I attempted to do the same thing with Intel's Mesa vulkan driver "anv".

Now I'd previously unsuccessfully tried to get vaapi on crocus working but got sidetracked back into other projects. The Intel h264 decoder hasn't changed a lot between ivb/hsw/gen8/gen9 era. I ported what I had from crocus to anv and started trying to get something to decode on my WhiskeyLake.

I wrote the code pretty early on, figured out all the things I had to send the hardware.

The first anv side bridge to cross was Vulkan is doing H264 Picture level decode API, so it means you get handed the encoded slice data. However to program the Intel hw you need to decode the slice header. I wrote a slice header decoder in some common code. The other thing you need to give the intel hw is a number of bits of slice header, which in some encoding schemes is rounded to bytes and in some isn't. Slice headers also have a 3-byte header on them, which Intel hardware wants you to discard or skip before handing it to it.

Once I'd fixed up that sort of thing in anv + crocus, I started getting grey I-frames decoded with later B/P frames using the grey frames as references so you'd see this kinda wierd motion.

That was I think 3 days ago. I've have stared at this intently for those 3 days blaming everything from bitstream encoding to rechecking all my packets (not enough times though). I had someone else verify they could see grey frames.

Today after a long discussion about possibilities, I was randomly comparing a frame from the intel-vaapi-driver and from crocus, and I spotted a packet header, the docs say is 34 dwords long, but intel-vaapi was only encoding 16 dwords, I switched crocus to explicitly state a 16-dword length and I started seeing my I-frames.

Now the B/P frames still have issues. I don't think I'm getting the ref frames logic right yet, but it felt like a decent win after 3 days of staring at it.

The crocus code is [1]. The anv code isn't cleaned up enough to post a pointer to yet, enterprising people might find it. Next week I'll clean it all up, and then start to ponder upstream paths and shared code for radv + anv. Then h265 maybe.

[1]https://gitlab.freedesktop.org/airlied/mesa/-/tree/crocus-media-wip

November 05, 2021

A few weeks ago I watched Victor's excellent talk on Vulkan Video. This made me question my skills in this area. I'm pretty vague on video processing hardware, I really have no understanding of H264 or any of the standards. I've been loosely following the Vulkan video group inside of Khronos, but I can't say I've understood it or been useful.

radeonsi has a gallium vaapi driver, that talks to firmware driver encoder on the hardware, surely copying what it is programming can't be that hard. I got an mpv/vaapi setup running and tested some videos on that setup just to get comfortable. I looked at what sort of data was being pushed about.

The thing is the firmware is doing all the work here, the driver is mostly just responsible for taking semi-parsed h264 bitstream data structures and giving them in memory buffers to the fw API. Then the resulting decoded image should be magically in a buffer.

I then got the demo nvidia video decoder application mentioned in Victor's talk.

I ported the code to radv in a couple of days, but then began a long journey into the unknown. The firmware is quite expectant on exactly what it wants and when it wants it. After fixing some interactions with the video player, I started to dig.

Now vaapi and DXVA (Windows) are context based APIs. This means they are like OpenGL, where you create a context, do a bunch of work, and tear it down, the driver does all the hw queuing of commands internally. All the state is held in the context. Vulkan is a command buffer based API. The application records command buffers and then enqueues those command buffers to the hardware itself.

So the vaapi driver works like this for a video

create hw ctx, flush, decode, flush, decode, flush, decode, flush, decode, flush, destroy hw ctx, flush

However Vulkan wants things to be more like

Create Session, record command buffer with (begin, decode, end) send to hw, (begin, decode, end), send to hw, End Sesssion

There is no way at the Create/End session time to submit things to the hardware.

After a week or two of hair removal and insightful irc chats I stumbled over a decent enough workaround to avoid the hw dying and managed to decode a H264 video of some jellyfish.

The work is based on bunch of other stuff, and is in no way suitable for upstreaming yet, not to mention the Vulkan specification is only beta/provisional so can't be used anywhere outside of development.

The preliminary code is in my gitlab repo here[1]. It has a start on h265 decode, but it's not working at all yet, and I think the h264 code is a bit hangy randomly.

I'm not sure where this is going yet, but it was definitely an interesting experiment.

[1]: https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-prelim-decode

November 04, 2021

A basic example of the git alias function syntax looks like this.

[alias]
    shortcut = "!f() \
    {\
        echo Hello world!; \
    }; f"

This syntax defines a function f and then calls it. These aliases are executed in a sh shell, which means there's no access to Bash / Zsh specific functionality.

Every command is ended with a ; and each line ended with a \. This is easy enough to grok. But when we try to clean up the above snippet and add some quotes to "Hello world!", we hit this obtuse error message.

}; f: 1: Syntax error: end of file unexpected (expecting "}")

This syntax error is caused by quotes needing to be escaped. The reason for this comes down to how git tokenizes and executes these functions. If you're curious …