planet.freedesktop.org
February 01, 2023

Fast-linking: This Is Your Howto

I previously wrote a post talking about some optimization work that’s been done with RADV’s VK_EXT_graphics_pipeline_library implementation to improve fast-link performance. As promised, that wasn’t the end of the story. Today’s post will be a bit different, however, as I’ll be assuming all the graphics experts in the audience are already well-versed in all the topics I’m covering.

Also I’m assuming you’re all driver developers interested in improving your VK_EXT_graphics_pipeline_library fast-link performance.

The one exception is that today I’ll be using a specific definition for fast when it comes to fast-linking: to be fast, a driver should be able to fast-link in under 0.01ms. In an extremely CPU-intensive application, this should allow for even the explodiest of pipeline explosions (100+ fast-links in a single frame) to avoid any sort of hitching/stuttering.

Which drivers have what it takes to be fast?

Testing

To begin evaluating fast-link performance, it’s important to have test cases. Benchmarks. The sort that can be easily run, easily profiled, easily understood.

vkoverhead is the premier tool for evaluating CPU overhead in Vulkan drivers, and thanks to Valve, it now has legal support for GPL fast-link using real pipelines from Dota2. That’s right. Acing this synthetic benchmark will have real world implications.

For anyone interested in running these cases, it’s as simple as building and then running:

./vkoverhead -start 135

These benchmark cases will call vkCreateGraphicsPipelines in a tight loop to perform a fast-link on GPL-created pipeline libraries, fast-linking thousands of times per second for easy profiling. The number of iterations per second, in thousands, is then printed.

vkoverhead works with any Vulkan driver on any platform (including Windows!), which means it’s possible to use it to profile and optimize any driver.

Optimization

vkoverhead currently has two cases for GPL fast-link. As they are both extracted directly from Dota2, they have a number of properties in common:

  • similar descriptor layouts/requirements
  • same composition of libraries (all four GPL stages created separately)

Each case tests the following:

  • depthonly is a pipeline containing only a vertex shader, forcing the driver to use its own fragment shader
  • slow is a pipeline that happens to be slow to create on many drivers

Various tools are available on different platforms for profiling, and I’m not going to go into details here. What I’m going to do instead is look into strategies for optimizing drivers. Strategies that I (and others) have employed in real drivers. Strategies that you, if you aren’t shipping a fast-linking implementation of GPL, might be interested in.

First Strategy: Move NO-OP Fragment Shader To Device

The depthonly case explicitly tests whether drivers are creating a new fragment shader for every pipeline that lacks one. Drivers should not do this.

Instead, create a single fragment shader on the device object and reuse it like these drivers do:

In addition to being significantly faster, this also saves some memory.

Second Strategy: Avoid Copying Shader IR

Regular, optimized pipeline creation typically involves running optimization passes across the shader binaries, possibly even the entire pipeline, to ensure that various speedups can be found. Many drivers copy the internal shader IR in the course of pipeline creation to handle shader variants.

Don’t copy shader IR when trying to fast-link a pipeline.

Copying IR is very expensive, especially in larger shaders. Instead, either precompile unoptimized shader binaries in their corresponding GPL stage or refcount IR structures that must exist during execution. Examples:

Third Strategy: Avoid Compiling Shaders

This one seems obvious, but it has to be stated.

Do not compile shaders when attempting to achieve fast-link speed.

If you are compiling shaders, this is a very easy place to start optimizing.

There’s no reason to cache a fast-linked pipeline. The amount of time saved by retrieving a cached pipeline should be outweighed by the amount of time required to:

  • compute a key/hash for a given pipeline
  • access the cache

I say should because ideally a driver should be so fast at combining a GPL pipeline that even a cache hit is only comparable performance, if not slower outright. Skip all aspects of caching for these pipelines.

Fifth Strategy: Misc Profiling

If a driver is still slow after checking for the above items, it’s time to try profiling. It’s surprising what slowdowns drivers will hit. The classics I’ve seen are large memset calls and avoidable allocations.

Some examples:

A Mystery Solved

In my previous post, I alluded to a driver that was shipping a GPL implementation that advertised fast-link but wasn’t actually fast. I saw a lot of guesses. Nobody got it right.

scooby.jpg

It was Lavapipe (me) all along.

As hinted at above, however, this is no longer the case. In fact, after going through the listed strategies, Lavapipe now has the fastest GPL linking in the world.

Obviously it would have to if I’m writing a blog post about optimizing fast-linking, right?

Fast-linking: Initial Comparisons

How fast is Lavapipe’s linking, you might ask?

To answer this, let’s first apply a small patch to bump up Lavapipe’s descriptor limits so it can handle the beefy Dota2 pipelines. With that done, here’s a look at comparisons to other, more legitimate drivers, all running on the same system.

NVIDIA is the gold standard for GPL fast-linking considering how long they’ve been shipping it. They’re pretty fast.

$ VK_ICD_FILENAMES=nvidia_icd.json ./vkoverhead -start 135 -duration 5
vkoverhead running on NVIDIA GeForce RTX 2070:
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                      444,          100.0%
 136, misc_compile_fastlink_slow,                           243,          100.0%

RADV (with pending MRs applied) has gotten incredibly fast over the past week-ish.

$ RADV_PERFTEST=gpl ./vkoverhead -start 135 -duration 5
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                      579,          100.0%
 136, misc_compile_fastlink_slow,                           537,          100.0%

Lavapipe (with pending MRs applied) blows them both out of the water.

$ VK_ICD_FILENAMES=lvp_icd.x86_64.json ./vkoverhead -start 135 -duration 5
vkoverhead running on llvmpipe (LLVM 15.0.6, 256 bits):
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                     1485,         100.0%
 136, misc_compile_fastlink_slow,                          1464,         100.0%

Even if the NVIDIA+RADV numbers are added together, it’s still not close.

Fast-linking: More Comparisons

If I switch over to a different machine, Intel’s ANV driver has a MR for GPL open, and it’s seeing some movement. Here’s a head-to-head with the champion.

$ ./vkoverhead -start 135 -duration 5
vkoverhead running on Intel(R) Iris(R) Plus Graphics (ICL GT2):
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                      384,          100.0%
 136, misc_compile_fastlink_slow,                           276,          100.0%

$ VK_ICD_FILENAMES=lvp_icd.x86_64.json ./vkoverhead -start 135 -duration 5
vkoverhead running on llvmpipe (LLVM 15.0.6, 256 bits):
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                     1785,         100.0%
 136, misc_compile_fastlink_slow,                          1779,         100.0%

On yet another machine, here’s Turnip, which advertises the fast-link feature. This driver requires a small patch to modify MAX_SETS=5 since this is hardcoded at 4. I’ve also pinned execution here to the big cores for consistency.

# turnip ooms itself with -duration
$ ./vkoverhead -start 135
vkoverhead running on Turnip Adreno (TM) 618:
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                       73,           100.0%
 136, misc_compile_fastlink_slow,                            23,           100.0%

$ VK_ICD_FILENAMES=lvp_icd.aarch64.json ./vkoverhead -start 135 -duration 5
vkoverhead running on llvmpipe (LLVM 14.0.6, 128 bits):
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                      690,          100.0%
 136, misc_compile_fastlink_slow,                           699,          100.0%

More Analysis

We’ve seen that Lavapipe is unequivocally the champion of fast-linking in every head-to-head, but what does this actually look like in timings?

Here’s a chart that shows the breakdown in milliseconds.

Driver ` misc_compile_fastlink_depthonly ` ` misc_compile_fastlink_slow `
NVIDIA 0.002ms 0.004ms
RADV 0.0017ms 0.0019ms
Lavapipe 0.0007ms 0.0007ms
     
ANV 0.0026ms 0.0036ms
Lavapipe 0.00056ms 0.00056ms
     
Turnip 0.0137ms 0.0435ms
Lavapipe 0.001ms 0.001ms

As we can see, all of these drivers are “fast”. A single fast-link pipeline isn’t likely to cause any of them to drop a frame.

The driver I’ve got my eye on, however, is Turnip, which is the only one of the tested group that doesn’t quite hit that 0.01ms target. A little bit of profiling might show some easy gains here.

Even More Analysis

For another view of these drivers, let’s examine the relative performance. Since GPL fast-linking is inherently a CPU task that has no relation to the GPU, it stands to reason that a CPU-based driver should be able to optimize for it the best given that there’s already all manner of hackery going on to defer and delay execution. Indeed, reality confirms this, and looking at any profile of Lavapipe for the benchmark cases reveals that the only remaining bottleneck is the speed of malloc, which is to say the speed with which the returned pipeline object can be allocated.

Thus, ignoring potential micro-optimizations of pipeline struct size, it can be said that Lavapipe has effectively reached the maximum speed of the system for fast-linking. From there, we can say that any other driver running on the same system is utilizing some fraction of this power.

Therefore, every other driver’s fast-link performance can be visualized in units of Lavapipe (lvps) to determine how much gain is possible if things like refactoring time and feasibility are ignored.

Driver misc_compile_fastlink_depthonly misc_compile_fastlink_slow
NVIDIA 0.299lvps 0.166lvps
RADV 0.390lvps 0.367lvps
ANV 0.215lvps 0.155lvps
Turnip 0.106lvps 0.033lvps

The great thing about lvps is that these are comparable units.

At last, we finally have a way to evaluate all these drivers in a head-to-head across different systems.

The results are a bit surprising to me:

  • RADV, with the insane heroics of Samuel “Gotta Go Fast” Pitoiset, has gone from roughly zero lvps last week to first place this week
  • NVIDIA’s fast-linking, while quite fast, is closer in performance to an unoptimized, unlanded ANV MR than it is to RADV
  • Turnip is a mobile driver that both 1) has a GPL implementation 2) is kinda fast at an objective level?

Key Takeaways

Aside from the strategies outlined above, the key takeaway for me is that there shouldn’t be any hardware limitation to implementing fast-linking. It’s a CPU-based architectural problem, and with enough elbow grease, any driver can aspire to reach nonzero lvps in vkoverhead’s benchmark cases.

January 27, 2023

A New Level Of Speed

I know everyone’s been eagerly awaiting the return of the pasta maker.

The wait is over.

But today we’re going to move away from those dangerous, addictivive synthetic benchmarks to look at a different kind of speed. That’s right. Today we’re looking at pipeline compile speed. Some of you are scoffing, mouse pointer already inching towards the close button on the tab.

Pipeline compile speed in the current year? Why should anyone care when we have great tools like Fossilize that can precompile everything for a game over the course of several hours to ensure there’s no stuttering?

It turns out there’s at least one type of pipeline compile that still matters going forward. Specifically, I’m talking about fast-linked pipelines using VK_EXT_graphics_pipeline_library.

Let’s get an appetizer going, some exposition under our belts before we get to the spaghetti we’re all craving.

Pipelines: They’re In Your Games

All my readers are graphics experts. It won’t come as any surprise when I say that a pipeline is a program containing shaders which is used by the GPU. And you all know how VK_EXT_graphics_pipeline_library enables compiling partial pipelines into libraries that can then be combined into a full pipeline. None of you need a refresher on this, and we all acknowledge that I’m just padding out the word count of this post for posterity.

Some of you experts, however, have been so deep into getting those green triangles on the screen to pass various unit tests that you might not be fully aware of the fast-linking property of VK_EXT_graphics_pipeline_library.

In general, compiling shaders during gameplay is (usually) bad. This is (usually) what causes stuttering: the compilation of a pipeline takes longer than the available time to draw the frame, and rendering blocks until compilation completes. The fast-linking property of VK_EXT_graphics_pipeline_library changes this paradigm by enabling pipelines, e.g., for shader variants, to be created fast enough to avoid stuttering.

Typically, this is utilized in applications through the following process:

  • create pipeline libraries from shaders during game startup or load screen
  • wait until pipeline is needed at draw-time
  • fast-link final pipeline and use for draw
  • background compile an optimized version of the same pipeline for future use

In this way, no draw is blocked by a pipeline creation, and optimized pipelines are still used for the majority of GPU operations.

But Why…

…would I care about this if I have Fossilize and a brand new gaming supercomputer with 256 cores all running at 12GHz?

I know you’re wondering, and the answer is simple: not everyone has these things.

Some people don’t have extremely modern computers, which means Fossilize pre-compile of shaders can take hours. Who wants to sit around waiting that long to play a game they just downloaded?

Some games don’t use Fossilize, which means there’s no pre-compile. In these situations, there are two options:

  • compile all pipelines during load screens
  • compile on-demand during rendering

The former option here gives us load times that remind us of the original Skyrim release. The latter probably yields stuttering.

Thus, VK_EXT_graphics_pipeline_library (henceforth GPL) with fast-linking.

The Obvious Problem

What does the “fast” in fast-linking really mean?

How fast is “fast”?

These are great questions that nobody knows the answer to. The only limitation here is that “fast” has to be “fast enough” to avoid stuttering.

Given that RADV is in the process of bringing up GPL for general use, and given that Zink is relying on fast-linking to eliminate compile stuttering, I thought I’d take out my perf magnifying glass and see what I found.

eyeofsauron.jpg

Uh-oh

Obviously we wouldn’t be advertising fast-linking on RADV if it wasn’t fast.

Obviously.

It goes without saying that we care about performance. No credible driver developer would advertise a performance-related feature if it wasn’t performant.

RIGHT?

And it’s not like I tried running Tomb Raider on zink and discovered that the so-called “fast”-link pipelines were being created at a non-fast speed. That would be insane to even consider—I mean, it’s literally in the name of the feature, so if using it caused the game to stutter, or if, for example, I was seeing “fast”-link pipelines being created in 10ms+

mesa.png

Surely I didn’t see that though.

rotated_mesa.png

Surely I didn’t see fast-link pipelines taking more than an entire frame’s worth of time to create.

It’s Fine

Long-time readers know that this is fine. I’m unperturbed by seeing numbers like this, and I can just file a ticket and move on with my life like a normal per—

angry_mesa.png

OBVIOUSLY I CAN’T.

Obviously.

And just as obviously I had to get a second opinion on this, which is why I took my testing over to the only game I know which uses GPL with fast-link: 3D Pinball: Space Cadet DOTA 2!

Naturally it would be DOTA2, along with any other Source Engine 2 game, that uses this functionality.

Thus, I fired up my game, and faster than I could scream MID OR MEEPO into my mic, I saw the unthinkable spewing out in my console:

COMPILE 11425115
COMPILE 39491
COMPILE 11716326
COMPILE 35963
COMPILE 11057200
COMPILE 37115
COMPILE 10738436

Yes, those are all “fast”-linked pipeline compile times in nanoseconds.

Yes, half of those are taking more than 10ms.

pastamaker.jpg

First Steps

The first step is always admitting that you have a problem, but I don’t have a problem. I’m fine. Not upset at all. Don’t read more into it.

As mentioned above, we have great tools in the Vulkan ecosystem like Fossilize to capture pipelines and replay them outside of applications. This was going to be a great help.

I thought.

I fired up a 32bit build of Fossilize, set it to run on Tomb Raider, and immediately it exploded.

fossilize-works.png

Zink has, historically, been the final boss for everything Vulkan-related, so I was unsurprised by this turn of events. I filed an issue, finger-painted ineffectually, and then gave up because I had called in the expert.

That’s right.

Friend of the blog, artisanal bit-wrangler, and a developer whose only speed is -O3 -ffast-math, Hans-Kristian Arntzen took my hand-waving, unintelligible gibbering, and pointing in the wrong direction and churned out a masterpiece in less time than it took RADV to “fast”-link some of those pipelines.

While I waited, I was working at the picosecond-level with perf to isolate the biggest bottleneck in fast-linking.

Fast-linking: Stop Compiling.

My caveman-like, tool-less hunt yielded immediate results: nir_shader_clone during fast-link was taking an absurd amount of time, and then also the shaders were being compiled at this point.

This was a complex problem to solve, and I had lots of other things to do (so many things), which meant I needed to call in another friend of the blog to take over while I did all the things I had to do.

Some of you know his name, and others just know him as “that RADV guy”, but Samuel Pitoiset is the real deal when it comes to driver development. He can crank out an entire extension implementation in less time than it takes me to write one of these long-winded, benchmark-number-free introductions to a blog post, and when I told him we had a huge problem, he dropped* everything and jumped on board.
* and when I say “dropped” I mean he finished finding and fixing another Halo Infinite hang in the time it took me to explain the problem

With lightning speed, Samuel reworked pipeline creation to not do that thing I didn’t want it to do. Because doing any kind of compiling when the driver is instead supposed to be “fast” is bad. Really bad.

How did that affect my numbers?

By now I was tired of dealing with the 32bit nonsense of Tomb Raider and had put all my eggs in the proverbial DOTA2 basket, so I again fired up a round, went to AFK in jungle, and checked my debug prints.

COMPILE 55699
COMPILE 55998
COMPILE 58016
COMPILE 56825
COMPILE 60288
COMPILE 110663
COMPILE 59679
COMPILE 50614
COMPILE 54316

Do my eyes deceive me or is that a 20,000% speedup from a single patch?!

Problem Solved

And so the problem was solved. I went to Dan Ginsburg, who I’m sure everyone knows as the author of this incredible blog post about GPL, and I showed him the improvements and our new timings, and I asked what he thought about the performance now.

Dan looked at me. Looked at the numbers I showed him. Shook his head a single time.

It shook me.

I don’t know what I was thinking.

In my defense, a 20,000% speedup is usually enough to call it quits on a given project. In this case, however, I had the shadow of a competitor looming overhead.

While RADV was now down to 0.05-0.11ms for a fast-link, NVIDIA can apparently do this consistently in 0.02ms.

That’s pretty fast.

Even Faster

By now, the man, the myth, @themaister, Hans-Kristian Arntzen had finished fixing every Fossilize bug that had ever existed and would ever exist in the future, which meant I could now capture and replay GPL pipelines from DOTA2. Fossilize also has another cool feature: it allows for extraction of single pipelines from a larger .foz file, which is great for evaluating performance.

The catch? It doesn’t have any way to print per-pipeline compile timings during a replay, nor does it have a way to sort pipeline hashes based on compile times.

Either I was going to have to write some C++ to add this functionality to Fossilize, or I was going to have to get creative. With my Chromium PTSD in mind, I found myself writing out this construct:

for x in $(fossilize-list --tag 6 dota2.foz); do
	echo "PIPELINE $x"
    RADV_PERFTEST=gpl fossilize-replay --pipeline-hash $x dota2.foz 2>&1|grep COMPILE
done

I’d previously added some in-driver printfs to output compile times for the fast-link pipelines, so this gave me a file with the pipeline hash on one line and the compile timing on the next. I could then sort this and figure out some outliers to extract, yielding slow.foz, a fast-link that consistently took longer than 0.1ms.

I took this to Samuel, and we put our perfs together. Immediately, he spotted another bottleneck: SHA1Transform() was taking up a considerable amount of CPU time. This was occurring because the fast-linked pipelines were being added to the shader cache for reuse.

But what’s the point of adding an unoptimized, fast-linked pipeline to a cache when it should take less time to just fast-link and return?

Blammo, another lightning-fast patch from Samuel, and fast-linked pipelines were no longer being considered for cache entries, cutting off even more compile time.

slow.foz was now consistently down to 0.07-0.08ms.

Are We There Yet?

No.

A post-Samuel flamegraph showed a few immediate issues:

memset.png

First, and easiest, a huge memset. Get this thing out of here.

Now slow.foz was fast-linking in 0.06-0.07ms. Where was the flamegraph at on this?

post-memset.png

Now the obvious question: What the farfalloni was going on with still creating a shader?!

It turns out this particular pipeline was being created without a fragment shader, and that shader was being generated during the fast-link process. Incredible coverage testing from an incredible game.

Fixing this proved trickier, and it still remains tricky. An unsolved problem.

However.

<zmike> can you get me a hack that I can use for that foz ?
* zmike just needs to get numbers for the blog
<hakzsam> hmm
<hakzsam> I'm trying

Like a true graphics hero, that hack was delivered just in time for me to run it through the blogginator. What kinds of gains would be had from this untested mystery patch?

slow.foz was now down to 0.023 ms (23566 ns).

final.png

We Did It

Thanks to Hans-Kristian enabling us and Samuel doing a lot of heavy and unsafe lifting while I sucked wind on the sidelines, we hit our target time of 0.02ms, which is a 50,000% improvement from where things started.

What does this mean?

If You’re A User…

This means in the very near future, you can fire up RADV_PERFTEST=gpl and run DOTA2 (or zink) on RADV without any kind of shader pre-caching and still have zero stuttering.

If You’re A Game Developer…

This means you can write apps relying on fast-linking and be assured that your users will not see stuttering on RADV.

If You’re A Driver Developer…

So far, there aren’t many drivers out there that implement GPL with true fast-linking. Aside from (a near-future version of) RADV, I’m reasonably certain the only driver that both advertises fast-linking and actually has fast linking is NVIDIA.

If you’re from one of those companies that has yet to take the plunge and implement GPL, or if you’ve implemented it and decided to advertise the fast-linking feature without actually being fast, here’s some key takeaways from a week in GPL optimization:

  • Ensure you aren’t compiling any shaders at link-time
  • Ensure you aren’t creating any shaders at link-time
  • Avoid adding fast-link pipelines to any sort of shader cache
  • Profile your fast-link pipeline creation

You might be thinking that profiling a single operation like this is tricky, and it’s hard to get good results from a single fossilize-replay that also compiles multiple library pipelines.

Never fear, vkoverhead is here to save the day.

You thought I wouldn’t plug it again, but here we are. In the very near future (ideally later today), vkoverhead will have some cases that isolate GPL fast-linking. This should prove useful for anyone looking to go from “fast” to fast.

There’s no big secret about being truly fast, and there’s no architectural limitations on speed. It just takes a little bit of elbow grease and some profiling.

But Also

The goal is to move GPL out of RADV_PERFTEST with Mesa 23.1 to enable it by default. There’s still some functional work to be done, but we’re not done optimizing here either.

One day I’ll be able to say with confidence that RADV has the fastest fast-link in the world, or my name isn’t Spaghetti Good Code.

UPDATE: It turns out RADV won’t have the fastest fast-link, but it can get a lot faster.

January 21, 2023

One of the challenges of reviewing a lot of code is that many reviews require multiple iterations. I really don't want to do a full review from scratch on the second and subsequent rounds. I need to be able to see what has changed since last time.

I happen to work on projects that care about having a useful Git history. This means that authors of (without loss of generality) pull requests use amend and rebase to change commits and force-push the result. I would like to see the only the changes they made since my last review pass. Especially when the author also rebased onto a new version of the main branch, existing code review tools tend to break down.

Git has a little-known built-in subcommand, git range-diff, which I had been using for a while. It's pretty cool, really: It takes two ranges of commits, old and new, matches old and new commits, and then shows how they changed. The rather huge problem is that its output is a diff of diffs. Trying to make sense of those quickly becomes headache-inducing.

I finally broke down at some point late last year and wrote my own tool, which I'm calling diff-modulo-base. It allows you to look at the difference of the repository contents between old and new in the history below, while ignoring all the changes that are due to differences in the respective base versions A and B.

 

As a bonus, it actually does explicitly show differences between A and B that would have caused merge conflicts during rebase. This allows a fairly comfortable view of how merge conflicts were resolved.

I've been using this tool for a while now. While there are certainly still some rough edges and to dos, I did put a bunch more effort into it over the winter holidays and am now quite happy with it. I'm making it available for all to try at https://git.sr.ht/~nhaehnle/diff-modulo-base. Let me know if you find it useful!

Better integration with the larger code review flow?

One of the rough edges is that it would be great to integrate tightly with the GitHub notifications workflow. That workflow is surprisingly usable in that you can essentially treat the notifications as an inbox in which you can mark notifications as unread or completed, and can "mute" issues and pull requests all with keyboard shortcut.

What's missing in my workflow is a reliable way to remember the most recent version of a pull request that I have reviewed. My somewhat passable workaround for now is to git fetch before I do a round of reviews, and rely on the local reflog of remote refs. A Git alias allows me to say

git dmb-origin $pull_request_id

and have that become

git diff-modulo-base origin/main origin/pull/$pull_request_id/head@{1} origin/pull/$pull_request_id/head

which is usually what I want.

Ideally, I'd have a fully local way of interacting with GitHub notifications, which could then remember the reviewed version in a more reliable way. This ought to also fix the terrible lagginess of the web interface. But that's a rant for another time.

Rust

This is the first serious piece of code I've written in Rust. I have to say that experience has really been quite pleasant so far. Rust's tooling is pretty great, mostly thanks to the rust-analyzer LSP server.

The one thing I'd wish is that the borrow checker was able to better understand  "partial" borrows. I find it occasionally convenient to tie a bunch of data structures together in a general context structure, and helper functions on such aggregates can't express that they only borrow part of the structure. This can usually be worked around by changing data types, but the fact that I have to do that is annoying. It feels like having to solve a puzzle that isn't part of the inherent complexity of the underlying problem that the code is trying to solve.

And unlike, say, circular references or graph structures in general, where it's clear that expressing and proving the sort of useful lifetime facts that developers might intuitively reason about quickly becomes intractable, improving the support for partial borrows feels like it should be a tractable problem.

January 19, 2023

After hacking the Intel media-driver and ffmpeg I managed to work out how the anv hardware mostly works now for h264 decoding.

I've pushed a branch [1] and a MR[2] to mesa. The basics of h264 decoding are working great on gen9 and compatible hardware. I've tested it on my one Lenovo WhiskeyLake laptop.

I have ported the code to hasvk as well, and once we get moving on this I'll polish that up and check we can h264 decode on IVB/HSW devices.

The one feature I know is missing is status reporting, radv can't support that from what I can work out due to firmware, but anv should be able to so I might dig into that a bit.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-decode

[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20782

January 17, 2023

In the beginning, there was the egg. Then fictional people started eating that from different ends, and the terms of "little endians" and "Big Endians" was born.

Computer architectures (mostly) come with one of either byte order: MSB first or LSB first. The two are incompatible of course, and many a bug was introduced trying to convert between the two (or, more common: failing to do so). The two byte orders were termed Big Endian and little endian, because that hilarious naming scheme at least gives us something to laugh about while contemplating throwing it all away and considering a future as, I don't know, a strawberry plant.

Back in the mullet-infested 80s when the X11 protocol was designed both little endian and big endian were common enough. And back then running the X server on a different host than the client was common too - the X terminals back then had less processing power than a smart toilet seat today so the cpu-intensive clients were running on some mainfraime. To avoid overtaxing the poor mainframe already running dozens of clients for multiple users, the job of converting between the two byte orders was punted to the X server. So to this day whenever a client connects, the first byte it sends is a literal "l" or "B" to inform the server of the client's byte order. Where the byte order doesn't match the X server's byte order, the client is a "swapped client" in X server terminology and all 16, 32, and 64-bit values must be "byte-swapped" into the server's byte order. All of those values in all requests, and then again back to the client's byte order in all outgoing replies and events. Forever, till a crash do them part.

If you get one of those wrong, the number is no longer correct. And it's properly wrong too, the difference between 0x1 and 0x01000000 is rather significant. [0] Which has the hilarious side-effect of... well, pretty much anything. But usually it ranges from crashing the server (thus taking all other clients down in commiseration) to leaking random memory locations. The list of security issues affecting the various SProcFoo implementations (X server naming scheme for Swapped Procedure for request Foo) is so long that I'm too lazy to pull out the various security advisories and link to them. Just believe me, ok? *jedi handwave*

These days, encountering a Big Endian host is increasingly niche, letting it run an X client that connects to your local little-endian X server is even more niche [1]. I think the only regular real-world use-case for this is running X clients on an s390x, connecting to your local intel-ish (and thus little endian) workstation. Not something most users do on a regular basis. So right now, the byte-swapping code is mainly a free attack surface that 99% of users never actually use for anything real. So... let's not do that?

I just merged a PR into the X server repo that prohibits byte-swapped clients by default. A Big Endian client connecting to an X server will fail the connection with an error message of "Prohibited client endianess, see the Xserver man page". [2] Thus, a whole class of future security issues avoided - yay!

For the use-cases where you do need to let Big Endian clients connect to your little endian X server, you have two options: start your X server (Xorg, Xwayland, Xnest, ...) with the +byteswappedclients commandline option. Alternatively, and this only applies for Xorg: add Option "AllowByteSwappedClients" "on" to the xorg.conf ServerFlags section. Both of these will change the default back to the original setting. Both are documented in the Xserver(1) and xorg.conf(5) man pages, respectively.

Now, there's a drawback: in the Wayland stack, the compositor is in charge of starting Xwayland which means the compositor needs to expose a way of passing +byteswappedclients to Xwayland. This is compositor-specific, bugs are filed for mutter (merged for GNOME 44), kwin and wlroots. Until those are addressed, you cannot easily change this default (short of changing /usr/bin/Xwayland into a wrapper script that passes the option through).

There's no specific plan yet which X releases this will end up in, primarily because the release cycle for X is...undefined. Probably xserver-23.0 if and when that happens. It'll probably find its way into the xwayland-23.0 release, if and when that happens. Meanwhile, distributions interested in this particular change should consider backporting it to their X server version. This has been accepted as a Fedora 38 change.

[0] Also, it doesn't help that much of the X server's protocol handling code was written with the attitude of "surely the client wouldn't lie about that length value"
[1] little-endian client to Big Endian X server is so rare that it's barely worth talking about. But suffice to say, the exact same applies, just with little and big swapped around.
[2] That message is unceremoniously dumped to stderr, but that bit is unfortunately a libxcb issue.

Needless to say h264/5 weren't my real goals in life for video decoding. Lynne and myself decided to see what we could do to drive AV1 decode forward by creating our own extensions called VK_MESA_video_decode_av1. This is a radv only extension so far, and may expose some peculiarities of AMD hardware/firmware.

Lynne's blog entry[1] has all the gory details, so go read that first. (really read it first).

Now that you've read and understood all that, I'll just rant here a bit. Figuring out the DPB management and hw frame ref and curr_pic_idx fields was a bit of a nightmare. I spent a few days hacking up a lot of wrong things before landing on the thing we agreed was the least wrong which was having the ffmpeg code allocate a frame index in the same fashion as the vaapi radeon implementation did. I had another hacky solution that involved overloading the slotIndex value to mean something that wasn't DPB slot index, but it wasn't really any better. I think there may be something about the hw I don't understand so hopefully we can achieve clarity later.

[1] https://lynne.ee/vk_mesa_video_decode_av1.html

After 8 months of work by Yinon Burgansky, libinput now has a new pointer acceleration profile: the "custom" profile. This profile allows users to tweak the exact response of their device based on their input speed.

A short primer: the pointer acceleration profile is a function that multiplies the incoming deltas with a given factor F, so that your input delta (x, y) becomes (Fx, Fy). How this is done is specific to the profile, libinput's existing profiles had either a flat factor or an adaptive factor that roughly resembles what Xorg used to have, see the libinput documentation for the details. The adaptive curve however has a fixed behaviour, all a user could do was scale the curve up/down, but not actually adjust the curve.

Input speed to output speed

The new custom filter allows exactly that: it allows a user to configure a completely custom ratio between input speed and output speed. That ratio will then influence the current delta. There is a whole new API to do this but simplified: the profile is defined via a series of points of (x, f(x)) that are linearly interpolated. Each point is defined as input speed in device units/ms to output speed in device units/ms. For example, to provide a flat acceleration equivalent, specify [(0.0, 0.0), (1.0, 1.0)]. With the linear interpolation this is of course a 45-degree function, and any incoming speed will result in the equivalent output speed.

Noteworthy: we are talking about the speed here, not any individual delta. This is not exactly the same as the flat acceleration profile (which merely multiplies the deltas by a constant factor) - it does take the speed of the device into account, i.e. device units moved per ms. For most use-cases this is the same but for particularly slow motion, the speed may be calculated across multiple deltas (e.g. "user moved 1 unit over 21ms"). This avoids some jumpyness at low speeds.

But because the curve is speed-based, it allows for some interesting features too: the curve [(0.0, 1.0), (1.0, 1.0)] is a horizontal function at 1.0. Which means that any input speed results in an output speed of 1 unit/ms. So regardless how fast the user moves the mouse, the output speed is always constant. I'm not immediately sure of a real-world use case for this particular case (some accessibility needs maybe) but I'm sure it's a good prank to play on someone.

Because libinput is written in C, the API is not necessarily immediately obvious but: to configure you pass an array of (what will be) y-values and set the step-size. The curve then becomes: [(0 * step-size, array[0]), (1 * step-size, array[1]), (2 * step-size, array[2]), ...]. There are some limitations on the number of points but they're high enough that they should not matter.

Note that any curve is still device-resolution dependent, so the same curve will not behave the same on two devices with different resolution (DPI). And since the curves uploaded by the user are hand-polished, the speed setting has no effect - we cannot possibly know how a custom curve is supposed to scale. The setting will simply update with the provided value and return that but the behaviour of the device won't change in response.

Motion types

Finally, there's another feature in this PR - the so-called "movement type" which must be set when defining a curve. Right now, we have two types, "fallback" and "motion". The "motion" type applies to, you guessed it, pointer motion. The only other type available is fallback which applies to everything but pointer motion. The idea here is of course that we can apply custom acceleration curves for various different device behaviours - in the future this could be scrolling, gesture motion, etc. And since those will have a different requirements, they can be configure separately.

How to use this?

As usual, the availability of this feature depends on your Wayland compositor and how this is exposed. For the Xorg + xf86-input-libinput case however, the merge request adds a few properties so that you can play with this using the xinput tool:


# Set the flat-equivalent function described above
$ xinput set-prop "devname" "libinput Accel Custom Motion Points" 0.0 1.0
# Set the step, i.e. the above points are on 0 u/ms, 1 u/ms, ...
# Can be skipped, 1.0 is the default anyway
$ xinput set-prop "devname" "libinput Accel Custom Motion Points" 1.0
# Now enable the custom profile
$ xinput set-prop "devname" "libinput Accel Profile Enabled" 0 0 1
The above sets a custom pointer accel for the "motion" type. Setting it for fallback is left as an exercise to the reader (though right now, I think the fallback curve is pretty much only used if there is no motion curve defined).

Happy playing around (and no longer filing bug reports if you don't like the default pointer acceleration ;)

Availability

This custom profile will be available in libinput 1.23 and xf86-input-libinput-1.3.0. No release dates have been set yet for either of those.

Have You Ever…

Had one of those moments when you looked at some code, looked at the output of your program, and then exclaimed COMPUTER, WHY YOU SO DUMB?

Of course not.

Nobody uses the spoken word in the current year.

But you’ve definitely complained about how dumb your computer is on IRC/discord/etc.

And I’m here today to complain in a blog post.

My computer (compiler) is really fucking dumb.

HOW DUMB IS IT?

I know you’re wondering how dumb a computer (compiler) would have to be to prompt an outraged blog post from me, the paragon of temperance.

Well.

Let’s take a look!

Disclaimer: You are now reading at your own peril. There is no catharsis to be found ahead.

Some time ago, avid blog followers will recall that I created vkoverhead for evaluating CPU overhead in Vulkan drivers. Intel and NVIDIA have both contributed since its inception, which has no relevance but I’m sneaking it in as today’s fun fact because nobody will read the middle of this paragraph. Using vkoverhead, I’ve found something very strange on RADV. Something bizarre. Something terrifying.

debugoptimized builds of Mesa are (significantly) faster than release builds for some cases.

Here’s an example.

debugoptimized:

$ ./vkoverhead -test 76  -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
	* descriptor numbers are reported as thousands of operations per second
	* percentages for descriptor cases are relative to 'descriptor_noop'
  76, descriptor_1image,                                  157308,       100.0%

release:

$ ./vkoverhead -test 76  -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
	* descriptor numbers are reported as thousands of operations per second
	* percentages for descriptor cases are relative to 'descriptor_noop'
  76, descriptor_1image,                                  135309,       100.0%

stupidity-start.png

This is where we are now.

How?

That’s a good question, and for the answer I’m going to turn it over to the blog’s new compiler expert, Mr. Azathoth:

azathoth.png

Thanks, champ.

As we can see, I’m about to meddle with dark forces beyond the limits of human comprehension, so anyone who values their sanity/time should escape while they still can.

Let’s take a look at a debugoptimized flamegraph to check out what sort of cosmic horror we’re delving into.

stupid-perf.png

Obviously this is a descriptor updating case, and upon opening up the code for the final function, this is what our eyes will refuse to see:

static ALWAYS_INLINE void
write_image_descriptor(unsigned *dst, unsigned size, VkDescriptorType descriptor_type, const VkDescriptorImageInfo *image_info)
{
   struct radv_image_view *iview = NULL;
   union radv_descriptor *descriptor;

   if (image_info)
      iview = radv_image_view_from_handle(image_info->imageView);

   if (!iview) {
      memset(dst, 0, size);
      return;
   }

   if (descriptor_type == VK_DESCRIPTOR_TYPE_STORAGE_IMAGE) {
      descriptor = &iview->storage_descriptor;
   } else {
      descriptor = &iview->descriptor;
   }
   assert(size > 0);

   memcpy(dst, descriptor, size);
}

It seems simple enough. Nothing too terrible here.

This Is Where It Gets Stupid

In the execution of this test case, the iterated callstack will look something like:

radv_UpdateDescriptorSets ->
radv_update_descriptor_sets_impl (inlined) ->
write_image_descriptor_impl (inlined) ->
write_image_descriptor (inlined)

That’s a lot of inlining.

Typically inlining isn’t a problem when used properly. It makes things faster (citation needed) in some cases (citation needed). Note the caveats.

One thing of note in the above code snippet is that memcpy is called with a variable size parameter. This isn’t ideal since it prevents the compiler from making certain assumptions about—What’s that, Mr. Azathoth? It’s not variable, you say? It’s always writing 32 bytes, you say?

case VK_DESCRIPTOR_TYPE_STORAGE_IMAGE:
   write_image_descriptor_impl(device, cmd_buffer, 32, ptr, buffer_list,
                               writeset->descriptorType, writeset->pImageInfo + j);

Wow, thanks, Mr. Azathoth, I totally would’ve missed that!

And surely the compiler (GCC 12.2.1) wouldn’t fuck this up, right?

inline-meme.png

Which would mean that, hypothetically, if I were to force the compiler (GCC 12.2.1) to use the right memcpy size then it would have no effect.

Right?

static ALWAYS_INLINE void
write_image_descriptor(unsigned *dst, unsigned size, VkDescriptorType descriptor_type, const struct radv_image_view *iview)
{
   if (!iview) {
      memset(dst, 0, size);
      return;
   }

   if (descriptor_type == VK_DESCRIPTOR_TYPE_STORAGE_IMAGE) {
      memcpy(dst, &iview->storage_descriptor, 32);
   } else {
      if (size == 64)
         memcpy(dst, &iview->descriptor, 64);
      else
         memcpy(dst, &iview->descriptor, size);
   }
}
$ ./vkoverhead -test 76  -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
	* descriptor numbers are reported as thousands of operations per second
	* percentages for descriptor cases are relative to 'descriptor_noop'
  76, descriptor_1image,                                  141299,       100.0%

That’s A Win

Many, many runs of vkoverhead confirm the results, which means that’s a solid win.

Right?

Except hold on.

In the course of writing this post, my SSH session to my test machine was terminated, and I had to reconnect. Just for posterity, let’s run the same pre/post patch driver through vkoverhead again to be thorough.

pre-patch, release build:

$ ./vkoverhead -test 76  -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
	* descriptor numbers are reported as thousands of operations per second
	* percentages for descriptor cases are relative to 'descriptor_noop'
  76, descriptor_1image,                                  136943,       100.0%

post-patch, release build:

$ ./vkoverhead -test 76  -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
	* descriptor numbers are reported as thousands of operations per second
	* percentages for descriptor cases are relative to 'descriptor_noop'
  76, descriptor_1image,                                  137810,       100.0%

Somehow, with no changes to my environment aside from being in a different SSH session, I’ve just gained a couple percentage points of performance on the pre-patch run and lost a couple on the post-patch run.

Second Opinion

It’s clear to me that GCC (and computers) can’t be trusted. I know what everyone reading this post is now thinking: Why are you still using that antiquated trash compiler instead of the awesome new Clang that does all the things better always?

Well.

I’m sure you can see where this is going.

If I fire up Clang builds of Mesa for debugoptimized and release configurations (same CFLAGS etc), I get these results.

debugoptimized:

$ ./vkoverhead -test 76  -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
	* descriptor numbers are reported as thousands of operations per second
	* percentages for descriptor cases are relative to 'descriptor_noop'
  76, descriptor_1image,                                  112629,       100.0%

release:

$ ./vkoverhead -test 76  -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
	* descriptor numbers are reported as thousands of operations per second
	* percentages for descriptor cases are relative to 'descriptor_noop'
  76, descriptor_1image,                                  132820,       100.0%

At least something makes sense there. Now what about a release build with the above changes?

$ ./vkoverhead -test 76  -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
	* descriptor numbers are reported as thousands of operations per second
	* percentages for descriptor cases are relative to 'descriptor_noop'
  76, descriptor_1image,                                  127761,       100.0%

According to Clang this makes performance worse, but also the performance is just worse overall?!?

Final Thoughts

Just to see what happens, let’s check out AMDPRO:

$ VK_ICD_FILENAMES=/home/zmike/amd_icd64.json ./vkoverhead -test 76 -duration 10
vkoverhead running on AMD Radeon RX 5700 XT:
	* descriptor numbers are reported as thousands of operations per second
	* percentages for descriptor cases are relative to 'descriptor_noop'
  76, descriptor_1image,                                  151691,       100.0%

I hate computers.

2022 really passed by fast and after I completed the GSoC 2022, I’m now completing another milestone: my project in the Igalia Coding Experience and I had the best experience during those four months. I learned tremendously about the Linux graphics stack and now I can say for sure that I would love to keep working in the DRM community.

While GSoC was, for me, an experience to get a better understanding of what open source is, Igalia CE was an opportunity for me to mature my knowledge of technical concepts.

So, this is a summary report of my journey at the Igalia CE.

IGT tests to V3D


Initially, V3D only had three basic IGT tests: v3d_get_bo_offset, v3d_get_param, and v3d_mmap. So, the basic goal of my CE project was to add more tests to the V3D driver.

V3D is the driver that supports the Broadcom V3D 3.3 and 4.1 OpenGL ES GPUs, and is the driver that provides 3D rendering to the Raspberry Pi 4. V3D is composed of a tiled renderer, a TFU (Texture Formatting Unit), and a CSD (Compute Shader Dispatch).

During the CE, I was able to develop tests for almost all eleven V3D ioctls (except v3d_submit_tfu). I began writing tests to the v3d_create_bo ioctl and Performance Monitor (perfmon) related ioctls. I developed tests that check the basic functionality of the ioctls and I inspected the kernel code to understand situations where the ioctl should fail.

After those tests, I got the biggest challenge that I had on my CE project: performing a Mesa’s no-op job on IGT. A no-op job is one of the simplest jobs that can be submitted to the V3D. It is a 3D rendering job, so it is a job submitted through the v3d_submit_cl ioctl, and performing this job on IGT was fundamental to developing good tests for the v3d_submit_cl ioctl.

The main problem I faced on submitting a no-op job on IGT was: I would copy many and many Mesa files to IGT. And I took a while fighting against this idea, looking for other ways to submit a job to V3D. But, as some experience developers pointed out, packeting is the best option for it. So indeed, the final solution I came in with was to copy a couple of files from Mesa, but just three of them, which sounds reasonable.

So, after some time, I was able to bring the Mesa structure to IGT with minimal overhead. But, I was still not able to run a successful no-op job as the job’s fence wasn’t being signaled by the end of the job. Then, Melissa Wen guided me to experiment running CTS tests to inspect the no-op job. With the CTS tests, I was able to hexdump the contents of the packet and understand what was going on wrong in my no-op job.

Running the CTS in the Raspberry Pi 4 was a fun side-quest of the project and ended up resulting in a commit to the CTS repository, as CTS wasn’t handling appropriately the wayland-scanner for cross-compiling. CTS was picking the wayland-scanner from the host computer instead of picking the wayland-scanner executable available in the target sysroot. This was fixed with this simple patch:

Allow override of wayland_scanner executable

When I finally got a successful no-op job, I was able to write the tests for the v3d_submit_cl and v3d_wait_bo ioctls. On these tests, I tested primarily job synchronization with single syncobjs and multiple syncobjs. In this part of the project, I had the opportunity to learn a lot about syncobjs and different forms of synchronization in the kernel and userspace.

Having done the v3d_submit_cl tests, I developed the v3d_submit_csd tests in a similar way, as the job submission process is kind of similar. For submitting a CSD job, it is necessary to make a valid submission with a pipeline assembly shader and as IGT doesn’t have a shader compiler, so I hard-coded the assembly of an empty shader in the code. In this way, I was able to get a simple CSD job submitted, and having done that, I could now play around with mixing CSD and CL jobs.

In these tests, I could test the synchronization between two job queues and see, for example, if they were proceeding independently.

So, by the end of the review process, I will add 66 new sub-tests to V3D, having in total 72 IGT sub-tests! Those tests are checking invalid parameters, synchronization, and the proper behavior of the functionalities.

Patch/Series Status
[PATCH 0/7] V3D IGT Tests Updates Accepted
[PATCH 0/2] Tests for V3D/VC4 Mmap BO IOCTLs Accepted
[PATCH 0/4] Make sure v3d/vc4 support performance monitor In Review
[PATCH 0/6] V3D Job Submission Tests In Review
[PATCH 0/3] V3D Mixed Job Submission Tests In Review

Mesa

Apart from reading a lot of kernel code, I also started to explore some of the Mesa code, especially the v3dv driver. On Mesa, I was trying to understand the userspace use of the ioctls in order to create useful tests. While I was exploring the v3dv, I was able to make two very simple contributions to Mesa: fixing typos and initializing a variable in order to assure proper error handling.

Patch Status
v3dv: fix multiple typos Accepted
v3dv: initialize fd variable for proper error handling Accepted

IGT tests to VC4


VC4 and V3D share some similarities in their basic 3D rendering implementation. VC4 contains a 3D engine, and a display output pipeline that supports different outputs. The display part of the VC4 is used on the Raspberry Pi 4 together with the V3D driver.

Although my main focus was on the V3D tests, as the VC4 and V3D drivers are kind of similar, I was able to bring some improvements to the VC4 tests as well. I added tests for perfmons and the vc4_mmap ioctl and improved a couple of things in the tests, such as moving it a separate folder and creating a check to skip the VC4 tests if they are running on a Raspberry Pi 4.

Patch/Series Status
[PATCH 0/5] VC4 IGT Tests Updates Accepted
[PATCH 0/2] Tests for V3D/VC4 Mmap BO IOCTLs Accepted
[PATCH 0/4] Make sure v3d/vc4 support performance monitor In Review
tests/vc4_purgeable_bo: Fix conditional assertion In Review

Linux Kernel


V3D/VC4 drivers

During this process of writing tests to IGT, I ended up reading a lot of kernel code from V3D in order to evaluate possible userspace scenarios. While inspecting some of the V3D code, I could find a couple of small things that could be improved, such as using the DRM-managed API for mutexes and replacing open-coded implementations with their DRM counterparts.

Patch Status
drm/v3d: switch to drmm_mutex_init Accepted
drm/v3d: add missing mutex_destroy Accepted
drm/v3d: replace open-coded implementation of drm_gem_object_lookup Accepted

Although I didn’t explore the VC4 driver as much as the V3D driver, I also took a look at the driver, and I was able to detect a small thing that could be improved: using the DRM-core helpers instead of open-code. Moreover, after a report on the mailing list, I bisected a deadlock and I was able to fix it after some study about the KMS locking system.

Patch Status
drm/vc4: drop all currently held locks if deadlock happens Accepted
drm/vc4: replace drm_gem_dma_object for drm_gem_object in vc4_exec_info In Review
drm/vc4: replace obj lookup steps with drm_gem_objects_lookup In Review

The debugfs side-quest

The debugfs side-quest was a total coincidence during this project. I had some spare time and was looking for something to develop. While looking at the DRM TODO list, I bumped into the debugfs clean-up task and found it interesting to work on. So, I started to work on this task based on the previous work from Wambui Karuga, who was a Outreachy mentee and worked on this feature during her internship. By chance, when I talked to Melissa about it, she told me that she had knowledge of this project due to a past Outreachy internship that she was engaged on, and she was able to help me figure out the last pieces of this side-quest.

After submitting the first patch, introducing the debugfs device-centered functions, and converting a couple of drivers to the new structure, I decided to remove the debugfs_init hook from a couple of drivers in order to get closer to the goal of removing the debugfs_init hook. Moreover, during my last week in the CE, I tried to write a debugfs infrastructure for the KMS objects, which was another task in the TODO list, although I still need to do some rework on this series.

Patch/Series Status
[PATCH 0/7] Introduce debugfs device-centered functions Accepted
drm/debugfs: use octal permissions instead of symbolic permissions Accepted
drm/debugfs: add descriptions to struct parameters Accepted
[PATCH 0/7] Convert drivers to the new debugfs device-centered functions Accepted
[PATCH 0/13] drm/debugfs: Create a debugfs infrastructure for kms objects In Review

More side-quests

By the end of the CE, I was on my summer break from university, so I had some time to take a couple of side-quests in this journey.

The first side-quest that I got into originated in a failed IGT test on the VC4, the “addfb25-bad-modifier” IGT test. Initially, I proposed a fix only for the VC4, but after some discussion in the mailing list, I decided to move forward with the idea to create the check for valid modifiers in the DRM core. The series is still in review, but I had some great interactions during the iterations.

The second side-quest was to understand why the IGT test kms_writeback was causing a kernel oops in vkms. After some bisecting and some study about KMS’s atomic API, I was able to detect the problem and write a solution for it. It was pretty exciting to deal with vkms for the first time and to get some notion about the display side of things.

Patch/Series Status
drm/tests: Split drm_test_dp_mst_calc_pbn_mode into parameterized tests Accepted
drm/tests: Split drm_test_dp_mst_sideband_msg_req_decode into parameterized tests Accepted
tests/kms_addfb_basic: Avoid open-coded expressions Accepted
[PATCH 0/3] Check for valid framebuffer’s format In Review
drm/vkms: reintroduce prepare_fb and cleanup_fb functions Accepted

Next Steps


A bit different from the end of GSoC, I’m not really sure what are going to be my next steps in the next couple of months. The only thing I know for sure is that I will keep contributing to the DRM subsystem and studying more about DRI, especially the 3D rendering and KMS.

The DRI infrastructure is really fascinating and there is so much to be learn! Although I feel that I improved a lot in the last couple of months, I still feel like a newbie in the community. I still want to have more knowledge of the DRM core helpers and understand how everything glues together.

Apart from the DRM subsystem, I’m also trying to take some time to program more in Rust and maybe contribute to other open-source projects, like Mesa.

Acknowledgment


I would like to thank my great mentors Melissa Wen and André Almeida for helping me through this journey. I wouldn’t be able to develop this project without their great support and encouragement. They were an amazing duo of mentors and I thank them for answering all my questions and helping me with all the challenges.

Also, I would like to thank the DRI community for reviewing my patches and giving me constructive feedback. Especially, I would like to thank Daniel Vetter for answering patiently every single question that I had about the debugfs clean-up and to thank Jani Nikula, Maxime Ripard, Thomas Zimmermann, Javier Martinez Canillas, Emma Anholt, Simon Ser, Iago Toral, Kamil Konieczny and many others that took their time to review my patches, answer my questions and provide me constructive feedback.

January 16, 2023

We haven’t posted updates to the work done on the V3DV driver since
we announced the driver becoming Vulkan 1.2 Conformant

But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then.

Moved more functionality to the GPU

Our implementation for Events and Occlusion Queries were both mostly CPU based. We have refactored both features with a new GPU-side implementation based on the use of compute shaders.

In addition to be more “the Vulkan way”, has additional benefits. For example, for the case of the events, we no longer need to stall on the CPU when we need to handle GPU-side event commnds, and allowed to re-enable sync_fd import/export.

VK_KHR_sampler_ycbcr_conversion

We have just landed a real implementation for this extension, based on the work of Ella Stanforth as part of her Igalia Coding Experience with us. This was a really complex work, as this feature added support for multi-plane formats, and needed to modify various parts of the driver. A big kudos to Ella for getting this tricky feature going. Also thanks to Jason Ekstrand, as he worked on a common Mesa framework for ycbcr support.

Support for new extensions

Since 1.2 got announced the following extension got exposed:

  • VK_EXT_texel_buffer_alignment
  • VK_KHR_maintenance4
  • VK_KHR_zero_initialize_workgroup_memory
  • VK_KHR_synchronization2
  • VK_KHR_workgroup_memory_explicit_layout
  • VK_EXT_tooling_info (0 tools exposed though)
  • VK_EXT_border_color_swizzle
  • VK_EXT_shader_module_identifier
  • VK_EXT_depth_clip_control
  • VK_EXT_attachment_feeback_loop_layout
  • VK_EXT_memory_budget
  • VK_EXT_primitive_topology_list_restart
  • VK_EXT_load_store_op_none
  • VK_EXT_image_robustness
  • VK_EXT_pipeline_robustness
  • VK_KHR_shader_integer_dot_product

Some miscellanea

In addition to those, we also worked on the following:

  • Implemented heuristic to decide to enable double-buffer mode, that could help to improve performance on some cases. It still needs to be enabled through the V3D_DEBUG environment variable.

  • Getting v3dv and v3d using the same shader optimization method, that would allow to reuse more code between the OpenGL and Vulkan driver.

  • Getting the driver working with the fossilize-db tools

  • Bugfixing, mostly related to bugs identified through new Khronos CTS releases

January 15, 2023

Hi all!

This month’s status update will be lighter than usual: I’ve been on leave for a while at the end of December. To make up for this, I have some big news: we’ve released Sway 1.8! This brings a whole lot of improvements from wlroots 0.16, as well as some nice smaller additions to Sway itself. We’re still working on fixing up a few regressions, so I’ll probably release wlroots 0.16.2 soon-ish.

Together with Sebastian Wick we’ve plumbed support for more data blocks to libdisplay-info. We now support everything in the base EDID block! We’re filling the gaps in our CTA-861 implementation, and we’re getting ready to release version 0.1.0. As expected EDID blobs continue to have many fields packed in creative ways, duplicating information and contradicting each other, ill-defined in many specifications and vendor-specific formats.

I’ve continued working on the goguma Android IRC client. I’ve wired up automatic bug reporting via GlitchTip – this helps a lot because grabbing logs from Android is much more complicated than it needs to be. Thanks to the bug dashboard I’ve fixed numerous crashes. I’ve also sent upstream a fix for unreliable notifications when UnifiedPush is used.

The NPotM is chayang, a small tool to gradually dim the screen. This can be used to implement a grace period before turning off the screens, to let the user press a key or move the mouse to keep the screens on.

Last but not least, I’ve written a patch to add support for ACME DNS challenges to tlstunnel. ACME DNS challenges unlock support for wildcard certificates in Let’s Encrypt. Unfortunately there is no widely supported standard protocol to update DNS records, so tlstunnel delegates this task to a helper script with the same API as dehydrated’s hooks.

That’s it! See you next month.

January 14, 2023

Automated testing of software is great. Unfortunately, what's commonly considered best practice for how to integrate testing into the development flow is a bad fit for a lot of software. You know what I mean: You submit a pull request, some automated testing process is kicked off in the background, and some time later you get back a result that's either Green (all tests passed) or Red (at least one test failed). Don't submit the PR if the tests are red. Sounds good, doesn't quite work.

There is Software Under Test (SUT), the software whose development is the goal of what you're doing, and there's the Test Bench (TB): the tests themselves, but also additional relevant parts of the environment like perhaps some hardware device that the SUT is run against.

The above development practice works well when the SUT and TB are both defined by the same code repository and are developed together. And admittedly, that is the case for a lot of useful software. But it just so happens that I mostly tend to work on software where the SUT and TB are inherently split. Graphics drivers and shader compilers implement some spec (like Vulkan or Direct3D), and an important part of the TB are a conformance test suite and other tests, the bulk of which are developed separately from the driver itself. Not to mention the GPU itself and other software like the kernel mode driver. The point is, TB development is split from the SUT development and it is infeasible to make changes to them in lockstep.

Down with No Failures, long live No Regressions

Problem #1 with keeping all tests passing all the time is that tests can fail for reasons whose root cause is not an SUT change.

For example, a new test case is added to the conformance test suite, but that test happens to fail. Suddenly nobody can submit any changes anymore.

That clearly makes no sense, and because Tooling Sucks(tm), what folks typically do is maintain a manual list of test cases that are excluded from automated testing. This unblocks development, but are you going to remember to update that exclusion list? Bonus points if the exclusion list isn't even maintained in the same repository as the SUT, which just compounds the problem.

The situation is worse when you bring up a large new feature or perhaps a new version of the hardware supported by your driver (which is really just a super duper large new feature), where there is already a large body of tests written by somebody else. Development of the new feature may take months and typically is merged bit by bit over time. For most of that time, there are going to be some test failures. And that's fine!

Unfortunately, a typical coping mechanism is that automated testing for the feature is entirely disabled until the development process is complete. The consequences are dire, as regressions in relatively basic functionality can go unnoticed for a fairly long time.

And sometimes there are simply changes in the TB that are hard to control. Maybe you upgraded the kernel mode driver for your GPU on the test systems, and suddenly some weird corner case tests fail. Yes, you have to fix it somehow, but removing the test case from your automated testing process is almost always the wrong response.

In fact, failing tests are, given the right context, a good thing! Let's say a bug is discovered in a real application in the field. Somebody root causes the problem and writes a simplified reproducer. This reproducer should be added to the TB as soon as possible, even if it is going to fail initially!

To be fair, many of the common testing frameworks recognize this by allowing tests to be marked as "expected to fail". But they typically also assume that the TB can be changed in lockstep with the SUT and fall on their face when that isn't the case.

What is needed here is to treat testing as a truly continuous exercise, with some awareness by the automation of how test runs relate to the development history.

During day-to-day development, the important bit isn't that there are no failures. The important bit is that there are no regressions.

Automation ought to track which tests pass on the main development branch and provide pre-commit reports for pull requests relative to those results: Have there been any regressions? Have any tests been fixed? Block code submissions when they cause regressions, but don't block them for pre-existing failures, especially when those failures are caused by changes in the TB.

Changes to the TB should also be tested where possible, and when they cause regressions those should be investigated. But it is quite common that regressions caused by a TB change are legitimate and shouldn't block the TB change.

Sparse testing

Problem #2 is that good test coverage means that tests take a very long time to run.

Your first solution to this problem should be to parallelize and throw more hardware at it. Let's hope the people who control the purse care enough about quality.

There is sometimes also low-hanging fruit you should pick, like wasting lots of time in process (or other) startup and teardown overhead. Addressing that can be a double-edged sword. Changing a test suite from running every test case in a separate process to running multiple test cases sequentially in the same process reduces isolation between the tests and can therefore make the tests flakier. It can also expose genuine bugs, though, and so the effort is usually worth it.

But all these techniques have their limits.

Let me give you an example. Compilers tend to have lots of switches that subtly change the compiler's behavior without (intentionally) affecting correctness. How good is your test coverage of these?

Chances are, most of them don't see any real testing at all. You probably have a few hand-written test cases that exercise the options. But what you really should be doing is run an entire suite of end-to-end tests with each of the switches applied to make sure you aren't missing some interaction. And you really should be testing all combinations of switches as well.

The combinatorial explosion is intense. Even if you only have 10 boolean switches, testing them each individually without regressing the turn-around time of the test suite requires 11x more test system hardware. Testing all possible combinations requires 1024x more. Nobody has that kind of money.

The good news is that having extremely high confidence in the quality of your software doesn't require that kind of money. If we run the entire test suite a small number of times (maybe even just once!) and independently choose a random combination of switches for each test, then not seeing any regressions there is a great indication that there really aren't any regressions.

Why is that? Because failures are correlated! Test T failing with a default setting of switches is highly correlated with test T failing with some non-default switch S enabled.

This effect isn't restriced to taking the cross product of a test suite with a bunch of configuration switches. By design, an exhaustive conformance test suite is going to have many sets of tests with high failure correlation. For example, in the Vulkan test suite you might have a bunch of test cases that all do the same thing, but with a different combination of framebuffer format and blend function. When there is a regression affecting such tests, the specific framebuffer format or blend function might not matter at all, and all of the tests will regress. Or perhaps the regression is related to a specific framebuffer format, and so all tests using that format will regress regardless of the blend function that is used, and so on.

A good automated testing system would leverage these observations using statistical methods (aka machine learning).

Combinatorial explosion causes your full test suite to take months to run? No problem, treat testing as a genuinely continuous task. Have test systems continuously run random samplings of test cases on the latest version of the main development branch. Whenever a change is made to either the SUT or TB, switch over to testing that new version instead. When a failure is encountered, automatically determine if it is a regression by referring to earlier test results (of the exact same test if it has been run previously, or related tests otherwise) combined with a bisection over the code history.

Pre-commit testing becomes an interesting fuzzy problem. By all means have a small traditional test suite that is manually curated to run within a few minutes. But we can also apply the approach of running randomly sampled tests to pre-commit testing.

A good automated testing system would learn a statistical model of regressions and combine that with the test results obtained so far to provide an estimate of the likelihood of regression. As long as no regression is actually found, this likelihood will keep dropping as more tests are run, though it will not drop to 0 unless all tests are run (and the setup here was this would take months). The team can define a likelihood threshold that a change must reach before it can be committed based on the their appetite for risk and rate of development.

The statistical model should be augmented with source-level information about the change, such as keywords that appear in the diff and commit message and the set of files that was changed. After all, there ought to be some meaningful correlation between regressions in a raytracing test case and the fact that the regressing change affected a file with "raytracing" in its name. The model should then also be used to bias the random sampling of tests to be run to maximize the information extracted per effort spent on running test cases.

Some caveats

What I've described is largely motivated by the fact that the world is messier than commonly accepted testing "wisdom" allows. However, the world is too messy even for what I've described.

I haven't talked about flaky (randomly failing) tests at all, though a good automated testing system should be able to cope with them. Re-running a test in the same configuration is not black magic and can be used to confirm that a test is flaky. If we wanted to get fancy, we could even estimate the failure probability and treat a significant increase of the failure rate as a regression!

Along similar lines, there can be state leakage between test cases that causes failures only when test cases are run in a specific order, or when specific test cases are run in parallel. This would manifest as flaky tests, and so flaky test detection ought to try to help tease out these scenarios. That is admittedly difficult and will probably never be entirely reliable. Luckily, it doesn't happen often.

Sometimes, there are test cases that can leave a test system in such a broken state that it has to be rebooted. This is not entirely unusual in very early bringup of a driver for new hardware, when even the device's firmware may still be unstable. An automated test system can and should treat this case just like one would treat a crashing test process: Detect the failure, perhaps using some timer-based watchdog, force a reboot, possibly using a remote-controlled power switch, and resume with the next test case. But if a decent fraction of your test suite is affected, the resulting experience isn't fun and there may not be anything your team can do about it in the short term. So that's an edge case where manual exclusion of tests seems legitimate.

So no, testing perfection isn't attainable for many kinds of software projects. But even relative to what feels like it should be realistically attainable, the state of the art is depressing.

January 04, 2023

This is the third part of my “Turnips in the wild” blog post series where I describe how I found and fixed graphical issues in the Mesa Turnip Vulkan driver for Adreno GPUs. If you missed the first two parts, you can find them here:

Psychonauts 2

A few months ago it was reported that “Psychonauts 2” has rendering artifacts in the main menu. Though only recently I got my hands on it.

Screenshot of the main menu with the rendering artifacts Notice the mark on the top right, the game was running directly on Qualcomm board via FEX-Emu

Step 1 - Toggling driver options

Forcing direct rendering, forcing tiled rendering, disabling UBWC compression, forcing synchronizations everywhere, and so on, nothing helped or changed the outcome.

Step 2 - Finding the Draw Call and staring at it intensively

Screenshot of the RenderDoc with problematic draw call being selected The first draw call with visible corruption

When looking around the draw call, everything looks good, but a single input image:

Input image of the draw call with corruption already present One of the input images

Ughhh, it seems like this image comes from the previous frame, that’s bad. Inter-frame issues are hard to debug since there is no tooling to inspect two frames together…

* Looks around nervously *

Ok, let’s forget about it, maybe it doesn’t matter. Then next step would be looking at the pixel values in the corrupted region:

color = (35.0625, 18.15625, 2.2382, 0.00)

Now, let’s see whether RenderDoc’s built-in shader debugger would give us the same value as GPU or not.

color = (0.0335, 0.0459, 0.0226, 0.50)

(Or not)

After looking at the similar pixel on the RADV, RenderDoc seems right. So the issue is somewhere in how driver compiled the shader.

Step 3 - Bisecting the shader until nothing is left

A good start is to print the shader values with debugPrintfEXT, much nicer than looking at color values. Adding the debugPrintfEXT aaaand, the issue goes away, great, not like I wanted to debug it or anything.

Adding a printf changes the shader which affects the compilation process, so the changed result is not unexpected, though it’s much better when it works. So now we are stuck with observing pixel colors.

Bisecting a shader isn’t hard, especially if there is a reference GPU, with the same capture opened, to compare the changes. You delete pieces of the shader until results are the same, when you get the same results you take one step back and start removing other expressions, repeat until nothing could be removed.

First iteration of the elimination, most of the shader is gone (click me)
float _258 = 1.0 / gl_FragCoord.w;
vec4 _273 = vec4(_222, _222, _222, 1.0) * _258;
vec4 _280 = View.View_SVPositionToTranslatedWorld * vec4(gl_FragCoord.xyz, 1.0);
vec3 _284 = _280.xyz / vec3(_280.w);
vec3 _287 = _284 - View.View_PreViewTranslation;
vec3 _289 = normalize(-_284);
vec2 _303 = vec2(in_var_TEXCOORD0[0].x, in_var_TEXCOORD0[0].y) * Material.Material_ScalarExpressions[0].x;
vec4 _309 = texture(sampler2D(Material_Texture2D_0, Material_Texture2D_0Sampler), _303, View.View_MaterialTextureMipBias);
vec2 _312 = (_309.xy * vec2(2.0)) - vec2(1.0);
vec3 _331 = normalize(mat3(in_var_TEXCOORD10_centroid.xyz, cross(in_var_TEXCOORD11_centroid.xyz, in_var_TEXCOORD10_centroid.xyz) * in_var_TEXCOORD11_centroid.w, in_var_TEXCOORD11_centroid.xyz) * normalize((vec4(_312, sqrt(clamp(1.0 - dot(_312, _312), 0.0, 1.0)), 1.0).xyz * View.View_NormalOverrideParameter.w) + View.View_NormalOverrideParameter.xyz)) * ((View.View_CullingSign * View_PrimitiveSceneData._m0[(in_var_PRIMITIVE_ID * 37u) + 4u].w) * float(gl_FrontFacing ? (-1) : 1));


vec2 _405 = in_var_TEXCOORD4.xy * vec2(1.0, 0.5);
vec4 _412 = texture(sampler2D(LightmapResourceCluster_LightMapTexture, LightmapResourceCluster_LightMapSampler), _405 + vec2(0.0, 0.5));
uint _418 = in_var_LIGHTMAP_ID; // <<<<<<-
float _447 = _331.y;


vec3 _531 = (((max(0.0, dot((_412 * View_LightmapSceneData_1._m0[_418 + 5u]), vec4(_447, _331.zx, 1.0))))) * View.View_IndirectLightingColorScale);

bool _1313 = TranslucentBasePass.TranslucentBasePass_Shared_Fog_ApplyVolumetricFog > 0.0;
vec4 _1364;

vec4 _1322 = View.View_WorldToClip * vec4(_287, 1.0);
float _1323 = _1322.w;
vec4 _1352;
if (_1313)
{
    _1352 = textureLod(sampler3D(TranslucentBasePass_Shared_Fog_IntegratedLightScattering, View_SharedBilinearClampedSampler), vec3(((_1322.xy / vec2(_1323)).xy * vec2(0.5, -0.5)) + vec2(0.5), (log2((_1323 * View.View_VolumetricFogGridZParams.x) + View.View_VolumetricFogGridZParams.y) * View.View_VolumetricFogGridZParams.z) * View.View_VolumetricFogInvGridSize.z), 0.0);
}
else
{
    _1352 = vec4(0.0, 0.0, 0.0, 1.0);
}
_1364 = vec4(_1352.xyz + (in_var_TEXCOORD7.xyz * _1352.w), _1352.w * in_var_TEXCOORD7.w);

out_var_SV_Target0 = vec4(_531.x, _1364.w, 0, 1);


After that it became harder to reduce the code.

More elimination (click me)
vec2 _303 = vec2(in_var_TEXCOORD0[0].x, in_var_TEXCOORD0[0].y);
vec4 _309 = texture(sampler2D(Material_Texture2D_0, Material_Texture2D_0Sampler), _303, View.View_MaterialTextureMipBias);
vec3 _331 = normalize(mat3(in_var_TEXCOORD10_centroid.xyz, in_var_TEXCOORD11_centroid.www, in_var_TEXCOORD11_centroid.xyz) * normalize((vec4(_309.xy, 1.0, 1.0).xyz))) * ((View_PrimitiveSceneData._m0[(in_var_PRIMITIVE_ID)].w) );

vec4 _412 = texture(sampler2D(LightmapResourceCluster_LightMapTexture, LightmapResourceCluster_LightMapSampler), in_var_TEXCOORD4.xy);
uint _418 = in_var_LIGHTMAP_ID; // <<<<<<-

vec3 _531 = (((dot((_412 * View_LightmapSceneData_1._m0[_418 + 5u]), vec4(_331.x, 1,1, 1.0)))) * View.View_IndirectLightingColorScale);

vec4 _1352 = textureLod(sampler3D(TranslucentBasePass_Shared_Fog_IntegratedLightScattering, View_SharedBilinearClampedSampler), vec3(vec2(0.5), View.View_VolumetricFogInvGridSize.z), 0.0);

out_var_SV_Target0 = vec4(_531.x, in_var_TEXCOORD7.w, 0, 1);


And finally, the end result:

vec3 a = in_var_TEXCOORD10_centroid.xyz + in_var_TEXCOORD11_centroid.xyz;
float b = a.x + a.y + a.z + in_var_TEXCOORD11_centroid.w + in_var_TEXCOORD0[0].x + in_var_TEXCOORD0[0].y + in_var_PRIMITIVE_ID.x;
float c = b + in_var_TEXCOORD4.x + in_var_TEXCOORD4.y + in_var_LIGHTMAP_ID;

out_var_SV_Target0 = vec4(c, in_var_TEXCOORD7.w, 0, 1);

Nothing left but loading of varyings and the simplest operations on them in order to prevent their elimination by the compiler.

in_var_TEXCOORD7.w values are several orders of magnitude different from the expected ones and if any varying is removed the issue goes away. Seems like an issue with loading of varyings.

I created a simple standalone reproducer in vkrunner to isolate this case and make my life easier, but the same fragment shader passed without any trouble. This should have pointed me to undefined behavior somewhere.

Step 4 - Going deeper

Anyway, one major difference is the vertex shader, changing it does “fix” the issue. Though changing it resulted in the changes in varyings layout, without changing the layout the issue is present. Thus the vertex shader is an unlikely culprit here.

Let’s take a look at the fragment shader assembly:

bary.f r0.z, 0, r0.x
bary.f r0.w, 3, r0.x
bary.f r1.x, 1, r0.x
bary.f r1.z, 4, r0.x
bary.f r1.y, 2, r0.x
bary.f r1.w, 5, r0.x
bary.f r2.x, 6, r0.x
bary.f r2.y, 7, r0.x
flat.b r2.z, 11, 16
bary.f r2.w, 8, r0.x
bary.f r3.x, 9, r0.x
flat.b r3.y, 12, 17
bary.f r3.z, 10, r0.x
bary.f (ei)r1.x, 16, r0.x
....

Loading varyings

bary.f loads interpolated varying, flat.b loads it without interpolation. bary.f (ei)r1.x, 16, r0.x is what loads the problematic varying, though it doesn’t look suspicious at all. Looking through the state which defines how varyings are passed between VS and FS also doesn’t yield anything useful.

Ok, but what does second operand of flat.b r2.z, 11, 16 means (the command format is flat.b dst, src1, src2). The first one is location from where varying is loaded, and looking through the Turnip’s code the second one should be equal to the first otherwise “some bad things may happen”. Forced the sources to be equal - nothing changed… What have I expected? Since the standalone reproducer with the same assembly works fine.

The same description which promised bad things to happen also said that using 2 immediate sources for flat.b isn’t really expected. Let’s revert the change and get something like flat.b r2.z, 11, r0.x, nothing is changed, again.

Packing varyings

What else happens with these varyings? They are being packed! To remove their unused components, so let’s stop packing them. Aha! Now it works correctly!

Looking several times through the code, nothing is wrong. Changing the order of varyings helps, aligning them helps, aligning only flat varyings also helps. But code is entirely correct.

Though one thing changes, during the shuffling of varyings order I noticed that the resulting misrendering changed, so likely it’s not the order, but the location which is cursed.

Interpolating varyings

What’s left? How varyings interpolation is specified. The code emits interpolations only for used varyings, but looking closer the “used varyings” part isn’t that obviously defined. Emitting the whole interpolation state fixes the issue!

The culprit is found, stale data is being read of varying interpolation. The resulting fix could be found in tu: Fix varyings interpolation reading stale values + cosmetic changes

Correctly rendered draw call after the changes Correctly rendered draw call after the changes

Injustice 2

Another corruption in the main menu.

Bad draw call on Turnip

How it should look:

The same draw call on RADV

The draw call inputs and state look good enough. So it’s time to bisect the shader.

Here is the output of the reduced shader on Turnip:

Enabling the display of NaNs and Infs shows that there are NaNs in the output on Turnip (NaNs have green color here):

While the correct rendering on RADV is:

Carefully reducing the shader further resulted in the following fragment which reproduces the issue:

r12 = uintBitsToFloat(uvec4(texelFetch(t34, _1195 + 0).x, texelFetch(t34, _1195 + 1).x, texelFetch(t34, _1195 + 2).x, texelFetch(t34, _1195 + 3).x));
....
vec4 _1268 = r12;
_1268.w = uintBitsToFloat(floatBitsToUint(r12.w) & 65535u);
_1275.w = unpackHalf2x16(floatBitsToUint(r12.w)).x;

On Turnip this _1275.w is NaN, while on RADV it is a proper number. Looking at assembly, the calculation of _1275.w from the above is translated into:

isaml.base0 (u16)(x)hr2.z, r0.w, r0.y, s#0, t#12
(sy)cov.f16f32 r1.z, hr2.z

In GLSL there is a read of uint32, stripping it of the high 16 bits, then converting the lower 16 bits to a half float.

In assembly the “read and strip the high 16 bits” part is done in a single command isaml, where the stripping is done via (u16) conversion.

At this point I wrote a simple reproducer to speed up iteration on the issue:

result = uint(unpackHalf2x16(texelFetch(t34, 0).x & 65535u).x);

After testing different values I confirmed that (u16) conversion doesn’t strip higher 16 bits, but clamps the value to 16 bit unsigned integer. Running the reproducer on the proprietary driver shown that it doesn’t fold u32 -> u16 conversion into isaml.

Knowing that the fix is easy: ir3: Do 16b tex dst folding only for floats

Monster Hunter: World

Main menu, again =) Before we even got here two other issues were fixed before, including the one which seems like an HW bug which proprietary driver is not aware of.

In this case of misrendering the culprit is a compute shader.

How it should look:

Compute shader are generally easier to deal with since much less state is involved.

None of debug options helped and shader printf didn’t work at that time for some reason. So I decided to look at the shader assembly trying to spot something funny.

ldl.u32 r6.w, l[r6.z-4016], 1
ldl.u32 r7.x, l[r6.z-4012], 1
ldl.u32 r7.y, l[r6.z-4032], 1
ldl.u32 r7.z, l[r6.z-4028], 1
ldl.u32 r0.z, l[r6.z-4024], 1
ldl.u32 r2.z, l[r6.z-4020], 1

Negative offsets into shared memory are not suspicious at all. Were they always there? How does it look right before being passed into our backend compiler?

vec1 32 ssa_206 = intrinsic load_shared (ssa_138) (base=4176, align_mul=4, align_offset=0)
vec1 32 ssa_207 = intrinsic load_shared (ssa_138) (base=4180, align_mul=4, align_offset=0)
vec1 32 ssa_208 = intrinsic load_shared (ssa_138) (base=4160, align_mul=4, align_offset=0)
vec1 32 ssa_209 = intrinsic load_shared (ssa_138) (base=4164, align_mul=4, align_offset=0)
vec1 32 ssa_210 = intrinsic load_shared (ssa_138) (base=4168, align_mul=4, align_offset=0)
vec1 32 ssa_211 = intrinsic load_shared (ssa_138) (base=4172, align_mul=4, align_offset=0)
vec1 32 ssa_212 = intrinsic load_shared (ssa_138) (base=4192, align_mul=4, align_offset=0)

Nope, no negative offsets, just a number of offsets close to 4096. Looks like offsets got wrapped around!

Looking at ldl definition it has 13 bits for the offset:

<pattern pos="0"           >1</pattern>
<field   low="1"  high="13" name="OFF" type="offset"/>    <--- This is the offset field
<field   low="14" high="21" name="SRC" type="#reg-gpr"/>
<pattern pos="22"          >x</pattern>
<pattern pos="23"          >1</pattern>

With offset type being a signed integer (so the one bit is for the sign). Which leaves us with 12 bits, meaning the upper bound of 4095. Case closed!

I know that there is a upper bound set on offset during optimizations, but where and how it is set?

The upper bound is set via nir_opt_offsets_options::shared_max and is equal to (1 << 13) - 1, which we saw is incorrect. Who set it?

Subject: [PATCH] ir3: Limit the maximum imm offset in nir_opt_offset for
 shared vars

STL/LDL have 13 bits to store imm offset.

Fixes crash in CS compilation in Monster Hunter World.

Fixes: b024102d7c2959451bfef323432beaa4dca4dd88
("freedreno/ir3: Use nir_opt_offset for removing constant adds for shared vars.")

Signed-off-by: Danylo Piliaiev
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14968>
---
 src/freedreno/ir3/ir3_nir.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

@@ -124,7 +124,7 @@ ir3_optimize_loop(struct ir3_compiler *compiler, nir_shader *s)
           */
          .uniform_max = (1 << 9) - 1,

-         .shared_max = ~0,
+         .shared_max = (1 << 13) - 1,

Weeeell, totally unexpected, it was me! Fixing the same game, maybe even the same shader…

Let’s set the shared_max to a correct value . . . . . Nothing changed, not even the assembly. The same incorrect offset is still there.

After a bit of wandering around the optimization pass, it was found that in one case the upper bound is not enforced correctly. Fixing it fixed the rendering.

The final changes were:

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:

  • Part 1: The high-level view of the whole CI system, and how to fully control test machines remotely (power on, OS to boot, keyboard/screen emulation using a serial console);
  • Part 2: A comparison of the different ways to generate the rootfs of your test environment, and introducing the Boot2Container project;
  • Part 3: Analysis of the requirements for the CI gateway, catching regressions before deployment, easy roll-back, and netbooting the CI gateway securely over the internet.
  • Part 4: Generating a live-upgradable operating system image that can also be baked into a container for ease of use.

Now that we have a way to netboot and live-update our CI gateway, it is time to start working on the services that will enable sharing the test machines to users!

This work is sponsored by the Valve Corporation.

Introduction

Test machines are a valuable resource for any organization, may it be from a tech giant or a small-scale open source project. Along with automated testing, they are instrumental to keeping the productivity of developers high by shifting focus from bug fixing to code review, maintenance improvements, and developing new features. Given how valuable test machines are, we should strive towards keeping the machine utilization as high as possible. This can be achieved by turning the developer-managed test machines which show low utilization into shared test machine.

Let's see how!

Designing the services needed to time-share your test machines

Efficient time-sharing of machines comes with the following high-level requirements:

  1. Ability to select which hardware you need for your testing;
  2. Ability to deploy the wanted test environment quickly and automatically;
  3. Full isolation between testing jobs (no persistent state);
  4. Bi-directional file-sharing between the testing client and the test machine / device under test (DUT);
  5. Caching resources to speed up the test machines' set-up/boot time;
  6. Real time feedback from / serial console to the test machine.

Let's review them and propose services that should be hosted on the CI gateway to satisfy them, with backwards-compatibility kept in mind at all time.

Note: This post can be very confusing as it tries to focus on the interaction between different services without focusing on implementation details for the service. To help support my (probably confusing) explanations, I added sequence diagrams when applicable.

So, please grab a pot of your favourite stimulating beverage before going further. If this were to be still insufficient, please leave a comment (or contact me by other means) and I will try my best to address it :)

Boot2container provides transport, deployment, isolation, and caching of test environments

As we have seen in Part 2, using containers to run test jobs is an effective way to allow any job to set whatever test environment they desire while providing isolation between test jobs. Additionally, it enables caching the test environments so that they do not have to get re-downloaded every time.

This container could be booted using the initramfs we created in part 2, Boot2container, which runs any list of container, as specified in the kernel cmdline.

iPXE & MinIO provide kernel/Boot2Container deployment and configuration

To keep things simple, stateless, and fast, the kernel/Boot2Container binaries and the kernel cmdline can be downloaded at boot time via HTTP by using, for example, iPXE. The iPXE binary can itself be netbooted by the machine's firmware (see PXE, and TFTP), or simply flashed onto the test machine's drive/attached USB pen-drive.

The iPXE binary can then download the wanted boot script from an HTTP Server, passing in the query URL the MAC address of the network interface that first got an IP address along with the architecture/platform (PCBIOS, x64 UEFI, ...). This allows the HTTP server to serve the wanted boot script for this machine, which contain the URLs of the kernel, initramfs, and the kernel command line to be used for the test job.

Note: While any HTTP server can be used used to provide the kernel and Boot2Container to iPXE, I would recommend using an S3-compatible service such as MinIO as it not only acts like an HTTP server, but also provides an industry-standard interface to manage the data (bucket creation, file uploads, access control, ...). This gives you the freedom to change where the service is located and which software provides it without impacting other components of the infrastructure.

Boot2container volumes & MinIO provide file-sharing and user-data caching

Job data caching and bi-directional file-sharing with the test container can be implemented using volumes and Boot2Container's ability to mirror volumes from/to an S3-compatible cloud storage system such as MinIO (see b2c.volume).

Since this may be a little confusing, here are a couple of examples:

Bidirectional data sharing

Test machines are meant to produce test results, and often need input data before executing tests. An effective solution is to share a folder between the test machine and the machine submitting the job. We suggest using an S3-compatible bucket to share this data, as it provides an industry-standard way of dealing with file between multiple machines.

As an example of how this would look like in practice, here are the operations Boot2Container would need to do in order to start an interactive shell on Alpine Linux on a test machine, with bi-directional data sharing:

  1. Connect to an S3-compatible Storage, which we name local_minio;
  2. Create a volume named job-volume, set the mirror target to local_minio's job-bucket bucket, then tell to download the content of this bucket right after boot (pull_on=pipeline_start), upload all the content of the volume back to the bucket before shutting down (push_on=pipeline_end), then mark it for deletion when we are done with execution;
  3. Run Alpine Linux (docker.io/alpine:latest) in interactive mode (-ti), with our volume mounted at /job.

Here is how it would look in the kernel command line, as actual Boot2Container arguments:

b2c.minio="local_minio,http://ci-gateway:9000,<ACCESSKEY>,<SECRETKEY>"
b2c.volume="job-volume,mirror=local_minio/job-bucket,pull_on=pipeline_start,push_on=pipeline_end,expiration=pipeline_end"
b2c.container="-v job-volume:/job -ti docker://docker.io/alpine:latest"

Other acceptable values for {push,pull}_on are: pipeline_start, container_start, container_end, pipeline_end, and changes. The latter downloads/uploads new files as soon as they get created/modified.

Caching data across reboots that only certain jobs may access

In some cases, a test machine may require a lot of confidential data which would be impractical to re-download every single time we boot the machine.

Once again, Boot2Container has us covered as it allows us to mark a volume as never expiring (expiration=never), decrypting the data when downloading it from the bucket (encrypt_key=$KEY), then storing it encrypted using fscrypt (fscrypt_key=$KEY). This would look something like this:

b2c.minio="local_minio,http://ci-gateway:9000,<ACCESSKEY>,<SECRETKEY>"
b2c.volume="job-volume,mirror=local_minio/job-bucket,pull_on=pipeline_start,expiration=never,encrypt_key=s3-password,fscrypt=7u9MGy[...]kQ=="
b2c.container="-v job-volume:/job -ti docker://docker.io/alpine:latest"

Read up more about these features, and a lot more, in Boot2Container's README.

MaRS & Sergeant Hartman provide test machine enrollment and DB of available machines/hardware

In the previous section, we focused on how to consistently boot the right test environment, but we also need to make sure we are booting on the right machine for the job!

Additionally, since we do not want to boot every machine every time a testing job comes just to figure out if we have the right test environment, we should also have a database available on the gateway that can link a machine id (MAC address?), a PDU port (see Part 1), and what hardware/peripherals it has.

While it is definitely possible to maintain a structured text file that would contain all of this information, it is also very error-prone, especially for test machines that allow swapping peripherals as maintenance operations can inadvertently swap multiple machines and testing jobs would suddenly stop being executed on the expected machine.

To mitigate this risk, it would be advisable to verify at every boot that the hardware found on the machine is the same as the one expected by the CI gateway. This can be done by creating a container that will enumerate the hardware at boot, generate a list of tags based on them, then compare it with a database running on the CI gateway, exposed as a REST service (the Machine Registration Service, AKA MaRS). If the machine is not known to the CI gateway, this machine registration container can automatically add it to MaRS's database.

New machines reported to (the) MaRS should however not be directly exposed to users, not until they undergo training to guarantee that:

  1. They boot reliably;
  2. The serial console is set properly and reliable;
  3. The reported list of tags is stable across reboots.

Fortunately, the Sergeant Hartman service is constantly on the look out for new recruits (any machine not deemed ready for service), to subject them to a bootloop to test their reliability. The service will then deem them ready for service if they reach a predetermined success rate (19/20, for example), at which point the test machine should be ready to accept test jobs \o/.

Sergeant Hartman can also be used to perform a sanity check after every reboot of the CI gateway, to check if any of the test machine in the MaRS database has changed while the CI gateway was offline.

Finally, CI farm administrators need to occasionally work on a test machine, and thus need to prevent execution of future jobs on this test machine. We call this operation "Retiring" a test machine. The machine can later be "activated" to bring it back into the pool of test machines, after going through the training sequence.

The test machines' state machine and the expected training sequence can be seen in the following images:

Note: Rebooting test machines booting using the Boot on AC boot method (see Part 1) may require to be disconnected from the power for a relatively long in order for the firwmare to stop getting power from the power supply. A delay of 30 seconds seems to be relatively conservative, but some machines may require more. It is thus recommended to make this delay configurable on a per-machine basis, and store it in MaRS.

SALAD provides automatic mapping between serial consoles and their attached test machine

Once a test machine is booted, a serial console can provide a real time view of the testing in progress while also enabling users to use the remote machine as if they had an SSH terminal on it.

To enable this feature, we first need to connect the test machine to the CI gateway using serial consoles, as explained in Part 1:

Test Machine <-> USB <-> RS-232 <-> NULL modem cable <-> RS-232 <-> USB Hub <-> Gateway

As the CI Gateway may be used for more than one test machine, we need to figure out which serial port of the test machine is connected to which serial port of the CI gateway. We then need to make sure this information is kept up to date as we want to make sure we are viewing the logs of the right machine when executing a job!

This may sound trivial when you have only a few machines, but this can quickly become difficult to maintain when you have 10+ ports connected to your CI gateway! So, if like me you don't want to maintain this information by hand, it is possible to auto-discover this mapping at the same time as we run the machine registration/check process, thanks to the use of another service that will untangle all this mess: SALAD.

For the initial registration, the machine registration container should output, at a predetermined baudrate, a well-known string to every serial port (SALAD.ping\n, for example) then pick the first console port that answers another well-known string (SALAD.pong\n, for example). Now that the test machine knows which port to use to talk to the CI gateway, it can send its machine identifier (MAC address?) over it so that the CI gateway can keep track of which serial port is associated to which machine (SALAD.machine_id=...\n).

As part of the initial registration, the machine registration container should also transmit to MaRS the name of the serial adapter it used to talk to SALAD (ttyUSB0 for example) so that, at the next boot, the machine can be configured to output its boot log on it (console=ttyUSB0 added to its kernel command line). This also means that the verification process of the machine registration container can simply simply send SALAD.ping\n to stdout, and wait for SALAD.pong\n on stdin before outputing SALAD.machine_id=... to stdout again to make sure the association is still valid.

On the CI gateway side, we propose that SALAD should provide the following functions:

  1. Associate the local serial adapters to a test machine's ID;
  2. Create a TCP server per test machine, then relay the serial byte stream from the test machine's serial console to the client socket and vice versa;
  3. Expose over a REST interface the list of known machine IDs along with the TCP port associated with their machine;
  4. Create preemptively a test machine/TCP server when asked for the TCP port over the REST interface.

This provides the ability to host the SALAD service on more than just the CI gateway, which may be useful in case the machine runs out of USB ports for the serial consoles.

Here are example outputs from the proposed REST interface:

$ curl -s http://${SALAD_HOST}:${SALAD_PORT}/api/v1/machine/
  machines:
    "00:d8:61:7a:51:cd":
      has_client: false
      tcp_port: 48841
    "00:e0:4c:68:0b:3d":
      has_client:   true
      tcp_port: 57791

$ curl -s http://${SALAD_HOST}:${SALAD_PORT}/api/v1/machine/00:d8:61:7a:51:cd
  has_client: false
  tcp_port: 48841

Interacting with a test machine machine's serial console is done by connecting to the tcp_port associated to the test machine. In a shell script, one could implement this using curl, jq, and netcat:

$ MACHINE_ID=00:d8:61:7a:51:cd
$ netcat ${SALAD_HOST} $(curl -s http://${SALAD_HOST}:${SALAD_PORT}/api/v1/machine/${MACHINE_ID} | jq ".tcp_port")
# You now have a read/write access to the serial console of the test machine

Power Delivery Unit: Turning ON/OFF your test machines on command

As we explained in Part 1, the only way to guarantee that test jobs don't interfere with each other is to reset the hardware fully between every job... which unfortunately means we need to cut the power to the test machine long-enough for the power supply to empty its capacitors and stop providing voltage to the motherboard even when the computer is already-off (30 seconds is usually enough).

Given that there are many switchable power delivery units on the market (industrial, or for home use), many communication mediums (serial, Ethernet, WiFi, Zigbee, Z-Wave, ...), and protocols (SNMP, HTTP, MQTT, ...), we really want to create an abstraction layer that will allow us to write drivers for any PDU without needing to change any other component.

One existing abstraction layer is pdudaemon, which has many drivers for industrial and home-oriented devices. It however does not provide a way to read back the state of a certain port, which prevents verifying that the operation succeeded and makes it difficult to check that the power was indeed off at all time during the mandatory power off period.

The PDU abstraction layer should allow its users to:

  1. List the supported PDUs / PDU drivers
  2. Register and validate a PDUs by providing a driver name and any other needed configuration (IP address, ...)
  3. Query the list of registered PDUs, and their ports (label, state, last state change date, power usage, ...)
  4. Set the state of a port, its label, or mark it as reserved / read-only to prevent accidental changes

While this layer could be developed both as a library or a REST service, we would recommend implementing it as a standalone service because it makes the following easier:

  1. Synchronizing PDU access between clients: required for PDUs using a telnet-/serial-based interface
  2. Reducing the amount of PDU queries by centralizing all requests in one place, and caching results
  3. Deploying updates transparently and on-the-fly by running it as a socket-activated systemd service
  4. Exporting metrics to monitoring systems as a background task (port power consumptions, reboot count, percentage of utilization)

Container registries: Caching containers to reduce bandwidth usage

Even though our test machines cache containers when they first download them, it would still be pretty inefficient if every test machine in the CI farm had to download them directly from the internet.

Rather than doing that, test machines can download the containers through a proxy registry hosted on the CI gateway. This means that the containers will only be downloaded from the internet ONCE, no matter how many test machines you have in your farm. Additionally, the reduced reliance on the internet will improve your farm's reliability and performance.

Executor: Sequencing the different CI services to time-share test machines

All the different services needed to time-share the test machines effectively have now been described. What we are missing is a central service that coordinates all the others, exposes an interface to describe and queue test jobs, then monitor its progress.

In other words, this service needs to:

  1. Interface with/implement MaRS/Sergent Hartman to list the test machines available and their state;
  2. Specify a way for clients to describe jobs;
  3. Provide an interface for clients to queue jobs;
  4. Provide exclusive access to a DUT to one job at a time;
  5. Detect the end of a job or failure, even across reboots by direct messaging from the test machine, or snooping on the serial console;
  6. Implement watchdogs and retry counters to detect/fix anomalous situations (machine failed to boot, ...).
Job description

The job description should allow users of the CI system to specify the test environment they want to use without constraining them needlessly. It can also be viewed as the reproduction recipe in case anyone would like to reproduce the test environment locally.

By its nature, the job description is probably the most important interface in the entire CI system. It is very much like a kernel's ABI: you don't want an updates to break your users, so you need to make backwards-compatible changes only to this interface!

Job descriptions should be generic and minimalistic to even have a chance to maintain backwards compatibility. To achieve this, try to base it on industry standards such as PXE, UEFI, HTTP, serial consoles, containers, and others that have proven their versatility and interoperability over the years.

Without getting tangled too much into details, here is the information it should contain:

  1. The machine to select, either by ID or by list of tags
  2. The kernel(s) and initramfs to boot
  3. The kernel command line to use
  4. The different serial console markers used to detect various conditions and trigger responses
  5. The different timeouts or watchdogs used

And here are some of the things it should NOT contain:

  1. Deployment method: DUTs should use auto-discoverable boot methods such as PXE. If the DUT can't do that natively, flash a small bootloader that will be able to perform this action rather than asking your users to know how to deploy their test environment.
  2. Scripts: Streaming commands to a DUT is asking for trouble! First, you need to know how to stream the commands (telnet, SSH, serial console), then you need to interpret the result of each command to know when you can send the next one. This will inevitably lead to a disaster either due to mismatching locales, over synchronization which may affect test results, or corrupted characters from the serial console.

Now, good luck designing your job description format... or wait for the next posts which will document the one we came up with!

Job execution

Job execution is split into the following phases:

  1. Job queuing: The client asks the executor to run a particular job
  2. Job setup: The executor gathers all the artifacts needed for the boot and makes them accessible to the DUT
  3. Job execution: The DUT turns on, deploy, executes the test environment, uploads results, then shut down
  4. Job tear-down: The results from the job get forwarded to the client, before releasing all job resources

While the executor could perform all of these actions from the same process, we would recommend splitting the job execution into its own process as it prevents configuration changes from affecting currently-running jobs, make it easier to tell if a machine is running or idle, make live-updating the executor trivial (see Part 4 if you are wondering why this would be desirable), and make it easier to implement job preemption in the future.

Here is how we propose the executor should interact with the other services:

Conclusion

In this post, we defined a list of requirements to efficiently time-share test machines between users, identified sets of services that satisfy these requirements, and detailed their interactions using sequence diagrams. Finally, we provided both recommendations and cautionary tales to help you set up your CI gateway.

In the next post, we will take a bit of breather and focus on the maintainability of the CI farm through the creation of an administrator dashboard, easing access to the gateway using a Wireguard VPN, and monitoring of both the CI gateway and the test machines.

By the end of this blog series, we aim to have propose a plug-and-play experience throughout the CI farm, and have it automatically and transparently expose runners on GitLab/GitHub. This system will also hopefully be partially hosted on Freedesktop.org to help developers write, test, and maintain their drivers. The goal would be to have a setup time of under an hour for newcomers!

That's all for now, thanks for making it to the end of this post!

January 03, 2023

Droppin’ Em

While the blog was on break, some of you might have noticed that VK_EXT_descriptor_buffer was released with my greasy fingerprints on it.

You might also have noticed that Zink didn’t have a day-one MR.

The thing about Schrödinger’s Specification is that it only remains constant while you’re directly observing it. As soon as you look away, it becomes an amorphous entity changing in unknown and confusing ways.

But this extension does provide some nice benefits and future optimization sites for Zink:

  • further reduces CPU overhead by eliminating more API calls
  • allows for direct control over descriptor VRAM allocation
  • avoids potentially unnecessary flushing when descriptors update

Future Improvements

The initial work I’ve done handles the basic case of binding and updating descriptors. There are n sets for all the base descriptor types, and each one gets its own descriptor buffer. Each set of buffers belongs to a batch and gets bound when the batch’s cmdbuf begins recording. Simple.

There’s some improvements that can still be done, however.

The first and most obvious is to add handling for bindless descriptors. I’m not sure how useful this really is since there’s so few apps leveraging bindless and I haven’t noticed CPU utilization being an issue in any of them. But it’s something that can be done.

The second is to (massively) cut down on VRAM usage. I was conservative in my allocation for the descriptor buffers, which means they’re all sized to the maximum extents from the start, and all buffers are always allocated. This means that each batch gets five descriptor buffers, each of which consumes ~20MiB of VRAM, in order to avoid any sort of flushing/stalling later that would impact performance. Instead, however, it’s possible that these buffers could be allocated more dynamically, creating only the ones that will be used and then scaling up the size based on utilization. This is more of a mobile thing, and the consumption with descriptor buffers should still be less than the base mode, so it’s not a high priority.

Want to test? Try ZINK_DESCRIPTORS=db once the MR lands.

December 29, 2022

After the video decode stuff was fairly nailed down, Lynne from ffmpeg nerdsniped^Wtalked me into looking at h264 encoding.

The AMD VCN encoder engine is a very different interface to the decode engine and required a lot of code porting from the radeon vaapi driver. Prior to Xmas I burned a few days on typing that all in, and yesterday I finished typing and moved to debugging the pile of trash I'd just typed in.

Lynne meanwhile had written the initial ffmpeg side implementation, and today we threw them at each other, and polished off a lot of sharp edges. We were rewarded with valid encoded frames.

The code at this point is only doing I-frame encoding, we will work on P/B frames when we get a chance.

There are also a bunch of hacks and workarounds for API/hw mismatches, that I need to consult with Vulkan spec and AMD teams about, but we have a good starting point to move forward from. I'll also be offline for a few days on holidays so I'm not sure it will get much further until mid January.

My branch is [1]. Lynne ffmpeg branch is [2].

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-enc-wip

[2] https://github.com/cyanreg/FFmpeg/tree/vulkan_decode

December 20, 2022

I've been working the past couple of weeks with an ffmpeg developer (Lynne) doing Vulkan video decode bringup on radv.

The current status of this work is in a branch[1]. This work is all against the current EXT decode beta extensions in the spec.

Khronos has released the final specs for these extensions. This work is rebased onto the final KHR form and is in a merge request for radv[2].

This contains an initial implementation of H264 and H265 decoding for AMD GPUs from TONGA to NAVI2x. It passes the basic conformance tests but fails some of the more complicated ones, but it has decoded the streams we've been throwing at it using ffmpeg.

Building:

git clone https://gitlab.freedesktop.org/airlied/mesa

git checkout radv-vulkan-video-prelim-decode

mkdir build

meson build -Dvulkan-beta=true -Dvulkan-drivers=amd -Dvideo-codecs=h264dec,h265dec --prefix=<prefix>

cd build

ninja

ninja install

Running:

export VK_ICD_FILENAMES=<prefix>/share/vulkan/icd.d/radeon_icd.x86_64.json
 

For prelim branch[1]:

export RADV_VIDEO_DECODE=1

For merge branch[2]:

export RADV_PERFTEST=video_decode

vulkaninfo

This should show support for VK_KHR_video_queue, VK_KHR_video_decode_queue, VK_EXT_video_decode_h264 and VK_EXT_video_decode_h265.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-prelim-decode

[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20388

December 18, 2022

Hi!

Following last month’s wlroots release, we’ve started the Sway release candidate cycle. Kenny Levinsen and Ronan Pigott have helped fixing the bugs and regressions that popped up, and I hope we’ll be able to ship the final release next week. I also plan to release wlroots 0.16.1 with the fixes we’ve accumulated.

In other wlroots news, Manuel Stoeckl and I have continued to work on the Vulkan renderer. A lot more pixel formats are now supported for textures, and the buffer synchronization issues should all be sorted out. Next we’d like to add support for rendering to high bit depth buffers for color management purposes. This is a bit more involved since there is no shader stage which runs after blending, so I’d like to experiment with compute shaders to see if they’re better suited for us. That ties in with the renderer API redesign I’ve been planning for a while: the new rendering API should make it easier to use compute shaders, and already shows a nice perf boost for the Pixman renderer.

I’ve been debugging some issues with USB-C docks where outputs wouldn’t turn back on after a plug-unplug-replug cycle. The result is an i915 patch which fixes some of issues, but it seems there are more where that came from. Ultimately this class of bugs should get fixed when we add support for atomically turning off multiple outputs at once in wlroots, but this will require a lot more work.

Alexander Orzechowski and I have been pushing surface-invalidation, a new Wayland protocol to help with GPU resets. GPU resets destroy the whole GL/Vulkan state on the compositor side, so compositors which can recover from resets are left with nothing to draw. The protocol allows compositors to request a new buffer from clients.

New versions of swaybg, swayidle and slurp have been released. swaybg and swayidle now take advantage of new protocols such as single-pixel-buffer-v1 and ext-idle-notify-v1. slurp can now enforce a specific aspect ratio for the selection rectangle, has a configurable font, and can print relative coordinates in its format string.

libdisplay-info is growing bit by bit. Sebastian Wick has added support for CTA short audio descriptors, and I’ve sent patches for CTA speaker allocation data blocks. I’ve continued my work on libjsonschema to support JSON serialization. By the way, libdisplay-info is now used in DXVK for EDID parsing!

I’ve merged some nice patches from delthas for the soju IRC bouncer. Alongside small quality-of-life improvements and fixes, a new WHO cache should make WHO queries more reliable, no longer hitting the rate limits of the upstream servers. Pending work on a database-backed message store will make chat history much more reliable, faster, and lossless (currently we drop all message tags). Last but not least, rj1 has implemented TLS certificate pinning to allow soju to connect to servers using self-signed certificates.

The Goguma IRC client for Android also has received some love this month. Various NOTICE and CTCP messages (e.g. from ChanServ or NickServ) should be less annoying, as they’re now treated as ephemeral. They don’t open notifications anymore, and if the user is not in a conversation with them the messages show up in a temporary bar at the bottom of the screen. I’ve also implemented image previews:

Image previews in Goguma

They are disabled by default because of privacy concerns, but can be enabled in the settings. Only image links are supported for now, but I plan to add HTML previews as well. Last, I’ve optimized the SQL queries run at startup, so launching the app should be a bit faster now.

The NPoTM is glhf, a dead simple IRC bot for GitLab projects. It’ll announce GitLab events in the IRC channel, and will expand links to issues and merge requests. We’re using it in #sway-devel, I’m pretty happy with it so far!

That’s all for now, see you in January!

December 16, 2022

After cleaning up the radv stuff I decided to go back and dig into the anv support for H264.

The current status of this work is in a branch[1]. This work is all against the current EXT decode beta extensions in the spec.

This contains an initial implementation of H264 Intel GPUs that anv supports. I've only tested it on Kabylake equivalents so far. It decodes some of the basic streams I've thrown at it from ffmpeg. Now this isn't as far along as the AMD implementation, but I'm also not sure I'm programming the hardware correctly. The Windows DXVA API has 2 ways to decode H264, short and long. I believe but I'm not 100% sure the current Vulkan API is quite close to "short", but the only Intel implementations I've found source for are for "long". I've bridged this gap by writing a slice header parser in mesa, but I think the hw might be capable of taking over that task, and I could in theory dump a bunch of code. But the programming guides for the hw block are a bit vague on some of the details around how "long" works. Maybe at some point someone in Intel can tell me :-)

Building:

git clone https://gitlab.freedesktop.org/airlied/mesa

git checkout anv-vulkan-video-prelim-decode

mkdir build

meson build -Dvulkan-beta=true -Dvulkan-drivers=intel -Dvideo-codecs=h264dec --prefix=<prefix>

cd build

ninja

ninja install

Running:

export VK_ICD_FILENAMES=<prefix>/share/vulkan/icd.d/intel_icd.x86_64.json

export ANV_VIDEO_DECODE=1

vulkaninfo

This should show support for VK_KHR_video_queue, VK_KHR_video_decode_queue, VK_EXT_video_decode_h264.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode

December 14, 2022

Let me start by saying that I now also have a Mastodon account, so please follow me on @Cfkschaller@fosstodon.org for news and updates around Fedora Workstation, PipeWire, Wayland, LVFS and Linux in general.

Fedora vs Fedora Workstation
Before I start with the general development update I want to mention something I mentioned quite a few times before and that is the confusion I often find people have about what is Fedora and what is Fedora Workstation.

Fedora Workstation

Fedora is our overall open source project and community working on packaging components and software for the various outputs that the Fedora community delivers. Think of the Fedora community a bit like a big group of people providing well tested and maintained building blocks to be used to build operating systems and applications. As part of that bigger community you have a lot of working groups and special interest groups working to build something with those building blocks and Fedora Workstation is the most popular thing being built from those building blocks. That means that Fedora Workstation isn’t ‘Fedora’ it is something created by the Fedora community alongside a lot of other projects like Fedora Server, Silverblue and Fedora spins like Fedora KDE and Fedora Kinoite. But all them should be considered separate efforts built using a shared set of building blocks.
Putting together an operating system like Fedora Workstation is more than just assembling a list of software components to include though, it is also about setting policies, default configurations, testing and QE and marketing. This means that while Fedora Workstation contains many of the same components of other things like the Fedora KDE Plasma Desktop spin, the XFCE Desktop spin, the Cinnamon spin and so on, they are not the same thing. And that is not just because the feature set of GNOME is different from the feature set of XFCE, it is because each variant is able to set their own policies, their own configurations and do their own testing and QE. Different variants adopted different technologies at different times, for instance Fedora Workstation was an early adopter of new technologies like Wayland and PipeWire. So the reason I keep stressing this point is that I to this day often see comments or feedback about ‘Fedora’, feedback which as someone spending a lot of effort on Fedora Workstation, sometimes makes no sense to me, only to reach out and discover that they where not using Fedora Workstation, but one of the spins. So I do ask people, especially those who are members of the technology press to be more precise in their reviews, about if they are talking about Fedora Workstation or another project that is housed under the Fedora umbrella and not just shorten it all to ‘Fedora’.

Fedora Workstation

Fedora Workstation


Anyway, onwards to updating you on some of the work we are doing in and around Fedora Workstation currently.

High Dynamic Range – HDR

HDR

HDR

A major goal for us at Red Hat is to get proper HDR support into Fedora and RHEL. Sebastian Wick is leading that effort for us and working with partners and contributors across the community and the industry. For those who read my past updates you know I have been talking about this for a while, but it has been slow moving both because it is a very complex issue that needs changes from the kernel graphics drivers up through the desktop shell and into the application GUI toolkits. As Sebastian put it when I spoke with him, HDR forces us (the Linux ecosystem) to take colors seriously for the first time and thus it reveals a lot of shortcomings up through our stack, many which are not even directly HDR related.
Support for High Dynamic Range (HDR) requires compositors to produce framebuffers in specific HDR encodings with framebuffers from different clients. Most clients nowadays are unaware of HDR, Wide Color Gamuts (WCG) and instead produce pixels with RGB encodings which look okay on most displays which approximate sRGB characteristics. Converting different encodings to the common HDR encoding requires clients to communicate their encoding, the compositor to adjust and convert between the different color spaces, and the driver to enable an HDR mode in the display. Converting between the encodings should ideally be done by the scanout engine of the GPU to keep the latency and power benefits of direct scanout. Similarly, applications and toolkits have to understand different encodings and how to convert between them to make use of HDR and WCG features.

Essentially, HDR forces us to handle color correctly and makes color management an integral part of the system.
So in no particular order here are some of the issues we have found and are looking at:

  • Display brightness controls are today limited to a single, internal monitor and setting the brightness level to zero might or might not turn off the screen. The luminance setting might scale linearly or approximate the human brightness perception depending on hardware. Hans de Goede on our team is working on fixing these issues to get as coherent behaviour from different hardware as possible.
  • GStreamers glcolorconvert element doesn’t actually convert colors, which breaks playback of videos on some HDR and WCG encoded videos.
  • There are also some challenges with limitations in the current KMS API. User space cannot select the bit depth of the colors sent to the display, nor whether a YUV format with subsampling is used. Both are driver internal magic which can reduce the quality of the image but might be required to light up displays with the limited bandwidth available. The handling of limited and full range colors is driver defined and an override for badly behaving sinks is only available on Intel and VC4. And offloading the color transformations to the scanout hardware is impossible with the KMS API. Available features, their order and configurability varies a lot between hardware. To avoid any discrepancy between the offloaded and the software case the entire color pipeline has to be predictable by user space. We’re currently working together with hardware vendors to look at ways forward here.
  • To handle the communication of the color metadata between clients and compositors Sebastian is working with the wider community to define two work-in-progress wayland protocols. The color representation protocol describes how to go from a tristimulus value to the encoding (this includes YUV to RGB conversions, full vs limited color range, chroma siting) and the color-management protocol for describes the tristimulus values (primaries, whitepoint, OETF, ICC profile, dynamic range).
  • There is also now a repository on FreeDesktop.org’s GitLab instance to document and discuss all things related to colors on Linux. This required a lot of research, thinking and discussing about how everything is supposed to fit together.
  • Compositors have to look up information from EDID and DisplayID for HDR. The data is notoriously unreliable currently so Simon Ser, Pekka Paalanen and Sebastian started a new project called libdisplay-info for parsing and quirking EDID+DisplayID.
  • Mutter is now in charge of ICC profile handling and we’re planning to deprecate the old colord daemon. Work on supporting high bit depth framebuffers and enabling HDR mode is ongoing and should give us an experimental setup for figuring out how to combine different SDR and HDR content.

Headless display and automated test of the GNOME Shell

Jonas Ådahl has been working on getting headless display working correctly under Wayland for a bit and as part of that also been working on using the headless display support to enable automated testing of gnome-shell. The core work to get GNOME Shell to be able to run headless on top of Wayland finished some time ago, but there are still some related tasks that need further work.

  • The most critical is remote login. With some hand holding it works and supports TPM based authentication for single user sessions, but currently lacks a nice way to start it. This item is on the very top of Jonas todo list so hopefully we can cross it off soon.
  • Remote multi user login: This feature is currently in active development and will make it possible to get a gdm login screen via gnome-remote-desktop where one can type ones username and password getting a headless session. The remote multi user login is a collaboration between Jonas Ådahl, Pascal Nowack, the author of the RDP backend in gnome-remote-desktop and Joan Torres from SUSE.
  • Testing: Jonas has also been putting in a lot of work recently to improve automated testing of GNOME Shell on the back of the headless display support. You can read more about that in his recent blog post on the subject.

HID-BPF

Benjamin Tissoires has been working on a set of kernel patches for some time now that implements something called HID-BPF. HID referring to Human Interface Devices (like mice, keyboard, joysticks and so on) and BPF is a kernel and user-space observability scheme for the Linux kernel. If you want to learn more about BPF in general I recommend this Linux Journal article on the subject. This blog post will talk specifically about HID-BPF and not BPF in general. We are expecting this patcheset to either land in 6.2 or it might get delayed to 6.3.
The kernel documentation explains in length and through examples how to use HID-BPF and when. The summary would be that in the HID world, we often have to tweak a single byte on a device to make it work, simply because HW makers only test their device under Windows, and they can provide a “driver” that has the fix in software. So most of the time it comes to changing a few bytes in the report descriptor of the HID device, or on the fly when we receive an event. For years, we have been doing this sort of fixes in the kernel through the normal kernel driver model. But it is a relatively slow way of doing it and we should be able to be more efficient.

The kernel driver model process involves multiple steps and requires the reporter of the issue to test the fix at every step. The reporter has an issue and contacts a kernel developer to fix that device (sometimes, the reporter and the kernel developers are the same person). To be able to submit the fix, it has to be tested, and so compiled in the kernel upstream tree, which means that we require regular users to compile their kernel. This is not an easy feat, but that’s how we do it. Then, the patch is sent to the list and reviewed. This often leads to v2, v3 of the patch requiring the user to recompile a new kernel. Once it’s accepted, we still need to wait for the patch to land in Linus’ tree, and then for the kernel to be taken by distributions. And this is only now that the reporter can drop the custom kernel build. Of course, during that time, the reporter has a choice to make: should I update my distro kernel to get security updates and drop my fix, or should I recompile the kernel to keep everybody up to date, or should I just stick with my kernel version because my fix is way more important?
So how can HID-BPF make this easier?

So instead of the extensive process above. A fix can be written as a BPF program which can be run on any kernel. So instead of asking the reporter to recompile the upstream kernel, a kernel developer can compile the BPF program, and provide both the sources and the binary to the tester. The tester now just has to drop that binary in a given directory to have a userspace component automatically binding this program to the device. When the review process happens, we can easily ask testers to update their BPF program, and they don’t even need to reboot (just unplug/replug the target device). For the developers, we continue doing the review, and we merge the program directly in the source tree. Not even a single #define needs to differ. Then an automated process (yet to be fully determined, but Benjamin has an RFC out there) builds that program into the kernel tree, and we can safely ship that tested program. When the program hits the distribution kernel, the user doesn’t even notice it: given that the kernel already ships that BPF program, the userspace will just not attach itself to the device, meaning that the test user will even benefit from the latest fixes if there were any without any manual intervention. Of course, if the user wants to update the kernel because of a security problem or another bug, the user will get both: the security fix and the HID fix still there.

That single part is probably the most appealing thing about HID-BPF. And of course, this translate very well in the enterprise Linux world: when a customer has an issue on a HID device that can be fixed by HID-BPF, we can provide to that customer the pre-compiled binary to introduce in the filesystem without having to rely on a kernel update.
But there is more: we can now start implementing a HID firewall (so that only fwupd can change the firmware of a HID device), we can also tell people not to invent ad-hoc kernel APIs, but rely on eBPF to expose that functionality (so that if the userspace program is not there, we don’t even expose that capability). And we can be even more creative, like deciding to change how the kernel presents a device to the userspace.

Benjamin speaking a Kernel Recipies


For more details I recommend checking out the talk Benjamin did at the kernel recipes conference.

MIPI Camera
Kate Hsuan and Hans de Goede have been working together trying to get the Linux support for MIPI cameras into shape. MIPI cameras are the next generation of PC cameras and you are going to see more and more laptops shipping with these so it is critical for us to get them working and working well and this work is also done in the context of our close collaboration with Lenovo around the Fedora laptops they offer.
Without working support, Fedora Workstation users who buy a laptop equipped with a MIPI camera will only get a black image. The long term solution for MIPI cameras is a library called libcamera which is a project lead by Laurent Pinchart and Kieran Bingham from IdeasOnBoard and sponsored by Google. For desktop Linux users you probably are not going to have applications interact directly with libcamera though, instead our expectation is that your applications use libcamera through PipeWire and the Flatpak portal. Thanks to the work of the community we now have support for the Intel IPU3 MIPI stack in the kernel and through libcamera, but the Intel IPU6 MIPI stack is the one we expect to see head into laptops in a major way and thus our current effort is focused on bringing support for that in a useful way to Fedora users. Intel has so far provided a combination of binary and limited open source code to support these cameras under Linux, by using GStreamer to provide an emulated V4l2 device. This is not a great long term solution, but it does provide us with at least an intermediate solution until we can get IPU6 working with libcamera. Based on his successful work on IPU3, Hans de Goede has been working on getting the necessary part of the Intel open source driver for IPU6 upstreamed since that is the basic interface to control the IPU6 and image sensor. Kate has been working on packaging all the remaining Intel non-free binary and software releases into RPM packages. The packages will provide the v4l2loopback solution which should work with any software supporting V4l2. These packages were planned to go live soon in RPM fusion nonfree repository.

LVFS – Linux Vendor Firmware Service
Richard Hughes is still making great strides forward with the now ubiques LVFS service. A few months ago we pushed the UEFI dbx update to the LVFS, which has now been downloaded over 4 million times. This pushed the total downloads to a new high of 5 million updates just in the 4 weeks of September, although it’s returned to a more normal growth pattern now.

The LVFS also now supports 1,163 different devices, and is being used by 137 different vendors. Since we started all those years ago we’ve provided at least 74 million updates to end-users although it’s probably even more given that lots of Red Hat customers mirror the entire LVFS for internal use. Not to mention that thanks to Richards collaboration with Google LVFS is now an integral part of the ‘Works with ChromeOS’ program.

On the LVFS we also now show the end-user provided HSI reports for a lot of popular hardware. This is really useful to check how secure the device will likely be before buying new hardware, regardless if the vendor is uploading to the LVFS. We’ve also asked ODMs and OEMs who do actually use the LVFS to use signed reports so we can continue to scale up LVFS. Once that is in place we can continue to scale up, aiming for 10,000 supported devices, and to 100 million downloads.

LVFS analytics

Latest LVFS statistics

PipeWire & OBS Studio

In addition to doing a bunch of bug fixes and smaller improvements around audio in PipeWire, Wim Taymans has spent time recently working on getting the video side of PipeWire in shape. We decided to use OBS Studio as our ‘proof of concept’ application because there was already a decent set of patches from Georges Stavracas and Columbarius that Wim could build upon. Getting the Linux developer community to adopt PipeWire for video will be harder than it was for audio since we can not do a drop in replacement, instead it will require active porting by application developers. Unlike for audio where PipeWire provides a binary compatible implementation of the ALSA, PulesAudio and JACK apis we can not provide a re-implementation of the V4L API that can run transparently and reliably in place of the actual V4L2. That said, Wim has created a tool called pw-v4l2 which tries to redirect the v4l2 calls into PipeWire, so you can use that for some testing, for example by running ‘pw-v4l2 cheese’ on the command line and you will see Cheese appear in the PipeWire patchbay applications Helvum and qpwgraph. As stated though it is not reliable enough to be something we can for instance have GNOME Shell do as a default for all camera handling applications. Instead we will rely on application developers out there to look at the work we did in OBS Studio as a best practice example and then implement camera input through PipeWire using that. This brings with it a lot of advantages though, like transparent support for libcamera alongside v4l2 and easy sharing of the video streams between multiple applications.
One thing to note about this is that at the moment we have two ‘equal’ video backends in PipeWire, V4L2 and libcamera, which means you get offered the same device twice, once from v4l2 and once from libcamera. Obviously this is a little confusing and annoying. Since libcamera is still in heavy development the v4l2 backend is more reliable for now, so Fedora Workstation we will be shipping with the v4l2 backend ‘out of the box’, but allow you to easily install the libcamera backend. As libcamera matures we will switch over to the libcamera backend (and allow you to install the v4l backend if you still need/want it for some reason.)

Helvum running OBS Studio with native PipeWire support and Cheese using the pw-v4l2 re-directer.

Of course the most critical thing to get ported to use PipeWire for camera handling is the web browsers. Luckily thanks to the work of Pengutronix there is a patchset ready to be merged into WebRTC, which is the implementation used by both Chromium/Chrome and Firefox. While I can make no promises we are currently looking to see if it is viable for us to start shipping that patch in the Fedora Firefox package soon.

And finally thanks to the work by Jan Grulich in collaboration with Google engineers PipeWire is now included in the Chromium test build which has allowed Google to enable PipeWire for screen sharing enabled by default. Having PipeWire in the test builds is also going to be critical for getting the camera handling patch merged and enabled.

Flathub, Flatpak and Fedora

As I have spoken about before, we have a clear goal of moving as close as we can to a Flatpak only model for Fedora Workstation. The are a lot of reasons for this like making applications more robust (i.e. the OS doesn’t keep moving fast underneath the application), making updates more reliable (because an application updating its dependencies doesn’t risk conflicting with the needs of other application), making applications more portable in the sense that they will not need to be rebuilt for each variety of operating system, to provide better security since applications can be sandboxed in a container with clear access limitations and to allow us to move to an OS model like we have in Silverblue with an immutable core.
So over the last years we spent a lot of effort alongside other members of the Linux community preparing the ground to allow us to go in this direction, with improvements to GTK+ and Qt for instance to allow them to work better in sandboxes like the one provided by Flatpak.
There has also been strong support and growth around Flathub which now provides a wide range of applications in Flatpak format and being used for most major Linux distributions out there. As part of that we have been working to figure out policies and user interface to allow us to enable Flathub fully in Fedora (currently there is a small allowlisted selection available when you enable 3rd party software). This change didn’t make it into Fedora Workstation 37, but we do hope to have it ready for Fedora Workstation 38. As part of that effort we also hope to take another look at the process for building Flatpaks inside Fedora to reduce the barrier for Fedora developers to do so.

So how do we see things evolving in terms of distribution packaging? Well that is a good question. First of all a huge part of what goes into a distribution is not Flatpak material considering that Flatpaks are squarely aimed at shipping GUI desktop applications. There are a huge list of libraries and tools that are either used to build the distribution itself, like Wayland, GNOME Shell, libinput, PipeWire etc. or tools used by developers on top of the operating system like Python, Perl, Rust tools etc. So these will be needed to be RPM packaged for the distribution regardless. And there will be cases where you still want certain applications to be RPM packaged going forward. For instance many of you hopefully are aware of Container Toolbx, our effort to make pet containers a great tool for developers. What you install into your toolbox, including some GUI applications like many IDEs will still need to be packages as RPMS and installed into each toolbox as they have no support for interacting with a toolbox from the host. Over time we hope that more IDEs will follow in GNOME Builders footsteps and become container aware and thus can be run from the host system as a Flatpak, but apart from Owen Taylors VS Code integration plugin most of them are not yet and thus needs to be installed inside your Toolbx.

As for building Flatpaks in Fedora as opposed to on Flathub, we are working on improving the developer experience around that. There are many good reasons why one might want to maintain a Fedora Flatpak, things like liking the Fedora content and security policies or just being more familiar with using the tested and vetted Fedora packages. Of course there are good reasons why developers might prefer maintaining applications on Flathub too, we are fine with either, but we want to make sure that whatever path you choose we have a great developer and package maintainer experience for you.

Multi-Stream Transport

Multi-monitor setups have become more and more common and popular so one effort we spent time on for the last few years is Lyude Paul working to clean up the MST support in the kernel. MST is a specification for DisplayPort that allows multiple monitors to be driven from a single DisplayPort port by multiplexing several video streams into a single stream and sending it to a branch device, which demultiplexes the signal into the original streams. DisplayPort MST will usually take the form of a single USB-C or DisplayPort connection. More recently, we’ve also seen higher resolution displays – and complicated technologies like DSC (Display Stream Compression) which need proper driver support in order to function.

Making setups like docks work is no easy task. In the Linux kernel, we have a set of shared code that any video driver can use in order to much more easily implement support for features like DisplayPort MST. We’ve put quite a lot of work into making this code both viable for any driver to use, but also to build new functionality on top of for features such as DSC. Our hope is that with this we can both encourage the growth of support for functionality like MST, and support of further features like DSC from vendors through this work. Since this code is shared, this can also come with the benefit that any new functionality implemented through this path is far easier to add to other drivers.
Lyude has mostly finished this work now and recently have been focusing on fixing some regressions that accidentally came upstream in amdgpu. The main stuff she was working on beforehand was a lot of code cleanup, particularly removing a bunch of the legacy MST code. For context: with kernel modesetting we have legacy modesetting, and atomic modesetting. Atomic modesetting is what’s used for modern drivers, and it’s a great deal simpler to work with than legacy modesetting. Most of the MST helpers were written before atomic was a thing, and as a result there was a pretty big mess of code that both didn’t really need to be there – and actively made it a lot more difficult to implement new functionality and figure out whether bug fixes that were being submitted to Lyude were even correct or not. Now that we’ve cleaned this up though, the MST helpers make heavy use of atomic and this has definitely simplified the code quite a bit.

December 13, 2022

Time for another status update on libei, the transport layer for bouncing emulated input events between applications and Wayland compositors [1]. And this time it's all about portals and how we're about to use them for libei communication. I've hinted at this in the last post, but of course you're forgiven if you forgot about this in the... uhm.. "interesting" year that was 2022. So, let's recap first:

Our basic premise is that we want to emulate and/or capture input events in the glorious new world that is Wayland (read: where applications can't do whatever they want, whenever they want). libei is a C library [0] that aims to provide this functionality. libei supports "sender" and "receiver" contexts and that just specifies which way the events will flow. A sender context (e.g. xdotool) will send emulated input events to the compositor, a "receiver" context will - you'll never guess! - receive events from the compositor. If you have the InputLeap [2] use-case, the server-side will be a receiver context, the client side a sender context. But libei is really just the transport layer and hasn't had that many changes since the last post - most of the effort was spent on trying to figure out how to exchange the socket between different applications. And for that, we have portals!

RemoteDesktop

In particular, we have a PR for the RemoteDesktop portal to add that socket exchange. In particular, once a RemoteDesktop session starts your application can request an EIS socket and send input events over that. This socket supersedes the current NotifyButton and similar DBus calls and removes the need for the portal to stay in the middle - the application and compositor now talk directly to each other. The compositor/portal can still close the session at any time though, so all the benefits of a portal stay there. The big advantage of integrating this into RemoteDesktop is that the infrastructucture for that is already mostly in place - once your compositor adds the bits for the new ConnectToEIS method you get all the other pieces for free. In GNOME this includes a visual indication that your screen is currently being remote-controlled, same as from a real RemoteDesktop session.

Now, talking to the RemoteDesktop portal is nontrivial simply because using DBus is nontrivial, doubly so for the way how sessions and requests work in the portals. To make this easier, libei 0.4.1 now includes a new library "liboeffis" that enables your application to catch the DBus. This library has a very small API and can easily be integrated with your mainloop (it's very similar to libei). We have patches for Xwayland to use that and it's really trivial to use. And of course, with the other Xwayland work we already had this means we can really run xdotool through Xwayland to connect through the XDG Desktop Portal as a RemoteDesktop session and move the pointer around. Because, kids, remember, uhm, Unix is all about lots of separate pieces.

InputCapture

On to the second mode of libei - the receiver context. For this, we also use a portal but a brand new one: the InputCapture portal. The InputCapture portal is the one to use to decide when input events should be captured. The actual events are then sent over the EIS socket.

Right now, the InputCapture portal supports PointerBarriers - virtual lines on the screen edges that, once crossed, trigger input capture for a capability (e.g. pointer + keyboard). And an application's basic approach is to request a (logical) representation of the available desktop areas ("Zones") and then set up pointer barriers at the edge(s) of those Zones. Get the EIS connection, Enable() the session and voila - the compositor will (hopefully) send input events when the pointer crosses one of those barriers. Once that happens you'll get a DBus signal in InputCapture and the events will start flowing on the EIS socket. The portal itself doesn't need to sit in the middle, events go straight to the application. The portal can still close the session anytime though. And the compositor can decide to stop capturing events at any time.

There is actually zero Wayland-y code in all this, it's display-system acgnostic. So anyone with too much motivation could add this to the X server too. Because that's what the world needs...

The (currently) bad news is that this needs to be pulled into a lot of different repositories. And everything needs to get ready before it can be pulled into anything to make sure we don't add broken API to any of those components. But thanks to a lot of work by Olivier Fourdan, we have this mostly working in InputLeap (tbh the remaining pieces are largely XKB related, not libei-related). Together with the client implementation (through RemoteDesktop) we can move pointers around like in the InputLeap of old (read: X11).

Our current goal is for this to be ready for GNOME 45/Fedora 39.

[0] eventually a protocol but we're not there yet
[1] It doesn't actually have to be a compositor but that's the prime use-case, so...
[2] or barrier or synergy. I'll stick with InputLeap for this post

December 12, 2022
Igalia Logo next to the Vulkan Logo

The end of 2022 is very close so I’m just in time for some self-promotion. As you may know, the ongoing collaboration between Valve and Igalia lets me and some of my colleagues work on improving the open-source Vulkan and OpenGL Conformance Test Suite. This work is essential to ship quality Vulkan drivers and, from the Khronos side, to improve the Vulkan standard further by, among other things, adding new functionality through API extensions. When creating a new extension, apart from reaching consensus among vendors about the scope and shape of the new APIs, CTS tests are developed in order to check the specification text is clear and vendors provide a uniform implementation of the basic functionality, corner cases and, sometimes, interactions with other extensions.

In addition to our CTS work, many times we review the Vulkan specification text from those extensions we develop tests for. We also do the same for other extensions and changes, and we also submit fixes and improvements of our own.

In 2022, our work was important to be able to ship a bunch of extensions you can probably see implemented in Mesa and used by VKD3D-Proton when playing your favorite games on Linux, be it on your PC or perhaps on the fantastic Steam Deck. Or maybe used by Zink when implementing OpenGL on top of your favorite Vulkan driver. Anyway, without further ado, let’s take a look.

VK_EXT_image_2d_view_of_3d

This extension was created by our beloved super good coder Mike Blumenkrantz to be able to create a 2D view of a single slice of a 3D image. It helps emulate functionality which was already possible with OpenGL, and is used by Zink. Siru developed tests for this one but we reviewed the spec and are listed as contributors.

VK_EXT_shader_module_identifier

One of my favorite extensions shipped in 2022. Created by Hans-Kristian Arntzen to be used by Proton, this extension lets applications query identifiers (hashes, if you want to think about them like that) for existing VkShaderModule objects and also to provide said identifiers in lieu of actual VkShaderModule objects when creating a pipeline. This apparently simple change has real-world impact when downloading and playing games on a Steam Deck, for example.

You see, DX12 games ship their shaders typically in an intermediate assembly-like representation called DXIL. This is the equivalent of the assembly-like SPIR-V language when used with Vulkan. But Proton has to, when implementing DX12 on top of Vulkan, translate this DXIL to SPIR-V before passing the shader down to Vulkan, and this translation takes some time that may result in stuttering that, if done correctly, would not be present in the game when it runs natively on Windows.

Ideally, we would bypass this cost by shipping a Proton translation cache with the game when you download it on Linux. This cache would allow us to hash the DXIL module and use the resulting hash as an index into a database to find a pre-translated SPIR-V module, which can be super-fast. Hooray, no more stuttering from that! You may still get stuttering when the Vulkan driver has to compile the SPIR-V module to native GPU instructions, just like the DX12 driver would when translating DXIL to native instructions, if the game does not, or cannot, pre-compile shaders somehow. Yet there’s a second workaround for that.

If you’re playing on a known platform with known hardware and drivers (think Steam Deck), you can also ship a shader cache for that particular driver and hardware. Mesa drivers already have shader caches, so shipping a RADV cache with the game makes total sense and we would avoid stuttering once more, because the driver can hash the SPIR-V module and use the resulting hash to find the native GPU module. Again, this can be super-fast so it’s fantastic! But now we have a problem, you see? We are shipping a cache that translates DXIL hashes to SPIR-V modules, and a driver cache that translates SPIR-V hashes to native modules. And both are big. Quite big for some games. And what do we want the SPIR-V modules for? For the driver to calculate their hashes and find the native module? Wouldn’t it be much more efficient if we could pass the SPIR-V hash directly to the driver instead of the actual module? That way, the database translating DXIL hashes to SPIR-V modules could be replaced with a database that translates DXIL hashes to SPIR-V hashes. This can save space in the order of gigabytes for some games, and this is precisely what this extension allows. Enjoy your extra disk space on the Deck and thank Hans-Kristian for it! We reviewed the spec, contributed to it, and created tests for this one.

VK_EXT_attachment_feedback_loop_layout

This one was written by Joshua Ashton and allows applications to put images in the special VK_IMAGE_LAYOUT_ATTACHMENT_FEEDBACK_LOOP_OPTIMAL_EXT layout, in which they can both be used to render to and to sample from at the same time. It’s used by DXVK 2.0+ to more efficiently support D3D9 games that read from active render targets. We reviewed, created tests and contributed to this one.

VK_EXT_mesh_shader

I don’t need to tell you more about this one. You saw the Khronos blog post. You watched my XDC 2022 talk (and Timur’s). You read my slides. You attended the Vulkan Webinar. Important to actually have mesh shaders on Vulkan like you have in DX12, so emulation of DX12 on Vulkan was a top goal. We contributed to the spec and created tests.

VK_EXT_mutable_descriptor_type

“Hey, hey, hey!” I hear you protest. “This is just a rename of the VK_VALVE_mutable_descriptor_type extension which was released at the end of 2020.” Right, but I didn’t get to tell you about it then, so bear with me for a moment. This extension was created by Hans-Kristian Arntzen and Joshua Ashton and it helps efficiently emulate the raw D3D12 binding model on top of Vulkan. For that, it allows you to have descriptors with a type that is only known at runtime, and also to have descriptor pools and sets that reside only in host memory. We had reviewed the spec and created tests for the Valve version of the extension. Those same tests are the VK_EXT_mutable_descriptor_type tests today.

VK_EXT_extended_dynamic_state3

The final boss of dynamic state, which helps you reduce the number of pipeline objects in your application as much as possibly allowed. Combine some of the old and new dynamic states with graphics pipeline libraries and you may enjoy stutter-free gaming. Guaranteed!1 This one will be used by native apps, translation layers (including Zink) and you-name-it. We developed tests and reviewed the spec for it.

1Disclaimer: not actually guaranteed.

December 08, 2022

It has been a busy couple of months. As I pointed on my last blog post, I finished GSoC and joined the Igalia Coding Experience mentorship project. In October, I also traveled to Minneapolis for XDC 2022 where I presented to the Linux Graphics community our AMD/KUnit work with my colleagues. So, let’s make a summary of the last couple of months.

XDC 2022


Just a small thank you note to X.Org Foundation for sponsoring my travel to Minneapolis. XDC 2022 was a great experience, and I learned quite a lot during the talks. Although I was a newcomer, all developers were very nice to me, and it was great to talk to experienced developers (and meet my mentors in person). Also, I presented the GSoC/XOrg work on the first day of the conference and this talk is available on YouTube.

Working with the Raspberry Pi 4


As I mentioned in my last blog post, GSoC was a great learning experience and I’m willing to keep learning about the Linux graphics stack. Fortunately, when I started the Igalia CE, Melissa Wen pitched me a project to increase IGT test coverage on DRM/V3D kernel driver. I was pretty glad to hear about the project as it allowed me to learn more about how a GPU works.

The Project

Currently, V3D only has three basic IGT tests: v3d_get_bo_offset, v3d_get_param, and v3d_mmap. So, the basic goal of my CE project was to add more tests to the V3D driver.

As the general DRM-core tests were in a good shape on the V3D driver, I started to think together with my mentors about more driver-specific tests for the driver.

By checking the V3D UAPI, you can see that the V3D has eleven ioctls, so there is yet a lot to test for the V3D on IGT.

First, there are Buffer Object (BO) related-ioctls: v3d_create_bo, v3d_wait_bo, v3d_mmap_bo, and v3d_get_bo_offset. The Buffer Objects are shared-memory objects that are allocated by the GPU to store things like vertex data. Therefore, testing them is important to make sure that memory is being correctly allocated. Different from the VC4, the V3D has an MMU between the GPU and the bus, allowing it to not allocate objects contiguously. Therefore, the idea was to develop tests for v3d_create_bo and v3d_wait_bo.

Next, there are Performance Monitor (perfmon) related-ioctls: v3d_perfmon_create, v3d_perfmon_destroy, and v3d_perfmon_get_values. Performance Monitors are basically registers that are used for monitoring the performance of the V3D engine. So, tests were designed to assure that the driver was creating perfmons properly and was resilient to incorrect requests, such as trying to get a value from a non-existent perfmon.

And finally, the most interesting type of ioctls: the job submission ioctls. You can use the v3d_submit_cl ioctl to submit commands to the 3D engine, which is a tiled engine. When I think about tiled rendering, I always think about a Super Nintendo, but things can get a bit more complicated than a SNES as you can see here. The 3D engine is composed of a bin and render pipelines, each has its command list. The binning step maps the tile to a piece of the frame and the rendering step renders the tile based on its mapping.

By testing the v3d_submit_cl ioctl, it is possible to test syncing between jobs and also the V3D multisync ability.

Moreover, the V3D has also a TFU (texture formatting unit), and a CSD (compute shader dispatch), which has their ioctls: v3d_submit_tfu and v3d_submit_csd. The TFU makes format conversions and generated mipmaps and the CSD is responsible for dispatching a compute shader.

So, the idea is to write tests for all those functionalities from V3D, increasing the testability of V3D on IGT. Although things are not yet fully-done, I’ve been enjoying and working exploring the V3D, IGT, and Mesa. After this experience with Mesa and also XDC, I became more and more interested in Mesa.

A Noop Job…

In order to test the v3d_submit_cl ioctl, it was needed to design a job to be submitted. So, Melissa suggested using Mesa’s noop job specification on IGT to perform the tests. The idea was quite simple: submit a noop job and create tests based on it. But, it was not that simple after all…

First, I must say that I’m mostly a kernel developer, so I was not familiar with Mesa. So, maybe it was not that hard to figure out, but I took a while to understand Mesa’s packet and how to submit them.

The main problem I faced on submitting a noop job on IGT was: I would copy many and many Mesa files to IGT. And I took a while fighting against this idea, looking for other ways to submit a job to V3D. But, as some experience developers pointed out, packeting is the best option for it.

After some time, I was able to bring the Mesa structure to IGT with a minimal (although not that minimal) overhead. But, I’m still not able to run a successful noop job as the job’s fence is not being signaled by the end of the job.

Series Submitted

Although my noop job has not landed yet, so far, I was able to submit two series to IGT: one for the V3D driver and the other for the VC4 driver.

Apart from cleanups in the drivers, I added tests for the v3d_create_bo ioctl and the V3D’s and VC4’s perfmon ioctls. Moreover, as I was running the VC4 tests on the Raspberry Pi 4, I realized that most of the VC4 tests were failing on V3D, considering the VC4 doesn’t have rendering abilities on the Raspberry Pi 4. So, I also created checks to assure that the VC4 tests are not running on V3D.

Those series are being reviewed yet, but I hope to get them merged soon.

Next Steps


My biggest priority now is to run a noop job on IGT and for it, I’m currently running the CTS tests on the Raspberry Pi 4 in order to reproduce a noop job and understand why my current job is resulting in a hang. I added a couple of debug logs (aka printf) on Mesa and now I can see the contents of the BOs and the parameters of the submission. So, I hope to get a fully-working noop job now.

After I develop my fully working noop job, I will finish the v3d_wait_bo tests, so those only make sense if I submit a job and wait for a BO after it and design the v3d_submit_cl tests as well. For this last one, I hope to test the syncing functionalities of V3D especially.

Moreover, I hope to write soon a piece about cross-compiling CTS for the Raspberry Pi 4, which was a fun digression on this CE project.

December 07, 2022

We’re excited to announce our first Apple GPU driver release!

We’ve been working hard over the past two years to bring this new driver to everyone, and we’re really proud to finally be here. This is still an alpha driver, but it’s already good enough to run a smooth desktop experience and some games.

Read on to find out more about the state of things today, how to install it (it’s an opt-in package), and how to report bugs!

Status

This release features work-in-progress OpenGL 2.1 and OpenGL ES 2.0 support for all current Apple M-series systems. That’s enough for hardware acceleration with desktop environments, like GNOME and KDE. It’s also enough for older 3D games, like Quake3 and Neverball. While there’s always room for improvement, the driver is fast enough to run all of the above at 60 frames per second at 4K.

Please note: these drivers have not yet passed the OpenGL (ES) conformance tests. There will be bugs!

What’s next? Supporting more applications. While OpenGL (ES) 2 suffices for some applications, newer ones (especially games) demand more OpenGL features. OpenGL (ES) 3 brings with it a slew of new features, like multiple render targets, multisampling, and transform feedback. Work on these features is well under way, but they will each take a great deal of additional development effort, and all are needed before OpenGL (ES) 3.0 is available.

What about Vulkan? We’re working on it! Although we’re only shipping OpenGL right now, we’re designing with Vulkan in mind. Most of the work we’re putting toward OpenGL will be reused for Vulkan. We estimated that we could ship working OpenGL 2 drivers much sooner than a working Vulkan 1.0 driver, and we wanted to get hardware accelerated desktops into your hands as soon as possible. For the most part, those desktops use OpenGL, so supporting OpenGL first made more sense to us than diving into the Vulkan deep end, only to use Zink to translate OpenGL 2 to Vulkan to run desktops. Plus, there is a large spectrum of OpenGL support, with OpenGL 2.1 containing a fraction of the features of OpenGL 4.6. The same is true for Vulkan: the baseline Vulkan 1.0 profile is roughly equivalent to OpenGL ES 3.1, but applications these days want Vulkan 1.3 with tons of extensions and “optional” features. Zink’s “layering” of OpenGL on top of Vulkan isn’t magic: it can only expose the OpenGL features that the underlying Vulkan driver has. A baseline Vulkan 1.0 driver isn’t even enough to get OpenGL 2.1 on Zink! Zink itself advertises support for OpenGL 4.6, but of course that’s only when paired with Vulkan drivers that support the equivalent of OpenGL 4.6… and that gets us back to a tremendous amount of time and effort.

When will OpenGL 3 support be ready? OpenGL 4? Vulkan 1.0? Vulkan 1.3? In community open source projects, it’s said that every time somebody asks when a feature will be done, it delays that feature by a month. Well, a lot of people have been asking…

At any rate, for a sneak peek… here is SuperTuxKart’s deferred renderer running at full speed, making liberal use of OpenGL ES 3 features like multiple render targets~

Anatomy of a GPU driver

Modern GPUs consist of many distinct “layered” parts. There is…

  • a memory management unit and an interface to submit memory-mapped work to the hardware
  • fixed-function 3D hardware to rasterize triangles, perform depth/stencil testing, and more
  • programmable “shader cores” (like little CPUs with bespoke instruction sets) with work dispatched by the fixed-function hardware

This “layered” hardware demands a “layered” graphics driver stack. We need…

  • a kernel driver to map memory and submit memory-mapped work
  • a userspace driver to translate OpenGL and Vulkan calls into hardware-specific data structures in graphics memory
  • a compiler translating shading programming languages like GLSL to the hardware’s instruction set

That’s a lot of work, calling for a team effort! Fortunately, that layering gives us natural boundaries to divide work among our small team.

Meanwhile, Ella Stanforth is working on a Vulkan driver, reusing the kernel driver, the compiler, and some code shared with the OpenGL driver.

Of course, we couldn’t build an OpenGL driver in under two years just ourselves. Thanks to the power of free and open source software, we stand on the shoulders of FOSS giants. The compiler implements a “NIR” backend, where NIR is a powerful intermediate representation, including GLSL to NIR translation. The kernel driver users the “Direct Rendering Manager” (DRM) subsystem of the Linux kernel to minimize boilerplate. Finally, the OpenGL driver implements the “Gallium3D” API inside of Mesa, the home for open source OpenGL and Vulkan drivers. Through Mesa and Gallium3D, we benefit from thirty years of OpenGL driver development, with common code translating OpenGL into the much simpler Gallium3D. Thanks to the incredible engineering of NIR, Mesa, and Gallium3D, our ragtag team of reverse-engineers can focus on what’s left: the Apple hardware.

Installation instructions

To get the new drivers, you need to run the linux-asahi-edge kernel and also install the mesa-asahi-edge Mesa package.

$ sudo pacman -Syu
$ sudo pacman -S linux-asahi-edge mesa-asahi-edge
$ sudo update-grub

Since only one version of Mesa can be installed at a time, pacman will prompt you to replace mesa with mesa-asahi-edge. This is normal!

We also recommend running Wayland instead of Xorg at this point, so if you’re using the KDE Plasma environment, make sure to install the Wayland session:

$ sudo pacman -S plasma-wayland-session

Then reboot, pick the Wayland session at the top of the login screen (SDDM), and enjoy! You might want to adjust the screen scale factor in System Settings → Display and Monitor (Plasma Wayland defaults to 100% or 200%, while 150% is often nicer). If you have “Force font DPI” enabled under Appearance → Fonts, you should disable that (it is saved separately for Wayland and Xorg, and shouldn’t be necessary on Wayland sessions). Log out and back in for these changes to fully apply.

Xorg and Xorg-based desktop environments should work, but there are a few known issues:

  • Expect screen tearing (this might be fixed soon)
  • VSync does not work (some KDE animations will be too fast, and GL apps will not limit their FPS even with VSync enabled). This is a limitation of Xorg on the Apple DCP display controllers, which do not support VBlank interrupts.
  • There are still driver bugs triggered by Xorg/KWin. We’re looking into this.

The linux-asahi-edge kernel can be installed side-by-side with the standard linux-asahi package, but both versions should be kept in sync, so make sure to always update your packages together! You can always pick the linux-asahi kernel in the GRUB boot menu, which will disable GPU acceleration and the DCP display driver.

When the packages are updated in the future, it’s possible that graphical apps will stop starting up after an update until you reboot, or they may fall back to software rendering. This is normal. Until the UAPI is stable, we’ll have to break compatibility between Mesa and the kernel every now and then, so you will need to reboot to make things work after updates. In general, if apps do keep working with acceleration after any particular Mesa update, then it’s probably safe not to reboot, but you should still do it to make sure you’re running the latest kernel!

Reporting bugs

Since the driver is still in development, there are lots of known issues and we’re still working hard on improving conformance test results. Please don’t open new bugs for random apps not working! It’s still the early days and we know there’s a lot of work to do. Here’s a quick guide of how to report bugs:

  • If you find an app that does not start up at all, please don’t report it as a bug. Lots of apps won’t work because they require a newer GL version than what we support. Please set the LIBGL_ALWAYS_SOFTWARE=1 environment variable for those apps to fall back to software rendering. If it is a popular app that is part of the Arch Linux ARM repository, you can make a comment on this issue instead, so we can add Mesa quirks to workaround.
  • If you run into issues caused by linux-asahi-edge unrelated to the GPU, please add a comment to this issue. This includes display output issues! (Resolutions, backlight control, display power control, etc.)
  • If the GPU locks up and all GPU apps stop working, run asahi-diagnose (for example, from an SSH session), open a new bug on the AsahiLinux/linux repository, attach the file generated by that command, and tell us what you were doing that caused the lockup.
  • For other GPU issues (rendering glitches, apps that crash after starting up correctly, and things like that), run asahi-diagnose and make a comment on this issue, attaching the file generated by that command. Don’t forget to tell us about your environment!
  • In the future, if a driver update causes a regression (rendering problems or crashes for apps that previously worked properly), you can open a bug directly in the Mesa tracker.

We hope you enjoy our driver! Remember, things are still moving quickly, so make sure to update your packages regularly to get updates and bug fixes!

Co-written with Asahi Lina. Can you tell who wrote what?

November 14, 2022

Hi all!

Last Friday we’ve shipped wlroots 0.16! This long-overdue release is the fruit of 10 months worth of work from 46 contributors. It includes many improvements, especially around new protocol support, the scene-graph API and the Vulkan renderer. See the release notes for more details!

With the new release freshly delivered, we’re already working on the next one. I’ve been continuing my work on the Vulkan renderer: the patch to stop blocking while the GPU is rendering is almost ready to be merged. I’ve even fixed a Vulkan-ValidationLayers bug along the way. I’ve also investigated color management and ICC profiles a bit more, and have a better idea of what we need to lay down the first pieces of the puzzle.

I’ve reached a new milestone for wlroots-rs: the example can now display a red screen! I’ve cleaned up my work and properly exposed the API in a package. I’m still not super happy about the way the compositor state is handled, if you have suggestions please let me know!

I’ve released libdrm 2.4.114 with new helpers to allocate DRM dumb buffers. Up until now this is something developers had to hand-roll themselves with raw IOCTLs, hopefully this addition can help newcomers and improve type safety. I’ve also released Pixman 0.42.0 with a constified API for regions and work by Manuel Stoeckl to fix bugs discovered via the wlroots Pixman renderer and port demos to GTK3.

I’ve got two NPotMs! The first one is libjsonschema, which is basically the C counterpart of go-jsonschema. It generates C definitions and functions to encode and decode JSON defined by a schema. This project is still very much WIP, my goal is to use it in drm_info and eventually in libdisplay-info.

The second new project is go-smee. It’s a Go version of the official Node.js smee client. smee is a handy tool when writing code which leverages Web hooks: it’s a proxy which forwards Web hooks to a locally running server. At some point I got annoyed enough with Node.js to convince me to write an alternative client.

That’s all for now, see you next month!

November 10, 2022

I already talked about debugging hangs in “Graphics Flight Recorder - unknown but handy tool to debug GPU hangs”, now I want to talk about the most nasty kind of GPU hangs - the ones which cannot be recovered from, where your computer becomes completely unresponsive and you cannot even ssh into it.

How would one debug this? There is no data to get after the hang and it’s incredibly frustrating to even try different debug options and hypothesis, if you are wrong - you get to reboot the machine!

If you are a hardware manufacturer creating a driver for your own GPU, you could just run the workload in your fancy hardware simulator, wait for a few hours for the result and call it a day. But what if you don’t have access to a simulator, or to some debug side channel?

There are a few things one could try:

  • Try all debug options you have, try to disable compute dispatches, some draw calls, blits, and so on. The downside is that you have to reboot every time you hit the issue.
  • Eyeball the GPU packets until they start eyeballing you.
  • Breadcrumbs!

Today we will talk about the breadcrumbs. Unfortunately, Graphics Flight Recorder, which I already wrote about, isn’t of much help here. The hang is unrecoverable so GFR isn’t able to gather the results and write them to the disk.

But the idea of breadcrumbs is still useful! What if instead of gathering result post factum, we stream the results to some other machine. This would allow us to get the results even if our target becomes unresponsive. Though, the requirement to get the results ASAP considerably changes the workflow.

What if we write breadcrumbs on GPU after each command and spin a thread on CPU reading it in a busy loop?

In practice the the amount of breadcrumbs between the one sent over network and the one currently executed is just too big to be practical.

So we have to make GPU and CPU running in a lockstep.

  • GPU writes a breadcrumb and immediately waits for this value to be acknowledged;
  • CPU in a busy loop checks the last written breadcrumb value, sends it over socket, and writes it back to the fixed address;
  • GPU sees a new value and continues execution.

This way the most recent breadcrumb gets immediately sent over the network. In practice, some breadcrumbs are still lost between the last sent over the network and the one where GPU hangs. But the difference is only a few of them.

With the lockstep execution we could narrow the hanging command even further. For this we have to wait for a certain time after each breadcrumb before proceeding to the next one. I chose to just promt user for explicit keyboard input for each breadcrumb.

I ended up with the following workflow:

  • Run with breadcrumbs enabled but without requiring explicit user input;
  • On another machine receive the stream of breadcrumbs via network;
  • Note the last received breadcrumb, the hanging command would be nearby;
  • Reboot the target;
  • Enable breadcrumbs starting from a few breadcrumbs before the last one received, require explicit ack from the user on each breadcrumb;
  • In a few steps GPU should hang;
  • Now that we know the closest breadcrumb to the real hang location - we can get the command stream and see what happens right after the breadcrumb;
  • Knowing which command(s) is causing our hang - it’s now possible to test various changes in the driver.

Could Graphics Flight Recorder do this?

In theory yes, but it would require additional VK extension to be able to wait for value in memory. However, it still would have a crucial limitation.

Even with Vulkan being really close to the hardware there are still many cases where one Vulkan command is translated into many GPU commands under the hood. Things like image copies, blits, renderpass boundaries. For unrecoverable hangs we want to narrow down the hanging GPU command as much as possible, so it makes more sense to implement such functionality in a driver.

Turnip’s take on the breadcrumbs

I recently implemented it in Turnip (open-source Vulkan driver for Qualcomm’s GPUs) and used a few times with good results.

Current implementation in Turnip is rather spartan and gives the minimal amount of instruments to achieve the workflow described above. While looks like this:

1) Launch workload with TU_BREADCRUMBS envvar (TU_BREADCRUMBS=$IP:$PORT,break=$BREAKPOINT:$BREAKPOINT_HITS):

TU_BREADCRUMBS=$IP:$PORT,break=-1:0 ./some_workload

2) Receive breadcrumbs on another machine via this bash spaghetti:

> nc -lvup $PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}'

Received packet from 10.42.0.19:49116 -> 10.42.0.1:11113 (local)
1:0
7:0
8:0
9:0
10:0
11:0
12:0
13:0
14:0
[...]
10:3
11:3
12:3
13:3
14:3
15:3
16:3
17:3
18:3

Each line is a breadcrumb № and how many times it was repeated (either if command buffer is reusable or if breadcrumb is in a command stream which repeats for each tile).

3) Increase hang timeout:

echo -n 120000 > /sys/kernel/debug/dri/0/hangcheck_period_ms

4) Launch workload and break on the last known executed breadcrumb:

TU_BREADCRUMBS=$IP:$PORT,break=15:3 ./some_workload
GPU is on breadcrumb 18, continue?y
GPU is on breadcrumb 19, continue?y
GPU is on breadcrumb 20, continue?y
GPU is on breadcrumb 21, continue?y
...

5) Continue until GPU hangs.

:tada: Now you know two breadcrumbs between which there is a command which causes the hang.

Limitations

  • If the hang was caused by a lack of synchronization, or a lack of cache flushing - it may not be reproducible with breadcrumbs enabled.
  • While breadcrumbs could help to narrow down the hang to a single command - a single command may access many different states and have many steps under the hood (e.g. draw or blit commands).
  • The method for breadcrumbs described here do affect performance, especially if tiling rendering is used, since breadcrumbs are repeated for each tile.

Hey,

If you enjoy using upstream Linux kernel in your Raspberry Pi system or just want to give a try in the freshest kernel graphics drivers there, the good news is that now you can compile and boot the V3D driver from the mainline in your Raspberry Pi 4. Thanks to the work of Stefan, Peter and Nicolas [1] [2], the V3D enablement reached the Linux kernel mainline. That means hacking and using new features available in the upstream V3D driver directly from the source.

However, even for those used to compiling and installing a custom kernel in the Raspberry Pi, there are some quirks to getting the mainline v3d module available in 32-bit and 64-bit systems. I’ve quickly summarized how to compile and install upstream kernel versions (>=6.0) in this short blog post.

Note: V3D driver is not present in Raspberry Pi models 0-3.

First, I’m taking into account that you already know how to cross-compile a custom kernel to your system. If it is not your case, a good tutorial is already available in the Raspberry Pi documentation, but it targets the kernel in the rpi-linux repository (downstream kernel).

From this documentation, the main differences in the upstream kernel are presented below:

Raspberry Pi 4 64-bit (arm64)

Diff short summary:

  1. instead of getting the .config file from bcm2711_defconfig, get it by running make ARCH=arm64 defconfig
  2. compile and install the kernel image and modules as usual, but just copy the dtb file arch/arm64/boot/dts/broadcom/bcm2711-rpi-4-b.dtb to the /boot of your target machine (no /overlays directory to copy)
  3. change /boot/config.txt:
    • comment/remove the dt_overlay=vc4-kms-v3d entry
    • add a device_tree=bcm2711-rpi-4-b.dtb entry

Raspberry Pi 4 32-bits (arm)

Diff short summary:

  1. get the .config file by running make ARCH=arm multi_v7_defconfig
  2. using make ARCH=arm menuconfig or a similar tool, enable CONFIG_ARM_LPAE=y
  3. compile and install the kernel image and modules as usual, but just copy the dtb file arch/arm/boot/dts/bcm2711-rpi-4-b.dtb to the /boot of your target machine (no /overlays directory to copy)
  4. change /boot/config.txt:
    • comment/remove the dt_overlay=vc4-kms-v3d entry
    • add a device_tree=bcm2711-rpi-4-b.dtb entry

Step-by-step for remote deployment:

Set variables

Raspberry Pi 4 64-bit (arm64)

cd <path-to-upstream-linux-directory>
KERNEL=`make kernelrelease`
ARCH="arm64"
CROSS_COMPILE="aarch64-linux-gnu-"
DEFCONFIG=defconfig
IMAGE=Image
DTB_PATH="broadcom/bcm2711-rpi-4-b.dtb"
RPI4=<ip>
TMP=`mktemp -d`

Raspberry Pi 4 32-bits (arm)

cd <path-to-upstream-linux-directory>
KERNEL=`make kernelrelease`
ARCH="arm"
CROSS_COMPILE="arm-linux-gnueabihf-"
DEFCONFIG=multi_v7_defconfig
IMAGE=zImage
DTB_PATH="bcm2711-rpi-4-b.dtb"
RPI4=<ip>
TMP=`mktemp -d`

Get default .config file

make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE $DEFCONFIG

Raspberry Pi 4 32-bit (arm)

Additional step for 32-bit system. Enable CONFIG_ARM_LPAE=y using make ARCH=arm menuconfig

Cross-compile the mainline kernel

make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE $IMAGE modules dtbs

Install modules to send

make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE INSTALL_MOD_PATH=$TMP modules_install

Copy kernel image, modules and the dtb to your remote system

ssh $RPI4 mkdir -p /tmp/new_modules /tmp/new_kernel
rsync -av $TMP/ $RPI4:/tmp/new_modules/
scp arch/$ARCH/boot/$IMAGE $RPI4:/tmp/new_kernel/$KERNEL
scp arch/$ARCH/boot/dts/$DTB_PATH $RPI4:/tmp/new_kernel
ssh $RPI4 sudo rsync -av /tmp/new_modules/lib/modules/ /lib/modules/
ssh $RPI4 sudo rsync -av /tmp/new_kernel/ /boot/
rm -rf $TMP

Set config.txt of you RPi 4

In your Raspberry Pi 4, open the config file /boot/config.txt

  • comment/remove the dt_overlay=vc4-kms-v3d entry
  • add a device_tree=bcm2711-rpi-4-b.dtb entry
  • add a kernel=<image-name> entry

Why not Kworkflow?

You can safely use the steps above, but if you are hacking the kernel and need to repeat this compiling and installing steps repeatedly, why don’t try the Kworkflow?

Kworkflow is a set of scripts to synthesize all steps to have a custom kernel compiled and installed in local and remote machines and it supports kernel building and deployment to Raspberry Pi machines for Raspbian 32 bit and 64 bit.

After learning the kernel compilation and installation step by step, you can simply use kw bd command and have a custom kernel installed in your Raspberry Pi 4.

Note: the userspace 3D acceleration (glx/mesa) is working as expected on arm64, but the driver is not loaded yet for arm. Besides that, a bunch of pte invalid errors may appear when using 3D acceleration, it’s a known issue that are still under investigation.

November 02, 2022

💽 Linux Boot Partitions and How to Set Them Up 🚀

Let’s have a look how traditional Linux distributions set up /boot/ and the ESP, and how this could be improved.

How Linux distributions traditionally have been setting up their “boot” file systems has been varying to some degree, but the most common choice has been to have a separate partition mounted to /boot/. Usually the partition is formatted as a Linux file system such as ext2/ext3/ext4. The partition contains the kernel images, the initrd and various boot loader resources. Some distributions, like Debian and Ubuntu, also store ancillary files associated with the kernel here, such as kconfig or System.map. Such a traditional boot partition is only defined within the context of the distribution, and typically not immediately recognizable as such when looking just at the partition table (i.e. it uses the generic Linux partition type UUID).

With the arrival of UEFI a new partition relevant for boot appeared, the EFI System Partition (ESP). This partition is defined by the firmware environment, but typically accessed by Linux to install or update boot loaders. The choice of file system is not up to Linux, but effectively mandated by the UEFI specifications: vFAT. In theory it could be formatted as other file systems too. However, this would require the firmware to support file systems other than vFAT. This is rare and firmware specific though, as vFAT is the only file system mandated by the UEFI specification. In other words, vFAT is the only file system which is guaranteed to be universally supported.

There’s a major overlap of the type of the data typically stored in the ESP and in the traditional boot partition mentioned earlier: a variety of boot loader resources as well as kernels/initrds.

Unlike the traditional boot partition, the ESP is easily recognizable in the partition table via its GPT partition type UUID. The ESP is also a shared resource: all OSes installed on the same disk will share it and put their boot resources into them (as opposed to the traditional boot partition, of which there is one per installed Linux OS, and only that one will put resources there).

To summarize, the most common setup on typical Linux distributions is something like this:

Type Linux Mount Point File System Choice
Linux “Boot” Partition /boot/ Any Linux File System, typically ext2/ext3/ext4
ESP /boot/efi/ vFAT

As mentioned, not all distributions or local installations agree on this. For example, it’s probably worth mentioning that some distributions decided to put kernels onto the root file system of the OS itself. For this setup to work the boot loader itself [sic!] must implement a non-trivial part of the storage stack. This may have to include RAID, storage drivers, networked storage, volume management, disk encryption, and Linux file systems. Leaving aside the conceptual argument that complex storage stacks don’t belong in boot loaders there are very practical problems with this approach. Reimplementing the Linux storage stack in all its combinations is a massive amount of work. It took decades to implement what we have on Linux now, and it will take a similar amount of work to catch up in the boot loader’s reimplementation. Moreover, there’s a political complication: some Linux file system communities made clear they have no interest in supporting a second file system implementation that is not maintained as part of the Linux kernel.

What’s interesting is that the /boot/efi/ mount point is nested below the /boot/ mount point. This effectively means that to access the ESP the Boot partition must exist and be mounted first. A system with just an ESP and without a Boot partition hence doesn’t fit well into the current model. The Boot partition will also have to carry an empty “efi” directory that can be used as the inner mount point, and serves no other purpose.

Given that the traditional boot partition and the ESP may carry similar data (i.e. boot loader resources, kernels, initrds) one may wonder why they are separate concepts. Historically, this was the easiest way to make the pre-UEFI way how Linux systems were booted compatible with UEFI: conceptually, the ESP can be seen as just a minor addition to the status quo ante that way. Today, primarily two reasons remained:

  • Some distributions see a benefit in support for complex Linux file system concepts such as hardlinks, symlinks, SELinux labels/extended attributes and so on when storing boot loader resources. – I personally believe that making use of features in the boot file systems that the firmware environment cannot really make sense of is very clearly not advisable. The UEFI file system APIs know no symlinks, and what is SELinux to UEFI anyway? Moreover, putting more than the absolute minimum of simple data files into such file systems immediately raises questions about how to authenticate them comprehensively (including all fancy metadata) cryptographically on use (see below).

  • On real-life systems that ship with non-Linux OSes the ESP often comes pre-installed with a size too small to carry multiple Linux kernels and initrds. As growing the size of an existing ESP is problematic (for example, because there’s no space available immediately after the ESP, or because some low-quality firmware reacts badly to the ESP changing size) placing the kernel in a separate, secondary partition (i.e. the boot partition) circumvents these space issues.

File System Choices

We already mentioned that the ESP effectively has to be vFAT, as that is what UEFI (more or less) guarantees. The file system choice for the boot partition is not quite as restricted, but using arbitrary Linux file systems is not really an option either. The file system must be accessible by both the boot loader and the Linux OS. Hence only file systems that are available in both can be used. Note that such secondary implementations of Linux file systems in the boot environment – limited as they may be – are not typically welcomed or supported by the maintainers of the canonical file system implementation in the upstream Linux kernel. Modern file systems are notoriously complicated and delicate and simply don’t belong in boot loaders.

In a trusted boot world, the two file systems for the ESP and the /boot/ partition should be considered untrusted: any code or essential data read from them must be authenticated cryptographically before use. And even more, the file system structures themselves are also untrusted. The file system driver reading them must be careful not to be exploitable by a rogue file system image. Effectively this means a simple file system (for which a driver can be more easily validated and reviewed) is generally a better choice than a complex file system (Linux file system communities made it pretty clear that robustness against rogue file system images is outside of their scope and not what is being tested for.).

Some approaches tried to address the fact that boot partitions are untrusted territory by encrypting them via a mechanism compatible to LUKS, and adding decryption capabilities to the boot loader so it can access it. This misses the point though, as encryption does not imply authentication, and only authentication is typically desired. The boot loader and kernel code are typically Open Source anyway, and hence there’s little value in attempting to keep secret what is already public knowledge. Moreover, encryption implies the existence of an encryption key. Physically typing in the decryption key on a keyboard might still be acceptable on desktop systems with a single human user in front, but outside of that scenario unlock via TPM, PKCS#11 or network services are typically required. And even on the desktop FIDO2 unlocking is probably the future. Implementing all the technologies these unlocking mechanisms require in the boot loader is not realistic, unless the boot loader shall become a full OS on its own as it would require subsystems for FIDO2, PKCS#11, USB, Bluetooth network, smart card access, and so on.

File System Access Patterns

Note that traditionally both mentioned partitions were read-only during most parts of the boot. Only later, once the OS is up, write access was required to implement OS or boot loader updates. In today’s world things have become a bit more complicated. A modern OS might want to require some limited write access already in the boot loader, to implement boot counting/boot assessment/automatic fallback (e.g., if the same kernel fails to boot 3 times, automatically revert to older kernel), or to maintain an early storage-based random seed. This means that even though the file system is mostly read-only, we need limited write access after all.

vFAT cannot compete with modern Linux file systems such as btrfs when it comes to data safety guarantees. It’s not a journaled file system, does not use CoW or any form of checksumming. This means when used for the system boot process we need to be particularly careful when accessing it, and in particular when making changes to it (i.e., trying to keep changes local to single sectors). It is essential to use write patterns that minimize the chance of file system corruption. Checking the file system (“fsck”) before modification (and probably also reading) is important, as is ensuring the file system is put into a “clean” state as quickly as possible after each modification.

Code quality of the firmware in typical systems is known to not always be great. When relying on the file system driver included in the firmware it’s hence a good idea to limit use to operations that have a better chance to be correctly implemented. For example, when writing from the UEFI environment it might be wise to avoid any operation that requires allocation algorithms, but instead focus on access patterns that only override already written data, and do not require allocation of new space for the data.

Besides write access from the boot loader code (as described above) these file systems will require write access from the OS, to facilitate boot loader and kernel/initrd updates. These types of accesses are generally not fully random accesses (i.e., never partial file updates) but usually mean adding new files as whole, and removing old files as a whole. Existing files are typically not modified once created, though they might be replaced wholly by newer versions.

Boot Loader Updates

Note that the update cycle frequencies for boot loaders and for kernels/initrds are probably similar these days. While kernels are still vastly more complex than boot loaders, security issues are regularly found in both. In particular, as boot loaders (through “shim” and similar components) carry certificate/keyring and denylist information, which typically require frequent updates. Update cycles hence have to be expected regularly.

Boot Partition Discovery

The traditional boot partition was not recognizable by looking just at the partition table. On MBR systems it was directly referenced from the boot sector of the disk, and on EFI systems from information stored in the ESP. This is less than ideal since by losing this entrypoint information the system becomes unbootable. It’s typically a better, more robust idea to make boot partitions recognizable as such in the partition table directly. This is done for the ESP via the GPT partition type UUID. For traditional boot partitions this was not done though.

Current Situation Summary

Let’s try to summarize the above:

  • Currently, typical deployments use two distinct boot partitions, often using two distinct file system implementations

  • Firmware effectively dictates existence of the ESP, and the use of vFAT

  • In userspace view: the ESP mount is nested below the general Boot partition mount

  • Resources stored in both partitions are primarily kernel/initrd, and boot loader resources

  • The mandatory use of vFAT brings certain data safety challenges, as does quality of firmware file system driver code

  • During boot limited write access is needed, during OS runtime more comprehensive write access is needed (though still not fully random).

  • Less restricted but still limited write patterns from OS environment (only full file additions/updates/removals, during OS/boot loader updates)

  • Boot loaders should not implement complex storage stacks.

  • ESP can be auto-discovered from the partition table, traditional boot partition cannot.

  • ESP and the traditional boot partition are not protected cryptographically neither in structure nor contents. It is expected that loaded files are individually authenticated after being read.

  • The ESP is a shared resource — the traditional boot partition a resource specific to each installed Linux OS on the same disk.

How to Do it Better

Now that we have discussed many of the issues with the status quo ante, let’s see how we can do things better:

  • Two partitions for essentially the same data is a bad idea. Given they carry data very similar or identical in nature, the common case should be to have only one (but see below).

  • Two file system implementations are worse than one. Given that vFAT is more or less mandated by UEFI and the only format universally understood by all players, and thus has to be used anyway, it might as well be the only file system that is used.

  • Data safety is unnecessarily bad so far: both ESP and boot partition are continuously mounted from the OS, even though access is pretty restricted: outside of update cycles access is typically not required.

  • All partitions should be auto-discoverable/self-descriptive

  • The two partitions should not be exposed as nested mounts to userspace

To be more specific, here’s how I think a better way to set this all up would look like:

  • Whenever possible, only have one boot partition, not two. On EFI systems, make it the ESP. On non-EFI systems use an XBOOTLDR partition instead (see below). Only have both in the case where a Linux OS is installed on a system that already contains an OS with an ESP that is too small to carry sufficient kernels/initrds. When a system contains a XBOOTLDR partition put kernels/initrd on that, otherwise the ESP.

  • Instead of the vaguely defined, traditional Linux “boot” partition use the XBOOTLDR partition type as defined by the Discoverable Partitions Specification. This ensures the partition is discoverable, and can be automatically mounted by things like systemd-gpt-auto-generator. Use XBOOTLDR only if you have to, i.e., when dealing with systems that lack UEFI (and where the ESP hence has no value) or to address the mentioned size issues with the ESP. Note that unlike the traditional boot partition the XBOOTLDR partition is a shared resource, i.e., shared between multiple parallel Linux OS installations on the same disk. Because of this it is typically wise to place a per-OS directory at the top of the XBOOTLDR file system to avoid conflicts.

  • Use vFAT for both partitions, it’s the only thing universally understood among relevant firmwares and Linux. It’s simple enough to be useful for untrusted storage. Or to say this differently: writing a file system driver that is not easily vulnerable to rogue disk images is much easier for vFAT than for let’s say btrfs. – But the choice of vFAT implies some care needs to be taken to address the data safety issues it brings, see below.

  • Mount the two partitions via the “automount” logic. For example, via systemd’s automount units, with a very short idle time-out (one second or so). This improves data safety immensely, as the file systems will remain mounted (and thus possibly in a “dirty” state) only for very short periods of time, when they are actually accessed – and all that while the fact that they are not mounted continuously is mostly not noticeable for applications as the file system paths remain continuously around. Given that the backing file system (vFAT) has poor data safety properties, it is essential to shorten the access for unclean file system state as much as possible. In fact, this is what the aforementioned systemd-gpt-auto-generator logic actually does by default.

  • Whenever mounting one of the two partitions, do a file system check (fsck; in fact this is also what systemd-gpt-auto-generatordoes by default, hooked into the automount logic, to run on first access). This ensures that even if the file system is in an unclean state it is restored to be clean when needed, i.e., on first access.

  • Do not mount the two partitions nested, i.e., no more /boot/efi/. First of all, as mentioned above, it should be possible (and is desirable) to only have one of the two. Hence it is simply a bad idea to require the other as well, just to be able to mount it. More importantly though, by nesting them, automounting is complicated, as it is necessary to trigger the first automount to establish the second automount, which defeats the point of automounting them in the first place. Use the two distinct mount points /efi/ (for the ESP) and /boot/ (for XBOOTLDR) instead. You might have guessed, but that too is what systemd-gpt-auto-generator does by default.

  • When making additions or updates to ESP/XBOOTLDR from the OS make sure to create a file and write it in full, then syncfs() the whole file system, then rename to give it its final name, and syncfs() again. Similar when removing files.

  • When writing from the boot loader environment/UEFI to ESP/XBOOTLDR, do not append to files or create new files. Instead overwrite already allocated file contents (for example to maintain a random seed file) or rename already allocated files to include information in the file name (and ideally do not increase the file name in length; for example to maintain boot counters).

  • Consider adopting UKIs, which minimize the number of files that need to be updated on the ESP/XBOOTLDR during OS/kernel updates (ideally down to 1)

  • Consider adopting systemd-boot, which minimizes the number of files that need to be updated on boot loader updates (ideally down to 1)

  • Consider removing any mention of ESP/XBOOTLDR from /etc/fstab, and just let systemd-gpt-auto-generator do its thing.

  • Stop implementing file systems, complex storage, disk encryption, … in your boot loader.

Implementing things like that you gain:

  • Simplicity: only one file system implementation, typically only one partition and mount point

  • Robust auto-discovery of all partitions, no need to even configure /etc/fstab

  • Data safety guarantees as good as possible, given the circumstances

To summarize this in a table:

Type Linux Mount Point File System Choice Automount
ESP /efi/ vFAT yes
XBOOTLDR /boot/ vFAT yes

A note regarding modern boot loaders that implement the Boot Loader Specification: both partitions are explicitly listed in the specification as sources for both Type #1 and Type #2 boot menu entries. Hence, if you use such a modern boot loader (e.g. systemd-boot) these two partitions are the preferred location for boot loader resources, kernels and initrds anyway.

Addendum: You got RAID?

You might wonder, what about RAID setups and the ESP? This comes up regularly in discussions: how to set up the ESP so that (software) RAID1 (mirroring) can be done on the ESP. Long story short: I’d strongly advise against using RAID on the ESP. Firmware typically doesn’t have native RAID support, and given that firmware and boot loader can write to the file systems involved, any attempt to use software RAID on them will mean that a boot cycle might corrupt the RAID sync, and immediately requires a re-synchronization after boot. If RAID1 backing for the ESP is really necessary, the only way to implement that safely would be to implement this as a driver for UEFI – but that creates certain bootstrapping issues (i.e., where to place the driver if not the ESP, a file system the driver is supposed to be used for), and also reimplements a considerable component of the OS storage stack in firmware mode, which seems problematic.

So what to do instead? My recommendation would be to solve this via userspace tooling. If redundant disk support shall be implemented for the ESP, then create separate ESPs on all disks, and synchronize them on the file system level instead of the block level. Or in other words, the tools that install/update/manage kernels or boot loaders should be taught to maintain multiple ESPs instead of one. Copy the kernels/boot loader files to all of them, and remove them from all of them. Under the assumption that the goal of RAID is a more reliable system this should be the best way to achieve that, as it doesn’t pretend the firmware could do things it actually cannot do. Moreover it minimizes the complexity of the boot loader, shifting the syncing logic to userspace, where it’s typically easier to get right.

Addendum: Networked Boot

The discussion above focuses on booting up from a local disk. When thinking about networked boot I think two scenarios are particularly relevant:

  1. PXE-style network booting. I think in this mode of operation focus should be on directly booting a single UKI image instead of a boot loader. This sidesteps the whole issue of maintaining any boot partition at all, and simplifies the boot process greatly. In scenarios where this is not sufficient, and an interactive boot menu or other boot loader features are desired, it might be a good idea to take inspiration from the UKI concept, and build a single boot loader EFI binary (such as systemd-boot), and include the UKIs for the boot menu items and other resources inside it via PE sections. Or in other words, build a single boot loader binary that is “supercharged” and contains all auxiliary resources in its own PE sections. (Note: this does not exist, it’s an idea I intend to explore with systemd-boot). Benefit: a single file has to be downloaded via PXE/TFTP, not more. Disadvantage: unused resources are downloaded unnecessarily. Either way: in this context there is no local storage, and the ESP/XBOOTLDR discussion above is without relevance.

  2. Initrd-style network booting. In this scenario the boot loader and kernel/initrd (better: UKI) are available on a local disk. The initrd then configures the network and transitions to a network share or file system on a network block device for the root file system. In this case the discussion above applies, and in fact the ESP or XBOOTLDR partition would be the only partition available locally on disk.

And this is all I have for today.

As President of the GNOME Foundation, I wanted to post a quick note to pass on the thanks from the Board, the Foundation staff team and membership to our outgoing Executive Director, Neil McGovern. I had the pleasure of passing on GNOME’s thanks in person at the Casa Bariachi this summer at GUADEC in Guadelajara, at the most exellent mariachi celebration of GNOME’s 25th Anniversary. 🤠 Kindly they stopped the music and handed me the microphone for the whole place, although I think many of the other guests celebrating their own birthdays were less excited about Neil’s tenure as Executive Director and the Free and Open Source desktop in general. 🤣

Neil’s 6-month handover period came to an end last month and he handed over the reins to myself and Thibault Martin on the Executive Committee, and Director of Operations Rosanna Yuen has stepped up to act as Chief of Staff and interface between the Board and the staff team for the time being. Our recruitment is ongoing for a new Executive Director although the search is a little behind schedule (mostly down to me!), and we’re hugely grateful to a few volunteers who have joined our search committee to help us source, screen and interview applicants.

I have really enjoyed working closely with Neil in my time on the GNOME board, and we are hugely grateful for his contributions and achievements over the past 5 years which I posted about earlier in the year. Neil is this month starting a new role as the Executive Director of Ruby Central. Our very best wishes from the GNOME community and good luck with your new role. See you soon!

(also posted to Discourse if you wish to add any thanks or comments of your own)

October 28, 2022

Yes

I know what you’re thinking, and I had to do it.

I had literally no other option, in that I asked myself whether there was any other option and decided the answer was no.

CL

There comes a time in every Mesa developer’s life when they start searching for answers. Real question answers, not just “why are there so many pipe caps?” or “is it possible to understand the GLSL compiler?”

That time for me was very recent.

Do we really need two CL frontends in Mesa?

My heart says yes. Not only do we need two, we probably need three, such as the Erlang-based one that Jason “99.4% CTS pass-rate” Ekstrand was briefly working on last year in an unsuccessful attempt to throw off avid bloggers who were getting too close to his real next job. Or the one that Adam “Why Am I In Your Blog?” Jackson has been quietly injecting into the Xserver codebase for the past decade without anyone noticing.

Despite these other entirely valid and extant CL implementations, my brain tells me that we probably don’t even need a single CL implementation in Mesa, let alone one that’s pending CL 3.0 conformance certification, is written in the most prominent of all the languages spoken by crabs, and has by far the most sane and credible Mesa developer working full-time on it.

So here we are.

clover.png

There can be only one.

And it’s the one that runs on zink.

Merge Request !19385: delete clover

October 26, 2022

Another Year Passes

I meant to blog more, but it seems we’re out of time. As we all know, SGC must go into hibernation at the end of the year to recharge. Workplace safety laws require sufficient rest time be given to meme-rich environments, and you can’t spell SGC without safety.

No, you can’t.

I’ve got two more blog posts that I’m hoping to release before the blog’s hard-earned time off begins.

The first is about OpenCL.

That’s this one.

Zink: Also a CL Driver

We all knew this would happen.

We follow Karol on twitter.

We’ve seen the wild, unhinged tweets from his local pub claiming to have LuxMark or some other insane, CL-based application running on top of who knows what driver.

It was only a matter of time before he cornered me at XDC for a quick “talk”. It was completely normal the way he had no fewer than four laptops physically harnessed to his person as he strutted around in search of hapless driver maintainers upon whom he could foist his latest and least sane project. I wasn’t at all afraid for my life when he told me that, by the time he let me leave the conference hall that day, rusticl would be working on zink.

I’m here blogging about it now, so obviously everything is fine and I’m not still trapped in his basement, but let’s take a look at some of the challenges I faced over that grueling, interminable period of CL bring-up besides lack of water and food.

Shader Types

The first major issue is, of course, that CL shaders are called “kernels”, and they are in the KERNEL stage. Since zink only handles the GL shader stages, this posed a bit of a problem.

Thankfully, kernels are very similar to GL/VK compute shaders, and so it was easy to combine most of the code into the compute blocks.

But this led to the real problem.

Sampler Descriptors

The real problem is that now this is yet another blog post about Vulkan descriptor models.

I know.

Believe me, I know.

I didn’t want to be here either.

In GL, there is the idea of a combined sampler. This descriptor type has both an image and a sampler in a 1:1 ratio.

CL, being the absolute unit of an API that it is, has discrete images and samplers.

Also, unlike with sane APIs, CL allows entirely untyped shader image access, meaning the images have no format or even component type.

If you’re thinking this is more convenient for users, you’re wrong.

Being completely separate, sampled images and samplers cannot just reuse the same codepaths. Instead, it was necessary to add in some hooks that could help translate these descriptors into the existing combined sampler codepaths. Primarily this amounted to:

  • adding shader handling for generating the final sampled image type based on the loads of the sampler and image variables
  • tweaking the descriptor template code to handle samplers and sampled images

Yeah, it wasn’t actually that hard. Because Vulkan descriptors aren’t that hard.

Stop writing about them.

Shader Samplers

The hard part was the shader interface, because Vulkan doesn’t allow untyped image access. So to handle this, I wrote some compiler passes to go through the shaders and add type info to the variables based on their usage.

I know what you’re thinking.

You’re thinking why didn’t this total lunatic writing a CL frontend in Rust just add type info to the variables before you got them? Wouldn’t that have saved you a lot of work?

It’s a great question. One of the best. And your follow-up? Incredible. Honestly, I knew that all of my readers were smart, but I didn’t realize you were this brilliant.

I don’t have an answer to this, and when asked, Karol just told me I’d earned myself another week in the pit. Whatever that means.

It’s Over.

Zink (on RustiCL) is done.

There’s nothing left (for me) to do, and anyone who says otherwise is wrong.

If you try it out and it doesn’t work, complain to Karol. On his twitter. Or in the Phoronix comments, where he has at least three different accounts.

October 23, 2022

🔐 Brave New Trusted Boot World 🚀

This document looks at the boot process of general purpose Linux distributions. It covers the status quo and how we envision Linux boot to work in the future with a focus on robustness and simplicity.

This document will assume that the reader has comprehensive familiarity with TPM 2.0 security chips and their capabilities (e.g., PCRs, measurements, SRK), boot loaders, the shim binary, Linux, initrds, UEFI Firmware, PE binaries, and SecureBoot.

Problem Description

Status quo ante of the boot logic on typical Linux distributions:

  • Most popular Linux distributions generate initrds locally, and they are unsigned, thus not protected through SecureBoot (since that would require local SecureBoot key enrollment, which is generally not done), nor TPM PCRs.

  • Boot chain is typically Firmware → shimgrub → Linux kernel → initrd (dracut or similar) → root file system

  • Firmware’s UEFI SecureBoot protects shim, shim’s key management protects grub and kernel. No code signing protects initrd. initrd acquires the key for encrypted root fs from the user (or TPM/FIDO2/PKCS11).

  • shim/grub/kernel is measured into TPM PCR 4, among other stuff

  • EFI TPM event log reports measured data into TPM PCRs, and can be used to reconstruct and validate state of TPM PCRs from the used resources.

  • No userspace components are typically measured, except for what IMA measures

  • New kernels require locally generating new boot loader scripts and generating a new initrd each time. OS updates thus mean fragile generation of multiple resources and copying multiple files into the boot partition.

Problems with the status quo ante:

  • initrd typically unlocks root file system encryption, but is not protected whatsoever, and trivial to attack and modify offline

  • OS updates are brittle: PCR values of grub are very hard to pre-calculate, as grub measures chosen control flow path, not just code images. PCR values vary wildly, and OS provided resources are not measured into separate PCRs. Grub’s PCR measurements might be useful up to a point to reason about the boot after the fact, for the most basic remote attestation purposes, but useless for calculating them ahead of time during the OS build process (which would be desirable to be able to bind secrets to future expected PCR state, for example to bind secrets to an OS in a way that it remain accessible even after that OS is updated).

  • Updates of a boot loader are not robust, require multi-file updates of ESP and boot partition, and regeneration of boot scripts

  • No rollback protection (no way to cryptographically invalidate access to TPM-bound secrets on OS updates)

  • Remote attestation of running software is needlessly complex since initrds are generated locally and thus basically are guaranteed to vary on each system.

  • Locking resources maintained by arbitrary user apps to TPM state (PCRs) is not realistic for general purpose systems, since PCRs will change on every OS update, and there’s no mechanism to re-enroll each such resource before every OS update, and remove the old enrollment after the update.

  • There is no concept to cryptographically invalidate/revoke secrets for an older OS version once updated to a new OS version. An attacker thus can always access the secrets generated on old OSes if they manage to exploit an old version of the OS — even if a newer version already has been deployed.

Goals of the new design:

  • Provide a fully signed execution path from firmware to userspace, no exceptions

  • Provide a fully measured execution path from firmware to userspace, no exceptions

  • Separate out TPM PCRs assignments, by “owner” of measured resources, so that resources can be bound to them in a fine-grained fashion.

  • Allow easy pre-calculation of expected PCR values based on booted kernel/initrd, configuration, local identity of the system

  • Rollback protection

  • Simple & robust updates: one updated file per concept

  • Updates without requiring re-enrollment/local preparation of the TPM-protected resources (no more “brittle” PCR hashes that must be propagated into every TPM-protected resource on each OS update)

  • System ready for easy remote attestation, to prove validity of booted OS, configuration and local identity

  • Ability to bind secrets to specific phases of the boot, e.g. the root fs encryption key should be retrievable from the TPM only in the initrd, but not after the host transitioned into the root fs.

  • Reasonably secure, automatic, unattended unlocking of disk encryption secrets should be possible.

  • “Democratize” use of PCR policies by defining PCR register meanings, and making binding to them robust against updates, so that external projects can safely and securely bind their own data to them (or use them for remote attestation) without risking breakage whenever the OS is updated.

  • Build around TPM 2.0 (with graceful fallback for TPM-less systems if desired, but TPM 1.2 support is out of scope)

Considered attack scenarios and considerations:

  • Evil Maid: neither online nor offline (i.e. “at rest”), physical access to a storage device should enable an attacker to read the user’s plaintext data on disk (confidentiality); neither online nor offline, physical access to a storage device should allow undetected modification/backdooring of user data or OS (integrity), or exfiltration of secrets.

  • TPMs are assumed to be reasonably “secure”, i.e. can securely store/encrypt secrets. Communication to TPM is not “secure” though and must be protected on the wire.

  • Similar, the CPU is assumed to be reasonably “secure”

  • SecureBoot is assumed to be reasonably “secure” to permit validated boot up to and including shim+boot loader+kernel (but see discussion below)

  • All user data must be encrypted and authenticated. All vendor and administrator data must be authenticated.

  • It is assumed all software involved regularly contains vulnerabilities and requires frequent updates to address them, plus regular revocation of old versions.

  • It is further assumed that key material used for signing code by the OS vendor can reasonably be kept secure (via use of HSM, and similar, where secret key information never leaves the signing hardware) and does not require frequent roll-over.

Proposed Construction

Central to the proposed design is the concept of a Unified Kernel Image (UKI). These UKIs are the combination of a Linux kernel image, and initrd, a UEFI boot stub program (and further resources, see below) into one single UEFI PE file that can either be directly invoked by the UEFI firmware (which is useful in particular in some cloud/Confidential Computing environments) or through a boot loader (which is generally useful to implement support for multiple kernel versions, with interactive or automatic selection of image to boot into, potentially with automatic fallback management to increase robustness).

UKI Components

Specifically, UKIs typically consist of the following resources:

  1. An UEFI boot stub that is a small piece of code still running in UEFI mode and that transitions into the Linux kernel included in the UKI (e.g., as implemented in sd-stub, see below)

  2. The Linux kernel to boot in the .linux PE section

  3. The initrd that the kernel shall unpack and invoke in the .initrd PE section

  4. A kernel command line string, in the .cmdline PE section

  5. Optionally, information describing the OS this kernel is intended for, in the .osrel PE section (derived from /etc/os-release of the booted OS). This is useful for presentation of the UKI in the boot loader menu, and ordering it against other entries, using the included version information.

  6. Optionally, information describing kernel release information (i.e. uname -r output) in the .uname PE section. This is also useful for presentation of the UKI in the boot loader menu, and ordering it against other entries.

  7. Optionally, a boot splash to bring to screen before transitioning into the Linux kernel in the .splash PE section

  8. Optionally, a compiled Devicetree database file, for systems which need it, in the .dtb PE section

  9. Optionally, the public key in PEM format that matches the signatures of the .pcrsig PE section (see below), in a .pcrpkey PE section.

  10. Optionally, a JSON file encoding expected PCR 11 hash values seen from userspace once the UKI has booted up, along with signatures of these expected PCR 11 hash values, matching a specific public key in the .pcrsig PE section. (Note: we use plural for “values” and “signatures” here, as this JSON file will typically carry a separate value and signature for each PCR bank for PCR 11, i.e. one pair of value and signature for the SHA1 bank, and another pair for the SHA256 bank, and so on. This ensures when enrolling or unlocking a TPM-bound secret we’ll always have a signature around matching the banks available locally (after all, which banks the local hardware supports is up to the hardware). For the sake of simplifying this already overly complex topic, we’ll pretend in the rest of the text there was only one PCR signature per UKI we have to care about, even if this is not actually the case.)

Given UKIs are regular UEFI PE files, they can thus be signed as one for SecureBoot, protecting all of the individual resources listed above at once, and their combination. Standard Linux tools such as sbsigntool and pesign can be used to sign UKI files.

UKIs wrap all of the above data in a single file, hence all of the above components can be updated in one go through single file atomic updates, which is useful given that the primary expected storage place for these UKIs is the UEFI System Partition (ESP), which is a vFAT file system, with its limited data safety guarantees.

UKIs can be generated via a single, relatively simple objcopy invocation, that glues the listed components together, generating one PE binary that then can be signed for SecureBoot. (For details on building these, see below.)

Note that the primary location to place UKIs in is the EFI System Partition (or an otherwise firmware accessible file system). This typically means a VFAT file system of some form. Hence an effective UKI size limit of 4GiB is in place, as that’s the largest file size a FAT32 file system supports.

Basic UEFI Stub Execution Flow

The mentioned UEFI stub program will execute the following operations in UEFI mode before transitioning into the Linux kernel that is included in its .linux PE section:

  1. The PE sections listed are searched for in the invoked UKI the stub is part of, and superficially validated (i.e. general file format is in order).

  2. All PE sections listed above of the invoked UKI are measured into TPM PCR 11. This TPM PCR is expected to be all zeroes before the UKI initializes. Pre-calculation is thus very straight-forward if the resources included in the PE image are known. (Note: as a single exception the .pcrsig PE section is excluded from this measurement, as it is supposed to carry the expected result of the measurement, and thus cannot also be input to it, see below for further details about this section.)

  3. If the .splash PE section is included in the UKI it is brought onto the screen

  4. If the .dtb PE section is included in the UKI it is activated using the Devicetree UEFI “fix-up” protocol

  5. If a command line was passed from the boot loader to the UKI executable it is discarded if SecureBoot is enabled and the command line from the .cmdline used. If SecureBoot is disabled and a command line was passed it is used in place of the one from .cmdline. Either way the used command line is measured into TPM PCR 12. (This of course removes any flexibility of control of the kernel command line of the local user. In many scenarios this is probably considered beneficial, but in others it is not, and some flexibility might be desired. Thus, this concept probably needs to be extended sooner or later, to allow more flexible kernel command line policies to be enforced via definitions embedded into the UKI. For example: allowing definition of multiple kernel command lines the user/boot menu can select one from; allowing additional allowlisted parameters to be specified; or even optionally allowing any verification of the kernel command line to be turned off even in SecureBoot mode. It would then be up to the builder of the UKI to decide on the policy of the kernel command line.)

  6. It will set a couple of volatile EFI variables to inform userspace about executed TPM PCR measurements (and which PCR registers were used), and other execution properties. (For example: the EFI variable StubPcrKernelImage in the 4a67b082-0a4c-41cf-b6c7-440b29bb8c4f vendor namespace indicates the PCR register used for the UKI measurement, i.e. the value “11”).

  7. An initrd cpio archive is dynamically synthesized from the .pcrsig and .pcrpkey PE section data (this is later passed to the invoked Linux kernel as additional initrd, to be overlaid with the main initrd from the .initrd section). These files are later available in the /.extra/ directory in the initrd context.

  8. The Linux kernel from the .linux PE section is invoked with with a combined initrd that is composed from the blob from the .initrd PE section, the dynamically generated initrd containing the .pcrsig and .pcrpkey PE sections, and possibly some additional components like sysexts or syscfgs.

TPM PCR Assignments

In the construction above we take possession of two PCR registers previously unused on generic Linux distributions:

  • TPM PCR 11 shall contain measurements of all components of the UKI (with exception of the .pcrsig PE section, see above). This PCR will also contain measurements of the boot phase once userspace takes over (see below).

  • TPM PCR 12 shall contain measurements of the used kernel command line. (Plus potentially other forms of parameterization/configuration passed into the UKI, not discussed in this document)

On top of that we intend to define two more PCR registers like this:

  • TPM PCR 15 shall contain measurements of the volume encryption key of the root file system of the OS.

  • [TPM PCR 13 shall contain measurements of additional extension images for the initrd, to enable a modularized initrd – not covered by this document]

(See the Linux TPM PCR Registry for an overview how these four PCRs fit into the list of Linux PCR assignments.)

For all four PCRs the assumption is that they are zero before the UKI initializes, and only the data that the UKI and the OS measure into them is included. This makes pre-calculating them straightforward: given a specific set of UKI components, it is immediately clear what PCR values can be expected in PCR 11 once the UKI booted up. Given a kernel command line (and other parameterization/configuration) it is clear what PCR values are expected in PCR 12.

Note that these four PCRs are defined by the conceptual “owner” of the resources measured into them. PCR 11 only contains resources the OS vendor controls. Thus it is straight-forward for the OS vendor to pre-calculate and then cryptographically sign the expected values for PCR 11. The PCR 11 values will be identical on all systems that run the same version of the UKI. PCR 12 only contains resources the administrator controls, thus the administrator can pre-calculate PCR values, and they will be correct on all instances of the OS that use the same parameters/configuration. PCR 15 only contains resources inherently local to the local system, i.e. the cryptographic key material that encrypts the root file system of the OS.

Separating out these three roles does not imply these actually need to be separate when used. However the assumption is that in many popular environments these three roles should be separate.

By separating out these PCRs by the owner’s role, it becomes straightforward to remotely attest, individually, on the software that runs on a node (PCR 11), the configuration it uses (PCR 12) or the identity of the system (PCR 15). Moreover, it becomes straightforward to robustly and securely encrypt data so that it can only be unlocked on a specific set of systems that share the same OS, or the same configuration, or have a specific identity – or a combination thereof.

Note that the mentioned PCRs are so far not typically used on generic Linux-based operating systems, to our knowledge. Windows uses them, but given that Windows and Linux should typically not be included in the same boot process this should be unproblematic, as Windows’ use of these PCRs should thus not conflict with ours.

To summarize:

PCR Purpose Owner Expected Value before UKI boot Pre-Calculable
11 Measurement of UKI components and boot phases OS Vendor Zero Yes
(at UKI build time)
12 Measurement of kernel command line, additional kernel runtime configuration such as systemd credentials, systemd syscfg images Administrator Zero Yes
(when system configuration is assembled)
13 System Extension Images of initrd
(and possibly more)
(Administrator) Zero Yes
(when set of extensions is assembled)
15 Measurement of root file system volume key
(Possibly later more: measurement of root file system UUIDs and labels and of the machine ID /etc/machine-id)
Local System Zero Yes
(after first boot once ll such IDs are determined)

Signature Keys

In the model above in particular two sets of private/public key pairs are relevant:

  • The SecureBoot key to sign the UKI PE executable with. This controls permissible choices of OS/kernel

  • The key to sign the expected PCR 11 values with. Signatures made with this key will end up in the .pcrsig PE section. The public key part will end up in the .pcrpkey PE section.

Typically the key pair for the PCR 11 signatures should be chosen with a narrow focus, reused for exactly one specific OS (e.g. “Fedora Desktop Edition”) and the series of UKIs that belong to it (all the way through all the versions of the OS). The SecureBoot signature key can be used with a broader focus, if desired. By keeping the PCR 11 signature key narrow in focus one can ensure that secrets bound to the signature key can only be unlocked on the narrow set of UKIs desired.

TPM Policy Use

Depending on the intended access policy to a resource protected by the TPM, one or more of the PCRs described above should be selected to bind TPM policy to.

For example, the root file system encryption key should likely be bound to TPM PCR 11, so that it can only be unlocked if a specific set of UKIs is booted (it should then, once acquired, be measured into PCR 15, as discussed above, so that later TPM objects can be bound to it, further down the chain). With the model described above this is reasonably straight-forward to do:

  • When userspace wants to bind disk encryption to a specific series of UKIs (“enrollment”), it looks for the public key passed to the initrd in the /.extra/ directory (which as discussed above originates in the .pcrpkey PE section of the UKI). The relevant userspace component (e.g. systemd) is then responsible for generating a random key to be used as symmetric encryption key for the storage volume (let’s call it disk encryption key _here, DEK_). The TPM is then used to encrypt (“seal”) the DEK with its internal Storage Root Key (TPM SRK). A TPM2 policy is bound to the encrypted DEK. The policy enforces that the DEK may only be decrypted if a valid signature is provided that matches the state of PCR 11 and the public key provided in the /.extra/ directory of the initrd. The plaintext DEK key is passed to the kernel to implement disk encryption (e.g. LUKS/dm-crypt). (Alternatively, hardware disk encryption can be used too, i.e. Intel MKTME, AMD SME or even OPAL, all of which are outside of the scope of this document.) The TPM-encrypted version of the DEK which the TPM returned is written to the encrypted volume’s superblock.

  • When userspace wants to unlock disk encryption on a specific UKI, it looks for the signature data passed to the initrd in the /.extra/ directory (which as discussed above originates in the .pcrsig PE section of the UKI). It then reads the encrypted version of the DEK from the superblock of the encrypted volume. The signature and the encrypted DEK are then passed to the TPM. The TPM then checks if the current PCR 11 state matches the supplied signature from the .pcrsig section and the public key used during enrollment. If all checks out it decrypts (“unseals”) the DEK and passes it back to the OS, where it is then passed to the kernel which implements the symmetric part of disk encryption.

Note that in this scheme the encrypted volume’s DEK is not bound to specific literal PCR hash values, but to a public key which is expected to sign PCR hash values.

Also note that the state of PCR 11 only matters during unlocking. It is not used or checked when enrolling.

In this scenario:

  • Input to the TPM part of the enrollment process are the TPM’s internal SRK, the plaintext DEK provided by the OS, and the public key later used for signing expected PCR values, also provided by the OS. – Output is the encrypted (“sealed”) DEK.

  • Input to the TPM part of the unlocking process are the TPM’s internal SRK, the current TPM PCR 11 values, the public key used during enrollment, a signature that matches both these PCR values and the public key, and the encrypted DEK. – Output is the plaintext (“unsealed”) DEK.

Note that sealing/unsealing is done entirely on the TPM chip, the host OS just provides the inputs (well, only the inputs that the TPM chip doesn’t know already on its own), and receives the outputs. With the exception of the plaintext DEK, none of the inputs/outputs are sensitive, and can safely be stored in the open. On the wire the plaintext DEK is protected via TPM parameter encryption (not discussed in detail here because though important not in scope for this document).

TPM PCR 11 is the most important of the mentioned PCRs, and its use is thus explained in detail here. The other mentioned PCRs can be used in similar ways, but signatures/public keys must be provided via other means.

This scheme builds on the functionality Linux’ LUKS2 functionality provides, i.e. key management supporting multiple slots, and the ability to embed arbitrary metadata in the encrypted volume’s superblock. Note that this means the TPM2-based logic explained here doesn’t have to be the only way to unlock an encrypted volume. For example, in many setups it is wise to enroll both this TPM-based mechanism and an additional “recovery key” (i.e. a high-entropy computer generated passphrase the user can provide manually in case they lose access to the TPM and need to access their data), of which either can be used to unlock the volume.

Boot Phases

Secrets needed during boot-up (such as the root file system encryption key) should typically not be accessible anymore afterwards, to protect them from access if a system is attacked during runtime. To implement this the scheme above is extended in one way: at certain milestones of the boot process additional fixed “words” should be measured into PCR 11. These milestones are placed at conceptual security boundaries, i.e. whenever code transitions from a higher privileged context to a less privileged context.

Specifically:

  • When the initrd initializes (“initrd-enter”)

  • When the initrd transitions into the root file system (“initrd-leave”)

  • When the early boot phase of the OS on the root file system has completed, i.e. all storage and file systems have been set up and mounted, immediately before regular services are started (“sysinit”)

  • When the OS on the root file system completed the boot process far enough to allow unprivileged users to log in (“complete”)

  • When the OS begins shut down (“shutdown”)

  • When the service manager is mostly finished with shutting down and is about to pass control to the final phase of the shutdown logic (“final”)

By measuring these additional words into PCR 11 the distinct phases of the boot process can be distinguished in a relatively straight-forward fashion and the expected PCR values in each phase can be determined.

The phases are measured into PCR 11 (as opposed to some other PCR) mostly because available PCRs are scarce, and the boot phases defined are typically specific to a chosen OS, and hence fit well with the other data measured into PCR 11: the UKI which is also specific to the OS. The OS vendor generates both the UKI and defines the boot phases, and thus can safely and reliably pre-calculate/sign the expected PCR values for each phase of the boot.

Revocation/Rollback Protection

In order to secure secrets stored at rest, in particular in environments where unattended decryption shall be possible, it is essential that an attacker cannot use old, known-buggy – but properly signed – versions of software to access them.

Specifically, if disk encryption is bound to an OS vendor (via UKIs that include expected PCR values, signed by the vendor’s public key) there must be a mechanism to lock out old versions of the OS or UKI from accessing TPM based secrets once it is determined that the old version is vulnerable.

To implement this we propose making use of one of the “counters” TPM 2.0 devices provide: integer registers that are persistent in the TPM and can only be increased on request of the OS, but never be decreased. When sealing resources to the TPM, a policy may be declared to the TPM that restricts how the resources can later be unlocked: here we use one that requires that along with the expected PCR values (as discussed above) a counter integer range is provided to the TPM chip, along with a suitable signature covering both, matching the public key provided during sealing. The sealing/unsealing mechanism described above is thus extended: the signature passed to the TPM during unsealing now covers both the expected PCR values and the expected counter range. To be able to use a signature associated with an UKI provided by the vendor to unseal a resource, the counter thus must be at least increased to the lower end of the range the signature is for. By doing so the ability is lost to unseal the resource for signatures associated with older versions of the UKI, because their upper end of the range disables access once the counter has been increased far enough. By carefully choosing the upper and lower end of the counter range whenever the PCR values for an UKI shall be signed it is thus possible to ensure that updates can invalidate prior versions’ access to resources. By placing some space between the upper and lower end of the range it is possible to allow a controlled level of fallback UKI support, with clearly defined milestones where fallback to older versions of an UKI is not permitted anymore.

Example: a hypothetical distribution FooOS releases a regular stream of UKI kernels 5.1, 5.2, 5.3, … It signs the expected PCR values for these kernels with a key pair it maintains in a HSM. When signing UKI 5.1 it includes information directed at the TPM in the signed data declaring that the TPM counter must be above 100, and below 120, in order for the signature to be used. Thus, when the UKI is booted up and used for unlocking an encrypted volume the unlocking code must first increase the counter to 100 if needed, as the TPM will otherwise refuse unlocking the volume. The next release of the UKI, i.e. UKI 5.2 is a feature release, i.e. reverting back to the old kernel locally is acceptable. It thus does not increase the lower bound, but it increases the upper bound for the counter in the signature payload, thus encoding a valid range 100…121 in the signed payload. Now a major security vulnerability is discovered in UKI 5.1. A new UKI 5.3 is prepared that fixes this issue. It is now essential that UKI 5.1 can no longer be used to unlock the TPM secrets. Thus UKI 5.3 will bump the lower bound to 121, and increase the upper bound by one, thus allowing a range 121…122. Or in other words: for each new UKI release the signed data shall include a counter range declaration where the upper bound is increased by one. The lower range is left as-is between releases, except when an old version shall be cut off, in which case it is bumped to one above the upper bound used in that release.

UKI Generation

As mentioned earlier, UKIs are the combination of various resources into one PE file. For most of these individual components there are pre-existing tools to generate the components. For example the included kernel image can be generated with the usual Linux kernel build system. The initrd included in the UKI can be generated with existing tools such as dracut and similar. Once the basic components (.linux, .initrd, .cmdline, .splash, .dtb, .osrel, .uname) have been acquired the combination process works roughly like this:

  1. The expected PCR 11 hashes (and signatures for them) for the UKI are calculated. The tool for that takes all basic UKI components and a signing key as input, and generates a JSON object as output that includes both the literal expected PCR hash values and a signature for them. (For all selected TPM2 banks)

  2. The EFI stub binary is now combined with the basic components, the generated JSON PCR signature object from the first step (in the .pcrsig section) and the public key for it (in the .pcrpkey section). This is done via a simple “objcopy” invocation resulting in a single UKI PE binary.

  3. The resulting EFI PE binary is then signed for SecureBoot (via a tool such as sbsign or similar).

Note that the UKI model implies pre-built initrds. How to generate these (and securely extend and parameterize them) is outside of the scope of this document, but a related document will be provided highlighting these concepts.

Protection Coverage of SecureBoot Signing and PCRs

The scheme discussed here touches both SecureBoot code signing and TPM PCR measurements. These two distinct mechanisms cover separate parts of the boot process.

Specifically:

  • Firmware/Shim SecureBoot signing covers bootloader and UKI

  • TPM PCR 11 covers the UKI components and boot phase

  • TPM PCR 12 covers admin configuration

  • TPM PCR 15 covers the local identity of the host

Note that this means SecureBoot coverage ends once the system transitions from the initrd into the root file system. It is assumed that trust and integrity have been established before this transition by some means, for example LUKS/dm-crypt/dm-integrity, ideally bound to PCR 11 (i.e. UKI and boot phase).

A robust and secure update scheme for PCR 11 (i.e. UKI) has been described above, which allows binding TPM-locked resources to a UKI. For PCR 12 no such scheme is currently designed, but might be added later (use case: permit access to certain secrets only if the system runs with configuration signed by a specific set of keys). Given that resources measured into PCR 15 typically aren’t updated (or if they are updated loss of access to other resources linked to them is desired) no update scheme should be necessary for it.

This document focuses on the three PCRs discussed above. Disk encryption and other userspace may choose to also bind to other PCRs. However, doing so means the PCR brittleness issue returns that this design is supposed to remove. PCRs defined by the various firmware UEFI/TPM specifications generally do not know any concept for signatures of expected PCR values.

It is known that the industry-adopted SecureBoot signing keys are too broad to act as more than a denylist for known bad code. It is thus probably a good idea to enroll vendor SecureBoot keys wherever possible (e.g. in environments where the hardware is very well known, and VM environments), to raise the bar on preparing rogue UKI-like PE binaries that will result in PCR values that match expectations but actually contain bad code. Discussion about that is however outside of the scope of this document.

Whole OS embedded in the UKI

The above is written under the assumption that the UKI embeds an initrd whose job it is to set up the root file system: find it, validate it, cryptographically unlock it and similar. Once the root file system is found, the system transitions into it.

While this is the traditional design and likely what most systems will use, it is also possible to embed a regular root file system into the UKI and avoid any transition to an on-disk root file system. In this mode the whole OS would be encapsulated in the UKI, and signed/measured as one. In such a scenario the whole of the OS must be loaded into RAM and remain there, which typically restricts the general usability of such an approach. However, for specific purposes this might be the design of choice, for example to implement self-sufficient recovery or provisioning systems.

Proposed Implementations & Current Status

The toolset for most of the above is already implemented in systemd and related projects in one way or another. Specifically:

  1. The systemd-stub (or short: sd-stub) component implements the discussed UEFI stub program

  2. The systemd-measure tool can be used to pre-calculate expected PCR 11 values given the UKI components and can sign the result, as discussed in the UKI Image Generation section above.

  3. The systemd-cryptenroll and systemd-cryptsetup tools can be used to bind a LUKS2 encrypted file system volume to a TPM and PCR 11 public key/signatures, according to the scheme described above. (The two components also implement a “recovery key” concept, as discussed above)

  4. The systemd-pcrphase component measures specific words into PCR 11 at the discussed phases of the boot process.

  5. The systemd-creds tool may be used to encrypt/decrypt data objects called “credentials” that can be passed into services and booted systems, and are automatically decrypted (if needed) immediately before service invocation. Encryption is typically bound to the local TPM, to ensure the data cannot be recovered elsewhere.

Note that systemd-stub (i.e. the UEFI code glued into the UKI) is distinct from systemd-boot (i.e. the UEFI boot loader than can manage multiple UKIs and other boot menu items and implements automatic fallback, an interactive menu and a programmatic interface for the OS among other things). One can be used without the other – both sd-stub without sd-boot and vice versa – though they integrate nicely if used in combination.

Note that the mechanisms described are relatively generic, and can be implemented and be consumed in other software too, systemd should be considered a reference implementation, though one that found comprehensive adoption across Linux distributions.

Some concepts discussed above are currently not implemented. Specifically:

  1. The rollback protection logic is currently not implemented.

  2. The mentioned measurement of the root file system volume key to PCR 15 is implemented, but not merged into the systemd main branch yet.

The UAPI Group

We recently started a new group for discussing concepts and specifications of basic OS components, including UKIs as described above. It's called the UAPI Group. Please have a look at the various documents and specifications already available there, and expect more to come. Contributions welcome!

Glossary

TPM

Trusted Platform Module; a security chip found in many modern systems, both physical systems and increasingly also in virtualized environments. Traditionally a discrete chip on the mainboard but today often implemented in firmware, and lately directly in the CPU SoC.

PCR

Platform Configuration Register; a set of registers on a TPM that are initialized to zero at boot. The firmware and OS can “extend” these registers with hashes of data used during the boot process and afterwards. “Extension” means the supplied data is first cryptographically hashed. The resulting hash value is then combined with the previous value of the PCR and the combination hashed again. The result will become the new value of the PCR. By doing this iteratively for all parts of the boot process (always with the data that will be used next during the boot process) a concept of “Measured Boot” can be implemented: as long as every element in the boot chain measures (i.e. extends into the PCR) the next part of the boot like this, the resulting PCR values will prove cryptographically that only a certain set of boot components can have been used to boot up. A standards compliant TPM usually has 24 PCRs, but more than half of those are already assigned specific meanings by the firmware. Some of the others may be used by the OS, of which we use four in the concepts discussed in this document.

Measurement

The act of “extending” a PCR with some data object.

SRK

Storage Root Key; a special cryptographic key generated by a TPM that never leaves the TPM, and can be used to encrypt/decrypt data passed to the TPM.

UKI

Unified Kernel Image; the concept this document is about. A combination of kernel, initrd and other resources. See above.

SecureBoot

A mechanism where every software component involved in the boot process is cryptographically signed and checked against a set of public keys stored in the mainboard hardware, implemented in firmware, before it is used.

Measured Boot

A boot process where each component measures (i.e., hashes and extends into a TPM PCR, see above) the next component it will pass control to before doing so. This serves two purposes: it can be used to bind security policy for encrypted secrets to the resulting PCR values (or signatures thereof, see above), and it can be used to reason about used software after the fact, for example for the purpose of remote attestation.

initrd

Short for “initial RAM disk”, which – strictly speaking – is a misnomer today, because no RAM disk is anymore involved, but a tmpfs file system instance. Also known as “initramfs”, which is also misleading, given the file system is not ramfs anymore, but tmpfs (both of which are in-memory file systems on Linux, with different semantics). The initrd is passed to the Linux kernel and is basically a file system tree in cpio archive. The kernel unpacks the image into a tmpfs (i.e., into an in-memory file system), and then executes a binary from it. It thus contains the binaries for the first userspace code the kernel invokes. Typically, the initrd’s job is to find the actual root file system, unlock it (if encrypted), and transition into it.

UEFI

Short for “Unified Extensible Firmware Interface”, it is a widely adopted standard for PC firmware, with native support for SecureBoot and Measured Boot.

EFI

More or less synonymous to UEFI, IRL.

Shim

A boot component originating in the Linux world, which in a way extends the public key database SecureBoot maintains (which is under control from Microsoft) with a second layer (which is under control of the Linux distributions and of the owner of the physical device).

PE

Portable Executable; a file format for executable binaries, originally from the Windows world, but also used by UEFI firmware. PE files may contain code and data, categorized in labeled “sections”

ESP

EFI System Partition; a special partition on a storage medium that the firmware is able to look for UEFI PE binaries in to execute at boot.

HSM

Hardware Security Module; a piece of hardware that can generate and store secret cryptographic keys, and execute operations with them, without the keys leaving the hardware (though this is configurable). TPMs can act as HSMs.

DEK

Disk Encryption Key; an asymmetric cryptographic key used for unlocking disk encryption, i.e. passed to LUKS/dm-crypt for activating an encrypted storage volume.

LUKS2

Linux Unified Key Setup Version 2; a specification for a superblock for encrypted volumes widely used on Linux. LUKS2 is the default on-disk format for the cryptsetup suite of tools. It provides flexible key management with multiple independent key slots and allows embedding arbitrary metadata in a JSON format in the superblock.

Thanks

I’d like to thank Alain Gefflaut, Anna Trikalinou, Christian Brauner, Daan de Meyer, Luca Boccassi, Zbigniew Jędrzejewski-Szmek for reviewing this text.

October 22, 2022

In my previous post I talked about the VK_EXT_mesh_shader extension that had just been released for Vulkan, and in which I had participated by reviewing the spec and writing CTS tests. Back then I referred readers to external documentation sources like the Vulkan mesh shading post on the Khronos Blog, but today I can add one more interesting resource. A couple of weeks ago I went to Minneapolis to participate in XDC 2022, where I gave an introductory talk about mesh shaders that’s now available on YouTube. In the talk I give some details about the concepts, the Vulkan API and how the new shading stages work.

Just after me, Timur Kristóf also presented an excellent talk with details about the Mesa mesh shader implementation for RADV, available as part of the same playlist.

As an additional resource, I’m going to participate together with Timur, Steven Winston and Christoph Kubisch in a Khronos Vulkanised Webinar to talk a bit more about mesh shaders on October 27. You must register to attend, but attendance is free.

Back to XDC, crossing the Atlantic Ocean to participate in the event was definitely tiring, but I had a lot of fun at the conference. It was my first in-person XDC and a special one too, this year hosted together with WineConf and FOSS XR. Seeing everyone there and shaking some hands, even with our masks on most of the time, made me realize how much I missed traveling to events. Special thanks to Codeweavers for organizing the conference, and in particular to Jeremy White and specially to Arek Hiler for taking care of most technical details and acting as a host and manager in the XDC room.

Apart from my mesh shaders talk, do not miss other talks by Igalians at the conference:

And, of course, take a look at the whole event playlist for more super-interesting content, like the one by Alyssa Rosenzweig and Lina Asahi about reverse-engineered GPU drivers, the one about Zink by Mike Blumenkrantz (thanks for the many shout-outs!) or the guide to write Vulkan drivers by Jason Ekstrand which includes some info about the new open source Vulkan driver for NVIDIA cards.

That said, if you’re really interested in my talk but don’t want to watch a video (or the closed captions are giving you trouble), you can find the slides and the script to my talk below. Each thumbnail can be clicked to view a full-HD image.

Talk slides and script

Title slide: Replacing the geometry pipeline with mesh shaders

Hi everyone, I’m Ricardo Garcia from Igalia. Most of my work revolves around CTS tests and the Vulkan specification, and today I’ll be talking about the new mesh shader extension for Vulkan that was published a month ago. I participated in the release process of this extension by writing thousands of tests and reviewing and discussing the specification text for Vulkan, GLSL and SPIR-V.

Mesh shaders are a new way of processing geometry in the graphics pipeline. Essentially, they introduce an alternative way of creating graphics pipelines in Vulkan, but they don’t introduce a completely new type of pipeline.

The new extension is multi-vendor and heavily based on the NVIDIA-only extension that existed before, but some details have been fine-tuned to make it closer to the DX12 version of mesh shaders and to make it easier to implement for other vendors.

Main Points

I want to cover what mesh shaders are, how they compare to the classic pipelines and how they solve some problems.

Then we’ll take a look at what a mesh shader looks like and how it works, and we’ll also talk about drawbacks mesh shaders have.

What is mesh shading?

Mesh shading introduces a new type of graphics pipeline with a much smaller number of stages compared to the classic one. One of the new stages is called the mesh shading stage.

These new pipelines try to address some issues and shortcomings with classic pipelines on modern GPUs. The new pipeline stages have many things in common with compute shaders, as we’ll see.

Classic Graphics Pipeline

This is a simplified version of the classic pipeline.

Basically the pipeline can be divided in two parts. The first stages are in charge of generating primitives for the rasterizer.

Then the rasterizer does a lot of magic including primitive clipping, barycentric interpolation and preparing fragment data for fragment shader invocations.

It’s technically possible to replace the whole pipeline with a compute shader (there’s a talk on Thursday about this), but mesh shaders do not touch the rasterizer and everything that comes after it.

Mesh shaders try to apply a compute model to replace some of this with a shader that’s similar to compute, but the changes are restricted to the first part of the pipeline.

Mesh Shading Pipeline

If I have to cut the presentation short, this is perhaps one of the slides you should focus on.

Mesh shading employs a shader that’s similar to compute to generate geometry for the rasterizer.

There’s no input assembly, no vertex shader, no tessellation, etc. Everything that you did with those stages is now done in the new mesh shading stage, which is a bit more flexible and powerful.

Mesh Shading Pipeline (Full)

In reality, the mesh shading extension actually introduces two new stages. There’s an optional task shader that runs before the mesh shader, but we’ll forget about it for now.

Classic Pipeline Problems

These are the problems mesh shading tries to solve.

Vertex inputs are a bit annoying to implement in drivers and in some hardware they use specific fixed function units that may be a bottleneck in some cases, at least in theory.

The main pain point is that vertex shaders work at the per-vertex level, so you don’t generally have control of how geometry is arranged in primitives. You may run several vertex shader invocations that end up forming a primitive that faces back and is not visible and there’s no easy way to filter those out, so you waste computing power and memory bandwidth reading data for those vertices. Some implementations do some clever stuff here trying to avoid these issues.

Finally, tessellation and geometry shaders should perhaps be simpler and more powerful, and should work like compute shaders so we process vertices in parallel more efficiently.

How do they look? GLSL Extension

So far we know mesh shaders look a bit like compute shaders and they need to generate geometry somehow because their output goes to the rasterizer, so lets take a look.

The example will be in GLSL to make it easier to read. As you can see, it needs a new extension which, when translated to SPIR-V will be converted into a SPIR-V extension that gives access to some new opcodes and functionality.

How do they look? Local Size

The first similarity to compute is that mesh shaders are dispatched in 3d work groups like compute shaders, and each of them has a number of invocations in 3d controlled by the shader itself. Same deal. There’s a limit to the size of each work group, but the minimum mandatory limit by the spec is 128 invocations. If the hardware does not support work groups of that size, they will be emulated. We also have a properties structure in Vulkan where you can check the recommended maximum size for work groups according to the driver.

Inside the body of your shader you get access to the typical built-ins for compute shaders, like the number of work groups, work group id, local invocation indices, etc.

If subgroups are supported, you can also use subgroup operations in them.

How do they look? Type of output geometry

But mesh shaders also have to generate geometry. The type can not be chosen at runtime. When writing the shader you have to decide if you shader will output triangles, lines or points.

How do they look? Maximum vertices and primitives

You must also indicate an upper limit in the number of vertices and primitives that each work group will generate.

Generally speaking, this will be a small-ish number. Several implementations will limit you to 256 vertices and primitives at most, which is the minimum required limit.

To handle big meshes with this, you’ll need several work groups and each work group will handle a piece of the whole mesh.

In each work group, the local invocations will cooperate to generate arrays of vertex and primitive data.

How do they look? Output geometry arrays

And here you can see how. After, perhaps, some initial processing not seen here, you have to indicate how many actual vertices and primitives the work group will emit, using the SetMeshOutputsEXT call.

That call goes first before filling any output array and you can reason about it as letting the implementation allocate the appropriate amount of memory for those output arrays.

Mesh shaders output indexed geometry, like when you use vertex and index buffers together.

You need to write data for each vertex to an output array, and primitive indices to another output array. Typically, each local invocation handles one position or a chunk of those arrays so they cooperate together to fill the whole thing. In the slide here you see a couple of those arrays, the most typical ones.

The built-in mesh vertices ext array contains per-vertex built-ins, like the vertex position. Indices used with this array go from 0 to ACTUAL_V-1

Then, the primitive triangle indices ext array contains, for each triangle, 3 uint indices into the previous vertices array. The primitive indices array itself is accessed using indices from 0 to ACTUAL_P-1. If there’s a second slide that I want you to remember, it’s this one. What we have here is an initial template to start writing any mesh shader.

How do they look? Output attributes

There are a few more details we can add. For example, mesh shaders can also generate custom output attributes that will be interpolated and used as inputs to the fragment shader, just like vertex shaders can.

The difference is that in mesh shaders they form arrays. If we say nothing, like in the first output here, they’re considered per-vertex and have the same index range as the mesh vertices array.

A nice addition for mesh shaders is that you can use the perprimitiveEXT keyword to indicate output attributes are per-primitive and do not need to be interpolated, like the second output here. If you use these, you need to declare them with the same keyword in the fragment shader so the interfaces match. Indices to these arrays have the same range as the built-in primitive indices array.

And, of course, if there’s no input assembly we need to read data from somewhere. Typically from descriptors like storage buffers containing vertex and maybe index information, but we could also generate geometry procedurally.

Some built-in arrays

Just to show you a few more details, these are the built-in arrays used for geometry. There are arrays of indices for triangles, lines or points depending on what the shader is supposed to generate.

The mesh vertices ext array that we saw before can contain a bit more data apart from the position (point size, clip and cull distances).

The third array was not used before. It’s the first time I mention it and as you can see it’s per-primitive instead of per-vertex. You can indicate a few things like the primitive id, the layer or viewport index, etc.

Meshlets

As I mentioned before, each work group can only emit a relatively small number of primitives and vertices, so for big models, several work groups are dispatched and each of them is in charge of generating and processing a meshlet, which are the colored patches you see here on the bunny.

It’s worth mentioning the subdivision of big meshes into meshlets is typically done when preparing assets for the application, meaning there shouldn’t be any runtime delay.

Dispatching Work Groups

Mesh shading work groups are dispatched with specific commands inside a render pass, and they look similar to compute dispatches as you can see here, with a 3d size.

Mesh Shading Pipeline (Full)

Let’s talk a bit about task shaders, which are optional. If present, they go before mesh shaders, and the dispatch commands do not control the number of mesh shader work groups, but the number of task shader work groups that are dispatched and each task shader work group will dispatch a number of mesh shader work groups.

Task (Amplification) Shader

Each task shader work group also follows the compute model, with a number of local invocations that cooperate together.

Each work group typically pre-processes geometry in some way and amplifies or reduces the amount of work that needs to be done. That’s why it’s called the amplification shader in DX12.

Once that pre-processing is done, each task work group decides, at runtime, how many mesh work groups to launch as children, forming a tree with two levels.

Task (Amplification) Shader: Example Dispatch

One interesting detail about this is that compute built-ins in mesh shaders may not be unique when using task shaders. They are only unique per branch. In this example, we dispatched a couple of task shader work groups and each of them decided to dispatch 2 and 3 mesh shader work groups. Some mesh shader work groups will have the same work group id and, if the second task shader work group had launched 2 children instead of 3, even the number of work groups would be the same.

But we probably want them all to process different things, so the way to tell them apart from inside the mesh shader code is to use a payload: a piece of data that is generated in each task work group and passed down to its children as read-only data.

Combining the payload with existing built-ins allows you to process different things in each mesh shader work group.

Payload

This is done like this. On the left you have a task shader.

You can see it also works like a compute shader and invocations cooperate to pre-process stuff and generate the payload. The payload is a variable declared with the task payload shared ext qualifier. These payloads work like shared memory. That’s why they have “shared” in the qualifier.

In the mesh shader they are read-only. You can declare the same payload and read from it.

Mesh Shading Pros

Advantages:

Avoiding input assembly bottlenecks if they exist.

Pre-compute data and discard geometry in advance, saving processing power and memory bandwidth.

Geometry and tessellation can be applied freely, in more flexible ways.

The use of a model similar to compute shaders allows us to take advantage of the GPU processing power more effectively.

Many games use a compute pre-pass to process some data and calculate things that will be needed at draw time. With mesh shaders it may be possible to streamline this and integrate this processing into the mesh or task shaders.

You can also abuse mesh shading pipelines as compute pipelines with two levels if needed.

Mesh Shading Cons

Disadvantages:

Mesh shading is problematic for tiling GPUs as you can imagine, for the same reasons tessellation and geometry shaders suck on those platforms.

Giving users freedom in this part of the pipeline may allow them to shoot themselves in the foot and end-up with sub-optimal performance. If not used properly, mesh shaders may be slower than classic pipelines.

The structure of vertex and index buffers needs to be declared explicitly in shaders, increasing coupling between CPU and GPU code, which is not nice.

Most importantly, right now it’s hard or even impossible to write a single mesh shader that performs great on all implementations.

Some vendor preferences are exposed as properties by the extension.

NVIDIA loves smaller work groups and using loops in code to generate geometry with each invocation (several vertices and triangles per invocation).

Threads on AMD can only generate at most one vertex and one primitive, so they’d love you to use bigger work groups and use the local invocation index to access per-vertex and per-primitive arrays.

As you can imagine, this probably results in different mesh shaders for each vendor.

Questions and Answers

In the Q&A section, someone asked about differences between the Vulkan version of mesh shaders and the Metal version of them. Unfortunately, I’m not familiar with Metal so I said I didn’t know. Then, I was asked if it was possible to implement the classic pipeline on top of mesh shaders, and I replied it was theoretically possible. The programming model of mesh shaders is more flexible, so the classic pipeline can be implemented on top of it. However, I didn’t reply (because I have doubts) about how efficient that could be. Finally, the last question asked me to elaborate on stuff that could be done with task shaders, and I replied that apart from integrating compute pre-processing in the pipeline, they were also typically used to select LOD levels or discard meshlets and, hence, to avoid launching mesh work groups to deal with them.

Closing Slide

October 16, 2022

Hi!

This month I’ve done a lot of cleanup and bugfixing in wlroots, especially in the DRM backend, the Vulkan renderer and screencopy protocol implementation. There are still a few DRM backend bugs which need to be ironed out, but we’re getting there!

Jonas Ådahl has released wayland-protocols 1.27, which features two new protocols I’ve been working on with the KDE folks. idle-notify-v1 is a standard replacement for the old KDE idle protocol we’ve been using so far. It allows clients to be notified when the user is idle for a while. This is used by swayidle to blank the screens or lock the session after a delay, for instance. Patches for wlroots, Sway and swayidle have been merged.

The other protocol is content-type-v1 and adds a way for clients to attach a content type (such as “video” or “game”) to a surface. With this, compositors can implement generic rules for all video players or all games. I plan to make this available in Sway for_window rules.

Joshua Ashton has been working on improving the way a Wayland compositor will handle Xwayland surfaces. Up until now it was using a racy mechanism, and we could indeed hit the race in gamescope. We’re working on a proper solution to fix the race: xwayland-shell-v1.

To continue with the “fix old stuff which we believe was working fine but definitely isn’t” trend, I re-implemented wl_shm in wlroots. wl_shm is the interface used by Wayland clients to share CPU buffers (as opposed to GPU buffers). Until now we were using the libwayland implementation but defects in its API makes it a better way forward to just do everything ourselves. Implementing wl_shm is fun: in particular we need to handle SIGBUS shenanigans which the client can trigger by shrinking files mmap’ed by the compositor. And signal handlers being what they are, this results in a lot of complications.

In other graphics news, I’ve been doing more kernel work. I’ve tracked down some DisplayPort Multi-Stream Transport (DP-MST) issues with Jonas and Lyude, and sent out a fix which doesn’t break Mutter. I’ll need to send follow-up patches to address a common issue: the docs and the code don’t match. I’ve also sent another patch to make it less annoying to implement explicit synchronization in Wayland compositors: a new IOCTL to allow user-space to integrate a drm_syncobj wait with an event loop.

Together with Pekka and Sebastian, we’ve continued out work on libdisplay-info. A new website contains documentation (built via gyosu), more DisplayID bits and GTF mode generation have been merged, and Pekka is working on the high-level API.

Another unrelated small improvement worth mentioning: goguma now supports UnifiedPush! Google Play Services no longer are a required dependency to get support for push notifications.

The NPotM is go-jsonschema, a Go code generator for JSON Schema. JSON Schema can describe the structure of JSON documents, and go-jsonschema will generate matching Go structures. This is useful to avoid maintaining glue code for each and every JSON consumer. I plan to write a C code generator as well. Right now go-jsonschema can handle almost all of the drm_info schema, however many features are still missing and schemas need to be written in a specific way to make the most of go-jsonschema.

See you next time!

October 13, 2022

 At Linux Plumbers Conference 2022, we held a BoF session around accelerators.

This is a summary made from memory and notes taken by John Hubbard.

We started with defining categories of accelerator devices.

1. single shot data processors, submit one off jobs to a device. (simpler image processors)

2. single-user, single task offload devices (ML training devices)

3. multi-app devices (GPU, ML/inference execution engines)

One of the main points made is that common device frameworks are normally about targeting a common userspace (e.g. mesa for GPUs). Since a common userspace doesn't exist for accelerators, this presents a problem of what sort of common things can be targetted. Discussion about tensorflow, pytorch as being the userspace, but also camera image processing and OpenCL. OpenXLA was also named as a userspace API that might be of interest to use as a target for implementations.

 There was a discussion on what to call the subsystem and where to place it in the tree. It was agreed that the drivers would likely need to use DRM subsystem functionality but having things live in drivers/gpu/drm would not be great. Moving things around now for current drivers is too hard to deal with for backports etc. Adding a new directory for accel drivers would be a good plan, even if they used the drm framework. There was a lot naming discussion, I think we landed on drivers/skynet or drivers/accel (Greg and I like skynet more).

We had a discussion about RAS (Reliability, Availability, Serviceability) which is how hardware is monitored in data centers. GPU and acceleration drivers for datacentre operations define a their own RAS interfaces that get plugged into monitoring systems. This seems like an area that could be standardised across drivers. Netlink was suggested as a possible solution for this area.

Namespacing for devices was brought up. I think the advice was if you ever think you are going to namespace something in the future, you should probably consider creating a namespace for it up front, as designing one in later might prove difficult to secure properly.

We should use the drm framework with another major number to avoid some of the pain points and lifetime issues other frameworks see.

There was discussion about who could drive this forward, and Oded Gabbay from Intel's Habana Labs team was the obvious and best placed person to move it forward, Oded said he would endeavor to make it happen.

This is mostly a summary of the notes, I think we have a fair idea on a path forward we just need to start bringing the pieces together upstream now.

 




October 11, 2022

Recap

For those of you who weren’t present, Super Good Code took over XDC last week.

The recording of The Talk is finally sliced, diced, and tuned to perfection thanks to the work of Arkadiusz Hiler. Watch it for the first time all over again to catch all the technical details and workout tips you missed.

Additionally, the slides for the presentation are available for benchmarking.

But How

…is something from last week even relevant today, you might be asking.

It’s a reasonable question. As everyone knows, zinkland is a place that changes so rapidly, so furiously, that memes from as recent as minutes ago are already outdated.

But in this rare case, for today only, yesterweek’s presentation is still noteworthy today: today is the day I’m merging async pipeline precompiles.

That’s right.

Drivers supporting all the required features will now be able to precompile shaders for AAA games during loading screens, which (probably) means zero stuttering during actual gameplay. The era of gaming on zink begins today.

For Some

Not all drivers support all the features out of the box. In fact, to the best of my knowledge, only the latest NVIDIA beta driver supports even the bare minimum of features, let alone the most optimal codepaths that will maximeme performance.

But have I tested it there?

I scoff at the implication.

Of course I haven’t. As we all know, zink code works without issues the instant it can be compiled, and the suggestion that any alternative outcome is possible is risible.

So don’t bother reporting bugs.

October 05, 2022

It Must Be Said

…that I misspoke in the course of my XDC presentation. I want to apologize first to the live audience, for hearing such grave information firsthand, then to my fans, for disappointing them, and lastly to the ANGLE team:

During a Q&A, I stated that ANGLE uses Vulkan secondary command buffers. This is false. ANGLE does not use secondary command buffers.

There will be no further corrections.

October 01, 2022

It’s Happening.

After 7 years in hibernation since LinuxCon Japan, I will once again be making a limited return to the conference trail next week.

To where, you ask?

XDC 2022.

Where I will be delivering a presentation that finally, at long last, unleashes the true potential of Zink as a driver.

Be there. Watch online. Tweet about it.

September 27, 2022

Shoutouts

Big thanks to Timur for taking over merging all my misc RADV patches. I’m so far underwater I can see the bottom of the Mariana Trench, and without his help, these likely wouldn’t be landing until 2023.

Cooking Up A Storm

It’s been a while, but today’s blog is a zink blog. What’s been happening?

Well, if you’ve been following prominent news sites, you might not think there’s too much happening on the performance front. And whew do I have news for you.

Big news.

With XDC next week, I’m throwing down the gauntlet.

Mesa 22.3 is going to be the BIG PERF RELEASE.

I said it.

No Memes

Could not be more serious.

Don’t believe me? Bam, have another 20-30% performance on Turnip I “quietly” merged a month ago. And also some amount of improvement on Intel that I haven’t quantified yet. Who could’ve guessed that enabling compression on images would have benefits?

Seeing mega CPU usage? Not if I’ve got anything to say about that. Still more to come here too.

Your game/emulator randomly freezes for a few seconds on startup? I’ve heroically made the PBO download ubershaders compile without blocking. Also there’s new, optimized variants that perform even better.

What’s that? Zink is still unusable for gaming because of all the shader compilation stuttering? Blammo, your compute shaders are now precompiling in the background.

Oh, but you’re still seeing stuttering because of shader compiles? Not for long, because I’veOH SHIT KHRONOS IS HERE I’M SORRY I DIDN’T THINK IT’D BE A BIG DEAL IF I MENTIONED TH

September 23, 2022
In-person conferences are finally back! After two years of remote conferences, the kernel development community got together in Dublin, Ireland, to discuss current problems that need collaboration to be solved. As in the past editions, I took the opportunity to discuss about futex2, a project I’m deeply involved in. futex2 is a project to solve issues found in the current futex interface. This years’ session was about NUMA awaress of futex.
September 20, 2022
Photo of the RB3 development board with Adreno 630 GPU RB3 development board with Adreno 630 GPU

It is a major milestone for a driver that is created without any hardware documentation.

The last major roadblocks were VK_KHR_dynamic_rendering and, to a much lesser extent, VK_EXT_inline_uniform_block. Huge props to Connor Abbott for implementing them both!

Screenshot of mesamatrix.net showing that Turnip has 100% of features required for Vulkan 1.3

VK_KHR_dynamic_rendering was an especially nasty extension to implement on tiling GPUs because dynamic rendering allows splitting a render pass between several command buffers.

For desktop GPUs there are no issues with this. They could just record and execute commands in the same order they are submitted without any additional post-processing. Desktop GPUs don’t have render passes internally, they are just a sequence of commands for them.

On the other hand, tiling GPUs have the internal concept of a render pass: they do binning of the whole render pass geometry first, load part of the framebuffer into the tile memory, execute all render pass commands, store framebuffer contents into the main memory, then repeat load_framebufer -> execute_renderpass -> store_framebuffer for all tiles. In Turnip the required glue code is created at the end of a render pass, while the whole render pass contents (when the render pass is split across several command buffers) are known only at the submit time. Therefore we have to stitch the final render pass right there.

What’s next?

Implementing Vulkan 1.3 was necessary to support the latest DXVK (Direct3D 9-11 translation layer). VK_KHR_dynamic_rendering itself was also necessary for the latest VKD3D (Direct3D 12 translation layer).

For now my plan is:

  • Continue implementing new extensions for DXVK, VKD3D, and Zink as they come out.
  • Focus more on performance.
  • Improvements to driver debug tooling so it works better with internal and external debugging utilities.
September 16, 2022

 At LPC 2022 we had a BoF session for GPUs with two topics.

Moving to userspace consoles:

Currently most mainline distros still use the kernel console, provided by the VT subsystem. We'd like to move to CONFIG_VT=n as the console and vt subsystem have historically been a source of bugs but are also nasty places for locking etc. It also can be the cause of oops going missing when it takes out the panic path with locking bugs stopping other paths from completely processing the oops (like pstore or serial).

 The session started discussing what things would like. Lennart gave a great summary of the work David did a few years ago and the train of thought involved.

Once you think through all the paths and things you want supported, you realise the best user console is going to be one that supports emojis and non-Latin scripts. This probably means you want a lightweight wayland compositor running a fullscreen VTE based terminal. Working back from the consequences of this means you probably aren't going to want this in systemd, and it should be a separate development.

The other area discussed was around the requirements for a panic/emergency kernel console, likely called drmlog, this would just be something to output to the display whenever the kernel panics or during boot before the user console is loaded.

cgroups for GPU

This has been an ongoing topic, where vendors often suggest cgroup patches, and on review told they are too vendor specific and to simplify and try again, never to try again. We went over the problem space of both memory allocation accounting/enforcement and processing restrictions. We all agreed the local memory accounting was probably the easiest place to start doing something. We also agreed that accounting should be implemented and solid before we start digging into enforcement. We had a discussion about where CPU memory should be accounted, especially around integrated vs discrete graphics, since users might care about the application memory usage apart from the GPU objects usage which would be CPU on integrated and GPU on discrete. I believe a follow-up hallway discussion also covered a bunch of this with upstream cgroups maintainer.

Before We Begin

We all know where this post is going to end up.

We’ve been there before.

We’ll be there again.

But we’re not there yet. And before we get there, I need everyone to be absolutely serious for a little while. No memes, no jokes, no puns, just straight talk from me to you.

Let’s talk about Intel.

I know what you’re thinking.

If you work for Intel, you’re thinking Oh no.

If you don’t, you’re thinking Oh no. Or possibly getting ready to post some comments about ANV not supporting features, or having bugs, or any of the common refrains that I’ve seen so often around the internet.

The fact remains, however, that ANV is the reference zink hardware driver. It has been ever since I started working on zink. The reason for this is simple: it’s what I have on the machine I develop with, and when I started working on zink, it was the driver that was furthest along.

It’s therefore no exaggeration to say that without everything ANV and the Intel Mesa team brings to the table, zink wouldn’t be nearly as far along as it is.

And neither would RADV.

For those of you unaware, at the top of many RADV code files is this comment:

 * based in part on anv driver which is:
 * Copyright © 2015 Intel Corporation

That’s right. RADV originally started out reusing a lot of code written by Intel. It’s no exaggeration to say that RADV wouldn’t be nearly as far along without Intel’s Mesa team.

Let’s go deeper though. Using LWN’s Git Data Miner (gitdm) on mesa/src/compiler (which includes GLSL, SPIR-V, and NIR), let’s see who the top contributors are to the heart of Mesa:

Name Commit Count Percentage of Total Affiliation
Jason Ekstrand 1429 19.7% Intel/Collabora
Timothy Arceri 714 9.9% Collabora/Valve
Ian Romanick 577 8.0% Intel
Rhys Perry 298 4.1% Valve
Caio Oliveira 270 3.7% Intel
Emma Anholt 268 3.7% Google
Marek Olšák 260 3.6% AMD
Kenneth Graunke 224 3.1% Intel
Samuel Pitoiset 176 2.4% Valve
Connor Abbott 168 2.3% Intel/Valve

Again we see that Intel’s Mesa team has been hard at work over the years to enable the things that we all now take for granted.

So now you know why every time I see someone saying that ANV is bad, or Intel engineers aren’t doing everything they can, or the constant memeposting about any number of related issues, it hurts. I’ve been relying on the fruits of the Intel Mesa team’s labor for years now, and so have you.

If anything, we should be cheering them on to continue doing the great work they’ve been doing all along.

Thanks, Intel Mesa team.

I appreciate you.

And so should everyone else.

But Now

It’s time for another round of

where

mesa.png

is

rotated_mesa.png

MY SPAGHETTI?

angry_mesa.png

…and meatballs, because obviously, as an American, I’m obligated to add condiments to everything. Mainly melted cheese.

In today’s edition of WIMS, I’ll be delving into the depths of Intel’s Mesa Vulkan driver to see what kinds of spaghetti they’re growing and how much of it I can eat before they catch me.

The first step, as always, is to pull out my trusty perf and…

Once again, I know what you’re thinking.

1.png

You’re thinking that I can’t just use perf and profile my way out real performance issues.

But what if I showed you this from my Intel Icelake machine:

$ ./vkoverhead -test 0 -duration 3 -output-only
44854

Does 44.8 million draws/second seem like a good number? Something reasonable for the ancestor of all Mesa Vulkan drivers? Let’s check out a flamegraph, just for the sake of this is my blog and I’m gonna do it whether you want me to or not.

perf.png

And now I’m going to tell you

2.png

that I can too use perf for everything, and profiling is both a legitimate and useful way to improve any and all kinds of performance.

But looking at that graph, there’s something obvious that stands out here. Why is gfx11_emit_apply_pipe_flushes() showing up twice?

perfbad.png

The answer may surprise you.

Let’s go back in time. I want everyone to pretend that it’s early 2021 again. Photos are still being taken in black and white by hipsters, nobody has toilet paper, and everything is generally worse. Vulkan is also worse because I haven’t shipped any of my extensions. Extensions like VK_EXT_multi_draw, which I implemented for a number of Mesa drivers. Drivers like RADV. And Lavapipe.

And ANV.

You might see where I’m going with this.

Yes, it was all the way back in June of 2021 that I implemented this extension for ANV. And at the time, everything was worse, including the world’s lack of our lord and savior, our premier optimizing tool, that project I’m going to keep plugging in my blog until the end of time, vkoverhead. Back then, the only way anyone could know how their driver performed at the CPU level was to use drawoverhead through zink.

And zink wasn’t as fast then as it is now; whoever was working on it back then was a total fucking idiot who couldn’t optimize his way out of a for loop.

So it’s not going to surprise anyone, and it’s not like anyone would even be mad if they found out that in the course of those trivial, barely even noticeable changes to the driver, I also maybe sorta kinda increased ANV’s draw command recording overhead. But it’s not like it was by any big amount or anything like that. I mean, it’s not like I was a total fucking idiot back… back then… Well, it’s not like I wouldn’t have realized it even if I did accidentally nerf the driver that I relied upon for my daily use by an amount that was in any way significant, right?

It’s not like it was a full 30% or anything like that.

I, uh… I have to go for now, but it’s not like Intel’s entire Mesa team just rolled up on my house or anything like that because taht would be kinda crazy haha okay but brb

Barely Hanging On

So I’m back and I’m totally fine, don‘t ask if I need a gofundme for my medical bills or anything because there definitely aren’t any and I’m just great withp both hands still firmly attached and at least eight fingers in total between the two, which is, if you think about it, really just way more than anyone actually needs to write software.

And you know, I was totally gonna stop now with these new results

$ ./vkoverhead -test 0 -duration 3 -output-only
59973

but haha you know I’m being told that I’d better keep going or else hah so we’re just gonna uh gonna keep getting in there and seeing what we can find, if that’s all right with everyone? Yeah? Cool? Great, so, uh, yeah, let’s just check out another flamegraph real quick

perf2.png

And that’s looking just great. Yeah, it’s really great, isn’t it? It’s not? Okay, well, we’ve got some options. Uh, yeah, lots of options. Not like I need help figuring out what they are or anything, since this is what I’m good at. But there’s this big old gfx11_cmd_buffer_flush_state call sticking out like a sore thumb—and I’ve obviously still got two of those like everyone else—so let’s see what we can do about that.

Yup, well, it’s looking like inlining those functions is…

What?

3.png

You’re telling me I can’t just inline everything?

4.png

But it’s improving performance!

$ ./vkoverhead -test 0 -duration 3 -output-only
71192

And that’s great, isn’t it? An extra 60%+ draw throughput on ANV from a couple small patches? Yeah? Oh, okay, so we’re done? Yeah, no, that’s great, that’s really great. What a relief. I’m so glad we could come to an understanding. No, no, it was no trouble. No, I don’t think we’ll need to do this again, either. Learned a lot today. Just learning all around.

Results

Well, we made it through that—not like there was anything special or unusual happening today, just an ordinary turn of phrase—and everything’s fine.

5.png

But what does this mean for real-world performance results, you’re asking?

6.png

I’m just micro-optimizing for the benchmarks that I write, not saying this will double your Cyberpunk 2077 FPS.

I’m not saying it.

September 14, 2022

Hi all!

This month I’ve been working on stuff I’d usually not work on willingly. And by that I mean Rust and screen tearing of course.

I’ve been randomly typing keys on my keyboard and before I knew it, a wlroots-rs repository was created. Everybody is saying how difficult (or even impossible) it is to write Rust bindings for wlroots so I wanted to see for myself and give it a try. One thing is clear: these people weren’t wrong. The first step was to wire up bindgen to automatically generate Rust declarations from the wlroots headers, and that was easy enough. Then I needed to figure out how to use libwayland’s intrusive linked lists (wl_list, wl_signal and wl_listener) from Rust. I took a while to build a basic example where a fixed wl_signal is listened to. Then it took more time to figure out a (hacky) way to abstract that into a re-usable helper. And now I’m stuck at trying to figure out a reasonable Rust API.

The main issue is that Rust lifetime concepts don’t map well to wlroots/Wayland. I’ve taken some inspiration from Smithay and introduced a BackendHandler trait which can be implemented by a compositor, and which has its methods called when a wlroots signal is emitted. This works nicely for simple cases, but sometimes signals are used to introduce new objects to the compositor (e.g. wlr_backend.events.new_output). Sometimes signals reference an existing object. If the compositor owns all wlroots objects, then wlroots can’t fire a signal referencing these objects. Also, the compositor would like to listen to signals on objects created by wlroots, e.g. wlr_output.events.destroy. My next try will maybe introduce some kind of wlroots object handle (and there can be multiple handles referencing the same wlroots object), but not sure how it’ll turn out. If you have any good ideas, please share! My latest work is sitting in the handler-v2 branch.

In other graphics news, I’ve been working on adding screen tearing support to the kernel and Wayland. In the past I’ve been pretty averse to screen tearing, because I personally find it very undesirable. But I’ve come to understand that it can be viewed as a user preference, and some people want to turn it on when playing a game. After seeing some people ask about it in #wayland, and seeing that Valve wants it too, I’ve decided to bite the bullet and implement the missing pieces. To be clear, tearing will be opt-in, only used when explicitly enabled by the user.

There are two missing pieces. The first one is in the kernel uAPI: user-space has no way to perform tearing page-flips with the atomic uAPI. I’ve sent patches to address this. The second missing piece is in the Wayland protocol: Vulkan apps have a way to request tearing via VK_PRESENT_MODE_IMMEDIATE_KHR, but Mesa doesn’t have a way to communicate this preference to the Wayland compositor. Xaver Hugl has been working on a protocol extension for this, and I’ve been working on a Mesa implementation.

Work on libdisplay-info is continuing slowly. The EDID format is really trying very hard to be as annoying to parse as possible, but it makes for a good project to work on when I don’t want to think too much. I’ve started adding DisplayID support, which is the more modern standard for display device metadata. I’ve experimented a bit with Emscripten and wrote a web version of the tool.

As a freedesktop.org sysadmin, I’ve migrated our secondary nameservers. We were using SPI’s nameservers but these are getting decommissioned, so I’ve switched to Gandi’s. I’ve also investigated issues with our SMTP server: spammers were issuing a lot of subscription requests, so our outgoing queue was full and mail providers were starting to reject our emails. I had to ban the spammers and purge the bad emails from our queue.

I’ve made some good progress on IRC software. If you haven’t read it already, I’ve written a blog post about ongoing work on OAuth integration with IRC. Goguma finally gained support for the /me command and better handles sending long messages. I’ve released a new bugfix version of soju, with various stability improvements around stuck WHO responses and Web Push. I’ve also fixed a bug in OFTC’s server which was causing some issues with soju and Goguma.

That’s it for now, see you next month!

Homemade Spaghetti

As I’ve mentioned in a couple posts, I’m somewhat of a spaghetti expert. In my opinion, the longer and more plentiful the spaghetti is, the better.

So let’s take a look at one of my favorite spaghetti recipes.

Step 1: Read The Label

Today’s spaghetti comes from my new favorite brand of spaghetti feed, vkoverhead. It’s a simple brand, but it really gets the job done when it comes to growing great spaghetti. This particular spaghetti feed is vkoverhead -test 0, which is the most simple type. It grows the kind of spaghetti that everyone notices because it’s a staple of all graphics diets.

If I check out the state of this spaghetti feed with RADV now, I see the following:

$ ./vkoverhead -test 0 -output-only -duration 3
28345

Thus, I can see that I’m getting 28.3 million draws/second. Not too bad. Let’s check AMDPRO to get a little competition going.

$ VK_ICD_FILENAMES=/home/zmike/amd_pro.json ./vkoverhead -test 0 -output-only -duration 3
32889

what

mesa.png

What.

rotated_mesa.png

THEY DARE?

angry_mesa.png

It’s Totally Cool

…that AMDPRO is 15% faster than RADV. Yup, it’s totally fine. No anger problems here, no sir, not with me, not even a little furious.

Cool as a cucumber.

But if—and this is obviously just a hypothetical—If I were enraged and just recovering from a lengthy tantrum after seeing these results, I’d be looking at growing some artisanal spaghetti. To do that, I’d be running perf on the vkoverhead case and then checking out a flamegraph, which might even happen to look something like this

base.png

and you know it’s weird that the graph would look like that since in a graph like that the actual emission of draw packets is only 18% of the CPU time, which means it’s just throwing away CPU cycles, and no wonder the performance is worse, and I hate Wednesdays.

But again, don’t ask if I’m okay, I’m completely fine, this isn’t bothering me.

But if—and this is obviously just another hypothetical—If I’d just come back from a counseling session that was supposed to help me cope with these inferior performance results and wasn’t feeling any better at all, then I’d definitely be craving some spaghetti. And so I’d be looking at radv_emit_all_graphics_states() and radv_upload_graphics_shader_descriptors() to see what the actual farfalle was going on with these fat pieces of stortini.

And in the first of those functions, I’d see there were all kinds of null checks and branch chain disasters that were annihilating performance, so I’d probably rip and tear those right out, and then, just while I happened to be in the area, I’d simplify some cache-killing indirect access, and, well, it’s not like I’d leave without clearing up those branches, right? Hah, of course not, though this is all just hypothetical anyway.

I’m Not Being Defensive

Stop asking. I’m fine.

If I wasn’t fine, I’d probably be running vkoverhead again at this point and seeing the following results

$ ./vkoverhead -test 0 -output-only -duration 3
36006

and then I’d be fine anyway since now RADV is up by 10%. Which is okay. It’s not bad. Nothing to brag about, you know, just being up by such a tiny little amount over the competition, but it’ll do.

1.png

Is what a responsible person would say.

But here at SGC, responsibility flies out the window when performance is involved, and I don’t have enough spaghetti yet, so buckle in because this pasta machine is just getting started.

It’s perf time again, and I’ve got another totally hypothetical flamegraph

states.png

which is less consumed by the stupidity of those fat pieces of stortini I insalted above, but I’m not in the mood for stortini at all today. They gotta go. radv_upload_graphics_shader_descriptors() I got my eye on you and your little radv_flush_constants() too. Why is radv_flush_constants() even showing up here? What’s the deal with that? There’s no constants to flush. I’m taking ‘em out.

Get the rolling pin, flatten out the dough, and what happens?

$ ./vkoverhead -test 0 -output-only -duration 3
38629

2.png

Now We’re Cooking

…with perf, and I’m getting out another flamegraph, and it’s better

flush.png

because of course it is. That draw packet emission is getting more time, the fat stortini is slimming down, and everything is great.

But does anyone out there actually think I’m about to stop now? When I’m only up by a tenuous 36% from where I started, and my lead over AMDPRO is a barely-noticeable 17%?

Take off your jacket, because I’m turning the heat of the burners up to high.

Look at this eyesore

bad.png

I’m about to end this function’s whole career. By inlining it.

$ ./vkoverhead -test 0 -output-only -duration 3
41878

3.png

Finishing Touch

When serving any sort of dish, it’s important to add a garnish. And you know what isn’t a fucking garnish?

streamout.png

This thing in my debugoptimized build.

So now it’s gone and what is the performance at now?

$ ./vkoverhead -test 0 -output-only -duration 3
44073

4.png

Incredible. The flavor (of winning), the atmosphere (of being a winner), the experience (of being #1), are all unparalleled.

This makes for a 55% increase in RADV’s draw throughput as well as a much more reasonable 30% lead over AMDPRO.

All from growing just the right amount of spaghetti.

September 12, 2022

In the past few days I’ve been working on better integrating IRC with OAuth 2.0. In a nutshell, my goal is to make IRC clients obtain a token by interacting with an OAuth 2.0 server, and then use that token to authenticate with the IRC server. This effort has resulted in various patches for meta.sr.ht’s OAuth 2.0 server, for the soju IRC bouncer and for the gamja & goguma IRC clients.

Motivation

My motivation is to improve chat.sr.ht’s authentication. chat.sr.ht is a hosted soju IRC bouncer for SourceHut users. The soju instance delegates authentication to meta.sr.ht so that users can login with their SourceHut credentials.

The status quo is not ideal:

  • When connecting from an IRC client, users need to jump through hoops. They need to manually generate a personal access token and copy/paste it into the password field. This is especially annoying on mobile. Wouldn’t it be nice to just click a button, fill the SourceHut login form that gets presented to you, and poof everything Just Works™?
  • We maintain a soju fork (in the srht branch) with special patches to integrate with SourceHut. It’s not the end of the world, but rebasing is never fun and error-prone, and it would be much nicer to be able to use vanilla soju.
  • The access tokens expire after 1 year. When that happens, users are greeted with an error message and need to manually generate a new personal access token again. It would be nicer to automatically refresh the token if possible, and show up a login form if not. Similarly, it would be nice to revoke access tokens when a user logs out explicitly, instead of leaving behind unused tokens.

I’d like the solution to these problems to only use standardized APIs. That way any client or server with similar needs can implement the same standards and inter-operate with each other. For instance, one could add support for the standards in a GTK IRC client and have it work with chat.sr.ht. One could setup soju to use a GitLab instance as an OAuth server for authentication.

Before anybody complains about OAuth ruining IRC: this effort is just adding new optional things clients can add support for if they want to. I expect OAuth to be out-of-scope for many IRC clients and that’s perfectly fine. The current approach of passing a personal access token as the password will keep working.

Here is a video of the end result: user loads gamja, is asked confirmation by SourceHut, then is redirected back straight to a ready-to-use gamja. The experience on Goguma is similar.

High-level overview

┌────────────────┐           ┌────────────────┐           ┌────────────────┐
│                │           │                │           │                │
│   IRC client   │           │   IRC server   │           │  OAuth server  │
│                │           │                │           │                │
│ (gamja/goguma) │           │     (soju)     │           │  (meta.sr.ht)  │
│                │           │                │           │                │
└───────┬────────┘           └───────┬────────┘           └────────┬───────┘
        │                            │                             │
        │                                                          │
        │              1. Fetch OAuth server matadata              │
        ├────────────────────────────────────────────────────────► │
        │ ◄────────────────────────────────────────────────────────┤
        │                                                          │
        │                                                          │
        │              2. Redirect user to login page              │
        ├────────────────────────────────────────────────────────► │
        │ ◄────────────────────────────────────────────────────────┤
        │                     3. Get back a code                   │
        │                                                          │
        │                                                          │
        │                                                          │
        │                4. Exchange code for a token              │
        ├────────────────────────────────────────────────────────► │
        │ ◄────────────────────────────────────────────────────────┤
        │                                                          │
        │                                                          │
        │       5. Authenticate      │                             │
        │          with token        │                             │
        ├──────────────────────────► │                             │
        │                            │        6. Check token       │
        │                            ├───────────────────────────► │
        │                            │ ◄───────────────────────────┤
        │                            │      7. Get back username   │
        │                            │                             │
        │ ◄──────────────────────────┤                             │
        │                            │                             │
        │                            │                             │

Here’s a high-level overview of the interactions between the IRC client, the IRC server and the OAuth server. The IRC client and servers interact via the IRC protocol as usual, and they both interact with the OAuth server via HTTP.

  1. The user asks the IRC client to connect to “chat.sr.ht”. The client auto-discovers the OAuth server metadata to find out what the OAuth endpoints are and what features the server supports.
  2. The client redirects the user to the OAuth server login page.
  3. The user authenticates on the OAuth server login page (providing the username/password and possibly a one-time code). The OAuth server redirects back the user to the IRC client, with a code in the URL query parameters.
  4. The client grabs the code from the URL query parameters, and exchanges it for a token.
  5. The client connects to the IRC server and authenticates with the token.
  6. soju checks that the token is valid by querying the OAuth server.
  7. The OAuth server sends back the username associated with the token. soju uses that information to figure out which soju account should get selected.

Step 1 is optional: the alternative is to just hardcode all of the metadata inside the client. OAuth servers require client developers to register a client ID and client secret anyways, so it’s necessary to hardcode some metadata anyways (for now — more on that later).

Implementation

All of the above uses standards described in RFCs. This means there are already OAuth servers in the wild which support everything needed!

Step 1 uses OAuth 2.0 Authorization Server Metadata (RFC 8414). An HTTP GET request returns all of the data the client needs:

> curl https://chat.sr.ht/.well-known/oauth-authorization-server
{
	"issuer": "https://meta.sr.ht",
	"authorization_endpoint": "https://meta.sr.ht/oauth2/authorize",
	"token_endpoint": "https://meta.sr.ht/oauth2/access-token",
	"response_types_supported": ["code"],
	"grant_types_supported": ["authorization_code"],
	"introspection_endpoint": "https://meta.sr.ht/oauth2/introspect",
	"introspection_endpoint_auth_methods_supported": ["none"]
}

Readers familiar with OAuth will recognize that steps 2-4 are the usual OAuth dance defined in RFC 6749. Nothing fancy here. The client redirects the user to https://meta.sr.ht/oauth2/authorize?client_id=XXX. The OAuth server redirects back the user to the client with a ?code=YYY query parameter. The client then exchanges the code for a token via an HTTP request:

> curl \
    --data-urlencode grant_type=authorization_code \
    --data-urlencode code=YYY \
    --data-urlencode client_id=XXX \
    https://chat.sr.ht/oauth2/access-token
{
	"access_token": "asdf"
}

Something worth noting is that for Goguma, the redirect URI is a bit special. Since Goguma is a mobile app, the redirection at step 3 needs to navigate from the user’s web browser to Goguma. Following the recommendations in RFC 8252, Goguma leverages a private-use URI scheme: it sets the redirect URI to fr.emersion.goguma:/oauth2 (yes, it is a valid URI!). The web browser will open Goguma when loading that URI.

Step 5 uses the IRCv3 SASL extension in combination with the SASL OAUTHBEARER mechanism (RFC 7628). SASL PLAIN could’ve been used instead, but:

  • SASL OAUTHBEARER allows the client to ensure that the IRC server supports OAuth tokens for authentication, instead of hoping for the best and showing a vague “authentication failed” error message to the user if this assumption turns out to be wrong.
  • The SASL PLAIN RFC requires clients to specify a username, however at that point the client doesn’t know the username, it only has a token. SASL OAUTHBEARER allows the client to omit the username.

Step 6 uses OAuth 2.0 Token Introspection (RFC 7662). soju sends the token to the OAuth server, and the server replies back with some useful information:

> curl --data-urlencode token=asdf https://meta.sr.ht/oauth2/introspect
{
	"active": true,
	"username": "emersion"
}

And that’s enough for soju to associate the connection with the soju account “emersion” and send back a success response to the client!

I have patches floating around for all of the projects previously mentioned:

Future plans

Once all of the above is properly plumbed, this should already be a nice step forward! But there is also room for future improvements.

My patches currently don’t handle well token expiration. Clients should at least ask the user to re-authenticate again when a token expires. It would be nice to handle the refresh token and automatically obtain a new access token (would need to add refresh token support to meta.sr.ht).

Second, it’s a bit annoying for client developers to register their app on various OAuth servers. To remove that step, clients would need to dynamically obtain a client ID and secret from the OAuth server. The OAuth 2.0 Dynamic Client Registration Protocol (RFC 7591) can be used for this purpose. I am not sure how widely that RFC is implemented though.

Last, it would be best to avoid leaving behind some unused access tokens after the user logs out. OAuth 2.0 Token Revocation (RFC 7009) could be used by clients to clear these tokens.

September 11, 2022

So here we have it, the end of my Google Summer of Code journey. A few more than a hundred days have passed, and I can already tell that the seeds have been sown for me to keep collaborating with open source software from here on out.

My project’s primary goal was to create unit tests using KUnit for the AMDGPU driver focusing on code used by GPUs from the same generation of the GPU “RX 580” (DCE 11.2). We predicted that KUnit would have some limitations in regards to testing GPU’s drivers, so we expected to see some collaboration in that sense. Finally, we knew that I would be working in parallel with people writing tests for newer generations of GPUs (DCN). I planned to keep track of my weekly progress in my blog, trying to create an introductory material that could help future newcomers.

For starters, this project was completely different from what I had in mind, given that it was far from an individual experience with the Linux Kernel community; it was actually a team effort to introduce unit testing to the AMD display driver in a way that would encourage the community to spread KUnit into other GPU drivers.

Other surprise was how the unit tests didn’t really interact with the GPU, so all our worries about needing to mock devices, and intentions of writing about it, were put aside.

In retrospect, there’s still a lot to learn about the AMD codebase and graphics stack in general, but there are also many things that I’ve learned and didn’t manage to share yet, about IGT, KUnit, and the email-patch workflow, which I’ll be writing in my personal blog in addition to the pair of posts there:

I’ve also dedicated some time reviewing the posts from my colleagues as well, like:

Introducing Kunit test into AMDGPU

Our patch series is still under review:

We have refactored this series many times before sending it, worrying about how the code should be organized and how the tests should be compiled. My biggest lesson was learning to let go. Do your best to send great patches and reviews to the community, but don’t let the fear of being slightly wrong hold you back from pressing the submit button.

Done is better than perfect!

I didn’t manage to write as many tests as I planned to the DCE11.2 functions, but for what it’s worth, I’m making sure that the KUnit documentation is up to date, so that anyone can write their own tests.

Patch Status
drm/amd/display: Introduce KUnit tests for fixed31_32 library Under review

Contributing to FLOSS projects

In my way to introducing unit tests to the AMD display driver I’ve managed to leave some impressions behind, writing small patches and reviewing code, listed in detail on the following sections.

KWorkflow

Kworflow is a set of bash scripts that helped me to compile and deploy the kernel in my testing system, specially valuable to bisect code. In the beginning of my journey, shortly before the community bounding period, I sent a couple of patches to KWorkflow and reviewed a Pull Request.

Patch Status
#592 Add support for GPUs identified as “Display controller” in kw device Accepted
#607 Enhance docs for kw-pomodoro and kw-report Accepted

IGT

IGT GPU Tools is a collection of tools for development and testing of the DRM drivers. One of my first tasks was to run the AMDGPU test suite using the GPU “RX580”. After checking with the community that my testing setup was correct, I reported a bug to the AMD issue tracker, with proper bisection, and followed it through the patching process:

Patch Status
lib/igt_kmod: fix trivial typos Under review
CONTRIBUTING: Add reference for GTKDoc Under review
lib/kselftests: Skip kselftest when opening kmsg fails Under review
lib/igt_kmod: add igt_kselftests documentation Under review

Linux Kernel - KUnit

KUnit is the Kernel Unit Testing Framework. It not only brings a way to facilitate writing tests to the kernel, but also running them using User Mode Linux or QEMU. For most of time in the project we were discussing how to organize tests in the AMD folders, and there were a lot of lessons from that, which I intend to share. I have a lot of documentation patches in their way, some may be worth squashing, but nonetheless, I have the intention to help making the KUnit documentation as clear as possible!

Patch Status
Documentation: kunit: fix trivial typo Accepted
Documentation: Kunit: Fix inconsistent titles Accepted
Documentation: KUnit: Fix non-uml anchor Accepted
Documentation: Kunit: Add ref for other kinds of tests Accepted
Documentation: KUnit: remove duplicated docs for kunit_tool Accepted
Documentation: KUnit: avoid repeating “kunit.py run” in start.rst Accepted
Documentation: KUnit: add note about mrproper in start.rst Accepted
Documentation: KUnit: Reword start guide for selecting tests Accepted
Documentation: KUnit: add intro to the getting-started page Accepted
Documentation: KUnit: update links in the index page Accepted
lib: overflow: update reference to kunit-tool Accepted
lib: stackinit: update reference to kunit-tool Accepted
kunit: tool: fix –qemu_config help text Under review

Linux Kernel - DRM

The Direct Rendering Manager (DRM) is a subsystem of the Linux kernel responsible for interfacing with GPUs. There’s still a lot to learn and contribute to this subsystem, and I hope to discuss about introducing unit tests to other folders during the Linux Plumbers Conference (more about that in a following section). I only have two patches to VKMS to show, sent shortly before the community bounding period.

Patch Status
drm/vkms: check plane_composer->map[0] before using it Accepted
drm/vkms: return early if compose_plane fails Discarded

Linux Kernel - AMDGPU

Inside the DRM subsystem resides the AMD folder, where you can find drivers for many generations of AMD GPUs. Besides our main patch series, I’ve sent some minor patches and reviewed others.

Patch Status
drm/amd/display: make hubp1_wait_pipe_read_start() static Accepted
Update AMDGPU glossary and MAINTAINERS Accepted
drm/amd/display: fix overflow on MIN_I64 definition Accepted
drm/amd/display: fix minor codestyle problems Accepted
drm/amd/display: remove unneeded defines from bios parser Accepted

Acknowledgment

First, I would like to thank the X.org Foundation for accepting my GSoC proposal, for what I’ll be eternally grateful.

Next, thanks to AMD for donating the RX580 GPU which powered my testing system, and alongside Igalia allowed my mentors to take part in the GSoC program.

Moreover I would like to thank The Linux Foundation, which will enable my trip to Dublin, to attend the Linux Plumbers Conference.

There were plenty of great interactions with the DRM community, AMD, and KUnit engineers, for which I’m very thankful, specially David Gow which promptly reviewed most of my team’s patches.

And finally, thanks for my mentors and the other mentees with whom I’ve worked closely this summer, any of this would be possible without their support.

Mentors:

Mentees:

Next steps

Of course this is not farewell! It’s more like reaching the end of a game’s tutorial.

My first step, thanks to The Linux Foundation and my mentors, will be attending in-person the Kernel Testing & Dependability Micro Conference at the Linux Plumber Conference, where I hope to show what we’ve achieved and to discuss about the best way to organize the tests, which is what I’ve invested most of my time, so please keep an eye out for my report about it in my personal blog in the following weeks.

In regards to coding, beyond following through with my patches under review, I intend to write the tests for the DCE functions I proposed initially and maybe extend to other generations, now that we realized that the physical device is not needed in order to write unit tests. I might even look for other modules which would benefit from unit tests, like VKMS.

Finally, I’ll also try to get involved with the KernelCI project, where I could even employ my web development experience. In order to do that I’ll be attending virtually the monthly Automated Testing Call.


Thanks for reading.

September 10, 2022

Like all things, Google Summer of Code, too, comes to an end.

Now let’s go over what had to be done, what is done, and what’s left.

What had to be done

Well, this is rather easy for me to talk about, I’ll be on X.org’s Developers Conference soon, and full of motivations behind the work I’ve done.

Not just me, though! Me, Maíra and Magali (who might be familiar names to you already) will be there as well, and unfortunately Tales didn’t manage to get a visa due to bureaucracy layers no one dares to understand.

Looking retrospectively, the project’s motivation boils down to Siqueira wanting to resolve an internal tension between AMD engineers and the weird code they have to manage, unit tests being like a safety valve. As I’ve talked about previously, GPU code can be quite intense, the DML module being a particularly fun example.

No tests, just procedurally generated stuff, and there is so much to be done, really. Initially we proposed lots of tests, some docs, some refactoring, in my case specifically I wanted to make the drivers/gpu/drm/amd/display/dc/dml/dcn20/display_rq_dlg_calc_20.c (breathes heavily) look pretty.

So…

What was actually done

On the Kernel

Though we all had AMD code enhancement as a primary motivator, I actually did some unit tests, and at the end I was caught up in trying to understand what we were testing.

Wasn’t my smartest move (what external contributor actually knows what an RQ DLG is? Difficult answers to find, really) but I learnt so much, apart from the fact it was really satisfying to document something so convoluted and full of internal “slang” (I mean product (code)names or acronyms). But I soon found out I’d be a terrible detective, as most of the things we had to test were simply unattainable for someone who didn’t have internal docs or lots of experience with real GPUs: most things in the DML submodule refer back to themselves, to GPU internals, or even to AMD “internals”.

In the end there were some interesting results from the tests I was actually able to conclude, as follows:

  • drm/amd/display: Introduce Kunit tests to display_rq_dlg_calc_20

    This was a single patch I’ve done on the unit tests. Of course there are lots more functions to be tested, but that comes in the next section :)

    We collectively discussed some alternatives to deal with the fact the functions being tested were all static, as will be discussed by Tales in his presentation at the Linux Plumbers Conference.

    LPC: How to introduce KUnit to physical device drivers?

    There was also some documentation produced regarding the functions and structs involved in these tests, but they were not sent yet as they’re smaller changes and could be grouped in a patchset addressing this specifically.

On other projects

KWorkflow

KW (for the intimate) is a much needed and very interesting project, whom I tried my best to contribute to: I spent about a month and a half at the beginning of GSoC pushing it, to the point where I simply had no will to make my commit messages pretty or to respond maintainers.

Lucky me the owner of the project is also my GSoC mentor, and he completely understood where I was at and that I’d not be able to accomplish the (optional) goals I had set for KW in my proposal.

I really think this situation helped me understand better what is it that we’re doing when we contribute to free software, and that was the lesson I took.

Anyhow, there were many contributions during this period, even though I didn’t finish many of them:

IGT GPU Tools

Some things took a lot of our attention apart from the basic motivations of the project. As I was previously already working on IGT GPU Tools, I figured it’d be interesting to finish related work in that project. What good is making tests people aren’t going to use (or maintain?).

IGT is widely used by the kernel graphics people. It mainly tests GPU stuff using userspace APIs, but also does some other interesting things, and we figured it’d be very cool to be able to run unit tests there: easy to integrate with the pre-existing CI, not too much headache to maintain as the KTAP specs get more stable, as well as not requiring so much effort from engineers that are already so used to it.

  • patchset: Add support for KUnit tests

    Whilst it was not so difficult from the purely technical point of view, this might be what helped me the most towards acquiring some intimacy with the C language. I felt it deep when I came back to the same code after a month or two and saw so much to improve on it, then stepped out again as I waited for v1 to be reviewed (which didn’t happen) and came back a while ago to write the v2 linked above.

    There were several challenges involved, specially for me: I started coding C because of the kernel, all I knew were C-like languages, and they were all masking some deep truth about computers to me: pretty error handling, easy to use strings, etc.

    I admit I wasn’t as ready for C as I thought, now that I understand it better, I can see that clearly. In the end this “quasi-traumatic” experience (and I mean the IGT patchset specifically) taught me a lot, and I’m very thankful for that opportunity (thanks, André!).

    Apart from the language itself, we also had to decide what would be supported in a KTAP parser, and that wasn’t easy.

Blog posts

Well, for me personally this was the best part of this project, I really enjoy writing these things, I also enjoy receiving feedback – specially if it’s unasked :).

If you know my blog you know I like to go the difficult route: to talk about the objective engineering experience might teach you a lot, but where’s the fun in that? Even now I’m trying to bring some subjective matters to this.

Anyhow, I published two blog posts which you can check out here:

And I actually made lots of contributions to the Flusp site as well, whose repository was used to review these posts, which I hope will be hosted there as long as they’re hosted on my own website.

What still needs to be done

As I pointed out previously, GPU code can be very interesting! So much so, that I purposefully didn’t attempt to rush what was left of my proposal at the end of GSoC. It might have been really satisfactory to end GSoC with everything accomplished, but if I learnt anything from undergrad it is that you should NOT rush what’s important to you.

From what I’ve talked with my peers, there are two ways of seeing this project:

  1. GSoC is like a deal I made with Google and X.org to accomplish something until a due date and get money for it.

    This PoV is okay, but does it teach you anything new? I might as well have done some freelancing in that time period, would have got the money, and then I’d be very comfortable.

    But I believe there’s so much more to this experience, and that’s something Siqueira told us time and again, which brings me to a second, more wholesome way of seeing things:

  2. GSoC is a sort of first commitment to a community.

    I know this doesn’t sound so clear as the first way of looking at things, but follow me on this:

    For the community, timing might be important, but it’s definitely not as important as doing solid, good work, and keep pushing it even if it’s not as quickly as you’d like.

    I got really burnt out from the first two years at Uni, started working, went to live by myself, all that young adult jazz. Trying to always keep up with everything was a real challenge, and though I didn’t always give my best, well I really tried.

    At first, I was really sad, almost spiraling out of everything software-related, but now I see things more clearly, and I’m trying to find a rhythm that I can work with, in which I can deliver what I want, and, most importantly, what I committed to.

    Going back to Siqueira, what he told us is that (translating loosely):

    You got to make a dent in the community so that someone notices you.

    And now I see it more clearly.

    I’ve heard so many stories of people who engage on things like GSoC or Outreachy just to put it on their CV, then quit, but this is so much more important to me.

    I’ve already started making a career in free software, but the project I work on at the moment doesn’t allow me to interact so directly with some upstream or a community. I definitely want to improve in doing I enjoy and believe in, and if it takes some learning that’s only part of the journey to becoming a reliable member of some free software community.

So, in concrete terms, what is there to finish?

  1. First and foremost, the ongoing patchset for IGT needs lots of attention, as pointed by Michał Winiarski in this thread for example:

    Kernel lore archives: [PATCH v5 9/9] drm: selftest: convert drm_mm selftest to KUnit

  2. Secondly, of course, finish the original proposal of testing the dcn20/display_rq_dlg_calc_20.c file.

Might not sound like a lot, but those are really important things, and I’m sure they’ll keep me busy for some time.

I was also pinged by some people to review their patches, and I want to get back to them soon.

I’ve really been thinking a lot about giving back to the people that helped me get here, they were all awesome and I hope I can sow these same seeds and help fellow students become contributors as well. Just last week me and Maíra decided to try to get some people for a project on (fixing) coverage reports for the tests we wrote, but I might as well find another project for myself – probably related to XR, if you’re wondering, but I should write about that in the future :)

That’s it, dear reader, thank you a lot for reading this through, see you on the next one!

September 08, 2022

Need Another Hit

Ever stop in the middle of writing some code when it hits you? That nagging feeling. It’s in the back of your mind, relentlessly gnawing away. Making you question your decisions. Making you wonder what you’re doing with your life. You’re plugging away happily, when suddenly you remember I haven’t optimized any code in hours.

It’s no way to work.

It’s no way to live.

Let’s get back to our roots, blog.

New Edition

For this edition of the blog, we’re hopping off the usual tracks and onto a different set entirely: Vulkan driver optimization. I can already hear what you’re thinking.

Vulkan drivers are already fast. Go back to doing something useful, like making GL work.

First: no they’re not.

Second: I’m doing the opposite of that.

Third: shut up, it’s my blog.

How does one optimize Vulkan drivers? As we all know, any great optimization needs a test case. In the case of Vulkan, everyone who’s anyone (me) uses Zink as their test case. The reasons for this are many and varied because I say they are, but the important one to keep in mind is, as always, drawoverhead.

For those who can’t remember the times I have previously blogged about the world’s premiere benchmarking tool, don’t worry. As with any great form of entertainment, I’ve prepared a recap.

TL;DR drawoverhead

Suppose you are a large gaming-oriented company that sells hardware. Your hardware runs on a battery. The battery has a finite charge. Every bit of power drained from the battery powers the device so that users can play games.

Wouldn’t it be great if that battery lasted longer?

There are a number of ways to achieve this goal. Because this is my blog, I’m going to talk about the one that doesn’t involve underclocking or reducing graphical settings.

Obviously I’m talking about optimization, the process by which a smart and talented reader of StackOverflow copies and pastes code in exactly the right sequence such that, upon resolving all compilation errors, execution of a program utilizes fewer resources.

Now because everyone reading this is a GPU driver developer, we all know that optimization comes in two forms, optimizing for CPU and optimizing for GPU. But GPU optimization is easy. Anyone can do that. You just strap on your RadeonGPUProfiler or Nsight or , glance at the output, mumble mumble mumble, and blammo, your GPU is running like a clock. A really fast one though. With like a trillion hands all spinning at once.

So we’re done with GPU optimization, but we’re not optimized enough yet. The battery still doesn’t last forever, and the users are still complaining on Reddit that they can’t even finish a casual playthrough of Elden Ring or a boss fight in Monster Hunter: Rise without needing to plug in.

This brings us to “CPU optimization”, the process by which we use more esoteric tools like perf or dtrace or custom instrumentation to generate possibly-useful traces of where the CPU may or may not be executing optimally because it’s a filthy liar that doesn’t even know what line of code it’s executing half the time. But still, we need test cases, and unlike GPU profiling, CPU profiling typically isn’t useful with only a single frame’s worth of sample data.

Thus, drawoverhead, which provides a view of how various GL operations impact CPU utilization by executing millions of draw calls per second to provide copious samples for profiling.

Why Not drawoverhead?

This is where the blog is going to take a turn for the bizarre. Some people, it seems, don’t want to use Zink for benchmarking and profiling.

I know.

I’m shocked, hurt, appalled, and also it’s me who doesn’t want to use Zink for benchmarking and profiling so it’s a very confusing time.

The problem with using Zink for optimizing CPU usage is that Zink keeps getting in the way. I want to profile only the Vulkan driver, but instead I’ve got all this Mesa oozing and spurting onto my CPU samples. It’s gross, it’s an untenable situation, and I’ve already taken steps to resolve it.

Behold the future: vkoverhead.

With one simple clone, build, and execute, it’s now possible to see how much the Vulkan driver you’re using sucks at any given task.

Want to see how costly it is to bind a different pipeline? Got it covered.

Changing vertex buffers? Blam, your performance is garbage.

Starting and stopping renderpasses? Take your entire computer and throw it in the dumpster because that’s where your performance just went.

vkoverhead: Mythbusting

The obvious problem with this is that somebody has to actually dig into the vkoverhead results for each driver and figure out what can be made better. I’ll write another post about this since it’s a separate topic.

Instead, what I want to do today is use vkoverhead to delve into one of the latest and greatest myths of modern Vulkan:

Is the use of fast-linked Graphics Pipeline Libraries worse than, equivalent to, or better than VK_EXT_vertex_input_dynamic_state?

I say this is one of the great myths because, having spoken to Very Knowledgeable Khronos Insiders as well as Experienced Application Developers, I’ve been told repeatedly that VK_EXT_vertex_input_dynamic_state is just a convenience feature that should not be used or relied upon, and proper use of GPL with fast-linking provides the same functionality and performance with broader adoption. But is this really true?

Well, now that the tools exist, it’s possible to say definitively that this sort of wishful thinking does not reflect reality. Let’s check out the numbers. As of the latest 1.1 vkoverhead release, the following cases are available:

$ ./vkoverhead -list
   0, draw
   1, draw_multi
   2, draw_vertex
   3, draw_multi_vertex
   4, draw_index_change
   5, draw_index_offset_change
   6, draw_rp_begin_end
   7, draw_rp_begin_end_dynrender
   8, draw_rp_begin_end_dontcare
   9, draw_rp_begin_end_dontcare_dynrender
  10, draw_multirt
  11, draw_multirt_dynrender
  12, draw_multirt_begin_end
  13, draw_multirt_begin_end_dynrender
  14, draw_multirt_begin_end_dontcare
  15, draw_multirt_begin_end_dontcare_dynrender
  16, draw_vbo_change
  17, draw_1vattrib_change
  18, draw_16vattrib
  19, draw_16vattrib_16vbo_change
  20, draw_16vattrib_change
  21, draw_16vattrib_change_dynamic
  22, draw_16vattrib_change_gpl
  23, draw_16vattrib_change_gpl_hashncache
  24, draw_1ubo_change
  25, draw_12ubo_change
  26, draw_1sampler_change
  27, draw_16sampler_change
  28, draw_1texelbuffer_change
  29, draw_16texelbuffer_change
  30, draw_1ssbo_change
  31, draw_8ssbo_change
  32, draw_1image_change
  33, draw_16image_change
  34, draw_1imagebuffer_change
  35, draw_16imagebuffer_change
  36, submit_noop
  37, submit_50noop
  38, submit_1cmdbuf
  39, submit_50cmdbuf
  40, submit_50cmdbuf_50submit
  41, descriptor_noop
  42, descriptor_1ubo
  43, descriptor_template_1ubo
  44, descriptor_12ubo
  45, descriptor_template_12ubo
  46, descriptor_1sampler
  47, descriptor_template_1sampler
  48, descriptor_16sampler
  49, descriptor_template_16sampler
  50, descriptor_1texelbuffer
  51, descriptor_template_1texelbuffer
  52, descriptor_16texelbuffer
  53, descriptor_template_16texelbuffer
  54, descriptor_1ssbo
  55, descriptor_template_1ssbo
  56, descriptor_8ssbo
  57, descriptor_template_8ssbo
  58, descriptor_1image
  59, descriptor_template_1image
  60, descriptor_16image
  61, descriptor_template_16image
  62, descriptor_1imagebuffer
  63, descriptor_template_1imagebuffer
  64, descriptor_16imagebuffer
  65, descriptor_template_16imagebuffer
  66, misc_resolve
  67, misc_resolve_mutable

The interesting cases for this scenario are:

  21, draw_16vattrib_change_dynamic
  22, draw_16vattrib_change_gpl
  23, draw_16vattrib_change_gpl_hashncache
  • Case 21 is changing 16 vertex attributes between draws using VK_EXT_vertex_input_dynamic_state
  • Case 22 is using fast-linking GPL to compile and bind a new pipeline from precompiled partial pipelines between draws
  • Case 23 is using fully precompiled GPL pipelines with hash-n-cache to swap pipelines between draws

Running all of these tests on NVIDIA’s driver (the only hardware driver to fully support both extensions) on a AMD Ryzen 9 5900X with a 3060TI yields the following:

Case Draws per second
21 draw_16vattrib_change_dynamic 7,965,000
22 draw_16vattrib_change_gpl 315,000
23 draw_16vattrib_change_gpl_hashncache 4,020,000

Staggeringly, it turns out that GPL is worse in every scenario. Even the speed of the typical Vulkan hash-n-cache usage can’t make up for the fact that VK_EXT_vertex_input_dynamic_state genuinely is that much faster. And assuming the driver isn’t doing low-GPU-performance heroics, that means everyone drinking the koolaid about not using or not implementing VK_EXT_vertex_input_dynamic_state should be reconsidering their beverage of choice.

This isn’t to say Graphics Pipeline Library is bad or should not be used.

GPL is one of the best extensions Vulkan has to offer, and it definitely provides a huge number of features that every application developer should be examining to see how they might improve performance.

But it’s not better than VK_EXT_vertex_input_dynamic_state.

More vkoverhead

The project is already in a state where at least one major GPU vendor is utilizing it to drive down CPU usage. If you’re a GPU driver engineer, or perhaps if you’re someone who does benchmarking for a popular tech news site, you should check out vkoverhead too.

Some key takeaways:

  • Raw numbers can be compared between different GPUs and and GPU drivers so long as the rest of the system stays the same
    • This is how I know that AMDPRO currently performs better than RADV
  • If the rest of the system is not the same between drivers/GPUs, the percentage differences can still be compared

Stay tuned for an upcoming post where I teach a course in making your own spaghetti. Also some other things that give zink a 25%+ performance boost.

September 07, 2022

My journey on the Google Summer of Code project passed by so fast… This is my last week on the GSoC and those 14 weeks flew by! A lot of stuff happened during those three months, and as I’m writing this blog post, I feel quite nostalgic about this three months.

Before I started GSoC, I never thought I would send so many patches to the mailing list, have an abstract approved on XDC 2022, or have commit rights on drm-misc.

GSoC was indeed a fantastic experience. It gave me the opportunity to grow as a developer in an open source community and I believe that I ended up GSoC with a better understanding of what open source is. I learned more about the community, how to communicate with them, and who are the actors in this workflow.

So, this is a summary report of all my journey at GSoC 2022.

Contributions during GSoC 2022


First I will kick off with my non-related contributions. I mean, they are somehow related to my project, but they are not exactly unit tests for AMDGPU.

kworkflow

kworkflow is a tool for reducing the environment and setup overhead of developing for Linux, which is maintained by my mentor Rodrigo Siqueira. I use it to manage my config files, deploy to my testing machine, check code style, and more. Initially, kw didn’t have support to deploy for Fedora-based machines.

During the Community Bonding Period, I added support to deploy Fedora-based machines and I wrote a bit about this story in this blog post. Moreover, I fixed a couple of bugs that I spotted while using it.

Patch Status
docs: dependencies: Add pv to Fedora dependencies Accepted
src: kwlib: check if the context is inside a git worktree Accepted
Add deploy support to Fedora-based systems Accepted
src: help: Fix renaming of configm to kernel-config-manager Accepted

IGT

IGT GPU Tools is a collection of tools for the development and testing of DRM drivers. During GSoC, I ran the AMDGPU suite a couple of times on my testing machine with a single display connected through HDMI. With this setup, I was able to detect a couple of failures on the IGT tests and I reported some of those issues on the AMD bug tracker, but also I sent two patches fixing a couple of failures on the test.

Patch Status
[i-g-t,v2] tests/amdgpu: Skip multihead MPO tests on single display Accepted
[i-g-t,v2] tests/amdgpu/amd_bypass: skip if connector is not DisplayPort On Review

Linux Kernel - KUnit

KUnit is the Kernel Unit Testing Framework. It is the framework we used for creating unit tests for the AMDGPU drivers. My patches to KUnit are based on problems that I noticed I could improve while I was writing unit tests for VBA. First, I fixed a simple documentation error I spotted when I consulted the docs. The other patches are a bit more interesting.

While I was writing some tests, I realize I was using a lot of expectations such as:

KUNIT_EXPECT_EQ(test, memcmp(buffer1, buffer2, size), 0);

And I also realize that the output of this expectation can be quite non-helpful, as it only gives the output of the memcmp function. So, I created an expectation macro for analyzing blocks of memory and outputs the hexdump of the memory.

It was a great community experience to interact with the KUnit developers and work on their feedback.

Patch Status
Documentation: KUnit: Fix example with compilation error Accepted
kunit: Introduce KUNIT_EXPECT_MEMEQ and KUNIT_EXPECT_MEMNEQ macros Accepted
kunit: Add KUnit memory block assertions to the example_all_expect_macros_test Accepted
kunit: Use KUNIT_EXPECT_MEMEQ macro Accepted

Linux Kernel - DRM

In the DRM subsystem, I contributed to the DRM unit tests, which used to be selftests and I converted them to KUnit during an LKCamp hackathon with other students. I explained more about those tests in this blog post. After those patches were merged, I dedicated myself to some janitorial duties on the tests folders: fixing stack warnings, refactoring some tests, and making the naming more consistent.

Also, this summer, I applied for commit rights on the drm-misc repository and it was approved. I got pretty glad to have commit rights, although I believe it is such a huge responsibility and I plan to use this right very carefully. I must thank my mentor Melissa Wen for encouraging me to ask for commit rights and for sharing her knowledge about the community and maintainership models (and also for answering a thousand questions I had about dim).

Patch Status
drm: selftest: convert drm_damage_helper selftest to KUnit Accepted
drm: selftest: convert drm_cmdline_parser selftest to KUnit Accepted
drm: selftest: convert drm_rect selftest to KUnit Accepted
drm: selftest: convert drm_format selftest to KUnit Accepted
drm: selftest: convert drm_plane_helper selftest to KUnit Accepted
drm: selftest: convert drm_dp_mst_helper selftest to KUnit Accepted
drm: selftest: convert drm_framebuffer selftest to KUnit Accepted
drm: selftest: convert drm_buddy selftest to KUnit Accepted
drm: selftest: convert drm_mm selftest to KUnit Accepted
drm/tests: Split up test cases in igt_check_drm_format_min_pitch Accepted
drm/mm: Reduce stack frame usage in __igt_reserve On Review
drm/tests: Split drm_framebuffer_create_test into parameterized tests On Review
drm/tests: Change “igt_” prefix to “test_drm_” On Review

Linux Kernel - AMDGPU

Most of my patches to the AMDGPU branch were ideas that I had while writing the unit tests on VBA. The display_mode_vba files were automatically generated, which means that the code might not be the most readable one. During the summer, I had a couple of ideas for cleaning up a bit of the VBA files and some of those ideas are documented in this blog post. And most of my patches are related to this matter.

But, most of the patches I sent related to the VBA files weren’t merged and there was also no feedback on the patches. Some were sent multiple times, but I didn’t get an answer. The DML is a very sensitive part of the AMDGPU driver, so big changes might not be suitable for them.

Moreover, there are a couple of fixes. My favorite one is the drm/amdgpu: Fix use-after-free on amdgpu_bo_list mutex, which was a fix to a use-after-free problem that appeared in the mailing list. It was fun to read the output provided in the mailing-list and then track the bug based on the trace. Also, it was a really nice interaction with other developers on the mailing list.

Patch Status
drm/amd/display: Remove return value of Calculate256BBlockSizes Accepted
drm/amd/display: Remove duplicate code across dcn30 and dcn31 Accepted
drm/amd/display: Remove unused variables from vba_vars_st Accepted
drm/amdgpu: Write masked value to control register Accepted
drm/amd/display: Change get_pipe_idx function scope Accepted
drm/amd/display: Remove unused clk_src variable Accepted
drm/amd/display: Remove unused dml32_CalculatedoublePipeDPPCLKAndSCLThroughput function Accepted
drm/amd/display: Remove unused NumberOfStates variable Accepted
drm/amd/display: Remove unused variables from dml_rq_dlg_get_dlg_params Accepted
drm/amd/display: Remove unused value0 variable On Review
drm/amd/display: Remove unused variables from dcn10_stream_encoder Accepted
drm/amd/display: Remove unused MaxUsedBW variable Accepted
drm/amd/display: Remove parameters from dml30_CalculateWriteBackDISPCLK Rejected
drm/amd/display: Drop dm_sw_gfx7_2d_thin_l_vp and dm_sw_gfx7_2d_thin_gl On Review
drm/amd/display: Remove duplicated CalculateWriteBackDISPCLK On Review
drm/amd/display: Remove parameters from rq_dlg_get_dlg_reg On Review
drm/amd/display: Rewrite CalculateWriteBackDISPCLK function On Review
drm/amd/display: Remove unused struct freesync_context Accepted
[PATCH 00/16] Remove entries from struct vba_vars_st On Review
drm/amd/display: Drop XFCEnabled parameter from CalculatePrefetchSchedule On Review
drm/amdgpu: Fix use-after-free on amdgpu_bo_list mutex Accepted
drm/amd/display: Include missing header Accepted

The KUnit AMDGPU Tests


After this huge tangent, let’s jump into the real milestones of the GSoC project. Making a small recapitulation of the idea of my project:

The Display Mode Library (DML) is a fundamental part of the AMDGPU driver. It involves lots of complex calculations and a large number of parameters. That said, it points to itself as a great candidate for the implementation of unit tests. Unit tests will help graphic developers recognize bugs before they are merged into the mainline and make it possible for a future code refactor. This project intends to implement unit testing in the Display Mode VBA libraries, especially the Display Mode VBA public API and the DCN20’s Display Mode VBA functions.

In my project, the deliverables were:

  • Eleven unit tests for all the public functions on display_mode_vba and display_mode_vba20.
  • Five blog posts on the progress, problems, concepts, and all.
  • Run the unit tests on the AMDGPU Radeon RX 5700 XT.
  • Write documentation for the tests.

So, let’s discuss point by point the milestones of this summer.

The Unit Tests and Documentation

First, I had the intention to write unit tests for all the public functions on display_mode_vba and display_mode_vba20, which are eleven functions in total. Initially, it seems like a good idea to test only the public functions, which is usually recommended by Software Engineering authors.

I started following this plan, but as I was learning more about unit testing, I started questioning the feasibility of tests for some functions. I mean, a couple of functions had more than 45 input parameters and more than a thousand lines. Checking all the possible code branches of those tests seemed to me unfeasible because there were a lot of variables to be considered.

As the function size and the number of parameters grow, the complexity of the tests grows exponentially.

So, I ended up writing tests for some static VBA functions. In the end, I wrote more than eleven unit tests but the functions for which they were written are not the same as planned initially.

The functions were chosen by two means. First, I was trying to identify functions with a more suitable behavior for unit tests. Basically, functions with no more than 10 parameters and 100 lines. Also, I was looking for functions that were used a lot in the code, in order to increase the coverage and the relevance of the tests. But also, sometimes, Siqueira would suggest some functions outside of VBA, such as the dc_dmub_srv case, which was inspired on a regression.

Moreover, I also wrote documentation for the tests, giving instructions on how to run the tests and how to add more tests to the AMDGPU driver.

The patches with the test suites and documentation are listed here:

Patch Status
drm/amd/display: Introduce KUnit tests to the bw_fixed library On Review
drm/amd/display: Introduce KUnit tests to the display_mode_vba library On Review
drm/amd/display: Introduce KUnit to dcn20/display_mode_vba_20 library On Review
drm/amd/display: Introduce KUnit tests to dc_dmub_srv library On Review
Documentation/gpu: Add Display Core Unit Test documentation On Review

The tests are not merged yet and are currently on the second version, but there are some good changes that the tests will be merged soon to the mainline.

Moreover, I was able to run all the unit tests developed on the AMDGPU Radeon RX 5700 XT, and also with the kunit-tool.

The Blog Posts

During the summer, I wrote five blog posts about challenges that I found interesting in my journey. All blog posts are listed here:

Date Post
May 26, 2022 I’m in GSoC ‘22
Jun 11, 2022 Linux Kernel Developing with Fedora
Jul 11, 2022 About Kernel Symbol Table, Compilation, and more
Jul 19, 2022 From Selftests to KUnit
Aug 10, 2022 Does the Linux Kernel need software engineering?

More than just code


During this summer, I was able to evolve my community skills also. Before I joined GSoC, I didn’t though I had enough knowledge to review code from others or even read the mailing list. Now, I have more confidence to review and test some patches (and even commit a patch).

During GSoC, I developed the habit to read the mailing list daily. Although I don’t really get everything that is going on there, I read a couple of threads and try to understand what is being discussed. And it became a fun part of my day to open Thunderbird and read the mailing lists from AMDGPU, DRM, KUnit, and Fedora Devel.

Moreover, during my mailing list readings, I was able to find some discussions that I could contribute to and even review patches. Initially, I didn’t have confidence enough to send a Reviewed-by, so I used to send just a Tested-by. But, now I feel more courage to send a Reviewed-by and make an argument for my points on the mailing lists.

I made more than a dozen interactions on the mailing list, so I will just list the more relevant ones:

  1. My first Tested-by: This was after some interaction with David Gow on the AMDGPU Unit Tests RFC. I stated the need for enabling tests to link to the AMDGPU module and this culminated in some patches for it, where I sent my Tested-by. Also this discussion was the inspiration for this blog post.
  2. Documentation Reviews on KUnit: I reviewed a couple of KUnit documentation patches, such as 1 and 2.
  3. Spot an error in an IGT patch: This is a simple one, but it make me realize that I was about to help with my reviews.
  4. Reported a failing KUnit test: I was checking LKFT some day, and I realized that some KUnit tests were failing in all architectures. So, I sent an email reporting the failure to the tests’s author and he fixed the tests.
  5. PowerPC Compilation Fixes: On the same day I sent a patch fixing a PowerPC warning on the DRM tests, Melissa asked me how I cross-compiled for PPC on IRC as she was working on a PPC warning. I was happy I could help her with this after some hours of failing to cross-compile for PPC. This resulted in other review and also a great interaction.

Moreover, I also committed two patches on drm-misc-next. I reviewed to patches for improving the DRM KUnit tests and by the end of the review, I usually wait for someone to push the patches into drm-misc-next, but as those patches were on the list for a while, I decided to push it myself. I was pretty afraid of doing something wrong, but all went fine with a bit of help from Melissa.

Next Steps


Well, I’m pretty excited about the next couple of months for many reasons. First, I, Magali Lemes, and Isabella Basso will be attending XDC 2022. It is going to be very exciting to participate in an in-place conference and also to make a presentation on the main track.

We are going to talk about the “KUnit sorcery and the uncanny nature of FPU in the DRM” on October 4th, presenting a bit of our GSoC project. So, currently, I’m training a lot to make a good presentation in October.

Moreover, I intend to keep contributing to the DRM subsystem, and currently, I have some ideas for some code refactors of the DRM KUnit tests. Also, I would like to expand my contributions on the DRM to not only test-related. Although I do like to write unit tests, I want to learn more about planes, CRTC, color management, memory management, and more. Currently, most of my contributions are related to janitorial duties and I would like to contribute to implementing new features on the DRM or improving the DRM core.

Finally, by the end of GSoC, I’ll be joining another mentorship project on the Linux Graphics Stack, the Igalia Coding Experience, in which I will be learning more about the DRM subsystem and IGT in the next months. This is making me very excited, as I will continue to contribute with open source, especially the Linux kernel, with the help of my great mentors Melissa Wen and André Almeida, who are software engineers on Igalia.

Acknowledgment


First, I would like to thank my mentors Rodrigo Siqueira, Melissa Wen, and André Almeida. They really believed in our potential, encouraged us to talk to the community, and show us some great opportunities. They were an amazing team of mentors and I will always be thankful to them. Without them, I would probably never would had submitted a project to GSoC.

Also, I would like to thank the X.Org Foundation for accepting my proposal to GSoC and also helping us with funding for XDC 2022.

Moreover, I would like to thank AMD for donating hardware for us. During the project, I used a Radeon RX 5700 XT donated by them, so I’m also very thankful to them. Moreover, I would like to thank all AMD engineers that took their time to review my patches and send feedback.

Finally, I would like to thank the DRI community for reviewing my patches and giving me constructive feedback. Also, the KUnit community, especially David Gow, Daniel Latypov, and Brendan Higgins, review a lot of my patches and took their time to meet with us this summer.

Last, but not least, I thank the companionship of my colleagues Isabella Basso, Magali Lemes, and Tales Aparecida during this summer. It was great to have some companions to solve problems this summer and to share knowledge.

While Tianocore EDKII still dominates the UEFI development world, there has been continuous effort to enable Rust for firmware development. But so far the tools involved have not been stabilised. We now started an effort to remedy this and get stable Rust support for UEFI targets.

The rust compiler has gained support for multiple UEFI targets in the past, namely:

This allows building Rust UEFI Applications with a standard compiler by simply passing --target <arch>-unknown-uefi to cargo or rustc. Unfortunately, Tier-3 support means no compiler builds are distributed via the Rust release channels, nor does the Rust-CI guarantee the targets build successfully. Moreover, this implies that a nightly/unstable compiler is required to build for those targets, even though no nightly Rust Language features are required.

Raising support of these targets to Tier-2 will include automatic toolchain builds distributed via Rust release channels. Hence, no nightly/unstable compiler is required, anymore. Automatic CI builds will guarantee the targets build successfully and do not randomly break. This will greatly improve the trust in the platform and significantly enhance the developer experience.

Rust support for UEFI has been documented in the rustc-book section UEFI Platform Support. You can follow and support the Major Change Proposal (MCP) to raise support to Tier-2 on the Rust Compiler-Team Tracker.

September 03, 2022

September 1 was a big day! The official cross-vendor Vulkan mesh shading extension that I teased a while ago, has now been officially released. This is a significant moment for me because I’ve spent considerable time making the RADV implementation and collaborated with some excellent people to help shape this extension in Khronos.

How it started

We first started talking about mesh shaders in Mesa about two years (maybe more?) ago. At the time the only publicly available Vulkan mesh shading API was the vendor-specific NV_mesh_shader made by NVidia.

  • At the time nobody quite fully understood what mesh shaders are supposed to be and how they could work on the HW. I initially anticipated that they could be made to work even on RDNA1, but this turned out false due to some HW limitations.
  • It was unclear what was needed to get good performance.
  • The NVidia extension was a good start, but it was full of things that don’t make any real sense on any other HW.
  • Nobody had a clue how we would implement task shaders.

I was working together with Christoph Kubisch (from NVidia) who helped understand what this topic is all about. Caio Oliveira and Marcin Slusarz (from Intel, working on ANV) joined the adventure too.

How we made it work

We made the decision to start working on some preliminary support for the NV extension to get some experience with the topic and learn how it’s supposed to work. Once we were satisfied with that, we made the jump to the EXT.

NIR and SPIR-V

These are the common Mesa pieces that all drivers can use. The front-end and middleware code for mesh/task shader support are here. Caio created the initial pieces and I expanded on that as I found more and more things that needed adjustment in NIR, added a new storage class for task payloads etc. Marcin also chimed in with fixes and improvements.

AMD and Intel hardware work differently, so most of the hard work couldn’t be shared and needed to be implemented in the backends. However, some of the commonalities are implemented in eg. nir_lower_task_shader for the work that needs to happen in both RADV and ANV. There were dozens of merge requests that added support for various features, cleaned up old features to make them not crash on mesh shaders, etc. The latest is this MR which adds all the remaining puzzle pieces for the EXT.

Lowering the shaders in the backend

Because mesh shaders use NGG, the heavy lifting is done in ac_nir_lower_ngg which is already responsible for other NGG shaders (VS, TES, GS). The lowering basically “translates” the API shader to work like the HW expects. Without going into much detail, it essentially wraps the application shader and replaces some pieces of it to make them understandable by the HW.

There is now also an ac_nir_lower_taskmesh_io_to_mem for translating task payload I/O and EmitMeshTasksEXT to something the HW can understand.

Mesh shading draw calls in RADV

Previously I was mostly working on the compiler stack (NIR and ACO) and had little experience with the hardcore driver code, such as how draws/dispatches work and how the driver submits cmd buffers to the GPU. As I had near-zero knowledge of these areas, I had to learn about them, mostly by just reading the code.

So, I split the RADV work in two parts:

  1. Mesh shader only pipelines. These required only moderate changes to radv_cmd_buffer to add some new draw calls, and minor work in radv_pipeline to figure out per-primitive output attributes and get the register programming right.
  2. Task shader support. Due to how task shaders work on AMD HW, these required very severe refactoring in radv_device because we now have to submit to multiple queues at once. This is actually not finished yet because “gang submit” is still missing. Additionally, radv_cmd_buffer needed heavy changes to support an internal compute cmdbuf.

Naturally, during this work I also hit several RADV bugs in pre-existing use cases which just nobody noticed yet. Some of these were issues with secondary command buffers and conditional rendering on compute-only queues. There was also a nasty firmware bug, and other exciting stuff.

Let’s just say that my GPU loved to hang out with me.

Mesh shading on Intel

The Intel ANV compiler backend and driver implementation were done by Caio and Marcin and I just want to take this opportunity to say that I really enjoyed working together with them.

The guy who wrote the most mesh shaders on Earth

All of the above would have been impossible if I didn’t have some solid test cases which I could throw at my implementation. Ricardo Garcia developed the CTS (Vulkan Conformance Test Suite) testcases for both NV_mesh_shader and EXT_mesh_shader. During that work, Ricardo wrote several thousand mesh and task shaders for us to test with. Ricardo if you’re reading this, THANK YOU!!!

Conclusion

Implementing VK_EXT_mesh_shader gave me a very good learning experience and helped me get an understanding of parts of the driver that I had never looked at before.

What happens to NV_mesh_shader now?

We never wanted to officially support it, it was just a stepping stone for us to help start working on the EXT. The NV extension support will be removed soon.

When is it coming to my Steam Deck / Linux computer?

The RADV and ANV support will be included once the system is updated to Mesa 22.3, though we may be convinced to bring it to the Deck sooner if somebody finds a game that uses mesh shaders.

For NVidia proprietary driver users, EXT_mesh_shader is already included in the latest beta drivers.

Waiting for the gang…

We marked mesh/task shader support “experimental” in RADV because it has one main caveat that we are unable to solve without kernel support. Thanks to how we need to submit to two different queues at the same time, this can deadlock your GPU if you are running two (or more) processes which use task shaders at the same time. To properly solve it we need the “gang submit” feature in the kernel which prevents such deadlocks.

Unfortunately “gang submit” is not upstream yet. Cross your fingers and let’s hope it’ll be included in Linux 6.1.

Until then, you can use the RADV_PERFTEST=ext_ms environment variable to play your favourite mesh shader games!

September 02, 2022

Vulkan 1.3.226 was released yesterday and it finally includes the cross-vendor VK_EXT_mesh_shader extension. This has definitely been an important moment for me. As part of my job at Igalia and our collaboration with Valve, I had the chance to work reviewing this extension in depth and writing thousands of CTS tests for it. You’ll notice I’m listed as one of the extension contributors. Hopefully, the new tests will be released to the public soon as part of the open source VK-GL-CTS Khronos project.

During this multi-month journey I had the chance to work closely with several vendors working on adding support for this extension in multiple drivers, including NVIDIA (special shout-out to Christoph Kubisch, Patrick Mours, Pankaj Mistry and Piers Daniell among others), Intel (thanks Marcin Ślusarz for finding and reporting many test bugs) and, of course, Valve. Working for the latter, Timur Kristóf provided an implementation for RADV and reported to me dozens of bugs, test ideas, suggestions and improvements. Do not miss his blog post series about mesh shaders and how they’re implemented on RDNA2 hardware. Timur’s implementation will be used in your Linux system if you have a capable AMD GPU and, of course, the Steam Deck.

The extension has been developed with DX12 compatibility in mind. It’s possible to use mesh shading from Vulkan natively and it also allows future titles using DX12 mesh shading to be properly run on top of VKD3D-Proton and enjoyed on Linux, if possible, from day one. It’s hard to provide a summary of the added functionality and what mesh shaders are about in a short blog post like this one, so I’ll refer you to external documentation sources, starting by the Vulkan mesh shading post on the Khronos Blog. Both Timur and myself have submitted a couple of talks to XDC 2022 which have been accepted and will give you a primer on mesh shading as well as some more information on the RADV implementation. Do not miss the event at Minneapolis or enjoy it remotely while it’s being livestreamed in October.

August 22, 2022
Neverball rendered on the Apple M1 GPU with an open source OpenGL driver

After a year in development, the open source “Asahi” driver for the Apple GPU is running real games. There’s more to do, but Neverball is already playable (and a lot of fun!).

Neverball uses legacy “fixed function” OpenGL. Rather than supply programmable shaders like OpenGL 2, old OpenGL 1 applications configure a fixed set of graphics effects like fog and alpha testing. Modern GPUs don’t implement these features in hardware. Instead, the driver synthesizes shaders implementing the desired graphics. This translation is complicated, but we get it for “free” as an open source driver in Mesa. If we implement the modern shader pipeline, Mesa will handle fixed function OpenGL for us transparently. That’s a win for open source drivers, and a win for GPU acceleration on Asahi Linux.

To implement the modern OpenGL features, we rely on reverse-engineering the behaviour of Apple’s Metal driver, as we don’t have hardware documentation. Although Metal uses the same shader pipeline as OpenGL, it doesn’t support all the OpenGL features that the hardware does, which puts us in bind. In the past, I’ve relied on educated guesswork to bridge the gap, but there’s another solution… and it’s a doozy.

For motivation, consider the clip space used in OpenGL. In every other API on the planet, the Z component (depth) of points in the 3D world range from 0 to 1, where 0 is “near” and 1 is “far”. In OpenGL, however, Z ranges from negative 1 to 1. As Metal uses the 0/1 clip space, implementing OpenGL on Metal requires emulating the -1/1 clip space by inserting extra instructions into the vertex shader to transform the Z coordinate. Although this emulation adds overhead, it works for ANGLE’s open source implementation of OpenGL ES on Metal.

Like ANGLE, Apple’s OpenGL driver internally translates to Metal. Because Metal uses the 0 to 1 clip space, it should require this emulation code. Curiously, when we disassemble shaders compiled with their OpenGL implementation, we don’t see any such emulation. That means Apple’s GPU must support -1/1 clip spaces in addition to Metal’s preferred 0/1. The problem is figuring out how to use this other clip space.

We expect that there’s a bit toggling between these clip spaces. The logical place for such a bit is the viewport packet, but there’s no obvious difference between the viewport packets emitted by Metal and OpenGL-on-Metal. Ordinarily, we would identify the bit by toggling the clip space in Metal and comparing memory dumps. However, according to Apple’s documentation, there’s no way to change the clip space in Metal.

That’s an apparently contradiction. There’s no way to use the -1/1 clip space with Metal, but Apple’s OpenGL-on-Metal translator uses uses the -1/1 clip space. What gives?

Here’s a little secret: there are two graphics APIs called “Metal”. There’s the Metal you know, a limited API that Apple documents for App Store developers, an API that lacks useful features supported by OpenGL and Vulkan.

And there’s the Metal that Apple uses themselves, an internal API adding back features that Apple doesn’t want you using. While ANGLE implements OpenGL ES on the documented Metal, Apple can implement OpenGL on the secret Metal.

Apple does not publish documentation or headers for this richer Metal API, but if we’re lucky, we can catch a glimpse behind the curtain. The undocumented classes and methods making up the internal Metal API are still available in the production Metal binaries. To use them, we only need the missing headers. Fortunately, Objective-C symbols contain enough information to reconstruct header files, allowing us to experiment with undocumented methods with “extra” functionality inherited from OpenGL.

Compared to the desktop GPUs found in Intel Macs, Apple’s own GPU implements a slim, modern feature set mapping well to Metal. Most of the “extra” functionality is emulated. It is interesting to know the emulation happens in their Metal driver instead of their OpenGL frontend, but that’s unsurprising, as it allows their Metal drivers for Intel and AMD GPUs to implement the functionality natively. While this information is fascinating for “macOS hermeneutics”, it won’t help us with our Apple GPU mystery.

What will help us are the catch-all mystery methods named setOpenGLModeEnabled, apparently enabling “OpenGL mode”.

Mystery methods named like just beg to be called.

The render pipeline descriptor has such a method. That descriptor contains state that can change every draw. In some graphics APIs, like OpenGL with ARB_clip_control and Vulkan with VK_EXT_depth_clip_control, the application can change the clip space every draw. Ideally, the clip space state would be part of this descriptor.

We can test this optimistic guess by augmenting our Metal test bench to call [MTLRenderPipelineDescriptorInternal setOpenGLModeEnabled: YES].

It feels strange to call this hidden method. It’s stranger when the code compiles and runs just fine.

We can then compare traces between OpenGL mode and the normal Metal mode. Seemingly, enabling OpenGL mode toggles a plethora of random unknown bits. Even if one of them is what we want, it’s a bit unsatisfying that the “real” Metal would lack a proper [setClipSpace: MTLMinusOneToOne] method, rather than this blunt hack reconfiguring a pile of loosely related API behaviours.

Alas, for all the random changes in “OpenGL mode”, none seem to affect clipping behaviour.

Hope is not yet lost. There’s another setOpenGLModeEnabled method, this time in the render pass descriptor. Rather than pipeline state that can change every draw, this descriptor’s state can only change in between render passes. Changing that state in between draws would require an expensive flush to main memory, similar to the partial renders seen elsewhere with the Apple GPU. Nevertheless, it’s worth a shot.

Changing our test bench to call [MTLRenderPassDescriptorInternal setOpenGLModeEnabled: YES], we find another collection of random bits changed. Most of them are in hardware packets, and none of those seem to control clip space, either.

One bit does stand out. It’s not a hardware bit.

In addition to the packets that the userspace driver prepares for the hardware, userspace passes to the kernel a large block of render pass state describing everything from tile size to the depth/stencil buffers. Such a design is unusual. Ordinarily, GPU kernel drivers are only concerned with memory management and scheduling, remaining oblivious of 3D graphics. By contrast, Apple processes this state in the kernel forwarding the state to the GPU’s firmware to configure the actual hardware.

Comparing traces, the render pass “OpenGL mode” sets an unknown bit in this kernel-processed block. If we set the same bit in our OpenGL driver, we find the clip space changes to -1/1. Victory, right?

Almost. Because this bit is render pass state, we can’t use it to change the clip space between draws. That’s okay for baseline OpenGL and Vulkan, but it prevents us from efficiently implementing the ARB_clip_control and VK_EXT_depth_clip_control extensions. There are at least three (inefficient) implementations.

The first is ignoring the hardware support and emulating one of the clip spaces by inserting extra instructions into the vertex shader when the “wrong” clip space is used. In addition to extra overhead, that requires shader variants for the different clip spaces.

Shader variants are terrible.

In new APIs like Vulkan, Metal, and D3D12, everything needed to compile a shader is known up-front as part of a monolithic pipeline. That means pipelines are compiled when they’re created, not when they’re used, and are never recompiled. By contrast, older APIs like OpenGL and D3D11 allow using the same shader with different API states, requiring some drivers to recompile shaders on the fly. Compiling shaders is slow, so shader variants can cause unpredictable drops in an application’s frame rate, familiar to desktop gamers as stuttering. If we use this approach in our OpenGL driver, switching clip modes could cause stuttering due to recompiling shaders. In bad circumstances, that stutter could even happen long after the mode is switched.

That option is undesirable, so the second approach is always inserting emulation instructions that read the desired clip space at run-time, reserving a uniform (push constant) for the transformation. That way, the same shader is usable with either clip space, eliminating shader variants. However, that has even higher overhead than the first method. If an application frequently changes clip spaces within a render pass, this approach will be the most efficient of the three. If it does not, this approach adds constant overhead to every application. Knowing which approach is better requires the driver to have a magic crystal ball.1

The final option is using the hardware clip space bit and splitting the render pass when the clip space is changed. Here, the shaders are optimal and do not require variants. However, splitting the render pass wastes tremendous memory bandwidth if the application changes clip spaces frequently. Nevertheless, this approach has some support from the ARB_clip_control specification:

Some [OpenGL] implementations may introduce a flush when changing the clip control state. Hence frequent clip control changes are not recommended.

Each approach has trade-offs. For now, the easiest “option” is sticking our head in the sand and giving up on ARB_clip_control altogether. The OpenGL extension is optional until we get to OpenGL 4.5. Apple doesn’t implement it in their OpenGL stack. Because ARB_clip_control is primarily for porting Direct3D games, native OpenGL games are happy without it. Certainly, Neverball doesn’t mind. For now, we can use the hardware bit to use the -1/1 clip space unconditionally in OpenGL and 0/1 unconditionally in Vulkan. That does not require any emulation or flushing, though it prevents us from advertising the extensions.

That’s enough to run Neverball on macOS, using our userspace OpenGL driver in Mesa, and Apple’s proprietary kernel driver. There’s a catch: Neverball has to present with the deprecated X11 server on macOS. Years ago, Apple engineers2 contributed Mesa support for X11 on macOS (XQuartz), allowing us to run X11 applications with our Mesa driver. However, there’s no support for Apple’s own Cocoa windowing system, meaning native macOS applications won’t work with our driver. There’s also no easy way to run Linux’s newer Wayland display server on macOS. Nevertheless, Neverball does not use Cocoa directly. Instead, it uses the cross-platform SDL2 library to create its window, which internally uses Cocoa, X11, or Wayland as appropriate for the operating system. With enough sweat and tears, we can build an macOS/X11 version of SDL2 and link Neverball with that.

This Neverball/macOS/X11 port was frustrating, especially when the game is one apt install away on Linux. That’s a job for Asahi Lina, who has been hard at work writing a Linux kernel driver for Apple’s GPU. When our work converges, my userspace Mesa driver will run on Linux with her kernel driver to implement a full open source graphics stack for 3D acceleration on Asahi Linux.

Please temper your expectations: even with hardware documentation, an optimized Vulkan driver stack (with enough features to layer OpenGL 4.6 with Zink) requires many years of full time work. At least for now, nobody is working on this driver full time3. Reverse-engineering slows the process considerably. We won’t be playing AAA games any time soon.

That said, thanks to the tremendous shared code in Mesa, a basic OpenGL driver is doable by a single person. I’m optimistic that we’ll have native OpenGL 2.1 in Asahi Linux by the end of the year. That’s enough to accelerate your desktop environment and browser. It’s also enough to play older games (like Neverball). Even without fancy features, GPU acceleration means smooth animations and better battery life.

In that light, the Asahi Linux future looks bright.


  1. This crystal ball is called “Vulkan, Metal, or D3D12”, and it has its own problems.↩︎

  2. Hi Jeremy!↩︎

  3. I work full-time at Collabora on my baby, the open source Panfrost driver for Mali GPUs.↩︎

August 17, 2022

When I create kernel contributions, I usually rely on a specific hardware, which makes using a system on which I need to deploy kernels too complicated or time-consuming to be worth it. Yes, I'm an idiot that hacks the kernel directly on their main machine, though in my defense, I usually just need to compile drivers rather than full kernels.

But sometimes I work on a part of the kernel that can't be easily swapped out, like the USB sub-system. In which case I need to test out full kernels.

I usually prefer compiling full kernels as RPMs, on my Fedora system as it makes it easier to remove old test versions and clearly tag more information in the changelog or version numbers if I need to.

Step one, build as non-root

First, if you haven't already done so, create an ~/.rpmmacros file (I know...), and add a few lines so you don't need to be root, or write stuff in /usr to create RPMs.

$ cat ~/.rpmmacros
%_topdir        /home/hadess/Projects/packages
%_tmppath        %{_topdir}/tmp

Easy enough. Now we can use fedpkg or rpmbuild to create RPMs. Don't forget to run those under “powerprofilesctl launch” to speed things up a bit.

Step two, build less

We're hacking the kernel, so let's try and build from upstream. Instead of the aforementioned fedpkg, we'll use “make binrpm-pkg” in the upstream kernel, which builds the kernel locally, as it normally would, and then packages just the binaries into an RPM. This means that you can't really redistribute the results of this command, but it's fine for our use.

 If you choose to build a source RPM using “make rpm-pkg”, know that this one will build the kernel inside rpmbuild, this will be important later.

 Now that we're building from the kernel sources, that's our time to activate the cheat code. Run “make localmodconfig”. It will generate a .config file containing just the currently loaded modules. Don't forget to modify it to include your new driver, or driver for a device you'll need for testing.

Step three, build faster

If running “make rpm-pkg” is the same as running “make ; make modules” and then packaging up the results, does that mean that the “%{?_smp_mflags}” RPM macro is ignored, I make you ask rhetorically. The answer is yes. “make -j16 rpm-pkg”. Boom. Faster.

Step four, build fasterer

As we're building in the kernel tree locally before creating a binary package, already compiled modules and binaries are kept, and shouldn't need to be recompiled. This last trick can however be used to speed up compilation significantly if you use multiple kernel trees, or need to clean the build tree for whatever reason. In my tests, it made things slightly slower for a single tree compilation.

$ sudo dnf install -y ccache
$ make CC="ccache gcc" -j16 binrpm-pkg

Easy.

And if you want to speed up the rpm-pkg build:

$ cat ~/.rpmmacros
[...]
%__cc            ccache gcc
%__cxx            ccache g++

More information is available in Speeding Up Linux Kernel Builds With Ccache.

Step five, package faster

Now, if you've implemented all this, you'll see that the compilation still stops for a significant amount of time just before writing “Wrote kernel...rpm”. A quick look at top will show a single CPU core pegged to 100% CPU. It's rpmbuild compressing the package that you will just install and forget about.

$ cat ~/.rpmmacros
[...]
%_binary_payload    w2T16.xzdio

More information is available in Accelerating Ceph RPM Packaging: Using Multithreaded Compression.

TL;DR and further work

All those changes sped up the kernel compilation part of my development from around 20 minutes to less than 2 minutes on my desktop machine.

$ cat ~/.rpmmacros
%_topdir        /home/hadess/Projects/packages
%_tmppath        %{_topdir}/tmp
%__cc            ccache gcc
%__cxx            ccache g++
%_binary_payload    w2T16.xzdio


$ powerprofilesctl launch make CC="ccache gcc" -j16 binrpm-pkg

I believe there's still significant speed ups that could be done, in the kernel, by parallelising some of the symbols manipulation, caching the BTF parsing for modules, switching the single-threaded vmlinux bzip2 compression, and not generating a headers RPM (note: tested this last one, saves 4 seconds :)

 

The results of my tests. YMMV, etc.

Command Time spent Notes
koji build --scratch --arch-override=x86_64 f36 kernel.src.rpm 129 minutes It's usually quicker, but that day must have been particularly busy
fedpkg local 70 minutes No rpmmacros changes except setting the workdir in $HOME
powerprofilesctl launch fedpkg local 25 minutes
localmodconfig / bin-rpmpkg 19 minutes Defaults to "-j2"
localmodconfig -j16 / bin-rpmpkg 1:48 minutes
powerprofilesctl launch localmodconfig ccache -j16 / bin-rpmpkg 7 minutes Cold cache
powerprofilesctl launch localmodconfig ccache -j16 / bin-rpmpkg 1:45 minutes Hot cache
powerprofilesctl launch localmodconfig xzdio -j16 / bin-rpmpkg 1:20 minutes
August 13, 2022

Hi all!

This month I’ve been pondering offline-first apps. The online aspect of modern apps is an important feature for many use-cases: it enables collaboration between multiple people and seamless transition between devices (e.g. I often switch between my personal workstation, my laptop, and my phone). However many modern apps come with a cost: often times they only work with a fixed proprietary server, and only work online. I think that for many use-cases, allowing users to pick their own open-source server instance and designing offline-friendly apps is a good compromise between freedom and ease-of-use/simplicity. Not to say that peer-to-peer or fully distributed apps are always a bad choice, but they come at a significantly higher complexity cost, which makes them more annoying to both build and use.

The main hurdle when writing an offline-first app is synchronization. All devices must have a local copy of the database for offline use, and they need to push changes to the server when the device comes online. Of course, it’s perfectly possible that changes were made on multiple devices while offline, so some kind of conflict resolution is necessary. Instead of presenting a “Oops, we’ve got a conflict, which version would you like to keep?” dialog to the user, it’d be much nicer to just Do The Right Thing™. CRDTs are a solution to that problem. They look a bit scary at first because of all of the obscure naming (PN-Counter? LWW-Element-Set? anyone?) and intimidating theory in papers. However I like to think of CRDTs as “use this one easy trick to make synchronization work well”, and not some kind of complicated abstract machinery. In other words, by following some simple rules, it’s not too difficult to write well-behaved synchronization logic.

So, long story short, I’ve been experimenting with CRDTs this month. To get some hands-on experience, I’ve started working on a small hacky group expense tracking app, seda. I’ve got the idea for this NPotM while realizing that there’s no existing good open-source user-friendly collaborative offline-capable (!) alternative yet. That said, it’s just a toy for now, nothing serious yet. If you want to play with it, you can have a look at the demo (feel free to toggle offline mode in the dev tools, then make some changes, then go online). There’s still a lot to be done: in particular, things gets a bit hairy when one device deletes a participant and another creates a transaction with that user at the same time. I plan to write some docs and maybe a blog post about my findings.

I’ve released two new versions of software I maintain. soju 0.5.0 adds support for push notifications, a new IRC extension to search the chat history, support for more IRCv3 extensions (some of which originate from soju itself), and many other enhancements. hut 0.2.0 adds numerous new commands and export functionality.

In graphics news, I’ve been working on small tasks in various projects. As part of my Valve contract, I’ve been investigating some DisplayPort MST issues in the core kernel DRM code, and I’ve introduced a function to unset layer properties in libliftoff. With the help of Petter Hutterer we’ve introduced a new X11 XWAYLAND extension as a more reliable way for clients to figure out whether they’re running under Xwayland. Last, I’ve continued ticking more boxes in libdisplay-info’s TODO list. We aren’t too far from having a complete EDID parser, but there are still so many extension blocks to add support for, and a whole new high-level API to design.

See you next month!

August 11, 2022

As of xorgproto 2022.2, we have a new X11 protocol extension. First, you may rightly say "whaaaat? why add new extensions to the X protocol?" in a rather unnecessarily accusing way, followed up by "that's like adding lipstick to a dodo!". And that's not completely wrong, but nevertheless, we have a new protocol extension to the ... [checks calendar] almost 40 year old X protocol. And that extension is, ever creatively, named "XWAYLAND".

If you recall, Xwayland is a different X server than Xorg. It doesn't try to render directly to the hardware, instead it's a translation layer between the X protocol and the Wayland protocol so that X clients can continue to function on a Wayland compositor. The X application is generally unaware that it isn't running on Xorg and Xwayland (and the compositor) will do their best to accommodate for all the quirks that the application expects because it only speaks X. In a way, it's like calling a restaurant and ordering a burger because the person answering speaks American English. Without realising that you just called the local fancy French joint and now the chefs will have to make a burger for you, totally without avec.

Anyway, sometimes it is necessary for a client (or a user) to know whether the X server is indeed Xwayland. Previously, this was done through heuristics: the xisxwayland tool checks for XRandR properties, the xinput tool checks for input device names, and so on. These heuristics are just that, though, so they can become unreliable as Xwayland gets closer to emulating Xorg or things just change. And properties in general are problematic since they could be set by other clients. To solve this, we now have a new extension.

The XWAYLAND extension doesn't actually do anything, it's the bare minimum required for an extension. It just needs to exist and clients only need to XQueryExtension or check for it in XListExtensions (the equivalent to xdpyinfo | grep XWAYLAND). Hence, no support for Xlib or libxcb is planned. So of all the nightmares you've had in the last 2 years, the one of misidentifying Xwayland will soon be in the past.

August 10, 2022

For those looking for a short answer: yes, it does.

Now, we can dive into a more elaborate answer.

Software engineering is a more systematic approach to software development, which involves the definition, implementation, measurement, management, change, and improvement of the software lifecycle. When we think about software through this lens, we must also think about software requirements, design, construction, testing, and maintenance.

Software engineering improves software maintainability, scalability, and security. Moreover, makes it easier to add testing to the software stack. This approach makes the software more robust.

A little glossary for some software engineering terms:
- Maintainability: how easy is it to repair, improve or understand a software artifact? After finishing your product, you must continue to fix bugs, optimize functionalities and refactor code to avoid future problems.
- Scalability: how easy is it to grow or shrink a software artifact?
- Testability: how easy is it to test a software artifact? Does the function has suitable hooks for testing?

Many people might believe that it is not possible to do software engineering on the Linux Kernel, or even that software engineering is not needed. From what I see, these beliefs come from two statements:

  1. It is not possible to apply software engineering with C: sometimes software engineering is only associated with Object-Oriented Programming languages.
  2. Software engineering is not needed as we are working with drivers: as drivers are theoretically finite, we don’t have to think about their expansion and maintainability.

If we follow those beliefs, we might end up with poorly designed code. And, when badly designed code grows, I assure you that we are going to see code repetition, dead code, insanely large functions, and bugs.

But, the worst of all: when we have a huge codebase with lots of bad code practices, maintainability becomes hard and software quality will decrease more and more.

So, let’s first understand why those two beliefs are false.

Software Engineering with C


You might say: how can I use my fancy design patterns, avoid code repetition, and make beautiful polymorphism when I don’t have classes?

And okay, you are right! Design Patterns in C++ are much easier and more natural to understand and implement. In C++ you can create a hierarchy to represent a family of devices, and this feature comes out of the box. But we can translate those concepts to C.

C is a structured language, but we can write object-oriented programs in C. In this sense, libraries and structs are your main allies. Moreover, you can use function pointers to create polymorphism in C.

For example, if I want to write a simple queue in C, I can use the following approach:

#ifndef QUEUE_H_
#define QUEUE_H_

typedef struct Queue Queue;
struct Queue {
	int *buffer;
	int head;
	int size;
	int tail;
	int (*isFull)(Queue* const me);
	int (*isEmpty)(Queue* const me);
	int (*getSize)(Queue* const me);
	void (*insert)(Queue* const me, int k);
	int (*remove)(Queue* const me);
};

/* Constructor and destructors */
void Queue_Init(Queue const me, (*isFullFunction)(Queue* const me),
	(*isEmptyFunction)(Queue* const me), (*getSizeFunction)(Queue* const me),
	(*insertFunction)(Queue* const me, int k), (*removeFunction)(Queue* const me));

void Queue_Cleanup(Queue* const me);

/* Operations */
int Queue_isFull(Queue* const me);
int Queue_isEmpty(Queue* const me);
int Queue_getSize(Queue* const me);
void Queue_insert(Queue* const me, int k);
int Queue_remove(Queue* const me);

Queue *Queue_Create(void);
void Queue_Destroy(Queue* const me);

#endif

Notice that I can have polymorphism using this approach. As I can create a new struct that inherits Queue, such as:

typedef struct CachedQueue CachedQueue;
struct CachedQueue {
	Queue *queue;

	/* new attributes */
	char name[80];
	int numberElementsOnDisk;

	/* aggregation in subclass */
	Queue *outputQueue;

	/* inherited virtual function */
	int (*isFull)(CachedQueue* const me);
	int (*isEmpty)(CachedQueue* const me);
	int (*getSize)(CachedQueue* const me);
	void (*insert)(CachedQueue* const me, int k);
	int (*remove)(CachedQueue* const me);

	/* new virtual functions */
	void (*flush)(CachedQueue* const me);
	int (*load)(CachedQueue* const me);
};

Okay, this is incredible! It is POLYMORPHISM in C. And there is much more on this topic in the book “Design Patterns for Embedded Systems in C”, by Bruce Powel Douglass. I love this book and I learned a lot from with it. Moreover, there is an awesome talk by Renato Geh and Matheus Tavares in the Linux Developer Conference Brazil 2019 about “Object Oriented Techniques in C: A Case Study on Git and Linux”.

So, you can see that fancy software architecture can be done on C. And don’t get me wrong, there are some beautiful abstractions in Linux that use these concepts, such as the Virtual File System (VFS). Moreover, some libraries provide great APIs, such as the DRM subsystem.

But sometimes this is not used in the implementation of drivers. And this takes us to the next point: yes, drivers need to be properly designed on the software side.

Drivers should be designed as pieces of software


Here I must say that: my opinion is extremely biased by the Display Mode VBA library. For the last month, I have been writing unit tests for this library as part of my GSoC project. I got quite impressed (maybe not in a good way) with the amount of code repetition in the code and also with the huge functions.

And this is not a roast on AMDGPU code: AMD does a great job for the free software community and it is incredible that we have an open-source driver for a major graphics retailer. Moreover, I’m sure that this problem also exists in other parts of the kernel, so, I believe it is a good point to discuss.

Let’s start from the premise that drivers are finite: you can grab the datasheet, code the hardware to the end of its features and finish the driver. And I might even say that you are right: drivers are finite. But, hardware companies don’t usually create one single product with singular characteristics: they usually create a product line, and sometimes product lines have children: another product line with some upgrades on the previous one.

💡 Product lines having children… For an OOP programmer, this sounds like a beautiful case for inheritance.

So, if you have a product line, are you going to create a file for each product? And for the product with a couple of features added, are you going to paste the previous driver and change a couple of hundreds of lines? This doesn’t seem like a great option, for a couple of reasons:

  1. Duplicate Code: you are duplicating the code and the bugs as well.
  2. Test Coverage: are you going to duplicate the tests also?
  3. Maintainability: especially in the maintenance phase of a project, the less code the better.

You see, it all comes down to maintainability.

As a great example of code reuse, you can check out the IIO subsystem. Hardware manufacturers such as Maxim and Analog Devices Inc usually have chips that share the same register map or share functionalities. Instead of creating a driver for each chip, developers write one single driver and add the compatible device IDs on the Device Table. For example, you can check the Maxim MAX1027 ADC driver, which is compatible with the MAX1027, MAX1029, MAX1031, MAX1227, MAX1229, and MAX1231. So we have one single driver for six devices: this is great for maintainability!

In this case, if I find a bug, I can make one single modification, and send one single patch, the maintainer will review one single time, and all runs smoothly.

Now, let’s take a look at the DML folder from the AMD’s Display Core, more specifically the display_mode_vba files from DCN20 and DCN21. See that these product lines are pretty similar, so maybe we can reuse a lot of the code.

But, if you check the directory, you can see that we have three different files: display_mode_vba_20.c, display_mode_vba_20v2.c and display_mode_vba_21.c.

💡 You can check the difference between the files through:
$ diff drivers/gpu/drm/amd/display/dc/dml/dcn20/display_mode_vba_20.c drivers/gpu/drm/amd/display/dc/dml/dcn20/display_mode_vba_20v2.c

And much of the code is identical: I mean there are functions that don’t change a line! This hits pretty hard on maintainability.

Now, if I find a bug, I need to make three modifications. Moreover, I might not even know that the code is duplicated, so I might only fix the bug in one place and leave the other files untouched. Then another developer might find the same bug once again, and will have to send it to the maintainer, who will have to review it one more time. This is a lot of rework!

And if I could guess a reason for AMD to copy and paste the code so many times, I would point out another maintainability issue: the functions are huge! Some functions from the VBA files have more than a thousand lines.

These huge functions from the VBA files implicate that if you want to change a couple of lines for your new product lines you need to copy and paste the whole function.

Ideally, from the principles of the Clean Code book, we would like to have small functions that should be simple and do one thing only. And I know: this is not applicable in 100% of the cases, but I cannot find a good reason for a function to be so huge and have dozens of parameters.

💡 Other than the readability, those huge functions also hurt the stack pretty badly.

Huge functions really hurt the readability, understandability, and testability of the code. Moreover, they make it difficult to avoid code duplication as the function has dozens of side effects.

More glossary for some software engineering terms:
- Readability: how easily a software artifact can be read?
- Understandability: how easily a software artifact can be comprehend?

But this is not a dead end for the AMDGPU’s DML code: I mean, the AMDGPU driver works awesomely on Linux, and code refactoring is always an option.

We can think about software!


At this moment, we might conclude that as the AMDGPU driver is open-source, then we can fix those issues in the code. But it is definitely not safe to simply tear down the code and rewrite it in one single patch set, as the AMDGPU driver has to remain functional on Linux.

One way to fix this is through unit testing to ensure the code is properly refactored. Though, throughout my GSoC project, I ended up noticing that it is not possible to write a unit test for a thousand-line function. A huge function has many side effects and testing each one of them is not feasible.

Maybe for Display Mode VBA unit testing is not the only way to go. We probably could first break the functions into smaller, self-contained pieces, as this will help to create better tests, to improve readability, and to reduce the stack size.

Now with smaller functions, it is more feasible to share code across the DCNs and create a common interface for them.

This refactor can lead to the use of those design patterns I talked about earlier and make the DML more maintainable and readable. We can think about the use of inheritance where we have a base library, from which DCN20 can extend, and then DCN21 can extend from DCN20. And this is how those three huge files can become three small files.

And this refactor can start piece by piece:

  1. Unifying the parameters: don’t pass the parameters by copying if the parameters are in the common struct. The stack will thank this change!
  2. Splitting the functions: make smaller, more readable functions.
  3. Writing tests for the functions
  4. Creating a common interface: here is where the design patterns come in.

This way we can make a safer refactor as unit testing is not viable. This doesn’t mean that we are not going to introduce any bugs in the process, but having a structured plan will help us avoid them.


I must say: this is the opinion of someone that came straight out of the university, thinking about well-structured code. So, I might be utopian about software engineering. I understand that the developers at AMD are doing their best and are working hard to provide the best features for us, Linux users.

But thinking of software is the best way to ensure the maintainability of our code, and bad code practices will prove costly one day or another.

August 08, 2022

Descriptors are hard

Over the weekend, I asked on twitter if people would be interested in a rant about descriptor sets. As of the writing of this post, it has 46 likes so I’ll count that as a yes.

I kind-of hate descriptor sets…

Well, not descriptor sets per se. More descriptor set layouts. The fundamental problem, I think, was that we too closely tied memory layout to the shader interface. The Vulkan model works ok if your objective is to implement GL on top of Vulkan. You want 32 textures, 16 images, 24 UBOs, etc. and everything in your engine fits into those limits. As long as they’re always separate bindings in the shader, it works fine. It also works fine if you attempt to implement HLSL SM6.6 bindless on top of it. Have one giant descriptor set with all resources ever in giant arrays and pass indices into the shader somehow as part of the material.

The moment you want to use different binding interfaces in different shaders (pretty common if artists author shaders), things start to get painful. If you want to avoid excess descriptor set switching, you need multiple pipelines with different interfaces to use the same set. This makes the already painful situation with pipelines worse. Now you need to know the binding interfaces of all pipelines that are going to be used together so you can build the combined descriptor set layout and you need to know that before you can compile ANY pipelines. We tried to solve this a bit with multiple descriptor sets and pipeline layout compatibility which is supposed to let you mix-and-match a bit. It’s probably good enough for VS/FS mixing but not for mixing whole materials.

The problem space

So, how did we get here? As with most things in Vulkan, a big part of the problem is that Vulkan targets a very diverse spread of hardware and everyone does descriptor binding a bit differently. In order to understand the problem space a bit, we need to look at the hardware…

DISCLAIMER:

I’m about to spill a truckload of hardware beans. Let me reassure you all that I am not violating any NDAs here. Everything I’m about to tell you is either publicly documented (AMD and Intel) or can be gleaned from reading public Mesa source code.

Descriptor binding methods in hardware can be roughly broken down into 4 broad categories, each with its own advantages and disadvantages:

  1. Direct access (D): This is where the shader passes the entire descriptor to the access instruction directly. The descriptor may have been loaded from a buffer somewhere but the shader instructions do not reference that buffer in any way; they just take what they’re given. The classic example here is implementing SSBOs as “raw” pointer access. Direct access is extremely flexible because the descriptors can live literally anywhere but it comes at the cost of having to pass the full descriptor through the shader every time.

  2. Descriptor buffers (B): Instead of passing the entire descriptor through the shader, descriptors live in a buffer. The buffers themselves are bound to fixed binding points or have their base addresses pushed into the shader somehow. The shader instruction takes either a fixed descriptor buffer binding index or a base address (as appropriate) along with some form of offset to the descriptor in the buffer. The difference between this and the direct access model is that the descriptor data lives in some other bit of memory that the hardware must first read before it can do the actual access. Changing buffer bindings, while definitely not free, is typically not incredibly expensive.

  3. Descriptor heaps (H): Descriptors of a particular type all live in a single global table or heap. Because the table is global, changing it typically involves a full GPU stall and maybe dumping a bunch of caches. This makes changing out the table fairly expensive. Shader instructions which access these descriptors are passed an index into the global table. Because everything is fixed and global, this requires the least amount of data to pass through the shader of the three bindless mechanisms.

  4. Fixed HW bindings (F): In this model, resources are bound to fixed HW slots, often by setting registers from the command streamer or filling out small tables in memory. With the push towards bindless, fixed HW bindings are typically only used for fixed-function things on modern hardware such as render targets and vertex, index, and streamout buffers. However, we still need to consider them because Vulkan 1.0 was designed to support pre-bindless hardware which might not be quite as nice.

Here’s a quick run-down on where things sit with most of the hardware shipping today:

Hardware Textures Images Samplers Border Colors Typed buffers UBOs SSBOs
NVIDIA (Kepler+) H H H H D/F D
AMD D D D H D D D
Intel (Skylake+) H H H H H/D/F H/D
Intel (pre-Skylake) F F F F D/F F
Arm (Valhal+) B B B B B/D/F B/D
Arm (Pre-Valhal) F F F F D/F D
Qualcomm (a5xx+) B B B B B B
Broadcom (vc5) D D D D D D

The line above for “Intel (pre-Skylake)” is a bit misleading. I’m labeling everything as fixed HW bindings but it’s actually a bit more flexible than most fixed HW binding mechanisms. It a sort of heap model but where, instead of indexing into heaps directly from the shader, everything goes through a second layer of indirection called a binding table which is restricted to 240 entries. On Skylake and later hardware, the binding table hardware still exists and uses a different set up heaps which provides a nice back-door for drivers. More on that when we talk about D3D12.

The Vulkan 1.0 descriptor set model

As you can see from above, the hardware landscape is quite diverse when it comes to descriptor binding. Everyone has made slightly different choices depending on the type of descriptor and picking a single model for everyone isn’t easy. The Vulkan answer to this was, of course, descriptor sets and their dreaded layouts.

Ignoring UBOs for the moment, the mapping from the Vulkan API to these hardware descriptors is conceptually fairly simple. The descriptor set layout describe a set of bindings, each with a binding type and a number of descriptors in that binding. The driver maps the binding type to the type of HW binding it uses and computes how much GPU or CPU memory is needed to store all the bindings. Fixed HW bindings are typically stored CPU-side and the actual bindings get set as part of vkCmdBindDescriptorSets() or vkCmdDraw/Dispatch(). For everything in one of the three bindless categories, they allocate GPU memory. For heap descriptors, descriptors may be allocated as part of the descriptor set or, to save memory, as part of the image or buffer view object. Given that descriptor heaps are often limited in size, allocating them as part of the view object is often preferred.

UBOs get weird. I’m not going to try and go into all of the details because there are often heuristics involved and it gets complicated fast. However, as you can see from the above table, most hardware has some sort of fixed HW binding for UBOs, even on bindless hardware. This is because UBOs are the hottest of hot paths and even small differences in UBO fetch speed turn into real FPS differences in games. This is why, even with descriptor indexing, UBOs aren’t required to support update-after-bind. The Intel Linux driver has three or four different paths a UBO may take based on how often it’s used relative to other UBOs, update-after-bind, and which shader stage it’s being accessed from.

The other thing I have yet to mention is dynamic buffers. These typically look like a fixed HW binding. How they’re implemented varies by hardware and driver. Often they use fixed HW bindings or the descriptors are loaded into the shader as push constants. Even if the buffer pointer comes from descriptor set memory, the dynamic offset has to get loaded in via some push-like mechanism.

The D3D12 descriptor heap

DISCLAIMER:

Again, I’m going to talk details here. Again, in spite of the fact that there are exactly zero open-source D3D12 drivers, I can safely say that I’m not violating any NDAs. I’ve literally never seen the inside of a D3D12 driver. I’ve just read public documentation and am familiar with how hardware works and is driven. This is all based on D3D12 drivers I’ve written in my head, not the real deal. I may get a few things wrong.

For D3D12, Microsoft took a very different approach. They embraced heaps. D3D12 has these heavy-weight descriptor heap objects which have to be bound before you can execute any 3D or compute commands. Shaders have the usual HLSL register notation for describing the descriptor interface. When shaders are compiled into pipelines, descriptor tables are used to map the bindings in the shader to ranges in the relevant descriptor heap. While the size of a descriptor heap range remains fixed, each such range has a dynamic offset which allows the application to move it around at will.

With SM6.6, Microsoft added significant flexibility and further embraced heaps. Now, instead of having to use descriptor tables in the root descriptor, applications can use heap indices directly. This provides a full bindless experience. All the application developer has to do is manage heap allocations with resource lifetimes and figure out how to get indices into their shader. Gone are the days of fiddling with fixed interface layouts through side-bind pipeline create APIs. From what I’ve heard, most developers love it.

If D3D12 has embraced heaps, how does it work on AMD? They use descriptor buffers, don’t they? Yup. But, fortunately for Microsoft, a descriptor heap is just a very restrictive descriptor buffer. The AMD driver just uses two of their descriptor buffer bindings (resource and sampler heaps are separate in D3D12) and implements the heap as a descriptor buffer.

One downside to the descriptor heap approach is that it forces some amount of extra indirection, especially with the SM6.6 bindless model. If your application is using bindless, you first have to load a heap index from a constant buffer somewhere and then pass that to the load/store op. The load/store turns into a sequence of instruction that fetches the descriptor from the heap, does the offset calculation, and then does the actual load or store from the corresponding pointer. Depending on how often the shader does this, how many unique descriptors are involved, and the compiler’s ability to optimize away redundant descriptor fetches, this can add up to real shader time in a hurry.

The other major downside to the D3D12 model is that handing control of the hardware heaps to the application really ties driver writers’ hands. Any time the client does a copy or blit operation which isn’t implemented directly in the DMA hardware, the driver has to spin up the 3D hardware, set up a pipeline, and do a few draws. In order to do a blit, the pixel shader needs to be able to read from the blit source image. This means it needs a texture or UAV descriptor which needs to live in the heap which is now owned by the client. On AMD, this isn’t a problem because they can re-bind descriptor sets relatively cheaply or just use one of the high descriptor set bindings which they’re not using for heaps. On Intel, they have the very convenient back-door I mentioned above where the old binding table hardware still exists for fragment shaders.

Where this gets especially bad is on NVIDIA, which is a bit ironic given that the D3D12 model is basically exactly NVIDIA hardware. NVIDIA hardware only has one texture/image heap and switching it is expensive. How do they implement these DMA operations, then? First off, as far as I can tell, the only DMA operation in D3D12 that isn’t directly supported by NVIDIA’s DMA engine is MSAA resolves. D3D12 doesn’t have an equivalent of vkCmdBlitImage(). Applications are told to implement that themselves if they really want it. What saves them, I think (I can’t confirm), is that D3D12 exposes 106 descriptors to the application but NVIDIA hardware supports 220 descriptors. That leaves about 48k descriptors for internal usage. Some of those are reserved by Microsoft for tools such as PIX but I’m guessing a few of them are reserved for the driver as well. As long as the hardware is able to copy descriptors around a bit (NVIDIA is very good at doing tiny DMA ops), they can manage their internal descriptors inside this range. It’s not ideal, but it does work.

Towards a better future?

I have nothing to announce but me and others have been thinking about descriptors in Vulkan and how to make them better. I think we should be able to do something that’s better than the descriptor sets we have today. What is that? I’m personally not sure yet.

The good news is that, if we’re willing to ignore non-bindless hardware (I think we are for forward-looking things), there are really only two models: heaps and buffers. (Anything direct access can be stored in the heap or buffer and it won’t hurt anything.) I too can hear the siren call of D3D12 heaps but I’d really like to avoid tying the drivers hands like that. Even if NVIDIA were to rework their hardware to support two heaps today to get around the internal descriptors problem and make it part of the next generation of GPUs, we wouldn’t be able to rely on users having that for 5-10 years, longer depending on application targets.

If we keep letting drivers managing their own heaps, D3D12 layering on top of Vulkan becomes difficult. D3D12 doesn’t have image or buffer view objects in the same sense that Vulkan does. You just create descriptors and stick them in the heap somewhere. This means we either need to come up with a way to get rid of view objects in Vulkan or a D3D12 layer needs a giant cache of view objects, the lifetimes if which are difficult to manage to say the least. It’s quite the pickle.

As with many of my rant posts, I don’t really have a solution. I’m not even really asking for feedback and ideas. My primary goal is to educate people and help them understand the problem space. Graphics is insanely complicated and hardware vendors are notoriously cagey about the details. I’m hoping that, by demystifying things a bit, I can at the very least garner a bit of sympathy for what we at Khronos are trying to do and help people understand that it’s a near miracle that we’ve gotten where we are. 😅

August 04, 2022
People write code. Test coverage is never enough. Some angry contributor will disable the CI. And we all write bugs. But that’s OK, it part of the job. Programming is hard and sometimes we may miss a corner case, forget that numbers overflow and all other strange things that computers can do. One easy thing that we can do to help the poor developer that needs to find what changed in the code that stopped their printer to work properly, is to keep the project bisectable.