Some time ago I teased a new Mesa project that involved both features and perf. At last, it’s time to unveil the goods: a complete rewrite of all descriptor handling in Lavapipe thanks to one of RADV’s raytracing tamers, Konstantin Seurer.
It’s a feature that’s been confounding developers and users everywhere for years, but thanks to Konstantin’s tireless efforts over the past ETOOLONG, at last everyone will have the features they crave.
Yes.
In short, the work required rewriting all the LLVM-based JIT to dynamically generate all the image/buffer access code instead of being able to use a more fixed-function style of JIT. As the MR shows, the diff here is massive, and work is still ongoing to make it work everywhere.
It’s a truly Herculean effort by Konstantin that was only hindered by my goalpost-moving to fingerpaint in support for EXT_descriptor_buffer and EXT_mutable_descriptor_type.
Primarily it means that Lavapipe should start to be able to work with VKD3D-PROTON and play (haha) real games. This has more uses in CI purposes as well, allowing broader coverage for all layered drivers that depend on EXT_descriptor_indexing.
Unfortunately, the added indirection is going to slightly reduce performance in some cases.
Work is ongoing to mitigate this, and I don’t have any benchmark results.
The other day I posted about new efforts to create a compendium of games that work with zink. Initial feedback has been good, and thanks to everyone who has contributed so far.
I realized too late that I needed to be more explicit about what feedback I wanted, however.
There are three tickets open, one for each type of problem:
Please don’t dump the results of your entire steam library into a single ticket. Take the extra few seconds and split up your results appropriately.
With all that said, the quantity of games in the “working” list far outnumbers those in the “not working” lists, which is great. On the flip side, anyone looking for a starting point to contribute to zink now has quite a few real world cases to check out.
Last week I’ve been attending the HDR hackfest organized by Red Hat. The trip to Prague was rather chaotic: the morning right after the SourceHut DoS incident, I got a notification that my plane was cancelled, so had to re-book a new flight and hotel. Then a few hours before leaving for the airport, I realized that the train line to get there was cut off, so I had to take a longer route via bus (and of course, everybody taking a plane had the same idea). Thankfully Saturday evening I arrived in Prague as planned, and even had some time before the train the next day to visit the old city center with the Jewish cemetery and synagogues, and enjoy a traditional guláš. I arrived at Brno — the venue of the hackfest — on Sunday evening.
I met with some hackfest participants Monday morning in the hotel lobby, then we joined everybody else at the Red Hat office. People from various organizations were on-site: Red Hat, KDE, System76, AMD, Igalia, Collabora, Canonical, etc. Some more people from NVIDIA, Intel and Google joined us remotely (some of them waking up at 2 AM due to their timezone!). It was super nice meeting all these folks I’ve been working with remotely for years!
![]()
Sebastian had prepared a list of topics we could discuss. We started by brainstorming use-cases we cared about for HDR and color management. There are two main separate classes of users here: one wants to enjoy HDR movies and games, the other wants to perform color-critical work such as image or video editing. The former mainly cares about the colors looking good, while the latter cares about the colors looking right. These two use-cases are kind of orthogonal (a compositor can implement one without the other) but still closely related. We noted that displaying a single HDR surface full-screen is pretty easy to achieve but we really want to properly handle mixing HDR and SDR content, if only to be able to display overlay menus, notifications, cursors, and so on (and of course windowed HDR content). Additionally keeping the power usage down is important for mobile devices. We mentioned a million of other issues (screen recording, brightness adjustment, “Night Light” feature, etc) but this blog post would turn into a thick book if I tried to cover everything.
Then we switched gears and discussed about variable refresh rate (VRR). There are two unresolved issues when it comes to VRR: cursor handling and flickering. The first issue manifests itself when the cursor plane is moved while VRR is enabled. Either the cursor is moved at the maximum refresh rate (effectively disabling VRR), either the cursor is moved at the game/video refresh rate (resulting in a very choppy result). We need a new kernel uAPI to move the cursor plane without scheduling a new page-flip somehow. The second issue is that some screens (not all) flicker when the refresh rate is changed abruptly. This is a bit annoying to handle, we need to ensure that refresh rate changes are smoothed over multiple frames for these displays. It would be best for user-space to handle this, because the refresh rate compensation will mess up frame timings. It would be nice to be able to automatically tell apart “good” and “bad” screens, there are some HDMI and DisplayID standards for this but they are not widely supported. More experimentation and testing is required to figure out how much we can do in user-space.
Then we got into the color management topics. First the “easy” part: AMD is missing support for the Colorspace KMS property. There are patches floating around but a blocker: AMD may decide to encode the signal as either RGB or YUV on the wire depending on the available bandwidth, and the Colorspace property has different enum entries for RGB and YUV. However user-space has no way to know whether the driver picked RGB or YUV, so has no way to pick the correct enum entry. We decided that the best course of action was to retain backwards uAPI compatibility by keeping the existing enum entries, but treat them as equal in the driver and let it Do The Right Thing. That way user-space can unconditionally pick the RGB variant and the driver will silently convert that to the YUV variant if it happens to encode the signal as YUV on the wire.
Before we got into some more complicated color management and HDR discussions, Sebastian and Pekka explained in more detail how it’s all supposed to work. This is a very wide and tricky topic, so it can be especially complicated to learn and understand. Pekka gave some enlightening and colorful explanations (see what I did here?), I believe that helped a lot of people in the room. If you are interested, have a look at the learn page in Pekka’s color-and-hdr repository.
![]()
With that out of the way, we started debating about vendor-specific KMS properties. With the existing kernel uAPI, compositors can implement HDR and color management just fine, but have to resort to OpenGL or Vulkan shaders. This isn’t great for power efficiency, because this keeps the 3D engine in GPUs busy. Ideally we want to offload all color transformations to the display engine and keep the 3D engine completely idle (during video playback with hardware-accelerated decoding for instance). So we need a new kernel API.
A week before, Melissa has sent a patch series to introduce AMD-specific KMS properties to configure various color management related hardware blocks. The amdgpu documentation explains exactly what these hardware blocks are. Josh has implemented support for this in gamescope and this will be shipped in SteamOS soon. This is great, because this is the first time a real HDR compositor has been implemented on Linux, and with full hardware acceleration even! If nothing else, this is a very valuable testing grounds. So the question we asked ourselves is whether or not we want to merge the KMS vendor-specific properties. On one hand this allows easier and wider experimentation to come up with a good vendor-neutral uAPI. On the other hand we don’t want to end up stuck with vendor-specific user-space and no generic API. Everybody had a different opinion on this topic, so that made for an interesting discussion. At the end, we agreed that we can merge vendor-specific color management properties on the condition that the uAPI is documented as unstable, hidden behind an experimental build option and a kernel parameter. This should allow for more testing while avoiding the pitfalls of hardcoding chunks of vendor-specific code in each compositor.
Things got really interesting when we discussed about long-term plans. We want to design some kind of vendor-neutral API that compositors can use to program the color pipeline in GPUs. Other platforms (e.g. Android) typically provide a descriptive API: compositors set the color space and other related parameters for the source and the destination, and the driver comes up with the appropriate hardware configuration. However there are multiple ways of doing color conversions (e.g. gamut and tone mapping). Each way will give a different result. This will result in glitches when switching between OpenGL/Vulkan and KMS offloading. Unfortunately the switches can happen pretty often, e.g. when a notification comes in, or when a window is moved around. Another issue is that the descriptive API doesn’t give full control to the compositors, thus compositors cannot come up with novel color pipelines. We’ve decided that for KMS a prescriptive API would be superior: drivers expose a list of available hardware blocks (mathematical operations like look-up tables and matrices), then user-space directly programs each hardware block. The tricky part is coming up with a good API which fits all hardware, present and future. It would seem like this design would work well for AMD and Intel hardware, but NVIDIA GPUs are more opinionated and have hardware blocks converting between two fixed color spaces and cannot be disabled. We decided that it would be reasonable to expose these fixed hardware blocks to user-space as well just like any other hardware block. I will soon send an RFC to the dri-devel mailing list with more details about our API proposal.
Since the kernel API would just expose what the hardware can do (very much like KMS planes), user-space will need to translate its color pipeline to the hardware, and fallback to shaders if it cannot leverage the hardware. We plan to establish a common user-space library (similar to libliftoff) to help offload color pipelines to KMS.
Throughout the days we had various other discussions, for instance about testing or about new features we’d like KMS to have. The detailed notes should get published soon if you’re interested.
You can probably tell, we didn’t write much code during this hackfest. We just talked together the whole time. Everyone was very passionate and invested in the topics we discussed. The hackfest was very exhausting, by 5 PM the discussions were a lot slower. However that effort payed off, and we’ve made great progress! We now have a clear path forward, and I can’t wait to see the fruits of the hackfest materialize in various Git repositories. Many thanks to Carlos for organizing everything, and to Red Hat + Collabora for sponsoring the event!
Yep, it’s all merged. That means if your driver supports VK_EXT_shader_object, you can finally enjoy Tomb Raider (2013) without any issues.
NVIDIA has just released a new beta driver that I need to test, but I’m hopeful it will no longer crash when trying to use this extension.
Remember that time I mentioned how zink wasn’t allowed to use VK_EXT_vertex_input_dynamic_state on AMDVLK?
We all got a chuckle, and it was sad/funny, and nobody was surprised, but the funnier part is some code that’s been in zink for much longer. Almost a year, in fact:
if (screen->info.have_EXT_extended_dynamic_state) {
if (screen->info.have_EXT_extended_dynamic_state2) {
if (screen->info.have_EXT_extended_dynamic_state3) {
if (screen->info.have_EXT_vertex_input_dynamic_state)
dynamic = ZINK_DYNAMIC_VERTEX_INPUT;
else
dynamic = ZINK_DYNAMIC_STATE3;
} else {
if (screen->info.have_EXT_vertex_input_dynamic_state)
dynamic = ZINK_DYNAMIC_VERTEX_INPUT2;
else
dynamic = ZINK_DYNAMIC_STATE2;
}
} else {
dynamic = ZINK_DYNAMIC_STATE;
}
} else {
dynamic = ZINK_NO_DYNAMIC_STATE;
}
This is the conditional for enabling dynamic state usage in zink using an enum. As we can see, the only time VK_EXT_vertex_input_dynamic_state is enabled is if either VK_EXT_extended_dynamic_state2 or VK_EXT_extended_dynamic_state3 are also enabled. This cuts down on the number of codepaths that can be used by drivers, which improves performance and debuggability.
AMD’s drivers don’t yet support VK_EXT_extended_dynamic_state3, as anyone can see from gpuinfo. They do, however, support VK_EXT_extended_dynamic_state2, so the driver-side disablement of VK_EXT_vertex_input_dynamic_state does have some effect.
Not.
Way, way back, just over a year ago, I was doing some testing on AMD’s drivers. One thing I noticed was that trying to run any of the GLES CTS caselists on these drivers caused GPU hangs, so I stopped running those, leaving me with just GL4.6 CTS.
And what happened when I enabled VK_EXT_extended_dynamic_state2 there, you ask? Test failures. Lots of test failures.
Thus, AMD got the cone of shame: a driver workaround to explicitly disable this extension.
In conclusion, we all had a good chuckle about AMD blocking zink from using VK_EXT_vertex_input_dynamic_state, but…
Well, there’s nothing in this story that we didn’t already expect.
On another topic, I’ve been doing some per-app tracking in zink. Specifically tracking games that don’t work great. If you know of other games with issues, post them there.
But there hasn’t been any sort of canonical list for games that do work great on zink, which leads to a lot of confusion about how useful it is for gaming.
Thus, I’ve created the GAMES THAT WORK tracking ticket. If you’ve played a game on zink and it works, post about it. If you want to know whether a game works, check that ticket and maybe it’ll be updated enough to be useful.
Remember: I don’t play video games, so I can’t fill out this table on my own. The only way I know if a game works is if I spend a comprehensive amount of time benchmarking or debugging it, which is the only reason I have 300+ hours logged in the benchmark mode of Tomb Raider (2013).
Things have been quiet on the surface lately, with nothing big in development.
But there is big work underway. It’s an extremely secret project (not Half Life 3) that (or Portal 3) I’m hoping (or any Valve-related title) can be (it’s not even a game) brought into the light (it’s driver-related) within the next week (involving features) or two (and perf).
I can’t say more on this topic.
Don’t even bother asking.
It’s too secret.
What I can say is that it’s been in development for almost a month. And we all know how much time a month is when it comes to SGC speed.
Cannot believe it has been years since my last update here!
There are two things that I would like to tell people about:
The first is that I no longer work at Collabora. It has been almost 13 years full of excitement and recently I came to believe that I wanted a proper change.
They are great folks to work with, so if you are thinking of a career change and want to do open-source stuff upstream, I recommend you to consider them.
And the other topic is what I have been working on lately: a free software driver for the NPUs that VeriSilicon sells to SoC vendors.
tomeu@arm-64:~/tensorflow/build/examples/label_image$ SMALLER_SOFTMAX=1 RUSTICL_ENABLE=etnaviv LD_LIBRARY_PATH=/home/tomeu/opencl/lib LIBGL_DRIVERS_PATH=/home/tomeu/opencl/lib/dri/ ./label_image --gpu_backend=cl --use_gpu=true --verbose 1 --tflite_model ../../../assets/mobilenet_quant_v1_224.tflite --labels ../../../assets/labels.txt --image ../../../assets/grace_hopper.bmp --warmup_runs 1 -c 1
[snip]
INFO: invoked
INFO: average time: 1261.99 ms
INFO: 0.666667: 458 bow tie
INFO: 0.294118: 653 military uniform
INFO: 0.0117647: 835 suit
INFO: 0.00784314: 611 jersey
INFO: 0.00392157: 922 book jacket
That is TensorFlow Lite's OpenCL delegate detecting objects with Etnaviv from Grace Hopper's portrait in military uniform.
Many years ago, when I was working on the operating system for the One Laptop Per Child project, I became painfully aware of the problems derived by IP vendors not providing the source code for their drivers.
This and other instances of the same problem motivated me to help out on the Panfrost project, writing a free software driver for the Mali GPUs by Arm. That gave me a great opportunity to learn about reverse engineering from Alyssa Rosenzweig.
Nowadays the Mesa project contains drivers for most GPUs out there, some maintained by the same companies that develop the IP, some by their customers and hobbyists alike. So the problem of the availability of source code for GPU drivers is pretty much solved.
Only that, with the advent of machine learning in the edge, we are reliving this problem with the drivers for accelerating those workloads with NPUs, TPUs, etc.
Vivante's NPU IP is very closely based on their GPUs. And it is pretty popular, being included in SoCs by Amlogic, Rockchip, NXP, Broadcom and more.
We already have a reasonably complete driver (Etnaviv) for their GPU IP, so I started by looking at what the differences were and how much of the existing userspace and kernel drivers we could reuse.
The kernel driver works with almost no changes, just took me some time to implement the hardware initialization properly in upstream. As of Linux 6.3 the driver loads correctly on Khadas' VIM3, but for a chance at decent performance this patch is needed:
[PATCH] arm64: dts: VIM3: Set the rates of the clocks for the NPU
Due to its experimental status, it is disabled by default in the device tree. To enable it, add the below to arch/arm64/boot/dts/amlogic/meson-g12b-a311d-khadas-vim3.dts:
&npu {
status = "okay";
};
Enabling Etnaviv for other boards with this IP should be relatively straightforward, by describing how the HW is initialized by inspecting the downstream kernel sources for the board in question.
Mesa has seen most of the work, as this IP is compute-only and the userspace driver only targeted OpenGL ES.
First step was wiring up the existing driver to Mesa's OpenCL implementation, and then I focused on getting the simplest kernel to correctly run. For this and all the subsequent work, the reverse-engineering tools used by the Etnaviv community have been of great use.
At that point I had to pause the work to focus on other unrelated stuff, but Collabora's Italo Nicola and Faith Ekstrand did great work to extend the existing compiler to generate OpenCL kernels.
Once I didn't have a day job getting in the way anymore, I started adding the features needed to run the label_image example in TensorFlow Lite.
And eventually we got to this point. 1.2 seconds to run that inferrence is a lot of time, so the next steps for me will be to figure out what are the biggest causes for the low performance.
With the goal in mind of providing a free software driver that companies can use to run inferrence on their products containing Vivante's NPU IP, I need for those tasks to be performanced at at least the same order of magnitude as the closed source solution provided by Vivante.
Right now Etnaviv is about twice as slow as running label_image with the OpenCL delegate on Vivante's driver, but the solution that they provide uses a special delegate that is able to better use their hardware is several times faster.
Current performance situation (label_image):
The plan is to first see why we are slower with the OpenCL delegate and fix it, and afterwards the real fun stuff will start: seeing how we can use more of the HW capabilities through the OpenCL API and with upstream TensorFlow Lite.
Italo is cleaning up an initial submission for inclusion in Mesa upstream. Once that is done I will rebase my branch and start submitting features.
In parallel to upstreaming, I will be looking at what is needed to get closer to the performance of the closed source driver, for ML acceleration.
There is a lot of people besides the ones mentioned above that have made this possible. Some of they are:
Last but not least, there are some individuals to whom I was able to turn when I needed help:
F38 just released and seeing a bunch of people complain that TF2 dies on AMD or other platforms when lavapipe is installed. Who's at fault? I've no real idea. How to fix it? I've no real idea.
AMD OpenGL drivers use LLVM as the backend compiler. Fedora 38 updated to LLVM 16. LLVM 16 is built with c++17 by default. C++17 introduces new "operator new/delete" interfaces[1].
TF2 ships with it's own libtcmalloc_minimal.so implementation, tcmalloc expects to replace all the new/delete interfaces, but the version in TF2 must not support or had incorrect support for the new align interfaces.
What happens is when TF2 probes OpenGL and LLVM is loaded, when DenseMap initializes, one "new" path fails to go into tcmalloc, but the "delete" path does, and this causes tcmalloc to explode with
"src/tcmalloc.cc:278] Attempt to free invalid pointer"
I'll talk to Valve and see if we can work out something, LLVM 16 doesn't seem to support building with C++14 anymore. I'm not sure if static linking libstdc++ into LLVM might avoid the tcmalloc overrides, it might not also be acceptable to the wider Fedora community.
I escaped my RADV pen once again. I know, it’s been a while, but every so often my handler gets distracted and I sprint for the fence with all my might.
This time I decided to try out a shiny new Intel Arc A770 that was left by my food trough.
The Dota2 performance? Surprisingly good. I was getting 100+ FPS in all the places I expected to have good perf and 4-5 FPS in all the places I expected to have bad perf.
The GL performance? Also surprisingly good*. Some quick screenshots:
DOOM 2016 on Iris:
Playable.
And of course, this being a zink blog, I have to do this:
The perf on zink is so good that the game thinks it’s running on an NVIDIA GPU.
If anyone out there happens to be a prominent hardware benchmarking site, this would probably be an interesting comparison on the upcoming Mesa 23.1 release.
I’ve seen a lot of people with AMD hardware getting hit by the Fedora 38 / LLVM 16 update crash. While this is unfortunate, and there’s nothing that I, a simple meme connoisseur, can do about it, I have some simple workarounds that will enable you to play all your favorite games without issue:
rm /usr/share/vulkan/icd.d/lvp_icd.*)
MESA_LOADER_DRIVER_OVERRIDE=zink %command% in your game’s launch optionsI realize the latter suggestion seems meme-adjacent, but so long as you’re on Mesa 23.1-rc2 or a recent git build, I doubt you’ll notice the difference for most games.
You can’t run a full desktop on zink yet, but you can now play most games at native-ish performance. Or better!
I’ve made mention of Big Triangle a number of times on the blog. Everyone’s had a good chuckle, and we’re all friends so we know it’s an inside joke.
But what if I told you I was serious each and every time I said it?
What if Big Triangle really does exist?
I know what you’re thinking: Mike, you’re not gonna get me again. You can’t trick me this time. I’ve seen this coming for—
CHECK OUT THESE AMDVLK RELEASE NOTES!
Incredible, they’re finally supporting that one extension I’ve been saying is crucial for having good performance. Isn’t that ama—
And they’ve even added an app profile for zink! I assume they’re going to be slowly rolling out all the features zink needs in a controlled manner since zink is a known good-citizen when it comes to behaving within the boundaries of—
…
There are plans for nouveau to support using the NVIDIA supplied GSP firmware in order to support new hardware going forward
The nouveau project doesn't have any input or control over the firmware. NVIDIA have made no promises around stable ABI or firmware versioning. The current status quo is that NVIDIA will release versioned signed gsp firmwares as part of their driver distribution packages that are version locked to their proprietary drivers (open source and binary). They are working towards allowing these firmwares to be redistributed in linux-firmware.
The NVIDIA firmwares are quite large. The nouveau project will control the selection of what versions of the released firmwares are to be supported by the driver, it's likely a newer firmware will only be pulled into linux-firmware for:
This should at least limit the number of firmwares in the linux-firmware project.
However a secondary effect of the size of the firmwares is that having the nouveau kernel module at more and more MODULE_FIRMWARE lines for each iteration will mean the initramfs sizes will get steadily larger on systems, and after a while the initramfs will contain a few gsp firmwares that the driver doesn't even need to run.
To combat this I've looked into adding some sort of module grouping which dracut can pick one out off.
It currently looks something like:
MODULE_FIRMWARE_GROUP_ONLY_ONE("ga106-gsp");
MODULE_FIRMWARE("nvidia/ga106/gsp/gsp-5258902.bin");
MODULE_FIRMWARE("nvidia/ga106/gsp/gsp-5303002.bin");
MODULE_FIRMWARE_GROUP_ONLY_ONE("ga106-gsp"); This group only one will end up in the module info section and dracut will only pick one module from the group to install into the initramfs. Due to how the module info section is constructed this will end up picking the last module in the group first.
The dracut MR is:
https://github.com/dracutdevs/dracut/pull/2309
The kernel one liner is:
https://lore.kernel.org/all/20230419043652.1773413-1-airlied@gmail.com/T/#u
Hi!
In the last month I’ve continued working on go-imap v2. I’ve written the
server side, implemented an in-memory server backend, and spent quite a bit of
time fixing issues reported by imaptest. I only have a handful of test
failures, most of which due to \Recent being unimplemented on purpose because
it’s been removed from the new IMAP4rev2 RFC. The end result is a much more
correct and reliable server implementation compared to v1. I’ve pushed some
incremental improvements for the client side as well, fixing compatibility
issues with servers in the wild and adding a few more extensions. Next, I’d
like to explore server-side command pipelining and fix the remaining issues
related to unilateral updates.
In other news, I’ve (finally!) released new versions of soju and goguma. soju v0.6.0 adds a database message store, a new sojuctl utility, external authentication support, and many more improvements. goguma v0.5.0 adds image previews, UnifiedPush support, performance improvements, and new IRCv3 extensions. Since the goguma release I’ve also implemented previews for Web pages.
While we’re on the topic of new releases, there is one more piece of software which got its version bumped this month: hut v0.3.0 adds pagination, improved Web hooks support, a few new sub-commands and other quality-of-life improvements. Thanks a lot to Thorben Günther for their numerous contributions!
The NPotM is yojo. I’ve already
written two tools to integrate builds.sr.ht with other code forges, so here’s a
third one focused on Forgejo/Gitea. It’s pretty similar to hottub, a public
instance is available for Codeberg integration. It
doesn’t support pull requests yet, patches welcome! While working on yojo I got
once again annoyed by golang.org/x/oauth2 so I started working on a simpler
alternative creatively called go-oauth2.
Last but not least, after days of battling with the Pixman API, I’ve managed to finish up my new renderer API for wlroots. I’m excited about it because the next step is to lay down the first bricks of the color management infrastructure. My plan is to work on basic support for per-output ICC profiles, then go from there. I’ll be participating in Red Hat’s HDR hackfest next week, I hope the discussions with the rest of the stakeholders (compositor and driver developers) can help us move this forward!
That’s all for April, see you next month!
As I mentioned a week or three ago, I deleted comments on the blog because (apparently) the widget was injecting ads. My b. I wish I could say the ad revenue was worth it, but it wasn’t.
With that said, I’m looking at ways to bring comments back. I’ve seen a number of possibilities, but none have really grabbed me:
If anyone has other ideas, post here about it.
EDIT: Thanks to a brilliant suggestion by the other Rhys Perry, I’ve activated giscus for comments. Took about 2 mins. Does it work? We’ll see.
After this boring procedural opening, let’s get to something exciting that nobody blogs about: shader linking.
What is shader linking? Shader linking is the process by which shaders are “linked” together to match up I/O blocks and optimize the runtime. There’s a lot of rules for what compilers can and can’t do during linking, and I’m sure that’s all very interesting, and probably there’s someone out there who would want to read about that, but we’ll save that topic for another day. And another blog.
I want to talk about one part of linking in particular, and that’s interface matching. Let’s check out some Vulkan spec text:
15.1.3. Interface Matching
An output variable, block, or structure member in a given shader stage has an interface match with
an input variable, block, or structure member in a subsequent shader stage if they both adhere to
the following conditions:
• They have equivalent decorations, other than:
◦ XfbBuffer, XfbStride, Offset, and Stream
◦ one is not decorated with Component and the other is declared with a Component of 0
◦ Interpolation decorations
◦ RelaxedPrecision if one is an input variable and the other an output variable
• Their types match as follows:
◦ if the input is declared in a tessellation control or geometry shader as an OpTypeArray with
an Element Type equivalent to the OpType* declaration of the output, and neither is a structure
member; or
◦ if the maintenance4 feature is enabled, they are declared as OpTypeVector variables, and the
output has a Component Count value higher than that of the input but the same Component Type;
or
◦ if the output is declared in a mesh shader as an OpTypeArray with an Element Type equivalent
to the OpType* declaration of the input, and neither is a structure member; or
◦ if the input is decorated with PerVertexKHR, and is declared in a fragment shader as an
OpTypeArray with an Element Type equivalent to the OpType* declaration of the output, and
neither the input nor the output is a structure member; or
◦ if in any other case they are declared with an equivalent OpType* declaration.
• If both are structures and every member has an interface match.
Fascinating. Take a moment to digest.
Once again that’s all very interesting, and probably there’s someone out there who wanted to read about that, but this isn’t quite today’s topic either.
Today’s topic is this one line a short ways below:
Shaders can declare and write to output variables that are not declared or read by the subsequent stage.
This allows e.g., a vertex shader to write an output variable that a fragment shader doesn’t read. Nobody has ever seen a problem with this in Vulkan. The reason is pipelines. Yes, that concept about which Khronos has recently made questionable statements, courtesy of Nintendo, based on the new VK_EXT_shader_object extension. In a pipeline, all the shaders get linked, which means the compiler can delete these unused variables. Or, if not delete, then it can at least use the explicit location info for variables to ensure that I/O is matched up properly.
And because of pipelines, everything works great.
But what happens if pipelines/linking go away?
Everyone saw this coming as soon as the blog loaded. With shader objects (and GPL fastlink), it now becomes possible to create unlinked shaders with mismatched outputs. The shader code is correct, the Vulkan API usage to create the shaders is correct, but is the execution still going to be correct?
Right. CTS. So let’s check…
Okay, there’s no public CTS available for VK_EXT_shader_object yet, but I’m sure it’s coming soon.
I have access to the private CTS repos, and I can see that there is (a lot of) CTS for this extension, which is a relief, and obviously I already knew this since lavapipe has passed everything, and I’m sure there must be testing for shader interface mismatches either there or in the GPL tests.
Sure, maybe there’s no tests for this, but it must be on the test plan since that’s so comprehensive.
Alright, so it’s not in the test plan, but I can add it, and that’s not a problem. In the meanwhile, since zink needs this functionality, I can just test it there, and I’m sure it’ll work fine.
It’s more broken than AMD’s VK_EXT_robustness2 handling, but I’m sure it’ll be easy to fix.
It’s nightmarishly difficult, and I wasted an entire day trying to fix nir_assign_io_var_locations, but I’m sure only lavapipe uses it.
The Vulkan drivers affected by this issue:
Basically everyone except ANV. But also maybe ANV since the extension isn’t implemented there. And probably all the proprietary drivers too since there’s no CTS.
Great.
nir_assign_io_var_locations works like this:
Location decorationThis results in a well-ordered list of variables with proper indexing that should match up both on the input side and the output side.
Except no, not really.
Consider the following simple shader interface:
vertex shader
layout(location = 0) in highp vec4 i_color;
layout(location = 0) out highp vec4 o_color;
void main()
{
gl_Position = vec4(some value);
o_color = i_color;
}
fragment shader
layout(location = 0) in highp vec4 i_color;
layout(location = 0) out highp vec4 o_color;
void main()
{
o_color = i_color;
}
We expect that the vertex attribute color will propagate through to the fragment output color, and that’s what happens.
Vertex shader outputs:
o_color, driver_location=0Fragment shader inputs:
i_color, driver_location=0Let’s modify it slightly:
vertex shader
layout(location = 0) in highp vec4 i_color;
layout(location = 0) out highp vec2 o_color;
layout(location = 2) out highp vec2 o_color2;
void main()
{
gl_Position = vec4(some value);
o_color = i_color;
}
fragment shader
layout(location = 0) in highp vec2 i_color;
layout(location = 2) in highp vec2 i_color2;
layout(location = 0) out highp vec4 o_color;
void main()
{
o_color = vec4(i_color.xy, i_color2.xy);
}
Vertex shader outputs:
o_color, driver_location=0o_color2, driver_location=1Fragment shader inputs:
i_color, driver_location=0i_color2, driver_location=1No problems yet.
But what about this:
vertex shader
layout(location = 0) in highp vec4 i_color;
layout(location = 0) out highp vec2 o_color;
layout(location = 1) out highp vec4 lol;
layout(location = 2) out highp vec2 o_color2;
void main()
{
gl_Position = vec4(some value);
o_color = i_color;
lol = vec4(1.0);
}
fragment shader
layout(location = 0) in highp vec2 i_color;
layout(location = 2) in highp vec2 i_color2;
layout(location = 0) out highp vec4 o_color;
void main()
{
o_color = vec4(i_color.xy, i_color2.xy);
}
In a linked pipeline this works just fine: lol is optimized out during linking since it isn’t read by the fragment shader, and location indices are then assigned correctly. But in unlinked shader objects (and with non-LTO EXT_graphics_pipeline_library), there is no linking. Which means lol isn’t optimized out. And what happens once nir_assign_io_var_locations is run?
Vertex shader outputs:
o_color, driver_location=0lol, driver_location=1o_color2, driver_location=2Fragment shader inputs:
i_color, driver_location=0i_color2, driver_location=1Tada, now the shaders are broken.
Hopefully there will be some, but at present I’ve had to work around this issue in zink by creating multiple separate shader variants with different locations to ensure everything matches up.
I made an attempt at fixing this, but it was unsuccessful. I then contacted the great Mesa compiler sage, Timothy Arceri, and he provided me with a history lesson from The Before Times. Apparently this NIR pass was originally written for GLSL and lived in mesa/st. Then Vulkan drivers wanted to use it, so it was moved to common code. Since all pipelines were monolithic and could do link-time optimizations, there were no problems.
But now LTO isn’t always possible, and so we are here.
It seems to me that the solution is to write an entirely new pass for Vulkan drivers to use, and that’s all very interesting, and probably there’s someone out there who wants to read about that, but this is the end of the post.
Just a quick post to sum up all the new features and things to watch for in zink for 23.1:
ARB_separate_shader_objects support
GL_QUADS natively suppored
EXT_multisample_render_to_texture now uses VK_EXT_multisampled_render_to_single_sampledEXT_descriptor_buffer is now the default for descriptor handlingNV_compute_shader_derivatives support
ZINK_DEBUG options for debugging
Has anyone else heard that Alyssa is going to Dyson to work on some new vaccuum tech? This is upending everything I thought I knew, but the source seems credible.
Today is my last day at Collabora and my last day leading the Panfrost driver.
It’s been a wild ride.
In 2017, I began work on the chai driver for Mali T (Midgard). chai would later be merged into Lyude Paul’s and Connor Abbott’s BiOpenly project for Mali G (Bifrost) to form Panfrost.
In 2019, I joined Collabora to accelerate work on the driver stack. The initial goal was to run GNOME on a Mali-T860 Chromebook.

Today, Panfrost supports a broad spectrum of Mali GPUs, conformant to the OpenGL ES 3.1 specification on Mali-G52 and Mali-G57. It’s hard to overstate how far we’ve come. I’ve had the thrills of architecting several backend shader compilers as well as the Gallium-based OpenGL driver, while my dear colleague Boris Brezillon has put together a proof-of-concept Vulkan driver which I think you’ll hear more about soon.
Lately, my focus has been ensuring the project can stand on its own four legs. I have every confidence in other Collaborans hacking on Panfrost, including Boris and Italo Nicola. The project has a bright future. It’s time for me to pass the reins.
I’m still alive. I plan to continue working on Mesa drivers for a long time, including the common infrastructure upon which Panfrost relies. And I’ll still send the odd Panfrost patch now and then. That said, my focus will shift.
I’m not ready to announce what’s in store yet… but maybe you can read between the lines!
Another week, more blog posts is what I meant to say when I started writing this post last Friday. But now it’s Monday, and everything is different.
In particular, zink is different. There’s a branchpoint coming up, and I’ll do a separate post about that and all the new features people can expect, but today’s topic is something else. Something more exciting.
Obviously it’s EXT_shader_object.
who is excited about this extension is me.
That’s right, I said it.
For years now, Tomb Raider (2013) has plagued zink users with separate shader objects that could not be precompiled even with EXT_graphics_pipeline_library. Why? Because the game uses tessellation. And when I suggested we’d probably want that in EXT_graphics_pipeline_library, someone said “oh we can just add that later, it’ll be easy”, and then since it’s Vulkan it wasn’t easy and it didn’t get added.
But then Nintendo came along and solved this problem for me in a much, much better way with EXT_shader_object.
The thing about OpenGL is that ARB_separate_shader_objects is a thing, and it’s a thing for every shader stage. Even if 99% of apps/games only use VS+FS, there’s still that 1% that wants to use it with those other geometry stages.
Like Tomb Raider (2013). And yes, the (2013) is necessary so nobody imagines I’m talking about a more recent, more relevant game.
Some months ago, I implemented basic separate shaders (VS+FS only) using EXT_graphics_pipeline_library. It’s gross. Really just not an ideal way of doing things when mapping to GL. Effectively each stage gets its own mini GPL pipeline which then gets combined on-the-fly for a couple frames of use to avoid stuttering until the real pipeline is done with its background compile.
But this is stupid. The GL architecture is for separate shaders, not for just-in-time linking; we leave the linking under the hood to screw us over when it doesn’t work right so we can complain. It’s a solved problem in that regard. Making this explicit and mapping from one to the other needs all kinds of refcounting, and hash tables, and complexity, and the fact that it works at all is a miracle that science can’t explain.
Now, however, there is a direct 1:1 mapping to separate shaders with EXT_shader_object. If the app compiles a shader, zink compiles that shader (object). If the app binds a shader, zink binds that shader (object). It’s that simple. And then in the background I can still do all the optimized monolithic pipeline compiling like usual to guarantee huge FPS the next time that group of shaders is used together.
Finally this one problem game will run without any frame hitching or other issues.
As soon as drivers besides NVIDIA implement it, of course. Thanks NVIDIA for your great Day 1 support of this great extension that solves…
Of this great extension…
Of…
Oh for fuck’s sake.
This game will never run without issues on zink. I’m over it. But you know what I’m not over yet?
This totally unexpected news that Panfrost is now without firm leadership and Alyssa is now without gainful employment. How could such a thing happen?
As everyone who’s anyone in the graphics community knows, SGC is the first place to receive any hiring-related rumors. It was here that the news first broke about Valve hiring some nutjob to work on zink. It was also here that everyone learned Broadcom, Bose, the entire multiverse, and Collabora were all vying to hire the five-time winner of Mesa’s Most Loudest Keyboard On Conference Call award (And yes, I can hear her clacking away towards a sixth win right now).
That’s right. It’s been a while, but I’ve got another scoop. And this one’s big. I couldn’t even believe it when I stumbled upon this, and I’m sure many of you won’t either. That’s why I’m gonna tell you, and then I’m gonna spell it out for you.
Alyssa has been hired by Boston Dynamics to work on driver-level computer vision integration in their robotics systems.
It just makes sense if you stop and think about it. Or if you re-read her last blog post in which she basically spells it out for us:
So yeah, nice try, but you’ll need to put a lot more effort into covering your tracks if you want to conceal your job hops from SGC.
Stay tuned for the crucial details everyone craves on the new Panfrost project leader: do they put toothpaste on their toothbrush before or after wetting the bristles?
As everyone is well aware, the Mesa 23.1 branchpoint is definitely going to be next week, and there is zero chance that it could ever be delayed*.
As everyone is also well aware, this is the release in which I’ve made unbreakable* promises about the viability of gaming on Zink.
Specifically, it will now be viable*.
But exactly one* bug remains as a blocker to that. Just one.
So naturally I had to fix it quick before anyone noticed*.
* Don’t @me for factual inconsistencies in any of the previous statements.
The thing about OpenGL games is a lot of them are x86 binaries, which means they run in a 32bit process. Any 32bit application gets 32bit address space. 32bit address space means a 4GiB limit on addressable memory. But what does that mean?
What is addressable memory? Addressable memory is any memory that can be accessed by a process. If malloc is called, this memory is addressable. If a file is mmaped, this memory is addressable. If GPU memory is mapped, this memory is addressable.
What happens if the limit is exceeded? Boom.
Why is the limit only 4GiB? Stop asking hard questions.
Why is this difficult? The issue from a driver perspective is that this limit includes both the addressable memory from the game (e.g., the game’s internal malloc calls) as well as the addressable memory from the driver (e.g., all the GPU mapped memory). Thus, while I would like to have all 4GiB (or more, really; doesn’t everyone have 32GiB RAM in their device these days?) to use with Zink, I do not have that luxury.
Judging by recent bug reports and the prevalance on 32bit games, it’s pretty bad. Given that I solved GPU map VA leaking a long time ago, the culprit must be memory utilization in the driver. Let’s check out some profiling.
The process for this is simple: capture a (long) trace from a game and then run it through massif. Sound familiar?
The game in this case is, of course, Tomb Raider (2013), the home of our triangle princess. Starting a new game runs through a lot of intro cinematics and loads a lot of assets, and the memory usage is explosive. See what I did there? Yeah, jokes. On a Monday. Whew I need a vacation.
This is where I started:
2.4 GiB memory allocated by the driver. In a modern, 64bit process, where we can make full use of the 64GiB memory in the device, this is not a problem and we can pretend to be a web browser using this much for a single tab. But here, from an era when memory management was important and everyone didn’t have 128GiB memory available, that’s not going to fly.
Initial analysis yielded the following pattern:
n3: 112105776 0x5F88B73: nir_intrinsic_instr_create (nir.c:759)
n1: 47129360 0x5F96216: clone_intrinsic (nir_clone.c:358)
n1: 47129360 0x5F9692E: clone_instr (nir_clone.c:496)
n1: 47129360 0x5F96BB4: clone_block (nir_clone.c:563)
n2: 47129360 0x5F96DEE: clone_cf_list (nir_clone.c:617)
n1: 46441568 0x5F971CE: clone_function_impl (nir_clone.c:701)
n3: 46441568 0x5F974A4: nir_shader_clone (nir_clone.c:774)
n1: 28591984 0x67D9DE5: zink_shader_compile_separate (zink_compiler.c:3280)
n1: 28591984 0x69005F8: precompile_separate_shader_job (zink_program.c:2022)
n1: 28591984 0x57647B7: util_queue_thread_func (u_queue.c:309)
n1: 28591984 0x57CD7BC: impl_thrd_routine (threads_posix.c:67)
n1: 28591984 0x4DDB14C: start_thread (in /usr/lib64/libc.so.6)
n0: 28591984 0x4E5BBB3: clone (in /usr/lib64/libc.so.6)
Looking at the code, I found an obvious issue: when I implemented precompile for separate shaders a month or two ago, I had a teensie weensie little bug. Turns out when memory is allocated, it has to be freed or else it becomes unreachable.
This is commonly called a leak.
It wasn’t caught before now because it only affects Tomb Raider and a handful of unit tests.
But I caught it, and it was so minor that I already (“quietly”) landed the fix without anyone noticing.
This sort of thing will be fixed when zink is rewritten in Rust*.
With an actual bug fixed, what does memory utilization look like now?
Down 300MiB to 2.1GiB. A 12.5% reduction. Not that exciting.
Certainly nothing that would warrant a SGC blog post.
My readers have standards.
Time to expand some expandables.
Here’s another common pattern in the massif output:
n4: 317700704 0x57570DA: ralloc_size (ralloc.c:117)
n1: 226637184 0x57583BB: create_slab (ralloc.c:759)
n3: 226637184 0x5758579: gc_alloc_size (ralloc.c:789)
n6: 215583536 0x575868C: gc_zalloc_size (ralloc.c:814)
n7: 91059504 0x5F88CE6: nir_alu_instr_create (nir.c:696)
n4: 35399104 0x5F90C49: nir_build_alu2 (nir_builder.c:162)
n0: 12115376 in 29 places, all below massif's threshold (1.00%)
n1: 11690848 0x67C90F1: nir_iadd (nir_builder_opcodes.h:1309)
n2: 11690848 0x67CB493: nir_iadd_imm (nir_builder.h:719)
n1: 6074016 0x67D691C: remove_bo_access_instr (zink_compiler.c:2013)
n1: 6074016 0x67C89A9: nir_shader_instructions_pass (nir_builder.h:88)
n1: 6074016 0x67D6DB2: remove_bo_access (zink_compiler.c:2044)
n1: 6074016 0x67E4827: zink_shader_create (zink_compiler.c:4409)
n1: 6074016 0x690443E: zink_create_gfx_shader_state (zink_program.c:1885)
n1: 6074016 0x623484B: util_live_shader_cache_get (u_live_shader_cache.c:141)
n1: 6074016 0x69044CC: zink_create_cached_shader_state (zink_program.c:1900)
This is some ralloc usage from zink’s shader creation. In short, the in-memory shader IR is…
Hold on. Doesn’t this sound familiar?
It turns out that nothing is ever new, and all problems have been solved before. By applying the exact same solution, we’re gonna start to see some big movement in these numbers.
Serialized NIR is much more compact than object-form NIR. The memory footprint is an order of magnitude smaller, which begs the question why would anyone ever store NIR structs in memory.
I don’t have an answer. One might try to make the argument that it makes shader variant creation easier, but then, it also needs to be said that shader variants require the NIR to be cloned anyway, which deserialization already (functionally) does. There’s shader_info, but that’s small, unchanging, and can be easily copied. I think it’s just convenience. And that’s fine.
But it’s not fine for me or zink.
Thus, I began converting all the NIR objects I was keeping around (and there’s lots) to serialized form. The first task was tackling zink_shader::nir, the object that exists for every shader created in the driver. How much would this help?
Down another 500MiB to 1.6GiB total. That’s another 24% reduction.
Now we’re getting somewhere.
But again, SGC enthusiasts have standards, and a simple 33% improvement from where things started is hardly worth mentioning here, so I apologize for wasting time.
Continuing, it’s easy to keep finding these patterns:
n1: 64055264 0x57583BB: create_slab (ralloc.c:759)
n2: 64055264 0x5758579: gc_alloc_size (ralloc.c:789)
n6: 61664176 0x575868C: gc_zalloc_size (ralloc.c:814)
n2: 22299104 0x5F88CE6: nir_alu_instr_create (nir.c:696)
n1: 19814432 0x60B3804: read_alu (nir_serialize.c:905)
n1: 19814432 0x60B6713: read_instr (nir_serialize.c:1787)
n1: 19814432 0x60B69BD: read_block (nir_serialize.c:1856)
n1: 19814432 0x60B6D6A: read_cf_node (nir_serialize.c:1949)
n2: 19814432 0x60B6EA0: read_cf_list (nir_serialize.c:1976)
n1: 19195888 0x60B708A: read_function_impl (nir_serialize.c:2012)
n1: 19195888 0x60B7C2A: nir_deserialize (nir_serialize.c:2219)
n2: 19195888 0x67E754A: zink_shader_deserialize (zink_compiler.c:4820)
n2: 19195888 0x6901899: zink_create_gfx_program (zink_program.c:1041)
n1: 17921504 0x6901C6C: create_linked_separable_job (zink_program.c:1105)
n1: 17921504 0x57647B7: util_queue_thread_func (u_queue.c:309)
n1: 17921504 0x57CD7BC: impl_thrd_routine (threads_posix.c:67)
n1: 17921504 0x4DDB14C: start_thread (in /usr/lib64/libc.so.6)
n0: 17921504 0x4E5BBB3: clone (in /usr/lib64/libc.so.6)
This one is from the NIR copy that happens when linking shaders. Simple enough to compress.
New graph:
An additional 37.5% reduction to 1.0GiB? That’s not too shabby. Now we’re looking at an overall 58% reduction in memory utilization. This is the kind of improvement that SGC readers have come to expect.
But wait! I was doing all this last week. And the start of this post was a really long time ago, but wasn’t there something else causing high memory utilization last week?
That’s right, these graphs are still being hit by the now-fixed RADV shader IR ballooning.
What hap
What happens if I apply that fix too?
482.7MiB total memory usage.
That’s another 51.7% improvement.
Overall a 79.9% reduction in memory usage. I’d expect similar (or greater?) savings for all games.
The MR is up now, and I expect it should be merged soon™.
Doesn’t this negatively affect performance?
No.
But doesn’t using more memory improve performance?
No.
What will I do with the rest of my 256GiB RAM?
Open two more browser tabs.
As everyone expects, Khronos has recently done a weekly spec update for Vulkan.
What nobody expected was that this week’s update would include a mammoth extension, VK_EXT_shader_object.
Or that it would be developed by Nintendo?
It’s a very cool extension for Zink. Effectively, it means (unoptimized) shader variants can be generated very fast. So fast that the extension should solve all the remaining issues with shader compilation and stuttering by enabling applications (zink) to create and bind shaders directly without the need for pipeline objects.
Widespread adoption in the ecosystem will take time, but Lavapipe has day one support as everyone expects for all the cool new extensions that I work on.
Since Samuel is looming over me, I must say that it is unlikely RADV will have support for this landed in time for 23.1, though there is an implementation in the works which passes all of CTS. A lot of refactoring is involved. Like, a lot.
But we’re definitely, 100% committed to shipping GPL by default, or you’ve lost the game.
As of today, gitlab.freedesktop.org allows anyone with a GitLab Developer role or above to remove spam issues. If you are reading this article a while after it's published, it's best to refer to the damspam README for up-to-date details. I'm going to start with the TLDR first.
Create a personal access token with API access and save the token value as $XDG_CONFIG_HOME/damspam/user.token Then run the following commands with your project's full path (e.g. mesa/mesa, pipewire/wireplumber, xorg/lib/libX11):
$ pip install git+https://gitlab.freedesktop.org/freedesktop/damspam $ damspam request-webhook foo/bar # clean up, no longer needed. $ pip uninstall damspam $ rm $XDG_CONFIG_HOME/damspam/user.tokenThe damspam command will file an issue in the freedesktop/fdo-bots repository. This issue will be automatically processed by a bot and should be done by the time you finish the above commands, see this issue for an example. Note: the issue processing requires a git push to an internal repo - if you script this for multiple repos please put a sleep(30) in to avoid conflicts.
Once the request has been processed (and again, this should be instant), any issue in your project that gets assigned the label Spam will be processed automatically by damspam. See the next section for details.
Once the maintainer for your project has requested the webhook, simply assign the Spam label to any issue that is spam. The issue creator will be blocked (i.e. cannot login), this issue and any other issue filed by the same user will be closed and made confidential (i.e. they are no longer visible to the public). In the future, one of the GitLab admins can remove that user completely but meanwhile, they and their spam are gone from the public eye and they're blocked from producing more. This should happen within seconds of assigning the Spam label.
Create a personal access token with API access for the @spambot user and save the token value as $XDG_CONFIG_HOME/damspam/spambot.token. This is so you can operate as spambot instead of your own user. Then run the following command to remove all tagged spammers:
$ pip install git+https://gitlab.freedesktop.org/freedesktop/damspam $ damspam purge-spammersThe last command will list any users that are spammers (together with an issue that should make it simple to check whether it is indeed spam) and after interactive confirmation purge them as requested. At the time of writing, the output looks like this:
$ damspam purge-spammers
0: naughtyuser : https://gitlab.freedesktop.org/somenamespace/project/-/issues/1234: [STREAMING@TV]!* LOOK AT ME
1: abcuseless : https://gitlab.freedesktop.org/somenamespace/project/-/issues/4567: ((@))THIS STREAM IS IMPORTANT
2: anothergit : https://gitlab.freedesktop.org/somenamespace/project/-/issues/8778: Buy something, really
3: whatawasteofalife : https://gitlab.freedesktop.org/somenamespace/project/-/issues/9889: What a waste of oxygen I am
Purging a user means a full delete including all issues, MRs, etc. This is nonrecoverable!
Please select the users to purge:
[q]uit, purge [a]ll, or the index:
Purging the spammers will hard-delete them and remove anything they ever did on gitlab. This is irreversible.
There are two components at play here: hookiedookie, a generic webhook dispatcher, and damspam which handles the actual spam issues. Hookiedookie provides an HTTP server and "does things" with JSON data on request. What it does is relatively generic (see the Settings.yaml example file) but it's set up to be triggered by a GitLab webhook and thus receives this payload. For damspam the rules we have for hookiedookie come down to something like this: if the URL is "webhooks/namespace/project" and damspam is set up for this project and the payload is an issue event and it has the "Spam" label in the issue labels, call out to damspam and pass the payload on. Other rules we currently use are automatic reload on push events or the rule to trigger the webhook request processing bot as above.
This is also the reason a maintainer has to request the webhook. When the request is processed, the spambot installs a webhook with a secret token (a uuid) in the project. That token will be sent as header (a standard GitLab feature). The project/token pair is also added to hookiedookie and any webhook data must contain the project name and matching token, otherwise it is discarded. Since the token is write-only, no-one (not even the maintainers of the project) can see it.
damspam gets the payload forwarded but is otherwise unaware of how it is invoked. It checks the issue, fetches the data needed, does some safety check and if it determines that yes, this is spam, then it closes the issue, makes it confidential, blocks the user and then recurses into every issue this user ever filed. Not necessarily in that order. There are some safety checks, so you don't have to worry about it suddenly blocking every project member.
For a while now, we've suffered from a deluge of spam (and worse) that makes it through the spam filters. GitLab has a Report Abuse feature for this but it's... woefully incomplete. The UI guides users to do the right thing - as reporter you can tick "the user is sending spam" and it automatically adds a link to the reported issue. But: none of this useful data is visible to admins. Seriously, look at the official screenshots. There is no link to the issue, all you get is a username, the user that reported it and the content of a textbox that almost never has any useful information. The link to the issue? Not there. The selection that the user is a spammer? Not there.
For an admin, this is frustrating at best. To verify that the user is indeed sending spam, you have to find the issue first. Which, at best, requires several clicks and digging through the profile activities. At worst you know that the user is a spammer because you trust the reporter but you just can't find the issue for whatever reason.
But even worse: reporting spam does nothing immediately. The spam stays up until an admin wakes up, reviews the abuse reports and removes that user. Meanwhile, the spammer can happily keep filing issues against the project. Overall, it is not a particularly great situation.
With hookiedookie and damspam, we're now better equipped to stand against the tide of spam. Anyone who can assign labels can help fight spam and the effect is immediate. And it's - for our use-cases - safe enough: if you trust someone to be a developer on your project, we can trust them to not willy-nilly remove issues pretending they're spam. In fact, they probably could've deleted issues beforehand already anyway if they wanted to make them disappear.
While we're definitely aiming at gitlab.freedesktop.org, there's nothing in particular that requires this instance. If you're the admin for a public gitlab instance feel free to talk to Benjamin Tissoires or me to check whether this could be useful for you too, and what changes would be necessary.
I got a report recently that Dota2 was using too much memory on RADV. Host memory, that is, not GPU. So how does one profile Dota2 memory usage?
On Linux, the ideal tool for memory profiling is massif. But does a flawless, unparalleled game like Dota2 run under massif?
Sort of maybe almost but not really.
And it’s not the best way to do it anyway since, for profiling, the ideal scenario is to run a static test. Thus, gfxreconstruct is the best way to test things here. Simply VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_gfxreconstruct in your ultra secret Source Engine debug console and let it run.
Then queue up a replay under massif and find something else to do for the next half hour since it’s not exactly a speedy process.
Opening up massif-visualizer, here’s what I saw just from getting to the Dota2 title screen:
Yikes.
3GiB memory utilization just from the driver? And nearly all of it just from in-memory shader IR?
Simplified, this is a byproduct of VK_PIPELINE_CREATE_RETAIN_LINK_TIME_OPTIMIZATION_INFO_BIT_EXT with GPL (RADV_PERFTEST=gpl, definitely will be enabled by default in Mesa 23.1 or your money back). This bit requires that the driver keep around any internal data needed to later generate an optimized pipeline from the same sources as the fast-linked pipeline.
And RADV, like an overzealous hoarder, keeps those sources around in pristine condition, sucking up all your memory in the meanwhile.
As we know and expect, once the problem was revealed, noted Xtreme Coding World Champion Samuel Pitoiset sprang into action. Mesa uses NIR for its shader IR, and this infrastructure comes with its own convenient set of serialization (compression) utilities. By serializing the in-memory shaders, memory is saved.
How much memory, you ask?
Yes, that’s an 85% reduction in memory utilization.
And it will soon land for everyone to enjoy.
After my last blogpost, I kept developing the Rust version of the VGEM driver, also known as rustgem for now. Previously, I had developed two important features of the driver: the ability to attach a fence and the ability to signal a fence. Still one important feature is still missing: the ability to prevent hangs. Currently, if the fence is not signaled, the driver will simply hang. So, we can create a callback that signals the fence when the fence is not signaled by the user for more than 10 seconds.
In order to create this callback, we need to have a Timer that will trigger it after the specified amount of time. Gladly, the Linux kernel provides us with a Timer that can be set with a callback and a timeout. But, to use it in the Rust code, we need to have a safe abstraction, that will ensure that the code is safe under some assumptions.
Initially, I was developing an abstraction on my own as I checked the RfL tree and there were no Timer abstractions available.
The most important question here is “how can we guarantee access to other
objects inside the callback?”. The callback only has receives a pointer to the
struct timer_list as its single argument. Naturally, we can think about using
a container_of macro. In order to make the compatibility layer between
Rust and the C callback, I decided to store the object inside the Timer. Yep, I
didn’t like that a lot, but it was the solution I came up with at the time. The
struct looked something like this:
/// A driver-specific Timer Object
//
// # Invariants
// timer is a valid pointer to a struct timer_list and we own a reference to it.
[repr(C)]
pub struct UniqueTimer<T: TimerOps<Inner = D>, D> {
timer: *bindings::timer_list,
inner: D,
_p: PhantomData<T>,
}
Moreover, the second important question I had was “how can the user pass a callback function to the timer?”. There were two possibilities: using a closure and using a Trait. I decided to go through the trait path. Things would be kind of similar if I decided to go into the closure path.
/// Trait which must be implemented by driver-specific timer objects.
pub trait TimerOps: Sized {
/// Type of the Inner data inside the Timer
type Inner;
/// Timer callback
fn timer_callback(timer: &UniqueTimer<Self, Self::Inner>);
}
With those two questions solved, it seems that we are all set and good to go. So, we can create methods to initialize the timer and modify the timer’s timeout, implement the Drop trait, and use the following callback by default:
unsafe extern "C" fn timer_callback<T: TimerOps<Inner = D>, D: Sized>(
timer: *mut bindings::timer_list,
) {
let timer = crate::container_of!(timer, UniqueTimer<T, D>, timer)
as *mut UniqueTimer<T, D>;
// SAFETY: The caller is responsible for passing a valid timer_list subtype
T::timer_callback(unsafe { &mut *timer });
}
All should work, right? Well… No, I didn’t really mention how I was allocating
memory. And let’s say I was initially allocating it wrongly and therefore, the
container_of macro was pointing to the wrong memory space.
Initially, I was allocating only timer with the kernel memory allocator
krealloc and allocating the rest of the struct with Rust’s memory allocator.
By making such a mess, container_of wasn’t able to point to the right
memory address.
I had to change things a bit to allocate the whole struct UniqueTimer with
the kernel’s memory allocator. However, krealloc returns a raw pointer and it
would be nice for the final user to get a raw pointer to the object. I wrapped
up inside another struct that could be dereferenced into the UniqueTimer
object.
/// A generic Timer Object
///
/// This object should be instantiated by the end user, as it holds
/// a unique reference to the UniqueTimer struct. The UniqueTimer
/// methods can be used through it.
pub struct Timer<T: TimerOps<Inner = D>, D>(*mut UniqueTimer<T, D>);
impl<T: TimerOps<Inner = D>, D> Timer<T, D> {
/// Create a timer for its first use
pub fn setup(inner: D) -> Self {
let t = unsafe {
bindings::krealloc(
core::ptr::null_mut(),
core::mem::size_of::<UniqueTimer<T, D>>(),
bindings::GFP_KERNEL | bindings::__GFP_ZERO,
) as *mut UniqueTimer<T, D>
};
// SAFETY: The pointer is valid, so pointers to members are too.
// After this, all fields are initialized.
unsafe {
addr_of_mut!((*t).inner).write(inner);
bindings::timer_setup(addr_of_mut!((*t).timer), Some(timer_callback::<T, D>), 0)
};
Self(t)
}
}
And then the container_of macro started working! Now, I could setup a Timer
for each fence and keep the fence inside the timer. Finally, I could use the
fence inside the timer to signal it when it was not signaled by the user for
more than 10 seconds.
impl TimerOps for VgemFenceOps {
type Inner = UniqueFence<Self>;
fn timer_callback(timer: &UniqueTimer<Self, UniqueFence<Self>>) {
let _ = timer.inner().signal();
}
}
So, I tested the driver with IGT using the vgem_slow test and it was now
passing! All IGT tests were passing and it looked like the driver was
practically completed (some FIXME problems notwithstanding). But, let’s see if
this abstraction is really safe…
First, let’s inspect the struct timer_list in the C code.
struct timer_list {
struct hlist_node entry;
unsigned long expires;
void (*function)(struct timer_list *);
u32 flags;
};
By looking at this struct, we can see a problem in my abstraction: a timer can point to a timer through a list. If you are not familiar with Rust, this can seem normal, but self-referential types can lead to undefined behavior (UB).
Let’s say we have an example type with two fields: u32 and a pointer to this
u32 value. Initially, everything looks fine, the pointer field points to the
value field in memory address A, which contains a valid u32, and all pointers
are valid. But Rust has the freedom to move values around memory. For
example, if we pass this struct into another function, it might get moved to a
different memory address. So, the once valid pointer is no longer valid, because
when we move the struct, the struct’s fields change their address, but not their
value. Now, the pointer fields still point to the memory address A, although the
value field is located at the memory address B now. This is really bad and can
lead to UB.
The solution is to make timer_list implement the !Unpin trait. This means
that to use this type safely, we can’t use regular pointers for self-reference.
Instead, we use special pointers that “pin” their values into place, ensuring
they can’t be moved.
Still looking at the struct timer_list, it is possible to notice that a timer
can queue itself in the timer function. This functionality is not covered by my
current abstraction.
Moreover, I was using jiffies to modify the timeout duration and I was adding a
Duration to the jiffies. This is problematic, because it can cause a data
races. Reading jiffies and adding a duration to them should be an atomic
operation.
Huge thanks to the RfL folks that pointed the errors in my implementation!
With all these problems pointed out, it is time to fix them! I could have reimplemented my safe abstraction, but the RfL folks pointed me to a Timer abstraction that they are developing in a downstream tree. Therefore, I decided to use their Timer abstraction.
There were two options to implement a Timer abstraction:
Timeout trait to the VgemFence structFnTimer abstractionIn the end, I decided to go with the second approach. The FnTimer receives a
closure that will be executed at the timeout. The closure can return an enum that
indicated if the timer is done or if it should be rescheduled.
When implementing the timer, I had a lot of borrow checker problems. See…
I need to use the Fence object inside the callback and also move the Fence
object at the end of the function. So, I got plenty of “cannot move out of
fence because it is borrowed” errors. Also, I needed the Timer to be dropped
at the same time as the fence, so I needed to store the Timer inside the
VgemFence struct.
The solution to the problems: smart pointers! I boxed the FnTimer and the closure
inside the FnTimer so that I could store it inside the VgemFence struct.
Then, the second problem got fixed. But, I still cannot use the fence inside the
closure, because it wasn’t encapsulated inside a smart pointer. So, I used an
Arc to box Fence, clone it, and move it to the scope of the closure.
pub(crate) struct VgemFence {
fence: Arc<UniqueFence<Fence>>,
_timer: Box<FnTimer<Box<dyn FnMut() -> Result<Next> + Sync>>>,
}
impl VgemFence {
pub(crate) fn create() -> Result<Self> {
let fence_ctx = FenceContexts::new(1, QUEUE_NAME, &QUEUE_CLASS_KEY)?;
let fence = Arc::try_new(fence_ctx.new_fence(0, Fence {})?)?;
// SAFETY: The caller calls [`FnTimer::init_timer`] before using the timer.
let t = Box::try_new(unsafe {
FnTimer::new(Box::try_new({
let fence = fence.clone();
move || {
let _ = fence.signal();
Ok(Next::Done)
}
})? as Box<_>)
})?;
// SAFETY: As FnTimer is inside a Box, it won't be moved.
let ptr = unsafe { core::pin::Pin::new_unchecked(&*t) };
timer_init!(ptr, 0, "vgem_timer");
// SAFETY: Duration.as_millis() returns a valid total number of whole milliseconds.
let timeout =
unsafe { bindings::msecs_to_jiffies(Duration::from_secs(10).as_millis().try_into()?) };
// We force the fence to expire within 10s to prevent driver hangs
ptr.raw_timer().schedule_at(jiffies_later(timeout));
Ok(Self { fence, _timer: t })
}
}
You can observe in this code that the initialization of the FnTimer uses an
unsafe operation. This happens because we still don’t have Safe Pinned
Initialization.
But the RfL folks are working hard to land this feature and improve ergonomics
when using Pin.
Now, running again the vgem_slow IGT test, you can see that all IGT tests are
now passing!
During this time, many improvements landed in the driver: all the objects are being properly dropped, including the DRM device; all error cases are returning the correct error; the SAFETY comments are properly written and most importantly, the timeout feature was introduced. With that, all IGT tests are passing and the driver is functional!
Now, the driver is in a good shape, apart from one FIXME problem: currently, the IOCTL abstraction doesn’t support any drivers that the IOCTLs don’t start in 0x00 and the VGEM driver starts its IOCTLs with 0x01. I don’t know yet how to bypass this problem without adding a dummy IOCTL as 0x00, but I hope to get a solution to it soon.
The progress of this project can be followed in this PR and I hope to see this project being integrated upstream in the future.
In previous posts, “Graphics Flight Recorder - unknown but handy tool to debug GPU hangs” and “Debugging Unrecoverable GPU Hangs”, I demonstrated a few tricks of how to identify the location of GPU fault.
But what’s the next step once you’ve roughly pinpointed the issue? What if the problem is only sporadically reproducible and the only way to ensure consistent results is by replaying a trace of raw GPU commands? How can you precisely determine the cause and find a proper fix?
Sometimes, you may have an inkling of what’s causing the problem, then and you can simply modify the driver’s code to see if it resolves the issue. However, there are instances where the root cause remains elusive or you only want to change a specific value without affecting the same register before and after it.
The optimal approach in these situations is to directly modify the commands sent to the GPU. The ability to arbitrarily edit the command stream was always an obvious idea and has crossed my mind numerous times (and not only mine – proprietary driver developers seem to employ similar techniques). Finally, the stars aligned: my frustration with a recent bug, the kernel’s new support for user-space-defined GPU addresses for buffer objects, the tool I wrote to replay command stream traces not so long ago, and the realization that implementing a command stream editor was not as complicated as initially thought.
The end result is a tool for Adreno GPUs (with msm kernel driver) to decompile, edit, and compile back command streams: “freedreno,turnip: Add tooling to edit command streams and use them in ‘replay’”.
The primary advantage of this command stream editing tool lies the ability to rapidly iterate over hypotheses. Another highly valuable feature (which I have plans for) would be the automatic bisection of the command stream, which would be particularly beneficial in instances where only the bug reporter has the necessary hardware to reproduce the issue at hand.
# Decompile one command stream from the trace
./rddecompiler -s 0 gpu_trace.rd > generate_rd.c
# Compile the executable which would output the command stream
meson setup . build
ninja -C build
# Override the command stream with the commands from the generator
./replay gpu_trace.rd --override=0 --generator=./build/generate_rd
Reading dEQP-VK.renderpass.suballocation.formats.r5g6b5_unorm_pack16.clear.clear.rd...
gpuid: 660
Uploading iova 0x100000000 size = 0x82000
Uploading iova 0x100089000 size = 0x4000
cmdstream 0: 207 dwords
generating cmdstream './generate_rd --vastart=21441282048 --vasize=33554432 gpu_trace.rd'
Uploading iova 0x4fff00000 size = 0x1d4
override cmdstream: 117 dwords
skipped cmdstream 1: 248 dwords
skipped cmdstream 2: 223 dwords
The decompiled code isn’t pretty:
/* pkt4: GRAS_SC_SCREEN_SCISSOR[0].TL = { X = 0 | Y = 0 } */
pkt4(cs, REG_A6XX_GRAS_SC_SCREEN_SCISSOR_TL(0), (2), 0);
/* pkt4: GRAS_SC_SCREEN_SCISSOR[0].BR = { X = 32767 | Y = 32767 } */
pkt(cs, 2147450879);
/* pkt4: VFD_INDEX_OFFSET = 0 */
pkt4(cs, REG_A6XX_VFD_INDEX_OFFSET, (2), 0);
/* pkt4: VFD_INSTANCE_START_OFFSET = 0 */
pkt(cs, 0);
/* pkt4: SP_FS_OUTPUT[0].REG = { REGID = r0.x } */
pkt4(cs, REG_A6XX_SP_FS_OUTPUT_REG(0), (1), 0);
/* pkt4: SP_TP_RAS_MSAA_CNTL = { SAMPLES = MSAA_FOUR } */
pkt4(cs, REG_A6XX_SP_TP_RAS_MSAA_CNTL, (2), 2);
/* pkt4: SP_TP_DEST_MSAA_CNTL = { SAMPLES = MSAA_FOUR } */
pkt(cs, 2);
/* pkt4: GRAS_RAS_MSAA_CNTL = { SAMPLES = MSAA_FOUR } */
pkt4(cs, REG_A6XX_GRAS_RAS_MSAA_CNTL, (2), 2);
Shader assembly is editable:
const char *source = R"(
shps #l37
getone #l37
cov.u32f32 r1.w, c504.z
cov.u32f32 r2.x, c504.w
cov.u32f32 r1.y, c504.x
....
end
)";
upload_shader(&ctx, 0x100200d80, source);
emit_shader_iova(&ctx, cs, 0x100200d80);
However, not everything is currently editable, such as descriptors. Despite this limitations, the existing functionality is sufficient for the majority of cases.
Hi all!
In the past week or so I’ve focused on a NPotM: go-imap, an IMAP library for Go. “But Simon, a New
Project of the Month is supposed to be new!” Right, right, the NPotM is a
lie… But only a half-lie: I’m rewriting it from scratch. go-imap was one of the
first Go projects I’ve written, and I couldn’t recite the IMAP4rev1 RFC by
heart at the time. This is just a roundabout way to say that mistakes were
made. IMAP extensions — a lot of which provide important functionality — were
designed to be implemented out-of-tree in separate Go modules. However many
extensions change the behavior of existing commands, so trying to design a
modular system is a fool’s errand which only results in a more complicated API.
Go channels were (ab)overused in the public API. The internals were not
designed with goroutine safety in mind, so races were ducktaped after the fact.
It’s not possible to run multiple IMAP commands concurrently: each time
a command is sent, the caller gets to twiddle their thumbs until the reply
comes back before sending a new one, paying the full price of the roundtrip.
The parser has a weird intermediate representation based on interface{} Go
values. Many functions and types are exported in the public API but really
shouldn’t be.
For all of these reasons, I’ve decided to start from scratch rather than trying to incrementally improve the library. This turned out to be a good decision: in one week, I had a working client which has less bugs and more features than go-imap v1. I based my work on the newer IMAP4rev2 RFC, which provides a better base feature set than IMAP4rev1. I’ve ported alps to the new API to make sure I didn’t miss anything. I still need to write the server part and tests.
In IRC news, the soju database message store submitted by delthas has finally
been merged. Now, the message history can be stored in the database rather than
in plain-text files. This enables features such as full-text search and
retaining IRCv3 message tags. The goguma mobile client now has a gallery view
for images, supports replies via the reply client tag (in a non-intrusive
fashion), and scrolls to the unread indicator when a conversation is opened.
As usual, I worked on many other smaller tasks in other projects. The wlroots
output layers have been merged, but are still opt-in and require compositor
support. lists.sr.ht now uses go-emailthreads to display replies in the patch
view. hut pages publish can now take a directory as input, and will generate
the tarball on-the-fly. There are many other tiny improvements I could mention,
but it’d get boring, let’s wrap up this status update. See you next month!
The blog has been there before, it’ll be there again, I’m over it.
Let’s talk about the only thing anyone cares about on this blog for the 23.1 release cycle: perf.
What does it mean? Who knows.
How does one acquire it? See above.
But everyone wants it, so I’ve gotta deliver like a hypothetical shipping company that delivers on-time and actually rings the bell when they drop something off instead of just hucking a box onto my front porch and running away to leave my stuff sitting out to get snowed on.
Unlike such a hypothetical shipping company which doesn’t exist, perf does exist, and I’ve got it now. Lots of it.
Let’s talk about what parts of my soul I had to sell to get to this point in my life.
Fans of the blog will recall I previously wrote a post that had great SEO for problem spaces and temporal ordering of events. This functionality, rewriting GL commands on-the-fly, has since been dubbed “time-bending” by prominent community members, so I guess that’s what we’re calling it since naming is the most important part of any technical challenge.
Initially, I implemented time-bending to be able to reorder some barriers and transfer operations and avoid splitting renderpasses, something that only the savages living in tiling GPU land care about. Is it useful on immediate-mode rendering GPUs, the ones we use in our daily lives for real work (NOT GAMING DON’T ASK ABOUT GAMING)?
Maybe? Outlook uncertain. Try again later.
So I’m implementing these great features with great names that are totally getting used by real people, and then notable perf anthropologist Emma Anholt comes to me and asks why zink is spewing out a thousand TRANSFER barriers for sequences of subdata calls. I’m sitting in front of my computer staring at the charts, checking over the data, looking at the graphs, and I don’t have an answer.
At that point in my life, three weeks ago before I embarked on a sleepless, not-leaving-my-house-until-the-perf-was-extant adventure that got me involved with dark forces too horrifying to describe, I thought I knew how to perf. It’s something anyone can do, isn’t it?
Obviously not. perf isn’t even a verb.
I had to start by discarding any ideas I had about perf. Clearly I knew nothing based on the reports I was getting: zink was being absolutely slaughtered in benchmarks across the board to a truly embarrassing degree by both ANGLE and native drivers.
Again, NOT GAMING RELATED DON’T EVEN THINK ABOUT ASKING ABOUT GAMING I DON’T PLAY GAMES.
This question about TRANSFER barriers ended up being the spark of inspiration that would lead to perf. Eventually.
To make it simple, suppose this sequence of operations occurs:
copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)Now here’s a trick question: are barriers required between these operations?
If you answered no, you’re right. But also you’re wrong if you’re on certain hardware/driver combos which are (currently) extremely broken. Let’s pretend we don’t know about those for the sake of what sanity remains.
But what did zink, as of a few weeks ago, do?
copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)barrier(TRANSFER)copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)barrier(TRANSFER)copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)Brilliant. Instead of pipelining the copies, the driver is instead pointlessly stalling between each one. This is what perf looks like, right?
The simple fix for this (that I implemented) was to track “live” transfer regions on resources, i.e., the unflushed regions that have writes pending in the current batch, and then query these regions for any subsequent read/write operation to determine whether a barrier is needed. This enables omitting any and all barriers for non-overlapping regions. Thus, the command stream looks like:
copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)without any barriers stalling the GPU workload.
Because this still isn’t perf.
Let’s look at another sequence of operations:
copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)draw(index_buffer=dst_buffer, index_offset=0)copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)draw(index_buffer=dst_buffer, index_offset=128)copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)draw(index_buffer=dst_buffer, index_offset=256)This is some pretty standard stream upload index/vertex buffer behavior when using glthread: glthread handles the upload to a staging buffer and then punts to the driver for a GPU copy, eliminating a CPU data copy that would otherwise happen between the eventual buffer->buffer copy.
And how was the glorious zink driver handling such a scenario?
copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=0)copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=128)copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=256)But this, too, can be optimized to give us foolish graphics experts a brief glance at perf. By leveraging the previous region tracking, it becomes possible to execute time-bending even more powerfully: instead of checking whether a buffer has been read from and disabling reordering, the driver can check whether a “live” region has been tracked for the operation. For example:
copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
track(dst_buffer, offset=0, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=0)copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
track(dst_buffer, offset=128, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=128)copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)
track(dst_buffer, offset=256, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=256)implements tracking, which can be used like:
check_liveness(dst_buffer, offset=0, size=64)
if (check_liveness==false) track(dst_buffer, offset=0, size=64)copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=0)check_liveness(dst_buffer, offset=128, size=64)
if (check_liveness==false) track(dst_buffer, offset=128, size=64)copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=128)check_liveness(dst_buffer, offset=256, size=64)
if (check_liveness==false) track(dst_buffer, offset=256, size=64)copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=256)which can then be transformed to:
check_liveness(dst_buffer, offset=0, size=64) = false
if (check_liveness==false) track(dst_buffer, offset=0, size=64)copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)check_liveness(dst_buffer, offset=128, size=64)
if (check_liveness==false) track(dst_buffer, offset=128, size=64)if (check_liveness==false) copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)check_liveness(dst_buffer, offset=256, size=64)
if (check_liveness==false) copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=0)check_liveness(dst_buffer, offset=128, size=64)
if (check_liveness==false) track(dst_buffer, offset=128, size=64)if (check_liveness==true) copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)if (check_liveness==true) barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=128)check_liveness(dst_buffer, offset=256, size=64)
if (check_liveness==false) track(dst_buffer, offset=256, size=64)if (check_liveness==true) copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)if (check_liveness==true) barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=256)Practically unreadable, so let’s clean it up:
check_liveness(dst_buffer, offset=0, size=64) = false
if (check_liveness==false) track(dst_buffer, offset=0, size=64)copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)check_liveness(dst_buffer, offset=128, size=64)
if (check_liveness==false) track(dst_buffer, offset=128, size=64)if (check_liveness==false) copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)check_liveness(dst_buffer, offset=256, size=64)
if (check_liveness==false) track(dst_buffer, offset=256, size=64)if (check_liveness==false) copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=0)draw(index_buffer=dst_buffer, index_offset=128)draw(index_buffer=dst_buffer, index_offset=256)Which, functionally, becomes:
track(dst_buffer, offset=0, size=64)copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)track(dst_buffer, offset=128, size=64)copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)track(dst_buffer, offset=256, size=64)copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)barrier(src=TRANSFER, dst=INDEX_READ)draw(index_buffer=dst_buffer, index_offset=0)draw(index_buffer=dst_buffer, index_offset=128)draw(index_buffer=dst_buffer, index_offset=256)All the copies are pipelined, all the draws are pipelined, and there’s only one barrier.
This is perf.
Stay tuned, because Mesa 23.1 is going to do things with perf that zink users have never seen before.
While going over the AV1 a few people commented on the lack of VP9 and a few people said it would be an easier place to start etc.
Daniel Almeida at Collabora took a first pass at writing the spec up, and I decided to go ahead and take it to a working demo level.
Lynne was busy, and they'd already said it should take an afternoon, so I decided to have a go at writing the ffmpeg side for it as well as finish off Daniel's radv code.
About 2 mins before I finished for the weekend on Friday, I got a single frame to decode, and this morning I finished off the rest to get at least 2 test videos I downloaded to work.
Branches are at [1] and [2]. There is only 8-bit support so far and I suspect some cleaning up is required.
[1] https://github.com/airlied/FFmpeg/tree/vulkan-vp9-decode
[2] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-decode-mesa-vp9
Also I’m disabling comments; I used disqus for this and didn’t realize the amount of ads it was injecting. Tag me on twitter if you want to give feedback.
It’s been quite a few months since the most recent updates about Flathub last year. We’ve been busy behind the scenes, so I’d like to share what we’ve been up to at Flathub and why—and what’s coming up from us this year. I want to focus on:
Flathub is going strong: we offer 2,000 apps from over 1,500 collaborators on GitHub. We’re averaging 700,000 app downloads a day, with 898 million HTTP requests totalling 88.3 TB served by our CDN each day (thank you Fastly!). Flatpak has, in my opinion, solved the largest technical issue which has held back the mainstream growth and acceptance of Linux on the desktop (or other personal computing devices) for the past 25 years: namely, the difficulty for app developers to publish their work in a way that makes it easy for people to discover, download (or sideload, for people in challenging connectivity environments), install and use. Flathub builds on that to help users discover the work of app developers and helps that work reach users in a timely manner.
Initial results of this disintermediation are promising: even with its modest size so far, Flathub has hundreds of apps that I have never, ever heard of before—and that’s even considering I’ve been working in the Linux desktop space for nearly 20 years and spent many of those staring at the contents of dselect (showing my age a little) or GNOME Software, attending conferences, and reading blog posts, news articles, and forums. I am also heartened to see that many of our OS distributor partners have recognised that this model is hugely complementary and additive to the indispensable work they are doing to bring the Linux desktop to end users, and that “having more apps available to your users” is a value-add allowing you to focus on your core offering and not a zero-sum game that should motivate infighting.
Getting Flathub into its current state has been a long ongoing process. Here’s what we’ve been up to behind the scenes:
Last year, we concluded our first engagement with Codethink to build features into the Flathub web app to move from a build service to an app store. That includes accounts for users and developers, payment processing via Stripe, and the ability for developers to manage upload tokens for the apps they control. In parallel, James Westman has been working on app verification and the corresponding features in flat-manager to ensure app metadata accurately reflects verification and pricing, and to provide authentication for paying users for app downloads when the developer enables it. Only verified developers will be able to make direct uploads or access payment settings for their apps.
So far, the GNOME Foundation has acted as an incubator and legal host for Flathub even though it’s not purely a GNOME product or initiative. Distributing software to end users along with processing and forwarding payments and donations also has a different legal profile in terms of risk exposure and nonprofit compliance than the current activities of the GNOME Foundation. Consequently, we plan to establish an independent legal entity to own and operate Flathub which reduces risk for the GNOME Foundation, better reflects the independent and cross-desktop interests of Flathub, and provides flexibility in the future should we need to change the structure.
We’re currently in the process of reviewing legal advice to ensure we have the right structure in place before moving forward.
As Flathub is something we want to set outside of the existing Linux desktop and distribution space—and ensure we represent and serve the widest community of Linux users and developers—we’ve been working on a governance model that ensures that there is transparency and trust in who is making decisions, and why. We have set up a working group with myself and Martín Abente Lahaye from GNOME, Aleix Pol Gonzalez, Neofytos Kolokotronis, and Timothée Ravier from KDE, and Jorge Castro flying the flag for the Flathub community. Thanks also to Neil McGovern and Nick Richards who were also more involved in the process earlier on.
We don’t want to get held up here creating something complex with memberships and elections, so at first we’re going to come up with a simple/balanced way to appoint people into a board that makes key decisions about Flathub and iterate from there.
We have received one grant for 2023 of $100K from Endless Network which will go towards the infrastructure, legal, and operations costs of running Flathub and setting up the structure described above. (Full disclosure: Endless Network is the umbrella organisation which also funds my employer, Endless OS Foundation.) I am hoping to grow the available funding to $250K for this year in order to cover the next round of development on the software, prepare for higher operations costs (e.g., accounting gets more complex), and bring in a second full-time staff member in addition to Bartłomiej Piotrowski to handle enquiries, reviews, documentation, and partner outreach.
We’re currently in discussions with NLnet about funding further software development, but have been unfortunately turned down for a grant from the Plaintext Group for this year; this Schmidt Futures project around OSS sustainability is not currently issuing grants in 2023. However, we continue to work on other funding opportunities.
My personal hypothesis is that our largest remaining barrier to Linux desktop scale and impact is economic. On competing platforms—mobile or desktop—a developer can offer their work for sale via an app store or direct download with payment or subscription within hours of making a release. While we have taken the “time to first download” time down from months to days with Flathub, as a community we continue to have a challenging relationship with money. Some creators are lucky enough to have a full-time job within the FLOSS space, while a few “superstar” developers are able to nurture some level of financial support by investing time in building a following through streaming, Patreon, Kickstarter, or similar. However, a large proportion of us have to make do with the main payback from our labours being a stream of bug reports on GitHub interspersed with occasional conciliatory beers at FOSDEM (other beverages and events are available).
The first and most obvious consequence is that if there is no financial payback for participating in developing apps for the free and open source desktop, we will lose many people in the process—despite the amazing achievements of those who have brought us to where we are today. As a result, we’ll have far fewer developers and apps. If we can’t offer access to a growing base of users or the opportunity to offer something of monetary value to them, the reward in terms of adoption and possible payment will be very small. Developers would be forgiven for taking their time and attention elsewhere. With fewer apps, our platform has less to entice and retain prospective users.
The second consequence is that this also represents a significant hurdle for diverse and inclusive participation. We essentially require that somebody is in a position of privilege and comfort that they have internet, power, time, and income—not to mention childcare, etc.—to spare so that they can take part. If that’s not the case for somebody, we are leaving them shut out from our community before they even have a chance to start. My belief is that free and open source software represents a better way for people to access computing, and there are billions of people in the world we should hope to reach with our work. But if the mechanism for participation ensures their voices and needs are never represented in our community of creators, we are significantly less likely to understand and meet those needs.
While these are my thoughts, you’ll notice a strong theme to this year will be leading a consultation process to ensure that we are including, understanding and reflecting the needs of our different communities—app creators, OS distributors and Linux users—as I don’t believe that our initiative will be successful without ensuring mutual benefit and shared success. Ultimately, no matter how beautiful, performant, or featureful the latest versions of the Plasma or GNOME desktops are, or how slick the newly rewritten installer is from your favourite distribution, all of the projects making up the Linux desktop ecosystem are subdividing between ourselves an absolutely tiny market share of the global market of personal computers. To make a bigger mark on the world, as a community, we need to get out more.
After identifying our major barriers to overcome, we’ve planned a number of focused initiatives and restructuring this year:
We’re working on deploying the work we have been doing over the past year, starting first with launching the new Flathub web experience as well as the rebrand that Jakub has been talking about on his blog. This also will finally launch the verification features so we can distinguish those apps which are uploaded by their developers.
In parallel, we’ll also be able to turn on the Flatpak repo subsets that enable users to select only verified and/or FLOSS apps in the Flatpak CLI or their desktop’s app center UI.
We would like to make sure that the voices of app creators, OS distributors, and Linux users are reflected in our plans for 2023 and beyond. We will be launching this in the form of Flathub Focus Groups at the Linux App Summit in Brno in May 2023, followed up with surveys and other opportunities for online participation. We see our role as interconnecting communities and want to be sure that we remain transparent and accountable to those we are seeking to empower with our work.
Whilst we are being bold and ambitious with what we are trying to create for the Linux desktop community, we also want to make sure we provide the right forums to listen to the FLOSS community and prioritise our work accordingly.
As we build the Flathub organisation up in 2023, we’re also planning to expand its governance by creating an Advisory Board. We will establish an ongoing forum with different stakeholders around Flathub: OS vendors, hardware integrators, app developers and user representatives to help us create the Flathub that supports and promotes our mutually shared interests in a strong and healthy Linux desktop community.
Direct app uploads are close to ready, and they enable exciting stuff like allowing Electron apps to be built outside of flatpak-builder, or driving automatic Flathub uploads from GitHub actions or GitLab CI flows; however, we need to think a little about how we encourage these to be used. Even with its frustrations, our current Buildbot ensures that the build logs and source versions of each app on Flathub are captured, and that the apps are built on all supported architectures. (Is 2023 when we add RISC-V? Reach out if you’d like to help!). If we hand upload tokens out to any developer, even if the majority of apps are open source, we will go from this relatively structured situation to something a lot more unstructured—and we fear many apps will be available on only 64-bit Intel/AMD machines.
My sketch here is that we need to establish some best practices around how to integrate Flathub uploads into popular CI systems, encouraging best practices so that we promote the properties of transparency and reproducibility that we don’t want to lose. If anyone is a CI wizard and would like to work with us as a thought partner about how we can achieve this—make it more flexible where and how build tasks can be hosted, but not lose these cross-platform and inspectability properties—we’d love to hear from you.
Once the work around legal and governance reaches a decent point, we will be in the position to move ahead with our Stripe setup and switch on the third big new feature in the Flathub web app. At present, we have already implemented support for one-off payments either as donations or a required purchase. We would like to go further than that, in line with what we were describing earlier about helping developers sustainably work on apps for our ecosystem: we would also like to enable developers to offer subscriptions. This will allow us to create a relationship between users and creators that funds ongoing work rather than what we already have.
For Flathub to succeed, we need to make sure that as we grow, we continue to be a platform that can give users confidence in the quality and security of the apps we offer. To that end, we are planning to set up infrastructure to help ensure developers are shipping the best products they possibly can to users. For example, we’d like to set up automated linting and security scanning on the Flathub back-end to help developers avoid bad practices, unnecessary sandbox permissions, outdated dependencies, etc. and to keep users informed and as secure as possible.
Fundraising is a forever task—as is running such a big and growing service. We hope that one day, we can cover our costs through some modest fees built into our payments—but until we reach that point, we’re going to be seeking a combination of grant funding and sponsorship to keep our roadmap moving. Our hope is very much that we can encourage different organisations that buy into our vision and will benefit from Flathub to help us support it and ensure we can deliver on our goals. If you have any suggestions of who might like to support Flathub, we would be very appreciative if you could reach out and get us in touch.
Thanks to you all for reading this far and supporting the work of Flathub, and also to our major sponsors and donors without whom Flathub could not exist: GNOME Foundation, KDE e.V., Mythic Beasts, Endless Network, Fastly, and Equinix Metal via the CNCF Community Cluster. Thanks also to the tireless work of the Freedesktop SDK community to give us the runtime platform most Flatpaks depend on, particularly Seppo Yli-Olli, Codethink and others.
I wanted to also give my personal thanks to a handful of dedicated people who keep Flathub working as a service and as a community: Bartłomiej Piotrowski is keeping the infrastructure working essentially single-handedly (in his spare time from keeping everything running at GNOME); Kolja Lampe and Bart built the new web app and backend API for Flathub which all of the new functionality has been built on, and Filippe LeMarchand maintains the checker bot which helps keeps all of the Flatpaks up to date.
And finally, all of the submissions to Flathub are reviewed to ensure quality, consistency and security by a small dedicated team of reviewers, with a huge amount of work from Hubert Figuière and Bart to keep the submissions flowing. Thanks to everyone—named or unnamed—for building this vision of the future of the Linux desktop together with us.
(originally posted to Flathub Discourse, head there if you have any questions or comments)
As a kernel developer, everyday I need to compile and install custom kernels, and any improvement in this workflow means to be more productive. While installing my fresh compiled modules, I noticed that it would be stuck in amdgpu compression for some time:
XZ /usr/lib/modules/6.2.0-tonyk/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.xz
My target machine is the Steam Deck, that uses .xz for compressing the modules. Giving that we want gamers to be able to install as many games as possible, the OS shouldn’t waste much disk space. amdgpu, when compiled with debug symbols can use a good hunk of space. Here’s the comparison of disk size of the module uncompressed, and then with .zst and .xz compression:
360M amdgpu.ko
61M amdgpu.ko.zst
38M amdgpu.ko.xz
This more compact module comes with a cost: more CPU time for compression.
When I opened htop, I saw that only a lonely thread was doing the hard work to compress amdgpu, even that compression is a task easily parallelizable. I then hacked scripts/Makefile.modinst so XZ would use as many threads as possible, with the option -T0. In my main build machine, modules_install was running 4 times faster!
# before the patch
$ time make modules_install -j16
Executed in 100.08 secs
# after the patch
$ time make modules_install -j16
Executed in 28.60 secs
Then, I submitted a patch to make this default for everyone: [PATCH] kbuild: modinst: Enable multithread xz compression
However, as Masahiro Yamada noticed, we shouldn’t be spawning numerous threads in the build system without the user request. Until today we specify manually how many threads we should run with make -jX.
Hopefully, Nathan Chancellor suggested that the same results can be achieved using XZ_OPT=-T0, so we still can benefit from this without the patch. I experimented with different -TX and -jY values, but in my notebook the most efficient values were X = Y = nproc. You can check some results bellow:
$ make modules_install
174.83 secs
$ make modules_install -j8
100.55 secs
$ make modules_install XZ_OPT=-T0
81.51 secs
$ make modules_install -j8 XZ_OPT=-T0
53.22 sec
Android running Freedreno
As part of my training at Igalia I’ve been attempting to write a new backend for Freedreno that targets the proprietary “KGSL” kernel mode driver. For those unaware there are two “main” kernel mode drivers on Qualcomm SOCs for the GPU, there is the “MSM”, and “KGSL”. “MSM” is DRM compliant, and Freedreno already able to run on this driver. “KGSL” is the proprietary KMD that Qualcomm’s proprietary userspace driver targets. Now why would you want to run freedreno against KGSL, when MSM exists? Well there are a few ones, first MSM only really works on an up-streamed kernel, so if you have to run a down-streamed kernel you can continue using the version of KGSL that the manufacturer shipped with your device. Second this allows you to run both the proprietary adreno driver and the open source freedreno driver on the same device just by swapping libraries, which can be very nice for quickly testing something against both drivers.
When working on a new backend, one of the critical things to do is to make use of as much “common code” as possible. This has a number of benefits, least of all reducing the amount of code you have to write. It also allows reduces the number of bugs that will likely exist as you are relying on well tested code, and it ensures that the backend is mostly likely going to continue to work with new driver updates.
When I started the work for a new backend I looked inside mesa’s
src/freedreno/drm folder. This has the current backend code
for Freedreno, and its already modularized to support multiple backends.
It currently has support for the above mentioned MSM kernel mode driver
as well as virtio (a backend that allows Freedreno to be used from
within in a virtualized environment). From the name of this path, you
would think that the code in this module would only work with kernel
mode drivers that implement DRM, but actually there is only a handful of
places in this module where DRM support is assumed. This made it a good
starting point to introduce the KGSL backend and piggy back off the
common code.
For example the drm module has a lot of code to deal
with the management of synchronization primitives, buffer objects, and
command submit lists. All managed at a abstraction above “DRM” and to
re-implement this code would be a bad idea.
One of this big struggles with getting the KGSL backend working was
figuring out how I could get Android to load mesa instead of Qualcomm
blob driver that is shipped with the device image. Thankfully a good
chunk of this work has already been figured out when the Turnip
developers (Turnip is the open source Vulkan implementation for Adreno
GPUs) figured out how to get Turnip running on android with KGSL.
Thankfully one of my coworkers Danylo is one of those
Turnip developers, and he gave me a lot of guidance on getting Android
setup. One thing to watch out for is the outdated instructions here. These instructions
almost work, but require some modifications. First if you’re
using a more modern version of the Android NDK, the compiler has been
replaced with LLVM/Clang, so you need to change which compiler is being
used. Second flags like system in the cross compiler script
incorrectly set the system as linux instead of
android. I had success using the below cross compiler
script. Take note that the compiler paths need to be updated to match
where you extracted the android NDK on your system.
[binaries]
ar = '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-ar'
c = ['ccache', '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang']
cpp = ['ccache', '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang++', '-fno-exceptions', '-fno-unwind-tables', '-fno-asynchronous-unwind-tables', '-static-libstdc++']
c_ld = 'lld'
cpp_ld = 'lld'
strip = '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-strip'
# Android doesn't come with a pkg-config, but we need one for Meson to be happy not
# finding all the optional deps it looks for. Use system pkg-config pointing at a
# directory we get to populate with any .pc files we want to add for Android
pkgconfig = ['env', 'PKG_CONFIG_LIBDIR=/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/pkgconfig:/home/lfryzek/Documents/projects/igalia/freedreno/install-android/lib/pkgconfig', '/usr/bin/pkg-config']
[host_machine]
system = 'android'
cpu_family = 'arm'
cpu = 'armv8'
endian = 'little'
Another thing I had to figure out with Android, that was different
with these instructions, was how I would get Android to load mesa
versions of mesa libraries. That’s when my colleague Mark pointed out to me that
Android is open source and I could just check the source code myself.
Sure enough you have find the OpenGL driver loader in Android’s
source code. From this code we can that Android will try to load a
few different files based on some settings, and in my case it would try
to load 3 different shaded libraries in the
/vendor/lib64/egl folder, libEGL_adreno.so
,libGLESv1_CM_adreno.so, and libGLESv2.so. I
could just replace these libraries with the version built from mesa and
voilà, you’re now loading a custom driver! This realization that I could
just “read the code” was very powerful in debugging some more android
specific issues I ran into, like dealing with gralloc.
Something cool that the opensource Freedreno & Turnip driver
developers figured out was getting android to run test OpenGL
applications from the adb shell without building android APKs. If you
check out the freedreno
repo, they have an ndk-build.sh script that can build
tests in the tests-* folder. The nice benefit of this is
that it provides an easy way to run simple test cases without worrying
about the android window system integration. Another nifty feature about
this repo is the libwrap tool that lets trace the commands
being submitted to the GPU.
Gralloc is the graphics memory allocated in Android, and the OS will
use it to allocate the surface for “windows”. This means that the memory
we want to render the display to is managed by gralloc and not our KGSL
backend. This means we have to get all the information about this
surface from gralloc, and if you look in
src/egl/driver/dri2/platform_android.c you will see
existing code for handing gralloc. You would think “Hey there is no work
for me here then”, but you would be wrong. The handle gralloc provides
is hardware specific, and the code in platform_android.c
assumes a DRM gralloc implementation. Thankfully the turnip developers
had already gone through this struggle and if you look in
src/freedreno/vulkan/tu_android.c you can see they have
implemented a separate path when a Qualcomm msm implementation of
gralloc is detected. I could copy this detection logic and add a
separate path to platform_android.c.
When working on any project (open-source or otherwise), it’s nice to
know that you aren’t working alone. Thankfully the
#freedreno channel on irc.oftc.net is very
active and full of helpful people to answer any questions you may have.
While working on the backend, one area I wasn’t really sure how to
address was the synchronization code for buffer objects. The backend
exposed a function called cpu_prep, This function was just
there to call the DRM implementation of cpu_prep on the
buffer object. I wasn’t exactly sure how to implement this functionality
with KGSL since it doesn’t use DRM buffer objects.
I ended up reaching out to the IRC channel and Rob Clark on the
channel explained to me that he was actually working on moving a lot of
the code for cpu_prep into common code so that a non-drm
driver (like the KGSL backend I was working on) would just need to
implement that operation as NOP (no operation).
I encountered a few different bugs when implementing the KGSL backend, but most of them consisted of me calling KGSL wrong, or handing synchronization incorrectly. Thankfully since Turnip is already running on KGSL, I could just more carefully compare my code to what Turnip is doing and figure out my logical mistake.
Some of the bugs I encountered required the backend interface in
Freedreno to be modified to expose per a new per driver implementation
of that backend function, instead of just using a common implementation.
For example the existing function to map a buffer object into userspace
assumed that the same fd for the device could be used for
the buffer object in the mmap call. This worked fine for
any buffer objects we created through KGSL but would not work for buffer
objects created from gralloc (remember the above section on surface
memory for windows comming from gralloc). To resolve this issue I
exposed a new per backend implementation of “map” where I could take a
different path if the buffer object came from gralloc.
While testing the KGSL backend I did encounter a new bug that seems
to effect both my new KGSL backend and the Turnip KGSL backend. The bug
is an iommu fault that occurs when the surface allocated by
gralloc does not have a height that is aligned to 4. The blitting engine
on a6xx GPUs copies in 16x4 chunks, so if the height is not aligned by 4
the GPU will try to write to pixels that exists outside the allocated
memory. This issue only happens with KGSL backends since we import
memory from gralloc, and gralloc allocates exactly enough memory for the
surface, with no alignment on the height. If running on any other
platform, the fdl (Freedreno Layout) code would be called
to compute the minimum required size for a surface which would take into
account the alignment requirement for the height. The blob driver
Qualcomm didn’t seem to have this problem, even though its getting the
exact same buffer from gralloc. So it must be doing something different
to handle the none aligned height.
Because this issue relied on gralloc, the application needed to running as an Android APK to get a surface from gralloc. The best way to fix this issue would be to figure out what the blob driver is doing and try to replicate this behavior in Freedreno (assuming it isn’t doing something silly like switch to sysmem rendering). Unfortunately it didn’t look like the libwrap library worked to trace an APK.
The libwrap library relied on a linux feature known as
LD_PRELOAD to load libwrap.so when the
application starts and replace the system functions like
open and ioctl with their own implementation
that traces what is being submitted to the KGSL kernel mode driver.
Thankfully android exposes this LD_PRELOAD mechanism
through its “wrap” interface where you create a propety called
wrap.<app-name> with a value
LD_PRELOAD=<path to libwrap.so>. Android will then
load your library like would be done in a normal linux shell. If you
tried to do this with libwrap though you find very quickly that you
would get corrupted traces. When android launches your APK, it doesn’t
only launch your application, there are different threads for different
android system related functions and some of them can also use OpenGL.
The libwrap library is not designed to handle multiple threads using
KGSL at the same time. After discovering this issue I created a MR
that would store the tracing file handles as TLS (thread local storage)
preventing the clobbering of the trace file, and also allowing you to
view the traces generated by different threads separately from each
other.
With this is in hand one could begin investing what the blob driver is doing to handle this unaligned surfaces.
Well the next obvious thing to fix is the aligned height issue which is still open. I’ve also worked on upstreaming my changes with this WIP MR.
Freedreno running 3d-mark
In the last blog post, I pointed out that I didn’t know exactly what it would be my next steps for the near future. Gladly, I had the amazing opportunity to start a new Igalia Coding Experience with a new project.
This time Melissa Wen pitched me with the idea to play around with Rust for Linux in order to rewrite the VGEM driver in Rust. The Rust for Linux project is growing fast with new bindings and abstractions being introduced in the downstream RfL kernel. Also, some basic functionalities were introduced in Linux 6.1. Therefore, it seems like a great timing to start exploring Rust in the DRM subsystem!
As mentioned by the Rust website, using Rust means Performance, Reliability, and Productivity. Rust is a blazingly fast and memory-efficient language with its powerful ownership model. No more looking for use-after-free and memory leaks, as Rust guarantees memory safety and thread safety, eliminating a handful of bugs at compile-time.
Moreover, Rust provides a new way of programming. The language provides beautiful features such as traits, enums, and error handling, that can make us feel empowered by the language. We can use a lot of concepts from functional programming and mix them with concepts from OOP, for example.
Although I’m an absolute beginner in Rust, I can see the major advantages of the Rust programming language. From the start, it was a bit tough to enjoy the language, as I was fighting with the compiler most of the time. But now that I have a more firm foundation on Rust, it is possible to appreciate the beauty in Rust and I don’t see myself starting a new project in C++ for a long while.
Bringing Rust to the Linux Kernel is a ambitious idea, but it can lead to great changes. We can think about a world where no developers are looking for memory leaks and use-after-free bugs due to the safety that Rust can provide us.
Now, what about Rust for DRM? I mean, I’m not the first one to think about it. Asahi Lina is making a fantastic work on the Apple M1 GPU and things are moving quite fast there. She already had great safe abstractions for the DRM bindings and provides us the very basis for anyone who is willing to start a new DRM driver in Rust, which is my case.
That said, why not make use of Lina’s excellent bindings to build a new driver?
VGEM (Virtual GEM Provider) is a minimal non-hardware backed GEM (Graphics Execution Manager) service. It is used with non-native 3D hardware for buffer sharing between the X server and DRI. It is a fairly simple driver with about 400 lines of code and it uses the DMA Fence API to handle attaching and signaling the fences.
So, to rewrite VGEM in Rust, some bindings are needed, e.g. bindings for platform device, for XArray, and for dealing with DMA fence and DMA reservations. Furthermore, many DRM abstractions are needed as well.
In this sense, a lot of the DRM abstractions are already developed by Lina and also she is developing abstractions for DMA fence. So, in this project, I’ll be focusing on the bindings that Lina and the RfL folks haven’t developed yet.
After developing the bindings, it is a matter of developing the driver, which it’ll be quite simple after all DMA abstractions are set, because most of the driver consists of fence manipulation.
I have developed the main platform device registration of the driver. As VGEM is a virtual device, the standard probe initialization is not useful, as a virtual device cannot be probed by the pseudo-bus that holds the platform devices. So, as VGEM is not a usual hotplugged device, we need to use the legacy platform device initialization. This made me develop my first binding for legacy registration:
/// Add a platform-level device and its resources
pub fn register(name: &'static CStr, id: i32) -> Result<Self> {
let pdev = from_kernel_err_ptr(unsafe {
bindings::platform_device_register_simple(name.as_char_ptr(), id,
core::ptr::null(), 0)
})?;
Ok(Self {
ptr: pdev,
used_resource: 0,
is_registered: true,
})
}
For sure, the registration must follow the unregistration of the device, so I implemented a Drop trait for the struct Device in order to guarantee the proper device removal without explicitly calling it.
impl Drop for Device {
fn drop(&mut self) {
if self.is_registered {
// SAFETY: This path only runs if a previous call to `register`
// completed successfully.
unsafe { bindings::platform_device_unregister(self.ptr) };
}
}
}
After those, I also developed bindings for a couple of more functions and together with Lina’s bindings, I could initialize the platform device and register the DRM device under a DRM minor!
[ 38.825684] vgem: vgem_init: platform_device with id -1
[ 38.826505] [drm] Initialized vgem 1.0.0 20230201 for vgem on minor 0
[ 38.858230] vgem: Opening...
[ 38.862377] vgem: Closing...
[ 41.543416] vgem: vgem_exit: drop
Next, I focused on the development of the two IOCTLs: drm_vgem_fence_attach
and drm_vgem_fence_signal. The first is responsable for creating and attaching
a fence to the VGEM handle, while the second signals and consumes a fence
earlier attached to a VGEM handle.
In order to add a fence, bindings to DMA reservation are needed. So, I started
by creating a safe abstraction for struct dma_resv.
/// A generic DMA Resv Object
///
/// # Invariants
/// ptr is a valid pointer to a dma_resv and we own a reference to it.
pub struct DmaResv {
ptr: *mut bindings::dma_resv,
}
impl DmaResv {
[...]
/// Add a fence to the dma_resv object
pub fn add_fences(
&self,
fence: &dyn RawDmaFence,
num_fences: u32,
usage: bindings::dma_resv_usage,
) -> Result {
unsafe { bindings::dma_resv_lock(self.ptr, core::ptr::null_mut()) };
let ret = self.reserve_fences(num_fences);
match ret {
Ok(_) => {
// SAFETY: ptr is locked with dma_resv_lock(), and dma_resv_reserve_fences()
// has been called.
unsafe {
bindings::dma_resv_add_fence(self.ptr, fence.raw(), usage);
}
}
Err(_) => {}
}
unsafe { bindings::dma_resv_unlock(self.ptr) };
ret
}
}
With that step, I could simply write the IOCTLs based on the new DmaResv
abstraction and Lina’s fence abstractions.
To test the IOCTLs, I used some already available IGT tests: dmabuf_sync_file
and vgem_basic. Those tests use VGEM as it base, so if the tests pass, it
means that the IOCTLs are working properly. And, after some debugging and rework
in the IOCTLs, I managed to get most of the tests to pass!
[root@fedora igt-gpu-tools]# ./build/tests/dmabuf_sync_file
IGT-Version: 1.27-gaa16e812 (x86_64) (Linux: 6.2.0-rc3-asahi-02441-g6c8eda039cfb-dirty x86_64)
Starting subtest: export-basic
Subtest export-basic: SUCCESS (0.000s)
Starting subtest: export-before-signal
Subtest export-before-signal: SUCCESS (0.000s)
Starting subtest: export-multiwait
Subtest export-multiwait: SUCCESS (0.000s)
Starting subtest: export-wait-after-attach
Subtest export-wait-after-attach: SUCCESS (0.000s)
You can check out the current progress of this project on this pull request.
Although most of the IGT tests are now passing, two tests aren’t working yet:
vgem_slow, as I haven’t introduced the timeout yet, and vgem_basic@unload,
as I still need to debug why the Drop trait from drm::drv::Registration is
not being called.
After bypassing those two problems, I still need to rework some of my code, as,
for example, I’m using a dummy IOCTL as IOCTL number 0x00, as the current macro
kernel::declare_drm_ioctl doesn’t support any drivers for which the IOCTL doesn’t
start in 0x00.
So, there is a lot of work yet to be done!
It finally happened.
Zink has been commercialized.
What does this mean, you ask? Well, look no further than this juicy X-Plane announcement.
That’s right, after months and decades of waiting, the testing and debugging is over, and Zink is now a gaming driver that runs real games in production for real, existing people. Who play games. At full speed.
I don’t have anything further to say on the topic since the above blog post says more than enough, but if you like this, show your support and play the best flight sim on the market that uses zink.
Nice work, X-Plane.
A lot has been made of VK_EXT_descriptor_buffer, also known as “sane descriptor handling”. It’s an extension that revolutionizes how descriptors can be managed not only in brevity of code but in performance.
That’s why ZINK_DESCRIPTORS=db is now the default wherever it’s supported.
But what does this gain zink (and other users), other than being completely undebuggable if anything were to break*?
* It won’t, trust me.
One nicety of descriptor buffers is performance. By swapping out descriptor templates for buffers, it removes a layer of indirection from the descriptor update path, which reduces CPU overhead by a small amount. By avoiding the need to bind different descriptor sets, GPU synchronization can also be reduced.
In zink terms, you’ll likely notice a small FPS increase in cases that were extremely CPU-bound (e.g., Tomb Raider).
Yes, GPU memory utilization is also affected.
Historically, as we all know by now, zink uses six (four when constrained) descriptor sets:
This optimizes access for the types as well as update frequency. In terms of descriptor sets, it means a different descriptor pool per descriptor layout so that sets can be bucket allocated to further reduce overhead.
Initially when I implemented descriptor buffer, I kept this same setup. Each descriptor type had its own descriptor buffer, and the buffers were hardcoded to fit N descriptors of the given type, where N was the maximum number of descriptors I could update per cmdbuf using drawoverhead. Each cmdbuf was associated with a set of descriptor buffers, and it worked great.
But it also used a lot of VRAM, comparable even to the then-default template mode.
During my latest round of descriptor buffer refactors which added handling for bindless textures, I had a realization. I was doing descriptor buffers all wrong.
Instead of having all these different buffers per cmdbuf, why wouldn’t I just have a single buffer? Plus a static one for bindless, of course.
Imagine pointlessly allocating five times the memory required.
So now I had two descriptor buffers:
In Tomb Raider, this ended up being about a 6% savings for peak VRAM utilization (1445MiB -> 1362MiB). Not bad, and the code is now simpler too.
But what if I stopped hardcoding the descriptor buffer size and instead used a sliding scale so that only the (rough) amount of memory needed was allocated? Some minor hacking here and there, and peak VRAM utilization was cut even more (1362MiB -> 1297MiB).
Now it’s a little over a 10% reduction in peak VRAM utilization.
There’s still some ways to go with VRAM utilization in zink considering RadeonSI peaks at 1221MiB for the same benchmark, but a 6% gap is much more reasonable than a 16% one.
Blog posts about Vulkan descriptor models aren’t going away.
I wish they were, but they just aren’t.
Stay tuned for more of these posts as well as other exciting developments on the road to Mesa 23.1.
SuperTuxKart Vulkan vs OpenGL
The latest SuperTuxKart release comes with an experimental Vulkan renderer and I was eager to check it out on my Raspbery Pi 4 and see how well it worked.
The short story is that while I have only tested a few tracks it seems to perform really well overall. In my tests, even with a debug build of Mesa I saw the FPS ranging from 60 to 110 depending on the track. I think the game might be able to produce more than 110 fps actually, since various tracks were able to reach exactly 110 fps I think the limiting factor here was the display.
I was then naturally interested in comparing this to the GL renderer and I was a bit surprised to see that, with the same settings, the GL renderer would be somewhere in the 8-20 fps range for the same tracks. The game was clearly hitting a very bad path in the GL driver so I had to fix that before I could make a fair comparison between both.
A perf session quickly pointed me to the issue: Mesa has code to transparently translate vertex attribute formats that are not natively supported to a supported format. While this is great for compatibility it is obviously going to be very slow. In particular, SuperTuxKart uses rgba16f and rg16f with some vertex buffers and Mesa was silently translating these to 32-bit counterparts because the GL driver was not advertising support for the 16-bit variants. The hardware does support 16-bit floating point vertex attributes though, so this was very easy to fix.
The Vulkan driver was exposing support for this already, which explains the dramatic difference in performance between both drivers. Indeed, with that change SuperTuxKart now plays smooth on OpenGL too, with framerates always above 30 fps and up to 110 fps depending on the track. We should probably have an option in Mesa to make this kind of under-the-hood compatibility translations more obvious to users so we can catch silly issues like this more easily.
With that said, even if GL is now a lot better, Vulkan is still ahead by quite a lot, producing 35-50% better framerate than OpenGL depending on the track, at least for the tracks that don’t hit the 110 fps mark, which as I said above, looks like it is a display maximum, at least with my setup.
Zink
During my presentation at XDC last year I mentioned Zink wasn’t supported on Raspberry Pi 4 any more due to feature requirements we could not fulfill.
In the past, Zink used to abort when it detected unsupported features, but it seems this policy has been changed and now it simply drops a warning and points to the possibility of incorrect rendering as a result.
Also, I have been talking to zmike about one of the features we could not support natively: scalarBlockLayout. Particularly, the issue with this is that we can’t make it work with vectors in all cases and the only alternative for us would be to scalarize everything through a lowering, which would probably have a performance impact. However, zmike confirmed that Zink is already doing this, so in practice we would not see vector load/stores from Zink, in which case it should work fine .
So with all that in mind, I did give Zink a go and indeed, I get the warning that we don’t support scalar block layouts (and some other feature I don’t remember now) but otherwise it mostly works. It is not as stable as the native driver and some things that work with the native driver don’t work with Zink at present, some examples I saw include the WebGL Aquarium demo in Chromium or SuperTuxKart.
As far as performance goes, it has been a huge leap from when I tested it maybe 2 years ago. With VkQuake3‘s OpenGL renderer performance with Zink used to be ~40% of the native OpenGL driver, but is now on par with it, even if not a tiny bit better, so kudos to zmike and all the other contributors to Zink for all the work they put into this over the last 2 years, it really shows.
With all that said, I didn’t do too much testing with Zink myself so if anyone here decides to give it a more thorough go, please let me know how it went in the comments.
Hi!
Earlier this month I went to FOSDEM with the rest of the SourceHut staff! It was great meeting face-to-face all of the people I work with. I discussed with lots of folks involved in Wayland, IRC, SourceHut and many other interesting projects. This was my first post-pandemic offline conference
Last week we’ve released wlroots 0.16.2 and Sway 1.8.1. We’ve spent a fair
bit of time trying to square away regressions, and I think we’ve addressed
almost all of them. This doesn’t mean we haven’t made any progress on new
features and improvements, quite the contrary. We’ve merged Kenny Levinsen’s
patches for the new fractional-scaling-v1 protocol, which allows clients to
render at fractional scales rather than being forced to use the next integer
scale. I’ve continued working on the new wlr_renderer API, and I’ve started
experimenting with Vulkan compute. I’m still not sure this is the right path
forward, we’ll see where this takes us.
I’ve made a lot of progress on libliftoff integration in wlroots. I’ve
extended the wlr_output_layer API to include a feedback mechanism so that
clients can re-allocate their buffers on-the-fly to enable direct scan-out on
overlay planes. I’ve wired this up to a new libliftoff API to query which
planes would be good candidates for direct scan-out. I’ve fixed the remaining
wlroots bugs, optimized libliftoff… What’s left is another testing and review
round, but we’re getting close!
By the way, the wlroots IRC channel has moved. We were (ab)using #sway-devel up until now, but now wlroots has its own separate #wlroots channel. Make sure to join it if you’ve been idling in #sway-devel!
In other Wayland news, I’ve landed a patch to add two new wl_surface events to indicate the preferred scale
and transform a client should use. No more guesswork via wl_output! I’ve also
sent out the schedule for the next Wayland release, if all goes well we’ll ship
it in two months.
libdisplay-info 0.1.0 has been released! After months of work, this initial release includes full support for EDID, partial support for CTA-861-H, and very basic support for DisplayID 1.3. Having a release out will allow us to leverage the library in more projects: it’s already used in DXVK and gamescope, I have a patch to use it in wlroots, and there are plans to use it in Mutter and Weston.
The NPotM is pixfmtdb. It’s a simple website which describes the in-memory layout of pixel formats from various graphics APIs. It also provides compatibility information: for each format, equivalent formats coming from other APIs are listed. This can be handy when wiring up multiple APIs together, for instance Cairo and Wayland, or Vulkan and KMS. Under the hood, the Khronos Data Format Specification is used to describe pixel formats in a standard way.
Recently delthas has been hard at work and has landed a lot of soju patches.
The new user run BouncerServ command can be used to run a command as another
user, which can be handy for administrators. soju now supports Unix admin
sockets to run any BouncerServ command from the shell. And support for external
authentication has been merged (right now, PAM and OAuth 2.0 are supported).
That’s all for now! See you next month.
Some weeks ago, Igalia announced publicly that we will host X.Org Developers Conference 2023 (XDC 2023) in A Coruña, Spain. If you remember, we also organized XDC 2018 in this beautiful city in the northwest of Spain (I hope you enjoyed it!)

Since the announcement, I can now confirm that the conference will be in the awesome facilities of Palexco conference center, at the city center of A Coruña, Spain, from 17th to 19th of October 2023.

We are going to setup soon the website, and prepare everything to open the Call For Papers in the coming weeks. Stay tuned!
XDC 2023 is a three-day conference full of talks and workshops related to the open-source graphics stack: from Wayland to X11, from DRM/KMS to Mesa drivers, toolkits, libraries… you name it! This is the go-to conference if you are involved in the development of any part of the open-source graphics stack. Don’t miss it!

Those of you who saw me at XDC will recall that I talked about my hatred for Gallium’s pipe caps.
I still hate them.
But mostly I hate a specific type of pipe cap: the pipe cap that gates performance.
In a nutshell:
It happens again and again across drivers. Some game/app/test is inexplicably running with a framerate so low it could win a world limbo championship. Initial debugging (perf) initially reveals nothing. GPU profiling reveals nothing. There’s nothing utilizing a massive amount of resources. What’s going on?
The most recent iteration of this Yet Another Missing Pipe Cap bug was seen with DOOM2016. Even as early as the title screen, framerate would struggle to hit 30fps for a while before rocketing up to whatever maximum it could reach. In-game was the same, soaring to the 200fps maximum after enough time standing stationary without moving the camera. As soon as the camera shifted however, back to 20fps we went.
Manually adding timing instrumentation to the driver revealed baffling results: a buffer readback was repeatedly triggering a staging copy, which then fenced for batch completion and stalled rendering.
Backtrace:
#0 zink_buffer_map (pctx=0x7f8fb4018800, pres=0x7f8fe50a4600, level=0, usage=1610612737, box=0x5f0f590, transfer=0x7f8fe50a45e8) at ../src/gallium/drivers/zink/zink_resource.c:1906
#1 0x00007f8fe9d626e5 in tc_buffer_map (_pipe=0x7f8fb415cef0, resource=0x7f8fe50a4600, level=0, usage=1610612737, box=0x5f0f590, transfer=0x7f8fe50a45e8) at ../src/gallium/auxiliary/util/u_threaded_context.c:2623
#2 0x00007f8fe93d7f59 in pipe_buffer_map_range (pipe=0x7f8fb415cef0, buffer=0x7f8fe50a4600, offset=0, length=31457280, access=1, transfer=0x7f8fe50a45e8) at ../src/gallium/auxiliary/util/u_inlines.h:400
#3 0x00007f8fe93d9495 in _mesa_bufferobj_map_range (ctx=0x7f8fb4191710, offset=0, length=31457280, access=1, obj=0x7f8fe50a4520, index=MAP_INTERNAL) at ../src/mesa/main/bufferobj.c:499
#4 0x00007f8fe963e40f in _mesa_validate_pbo_compressed_teximage (ctx=0x7f8fb4191710, dimensions=3, imageSize=16384, pixels=0xa04000, packing=0x7f8fb41c2820, funcName=0x7f8fea64c3e2 "glCompressedTexSubImage") at ../src/mesa/main/pbo.c:456
#5 0x00007f8fe96c2c8c in _mesa_store_compressed_texsubimage (ctx=0x7f8fb4191710, dims=3, texImage=0x7f8fe50bda20, xoffset=16000, yoffset=256, zoffset=1, width=128, height=128, depth=1, format=35919, imageSize=16384, data=0xa04000) at ../src/mesa/main/texstore.c:1357
#6 0x00007f8fe972862a in st_CompressedTexSubImage (ctx=0x7f8fb4191710, dims=3, texImage=0x7f8fe50bda20, x=16000, y=256, z=1, w=128, h=128, d=1, format=35919, imageSize=16384, data=0xa04000) at ../src/mesa/state_tracker/st_cb_texture.c:2390
#7 0x00007f8fe96aa0a5 in compressed_texture_sub_image (ctx=0x7f8fb4191710, dims=3, texObj=0x7f8fe50bd5b0, texImage=0x7f8fe50bda20, target=35866, level=0, xoffset=16000, yoffset=256, zoffset=1, width=128, height=128, depth=1, format=35919, imageSize=16384, data=0xa04000) at ../src/mesa/main/teximage.c:5862
#8 0x00007f8fe96aa553 in compressed_tex_sub_image (dim=3, target=35866, textureOrIndex=0, level=0, xoffset=16000, yoffset=256, zoffset=1, width=128, height=128, depth=1, format=35919, imageSize=16384, data=0xa04000, mode=TEX_MODE_CURRENT_ERROR, caller=0x7f8fea648740 "glCompressedTexSubImage3D") at ../src/mesa/main/teximage.c:6000
#9 0x00007f8fe96aab53 in _mesa_CompressedTexSubImage3D (target=35866, level=0, xoffset=16000, yoffset=256, zoffset=1, width=128, height=128, depth=1, format=35919, imageSize=16384, data=0xa04000) at ../src/mesa/main/teximage.c:6195
#10 0x00007f8fe93435b7 in _mesa_unmarshal_CompressedTexSubImage3D (ctx=0x7f8fb4191710, cmd=0x7f8fb4197990) at src/mapi/glapi/gen/marshal_generated1.c:5199
#11 0x00007f8fe960eb20 in glthread_unmarshal_batch (job=0x7f8fb41978c8, gdata=0x0, thread_index=0) at ../src/mesa/main/glthread.c:65
#12 0x00007f8fe960f47f in _mesa_glthread_finish (ctx=0x7f8fb4191710) at ../src/mesa/main/glthread.c:312
#13 0x00007f8fe960f4c7 in _mesa_glthread_finish_before (ctx=0x7f8fb4191710, func=0x7f8fea5fcc00 "GetQueryObjectui64v") at ../src/mesa/main/glthread.c:328
#14 0x00007f8fe935c214 in _mesa_marshal_GetQueryObjectui64v (id=4097, pname=34918, params=0x5f0fca8) at src/mapi/glapi/gen/marshal_generated3.c:1349
#15 0x000000007a89a63f in ?? ()
#16 0x0000000000000000 in ?? ()
This is the software fallback path for CompressedTexSubImage3D. Zink is (in this instance) a hardware driver, so why…
Hold on, I think I see something
A little closer
Enhance
With a trivial MR, DOOM2016 gets a nice 10x perf boost. Probably some other games do too.
But why can’t there be some sort of tag for all the hundreds of pipe caps to indicate that they’re perf-related?
Why do I have to do this dance again and again?
Okay just a short status update.
The radv h264/h265 support has been merged to mesa main branch. It is still behind RADV_PERFTEST=video_decode flag, and should work for basics from VI/GFX8+. It still has not passed all the CTS tests.
The anv h264 decode support has been merged to mesa main branch. It has been tested from Skylake up to DG2. It has no enable flag, just make sure to build with h264dec video-codec support. It passes all current CTS tests.
I ported the anv h264 decoder to hasvk the vulkan driver for Ivybridge/Haswell. This in a draft MR (HASVK H264). I haven't given this much testing yet, it has worked in the past. I'll get to testing it before trying to get it merged.
I created an MR for spec discussion (radv av1). I've also cleaned up the radv AV1 decode code.
I've started on anv AV1 decode support for DG2. I've gotten one very simple frame to decode. I will attempt to do more. I think filmgrain is not going to be supported in the short term. I'll fill in more details on this when it's working better. I think there are a few things that might need to be changed in the AV1 decoder provisional spec for Intel, there are some derived values that ffmpeg knows that it would be nice to not derive again, and there are also some hw limits around tiles and command buffers that will need to be figured out.
A while ago, I blogged about how zink was leveraging VK_EXT_graphics_pipeline_library to avoid mid-frame shader compilation, AKA hitching. This was all good and accurate in that the code existed, it worked, and when the right paths were taken, there was no mid-frame shader compiling.
The problem, of course, is that these paths were never taken. Who could have guessed: there were bugs.
These bugs are now fixed, however, and so there should be no more mid-frame hitching ever with zink.
Period.
If you don’t believe me, run your games with MESA_SHADER_CACHE_DISABLE=true and report back.
There’s this little extension I hate called ARB_separate_shader_objects. It allows shaders to be created separately and then linked together at runtime in a performant manner that doesn’t hitch.
If you’re thinking this sounds a lot like GPL fast-linking, you’re not wrong.
If, however, you’ve noticed the obvious flaw in this thinking, you’re also not wrong.
This is not a thing, but it needs to be a thing. Specifically it needs to be a thing because our favorite triangle princess, Tomb Raider (2013), uses SSO extensively, which is why the fps is bad.
Let’s think about how this would work. GPL allows for splitting up the pipeline into four sections:
Zink already uses this for normal shader programs. It creates partial pipeline libraries organized like this:
The all shaders library is generated asynchronously during load screens, the other two are generated on-demand, and the whole thing gets fast-linked together so quickly that there is (now) zero hitching.
Logically, supporting SSO from here should just mean expanding all shaders to a pair of pipeline libraries that can be created asynchronously:
This pair of shader libraries can then be fast-linked into the all shaders intermediate library, and the usual codepaths can then be taken.
It should be that easy.
Right?
Because descriptors exist. And while we all hoped I’d be able to make it a year without writing Yet Another Blog Post About Vulkan Descriptor Models, we all knew it was only a matter of time before I succumbed to the inevitable.
Zink, atop sane drivers, uses six descriptor sets:
This is nice since it keeps the codepaths for handling each set working at a descriptor-type level, making it easy to abstract while remaining performant*.
* according to me
When doing the pipeline library split above, however, this is not possible. The infinite wisdom of the GPL spec allows for independent sets between libraries, which is to say that library A can use set X, library B can use set Y, and the resulting pipeline will use sets X and Y.
But it doesn’t provide for any sort of merging of sets, which means that zink’s entire descriptor architecture is effectively incompatible with this use of GPL.
I wanted to write some brand new, bespoke, easily-breakable descriptor code anyway, and this gives me the perfect excuse to orphan some breathtakingly smart code in a way that will never be detected by CI.
What GPL requires is a new descriptor layout that looks more like this:
Note that leaving null sets here enables the bindless set to remain bound between SSO pipelines and regular pipelines, and no changes whatsoever need to be made to anything related to bindless. If there’s one thing I don’t want to touch (but will definitely fingerpaint all over within the next day or two), it’s bindless descriptor handling.
The first step of this latest Vulkan Descriptor Management Blog Post is to create new shader variants and modify the descriptor info for the corresponding variables. The set index needs to be updated as above, and then the bindings also need to be updated.
Yes, normally descriptor bindings are also calculated based on the descriptor-typing, which helps keep binding values low, but for SSO they have to be adjusted so that all descriptors for a shader can exist happily in a given set regardless of how many there are. This leads to the following awfulness:
void
zink_descriptor_shader_get_binding_offsets(const struct zink_shader *shader, unsigned *offsets)
{
offsets[ZINK_DESCRIPTOR_TYPE_UBO] = 0;
offsets[ZINK_DESCRIPTOR_TYPE_SAMPLER_VIEW] = shader->bindings[ZINK_DESCRIPTOR_TYPE_UBO][shader->num_bindings[ZINK_DESCRIPTOR_TYPE_UBO] - 1].binding + 1;
offsets[ZINK_DESCRIPTOR_TYPE_SSBO] = offsets[ZINK_DESCRIPTOR_TYPE_SAMPLER_VIEW] + shader->bindings[ZINK_DESCRIPTOR_TYPE_SAMPLER_VIEW][shader->num_bindings[ZINK_DESCRIPTOR_TYPE_SAMPLER_VIEW] - 1].binding + 1;
offsets[ZINK_DESCRIPTOR_TYPE_IMAGE] = offsets[ZINK_DESCRIPTOR_TYPE_SSBO] + shader->bindings[ZINK_DESCRIPTOR_TYPE_SSBO][shader->num_bindings[ZINK_DESCRIPTOR_TYPE_SSBO] - 1].binding + 1;
}
Given a shader, an array of per-set offsets is passed in, which then gets initialized based on the highest binding counts of previous sets to avoid collisions. These offsets are then appplied in a shader pass that looks like this:
int set = nir->info.stage == MESA_SHADER_FRAGMENT;
unsigned offsets[4];
zink_descriptor_shader_get_binding_offsets(zs, offsets);
nir_foreach_variable_with_modes(var, nir, nir_var_mem_ubo | nir_var_mem_ssbo | nir_var_uniform | nir_var_image) {
if (var->data.bindless)
continue;
var->data.descriptor_set = set;
switch (var->data.mode) {
case nir_var_mem_ubo:
var->data.binding = !!var->data.driver_location;
break;
case nir_var_uniform:
if (glsl_type_is_sampler(glsl_without_array(var->type)))
var->data.binding += offsets[1];
break;
case nir_var_mem_ssbo:
var->data.binding += offsets[2];
break;
case nir_var_image:
var->data.binding += offsets[3];
break;
default: break;
}
}
The one stupid here is that nir_var_uniform includes actual uniform-type variables, so those have to be ignored.
With this done, a shader’s descriptors can be considered SSO-capable.
But shaders are only half the equation, which means entirely novel and fascinating code has to be written to create VkDescriptorLayout objects, and VkDescriptorSet objects, and of course there’s VkDescriptorPool, not to mention VkDescriptorUpdateTemplate…
The blog post might go on infinitely if I actually did all that, so instead I’ve turned to our lord and savior, VK_EXT_descriptor_buffer. This allows me to reuse most of the existing descriptor buffer code for creating layouts, and then I can just write my descriptor data to an arbitrary bound buffer rather than create new pools/sets/templates.
As a veteran user of descriptors and member-with-voting-rights of the Vulkan Descriptor Bloggers Consortium, nothing described above poses the slightest challenge. The hard part of SSO handling is the actual pipeline management. Because all the precompiles are done per-shader in threads, there’s no reusable objects for caching with shader variants, nor is there a way to compile such variants asynchronously anyway. These shaders can have exactly one variant, the default, and anything else is not possible.
After a number of iterations on the concept, I settled on having a union in the zink_gfx_program struct which omits all of the mechanics for managing shader variants. The “separable” zink_gfx_program object functions like this:
zink_gfx_program stub when possiblezink_gfx_programzink_gfx_program stubzink_gfx_program is ready, replace separable object with real objectIn this way, the precompiled SSO shaders can be fast-linked like a regular pipeline to avoid hitching, and the bespoke descriptor update path will be taken using more or less the same mechanics as the normal path to guarantee matching performance. Any non-shader-variant pipeline state changes can be handled without any new code, and everything “just works”.
Especially clever experts have been reading up until this point with six or seven eyebrows raised with the following thought in mind.
Doesn’t Tomb Raider (2013) use tessellation shaders?
Why yes. Yes, Tomb Raider (2013) does use tessellation shaders, and yes, it does only use them for drawing hair effects.
No, GPL cannot do separate pipeline libraries for tessellation shaders. Or geometry shaders, for that matter.
What it doesn’t mean is that I’ve fixed hitching for Tomb Raider (2013).
At present, it’s not possible to fix this using Vulkan. There is simply no way to precompile separate tessellation shaders, and so I again will say that the use of SSO is why the fps is bad.
But the above-described handling does eliminate hitching caused by simple instances of SSO, which mitigates some of the hitching in Tomb Raider (2013). I’m not aware of other games that use this functionality, but if there are any, hopefully they don’t use tessellation and are smoothed out by this MR*.
* Note: ZINK_DESCRIPTORS=db is currently required to enable this functionality
Gaming on Zink: We’re getting there.
Notable RADV developer and pixel enthusiast Samuel Pitoiset doesn’t blog, but he’s just put up a MR that enables non-GPL pipeline caching with RADV_PERFTEST=gpl.
Test it and report back.
This year I started a new job working with Igalia’s Graphics Team. For those of you who don’t know Igalia they are a “worker-owned, employee-run cooperative model consultancy focused on open source software”.
As a new member of the team, I thought it would be a great idea to summarize the incredible amount of work the team completed in 2022. If you’re interested keep reading!
One of the big milestones for the team in 2022 was achieving Vulkan 1.2 conformance on the Raspberry Pi 4. The folks over at the Raspberry Pi company wrote a nice article about the achievement. Igalia has been partnering with the Raspberry Pi company to bring build and improve the graphics driver on all versions of the Raspberry Pi.
The Vulkan 1.2 spec ratification came with a few extensions that were promoted to Core. This means a conformant Vulkan 1.2 driver needs to implement those extensions. Alejandro Piñeiro wrote this interesting blog post that talks about some of those extensions.
Vulkan 1.2 also came with a number of optional extensions such as
VK_KHR_pipeline_executable_properties. My colleague Iago
Toral wrote an excellent blog
post on how we implemented that extension on the Raspberry Pi 4 and
what benefits it provides for debugging.
Igalia has been heavily supporting the Open-Source Turnip Vulkan driver for Qualcomm Adreno GPUs, and in 2022 we helped it achieve Vulkan 1.3 conformance. Danylo Piliaiev on the graphics team here at Igalia, wrote a great blog post on this achievement! One of the biggest challenges for the Turnip driver is that it is a completely reverse-engineered driver that has been built without access to any hardware documentation or reference driver code.
With Vulkan 1.3 conformance has also come the ability to run more commercial games on Adreno GPUs through the use of the DirectX translation layers. If you would like to see more of this check out this post from Danylo where he talks about getting “The Witcher 3”, “The Talos Principle”, and “OMD2” running on the A660 GPU. Outside of Vulkan 1.3 support he also talks about some of the extensions that were implemented to allow “Zink” (the OpenGL over Vulkan driver) to run Turnip, and bring OpenGL 4.6 support to Adreno GPUs.
Several developers on the Graphics Team made several key contributions to Vulkan Extensions and the Vulkan conformance test suite (CTS). My colleague Ricardo Garcia made an excellent blog post about those contributions. Below I’ve listed what Igalia did for each of the extensions:
Our resident “Not an AMD expert” Melissa Wen made several
contributions to the AMDGPU driver. Those contributions include
connecting parts of the pixel
blending and post blending code in AMD’s DC module to
DRM and fixing
a bug related to how panel orientation is set when a display is
connected. She also had a presentation
at XDC 2022, where she talks about techniques you can use to
understand and debug AMDGPU, even when there aren’t hardware docs
available.
André Almeida also completed and submitted work on enabled
logging features for the new GFXOFF hardware feature in AMD GPUs. He
also created a userspace application (which you can find here),
that lets you interact with this feature through the
debugfs interface. Additionally, he submitted a patch
for async page flips (which he also talked about in his XDC 2022
presentation) which is still yet to be merged.
Christopher Michael joined the Graphics Team in 2022 and along with Chema Casanova made some key contributions to enabling hardware acceleration and mode setting on the Raspberry Pi without the use of Glamor which allows making more video memory available to graphics applications running on a Raspberry Pi.
The older generation Raspberry Pis (1-3) only have a maximum of 256MB of memory available for video memory, and using Glamor will consume part of that video memory. Christopher wrote an excellent blog post on this work. Both him and Chema also had a joint presentation at XDC 2022 going into more detail on this work.
Our very own Samuel Iglesias had a column published in Linux Format Magazine. It’s a short column about reaching Vulkan 1.1 conformance for v3dv & Turnip Vulkan drivers, and how Open-Source GPU drivers can go from a “hobby project” to the defacto driver for the platform. Check it out on page 7 of issue #288!
X.Org Developers Conference is one of the big conferences for us here at the Graphics Team. Last year at XDC 2022 our Team presented 5 talks in Minneapolis, Minnesota. XDC 2022 took place towards the end of the year in October, so it provides some good context on how the team closed out the year. If you didn’t attend or missed their presentation, here’s a breakdown:
Ricardo presents what exactly mesh shaders are in Vulkan. He made many contributions to this extension including writing 1000s of CTS tests for this extension with a blog post on his presentation that should check out!
Iago goes into detail about the current status of the Raspberry Pi Vulkan driver. He talks about achieving Vulkan 1.2 conformance, as well as some of the challenges the team had to solve due to hardware limitations of the Broadcom GPU.
Chema and Christopher talk about the challenges they had to solve to enable hardware acceleration on the Raspberry Pi without Glamor.
In this non-technical presentation, Melissa talks about techniques developers can use to understand and debug drivers without access to hardware documentation.
André talks about the work that has been done to enable asynchronous page flipping in DRM’s atomic API with an introduction to the topic by explaining about what exactly is asynchronous page flip, and why you would want it.
Another important conference for us is FOSDEM, and last year we presented 3 of the 5 talks in the graphics dev room. FOSDEM took place in early February 2022, these talks provide some good context of where the team started in 2022.
Hyunjun presented the current state of the Turnip driver, also talking about the difficulties of developing a driver for a platform without hardware documentation. He talks about how Turnip developers reverse engineer the behaviour of the hardware, and then implement that in an open-source driver. He also made a companion blog post to checkout along with his presentation.
Igalia has been presenting the status of the v3dv driver since December 2019 and in this presentation, Alejandro talks about the status of the v3dv driver in early 2022. He talks about achieving conformance, the extensions that had to be implemented, and the future plans of the v3dv driver.
Ricardo presents about the work he did on the
VK_EXT_border_color_swizzle extension in Vulkan. He talks
about the specific contributions he did and how the extension fits in
with sampling color operations in Vulkan.
Last year Melissa & André co-mentored contributors working on introducing KUnit tests to the AMD display driver. This project was hosted as a “Google Summer of Code” (GSoC) project from the X.Org Foundation. If you’re interested in seeing their work Tales da Aparecida, Maíra Canal, Magali Lemes, and Isabella Basso presented their work at the Linux Plumbers Conference 2022 and across two talks at XDC 2022. Here you can see their first presentation and here you can see their second second presentation.
André & Melissa also mentored two “Igalia Coding Experience” (CE) projects, one related to IGT GPU test tools on the VKMS kernel driver, and the other for IGT GPU test tools on the V3D kernel driver. If you’re interested in reading up on some of that work, Maíra Canal wrote about her experience being part of the Igalia CE.
Ella Stanforth was also part of the Igalia Coding Experience, being
mentored by Iago & Alejandro. They worked on the
VK_KHR_sampler_ycbcr_conversion extension for the v3dv
driver. Alejandro talks about their work in his blog
post here.
The graphics team is looking forward to having a jam-packed 2023 with just as many if not more contributions to the Open-Source graphics stack! I’m super excited to be part of the team, and hope to see my name in our 2023 recap post!
Also, you might have heard that Igalia will be hosting XDC 2023 in the beautiful city of A Coruña! We hope to see you there where there will be many presentations from all the great people working on the Open-Source graphics stack, and most importantly where you can dream in the Atlantic!
Photo of A Coruña
I previously wrote a post talking about some optimization work that’s been done with RADV’s VK_EXT_graphics_pipeline_library implementation to improve fast-link performance. As promised, that wasn’t the end of the story. Today’s post will be a bit different, however, as I’ll be assuming all the graphics experts in the audience are already well-versed in all the topics I’m covering.
Also I’m assuming you’re all driver developers interested in improving your VK_EXT_graphics_pipeline_library fast-link performance.
The one exception is that today I’ll be using a specific definition for fast when it comes to fast-linking: to be fast, a driver should be able to fast-link in under 0.01ms. In an extremely CPU-intensive application, this should allow for even the explodiest of pipeline explosions (100+ fast-links in a single frame) to avoid any sort of hitching/stuttering.
Which drivers have what it takes to be fast?
To begin evaluating fast-link performance, it’s important to have test cases. Benchmarks. The sort that can be easily run, easily profiled, easily understood.
vkoverhead is the premier tool for evaluating CPU overhead in Vulkan drivers, and thanks to Valve, it now has legal support for GPL fast-link using real pipelines from Dota2. That’s right. Acing this synthetic benchmark will have real world implications.
For anyone interested in running these cases, it’s as simple as building and then running:
./vkoverhead -start 135
These benchmark cases will call vkCreateGraphicsPipelines in a tight loop to perform a fast-link on GPL-created pipeline libraries, fast-linking thousands of times per second for easy profiling. The number of iterations per second, in thousands, is then printed.
vkoverhead works with any Vulkan driver on any platform (including Windows!), which means it’s possible to use it to profile and optimize any driver.
vkoverhead currently has two cases for GPL fast-link. As they are both extracted directly from Dota2, they have a number of properties in common:
Each case tests the following:
depthonly is a pipeline containing only a vertex shader, forcing the driver to use its own fragment shaderslow is a pipeline that happens to be slow to create on many driversVarious tools are available on different platforms for profiling, and I’m not going to go into details here. What I’m going to do instead is look into strategies for optimizing drivers. Strategies that I (and others) have employed in real drivers. Strategies that you, if you aren’t shipping a fast-linking implementation of GPL, might be interested in.
The depthonly case explicitly tests whether drivers are creating a new fragment shader for every pipeline that lacks one. Drivers should not do this.
Instead, create a single fragment shader on the device object and reuse it like these drivers do:
In addition to being significantly faster, this also saves some memory.
Regular, optimized pipeline creation typically involves running optimization passes across the shader binaries, possibly even the entire pipeline, to ensure that various speedups can be found. Many drivers copy the internal shader IR in the course of pipeline creation to handle shader variants.
Don’t copy shader IR when trying to fast-link a pipeline.
Copying IR is very expensive, especially in larger shaders. Instead, either precompile unoptimized shader binaries in their corresponding GPL stage or refcount IR structures that must exist during execution. Examples:
This one seems obvious, but it has to be stated.
Do not compile shaders when attempting to achieve fast-link speed.
If you are compiling shaders, this is a very easy place to start optimizing.
There’s no reason to cache a fast-linked pipeline. The amount of time saved by retrieving a cached pipeline should be outweighed by the amount of time required to:
I say should because ideally a driver should be so fast at combining a GPL pipeline that even a cache hit is only comparable performance, if not slower outright. Skip all aspects of caching for these pipelines.
If a driver is still slow after checking for the above items, it’s time to try profiling. It’s surprising what slowdowns drivers will hit. The classics I’ve seen are large memset calls and avoidable allocations.
Some examples:
In my previous post, I alluded to a driver that was shipping a GPL implementation that advertised fast-link but wasn’t actually fast. I saw a lot of guesses. Nobody got it right.
It was Lavapipe (me) all along.
As hinted at above, however, this is no longer the case. In fact, after going through the listed strategies, Lavapipe now has the fastest GPL linking in the world.
Obviously it would have to if I’m writing a blog post about optimizing fast-linking, right?
How fast is Lavapipe’s linking, you might ask?
To answer this, let’s first apply a small patch to bump up Lavapipe’s descriptor limits so it can handle the beefy Dota2 pipelines. With that done, here’s a look at comparisons to other, more legitimate drivers, all running on the same system.
NVIDIA is the gold standard for GPL fast-linking considering how long they’ve been shipping it. They’re pretty fast.
$ VK_ICD_FILENAMES=nvidia_icd.json ./vkoverhead -start 135 -duration 5
vkoverhead running on NVIDIA GeForce RTX 2070:
* misc numbers are reported as thousands of operations per second
* percentages for misc cases should be ignored
135, misc_compile_fastlink_depthonly, 444, 100.0%
136, misc_compile_fastlink_slow, 243, 100.0%
RADV (with pending MRs applied) has gotten incredibly fast over the past week-ish.
$ RADV_PERFTEST=gpl ./vkoverhead -start 135 -duration 5
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
* misc numbers are reported as thousands of operations per second
* percentages for misc cases should be ignored
135, misc_compile_fastlink_depthonly, 579, 100.0%
136, misc_compile_fastlink_slow, 537, 100.0%
Lavapipe (with pending MRs applied) blows them both out of the water.
$ VK_ICD_FILENAMES=lvp_icd.x86_64.json ./vkoverhead -start 135 -duration 5
vkoverhead running on llvmpipe (LLVM 15.0.6, 256 bits):
* misc numbers are reported as thousands of operations per second
* percentages for misc cases should be ignored
135, misc_compile_fastlink_depthonly, 1485, 100.0%
136, misc_compile_fastlink_slow, 1464, 100.0%
Even if the NVIDIA+RADV numbers are added together, it’s still not close.
If I switch over to a different machine, Intel’s ANV driver has a MR for GPL open, and it’s seeing some movement. Here’s a head-to-head with the champion.
$ ./vkoverhead -start 135 -duration 5
vkoverhead running on Intel(R) Iris(R) Plus Graphics (ICL GT2):
* misc numbers are reported as thousands of operations per second
* percentages for misc cases should be ignored
135, misc_compile_fastlink_depthonly, 384, 100.0%
136, misc_compile_fastlink_slow, 276, 100.0%
$ VK_ICD_FILENAMES=lvp_icd.x86_64.json ./vkoverhead -start 135 -duration 5
vkoverhead running on llvmpipe (LLVM 15.0.6, 256 bits):
* misc numbers are reported as thousands of operations per second
* percentages for misc cases should be ignored
135, misc_compile_fastlink_depthonly, 1785, 100.0%
136, misc_compile_fastlink_slow, 1779, 100.0%
On yet another machine, here’s Turnip, which advertises the fast-link feature. This driver requires a small patch to modify MAX_SETS=5 since this is hardcoded at 4. I’ve also pinned execution here to the big cores for consistency.
# turnip ooms itself with -duration
$ ./vkoverhead -start 135
vkoverhead running on Turnip Adreno (TM) 618:
* misc numbers are reported as thousands of operations per second
* percentages for misc cases should be ignored
135, misc_compile_fastlink_depthonly, 73, 100.0%
136, misc_compile_fastlink_slow, 23, 100.0%
$ VK_ICD_FILENAMES=lvp_icd.aarch64.json ./vkoverhead -start 135 -duration 5
vkoverhead running on llvmpipe (LLVM 14.0.6, 128 bits):
* misc numbers are reported as thousands of operations per second
* percentages for misc cases should be ignored
135, misc_compile_fastlink_depthonly, 690, 100.0%
136, misc_compile_fastlink_slow, 699, 100.0%
We’ve seen that Lavapipe is unequivocally the champion of fast-linking in every head-to-head, but what does this actually look like in timings?
Here’s a chart that shows the breakdown in milliseconds.
| Driver | ` misc_compile_fastlink_depthonly ` | ` misc_compile_fastlink_slow ` |
|---|---|---|
| NVIDIA | 0.002ms | 0.004ms |
| RADV | 0.0017ms | 0.0019ms |
| Lavapipe | 0.0007ms | 0.0007ms |
| ANV | 0.0026ms | 0.0036ms |
| Lavapipe | 0.00056ms | 0.00056ms |
| Turnip | 0.0137ms | 0.0435ms |
| Lavapipe | 0.001ms | 0.001ms |
As we can see, all of these drivers are “fast”. A single fast-link pipeline isn’t likely to cause any of them to drop a frame.
The driver I’ve got my eye on, however, is Turnip, which is the only one of the tested group that doesn’t quite hit that 0.01ms target. A little bit of profiling might show some easy gains here.
For another view of these drivers, let’s examine the relative performance. Since GPL fast-linking is inherently a CPU task that has no relation to the GPU, it stands to reason that a CPU-based driver should be able to optimize for it the best given that there’s already all manner of hackery going on to defer and delay execution. Indeed, reality confirms this, and looking at any profile of Lavapipe for the benchmark cases reveals that the only remaining bottleneck is the speed of malloc, which is to say the speed with which the returned pipeline object can be allocated.
Thus, ignoring potential micro-optimizations of pipeline struct size, it can be said that Lavapipe has effectively reached the maximum speed of the system for fast-linking. From there, we can say that any other driver running on the same system is utilizing some fraction of this power.
Therefore, every other driver’s fast-link performance can be visualized in units of Lavapipe (lvps) to determine how much gain is possible if things like refactoring time and feasibility are ignored.
| Driver | misc_compile_fastlink_depthonly |
misc_compile_fastlink_slow |
|---|---|---|
| NVIDIA | 0.299lvps | 0.166lvps |
| RADV | 0.390lvps | 0.367lvps |
| ANV | 0.215lvps | 0.155lvps |
| Turnip | 0.106lvps | 0.033lvps |
The great thing about lvps is that these are comparable units.
At last, we finally have a way to evaluate all these drivers in a head-to-head across different systems.
The results are a bit surprising to me:
lvps last week to first place this weekAside from the strategies outlined above, the key takeaway for me is that there shouldn’t be any hardware limitation to implementing fast-linking. It’s a CPU-based architectural problem, and with enough elbow grease, any driver can aspire to reach nonzero lvps in vkoverhead’s benchmark cases.
I know everyone’s been eagerly awaiting the return of the pasta maker.
The wait is over.
But today we’re going to move away from those dangerous, addictivive synthetic benchmarks to look at a different kind of speed. That’s right. Today we’re looking at pipeline compile speed. Some of you are scoffing, mouse pointer already inching towards the close button on the tab.
Pipeline compile speed in the current year? Why should anyone care when we have great tools like Fossilize that can precompile everything for a game over the course of several hours to ensure there’s no stuttering?
It turns out there’s at least one type of pipeline compile that still matters going forward. Specifically, I’m talking about fast-linked pipelines using VK_EXT_graphics_pipeline_library.
Let’s get an appetizer going, some exposition under our belts before we get to the spaghetti we’re all craving.
All my readers are graphics experts. It won’t come as any surprise when I say that a pipeline is a program containing shaders which is used by the GPU. And you all know how VK_EXT_graphics_pipeline_library enables compiling partial pipelines into libraries that can then be combined into a full pipeline. None of you need a refresher on this, and we all acknowledge that I’m just padding out the word count of this post for posterity.
Some of you experts, however, have been so deep into getting those green triangles on the screen to pass various unit tests that you might not be fully aware of the fast-linking property of VK_EXT_graphics_pipeline_library.
In general, compiling shaders during gameplay is (usually) bad. This is (usually) what causes stuttering: the compilation of a pipeline takes longer than the available time to draw the frame, and rendering blocks until compilation completes. The fast-linking property of VK_EXT_graphics_pipeline_library changes this paradigm by enabling pipelines, e.g., for shader variants, to be created fast enough to avoid stuttering.
Typically, this is utilized in applications through the following process:
In this way, no draw is blocked by a pipeline creation, and optimized pipelines are still used for the majority of GPU operations.
…would I care about this if I have Fossilize and a brand new gaming supercomputer with 256 cores all running at 12GHz?
I know you’re wondering, and the answer is simple: not everyone has these things.
Some people don’t have extremely modern computers, which means Fossilize pre-compile of shaders can take hours. Who wants to sit around waiting that long to play a game they just downloaded?
Some games don’t use Fossilize, which means there’s no pre-compile. In these situations, there are two options:
The former option here gives us load times that remind us of the original Skyrim release. The latter probably yields stuttering.
Thus, VK_EXT_graphics_pipeline_library (henceforth GPL) with fast-linking.
What does the “fast” in fast-linking really mean?
How fast is “fast”?
These are great questions that nobody knows the answer to. The only limitation here is that “fast” has to be “fast enough” to avoid stuttering.
Given that RADV is in the process of bringing up GPL for general use, and given that Zink is relying on fast-linking to eliminate compile stuttering, I thought I’d take out my perf magnifying glass and see what I found.
Obviously we wouldn’t be advertising fast-linking on RADV if it wasn’t fast.
Obviously.
It goes without saying that we care about performance. No credible driver developer would advertise a performance-related feature if it wasn’t performant.
RIGHT?
And it’s not like I tried running Tomb Raider on zink and discovered that the so-called “fast”-link pipelines were being created at a non-fast speed. That would be insane to even consider—I mean, it’s literally in the name of the feature, so if using it caused the game to stutter, or if, for example, I was seeing “fast”-link pipelines being created in 10ms+…
Surely I didn’t see that though.
Surely I didn’t see fast-link pipelines taking more than an entire frame’s worth of time to create.
Long-time readers know that this is fine. I’m unperturbed by seeing numbers like this, and I can just file a ticket and move on with my life like a normal per—
OBVIOUSLY I CAN’T.
Obviously.
And just as obviously I had to get a second opinion on this, which is why I took my testing over to the only game I know which uses GPL with fast-link: 3D Pinball: Space Cadet DOTA 2!
Naturally it would be DOTA2, along with any other Source Engine 2 game, that uses this functionality.
Thus, I fired up my game, and faster than I could scream MID OR MEEPO into my mic, I saw the unthinkable spewing out in my console:
COMPILE 11425115
COMPILE 39491
COMPILE 11716326
COMPILE 35963
COMPILE 11057200
COMPILE 37115
COMPILE 10738436
Yes, those are all “fast”-linked pipeline compile times in nanoseconds.
Yes, half of those are taking more than 10ms.
The first step is always admitting that you have a problem, but I don’t have a problem. I’m fine. Not upset at all. Don’t read more into it.
As mentioned above, we have great tools in the Vulkan ecosystem like Fossilize to capture pipelines and replay them outside of applications. This was going to be a great help.
I thought.
I fired up a 32bit build of Fossilize, set it to run on Tomb Raider, and immediately it exploded.
Zink has, historically, been the final boss for everything Vulkan-related, so I was unsurprised by this turn of events. I filed an issue, finger-painted ineffectually, and then gave up because I had called in the expert.
That’s right.
Friend of the blog, artisanal bit-wrangler, and a developer whose only speed is -O3 -ffast-math, Hans-Kristian Arntzen took my hand-waving, unintelligible gibbering, and pointing in the wrong direction and churned out a masterpiece in less time than it took RADV to “fast”-link some of those pipelines.
While I waited, I was working at the picosecond-level with perf to isolate the biggest bottleneck in fast-linking.
My caveman-like, tool-less hunt yielded immediate results: nir_shader_clone during fast-link was taking an absurd amount of time, and then also the shaders were being compiled at this point.
This was a complex problem to solve, and I had lots of other things to do (so many things), which meant I needed to call in another friend of the blog to take over while I did all the things I had to do.
Some of you know his name, and others just know him as “that RADV guy”, but Samuel Pitoiset is the real deal when it comes to driver development. He can crank out an entire extension implementation in less time than it takes me to write one of these long-winded, benchmark-number-free introductions to a blog post, and when I told him we had a huge problem, he dropped* everything and jumped on board.
* and when I say “dropped” I mean he finished finding and fixing another Halo Infinite hang in the time it took me to explain the problem
With lightning speed, Samuel reworked pipeline creation to not do that thing I didn’t want it to do. Because doing any kind of compiling when the driver is instead supposed to be “fast” is bad. Really bad.
How did that affect my numbers?
By now I was tired of dealing with the 32bit nonsense of Tomb Raider and had put all my eggs in the proverbial DOTA2 basket, so I again fired up a round, went to AFK in jungle, and checked my debug prints.
COMPILE 55699
COMPILE 55998
COMPILE 58016
COMPILE 56825
COMPILE 60288
COMPILE 110663
COMPILE 59679
COMPILE 50614
COMPILE 54316
Do my eyes deceive me or is that a 20,000% speedup from a single patch?!
And so the problem was solved. I went to Dan Ginsburg, who I’m sure everyone knows as the author of this incredible blog post about GPL, and I showed him the improvements and our new timings, and I asked what he thought about the performance now.
Dan looked at me. Looked at the numbers I showed him. Shook his head a single time.
It shook me.
I don’t know what I was thinking.
In my defense, a 20,000% speedup is usually enough to call it quits on a given project. In this case, however, I had the shadow of a competitor looming overhead.
While RADV was now down to 0.05-0.11ms for a fast-link, NVIDIA can apparently do this consistently in 0.02ms.
That’s pretty fast.
By now, the man, the myth, @themaister, Hans-Kristian Arntzen had finished fixing every Fossilize bug that had ever existed and would ever exist in the future, which meant I could now capture and replay GPL pipelines from DOTA2. Fossilize also has another cool feature: it allows for extraction of single pipelines from a larger .foz file, which is great for evaluating performance.
The catch? It doesn’t have any way to print per-pipeline compile timings during a replay, nor does it have a way to sort pipeline hashes based on compile times.
Either I was going to have to write some C++ to add this functionality to Fossilize, or I was going to have to get creative. With my Chromium PTSD in mind, I found myself writing out this construct:
for x in $(fossilize-list --tag 6 dota2.foz); do
echo "PIPELINE $x"
RADV_PERFTEST=gpl fossilize-replay --pipeline-hash $x dota2.foz 2>&1|grep COMPILE
done
I’d previously added some in-driver printfs to output compile times for the fast-link pipelines, so this gave me a file with the pipeline hash on one line and the compile timing on the next. I could then sort this and figure out some outliers to extract, yielding slow.foz, a fast-link that consistently took longer than 0.1ms.
I took this to Samuel, and we put our perfs together. Immediately, he spotted another bottleneck: SHA1Transform() was taking up a considerable amount of CPU time. This was occurring because the fast-linked pipelines were being added to the shader cache for reuse.
But what’s the point of adding an unoptimized, fast-linked pipeline to a cache when it should take less time to just fast-link and return?
Blammo, another lightning-fast patch from Samuel, and fast-linked pipelines were no longer being considered for cache entries, cutting off even more compile time.
slow.foz was now consistently down to 0.07-0.08ms.
No.
A post-Samuel flamegraph showed a few immediate issues:
First, and easiest, a huge memset. Get this thing out of here.
Now slow.foz was fast-linking in 0.06-0.07ms. Where was the flamegraph at on this?
Now the obvious question: What the farfalloni was going on with still creating a shader?!
It turns out this particular pipeline was being created without a fragment shader, and that shader was being generated during the fast-link process. Incredible coverage testing from an incredible game.
Fixing this proved trickier, and it still remains tricky. An unsolved problem.
However.
<zmike> can you get me a hack that I can use for that foz ?
* zmike just needs to get numbers for the blog
<hakzsam> hmm
<hakzsam> I'm trying
Like a true graphics hero, that hack was delivered just in time for me to run it through the blogginator. What kinds of gains would be had from this untested mystery patch?
slow.foz was now down to 0.023 ms (23566 ns).
Thanks to Hans-Kristian enabling us and Samuel doing a lot of heavy and unsafe lifting while I sucked wind on the sidelines, we hit our target time of 0.02ms, which is a 50,000% improvement from where things started.
What does this mean?
This means in the very near future, you can fire up RADV_PERFTEST=gpl and run DOTA2 (or zink) on RADV without any kind of shader pre-caching and still have zero stuttering.
This means you can write apps relying on fast-linking and be assured that your users will not see stuttering on RADV.
So far, there aren’t many drivers out there that implement GPL with true fast-linking. Aside from (a near-future version of) RADV, I’m reasonably certain the only driver that both advertises fast-linking and actually has fast linking is NVIDIA.
If you’re from one of those companies that has yet to take the plunge and implement GPL, or if you’ve implemented it and decided to advertise the fast-linking feature without actually being fast, here’s some key takeaways from a week in GPL optimization:
You might be thinking that profiling a single operation like this is tricky, and it’s hard to get good results from a single fossilize-replay that also compiles multiple library pipelines.
Never fear, vkoverhead is here to save the day.
You thought I wouldn’t plug it again, but here we are. In the very near future (ideally later today), vkoverhead will have some cases that isolate GPL fast-linking. This should prove useful for anyone looking to go from “fast” to fast.
There’s no big secret about being truly fast, and there’s no architectural limitations on speed. It just takes a little bit of elbow grease and some profiling.
The goal is to move GPL out of RADV_PERFTEST with Mesa 23.1 to enable it by default. There’s still some functional work to be done, but we’re not done optimizing here either.
One day I’ll be able to say with confidence that RADV has the fastest fast-link in the world, or my name isn’t Spaghetti Good Code.
UPDATE: It turns out RADV won’t have the fastest fast-link, but it can get a lot faster.
One of the challenges of reviewing a lot of code is that many reviews require multiple iterations. I really don't want to do a full review from scratch on the second and subsequent rounds. I need to be able to see what has changed since last time.
I happen to work on projects that care about having a useful Git history. This means that authors of (without loss of generality) pull requests use amend and rebase to change commits and force-push the result. I would like to see the only the changes they made since my last review pass. Especially when the author also rebased onto a new version of the main branch, existing code review tools tend to break down.
Git has a little-known built-in subcommand, git range-diff, which I had been using for a while. It's pretty cool, really: It takes two ranges of commits, old and new, matches old and new commits, and then shows how they changed. The rather huge problem is that its output is a diff of diffs. Trying to make sense of those quickly becomes headache-inducing.
I finally broke down at some point late last year and wrote my own tool, which I'm calling diff-modulo-base. It allows you to look at the difference of the repository contents between old and new in the history below, while ignoring all the changes that are due to differences in the respective base versions A and B.
As a bonus, it actually does explicitly show differences between A and B that would have caused merge conflicts during rebase. This allows a fairly comfortable view of how merge conflicts were resolved.
I've been using this tool for a while now. While there are certainly still some rough edges and to dos, I did put a bunch more effort into it over the winter holidays and am now quite happy with it. I'm making it available for all to try at https://git.sr.ht/~nhaehnle/diff-modulo-base. Let me know if you find it useful!
One of the rough edges is that it would be great to integrate tightly with the GitHub notifications workflow. That workflow is surprisingly usable in that you can essentially treat the notifications as an inbox in which you can mark notifications as unread or completed, and can "mute" issues and pull requests all with keyboard shortcut.
What's missing in my workflow is a reliable way to remember the most recent version of a pull request that I have reviewed. My somewhat passable workaround for now is to git fetch before I do a round of reviews, and rely on the local reflog of remote refs. A Git alias allows me to say
git dmb-origin $pull_request_id
and have that become
git diff-modulo-base origin/main origin/pull/$pull_request_id/head@{1} origin/pull/$pull_request_id/head
which is usually what I want.
Ideally, I'd have a fully local way of interacting with GitHub notifications, which could then remember the reviewed version in a more reliable way. This ought to also fix the terrible lagginess of the web interface. But that's a rant for another time.
This is the first serious piece of code I've written in Rust. I have to say that experience has really been quite pleasant so far. Rust's tooling is pretty great, mostly thanks to the rust-analyzer LSP server.
The one thing I'd wish is that the borrow checker was able to better understand "partial" borrows. I find it occasionally convenient to tie a bunch of data structures together in a general context structure, and helper functions on such aggregates can't express that they only borrow part of the structure. This can usually be worked around by changing data types, but the fact that I have to do that is annoying. It feels like having to solve a puzzle that isn't part of the inherent complexity of the underlying problem that the code is trying to solve.
And unlike, say, circular references or graph structures in general, where it's clear that expressing and proving the sort of useful lifetime facts that developers might intuitively reason about quickly becomes intractable, improving the support for partial borrows feels like it should be a tractable problem.
After hacking the Intel media-driver and ffmpeg I managed to work out how the anv hardware mostly works now for h264 decoding.
I've pushed a branch [1] and a MR[2] to mesa. The basics of h264 decoding are working great on gen9 and compatible hardware. I've tested it on my one Lenovo WhiskeyLake laptop.
I have ported the code to hasvk as well, and once we get moving on this I'll polish that up and check we can h264 decode on IVB/HSW devices.
The one feature I know is missing is status reporting, radv can't support that from what I can work out due to firmware, but anv should be able to so I might dig into that a bit.
[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-decode
[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20782
In the beginning, there was the egg. Then fictional people started eating that from different ends, and the terms of "little endians" and "Big Endians" was born.
Computer architectures (mostly) come with one of either byte order: MSB first or LSB first. The two are incompatible of course, and many a bug was introduced trying to convert between the two (or, more common: failing to do so). The two byte orders were termed Big Endian and little endian, because that hilarious naming scheme at least gives us something to laugh about while contemplating throwing it all away and considering a future as, I don't know, a strawberry plant.
Back in the mullet-infested 80s when the X11 protocol was designed both little endian and big endian were common enough. And back then running the X server on a different host than the client was common too - the X terminals back then had less processing power than a smart toilet seat today so the cpu-intensive clients were running on some mainfraime. To avoid overtaxing the poor mainframe already running dozens of clients for multiple users, the job of converting between the two byte orders was punted to the X server. So to this day whenever a client connects, the first byte it sends is a literal "l" or "B" to inform the server of the client's byte order. Where the byte order doesn't match the X server's byte order, the client is a "swapped client" in X server terminology and all 16, 32, and 64-bit values must be "byte-swapped" into the server's byte order. All of those values in all requests, and then again back to the client's byte order in all outgoing replies and events. Forever, till a crash do them part.
If you get one of those wrong, the number is no longer correct. And it's properly wrong too, the difference between 0x1 and 0x01000000 is rather significant. [0] Which has the hilarious side-effect of... well, pretty much anything. But usually it ranges from crashing the server (thus taking all other clients down in commiseration) to leaking random memory locations. The list of security issues affecting the various SProcFoo implementations (X server naming scheme for Swapped Procedure for request Foo) is so long that I'm too lazy to pull out the various security advisories and link to them. Just believe me, ok? *jedi handwave*
These days, encountering a Big Endian host is increasingly niche, letting it run an X client that connects to your local little-endian X server is even more niche [1]. I think the only regular real-world use-case for this is running X clients on an s390x, connecting to your local intel-ish (and thus little endian) workstation. Not something most users do on a regular basis. So right now, the byte-swapping code is mainly a free attack surface that 99% of users never actually use for anything real. So... let's not do that?
I just merged a PR into the X server repo that prohibits byte-swapped clients by default. A Big Endian client connecting to an X server will fail the connection with an error message of "Prohibited client endianess, see the Xserver man page". [2] Thus, a whole class of future security issues avoided - yay!
For the use-cases where you do need to let Big Endian clients connect to your little endian X server, you have two options: start your X server (Xorg, Xwayland, Xnest, ...) with the +byteswappedclients commandline option. Alternatively, and this only applies for Xorg: add Option "AllowByteSwappedClients" "on" to the xorg.conf ServerFlags section. Both of these will change the default back to the original setting. Both are documented in the Xserver(1) and xorg.conf(5) man pages, respectively.
Now, there's a drawback: in the Wayland stack, the compositor is in charge of starting Xwayland which means the compositor needs to expose a way of passing +byteswappedclients to Xwayland. This is compositor-specific, bugs are filed for mutter (merged for GNOME 44), kwin and wlroots. Until those are addressed, you cannot easily change this default (short of changing /usr/bin/Xwayland into a wrapper script that passes the option through).
There's no specific plan yet which X releases this will end up in, primarily because the release cycle for X is...undefined. Probably xserver-23.0 if and when that happens. It'll probably find its way into the xwayland-23.0 release, if and when that happens. Meanwhile, distributions interested in this particular change should consider backporting it to their X server version. This has been accepted as a Fedora 38 change.
[0] Also, it doesn't help that much of the X server's protocol handling code was written with the attitude of "surely the client wouldn't lie about that length value"
[1] little-endian client to Big Endian X server is so rare that it's barely worth talking about. But suffice to say, the exact same applies, just with little and big swapped around.
[2] That message is unceremoniously dumped to stderr, but that bit is unfortunately a libxcb issue.
Needless to say h264/5 weren't my real goals in life for video decoding. Lynne and myself decided to see what we could do to drive AV1 decode forward by creating our own extensions called VK_MESA_video_decode_av1. This is a radv only extension so far, and may expose some peculiarities of AMD hardware/firmware.
Lynne's blog entry[1] has all the gory details, so go read that first. (really read it first).
Now that you've read and understood all that, I'll just rant here a bit. Figuring out the DPB management and hw frame ref and curr_pic_idx fields was a bit of a nightmare. I spent a few days hacking up a lot of wrong things before landing on the thing we agreed was the least wrong which was having the ffmpeg code allocate a frame index in the same fashion as the vaapi radeon implementation did. I had another hacky solution that involved overloading the slotIndex value to mean something that wasn't DPB slot index, but it wasn't really any better. I think there may be something about the hw I don't understand so hopefully we can achieve clarity later.
After 8 months of work by Yinon Burgansky, libinput now has a new pointer acceleration profile: the "custom" profile. This profile allows users to tweak the exact response of their device based on their input speed.
A short primer: the pointer acceleration profile is a function that multiplies the incoming deltas with a given factor F, so that your input delta (x, y) becomes (Fx, Fy). How this is done is specific to the profile, libinput's existing profiles had either a flat factor or an adaptive factor that roughly resembles what Xorg used to have, see the libinput documentation for the details. The adaptive curve however has a fixed behaviour, all a user could do was scale the curve up/down, but not actually adjust the curve.
The new custom filter allows exactly that: it allows a user to configure a completely custom ratio between input speed and output speed. That ratio will then influence the current delta. There is a whole new API to do this but simplified: the profile is defined via a series of points of (x, f(x)) that are linearly interpolated. Each point is defined as input speed in device units/ms to output speed in device units/ms. For example, to provide a flat acceleration equivalent, specify [(0.0, 0.0), (1.0, 1.0)]. With the linear interpolation this is of course a 45-degree function, and any incoming speed will result in the equivalent output speed.
Noteworthy: we are talking about the speed here, not any individual delta. This is not exactly the same as the flat acceleration profile (which merely multiplies the deltas by a constant factor) - it does take the speed of the device into account, i.e. device units moved per ms. For most use-cases this is the same but for particularly slow motion, the speed may be calculated across multiple deltas (e.g. "user moved 1 unit over 21ms"). This avoids some jumpyness at low speeds.
But because the curve is speed-based, it allows for some interesting features too: the curve [(0.0, 1.0), (1.0, 1.0)] is a horizontal function at 1.0. Which means that any input speed results in an output speed of 1 unit/ms. So regardless how fast the user moves the mouse, the output speed is always constant. I'm not immediately sure of a real-world use case for this particular case (some accessibility needs maybe) but I'm sure it's a good prank to play on someone.
Because libinput is written in C, the API is not necessarily immediately obvious but: to configure you pass an array of (what will be) y-values and set the step-size. The curve then becomes: [(0 * step-size, array[0]), (1 * step-size, array[1]), (2 * step-size, array[2]), ...]. There are some limitations on the number of points but they're high enough that they should not matter.
Note that any curve is still device-resolution dependent, so the same curve will not behave the same on two devices with different resolution (DPI). And since the curves uploaded by the user are hand-polished, the speed setting has no effect - we cannot possibly know how a custom curve is supposed to scale. The setting will simply update with the provided value and return that but the behaviour of the device won't change in response.
Finally, there's another feature in this PR - the so-called "movement type" which must be set when defining a curve. Right now, we have two types, "fallback" and "motion". The "motion" type applies to, you guessed it, pointer motion. The only other type available is fallback which applies to everything but pointer motion. The idea here is of course that we can apply custom acceleration curves for various different device behaviours - in the future this could be scrolling, gesture motion, etc. And since those will have a different requirements, they can be configure separately.
As usual, the availability of this feature depends on your Wayland compositor and how this is exposed. For the Xorg + xf86-input-libinput case however, the merge request adds a few properties so that you can play with this using the xinput tool:
# Set the flat-equivalent function described above $ xinput set-prop "devname" "libinput Accel Custom Motion Points" 0.0 1.0 # Set the step, i.e. the above points are on 0 u/ms, 1 u/ms, ... # Can be skipped, 1.0 is the default anyway $ xinput set-prop "devname" "libinput Accel Custom Motion Points" 1.0 # Now enable the custom profile $ xinput set-prop "devname" "libinput Accel Profile Enabled" 0 0 1The above sets a custom pointer accel for the "motion" type. Setting it for fallback is left as an exercise to the reader (though right now, I think the fallback curve is pretty much only used if there is no motion curve defined).
Happy playing around (and no longer filing bug reports if you don't like the default pointer acceleration ;)
This custom profile will be available in libinput 1.23 and xf86-input-libinput-1.3.0. No release dates have been set yet for either of those.
Had one of those moments when you looked at some code, looked at the output of your program, and then exclaimed COMPUTER, WHY YOU SO DUMB?
Of course not.
Nobody uses the spoken word in the current year.
But you’ve definitely complained about how dumb your computer is on IRC/discord/etc.
And I’m here today to complain in a blog post.
My computer (compiler) is really fucking dumb.
I know you’re wondering how dumb a computer (compiler) would have to be to prompt an outraged blog post from me, the paragon of temperance.
Well.
Let’s take a look!
Disclaimer: You are now reading at your own peril. There is no catharsis to be found ahead.
Some time ago, avid blog followers will recall that I created vkoverhead for evaluating CPU overhead in Vulkan drivers. Intel and NVIDIA have both contributed since its inception, which has no relevance but I’m sneaking it in as today’s fun fact because nobody will read the middle of this paragraph. Using vkoverhead, I’ve found something very strange on RADV. Something bizarre. Something terrifying.
debugoptimized builds of Mesa are (significantly) faster than release builds for some cases.
Here’s an example.
debugoptimized:
$ ./vkoverhead -test 76 -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
* descriptor numbers are reported as thousands of operations per second
* percentages for descriptor cases are relative to 'descriptor_noop'
76, descriptor_1image, 157308, 100.0%
release:
$ ./vkoverhead -test 76 -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
* descriptor numbers are reported as thousands of operations per second
* percentages for descriptor cases are relative to 'descriptor_noop'
76, descriptor_1image, 135309, 100.0%
This is where we are now.
That’s a good question, and for the answer I’m going to turn it over to the blog’s new compiler expert, Mr. Azathoth:
Thanks, champ.
As we can see, I’m about to meddle with dark forces beyond the limits of human comprehension, so anyone who values their sanity/time should escape while they still can.
Let’s take a look at a debugoptimized flamegraph to check out what sort of cosmic horror we’re delving into.
Obviously this is a descriptor updating case, and upon opening up the code for the final function, this is what our eyes will refuse to see:
static ALWAYS_INLINE void
write_image_descriptor(unsigned *dst, unsigned size, VkDescriptorType descriptor_type, const VkDescriptorImageInfo *image_info)
{
struct radv_image_view *iview = NULL;
union radv_descriptor *descriptor;
if (image_info)
iview = radv_image_view_from_handle(image_info->imageView);
if (!iview) {
memset(dst, 0, size);
return;
}
if (descriptor_type == VK_DESCRIPTOR_TYPE_STORAGE_IMAGE) {
descriptor = &iview->storage_descriptor;
} else {
descriptor = &iview->descriptor;
}
assert(size > 0);
memcpy(dst, descriptor, size);
}
It seems simple enough. Nothing too terrible here.
In the execution of this test case, the iterated callstack will look something like:
radv_UpdateDescriptorSets ->
radv_update_descriptor_sets_impl (inlined) ->
write_image_descriptor_impl (inlined) ->
write_image_descriptor (inlined)
That’s a lot of inlining.
Typically inlining isn’t a problem when used properly. It makes things faster (citation needed) in some cases (citation needed). Note the caveats.
One thing of note in the above code snippet is that memcpy is called with a variable size parameter. This isn’t ideal since it prevents the compiler from making certain assumptions about—What’s that, Mr. Azathoth? It’s not variable, you say? It’s always writing 32 bytes, you say?
case VK_DESCRIPTOR_TYPE_STORAGE_IMAGE:
write_image_descriptor_impl(device, cmd_buffer, 32, ptr, buffer_list,
writeset->descriptorType, writeset->pImageInfo + j);
Wow, thanks, Mr. Azathoth, I totally would’ve missed that!
And surely the compiler (GCC 12.2.1) wouldn’t fuck this up, right?
Which would mean that, hypothetically, if I were to force the compiler (GCC 12.2.1) to use the right memcpy size then it would have no effect.
Right?
static ALWAYS_INLINE void
write_image_descriptor(unsigned *dst, unsigned size, VkDescriptorType descriptor_type, const struct radv_image_view *iview)
{
if (!iview) {
memset(dst, 0, size);
return;
}
if (descriptor_type == VK_DESCRIPTOR_TYPE_STORAGE_IMAGE) {
memcpy(dst, &iview->storage_descriptor, 32);
} else {
if (size == 64)
memcpy(dst, &iview->descriptor, 64);
else
memcpy(dst, &iview->descriptor, size);
}
}
$ ./vkoverhead -test 76 -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
* descriptor numbers are reported as thousands of operations per second
* percentages for descriptor cases are relative to 'descriptor_noop'
76, descriptor_1image, 141299, 100.0%
Many, many runs of vkoverhead confirm the results, which means that’s a solid win.
Right?
Except hold on.
In the course of writing this post, my SSH session to my test machine was terminated, and I had to reconnect. Just for posterity, let’s run the same pre/post patch driver through vkoverhead again to be thorough.
pre-patch, release build:
$ ./vkoverhead -test 76 -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
* descriptor numbers are reported as thousands of operations per second
* percentages for descriptor cases are relative to 'descriptor_noop'
76, descriptor_1image, 136943, 100.0%
post-patch, release build:
$ ./vkoverhead -test 76 -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
* descriptor numbers are reported as thousands of operations per second
* percentages for descriptor cases are relative to 'descriptor_noop'
76, descriptor_1image, 137810, 100.0%
Somehow, with no changes to my environment aside from being in a different SSH session, I’ve just gained a couple percentage points of performance on the pre-patch run and lost a couple on the post-patch run.
It’s clear to me that GCC (and computers) can’t be trusted. I know what everyone reading this post is now thinking: Why are you still using that antiquated trash compiler instead of the awesome new Clang that does all the things better always?
Well.
I’m sure you can see where this is going.
If I fire up Clang builds of Mesa for debugoptimized and release configurations (same CFLAGS etc), I get these results.
debugoptimized:
$ ./vkoverhead -test 76 -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
* descriptor numbers are reported as thousands of operations per second
* percentages for descriptor cases are relative to 'descriptor_noop'
76, descriptor_1image, 112629, 100.0%
release:
$ ./vkoverhead -test 76 -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
* descriptor numbers are reported as thousands of operations per second
* percentages for descriptor cases are relative to 'descriptor_noop'
76, descriptor_1image, 132820, 100.0%
At least something makes sense there. Now what about a release build with the above changes?
$ ./vkoverhead -test 76 -duration 10
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
* descriptor numbers are reported as thousands of operations per second
* percentages for descriptor cases are relative to 'descriptor_noop'
76, descriptor_1image, 127761, 100.0%
According to Clang this makes performance worse, but also the performance is just worse overall?!?
Just to see what happens, let’s check out AMDPRO:
$ VK_ICD_FILENAMES=/home/zmike/amd_icd64.json ./vkoverhead -test 76 -duration 10
vkoverhead running on AMD Radeon RX 5700 XT:
* descriptor numbers are reported as thousands of operations per second
* percentages for descriptor cases are relative to 'descriptor_noop'
76, descriptor_1image, 151691, 100.0%
I hate computers.
2022 really passed by fast and after I completed the GSoC 2022, I’m now completing another milestone: my project in the Igalia Coding Experience and I had the best experience during those four months. I learned tremendously about the Linux graphics stack and now I can say for sure that I would love to keep working in the DRM community.
While GSoC was, for me, an experience to get a better understanding of what open source is, Igalia CE was an opportunity for me to mature my knowledge of technical concepts.
So, this is a summary report of my journey at the Igalia CE.
Initially, V3D only had three basic IGT tests: v3d_get_bo_offset,
v3d_get_param, and v3d_mmap. So, the basic goal of my CE project was to add
more tests to the V3D driver.
V3D is the driver that supports the Broadcom V3D 3.3 and 4.1 OpenGL ES GPUs, and is the driver that provides 3D rendering to the Raspberry Pi 4. V3D is composed of a tiled renderer, a TFU (Texture Formatting Unit), and a CSD (Compute Shader Dispatch).
During the CE, I was able to develop tests for almost all eleven V3D ioctls
(except v3d_submit_tfu). I began writing tests to the v3d_create_bo ioctl
and Performance Monitor (perfmon) related ioctls. I developed tests that check
the basic functionality of the ioctls and I inspected the kernel code to
understand situations where the ioctl should fail.
After those tests, I got the biggest challenge that I had on my CE project:
performing a Mesa’s no-op job on IGT. A no-op job is one of the simplest jobs
that can be submitted to the V3D. It is a 3D rendering job, so it is a job
submitted through the v3d_submit_cl ioctl, and performing this job on IGT was
fundamental to developing good tests for the v3d_submit_cl ioctl.
The main problem I faced on submitting a no-op job on IGT was: I would copy many and many Mesa files to IGT. And I took a while fighting against this idea, looking for other ways to submit a job to V3D. But, as some experience developers pointed out, packeting is the best option for it. So indeed, the final solution I came in with was to copy a couple of files from Mesa, but just three of them, which sounds reasonable.
So, after some time, I was able to bring the Mesa structure to IGT with minimal overhead. But, I was still not able to run a successful no-op job as the job’s fence wasn’t being signaled by the end of the job. Then, Melissa Wen guided me to experiment running CTS tests to inspect the no-op job. With the CTS tests, I was able to hexdump the contents of the packet and understand what was going on wrong in my no-op job.
Running the CTS in the Raspberry Pi 4 was a fun side-quest of the project and
ended up resulting in a commit to the CTS repository, as CTS wasn’t handling
appropriately the wayland-scanner for cross-compiling. CTS was picking the
wayland-scanner from the host computer instead of picking the
wayland-scanner executable available in the target sysroot. This was fixed
with this simple patch:
Allow override of wayland_scanner executable
When I finally got a successful no-op job, I was able to write the tests for the
v3d_submit_cl and v3d_wait_bo ioctls. On these tests, I tested primarily job
synchronization with single syncobjs and multiple syncobjs. In this part of the
project, I had the opportunity to learn a lot about syncobjs and different forms
of synchronization in the kernel and userspace.
Having done the v3d_submit_cl tests, I developed the v3d_submit_csd tests in
a similar way, as the job submission process is kind of similar. For submitting
a CSD job, it is necessary to make a valid submission with a pipeline assembly
shader and as IGT doesn’t have a shader compiler, so I hard-coded the assembly
of an empty shader in the code. In this way, I was able to get a simple CSD job
submitted, and having done that, I could now play around with mixing CSD and CL
jobs.
In these tests, I could test the synchronization between two job queues and see, for example, if they were proceeding independently.
So, by the end of the review process, I will add 66 new sub-tests to V3D, having in total 72 IGT sub-tests! Those tests are checking invalid parameters, synchronization, and the proper behavior of the functionalities.
| Patch/Series | Status |
|---|---|
| [PATCH 0/7] V3D IGT Tests Updates | Accepted |
| [PATCH 0/2] Tests for V3D/VC4 Mmap BO IOCTLs | Accepted |
| [PATCH 0/4] Make sure v3d/vc4 support performance monitor | In Review |
| [PATCH 0/6] V3D Job Submission Tests | In Review |
| [PATCH 0/3] V3D Mixed Job Submission Tests | In Review |
Apart from reading a lot of kernel code, I also started to explore some of the
Mesa code, especially the v3dv driver. On Mesa, I was trying to understand the
userspace use of the ioctls in order to create useful tests. While I was
exploring the v3dv, I was able to make two very simple contributions to Mesa:
fixing typos and initializing a variable in order to assure proper error
handling.
| Patch | Status |
|---|---|
| v3dv: fix multiple typos | Accepted |
| v3dv: initialize fd variable for proper error handling | Accepted |
VC4 and V3D share some similarities in their basic 3D rendering implementation. VC4 contains a 3D engine, and a display output pipeline that supports different outputs. The display part of the VC4 is used on the Raspberry Pi 4 together with the V3D driver.
Although my main focus was on the V3D tests, as the VC4 and V3D drivers are kind
of similar, I was able to bring some improvements to the VC4 tests as well. I
added tests for perfmons and the vc4_mmap ioctl and improved a couple of
things in the tests, such as moving it a separate folder and creating a check to
skip the VC4 tests if they are running on a Raspberry Pi 4.
| Patch/Series | Status |
|---|---|
| [PATCH 0/5] VC4 IGT Tests Updates | Accepted |
| [PATCH 0/2] Tests for V3D/VC4 Mmap BO IOCTLs | Accepted |
| [PATCH 0/4] Make sure v3d/vc4 support performance monitor | In Review |
| tests/vc4_purgeable_bo: Fix conditional assertion | In Review |
During this process of writing tests to IGT, I ended up reading a lot of kernel code from V3D in order to evaluate possible userspace scenarios. While inspecting some of the V3D code, I could find a couple of small things that could be improved, such as using the DRM-managed API for mutexes and replacing open-coded implementations with their DRM counterparts.
| Patch | Status |
|---|---|
| drm/v3d: switch to drmm_mutex_init | Accepted |
| drm/v3d: add missing mutex_destroy | Accepted |
| drm/v3d: replace open-coded implementation of drm_gem_object_lookup | Accepted |
Although I didn’t explore the VC4 driver as much as the V3D driver, I also took a look at the driver, and I was able to detect a small thing that could be improved: using the DRM-core helpers instead of open-code. Moreover, after a report on the mailing list, I bisected a deadlock and I was able to fix it after some study about the KMS locking system.
The debugfs side-quest was a total coincidence during this project. I had some spare time and was looking for something to develop. While looking at the DRM TODO list, I bumped into the debugfs clean-up task and found it interesting to work on. So, I started to work on this task based on the previous work from Wambui Karuga, who was a Outreachy mentee and worked on this feature during her internship. By chance, when I talked to Melissa about it, she told me that she had knowledge of this project due to a past Outreachy internship that she was engaged on, and she was able to help me figure out the last pieces of this side-quest.
After submitting the first patch, introducing the debugfs device-centered
functions, and converting a couple of drivers to the new structure, I decided to
remove the debugfs_init hook from a couple of drivers in order to get closer
to the goal of removing the debugfs_init hook. Moreover, during my last week
in the CE, I tried to write a debugfs infrastructure for the KMS objects, which
was another task in the TODO list, although I still need to do some rework on
this series.
By the end of the CE, I was on my summer break from university, so I had some time to take a couple of side-quests in this journey.
The first side-quest that I got into originated in a failed IGT test on the VC4, the “addfb25-bad-modifier” IGT test. Initially, I proposed a fix only for the VC4, but after some discussion in the mailing list, I decided to move forward with the idea to create the check for valid modifiers in the DRM core. The series is still in review, but I had some great interactions during the iterations.
The second side-quest was to understand why the IGT test kms_writeback was
causing a kernel oops in vkms. After some bisecting and some study about KMS’s
atomic API, I was able to detect the problem and write a solution for it. It was
pretty exciting to deal with vkms for the first time and to get some notion
about the display side of things.
A bit different from the end of GSoC, I’m not really sure what are going to be my next steps in the next couple of months. The only thing I know for sure is that I will keep contributing to the DRM subsystem and studying more about DRI, especially the 3D rendering and KMS.
The DRI infrastructure is really fascinating and there is so much to be learn! Although I feel that I improved a lot in the last couple of months, I still feel like a newbie in the community. I still want to have more knowledge of the DRM core helpers and understand how everything glues together.
Apart from the DRM subsystem, I’m also trying to take some time to program more in Rust and maybe contribute to other open-source projects, like Mesa.
I would like to thank my great mentors Melissa Wen and André Almeida for helping me through this journey. I wouldn’t be able to develop this project without their great support and encouragement. They were an amazing duo of mentors and I thank them for answering all my questions and helping me with all the challenges.
Also, I would like to thank the DRI community for reviewing my patches and giving me constructive feedback. Especially, I would like to thank Daniel Vetter for answering patiently every single question that I had about the debugfs clean-up and to thank Jani Nikula, Maxime Ripard, Thomas Zimmermann, Javier Martinez Canillas, Emma Anholt, Simon Ser, Iago Toral, Kamil Konieczny and many others that took their time to review my patches, answer my questions and provide me constructive feedback.
We haven’t posted updates to the work done on the V3DV driver since
we announced the driver becoming Vulkan 1.2 Conformant
But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then.
Our implementation for Events and Occlusion Queries were both mostly CPU based. We have refactored both features with a new GPU-side implementation based on the use of compute shaders.
In addition to be more “the Vulkan way”, has additional benefits. For example, for the case of the events, we no longer need to stall on the CPU when we need to handle GPU-side event commnds, and allowed to re-enable sync_fd import/export.
We have just landed a real implementation for this extension, based on the work of Ella Stanforth as part of her Igalia Coding Experience with us. This was a really complex work, as this feature added support for multi-plane formats, and needed to modify various parts of the driver. A big kudos to Ella for getting this tricky feature going. Also thanks to Jason Ekstrand, as he worked on a common Mesa framework for ycbcr support.
Since 1.2 got announced the following extension got exposed:
In addition to those, we also worked on the following:
Implemented heuristic to decide to enable double-buffer mode, that could help to improve performance on some cases. It still needs to be enabled through the V3D_DEBUG environment variable.
Getting v3dv and v3d using the same shader optimization method, that would allow to reuse more code between the OpenGL and Vulkan driver.
Getting the driver working with the fossilize-db tools
Bugfixing, mostly related to bugs identified through new Khronos CTS releases
Hi all!
This month’s status update will be lighter than usual: I’ve been on leave for a while at the end of December. To make up for this, I have some big news: we’ve released Sway 1.8! This brings a whole lot of improvements from wlroots 0.16, as well as some nice smaller additions to Sway itself. We’re still working on fixing up a few regressions, so I’ll probably release wlroots 0.16.2 soon-ish.
Together with Sebastian Wick we’ve plumbed support for more data blocks to libdisplay-info. We now support everything in the base EDID block! We’re filling the gaps in our CTA-861 implementation, and we’re getting ready to release version 0.1.0. As expected EDID blobs continue to have many fields packed in creative ways, duplicating information and contradicting each other, ill-defined in many specifications and vendor-specific formats.
I’ve continued working on the goguma Android IRC client. I’ve wired up automatic bug reporting via GlitchTip – this helps a lot because grabbing logs from Android is much more complicated than it needs to be. Thanks to the bug dashboard I’ve fixed numerous crashes. I’ve also sent upstream a fix for unreliable notifications when UnifiedPush is used.
The NPotM is chayang, a small tool to gradually dim the screen. This can be used to implement a grace period before turning off the screens, to let the user press a key or move the mouse to keep the screens on.
Last but not least, I’ve written a patch to add support for ACME DNS challenges to tlstunnel. ACME DNS challenges unlock support for wildcard certificates in Let’s Encrypt. Unfortunately there is no widely supported standard protocol to update DNS records, so tlstunnel delegates this task to a helper script with the same API as dehydrated’s hooks.
That’s it! See you next month.
Automated testing of software is great. Unfortunately, what's commonly considered best practice for how to integrate testing into the development flow is a bad fit for a lot of software. You know what I mean: You submit a pull request, some automated testing process is kicked off in the background, and some time later you get back a result that's either Green (all tests passed) or Red (at least one test failed). Don't submit the PR if the tests are red. Sounds good, doesn't quite work.
There is Software Under Test (SUT), the software whose development is the goal of what you're doing, and there's the Test Bench (TB): the tests themselves, but also additional relevant parts of the environment like perhaps some hardware device that the SUT is run against.
The above development practice works well when the SUT and TB are both defined by the same code repository and are developed together. And admittedly, that is the case for a lot of useful software. But it just so happens that I mostly tend to work on software where the SUT and TB are inherently split. Graphics drivers and shader compilers implement some spec (like Vulkan or Direct3D), and an important part of the TB are a conformance test suite and other tests, the bulk of which are developed separately from the driver itself. Not to mention the GPU itself and other software like the kernel mode driver. The point is, TB development is split from the SUT development and it is infeasible to make changes to them in lockstep.
Problem #1 with keeping all tests passing all the time is that tests can fail for reasons whose root cause is not an SUT change.
For example, a new test case is added to the conformance test suite, but that test happens to fail. Suddenly nobody can submit any changes anymore.
That clearly makes no sense, and because Tooling Sucks(tm), what folks typically do is maintain a manual list of test cases that are excluded from automated testing. This unblocks development, but are you going to remember to update that exclusion list? Bonus points if the exclusion list isn't even maintained in the same repository as the SUT, which just compounds the problem.
The situation is worse when you bring up a large new feature or perhaps a new version of the hardware supported by your driver (which is really just a super duper large new feature), where there is already a large body of tests written by somebody else. Development of the new feature may take months and typically is merged bit by bit over time. For most of that time, there are going to be some test failures. And that's fine!
Unfortunately, a typical coping mechanism is that automated testing for the feature is entirely disabled until the development process is complete. The consequences are dire, as regressions in relatively basic functionality can go unnoticed for a fairly long time.
And sometimes there are simply changes in the TB that are hard to control. Maybe you upgraded the kernel mode driver for your GPU on the test systems, and suddenly some weird corner case tests fail. Yes, you have to fix it somehow, but removing the test case from your automated testing process is almost always the wrong response.
In fact, failing tests are, given the right context, a good thing! Let's say a bug is discovered in a real application in the field. Somebody root causes the problem and writes a simplified reproducer. This reproducer should be added to the TB as soon as possible, even if it is going to fail initially!
To be fair, many of the common testing frameworks recognize this by allowing tests to be marked as "expected to fail". But they typically also assume that the TB can be changed in lockstep with the SUT and fall on their face when that isn't the case.
What is needed here is to treat testing as a truly continuous exercise, with some awareness by the automation of how test runs relate to the development history.
During day-to-day development, the important bit isn't that there are no failures. The important bit is that there are no regressions.
Automation ought to track which tests pass on the main development branch and provide pre-commit reports for pull requests relative to those results: Have there been any regressions? Have any tests been fixed? Block code submissions when they cause regressions, but don't block them for pre-existing failures, especially when those failures are caused by changes in the TB.
Changes to the TB should also be tested where possible, and when they cause regressions those should be investigated. But it is quite common that regressions caused by a TB change are legitimate and shouldn't block the TB change.
Problem #2 is that good test coverage means that tests take a very long time to run.
Your first solution to this problem should be to parallelize and throw more hardware at it. Let's hope the people who control the purse care enough about quality.
There is sometimes also low-hanging fruit you should pick, like wasting lots of time in process (or other) startup and teardown overhead. Addressing that can be a double-edged sword. Changing a test suite from running every test case in a separate process to running multiple test cases sequentially in the same process reduces isolation between the tests and can therefore make the tests flakier. It can also expose genuine bugs, though, and so the effort is usually worth it.
But all these techniques have their limits.
Let me give you an example. Compilers tend to have lots of switches that subtly change the compiler's behavior without (intentionally) affecting correctness. How good is your test coverage of these?
Chances are, most of them don't see any real testing at all. You probably have a few hand-written test cases that exercise the options. But what you really should be doing is run an entire suite of end-to-end tests with each of the switches applied to make sure you aren't missing some interaction. And you really should be testing all combinations of switches as well.
The combinatorial explosion is intense. Even if you only have 10 boolean switches, testing them each individually without regressing the turn-around time of the test suite requires 11x more test system hardware. Testing all possible combinations requires 1024x more. Nobody has that kind of money.
The good news is that having extremely high confidence in the quality of your software doesn't require that kind of money. If we run the entire test suite a small number of times (maybe even just once!) and independently choose a random combination of switches for each test, then not seeing any regressions there is a great indication that there really aren't any regressions.
Why is that? Because failures are correlated! Test T failing with a default setting of switches is highly correlated with test T failing with some non-default switch S enabled.
This effect isn't restriced to taking the cross product of a test suite with a bunch of configuration switches. By design, an exhaustive conformance test suite is going to have many sets of tests with high failure correlation. For example, in the Vulkan test suite you might have a bunch of test cases that all do the same thing, but with a different combination of framebuffer format and blend function. When there is a regression affecting such tests, the specific framebuffer format or blend function might not matter at all, and all of the tests will regress. Or perhaps the regression is related to a specific framebuffer format, and so all tests using that format will regress regardless of the blend function that is used, and so on.
A good automated testing system would leverage these observations using statistical methods (aka machine learning).
Combinatorial explosion causes your full test suite to take months to run? No problem, treat testing as a genuinely continuous task. Have test systems continuously run random samplings of test cases on the latest version of the main development branch. Whenever a change is made to either the SUT or TB, switch over to testing that new version instead. When a failure is encountered, automatically determine if it is a regression by referring to earlier test results (of the exact same test if it has been run previously, or related tests otherwise) combined with a bisection over the code history.
Pre-commit testing becomes an interesting fuzzy problem. By all means have a small traditional test suite that is manually curated to run within a few minutes. But we can also apply the approach of running randomly sampled tests to pre-commit testing.
A good automated testing system would learn a statistical model of regressions and combine that with the test results obtained so far to provide an estimate of the likelihood of regression. As long as no regression is actually found, this likelihood will keep dropping as more tests are run, though it will not drop to 0 unless all tests are run (and the setup here was this would take months). The team can define a likelihood threshold that a change must reach before it can be committed based on the their appetite for risk and rate of development.
The statistical model should be augmented with source-level information about the change, such as keywords that appear in the diff and commit message and the set of files that was changed. After all, there ought to be some meaningful correlation between regressions in a raytracing test case and the fact that the regressing change affected a file with "raytracing" in its name. The model should then also be used to bias the random sampling of tests to be run to maximize the information extracted per effort spent on running test cases.
What I've described is largely motivated by the fact that the world is messier than commonly accepted testing "wisdom" allows. However, the world is too messy even for what I've described.
I haven't talked about flaky (randomly failing) tests at all, though a good automated testing system should be able to cope with them. Re-running a test in the same configuration is not black magic and can be used to confirm that a test is flaky. If we wanted to get fancy, we could even estimate the failure probability and treat a significant increase of the failure rate as a regression!
Along similar lines, there can be state leakage between test cases that causes failures only when test cases are run in a specific order, or when specific test cases are run in parallel. This would manifest as flaky tests, and so flaky test detection ought to try to help tease out these scenarios. That is admittedly difficult and will probably never be entirely reliable. Luckily, it doesn't happen often.
Sometimes, there are test cases that can leave a test system in such a broken state that it has to be rebooted. This is not entirely unusual in very early bringup of a driver for new hardware, when even the device's firmware may still be unstable. An automated test system can and should treat this case just like one would treat a crashing test process: Detect the failure, perhaps using some timer-based watchdog, force a reboot, possibly using a remote-controlled power switch, and resume with the next test case. But if a decent fraction of your test suite is affected, the resulting experience isn't fun and there may not be anything your team can do about it in the short term. So that's an edge case where manual exclusion of tests seems legitimate.
So no, testing perfection isn't attainable for many kinds of software projects. But even relative to what feels like it should be realistically attainable, the state of the art is depressing.
This is the third part of my “Turnips in the wild” blog post series where I describe how I found and fixed graphical issues in the Mesa Turnip Vulkan driver for Adreno GPUs. If you missed the first two parts, you can find them here:
A few months ago it was reported that “Psychonauts 2” has rendering artifacts in the main menu. Though only recently I got my hands on it.
Notice the mark on the top right, the game was running directly on Qualcomm board via FEX-Emu
Forcing direct rendering, forcing tiled rendering, disabling UBWC compression, forcing synchronizations everywhere, and so on, nothing helped or changed the outcome.
The first draw call with visible corruption
When looking around the draw call, everything looks good, but a single input image:
One of the input images
Ughhh, it seems like this image comes from the previous frame, that’s bad. Inter-frame issues are hard to debug since there is no tooling to inspect two frames together…
* Looks around nervously *
Ok, let’s forget about it, maybe it doesn’t matter. Then next step would be looking at the pixel values in the corrupted region:
color = (35.0625, 18.15625, 2.2382, 0.00)
Now, let’s see whether RenderDoc’s built-in shader debugger would give us the same value as GPU or not.
color = (0.0335, 0.0459, 0.0226, 0.50)
(Or not)
After looking at the similar pixel on the RADV, RenderDoc seems right. So the issue is somewhere in how driver compiled the shader.
A good start is to print the shader values with debugPrintfEXT, much nicer than looking at color values. Adding the debugPrintfEXT aaaand, the issue goes away, great, not like I wanted to debug it or anything.
Adding a printf changes the shader which affects the compilation process, so the changed result is not unexpected, though it’s much better when it works. So now we are stuck with observing pixel colors.
Bisecting a shader isn’t hard, especially if there is a reference GPU, with the same capture opened, to compare the changes. You delete pieces of the shader until results are the same, when you get the same results you take one step back and start removing other expressions, repeat until nothing could be removed.
float _258 = 1.0 / gl_FragCoord.w;
vec4 _273 = vec4(_222, _222, _222, 1.0) * _258;
vec4 _280 = View.View_SVPositionToTranslatedWorld * vec4(gl_FragCoord.xyz, 1.0);
vec3 _284 = _280.xyz / vec3(_280.w);
vec3 _287 = _284 - View.View_PreViewTranslation;
vec3 _289 = normalize(-_284);
vec2 _303 = vec2(in_var_TEXCOORD0[0].x, in_var_TEXCOORD0[0].y) * Material.Material_ScalarExpressions[0].x;
vec4 _309 = texture(sampler2D(Material_Texture2D_0, Material_Texture2D_0Sampler), _303, View.View_MaterialTextureMipBias);
vec2 _312 = (_309.xy * vec2(2.0)) - vec2(1.0);
vec3 _331 = normalize(mat3(in_var_TEXCOORD10_centroid.xyz, cross(in_var_TEXCOORD11_centroid.xyz, in_var_TEXCOORD10_centroid.xyz) * in_var_TEXCOORD11_centroid.w, in_var_TEXCOORD11_centroid.xyz) * normalize((vec4(_312, sqrt(clamp(1.0 - dot(_312, _312), 0.0, 1.0)), 1.0).xyz * View.View_NormalOverrideParameter.w) + View.View_NormalOverrideParameter.xyz)) * ((View.View_CullingSign * View_PrimitiveSceneData._m0[(in_var_PRIMITIVE_ID * 37u) + 4u].w) * float(gl_FrontFacing ? (-1) : 1));
vec2 _405 = in_var_TEXCOORD4.xy * vec2(1.0, 0.5);
vec4 _412 = texture(sampler2D(LightmapResourceCluster_LightMapTexture, LightmapResourceCluster_LightMapSampler), _405 + vec2(0.0, 0.5));
uint _418 = in_var_LIGHTMAP_ID; // <<<<<<-
float _447 = _331.y;
vec3 _531 = (((max(0.0, dot((_412 * View_LightmapSceneData_1._m0[_418 + 5u]), vec4(_447, _331.zx, 1.0))))) * View.View_IndirectLightingColorScale);
bool _1313 = TranslucentBasePass.TranslucentBasePass_Shared_Fog_ApplyVolumetricFog > 0.0;
vec4 _1364;
vec4 _1322 = View.View_WorldToClip * vec4(_287, 1.0);
float _1323 = _1322.w;
vec4 _1352;
if (_1313)
{
_1352 = textureLod(sampler3D(TranslucentBasePass_Shared_Fog_IntegratedLightScattering, View_SharedBilinearClampedSampler), vec3(((_1322.xy / vec2(_1323)).xy * vec2(0.5, -0.5)) + vec2(0.5), (log2((_1323 * View.View_VolumetricFogGridZParams.x) + View.View_VolumetricFogGridZParams.y) * View.View_VolumetricFogGridZParams.z) * View.View_VolumetricFogInvGridSize.z), 0.0);
}
else
{
_1352 = vec4(0.0, 0.0, 0.0, 1.0);
}
_1364 = vec4(_1352.xyz + (in_var_TEXCOORD7.xyz * _1352.w), _1352.w * in_var_TEXCOORD7.w);
out_var_SV_Target0 = vec4(_531.x, _1364.w, 0, 1);
After that it became harder to reduce the code.
vec2 _303 = vec2(in_var_TEXCOORD0[0].x, in_var_TEXCOORD0[0].y);
vec4 _309 = texture(sampler2D(Material_Texture2D_0, Material_Texture2D_0Sampler), _303, View.View_MaterialTextureMipBias);
vec3 _331 = normalize(mat3(in_var_TEXCOORD10_centroid.xyz, in_var_TEXCOORD11_centroid.www, in_var_TEXCOORD11_centroid.xyz) * normalize((vec4(_309.xy, 1.0, 1.0).xyz))) * ((View_PrimitiveSceneData._m0[(in_var_PRIMITIVE_ID)].w) );
vec4 _412 = texture(sampler2D(LightmapResourceCluster_LightMapTexture, LightmapResourceCluster_LightMapSampler), in_var_TEXCOORD4.xy);
uint _418 = in_var_LIGHTMAP_ID; // <<<<<<-
vec3 _531 = (((dot((_412 * View_LightmapSceneData_1._m0[_418 + 5u]), vec4(_331.x, 1,1, 1.0)))) * View.View_IndirectLightingColorScale);
vec4 _1352 = textureLod(sampler3D(TranslucentBasePass_Shared_Fog_IntegratedLightScattering, View_SharedBilinearClampedSampler), vec3(vec2(0.5), View.View_VolumetricFogInvGridSize.z), 0.0);
out_var_SV_Target0 = vec4(_531.x, in_var_TEXCOORD7.w, 0, 1);
And finally, the end result:
vec3 a = in_var_TEXCOORD10_centroid.xyz + in_var_TEXCOORD11_centroid.xyz;
float b = a.x + a.y + a.z + in_var_TEXCOORD11_centroid.w + in_var_TEXCOORD0[0].x + in_var_TEXCOORD0[0].y + in_var_PRIMITIVE_ID.x;
float c = b + in_var_TEXCOORD4.x + in_var_TEXCOORD4.y + in_var_LIGHTMAP_ID;
out_var_SV_Target0 = vec4(c, in_var_TEXCOORD7.w, 0, 1);
Nothing left but loading of varyings and the simplest operations on them in order to prevent their elimination by the compiler.
in_var_TEXCOORD7.w values are several orders of magnitude different from the expected ones and if any varying is removed the issue goes away. Seems like an issue with loading of varyings.
I created a simple standalone reproducer in vkrunner to isolate this case and make my life easier, but the same fragment shader passed without any trouble. This should have pointed me to undefined behavior somewhere.
Anyway, one major difference is the vertex shader, changing it does “fix” the issue. Though changing it resulted in the changes in varyings layout, without changing the layout the issue is present. Thus the vertex shader is an unlikely culprit here.
Let’s take a look at the fragment shader assembly:
bary.f r0.z, 0, r0.x
bary.f r0.w, 3, r0.x
bary.f r1.x, 1, r0.x
bary.f r1.z, 4, r0.x
bary.f r1.y, 2, r0.x
bary.f r1.w, 5, r0.x
bary.f r2.x, 6, r0.x
bary.f r2.y, 7, r0.x
flat.b r2.z, 11, 16
bary.f r2.w, 8, r0.x
bary.f r3.x, 9, r0.x
flat.b r3.y, 12, 17
bary.f r3.z, 10, r0.x
bary.f (ei)r1.x, 16, r0.x
....
bary.f loads interpolated varying, flat.b loads it without interpolation. bary.f (ei)r1.x, 16, r0.x is what loads the problematic varying, though it doesn’t look suspicious at all. Looking through the state which defines how varyings are passed between VS and FS also doesn’t yield anything useful.
Ok, but what does second operand of flat.b r2.z, 11, 16 means (the command format is flat.b dst, src1, src2). The first one is location from where varying is loaded, and looking through the Turnip’s code the second one should be equal to the first otherwise “some bad things may happen”. Forced the sources to be equal - nothing changed… What have I expected? Since the standalone reproducer with the same assembly works fine.
The same description which promised bad things to happen also said that using 2 immediate sources for flat.b isn’t really expected. Let’s revert the change and get something like flat.b r2.z, 11, r0.x, nothing is changed, again.
What else happens with these varyings? They are being packed! To remove their unused components, so let’s stop packing them. Aha! Now it works correctly!
Looking several times through the code, nothing is wrong. Changing the order of varyings helps, aligning them helps, aligning only flat varyings also helps. But code is entirely correct.
Though one thing changes, during the shuffling of varyings order I noticed that the resulting misrendering changed, so likely it’s not the order, but the location which is cursed.
What’s left? How varyings interpolation is specified. The code emits interpolations only for used varyings, but looking closer the “used varyings” part isn’t that obviously defined. Emitting the whole interpolation state fixes the issue!
The culprit is found, stale data is being read of varying interpolation. The resulting fix could be found in tu: Fix varyings interpolation reading stale values + cosmetic changes
Correctly rendered draw call after the changes
Another corruption in the main menu.
Bad draw call on Turnip
How it should look:
The same draw call on RADV
The draw call inputs and state look good enough. So it’s time to bisect the shader.
Here is the output of the reduced shader on Turnip:
Enabling the display of NaNs and Infs shows that there are NaNs in the output on Turnip (NaNs have green color here):
While the correct rendering on RADV is:
Carefully reducing the shader further resulted in the following fragment which reproduces the issue:
r12 = uintBitsToFloat(uvec4(texelFetch(t34, _1195 + 0).x, texelFetch(t34, _1195 + 1).x, texelFetch(t34, _1195 + 2).x, texelFetch(t34, _1195 + 3).x));
....
vec4 _1268 = r12;
_1268.w = uintBitsToFloat(floatBitsToUint(r12.w) & 65535u);
_1275.w = unpackHalf2x16(floatBitsToUint(r12.w)).x;
On Turnip this _1275.w is NaN, while on RADV it is a proper number. Looking at assembly, the calculation of _1275.w from the above is translated into:
isaml.base0 (u16)(x)hr2.z, r0.w, r0.y, s#0, t#12
(sy)cov.f16f32 r1.z, hr2.z
In GLSL there is a read of uint32, stripping it of the high 16 bits, then converting the lower 16 bits to a half float.
In assembly the “read and strip the high 16 bits” part is done in a single command isaml, where the stripping is done via (u16) conversion.
At this point I wrote a simple reproducer to speed up iteration on the issue:
result = uint(unpackHalf2x16(texelFetch(t34, 0).x & 65535u).x);
After testing different values I confirmed that (u16) conversion doesn’t strip higher 16 bits, but clamps the value to 16 bit unsigned integer. Running the reproducer on the proprietary driver shown that it doesn’t fold u32 -> u16 conversion into isaml.
Knowing that the fix is easy: ir3: Do 16b tex dst folding only for floats
Main menu, again =) Before we even got here two other issues were fixed before, including the one which seems like an HW bug which proprietary driver is not aware of.
In this case of misrendering the culprit is a compute shader.
How it should look:
Compute shader are generally easier to deal with since much less state is involved.
None of debug options helped and shader printf didn’t work at that time for some reason. So I decided to look at the shader assembly trying to spot something funny.
ldl.u32 r6.w, l[r6.z-4016], 1
ldl.u32 r7.x, l[r6.z-4012], 1
ldl.u32 r7.y, l[r6.z-4032], 1
ldl.u32 r7.z, l[r6.z-4028], 1
ldl.u32 r0.z, l[r6.z-4024], 1
ldl.u32 r2.z, l[r6.z-4020], 1
Negative offsets into shared memory are not suspicious at all. Were they always there? How does it look right before being passed into our backend compiler?
vec1 32 ssa_206 = intrinsic load_shared (ssa_138) (base=4176, align_mul=4, align_offset=0)
vec1 32 ssa_207 = intrinsic load_shared (ssa_138) (base=4180, align_mul=4, align_offset=0)
vec1 32 ssa_208 = intrinsic load_shared (ssa_138) (base=4160, align_mul=4, align_offset=0)
vec1 32 ssa_209 = intrinsic load_shared (ssa_138) (base=4164, align_mul=4, align_offset=0)
vec1 32 ssa_210 = intrinsic load_shared (ssa_138) (base=4168, align_mul=4, align_offset=0)
vec1 32 ssa_211 = intrinsic load_shared (ssa_138) (base=4172, align_mul=4, align_offset=0)
vec1 32 ssa_212 = intrinsic load_shared (ssa_138) (base=4192, align_mul=4, align_offset=0)
Nope, no negative offsets, just a number of offsets close to 4096. Looks like offsets got wrapped around!
Looking at ldl definition it has 13 bits for the offset:
<pattern pos="0" >1</pattern>
<field low="1" high="13" name="OFF" type="offset"/> <--- This is the offset field
<field low="14" high="21" name="SRC" type="#reg-gpr"/>
<pattern pos="22" >x</pattern>
<pattern pos="23" >1</pattern>
With offset type being a signed integer (so the one bit is for the sign). Which leaves us with 12 bits, meaning the upper bound of 4095. Case closed!
I know that there is a upper bound set on offset during optimizations, but where and how it is set?
The upper bound is set via nir_opt_offsets_options::shared_max and is equal to (1 << 13) - 1, which we saw is incorrect. Who set it?
Subject: [PATCH] ir3: Limit the maximum imm offset in nir_opt_offset for
shared vars
STL/LDL have 13 bits to store imm offset.
Fixes crash in CS compilation in Monster Hunter World.
Fixes: b024102d7c2959451bfef323432beaa4dca4dd88
("freedreno/ir3: Use nir_opt_offset for removing constant adds for shared vars.")
Signed-off-by: Danylo Piliaiev
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14968>
---
src/freedreno/ir3/ir3_nir.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
@@ -124,7 +124,7 @@ ir3_optimize_loop(struct ir3_compiler *compiler, nir_shader *s)
*/
.uniform_max = (1 << 9) - 1,
- .shared_max = ~0,
+ .shared_max = (1 << 13) - 1,
Weeeell, totally unexpected, it was me! Fixing the same game, maybe even the same shader…
Let’s set the shared_max to a correct value . . . . . Nothing changed, not even the assembly. The same incorrect offset is still there.
After a bit of wandering around the optimization pass, it was found that in one case the upper bound is not enforced correctly. Fixing it fixed the rendering.
The final changes were:
This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:
Now that we have a way to netboot and live-update our CI gateway, it is time to start working on the services that will enable sharing the test machines to users!
This work is sponsored by the Valve Corporation.
Test machines are a valuable resource for any organization, may it be from a tech giant or a small-scale open source project. Along with automated testing, they are instrumental to keeping the productivity of developers high by shifting focus from bug fixing to code review, maintenance improvements, and developing new features. Given how valuable test machines are, we should strive towards keeping the machine utilization as high as possible. This can be achieved by turning the developer-managed test machines which show low utilization into shared test machine.
Let's see how!
Efficient time-sharing of machines comes with the following high-level requirements:
Let's review them and propose services that should be hosted on the CI gateway to satisfy them, with backwards-compatibility kept in mind at all time.
Note: This post can be very confusing as it tries to focus on the interaction between different services without focusing on implementation details for the service. To help support my (probably confusing) explanations, I added sequence diagrams when applicable.
So, please grab a pot of your favourite stimulating beverage before going further. If this were to be still insufficient, please leave a comment (or contact me by other means) and I will try my best to address it :)

As we have seen in Part 2, using containers to run test jobs is an effective way to allow any job to set whatever test environment they desire while providing isolation between test jobs. Additionally, it enables caching the test environments so that they do not have to get re-downloaded every time.
This container could be booted using the initramfs we created in part 2, Boot2container, which runs any list of container, as specified in the kernel cmdline.
To keep things simple, stateless, and fast, the kernel/Boot2Container binaries and the kernel cmdline can be downloaded at boot time via HTTP by using, for example, iPXE. The iPXE binary can itself be netbooted by the machine's firmware (see PXE, and TFTP), or simply flashed onto the test machine's drive/attached USB pen-drive.
The iPXE binary can then download the wanted boot script from an HTTP Server, passing in the query URL the MAC address of the network interface that first got an IP address along with the architecture/platform (PCBIOS, x64 UEFI, ...). This allows the HTTP server to serve the wanted boot script for this machine, which contain the URLs of the kernel, initramfs, and the kernel command line to be used for the test job.
Note: While any HTTP server can be used used to provide the kernel and Boot2Container to iPXE, I would recommend using an S3-compatible service such as MinIO as it not only acts like an HTTP server, but also provides an industry-standard interface to manage the data (bucket creation, file uploads, access control, ...). This gives you the freedom to change where the service is located and which software provides it without impacting other components of the infrastructure.
Job data caching and bi-directional file-sharing with the test container can be implemented using volumes and Boot2Container's ability to mirror volumes from/to an S3-compatible cloud storage system such as MinIO (see b2c.volume).
Since this may be a little confusing, here are a couple of examples:
Test machines are meant to produce test results, and often need input data before executing tests. An effective solution is to share a folder between the test machine and the machine submitting the job. We suggest using an S3-compatible bucket to share this data, as it provides an industry-standard way of dealing with file between multiple machines.
As an example of how this would look like in practice, here are the operations Boot2Container would need to do in order to start an interactive shell on Alpine Linux on a test machine, with bi-directional data sharing:
local_minio;job-volume, set the mirror target to local_minio's job-bucket bucket, then tell to
download the content of this bucket right after boot (pull_on=pipeline_start), upload all the content of the
volume back to the bucket before shutting down (push_on=pipeline_end), then mark it for deletion when we are done
with execution;docker.io/alpine:latest) in interactive mode (-ti), with our volume mounted at /job.Here is how it would look in the kernel command line, as actual Boot2Container arguments:
b2c.minio="local_minio,http://ci-gateway:9000,<ACCESSKEY>,<SECRETKEY>"
b2c.volume="job-volume,mirror=local_minio/job-bucket,pull_on=pipeline_start,push_on=pipeline_end,expiration=pipeline_end"
b2c.container="-v job-volume:/job -ti docker://docker.io/alpine:latest"
Other acceptable values for {push,pull}_on are: pipeline_start, container_start, container_end, pipeline_end, and
changes. The latter downloads/uploads new files as soon as they get created/modified.
In some cases, a test machine may require a lot of confidential data which would be impractical to re-download every single time we boot the machine.
Once again, Boot2Container has us covered as it allows us to mark a volume as never expiring (expiration=never),
decrypting the data when downloading it from the bucket (encrypt_key=$KEY), then storing it encrypted using
fscrypt (fscrypt_key=$KEY). This would look something like this:
b2c.minio="local_minio,http://ci-gateway:9000,<ACCESSKEY>,<SECRETKEY>"
b2c.volume="job-volume,mirror=local_minio/job-bucket,pull_on=pipeline_start,expiration=never,encrypt_key=s3-password,fscrypt=7u9MGy[...]kQ=="
b2c.container="-v job-volume:/job -ti docker://docker.io/alpine:latest"
Read up more about these features, and a lot more, in Boot2Container's README.
In the previous section, we focused on how to consistently boot the right test environment, but we also need to make sure we are booting on the right machine for the job!
Additionally, since we do not want to boot every machine every time a testing job comes just to figure out if we have the right test environment, we should also have a database available on the gateway that can link a machine id (MAC address?), a PDU port (see Part 1), and what hardware/peripherals it has.
While it is definitely possible to maintain a structured text file that would contain all of this information, it is also very error-prone, especially for test machines that allow swapping peripherals as maintenance operations can inadvertently swap multiple machines and testing jobs would suddenly stop being executed on the expected machine.
To mitigate this risk, it would be advisable to verify at every boot that the hardware found on the machine is the same as the one expected by the CI gateway. This can be done by creating a container that will enumerate the hardware at boot, generate a list of tags based on them, then compare it with a database running on the CI gateway, exposed as a REST service (the Machine Registration Service, AKA MaRS). If the machine is not known to the CI gateway, this machine registration container can automatically add it to MaRS's database.
New machines reported to (the) MaRS should however not be directly exposed to users, not until they undergo training to guarantee that:
Sergeant Hartman can also be used to perform a sanity check after every reboot of the CI gateway, to check if any of the test machine in the MaRS database has changed while the CI gateway was offline.
Finally, CI farm administrators need to occasionally work on a test machine, and thus need to prevent execution of future jobs on this test machine. We call this operation "Retiring" a test machine. The machine can later be "activated" to bring it back into the pool of test machines, after going through the training sequence.
The test machines' state machine and the expected training sequence can be seen in the following images:
Note: Rebooting test machines booting using the
Boot on ACboot method (see Part 1) may require to be disconnected from the power for a relatively long in order for the firwmare to stop getting power from the power supply. A delay of 30 seconds seems to be relatively conservative, but some machines may require more. It is thus recommended to make this delay configurable on a per-machine basis, and store it in MaRS.
Once a test machine is booted, a serial console can provide a real time view of the testing in progress while also enabling users to use the remote machine as if they had an SSH terminal on it.
To enable this feature, we first need to connect the test machine to the CI gateway using serial consoles, as explained in Part 1:
Test Machine <-> USB <-> RS-232 <-> NULL modem cable <-> RS-232 <-> USB Hub <-> Gateway
As the CI Gateway may be used for more than one test machine, we need to figure out which serial port of the test machine is connected to which serial port of the CI gateway. We then need to make sure this information is kept up to date as we want to make sure we are viewing the logs of the right machine when executing a job!
This may sound trivial when you have only a few machines, but this can quickly become difficult to maintain when you have 10+ ports connected to your CI gateway! So, if like me you don't want to maintain this information by hand, it is possible to auto-discover this mapping at the same time as we run the machine registration/check process, thanks to the use of another service that will untangle all this mess: SALAD.
For the initial registration, the machine registration container should output, at a predetermined baudrate, a
well-known string to every serial port (SALAD.ping\n, for example) then pick the first console port that answers another
well-known string (SALAD.pong\n, for example). Now that the test machine knows which port to use to talk to the CI
gateway, it can send its machine identifier (MAC address?) over it so that the CI gateway can keep track of which serial
port is associated to which machine (SALAD.machine_id=...\n).
As part of the initial registration, the machine registration container should also transmit to MaRS the name of the
serial adapter it used to talk to SALAD (ttyUSB0 for example) so that, at the next boot, the machine can be configured
to output its boot log on it (console=ttyUSB0 added to its kernel command line). This also means that the verification
process of the machine registration container can simply simply send SALAD.ping\n to stdout, and wait for
SALAD.pong\n on stdin before outputing SALAD.machine_id=... to stdout again to make sure the association is still
valid.
On the CI gateway side, we propose that SALAD should provide the following functions:
This provides the ability to host the SALAD service on more than just the CI gateway, which may be useful in case the machine runs out of USB ports for the serial consoles.
Here are example outputs from the proposed REST interface:
$ curl -s http://${SALAD_HOST}:${SALAD_PORT}/api/v1/machine/
machines:
"00:d8:61:7a:51:cd":
has_client: false
tcp_port: 48841
"00:e0:4c:68:0b:3d":
has_client: true
tcp_port: 57791
$ curl -s http://${SALAD_HOST}:${SALAD_PORT}/api/v1/machine/00:d8:61:7a:51:cd
has_client: false
tcp_port: 48841
Interacting with a test machine machine's serial console is done by connecting to the tcp_port associated to the
test machine. In a shell script, one could implement this using curl, jq, and netcat:
$ MACHINE_ID=00:d8:61:7a:51:cd
$ netcat ${SALAD_HOST} $(curl -s http://${SALAD_HOST}:${SALAD_PORT}/api/v1/machine/${MACHINE_ID} | jq ".tcp_port")
# You now have a read/write access to the serial console of the test machine
As we explained in Part 1, the only way to guarantee that test jobs don't interfere with each other is to reset the hardware fully between every job... which unfortunately means we need to cut the power to the test machine long-enough for the power supply to empty its capacitors and stop providing voltage to the motherboard even when the computer is already-off (30 seconds is usually enough).
Given that there are many switchable power delivery units on the market (industrial, or for home use), many communication mediums (serial, Ethernet, WiFi, Zigbee, Z-Wave, ...), and protocols (SNMP, HTTP, MQTT, ...), we really want to create an abstraction layer that will allow us to write drivers for any PDU without needing to change any other component.
One existing abstraction layer is pdudaemon, which has many drivers for industrial and home-oriented devices. It however does not provide a way to read back the state of a certain port, which prevents verifying that the operation succeeded and makes it difficult to check that the power was indeed off at all time during the mandatory power off period.
The PDU abstraction layer should allow its users to:
While this layer could be developed both as a library or a REST service, we would recommend implementing it as a standalone service because it makes the following easier:
Even though our test machines cache containers when they first download them, it would still be pretty inefficient if every test machine in the CI farm had to download them directly from the internet.
Rather than doing that, test machines can download the containers through a proxy registry hosted on the CI gateway. This means that the containers will only be downloaded from the internet ONCE, no matter how many test machines you have in your farm. Additionally, the reduced reliance on the internet will improve your farm's reliability and performance.
All the different services needed to time-share the test machines effectively have now been described. What we are missing is a central service that coordinates all the others, exposes an interface to describe and queue test jobs, then monitor its progress.
In other words, this service needs to:
The job description should allow users of the CI system to specify the test environment they want to use without constraining them needlessly. It can also be viewed as the reproduction recipe in case anyone would like to reproduce the test environment locally.
By its nature, the job description is probably the most important interface in the entire CI system. It is very much like a kernel's ABI: you don't want an updates to break your users, so you need to make backwards-compatible changes only to this interface!
Job descriptions should be generic and minimalistic to even have a chance to maintain backwards compatibility. To achieve this, try to base it on industry standards such as PXE, UEFI, HTTP, serial consoles, containers, and others that have proven their versatility and interoperability over the years.
Without getting tangled too much into details, here is the information it should contain:
And here are some of the things it should NOT contain:
Now, good luck designing your job description format... or wait for the next posts which will document the one we came up with!
Job execution is split into the following phases:
While the executor could perform all of these actions from the same process, we would recommend splitting the job execution into its own process as it prevents configuration changes from affecting currently-running jobs, make it easier to tell if a machine is running or idle, make live-updating the executor trivial (see Part 4 if you are wondering why this would be desirable), and make it easier to implement job preemption in the future.
Here is how we propose the executor should interact with the other services:
In this post, we defined a list of requirements to efficiently time-share test machines between users, identified sets of services that satisfy these requirements, and detailed their interactions using sequence diagrams. Finally, we provided both recommendations and cautionary tales to help you set up your CI gateway.
In the next post, we will take a bit of breather and focus on the maintainability of the CI farm through the creation of an administrator dashboard, easing access to the gateway using a Wireguard VPN, and monitoring of both the CI gateway and the test machines.
By the end of this blog series, we aim to have propose a plug-and-play experience throughout the CI farm, and have it automatically and transparently expose runners on GitLab/GitHub. This system will also hopefully be partially hosted on Freedesktop.org to help developers write, test, and maintain their drivers. The goal would be to have a setup time of under an hour for newcomers!
That's all for now, thanks for making it to the end of this post!
After the video decode stuff was fairly nailed down, Lynne from ffmpeg nerdsniped^Wtalked me into looking at h264 encoding.
The AMD VCN encoder engine is a very different interface to the decode engine and required a lot of code porting from the radeon vaapi driver. Prior to Xmas I burned a few days on typing that all in, and yesterday I finished typing and moved to debugging the pile of trash I'd just typed in.
Lynne meanwhile had written the initial ffmpeg side implementation, and today we threw them at each other, and polished off a lot of sharp edges. We were rewarded with valid encoded frames.
The code at this point is only doing I-frame encoding, we will work on P/B frames when we get a chance.
There are also a bunch of hacks and workarounds for API/hw mismatches, that I need to consult with Vulkan spec and AMD teams about, but we have a good starting point to move forward from. I'll also be offline for a few days on holidays so I'm not sure it will get much further until mid January.
My branch is [1]. Lynne ffmpeg branch is [2].
[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-enc-wip
I've been working the past couple of weeks with an ffmpeg developer (Lynne) doing Vulkan video decode bringup on radv.
The current status of this work is in a branch[1]. This work is all against the current EXT decode beta extensions in the spec.
Khronos has released the final specs for these extensions. This work is rebased onto the final KHR form and is in a merge request for radv[2].
This contains an initial implementation of H264 and H265 decoding for AMD GPUs from TONGA to NAVI2x. It passes the basic conformance tests but fails some of the more complicated ones, but it has decoded the streams we've been throwing at it using ffmpeg.
export VK_ICD_FILENAMES=<prefix>/share/vulkan/icd.d/radeon_icd.x86_64.json
export RADV_VIDEO_DECODE=1
For merge branch[2]:
export RADV_PERFTEST=video_decode
vulkaninfo
This should show support for VK_KHR_video_queue, VK_KHR_video_decode_queue, VK_EXT_video_decode_h264 and VK_EXT_video_decode_h265.
[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-prelim-decode
[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20388
Hi!
Following last month’s wlroots release, we’ve started the Sway release candidate cycle. Kenny Levinsen and Ronan Pigott have helped fixing the bugs and regressions that popped up, and I hope we’ll be able to ship the final release next week. I also plan to release wlroots 0.16.1 with the fixes we’ve accumulated.
In other wlroots news, Manuel Stoeckl and I have continued to work on the Vulkan renderer. A lot more pixel formats are now supported for textures, and the buffer synchronization issues should all be sorted out. Next we’d like to add support for rendering to high bit depth buffers for color management purposes. This is a bit more involved since there is no shader stage which runs after blending, so I’d like to experiment with compute shaders to see if they’re better suited for us. That ties in with the renderer API redesign I’ve been planning for a while: the new rendering API should make it easier to use compute shaders, and already shows a nice perf boost for the Pixman renderer.
I’ve been debugging some issues with USB-C docks where outputs wouldn’t turn back on after a plug-unplug-replug cycle. The result is an i915 patch which fixes some of issues, but it seems there are more where that came from. Ultimately this class of bugs should get fixed when we add support for atomically turning off multiple outputs at once in wlroots, but this will require a lot more work.
Alexander Orzechowski and I have been pushing surface-invalidation, a new Wayland protocol to help with GPU resets. GPU resets destroy the whole GL/Vulkan state on the compositor side, so compositors which can recover from resets are left with nothing to draw. The protocol allows compositors to request a new buffer from clients.
New versions of swaybg, swayidle and slurp have been released. swaybg and swayidle now take advantage of new protocols such as single-pixel-buffer-v1 and ext-idle-notify-v1. slurp can now enforce a specific aspect ratio for the selection rectangle, has a configurable font, and can print relative coordinates in its format string.
libdisplay-info is growing bit by bit. Sebastian Wick has added support for CTA short audio descriptors, and I’ve sent patches for CTA speaker allocation data blocks. I’ve continued my work on libjsonschema to support JSON serialization. By the way, libdisplay-info is now used in DXVK for EDID parsing!
I’ve merged some nice patches from delthas for the soju IRC bouncer. Alongside small quality-of-life improvements and fixes, a new WHO cache should make WHO queries more reliable, no longer hitting the rate limits of the upstream servers. Pending work on a database-backed message store will make chat history much more reliable, faster, and lossless (currently we drop all message tags). Last but not least, rj1 has implemented TLS certificate pinning to allow soju to connect to servers using self-signed certificates.
The Goguma IRC client for Android also has received some love this month. Various NOTICE and CTCP messages (e.g. from ChanServ or NickServ) should be less annoying, as they’re now treated as ephemeral. They don’t open notifications anymore, and if the user is not in a conversation with them the messages show up in a temporary bar at the bottom of the screen. I’ve also implemented image previews:
They are disabled by default because of privacy concerns, but can be enabled in the settings. Only image links are supported for now, but I plan to add HTML previews as well. Last, I’ve optimized the SQL queries run at startup, so launching the app should be a bit faster now.
The NPoTM is glhf, a dead
simple IRC bot for GitLab projects. It’ll announce GitLab events in the IRC
channel, and will expand links to issues and merge requests. We’re using it in
#sway-devel, I’m pretty happy with it so far!
That’s all for now, see you in January!
After cleaning up the radv stuff I decided to go back and dig into the anv support for H264.
The current status of this work is in a branch[1]. This work is all against the current EXT decode beta extensions in the spec.
This
contains an initial implementation of H264 Intel GPUs that anv supports. I've only tested it on Kabylake equivalents so far. It decodes some of the basic streams I've thrown at it from ffmpeg. Now this isn't as far along as the AMD implementation, but I'm also not sure I'm programming the hardware correctly. The Windows DXVA API has 2 ways to decode H264, short and long. I believe but I'm not 100% sure the current Vulkan API is quite close to "short", but the only Intel implementations I've found source for are for "long". I've bridged this gap by writing a slice header parser in mesa, but I think the hw might be capable of taking over that task, and I could in theory dump a bunch of code. But the programming guides for the hw block are a bit vague on some of the details around how "long" works. Maybe at some point someone in Intel can tell me :-)
This should show support for VK_KHR_video_queue, VK_KHR_video_decode_queue, VK_EXT_video_decode_h264.
[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode
Let me start by saying that I now also have a Mastodon account, so please follow me on @Cfkschaller@fosstodon.org for news and updates around Fedora Workstation, PipeWire, Wayland, LVFS and Linux in general.
Fedora vs Fedora Workstation
Before I start with the general development update I want to mention something I mentioned quite a few times before and that is the confusion I often find people have about what is Fedora and what is Fedora Workstation.
Fedora Workstation
Fedora is our overall open source project and community working on packaging components and software for the various outputs that the Fedora community delivers. Think of the Fedora community a bit like a big group of people providing well tested and maintained building blocks to be used to build operating systems and applications. As part of that bigger community you have a lot of working groups and special interest groups working to build something with those building blocks and Fedora Workstation is the most popular thing being built from those building blocks. That means that Fedora Workstation isn’t ‘Fedora’ it is something created by the Fedora community alongside a lot of other projects like Fedora Server, Silverblue and Fedora spins like Fedora KDE and Fedora Kinoite. But all them should be considered separate efforts built using a shared set of building blocks.
Putting together an operating system like Fedora Workstation is more than just assembling a list of software components to include though, it is also about setting policies, default configurations, testing and QE and marketing. This means that while Fedora Workstation contains many of the same components of other things like the Fedora KDE Plasma Desktop spin, the XFCE Desktop spin, the Cinnamon spin and so on, they are not the same thing. And that is not just because the feature set of GNOME is different from the feature set of XFCE, it is because each variant is able to set their own policies, their own configurations and do their own testing and QE. Different variants adopted different technologies at different times, for instance Fedora Workstation was an early adopter of new technologies like Wayland and PipeWire. So the reason I keep stressing this point is that I to this day often see comments or feedback about ‘Fedora’, feedback which as someone spending a lot of effort on Fedora Workstation, sometimes makes no sense to me, only to reach out and discover that they where not using Fedora Workstation, but one of the spins. So I do ask people, especially those who are members of the technology press to be more precise in their reviews, about if they are talking about Fedora Workstation or another project that is housed under the Fedora umbrella and not just shorten it all to ‘Fedora’.
High Dynamic Range – HDR
A major goal for us at Red Hat is to get proper HDR support into Fedora and RHEL. Sebastian Wick is leading that effort for us and working with partners and contributors across the community and the industry. For those who read my past updates you know I have been talking about this for a while, but it has been slow moving both because it is a very complex issue that needs changes from the kernel graphics drivers up through the desktop shell and into the application GUI toolkits. As Sebastian put it when I spoke with him, HDR forces us (the Linux ecosystem) to take colors seriously for the first time and thus it reveals a lot of shortcomings up through our stack, many which are not even directly HDR related.
Support for High Dynamic Range (HDR) requires compositors to produce framebuffers in specific HDR encodings with framebuffers from different clients. Most clients nowadays are unaware of HDR, Wide Color Gamuts (WCG) and instead produce pixels with RGB encodings which look okay on most displays which approximate sRGB characteristics. Converting different encodings to the common HDR encoding requires clients to communicate their encoding, the compositor to adjust and convert between the different color spaces, and the driver to enable an HDR mode in the display. Converting between the encodings should ideally be done by the scanout engine of the GPU to keep the latency and power benefits of direct scanout. Similarly, applications and toolkits have to understand different encodings and how to convert between them to make use of HDR and WCG features.
Essentially, HDR forces us to handle color correctly and makes color management an integral part of the system.
So in no particular order here are some of the issues we have found and are looking at:
Headless display and automated test of the GNOME Shell
Jonas Ådahl has been working on getting headless display working correctly under Wayland for a bit and as part of that also been working on using the headless display support to enable automated testing of gnome-shell. The core work to get GNOME Shell to be able to run headless on top of Wayland finished some time ago, but there are still some related tasks that need further work.
HID-BPF
Benjamin Tissoires has been working on a set of kernel patches for some time now that implements something called HID-BPF. HID referring to Human Interface Devices (like mice, keyboard, joysticks and so on) and BPF is a kernel and user-space observability scheme for the Linux kernel. If you want to learn more about BPF in general I recommend this Linux Journal article on the subject. This blog post will talk specifically about HID-BPF and not BPF in general. We are expecting this patcheset to either land in 6.2 or it might get delayed to 6.3.
The kernel documentation explains in length and through examples how to use HID-BPF and when. The summary would be that in the HID world, we often have to tweak a single byte on a device to make it work, simply because HW makers only test their device under Windows, and they can provide a “driver” that has the fix in software. So most of the time it comes to changing a few bytes in the report descriptor of the HID device, or on the fly when we receive an event. For years, we have been doing this sort of fixes in the kernel through the normal kernel driver model. But it is a relatively slow way of doing it and we should be able to be more efficient.
The kernel driver model process involves multiple steps and requires the reporter of the issue to test the fix at every step. The reporter has an issue and contacts a kernel developer to fix that device (sometimes, the reporter and the kernel developers are the same person). To be able to submit the fix, it has to be tested, and so compiled in the kernel upstream tree, which means that we require regular users to compile their kernel. This is not an easy feat, but that’s how we do it. Then, the patch is sent to the list and reviewed. This often leads to v2, v3 of the patch requiring the user to recompile a new kernel. Once it’s accepted, we still need to wait for the patch to land in Linus’ tree, and then for the kernel to be taken by distributions. And this is only now that the reporter can drop the custom kernel build. Of course, during that time, the reporter has a choice to make: should I update my distro kernel to get security updates and drop my fix, or should I recompile the kernel to keep everybody up to date, or should I just stick with my kernel version because my fix is way more important?
So how can HID-BPF make this easier?
So instead of the extensive process above. A fix can be written as a BPF program which can be run on any kernel. So instead of asking the reporter to recompile the upstream kernel, a kernel developer can compile the BPF program, and provide both the sources and the binary to the tester. The tester now just has to drop that binary in a given directory to have a userspace component automatically binding this program to the device. When the review process happens, we can easily ask testers to update their BPF program, and they don’t even need to reboot (just unplug/replug the target device). For the developers, we continue doing the review, and we merge the program directly in the source tree. Not even a single #define needs to differ. Then an automated process (yet to be fully determined, but Benjamin has an RFC out there) builds that program into the kernel tree, and we can safely ship that tested program. When the program hits the distribution kernel, the user doesn’t even notice it: given that the kernel already ships that BPF program, the userspace will just not attach itself to the device, meaning that the test user will even benefit from the latest fixes if there were any without any manual intervention. Of course, if the user wants to update the kernel because of a security problem or another bug, the user will get both: the security fix and the HID fix still there.
That single part is probably the most appealing thing about HID-BPF. And of course, this translate very well in the enterprise Linux world: when a customer has an issue on a HID device that can be fixed by HID-BPF, we can provide to that customer the pre-compiled binary to introduce in the filesystem without having to rely on a kernel update.
But there is more: we can now start implementing a HID firewall (so that only fwupd can change the firmware of a HID device), we can also tell people not to invent ad-hoc kernel APIs, but rely on eBPF to expose that functionality (so that if the userspace program is not there, we don’t even expose that capability). And we can be even more creative, like deciding to change how the kernel presents a device to the userspace.
MIPI Camera
Kate Hsuan and Hans de Goede have been working together trying to get the Linux support for MIPI cameras into shape. MIPI cameras are the next generation of PC cameras and you are going to see more and more laptops shipping with these so it is critical for us to get them working and working well and this work is also done in the context of our close collaboration with Lenovo around the Fedora laptops they offer.
Without working support, Fedora Workstation users who buy a laptop equipped with a MIPI camera will only get a black image. The long term solution for MIPI cameras is a library called libcamera which is a project lead by Laurent Pinchart and Kieran Bingham from IdeasOnBoard and sponsored by Google. For desktop Linux users you probably are not going to have applications interact directly with libcamera though, instead our expectation is that your applications use libcamera through PipeWire and the Flatpak portal. Thanks to the work of the community we now have support for the Intel IPU3 MIPI stack in the kernel and through libcamera, but the Intel IPU6 MIPI stack is the one we expect to see head into laptops in a major way and thus our current effort is focused on bringing support for that in a useful way to Fedora users. Intel has so far provided a combination of binary and limited open source code to support these cameras under Linux, by using GStreamer to provide an emulated V4l2 device. This is not a great long term solution, but it does provide us with at least an intermediate solution until we can get IPU6 working with libcamera. Based on his successful work on IPU3, Hans de Goede has been working on getting the necessary part of the Intel open source driver for IPU6 upstreamed since that is the basic interface to control the IPU6 and image sensor. Kate has been working on packaging all the remaining Intel non-free binary and software releases into RPM packages. The packages will provide the v4l2loopback solution which should work with any software supporting V4l2. These packages were planned to go live soon in RPM fusion nonfree repository.
LVFS – Linux Vendor Firmware Service
Richard Hughes is still making great strides forward with the now ubiques LVFS service. A few months ago we pushed the UEFI dbx update to the LVFS, which has now been downloaded over 4 million times. This pushed the total downloads to a new high of 5 million updates just in the 4 weeks of September, although it’s returned to a more normal growth pattern now.
The LVFS also now supports 1,163 different devices, and is being used by 137 different vendors. Since we started all those years ago we’ve provided at least 74 million updates to end-users although it’s probably even more given that lots of Red Hat customers mirror the entire LVFS for internal use. Not to mention that thanks to Richards collaboration with Google LVFS is now an integral part of the ‘Works with ChromeOS’ program.
On the LVFS we also now show the end-user provided HSI reports for a lot of popular hardware. This is really useful to check how secure the device will likely be before buying new hardware, regardless if the vendor is uploading to the LVFS. We’ve also asked ODMs and OEMs who do actually use the LVFS to use signed reports so we can continue to scale up LVFS. Once that is in place we can continue to scale up, aiming for 10,000 supported devices, and to 100 million downloads.
PipeWire & OBS Studio
In addition to doing a bunch of bug fixes and smaller improvements around audio in PipeWire, Wim Taymans has spent time recently working on getting the video side of PipeWire in shape. We decided to use OBS Studio as our ‘proof of concept’ application because there was already a decent set of patches from Georges Stavracas and Columbarius that Wim could build upon. Getting the Linux developer community to adopt PipeWire for video will be harder than it was for audio since we can not do a drop in replacement, instead it will require active porting by application developers. Unlike for audio where PipeWire provides a binary compatible implementation of the ALSA, PulesAudio and JACK apis we can not provide a re-implementation of the V4L API that can run transparently and reliably in place of the actual V4L2. That said, Wim has created a tool called pw-v4l2 which tries to redirect the v4l2 calls into PipeWire, so you can use that for some testing, for example by running ‘pw-v4l2 cheese’ on the command line and you will see Cheese appear in the PipeWire patchbay applications Helvum and qpwgraph. As stated though it is not reliable enough to be something we can for instance have GNOME Shell do as a default for all camera handling applications. Instead we will rely on application developers out there to look at the work we did in OBS Studio as a best practice example and then implement camera input through PipeWire using that. This brings with it a lot of advantages though, like transparent support for libcamera alongside v4l2 and easy sharing of the video streams between multiple applications.
One thing to note about this is that at the moment we have two ‘equal’ video backends in PipeWire, V4L2 and libcamera, which means you get offered the same device twice, once from v4l2 and once from libcamera. Obviously this is a little confusing and annoying. Since libcamera is still in heavy development the v4l2 backend is more reliable for now, so Fedora Workstation we will be shipping with the v4l2 backend ‘out of the box’, but allow you to easily install the libcamera backend. As libcamera matures we will switch over to the libcamera backend (and allow you to install the v4l backend if you still need/want it for some reason.)
Of course the most critical thing to get ported to use PipeWire for camera handling is the web browsers. Luckily thanks to the work of Pengutronix there is a patchset ready to be merged into WebRTC, which is the implementation used by both Chromium/Chrome and Firefox. While I can make no promises we are currently looking to see if it is viable for us to start shipping that patch in the Fedora Firefox package soon.
And finally thanks to the work by Jan Grulich in collaboration with Google engineers PipeWire is now included in the Chromium test build which has allowed Google to enable PipeWire for screen sharing enabled by default. Having PipeWire in the test builds is also going to be critical for getting the camera handling patch merged and enabled.
Flathub, Flatpak and Fedora
As I have spoken about before, we have a clear goal of moving as close as we can to a Flatpak only model for Fedora Workstation. The are a lot of reasons for this like making applications more robust (i.e. the OS doesn’t keep moving fast underneath the application), making updates more reliable (because an application updating its dependencies doesn’t risk conflicting with the needs of other application), making applications more portable in the sense that they will not need to be rebuilt for each variety of operating system, to provide better security since applications can be sandboxed in a container with clear access limitations and to allow us to move to an OS model like we have in Silverblue with an immutable core.
So over the last years we spent a lot of effort alongside other members of the Linux community preparing the ground to allow us to go in this direction, with improvements to GTK+ and Qt for instance to allow them to work better in sandboxes like the one provided by Flatpak.
There has also been strong support and growth around Flathub which now provides a wide range of applications in Flatpak format and being used for most major Linux distributions out there. As part of that we have been working to figure out policies and user interface to allow us to enable Flathub fully in Fedora (currently there is a small allowlisted selection available when you enable 3rd party software). This change didn’t make it into Fedora Workstation 37, but we do hope to have it ready for Fedora Workstation 38. As part of that effort we also hope to take another look at the process for building Flatpaks inside Fedora to reduce the barrier for Fedora developers to do so.
So how do we see things evolving in terms of distribution packaging? Well that is a good question. First of all a huge part of what goes into a distribution is not Flatpak material considering that Flatpaks are squarely aimed at shipping GUI desktop applications. There are a huge list of libraries and tools that are either used to build the distribution itself, like Wayland, GNOME Shell, libinput, PipeWire etc. or tools used by developers on top of the operating system like Python, Perl, Rust tools etc. So these will be needed to be RPM packaged for the distribution regardless. And there will be cases where you still want certain applications to be RPM packaged going forward. For instance many of you hopefully are aware of Container Toolbx, our effort to make pet containers a great tool for developers. What you install into your toolbox, including some GUI applications like many IDEs will still need to be packages as RPMS and installed into each toolbox as they have no support for interacting with a toolbox from the host. Over time we hope that more IDEs will follow in GNOME Builders footsteps and become container aware and thus can be run from the host system as a Flatpak, but apart from Owen Taylors VS Code integration plugin most of them are not yet and thus needs to be installed inside your Toolbx.
As for building Flatpaks in Fedora as opposed to on Flathub, we are working on improving the developer experience around that. There are many good reasons why one might want to maintain a Fedora Flatpak, things like liking the Fedora content and security policies or just being more familiar with using the tested and vetted Fedora packages. Of course there are good reasons why developers might prefer maintaining applications on Flathub too, we are fine with either, but we want to make sure that whatever path you choose we have a great developer and package maintainer experience for you.
Multi-Stream Transport
Multi-monitor setups have become more and more common and popular so one effort we spent time on for the last few years is Lyude Paul working to clean up the MST support in the kernel. MST is a specification for DisplayPort that allows multiple monitors to be driven from a single DisplayPort port by multiplexing several video streams into a single stream and sending it to a branch device, which demultiplexes the signal into the original streams. DisplayPort MST will usually take the form of a single USB-C or DisplayPort connection. More recently, we’ve also seen higher resolution displays – and complicated technologies like DSC (Display Stream Compression) which need proper driver support in order to function.

Making setups like docks work is no easy task. In the Linux kernel, we have a set of shared code that any video driver can use in order to much more easily implement support for features like DisplayPort MST. We’ve put quite a lot of work into making this code both viable for any driver to use, but also to build new functionality on top of for features such as DSC. Our hope is that with this we can both encourage the growth of support for functionality like MST, and support of further features like DSC from vendors through this work. Since this code is shared, this can also come with the benefit that any new functionality implemented through this path is far easier to add to other drivers.
Lyude has mostly finished this work now and recently have been focusing on fixing some regressions that accidentally came upstream in amdgpu. The main stuff she was working on beforehand was a lot of code cleanup, particularly removing a bunch of the legacy MST code. For context: with kernel modesetting we have legacy modesetting, and atomic modesetting. Atomic modesetting is what’s used for modern drivers, and it’s a great deal simpler to work with than legacy modesetting. Most of the MST helpers were written before atomic was a thing, and as a result there was a pretty big mess of code that both didn’t really need to be there – and actively made it a lot more difficult to implement new functionality and figure out whether bug fixes that were being submitted to Lyude were even correct or not. Now that we’ve cleaned this up though, the MST helpers make heavy use of atomic and this has definitely simplified the code quite a bit.
Time for another status update on libei, the transport layer for bouncing emulated input events between applications and Wayland compositors [1]. And this time it's all about portals and how we're about to use them for libei communication. I've hinted at this in the last post, but of course you're forgiven if you forgot about this in the... uhm.. "interesting" year that was 2022. So, let's recap first:
Our basic premise is that we want to emulate and/or capture input events in the glorious new world that is Wayland (read: where applications can't do whatever they want, whenever they want). libei is a C library [0] that aims to provide this functionality. libei supports "sender" and "receiver" contexts and that just specifies which way the events will flow. A sender context (e.g. xdotool) will send emulated input events to the compositor, a "receiver" context will - you'll never guess! - receive events from the compositor. If you have the InputLeap [2] use-case, the server-side will be a receiver context, the client side a sender context. But libei is really just the transport layer and hasn't had that many changes since the last post - most of the effort was spent on trying to figure out how to exchange the socket between different applications. And for that, we have portals!
In particular, we have a PR for the RemoteDesktop portal to add that socket exchange. In particular, once a RemoteDesktop session starts your application can request an EIS socket and send input events over that. This socket supersedes the current NotifyButton and similar DBus calls and removes the need for the portal to stay in the middle - the application and compositor now talk directly to each other. The compositor/portal can still close the session at any time though, so all the benefits of a portal stay there. The big advantage of integrating this into RemoteDesktop is that the infrastructucture for that is already mostly in place - once your compositor adds the bits for the new ConnectToEIS method you get all the other pieces for free. In GNOME this includes a visual indication that your screen is currently being remote-controlled, same as from a real RemoteDesktop session.
Now, talking to the RemoteDesktop portal is nontrivial simply because using DBus is nontrivial, doubly so for the way how sessions and requests work in the portals. To make this easier, libei 0.4.1 now includes a new library "liboeffis" that enables your application to catch the DBus. This library has a very small API and can easily be integrated with your mainloop (it's very similar to libei). We have patches for Xwayland to use that and it's really trivial to use. And of course, with the other Xwayland work we already had this means we can really run xdotool through Xwayland to connect through the XDG Desktop Portal as a RemoteDesktop session and move the pointer around. Because, kids, remember, uhm, Unix is all about lots of separate pieces.
On to the second mode of libei - the receiver context. For this, we also use a portal but a brand new one: the InputCapture portal. The InputCapture portal is the one to use to decide when input events should be captured. The actual events are then sent over the EIS socket.
Right now, the InputCapture portal supports PointerBarriers - virtual lines on the screen edges that, once crossed, trigger input capture for a capability (e.g. pointer + keyboard). And an application's basic approach is to request a (logical) representation of the available desktop areas ("Zones") and then set up pointer barriers at the edge(s) of those Zones. Get the EIS connection, Enable() the session and voila - the compositor will (hopefully) send input events when the pointer crosses one of those barriers. Once that happens you'll get a DBus signal in InputCapture and the events will start flowing on the EIS socket. The portal itself doesn't need to sit in the middle, events go straight to the application. The portal can still close the session anytime though. And the compositor can decide to stop capturing events at any time.
There is actually zero Wayland-y code in all this, it's display-system acgnostic. So anyone with too much motivation could add this to the X server too. Because that's what the world needs...
The (currently) bad news is that this needs to be pulled into a lot of different repositories. And everything needs to get ready before it can be pulled into anything to make sure we don't add broken API to any of those components. But thanks to a lot of work by Olivier Fourdan, we have this mostly working in InputLeap (tbh the remaining pieces are largely XKB related, not libei-related). Together with the client implementation (through RemoteDesktop) we can move pointers around like in the InputLeap of old (read: X11).
Our current goal is for this to be ready for GNOME 45/Fedora 39.
[0] eventually a protocol but we're not there yet
[1] It doesn't actually have to be a compositor but that's the prime use-case, so...
[2] or barrier or synergy. I'll stick with InputLeap for this post
The end of 2022 is very close so I’m just in time for some self-promotion. As you may know, the ongoing collaboration between Valve and Igalia lets me and some of my colleagues work on improving the open-source Vulkan and OpenGL Conformance Test Suite. This work is essential to ship quality Vulkan drivers and, from the Khronos side, to improve the Vulkan standard further by, among other things, adding new functionality through API extensions. When creating a new extension, apart from reaching consensus among vendors about the scope and shape of the new APIs, CTS tests are developed in order to check the specification text is clear and vendors provide a uniform implementation of the basic functionality, corner cases and, sometimes, interactions with other extensions.
In addition to our CTS work, many times we review the Vulkan specification text from those extensions we develop tests for. We also do the same for other extensions and changes, and we also submit fixes and improvements of our own.
In 2022, our work was important to be able to ship a bunch of extensions you can probably see implemented in Mesa and used by VKD3D-Proton when playing your favorite games on Linux, be it on your PC or perhaps on the fantastic Steam Deck. Or maybe used by Zink when implementing OpenGL on top of your favorite Vulkan driver. Anyway, without further ado, let’s take a look.
This extension was created by our beloved super good coder Mike Blumenkrantz to be able to create a 2D view of a single slice of a 3D image. It helps emulate functionality which was already possible with OpenGL, and is used by Zink. Siru developed tests for this one but we reviewed the spec and are listed as contributors.
One of my favorite extensions shipped in 2022. Created by Hans-Kristian Arntzen to be used by Proton, this extension lets applications query identifiers (hashes, if you want to think about them like that) for existing VkShaderModule objects and also to provide said identifiers in lieu of actual VkShaderModule objects when creating a pipeline. This apparently simple change has real-world impact when downloading and playing games on a Steam Deck, for example.
You see, DX12 games ship their shaders typically in an intermediate assembly-like representation called DXIL. This is the equivalent of the assembly-like SPIR-V language when used with Vulkan. But Proton has to, when implementing DX12 on top of Vulkan, translate this DXIL to SPIR-V before passing the shader down to Vulkan, and this translation takes some time that may result in stuttering that, if done correctly, would not be present in the game when it runs natively on Windows.
Ideally, we would bypass this cost by shipping a Proton translation cache with the game when you download it on Linux. This cache would allow us to hash the DXIL module and use the resulting hash as an index into a database to find a pre-translated SPIR-V module, which can be super-fast. Hooray, no more stuttering from that! You may still get stuttering when the Vulkan driver has to compile the SPIR-V module to native GPU instructions, just like the DX12 driver would when translating DXIL to native instructions, if the game does not, or cannot, pre-compile shaders somehow. Yet there’s a second workaround for that.
If you’re playing on a known platform with known hardware and drivers (think Steam Deck), you can also ship a shader cache for that particular driver and hardware. Mesa drivers already have shader caches, so shipping a RADV cache with the game makes total sense and we would avoid stuttering once more, because the driver can hash the SPIR-V module and use the resulting hash to find the native GPU module. Again, this can be super-fast so it’s fantastic! But now we have a problem, you see? We are shipping a cache that translates DXIL hashes to SPIR-V modules, and a driver cache that translates SPIR-V hashes to native modules. And both are big. Quite big for some games. And what do we want the SPIR-V modules for? For the driver to calculate their hashes and find the native module? Wouldn’t it be much more efficient if we could pass the SPIR-V hash directly to the driver instead of the actual module? That way, the database translating DXIL hashes to SPIR-V modules could be replaced with a database that translates DXIL hashes to SPIR-V hashes. This can save space in the order of gigabytes for some games, and this is precisely what this extension allows. Enjoy your extra disk space on the Deck and thank Hans-Kristian for it! We reviewed the spec, contributed to it, and created tests for this one.
This one was written by Joshua Ashton and allows applications to put images in the special VK_IMAGE_LAYOUT_ATTACHMENT_FEEDBACK_LOOP_OPTIMAL_EXT layout, in which they can both be used to render to and to sample from at the same time.
It’s used by DXVK 2.0+ to more efficiently support D3D9 games that read from active render targets.
We reviewed, created tests and contributed to this one.
I don’t need to tell you more about this one. You saw the Khronos blog post. You watched my XDC 2022 talk (and Timur’s). You read my slides. You attended the Vulkan Webinar. Important to actually have mesh shaders on Vulkan like you have in DX12, so emulation of DX12 on Vulkan was a top goal. We contributed to the spec and created tests.
“Hey, hey, hey!” I hear you protest. “This is just a rename of the VK_VALVE_mutable_descriptor_type extension which was released at the end of 2020.” Right, but I didn’t get to tell you about it then, so bear with me for a moment. This extension was created by Hans-Kristian Arntzen and Joshua Ashton and it helps efficiently emulate the raw D3D12 binding model on top of Vulkan. For that, it allows you to have descriptors with a type that is only known at runtime, and also to have descriptor pools and sets that reside only in host memory. We had reviewed the spec and created tests for the Valve version of the extension. Those same tests are the VK_EXT_mutable_descriptor_type tests today.
The final boss of dynamic state, which helps you reduce the number of pipeline objects in your application as much as possibly allowed. Combine some of the old and new dynamic states with graphics pipeline libraries and you may enjoy stutter-free gaming. Guaranteed!1 This one will be used by native apps, translation layers (including Zink) and you-name-it. We developed tests and reviewed the spec for it.
1Disclaimer: not actually guaranteed.
It has been a busy couple of months. As I pointed on my last blog post, I finished GSoC and joined the Igalia Coding Experience mentorship project. In October, I also traveled to Minneapolis for XDC 2022 where I presented to the Linux Graphics community our AMD/KUnit work with my colleagues. So, let’s make a summary of the last couple of months.
Just a small thank you note to X.Org Foundation for sponsoring my travel to Minneapolis. XDC 2022 was a great experience, and I learned quite a lot during the talks. Although I was a newcomer, all developers were very nice to me, and it was great to talk to experienced developers (and meet my mentors in person). Also, I presented the GSoC/XOrg work on the first day of the conference and this talk is available on YouTube.
As I mentioned in my last blog post, GSoC was a great learning experience and I’m willing to keep learning about the Linux graphics stack. Fortunately, when I started the Igalia CE, Melissa Wen pitched me a project to increase IGT test coverage on DRM/V3D kernel driver. I was pretty glad to hear about the project as it allowed me to learn more about how a GPU works.
Currently, V3D only has three basic IGT tests: v3d_get_bo_offset, v3d_get_param, and v3d_mmap. So, the basic goal of my CE project was to add more tests to the V3D driver.
As the general DRM-core tests were in a good shape on the V3D driver, I started to think together with my mentors about more driver-specific tests for the driver.
By checking the V3D UAPI, you can see that the V3D has eleven ioctls, so there is yet a lot to test for the V3D on IGT.
First, there are Buffer Object (BO) related-ioctls: v3d_create_bo, v3d_wait_bo, v3d_mmap_bo, and v3d_get_bo_offset. The Buffer Objects are shared-memory objects that are allocated by the GPU to store things like vertex data. Therefore, testing them is important to make sure that memory is being correctly allocated. Different from the VC4, the V3D has an MMU between the GPU and the bus, allowing it to not allocate objects contiguously. Therefore, the idea was to develop tests for v3d_create_bo and v3d_wait_bo.
Next, there are Performance Monitor (perfmon) related-ioctls: v3d_perfmon_create, v3d_perfmon_destroy, and v3d_perfmon_get_values. Performance Monitors are basically registers that are used for monitoring the performance of the V3D engine. So, tests were designed to assure that the driver was creating perfmons properly and was resilient to incorrect requests, such as trying to get a value from a non-existent perfmon.
And finally, the most interesting type of ioctls: the job submission ioctls. You can use the v3d_submit_cl ioctl to submit commands to the 3D engine, which is a tiled engine. When I think about tiled rendering, I always think about a Super Nintendo, but things can get a bit more complicated than a SNES as you can see here. The 3D engine is composed of a bin and render pipelines, each has its command list. The binning step maps the tile to a piece of the frame and the rendering step renders the tile based on its mapping.
By testing the v3d_submit_cl ioctl, it is possible to test syncing between jobs and also the V3D multisync ability.
Moreover, the V3D has also a TFU (texture formatting unit), and a CSD (compute shader dispatch), which has their ioctls: v3d_submit_tfu and v3d_submit_csd. The TFU makes format conversions and generated mipmaps and the CSD is responsible for dispatching a compute shader.
So, the idea is to write tests for all those functionalities from V3D, increasing the testability of V3D on IGT. Although things are not yet fully-done, I’ve been enjoying and working exploring the V3D, IGT, and Mesa. After this experience with Mesa and also XDC, I became more and more interested in Mesa.
In order to test the v3d_submit_cl ioctl, it was needed to design a job to be submitted. So, Melissa suggested using Mesa’s noop job specification on IGT to perform the tests. The idea was quite simple: submit a noop job and create tests based on it. But, it was not that simple after all…
First, I must say that I’m mostly a kernel developer, so I was not familiar with Mesa. So, maybe it was not that hard to figure out, but I took a while to understand Mesa’s packet and how to submit them.
The main problem I faced on submitting a noop job on IGT was: I would copy many and many Mesa files to IGT. And I took a while fighting against this idea, looking for other ways to submit a job to V3D. But, as some experience developers pointed out, packeting is the best option for it.
After some time, I was able to bring the Mesa structure to IGT with a minimal (although not that minimal) overhead. But, I’m still not able to run a successful noop job as the job’s fence is not being signaled by the end of the job.
Although my noop job has not landed yet, so far, I was able to submit two series to IGT: one for the V3D driver and the other for the VC4 driver.
Apart from cleanups in the drivers, I added tests for the v3d_create_bo ioctl and the V3D’s and VC4’s perfmon ioctls. Moreover, as I was running the VC4 tests on the Raspberry Pi 4, I realized that most of the VC4 tests were failing on V3D, considering the VC4 doesn’t have rendering abilities on the Raspberry Pi 4. So, I also created checks to assure that the VC4 tests are not running on V3D.
Those series are being reviewed yet, but I hope to get them merged soon.
My biggest priority now is to run a noop job on IGT and for it, I’m currently running the CTS tests on the Raspberry Pi 4 in order to reproduce a noop job and understand why my current job is resulting in a hang. I added a couple of debug logs (aka printf) on Mesa and now I can see the contents of the BOs and the parameters of the submission. So, I hope to get a fully-working noop job now.
After I develop my fully working noop job, I will finish the v3d_wait_bo tests, so those only make sense if I submit a job and wait for a BO after it and design the v3d_submit_cl tests as well. For this last one, I hope to test the syncing functionalities of V3D especially.
Moreover, I hope to write soon a piece about cross-compiling CTS for the Raspberry Pi 4, which was a fun digression on this CE project.
We’re excited to announce our first Apple GPU driver release!
We’ve been working hard over the past two years to bring this new driver to everyone, and we’re really proud to finally be here. This is still an alpha driver, but it’s already good enough to run a smooth desktop experience and some games.
Read on to find out more about the state of things today, how to install it (it’s an opt-in package), and how to report bugs!
This release features work-in-progress OpenGL 2.1 and OpenGL ES 2.0 support for all current Apple M-series systems. That’s enough for hardware acceleration with desktop environments, like GNOME and KDE. It’s also enough for older 3D games, like Quake3 and Neverball. While there’s always room for improvement, the driver is fast enough to run all of the above at 60 frames per second at 4K.
Please note: these drivers have not yet passed the OpenGL (ES) conformance tests. There will be bugs!
What’s next? Supporting more applications. While OpenGL (ES) 2 suffices for some applications, newer ones (especially games) demand more OpenGL features. OpenGL (ES) 3 brings with it a slew of new features, like multiple render targets, multisampling, and transform feedback. Work on these features is well under way, but they will each take a great deal of additional development effort, and all are needed before OpenGL (ES) 3.0 is available.
What about Vulkan? We’re working on it! Although we’re only shipping OpenGL right now, we’re designing with Vulkan in mind. Most of the work we’re putting toward OpenGL will be reused for Vulkan. We estimated that we could ship working OpenGL 2 drivers much sooner than a working Vulkan 1.0 driver, and we wanted to get hardware accelerated desktops into your hands as soon as possible. For the most part, those desktops use OpenGL, so supporting OpenGL first made more sense to us than diving into the Vulkan deep end, only to use Zink to translate OpenGL 2 to Vulkan to run desktops. Plus, there is a large spectrum of OpenGL support, with OpenGL 2.1 containing a fraction of the features of OpenGL 4.6. The same is true for Vulkan: the baseline Vulkan 1.0 profile is roughly equivalent to OpenGL ES 3.1, but applications these days want Vulkan 1.3 with tons of extensions and “optional” features. Zink’s “layering” of OpenGL on top of Vulkan isn’t magic: it can only expose the OpenGL features that the underlying Vulkan driver has. A baseline Vulkan 1.0 driver isn’t even enough to get OpenGL 2.1 on Zink! Zink itself advertises support for OpenGL 4.6, but of course that’s only when paired with Vulkan drivers that support the equivalent of OpenGL 4.6… and that gets us back to a tremendous amount of time and effort.
When will OpenGL 3 support be ready? OpenGL 4? Vulkan 1.0? Vulkan 1.3? In community open source projects, it’s said that every time somebody asks when a feature will be done, it delays that feature by a month. Well, a lot of people have been asking…
At any rate, for a sneak peek… here is SuperTuxKart’s deferred renderer running at full speed, making liberal use of OpenGL ES 3 features like multiple render targets~
Modern GPUs consist of many distinct “layered” parts. There is…
This “layered” hardware demands a “layered” graphics driver stack. We need…
That’s a lot of work, calling for a team effort! Fortunately, that layering gives us natural boundaries to divide work among our small team.
Meanwhile, Ella Stanforth is working on a Vulkan driver, reusing the kernel driver, the compiler, and some code shared with the OpenGL driver.
Of course, we couldn’t build an OpenGL driver in under two years just ourselves. Thanks to the power of free and open source software, we stand on the shoulders of FOSS giants. The compiler implements a “NIR” backend, where NIR is a powerful intermediate representation, including GLSL to NIR translation. The kernel driver users the “Direct Rendering Manager” (DRM) subsystem of the Linux kernel to minimize boilerplate. Finally, the OpenGL driver implements the “Gallium3D” API inside of Mesa, the home for open source OpenGL and Vulkan drivers. Through Mesa and Gallium3D, we benefit from thirty years of OpenGL driver development, with common code translating OpenGL into the much simpler Gallium3D. Thanks to the incredible engineering of NIR, Mesa, and Gallium3D, our ragtag team of reverse-engineers can focus on what’s left: the Apple hardware.
To get the new drivers, you need to run the linux-asahi-edge kernel and also install the mesa-asahi-edge Mesa package.
$ sudo pacman -Syu
$ sudo pacman -S linux-asahi-edge mesa-asahi-edge
$ sudo update-grub
Since only one version of Mesa can be installed at a time, pacman will prompt you to replace mesa with mesa-asahi-edge. This is normal!
We also recommend running Wayland instead of Xorg at this point, so if you’re using the KDE Plasma environment, make sure to install the Wayland session:
$ sudo pacman -S plasma-wayland-session
Then reboot, pick the Wayland session at the top of the login screen (SDDM), and enjoy! You might want to adjust the screen scale factor in System Settings → Display and Monitor (Plasma Wayland defaults to 100% or 200%, while 150% is often nicer). If you have “Force font DPI” enabled under Appearance → Fonts, you should disable that (it is saved separately for Wayland and Xorg, and shouldn’t be necessary on Wayland sessions). Log out and back in for these changes to fully apply.
Xorg and Xorg-based desktop environments should work, but there are a few known issues:
The linux-asahi-edge kernel can be installed side-by-side with the standard linux-asahi package, but both versions should be kept in sync, so make sure to always update your packages together! You can always pick the linux-asahi kernel in the GRUB boot menu, which will disable GPU acceleration and the DCP display driver.
When the packages are updated in the future, it’s possible that graphical apps will stop starting up after an update until you reboot, or they may fall back to software rendering. This is normal. Until the UAPI is stable, we’ll have to break compatibility between Mesa and the kernel every now and then, so you will need to reboot to make things work after updates. In general, if apps do keep working with acceleration after any particular Mesa update, then it’s probably safe not to reboot, but you should still do it to make sure you’re running the latest kernel!
Since the driver is still in development, there are lots of known issues and we’re still working hard on improving conformance test results. Please don’t open new bugs for random apps not working! It’s still the early days and we know there’s a lot of work to do. Here’s a quick guide of how to report bugs:
LIBGL_ALWAYS_SOFTWARE=1 environment variable for those apps to fall back to software rendering. If it is a popular app that is part of the Arch Linux ARM repository, you can make a comment on this issue instead, so we can add Mesa quirks to workaround.linux-asahi-edge unrelated to the GPU, please add a comment to this issue. This includes display output issues! (Resolutions, backlight control, display power control, etc.)asahi-diagnose (for example, from an SSH session), open a new bug on the AsahiLinux/linux repository, attach the file generated by that command, and tell us what you were doing that caused the lockup.asahi-diagnose and make a comment on this issue, attaching the file generated by that command. Don’t forget to tell us about your environment!We hope you enjoy our driver! Remember, things are still moving quickly, so make sure to update your packages regularly to get updates and bug fixes!
Co-written with Asahi Lina. Can you tell who wrote what?