planet.freedesktop.org
December 08, 2023

If you’ve been paying attention to the evolution of the Linux gaming ecosystem in recent years, including the release of the Steam Deck and the new Steam Deck OLED, it’s likely your initial reaction to the blog post title is a simple “OK”. However, I’m coming from a very particular place so I wanted to explain my point of view and the significance of this, and hopefully you’ll find the story interesting.

steam running on fedora 39.tn
Figure 1. Steam running on Fedora Linux 39

As a background, let me say I’ve always gamed on Windows when using my PC. If you think I’m an idiot for doing so lately, specially because my work at Igalia involves frequently interacting with Valve contractors like Samuel Pitoiset, Timur Kristóf, Mike Blumenkrantz or Hans-Kristian Arntzen, you’d be more than right. But hear me out. I’ve always gamed on Windows because it’s the safe bet. With a couple of small kids at home and very limited free time, when I game everything has to just work. No fiddling around with software, config files, or wasting time setting up the software stack. I’m supposed to boot Windows when I want to play, play, and then turn my computer off. The experience needs to be as close to a console as possible. And, for anything non-gaming, which is most of it, I’d be using my Linux system.

In the last years, thanks to the work done by Valve, the Linux gaming stack has improved a lot. Despite this, I’ve kept gaming on Windows for a variety of reasons:

  1. For a long time, my Linux disk only had a capacity of 128GB, so installing games was not a real possibility due to the amount of disk space they need.

  2. Also, I was running Slackware and installing Steam and getting the whole thing running implied a fair amount of fiddling I didn’t even want to think about.

  3. Then, when I was running Fedora on a large disk, I had kids and I didn’t want to take any risks or possibly waste time on that.

So, what changed?

sapphire pulse amd rx 6700 box
Figure 2. Sapphire Pulse AMD Radeon RX 6700 box

Earlier this year I upgraded my PC and replaced an old Intel Haswell i7-4770k with a Ryzen R5 7600X, and my GPU changed from an NVIDIA GTX 1070 to a Radeon RX 6700. The jump in CPU power was much bigger and impressive than the more modest jump in GPU power. But talking about that and the sorry state of the GPU market is a story for another blog post. In any case, I had put up with the NVIDIA proprietary driver for many years and I think, on Windows and for gaming, NVIDIA is the obvious first choice for many people, including me. Dealing with the proprietary blob under Linux was not particularly problematic, specially with the excellent way it’s handled by RPMFusion on Fedora, where essentially you only have to install a few packages and you can mostly forget about it.

However, given my recent professional background I decided to go with an AMD card for the first time. I wanted to use a fully open source graphics stack and I didn’t want to think about making compromises in Wayland support or other fronts whatsoever. Plus, at the time I upgraded my PC, the timing was almost perfect for me to switch to an AMD card, because:

  1. AMD cards were, in general, performing better for the same price than NVIDIA cards, except for ray tracing.

  2. The RX 6700 non-XT was on sale.

  3. It had the same performance as a PS5 or so.

  4. It didn’t draw a ton of power like many recent high-end GPUs (175W, similar to the 1070 and its 150W TDP).

After the system upgrade, I did notice a few more stability problems when gaming under Windows, compared to what I was used to with an NVIDIA card. You can find thousands of opinions, comments and anecdotes on the Internet about the quality of AMD drivers, and a lot of people say they’re a couple of steps below NVIDIA drivers. It’s not my intention at all to pile up on those, but it’s true my own personal experience is having generally more crashes in games and having to face more weird situations since I switched to AMD. Normally, it doesn’t get to the point of being annoying at all, but sometimes it’s a bit surprising and I could definitely notice that increase in instability without any bias on my side, I believe. Which takes us to Far Cry 6.

A few days ago I finished playing Doom Eternal and its expansions (really nice game, by the way!) and I decided to go with Far Cry 6 next. I’m slowly working my way up with some more graphically demanding games that I didn’t feel comfortable with playing on the 1070. I went ahead and installed the game on Windows. Being a big 70GB download (100GB on disk), that took a bit of time. Then I launched it, adjusted the keyboard and mouse settings to my liking and I went to the video options menu. The game had chosen the high preset for me and everything looked good, so I attempted to run the in-game benchmark to see if the game performed well with that preset (I love it when games have built-in benchmarks!). After a few seconds in a loading screen, the game crashed I was back to the desktop. “Oh, what a bad way to start!”, I thought, without knowing what lied ahead. I launched the game again, same thing.

On the course of the 2 hours that followed, I tried everything:

  1. Launching the main game instead of the benchmark, just in case the bug only happened in the benchmark. Nope.

  2. Lowering quality and resolution.

  3. Disabling any advanced setting.

  4. Trying windowed mode, or borderless full screen.

  5. Vsync off or on.

  6. Disabling the overlays for Ubisoft, Steam, AMD.

  7. Rebooting multiple times.

  8. Uninstalling the drivers normally as well as using DDU and installing them again.

Same result every time. I also searched on the web for people having similar problems, but got no relevant search results anywhere. Yes, a lot of people both using AMD and NVIDIA had gotten crashes somewhere in the game under different circumstances, but nobody mentioned specifically being unable to reach any gameplay at all. That day I went to bed tired and a bit annoyed. I was also close to having run the game for 2 hours according to Steam, which is the limit for refunds if I recall correctly. I didn’t want to refund the game, though, I wanted to play it.

The next day I was ready to uninstall it and move on to another title in my list but, out of pure curiosity, given that I had already spent a good amount of time trying to make it run, I searched for it on the Proton compatibility database to see if it could be run on Linux, and it seemed to be possible. The game appeared to be well supported and it was verified to run on the Deck, which was good because both the Deck and my system have an RDNA2 GPU. In my head I wasn’t fully convinced this could work, because I didn’t know if the problem was in the game (maybe a bug with recent updates) or the drivers or anywhere else (like a hardware problem).

And this was, for me, when the fun started. I installed Steam on Linux from the Gnome Software app. For those who don’t know it, it’s like an app store for Gnome that acts as a frontend to the package manager.

gnome software steam.tn
Figure 3. Gnome Software showing Steam as an installed application

Steam showed up there with 3 possible sources: Flathub, an “rpmfusion-nonfree-steam” repo and the more typical “rpmfusion-nonfree” repo. I went with the last option and soon I had Steam in my list of apps. I launched that and authenticated using the Steam mobile app QR code scanning function for logging in (which is a really cool way to log in, by the way, without needing to recall your username and password).

My list of installed games was empty and I couldn’t find a way to install Far Cry 6 because it was not available for Linux. However, I thought there should be an easy way to install it and launch it using the famous Proton compatibility layer, and a quick web search revealed I only had to right-click on the game title, select Properties and choose to “Force the use of a specific Steam Play compatibility tool” under the Compatibility section. Click-click-click and, sure, the game was ready to install. I let it download again and launched it.

Context menu shown after right-clicking on Far Cry 6 on the Steam application, with the Properties option highlighted
Far Cry 6 Compatibility tab displaying the option to force the use of a specific Steam Play compatibility tool

Some stuff pops up about processing or downloading Vulkan shaders and I see it doing some work. In that first launch, the game takes more time to start compared to what I had seen under Windows, but it ends up launching (and subsequent launches were noticeably faster). That includes some Ubisoft Connect stuff showing up before the game starts and so on. Intro videos play normally and I reach the game menu in full screen. No indication that I was running it on Linux whatsoever. I go directly to the video options menu, see that the game again selected the high preset, I turn off VSync and launch the benchmark. Sincerely, honestly, completely and totally expecting it to crash one more time and that would’ve been OK, pointing to a game bug. But no, for the first time in two days this is what I get:

far cry 6 benchmark screenshot.tn
Figure 4. Far Cry 6 benchmark screenshot displaying the game running at over 100 frames per second

The benchmark runs perfectly, no graphical glitches, no stuttering, frame rates above 100FPS normally, and I had a genuinely happy and surprised grin on my face. I laughed out loud and my wife asked what was so funny. Effortless. No command lines, no config files, nothing.

As of today, I’ve played the game for over 30 hours and the game has crashed exactly once out of the blue. And I think it was an unfortunate game bug. The rest of the time it’s been running as smooth and as perfect as the first time I ran the benchmark. Framerate is completely fine and way over the 0 frames per second I got on Windows because it wouldn’t run. The only problem seems to be that when I finish playing and exit to the desktop, Steam is unable to stop the game completely for some reason (I don’t know the cause) and it shows up as still running. I usually click on the Stop button in the Steam interface after a few seconds, it stops the game and that’s it. No problem synchronizing game saves to the cloud or anything. Just that small bug that, again, only requires a single extra click.

Then I remember something that had happened a few months before, prior to starting to play Doom Eternal under Windows. I had tried to play Deathloop first, another game in my backlog. However, the game crashed every few minutes and an error window popped up. The amount and timing of the crashes didn’t look constant, and lowering the graphics settings sometimes would allow me to play the game a bit longer, but in any case I wasn’t able to finish the game intro level without crashes and being very annoyed. Searching for the error message on the web, I saw it looked like a game problem that was apparently affecting not only AMD users, but also NVIDIA ones, so I had mentally classified that as a game bug and, similarly to the Far Cry 6 case, I had given up on running the game without refunding it hoping to be able to play it in the future.

Now I was wondering if it was really a game bug and, even if it was, if maybe Proton could have a workaround for it and maybe it could be played on Linux. Again, ProtonDB showed the game to be verified on the Deck with encouraging recent reports. So I installed Deathloop on Linux, launched it just once and played for 20 minutes or so. No crashes and I got as far as I had gotten on Windows in the intro level. Again, no graphical glitches that I could see, smooth framerates, etc. Maybe it was a coincidence and I was lucky, but I think I will be able to play the game without issues when I’m done with Far Cry 6.

In conclusion, this story is another data point that tells us the quality of Proton as a product and software compatibility layer is outstanding. In combination with some high quality open source Mesa drivers like RADV, I’m amazed the experience can be actually better than gaming natively on Windows. Think about that: the Windows game binary running natively on a DX12 or Vulkan official driver crashes more and doesn’t work as well as the game running on top of a Windows compatibility layer with a graphics API translation layer, on top of a different OS kernel and a different Vulkan driver. Definitely amazing to me and it speaks wonders of the work Valve has been doing on Linux. Or it could also speak badly of AMD Windows drivers, or both.

Sure, some new games on launch have more compatibility issues, bugs that need fixing, maybe workarounds applied in Proton, etc. But even in those cases, if you have a bit of patience, play the game some months down the line and check ProtonDB first (ideally before buying the game), you may be in for a great experience. You don’t need to be an expert either. Not to mention that some of these details are even better and smoother if you use a Steam Deck as compared to an (officially) unsupported Linux distribution like I do.

December 06, 2023

During these last two weeks I have been working towards adding support for more operations and kinds of convolutions so we can run more interesting models. As a first target, I'm aiming to MobileDet, which though a bit old by now (it was introduced in 2020) is still the state of the art in object detection in mobile, used in for example Frigate NVR.

I haven't mentioned it in a few updates, but all this work keeps being sponsored by Libre Computer, who are aiming to be the first manufacturer of single board computers to provide accelerated machine learning with open source components. Check out Alta and Solitude for the first such boards in the market.

Upstreaming

Igalia's Christian Gmeiner has been giving me great feedback at the merge request, and as part of that I submitted a patch to the kernel to retrieve some parameters that are needed when programming the hardware and that are best not left hardcoded. 

This means that upstreaming to Mesa loses some urgency as we are anyway going to have to wait for the merge window for 6.8 opens, after 6.7 final is out.

Convolutions with 5x5 weights

Until now I had implemented support only for weights with dimensions 1x1 (aka pointwise convolutions) and 3x3 (the most common by far). Some of the convolutions in MobileDet use 5x5 weight tensors though, so I had to implement support for them. It was a matter of adding some extra complexity to the code that compresses the weight tensors in the format that the hardware expects.

I implemented this for all kind of supported convolutions: depthwise, strided, with padding, etc.

Tensor addition

I observed that the vendor blob implements addition operations with convolution jobs, so I looked deeper and saw that it was implementing the addition of two input tensors by placing them as the two channels of a single tensor, then passing them through a 1x1 convolution with a specially crafted weight tensor and bias vector.

This is working with hardcoded values for some specific input image dimensions, but I still need to gather more data so I can come up with a generic expression.

Softmax pooling

One more missing operation commonly used in models for mobile is pooling, in its different kinds: average, max, etc.

The blob implements these operations on the programmable core, with CL-like kernels.

So I undusted the work that I did in the first half of 2023 and added code to Teflon for passing these operations to the Gallium drivers. Then added a new kind of operation to the ML backend in Etnaviv to make use of the programmable core.

Things work fine, even if for now I am storing the kernel machine code in a blob inside the C code. The next step will be to implement the kernel in NIR and generate the machine code using the existing compiler in Etnaviv.

With this piece of work, we are now able to use all the hardware units in the NPU, and even if the programmable core in this configuration is really underpowered, it will allow us to keep the model in memory close to the NPU, instead of having to ping-pong between the NPU and CPU domains.

A new test suite

With new operations and kinds of convolutions being added, I was starting to have trouble testing all the possible combinations in a practical way, as the test suite that I had was taking more than 20 minutes for a full run.

To get around that, I reimplemented the tests in C++ with GoogleTest, which is supported by Emma Anholt's deqp-runner and will allow me to run the tests in parallel, making full use of the CPU cores in the board.

That made a big difference, but with so many testing combinations being added (+3000 as of now), it was still not fast enough for me. So I remembered an approach that we were considering to speed up execution of Vulkan and OpenGL conformance tests: caching the golden images that are used to compare and check that the output from the hardware is correct.

With that, the bottleneck is the network, as I store the cache in NFS, and I can run the full test suite in less than 3 minutes.

Only that I started finding some tests that were randomly failing, specially when the cache of test results had been already brought into the filesystem cache in the board. After a lot of scratching my head, I came to realize that the Etnaviv kernel driver was trying to submit up to 4 jobs at the same time to the hardware, if userspace was fast enough to enqueue that many jobs before the previous ones had finished.

There is a kernel module parameter to set the number of jobs that are submitted to the hardware at any given point, and setting that to 1 took me back to rock solid test results, which is an absolute need for keeping the driver author's sanity.

Next steps

I have quickly added support for a lot of new operations and parameter combinations and the code is not as clean as I would like, in part due to the need for some refactoring.

So in the next days I will be investing some time in cleaning things up, and afterwards will move to more operations in MobileDet.


November 29, 2023

I have not been so active for a while with writing these Fedora Workstation updates and part of the reason was that I felt I was beginning to repeat myself a lot, which I partly felt was a side effect of writing them so often, but with some time now since my last update I felt that time was ripe again. So what are some of the things we have been working on and what are our main targets going forward? This is not a exhaustive list, but hopefully items you find interesting. Apologize for weird sentences and potential spelling mistakes, but it ended up a a long post and when you read your own words over for the Nth time you start going blind to issues :)

PipeWire

PipeWire 1.0 is available! PipeWire keeps the Linux Multimedia revolution rolling[/caption]So lets start with one of your favorite topics, PipeWire. As you probably know PipeWire 1.0 is now out and I feel it is a project we definitely succeeded with, so big kudos to Wim Taymans for leading this effort. I think the fact that we got both the creator of JACK, Paul Davis and the creator of PulseAudio Lennart Poettering to endorse it means our goal of unifying the Linux audio landscape is being met. I include their endorsement comments from the PipeWire 1.0 release announcement here :

“PipeWire represents the next evolution of audio handling for Linux, taking
the best of both pro-audio (JACK) and desktop audio servers (PulseAudio) and
linking them into a single, seamless, powerful new system.”
– Paul Davis, JACK and Ardour author

“PipeWire is a worthy successor to PulseAudio, providing a feature set
closer to how modern audio hardware works, and with a security model
with today’s application concepts in mind. Version 1.0 marks a
major milestone in completing the adoption of PipeWire in the standard
set of Linux subsystems. Congratulations to the team!”
– Lennart Poettering, Pulseaudio and systemd author

So for new readers, PipeWire is a audio and video server we created for Fedora Workstation to replace PulseAudio for consumer audio, JACK for pro-audio and add similar functionality for video to your linux operating system. So instead of having to deal with two different sound server architectures users now just have to deal with one and at the same time they get the same advantages for video handling. Since PipeWire implemented both the PulseAudio API and the JACK API it is a drop in replacement for both of them without needing any changes to the audio applications built for those two sound servers. Wim Taymans alongside the amazing community that has grown around the project has been hard at work maturing PipeWire and adding any missing feature they could find that blocked anyone from moving to it from either PulseAudio and JACK. Wims personal focus recently has been on an IRQ based ALSA driver for PipeWire to be able to provide 100% performance parity with the old JACK server. So while a lot of Pro-audio users felt that PipeWire’s latency was already good enough, this work by Wim shaves of the last few milliseconds to reach the same level of latency as JACK itself had.

In parallel with the work on PipeWire the community and especially Collabora has been hard at work on the new 0.5 release of WirePlumber, the session manager which handles all policy issues for PipeWire. I know people often get a little confused about PipeWire vs WirePlumber, but think of it like this: PipeWire provides you the ability to output audio through a connected speaker, through a bluetooth headset, through an HDMI connection and so on, but it doesn’t provide any ‘smarts’ for how that happens. The smarts are instead provided by WirePlumber which then contains policies to decide where to route your audio or video, either based on user choice or through preset policies making the right choices automatically, like if you disconnect your USB speaker it will move the audio to your internal speaker instead. Anyway, WirePlumber 0.5 will be a major step forward for WirePlumber moving from using lua scripts for configuration to instead using JSON for configuration while retaining lua for scripting. This has many advantages, but I point you to this excellent blog post by Collabora’s Ashok Sidipotu for the details. Ashok got further details about WirePlumber 0.5 that you can find here.

With PipeWire 1.0 out the door I feel we are very close to reaching one of our initial goals with PipeWire, to remove the need for custom pro-audio distributions like Fedora JAM or Ubuntu Studio, and instead just let audio folks be able to use the same great Fedora Workstation as the rest of the world. With 1.0 done Wim plans next to look a bit at things like configuration tools and similar used by pro-audio folks and also dive into the Flatpak portal needs of pro-audio applications more, to ensure that Flatpaks + PipeWire is the future of pro-audio.

On the video handling side its been a little slow going since there applications need to be ported from relying directly on v4l. Jan Grulich has been working with our friends at Mozilla and Google to get PipeWire camera handling support into Firefox and Google Chrome. At the moment it looks like the Firefox support will land first, in fact Jan has set up a COPR that lets you try it out here. For tracking the upstream work in WebRTC to add PipeWire support Jan set up this tracker bug. Getting the web browsers to use PipeWire is important both to enable the advanced video routing capabilities of PipeWire, but it will also provide applications the ability to use libcamera which is a needed for new modern MIPI cameras to work properly under Linux.

Another important application to get PipeWire camera support into is OBS Studio and the great thing is that community member Georges Stavracas is working on getting the PipeWire patches merged into OBS Studio, hopefully in time for their planned release early next year. You can track Georges work in this pull request.

For more information about PipeWire 1.0 I recommend our interview with Wim Taymans in Fedora Magazine and also the interview with Wim on Linux Unplugged podcast.

HDR
HDRHDR, or High Dynamic Range, is another major effort for us. HDR is a technology I think many of you have become familiar with due to it becoming quite common in TVs these days. It basically provides for greatly increased color depth and luminescence on your screen. This is a change that entails a lot of changes through the stack, because when you introduce into an existing ecosystem like the Linux desktop you have to figure out how to combine both new HDR capable applications and content and old non-HDR applications and content. Sebastian Wick, Jonas Ådahl, Oliver Fourdan, Michel Daenzer and more on the team has been working with other members of the ecosystem from Intel, AMD, NVIDIA, Collabora and more to pick and define the standards and protocols needed in this space. A lot of design work was done early in the year so we been quite focused on implementation work across the drivers, Wayland, Mesa, GStreamer, Mutter, GTK+ and more. Some of the more basic scenarios, like running a fullscreen HDR application is close to be ready, while we are still working hard on getting all the needed pieces together for the more complex scenarios like running SDR and HDR windows composited together on your desktop. So getting for instance full screen games to run in HDR mode with Steam should happen shortly, but the windowed support will probably land closer to summer next year.

Wayland remoting
One feature we been also spending a lot of time on is enabling remote logins to a Wayland desktop. You have been able to share your screen under Wayland more or less from day one, but it required your desktop session to be already active. But lets say you wanted to access your Wayland desktop running on a headless system you been out of luck so far and had to rely on the old X session instead. So putting in place all the pieces for this has been quite an undertaking with work having been done on PipeWire, on Wayland portals, gnome remote desktop daemon, libei; the new input emulation library, gdm and more. The pieces needed are finally falling into place and we expect to have everything needed landed in time for GNOME 46. This support is currently done using a private GNOME API, but a vendor less API is being worked on to replace it.

As a sidenote here not directly related to desktop remoting, but libei has also enabled us to bring xtest support to XWayland which was important for various applications including Valves gamescope.

NVIDIA drivers
One area we keep investing in is improving the state of NVIDIA support on Linux. This comes both in the form of being the main company backing the continued development of the Nouveau graphics driver. So the challenge with Nouveau is that for the longest while it offered next to no hardware acceleration for 3D graphics. The reason for this was that the firmware that NVIDIA provided for Nouveau to use didn’t expose that functionality and since recent generations of NVIDIA cards only works with firmware signed by NVIDIA this left us stuck. So Nouveau was a good tool for doing an initial install of a system, but if you where doing any kind of serious 3D acceleration, including playing games, then you would need to install the NVIDIA binary driver. So in the last year that landscape around that has changed drastically, with the release of the new out-of-tree open source driver from NVIDIA. Alongside that driver a new firmware has also been made available , one that do provide full support for hardware acceleration.
Let me quickly inject a quick explanation of out-of-tree versus in-tree drivers here. An in-tree driver is basically a kernel driver for a piece of hardware that has been merged into the official Linux kernel from Linus Torvalds and is thus being maintained as part of the official Linux kernel releases. This ensures that the driver integrates well with the rest of the Linux kernel and that it gets updated in sync with the rest of the Linux kernel. So Nouveau is an in-tree kernel driver which also integrates with the rest of the open source graphics stack, like Mesa. The new NVIDIA open source driver is an out-of-tree driver which ships as a separate source code release on its own schedule, but of course NVIDIA works to keeps it working with the upstream kernel releases (which is a lot of work of course and thus considered a major downside to being an out of tree driver).

As of the time of writing this blog post NVIDIAs out-of-tree kernel driver and firmware is still a work in progress for display usercases, but that is changing with NVIDIA exposing more and more display features in the driver (and the firmware) with each new release they do. But if you saw the original announcement of the new open source driver from NVIDIA and have been wondering why no distribution relies on it yet, this is why. So what does this mean for Nouveau? Well our plan is to keep supporting Nouveau for the foreseeable future because it is an in-tree driver, which is a lot easier to ensure keeps working with each new upstream kernel release.

At the same time the new firmware updates allows Nouveau to eventually offer performance levels competitive with the official out-of-tree driver, kind of how the open source AMD driver with MESA offers comparable performance to AMD binary GPU driver userspace. So Nouvea maintainer Ben Skeggs spent the last year working hard on refactoring Nouveau to work with the new firmware and we now have a new release of Nouveau out showing the fruits of that labor, enabling support for NVIDIAs latest chipset. Over time we will have it cover more chipset and expand Vulkan and OpenGL (using Zink) support to be a full fledged accelerated graphics driver.
So some news here, Ben after having worked tirelessly on keeping Nouveau afloat for so many years decided he needed a change of pace and thus decided to leave software development behind for the time being. A big thank you to Ben from all us at Red Hat and Fedora ! The good news is that Danilo Krummrich will take over as the development lead, with Lyude Paul taking on working on the Display side specifically of the driver. We also expect to have other members of the team chipping in too. They will pick up Bens work and continue working with NVIDIA and the community on a bright future for Nouveau.

So as I mentioned though the new open source driver from NVIDIA is still being matured for the display usercase and until it works fully as a display driver neither will Nouveau be able to be a full alternative since they share the same firmware. So people will need to rely on the binary NVIDIA Driver for some time still. One thing we are looking at there and discussing is if there are ways for us to improve the experience of using that binary driver with Secure Boot enabled. Atm that requires quite a bit of manual fiddling with tools like mokutils, but we have some ideas on how to streamline that a bit, but it is a hard nut to solve due to a combination of policy issues, legal issues, security issues and hardware/UEFI bugs so I am making no promises at this point, just a promise that it is something we are looking at.

Accessibility
Accessibility is an important feature for us in Fedora Workstation and thus we hired Lukáš Tyrychtr to focus on the issue. Lukáš has been working through across the stack fixing issues blocking proper accessibility support in Fedora Workstation and also participated in various accessibility related events. There is still a lot to do there so I was very happy to hear recently that the GNOME Foundation got a million Euro sponsorship from the Sovereign Tech Fund to improve various things across the stack, especially improving accessibility. So the combination of Lukáš continued efforts and that new investment should make for a much improved accessibility experience in GNOME and in Fedora Workstation going forward.

GNOME Software
Another area that we keep investing in is improving GNOME Software, with Milan Crha working continuously on bugfixing and performance improvements. GNOME Software is actually a fairly complex piece of software as it has to be able to handle the installation and updating of RPMS, OSTree system images, Flatpaks, fonts and firmware for us in addition to the formats it handles for other distributions. For some time it felt was GNOME Software was struggling with the load of all those different formats and usercases and was becoming both slow and with a lot of error messages. Milan has been spending a lot of time dealing with those issues one by one and also recently landed some major performance improvements making the GNOME Software experience a lot better. One major change that Milan is working on that I think we will be able to land in Fedora Workstation 40/41 is porting GNOME Software to use DNF5. The main improvement end users will probably notice is that it unifies the caches used for GNOME Software and using dnf on the command line, saving you storage space and also ensuring the two are fully in sync on what RPMS is installed/updated at any given time.

Fedora and Flatpaks

Flatpaks is another key element of our strategy for moving the Linux desktop forward and as part of that we have now enabled all of Flathub to be available if you choose to enable 3rd party repositories when you install Fedora Workstation. This means that the huge universe of applications available on Flathub will be easy to install through GNOME Software alongside the content available in Fedora’s own repositories. That said we have also spent time improving the ease of making Fedora Flatpaks. Owen Taylor jumped in and removed the dependency on a technology called ‘modularity‘ which was initially introduced to Fedora to bring new features around having different types of content and ease keeping containers up to date. Unfortunately it did not work out as intended and instead it became something that everyone just felt made things a lot more complicated, including building Flatpaks from Fedora content. With Owens updates building Flatpaks in Fedora has become a lot simpler and should help energize the effort building Flatpaks in Fedora.

Toolbx
As we continue marching towards a vision for Fedora Workstation to be a highly robust operating we keep evolving Toolbx. Our tool for making running your development environment(s) inside a container and thus allows you to both keep your host OS pristine and up to date, while at the same time using specific toolchains and tools inside the development container. This is a hard requirement for immutable operating systems such as Fedora Silverblue or Universal blue, but it is also useful on operating systems like Fedora Workstation as a way to do development for other platforms, like for instance Red Hat Enterprise Linux.

A major focus for Toolbx since the inception is to get it a stage where it is robust and reliable. So for instance while we prototyped it as a shell script, today it is written in Go to be more maintainable and also to confirm with the rest of the container ecosystem. A recent major step forward for getting that stability there is that starting with Fedora 39, the toolbox image is now a release blocking deliverable. This means it is now built as part of the nightly compose and the whole Toolbx stack (ie. the fedora-toolbox image and the toolbox RPM) is part of the release-blocking test criteria. This shows the level of importance we put on Toolbx as the future of Linux software development and its criticality to Fedora Workstation. Earlier, we built the fedora-toobox image as a somewhat separate and standalone thing, and people interested in Toolbx would try to test and keep the whole thing working, as much as possible, on their own. This was becoming unmanageable because Toolbx integrates with many parts of the distribution from Mutter (ie, the Wayland and X sockets) to Kerberos to RPM (ie., %_netsharedpath in /usr/lib/rpm/macros.d/macros.toolbox) to glibc locale definitions and translations. The list of things that could change elsewhere in Fedora, and end up breaking Toolbx, was growing too large for a small group of Toolbx contributors to keep track of.

We the next release we now also have built-in support for Arch Linux and Ubuntu through the –distro flag in toolbox.git main, thanks again to the community contributors who worked with us on this allowing us to widen the amount of distros supported while keeping with our policy of reliability and dependability. And along the same theme of ensuring Toolbx is a tool developers can rely on we have added lots and lots of new tests. We now have more than 280 tests that run on CentOS Stream 9, all supported Fedoras and Rawhide, and Ubuntu 22.04.

Another feature that Toolbx maintainer Debarshi Ray put a lot of effort into is setting up full RHEL containers in Toolbx on top of Fedora. Today, thanks to Debarshi work you do subscription-manager register --username user@domain.name on the Fedora or RHEL host, and the container is automatically entitled to RHEL content. We are still looking at how we can provide a graphical interface for that process or at least how to polish up the CLI for doing subscription-manager register. If you are interested in this feature, Debarshi provides a full breakdown here.

Other nice to haves added is support for enterprise FreeIPA set-ups, where the user logs into their machine through Kerberos and support for automatically generated shell completions for Bash, fish and Z shell.

Flatpak and Foreman & Katello
For those out there using Foreman to manage your fleet of Linux installs we have some good news. We are in the process of implementing support for Flatpaks in these tools so that you can manage and deploy applications in the Flatpak format using them. This is still a work in progress, but relevant Pulp and Katello commits are Pulp commit Support for Flatpak index endpoints and Katello commits Reporting results of docker v2 repo discovery” and Support Link header in docker v2 repo discovery“.

LVFS
Another effort that Fedora Workstation has brought to the world of Linux and that is very popular arethe LVFS and fwdup formware update repository and tools. Thanks to that effort we are soon going to be passing one hundred million firmware updates on Linux devices soon! These firmware updates has helped resolve countless bugs and much improved security for Linux users.

But we are not slowing down. Richard Hughes worked with industry partners this year to define a Bill of Materials defintion to firmware updates allowing usings to be better informed on what is included in their firmware updates.

We now support over 1400 different devices on the LVFS (covering 78 different protocols!), with over 8000 public firmware versions (image below) from over 150 OEMs and ODMs. We’ve now done over 100,000 static analysis tests on over 2,000,000 EFI binaries in the firmware capsules!

Some examples of recently added hardware:
* AMD dGPUs, Navi3x and above, AVer FONE540, Belkin Thunderbolt 4 Core Hub dock, CE-LINK TB4 Docks,CH347 SPI programmer, EPOS ADAPT 1×5, Fibocom FM101, Foxconn T99W373, SDX12, SDX55 and SDX6X devices, Genesys GL32XX SD readers, GL352350, GL3590, GL3525S and GL3525 USB hubs, Goodix Touch controllers, HP Rata/Remi BLE Mice, Intel USB-4 retimers, Jabra Evolve 65e/t and SE, Evolve2, Speak2 and Link devices, Logitech Huddle, Rally System and Tap devices, Luxshare Quad USB4 Dock, MediaTek DP AUX Scalers, Microsoft USB-C Travel Hub, More Logitech Unifying receivers, More PixartRF HPAC devices, More Synaptics Prometheus fingerprint readers, Nordic HID devices, nRF52 Desktop Keyboard, PixArt BLE HPAC OTA, Quectel EM160 and RM520, Some Western Digital eMMC devices, Star Labs StarBook Mk VIr2, Synaptics Triton devices, System76 Launch 3, Launch Heavy 3 and Thelio IO 2, TUXEDO InfinityBook Pro 13 v3, VIA VL122, VL817S, VL822T, VL830 and VL832, Wacom Cintiq Pro 27, DTH134 and DTC121, One 13 and One 12 Tablets

InputLeap on Wayland
One really interesting feature that landed for Fedora Workstation 39 was the support for InputLeap. It’s probably not on most peoples radar, but it’s an important feature for system administrators, developers and generally anyone with more than a single computer on their desk.

Historically, InputLeap is a fork of Barrier which itself was a fork of Synergy, it allows to share the same input devices (mouse, keyboard) across different computers (Linux, Windows, MacOS) and to move the pointer between the screens of these computers seamlessly as if they were one.

InputLeap has a client/server architecture with the server running on the main host (the one with the keyboard and mouse connected) and multiple clients, the other machines sitting next to the server machine. That implies two things, the InputLeap daemon on the server must be able to “capture” all the input events to forward them to the remote clients when the pointer reaches the edge of the screen, and the InputLeap client must be able to “replay” those input events on the client host to make it as if the keyboard and mouse were connected directly to the (other) computer. Historically, that relied on X11 mechanisms and neither InputLeap (nor Barrier or even Synergy as a matter of fact) would work on Wayland.

This is one of the use cases that Peter Hutterer had in mind when he started libEI, a low-level library aimed at providing a separate communication channel for input emulation in Wayland compositors and clients (even though libEI is not strictly tied to Wayland). But libEI alone is far from being sufficient to implement InputLeap features, with Wayland we had the opportunity to make things more secure than X11 and take benefit from the XDG portal mechanisms.

On the client side, for replaying input events, it’s similar to remote desktop but we needed to update the existing RemoteDesktop portal to pass the libEI socket. On the server side, it required a brand new portal for input capture . These also required their counterparts in the GNOME portal, for both RemoteDesktop and InputCapture [8], and of course, all that needs to be supported by the Wayland compositor, in the case of GNOME that’s mutter. That alone was a lot of work.

Yet, even with all that in place, that’s just the basic requirements to support a Synergy/Barrier/InputLeap-like feature, the tools in question need to have support for the portal and libEI implemented to benefit from the mechanisms we’ve put in place and for the all feature to work and be usable. So libportal was also updated to support the new portal features and a new “Wayland” backend alongside the X11, Windows and Mac OS backends was contributed to InputLeap.

The merge request in InputLeap was accepted very early, even before the libEI API was completely stabilized and before the rest of the stack was merged, which I believe was a courageous choice from Povilas (who maintains InputLeap) which helped reduce the time to have the feature actually working, considering the number of components and inter-dependencies involved. Of course, there are still features missing in the Wayland backend, like copy/pasting between hosts, but a clipboard interface was fairly recently added to the remote desktop portal and therefore could be used by InputLeap to implement that feature.

Fun fact, Xwayland also grew support for libEI also using the remote desktop portal and wires that to the XTEST extension on X11 that InputLeap’s X11 backend uses, so it might even be possible to use the X11 backend of InputLeap in the client side through Xwayland, but of course it’s better to use the Wayland backend on both the client and server sides.

InputLeap is a great example of collaboration between multiple parties upstream including key contributions from us at Red Hat to implement and contribute a feature that has been requested for years upstream..

Thank you to Olivier Fourdan, Debarshi Ray, Richard Hughes, Sebastian Wick and Jonas Ådahl for their contributions to this blog post.

November 17, 2023

Progress

 
This update's highlight is that last week I finally got the TP jobs working, which allows us to make the tensor manipulation in the HW, removing 18ms from the tensor preprocessing. We can currently use them for transposing tensors from the format that TensorFlow prefers to that which the HW expects and the other way around, and for lowering strided convolutions to regular ones.
 
This makes our image classification benchmark twice as fast, as expected:

tomeu@arm-64:~/mesa$ ETNA_MESA_DEBUG=ml_msgs python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}
Running the NN job took 13 ms.
0.866667: military uniform
0.031373: Windsor tie
0.015686: mortarboard
0.007843: bow tie
0.007843: academic gown
time: 15.650ms

60 FPS is already quite interesting for many use cases, but the proprietary driver is able to do the same at around 8 ms, so there is still plenty of room for improvements.
 
Some preliminary testing indicates that enabling zero-run length compression in the weight buffers will make the biggest difference, so that is what I will be working on when I get back to performance work.

Additionally, I also got some experimental jobs running on the programmable core in this NPU, which will allow us to run more advanced models, which tend to use operations that the hardware couldn't be designed for back then.

Upstreaming is going well, those interested can follow it here:
 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714.
 

Next steps

 

These will be my priorities during the next couple of weeks, in order:

  1. Upstreaming
  2. Get the Mobilenet SSD V1 model running on the HW, for object detection
  3. Performance
November 15, 2023

Hi! This month I’ve started a new PotM called pyonji. It’s an easy-to-use replacement for the venerable git-send-email command. The goal is to make it less painful for a new contributor not familiar with the e-mail based patch submission to submit patches.

Users are expected to use the same workflow as GitHub, GitLab and friends when contributing: create a new branch and add commits there. Instead of pushing to a fork though, users simply invoke pyonji.

When run for the first time, pyonji will ask for your e-mail account details: e-mail address, password… and nothing else. The SMTP server hostname, port and other details are automatically detected (via multiple means: SRV records, Mozilla auto-configuration database, common subdomains, etc). Once the password is verified pyonji will store everything in the Git configuration (in the same fashion that git-send-email expects it).

Then pyonji will present a UI with a list of commits to be submitted for review. The user can tweak details such as the base branch, the mailing list address, the version of the patch, however that’s rarely needed: pyonji will find good defaults for these. The user can add a cover letter if desired with a longer description for the set of patches. Then the big blue “submit” button can be pressed to send the patches.

Unlike git-send-email, pyonji will remember for you what the last submitted version number was (and automatically increment it). pyonji will save the cover letter so that it’s not lost if the network is flaky and you don’t need to re-type it for the next submission. pyonji will not waste your time with uninteresting questions such as “which encoding should I use?”. pyonji will automatically include the base tree information in the patches so that any conflicts are more easily resolved by the reviewer.

Please try it and let me know how it goes! In particular, I’m wondering if the logic to auto-detect the e-mail server settings are robust enough, or if there are e-mail providers I don’t handle correctly yet.

There is still a lot to be done to improve pyonji. Setup is painful for GMail and Fastmail users because app passwords are required. I wanted to use OAuth to fix this but both of these providers heavily restrict how SMTP OAuth apps can be registered. Setup doesn’t work for ProtonMail users because the bridge uses a self-signed certificate, that can be fixed but setup will remain painful. I’d like to add UI to change the base branch, improve the heuristics to pick a good default for the base branch, support for the MAINTAINERS file for easier contribution to big projects such as the kernel, add an easy way to mark a patch series as RFC, and probably a million of other things.

Apart from pyonji, I’ve been working on some graphics-related stuff as always. We’re getting closer to the wlroots 0.17 release, fixing the few remaining blocking issues. A new API to clip surfaces with the scene-graph has been merged, many thanks to Alexander Orzechowski and Isaac Freund! I’ve fixed a Mesa regression introduced by a previous patch I’ve reviewed related to EGL and split render/display SoCs (I hate these). And I’ve been discussing with other kernel developers about a way to stop (ab)using KMS dumb buffers for split render/display SoCs (I swear I really hate these). We’re trying to come up with a solution which could on the long run also help with the Buffer Allocation Constraints Problem (see the XDC 2020 talk for more info).

I’ve written a few patches to add support for OAuth 2.0 refresh tokens to meta.sr.ht. If you’ve ever used an OAuth sr.ht app (like hottub or yojo to integrate builds.sr.ht with GitHub or Forgejo), you probably know that tokens expire after one year, and that you need to redo the setup step when that happens. This is annoying, and adding support for refresh tokens to meta.sr.ht and the OAuth apps should fix this.

Last, I’m now part of the FreeDesktop Code of Conduct team. This is not a technical role, but it’s very important to have folks doing this work. I’ve attended a Code of Conduct workshop to learn how to do it, that’s been pretty interesting and helpful. The workshop focused a lot more on trying to change people’s behavior, instead of bringing down the ban hammer.

That’s all for now, see you next month!

November 11, 2023

Today, 12 years after the meeting where AppStream was first discussed and 11 years after I released a prototype implementation I am excited to announce AppStream 1.0! 🎉🎉🎊

Check it out on GitHub, or get the release tarball or read the documentation or release notes! 😁

Some nostalgic memories

I was not in the original AppStream meeting, since in 2011 I was extremely busy with finals preparations and ball organization in high school, but I still vividly remember sitting at school in the students’ lounge during a break and trying to catch the really choppy live stream from the meeting on my borrowed laptop (a futile exercise, I watched parts of the blurry recording later).

I was extremely passionate about getting software deployment to work better on Linux and to improve the overall user experience, and spent many hours on the PackageKit IRC channel discussing things with many amazing people like Richard Hughes, Daniel Nicoletti, Sebastian Heinlein and others.

At the time I was writing a software deployment tool called Listaller – this was before Linux containers were a thing, and building it was very tough due to technical and personal limitations (I had just learned C!). Then in university, when I intended to recreate this tool, but for real and better this time as a new project called Limba, I needed a way to provide metadata for it, and AppStream fit right in! Meanwhile, Richard Hughes was tackling the UI side of things while creating GNOME Software and needed a solution as well. So I implemented a prototype and together we pretty much reshaped the early specification from the original meeting into what would become modern AppStream.

Back then I saw AppStream as a necessary side-project for my actual project, and didn’t even consider me as the maintainer of it for quite a while (I hadn’t been at the meeting afterall). All those years ago I had no idea that ultimately I was developing AppStream not for Limba, but for a new thing that would show up later, with an even more modern design called Flatpak. I also had no idea how incredibly complex AppStream would become and how many features it would have and how much more maintenance work it would be – and also not how ubiquitous it would become.

The modern Linux desktop uses AppStream everywhere now, it is supported by all major distributions, used by Flatpak for metadata, used for firmware metadata via Richard’s fwupd/LVFS, runs on every Steam Deck, can be found in cars and possibly many places I do not know yet.

What is new in 1.0?

API breaks

The most important thing that’s new with the 1.0 release is a bunch of incompatible changes. For the shared libraries, all deprecated API elements have been removed and a bunch of other changes have been made to improve the overall API and especially make it more binding-friendly. That doesn’t mean that the API is completely new and nothing looks like before though, when possible the previous API design was kept and some changes that would have been too disruptive have not been made. Regardless of that, you will have to port your AppStream-using applications. For some larger ones I already submitted patches to build with both AppStream versions, the 0.16.x stable series as well as 1.0+.

For the XML specification, some older compatibility for XML that had no or very few users has been removed as well. This affects for example release elements that reference downloadable data without an artifact block, which has not been supported for a while. For all of these, I checked to remove only things that had close to no users and that were a significant maintenance burden. So as a rule of thumb: If your XML validated with no warnings with the 0.16.x branch of AppStream, it will still be 100% valid with the 1.0 release.

Another notable change is that the generated output of AppStream 1.0 will always be 1.0 compliant, you can not make it generate data for versions below that (this greatly reduced the maintenance cost of the project).

Developer element

For a long time, you could set the developer name using the top-level developer_name tag. With AppStream 1.0, this is changed a bit. There is now a developer tag with a name child (that can be translated unless the translate="no" attribute is set on it). This allows future extensibility, and also allows to set a machine-readable id attribute in the developer element. This permits software centers to group software by developer easier, without having to use heuristics. If we decide to extend the developer information per-app in future, this is also now possible. Do not worry though the developer_name tag is also still read, so there is no high pressure to update. The old 0.16.x stable series also has this feature backported, so it can be available everywhere. Check out the developer tag specification for more details.

Scale factor for screenshots

Screenshot images can now have a scale attribute, to indicate an (integer) scaling factor to apply. This feature was a breaking change and therefore we could not have it for the longest time, but it is now available. Please wait a bit for AppStream 1.0 to become deployed more widespread though, as using it with older AppStream versions may lead to issues in some cases. Check out the screenshots tag specification for more details.

Screenshot environments

It is now possible to indicate the environment a screenshot was recorded in (GNOME, GNOME Dark, KDE Plasma, Windows, etc.) via an environment attribute on the respective screenshot tag. This was also a breaking change, so use it carefully for now! If projects want to, they can use this feature to supply dedicated screenshots depending on the environment the application page is displayed in. Check out the screenshots tag specification for more details.

References tag

This is a feature more important for the scientific community and scientific applications. Using the references tag, you can associate the AppStream component with a DOI (Digital object identifier) or provide a link to a CFF file to provide citation information. It also allows to link to other scientific registries. Check out the references tag specification for more details.

Release tags

Releases can have tags now, just like components. This is generally not a feature that I expect to be used much, but in certain instances it can become useful with a cooperating software center, for example to tag certain releases as long-term supported versions.

Multi-platform support

Thanks to the interest and work of many volunteers, AppStream (mostly) runs on FreeBSD now, a NetBSD port exists, support for macOS was written and a Windows port is on its way! Thank you to everyone working on this 🙂

Better compatibility checks

For a long time I thought that the AppStream library should just be a thin layer above the XML and that software centers should just implement a lot of the actual logic. This has not been the case for a while, but there was still a lot of complex AppStream features that were hard for software centers to implement and where it makes sense to have one implementation that projects can just use.

The validation of component relations is one such thing. This was implemented in 0.16.x as well, but 1.0 vastly improves upon the compatibility checks, so you can now just run as_component_check_relations and retrieve a detailed list of whether the current component will run well on the system. Besides better API for software developers, the appstreamcli utility also has much improved support for relation checks, and I wrote about these changes in a previous post. Check it out!

With these changes, I hope this feature will be used much more, and beyond just drivers and firmware.

So much more!

The changelog for the 1.0 release is huge, and there are many papercuts resolved and changes made that I did not talk about here, like us using gi-docgen (instead of gtkdoc) now for nice API documentation, or the many improvements that went into better binding support, or better search, or just plain bugfixes.

Outlook

I expect the transition to 1.0 to take a bit of time. AppStream has not broken its API for many, many years (since 2016), so a bunch of places need to be touched even if the changes themselves are minor in many cases. In hindsight, I should have also released 1.0 much sooner and it should not have become such a mega-release, but that was mainly due to time constraints.

So, what’s in it for the future? Contrary to what I thought, AppStream does not really seem to be “done” and fetature complete at a point, there is always something to improve, and people come up with new usecases all the time. So, expect more of the same in future: Bugfixes, validator improvements, documentation improvements, better tools and the occasional new feature.

Onwards to 1.0.1! 😁

November 10, 2023

TLDR: see the title of this blog post, it's really that trivial.

Now that GodotWayland has been coming for ages and all new development focuses on a pile of software that steams significantly less, we're seeing cracks appear in the old Xorg support. Not intentionally, but there's only so much time that can be spent on testing and things that are more niche fall through. One of these was a bug I just had the pleasure of debugging and was triggered by GNOME on Xorg user using the xf86-input-libinput driver for tablet devices.

On the surface of it, this should be fine because libinput (and thus xf86-input-libinput) handles tablets just fine. But libinput is the new kid on the block. The old kid on said block is the xf86-input-wacom driver, older than libinput by slightly over a decade. And oh man, history has baked things into the driver that are worse than raisins in apple strudel [1].

The xf86-input-libinput driver was written as a wrapper around libinput and makes use of fancy things that (from libinput's POV) have always been around: things like input device hotplugging. Fancy, I know. For tablet devices the driver creates an X device for each new tool as it comes into proximity first. Future events from that tool will go through that device. A second tool, be it a new pen or the eraser on the original pen, will create a second X device and events from that tool will go through that X device. Configuration on any device will thus only affect that particular pen. Almost like the whole thing makes sense.

The wacom driver of course doesn't do this. It pre-creates X devices for some possible types of tools (pen, eraser, and cursor [2] but not airbrush or artpen). When a tool goes into proximity the events are sent through the respective device, i.e. all pens go through the pen tool, all erasers through the eraser tool. To actually track pens there is the "Wacom Serial IDs" property that contains the current tool's serial number. If you want to track multiple tools you need to query the property on proximity in [4]. At the time this was within a reasonable error margin of a good idea.

Of course and because MOAR CONFIGURATION! will save us all from the great filter you can specify the "ToolSerials" xorg.conf option as e.g. "airbrush;12345;artpen" and get some extra X devices pre-created, in this case a airbrush and artpen X device and an X device just for the tool with the serial number 12345. All other tools multiplex through the default devices. Again, at the time this was a great improvement. [5]

Anyway, where was I? Oh, right. The above should serve as a good approximation of a reason why the xf86-input-libinput driver does not try to be fullly compatible to the xf86-input-wacom driver. In everyday use these things barely matter [6] but for the desktop environment which needs to configure these devices all these differences mean multiple code paths. Those paths need to be tested but they aren't, so things fall through the cracks.

So quite a while ago, we made the decision that until Xorg goes dodo, the xf86-input-wacom driver is the tablet driver to use in GNOME. So if you're using a GNOME on Xorg session [7], do make sure the xf86-input-wacom driver is installed. It will make both of us happier and that's a good aim to strive for.

[1] It's just a joke. Put the pitchforks down already.
[2] The cursor is the mouse-like thing Wacom sells. Which is called cursor [3] because the English language has a limited vocabulary and we need to re-use words as much as possible lest we run out of them.
[3] It's also called puck. Because [2].
[4] And by "query" I mean "wait for the XI2 event notifying you of a property change". Because of lolz the driver cannot update the property on proximity in but needs to schedule that as idle func so the property update for the serial always arrives at some unspecified time after the proximity in but hopefully before more motion events happen. Or not, and that's how hope dies.
[5] Think about this next time someone says they long for some unspecified good old days.
[6] Except the strip axis which on the wacom driver is actually a bit happily moving left/right as your finger moves up/down on the touch strip and any X client needs to know this. libinput normalizes this to...well, a normal value but now the X client needs to know which driver is running so, oh deary deary.
[7] e.g because your'e stockholmed into it by your graphics hardware

November 08, 2023

I’ve recently worked on a patch for the vc4 display driver used on the Raspberry Pi 4. To test this patch, I needed to compile the kernel and install it, something I know how to do on x86 but not on Raspberry Pi. Because I’m pretty stubborn I’ve also insisted on making my life harder:

  • I installed Arch Linux ARM as the base system, instead of Raspberry Pi OS or Raspbian.
  • I based my patches on top of the mainline kernel, instead of using Raspberry Pi’s tree.
  • I wanted to install my built kernel alongside the one provided by the distribution, instead of overwriting it.

Raspberry Pi has an official guide to compile the kernel, however it assumes Raspberry Pi OS, Raspberry Pi’s kernel tree, and overwrites the current kernel. It was still very useful to get an idea of the process. Still, quite a few adaptations have been required. This blog post serves as my personal notepad to remember how to Do It.

First, the official guide instructs us to run make bcm2711_defconfig to generate the kernel config, however mainline complains with:

Can't find default configuration "arch/arm/configs/bcm2711_defconfig"

This can be fixed by grabbing this file from the Raspberry Pi tree:

curl -L -o arch/arm/configs/bcm2711_defconfig "https://github.com/raspberrypi/linux/raw/rpi-6.1.y/arch/arm/configs/bcm2711_defconfig"

Once that’s done, compiling the kernel as usual works fine. Then we need to install it to the /boot partition. We can ignore the overlays stuff from the official guide, we don’t use these. The source paths need to be slightly adjusted, and the destination paths need to be fixed up to use a subdirectory:

doas make modules_install
doas cp arch/arm/boot/dts/broadcom/*.dtb /boot/custom/
doas cp arch/arm/boot/zImage /boot/custom/kernel7.img

Then we need to generate an initramfs. At first I forgot to do that step and the kernel was hanging around USB bus discovery.

doas mkinitcpio --generate /boot/custom/initramfs-linux.img --kernel /boot/custom/kernel7.img

The last step is updating the boot firmware configuration located at /boot/config.txt. Comment out any dtoverlay directive, then add os_prefix=custom/ to point the firmware to our subdirectory (note, the final slash is important).

For some reason my memory card was showing up as /dev/mmcblk1 instead of /dev/mmcblk0, so I had to bang my head against the wall until I notice the difference adjust /boot/cmdline.txt and /etc/fstab accordingly.

That’s it! After a reboot I was ready to start kernel hacking. Thanks to Maíra Canal for replying to my distress signal on Mastodon and providing recommendations!

November 07, 2023

TL;DR:

This blog post explores the color capabilities of AMD hardware and how they are exposed to userspace through driver-specific properties. It discusses the different color blocks in the AMD Display Core Next (DCN) pipeline and their capabilities, such as predefined transfer functions, 1D and 3D lookup tables (LUTs), and color transformation matrices (CTMs). It also highlights the differences in AMD HW blocks for pre and post-blending adjustments, and how these differences are reflected in the available driver-specific properties.

Overall, this blog post provides a comprehensive overview of the color capabilities of AMD hardware and how they can be controlled by userspace applications through driver-specific properties. This information is valuable for anyone who wants to develop applications that can take advantage of the AMD color management pipeline.

Get a closer look at each hardware block’s capabilities, unlock a wealth of knowledge about AMD display hardware, and enhance your understanding of graphics and visual computing. Stay tuned for future developments as we embark on a quest for GPU color capabilities in the ever-evolving realm of rainbow treasures.


Operating Systems can use the power of GPUs to ensure consistent color reproduction across graphics devices. We can use GPU-accelerated color management to manage the diversity of color profiles, do color transformations to convert between High-Dynamic-Range (HDR) and Standard-Dynamic-Range (SDR) content and color enhacements for wide color gamut (WCG). However, to make use of GPU display capabilities, we need an interface between userspace and the kernel display drivers that is currently absent in the Linux/DRM KMS API.

In the previous blog post I presented how we are expanding the Linux/DRM color management API to expose specific properties of AMD hardware. Now, I’ll guide you to the color features for the Linux/AMD display driver. We embark on a journey through DRM/KMS, AMD Display Manager, and AMD Display Core and delve into the color blocks to uncover the secrets of color manipulation within AMD hardware. Here we’ll talk less about the color tools and more about where to find them in the hardware.

We resort to driver-specific properties to reach AMD hardware blocks with color capabilities. These blocks display features like predefined transfer functions, color transformation matrices, and 1-dimensional (1D LUT) and 3-dimensional lookup tables (3D LUT). Here, we will understand how these color features are strategically placed into color blocks both before and after blending in Display Pipe and Plane (DPP) and Multiple Pipe/Plane Combined (MPC) blocks.

That said, welcome back to the second part of our thrilling journey through AMD’s color management realm!

AMD Display Driver in the Linux/DRM Subsystem: The Journey

In my 2022 XDC talk “I’m not an AMD expert, but…”, I briefly explained the organizational structure of the Linux/AMD display driver where the driver code is bifurcated into a Linux-specific section and a shared-code portion. To reveal AMD’s color secrets through the Linux kernel DRM API, our journey led us through these layers of the Linux/AMD display driver’s software stack. It includes traversing the DRM/KMS framework, the AMD Display Manager (DM), and the AMD Display Core (DC) [1].

The DRM/KMS framework provides the atomic API for color management through KMS properties represented by struct drm_property. We extended the color management interface exposed to userspace by leveraging existing resources and connecting them with driver-specific functions for managing modeset properties.

On the AMD DC layer, the interface with hardware color blocks is established. The AMD DC layer contains OS-agnostic components that are shared across different platforms, making it an invaluable resource. This layer already implements hardware programming and resource management, simplifying the external developer’s task. While examining the DC code, we gain insights into the color pipeline and capabilities, even without direct access to specifications. Additionally, AMD developers provide essential support by answering queries and reviewing our work upstream.

The primary challenge involved identifying and understanding relevant AMD DC code to configure each color block in the color pipeline. However, the ultimate goal was to bridge the DC color capabilities with the DRM API. For this, we changed the AMD DM, the OS-dependent layer connecting the DC interface to the DRM/KMS framework. We defined and managed driver-specific color properties, facilitated the transport of user space data to the DC, and translated DRM features and settings to the DC interface. Considerations were also made for differences in the color pipeline based on hardware capabilities.

Exploring Color Capabilities of the AMD display hardware

Now, let’s dive into the exciting realm of AMD color capabilities, where a abundance of techniques and tools await to make your colors look extraordinary across diverse devices.

First, we need to know a little about the color transformation and calibration tools and techniques that you can find in different blocks of the AMD hardware. I borrowed some images from [2] [3] [4] to help you understand the information.

Predefined Transfer Functions (Named Fixed Curves):

Transfer functions serve as the bridge between the digital and visual worlds, defining the mathematical relationship between digital color values and linear scene/display values and ensuring consistent color reproduction across different devices and media. You can learn more about curves in the chapter GPU Gems 3 - The Importance of Being Linear by Larry Gritz and Eugene d’Eon.

ITU-R 2100 introduces three main types of transfer functions:

  • OETF: the opto-electronic transfer function, which converts linear scene light into the video signal, typically within a camera.
  • EOTF: electro-optical transfer function, which converts the video signal into the linear light output of the display.
  • OOTF: opto-optical transfer function, which has the role of applying the “rendering intent”.

AMD’s display driver supports the following pre-defined transfer functions (aka named fixed curves):

  • Linear/Unity: linear/identity relationship between pixel value and luminance value;
  • Gamma 2.2, Gamma 2.4, Gamma 2.6: pure power functions;
  • sRGB: 2.4: The piece-wise transfer function from IEC 61966-2-1:1999;
  • BT.709: has a linear segment in the bottom part and then a power function with a 0.45 (~1/2.22) gamma for the rest of the range; standardized by ITU-R BT.709-6;
  • PQ (Perceptual Quantizer): used for HDR display, allows luminance range capability of 0 to 10,000 nits; standardized by SMPTE ST 2084.

These capabilities vary depending on the hardware block, with some utilizing hardcoded curves and others relying on AMD’s color module to construct curves from standardized coefficients. It also supports user/custom curves built from a lookup table.

1D LUTs (1-dimensional Lookup Table):

A 1D LUT is a versatile tool, defining a one-dimensional color transformation based on a single parameter. It’s very well explained by Jeremy Selan at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations

It enables adjustments to color, brightness, and contrast, making it ideal for fine-tuning. In the Linux AMD display driver, the atomic API offers a 1D LUT with 4096 entries and 8-bit depth, while legacy gamma uses a size of 256.

3D LUTs (3-dimensional Lookup Table):

These tables work in three dimensions – red, green, and blue. They’re perfect for complex color transformations and adjustments between color channels. It’s also more complex to manage and require more computational resources. Jeremy also explains 3D LUT at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations

CTM (Color Transformation Matrices):

Color transformation matrices facilitate the transition between different color spaces, playing a crucial role in color space conversion.

HDR Multiplier:

HDR multiplier is a factor applied to the color values of an image to increase their overall brightness.

AMD Color Capabilities in the Hardware Pipeline

First, let’s take a closer look at the AMD Display Core Next hardware pipeline in the Linux kernel documentation for AMDGPU driver - Display Core Next

In the AMD Display Core Next hardware pipeline, we encounter two hardware blocks with color capabilities: the Display Pipe and Plane (DPP) and the Multiple Pipe/Plane Combined (MPC). The DPP handles color adjustments per plane before blending, while the MPC engages in post-blending color adjustments. In short, we expect DPP color capabilities to match up with DRM plane properties, and MPC color capabilities to play nice with DRM CRTC properties.

Note: here’s the catch – there are some DRM CRTC color transformations that don’t have a corresponding AMD MPC color block, and vice versa. It’s like a puzzle, and we’re here to solve it!

AMD Color Blocks and Capabilities

We can finally talk about the color capabilities of each AMD color block. As it varies based on the generation of hardware, let’s take the DCN3+ family as reference. What’s possible to do before and after blending depends on hardware capabilities describe in the kernel driver by struct dpp_color_caps and struct mpc_color_caps.

The AMD Steam Deck hardware provides a tangible example of these capabilities. Therefore, we take SteamDeck/DCN301 driver as an example and look at the “Color pipeline capabilities” described in the file: driver/gpu/drm/amd/display/dcn301/dcn301_resources.c

/* Color pipeline capabilities */

dc->caps.color.dpp.dcn_arch = 1; // If it is a Display Core Next (DCN): yes. Zero means DCE.
dc->caps.color.dpp.input_lut_shared = 0;
dc->caps.color.dpp.icsc = 1; // Intput Color Space Conversion  (CSC) matrix.
dc->caps.color.dpp.dgam_ram = 0; // The old degamma block for degamma curve (hardcoded and LUT). `Gamma correction` is the new one.
dc->caps.color.dpp.dgam_rom_caps.srgb = 1; // sRGB hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.bt2020 = 1; // BT2020 hardcoded curve support (seems not actually in use)
dc->caps.color.dpp.dgam_rom_caps.gamma2_2 = 1; // Gamma 2.2 hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.pq = 1; // PQ hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.hlg = 1; // HLG hardcoded curve support
dc->caps.color.dpp.post_csc = 1; // CSC matrix
dc->caps.color.dpp.gamma_corr = 1; // New `Gamma Correction` block for degamma user LUT;
dc->caps.color.dpp.dgam_rom_for_yuv = 0;

dc->caps.color.dpp.hw_3d_lut = 1; // 3D LUT support. If so, it's always preceded by a shaper curve. 
dc->caps.color.dpp.ogam_ram = 1; // `Blend Gamma` block for custom curve just after blending
// no OGAM ROM on DCN301
dc->caps.color.dpp.ogam_rom_caps.srgb = 0;
dc->caps.color.dpp.ogam_rom_caps.bt2020 = 0;
dc->caps.color.dpp.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.dpp.ogam_rom_caps.pq = 0;
dc->caps.color.dpp.ogam_rom_caps.hlg = 0;
dc->caps.color.dpp.ocsc = 0;

dc->caps.color.mpc.gamut_remap = 1; // Post-blending CTM (pre-blending CTM is always supported)
dc->caps.color.mpc.num_3dluts = pool->base.res_cap->num_mpc_3dlut; // Post-blending 3D LUT (preceded by shaper curve)
dc->caps.color.mpc.ogam_ram = 1; // Post-blending regamma.
// No pre-defined TF supported for regamma.
dc->caps.color.mpc.ogam_rom_caps.srgb = 0;
dc->caps.color.mpc.ogam_rom_caps.bt2020 = 0;
dc->caps.color.mpc.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.mpc.ogam_rom_caps.pq = 0;
dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
dc->caps.color.mpc.ocsc = 1; // Output CSC matrix.

I included some inline comments in each element of the color caps to quickly describe them, but you can find the same information in the Linux kernel documentation. See more in struct dpp_color_caps, struct mpc_color_caps and struct rom_curve_caps.

Now, using this guideline, we go through color capabilities of DPP and MPC blocks and talk more about mapping driver-specific properties to corresponding color blocks.

DPP Color Pipeline: Before Blending (Per Plane)

Let’s explore the capabilities of DPP blocks and what you can achieve with a color block. The very first thing to pay attention is the display architecture of the display hardware: previously AMD uses a display architecture called DCE

  • Display and Compositing Engine, but newer hardware follows DCN - Display Core Next.

The architectute is described by: dc->caps.color.dpp.dcn_arch

AMD Plane Degamma: TF and 1D LUT

Described by: dc->caps.color.dpp.dgam_ram, dc->caps.color.dpp.dgam_rom_caps,dc->caps.color.dpp.gamma_corr

AMD Plane Degamma data is mapped to the initial stage of the DPP pipeline. It is utilized to transition from scanout/encoded values to linear values for arithmetic operations. Plane Degamma supports both pre-defined transfer functions and 1D LUTs, depending on the hardware generation. DCN2 and older families handle both types of curve in the Degamma RAM block (dc->caps.color.dpp.dgam_ram); DCN3+ separate hardcoded curves and 1D LUT into two block: Degamma ROM (dc->caps.color.dpp.dgam_rom_caps) and Gamma correction block (dc->caps.color.dpp.gamma_corr), respectively.

Pre-defined transfer functions:

  • they are hardcoded curves (read-only memory - ROM);
  • supported curves: sRGB EOTF, BT.709 inverse OETF, PQ EOTF and HLG OETF, Gamma 2.2, Gamma 2.4 and Gamma 2.6 EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. Setting TF = Identity/Default and LUT as NULL means bypass.

References:

AMD Plane 3x4 CTM (Color Transformation Matrix)

AMD Plane CTM data goes to the DPP Gamut Remap block, supporting a 3x4 fixed point (s31.32) matrix for color space conversions. The data is interpreted as a struct drm_color_ctm_3x4. Setting NULL means bypass.

References:

AMD Plane Shaper: TF + 1D LUT

Described by: dc->caps.color.dpp.hw_3d_lut

The Shaper block fine-tunes color adjustments before applying the 3D LUT, optimizing the use of the limited entries in each dimension of the 3D LUT. On AMD hardware, a 3D LUT always means a preceding shaper 1D LUT used for delinearizing and/or normalizing the color space before applying a 3D LUT, so this entry on DPP color caps dc->caps.color.dpp.hw_3d_lut means support for both shaper 1D LUT and 3D LUT.

Pre-defined transfer function enables delinearizing content with or without shaper LUT, where AMD color module calculates the resulted shaper curve. Shaper curves go from linear values to encoded values. If we are already in a non-linear space and/or don’t need to normalize values, we can set a Identity TF for shaper that works similar to bypass and is also the default TF value.

Pre-defined transfer functions:

  • there is no DPP Shaper ROM. Curves are calculated by AMD color modules. Check calculate_curve() function in the file amd/display/modules/color/color_gamma.c.
  • supported curves: Identity, sRGB inverse EOTF, BT.709 OETF, PQ inverse EOTF, HLG OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 inverse EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. When setting Plane Shaper TF (!= Identity) and LUT at the same time, the color module will combine the pre-defined TF and the custom LUT values into the LUT that’s actually programmed. Setting TF = Identity/Default and LUT as NULL works as bypass.

References:

AMD Plane 3D LUT

Described by: dc->caps.color.dpp.hw_3d_lut

The 3D LUT in the DPP block facilitates complex color transformations and adjustments. 3D LUT is a three-dimensional array where each element is an RGB triplet. As mentioned before, the dc->caps.color.dpp.hw_3d_lut describe if DPP 3D LUT is supported.

The AMD driver-specific property advertise the size of a single dimension via LUT3D_SIZE property. Plane 3D LUT is a blog property where the data is interpreted as an array of struct drm_color_lut elements and the number of entries is LUT3D_SIZE cubic. The array contains samples from the approximated function. Values between samples are estimated by tetrahedral interpolation The array is accessed with three indices, one for each input dimension (color channel), blue being the outermost dimension, red the innermost. This distribution is better visualized when examining the code in [RFC PATCH 5/5] drm/amd/display: Fill 3D LUT from userspace by Alex Hung:

+	for (nib = 0; nib < 17; nib++) {
+		for (nig = 0; nig < 17; nig++) {
+			for (nir = 0; nir < 17; nir++) {
+				ind_lut = 3 * (nib + 17*nig + 289*nir);
+
+				rgb_area[ind].red = rgb_lib[ind_lut + 0];
+				rgb_area[ind].green = rgb_lib[ind_lut + 1];
+				rgb_area[ind].blue = rgb_lib[ind_lut + 2];
+				ind++;
+			}
+		}
+	}

In our driver-specific approach we opted to advertise it’s behavior to the userspace instead of implicitly dealing with it in the kernel driver. AMD’s hardware supports 3D LUTs with 17-size or 9-size (4913 and 729 entries respectively), and you can choose between 10-bit or 12-bit. In the current driver-specific work we focus on enabling only 17-size 12-bit 3D LUT, as in [PATCH v3 25/32] drm/amd/display: add plane 3D LUT support:

+		/* Stride and bit depth are not programmable by API yet.
+		 * Therefore, only supports 17x17x17 3D LUT (12-bit).
+		 */
+		lut->lut_3d.use_tetrahedral_9 = false;
+		lut->lut_3d.use_12bits = true;
+		lut->state.bits.initialized = 1;
+		__drm_3dlut_to_dc_3dlut(drm_lut, drm_lut3d_size, &lut->lut_3d,
+					lut->lut_3d.use_tetrahedral_9,
+					MAX_COLOR_3DLUT_BITDEPTH);

A refined control of 3D LUT parameters should go through a follow-up version or generic API.

Setting 3D LUT to NULL means bypass.

References:

AMD Plane Blend/Out Gamma: TF + 1D LUT

Described by: dc->caps.color.dpp.ogam_ram

The Blend/Out Gamma block applies the final touch-up before blending, allowing users to linearize content after 3D LUT and just before the blending. It supports both 1D LUT and pre-defined TF. We can see Shaper and Blend LUTs as 1D LUTs that are sandwich the 3D LUT. So, if we don’t need 3D LUT transformations, we may want to only use Degamma block to linearize and skip Shaper, 3D LUT and Blend.

Pre-defined transfer function:

  • there is no DPP Blend ROM. Curves are calculated by AMD color modules;
  • supported curves: Identity, sRGB EOTF, BT.709 inverse OETF, PQ EOTF, HLG inverse OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. If plane_blend_tf_property != Identity TF, AMD color module will combine the user LUT values with pre-defined TF into the LUT parameters to be programmed. Setting TF = Identity/Default and LUT to NULL means bypass.

References:

MPC Color Pipeline: After Blending (Per CRTC)

DRM CRTC Degamma 1D LUT

The degamma lookup table (LUT) for converting framebuffer pixel data before apply the color conversion matrix. The data is interpreted as an array of struct drm_color_lut elements. Setting NULL means bypass.

Not really supported. The driver is currently reusing the DPP degamma LUT block (dc->caps.color.dpp.dgam_ram and dc->caps.color.dpp.gamma_corr) for supporting DRM CRTC Degamma LUT, as explaning by [PATCH v3 20/32] drm/amd/display: reject atomic commit if setting both plane and CRTC degamma.

DRM CRTC 3x3 CTM

Described by: dc->caps.color.mpc.gamut_remap

It sets the current transformation matrix (CTM) apply to pixel data after the lookup through the degamma LUT and before the lookup through the gamma LUT. The data is interpreted as a struct drm_color_ctm. Setting NULL means bypass.

DRM CRTC Gamma 1D LUT + AMD CRTC Gamma TF

Described by: dc->caps.color.mpc.ogam_ram

After all that, you might still want to convert the content to wire encoding. No worries, in addition to DRM CRTC 1D LUT, we’ve got a AMD CRTC gamma transfer function (TF) to make it happen. Possible TF values are defined by enum amdgpu_transfer_function.

Pre-defined transfer functions:

  • there is no MPC Gamma ROM. Curves are calculated by AMD color modules.
  • supported curves: Identity, sRGB inverse EOTF, BT.709 OETF, PQ inverse EOTF, HLG OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 inverse EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. When setting CRTC Gamma TF (!= Identity) and LUT at the same time, the color module will combine the pre-defined TF and the custom LUT values into the LUT that’s actually programmed. Setting TF = Identity/Default and LUT to NULL means bypass.

References:

Others

AMD CRTC Shaper and 3D LUT

We have previously worked on exposing CRTC shaper and CRTC 3D LUT, but they were removed from the AMD driver-specific color series because they lack userspace case. CRTC shaper and 3D LUT works similar to plane shaper and 3D LUT but after blending (MPC block). The difference here is that setting (not bypass) Shaper and Gamma blocks together are not expected, since both blocks are used to delinearize the input space. In summary, we either set Shaper + 3D LUT or Gamma.

Input and Output Color Space Conversion

There are two other color capabilities of AMD display hardware that were integrated to DRM by previous works and worth a brief explanation here. The DC Input CSC sets pre-defined coefficients from the values of DRM plane color_range and color_encoding properties. It is used for color space conversion of the input content. On the other hand, we have de DC Output CSC (OCSC) sets pre-defined coefficients from DRM connector colorspace properties. It is uses for color space conversion of the composed image to the one supported by the sink.

References:

The search for rainbow treasures is not over yet

If you want to understand a little more about this work, be sure to watch Joshua and I presented two talks at XDC 2023 about AMD/Steam Deck colors on Gamescope:

In the time between the first and second part of this blog post, Uma Shashank and Chaitanya Kumar Borah published the plane color pipeline for Intel and Harry Wentland implemented a generic API for DRM based on VKMS support. We discussed these two proposals and the next steps for Color on Linux during the Color Management workshop at XDC 2023 and I briefly shared workshop results in the 2023 XDC lightning talk session.

The search for rainbow treasures is not over yet! We plan to meet again next year in the 2024 Display Hackfest in Coruña-Spain (Igalia’s HQ) to keep up the pace and continue advancing today’s display needs on Linux.

Finally, a HUGE thank you to everyone who worked with me on exploring AMD’s color capabilities and making them available in userspace.

November 06, 2023

 If you remember the last update two weeks ago, I got MobileNetV1 working with good performance, and I was planning to move to upstreaming my changes to the Linux kernel and Mesa.

One of the kernel patches is now queued for the 6.7 release of the Linux kernel, and the other one has just been resent for reviews.

Regarding Mesa, I have made several cleanups and have started getting great review comments from Christian Gmeiner.

While waiting for feedback, I have started work on using the TP cores for tensor manipulation, which should be many times faster  than the naive code I was running on the CPU for this.

Got some jobs producing the correct results, but I'm facing a problem with the GPU hanging right afterwards. Have already made a pass at the whole set of data that is sent to the HW (unit configuration, command stream and registers), but haven't found yet the problem. I will next improve the tooling around this and get a better view of the differences.

I hacked Mesa to use the out-of-tree driver and my code works that way, so it has to be something at the kernel driver.

During the next weeks I will keep incorporating feedback and see how I can fix the GPU hang on TP jobs.


November 05, 2023

Linus has pulled the initial GSP firmware support for nouveau. This is just the first set of work to use the new GSP firmware and there are likely many challenges and improvements ahead.

To get this working you need to install the firmware which hasn't landed in linux-firmware yet.

For Fedora this copr has the firmware in the necessary places:

https://copr.fedorainfracloud.org/coprs/airlied/nouveau-gsp/build/6593115/ 

Hopefully we can upstream that in next week or so.

If you have an ADA based GPU then it should just try and work out of the box, if you have Turing or Ampere you currently need to pass nouveau.config=NvGspRm=1 on the kernel command line to attempt to use GSP.

Going forward, I've got a few fixes and stabilization bits to land, which we will concentrate on for 6.7, then going forward we have to work out how to keep it up to date and support new hardware and how to add new features.


November 03, 2023

This is the second part of the Xwayland rootful post, the first part is there

Using Xwayland rootful to run a full X11 desktop

Xwayland rootful can run more than just a window manager, it can as well run an entire X11 desktop, for example with Xfce:

$ Xwayland -geometry 1024x768 -decorate :12 &
DISPLAY=:12 SESSION_MANAGER= GDK_BACKEND=x11 dbus-run-session startxfce4

Xfce running on Xwayland rootful in GNOME Shell on Wayland


Unfortunately, not all the keyboard shortcuts within the nested X11 session actually work, because some of those (such a Alt-Tab for example) get processed by the Wayland compositor directly, instead of being forwarded to the nested environment.

This however isn't a problem specific to Wayland or Xwayland, an X11 window manager running in Xnest or Xephyr will have the same issues with keyboard shortcuts. To avoid that, Xephyr is able to „grab“ the keyboard and pointer so that all input events end up in the nested X11 session and do not get processed by the parent session.

Xwayland 23.1 has a similar functionality using the Wayland pointer locking & confinement protocol and the keyboard shortcuts inhibitor protocol.

So if your favorite Wayland compositor supports these protocols (in doubt, you can check that it is the case using „wayland-info“), you can use the „-host-grab“ option in Xwayland rootful:

$ Xwayland -geometry 1024x768 -decorate -host-grab :12 &
DISPLAY=:12 SESSION_MANAGER= GDK_BACKEND=x11 dbus-run-session startxfce4

Pressing the Control and Shift keys simultaneously will release the keyboard and pointer (just like with Xephyr actually).

Using Xwayland rootful to run a single X11 application

In some cases, it might be desirable to run a single X11 application isolated from the rest of the X11 clients, on its own X11 server.

On such a setup, one could run a single X11 client either maximized or fullscreen within Xwayland rootful.

Since Xwayland 23.2 allows to interactively resize the root window, users could mode and resize that window at will.

But for that to work, we need a simple X11 window manager that could resize the X11 client window along with the root window, using XRANDR notifications, such as the matchbox window manager for example.

$ Xwayland -geometry 1024x768 -decorate :12 &
matchbox-window-manager -display :12 &
$ GDK_BACKEND=x11 midori --display=:12

When the Xwayland rootful window is resized, corresponding XRANDR events are emitted, notifying the X11 window manager which in turn resizes the client window.

Using Xwayland rootful fullscreen

For years now, Xwayland rootless had support for the viewport Wayland protocol, to emulate XRandR for legacy games thanks to the work from Hans De Goede.

So the idea is to add a fullscreen mode to Xwayland rootful and take advantage of the Wayland viewports support to emulate resolution changes.

This is exactly what the „-fullscreen“ command line options does, it starts Xwayland rootful in fullscreen mode using the xdg_toplevel Wayland protocol and uses the existing viewport support to scale the window and to match the actual display physical resolution.

The emulated resolution is not even limited by the physical resolution, it's possible to use XRANDR to select an emulated resolution much higher than the actual monitor's resolution, quite handy to test X11 applications on high resolution without having to purchase expensive monitors!

$ Xwayland -fullscreen :12 &
matchbox-window-manager -display :12 &
$ xterm -display :12 &
$ xrandr -s 5120x2880 -display :12

Are we done yet?

Well, there's still one thing Xwayland is not handling well, it's HiDPI and fractional scaling.

With rootless Xwayland (as on a typical Wayland desktop session), all X11 clients share the same Xwayland server, and can span across different Wayland outputs of different scales.

Even though theoretically each Wayland surface associated with each X11 window could have a different scale factor set by Xwayland, all X11 clients on the same Xserver share the same coordinate space, so in practice different X11 windows cannot have different scale factors applied.

That's the reason why all the existing merge requests to add support for HiDPI to Xwayland set the same scale to all X11 surfaces. But that means that the rendered surface could end up being way too small depending on the actual scale the window is placed on, on a mixed-DPI multi-monitor setup (I already shared my views of the problem in this issue upstream).

But such limitation does not apply to rootful Xwayland, considering that all the X11 clients running on a rootful Xwayland actually belong to and remain within the same visible root window. They are part of the same visual entity and move all together along with the Xwayland rootful window.

So we could possibly add support for HiDPI (and hence achieve fractional scaling without blurred fonts) to rootful Xwayland. The idea is that Xwayland would set the surface scale to match the scale of the output it's placed on, and automatically resize its root window according to the scale, whenever that changes or when the rootful Xwayland window is moved from one monitor to another.

So for example, when Xwayland rootful with a size of 640×480 is moved from an output with scale 1 to an output with scale 2, the size of the root window (hence the Xwayland rootful window) would be automatically changed to 1280×960, along with the corresponding XRANDR notifications so that an X11 window manager running nested can adjust the X11 clients size and positions.

And if we want a way to communicate that to the X11 clients running within Xwayland rootful, we can use an X11 property on the root window that reflects the actual scale factor being applied. An X11 client could either use that property directly, or more likely, a simple dedicated daemon could adjust the scaling factor of the various X11 toolkits depending on the value set for Wayland scaling.

That's what that proposed merge request upstream does.

gnome-calculator running on Xwayland rootful with 150% fractional scaling

Of course, at this time of writing, this is just a merge request I just posted upstream, and there is no promise that it will accepted eventually. We'll see how that goes, but if that could find its way to Xwayland upstream, it would be part of the next major release of Xwayland some time next year.

October 30, 2023

I was at XDC 2023 in A Coruña a few days ago where I had the opportunity to talk about some of the work we have been doing on the Raspberry Pi driver stack together with my colleagues Juan Suárez and Maíra Canal. We talked about Raspberry Pi 5, CPU job handling in the Vulkan driver, OpenGL 3.1 support and how we are exposing GPU stats to user space. If you missed it here is the link to Youtube.

Big thanks to Igalia for organizing it and to all the sponsors and specially to Samuel and Chema for all the work they put into making this happen.

October 27, 2023

🪑?

October 26, 2023

And Now For Something Slightly More Technical

It’s a busy, busy week here. So busy I’m slipping on my blogging. But that’s okay, because here one last big technical post about something I hate.

Swapchain readback.

So Easy Even You Could Accidentally Do It

I’m not alone in drinking the haterade on this one, but GL makes it especially easy to footgun yourself by not providing explicit feedback that you’re footgunning yourself.

I recently encountered a scenario in REDACTED where this behavior was commonplace. The command stream looked roughly like this:

  • draw some stuff
  • swapbuffers
  • blitframebuffer

And this happened on every single frame (???).

In Zink Terms…

This isn’t pretty. Zink has an extremely conformant method of performing swapchain readback which definitely works without issues in all cases. I’d explain it, but it wouldn’t make either of us happy, and I’ve got so much other stuff to do that I couldn’t possibly… Oh, you really want to know? Well don’t say I didn’t warn you.

Vulkan doesn’t allow readback from swapchains. By this, I mean:

  • swapchain images must be acquired before they can be accessed for any purpose
  • there is no method to explicitly reacquire a specific swapchain image
  • there is no guarantee that swapchain images are unchanged after present

Combined, once you have presented a swapchain image you’re screwed.

…According to the spec, that is. In the real world, things work differently.

Zink takes advantage of this “real world” utilization to implement swapchain readback. In short, the only method available is to spam present/acquire on the swapchain until the last-presented image is reacquired. Then it can be read back, and the image data is (probably) the same as when it was presented.

P E R F

This is not a speedy method of implementing readback. It requires a full sync, and it was designed for the purpose of passing unit tests, which is does perfectly. Performance was never a concern, because why would anyone ever be trying to do readback in… Why would anyone ever be trying to do readback in a performance-sensitive… Using OpenGL, why would anyone ever be…

Anyway, this is very unperformant, and here at SGC we hate all things of that nature. Given that I had my real world scenario from REDACTED in which this was happening every frame, something had to be done.

This solution isn’t performant in the absolute sense either, but it’s massively faster than what was happening previously. Once zink detects an app repeatedly footgunning itself at full speed, it activates readback mode for a swapchain and maintains a staging copy of every frame. This enables the image data to be read back at any time without synchronization at the cost of an extra full-frame copy. This roughly doubles FPS in the case I was testing, which is pretty good.

The functionality is already merged for the upcoming 23.3 release.

Footgun as hard as you want.

October 25, 2023

More Milestones

As everyone knows, Red Hat’s top RustiCL expert, Karol “But it’s only 10 o’clock?” Herbst, has been hard at work beating Mesa/Zink/RustiCL into shape. That effort continues to bear fruit, and with the merge of an upcoming MR it should be possible to pass OpenCL conformance with zink on multiple platforms.

This will make zink THE FIRST EVER CONFORMANT VULKAN-BASED OPENCL IMPLEMENTATION.

Great work all around. For up-to-the-second progress reports on this ecosystem-critical topic, don’t forget to follow Karol on social media.

October 24, 2023

Hi all, long time no see! It’s been more than two months since the last status update. My excuse for this silence is two-fold: I was on leave for 5 weeks, and then X.Org Developer’s Conference happened. During my time off, I’ve traveled in Korea and Japan. I will be blunt: these last two months have been fantastic! And to be honest, that’s a huge understatement.

Busan view from Jangsan

East gate

After my trip in Asia, I went to a 2-day Valve hackfest in Igalia’s headquarters. I met other Valve contractors there, we discussed about various topics such as color management, variable refresh rate, flicker-free startup, and more.

At XDC, there were lots of interesting talks and workshops: HDR by Joshua and Melissa, NVK by Faith, Asahi by Alyssa et al, wlroots frame scheduling by Rose (my GSoC student), CI by Martin, VKMS by Maíra, Wine Wayland by Alexandros, Wine X11 by Arek, and many more! Everything should be available online if you haven’t watched live. That said, as usual, the part I enjoyed the most is the so-called hallway track. It’s great to have free-form discussions with fellow graphics developers, it results in a pretty different train of thought than the usual focused discussions we have online.

Apart from these events, I’ve found some time to do a bit of actual work, too. I’ve re-spinned an old patch I wrote to introduce a new CLOSEFB IOCTL, to allow a DRM master to leave a framebuffer on-screen when quitting so that the next DRM master can take over without a black screen in-between. This time I also included a user-space patch and an IGT test (both requirements for new kernel uAPI). I sent (and merged) another kernel patch to fix black screens in some situations when unplugging USB-C docks.

On the Wayland side, I continued working on explicit synchronization, updating the protocol and submitting a gamescope patch. Joshua has been working on a Mesa patch, so all of the pieces are coming together now. On the SourceHut side, I’ve sent a patch to add HTTP/2 support to pages.sr.ht. It’s been merged and deployed, enjoy! The NPotTM is libicc, a small library to parse ICC profile files. Unlike LittleCMS, it provides lower-level access to the ICC structure and the exact color transformation operations.

That’s all for now, see you next month!

There is an issue with the rpmfusion packaged IPU6 camera stack for Fedora is not working on many Dell laptop models after upgrading the kernel to a 6.5.y kernel.

This is caused by a new mainline ov0a10 sensor driver which takes precedence over the akmod ov0a10 driver but lacks VSC integration.

This can be worked around by running the following command:
sudo rm /lib/modules/$(uname -r)/kernel/drivers/media/i2c/ov01a10.ko.xz; sudo depmod -a

After the rm + depmod run:
sudo rmmod ov01a10; sudo modprobe ov01a10

Or reboot. After this your camera will hopefully work again.

I have submitted a pull-request to disable the mainline kernel's non working ov01a10 driver, so after the next Fedora kernel update this workaround should no longer be necessary.

Your Bug Has Already Been Solved

After yesterday’s post, I’m sure my thousands of readers stampeded to install the latest zink and run their system with it, and I salute you for your hard work in finding all those new ways to crash your systems.

Some of those crashes, however, are not my bugs. They’re system bugs.

In particular, any of you still using Xorg instead of Wayland will want to create this file:

$ cat /etc/X11/xorg.conf.d/30-dmabuf.conf
Section "ServerFlags"
	Option "Debug" "dmabuf_capable"
EndSection

This makes your xserver dmabuf-capable, which will be more successful when running things with zink.

Another problem you’re likely to have is this console error:

DRI3 not available
failed to load driver: zink

Specifically you’re likely to have this on AMD hardware, and the cause is almost certainly that you’ve installed some footgun package with a naming variation on xf86-video-amdgpu.

Delete this package.

Just delete it. I don’t know why distros still make it available, but if you have it installed then you’re just footgunning yourself.

If you’re still having problems after checking for both of these issues, try turning your computer on.

October 23, 2023

Progress

Since the last update I finally got the whole of MobileNetv1 running at full-accuracy on the NPU with Mesa:
tomeu@arm-64:~/mesa$ python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from libteflon.so with args: {}
Processing the input took 18 ms.
Running the NN job took 13 ms.
Processing the output took 1 ms.
0.866667: military uniform
0.031373: Windsor tie
0.015686: mortarboard
0.007843: bow tie
0.007843: academic gown
time: 33.094ms
That takes us to a performance level around 3 times faster than running the same inference on the CPUs on the A311D SoC.

Most of the time (18 ms.) is spent in my naive manipulation of the input tensor, transposing and reshuffling it to match what the HW expects. Once we learn to do these operations on the 4 tensor manipulation cores, this time should be brought close to zero.

The 13 ms. that the convolutions take in the NPU is still sensibly higher than the 8 ms. that the blob achieves, but the optimizations mentioned in previous updates in this blog should bring us pretty close.
 

Next steps

Now that we have something that people can use in their products, I will switch to upstreaming mode.

I want to do a few cleanups to the Mesa code and then I will ask for people to review and ack so it can be merged. In the meantime, the draft merge request can be found here.

I would also like to have a CI job running to make sure it doesn't regress. But given that we don't use NIR as of yet and the dependencies with the rest of Mesa are minimal, there is probably little need as long as I'm the only person contributing to the code.


Almost That Time Again

As readers are no doubt aware by now, SGC goes into hibernation beginning around November, and that time is nearly upon us once more. To cap out another glorious year of shitpostinghighly technical and informative blogging, I’ll be attempting to put up a newsworthy post every day.

This is Day 1.

Zink: No Longer A Hacky Workaround Driver

2023 has seen great strides in the zink ecosystem:

  • Some games, most notably my favorite game of all time X-Plane, are now shipping zink in order to have a consistent GL experience across platforms
  • Zink has reached official GL 4.6 conformance on Imagination GPUs and will be shipping as their GL implementation
  • Zink can now run display servers for both X and Wayland, enabling full systems to exist without a native GL implementation

And there’s plenty more, of course, but throughout all this progress has been one very minor, very annoying wrinkle.

MESA_LOADER_DRIVER_OVERRIDE=zink has to be specified in order to use zink, even if no other GL drivers exist on the system.

Or Does It?

Over a year ago I attempted to enable automatic zink loading if a native driver could not be loaded. It was a reasonable first attempt, but it had issues with driver loading in scenarios where hardware drivers were not permitted.

Work has slowly progressed in Mesa since that time, and various small changes have gradually pushed the teetering tower that is GLX/EGL in the direction anyone and everyone wanted, full stop.

The result is that on zink-enabled systems, loader environment variables will no longer be necessary as of the upcoming Mesa 23.3 release. If zink is your only GL driver, you will get zink rather than an automatic fallback to swrast.

I can’t imagine anyone will need it, but remember that issues can be reported here.

October 20, 2023

A bit of background

Xwayland is intended as a compatibility layer, to allow legacy X11 applications to continue to work in a Wayland environment.

Most Wayland compositors run Xwayland „rootless“ (using the command line option „-rootless“ when spawning Xwayland) so that X11 clients can integrate seamlessly with the other Wayland native clients, the Wayland compositor taking care of stacking the various windows (or surfaces) regardless of the client being X11 or Wayland native.

That actually works very well, so well that in many cases users do not even realize that any particular client is still running on X11, using Xwayland.

For that to work, the Wayland compositor needs to integrate a fully functional X11 window manager.

Sometimes, however, it is useful to use a separate X11 server to run X11 applications with another X11 window manager or even a full X11 environment.

Nested X11 servers

With X11, it is possible to run a nested X11 server such as Xnest or Xephyr, and run a full X11 environment within those nested X servers.

That can be useful for a number of reasons, like connecting remotely to a remote legacy Unix server using XDMCP (not that I would recommend that anyway!), or for testing a particular X11 application with different window managers, or even because a particular X11 application is certified only with a specific window manager. The possibilities are endless.

$ Xephyr -retro -screen 1024x768 :12

Xephyr running the Motif window manager on a GNOME Shell Wayland session

But Xnest or Xephyr are X11 clients themselves, meaning that they run on top of Xwayland when running on a Wayland compositor. That's a bit of a waste, using two X11 servers on top of a Wayland compositor.

Besides, with X.org development winding down, downstream maintainers and packagers may want to reduce the number of X11 servers they ship and have to maintain in the future.

What's wrong with Xwayland rootful?

Right, so if Xwayland already runs rootful by default, why not just using that instead of Xnest or Xephyr?

Well, up until Xwayland 23.1, Xwayland rootful would take its screen configuration from the Wayland compositor itself (using the wl_output or xdg-output Wayland protocols), meaning that when running rootful, Xwayland would map a surface the size of all the monitors, and the user would have no way to easily move or resize it.

That's far from being practical, especially when using a multi-monitor setup!

Making Xwayland rootful (more) usable

So the first step to help making Xwayland rootful suitable as a nested X11 server is to provide a command line option to specify the desired size of the Xwayland window.

That's the „-geometry“ option introduced in Xwayland 23.1 so that one can specifies the desired size of the Xwayland rootful window:

$ Xwayland -geometry 1024x768 :12

That will grant you with a black window of the specified size. If you want more of a „retro“ look, you can get the classic stipple and the X cursor, you can use:

$ Xwayland -geometry 1024x768 -retro :12

Still, the Xwayland window is missing a title bar that would allow for moving the window around.

This is because Wayland does not decorate its surfaces, this is left to the Wayland client themselves to add window decorations (also known as client side decorations, or CSD for short).

This however would add a lot of complexity to Xwayland (which is primarily an Xserver, not a full fledged Wayland application). Thankfully, there is libdecor which can fence Xwayland from that complexity and provide window decorations for us.

So if libdecor is installed on the system and Xwayland is built with libdecor enabled (this is an optional dependency though), then we can request that Xwayland uses decorations with the „-decorate“ command line option:

$ Xwayland -geometry 1024x768 -retro -decorate :12


No we can have fun running some legacy X11 applications on that Xwayland rootful server:

$ xterm -display :12 &
$ twm -display :12 &
$ xsetroot -solid dodgerblue -display :12


We can even use „xrandr“ to query the size of the Xwayland window and resize it:


New with Xwayland 23.2, the Xwayland window is also resize-able interactively and the resulting display size is available in XRandR, creating an XRandR configuration to match the actual window size set interactively by the user:



October 18, 2023

This is my first blog post, ever!

I'm afraid there isn't much yet, but my intention is to post things related to Xwayland and various other projects I contribute to.

October 12, 2023

EBUSY

As everyone knows, SGC goes into yearly hibernation beginning in November. Leading up to that point has been a mad scramble to nail down all the things, leaving less time for posts here.

But there have been updates, and I’m gonna round ‘em all up.

R A Y T R A C W T F

Friend of the blog and future Graphics scientist with a PhD in WTF, Konstantin Seurer, has been hard at work over the past several weeks. Remember earlier this year when he implemented VK_EXT_descriptor_indexing for Lavapipe? Well he’s at it again, and this time he’s aimed for something bigger.

He’s now implemented raytracing for Lavapipe.

It’s a tremendous feat, one that sets him apart from the other developers who have not implemented raytracing for a software implementation of Vulkan.

CLosure

I blogged (or maybe imagined blogging) about RustiCL progress on zink last year at XDC, specifically the time renowned pubmaster Karol Herbst handcuffed himself to me and refused to divulge the location of the key (disguised as a USB thumb drive in his laptop) until we had basic CL support functioning in a pair programming exercise that put us up against the unnaturally early closing time of Minneapolis pubs. That episode is finally turning into something useful as CL support for zink will soon be merged.

While I can’t reveal too much about the performance as of yet, what I can say now is that it’s roughly 866% faster.

Fixups

A number of longstanding bugs have recently been fixed.

Wolfenstein Face

Anyone who has tried to play one of the modern Wolfenstein GL games on RADV has probably seen this abomination:

wolf-face.png

Wolfenstein Face affects a very small number of apps. Actually just the Wolfenstein (The New Order / The Old Blood) games. I’d had a ticket open about it for a while, and it turns out that this is a known issue in D3D games which has its own workaround. The workaround is now going to be applied for zink as well, which should resolve the issue while hopefully not causing others.

Apitrace: The Final Frontier

Since the dawn of time, experts have tried to obtain traces from games with rendering bugs, but some of these games have historically been resistant to tracing.

  • A number of games could be traced, but then replaying those traces would crash at a certain point. This is now fixed, enabling better bug reporting for a large number of AAA games from the the last decade.
  • Another set of games using the id Engine could record traces, but then replaying them would fail to render correctly:

wolf-trace.png

This affects (at least) Wolfenstein: The Old Blood and DOOM2016, but the problem has been identified, and a fix is on the way.

Zink: Exploring New Display Systems

After a number of universally-reviled hacks, Zink should now work fine in both Wayland and Surfaceless EGL configurations.

The Real Post

Any other, lesser blogger would’ve saved this for another post in order to maximize their posting frequency metric, but here at SGC the readers get a full meal with every post even when they don’t have enough time to digest it all at once. Since I’m not going to XDC this year, consider this the thing I might have given a presentation on.

During my executive senior keynote seminar presentation workshop on zink at last year’s XDC, I brought up tiler performance as one of the known deficiencies. Specifically this was in regard to how tilers need to maximize time spent inside renderpasses and avoid unnecessary load/store operations when beginning/ending those renderpasses, which required either some sort of Vulkan extension to enable deferred load/store op setting OR command stream parsing for GL.

While I did work on a number of Vulkan extensions this year, deferred load/store ops wasn’t one of them.

So it was that I implemented renderpass tracking for Threaded Context to scan the GL command stream in the course of recording it for threaded dispatch. The CPU overhead is negligible (~5% on a couple extremely synthetic drawoverhead cases and nothing noticeable in apps), while the performance gains are staggering (~10-15x speedup in AAA games). All in all, it was a painful process but one that has yielded great results.

The gist of it, as I’ve described in previous posts that I’m too lazy to find links for, is that framebuffer attachment access is accumulated during TC command recording such that zink is able to determine which load/store ops are needed. This works great so long as nothing unexpected splits the renderpass. “Unexpected” in this context refers to one of the following scenarios:

  • zink receives a (transfer) command sequence which is impossible to reorder and must split the renderpass to execute copies/blits
  • the app randomly flushes during rendering
  • the GL frontend hits a TC synchronization point and halts the recording thread to wait for the driver thread to finish execution

The final issue remaining for renderpass tracking has been this third scenario: any time the GL frontend needs to sync TC, renderpass metadata is split. The splitting is such that a single renderpass becomes two because the driver must complete execution on the currently-recorded metadata in order to avoid deadlocking itself against the waiting GL frontend, but then the renderpass will continue after the sync. While this happens in a very small number of scenarios, one of them is quite common.

Texture uploading.

Texture Uploads: How Do They Work?

There are (currently) three methods by which TC can perform texture uploads:

  • for small uploads, the data is enqueued and passed asynchronously to the driver thread
  • for larger uploads:
    • if renderpass tracking is enabled and a renderpass is active, the upload will be sequenced into N strided uploads and passed asynchronously to the driver thread to avoid splitting renderpasses
    • otherwise TC syncs the driver thread and performs the upload directly

Eagle-eyed readers will notice that I’ve already handled the “problem” case described above; in order to avoid splitting renderpasses, I’ve written some handling which rewrites texture uploads into a sequence of N asynchronous buffer2image copies, where N is either 1 or $height depending on whether the source data’s stride matches the image’s stride. In the case where N is not 1, this can result in e.g., 4096 copy operations being enqueued for a 4096x4096 texture atlas. Even in the case where N is 1, it still adds an extra full copy of the texture data. While this is still more optimal than splitting a renderpass, it’s not optimal in the absolute sense.

You can see where this is going.

TC Execution: Define Optimal

Optimal Threaded Context execution is the state when the GL frontend is recording commands while the driver thread is deserializing those commands into hardware-specific instructions to submit to the GPU. Visually, it looks like this Halloween-themed diagram:

ideal.png

Ignoring the small-upload case, the current state of texture uploading looks like one of the following Halloween-themed diagrams:

  • the sequenced upload case will have more work, so the driver thread will run a bit longer than it otherwise would, resulting in the GL frontend waiting a bit longer than it otherwise would for completion

copies.png

  • the sync upload case creates a bubble in TC execution

sync.png

Solve For P

To maintain maximum performance, TC needs to be processing commands asynchronously in the driver thread while the GL frontend continues to record commands for processing. Thus, to maintain maximum performance during texture uploads, the texture upload needs to occur (without copies) while the driver thread continues executing.

Looking at this problem from a different perspective, the case that needs to be avoided at all costs is the case where the GL frontend syncs TC execution. The reason why this sync exists is to avoid accidentally uploading data to an in-use image, which would cause unpredictable (but definitely wrong) output. In this context, in-use can be defined as an image which is either:

  • enqueued in a TC batch for execution
  • enqueued/active in a GPU submission

On the plus side, pipe_context::is_resource_busy exists to query the second of these, so that’s solved. On the minus side, while TC has some usage tracking for buffers, it has nothing for images, and adding such tracking in a performant manner is challenging.

To figure out a solution for TC image tracking, let’s examine the most common problem case. In games, the most common scenario for texture uploading is something like this:

  • create staging image
  • upload texture data to staging image
  • draw to scene while sampling staging image
  • delete staging image

For such a case, it’d be trivial to add a seen flag to struct threaded_resource and pass the conditional if the flag is false. Since it’s straightforward enough to evaluate when an image has been seen in TC, this would suffice. Unfortunately, such a naive (don’t @ me about diacritics) implementation ignores another common pattern:

  • create staging image
  • upload texture data to staging image
  • draw to scene while sampling staging image
  • cache staging image for reuse
  • render frame
  • upload texture data to staging image
  • draw to scene while sampling staging image
  • cache staging image for reuse
  • render frame

For this scenario, the staging image is reused, requiring a bit more tracking in order to accurately determine that it can be safely used for uploads.

The solution I’ve settled on is to use a derivative of zink’s resource tracking. This adds an ID for the last-used batch to the resource, which can then be checked during uploads. When the image is determined idle, the texture data is passed directly to the driver for an unsynchronized upload similar to how unsynchronized buffer uploads work. It’s simple and hasn’t shown any definitive performance overhead in my testing.

For it to really work to its fullest potential in zink, unfortunately, requires VK_EXT_host_image_copy to avoid further staging copies, and nobody implements this yet in mesa main (except Lavapipe, though also there’s this ANV MR). But someday more drivers will support this, and then it’ll be great.

As far as non-tiler performance gains from this work, it’s hard to say definitively whether they’ll be noticeable. Texture uploads during loading screens are typically intermixed with shader compilation, so there’s little TC execution to unblock, but any game which uses texture streaming may see some slight latency improvements.

The only remaining future work here is to further enable unsynchronized texture uploads in zink by adding a special cmdbuf for unsynchronized uploads to handle the non-HIC case. Otherwise performance should be pretty solid across the board.

October 10, 2023

At the moment I am hard at work putting together the final bits for the AppStream 1.0 release (hopefully to be released this month). The new release comes with many new new features, an improved developer API and removal of most deprecated things (so it carefully breaks compatibility with very old data and the previous C API). One of the tasks for the upcoming 1.0 release was #481 asking about a formal way to distinguish Linux phone applications from desktop applications.

AppStream infamously does not support any “is-for-phone” label for software components, instead the decision whether something is compatible with a device is based the the device’s capabilities and the component’s requirements. This allows for truly adaptive applications to describe their requirements correctly, and does not lock us into “form factors” going into the future, as there are many and the feature range between a phone, a tablet and a tiny laptop is quite fluid.

Of course the “match to current device capabilities” check does not work if you are a website ranking phone compatibility. It also does not really work if you are a developer and want to know which devices your component / application will actually be considered compatible with. One goal for AppStream 1.0 is to have its library provide more complete building blocks to software centers. Instead of just a “here’s the data, interpret it according to the specification” API, libappstream now interprets the specification for the application and provides API to handle most common operations – like checking device compatibility. For developers, AppStream also now implements a few “virtual chassis configurations”, to roughly gauge which configurations a component may be compatible with.

To test the new code, I ran it against the large Debian and Flatpak repositories to check which applications are considered compatible with what chassis/device type already. The result was fairly disastrous, with many applications not specifying compatibility correctly (many do, but it’s by far not the norm!). Which brings me to the actual topic of this blog post: Very few seem to really know how to mark an application compatible with certain screen sizes and inputs! This is most certainly a matter of incomplete guides and good templates, so maybe this post can help with that a bit:

The ultimate cheat-sheet to mark your app “chassis-type” compatible

As a quick reminder, compatibility is indicated using AppStream’s relations system: A requires relation indicates that the system will not run at all or will run terribly if the requirement is not met. If the requirement is not met, it should not be installable on a system. A recommends relation means that it would be advantageous to have the recommended items, but it’s not essential to run the application (it may run with a degraded experience without the recommended things though). And a supports relation means a given interface/device/control/etc. is supported by this application, but the application may work completely fine without it.

I have a desktop-only application

A desktop-only application is characterized by needing a larger screen to fit the application, and requiring a physical keyboard and accurate mouse input. This type is assumed by default if no capabilities are set for an application, but it’s better to be explicit. This is the metadata you need:

<component type="desktop-application">
  <id>org.example.desktopapp</id>
  <name>DesktopApp</name>
  [...]
  <requires>
    <display_length>768</display_length>

    <control>keyboard</control>
    <control>pointing</control>
  </requires>
  [...]
</component>

With this requires relation, you require a small-desktop sized screen (at least 768 device-independent pixels (dp) on its smallest edge) and require a keyboard and mouse to be present / connectable. Of course, if your application needs more minimum space, adjust the requirement accordingly. Note that if the requirement is not met, your application may not be offered for installation.

Note: Device-independent / logical pixels

One logical pixel (= device independent pixel) roughly corresponds to the visual angle of one pixel on a device with a pixel density of 96 dpi (for historical X11 reasons) and a distance from the observer of about 52 cm, making the physical pixel about 0.26 mm in size. When using logical pixels as unit, they might not always map to exact physical lengths as their exact size is defined by the device providing the display. They do however accurately depict the maximum amount of pixels that can be drawn in the depicted direction on the device’s display space. AppStream always uses logical pixels when measuring lengths in pixels.

I have an application that works on mobile and on desktop / an adaptive app

Adaptive applications have fewer hard requirements, but a wide range of support for controls and screen sizes. For example, they support touch input, unlike desktop apps. An example MetaInfo snippet for these kind of apps may look like this:

<component type="desktop-application">
  <id>org.example.adaptive_app</id>
  <name>AdaptiveApp</name>
  [...]

  <requires>
    <display_length>360</display_length>
  </requires>

  <supports>
    <control>keyboard</control>
    <control>pointing</control>
    <control>touch</control>
  </supports>
  [...]
</component>

Unlike the pure desktop application, this adaptive application requires a much smaller lowest display edge length, and also supports touch input, in addition to keyboard and mouse/touchpad precision input.

I have a pure phone/table app

Making an application a pure phone application is tricky: We need to mark it as compatible with phones only, while not completely preventing its installation on non-phone devices (even though its UI is horrible, you may want to test the app, and software centers may allow its installation when requested explicitly even if they don’t show it by default). This is how to achieve that result:

<component type="desktop-application">
  <id>org.example.phoneapp</id>
  <name>PhoneApp</name>
  [...]

  <requires>
    <display_length>360</display_length>
  </requires>

  <recommends>
    <display_length compare="lt">1280</display_length>
    <control>touch</control>
  </recommends>
  [...]
</component>

We require a phone-sized display minimum edge size (adjust to a value that is fit for your app!), but then also recommend the screen to have a smaller edge size than a larger tablet/laptop, while also recommending touch input and not listing any support for keyboard and mouse.

Please note that this blog post is of course not a comprehensive guide, so if you want to dive deeper into what you can do with requires/recommends/suggests/supports, you may want to have a look at the relations tags described in the AppStream specification.

Validation

It is still easy to make mistakes with the system requirements metadata, which is why AppStream 1.0 will provide more commands to check MetaInfo files for system compatibility. Current pre-1.0 AppStream versions already have an is-satisfied command to check if the application is compatible with the currently running operating system:

:~$ appstreamcli is-satisfied ./org.example.adaptive_app.metainfo.xml
Relation check for: */*/*/org.example.adaptive_app/*

Requirements:
 • Unable to check display size: Can not read information without GUI toolkit access.
Recommendations:
 • No recommended items are set for this software.
Supported:
 ✔ Physical keyboard found.
 ✔ Pointing device (e.g. a mouse or touchpad) found.
 • This software supports touch input.

In addition to this command, AppStream 1.0 will introduce a new one as well: check-syscompat. This command will check the component against libappstream’s mock system configurations that define a “most common” (whatever that is at the time) configuration for a respective chassis type.

If you pass the --details flag, you can even get an explanation why the component was considered or not considered for a specific chassis type:

:~$ appstreamcli check-syscompat --details ./org.example.phoneapp.metainfo.xml
Chassis compatibility check for: */*/*/org.example.phoneapp/*

Desktop:
  Incompatible
 • recommends: This software recommends a display with its shortest edge
   being << 1280 px in size, but the display of this device has 1280 px.
 • recommends: This software recommends a touch input device.

Laptop:
  Incompatible
 • recommends: This software recommends a display with its shortest edge 
   being << 1280 px in size, but the display of this device has 1280 px.
 • recommends: This software recommends a touch input device.

Server:
  Incompatible
 • requires: This software needs a display for graphical content.
 • recommends: This software needs a display for graphical content.
 • recommends: This software recommends a touch input device.

Tablet:
 ✔ Compatible (100%)

Handset:
 ✔ Compatible (100%)

I hope this is helpful for people. Happy metadata writing! 😀

October 06, 2023

Progress

Last week I was a bit distracted with the trip to Paris for the Embedded Recipes conference, but later I have found some time for hacking and got some interesting results out of it.

Refactored the Gallium front-end

As commented in the previous update, I had found some limits in my testing due to the naive way that the front-end was scheduling jobs to the Gallium hardware-dependent driver.

I got to basically rewrite it (and removed any C++ remnants, on the way) and moved to a model in which the drivers would compile the operation blocks that they support to a format that can be quickly sent to the hardware.

As a side effect, I got proper memory management of the workload which allowed me to expand the testing I can do in a reasonable amount of time.

Also took the chance to rewrite the higher level scheduling data structure so all jobs in the same model partition are sent to the hardware in a single batch, for decreased latency.

Unfortunately I didn't get to remove copies of input and output tensors because the TensorFlow Lite API for this (TfLiteAsyncKernel) is undocumented and far from trivial. They seem to just be adding stuff on top to abstract whatever the Android folks may end up wanting to do.

Got MobileNet V1 to run

As part of the refactoring  from above, I got multiple operations in the same model to work, which got us to correctly running some inferences, even if at low accuracy rates:

by Julien Langlois CC BY-SA 3.0

tomeu@arm-64:~/mesa$ LD_PRELOAD=libtensorflow_lite.so python3.10 class_device.py -i hen.bmp -m mobilenet_v1_0.25_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}
tflite_plugin_create_delegate
Teflon delegate: loaded etnaviv driver
INFO: Initialized TensorFlow Lite runtime.
PrepareDelegate
VERBOSE: Replacing 27 out of 31 node(s) with delegate (Teflon Delegate) node, yielding 2 partitions for the whole graph.
0.960784: hen
0.015686: cock
0.007843: goose
0.003922: Pembroke
0.003922: Ibizan hound
time: 22.802ms
tflite_plugin_destroy_delegate

This matched bit by bit the output from the blob, even if I was doing some tensor operations by hand, on the CPU. That also causes it to run far too slowly. We should be able to get that down to around 5ms once we learn how to drive the TP units for tensor manipulation.

Presented this work at Embedded Recipes 2023

Tired of only writing about all this in this blog, I took the chance given to me by Kevin Hilman to present it in front of a captive audience.


You can find the slides here, and listen to the talk at:



Next steps

The previous update got more in deep into what is left to do in the medium term, so I will just mention what I plan to do in the immediate future:

  1. Get input and output channels working at the 512 level, so we can run a higher accuracy version of the MobileNet V1 network
  2. Learn to use the TP units to remove those costly transpositions and reshuffles in the CPU (at this point, we would have something useful to people on the field)
  3. Upstream changes to the Linux kernel
  4. Propose Teflon to the Mesa folks

September 26, 2023

Progress

With the kids back in school I have been able to work on the Vivante VIP NPU driver full-time during the two weeks after the last update, with quite some work coming out of the pipeline:

Found the problem with enabling the 8th NN core

Though I don't know exactly yet what the problem is, I found that by going back to a previous brute-force approach to powering up the NPU, the 8th core works just fine.

For now this unblocks the work and gets me closer to the initial goal of running a MobileNetv1 inference and seeing what the performance is like, so I'm leaving a proper fix for this for later.

I bet there's either a register that is being written in the wrong order, or a delay between register writes that is too short. Will have to delve into the power domain subsystem and/or the common clock framework in the Linux kernel to fix this one.

Added support for depthwise convolutions

MobileNetV1 introduced Separable Depthwise Convolutions (see the linked paper for an in-depth description), which are layers that contain a depthwise convolution to process each depth level separately, plus a pointwise convolution to rejoin them again. This offers the same result with 23x less multiplications, so it's very attractive for mobile use-cases.

This hardware doesn't support depthwise convolutions directly, but we can lower them to regular convolutions after modifying the weight tensor to cover each IFM/depth separately.

Added support for pointwise convolutions

For the second half of a Separable Depthwise Convolution, I just had to take into account that 1x1 kernels are packed in a different format in memory, as otherwise it would be very inefficient for each NN core to pull each 1-byte kernel separately from the memory bus.

Added support for unsigned weights

TensorFlow Lite has moved towards implementing a new quantization specification which gives preference to signed weights because of convenience, as symmetric quantization is simpler to implement. Unfortunately for us, our hardware works natively with unsigned weights so we would need to convert them if we were to use TFLite's new quantization.

But the models that Google themselves publish make use of the ancient tooling that still support the old, unsigned quantization scheme, so I had to find a way of producing models with unsigned quantization for our test suite, to match what MobileNetV1 does.

That also implied moving to per-tensor quantization, instead of per-axis.

Added support for higher IFMs and OFMs (up to 256 each)

In the previous update I explained how support for multiple input and output channels (or feature maps) was added, but I wasn't able to test with more than 7 output channels because the 8th NN core was MIA.

With that solved, I was able to see what would be needed for convolutions with higher channel counts, such as those that MobileNetV1 use (32, 64, 128, 256, 512 and 1024).

Each level implied revisiting the tiled format in which weights and biases are laid out in memory, making it more and more complex.

I got to 256, with 512 and 1024 bringing more changes in the tiled format that I still need to reverse engineer.


Next steps

Model partition compilation and resource management

I'm facing problems with testing coverage as we support so many different parameters that need to be tested in combination, with a explosion in the number of individual tests. Because of the hacky current state of the TFLite delegate (and Gallium state tracker) I'm not able to run all the tests because I don't have proper resource management implemented and so we reach OOM before the end.

So my next task after I get back from Embedded Recipes will be to refactor the delegate implementation so we have a proper compilation of the model partitions. These will own the weight+bias buffers as well as the intermediate tensors, with each inference just feeding an input tensor to the partition and retrieving an output tensor at the end.

This will allow me to scale up the automated testing further, so I can keep adding new features with confidence, knowing that I'm not adding regressions.

Move development to Cottonwood A311D board

Da Xue of LibreComputer has got Etnaviv and Teflon working on the new boards that his company is releasing soon. One of them contain a A311D SoC, the same as the VIM3 I'm currently using for development. I will be initially targeting that one, and later make sure that it also works on the Cottonwood boards that will have the S905D3 SoC, which has a VIP Pico instead of a VIP Nano.

Besides being in general a great FOSS champion and specifically being supportive of ML inference with open source, Da is directly sponsoring this work, so I look forward to meet him in Paris this week and exchange notes.

Bigger coefficient tensors

The last known features missing before being able to run MobileNetV1 are IFMs and OFMs of 512 and 1024, each.

Hopefully it will only require some further tweaking of the tiled memory representation of the coefficient buffer.

Medium term goals

I don't expect performance to be that great yet, so I plan on switching the focus to it after the above has been accomplished. I expect for the features below making the most impact in improving performance:
  1. Avoid copies in and out of the model partition, by mapping user buffers to the NPU
  2. Use the TP units for tensor manipulation (transposing, mostly)
  3. Properly configuring the automatic caching of kernels and images in the internal on-chip SRAM
  4. Use the external SRAM for intermediate tensor data
  5. Chain all TP and NN jobs in a model partition in the same command stream
  6. Enable zero-run-length compression in the coefficient buffer
  7. Tune the tiling parameters for reduced memory bandwidth usage
September 25, 2023

Recently I’ve been working on a project where I needed to convert an application written in OpenGL to a software renderer. The matrix transformation code in OpenGL made use of the GLM library for matrix math, and I needed to convert the 4x4 matrices to be 3x3 matrices to work with the software renderer. There was some existing code to do this that was broken, and looked something like this:

glm::mat3 mat3x3 = glm::mat3(mat4x4);

Don’t worry if you don’t see the problem already, I’m going to illustrate in more detail with the example of a translation matrix. In 3D a standard translation matrix to translate by a vector (x, y, z) looks something like this:

[1 0 0 x]
[0 1 0 y]
[0 0 1 z]
[0 0 0 1]

Then when we multiply this matrix by a vector like (a, b, c, 1) the result is (a + x, b + y, c + z, 1). If you don’t understand why the matrix is 4x4 or why we have that extra 1 at the end don’t worry, I’ll explain that in more detail later.

Now using the existing conversion code to get a 3x3 matrix will simply take the first 3 columns and first 3 rows of the matrix and produce a 3x3 matrix from those. Converting the translation matrix above using this code produces the following matrix:

[1 0 0]
[0 1 0]
[0 0 1]

See the problem now? The (x, y, z) values disappeared! In the conversion process we lost these critical values from the translation matrix, and now if we multiply by this matrix nothing will happen since we are just left with the identity matrix. So if we can’t use this simple “cast” function in GLM, what can we use?

Well one thing we can do is preserve the last column and last row of the matrix. So assume we have a 4x4 matrix like this:

[a b c d]
[e f g h]
[i j k l]
[m n o p]

Then preserving the last row and column we should get a matrix like this:

[a b d]
[e f h]
[m n p]

And if we use this conversion process for the same translation matrix we will get:

[1 0 x]
[0 1 y]
[0 0 1]

Now we see that the (x, y) part of the translation is preserved, and if we try to multiply this matrix by the vector (a, b, 1) the result will be (a + x, b + y, 1). The translation is preserved in the conversion!

Why do we have to use this conversion?

The reason the conversion is more complicated is hidden in how we defined the translation matrix and vector we wanted to translate. The vector was actually a 4D vector with the final component set to 1. The reason we do this is that we actually want to represent an affine space instead of just a vector space. An affine space being a type of space where you can have both points and vectors. A point is exactly what you would expect it to be just a point in space from some origin, and vector is a direction with magnitude but no origin. This is important because strictly speaking translation isn’t actually defined for vectors in a normal vector space. Additionally if you try to construct a matrix to represent translation for a vector space you’ll find that its impossible to derive a matrix to do this and that operation is not a linear function. On the other hand operations like translation are well defined in an affine space and do what you would expect.

To get around the problem of vector spaces, mathematicians more clever than I figured out you can implement an affine space in a normal vector space by increasing the dimension of the vector space by one, and by adding an extra row and column to the transformation matrices used. They called this a homogeneous coordinate system. This lets you say that a vector is actually just a point if the 4th component is 1, but if its 0 its just a vector. Using this abstraction one can implement all the well defined operations for an affine space (like translation!).

So using the “homogeneous coordinate system” abstraction, translation is an operation that defined by taking a point and moving it by a vector. Lets look at how that works with the translation matrix I used as an example above. If you multiply that matrix by a 4D vector where the 4th component is 0, it will just return the same vector. Now if we multiply by a 4D vector where the 4th component is 1, it will return the point translated by the vector we used to construct that translation matrix. This implements the translation operation as its defined in an affine space!

If you’re interested in understanding more about homogeneous coordinate spaces, (like how the translation matrix is derived in the first place) I would encourage you to look at resources like “Mathematics for Computer Graphics Applications”. They provide a much more detailed explanation than I am providing here. (The homogeneous coordinate system also has some benefits for representing projections which I won’t get into here, but are explained in that text book.)

Now to finally answer the question about why we needed to preserve those final columns and vectors. Based on what we now know, we weren’t actually just converting from a “3D space” to a “2D space” we were converting from a “3D homogeneous space” to a “2D homogeneous space”. The process of converting from a higher dimension matrix to a lower dimensional matrix is lossy and some transformation details are going to be lost in process (like for example the translation along the z-axis). There is no way to tell what kind of space a given matrix is supposed to transform just by looking at the matrix itself. The matrix does not carry any information about about what space its operating in and any conversion function would need to know that information to properly convert that matrix. Therefore we need develop our own conversion function that preserves the transformations that are important to our application when moving from a “3D homogeneous space” to a “2D homogeneous space”.

Hopefully this explanation helps if you are every working on converting 3D transformation code to 2D.

September 22, 2023

Remember Way Back When…

This blog was about pointlessly optimizing things? I’m talking like taking vkGetDescriptorSetLayoutSupport and making it fast. The kinds of optimizations nobody asked for and potentially nobody even wanted.

Well good news: this isn’t a post about those types of optimizations.

This is a post where I’m gonna talk about some speedups that you didn’t even know you craved but now that you know they exist you can’t live without.

The Vulkan Queue: What Is It?

Lots of people are asking, but surely nobody reading this blog since you’re all experts. But if you have a friend who wants to know, here’s the official resource for all that knowledge. It’s got diagrams. Images with important parts circled. The stuff that means whoever wrote it knew what they were talking about.

The thing this “official” resource doesn’t tell you is the queue is potentially pretty slow. You chuck some commands into it, and then you wait on your fence/semaphore, but the actual time it takes to perform queue submission is nonzero. In fact, it’s quite a bit larger than zero. How large is it? you might be asking.

I didn’t want to do this, but you’ve forced my hand.

The Vulkan Queue: Timing

What if I told you there was a tool for measuring things like this. A tool for determining the cost of various Vulkan operations. For benchmarking, one might say.

That’s right, it’s time to yet again plug vkoverhead, the best and only tool for doing whatever I’m about to do.

Like a prophet, my past self already predicted that I’d be sitting here writing this post to close out a week of types_of_headaches.meme -> vulkan_headaches.meme. That’s why vkoverhead already has the -submit-only option in order to run a series of benchmark cases which have numbers that are totally not about to go up.

Let’s look at those cases now to fill up some more page space and time travel closer to the end of my workweek:

  • submit_noop submits nothing. There’s no semaphores, no cmdbufs, it just submits and returns in order to provide a baseline
  • submit_50noop submits nothing 50 times, which is to say it passes 50x VkSubmitInfo structs to vkQueueSubmit (or the 2 versions if sync2 is supported)
  • submit_1cmdbuf submits a single cmdbuf. In theory this should be slower than the noop case, but I hate computers and obviously this isn’t true at all
  • submit_50cmdbuf submits 50 cmdbufs. In theory this should be slower than the single cmdbuf case, and, thankfully, this one particular time in which we have expectations of how computers work does match our expectations
  • submit_50cmdbuf_50submit submits 50 cmdbufs in 50 submits for a total of 50 cmdbufs per vkQueueSubmit call. This is the slowest test, you would think, and I thought that too, and the longer this explanation goes on the more you start to wonder if computers really do work at all like you expect or if this is going to upset you, but it’s Friday, and I don’t have anywhere to be except the gym, so I could keep delaying the inevitable for a while longer, but I do have to get to the gym, so sure, this is totally gonna be way slower than all the other tests, trust me™

It’s a great series of tests which showcase some driver pain points. Specifically it shows how slow submit can be.

Let’s check out some baseline results on the driver everyone loves to hang out with, RADV:

  40, submit_noop,                                        19569683,     100.0%
  41, submit_50noop,                                      402324,       2.1%
  42, submit_1cmdbuf,                                     51356,        0.3%
  43, submit_50cmdbuf,                                     1840,         0.0%
  44, submit_50cmdbuf_50submit,                            1031,         0.0%

Everything looks like we’d expect. The benchmark results ensmallen as they get more complex.

But why?

Why So Slow

Because if you think about it like a smart human and not a dumb pile of “thinking” sand, submitting 50 cmdbufs is submitting 50 cmdbufs no matter how you do it.

queue-think.png

Some restrictions apply, signal semaphores blah blah blah, but none of that’s happening here so what the fuck, RADV?

This is where we get into some real facepalm territory. Vulkan, as an API, gives drivers the ability to optimize this. That’s the entire reason why vkQueueSubmit has the submitCount param and takes an array of submits.

But what does Mesa do here? Well, in the current code there’s this gem:

for (uint32_t i = 0; i < submitCount; i++) {
   struct vulkan_submit_info info = {
      .pNext = pSubmits[i].pNext,
      .command_buffer_count = pSubmits[i].commandBufferInfoCount,
      .command_buffers = pSubmits[i].pCommandBufferInfos,
      .wait_count = pSubmits[i].waitSemaphoreInfoCount,
      .waits = pSubmits[i].pWaitSemaphoreInfos,
      .signal_count = pSubmits[i].signalSemaphoreInfoCount,
      .signals = pSubmits[i].pSignalSemaphoreInfos,
      .fence = i == submitCount - 1 ? fence : NULL
   };
   VkResult result = vk_queue_submit(queue, &info);
   if (unlikely(result != VK_SUCCESS))
      return result;
}

Tremendous. It’s worth mentioning that not only is this splitting the batched submits into individual ones, each submit also allocates a struct to contain the submit info so that the drivers can use the same interface. So it’s increasing the kernel overhead by performing multiple submits and also increasing memory allocations.

Fast Forward

We’ve all been here before on SGC, and I really do need to get to the gym, so I’m authorizing a one-time fast forward to the results of optimizing this:

RADV GFX11:

  40, submit_noop,                                        19569683,     100.0%
  41, submit_50noop,                                      402324,       2.1%
  42, submit_1cmdbuf,                                     51356,        0.3%
  43, submit_50cmdbuf,                                     1840,         0.0%
  44, submit_50cmdbuf_50submit,                            1031,         0.0%
↓
  40, submit_noop,                                        21008648,     100.0%
  41, submit_50noop,                                      4866415,      23.2%
  42, submit_1cmdbuf,                                     51294,        0.2%
  43, submit_50cmdbuf,                                     1823,         0.0%
  44, submit_50cmdbuf_50submit,                            1828,         0.0%

That’s like 1000% faster for case #41 and 50% faster for case #44.

But how does this affect other drivers? I’m sure you’re asking next. And of course, this being the primary blog for distributing Mesa benchmarking numbers in any given year, I have those numbers.

Lavapipe:

  40, submit_noop,                                        1972672,      100.0%
  41, submit_50noop,                                      40334,        2.0%
  42, submit_1cmdbuf,                                     5994597,      303.9%
  43, submit_50cmdbuf,                                    2623720,      133.0%
  44, submit_50cmdbuf_50submit,                           133453,       6.8%
↓
  40, submit_noop,                                        1980681,      100.0%
  41, submit_50noop,                                      1202374,      60.7%
  42, submit_1cmdbuf,                                     6340872,      320.1%
  43, submit_50cmdbuf,                                    2482127,      125.3%
  44, submit_50cmdbuf_50submit,                           1165495,      58.8%

3000% faster for #41 and 1000% faster for #44.

Intel DG2:

  40, submit_noop,                                        101336,       100.0%
  41, submit_50noop,                                       2123,         2.1%
  42, submit_1cmdbuf,                                     35372,        34.9%
  43, submit_50cmdbuf,                                      713,          0.7%
  44, submit_50cmdbuf_50submit,                             707,          0.7%
↓
  40, submit_noop,                                        106065,       100.0%
  41, submit_50noop,                                      105992,       99.9%
  42, submit_1cmdbuf,                                     35110,        33.1%
  43, submit_50cmdbuf,                                      709,          0.7%
  44, submit_50cmdbuf_50submit,                             702,          0.7%

5000% faster for #41 and a big 🤕 for #44 because Intel.

Turnip A740:

  40, submit_noop,                                        1227546,      100.0%
  41, submit_50noop,                                      26194,        2.1%
  42, submit_1cmdbuf,                                     1186327,      96.6%
  43, submit_50cmdbuf,                                    545341,       44.4%
  44, submit_50cmdbuf_50submit,                           16531,        1.3%
↓
  40, submit_noop,                                        1313550,      100.0%
  41, submit_50noop,                                      1078383,      82.1%
  42, submit_1cmdbuf,                                     1129515,      86.0%
  43, submit_50cmdbuf,                                    329247,       25.1%
  44, submit_50cmdbuf_50submit,                           484241,       36.9%

4000% faster for #41, 3000% faster for #44.

Pretty good, and it somehow manages to still be conformant.

Code here.

September 15, 2023

But Not Mine

If you’re reading, thanks for everything.

Glamorous

I planned to blog about it a while ago, but then I didn’t and news sites have since broken the news: Zink from Mesa main can run finally xservers.

Yes, it’s true. For the first time ever, you can install Mesa (from git) and use zink (with environment variables) to run your entire system (unless you’re on Intel).

But what was so challenging about getting this to work? The answer won’t surprise you.

WSI

Fans of the blog know that I’m no fan of WSI. If I had my way, GPUs would render to output buffers that we could peruse at our leisure using whatever methods we had at our disposal. Ideally manual inspection. Alas, few others share my worldview and so we all must suffer.

The root of all evil when it comes to computers is synchronization. This is triply so for anything GPU-related, and when all this “display server” chicanery is added in, the evilness value becomes one of those numbers so large that numerologists are still researching naming possibilities. There are two types of synchronization used with WSI:

  • implicit sync - “just fucking do it”
  • explicit sync - “I’ll tell you exactly when to do it”

From a user perspective, the former has less code to manage. The downside is that on the driver side things become more complex, as implicit sync is effectively layered atop explicit sync.

Another way of looking at it is:

  • implicit sync - OpenGL
  • explicit sync - Vulkan

And, since xservers run on GL, you can see where this is going.

Implicitly Terrible

Don’t get me wrong, explicit sync sucks too, but at least it makes sense. Broadly speaking, with explicit sync you have a dmabuf image, you submit it to the GPU, and you tell the server to display it.

In the words of venerable Xorg developer, EGL maintainer, and synchronization PTSD survivor Daniel Stone, the way to handle implicit sync is “vibes”. You have a dmabuf image, you glFlush, and magically it gets displayed.

Sound nuts? It is, and that’s why Vulkan doesn’t support it.

But zink uses Vulkan, so…

Send Eyebleach

Explicit sync is based on two concepts:

  • import
  • export

A user of a dmabuf waits on an export operation before using it (i.e., a wait semaphore), then signals an import operation at the end of a cmdbuf submission (i.e., a signal semaphore). Vulkan WSI handles this under the hood for users. But there’s no way to use Vulkan WSI with imported dmabufs, which means this all has to be copy/pasted around to work elsewhere.

In zink, all that happens in an xserver scenario is apps import/export dmabufs, sample/render them, and then do queue submission. To successfully copy/paste the WSI code and translate this into explicit sync for Vulkan, it’s necessary to be a bit creative with driver mechanics. The gist of it is:

  • when doing a queue import (from FOREIGN) for a dmabuf, create and queue an export (DMA_BUF_IOCTL_EXPORT_SYNC_FILE) semaphore to be waited on before the current cmdbuf
  • when triggering a barrier on any exported dmabuf, queue an import (DMA_BUF_IOCTL_IMPORT_SYNC_FILE) semaphore to be signaled after the current cmdbuf
  • at submit time, serialize all the wait semaphores onto a separate queue submission before the main cmdbuf
  • at submit time, serialize all the signal semaphores onto a separate queue submission after the main cmdbuf
  • pray for modifiers to match up

Big thanks to Faith “ARB_shader_image_load_store” Ekstrand for god-tier rubberducking when I was in the home stretch of this undertaking.

Anyway I expect to be absolutely buried in bug reports by next week from all the people testing this, so thanks in advance.

September 08, 2023

Overdue

It’s been a busy week, and I’ve got posts I’ve been meaning to write. The problem is they’re long and I’m busy. That’s why I’m doing a shorter post today, since even just getting this one out somehow took 4+ hours while I was continually sidetracked by “real work”.

But despite this being a shorter post, don’t worry: the memes won’t be shorter.

Fighting!

I got a ticket very recently that I immediately jumped on and didn’t at all procrastinate or forget about. The ticket concerned a little game called THE KING OF FIGHTERS XIV.

Now for those of you unfamiliar, The King of Fighters is a long-running fighting game franchise which debuted in the 90s. At the arcade. Pretty sure I played it once. But at like a retro arcade or something because I’m not that old, fellow kids.

The bug in question was that when a match is won using a special move, half the frame would misrender:

kof.jpg

Heroically, the reporter posted a number of apitrace captures. Unfortunately that effort ended up being ineffectual since it did nothing but reveal yet another apitrace bug related to VAO uploads which caused replays of the traces to crash.

It was the worst kind of bug.

I was going to have to play a gametest the defect myself.

It would prove to be the greatest test of my skill yet. I would have to:

  • start the correct game
  • win a match
  • win a match using a special move
  • renderdoc capture at the exact right time during the finishing animation to capture the bug

Was I successful?

I’m not saying there’s someone out there who’s worse at the gamethe test app than a guy playingperforming exploratory tests on his keyboard under renderdoc. That’s not what I’m saying.

kof.png

Eddie Gordo Got Mad At This Combo

The debug process for this issue was, in contrast to the capture process, much simpler. I attribute this to the fact that, while I don’t own a gamepad for use with whatever test apps need to be run, I do have a code controller that I use for all my debugging:

code-controller.png

I’ve been hesitant to share such pro strats on the blog before, but SGC has been around for long enough now that even when the copycats start vlogging about my tech and showing off the frame data, everyone will recognize where it came from. All I ask is that you post clips of tournament popoffs.

Using my code controller, I was able to perform a debug -> light code -> light code -> debug -> heavy code -> compile -> block -> post meme -> reboot -> heavy code -> heavy code combo for an easy W.

To break down this advanced sequence, a small debug reveals that the issue is a render area clamped to 1024x1024 on a 1920x1080 frame. Since I have every line of the codebase memorized (zink main don’t @ me) it was instantly obvious that some poking was in order.

Vulkan has this pair of (awful) VUs:

VUID-VkRenderingInfo-pNext-06079
If the pNext chain does not contain VkDeviceGroupRenderPassBeginInfo or its deviceRenderAreaCount member is equal to 0, the width of the imageView member of any element of pColorAttachments, pDepthAttachment, or pStencilAttachment that is not VK_NULL_HANDLE must be greater than or equal to renderArea.offset.x + renderArea.extent.width

VUID-VkRenderingInfo-pNext-06080
If the pNext chain does not contain VkDeviceGroupRenderPassBeginInfo or its deviceRenderAreaCount member is equal to 0, the height of the imageView member of any element of pColorAttachments, pDepthAttachment, or pStencilAttachment that is not VK_NULL_HANDLE must be greater than or equal to renderArea.offset.y + renderArea.extent.height

which don’t match up at all to GL’s ability to throw whatever size framebuffer attachments at the GPU and have things come out fine. A long time ago I wrote this MR to clamp framebuffer size to the smallest attachment. But in this particular case, there are three framebuffer attachments:

  • 1920x1080
  • UNUSED
  • 1920x1080

The unused attachment ends up clamping the framebuffer to a smaller region to avoid violating spec, and this breaks rendering. Some light code pokes to skip clamping for NULL attachments open up the combo. Another quick debug doesn’t show the issue as being resolved, which means it’s time for some heavy code: checking for unused attachments in the fragment shader during renderpass start.

Naturally this triggers a full tree compile, which is a blocking operation that gives me enough time to execute a post meme for style points. The downside is that I’m using an AMD system, so as soon as I try to run the new code it hangs–it’s at this point that I nail a reboot to launch it into orbit.

I’m not looking for a record-setting juggle, so I finish off my combo with a heavy code -> heavy code finisher to hack in attachment write tracking for TC renderpass optimization and then plumb it through the rest of my stack so unused attachments will skip all renderpass-related operations.

Problem solved, and all without having to personally play any games.

Next Week

I’ll finally post that one post I’ve been planning to post for weeks but it’s hard and I just blew my entire meme budget for the month today so what is even going to happen who knows.

September 07, 2023

Progress

 This week started quite fruitfully, these features were added:

  • Convolutions with multiple input and output channels (input and output feature maps)
  • "Same" padding in convolutions

And with this we should have all the features we need to run a model such as MobileNet v1 and get some performance numbers to guide the next steps.

One more roadblock

Only that the NPU hangs when I try to use the 8th core... and this is required to run most detection models, as they start by convoluting the input to 32 feature maps.

Have checked and we are sending to the kernel bit-identical command streams and input buffers, so I suspect the problem will be somewhere in the kernel.

So I plan to instrument the out-of-tree kernel driver and get some register and command stream dumps, in the hope that there is some bit in a magic register somewhere that I need to flip.

Want to try it out?

I'm not really looking forward to such work, so I decided to first invest some time cleaning things up a bit to make it easier for other people to play with this if they wish.

I have removed from my branch everything from my previous attempt at using OpenCL and have written some documentation about how to run the TensorFlow Lite delegate:

https://gitlab.freedesktop.org/tomeu/mesa/-/blob/teflon/docs/teflon.rst

You will need a VIM3 board, a recent mainline kernel and a Debian testing rootfs.


September 03, 2023
This is my final blog post for GSoC ‘23. For me, the past few months have been both challenging and rewarding. Although I didn’t achieve all the goals I set out to do, I learned a lot in the process. What Was Done As mentioned in the last post, we created a new project, boot2ipxe, to build netboot disk image for different platforms. With that as the base image, ipxe-boot-server could generate disk image that boots boot2container from a HTTPS server.
Long time no see! It’s time to share my progress on the GSoC project. Last time we manually built a netboot SD card image for RPi 3B+ and left some questions to be answered. Now let’s dig into those questions and show you what I’ve got. Boot2ipxe First, a new project, Boot2ipxe, is created for building and testing iPXE netboot disk image for both SBCs and x86 PC, which used to be part of ipxe-boot-server’s job.
September 01, 2023

 Sriram invited me to the oneAPI meetup, and I felt I hadn't summed up the state of compute and community development in a while. Enjoy 45 minutes of opinions!

https://www.youtube.com/watch?v=HzzLY5TdnZo




if you disagree you’re wrong

August 28, 2023
Introduction Hello! This is the final report of the work I did as a Google Summer of Code 2023 contributor to NVK. My work revolved around the implementation of YCbCr format support, which came in form of enabling three Vulkan extensions. Mesa is the open-source, default graphics driver stack on Linux, with implementations for graphics hardware from most vendors. One such implementation is NV...

gitlab is down, post low-effort blogs and touch grass until it returns

August 27, 2023

The GSoC journey is coming to a close. In just over 100 days, I gained more experience in open-source development than I could ever imagine in this period.

Prior to GSoC, I was not used to regularly submit patches to the mailing lists. Now, I’ve sent many patches and revisions. I believe my interaction with the community will only grow. I learned so much about the tools and workflow of kernel development.

After this experience, I’m more than certain that I want to make this a job, contributing to open-source is fun, so why not make this a living :)

Goals

The main goal of the project was to increase the code coverage on the DRM core helper functions by creating unit tests.

As the coverage of all helpers is a big task for the time period, I decided to create tests for the drm_format_helper.c functions.

Throughout the project, other side tasks appeared. I will list the contributions made below.

GSoC contributions

Linux Kernel - VKMS

VKMS is a software-only model of a KMS driver that is useful for testing and running X (or similar) on headless machines.

This was, unexpectedly, a big part of my GSoC. I learned a lot about color formats and how a graphics driver works. Currently, only one piece of my work was upstreamed, the rest needs more work and was postponed in favor of the primary project goal.

Patch Status
drm/vkms: Add support to 1D gamma LUT Accepted

For more information go check my blogpost about the whole process.

IGT

IGT GPU Tools is a collection of tools for the development and testing of DRM drivers. While working on VKMS I used heavily the IGT framework for testing, in one occasion a bug made a test to stop working on the VKMS, so a submitted a patch to fix that.

Patch Status
lib/igt_fb: Add check for intel device on use_enginecopy Accepted

Linux Kernel - DRM

In the DRM subsystem, I’ve done the main project goal, contributed by adding unit tests, and also helped to fix some bugs that appeared while working on the tests. With the sent patches I got 71.5% of line coverage and 85.7% of function coverage on the drm_format_helper.c.

Patch Status
drm/tests: Remove CONFIG_DRM_FBDEV_EMULATION on .kunitconfig Rejected
drm/tests: Alloc drm_device on drm_exec tests Accepted
drm/tests: Test default pitch fallback Accepted
drm/tests: Add KUnit tests for drm_fb_swab() Accepted
drm/tests: Add KUnit tests for drm_fb_clip_offset() Accepted
drm/tests: Add KUnit tests for drm_fb_build_fourcc_list() Accepted
drm/tests: Add multi-plane support to conversion_buf_size() Accepted
drm/tests: Add KUnit tests for drm_fb_memcpy() Accepted

Challenges Faced

I think the most difficult task was describing my work. Either on blog posts or in the commit messages, it takes a lot of work to write what you’ve done concisely and clearly. With time you get the way of things, but I think I can improve on this subject.

Moreover, many times I had to debug some problems. I already knew how to use GDB, but using it in the kernel is a little more cumbersome. After searching, reading the documentation, and getting tips from my mentors, I got it working.

On the VKMS, I had to create new features, this requires a lot of thought. I made a lot of diagrams in my head to understand how the color formats would be displayed in memory, and yet most of my work hasn’t seen the light of day XD.

What is left to do

I was able to do most of the proposed tasks. But the drm_xfrm_toio was left out due to the difficulty of testing it, as it uses IO memory. I tested the drm_fb_blit(), but I’m waiting for the acceptance of the patchset to send it, with that patch the line coverage will go to 89.2% and the function coverage will go to 94.3%.

Community Interaction

Besides patch submission, I reviewed some patches too. Going to the other side, I enjoyed thinking about how a piece of code could be improved.

Also, on one occasion I started a discussion about the best way to solve an issue by sending a patch. This got me a Reported-by tag on the patch that fixed the bug.

Patch
Re: [PATCH] drm/format_helper: Add Kunit tests for drm_fb_xrgb8888_to_mono()
Re: [PATCH] drm/format_helper: Add Kunit tests for drm_fb_xrgb8888_to_mono()
Re: [PATCH v2] drm/format-helper: Make conversion_buf_size() support sub-byte pixel fmts
Re: [PATCH v3 1/2] drm/format-helper: Add Kunit tests for drm_fb_xrgb8888_to_mono()
Re: [PATCH 1/5] drm/tests: Test drm_rect_intersect()
Re: [PATCH v2 1/5] drm/tests: Test drm_rect_intersect()
Re: [PATCH v2 1/7] drm/vkms: isolate pixel conversion functionality
Re: [PATCH v3 1/6] drm/vkms: isolate pixel conversion functionality
Re: [PATCH v4 3/5] drm/tests: Add test cases for drm_rect_calc_vscale()
Re: [PATCH v2 1/2] drm/vkms: allow full alpha blending on all planes
Re: [PATCH v2 1/2] drm: Add fixed-point helper to get rounded integer values
Re: [PATCH v2 2/2] drm/vkms: Fix RGB565 pixel conversion
Re: [PATCH v3 1/2] drm: Add fixed-point helper to get rounded integer values
Re: [PATCH 0/3] drm/vkms: Minor Improvements
Re: [PATCH v2] drm/vkms: Fix race-condition between the hrtimer and the atomic commit
Re: [PATCH] drm/tests: Add test case for drm_rect_clip_scaled()
Re: [PATCH v4] drm/vkms: Add support to 1D gamma LUT
Re: [PATCH] drm/tests: Remove CONFIG_DRM_FBDEV_EMULATION on .kunitconfig
Re: [PATCH] drm/tests: Remove CONFIG_DRM_FBDEV_EMULATION on .kunitconfig
Re: [PATCH] drm/tests: Alloc drm_device on drm_exec tests
Re: [PATCH] drm: Drop select FRAMEBUFFER_CONSOLE for DRM_FBDEV_EMULATION
Re: [PATCH -next 6/7] drm/format-helper: Remove unnecessary NULL values
Re: [PATCH 6/6] drm/format-helper: Add KUnit tests for drm_fb_memcpy()
Re: [PATCH v2 6/6] drm/tests: Add KUnit tests for drm_fb_memcpy()

Moreover, I use a Thunderbird addon to make the diff properly highlyted. When I was tinkering with the configuration, I noticed that the CSS of the configuration menu was wrong, so it made the user experience pretty bad.

I sent a patch fixing that to the maintainer of the addon, this patch generated a discussion that made a whole change in the CSS file due to Thunderbird updates.

Acknowledgments

I’d like to thank my mentors, André “Tony” Almeida, Maíra Canal, and Tales L. Aparecida. Their support and expertise were invaluable throughout this journey.

Moreover, I thank the X.Org Foundation for allowing me to participate in this program, and also for accepting my talk proposal on the XDC 2023.

Lastly, I thank my colleague Carlos Eduardo Gallo for exchanging knowledge during the weekly meetings.

August 26, 2023
Introduction: Not That Type of Advertisement Hello again! Last time, we talked about implementing multi-plane format support, and now after having everything fully wired up, we confirmed that it compiles and hence it’s time to ship it and move on to the next stage: YCbCr advertisement. Hm? I forgot something? It probably wasn’t important then. Probably. Maybe. :^) Put simply, advertisement is...
August 25, 2023

It's 5am and I have a headache. The perfect time for some reflection!

Not only that, but I've just had to play the part of Static Site Ungenerator, because I found out that I deleted the source of the last post and I didn't want to lose it in the upcoming publish. If your Atom feed went funky, sorry.

This document is my Final Work Submission, but is fun for all the family, including the ones who don't work at Google. Hi everyone!

&num What we wanted to happen

Going into the summer, the plan was to add functionality to wlroots so that its users (generally Wayland compositors) could more easily switch to a smarter frame schedule. I've had many goes at explaining the problem and they all sucked, so here we go again: if a compositor puts some thought into when it starts its render, desktop latency as perceived by the user can decrease. The computer will feel snappier.

wlroots started the summer with no accommodations for compositors that wanted to put thought into when they start to render. It assumed exactly no thought was to be put in, and left you on your own if you were to decide otherwise. But that has all changed!

The aim of my work could have comprised three things, but I added a fourth and then didn't have time for the third:

  1. measurement - a way to determine how long a render job took, from start (on the CPU) to finish (on the GPU).
  2. scheduling - the system that chooses when to tell a compositor that it should render. this wants a way for wlroots users to dictate when they want to start rendering a new frame, relative to when this frame is due to be displayed
  3. prediction - some clever maths that learns from the time took by previous renders and guesses how long the next one will take. this allows for moving the render start time closer to the frame deadline (good), carefully enough to avoid missing the deadline (very bad)
  4. bonus! tracing - measurement, but for humans. we're getting 60 new numbers every second, and it's going to be hard to make sense of them if they're just being printed out to the console

&num What happened

After some flailing around trying to add a delay to the existing scheduling, I started writing patches worth landing.

First came the render timer API. Now we can measure the duration of our render passes. This MR brought an abstraction for timers, and an implementation for wlroots' gles2 renderer.

Next, the scene timer API. wlr_scene does some of its own work before setting off the render pass itself, so it needed to become aware of timers and expose a way to use them.

Meanwhile, I was having another stab at configuring a frame delay. It wasn't very good, and the design of wlroots' scheduling and the complexity of the logic underneath it turned out to take a long time to get through. With this MR, though, I had a better idea of where I was trying to go. A long thought process followed, much of which lives in this issue, and further down we'll see what came of that.

Before working on a prediction algorithm, I wanted to be able to see live feedback on how render timings behaved and which frames were missed so that I could do a good (informed) job of predicting them. I took a detour into the world of tracing. libuserevents was spawned and so was the work to make use of it in wlroots. Linux's user_events tracing interface was appealing because it meant that GPUVis, an existing tool that can display a timeline of CPU and GPU events, would be able to show wlroots' events. Unfortunately Linux and I have so far struggled to get along and this work is still in progress - no submission yet because it's broken. Even more unfortunately, this meant that I wasn't able to get around to prediction.

Then I got tired of fighting that, and despite the words of discouragement...

“it's kind of impossible and I don't think we can do much better than !4214”

- a fool, a moron, a silly goose (me)

a refactor of wlroots' frame scheduling that allows us to do much better than !4214: !4307! This hasn't quite made it past the finish line, but it's close; I can feel it in my frames. It (in my opinion) neatly extracts the hairy logic that lived in wlr_output into a helper interface, allowing users to swap out which frame scheduler they use, or to forgo the helpers and roll their own without there being bits and pieces left over in the parts of wlroots that they do care about. This is the most exciting piece of the puzzle IMO; wlr_output has grown to have its fingers in many pies, and this MR reduces that and leaves wlr_output a little bit more friendly in a way that took a lot of brain cycles but turned out clean.

This new interface doesn't come with a frame delay option for free, but an implementation of the interface that has this feature is underway: !4334. It fits nicely! We hashed it out a little on IRC because the frame delay option is a surprisingly tricky constraint on the interface, but I think the conclusion is good. It was definitely a lot easier to write this with confidence after the scheduling redesign :)

To make this scheduling design possible and clean, a couple of little changes were needed in other areas, and thankfully the case for these changes was easy to make. They're helpful to me, but also make those parts of wlroots less surprising and/or broken. There was also a discussion about the fate of wlr_output.events.needs_frame, which is an extra complexity in wlroots' frame scheduling. It turned out that while removing it is possible, it wasn't necessary for the new scheduling system, so it continues in the background.

&num Loose ends

While libuserevents is usable, the wlroots integration is not ready.

There is sadly no "stock" plug-and-play prediction algorithm in wlroots.

The new scheduling infrastructure has not landed but I'm sure it will Soon™. The implementation with the frame delay option will hopefully follow shortly after. When (touch wood) it does, compositors will have to bring their own prediction algorithm, but a "good enough" algorithm can be very simple and given the current interface design can easily be swapped out for a stock one if one materialises.

And finally, the funniest one. I wrote an implementation of the timer API for wlroots' Vulkan renderer, and then put off submitting it for two months because everything else was more important. gles2 is the default renderer and supports roughly every GPU in existence. Writing the Vulkan timer was fun but landing it was less of a priority than every other task I had and nothing really depended on it, so it remains stuck on my laptop to this day. Perhaps I should get round to that.

&num Closure

The project didn't go how I expected it to - not even close. I even wrote up a schedule as part of my application that almost immediately turned out completely wrong. I'm not bothered, though, because it was fun, I made myself useful, and I met some cool people.

If you're considering doing something like I did, I can happily recommend Simon as a mentor, X.Org, and GSoC, in that order. Much love to Simon for making me feel comfortable when I really didn't know what I was doing, and for participating in my wildly off-topic free software rambles. I've only interacted with a small part of the X.Org community so far but it struck me from the start how welcoming everyone is; I have no doubts that the other X.Org project mentors are as lovely in their own ways. And of course, as a strong proponent of software that doesn't suck that's free, I have to appreciate that GSoC gave me a welcoming place to do my part in that and relieve my worldly pressures (did you know you have to pay for internet??).

Thanks everyone for putting up with me. If you would like to put up with me some more, click the links on the left - I'm not going anywhere, there's still work to do!

August 24, 2023

Progress

Managed to squeeze some time between holidaying to hack on the NPU driver and got something out of it.

Since the last update I have:

  • implemented support for strided convolutions with more than one input channel, and
  • Implemented support for more than one output channel, but for now only for a single input channel.

Next steps are  to support convolutions with multiple input and output channels, and padding. Then see what is still missing so we can run MobileNet v1 and check the performance when using the NN units and doing the rest on the CPU.

As a reminder, I'm pushing all the code to this branch: https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/.

IRC channel

A bunch of us have started to gather in the #ml-mainline IRC channel in OFTC to disucss matters about doing accelerated ML with mainline, on embedded.

For those of you that may not have a IRC bouncer setup yet, you can easily join with the web chat UI, but in case others aren't in front of the keyboard when you type your question, I recommend using element.io with the Matrix IRC bridge:

https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/

Embedded recipes

I have been invited to give a talk about all this ML with mainline effort at Embedded Recipes 2023, Paris 28-29 September. Slides and a recording will be published after the conference ends.

Sponsor

Last but not least, if I am able to invest so much effort on this is because the folks at LibreComputer have been supporting me financially this last couple of months.

Thanks to Da Xue for his support, it is greatly appreciated! It is awesome to see SBC vendors investing in the Linux upstream ecosystem.

August 23, 2023

fml.png

Addendum

It turns out this was the product of a tiler optimization I did earlier this year to pipeline texture uploads without splitting renderpasses. I was (wrongly) assuming that the PBO stride would always match the image format stride, which broke functionality in literally just this one corner case.

August 22, 2023

Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. That means the drivers are compatible with any OpenGL ES 3.1 application. Interested? Just install Linux!

For existing Asahi Linux users, upgrade your system with dnf upgrade (Fedora) or pacman -Syu (Arch) for the latest drivers.

Our reverse-engineered, free and open source graphics drivers are the world’s only conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware. That means our driver passed tens of thousands of tests to demonstrate correctness and is now recognized by the industry.

To become conformant, an “implementation” must pass the official conformance test suite, designed to verify every feature in the specification. The test results are submitted to Khronos, the standards body. After a 30-day review period, if no issues are found, the implementation becomes conformant. The Khronos website lists all conformant implementations, including our drivers for the M1, M1 Pro/Max/Ultra, M2, and M2 Pro/Max.

Today’s milestone isn’t just about OpenGL ES. We’re releasing the first conformant implementation of any graphics standard for the M1. And we don’t plan to stop here ;-)

Teaser of the “Vulkan instancing” demo running on Asahi Linux

Unlike ours, the manufacturer’s M1 drivers are unfortunately not conformant for any standard graphics API, whether Vulkan or OpenGL or OpenGL ES. That means that there is no guarantee that applications using the standards will work on your M1/M2 (if you’re not running Linux). This isn’t just a theoretical issue. Consider Vulkan. The third-party MoltenVK layers a subset of Vulkan on top of the proprietary drivers. However, those drivers lack key functionality, breaking valid Vulkan applications. That hinders developers and users alike, if they haven’t yet switched their M1/M2 computers to Linux.

Why did we pursue standards conformance when the manufacturer did not? Above all, our commitment to quality. We want our users to know that they can depend on our Linux drivers. We want standard software to run without M1-specific hacks or porting. We want to set the right example for the ecosystem: the way forward is implementing open standards, conformant to the specifications, without compromises for “portability”. We are not satisfied with proprietary drivers, proprietary APIs, and refusal to implement standards. The rest of the industry knows that progress comes from cross-vendor collaboration. We know it, too. Achieving conformance is a win for our community, for open source, and for open graphics.

Of course, Asahi Lina and I are two individuals with minimal funding. It’s a little awkward that we beat the big corporation…

It’s not too late though. They should follow our lead!


OpenGL ES 3.1 updates the experimental OpenGL ES 3.0 and OpenGL 3.1 we shipped in June. Notably, ES 3.1 adds compute shaders, typically used to accelerate general computations within graphics applications. For example, a 3D game could run its physics simulations in a compute shader. The simulation results can then be used for rendering, eliminating stalls that would otherwise be required to synchronize the GPU with a CPU physics simulation. That lets the game run faster.

Let’s zoom in on one new feature: atomics on images. Older versions of OpenGL ES allowed an application to read an image in order to display it on screen. ES 3.1 allows the application to write to the image, typically from a compute shader. This new feature enables flexible image processing algorithms, which previously needed to fit into the fixed-function 3D pipeline. However, GPUs are massively parallel, running thousands of threads at the same time. If two threads write to the same location, there is a conflict: depending which thread runs first, the result will be different. We have a race condition.

“Atomic” access to memory provides a solution to race conditions. With atomics, special hardware in the memory subsystem guarantees consistent, well-defined results for select operations, regardless of the order of the threads. Modern graphics hardware supports various atomic operations, like addition, serving as building blocks to complex parallel algorithms.

Can we put these two features together to write to an image atomically?

Yes. A ubiquitous OpenGL ES extension, required for ES 3.2, adds atomics operating on pixels in an image. For example, a compute shader could atomically increment the value at pixel (10, 20).

Other GPUs have dedicated instructions to perform atomics on an images, making the driver implementation straightforward. For us, the story is more complicated. The M1 lacks hardware instructions for image atomics, even though it has non-image atomics and non-atomic images. We need to reframe the problem.

The idea is simple: to perform an atomic on a pixel, we instead calculate the address of the pixel in memory and perform a regular atomic on that address. Since the hardware supports regular atomics, our task is “just” calculating the pixel’s address.

If the image were laid out linearly in memory, this would be straightforward: multiply the Y-coordinate by the number of bytes per row (“stride”), multiply the X-coordinate by the number of bytes per pixel, and add. That gives the pixel’s offset in bytes relative to the first pixel of the image. To get the final address, we add that offset to the address of the first pixel.

Address of (X, Y) equals Address of (0, 0) + Y times Stride + X times Bytes Per Pixel

Alas, images are rarely linear in memory. To improve cache efficiency, modern graphics hardware interleaves the X- and Y-coordinates. Instead of one row after the next, pixels in memory follow a spiral-like curve.

We need to amend our previous equation to interleave the coordinates. We could use many instructions to mask one bit at a time, shifting to construct the interleaved result, but that’s inefficient. We can do better.

There is a well-known “bit twiddling” algorithm to interleave bits. Rather than shuffle one bit at a time, the algorithm shuffles groups of bits, parallelizing the problem. Implementing this algorithm in shader code improves performance.

In practice, only the lower 7-bits (or less) of each coordinate are interleaved. That lets us use 32-bit instructions to “vectorize” the interleave, by putting the X- and Y-coordinates in the low and high 16-bits of a 32-bit register. Those 32-bit instructions let us interleave X and Y at the same time, halving the instruction count. Plus, we can exploit the GPU’s combined shift-and-add instruction. Putting the tricks together, we interleave in 10 instructions of M1 GPU assembly:

# Inputs x, y in r0l, r0h.
# Output in r1.

add r2, #0, r0, lsl 4
or  r1, r0, r2
and r1, r1, #0xf0f0f0f
add r2, #0, r1, lsl 2
or  r1, r1, r2
and r1, r1, #0x33333333
add r2, #0, r1, lsl 1
or  r1, r1, r2
and r1, r1, #0x55555555
add r1, r1l, r1h, lsl 1

We could stop here, but what if there’s a dedicated instruction to interleave bits? PowerVR has a “shuffle” instruction shfl, and the M1 GPU borrows from PowerVR. Perhaps that instruction was borrowed too. Unfortunately, even if it was, the proprietary compiler won’t use it when compiling our test shaders. That makes it difficult to reverse-engineer the instruction – if it exists – by observing compiled shaders.

It’s time to dust off a powerful reverse-engineering technique from magic kindergarten: guess and check.

Dougall Johnson provided the guess. When considering the instructions we already know about, he took special notice of the “reverse bits” instruction. Since reversing bits is a type of bit shuffle, the interleave instruction should be encoded similarly. The bit reverse instruction has a two-bit field specifying the operation, with value 01. Related instructions to count the number of set bits and find the first set bit have values 10 and 11 respectively. That encompasses all known “complex bit manipulation” instructions.

00 ? ? ?
01 Reverse bits
10 Count set bits
11 Find first set

There is one value of the two-bit enumeration that is unobserved and unknown: 00. If this interleave instruction exists, it’s probably encoded like the bit reverse but with operation code 00 instead of 01.

There’s a difficulty: the three known instructions have one single input source, but our instruction interleaves two sources. Where does the second source go? We can make a guess based on symmetry. Presumably to simplify the hardware decoder, M1 GPU instructions usually encode their sources in consistent locations across instructions. The other three instructions have a gap where we would expect the second source to be, in a two-source arithmetic instruction. Probably the second source is there.

Armed with a guess, it’s our turn to check. Rather than handwrite GPU assembly, we can hack our compiler to replace some two-source integer operation (like multiply) with our guessed encoding of “interleave”. Then we write a compute shader using this operation (by “multiplying” numbers) and run it with the newfangled compute support in our driver.

All that’s left is writing a shader that checks that the mystery instruction returns the interleaved result for each possible input. Since the instruction takes two 16-bit sources, there are about 4 billion (\(2^32\)) inputs. With our driver, the M1 GPU manages to check them all in under a second, and the verdict is in: this is our interleave instruction.

As for our clever vectorized assembly to interleave coordinates? We can replace it with one instruction. It’s anticlimactic, but it’s fast and it passes the conformance tests.

And that’s what matters.


Thank you to Khronos and Software in the Public Interest for supporting open drivers.

August 21, 2023
I have just become aware that Fedora users using the packaged IPU6 camera stack are having an issue where the output from the camera is black. We have been updating the stack and the new version of ipu6-camera-bins has hit the stable updates repo, while the matching new version of ipu6-camera-hal is currently in the updates-testing repo.

This causes the versions of ipu6-camera-bins vs ipu6-camera-hal to get out of sync (unless you have updates-testing enabled) which leads to the camera output being all black

You can fix this issue by running the following command:

sudo dnf update --enablerepo=rpmfusion-nonfree-updates-testing 'ipu6-camera-*'

Sorry about the inconvenience, we'll make sure to push both packages at the same time for the next set of updates.

I have tagged all the new ipu6-camera-hal builds to be moved to the stable update repositories, so on the next rpmfusion updates push this should be resolved.

TL;DR:

Color is a visual perception. Human eyes can detect a broader range of colors than any devices in the graphics chain. Since each device can generate, capture or reproduce a specific subset of colors and tones, color management controls color conversion and calibration across devices to ensure a more consistent color reproduction. We can expose a GPU-accelerated display color management pipeline to support this process and enhance results, and this is what we are doing on Linux to improve color management on Gamescope/SteamDeck. Even with the challenges of being external developers, we have been working on mapping AMD GPU color capabilities to the Linux kernel color management interface, which is a combination of DRM and AMD driver-specific color properties. This more extensive color management pipeline includes pre-defined Transfer Functions, 1-Dimensional LookUp Tables (1D LUTs), and 3D LUTs before and after the plane composition/blending.


The study of color is well-established and has been explored for many years. Color science and research findings have also guided technology innovations. As a result, color in Computer Graphics is a very complex topic that I’m putting a lot of effort into becoming familiar with. I always find myself rereading all the materials I have collected about color space and operations since I started this journey (about one year ago). I also understand how hard it is to find consensus on some color subjects, as exemplified by all explanations around the 2015 online viral phenomenon of The Black and Blue Dress. Have you heard about it? What is the color of the dress for you?

So, taking into account my skills with colors and building consensus, this blog post only focuses on GPU hardware capabilities to support color management :-D If you want to learn more about color concepts and color on Linux, you can find useful links at the end of this blog post.

Linux Kernel, show me the colors ;D

DRM color management interface only exposes a small set of post-blending color properties. Proposals to enhance the DRM color API from different vendors have landed the subsystem mailing list over the last few years. On one hand, we got some suggestions to extend DRM post-blending/CRTC color API: DRM CRTC 3D LUT for R-Car (2020 version); DRM CRTC 3D LUT for Intel (draft - 2020); DRM CRTC 3D LUT for AMD by Igalia (v2 - 2023); DRM CRTC 3D LUT for R-Car (v2 - 2023). On the other hand, some proposals to extend DRM pre-blending/plane API: DRM plane colors for Intel (v2 - 2021); DRM plane API for AMD (v3 - 2021); DRM plane 3D LUT for AMD - 2021. Finally, Simon Ser sent the latest proposal in May 2023: Plane color pipeline KMS uAPI, from discussions in the 2023 Display/HDR Hackfest, and it is still under evaluation by the Linux Graphics community.

All previous proposals seek a generic solution for expanding the API, but many seem to have stalled due to the uncertainty of matching well the hardware capabilities of all vendors. Meanwhile, the use of AMD color capabilities on Linux remained limited by the DRM interface, as the DCN 3.0 family color caps and mapping diagram below shows the Linux/DRM color interface without driver-specific color properties [*]:

Bearing in mind that we need to know the variety of color pipelines in the subsystem to be clear about a generic solution, we decided to approach the issue from a different perspective and worked on enabling a set of Driver-Specific Color Properties for AMD Display Drivers. As a result, I recently sent another round of the AMD driver-specific color mgmt API.

For those who have been following the AMD driver-specific proposal since the beginning (see [RFC][V1]), the main new features of the latest version [v2] are the addition of pre-blending Color Transformation Matrix (plane CTM) and the differentiation of Pre-defined Transfer Functions (TF) supported by color blocks. For those who just got here, I will recap this work in two blog posts. This one describes the current status of the AMD display driver in the Linux kernel/DRM subsystem and what changes with the driver-specific properties. In the next post, we go deeper to describe the features of each color block and provide a better picture of what is available in terms of color management for Linux.

The Linux kernel color management API and AMD hardware color capabilities

Before discussing colors in the Linux kernel with AMD hardware, consider accessing the Linux kernel documentation (version 6.5.0-rc5). In the AMD Display documentation, you will find my previous work documenting AMD hardware color capabilities and the Color Management Properties. It describes how AMD Display Manager (DM) intermediates requests between the AMD Display Core component (DC) and the Linux/DRM kernel interface for color management features. It also describes the relevant function to call the AMD color module in building curves for content space transformations.

A subsection also describes hardware color capabilities and how they evolve between versions. This subsection, DC Color Capabilities between DCN generations, is a good starting point to understand what we have been doing on the kernel side to provide a broader color management API with AMD driver-specific properties.

Why do we need more kernel color properties on Linux?

Blending is the process of combining multiple planes (framebuffers abstraction) according to their mode settings. Before blending, we can manage the colors of various planes separately; after blending, we have combined those planes in only one output per CRTC. Color conversions after blending would be enough in a single-plane scenario or when dealing with planes in the same color space on the kernel side. Still, it cannot help to handle the blending of multiple planes with different color spaces and luminance levels. With plane color management properties, userspace can get better representation of colors to deal with the diversity of color profiles of devices in the graphics chain, bring a wide color gamut (WCG), convert High-Dynamic-Range (HDR) content to Standard-Dynamic-Range (SDR) content (and vice-versa). With a GPU-accelerated display color management pipeline, we can use hardware blocks for color conversions and color mapping and support advanced color management.

The current DRM color management API enables us to perform some color conversions after blending, but there is no interface to calibrate input space by planes. Note that here I’m not considering some workarounds in the AMD display manager mapping of DRM CRTC de-gamma and DRM CRTC CTM property to pre-blending DC de-gamma and gamut remap block, respectively. So, in more detail, it only exposes three post-blending features:

  • DRM CRTC de-gamma: used to convert the framebuffer’s colors to linear gamma;
  • DRM CRTC CTM: used for color space conversion;
  • DRM CRTC gamma: used to convert colors to the gamma space of the connected screen.

AMD driver-specific color management interface

We can compare the Linux color management API with and without the driver-specific color properties. From now, we denote driver-specific properties with the AMD prefix and generic properties with the DRM prefix. For visual comparison, I bring the DCN 3.0 family color caps and mapping diagram closer and present it here again:

Mixing AMD driver-specific color properties with DRM generic color properties, we have a broader Linux color management system with the following features exposed by properties in the plane and CRTC interface, as summarized by this updated diagram:

The blocks highlighted by red lines are the new properties in the driver-specific interface developed by me (Igalia) and Joshua (Valve). The red dashed lines are new links between API and AMD driver components implemented by us to connect the Linux/DRM interface to AMD hardware blocks, mapping components accordingly. In short, we have the following color management properties exposed by the DRM/AMD display driver:

  • Pre-blending - AMD Display Pipe and Plane (DPP):
    • AMD plane de-gamma: 1D LUT and pre-defined transfer functions; used to linearize the input space of a plane;
    • AMD plane CTM: 3x4 matrix; used to convert plane color space;
    • AMD plane shaper: 1D LUT and pre-defined transfer functions; used to delinearize and/or normalize colors before applying 3D LUT;
    • AMD plane 3D LUT: 17x17x17 size with 12 bit-depth; three dimensional lookup table used for advanced color mapping;
    • AMD plane blend/out gamma: 1D LUT and pre-defined transfer functions; used to linearize back the color space after 3D LUT for blending.
  • Post-blending - AMD Multiple Pipe/Plane Combined (MPC):
    • DRM CRTC de-gamma: 1D LUT (can’t be set together with plane de-gamma);
    • DRM CRTC CTM: 3x3 matrix (remapped to post-blending matrix);
    • DRM CRTC gamma: 1D LUT + AMD CRTC gamma TF; added to take advantage of driver pre-defined transfer functions;

Note: You can find more about AMD display blocks in the Display Core Next (DCN) - Linux kernel documentation, provided by Rodrigo Siqueira (Linux/AMD display developer) in a 2021-documentation series. In the next post, I’ll revisit this topic, explaining display and color blocks in detail.

How did we get a large set of color features from AMD display hardware?

So, looking at AMD hardware color capabilities in the first diagram, we can see no post-blending (MPC) de-gamma block in any hardware families. We can also see that the AMD display driver maps CRTC/post-blending CTM to pre-blending (DPP) gamut_remap, but there is post-blending (MPC) gamut_remap (DRM CTM) from newer hardware versions that include SteamDeck hardware. You can find more details about hardware versions in the Linux kernel documentation/AMDGPU Product Information.

I needed to rework these two mappings mentioned above to provide pre-blending/plane de-gamma and CTM for SteamDeck. I changed the DC mapping to detach stream gamut remap matrixes from the DPP gamut remap block. That means mapping AMD plane CTM directly to DPP/pre-blending gamut remap block and DRM CRTC CTM to MPC/post-blending gamut remap block. In this sense, I also limited plane CTM properties to those hardware versions with MPC/post-blending gamut_remap capabilities since older versions cannot support this feature without clashes with DRM CRTC CTM.

Unfortunately, I couldn’t prevent conflict between AMD plane de-gamma and DRM plane de-gamma since post-blending de-gamma isn’t available in any AMD hardware versions until now. The fact is that a post-blending de-gamma makes little sense in the AMD color pipeline, where plane blending works better in a linear space, and there are enough color blocks to linearize content before blending. To deal with this conflict, the driver now rejects atomic commits if users try to set both AMD plane de-gamma and DRM CRTC de-gamma simultaneously.

Finally, we had no other clashes when enabling other AMD driver-specific color properties for our use case, Gamescope/SteamDeck. Our main work for the remaining properties was understanding the data flow of each property, the hardware capabilities and limitations, and how to shape the data for programming the registers - AMD color block capabilities (and limitations) are the topics of the next blog post. Besides that, we fixed some driver bugs along the way since it was the first Linux use case for most of the new color properties, and some behaviors are only exposed when exercising the engine.

Take a look at the Gamescope/Steam Deck Color Pipeline[**], and see how Gamescope uses the new API to manage color space conversions and calibration (please click on the image for a better view):

In the next blog post, I’ll describe the implementation and technical details of each pre- and post-blending color block/property on the AMD display driver.

* Thank Harry Wentland for helping with diagrams, color concepts and AMD capabilities.

** Thank Joshua Ashton for providing and explaining Gamescope/Steam Deck color pipeline.

*** Thanks to the Linux Graphics community - explicitly Harry, Joshua, Pekka, Simon, Sebastian, Siqueira, Alex H. and Ville - to all the learning during this Linux DRM/AMD color journey. Also, Carlos and Tomas for organizing the 2023 Display/HDR Hackfest where we have a great and immersive opportunity to discuss Color & HDR on Linux.

  1. Cinematic Color - 2012 SIGGRAPH course notes by Jeremy Selan: an introduction to color science, concepts and pipelines.
  2. Color management and HDR documentation for FOSS graphics by Pekka Paalanen: documentation and useful links on applying color concepts to the Linux graphics stack.
  3. HDR in Linux by Jeremy Cline: a blog post exploring color concepts for HDR support on Linux.
  4. Methods for conversion of high dynamic range content to standard dynamic range content and vice-versa by ITU-R: guideline for conversions between HDR and SDR contents.
  5. Using Lookup Tables to Accelerate Color Transformations by Jeremy Selan: Nvidia blog post about Lookup Tables on color management.
  6. The Importance of Being Linear by Larry Gritz and Eugene d’Eon: Nvidia blog post about gamma and color conversions.
August 18, 2023

It Me, Maintenance5

After a long week of what-even-happened, it’s finally time to talk about maintenance5.

This long-awaited maintenance extension has a number of great and zinkful features:

  • VK_FORMAT_A8_UNORM_KHR for native A8 handling
    • totally works all the time and doesn’t at all need any workarounds for cases where the driver only supports a certain subset of the required format features
      • hahaha
        • ahahahahahahahahahahahahaha
  • a property to detect Intel hardware
  • default value of 1.0 for gl_PointSize
  • deprecating shader modules
    • still gonna keep using them
  • VK_REMAINING_ARRAY_LAYERS
  • Clarification that copies between images of any type are allowed, treating 1D images as 2D images with a height of 1.
    • already been doing it since 2018

But who can guess which one is the topic of this blog post?

Hell Yeah

Finally a default value for gl_PointSize.

Long-term fans of the blog will recall that I’ve previously raged against the insane concept that pointsize must be written many times prior. In fact, it remains the second most blogged about topic in SGC history right behind Big Triangledescriptor management, the topic that modern graphics-related blogs must cover above all others.

Finally with maintenance5 we can be freed from these unjust shackles that have bound us for so long. No more* shall complex logic be unnecessarily injected into the compiler stack to add senseless writes to this output.
* except all that code still has to exist and run to handle drivers that don’t support maintenance5

Beyond the obvious benefit of having a fixed default pointsize (sanity), let’s check out some other benefits.

Shader Reduction

Previously all zink-emitted shaders would have a pointsize write, even those that were never used for drawing points. This resulted in unnecessary shader i/o at the hardware level. Nobody wants unnecessary shader i/o at the hardware level.

Now, however, it’s possible to use heuristics during linking to delete all unnecessary pointsize writes any time there is no XFB emission.

How much performance improvement will this yield?

Six.

Six improvement units of performance.

Complexity Reduction

Everyone remembers that time I discovered that huge flaw in nir_assign_io_var_locations where shader interfaces would break due to psiz injection.

With maintenance5 all of that can be handwaved away, meaning fewer shader variants are needed.

Code Deletion

.

Maintainance.

Maintenance extensions are best extensions, prove me wrong.

August 17, 2023

Hi!

Let me start this status update with an announcement: from 2023-08-28 to 2023-10-01 (!), I will be on leave, so I will have reduced availability. Don’t be surprised if I miss some e-mails, and feel free to ping me again (more generally, please do ping me if I forget about a discussion — that also tends to happen when I’m not on leave). During that time, I will be traveling to Korea and Japan. If you live there and want to say hello, please reach out! :)

This month, Rose has continued working on wlroots frame scheduling. After a fair amount of discussion, she’s found a pretty nice API design. She still needs to address and cleanup a few things, but that merge request is on a good track! I’ve also merged a new API to embed a compositor inside a Wayland client, and sent patches to remove some cases where we were waiting for a reply from Xwayland in a blocking fashion.

My kernel patch for signaling an eventfd from a drm_syncobj has been merged (see last month’s post for more details), and I’ve reviewed a patch from Erik Kurzinger to import a sync_file into a drm_syncobj timeline, which was possible before but awkward (it required 3 IOCTLs and a temporary binary drm_syncobj). As usual, I’ve sent a few kernel documentation patches as well.

I’ve released a new version of Cage, the Wayland kiosk compositor. Cage now uses the latest wlroots release, implements a bunch of new protocols and leverages wlroots’ scene-graph API.

The NPotM is go-mls, a Go library for the Messaging Layer Security protocol. It’s a shiny new end-to-end encryption framework for messaging protocols (similar to the one used by e.g. Signal and Matrix). I wanted to figure out how it works, but simply reading a 132-page RFC didn’t seem fun enough, so I just tried implementing it instead. I’m passing most of the official test vectors, still missing a few things but overall not too far away from a proper implementation. I’ve been discussing with a few folks about an IRCv3 extension for MLS, but we’re still at the very early stages on that front.

Speaking of IRCv3, the pre-away extension has been merged, so the away status of soju users shouldn’t blink anymore when the Goguma mobile client synchronizes in the background. I’ve also submitted the no-implicit-names extension for standardization. That extension reduces bandwidth usage for clients who don’t need to always maintain a list of all users in all channels. This helps a lot with slow 3G connections in the countryside.

The SNPotM is libdns/dnsupdate, a Go library for the venerable dynamic DNS UPDATE protocol implemented by various authoritative name servers. The library conforms to an interface shared with other (proprietary) libdns providers. I have more plans in this area, but will keep that for a future blog post.

I’ve sent a go-proxyproto patch to add a helper to configure an HTTP/2 server with PROXY protocol upgrade support. TLS ALPN is needed to negotiate HTTP/2, so it’s tricky to make work behind a reverse proxy which terminates the TLS connection. This patch is basically part of kimchi ripped off and put behind a nice API. This patch would be useful to add HTTP/2 support to pages.sr.ht.

Last but not least, I’ve implemented tracker export for the todo.sr.ht GraphQL API. delthas has added support for that in hut. Next up is support for import in hut! I’ve also sent a whole bunch of bug fixes for sr.ht.

That’s all for this month! I’m not sure I’ll write a status update in September, but will definitely do so in October.

August 14, 2023

As part of the same process outlined in Matthias Clasen's "LibreOffice packages" email, my management chain has made the decision to stop all upstream and downstream work on desktop Bluetooth, multimedia applications (namely totem, rhythmbox and sound-juicer) and libfprint/fprintd. The rest of my upstream and downstream work will be reassigned depending on Red Hat's own priorities (see below), as I am transferred to another team that deals with one of a list of Red Hat’s priority projects.

I'm very disappointed, because those particular projects were already starved for resources: I spent less than 10% of my work time on them in the past year, with other projects and responsibilities taking most of my time.

This means that, in the medium-term at least, all those GNOME projects will go without a maintainer, reviewer, or triager:
- gnome-bluetooth (including Settings panel and gnome-shell integration)
- totem, totem-pl-parser, gom
- libgnome-volume-control
- libgudev
- geocode-glib
- gvfs AFC backend

Those freedesktop projects will be archived until further notice:
- power-profiles-daemon
- switcheroo-control
- iio-sensor-proxy
- low-memory-monitor

I will not be available for reviewing libfprint/fprintd, upower, grilo/grilo-plugins, gnome-desktop thumbnailer sandboxing patches, or any work related to XDG specifications.

Kernel work, reviews and maintenance, including recent work on SteelSeries headset and Logitech devices kernel drivers, USB revoke for Flatpak Portal support, or core USB is suspended until further notice.

All my Fedora packages were orphaned about a month and a half ago, it's likely that there are still some that are orphaned, if there are takers. RHEL packages were unassigned about 3 weeks ago, they've been reassigned since then, so I cannot point to the new maintainer(s).

If you are a partner, or a customer, I would recommend that you get in touch with your Red Hat contacts to figure out what the plan is going forward for the projects you might be involved with.

If you are a colleague that will take on all or part of the 90% of the work that's not being stopped, or a community member that was relying on my work to further advance your own projects, get in touch, I'll do my best to accommodate your queries, time permitting.

I'll try to make sure to update this post, or create a new one if and when any of the above changes.
August 11, 2023

It’s Not Maintenance5

I just got back from lunch and have to work off some cals, and that means it’s time for another big lift on the blog. Today’s topic: how dumb can a driver’s compiler stack be?

As I outlined in the previous post, zink’s compiler stack is about to get a whole lot dumber for various reasons. But what exactly does that look like?

Lesser bloggers would simply link to the MR and say “figure it out”.

Here at SGC, however, I put in the extra effort so my readers can comprehend all the stringy awfulness that goes into each individual graphical sausage that this triangle factory is pumping out.

Let’s get at it.

Step 1: How Much Work Is This, Exactly?

The key point of using the theoretical new NIR linker (that definitely exists and will be merged in finite time) is that it requires drivers to accept lowered i/o. This means, effectively, that zink must begin consuming lowered i/o as soon as it receives shaders. Naturally the first step to that was evaluating all the shader passes which operate on specific i/o variables using derefs (AKA “explicit” i/o):

  • lower_64bit_vertex_attribs
  • lower_64bit_vars
  • split_blocks
  • lower_bindless_io
  • rewrite_read_as_0

The first four are called from zink_shader_create, the first time zink sees new shaders, while the last one is called zink_compiler_assign_io. As shaders won’t have derefs again until just before they go through NTV, they’ll all have to be…

What’s that you say, creator of the patented Delete The Code methodology and planar YUV expert, Faith Ekstrand? I can just delete some of this code?

That sounds like a pretty smart idea. Looking at the list again, and then cross-referencing against all the features lowered i/o provides, and then pushing up my glasses so I can read the very real documentation that nir has, let’s see where that leads:

  • lower_64bit_vertex_attribs
    • nir_lower_io_lower_64bit_to_32 is available during i/o lowering, so this can all be deleted
  • lower_64bit_vars
    • the drivers that need this won’t magically grow 64bit support for shader-local variables, but the parts that operate specifically on i/o can be deleted
  • split_blocks
    • this was written to improve XFB inlining and reduce the number of “extra” outputs needed to generate XFB data, but if new, location-based variables are being generated anyway this is no longer needed
  • lower_bindless_io
    • this converts bindless texture i/o (because obviously you can pass bindless texture handles between shader stages) to something that is spirv-legal
    • needs to be updated
  • rewrite_read_as_0
    • this handles uninitialized reads from mismatched shader interfaces (i.e., the consumer reads more components than the producer writes)
    • needs to be updated

Not actually that much work, huzzah.

Step 2: New Variables

As in the flowchart, this process involves taking explicit i/o, converting to lowered i/o, then converting back to explicit. Explicit i/o is characterized by using derefs to explicit variables for access, which means variables are needed. A work-intensive benefit to this means simpler variables: since lowered i/o is characterized by location-based access to components, the subsequent conversion back to explicit i/o can use entirely new variables, and since these variables are location-based, there’s no need to retain any* of the gross struct/array typing that GLSL yields.
* except where arrays are indirectly accessed

For those of you who are truly in the know, this means goku in his SSJB form

struct TestStruct {
   dmat2x3 a[2];
   mat2x3 b[2];
   dvec2 c;
};
layout (location = 0, xfb_offset = 0) flat out TestStruct goku;

gets blasted into a series of smaller and more vincible variables:

decl_var shader_out INTERP_MODE_FLAT dvec3 goku#0 (VARYING_SLOT_VAR2.xyz, 2, 0)
decl_var shader_out INTERP_MODE_FLAT dvec3 goku#1 (VARYING_SLOT_VAR4.xyz, 4, 0)
decl_var shader_out INTERP_MODE_FLAT dvec3 goku#2 (VARYING_SLOT_VAR6.xyz, 6, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#3 (VARYING_SLOT_VAR8.xyz, 8, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#4 (VARYING_SLOT_VAR9.xyz, 9, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#5 (VARYING_SLOT_VAR10.xyz, 10, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#6 (VARYING_SLOT_VAR11.xyz, 11, 0)
decl_var shader_out INTERP_MODE_FLAT dvec2 goku#7 (VARYING_SLOT_VAR12.xy, 12, 0)

Beautiful and easy to parse. There’s only one snag: I gotta do this manually.

Long-time fans of the blog will recall some wild ravings in the past where I described a pass I wrote to handle a similar issue. lower_64bit_vars is that pass, and it both splits variables containing 64bit types into 32bit types and then rewrites all access to them to use those new types.

And now I have to do basically the same thing. Again. But in a different enough way that none of the code is reusable.

harold.jpg

The process for doing this variable rewrite is split in three:

  • scan the existing variables and make a table linking them to all the locations/components they consume
  • scan the shader for access to these variables and determine which ones have indirect access
  • scan the shader for access to these variables again, this time creating new vector-based variables which consume at most a single location*
    • except for variables accessed indirectly, which need to retain arrayness for indirect access

But then there’s also the bonus step (everyone loves bonuses!) of scanning all the new variables and comparing them against the original variables to ensure they have the same number of per-location components (i.e., if the original variable consumes all components for a given location, the new one must too) in order to maintain shader interface compatibility, and for all the locations where a mismatch is detected, single-component variables have to be inserted, and they have to have associated access added too so various optimizers don’t delete them again, and it’s obviously one of the first things anyone embarking on this idiotic journey would consider and not a last-second thing that someone would only realize after running a series of esoteric piglit tests and seeing bizarre failures.

Variables. Done.

Step 3: Explicit Again

The next step is where things get really stupid, because this is where things need to happen so that the shader goes back to having all the derefs and explicit variable access it used to have before some idiot went and deleted them.

I called this the add_derefs pass because I’m a creative type. An auteur.

For this, all the i/o variables need to be iterated through, and for each variable, scan the shader for access, where “access” means the location and component are consumed by the variable. And also its fbfetch-edness matches. Then take this lowered load/store access, krangle in whatever possibly-indirect derefs the variable needs to mimic the lowered operation, and write in a new explicit load/store access.

Except also I forgot to mention that i/o lowering needs to lower interpolation instructions, which are also (currently) in explicit deref format. And these explicit interpolation instructions get converted to lowered ones, and then sometimes a load_deref becomes load_barycentric_centroid. And you know (lol) it wouldn’t be a real adventure (lol) if a zink change didn’t uncover (lol) some incredibly obscure and opaque (lol) llvmpipe bug! So then there’s the usual spelunking through there, and whispered backchannel discussions and cursing with Dave, and OF FUCKING COURSE IT’S TGSI AGAIN but we got it done.

Also it’s possible there might be a future where llvmpipe doesn’t use TGSI but don’t quote me (I’ll deny it to my grave) and if anyone asks you didn’t hear it from me.

Step 3a: Done

You’d think by the way I just went off on my usual TGSI rant that I was done exploring this section, but think again because none of us asked what gl_ClipDistance or gl_CullDistance thought about any of this.

Well I asked, and they’re not happy.

Clip/cull distance are stupidweird ones because they’re array[8] variables that consume two locations. And that means all the calculations/heuristics for accessing arrays that work for every other array are broken for these.

But it’s fine, because this is zink and the whole thing is just a jenga tower of hacks all the way down anyway.

Step 4: Done

I’ll be completely and brutally honest with you, this all worked perfectly the first time I ran it.

On NVK, that is, which, as I mentioned in my historic XDC keynote, has been relying on the now-merged NIR 2.0 since last year. Truly a driver living in the future.

Other drivers, however, required considerably more work to make CI explode. Sorry, I meant not explode. Obviously. Totally a joke. The absolute state of CI is 100% not the fault of this lowered i/o conversion.

Anyway, the clear choice once parity was achieved was to then start deleting code.

Remember all that gross NTV code I linked in the previous post? Gone.

More stupid XFB code that’s been jenga-ing around for years? Gone.

Obscure ticket from years ago? Fixed incidentally

 src/compiler/nir/nir_passthrough_gs.c                |    2 +-
 src/gallium/auxiliary/nir/nir_to_tgsi_info.c         |    4 +
 src/gallium/drivers/zink/nir_to_spirv/nir_to_spirv.c |  412 +------------------------------------
 src/gallium/drivers/zink/zink_compiler.c             | 1081 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
 src/gallium/drivers/zink/zink_compiler.h             |    3 +-
 src/gallium/drivers/zink/zink_draw.cpp               |    2 +-
 src/gallium/drivers/zink/zink_program.c              |    8 +-
 src/gallium/drivers/zink/zink_types.h                |    6 +-
 8 files changed, 736 insertions(+), 782 deletions(-)

And as seen by the statistics here another bonus ticket fixed too through the magic of code deletion.

Step 5: The Struggle Continues

I didn’t even get to mention the great things that happened related to maintenance5 yet. Be sure to read again next week when I inevitably shovel more garbage onto the internet in the form of an unfortunately large blog post riddled with memes that obfuscate the truly interesting parts.

August 09, 2023

New Topic

As every one of my big brained readers knows, zink runs on top of vulkan. As you also know, vulkan uses spirv for its shaders. This means, in general, compiler-y stuff in zink tries to stay as close to spirv mechanics as possible.

Let’s look at an example. Here’s a very simple fragment shader from glxgears before it undergoes spirv translation:

shader: MESA_SHADER_FRAGMENT
source_sha1: {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
name: ff-fs
internal: true
stage: 4
next_stage: 0
inputs_read: 1
outputs_written: 4-11
system_values_read: 0x00000000'00100000'00000000
subgroup_size: 1
first_ubo_is_default_ubo: true
separate_shader: true
flrp_lowered: true
inputs: 1
outputs: 8
uniforms: 0
decl_var shader_in INTERP_MODE_NONE vec4 VARYING_SLOT_COL0 (VARYING_SLOT_COL0.xyzw, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[0] (FRAG_RESULT_DATA0.xyzw, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[1] (FRAG_RESULT_DATA1.xyzw, 1, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[2] (FRAG_RESULT_DATA2.xyzw, 2, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[3] (FRAG_RESULT_DATA3.xyzw, 3, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[4] (FRAG_RESULT_DATA4.xyzw, 4, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[5] (FRAG_RESULT_DATA5.xyzw, 5, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[6] (FRAG_RESULT_DATA6.xyzw, 6, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[7] (FRAG_RESULT_DATA7.xyzw, 7, 0)
decl_var push_const INTERP_MODE_NONE struct gfx_pushconst
decl_function main (0 params)

impl main {
    block b0:   // preds: 
    32     %0 = deref_var &VARYING_SLOT_COL0 (shader_in vec4)
    32x4   %1 = @load_deref (%0) (access=0)
    32     %2 = deref_var &gl_FragData[0] (shader_out vec4)
                @store_deref (%2, %1) (wrmask=xyzw, access=0)
    32     %3 = deref_var &gl_FragData[1] (shader_out vec4)
                @store_deref (%3, %1) (wrmask=xyzw, access=0)
    32     %4 = deref_var &gl_FragData[2] (shader_out vec4)
                @store_deref (%4, %1) (wrmask=xyzw, access=0)
    32     %5 = deref_var &gl_FragData[3] (shader_out vec4)
                @store_deref (%5, %1) (wrmask=xyzw, access=0)
    32     %6 = deref_var &gl_FragData[4] (shader_out vec4)
                @store_deref (%6, %1) (wrmask=xyzw, access=0)
    32     %7 = deref_var &gl_FragData[5] (shader_out vec4)
                @store_deref (%7, %1) (wrmask=xyzw, access=0)
    32     %8 = deref_var &gl_FragData[6] (shader_out vec4)
                @store_deref (%8, %1) (wrmask=xyzw, access=0)
    32     %9 = deref_var &gl_FragData[7] (shader_out vec4)
                @store_deref (%9, %1) (wrmask=xyzw, access=0)
                // succs: b1 
    block b1:
}

Notice all the variables and derefs. This is in contrast to what shaders from more hardware-y drivers look like:

shader: MESA_SHADER_FRAGMENT
source_sha1: {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
name: ff-fs
internal: true
stage: 4
next_stage: 0
inputs_read: 1
outputs_written: 2
subgroup_size: 1
first_ubo_is_default_ubo: true
separate_shader: true
flrp_lowered: true
inputs: 1
outputs: 1
uniforms: 0
decl_var shader_in INTERP_MODE_NONE vec4 VARYING_SLOT_COL0 (VARYING_SLOT_COL0.xyzw, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 FRAG_RESULT_COLOR (FRAG_RESULT_COLOR.xyzw, 4, 0)
decl_function main (0 params)

impl main {
    block b0:  // preds: 
    32    %3 = undefined
    32    %0 = deref_var &VARYING_SLOT_COL0 (shader_in vec4)
    32x4  %1 = @load_deref (%0) (access=0)
    32    %2 = deref_var &FRAG_RESULT_COLOR (shader_out vec4)
    32    %4 = load_const (0x00000000)
               @store_output (%1, %4 (0x0)) (base=4, wrmask=xyzw, component=0, src_type=float32, io location=FRAG_RESULT_COLOR slots=1, xfb(), xfb2())  // FRAG_RESULT_COLOR
               // succs: b1 
    block b1:
}

The latter form here is called “lowered” i/o: the derefs for explicit variables have been lowered to intrinsics corresponding to the operations being performed. Such excitement, many detail.

Change Is Bad

With few exceptions, every mesa driver uses lowered i/o. Zink is one of those exceptions, and the reasons are simple:

  • spirv requires explicit variable derefs
  • using lowered i/o would take a lot of work
  • there’s literally no benefit to using lowered i/o in zink

It’s a tough choice, but if I had to pick one of these as the “main” reason why I haven’t done the move, my response would be yes.

With that said, I’m extremely disgruntled to announce that I have completed the transition to lowered i/o.

Hooray.

The reasoning behind this Sisyphean undertaking which has cost me the past couple weeks along with what shreds of sanity previously remained within this mortal shell:

  • I’ve really wanted to delete some unbelievably gross code for a while
  • in theory there will someday be this magical new linker that infamous graphics mad scientist Marek Olšák has been working on since the before-times, and this will require the use of lowered i/o

It’s a tough choice, but if I had to pick one of these as the “main” reason why I have done the move, my response would be yes.

Before And After

I’ll save the details of this for some deep dive posts to pad out my monthly blog counter. For now, let’s take a look at the overview: how does this affect “shader stuff” in zink?

The short answer, for that one person who is actively eyeballs-deep in zink shader refactoring, is that it shouldn’t have any effect whatsoever. The zink passes that use explicit derefs for i/o are mostly at the end of the compilation chain, and derefs will have been added back in time to avoid needing to touch anything there.

This refactor may a tough concept to grasp, so I’m providing some flowcharts since it’s been far too long since the blog has seen any. Here is a basic overview of the zink shader compilation process:

It’s a simple process that anyone can understand.

This is the old process side-by-side with the new one for comparison:

Next time: maintenance5 in lavapipe or more compiler talk. You decide. But not really because I’m the one writing the posts.

August 07, 2023

Summer has kept me busy with holidays, but I have managed to find a bit of time to keep hacking on the driver for the VeriSilicon NPU since the last update.

TL;DR

The issue with placing the output to the right scale is solved now, and simple convolution operations are working just fine.

3D tensors are now supported as inputs, and we support strided convolutions as well, but only on 2D inputs for now.

The test workloads are running fast and stably now, so I now feel I have pretty solid ground beneath my feet.

There are three features left before I can run a real, full-fledged commercially interesting model:

  1. 3D inputs for strided convolutions
  2. Multiple output channels
  3. Padded convolutions

Re-quantization

The last update in this blog was left at my attempt at figuring out how the convolution raw outputs had to be processed with fields called post_shift and post_multiplier so I could get the right values in the final output.

After spending more time than I should probably have in a spreadsheet trying to find correlations, some desperate googling brought me to some research papers about optimizing quantization operations on integer-only hardware:

That explains the meaning of the shift and multiplier, as these are the operations we can use to approximate the floating point division on integer hardware.

But to actually understand what the hardware was trying to do with them, it was useful to look at the QNNPACK implementation of requantization.

3D input tensor

This was pretty much straightforward, as was basically a matter of updating the code to take into account the added dimension, and also reorder the tensor elements as the hardware expects depth first order.

This was made much easier by some improvements to the scripts I use to observe the behavior of the closed source stack, by intercepting the communication with the kernel's GPL driver.

For example, this is the output when Mesa has generated a cmd stream that is functionally equivalent to what the blob sends to the kernel:

+ diff -u -U 100 /home/tomeu/mesa.txt /home/tomeu/galcore.txt
--- /home/tomeu/mesa.txt    2023-08-07 18:28:29.939750225 +0200
+++ /home/tomeu/galcore.txt    2023-08-07 18:28:42.116625362 +0200
@@ -1,176 +1,273 @@
 {
-    0x0801028a, /* LOAD_STATE (1) Base: 0x00A28 Size: 1 Fixp: 0 */
-    0x00000011, /*   PA.SYSTEM_MODE := PROVOKING_VERTEX_LAST=1,HALF_PIXEL_CENTER=1 */
-    0x08010e13, /* LOAD_STATE (1) Base: 0x0384C Size: 1 Fixp: 0 */
-    0x00000002, /*   GL.API_MODE := OPENCL */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /*  */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /*  */
+    0x00000000, /* UNKNOWN (0) */
+    0x00000000, /*  */
     0x00000000, /* UNKNOWN (0) */
     0x00000000, /*  */
     0x08010e4f, /* LOAD_STATE (1) Base: 0x0393C Size: 1 Fixp: 0 */
     0x00000000, /*   GL.OCB_REMAP_START := 0x0 */
     0x08010e50, /* LOAD_STATE (1) Base: 0x03940 Size: 1 Fixp: 0 */
     0x00000000, /*   GL.OCB_REMAP_END := 0x0 */
     0x08010e4c, /* LOAD_STATE (1) Base: 0x03930 Size: 1 Fixp: 0 */
     0x00000010, /*   GL.NN_CONFIG := UNK0=0x0,DISABLE_ZDPN=0,DISABLE_SWTILING=0,SMALL_BATCH=1,DDR_BURST_SIZE=0x0,UNK7=0,NN_CORE_COUNT=0x0,UNK12=0 */
     0x08010428, /* LOAD_STATE (1) Base: 0x010A0 Size: 1 Fixp: 0 */
-    0xffff3000, /*   PS.NN_INST_ADDR := *0xffff3000 */
+    0x3348e780, /*   PS.NN_INST_ADDR := *0x3348e780 */
     0x08010429, /* LOAD_STATE (1) Base: 0x010A4 Size: 1 Fixp: 0 */
     0x00000000, /*   0x010A4 */
     0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
     0x00000c23, /*   GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
     0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
     0x00000c23, /*   GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
     0x00000000, /* UNKNOWN (0) */
     0x00000000, /*  */
 }
 map->layer_type = 0x0;  /* (0) */
 map->no_z_offset = 0x0;  /* (0) */
 map->kernel_xy_size = 0x2;  /* (2) */
 map->kernel_z_size = 0x4;  /* (4) */
 map->kernels_per_core = 0x1;  /* (1) */
 map->pooling = 0x0;  /* (0) */
 map->pooling_xy_size = 0x1;  /* (1) */
 map->prelu = 0x0;  /* (0) */
 map->nn_layer_flush = 0x1;  /* (1) */
 map->kernel_data_type = 0x0;  /* (0) */
 map->in_image_data_type = 0x0;  /* (0) */
 map->out_image_data_type = 0x0;  /* (0) */
 map->in_image_x_size = 0x4;  /* (4) */
 map->in_image_y_size = 0x4;  /* (4) */
 map->in_image_x_offset = 0x0;  /* (0) */
 map->in_image_y_offset = 0x0;  /* (0) */
 map->unused0 = 0x0;  /* (0) */
 map->brick_mode = 0x0;  /* (0) */
 map->brick_distance = 0x0;  /* (0) */
 map->relu = 0x0;  /* (0) */
 map->unused1 = 0x0;  /* (0) */
 map->post_multiplier = 0x0;  /* (0) */
 map->post_shift = 0x17;  /* (23) */
 map->unused2 = 0x0;  /* (0) */
 map->no_flush = 0x0;  /* (0) */
 map->unused3 = 0x0;  /* (0) */
 map->out_image_x_size = 0x3;  /* (3) */
 map->out_image_y_size = 0x3;  /* (3) */
 map->out_image_z_size = 0x1;  /* (1) */
 map->rounding_mode = 0x1;  /* (1) */
 map->in_image_x_offset_bit_3 = 0x0;  /* (0) */
 map->in_image_y_offset_bit_3 = 0x0;  /* (0) */
 map->out_image_tile_x_size = 0x3;  /* (3) */
 map->out_image_tile_y_size = 0x3;  /* (3) */
-map->kernel_address = 0x3fffd00;  /* (67108096) */
+map->kernel_address = 0xcd237f;  /* (13443967) */
 map->kernel_z_size2 = 0x0;  /* (0) */
-map->in_image_address = 0xffff6000;  /* (4294926336) */
-map->out_image_address = 0xffff7000;  /* (4294930432) */
+map->in_image_address = 0x3348e240;  /* (860414528) */
+map->out_image_address = 0x89ffc500;  /* (2315240704) */
 map->image_caching_mode = 0x0;  /* (0) */
 map->kernel_caching_mode = 0x1;  /* (1) */
 map->partial_cache_data_unit = 0x0;  /* (0) */
 map->kernel_pattern_msb = 0x0;  /* (0) */
 map->kernel_y_size = 0x2;  /* (2) */
 map->out_image_y_stride = 0x3;  /* (3) */
 map->kernel_pattern_low = 0x0;  /* (0) */
 map->kernel_pattern_high = 0x0;  /* (0) */
 map->kernel_cache_start_address = 0x800;  /* (2048) */
 map->kernel_cache_end_address = 0xa00;  /* (2560) */
 map->image_start_address = 0x0;  /* (0) */
 map->image_end_address = 0x800;  /* (2048) */
 map->in_image_border_mode = 0x0;  /* (0) */
 map->in_image_border_const = 0x7d;  /* (125) */
 map->unused4 = 0x0;  /* (0) */
 map->kernel_data_type_bit_2 = 0x0;  /* (0) */
 map->in_image_data_type_bit_2 = 0x0;  /* (0) */
 map->out_image_data_type_bit_2 = 0x0;  /* (0) */
 map->post_multiplier_1_to_6 = 0x1f;  /* (31) */
 map->post_shift_bit_5_6 = 0x0;  /* (0) */
 map->unused5 = 0x0;  /* (0) */
 map->in_image_x_stride = 0x4;  /* (4) */
 map->in_image_y_stride = 0x4;  /* (4) */
 map->out_image_x_stride = 0x3;  /* (3) */
 map->unused6 = 0x0;  /* (0) */
 map->post_multiplier_7_to_14 = 0x61;  /* (97) */
 map->out_image_circular_buf_size = 0x0;  /* (0) */
 map->unused7 = 0x0;  /* (0) */
 map->per_channel_post_mul = 0x0;  /* (0) */
 map->out_image_circular_buf_end_addr_plus_1 = 0x3ffffff;  /* (67108863) */
 map->unused8 = 0x0;  /* (0) */
 map->in_image_circular_buf_size = 0x0;  /* (0) */
 map->unused9 = 0x0;  /* (0) */
 map->in_image_circular_buf_end_addr_plus_1 = 0x3ffffff;  /* (67108863) */
 map->unused10 = 0x0;  /* (0) */
 map->coef_zero_point = 0x80;  /* (128) */
 map->out_zero_point = 0x77;  /* (119) */
 map->kernel_direct_stream_from_VIP_sram = 0x0;  /* (0) */
 map->depthwise = 0x0;  /* (0) */
 map->unused11 = 0x0;  /* (0) */
 map->unused12 = 0x0;  /* (0) */
 map->unused13 = 0x0;  /* (0) */
 map->unused14 = 0x0;  /* (0) */
 map->unused15 = 0x0;  /* (0) */
 map->unused16 = 0x0;  /* (0) */
 map->further1 = 0x0;  /* (0) */
 map->further2 = 0x0;  /* (0) */
 map->further3 = 0x3ffffff;  /* (67108863) */
 map->further4 = 0x7f800000;  /* (2139095040) */
 map->further5 = 0xff800000;  /* (4286578688) */
 map->further6 = 0x0;  /* (0) */
 map->further7 = 0x0;  /* (0) */
 map->further8 = 0x0;  /* (0) */
   0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x2c, 0x99, 0x0e, 0x00, 0x00,
   0x40, 0xea, 0x2c, 0xeb, 0x80, 0xaf, 0x80, 0x9b, 0x99, 0x80, 0x80, 0x13,
   0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00
   0x69, 0xd3, 0x2d, 0x92, 0x07, 0x00, 0x64, 0x00, 0x0c, 0x22, 0x90, 0xd6,
   0x53, 0xc9, 0xe2, 0x48, 0xe6, 0x4c, 0xa8, 0xeb, 0xd2, 0xf3, 0xb0, 0xf4,
   0x2d, 0xa4, 0x3e, 0xf4, 0x0f, 0x7b, 0x98, 0x01, 0x41, 0x84, 0x92, 0x7e,
   0xfa, 0x19, 0xf5, 0xda, 0xb3, 0x5a, 0xb7, 0xf3, 0x97, 0x95, 0x12, 0xe7,
   0x51, 0x94, 0xcb, 0x5a, 0x1f, 0xa9, 0xc6, 0xc4, 0x1c, 0xa9, 0x92, 0x1f,
   0xf7, 0x64, 0xc3, 0xca
   0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77

This corresponds to a convolution with the following parameters:

  • 8x8x1 input tensor
  • 3x3x1 weight tensor
  • stride == 2

The differences are due to different addresses being allocated between runs, and some differences due to how Mesa's code is structured but that shouldn't affect the end result. 

At the top we have the payload of the submit IOCTL, followed by a struct with the configuration for the NN units themselves and then the buffers for the weights, input and output.

When running a convolution configuration that isn't yet supported, we will spot more differences and hopefully will be able to figure out the logic behind them.

Strided convolutions

The hardware doesn't really support strided convolutions, so these are "lowered" to 1-stride convolutions with added channels, as per this research paper:

By implementing the algorithm in the paper, we match the behavior of the blob, as with requantization. It refers only to 2D input tensors, so I will need to check how the blob behaves with 3D inputs and figure out the logic behind it.

For now I have chosen to do the tensor manipulation on the CPU, but later on we will be able to use the TP units in the HW for this, reducing latency.

Test suite

With so many different convolution parameters supported, I felt the need for a comfortable way of keeping regressions in check.

I wrote a simple pytest module that will generate a TFLite model with a single convolution operation, and the parameters and payloads will be changed according to the different parameters that we support.

At some point I will add a CI job, probably before sending the initial merge request.

August 04, 2023

The initial NVK (nouveau vulkan) experimental driver has been merged into mesa master[1], and although there's lots of work to be done before it's application ready, the main reason it was merged was because the initial kernel work needed was merged into drm-misc-next[2] and will then go to drm-next for the 6.6 merge window. (This work is separate from the GSP firmware enablement required for reclocking, that is a parallel development, needed to make nvk useable). Faith at Collabora will have a blog post about the Mesa side, this is more about the kernel journey.

What was needed in the kernel?

The nouveau kernel API was written 10 years or more ago, and was designed around OpenGL at the time. There were two major restrictions in the current uAPI that made it unsuitable for Vulkan.

  1. buffer objects (physical memory allocations) were allocated 1:1 with virtual memory allocations for a file descriptor. This meant the kernel managed the virtual address space. For proper Vulkan support, the bo allocation and vm allocation have to be separate, and userspace should control the virtual address space.
  2. Command submission didn't use sync objects. The nouveau command submission wasn't wired up to the modern sync objects. These are pretty much a requirement for Vulkan fencing and semaphores to work properly.

How to implement these?

When we kicked off the nvk idea I made a first pass at implementing a new user API, to allow the above features. I took at look at how the GPU VMA management was done in current drivers and realized that there was a scope for a common component to manage the GPU VA space. I did a hacky implementation of some common code and a nouveau implementation. Luckily at the time, Danilo Krummrich had joined my team at Red Hat and needed more kernel development experience in GPU drivers. I handed my sketchy implementation to Danilo and let him run with it. He spent a lot of time learning and writing copious code. His GPU VA manager code was merged into drm-misc-next last week and his nouveau code landed today.

What is the GPU VA manager?

The idea behind the GPU VA manager is that there is no need for every driver to implement something that should essentially not be a hardware specific problem. The manager is designed to track VA allocations from userspace, and keep track of what GEM objects they are currently bound to. The implementation went through a few twists and turns and experiments. 

For a long period we considered using maple tree as the core of it, but we hit a number of messy interactions between the dma-fence locking and memory allocations required to add new nodes to the maple tree. The dma-fence critical section is a hard requirement to make others deal with. In the end Danilo used an rbtree to track things. We will revisit if we can deal with maple tree again in the future. 

We had a long discussion and a couple of implement it both ways and see, on whether we needed to track empty sparse VMA ranges in the manager or not,  nouveau wanted these but generically we weren't sure they were helpful, but that also affected the uAPI as it needed explicit operations to create/drop these. In the end we started tracking these in the driver and left the core VA manager cleaner.

Now the code is in tree we will start to push future drivers to use it instead of spinning their own.

What changes are needed for nouveau?

Now that the VAs are being tracked, the nouveau API needed two new entrypoints. Since BO allocation will no longer create a VM, a new API is needed to bind BO allocations with VM addresses. This is called the VM_BIND API. It has two variants

  1. a synchronous version that immediately maps a BO to a VM and is used for the common allocation paths.
  2. an asynchronous version that is modeled after the Vulkan sparse API, and takes in/out sync objects, which use the drm scheduler to schedule the vm/bo binding.
The VM BIND backend then does all the page table manipulation required.
 
The second API added was an EXEC call. This takes in/out sync objects and a set of addresses that point to command buffers to execute. This uses the drm scheduler to deal with the synchronization and hands the firmware the command buffer address to execute.
Internally for nouveau this meant having to add support for the drm scheduler, adding new internal page table manipulation APIs, and wiring up the GPU VA. 

Shoutouts:

My input was the sketchy sketch at the start, and doing the userspace changes to the nvk codebase to allow testing.

The biggest shoutout to Danilo, who took a sketchy sketch of what things should look like, created a real implementation, did all the experimental ideas I threw at him, and threw them and others back at me, negotiated with other drivers to use the common code, and built a great foundational piece of drm kernel infrastructure.

Faith at Collabora who has done the bulk of the work on nvk did a code review at the end and pointed out some missing pieces of the API and the optimisations it enables.

Karol at Red Hat on the main nvk driver and Ben at Red Hat for nouveau advice on how things worked, while he smashed away at the GSP rock.

(and anyone else who has contributed to nvk, nouveau and even NVIDIA for some bits :-)

[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24326

[2] https://cgit.freedesktop.org/drm-misc/log/

Progress

As everyone knows, Zink is a fast-moving target. Sometimes it moves so fast that even I don’t fully grasp the weight of some changes as they fly past.

I’m sure you all remember this monumental slide from XDC last year:

x-no.png

Truly a masterpiece that’s impossible to improve upon; don’t @ me.

But Now

Time has passed. Almost a year, some would say. Is that slide still accurate?

Anyone who knows anything about journalism knows that the answer to all rhetorical questions on the internet is always the same.

For Those Of You Not Following Closely

A couple weeks ago, Collabora’s Igor Torrente put up an MR that slid under the rader for most people. Not me, of course, because as a responsible maintainer I carefully review every character in every line of code in every file for every patch in every MR tagged with my driver(s).

And I’m Not The Only One

Because the great Adam Jackson and Daniel Stone also got D E E P into this one. By which I mean they commented.

And approved.

So Many Headers

It’s the equivalent of clickbait on a blog. But why—

x-yes.png

That’s Right.

August 03, 2023

Long Time No See

Yes, yes, it’s been a while, some number of weeks or mesa-years since I last blogged. Lots of important things have happened in that time. I’ve generated enough blog content for an entire month of posts, in fact. Maybe I’ll manage to maintain enough motivation to write about them.

Let’s kick off the return by looking at some progress updates.

  • maintenance5 finally released (but nobody can implement it yet)
  • host_image_copy finally released (lavapipe world-first open source best driver #1)
  • alyssa delete code big
  • some other things happened?

It’s all very exciting, and there’s definitely gonna be lots of posts once I remember what happened when I was melting into a puddle during the heatwave.

In the meanwhile, I want to talk about something else. Something a lot of people ask me about.

I want to talk about working for Valve.

YMMV

I work in Open Source, so obviously I can only comment on that, and I only work on certain projects, so obviously I can only comment on my experience working on them, and I’m only one non-hivemind mortal being, so obviously I can only comment on what I’ve personally experienced, but I’m nearly through my third year here and it feels like a good time for a post of this sort. You know, because three.

So what’s it really like here?

In a word, working here is great.

Imagine you’ve got three or twenty projects you enjoy working on. Now imagine your task is to make them better. Better how, exactly? However you want. Which one do you work on? Whichever one you want. How many hours per week do you work? However many you want. Who checks your progress? You do. How do biannual performance evaluations happen? They don’t. And on top of all that you get paid.

It sounds too good to be true, doesn’t it? Surely working here can’t be such a complete state of anarchy, and indeed it isn’t. In my experience, the Open Source team here is like a big jigsaw puzzle: there’s a ton of different pieces, each of them with its place, each of them making the others better.

Let me explain.

The Team

There’s a lot of people working here, all of them smarter than me (but none of them blogging more than me). Most of them have been here longer than me too. Every one of them has fallen into their niche, the place they like to tinker where they can excel. Here’s a few just out of the old-timers:

  • Hans-Kristian works on VKD3D-proton, Vulkan spec, self-help guides, and is generally the graphics bible
  • Joshie has a frog in most of the Linux gaming ecosystem components in addition of being a top frog in DX layering and frogging occasional patches into Mesa
  • Timothy Arceri can make Mesa’s GLSL compiler roll over, sit, shake, and compile shaders in finite time
  • Daniel “fyi I’ll be on vacation until Sunday” Schürmann is our ACO architect and resident workaholic
  • Bas isn’t technically on the team but also he is
  • Samuel has written eight new RADV extension implementations in the time it took me to write this post
  • And I know I’ve just left out a ton of people because this is like 5% of the team here—really just the ones I could meme about with my last functioning brain cell

Everyone has distinct roles that they play on the team. Project areas they specialize in, as per my “anything goes” claim above. Some people work on lots of things, some don’t, but the niches are filled. Everyone’s got their “spot”.

Put another way, everyone on the team is a piece of the puzzle, the puzzle being “make Linux gaming great”. Everyone fits into their spot, and by doing so things get better.

Spotting

There’s another way of looking at things though. While everyone here can be a puzzle piece, everyone has their own puzzles too. I work on zink, but that doesn’t mean “Mike is the one working on zink”. What it really means is that I’m able to work on zink because the puzzle pieces have been assembled such that I’m able to work on zink. It’s like how you wouldn’t try benching four plates at the gym without having a spot (I would, but I’m huge).

Sometimes getting a spot is a production. You know, the kind of thing that makes headlines where Joshie throws me the rock, but it’s too hot so I fire off a pass to Georg, and he slows down the tempo while we wait for hardware vendors to understand how their thinking sand interacts with complex ancient graphics specificiations, but then we get in the zone, and Georg throws me an alleyoop, and then Joshie takes it back to fix the original problem and now games work a little better.

Sometimes it’s all rockstars all day long.

But it’s the other times that make working here really great. The times when you’re struggling to grind out that last rep because you told your buddy you were definitely gonna hit 10x315 on front squat this week and you don’t wanna go back and admit you had too much preworkout.

I’m talking about times like when Timur picked up my massive CPU optimization series and wrangled it into a bunch of MRs because I was foolishly stretching myself too thin across too many projects.

I’m talking about the unsung heroes who make working here truly great.

Case Study: Rhys (not that one, the other one) Perry

Everyone knows Rhys. Everyone here, anyway. Outside the team it might be a different story; he has no blog, and searching for his many and varied accomplishments in the depths of the internet yields only one article written before I was born.

IYKYK, as they say. Just in the past week he’s quietly fixed Rage 2 and WWZ. A glance through his extensive patch history is a litany of complex optimizations and tweaks which aren’t flashy enough to be newsworthy on their own but add up to significant gains through consistent improvements.

But it’s still not any of these that (to me, at least) make Rhys one of the unsung heroes of the team. The glue that holds parts of it together.

All my very high IQ readers know what it’s like to get stuck on something. That feeling when you come across a problem, and you know it’s a problem, and you can handwave some half-functional solution that lets you limp across the finish line to collapse in a broken, battered heap with 0 regressions as the output of your refactoring branch’s latest CTS run, but you can’t quite figure out the “right” way to fix the problem. The way that won’t get your patches NAKed harder than proposing the addition of registers to NIR right now.

At times like these, who’s there to help you out? Who is it that gives the bar that tiny, it-was-all-you-bro-I-didn’t-even-touch-it nudge to help you finish that last rep?

It’s Rhys Perry. It’s always the Rhyses, the unsung heroes. The ones who answer complex questions on IRC at 2am because they’re watching an historic cricket match and happened to glance over and see you flailing away at your keyboard. The ones who step in and say “sure, I’ll review this absolute disaster you fingerpainted into gitlab using the webui without regard to formatting, or coding conventions, or even the right programming language, and we’ll get through this together with a fix that’ll make everyone happy” when you’re staring down Yog-Sothoth in the abyss of the compiler stack at the end of a week and have exactly one functioning brain cell remaining that tells you only SHOWER. NOW. IT’S BEEN DAYS.

And it’s the mixing of all these people, rockstars and not, unsung heroes and not, working on so many projects, enabling each other and making us all better at what we do that makes working at Valve great.

To me, at least.

Tune in next time when I’ll be MS Painting some XFB memes and raging about Big Triangle since that’s apparently my niche.

July 28, 2023

EOSS in Prague was great, lots of hallway track, good talks, good food, excellent tea at meetea - first time I had proper tea in my life, quite an experience. And also my first talk since covid, pack room with standing audience, apparently one of the top ten most attended talks per LF’s conference report.

The video recording is now uploaded, I’ve uploaded the fixed slides, including the missing slide that I accidentally cut in a last-minute edit. It’s the same content as my blog posts from last year, first talking about locking engineering principles and then the hierarchy of locking engineering patterns.

July 24, 2023
Introduction Hello again! A lot has happened over the past few weeks and we have achieved a milestone that I feel comfortable in talking more about. More precisely, NVK is now multi-plane format ready, YCbCr formats are advertized and supported, and YCbCr sampler support is being worked on now. Today, we’ll be covering the very first point, which is the first and biggest stage in our project, ...
July 18, 2023

Hi all!

As usual, this month has been rich in Wayland-related activities. Rose has continued building and upstreaming better frame scheduling infrastructure for wlroots, you can read more on her blog. I’ve resurrected an old patch to make wlroots behave better when the GPU is under high load. In my testing this improves latency a lot some specific scenarios and some specific hardware, but doesn’t help on some others. It’s not super clear if anything can be done about this, it may be that we are hitting some hardware limitations here: GPUs don’t know how to preempt tasks very well.

I’ve also started working on explicit synchronization again. This was previously blocked on a hard problem: drivers may want to use a new kind of synchronization fence primitive (user-space memory fences) and it wasn’t clear how the current primitives (drm_syncobj) would hold up. We’ve been talking about this new primitive for a few years but unfortunately it’s a complicated matter and nothing new has surfaced. However, after discussing with Daniel Vetter, we’ve come to the conclusion that the kernel will provide backwards compatibility for drm_syncobj, so we can just stop worrying and use that as the basis for explicit synchronization protocols and implementations. Moreover, NVIDIA engineers are interested in helping with this effort, so I hope we can keep the momentum and join forces to push the new protocol, APIs and implementations to the finish line.

There is a lot to be done to plumb explicit synchronization. This month I’ve respinned a new kernel uAPI patch to allow compositors to wait on a drm_syncobj without blocking. This also involved writing a test suite in IGT and a wlroots patch to use the new uAPI. Everything is now reviewed, I hope to merge this soon. Apart from this, we also need a new Wayland protocol, a new Vulkan extension for drm_syncobj import/export, more implementations of the protocol, ideally yet another new kernel uAPI to improve interoperability with sync_file, and even a new X11 protocol so that legacy X11 clients (read: games) can take advantage of this whole thing. Oh my… As French people say, there is some bread on the table.

In other Wayland news, we’ve started having some more-or-less weekly meetings for wayland-protocols standardization. We’ve been talking about upstreaming some of the stuff currently in a private GTK protocol, IMEs, and layer-shell. It’s been great to be able to discuss face-to-face about blockers for these protocols. The meeting notes are available on the wiki. We’ve done a lot of talking and gesturing, but also some actual work: security-context has finally (!) been merged, and I’ve updated the ext-layer-shell patch.

Apart from the explicit synchronization work, I’ve sent a few other kernel patches. Numerous patches to improve the kernel uAPI documentation, and a few patches to add more information to the hotplug events sent by bridge/i915/nouveau so that compositors don’t need to reload the whole KMS state on each hotplug event (instead, they can now only reload the KMS state of the one specific connector which got hotplugged). I’ve reviewed a few patches as well. Thomas Zimmermann has made it so all DRM drivers now support DMA-BUFs (required for wlroots to run), so now wlroots works on e.g. gma500. AMD engineers have sent patches to support more than 64 DRM devices, there are some subtle uAPI stability issues at play I’ve tried to provide feedback on.

Let’s wrap up this status update with a collection of various smaller happenings. I’ve removed dlsym() related magic used in the Wayland test suite which caused sporadic failures on FreeBSD. I’ve been gradually improving the API for go-imap v2 and fixing a few bugs. hut now supports pagination on all commands thanks to tireless work by Thorben Günther. kanshi now supports configuring adaptive sync (VRR). I’ve improved the API of go-oauth2 a bit. Last but not least, I’ve reworked an old patch to make it easier to parse scfg files from Go programs, by defining a Go struct instead of hand-rolling parsing code.

See you next month!

July 14, 2023

I recently came across tinygrad as a small powerful nn framework that had an OpenCL backend target and could run LLaMA model.

I've been looking out for rusticl workloads, and this seemed like a good one, and I could jump on the AI train, and run an LLM in my house!

I started it going on my Radeon 6700XT with the latest rusticl using radeonsi with the LLVM backend, and I could slowly interrogate a model with a question, and it would respond. I've no idea how performant it is vs ROCm yet which seems to be where tinygrad is more directed, but I may get to that next week.

While I was there though I decided to give the Mesa ACO compiler backend a go, it's been tied into radeonsi recently, and I done some hacks before to get compute kernels to run. I reproduced said hacks on the modern code and gave it a run.

tinygrad comes with a benchmark script called benchmark_train_efficientnet so I started playing with it to see what low hanging fruit I could find in an LLVM vs ACO shootout.

The bench does 10 runs, the first is where lots of compilation happens, the last is well primed cache wise. There are the figures from the first and last runs with a release build of llvm and mesa. (and the ACO hacks).

LLVM:

215.78 ms cpy,  12245.04 ms run,  120.33 ms build, 12019.45 ms realize,  105.26 ms CL,   -0.12 loss,  421 tensors, 0.04 GB used,      0.94 GFLOPS

10.25 ms cpy,   221.02 ms run,   83.50 ms build,   36.25 ms realize,  101.27 ms CL,   -0.01 loss,  421 tensors, 0.04 GB used,     52.11 GFLOPS

ACO:

71.10 ms cpy,  3443.04 ms run,  112.58 ms build, 3214.13 ms realize,  116.34 ms CL,   -0.04 loss,  421 tensors, 0.04 GB used,      3.35 GFLOPS
10.36 ms cpy,   234.90 ms run,   84.84 ms build,   36.51 ms realize,  113.54 ms CL,    0.05 loss,  421 tensors, 0.04 GB used,     49.03 GFLOPS

So ACO is about 4 times faster to compile but produces binaries that are less optimised.

The benchmark produces 148 shaders:

LLVM:

126 Max Waves: 16 
  6 Max Waves: 10
  5 Max Waves: 9
  6 Max Waves: 8
  5 Max Waves: 4


ACO:

 96 Max Waves: 16
 36 Max Waves: 12
  2 Max Waves: 10
 10 Max Waves: 8
  4 Max Waves: 4

So ACO doesn't quite get the optimal shaders for a bunch of paths, even with some local hackery I've done to make it do better.[1]

I'll investigate ROCm next week maybe, got a bit of a cold/flu, and large GPU stacks usually make me want to wipe the machine after I test them :-P

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radeonsi-rusticl-aco-wip


July 09, 2023

I'm suffering from having a mortal form again, but things are moving in the general direction of progress.

Or "Rose, it's 2 in the morning!" Yeah yeah, whatever, you're not my mum.

&num Imperfections

Some would call this whining - skip this section if you're here for technology :)

You're not supposed to make yourself work when you don't have energy to because you'll feel bad. People have tried telling me this and I've tried listening but to really take it on board I had to figure out what low energy actually feels like, so here we are, skipping a week of status reporting and holding a suspiciously high Factorio play time. I spent some of that play time making a cool blue circuit factory! Downtime is a good idea, hopefully - we'll find out next week whether it worked.

It's surprising that one of the hardest problems given to me by the Fates has been fighting against myself, which sounds overly dramatic but in a literal sense is true. I would be moving faster if I felt up to it, but I don't feel up to it because I moved too fast recently. It's my fault because I wore myself out, but it's not my fault to rest when I need to, so instinctively I remain undecided on whether it's my fault. Sadly this isn't a balance that I've learned to strike, at least not for large scale work that I care about.

Add this to a general guilt for doing less than others seem to be doing (a velocity- rather than the famous competence-based impostor syndrome) and the work that was once appealing becomes more distant. LoC metrics are a favourite of crap managers, quick glancers, and the part of my subconscious that judges my self worth. It's not ideal and it's even not-idealer when your work is mostly thinking and not actually that much coding - see the previous report for a bunch of musings about what code should be written and not much written code. It's valid work! But the goblin in my skull disagrees. The mortal form disappoints me. I was hoping to discover my inner cold programming machine but I just found some boring human imperfections. Yawn!

This isn't what I was expecting to write about but I think it's helping. I'm sure these aren't unique experiences but they worry me nonetheless, which is partially because I'm hardwired to be worrying about something most of the time.

In a couple of days it will all be OK because I'll be able to play Counter-Strike again and that will for sure make my productivity go up, or down. The paradox of relaxing!

&num The Happenings

As predicted, I have to face prediction. Before I do that, I want to get a feel for the behaviour of compositors' performance so I'm not mathsing in the dark, and my weapon of choice is Linux's tracing system which either is called ftrace or has a component called ftrace. I can't tell which.

We've met Linux's tracing before. The screenshots from GPUVis were made of data extracted from it, which makes it an attractive answer to the question "where do I put all my data". In theory, if wlroots gains the ability to output events to this system, GPUVis will automatically be able to display these events as it does all the others.

The mechanism for userspace to emit events in this way landed in Linux 6.4 which was unleashed about 12 hours before I realised that my laptop's 6.3 series kernel didn't have support for it and nearly gave up. Until 6.4, the feature was gated behind CONFIG_BROKEN and looked truly like a lost cause. Thankfully Simon noticed that 6.4 held the answer to my problems and I found things to do while I waited for it to hit my distribution. Thrilling! We're back on track.

To hide the horrors of a bare UAPI from wlroots, I wrote and published libuserevents, which is my first C library and will make interacting with user_events amazing and great and you should definitely use it. There are whispers of integration into wlroots so far. I hope eventually I'll have a nice tool that can monitor a running compositor and show a graph of the frame times because that will at least be something pretty to look at to get away from thinking.

In the background there's a scene timer wriggling its way through review and the dreaded How To Schedule Frame Signals is looming over us all. I forgot to submit the Vulkan timer in all the ruckus. Oh well, apparently no one's supposed to be using the Vulkan backend yet anyway so I doubt there's anyone holding their breath.

I've also just noticed that the second status report has links to git branches instead of commits, so they're likely very stale by now. Remind past me to not do that, that moron.

Who knows what the future holds? Join us next week time to find out.