
If you’ve been paying attention to the evolution of the Linux gaming ecosystem in recent years, including the release of the Steam Deck and the new Steam Deck OLED, it’s likely your initial reaction to the blog post title is a simple “OK”. However, I’m coming from a very particular place so I wanted to explain my point of view and the significance of this, and hopefully you’ll find the story interesting.
As a background, let me say I’ve always gamed on Windows when using my PC. If you think I’m an idiot for doing so lately, specially because my work at Igalia involves frequently interacting with Valve contractors like Samuel Pitoiset, Timur Kristóf, Mike Blumenkrantz or Hans-Kristian Arntzen, you’d be more than right. But hear me out. I’ve always gamed on Windows because it’s the safe bet. With a couple of small kids at home and very limited free time, when I game everything has to just work. No fiddling around with software, config files, or wasting time setting up the software stack. I’m supposed to boot Windows when I want to play, play, and then turn my computer off. The experience needs to be as close to a console as possible. And, for anything non-gaming, which is most of it, I’d be using my Linux system.
In the last years, thanks to the work done by Valve, the Linux gaming stack has improved a lot. Despite this, I’ve kept gaming on Windows for a variety of reasons:
For a long time, my Linux disk only had a capacity of 128GB, so installing games was not a real possibility due to the amount of disk space they need.
Also, I was running Slackware and installing Steam and getting the whole thing running implied a fair amount of fiddling I didn’t even want to think about.
Then, when I was running Fedora on a large disk, I had kids and I didn’t want to take any risks or possibly waste time on that.
So, what changed?
Earlier this year I upgraded my PC and replaced an old Intel Haswell i7-4770k with a Ryzen R5 7600X, and my GPU changed from an NVIDIA GTX 1070 to a Radeon RX 6700. The jump in CPU power was much bigger and impressive than the more modest jump in GPU power. But talking about that and the sorry state of the GPU market is a story for another blog post. In any case, I had put up with the NVIDIA proprietary driver for many years and I think, on Windows and for gaming, NVIDIA is the obvious first choice for many people, including me. Dealing with the proprietary blob under Linux was not particularly problematic, specially with the excellent way it’s handled by RPMFusion on Fedora, where essentially you only have to install a few packages and you can mostly forget about it.
However, given my recent professional background I decided to go with an AMD card for the first time. I wanted to use a fully open source graphics stack and I didn’t want to think about making compromises in Wayland support or other fronts whatsoever. Plus, at the time I upgraded my PC, the timing was almost perfect for me to switch to an AMD card, because:
AMD cards were, in general, performing better for the same price than NVIDIA cards, except for ray tracing.
The RX 6700 non-XT was on sale.
It had the same performance as a PS5 or so.
It didn’t draw a ton of power like many recent high-end GPUs (175W, similar to the 1070 and its 150W TDP).
After the system upgrade, I did notice a few more stability problems when gaming under Windows, compared to what I was used to with an NVIDIA card. You can find thousands of opinions, comments and anecdotes on the Internet about the quality of AMD drivers, and a lot of people say they’re a couple of steps below NVIDIA drivers. It’s not my intention at all to pile up on those, but it’s true my own personal experience is having generally more crashes in games and having to face more weird situations since I switched to AMD. Normally, it doesn’t get to the point of being annoying at all, but sometimes it’s a bit surprising and I could definitely notice that increase in instability without any bias on my side, I believe. Which takes us to Far Cry 6.
A few days ago I finished playing Doom Eternal and its expansions (really nice game, by the way!) and I decided to go with Far Cry 6 next. I’m slowly working my way up with some more graphically demanding games that I didn’t feel comfortable with playing on the 1070. I went ahead and installed the game on Windows. Being a big 70GB download (100GB on disk), that took a bit of time. Then I launched it, adjusted the keyboard and mouse settings to my liking and I went to the video options menu. The game had chosen the high preset for me and everything looked good, so I attempted to run the in-game benchmark to see if the game performed well with that preset (I love it when games have built-in benchmarks!). After a few seconds in a loading screen, the game crashed I was back to the desktop. “Oh, what a bad way to start!”, I thought, without knowing what lied ahead. I launched the game again, same thing.
On the course of the 2 hours that followed, I tried everything:
Launching the main game instead of the benchmark, just in case the bug only happened in the benchmark. Nope.
Lowering quality and resolution.
Disabling any advanced setting.
Trying windowed mode, or borderless full screen.
Vsync off or on.
Disabling the overlays for Ubisoft, Steam, AMD.
Rebooting multiple times.
Uninstalling the drivers normally as well as using DDU and installing them again.
Same result every time. I also searched on the web for people having similar problems, but got no relevant search results anywhere. Yes, a lot of people both using AMD and NVIDIA had gotten crashes somewhere in the game under different circumstances, but nobody mentioned specifically being unable to reach any gameplay at all. That day I went to bed tired and a bit annoyed. I was also close to having run the game for 2 hours according to Steam, which is the limit for refunds if I recall correctly. I didn’t want to refund the game, though, I wanted to play it.
The next day I was ready to uninstall it and move on to another title in my list but, out of pure curiosity, given that I had already spent a good amount of time trying to make it run, I searched for it on the Proton compatibility database to see if it could be run on Linux, and it seemed to be possible. The game appeared to be well supported and it was verified to run on the Deck, which was good because both the Deck and my system have an RDNA2 GPU. In my head I wasn’t fully convinced this could work, because I didn’t know if the problem was in the game (maybe a bug with recent updates) or the drivers or anywhere else (like a hardware problem).
And this was, for me, when the fun started. I installed Steam on Linux from the Gnome Software app. For those who don’t know it, it’s like an app store for Gnome that acts as a frontend to the package manager.
Steam showed up there with 3 possible sources: Flathub, an “rpmfusion-nonfree-steam” repo and the more typical “rpmfusion-nonfree” repo. I went with the last option and soon I had Steam in my list of apps. I launched that and authenticated using the Steam mobile app QR code scanning function for logging in (which is a really cool way to log in, by the way, without needing to recall your username and password).
My list of installed games was empty and I couldn’t find a way to install Far Cry 6 because it was not available for Linux. However, I thought there should be an easy way to install it and launch it using the famous Proton compatibility layer, and a quick web search revealed I only had to right-click on the game title, select Properties and choose to “Force the use of a specific Steam Play compatibility tool” under the Compatibility section. Click-click-click and, sure, the game was ready to install. I let it download again and launched it.
Some stuff pops up about processing or downloading Vulkan shaders and I see it doing some work. In that first launch, the game takes more time to start compared to what I had seen under Windows, but it ends up launching (and subsequent launches were noticeably faster). That includes some Ubisoft Connect stuff showing up before the game starts and so on. Intro videos play normally and I reach the game menu in full screen. No indication that I was running it on Linux whatsoever. I go directly to the video options menu, see that the game again selected the high preset, I turn off VSync and launch the benchmark. Sincerely, honestly, completely and totally expecting it to crash one more time and that would’ve been OK, pointing to a game bug. But no, for the first time in two days this is what I get:
The benchmark runs perfectly, no graphical glitches, no stuttering, frame rates above 100FPS normally, and I had a genuinely happy and surprised grin on my face. I laughed out loud and my wife asked what was so funny. Effortless. No command lines, no config files, nothing.
As of today, I’ve played the game for over 30 hours and the game has crashed exactly once out of the blue. And I think it was an unfortunate game bug. The rest of the time it’s been running as smooth and as perfect as the first time I ran the benchmark. Framerate is completely fine and way over the 0 frames per second I got on Windows because it wouldn’t run. The only problem seems to be that when I finish playing and exit to the desktop, Steam is unable to stop the game completely for some reason (I don’t know the cause) and it shows up as still running. I usually click on the Stop button in the Steam interface after a few seconds, it stops the game and that’s it. No problem synchronizing game saves to the cloud or anything. Just that small bug that, again, only requires a single extra click.
Then I remember something that had happened a few months before, prior to starting to play Doom Eternal under Windows. I had tried to play Deathloop first, another game in my backlog. However, the game crashed every few minutes and an error window popped up. The amount and timing of the crashes didn’t look constant, and lowering the graphics settings sometimes would allow me to play the game a bit longer, but in any case I wasn’t able to finish the game intro level without crashes and being very annoyed. Searching for the error message on the web, I saw it looked like a game problem that was apparently affecting not only AMD users, but also NVIDIA ones, so I had mentally classified that as a game bug and, similarly to the Far Cry 6 case, I had given up on running the game without refunding it hoping to be able to play it in the future.
Now I was wondering if it was really a game bug and, even if it was, if maybe Proton could have a workaround for it and maybe it could be played on Linux. Again, ProtonDB showed the game to be verified on the Deck with encouraging recent reports. So I installed Deathloop on Linux, launched it just once and played for 20 minutes or so. No crashes and I got as far as I had gotten on Windows in the intro level. Again, no graphical glitches that I could see, smooth framerates, etc. Maybe it was a coincidence and I was lucky, but I think I will be able to play the game without issues when I’m done with Far Cry 6.
In conclusion, this story is another data point that tells us the quality of Proton as a product and software compatibility layer is outstanding. In combination with some high quality open source Mesa drivers like RADV, I’m amazed the experience can be actually better than gaming natively on Windows. Think about that: the Windows game binary running natively on a DX12 or Vulkan official driver crashes more and doesn’t work as well as the game running on top of a Windows compatibility layer with a graphics API translation layer, on top of a different OS kernel and a different Vulkan driver. Definitely amazing to me and it speaks wonders of the work Valve has been doing on Linux. Or it could also speak badly of AMD Windows drivers, or both.
Sure, some new games on launch have more compatibility issues, bugs that need fixing, maybe workarounds applied in Proton, etc. But even in those cases, if you have a bit of patience, play the game some months down the line and check ProtonDB first (ideally before buying the game), you may be in for a great experience. You don’t need to be an expert either. Not to mention that some of these details are even better and smoother if you use a Steam Deck as compared to an (officially) unsupported Linux distribution like I do.
During these last two weeks I have been working towards adding support for more operations and kinds of convolutions so we can run more interesting models. As a first target, I'm aiming to MobileDet, which though a bit old by now (it was introduced in 2020) is still the state of the art in object detection in mobile, used in for example Frigate NVR.
I haven't mentioned it in a few updates, but all this work keeps being sponsored by Libre Computer, who are aiming to be the first manufacturer of single board computers to provide accelerated machine learning with open source components. Check out Alta and Solitude for the first such boards in the market.
Igalia's Christian Gmeiner has been giving me great feedback at the merge request, and as part of that I submitted a patch to the kernel to retrieve some parameters that are needed when programming the hardware and that are best not left hardcoded.
This means that upstreaming to Mesa loses some urgency as we are anyway going to have to wait for the merge window for 6.8 opens, after 6.7 final is out.
Until now I had implemented support only for weights with dimensions 1x1 (aka pointwise convolutions) and 3x3 (the most common by far). Some of the convolutions in MobileDet use 5x5 weight tensors though, so I had to implement support for them. It was a matter of adding some extra complexity to the code that compresses the weight tensors in the format that the hardware expects.
I implemented this for all kind of supported convolutions: depthwise, strided, with padding, etc.
I observed that the vendor blob implements addition operations with convolution jobs, so I looked deeper and saw that it was implementing the addition of two input tensors by placing them as the two channels of a single tensor, then passing them through a 1x1 convolution with a specially crafted weight tensor and bias vector.
This is working with hardcoded values for some specific input image dimensions, but I still need to gather more data so I can come up with a generic expression.
One more missing operation commonly used in models for mobile is pooling, in its different kinds: average, max, etc.
The blob implements these operations on the programmable core, with CL-like kernels.
So I undusted the work that I did in the first half of 2023 and added code to Teflon for passing these operations to the Gallium drivers. Then added a new kind of operation to the ML backend in Etnaviv to make use of the programmable core.
Things work fine, even if for now I am storing the kernel machine code in a blob inside the C code. The next step will be to implement the kernel in NIR and generate the machine code using the existing compiler in Etnaviv.
With this piece of work, we are now able to use all the hardware units in the NPU, and even if the programmable core in this configuration is really underpowered, it will allow us to keep the model in memory close to the NPU, instead of having to ping-pong between the NPU and CPU domains.
With new operations and kinds of convolutions being added, I was starting to have trouble testing all the possible combinations in a practical way, as the test suite that I had was taking more than 20 minutes for a full run.
To get around that, I reimplemented the tests in C++ with GoogleTest, which is supported by Emma Anholt's deqp-runner and will allow me to run the tests in parallel, making full use of the CPU cores in the board.
That made a big difference, but with so many testing combinations being added (+3000 as of now), it was still not fast enough for me. So I remembered an approach that we were considering to speed up execution of Vulkan and OpenGL conformance tests: caching the golden images that are used to compare and check that the output from the hardware is correct.
With that, the bottleneck is the network, as I store the cache in NFS, and I can run the full test suite in less than 3 minutes.
Only that I started finding some tests that were randomly failing, specially when the cache of test results had been already brought into the filesystem cache in the board. After a lot of scratching my head, I came to realize that the Etnaviv kernel driver was trying to submit up to 4 jobs at the same time to the hardware, if userspace was fast enough to enqueue that many jobs before the previous ones had finished.
There is a kernel module parameter to set the number of jobs that are submitted to the hardware at any given point, and setting that to 1 took me back to rock solid test results, which is an absolute need for keeping the driver author's sanity.
I have quickly added support for a lot of new operations and parameter combinations and the code is not as clean as I would like, in part due to the need for some refactoring.
So in the next days I will be investing some time in cleaning things up, and afterwards will move to more operations in MobileDet.
I have not been so active for a while with writing these Fedora Workstation updates and part of the reason was that I felt I was beginning to repeat myself a lot, which I partly felt was a side effect of writing them so often, but with some time now since my last update I felt that time was ripe again. So what are some of the things we have been working on and what are our main targets going forward? This is not a exhaustive list, but hopefully items you find interesting. Apologize for weird sentences and potential spelling mistakes, but it ended up a a long post and when you read your own words over for the Nth time you start going blind to issues :)
PipeWire keeps the Linux Multimedia revolution rolling[/caption]So lets start with one of your favorite topics, PipeWire. As you probably know PipeWire 1.0 is now out and I feel it is a project we definitely succeeded with, so big kudos to Wim Taymans for leading this effort. I think the fact that we got both the creator of JACK, Paul Davis and the creator of PulseAudio Lennart Poettering to endorse it means our goal of unifying the Linux audio landscape is being met. I include their endorsement comments from the PipeWire 1.0 release announcement here :
“PipeWire represents the next evolution of audio handling for Linux, taking
the best of both pro-audio (JACK) and desktop audio servers (PulseAudio) and
linking them into a single, seamless, powerful new system.”
– Paul Davis, JACK and Ardour author
“PipeWire is a worthy successor to PulseAudio, providing a feature set
closer to how modern audio hardware works, and with a security model
with today’s application concepts in mind. Version 1.0 marks a
major milestone in completing the adoption of PipeWire in the standard
set of Linux subsystems. Congratulations to the team!”
– Lennart Poettering, Pulseaudio and systemd author
So for new readers, PipeWire is a audio and video server we created for Fedora Workstation to replace PulseAudio for consumer audio, JACK for pro-audio and add similar functionality for video to your linux operating system. So instead of having to deal with two different sound server architectures users now just have to deal with one and at the same time they get the same advantages for video handling. Since PipeWire implemented both the PulseAudio API and the JACK API it is a drop in replacement for both of them without needing any changes to the audio applications built for those two sound servers. Wim Taymans alongside the amazing community that has grown around the project has been hard at work maturing PipeWire and adding any missing feature they could find that blocked anyone from moving to it from either PulseAudio and JACK. Wims personal focus recently has been on an IRQ based ALSA driver for PipeWire to be able to provide 100% performance parity with the old JACK server. So while a lot of Pro-audio users felt that PipeWire’s latency was already good enough, this work by Wim shaves of the last few milliseconds to reach the same level of latency as JACK itself had.
In parallel with the work on PipeWire the community and especially Collabora has been hard at work on the new 0.5 release of WirePlumber, the session manager which handles all policy issues for PipeWire. I know people often get a little confused about PipeWire vs WirePlumber, but think of it like this: PipeWire provides you the ability to output audio through a connected speaker, through a bluetooth headset, through an HDMI connection and so on, but it doesn’t provide any ‘smarts’ for how that happens. The smarts are instead provided by WirePlumber which then contains policies to decide where to route your audio or video, either based on user choice or through preset policies making the right choices automatically, like if you disconnect your USB speaker it will move the audio to your internal speaker instead. Anyway, WirePlumber 0.5 will be a major step forward for WirePlumber moving from using lua scripts for configuration to instead using JSON for configuration while retaining lua for scripting. This has many advantages, but I point you to this excellent blog post by Collabora’s Ashok Sidipotu for the details. Ashok got further details about WirePlumber 0.5 that you can find here.
With PipeWire 1.0 out the door I feel we are very close to reaching one of our initial goals with PipeWire, to remove the need for custom pro-audio distributions like Fedora JAM or Ubuntu Studio, and instead just let audio folks be able to use the same great Fedora Workstation as the rest of the world. With 1.0 done Wim plans next to look a bit at things like configuration tools and similar used by pro-audio folks and also dive into the Flatpak portal needs of pro-audio applications more, to ensure that Flatpaks + PipeWire is the future of pro-audio.
On the video handling side its been a little slow going since there applications need to be ported from relying directly on v4l. Jan Grulich has been working with our friends at Mozilla and Google to get PipeWire camera handling support into Firefox and Google Chrome. At the moment it looks like the Firefox support will land first, in fact Jan has set up a COPR that lets you try it out here. For tracking the upstream work in WebRTC to add PipeWire support Jan set up this tracker bug. Getting the web browsers to use PipeWire is important both to enable the advanced video routing capabilities of PipeWire, but it will also provide applications the ability to use libcamera which is a needed for new modern MIPI cameras to work properly under Linux.
Another important application to get PipeWire camera support into is OBS Studio and the great thing is that community member Georges Stavracas is working on getting the PipeWire patches merged into OBS Studio, hopefully in time for their planned release early next year. You can track Georges work in this pull request.
For more information about PipeWire 1.0 I recommend our interview with Wim Taymans in Fedora Magazine and also the interview with Wim on Linux Unplugged podcast.
HDR
HDR, or High Dynamic Range, is another major effort for us. HDR is a technology I think many of you have become familiar with due to it becoming quite common in TVs these days. It basically provides for greatly increased color depth and luminescence on your screen. This is a change that entails a lot of changes through the stack, because when you introduce into an existing ecosystem like the Linux desktop you have to figure out how to combine both new HDR capable applications and content and old non-HDR applications and content. Sebastian Wick, Jonas Ådahl, Oliver Fourdan, Michel Daenzer and more on the team has been working with other members of the ecosystem from Intel, AMD, NVIDIA, Collabora and more to pick and define the standards and protocols needed in this space. A lot of design work was done early in the year so we been quite focused on implementation work across the drivers, Wayland, Mesa, GStreamer, Mutter, GTK+ and more. Some of the more basic scenarios, like running a fullscreen HDR application is close to be ready, while we are still working hard on getting all the needed pieces together for the more complex scenarios like running SDR and HDR windows composited together on your desktop. So getting for instance full screen games to run in HDR mode with Steam should happen shortly, but the windowed support will probably land closer to summer next year.
Wayland remoting
One feature we been also spending a lot of time on is enabling remote logins to a Wayland desktop. You have been able to share your screen under Wayland more or less from day one, but it required your desktop session to be already active. But lets say you wanted to access your Wayland desktop running on a headless system you been out of luck so far and had to rely on the old X session instead. So putting in place all the pieces for this has been quite an undertaking with work having been done on PipeWire, on Wayland portals, gnome remote desktop daemon, libei; the new input emulation library, gdm and more. The pieces needed are finally falling into place and we expect to have everything needed landed in time for GNOME 46. This support is currently done using a private GNOME API, but a vendor less API is being worked on to replace it.
As a sidenote here not directly related to desktop remoting, but libei has also enabled us to bring xtest support to XWayland which was important for various applications including Valves gamescope.
NVIDIA drivers
One area we keep investing in is improving the state of NVIDIA support on Linux. This comes both in the form of being the main company backing the continued development of the Nouveau graphics driver. So the challenge with Nouveau is that for the longest while it offered next to no hardware acceleration for 3D graphics. The reason for this was that the firmware that NVIDIA provided for Nouveau to use didn’t expose that functionality and since recent generations of NVIDIA cards only works with firmware signed by NVIDIA this left us stuck. So Nouveau was a good tool for doing an initial install of a system, but if you where doing any kind of serious 3D acceleration, including playing games, then you would need to install the NVIDIA binary driver. So in the last year that landscape around that has changed drastically, with the release of the new out-of-tree open source driver from NVIDIA. Alongside that driver a new firmware has also been made available , one that do provide full support for hardware acceleration.
Let me quickly inject a quick explanation of out-of-tree versus in-tree drivers here. An in-tree driver is basically a kernel driver for a piece of hardware that has been merged into the official Linux kernel from Linus Torvalds and is thus being maintained as part of the official Linux kernel releases. This ensures that the driver integrates well with the rest of the Linux kernel and that it gets updated in sync with the rest of the Linux kernel. So Nouveau is an in-tree kernel driver which also integrates with the rest of the open source graphics stack, like Mesa. The new NVIDIA open source driver is an out-of-tree driver which ships as a separate source code release on its own schedule, but of course NVIDIA works to keeps it working with the upstream kernel releases (which is a lot of work of course and thus considered a major downside to being an out of tree driver).
As of the time of writing this blog post NVIDIAs out-of-tree kernel driver and firmware is still a work in progress for display usercases, but that is changing with NVIDIA exposing more and more display features in the driver (and the firmware) with each new release they do. But if you saw the original announcement of the new open source driver from NVIDIA and have been wondering why no distribution relies on it yet, this is why. So what does this mean for Nouveau? Well our plan is to keep supporting Nouveau for the foreseeable future because it is an in-tree driver, which is a lot easier to ensure keeps working with each new upstream kernel release.
At the same time the new firmware updates allows Nouveau to eventually offer performance levels competitive with the official out-of-tree driver, kind of how the open source AMD driver with MESA offers comparable performance to AMD binary GPU driver userspace. So Nouvea maintainer Ben Skeggs spent the last year working hard on refactoring Nouveau to work with the new firmware and we now have a new release of Nouveau out showing the fruits of that labor, enabling support for NVIDIAs latest chipset. Over time we will have it cover more chipset and expand Vulkan and OpenGL (using Zink) support to be a full fledged accelerated graphics driver.
So some news here, Ben after having worked tirelessly on keeping Nouveau afloat for so many years decided he needed a change of pace and thus decided to leave software development behind for the time being. A big thank you to Ben from all us at Red Hat and Fedora ! The good news is that Danilo Krummrich will take over as the development lead, with Lyude Paul taking on working on the Display side specifically of the driver. We also expect to have other members of the team chipping in too. They will pick up Bens work and continue working with NVIDIA and the community on a bright future for Nouveau.
So as I mentioned though the new open source driver from NVIDIA is still being matured for the display usercase and until it works fully as a display driver neither will Nouveau be able to be a full alternative since they share the same firmware. So people will need to rely on the binary NVIDIA Driver for some time still. One thing we are looking at there and discussing is if there are ways for us to improve the experience of using that binary driver with Secure Boot enabled. Atm that requires quite a bit of manual fiddling with tools like mokutils, but we have some ideas on how to streamline that a bit, but it is a hard nut to solve due to a combination of policy issues, legal issues, security issues and hardware/UEFI bugs so I am making no promises at this point, just a promise that it is something we are looking at.
Accessibility
Accessibility is an important feature for us in Fedora Workstation and thus we hired Lukáš Tyrychtr to focus on the issue. Lukáš has been working through across the stack fixing issues blocking proper accessibility support in Fedora Workstation and also participated in various accessibility related events. There is still a lot to do there so I was very happy to hear recently that the GNOME Foundation got a million Euro sponsorship from the Sovereign Tech Fund to improve various things across the stack, especially improving accessibility. So the combination of Lukáš continued efforts and that new investment should make for a much improved accessibility experience in GNOME and in Fedora Workstation going forward.
GNOME Software
Another area that we keep investing in is improving GNOME Software, with Milan Crha working continuously on bugfixing and performance improvements. GNOME Software is actually a fairly complex piece of software as it has to be able to handle the installation and updating of RPMS, OSTree system images, Flatpaks, fonts and firmware for us in addition to the formats it handles for other distributions. For some time it felt was GNOME Software was struggling with the load of all those different formats and usercases and was becoming both slow and with a lot of error messages. Milan has been spending a lot of time dealing with those issues one by one and also recently landed some major performance improvements making the GNOME Software experience a lot better. One major change that Milan is working on that I think we will be able to land in Fedora Workstation 40/41 is porting GNOME Software to use DNF5. The main improvement end users will probably notice is that it unifies the caches used for GNOME Software and using dnf on the command line, saving you storage space and also ensuring the two are fully in sync on what RPMS is installed/updated at any given time.
Fedora and Flatpaks
Flatpaks is another key element of our strategy for moving the Linux desktop forward and as part of that we have now enabled all of Flathub to be available if you choose to enable 3rd party repositories when you install Fedora Workstation. This means that the huge universe of applications available on Flathub will be easy to install through GNOME Software alongside the content available in Fedora’s own repositories. That said we have also spent time improving the ease of making Fedora Flatpaks. Owen Taylor jumped in and removed the dependency on a technology called ‘modularity‘ which was initially introduced to Fedora to bring new features around having different types of content and ease keeping containers up to date. Unfortunately it did not work out as intended and instead it became something that everyone just felt made things a lot more complicated, including building Flatpaks from Fedora content. With Owens updates building Flatpaks in Fedora has become a lot simpler and should help energize the effort building Flatpaks in Fedora.
Toolbx
As we continue marching towards a vision for Fedora Workstation to be a highly robust operating we keep evolving Toolbx. Our tool for making running your development environment(s) inside a container and thus allows you to both keep your host OS pristine and up to date, while at the same time using specific toolchains and tools inside the development container. This is a hard requirement for immutable operating systems such as Fedora Silverblue or Universal blue, but it is also useful on operating systems like Fedora Workstation as a way to do development for other platforms, like for instance Red Hat Enterprise Linux.
A major focus for Toolbx since the inception is to get it a stage where it is robust and reliable. So for instance while we prototyped it as a shell script, today it is written in Go to be more maintainable and also to confirm with the rest of the container ecosystem. A recent major step forward for getting that stability there is that starting with Fedora 39, the toolbox image is now a release blocking deliverable. This means it is now built as part of the nightly compose and the whole Toolbx stack (ie. the fedora-toolbox image and the toolbox RPM) is part of the release-blocking test criteria. This shows the level of importance we put on Toolbx as the future of Linux software development and its criticality to Fedora Workstation. Earlier, we built the fedora-toobox image as a somewhat separate and standalone thing, and people interested in Toolbx would try to test and keep the whole thing working, as much as possible, on their own. This was becoming unmanageable because Toolbx integrates with many parts of the distribution from Mutter (ie, the Wayland and X sockets) to Kerberos to RPM (ie., %_netsharedpath in /usr/lib/rpm/macros.d/macros.toolbox) to glibc locale definitions and translations. The list of things that could change elsewhere in Fedora, and end up breaking Toolbx, was growing too large for a small group of Toolbx contributors to keep track of.
We the next release we now also have built-in support for Arch Linux and Ubuntu through the –distro flag in toolbox.git main, thanks again to the community contributors who worked with us on this allowing us to widen the amount of distros supported while keeping with our policy of reliability and dependability. And along the same theme of ensuring Toolbx is a tool developers can rely on we have added lots and lots of new tests. We now have more than 280 tests that run on CentOS Stream 9, all supported Fedoras and Rawhide, and Ubuntu 22.04.
Another feature that Toolbx maintainer Debarshi Ray put a lot of effort into is setting up full RHEL containers in Toolbx on top of Fedora. Today, thanks to Debarshi work you do subscription-manager register --username user@domain.name
on the Fedora or RHEL host, and the container is automatically entitled to RHEL content. We are still looking at how we can provide a graphical interface for that process or at least how to polish up the CLI for doing subscription-manager register
. If you are interested in this feature, Debarshi provides a full breakdown here.
Other nice to haves added is support for enterprise FreeIPA set-ups, where the user logs into their machine through Kerberos and support for automatically generated shell completions for Bash, fish and Z shell.
Flatpak and Foreman & Katello
For those out there using Foreman to manage your fleet of Linux installs we have some good news. We are in the process of implementing support for Flatpaks in these tools so that you can manage and deploy applications in the Flatpak format using them. This is still a work in progress, but relevant Pulp and Katello commits are Pulp commit Support for Flatpak index endpoints and Katello commits Reporting results of docker v2 repo discovery” and Support Link header in docker v2 repo discovery“.
LVFS
Another effort that Fedora Workstation has brought to the world of Linux and that is very popular arethe LVFS and fwdup formware update repository and tools. Thanks to that effort we are soon going to be passing one hundred million firmware updates on Linux devices soon! These firmware updates has helped resolve countless bugs and much improved security for Linux users.
But we are not slowing down. Richard Hughes worked with industry partners this year to define a Bill of Materials defintion to firmware updates allowing usings to be better informed on what is included in their firmware updates.
We now support over 1400 different devices on the LVFS (covering 78 different protocols!), with over 8000 public firmware versions (image below) from over 150 OEMs and ODMs. We’ve now done over 100,000 static analysis tests on over 2,000,000 EFI binaries in the firmware capsules!
Some examples of recently added hardware:
* AMD dGPUs, Navi3x and above, AVer FONE540, Belkin Thunderbolt 4 Core Hub dock, CE-LINK TB4 Docks,CH347 SPI programmer, EPOS ADAPT 1×5, Fibocom FM101, Foxconn T99W373, SDX12, SDX55 and SDX6X devices, Genesys GL32XX SD readers, GL352350, GL3590, GL3525S and GL3525 USB hubs, Goodix Touch controllers, HP Rata/Remi BLE Mice, Intel USB-4 retimers, Jabra Evolve 65e/t and SE, Evolve2, Speak2 and Link devices, Logitech Huddle, Rally System and Tap devices, Luxshare Quad USB4 Dock, MediaTek DP AUX Scalers, Microsoft USB-C Travel Hub, More Logitech Unifying receivers, More PixartRF HPAC devices, More Synaptics Prometheus fingerprint readers, Nordic HID devices, nRF52 Desktop Keyboard, PixArt BLE HPAC OTA, Quectel EM160 and RM520, Some Western Digital eMMC devices, Star Labs StarBook Mk VIr2, Synaptics Triton devices, System76 Launch 3, Launch Heavy 3 and Thelio IO 2, TUXEDO InfinityBook Pro 13 v3, VIA VL122, VL817S, VL822T, VL830 and VL832, Wacom Cintiq Pro 27, DTH134 and DTC121, One 13 and One 12 Tablets
InputLeap on Wayland
One really interesting feature that landed for Fedora Workstation 39 was the support for InputLeap. It’s probably not on most peoples radar, but it’s an important feature for system administrators, developers and generally anyone with more than a single computer on their desk.
Historically, InputLeap is a fork of Barrier which itself was a fork of Synergy, it allows to share the same input devices (mouse, keyboard) across different computers (Linux, Windows, MacOS) and to move the pointer between the screens of these computers seamlessly as if they were one.
InputLeap has a client/server architecture with the server running on the main host (the one with the keyboard and mouse connected) and multiple clients, the other machines sitting next to the server machine. That implies two things, the InputLeap daemon on the server must be able to “capture” all the input events to forward them to the remote clients when the pointer reaches the edge of the screen, and the InputLeap client must be able to “replay” those input events on the client host to make it as if the keyboard and mouse were connected directly to the (other) computer. Historically, that relied on X11 mechanisms and neither InputLeap (nor Barrier or even Synergy as a matter of fact) would work on Wayland.
This is one of the use cases that Peter Hutterer had in mind when he started libEI, a low-level library aimed at providing a separate communication channel for input emulation in Wayland compositors and clients (even though libEI is not strictly tied to Wayland). But libEI alone is far from being sufficient to implement InputLeap features, with Wayland we had the opportunity to make things more secure than X11 and take benefit from the XDG portal mechanisms.
On the client side, for replaying input events, it’s similar to remote desktop but we needed to update the existing RemoteDesktop portal to pass the libEI socket. On the server side, it required a brand new portal for input capture . These also required their counterparts in the GNOME portal, for both RemoteDesktop and InputCapture [8], and of course, all that needs to be supported by the Wayland compositor, in the case of GNOME that’s mutter. That alone was a lot of work.
Yet, even with all that in place, that’s just the basic requirements to support a Synergy/Barrier/InputLeap-like feature, the tools in question need to have support for the portal and libEI implemented to benefit from the mechanisms we’ve put in place and for the all feature to work and be usable. So libportal was also updated to support the new portal features and a new “Wayland” backend alongside the X11, Windows and Mac OS backends was contributed to InputLeap.
The merge request in InputLeap was accepted very early, even before the libEI API was completely stabilized and before the rest of the stack was merged, which I believe was a courageous choice from Povilas (who maintains InputLeap) which helped reduce the time to have the feature actually working, considering the number of components and inter-dependencies involved. Of course, there are still features missing in the Wayland backend, like copy/pasting between hosts, but a clipboard interface was fairly recently added to the remote desktop portal and therefore could be used by InputLeap to implement that feature.
Fun fact, Xwayland also grew support for libEI also using the remote desktop portal and wires that to the XTEST extension on X11 that InputLeap’s X11 backend uses, so it might even be possible to use the X11 backend of InputLeap in the client side through Xwayland, but of course it’s better to use the Wayland backend on both the client and server sides.
InputLeap is a great example of collaboration between multiple parties upstream including key contributions from us at Red Hat to implement and contribute a feature that has been requested for years upstream..
Thank you to Olivier Fourdan, Debarshi Ray, Richard Hughes, Sebastian Wick and Jonas Ådahl for their contributions to this blog post.
tomeu@arm-64:~/mesa$ ETNA_MESA_DEBUG=ml_msgs python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}
Running the NN job took 13 ms.
0.866667: military uniform
0.031373: Windsor tie
0.015686: mortarboard
0.007843: bow tie
0.007843: academic gown
time: 15.650ms
These will be my priorities during the next couple of weeks, in order:
Hi! This month I’ve started a new PotM
called pyonji. It’s an easy-to-use replacement for the venerable
git-send-email
command. The goal is to make it less painful for a new
contributor not familiar with the e-mail based patch submission to submit
patches.
Users are expected to use the same workflow as GitHub, GitLab and friends when
contributing: create a new branch and add commits there. Instead of pushing to
a fork though, users simply invoke pyonji
.
When run for the first time, pyonji will ask for your e-mail account details: e-mail address, password… and nothing else. The SMTP server hostname, port and other details are automatically detected (via multiple means: SRV records, Mozilla auto-configuration database, common subdomains, etc). Once the password is verified pyonji will store everything in the Git configuration (in the same fashion that git-send-email expects it).
Then pyonji will present a UI with a list of commits to be submitted for review. The user can tweak details such as the base branch, the mailing list address, the version of the patch, however that’s rarely needed: pyonji will find good defaults for these. The user can add a cover letter if desired with a longer description for the set of patches. Then the big blue “submit” button can be pressed to send the patches.
Unlike git-send-email, pyonji will remember for you what the last submitted version number was (and automatically increment it). pyonji will save the cover letter so that it’s not lost if the network is flaky and you don’t need to re-type it for the next submission. pyonji will not waste your time with uninteresting questions such as “which encoding should I use?”. pyonji will automatically include the base tree information in the patches so that any conflicts are more easily resolved by the reviewer.
Please try it and let me know how it goes! In particular, I’m wondering if the logic to auto-detect the e-mail server settings are robust enough, or if there are e-mail providers I don’t handle correctly yet.
There is still a lot to be done to improve pyonji. Setup is painful for GMail and Fastmail users because app passwords are required. I wanted to use OAuth to fix this but both of these providers heavily restrict how SMTP OAuth apps can be registered. Setup doesn’t work for ProtonMail users because the bridge uses a self-signed certificate, that can be fixed but setup will remain painful. I’d like to add UI to change the base branch, improve the heuristics to pick a good default for the base branch, support for the MAINTAINERS file for easier contribution to big projects such as the kernel, add an easy way to mark a patch series as RFC, and probably a million of other things.
Apart from pyonji, I’ve been working on some graphics-related stuff as always. We’re getting closer to the wlroots 0.17 release, fixing the few remaining blocking issues. A new API to clip surfaces with the scene-graph has been merged, many thanks to Alexander Orzechowski and Isaac Freund! I’ve fixed a Mesa regression introduced by a previous patch I’ve reviewed related to EGL and split render/display SoCs (I hate these). And I’ve been discussing with other kernel developers about a way to stop (ab)using KMS dumb buffers for split render/display SoCs (I swear I really hate these). We’re trying to come up with a solution which could on the long run also help with the Buffer Allocation Constraints Problem (see the XDC 2020 talk for more info).
I’ve written a few patches to add support for OAuth 2.0 refresh tokens to meta.sr.ht. If you’ve ever used an OAuth sr.ht app (like hottub or yojo to integrate builds.sr.ht with GitHub or Forgejo), you probably know that tokens expire after one year, and that you need to redo the setup step when that happens. This is annoying, and adding support for refresh tokens to meta.sr.ht and the OAuth apps should fix this.
Last, I’m now part of the FreeDesktop Code of Conduct team. This is not a technical role, but it’s very important to have folks doing this work. I’ve attended a Code of Conduct workshop to learn how to do it, that’s been pretty interesting and helpful. The workshop focused a lot more on trying to change people’s behavior, instead of bringing down the ban hammer.
That’s all for now, see you next month!
Today, 12 years after the meeting where AppStream was first discussed and 11 years after I released a prototype implementation I am excited to announce AppStream 1.0!
Check it out on GitHub, or get the release tarball or read the documentation or release notes!
I was not in the original AppStream meeting, since in 2011 I was extremely busy with finals preparations and ball organization in high school, but I still vividly remember sitting at school in the students’ lounge during a break and trying to catch the really choppy live stream from the meeting on my borrowed laptop (a futile exercise, I watched parts of the blurry recording later).
I was extremely passionate about getting software deployment to work better on Linux and to improve the overall user experience, and spent many hours on the PackageKit IRC channel discussing things with many amazing people like Richard Hughes, Daniel Nicoletti, Sebastian Heinlein and others.
At the time I was writing a software deployment tool called Listaller – this was before Linux containers were a thing, and building it was very tough due to technical and personal limitations (I had just learned C!). Then in university, when I intended to recreate this tool, but for real and better this time as a new project called Limba, I needed a way to provide metadata for it, and AppStream fit right in! Meanwhile, Richard Hughes was tackling the UI side of things while creating GNOME Software and needed a solution as well. So I implemented a prototype and together we pretty much reshaped the early specification from the original meeting into what would become modern AppStream.
Back then I saw AppStream as a necessary side-project for my actual project, and didn’t even consider me as the maintainer of it for quite a while (I hadn’t been at the meeting afterall). All those years ago I had no idea that ultimately I was developing AppStream not for Limba, but for a new thing that would show up later, with an even more modern design called Flatpak. I also had no idea how incredibly complex AppStream would become and how many features it would have and how much more maintenance work it would be – and also not how ubiquitous it would become.
The modern Linux desktop uses AppStream everywhere now, it is supported by all major distributions, used by Flatpak for metadata, used for firmware metadata via Richard’s fwupd/LVFS, runs on every Steam Deck, can be found in cars and possibly many places I do not know yet.
The most important thing that’s new with the 1.0 release is a bunch of incompatible changes. For the shared libraries, all deprecated API elements have been removed and a bunch of other changes have been made to improve the overall API and especially make it more binding-friendly. That doesn’t mean that the API is completely new and nothing looks like before though, when possible the previous API design was kept and some changes that would have been too disruptive have not been made. Regardless of that, you will have to port your AppStream-using applications. For some larger ones I already submitted patches to build with both AppStream versions, the 0.16.x stable series as well as 1.0+.
For the XML specification, some older compatibility for XML that had no or very few users has been removed as well. This affects for example release
elements that reference downloadable data without an artifact
block, which has not been supported for a while. For all of these, I checked to remove only things that had close to no users and that were a significant maintenance burden. So as a rule of thumb: If your XML validated with no warnings with the 0.16.x branch of AppStream, it will still be 100% valid with the 1.0 release.
Another notable change is that the generated output of AppStream 1.0 will always be 1.0 compliant, you can not make it generate data for versions below that (this greatly reduced the maintenance cost of the project).
For a long time, you could set the developer name using the top-level developer_name
tag. With AppStream 1.0, this is changed a bit. There is now a developer
tag with a name
child (that can be translated unless the translate="no"
attribute is set on it). This allows future extensibility, and also allows to set a machine-readable id
attribute in the developer
element. This permits software centers to group software by developer easier, without having to use heuristics. If we decide to extend the developer information per-app in future, this is also now possible. Do not worry though the developer_name
tag is also still read, so there is no high pressure to update. The old 0.16.x stable series also has this feature backported, so it can be available everywhere. Check out the developer tag specification for more details.
Screenshot images can now have a scale
attribute, to indicate an (integer) scaling factor to apply. This feature was a breaking change and therefore we could not have it for the longest time, but it is now available. Please wait a bit for AppStream 1.0 to become deployed more widespread though, as using it with older AppStream versions may lead to issues in some cases. Check out the screenshots tag specification for more details.
It is now possible to indicate the environment a screenshot was recorded in (GNOME, GNOME Dark, KDE Plasma, Windows, etc.) via an environment
attribute on the respective screenshot
tag. This was also a breaking change, so use it carefully for now! If projects want to, they can use this feature to supply dedicated screenshots depending on the environment the application page is displayed in. Check out the screenshots tag specification for more details.
This is a feature more important for the scientific community and scientific applications. Using the references
tag, you can associate the AppStream component with a DOI (Digital object identifier) or provide a link to a CFF file to provide citation information. It also allows to link to other scientific registries. Check out the references tag specification for more details.
Releases can have tags now, just like components. This is generally not a feature that I expect to be used much, but in certain instances it can become useful with a cooperating software center, for example to tag certain releases as long-term supported versions.
Thanks to the interest and work of many volunteers, AppStream (mostly) runs on FreeBSD now, a NetBSD port exists, support for macOS was written and a Windows port is on its way! Thank you to everyone working on this
For a long time I thought that the AppStream library should just be a thin layer above the XML and that software centers should just implement a lot of the actual logic. This has not been the case for a while, but there was still a lot of complex AppStream features that were hard for software centers to implement and where it makes sense to have one implementation that projects can just use.
The validation of component relations is one such thing. This was implemented in 0.16.x as well, but 1.0 vastly improves upon the compatibility checks, so you can now just run as_component_check_relations and retrieve a detailed list of whether the current component will run well on the system. Besides better API for software developers, the appstreamcli
utility also has much improved support for relation checks, and I wrote about these changes in a previous post. Check it out!
With these changes, I hope this feature will be used much more, and beyond just drivers and firmware.
The changelog for the 1.0 release is huge, and there are many papercuts resolved and changes made that I did not talk about here, like us using gi-docgen (instead of gtkdoc) now for nice API documentation, or the many improvements that went into better binding support, or better search, or just plain bugfixes.
I expect the transition to 1.0 to take a bit of time. AppStream has not broken its API for many, many years (since 2016), so a bunch of places need to be touched even if the changes themselves are minor in many cases. In hindsight, I should have also released 1.0 much sooner and it should not have become such a mega-release, but that was mainly due to time constraints.
So, what’s in it for the future? Contrary to what I thought, AppStream does not really seem to be “done” and fetature complete at a point, there is always something to improve, and people come up with new usecases all the time. So, expect more of the same in future: Bugfixes, validator improvements, documentation improvements, better tools and the occasional new feature.
Onwards to 1.0.1!
TLDR: see the title of this blog post, it's really that trivial.
Now that GodotWayland has been coming for ages and all new development focuses on a pile of software
that steams significantly less, we're seeing cracks appear in the old Xorg support. Not intentionally,
but there's only so much time that can be spent on testing and things that are more niche fall through.
One of these was a bug I just had the pleasure of debugging and was triggered by GNOME on Xorg user using the xf86-input-libinput driver for tablet devices.
On the surface of it, this should be fine because libinput (and thus xf86-input-libinput) handles tablets just fine. But libinput is the new kid on the block. The old kid on said block is the xf86-input-wacom driver, older than libinput by slightly over a decade. And oh man, history has baked things into the driver that are worse than raisins in apple strudel [1].
The xf86-input-libinput driver was written as a wrapper around libinput and makes use of fancy things that (from libinput's POV) have always been around: things like input device hotplugging. Fancy, I know. For tablet devices the driver creates an X device for each new tool as it comes into proximity first. Future events from that tool will go through that device. A second tool, be it a new pen or the eraser on the original pen, will create a second X device and events from that tool will go through that X device. Configuration on any device will thus only affect that particular pen. Almost like the whole thing makes sense.
The wacom driver of course doesn't do this. It pre-creates X devices for some possible types of tools (pen, eraser, and cursor [2] but not airbrush or artpen). When a tool goes into proximity the events are sent through the respective device, i.e. all pens go through the pen tool, all erasers through the eraser tool. To actually track pens there is the "Wacom Serial IDs" property that contains the current tool's serial number. If you want to track multiple tools you need to query the property on proximity in [4]. At the time this was within a reasonable error margin of a good idea.
Of course and because MOAR CONFIGURATION! will save us all from the great filter you can specify the "ToolSerials" xorg.conf option as e.g. "airbrush;12345;artpen" and get some extra X devices pre-created, in this case a airbrush and artpen X device and an X device just for the tool with the serial number 12345. All other tools multiplex through the default devices. Again, at the time this was a great improvement. [5]
Anyway, where was I? Oh, right. The above should serve as a good approximation of a reason why the xf86-input-libinput driver does not try to be fullly compatible to the xf86-input-wacom driver. In everyday use these things barely matter [6] but for the desktop environment which needs to configure these devices all these differences mean multiple code paths. Those paths need to be tested but they aren't, so things fall through the cracks.
So quite a while ago, we made the decision that until Xorg goes dodo, the xf86-input-wacom driver is the tablet driver to use in GNOME. So if you're using a GNOME on Xorg session [7], do make sure the xf86-input-wacom driver is installed. It will make both of us happier and that's a good aim to strive for.
[1] It's just a joke. Put the pitchforks down already.
[2] The cursor is the mouse-like thing Wacom sells. Which is called cursor [3] because the English language has a limited vocabulary and we need to re-use words as much as possible lest we run out of them.
[3] It's also called puck. Because [2].
[4] And by "query" I mean "wait for the XI2 event notifying you of a property change". Because of lolz the driver cannot update the property on proximity in but needs to schedule that as idle func so the
property update for the serial always arrives at some unspecified time after the proximity in but hopefully before more motion events happen. Or not, and that's how hope dies.
[5] Think about this next time someone says they long for some unspecified good old days.
[6] Except the strip axis which on the wacom driver is actually a bit happily moving left/right as your finger moves up/down on the touch strip and any X client needs to know this. libinput normalizes this to...well, a normal value but now the X client needs to know which driver is running so, oh deary deary.
[7] e.g because your'e stockholmed into it by your graphics hardware
I’ve recently worked on a patch for the vc4 display driver used on the Raspberry Pi 4. To test this patch, I needed to compile the kernel and install it, something I know how to do on x86 but not on Raspberry Pi. Because I’m pretty stubborn I’ve also insisted on making my life harder:
Raspberry Pi has an official guide to compile the kernel, however it assumes Raspberry Pi OS, Raspberry Pi’s kernel tree, and overwrites the current kernel. It was still very useful to get an idea of the process. Still, quite a few adaptations have been required. This blog post serves as my personal notepad to remember how to Do It.
First, the official guide instructs us to run make bcm2711_defconfig
to
generate the kernel config, however mainline complains with:
Can't find default configuration "arch/arm/configs/bcm2711_defconfig"
This can be fixed by grabbing this file from the Raspberry Pi tree:
curl -L -o arch/arm/configs/bcm2711_defconfig "https://github.com/raspberrypi/linux/raw/rpi-6.1.y/arch/arm/configs/bcm2711_defconfig"
Once that’s done, compiling the kernel as usual works fine. Then we need to
install it to the /boot
partition. We can ignore the overlays stuff from the
official guide, we don’t use these. The source paths need to be slightly
adjusted, and the destination paths need to be fixed up to use a subdirectory:
doas make modules_install
doas cp arch/arm/boot/dts/broadcom/*.dtb /boot/custom/
doas cp arch/arm/boot/zImage /boot/custom/kernel7.img
Then we need to generate an initramfs. At first I forgot to do that step and the kernel was hanging around USB bus discovery.
doas mkinitcpio --generate /boot/custom/initramfs-linux.img --kernel /boot/custom/kernel7.img
The last step is updating the boot firmware configuration located at
/boot/config.txt
. Comment out any dtoverlay
directive, then add
os_prefix=custom/
to point the firmware to our subdirectory (note, the final
slash is important).
For some reason my memory card was showing up as /dev/mmcblk1
instead of
/dev/mmcblk0
, so I had to bang my head against the wall until I notice the
difference adjust /boot/cmdline.txt
and /etc/fstab
accordingly.
That’s it! After a reboot I was ready to start kernel hacking. Thanks to Maíra Canal for replying to my distress signal on Mastodon and providing recommendations!
This blog post explores the color capabilities of AMD hardware and how they are exposed to userspace through driver-specific properties. It discusses the different color blocks in the AMD Display Core Next (DCN) pipeline and their capabilities, such as predefined transfer functions, 1D and 3D lookup tables (LUTs), and color transformation matrices (CTMs). It also highlights the differences in AMD HW blocks for pre and post-blending adjustments, and how these differences are reflected in the available driver-specific properties.
Overall, this blog post provides a comprehensive overview of the color capabilities of AMD hardware and how they can be controlled by userspace applications through driver-specific properties. This information is valuable for anyone who wants to develop applications that can take advantage of the AMD color management pipeline.
Get a closer look at each hardware block’s capabilities, unlock a wealth of knowledge about AMD display hardware, and enhance your understanding of graphics and visual computing. Stay tuned for future developments as we embark on a quest for GPU color capabilities in the ever-evolving realm of rainbow treasures.
Operating Systems can use the power of GPUs to ensure consistent color reproduction across graphics devices. We can use GPU-accelerated color management to manage the diversity of color profiles, do color transformations to convert between High-Dynamic-Range (HDR) and Standard-Dynamic-Range (SDR) content and color enhacements for wide color gamut (WCG). However, to make use of GPU display capabilities, we need an interface between userspace and the kernel display drivers that is currently absent in the Linux/DRM KMS API.
In the previous blog post I presented how we are expanding the Linux/DRM color management API to expose specific properties of AMD hardware. Now, I’ll guide you to the color features for the Linux/AMD display driver. We embark on a journey through DRM/KMS, AMD Display Manager, and AMD Display Core and delve into the color blocks to uncover the secrets of color manipulation within AMD hardware. Here we’ll talk less about the color tools and more about where to find them in the hardware.
We resort to driver-specific properties to reach AMD hardware blocks with color capabilities. These blocks display features like predefined transfer functions, color transformation matrices, and 1-dimensional (1D LUT) and 3-dimensional lookup tables (3D LUT). Here, we will understand how these color features are strategically placed into color blocks both before and after blending in Display Pipe and Plane (DPP) and Multiple Pipe/Plane Combined (MPC) blocks.
That said, welcome back to the second part of our thrilling journey through AMD’s color management realm!
In my 2022 XDC talk “I’m not an AMD expert, but…”, I briefly explained the organizational structure of the Linux/AMD display driver where the driver code is bifurcated into a Linux-specific section and a shared-code portion. To reveal AMD’s color secrets through the Linux kernel DRM API, our journey led us through these layers of the Linux/AMD display driver’s software stack. It includes traversing the DRM/KMS framework, the AMD Display Manager (DM), and the AMD Display Core (DC) [1].
The DRM/KMS framework provides the atomic API for color management through KMS
properties represented by struct drm_property
. We extended the color
management interface exposed to userspace by leveraging existing resources and
connecting them with driver-specific functions for managing modeset properties.
On the AMD DC layer, the interface with hardware color blocks is established. The AMD DC layer contains OS-agnostic components that are shared across different platforms, making it an invaluable resource. This layer already implements hardware programming and resource management, simplifying the external developer’s task. While examining the DC code, we gain insights into the color pipeline and capabilities, even without direct access to specifications. Additionally, AMD developers provide essential support by answering queries and reviewing our work upstream.
The primary challenge involved identifying and understanding relevant AMD DC code to configure each color block in the color pipeline. However, the ultimate goal was to bridge the DC color capabilities with the DRM API. For this, we changed the AMD DM, the OS-dependent layer connecting the DC interface to the DRM/KMS framework. We defined and managed driver-specific color properties, facilitated the transport of user space data to the DC, and translated DRM features and settings to the DC interface. Considerations were also made for differences in the color pipeline based on hardware capabilities.
Now, let’s dive into the exciting realm of AMD color capabilities, where a abundance of techniques and tools await to make your colors look extraordinary across diverse devices.
First, we need to know a little about the color transformation and calibration tools and techniques that you can find in different blocks of the AMD hardware. I borrowed some images from [2] [3] [4] to help you understand the information.
Transfer functions serve as the bridge between the digital and visual worlds, defining the mathematical relationship between digital color values and linear scene/display values and ensuring consistent color reproduction across different devices and media. You can learn more about curves in the chapter GPU Gems 3 - The Importance of Being Linear by Larry Gritz and Eugene d’Eon.
ITU-R 2100 introduces three main types of transfer functions:
- OETF: the opto-electronic transfer function, which converts linear scene light into the video signal, typically within a camera.
- EOTF: electro-optical transfer function, which converts the video signal into the linear light output of the display.
- OOTF: opto-optical transfer function, which has the role of applying the “rendering intent”.
AMD’s display driver supports the following pre-defined transfer functions (aka named fixed curves):
- Linear/Unity: linear/identity relationship between pixel value and luminance value;
- Gamma 2.2, Gamma 2.4, Gamma 2.6: pure power functions;
- sRGB: 2.4: The piece-wise transfer function from IEC 61966-2-1:1999;
- BT.709: has a linear segment in the bottom part and then a power function with a 0.45 (~1/2.22) gamma for the rest of the range; standardized by ITU-R BT.709-6;
- PQ (Perceptual Quantizer): used for HDR display, allows luminance range capability of 0 to 10,000 nits; standardized by SMPTE ST 2084.
These capabilities vary depending on the hardware block, with some utilizing hardcoded curves and others relying on AMD’s color module to construct curves from standardized coefficients. It also supports user/custom curves built from a lookup table.
A 1D LUT is a versatile tool, defining a one-dimensional color transformation based on a single parameter. It’s very well explained by Jeremy Selan at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations
It enables adjustments to color, brightness, and contrast, making it ideal for fine-tuning. In the Linux AMD display driver, the atomic API offers a 1D LUT with 4096 entries and 8-bit depth, while legacy gamma uses a size of 256.
These tables work in three dimensions – red, green, and blue. They’re perfect for complex color transformations and adjustments between color channels. It’s also more complex to manage and require more computational resources. Jeremy also explains 3D LUT at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations
Color transformation matrices facilitate the transition between different color spaces, playing a crucial role in color space conversion.
HDR multiplier is a factor applied to the color values of an image to increase their overall brightness.
First, let’s take a closer look at the AMD Display Core Next hardware pipeline in the Linux kernel documentation for AMDGPU driver - Display Core Next
In the AMD Display Core Next hardware pipeline, we encounter two hardware blocks with color capabilities: the Display Pipe and Plane (DPP) and the Multiple Pipe/Plane Combined (MPC). The DPP handles color adjustments per plane before blending, while the MPC engages in post-blending color adjustments. In short, we expect DPP color capabilities to match up with DRM plane properties, and MPC color capabilities to play nice with DRM CRTC properties.
Note: here’s the catch – there are some DRM CRTC color transformations that don’t have a corresponding AMD MPC color block, and vice versa. It’s like a puzzle, and we’re here to solve it!
We can finally talk about the color capabilities of each AMD color block. As it
varies based on the generation of hardware, let’s take the DCN3+ family as
reference. What’s possible to do before and after blending depends on hardware
capabilities describe in the kernel driver by struct
dpp_color_caps
and struct
mpc_color_caps
.
The AMD Steam Deck hardware provides a tangible example of these capabilities.
Therefore, we take SteamDeck/DCN301 driver as an example and look at the “Color
pipeline capabilities” described in the file:
driver/gpu/drm/amd/display/dcn301/dcn301_resources.c
/* Color pipeline capabilities */
dc->caps.color.dpp.dcn_arch = 1; // If it is a Display Core Next (DCN): yes. Zero means DCE.
dc->caps.color.dpp.input_lut_shared = 0;
dc->caps.color.dpp.icsc = 1; // Intput Color Space Conversion (CSC) matrix.
dc->caps.color.dpp.dgam_ram = 0; // The old degamma block for degamma curve (hardcoded and LUT). `Gamma correction` is the new one.
dc->caps.color.dpp.dgam_rom_caps.srgb = 1; // sRGB hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.bt2020 = 1; // BT2020 hardcoded curve support (seems not actually in use)
dc->caps.color.dpp.dgam_rom_caps.gamma2_2 = 1; // Gamma 2.2 hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.pq = 1; // PQ hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.hlg = 1; // HLG hardcoded curve support
dc->caps.color.dpp.post_csc = 1; // CSC matrix
dc->caps.color.dpp.gamma_corr = 1; // New `Gamma Correction` block for degamma user LUT;
dc->caps.color.dpp.dgam_rom_for_yuv = 0;
dc->caps.color.dpp.hw_3d_lut = 1; // 3D LUT support. If so, it's always preceded by a shaper curve.
dc->caps.color.dpp.ogam_ram = 1; // `Blend Gamma` block for custom curve just after blending
// no OGAM ROM on DCN301
dc->caps.color.dpp.ogam_rom_caps.srgb = 0;
dc->caps.color.dpp.ogam_rom_caps.bt2020 = 0;
dc->caps.color.dpp.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.dpp.ogam_rom_caps.pq = 0;
dc->caps.color.dpp.ogam_rom_caps.hlg = 0;
dc->caps.color.dpp.ocsc = 0;
dc->caps.color.mpc.gamut_remap = 1; // Post-blending CTM (pre-blending CTM is always supported)
dc->caps.color.mpc.num_3dluts = pool->base.res_cap->num_mpc_3dlut; // Post-blending 3D LUT (preceded by shaper curve)
dc->caps.color.mpc.ogam_ram = 1; // Post-blending regamma.
// No pre-defined TF supported for regamma.
dc->caps.color.mpc.ogam_rom_caps.srgb = 0;
dc->caps.color.mpc.ogam_rom_caps.bt2020 = 0;
dc->caps.color.mpc.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.mpc.ogam_rom_caps.pq = 0;
dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
dc->caps.color.mpc.ocsc = 1; // Output CSC matrix.
I included some inline comments in each element of the color caps to quickly
describe them, but you can find the same information in the Linux kernel
documentation. See more in
struct dpp_color_caps
,
struct mpc_color_caps
and struct rom_curve_caps
.
Now, using this guideline, we go through color capabilities of DPP and MPC blocks and talk more about mapping driver-specific properties to corresponding color blocks.
Let’s explore the capabilities of DPP blocks and what you can achieve with a color block. The very first thing to pay attention is the display architecture of the display hardware: previously AMD uses a display architecture called DCE
The architectute is described by: dc->caps.color.dpp.dcn_arch
Described by: dc->caps.color.dpp.dgam_ram
, dc->caps.color.dpp.dgam_rom_caps
,dc->caps.color.dpp.gamma_corr
AMD Plane Degamma data is mapped to the initial stage of the DPP pipeline. It
is utilized to transition from scanout/encoded values to linear values for
arithmetic operations. Plane Degamma supports both pre-defined transfer
functions and 1D LUTs, depending on the hardware generation. DCN2 and older
families handle both types of curve in the Degamma RAM block
(dc->caps.color.dpp.dgam_ram
); DCN3+ separate hardcoded curves and 1D LUT
into two block: Degamma ROM (dc->caps.color.dpp.dgam_rom_caps
) and Gamma
correction block (dc->caps.color.dpp.gamma_corr
), respectively.
Pre-defined transfer functions:
The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array
of struct drm_color_lut
elements. Setting TF = Identity/Default and LUT as
NULL means bypass.
References:
AMD Plane CTM data goes to the DPP Gamut Remap block, supporting a 3x4 fixed
point (s31.32) matrix for color space conversions. The data is interpreted as a
struct drm_color_ctm_3x4
. Setting NULL means bypass.
References:
Described by: dc->caps.color.dpp.hw_3d_lut
The Shaper block fine-tunes color adjustments before applying the 3D LUT,
optimizing the use of the limited entries in each dimension of the 3D LUT. On
AMD hardware, a 3D LUT always means a preceding shaper 1D LUT used for
delinearizing and/or normalizing the color space before applying a 3D LUT, so
this entry on DPP color caps dc->caps.color.dpp.hw_3d_lut
means support for
both shaper 1D LUT and 3D LUT.
Pre-defined transfer function enables delinearizing content with or without shaper LUT, where AMD color module calculates the resulted shaper curve. Shaper curves go from linear values to encoded values. If we are already in a non-linear space and/or don’t need to normalize values, we can set a Identity TF for shaper that works similar to bypass and is also the default TF value.
Pre-defined transfer functions:
calculate_curve()
function in the file
amd/display/modules/color/color_gamma.c
.The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array
of struct drm_color_lut
elements. When setting Plane Shaper TF (!= Identity)
and LUT at the same time, the color module will combine the pre-defined TF and
the custom LUT values into the LUT that’s actually programmed. Setting TF =
Identity/Default and LUT as NULL works as bypass.
References:
Described by: dc->caps.color.dpp.hw_3d_lut
The 3D LUT in the DPP block facilitates complex color transformations and
adjustments. 3D LUT is a three-dimensional array where each element is an RGB
triplet. As mentioned before, the dc->caps.color.dpp.hw_3d_lut
describe if
DPP 3D LUT is supported.
The AMD driver-specific property advertise the size of a single dimension via
LUT3D_SIZE
property. Plane 3D LUT is a blog property where the data is interpreted
as an array of struct drm_color_lut
elements and the number of entries is
LUT3D_SIZE
cubic. The array contains samples from the approximated function.
Values between samples are estimated by tetrahedral interpolation
The array is accessed with three indices, one for each input dimension (color
channel), blue being the outermost dimension, red the innermost. This
distribution is better visualized when examining the code in
[RFC PATCH 5/5] drm/amd/display: Fill 3D LUT from userspace by Alex Hung:
+ for (nib = 0; nib < 17; nib++) {
+ for (nig = 0; nig < 17; nig++) {
+ for (nir = 0; nir < 17; nir++) {
+ ind_lut = 3 * (nib + 17*nig + 289*nir);
+
+ rgb_area[ind].red = rgb_lib[ind_lut + 0];
+ rgb_area[ind].green = rgb_lib[ind_lut + 1];
+ rgb_area[ind].blue = rgb_lib[ind_lut + 2];
+ ind++;
+ }
+ }
+ }
In our driver-specific approach we opted to advertise it’s behavior to the userspace instead of implicitly dealing with it in the kernel driver. AMD’s hardware supports 3D LUTs with 17-size or 9-size (4913 and 729 entries respectively), and you can choose between 10-bit or 12-bit. In the current driver-specific work we focus on enabling only 17-size 12-bit 3D LUT, as in [PATCH v3 25/32] drm/amd/display: add plane 3D LUT support:
+ /* Stride and bit depth are not programmable by API yet.
+ * Therefore, only supports 17x17x17 3D LUT (12-bit).
+ */
+ lut->lut_3d.use_tetrahedral_9 = false;
+ lut->lut_3d.use_12bits = true;
+ lut->state.bits.initialized = 1;
+ __drm_3dlut_to_dc_3dlut(drm_lut, drm_lut3d_size, &lut->lut_3d,
+ lut->lut_3d.use_tetrahedral_9,
+ MAX_COLOR_3DLUT_BITDEPTH);
A refined control of 3D LUT parameters should go through a follow-up version or generic API.
Setting 3D LUT to NULL means bypass.
References:
Described by: dc->caps.color.dpp.ogam_ram
The Blend/Out Gamma block applies the final touch-up before blending, allowing users to linearize content after 3D LUT and just before the blending. It supports both 1D LUT and pre-defined TF. We can see Shaper and Blend LUTs as 1D LUTs that are sandwich the 3D LUT. So, if we don’t need 3D LUT transformations, we may want to only use Degamma block to linearize and skip Shaper, 3D LUT and Blend.
Pre-defined transfer function:
The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array
of struct drm_color_lut
elements. If plane_blend_tf_property
!= Identity TF,
AMD color module will combine the user LUT values with pre-defined TF into the
LUT parameters to be programmed. Setting TF = Identity/Default and LUT to NULL
means bypass.
References:
The degamma lookup table (LUT) for converting framebuffer pixel data before
apply the color conversion matrix. The data is interpreted as an array of
struct drm_color_lut
elements. Setting NULL means bypass.
Not really supported. The driver is currently reusing the DPP degamma LUT block
(dc->caps.color.dpp.dgam_ram
and dc->caps.color.dpp.gamma_corr
) for
supporting DRM CRTC Degamma LUT, as explaning by [PATCH v3 20/32]
drm/amd/display: reject atomic commit if setting both plane and CRTC
degamma.
Described by: dc->caps.color.mpc.gamut_remap
It sets the current transformation matrix (CTM) apply to pixel data after the
lookup through the degamma LUT and before the lookup through the gamma LUT. The
data is interpreted as a struct drm_color_ctm
. Setting NULL means bypass.
Described by: dc->caps.color.mpc.ogam_ram
After all that, you might still want to convert the content to wire encoding.
No worries, in addition to DRM CRTC 1D LUT, we’ve got a AMD CRTC gamma transfer
function (TF) to make it happen. Possible TF values are defined by enum
amdgpu_transfer_function
.
Pre-defined transfer functions:
The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array
of struct drm_color_lut
elements. When setting CRTC Gamma TF (!= Identity)
and LUT at the same time, the color module will combine the pre-defined TF and
the custom LUT values into the LUT that’s actually programmed. Setting TF =
Identity/Default and LUT to NULL means bypass.
References:
We have previously worked on exposing CRTC shaper and CRTC 3D LUT, but they were removed from the AMD driver-specific color series because they lack userspace case. CRTC shaper and 3D LUT works similar to plane shaper and 3D LUT but after blending (MPC block). The difference here is that setting (not bypass) Shaper and Gamma blocks together are not expected, since both blocks are used to delinearize the input space. In summary, we either set Shaper + 3D LUT or Gamma.
There are two other color capabilities of AMD display hardware that were
integrated to DRM by previous works and worth a brief explanation here. The DC
Input CSC sets pre-defined coefficients from the values of DRM plane
color_range
and color_encoding
properties. It is used for color space
conversion of the input content. On the other hand, we have de DC Output CSC
(OCSC) sets pre-defined coefficients from DRM connector colorspace
properties. It is uses for color space conversion of the composed image to the
one supported by the sink.
References:
If you want to understand a little more about this work, be sure to watch Joshua and I presented two talks at XDC 2023 about AMD/Steam Deck colors on Gamescope:
In the time between the first and second part of this blog post, Uma Shashank and Chaitanya Kumar Borah published the plane color pipeline for Intel and Harry Wentland implemented a generic API for DRM based on VKMS support. We discussed these two proposals and the next steps for Color on Linux during the Color Management workshop at XDC 2023 and I briefly shared workshop results in the 2023 XDC lightning talk session.
The search for rainbow treasures is not over yet! We plan to meet again next year in the 2024 Display Hackfest in Coruña-Spain (Igalia’s HQ) to keep up the pace and continue advancing today’s display needs on Linux.
Finally, a HUGE thank you to everyone who worked with me on exploring AMD’s color capabilities and making them available in userspace.
If you remember the last update two weeks ago, I got MobileNetV1 working with good performance, and I was planning to move to upstreaming my changes to the Linux kernel and Mesa.
One of the kernel patches is now queued for the 6.7 release of the Linux kernel, and the other one has just been resent for reviews.
Regarding Mesa, I have made several cleanups and have started getting great review comments from Christian Gmeiner.
While waiting for feedback, I have started work on using the TP cores for tensor manipulation, which should be many times faster than the naive code I was running on the CPU for this.
Got some jobs producing the correct results, but I'm facing a problem with the GPU hanging right afterwards. Have already made a pass at the whole set of data that is sent to the HW (unit configuration, command stream and registers), but haven't found yet the problem. I will next improve the tooling around this and get a better view of the differences.
I hacked Mesa to use the out-of-tree driver and my code works that way, so it has to be something at the kernel driver.
During the next weeks I will keep incorporating feedback and see how I can fix the GPU hang on TP jobs.
Linus has pulled the initial GSP firmware support for nouveau. This is just the first set of work to use the new GSP firmware and there are likely many challenges and improvements ahead.
To get this working you need to install the firmware which hasn't landed in linux-firmware yet.
For Fedora this copr has the firmware in the necessary places:
https://copr.fedorainfracloud.org/coprs/airlied/nouveau-gsp/build/6593115/
Hopefully we can upstream that in next week or so.
If you have an ADA based GPU then it should just try and work out of the box, if you have Turing or Ampere you currently need to pass nouveau.config=NvGspRm=1 on the kernel command line to attempt to use GSP.
Going forward, I've got a few fixes and stabilization bits to land, which we will concentrate on for 6.7, then going forward we have to work out how to keep it up to date and support new hardware and how to add new features.
This is the second part of the Xwayland rootful post, the first part is there.
Xwayland rootful can run more than just a window manager, it can as well run an entire X11 desktop, for example with Xfce:
![]() |
Xfce running on Xwayland rootful in GNOME Shell on Wayland |
This however isn't a problem specific to Wayland or Xwayland, an X11 window manager running in Xnest or Xephyr will have the same issues with keyboard shortcuts. To avoid that, Xephyr is able to „grab“ the keyboard and pointer so that all input events end up in the nested X11 session and do not get processed by the parent session.
Xwayland 23.1 has a similar functionality using the Wayland pointer locking & confinement protocol and the keyboard shortcuts inhibitor protocol.
So if your favorite Wayland compositor supports these protocols (in doubt, you can check that it is the case using „wayland-info“), you can use the „-host-grab“ option in Xwayland rootful:
Pressing the Control and Shift keys simultaneously will release the keyboard and pointer (just like with Xephyr actually).
In some cases, it might be desirable to run a single X11 application isolated from the rest of the X11 clients, on its own X11 server.
On such a setup, one could run a single X11 client either maximized or fullscreen within Xwayland rootful.
Since Xwayland 23.2 allows to interactively resize the root window, users could mode and resize that window at will.
But for that to work, we need a simple X11 window manager that could resize the X11 client window along with the root window, using XRANDR notifications, such as the matchbox window manager for example.
When the Xwayland rootful window is resized, corresponding XRANDR events are emitted, notifying the X11 window manager which in turn resizes the client window.
For years now, Xwayland rootless had support for the viewport Wayland protocol, to emulate XRandR for legacy games thanks to the work from Hans De Goede.
So the idea is to add a fullscreen mode to Xwayland rootful and take advantage of the Wayland viewports support to emulate resolution changes.
This is exactly what the „-fullscreen“ command line options does, it starts Xwayland rootful in fullscreen mode using the xdg_toplevel Wayland protocol and uses the existing viewport support to scale the window and to match the actual display physical resolution.
The emulated resolution is not even limited by the physical resolution, it's possible to use XRANDR to select an emulated resolution much higher than the actual monitor's resolution, quite handy to test X11 applications on high resolution without having to purchase expensive monitors!
Well, there's still one thing Xwayland is not handling well, it's HiDPI and fractional scaling.
With rootless Xwayland (as on a typical Wayland desktop session), all X11 clients share the same Xwayland server, and can span across different Wayland outputs of different scales.
Even though theoretically each Wayland surface associated with each X11 window could have a different scale factor set by Xwayland, all X11 clients on the same Xserver share the same coordinate space, so in practice different X11 windows cannot have different scale factors applied.
That's the reason why all the existing merge requests to add support for HiDPI to Xwayland set the same scale to all X11 surfaces. But that means that the rendered surface could end up being way too small depending on the actual scale the window is placed on, on a mixed-DPI multi-monitor setup (I already shared my views of the problem in this issue upstream).
But such limitation does not apply to rootful Xwayland, considering that all the X11 clients running on a rootful Xwayland actually belong to and remain within the same visible root window. They are part of the same visual entity and move all together along with the Xwayland rootful window.
So we could possibly add support for HiDPI (and hence achieve fractional scaling without blurred fonts) to rootful Xwayland. The idea is that Xwayland would set the surface scale to match the scale of the output it's placed on, and automatically resize its root window according to the scale, whenever that changes or when the rootful Xwayland window is moved from one monitor to another.
So for example, when Xwayland rootful with a size of 640×480 is moved from an output with scale 1 to an output with scale 2, the size of the root window (hence the Xwayland rootful window) would be automatically changed to 1280×960, along with the corresponding XRANDR notifications so that an X11 window manager running nested can adjust the X11 clients size and positions.
And if we want a way to communicate that to the X11 clients running within Xwayland rootful, we can use an X11 property on the root window that reflects the actual scale factor being applied. An X11 client could either use that property directly, or more likely, a simple dedicated daemon could adjust the scaling factor of the various X11 toolkits depending on the value set for Wayland scaling.
That's what that proposed merge request upstream does.
![]() |
gnome-calculator running on Xwayland rootful with 150% fractional scaling |
Of course, at this time of writing, this is just a merge request I just posted upstream, and there is no promise that it will accepted eventually. We'll see how that goes, but if that could find its way to Xwayland upstream, it would be part of the next major release of Xwayland some time next year.
I was at XDC 2023 in A Coruña a few days ago where I had the opportunity to talk about some of the work we have been doing on the Raspberry Pi driver stack together with my colleagues Juan Suárez and Maíra Canal. We talked about Raspberry Pi 5, CPU job handling in the Vulkan driver, OpenGL 3.1 support and how we are exposing GPU stats to user space. If you missed it here is the link to Youtube.
Big thanks to Igalia for organizing it and to all the sponsors and specially to Samuel and Chema for all the work they put into making this happen.
It’s a busy, busy week here. So busy I’m slipping on my blogging. But that’s okay, because here one last big technical post about something I hate.
Swapchain readback.
I’m not alone in drinking the haterade on this one, but GL makes it especially easy to footgun yourself by not providing explicit feedback that you’re footgunning yourself.
I recently encountered a scenario in REDACTED where this behavior was commonplace. The command stream looked roughly like this:
And this happened on every single frame (???).
This isn’t pretty. Zink has an extremely conformant method of performing swapchain readback which definitely works without issues in all cases. I’d explain it, but it wouldn’t make either of us happy, and I’ve got so much other stuff to do that I couldn’t possibly… Oh, you really want to know? Well don’t say I didn’t warn you.
Vulkan doesn’t allow readback from swapchains. By this, I mean:
Combined, once you have presented a swapchain image you’re screwed.
…According to the spec, that is. In the real world, things work differently.
Zink takes advantage of this “real world” utilization to implement swapchain readback. In short, the only method available is to spam present/acquire on the swapchain until the last-presented image is reacquired. Then it can be read back, and the image data is (probably) the same as when it was presented.
This is not a speedy method of implementing readback. It requires a full sync, and it was designed for the purpose of passing unit tests, which is does perfectly. Performance was never a concern, because why would anyone ever be trying to do readback in… Why would anyone ever be trying to do readback in a performance-sensitive… Using OpenGL, why would anyone ever be…
Anyway, this is very unperformant, and here at SGC we hate all things of that nature. Given that I had my real world scenario from REDACTED in which this was happening every frame, something had to be done.
This solution isn’t performant in the absolute sense either, but it’s massively faster than what was happening previously. Once zink detects an app repeatedly footgunning itself at full speed, it activates readback mode for a swapchain and maintains a staging copy of every frame. This enables the image data to be read back at any time without synchronization at the cost of an extra full-frame copy. This roughly doubles FPS in the case I was testing, which is pretty good.
The functionality is already merged for the upcoming 23.3 release.
Footgun as hard as you want.
As everyone knows, Red Hat’s top RustiCL expert, Karol “But it’s only 10 o’clock?” Herbst, has been hard at work beating Mesa/Zink/RustiCL into shape. That effort continues to bear fruit, and with the merge of an upcoming MR it should be possible to pass OpenCL conformance with zink on multiple platforms.
This will make zink THE FIRST EVER CONFORMANT VULKAN-BASED OPENCL IMPLEMENTATION.
Great work all around. For up-to-the-second progress reports on this ecosystem-critical topic, don’t forget to follow Karol on social media.
Hi all, long time no see! It’s been more than two months since the last status update. My excuse for this silence is two-fold: I was on leave for 5 weeks, and then X.Org Developer’s Conference happened. During my time off, I’ve traveled in Korea and Japan. I will be blunt: these last two months have been fantastic! And to be honest, that’s a huge understatement.
After my trip in Asia, I went to a 2-day Valve hackfest in Igalia’s headquarters. I met other Valve contractors there, we discussed about various topics such as color management, variable refresh rate, flicker-free startup, and more.
At XDC, there were lots of interesting talks and workshops: HDR by Joshua and Melissa, NVK by Faith, Asahi by Alyssa et al, wlroots frame scheduling by Rose (my GSoC student), CI by Martin, VKMS by Maíra, Wine Wayland by Alexandros, Wine X11 by Arek, and many more! Everything should be available online if you haven’t watched live. That said, as usual, the part I enjoyed the most is the so-called hallway track. It’s great to have free-form discussions with fellow graphics developers, it results in a pretty different train of thought than the usual focused discussions we have online.
Apart from these events, I’ve found some time to do a bit of actual work, too. I’ve re-spinned an old patch I wrote to introduce a new CLOSEFB IOCTL, to allow a DRM master to leave a framebuffer on-screen when quitting so that the next DRM master can take over without a black screen in-between. This time I also included a user-space patch and an IGT test (both requirements for new kernel uAPI). I sent (and merged) another kernel patch to fix black screens in some situations when unplugging USB-C docks.
On the Wayland side, I continued working on explicit synchronization, updating the protocol and submitting a gamescope patch. Joshua has been working on a Mesa patch, so all of the pieces are coming together now. On the SourceHut side, I’ve sent a patch to add HTTP/2 support to pages.sr.ht. It’s been merged and deployed, enjoy! The NPotTM is libicc, a small library to parse ICC profile files. Unlike LittleCMS, it provides lower-level access to the ICC structure and the exact color transformation operations.
That’s all for now, see you next month!
sudo rm /lib/modules/$(uname -r)/kernel/drivers/media/i2c/ov01a10.ko.xz; sudo depmod -a
sudo rmmod ov01a10; sudo modprobe ov01a10
After yesterday’s post, I’m sure my thousands of readers stampeded to install the latest zink and run their system with it, and I salute you for your hard work in finding all those new ways to crash your systems.
Some of those crashes, however, are not my bugs. They’re system bugs.
In particular, any of you still using Xorg instead of Wayland will want to create this file:
$ cat /etc/X11/xorg.conf.d/30-dmabuf.conf
Section "ServerFlags"
Option "Debug" "dmabuf_capable"
EndSection
This makes your xserver dmabuf-capable, which will be more successful when running things with zink.
Another problem you’re likely to have is this console error:
DRI3 not available
failed to load driver: zink
Specifically you’re likely to have this on AMD hardware, and the cause is almost certainly that you’ve installed some footgun package with a naming variation on xf86-video-amdgpu
.
Delete this package.
Just delete it. I don’t know why distros still make it available, but if you have it installed then you’re just footgunning yourself.
If you’re still having problems after checking for both of these issues, try turning your computer on.
tomeu@arm-64:~/mesa$ python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.soThat takes us to a performance level around 3 times faster than running the same inference on the CPUs on the A311D SoC.
Loading external delegate from libteflon.so with args: {}
Processing the input took 18 ms.
Running the NN job took 13 ms.
Processing the output took 1 ms.
0.866667: military uniform
0.031373: Windsor tie
0.015686: mortarboard
0.007843: bow tie
0.007843: academic gown
time: 33.094ms
Now that we have something that people can use in their products, I will switch to upstreaming mode.
I want to do a few cleanups to the Mesa code and then I will ask for people to review and ack so it can be merged. In the meantime, the draft merge request can be found here.
I would also like to have a CI job running to make sure it doesn't regress. But given that we don't use NIR as of yet and the dependencies with the rest of Mesa are minimal, there is probably little need as long as I'm the only person contributing to the code.
As readers are no doubt aware by now, SGC goes into hibernation beginning around November, and that time is nearly upon us once more. To cap out another glorious year of shitpostinghighly technical and informative blogging, I’ll be attempting to put up a newsworthy post every day.
This is Day 1.
2023 has seen great strides in the zink ecosystem:
And there’s plenty more, of course, but throughout all this progress has been one very minor, very annoying wrinkle.
MESA_LOADER_DRIVER_OVERRIDE=zink
has to be specified in order to use zink, even if no other GL drivers exist on the system.
Over a year ago I attempted to enable automatic zink loading if a native driver could not be loaded. It was a reasonable first attempt, but it had issues with driver loading in scenarios where hardware drivers were not permitted.
Work has slowly progressed in Mesa since that time, and various small changes have gradually pushed the teetering tower that is GLX/EGL in the direction anyone and everyone wanted, full stop.
The result is that on zink-enabled systems, loader environment variables will no longer be necessary as of the upcoming Mesa 23.3 release. If zink is your only GL driver, you will get zink rather than an automatic fallback to swrast.
I can’t imagine anyone will need it, but remember that issues can be reported here.
Xwayland is intended as a compatibility layer, to allow legacy X11 applications to continue to work in a Wayland environment.
Most Wayland compositors run Xwayland „rootless“ (using the command line option „-rootless“ when spawning Xwayland) so that X11 clients can integrate seamlessly with the other Wayland native clients, the Wayland compositor taking care of stacking the various windows (or surfaces) regardless of the client being X11 or Wayland native.
That actually works very well, so well that in many cases users do not even realize that any particular client is still running on X11, using Xwayland.
For that to work, the Wayland compositor needs to integrate a fully functional X11 window manager.
Sometimes, however, it is useful to use a separate X11 server to run X11 applications with another X11 window manager or even a full X11 environment.
With X11, it is possible to run a nested X11 server such as Xnest or Xephyr, and run a full X11 environment within those nested X servers.
That can be useful for a number of reasons, like connecting remotely to a remote legacy Unix server using XDMCP (not that I would recommend that anyway!), or for testing a particular X11 application with different window managers, or even because a particular X11 application is certified only with a specific window manager. The possibilities are endless.
$ Xephyr -retro -screen 1024x768 :12
![]() |
Xephyr running the Motif window manager on a GNOME Shell Wayland session |
But Xnest or Xephyr are X11 clients themselves, meaning that they run on top of Xwayland when running on a Wayland compositor. That's a bit of a waste, using two X11 servers on top of a Wayland compositor.
Besides, with X.org development winding down, downstream maintainers and packagers may want to reduce the number of X11 servers they ship and have to maintain in the future.
Right, so if Xwayland already runs rootful by default, why not just using that instead of Xnest or Xephyr?
Well, up until Xwayland 23.1, Xwayland rootful would take its screen configuration from the Wayland compositor itself (using the wl_output or xdg-output Wayland protocols), meaning that when running rootful, Xwayland would map a surface the size of all the monitors, and the user would have no way to easily move or resize it.
That's far from being practical, especially when using a multi-monitor setup!
So the first step to help making Xwayland rootful suitable as a nested X11 server is to provide a command line option to specify the desired size of the Xwayland window.
That's the „-geometry“ option introduced in Xwayland 23.1 so that one can specifies the desired size of the Xwayland rootful window:
$ Xwayland -geometry 1024x768 :12This is because Wayland does not decorate its surfaces, this is left to the Wayland client themselves to add window decorations (also known as client side decorations, or CSD for short).
This however would add a lot of complexity to Xwayland (which is primarily an Xserver, not a full fledged Wayland application). Thankfully, there is libdecor which can fence Xwayland from that complexity and provide window decorations for us.
So if libdecor is installed on the system and Xwayland is built with libdecor enabled (this is an optional dependency though), then we can request that Xwayland uses decorations with the „-decorate“ command line option:
$ Xwayland -geometry 1024x768 -retro -decorate :12
No we can have fun running some legacy X11 applications on that Xwayland rootful server:
$ xterm -display :12 &We can even use „xrandr“ to query the size of the Xwayland window and resize it:
New with Xwayland 23.2, the Xwayland window is also resize-able interactively and the resulting display size is available in XRandR, creating an XRandR configuration to match the actual window size set interactively by the user:
This is my first blog post, ever!
I'm afraid there isn't much yet, but my intention is to post things related to Xwayland and various other projects I contribute to.
As everyone knows, SGC goes into yearly hibernation beginning in November. Leading up to that point has been a mad scramble to nail down all the things, leaving less time for posts here.
But there have been updates, and I’m gonna round ‘em all up.
Friend of the blog and future Graphics scientist with a PhD in WTF, Konstantin Seurer, has been hard at work over the past several weeks. Remember earlier this year when he implemented VK_EXT_descriptor_indexing
for Lavapipe? Well he’s at it again, and this time he’s aimed for something bigger.
He’s now implemented raytracing for Lavapipe.
It’s a tremendous feat, one that sets him apart from the other developers who have not implemented raytracing for a software implementation of Vulkan.
I blogged (or maybe imagined blogging) about RustiCL progress on zink last year at XDC, specifically the time renowned pubmaster Karol Herbst handcuffed himself to me and refused to divulge the location of the key (disguised as a USB thumb drive in his laptop) until we had basic CL support functioning in a pair programming exercise that put us up against the unnaturally early closing time of Minneapolis pubs. That episode is finally turning into something useful as CL support for zink will soon be merged.
While I can’t reveal too much about the performance as of yet, what I can say now is that it’s roughly 866% faster.
A number of longstanding bugs have recently been fixed.
Anyone who has tried to play one of the modern Wolfenstein GL games on RADV has probably seen this abomination:
Wolfenstein Face affects a very small number of apps. Actually just the Wolfenstein (The New Order / The Old Blood) games. I’d had a ticket open about it for a while, and it turns out that this is a known issue in D3D games which has its own workaround. The workaround is now going to be applied for zink as well, which should resolve the issue while hopefully not causing others.
Since the dawn of time, experts have tried to obtain traces from games with rendering bugs, but some of these games have historically been resistant to tracing.
This affects (at least) Wolfenstein: The Old Blood
and DOOM2016
, but the problem has been identified, and a fix is on the way.
After a number of universally-reviled hacks, Zink should now work fine in both Wayland and Surfaceless EGL configurations.
Any other, lesser blogger would’ve saved this for another post in order to maximize their posting frequency metric, but here at SGC the readers get a full meal with every post even when they don’t have enough time to digest it all at once. Since I’m not going to XDC this year, consider this the thing I might have given a presentation on.
During my executive senior keynote seminar presentation workshop on zink at last year’s XDC, I brought up tiler performance as one of the known deficiencies. Specifically this was in regard to how tilers need to maximize time spent inside renderpasses and avoid unnecessary load/store operations when beginning/ending those renderpasses, which required either some sort of Vulkan extension to enable deferred load/store op setting OR command stream parsing for GL.
While I did work on a number of Vulkan extensions this year, deferred load/store ops wasn’t one of them.
So it was that I implemented renderpass tracking for Threaded Context to scan the GL command stream in the course of recording it for threaded dispatch. The CPU overhead is negligible (~5% on a couple extremely synthetic drawoverhead
cases and nothing noticeable in apps), while the performance gains are staggering (~10-15x speedup in AAA games). All in all, it was a painful process but one that has yielded great results.
The gist of it, as I’ve described in previous posts that I’m too lazy to find links for, is that framebuffer attachment access is accumulated during TC command recording such that zink is able to determine which load/store ops are needed. This works great so long as nothing unexpected splits the renderpass. “Unexpected” in this context refers to one of the following scenarios:
The final issue remaining for renderpass tracking has been this third scenario: any time the GL frontend needs to sync TC, renderpass metadata is split. The splitting is such that a single renderpass becomes two because the driver must complete execution on the currently-recorded metadata in order to avoid deadlocking itself against the waiting GL frontend, but then the renderpass will continue after the sync. While this happens in a very small number of scenarios, one of them is quite common.
Texture uploading.
There are (currently) three methods by which TC can perform texture uploads:
Eagle-eyed readers will notice that I’ve already handled the “problem” case described above; in order to avoid splitting renderpasses, I’ve written some handling which rewrites texture uploads into a sequence of N asynchronous buffer2image copies, where N is either 1 or $height
depending on whether the source data’s stride matches the image’s stride. In the case where N is not 1, this can result in e.g., 4096 copy operations being enqueued for a 4096x4096 texture atlas. Even in the case where N is 1, it still adds an extra full copy of the texture data. While this is still more optimal than splitting a renderpass, it’s not optimal in the absolute sense.
You can see where this is going.
Optimal Threaded Context execution is the state when the GL frontend is recording commands while the driver thread is deserializing those commands into hardware-specific instructions to submit to the GPU. Visually, it looks like this Halloween-themed diagram:
Ignoring the small-upload case, the current state of texture uploading looks like one of the following Halloween-themed diagrams:
To maintain maximum performance, TC needs to be processing commands asynchronously in the driver thread while the GL frontend continues to record commands for processing. Thus, to maintain maximum performance during texture uploads, the texture upload needs to occur (without copies) while the driver thread continues executing.
Looking at this problem from a different perspective, the case that needs to be avoided at all costs is the case where the GL frontend syncs TC execution. The reason why this sync exists is to avoid accidentally uploading data to an in-use
image, which would cause unpredictable (but definitely wrong) output. In this context, in-use
can be defined as an image which is either:
On the plus side, pipe_context::is_resource_busy
exists to query the second of these, so that’s solved. On the minus side, while TC has some usage tracking for buffers, it has nothing for images, and adding such tracking in a performant manner is challenging.
To figure out a solution for TC image tracking, let’s examine the most common problem case. In games, the most common scenario for texture uploading is something like this:
For such a case, it’d be trivial to add a seen
flag to struct threaded_resource
and pass the conditional if the flag is false. Since it’s straightforward enough to evaluate when an image has been seen in TC, this would suffice. Unfortunately, such a naive (don’t @ me about diacritics) implementation ignores another common pattern:
For this scenario, the staging image is reused, requiring a bit more tracking in order to accurately determine that it can be safely used for uploads.
The solution I’ve settled on is to use a derivative of zink’s resource tracking. This adds an ID for the last-used batch to the resource, which can then be checked during uploads. When the image is determined idle, the texture data is passed directly to the driver for an unsynchronized upload similar to how unsynchronized buffer uploads work. It’s simple and hasn’t shown any definitive performance overhead in my testing.
For it to really work to its fullest potential in zink, unfortunately, requires VK_EXT_host_image_copy to avoid further staging copies, and nobody implements this yet in mesa main (except Lavapipe, though also there’s this ANV MR). But someday more drivers will support this, and then it’ll be great.
As far as non-tiler performance gains from this work, it’s hard to say definitively whether they’ll be noticeable. Texture uploads during loading screens are typically intermixed with shader compilation, so there’s little TC execution to unblock, but any game which uses texture streaming may see some slight latency improvements.
The only remaining future work here is to further enable unsynchronized texture uploads in zink by adding a special cmdbuf for unsynchronized uploads to handle the non-HIC case. Otherwise performance should be pretty solid across the board.
At the moment I am hard at work putting together the final bits for the AppStream 1.0 release (hopefully to be released this month). The new release comes with many new new features, an improved developer API and removal of most deprecated things (so it carefully breaks compatibility with very old data and the previous C API). One of the tasks for the upcoming 1.0 release was #481 asking about a formal way to distinguish Linux phone applications from desktop applications.
AppStream infamously does not support any “is-for-phone” label for software components, instead the decision whether something is compatible with a device is based the the device’s capabilities and the component’s requirements. This allows for truly adaptive applications to describe their requirements correctly, and does not lock us into “form factors” going into the future, as there are many and the feature range between a phone, a tablet and a tiny laptop is quite fluid.
Of course the “match to current device capabilities” check does not work if you are a website ranking phone compatibility. It also does not really work if you are a developer and want to know which devices your component / application will actually be considered compatible with. One goal for AppStream 1.0 is to have its library provide more complete building blocks to software centers. Instead of just a “here’s the data, interpret it according to the specification” API, libappstream now interprets the specification for the application and provides API to handle most common operations – like checking device compatibility. For developers, AppStream also now implements a few “virtual chassis configurations”, to roughly gauge which configurations a component may be compatible with.
To test the new code, I ran it against the large Debian and Flatpak repositories to check which applications are considered compatible with what chassis/device type already. The result was fairly disastrous, with many applications not specifying compatibility correctly (many do, but it’s by far not the norm!). Which brings me to the actual topic of this blog post: Very few seem to really know how to mark an application compatible with certain screen sizes and inputs! This is most certainly a matter of incomplete guides and good templates, so maybe this post can help with that a bit:
As a quick reminder, compatibility is indicated using AppStream’s relations system: A requires
relation indicates that the system will not run at all or will run terribly if the requirement is not met. If the requirement is not met, it should not be installable on a system. A recommends
relation means that it would be advantageous to have the recommended items, but it’s not essential to run the application (it may run with a degraded experience without the recommended things though). And a supports
relation means a given interface/device/control/etc. is supported by this application, but the application may work completely fine without it.
A desktop-only application is characterized by needing a larger screen to fit the application, and requiring a physical keyboard and accurate mouse input. This type is assumed by default if no capabilities are set for an application, but it’s better to be explicit. This is the metadata you need:
<component type="desktop-application">
<id>org.example.desktopapp</id>
<name>DesktopApp</name>
[...]
<requires>
<display_length>768</display_length>
<control>keyboard</control>
<control>pointing</control>
</requires>
[...]
</component>
With this requires
relation, you require a small-desktop sized screen (at least 768 device-independent pixels (dp) on its smallest edge) and require a keyboard and mouse to be present / connectable. Of course, if your application needs more minimum space, adjust the requirement accordingly. Note that if the requirement is not met, your application may not be offered for installation.
Note: Device-independent / logical pixels
One logical pixel (= device independent pixel) roughly corresponds to the visual angle of one pixel on a device with a pixel density of 96 dpi (for historical X11 reasons) and a distance from the observer of about 52 cm, making the physical pixel about 0.26 mm in size. When using logical pixels as unit, they might not always map to exact physical lengths as their exact size is defined by the device providing the display. They do however accurately depict the maximum amount of pixels that can be drawn in the depicted direction on the device’s display space. AppStream always uses logical pixels when measuring lengths in pixels.
Adaptive applications have fewer hard requirements, but a wide range of support for controls and screen sizes. For example, they support touch input, unlike desktop apps. An example MetaInfo snippet for these kind of apps may look like this:
<component type="desktop-application">
<id>org.example.adaptive_app</id>
<name>AdaptiveApp</name>
[...]
<requires>
<display_length>360</display_length>
</requires>
<supports>
<control>keyboard</control>
<control>pointing</control>
<control>touch</control>
</supports>
[...]
</component>
Unlike the pure desktop application, this adaptive application requires a much smaller lowest display edge length, and also supports touch input, in addition to keyboard and mouse/touchpad precision input.
Making an application a pure phone application is tricky: We need to mark it as compatible with phones only, while not completely preventing its installation on non-phone devices (even though its UI is horrible, you may want to test the app, and software centers may allow its installation when requested explicitly even if they don’t show it by default). This is how to achieve that result:
<component type="desktop-application">
<id>org.example.phoneapp</id>
<name>PhoneApp</name>
[...]
<requires>
<display_length>360</display_length>
</requires>
<recommends>
<display_length compare="lt">1280</display_length>
<control>touch</control>
</recommends>
[...]
</component>
We require a phone-sized display minimum edge size (adjust to a value that is fit for your app!), but then also recommend the screen to have a smaller edge size than a larger tablet/laptop, while also recommending touch input and not listing any support for keyboard and mouse.
Please note that this blog post is of course not a comprehensive guide, so if you want to dive deeper into what you can do with requires
/recommends
/suggests
/supports
, you may want to have a look at the relations tags described in the AppStream specification.
It is still easy to make mistakes with the system requirements metadata, which is why AppStream 1.0 will provide more commands to check MetaInfo files for system compatibility. Current pre-1.0 AppStream versions already have an is-satisfied
command to check if the application is compatible with the currently running operating system:
:~$ appstreamcli is-satisfied ./org.example.adaptive_app.metainfo.xml Relation check for: */*/*/org.example.adaptive_app/* Requirements: • Unable to check display size: Can not read information without GUI toolkit access. Recommendations: • No recommended items are set for this software. Supported:Physical keyboard found.
Pointing device (e.g. a mouse or touchpad) found. • This software supports touch input.
In addition to this command, AppStream 1.0 will introduce a new one as well: check-syscompat
. This command will check the component against libappstream’s mock system configurations that define a “most common” (whatever that is at the time) configuration for a respective chassis type.
If you pass the --details
flag, you can even get an explanation why the component was considered or not considered for a specific chassis type:
:~$ appstreamcli check-syscompat --details ./org.example.phoneapp.metainfo.xml Chassis compatibility check for: */*/*/org.example.phoneapp/* Desktop: ✘ Incompatible • recommends: This software recommends a display with its shortest edge being << 1280 px in size, but the display of this device has 1280 px. • recommends: This software recommends a touch input device. Laptop: ✘ Incompatible • recommends: This software recommends a display with its shortest edge being << 1280 px in size, but the display of this device has 1280 px. • recommends: This software recommends a touch input device. Server: ✘ Incompatible • requires: This software needs a display for graphical content. • recommends: This software needs a display for graphical content. • recommends: This software recommends a touch input device. Tablet:Compatible (100%) Handset:
Compatible (100%)
I hope this is helpful for people. Happy metadata writing!
Last week I was a bit distracted with the trip to Paris for the Embedded Recipes conference, but later I have found some time for hacking and got some interesting results out of it.
As commented in the previous update, I had found some limits in my testing due to the naive way that the front-end was scheduling jobs to the Gallium hardware-dependent driver.
I got to basically rewrite it (and removed any C++ remnants, on the way) and moved to a model in which the drivers would compile the operation blocks that they support to a format that can be quickly sent to the hardware.
As a side effect, I got proper memory management of the workload which allowed me to expand the testing I can do in a reasonable amount of time.
Also took the chance to rewrite the higher level scheduling data structure so all jobs in the same model partition are sent to the hardware in a single batch, for decreased latency.
Unfortunately I didn't get to remove copies of input and output tensors because the TensorFlow Lite API for this (TfLiteAsyncKernel) is undocumented and far from trivial. They seem to just be adding stuff on top to abstract whatever the Android folks may end up wanting to do.
![]() |
by Julien Langlois CC BY-SA 3.0 |
tomeu@arm-64:~/mesa$ LD_PRELOAD=libtensorflow_lite.so python3.10 class_device.py -i hen.bmp -m mobilenet_v1_0.25_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}
tflite_plugin_create_delegate
Teflon delegate: loaded etnaviv driver
INFO: Initialized TensorFlow Lite runtime.
PrepareDelegate
VERBOSE: Replacing 27 out of 31 node(s) with delegate (Teflon Delegate) node, yielding 2 partitions for the whole graph.
0.960784: hen
0.015686: cock
0.007843: goose
0.003922: Pembroke
0.003922: Ibizan hound
time: 22.802ms
tflite_plugin_destroy_delegate
This matched bit by bit the output from the blob, even if I was doing some tensor operations by hand, on the CPU. That also causes it to run far too slowly. We should be able to get that down to around 5ms once we learn how to drive the TP units for tensor manipulation.
Tired of only writing about all this in this blog, I took the chance given to me by Kevin Hilman to present it in front of a captive audience.
You can find the slides here, and listen to the talk at:
The previous update got more in deep into what is left to do in the medium term, so I will just mention what I plan to do in the immediate future:
With the kids back in school I have been able to work on the Vivante VIP NPU driver full-time during the two weeks after the last update, with quite some work coming out of the pipeline:
Recently I’ve been working on a project where I needed to convert an application written in OpenGL to a software renderer. The matrix transformation code in OpenGL made use of the GLM library for matrix math, and I needed to convert the 4x4 matrices to be 3x3 matrices to work with the software renderer. There was some existing code to do this that was broken, and looked something like this:
glm::mat3 mat3x3 = glm::mat3(mat4x4);
Don’t worry if you don’t see the problem already, I’m going to
illustrate in more detail with the example of a translation matrix. In
3D a standard translation matrix to translate by a vector
(x, y, z)
looks something like this:
[1 0 0 x]
[0 1 0 y]
[0 0 1 z]
[0 0 0 1]
Then when we multiply this matrix by a vector like
(a, b, c, 1)
the result is
(a + x, b + y, c + z, 1)
. If you don’t understand why the
matrix is 4x4 or why we have that extra 1 at the end don’t worry, I’ll
explain that in more detail later.
Now using the existing conversion code to get a 3x3 matrix will simply take the first 3 columns and first 3 rows of the matrix and produce a 3x3 matrix from those. Converting the translation matrix above using this code produces the following matrix:
[1 0 0]
[0 1 0]
[0 0 1]
See the problem now? The (x, y, z)
values disappeared!
In the conversion process we lost these critical values from the
translation matrix, and now if we multiply by this matrix nothing will
happen since we are just left with the identity matrix. So if we can’t
use this simple “cast” function in GLM, what can we use?
Well one thing we can do is preserve the last column and last row of the matrix. So assume we have a 4x4 matrix like this:
[a b c d]
[e f g h]
[i j k l]
[m n o p]
Then preserving the last row and column we should get a matrix like this:
[a b d]
[e f h]
[m n p]
And if we use this conversion process for the same translation matrix we will get:
[1 0 x]
[0 1 y]
[0 0 1]
Now we see that the (x, y)
part of the translation is
preserved, and if we try to multiply this matrix by the vector
(a, b, 1)
the result will be
(a + x, b + y, 1)
. The translation is preserved in the
conversion!
The reason the conversion is more complicated is hidden in how we defined the translation matrix and vector we wanted to translate. The vector was actually a 4D vector with the final component set to 1. The reason we do this is that we actually want to represent an affine space instead of just a vector space. An affine space being a type of space where you can have both points and vectors. A point is exactly what you would expect it to be just a point in space from some origin, and vector is a direction with magnitude but no origin. This is important because strictly speaking translation isn’t actually defined for vectors in a normal vector space. Additionally if you try to construct a matrix to represent translation for a vector space you’ll find that its impossible to derive a matrix to do this and that operation is not a linear function. On the other hand operations like translation are well defined in an affine space and do what you would expect.
To get around the problem of vector spaces, mathematicians more clever than I figured out you can implement an affine space in a normal vector space by increasing the dimension of the vector space by one, and by adding an extra row and column to the transformation matrices used. They called this a homogeneous coordinate system. This lets you say that a vector is actually just a point if the 4th component is 1, but if its 0 its just a vector. Using this abstraction one can implement all the well defined operations for an affine space (like translation!).
So using the “homogeneous coordinate system” abstraction, translation is an operation that defined by taking a point and moving it by a vector. Lets look at how that works with the translation matrix I used as an example above. If you multiply that matrix by a 4D vector where the 4th component is 0, it will just return the same vector. Now if we multiply by a 4D vector where the 4th component is 1, it will return the point translated by the vector we used to construct that translation matrix. This implements the translation operation as its defined in an affine space!
If you’re interested in understanding more about homogeneous coordinate spaces, (like how the translation matrix is derived in the first place) I would encourage you to look at resources like “Mathematics for Computer Graphics Applications”. They provide a much more detailed explanation than I am providing here. (The homogeneous coordinate system also has some benefits for representing projections which I won’t get into here, but are explained in that text book.)
Now to finally answer the question about why we needed to preserve those final columns and vectors. Based on what we now know, we weren’t actually just converting from a “3D space” to a “2D space” we were converting from a “3D homogeneous space” to a “2D homogeneous space”. The process of converting from a higher dimension matrix to a lower dimensional matrix is lossy and some transformation details are going to be lost in process (like for example the translation along the z-axis). There is no way to tell what kind of space a given matrix is supposed to transform just by looking at the matrix itself. The matrix does not carry any information about about what space its operating in and any conversion function would need to know that information to properly convert that matrix. Therefore we need develop our own conversion function that preserves the transformations that are important to our application when moving from a “3D homogeneous space” to a “2D homogeneous space”.
Hopefully this explanation helps if you are every working on converting 3D transformation code to 2D.
This blog was about pointlessly optimizing things? I’m talking like taking vkGetDescriptorSetLayoutSupport and making it fast. The kinds of optimizations nobody asked for and potentially nobody even wanted.
Well good news: this isn’t a post about those types of optimizations.
This is a post where I’m gonna talk about some speedups that you didn’t even know you craved but now that you know they exist you can’t live without.
Lots of people are asking, but surely nobody reading this blog since you’re all experts. But if you have a friend who wants to know, here’s the official resource for all that knowledge. It’s got diagrams. Images with important parts circled. The stuff that means whoever wrote it knew what they were talking about.
The thing this “official” resource doesn’t tell you is the queue is potentially pretty slow. You chuck some commands into it, and then you wait on your fence/semaphore, but the actual time it takes to perform queue submission is nonzero. In fact, it’s quite a bit larger than zero. How large is it? you might be asking.
I didn’t want to do this, but you’ve forced my hand.
What if I told you there was a tool for measuring things like this. A tool for determining the cost of various Vulkan operations. For benchmarking, one might say.
That’s right, it’s time to yet again plug vkoverhead, the best and only tool for doing whatever I’m about to do.
Like a prophet, my past self already predicted that I’d be sitting here writing this post to close out a week of types_of_headaches.meme -> vulkan_headaches.meme
. That’s why vkoverhead
already has the -submit-only
option in order to run a series of benchmark cases which have numbers that are totally not about to go up.
Let’s look at those cases now to fill up some more page space and time travel closer to the end of my workweek:
submit_noop
submits nothing. There’s no semaphores, no cmdbufs, it just submits and returns in order to provide a baselinesubmit_50noop
submits nothing 50 times, which is to say it passes 50x VkSubmitInfo
structs to vkQueueSubmit
(or the 2 versions if sync2 is supported)submit_1cmdbuf
submits a single cmdbuf. In theory this should be slower than the noop case, but I hate computers and obviously this isn’t true at allsubmit_50cmdbuf
submits 50 cmdbufs. In theory this should be slower than the single cmdbuf case, and, thankfully, this one particular time in which we have expectations of how computers work does match our expectationssubmit_50cmdbuf_50submit
submits 50 cmdbufs in 50 submits for a total of 50 cmdbufs per vkQueueSubmit
call. This is the slowest test, you would think, and I thought that too, and the longer this explanation goes on the more you start to wonder if computers really do work at all like you expect or if this is going to upset you, but it’s Friday, and I don’t have anywhere to be except the gym, so I could keep delaying the inevitable for a while longer, but I do have to get to the gym, so sure, this is totally gonna be way slower than all the other tests, trust me™It’s a great series of tests which showcase some driver pain points. Specifically it shows how slow submit can be.
Let’s check out some baseline results on the driver everyone loves to hang out with, RADV:
40, submit_noop, 19569683, 100.0%
41, submit_50noop, 402324, 2.1%
42, submit_1cmdbuf, 51356, 0.3%
43, submit_50cmdbuf, 1840, 0.0%
44, submit_50cmdbuf_50submit, 1031, 0.0%
Everything looks like we’d expect. The benchmark results ensmallen as they get more complex.
But why?
Because if you think about it like a smart human and not a dumb pile of “thinking” sand, submitting 50 cmdbufs is submitting 50 cmdbufs no matter how you do it.
Some restrictions apply, signal semaphores blah blah blah, but none of that’s happening here so what the fuck, RADV?
This is where we get into some real facepalm territory. Vulkan, as an API, gives drivers the ability to optimize this. That’s the entire reason why vkQueueSubmit has the submitCount
param and takes an array of submits.
But what does Mesa do here? Well, in the current code there’s this gem:
for (uint32_t i = 0; i < submitCount; i++) {
struct vulkan_submit_info info = {
.pNext = pSubmits[i].pNext,
.command_buffer_count = pSubmits[i].commandBufferInfoCount,
.command_buffers = pSubmits[i].pCommandBufferInfos,
.wait_count = pSubmits[i].waitSemaphoreInfoCount,
.waits = pSubmits[i].pWaitSemaphoreInfos,
.signal_count = pSubmits[i].signalSemaphoreInfoCount,
.signals = pSubmits[i].pSignalSemaphoreInfos,
.fence = i == submitCount - 1 ? fence : NULL
};
VkResult result = vk_queue_submit(queue, &info);
if (unlikely(result != VK_SUCCESS))
return result;
}
Tremendous. It’s worth mentioning that not only is this splitting the batched submits into individual ones, each submit also allocates a struct to contain the submit info so that the drivers can use the same interface. So it’s increasing the kernel overhead by performing multiple submits and also increasing memory allocations.
We’ve all been here before on SGC, and I really do need to get to the gym, so I’m authorizing a one-time fast forward to the results of optimizing this:
RADV GFX11:
40, submit_noop, 19569683, 100.0%
41, submit_50noop, 402324, 2.1%
42, submit_1cmdbuf, 51356, 0.3%
43, submit_50cmdbuf, 1840, 0.0%
44, submit_50cmdbuf_50submit, 1031, 0.0%
↓
40, submit_noop, 21008648, 100.0%
41, submit_50noop, 4866415, 23.2%
42, submit_1cmdbuf, 51294, 0.2%
43, submit_50cmdbuf, 1823, 0.0%
44, submit_50cmdbuf_50submit, 1828, 0.0%
That’s like 1000% faster for case #41 and 50% faster for case #44.
But how does this affect other drivers? I’m sure you’re asking next. And of course, this being the primary blog for distributing Mesa benchmarking numbers in any given year, I have those numbers.
Lavapipe:
40, submit_noop, 1972672, 100.0%
41, submit_50noop, 40334, 2.0%
42, submit_1cmdbuf, 5994597, 303.9%
43, submit_50cmdbuf, 2623720, 133.0%
44, submit_50cmdbuf_50submit, 133453, 6.8%
↓
40, submit_noop, 1980681, 100.0%
41, submit_50noop, 1202374, 60.7%
42, submit_1cmdbuf, 6340872, 320.1%
43, submit_50cmdbuf, 2482127, 125.3%
44, submit_50cmdbuf_50submit, 1165495, 58.8%
3000% faster for #41 and 1000% faster for #44.
Intel DG2:
40, submit_noop, 101336, 100.0%
41, submit_50noop, 2123, 2.1%
42, submit_1cmdbuf, 35372, 34.9%
43, submit_50cmdbuf, 713, 0.7%
44, submit_50cmdbuf_50submit, 707, 0.7%
↓
40, submit_noop, 106065, 100.0%
41, submit_50noop, 105992, 99.9%
42, submit_1cmdbuf, 35110, 33.1%
43, submit_50cmdbuf, 709, 0.7%
44, submit_50cmdbuf_50submit, 702, 0.7%
5000% faster for #41 and a big 🤕 for #44 because Intel.
Turnip A740:
40, submit_noop, 1227546, 100.0%
41, submit_50noop, 26194, 2.1%
42, submit_1cmdbuf, 1186327, 96.6%
43, submit_50cmdbuf, 545341, 44.4%
44, submit_50cmdbuf_50submit, 16531, 1.3%
↓
40, submit_noop, 1313550, 100.0%
41, submit_50noop, 1078383, 82.1%
42, submit_1cmdbuf, 1129515, 86.0%
43, submit_50cmdbuf, 329247, 25.1%
44, submit_50cmdbuf_50submit, 484241, 36.9%
4000% faster for #41, 3000% faster for #44.
Pretty good, and it somehow manages to still be conformant.
Code here.
If you’re reading, thanks for everything.
I planned to blog about it a while ago, but then I didn’t and news sites have since broken the news: Zink from Mesa main can run finally xservers.
Yes, it’s true. For the first time ever, you can install Mesa (from git) and use zink (with environment variables) to run your entire system (unless you’re on Intel).
But what was so challenging about getting this to work? The answer won’t surprise you.
Fans of the blog know that I’m no fan of WSI. If I had my way, GPUs would render to output buffers that we could peruse at our leisure using whatever methods we had at our disposal. Ideally manual inspection. Alas, few others share my worldview and so we all must suffer.
The root of all evil when it comes to computers is synchronization. This is triply so for anything GPU-related, and when all this “display server” chicanery is added in, the evilness value becomes one of those numbers so large that numerologists are still researching naming possibilities. There are two types of synchronization used with WSI:
From a user perspective, the former has less code to manage. The downside is that on the driver side things become more complex, as implicit sync is effectively layered atop explicit sync.
Another way of looking at it is:
And, since xservers run on GL, you can see where this is going.
Don’t get me wrong, explicit sync sucks too, but at least it makes sense. Broadly speaking, with explicit sync you have a dmabuf image, you submit it to the GPU, and you tell the server to display it.
In the words of venerable Xorg developer, EGL maintainer, and synchronization PTSD survivor Daniel Stone, the way to handle implicit sync is “vibes”. You have a dmabuf image, you glFlush
, and magically it gets displayed.
Sound nuts? It is, and that’s why Vulkan doesn’t support it.
But zink uses Vulkan, so…
Explicit sync is based on two concepts:
A user of a dmabuf waits on an export operation before using it (i.e., a wait semaphore), then signals an import operation at the end of a cmdbuf submission (i.e., a signal semaphore). Vulkan WSI handles this under the hood for users. But there’s no way to use Vulkan WSI with imported dmabufs, which means this all has to be copy/pasted around to work elsewhere.
In zink, all that happens in an xserver scenario is apps import/export dmabufs, sample/render them, and then do queue submission. To successfully copy/paste the WSI code and translate this into explicit sync for Vulkan, it’s necessary to be a bit creative with driver mechanics. The gist of it is:
DMA_BUF_IOCTL_EXPORT_SYNC_FILE
) semaphore to be waited on before the current cmdbufDMA_BUF_IOCTL_IMPORT_SYNC_FILE
) semaphore to be signaled after the current cmdbufBig thanks to Faith “ARB_shader_image_load_store” Ekstrand for god-tier rubberducking when I was in the home stretch of this undertaking.
Anyway I expect to be absolutely buried in bug reports by next week from all the people testing this, so thanks in advance.
It’s been a busy week, and I’ve got posts I’ve been meaning to write. The problem is they’re long and I’m busy. That’s why I’m doing a shorter post today, since even just getting this one out somehow took 4+ hours while I was continually sidetracked by “real work”.
But despite this being a shorter post, don’t worry: the memes won’t be shorter.
I got a ticket very recently that I immediately jumped on and didn’t at all procrastinate or forget about. The ticket concerned a little game called THE KING OF FIGHTERS XIV.
Now for those of you unfamiliar, The King of Fighters is a long-running fighting game franchise which debuted in the 90s. At the arcade. Pretty sure I played it once. But at like a retro arcade or something because I’m not that old, fellow kids.
The bug in question was that when a match is won using a special move, half the frame would misrender:
Heroically, the reporter posted a number of apitrace captures. Unfortunately that effort ended up being ineffectual since it did nothing but reveal yet another apitrace bug related to VAO uploads which caused replays of the traces to crash.
It was the worst kind of bug.
I was going to have to play a gametest the defect myself.
It would prove to be the greatest test of my skill yet. I would have to:
Was I successful?
I’m not saying there’s someone out there who’s worse at the gamethe test app than a guy playingperforming exploratory tests on his keyboard under renderdoc. That’s not what I’m saying.
The debug process for this issue was, in contrast to the capture process, much simpler. I attribute this to the fact that, while I don’t own a gamepad for use with whatever test apps need to be run, I do have a code controller that I use for all my debugging:
I’ve been hesitant to share such pro strats on the blog before, but SGC has been around for long enough now that even when the copycats start vlogging about my tech and showing off the frame data, everyone will recognize where it came from. All I ask is that you post clips of tournament popoffs.
Using my code controller, I was able to perform a debug -> light code -> light code -> debug -> heavy code -> compile -> block -> post meme -> reboot -> heavy code -> heavy code
combo for an easy W.
To break down this advanced sequence, a small debug
reveals that the issue is a render area clamped to 1024x1024 on a 1920x1080 frame. Since I have every line of the codebase memorized (zink main don’t @ me) it was instantly obvious that some poking was in order.
Vulkan has this pair of (awful) VUs:
VUID-VkRenderingInfo-pNext-06079
If the pNext chain does not contain VkDeviceGroupRenderPassBeginInfo or its deviceRenderAreaCount member is equal to 0, the width of the imageView member of any element of pColorAttachments, pDepthAttachment, or pStencilAttachment that is not VK_NULL_HANDLE must be greater than or equal to renderArea.offset.x + renderArea.extent.width
VUID-VkRenderingInfo-pNext-06080
If the pNext chain does not contain VkDeviceGroupRenderPassBeginInfo or its deviceRenderAreaCount member is equal to 0, the height of the imageView member of any element of pColorAttachments, pDepthAttachment, or pStencilAttachment that is not VK_NULL_HANDLE must be greater than or equal to renderArea.offset.y + renderArea.extent.height
which don’t match up at all to GL’s ability to throw whatever size framebuffer attachments at the GPU and have things come out fine. A long time ago I wrote this MR to clamp framebuffer size to the smallest attachment. But in this particular case, there are three framebuffer attachments:
The unused attachment ends up clamping the framebuffer to a smaller region to avoid violating spec, and this breaks rendering. Some light code
pokes to skip clamping for NULL attachments open up the combo. Another quick debug
doesn’t show the issue as being resolved, which means it’s time for some heavy code
: checking for unused attachments in the fragment shader during renderpass start.
Naturally this triggers a full tree compile
, which is a block
ing operation that gives me enough time to execute a post meme
for style points. The downside is that I’m using an AMD system, so as soon as I try to run the new code it hangs–it’s at this point that I nail a reboot
to launch it into orbit.
I’m not looking for a record-setting juggle, so I finish off my combo with a heavy code -> heavy code
finisher to hack in attachment write tracking for TC renderpass optimization and then plumb it through the rest of my stack so unused attachments will skip all renderpass-related operations.
Problem solved, and all without having to personally play any games.
I’ll finally post that one post I’ve been planning to post for weeks but it’s hard and I just blew my entire meme budget for the month today so what is even going to happen who knows.
This week started quite fruitfully, these features were added:
And with this we should have all the features we need to run a model such as MobileNet v1 and get some performance numbers to guide the next steps.
Only that the NPU hangs when I try to use the 8th core... and this is required to run most detection models, as they start by convoluting the input to 32 feature maps.
Have checked and we are sending to the kernel bit-identical command streams and input buffers, so I suspect the problem will be somewhere in the kernel.
So I plan to instrument the out-of-tree kernel driver and get some register and command stream dumps, in the hope that there is some bit in a magic register somewhere that I need to flip.
I'm not really looking forward to such work, so I decided to first invest some time cleaning things up a bit to make it easier for other people to play with this if they wish.
I have removed from my branch everything from my previous attempt at using OpenCL and have written some documentation about how to run the TensorFlow Lite delegate:
https://gitlab.freedesktop.org/tomeu/mesa/-/blob/teflon/docs/teflon.rst
You will need a VIM3 board, a recent mainline kernel and a Debian testing rootfs.
Sriram invited me to the oneAPI meetup, and I felt I hadn't summed up the state of compute and community development in a while. Enjoy 45 minutes of opinions!
https://www.youtube.com/watch?v=HzzLY5TdnZo
if you disagree you’re wrong
gitlab is down, post low-effort blogs and touch grass until it returns
The GSoC journey is coming to a close. In just over 100 days, I gained more experience in open-source development than I could ever imagine in this period.
Prior to GSoC, I was not used to regularly submit patches to the mailing lists. Now, I’ve sent many patches and revisions. I believe my interaction with the community will only grow. I learned so much about the tools and workflow of kernel development.
After this experience, I’m more than certain that I want to make this a job, contributing to open-source is fun, so why not make this a living :)
The main goal of the project was to increase the code coverage on the DRM core helper functions by creating unit tests.
As the coverage of all helpers is a big task for the time period, I
decided to create tests for the drm_format_helper.c
functions.
Throughout the project, other side tasks appeared. I will list the contributions made below.
VKMS is a software-only model of a KMS driver that is useful for testing and running X (or similar) on headless machines.
This was, unexpectedly, a big part of my GSoC. I learned a lot about color formats and how a graphics driver works. Currently, only one piece of my work was upstreamed, the rest needs more work and was postponed in favor of the primary project goal.
Patch | Status |
---|---|
drm/vkms: Add support to 1D gamma LUT | Accepted |
For more information go check my blogpost about the whole process.
IGT GPU Tools is a collection of tools for the development and testing of DRM drivers. While working on VKMS I used heavily the IGT framework for testing, in one occasion a bug made a test to stop working on the VKMS, so a submitted a patch to fix that.
Patch | Status |
---|---|
lib/igt_fb: Add check for intel device on use_enginecopy | Accepted |
In the DRM subsystem, I’ve done the main project goal, contributed by
adding unit tests, and also helped to fix some bugs that appeared
while working on the tests. With the sent patches I got 71.5% of line
coverage and 85.7% of function
coverage
on the drm_format_helper.c
.
I think the most difficult task was describing my work. Either on blog posts or in the commit messages, it takes a lot of work to write what you’ve done concisely and clearly. With time you get the way of things, but I think I can improve on this subject.
Moreover, many times I had to debug some problems. I already knew how to use GDB, but using it in the kernel is a little more cumbersome. After searching, reading the documentation, and getting tips from my mentors, I got it working.
On the VKMS, I had to create new features, this requires a lot of thought. I made a lot of diagrams in my head to understand how the color formats would be displayed in memory, and yet most of my work hasn’t seen the light of day XD.
I was able to do most of the proposed tasks. But the drm_xfrm_toio
was left out due to the difficulty of testing it, as it uses IO
memory. I tested the drm_fb_blit()
, but I’m waiting for the
acceptance of the patchset to send it, with that patch the line
coverage will go to 89.2% and the function coverage will go to
94.3%.
Besides patch submission, I reviewed some patches too. Going to the other side, I enjoyed thinking about how a piece of code could be improved.
Also, on one occasion I started a discussion about the best way to solve an issue by sending a patch. This got me a Reported-by tag on the patch that fixed the bug.
Moreover, I use a Thunderbird addon to make the diff properly highlyted. When I was tinkering with the configuration, I noticed that the CSS of the configuration menu was wrong, so it made the user experience pretty bad.
I sent a patch fixing that to the maintainer of the addon, this patch generated a discussion that made a whole change in the CSS file due to Thunderbird updates.
I’d like to thank my mentors, André “Tony” Almeida, Maíra Canal, and Tales L. Aparecida. Their support and expertise were invaluable throughout this journey.
Moreover, I thank the X.Org Foundation for allowing me to participate in this program, and also for accepting my talk proposal on the XDC 2023.
Lastly, I thank my colleague Carlos Eduardo Gallo for exchanging knowledge during the weekly meetings.
It's 5am and I have a headache. The perfect time for some reflection!
Not only that, but I've just had to play the part of Static Site Ungenerator, because I found out that I deleted the source of the last post and I didn't want to lose it in the upcoming publish. If your Atom feed went funky, sorry.
This document is my Final Work Submission, but is fun for all the family, including the ones who don't work at Google. Hi everyone!
Going into the summer, the plan was to add functionality to wlroots so that its users (generally Wayland compositors) could more easily switch to a smarter frame schedule. I've had many goes at explaining the problem and they all sucked, so here we go again: if a compositor puts some thought into when it starts its render, desktop latency as perceived by the user can decrease. The computer will feel snappier.
wlroots started the summer with no accommodations for compositors that wanted to put thought into when they start to render. It assumed exactly no thought was to be put in, and left you on your own if you were to decide otherwise. But that has all changed!
The aim of my work could have comprised three things, but I added a fourth and then didn't have time for the third:
After some flailing around trying to add a delay to the existing scheduling, I started writing patches worth landing.
First came the render timer API. Now we can measure the duration of our render passes. This MR brought an abstraction for timers, and an implementation for wlroots' gles2 renderer.
Next, the scene timer API.
wlr_scene
does some of its own work before setting off the render pass itself, so it needed to
become aware of timers and expose a way to use them.
Meanwhile, I was having another stab at configuring a frame delay. It wasn't very good, and the design of wlroots' scheduling and the complexity of the logic underneath it turned out to take a long time to get through. With this MR, though, I had a better idea of where I was trying to go. A long thought process followed, much of which lives in this issue, and further down we'll see what came of that.
Before working on a prediction algorithm, I wanted to be able to see live feedback on how render
timings behaved and which frames were missed so that I could do a good (informed) job of predicting
them.
I took a detour into the world of tracing.
libuserevents
was spawned and so was the work to make
use of it in wlroots.
Linux's user_events tracing interface was appealing because it meant that GPUVis, an existing tool
that can display a timeline of CPU and GPU events, would be able to show wlroots' events.
Unfortunately Linux and I have so far struggled to get along and this work is still in progress -
no submission yet because it's broken.
Even more unfortunately, this meant that I wasn't able to get around to prediction.
Then I got tired of fighting that, and despite the words of discouragement...
a refactor of wlroots' frame scheduling that allows us to do much better than !4214:
!4307!
This hasn't quite made it past the finish line, but it's close; I can feel it in my frames.
It (in my opinion) neatly extracts the hairy logic that lived in wlr_output
into a helper
interface, allowing users to swap out which frame scheduler they use, or to forgo the helpers and
roll their own without there being bits and pieces left over in the parts of wlroots that they do
care about.
This is the most exciting piece of the puzzle IMO; wlr_output
has grown to have its fingers in
many pies, and this MR reduces that and leaves wlr_output
a little bit more friendly in a way that
took a lot of brain cycles but turned out clean.
This new interface doesn't come with a frame delay option for free, but an implementation of the interface that has this feature is underway: !4334. It fits nicely! We hashed it out a little on IRC because the frame delay option is a surprisingly tricky constraint on the interface, but I think the conclusion is good. It was definitely a lot easier to write this with confidence after the scheduling redesign :)
To make this scheduling design possible and clean, a
couple of
little changes were
needed in other areas, and
thankfully the case for these changes was easy to make.
They're helpful to me, but also make those parts of wlroots less surprising and/or broken.
There was also a discussion about
the fate of wlr_output.events.needs_frame
, which is an extra complexity in wlroots' frame
scheduling.
It turned out that while removing it is possible, it wasn't necessary for the new scheduling system,
so it continues in the background.
While libuserevents
is usable, the wlroots integration is not ready.
There is sadly no "stock" plug-and-play prediction algorithm in wlroots.
The new scheduling infrastructure has not landed but I'm sure it will Soon™. The implementation with the frame delay option will hopefully follow shortly after. When (touch wood) it does, compositors will have to bring their own prediction algorithm, but a "good enough" algorithm can be very simple and given the current interface design can easily be swapped out for a stock one if one materialises.
And finally, the funniest one. I wrote an implementation of the timer API for wlroots' Vulkan renderer, and then put off submitting it for two months because everything else was more important. gles2 is the default renderer and supports roughly every GPU in existence. Writing the Vulkan timer was fun but landing it was less of a priority than every other task I had and nothing really depended on it, so it remains stuck on my laptop to this day. Perhaps I should get round to that.
The project didn't go how I expected it to - not even close. I even wrote up a schedule as part of my application that almost immediately turned out completely wrong. I'm not bothered, though, because it was fun, I made myself useful, and I met some cool people.
If you're considering doing something like I did, I can happily recommend Simon as a
mentor, X.Org, and GSoC, in that order.
Much love to Simon for making me feel comfortable when I really didn't know what I was doing, and
for participating in my wildly off-topic free software rambles.
I've only interacted with a small part of the X.Org community so far but it struck me from the start
how welcoming everyone is;
I have no doubts that the other X.Org project mentors are as lovely in their own ways.
And of course, as a strong proponent of software that doesn't suck that's free, I have to
appreciate that GSoC gave me a welcoming place to do my part in that and relieve my worldly
pressures (did you know you have to pay for internet??).
Thanks everyone for putting up with me. If you would like to put up with me some more, click the links on the left - I'm not going anywhere, there's still work to do!
Managed to squeeze some time between holidaying to hack on the NPU driver and got something out of it.
Since the last update I have:
Next steps are to support convolutions with multiple input and output channels, and padding. Then see what is still missing so we can run MobileNet v1 and check the performance when using the NN units and doing the rest on the CPU.
As a reminder, I'm pushing all the code to this branch: https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/.
A bunch of us have started to gather in the #ml-mainline IRC channel in OFTC to disucss matters about doing accelerated ML with mainline, on embedded.
For those of you that may not have a IRC bouncer setup yet, you can easily join with the web chat UI, but in case others aren't in front of the keyboard when you type your question, I recommend using element.io with the Matrix IRC bridge:
https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/
I have been invited to give a talk about all this ML with mainline effort at Embedded Recipes 2023, Paris 28-29 September. Slides and a recording will be published after the conference ends.
Last but not least, if I am able to invest so much effort on this is because the folks at LibreComputer have been supporting me financially this last couple of months.
Thanks to Da Xue for his support, it is greatly appreciated! It is awesome to see SBC vendors investing in the Linux upstream ecosystem.
It turns out this was the product of a tiler optimization I did earlier this year to pipeline texture uploads without splitting renderpasses. I was (wrongly) assuming that the PBO stride would always match the image format stride, which broke functionality in literally just this one corner case.
Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. That means the drivers are compatible with any OpenGL ES 3.1 application. Interested? Just install Linux!
For existing Asahi Linux users,
upgrade your system with dnf
upgrade
(Fedora) or pacman
-Syu
(Arch) for the latest drivers.
Our reverse-engineered, free and open source graphics drivers are the world’s only conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware. That means our driver passed tens of thousands of tests to demonstrate correctness and is now recognized by the industry.
To become conformant, an “implementation” must pass the official conformance test suite, designed to verify every feature in the specification. The test results are submitted to Khronos, the standards body. After a 30-day review period, if no issues are found, the implementation becomes conformant. The Khronos website lists all conformant implementations, including our drivers for the M1, M1 Pro/Max/Ultra, M2, and M2 Pro/Max.
Today’s milestone isn’t just about OpenGL ES. We’re releasing the first conformant implementation of any graphics standard for the M1. And we don’t plan to stop here ;-)
Unlike ours, the manufacturer’s M1 drivers are unfortunately not conformant for any standard graphics API, whether Vulkan or OpenGL or OpenGL ES. That means that there is no guarantee that applications using the standards will work on your M1/M2 (if you’re not running Linux). This isn’t just a theoretical issue. Consider Vulkan. The third-party MoltenVK layers a subset of Vulkan on top of the proprietary drivers. However, those drivers lack key functionality, breaking valid Vulkan applications. That hinders developers and users alike, if they haven’t yet switched their M1/M2 computers to Linux.
Why did we pursue standards conformance when the manufacturer did not? Above all, our commitment to quality. We want our users to know that they can depend on our Linux drivers. We want standard software to run without M1-specific hacks or porting. We want to set the right example for the ecosystem: the way forward is implementing open standards, conformant to the specifications, without compromises for “portability”. We are not satisfied with proprietary drivers, proprietary APIs, and refusal to implement standards. The rest of the industry knows that progress comes from cross-vendor collaboration. We know it, too. Achieving conformance is a win for our community, for open source, and for open graphics.
Of course, Asahi Lina and I are two individuals with minimal funding. It’s a little awkward that we beat the big corporation…
It’s not too late though. They should follow our lead!
OpenGL ES 3.1 updates the experimental OpenGL ES 3.0 and OpenGL 3.1 we shipped in June. Notably, ES 3.1 adds compute shaders, typically used to accelerate general computations within graphics applications. For example, a 3D game could run its physics simulations in a compute shader. The simulation results can then be used for rendering, eliminating stalls that would otherwise be required to synchronize the GPU with a CPU physics simulation. That lets the game run faster.
Let’s zoom in on one new feature: atomics on images. Older versions of OpenGL ES allowed an application to read an image in order to display it on screen. ES 3.1 allows the application to write to the image, typically from a compute shader. This new feature enables flexible image processing algorithms, which previously needed to fit into the fixed-function 3D pipeline. However, GPUs are massively parallel, running thousands of threads at the same time. If two threads write to the same location, there is a conflict: depending which thread runs first, the result will be different. We have a race condition.
“Atomic” access to memory provides a solution to race conditions. With atomics, special hardware in the memory subsystem guarantees consistent, well-defined results for select operations, regardless of the order of the threads. Modern graphics hardware supports various atomic operations, like addition, serving as building blocks to complex parallel algorithms.
Can we put these two features together to write to an image atomically?
Yes. A ubiquitous OpenGL ES extension, required for ES 3.2, adds atomics operating on pixels in an image. For example, a compute shader could atomically increment the value at pixel (10, 20).
Other GPUs have dedicated instructions to perform atomics on an images, making the driver implementation straightforward. For us, the story is more complicated. The M1 lacks hardware instructions for image atomics, even though it has non-image atomics and non-atomic images. We need to reframe the problem.
The idea is simple: to perform an atomic on a pixel, we instead calculate the address of the pixel in memory and perform a regular atomic on that address. Since the hardware supports regular atomics, our task is “just” calculating the pixel’s address.
If the image were laid out linearly in memory, this would be straightforward: multiply the Y-coordinate by the number of bytes per row (“stride”), multiply the X-coordinate by the number of bytes per pixel, and add. That gives the pixel’s offset in bytes relative to the first pixel of the image. To get the final address, we add that offset to the address of the first pixel.
Alas, images are rarely linear in memory. To improve cache efficiency, modern graphics hardware interleaves the X- and Y-coordinates. Instead of one row after the next, pixels in memory follow a spiral-like curve.
We need to amend our previous equation to interleave the coordinates. We could use many instructions to mask one bit at a time, shifting to construct the interleaved result, but that’s inefficient. We can do better.
There is a well-known “bit twiddling” algorithm to interleave bits. Rather than shuffle one bit at a time, the algorithm shuffles groups of bits, parallelizing the problem. Implementing this algorithm in shader code improves performance.
In practice, only the lower 7-bits (or less) of each coordinate are interleaved. That lets us use 32-bit instructions to “vectorize” the interleave, by putting the X- and Y-coordinates in the low and high 16-bits of a 32-bit register. Those 32-bit instructions let us interleave X and Y at the same time, halving the instruction count. Plus, we can exploit the GPU’s combined shift-and-add instruction. Putting the tricks together, we interleave in 10 instructions of M1 GPU assembly:
# Inputs x, y in r0l, r0h.
# Output in r1.
add r2, #0, r0, lsl 4
or r1, r0, r2
and r1, r1, #0xf0f0f0f
add r2, #0, r1, lsl 2
or r1, r1, r2
and r1, r1, #0x33333333
add r2, #0, r1, lsl 1
or r1, r1, r2
and r1, r1, #0x55555555
add r1, r1l, r1h, lsl 1
We could stop here, but what if there’s a dedicated
instruction to interleave bits? PowerVR has a “shuffle” instruction shfl
,
and the M1 GPU borrows from PowerVR. Perhaps that instruction was
borrowed too. Unfortunately, even if it was, the proprietary compiler
won’t use it when compiling our test shaders. That makes it difficult to
reverse-engineer the instruction – if it exists – by observing compiled
shaders.
It’s time to dust off a powerful reverse-engineering technique from magic kindergarten: guess and check.
Dougall Johnson
provided the guess. When considering the instructions we already know
about, he took special notice of the “reverse bits” instruction. Since
reversing bits is a type of bit shuffle, the interleave instruction
should be encoded similarly. The bit reverse instruction has a two-bit
field specifying the operation, with value 01
. Related
instructions to count the number of set bits and find the
first set bit have values 10
and 11
respectively. That encompasses all known “complex bit manipulation”
instructions.
00 |
? ? ? |
01 |
Reverse bits |
10 |
Count set bits |
11 |
Find first set |
There is one value of the two-bit enumeration that is unobserved and
unknown: 00
. If this interleave instruction exists, it’s
probably encoded like the bit reverse but with operation code
00
instead of 01
.
There’s a difficulty: the three known instructions have one single input source, but our instruction interleaves two sources. Where does the second source go? We can make a guess based on symmetry. Presumably to simplify the hardware decoder, M1 GPU instructions usually encode their sources in consistent locations across instructions. The other three instructions have a gap where we would expect the second source to be, in a two-source arithmetic instruction. Probably the second source is there.
Armed with a guess, it’s our turn to check. Rather than handwrite GPU assembly, we can hack our compiler to replace some two-source integer operation (like multiply) with our guessed encoding of “interleave”. Then we write a compute shader using this operation (by “multiplying” numbers) and run it with the newfangled compute support in our driver.
All that’s left is writing a shader that checks that the mystery instruction returns the interleaved result for each possible input. Since the instruction takes two 16-bit sources, there are about 4 billion (\(2^32\)) inputs. With our driver, the M1 GPU manages to check them all in under a second, and the verdict is in: this is our interleave instruction.
As for our clever vectorized assembly to interleave coordinates? We can replace it with one instruction. It’s anticlimactic, but it’s fast and it passes the conformance tests.
And that’s what matters.
Thank you to Khronos and Software in the Public Interest for supporting open drivers.
Color is a visual perception. Human eyes can detect a broader range of colors
than any devices in the graphics chain. Since each device can generate, capture
or reproduce a specific subset of colors and tones, color management controls
color conversion and calibration across devices to ensure a more consistent
color reproduction. We can expose a GPU-accelerated display color management
pipeline to support this process and enhance results, and this is what we are
doing on Linux to improve color management on Gamescope/SteamDeck
. Even with
the challenges of being external developers, we have been working on mapping
AMD GPU color capabilities
to the Linux kernel color management interface
,
which is a combination of DRM and AMD driver-specific color properties. This
more extensive color management pipeline includes pre-defined Transfer
Functions
, 1-Dimensional LookUp Tables (1D LUTs)
, and 3D LUTs
before and
after the plane composition/blending.
The study of color is well-established and has been explored for many years. Color science and research findings have also guided technology innovations. As a result, color in Computer Graphics is a very complex topic that I’m putting a lot of effort into becoming familiar with. I always find myself rereading all the materials I have collected about color space and operations since I started this journey (about one year ago). I also understand how hard it is to find consensus on some color subjects, as exemplified by all explanations around the 2015 online viral phenomenon of The Black and Blue Dress. Have you heard about it? What is the color of the dress for you?
So, taking into account my skills with colors and building consensus, this blog post only focuses on GPU hardware capabilities to support color management :-D If you want to learn more about color concepts and color on Linux, you can find useful links at the end of this blog post.
DRM color management interface only exposes a small set of post-blending color properties. Proposals to enhance the DRM color API from different vendors have landed the subsystem mailing list over the last few years. On one hand, we got some suggestions to extend DRM post-blending/CRTC color API: DRM CRTC 3D LUT for R-Car (2020 version); DRM CRTC 3D LUT for Intel (draft - 2020); DRM CRTC 3D LUT for AMD by Igalia (v2 - 2023); DRM CRTC 3D LUT for R-Car (v2 - 2023). On the other hand, some proposals to extend DRM pre-blending/plane API: DRM plane colors for Intel (v2 - 2021); DRM plane API for AMD (v3 - 2021); DRM plane 3D LUT for AMD - 2021. Finally, Simon Ser sent the latest proposal in May 2023: Plane color pipeline KMS uAPI, from discussions in the 2023 Display/HDR Hackfest, and it is still under evaluation by the Linux Graphics community.
All previous proposals seek a generic solution for expanding the API, but many
seem to have stalled due to the uncertainty of matching well the hardware
capabilities of all vendors. Meanwhile, the use of AMD color capabilities on
Linux remained limited by the DRM interface, as the DCN 3.0 family color caps
and mapping
diagram below shows the Linux/DRM color interface without
driver-specific color properties [*]:
Bearing in mind that we need to know the variety of color pipelines in the
subsystem to be clear about a generic solution, we decided to approach the
issue from a different perspective and worked on enabling a set of
Driver-Specific Color Properties for AMD Display Drivers
. As a result, I
recently sent another round of the AMD driver-specific color mgmt
API.
For those who have been following the AMD driver-specific proposal since the
beginning (see
[RFC][V1]),
the main new features of the latest version
[v2]
are the addition of pre-blending Color Transformation Matrix (plane CTM)
and
the differentiation of Pre-defined Transfer Functions (TF)
supported by color
blocks. For those who just got here, I will recap this work in two blog posts.
This one describes the current status of the AMD display driver in the Linux
kernel/DRM subsystem and what changes with the driver-specific properties. In
the next post, we go deeper to describe the features of each color block and
provide a better picture of what is available in terms of color management for
Linux.
Before discussing colors in the Linux kernel with AMD hardware, consider
accessing the Linux kernel documentation (version 6.5.0-rc5). In the AMD
Display documentation, you will find my previous work documenting AMD hardware
color capabilities and the Color Management
Properties.
It describes how AMD Display Manager (DM)
intermediates requests between the
AMD Display Core component (DC)
and the Linux/DRM kernel
interface for
color management features. It also describes the relevant function to call the
AMD color module in building curves for content space transformations.
A subsection also describes hardware color capabilities and how they evolve between versions. This subsection, DC Color Capabilities between DCN generations, is a good starting point to understand what we have been doing on the kernel side to provide a broader color management API with AMD driver-specific properties.
Blending is the process of combining multiple planes (framebuffers abstraction)
according to their mode settings. Before blending, we can manage the colors of
various planes separately; after blending, we have combined those planes in
only one output per CRTC. Color conversions after blending would be enough in a
single-plane scenario or when dealing with planes in the same color space on
the kernel side. Still, it cannot help to handle the blending of multiple
planes with different color spaces and luminance levels. With plane color
management properties, userspace can get better representation of
colors to deal with the diversity of color profiles of devices in the graphics
chain, bring a wide color gamut (WCG)
, convert High-Dynamic-Range (HDR)
content to Standard-Dynamic-Range (SDR)
content (and vice-versa). With a
GPU-accelerated display color management pipeline, we can use hardware blocks
for color conversions and color mapping and support advanced color management.
The current DRM color management API enables us to perform some color conversions after blending, but there is no interface to calibrate input space by planes. Note that here I’m not considering some workarounds in the AMD display manager mapping of DRM CRTC de-gamma and DRM CRTC CTM property to pre-blending DC de-gamma and gamut remap block, respectively. So, in more detail, it only exposes three post-blending features:
We can compare the Linux color management API with and without the
driver-specific color properties. From now, we denote driver-specific
properties with the AMD prefix and generic properties with the DRM prefix. For
visual comparison, I bring the DCN 3.0 family color caps and mapping
diagram
closer and present it here again:
Mixing AMD driver-specific color properties with DRM generic color properties, we have a broader Linux color management system with the following features exposed by properties in the plane and CRTC interface, as summarized by this updated diagram:
The blocks highlighted by red lines
are the new properties
in the
driver-specific interface developed by me (Igalia) and Joshua (Valve). The red
dashed lines
are new links between API and AMD driver components
implemented by
us to connect the Linux/DRM interface to AMD hardware blocks, mapping
components accordingly. In short, we have the following color management
properties exposed by the DRM/AMD display driver:
Note: You can find more about AMD display blocks in the Display Core Next (DCN) - Linux kernel documentation, provided by Rodrigo Siqueira (Linux/AMD display developer) in a 2021-documentation series. In the next post, I’ll revisit this topic, explaining display and color blocks in detail.
So, looking at AMD hardware color capabilities in the first diagram, we can see no post-blending (MPC) de-gamma block in any hardware families. We can also see that the AMD display driver maps CRTC/post-blending CTM to pre-blending (DPP) gamut_remap, but there is post-blending (MPC) gamut_remap (DRM CTM) from newer hardware versions that include SteamDeck hardware. You can find more details about hardware versions in the Linux kernel documentation/AMDGPU Product Information.
I needed to rework these two mappings mentioned above to provide
pre-blending/plane de-gamma and CTM for SteamDeck. I changed the DC mapping to
detach stream gamut remap
matrixes from the DPP gamut remap
block. That
means mapping AMD plane CTM directly to DPP/pre-blending gamut remap block and
DRM CRTC CTM to MPC/post-blending gamut remap block. In this sense, I also
limited plane CTM properties to those hardware versions with MPC/post-blending
gamut_remap capabilities since older versions cannot support this feature
without clashes with DRM CRTC CTM.
Unfortunately, I couldn’t prevent conflict between AMD plane de-gamma and DRM plane de-gamma since post-blending de-gamma isn’t available in any AMD hardware versions until now. The fact is that a post-blending de-gamma makes little sense in the AMD color pipeline, where plane blending works better in a linear space, and there are enough color blocks to linearize content before blending. To deal with this conflict, the driver now rejects atomic commits if users try to set both AMD plane de-gamma and DRM CRTC de-gamma simultaneously.
Finally, we had no other clashes when enabling other AMD driver-specific color properties for our use case, Gamescope/SteamDeck. Our main work for the remaining properties was understanding the data flow of each property, the hardware capabilities and limitations, and how to shape the data for programming the registers - AMD color block capabilities (and limitations) are the topics of the next blog post. Besides that, we fixed some driver bugs along the way since it was the first Linux use case for most of the new color properties, and some behaviors are only exposed when exercising the engine.
Take a look at the Gamescope/Steam Deck Color Pipeline[**], and see how Gamescope uses the new API to manage color space conversions and calibration (please click on the image for a better view):
In the next blog post, I’ll describe the implementation and technical details of each pre- and post-blending color block/property on the AMD display driver.
* Thank Harry Wentland for helping with diagrams, color concepts and AMD capabilities.
** Thank Joshua Ashton for providing and explaining Gamescope/Steam Deck color pipeline.
*** Thanks to the Linux Graphics community - explicitly Harry, Joshua, Pekka, Simon, Sebastian, Siqueira, Alex H. and Ville - to all the learning during this Linux DRM/AMD color journey. Also, Carlos and Tomas for organizing the 2023 Display/HDR Hackfest where we have a great and immersive opportunity to discuss Color & HDR on Linux.
After a long week of what-even-happened, it’s finally time to talk about maintenance5.
This long-awaited maintenance extension has a number of great and zinkful features:
VK_FORMAT_A8_UNORM_KHR
for native A8 handling
gl_PointSize
VK_REMAINING_ARRAY_LAYERS
Clarification that copies between images of any type are allowed, treating 1D images as 2D images with a height of 1.
But who can guess which one is the topic of this blog post?
Finally a default value for gl_PointSize
.
Long-term fans of the blog will recall that I’ve previously raged against the insane concept that pointsize must be written
many times prior. In fact, it remains the second most blogged about topic in SGC history right behind Big Triangledescriptor management, the topic that modern graphics-related blogs must cover above all others.
Finally with maintenance5 we can be freed from these unjust shackles that have bound us for so long. No more* shall complex logic be unnecessarily injected into the compiler stack to add senseless writes to this output.
* except all that code still has to exist and run to handle drivers that don’t support maintenance5
Beyond the obvious benefit of having a fixed default pointsize (sanity), let’s check out some other benefits.
Previously all zink-emitted shaders would have a pointsize write, even those that were never used for drawing points. This resulted in unnecessary shader i/o at the hardware level. Nobody wants unnecessary shader i/o at the hardware level.
Now, however, it’s possible to use heuristics during linking to delete all unnecessary pointsize writes any time there is no XFB emission.
How much performance improvement will this yield?
Six.
Six improvement units of performance.
Everyone remembers that time I discovered that huge flaw in nir_assign_io_var_locations
where shader interfaces would break due to psiz injection.
With maintenance5 all of that can be handwaved away, meaning fewer shader variants are needed.
.
Maintenance extensions are best extensions, prove me wrong.
Hi!
Let me start this status update with an announcement: from 2023-08-28 to 2023-10-01 (!), I will be on leave, so I will have reduced availability. Don’t be surprised if I miss some e-mails, and feel free to ping me again (more generally, please do ping me if I forget about a discussion — that also tends to happen when I’m not on leave). During that time, I will be traveling to Korea and Japan. If you live there and want to say hello, please reach out! :)
This month, Rose has continued working on wlroots frame scheduling. After a fair amount of discussion, she’s found a pretty nice API design. She still needs to address and cleanup a few things, but that merge request is on a good track! I’ve also merged a new API to embed a compositor inside a Wayland client, and sent patches to remove some cases where we were waiting for a reply from Xwayland in a blocking fashion.
My kernel patch for signaling an eventfd from a drm_syncobj
has been merged
(see last month’s post for more details), and I’ve reviewed a patch from Erik
Kurzinger to import a sync_file
into a drm_syncobj
timeline, which was
possible before but awkward (it required 3 IOCTLs and a temporary binary
drm_syncobj
). As usual, I’ve sent a few kernel documentation patches as well.
I’ve released a new version of Cage, the Wayland kiosk compositor. Cage now uses the latest wlroots release, implements a bunch of new protocols and leverages wlroots’ scene-graph API.
The NPotM is go-mls, a Go library for the Messaging Layer Security protocol. It’s a shiny new end-to-end encryption framework for messaging protocols (similar to the one used by e.g. Signal and Matrix). I wanted to figure out how it works, but simply reading a 132-page RFC didn’t seem fun enough, so I just tried implementing it instead. I’m passing most of the official test vectors, still missing a few things but overall not too far away from a proper implementation. I’ve been discussing with a few folks about an IRCv3 extension for MLS, but we’re still at the very early stages on that front.
Speaking of IRCv3, the pre-away extension has been merged, so the away status of soju users shouldn’t blink anymore when the Goguma mobile client synchronizes in the background. I’ve also submitted the no-implicit-names extension for standardization. That extension reduces bandwidth usage for clients who don’t need to always maintain a list of all users in all channels. This helps a lot with slow 3G connections in the countryside.
The SNPotM is libdns/dnsupdate, a Go library for the venerable dynamic DNS UPDATE protocol implemented by various authoritative name servers. The library conforms to an interface shared with other (proprietary) libdns providers. I have more plans in this area, but will keep that for a future blog post.
I’ve sent a go-proxyproto patch to add a helper to configure an HTTP/2 server with PROXY protocol upgrade support. TLS ALPN is needed to negotiate HTTP/2, so it’s tricky to make work behind a reverse proxy which terminates the TLS connection. This patch is basically part of kimchi ripped off and put behind a nice API. This patch would be useful to add HTTP/2 support to pages.sr.ht.
Last but not least, I’ve implemented tracker export for the todo.sr.ht GraphQL API. delthas has added support for that in hut. Next up is support for import in hut! I’ve also sent a whole bunch of bug fixes for sr.ht.
That’s all for this month! I’m not sure I’ll write a status update in September, but will definitely do so in October.
I just got back from lunch and have to work off some cals, and that means it’s time for another big lift on the blog. Today’s topic: how dumb can a driver’s compiler stack be?
As I outlined in the previous post, zink’s compiler stack is about to get a whole lot dumber for various reasons. But what exactly does that look like?
Lesser bloggers would simply link to the MR and say “figure it out”.
Here at SGC, however, I put in the extra effort so my readers can comprehend all the stringy awfulness that goes into each individual graphical sausage that this triangle factory is pumping out.
Let’s get at it.
The key point of using the theoretical new NIR linker (that definitely exists and will be merged in finite time) is that it requires drivers to accept lowered i/o. This means, effectively, that zink must begin consuming lowered i/o as soon as it receives shaders. Naturally the first step to that was evaluating all the shader passes which operate on specific i/o variables using derefs (AKA “explicit” i/o):
The first four are called from zink_shader_create
, the first time zink sees new shaders, while the last one is called zink_compiler_assign_io
. As shaders won’t have derefs again until just before they go through NTV, they’ll all have to be…
What’s that you say, creator of the patented Delete The Code methodology and planar YUV expert, Faith Ekstrand? I can just delete some of this code?
That sounds like a pretty smart idea. Looking at the list again, and then cross-referencing against all the features lowered i/o provides, and then pushing up my glasses so I can read the very real documentation that nir has, let’s see where that leads:
nir_lower_io_lower_64bit_to_32
is available during i/o lowering, so this can all be deletedNot actually that much work, huzzah.
As in the flowchart, this process involves taking explicit i/o, converting to lowered i/o, then converting back to explicit. Explicit i/o is characterized by using derefs to explicit variables for access, which means variables are needed. A work-intensive benefit to this means simpler variables: since lowered i/o is characterized by location-based access to components, the subsequent conversion back to explicit i/o can use entirely new variables, and since these variables are location-based, there’s no need to retain any* of the gross struct/array typing that GLSL yields.
* except where arrays are indirectly accessed
For those of you who are truly in the know, this means goku in his SSJB form
struct TestStruct {
dmat2x3 a[2];
mat2x3 b[2];
dvec2 c;
};
layout (location = 0, xfb_offset = 0) flat out TestStruct goku;
gets blasted into a series of smaller and more vincible variables:
decl_var shader_out INTERP_MODE_FLAT dvec3 goku#0 (VARYING_SLOT_VAR2.xyz, 2, 0)
decl_var shader_out INTERP_MODE_FLAT dvec3 goku#1 (VARYING_SLOT_VAR4.xyz, 4, 0)
decl_var shader_out INTERP_MODE_FLAT dvec3 goku#2 (VARYING_SLOT_VAR6.xyz, 6, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#3 (VARYING_SLOT_VAR8.xyz, 8, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#4 (VARYING_SLOT_VAR9.xyz, 9, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#5 (VARYING_SLOT_VAR10.xyz, 10, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#6 (VARYING_SLOT_VAR11.xyz, 11, 0)
decl_var shader_out INTERP_MODE_FLAT dvec2 goku#7 (VARYING_SLOT_VAR12.xy, 12, 0)
Beautiful and easy to parse. There’s only one snag: I gotta do this manually.
Long-time fans of the blog will recall some wild ravings in the past where I described a pass I wrote to handle a similar issue. lower_64bit_vars
is that pass, and it both splits variables containing 64bit types into 32bit types and then rewrites all access to them to use those new types.
And now I have to do basically the same thing. Again. But in a different enough way that none of the code is reusable.
The process for doing this variable rewrite is split in three:
But then there’s also the bonus step (everyone loves bonuses!) of scanning all the new variables and comparing them against the original variables to ensure they have the same number of per-location components (i.e., if the original variable consumes all components for a given location, the new one must too) in order to maintain shader interface compatibility, and for all the locations where a mismatch is detected, single-component variables have to be inserted, and they have to have associated access added too so various optimizers don’t delete them again, and it’s obviously one of the first things anyone embarking on this idiotic journey would consider and not a last-second thing that someone would only realize after running a series of esoteric piglit tests and seeing bizarre failures.
Variables. Done.
The next step is where things get really stupid, because this is where things need to happen so that the shader goes back to having all the derefs and explicit variable access it used to have before some idiot went and deleted them.
I called this the add_derefs
pass because I’m a creative type. An auteur.
For this, all the i/o variables need to be iterated through, and for each variable, scan the shader for access, where “access” means the location and component are consumed by the variable. And also its fbfetch-edness matches. Then take this lowered load/store access, krangle in whatever possibly-indirect derefs the variable needs to mimic the lowered operation, and write in a new explicit load/store access.
Except also I forgot to mention that i/o lowering needs to lower interpolation instructions, which are also (currently) in explicit deref format. And these explicit interpolation instructions get converted to lowered ones, and then sometimes a load_deref
becomes load_barycentric_centroid
. And you know (lol) it wouldn’t be a real adventure (lol) if a zink change didn’t uncover (lol) some incredibly obscure and opaque (lol) llvmpipe bug! So then there’s the usual spelunking through there, and whispered backchannel discussions and cursing with Dave, and OF FUCKING COURSE IT’S TGSI AGAIN but we got it done.
Also it’s possible there might be a future where llvmpipe doesn’t use TGSI but don’t quote me (I’ll deny it to my grave) and if anyone asks you didn’t hear it from me.
You’d think by the way I just went off on my usual TGSI rant that I was done exploring this section, but think again because none of us asked what gl_ClipDistance
or gl_CullDistance
thought about any of this.
Well I asked, and they’re not happy.
Clip/cull distance are stupidweird ones because they’re array[8] variables that consume two locations. And that means all the calculations/heuristics for accessing arrays that work for every other array are broken for these.
But it’s fine, because this is zink and the whole thing is just a jenga tower of hacks all the way down anyway.
I’ll be completely and brutally honest with you, this all worked perfectly the first time I ran it.
On NVK, that is, which, as I mentioned in my historic XDC keynote, has been relying on the now-merged NIR 2.0 since last year. Truly a driver living in the future.
Other drivers, however, required considerably more work to make CI explode. Sorry, I meant not explode. Obviously. Totally a joke. The absolute state of CI is 100% not the fault of this lowered i/o conversion.
Anyway, the clear choice once parity was achieved was to then start deleting code.
Remember all that gross NTV code I linked in the previous post? Gone.
More stupid XFB code that’s been jenga-ing around for years? Gone.
Obscure ticket from years ago? Fixed incidentally
src/compiler/nir/nir_passthrough_gs.c | 2 +-
src/gallium/auxiliary/nir/nir_to_tgsi_info.c | 4 +
src/gallium/drivers/zink/nir_to_spirv/nir_to_spirv.c | 412 +------------------------------------
src/gallium/drivers/zink/zink_compiler.c | 1081 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
src/gallium/drivers/zink/zink_compiler.h | 3 +-
src/gallium/drivers/zink/zink_draw.cpp | 2 +-
src/gallium/drivers/zink/zink_program.c | 8 +-
src/gallium/drivers/zink/zink_types.h | 6 +-
8 files changed, 736 insertions(+), 782 deletions(-)
And as seen by the statistics here another bonus ticket fixed too through the magic of code deletion.
I didn’t even get to mention the great things that happened related to maintenance5 yet. Be sure to read again next week when I inevitably shovel more garbage onto the internet in the form of an unfortunately large blog post riddled with memes that obfuscate the truly interesting parts.
As every one of my big brained readers knows, zink runs on top of vulkan. As you also know, vulkan uses spirv for its shaders. This means, in general, compiler-y stuff in zink tries to stay as close to spirv mechanics as possible.
Let’s look at an example. Here’s a very simple fragment shader from glxgears before it undergoes spirv translation:
shader: MESA_SHADER_FRAGMENT
source_sha1: {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
name: ff-fs
internal: true
stage: 4
next_stage: 0
inputs_read: 1
outputs_written: 4-11
system_values_read: 0x00000000'00100000'00000000
subgroup_size: 1
first_ubo_is_default_ubo: true
separate_shader: true
flrp_lowered: true
inputs: 1
outputs: 8
uniforms: 0
decl_var shader_in INTERP_MODE_NONE vec4 VARYING_SLOT_COL0 (VARYING_SLOT_COL0.xyzw, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[0] (FRAG_RESULT_DATA0.xyzw, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[1] (FRAG_RESULT_DATA1.xyzw, 1, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[2] (FRAG_RESULT_DATA2.xyzw, 2, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[3] (FRAG_RESULT_DATA3.xyzw, 3, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[4] (FRAG_RESULT_DATA4.xyzw, 4, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[5] (FRAG_RESULT_DATA5.xyzw, 5, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[6] (FRAG_RESULT_DATA6.xyzw, 6, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[7] (FRAG_RESULT_DATA7.xyzw, 7, 0)
decl_var push_const INTERP_MODE_NONE struct gfx_pushconst
decl_function main (0 params)
impl main {
block b0: // preds:
32 %0 = deref_var &VARYING_SLOT_COL0 (shader_in vec4)
32x4 %1 = @load_deref (%0) (access=0)
32 %2 = deref_var &gl_FragData[0] (shader_out vec4)
@store_deref (%2, %1) (wrmask=xyzw, access=0)
32 %3 = deref_var &gl_FragData[1] (shader_out vec4)
@store_deref (%3, %1) (wrmask=xyzw, access=0)
32 %4 = deref_var &gl_FragData[2] (shader_out vec4)
@store_deref (%4, %1) (wrmask=xyzw, access=0)
32 %5 = deref_var &gl_FragData[3] (shader_out vec4)
@store_deref (%5, %1) (wrmask=xyzw, access=0)
32 %6 = deref_var &gl_FragData[4] (shader_out vec4)
@store_deref (%6, %1) (wrmask=xyzw, access=0)
32 %7 = deref_var &gl_FragData[5] (shader_out vec4)
@store_deref (%7, %1) (wrmask=xyzw, access=0)
32 %8 = deref_var &gl_FragData[6] (shader_out vec4)
@store_deref (%8, %1) (wrmask=xyzw, access=0)
32 %9 = deref_var &gl_FragData[7] (shader_out vec4)
@store_deref (%9, %1) (wrmask=xyzw, access=0)
// succs: b1
block b1:
}
Notice all the variables and derefs. This is in contrast to what shaders from more hardware-y drivers look like:
shader: MESA_SHADER_FRAGMENT
source_sha1: {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
name: ff-fs
internal: true
stage: 4
next_stage: 0
inputs_read: 1
outputs_written: 2
subgroup_size: 1
first_ubo_is_default_ubo: true
separate_shader: true
flrp_lowered: true
inputs: 1
outputs: 1
uniforms: 0
decl_var shader_in INTERP_MODE_NONE vec4 VARYING_SLOT_COL0 (VARYING_SLOT_COL0.xyzw, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 FRAG_RESULT_COLOR (FRAG_RESULT_COLOR.xyzw, 4, 0)
decl_function main (0 params)
impl main {
block b0: // preds:
32 %3 = undefined
32 %0 = deref_var &VARYING_SLOT_COL0 (shader_in vec4)
32x4 %1 = @load_deref (%0) (access=0)
32 %2 = deref_var &FRAG_RESULT_COLOR (shader_out vec4)
32 %4 = load_const (0x00000000)
@store_output (%1, %4 (0x0)) (base=4, wrmask=xyzw, component=0, src_type=float32, io location=FRAG_RESULT_COLOR slots=1, xfb(), xfb2()) // FRAG_RESULT_COLOR
// succs: b1
block b1:
}
The latter form here is called “lowered” i/o: the derefs for explicit variables have been lowered to intrinsics corresponding to the operations being performed. Such excitement, many detail.
With few exceptions, every mesa driver uses lowered i/o. Zink is one of those exceptions, and the reasons are simple:
It’s a tough choice, but if I had to pick one of these as the “main” reason why I haven’t done the move, my response would be yes.
With that said, I’m extremely disgruntled to announce that I have completed the transition to lowered i/o.
Hooray.
The reasoning behind this Sisyphean undertaking which has cost me the past couple weeks along with what shreds of sanity previously remained within this mortal shell:
It’s a tough choice, but if I had to pick one of these as the “main” reason why I have done the move, my response would be yes.
I’ll save the details of this for some deep dive posts to pad out my monthly blog counter. For now, let’s take a look at the overview: how does this affect “shader stuff” in zink?
The short answer, for that one person who is actively eyeballs-deep in zink shader refactoring, is that it shouldn’t have any effect whatsoever. The zink passes that use explicit derefs for i/o are mostly at the end of the compilation chain, and derefs will have been added back in time to avoid needing to touch anything there.
This refactor may a tough concept to grasp, so I’m providing some flowcharts since it’s been far too long since the blog has seen any. Here is a basic overview of the zink shader compilation process:
It’s a simple process that anyone can understand.
This is the old process side-by-side with the new one for comparison:
Next time: maintenance5 in lavapipe or more compiler talk. You decide. But not really because I’m the one writing the posts.
Summer has kept me busy with holidays, but I have managed to find a bit of time to keep hacking on the driver for the VeriSilicon NPU since the last update.
The issue with placing the output to the right scale is solved now, and simple convolution operations are working just fine.
3D tensors are now supported as inputs, and we support strided convolutions as well, but only on 2D inputs for now.
The test workloads are running fast and stably now, so I now feel I have pretty solid ground beneath my feet.
There are three features left before I can run a real, full-fledged commercially interesting model:
The last update in this blog was left at my attempt at figuring out how the convolution raw outputs had to be processed with fields called post_shift and post_multiplier so I could get the right values in the final output.
After spending more time than I should probably have in a spreadsheet trying to find correlations, some desperate googling brought me to some research papers about optimizing quantization operations on integer-only hardware:
That explains the meaning of the shift and multiplier, as these are the operations we can use to approximate the floating point division on integer hardware.
But to actually understand what the hardware was trying to do with them, it was useful to look at the QNNPACK implementation of requantization.
This was pretty much straightforward, as was basically a matter of updating the code to take into account the added dimension, and also reorder the tensor elements as the hardware expects depth first order.
This was made much easier by some improvements to the scripts I use to observe the behavior of the closed source stack, by intercepting the communication with the kernel's GPL driver.
For example, this is the output when Mesa has generated a cmd stream that is functionally equivalent to what the blob sends to the kernel:
+ diff -u -U 100 /home/tomeu/mesa.txt /home/tomeu/galcore.txt
--- /home/tomeu/mesa.txt 2023-08-07 18:28:29.939750225 +0200
+++ /home/tomeu/galcore.txt 2023-08-07 18:28:42.116625362 +0200
@@ -1,176 +1,273 @@
{
- 0x0801028a, /* LOAD_STATE (1) Base: 0x00A28 Size: 1 Fixp: 0 */
- 0x00000011, /* PA.SYSTEM_MODE := PROVOKING_VERTEX_LAST=1,HALF_PIXEL_CENTER=1 */
- 0x08010e13, /* LOAD_STATE (1) Base: 0x0384C Size: 1 Fixp: 0 */
- 0x00000002, /* GL.API_MODE := OPENCL */
+ 0x00000000, /* UNKNOWN (0) */
+ 0x00000000, /* */
+ 0x00000000, /* UNKNOWN (0) */
+ 0x00000000, /* */
+ 0x00000000, /* UNKNOWN (0) */
+ 0x00000000, /* */
0x00000000, /* UNKNOWN (0) */
0x00000000, /* */
0x08010e4f, /* LOAD_STATE (1) Base: 0x0393C Size: 1 Fixp: 0 */
0x00000000, /* GL.OCB_REMAP_START := 0x0 */
0x08010e50, /* LOAD_STATE (1) Base: 0x03940 Size: 1 Fixp: 0 */
0x00000000, /* GL.OCB_REMAP_END := 0x0 */
0x08010e4c, /* LOAD_STATE (1) Base: 0x03930 Size: 1 Fixp: 0 */
0x00000010, /* GL.NN_CONFIG := UNK0=0x0,DISABLE_ZDPN=0,DISABLE_SWTILING=0,SMALL_BATCH=1,DDR_BURST_SIZE=0x0,UNK7=0,NN_CORE_COUNT=0x0,UNK12=0 */
0x08010428, /* LOAD_STATE (1) Base: 0x010A0 Size: 1 Fixp: 0 */
- 0xffff3000, /* PS.NN_INST_ADDR := *0xffff3000 */
+ 0x3348e780, /* PS.NN_INST_ADDR := *0x3348e780 */
0x08010429, /* LOAD_STATE (1) Base: 0x010A4 Size: 1 Fixp: 0 */
0x00000000, /* 0x010A4 */
0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
0x00000c23, /* GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
0x00000c23, /* GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
0x00000000, /* UNKNOWN (0) */
0x00000000, /* */
}
map->layer_type = 0x0; /* (0) */
map->no_z_offset = 0x0; /* (0) */
map->kernel_xy_size = 0x2; /* (2) */
map->kernel_z_size = 0x4; /* (4) */
map->kernels_per_core = 0x1; /* (1) */
map->pooling = 0x0; /* (0) */
map->pooling_xy_size = 0x1; /* (1) */
map->prelu = 0x0; /* (0) */
map->nn_layer_flush = 0x1; /* (1) */
map->kernel_data_type = 0x0; /* (0) */
map->in_image_data_type = 0x0; /* (0) */
map->out_image_data_type = 0x0; /* (0) */
map->in_image_x_size = 0x4; /* (4) */
map->in_image_y_size = 0x4; /* (4) */
map->in_image_x_offset = 0x0; /* (0) */
map->in_image_y_offset = 0x0; /* (0) */
map->unused0 = 0x0; /* (0) */
map->brick_mode = 0x0; /* (0) */
map->brick_distance = 0x0; /* (0) */
map->relu = 0x0; /* (0) */
map->unused1 = 0x0; /* (0) */
map->post_multiplier = 0x0; /* (0) */
map->post_shift = 0x17; /* (23) */
map->unused2 = 0x0; /* (0) */
map->no_flush = 0x0; /* (0) */
map->unused3 = 0x0; /* (0) */
map->out_image_x_size = 0x3; /* (3) */
map->out_image_y_size = 0x3; /* (3) */
map->out_image_z_size = 0x1; /* (1) */
map->rounding_mode = 0x1; /* (1) */
map->in_image_x_offset_bit_3 = 0x0; /* (0) */
map->in_image_y_offset_bit_3 = 0x0; /* (0) */
map->out_image_tile_x_size = 0x3; /* (3) */
map->out_image_tile_y_size = 0x3; /* (3) */
-map->kernel_address = 0x3fffd00; /* (67108096) */
+map->kernel_address = 0xcd237f; /* (13443967) */
map->kernel_z_size2 = 0x0; /* (0) */
-map->in_image_address = 0xffff6000; /* (4294926336) */
-map->out_image_address = 0xffff7000; /* (4294930432) */
+map->in_image_address = 0x3348e240; /* (860414528) */
+map->out_image_address = 0x89ffc500; /* (2315240704) */
map->image_caching_mode = 0x0; /* (0) */
map->kernel_caching_mode = 0x1; /* (1) */
map->partial_cache_data_unit = 0x0; /* (0) */
map->kernel_pattern_msb = 0x0; /* (0) */
map->kernel_y_size = 0x2; /* (2) */
map->out_image_y_stride = 0x3; /* (3) */
map->kernel_pattern_low = 0x0; /* (0) */
map->kernel_pattern_high = 0x0; /* (0) */
map->kernel_cache_start_address = 0x800; /* (2048) */
map->kernel_cache_end_address = 0xa00; /* (2560) */
map->image_start_address = 0x0; /* (0) */
map->image_end_address = 0x800; /* (2048) */
map->in_image_border_mode = 0x0; /* (0) */
map->in_image_border_const = 0x7d; /* (125) */
map->unused4 = 0x0; /* (0) */
map->kernel_data_type_bit_2 = 0x0; /* (0) */
map->in_image_data_type_bit_2 = 0x0; /* (0) */
map->out_image_data_type_bit_2 = 0x0; /* (0) */
map->post_multiplier_1_to_6 = 0x1f; /* (31) */
map->post_shift_bit_5_6 = 0x0; /* (0) */
map->unused5 = 0x0; /* (0) */
map->in_image_x_stride = 0x4; /* (4) */
map->in_image_y_stride = 0x4; /* (4) */
map->out_image_x_stride = 0x3; /* (3) */
map->unused6 = 0x0; /* (0) */
map->post_multiplier_7_to_14 = 0x61; /* (97) */
map->out_image_circular_buf_size = 0x0; /* (0) */
map->unused7 = 0x0; /* (0) */
map->per_channel_post_mul = 0x0; /* (0) */
map->out_image_circular_buf_end_addr_plus_1 = 0x3ffffff; /* (67108863) */
map->unused8 = 0x0; /* (0) */
map->in_image_circular_buf_size = 0x0; /* (0) */
map->unused9 = 0x0; /* (0) */
map->in_image_circular_buf_end_addr_plus_1 = 0x3ffffff; /* (67108863) */
map->unused10 = 0x0; /* (0) */
map->coef_zero_point = 0x80; /* (128) */
map->out_zero_point = 0x77; /* (119) */
map->kernel_direct_stream_from_VIP_sram = 0x0; /* (0) */
map->depthwise = 0x0; /* (0) */
map->unused11 = 0x0; /* (0) */
map->unused12 = 0x0; /* (0) */
map->unused13 = 0x0; /* (0) */
map->unused14 = 0x0; /* (0) */
map->unused15 = 0x0; /* (0) */
map->unused16 = 0x0; /* (0) */
map->further1 = 0x0; /* (0) */
map->further2 = 0x0; /* (0) */
map->further3 = 0x3ffffff; /* (67108863) */
map->further4 = 0x7f800000; /* (2139095040) */
map->further5 = 0xff800000; /* (4286578688) */
map->further6 = 0x0; /* (0) */
map->further7 = 0x0; /* (0) */
map->further8 = 0x0; /* (0) */
0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x2c, 0x99, 0x0e, 0x00, 0x00,
0x40, 0xea, 0x2c, 0xeb, 0x80, 0xaf, 0x80, 0x9b, 0x99, 0x80, 0x80, 0x13,
0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00
0x69, 0xd3, 0x2d, 0x92, 0x07, 0x00, 0x64, 0x00, 0x0c, 0x22, 0x90, 0xd6,
0x53, 0xc9, 0xe2, 0x48, 0xe6, 0x4c, 0xa8, 0xeb, 0xd2, 0xf3, 0xb0, 0xf4,
0x2d, 0xa4, 0x3e, 0xf4, 0x0f, 0x7b, 0x98, 0x01, 0x41, 0x84, 0x92, 0x7e,
0xfa, 0x19, 0xf5, 0xda, 0xb3, 0x5a, 0xb7, 0xf3, 0x97, 0x95, 0x12, 0xe7,
0x51, 0x94, 0xcb, 0x5a, 0x1f, 0xa9, 0xc6, 0xc4, 0x1c, 0xa9, 0x92, 0x1f,
0xf7, 0x64, 0xc3, 0xca
0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77
This corresponds to a convolution with the following parameters:
The differences are due to different addresses being allocated between runs, and some differences due to how Mesa's code is structured but that shouldn't affect the end result.
At the top we have the payload of the submit IOCTL, followed by a struct with the configuration for the NN units themselves and then the buffers for the weights, input and output.
When running a convolution configuration that isn't yet supported, we will spot more differences and hopefully will be able to figure out the logic behind them.
The hardware doesn't really support strided convolutions, so these are "lowered" to 1-stride convolutions with added channels, as per this research paper:
By implementing the algorithm in the paper, we match the behavior of the blob, as with requantization. It refers only to 2D input tensors, so I will need to check how the blob behaves with 3D inputs and figure out the logic behind it.
For now I have chosen to do the tensor manipulation on the CPU, but later on we will be able to use the TP units in the HW for this, reducing latency.
With so many different convolution parameters supported, I felt the need for a comfortable way of keeping regressions in check.
I wrote a simple pytest module that will generate a TFLite model with a single convolution operation, and the parameters and payloads will be changed according to the different parameters that we support.
At some point I will add a CI job, probably before sending the initial merge request.
The initial NVK (nouveau vulkan) experimental driver has been merged into mesa master[1], and although there's lots of work to be done before it's application ready, the main reason it was merged was because the initial kernel work needed was merged into drm-misc-next[2] and will then go to drm-next for the 6.6 merge window. (This work is separate from the GSP firmware enablement required for reclocking, that is a parallel development, needed to make nvk useable). Faith at Collabora will have a blog post about the Mesa side, this is more about the kernel journey.
The nouveau kernel API was written 10 years or more ago, and was designed around OpenGL at the time. There were two major restrictions in the current uAPI that made it unsuitable for Vulkan.
When we kicked off the nvk idea I made a first pass at implementing a new user API, to allow the above features. I took at look at how the GPU VMA management was done in current drivers and realized that there was a scope for a common component to manage the GPU VA space. I did a hacky implementation of some common code and a nouveau implementation. Luckily at the time, Danilo Krummrich had joined my team at Red Hat and needed more kernel development experience in GPU drivers. I handed my sketchy implementation to Danilo and let him run with it. He spent a lot of time learning and writing copious code. His GPU VA manager code was merged into drm-misc-next last week and his nouveau code landed today.
The idea behind the GPU VA manager is that there is no need for every driver to implement something that should essentially not be a hardware specific problem. The manager is designed to track VA allocations from userspace, and keep track of what GEM objects they are currently bound to. The implementation went through a few twists and turns and experiments.
For a long period we considered using maple tree as the core of it, but we hit a number of messy interactions between the dma-fence locking and memory allocations required to add new nodes to the maple tree. The dma-fence critical section is a hard requirement to make others deal with. In the end Danilo used an rbtree to track things. We will revisit if we can deal with maple tree again in the future.
We had a long discussion and a couple of implement it both ways and see, on whether we needed to track empty sparse VMA ranges in the manager or not, nouveau wanted these but generically we weren't sure they were helpful, but that also affected the uAPI as it needed explicit operations to create/drop these. In the end we started tracking these in the driver and left the core VA manager cleaner.
Now the code is in tree we will start to push future drivers to use it instead of spinning their own.
Now that the VAs are being tracked, the nouveau API needed two new entrypoints. Since BO allocation will no longer create a VM, a new API is needed to bind BO allocations with VM addresses. This is called the VM_BIND API. It has two variants
My input was the sketchy sketch at the start, and doing the userspace changes to the nvk codebase to allow testing.
The biggest shoutout to Danilo, who took a sketchy sketch of what things should look like, created a real implementation, did all the experimental ideas I threw at him, and threw them and others back at me, negotiated with other drivers to use the common code, and built a great foundational piece of drm kernel infrastructure.
Faith at Collabora who has done the bulk of the work on nvk did a code review at the end and pointed out some missing pieces of the API and the optimisations it enables.
Karol at Red Hat on the main nvk driver and Ben at Red Hat for nouveau advice on how things worked, while he smashed away at the GSP rock.
(and anyone else who has contributed to nvk, nouveau and even NVIDIA for some bits :-)
[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24326
As everyone knows, Zink is a fast-moving target. Sometimes it moves so fast that even I don’t fully grasp the weight of some changes as they fly past.
I’m sure you all remember this monumental slide from XDC last year:
Truly a masterpiece that’s impossible to improve upon; don’t @ me.
Time has passed. Almost a year, some would say. Is that slide still accurate?
Anyone who knows anything about journalism knows that the answer to all rhetorical questions on the internet is always the same.
A couple weeks ago, Collabora’s Igor Torrente put up an MR that slid under the rader for most people. Not me, of course, because as a responsible maintainer I carefully review every character in every line of code in every file for every patch in every MR tagged with my driver(s).
Because the great Adam Jackson and Daniel Stone also got D E E P into this one. By which I mean they commented.
And approved.
It’s the equivalent of clickbait on a blog. But why—
Yes, yes, it’s been a while, some number of weeks or mesa-years since I last blogged. Lots of important things have happened in that time. I’ve generated enough blog content for an entire month of posts, in fact. Maybe I’ll manage to maintain enough motivation to write about them.
Let’s kick off the return by looking at some progress updates.
It’s all very exciting, and there’s definitely gonna be lots of posts once I remember what happened when I was melting into a puddle during the heatwave.
In the meanwhile, I want to talk about something else. Something a lot of people ask me about.
I want to talk about working for Valve.
I work in Open Source, so obviously I can only comment on that, and I only work on certain projects, so obviously I can only comment on my experience working on them, and I’m only one non-hivemind mortal being, so obviously I can only comment on what I’ve personally experienced, but I’m nearly through my third year here and it feels like a good time for a post of this sort. You know, because three.
So what’s it really like here?
In a word, working here is great.
Imagine you’ve got three or twenty projects you enjoy working on. Now imagine your task is to make them better. Better how, exactly? However you want. Which one do you work on? Whichever one you want. How many hours per week do you work? However many you want. Who checks your progress? You do. How do biannual performance evaluations happen? They don’t. And on top of all that you get paid.
It sounds too good to be true, doesn’t it? Surely working here can’t be such a complete state of anarchy, and indeed it isn’t. In my experience, the Open Source team here is like a big jigsaw puzzle: there’s a ton of different pieces, each of them with its place, each of them making the others better.
Let me explain.
There’s a lot of people working here, all of them smarter than me (but none of them blogging more than me). Most of them have been here longer than me too. Every one of them has fallen into their niche, the place they like to tinker where they can excel. Here’s a few just out of the old-timers:
Everyone has distinct roles that they play on the team. Project areas they specialize in, as per my “anything goes” claim above. Some people work on lots of things, some don’t, but the niches are filled. Everyone’s got their “spot”.
Put another way, everyone on the team is a piece of the puzzle, the puzzle being “make Linux gaming great”. Everyone fits into their spot, and by doing so things get better.
There’s another way of looking at things though. While everyone here can be a puzzle piece, everyone has their own puzzles too. I work on zink, but that doesn’t mean “Mike is the one working on zink”. What it really means is that I’m able to work on zink because the puzzle pieces have been assembled such that I’m able to work on zink. It’s like how you wouldn’t try benching four plates at the gym without having a spot (I would, but I’m huge).
Sometimes getting a spot is a production. You know, the kind of thing that makes headlines where Joshie throws me the rock, but it’s too hot so I fire off a pass to Georg, and he slows down the tempo while we wait for hardware vendors to understand how their thinking sand interacts with complex ancient graphics specificiations, but then we get in the zone, and Georg throws me an alleyoop, and then Joshie takes it back to fix the original problem and now games work a little better.
Sometimes it’s all rockstars all day long.
But it’s the other times that make working here really great. The times when you’re struggling to grind out that last rep because you told your buddy you were definitely gonna hit 10x315 on front squat this week and you don’t wanna go back and admit you had too much preworkout.
I’m talking about times like when Timur picked up my massive CPU optimization series and wrangled it into a bunch of MRs because I was foolishly stretching myself too thin across too many projects.
I’m talking about the unsung heroes who make working here truly great.
Everyone knows Rhys. Everyone here, anyway. Outside the team it might be a different story; he has no blog, and searching for his many and varied accomplishments in the depths of the internet yields only one article written before I was born.
IYKYK, as they say. Just in the past week he’s quietly fixed Rage 2 and WWZ. A glance through his extensive patch history is a litany of complex optimizations and tweaks which aren’t flashy enough to be newsworthy on their own but add up to significant gains through consistent improvements.
But it’s still not any of these that (to me, at least) make Rhys one of the unsung heroes of the team. The glue that holds parts of it together.
All my very high IQ readers know what it’s like to get stuck on something. That feeling when you come across a problem, and you know it’s a problem, and you can handwave some half-functional solution that lets you limp across the finish line to collapse in a broken, battered heap with 0 regressions
as the output of your refactoring branch’s latest CTS run, but you can’t quite figure out the “right” way to fix the problem. The way that won’t get your patches NAKed harder than proposing the addition of registers to NIR right now.
At times like these, who’s there to help you out? Who is it that gives the bar that tiny, it-was-all-you-bro-I-didn’t-even-touch-it nudge to help you finish that last rep?
It’s Rhys Perry. It’s always the Rhyses, the unsung heroes. The ones who answer complex questions on IRC at 2am because they’re watching an historic cricket match and happened to glance over and see you flailing away at your keyboard. The ones who step in and say “sure, I’ll review this absolute disaster you fingerpainted into gitlab using the webui without regard to formatting, or coding conventions, or even the right programming language, and we’ll get through this together with a fix that’ll make everyone happy” when you’re staring down Yog-Sothoth in the abyss of the compiler stack at the end of a week and have exactly one functioning brain cell remaining that tells you only SHOWER. NOW. IT’S BEEN DAYS.
And it’s the mixing of all these people, rockstars and not, unsung heroes and not, working on so many projects, enabling each other and making us all better at what we do that makes working at Valve great.
To me, at least.
Tune in next time when I’ll be MS Painting some XFB memes and raging about Big Triangle since that’s apparently my niche.
EOSS in Prague was great, lots of hallway track, good talks, good food, excellent tea at meetea - first time I had proper tea in my life, quite an experience. And also my first talk since covid, pack room with standing audience, apparently one of the top ten most attended talks per LF’s conference report.
The video recording is now uploaded, I’ve uploaded the fixed slides, including the missing slide that I accidentally cut in a last-minute edit. It’s the same content as my blog posts from last year, first talking about locking engineering principles and then the hierarchy of locking engineering patterns.
Hi all!
As usual, this month has been rich in Wayland-related activities. Rose has continued building and upstreaming better frame scheduling infrastructure for wlroots, you can read more on her blog. I’ve resurrected an old patch to make wlroots behave better when the GPU is under high load. In my testing this improves latency a lot some specific scenarios and some specific hardware, but doesn’t help on some others. It’s not super clear if anything can be done about this, it may be that we are hitting some hardware limitations here: GPUs don’t know how to preempt tasks very well.
I’ve also started working on explicit synchronization again. This was
previously blocked on a hard problem: drivers may want to use a new kind of
synchronization fence primitive (user-space memory fences) and it wasn’t clear
how the current primitives (drm_syncobj
) would hold up. We’ve been talking
about this new primitive for a few years but unfortunately it’s a complicated
matter and nothing new has surfaced. However, after discussing with Daniel
Vetter, we’ve come to the conclusion that the kernel will provide backwards
compatibility for drm_syncobj
, so we can just stop worrying and use that as
the basis for explicit synchronization protocols and implementations. Moreover,
NVIDIA engineers are interested in helping with this effort, so I hope we can
keep the momentum and join forces to push the new protocol, APIs and
implementations to the finish line.
There is a lot to be done to plumb explicit synchronization. This month I’ve
respinned a new kernel uAPI patch to allow compositors to
wait on a drm_syncobj
without blocking. This also involved writing a test
suite in IGT and a wlroots patch to use the new uAPI. Everything is now
reviewed, I hope to merge this soon. Apart from this, we also need a
new Wayland protocol, a new Vulkan
extension for drm_syncobj
import/export, more implementations of the
protocol, ideally yet another new kernel uAPI to improve
interoperability with sync_file
, and even a new X11 protocol so that legacy
X11 clients (read: games) can take advantage of this whole thing. Oh my… As
French people say, there is some bread on the table.
In other Wayland news, we’ve started having some more-or-less weekly meetings for wayland-protocols standardization. We’ve been talking about upstreaming some of the stuff currently in a private GTK protocol, IMEs, and layer-shell. It’s been great to be able to discuss face-to-face about blockers for these protocols. The meeting notes are available on the wiki. We’ve done a lot of talking and gesturing, but also some actual work: security-context has finally (!) been merged, and I’ve updated the ext-layer-shell patch.
Apart from the explicit synchronization work, I’ve sent a few other kernel patches. Numerous patches to improve the kernel uAPI documentation, and a few patches to add more information to the hotplug events sent by bridge/i915/nouveau so that compositors don’t need to reload the whole KMS state on each hotplug event (instead, they can now only reload the KMS state of the one specific connector which got hotplugged). I’ve reviewed a few patches as well. Thomas Zimmermann has made it so all DRM drivers now support DMA-BUFs (required for wlroots to run), so now wlroots works on e.g. gma500. AMD engineers have sent patches to support more than 64 DRM devices, there are some subtle uAPI stability issues at play I’ve tried to provide feedback on.
Let’s wrap up this status update with a collection of various smaller
happenings. I’ve removed dlsym()
related magic used in the Wayland test suite
which caused sporadic failures on FreeBSD. I’ve been gradually improving the
API for go-imap v2 and fixing a few bugs. hut now supports pagination on all
commands thanks to tireless work by Thorben Günther. kanshi now supports
configuring adaptive sync (VRR). I’ve improved the API of go-oauth2 a bit. Last
but not least, I’ve reworked an old patch to make it easier to
parse scfg files from Go programs, by defining a Go struct
instead of hand-rolling parsing code.
See you next month!
I recently came across tinygrad as a small powerful nn framework that had an OpenCL backend target and could run LLaMA model.
I've been looking out for rusticl workloads, and this seemed like a good one, and I could jump on the AI train, and run an LLM in my house!
I started it going on my Radeon 6700XT with the latest rusticl using radeonsi with the LLVM backend, and I could slowly interrogate a model with a question, and it would respond. I've no idea how performant it is vs ROCm yet which seems to be where tinygrad is more directed, but I may get to that next week.
While I was there though I decided to give the Mesa ACO compiler backend a go, it's been tied into radeonsi recently, and I done some hacks before to get compute kernels to run. I reproduced said hacks on the modern code and gave it a run.
tinygrad comes with a benchmark script called benchmark_train_efficientnet so I started playing with it to see what low hanging fruit I could find in an LLVM vs ACO shootout.
The bench does 10 runs, the first is where lots of compilation happens, the last is well primed cache wise. There are the figures from the first and last runs with a release build of llvm and mesa. (and the ACO hacks).
LLVM:
215.78 ms cpy, 12245.04 ms run, 120.33 ms build, 12019.45 ms realize, 105.26 ms CL, -0.12 loss, 421 tensors, 0.04 GB used, 0.94 GFLOPS
10.25 ms cpy, 221.02 ms run, 83.50 ms build, 36.25 ms realize, 101.27 ms CL, -0.01 loss, 421 tensors, 0.04 GB used, 52.11 GFLOPS
ACO:
71.10 ms cpy, 3443.04 ms run, 112.58 ms build, 3214.13 ms realize, 116.34 ms CL, -0.04 loss, 421 tensors, 0.04 GB used, 3.35 GFLOPS
10.36 ms cpy, 234.90 ms run, 84.84 ms build, 36.51 ms realize, 113.54 ms CL, 0.05 loss, 421 tensors, 0.04 GB used, 49.03 GFLOPS
So ACO is about 4 times faster to compile but produces binaries that are less optimised.
The benchmark produces 148 shaders:
LLVM:
ACO:
So ACO doesn't quite get the optimal shaders for a bunch of paths, even with some local hackery I've done to make it do better.[1]
I'll investigate ROCm next week maybe, got a bit of a cold/flu, and large GPU stacks usually make me want to wipe the machine after I test them :-P
[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radeonsi-rusticl-aco-wip
I'm suffering from having a mortal form again, but things are moving in the general direction of progress.
Or "Rose, it's 2 in the morning!" Yeah yeah, whatever, you're not my mum.
Some would call this whining - skip this section if you're here for technology :)
You're not supposed to make yourself work when you don't have energy to because you'll feel bad. People have tried telling me this and I've tried listening but to really take it on board I had to figure out what low energy actually feels like, so here we are, skipping a week of status reporting and holding a suspiciously high Factorio play time. I spent some of that play time making a cool blue circuit factory! Downtime is a good idea, hopefully - we'll find out next week whether it worked.
It's surprising that one of the hardest problems given to me by the Fates has been fighting against myself, which sounds overly dramatic but in a literal sense is true. I would be moving faster if I felt up to it, but I don't feel up to it because I moved too fast recently. It's my fault because I wore myself out, but it's not my fault to rest when I need to, so instinctively I remain undecided on whether it's my fault. Sadly this isn't a balance that I've learned to strike, at least not for large scale work that I care about.
Add this to a general guilt for doing less than others seem to be doing (a velocity- rather than the famous competence-based impostor syndrome) and the work that was once appealing becomes more distant. LoC metrics are a favourite of crap managers, quick glancers, and the part of my subconscious that judges my self worth. It's not ideal and it's even not-idealer when your work is mostly thinking and not actually that much coding - see the previous report for a bunch of musings about what code should be written and not much written code. It's valid work! But the goblin in my skull disagrees. The mortal form disappoints me. I was hoping to discover my inner cold programming machine but I just found some boring human imperfections. Yawn!
This isn't what I was expecting to write about but I think it's helping. I'm sure these aren't unique experiences but they worry me nonetheless, which is partially because I'm hardwired to be worrying about something most of the time.
In a couple of days it will all be OK because I'll be able to play Counter-Strike again and that will for sure make my productivity go up, or down. The paradox of relaxing!
As predicted, I have to face prediction. Before I do that, I want to get a feel for the behaviour of compositors' performance so I'm not mathsing in the dark, and my weapon of choice is Linux's tracing system which either is called ftrace or has a component called ftrace. I can't tell which.
We've met Linux's tracing before. The screenshots from GPUVis were made of data extracted from it, which makes it an attractive answer to the question "where do I put all my data". In theory, if wlroots gains the ability to output events to this system, GPUVis will automatically be able to display these events as it does all the others.
The mechanism for userspace to emit events in this way landed in Linux 6.4 which was unleashed about 12 hours before I realised that my laptop's 6.3 series kernel didn't have support for it and nearly gave up. Until 6.4, the feature was gated behind CONFIG_BROKEN and looked truly like a lost cause. Thankfully Simon noticed that 6.4 held the answer to my problems and I found things to do while I waited for it to hit my distribution. Thrilling! We're back on track.
To hide the horrors of a bare UAPI from wlroots, I wrote and published libuserevents, which is my first C library and will make interacting with user_events amazing and great and you should definitely use it. There are whispers of integration into wlroots so far. I hope eventually I'll have a nice tool that can monitor a running compositor and show a graph of the frame times because that will at least be something pretty to look at to get away from thinking.
In the background there's a scene timer wriggling its way through review and the dreaded How To Schedule Frame Signals is looming over us all. I forgot to submit the Vulkan timer in all the ruckus. Oh well, apparently no one's supposed to be using the Vulkan backend yet anyway so I doubt there's anyone holding their breath.
I've also just noticed that the second status report has links to git branches instead of commits, so they're likely very stale by now. Remind past me to not do that, that moron.
Who knows what the future holds? Join us next week time to find out.