I’ve been traveling a bit lately, but Mesa has reached an important landmark that I wanted to broadcast to the three users out there who have been waiting years for us to reach this milestone:
Mesa now* supports all the GL OVR extensions.
You read that correctly.
Nearly a decade after the extensions were drafted and half that since everyone in VR-land moved to Vulkan, Mesa will now support all three of the people who still write VR applications in GL.
Big thanks to Marek for doing the GLSL heavy lifting and whoever eventually rubber stamps the followup which adds the last extension which is definitely being used.
* Only zink users are cool enough to withstand the awesome power of these extensions.
It was some time ago that I created my first MR touching WSI stuff.
That was also the first time I broke Mesa.
Did I learn anything?
The answer is no, but then again it would have to be given the topic of this sleep-deprived post.
WSI has a lot of issues, but most of them stem from its mere existence. If people stopped wanting to see the triangles, we would all have easier lives and performance would go through the fucking roof. That’s ignoring the raw sweat and verbiage dedicated to insane ideas like determining the precise time at which the triangles should be made visible on a display or literally how are colors.
I’m nowhere near as smart as the people arguing about these things: I’m the guy who plays jenga with the tower constructed from popsicle sticks, marshmallow fluff, and wishful thinking. That’s why, a while ago, I declared war on DRI interfaces and then also definitely won that war without any issues. In fact, it works perfectly.
But why did I embark upon this journey which required absolutely no fixups?
The answer lies in architecture. In the before-times, DRI (a massively overloaded acronym that no longer means anything) allowed Xorg to plug directly into Mesa to utilize hardware-accelerated rendering. It sidestepped the GL API in favor of a contract with Mesa that certain API would never change. And that was great for Xorg since it provided an optimal path to do xserver stuff. But it was (eventually) terrible for Mesa.
When an API contract is made, it remains binding forever. A case when the contract is broken is called a Bug Report. Mesa has no bugs, however, except for the ones I didn’t cause, and so this DRI contract that enables Xorg to shortcut more sensible APIs like EGL remains identical to this day, decades later. What is not identical, however, is Mesa.
In those intervening years, Mesa has developed into an entire ecosystem for driver development and other, less sane ideas. Gallium was created and then became the only method for implementing GL drivers. EGL and GBM are things now. But still, that DRI contract remains binding. Xorg must work. Like that one reviewer who will suggest changes for every minuscule flaw in your stupid, idiotic, uneducated, cretinous abuse of whitespace, it is not going away.
DRIL was the method by which Mesa could finally unshackle itself. The only parts of DRI still used by Xorg are for determining rendertarget capabilities, effectively eglGetConfigs
. So @ajax and I punted out the relevant API into a stub which is mostly just a wrapper around eglGetConfigs
. This enabled change and cleanup in every part of the codebase that was previously immutable.
As anyone who has tried to debug Mesa’s DRI frontend knows, it sucks. It’s one of the worst pieces of code to debug. A significant reason for this is (was) how the DRI callback system perpetuated circular architecture.
At the time of DRIL’s merge, a user of GLX/EGL/GBM would engage with this sort of control flow:
gallium/frontends/dri
In terms of functionality, it was functional. But debugging at a glance was impossible, and trying to eyeball any execution path required the type of PhD held by fewer than five people globally. The cyclical back-and-forth function pointering was a vertical cliff of a learning curve for anyone who didn’t already know how things worked, and even things as “simple” as eglInitialize
went through several impenetrable cycles of idiot-looping to determine success or failure. The absolute state of it made adding new features a nightmarish and daunting prospect, and reviewing any changes had, at best, even odds of breaking things because of how difficult it is to test this stuff.
Maybe.
The juiciest refactoring is over, and now function pointering only occurs when the DRI frontend needs to access API-specific data for its drawables. It’s actually possible to follow execution just by reading the code. Not that it’s necessarily easy, but it’s possible.
There’s still a lot of work to be done here. There’s still some corner case bugs with DRIL, there’s probably EGL issues that have yet to be discovered because much of that code is still fairly opaque, and half the codebase is still prefixed with dri2_
.
At the least, I think it’s now possible to work on WSI in Mesa and have some idea what’s going on. Or maybe I’ve just been down in the abyss for so long that I’m the one staring back.
I’ve been cooking. I mean like really cooking. Expect big things related to the number 3 later this month.
* UPDATE: At the urging of my legal team, I’ve been advised to mention that no part of this post, blog, or site has any association with, bearing on, or endorsement from Half Life 3.
The topic of a Direct Rendering Manager (DRM) cgroup controller is something which has been proposed a few times in the past, but so far is still missing from the Linux graphics stack. Some of those attempts were focusing on controlling the GPU memory usage aspect, while some were concerned with scheduling. As I am continuing to explore this area as part of my work at Igalia, in this post we will discuss one possible way of implementing the latter.
General problem statement which we are trying to address is the fact many GPUs (and their respective kernel drivers) can simultaneously schedule workloads from different clients and that there are use-cases where having external control over scheduling decisions would be beneficial.
But first to clarify what we mean by “external control”. By that term we refer to the scheduling decisions being influenced from the outside of the actual process doing the rendering. If we were to draw a parallel to CPU scheduling, that would be the difference between a process (or a thread) issuing a system call such as setpriority(2) or nice(2) itself (“internal control”), versus its scheduling priority being modified by an external entity such as the user issuing the renice(1) shell command, launching the executable via the nice(1) shell command, or even using the CPU scheduling cgroup controller (“external control”).
This has two benefits. Firstly, it is the user who typically knows which tasks are higher priority and which should run in the background and therefore be as much as it is possible isolated from starving the foreground tasks from resources. Secondly, external control can be applied on any process in an unified manner, without the need for applications to individually expose the means to control their scheduling priority.
If we now return back to the world of GPU scheduling we find ourselves in a landscape where internal scheduling control is possible with many GPU drivers, but the external control is not. To improve on that there are some technical and conceptual challenges, because GPUs are not as nice and uniform in their scheduling needs and capabilities as CPUs are, but if we would be able to come up with something reasonable even if not perfect, it could bring improvements to the user experience in a variety of scenarios.
The earliest attempt I can remember was from 2018, by Matt Roper[1], who proposed to implement a driver-specific priority based controller. The RFC limited itself to i915 (kernel driver for Intel GPUs) and, although the priority-based setup is well established in the world of CPU scheduling, and it is easy to understand its effects, the proposal did not gain much traction.
Because of the aforementioned advantages, when I proposed my version of the controller in 2022[2], it also included a slightly different version of a priority-based controller. In contrast to the earlier one, this proposal was in principle driver-agnostic and the priority levels were also abstracted.
The proposal was also accompanied by benchmark results showing that the approach was effective in allowing users on Linux to launch GPU tasks in the background, while leaving more GPU bandwidth to the foreground task than when not using the controller. Similarly on ChromeOS, when wired into the focused versus un-focused window cgroup management, it was able to demonstrate relatively more GPU time given to the foreground window.
Anticipating the potential lack of sufficient support for this approach the same RFC also included a second controller which takes a different route. It abstracts things one step further and implements a weight based controller based on GPU utilisation[3].
The basic idea is that the GPU time budget is split based on relative group weights across the cgroup hierarchy, and that the controller notifies the individual DRM drivers when their clients are over budget. From there it is left for the individual drivers to know how to best manage this situation, depending on the specific scheduling capabilities of the driver and the GPU hardware.
The user interface completely mimics the exiting CPU and IO cgroup controllers with the single drm.weight control file. The weights carry no absolute meaning and are only relative within a single group of siblings. Their only purpose is to split out the time budget between them.
Visually one potential cgroup configuration could look like this:
The DRM cgroup controller then executes a periodic scanning task which queries each DRM client for its GPU usage and notifies drivers when clients are over their allocated budget.
If we expand the concept with runtime adjustment of group weights based on window focus status, with two graphically active clients such as a game and a web browser, we can end up with the following two scenarios:
Here we show the actual GPU utilisation of each group together with their drm.weight. On the left hand side the web browser is the focused window, with the weights 100-to-10 in its favour.
The compositor is not using its full 200 / (200 + 100) so a portion is passed on to the desktop group to the extent of the full 80% required. Inside the desktop group the game is currently using 70%, while its actual allocation is 80% * (10 / (100 + 10)) = 7.27%. Therefore it is currently consuming is more than the budget and the corresponding DRM driver will be notified by the controller and will be able to do something about it.
After the user has given focus to the game window, relative weights will be adjusted and so will the budgets. Now the web browser will be over budget and therefore it can be throttled down, limiting the effect of its background activity on the foreground game window.
Back when I started developing this idea Intel GPU’s were my main focus, which is why i915 was the first driver I wired up with the controller.
There I implemented a rather simple approach of dynamically adjusting the scheduling priority of the throttled contexts, to the amount proportional to how much client is over budget in relative terms.
Implementation would also cross-check against the physical engine utilisation, since in i915 we have easy access to that metric, and only throttle if the latter is close to being fully utilised. (Why this makes sense could be an interesting digression relating to the fact that a single cgroup can in theory contain multiple GPUs and multiple clients using a mix of those GPUs. But lets leave that for later.)
One of the scenarios I used to test how well this works is to run two demanding GPU clients, each in its own cgroup, tweak their relative weights, and see what happens. The results were encouraging and are shown in the following table.
We can see that, when a clients group weight was decreased, the GPU bandwidth it was receiving also went down, as a consequence of the lowered context priority after receiving the over-budget notification.
This is a suitable moment to mention how the DRM cgroup controller does not promise perfect control, that is, achieving the actual GPU sharing ratios as expressed by group-relative weights. As we have mentioned before, GPU scheduling is not nearly at the same level of quality and granularity as in the CPU world, so the goal it sets is simply to improve things - do something which has a positive impact on user experience. At the same time, the mechanism and control interface proposed does not preclude individual drivers doing as good job as they can. Or even a future possibility of replacing the inner workings with a controller with something smarter, with no need to change the user space control interface.
Going back to the initial i915 implementation, the second test I have done was attempting to wire up with the background/foreground window focus handling in ChromeOS. There I experimented with a game (Android VM) running in parallel with a WebGL demo in a browser. At a certain point after both clients were running I lowered the weight of the background game and on the below screenshot we can see how the FPS metric in a browser jumped up.
This illustrates how having the controller can indeed improve the user experience. The user’s focus will be at the foreground window and therefore it does make sense to prioritise GPU access to that client for better interactiveness and smoother rendering there. In fact, in this example the actual FPS jumped from around 48-49 to 60fps. Meaning that throttling the background client has allowed the foreground one to match its rendering to display’s refresh rate.
AMD’s kernel module was the next interesting driver which I wired up with the controller.
The fact that its scheduling is built on top of the DRM scheduler with only three distinct priority levels mandated a different approach to throttling. We keep a sorted list of “most offending” clients (most out of budget, or most borrowed unused budget from the sibling group), with the idea that the top client on that list gets throttled by lowering its scheduling priority. That was relatively straightforward to implement and sounded like it could potentially satisfy the most basic use case of background task isolation.
To test the runtime behaviour we set up two sibling cgroups and vary their relative scheduling weights. In one cgroup we run glxgears with vsync turned off and log its frame rate over time, while in the second group we run glmark2.
Let us first have a look on how glxgears frame rate varies during this test, depending on three different scheduling weight ratios between the cgroups. Scheduling weight ratio is expressed as glxgears:glmark2 ie. 10:1 means glxgears scheduling weight was ten times as much as configured for glmark2.
We can observe that, as the glmark2 is progressing through its various sub-benchmarks, glxgears frame rate is changing too. But it was overall higher in the runs where the scheduling weight ratio was in its favour. That is a positive result showing that even a simple implementation seems to be having the desired effect, at least to some extent.
For the second test we can look from the perspective of glmark2, checking how the benchmark score change depending on the ratio of scheduling weights.
Again we see that the scores are generally improving when the scheduling weight ratio is increased in favour of the benchmark.
However, in neither case the change of the result is proportional to actual ratios. This is because the primitive implementation is not able to precisely limit the “background” client, but is only able to achieve some throttling. Also, there is an inherent delay in how fast the controller can react given the control loop is based on periodic scanning. This period is configurable and was set to two seconds for the above tests.
Hopefully this write-up has managed to demonstrate two main points:
First, that a generic and driver agnostic approach to DRM scheduling cgroup controller can improve user experience and enable new use cases. While at the same time following the established control interface as it exists for CPU and IO control, which makes it future-proof and extendable;
Secondly, that even relatively basic driver implementations can be somewhat effective in providing positive control effects.
It also probably needs to be re-iterated that neither the driver implementations or the cgroup controller implementation itself are limited by the user interface proposed. Both could be independently improved under the hood in the future.
What is next? There is more work to be done such as conducting more detailed testing, polishing the implementation and potentially attempting to wire up more drivers to the controller. Further advocacy work in the DRM community too.
There's been a couple of mentions of Rust4Linux in the past week or two, one from Linus on the speed of engagement and one about Wedson departing the project due to non-technical concerns. This got me thinking about project phases and developer types.
Wayfinders and maintainers is the most difficult interaction. Wayfinders like to move freely and quickly, maintainers have other priorities that slow them down. I believe there needs to be road builders engaged between the wayfinders and maintainers.
Road builders have to be willing to expend the extra time to resolving roadblocks in the best way possible for all parties. The time it takes to resolve a single roadblock may be greater than the time expended on the whole wayfinding expedition, and this frustrates wayfinders. The builder has to understand what the maintainers concerns are and where they come from, and why the wayfinder made certain decisions. They work via education and trust building to get them aligned to move past the block. They then move down the road and repeat this process until the road is open. How this is done might change depending on the type of maintainers.
Agrees with the road's direction, might not like some of the intersections, willing to be educated and give feedback on newer intersection designs. Moves to group 1 or trusts that others are willing to maintain intersections on their road.
I think my request from this is that contributors should try and identify the archetype they currently resonate with and find the next group over to interact with.
For wayfinders, it's fine to just keep wayfinding, just don't be surprised when the road building takes longer, or the road that gets built isn't what you envisaged.
For road builder, just keep building, find new techniques for bridging gaps and blowing stuff up when appropriate. Figure out when to use higher authorities. Take the high road, and focus on the big picture.
For maintainers, try and keep up with modern road building, don't say 20 year old roads are the pinnacle of innovation. Be willing to install the rumble strips, widen the lanes, add crash guardrails, and truck safety offramps. Understand that wayfinders show you opportunities for longer term success and that road builders are going to keep building the road, and the result is better if you engage positively with them.
Hi!
After months of bikeshedding finishing touches we’ve finally merged
ext-image-capture-source-v1 and ext-image-copy-capture-v1 in
wayland-protocols! These two new protocols supersede the old wlr-screencopy-v1
protocol. They unlock some nice features such as toplevel and cursor capture,
as well as improved damage tracking. Thanks a lot to Andri Yngvason! He’s
written a blog post about the new protocols with more details. The
wlroots MR doesn’t have toplevel capture implemented yet, but that’s next on
the TODO list.
In other Wayland news, we’ve merged full support for explicit synchronization in wlroots. This generally results in a better system architecture than implicit synchronization, reduces over-synchronization for complicated pipelines, and makes wlroots work correctly with drivers lacking implicit synchronization support (e.g. NVIDIA).
Alexander has implemented automatic X11 surface restacking in wlroots’ scene-graph. That way, all scene-graph compositors get proper X11 stack handling for free (Sway’s implementation was buggy). This should fix issues where the X11 server and the compositor don’t have the same idea of the relative ordering of surfaces, resulting in clicks going “through” windows or reaching invisible windows.
Ricardo Steijn has contributed Sway support for tearing-control-v1.
This allows users to opt-in to immediate page-flips which don’t wait for the
vertical sync point (VSync) to program new frames into the hardware. For
tearing to be enabled, two conditions need to be fulfilled: tearing needs to
be enabled per-output via the output allow_tearing
command, and tearing
needs to be enabled per-application either via the tearing-control-v1 Wayland
protocol or manually via the window allow_tearing
command. I’ve also pushed
kernel patches from André Almeida and me to fix a few bugs around tearing
page-flips with the atomic KMS API, so once these land forcing the legacy KMS
API shouldn’t be necessary anymore.
drm_info v2.7.0 has been released with a few new features and cleanups.
Support for DRM_CLIENT_CAP_CURSOR_PLANE_HOTSPOT
and
DRM_CAP_ATOMIC_ASYNC_PAGE_FLIP
has been added, and a new flag has been
introduced to display information from a JSON dump.
Last, I’ve released a new version of go-maildir with a brand new API. Instead
of referring to messages by their Maildir key and phishing back their full
filename on each operation, the API exposes a Message
type. It should be much
nicer to use than the previous one.
That’s all for August, see you next month!
The Freedesktop.org Specifications directory contains a list of common specifications that have accumulated over the decades and define how common desktop environment functionality works. The specifications are designed to increase interoperability between desktops. Common specifications make the life of both desktop-environment developers and especially application developers (who will almost always want to maximize the amount of Linux DEs their app can run on and behave as expected, to increase their apps target audience) a lot easier.
Unfortunately, building the HTML specifications and maintaining the directory of available specs has become a bit of a difficult chore, as the pipeline for building the site has become fairly old and unmaintained (parts of it still depended on Python 2). In order to make my life of maintaining this part of Freedesktop easier, I aimed to carefully modernize the website. I do have bigger plans to maybe eventually restructure the site to make it easier to navigate and not just a plain alphabetical list of specifications, and to integrate it with the Wiki, but in the interest of backwards compatibility and to get anything done in time (rather than taking on a mega-project that can’t be finished), I decided to just do the minimum modernization first to get a viable website, and do the rest later.
So, long story short: Most Freedesktop specs are written in DocBook XML. Some were plain HTML documents, some were DocBook SGML, a few were plaintext files. To make things easier to maintain, almost every specification is written in DocBook now. This also simplifies the review process and we may be able to switch to something else like AsciiDoc later if we want to. Of course, one could have switched to something else than DocBook, but that would have been a much bigger chore with a lot more broken links, and I did not want this to become an even bigger project than it already was and keep its scope somewhat narrow.
DocBook is a markup language for documentation which has been around for a very long time, and therefore has older tooling around it. But fortunately our friends at openSUSE created DAPS (DocBook Authoring and Publishing Suite) as a modern way to render DocBook documents to HTML and other file formats. DAPS is now used to generate all Freedesktop specifications on our website. The website index and the specification revisions are also now defined in structured TOML files, to make them easier to read and to extend. A bunch of specifications that had been missing from the original website are also added to the index and rendered on the website now.
Originally, I wanted to put the website live in a temporary location and solicit feedback, especially since some links have changed and not everything may have redirects. However, due to how GitLab Pages worked (and due to me not knowing GitLab CI well enough…) the changes went live before their MR was actually merged. Rather than reverting the change, I decided to keep it (as the old website did not build properly anymore) and to see if anything breaks. So far, no dead links or bad side effects have been observed, but:
If you notice any broken link to specifications.fd.o or anything else weird, please file a bug so that we can fix it!
Thank you, and I hope you enjoy reading the specifications in better rendering and more coherent look!
I'm happy to announce that my first project regarding support for the NPU in NXP's i.MX 8M Plus SoC has reached the feature complete stage.
CC BY-NC 4.0 Henrik Boye |
For the last several weeks I have been working full-time on adding support for the NPU to the existing Etnaviv driver. Most of the existing code that supports the NPU in the Amlogic A311D was reused, but NXP used a much more recent version of the NPU IP so some advancements required new code, and this in turn required reverse engineering.
This work has been kindly sponsored by the Open Source consultancy Ideas On Board, for which I am very grateful. I hope this will be useful to those companies that need full mainline support in their products, even if it is just the start.This company is unique in working on both NPU and camera drivers in Linux mainline, so they have the best experience for products that require long term support and vision processing.
Since the last update I have fixed the last bugs in the compression of the weights tensor and implemented support for a new hardware-assisted way of executing depthwise convolutions. Some improvements on how the tensor addition operation is lowered to convolutions was needed as well.
Performance is pretty good already, allowing for detecting objects in video streams at 30 frames per second, so at a similar performance level as the NPU in the Amlogic A311D. Some performance features are left to be implemented, so I think there is still substantial room for improvement.
In a previous post I gave the context for my pet project ieee1275-rs, it is a framework to build bootable ELF payloads on Open Firmware (IEEE 1275). OF is a standard developed by Sun for SPARC and aimed to provide a standardized firmware interface that was rich and nice to work with, it was later adopted by IBM, Apple for POWER and even the OLPC XO.
The crate is intended to provide a similar set of facilities as uefi-rs, that is, an abstraction over the entry point and the interfaces. I started the ieee1275-rs crate specifically for IBM’s POWER platforms, although if people want to provide support for SPARC, G3/4/5s and the OLPC XO I would welcome contributions.
There are several ways the firmware takes a payload to boot, in Fedora we use a PReP partition type, which is a ~4MB partition labeld with the 41h type in MBR or 9E1A2D38-C612-4316-AA26-8B49521E5A8B as the GUID in the GPT table. The ELF is written as raw data in the partition.
Another alternative is a so called CHRP script in “ppc/bootinfo.txt”, this script can load an ELF located in the same filesystem, this is what the bootable CD/DVD installer uses. I have yet to test whether this is something that can be used across Open Firmware implementations.
To avoid compatibility issues, the ELF payload has to be compiled as a 32bit big-endian binary as the firmware interface would often assume that endianness and address size.
As I entered this problem I had some experience writing UEFI binaries, the entry point in UEFI looks like this:
#![no_main]
#![no_std]
use uefi::prelude::*;
#[entry]
fn main(_image_handle: Handle, mut system_table: SystemTable<Boot>) -> Status {
uefi::helpers::init(&mut system_table).unwrap();
system_table.boot_services().stall(10_000_000);
Status::SUCCESS
}
Basically you get a pointer to a table of functions, and that’s how you ask the firmware to perform system functions for you. I thought that maybe Open Firmware did something similar, so I had a look at how GRUB does this and it used a ppc assembler snippet that jumps to grub_ieee1275_entry_fn()
, yaboot does a similar thing. I was already grumbling of having to look into how to embed an asm binary to my Rust project. But turns out this snippet conforms to the PPC function calling convention, and since those snippets mostly take care of zeroing the BSS segment but turns out the ELF Rust outputs does not generate one (although I am not sure this means there isn’t a runtime one, I need to investigate this further), I decided to just create a small ppc32be ELF binary with the start function into the top of the .text section at address 0x10000.
I have created a repository with the most basic setup that you can run. With some cargo configuration to get the right linking options, and a script to create the disk image with the ELF payload on the PReP partition and run qemu, we can get this source code being run by Open Firmware:
#![no_std]
#![no_main]
use core::{panic::PanicInfo, ffi::c_void};
#[panic_handler]
fn _handler (_info: &PanicInfo) -> ! {
loop {}
}
#[no_mangle]
#[link_section = ".text"]
extern "C" fn _start(_r3: usize, _r4: usize, _entry: extern "C" fn(*mut c_void) -> usize) -> isize {
loop {}
}
Provided we have already created the disk image (check the run_qemu.sh script for more details), we can run our code by executing the following commands:
$ cargo +nightly build --release --target powerpc-unknown-linux-gnu
$ dd if=target/powerpc-unknown-linux-gnu/release/openfirmware-basic-entry of=disk.img bs=512 seek=2048 conv=notrunc
$ qemu-system-ppc64 -M pseries -m 512 --drive file=disk.img
[...]
Welcome to Open Firmware
Copyright (c) 2004, 2017 IBM Corporation All rights reserved.
This program and the accompanying materials are made available
under the terms of the BSD License available at
http://www.opensource.org/licenses/bsd-license.php
Trying to load: from: /vdevice/v-scsi@71000003/disk@8000000000000000 ... Successfully loaded
Ta da! The wonders of getting your firmware to run an infinite loop. Here’s where the fun begins.
Now, to complete the hello world, we need to do something useful. Remeber our _entry
argument in the _start()
function? That’s our gateway to the firmware functionality. Let’s look at how the IEEE1275 spec tells us how we can work with it.
This function is a universal entry point that takes a structure as an argument that tells the firmware what to run, depending on the function it expects some extra arguments attached. Let’s look at how we can at least print “Hello World!” on the firmware console.
The basic structure looks like this:
#[repr(C)]
pub struct Args {
pub service: *const u8, // null terminated ascii string representing the name of the service call
pub nargs: usize, // number of arguments
pub nret: usize, // number of return values
}
This is just the header of every possible call, nargs and nret determine the size of the memory of the entire argument payload. Let’s look at an an example to just exit the program:
#[no_mangle]
#[link_section = ".text"]
extern "C" fn _start(_r3: usize, _r4: usize, entry: extern "C" fn(*mut Args) -> usize) -> isize {
let mut args = Args {
service: "exit\0".as_ptr(),
nargs: 0,
nret: 0
};
entry (&mut args as *mut Args);
0 // The program will exit in the line before, we return 0 to satisfy the compiler
}
When we run it in qemu we get the following output:
Trying to load: from: /vdevice/v-scsi@71000003/disk@8000000000000000 ... Successfully loaded
W3411: Client application returned.
Aha! We successfully called firmware code!
To summarize, we’ve learned that we don’t really need assembly code to produce an entry point to our OF bootloader (tho we need to zero our bss segment if we have one), we’ve learned how to build a valid OF ELF for the PPC architecture and how to call a basic firmware service.
In a follow up post I intend to show a hello world text output and how the ieee1275 crate helps to abstract away most of the grunt to access common firmware services. Stay tuned!
The DRIL merge is done, and things are mostly working again after a tumultuous week. To recap, here’s everything that went wrong leading up to 24.2-rc1, the reason why it went wrong, and the potential steps that could be taken (but almost certainly won’t) to avoid future issues.
One of the big changes that went in last-minute was a MR linking all the GL frontend libs to Gallium, which is a huge improvement to the old way of using dlopen
to directly trigger version mismatch errors.
It had some problems, like how it broke Steam. As some readers may have inferred, this was Very Bad, as my employer has some interest in ensuring that Steam does not break.
The core problem in this case has to do with library paths, distro policies, and Steam’s own library handling:
libgallium.so
now, which means this library must be in the library pathlibgallium.so
has been installed to ${libdir}/dri
${libdir}
to avoid library pathing issues, but the criticism I received was that distros would not be friendly towards shipping an unstable library heredri
directory was appended to the library path for libgallium.so
Unfortunately, there are lots of things that don’t fully handle all variations of rpath
, chief among them Steam. Furthermore, some distros don’t use the most optimal implementation of rpath
(i.e., they use DT_RPATH
instead of DT_RUNPATH
), which hits those unimplemented parts of Steam.
The reason(s) this managed to land without issues?
LD_LIBRARY_PATH
variables to include ${libdir}/dri
which I had used for a test run but did not intend to land with the final versionLD_LIBRARY_PATH
to avoid random issues when testing obscure appsCombined, I wasn’t getting adequate testing, so it appeared everything was fine when really nothing was fine.
Lucky for me, Simon McVittie wrote a full textbook analysis of the issue and possible solutions, so this is now fixed.
Ideally in the future I’ll have better testing environments and won’t be trying to hammer in big MRs minutes before a RC goes out.
DRI is (now) a simple interface that tells Xorg which rendering formats can be used for drawables. This is dependent on the device and driver, but fbconfigs aren’t typically something that should vary too much between driver versions. DRIL is meant to split this functionality out of the rest of Mesa so that all the internal interfaces don’t have to be a Gordian Knot.
Unfortunately, this means if DRIL has problems determining which formats are usable, the xserver also has problems. There were a lot of problems:
eglChooseConfigs
, you’re probably fucking up) which made it hard to adequately reviewThis is why there was a sudden deluge of issues about broken colors
On my end, I didn’t check glxinfo
output thoroughly enough, nor did I do an exceptionally thorough testing of desktop apps. All the unit tests passed along with CI, which seemed like it should have been enough. Too bad there are no piglit tests which check to see whether various fbconfigs are supported. Maybe I’ll write one to ensure there’s a CI baseline and catch any future regressions.
This is a pretty dumb issue, but it was an issue nonetheless: drivers simply stopped loading. This affected any number of embedded (etnaviv) devices and was fixed by a pretty trivial MR. Also I broke KMSRO, which broke even more devices.
Whoops.
The problem here is there’s no CI testing, and I have no such devices for testing. Hard to evaluate these types of things when they silently fail.
I promise.
I have been doing random coding experiments with my spare time that I never got to publicize much outside of my inner circles. I thought I would undust my blog a bit to talk about what I did in case it is useful for others.
For some background, I used to manage the bootloader team at Red Hat a few years ago alongside Peter Jones and Javier Martinez. I learned a great deal from them and I fell in love with this particular problem space and I have come to enjoy tinkering with experiments in this space.
There many open challenges in this space that we could use to have a more robust bootpath across Linux distros, from boot attestation for initramfs and cmdline, A/B rollbacks, TPM LUKS decryption (ala BitLocker)…
One that particularly interests me is unifying the firmware-kernel boot interface across implementations in the hypothetical absence of GRUB.
The priority of the team was to support RHEL boot path on all the architectures we supported. Namely x86_64 (legacy BIOS & UEFI), aarch64 (UEFI), s390x and ppc64le (Open Power and PowerVM).
These are extremely heterogeneous firmware interfaces, some are on their way to extinction (legacy PC BIOS) and some will remain weird for a while.
GRUB, (GRand Unified Bootloader) as it names stands, intends to be a unified bootloader for all platforms. GRUB has to support a supersetq of firmware interfaces, some of those, like legacy BIOS do not support much other than some rudimentary support disk or network access and basic graphics handling.
To get to load a kernel and its initramfs, this means that GRUB has to implement basic drivers for storage, networking, TCP/IP, filesystems, volume management… every time there is a new device storage technology, we need to implement a driver twice, once in the kernel and once in GRUB itself. GRUB is, for all intent and purposes, an entire operating system that has to be maintained.
The maintenance burden is actually quite big, and recently it has been a target for the InfoSec community after the Boot Hole vulnerability. GRUB is implemented in C and it is an extremely complex code base and not as well staffed as it should. It implements its own scripting language (parser et al) and it is clear there are quite a few CVEs lurking in there.
So, we are basically maintaining code we already have to write, test and maintain in the Linux kernel in a different OS whose whole purposes (in the context of RHEL, CentOS and Fedora) its main job is to boot a Linux kernel.
This realization led to the initiative that these days are taking shape in the discussions around nmbl (no more boot loader). You can read more about that in that blog post, I am not actively participating in that effort but I encourage you to read about it. I do want to focus on something else and very specific, which is what you do before you load the nmble kernel.
I want to focus on the code that goes from the firmware interface to loading the kernel (nmbl or otherwise) from disk. We want some sort of A/B boot protocol that is somewhat normalized across the platforms we support, we need to pick the kernel from the disk.
The systemd community has led some of the boot modernization initiatives, vocally supporting the adoption of UKI and signed pre-built initarmfs images, developing the Boot Loader Spec, and other efforts.
At some point I heard Lennart making the point that we should standardize on using the EFI System Partition as /boot to place the kernel as most firmware implementations know how to talk to a FAT partition.
This proposal caught my attention and I have been pondering if we could have a relatively small codebase written in a safe language (you know which) that could support a well define protocol for A/B booting a kernel in Legacy BIOS, S390 and OpenFirmware (UEFI and Open Power already support BLS snippets so we are covered there).
My modest inroad into testing this hypothesis so far has been the development of ieee1275-rs, a Rust module to write programs for the Open Firmware interface, so far I have not been able to load a kernel by myself but I still think the lessons learned and some of the code could be useful to others. Please note this is a personal experiment and nothing Red Hat is officially working on.
I will be writing more about the technical details of this crate in a follow up blog post where I get into some of the details of writing Rust code for a firmware interface, this post is long enough already. Stay tuned.
Lot of stuff happening. I can’t talk about much of it yet, but trust me when I say the following:
It’s happening.
When it happens, you’ll know what I meant.
Remember way back when I put DRI interfaces on notice?
Now, only four months later, DRI interfaces are finally going away.
Begun by @ajax and then finished off by me and Pavel (ghostwritten by @daniels), the DRIL (DRI Legacy) interface is a tiny shim which matches Xorg’s ABI expectations to provide a list of sensible fbconfig formats during startup. Then it does nothing. And by doing nothing, it saves the rest of Mesa from being shackled to ancient ABI constraints.
Let the refactoring begin.
Obviously I’m not going to stop here. SGC leaves no code half-krangled. That’s why, as soon as DRIL lands, I’ll also be hammering in this followup MR which finally makes all the GL frontends link directly to the Gallium backend driver.
Why is this so momentous, you ask? How many of you have gotten the error DRI driver not from this Mesa build
when trying to use your custom Mesa build?
With this MR, that error is going away. Permanently. Now you can have as many Mesa builds on your system as you want. No longer do you need to set LIBGL_DRIVERS_PATH
for any reason.
The future is here.
Hi!
This month wlroots 0.18.0 has been released! This new version includes a fair
share of niceties: ICC profiles, GPU reset recovery, less black screens when
plugging in a monitor on Intel, a whole bunch of new protocol implementations,
and much more. Thanks a lot to all contributors! Two recent merge requests made
it in the release: Kenny’s Vulkan renderer optimizations, and support for the
SIZE_HINTS
KMS property to use a smaller cursor plane on Intel to save power.
For the next release we’ll be trying out release candidates to formally focus
on bugfixing and leave time for compositors and language bindings to update and
report issues.
I’ve continued working on various graphics-related topics, for instance the wlroots implementation of the upcoming ext-screencopy-v1 protocol is now complete and the protocol itself is almost ready (still figuring out the most difficult part: how to name it). I also sent out a kernel patch to fix tearing page-flips when cursor/overlay planes don’t change (and are included in the atomic commit). I reviewed patches by Enrico Weigelt to improve libdrm’s portability to OpenBSD and Solaris. Last, I’ve released libdisplay-info 0.2.0 with a new high-level API for colorimetry and support for more EDID/CTA/DisplayID blocks.
To get the releases over with, let’s briefly mention Goguma 0.7.0. This one unlocks file uploads, a new look based on Material You with an adaptive color scheme, many improvements to the iOS port, and text/media can be shared to Goguma from other apps. slingamn has played with a gamja/Ergo setup configured with Forgejo as an OAuth server, and it worked nicely after fixing a gamja SASL-related bug and implementing a missing feature in Forgejo’s OAuth token introspection endpoint!
Last, I also added a new libscfg API to write files - this can be useful to auto-generate some configuration files for instance. And I also performed some more boring X.Org Foundation sysadmin stuff, such as dealing with domain-related issues, recovering a server running out of disk space again, and convincing Postfix to start up.
See you next month!
The Igalia Graphics team has been expanding and making significant contributions in the space of open source graphics. An earlier blog post by our team member Lucas provides an excellent insight in to the team’s evolution over the past years. The following series of posts will attempt to summarize the team’s recent engagements:
Before dwelling in to details, it is worth mentioning the recent highlights; Igalia hosted 2024 Linux Display Next Hackfest in May this year and X.org Developers Conference 2023 in October last year, both in the beautiful city of A Coruña. These events were a huge success in creating a hub for graphics experts to foster open innovation. Continue reading for more details on these events.
Last year brought great news for AMD GPU color management: the AMD driver-specific color management properties reached the upstream linux-next! My Igalia colleague Melissa Wen has been spearheading this effort for some time now and has journalled every detail in a series of blog posts.
AMD has been improving its display color management pipeline with each new hardware generation. The new color capabilities, before and after plane composition, can be used by compositors and userspace applications to provide a vibrant experience to the end-user. Exposing AMD driver-specific color properties is a step towards advanced color management on Linux, allowing gamut mapping, HDR rendering, HDR on SDR, and SDR on HDR.
On a very high level, there are 2 parts of this support:
Upgrading the DRM/KMS Linux interface to expose the new features to the user-space. One major challenge was the limited DRM/KMS interface, which only exposed a small set of post-blending color properties. Latest AMD Display Core Next hardware has many more post-blending and pre-blending capabilities. Melissa’s work involved mapping these capabilities to the AMD driver’s display core interface and then to the DRM interface. Her blog post provides a brief overview of this extensive mapping effort.
Updating the AMD’s Linux display driver to expose the new hardware features. AMD DCN 3.0 comes with cutting edge color capabilities described by Melissa here and this blog post also talks about the AMD’s Linux display subsystem components and about the new properties.
I quote here some of Melissa’s write-ups that helped me get some understanding about this vast subject:
Turnip, the open-source Vulkan driver for Qualcomm Adreno GPUs, has been receiving major upgrades this year for Qualcomm’s Adreno 7XX GPUs.
From my colleague Danylo Piliaiev’s Turnip update at FOSDEM 2024, Turnip seems to be in a great state; major Vulkan extensions and better debug support, AAA desktop games can now run via FEX + Turnip on Linux, with some from the Termux community even running desktop games on Android with Box64/FEX + Turnip.
The highlight of Danylo’s talk is the A7XX support. The team started the year with A7XX bring up and now ramping on adding support for the new features introduced in A7XX:
Mark Collins, who also represents Igalia at the Khronos Vulkan WG, implemented GMEM rendering for A7XX, which can be considerably faster and more power efficient than sysmem rendering depending on what’s being rendered. Followed up by support for unidirectional LRZ, bringing A7XX to parity with A6XX’s GMEM rendering feature set and further boosting performance, with more performance improvements for A7XX on the horizon.
Our colleague Amber Harmonia added support for allowing a shader to contain 64-bit atomic operations on signed and unsigned integers and support for allowing rasterizing wide lines while Fixed Stride Draw Table support is work-in-progress.
In addition to new feature support, we are committed to providing a robust and performant driver.
Recently, Job Noorman has joined our Turnip team to improve the IR3 compiler. He improved handling of predicate registers and added support for predication. Adreno GPUs have special registers that store the result of a condition called predicate registers, utilizing these registers can eliminate branches in the generated code thereby improving performance. Similarly, more than 10% code size reduction was observed in shader-db with his patch for using rptN instructions.
Turnip has come far and has been giving competition to the Adreno’s proprietary driver recently. Here is Assassin’s Creed running on Adreno + Turnip. Check the FPS on that screen!
Danylo usually talks about analyzing some of the major Turnip issues in his series of blog posts “Turnips in the wild” with part 3 being the latest addition. This is exactly what you need to jump start Turnip development.
As always, the team also discovered many new techniques of debugging GPU issues. GPU driver developers want to modify the GPU command stream on run-time to see the outcome of editing it in different ways. Danylo implemented this highly sought out feature as a tool for Adreno and describes how this tool can be used.
The management of the display, graphics and composition in Linux lies in the kernel DRM/KMS framework. Igalian Maíra Canal provides full disclosure on our notable contributions authoring, reviewing and testing kernel DRM patches while I privide a few highlights here:
My Igalia colleague André Almeida and Simon Ser have been working on Asynchronous Page Flips, an optimization that allows applications to flip a plane for immediate presentation. The support for this feature is now available in the atomic API. Plus, with André’s patch, it is enabled for all planes including the primary plane if the hardware supports it.
Maíra has been working on feature crucial to graphics development on RPi. She supplied per client GPU usage statistics as well as global GPU utilization.
In order to ensure continuous job submission to the GPU, CPU jobs submitted from userspace must be prevented. With a series of patches from Maíra moved CPU jobs mechanisms from the V3DV driver to the V3D kernel driver.
After achieving Vulkan 1.2 conformance on V3DV, the Igalia team working on V3DV have been focusing on instrumental enhancements of the driver. V3DV is Broadcom Video Core GPU’s Vulkan driver on the
RPi 5 was launched in October last year with a new BCM GPU. Alejandro provided an overview of the team’s journey through V3DV development since RPi 4 and then talks about challenges of RPi 5 support in V3DV:
More improvements and new Vulkan extensions were supported last year.
This year Iago landed support for Vulkan dynamic rendering extension. VK_KHR_dynamic_rendering is a popular Vulkan extension that has added flexibility to the Vulkan API by allowing users to skip render pass and frame buffer objects and start immediate rendering. And now its available on the Pi.
As mentioned in the DRM/KMS improvements above, Maíra together with José María Casanova (Chema) and Melissa supported GPU utilization stats and CPU jobs optimization. Here is a snapshot of collection of GPU stats on Pi5:
RPi 5 continues to use OpenGL/Wayland based Wayfire compositor on these devices. Christopher was therefore tasked with enabling Wayfire to run on RPi 3 and 4 as well. He achieved this by software rendering implementing by a Pixman back-end. Check out the demo:
Iago also made some interesting observations while experimenting with SuperTuxKart on the Pi. You will be pleasantly surprised to know how Vulkan out-performed OpenGL.
The team has been working towards Vulkan 1.3 and we will hopefully be able to share more news on that front very soon.
Christian Gmeiner, one of the maintainers of Etnaviv (open-source graphics driver for Vivante GPUs), joined our team last year. We are very excited to have him on-board because it is a testament to Igalia’s dedication towards open source graphics software development.
Christian is also enjoying being at Igalia as he discusses in blog post and also reveals his plans for Etnaviv:
One of his latest updates is the user-space hardware database. He explains that a user-space driver HW database has been introduced to obtain GPU specific information like GPU features and limits, corresponding to the introduction of an in-kernel hardware database. I am sure this will be super helpful for the reverse engineers out there!
Igalians are always eager to share their knowledge and expertise with the open source community by participating in key organizations and events.
There is quite a trend in Igalians serving on the X.Org Foundation’s Board of Directors. Samuel Iglesias took on this responsibility for a number of terms but this year he is stepping down. He reminisced about his role in this blog post.
Ricardo was, however, elected as one of the board of directors in 2022 and stayed on the board till Q1 2024, leaving Christopher Michael as the only Igalian currently on the board. In his blog post, Ricardo introduces the X.Org Foundation but also tackles some questions about its future.
Samuel was invited to join the Linux Foundation (Europe) advisory board and he has accepted the invitation. This is a huge milestone for the whole graphics team. Congratulations Sam!
This is a rather new event that has materialized in the Linux community to enhance the Linux display stack.
Melissa’s work on HDR and AMD color management together with interesting discussions during XDC 2023 Color Management workshop paved the way for the event this year and therefore, Igalia graciously offered to host it.
The event attracted key participants from Linux community, AMD, Nvidia, Google, Fedora, and Gnome, focusing on topics like HDR/color Management, variable refresh rate, tearing, multiplane/hardware overlay for video and gaming, real-time scheduling, async KMS API, power saving vs. color/latency, content-adaptive scaling and sharpening, and display control. The success of this event has highlighted the need for future editions.
At EOSS this year, we presented the following talks:
At FOSDEM this year, we presented the following talks:
At Vukanised this year, we presented the following talks:
Stéphane Cerveau & Hyunjun Ko, “Implementing a Vulkan Video Encoder From Mesa to Streamer” Iago Toral, Faith Ekstrand, “8 Years of Open Drivers, including the State of Vulkan in Mesa”
Igalians who attended the event found it quite informative on the subject.
Igalia hosted XDC 2023 in the city of their headquarters, A Coruña. We also presented many talks and demos.
The lightning talks and demos had an equally active participation from Igalia:
Workshops were organized for discussion on larger subjects like advance color management (discussion summary) and continuous integration (discussion summary).
Igalia graphics team has profound expertise in Mesa, Vulkan, OpenGL and Linux kernel. We have also embraced new and really interesting graphics technologies that I talk about in my next post.
Note
This blog post is part 1 of a series of blog posts about isaspec and its usage in the etnaviv GPU stack.
I will add here links to the other blog posts, once they are published.
The first time I heard about isaspec, I was blown away by the possibilities it opens. I am really thankful that Igalia made it possible to complete this crucial piece of core infrastructure for the etnaviv GPU stack.
If isaspec is new to you, here is what the Mesa docs have to tell about it:
isaspec provides a mechanism to describe an instruction set in XML, and generate a disassembler and assembler. The intention is to describe the instruction set more formally than hand-coded assembler and disassembler, and better decouple the shader compiler from the underlying instruction encoding to simplify dealing with instruction encoding differences between generations of GPU.
Benefits of a formal ISA description, compared to hand-coded assemblers and disassemblers, include easier detection of new bit combinations that were not seen before in previous generations due to more rigorous description of bits that are expect to be ‘0’ or ‘1’ or ‘x’ (dontcare) and verification that different encodings don’t have conflicting bits (i.e. that the specification cannot result in more than one valid interpretation of any bit pattern).
If you are interested in more details, I highly recommend Rob Clark’s introduction to isaspec presentation.
Vivante uses a fixed-size (128 bits), predictable instruction format with explicit inputs and outputs.
As of today, there are three different encodings seen in the wild:
There are several reasons..
The current ISA documentation is not very explicit and leaves lot of room for interpretation and speculation. One thing that it provides, are some nice explanations what an instruction does. isaspec does not support <doc>
tags yet, but I there is a PoC MR that generates really nice looking and information ISA documentation based on the xml.
I think soon you might find all etnaviv’s isaspec documentation at docs.mesa3d.org.
There are no unit tests based on instructions generated by the blob driver. This might not sound too bad, but it opens the door to generating ‘bad’ encoded instructions that could trigger all sorts of weird and hard-to-debug problems. Such breakages could be caused by some compiler rework, etc.
In an ideal world, there would be a unit test that does the following:
This is our ultimate goal, which we really must reach. etnaviv will not be the only driver that does such deep unit testing - e.g. freedreno does it too.
Do you remember the rusticl OpenCL attempt for etnaviv? It contains lines like:
if (nir_src_is_const(intr->src[1])) {
inst.tex.swiz = 128;
}
if (rmode == nir_rounding_mode_rtz)
inst.tex.amode = 0x4 + INST_ROUND_MODE_RTZ;
else /*if (rmode == nir_rounding_mode_rtne)*/
inst.tex.amode = 0x4 + INST_ROUND_MODE_RTNE;
Do you clearly see what is going on? Why do we need to set tex.amode for an ALU instruction?
I always found it quite disappointing to see such code snippets. Sure, they mimic what the blob driver is doing, but you might lose all the knowledge about why these bits are used that way days after you worked on it. There must be a cleaner, more understandable, and thus more maintainable way to document the ISA better.
This situation might become even worse if we want to support the other encodings and could end up with more of these bad patterns, resulting in a maintenance nightmare.
Oh, and if you wonder what happened to OpenCL and etnaviv - I promise there will be an update later this year.
As isaspec is written in Python, it is really easy to extend it and add support for new functionality.
At its core, we can generate a disassembler and an assembler based on isaspec. This alone saves us from writing a lot of code that needs to be kept in sync with all the ISA reverse engineering findings that happen over time.
As isaspec is just an ordinary XML file, you can use any programming language you like to work with it.
I really fell in love with the idea of having one source of truth that models our target ISA, contains written documentation, and extends each opcode with meta information that can be used in the upper layers of the compiler stack.
I think I have sold you the idea quite well, so it must be a matter of some days to switch to it. Sadly no, as there are some missing features:
<meta>
tags are supportedThe first big MR I worked on, extended BITSET APIs with features needed for isaspec. Here we are talking about bitwise AND, OR, and NOT, and left shifts.
The next step was to switch isaspec to use the BITSET API to support wider ISAs. This resulted in a lot of commits, as there was a need for some new APIs to support handling this new feature. After these 31 commits, we were able to start looking into isaspec support for etnaviv.
Now it is time to start writing an isaspec XML for etnaviv, and the easiest opcode to start with is the nop
. As the name suggests, it does nothing and has no src’s, no dst, or any other modifier.
As I do not have this initial version anymore, I tried to recreate it - it might have looked something like this:
<?xml version="1.0" encoding="UTF-8"?>
<isa>
<bitset name="#instruction">
<display>
{NAME} void, void, void, void
</display>
<pattern low="6" high="10">00000</pattern>
<pattern pos="11">0</pattern>
<pattern pos="12">0</pattern>
<pattern low="13" high="26">00000000000000</pattern>
<pattern low="27" high="31">00000</pattern>
<pattern pos="32">0</pattern>
<pattern pos="33">0</pattern>
<pattern pos="34">0</pattern>
<pattern low="35" high="38">0000</pattern>
<pattern pos="39">0</pattern>
<pattern low="40" high="42">000</pattern>
<!-- SRC0 -->
<pattern pos="43">0</pattern> <!-- SRC0_USE -->
<pattern low="44" high="52">000000000</pattern> <!-- SRC0_REG -->
<pattern pos="53">0</pattern>
<pattern low="54" high="61">00000000</pattern> <!-- SRC0_SWIZ -->
<pattern pos="62">0</pattern> <!-- SRC0_NEG -->
<pattern pos="63">0</pattern> <!-- SRC0_ABS -->
<pattern low="64" high="66">000</pattern> <!-- SRC0_AMODE -->
<pattern low="67" high="69">000</pattern> <!-- SRC0_RGROUP -->
<!-- SRC1 -->
<pattern pos="70">0</pattern> <!-- SRC1_USE -->
<pattern low="71" high="79">000000000</pattern> <!-- SRC1_REG -->
<pattern low="81" high="88">00000000</pattern> <!-- SRC1_SWIZ -->
<pattern pos="89">0</pattern> <!-- SRC1_NEG -->
<pattern pos="90">0</pattern> <!-- SRC1_ABS -->
<pattern low="91" high="93">000</pattern> <!-- SRC1_AMODE -->
<pattern pos="94">0</pattern>
<pattern pos="95">0</pattern>
<pattern low="96" high="98">000</pattern> <!-- SRC1_RGROUP -->
<!-- SRC2 -->
<pattern pos="99">0</pattern> <!-- SRC2_USE -->
<pattern low="100" high="108">000000000</pattern> <!-- SRC2_REG -->
<pattern low="110" high="117">00000000</pattern> <!-- SRC2_SWIZ -->
<pattern pos="118">0</pattern> <!-- SRC2_NEG -->
<pattern pos="119">0</pattern> <!-- SRC2_ABS -->
<pattern pos="120">0</pattern>
<pattern low="121" high="123">000</pattern> <!-- SRC2_AMODE -->
<pattern low="124" high="126">000</pattern> <!-- SRC2_RGROUP -->
<pattern pos="127">0</pattern>
</bitset>
<!-- opcocdes sorted by opc number -->
<bitset name="nop" extends="#instruction">
<pattern low="0" high="5">000000</pattern> <!-- OPC -->
<pattern pos="80">0</pattern> <!-- OPCODE_BIT6 -->
</bitset></isa>
With the knowledge of the old ISA documentation, I went fishing for instructions. I only used instructions from the binary blob for this process. It is quite important for me to have as many unit tests as I can write to not break any decoding with some isaspec XML changes I do. And it was a huge lifesaver at that time.
After I reached almost feature parity with the old disassembler, I thought it was time to land etnaviv.xml and replace the current handwritten disassembler with a generated one - yeah, so I submitted an MR to make the switch.
As this is only a driver internal disassembler used by maybe 2-3 human beings, it would not be a problem if there were some regressions.
Today I would say the isaspec disassembler is superior to the handwritten one.
The next item on my list was to add encoding support. As you can imagine, there was some work needed upfront to support ISAs that are bigger than 64 bits. This time the MR only contains two commits 😄.
With everything ready it is time to add isaspec based encoding support to etnaviv.
The goal is to drop our custom (and too simple) assembler and switch to one that is powered by isaspec.
This opens the door to:
In the end, all the magic that is needed is shown in the following diff:
diff --git a/src/etnaviv/isa/etnaviv.xml b/src/etnaviv/isa/etnaviv.xml
index eca8241a2238a..c9a3ebe0a40c2 100644
--- a/src/etnaviv/isa/etnaviv.xml
+++ b/src/etnaviv/isa/etnaviv.xml
@@ -125,6 +125,13 @@ SPDX-License-Identifier: MIT
<field name="AMODE" low="0" high="2" type="#reg_addressing_mode"/>
<field name="REG" low="3" high="9" type="uint"/>
<field name="COMPS" low="10" high="13" type="#wrmask"/>
+
+ <encode type="struct etna_inst_dst *">
+ <map name="DST_USE">p->DST_USE</map>
+ <map name="AMODE">src->amode</map>
+ <map name="REG">src->reg</map>
+ <map name="COMPS">p->COMPS</map>
+ </encode>
</bitset>
<bitset name="#instruction" size="128">
@@ -137,6 +144,46 @@ SPDX-License-Identifier: MIT
<derived name="TYPE" type="#type">
<expr>{TYPE_BIT2} << 2 | {TYPE_BIT01}</expr>
</derived>
+
+ <encode type="struct etna_inst *" case-prefix="ISA_OPC_">
+ <map name="TYPE_BIT01">src->type & 0x3</map>
+ <map name="TYPE_BIT2">(src->type & 0x4) > 2</map>
+ <map name="LOW_HALF">src->sel_bit0</map>
+ <map name="HIGH_HALF">src->sel_bit1</map>
+ <map name="COND">src->cond</map>
+ <map name="RMODE">src->rounding</map>
+ <map name="SAT">src->sat</map>
+ <map name="DST_USE">src->dst.use</map>
+ <map name="DST">&src->dst</map>
+ <map name="DST_FULL">src->dst_full</map>
+ <map name="COMPS">src->dst.write_mask</map>
+ <map name="SRC0">&src->src[0]</map>
+ <map name="SRC0_USE">src->src[0].use</map>
+ <map name="SRC0_REG">src->src[0].reg</map>
+ <map name="SRC0_RGROUP">src->src[0].rgroup</map>
+ <map name="SRC0_AMODE">src->src[0].amode</map>
+ <map name="SRC1">&src->src[1]</map>
+ <map name="SRC1_USE">src->src[1].use</map>
+ <map name="SRC1_REG">src->src[1].reg</map>
+ <map name="SRC1_RGROUP">src->src[1].rgroup</map>
+ <map name="SRC1_AMODE">src->src[1].amode</map>
+ <map name="SRC2">&src->src[2]</map>
+ <map name="SRC2_USE">rc->src[2].use</map>
+ <map name="SRC2_REG">src->src[2].reg</map>
+ <map name="SRC2_RGROUP">src->src[2].rgroup</map>
+ <map name="SRC2_AMODE">src->src[2].amode</map>
+
+ <map name="TEX_ID">src->tex.id</map>
+ <map name="TEX_SWIZ">src->tex.swiz</map>
+ <map name="TARGET">src->imm</map>
+
+ <!-- sane defaults -->
+ <map name="PMODE">1</map>
+ <map name="SKPHP">0</map>
+ <map name="LOCAL">0</map>
+ <map name="DENORM">0</map>
+ <map name="LEFT_SHIFT">0</map>
+ </encode>
</bitset>
<bitset name="#src-swizzle" size="8">
@@ -148,6 +195,13 @@ SPDX-License-Identifier: MIT
<field name="SWIZ_Y" low="2" high="3" type="#swiz"/>
<field name="SWIZ_Z" low="4" high="5" type="#swiz"/>
<field name="SWIZ_W" low="6" high="7" type="#swiz"/>
+
+ <encode type="uint8_t">
+ <map name="SWIZ_X">(src & 0x03) >> 0</map>
+ <map name="SWIZ_Y">(src & 0x0c) >> 2</map>
+ <map name="SWIZ_Z">(src & 0x30) >> 4</map>
+ <map name="SWIZ_W">(src & 0xc0) >> 6</map>
+ </encode>
</bitset>
<enum name="#thread">
@@ -272,6 +326,13 @@ SPDX-License-Identifier: MIT
</expr>
</derived>
</override>
+
+ <encode type="struct etna_inst_src *">
+ <map name="SRC_SWIZ">src->swiz</map>
+ <map name="SRC_NEG">src->neg</map>
+ <map name="SRC_ABS">src->abs</map>
+ <map name="SRC_RGROUP">p->SRC_RGROUP</map>
+ </encode>
</bitset>
<bitset name="#instruction-alu-no-src" extends="#instruction-alu">
One nice side effect of this work is the removal of isa.xml.h file that has been part of etnaviv since day one. We are able to generate all the file contents with isaspec and some custom python3 scripts. The move of instruction src swizzling from the driver into etnaviv.xml was super easy - less code to maintain!
I am really happy with the end result, even though it took quite some time from the initial idea to the point when everything was integrated into Mesa’s main git branch.
There is so much more to share - I can’t wait to publish parts II and III.
Last week I started work on adding support to the Etnaviv driver for the NPU inside the NXP i.MX 8M Plus SoC (VeriSilicon's VIPNano-SI+).
This work is sponsored by the open source consultancy Ideas On Boards, and will include the same level of support as for the Amlogic A311D SoC, which means full acceleration for the SSDLite MobileDet object detection model.
My current project at Igalia has had me working on Mesa’s software renderers, llvmpipe and lavapipe. I’ve been working to get them running on Android, and I wanted to document the progress I’ve made, the challenges I’ve faced, and talk a little bit about the development process for a project like this. My work is not totally merged into upstream mesa yet, but you can see the MRs I made here:
VK_EXT_external_memory_dma_buf
Getting system level software to build and run on Android is unfortunately not straightforward. Since we are doing software rendering we don’t need a physical device and instead we can make use of the Android emulator, and if you didn’t know Android has two emulators, the common one most people use is “goldfish” and the other lesser known is “cuttlefish”. For this project I did my work on the cuttlefish emulator as its meant for testing the Android OS itself instead of just Android apps and is more reflective of real hardware. The cuttlefish emulator takes a little bit more work to setup, and I’ve found that it only works properly in Debian based linux distros. I run Fedora, so I had to run the emulator in a debian VM.
Thankfully Google has good instructions for building and running cuttlefish, which you can find here. The instructions show you how to setup the emulator using nightly build images from Google. We’ll also need to setup our own Android OS images so after we’ve confirmed we can run the emulator, we need to start looking at building AOSP.
For building our own AOSP image, we can also follow the instructions
from Google here.
For the target we’ll want
aosp_cf_x86_64_phone-trunk_staging-eng
. At this point it’s
a good idea to verify that you can build the image, which you can do by
following the rest of the instructions on the page. Building AOSP from
source does take a while though, so prepare to wait potentially an
entire day for the image to build. Also if you get errors complaining
that you’re out of memory, you can try to reduce the number of parallel
builds. Google officially recommends to have 64GB of RAM, and I only had
32GB so some packages had to be built with the parallel builds set to 1
so I wouldn’t run out of RAM.
For running this custom-built image on Cuttlefish, you can just copy
all the *.img
files from
out/target/product/vsoc_x86_64/
to the root cuttlefish
directory, and then launch cuttlefish. If everything worked successfully
you should be able to see your custom built AOSP image running in the
cuttlefish webui.
Working from the changes in MR !29344
building llvmpipe or lavapipe targeting Android should just work™️. To
get to that stage required a few changes. First llvmpipe actually
already had some support on Android, as long as it was running on a
device that supports a DRM display driver. In that case it could use the
dri
window system integration which already works on
Android. I wanted to get llvmpipe (and lavapipe) running without dri, so
I had to add support for Android in the drisw
window system
integration.
To support Android in drisw
, this mainly meant adding
support for importing dmabuf as framebuffers. The Android windowing
system will provide us with a “gralloc” buffer which inside has a dmabuf
fd that represents the framebuffer. Adding support for importing dmabufs
in drisw means we can import and begin drawing to these frame buffers.
Most the changes to support that can be found in drisw_allocate_textures
and the underlying changes to llvmpipe to support importing dmabufs in
MR !27805.
The EGL Android platform code also needed some changes to use the
drisw
window system code. Previously this code would only
work with true dri drivers, but with some small tweaks it was possible
to get to have it initialize the drisw window system and then using it
for rendering if no hardware devices are available.
For lavapipe the changes were a lot simpler. The Android Vulkan
loader requires your driver to have HAL_MODULE_INFO_SYM
symbol in the binary, so that got created and populated correctly,
following other Vulkan drivers in Mesa like turnip. Then the image
creation code had to be modified to support the
VK_ANDROID_native_buffer
extension which allows the Android
Vulkan loader to create images using Android native buffer handles.
Under the hood this means getting the dmabuf fd from the native buffer
handle. Thankfully mesa already has some common code to handle this, so
I could just use that. Some other small changes were also necessary to
address crashes and other failures that came up during testing.
With the changes out of of the way we can now start building Mesa on Android. For this project I had to update the Android documentation for Mesa to include steps for building LLVM for Android since the version Google ships with the NDK is missing libraries that llvmpipe/lavapipe need to function. You can see the updated documentation here and here. After sorting out LLVM, building llvmpipe/lavapipe is the same as building any other Mesa driver for Android: we setup a cross file to tell meson how to cross compile and then we run meson. At this point you could manual modify the Android image and copy these files to the vm, but I also wanted to support building a new AOSP image directly including the driver. In order to do that you also have to rename the driver binaries to match Android’s naming convention, and make sure SO_NAME matches as well. If you check out this section of the documentation I wrote, it covers how to do that.
If you followed all of that you should have built an version of llvmpipe and lavapipe that you can run on Android’s cuttlefish emulator.
Over the last months I've started looking into a few of the papercuts that affects graphics tablet users in GNOME. So now that most of those have gone in, let's see what has happened:
The calibration code, a descendent of the old xinput_calibrator tool was in a pretty rough shape and didn't work particularly well. That's now fixed and I've made the calibrator a little bit easier to use too. Previously the timeout was quite short which made calibration quite stressfull, that timeout is now per target rather than to complete the whole calibration process. Likewise, the calibration targets now accept larger variations - something probably not needed for real use-cases (you want the calibration to be exact) but it certainly makes testing easier since clicking near the target is good enough.
The other feature added was to allow calibration even when the tablet is manually mapped to a monitor. Previously this only worked in the "auto" configuration but some tablets don't correctly map to the right screen and lost calibration abilities. That's fixed now too.
A picture says a thousand words, except in this case where the screenshot provides no value whatsoever. But here you have it anyway.
Traditionally, GNOME would rely on libwacom to get some information about tablets so it could present users with the right configuration options. The drawback was that a tablet not recognised by libwacom didn't exist in GNOME Settings - and there was no immediately obvious way of fixing this, the panel either didn't show up or (with multiple tablets) the unrecognised one was missing. The tablet worked (because the kernel and libinput didn't require libwacom) but it just couldn't be configured.
libwacom 2.11 changed the default fallback tablet to be a built-in one since this is now the most common unsupported tablet we see. Together with the new fallback handling in GNOME settings this means that any unsupported tablet is treated as a generic built-in tablet and provides the basic configuration options for those (Map to Monitor, Calibrate, assigning stylus buttons). The tablet should still be added to libwacom but at least it's no longer a requirement for configuration. Plus there's now a link to the GNOME Help to explain things. Below is a screenshot on how this looks like (after modifying my libwacom to no longer recognise the tablet, poor Intuos).
For historical reasons, the names of the display in the GNOME Settings Display configuration differed from the one used by the Wacom panel. Not ideal and that bit is now fixed with the Wacom panel listing the name of the monitor and the connector name if multiple monitors share the same name. You get the best value out of this if you have a monitor vendor with short names. (This is not a purchase recommendation).
If you're an avid tablet user, you may have multiple stylus tools - but it's also likely that you have multiple tools of the same type which makes differentiating them in the GUI hard. Which is why they're highlighted now - if you bring the tool into proximity, the matching image is highlighted to make it easier to know which stylus you're about to configure. Oh, and in the process we added a new SVG for AES styli too to make the picture look more like the actual physical tool. The <blink> tag may no longer be cool but at least we can disco our way through the stylus configuration now.
GNOME Settings historically presents a slider from "Soft" to "Firm" to adjust the feel of the tablet tip (which influences the pressure values sent to the application). Behind the scenes this was converted into a set of 7 fixed curves but thanks to a old mutter bug those curves only covered a small amount of the possible range. This is now fixed so you can really go from pencil-hard to jelly-soft and the slider now controls an almost-continous range instead of just 7 curves. Behold, a picture of slidery goodness:
And of course a bunch of miscellaneous fixes. Things that I quickly found were support for Alt in the tablet pad keymappings, fixing of erroneous backwards movement when wrapping around on the ring, a long-standing stylus button mismatch, better stylus naming and a rather odd fix causing configuration issues if the eraser was the first tool ever to be brought into proximity.
There are a few more things in the pipe but I figured this is enough to write a blog post so I no longer have to remember to write a blog post about all this.
I don’t have a lot of time. There’s a gun to my head. Literally.
John Eldenring is here, and he has a gun pointed at my temple, and he’s telling me that if I don’t start playing his new downloadable content now, I won’t be around to make any more posts.
Today’s game is Blazblue Centralfiction. I don’t know what this game is, I don’t know what it’s about, I don’t even know what genre it is, but it’s got problems.
What kinds of problems?
This is a DX game that embeds video files. In proton, GStreamer is used to play the video files while DXVK goes brrrr in the background. GStreamer uses GL. What do you think this looks like under perf?
GStreameur here does the classic memcpy PIXEL_PACK_BUFFER texture upload -> glReadPixels memcpy download
in order to transcode the video files into something renderable. I say classic because this is a footgun on both ends:
This results in blocking at both ends of the transcode pipeline. A better choice, for Mesa’s current state of optimization, would’ve been to do glTexImage -> glGetTexImage
. This would leverage all the work that I did however many years ago in this post for PBO downloads using compute shaders.
Still, this is the future, so Mesa must adapt. With a few copy/pasted lines and a sprinkle of magical SGC dust (massive compute-based PBO shaders), the flamegraph becomes:
Flamegraphs aren’t everything though. This has more obvious real world results just from trace replay times:
# before
$ time glretrace -b BlazBlue.Centralfiction.trace
/home/zmike/.local/share/Steam/steamapps/common/Proton - Experimental/files/bin/wine-preloader
Rendered 0 frames in 15.4088 secs, average of 0 fps
glretrace -b BlazBlue.Centralfiction.trace 13.88s user 0.35s system 91% cpu 15.514 total
# after
$ time glretrace -b BlazBlue.Centralfiction.trace
/home/zmike/.local/share/Steam/steamapps/common/Proton - Experimental/files/bin/wine-preloader
Rendered 0 frames in 10.6251 secs, average of 0 fps
glretrace -b BlazBlue.Centralfiction.trace 9.83s user 0.42s system 95% cpu 10.747 total
Considering this trace only captured the first 4-5 seconds of a 98 second movie, I’d say that’s damn good.
Check out the MR if you want to test.
Hi all!
This status update will be shorter than usual because I had a lot less free time for my open-source projects than usual this month. Indeed, I recently joined SNCF Réseau (the company responsible for the French railway infrastructure) to work on OSRD, an open-source tool to design and operate railway networks. The project’s immediate goal is to fit new freight trains in an existing timetable a few days in advance, but the longer term scope is much larger. Working partly on-site in a big team is quite the change of pace but I like it so far!
I’ve released a lot of new versions this month! The big one is Wayland 1.23.0
which adds a mechanism to set the size of the internal connection buffer, an
enum-header
mode for wayland-scanner
to generate a header with only enums,
auto-generated enum validator functions for compositors, a new
deprecated-since
attribute to mark parts of protocols as deprecated, and a
few other niceties. libliftoff 0.5.0 prioritizes layers that are frequently
updated, adds performance optimizations (a fast path when the intersection of
layers doesn’t change, a fast path for standard KMS properties, an early return
to avoid needlessly trying potential solutions) and a timeout to avoid stalling
the compositor for too long. soju 0.8.0 adds a new file upload IRC extension,
adds support for Unix domain sockets for HTTP and WebSocket listeners and
better spreads the load on multiple upstream servers on large deployments.
kanshi 1.7.0 adds output defaults and aliases. Phew, that was a mouthful!
In other Wayland news, the xdg-toplevel-icon protocol got merged after a long
and difficult process. I really hope we can improve the contribution experience
for future proposals. We realized that the governance document was missing the
review requirements, so I fixed that along the way. The wlroots
linux-drm-syncobj-v1 implementation has been merged (it’s been used by
gamescope for a few months - note that this does not include the wlroots
renderer, backend and scene-graph changes). Multiple wlroots versions can now
be installed side-by-side thanks to Violet Purcell. Sway has gained a new
color_profile
output command to apply an ICC profile to an
output thanks to M. Stoeckl. A high-level API for colorimetry has
been added in libdisplay-info thanks to Pekka Paalanen, and support for HDMI
audio data blocks has been implemented by Sebastian Wick.
Let’s switch gears and talk about IRC updates. I’ve submitted an IRCv3 proposal
to fix a few ISUPPORT
deficiencies - it will need a lot more
feedback and implementations before it can be accepted. I’ve continued
debugging Goguma’s duplicate message bug and I’m pleased to
announce that I’ve almost completely fixed it (I still experience it very
rarely somehow…). delthas has added support for adaptive color schemes (Goguma
now uses your preferred accent color if any). I’ve performed some more boring
maintenance tasks, for instance adding support for newer Android Gradle Plugin
version to webcrypto.dart, one of Goguma’s dependencies.
One last update to wrap up this post: Zhi Qu has added support for the ID extension to go-imap, which is sadly required to connect to some servers. That’s all for now, see you next month!
There are times when you feel your making no progress and there are other times when things feel like they are landing in quick succession. Luckily this definitely is the second when a lot of our long term efforts are finally coming over the finish line. As many of you probably know our priorities tend to be driven by a combination of what our RHEL Workstation customers need, what our hardware partners are doing and what is needed for Fedora Workstation to succeed. We also try to be good upstream partners and do patch reviews and participate where we can in working on upstream standards, especially those of course of concern to our RHEL Workstation and Server users. So when all those things align we are at our most productive and that seems to be what is happening now. Everything below is features in flight that will at the latest land in Fedora Workstation 41.
Artificial Intelligence
And it is not just Granite, we are ensuring other other major AI projects will work with Fedora too, like Meta’s popular Llama LLM. And a big step for that is how Tom Rix has been working on bringing in AMD accelerated support (ROCm) for PyTorch to Fedora. PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. The long term goal is that you should be able to just install PyTorch on Fedora and have it work hardware accelerated with any of the 3 major GPU vendors chipsets.
NVIDIA in Fedora
So the clear market leader at the moment for powering AI workloads in NVIDIA so I am also happy to let you know about two updates we are working on that will make you life better on Fedora when using NVIDIA GPUs, be that for graphics or for compute or Artificial Intelligence. So for the longest time we have had easy install of the NVIDIA driver through GNOME Software in Fedora Workstation, unfortunately this setup never dealt with what is now the default usecase, which is using it with a system that has secure boot enabled. So the driver install was dropped from GNOME Software in our most recent release as the only way for people to get it working was through using mokutils on the command line, but the UI didn’t tell you that. Well we of course realize that sending people back to the command line to get this driver installed is highly unfortunate so Milan Crha has been working together with Alan Day and Jakub Steiner to come up with a streamlined user experience in GNOME Software to let you install the binary NVIDIA driver and provide you with an integrated graphical user interface help to sign the kernel module for use with secure boot. This is a bit different than what we for instance are doing in RHEL, where we are working with NVIDIA to provide pre-signed kernel modules, but that is a lot harder to do in Fedora due to the rapidly updating kernel versions and which most Fedora users appreciate as a big plus. So instead what we are for opting in Fedora is as I said to make it simple for you to self-sign the kernel module for use with secure boot. We are currently looking at when we can make this feature available, but no later than Fedora Workstation 41 for sure.
Toolbx getting top notch NVIDIA integration
We are also hoping the packaging fixes to subscription manager will land soon as that will make using RHEL containers on Fedora a lot smoother. While this feature basically already works as outlined here we do hope to make it even more streamlined going forward.
Open Source NVIDIA support
Of course being Red Hat we haven’t forgotten about open source here, you probably heard about Nova our new Rust based upstream kernel driver for NVIDIA hardware which will provided optimized support for the hardware supported by NVIDIAs firmware (basically all newer ones) and accelerate Vulkan through the NVK module and provide OpenGL through Zink. That effort is still quite early days, but there is some really cool developments happening around Nova that I am not at liberty to share yet, but I hope to be able to talk about those soon.
High Dynamic Range (HDR)
Jonas Ådahl after completing the remote access work for GNOME under Wayland has moved his focus to help land the HDR support in mutter and GNOME Shell. He recently finished rebasing his HDR patches onto a wip merge request from
Georges Stavracas which ported gnome-shell to using paint nodes,
So the HDR enablement in mutter and GNOME shell is now a set of 3 patches.
With this the work is mostly done, what is left is avoiding over exposure of the cursor, and inhibiting direct scanout.
We also hope to help finalize the upstream Wayland specs soon so that everyone can implement this and know the protocols are stable and final.
DRM leasing – VR Headsets
The most common usecase for DRM leasing is VR headsets, but it is also a useful feature for things like video walls. José Expósito is working on finalizing a patch for it using the Wayland protocol adopted by KDE and others. We where somewhat hesitant to go down this route as we felt a portal would have been a better approach, especially as a lot of our experienced X.org developers are worried that Wayland is in the process of replicating one of the core issues with X through the unmanageable plethora of Wayland protocols that is being pushed. That said, the DRM leasing stuff was not a hill worth dying on here, getting this feature out to our users in a way they could quickly use was more critical, so DRM leasing will land soon through this merge request.
Explicit sync
Another effort that we have put a lot of effort into together with our colleagues at NVIDIA is landing support for what is called explicit sync into the Linux kernel and the graphics drivers.The linux graphics stack was up to this point using something called implicit sync, but the NVIDIA drivers did not work well with that and thus people where experiencing ‘blinking’ applications under Wayland. So we worked with NVIDIA and have landed the basic support in the kernel and in GNOME and thus once the 555 release of the NVIDIA driver is out we hope the ‘blinking’ issues are fully resolved for your display. There has been some online discussion about potential performance gains from this change too, across all graphics drivers, but the reality of this is somewhat uncertain or at least it is still unclear if there will be real world measurable gains from adding explicit sync. I heard knowledgeable people argue both sides with some saying there should be visible performance gains while others say the potential gains will be so specific that unless you write a test to benchmark it explicitly you will not be able to detect a difference. But what is beyond doubt is that this will make using the NVIDIA stack with Wayland a lot better a that is a worthwhile goal in itself. The one item we are still working on is integrating the PipeWire support for explicit sync into our stack, because without it you might have the same flickering issues with PipeWire streams on top of the NVIDIA driver that you have up to now seen on your screen. So for instance if you are using PipeWire for screen capture it might look fine on screen with the fixes already merged, but the captured video has flickering. Wim Taymans landed some initial support in PipeWire already so now Michel Dänzer is working on implementing the needed bits for PipeWire in mutter. At the same time Wim is working on ensuring we have a testing client available to verify the compositor support. Once everything has landed in mutter and we been able to verify that it works with the sample client we will need to add support to client applications interacting with PipeWire, like Firefox, Chrome, OBS Studio and GNOME-remote-desktop.
In the previous post I gave an introduction to shader linking. Mike has already blogged about this topic a while ago, focusing mostly on Zink, and now it’s time for me to share some of my adventures about it too, but of course focusing on how we improved it in RADV and the various rabbit holes that this work has lead me to.
In Mesa, we mainly represent shaders in NIR (the NIR intermediate representation) and that is where link-time optimizations happen.
The big news is that Marek Olšák wrote a new pass called
nir_opt_varyings
which is an all-in-one solution to all the optimizations above, and
now authors of various drivers are rushing to take advantage of this new code.
We can’t miss the opportunity to start using nir_opt_varyings
in RADV too,
so that’s what I’ve been working on for the past several weeks.
It is intended to replace all of the previous linking passes and can do all of the following:
So, I started by adding a call to nir_opt_varyings
and went from there.
The naive reader might think using the new pass is as simple as
going to radv_link_shaders
and calling nir_opt_varyings
there.
But it can never be that easy, can it?
The issue is that one can’t simply deal with shader linking. We also need to get our hands dirty with all the details of how shader I/O works on a very low level.
The first problem is that RADV’s current linking solution radv_link_shaders
works with the shaders when their I/O variables are still intact,
meaning that they are still treated as dereferenced variables.
However nir_opt_varyings
expects to work with explicit I/O.
In fact, all of RADV’s linking code all works based on I/O variables and dereferences,
so much so that it’d be too much to refactor all of that
(and such refactoring probably would have its own set of problems and rabbit holes).
So the solution here is to add a new linking step that runs after nir_lower_io
and
call the new pass in there.
After writing the above, I quickly discovered that some tests crash, others fail, and most applications render incorrectly. So I’ve set out on a journey to solve all that.
Like every driver, RADV needs to collect certain information about every shader in order to program the GPU’s registers correctly before a draw. This information includes the number of inputs / outputs (and which slots are used), in order to determine how much LDS needs to be allocated (in case of tessellation shaders) or how much FS inputs are needed, etc.
This is done by radv_nir_shader_info_pass
which also operated on I/O variables,
rather than information from I/O instructions.
However, after nir_lower_io
the explicit I/O instructions may no longer be in
sync with the original I/O variables. This wasn’t a problem before because we
haven’t done any optimizations on explicit I/O so we could rely on I/O variable
information being accurate.
However, in order to use an optimization based on explicit I/O,
the RADV shader info pass had to be refactored
to collect its information from explicit I/O intrinsics, otherwise we wouldn’t be
able to have up-to-date information after running nir_opt_varyings
, resulting
in wrong register programming.
Driver location assignment is how the driver decides which input and output goes to which “slot” or “address”. RADV did this after linking, but still based on I/O variables, so the mechanism needed to be re-thought.
It also came to my attention that there are some plans to deprecate the concept of driver locations in NIR in favour of so-called I/O semantics. So I had to do my refactor with this in mind; I spent some effort on removing our uses of driver locations in order to make the new code somewhat future-proof.
For most stage combinations, nir_recompute_io_bases
can be used as a stopgap to
simply reassign the driver locations based on the assumption that a shader will
only write the outputs that the next stage reads.
However, this is somewhat difficult to achieve for tessellation shaders because of their
unique “brain puzze” (TCS can read its own outputs, so the compiler can’t simply
remove TCS outputs when TES doesn’t read them).
Due to the unique brain puzzle that is I/O in tessellation shaders,
they require extra brain power…
shader linking between TCS and TES was implemented
all over the place; even our backend compiler ACO had some dependence on TCS linking information,
which made any kind of refactor difficult.
At the time, VK_EXT_shader_object
was new, and our implementation
used so-called shader epilogs to deal with the dynamic states of the TCS
(including in OpenGL for RadeonSI),
which is what ACO needed the linking information for.
After a discussion with the team, we decided that TCS epilogs had to go; not only because of my shader linking effort, but also to make the code base saner and more maintainable.
This effort made our code lighter by about 1200 LOC.
On AMD hardware, TCS outputs are implemented using LDS (when the TCS reads them)
and VRAM (when the TES reads them), which means that the driver has two different
ways to store these variables depending on their use. However, since the code
was based on driver_location
and there can only be one driver location,
we used effectively the same location for both LDS and VRAM, which was suboptimal.
With the TCS epilog out of the way, now ac_nir_lower_tess_io_to_mem
is free to
choose the LDS layout because the drivers no longer need to generate a TCS epilog
that would need to make assumptions about their memory layout.
One of the innovations of nir_opt_varyings
is that it can pack two 16-bit inputs and
outputs together into a single 32-bit slot in order to save I/O space.
However, unfortunately RADV didn’t really work at all with packed 16-bit I/O.
Practically, this meant that every test case using 16-bit I/O failed.
I considered to disable 16-bit packing in nir_opt_varyings
but eventually
I decided to just implement it in RADV properly instead.
While writing patches to handle packed 16-bit I/O, we’ve taken note of how repetitive the code was; basically the same thing was implemented several times with subtle differences. Of course, this had to be dealt with.
Mesh shading pipelines have so-called per-primitive I/O, which need special handling.
For example, it is wrong to pack per-primitive and per-vertex inputs or outputs
into the same slot. Because OpenGL doesn’t have per-primitive I/O, this was
left unsolved and needed to be fixed in nir_opt_varyings
before RADV could use it.
nir_recompute_io_bases
needed to learn about per-primitive I/Onir_opt_varyings
itself needed to learn per-primitive I/O tooUsing nir_opt_varyings
had a slight regression in shader instruction counts
due to inter-stage code motion.
Essentially, it moved some instructions into the previous stage, which prevented
nir_opt_load_store_vectorize
to deduce the alignment of some memory instructions,
resulting in worse vectorization.
The solution was to add a new pass based on the code already written for
nir_opt_load_store_vectorize
that would just update the alignments of each
memory access called nir_opt_load_store_update_alignments
, and run that pass before
nir_opt_varyings
, thereby preserving the aligment info before it is lost.
FLAT
FLAT
fragment shader inputs are generally better because they require no interpolation
and allow packing 32-bit and 16-bit inputs together, so
nir_opt_varyings
takes special care trying to promote interpolated inputs to
FLAT
when possible.
However, there was a regression caused by this mechanism unintentionally creating more inputs than there were before.
mediump
I/OThe mediump
qualifier effectively means that the application allows the driver to
use either 16-bit or 32-bit precision for a variable, whichever it
deems more optimal for a specific shader.
It turned out that RADV didn’t deal with mediump
I/O at all.
This was fine, because it just meant they got treated as normal 32-bit I/O,
but it became a problem when it turned out nir_opt_varyings
is unaware of
mediump
and mixed it up with other inputs, which confused the vectorizer.
Side note: I am not quite done yet with mediump
.
In the future I plan to lower it to 16-bit precision,
but only when I can make sure it doesn’t result in more inputs.
Vulkan has some functionality that allow applications to do custom FS input interpolation
in shader code, which from the driver perspective means that each FS invocation needs to
access the output of each vertex. (This requires special register programming from RADV
on the FS inputs.) This is something that also needed to be added to nir_opt_varyings
before RADV could use it.
The nir_opt_varyings
pass only works on so-called scalarized I/O, meaning that it can
only deal with instructions that write a single output component or read a single
input component. Fortunately, there is already a handy nir_lower_io_to_scalar
pass which we can use.
The downside is that scalarized shader I/O becomes sub-optimal (on all stages on AMD HW except VS -> PS) because the I/O instructions are really memory accesses, which are simply more optimal when more components are accessed by the same instruction.
This is solved in two ways:
nir_opt_load_store_vectorize
pass to better deal with
lowered shader I/O, meaning that it can now better vectorize the memory access instructions
that are generated by scalarized I/O.nir_opt_vectorize_io
pass which can re-vectorize the
I/O intrinsics (before they are lowered to memory access).One of the main questions with any kind of optimization is how to measure the effects of that optimization in an objective way. This is a solved problem, we have shader stats for this which contain instructions about various aspects of a shader, such as number of instructions, register use etc. Except, there was no stats about I/O, so this needed to be added.
These stats are useful to prove that all of this work actually improved things. Furthermore, they turned out useful for finding bugs in existing code as well.
Considering that nir_opt_varyings
is supposed to be an all-in-one linking solution,
a naive person like me would assume that once the driver had been refactored to use
nir_opt_varyings
we can simply stop using all of the old linking passes. But…
It turns out that getting rid of any of the other passes seem to cause regressions in shaders stats
such as instruction count (which we don’t want).
Why is this?
Due to the order in which we call various NIR optimizations, it seems that we can’t effectively
take advantake of new optimization opportunities after nir_opt_varyings
. This means that
we either have to re-run all expensive optimizations once more after the new linking step,
or we will have to reorder our optimizations in order to be able to remove the old linking
passes.
While we haven’t yet found the exit from all the rabbit holes we fell into, we made really good progress and I feel that all of our I/O code ended up better after this effort. Some work on RADV shader I/O (as of June 2024) remains:
mediump
to 16-bit when beneficialI owe a big thank you to Marek for developing nir_opt_varyings
in the first place
and helping me adopt it every step of the way.
Also thanks to Samuel, Rhys and Georg for the good conversations we had and for reviewing my patches.
In the past few weeks I have been working on among other things a kernel driver for the NPU in the Rockchip RK3588 SoC, new from the ground up.
It is now fully working and after a good amount of polishing I sent it yesterday to the kernel mailing lists, for review. Those interested can see the code and follow the review process at this link.
The kernel driver is able to fully use the three cores in the NPU, giving us the possibility of running 4 simultaneous object detection inferences such as the one below on a stream, at almost 30 frames per second.
The userspace driver is in a less polished state, but fully featured at this state. I will be working on this in the next few days so it can be properly submitted for review.
This is the first accelerator-only driver for an edge NPU submitted to the mainline kernel, and hopefully it can serve as a template for the next ones to come, as the differences among NPUs of different vendors are relatively superficial.
Yesterday evening we released systemd v256 into the wild. While other projects, such as Firefox are just about to leave the 7bit world and enter 8bit territory, we already entered 9bit version territory! For details about the release, see our announcement mail.
In the weeks leading up to this release I have posted a series of serieses of posts to Mastodon about key new features in this release. Mastodon has its goods and its bads. Among the latter is probably that it isn't that great for posting listings of serieses of posts. Hence let me provide you with a list of the relevant first post in the series of posts here:
.v/
DirectoriesX_SYSTEMD_UNIT_ACTIVE=
sd_notify()
MessagesProtectSystem=
run0
As sudo
Replacementsystemd-nspawn
ssh
into systemd-homed
Accountssystemd-vmspawn
systemd-sysext
systemctl sleep
systemd-ssh-generator
systemd-cryptenroll
without device argumentdlopen()
ELF MetadataCapsules
I intend to do a similar series of serieses of posts for the next systemd release (v257), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.
And while I have you: note that the All Systems Go 2024 Conference (Berlin) Call for Papers ends 😲 THIS WEEK 🤯! Hence, HURRY, and get your submissions in now, for the best low-level Linux userspace conference around!
Shader linking is one of the more complicated topics in graphics driver development. It is both a never ending effort in the pursuit of performance and a black hole in which driver developers disappear. In this post, I intend to give an introduction to what shader linking is and why it’s worth spending our time working on it in general.
Shaders are smallish programs that run on your GPU and are necessary for any graphical application in order to draw things on your screen using your GPU. A typical game can have thousands, or even hundreds of thousands of shaders. Because every GPU has its own instruction set with major differences, it is generally the reponsibility of the graphics driver to compile shaders in a way that is most optimal on your GPU in order to make games run fast.
One of the ways of making them faster is linking. Many times, the driver knows exactly which shaders are going to be used together and that gives it the opportunity to perform optimizations based on the assumption that two shaders are only ever used together (and never with other shaders).
In Vulkan, there are now three ways for an application to create graphics shaders and all of these have a possibility for utilizing linking:
In Mesa, we mainly represent shaders in NIR (the NIR intermediate representation) and that is where link-time optimizations happen.
Shader linking allows the compiler stack to make assumptions about a shader by looking at another shader that it is used together with. Let’s take a look at what optimizations are possible.
The compiler may look at the outputs of a shader and the inputs of the next stage, and delete unnecessary I/O. For example, when you have a pipeline with VS (vertex shader) and FS (fragment shader):
As a result, both the VS and FS will have fewer IO instructions and more optimal algebraic instructions. The same ideas are basically applicable to any two shader stages.
This first group of optimizations are the easiest to implement and has been supported
by NIR for a long time: nir_remove_dead_variables
, nir_remove_unused_varyings
and nir_link_opt_varyings
have existed for a long time.
Shader linking also lets the compiler to “compact” the output space, by reordering I/O variables in both shaders so that they use the least amount of space possible.
For example, it may be possible that they have “gaps” between the I/O slots that they use, and the compiler can then be smart and rearrange the I/O variables so that there are as few gaps as possible.
As a result, less I/O space will be used. The exact benefit of this optimization depends highly on the hardware architecture and which stages are involved. But generally speaking, using less space can mean less memory use (which can translate into better occupancy or higher throughput), or simply that the next stage is launched faster, or will use fewer registers, or less fixed-function HW resources needed, etc.
NIR has also supported I/O compaction in nir_compact_varyings
, however its implementation
was far from perfect, the main challenges were handling indirect indexing and
it lacked packing 16-bit varyings into 32-bits.
Also known as inter-stage code motion, this is a complex optimization that has two main goals:
This concept is all-new in Mesa and hasn’t existed until Marek wrote nir_opt_varyings
recently.
At this point you might ask yourself the question, why is all of this necessary? In other words, why do shaders actually need these optimizations? Why don’t app developers write shaders that are already optimal?
The answer might surprise you.
Many times, the same shader is reused between different pipelines, in which case the application developer needs to write them in a way in which they are interchangeable. This is simply a good practice from the perspective of the application developer, reducing the number of shaders they need to maintain.
Sometimes, applications effectively generate different shaders from the same source using ifdefs, specialization constants etc.
Even though the same source shader was written to usable with multiple other shaders; in each pipeline the driver can deal with it as if it were a different shader and in each pipeline the shader will be linked to the other shaders in that specific pipeline.
The big news is that Marek Olšák wrote a new pass called
nir_opt_varyings
which is an all-in-one solution to all the optimizations above, and
now authors of various drivers are rushing to take advantage of this new code.
I’ll share my experience of using that in RADV in the next blog post.
This is a long-awaited update to the previous mesh shading related posts. RDNA3 brings many interesting improvements to the hardware which simplify how mesh shaders work.
RDNA2 already supported mesh and task shaders, but mesh shaders had a big caveat regarding how outputs work: each shader invocation could only really write up to 1 vertex and 1 primitive, which meant that the shader compiler had to work around that to implement the programming model of the mesh shading API.
On RDNA2 the shader compiler had to:
RDNA3 changes how shader outputs work on pre-rasterization stages (including VS, TES, GS, MS).
Previous architectures had a special buffer called parameter cache where the pre-rasterization stage stored positions and generic output attributes for fragment shaders (pixel shaders) to read.
The parameter cache was removed from RDNA3 in favour of the attribute ring which is basically a buffer in VRAM. Shaders must now store their outputs to this buffer and after rasterization, the HW reads the attributes from the attribute ring and stores them to the LDS space of fragment shaders.
When I first heard about the attribute ring I didn’t understand how this is an improvement over the previous design (VRAM bandwidth is considered a bottleneck in many cases), but then I realized that this is meant to work together with the Infinity Cache that these new chips have. In the ideal access pattern, each attribute store would overwrite a full cache line so the shader won’t actually touch VRAM.
For mesh shaders, this has two consequences:
RADV already supports the attribute ring for VS, TES and GS so we have some experience with how it works and only needed to apply that to mesh shaders.
For non-generic output attributes (such as position, clip/cull distances, etc.) we
still need to use exp
instructions just like the old hardware. However, these now
have a new mode called row export which allows each lane to write not only its own
outputs but also others in the same row.
The legacy fast launch mode is essentially the same thing as RDNA2 had, so in this mode mesh shaders can be compiled with the same structure and the compiler only needs to be adjusted to use the attribute ring.
The drawback of this mode is that it still has the same issue with workgroup size as RDNA2 had. So this is just useful for helping driver developers port their code to the new architecture but it doesn’t allow us to fully utilize the new capabilities of the hardware.
The initial MS implementation in RADV used this mode.
In this mode, the number of HW shader invocations is determined similarly to how a compute shader would work, and there is no need to match the number of vertices and primitives in this mode.
Thanks to Rhys for working on this and enabling the new mode on RDNA3.
Based on the information we can glean from the open source progress (in particular, the published register files) happening thus far, we think RDNA4 will only support this new mode.
I’ve wanted to write about this for some time, but somehow forgot that I have a blog… Sorry!
As always, what I discuss here is based on open source driver code including mesa (RadeonSI and RADV) and AMD’s reference driver code.
Back in the day when presumably at least someone was young, the venerable xsetwacom tool was commonly used to configure wacom tablets devices on Xorg [1]. This tool is going dodo in Wayland because, well, a tool that is specific to an X input driver kinda stops working when said X input driver is no longer being used. Such is technology, let's go back to sheep farming.
There's nothing hugely special about xsetwacom, it's effectively identical to the xinput commandline tool except for the CLI that guides you towards the various wacom driver-specific properties and knows the right magic values to set. Like xinput, xsetwacom has one big peculiarity: it is a fire-and-forget tool and nothing is persistent - unplugging the device or logging out would vanish the current value without so much as a "poof" noise [2].
If also somewhat clashes with GNOME (or any DE, really). GNOME configuration works so that GNOME Settings (gnome-control-center) and GNOME Tweaks write the various values to the gsettings. mutter [3] picks up changes to those values and in response toggles the X driver properties (or in Wayland the libinput context). xsetwacom short-cuts that process by writing directly to the driver but properties are "last one wins" so there were plenty of use-cases over the years where changes by xsetwacom were overwritten.
Anyway, there are plenty of use-cases where xsetwacom is actually quite useful, in particular where tablet behaviour needs to be scripted, e.g. switching between pressure curves at the press of a button or key. But xsetwacom cannot work under Wayland because a) the xf86-input-wacom driver is no longer in use, b) only the compositor (i.e. mutter) has access to the libinput context (and some behaviours are now implemented in the compositor anyway) and c) we're constantly trying to think of new ways to make life worse for angry commenters on the internets. So if xsetwacom cannot work, what can we do?
Well, most configurations possible with xsetwacom are actually available in GNOME. So let's make those available to a commandline utility! And voila, I present to you gsetwacom, a commandline utility to toggle the various tablet settings under GNOME:
$ gsetwacom list-devices devices: - name: "HUION Huion Tablet_H641P Pen" usbid: "256C:0066" - name: "Wacom Intuos Pro M Pen" usbid: "056A:0357" $ gsetwacom tablet "056A:0357" set-left-handed true $ gsetwacom tablet "056A:0357" set-button-action A keybinding "<Control><Alt>t" $ gsetwacom tablet "056A:0357" map-to-monitor --connector DP-1
Just like xsetwacom was effectively identical to xinput but with a domain-specific CLI, gsetwacom is effectively identical to the gsettings tool but with a domain-specific CLI. gsetwacom is not intended to be a drop-in replacement for xsetwacom, the CLI is very different. That's mostly on purpose because I don't want to have to chase bug-for-bug compatibility for something that is very different after all.
I almost spent more time writing this blog post than on the implementation so it's still a bit rough. Also, (partially) due to how relocatable schemas work error checking is virtually nonexistent - if you want to configure Button 16 on your 2-button tablet device you can do that. Just don't expect 14 new buttons to magically sprout from your tablet. This could all be worked around with e.g. libwacom integration but right now I'm too lazy for that [4]
Oh, and because gsetwacom writes the gsettings configuration it is persistent, GNOME Settings will pick up those values and they'll be re-applied by mutter after unplug. And because mutter-on-Xorg still works, gsetwacom will work the same under Xorg. It'll also work under the GNOME derivatives as long as they use the same gsettings schemas and keys.
Le utilitaire est mort, vive le utilitaire!
[1] The git log claims libwacom was originally written in 2009. By me. That was a surprise...
[2] Though if you have the same speakers as I do you at least get a loud "pop" sound whenever you log in/out and the speaker gets woken up
[3] It used to be gnome-settings-daemon but with mutter now controlling the libinput context this all moved to mutter
[4] Especially because I don't want to write Python bindings for libwacom right now
Finally, conformant Vulkan for the M1! The new “Honeykrisp” driver is the first conformant Vulkan® for Apple hardware on any operating system, implementing the full 1.3 spec without “portability” waivers.
Honeykrisp is not yet released for end users. We’re continuing to add features, improve performance, and port to more hardware. Source code is available for developers.
Honeykrisp is not based on prior M1 Vulkan efforts, but rather Faith Ekstrand’s open source NVK driver for NVIDIA GPUs. In her words:
All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan driver and started by copying+pasting from it. My hope is that NVK will eventually become the driver that everyone copies and pastes from. To that end, I’m building NVK with all the best practices we’ve developed for Vulkan drivers over the last 7.5 years and trying to keep the code-base clean and well-organized.
Why spend years implementing features from scratch when we can reuse NVK? There will be friction starting out, given NVIDIA’s desktop architecture differs from the M1’s mobile roots. In exchange, we get a modern driver designed for desktop games.
We’ll need to pass a half-million tests ensuring correctness, submit the results, and then we’ll become conformant after 30 days of industry review. Starting from NVK and our OpenGL 4.6 driver… can we write a driver passing the Vulkan 1.3 conformance test suite faster than the 30 day review period?
It’s unprecedented…
Challenge accepted.
It begins with a text.
Faith… I think I want to write a Vulkan driver.
Her advice?
Just start typing.
There’s no copy-pasting yet – we just add M1 code to NVK and remove NVIDIA as we go. Since the kernel mediates our access to the hardware, we begin connecting “NVK” to Asahi Lina’s kernel driver using code shared with OpenGL. Then we plug in our shader compiler and hit the hay.
To access resources, GPUs use “descriptors” containing the address, format, and size of a resource. Vulkan bundles descriptors into “sets” per the application’s “descriptor set layout”. When compiling shaders, the driver lowers descriptor accesses to marry the set layout with the hardware’s data structures. As our descriptors differ from NVIDIA’s, our next task is adapting NVK’s descriptor set lowering. We start with a simple but correct approach, deleting far more code than we add.
With working descriptors, we can compile compute shaders. Now we program the fixed-function hardware to dispatch compute. We first add bookkeeping to map Vulkan command buffers to lists of M1 “control streams”, then we generate a compute control stream. We copy that code from our OpenGL driver, translate the GL into Vulkan, and compute works.
That’s enough to move on to “copies” of buffers and images. We implement Vulkan’s copies with compute shaders, internally dispatched with Vulkan commands as if we were the application. The first copy test passes.
Fleshing out yesterday’s code, all copy tests pass.
We’re ready to tackle graphics. The novelty is handling graphics state like depth/stencil. That’s straightforward, but there’s a lot of state to handle. Faith’s code collects all “dynamic state” into a single structure, which we translate into hardware control words. As usual, we grab that translation from our OpenGL driver, blend with NVK, and move on.
What makes state “dynamic”? Dynamic state can change without
recompiling shaders. By contrast, static state is baked into shader
binaries called “pipelines”. If games create all their pipelines during
a loading screen, there is no compiler “stutter” during gameplay. The
idea hasn’t quite panned out: many game developers don’t know their
state ahead-of-time so cannot create pipelines early. In response,
Vulkan has made
ever
more
state
dynamic,
punctuated with the EXT_shader_object
extension that makes pipelines optional.
We want full dynamic state and shader objects. Unfortunately, the M1 bakes random state into shaders: vertex attributes, fragment outputs, blending, even linked interpolation qualifiers. Like most of the industry in the 2010s, the M1’s designers bet on pipelines.
Faced with this hardware, a reasonable driver developer would double-down on pipelines. DXVK would stutter, but we’d pass conformance.
I am not reasonable.
To eliminate stuttering in OpenGL, we make state dynamic with four strategies:
Wait, what-a-logs?
AMD also bakes state into shaders… with a twist. They divide the hardware binary into three parts: a prolog, the shader, and an epilog. Confining dynamic state to the periphery eliminates shader variants. They compile prologs and epilogs on the fly, but that’s fast and doesn’t stutter. Linking shader parts is a quick concatenation, or long jumps avoid linking altogether. This strategy works for the M1, too.
For Honeykrisp, let’s follow NVK’s lead and treat all state as dynamic. No other Vulkan driver has implemented full dynamic state and shader objects this early on, but it avoids refactoring later. Today we add the code to build, compile, and cache prologs and epilogs.
Putting it together, we get a (dynamic) triangle:
Guided by the list of failing tests, we wire up the little bits missed along the way, like translating border colours.
/* Translate an American VkBorderColor into a Canadian agx_border_colour */
enum agx_border_colour
translate_border_color(VkBorderColor color)
{
switch (color) {
case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK:
return AGX_BORDER_COLOUR_TRANSPARENT_BLACK;
...
}
}
Test results are getting there.
Pass: 149770, Fail: 7741, Crash: 2396
That’s good enough for vkQuake.
Lots of little fixes bring us to a 99.6% pass rate… for Vulkan 1.1. Why stop there? NVK is 1.3 conformant, so let’s claim 1.3 and skip to the finish line.
Pass: 255209, Fail: 3818, Crash: 599
98.3% pass rate for 1.3 on our 1 week anniversary.
Not bad.
SuperTuxKart has a Vulkan renderer.
Zink works too.
I tracked down some fails to a test bug, where an arbitrary verification threshold was too strict to pass on some devices. I filed a bug report, and it’s resolved within a few weeks.
The tests for “descriptor indexing” revealed a compiler bug affecting subgroup shuffles in non-uniform control flow. The M1’s shuffle instruction is quirky, but it’s easy to workaround. Fixing that fixes the descriptor indexing tests.
A few tests crash inside our register allocator. Their shaders contain a peculiar construction:
condition
is always false, but the compiler doesn’t know
that.
Infinite loops are nominally invalid since shaders must terminate in finite time, but this shader is syntactically valid. “All loops contain a break” seems obvious for a shader, but it’s false. It’s straightforward to fix register allocation, but what a doozy.
Remember copies? They’re slow, and every frame currently requires a copy to get on screen.
For “zero copy” rendering, we need enough Linux window system
integration to negotiate an efficient surface layout across process
boundaries. Linux uses “modifiers” for this purpose, so we implement the
EXT_image_drm_format_modifier
extension. And by implement, I mean copy.
Copies to avoid copies.
“I’d like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux Vulkan Mac.”
…
“Ma’am, this is a Wendy’s.”
As bug fixing slows down, we step back and check our driver architecture. Since we treat all state as dynamic, we don’t pre-pack control words during pipeline creation. That adds theoretical CPU overhead.
Is that a problem? After some optimization, vkoverhead says we’re pushing 100 million draws per second.
I think we’re okay.
Time to light up YCbCr. If we don’t use special YCbCr hardware, this feature is “software-only”. However, it touches a lot of code.
It touches so much code that Mohamed Ahmed spent an entire summer adding it to NVK.
Which means he spent a summer adding it to Honeykrisp.
Thanks, Mohamed ;-)
Query copies are next. In Vulkan, the application can query the number of samples rendered, writing the result into an opaque “query pool”. The result can be copied from the query pool on the CPU or GPU.
For the CPU, the driver maps the pool’s internal data structure and copies the result. This may require nontrivial repacking.
For the GPU, we need to repack in a compute shader. That’s harder, because we can’t just run C code on the GPU, right?
…Actually, we can.
A little witchcraft makes GPU query copies as easy as C.
void copy_query(struct params *p, int i) {
uintptr_t dst = p->dest + i * p->stride;
int query = p->first + i;
if (p->available[query] || p->partial) {
int q = p->index[query];
write_result(dst, p->_64, p->results[q]);
}
...
}
The final boss: border colours, hard mode.
Direct3D lets the application choose an arbitrary border colour when creating a sampler. By contrast, Vulkan only requires three border colours:
(0, 0, 0, 0)
– transparent black(0, 0, 0, 1)
– opaque black(1, 1, 1, 1)
– opaque whiteWe handled these on April 8. Unfortunately, there are two problems.
First, we need custom border colours for Direct3D compatibility. Both
DXVK and vkd3d-proton
require the EXT_custom_border_color
extension.
Second, there’s a subtle problem with our hardware, causing dozens of fails even without custom border colours. To understand the issue, let’s revisit texture descriptors, which contain a pixel format and a component reordering swizzle.
Some formats are implicitly reordered. Common “BGRA” formats swap red
and blue for historical
reasons. The M1 does not directly support these formats. Instead,
the driver composes the swizzle with the format’s reordering. If the
application uses a BARB
swizzle with a BGRA
format, the driver uses an RABR
swizzle with an
RGBA
format.
There’s a catch: swizzles apply to the border colour, but formats do not. We need to undo the format reordering when programming the border colour for correct results after the hardware applies the composed swizzle. Our OpenGL driver implements border colours this way, because it knows the texture format when creating the sampler. Unfortunately, Vulkan doesn’t give us that information.
Without custom border colour support, we “should” be okay. Swapping red and blue doesn’t change anything if the colour is white or black.
There’s an even subtler catch. Vulkan mandates support for a packed 16-bit format with 4-bit components. The M1 supports a similar format… but with reversed “endianness”, swapping red and alpha.
That still seems okay. For transparent black (all zero) and opaque white (all one), swapping components doesn’t change the result.
The problem is opaque black: (0, 0,
0, 1)
. Swapping red and alpha gives
(1, 0, 0, 0)
. Transparent red?
Uh-oh.
We’re stuck. No known hardware configuration implements correct Vulkan semantics.
Is hope lost?
Do we give up?
A reasonable person would.
I am not reasonable.
Let’s jump into the deep end. If we implement custom border colours, opaque black becomes a special case. But how? The M1’s custom border colours entangle the texture format with the sampler. A reasonable person would skip Direct3D support.
As you know, I am not reasonable.
Although the hardware is unsuitable, we control software. Whenever a shader samples a texture, we’ll inject code to fix up the border colour. This emulation is simple, correct, and slow. We’ll use dirty driver tricks to speed it up later. For now, we eat the cost, advertise full custom border colours, and pass the opaque black tests.
All that’s left is some last minute bug fixing, and…
Pass: 686930, Fail: 0
Success.
The next task is implementing everything that DXVK and vkd3d-proton require to layer Direct3D. That includes esoteric extensions like transform feedback. Then Wine and an open source x86 emulator will run Windows games on Asahi Linux.
That’s getting ahead of ourselves. In the mean time, enjoy Linux games with our conformant OpenGL 4.6 drivers… and stay tuned.
Recently, the Linux Mint Blog published Monthly News – April 2024, which goes into detail about wanting to fork and maintain older GNOME apps in collaboration with other GTK-based desktop environments.
Despite the good intentions of the author, Clem, many readers interpreted this as an attack against GNOME. Specifically: GTK, libadwaita, the relationship between them, and their relevance to any desktop environment or desktop operating system. Unfortunately, many of these readers seem to have a lot of difficulty understanding what GTK is trying to be, and how libadwaita helps.
In this article, we’ll look at the history of why and how libadwaita was born, the differences between GTK 4 and libadwaita in terms of scope of support, their relevance to each desktop environment and desktop operating system, and the state of GTK 4 today.
First of all, what is GTK? GTK is a cross-platform widget toolkit from the GNOME Project, which means it provides interactive elements that developers can use to build their apps.
The latest major release of GTK is 4, which brings performance improvements over GTK 3. GTK 4 also removes several widgets that were part of the GNOME design language, which became a controversy. In the context of application design, a design language is the visual characteristics that are communicated to the user. Fonts, colors, shapes, forms, layouts, writing styles, spacing, etc. are all elements of the design language.(Source)
In general, cross-platform toolkits tend to provide general-purpose/standard widgets, typically with a non-opinionated styling, i.e. widgets and design patterns that are used consistently across different operating systems (OSes) and desktop environments.
However, GTK had the unique case of bundling GNOME’s design language into GTK, which made it far from generic, leading to problems of different lexicons, mainly philosophical and technical problems.
When we look at apps made for the GNOME desktop (will be referred to as “GNOME apps”) as opposed to non-GNOME apps, we notice that they’re distinctive: GNOME apps tend to have hamburger buttons, header bars, larger buttons, larger padding and margins, etc., while most non-GNOME apps tend to be more compact, use menu bars, standard title bars, and many other design metaphors that may not be used in GNOME apps.
This is because, from a design philosophy standpoint, GNOME’s design patterns tend to go in a different direction than most apps. As a brand and product, GNOME has a design language it adheres to, which is accompanied by the GNOME Human Interface Guidelines (HIG).
As a result, GTK and GNOME’s design language clashed together. Instead of being as general-purpose as possible, GTK as a cross-platform toolkit contained an entire design language intended to be used only by a specific desktop, thus defeating the purpose of a cross-platform toolkit.
For more information on GNOME’s design philosophy, see “What is GNOME’s Philosophy?”.
The unnecessary unification of the toolkit and design language also divided a significant amount of effort and maintenance: Instead of focusing solely on the general-purpose widgets that could be used across all desktop OSes and environments, much of the focus was on the widgets that were intended to conform to the GNOME HIG. Many of the general-purpose widgets also included features and functionality that were only relevant to the GNOME desktop, making them less general-purpose.
Thus, the general-purpose widgets were being implemented and improved slowly, and the large codebase also made the GNOME widgets and design language difficult to maintain, change, and adapt. In other words, almost everything was hindered by the lack of independence on both sides.
Because of the technical bottlenecks caused by the philosophical decisions, libhandy was created in 2017, with the first experimental version released in 2018. As described on the website, libhandy is a collection of “[b]uilding blocks for modern adaptive GNOME applications.” In other words, libhandy provides additional widgets that can be used by GNOME apps, especially those that use GTK 3. For example, Boxes uses libhandy, and many GNOME apps that used to use GTK 3 also used libhandy.
However, some of the problems remained: Since libhandy was relatively new at the time, most GNOME widgets were still part of GTK 3, which continued to suffer from the consequences of merging the toolkit and design language. Furthermore, GTK 4 was released at the end of December 2020 — after libhandy. Since libhandy was created before the initial release of GTK 4, it made little sense to fully address these issues in GTK 3, especially when doing so would have caused major breakages and inconveniences for GTK, libhandy, and app developers. As such, it wasn’t worth the effort.
With these issues in mind, the best course of action was to introduce all these major changes and breakages in GTK 4, use libhandy as an experiment and to gain experience, and properly address these issues in a successor.
Because of all the above problems, libadwaita was created: libhandy’s successor that will accompany GTK 4.
GTK 4 was initially released in December 2020, and libadwaita was released one year later, in December 2021. With the experience gained from libhandy, libadwaita managed to become extensible and easy to maintain.
Libadwaita is a platform library accompanying GTK 4. A platform library is a library used to complement a specific platform. In the case of libadwaita, the platform it targets is the GNOME desktop.
Some GNOME widgets from GTK 3 (or earlier versions of GTK 4) were removed or deprecated in GTK 4 and were reimplemented in / transferred to libadwaita, for example:
These aforementioned widgets only benefited GNOME apps, as they were strictly designed to provide widgets that conformed to the GNOME HIG. Non-GNOME apps usually didn’t use these widgets, so they were practically irrelevant to everyone else.
In addition, libadwaita introduced several widgets as counterparts to GTK 4 to comply with the HIG:
Similarly, these aforementioned GTK 4 (the ones starting with Gtk
) widgets are not designed to comply with the GNOME HIG. Since GTK 4 widgets are supposed to be general-purpose, they should not be platform-specific; the HIG no longer has any influence on GTK, only on the development of libadwaita.
The main difference between GTK 4 and libadwaita is the scope of support, specifically the priorities in terms of the GNOME desktop, and desktop environment and OS support. While most resources are dedicated to GNOME desktop integration, GTK 4 is not nearly as focused on the GNOME desktop as libadwaita. GTK 4, while opinionated, still tries to get closer to the traditional desktop metaphor by providing these general-purpose widgets, while libadwaita provides custom widgets to conform to the GNOME HIG.
Since libadwaita is only made for the GNOME desktop, and the GNOME desktop is primarily officially supported on Linux, libadwaita thus primarily supports Linux. In contrast, GTK is officially supported on all major operating systems (Windows, macOS, Linux). However, since GTK 4 is mostly developed by GNOME developers, it works best on Linux and GNOME — hence “opinionated”.
Thanks to the removal of GNOME widgets from GTK 4, GTK developers can continue to work on general-purpose widgets, without being influenced or restricted in any way by the GNOME HIG. Developers of cross-platform GTK 3 apps that rely exclusively on general-purpose widgets can be more confident that GTK 4 won’t remove these widgets, and hopefully enjoy the benefits that GTK 4 offers.
At the time of writing, there are several cross-platform apps that have either successfully ported to GTK 4, or are currently in the process of doing so. To name a few: Freeciv gtk4 client, HandBrake, Inkscape, Transmission, and PulseAudio Volume Control. The LibreOffice developers are working on the GTK 4 port, with the gtk4
VCL plugin option enabled. For example, the libreoffice-fresh
package from Arch Linux has it enabled.
Here are screenshots of the aforementioned apps:
This is a counter-response to Thom Holwerda’s response to this article.
An app targeting a specific platform will typically run best on that platform and will naturally struggle to integrate with other platforms. Whether the libraries change over time or stay the same forever, if the developers are invested in the platform they are targeting, the app will follow the direction of the platform and continue to struggle to integrate with other platforms. At best, it will integrate in other platforms by accident.
In this case, developers who have and will continue to target the GNOME desktop will actively adapt their apps to follow the GNOME philosophy, for better or worse. Hamburger buttons, header bars, typography, and distinct design patterns were already present a decade ago (2014).(Source) Since other platforms were (and still are) adhering to different design languages, with or without libhandy/libadwaita, the GTK 3 apps targeting GNOME were already distinguishable a decade ago. Custom solutions such as theming were (and still are) inadequate, as there was (and still is) no 🪄 magical 🪄 solution that converts GNOME’s design patterns into their platform-agnostic counterparts.
Whether the design language is part of the toolkit or a separate library has no effect on integration, because GNOME apps already looked really different long before libhandy was created, and non-GNOME apps already looked “out of place” in GNOME as well. Apps targeting a specific platform that unintentionally integrate with other platforms will eventually stop integrating with other platforms as the target platform progresses and apps adapt. In rare cases, developers may decide to no longer adhere to the GNOME HIG.
While libadwaita is the most popular and widely used platform library that accompanies GTK 4, there are several alternatives to libadwaita:
There are also several alternatives to libhandy:
Just like libadwaita and libhandy, these platform libraries offer custom widgets and styling that differ from GTK and are built for their respective platforms, so it’s important to realize that GTK is meant to be built with a complementary platform library that extends its functionality when targeting a specific platform.
Similarly, Kirigami from KDE accompanies Qt to build Plasma apps. MauiKit from the Maui Project (another KDE project) also accompanies Qt, but targets Nitrux. Libcosmic by System76 accompanies iced to build COSMIC apps.
A cross-platform toolkit should primarily provide general-purpose widgets. Third parties should be able to extend the toolkit as they see fit through a platform library if they want to target a specific platform.
As we’ve seen throughout the philosophical and technical issues with GTK, a lot of effort has gone into moving GNOME widgets from GTK 4 to libadwaita. GTK 4 will continue to provide these general-purpose widgets for apps intended to run on any desktop or OS, while platform libraries such as libadwaita, Granite and libhelium provide styling and custom widgets that respect their respective platforms.
Libadwaita is targeted exclusively at the GNOME ecosystem, courtesy of the GNOME HIG. Apps built with libadwaita are intended to run best on GNOME, while GTK 4 apps that don’t come with a platform library are intended to run everywhere.
Hi!
Sadly, I need to start this status update with bad news: SourceHut has decided to terminate my contract. At this time, I’m still in the process of figuring out what I’ll do next. I’ve marked some SourceHut-specific projects as unmaintained, such as sr.ht-container-compose (feel free to fork of course). I’ve handed over hut maintenance to xenrox, and I’ve started migrating a few projects to other forges (more to follow). I will continue to maintain projects that I still use such as soju to the extent that my free time allows.
On a more positive note, this month Igalia’s display next hackfest took place. Although I couldn’t attend in-person, it was great to discuss in real time with other engineers in the community about focused topics. We discussed about color management, HDR, adaptive sync, testing, real-time scheduling, power usage implications of the color pipeline, improved uAPI to handle KMS atomic commit failures, hardware plane offloading, display muxes, backlight, scaling and sharpening filters… And I probably missed a few other things.
We’ve released wlroots 0.17.3 with a bunch of new bug fixes (thanks to Simon Zeni). The patches to add support for ICC profiles from M. Stoeckl have been merged. I’ve continued working on the new ext-screencopy-v1 protocol but there are a few remaining issues to address before this is ready.
The display hackfest has motivated me to work on libliftoff. Apart from a few bug fixes, a new API to set a timeout for the libliftoff algorithm has been added, and some optimizations are about to get merged (one thanks to Leo Li).
The Wayland release cycle has started, we’ve merged patches to generate
validators for enum values and added a new deprecated-since
XML attribute to
mark a request, event or enum as deprecated. Thanks to Ferdinand Bachmann,
kanshi has gained output defaults and aliases (useful for sharing output
configurations across multiple profiles). mako 1.9 has been released with
a new flag to toggle modes, another new flag to bypass history when dismissing
a notification, and support for compositor-side cursor images.
In IRC news, goguma now uses Material 3 (please report any regression), has
gained support for messages only visible to channel operators (STATUSMSG
),
and I’ve spent a fair bit of time investigating the infamous duplicate message
bug. I have a better understanding of the issue now, but still need a bit more
time to come up with a proper fix.
Thanks to old patches sent by sitting33 that I took way too long to review, gamja now only marks messages as read when it’s focused, shows the number of unread highlights in the tab title, and hides the internal WHO reply chatter from the user.
Last, I’ve released go-imap 2.0.0 beta 3 with a whole bunch of bug fixes. Ksenia Roshchina has contributed a client implementation of the ACL IMAP extension.
That’s all for now, see you next month!
Some weeks ago I attended for the first time the Embedded Open Source Summit. Igalia had a booth that allowed to showcase the work that we have been doing during the past years. Several igalians also gave talks there.
I gave a talk titled “Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driver for a New GPU”, where I provided an introduction about Igalia contributions to maintain the OpenGL/Vulkan stack for the Raspberry Pi, focusing on the challenges to implement the Mesa support for the Raspberry Pi 5, the last device from that series, that was release on October 2023.
In you are interested, the video and slides of my presentation is now available:
https://static.sched.com/hosted_files/eoss24/78/2024-04-eoss-apinheiro-rpi5.pdf
And as a bonus, you can see here a video showing the RPI5 running some Unreal Engine 4 Demos, and other applications:
TLDR: Thanks to José Exposito, libwacom 2.12 will support all [1] Huion and Gaomon devices when running on a 6.10 kernel.
libwacom, now almost 13 years old, is a C library that provides a bunch of static information about graphics tablets that is not otherwise available by looking at the kernel device. Basically, it's a set of APIs in the form of libwacom_get_num_buttons and so on. This is used by various components to be more precise about initializing devices, even though libwacom itself has no effect on whether the device works. It's only a library for historical reasons [2], if I were to rewrite it today, I'd probably ship libwacom as a set of static json or XML files with a specific schema.
Here are a few examples on how this information is used: libinput uses libwacom to query information about tablet tools.The kernel event node always supports tilt but the individual tool that is currently in proximity may not. libinput can get the tool ID from the kernel, query libwacom and then initialize the tool struct correctly so the compositor and Wayland clients will get the right information. GNOME Settings uses libwacom's information to e.g. detect if a tablet is built-in or an external display (to show you the "Map to Monitor" button or not, if builtin), GNOME's mutter uses the SVGs provided by libwacom to show you an OSD where you can assign keystrokes to the buttons. All these features require that the tablet is supported by libwacom.
Huion and Gamon devices [3] were not well supported by libwacom because they re-use USB ids, i.e. different tablets from seemingly different manufacturers have the same vendor and product ID. This is understandable, the 16-bit product id only allows for 65535 different devices and if you're a company that thinks about more than just the current quarterly earnings you realise that if you release a few devices every year (let's say 5-7), you may run out of product IDs in about 10000 years. Need to think ahead! So between the 140 Huion and Gaomon devices we now have in libwacom I only counted 4 different USB ids. Nine years ago we added name matching too to work around this (i.e. the vid/pid/name combo must match) but, lo and behold, we may run out of unique strings before the heat death of the universe so device names are re-used too! [4] Since we had no other information available to userspace this meant that if you plugged in e.g. a Gaomon M106 and it was detected as S620 and given wrong button numbers, a wrong SVG, etc.
A while ago José got himself a tablet and started contributing to DIGIMEND (and upstreaming a bunch of things). At some point we realised that the kernel actually had the information we needed: the firmware version string from the tablet which conveniently gave us the tablet model too. With this kernel patch scheduled for 6.10 this is now exported as the uniq property (HID_UNIQ in the uevent) and that means it's available to userspace. After a bit of rework in libwacom we can now match on the trifecta of vid/pid/uniq or the quadrella of vid/pid/name/uniq. So hooray, for the first time we can actually detect Huion and Gaomon devices correctly.
The second thing Jose did was to extract all model names from the .deb packages Huion and Gaomon provide and auto-generate all libwacom descriptions for all supported devices. Which meant, in one pull request we added around 130 devices. Nice!
As said above, this requires the future kernel 6.10 but you can apply the patches to your current kernel if you want. If you do have one of the newly added devices, please verify the .tablet file for your device and let us know so we can remove the "this is autogenerated" warnings and fix any issues with the file. Some of the new files may now take precedence over the old hand-added ones so over time we'll likely have to merge them. But meanwhile, for a brief moment in time, things may actually work.
[1] fsvo of all but should be all current and past ones provided they were supported by Huions driver
[2] anecdote: in 2011 Jason Gerecke from Wacom and I sat down to and decided on a generic tablet handling library independent of the xf86-input-wacom driver. libwacom was supposed to be that library but it never turned into more than a static description library, libinput is now what our original libwacom idea was.
[3] and XP Pen and UCLogic but we don't yet have a fix for those at the time of writing
[4] names like "HUION PenTablet Pen"...
We’re excited to announce the details of our upcoming 2024 Linux Display Next Hackfest in the beautiful city of A Coruña, Spain!
This year’s hackfest will be hosted by Igalia and will take place from May 14th to 16th. It will be a gathering of minds from a diverse range of companies and open source projects, all coming together to share, learn, and collaborate outside the traditional conference format.
We’re excited to welcome participants from various backgrounds, including:
This diverse mix of backgrounds are represented by developers from several companies working on the Linux display stack: AMD, Arm, BlueSystems, Bootlin, Collabora, Google, GravityXR, Igalia, Intel, LittleCMS, Qualcomm, Raspberry Pi, RedHat, SUSE, and System76. It’ll ensure a dynamic exchange of perspectives and foster collaboration across the Linux Display community.
Please take a look at the list of participants for more info.
The beauty of the hackfest is that the agenda is driven by participants! As this is a hybrid event, we decided to improve the experience for remote participants by creating a dedicated space for them to propose topics and some introductory talks in advance. From those inputs, we defined a schedule that reflects the collective interests of the group, but is still open for amendments and new proposals. Find the schedule details in the official event webpage.
Expect discussions on:
This year Linux Display Next hackfest is a hybrid event, hosted onsite at the Igalia offices and available for remote attendance. In-person participants will find an environment for networking and brainstorming in our inspiring and collaborative office space. Additionally, A Coruña itself is a gem waiting to be explored, with stunning beaches, good food, and historical sites.
To make the most of your time in A Coruña, we’ll be organizing some social activities:
Igalia sponsors lunches and coffee-breaks on hackfest days, Tuesday’s dinner, and the social event on Thursday afternoon for in-person participants.
We can’t wait to welcome hackfest attendees to A Coruña! Stay tuned for further details and outcomes of this unconventional and unique experience.
With new releases of the Linux kernel and Mesa drivers poised to be packaged by Linux distributions, the TensorFlow Lite driver for the NPU in the Amlogic A311D SoC will be available to users with minimal effort.
With that work bearing its fruits, I have been looking at how this driver could be of use with other hardware.
Philipp Zabel of Pengutronix has been looking at adding support for the NPU in the NXP i.MX 8M Plus SoC, and he has made great progress on reverse engineering the in-memory format of the weights tensor, which is different from that used in the A311D.
I started by probing what would entail supporting the NPU in the S905D3 SoC from Amlogic, and I found it not that different from what is currently supported, besides it also using a new format for the weights tensor.Weights, the other kind of them. |
After a couple of weeks staring at memory dumps and writing a python tool to decode them, I realized that the run-length and Huffman encodings were the same, with only a few differences such as where and how the bias values were stored.
With a few changes to Philip's work-in-progress branch I got my first tests passing on the Libre Computer Solitude SBC board.
Next I will look at supporting more weights tensor dimensions and fixing bugs in how the weights and other values are encoded.
The command stream programming seems to be very similar to that of the A311D, so I don't expect much work to be needed there.
Once everything is working at the same level as with the A311D, I will move to determine the optimal values for the zero run-length and Huffman symbol maps, for maximum compression and thus performance (as NPUs are so fast at arithmetic that they tend to be memory starved).
Big thanks to Pengutronix for supporting Philip's work, and to Libre Computer for having supported the development of the driver so far.
Discussions about rebase vs. merge are familiar territory for anybody with an interest in version control in general and git in particular. I want to finally give a more permanent home to an idea that I have expressed in the past and that I've occasionally seen others hint at in those discussions as well.
There are multiple camps in these discussions that have slightly different ideas about how and for what purposes git should be used.
The first major axis of disagreement is whether history needs to be git bisect-able. Outside of my own little hobby projects, I've always worked on projects for which bisectability was important. This has generally been because their scope was such that CI simply had no chance to cover all uses of the software. Bug reports that can be traced to regressions from weeks or even months ago are not frequent per se, but they have always been frequent enough to matter. git bisect is an essential tool for finding those regression points when they happen. Not all projects are like that, but for projects which are, the notion of an "atomic" change to the project's main development branch (or branches) is important.
The second major axis of disagreement is whether the development history of those "atomic" changes is important enough to preserve. The original git development workflow does not consider this to be important: developers send around and review multiple iterations of a change, but only the final version of the change goes into the permanent record of the git repository. I tend to agree with that view. I have very occasionally found it useful to go back and read through the comments on a pull request that lead to a change months ago (or the email thread in projects that use an email workflow), but I have never found it useful to look at older versions of a change.
Some people seem to really care about this kind of history, though. They're the people who argue for a merge-based workflow for pull requests on GitHub (but against force-pushes to the same) and who have built hacks for bisectability and readability of history like --first-parent. I'm calling that a hack because it does not compose well. It works for projects whose atomic change history is essentially linear, but it breaks down once the history becomes more complex. What if the project occasionally has a genuine merge? Now you'd want to apply --first-parent for most merge commits but not all. Things get messy.
One final observation. Even "my" camp, which generally prefers to discard development history leading up to the atomic change in a main development branch, does want to preserve a kind of history that is currently not captured by git's graph. git revert inserts the hash of the commit that was reverted into the commit message. Similarly, git cherry-pick optionally inserts the hash of the commit that was cherry-picked into the commit message.
In other words, there is a kind of history for whose preservation at least in some cases there seems to be a broad consensus. This kind of history is distinct from the history that is captured by commit parent links. Looked at in this light, the idea is almost obvious: make this history an explicit part of git commit metadata.
The gist of it would be this. Every commit has a (often empty) list of historical commit references explaining the origins of the diff that is implicitly represented by the commit; let's call them diff-parents. The diff-parents are an ordered list of references to commits, each of them with a "reverted" bit that can optionally be set.
The history of a revert can be encoded by making the reverted commit a diff-parent with the "reverted" bit set. The history of a cherry-pick can be encoded similarly, with the "reverted" bit clear. When we perform a simple rebase, each new commit has an obvious diff-parent. When commits are squashed during a rebase, the sequence of squashed commits becomes the list of diff-parents of the newly formed commit. GitHub users who like to preserve all development history can use the "squash" option when landing pull requests and have the history be preserved via the list of diff-parents. git commit --amend can similarly record the original commit as diff-parent.
This is an idea and not a fully fleshed-out plan. There are obviously a whole bunch of tricky questions to answer. For example: How does this all fit into git's admittedly often byzantine CLI? Can merge commits be diff-parents, and how would that work? Can we visualize the difference between a commit and its diff-parents? (Hint: Here's an idea)
Diff-parents are a source of potential information leaks. This is not a problem specific to the idea of diff-parents; it is a general problem with the idea of preserving all history. Imagine some developer accidentally commits some credentials in their local clone of a repository and then uses git commit --amend to remove them again. Whoops, the commit that contains the credentials is still referenced as a diff-parent. Will it (and therefore the credentials) be published to the world for all to see when the developers pushes their branch to GitHub? This needs to be taken seriously.
So there are a whole bunch of issues that would have to be addressed for this idea to work well. I believe those issues to be quite surmountable in principle, but given the state of git development (where GitHub, which to many is almost synonymous with git, doesn't even seem to be able to understand how git was originally meant to be used) I am not particularly optimistic. Still, I think it's a good idea, and I'd love to see it or something like it in git.
It’s been around 6 months since the GNOME Foundation was joined by our new Executive Director, Holly Million, and the board and I wanted to update members on the Foundation’s current status and some exciting upcoming changes.
As you may be aware, the GNOME Foundation has operated at a deficit (nonprofit speak for a loss – ie spending more than we’ve been raising each year) for over three years, essentially running the Foundation on reserves from some substantial donations received 4-5 years ago. The Foundation has a reserves policy which specifies a minimum amount of money we have to keep in our accounts. This is so that if there is a significant interruption to our usual income, we can preserve our core operations while we work on new funding sources. We’ve now “hit the buffers” of this reserves policy, meaning the Board can’t approve any more deficit budgets – to keep spending at the same level we must increase our income.
One of the board’s top priorities in hiring Holly was therefore her experience in communications and fundraising, and building broader and more diverse support for our mission and work. Her goals since joining – as well as building her familiarity with the community and project – have been to set up better financial controls and reporting, develop a strategic plan, and start fundraising. You may have noticed the Foundation being more cautious with spending this year, because Holly prepared a break-even budget for the Board to approve in October, so that we can steady the ship while we prepare and launch our new fundraising initiatives.
The biggest prerequisite for fundraising is a clear strategy – we need to explain what we’re doing and why it’s important, and use that to convince people to support our plans. I’m very pleased to report that Holly has been working hard on this and meeting with many stakeholders across the community, and has prepared a detailed and insightful five year strategic plan. The plan defines the areas where the Foundation will prioritise, develop and fund initiatives to support and grow the GNOME project and community. The board has approved a draft version of this plan, and over the coming weeks Holly and the Foundation team will be sharing this plan and running a consultation process to gather feedback input from GNOME foundation and community members.
In parallel, Holly has been working on a fundraising plan to stabilise the Foundation, growing our revenue and ability to deliver on these plans. We will be launching a variety of fundraising activities over the coming months, including a development fund for people to directly support GNOME development, working with professional grant writers and managers to apply for government and private foundation funding opportunities, and building better communications to explain the importance of our work to corporate and individual donors.
Another observation that Holly had since joining was that we had, by general nonprofit standards, a very small board of just 7 directors. While we do have some committees which have (very much appreciated!) volunteers from outside the board, our officers are usually appointed from within the board, and many board members end up serving on multiple committees and wearing several hats. It also means the number of perspectives on the board is limited and less representative of the diverse contributors and users that make up the GNOME community.
Holly has been working with the board and the governance committee to reduce how much we ask from individual board members, and improve representation from the community within the Foundation’s governance. Firstly, the board has decided to increase its size from 7 to 9 members, effective from the upcoming elections this May & June, allowing more voices to be heard within the board discussions. After that, we’re going to be working on opening up the board to more participants, creating non-voting officer seats to represent certain regions or interests from across the community, and take part in committees and board meetings. These new non-voting roles are likely to be appointed with some kind of application process, and we’ll share details about these roles and how to be considered for them as we refine our plans over the coming year.
We’re really excited to develop and share these plans and increase the ways that people can get involved in shaping the Foundation’s strategy and how we raise and spend money to support and grow the GNOME community. This brings me to my final point, which is that we’re in the run up to the annual board elections which take place in the run up to GUADEC. Because of the expansion of the board, and four directors coming to the end of their terms, we’ll be electing 6 seats this election. It’s really important to Holly and the board that we use this opportunity to bring some new voices to the table, leading by example in growing and better representing our community.
Allan wrote in the past about what the board does and what’s expected from directors. As you can see we’re working hard on reducing what we ask from each individual board member by increasing the number of directors, and bringing additional members in to committees and non-voting roles. If you’re interested in seeing more diverse backgrounds and perspectives represented on the board, I would strongly encourage you consider standing for election and reach out to a board member to discuss their experience.
Thanks for reading! Until next time.
Best Wishes,
Rob
President, GNOME Foundation
Update 2024-04-27: It was suggested in the Discourse thread that I clarify the interaction between the break-even budget and the 1M EUR committed by the STF project. This money is received in the form of a contract for services rather than a grant to the Foundation, and must be spent on the development areas agreed during the planning and application process. It’s included within this year’s budget (October 23 – September 24) and is all expected to be spent during this fiscal year, so it doesn’t have an impact on the Foundation’s reserves position. The Foundation retains a small % fee to support its costs in connection with the project, including the new requirement to have our accounts externally audited at the end of the financial year. We are putting this money towards recruitment of an administrative assistant to improve financial and other operational support for the Foundation and community, including the STF project and future development initiatives.
(also posted to GNOME Discourse, please head there if you have any questions or comments)
I’ve been seeing a lot of ultra technical posts fly past my news feed lately and I’m tired of it. There’s too much information out there, too many analyses of vague hardware capabilities, too much handwaving in the direction of compiler internals.
It’s too much.
Take it out. I know you’ve got it with you. I know all my readers carry them at all times.
That’s right.
It’s time to make some pasta.
Everyone understands pasta.
Today I’ll be firing up the pasta maker on this ticket that someone nerdsniped me with. This is the sort of simple problem that any of us smoothbrains can understand: app too slow.
Here at SGC, we’re all experts at solving app too slow by now, so let’s take a gander at the problem area.
I’m in a hurry to get to the gym today, so I’ll skip over some of the less interesting parts of my analysis. Instead, let’s look at some artisanal graphics.
This is an image, but let’s pretend it’s a graph of the time between when an app is started to when it displays its first frame:
At the start is when the user launched the app, the body of the arrow is what happens during “startup”, and the head of the arrow is when the app has displayed its first frame to the user. The “startup” period is what the user perceives as latency. More technical blogs would break down here into discussions and navel-gazing about “time to first light” and “photon velocity” or whatever, but we’re keeping things simple. If SwapBuffers
is called, the app has displayed its frame.
Where are we at with this now?
I did my testing on an Intel Icelake CPU/GPU because I’m lazy. Also because the original ticket was for Intel systems. Also because deal with it, this isn’t an AMD blog.
The best way to time this is to:
exit
call at the end of SwapBuffers
while
loop using time
On iris, the average startup time for gtk4-demo
was between 190-200ms.
On zink, the average startup time was between 350-370ms.
Uh-oh.
Initial analysis revealed something very stupid for the zink case: a lot of time was being spent on shaders.
Now, I’m not saying a lot of time was spent compiling shaders. That would be smart. Shaders have to be compiled, and it’s not like that can be skipped or anything. A cold run of this app that compiles shaders takes upwards of 1.0 seconds on any driver, and I’m not looking to improve that case since it’s rare. And hard. And also I gotta save some work for other people who want to make good blog posts.
The problem here is that when creating shaders, zink blocks while it does some initial shader rewrites and optimizations. This is like if you’re going to make yourself a sandwich, before you put smoked brisket on the bread you have to first slice the bread so it’s ready when you want to put the brisket on it. Sure, you could slice it after you’ve assembled your pile of pulled pork and slaw, but generally you slice the bread, you leave the bread sitting somewhere while you find/make/assemble the burnt ends for your sandwich, and then you finish making your sandwich. Compiling shaders is basically the same as making a sandwich.
But slicing bread takes time. And when you’re slicing the bread, you’re not doing anything else. You can’t. You’re holding a knife and a loaf of bread. You’re physically incapable of doing anything else until you finish slicing.
Similarly, zink can’t do anything else while it’s doing that shader creation. It’s sitting there creating the shaders. And while it’s doing that, the rest of the app (or just the main GL thread if glthread is active) is blocked. It can’t do anything else. It’s waiting on zink to finish, and it cannot make forward progress until the shader creation has completed.
Now this process happens dozens or hundreds of times during app startup, and every time it happens, the app blocks. Its own initialization routines–reading configuration data, setting up global structs and signal handlers, making display server connections, etc–cannot proceed until GL stops blocking.
If you’re unsure where I’m going with this, it’s a bad thing that zink is slicing all this bread while the app is trying to make sandwiches.
The year is whatever year you’re reading this, and in that year we have very powerful CPUs. CPUs so powerful that you can do lots of things at once. Instead of having only two hands to hold the bread and slice it, you have your own hands and then the hands of another 10+ of your clones which are also able to hold bread and slice it. So if you tell one of those clones “slice some bread for me”, you can do other stuff and come back to some nicely sliced bread. When exactly that bread arrives is another issue depensynchronizationding on how well you understand the joke here.
But this is me, so I get all the jokes, and that means I can do something like this:
By moving all that bread slicing into a thread, the rest of the startup operations can proceed without blocking. This frees up the app to continue with its own lengthy startup routines.
After the change, zink starts up in a average of 260-280ms, a 25% improvement.
I know not everyone wants pasta on their sandwiches, but that’s where we ended up today.
That changeset is the end of this post, but it’s not the end of my investigation. There’s still mysteries to uncover here.
Like why the farfalle is this app calling glXInitialize
and eglInitialize
?
Can zink get closer to iris’s startup time?
We’ll find out in a future installment of Who Wants To Eat Lunch?
Yesterday I managed to implement in my open-source driver all the remaining operations so the SSDLite MobileDet model can run on Rockchip's NPU in the RK3588 SoC.
Performance is pretty good at 30 frames per second when using just one of the 3 cores that the NPU contains.
Now that we got to this level of usefulness, I'm going to switch to writing a kernel driver suited for inclusion into the Linux kernel, to the drivers/accel subsystem.
There is still lots of work to do, but progress is going pretty fast, though as I write more drivers for different NPUs I will have to split my time among them. At least, until we get more contributors! :)
For the last few months, Benjamin Tissoires and I have been working on and polishing a little tool called udev-hid-bpf [1]. This is the scaffolding required quickly and easily write, test and eventually fix your HID input devices (mouse, keyboard, etc.) via a BPF program instead of a full-blown custom kernel driver or a semi-full-blown kernel patch. To understand how it works, you need to know two things: HID and BPF [2].
HID is the Human Interface Device standard and the most common way input devices communicate with the host (HID over USB, HID over Bluetooth, etc.). It has two core components: the "report descriptor" and "reports", both of which are byte arrays. The report descriptor is a fixed burnt-in-ROM byte array that (in rather convoluted terms) tells us what we'll find in the reports. Things like "bits 16 through to 24 is the delta x coordinate" or "bit 5 is the binary button state for button 3 in degrees celcius". The reports themselves are sent at (usually) regular intervals and contain the data in the described format, as the devices perceives reality. If you're interested in more details, see Understanding HID report descriptors.
BPF or more correctly eBPF is a Linux kernel technology to write programs in a subset of C, compile it and load it into the kernel. The magic thing here is that the kernel will verify it, so once loaded, the program is "safe". And because it's safe it can be run in kernel space which means it's fast. eBPF was originally written for network packet filters but as of kernel v6.3 and thanks to Benjamin, we have BPF in the HID subsystem. HID actually lends itself really well to BPF because, well, we have a byte array and to fix our devices we need to do complicated things like "toggle that bit to zero" or "swap those two values".
If we want to fix our devices we usually need to do one of two things: fix the report descriptor to enable/disable/change some of the values the device pretends to support. For example, we can say we support 5 buttons instead of the supposed 8. Or we need to fix the report by e.g. inverting the y value for the device. This can be done in a custom kernel driver but a HID BPF program is quite a lot more convenient.
For illustration purposes, here's the example program to flip the y coordinate. HID BPF programs are usually device specific, we need to know that the e.g. the y coordinate is 16 bits and sits in bytes 3 and 4 (little endian):
SEC("fmod_ret/hid_bpf_device_event") int BPF_PROG(hid_y_event, struct hid_bpf_ctx *hctx) { s16 y; __u8 *data = hid_bpf_get_data(hctx, 0 /* offset */, 9 /* size */); if (!data) return 0; /* EPERM check */ y = data[3] | (data[4] << 8); y = -y; data[3] = y & 0xFF; data[4] = (y >> 8) & 0xFF; return 0; }That's it. HID-BPF is invoked before the kernel handles the HID report/report descriptor so to the kernel the modified report looks as if it came from the device.
As said above, this is device specific because where the coordinates is in the report depends on the device (the report descriptor will tell us). In this example we want to ensure the BPF program is only loaded for our device (vid/pid of 04d9/a09f), and for extra safety we also double-check that the report descriptor matches.
// The bpf.o will only be loaded for devices in this list HID_BPF_CONFIG( HID_DEVICE(BUS_USB, HID_GROUP_GENERIC, 0x04D9, 0xA09F) ); SEC("syscall") int probe(struct hid_bpf_probe_args *ctx) { /* * The device exports 3 interfaces. * The mouse interface has a report descriptor of length 71. * So if report descriptor size is not 71, mark as -EINVAL */ ctx->retval = ctx->rdesc_size != 71; if (ctx->retval) ctx->retval = -EINVAL; return 0; }Obviously the check in probe() can be as complicated as you want.
This is pretty much it, the full working program only has a few extra includes and boilerplate. So it mostly comes down to compiling and running it, and this is where udev-hid-bpf comes in.
udev-hid-bpf is a tool to make the development and testing of HID BPF programs simple, and collect HID BPF programs. You basically run meson compile and meson install and voila, whatever BPF program applies to your devices will be auto-loaded next time you plug those in. If you just want to test a single bpf.o file you can udev-hid-bpf install /path/to/foo.bpf.o and it will install the required udev rule for it to get loaded whenever the device is plugged in. If you don't know how to compile, you can grab a tarball from our CI and test the pre-compiled bpf.o. Hooray, even simpler.
udev-hid-bpf is written in Rust but you don't need to know Rust, it's just the scaffolding. The BPF programs are all in C. Rust just gives us a relatively easy way to provide a static binary that will work on most tester's machines.
The documentation for udev-hid-bpf is here. So if you have a device that needs a hardware quirk or just has an annoying behaviour that you always wanted to fix, well, now's the time. Fixing your device has never been easier! [3].
[1] Yes, the name is meh but you're welcome to come up with a better one and go back in time to suggest it a few months ago.
[2] Because I'm lazy the terms eBPF and BPF will be used interchangeably in this article. Because the difference doesn't really matter in this context, it's all eBPF anyway but nobody has the time to type that extra "e".
[3] Citation needed
Hi!
The X.Org Foundation results are in, and I’m now officially part of the Board of Directors. I hope I can be of use to the community on more organizational issues! Speaking of which, I’ve spent quite a bit of time dealing with Code of Conduct matters lately. Of course I can’t disclose details for privacy, but hopefully our actions can gradually improve the contribution experience for FreeDesktop.Org projects.
New extensions have been merged in wayland-protocols. linux-drm-syncobj-v1 enables explicit synchronization which is a better architecture than what we have today (implicit synchronization) and will improve NVIDIA support. alpha-modifier-v1 allows Wayland clients to set an alpha channel multiplier on its surfaces, it can be used to implement effects such as fade-in or fade-out without redrawing, and can even be offloaded to KMS. The tablet-v2 protocol we’ve used for many years has been stabilized.
In other Wayland news, a new API has been added to dynamically resize libwayland’s internal buffer. By default, the server-side buffer size is still 4 KiB but the client-side buffer will grow as needed. This should help with bursts (e.g. long format lists) and high poll rate mice. I’ve added a new wayland-scanner mode to generate headers with only enums to help libraries such as wlroots which use these in their public API. And I’ve sent an announcement for the next Wayland release, it should happen at the end of May if all goes well.
With the help of Sebastian Wick, libdisplay-info has gained support for more bits, in particular DisplayID type II, III and VII timings, as well as CTA Video Format Preference blocks, Room Configuration blocks and Speaker Location blocks. I’ve worked on libicc to finish up the parser, next I’d like to add the math required to apply an ICC profile. gamja now has basic support for file uploads (only when pasting a file for now) and hides no-op nickname changes (e.g. from “emersion” to “emersion_” and back).
See you next month!
Just a quick post to let everyone know that I have clicked merge on the vroom MR. Once it lands, you can test the added performance gains with ZINK_DEBUG=ioopt
.
I’ll be enabling this by default in the next month or so once a new GL CTS release happens that fixes all the hundreds of broken tests which would otherwise regress. With that said, I’ve tested it on a number of games and benchmarks, and everything works as expected.
Have fun.
Trusting hardware, particularly the registers that describe its functionality, is fundamentally risky.
The etnaviv GPU stack is continuously improving and becoming more robust. This time, a hardware database was incorporated into Mesa, utilizing header files provided by the SoC vendors.
If you are interested in the implementation details, I recommend checking out this Mesa MR.
Are you employed at Versilicon and want to help? You could greatly simplify our work by supplying the community with a comprehensive header that includes all the models you offer.
Last but not least: I deeply appreciate Igalia’s passion for open source GPU driver development, and I am grateful to be a part of the team. Their enthusiasm for open source work not only pushes the boundaries of technology but also builds a strong, collaborative community around it.
Years ago, when I began dedicating time to hacking on etnaviv, the kernel driver in use would read a handful of registers and relay the gathered information to the user space blob. This blob driver was then capable of identifying the GPU (including model, revision, etc.), supported features (such as DXT texture compression, seamless cubemaps, etc.), and crucial limits (like the number of registers, number of varyings, and so on).
For reverse engineering purposes, this interface is super useful. Image if you could change one of these feature bits on a target running the binary blob.
With libvivhook it is possible to do exactly this. From time to time, I am running such an old vendor driver stack on an i.MX 6QuadPlus SBC, which features a Vivante GC3000 as its GPU.
Somewhere, I have a collection of scripts that I utilized to acquire additional knowledge about unknown GPU states activated when a specific feature bit was set.
To explore a simple example, let’s consider the case of misrepresenting a GPU’s identity as a GC2000. This involves modifying the information provided by the kernel driver to the user space, making the user space driver believe it is interacting with a GC2000 GPU. This scenario could be used for testing, debugging, or understanding how specific features or optimizations are handled differently across GPU models.
export ETNAVIV_CHIP_MODEL="0x2000"
export ETNAVIV_CHIP_REVISION="0x5108"
export ETNAVIV_FEATURES0_CLEAR="0xFFFFFFFF"
export ETNAVIV_FEATURES1_CLEAR="0xFFFFFFFF"
export ETNAVIV_FEATURES2_CLEAR="0xFFFFFFFF"
export ETNAVIV_FEATURES0_SET="0xe0296cad"
export ETNAVIV_FEATURES1_SET="0xc9799eff"
export ETNAVIV_FEATURES2_SET="0x2efbf2d9"
LD_PRELOAD="/lib/viv_interpose.so" ./test-case
If you capture the generated command stream and compare it with the one produced under the correct identity, you’ll observe many differences. This is super useful - I love it.
At some point in time, Vivante changed their ioctl() interface and modified the gcvHAL_QUERY_CHIP_IDENTITY
command. Instead of providing a very detailed chip identity, they reduced the data set to the following values:
This shift could indeed hinder reverse engineering efforts significantly. At a glance, it becomes impossible to alter any feature value, and understanding how the vendor driver processes these values is out of reach. Determining the function or impact of an unknown feature bit now seems unattainable.
However, the kernel driver also requires a mechanism to verify the existing features of the GPU, as it needs to accommodate a wide variety of GPUs. Therefore, there must be some sort of system or method in place to ensure the kernel driver can effectively manage and support the diverse functionalities and capabilities of different GPUs.
Let’s welcome: gc_feature_database.h, or hwdb for short.
Vivante transitioned to using a database that stores entries for limit values and feature bits. This database is accessed by querying with model, revision, product id, eco id and customer id.
There is some speculation why this move was done. My theory posits that they became frustrated with the recurring cycle of introducing feature bits to indicate the implementation of a feature, subsequently discovering problems with said feature, and then having to introduce additional feature bits to signal that the feature now truly operates as intended. It became far more straightforward to deactivate a malfunctioning feature by modifying information in the hardware database (hwdb). After they began utilizing the hwdb within the driver, updates to the feature registers in the hardware ceased.
Here is a concrete example of such a case that can be found in the etnaviv gallium driver:
screen->specs.tex_astc = VIV_FEATURE(screen, chipMinorFeatures4, TEXTURE_ASTC) &&
!VIV_FEATURE(screen, chipMinorFeatures6, NO_ASTC);
Meanwhile, in the etnaviv world there was a hybrid in the making. We stuck with the detailed feature words and found a smart way to convert from Vivante’s hwdb entries to our own in-kernel database. There is even a full blown Vivante -> etnaviv hwdb convert.
At that time, I did not fully understand all the consequences this approach would bring - more on that later. So, I dedicated my free time to reverse engineering and tweaking the user space driver, while letting the kernel developers do their thing.
About a year after the initial hwdb landed in the kernel, I thought it might be a good idea to read out the extra id values, and provide them via sysfs to the user space. At that time, I already had the idea of moving the hardware database to user space in mind. However, I was preoccupied with other priorities that were higher on my to-do list, and I ended up forgetting about it.
Tomeu Vizoso began to work on teflon and a Neural Processing Unit (NPU) driver within Mesa, leveraging a significant amount of the existing codebase and concepts, including the same kernel driver for the GPU. During this process, he encountered a need for some NPU-specific limit values. To address this, he added an in-kernel hwdb entry and made the limit values accessible to user space.
That’s it — the kernel supplies all the values the NPU driver requires. We’re finished, aren’t we?
It turns out, that there are many more NPU related values that need to be exposed in the same manner, with seemingly no end in sight.
One of the major drawbacks when the hardware database (hwdb) resides in the kernel is the considerable amount of time it takes for hwdb patches to be written, reviewed, and eventually merged into Linus’s git tree. This significantly slows down the development of user space drivers. For end users, this means they must either run a bleeding-edge kernel or backport the necessary changes on their own.
For me personally, the in-kernel hardware database should never have been implemented in its current form. If I could go back in time, I would have voiced my concerns.
As a result, moving the hardware database (hwdb) to user space quickly became a top priority on my to-do list, and I began working on it. However, during the testing phase of my proof of concept (PoC), I had to pause my work due to a kernel issue that made it unreliable for user space to trust the ID values provided by the kernel. Once my fix for this issue began to be incorporated into stable kernel versions, it was time to finalize the user space hwdb.
There is only one little but important detail we have not talked about yet. There are vendor specific versions of gc_feature_database.h based on different versions of the binary blob. For instance, there is one from NXP, ST, Amlogic and some more.
Here is a brief look at the differences:
nxp/gc_feature_database.h (autogenerated at 2023-10-24 16:06:00, 861 struct members, 27 entries)
stm/gc_feature_database.h (autogenerated at 2022-12-29 11:13:00, 833 struct members, 4 entries)
amlogic/gc_feature_database.h (autogenerated at 2021-04-12 17:20:00, 733 struct members, 8 entries)
We understand that these header files are generated and adhere to a specific structure. Therefore, all we need to do is write an intelligent Python script capable of merging the struct members into a single consolidated struct. This script will also convert the old struct entries to the new format and generate a header file that we can use.
I’m consistently amazed by how swiftly and effortlessly Python can be used for such tasks. Ninety-nine percent of the time, there’s a ready-to-use Python module available, complete with examples and some documentation. To address the C header parsing challenge, I opted for pycparser.
The final outcome is a generated hwdb.h file that looks and feels similar to those generated from the binary blob.
This header merging approach offers several advantages:
While working on this topic I decided to do a bigger refactoring with the end goal to provide a struct etna_core_info
that is located outside of the gallium driver.
This makes the code future proof and moves the filling of struct etna_core_info
directly into the lowest layer - libetnaviv_drm (src/etnaviv/drm).
We have not yet talked about one important detail.
What happens if there is no entry in the user space hwdb?
The solution is straightforward: we fallback to the previous method and request all feature words from the kernel driver. However, in an ideal scenario, our user space hardware database should supply all necessary entries. If you find that an entry for your GPU/NPU is missing, please get in touch with me.
The existing system, despite its limitations, is set to remain indefinitely, with new entries being added to accommodate new GPUs. Although it will never contain as much information as the user space counterpart, this isn’t necessarily a drawback. For the purposes at hand, only a handful of feature bits are required.
I know what you’re all thinking: there have not been enough blog posts this year. As always, my highly intelligent readers are right, and as always, you’re just gonna have to live with that because I’m not changing the way anything works. SGC happens when it happens.
And today. As it snows in April. SGC. Is. Happening.
Let’s begin.
I was sitting at my battlestation doing some very ordinary REDACTED work for REDACTED, and friend of the blog, Samuel “Shader Objects” Pitoiset (he has legally changed his name, please be respectful), came to me with a simple request. He wanted to enable VK_EXT_shader_object for the radv-zink jobs in mesa CI as the final part of his year-long bringup for the extension. This meant that all the tests passing without shader objects needed to also pass with shader objects.
This should’ve been easy; it was over a year ago that the Khronos blog famously and confusingly announced that pipelines were dead and nobody should ever use them again (paraphrased). A year is more than enough time for everyone to collectively get their shit together. Or so you might think.
Turns out shader objects are hard. This simple ask sent me down a rabbithole the likes of which I had never imagined.
It started normally enough. There were a few zink tests which failed when shader objects were enabled. Nobody was surprised; I wrote the zink usage before validation support had landed and also before anything but lavapipe supported it. As everyone is well aware, lavapipe is the best and most handsome Vulkan driver, and just by using it you eliminate all bugs that your application may have. RADV is not, and so there are bugs.
A number of them were simple:
The list goes on, and longtime followers of the blog are nodding to themselves as they skim the issues, confirming that they would have applied all the same one-liner fixes.
Then it started to get crazy.
I’m a genius, so obviously I know how this all works. That’s why I’m writing this blog. Right?
Right. Good. So Samuel comes to me, and he hits me with this absolute brainbuster of an issue. An issue so tough that I have to perform an internet search to find a credible authority on the topic. I found this amazing and informative site that exactly described the issue Samuel had posted. I followed the staggering intellect of the formidable author and blah blah blah yeah obviously the only person I’d find writing about an issue I have to solve is past-me who was too fucking lazy to actually solve it.
I started looking into this more deeply after taking a moment to fix a different issue related to location assignment that Samuel was too lazy to file a ticket for and thus has deprived the blog of potential tests that readers could run to examine and debug the issue for themselves. But the real work was happening elsewhere.
Now we’re getting to the good stuff. I hope everyone has their regulation-thickness safety helmet strapped on and splatter guards raised to full height because you’ll need them both.
As I said in Adventures In Linking, nir_assign_io_var_locations
is the root of all evil. In the case where shaders have mismatched builtins, the assigned locations are broken. I decided to take the hammer to this. I mean I took the forbidden action, did the very thing that I railed about live at XDC.
Sidebar: at this exact moment, Samuel told me his issue was already fixed.
I added a new pipe cap.
I know. It was a last resort, but I wanted the issue fixed. The result was this MR, which gave nir_assign_io_var_locations
the ability to ignore builtins with regard to assigning locations. This would resolve the issue once and for all, as drivers which treat builtins differently could pass the appropriate param to the NIR pass and then get good results.
Problem solved.
I got some review comments which were interesting, but ultimately the problem remained: lavapipe (and maybe some other vulkan drivers) use this pass to assign locations, and no amount of pipe caps will change that.
It was a tough problem to solve, but someone had to do it. That’s why I dug in and began examining this MR from the only man who is both a Mesa expert and a Speed Force user, Marek Olšák, to enable his new NIR optimized linker for RadeonSI. This was a big, meaty triangles-go-brrr thing to sink my teeth into. I had to get into a different headspace to figure out what I was even doing anymore.
The gist of opt_varyings
is that you give all the shaders in a pipeline to Marek, and Marek says “trust me, buddy, this is gonna be way faster” and gives you back new shaders that do the same thing except only the vertex shader actually has any code. Read the design document if you want more info.
Now I’m deep into it though, and I’m reading the commits, and I see there’s this new lower_mediump_io
callback which lowers mediump I/O to 16bit. Which is allowed by GLSL. And I use GLSL, so naturally I could do this too. And I did, and I ran it in zink, and I put it through CTS and OH FUCK OH SHIT OH FUCK WHAT THE FUCK EVEN–
Here’s the thing. In GLSL, you can have mediump I/O which drivers can translate to mean 16bit I/O, and this works great. In Vulkan, we have this knockoff brand, dumpster tier VK_KHR_16bit_storage extension which seems like it should be the same, except for one teeny tiny little detail:
• VUID-StandaloneSpirv-Component-04920
The Component decoration value must not be greater than 3
Brilliant. So I can have up to four 16bit components at a given location. Two whole dwords. Very useful. Great. Just what I wanted. Thanks.
Also, XFB is a thing, and, well, pardon my saying so, but mediump xfb? Fuck right off.
With mediump safely ejected from the codebase and my life, I was free to pursue other things. I didn’t, but I was free to. And even with Samuel screaming somewhere distant that his issue was already long since fixed, I couldn’t stop. There were other people struggling to implement opt_varyings
in their own drivers, and as we all know, half of driver performance is the speed with which they implement new features. That meant that, as expected, RadeonSI had a significant lead on me since I’m always just copying Marek’s homework anyway, but the hell if I was about to let some other driver copy homework faster than me.
Fans of the blog will recall way, way, way, way back in Q3 ‘23 when I blogged about very dumb things. Specifically about how I was going to start using “lowered I/O” in zink. Well, I did that. And then I let the smoking rubble cool for a few months. And now it’s Q2 ‘24, and I’m older and unfathomably wiser, and I am about to put this rake into the wheel of my bicycle once more.
In this case, the rake is nir_io_glsl_lower_derefs
, which moves all the I/O lowering into the frontend rather than doing it manually. The result is the same: zink gets lowered I/O, and the only difference is that it happens earlier. It’s less code in zink, and…
Of course there is no driver but RadeonSI which sets nir_io_glsl_lower_derefs
.
And, of course, RadeonSI doesn’t use any of the common Gallium NIR passes.
But surely they’d still work.
Surely at least some of them would work.
Surely there wouldn’t be that many of them.
Surely fucking all of themthe ones that didn’t work would be easy to fix.
Surely they wouldn’t uncover any other, more complex, more time-consuming issues that would drag in the entire Mesa compiler ecosystem.
Wouldn’t be worth mentioning at SGC if any of those were true, would it.
By now I was pretty deep into this project, which is to say that I had inexplicably vanished from several other tasks I was supposed to be accomplishing, and the only way out was through. But before I could delve into any of the legacy GL compatibility stuff, I had bigger problems.
Namely everything was exploding because I failed to follow the directions and was holding opt_varyings
wrong. In the fine print, the documentation for the pass very explicitly says that lower_to_scalar
must be set in the compiler options. But did I read the directions? Obviously I did. If you’re asking whether I read them comprehensively, however, or whether I remembered what I had read once I was deep within the coding fugue of fixing this damn bug Samuel had given me way back wh
With lower_to_scalar
active, I actually came upon the big problem: my existing handling for lowered I/O was inadequate, and I needed to make my code better. Much better.
Originally when I switched to lowered I/O, I wrote some passes to unclown I/O back to variables and derefs. There was one NIR pass that ran early on to generate variables based on the loads and stores, and there was a second that ran just before spirv translation to convert all the load/store intrinsics back to load/store derefs. This worked great.
But it didn’t work great now! Obviously it wouldn’t, right? I mean, nothing in this entire compiler stack ever works, does it? It’s all just a giant jenga tower that’s one fat-finger away from total and utter—What? Oh, right, heh, yeah, no, I just got a little carried away remembering is all. No problem. Let’s keep going. We have to now that we’ve already come this far. Don’t we? I’ll stop writing if you stop reading, how about that. No? Well, heh, of course it’d be that way! This is… We’re SGC!
So I had this rework_io_vars
function, and it. was. BIG. I’m talking over a hundred lines with loops and switches and all kinds of cool control flow to handle all the weird corner cases I found at 4:14am when I was working on it. The way that it worked was pretty simple:
It worked great. Really, there were no known bugs.
The problem with this came with the scalarized frontend I/O lowering, which would create patterns like:
store(location=1, component_count=1)
store(location=0, component_count=1, array_size=4, array_offset=$val)
In this scenario, there’s indirect access mixed with direct access for the same location, but it’s at an offset from the base of the array, and it kiiinda almost works except it totally doesn’t because the first instruction has no metadata hint about being part of the second instruction’s array. And since the pass iterates over the shader in instruction order, encountering the instructions in this order is a problem whereas encountering them in a different order potentially wouldn’t be a problem.
I had two options available to me at that point. The first option was to add in some workarounds to enlarge the scalar to an array when encountering this pattern. And I tried that, and it worked. But then I came across a slightly different variant which didn't work. And that's when I chose the second option.
Burn it all down. The whole thing.
I mean, uh, just—just that one function. It’s not like I want to BURN THE WHOLE THING DOWN after staring into the abyss for so long, definitely not.
The new pass! Right, the new pass. The new rework_io_vars
pass that I wrote is a sequence of operations that ends up being far more robust than the original. It works something like this:
shader_info
masks, e.g., outputs_written
and inputs_read
rework_io_vars
is the base function with special-casing for VS inputs and FS outputs to create variables for those builtins separatelyThe “scan” process ends up being a function called loop_io_var_mask
which iterates a shader_info
mask for a given input/output mode and scans the shader for instructions which occur on each location for that mode. The gathered info includes a component mask as well as array size and fbfetch info–all that stuff. Everything needed to create variables. After the shader is scanned, variables are created for the given location. By processing the indirect mask first, it becomes possible to always detect the above case and handle it correctly.
Problem solved.
But that’s fine, and I am so sane right now you wouldn’t believe it if I told you. I wrote this great, readable, bulletproof variable generator, and it’s tremendous, but then I tried using it without nir_io_glsl_lower_derefs
because I value bisectability, and obviously there was zero chance that would ever work so why would I ever even bother. XFB is totally broken, and there’s all kinds of other weird failures that I started examining and then had to go stand outside staring into the woods for a while, and it’s just not happening. And nir_io_glsl_lower_derefs
doesn’t work without the new version either, which means it’s gonna be impossible to bisect anything between the two changes.
Totally fine, I’m sure, just like me.
By now, I had a full stack of zink compiler cleanups and fixes that I’d accumulated in the course of all this. Multiple stacks, really. So many stacks. Fortunately I was able to slip them into the repo without anyone noticing. And also without CI slowing to a crawl due to the freedreno farm yet again being in an absolute state.
I was passing CTS again, which felt great. But then I ran piglit, and I remembered that I had skipped over all those Gallium compatibility passes. And I definitely had to go in and fix them.
There were a lot of these passes to fix, and nearly all of them had the same two issues:
This meant I had to add handling for lowered I/O without variables, and then I also had to add generic handling for scalarized versions of both codepaths. Great, great, great. So I did that. And one of them really needed a lot of work, but most of the others were reasonably straightforward.
And then there’s lower_clip
.
lower_clip
is a pass that rewrites shaders to handle user-specified clip planes when the underlying driver doesn’t support them. The pass does this by leveraging clipdistance.
And here’s the thing about clipdistance: unlike the other builtins, it’s an array. But it’s treated like a vector. Except you can still access it indirectly like an array. So is it an array or is it a vector? Decades from now, graphics engineers will still be arguing about this stupidity, but now is the time when I need to solve this, and it’s not something that I as a mere, singular human, can possibly solve. Hah! There’s no way I’d be able to do that. I’d have to be crazy. And I’m… Uh-oh, what’s the right way to finish that statement? It’s probably fine! Everything’s fine!
But when you’ve got an array that’s treated like a vector that’s really an array, things get confusing fast, and in NIR there’s the compact
flag to indicate that you need to reassess your life choices. One of those choices needing reassessment is the use of nir_shader_gather_info
, a simple function that populates shader_info
with useful metadata after scanning the shader. And here’s a pop quiz that I’m sure everyone can pass with ease after reading this far.
How many shader locations are consumed by gl_ClipDistance
?
Simple question, right? It’s a variably-sized float[] array-vector with up to 8 members, so it consumes up to two locations. Right? No, that’s a question, not a rhetorical—But you’re using nir_shader_gather_info
, and it sees gl_ClipDistance
, okay, so how many slots do you expect it to add to your outputs_written
bitmask? Is it 8? Or is it 2? Does anybody really know?
Regardless of what you thought, the answer is 8, and you’ll get 8, and you’ll be happy with 8. And if you’re trying to use outputs_written
for anything, and you see any of the other builtins within 8 slots of gl_ClipDistance
being used, then you should be able to just figure it out that this is clipdistance playing pranks again. Right?
It’s all fun and games until someone gets too deep into clipdistance is a proverb oft-repeated among compiler developers. Personally, I went back and forth until I cobbled together something to sort of almost fix the problem, but I posed the issue to the community at large, and now we are having plans with headings and subheadings. You’re welcome.
And that’s the end of it, right?
The problem with going in and fixing anything in core Mesa is that you end up breaking everything else. So while I was off fixing Gallium compatibility passes, specifically lower_clip
, I ended up breaking freedreno and v3d. Someday maybe we’ll get to the bottom of that.
But I’m fast-forwarding, because while I was working on this…
What even is this anymore? Right, I was fixing Samuel’s bug. The one about not using opt_varyings
. So I had my variable generator functioning, and I had the compat passes working (for me), and CTS and piglit were both passing. Then I decided to try out nir_io_glsl_opt_varyings
. Just a little. Just to see what happened.
I don’t have any more jokes here. It didn’t work good. A lot of things went boom-boom. There were some opt_varyings
bugs like these, and some related bugs like this, and there was missing core NIR stuff for zink, and there were GLSL bugs, and also CTS was broken. Also a bunch of the earlier zink stacks of compiler patches were fixing bugs here.
But eventually, over weeks, it started working.
Other than verifying everything still works, I haven’t tested much. If you’re feeling brave, try out the MR with dependencies (or wait for rebase) and tell me how the perf looks. So far, all I’ve seen is about a 6000% improvement across the board.
Finally, it’s over.
Samuel, your bug is fixed. Never ask me for anything again.
The Linux kernel 6.8 came out on March 10th, 2024, bringing brand-new features and plenty of performance improvements on different subsystems. As part of Igalia, I’m happy to be an active part of many features that are released in this version, and today I’m going to review some of them.
Linux 6.8 is packed with a lot of great features, performance optimizations, and new hardware support. In this release, we can check the Intel Xe DRM driver experimentally, further support for AMD Zen 5 and other upcoming AMD hardware, initial support for the Qualcomm Snapdragon 8 Gen 3 SoC, the Imagination PowerVR DRM kernel driver, support for the Nintendo NSO controllers, and much more.
Igalia is widely known for its contributions to Web Platforms, Chromium, and Mesa. But, we also make significant contributions to the Linux kernel. This release shows some of the great work that Igalia is putting into the kernel and strengthens our desire to keep working with this great community.
Let’s take a deep dive into Igalia’s major contributions to the 6.8 release:
You may have seen the release of a new Steam Deck last year, the Steam Deck OLED. What you may not know is that Igalia helped bring this product to life by putting some effort into the AMD driver-specific color management properties implementation. Melissa Wen, together with Joshua Ashton (Valve), and Harry Wentland (AMD), implemented several driver-specific properties to allow Gamescope to manage color features provided by the AMD hardware to fit HDR content and improve gamers’ experience.
She has explained all features implemented in the AMD display kernel driver in two blog posts and a 2023 XDC talk:
André Almeida worked together with Simon Ser (SourceHut) to provide support for asynchronous page-flips in the atomic API. This feature targets users who want to present a new frame immediately, even if after missing a V-blank. This feature is particularly useful for applications with high frame rates, such as gaming.
Raspberry Pi 5 was officially released on October 2023 and Igalia was ready to bring top-notch graphics support for it. Although we still can’t use the RPi 5 with the mainline kernel, it is superb to see some pieces coming upstream. Iago Toral worked on implementing all the kernel support needed for the V3D 7.1.x driver.
With the kernel patches, by the time the RPi 5 was released, it already included a fully 3.1 OpenGL ES and Vulkan 1.2 compliant driver implemented by Igalia.
Apart from the release of the Raspberry Pi 5, Igalia is still working on improving the whole Raspberry Pi environment. I worked, together with José Maria “Chema” Casanova, implementing the support for GPU stats on the V3D driver. This means that RPi 4/5 users now can access the usage percentage of the GPU and they can access the statistics by process or globally.
I also worked, together with Melissa, implementing CPU jobs for the V3D driver. As the Broadcom GPU isn’t capable of performing some operations, the Vulkan driver uses the CPU to compensate for it. In order to avoid stalls in the job submission, now CPU jobs are part of the kernel and can be easily synchronized though with synchronization objects.
If you are curious about the CPU job implementation, you can check this blog post.
Sometimes we don’t contribute to a major feature in the release, however we can help improving documentation and sending fixes. André also contributed to this release by documenting the different AMD GPU reset methods, making it easier to understand by future users.
During Igalia’s efforts to improve the general users’ experience on the Steam Deck, Guilherme G. Piccoli noticed a message in the kernel log and readily provided a fix for this PCI issue.
Outside of the Steam Deck world, we can check some of Igalia’s work on the Qualcomm Adreno GPUs. Although most of our Adreno-related work is located at the user-space, Danylo Piliaiev sent a couple of kernel fixes to the msm driver, fixing some hangs and some CTS tests.
We also had contributions from our 2023 Igalia CE student, Nia Espera. Nia’s project was related to mobile Linux and she managed to write a couple of patches to the kernel in order to add support for the OnePlus 9 and OnePlus 9 Pro devices.
If you are a student interested in open-source and would like to have a first exposure to the professional world, check if we have openings for the Igalia Coding Experience. I was a CE student myself and being mentored by a Igalian was a incredible experience.
I am publicly announcing that I am transgender. I have experienced gender dysphoria for almost ten years, but I have found a label that I feel comfortable with now.
In this article, I’m going to go over:
Before I delve into my personal experience, allow me to define several key terms:
Allow me to share a little backstory. I come from a neighborhood where being anti-LGBTQ+ was considered “normal” a decade ago. This outlook was quite common in the schools I attended, but I wouldn’t be surprised if a considerably significant portion of the people around here are still anti-LGBTQ+ today. Many individuals, including former friends and teachers, have expressed their opposition to LGBTQ+ in the past, which influenced my own view against the LGBTQ+ community at the time.
Due to my previous experiences and the environment I live(d) in, I tried really hard to avoid thinking about my sexuality and gender identity for almost a decade. Every time I thought about my sexuality and gender identity, I’d do whatever I could to distract myself. I kept forcing myself to be as masculine as possible. However, since we humans have a limit, I eventually reached a limit to the amount of thoughts I could suppress.
I always struggled with communicating and almost always felt lonely whenever I was around the majority of people, so I pretended to be “normal” and hid my true feelings. About 5 years ago, I began to spend most of my time online. I met people who are just like me, many of which I’m still friends with 3-4 years later. At the time, despite my strong biases against LGBTQ+ from my surroundings, I naturally felt more comfortable within the community, far more than I did outside. I was able to express myself more freely and have people actually understand me. It was the only time I didn’t feel the need to act masculine. However, despite all this, I was still in the mindset of suppressing my feelings. Truly an egg irl moment
Eventually, I was unable hold my thoughts anymore, and everything exploded. All I could think about for a few months was my gender identity: my biases between my childhood environment often clashed with me questioning my own identity, and whether I really saw myself as a man. I just had these recurring thoughts and a lot of anxiety about where I’m getting these thoughts from, and why.
Since then, my work performance got exponentially worse by the week. I quickly lost interest in my hobbies, and began to distance myself from communities and friends. I often lashed out on people because my mental health was getting worse. My sleep quality was also getting worse, which only worsened the situation. On top of that, I still had to hide my feelings, which continued to exhaust me. All I could think about for months was my gender identity.
After I slowly became comfortable with and accepting of my gender identity, I started having suicidal thoughts on a daily basis, which I was able to endure… until I reached a breaking point once again. I was having suicidal thoughts on a bi-hourly basis. It escalated to hourly, and finally almost 24/7. I obviously couldn’t work anymore, nor could I do my hobbies. I needed to hide my pain because of my social anxiety. I also didn’t have the courage to call the suicide hotline either. What happened was that I talked to many people, some of whom have encouraged and even helped me seek professional help.
However, that was all in the past. I feel much better and more comfortable with myself and the people I opened up to, and now I’m confident enough to share it publicly 😊
I identify as agender. My pronouns are any/all — I’ll accept any pronouns. I don’t think I have a preference, so feel free to call me as whatever you want; whatever you think fits me best :)
I’m happy with agender because I feel disconnected from my own masculinity. I don’t think I belong at either end of the spectrum (or even in between), so I’m pretty happy that there is something that best describes me.
So… why come out publicly? Why am I making a big deal out of this?
Simply put, I am really proud and relieved for discovering myself. For so long, I tried to suppress my thoughts and force myself to be someone I was fundamentally not. While that never worked, I explored myself instead and discovered that I’m trans. However, I also wrote this article to explain how much it affected me for living in a transphobic environment, even before I discovered myself.
For me, displaying my gender identity is like displaying a username or profile picture. We choose a username and profile picture when possible to give a glimpse of who we are.
I chose “TheEvilSkeleton” as my username because I used to play Minecraft regularly when I was 10 years old. While I don’t play Minecraft anymore, it helped me discover my passion: creating and improving things and working together — that’s why I’m a programmer and contribute to software. I chose Chrome-chan as my profile picture because I think she is cute and I like cute things :3. I highly value my username and profile picture, the same way I now value my gender identity.
While I’m doing much better than before, I did go through a depressive episode that I’m still recovering from at the time of writing, and I’m still processing the discovery because of my childhood environment, but I certainly feel much better after discovering myself and coming out.
However, coming out won’t magically heal the trauma I’ve experienced throughout my childhood environment. It won’t make everyone around me accept who I am, or even make them feel comfortable around me. It won’t drop the amount of harassment I receive online to zero — if anything, I write this with the expectation that I will be harassed and discriminated against more than ever.
There will be new challenges that I will have to face, but I still have to deal with the trauma, and I will have to deal with possible trauma in the future. The best thing I can do is train myself to be mentally resilient. I certainly feel much better coming out, but I’m still worried about the future. I sometimes wish I wasn’t trans, because I’m genuinely terrified about the things people have gone through in the past, and are still going through right now.
I know I’m going to have to fight for my life now that I’ve come out publicly, because apparently the right to live as yourself is still controversial in 2024.
Of course, I wasn’t alone in my journey. What helped me get through it was talking to my friends and seeking help in other places. I came out to several of my friends in private. They were supportive and listened to me vent; they reassured me that there’s nothing wrong with me, and congratulated me for discovering myself and coming out.
Some of my friends encouraged and helped me seek professional help at local clinics for my depression. I have gained more confidence in myself; I am now capable to call clinics by myself, even when I’m nervous. If these suicidal thoughts escalate again, I will finally have the courage to call the suicide hotline.
If you’re feeling anxious about something, don’t hesitate to talk to your friends about it. Unless you know that they’ll take it the wrong way and/or are currently dealing with personal issues, they will be more than happy to help.
I have messaged so many people in private and felt much better after talking. I’ve never felt so comforted by friends who try their best to be there for me. Some friends have listened without saying anything, while some others have shared their experiences with me. Both were extremely valuable to me, because sometimes I just want (and need) to be heard and understood.
If you’re currently trying to suppress your thoughts and really trying to force yourself into the gender you were assigned at birth, like I was, the best advice I can give you is to give yourself time to explore yourself. It’s perfectly fine to acknowledge that you’re not cisgender (that is, if you’re not). You might want to ask your trans friends to help you explore yourself. From experience, it’s not worth forcing yourself to be someone you’re not.
I feel relieved about coming out, but to be honest, I’m still really worried about the future of my mental health. I really hope that everything will work out and that I’ll be more mentally resilient.
I’m really happy that I had the courage to take the first steps, to go to clinics, to talk to people, to open up publicly. It’s been really difficult for me to write and publish the article. I’m really grateful to have wonderful friends, and legitimately, I couldn’t ask for better friends.
Flatpaks has been a key part of our strategy for desktop applications for a while now and we are working on a multitude of things to make Flatpaks an even stronger technology going forward. Christian Hergert is working on figuring out how applications that require system daemons will work with Flatpaks, using his own Sysprof project as the proof of concept application. The general idea here is to rely on the work that has happened in SystemD around sysext/confext/portablectl trying to figure out who we can get a system service installed from a Flatpak and the necessary bits wired up properly. The other part of this work, figuring out how to give applications permissions that today is handled with udev rules, that is being worked on by Hubert Figuière based on earlier work by Georges Stavracas on behalf of the GNOME Foundation thanks to the sponsorship from the Sovereign Tech Fund. So hopefully we will get both of these two important issues resolved soon. Kalev Lember is working on polishing up the Flatpak support in Foreman (and Satellite) to ensure there are good tools for managing Flatpaks when you have a fleet of systems you manage, building on the work of Stephan Bergman. Finally Jan Horak and Jan Grulich is working hard on polishing up the experience of using Firefox from a fully sandboxed Flatpak. This work is mainly about working with the upstream community to get some needed portals over the finish line and polish up some UI issues in Firefox, like this one.
ToolbxToolbx, our project for handling developer containers, is picking up pace with Debarshi Ray currently working on getting full NVIDIA binary driver support for the containers. One of our main goals for Toolbx atm is making it a great tool for AI development and thus getting the NVIDIA & CUDA support squared of is critical. Debarshi has also spent quite a lot of time cleaning up the Toolbx website, providing easier access to and updating the documentation there. We are also moving to use the new Ptyxis (formerly Prompt) terminal application created by Christian Hergert, in Fedora Workstation 40. This both gives us a great GTK4 terminal, but we also believe we will be able to further integrate Toolbx and Ptyxis going forward, creating an even better user experience.
NovaSo as you probably know, we have been the core maintainers of the Nouveau project for years, keeping this open source upstream NVIDIA GPU driver alive. We plan on keep doing that, but the opportunities offered by the availability of the new GSP firmware for NVIDIA hardware means we should now be able to offer a full featured and performant driver. But co-hosting both the old and the new way of doing things in the same upstream kernel driver has turned out to be counter productive, so we are now looking to split the driver in two. For older pre-GSP NVIDIA hardware we will keep the old Nouveau driver around as is. For GSP based hardware we are launching a new driver called Nova. It is important to note here that Nova is thus not a competitor to Nouveau, but a continuation of it. The idea is that the new driver will be primarily written in Rust, based on work already done in the community, we are also evaluating if some of the existing Nouveau code should be copied into the new driver since we already spent quite a bit of time trying to integrate GSP there. Worst case scenario, if we can’t reuse code, we use the lessons learned from Nouveau with GSP to implement the support in Nova more quickly. Contributing to this effort from our team at Red Hat is Danilo Krummrich, Dave Airlie, Lyude Paul, Abdiel Janulgue and Phillip Stanner.
Explicit Sync and VRRAnother exciting development that has been a priority for us is explicit sync, which is critical for especially the NVidia driver, but which might also provide performance improvements for other GPU architectures going forward. So a big thank you to Michel Dänzer , Olivier Fourdan, Carlos Garnacho; and Nvidia folks, Simon Ser and the rest of community for working on this. This work has just finshed upstream so we will look at backporting it into Fedora Workstaton 40. Another major Fedora Workstation 40 feature is experimental support for Variable Refresh Rate or VRR in GNOME Shell. The feature was mostly developed by community member Dor Askayo, but Jonas Ådahl, Michel Dänzer, Carlos Garnacho and Sebastian Wick have all contributed with code reviews and fixes. In Fedora Workstation 40 you need to enable it using the command
gsettings set org.gnome.mutter experimental-features "['variable-refresh-rate']"
Already covered PipeWire in my post a week ago, but to quickly summarize here too. Using PipeWire for video handling is now finally getting to the stage where it is actually happening, both Firefox and OBS Studio now comes with PipeWire support and hopefully we can also get Chromium and Chrome to start taking a serious look at merging the patches for this soon. Whats more Wim spent time fixing Firewire FFADO bugs, so hopefully for our pro-audio community users this makes their Firewire equipment fully usable and performant with PipeWire. Wim did point out when I spoke to him though that the FFADO drivers had obviously never had any other consumer than JACK, so when he tried to allow for more functionality the drivers quickly broke down, so Wim has limited the featureset of the PipeWire FFADO module to be an exact match of how these drivers where being used by JACK. If the upstream kernel maintainer is able to fix the issues found by Wim then we could look at providing a more full feature set. In Fedora Workstation 40 the de-duplication support for v4l vs libcamera devices should work as soon as we update Wireplumber to the new 0.5 release.
To hear more about PipeWire and the latest developments be sure to check out this interview with Wim Taymans by the good folks over at Destination Linux.
Remote DesktopAnother major feature landing in Fedora Workstation 40 that Jonas Ådahl and Ray Strode has spent a lot of effort on is finalizing the remote desktop support for GNOME on Wayland. So there has been support for remote connections for already logged in sessions already, but with these updates you can do the login remotely too and thus the session do not need to be started already on the remote machine. This work will also enable 3rd party solutions to do remote logins on Wayland systems, so while I am not at liberty to mention names, be on the lookout for more 3rd party Wayland remoting software becoming available this year.
This work is also important to help Anaconda with its Wayland transition as remote graphical install is an important feature there. So what you should see there is Anaconda using GNOME Kiosk mode and the GNOME remote support to handle this going forward and thus enabling Wayland native Anaconda.
HDRAnother feature we been working on for a long time is HDR, or High Dynamic Range. We wanted to do it properly and also needed to work with a wide range of partners in the industry to make this happen. So over the last year we been contributing to improve various standards around color handling and acceleration to prepare the ground, work on and contribute to key libraries needed to for instance gather the needed information from GPUs and screens. Things are coming together now and Jonas Ådahl and Sebastian Wick are now going to focus on getting Mutter HDR capable, once that work is done we are by no means finished, but it should put us close to at least be able to start running some simple usecases (like some fullscreen applications) while we work out the finer points to get great support for running SDR and HDR applications side by side for instance.
PyTorchWe want to make Fedora Workstation a great place to do AI development and testing. First step in that effort is packaging up PyTorch and making sure it can have working hardware acceleration out of the box. Tom Rix has been leading that effort on our end and you will see the first fruits of that labor in Fedora Workstation 40 where PyTorch should work with GPU acceleration on AMD hardware (ROCm) out of the box. We hope and expect to be able to provide the same for NVIDIA and Intel graphics eventually too, but this is definitely a step by step effort.
For the last couple of weeks I have kept chipping at a new userspace driver for the NPU in the Rockchip RK3588 SoC.
I am very happy to report that the work has gone really smooth and I reached my first milestone: running the MobileNetV1 model with all convolutions accelerated by the NPU.
And it not only runs flawlessly, but at the same performance level as the blob.
It has been great having access to the register list as disclosed by Rockchip in their TRM, and to the NVDLA and ONNC documentation and source code. This has allowed for the work to proceed at a pace several times faster than with my previous driver for the VeriSilicon NPU, for which a lot of painstaking reverse engineering had to be done.
by Julien Langlois CC BY-SA 3.0 |
tomeu@arm-64:~/mesa$ TEFLON_DEBUG=verbose python3.10 classification.py -i hens.jpg -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from libteflon.so with args: {}
Teflon delegate: loaded rknpu driver
teflon: compiling graph: 89 tensors 27 operations
...
teflon: compiled graph, took 413 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 11 ms
teflon: invoked graph, took 10 ms
teflon: invoked graph, took 10 ms
0.984314: hen
0.019608: cock
0.000000: toilet tissue
0.000000: sea cucumber
0.000000: wood rabbit
time: 10.776ms
Notice how nothing in the invocation refers to the specific driver that TensorFlow Lite is using, that is completely abstracted by Mesa. Once all these bits are upstream and packaged by distros, one will be able to just download a model in INT8 quantization format and get accelerated inferences going fast irrespective of the hardware.
Thanks to TL Lim of PINE64 for sending me a QuartzPro64 board for me to hack on.
I want to go back and get my last work on performance for the VeriSilicon driver upstreamed, so it is packaged in distros sooner rather than later.
After that, I'm a bit torned between working further on the userspace driver and implementing more operations and control flow, or start writing a kernel driver for mainline.
Hi! It’s this time of the month once again it seems…
We’ve finally released Sway 1.9! Note that it uses the new wlroots rendering API, but doesn’t use the scene-graph API: we’ve left that for 1.10. We’ve also released wlroots 0.17.2 with a whole bunch of bug fixes. Special thanks to Simon Zeni for doing the backporting work!
In other Wayland news, the wlroots merge request to atomically
apply changes to multiple outputs has been merged! In addition, another
merge request to help compositors allocate the right kind of
buffers during modesets has been merged. These two combined should help
lighting up correctly more multi-output setups on Intel GPUs, which previously
required a workaround (WLR_DRM_NO_MODIFIERS=1
). Thanks to Kenny for helping
with that work!
I also got around to writing a Sway patch to gracefully handle GPU resets. This should be good news for users of a particular GPU vendor which tends to be a bit trigger happy with resets! Sway will now survive and continue running instead of being frozen. Note, clients may still glitch, need a nudge to redraw, or freeze. A few wlroots patches were also required to get this to work.
With the help of Jean Thomas, Goguma (and pushgarden) has gained support for Apple Push Notification service (APNs). This means that Goguma iOS users can now enjoy instantaneous notifications! This is also important to prove that it’s possible to design a standard (as an IRC extension) which doesn’t hardcode any proprietary platform (and thus doesn’t force each IRC server to have one codepath per platform), but still interoperates with these proprietary platforms (important for usability) and ensures that said proprietary platforms have minimal access to sensible data (via end-to-end encryption between the IRC server and the IRC client).
It’s now also possible to share links and files to Goguma. That is, when using another app (e.g. the gallery, your favorite fediverse client, and many others) and opening the share menu, Goguma will show up as an option. It will then ask which conversation to share the content with, and automatically upload any shared file.
No NPotM this time around sadly. To make up for it, I’ve implemented refresh tokens in sinwon, and made most of the remaining tests pass in go-mls.
See you next month!
During the past weeks I have paused work on the driver for the Vivante NPU and have started work on a new driver, for Rockchip's own NPU IP, as used in SoCs such as RK3588(S) and RK3568.
The version of the NPU in the RK3588 claims a performance of 6 TOPS across its 3 cores, though from what I have read, people are having trouble making use of more than one core in parallel, with the closed source driver.
A nice walk in the park |
Rockchip, as most other vendors of NPU IP, provides a GPLed kernel driver and pushes out their userspace driver in binary form. The kernel driver is pleasantly simple and relatively up-to-date in regards of its use of internal kernel APIs. The userspace stack though is notoriously buggy and difficult to use, with basic features still unimplemented and performance being quite below what the hardware should be able to achieve.
To be clear, this is on top of the usual problems related to closed-source drivers. I get the impression that Rockchip's NPU team is really understaffed.
Other people had already looked at reverse-engineering the HW so they could address the limitations and bugs in the closed source driver, and use it in situations not supported by Rockchip. I used information acquired by Pierre-Hugues Husson and Jasbir Matharu to get started, a big thanks to them!
After the initial environment was setup (had to forward-port their kernel driver to v6.8), I wrote a simple library that can be loaded in the process with LD_PRELOAD and that, by overriding the ioctl and other syscalls, I was able to dump the buffers that the proprietary userspace driver sends to the hardware.
I started looking at a buffer that from the debug logs of the proprietary driver contained register writes, and when looking at the register descriptions in the TRM, I saw that it had to be closely based on NVIDIA's NVDLA open-source NPU IP.
With Rockchip's (terse) description of the registers, NVDLA's documentation and source code for both the hardware and the userspace driver, I have been able to make progress several times faster than I was able to when working on VeriSilicon's driver (for which I had zero documentation).
Right now I am at the stage at which I am able to correctly execute TensorFLow Lite's Conv2D and DepthwiseConv2D operations with different combinations of input dimensions, weight dimensions, strides and padding. Next is to support multiple output channels.
I'm currently using Rockchip's kernel, but as soon as I'm able to run object detection models with decent hardware utilization, I plan to start writing a new kernel driver for mainlining.
Rockchip's kernel driver has gems such as passing addresses in the kernel address space across the UAPI...
Tests run fast and reliably, even with high concurrency:
tomeu@arm-64:~/mesa$ TEFLON_TEST_DELEGATE=~/mesa/build/src/gallium/targets/teflon/libteflon.so TEFLON_TEST_DATA=src/gallium/targets/teflon/tests LD_LIBRARY_PATH=/home/tomeu/tflite-vx-delegate/build/_deps/tensorflow-build/ ~/.cargo/bin/gtest-runner run --gtest /home/tomeu/mesa/build/src/gallium/targets/teflon/test_teflon --output /tmp -j8 --tests-per-group 1 --baseline ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-fails.txt --flakes ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-flakes.txt --skips ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-skips.txt
Running gtest on 8 threads in 1-test groups
Pass: 0, Duration: 0
Pass: 139, Skip: 14, Duration: 2, Remaining: 2
Pass: 277, Skip: 22, Duration: 4, Remaining: 0
Pass: 316, Skip: 24, Duration: 4, Remaining: 0
We hit a major milestones this week with the long worked on adoption of PipeWire Camera support finally starting to land!
Not long ago Firefox was released with experimental PipeWire camera support thanks to the great work by Jan Grulich.
Then this week OBS Studio shipped with PipeWire camera support thanks to the great work of Georges Stavracas, who cleaned up the patches and pushed to get them merged based on earlier work by himself, Wim Taymans and Colulmbarius. This means we now have two major applications out there that can use PipeWire for camera handling and thus two applications whose video streams that can be interacted with through patchbay applications like Helvum and qpwgraph.
These applications are important and central enough that having them use PipeWire are in itself useful, but they will now also provide two examples of how to do it for application developers looking at how to add PipeWire camera support to their own applications; there is no better documentation than working code.
The PipeWire support is also paired with camera portal support. The use of the portal also means we are getting closer to being able to fully sandbox media applications in Flatpaks which is an important goal in itself. Which reminds me, to test out the new PipeWire support be sure to grab the official OBS Studio Flatpak from Flathub.
For those wondering work is also underway to bring this into Chromium and Google Chrome browsers where Michael Olbrich from Pengutronix has been pushing to get patches written and merged, he did a talk about this work at FOSDEM last year as you can see from these slides with this patch being the last step to get this working there too.
The move to PipeWire also prepared us for the new generation of MIPI cameras being rolled out in new laptops and helps push work on supporting those cameras towards libcamera, the new library for dealing with the new generation of complex cameras. This of course ties well into the work that Hans de Goede and Kate Hsuan has been doing recently, along with Bryan O’Donoghue from Linaro, on providing an open source driver for MIPI cameras and of course the incredible work by Laurent Pinchart and Kieran Bingham from Ideas on board on libcamera itself.
The PipeWire support is of course fresh and I am sure we will find bugs and corner cases that needs fixing as more people test out the functionality in both Firefox and OBS Studio and there are some interface annoyances we are working to resolve. For instance since PipeWire support both V4L and libcamera as a backend you do atm get double entries in your selection dialogs for most of your cameras. Wireplumber has implemented de-deplucation code which will ensure only the libcamera listing will show for cameras supported by both v4l and libcamera, but is only part of the development version of Wireplumber and thus it will land in Fedora Workstation 40, so until that is out you will have to deal with the duplicate options.
Another recent good PipeWire new tidbit that became available with the PipeWire 1.0.4 release PipeWire maintainer Wim Taymans also fixed up the FireWire FFADO support. The FFADO support had been in there for some time, but after seeing Venn Stone do some thorough tests and find issues we decided it was time to bite the bullet and buy some second hand Firewire hardware for Wim to be able to test and verify himself.
.So all in all its been a great few weeks for PipeWire and for Linux Audio AND Video, and if you are an application maintainer be sure to look at how you can add PipeWire camera support to your application and of course get that application packaged up as a Flatpak for people using Fedora Workstation and other distributions to consume.
In a different epoch, before the pandemic, I’ve done a presentation about upstream first at the Siemens Linux Community Event 2018, where I’ve tried to explain the fundamentals of open source using microeconomics. Unfortunately that talk didn’t work out too well with an audience that isn’t well-versed in upstream and open source concepts, largely because it was just too much material crammed into too little time.
Last year I got the opportunity to try again for an Intel-internal event series, and this time I’ve split the material into two parts. I think that worked a lot better. For obvious reasons I cannot publish the recordings, but I can publish the slides.
The first part “Upstream, Why?” covers a few concepts from microeconomcis 101, and then applies them to upstream stream open source. The key concept is on one hand that open source achieves an efficient software market in the microeconomic sense by driving margins and prices to zero. And the only way to make money in such a market is to either have more-or-less unstable barriers to entry that prevent the efficient market from forming and destroying all monetary value. Or to sell a complementary product.
The second part”Upstream, How?” then looks at what this all means for the different stakeholders involved:
Individual engineers, who have skills and create a product with zero economic value, and might still be stupid enough and try to build a career on that.
Upstream communities, often with a formal structure as a foundation, and what exactly their goals should be to build a thriving upstream open source project that can actually pay some bills, generate some revenue somewhere else and get engineers paid. Because without that you’re not going to have much of a project with a long term future.
Engineering organizations, what exactly their incentives and goals should be, and the fundamental conflicts of interest this causes. Specifically on this I’ve only seen bad solutions, and ugly solutions, but not yet a really good one. A relevant pre-pandemic talk of mine on this topic is also “Upstream Graphics: Too Little, Too Late”
And finally the overall business and more importantly, what kind of business strategy is needed to really thrive with an open source upstream first approach: You need to clearly understand which software market’s economic value you want to destroy by driving margins and prices to zero, and which complemenetary product you’re selling to still earn money.
At least judging by the feedback I’ve received internally taking more time and going a bit more in-depth on the various concept worked much better than the keynote presentation I’ve done at Siemens, hence I decided to publish at the least the slides.
Touchscreens are quite prevalent by now but one of the not-so-hidden secrets is that they're actually two devices: the monitor and the actual touch input device. Surprisingly, users want the touch input device to work on the underlying monitor which means your desktop environment needs to somehow figure out which of the monitors belongs to which touch input device. Often these two devices come from two different vendors, so mutter needs to use ... */me holds torch under face* .... HEURISTICS! :scary face:
Those heuristics are actually quite simple: same vendor/product ID? same dimensions? is one of the monitors a built-in one? [1] But unfortunately in some cases those heuristics don't produce the correct result. In particular external touchscreens seem to be getting more common again and plugging those into a (non-touch) laptop means you usually get that external screen mapped to the internal display.
Luckily mutter does have a configuration to it though it is not exposed in the GNOME Settings (yet). But you, my $age $jedirank, can access this via a commandline interface to at least work around the immediate issue. But first: we need to know the monitor details and you need to know about gsettings relocatable schemas.
Finding the right monitor information is relatively trivial: look at $HOME/.config/monitors.xml and get your monitor's vendor, product and serial from there. e.g. in my case this is:
<monitors version="2"> <configuration> <logicalmonitor> <x>0</x> <y>0</y> <scale>1</scale> <monitor> <monitorspec> <connector>DP-2</connector> <vendor>DEL</vendor> <--- this one <product>DELL S2722QC</product> <--- this one <serial>59PKLD3</serial> <--- and this one </monitorspec> <mode> <width>3840</width> <height>2160</height> <rate>59.997</rate> </mode> </monitor> </logicalmonitor> <logicalmonitor> <x>928</x> <y>2160</y> <scale>1</scale> <primary>yes</primary> <monitor> <monitorspec> <connector>eDP-1</connector> <vendor>IVO</vendor> <product>0x057d</product> <serial>0x00000000</serial> </monitorspec> <mode> <width>1920</width> <height>1080</height> <rate>60.010</rate> </mode> </monitor> </logicalmonitor> </configuration> </monitors>Well, so we know the monitor details we want. Note there are two monitors listed here, in this case I want to map the touchscreen to the external Dell monitor. Let's move on to gsettings.
gsettings is of course the configuration storage wrapper GNOME uses (and the CLI tool with the same name). GSettings follow a specific schema, i.e. a description of a schema name and possible keys and values for each key. You can list all those, set them, look up the available values, etc.:
$ gsettings list-recursively ... lots of output ... $ gsettings set org.gnome.desktop.peripherals.touchpad click-method 'areas' $ gsettings range org.gnome.desktop.peripherals.touchpad click-method enum 'default' 'none' 'areas' 'fingers'Now, schemas work fine as-is as long as there is only one instance. Where the same schema is used for different devices (like touchscreens) we use a so-called "relocatable schema" and that requires also specifying a path - and this is where it gets tricky. I'm not aware of any functionality to get the specific path for a relocatable schema so often it's down to reading the source. In the case of touchscreens, the path includes the USB vendor and product ID (in lowercase), e.g. in my case the path is:
/org/gnome/desktop/peripherals/touchscreens/04f3:2d4a/In your case you can get the touchscreen details from lsusb, libinput record, /proc/bus/input/devices, etc. Once you have it, gsettings takes a schema:path argument like this:
$ gsettings list-recursively org.gnome.desktop.peripherals.touchscreen:/org/gnome/desktop/peripherals/touchscreens/04f3:2d4a/ org.gnome.desktop.peripherals.touchscreen output ['', '', '']Looks like the touchscreen is bound to no monitor. Let's bind it with the data from above:
$ gsettings set org.gnome.desktop.peripherals.touchscreen:/org/gnome/desktop/peripherals/touchscreens/04f3:2d4a/ output "['DEL', 'DELL S2722QC', '59PKLD3']"Note the quotes so your shell doesn't misinterpret things.
And that's it. Now I have my internal touchscreen mapped to my external monitor which makes no sense at all but shows that you can map a touchscreen to any screen if you want to.
[1] Probably the one that most commonly takes effect since it's the vast vast majority of devices
I’ve had a few things I was going to blog about over the past month, but then news sites picked them up and I lost motivation because there’s only so many hours in a day that anyone wants to spend reading things that aren’t specification texts. Yeah, that’s my life now.
Anyway, a lot’s happened, and I’d try to enumerate it all but I’ve forgotten / lost track / don’t care. git log
me if you’re interested. Some highlights:
More on the last one later. Like in a couple months. When I won’t get vanned for talking about it.
No, it’s not Half Life 3 / Portal 3 / L4D3.
Today’s post was inspired by interfaces: they’re the things that make code go brrrrr. Basically Legos, but for adults who never go outside. If you’ve written code, you’ve done it using an interface.
Graphics has interfaces too. OpenGL is an interface. Vulkan is an interface.
Mesa has interfaces. It’s got some neat ones like Gallium which let you write a whole GL driver without knowing anything about GL.
And then it’s got the DRI interfaces. Which, by their mere existence, answer the question “What could possibly be done to make WSI even worse than it already is?”
The DRI interfaces date way back to a time before the blog. A time when now-dinosaurs roamed the earth. A time when Vulkan was but a twinkle in the eye of Mantle, which didn’t even exist. I’m talking Copyright 1998-1999 Precision Insight, Inc., Cedar Park, Texas.
at the top of the file old.
The point of these interfaces was to let external applications access GL functionality. Specifically the xserver. This was before GLAMOR combined GBM and EGL to enable a better way of doing things that didn’t involve brain damage, and it was a necessary evil to enable cross-vendor hardware acceleration using Mesa. Other historical details abound, but this isn’t a textbook. The DRI interfaces did their job and enabled hardware-accelerated display servers for decades.
Now, however, they’ve become cruft. A hassle. A roadblock on the highway to a future where I can run zink on stupid platforms with ease.
The first step to admitting there’s a problem is having a problem. I think that’s how the saying goes, anyway. In Mesa, the problem is any time I (or anyone) want to do something related to the DRI frontend, like allow NVK to use zink by default, it has to go through DRI. Which means going through the DRI interfaces. Which means untangling a mess of unnecessary function pointers with versioned prototypes meaning they can’t be changed without adding a new version of the same function and adding new codepaths which call the new version if available. And guess how many people in the project truly understand how all the layers fit together?
It’s a mess. And more than a mess, it’s a huge hassle any time a change needs to be made. Not only do the interfaces have to be versioned and changed, someone looking to work on a new or bitrotted platform has to first chase down all the function pointers to see where the hell execution is headed. Even when the function pointers always lead to the same place.
I don’t have any memes today.
This is my declaration of war.
DRI interfaces: you’re officially on notice. I’m coming for you.
In the last update I explained how compression of zero weights gave our driver such a big performance improvement.
Since then, I have explored further what could take us closer to the performance of the proprietary driver and saw the opportunity to gather some of the proverbial low-hanging fruit.
Our driver's performance on SSD MobileDet went from 32.7 ms to 24.8 ms, against the proprietary driver's 19.5 ms.
On MobileNetV1, our driver went from 9.9 ms to 6.6 ms, against the proprietary driver's 5.5 ms. Pretty close!
Our driver
was rejecting convolutions with a number of output channels that is not
divisible by the number of convolution cores in the NPU because at the
start of the development the code that lays the weights out in memory
didn't support that. That caused TensorFlow Lite to run the convolutions
in CPU, and some of them were big enough to take a few milliseconds,
several times more than on the NPU.
When implementing support for bigger kernels I had to add improvements to the tiling of the convolutions and that included adding support for these other convolutions. So by just removing the rejection of these, we got a nice speed up on SSD MobileDet: from 32.7ms to 27ms!
That didn't help on MobileNetV1 because that one has all its convolutions with neat numbers of output channels.
So far we were only caching the kernels on the on-chip SRAM. I spent some time looking at how the proprietary driver sets the various caching fields and found a way of getting us to cache a portion of the input tensor on the remaining internal SRAM.
That got us the rest of the performance improvement mentioned above, but I am having trouble with some combination of parameters when the input tensor caching is enabled, so I need to get to the bottom of it before I submit it for review.
At this point I am pretty confident that we can get quite close to the performance of the proprietary driver without much additional work, as a few major performance features remain to be implemented, and I know that I still need to give a pass at tuning some of the previous performance work.
But after getting the input tensor caching finished and before I move to any other improvements, I think I will invest some time in adding some profiling facilities so I can better direct the efforts and get the best returns.
…that this year is a lot busier than expected. Blog posts will probably come in small clusters here and there rather than with any sort of regular cadence.
But now I’m here. You’re here. Let’s get cozy for a few minutes.
I’m sure you’ve seen some news, you’ve been trawling the gitlab MRs, you’re on the #nouveau
channels. You’re one of my readers, so we both know you must be an expert.
Zink on NVK is happening.
Those of you who remember the zink XDC talk know that this work has been ongoing for a while, but now I can finally reveal the real life twist that only a small number of need-to-know community members have been keeping under wraps for years: I still haven’t been to XDC yet.
Let me explain.
I’m sure everyone recalls the point in the presentation where “I” talked about progress made towards Zink on NVK. A lot of people laughed it off; oh sure, you said, that’s just the usual sort of joke we expect. But what if I told you it wasn’t a joke? That all of it was 100% accurate, it just hadn’t happened yet?
I know what you’re thinking now, and you’re absolutely correct. The me that attended XDC was actually time traveling from the future. A future in which Zink on NVK is very much finished. Since then, I’ve been slowly and quietly “backporting” the patches my future self wrote and slipping them into git.
Let’s look at an example.
20 Feb 2024 was a landmark day in my future-journal for a number of reasons, not the least due to the alarming effects of planetary alignment that you’re all no doubt monitoring. For the purposes of the current blog post that I’m now writing, however, it was monumental for a different reason. This was the day that noted zinkologist and current record-holder for Most Tests Fixed With One Line Of Code, Faith Ekstrand (@gfxstrand), would delve into debugging the most serious known issue in zink+nvk:
Yup, it’s another clusterfuck.
Now let me say that I had the debug session noted down in my journal, but I didn’t add details. If you haven’t been in #nouveau for a live debug session, it’s worth scheduling time around it. Get some popcorn ready. Put on your safety glasses and set up your regulation-size splatterguard, all the usual, and then…
Well, if I had to describe the scene, it’s like watching someone feed a log into a wood chipper. All the potential issues investigated one-by-one and eliminated into the pile of growing sawdust.
Anyway, it turns out that NVK (currently) does not expose a BAR memory type with host-visible and device-local properties, and zink has no handling for persistently mapped buffers in this scenario. I carefully cherry-picked the appropriate patch from my futurelog and rammed it through CI late at night when nobody would notice.
As a result, all GL games now work on NVK. No hyperbole. They just work.
Stay tuned for future updates backported from a time when I’m not struggling to find spare seconds under the watchful gaze of Big Triangle.
Hi! February is FOSDEM month, and as usual I’ve come to Brussels to meet with a lot of other FOSS developers and exchange ideas. I like to navigate between the buildings and along the hallways to find nice people to discuss with. This edition I’ve been involved in the new modern e-mail devroom and I’ve given a talk about IMAP with Damian, a fellow IMAP library maintainer and organizer of this devroom. The whole weekend was great!
In wlroots news, I’ve worked on multi-connector atomic commits. Right now,
wlroots sequentially configures outputs, one at a time. This is slow and makes
it impossible to properly handle GPU limitations such as bandwidth: if the GPU
cannot drive two outputs with a 4k resolution, we’ll only find out after the
first one has been lit up. As a result we can’t properly implement fallbacks
and this results in black screens on some setups. In particular, on Intel some
users need to set WLR_DRM_NO_MODIFIERS=1
to have their multi-output setup
work correctly. The multi-connector atomic commit work is the first step to
resolve these situations and also results in faster modesets. The second step
will be to add fallback logic to use a less bandwidth-intensive scanout buffer
on modeset.
While working on the wlroots DRM backend code, I’ve also taken the opportunity to cleanup the internals and skip unnecessary modesets when switching between VTs. Ctrl Alt 1 should be faster now! I’ve also tried to resurrect the ext-screencopy-v1 protocol, required for capturing individual windows. I’ve pushed a new version and reworked the wlroots implementation, hopefully I can find some more time next month to continue on this front.
Sway 1.9-rc4 has been recently released, my reading of the tea leaves at my disposal indicates that the final release may be shipped soon. Sway 1.9 will leverage the new wlroots rendering API, however it does not include the huge scene-graph rework that Alexander has pushed forward in the last year or so. Sway 1.10 will be the first release to include this major overhaul and all the niceties it unlocks. And Sway 1.10 will also finally support input method popups (used for CJK among other things) thanks to efforts by Access and Tadeo Kondrak.
The NPotM is sinwon, a simple OAuth 2 server for small deployments. I’ve long been trying to find a good solution to delegate authentication to a single service and provide single-sign-on for my personal servers. I’ve come to like OAuth 2 because it’s a standard, it’s not tied to another use-case (like IMAP or SMTP is), and it prevents other services from manipulating user passwords directly. sinwon stores everything in a SQLite database, and it’s pretty boring: no fancy cryptography usage for tokens, no fancy cloud-grade features. I like boring. sinwon has a simple UI to manage users and OAuth clients (sometimes called “apps”). Still missing are refresh tokens, OAuth scopes, an audit log, personal access tokens, and more advanced features such as TOTP, device authorization grants and mTLS. Patches welcome!
I’ve continued my work to make it easier to contribute to the SourceHut codebase. Setting up PGP keys is now optional to run a SourceHut instance, and a local S3-compatible server (such as minio) can be used without TLS. Thorben Günther has added paste.sr.ht to sr.ht-container-compose. I’m also working on making services use meta.sr.ht’s GraphQL API instead of maintaining their own copy of the user’s profile, but more needs to be done there.
And now for the random collection of smaller updates… The soju IRC bouncer and the goguma IRC client for mobile devices now support file uploads: no need to use an external service anymore to share a screenshot or picture in an IRC conversation. Conrad Hoffmann and Thomas Müller have added support for multiple address books to the go-webdav library, as well as creating/deleting address books and calendars. I’ve modernized the FreeDesktop e-mail server setup with SPF, DKIM and DMARC. KDE developers have contributed a new layer-shell minor version to support docking their panel to a corner of the screen.
That’s all for now, see you next month!
In the past 8 months, I’ve lost 60 pounds and went from completely sedentary to well on my way towards becoming fit, while putting in a minimum of effort. On the fitness side, I’ve taken my cardiorespiratory fitness from below average to above average, and I’m visibly stronger (I can do multiple pull-ups!). Again, I’ve aimed to do so with minimal effort to maximize my efficiency.
Here’s what I wrote in my prior post on weight loss:
I have no desire to be a bodybuilder, but I want to be in great shape now and be as healthy and mobile as possible well into my old age. And a year ago, my blood pressure was already at pre-hypertension levels, despite being at a relatively young age.
Research shows that 5 factors are key to a long life — extending your life by 12–14 years:
In addition, people who are in good health have a much shorter end-of-life period. This means they extend the healthy portion of their lifespan (the “healthspan”) and compress the worst parts into a shorter period at the very end. Having seen many grandparents go through years of struggle as they grew older, I wanted my own story to have a different ending.
Although I’m not a smoker, I was missing three of the other factors. My weight was massively unhealthy, I didn’t exercise at all and spent most of my day in front of a desk, and my diet was awful. I do drink moderately, however (almost entirely beer).
This post accompanies my earlier writeup, “The lazy technologist’s guide to weight loss.” Check that out for an in-depth, science-driven review of my experience losing weight.
Why is this the lazy technologist’s guide, again? I wanted to lose weight in the “laziest” way possible — in the same sense that lazy programmers find the most efficient solutions to problems, according to an apocryphal quote by Bill Gates and a real one by Larry Wall, creator of Perl. Gates supposedly said, “I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.” Wall wrote in Programming Perl, “Laziness: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful and document what you wrote so you don’t have to answer so many questions about it.”
What’s the lowest-effort, most research-driven way to become fit as quickly as possible, during and after losing weight? Discovering and executing upon that was my journey. Read on if you’re considering taking a similar path.
My initial goal for fitness was simply to meet the “30+ min/day” factor in the research study I cited at the beginning of this post, while considering a few factors:
Joint issues become very common for older people, especially knees and hips. My program needed to avoid any high-impact, repetitive stress on those joints to preserve maximum function. I’ve always heard that running is bad on your knees, but after I looked into it, the research does not bear that out. And yet, it remains a popular misconception among both the general population as well as doctors who do not frequently perform hip replacements.
However, I just don’t like running — I enjoy different activities if I’m going to be working hard physically, such as games like racquetball/squash/pickleball or self-defense (Krav Maga!). I’m also not a big fan of getting all sweaty in general, but especially in the middle of a workday. So I wanted an activity with a moderate rather than high level of exertion.
Low-impact options include walking, cycling, swimming, and rowing, among others. But swimming requires an indoor pool or year-round good weather, and rowing requires a specialized machine or boat, while I’m aiming to stay minimal. I also do not own a bicycle, nor is the snowy weather in Minnesota great for cycling in the winter (fat-tire bikes being an exception).
We’re left with walking as the primary activity.
Initially, I started with only walking. This is called low-intensity steady state (LISS) cardio (cardiovascular, a.k.a. aerobic) exercise. Later, I also incorporated high-intensity interval training (HIIT) as the laziest possible way to further improve my cardiovascular health.
To bump walking up into a “moderate” level of activity, I need to walk between 3–4 mph. This is what’s sometimes called a “brisk” walk — 3 mph feels fast, and 4 mph is about as fast as I can go without changing into some weird competitive walking style.
I also need to hit 30+ minutes per day of this brisk walking. At first, I started on a “walking pad” treadmill under my standing desk, which I bought for <$200 on Amazon. My goal was to integrate walking directly into my day with no dedicated time, and this seemed like a good path. However, this violates the minimalism requirement. I also learned that the pace is also too fast to do much of anything at the desk besides watch videos or browse social media. So I broke this up into two 1-mile outdoor walks, one after lunch and another after dinner.
Each 1-mile walk takes 15–20 minutes. Fitting this into a workday requires me to block off 45–60 minutes for lunch, between lunch prep, time to eat, and the walk itself. I find this much easier than trying to create a huge block of time in the morning for exercise, because I do not naturally wake up early. In the evening, I’ll frequently extend the after-dinner walk to ~2 miles instead of 1 mile.
It turns out that walking after meals is a great strategy for both weight loss and suppressing your blood sugar levels, among other benefits. This can be as short as a 2-minute walk, according to recent studies. In fact, it’s seen as so key in Mediterranean culture that walking is considered a component of the Mediterranean diet.
Overall, I’ve increased my active calorie consumption by 250 calories/day by incorporating active walks into my day. That’s a combination of the 2 after-meal brisk walks, plus a more relaxed walk on my under-desk treadmill sometime during the day. The latter is typically a 2 mph walk for 40–60 min, and I do it while I’m in a meeting that I’m not leading, or maybe watching a webinar. Without buying the walking pad, you could do the same on a nice outdoor walk with a headset or earbuds, but Minnesota weather sometimes makes that miserable. Overall, all of this typically gets me somewhere between 10,000–15,000 steps per day.
Not only is this good for fitness, it also helps to offset the effects of metabolic adaptation. If you’re losing weight, your body consumes fewer calories because it decreases your resting metabolic rate to conserve energy. Although some sites will suggest this could be hundreds of calories daily, which is quite discouraging, research shows that’s exaggerated for most people. During active weight loss, it’s typically ~100 calories per day, although it may be up to 175±150 calories for diet-resistant people. That range is a standard deviation, so people who are in the worst ~15% of the diet-resistant subset could have adaptations >325 calories/day. So if you believe you’re diet-resistant, you probably want to aim for a 1000-calorie deficit, to ensure you’re able to lose weight at a good rate. On the bright side, that adaptation gets cut in half once you’ve stabilized for a few weeks at your new weight, and it’s effectively back to zero a year later.
To further maintain my muscle following weight loss, I added a weighted vest to my after-lunch walks occasionally (examples: Rogue, 5.11, TRX). I started doing this once a week, and I aim to get to 3x+/week. I use a 40 lb weighted vest to counterbalance the 40+ lb of weight that I’ve lost. When I walk with the vest, I’m careful to maintain the same pace as without the vest, which increases the intensity and my heart rate. This pushes a normal moderate-intensity walk into the low end of high intensity (approaching 80% of my max heart rate). I also anticipate incorporating this weighted vest into my strength training later, once my own body weight is insufficient for continued progression.
Considering a minimalist approach, however, I think you could do just fine without a weighted vest. There are other ways to increase intensity, such as speed or inclines, and the combination of a high-protein diet, HIIT, and strength training provides similar benefits.
Why do HIIT? Regularly getting your heart rate close to its maximum is good for your cardiovascular health, and you can’t do it with LISS, which by definition is low intensity. Another option besides HIIT is much longer moderate-intensity continuous training (your classic aerobic workout), but HIIT can fit the same benefits or more into a fraction of the time.
Research is very supportive of HIIT compared to longer aerobic workouts, which enables time compression of the total workout length from the classic 60 minutes down to 30 minutes or less.
However, 30 minutes still isn’t the least you can do and still get most of the benefits. The minimum required HIIT remains unclear — in overall length, weekly frequency, as well as patterns of high-intensity and rest / low-intensity. Here are some examples of research that test the limits of minimalist HIIT and find that it still works well:
Yes, you read that right — the last study used 20-second intervals. They were only separated by 10 seconds of rest, so the primary exercise period was just 4 minutes, excluding warm-up. Furthermore, this meta-analysis suggests that HIIT benefits more from increasing the intensity of the high-intensity intervals, rather than increasing the volume of repetitions.
After my investigation, it was clear that “low-volume” or “extremely low volume” HIIT could work well, so there was no need to do the full 30-minute HIIT workouts that are popular with many gym chains.
I settled on 3 minutes of HIIT, 2x/week: 3 repetitions of 30 seconds hard / 30 seconds light, plus a 1-minute warm-up. This overlaps with the HIIT intervals, breaks, and repetitions from the research I’ve dug into, and it also has the convenient benefit of not quite making me sweat during the workout, so I don’t need to change clothes.
I’m seeing the benefits of this already, which I’ll discuss in the Summary.
I also wanted to incorporate strength training for many reasons. In the short term, it was to minimize muscle loss as I lost weight (addressed in my prior post). In the medium and long term, I want to build muscle now so that I can live a healthier life once I’m older and also feel better about myself today.
What I’ve found is that aiming for the range of 10%–15% body fat is ideal for men who want to be very fit. This range makes it easy to tell visually when you’re at the top or bottom of the range, based on the appearance of a well-defined six-pack or its fading away to barely visible. It gets harder to tell where you are visually from 15% upwards, while anything below 10% has some health risks and starts to look pretty unusual too.
Within that 10%–15% range, I’m planning to do occasional short-term “lean bulks” / “clean bulks” and “cuts.” That’s the typical approach to building muscle — you eat a slight excess of calories while ensuring plenty of protein, aiming to gain about 2–4 lbs/month for someone my size. After a cycle of doing this, you then “cut” by dieting to lose the excess fat you’ve gained, because it’s impossible to only gain muscle. My personal preference is to make this cycle more agile with shorter iteration cycles, compared to some of the examples I’ve seen. I’m thinking about a 3:1 bulk:cut split over 4 months that results in a total gain/loss of ~10 lbs.
My goal of staying minimal pushed me toward calisthenics (bodyweight exercises), rather than needing to work out at a gym or buy free weights. This means the only required equipment is a doorway pull-up bar ($25), while everything else can be done with a wall, table or chair/bench. Although I may not build enormous muscles, it’s possible to get to the point of lifting your entire body weight with a single arm, which is more than good enough for me. That’s effectively lifting 2x your body weight, since you’re lifting 1x with just one arm.
My routine is inspired by Reddit’s r/bodyweightfitness (including the Recommended Routine and the Minimalist Routine) and this blog post by Steven Low, author of the book “Overcoming Gravity.” I’ve also incorporated scientific research wherever possible to guide repetitions and frequency. Overall, the goal is to get both horizontal and vertical pushing and pulling exercises for the arms/shoulders due to their larger range of motion, while getting push and pull for legs, and good core exercises that cover both the upper and lower back as well.
I’ve chosen compound exercises that work many muscles simultaneously — for practicality (more applicable to real-world motions), length of workout, and minimal equipment needs. If you’re working isolated muscles, you generally need lots of specialized machines at a gym. Isometrics (exercises where you don’t move, like a wall-sit) are also less applicable to real use cases as you age, such as the strength and agility to catch yourself from a fall. For that reason, I prefer compound exercises with some rapid, explosive movements that help to build both strength and agility.
Here’s my current schedule (3 sets of repetitions for each movement, with a 3-minute break between sets):
For ones that I couldn’t do initially (e.g. pull-ups, handstands, L-sits, Nordic curls), I used progressions to work my way there step by step. For pull-ups, that meant doing negatives / eccentrics by jumping up and slowly lowering myself down over multiple seconds, then repeating. For handstands, I face the wall to encourage better posture, so it’s been about longer holds and figuring out how to bail out so I can more confidently get vertical. For L-sits, I follow this progression. For Nordic curls, I’m doing slow negatives as far down as I can make it, then dropping the rest of the way onto my hands and pushing back up.
On days with multiple exercises for the same muscles, I’ll typically try to split them up so they fit more easily into a workday. For example, I’ll find 10 minutes mid-morning between meetings/calls to do one movement and 10 minutes mid-afternoon for the other. This is the same time I might’ve spent making a coffee, before I started focusing on fitness.
Combined with the walks, this plan gets me moving 4 times a day — two 20-minute walks and two 10-minute workouts, for a total of 1 hour each day. The great thing about this approach is that I never feel like I need to dedicate a ton of time to exercise, because it fits naturally into the structure of my day. I’ve also got an additional 40–60 minutes of slow walking while at my desk, which again fits easily into my day.
As you can see, I’m currently at 1x/wk for non-core exercises, which is a “traditional split.” That means I’m splitting up exercises, focusing on just one set of muscles each day. The problem is that the frequency of training for each muscle group is low, which I’d like to change so that I can build strength more quickly.
I’m switching to “paired sets” (aka “alternating sets”) that alternate among different muscle groups, so I can fit more into the same amount of time. Here’s how that works: if you were taking a 3-minute rest between sets, that gives you time to fit in an unrelated set of muscles that you weren’t using in the first exercise (e.g. biceps & triceps, quads & hamstrings, chest & back). I do this as an alternating tri-set (arm pull, arm push, legs) with a 30–45 second rest between each muscle group, and a 1.5–2 minute break between each full tri-set. You might also see “supersets,” which is a similar concept but with no breaks within the tri-set. I’ve found that I tend to get too tired and sloppy if I try a superset, so I do alternating sets instead.
In addition, I’ve done a lot more research on strength training after getting started. For LISS and HIIT, I had a strongly research-driven approach before beginning. For strength training, I went with some more direct recommendations and only did additional academic research later. Here’s what I’ve learned since then:
Overall, that suggests a workout design that looks like this (2 days a week):
To incorporate this research into a redesigned routine that also includes HIIT and core work, here’s what I’ve recently changed to (most links go to “progressions” that will help you get started):
Also, 4+ days a week, I do a quick set of a 5-second negative for each type of compound exercise (arm push, arm pull, leg press). That’s just 2 days in addition to my strength days, so I usually fit it into HIIT warm-up or cool-down.
On each day, my overall expected time commitment will be about 10 minutes. For strength training, all the alternating sets will overlap with each other. Even with a 3-min break between each set for the same muscle group, that should run quite efficiently for 2–3 sets. For HIIT, it’s already a highly compressed routine that takes ~5 minutes including warm-up and cool-down, but I need another 5 minutes afterwards to decompress after exercise that intense. You may notice that I only have one dedicated day to work my core (Wednesday), but I’m also getting core exercise during push-ups (as I plank), L-sit pull-ups, and handstands (as I balance).
The research recommendation to increase load to 80% of your max can seem more challenging with calisthenics, since it’s just about bodyweight. However, it’s always possible by decreasing your leverage, using one limb instead of two, or increasing the proportion of your weight that’s applied by changing your body angles. For example, you can do push-ups at a downwards incline with your feet on a bench/chair. You can also do more advanced types of squats like Bulgarian split squats, shrimp squats, or pistol squats.
My cardiorespiratory fitness, as measured by VO2 Max (maximal oxygen consumption) on my Apple Watch, has increased from 32 (the lowest end of “below average,” for my age & gender) to 40.1 (above average). It continues to improve on a nearly daily basis. That’s largely happened within just a couple of months, since I started walking every day and doing HIIT.
My blood pressure (one of my initial concerns) has dropped out of pre-hypertension into the healthy range. My resting heart rate has also decreased from 63 to 56 bpm, which was a long slow process that’s occurred over the entire course of my weight loss.
On the strength side, I wasn’t expecting any gains because I’m in a caloric deficit. My main goal was to avoid losing muscle while losing weight. I’ve now been strength training for 2.5 months, and I’ve been pleasantly surprised by the “newbie gains” (which people often see in their first year or two of strength training).
For example, I couldn’t do any pull-ups when I started. I could barely do a couple of negatives, by jumping up and letting myself down slowly. Now I can do 4 pull-ups (neutral grip). Also, I can now hold a wall handstand for 30–45 seconds and do 6–8 very small push-ups, while I could barely get into that position at all when I started.
Overall, clear results emerged almost instantly for cardiorespiratory fitness, and as soon as 6 weeks after beginning a regular strength-training routine. If you try it out, let me know how it works for you!
[Last update: 2024-02-16]
In the past 8 months, I’ve lost 60 pounds and went from completely sedentary to becoming much more fit, while putting in a minimum of effort. I have no desire to be a bodybuilder, but I want to be in great shape now and be as healthy and mobile as possible well into my old age. A year ago, my blood pressure was already at pre-hypertension levels, despite being at a relatively young age.
I wasn’t willing to let this last any longer, and I wasn’t willing to accept that future.
Research shows that 5 factors are key to a long life — correlated with extending your life by 12–14 years:
In addition, people who are in good health have a much shorter end-of-life period. This means they extend the healthy portion of their lifespan (the “healthspan”) and compress the worst parts into a shorter period at the very end. Having seen many grandparents go through years of struggle as they grew older, I wanted my own story to have a different ending.
Although I’m not a smoker, I was missing three of the other factors. My weight was massively unhealthy, I didn’t exercise at all and spent most of my day in front of a desk, and my diet was awful. On the bright side for these purposes, I drink moderately (almost entirely beer).
In this post, I’ll walk through my own experience going from obese to a healthy weight, with plenty of research-driven references and data along the way.
Why is this the lazy technologist’s guide, though? I wanted to lose weight in the “laziest” way possible — in the same sense that lazy programmers find the most efficient solutions to problems, according to an apocryphal quote by Bill Gates and a real one by Larry Wall, creator of Perl. Gates supposedly said, “I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.” Wall wrote in Programming Perl, “Laziness: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful and document what you wrote so you don’t have to answer so many questions about it.”
What’s the lowest-effort, most research-driven way to lose weight as quickly as possible without losing health? Discovering and executing upon that was my journey. Read on if you’re considering taking a similar path.
My initial goal was to get down from 240 pounds (obese, BMI of 31.7) into the healthy range, reaching 185 pounds (BMI of 24.4).
My aim was to lose at the high end of a healthy rate, 2 pounds per week. Credible sources like the Mayo Clinic and the CDC suggested aiming for 1–2 pounds a week, because anything beyond that can cause issues with muscle loss as well as malnutrition.
But how could I accomplish that?
I’ve lost weight once previously (about 15 years ago), although it was a smaller amount. Back then, I learned that there’s no silver bullet — the trick is to create a calorie deficit, so that your body consumes more energy than the calories in what you eat.
Every pound is about 3500 calories, which helps to set a weekly and daily goal for your calorie deficit. For me to lose 2 pounds a week, that’s 2*3500 = 7000 calories/week, or 1000 calories/day of deficit (eating that much less than my body uses).
It’s far more effective and efficient to create this deficit primarily through eating less rather than expecting exercise to make a huge difference. If you were previously gaining weight, you might’ve been eating 3000 calories/day or more! You can easily reduce what you eat by 1500 calories/day from that starting point, but it’s almost impossible to exercise enough to burn that many calories. An hour of intense exercise might burn 500 calories, and it’s very hard to keep up that level of effort for even one full hour — especially if you’ve been sitting in a chair all day for years on end.
Not to mention, that much exercise would defeat the whole idea of this being the lazy person’s way of making progress.
So how exactly can you reduce calories? You’ve got a lot of options, but they basically boil down to two things — eat less (portion control), and eat better (food choice).
At this point, I knew I needed to eat 1000 calories/day less than I burned. I used this calculator to identify that, as a sedentary person, I burned about 2450 calories/day. So to create that deficit, I needed to eat about 1450 calories/day. At that point, I was probably eating 2800–3000 calories/day, so that would require massive changes in my diet.
I don’t like the idea of fad diets that completely remove one or many types of foods entirely (Atkins, keto, paleo, etc), although they can work for other people. One of those big lessons about dieting is that as long as you’re removing something from what you eat, you’ll probably lose weight.
I decided to make two big changes: how often I ate healthy vs unhealthy food, and when I ate over the course of the day. At the time, I was eating a huge amount of high-fat, high-sugar, and low-health foods like burgers and fries multiple times per week, fried food, lots of chips/crisps, white bread (very high sugar in the US) & white rice, cheese, chocolate and candy.
I decided to shift that toward white meat (chicken/pork/turkey), seafood, salads & veggies, and whole grains (whole-wheat bread, brown rice, quinoa, etc). One pro-tip: American salad dressings are super unhealthy, often even the “vinaigrettes” that sound better. Do like Italians do, and dress salads yourself with olive oil, salt, and vinegar. However, I didn’t want to remove my favorite foods entirely, because that would destroy my long-term motivation and enjoyment of my progress. For example, once a week, I still allow myself to get a cheeseburger. But I’ll typically get a single patty, no mayo/cheese/ketchup, and with a side like salad (w/ healthy dressing) or cole slaw. I’ll also ensure my other meal of the day is very light. Many days, I’ll enjoy a small treat like 1–2 chocolates, as well (50–100 calories).
I wanted to reach my calorie target without eliminating beer, so I could both preserve my quality of life and also maintain the moderate drinking that research shows is correlated with increased lifespan.
I was also drinking very high-calorie beer (like double IPAs and bourbon-barrel–aged imperial stouts). I shifted that toward low-alcohol, low-calorie beer (alcohol levels and calories are correlated). Bell’s Light-Hearted IPA and Lagunitas DayTime IPA are two pretty good ones in my area. Of the non-alcoholic (NA) beers, Athletic Free Wave Hazy IPA is the best I’ve found in my area, but Untappd has reasonably good ratings for Sam Adams Just the Haze and Sierra Nevada Trail Pass IPA, which should be broadly available. As a rough estimate on calories in beer, you can use this formula:
Beer calories = ABV (alcohol percentage) * 2.5 * fluid ounces
As an exception, many Belgian beers are quite “efficient” to drink, in that roughly 75% of the calories are alcohol rather than other carbs that just add calories. As a result, they violate the above formula and tend to be lower-calorie than you’d expect. This could be the result of carefully crafted recipes that consume most of the carbs, and fermentation that uses up all of the sugar.
Here’s a more specific formula that you can use, if you’re curious about how “efficient” a given beer is, and you know how many total calories it has (find this online):
Beer calories from ethanol = (ABV * 0.8 / 100) * (29.6 * fluid ounces) * 7
(Simplified form): Beer calories from ethanol = ABV * 1.7 * fluid ounces
This uses the density and calories of ethanol (0.8 g/ml and 7 cal/g, respectively) and converts from milliliters to ounces (29.6 ml/oz). If you then calculate that number as a fraction of the total calories in a beer, you can find its “efficiency.” For example, a 12-ounce bottle of 8.5% beer might have 198 calories total. Using the equation, we can calculate that it’s got 169 calories from ethanol, so 169/198 = 85% “efficient.”
If you’re really trying to optimize for this, however, beer is the wrong drink. Have a low-calorie mixed drink instead, like a vodka soda, ranch water, or rum and Diet Coke.
Therefore, instead of giving up beer entirely, I decided to skip breakfast. I’d eaten light breakfasts for years (a small bowl of cereal, or a banana and a granola bar), so this wasn’t a big deal to me.
Later, I discovered this qualified my diet as time-restricted intermittent fasting as well, since I was only eating/drinking between ~12pm–6pm. This approach of 18 hours off / 6 hours on (18:6 fasting) may have aided in my weight loss, but studies are mixed with some suggesting no effect.
Here’s what a day might look like on 1450 calories:
When I get hungry, I often drink some water instead, because my body’s easily confused about hunger vs thirst. It’s a mental game too — I remind myself that hunger means my body is burning fat, and that’s a good thing.
For a long time, I kept track of my estimated calorie consumption mentally. More recently, I decided to make my life a little easier by switching to an app. I chose MyFitnessPal because it’s got a big database including almost everything I eat.
On this plan, I had a great deal of success in losing my first 40 pounds, getting down from 240 to 200. However, it started to feel like a bit of a struggle to maintain my weight loss as I reached 200 pounds and wanted to continue losing at the same rate of 2 pounds/week.
I fell behind by about two weeks on my weight-loss goal, which was massively frustrating because I’d done so well all along. I convinced myself to keep persisting because it had worked all along for months, and this was a temporary setback.
Finally I re-used the same weight-loss calculator and realized what seemed obvious in hindsight: Since I now weighed less, I also burned fewer calories per day! Those 40 pounds that were now gone didn’t use any energy anymore, but I was still eating as if I had them. I needed to change something to restore the 1000-calorie daily deficit.
At this point, I aimed to decrease my intake to about 1200 calories per day. This quickly became frustrating because it started to affect my quality of life by forcing choices I didn’t want to make, such as choosing between a decent dinner or a beer, or forcing me to eat a salad with no protein for dinner if I had a little bit bigger lunch.
That low calorie limit also carried the risk of causing metabolic adaptation — meaning my body could burn hundreds fewer calories per day as a result of being in a “starvation mode” of sorts. That ends up being a vicious cycle that continually forces you to eat less, and it makes weight loss even more challenging.
Consequently, I began to introduce moderate exercise (walking), so I could bring my intake back up to 1400 calories on days when I burned 200 extra calories. I’ve discussed the details in a follow-up guide for fitness.
Over the course of my learning, I discovered that it’s ideal (according to actuarial tables) to sit in the middle of the healthy range rather than be at the top of it. I maintained my initial weight-loss goal to keep myself motivated on progress, but set a second goal of reaching 165 pounds — or whatever weight it takes to get a six-pack (~10% body fat).
I also discovered that high-protein diets are better at preserving muscle, so more of the weight loss is fat. This is especially true when coupled with resistance or strength training, which also sends your body a signal that it needs to keep its muscle instead of losing it. The minimum recommended daily allowance (RDA) of protein (0.36 grams per pound of body weight, or 67 g/day for me) could be your absolute lower limit, while as much as 0.6 g/lb (111 g/day for me) could help in improving your muscle mass.
Another study suggested multiplying the RDA by 1.25–1.5 (or more if you exercise) to maintain muscle during weight loss, which would put my recommended protein at 84–100 grams per day. The same study also said exercise helps to maintain muscle during weight loss, so it could be an either/or situation rather than needing both. Additionally, high-protein diets can help with hunger and weight loss, in part because they keep you fuller for longer. Getting 25%–30% of daily calories from protein will get you to this level, which is a whole lot of protein. Starting from your overall daily calories, you can apply this percentage and then divide your desired protein calories by 4 to get the number of grams per day:
Protein grams per day = Total daily calories * {25%, 30%} / 4
For my calorie limit, that’s about 88–105 grams per day.
I’ve found that eating near the absolute minimum recommended protein level (67 grams per day, for my weight) tends to happen fairly naturally with my originally planned diet, while getting much higher protein takes real effort. I needed to identify low-calorie, high-protein foods and incorporate them more intentionally into meals, so that I can get enough protein without compromising my daily calorie limit.
Here’s a good list of low-calorie, high-protein foods that are pretty affordable:
If you’re vegetarian, you’d want to go heavier on lentils and beans, and add plenty of nuts, including hummus and peanut butter. You probably also want to bring in tempeh, and you likely already eat tofu.
I’d never tried canned salmon before, and I was impressed with how easily I could make it into a salad or an open-faced sandwich (like Danish smørrebrød). The salmon came in large pieces and retained the original texture, as you’d want. Canned tuna has been more variable in terms of texture — I’ve had some great-looking albacore from Genova and some great-tasting (but not initially good-looking) skipjack from Wild Planet.
Avoid the most common brands of canned fish though, like Chicken of the Sea, StarKist, or Bumble Bee. They are often farmed or net-caught instead of pole/line-caught, and they may be higher in parasites (for farmed fish like salmon). I also aim to buy lower-mercury types of salmon and tuna — this means I can eat each kind of fish as often as I want, instead of once a week. I buy canned Wild Planet skipjack tuna (not albacore, but yellowfin is pretty good too) and canned Deming’s sockeye salmon (not pink salmon) at my local grocery store, and I pick up large trays of refrigerated cocktail shrimp at Costco. The Genova brand also garners good reviews for canned fish and may be easier to find. All of those are pre-cooked and ready to eat, so they’re easy to use for a quick lunch.
Go ahead and get fresh seafood if you want, but be aware that you’ll be going through a lot of it so it could get expensive. Fish only stays good for a couple of days unless frozen, so you’ll also be making a lot of trips to the store or regularly thawing/cooking frozen fish.
Over the past 8 months, I’ve managed to lose 60 pounds (and counting!) through a low-effort approach that has minimized the overall impact on my quality of life. I’ve continued to eat the foods I want — but less of them.
The biggest challenge has been persistence through the tough times. However, not cutting out any foods completely, but rather just decreasing the frequency of unhealthy foods in my life, has been a massive help with that. That meant I didn’t feel like I was breaking my whole diet whenever I had something I really wanted, as long as it fit within my calorie limit.
What’s next? A few months after beginning my weight loss, I also started working out to get into better shape, which was another one of those original 5 factors to a long life. Right now, I’m aiming to get down to about 10% body fat, which is likely to be around 165 pounds. Then I’ll flip my eating habits into muscle-building mode, which will require a slight caloric excess rather than a deficit.
Stay tuned to see what happens!