Hi all!
As usual, this month has been rich in Wayland-related activities. Rose has continued building and upstreaming better frame scheduling infrastructure for wlroots, you can read more on her blog. I’ve resurrected an old patch to make wlroots behave better when the GPU is under high load. In my testing this improves latency a lot some specific scenarios and some specific hardware, but doesn’t help on some others. It’s not super clear if anything can be done about this, it may be that we are hitting some hardware limitations here: GPUs don’t know how to preempt tasks very well.
I’ve also started working on explicit synchronization again. This was
previously blocked on a hard problem: drivers may want to use a new kind of
synchronization fence primitive (user-space memory fences) and it wasn’t clear
how the current primitives (drm_syncobj) would hold up. We’ve been talking
about this new primitive for a few years but unfortunately it’s a complicated
matter and nothing new has surfaced. However, after discussing with Daniel
Vetter, we’ve come to the conclusion that the kernel will provide backwards
compatibility for drm_syncobj, so we can just stop worrying and use that as
the basis for explicit synchronization protocols and implementations. Moreover,
NVIDIA engineers are interested in helping with this effort, so I hope we can
keep the momentum and join forces to push the new protocol, APIs and
implementations to the finish line.
There is a lot to be done to plumb explicit synchronization. This month I’ve
respinned a new kernel uAPI patch to allow compositors to
wait on a drm_syncobj without blocking. This also involved writing a test
suite in IGT and a wlroots patch to use the new uAPI. Everything is now
reviewed, I hope to merge this soon. Apart from this, we also need a
new Wayland protocol, a new Vulkan
extension for drm_syncobj import/export, more implementations of the
protocol, ideally yet another new kernel uAPI to improve
interoperability with sync_file, and even a new X11 protocol so that legacy
X11 clients (read: games) can take advantage of this whole thing. Oh my… As
French people say, there is some bread on the table.
In other Wayland news, we’ve started having some more-or-less weekly meetings for wayland-protocols standardization. We’ve been talking about upstreaming some of the stuff currently in a private GTK protocol, IMEs, and layer-shell. It’s been great to be able to discuss face-to-face about blockers for these protocols. The meeting notes are available on the wiki. We’ve done a lot of talking and gesturing, but also some actual work: security-context has finally (!) been merged, and I’ve updated the ext-layer-shell patch.
Apart from the explicit synchronization work, I’ve sent a few other kernel patches. Numerous patches to improve the kernel uAPI documentation, and a few patches to add more information to the hotplug events sent by bridge/i915/nouveau so that compositors don’t need to reload the whole KMS state on each hotplug event (instead, they can now only reload the KMS state of the one specific connector which got hotplugged). I’ve reviewed a few patches as well. Thomas Zimmermann has made it so all DRM drivers now support DMA-BUFs (required for wlroots to run), so now wlroots works on e.g. gma500. AMD engineers have sent patches to support more than 64 DRM devices, there are some subtle uAPI stability issues at play I’ve tried to provide feedback on.
Let’s wrap up this status update with a collection of various smaller
happenings. I’ve removed dlsym() related magic used in the Wayland test suite
which caused sporadic failures on FreeBSD. I’ve been gradually improving the
API for go-imap v2 and fixing a few bugs. hut now supports pagination on all
commands thanks to tireless work by Thorben Günther. kanshi now supports
configuring adaptive sync (VRR). I’ve improved the API of go-oauth2 a bit. Last
but not least, I’ve reworked an old patch to make it easier to
parse scfg files from Go programs, by defining a Go struct
instead of hand-rolling parsing code.
See you next month!
I recently came across tinygrad as a small powerful nn framework that had an OpenCL backend target and could run LLaMA model.
I've been looking out for rusticl workloads, and this seemed like a good one, and I could jump on the AI train, and run an LLM in my house!
I started it going on my Radeon 6700XT with the latest rusticl using radeonsi with the LLVM backend, and I could slowly interrogate a model with a question, and it would respond. I've no idea how performant it is vs ROCm yet which seems to be where tinygrad is more directed, but I may get to that next week.
While I was there though I decided to give the Mesa ACO compiler backend a go, it's been tied into radeonsi recently, and I done some hacks before to get compute kernels to run. I reproduced said hacks on the modern code and gave it a run.
tinygrad comes with a benchmark script called benchmark_train_efficientnet so I started playing with it to see what low hanging fruit I could find in an LLVM vs ACO shootout.
The bench does 10 runs, the first is where lots of compilation happens, the last is well primed cache wise. There are the figures from the first and last runs with a release build of llvm and mesa. (and the ACO hacks).
LLVM:
215.78 ms cpy, 12245.04 ms run, 120.33 ms build, 12019.45 ms realize, 105.26 ms CL, -0.12 loss, 421 tensors, 0.04 GB used, 0.94 GFLOPS
10.25 ms cpy, 221.02 ms run, 83.50 ms build, 36.25 ms realize, 101.27 ms CL, -0.01 loss, 421 tensors, 0.04 GB used, 52.11 GFLOPS
ACO:
71.10 ms cpy, 3443.04 ms run, 112.58 ms build, 3214.13 ms realize, 116.34 ms CL, -0.04 loss, 421 tensors, 0.04 GB used, 3.35 GFLOPS
10.36 ms cpy, 234.90 ms run, 84.84 ms build, 36.51 ms realize, 113.54 ms CL, 0.05 loss, 421 tensors, 0.04 GB used, 49.03 GFLOPS
So ACO is about 4 times faster to compile but produces binaries that are less optimised.
The benchmark produces 148 shaders:
LLVM:
ACO:
So ACO doesn't quite get the optimal shaders for a bunch of paths, even with some local hackery I've done to make it do better.[1]
I'll investigate ROCm next week maybe, got a bit of a cold/flu, and large GPU stacks usually make me want to wipe the machine after I test them :-P
[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radeonsi-rusticl-aco-wip
I'm suffering from having a mortal form again, but things are moving in the general direction of progress.
Or "Rose, it's 2 in the morning!" Yeah yeah, whatever, you're not my mum.
Some would call this whining - skip this section if you're here for technology :)
You're not supposed to make yourself work when you don't have energy to because you'll feel bad. People have tried telling me this and I've tried listening but to really take it on board I had to figure out what low energy actually feels like, so here we are, skipping a week of status reporting and holding a suspiciously high Factorio play time. I spent some of that play time making a cool blue circuit factory! Downtime is a good idea, hopefully - we'll find out next week whether it worked.
It's surprising that one of the hardest problems given to me by the Fates has been fighting against myself, which sounds overly dramatic but in a literal sense is true. I would be moving faster if I felt up to it, but I don't feel up to it because I moved too fast recently. It's my fault because I wore myself out, but it's not my fault to rest when I need to, so instinctively I remain undecided on whether it's my fault. Sadly this isn't a balance that I've learned to strike, at least not for large scale work that I care about.
Add this to a general guilt for doing less than others seem to be doing (a velocity- rather than the famous competence-based impostor syndrome) and the work that was once appealing becomes more distant. LoC metrics are a favourite of crap managers, quick glancers, and the part of my subconscious that judges my self worth. It's not ideal and it's even not-idealer when your work is mostly thinking and not actually that much coding - see the previous report for a bunch of musings about what code should be written and not much written code. It's valid work! But the goblin in my skull disagrees. The mortal form disappoints me. I was hoping to discover my inner cold programming machine but I just found some boring human imperfections. Yawn!
This isn't what I was expecting to write about but I think it's helping. I'm sure these aren't unique experiences but they worry me nonetheless, which is partially because I'm hardwired to be worrying about something most of the time.
In a couple of days it will all be OK because I'll be able to play Counter-Strike again and that will for sure make my productivity go up, or down. The paradox of relaxing!
As predicted, I have to face prediction. Before I do that, I want to get a feel for the behaviour of compositors' performance so I'm not mathsing in the dark, and my weapon of choice is Linux's tracing system which either is called ftrace or has a component called ftrace. I can't tell which.
We've met Linux's tracing before. The screenshots from GPUVis were made of data extracted from it, which makes it an attractive answer to the question "where do I put all my data". In theory, if wlroots gains the ability to output events to this system, GPUVis will automatically be able to display these events as it does all the others.
The mechanism for userspace to emit events in this way landed in Linux 6.4 which was unleashed about 12 hours before I realised that my laptop's 6.3 series kernel didn't have support for it and nearly gave up. Until 6.4, the feature was gated behind CONFIG_BROKEN and looked truly like a lost cause. Thankfully Simon noticed that 6.4 held the answer to my problems and I found things to do while I waited for it to hit my distribution. Thrilling! We're back on track.
To hide the horrors of a bare UAPI from wlroots, I wrote and published libuserevents, which is my first C library and will make interacting with user_events amazing and great and you should definitely use it. There are whispers of integration into wlroots so far. I hope eventually I'll have a nice tool that can monitor a running compositor and show a graph of the frame times because that will at least be something pretty to look at to get away from thinking.
In the background there's a scene timer wriggling its way through review and the dreaded How To Schedule Frame Signals is looming over us all. I forgot to submit the Vulkan timer in all the ruckus. Oh well, apparently no one's supposed to be using the Vulkan backend yet anyway so I doubt there's anyone holding their breath.
I've also just noticed that the second status report has links to git branches instead of commits, so they're likely very stale by now. Remind past me to not do that, that moron.
Who knows what the future holds? Join us next week time to find out.
Today, Imagination Technologies announced some very exciting news: they are now using Zink for full OpenGL 4.6 support! Collabora had the pleasure of working together with engineers from Imagination to make this a reality, and it’s very rewarding to now be able show the results to the world!
More importantly, this is the first time we’ve seen a hardware vendor trust the OpenGL-on-Vulkan Mesa driver enough to completely side-step a native OpenGL driver and use it in a shipping product. It’s wonderful to see that Zink can realistically be used as a work-horse, especially in a high-performance graphics setting.
Zink started out as a small R&D project at Collabora, but has since grown to be a full-on community project. None of this would have been possible without the awesome work done by Mike and the other Zink contributors!
One small detail from Imagination’s post that I think is important to highlight is that the solution is officially conformant. This is the first product to be officially conformant using Zink, but it’s not going to be the last! In fact, we only need one more conformant implementation before Zink itself is conformant as a generic layered implementation, according to the Khronos Conformant Product Criteria.
In the not too distant future, we should be able to combine Zink with the in-progress open source driver from Imagination, and that’s when things will really start to shine for the open source graphics stack on Imagination hardware. So there’s plenty more to look forward to here!
The All Systems Go! 2023 Call for Participation Closes in Three Days!
The Call for Participation (CFP) for All Systems Go! 2023 will close in three days, on 7th of July! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!
![]()
All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:
The CFP will close on July 7th, 2023. A response will be sent to all submitters on or before July 14th, 2023. The conference takes place in 🗺️ Berlin, Germany 🇩🇪 on Sept. 13-14th.
All Systems Go! 2023 is all about foundational open-source Linux technologies. We are primarily looking for deeply technical talks by and for developers, engineers and other technical roles.
We focus on the userspace side of things, so while kernel topics are welcome they must have clear, direct relevance to userspace. The following is a non-comprehensive list of topics encouraged for 2023 submissions:
For more information please visit our conference website!
Adam Jackson created the script add-gitlab-merge-requests.sh which is the basis of this workflow.
The idea is to provide local access to all of the PRs that exist upstream. This both provides a better general overview of which PRs that have been pulled into the branch you're working on, but also enables you to search the contents of all PRs.
This function automagically detects if your remote is hosted on GitHub or GitLab and makes the necessary adjustments to work on either platform.
[alias]
mr = pr
pr = "!f() { \
REMOTES=$(git remote); \
REMOTE=\"origin\"; \
case \"$REMOTES\" in \
*upstream*) \
REMOTE=\"upstream\"; \
;; \
esac; \
ORIGIN=${1:-${REMOTE}}; \
URL=$(git remote get-url ${ORIGIN}); \
\
case \"$URL\" in \
*gitlab*) \
REMOTE_EXT="mr" \
REMOTE_PATH="merge-requests" \
;; \
*github*) \
REMOTE_EXT="pr" \
REMOTE_PATH="pull" \
;; \
esac …
Peter Hutterer: gitlab.freedesktop.org now has a bugbot for automatic issue/merge request processing As of today, gitlab.freedesktop.org provides easy hooks to invoke the gitlab-triage tool for your project. gitlab-triage allows for the automation of recurring tasks, for example something like
If the label FOO is set, close the issue and add a comment containing ".... blah ..."Many project have recurring tasks like this, e.g. the wayland project gets a lot of issues that are compositor (not protocol) issues. Being able to just set a label and have things happen is much more convenient than having to type out the same explanations over and over again.
The goal for us was to provide automated handling for these with as little friction as possible. And of course each project must be able to decide what actions should be taken. Usually gitlab-triage is run as part of project-specific scheduled pipelines but since we already have webhook-based spam-fighting tools we figured we could make this even easier.
So, bugbot was born. Any project registered with bugbot can use labels prefixed with "bugbot::" to have gitlab-triage invoked against the project's policies file. These labels thus serve as mini-commands for bugbot, though each project decides what happens for any particular label. bugbot effectively works like this:
sleep 30
for label in {issue|merge_request}.current_labels:
if label.startswith("bugbot::"):
wget https://gitlab.freedesktop.org/foo/bar/-/raw/{main|master}/.triage-policies.yml
run-gitlab-triage --as-user @bugbot --use-file .triage-policies.yml
break
And this is triggered on every issue/merge request update for any registered project which means that all you need to do is set the label and you're done.
The things of note here:
resource_rules:
issues:
rules:
- name: convert bugbot label to other label
conditions:
labels:
- "bugbot::foo"
actions:
labels:
- "foo"
remove_labels:
- "bugbot::foo"
comment: |
Nice label you have there. Would be a shame
if someone removed it
status: "close"
merge_requests:
rules:
[]
And the effect of this file can be seen in this issue here.
Bugbot is part of the damspam project and registering a project can be done with a single command. Note: this can only be done by someone with the Maintainer role or above.
Create a personal access token with API access and save the token value as $XDG_CONFIG_HOME/bugbot/user.token Then run the following commands with your project's full path (e.g. mesa/mesa, pipewire/wireplumber, xorg/lib/libX11):
$ pip install git+https://gitlab.freedesktop.org/freedesktop/damspam $ bugbot request-webhook foo/barAfter this you may remove the token file and the package
$ pip uninstall damspam $ rm $XDG_CONFIG_HOME/bugbot/user.tokenThe bugbot command will file an issue in the freedesktop/fdo-bots repository. This issue will be automatically processed and should be done by the time you finish the above commands, see this issue for an example. Note: the issue processing requires a git push to an internal repo - if you script this for multiple repos please put a sleep(30) in to avoid conflicts.
Remember you can test your policies file with
$ gitlab-triage --dry-run --token $GITLAB_TOKEN \ --source-id foo/bar --resource-reference 1234
I mentioned some time back about Konstantin’s heroic efforts to add all the descriptor features to Lavapipe.
After a lot of battling CI those efforts have finally paid off, and Lavapipe now supports all the features. Enough to run a credible amount of vkd3d-proton, in fact.
Great work. Can’t wait to see how he tackles sparse binding.
What two weeks!
Taking from where I left at the last update, I made progress in understanding the format of the buffer that contains the weights and biases.
The bit of knowledge that made a difference was realising that the format is optimized so that each NN core can efficiently access the portion of it that it needs, without having to do any parsing or decoding. Knowing that also helped in guessing what some fields in the parameter structure are for.
With that, I was able to correctly run a convolution on a small matrix with arbitrary weights and biases.
The biggest roadblock in this area currently is understanding how I need to program the output unit in the NN so the output data is in the desired scale. There are a series of fields that influence how the output values are processed before being placed in the output buffer, and I don't really know how they work yet. They are called post_shift and post_mult and the first correlates moderately (r=0.78) to the quantization scale of the output. I know that the post_shift field does what it says, to the right, but to understand what value I need in each situation I feel I need to understand better how the hardware works and what could be the initial values at the end of the convolution and before the output unit. I will be reading a bunch of research papers about NN-accelerating silicon in the summer.
That said, replacing the OpenCL kernels in TensorFlow Lite's GPU delegate that do convolutions with the fixed units turned out to be a worse idea than I initially thought. This is because that delegate is completely oriented towards float-first hardware such as GPUs and this accelerator is integer only.
A consequence of this is that TFLite inserts a dequantize operation at the start of the graph and a quantize at the end, to match the desired intput and output formats of a fully quantized model while feeding floats to the GPU. We need integers, so would be having to quantize after TFLite's dequantization and vice versa. Also, the other operations in the graph expect floats as well... This is certainly the wrong path to take for performance in a bandwidth-constrained device as all embedded boards are, so I had to go back to the drawing board.
If TF Lite's GPU delegate is such a bad match for this HW, what can we do to run inferences with reasonable speeds? The same that VeriSilicon did: write our own delegate:
https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/
TF Lite's operation description matches relatively well what we currently know of the configuration of the NN units. So we will not need to write complex shaders to implement the operations, but "just" translate the description of the operation to the HW configuration.
Of course, there is no HW that has fixed function units that accelerate all operations that are built into TF Lite or even that the most commonly used models contain. VeriSilicon's delegate deals with that by having a library of optimized OpenCL kernels that run on their programmable shader core(s).
But we want to avoid getting in the business of writing dozens of kernels that will need to be tweaked and made more complex so they run efficiently on other NPUs out there.
Fortunately, the delegate infrastructure in TF Lite is designed for this very scenario of imperfect HW and we can have a simple delegate that will implement the operations supported by the HW and the rest will execute in other delegates based on their capabilities.
How fast that will be is a big unknown right now, as switching between delegates will have a cost in terms of synchronization and data sharing, but that is something that we probably can improve in the TF Lite code base as the kernel has already all mechanisms for efficient synchronization and data sharing.
Other possibilities that we have with the TF Lite delegate mechanism is offloading the operations we don't need to a different delegate that supports accelerating them. For example, in the case of a board with Amlogic A311D or S905D3, we could use the GPU delegate to run those operations on the Mali GPU on it, via the OpenCL driver that Alyssa is writing in Mesa.
And if that is still slower than with the proprietary stack, one could always write an optimized kernel in NIR to run on the programmable core in the Vivante NPU. That is the beauty of free software, we can address the needs we have ourselves, and importantly so, do it by pooling work with others!
Because this frontend is implemented in terms of Gallium, we leverage the infrastructure in there for memory management, synchronization and execution. I think this will work well for adding support to other NN engines such as those from Rockchip, Cadence, Mediatek, etc.
I need to crack the nut of the post-processing of the raw output so it is in the expected scale, and afterwards I will be looking at handling multiple feature maps (kernel z > 1).
After that I don't see much else in the way of running convolutions as expected by TF Lite, so hopefully I will be running some models and measuring the performance. I expect that we will want to do the same for accelerating tensor operations with the TP units. And we will probably want to give a look at using the SRAM to reduce bandwidth and memory access latency. That still some way off though, and the summer is just starting!
API design is kinda tricky.
wlroots is designed to be very very modular. It's clear from the readme:
This design goal seems to have come about from lessons learned in other projects where a more all-in-one approach has become a burden. Note that "pluggable" doesn't just mean you can plug it, it also means that you have to. In my opinion, that's why people shy away from this approach: if you want library users to be able to swap out parts of your code for theirs, you have to force them to be responsible for all the wiring even if they are using the default everything; otherwise they won't have access to the wiring for when they change their mind.
This is all very commendable - engineering has happened, people learned, we are making better tradeoffs now than we were before. But what it means for me is that I have to also learn and do engineering. Imagine! More seriously, I want to respect the consensus that making library users do wiring is good, and they should be able to opt out of my code being called by not calling it.
What I'm finding, though, is that the Thing I'm Trying To Do (scheduling) sounds like a feature but
is more of an optimisation.
Really, I'm making the frame signal firing be more interesting than it was before.
Since the beginning it has worked in its way with a constant delay of zero, and now I am improving
it to recognise that the delay might not be zero.
I don't think it's possible to do this work in a neat and separate box, because it lies right in the
middle of the frame firing path.
There's a fair bit of complexity in the way the signal is fired,
so refactoring the whole firing path to separate it from what it touches
(wlr_output and the backends) is much harder than it might seem.
Top scientists (me) are working around the clock (or parts of it) to figure out how the scheduling can be decoupled such that it doesn't have to touch the backends at all. And by "working to figure out" I mostly mean "stewing on how it's kind of impossible and I don't think we can do much better than !4214". We'll see, but at the moment it seems like I have to disregard all the pondering from the top of the page and stick my grubby fingers into all kinds of once-sacred functions.
There are measurement woes, too: wlr_scene is a whole thing, and it calls the renderer for you.
It spends some time doing that and we need to know how long that time is, so we do need to add
measurement code inside the function that does the thing (wlr_scene_output_build_state).
I've been doing a lot less pondering and a lot more giving up on this one, and my grand plan at the
moment is to just throw a timer right in there and let the user query it.
This does make scene bigger, but I think it's alright to expect the code that does the rendering to
also do the timing thereof. It will need to be aware of the timer to some extent, so we might as
well put the whole thing in there and not bother trying to pretend they're separate.
Aside from all that, I've implemented a timer for the Vulkan backend, but I haven't submitted it yet. Not much more to say there.
Soon, I will have to face prediction and think about statistics. Nooooooo!
Keen-eyed readers of the blog will have noticed over the past couple weeks an obvious omission in my furious benchmark-related handwaving.
I didn’t mention glmark2.
Yes, that glmark2.
Except my results were a bit different.
The thing about glmark2 is it’s just a CPU benchmark of drivers at this point. None of the rendering is in any way intensive, GPUs have gotten far more powerful than they were at the time the thing was written, and CPUs are now pushing tens of thousands of frames per second.
Yes, tens of thousands.
I set out to investigate this using the case with the biggest perf delta, which was build:use-vbo=true:
Brutal.
Naturally I went to my standby for profiling, perf. I found some stuff, and I fixed it all, and surprise surprise the numbers didn’t change. Clearly there was some blocking going on that was affecting the frames. Could the flamegraph tools for perf highlight this issue for me?
No.
An average Monday in the life of a zink developer.
New tools would be needed to figure out what’s going on here.
Do such tools already exist? Maybe. Post in the comments if you have tools you use for this type of thing.
The solution I settled on is the olde-timey method of instrumentation. That is, injecting os_time_get_nano() calls in areas of interest and printing the diffs.
Let’s take a moment and speculate on what my findings might have been.
Yes, obviously it’s WSI.
It turns out Mesa’s (X11) WSI is too slow to push more than 10,000 fps.
As I’ve been tracking in this issue, the Mesa WSI code is very inefficient. This probably won’t affect most things, but when the target frametime is around 40,000 nanoseconds, things get dicey.
For those of you unwilling to sort through your memorized table of frame timings, an application rendering at 60 fps must render frames in 16,666,667 nanoseconds.
glmark2 is over 400 times faster.
To enumerate the slow points I found at a high level:
vkGetPhysicalDeviceSurfaceCapabilitiesKHR is doing unlimited X11 roundtripsAgain, these aren’t things that affect most apps/games, but when the frametime diff is 100,000ns vs 40,000ns, these add up to a majority of that difference.
Going in order, part of the problem is the split nature of zink’s WSI handling. The Gallium DRI frontend runs in a different thread from the zink driver thread, and it has its own integration with the API state tracker. This is fine for other drivers, as it lets relatively expensive WSI operations (e.g., display server roundtrips) occur without impeding the driver thread.
But how does this work with zink?
Zink uses the kopper interface, which is just a bridge between the Gallium DRI frontend and Vulkan WSI. In short, when DRI would directly call through to acquire/present, it instead handwaves some API methods at zink to maybe do the same thing.
One of these operations is updating the drawable geometry to determine the sizing of the framebuffer. In VK parlance, this is the image extents returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR.
Now, looking at the other side, usually the DRI frontend (e.g., with RadeonSI) uses X11 configure notify events to determine the geometry, which lets it work asynchronously without roundtripping. Unfortunately, this doesn’t work so well for zink, as there’s no guarantee the underlying Vulkan driver will have the same idea of geometry. Thus, any time geometry is updated, the DRI frontend has to call through to zink, which has to call through to Vulkan, which then returns the geometry.
And if the Vulkan driver performs a display server roundtrip to get that geometry, it becomes very slow at extreme framerates.
The solution here is obvious: make Mesa’s WSI avoid roundtrips when polling geometry. This doesn’t fix non-Mesa drivers, but it does provide a substantial (~10%) boost to performance for the ones affected.
Another big difference between Gallium DRI and VK WSI is present handling.
In DRI, presentation happens.
In VK WSI, the event queue is flushed, and then presentation happens, then the result of presentation is waited on and checked, and the value is returned to the app.
It’s a subtle difference, but I’m sure my genius-level readers can pick up on it.
The reasoning for the discrepancy is clear: vkQueuePresentKHR has a return value, which means it can return values to indicate various failure/suboptimal modes. But is this a good idea?
Yes, but also no. On one hand, certain returns are important so the app can determine when to reallocate the swapchain or make other, related changes. On the other hand, doing all this extra work impacts WSI performance by a substantial amount, and none of it is truly useful to applications since the same values can be returned by acquire.
So I deleted it.
That was a ~60% perf boost.
The final frontier of WSI perf is syncfile operations. TL;DR Vulkan WSI is explicit, which means there’s a file descriptor fence passed back and forth between kernel and application to signal when a swapchain image can be used/reused, and all of these operations have costs. These costs aren’t typically visible to applications, but when 4000ns can be a 10% performance boost, they start to become noticeable.
Now one thing to note here is that the DRI frontend uses implicit sync, which means drivers are allowed to do some tricks to avoid extra operations related to swapchain fencing. VK WSI, however, uses explicit sync, which means all these tricks are still totally legalNOT AT ALL LEGAL. This means that, in a sense, VK WSI will always be slightly slower. Probably less than 10,000ns per frame, but, again, this is significant when running at extreme framerates.
The only operation I could sensibly eliminate from the presentation path was the dmabuf semaphore export. Semaphores are automatically reset after they are signaled, so this means there’s no need to continually export the same semaphore for every present.
The DRI frontend uses a swapchain size of four images when using IMMEDIATE and the swap mode returned by the server is ‘flip’.
VK WSI always uses three images.
Switching to four swapchain images for this exe uncapped the perf a bit further.
Nobody’s expecting any huge gains here, but that’s obviously a lie since the only reason anyone comes to SGC is for the huge gains.
And huge gains there are.
My new glmark2 results:
Still nowhere near caught up, but this cuts roughly 40,000ns (~40%) off zink’s frametime, which is big enough to qualify for a mention.
These changes are still a bit experimental. I’ve tested them locally and gotten good results without issues in all scenarios, but that doesn’t mean they’re “correct”. VK WSI is used by a lot of applications in a lot of ways, so it’s possible there’s scenarios I haven’t considered.
If you’re interested in testing, the MR is here.
This work proposed by my mentor Maíra Canal is going more difficult than I thought >W<.
On the Community Bonding Period of GSoC, Maíra proposed I work on VKMS. While looking on the TODO list of the driver, on the plane feature section, I found the item “Additional buffer formats, especially YUV formats for video like NV12” interesting to work on, as it has some correlation with my work on the GSoC.
Before I start talking about what was done I think this blog post needs more context.
The Virtual Kernel Mode Setting is a virtual video driver. It is used for testing the correct display of userspace programs without needing a physical monitor. Virtual drivers are useful for automated testing.
NV12 is a multi-planar color format that uses the YCbCr color model with Chroma subsampling. This explanation doesn’t say much for the first reader, so let’s explain it further.
The YCbCr, like RGB, is a form to represent color on the computer memory. It’s divided into three components:
An image along with components respectively (User:Brianski, Public domain, via Wikimedia Commons)
This color model is better for compression because it separates the brightness from the color. Humans perceive more changes in brightness than in color, so we can have less information about color in an image and not perceive the difference in loss of detail. So some formats have the same chroma (Cb, Cr) components for multiple pixels. This technique is called Chroma subsampling.
The NV12 has the same color components for 2 pixels on the horizontal and 2 pixels on the vertical. It achieves that by having two areas of memory called planes, one only for the luminance component and the other for the two colors components.
A format that separates its components into multiple planes is called a multi-planar format. More specifically, a format with two planes is called a semi-planar format, and one with three planes is called a planar format.
Y Plane Cb Cr Plane Formed Pixels
Y00 Y01 Cb0 Cr0 Y00Cb0Cr0 Y01Cb0Cr0
Y02 Y03 + Cb1 Cr1 ----> Y02Cb0Cr0 Y03Cb0Cr0
Y10 Y11 Y10Cb1Cr1 Y11Cb1Cr1
Y12 Y13 Y12Cb1Cr1 Y13Cb1Cr1
Each Yij value uses the correspondent Cbi an Cri values.
Currently, the VKMS supports none of the features above. So the task is subdivided into this order:
I think you can see why this work is harder than I thought. I knew none of that information beforehand, so learning all of that was an adventure :D.
I have implemented all those things, but it still needs a little thinking to get all working together.
The VKMS only expects drm_framebuffer’s to have a single
plane. The first thing to do is to remove the offset, pitch, and cpp
variables from the vkms_frame_info and switch into using the arrays that they
were taken. With that, we can access the information for every plane in a
drm_framebuffer, not just the first one.
struct vkms_frame_info {
struct drm_rect rotated;
struct iosys_map map[DRM_FORMAT_MAX_PLANES];
unsigned int rotation;
- unsigned int offset;
- unsigned int pitch;
- unsigned int cpp;
};
After that, we need the ability to choose the plane and use its own cpp,
pitch, and offset. To do that, we need to pass an index to the functions
that access the color information and use it on the new arrays.
-static size_t pixel_offset(const struct vkms_frame_info *frame_info, int x, int y)
+static size_t pixel_offset(const struct vkms_frame_info *frame_info, int x, int y, size_t index)
{
struct drm_framebuffer *fb = frame_info->fb;
- return fb->offsets[0] + (y * fb->pitches[0])
- + (x * fb->format->cpp[0]);
+ return fb->offsets[index] + (y * fb->pitches[index])
+ + (x * fb->format->cpp[index]);
}
static void *packed_pixels_addr(const struct vkms_frame_info *frame_info,
- int x, int y)
+ int x, int y, size_t index)
{
- size_t offset = pixel_offset(frame_info, x, y);
+ size_t offset = pixel_offset(frame_info, x, y, index);
return (u8 *)frame_info->map[0].vaddr + offset;
}
The drm_format_info has the hsub and vsub variables
that dictate the subsampling factor horizontally and vertically (for [NV12 hsub
= vsub = 2][nv12-fourcc.h]). So we need to take this into account when accessing
the color information of a pixel. Note that this is not to be done on the first
plane, as for all formats the subsampling is not present on it.
@@ -238,8 +238,10 @@ static void get_src_pixels_per_plane(const struct vkms_frame_info *frame_info,
{
const struct drm_format_info *frame_format = frame_info->fb->format;
- for (size_t i = 0; i < frame_format->num_planes; i++)
- src_pixels[i] = get_packed_src_addr(frame_info, y, i);
+ for (size_t i = 0; i < frame_format->num_planes; i++){
+ int vsub = i ? frame_format->vsub : 1;
+ src_pixels[i] = get_packed_src_addr(frame_info, y / vsub, i);
+ }
}
void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y)
@@ -250,6 +252,8 @@ void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state
int limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels);
u8 *src_pixels[DRM_FORMAT_MAX_PLANES];
+ int hsub_count = 0;
+
enum drm_color_encoding encoding = plane->base.base.color_encoding;
enum drm_color_range range = plane->base.base.color_range;
@@ -258,17 +262,21 @@ void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state
for (size_t x = 0; x < limit; x++) {
int x_pos = get_x_position(frame_info, limit, x);
+ hsub_count = (hsub_count + 1) % frame_format->hsub;
+
if (drm_rotation_90_or_270(frame_info->rotation)) {
+ get_src_pixels_per_plane(frame_info, src_pixels, x + frame_info->rotated.y1);
for (size_t i = 0; i < frame_format->num_planes; i++)
- src_pixels[i] = get_packed_src_addr(frame_info,
- x + frame_info->rotated.y1, i) +
- frame_format->cpp[i] * y;
+ if (!i || !hsub_count)
+ src_pixels[i] += frame_format->cpp[i] * y;
}
plane->pixel_read(src_pixels, &out_pixels[x_pos], encoding, range);
- for (size_t i = 0; i < frame_format->num_planes; i++)
- src_pixels[i] += frame_format->cpp[i];
+ for (size_t i = 0; i < frame_format->num_planes; i++) {
+ if (!i || !hsub_count)
+ src_pixels[i] += frame_format->cpp[i];
+ }
}
}
This was, by far, the most difficult part of the project.
The Color YCbCr has three color encoding standards, BT601, BT709, and BT2020, besides that the YCbCr can occupy the full range of each byte it uses or a limited range.
To tell what color encoding and range the driver support, we have to add the
drm_plane_create_color_properties()
@@ -212,5 +212,14 @@ struct vkms_plane *vkms_plane_init(struct vkms_device *vkmsdev,
drm_plane_create_rotation_property(&plane->base, DRM_MODE_ROTATE_0,
DRM_MODE_ROTATE_MASK | DRM_MODE_REFLECT_MASK);
+ drm_plane_create_color_properties(&plane->base,
+ BIT(DRM_COLOR_YCBCR_BT601) |
+ BIT(DRM_COLOR_YCBCR_BT709) |
+ BIT(DRM_COLOR_YCBCR_BT2020),
+ BIT(DRM_COLOR_YCBCR_LIMITED_RANGE) |
+ BIT(DRM_COLOR_YCBCR_FULL_RANGE),
+ DRM_COLOR_YCBCR_BT601,
+ DRM_COLOR_YCBCR_FULL_RANGE);
+
return plane;
}
The conversion code was taken from the tpg-core.c, a virtual
driver from the media subsystem that does those conversions in software as well.
As a side thought, maybe would be better to have those two subsystems use the same code, maybe with a separate subsystem that handles color formats.
The TPG code was changed to use the drm_fixed.h operations, to be
more precise and coherent.
struct pixel_yuv_u8 {
u8 y, u, v;
};
static void ycbcr2rgb(const s64 m[3][3], int y, int cb, int cr,
int y_offset, int *r, int *g, int *b)
{
s64 fp_y; s64 fp_cb; s64 fp_cr;
s64 fp_r; s64 fp_g; s64 fp_b;
y -= y_offset;
cb -= 128;
cr -= 128;
fp_y = drm_int2fixp(y);
fp_cb = drm_int2fixp(cb);
fp_cr = drm_int2fixp(cr);
fp_r = drm_fixp_mul(m[0][0], fp_y) +
drm_fixp_mul(m[0][1], fp_cb) +
drm_fixp_mul(m[0][2], fp_cr);
fp_g = drm_fixp_mul(m[1][0], fp_y) +
drm_fixp_mul(m[1][1], fp_cb) +
drm_fixp_mul(m[1][2], fp_cr);
fp_b = drm_fixp_mul(m[2][0], fp_y) +
drm_fixp_mul(m[2][1], fp_cb) +
drm_fixp_mul(m[2][2], fp_cr);
*r = drm_fixp2int(fp_r);
*g = drm_fixp2int(fp_g);
*b = drm_fixp2int(fp_b);
}
static void yuv_u8_to_argb_u16(struct pixel_argb_u16 *argb_u16, struct pixel_yuv_u8 *yuv_u8,
enum drm_color_encoding encoding, enum drm_color_range range)
{
#define COEFF(v, r) (\
drm_fixp_div(drm_fixp_mul(drm_fixp_from_fraction(v, 10000), drm_int2fixp((1 << 16) - 1)),\
drm_int2fixp(r)) \
)\
const s64 bt601[3][3] = {
{ COEFF(10000, 219), COEFF(0, 224), COEFF(14020, 224) },
{ COEFF(10000, 219), COEFF(-3441, 224), COEFF(-7141, 224) },
{ COEFF(10000, 219), COEFF(17720, 224), COEFF(0, 224) },
};
const s64 bt601_full[3][3] = {
{ COEFF(10000, 255), COEFF(0, 255), COEFF(14020, 255) },
{ COEFF(10000, 255), COEFF(-3441, 255), COEFF(-7141, 255) },
{ COEFF(10000, 255), COEFF(17720, 255), COEFF(0, 255) },
};
const s64 rec709[3][3] = {
{ COEFF(10000, 219), COEFF(0, 224), COEFF(15748, 224) },
{ COEFF(10000, 219), COEFF(-1873, 224), COEFF(-4681, 224) },
{ COEFF(10000, 219), COEFF(18556, 224), COEFF(0, 224) },
};
const s64 rec709_full[3][3] = {
{ COEFF(10000, 255), COEFF(0, 255), COEFF(15748, 255) },
{ COEFF(10000, 255), COEFF(-1873, 255), COEFF(-4681, 255) },
{ COEFF(10000, 255), COEFF(18556, 255), COEFF(0, 255) },
};
const s64 bt2020[3][3] = {
{ COEFF(10000, 219), COEFF(0, 224), COEFF(14746, 224) },
{ COEFF(10000, 219), COEFF(-1646, 224), COEFF(-5714, 224) },
{ COEFF(10000, 219), COEFF(18814, 224), COEFF(0, 224) },
};
const s64 bt2020_full[3][3] = {
{ COEFF(10000, 255), COEFF(0, 255), COEFF(14746, 255) },
{ COEFF(10000, 255), COEFF(-1646, 255), COEFF(-5714, 255) },
{ COEFF(10000, 255), COEFF(18814, 255), COEFF(0, 255) },
};
int r = 0;
int g = 0;
int b = 0;
bool full = range == DRM_COLOR_YCBCR_FULL_RANGE;
unsigned int y_offset = full ? 0 : 16;
switch (encoding) {
case DRM_COLOR_YCBCR_BT601:
ycbcr2rgb(full ? bt601_full : bt601,
yuv_u8->y, yuv_u8->u, yuv_u8->v, y_offset, &r, &g, &b);
break;
case DRM_COLOR_YCBCR_BT709:
ycbcr2rgb(full ? rec709_full : rec709,
yuv_u8->y, yuv_u8->u, yuv_u8->v, y_offset, &r, &g, &b);
break;
case DRM_COLOR_YCBCR_BT2020:
ycbcr2rgb(full ? bt2020_full : bt2020,
yuv_u8->y, yuv_u8->u, yuv_u8->v, y_offset, &r, &g, &b);
break;
default:
pr_warn_once("Not supported color encoding\n");
break;
}
argb_u16->r = clamp(r, 0, 0xffff);
argb_u16->g = clamp(g, 0, 0xffff);
argb_u16->b = clamp(b, 0, 0xffff);
}
After all that, we can finally create the NV12 conversion function. We need to access the YCbCr values in the form of the NV12.
static void NV12_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
enum drm_color_encoding encoding, enum drm_color_range range)
{
struct pixel_yuv_u8 yuv_u8;
yuv_u8.y = src_pixels[0][0];
yuv_u8.u = src_pixels[1][0];
yuv_u8.v = src_pixels[1][1];
yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
}
After all this work you think that all worked, right? Well, IGT GPU Tools, says the opposite.
[root@archlinux shared]# ./build/tests/kms_plane --run pixel-format
IGT-Version: 1.27.1-g4637d2285 (x86_64) (Linux: 6.4.0-rc1-VKMS-DEVEL+ x86_64)
Opened device: /dev/dri/card0
Starting subtest: pixel-format
Starting dynamic subtest: pipe-A-planes
Using (pipe A + Virtual-1) to run the subtest.
Testing format XR24(0x34325258) / modifier linear(0x0) on A.0
Testing format AR24(0x34325241) / modifier linear(0x0) on A.0
Testing format XR48(0x38345258) / modifier linear(0x0) on A.0
Testing format AR48(0x38345241) / modifier linear(0x0) on A.0
Testing format RG16(0x36314752) / modifier linear(0x0) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr limited range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 3/4 solid colors tested (0xD)
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr full range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 2/4 solid colors tested (0xC)
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr limited range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 3/4 solid colors tested (0xD)
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr full range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 2/4 solid colors tested (0xC)
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr limited range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 3/4 solid colors tested (0xD)
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr full range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 2/4 solid colors tested (0xC)
The subtest pixel-format from kms_plane tests all color formats supported by
a driver. It does that by creating framebuffers filled with different colors
and an image with the same color in the userspace. After that, it checks if the
CRC of the framebuffer and the userspace image are equal.
The NV12 support described until this point doesn’t work out because of the imprecision in the YCbCr to RGB conversion. I don’t know if the conversion is intrinsically imperfect, or if the use of fixed-point operations is the culprit. All I know is that certain colors are slightly off.
Luckily the IGT guys know about this issue, the way they overcome that is by just checking the MSB bits of color values, basically rounding then. They do that by passing a Gamma Look-Up Table (LUT) to the driver. But VKMS doesn’t support that.
It is a look-up table that the index’ is the color and its value is the result color.
This is the definition of a 1D table. You can have one for all the channels, so the transformation is the same for all the color channels, or one for each channel, so you can tweak each channel specifically.

There is a more complex type of LUT, a 3D LUT. But this one I don’t fully understand and it’s not needed. All I know is that you use the three color channels at the same time for the index, one for each coordinate, and the value that you get is a color. And besides that, you have to do interpolations.

Its image representation is a pretty cube :).
It’s not that difficult, the DRM core does all the hard work.
You have to tell that the driver supports a LUT of a specific size, and the DRM places it for you inside the crtc_state.
@@ -290,6 +290,9 @@ int vkms_crtc_init(struct drm_device *dev, struct drm_crtc *crtc,
drm_crtc_helper_add(crtc, &vkms_crtc_helper_funcs);
+ drm_mode_crtc_set_gamma_size(crtc, VKMS_LUT_SIZE);
+ drm_crtc_enable_color_mgmt(crtc, 0, false, VKMS_LUT_SIZE);
+
spin_lock_init(&vkms_out->lock);
spin_lock_init(&vkms_out->composer_lock);
After that, you have to use it. You just have to access the framebuffer, after the transformations, and use the 1D LUT of each channel.
static void apply_lut(const struct vkms_crtc_state *crtc_state, struct line_buffer *output_buffer)
{
struct drm_color_lut *lut;
size_t lut_length;
if (!crtc_state->base.gamma_lut)
return;
lut = (struct drm_color_lut *)crtc_state->base.gamma_lut->data;
lut_length = crtc_state->base.gamma_lut->length / sizeof(*lut);
if (!lut_length)
return;
for (size_t x = 0; x < output_buffer->n_pixels; x++) {
size_t lut_r_index = output_buffer->pixels[x].r * (lut_length - 1) / 0xffff;
size_t lut_g_index = output_buffer->pixels[x].g * (lut_length - 1) / 0xffff;
size_t lut_b_index = output_buffer->pixels[x].b * (lut_length - 1) / 0xffff;
output_buffer->pixels[x].r = lut[lut_r_index].red;
output_buffer->pixels[x].g = lut[lut_g_index].green;
output_buffer->pixels[x].b = lut[lut_b_index].blue;
}
}
Now we finally have the conversion working :DDDDDDD.
[root@archlinux shared]# ./build/tests/kms_plane --run pixel-format
IGT-Version: 1.27.1-g4637d2285 (x86_64) (Linux: 6.4.0-rc1-VKMS-DEVEL+ x86_64)
Opened device: /dev/dri/card0
Starting subtest: pixel-format
Starting dynamic subtest: pipe-A-planes
Using (pipe A + Virtual-1) to run the subtest.
Testing format XR24(0x34325258) / modifier linear(0x0) on A.0
Testing format AR24(0x34325241) / modifier linear(0x0) on A.0
Testing format XR48(0x38345258) / modifier linear(0x0) on A.0
Testing format AR48(0x38345241) / modifier linear(0x0) on A.0
Testing format RG16(0x36314752) / modifier linear(0x0) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr limited range) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr full range) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr limited range) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr full range) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr limited range) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr full range) on A.0
Testing format AR24(0x34325241) / modifier linear(0x0) on A.1
Testing format XR24(0x34325258) / modifier linear(0x0) on A.1
Testing format XR48(0x38345258) / modifier linear(0x0) on A.1
Testing format AR48(0x38345241) / modifier linear(0x0) on A.1
Testing format RG16(0x36314752) / modifier linear(0x0) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr limited range) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr full range) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr limited range) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr full range) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr limited range) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr full range) on A.1
Dynamic subtest pipe-A-planes: SUCCESS (2.586s)
Subtest pixel-format: SUCCESS (2.586s)
The pixel-format-clamped is still not passing, I still haven’t got the time to tackle that. I hope that all will be done after this is solved.
After this, it will be very easy to add another YCbCr format, it is just a matter of getting the color values in the way that the format stores.
The world tried to stop me, but I pushed through and got some work done on wlroots.
Unfortunately aside from programming I have a physical form that I have to take care of. Recently this form's hands have been hurting because of strain from typing, which is pretty rude of them, so I took some time out to build a keyboard. Woe is me, having to spend my time soldering and cutting. I had a great time. After finishing the keyboard, all of a sudden it was time to move out, so we packed and shuffled and drove and unpacked and complained about the trains in this country. Then I slept, and yesterday I finally found the time to get back to work. This is why there was no update last week.
I've decided that I need consistent terms for the subtasks I have. So:
measurement is reading how long a frame takes to render, prediction is
coming up with a number of milliseconds, and scheduling is delaying by that
number and all the API funk therein. It turns out that Simon was right and there
is a lot of API funk for me to deal with. We've had some discussions about what
exactly wlr_output.events.frame is for and whether we really need it,
and I've implemented a scheduling mechanism based on that signal
to show how we can make good use of it.
Bringing together this scheduling, the timer from the first status report for measurement, and some naive prediction (next frame takes at most 1ms more than last frame) christens tinywl as the first user of all this mess. Here are some screenshots from GPUVis showing the old, boring timeline where we rendered immediately after the last present, and the new, sexy timeline where we render immediately before the next one! The purple lines are vblanks, and the boxes in each row represent durations when those processes were running on the CPU. I think blue lines are GPU submissions, but the naming is unclear.
Wow, isn't that beautiful.
I didn’t go into much detail in my last post about all the huuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuge optimizations I’ve been working on (they’re big) with code that totally exists and isn’t made up. The reason was that I didn’t feel like it.
But today’s a different day, and I haven’t done any real blogging in a while. Does that mean…
One of the big perf sinks in ET profiling was fragment shader epilog handling.
Yeah, that’s the stuff right there in the middle. The lookup is slow, the key generation is slow, and I hate all of it and we already know I’ve taken out the hammer, so let’s skip past more of the profiling details onto some code.
RADV has the same mechanism for prologs and epilogs:
In code, it looks like this:
struct radv_ps_epilog_key key = radv_generate_ps_epilog_key(device, &state, true);
uint32_t hash = radv_hash_ps_epilog(&key);
u_rwlock_rdlock(&device->ps_epilogs_lock);
struct hash_entry *epilog_entry =
_mesa_hash_table_search_pre_hashed(device->ps_epilogs, hash, &key);
u_rwlock_rdunlock(&device->ps_epilogs_lock);
if (!epilog_entry) {
u_rwlock_wrlock(&device->ps_epilogs_lock);
epilog_entry = _mesa_hash_table_search_pre_hashed(device->ps_epilogs, hash, &key);
if (epilog_entry) {
u_rwlock_wrunlock(&device->ps_epilogs_lock);
return epilog_entry->data;
}
epilog = radv_create_ps_epilog(device, &key, NULL);
struct radv_ps_epilog_key *key2 = malloc(sizeof(*key2));
if (!epilog || !key2) {
radv_shader_part_unref(device, epilog);
free(key2);
u_rwlock_wrunlock(&device->ps_epilogs_lock);
return NULL;
}
memcpy(key2, &key, sizeof(*key2));
_mesa_hash_table_insert_pre_hashed(device->ps_epilogs, hash, key2, epilog);
u_rwlock_wrunlock(&device->ps_epilogs_lock);
return epilog;
}
return epilog_entry->data;
Since all my readers are performance experts, everyone can see the issues.
Literally everyone.
Yeah, so this isn’t great. There’s a number of factors here that I’m going to go over for posterity, so I’m expecting that nobody will read the following section because all you experts already know what I’m about to say.
Why do a lookup and then an insertion? Both of these operations perform a lookup, so it’s just two lookups in the worst case scenario. It’s already well-known that Mesa’s hash/set lookups are unusually slow (don’t @ me), so this should be avoided.
The Mesa hash table API doesn’t have a search_or_insert function because of collective laziness, which means it all needs to be converted into a set, which does have the API. And it’s more convenient to work with.
Locking is bad. Nobody wants to lock. Why does this have locking? Obviously it needs locking because the data is stored on the device, and the device can be used between threads without restriction, blah, blah, blah, but what if things just didn’t work like that?
Instead of always accessing everything on the device, instead access objects that are thread local. Like command buffers. But nobody wants to be duplicating all the epilog generation since that’s expensive, so this is a case where a two-tiered hash table is optimal:
Blam, the locking is gone.
The prolog handling has a mechanism where the last-used prolog is stored to the cmdbuf for immediate reuse if the same data is passed, so copying that over to the epilog handling can improve perf in some corner cases.
If I had to give a number, probably at least 1000% faster. Just as a rough estimate.
It sure does, but only zink uses any of it, so who cares.
MR is up here, probably to be merged sometime in 2030 depending on the lunar cycle and developer horoscopes permitting reviews.
Hi!
This month Rose Hudson has started working on wlroots as part of Google Summer of Code! She will focus on reducing frame latency by implementing an adaptive frame scheduler. She already has landed new wlroots APIs to measure render time. You can follow Rose’s blog if you’re interested.
Other wlroots contributors have been hard at work too. Alexander has
implemented the new render pass API for GLES2, and while at it he’s
significantly improved its performance. I hope this will help with weak SoCs
and power usage. A big refactoring from vyivel has been merged to
unify the map/unmap logic across all shells. I’ve moved some of the cursor
logic over from wlr_output to wlr_cursor (with the eventual goal of
simplifying wlr_output and removing most of the cursor-specific logic). And
we’ve all fixed a whole bunch of bugs!
The NPotM is lodns. It’s a
simple local DNS resolver which forwards queries to a DNS-over-TLS or
DNS-over-HTTPS server. It’s somewhat similar to systemd-resolved. Still
missing are a way to forward local hostnames to the DNS resolver advertised via
DHCP and support for /etc/hosts.
As usual, I’ve made small improvements to various projects. I’ve added a fast
tab switcher for gamja: press Ctrl+k, type a few letters to filter the
channels, and press enter to switch. I’ve contributed to the upcoming
IRCv3 message redaction extension and implemented it in goguma. kanshi
has gained a kanshictl switch command to manually switch to another profile.
go-oauth2 now supports dynamic client registration. gyosu generates
documentation for typedef. And more! But that’s enough for today, see you
next month!
Yes, you heard that right.
Ray Tracing Pipelines.
On RADV.
Enabled by default.
Now merged in Mesa main.
This has been in the works for a loooooooooong time. Probably the longest of any RADV features so far.
But what makes ray tracing pipelines so complex that it takes this long to implement? Let’s take a short look at what it took for RADV to get its implementation off the ground.
For the purposes of this blog, ray tracing is the process of finding intersections between rays and some geometry.
Most of the time, this geometry will be made up of lots of triangles. We don’t want to test every single triangle for intersection separately, so Bounding Volume Hierarchies (BVHs) are used to speed up the process by skipping entire groups of triangles at once.
Nowadays, GPUs have dedicated hardware to speed up the ray tracing process.
AMD’s hardware acceleration for ray tracing is very simple: It consists of a single instruction called image_bvh_intersect_ray (and its 64-bit variant).1
Why is it called image_bvh_intersect_ray? Because the hardware sees the BVH as a 1D image and uses its memory subsystem for textures to fetch BVH data, of course.
This instruction takes care of calculating intersections between a ray and a single node in the BVH. But intersecting one node isn’t good enough: In order to find actual intersections between the ray and geometry, we need to traverse the BVH and check lots of nodes. The traversal loop that accomplishes this is implemented in software2.
In Vulkan, you can use ray tracing pipelines to utilize your GPU’s hardware-accelerated ray tracing capabilities. It might not seem like it, but ray tracing pipelines actually bring a whole lot of new features with them that make them quite complex to implement.
Ray tracing pipelines introduce a set of new shader stages:
traceRayEXT to start tracingThat’s right, as a small side effect, ray tracing pipelines also introduced full proper recursion from shaders. This doesn’t just apply to callable shaders: You can also trace new rays from a closest-hit shader, which can recursively invoke more closest-hit shaders, etc.
Also, ray tracing pipelines introduce a very dynamic, GPU-driven shader dispatch process: In traditional graphics and compute pipelines, once you bind a pipeline, you know exactly which shaders are going to execute once you do a draw or dispatch. In ray tracing pipelines, this depends on something called the Shader Binding Table, which is a piece of memory containing so-called “shader handles”. These shader handles identify the shader that is actually launched when vkCmdTraceRaysKHR is called.
In both graphics and compute pipelines, the concept of pipeline stages was quite simple: You have a bunch of shader stages (for graphics pipelines, it’s usually vertex and fragment, for compute pipelines it’s just compute). Each stage has exactly one shader: You don’t have one graphics pipeline with many vertex shaders. In ray tracing pipelines, there are no restrictions on how many shaders can exist for each stage.
In RT pipelines, there is also the concept of shaders dispatching other shaders: Every time traceRayEXT is called, more shaders (any-hit, intersection, closest-hit or miss shaders)
are launched.
That’s lots of changes just for some ray tracing!
RT pipelines aren’t really a fitting representation of AMD hardware. There is no such thing as reading a memory location to determine which shader to launch, and the hardware has no concept of a callstack to implement recursion. RADV therefore has to do a bit of magic to transform RT pipelines in a way that will actually run.
The first approach RADV used to implement these ray tracing pipelines was essentially to pretend that the whole ray tracing pipelines a normal compute shader:
All shaders from the pipeline are assigned a unique ID. Then, all shaders are inserted into a humongous chain of if (idx == shader_id) { (paste shader code here) } statements.
If you wanted to call a shader, it was as simple as setting idx to the ID of the shader you wanted to call. You could even implement recursion by storing the ID of the shader
to return to on a call stack.
Launching shaders according to the shader binding table wasn’t a problem either: You just read the shader binding table at the start and set idx to whatever value is in there.
But there was a problem.
As it turns out, if you don’t put any restrictions on how many shaders can exist in a stage, there’s going to be apps that use LOTS of them. We’re talking almost a thousand shaders in some cases. Ludicrously large code like that resulted in lots of ludicrous results (games spending over half an hour compiling shaders!). Clearly, the megashader solution wasn’t sustainable.
Ray Tracing Pipelines also add pipeline libraries. You might have heard of them in the context of Graphics Pipeline Libraries, which was also really painful to implement in RADV.
Pipeline libraries essentially allow you to create parts of your ray tracing pipeline beforehand, and then re-use these created parts all over other ray tracing pipelines. But if we just paste all shaders into one chonker compute shader, we can’t compile it yet when creating a pipeline library, because other shaders will be added once a real pipeline is created from it!
This basically meant that we couldn’t do anything but copy the source code around, and start compiling only when the real pipeline is created. It also turned out that it’s valid behaviour to query the stack size used for recursion from pipeline libraries, but because RADV didn’t compile any code yet, it didn’t even know what stack size the shaders from that pipeline used.
This is where separate shader compilation comes in. As the name suggests, most3 shaders are compiled independently. Instead of using shader IDs to select what shader is called, we store the VRAM addresses of the shaders and directly jump to whatever shaders we want to execute next.
Directly jumping to a shader is still impossible because reading the shader binding table is required. Instead, RADV creates a small piece of shader assembly that sets up necessary parameters, reads the shader binding table, and then directly jumps to the selected shader (like it is done for shader calls).
This allows us to compile shaders immediately when creating pipeline libraries. It also pretty much resolves the problem of chonker compute shaders taking ludicrously long to compile. It also required basically reworking the entire ray tracing compilation infrastructure, but I think it forms a great basis for future work in the performance area.
Everything runs.
In case you disagree, please open an issue.
Pretty competitive with AMDVLK/the AMD Windows drivers! You’ll generally see similar, if not better, performance on RADV.
Not well (expect significantly less performance compared to AMDVLK/Windows drivers). This is being worked on.
RDNA3 introduces another instruction that helps with BVH traversal stack management, but RADV doesn’t use it yet. ↩
This is also what makes it so easy to support ray tracing even when there is no hardware acceleration (using RADV_PERFTEST=emulate_rt): Most of the traversal code can be reused, only image_bvh_intersect_ray needs to be replaced with a software equivalent. ↩
Any-hit and Intersection shaders are still combined into a single traversal shader. This still shows some of the disadvantages of the combined shader method, but generally compile times aren’t that ludicrous anymore. ↩
No time for an intro because it’s time to go fast and when you’re going this fast you don’t have time to explain things or catch anyone up on what’s happening or edit out tpyos or even use punctuation let’s gooooooooooooooooooooo
Everyone knew as soon as the latest P-score update came out that this post was coming. There were a couple items that immediately warranted deeper inspection:
No time for memes because there is a bomb attached to my keyboard that will explode if my WPM drops below 50, and typing out both of those links above from memory already came dangerously close to blowing up zink’s bus factor.
Where to look first? Obviously it’s gotta be that first one first, which means it’s Bioshock time.
I remember playing this game when it came out, and I remember it being fine. Benchmarking is about the same except it’s way harder to actually get into the benchmark mode for no obvious reason. According to Steam, the last time I ran this game for work was January of 2022; at that time, I remember having a weird bug where the rendering only progressed when I moved the mouse. Does that bug still exist?
Thankfully no, but the FPS is indeed low. Lower than I remembered. That’s good, because it meant I could bisect. And what did I find? Obviously it was a regression from this MR, which was causing some truly abysmal FPS in certain scenes (It turns out that doing a full GPU sync every time a query is used yields bad perf).
Specifically, the memory type check was broken. But great news: I “quietly” fixed this last week and nobody noticed.
Further benchmarking here yielded nothing actionable. On RadeonSI I was unable to repro anything remotely approaching the figures above—at most, I was able to get about half the claimed FPS. When switching to the Proton+DXVK mode, however, that’s around the performance I see.
I’m still not happy with the perf here, but I don’t think there’s much to be done at present; in my local testing I’m seeing an 8-9% perf gap, and it looks like some weird RADV corner case that I can’t examine further because RGP dumping is broken.
This is another esoteric benchmark that I had to pull the actual cmdline out of ps while it was running because who even comes up with these things? Just give me a simple benchmark.sh script or something. I don’t wanna have to figure out ./etl.x86_64 '+exec preset_high +set ui_r_mode -1 +set r_fullscreen 0 +set r_mode -1 +timedemo 1 +set demodone quit +set demoloop1 demo etlegacy281-pts1; set nextdemo vstr demodone +vstr demoloop1 +vid_restart +set r_customwidth 1920 +set r_customheight 1080' 2>&1|grep fps.
Given the extremely high FPS here, it’s obvious that this benchmark is CPU-bound, and so it makes some sense as to why zink would have lower perf: a layered implementation using the same API is not going to match up with a native implementation in cases like this. But is the gap necessarily so large?
Initial results looked like this for me:
| GPU | FPS (RadeonSI) | FPS (zink) |
|---|---|---|
| RX5700 (primary) | 670 | 520 |
| 7900XTX (secondary) | 650 | 545 |
Clearly these are very different results than the above charts, but that’s benchmarking in a nutshell. Also weird is that radeonsi gets slower on the 7900 while zink+radv is faster? I don’t understand how sand thinks, but surely this makes sense to it.
Let’s see what’s going on with perf:
And, just at a glance, the problem area:
Well that’s not great.
Taking out a magnifying glass, this resolves into some more definite problem areas:
I banged out some epilog fixes, and then I did the same for VS prologs, and it’s great, and it’ll be up in MRs soon™. Huge perf gains. Absolutely massive. But only for zink, because nothing else uses any of this.
Where does that leave things in perf (zoomed for readability)?
And the problems are…
Some of these are easily hammered out, and then there’s blend state emission. I tried a bunch of things, aaaaaaand it’s hard.
But progress was made, and suddenly now it’s like:
Which has a few weird things:
I haven’t been able to make too much headway on any of these, but they’re consistent in consuming a fair bit of CPU.
Everyone knows that these results are going to be better, but how much better is the question.
It’s big, but it’s not that big.
| GPU | FPS (RadeonSI) | FPS (zink) |
|---|---|---|
| RX5700 (primary) | 670 | 580 |
| 7900XTX (secondary) | 650 | 545 |
I’m not sure what’s going on with my 7900, but I’m gonna assume there’s some weird quirk going on as a result of being a secondary GPU.
Overall, it amounts to roughly a 10% improvement, leaving the perf gap at around 14% between the two drivers. That’s not ideal, but it’s not terrible.
In the previous update I explained that the programmable core in this NPU (VIPNano-QI) is too slow to run inference workloads substantially faster than the CPUs. The vendor stack achieves acceptable inference rates by running most of the work on fixed-function units that can perform different kinds of convolutions and transformations of tensors.
Most of the work is done by the convolution units that VeriSilicon calls NN cores, so this is what I have been focusing on at this stage. I think that even if we still do all tensor transformation on the programmable core, by using the NN units we could already achieve usable performance.
By looking around in the ioctls that VeriSilicon's userspace stack sends to the kernel, it was clear that in the NN jobs there was little more than a pointer to a structure that configures the NN fixed-function units. Luckily I didn't need to reverse engineer it from zero, as VeriSilicon's out-of-tree kernel driver is GPL and contains two instances of programming this HW with a trivial job (a 2x2x1 kernel with a single bias value).
Took some boring work to translate what the code does to a C struct, but this was the initial one:
struct etna_nn_params {
uint32_t op_type : 1; /* conv: 0 fully_connected: 1 */
uint32_t no_z_offset : 1;
uint32_t kernel_x_size : 4;
uint32_t kernel_z_size : 14; /* & 0x3FFF */
uint32_t kernels_per_core : 7;
uint32_t zero1 : 2;
uint32_t zero2 : 1;
uint32_t zero3 : 1;
uint32_t nn_layer_flush : 1;
uint32_t kernel_data_type : 2; /* UINT8 0x2 INT8 0x0 */
uint32_t in_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
uint32_t out_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
uint32_t in_image_x_size : 13;
uint32_t in_image_y_size : 13;
uint32_t zero4 : 3;
uint32_t zero5 : 3;
uint32_t unused0 : 1;
uint32_t zero6 : 16;
uint32_t zero7 : 1;
uint32_t enable_relu : 1;
uint32_t zero9 : 1;
uint32_t post_shift : 6;
uint32_t unused1 : 2;
uint32_t zero10 : 1;
uint32_t zero11 : 1;
uint32_t unused2 : 2;
uint32_t out_image_x_size : 13;
uint32_t out_image_y_size : 13;
uint32_t out_image_z_size : 14;
uint32_t zero12 : 2; /* 0x0 */
uint32_t zero13 : 1; /* (0 >> 3) & 0x1 */
uint32_t zero14 : 1; /* (0 >> 3) & 0x1 */
uint32_t unk0 : 7; /* 1 */
uint32_t unk1 : 7; /* 1 */
uint32_t kernel_address : 26; /* >> 6 */
uint32_t kernel_z_size2 : 6; /* >> 14 */
uint32_t in_image_address;
uint32_t out_image_address;
uint32_t unused3 : 12;
uint32_t kernel_y_size : 4;
uint32_t out_image_y_size2 : 16; /* maybe stride? */
uint32_t zero15;
uint32_t zero16;
uint32_t zero17;
uint32_t kernel_cache_end_address;
uint32_t zero19;
uint32_t image_end_address;
uint32_t zero20 : 2;
uint32_t zero21 : 16;
uint32_t kernel_data_type_bit_2 : 1;
uint32_t in_image_data_type_bit_2 : 1;
uint32_t out_image_data_type_bit_2 : 1;
uint32_t zero22 : 6;
uint32_t post_shift_bit_5_6 : 2;
uint32_t unused4 : 3;
uint32_t in_image_stride : 16;
uint32_t in_image_y_size2 : 16; /* again? */
uint32_t out_image_stride : 16;
uint32_t unused5 : 8;
uint32_t zero23 : 8;
uint32_t zero24 : 26; /* 0 >> 6 */
uint32_t zero25 : 1;
uint32_t zero26 : 1;
uint32_t zero27 : 1; /* 0 >> 4 */
uint32_t zero28 : 1; /* 0 >> 4 */
uint32_t zero29 : 1;
uint32_t kernel_data_type_bit_3 : 1;
uint32_t unk2 : 26; /* 0xFFFFFFFF >> 6 */
uint32_t unused6 : 4;
uint32_t zero30 : 1;
uint32_t in_image_data_type_bit_3 : 1;
uint32_t zero31 : 26; /* 0 >> 6 */
uint32_t out_image_data_type_bit_3 : 1;
uint32_t unused7 : 6;
uint32_t unk3 : 26; /* 0xFFFFFFFF >> 6 */
uint32_t unused8 : 6;
uint32_t coef_zero_point : 8;
uint32_t out_zero_point : 8;
uint32_t zero32 : 1;
uint32_t zero33 : 1;
uint32_t zero34 : 8;
uint32_t unused9 : 6;
uint32_t zero35;
uint32_t zero36 : 4;
uint32_t zero37 : 28; /* 0 >> 4 */
uint32_t zero38 : 4;
uint32_t zero39 : 28; /* 0 >> 4 */
uint32_t further1;
uint32_t further2;
uint32_t further3;
uint32_t further4;
uint32_t further5;
uint32_t further6;
uint32_t further7;
uint32_t further8;
};
As you can see there are a lot of "zero" and "unused" fields, most of them I think will be actually used for something as HW engineers don't tend to like wasting bits. By adding instrumentation for dumping these structs to the reverse engineering tooling, I will be making myself a better idea of what each field means and does.
I got GPU hangs the first time that I submitted a job with the same configuration as the kernel's trivial reset job, and looking further showed that the buffer that contains the convolution filters must follow a specific format.
By looking again at the kernel driver sources, I used the same kernel/filter buffer and the GPU didn't hang anymore. That kernel was all zeroes as the weights, and indeed my output buffer was now full of zeroes.
Then I tried to put my weights into the format that I inferred from the kernel driver source code, but I wasn't able to get any job to run to completion without hangs, and the output buffer was unchanged.
To figure out what I was missing about how the weights (and the biases) need to be placed in the buffer, I added code to the reverse engineering tooling to dump the weights buffer. With that buffer and after playing some with the sizes of the output, input and kernel buffers, I finally got a job to run with non-zero weights.
What I am doing right now is slowly zeroing out the weights buffer to figure out what are data bits, what are control and what effect the changes have in the output.
Hope that by the next update I will have documented the format of the weights buffer and will be able to run at least one kind of convolution!
After what was basically a flurry of typing, the snegg Python bindings for libei are now available. This is a Python package that provides bindings to the libei/libeis/liboeffis C libraries with a little bit of API improvement to make it not completely terrible. The main goal of these bindings (at least for now) is to provide some quick and easy way to experiment with what could possibly be done using libei - both server-side and client-side. [1] The examples directory has a minimal EI client (with portal support via liboeffis) and a minimal EIS implementation. The bindings are still quite rough and the API is nowhere near stable.
A proper way to support EI in Python would be to implement the protocol directly - there's no need for the C API quirkiness this way and you can make full use of things like async and whatnot. If you're interested in that, get in touch! Meanwhile, writing something roughly resemling xdotool is probably only a few hundred lines of python code. [2]
[1] writing these also exposed a few bugs in libei itself so I'm happy 1.0 wasn't out just yet
[2] at least the input emulation parts of xdotool
Upgrade your Asahi Linux systems, because your graphics drivers are getting a big boost: leapfrogging from OpenGL 2.1 over OpenGL 3.0 up to OpenGL 3.1! Similarly, the OpenGL ES 2.0 support is bumping up to OpenGL ES 3.0. That means more playable games and more functioning applications.
Back in December, I teased an early screenshot of SuperTuxKart’s deferred renderer working on Asahi, using OpenGL ES 3.0 features like multiple render targets and instancing. Now you too can enjoy SuperTuxKart with advanced lighting the way it’s meant to be:
SuperTuxKart rendering with advanced lighting
As before, these drivers are experimental and not yet conformant to the OpenGL or OpenGL ES specifications. For now, you’ll need to run our -edge packages to opt-in to the work-in-progress drivers, understanding that there may be bugs. Please refer to our previous post explaining how to install the drivers and how to report bugs to help us improve.
With that disclaimer out of the way, there’s a LOT of new functionality packed into OpenGL 3.0, 3.1, and OpenGL ES 3.0 to make this release. Highlights include:
For now, let’s talk about…
Vulkan and OpenGL support multisampling, short for multisampled anti-aliasing. In graphics, aliasing causes jagged diagonal edges due to rendering at insufficient resolution. One solution to aliasing is rendering at higher resolutions and scaling down. Edges will be blurred, not jagged, which looks better. Multisampling is an efficient implementation of that idea.
A multisampled image contains multiple samples for every pixel. After rendering, a multisampled image is resolved to a regular image with one sample per pixel, typically by averaging the samples within a pixel.
Apple GPUs support multisampled images and framebuffers. There’s quite a bit of typing to plumb the programmer’s view of multisampling into the form understood by the hardware, but there’s no fundamental incompatibility.
The trouble comes with sample shading. Recall that in modern graphics, the colour of each fragment is determined by running a fragment shader given by the programmer. If the fragments are pixels, then each sample within that pixel gets the same colour. Running the fragment shader once per pixel still benefits from multisampling thanks to higher quality rasterization, but it’s not as good as actually rendering at a higher resolution. If instead the fragments are samples, each sample gets a unique colour, equivalent to rendering at a higher resolution (supersampling). In Vulkan and OpenGL, fragment shaders generally run per-pixel, but with “sample shading”, the application can force the fragment shader to run per-sample.
How does sample shading work from the drivers’ perspective? On a typical GPU, it is simple: the driver compiles a fragment shader that calculates the colour of a single sample, and sets a hardware bit to execute it per-sample instead of per-pixel. There is only one bit of state associated with sample shading. The hardware will execute the fragment shader multiple times per pixel, writing out pixel colours independently.
Easy, right?
Alas, Apple’s “AGX” GPU is not typical.
AGX always executes the shader once per pixel, not once per sample, like older GPUs that did not support sample shading. AGX does support it, though.
How? The AGX instruction set allows pixel shaders to output different colours to each sample. The instruction used to output a colour1 takes a set of samples to modify, encoded as a bit mask. The default all-1’s mask writes the same value to all samples in a pixel, but a mask setting a single bit will write only the single corresponding sample.
This design is unusual, and it requires driver backflips to translate “fragment shaders” into hardware pixel shaders. How do we do it?
Physically, the hardware executes our shader once per pixel. Logically, we’re supposed to execute the application’s fragment shader once per sample. If we know the number of samples per pixel, then we can wrap the application’s shader in a loop over each sample. So, if the original fragment shader is:
interpolated colour = interpolate at current sample(input colour);
output current sample(interpolated colour);then we will transform the program to the pixel shader:
for (sample = 0; sample < number of samples; ++sample) {
sample mask = (1 << sample);
interpolated colour = interpolate at sample(input colour, sample);
output samples(sample mask, interpolated colour);
}The original fragment shader runs inside the loop, once per sample. Whenever it interpolates inputs at the current sample position, we change it to instead interpolate at a specific sample given by the loop counter sample. Likewise, when it outputs a colour for a sample, we change it to output the colour to the single sample given by the loop counter.
If the story ended here, this mechanism would be silly. Adding sample masks to the instruction set is more complicated than a single bit to invoke the shader multiple times, as other GPUs do. Even Apple’s own Metal driver has to implement this dance, because Metal has a similar approach to sample shading as OpenGL and Vulkan. With all this extra complexity, is there a benefit?
If we generated that loop at the end, maybe not. But if we know at compile-time that sample shading is used, we can run our full optimizer on this sample loop. If there is an expression that is the same for all samples in a pixel, it can be hoisted out of the loop.2 Instead of calculating the same value multiple times, as other GPUs do, the value can be calculated just once and reused for each sample. Although it complicates the driver, this approach to sample shading isn’t Apple cutting corners. If we slapped on the loop at the end and did no optimizations, the resulting code would be comparable to what other GPUs execute in hardware. There might be slight differences from spawning fewer threads but executing more control flow instructions3, but that’s minor. Generating the loop early and running the optimizer enables better performance than possible on other GPUs.
So is the mechanism only an optimization? Did Apple stumble on a better approach to sample shading that other GPUs should adopt? I wouldn’t be so sure.
Let’s pull the curtain back. AGX has its roots as a mobile GPU intended for iPhones, with significant PowerVR heritage. Even if it powers Mac Pros today, the mobile legacy means AGX prefers software implementations of many features that desktop GPUs implement with dedicated hardware.
Yes, I’m talking about blending.
Blending is an operation in graphics APIs to combine the fragment shader output colour with the existing colour in the framebuffer. It is usually used to implement alpha blending, to let the background poke through translucent objects.
When multisampling is used without sample shading, although the fragment shader only runs once per pixel, blending happens per-sample. Even if the fragment shader outputs the same colour to each sample, if the framebuffer already had different colours in different samples, blending needs to happen per-sample to avoid losing that information already in the framebuffer.
A traditional desktop GPU blends with dedicated hardware. In the mobile space, there’s a mix of dedicated hardware and software. On AGX, blending is purely software. Rather than configure blending hardware, the driver must produce variants of the fragment shader that include instructions to implement the desired blend mode. With alpha blending, a fragment shader like:
becomes:
colour = calculate lighting();
dest = load destination colour;
alpha = colour.alpha;
blended = (alpha * colour) + ((1 - alpha) * dest));
output(blended);Where’s the problem?
Blending happens per sample. Even if the application intends to run the fragment shader per pixel, the shader must run per sample for correct blending. Compared to other GPUs, this approach to blending would regress performance when blending and multisampling are enabled but sample shading is not.
On the other hand, exposing multisample pixel shaders to the driver solves the problem neatly. If both the blending and the multisample state are known, we can first insert instructions for blending, and then wrap with the sample loop. The above program would then become:
for (sample = 0; sample < number of samples; ++sample_id) {
colour = calculate lighting();
dest = load destination colour at sample (sample);
alpha = colour.alpha;
blended = (alpha * colour) + ((1 - alpha) * dest);
sample mask = (1 << sample);
output samples(sample_mask, blended);
}In this form, the fragment shader is asymptotically worse than the application wanted: the fragment shader is executed inside the loop, running per-sample unnecessarily.
Have no fear, the optimizer is here. Since colour is the same for each sample in the pixel, it does not depend on the sample ID. The compiler can move the entire original fragment shader (and related expressions) out of the per-sample loop:
colour = calculate lighting();
alpha = colour.alpha;
inv_alpha = 1 - alpha;
colour_alpha = alpha * colour;
for (sample = 0; sample < number of samples; ++sample_id) {
dest = load destination colour at sample (sample);
blended = colour_alpha + (inv_alpha * dest);
sample mask = (1 << sample);
output samples(sample_mask, blended);
}Now blending happens per sample but the application’s fragment shader runs just once, matching the performance characteristics of traditional GPUs. Even better, all of this happens without any special work from the compiler. There’s no magic multisampling optimization happening here: it’s just a loop.
By the way, what do we do if we don’t know the blending and multisample state at compile-time? Hope is not lost…
…but that’s a story for another day.
While OpenGL ES 3.0 is an improvement over ES 2.0, we’re not done. In my work-in-progress branch, OpenGL ES 3.1 support is nearly finished, which will unlock compute shaders.
The final goal is a Vulkan driver running modern games. We’re a while away, but the baseline Vulkan 1.0 requirements parallel OpenGL ES 3.1, so our work translates to Vulkan. For example, the multisampling compiler passes described above are common code between the drivers. We’ve tested them against OpenGL, and now they’re ready to go for Vulkan.
And yes, the team is already working on Vulkan.
Until then, you’re one pacman -Syu away from enjoying OpenGL 3.1!
Store a formatted value to local memory acting as a tilebuffer.↩︎
Via common subexpression elimination if the loop is unrolled, otherwise via code motion.↩︎
Since the number of samples is constant, all threads branch in the same direction so the usual “GPUs are bad at branching” advice does not apply.↩︎
.
Yes, we’ve just skipped over the third anniversary of SGC and the blog. Time flies when you’re making triangles. I had some plans, and I was gonna make some posts, and then I didn’t, so here we are. Great.
There’s been a lot of things in motion over the past however long it’s been since I blogged. Exciting things. Big things. Not like Half Life 3 big, my lawyer is advising me to state for posterity, but you know. Big.
Blogging’s not like riding a bike. You gotta ease back into it after you take some time off. Just ease in…
I’ve been monitoring the issues everyone’s been reporting for GAMES THAT DON’T RENDER GOOD, and before I got sucked into doing this (huge) thing that I’m totally gonna talk about at some point, I was actually organizing some of the reports into tables and such. And investigating them. And one of the ones I was looking at was some broken sampling in Wolfenstein:
Some weird grid lines being rendered there. And so I took out my trusty chainsaw renderdoc, and I looked at the shaders, and, well, I’m not about to bore you with the details, but they were totally unreadable. It turns out that all the “free” bitcasting zink does during SPIRV translation isn’t actually free when you gotta read it.
Fine, you twisted my arm. Here’s some details:
Yeah that’s the good stuff. Ideally you want a 1:1 balance between the bitcasts and the shader ops, so you can see here that I’m really doing a great job. Shoutout to SPIRV-Cross for heroically putting up with it.
So I saw this, and I stared at it a while, and I wasn’t getting anywhere, so I kept staring at it in case it was one of those magic eye puzzles, but I’m pretty convinced now, a month later, that it’s not a magic eye puzzle, so probably I wasted some time, but you can never be too sure with those things, which is why you gotta check just in case.
If it’s not a magic eye puzzle, however, what’s going on here?
This goes back to one of the oldest ideas in zink, namely that when translating from NIR, all values should be stored as uint for consistency. This is great in terms of usability at the compiler level: when you use a result from op A later on, you know it’s a uint, so if op B takes a float param, you blindly cast the param from uint to float and it’s correct every time. But then also for every single op you’re adding in extra bitcasts. And in the case where op A has a result type of float and op B has an input type of float, that means two bitcasts to reuse one SSA value.
The smarter solution, obviously, is just to track the type of each value and then only bitcast when absolutely necessary. Like I did in this merge request which landed last month.
Which means it’s all fixed now, doesn’t it.
And the shaders are totally readable by humans.
And it probably fixed some random bugs in the process.
And you can already see where this is heading, can’t you.
There’s still a couple bitcasts, sure, and they’re annoying, yeah, and I should probably get rid of them, totally, but ultimately this is still not readable. But it’s better? Probably?
And since zink is now usable to play most games, this means it should be much easier for people (other than me) to renderdoc their way to fixing shader-related bugs (so I don’t have to).
If they’re so inclined (to help me out).
Anyway great post everyone, we broke the no-blogging streak. Solid hustle. Let’s wrap it up and hit the gym.
The first week of GSoC is over! I'm working on presentation scheduling in wlroots. Here's how it's going.
Currently in wlroots, the easy and obvious frame schedule for compositors to implement is one where Wayland clients and the compositor are both told to render a new frame as soon as the last one makes it to the screen. This means that clients are almost guaranteed to miss the deadline for this cycle, and their frame submissions will only make it into the compositor's renderer in the next cycle. With this schedule, there are two frames of latency between a client receiving some user input and the new content making it to the screen.
If the compositor starts rendering later then clients are more likely to have submitted the frame that was most recently triggered. The tradeoff is that the compositor is less likely to hit its own deadline. It's like the computer is playing The Price Is Right for every frame, and I'm in control of how it guesses.
I'm trying to make a smarter frame schedule (where the computer is really good at the game) be another easy and obvious schedule for wlroots compositors to use, so the whole ecosystem can win without having to faff about implementing it themselves.
Thanks to Simon for the idea and for mentoring me.
I quickly implemented a way for compositors to delay rendering by a specified number of milliseconds, and Kenny quickly pointed out that the way I did it was silly. Since then, I have a much better understanding of what actually happens for each frame in wlroots. Thanks Kenny!
After that review and a bit more discussion I gave it another go, but didn't submit it yet because I haven't finished polishing it. While I was doing that, I was brutally reminded that the C compiler does not care about me and will not attempt to help at all. Don't forget to enable your sanitizers, reader. Arguably I deserved it, because I had recently commented on how I was enjoying writing C for a change. Don't fall into the same trap as me; C does not want you to enjoy it.
I also wrote a new API that allows you to find out how long it took to render a frame. For now, this is only implemented on the GLES2 backend. Storing this time for the last few frames is a good way to make an educated guess on the time for the next one, and by also knowing your monitor's (maximum) refresh rate you can come up with a reasonable duration to delay rendering by.
In testing the render timer I discovered that my test machine takes a whopping 4 milliseconds to paint the screen a solid colour. A valuable lesson lies here: don't ask questions if you're not strong enough to hear the answer. Or maybe "damage tracking is good".
See you next week!
As I wrote in the last update, my OpenCL branch is able to correctly run MobileNet v1 with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.
In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!).
When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.
After a lots of fumbling around in the internets I stumbled upon a commit that included an interestingly-named environment variable: VIV_VX_DISABLE_TP_NN_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).
With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.
Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.
Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.
The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:
Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -> [T#5]
Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -> [T#31]
...
[12 more pairs of CONV_2D and DEPTHWISE_CONV_2D]
...
Op#27 AVERAGE_POOL_2D(T#29) -> [T#0]
Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -> [T#1]
Op#29 RESHAPE(T#1, T#86[-1, 1001]) -> [T#85]
Op#30 SOFTMAX(T#85) -> [T#87]
As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end.
By using some of the environment variables that are mentioned in this issue in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:
operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP
operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP
operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
...
[34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN]
...
operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH
operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP
operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH
From that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.
So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...
What I will be working on next:
If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.
The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the vxConvolutionLayer node).
The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.
That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.
For the last four years I’ve served as a member of the X.Org Foundation Board of Directors, but some days ago I stepped down after my term ended and not having run for re-election.
I started contributing to Mesa in 2014 and joined the amazing freedesktop community. Soon after, I joined the X.Org Foundation as an regular member in order to participate in the elections and get access to some interesting perks (VESA, Khronos Group). You can learn more about what X.Org Foundation does in Ricardo’s blogpost.
But everything changed in 2018. That year, Chema and I organized XDC 2018 in A Coruña, Spain.

The following year, I ran for the yearly election of X.Org Foundation’s board of directors (as it is a two years term, we renew half of the board every year)… and I was elected! It was awesome! Almost immediately, I started coordinating XDC, and looking for organization proposals for the following XDC. I documented my experience organizing XDC 2018 in an attempt to make the job easier for future organizers, reducing the burden that organizing such a conference entails.
In 2021, I was re-elected and everything continued without changes (well, except the pandemic and having our first 2 virtual XDCs: 2020 and 2021).
Unfortunately, my term finished this year… and I did not re-run for election. The reasons were a mix of personal life commitments (having 2 kids change your life completely) and new professional responsibilities. After those changes, I could not contribute as much as I wanted, and that was enough to me to pass the torch and let others contribute to the X.Org Foundation instead. Congratulations to Christopher Michale and Arek Hiler, I’m pretty sure you are going to do great!
Surprisingly enough, I am closing the cycle as it started: organizing X.Org Developers Conference 2023 in A Coruña, Spain from 17th to 19th October 2023.

I leave the board of directors but I won friends and great memories. In case you are interested on participating to the community via the board of directors, prepare your candidancy for next year!
See you in A Coruña!
A few weeks ago the annual X.Org Foundation Board of Directors election took place. The Board of Directors has 8 members at any given moment, and members are elected for 2-year terms. Instead of renewing the whole board every 2 years, half the board is renewed every year. Foundation members, which must apply for or renew membership every year, are the electorate in the process. Their main duty is voting in board elections and occasionally voting in other changes proposed by the board.
As you may know, thanks to the work I do at Igalia, and the trust of other Foundation members, I’m part of the board and currently serving the second year of my term, which will end in Q1 2024. Despite my merits coming from my professional life, I do not represent Igalia as a board member. However, to avoid companies from taking over the board, I must disclose my professional affiliation and we must abide by the rule that prohibits more than two people with the same affiliation from being on the board at the same time.
Because of the name of the Foundation and for historical reasons, some people are confused about its purpose and sometimes they tend to think it acts as a governance body for some projects, particularly the X server, but this is not the case. The X.Org Foundation wiki page at freedesktop.org has some bits of information but I wanted to clarify a few points, like mentioning the Foundation has no paid employees, and explain what we do at the Foundation and the tasks of the Board of Directors in practical terms.
Cue the music.
(“The Who - Who Are You?” starts playing)
The main points would be:
The Foundation acts as an umbrella for multiple projects, including the X server, Wayland and others.
The board of directors has no power to decide who has to work on what.
The largest task is probably organizing XDC.
Being a director is not a paid position.
The Foundation pays for project infrastructure.
The Foundation, or its financial liaison, acts as an intermediary with other orgs.
Some directors have argued in the past that we need to change the Foundation name to something different, like the Freedesktop.org Foundation. With some healthy sense of humor, others have advocated for names like Freedesktop Software Foundation, or FSF for short, which should be totally not confusing. Humor or not, the truth is the X.Org Foundation is essentially the Freedesktop Foundation, so the name change would be nice in my own personal opinion.
If you take a look at the Freedesktop Gitlab instance, you can navigate to a list of projects and sort them by stars. Notable mentions you’ll find in the list: Mesa, PipeWire, GStreamer, Wayland, the X server, Weston, PulseAudio, NetworkManager, libinput, etc. Most of them closely related to a free and open source graphics stack, or free and open source desktop systems in general.
As I mentioned above, the Foundation has no paid employees and the board has no power to direct engineering resources to a particular project under its umbrella. It’s not a legal question, but a practical one. Is the X.Org server dying and nobody wants to touch it anymore? Certainly. Many people who worked on the X server are now working on Wayland and creating and improving something that works better in a modern computer, with a GPU that’s capable of doing things which were not available 25 years ago. It’s their decision and the board can do nothing.
On a tangent, I’m feeling a bit old now, so let me say when I started using Linux more than 20 years ago people were already mentioning most toolkits were drawing stuff to pixmaps and putting those pixmaps on the screen, ignoring most of the drawing capabilities of the X server. I’ve seen tearing when playing movies on Linux many times, and choppy animations everywhere. Attempting to use the X11 protocol over a slow network resulted in broken elements and generally unusable screens, problems which would not be present when falling back to a good VNC server and client (they do only one specialized thing and do it better).
For the last 3 or 4 years I’ve been using Wayland (first on my work laptop, nowadays also on my personal desktop) and I’ve seen it improve all the time. When using Wayland, animations are never choppy in my own experience, tearing is unheard of and things work more smoothly, as far as my experience goes. Thanks to using the hardware better, Wayland may also give you improved battery life. I’ve posted in the past that you can even use NVIDIA with Gnome on Wayland these days, and things are even simpler if you use an Intel or AMD GPU.
Naturally, there may be a few things which may not be ready for you yet. For example, maybe you use a DE which only works on X11. Or perhaps you use an app or DE which works on Wayland, but its support is not great and has problems there. If it’s an app, likely power users or people working on distributions can tune it to make it use XWayland by default, instead of Wayland, while bugs are ironed out.
Ouch, there we have the “X.Org” moniker again…
Back on track, if the Foundation can do nothing about the lack of people maintaining the X server and does not set any technical direction for projects, what does it do? (I hear you shouting “nothing!” while waving your fist at me.) One of the most time-consuming tasks is organizing XDC every year, which is arguably one of the most important conferences, if not the most important one, for open source graphics right now.
Specifically, the board of directors will set up a commission composed of several board members and other Foundation members to review talk proposals, select which ones will have a place at the conference, talk to speakers about shortening or lengthening their talks, and put them on a schedule to be used at the conference, which typically lasts 3 days. I chaired the paper committee for XDC 2022 and spent quite a lot of time on this.
The conference is free to attend for anyone and usually alternates location between Europe and the Americas. Some people may want to travel to the conference to present talks there but they may lack the budget to do so. Maybe they’re a student or they don’t have enough money, or their company will not sponsor travel to the conference. For that, we have travel grants. The board of directors also reviews requests for travel grants and approves them when they make sense.
But that is only the final part. The board of directors selects the conference contents and prepares the schedule, but the job of running the conference itself (finding an appropriate venue, paying for it, maybe providing some free lunches or breakfasts for attendees, handling audio and video, streaming, etc) falls in the hands of the organizer. Kid you not, it’s not easy to find someone willing to spend the needed amount of time and money organizing such a conference, so the work of the board starts a bit earlier. We have to contact people and request for proposals to organize the conference. If we get more than one proposal, we have to evaluate and select one.
As the conference nears, we have to fire some more emails and convince companies to sponsor XDC. This is also really important and takes time as well. Money gathered from sponsors is not only used for the conference itself and travel grants, but also to pay for infrastructure and project hosting throughout the whole year. Which takes us to…
No, that’s not happening.
Being a director of the Foundation is not a paid position. Every year we suffer a bit to be able to get enough candidates for the 4 positions that will be elected. Many times we have to extend the nomination period.
If you read news about the Foundation having trouble finding candidates for the board, that barely qualifies as news because it’s almost the same every year. Which doesn’t mean we’re not happy when people spread the news and we receive some more nominations, thank you!
Just like being an open source maintainer is not a grateful task sometimes, not everybody wants to volunteer and do time-consuming tasks for free. Running the board elections themselves, approving membership renewals and requests every year, and sending voting reminders also takes time. Believe me, I just did that a few weeks ago with help from Mark Filion from Collabora and technical assistance from Martin Roukala.
The Foundation spends a lot of money on project hosting costs, including Gitlab and CI systems, for projects under the Freedesktop.org umbrella. These systems are used every day and are fundamental for some projects and software you may be using if you run Linux. Running our own Gitlab instance and associated services helps keep the web decentralized and healthy, and provides more technical flexibility. Many people seem to appreciate those details, judging by the number of projects we host.
The Foundation also approaches other organizations on behalf of the community to achieve some stuff that would be difficult otherwise.
To pick one example, we’ve worked with VESA to provide members with access to various specifications that are needed to properly implement some features. Our financial liaison, formerly SPI and soon SFC, signs agreements with the Khronos Group that let them waive fees for certifying open source implementations of their standards.
For example, you know RADV is certified to comply with the Vulkan 1.3 spec and the submission was made on behalf of Software in the Public Interest, Inc. Same thing for lavapipe. Similar for Turnip, which is Vulkan 1.1 conformant.
The song is probably over by now and you have a better idea of what the Foundation does, and what the board members do to keep the lights on. If you have any questions, please let me know.
Last year, the Linux Foundation announced the creation of the Linux Foundation Europe.

The goal of the Linux Foundation Europe is, in a nutshell, to promote Open Source in Europe not only to individuals (via events and courses), but to companies (guidance and hosting projects) and European organizations. However, this effort needs the help of European experts in Open Source.
Thus, the Linux Foundation Europe (LFE) has formed an advisory board called the Linux Foundation Europe Advisory Board (LFEAB), which includes representatives from a cross-section of 20 leading European organizations within the EU, the UK, and beyond. The Advisory Board will play an important role in stewarding Linux Foundation Europe’s growing community, which now spans 100 member organizations from across the European region.
Early this year, I was invited to join the LFEAB as an inaugural member. I would not be in this position without the huge amount of work done by the rest of my colleagues at Igalia since the company was founded in 2001, which has paved the way for us to be one of the landmark consultancies specialized in Open Source, both globally and in Europe.
My presence in the LFEAB will help to share our experience, and help the Linux Foundation Europe to grow and spread Open Source everywhere in Europe.

I’m excited to participate in the Linux Foundation Europe Advisory Board! I and the rest of the LFEAB will be at the Open Source Summit Europe, send me an email if you want to meet me to know more about LFEAB, about Igalia or about how you can contribute more to Open Source.
Happy hacking!
After finishing up my first Igalia Coding Experience in January, I got the amazing opportunity to keep working in the DRI community by extending my Igalia CE to a second round. Huge thanks to Igalia for providing me with this opportunity!
Another four months passed by and here I am completing another milestone with Igalia. Previously, in the last final reports, I described GSoC as “an experience to get a better understanding of what open source is” and the first round of the Igalia CE as “an opportunity for me to mature my knowledge of technical concepts”. My second round of the Igalia CE was a period for broadening my horizons.
I had the opportunity to deepen my knowledge of a new programming language and learn more about Kernel Mode Setting (KMS). I took my time learning more about Vulkan and the Linux graphics stack. All of this new knowledge about the DRM infrastructure fascinated me and made me excited to keep developing.
So, this is a summary report of my journey at my second Igalia CE.
First, I took some time to wrap up the contributions of my previous Igalia CE. In my January Update, I described the journey to include IGT tests for V3D. But at the time, I hadn’t yet sent the final versions of the tests. Right when I started my second Igalia CE, I sent the final versions of the V3D tests, which were accepted and merged.
| Series | Status |
|---|---|
| [PATCH i-g-t 0/6] V3D Job Submission Tests | Accepted |
| [PATCH i-g-t 0/3] V3D Mixed Job Submission Tests | Accepted |
The first part of my Igalia CE was focused on rewriting the VGEM driver in Rust. VGEM (Virtual GEM Provider) is a minimal non-hardware-backed GEM (Graphics Execution Manager) service. It is used with non-native 3D hardware for buffer sharing between the X server and DRI.
The goal of the project was to explore Rust in the DRM subsystem and have a working VGEM driver written in Rust. Rust is a blazingly fast and memory-efficient language with its powerful ownership model. It was really exciting to learn more about Rust and implement from the beginning a DRM driver.
During the project, I wrote two blog posts describing the technical aspects of rustgem driver.
If you are interested in this project, check them out!
| Date | Blogpost |
|---|---|
| 28th February | Rust for VGEM |
| 22th March | Adding a Timeout feature to Rustgem |
By the end of the first half of the Igalia CE, I sent an RFC patch with the rustgem driver.
Thanks to Asahi Lina, the Rust for Linux folks, and Daniel Vetter for all the feedback provided during the development of the driver.
I still need to address some feedback and rebase the series on top of the new pin-init API, but I hope to see this driver upstream soon.
You can check the driver’s current status in this PR.
| Series | Status |
|---|---|
| [RFC PATCH 0/9] Rust version of the VGEM driver | In Review |
Apart from rewriting the VGEM driver, I also sent a couple of improvements to the C version of the VGEM driver and its IGT tests.
I found a missing mutex_destroy on the code and also an unused struct.
| Patches | Status |
|---|---|
| [PATCH] drm/vgem: add missing mutex_destroy | Accepted |
| [PATCH] drm/vgem: Drop struct drm_vgem_gem_object | Accepted |
On the IGT side, I added some new tests to the VGEM tests. I wanted to ensure that my driver returned the correct values for all possible error paths, so I wrote this IGT test. Initially, it was just for me, but I decided to submit it upstream.
| Series | Status |
|---|---|
| [PATCH v3 i-g-t 0/2] Add negative tests to VGEM | Accepted |
Focusing on the VKMS was the major goal of the second part of my Igalia CE. Melissa Wen is one of the maintainers of the VKMS, and she provided me with a fantastic opportunity to learn more about the VKMS. So far, I haven’t dealt with displays, and learning new concepts in the graphics stack was great.
VKMS is a software-only KMS driver that is quite useful for testing and running X (or similar compositors) on headless machines. At the time, the driver didn’t have any support for optional plane properties, such as rotation and blend mode. Therefore, my goal was to implement the first plane property of the driver: rotation. I described the technicalities of this challenge in this blog post, but I can say that it was a nice challenge of this mentorship project.
In the end, we have the first plane property implemented for the VKMS and it is already committed.
Together with the VKMS part, I sent a series to the IGT mailing list with some improvements to the kms_rotation_crc tests.
These improvements included adding new tests for rotation with offset and reflection and the isolation of some Intel-specific tests.
As I was working with the rotation series, I discovered a couple of things that could be improved in the VKMS driver. Last year, Igor Torrente sent a series to VKMS that changed the composition work in the driver. Before his series, the plane composition was executed on top of the primary plane. Now, the plane composition is executed on top of the CRTC.
Although his series was merged, some parts of the code still considered that the composition was executed on top of the primary plane, limiting the VKMS capabilities. So I sent a couple of patches to the mailing list, improving the handling of the primary plane and allowing full alpha blending on all planes.
Moreover, I sent a series that added a module parameter to set a background color to the CRTC. This work raised an interesting discussion about the need for this property by the user space and whether this parameter should be a KMS property.
Apart from introducing the rotation property to the VKMS driver, I also took my time to implement two other properties: alpha and blend mode. This series is still awaiting review, but it would be a nice addition to the VKMS, increasing its IGT test coverage rate.
Finally, I found a bug in the RGB565 conversion.
The RGB565 conversion to ARGB16161616 involves some fixed-point operations and, when running the pixel-format IGT test, I verified that the RGB565 test was failing.
So, some of those fixed-point operations were returning erroneous values.
I checked that the RGB coefficients weren’t being rounded when converted from fixed-point to integers.
But, this should happen in order to provided the proper coefficient values.
Therefore, the fix was: implement a new helper that rounds the fixed-point value when converting it to a integer.
After performing all this work on the VKMS, I sent a patch adding myself as a VKMS maintainer, which was acked by Javier Martinez and Melissa Wen. So now, I’m working together with Melissa, Rodrigo Siqueira and all DRI community to improve and maintain the VKMS driver.
A couple of years ago, Sumera Priyadarsini, an Outreachy intern, worked on a Virtual Hardware functionality for the VKMS. The idea was to add a Virtual Hardware or vblank-less mode as a kernel parameter to enable VKMS to emulate virtual devices. This means no vertical blanking events occur and page flips are completed arbitrarily when required for updating the frame. Unfortunately, she wasn’t able to wrap things up and this ended up never being merged into VKMS.
Melissa suggested rebasing this series and now we can have the Virtual Hardware functionality working on the current VKMS. This was a great work by Sumera and my work here was just to adapt her changes to the new VKMS code.
| Series | Status |
|---|---|
| [PATCH 0/2] drm/vkms: Enable Virtual Hardware support | In Review |
Finally, I was in the last week of the project, just wrapping things up, when I decided to run the VKMS CI. I had recently committed the rotation series and I had run the CI before, but to my surprise, I got the following output:
[root@fedora igt-gpu-tools]# ./build/tests/kms_writeback
IGT-Version: 1.27.1-gce51f539 (x86_64) (Linux: 6.3.0-rc4-01641-gb8e392245105-dirty x86_64)
(kms_writeback:1590) igt_kms-WARNING: Output Writeback-1 could not be assigned to a pipe
Starting subtest: writeback-pixel-formats
Subtest writeback-pixel-formats: SUCCESS (0.000s)
Starting subtest: writeback-invalid-parameters
Subtest writeback-invalid-parameters: SUCCESS (0.001s)
Starting subtest: writeback-fb-id
Subtest writeback-fb-id: SUCCESS (0.020s)
Starting subtest: writeback-check-output
(kms_writeback:1590) CRITICAL: Test assertion failure function get_and_wait_out_fence, file ../tests/kms_writeback.c:288:
(kms_writeback:1590) CRITICAL: Failed assertion: ret == 0
(kms_writeback:1590) CRITICAL: Last errno: 38, Function not implemented
(kms_writeback:1590) CRITICAL: sync_fence_wait failed: Timer expired
Stack trace:
#0 ../lib/igt_core.c:1963 __igt_fail_assert()
#1 [get_and_wait_out_fence+0x83]
#2 ../tests/kms_writeback.c:337 writeback_sequence()
#3 ../tests/kms_writeback.c:360 __igt_unique____real_main481()
#4 ../tests/kms_writeback.c:481 main()
#5 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
#6 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
#7 [_start+0x25]
Subtest writeback-check-output failed.
**** DEBUG ****
(kms_writeback:1590) CRITICAL: Test assertion failure function get_and_wait_out_fence, file ../tests/kms_writeback.c:288:
(kms_writeback:1590) CRITICAL: Failed assertion: ret == 0
(kms_writeback:1590) CRITICAL: Last errno: 38, Function not implemented
(kms_writeback:1590) CRITICAL: sync_fence_wait failed: Timer expired
(kms_writeback:1590) igt_core-INFO: Stack trace:
(kms_writeback:1590) igt_core-INFO: #0 ../lib/igt_core.c:1963 __igt_fail_assert()
(kms_writeback:1590) igt_core-INFO: #1 [get_and_wait_out_fence+0x83]
(kms_writeback:1590) igt_core-INFO: #2 ../tests/kms_writeback.c:337 writeback_sequence()
(kms_writeback:1590) igt_core-INFO: #3 ../tests/kms_writeback.c:360 __igt_unique____real_main481()
(kms_writeback:1590) igt_core-INFO: #4 ../tests/kms_writeback.c:481 main()
(kms_writeback:1590) igt_core-INFO: #5 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
(kms_writeback:1590) igt_core-INFO: #6 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
(kms_writeback:1590) igt_core-INFO: #7 [_start+0x25]
**** END ****
Subtest writeback-check-output: FAIL (1.047s)
🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠
Initially, I thought I had introduced the bug with my rotation series. Turns out, I just had made it more likely to happen. This bug has been hidden in VKMS for a while, but it happened just on rare occasions. Yeah, I’m talking about a race condition… The kind of bug that just stays hidden in your code for a long while.
When I started to debug, I thought it was a performance issue. But then, I increased the timeout to 10 seconds and even then the job wouldn’t finish. So, I thought that it could be a deadlock. But after inspecting the DRM internal locks and the VKMS locks, it didn’t seem the case.
Melissa pointed me to a hint: there was one framebuffer being leaked when removing the driver. I discovered that it was the writeback framebuffer. It meant that the writeback job was being queued, but it wasn’t being signaled. So, the problem was inside the VKMS locking mechanism.
After tons of GDB and ftrace, I was able to find out that the composer was being set twice without any calls to the composer worker. I changed the internal locks a bit and I was able to run the test repeatedly for minutes! I sent the fix for review and now I’m just waiting for a Reviewed-by.
| Patches | Status |
|---|---|
| [PATCH] drm/vkms: Fix race-condition between the hrtimer and the atomic commit | In Review |
While debugging, I found some things that could be improved in the VKMS writeback file. So, I decided to send a series with some minor improvements to the code.
| Series | Status |
|---|---|
| [PATCH 0/3] drm/vkms: Minor Improvements | In Review |
If you run all IGT KMS tests on the VKMS driver, you will see that some tests will fail. That’s not what we would expect: we would expect that all tests would pass or skip. The tests could fail due to errors in the VKMS driver or be wrong exceptions on the IGT side. So, on the final part of my Igalia CE, I inspected a couple of IGT failures and sent fixes to address the errors.
This patch is a revival of a series I sent in January to fix the IGT test kms_addfb_basic@addfb25-bad-modifier.
This test also failed in VC4, and I investigated the reason in January.
I sent a patch to guarantee that the test would pass and after some feedback, I came down to a dead end.
So, I left this patch aside for a while and decided to recapture it now.
Now, with this patch being merged, we can guarantee that the test kms_addfb_basic@addfb25-bad-modifier is passing for multiple drivers.
| Patches | Status |
|---|---|
| [PATCH] drm/gem: Check for valid formats | Accepted |
On the IGT side, I sent a couple of improvements to the tests.
The failure was usually just a scenario that the test didn’t consider.
For example, the kms_plane_scaling test was failing in VKMS, because it didn’t consider the case in which the driver did not have the rotation property.
As VKMS didn’t use to have the rotation property, the tests were failing instead of skipping.
Therefore, I just developed a path for drivers without the rotation property for the tests to skip.
I sent improvements to the kms_plane_scaling, kms_flip, and kms_plane tests, making the tests pass or skip on all cases for the VKMS.
One important thing to VKMS is creating a baseline of generic KMS tests that should pass. This way, we can test new contributions against this baseline and avoid introducing regressions in the codebase. I sent a patch to IGT to create a testlist for the VKMS driver with all the KMS tests that must pass on the VKMS driver. This is great for maintenance, as we can run the testlist to ensure that the VKMS functionalities are preserved.
With new features being introduced in VKMS, it is important to keep the test list updated. So, I verified the test results and updated this test list during my time at the Igalia CE. I intend to keep this list updated as long as I can.
| Series | Status |
|---|---|
| [PATCH i-g-t] tests/vkms: Create a testlist to the vkms DRM driver | Accepted |
| [PATCH i-g-t 0/3] tests/vkms: Update VKMS’s testlist | Accepted |
First, I would like to thank my great mentor Melissa Wen. Melissa and I are completing a year together as mentee and mentor and it has been an amazing journey. Since GSoC, Melissa has been helping me by answering every single question I have and providing me with great encouragement. I have a lot of admiration for her and I’m really grateful for having her as my mentor during the last year.
Also, I would like to thank Igalia for giving me this opportunity to keep working in the DRI community and learning more about this fascinating topic. Thanks to all Igalians that helped through this journey!
Moreover, I would like to thank the DRI community for reviewing my patches and giving me constructive feedback.
Especially, I would like to thank Asahi Lina, Daniel Vetter and the Rust for Linux folks for all the help with the rustgem driver.
Thanks for all the suggestions, on here, and on twitter and on mastodon, anyway who noted I could use a single fd and avoid all the pain was correct!
I hacked up an ever growing ftruncate/madvise memfd and it seemed to work fine. In order to use it for sparse I have to use it for all device memory allocations in lavapipe which means if I push forward I probably have to prove it works and scales a bit better to myself. I suspect layering some of the pb bufmgr code on top of an ever growing fd might work, or maybe just having multiple 2GB buffers might be enough.
Not sure how best to do shaderResourceResidency, userfaultfd might be somewhat useful, mapping with PROT_NONE and then using write(2) to get a -EFAULT is also promising, but I'm not sure how best to avoid segfaults for read/writes to PROT_NONE regions.
Once I got that going, though I ran headfirst into something that should have been obvious to me, but I hadn't thought through.
llvmpipe allocates all it's textures linearly, there is no tiling (even for vulkan optimal). Sparse textures are incompatible with linear implementations. For sparseImage2D you have to be able to give the sparse tile sizes from just the image format. This typically means you have to work out how large the tile that fits into a hw page is in w/h. Of course for a linear image, this would be dependent on the image stride not just the format, and you just don't have that information.
I guess it means texture tiling in llvmpipe might have to become a thing, we've thought about it over the years but I don't think there's ever been a solid positive for implementing it.
Might have to put sparse support on the back burner for a little while longer.
Hi all!
This status update comes in a bit late because I was on leave last week. The highlight this month is the HDR hackfest, I’ve written a dedicated blog post about it. After the publication of that blog post, I’ve sent out an RFC to dri-devel.
We’ve made some good progress on wlroots’ Vulkan renderer. Manuel Stoeckl has added support for an intermediate buffer for blending, which is required for non-8-bit output formats and for color management features. The renderer now has an optional extra rendering pass to run a shader after blending. This is currently used to encode color values to sRGB, and will be used in the future to apply ICC profiles and to perform color space conversions. I’ve added support for the NV12 DMA-BUF format, support for more YCbCr formats is in a merge request.
The new cursor-shape-v1 protocol has been merged in wayland-protocols thanks
to KDE and winit folks. Traditionally Wayland clients needed to load XCursor
themes and submit these as wl_shm buffers to the compositor. However there
are a few downsides: there is no mechanism to configure the theme that gets
loaded, the theme cannot be changed on-the-fly, there is no way to configure
separate themes per seat, and loading cursors slows down client startup. The
cursor-shape-v1 protocol allows clients to set a cursor image by its name
instead of using wl_shm buffers.
I’ve worked on adding a new mode to wayland-scanner to generate enums only. This is useful for libraries like wlroots which use C enums generated from protocol XML in their public headers. We plan to ship these headers as part of a wayland-protocols installation.
To wrap up this status update, let’s mention a few updates for miscellaneous projects. A handful of new formats have been added to pixfmtdb. gqlclient now handles GraphQL interfaces correctly and generates methods to unwrap the underlying type. This is now used in hut to show ticket comments, among other things. go-imap now supports SEARCHRES, LITERAL+, and features a simplified API for STATUS commands.
See you next month!
Mike nerdsniped me into wondering how hard sparse memory support would be in lavapipe.
The answer is unfortunately extremely.
Sparse binding essentially allows creating a vulkan buffer/image of a certain size, then plugging in chunks of memory to back it in page-size multiple chunks.
This works great with GPU APIs where we've designed this, but it's actually hard to pull off on the CPU.
Currently lavapipe allocates memory with an aligned malloc. It allocates objects with no backing and non-sparse bindings connect objects to the malloced memory.
However with sparse objects, the object creation should allocate a chunk of virtual memory space, then sparse binding should bind allocated device memory into the virtual memory space. Except Linux has no interfaces for doing this without using a file descriptor.
You can't mmap a chunk of anonymous memory that you allocated with malloc to another location. So if I malloc backing memory A at 0x1234000, but the virtual memory I've used for the object is at 0x4321000, there's no nice way to get the memory from the malloc to be available at the new location (unless I missed an API).
However you can do it with file descriptors. You can mmap a PROT_NONE area for the sparse object, then allocate the backing memory into file descriptors, then mmap areas from those file descriptors into the correct places.
But there are limits on file descriptors, you get 1024 soft, or 4096 hard limits by default, which is woefully low for this. Also *all* device memory allocations would need to be fd backed, not just ones going to be used in sparse allocations.
Vulkan has a limit maxMemoryAllocationCount that could be used for this, but setting it to the fd limit is a problem because some fd's are being used by the application and just in general by normal operations, so reporting 4096 for it, is probably going to explode if you only have 3900 of them left.
Also the sparse CTS tests don't respect the maxMemoryAllocationCount anyways :-)
I shall think on this a bit more, please let me know if anyone has any good ideas!
This blogpost was actually written partially in November/December 2022 while I was developing IGT tests for the V3D driver. I ended up leaving it aside for a while and now, I came back and finished the last loose ends. That’s why I’m referencing the time where I was fighting against V3D’s noop jobs.
Currently, during my Igalia Coding Experience, I’m working on the V3D’s IGT tests and therefore, I’m dealing a lot with the Raspberry Pi 4.
During the project, I had a real struggle to design the tests for the v3d_submit_cl ioctl, as I was not capable of submit a proper noop job to the GPU.
In order to debug the tests, my mentor Melissa Wen suggested to me to run the CTS tests to reproduce a noop job and debug it through Mesa. I cloned the CTS repository into my Raspberry Pi 4 and I tried to compile, but my Raspberry Pi 4 went OOM. This sent me on a journey to cross-compile CTS for the Raspberry Pi 4. I decided to compile this journey into this blogpost.
During this blogpost, I’m using a Raspbian OS with desktop 64-bit.
First, you need to install Mesa on the Raspberry Pi 4. I decided to compile Mesa on the Raspberry Pi 4 itself, but maybe one day, I can write a blogpost about cross-compiling Mesa.
Currently, the Raspbian repositories only provide libdrm 2.4.104 and Mesa’s main branch needs libdrm >=2.4.109.
So, first, let’s install libdrm 2.4.109 on the Raspberry Pi 4.
First, let’s make sure that you have meson installed on your RPi4.
We will need meson to build libdrm and Mesa.
I’m installing meson through pip3 because we need a meson version greater than 0.60 to build Mesa.
# On the Raspberry Pi 4
$ sudo pip3 install meson
Then, you can install libdrm 2.4.109 on the RPi4.
# On the Raspberry Pi 4
$ wget https://dri.freedesktop.org/libdrm/libdrm-2.4.114.tar.xz
$ tar xvpf libdrm-2.4.114.tar.xz
$ cd libdrm-2.4.114
$ mkdir build
$ cd build
$ FLAGS="-O2 -march=armv8-a+crc+simd -mtune=cortex-a72" \
CXXFLAGS="-O2 -march=armv8-a+crc+simd -mtune=cortex-a72" \
meson -Dudev=true -Dvc4="enabled" -Dintel="disabled" -Dvmwgfx="disabled" \
-Dradeon="disabled" -Damdgpu="disabled" -Dnouveau="disabled" -Dfreedreno="disabled" \
-Dinstall-test-programs=true ..
$ sudo ninja install
So, now let’s install Mesa.
During this blogpost, I will use ${USER} as the username on the machine.
Note that, in order to run sudo apt build-dep mesa, you will have to uncomment some deb-src on the file /etc/apt/sources.list and run sudo apt update.
# On the Raspberry Pi 4
# Install Mesa's build dependencies
$ sudo apt build-dep mesa
# Build and Install Mesa
$ git clone https://gitlab.freedesktop.org/mesa/mesa
$ cd mesa
$ mkdir builddir
$ mkdir installdir
$ CFLAGS="-mcpu=cortex-a72" CXXFLAGS="-mcpu=cortex-a72" \
meson -Dprefix="/home/${USER}/mesa/installdir" -D platforms=x11 \
-D vulkan-drivers=broadcom \
-D gallium-drivers=kmsro,v3d,vc4 builddir
$ cd builddir
$ ninja
$ cd ..
$ ninja -C builddir install
In order to cross-compile the Raspberry Pi, you need to clone the target sysroot to the host.
For it, we are going to use rsync, so the host and the target need to be connected through a network.
$ sudo apt update
$ sudo apt dist-upgrade
As I said before, we will be using the rsync command to sync files between the host and the Raspberry Pi.
For some of these files, root rights is required internally, so let’s enable rsync with elevated rights.
$ echo "$USER ALL=NOPASSWD:$(which rsync)" | sudo tee --append /etc/sudoers
Some symbolic links are needed to make the toolchain work properly, so to create all required symbolic link reliably, this bash script is needed.
$ wget https://raw.githubusercontent.com/abhiTronix/raspberry-pi-cross-compilers/master/utils/SSymlinker
Once it is downloaded, you just need to make it executable, and then run it for each path needed.
$ sudo chmod +x SSymlinker
$ ./SSymlinker -s /usr/include/aarch64-linux-gnu/asm -d /usr/include
$ ./SSymlinker -s /usr/include/aarch64-linux-gnu/gnu -d /usr/include
$ ./SSymlinker -s /usr/include/aarch64-linux-gnu/bits -d /usr/include
$ ./SSymlinker -s /usr/include/aarch64-linux-gnu/sys -d /usr/include
$ ./SSymlinker -s /usr/include/aarch64-linux-gnu/openssl -d /usr/include
$ ./SSymlinker -s /usr/lib/aarch64-linux-gnu/crtn.o -d /usr/lib/crtn.o
$ ./SSymlinker -s /usr/lib/aarch64-linux-gnu/crt1.o -d /usr/lib/crt1.o
$ ./SSymlinker -s /usr/lib/aarch64-linux-gnu/crti.o -d /usr/lib/crti.o
First, we need to create a workspace for building CTS, where the Raspberry Pi 4 sysroot is going to be built.
$ sudo mkdir ~/rpi-vk
$ sudo mkdir ~/rpi-vk/installdir
$ sudo mkdir ~/rpi-vk/tools
$ sudo mkdir ~/rpi-vk/sysroot
$ sudo mkdir ~/rpi-vk/sysroot/usr
$ sudo mkdir ~/rpi-vk/sysroot/usr/share
$ sudo chown -R 1000:1000 ~/rpi-vk
$ cd ~/rpi-vk
Now, we need to sync up our sysroot folder with the system files from the Raspberry Pi.
We will be using rsync that let us sync files from the Raspberry Pi.
To do this, enter the following commands one by one into your terminal and remember to change username and 192.168.1.47 with the IP address of your Raspberry Pi.
$ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/lib sysroot
$ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/usr/include sysroot/usr
$ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/usr/lib sysroot/usr
$ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/usr/share sysroot/usr
$ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/home/${USER}/mesa/installdir installdir
The files we copied in the previous step still have symbolic links pointing to the file system on the Raspberry Pi.
So, we need to alter this, so that they become relative links from the new sysroot directory on the host machine.
There is a Python script available online that can help us.
$ wget https://raw.githubusercontent.com/abhiTronix/rpi_rootfs/master/scripts/sysroot-relativelinks.py
Once it is downloaded, you just need to make it executable and run it.
$ sudo chmod +x sysroot-relativelinks.py
$ ./sysroot-relativelinks.py sysroot
As Raspbian OS 64-bits uses GCC 10.2.0, let’s install the proper cross-compiler toolchain on our host machine. I’m using the toolchain provided by abhiTronix/raspberry-pi-cross-compilers, but there are many other around the web that you can use.
We are going to use the tools folder to setup our toolchain.
$ cd ~/rpi-vk/tools
$ wget https://sourceforge.net/projects/raspberry-pi-cross-compilers/files/Bonus%20Raspberry%20Pi%20GCC%2064-Bit%20Toolchains/Raspberry%20Pi%20GCC%2064-Bit%20Cross-Compiler%20Toolchains/Bullseye/GCC%2010.2.0/cross-gcc-10.2.0-pi_64.tar.gz/download
$ tar xvf download
$ rm download
If you run all the steps from this tutorial expect this one, you will still get some weird Wayland-related errors when cross-compiling it.
This will happen because probably the wayland-scanner version from your host is different from the wayland-scanner version of the target.
For example, on Fedora 37, the wayland-scanner version is 1.21.0 and the version on the Raspberry Pi 4 is 1.18.0.
In order to build Wayland, you will need the following dependencies:
$ sudo dnf install expat-devel xmlto
So, let’s install the proper Wayland version on our sysroot.
$ wget https://wayland.freedesktop.org/releases/wayland-1.18.0.tar.xz
$ tar xvf wayland-1.18.0.tar.xz
$ cd wayland-1.18.0
$ meson --prefix ~/rpi-vk/sysroot/usr build
$ ninja -C install
Now that we have the hole Raspberry Pi environment set up, we just need to create a toolchain file for CMake and its all set! So, let’s clone the CTS repository.
$ git clone https://github.com/KhronosGroup/VK-GL-CTS
$ cd VK-GL-CTS
To build dEQP, you need first to download sources for zlib, libpng, jsoncpp, glslang, vulkan-docs, spirv-headers, and spirv-tools. To download sources, run:
$ python3 external/fetch_sources.py
Inside the CTS directory, we are going to create a toolchain file called cross_compiling.cmake with the following contents:
set(CMAKE_VERBOSE_MAKEFILE ON)
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_VERSION 1)
set(CMAKE_SYSTEM_PROCESSOR aarch64)
# Check if the sysroot and toolchain paths are correct
set(tools /home/${USER}/rpi-vk/tools/cross-pi-gcc-10.2.0-64)
set(rootfs_dir $ENV{HOME}/rpi-vk/sysroot)
set(CMAKE_FIND_ROOT_PATH ${rootfs_dir})
set(CMAKE_SYSROOT ${rootfs_dir})
set(ENV{PKG_CONFIG_PATH} "")
set(ENV{PKG_CONFIG_LIBDIR} "${CMAKE_SYSROOT}/usr/lib/pkgconfig:${CMAKE_SYSROOT}/usr/share/pkgconfig")
set(ENV{PKG_CONFIG_SYSROOT_DIR} ${CMAKE_SYSROOT})
set(CMAKE_LIBRARY_ARCHITECTURE aarch64-linux-gnu)
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fPIC -Wl,-rpath-link,${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE} -L${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE}")
set(CMAKE_C_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -Wl,-rpath-link,${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE} -L${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -Wl,-rpath-link,${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE} -L${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE}")
set(WAYLAND_SCANNER ${CMAKE_SYSROOT}/usr/bin/wayland-scanner)
## Compiler Binary
SET(BIN_PREFIX ${tools}/bin/aarch64-linux-gnu)
SET (CMAKE_C_COMPILER ${BIN_PREFIX}-gcc)
SET (CMAKE_CXX_COMPILER ${BIN_PREFIX}-g++ )
SET (CMAKE_LINKER ${BIN_PREFIX}-ld
CACHE STRING "Set the cross-compiler tool LD" FORCE)
SET (CMAKE_AR ${BIN_PREFIX}-ar
CACHE STRING "Set the cross-compiler tool AR" FORCE)
SET (CMAKE_NM {BIN_PREFIX}-nm
CACHE STRING "Set the cross-compiler tool NM" FORCE)
SET (CMAKE_OBJCOPY ${BIN_PREFIX}-objcopy
CACHE STRING "Set the cross-compiler tool OBJCOPY" FORCE)
SET (CMAKE_OBJDUMP ${BIN_PREFIX}-objdump
CACHE STRING "Set the cross-compiler tool OBJDUMP" FORCE)
SET (CMAKE_RANLIB ${BIN_PREFIX}-ranlib
CACHE STRING "Set the cross-compiler tool RANLIB" FORCE)
SET (CMAKE_STRIP {BIN_PREFIX}-strip
CACHE STRING "Set the cross-compiler tool RANLIB" FORCE)
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
Note that we had to specify our toolchain and also the specify the path to the wayland-scanner.
Now that we are all set, we can finally cross-compile CTS.
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_LIBRARY_PATH=/home/${USER}/rpi-vk/installdir/lib \
-DCMAKE_INCLUDE_PATH=/home/${USER}/rpi-vk/installdir/include \
-DCMAKE_GENERATOR=Ninja \
-DCMAKE_TOOLCHAIN_FILE=/home/${USER}/VK-GL-CTS/cross_compiling.cmake ..
$ ninja
Now, you can transfer the compiled files to the Raspberry Pi 4 and run CTS!
This was a fun little challenge of my CE project and it was pretty nice to learn more about CTS. Running CTS was also a great idea from Melissa as I was able to hexdump the contents of a noop job for the V3DV and fix my noop job on IGT. So, now I finally have a working noop job on IGT and you can check it here.
Also, a huge thanks to my friend Arthur Grillo for helping me with resources about cross-compiling for the Raspberry Pi.
Wow, it’s really happening :D
After some warm-up and work put on, I got accepted into the 2023 Google Summer of Code program with the X.Org organization.
The title of my project is “Increasing Code Coverage on the DRM Code”. The DRM subsystem is the standard way to interact with complex graphics devices.
It provides many global helpers for use inside the drivers. As these helpers are used by many drivers on the DRM subsystem, testing those functions for asserting that no regressions are made is crucial for kernel development.
Many units test were written for those helpers with the KUnit framework. But there is still much work to do. Running the Gcov code covering
analysis tool, we see that just one file has 100% of code
coverage. Knowing this, I will create more tests for the
drm-format-helper.c. This file handles color format conversion.
Currently, the conversion functions can’t handle planar formats. Instead
of having the color information packed in a single plane, those have their
information separated into multiple planes. I pretend to add support for it by
modifying the drm_fb_xfrm() function.
This Summer, I will be mentored by:
During this community bonding period, I’m reserving time to review patches on
the dri-devel mailing list and read more about the DRM.
These are my reviews:
In parallel, I’m trying to add support for the NV12 format to the VKMS driver. This would be nice for testing userspace programs that contain framebuffers with video. This is turning out to be bigger than I thought, as I need to add support for planar formats there and make the format conversion in software. Stay tuned for blog posts on that too ;).
See ya! :)
I received the email saying that my proposal for GSoC ‘23 was accepted by X.Org Foundation on May 5th. It’s my first time working with open-source communities and I’m really excited about it. My mentor Martin Roukala gave me some great advice and helped me understand the project and my tasks better. Now it’s time to dive into the project details.
A number of members of my team at Valve don’t blog, and I like to bring visibility to their work when I can. Here’s a quick roundup of what some of them have been doing, in no particular order.
Or maybe from most important work to least important work.
You decide.
After another week of heroic work, his implementation of VK_EXT_descriptor_indexing for Lavapipe has finally passed CI. This means it will probably be merged within the next week or two.
Given that Dave Airlie is practically done with his VK_EXT_mesh_shader (it draws multiple triangles now), this means the only thing remaining for full VKD3D-PROTON functionality on Lavapipe is sparse binding. I think.
Which begs the question: will Lavapipe get sparse binding support before Intel?
It’s depressing, it seems impossible given how many years we’ve all been waiting, and we don’t want to admit it to ourselves, but we already know the answer.
Another of our well-monitored raytracing wranglers, Friedrich has managed to escape the from his designated habitat extremely comfortable work environment to do something unrelated to the tracing of rays.
Specifically, he started a blog, and it’s got just the right mix of extremely serious content and extremely seriouser content to make for a great read.
I have high hopes for the future of his content provided we can get him back on target. Unfortunately the most recent sightings have been in dangerous proximity to the RADV pipeline defrobulator infrastructure, which doesn’t bode well for anyone’s safety.
I would never accuse Samuel of taking time off. This is a man who, when he goes grocery shopping, carries a tablet in one hand so he can continue implementing extensions while he sifts through produce. This is a guy who, when he goes through airport security, gets asked to review the metal detectors rather than walk through them. This is an individual who, when he takes time off, brings his laptop with him to rebase his plethora of code refactoring branches.
What I’m trying to say is it was a great week for code refactoring.
This individual has a blog but refuses to post about their employment status until a prominent news site covers it first.
Well, Captain Code Deletion, consider this news C O V E R E D.
We’re glad to have you deleting things here after your brief and questionably-successful stint at Twitter.
I had some other stuff I was gonna blog about, but everyone’s already gotta click through to Friedrich’s post, and it’s long, so I’m gonna save it. Because I’ve totally got that content. And it’s ready. And I could post it. But also it’s Friday? And Capricorn is in the ascendant–you know what that means.
Yeah. Great post.
It’s an exciting time for Mesa as its next major release is unveiled this week. Igalia has played an important role in this milestone, with Eric Engestrom managing the release and 11 other Igalians contributing over 110 merge requests. A sample of these contributions are detailed below.
As part of an effort to enhance the reliability of GPU resets on amdgpu, Tony implemented a GPU reset notification feature in the RADV Vulkan driver. This new function improves the robustness of the RADV driver. The driver can now check if the GPU has been reset by a userspace application, allowing the driver to recover their contexts, exit, or engage in some other appropriate action.
You can read more about Tony’s changes in the link below
With a goal of improving feature parity of the KGSL kernel mode driver with its drm counterpart, Mark has been rewriting the backend for KGSL. These changes leverage the new, common backend Vulkan infrastructure inside Mesa and fix multiple bugs. In addition, they introduce support for importing/exporting sync FDs, pre-signalled fences, and timeline semaphore support.
If you’re interested in taking a deeper dive into Mark’s changes, you can read the following MR:
Danylo has adopted a significant role for two major changes inside turnip: 1)contributing to the effort to migrate turnip to C++ and 2)supporting the next generation a7xx Adreno GPUs from Qualcomm. A more detailed overview of Danylo’s changes can be found in the linked MRs below:
Igalia maintains the v3d OpenGL driver and v3dv Vulkan drive for broadcom videocore GPUs which can be found on devices such as the Raspberry Pi. Iago, Alex and Juan have combined their expertise to implement multiple fixes for both the v3d gallium driver and the v3dv vulkan driver on the Raspberry Pi. These changes include CPU performance optimizations, support for 16-bit floating point vertex attributes, and raising support in the driver to OpenGL 3.1 level functionality. This Igalian trio has also been addressing fixes for conformance issues raised in the Vulkan 1.3.5 conformance test suite (CTS).
You can dive into some of their Raspberry Pi driver changes here:
In addition to managing the 23.1 release, Eric has also implemented many fixes in Mesa’s infrastructure. He has assisted with addressing a number of CI issues within Mesa on various drivers from v3d to panfrost. Eric also dedicated part of his time to general clean-up of the Mesa code (e.g. removing duplicate functions, fixing and improving the meson-based build system, and removing dead code).
If you’re interested in seeing some of his work, check out some of the MRs below:
GPU hangs are one of the most common results of pretty much anything going wrong GPU-side, and finding out why they occur isn’t always easy. In this blog post, I’ll document my journey towards finding the cause of one specific hang in the game “Splitgate”.
Right off the bat, I noticed a few oddities with this particular hang. Firstly, the actual game always ran completely fine. The only place where it hung was on the first startup where the game automatically configures graphics settings (I’ll call it autoconfiguration from here).
Additionally, while I could reproduce the hang on my Steam Deck, I couldn’t get the hang to appear on my desktop. I have an RDNA2 graphics card, which is the same architecture as the Deck, so it seemed unlikely that specifics about the hardware architecture were the problem here.
As a first step, I tried running the game with the Vulkan Validation Layers. If the game is using the API in an invalid way and that is the cause of the hangs, there’s a rather good chance the Validation Layers will catch it.
Even though there were a few errors from the validation layers, it seemed like none of the errors were actually relevant to the hang. Most importantly, the errors with autoconfiguration on were the same as the errors with autoconfiguration off.
As any software, the Validation Layers aren’t perfect and can’t detect every possible invalid behaviour. At this point I was still unsure whether I’d have to search for the bug on the application side or on the driver side.
With the validation layers being unable to detect any invalid behaviour by the app during the autoconfiguration phase, another question comes to mind: What is the application doing, actually?
To answer that, I utilized the API Dump Vulkan layer by LunarG. When this layer is activated, it dumps all the commands made by the application, including every parameter and return value to standard output.
While API dumps are good to have for debugging, large API dumps from large engines are often difficult to navigate (not just because it’s an 800MB large file and your text editor dies trying to scroll through them). Instead, it’s often best to extract just the work that hangs for further debugging. But what frame is this?
The CPU and GPU do work asynchronously, which means that the CPU is free to do more work while GPU starts with its work. Somewhat unfortunately, this also means the CPU can do more Vulkan calls which will show up in the API dump after the app already submitted the hanging frame to the GPU. This means that I couldn’t just look at the last command in the API dump and assume that command caused the hang. Luckily, there were other hints towards what caused the hang.
In Vulkan, when you want to know when a particular work submission finishes,
you give a VkFence to the submit function. Later, you can wait for
the submission to finish with vkWaitForFences, or you can query whether
the submission has already finished with vkGetFenceStatus.
I noticed that after work was submitted, the app seemed to call
vkGetFenceStatus from time to time, polling whether that submission was
finished. Usually, vkGetFenceStatus would return VK_SUCCESS after a
few calls, indicating that the submission finished. However, there was one
submission where vkGetFenceStatus seemed to always return VK_NOT_READY.
It seemed very likely that the GPU was hanging while executing that submission.
To test my theory, I modified the implementation of vkQueueSubmit, which
you call for submitting work, to call vkDeviceWaitIdle immediately after
submitting the work. vkDeviceWaitIdle waits for all outstanding GPU work to
finish. When the GPU hangs, the vkQueueSubmit which caused the hang should
be the last line in the API dump1.
This time, the API dump cut off at the vkQueueSubmit for which
vkGetFenceStatus always returned VK_NOT_READY.
Bingo.
![]()
Now we know which submission hangs, but that submission still contains a lot of commands. Even though the text editor now survives scrolling through the commands, finding what is wrong by inspection is highly unlikely.
Instead, I tried to answer the question: “What specific command is making the GPU hang?”
In order to find the answer, I needed to find out as much as possible about what state the GPU is in when it hangs. There are a few useful tools which helped me gather info:
The first thing I did was use umr to query if any waves were active at the time of the hang. Waves (or Wavefronts) are groups of 32 or 64 shader invocations (or threads in DX terms) that the GPU executes at the same time2. There were indeed quite a few waves currently executing. For each wave, umr can show a disassembly of the GPU code that is currently executing, as well as the values of all registers, and more.
In this case, I was especially interested in the halt and fatal_halt status
bits for each wave. These bits are set when the wave encounters a fatal
exception (for example dereferencing invalid pointers) and won’t continue
execution. These bits were not set for any waves I inspected, so it was
unlikely that exceptions in a shader were causing the hang.
Aside from exceptions, the other common way for shaders to trigger GPU hangs is by accidentally executing infinite loops. But the shader code currently executing was very simple and didn’t even have a jump instruction anywhere, so the hang couldn’t be caused by infinite loops either.
Shaders aren’t the only thing that the GPU executes, and as such shaders aren’t the only thing that can cause GPU hangs.
In RADV, command buffers recorded in Vulkan are translated to a
hardware-specific command buffer format called PKT33. Commands encoded
in this format are written to GPU-accessible memory, and executed by the
GPU’s command processor (CP for short) when the command buffer is submitted.
These commands might also be involved in the hang, so I tried finding out which
commands the CP was executing when the hang happened. RADV has integrated debug
functionality that can help with exactly this, which can be enabled by setting
an environment variable named RADV_DEBUG to "hang". But when I tried
triggering the hang with this environment variable in place, it started up just
fine!
This isn’t the first time I’ve seen this. RADV_DEBUG=hang has a funny side
effect: It also inserts commands to wait for draws or dispatches to complete
immediately after the dispatch is triggered. This immensely helps with
figuring out which shader is faulty if there are multiple shaders executing
concurrently. But it also prevents certain hangs from happening: Where things
executing concurrently causes the hang in the first place.
In other words, we seem to be looking at a synchronization issue.
![]()
Even though we know we’re dealing with a synchronization issue, the original question remains unsolved: What command causes the hang?
The “sync after every draw/dispatch” method of RADV_DEBUG=hang fixes the
issue, but it has a very broad effect. Since the issue seems to reproduce
very reliably (which in itself is a rarity for synchronization bugs), we
can apply that sync selectively to only some draws or dispatches to narrow
down what commands exactly cause the hangs.
First, I tried restricting the synchronization to only apply to dispatches (so no draws were synchronized). This made the hang appear again. Testing the other way around (restricting the synchronization to only draws) confirmed: All compute dispatches were fine, the issue was about draw synchronization only.
Next, I tried only synchronizing at the end of renderpasses. This also fixed the hang. However, synchronizing at the start of renderpasses fixed nothing. Therefore it was impossible that missing synchronization across renderpasses was the cause of the hang.
The last likely option was that there was missing synchronization in between the draws and something in between renderpasses.
At this point, the API dump of the hanging submission proved very helpful. Upon taking a closer look, it became clear that the commands in the submitted command buffer had a very simple pattern (some irrelevant commands omitted for brevity):
vkCmdBeginRenderPass to begin a new renderpassvkCmdDrawvkCmdEndRenderPass, ending the renderpassvkCmdWriteTimestamp, writing the current elapsed timeWhat stuck out to me was that vkCmdWriteTimestamp was called with a
pipelineStage of VK_PIPELINE_STAGE_TOP_OF_PIPE. In simple terms, this means
that the timestamp can be written before the preceding draw finished.4
Further testing confirmed: If I insert synchronization before writing the timestamp, the hang is fixed. Inserting synchronization immediately after writing the timestamp makes the hang re-appear.
By now, it has become pretty clear that timestamp queries are the problem here. But it just didn’t really make sense that the timestamp write itself would hang.
Timestamp writes on AMD hardware don’t require launching any shaders.
They can be implemented using one PKT3 command called COPY_DATA5, which
accepts many data sources other than memory. One of these data sources is the
current timestamp. RADV uses COPY_DATA to write the timestamp to memory.
The memory for these timestamps is managed by the driver, so it’s exceedingly
unlikely the memory write would fail.
From the wave analysis with umr earlier I also knew that the in-flight shaders didn’t actually write or read any memory that might interfere with the timestamp write (somehow). The timestamp write itself being the cause of the hang seemed impossible.
If timestamp writes can’t be the problem, what else can there be that might hang the GPU?
There is one other part to timestamp queries aside from writing the timestamp itself: In Vulkan, timestamps are always written to opaque “query pool” objects. In order to actually view the timestamp value, an app has to copy the results stored in the query pool to a buffer in CPU or GPU memory. Splitgate uses Unreal Engine 4, which has a known bug related to query pool copies that RADV has to work around.
It isn’t too far-fetched to think there might be other bugs in UE’s Vulkan RHI regarding query copies. Synchronizing the query copy didn’t do anything, but just commenting out the query copy fixed the hang as well.
![]()
Up until this point, I was pretty sure that something about the timestamp write must be the cause of the problems. Now it seemed like query copies might also influence the problem somehow? I was pretty unsure how to reconcile these two observations, so I tried finding out more about how exactly the query copy affected things.
Query copies on RADV are implemented using small compute shaders written directly in NIR. Having the simple driver-internal shaders in NIR is a nice and simple way of storing them inside the driver, but they’re a bit hard to read for people not used to the syntax. For demonstration purposes I’ll use a GLSL translation of the shader6. The copy shader for timestamp queries looks like this:
location(binding = 0) buffer dst_buf;
location(binding = 1) buffer src_buf;
void main() {
uint32_t result_size = flags & VK_QUERY_RESULT_64_BIT ? sizeof(uint64_t) : sizeof(uint32_t);
uint32_t dst_stride = result_size;
if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT)
dst_stride += sizeof(uint32_t);
uint32_t src_stride = 8;
uint64_t result = 0;
bool available = false;
uint64_t src_offset = src_stride * global_id.x;
uint64_t dst_offset = dst_stride * global_id.x;
uint64_t timestamp = src_buf[src_offset];
if (timestamp != TIMESTAMP_NOT_READY) {
result = timestamp;
available = true;
}
if ((flags & VK_QUERY_RESULT_PARTIAL_BIT) || available) {
if (flags & VK_QUERY_RESULT_64_BIT) {
dst_buf[dst_offset] = result;
} else {
dst_buf[dst_offset] = (uint32_t)result;
}
}
if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT) {
dst_buf[dst_offset + result_size] = available;
}
}
At first, I tried commenting out the stores to dst_buf, which resulted in the
hangs disappearing again. This can indicate that dst_buf is the problem, but
it’s not the only possibility. The compiler can also optimize out the load
because it isn’t used further down in the shader, so this could also mask
an invalid read as well. When I commented out the read and always stored a
constant instead - it also didn’t hang!
But could it be that the shader was reading from an invalid address? Splitgate is by far not the only app out there using timestamp queries, and those apps all work fine - so it can’t just be fundamentally broken, right?
To test this out, I modified the timestamp write command once again. Remember
how PKT3_COPY_DATA is really versatile? Aside from copying memory and
timestamps, it can also copy a 32/64-bit constant supplied as a parameter.
I undid all the modifications to the copy shader and forced a constant to be
written instead of timestamps. No hangs to be seen.
![]()
It seems like aside from the synchronization, the value that is written as the timestamp influences whether a hang happens or not. But that also means neither of the two things already investigated can actually be the source of the hang, can they?
It’s essentially the same question as in the beginning, still unanswered:
“What the heck is hanging here???”
Stabbing in the dark with more guesses won’t help here. The only thing that can is more info. I already had a small GPU buffer that I used for some other debugging I skipped over. To get definitive info on whether it hangs because of the timestamp write, the timestamp copy, or something else entirely, I modified the command buffer recording to write some magic numbers into that debug buffer whenever these operations happened. It went something along the lines of:
0xAAAAAAAA if timestamp write is complete0xBBBBBBBB if timestamp copy is completeHowever, I still needed to ensure I only read the magic numbers after the GPU had time to execute them (without waiting forever during GPU hangs).. This required a different intricate and elaborate synchronization algorithm.
// VERY COMPLICATED SYNCHRONIZATION
sleep(1);
With that out of the way, let’s take a look at the magic number of the hanging submission.
Magic: 0x0
what??? this means neither write nor copy have executed? Alright, what if I add another command writing a magic number right at the beginning of the command buffer?
Magic: 0x0
So… the hang happens before the command buffer starts executing? Something can’t be right here.7
![]()
At this point I started logging all submits that contained either timestamp writes or timestamp copies, and I noticed that there was another submission with the same pattern of commands right before the hanging one.
This previous submission had executed just fine - all timestamps were written, all shaders finished without hangs. This meant that neither the way timestamps were written nor the way they were copied could be direct causes of hangs, because they worked just one submission prior.
I verified this theory by forcing full shader synchronization to happen before the timestamp write, but only for the submission that actually hangs. To my surprise, this did nothing to fix the hangs.
When I applied the synchronization trick to the previous submit (that always worked fine!), the hangs stopped appearing.
It seems like the cause of the hang is not in the hanging submission, but in a completely separate one that completed successfully.
![]()
Let’s rewind to the question that started this whole mess. “What is the app doing?”
Splitgate (as of today) uses Unreal Engine 4.27.2. Luckily, Epic Games make the source code of UE available to anyone registering for it with their Epic Games account. There was hope that the benchmark code they were using was built into Unreal, where I could examine what exactly it does.
Searching in the game logs from a run with the workaround enabled, I found this:
LogSynthBenchmark: Display: Graphics:
LogSynthBenchmark: Display: Adapter Name: 'AMD Custom GPU 0405 (RADV VANGOGH)'
LogSynthBenchmark: Display: (On Optimus the name might be wrong, memory should be ok)
LogSynthBenchmark: Display: Vendor Id: 0x1002
LogSynthBenchmark: Display: Device Id: 0x163F
LogSynthBenchmark: Display: Device Revision: 0x0
LogSynthBenchmark: Display: GPU first test: 0.06s
LogSynthBenchmark: Display: ... 3.519 s/GigaPix, Confidence=100% 'ALUHeavyNoise' (likely to be very inaccurate)
LogSynthBenchmark: Display: ... 2.804 s/GigaPix, Confidence=100% 'TexHeavy' (likely to be very inaccurate)
LogSynthBenchmark: Display: ... 2.487 s/GigaPix, Confidence=100% 'DepTexHeavy' (likely to be very inaccurate)
LogSynthBenchmark: Display: ... 8.917 s/GigaPix, Confidence=100% 'FillOnly' (likely to be very inaccurate)
LogSynthBenchmark: Display: ... 0.330 s/GigaPix, Confidence=100% 'Bandwidth' (likely to be very inaccurate)
LogSynthBenchmark: Display: ... 0.951 s/GigaVert, Confidence=100% 'VertThroughPut1' (likely to be very inaccurate)
LogSynthBenchmark: Display: ... 6.053 s/GigaVert, Confidence=100% 'VertThroughPut2' (likely to be very inaccurate)
LogSynthBenchmark: Display: GPU second test: 0.54s
LogSynthBenchmark: Display: ... 4.186 s/GigaPix, Confidence=100% 'ALUHeavyNoise' (likely to be inaccurate)
LogSynthBenchmark: Display: ... 3.118 s/GigaPix, Confidence=100% 'TexHeavy' (likely to be inaccurate)
LogSynthBenchmark: Display: ... 2.844 s/GigaPix, Confidence=100% 'DepTexHeavy' (likely to be inaccurate)
LogSynthBenchmark: Display: ... 9.127 s/GigaPix, Confidence=100% 'FillOnly' (likely to be inaccurate)
LogSynthBenchmark: Display: ... 0.339 s/GigaPix, Confidence=100% 'Bandwidth' (likely to be inaccurate)
LogSynthBenchmark: Display: ... 0.983 s/GigaVert, Confidence=100% 'VertThroughPut1' (likely to be inaccurate)
LogSynthBenchmark: Display: ... 6.422 s/GigaVert, Confidence=100% 'VertThroughPut2' (likely to be inaccurate)
LogSynthBenchmark: Display: GPU Final Results:
LogSynthBenchmark: Display: ... 4.186 s/GigaPix, Confidence=100% 'ALUHeavyNoise'
LogSynthBenchmark: Display: ... 3.118 s/GigaPix, Confidence=100% 'TexHeavy'
LogSynthBenchmark: Display: ... 2.844 s/GigaPix, Confidence=100% 'DepTexHeavy'
LogSynthBenchmark: Display: ... 9.127 s/GigaPix, Confidence=100% 'FillOnly'
LogSynthBenchmark: Display: ... 0.339 s/GigaPix, Confidence=100% 'Bandwidth'
LogSynthBenchmark: Display: ... 0.983 s/GigaVert, Confidence=100% 'VertThroughPut1'
LogSynthBenchmark: Display: ... 6.422 s/GigaVert, Confidence=100% 'VertThroughPut2'
FSynthBenchmark indeed appears in the UE codebase as a
benchmark tool to auto-calibrate settings. From reading its code, it seemed
like it does 3 separate benchmark ru…
wait. 3??
We can clearly see from the logs there are only two benchmark runs. Maybe the third run hangs the GPU somehow?
While thinking about this, another possibility came to my mind. The GPU driver can’t actually detect if the GPU is hung because of some fatal error or if it just takes an obscenely long amount of time for some work. No matter what it is, if it isn’t finished in 10 seconds, the GPU will be reset.8
So what if the hang I’ve been chasing all this time isn’t actually a hang? How do I even find out?
The amdgpu kernel driver has a parameter named lockup_timeout for this exact
purpose: You can modify this parameter to change the amount of time after which
the GPU is reset if a job doesn’t finish, or disable this GPU reset entirely.
To test this theory, I went with disabling the GPU reset.
After setting all the parameters up and rebooting, I started the game another time.
And it worked! It took a really long time, but eventually, the game started up fully. It was indeed just hammering the poor Deck’s GPU with work that took way too long.
Finally, things start clearing up a bit. There is still an open question, though: What does the workaround do to prevent this?
The code that runs the 3 benchmark passes doesn’t always run them unconditionally. Instead, the 3 benchmarks have an increasingly larger workload (each roughly 10x as much as the previous one). Comments nearby explain that this choice was made because the larger benchmark runs cause driver resets on low-end APUs (hey, that’s exactly the problem we’re having!). It measures the time it takes for the benchmark workloads to complete using the timestamp queries, and if the total benchmark time is beyond a certain point, it skips the other benchmark runs.
If you’ve been paying extremely close attention all the way until here, you might notice a small problem. UE4 interprets the timestamp values as the time until the benchmark workload completes. But as I pointed out all the way near the beginning, the timestamp can be written before the benchmark workload is even finished!
If the timestamp is written before the benchmark workload finishes, the measured benchmark time is much less than the workload actually took. In practice, this results in the benchmark results indicating a much faster GPU than there actually is. I assume this led to the third benchmark (which was too heavy for the Deck GPU) to be launched. My desktop GPU seems to be powerful enough to get through the benchmark before the lockup timeout, which is why I couldn’t reproduce the issue there.
In the end, the hack I originally found to work around the issue turned out to be a fitting workaround.
And I even got to make my first bugreport for Unreal Engine.
![]()
If an app is using Vulkan on multiple threads, this might not always be the case. This is a rare case where I’m grateful for Unreal Engine to have a single RHI thread. ↩
Nvidia also calls them “warps”. ↩
Short for “Packet 3”. Packet 2, 1 and 0 also exist, although they aren’t widely used on newer AMD hardware. ↩
If you insert certain pipeline barriers, writing the timestamp early would be disallowed, but these barriers weren’t there in this case. ↩
This command writes the timestamp immediately when the CP executes it. There is another command which waits for previous commands to finish before writing the timestamp. ↩
You can also view the original NIR and the GLSL translation here ↩
As it turned out later, the debugging method was flawed. In actuality, both timestamp writes and copies completed successfully, but the writes indicating this seemed to be still in the write cache. Forcing the memory containing the magic number to be uncached solved this. ↩
Usually, the kernel driver can also command the GPU to kill whatever job it is doing right now. For some reason, it didn’t work here though. ↩
If nothing goes wrong, Mesa 23.1 will ship in the next few hours, which means that at last everyone will have a new zink release.
And it’s a big one.
Since I’m expecting lots of people will be testing zink for the first time now that it should be usable for most things, I thought it would be useful to have a post about debugging zink. Or maybe not debugging but issue reporting. Yeah that sounds right.
Zink is a complex driver. It has many components:
There are many systems at play when zink is running, and malfunctions in any of them may lead to bugs. If you encounter a bug, there are a number of steps you can take to try mitigating its effect or diagnosing it.
Let’s dig in.
Zink tries to use a lot of threads. Sometimes they can cause problems. One of the first steps I take when I encounter an issue is to disable them:
mesa_glthread=false GALLIUM_THREAD=0 ZINK_DEBUG=flushsync MESA_LOADER_DRIVER_OVERRIDE=zink <command>
mesa_glthread=false disables glthread, which very occasionally causes issues related to vertex buffersGALLIUM_THREAD=0 disables threaded context, which very occasionally causes issues in everythingZINK_DEBUG=flushsync disables threaded queue submission, which has historically never been an issue but may remove some timing discrepancies to enable bugs to shine through more readilyIf none of the above affects your problem, it’s time to move on to the next step.
Zink tries a lot of optimizations. Sometimes they can cause problems. One of the first steps I take when I encounter an issue that isn’t fixed by the above is to disable them:
mesa_glthread=false GALLIUM_THREAD=0 ZINK_DEBUG=flushsync,noreorder,norp MESA_LOADER_DRIVER_OVERRIDE=zink <command>
noreorder disables command reordering, which will probably fix any issuenorp disables renderpass optimization, which is only enabled by default on tiling GPUs so it’s definitely not impacting anyone reading thisIf none of the above affects your problem, then you’re probably screwed it’s time to move on to the next step.
Zink tries to obey Vulkan synchronization rules. Sometimes these rules are difficult to understand and confusing to implement. One of the first steps I take when I encounter an issue that isn’t fixed by the above is to cry a lot check out this blog post the synchronization. Manually inspecting this is impossible hard, so there’s only a big hammer:
mesa_glthread=false GALLIUM_THREAD=0 ZINK_DEBUG=flushsync,noreorder,norp,sync MESA_LOADER_DRIVER_OVERRIDE=zink <command>
sync forces full memory synchronization before every command, which will solve the problem of extant FPSIf none of the above affects your problem, then I totally have more ideas, and I’m gonna blog about them right here, but I gotta go, um, get something. From my…refrigerator. It’s just down the street I’ll berightback.
Zink tries to conform to Vulkan API rules. Sometimes these rules are obfuscated by a trillion Valid Usage statements that nobody has time to read or the mental capacity to comprehend in their totality.
And when a problem appears that cannot be mitigated with any of the above, we must pray to our lord and savior the Vulkan Validation Layer for guidance:
mesa_glthread=false GALLIUM_THREAD=0 ZINK_DEBUG=flushsync,noreorder,norp,sync,validation MESA_LOADER_DRIVER_OVERRIDE=zink <command>
validation enables VVL at runtime, which will (hopefully) explain how I fucked upIf this doesn’t fix your problem,
You should file a ticket if you find an issue. But if any of the above provided some additional info, providing that can reduce the time it takes to resolve that issue.
libei is the library for Emulated Input - see this post for an introduction. Like many projects, libei was started when it was still unclear if it could be the right solution to the problem. In the years (!) since, we've upgraded the answer to that question from "hopefully" to "yeah, I reckon" - doubly so since we added support for receiver contexts and got InputLeap working through the various portal changes.
Emulating or capturing input needs two processes to communicate for obvious reasons so the communication protocol is a core part of it. But initially, libei was a quickly written prototype and the protocol was hacked up on an as-needed let's-get-this-working basis. The rest of the C API got stable enough but the protocol was the missing bit. Long-term the protocol must be stable - without a stable protocol updating your compositor may break all flatpaks still shipping an older libei. Or updating a flatpak may not work with an older compositor. So in the last weeks/months, a lot of work as gone into making the protocol stable. This consisted of two parts: drop protobuf and make the variuos features interface-dependent, unashamedly quite like the Wayland protocol which is also split into a number of interfaces that can be independently versioned. Initially, I attempted to make the protocol binary compatible with Wayland but dropped that goal eventually - the benefits were minimal and the effort and limitations (due to different requirements) were quite significant.
The protocol is defined in a single XML file and can be used directly from language bindings (if any). The protocol documentation is quite extensive but it's relatively trivial in principal: the first 8 bytes of each message are the object ID, then we have 4 bytes for the message length in bytes, then 4 for the object-specific opcode. That opcode is one of the requests or events in the object's interface - which is defined at object creation time. Unlike Wayland, the majority of objects in libei are created in server-side (the EIS implementation decides which seats are available and which devices in those seats). The remainder of the message are the arguments. Note that unlike other protocols the message does not carry a signature - prior knowledge of the message is required to parse the arguments. This is a direct effect of initially making it wayland-compatible and I didn't really find it worth the effort to add this.
Anyway, long story short: swapping the protocol out didn't initially have any effect on the C library but with the changes came some minor updates to remove some of the warts in the API. Perhaps the biggest change is that the previous capabilities of a device are now split across several interfaces. Your average mouse-like emulated device will have the "pointer", "button" and "scroll" interfaces, or maybe the "pointer_absolute", "button" and "scroll" interface. The touch and keyboard interfaces were left as-is. Future interfaces will likely include gestures and tablet tools, I have done some rough prototyping locally and it will fit in nicely enough with the current protocol.
At the time of writing, the protocol is not officialy stable but I have no intention of changing it short of some bug we may discover. Expect libei 1.0 very soon.
In my last blog post, I described a bit of my previous work on the rustgem project, and after that, as I had finished the VGEM features, I sent a RFC to the mailing list.
Although I still need to work on some rustgem feedback, I started to explore more of the KMS (Kernel Mode Setting) and its properties.
I talked to my mentor Melissa Wen, one of the VKMS maintainers, and she proposed implementing plane rotation capabilities to VKMS. The VKMS (Virtual Kernel Mode Setting) is a software-only KMS driver that is quite useful for testing and running X (or similar compositors) on headless machines. It sounded like a great idea, as I would like to explore a bit more of the KMS side of things.
In order to have an image on a display, we need to go through the whole Kernel Mode Setting (KMS) Display Pipeline. The pipeline has a couple of different objects, such as framebuffers, planes, and CRTCs, and the relationship between them can be quite complicated. If you are interested in the KMS Display Pipeline, I recommend reading the great KMS documentation. But here we are focused in only one of those abstractions, the plane.
In the context of graphics processing, a plane refers to an image source that can be superimposed or blended on top of a CRTC during the scanout process. The plane itself specifies the cropping and scaling of that image, and where it is placed on the visible area of the CRTC. Moreover, planes may possess additional attributes that dictate pixel positioning and blending, such as rotation or Z-positioning.
Rotation is an optional KMS property of the DRM plane object, which we use to specify the rotation amount in degrees in counter-clockwise direction. The rotation is applied to the image sampled from the source rectangle, before scaling it to fit in the destination rectangle. So, basically, the rotation property adds a rotation and a reflection step between the source and destination rectangles.
|*********|$$$$$$$$$| |$$$$$$$$$|@@@@@@@@@|
|*********|$$$$$$$$$| ---------> |$$$$$$$$$|@@@@@@@@@|
|#########|@@@@@@@@@| 90º |*********|#########|
|#########|@@@@@@@@@| |*********|#########|
The possible rotation values are rotate-0, rotate-90, rotate-180, rotate-270, reflect-x and reflect-y.
Now that we understand what plane rotation is, we can think about how to implement the rotation property on VKMS.
VKMS has some really special driver attributes, as all its composition happens by software operations. The rotation is usually an operation that is performed on the user-space, but the hardware can also perform it. In order for the hardware to perform it, the driver will set some registers, change some configurations, and indicate to the hardware that the plane should be rotated. This doesn’t happen on VKMS, as the composition is essentially a software loop. So, we need to modify this loop to perform the rotation.
First, we need a brief notion of how the composition happens in VKMS. The composition in VKMS happens line-by-line. Each line is represented by a staging buffer, which contains the composition for one plane, and an output buffer, which contains the composition of all planes in z-pos order. For each line, we query an array by the first pixel of the line and go through the whole source array linearly, performing the proper pixel conversion. The composition of the line can be summarized by:
void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y)
{
struct pixel_argb_u16 *out_pixels = stage_buffer->pixels;
struct vkms_frame_info *frame_info = plane->frame_info;
u8 *src_pixels = get_packed_src_addr(frame_info, y);
int limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels);
for (size_t x = 0; x < limit; x++, src_pixels += frame_info->cpp)
plane->pixel_read(src_pixels, &out_pixels[x]);
}
Here we can see that we have the line, represented by the stage buffer and the y coordinate, and the source pixels. We read each source pixel in a linear manner, through the for-loop, and we place it on the stage buffer in the appropriate format.
With that in mind, we can think that rotating a plane is a matter of changing how we read and interpret the lines.
Let’s think about the reflect-x operation.
|*********|$$$$$$$$$| |$$$$$$$$$|*********|
|*********|$$$$$$$$$| -----------> |$$$$$$$$$|*********|
|#########|@@@@@@@@@| reflect-x |@@@@@@@@@|#########|
|#########|@@@@@@@@@| |@@@@@@@@@|#########|
Thinking that the VKMS composition happens line-by-line, we can describe the operation as a read in reverse order.
So, instead of start reading the pixels from left to right, we need to start reading the pixels from right to left.
We can implement this by getting the limit of the line and subtracting the current x position:
static int get_x_position(const struct vkms_frame_info *frame_info, int limit, int x)
{
if (frame_info->rotation & DRM_MODE_REFLECT_X)
return limit - x - 1;
return x;
}
For the reflect-y operation, we need to start reading the plane from the last line, instead of reading it from the first line.
|*********|$$$$$$$$$| |#########|@@@@@@@@@|
|*********|$$$$$$$$$| -----------> |#########|@@@@@@@@@|
|#########|@@@@@@@@@| reflect-y |*********|$$$$$$$$$|
|#########|@@@@@@@@@| |*********|$$$$$$$$$|
This can be performed by changing the y on the external composition loop.
Similarly from the reflect-x case, we can get the y limit and subtract the current y position.
static int get_y_pos(struct vkms_frame_info *frame_info, int y)
{
if (frame_info->rotation & DRM_MODE_REFLECT_Y)
return drm_rect_height(&frame_info->rotated) - y - 1;
return y;
}
So, to implement the rotation in VKMS, we need to change how we interpret the boundaries of the plane and read accordingly.
This might seem odd because we could just rotate the src rectangle by using drm_rect_rotate, but this wouldn’t work as the composition in VKMS is performed line-by-line and the pixels are accessed linearly.
However, drm_rect_rotate is of great help for us on the rotate-90 and rotate-270 cases.
Those cases demand scaling and drm_rect_rotate helps us tremendously with it.
Basically, what it does is:
|$$|@@|
|$$|@@|
|*********|$$$$$$$$$| |$$|@@|
|*********|$$$$$$$$$| --------------------> |$$|@@|
|#########|@@@@@@@@@| drm_rect_rotate(90) |**|##|
|#########|@@@@@@@@@| |**|##|
|**|##|
|**|##|
After the drm_rect_rotate operation, we need to read the columns as lines and the lines as columns.
See that even for a case like rotate-90, it is just a matter of changing the point of view and reading the lines differently.
The complete implementation of all rotation modes is available here. Together with the rotation feature, I sent a patch to reduce the code repetition in the code by isolating the pixel conversion functionality. This patch was already merged, but the rest of the series is still pending a Reviewed-by.
Rotating planes on VKMS was a fun challenge of my Igalia Coding Experience and I hope to keep working on VKMS to bring more and more features.
In my last blog post, I described a bit of my previous work on the rustgem project, and after that, as I had finished the VGEM features, I sent a RFC to the mailing list.
Although I still need to work on some rustgem feedback, I started to explore more of the KMS (Kernel Mode Setting) and its properties.
I talked to my mentor Melissa Wen, one of the VKMS maintainers, and she proposed implementing plane rotation capabilities to VKMS. The VKMS (Virtual Kernel Mode Setting) is a software-only KMS driver that is quite useful for testing and running X (or similar compositors) on headless machines. It sounded like a great idea, as I would like to explore a bit more of the KMS side of things.
In order to have an image on a display, we need to go through the whole Kernel Mode Setting (KMS) Display Pipeline. The pipeline has a couple of different objects, such as framebuffers, planes, and CRTCs, and the relationship between them can be quite complicated. If you are interested in the KMS Display Pipeline, I recommend reading the great KMS documentation. But here we are focused in only one of those abstractions, the plane.
In the context of graphics processing, a plane refers to an image source that can be superimposed or blended on top of a CRTC during the scanout process. The plane itself specifies the cropping and scaling of that image, and where it is placed on the visible area of the CRTC. Moreover, planes may possess additional attributes that dictate pixel positioning and blending, such as rotation or Z-positioning.
Rotation is an optional KMS property of the DRM plane object, which we use to specify the rotation amount in degrees in counter-clockwise direction. The rotation is applied to the image sampled from the source rectangle, before scaling it to fit in the destination rectangle. So, basically, the rotation property adds a rotation and a reflection step between the source and destination rectangles.
|*********|$$$$$$$$$| |$$$$$$$$$|@@@@@@@@@|
|*********|$$$$$$$$$| ---------> |$$$$$$$$$|@@@@@@@@@|
|#########|@@@@@@@@@| 90º |*********|#########|
|#########|@@@@@@@@@| |*********|#########|
The possible rotation values are rotate-0, rotate-90, rotate-180, rotate-270, reflect-x and reflect-y.
Now that we understand what plane rotation is, we can think about how to implement the rotation property on VKMS.
VKMS has some really special driver attributes, as all its composition happens by software operations. The rotation is usually an operation that is performed on the user-space, but the hardware can also perform it. In order for the hardware to perform it, the driver will set some registers, change some configurations, and indicate to the hardware that the plane should be rotated. This doesn’t happen on VKMS, as the composition is essentially a software loop. So, we need to modify this loop to perform the rotation.
First, we need a brief notion of how the composition happens in VKMS. The composition in VKMS happens line-by-line. Each line is represented by a staging buffer, which contains the composition for one plane, and an output buffer, which contains the composition of all planes in z-pos order. For each line, we query an array by the first pixel of the line and go through the whole source array linearly, performing the proper pixel conversion. The composition of the line can be summarized by:
void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y)
{
struct pixel_argb_u16 *out_pixels = stage_buffer->pixels;
struct vkms_frame_info *frame_info = plane->frame_info;
u8 *src_pixels = get_packed_src_addr(frame_info, y);
int limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels);
for (size_t x = 0; x < limit; x++, src_pixels += frame_info->cpp)
plane->pixel_read(src_pixels, &out_pixels[x]);
}
Here we can see that we have the line, represented by the stage buffer and the y coordinate, and the source pixels. We read each source pixel in a linear manner, through the for-loop, and we place it on the stage buffer in the appropriate format.
With that in mind, we can think that rotating a plane is a matter of changing how we read and interpret the lines.
Let’s think about the reflect-x operation.
|*********|$$$$$$$$$| |$$$$$$$$$|*********|
|*********|$$$$$$$$$| -----------> |$$$$$$$$$|*********|
|#########|@@@@@@@@@| reflect-x |@@@@@@@@@|#########|
|#########|@@@@@@@@@| |@@@@@@@@@|#########|
Thinking that the VKMS composition happens line-by-line, we can describe the operation as a read in reverse order.
So, instead of start reading the pixels from left to right, we need to start reading the pixels from right to left.
We can implement this by getting the limit of the line and subtracting the current x position:
static int get_x_position(const struct vkms_frame_info *frame_info, int limit, int x)
{
if (frame_info->rotation & DRM_MODE_REFLECT_X)
return limit - x - 1;
return x;
}
For the reflect-y operation, we need to start reading the plane from the last line, instead of reading it from the first line.
|*********|$$$$$$$$$| |#########|@@@@@@@@@|
|*********|$$$$$$$$$| -----------> |#########|@@@@@@@@@|
|#########|@@@@@@@@@| reflect-y |*********|$$$$$$$$$|
|#########|@@@@@@@@@| |*********|$$$$$$$$$|
This can be performed by changing the y on the external composition loop.
Similarly from the reflect-x case, we can get the y limit and subtract the current y position.
static int get_y_pos(struct vkms_frame_info *frame_info, int y)
{
if (frame_info->rotation & DRM_MODE_REFLECT_Y)
return drm_rect_height(&frame_info->rotated) - y - 1;
return y;
}
So, to implement the rotation in VKMS, we need to change how we interpret the boundaries of the plane and read accordingly.
This might seem odd because we could just rotate the src rectangle by using drm_rect_rotate, but this wouldn’t work as the composition in VKMS is performed line-by-line and the pixels are accessed linearly.
However, drm_rect_rotate is of great help for us on the rotate-90 and rotate-270 cases.
Those cases demand scaling and drm_rect_rotate helps us tremendously with it.
Basically, what it does is:
|$$|@@|
|$$|@@|
|*********|$$$$$$$$$| |$$|@@|
|*********|$$$$$$$$$| --------------------> |$$|@@|
|#########|@@@@@@@@@| drm_rect_rotate(90) |**|##|
|#########|@@@@@@@@@| |**|##|
|**|##|
|**|##|
After the drm_rect_rotate operation, we need to read the columns as lines and the lines as columns.
See that even for a case like rotate-90, it is just a matter of changing the point of view and reading the lines differently.
The complete implementation of all rotation modes is available here. Together with the rotation feature, I sent a patch to reduce the code repetition in the code by isolating the pixel conversion functionality. This patch was already merged, but the rest of the series is still pending a Reviewed-by.
Rotating planes on VKMS was a fun challenge of my Igalia Coding Experience and I hope to keep working on VKMS to bring more and more features.
Some time ago I teased a new Mesa project that involved both features and perf. At last, it’s time to unveil the goods: a complete rewrite of all descriptor handling in Lavapipe thanks to one of RADV’s raytracing tamers, Konstantin Seurer.
It’s a feature that’s been confounding developers and users everywhere for years, but thanks to Konstantin’s tireless efforts over the past ETOOLONG, at last everyone will have the features they crave.
Yes.
In short, the work required rewriting all the LLVM-based JIT to dynamically generate all the image/buffer access code instead of being able to use a more fixed-function style of JIT. As the MR shows, the diff here is massive, and work is still ongoing to make it work everywhere.
It’s a truly Herculean effort by Konstantin that was only hindered by my goalpost-moving to fingerpaint in support for EXT_descriptor_buffer and EXT_mutable_descriptor_type.
Primarily it means that Lavapipe should start to be able to work with VKD3D-PROTON and play (haha) real games. This has more uses in CI purposes as well, allowing broader coverage for all layered drivers that depend on EXT_descriptor_indexing.
Unfortunately, the added indirection is going to slightly reduce performance in some cases.
Work is ongoing to mitigate this, and I don’t have any benchmark results.
The other day I posted about new efforts to create a compendium of games that work with zink. Initial feedback has been good, and thanks to everyone who has contributed so far.
I realized too late that I needed to be more explicit about what feedback I wanted, however.
There are three tickets open, one for each type of problem:
Please don’t dump the results of your entire steam library into a single ticket. Take the extra few seconds and split up your results appropriately.
With all that said, the quantity of games in the “working” list far outnumbers those in the “not working” lists, which is great. On the flip side, anyone looking for a starting point to contribute to zink now has quite a few real world cases to check out.
Last week I’ve been attending the HDR hackfest organized by Red Hat. The trip to Prague was rather chaotic: the morning right after the SourceHut DoS incident, I got a notification that my plane was cancelled, so had to re-book a new flight and hotel. Then a few hours before leaving for the airport, I realized that the train line to get there was cut off, so I had to take a longer route via bus (and of course, everybody taking a plane had the same idea). Thankfully Saturday evening I arrived in Prague as planned, and even had some time before the train the next day to visit the old city center with the Jewish cemetery and synagogues, and enjoy a traditional guláš. I arrived at Brno — the venue of the hackfest — on Sunday evening.
I met with some hackfest participants Monday morning in the hotel lobby, then we joined everybody else at the Red Hat office. People from various organizations were on-site: Red Hat, KDE, System76, AMD, Igalia, Collabora, Canonical, etc. Some more people from NVIDIA, Intel and Google joined us remotely (some of them waking up at 2 AM due to their timezone!). It was super nice meeting all these folks I’ve been working with remotely for years!
![]()
Sebastian had prepared a list of topics we could discuss. We started by brainstorming use-cases we cared about for HDR and color management. There are two main separate classes of users here: one wants to enjoy HDR movies and games, the other wants to perform color-critical work such as image or video editing. The former mainly cares about the colors looking good, while the latter cares about the colors looking right. These two use-cases are kind of orthogonal (a compositor can implement one without the other) but still closely related. We noted that displaying a single HDR surface full-screen is pretty easy to achieve but we really want to properly handle mixing HDR and SDR content, if only to be able to display overlay menus, notifications, cursors, and so on (and of course windowed HDR content). Additionally keeping the power usage down is important for mobile devices. We mentioned a million of other issues (screen recording, brightness adjustment, “Night Light” feature, etc) but this blog post would turn into a thick book if I tried to cover everything.
Then we switched gears and discussed about variable refresh rate (VRR). There are two unresolved issues when it comes to VRR: cursor handling and flickering. The first issue manifests itself when the cursor plane is moved while VRR is enabled. Either the cursor is moved at the maximum refresh rate (effectively disabling VRR), either the cursor is moved at the game/video refresh rate (resulting in a very choppy result). We need a new kernel uAPI to move the cursor plane without scheduling a new page-flip somehow. The second issue is that some screens (not all) flicker when the refresh rate is changed abruptly. This is a bit annoying to handle, we need to ensure that refresh rate changes are smoothed over multiple frames for these displays. It would be best for user-space to handle this, because the refresh rate compensation will mess up frame timings. It would be nice to be able to automatically tell apart “good” and “bad” screens, there are some HDMI and DisplayID standards for this but they are not widely supported. More experimentation and testing is required to figure out how much we can do in user-space.
Then we got into the color management topics. First the “easy” part: AMD is missing support for the Colorspace KMS property. There are patches floating around but a blocker: AMD may decide to encode the signal as either RGB or YUV on the wire depending on the available bandwidth, and the Colorspace property has different enum entries for RGB and YUV. However user-space has no way to know whether the driver picked RGB or YUV, so has no way to pick the correct enum entry. We decided that the best course of action was to retain backwards uAPI compatibility by keeping the existing enum entries, but treat them as equal in the driver and let it Do The Right Thing. That way user-space can unconditionally pick the RGB variant and the driver will silently convert that to the YUV variant if it happens to encode the signal as YUV on the wire.
Before we got into some more complicated color management and HDR discussions, Sebastian and Pekka explained in more detail how it’s all supposed to work. This is a very wide and tricky topic, so it can be especially complicated to learn and understand. Pekka gave some enlightening and colorful explanations (see what I did here?), I believe that helped a lot of people in the room. If you are interested, have a look at the learn page in Pekka’s color-and-hdr repository.
![]()
With that out of the way, we started debating about vendor-specific KMS properties. With the existing kernel uAPI, compositors can implement HDR and color management just fine, but have to resort to OpenGL or Vulkan shaders. This isn’t great for power efficiency, because this keeps the 3D engine in GPUs busy. Ideally we want to offload all color transformations to the display engine and keep the 3D engine completely idle (during video playback with hardware-accelerated decoding for instance). So we need a new kernel API.
A week before, Melissa has sent a patch series to introduce AMD-specific KMS properties to configure various color management related hardware blocks. The amdgpu documentation explains exactly what these hardware blocks are. Josh has implemented support for this in gamescope and this will be shipped in SteamOS soon. This is great, because this is the first time a real HDR compositor has been implemented on Linux, and with full hardware acceleration even! If nothing else, this is a very valuable testing grounds. So the question we asked ourselves is whether or not we want to merge the KMS vendor-specific properties. On one hand this allows easier and wider experimentation to come up with a good vendor-neutral uAPI. On the other hand we don’t want to end up stuck with vendor-specific user-space and no generic API. Everybody had a different opinion on this topic, so that made for an interesting discussion. At the end, we agreed that we can merge vendor-specific color management properties on the condition that the uAPI is documented as unstable, hidden behind an experimental build option and a kernel parameter. This should allow for more testing while avoiding the pitfalls of hardcoding chunks of vendor-specific code in each compositor.
Things got really interesting when we discussed about long-term plans. We want to design some kind of vendor-neutral API that compositors can use to program the color pipeline in GPUs. Other platforms (e.g. Android) typically provide a descriptive API: compositors set the color space and other related parameters for the source and the destination, and the driver comes up with the appropriate hardware configuration. However there are multiple ways of doing color conversions (e.g. gamut and tone mapping). Each way will give a different result. This will result in glitches when switching between OpenGL/Vulkan and KMS offloading. Unfortunately the switches can happen pretty often, e.g. when a notification comes in, or when a window is moved around. Another issue is that the descriptive API doesn’t give full control to the compositors, thus compositors cannot come up with novel color pipelines. We’ve decided that for KMS a prescriptive API would be superior: drivers expose a list of available hardware blocks (mathematical operations like look-up tables and matrices), then user-space directly programs each hardware block. The tricky part is coming up with a good API which fits all hardware, present and future. It would seem like this design would work well for AMD and Intel hardware, but NVIDIA GPUs are more opinionated and have hardware blocks converting between two fixed color spaces and cannot be disabled. We decided that it would be reasonable to expose these fixed hardware blocks to user-space as well just like any other hardware block. I will soon send an RFC to the dri-devel mailing list with more details about our API proposal.
Since the kernel API would just expose what the hardware can do (very much like KMS planes), user-space will need to translate its color pipeline to the hardware, and fallback to shaders if it cannot leverage the hardware. We plan to establish a common user-space library (similar to libliftoff) to help offload color pipelines to KMS.
Throughout the days we had various other discussions, for instance about testing or about new features we’d like KMS to have. The detailed notes should get published soon if you’re interested.
You can probably tell, we didn’t write much code during this hackfest. We just talked together the whole time. Everyone was very passionate and invested in the topics we discussed. The hackfest was very exhausting, by 5 PM the discussions were a lot slower. However that effort payed off, and we’ve made great progress! We now have a clear path forward, and I can’t wait to see the fruits of the hackfest materialize in various Git repositories. Many thanks to Carlos for organizing everything, and to Red Hat + Collabora for sponsoring the event!
Yep, it’s all merged. That means if your driver supports VK_EXT_shader_object, you can finally enjoy Tomb Raider (2013) without any issues.
NVIDIA has just released a new beta driver that I need to test, but I’m hopeful it will no longer crash when trying to use this extension.
Remember that time I mentioned how zink wasn’t allowed to use VK_EXT_vertex_input_dynamic_state on AMDVLK?
We all got a chuckle, and it was sad/funny, and nobody was surprised, but the funnier part is some code that’s been in zink for much longer. Almost a year, in fact:
if (screen->info.have_EXT_extended_dynamic_state) {
if (screen->info.have_EXT_extended_dynamic_state2) {
if (screen->info.have_EXT_extended_dynamic_state3) {
if (screen->info.have_EXT_vertex_input_dynamic_state)
dynamic = ZINK_DYNAMIC_VERTEX_INPUT;
else
dynamic = ZINK_DYNAMIC_STATE3;
} else {
if (screen->info.have_EXT_vertex_input_dynamic_state)
dynamic = ZINK_DYNAMIC_VERTEX_INPUT2;
else
dynamic = ZINK_DYNAMIC_STATE2;
}
} else {
dynamic = ZINK_DYNAMIC_STATE;
}
} else {
dynamic = ZINK_NO_DYNAMIC_STATE;
}
This is the conditional for enabling dynamic state usage in zink using an enum. As we can see, the only time VK_EXT_vertex_input_dynamic_state is enabled is if either VK_EXT_extended_dynamic_state2 or VK_EXT_extended_dynamic_state3 are also enabled. This cuts down on the number of codepaths that can be used by drivers, which improves performance and debuggability.
AMD’s drivers don’t yet support VK_EXT_extended_dynamic_state3, as anyone can see from gpuinfo. They do, however, support VK_EXT_extended_dynamic_state2, so the driver-side disablement of VK_EXT_vertex_input_dynamic_state does have some effect.
Not.
Way, way back, just over a year ago, I was doing some testing on AMD’s drivers. One thing I noticed was that trying to run any of the GLES CTS caselists on these drivers caused GPU hangs, so I stopped running those, leaving me with just GL4.6 CTS.
And what happened when I enabled VK_EXT_extended_dynamic_state2 there, you ask? Test failures. Lots of test failures.
Thus, AMD got the cone of shame: a driver workaround to explicitly disable this extension.
In conclusion, we all had a good chuckle about AMD blocking zink from using VK_EXT_vertex_input_dynamic_state, but…
Well, there’s nothing in this story that we didn’t already expect.
On another topic, I’ve been doing some per-app tracking in zink. Specifically tracking games that don’t work great. If you know of other games with issues, post them there.
But there hasn’t been any sort of canonical list for games that do work great on zink, which leads to a lot of confusion about how useful it is for gaming.
Thus, I’ve created the GAMES THAT WORK tracking ticket. If you’ve played a game on zink and it works, post about it. If you want to know whether a game works, check that ticket and maybe it’ll be updated enough to be useful.
Remember: I don’t play video games, so I can’t fill out this table on my own. The only way I know if a game works is if I spend a comprehensive amount of time benchmarking or debugging it, which is the only reason I have 300+ hours logged in the benchmark mode of Tomb Raider (2013).
Things have been quiet on the surface lately, with nothing big in development.
But there is big work underway. It’s an extremely secret project (not Half Life 3) that (or Portal 3) I’m hoping (or any Valve-related title) can be (it’s not even a game) brought into the light (it’s driver-related) within the next week (involving features) or two (and perf).
I can’t say more on this topic.
Don’t even bother asking.
It’s too secret.
What I can say is that it’s been in development for almost a month. And we all know how much time a month is when it comes to SGC speed.
Cannot believe it has been years since my last update here!
There are two things that I would like to tell people about:
The first is that I no longer work at Collabora. It has been almost 13 years full of excitement and recently I came to believe that I wanted a proper change.
They are great folks to work with, so if you are thinking of a career change and want to do open-source stuff upstream, I recommend you to consider them.
And the other topic is what I have been working on lately: a free software driver for the NPUs that VeriSilicon sells to SoC vendors.
tomeu@arm-64:~/tensorflow/build/examples/label_image$ SMALLER_SOFTMAX=1 RUSTICL_ENABLE=etnaviv LD_LIBRARY_PATH=/home/tomeu/opencl/lib LIBGL_DRIVERS_PATH=/home/tomeu/opencl/lib/dri/ ./label_image --gpu_backend=cl --use_gpu=true --verbose 1 --tflite_model ../../../assets/mobilenet_quant_v1_224.tflite --labels ../../../assets/labels.txt --image ../../../assets/grace_hopper.bmp --warmup_runs 1 -c 1
[snip]
INFO: invoked
INFO: average time: 1261.99 ms
INFO: 0.666667: 458 bow tie
INFO: 0.294118: 653 military uniform
INFO: 0.0117647: 835 suit
INFO: 0.00784314: 611 jersey
INFO: 0.00392157: 922 book jacket
That is TensorFlow Lite's OpenCL delegate detecting objects with Etnaviv from Grace Hopper's portrait in military uniform.
Many years ago, when I was working on the operating system for the One Laptop Per Child project, I became painfully aware of the problems derived by IP vendors not providing the source code for their drivers.
This and other instances of the same problem motivated me to help out on the Panfrost project, writing a free software driver for the Mali GPUs by Arm. That gave me a great opportunity to learn about reverse engineering from Alyssa Rosenzweig.
Nowadays the Mesa project contains drivers for most GPUs out there, some maintained by the same companies that develop the IP, some by their customers and hobbyists alike. So the problem of the availability of source code for GPU drivers is pretty much solved.
Only that, with the advent of machine learning in the edge, we are reliving this problem with the drivers for accelerating those workloads with NPUs, TPUs, etc.
Vivante's NPU IP is very closely based on their GPUs. And it is pretty popular, being included in SoCs by Amlogic, Rockchip, NXP, Broadcom and more.
We already have a reasonably complete driver (Etnaviv) for their GPU IP, so I started by looking at what the differences were and how much of the existing userspace and kernel drivers we could reuse.
The kernel driver works with almost no changes, just took me some time to implement the hardware initialization properly in upstream. As of Linux 6.3 the driver loads correctly on Khadas' VIM3, but for a chance at decent performance this patch is needed:
[PATCH] arm64: dts: VIM3: Set the rates of the clocks for the NPU
Due to its experimental status, it is disabled by default in the device tree. To enable it, add the below to arch/arm64/boot/dts/amlogic/meson-g12b-a311d-khadas-vim3.dts:
&npu {
status = "okay";
};
Enabling Etnaviv for other boards with this IP should be relatively straightforward, by describing how the HW is initialized by inspecting the downstream kernel sources for the board in question.
Mesa has seen most of the work, as this IP is compute-only and the userspace driver only targeted OpenGL ES.
First step was wiring up the existing driver to Mesa's OpenCL implementation, and then I focused on getting the simplest kernel to correctly run. For this and all the subsequent work, the reverse-engineering tools used by the Etnaviv community have been of great use.
At that point I had to pause the work to focus on other unrelated stuff, but Collabora's Italo Nicola and Faith Ekstrand did great work to extend the existing compiler to generate OpenCL kernels.
Once I didn't have a day job getting in the way anymore, I started adding the features needed to run the label_image example in TensorFlow Lite.
And eventually we got to this point. 1.2 seconds to run that inferrence is a lot of time, so the next steps for me will be to figure out what are the biggest causes for the low performance.
With the goal in mind of providing a free software driver that companies can use to run inferrence on their products containing Vivante's NPU IP, I need for those tasks to be performanced at at least the same order of magnitude as the closed source solution provided by Vivante.
Right now Etnaviv is about twice as slow as running label_image with the OpenCL delegate on Vivante's driver, but the solution that they provide uses a special delegate that is able to better use their hardware is several times faster.
Current performance situation (label_image):
The plan is to first see why we are slower with the OpenCL delegate and fix it, and afterwards the real fun stuff will start: seeing how we can use more of the HW capabilities through the OpenCL API and with upstream TensorFlow Lite.
Italo is cleaning up an initial submission for inclusion in Mesa upstream. Once that is done I will rebase my branch and start submitting features.
In parallel to upstreaming, I will be looking at what is needed to get closer to the performance of the closed source driver, for ML acceleration.
There is a lot of people besides the ones mentioned above that have made this possible. Some of they are:
Last but not least, there are some individuals to whom I was able to turn when I needed help:
F38 just released and seeing a bunch of people complain that TF2 dies on AMD or other platforms when lavapipe is installed. Who's at fault? I've no real idea. How to fix it? I've no real idea.
AMD OpenGL drivers use LLVM as the backend compiler. Fedora 38 updated to LLVM 16. LLVM 16 is built with c++17 by default. C++17 introduces new "operator new/delete" interfaces[1].
TF2 ships with it's own libtcmalloc_minimal.so implementation, tcmalloc expects to replace all the new/delete interfaces, but the version in TF2 must not support or had incorrect support for the new align interfaces.
What happens is when TF2 probes OpenGL and LLVM is loaded, when DenseMap initializes, one "new" path fails to go into tcmalloc, but the "delete" path does, and this causes tcmalloc to explode with
"src/tcmalloc.cc:278] Attempt to free invalid pointer"
I'll talk to Valve and see if we can work out something, LLVM 16 doesn't seem to support building with C++14 anymore. I'm not sure if static linking libstdc++ into LLVM might avoid the tcmalloc overrides, it might not also be acceptable to the wider Fedora community.
I escaped my RADV pen once again. I know, it’s been a while, but every so often my handler gets distracted and I sprint for the fence with all my might.
This time I decided to try out a shiny new Intel Arc A770 that was left by my food trough.
The Dota2 performance? Surprisingly good. I was getting 100+ FPS in all the places I expected to have good perf and 4-5 FPS in all the places I expected to have bad perf.
The GL performance? Also surprisingly good*. Some quick screenshots:
DOOM 2016 on Iris:
Playable.
And of course, this being a zink blog, I have to do this:
The perf on zink is so good that the game thinks it’s running on an NVIDIA GPU.
If anyone out there happens to be a prominent hardware benchmarking site, this would probably be an interesting comparison on the upcoming Mesa 23.1 release.
I’ve seen a lot of people with AMD hardware getting hit by the Fedora 38 / LLVM 16 update crash. While this is unfortunate, and there’s nothing that I, a simple meme connoisseur, can do about it, I have some simple workarounds that will enable you to play all your favorite games without issue:
rm /usr/share/vulkan/icd.d/lvp_icd.*)
MESA_LOADER_DRIVER_OVERRIDE=zink %command% in your game’s launch optionsI realize the latter suggestion seems meme-adjacent, but so long as you’re on Mesa 23.1-rc2 or a recent git build, I doubt you’ll notice the difference for most games.
You can’t run a full desktop on zink yet, but you can now play most games at native-ish performance. Or better!
I’ve made mention of Big Triangle a number of times on the blog. Everyone’s had a good chuckle, and we’re all friends so we know it’s an inside joke.
But what if I told you I was serious each and every time I said it?
What if Big Triangle really does exist?
I know what you’re thinking: Mike, you’re not gonna get me again. You can’t trick me this time. I’ve seen this coming for—
CHECK OUT THESE AMDVLK RELEASE NOTES!
Incredible, they’re finally supporting that one extension I’ve been saying is crucial for having good performance. Isn’t that ama—
And they’ve even added an app profile for zink! I assume they’re going to be slowly rolling out all the features zink needs in a controlled manner since zink is a known good-citizen when it comes to behaving within the boundaries of—
…
There are plans for nouveau to support using the NVIDIA supplied GSP firmware in order to support new hardware going forward
The nouveau project doesn't have any input or control over the firmware. NVIDIA have made no promises around stable ABI or firmware versioning. The current status quo is that NVIDIA will release versioned signed gsp firmwares as part of their driver distribution packages that are version locked to their proprietary drivers (open source and binary). They are working towards allowing these firmwares to be redistributed in linux-firmware.
The NVIDIA firmwares are quite large. The nouveau project will control the selection of what versions of the released firmwares are to be supported by the driver, it's likely a newer firmware will only be pulled into linux-firmware for:
This should at least limit the number of firmwares in the linux-firmware project.
However a secondary effect of the size of the firmwares is that having the nouveau kernel module at more and more MODULE_FIRMWARE lines for each iteration will mean the initramfs sizes will get steadily larger on systems, and after a while the initramfs will contain a few gsp firmwares that the driver doesn't even need to run.
To combat this I've looked into adding some sort of module grouping which dracut can pick one out off.
It currently looks something like:
MODULE_FIRMWARE_GROUP_ONLY_ONE("ga106-gsp");
MODULE_FIRMWARE("nvidia/ga106/gsp/gsp-5258902.bin");
MODULE_FIRMWARE("nvidia/ga106/gsp/gsp-5303002.bin");
MODULE_FIRMWARE_GROUP_ONLY_ONE("ga106-gsp"); This group only one will end up in the module info section and dracut will only pick one module from the group to install into the initramfs. Due to how the module info section is constructed this will end up picking the last module in the group first.
The dracut MR is:
https://github.com/dracutdevs/dracut/pull/2309
The kernel one liner is:
https://lore.kernel.org/all/20230419043652.1773413-1-airlied@gmail.com/T/#u
Hi!
In the last month I’ve continued working on go-imap v2. I’ve written the
server side, implemented an in-memory server backend, and spent quite a bit of
time fixing issues reported by imaptest. I only have a handful of test
failures, most of which due to \Recent being unimplemented on purpose because
it’s been removed from the new IMAP4rev2 RFC. The end result is a much more
correct and reliable server implementation compared to v1. I’ve pushed some
incremental improvements for the client side as well, fixing compatibility
issues with servers in the wild and adding a few more extensions. Next, I’d
like to explore server-side command pipelining and fix the remaining issues
related to unilateral updates.
In other news, I’ve (finally!) released new versions of soju and goguma. soju v0.6.0 adds a database message store, a new sojuctl utility, external authentication support, and many more improvements. goguma v0.5.0 adds image previews, UnifiedPush support, performance improvements, and new IRCv3 extensions. Since the goguma release I’ve also implemented previews for Web pages.
While we’re on the topic of new releases, there is one more piece of software which got its version bumped this month: hut v0.3.0 adds pagination, improved Web hooks support, a few new sub-commands and other quality-of-life improvements. Thanks a lot to Thorben Günther for their numerous contributions!
The NPotM is yojo. I’ve already
written two tools to integrate builds.sr.ht with other code forges, so here’s a
third one focused on Forgejo/Gitea. It’s pretty similar to hottub, a public
instance is available for Codeberg integration. It
doesn’t support pull requests yet, patches welcome! While working on yojo I got
once again annoyed by golang.org/x/oauth2 so I started working on a simpler
alternative creatively called go-oauth2.
Last but not least, after days of battling with the Pixman API, I’ve managed to finish up my new renderer API for wlroots. I’m excited about it because the next step is to lay down the first bricks of the color management infrastructure. My plan is to work on basic support for per-output ICC profiles, then go from there. I’ll be participating in Red Hat’s HDR hackfest next week, I hope the discussions with the rest of the stakeholders (compositor and driver developers) can help us move this forward!
That’s all for April, see you next month!
As I mentioned a week or three ago, I deleted comments on the blog because (apparently) the widget was injecting ads. My b. I wish I could say the ad revenue was worth it, but it wasn’t.
With that said, I’m looking at ways to bring comments back. I’ve seen a number of possibilities, but none have really grabbed me:
If anyone has other ideas, post here about it.
EDIT: Thanks to a brilliant suggestion by the other Rhys Perry, I’ve activated giscus for comments. Took about 2 mins. Does it work? We’ll see.
After this boring procedural opening, let’s get to something exciting that nobody blogs about: shader linking.
What is shader linking? Shader linking is the process by which shaders are “linked” together to match up I/O blocks and optimize the runtime. There’s a lot of rules for what compilers can and can’t do during linking, and I’m sure that’s all very interesting, and probably there’s someone out there who would want to read about that, but we’ll save that topic for another day. And another blog.
I want to talk about one part of linking in particular, and that’s interface matching. Let’s check out some Vulkan spec text:
15.1.3. Interface Matching
An output variable, block, or structure member in a given shader stage has an interface match with
an input variable, block, or structure member in a subsequent shader stage if they both adhere to
the following conditions:
• They have equivalent decorations, other than:
◦ XfbBuffer, XfbStride, Offset, and Stream
◦ one is not decorated with Component and the other is declared with a Component of 0
◦ Interpolation decorations
◦ RelaxedPrecision if one is an input variable and the other an output variable
• Their types match as follows:
◦ if the input is declared in a tessellation control or geometry shader as an OpTypeArray with
an Element Type equivalent to the OpType* declaration of the output, and neither is a structure
member; or
◦ if the maintenance4 feature is enabled, they are declared as OpTypeVector variables, and the
output has a Component Count value higher than that of the input but the same Component Type;
or
◦ if the output is declared in a mesh shader as an OpTypeArray with an Element Type equivalent
to the OpType* declaration of the input, and neither is a structure member; or
◦ if the input is decorated with PerVertexKHR, and is declared in a fragment shader as an
OpTypeArray with an Element Type equivalent to the OpType* declaration of the output, and
neither the input nor the output is a structure member; or
◦ if in any other case they are declared with an equivalent OpType* declaration.
• If both are structures and every member has an interface match.
Fascinating. Take a moment to digest.
Once again that’s all very interesting, and probably there’s someone out there who wanted to read about that, but this isn’t quite today’s topic either.
Today’s topic is this one line a short ways below:
Shaders can declare and write to output variables that are not declared or read by the subsequent stage.
This allows e.g., a vertex shader to write an output variable that a fragment shader doesn’t read. Nobody has ever seen a problem with this in Vulkan. The reason is pipelines. Yes, that concept about which Khronos has recently made questionable statements, courtesy of Nintendo, based on the new VK_EXT_shader_object extension. In a pipeline, all the shaders get linked, which means the compiler can delete these unused variables. Or, if not delete, then it can at least use the explicit location info for variables to ensure that I/O is matched up properly.
And because of pipelines, everything works great.
But what happens if pipelines/linking go away?
Everyone saw this coming as soon as the blog loaded. With shader objects (and GPL fastlink), it now becomes possible to create unlinked shaders with mismatched outputs. The shader code is correct, the Vulkan API usage to create the shaders is correct, but is the execution still going to be correct?
Right. CTS. So let’s check…
Okay, there’s no public CTS available for VK_EXT_shader_object yet, but I’m sure it’s coming soon.
I have access to the private CTS repos, and I can see that there is (a lot of) CTS for this extension, which is a relief, and obviously I already knew this since lavapipe has passed everything, and I’m sure there must be testing for shader interface mismatches either there or in the GPL tests.
Sure, maybe there’s no tests for this, but it must be on the test plan since that’s so comprehensive.
Alright, so it’s not in the test plan, but I can add it, and that’s not a problem. In the meanwhile, since zink needs this functionality, I can just test it there, and I’m sure it’ll work fine.
It’s more broken than AMD’s VK_EXT_robustness2 handling, but I’m sure it’ll be easy to fix.
It’s nightmarishly difficult, and I wasted an entire day trying to fix nir_assign_io_var_locations, but I’m sure only lavapipe uses it.
The Vulkan drivers affected by this issue:
Basically everyone except ANV. But also maybe ANV since the extension isn’t implemented there. And probably all the proprietary drivers too since there’s no CTS.
Great.
nir_assign_io_var_locations works like this:
Location decorationThis results in a well-ordered list of variables with proper indexing that should match up both on the input side and the output side.
Except no, not really.
Consider the following simple shader interface:
vertex shader
layout(location = 0) in highp vec4 i_color;
layout(location = 0) out highp vec4 o_color;
void main()
{
gl_Position = vec4(some value);
o_color = i_color;
}
fragment shader
layout(location = 0) in highp vec4 i_color;
layout(location = 0) out highp vec4 o_color;
void main()
{
o_color = i_color;
}
We expect that the vertex attribute color will propagate through to the fragment output color, and that’s what happens.
Vertex shader outputs:
o_color, driver_location=0Fragment shader inputs:
i_color, driver_location=0Let’s modify it slightly:
vertex shader
layout(location = 0) in highp vec4 i_color;
layout(location = 0) out highp vec2 o_color;
layout(location = 2) out highp vec2 o_color2;
void main()
{
gl_Position = vec4(some value);
o_color = i_color;
}
fragment shader
layout(location = 0) in highp vec2 i_color;
layout(location = 2) in highp vec2 i_color2;
layout(location = 0) out highp vec4 o_color;
void main()
{
o_color = vec4(i_color.xy, i_color2.xy);
}
Vertex shader outputs:
o_color, driver_location=0o_color2, driver_location=1Fragment shader inputs:
i_color, driver_location=0i_color2, driver_location=1No problems yet.
But what about this:
vertex shader
layout(location = 0) in highp vec4 i_color;
layout(location = 0) out highp vec2 o_color;
layout(location = 1) out highp vec4 lol;
layout(location = 2) out highp vec2 o_color2;
void main()
{
gl_Position = vec4(some value);
o_color = i_color;
lol = vec4(1.0);
}
fragment shader
layout(location = 0) in highp vec2 i_color;
layout(location = 2) in highp vec2 i_color2;
layout(location = 0) out highp vec4 o_color;
void main()
{
o_color = vec4(i_color.xy, i_color2.xy);
}
In a linked pipeline this works just fine: lol is optimized out during linking since it isn’t read by the fragment shader, and location indices are then assigned correctly. But in unlinked shader objects (and with non-LTO EXT_graphics_pipeline_library), there is no linking. Which means lol isn’t optimized out. And what happens once nir_assign_io_var_locations is run?
Vertex shader outputs:
o_color, driver_location=0lol, driver_location=1o_color2, driver_location=2Fragment shader inputs:
i_color, driver_location=0i_color2, driver_location=1Tada, now the shaders are broken.
Hopefully there will be some, but at present I’ve had to work around this issue in zink by creating multiple separate shader variants with different locations to ensure everything matches up.
I made an attempt at fixing this, but it was unsuccessful. I then contacted the great Mesa compiler sage, Timothy Arceri, and he provided me with a history lesson from The Before Times. Apparently this NIR pass was originally written for GLSL and lived in mesa/st. Then Vulkan drivers wanted to use it, so it was moved to common code. Since all pipelines were monolithic and could do link-time optimizations, there were no problems.
But now LTO isn’t always possible, and so we are here.
It seems to me that the solution is to write an entirely new pass for Vulkan drivers to use, and that’s all very interesting, and probably there’s someone out there who wants to read about that, but this is the end of the post.
Just a quick post to sum up all the new features and things to watch for in zink for 23.1:
ARB_separate_shader_objects support
GL_QUADS natively suppored
EXT_multisample_render_to_texture now uses VK_EXT_multisampled_render_to_single_sampledEXT_descriptor_buffer is now the default for descriptor handlingNV_compute_shader_derivatives support
ZINK_DEBUG options for debugging
Has anyone else heard that Alyssa is going to Dyson to work on some new vaccuum tech? This is upending everything I thought I knew, but the source seems credible.
Today is my last day at Collabora and my last day leading the Panfrost driver.
It’s been a wild ride.
In 2017, I began work on the chai driver for Mali T (Midgard). chai would later be merged into Lyude Paul’s and Connor Abbott’s BiOpenly project for Mali G (Bifrost) to form Panfrost.
In 2019, I joined Collabora to accelerate work on the driver stack. The initial goal was to run GNOME on a Mali-T860 Chromebook.

Today, Panfrost supports a broad spectrum of Mali GPUs, conformant to the OpenGL ES 3.1 specification on Mali-G52 and Mali-G57. It’s hard to overstate how far we’ve come. I’ve had the thrills of architecting several backend shader compilers as well as the Gallium-based OpenGL driver, while my dear colleague Boris Brezillon has put together a proof-of-concept Vulkan driver which I think you’ll hear more about soon.
Lately, my focus has been ensuring the project can stand on its own four legs. I have every confidence in other Collaborans hacking on Panfrost, including Boris and Italo Nicola. The project has a bright future. It’s time for me to pass the reins.
I’m still alive. I plan to continue working on Mesa drivers for a long time, including the common infrastructure upon which Panfrost relies. And I’ll still send the odd Panfrost patch now and then. That said, my focus will shift.
I’m not ready to announce what’s in store yet… but maybe you can read between the lines!
Another week, more blog posts is what I meant to say when I started writing this post last Friday. But now it’s Monday, and everything is different.
In particular, zink is different. There’s a branchpoint coming up, and I’ll do a separate post about that and all the new features people can expect, but today’s topic is something else. Something more exciting.
Obviously it’s EXT_shader_object.
who is excited about this extension is me.
That’s right, I said it.
For years now, Tomb Raider (2013) has plagued zink users with separate shader objects that could not be precompiled even with EXT_graphics_pipeline_library. Why? Because the game uses tessellation. And when I suggested we’d probably want that in EXT_graphics_pipeline_library, someone said “oh we can just add that later, it’ll be easy”, and then since it’s Vulkan it wasn’t easy and it didn’t get added.
But then Nintendo came along and solved this problem for me in a much, much better way with EXT_shader_object.
The thing about OpenGL is that ARB_separate_shader_objects is a thing, and it’s a thing for every shader stage. Even if 99% of apps/games only use VS+FS, there’s still that 1% that wants to use it with those other geometry stages.
Like Tomb Raider (2013). And yes, the (2013) is necessary so nobody imagines I’m talking about a more recent, more relevant game.
Some months ago, I implemented basic separate shaders (VS+FS only) using EXT_graphics_pipeline_library. It’s gross. Really just not an ideal way of doing things when mapping to GL. Effectively each stage gets its own mini GPL pipeline which then gets combined on-the-fly for a couple frames of use to avoid stuttering until the real pipeline is done with its background compile.
But this is stupid. The GL architecture is for separate shaders, not for just-in-time linking; we leave the linking under the hood to screw us over when it doesn’t work right so we can complain. It’s a solved problem in that regard. Making this explicit and mapping from one to the other needs all kinds of refcounting, and hash tables, and complexity, and the fact that it works at all is a miracle that science can’t explain.
Now, however, there is a direct 1:1 mapping to separate shaders with EXT_shader_object. If the app compiles a shader, zink compiles that shader (object). If the app binds a shader, zink binds that shader (object). It’s that simple. And then in the background I can still do all the optimized monolithic pipeline compiling like usual to guarantee huge FPS the next time that group of shaders is used together.
Finally this one problem game will run without any frame hitching or other issues.
As soon as drivers besides NVIDIA implement it, of course. Thanks NVIDIA for your great Day 1 support of this great extension that solves…
Of this great extension…
Of…
Oh for fuck’s sake.
This game will never run without issues on zink. I’m over it. But you know what I’m not over yet?
This totally unexpected news that Panfrost is now without firm leadership and Alyssa is now without gainful employment. How could such a thing happen?
As everyone who’s anyone in the graphics community knows, SGC is the first place to receive any hiring-related rumors. It was here that the news first broke about Valve hiring some nutjob to work on zink. It was also here that everyone learned Broadcom, Bose, the entire multiverse, and Collabora were all vying to hire the five-time winner of Mesa’s Most Loudest Keyboard On Conference Call award (And yes, I can hear her clacking away towards a sixth win right now).
That’s right. It’s been a while, but I’ve got another scoop. And this one’s big. I couldn’t even believe it when I stumbled upon this, and I’m sure many of you won’t either. That’s why I’m gonna tell you, and then I’m gonna spell it out for you.
Alyssa has been hired by Boston Dynamics to work on driver-level computer vision integration in their robotics systems.
It just makes sense if you stop and think about it. Or if you re-read her last blog post in which she basically spells it out for us:
So yeah, nice try, but you’ll need to put a lot more effort into covering your tracks if you want to conceal your job hops from SGC.
Stay tuned for the crucial details everyone craves on the new Panfrost project leader: do they put toothpaste on their toothbrush before or after wetting the bristles?
As everyone is well aware, the Mesa 23.1 branchpoint is definitely going to be next week, and there is zero chance that it could ever be delayed*.
As everyone is also well aware, this is the release in which I’ve made unbreakable* promises about the viability of gaming on Zink.
Specifically, it will now be viable*.
But exactly one* bug remains as a blocker to that. Just one.
So naturally I had to fix it quick before anyone noticed*.
* Don’t @me for factual inconsistencies in any of the previous statements.
The thing about OpenGL games is a lot of them are x86 binaries, which means they run in a 32bit process. Any 32bit application gets 32bit address space. 32bit address space means a 4GiB limit on addressable memory. But what does that mean?
What is addressable memory? Addressable memory is any memory that can be accessed by a process. If malloc is called, this memory is addressable. If a file is mmaped, this memory is addressable. If GPU memory is mapped, this memory is addressable.
What happens if the limit is exceeded? Boom.
Why is the limit only 4GiB? Stop asking hard questions.
Why is this difficult? The issue from a driver perspective is that this limit includes both the addressable memory from the game (e.g., the game’s internal malloc calls) as well as the addressable memory from the driver (e.g., all the GPU mapped memory). Thus, while I would like to have all 4GiB (or more, really; doesn’t everyone have 32GiB RAM in their device these days?) to use with Zink, I do not have that luxury.
Judging by recent bug reports and the prevalance on 32bit games, it’s pretty bad. Given that I solved GPU map VA leaking a long time ago, the culprit must be memory utilization in the driver. Let’s check out some profiling.
The process for this is simple: capture a (long) trace from a game and then run it through massif. Sound familiar?
The game in this case is, of course, Tomb Raider (2013), the home of our triangle princess. Starting a new game runs through a lot of intro cinematics and loads a lot of assets, and the memory usage is explosive. See what I did there? Yeah, jokes. On a Monday. Whew I need a vacation.
This is where I started:
2.4 GiB memory allocated by the driver. In a modern, 64bit process, where we can make full use of the 64GiB memory in the device, this is not a problem and we can pretend to be a web browser using this much for a single tab. But here, from an era when memory management was important and everyone didn’t have 128GiB memory available, that’s not going to fly.
Initial analysis yielded the following pattern:
n3: 112105776 0x5F88B73: nir_intrinsic_instr_create (nir.c:759)
n1: 47129360 0x5F96216: clone_intrinsic (nir_clone.c:358)
n1: 47129360 0x5F9692E: clone_instr (nir_clone.c:496)
n1: 47129360 0x5F96BB4: clone_block (nir_clone.c:563)
n2: 47129360 0x5F96DEE: clone_cf_list (nir_clone.c:617)
n1: 46441568 0x5F971CE: clone_function_impl (nir_clone.c:701)
n3: 46441568 0x5F974A4: nir_shader_clone (nir_clone.c:774)
n1: 28591984 0x67D9DE5: zink_shader_compile_separate (zink_compiler.c:3280)
n1: 28591984 0x69005F8: precompile_separate_shader_job (zink_program.c:2022)
n1: 28591984 0x57647B7: util_queue_thread_func (u_queue.c:309)
n1: 28591984 0x57CD7BC: impl_thrd_routine (threads_posix.c:67)
n1: 28591984 0x4DDB14C: start_thread (in /usr/lib64/libc.so.6)
n0: 28591984 0x4E5BBB3: clone (in /usr/lib64/libc.so.6)
Looking at the code, I found an obvious issue: when I implemented precompile for separate shaders a month or two ago, I had a teensie weensie little bug. Turns out when memory is allocated, it has to be freed or else it becomes unreachable.
This is commonly called a leak.
It wasn’t caught before now because it only affects Tomb Raider and a handful of unit tests.
But I caught it, and it was so minor that I already (“quietly”) landed the fix without anyone noticing.
This sort of thing will be fixed when zink is rewritten in Rust*.
With an actual bug fixed, what does memory utilization look like now?
Down 300MiB to 2.1GiB. A 12.5% reduction. Not that exciting.
Certainly nothing that would warrant a SGC blog post.
My readers have standards.
Time to expand some expandables.
Here’s another common pattern in the massif output:
n4: 317700704 0x57570DA: ralloc_size (ralloc.c:117)
n1: 226637184 0x57583BB: create_slab (ralloc.c:759)
n3: 226637184 0x5758579: gc_alloc_size (ralloc.c:789)
n6: 215583536 0x575868C: gc_zalloc_size (ralloc.c:814)
n7: 91059504 0x5F88CE6: nir_alu_instr_create (nir.c:696)
n4: 35399104 0x5F90C49: nir_build_alu2 (nir_builder.c:162)
n0: 12115376 in 29 places, all below massif's threshold (1.00%)
n1: 11690848 0x67C90F1: nir_iadd (nir_builder_opcodes.h:1309)
n2: 11690848 0x67CB493: nir_iadd_imm (nir_builder.h:719)
n1: 6074016 0x67D691C: remove_bo_access_instr (zink_compiler.c:2013)
n1: 6074016 0x67C89A9: nir_shader_instructions_pass (nir_builder.h:88)
n1: 6074016 0x67D6DB2: remove_bo_access (zink_compiler.c:2044)
n1: 6074016 0x67E4827: zink_shader_create (zink_compiler.c:4409)
n1: 6074016 0x690443E: zink_create_gfx_shader_state (zink_program.c:1885)
n1: 6074016 0x623484B: util_live_shader_cache_get (u_live_shader_cache.c:141)
n1: 6074016 0x69044CC: zink_create_cached_shader_state (zink_program.c:1900)
This is some ralloc usage from zink’s shader creation. In short, the in-memory shader IR is…
Hold on. Doesn’t this sound familiar?
It turns out that nothing is ever new, and all problems have been solved before. By applying the exact same solution, we’re gonna start to see some big movement in these numbers.
Serialized NIR is much more compact than object-form NIR. The memory footprint is an order of magnitude smaller, which begs the question why would anyone ever store NIR structs in memory.
I don’t have an answer. One might try to make the argument that it makes shader variant creation easier, but then, it also needs to be said that shader variants require the NIR to be cloned anyway, which deserialization already (functionally) does. There’s shader_info, but that’s small, unchanging, and can be easily copied. I think it’s just convenience. And that’s fine.
But it’s not fine for me or zink.
Thus, I began converting all the NIR objects I was keeping around (and there’s lots) to serialized form. The first task was tackling zink_shader::nir, the object that exists for every shader created in the driver. How much would this help?
Down another 500MiB to 1.6GiB total. That’s another 24% reduction.
Now we’re getting somewhere.
But again, SGC enthusiasts have standards, and a simple 33% improvement from where things started is hardly worth mentioning here, so I apologize for wasting time.
Continuing, it’s easy to keep finding these patterns:
n1: 64055264 0x57583BB: create_slab (ralloc.c:759)
n2: 64055264 0x5758579: gc_alloc_size (ralloc.c:789)
n6: 61664176 0x575868C: gc_zalloc_size (ralloc.c:814)
n2: 22299104 0x5F88CE6: nir_alu_instr_create (nir.c:696)
n1: 19814432 0x60B3804: read_alu (nir_serialize.c:905)
n1: 19814432 0x60B6713: read_instr (nir_serialize.c:1787)
n1: 19814432 0x60B69BD: read_block (nir_serialize.c:1856)
n1: 19814432 0x60B6D6A: read_cf_node (nir_serialize.c:1949)
n2: 19814432 0x60B6EA0: read_cf_list (nir_serialize.c:1976)
n1: 19195888 0x60B708A: read_function_impl (nir_serialize.c:2012)
n1: 19195888 0x60B7C2A: nir_deserialize (nir_serialize.c:2219)
n2: 19195888 0x67E754A: zink_shader_deserialize (zink_compiler.c:4820)
n2: 19195888 0x6901899: zink_create_gfx_program (zink_program.c:1041)
n1: 17921504 0x6901C6C: create_linked_separable_job (zink_program.c:1105)
n1: 17921504 0x57647B7: util_queue_thread_func (u_queue.c:309)
n1: 17921504 0x57CD7BC: impl_thrd_routine (threads_posix.c:67)
n1: 17921504 0x4DDB14C: start_thread (in /usr/lib64/libc.so.6)
n0: 17921504 0x4E5BBB3: clone (in /usr/lib64/libc.so.6)
This one is from the NIR copy that happens when linking shaders. Simple enough to compress.
New graph:
An additional 37.5% reduction to 1.0GiB? That’s not too shabby. Now we’re looking at an overall 58% reduction in memory utilization. This is the kind of improvement that SGC readers have come to expect.
But wait! I was doing all this last week. And the start of this post was a really long time ago, but wasn’t there something else causing high memory utilization last week?
That’s right, these graphs are still being hit by the now-fixed RADV shader IR ballooning.
What hap
What happens if I apply that fix too?
482.7MiB total memory usage.
That’s another 51.7% improvement.
Overall a 79.9% reduction in memory usage. I’d expect similar (or greater?) savings for all games.
The MR is up now, and I expect it should be merged soon™.
Doesn’t this negatively affect performance?
No.
But doesn’t using more memory improve performance?
No.
What will I do with the rest of my 256GiB RAM?
Open two more browser tabs.