This blog was about pointlessly optimizing things? I’m talking like taking vkGetDescriptorSetLayoutSupport and making it fast. The kinds of optimizations nobody asked for and potentially nobody even wanted.
Well good news: this isn’t a post about those types of optimizations.
This is a post where I’m gonna talk about some speedups that you didn’t even know you craved but now that you know they exist you can’t live without.
Lots of people are asking, but surely nobody reading this blog since you’re all experts. But if you have a friend who wants to know, here’s the official resource for all that knowledge. It’s got diagrams. Images with important parts circled. The stuff that means whoever wrote it knew what they were talking about.
The thing this “official” resource doesn’t tell you is the queue is potentially pretty slow. You chuck some commands into it, and then you wait on your fence/semaphore, but the actual time it takes to perform queue submission is nonzero. In fact, it’s quite a bit larger than zero. How large is it? you might be asking.
I didn’t want to do this, but you’ve forced my hand.
What if I told you there was a tool for measuring things like this. A tool for determining the cost of various Vulkan operations. For benchmarking, one might say.
That’s right, it’s time to yet again plug vkoverhead, the best and only tool for doing whatever I’m about to do.
Like a prophet, my past self already predicted that I’d be sitting here writing this post to close out a week of types_of_headaches.meme -> vulkan_headaches.meme
. That’s why vkoverhead
already has the -submit-only
option in order to run a series of benchmark cases which have numbers that are totally not about to go up.
Let’s look at those cases now to fill up some more page space and time travel closer to the end of my workweek:
submit_noop
submits nothing. There’s no semaphores, no cmdbufs, it just submits and returns in order to provide a baselinesubmit_50noop
submits nothing 50 times, which is to say it passes 50x VkSubmitInfo
structs to vkQueueSubmit
(or the 2 versions if sync2 is supported)submit_1cmdbuf
submits a single cmdbuf. In theory this should be slower than the noop case, but I hate computers and obviously this isn’t true at allsubmit_50cmdbuf
submits 50 cmdbufs. In theory this should be slower than the single cmdbuf case, and, thankfully, this one particular time in which we have expectations of how computers work does match our expectationssubmit_50cmdbuf_50submit
submits 50 cmdbufs in 50 submits for a total of 50 cmdbufs per vkQueueSubmit
call. This is the slowest test, you would think, and I thought that too, and the longer this explanation goes on the more you start to wonder if computers really do work at all like you expect or if this is going to upset you, but it’s Friday, and I don’t have anywhere to be except the gym, so I could keep delaying the inevitable for a while longer, but I do have to get to the gym, so sure, this is totally gonna be way slower than all the other tests, trust me™It’s a great series of tests which showcase some driver pain points. Specifically it shows how slow submit can be.
Let’s check out some baseline results on the driver everyone loves to hang out with, RADV:
40, submit_noop, 19569683, 100.0%
41, submit_50noop, 402324, 2.1%
42, submit_1cmdbuf, 51356, 0.3%
43, submit_50cmdbuf, 1840, 0.0%
44, submit_50cmdbuf_50submit, 1031, 0.0%
Everything looks like we’d expect. The benchmark results ensmallen as they get more complex.
But why?
Because if you think about it like a smart human and not a dumb pile of “thinking” sand, submitting 50 cmdbufs is submitting 50 cmdbufs no matter how you do it.
Some restrictions apply, signal semaphores blah blah blah, but none of that’s happening here so what the fuck, RADV?
This is where we get into some real facepalm territory. Vulkan, as an API, gives drivers the ability to optimize this. That’s the entire reason why vkQueueSubmit has the submitCount
param and takes an array of submits.
But what does Mesa do here? Well, in the current code there’s this gem:
for (uint32_t i = 0; i < submitCount; i++) {
struct vulkan_submit_info info = {
.pNext = pSubmits[i].pNext,
.command_buffer_count = pSubmits[i].commandBufferInfoCount,
.command_buffers = pSubmits[i].pCommandBufferInfos,
.wait_count = pSubmits[i].waitSemaphoreInfoCount,
.waits = pSubmits[i].pWaitSemaphoreInfos,
.signal_count = pSubmits[i].signalSemaphoreInfoCount,
.signals = pSubmits[i].pSignalSemaphoreInfos,
.fence = i == submitCount - 1 ? fence : NULL
};
VkResult result = vk_queue_submit(queue, &info);
if (unlikely(result != VK_SUCCESS))
return result;
}
Tremendous. It’s worth mentioning that not only is this splitting the batched submits into individual ones, each submit also allocates a struct to contain the submit info so that the drivers can use the same interface. So it’s increasing the kernel overhead by performing multiple submits and also increasing memory allocations.
We’ve all been here before on SGC, and I really do need to get to the gym, so I’m authorizing a one-time fast forward to the results of optimizing this:
RADV GFX11:
40, submit_noop, 19569683, 100.0%
41, submit_50noop, 402324, 2.1%
42, submit_1cmdbuf, 51356, 0.3%
43, submit_50cmdbuf, 1840, 0.0%
44, submit_50cmdbuf_50submit, 1031, 0.0%
↓
40, submit_noop, 21008648, 100.0%
41, submit_50noop, 4866415, 23.2%
42, submit_1cmdbuf, 51294, 0.2%
43, submit_50cmdbuf, 1823, 0.0%
44, submit_50cmdbuf_50submit, 1828, 0.0%
That’s like 1000% faster for case #41 and 50% faster for case #44.
But how does this affect other drivers? I’m sure you’re asking next. And of course, this being the primary blog for distributing Mesa benchmarking numbers in any given year, I have those numbers.
Lavapipe:
40, submit_noop, 1972672, 100.0%
41, submit_50noop, 40334, 2.0%
42, submit_1cmdbuf, 5994597, 303.9%
43, submit_50cmdbuf, 2623720, 133.0%
44, submit_50cmdbuf_50submit, 133453, 6.8%
↓
40, submit_noop, 1980681, 100.0%
41, submit_50noop, 1202374, 60.7%
42, submit_1cmdbuf, 6340872, 320.1%
43, submit_50cmdbuf, 2482127, 125.3%
44, submit_50cmdbuf_50submit, 1165495, 58.8%
3000% faster for #41 and 1000% faster for #44.
Intel DG2:
40, submit_noop, 101336, 100.0%
41, submit_50noop, 2123, 2.1%
42, submit_1cmdbuf, 35372, 34.9%
43, submit_50cmdbuf, 713, 0.7%
44, submit_50cmdbuf_50submit, 707, 0.7%
↓
40, submit_noop, 106065, 100.0%
41, submit_50noop, 105992, 99.9%
42, submit_1cmdbuf, 35110, 33.1%
43, submit_50cmdbuf, 709, 0.7%
44, submit_50cmdbuf_50submit, 702, 0.7%
5000% faster for #41 and a big 🤕 for #44 because Intel.
Turnip A740:
40, submit_noop, 1227546, 100.0%
41, submit_50noop, 26194, 2.1%
42, submit_1cmdbuf, 1186327, 96.6%
43, submit_50cmdbuf, 545341, 44.4%
44, submit_50cmdbuf_50submit, 16531, 1.3%
↓
40, submit_noop, 1313550, 100.0%
41, submit_50noop, 1078383, 82.1%
42, submit_1cmdbuf, 1129515, 86.0%
43, submit_50cmdbuf, 329247, 25.1%
44, submit_50cmdbuf_50submit, 484241, 36.9%
4000% faster for #41, 3000% faster for #44.
Pretty good, and it somehow manages to still be conformant.
Code here.
If you’re reading, thanks for everything.
I planned to blog about it a while ago, but then I didn’t and news sites have since broken the news: Zink from Mesa main can run finally xservers.
Yes, it’s true. For the first time ever, you can install Mesa (from git) and use zink (with environment variables) to run your entire system (unless you’re on Intel).
But what was so challenging about getting this to work? The answer won’t surprise you.
Fans of the blog know that I’m no fan of WSI. If I had my way, GPUs would render to output buffers that we could peruse at our leisure using whatever methods we had at our disposal. Ideally manual inspection. Alas, few others share my worldview and so we all must suffer.
The root of all evil when it comes to computers is synchronization. This is triply so for anything GPU-related, and when all this “display server” chicanery is added in, the evilness value becomes one of those numbers so large that numerologists are still researching naming possibilities. There are two types of synchronization used with WSI:
From a user perspective, the former has less code to manage. The downside is that on the driver side things become more complex, as implicit sync is effectively layered atop explicit sync.
Another way of looking at it is:
And, since xservers run on GL, you can see where this is going.
Don’t get me wrong, explicit sync sucks too, but at least it makes sense. Broadly speaking, with explicit sync you have a dmabuf image, you submit it to the GPU, and you tell the server to display it.
In the words of venerable Xorg developer, EGL maintainer, and synchronization PTSD survivor Daniel Stone, the way to handle implicit sync is “vibes”. You have a dmabuf image, you glFlush
, and magically it gets displayed.
Sound nuts? It is, and that’s why Vulkan doesn’t support it.
But zink uses Vulkan, so…
Explicit sync is based on two concepts:
A user of a dmabuf waits on an export operation before using it (i.e., a wait semaphore), then signals an import operation at the end of a cmdbuf submission (i.e., a signal semaphore). Vulkan WSI handles this under the hood for users. But there’s no way to use Vulkan WSI with imported dmabufs, which means this all has to be copy/pasted around to work elsewhere.
In zink, all that happens in an xserver scenario is apps import/export dmabufs, sample/render them, and then do queue submission. To successfully copy/paste the WSI code and translate this into explicit sync for Vulkan, it’s necessary to be a bit creative with driver mechanics. The gist of it is:
DMA_BUF_IOCTL_EXPORT_SYNC_FILE
) semaphore to be waited on before the current cmdbufDMA_BUF_IOCTL_IMPORT_SYNC_FILE
) semaphore to be signaled after the current cmdbufBig thanks to Faith “ARB_shader_image_load_store” Ekstrand for god-tier rubberducking when I was in the home stretch of this undertaking.
Anyway I expect to be absolutely buried in bug reports by next week from all the people testing this, so thanks in advance.
It’s been a busy week, and I’ve got posts I’ve been meaning to write. The problem is they’re long and I’m busy. That’s why I’m doing a shorter post today, since even just getting this one out somehow took 4+ hours while I was continually sidetracked by “real work”.
But despite this being a shorter post, don’t worry: the memes won’t be shorter.
I got a ticket very recently that I immediately jumped on and didn’t at all procrastinate or forget about. The ticket concerned a little game called THE KING OF FIGHTERS XIV.
Now for those of you unfamiliar, The King of Fighters is a long-running fighting game franchise which debuted in the 90s. At the arcade. Pretty sure I played it once. But at like a retro arcade or something because I’m not that old, fellow kids.
The bug in question was that when a match is won using a special move, half the frame would misrender:
Heroically, the reporter posted a number of apitrace captures. Unfortunately that effort ended up being ineffectual since it did nothing but reveal yet another apitrace bug related to VAO uploads which caused replays of the traces to crash.
It was the worst kind of bug.
I was going to have to play a gametest the defect myself.
It would prove to be the greatest test of my skill yet. I would have to:
Was I successful?
I’m not saying there’s someone out there who’s worse at the gamethe test app than a guy playingperforming exploratory tests on his keyboard under renderdoc. That’s not what I’m saying.
The debug process for this issue was, in contrast to the capture process, much simpler. I attribute this to the fact that, while I don’t own a gamepad for use with whatever test apps need to be run, I do have a code controller that I use for all my debugging:
I’ve been hesitant to share such pro strats on the blog before, but SGC has been around for long enough now that even when the copycats start vlogging about my tech and showing off the frame data, everyone will recognize where it came from. All I ask is that you post clips of tournament popoffs.
Using my code controller, I was able to perform a debug -> light code -> light code -> debug -> heavy code -> compile -> block -> post meme -> reboot -> heavy code -> heavy code
combo for an easy W.
To break down this advanced sequence, a small debug
reveals that the issue is a render area clamped to 1024x1024 on a 1920x1080 frame. Since I have every line of the codebase memorized (zink main don’t @ me) it was instantly obvious that some poking was in order.
Vulkan has this pair of (awful) VUs:
VUID-VkRenderingInfo-pNext-06079
If the pNext chain does not contain VkDeviceGroupRenderPassBeginInfo or its deviceRenderAreaCount member is equal to 0, the width of the imageView member of any element of pColorAttachments, pDepthAttachment, or pStencilAttachment that is not VK_NULL_HANDLE must be greater than or equal to renderArea.offset.x + renderArea.extent.width
VUID-VkRenderingInfo-pNext-06080
If the pNext chain does not contain VkDeviceGroupRenderPassBeginInfo or its deviceRenderAreaCount member is equal to 0, the height of the imageView member of any element of pColorAttachments, pDepthAttachment, or pStencilAttachment that is not VK_NULL_HANDLE must be greater than or equal to renderArea.offset.y + renderArea.extent.height
which don’t match up at all to GL’s ability to throw whatever size framebuffer attachments at the GPU and have things come out fine. A long time ago I wrote this MR to clamp framebuffer size to the smallest attachment. But in this particular case, there are three framebuffer attachments:
The unused attachment ends up clamping the framebuffer to a smaller region to avoid violating spec, and this breaks rendering. Some light code
pokes to skip clamping for NULL attachments open up the combo. Another quick debug
doesn’t show the issue as being resolved, which means it’s time for some heavy code
: checking for unused attachments in the fragment shader during renderpass start.
Naturally this triggers a full tree compile
, which is a block
ing operation that gives me enough time to execute a post meme
for style points. The downside is that I’m using an AMD system, so as soon as I try to run the new code it hangs–it’s at this point that I nail a reboot
to launch it into orbit.
I’m not looking for a record-setting juggle, so I finish off my combo with a heavy code -> heavy code
finisher to hack in attachment write tracking for TC renderpass optimization and then plumb it through the rest of my stack so unused attachments will skip all renderpass-related operations.
Problem solved, and all without having to personally play any games.
I’ll finally post that one post I’ve been planning to post for weeks but it’s hard and I just blew my entire meme budget for the month today so what is even going to happen who knows.
This week started quite fruitfully, these features were added:
And with this we should have all the features we need to run a model such as MobileNet v1 and get some performance numbers to guide the next steps.
Only that the NPU hangs when I try to use the 8th core... and this is required to run most detection models, as they start by convoluting the input to 32 feature maps.
Have checked and we are sending to the kernel bit-identical command streams and input buffers, so I suspect the problem will be somewhere in the kernel.
So I plan to instrument the out-of-tree kernel driver and get some register and command stream dumps, in the hope that there is some bit in a magic register somewhere that I need to flip.
I'm not really looking forward to such work, so I decided to first invest some time cleaning things up a bit to make it easier for other people to play with this if they wish.
I have removed from my branch everything from my previous attempt at using OpenCL and have written some documentation about how to run the TensorFlow Lite delegate:
https://gitlab.freedesktop.org/tomeu/mesa/-/blob/teflon/docs/teflon.rst
You will need a VIM3 board, a recent mainline kernel and a Debian testing rootfs.
Sriram invited me to the oneAPI meetup, and I felt I hadn't summed up the state of compute and community development in a while. Enjoy 45 minutes of opinions!
https://www.youtube.com/watch?v=HzzLY5TdnZo
if you disagree you’re wrong
gitlab is down, post low-effort blogs and touch grass until it returns
The GSoC journey is coming to a close. In just over 100 days, I gained more experience in open-source development than I could ever imagine in this period.
Prior to GSoC, I was not used to regularly submit patches to the mailing lists. Now, I’ve sent many patches and revisions. I believe my interaction with the community will only grow. I learned so much about the tools and workflow of kernel development.
After this experience, I’m more than certain that I want to make this a job, contributing to open-source is fun, so why not make this a living :)
The main goal of the project was to increase the code coverage on the DRM core helper functions by creating unit tests.
As the coverage of all helpers is a big task for the time period, I
decided to create tests for the drm_format_helper.c
functions.
Throughout the project, other side tasks appeared. I will list the contributions made below.
VKMS is a software-only model of a KMS driver that is useful for testing and running X (or similar) on headless machines.
This was, unexpectedly, a big part of my GSoC. I learned a lot about color formats and how a graphics driver works. Currently, only one piece of my work was upstreamed, the rest needs more work and was postponed in favor of the primary project goal.
Patch | Status |
---|---|
drm/vkms: Add support to 1D gamma LUT | Accepted |
For more information go check my blogpost about the whole process.
IGT GPU Tools is a collection of tools for the development and testing of DRM drivers. While working on VKMS I used heavily the IGT framework for testing, in one occasion a bug made a test to stop working on the VKMS, so a submitted a patch to fix that.
Patch | Status |
---|---|
lib/igt_fb: Add check for intel device on use_enginecopy | Accepted |
In the DRM subsystem, I’ve done the main project goal, contributed by
adding unit tests, and also helped to fix some bugs that appeared
while working on the tests. With the sent patches I got 71.5% of line
coverage and 85.7% of function
coverage
on the drm_format_helper.c
.
I think the most difficult task was describing my work. Either on blog posts or in the commit messages, it takes a lot of work to write what you’ve done concisely and clearly. With time you get the way of things, but I think I can improve on this subject.
Moreover, many times I had to debug some problems. I already knew how to use GDB, but using it in the kernel is a little more cumbersome. After searching, reading the documentation, and getting tips from my mentors, I got it working.
On the VKMS, I had to create new features, this requires a lot of thought. I made a lot of diagrams in my head to understand how the color formats would be displayed in memory, and yet most of my work hasn’t seen the light of day XD.
I was able to do most of the proposed tasks. But the drm_xfrm_toio
was left out due to the difficulty of testing it, as it uses IO
memory. I tested the drm_fb_blit()
, but I’m waiting for the
acceptance of the patchset to send it, with that patch the line
coverage will go to 89.2% and the function coverage will go to
94.3%.
Besides patch submission, I reviewed some patches too. Going to the other side, I enjoyed thinking about how a piece of code could be improved.
Also, on one occasion I started a discussion about the best way to solve an issue by sending a patch. This got me a Reported-by tag on the patch that fixed the bug.
Moreover, I use a Thunderbird addon to make the diff properly highlyted. When I was tinkering with the configuration, I noticed that the CSS of the configuration menu was wrong, so it made the user experience pretty bad.
I sent a patch fixing that to the maintainer of the addon, this patch generated a discussion that made a whole change in the CSS file due to Thunderbird updates.
I’d like to thank my mentors, André “Tony” Almeida, Maíra Canal, and Tales L. Aparecida. Their support and expertise were invaluable throughout this journey.
Moreover, I thank the X.Org Foundation for allowing me to participate in this program, and also for accepting my talk proposal on the XDC 2023.
Lastly, I thank my colleague Carlos Eduardo Gallo for exchanging knowledge during the weekly meetings.
It's 5am and I have a headache. The perfect time for some reflection!
Not only that, but I've just had to play the part of Static Site Ungenerator, because I found out that I deleted the source of the last post and I didn't want to lose it in the upcoming publish. If your Atom feed went funky, sorry.
This document is my Final Work Submission, but is fun for all the family, including the ones who don't work at Google. Hi everyone!
Going into the summer, the plan was to add functionality to wlroots so that its users (generally Wayland compositors) could more easily switch to a smarter frame schedule. I've had many goes at explaining the problem and they all sucked, so here we go again: if a compositor puts some thought into when it starts its render, desktop latency as perceived by the user can decrease. The computer will feel snappier.
wlroots started the summer with no accommodations for compositors that wanted to put thought into when they start to render. It assumed exactly no thought was to be put in, and left you on your own if you were to decide otherwise. But that has all changed!
The aim of my work could have comprised three things, but I added a fourth and then didn't have time for the third:
After some flailing around trying to add a delay to the existing scheduling, I started writing patches worth landing.
First came the render timer API. Now we can measure the duration of our render passes. This MR brought an abstraction for timers, and an implementation for wlroots' gles2 renderer.
Next, the scene timer API.
wlr_scene
does some of its own work before setting off the render pass itself, so it needed to
become aware of timers and expose a way to use them.
Meanwhile, I was having another stab at configuring a frame delay. It wasn't very good, and the design of wlroots' scheduling and the complexity of the logic underneath it turned out to take a long time to get through. With this MR, though, I had a better idea of where I was trying to go. A long thought process followed, much of which lives in this issue, and further down we'll see what came of that.
Before working on a prediction algorithm, I wanted to be able to see live feedback on how render
timings behaved and which frames were missed so that I could do a good (informed) job of predicting
them.
I took a detour into the world of tracing.
libuserevents
was spawned and so was the work to make
use of it in wlroots.
Linux's user_events tracing interface was appealing because it meant that GPUVis, an existing tool
that can display a timeline of CPU and GPU events, would be able to show wlroots' events.
Unfortunately Linux and I have so far struggled to get along and this work is still in progress -
no submission yet because it's broken.
Even more unfortunately, this meant that I wasn't able to get around to prediction.
Then I got tired of fighting that, and despite the words of discouragement...
a refactor of wlroots' frame scheduling that allows us to do much better than !4214:
!4307!
This hasn't quite made it past the finish line, but it's close; I can feel it in my frames.
It (in my opinion) neatly extracts the hairy logic that lived in wlr_output
into a helper
interface, allowing users to swap out which frame scheduler they use, or to forgo the helpers and
roll their own without there being bits and pieces left over in the parts of wlroots that they do
care about.
This is the most exciting piece of the puzzle IMO; wlr_output
has grown to have its fingers in
many pies, and this MR reduces that and leaves wlr_output
a little bit more friendly in a way that
took a lot of brain cycles but turned out clean.
This new interface doesn't come with a frame delay option for free, but an implementation of the interface that has this feature is underway: !4334. It fits nicely! We hashed it out a little on IRC because the frame delay option is a surprisingly tricky constraint on the interface, but I think the conclusion is good. It was definitely a lot easier to write this with confidence after the scheduling redesign :)
To make this scheduling design possible and clean, a
couple of
little changes were
needed in other areas, and
thankfully the case for these changes was easy to make.
They're helpful to me, but also make those parts of wlroots less surprising and/or broken.
There was also a discussion about
the fate of wlr_output.events.needs_frame
, which is an extra complexity in wlroots' frame
scheduling.
It turned out that while removing it is possible, it wasn't necessary for the new scheduling system,
so it continues in the background.
While libuserevents
is usable, the wlroots integration is not ready.
There is sadly no "stock" plug-and-play prediction algorithm in wlroots.
The new scheduling infrastructure has not landed but I'm sure it will Soon™. The implementation with the frame delay option will hopefully follow shortly after. When (touch wood) it does, compositors will have to bring their own prediction algorithm, but a "good enough" algorithm can be very simple and given the current interface design can easily be swapped out for a stock one if one materialises.
And finally, the funniest one. I wrote an implementation of the timer API for wlroots' Vulkan renderer, and then put off submitting it for two months because everything else was more important. gles2 is the default renderer and supports roughly every GPU in existence. Writing the Vulkan timer was fun but landing it was less of a priority than every other task I had and nothing really depended on it, so it remains stuck on my laptop to this day. Perhaps I should get round to that.
The project didn't go how I expected it to - not even close. I even wrote up a schedule as part of my application that almost immediately turned out completely wrong. I'm not bothered, though, because it was fun, I made myself useful, and I met some cool people.
If you're considering doing something like I did, I can happily recommend Simon as a
mentor, X.Org, and GSoC, in that order.
Much love to Simon for making me feel comfortable when I really didn't know what I was doing, and
for participating in my wildly off-topic free software rambles.
I've only interacted with a small part of the X.Org community so far but it struck me from the start
how welcoming everyone is;
I have no doubts that the other X.Org project mentors are as lovely in their own ways.
And of course, as a strong proponent of software that doesn't suck that's free, I have to
appreciate that GSoC gave me a welcoming place to do my part in that and relieve my worldly
pressures (did you know you have to pay for internet??).
Thanks everyone for putting up with me. If you would like to put up with me some more, click the links on the left - I'm not going anywhere, there's still work to do!
Managed to squeeze some time between holidaying to hack on the NPU driver and got something out of it.
Since the last update I have:
Next steps are to support convolutions with multiple input and output channels, and padding. Then see what is still missing so we can run MobileNet v1 and check the performance when using the NN units and doing the rest on the CPU.
As a reminder, I'm pushing all the code to this branch: https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/.
A bunch of us have started to gather in the #ml-mainline IRC channel in OFTC to disucss matters about doing accelerated ML with mainline, on embedded.
For those of you that may not have a IRC bouncer setup yet, you can easily join with the web chat UI, but in case others aren't in front of the keyboard when you type your question, I recommend using element.io with the Matrix IRC bridge:
https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/
I have been invited to give a talk about all this ML with mainline effort at Embedded Recipes 2023, Paris 28-29 September. Slides and a recording will be published after the conference ends.
Last but not least, if I am able to invest so much effort on this is because the folks at LibreComputer have been supporting me financially this last couple of months.
Thanks to Da Xue for his support, it is greatly appreciated! It is awesome to see SBC vendors investing in the Linux upstream ecosystem.
It turns out this was the product of a tiler optimization I did earlier this year to pipeline texture uploads without splitting renderpasses. I was (wrongly) assuming that the PBO stride would always match the image format stride, which broke functionality in literally just this one corner case.
Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. That means the drivers are compatible with any OpenGL ES 3.1 application. Interested? Just install Linux!
For existing Asahi Linux users,
upgrade your system with dnf
upgrade
(Fedora) or pacman
-Syu
(Arch) for the latest drivers.
Our reverse-engineered, free and open source graphics drivers are the world’s only conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware. That means our driver passed tens of thousands of tests to demonstrate correctness and is now recognized by the industry.
To become conformant, an “implementation” must pass the official conformance test suite, designed to verify every feature in the specification. The test results are submitted to Khronos, the standards body. After a 30-day review period, if no issues are found, the implementation becomes conformant. The Khronos website lists all conformant implementations, including our drivers for the M1, M1 Pro/Max/Ultra, M2, and M2 Pro/Max.
Today’s milestone isn’t just about OpenGL ES. We’re releasing the first conformant implementation of any graphics standard for the M1. And we don’t plan to stop here ;-)
Unlike ours, the manufacturer’s M1 drivers are unfortunately not conformant for any standard graphics API, whether Vulkan or OpenGL or OpenGL ES. That means that there is no guarantee that applications using the standards will work on your M1/M2 (if you’re not running Linux). This isn’t just a theoretical issue. Consider Vulkan. The third-party MoltenVK layers a subset of Vulkan on top of the proprietary drivers. However, those drivers lack key functionality, breaking valid Vulkan applications. That hinders developers and users alike, if they haven’t yet switched their M1/M2 computers to Linux.
Why did we pursue standards conformance when the manufacturer did not? Above all, our commitment to quality. We want our users to know that they can depend on our Linux drivers. We want standard software to run without M1-specific hacks or porting. We want to set the right example for the ecosystem: the way forward is implementing open standards, conformant to the specifications, without compromises for “portability”. We are not satisfied with proprietary drivers, proprietary APIs, and refusal to implement standards. The rest of the industry knows that progress comes from cross-vendor collaboration. We know it, too. Achieving conformance is a win for our community, for open source, and for open graphics.
Of course, Asahi Lina and I are two individuals with minimal funding. It’s a little awkward that we beat the big corporation…
It’s not too late though. They should follow our lead!
OpenGL ES 3.1 updates the experimental OpenGL ES 3.0 and OpenGL 3.1 we shipped in June. Notably, ES 3.1 adds compute shaders, typically used to accelerate general computations within graphics applications. For example, a 3D game could run its physics simulations in a compute shader. The simulation results can then be used for rendering, eliminating stalls that would otherwise be required to synchronize the GPU with a CPU physics simulation. That lets the game run faster.
Let’s zoom in on one new feature: atomics on images. Older versions of OpenGL ES allowed an application to read an image in order to display it on screen. ES 3.1 allows the application to write to the image, typically from a compute shader. This new feature enables flexible image processing algorithms, which previously needed to fit into the fixed-function 3D pipeline. However, GPUs are massively parallel, running thousands of threads at the same time. If two threads write to the same location, there is a conflict: depending which thread runs first, the result will be different. We have a race condition.
“Atomic” access to memory provides a solution to race conditions. With atomics, special hardware in the memory subsystem guarantees consistent, well-defined results for select operations, regardless of the order of the threads. Modern graphics hardware supports various atomic operations, like addition, serving as building blocks to complex parallel algorithms.
Can we put these two features together to write to an image atomically?
Yes. A ubiquitous OpenGL ES extension, required for ES 3.2, adds atomics operating on pixels in an image. For example, a compute shader could atomically increment the value at pixel (10, 20).
Other GPUs have dedicated instructions to perform atomics on an images, making the driver implementation straightforward. For us, the story is more complicated. The M1 lacks hardware instructions for image atomics, even though it has non-image atomics and non-atomic images. We need to reframe the problem.
The idea is simple: to perform an atomic on a pixel, we instead calculate the address of the pixel in memory and perform a regular atomic on that address. Since the hardware supports regular atomics, our task is “just” calculating the pixel’s address.
If the image were laid out linearly in memory, this would be straightforward: multiply the Y-coordinate by the number of bytes per row (“stride”), multiply the X-coordinate by the number of bytes per pixel, and add. That gives the pixel’s offset in bytes relative to the first pixel of the image. To get the final address, we add that offset to the address of the first pixel.
Alas, images are rarely linear in memory. To improve cache efficiency, modern graphics hardware interleaves the X- and Y-coordinates. Instead of one row after the next, pixels in memory follow a spiral-like curve.
We need to amend our previous equation to interleave the coordinates. We could use many instructions to mask one bit at a time, shifting to construct the interleaved result, but that’s inefficient. We can do better.
There is a well-known “bit twiddling” algorithm to interleave bits. Rather than shuffle one bit at a time, the algorithm shuffles groups of bits, parallelizing the problem. Implementing this algorithm in shader code improves performance.
In practice, only the lower 7-bits (or less) of each coordinate are interleaved. That lets us use 32-bit instructions to “vectorize” the interleave, by putting the X- and Y-coordinates in the low and high 16-bits of a 32-bit register. Those 32-bit instructions let us interleave X and Y at the same time, halving the instruction count. Plus, we can exploit the GPU’s combined shift-and-add instruction. Putting the tricks together, we interleave in 10 instructions of M1 GPU assembly:
# Inputs x, y in r0l, r0h.
# Output in r1.
add r2, #0, r0, lsl 4
or r1, r0, r2
and r1, r1, #0xf0f0f0f
add r2, #0, r1, lsl 2
or r1, r1, r2
and r1, r1, #0x33333333
add r2, #0, r1, lsl 1
or r1, r1, r2
and r1, r1, #0x55555555
add r1, r1l, r1h, lsl 1
We could stop here, but what if there’s a dedicated
instruction to interleave bits? PowerVR has a “shuffle” instruction shfl
,
and the M1 GPU borrows from PowerVR. Perhaps that instruction was
borrowed too. Unfortunately, even if it was, the proprietary compiler
won’t use it when compiling our test shaders. That makes it difficult to
reverse-engineer the instruction – if it exists – by observing compiled
shaders.
It’s time to dust off a powerful reverse-engineering technique from magic kindergarten: guess and check.
Dougall Johnson
provided the guess. When considering the instructions we already know
about, he took special notice of the “reverse bits” instruction. Since
reversing bits is a type of bit shuffle, the interleave instruction
should be encoded similarly. The bit reverse instruction has a two-bit
field specifying the operation, with value 01
. Related
instructions to count the number of set bits and find the
first set bit have values 10
and 11
respectively. That encompasses all known “complex bit manipulation”
instructions.
00 |
? ? ? |
01 |
Reverse bits |
10 |
Count set bits |
11 |
Find first set |
There is one value of the two-bit enumeration that is unobserved and
unknown: 00
. If this interleave instruction exists, it’s
probably encoded like the bit reverse but with operation code
00
instead of 01
.
There’s a difficulty: the three known instructions have one single input source, but our instruction interleaves two sources. Where does the second source go? We can make a guess based on symmetry. Presumably to simplify the hardware decoder, M1 GPU instructions usually encode their sources in consistent locations across instructions. The other three instructions have a gap where we would expect the second source to be, in a two-source arithmetic instruction. Probably the second source is there.
Armed with a guess, it’s our turn to check. Rather than handwrite GPU assembly, we can hack our compiler to replace some two-source integer operation (like multiply) with our guessed encoding of “interleave”. Then we write a compute shader using this operation (by “multiplying” numbers) and run it with the newfangled compute support in our driver.
All that’s left is writing a shader that checks that the mystery instruction returns the interleaved result for each possible input. Since the instruction takes two 16-bit sources, there are about 4 billion (\(2^32\)) inputs. With our driver, the M1 GPU manages to check them all in under a second, and the verdict is in: this is our interleave instruction.
As for our clever vectorized assembly to interleave coordinates? We can replace it with one instruction. It’s anticlimactic, but it’s fast and it passes the conformance tests.
And that’s what matters.
Thank you to Khronos and Software in the Public Interest for supporting open drivers.
Color is a visual perception. Human eyes can detect a broader range of colors
than any devices in the graphics chain. Since each device can generate, capture
or reproduce a specific subset of colors and tones, color management controls
color conversion and calibration across devices to ensure a more accurate and
consistent color representation. We can expose a GPU-accelerated display color
management pipeline to support this process and enhance results, and this is
what we are doing on Linux to improve color management on
Gamescope/SteamDeck
. Even with the challenges of being external developers,
we have been working on mapping AMD GPU color capabilities
to the Linux
kernel color management interface
, which is a combination of DRM and AMD
driver-specific color properties. This more extensive color management pipeline
includes pre-defined Transfer Functions
, 1-Dimensional LookUp Tables (1D
LUTs)
, and 3D LUTs
before and after the plane composition/blending.
The study of color is well-established and has been explored for many years. Color science and research findings have also guided technology innovations. As a result, color in Computer Graphics is a very complex topic that I’m putting a lot of effort into becoming familiar with. I always find myself rereading all the materials I have collected about color space and operations since I started this journey (about one year ago). I also understand how hard it is to find consensus on some color subjects, as exemplified by all explanations around the 2015 online viral phenomenon of The Black and Blue Dress. Have you heard about it? What is the color of the dress for you?
So, taking into account my skills with colors and building consensus, this blog post only focuses on GPU hardware capabilities to support color management :-D If you want to learn more about color concepts and color on Linux, you can find useful links at the end of this blog post.
DRM color management interface only exposes a small set of post-blending color properties. Proposals to enhance the DRM color API from different vendors have landed the subsystem mailing list over the last few years. On one hand, we got some suggestions to extend DRM post-blending/CRTC color API: DRM CRTC 3D LUT for R-Car (2020 version); DRM CRTC 3D LUT for Intel (draft - 2020); DRM CRTC 3D LUT for AMD by Igalia (v2 - 2023); DRM CRTC 3D LUT for R-Car (v2 - 2023). On the other hand, some proposals to extend DRM pre-blending/plane API: DRM plane colors for Intel (v2 - 2021); DRM plane API for AMD (v3 - 2021); DRM plane 3D LUT for AMD - 2021. Finally, Simon Ser sent the latest proposal in May 2023: Plane color pipeline KMS uAPI, from discussions in the 2023 Display/HDR Hackfest, and it is still under evaluation by the Linux Graphics community.
All previous proposals seek a generic solution for expanding the API, but many
seem to have stalled due to the uncertainty of matching well the hardware
capabilities of all vendors. Meanwhile, the use of AMD color capabilities on
Linux remained limited by the DRM interface, as the DCN 3.0 family color caps
and mapping
diagram below shows the Linux/DRM color interface without
driver-specific color properties [*]:
Bearing in mind that we need to know the variety of color pipelines in the
subsystem to be clear about a generic solution, we decided to approach the
issue from a different perspective and worked on enabling a set of
Driver-Specific Color Properties for AMD Display Drivers
. As a result, I
recently sent another round of the AMD driver-specific color mgmt
API.
For those who have been following the AMD driver-specific proposal since the
beginning (see
[RFC][V1]),
the main new features of the latest version
[v2]
are the addition of pre-blending Color Transformation Matrix (plane CTM)
and
the differentiation of Pre-defined Transfer Functions (TF)
supported by color
blocks. For those who just got here, I will recap this work in two blog posts.
This one describes the current status of the AMD display driver in the Linux
kernel/DRM subsystem and what changes with the driver-specific properties. In
the next post, we go deeper to describe the features of each color block and
provide a better picture of what is available in terms of color management for
Linux.
Before discussing colors in the Linux kernel with AMD hardware, consider
accessing the Linux kernel documentation (version 6.5.0-rc5). In the AMD
Display documentation, you will find my previous work documenting AMD hardware
color capabilities and the Color Management
Properties.
It describes how AMD Display Manager (DM)
intermediates requests between the
AMD Display Core component (DC)
and the Linux/DRM kernel
interface for
color management features. It also describes the relevant function to call the
AMD color module in building curves for content space transformations.
A subsection also describes hardware color capabilities and how they evolve between versions. This subsection, DC Color Capabilities between DCN generations, is a good starting point to understand what we have been doing on the kernel side to provide a broader color management API with AMD driver-specific properties.
Blending is the process of combining multiple planes (framebuffers abstraction)
according to their mode settings. Before blending, we can manage the colors of
various planes separately; after blending, we have combined those planes in
only one output per CRTC. Color conversions after blending would be enough in a
single-plane scenario or when dealing with planes in the same color space on
the kernel side. Still, it cannot help to handle the blending of multiple
planes with different color spaces and luminance levels. With plane color
management properties, userspace can get a more accurate representation of
colors to deal with the diversity of color profiles of devices in the graphics
chain, bring a wide color gamut (WCG)
, convert High-Dynamic-Range (HDR)
content to Standard-Dynamic-Range (SDR)
content (and vice-versa). With a
GPU-accelerated display color management pipeline, we can use hardware blocks
for color conversions and color mapping and support advanced color management.
The current DRM color management API enables us to perform some color conversions after blending, but there is no interface to calibrate input space by planes. Note that here I’m not considering some workarounds in the AMD display manager mapping of DRM CRTC de-gamma and DRM CRTC CTM property to pre-blending DC de-gamma and gamut remap block, respectively. So, in more detail, it only exposes three post-blending features:
We can compare the Linux color management API with and without the
driver-specific color properties. From now, we denote driver-specific
properties with the AMD prefix and generic properties with the DRM prefix. For
visual comparison, I bring the DCN 3.0 family color caps and mapping
diagram
closer and present it here again:
Mixing AMD driver-specific color properties with DRM generic color properties, we have a broader Linux color management system with the following features exposed by properties in the plane and CRTC interface, as summarized by this updated diagram:
The blocks highlighted by red lines
are the new properties
in the
driver-specific interface developed by me (Igalia) and Joshua (Valve). The red
dashed lines
are new links between API and AMD driver components
implemented by
us to connect the Linux/DRM interface to AMD hardware blocks, mapping
components accordingly. In short, we have the following color management
properties exposed by the DRM/AMD display driver:
Note: You can find more about AMD display blocks in the Display Core Next (DCN) - Linux kernel documentation, provided by Rodrigo Siqueira (Linux/AMD display developer) in a 2021-documentation series. In the next post, I’ll revisit this topic, explaining display and color blocks in detail.
So, looking at AMD hardware color capabilities in the first diagram, we can see no post-blending (MPC) de-gamma block in any hardware families. We can also see that the AMD display driver maps CRTC/post-blending CTM to pre-blending (DPP) gamut_remap, but there is post-blending (MPC) gamut_remap (DRM CTM) from newer hardware versions that include SteamDeck hardware. You can find more details about hardware versions in the Linux kernel documentation/AMDGPU Product Information.
I needed to rework these two mappings mentioned above to provide
pre-blending/plane de-gamma and CTM for SteamDeck. I changed the DC mapping to
detach stream gamut remap
matrixes from the DPP gamut remap
block. That
means mapping AMD plane CTM directly to DPP/pre-blending gamut remap block and
DRM CRTC CTM to MPC/post-blending gamut remap block. In this sense, I also
limited plane CTM properties to those hardware versions with MPC/post-blending
gamut_remap capabilities since older versions cannot support this feature
without clashes with DRM CRTC CTM.
Unfortunately, I couldn’t prevent conflict between AMD plane de-gamma and DRM plane de-gamma since post-blending de-gamma isn’t available in any AMD hardware versions until now. The fact is that a post-blending de-gamma makes little sense in the AMD color pipeline, where plane blending works better in a linear space, and there are enough color blocks to linearize content before blending. To deal with this conflict, the driver now rejects atomic commits if users try to set both AMD plane de-gamma and DRM CRTC de-gamma simultaneously.
Finally, we had no other clashes when enabling other AMD driver-specific color properties for our use case, Gamescope/SteamDeck. Our main work for the remaining properties was understanding the data flow of each property, the hardware capabilities and limitations, and how to shape the data for programming the registers - AMD color block capabilities (and limitations) are the topics of the next blog post. Besides that, we fixed some driver bugs along the way since it was the first Linux use case for most of the new color properties, and some behaviors are only exposed when exercising the engine.
Take a look at the Gamescope/Steam Deck Color Pipeline[**], and see how Gamescope uses the new API to manage color space conversions and calibration (please click on the image for a better view):
In the next blog post, I’ll describe the implementation and technical details of each pre- and post-blending color block/property on the AMD display driver.
* Thank Harry Wentland for helping with diagrams, color concepts and AMD capabilities.
** Thank Joshua Ashton for providing and explaining Gamescope/Steam Deck color pipeline.
*** Thanks to the Linux Graphics community - explicitly Harry, Joshua, Pekka, Simon, Sebastian, Siqueira, Alex H. and Ville - to all the learning during this Linux DRM/AMD color journey. Also, Carlos and Tomas for organizing the 2023 Display/HDR Hackfest where we have a great and immersive opportunity to discuss Color & HDR on Linux.
After a long week of what-even-happened, it’s finally time to talk about maintenance5.
This long-awaited maintenance extension has a number of great and zinkful features:
VK_FORMAT_A8_UNORM_KHR
for native A8 handling
gl_PointSize
VK_REMAINING_ARRAY_LAYERS
Clarification that copies between images of any type are allowed, treating 1D images as 2D images with a height of 1.
But who can guess which one is the topic of this blog post?
Finally a default value for gl_PointSize
.
Long-term fans of the blog will recall that I’ve previously raged against the insane concept that pointsize must be written
many times prior. In fact, it remains the second most blogged about topic in SGC history right behind Big Triangledescriptor management, the topic that modern graphics-related blogs must cover above all others.
Finally with maintenance5 we can be freed from these unjust shackles that have bound us for so long. No more* shall complex logic be unnecessarily injected into the compiler stack to add senseless writes to this output.
* except all that code still has to exist and run to handle drivers that don’t support maintenance5
Beyond the obvious benefit of having a fixed default pointsize (sanity), let’s check out some other benefits.
Previously all zink-emitted shaders would have a pointsize write, even those that were never used for drawing points. This resulted in unnecessary shader i/o at the hardware level. Nobody wants unnecessary shader i/o at the hardware level.
Now, however, it’s possible to use heuristics during linking to delete all unnecessary pointsize writes any time there is no XFB emission.
How much performance improvement will this yield?
Six.
Six improvement units of performance.
Everyone remembers that time I discovered that huge flaw in nir_assign_io_var_locations
where shader interfaces would break due to psiz injection.
With maintenance5 all of that can be handwaved away, meaning fewer shader variants are needed.
.
Maintenance extensions are best extensions, prove me wrong.
Hi!
Let me start this status update with an announcement: from 2023-08-28 to 2023-10-01 (!), I will be on leave, so I will have reduced availability. Don’t be surprised if I miss some e-mails, and feel free to ping me again (more generally, please do ping me if I forget about a discussion — that also tends to happen when I’m not on leave). During that time, I will be traveling to Korea and Japan. If you live there and want to say hello, please reach out! :)
This month, Rose has continued working on wlroots frame scheduling. After a fair amount of discussion, she’s found a pretty nice API design. She still needs to address and cleanup a few things, but that merge request is on a good track! I’ve also merged a new API to embed a compositor inside a Wayland client, and sent patches to remove some cases where we were waiting for a reply from Xwayland in a blocking fashion.
My kernel patch for signaling an eventfd from a drm_syncobj
has been merged
(see last month’s post for more details), and I’ve reviewed a patch from Erik
Kurzinger to import a sync_file
into a drm_syncobj
timeline, which was
possible before but awkward (it required 3 IOCTLs and a temporary binary
drm_syncobj
). As usual, I’ve sent a few kernel documentation patches as well.
I’ve released a new version of Cage, the Wayland kiosk compositor. Cage now uses the latest wlroots release, implements a bunch of new protocols and leverages wlroots’ scene-graph API.
The NPotM is go-mls, a Go library for the Messaging Layer Security protocol. It’s a shiny new end-to-end encryption framework for messaging protocols (similar to the one used by e.g. Signal and Matrix). I wanted to figure out how it works, but simply reading a 132-page RFC didn’t seem fun enough, so I just tried implementing it instead. I’m passing most of the official test vectors, still missing a few things but overall not too far away from a proper implementation. I’ve been discussing with a few folks about an IRCv3 extension for MLS, but we’re still at the very early stages on that front.
Speaking of IRCv3, the pre-away extension has been merged, so the away status of soju users shouldn’t blink anymore when the Goguma mobile client synchronizes in the background. I’ve also submitted the no-implicit-names extension for standardization. That extension reduces bandwidth usage for clients who don’t need to always maintain a list of all users in all channels. This helps a lot with slow 3G connections in the countryside.
The SNPotM is libdns/dnsupdate, a Go library for the venerable dynamic DNS UPDATE protocol implemented by various authoritative name servers. The library conforms to an interface shared with other (proprietary) libdns providers. I have more plans in this area, but will keep that for a future blog post.
I’ve sent a go-proxyproto patch to add a helper to configure an HTTP/2 server with PROXY protocol upgrade support. TLS ALPN is needed to negotiate HTTP/2, so it’s tricky to make work behind a reverse proxy which terminates the TLS connection. This patch is basically part of kimchi ripped off and put behind a nice API. This patch would be useful to add HTTP/2 support to pages.sr.ht.
Last but not least, I’ve implemented tracker export for the todo.sr.ht GraphQL API. delthas has added support for that in hut. Next up is support for import in hut! I’ve also sent a whole bunch of bug fixes for sr.ht.
That’s all for this month! I’m not sure I’ll write a status update in September, but will definitely do so in October.
I just got back from lunch and have to work off some cals, and that means it’s time for another big lift on the blog. Today’s topic: how dumb can a driver’s compiler stack be?
As I outlined in the previous post, zink’s compiler stack is about to get a whole lot dumber for various reasons. But what exactly does that look like?
Lesser bloggers would simply link to the MR and say “figure it out”.
Here at SGC, however, I put in the extra effort so my readers can comprehend all the stringy awfulness that goes into each individual graphical sausage that this triangle factory is pumping out.
Let’s get at it.
The key point of using the theoretical new NIR linker (that definitely exists and will be merged in finite time) is that it requires drivers to accept lowered i/o. This means, effectively, that zink must begin consuming lowered i/o as soon as it receives shaders. Naturally the first step to that was evaluating all the shader passes which operate on specific i/o variables using derefs (AKA “explicit” i/o):
The first four are called from zink_shader_create
, the first time zink sees new shaders, while the last one is called zink_compiler_assign_io
. As shaders won’t have derefs again until just before they go through NTV, they’ll all have to be…
What’s that you say, creator of the patented Delete The Code methodology and planar YUV expert, Faith Ekstrand? I can just delete some of this code?
That sounds like a pretty smart idea. Looking at the list again, and then cross-referencing against all the features lowered i/o provides, and then pushing up my glasses so I can read the very real documentation that nir has, let’s see where that leads:
nir_lower_io_lower_64bit_to_32
is available during i/o lowering, so this can all be deletedNot actually that much work, huzzah.
As in the flowchart, this process involves taking explicit i/o, converting to lowered i/o, then converting back to explicit. Explicit i/o is characterized by using derefs to explicit variables for access, which means variables are needed. A work-intensive benefit to this means simpler variables: since lowered i/o is characterized by location-based access to components, the subsequent conversion back to explicit i/o can use entirely new variables, and since these variables are location-based, there’s no need to retain any* of the gross struct/array typing that GLSL yields.
* except where arrays are indirectly accessed
For those of you who are truly in the know, this means goku in his SSJB form
struct TestStruct {
dmat2x3 a[2];
mat2x3 b[2];
dvec2 c;
};
layout (location = 0, xfb_offset = 0) flat out TestStruct goku;
gets blasted into a series of smaller and more vincible variables:
decl_var shader_out INTERP_MODE_FLAT dvec3 goku#0 (VARYING_SLOT_VAR2.xyz, 2, 0)
decl_var shader_out INTERP_MODE_FLAT dvec3 goku#1 (VARYING_SLOT_VAR4.xyz, 4, 0)
decl_var shader_out INTERP_MODE_FLAT dvec3 goku#2 (VARYING_SLOT_VAR6.xyz, 6, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#3 (VARYING_SLOT_VAR8.xyz, 8, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#4 (VARYING_SLOT_VAR9.xyz, 9, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#5 (VARYING_SLOT_VAR10.xyz, 10, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#6 (VARYING_SLOT_VAR11.xyz, 11, 0)
decl_var shader_out INTERP_MODE_FLAT dvec2 goku#7 (VARYING_SLOT_VAR12.xy, 12, 0)
Beautiful and easy to parse. There’s only one snag: I gotta do this manually.
Long-time fans of the blog will recall some wild ravings in the past where I described a pass I wrote to handle a similar issue. lower_64bit_vars
is that pass, and it both splits variables containing 64bit types into 32bit types and then rewrites all access to them to use those new types.
And now I have to do basically the same thing. Again. But in a different enough way that none of the code is reusable.
The process for doing this variable rewrite is split in three:
But then there’s also the bonus step (everyone loves bonuses!) of scanning all the new variables and comparing them against the original variables to ensure they have the same number of per-location components (i.e., if the original variable consumes all components for a given location, the new one must too) in order to maintain shader interface compatibility, and for all the locations where a mismatch is detected, single-component variables have to be inserted, and they have to have associated access added too so various optimizers don’t delete them again, and it’s obviously one of the first things anyone embarking on this idiotic journey would consider and not a last-second thing that someone would only realize after running a series of esoteric piglit tests and seeing bizarre failures.
Variables. Done.
The next step is where things get really stupid, because this is where things need to happen so that the shader goes back to having all the derefs and explicit variable access it used to have before some idiot went and deleted them.
I called this the add_derefs
pass because I’m a creative type. An auteur.
For this, all the i/o variables need to be iterated through, and for each variable, scan the shader for access, where “access” means the location and component are consumed by the variable. And also its fbfetch-edness matches. Then take this lowered load/store access, krangle in whatever possibly-indirect derefs the variable needs to mimic the lowered operation, and write in a new explicit load/store access.
Except also I forgot to mention that i/o lowering needs to lower interpolation instructions, which are also (currently) in explicit deref format. And these explicit interpolation instructions get converted to lowered ones, and then sometimes a load_deref
becomes load_barycentric_centroid
. And you know (lol) it wouldn’t be a real adventure (lol) if a zink change didn’t uncover (lol) some incredibly obscure and opaque (lol) llvmpipe bug! So then there’s the usual spelunking through there, and whispered backchannel discussions and cursing with Dave, and OF FUCKING COURSE IT’S TGSI AGAIN but we got it done.
Also it’s possible there might be a future where llvmpipe doesn’t use TGSI but don’t quote me (I’ll deny it to my grave) and if anyone asks you didn’t hear it from me.
You’d think by the way I just went off on my usual TGSI rant that I was done exploring this section, but think again because none of us asked what gl_ClipDistance
or gl_CullDistance
thought about any of this.
Well I asked, and they’re not happy.
Clip/cull distance are stupidweird ones because they’re array[8] variables that consume two locations. And that means all the calculations/heuristics for accessing arrays that work for every other array are broken for these.
But it’s fine, because this is zink and the whole thing is just a jenga tower of hacks all the way down anyway.
I’ll be completely and brutally honest with you, this all worked perfectly the first time I ran it.
On NVK, that is, which, as I mentioned in my historic XDC keynote, has been relying on the now-merged NIR 2.0 since last year. Truly a driver living in the future.
Other drivers, however, required considerably more work to make CI explode. Sorry, I meant not explode. Obviously. Totally a joke. The absolute state of CI is 100% not the fault of this lowered i/o conversion.
Anyway, the clear choice once parity was achieved was to then start deleting code.
Remember all that gross NTV code I linked in the previous post? Gone.
More stupid XFB code that’s been jenga-ing around for years? Gone.
Obscure ticket from years ago? Fixed incidentally
src/compiler/nir/nir_passthrough_gs.c | 2 +-
src/gallium/auxiliary/nir/nir_to_tgsi_info.c | 4 +
src/gallium/drivers/zink/nir_to_spirv/nir_to_spirv.c | 412 +------------------------------------
src/gallium/drivers/zink/zink_compiler.c | 1081 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
src/gallium/drivers/zink/zink_compiler.h | 3 +-
src/gallium/drivers/zink/zink_draw.cpp | 2 +-
src/gallium/drivers/zink/zink_program.c | 8 +-
src/gallium/drivers/zink/zink_types.h | 6 +-
8 files changed, 736 insertions(+), 782 deletions(-)
And as seen by the statistics here another bonus ticket fixed too through the magic of code deletion.
I didn’t even get to mention the great things that happened related to maintenance5 yet. Be sure to read again next week when I inevitably shovel more garbage onto the internet in the form of an unfortunately large blog post riddled with memes that obfuscate the truly interesting parts.
As every one of my big brained readers knows, zink runs on top of vulkan. As you also know, vulkan uses spirv for its shaders. This means, in general, compiler-y stuff in zink tries to stay as close to spirv mechanics as possible.
Let’s look at an example. Here’s a very simple fragment shader from glxgears before it undergoes spirv translation:
shader: MESA_SHADER_FRAGMENT
source_sha1: {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
name: ff-fs
internal: true
stage: 4
next_stage: 0
inputs_read: 1
outputs_written: 4-11
system_values_read: 0x00000000'00100000'00000000
subgroup_size: 1
first_ubo_is_default_ubo: true
separate_shader: true
flrp_lowered: true
inputs: 1
outputs: 8
uniforms: 0
decl_var shader_in INTERP_MODE_NONE vec4 VARYING_SLOT_COL0 (VARYING_SLOT_COL0.xyzw, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[0] (FRAG_RESULT_DATA0.xyzw, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[1] (FRAG_RESULT_DATA1.xyzw, 1, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[2] (FRAG_RESULT_DATA2.xyzw, 2, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[3] (FRAG_RESULT_DATA3.xyzw, 3, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[4] (FRAG_RESULT_DATA4.xyzw, 4, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[5] (FRAG_RESULT_DATA5.xyzw, 5, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[6] (FRAG_RESULT_DATA6.xyzw, 6, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[7] (FRAG_RESULT_DATA7.xyzw, 7, 0)
decl_var push_const INTERP_MODE_NONE struct gfx_pushconst
decl_function main (0 params)
impl main {
block b0: // preds:
32 %0 = deref_var &VARYING_SLOT_COL0 (shader_in vec4)
32x4 %1 = @load_deref (%0) (access=0)
32 %2 = deref_var &gl_FragData[0] (shader_out vec4)
@store_deref (%2, %1) (wrmask=xyzw, access=0)
32 %3 = deref_var &gl_FragData[1] (shader_out vec4)
@store_deref (%3, %1) (wrmask=xyzw, access=0)
32 %4 = deref_var &gl_FragData[2] (shader_out vec4)
@store_deref (%4, %1) (wrmask=xyzw, access=0)
32 %5 = deref_var &gl_FragData[3] (shader_out vec4)
@store_deref (%5, %1) (wrmask=xyzw, access=0)
32 %6 = deref_var &gl_FragData[4] (shader_out vec4)
@store_deref (%6, %1) (wrmask=xyzw, access=0)
32 %7 = deref_var &gl_FragData[5] (shader_out vec4)
@store_deref (%7, %1) (wrmask=xyzw, access=0)
32 %8 = deref_var &gl_FragData[6] (shader_out vec4)
@store_deref (%8, %1) (wrmask=xyzw, access=0)
32 %9 = deref_var &gl_FragData[7] (shader_out vec4)
@store_deref (%9, %1) (wrmask=xyzw, access=0)
// succs: b1
block b1:
}
Notice all the variables and derefs. This is in contrast to what shaders from more hardware-y drivers look like:
shader: MESA_SHADER_FRAGMENT
source_sha1: {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
name: ff-fs
internal: true
stage: 4
next_stage: 0
inputs_read: 1
outputs_written: 2
subgroup_size: 1
first_ubo_is_default_ubo: true
separate_shader: true
flrp_lowered: true
inputs: 1
outputs: 1
uniforms: 0
decl_var shader_in INTERP_MODE_NONE vec4 VARYING_SLOT_COL0 (VARYING_SLOT_COL0.xyzw, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 FRAG_RESULT_COLOR (FRAG_RESULT_COLOR.xyzw, 4, 0)
decl_function main (0 params)
impl main {
block b0: // preds:
32 %3 = undefined
32 %0 = deref_var &VARYING_SLOT_COL0 (shader_in vec4)
32x4 %1 = @load_deref (%0) (access=0)
32 %2 = deref_var &FRAG_RESULT_COLOR (shader_out vec4)
32 %4 = load_const (0x00000000)
@store_output (%1, %4 (0x0)) (base=4, wrmask=xyzw, component=0, src_type=float32, io location=FRAG_RESULT_COLOR slots=1, xfb(), xfb2()) // FRAG_RESULT_COLOR
// succs: b1
block b1:
}
The latter form here is called “lowered” i/o: the derefs for explicit variables have been lowered to intrinsics corresponding to the operations being performed. Such excitement, many detail.
With few exceptions, every mesa driver uses lowered i/o. Zink is one of those exceptions, and the reasons are simple:
It’s a tough choice, but if I had to pick one of these as the “main” reason why I haven’t done the move, my response would be yes.
With that said, I’m extremely disgruntled to announce that I have completed the transition to lowered i/o.
Hooray.
The reasoning behind this Sisyphean undertaking which has cost me the past couple weeks along with what shreds of sanity previously remained within this mortal shell:
It’s a tough choice, but if I had to pick one of these as the “main” reason why I have done the move, my response would be yes.
I’ll save the details of this for some deep dive posts to pad out my monthly blog counter. For now, let’s take a look at the overview: how does this affect “shader stuff” in zink?
The short answer, for that one person who is actively eyeballs-deep in zink shader refactoring, is that it shouldn’t have any effect whatsoever. The zink passes that use explicit derefs for i/o are mostly at the end of the compilation chain, and derefs will have been added back in time to avoid needing to touch anything there.
This refactor may a tough concept to grasp, so I’m providing some flowcharts since it’s been far too long since the blog has seen any. Here is a basic overview of the zink shader compilation process:
It’s a simple process that anyone can understand.
This is the old process side-by-side with the new one for comparison:
Next time: maintenance5 in lavapipe or more compiler talk. You decide. But not really because I’m the one writing the posts.
Summer has kept me busy with holidays, but I have managed to find a bit of time to keep hacking on the driver for the VeriSilicon NPU since the last update.
The issue with placing the output to the right scale is solved now, and simple convolution operations are working just fine.
3D tensors are now supported as inputs, and we support strided convolutions as well, but only on 2D inputs for now.
The test workloads are running fast and stably now, so I now feel I have pretty solid ground beneath my feet.
There are three features left before I can run a real, full-fledged commercially interesting model:
The last update in this blog was left at my attempt at figuring out how the convolution raw outputs had to be processed with fields called post_shift and post_multiplier so I could get the right values in the final output.
After spending more time than I should probably have in a spreadsheet trying to find correlations, some desperate googling brought me to some research papers about optimizing quantization operations on integer-only hardware:
That explains the meaning of the shift and multiplier, as these are the operations we can use to approximate the floating point division on integer hardware.
But to actually understand what the hardware was trying to do with them, it was useful to look at the QNNPACK implementation of requantization.
This was pretty much straightforward, as was basically a matter of updating the code to take into account the added dimension, and also reorder the tensor elements as the hardware expects depth first order.
This was made much easier by some improvements to the scripts I use to observe the behavior of the closed source stack, by intercepting the communication with the kernel's GPL driver.
For example, this is the output when Mesa has generated a cmd stream that is functionally equivalent to what the blob sends to the kernel:
+ diff -u -U 100 /home/tomeu/mesa.txt /home/tomeu/galcore.txt
--- /home/tomeu/mesa.txt 2023-08-07 18:28:29.939750225 +0200
+++ /home/tomeu/galcore.txt 2023-08-07 18:28:42.116625362 +0200
@@ -1,176 +1,273 @@
{
- 0x0801028a, /* LOAD_STATE (1) Base: 0x00A28 Size: 1 Fixp: 0 */
- 0x00000011, /* PA.SYSTEM_MODE := PROVOKING_VERTEX_LAST=1,HALF_PIXEL_CENTER=1 */
- 0x08010e13, /* LOAD_STATE (1) Base: 0x0384C Size: 1 Fixp: 0 */
- 0x00000002, /* GL.API_MODE := OPENCL */
+ 0x00000000, /* UNKNOWN (0) */
+ 0x00000000, /* */
+ 0x00000000, /* UNKNOWN (0) */
+ 0x00000000, /* */
+ 0x00000000, /* UNKNOWN (0) */
+ 0x00000000, /* */
0x00000000, /* UNKNOWN (0) */
0x00000000, /* */
0x08010e4f, /* LOAD_STATE (1) Base: 0x0393C Size: 1 Fixp: 0 */
0x00000000, /* GL.OCB_REMAP_START := 0x0 */
0x08010e50, /* LOAD_STATE (1) Base: 0x03940 Size: 1 Fixp: 0 */
0x00000000, /* GL.OCB_REMAP_END := 0x0 */
0x08010e4c, /* LOAD_STATE (1) Base: 0x03930 Size: 1 Fixp: 0 */
0x00000010, /* GL.NN_CONFIG := UNK0=0x0,DISABLE_ZDPN=0,DISABLE_SWTILING=0,SMALL_BATCH=1,DDR_BURST_SIZE=0x0,UNK7=0,NN_CORE_COUNT=0x0,UNK12=0 */
0x08010428, /* LOAD_STATE (1) Base: 0x010A0 Size: 1 Fixp: 0 */
- 0xffff3000, /* PS.NN_INST_ADDR := *0xffff3000 */
+ 0x3348e780, /* PS.NN_INST_ADDR := *0x3348e780 */
0x08010429, /* LOAD_STATE (1) Base: 0x010A4 Size: 1 Fixp: 0 */
0x00000000, /* 0x010A4 */
0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
0x00000c23, /* GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */
0x00000c23, /* GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */
0x00000000, /* UNKNOWN (0) */
0x00000000, /* */
}
map->layer_type = 0x0; /* (0) */
map->no_z_offset = 0x0; /* (0) */
map->kernel_xy_size = 0x2; /* (2) */
map->kernel_z_size = 0x4; /* (4) */
map->kernels_per_core = 0x1; /* (1) */
map->pooling = 0x0; /* (0) */
map->pooling_xy_size = 0x1; /* (1) */
map->prelu = 0x0; /* (0) */
map->nn_layer_flush = 0x1; /* (1) */
map->kernel_data_type = 0x0; /* (0) */
map->in_image_data_type = 0x0; /* (0) */
map->out_image_data_type = 0x0; /* (0) */
map->in_image_x_size = 0x4; /* (4) */
map->in_image_y_size = 0x4; /* (4) */
map->in_image_x_offset = 0x0; /* (0) */
map->in_image_y_offset = 0x0; /* (0) */
map->unused0 = 0x0; /* (0) */
map->brick_mode = 0x0; /* (0) */
map->brick_distance = 0x0; /* (0) */
map->relu = 0x0; /* (0) */
map->unused1 = 0x0; /* (0) */
map->post_multiplier = 0x0; /* (0) */
map->post_shift = 0x17; /* (23) */
map->unused2 = 0x0; /* (0) */
map->no_flush = 0x0; /* (0) */
map->unused3 = 0x0; /* (0) */
map->out_image_x_size = 0x3; /* (3) */
map->out_image_y_size = 0x3; /* (3) */
map->out_image_z_size = 0x1; /* (1) */
map->rounding_mode = 0x1; /* (1) */
map->in_image_x_offset_bit_3 = 0x0; /* (0) */
map->in_image_y_offset_bit_3 = 0x0; /* (0) */
map->out_image_tile_x_size = 0x3; /* (3) */
map->out_image_tile_y_size = 0x3; /* (3) */
-map->kernel_address = 0x3fffd00; /* (67108096) */
+map->kernel_address = 0xcd237f; /* (13443967) */
map->kernel_z_size2 = 0x0; /* (0) */
-map->in_image_address = 0xffff6000; /* (4294926336) */
-map->out_image_address = 0xffff7000; /* (4294930432) */
+map->in_image_address = 0x3348e240; /* (860414528) */
+map->out_image_address = 0x89ffc500; /* (2315240704) */
map->image_caching_mode = 0x0; /* (0) */
map->kernel_caching_mode = 0x1; /* (1) */
map->partial_cache_data_unit = 0x0; /* (0) */
map->kernel_pattern_msb = 0x0; /* (0) */
map->kernel_y_size = 0x2; /* (2) */
map->out_image_y_stride = 0x3; /* (3) */
map->kernel_pattern_low = 0x0; /* (0) */
map->kernel_pattern_high = 0x0; /* (0) */
map->kernel_cache_start_address = 0x800; /* (2048) */
map->kernel_cache_end_address = 0xa00; /* (2560) */
map->image_start_address = 0x0; /* (0) */
map->image_end_address = 0x800; /* (2048) */
map->in_image_border_mode = 0x0; /* (0) */
map->in_image_border_const = 0x7d; /* (125) */
map->unused4 = 0x0; /* (0) */
map->kernel_data_type_bit_2 = 0x0; /* (0) */
map->in_image_data_type_bit_2 = 0x0; /* (0) */
map->out_image_data_type_bit_2 = 0x0; /* (0) */
map->post_multiplier_1_to_6 = 0x1f; /* (31) */
map->post_shift_bit_5_6 = 0x0; /* (0) */
map->unused5 = 0x0; /* (0) */
map->in_image_x_stride = 0x4; /* (4) */
map->in_image_y_stride = 0x4; /* (4) */
map->out_image_x_stride = 0x3; /* (3) */
map->unused6 = 0x0; /* (0) */
map->post_multiplier_7_to_14 = 0x61; /* (97) */
map->out_image_circular_buf_size = 0x0; /* (0) */
map->unused7 = 0x0; /* (0) */
map->per_channel_post_mul = 0x0; /* (0) */
map->out_image_circular_buf_end_addr_plus_1 = 0x3ffffff; /* (67108863) */
map->unused8 = 0x0; /* (0) */
map->in_image_circular_buf_size = 0x0; /* (0) */
map->unused9 = 0x0; /* (0) */
map->in_image_circular_buf_end_addr_plus_1 = 0x3ffffff; /* (67108863) */
map->unused10 = 0x0; /* (0) */
map->coef_zero_point = 0x80; /* (128) */
map->out_zero_point = 0x77; /* (119) */
map->kernel_direct_stream_from_VIP_sram = 0x0; /* (0) */
map->depthwise = 0x0; /* (0) */
map->unused11 = 0x0; /* (0) */
map->unused12 = 0x0; /* (0) */
map->unused13 = 0x0; /* (0) */
map->unused14 = 0x0; /* (0) */
map->unused15 = 0x0; /* (0) */
map->unused16 = 0x0; /* (0) */
map->further1 = 0x0; /* (0) */
map->further2 = 0x0; /* (0) */
map->further3 = 0x3ffffff; /* (67108863) */
map->further4 = 0x7f800000; /* (2139095040) */
map->further5 = 0xff800000; /* (4286578688) */
map->further6 = 0x0; /* (0) */
map->further7 = 0x0; /* (0) */
map->further8 = 0x0; /* (0) */
0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x2c, 0x99, 0x0e, 0x00, 0x00,
0x40, 0xea, 0x2c, 0xeb, 0x80, 0xaf, 0x80, 0x9b, 0x99, 0x80, 0x80, 0x13,
0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00
0x69, 0xd3, 0x2d, 0x92, 0x07, 0x00, 0x64, 0x00, 0x0c, 0x22, 0x90, 0xd6,
0x53, 0xc9, 0xe2, 0x48, 0xe6, 0x4c, 0xa8, 0xeb, 0xd2, 0xf3, 0xb0, 0xf4,
0x2d, 0xa4, 0x3e, 0xf4, 0x0f, 0x7b, 0x98, 0x01, 0x41, 0x84, 0x92, 0x7e,
0xfa, 0x19, 0xf5, 0xda, 0xb3, 0x5a, 0xb7, 0xf3, 0x97, 0x95, 0x12, 0xe7,
0x51, 0x94, 0xcb, 0x5a, 0x1f, 0xa9, 0xc6, 0xc4, 0x1c, 0xa9, 0x92, 0x1f,
0xf7, 0x64, 0xc3, 0xca
0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77
This corresponds to a convolution with the following parameters:
The differences are due to different addresses being allocated between runs, and some differences due to how Mesa's code is structured but that shouldn't affect the end result.
At the top we have the payload of the submit IOCTL, followed by a struct with the configuration for the NN units themselves and then the buffers for the weights, input and output.
When running a convolution configuration that isn't yet supported, we will spot more differences and hopefully will be able to figure out the logic behind them.
The hardware doesn't really support strided convolutions, so these are "lowered" to 1-stride convolutions with added channels, as per this research paper:
By implementing the algorithm in the paper, we match the behavior of the blob, as with requantization. It refers only to 2D input tensors, so I will need to check how the blob behaves with 3D inputs and figure out the logic behind it.
For now I have chosen to do the tensor manipulation on the CPU, but later on we will be able to use the TP units in the HW for this, reducing latency.
With so many different convolution parameters supported, I felt the need for a comfortable way of keeping regressions in check.
I wrote a simple pytest module that will generate a TFLite model with a single convolution operation, and the parameters and payloads will be changed according to the different parameters that we support.
At some point I will add a CI job, probably before sending the initial merge request.
The initial NVK (nouveau vulkan) experimental driver has been merged into mesa master[1], and although there's lots of work to be done before it's application ready, the main reason it was merged was because the initial kernel work needed was merged into drm-misc-next[2] and will then go to drm-next for the 6.6 merge window. (This work is separate from the GSP firmware enablement required for reclocking, that is a parallel development, needed to make nvk useable). Faith at Collabora will have a blog post about the Mesa side, this is more about the kernel journey.
The nouveau kernel API was written 10 years or more ago, and was designed around OpenGL at the time. There were two major restrictions in the current uAPI that made it unsuitable for Vulkan.
When we kicked off the nvk idea I made a first pass at implementing a new user API, to allow the above features. I took at look at how the GPU VMA management was done in current drivers and realized that there was a scope for a common component to manage the GPU VA space. I did a hacky implementation of some common code and a nouveau implementation. Luckily at the time, Danilo Krummrich had joined my team at Red Hat and needed more kernel development experience in GPU drivers. I handed my sketchy implementation to Danilo and let him run with it. He spent a lot of time learning and writing copious code. His GPU VA manager code was merged into drm-misc-next last week and his nouveau code landed today.
The idea behind the GPU VA manager is that there is no need for every driver to implement something that should essentially not be a hardware specific problem. The manager is designed to track VA allocations from userspace, and keep track of what GEM objects they are currently bound to. The implementation went through a few twists and turns and experiments.
For a long period we considered using maple tree as the core of it, but we hit a number of messy interactions between the dma-fence locking and memory allocations required to add new nodes to the maple tree. The dma-fence critical section is a hard requirement to make others deal with. In the end Danilo used an rbtree to track things. We will revisit if we can deal with maple tree again in the future.
We had a long discussion and a couple of implement it both ways and see, on whether we needed to track empty sparse VMA ranges in the manager or not, nouveau wanted these but generically we weren't sure they were helpful, but that also affected the uAPI as it needed explicit operations to create/drop these. In the end we started tracking these in the driver and left the core VA manager cleaner.
Now the code is in tree we will start to push future drivers to use it instead of spinning their own.
Now that the VAs are being tracked, the nouveau API needed two new entrypoints. Since BO allocation will no longer create a VM, a new API is needed to bind BO allocations with VM addresses. This is called the VM_BIND API. It has two variants
My input was the sketchy sketch at the start, and doing the userspace changes to the nvk codebase to allow testing.
The biggest shoutout to Danilo, who took a sketchy sketch of what things should look like, created a real implementation, did all the experimental ideas I threw at him, and threw them and others back at me, negotiated with other drivers to use the common code, and built a great foundational piece of drm kernel infrastructure.
Faith at Collabora who has done the bulk of the work on nvk did a code review at the end and pointed out some missing pieces of the API and the optimisations it enables.
Karol at Red Hat on the main nvk driver and Ben at Red Hat for nouveau advice on how things worked, while he smashed away at the GSP rock.
(and anyone else who has contributed to nvk, nouveau and even NVIDIA for some bits :-)
[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24326
As everyone knows, Zink is a fast-moving target. Sometimes it moves so fast that even I don’t fully grasp the weight of some changes as they fly past.
I’m sure you all remember this monumental slide from XDC last year:
Truly a masterpiece that’s impossible to improve upon; don’t @ me.
Time has passed. Almost a year, some would say. Is that slide still accurate?
Anyone who knows anything about journalism knows that the answer to all rhetorical questions on the internet is always the same.
A couple weeks ago, Collabora’s Igor Torrente put up an MR that slid under the rader for most people. Not me, of course, because as a responsible maintainer I carefully review every character in every line of code in every file for every patch in every MR tagged with my driver(s).
Because the great Adam Jackson and Daniel Stone also got D E E P into this one. By which I mean they commented.
And approved.
It’s the equivalent of clickbait on a blog. But why—
Yes, yes, it’s been a while, some number of weeks or mesa-years since I last blogged. Lots of important things have happened in that time. I’ve generated enough blog content for an entire month of posts, in fact. Maybe I’ll manage to maintain enough motivation to write about them.
Let’s kick off the return by looking at some progress updates.
It’s all very exciting, and there’s definitely gonna be lots of posts once I remember what happened when I was melting into a puddle during the heatwave.
In the meanwhile, I want to talk about something else. Something a lot of people ask me about.
I want to talk about working for Valve.
I work in Open Source, so obviously I can only comment on that, and I only work on certain projects, so obviously I can only comment on my experience working on them, and I’m only one non-hivemind mortal being, so obviously I can only comment on what I’ve personally experienced, but I’m nearly through my third year here and it feels like a good time for a post of this sort. You know, because three.
So what’s it really like here?
In a word, working here is great.
Imagine you’ve got three or twenty projects you enjoy working on. Now imagine your task is to make them better. Better how, exactly? However you want. Which one do you work on? Whichever one you want. How many hours per week do you work? However many you want. Who checks your progress? You do. How do biannual performance evaluations happen? They don’t. And on top of all that you get paid.
It sounds too good to be true, doesn’t it? Surely working here can’t be such a complete state of anarchy, and indeed it isn’t. In my experience, the Open Source team here is like a big jigsaw puzzle: there’s a ton of different pieces, each of them with its place, each of them making the others better.
Let me explain.
There’s a lot of people working here, all of them smarter than me (but none of them blogging more than me). Most of them have been here longer than me too. Every one of them has fallen into their niche, the place they like to tinker where they can excel. Here’s a few just out of the old-timers:
Everyone has distinct roles that they play on the team. Project areas they specialize in, as per my “anything goes” claim above. Some people work on lots of things, some don’t, but the niches are filled. Everyone’s got their “spot”.
Put another way, everyone on the team is a piece of the puzzle, the puzzle being “make Linux gaming great”. Everyone fits into their spot, and by doing so things get better.
There’s another way of looking at things though. While everyone here can be a puzzle piece, everyone has their own puzzles too. I work on zink, but that doesn’t mean “Mike is the one working on zink”. What it really means is that I’m able to work on zink because the puzzle pieces have been assembled such that I’m able to work on zink. It’s like how you wouldn’t try benching four plates at the gym without having a spot (I would, but I’m huge).
Sometimes getting a spot is a production. You know, the kind of thing that makes headlines where Joshie throws me the rock, but it’s too hot so I fire off a pass to Georg, and he slows down the tempo while we wait for hardware vendors to understand how their thinking sand interacts with complex ancient graphics specificiations, but then we get in the zone, and Georg throws me an alleyoop, and then Joshie takes it back to fix the original problem and now games work a little better.
Sometimes it’s all rockstars all day long.
But it’s the other times that make working here really great. The times when you’re struggling to grind out that last rep because you told your buddy you were definitely gonna hit 10x315 on front squat this week and you don’t wanna go back and admit you had too much preworkout.
I’m talking about times like when Timur picked up my massive CPU optimization series and wrangled it into a bunch of MRs because I was foolishly stretching myself too thin across too many projects.
I’m talking about the unsung heroes who make working here truly great.
Everyone knows Rhys. Everyone here, anyway. Outside the team it might be a different story; he has no blog, and searching for his many and varied accomplishments in the depths of the internet yields only one article written before I was born.
IYKYK, as they say. Just in the past week he’s quietly fixed Rage 2 and WWZ. A glance through his extensive patch history is a litany of complex optimizations and tweaks which aren’t flashy enough to be newsworthy on their own but add up to significant gains through consistent improvements.
But it’s still not any of these that (to me, at least) make Rhys one of the unsung heroes of the team. The glue that holds parts of it together.
All my very high IQ readers know what it’s like to get stuck on something. That feeling when you come across a problem, and you know it’s a problem, and you can handwave some half-functional solution that lets you limp across the finish line to collapse in a broken, battered heap with 0 regressions
as the output of your refactoring branch’s latest CTS run, but you can’t quite figure out the “right” way to fix the problem. The way that won’t get your patches NAKed harder than proposing the addition of registers to NIR right now.
At times like these, who’s there to help you out? Who is it that gives the bar that tiny, it-was-all-you-bro-I-didn’t-even-touch-it nudge to help you finish that last rep?
It’s Rhys Perry. It’s always the Rhyses, the unsung heroes. The ones who answer complex questions on IRC at 2am because they’re watching an historic cricket match and happened to glance over and see you flailing away at your keyboard. The ones who step in and say “sure, I’ll review this absolute disaster you fingerpainted into gitlab using the webui without regard to formatting, or coding conventions, or even the right programming language, and we’ll get through this together with a fix that’ll make everyone happy” when you’re staring down Yog-Sothoth in the abyss of the compiler stack at the end of a week and have exactly one functioning brain cell remaining that tells you only SHOWER. NOW. IT’S BEEN DAYS.
And it’s the mixing of all these people, rockstars and not, unsung heroes and not, working on so many projects, enabling each other and making us all better at what we do that makes working at Valve great.
To me, at least.
Tune in next time when I’ll be MS Painting some XFB memes and raging about Big Triangle since that’s apparently my niche.
EOSS in Prague was great, lots of hallway track, good talks, good food, excellent tea at meetea - first time I had proper tea in my life, quite an experience. And also my first talk since covid, pack room with standing audience, apparently one of the top ten most attended talks per LF’s conference report.
The video recording is now uploaded, I’ve uploaded the fixed slides, including the missing slide that I accidentally cut in a last-minute edit. It’s the same content as my blog posts from last year, first talking about locking engineering principles and then the hierarchy of locking engineering patterns.
Hi all!
As usual, this month has been rich in Wayland-related activities. Rose has continued building and upstreaming better frame scheduling infrastructure for wlroots, you can read more on her blog. I’ve resurrected an old patch to make wlroots behave better when the GPU is under high load. In my testing this improves latency a lot some specific scenarios and some specific hardware, but doesn’t help on some others. It’s not super clear if anything can be done about this, it may be that we are hitting some hardware limitations here: GPUs don’t know how to preempt tasks very well.
I’ve also started working on explicit synchronization again. This was
previously blocked on a hard problem: drivers may want to use a new kind of
synchronization fence primitive (user-space memory fences) and it wasn’t clear
how the current primitives (drm_syncobj
) would hold up. We’ve been talking
about this new primitive for a few years but unfortunately it’s a complicated
matter and nothing new has surfaced. However, after discussing with Daniel
Vetter, we’ve come to the conclusion that the kernel will provide backwards
compatibility for drm_syncobj
, so we can just stop worrying and use that as
the basis for explicit synchronization protocols and implementations. Moreover,
NVIDIA engineers are interested in helping with this effort, so I hope we can
keep the momentum and join forces to push the new protocol, APIs and
implementations to the finish line.
There is a lot to be done to plumb explicit synchronization. This month I’ve
respinned a new kernel uAPI patch to allow compositors to
wait on a drm_syncobj
without blocking. This also involved writing a test
suite in IGT and a wlroots patch to use the new uAPI. Everything is now
reviewed, I hope to merge this soon. Apart from this, we also need a
new Wayland protocol, a new Vulkan
extension for drm_syncobj
import/export, more implementations of the
protocol, ideally yet another new kernel uAPI to improve
interoperability with sync_file
, and even a new X11 protocol so that legacy
X11 clients (read: games) can take advantage of this whole thing. Oh my… As
French people say, there is some bread on the table.
In other Wayland news, we’ve started having some more-or-less weekly meetings for wayland-protocols standardization. We’ve been talking about upstreaming some of the stuff currently in a private GTK protocol, IMEs, and layer-shell. It’s been great to be able to discuss face-to-face about blockers for these protocols. The meeting notes are available on the wiki. We’ve done a lot of talking and gesturing, but also some actual work: security-context has finally (!) been merged, and I’ve updated the ext-layer-shell patch.
Apart from the explicit synchronization work, I’ve sent a few other kernel patches. Numerous patches to improve the kernel uAPI documentation, and a few patches to add more information to the hotplug events sent by bridge/i915/nouveau so that compositors don’t need to reload the whole KMS state on each hotplug event (instead, they can now only reload the KMS state of the one specific connector which got hotplugged). I’ve reviewed a few patches as well. Thomas Zimmermann has made it so all DRM drivers now support DMA-BUFs (required for wlroots to run), so now wlroots works on e.g. gma500. AMD engineers have sent patches to support more than 64 DRM devices, there are some subtle uAPI stability issues at play I’ve tried to provide feedback on.
Let’s wrap up this status update with a collection of various smaller
happenings. I’ve removed dlsym()
related magic used in the Wayland test suite
which caused sporadic failures on FreeBSD. I’ve been gradually improving the
API for go-imap v2 and fixing a few bugs. hut now supports pagination on all
commands thanks to tireless work by Thorben Günther. kanshi now supports
configuring adaptive sync (VRR). I’ve improved the API of go-oauth2 a bit. Last
but not least, I’ve reworked an old patch to make it easier to
parse scfg files from Go programs, by defining a Go struct
instead of hand-rolling parsing code.
See you next month!
I recently came across tinygrad as a small powerful nn framework that had an OpenCL backend target and could run LLaMA model.
I've been looking out for rusticl workloads, and this seemed like a good one, and I could jump on the AI train, and run an LLM in my house!
I started it going on my Radeon 6700XT with the latest rusticl using radeonsi with the LLVM backend, and I could slowly interrogate a model with a question, and it would respond. I've no idea how performant it is vs ROCm yet which seems to be where tinygrad is more directed, but I may get to that next week.
While I was there though I decided to give the Mesa ACO compiler backend a go, it's been tied into radeonsi recently, and I done some hacks before to get compute kernels to run. I reproduced said hacks on the modern code and gave it a run.
tinygrad comes with a benchmark script called benchmark_train_efficientnet so I started playing with it to see what low hanging fruit I could find in an LLVM vs ACO shootout.
The bench does 10 runs, the first is where lots of compilation happens, the last is well primed cache wise. There are the figures from the first and last runs with a release build of llvm and mesa. (and the ACO hacks).
LLVM:
215.78 ms cpy, 12245.04 ms run, 120.33 ms build, 12019.45 ms realize, 105.26 ms CL, -0.12 loss, 421 tensors, 0.04 GB used, 0.94 GFLOPS
10.25 ms cpy, 221.02 ms run, 83.50 ms build, 36.25 ms realize, 101.27 ms CL, -0.01 loss, 421 tensors, 0.04 GB used, 52.11 GFLOPS
ACO:
71.10 ms cpy, 3443.04 ms run, 112.58 ms build, 3214.13 ms realize, 116.34 ms CL, -0.04 loss, 421 tensors, 0.04 GB used, 3.35 GFLOPS
10.36 ms cpy, 234.90 ms run, 84.84 ms build, 36.51 ms realize, 113.54 ms CL, 0.05 loss, 421 tensors, 0.04 GB used, 49.03 GFLOPS
So ACO is about 4 times faster to compile but produces binaries that are less optimised.
The benchmark produces 148 shaders:
LLVM:
ACO:
So ACO doesn't quite get the optimal shaders for a bunch of paths, even with some local hackery I've done to make it do better.[1]
I'll investigate ROCm next week maybe, got a bit of a cold/flu, and large GPU stacks usually make me want to wipe the machine after I test them :-P
[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radeonsi-rusticl-aco-wip
I'm suffering from having a mortal form again, but things are moving in the general direction of progress.
Or "Rose, it's 2 in the morning!" Yeah yeah, whatever, you're not my mum.
Some would call this whining - skip this section if you're here for technology :)
You're not supposed to make yourself work when you don't have energy to because you'll feel bad. People have tried telling me this and I've tried listening but to really take it on board I had to figure out what low energy actually feels like, so here we are, skipping a week of status reporting and holding a suspiciously high Factorio play time. I spent some of that play time making a cool blue circuit factory! Downtime is a good idea, hopefully - we'll find out next week whether it worked.
It's surprising that one of the hardest problems given to me by the Fates has been fighting against myself, which sounds overly dramatic but in a literal sense is true. I would be moving faster if I felt up to it, but I don't feel up to it because I moved too fast recently. It's my fault because I wore myself out, but it's not my fault to rest when I need to, so instinctively I remain undecided on whether it's my fault. Sadly this isn't a balance that I've learned to strike, at least not for large scale work that I care about.
Add this to a general guilt for doing less than others seem to be doing (a velocity- rather than the famous competence-based impostor syndrome) and the work that was once appealing becomes more distant. LoC metrics are a favourite of crap managers, quick glancers, and the part of my subconscious that judges my self worth. It's not ideal and it's even not-idealer when your work is mostly thinking and not actually that much coding - see the previous report for a bunch of musings about what code should be written and not much written code. It's valid work! But the goblin in my skull disagrees. The mortal form disappoints me. I was hoping to discover my inner cold programming machine but I just found some boring human imperfections. Yawn!
This isn't what I was expecting to write about but I think it's helping. I'm sure these aren't unique experiences but they worry me nonetheless, which is partially because I'm hardwired to be worrying about something most of the time.
In a couple of days it will all be OK because I'll be able to play Counter-Strike again and that will for sure make my productivity go up, or down. The paradox of relaxing!
As predicted, I have to face prediction. Before I do that, I want to get a feel for the behaviour of compositors' performance so I'm not mathsing in the dark, and my weapon of choice is Linux's tracing system which either is called ftrace or has a component called ftrace. I can't tell which.
We've met Linux's tracing before. The screenshots from GPUVis were made of data extracted from it, which makes it an attractive answer to the question "where do I put all my data". In theory, if wlroots gains the ability to output events to this system, GPUVis will automatically be able to display these events as it does all the others.
The mechanism for userspace to emit events in this way landed in Linux 6.4 which was unleashed about 12 hours before I realised that my laptop's 6.3 series kernel didn't have support for it and nearly gave up. Until 6.4, the feature was gated behind CONFIG_BROKEN and looked truly like a lost cause. Thankfully Simon noticed that 6.4 held the answer to my problems and I found things to do while I waited for it to hit my distribution. Thrilling! We're back on track.
To hide the horrors of a bare UAPI from wlroots, I wrote and published libuserevents, which is my first C library and will make interacting with user_events amazing and great and you should definitely use it. There are whispers of integration into wlroots so far. I hope eventually I'll have a nice tool that can monitor a running compositor and show a graph of the frame times because that will at least be something pretty to look at to get away from thinking.
In the background there's a scene timer wriggling its way through review and the dreaded How To Schedule Frame Signals is looming over us all. I forgot to submit the Vulkan timer in all the ruckus. Oh well, apparently no one's supposed to be using the Vulkan backend yet anyway so I doubt there's anyone holding their breath.
I've also just noticed that the second status report has links to git branches instead of commits, so they're likely very stale by now. Remind past me to not do that, that moron.
Who knows what the future holds? Join us next week time to find out.
Today, Imagination Technologies announced some very exciting news: they are now using Zink for full OpenGL 4.6 support! Collabora had the pleasure of working together with engineers from Imagination to make this a reality, and it’s very rewarding to now be able show the results to the world!
More importantly, this is the first time we’ve seen a hardware vendor trust the OpenGL-on-Vulkan Mesa driver enough to completely side-step a native OpenGL driver and use it in a shipping product. It’s wonderful to see that Zink can realistically be used as a work-horse, especially in a high-performance graphics setting.
Zink started out as a small R&D project at Collabora, but has since grown to be a full-on community project. None of this would have been possible without the awesome work done by Mike and the other Zink contributors!
One small detail from Imagination’s post that I think is important to highlight is that the solution is officially conformant. This is the first product to be officially conformant using Zink, but it’s not going to be the last! In fact, we only need one more conformant implementation before Zink itself is conformant as a generic layered implementation, according to the Khronos Conformant Product Criteria.
In the not too distant future, we should be able to combine Zink with the in-progress open source driver from Imagination, and that’s when things will really start to shine for the open source graphics stack on Imagination hardware. So there’s plenty more to look forward to here!
The All Systems Go! 2023 Call for Participation Closes in Three Days!
The Call for Participation (CFP) for All Systems Go! 2023 will close in three days, on 7th of July! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!
All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:
The CFP will close on July 7th, 2023. A response will be sent to all submitters on or before July 14th, 2023. The conference takes place in 🗺️ Berlin, Germany 🇩🇪 on Sept. 13-14th.
All Systems Go! 2023 is all about foundational open-source Linux technologies. We are primarily looking for deeply technical talks by and for developers, engineers and other technical roles.
We focus on the userspace side of things, so while kernel topics are welcome they must have clear, direct relevance to userspace. The following is a non-comprehensive list of topics encouraged for 2023 submissions:
For more information please visit our conference website!
Adam Jackson created the script add-gitlab-merge-requests.sh which is the basis of this workflow.
The idea is to provide local access to all of the PRs that exist upstream. This both provides a better general overview of which PRs that have been pulled into the branch you're working on, but also enables you to search the contents of all PRs.
This function automagically detects if your remote is hosted on GitHub or GitLab and makes the necessary adjustments to work on either platform.
[alias]
mr = pr
pr = "!f() { \
REMOTES=$(git remote); \
REMOTE=\"origin\"; \
case \"$REMOTES\" in \
*upstream*) \
REMOTE=\"upstream\"; \
;; \
esac; \
ORIGIN=${1:-${REMOTE}}; \
URL=$(git remote get-url ${ORIGIN}); \
\
case \"$URL\" in \
*gitlab*) \
REMOTE_EXT="mr" \
REMOTE_PATH="merge-requests" \
;; \
*github*) \
REMOTE_EXT="pr" \
REMOTE_PATH="pull" \
;; \
esac …
As of today, gitlab.freedesktop.org provides easy hooks to invoke the gitlab-triage tool for your project. gitlab-triage allows for the automation of recurring tasks, for example something like
If the label FOO is set, close the issue and add a comment containing ".... blah ..."Many project have recurring tasks like this, e.g. the wayland project gets a lot of issues that are compositor (not protocol) issues. Being able to just set a label and have things happen is much more convenient than having to type out the same explanations over and over again.
The goal for us was to provide automated handling for these with as little friction as possible. And of course each project must be able to decide what actions should be taken. Usually gitlab-triage is run as part of project-specific scheduled pipelines but since we already have webhook-based spam-fighting tools we figured we could make this even easier.
So, bugbot was born. Any project registered with bugbot can use labels prefixed with "bugbot::" to have gitlab-triage invoked against the project's policies file. These labels thus serve as mini-commands for bugbot, though each project decides what happens for any particular label. bugbot effectively works like this:
sleep 30 for label in {issue|merge_request}.current_labels: if label.startswith("bugbot::"): wget https://gitlab.freedesktop.org/foo/bar/-/raw/{main|master}/.triage-policies.yml run-gitlab-triage --as-user @bugbot --use-file .triage-policies.yml breakAnd this is triggered on every issue/merge request update for any registered project which means that all you need to do is set the label and you're done. The things of note here:
resource_rules: issues: rules: - name: convert bugbot label to other label conditions: labels: - "bugbot::foo" actions: labels: - "foo" remove_labels: - "bugbot::foo" comment: | Nice label you have there. Would be a shame if someone removed it status: "close" merge_requests: rules: []And the effect of this file can be seen in this issue here.
Bugbot is part of the damspam project and registering a project can be done with a single command. Note: this can only be done by someone with the Maintainer role or above.
Create a personal access token with API access and save the token value as $XDG_CONFIG_HOME/bugbot/user.token Then run the following commands with your project's full path (e.g. mesa/mesa, pipewire/wireplumber, xorg/lib/libX11):
$ pip install git+https://gitlab.freedesktop.org/freedesktop/damspam $ bugbot request-webhook foo/barAfter this you may remove the token file and the package
$ pip uninstall damspam $ rm $XDG_CONFIG_HOME/bugbot/user.tokenThe bugbot command will file an issue in the freedesktop/fdo-bots repository. This issue will be automatically processed and should be done by the time you finish the above commands, see this issue for an example. Note: the issue processing requires a git push to an internal repo - if you script this for multiple repos please put a sleep(30) in to avoid conflicts.
Remember you can test your policies file with
$ gitlab-triage --dry-run --token $GITLAB_TOKEN \ --source-id foo/bar --resource-reference 1234
I mentioned some time back about Konstantin’s heroic efforts to add all the descriptor features to Lavapipe.
After a lot of battling CI those efforts have finally paid off, and Lavapipe now supports all the features. Enough to run a credible amount of vkd3d-proton, in fact.
Great work. Can’t wait to see how he tackles sparse binding.
What two weeks!
Taking from where I left at the last update, I made progress in understanding the format of the buffer that contains the weights and biases.
The bit of knowledge that made a difference was realising that the format is optimized so that each NN core can efficiently access the portion of it that it needs, without having to do any parsing or decoding. Knowing that also helped in guessing what some fields in the parameter structure are for.
With that, I was able to correctly run a convolution on a small matrix with arbitrary weights and biases.
The biggest roadblock in this area currently is understanding how I need to program the output unit in the NN so the output data is in the desired scale. There are a series of fields that influence how the output values are processed before being placed in the output buffer, and I don't really know how they work yet. They are called post_shift and post_mult and the first correlates moderately (r=0.78) to the quantization scale of the output. I know that the post_shift field does what it says, to the right, but to understand what value I need in each situation I feel I need to understand better how the hardware works and what could be the initial values at the end of the convolution and before the output unit. I will be reading a bunch of research papers about NN-accelerating silicon in the summer.
That said, replacing the OpenCL kernels in TensorFlow Lite's GPU delegate that do convolutions with the fixed units turned out to be a worse idea than I initially thought. This is because that delegate is completely oriented towards float-first hardware such as GPUs and this accelerator is integer only.
A consequence of this is that TFLite inserts a dequantize operation at the start of the graph and a quantize at the end, to match the desired intput and output formats of a fully quantized model while feeding floats to the GPU. We need integers, so would be having to quantize after TFLite's dequantization and vice versa. Also, the other operations in the graph expect floats as well... This is certainly the wrong path to take for performance in a bandwidth-constrained device as all embedded boards are, so I had to go back to the drawing board.
If TF Lite's GPU delegate is such a bad match for this HW, what can we do to run inferences with reasonable speeds? The same that VeriSilicon did: write our own delegate:
https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/
TF Lite's operation description matches relatively well what we currently know of the configuration of the NN units. So we will not need to write complex shaders to implement the operations, but "just" translate the description of the operation to the HW configuration.
Of course, there is no HW that has fixed function units that accelerate all operations that are built into TF Lite or even that the most commonly used models contain. VeriSilicon's delegate deals with that by having a library of optimized OpenCL kernels that run on their programmable shader core(s).
But we want to avoid getting in the business of writing dozens of kernels that will need to be tweaked and made more complex so they run efficiently on other NPUs out there.
Fortunately, the delegate infrastructure in TF Lite is designed for this very scenario of imperfect HW and we can have a simple delegate that will implement the operations supported by the HW and the rest will execute in other delegates based on their capabilities.
How fast that will be is a big unknown right now, as switching between delegates will have a cost in terms of synchronization and data sharing, but that is something that we probably can improve in the TF Lite code base as the kernel has already all mechanisms for efficient synchronization and data sharing.
Other possibilities that we have with the TF Lite delegate mechanism is offloading the operations we don't need to a different delegate that supports accelerating them. For example, in the case of a board with Amlogic A311D or S905D3, we could use the GPU delegate to run those operations on the Mali GPU on it, via the OpenCL driver that Alyssa is writing in Mesa.
And if that is still slower than with the proprietary stack, one could always write an optimized kernel in NIR to run on the programmable core in the Vivante NPU. That is the beauty of free software, we can address the needs we have ourselves, and importantly so, do it by pooling work with others!
Because this frontend is implemented in terms of Gallium, we leverage the infrastructure in there for memory management, synchronization and execution. I think this will work well for adding support to other NN engines such as those from Rockchip, Cadence, Mediatek, etc.
I need to crack the nut of the post-processing of the raw output so it is in the expected scale, and afterwards I will be looking at handling multiple feature maps (kernel z > 1).
After that I don't see much else in the way of running convolutions as expected by TF Lite, so hopefully I will be running some models and measuring the performance. I expect that we will want to do the same for accelerating tensor operations with the TP units. And we will probably want to give a look at using the SRAM to reduce bandwidth and memory access latency. That still some way off though, and the summer is just starting!
API design is kinda tricky.
wlroots is designed to be very very modular. It's clear from the readme:
This design goal seems to have come about from lessons learned in other projects where a more all-in-one approach has become a burden. Note that "pluggable" doesn't just mean you can plug it, it also means that you have to. In my opinion, that's why people shy away from this approach: if you want library users to be able to swap out parts of your code for theirs, you have to force them to be responsible for all the wiring even if they are using the default everything; otherwise they won't have access to the wiring for when they change their mind.
This is all very commendable - engineering has happened, people learned, we are making better tradeoffs now than we were before. But what it means for me is that I have to also learn and do engineering. Imagine! More seriously, I want to respect the consensus that making library users do wiring is good, and they should be able to opt out of my code being called by not calling it.
What I'm finding, though, is that the Thing I'm Trying To Do (scheduling) sounds like a feature but
is more of an optimisation.
Really, I'm making the frame signal firing be more interesting than it was before.
Since the beginning it has worked in its way with a constant delay of zero, and now I am improving
it to recognise that the delay might not be zero.
I don't think it's possible to do this work in a neat and separate box, because it lies right in the
middle of the frame firing path.
There's a fair bit of complexity in the way the signal is fired,
so refactoring the whole firing path to separate it from what it touches
(wlr_output
and the backends) is much harder than it might seem.
Top scientists (me) are working around the clock (or parts of it) to figure out how the scheduling can be decoupled such that it doesn't have to touch the backends at all. And by "working to figure out" I mostly mean "stewing on how it's kind of impossible and I don't think we can do much better than !4214". We'll see, but at the moment it seems like I have to disregard all the pondering from the top of the page and stick my grubby fingers into all kinds of once-sacred functions.
There are measurement woes, too: wlr_scene
is a whole thing, and it calls the renderer for you.
It spends some time doing that and we need to know how long that time is, so we do need to add
measurement code inside the function that does the thing (wlr_scene_output_build_state
).
I've been doing a lot less pondering and a lot more giving up on this one, and my grand plan at the
moment is to just throw a timer right in there and let the user query it.
This does make scene bigger, but I think it's alright to expect the code that does the rendering to
also do the timing thereof. It will need to be aware of the timer to some extent, so we might as
well put the whole thing in there and not bother trying to pretend they're separate.
Aside from all that, I've implemented a timer for the Vulkan backend, but I haven't submitted it yet. Not much more to say there.
Soon, I will have to face prediction and think about statistics. Nooooooo!
This work proposed by my mentor Maíra Canal is going more difficult than I thought >W<.
On the Community Bonding Period of GSoC, Maíra proposed I work on VKMS. While looking on the TODO list of the driver, on the plane feature section, I found the item “Additional buffer formats, especially YUV formats for video like NV12” interesting to work on, as it has some correlation with my work on the GSoC.
Before I start talking about what was done I think this blog post needs more context.
The Virtual Kernel Mode Setting is a virtual video driver. It is used for testing the correct display of userspace programs without needing a physical monitor. Virtual drivers are useful for automated testing.
NV12 is a multi-planar color format that uses the YCbCr color model with Chroma subsampling. This explanation doesn’t say much for the first reader, so let’s explain it further.
The YCbCr, like RGB, is a form to represent color on the computer memory. It’s divided into three components:
This color model is better for compression because it separates the brightness from the color. Humans perceive more changes in brightness than in color, so we can have less information about color in an image and not perceive the difference in loss of detail. So some formats have the same chroma (Cb, Cr) components for multiple pixels. This technique is called Chroma subsampling.
The NV12 has the same color components for 2 pixels on the horizontal and 2 pixels on the vertical. It achieves that by having two areas of memory called planes, one only for the luminance component and the other for the two colors components.
A format that separates its components into multiple planes is called a multi-planar format. More specifically, a format with two planes is called a semi-planar format, and one with three planes is called a planar format.
Y Plane Cb Cr Plane Formed Pixels
Y00 Y01 Cb0 Cr0 Y00Cb0Cr0 Y01Cb0Cr0
Y02 Y03 + Cb1 Cr1 ----> Y02Cb0Cr0 Y03Cb0Cr0
Y10 Y11 Y10Cb1Cr1 Y11Cb1Cr1
Y12 Y13 Y12Cb1Cr1 Y13Cb1Cr1
Each Yij
value uses the correspondent Cbi
an Cri
values.
Currently, the VKMS supports none of the features above. So the task is subdivided into this order:
I think you can see why this work is harder than I thought. I knew none of that information beforehand, so learning all of that was an adventure :D.
I have implemented all those things, but it still needs a little thinking to get all working together.
The VKMS only expects drm_framebuffer
’s to have a single
plane. The first thing to do is to remove the offset
, pitch
, and cpp
variables from the vkms_frame_info
and switch into using the arrays that they
were taken. With that, we can access the information for every plane in a
drm_framebuffer
, not just the first one.
struct vkms_frame_info {
struct drm_rect rotated;
struct iosys_map map[DRM_FORMAT_MAX_PLANES];
unsigned int rotation;
- unsigned int offset;
- unsigned int pitch;
- unsigned int cpp;
};
After that, we need the ability to choose the plane and use its own cpp
,
pitch
, and offset
. To do that, we need to pass an index to the functions
that access the color information and use it on the new arrays.
-static size_t pixel_offset(const struct vkms_frame_info *frame_info, int x, int y)
+static size_t pixel_offset(const struct vkms_frame_info *frame_info, int x, int y, size_t index)
{
struct drm_framebuffer *fb = frame_info->fb;
- return fb->offsets[0] + (y * fb->pitches[0])
- + (x * fb->format->cpp[0]);
+ return fb->offsets[index] + (y * fb->pitches[index])
+ + (x * fb->format->cpp[index]);
}
static void *packed_pixels_addr(const struct vkms_frame_info *frame_info,
- int x, int y)
+ int x, int y, size_t index)
{
- size_t offset = pixel_offset(frame_info, x, y);
+ size_t offset = pixel_offset(frame_info, x, y, index);
return (u8 *)frame_info->map[0].vaddr + offset;
}
The drm_format_info
has the hsub
and vsub
variables
that dictate the subsampling factor horizontally and vertically (for [NV12 hsub
= vsub = 2][nv12-fourcc.h]). So we need to take this into account when accessing
the color information of a pixel. Note that this is not to be done on the first
plane, as for all formats the subsampling is not present on it.
@@ -238,8 +238,10 @@ static void get_src_pixels_per_plane(const struct vkms_frame_info *frame_info,
{
const struct drm_format_info *frame_format = frame_info->fb->format;
- for (size_t i = 0; i < frame_format->num_planes; i++)
- src_pixels[i] = get_packed_src_addr(frame_info, y, i);
+ for (size_t i = 0; i < frame_format->num_planes; i++){
+ int vsub = i ? frame_format->vsub : 1;
+ src_pixels[i] = get_packed_src_addr(frame_info, y / vsub, i);
+ }
}
void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y)
@@ -250,6 +252,8 @@ void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state
int limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels);
u8 *src_pixels[DRM_FORMAT_MAX_PLANES];
+ int hsub_count = 0;
+
enum drm_color_encoding encoding = plane->base.base.color_encoding;
enum drm_color_range range = plane->base.base.color_range;
@@ -258,17 +262,21 @@ void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state
for (size_t x = 0; x < limit; x++) {
int x_pos = get_x_position(frame_info, limit, x);
+ hsub_count = (hsub_count + 1) % frame_format->hsub;
+
if (drm_rotation_90_or_270(frame_info->rotation)) {
+ get_src_pixels_per_plane(frame_info, src_pixels, x + frame_info->rotated.y1);
for (size_t i = 0; i < frame_format->num_planes; i++)
- src_pixels[i] = get_packed_src_addr(frame_info,
- x + frame_info->rotated.y1, i) +
- frame_format->cpp[i] * y;
+ if (!i || !hsub_count)
+ src_pixels[i] += frame_format->cpp[i] * y;
}
plane->pixel_read(src_pixels, &out_pixels[x_pos], encoding, range);
- for (size_t i = 0; i < frame_format->num_planes; i++)
- src_pixels[i] += frame_format->cpp[i];
+ for (size_t i = 0; i < frame_format->num_planes; i++) {
+ if (!i || !hsub_count)
+ src_pixels[i] += frame_format->cpp[i];
+ }
}
}
This was, by far, the most difficult part of the project.
The Color YCbCr has three color encoding standards, BT601, BT709, and BT2020, besides that the YCbCr can occupy the full range of each byte it uses or a limited range.
To tell what color encoding and range the driver support, we have to add the
drm_plane_create_color_properties()
@@ -212,5 +212,14 @@ struct vkms_plane *vkms_plane_init(struct vkms_device *vkmsdev,
drm_plane_create_rotation_property(&plane->base, DRM_MODE_ROTATE_0,
DRM_MODE_ROTATE_MASK | DRM_MODE_REFLECT_MASK);
+ drm_plane_create_color_properties(&plane->base,
+ BIT(DRM_COLOR_YCBCR_BT601) |
+ BIT(DRM_COLOR_YCBCR_BT709) |
+ BIT(DRM_COLOR_YCBCR_BT2020),
+ BIT(DRM_COLOR_YCBCR_LIMITED_RANGE) |
+ BIT(DRM_COLOR_YCBCR_FULL_RANGE),
+ DRM_COLOR_YCBCR_BT601,
+ DRM_COLOR_YCBCR_FULL_RANGE);
+
return plane;
}
The conversion code was taken from the tpg-core.c
, a virtual
driver from the media subsystem that does those conversions in software as well.
As a side thought, maybe would be better to have those two subsystems use the same code, maybe with a separate subsystem that handles color formats.
The TPG code was changed to use the drm_fixed.h
operations, to be
more precise and coherent.
struct pixel_yuv_u8 {
u8 y, u, v;
};
static void ycbcr2rgb(const s64 m[3][3], int y, int cb, int cr,
int y_offset, int *r, int *g, int *b)
{
s64 fp_y; s64 fp_cb; s64 fp_cr;
s64 fp_r; s64 fp_g; s64 fp_b;
y -= y_offset;
cb -= 128;
cr -= 128;
fp_y = drm_int2fixp(y);
fp_cb = drm_int2fixp(cb);
fp_cr = drm_int2fixp(cr);
fp_r = drm_fixp_mul(m[0][0], fp_y) +
drm_fixp_mul(m[0][1], fp_cb) +
drm_fixp_mul(m[0][2], fp_cr);
fp_g = drm_fixp_mul(m[1][0], fp_y) +
drm_fixp_mul(m[1][1], fp_cb) +
drm_fixp_mul(m[1][2], fp_cr);
fp_b = drm_fixp_mul(m[2][0], fp_y) +
drm_fixp_mul(m[2][1], fp_cb) +
drm_fixp_mul(m[2][2], fp_cr);
*r = drm_fixp2int(fp_r);
*g = drm_fixp2int(fp_g);
*b = drm_fixp2int(fp_b);
}
static void yuv_u8_to_argb_u16(struct pixel_argb_u16 *argb_u16, struct pixel_yuv_u8 *yuv_u8,
enum drm_color_encoding encoding, enum drm_color_range range)
{
#define COEFF(v, r) (\
drm_fixp_div(drm_fixp_mul(drm_fixp_from_fraction(v, 10000), drm_int2fixp((1 << 16) - 1)),\
drm_int2fixp(r)) \
)\
const s64 bt601[3][3] = {
{ COEFF(10000, 219), COEFF(0, 224), COEFF(14020, 224) },
{ COEFF(10000, 219), COEFF(-3441, 224), COEFF(-7141, 224) },
{ COEFF(10000, 219), COEFF(17720, 224), COEFF(0, 224) },
};
const s64 bt601_full[3][3] = {
{ COEFF(10000, 255), COEFF(0, 255), COEFF(14020, 255) },
{ COEFF(10000, 255), COEFF(-3441, 255), COEFF(-7141, 255) },
{ COEFF(10000, 255), COEFF(17720, 255), COEFF(0, 255) },
};
const s64 rec709[3][3] = {
{ COEFF(10000, 219), COEFF(0, 224), COEFF(15748, 224) },
{ COEFF(10000, 219), COEFF(-1873, 224), COEFF(-4681, 224) },
{ COEFF(10000, 219), COEFF(18556, 224), COEFF(0, 224) },
};
const s64 rec709_full[3][3] = {
{ COEFF(10000, 255), COEFF(0, 255), COEFF(15748, 255) },
{ COEFF(10000, 255), COEFF(-1873, 255), COEFF(-4681, 255) },
{ COEFF(10000, 255), COEFF(18556, 255), COEFF(0, 255) },
};
const s64 bt2020[3][3] = {
{ COEFF(10000, 219), COEFF(0, 224), COEFF(14746, 224) },
{ COEFF(10000, 219), COEFF(-1646, 224), COEFF(-5714, 224) },
{ COEFF(10000, 219), COEFF(18814, 224), COEFF(0, 224) },
};
const s64 bt2020_full[3][3] = {
{ COEFF(10000, 255), COEFF(0, 255), COEFF(14746, 255) },
{ COEFF(10000, 255), COEFF(-1646, 255), COEFF(-5714, 255) },
{ COEFF(10000, 255), COEFF(18814, 255), COEFF(0, 255) },
};
int r = 0;
int g = 0;
int b = 0;
bool full = range == DRM_COLOR_YCBCR_FULL_RANGE;
unsigned int y_offset = full ? 0 : 16;
switch (encoding) {
case DRM_COLOR_YCBCR_BT601:
ycbcr2rgb(full ? bt601_full : bt601,
yuv_u8->y, yuv_u8->u, yuv_u8->v, y_offset, &r, &g, &b);
break;
case DRM_COLOR_YCBCR_BT709:
ycbcr2rgb(full ? rec709_full : rec709,
yuv_u8->y, yuv_u8->u, yuv_u8->v, y_offset, &r, &g, &b);
break;
case DRM_COLOR_YCBCR_BT2020:
ycbcr2rgb(full ? bt2020_full : bt2020,
yuv_u8->y, yuv_u8->u, yuv_u8->v, y_offset, &r, &g, &b);
break;
default:
pr_warn_once("Not supported color encoding\n");
break;
}
argb_u16->r = clamp(r, 0, 0xffff);
argb_u16->g = clamp(g, 0, 0xffff);
argb_u16->b = clamp(b, 0, 0xffff);
}
After all that, we can finally create the NV12 conversion function. We need to access the YCbCr values in the form of the NV12.
static void NV12_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel,
enum drm_color_encoding encoding, enum drm_color_range range)
{
struct pixel_yuv_u8 yuv_u8;
yuv_u8.y = src_pixels[0][0];
yuv_u8.u = src_pixels[1][0];
yuv_u8.v = src_pixels[1][1];
yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range);
}
After all this work you think that all worked, right? Well, IGT GPU Tools, says the opposite.
[root@archlinux shared]# ./build/tests/kms_plane --run pixel-format
IGT-Version: 1.27.1-g4637d2285 (x86_64) (Linux: 6.4.0-rc1-VKMS-DEVEL+ x86_64)
Opened device: /dev/dri/card0
Starting subtest: pixel-format
Starting dynamic subtest: pipe-A-planes
Using (pipe A + Virtual-1) to run the subtest.
Testing format XR24(0x34325258) / modifier linear(0x0) on A.0
Testing format AR24(0x34325241) / modifier linear(0x0) on A.0
Testing format XR48(0x38345258) / modifier linear(0x0) on A.0
Testing format AR48(0x38345241) / modifier linear(0x0) on A.0
Testing format RG16(0x36314752) / modifier linear(0x0) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr limited range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 3/4 solid colors tested (0xD)
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr full range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 2/4 solid colors tested (0xC)
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr limited range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 3/4 solid colors tested (0xD)
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr full range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 2/4 solid colors tested (0xC)
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr limited range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 3/4 solid colors tested (0xD)
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr full range) on A.0
(kms_plane:403) WARNING: CRC mismatches with format NV12(0x3231564e) on A.0 with 2/4 solid colors tested (0xC)
The subtest pixel-format
from kms_plane
tests all color formats supported by
a driver. It does that by creating framebuffers filled with different colors
and an image with the same color in the userspace. After that, it checks if the
CRC of the framebuffer and the userspace image are equal.
The NV12 support described until this point doesn’t work out because of the imprecision in the YCbCr to RGB conversion. I don’t know if the conversion is intrinsically imperfect, or if the use of fixed-point operations is the culprit. All I know is that certain colors are slightly off.
Luckily the IGT guys know about this issue, the way they overcome that is by just checking the MSB bits of color values, basically rounding then. They do that by passing a Gamma Look-Up Table (LUT) to the driver. But VKMS doesn’t support that.
It is a look-up table that the index’ is the color and its value is the result color.
This is the definition of a 1D table. You can have one for all the channels, so the transformation is the same for all the color channels, or one for each channel, so you can tweak each channel specifically.
There is a more complex type of LUT, a 3D LUT. But this one I don’t fully understand and it’s not needed. All I know is that you use the three color channels at the same time for the index, one for each coordinate, and the value that you get is a color. And besides that, you have to do interpolations.
Its image representation is a pretty cube :).
It’s not that difficult, the DRM core does all the hard work.
You have to tell that the driver supports a LUT of a specific size, and the DRM places it for you inside the crtc_state.
@@ -290,6 +290,9 @@ int vkms_crtc_init(struct drm_device *dev, struct drm_crtc *crtc,
drm_crtc_helper_add(crtc, &vkms_crtc_helper_funcs);
+ drm_mode_crtc_set_gamma_size(crtc, VKMS_LUT_SIZE);
+ drm_crtc_enable_color_mgmt(crtc, 0, false, VKMS_LUT_SIZE);
+
spin_lock_init(&vkms_out->lock);
spin_lock_init(&vkms_out->composer_lock);
After that, you have to use it. You just have to access the framebuffer, after the transformations, and use the 1D LUT of each channel.
static void apply_lut(const struct vkms_crtc_state *crtc_state, struct line_buffer *output_buffer)
{
struct drm_color_lut *lut;
size_t lut_length;
if (!crtc_state->base.gamma_lut)
return;
lut = (struct drm_color_lut *)crtc_state->base.gamma_lut->data;
lut_length = crtc_state->base.gamma_lut->length / sizeof(*lut);
if (!lut_length)
return;
for (size_t x = 0; x < output_buffer->n_pixels; x++) {
size_t lut_r_index = output_buffer->pixels[x].r * (lut_length - 1) / 0xffff;
size_t lut_g_index = output_buffer->pixels[x].g * (lut_length - 1) / 0xffff;
size_t lut_b_index = output_buffer->pixels[x].b * (lut_length - 1) / 0xffff;
output_buffer->pixels[x].r = lut[lut_r_index].red;
output_buffer->pixels[x].g = lut[lut_g_index].green;
output_buffer->pixels[x].b = lut[lut_b_index].blue;
}
}
Now we finally have the conversion working :DDDDDDD.
[root@archlinux shared]# ./build/tests/kms_plane --run pixel-format
IGT-Version: 1.27.1-g4637d2285 (x86_64) (Linux: 6.4.0-rc1-VKMS-DEVEL+ x86_64)
Opened device: /dev/dri/card0
Starting subtest: pixel-format
Starting dynamic subtest: pipe-A-planes
Using (pipe A + Virtual-1) to run the subtest.
Testing format XR24(0x34325258) / modifier linear(0x0) on A.0
Testing format AR24(0x34325241) / modifier linear(0x0) on A.0
Testing format XR48(0x38345258) / modifier linear(0x0) on A.0
Testing format AR48(0x38345241) / modifier linear(0x0) on A.0
Testing format RG16(0x36314752) / modifier linear(0x0) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr limited range) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr full range) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr limited range) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr full range) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr limited range) on A.0
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr full range) on A.0
Testing format AR24(0x34325241) / modifier linear(0x0) on A.1
Testing format XR24(0x34325258) / modifier linear(0x0) on A.1
Testing format XR48(0x38345258) / modifier linear(0x0) on A.1
Testing format AR48(0x38345241) / modifier linear(0x0) on A.1
Testing format RG16(0x36314752) / modifier linear(0x0) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr limited range) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.601 YCbCr, YCbCr full range) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr limited range) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.709 YCbCr, YCbCr full range) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr limited range) on A.1
Testing format NV12(0x3231564e) / modifier linear(0x0) (ITU-R BT.2020 YCbCr, YCbCr full range) on A.1
Dynamic subtest pipe-A-planes: SUCCESS (2.586s)
Subtest pixel-format: SUCCESS (2.586s)
The pixel-format-clamped is still not passing, I still haven’t got the time to tackle that. I hope that all will be done after this is solved.
After this, it will be very easy to add another YCbCr format, it is just a matter of getting the color values in the way that the format stores.
The world tried to stop me, but I pushed through and got some work done on wlroots.
Unfortunately aside from programming I have a physical form that I have to take care of. Recently this form's hands have been hurting because of strain from typing, which is pretty rude of them, so I took some time out to build a keyboard. Woe is me, having to spend my time soldering and cutting. I had a great time. After finishing the keyboard, all of a sudden it was time to move out, so we packed and shuffled and drove and unpacked and complained about the trains in this country. Then I slept, and yesterday I finally found the time to get back to work. This is why there was no update last week.
I've decided that I need consistent terms for the subtasks I have. So:
measurement is reading how long a frame takes to render, prediction is
coming up with a number of milliseconds, and scheduling is delaying by that
number and all the API funk therein. It turns out that Simon was right and there
is a lot of API funk for me to deal with. We've had some discussions about what
exactly wlr_output.events.frame
is for and whether we really need it,
and I've implemented a scheduling mechanism based on that signal
to show how we can make good use of it.
Bringing together this scheduling, the timer from the first status report for measurement, and some naive prediction (next frame takes at most 1ms more than last frame) christens tinywl as the first user of all this mess. Here are some screenshots from GPUVis showing the old, boring timeline where we rendered immediately after the last present, and the new, sexy timeline where we render immediately before the next one! The purple lines are vblanks, and the boxes in each row represent durations when those processes were running on the CPU. I think blue lines are GPU submissions, but the naming is unclear.
Wow, isn't that beautiful.
Hi!
This month Rose Hudson has started working on wlroots as part of Google Summer of Code! She will focus on reducing frame latency by implementing an adaptive frame scheduler. She already has landed new wlroots APIs to measure render time. You can follow Rose’s blog if you’re interested.
Other wlroots contributors have been hard at work too. Alexander has
implemented the new render pass API for GLES2, and while at it he’s
significantly improved its performance. I hope this will help with weak SoCs
and power usage. A big refactoring from vyivel has been merged to
unify the map/unmap logic across all shells. I’ve moved some of the cursor
logic over from wlr_output
to wlr_cursor
(with the eventual goal of
simplifying wlr_output
and removing most of the cursor-specific logic). And
we’ve all fixed a whole bunch of bugs!
The NPotM is lodns. It’s a
simple local DNS resolver which forwards queries to a DNS-over-TLS or
DNS-over-HTTPS server. It’s somewhat similar to systemd-resolved
. Still
missing are a way to forward local hostnames to the DNS resolver advertised via
DHCP and support for /etc/hosts
.
As usual, I’ve made small improvements to various projects. I’ve added a fast
tab switcher for gamja: press Ctrl+k, type a few letters to filter the
channels, and press enter to switch. I’ve contributed to the upcoming
IRCv3 message redaction extension and implemented it in goguma. kanshi
has gained a kanshictl switch
command to manually switch to another profile.
go-oauth2 now supports dynamic client registration. gyosu generates
documentation for typedef
. And more! But that’s enough for today, see you
next month!
Yes, you heard that right.
Ray Tracing Pipelines.
On RADV.
Enabled by default.
Now merged in Mesa main
.
This has been in the works for a loooooooooong time. Probably the longest of any RADV features so far.
But what makes ray tracing pipelines so complex that it takes this long to implement? Let’s take a short look at what it took for RADV to get its implementation off the ground.
For the purposes of this blog, ray tracing is the process of finding intersections between rays and some geometry.
Most of the time, this geometry will be made up of lots of triangles. We don’t want to test every single triangle for intersection separately, so Bounding Volume Hierarchies (BVHs) are used to speed up the process by skipping entire groups of triangles at once.
Nowadays, GPUs have dedicated hardware to speed up the ray tracing process.
AMD’s hardware acceleration for ray tracing is very simple: It consists of a single instruction called image_bvh_intersect_ray
(and its 64-bit variant).1
Why is it called image_bvh_intersect_ray
? Because the hardware sees the BVH as a 1D image and uses its memory subsystem for textures to fetch BVH data, of course.
This instruction takes care of calculating intersections between a ray and a single node in the BVH. But intersecting one node isn’t good enough: In order to find actual intersections between the ray and geometry, we need to traverse the BVH and check lots of nodes. The traversal loop that accomplishes this is implemented in software2.
In Vulkan, you can use ray tracing pipelines to utilize your GPU’s hardware-accelerated ray tracing capabilities. It might not seem like it, but ray tracing pipelines actually bring a whole lot of new features with them that make them quite complex to implement.
Ray tracing pipelines introduce a set of new shader stages:
traceRayEXT
to start tracingThat’s right, as a small side effect, ray tracing pipelines also introduced full proper recursion from shaders. This doesn’t just apply to callable shaders: You can also trace new rays from a closest-hit shader, which can recursively invoke more closest-hit shaders, etc.
Also, ray tracing pipelines introduce a very dynamic, GPU-driven shader dispatch process: In traditional graphics and compute pipelines, once you bind a pipeline, you know exactly which shaders are going to execute once you do a draw or dispatch. In ray tracing pipelines, this depends on something called the Shader Binding Table, which is a piece of memory containing so-called “shader handles”. These shader handles identify the shader that is actually launched when vkCmdTraceRaysKHR is called.
In both graphics and compute pipelines, the concept of pipeline stages was quite simple: You have a bunch of shader stages (for graphics pipelines, it’s usually vertex and fragment, for compute pipelines it’s just compute). Each stage has exactly one shader: You don’t have one graphics pipeline with many vertex shaders. In ray tracing pipelines, there are no restrictions on how many shaders can exist for each stage.
In RT pipelines, there is also the concept of shaders dispatching other shaders: Every time traceRayEXT
is called, more shaders (any-hit, intersection, closest-hit or miss shaders)
are launched.
That’s lots of changes just for some ray tracing!
RT pipelines aren’t really a fitting representation of AMD hardware. There is no such thing as reading a memory location to determine which shader to launch, and the hardware has no concept of a callstack to implement recursion. RADV therefore has to do a bit of magic to transform RT pipelines in a way that will actually run.
The first approach RADV used to implement these ray tracing pipelines was essentially to pretend that the whole ray tracing pipelines a normal compute shader:
All shaders from the pipeline are assigned a unique ID. Then, all shaders are inserted into a humongous chain of if (idx == shader_id) { (paste shader code here) }
statements.
If you wanted to call a shader, it was as simple as setting idx
to the ID of the shader you wanted to call. You could even implement recursion by storing the ID of the shader
to return to on a call stack.
Launching shaders according to the shader binding table wasn’t a problem either: You just read the shader binding table at the start and set idx
to whatever value is in there.
But there was a problem.
As it turns out, if you don’t put any restrictions on how many shaders can exist in a stage, there’s going to be apps that use LOTS of them. We’re talking almost a thousand shaders in some cases. Ludicrously large code like that resulted in lots of ludicrous results (games spending over half an hour compiling shaders!). Clearly, the megashader solution wasn’t sustainable.
Ray Tracing Pipelines also add pipeline libraries. You might have heard of them in the context of Graphics Pipeline Libraries, which was also really painful to implement in RADV.
Pipeline libraries essentially allow you to create parts of your ray tracing pipeline beforehand, and then re-use these created parts all over other ray tracing pipelines. But if we just paste all shaders into one chonker compute shader, we can’t compile it yet when creating a pipeline library, because other shaders will be added once a real pipeline is created from it!
This basically meant that we couldn’t do anything but copy the source code around, and start compiling only when the real pipeline is created. It also turned out that it’s valid behaviour to query the stack size used for recursion from pipeline libraries, but because RADV didn’t compile any code yet, it didn’t even know what stack size the shaders from that pipeline used.
This is where separate shader compilation comes in. As the name suggests, most3 shaders are compiled independently. Instead of using shader IDs to select what shader is called, we store the VRAM addresses of the shaders and directly jump to whatever shaders we want to execute next.
Directly jumping to a shader is still impossible because reading the shader binding table is required. Instead, RADV creates a small piece of shader assembly that sets up necessary parameters, reads the shader binding table, and then directly jumps to the selected shader (like it is done for shader calls).
This allows us to compile shaders immediately when creating pipeline libraries. It also pretty much resolves the problem of chonker compute shaders taking ludicrously long to compile. It also required basically reworking the entire ray tracing compilation infrastructure, but I think it forms a great basis for future work in the performance area.
Everything runs.
In case you disagree, please open an issue.
Pretty competitive with AMDVLK/the AMD Windows drivers! You’ll generally see similar, if not better, performance on RADV.
Not well (expect significantly less performance compared to AMDVLK/Windows drivers). This is being worked on.
RDNA3 introduces another instruction that helps with BVH traversal stack management, but RADV doesn’t use it yet. ↩
This is also what makes it so easy to support ray tracing even when there is no hardware acceleration (using RADV_PERFTEST=emulate_rt
): Most of the traversal code can be reused, only image_bvh_intersect_ray
needs to be replaced with a software equivalent. ↩
Any-hit and Intersection shaders are still combined into a single traversal shader. This still shows some of the disadvantages of the combined shader method, but generally compile times aren’t that ludicrous anymore. ↩
In the previous update I explained that the programmable core in this NPU (VIPNano-QI) is too slow to run inference workloads substantially faster than the CPUs. The vendor stack achieves acceptable inference rates by running most of the work on fixed-function units that can perform different kinds of convolutions and transformations of tensors.
Most of the work is done by the convolution units that VeriSilicon calls NN cores, so this is what I have been focusing on at this stage. I think that even if we still do all tensor transformation on the programmable core, by using the NN units we could already achieve usable performance.
By looking around in the ioctls that VeriSilicon's userspace stack sends to the kernel, it was clear that in the NN jobs there was little more than a pointer to a structure that configures the NN fixed-function units. Luckily I didn't need to reverse engineer it from zero, as VeriSilicon's out-of-tree kernel driver is GPL and contains two instances of programming this HW with a trivial job (a 2x2x1 kernel with a single bias value).
Took some boring work to translate what the code does to a C struct, but this was the initial one:
struct etna_nn_params {
uint32_t op_type : 1; /* conv: 0 fully_connected: 1 */
uint32_t no_z_offset : 1;
uint32_t kernel_x_size : 4;
uint32_t kernel_z_size : 14; /* & 0x3FFF */
uint32_t kernels_per_core : 7;
uint32_t zero1 : 2;
uint32_t zero2 : 1;
uint32_t zero3 : 1;
uint32_t nn_layer_flush : 1;
uint32_t kernel_data_type : 2; /* UINT8 0x2 INT8 0x0 */
uint32_t in_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
uint32_t out_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */
uint32_t in_image_x_size : 13;
uint32_t in_image_y_size : 13;
uint32_t zero4 : 3;
uint32_t zero5 : 3;
uint32_t unused0 : 1;
uint32_t zero6 : 16;
uint32_t zero7 : 1;
uint32_t enable_relu : 1;
uint32_t zero9 : 1;
uint32_t post_shift : 6;
uint32_t unused1 : 2;
uint32_t zero10 : 1;
uint32_t zero11 : 1;
uint32_t unused2 : 2;
uint32_t out_image_x_size : 13;
uint32_t out_image_y_size : 13;
uint32_t out_image_z_size : 14;
uint32_t zero12 : 2; /* 0x0 */
uint32_t zero13 : 1; /* (0 >> 3) & 0x1 */
uint32_t zero14 : 1; /* (0 >> 3) & 0x1 */
uint32_t unk0 : 7; /* 1 */
uint32_t unk1 : 7; /* 1 */
uint32_t kernel_address : 26; /* >> 6 */
uint32_t kernel_z_size2 : 6; /* >> 14 */
uint32_t in_image_address;
uint32_t out_image_address;
uint32_t unused3 : 12;
uint32_t kernel_y_size : 4;
uint32_t out_image_y_size2 : 16; /* maybe stride? */
uint32_t zero15;
uint32_t zero16;
uint32_t zero17;
uint32_t kernel_cache_end_address;
uint32_t zero19;
uint32_t image_end_address;
uint32_t zero20 : 2;
uint32_t zero21 : 16;
uint32_t kernel_data_type_bit_2 : 1;
uint32_t in_image_data_type_bit_2 : 1;
uint32_t out_image_data_type_bit_2 : 1;
uint32_t zero22 : 6;
uint32_t post_shift_bit_5_6 : 2;
uint32_t unused4 : 3;
uint32_t in_image_stride : 16;
uint32_t in_image_y_size2 : 16; /* again? */
uint32_t out_image_stride : 16;
uint32_t unused5 : 8;
uint32_t zero23 : 8;
uint32_t zero24 : 26; /* 0 >> 6 */
uint32_t zero25 : 1;
uint32_t zero26 : 1;
uint32_t zero27 : 1; /* 0 >> 4 */
uint32_t zero28 : 1; /* 0 >> 4 */
uint32_t zero29 : 1;
uint32_t kernel_data_type_bit_3 : 1;
uint32_t unk2 : 26; /* 0xFFFFFFFF >> 6 */
uint32_t unused6 : 4;
uint32_t zero30 : 1;
uint32_t in_image_data_type_bit_3 : 1;
uint32_t zero31 : 26; /* 0 >> 6 */
uint32_t out_image_data_type_bit_3 : 1;
uint32_t unused7 : 6;
uint32_t unk3 : 26; /* 0xFFFFFFFF >> 6 */
uint32_t unused8 : 6;
uint32_t coef_zero_point : 8;
uint32_t out_zero_point : 8;
uint32_t zero32 : 1;
uint32_t zero33 : 1;
uint32_t zero34 : 8;
uint32_t unused9 : 6;
uint32_t zero35;
uint32_t zero36 : 4;
uint32_t zero37 : 28; /* 0 >> 4 */
uint32_t zero38 : 4;
uint32_t zero39 : 28; /* 0 >> 4 */
uint32_t further1;
uint32_t further2;
uint32_t further3;
uint32_t further4;
uint32_t further5;
uint32_t further6;
uint32_t further7;
uint32_t further8;
};
As you can see there are a lot of "zero" and "unused" fields, most of them I think will be actually used for something as HW engineers don't tend to like wasting bits. By adding instrumentation for dumping these structs to the reverse engineering tooling, I will be making myself a better idea of what each field means and does.
I got GPU hangs the first time that I submitted a job with the same configuration as the kernel's trivial reset job, and looking further showed that the buffer that contains the convolution filters must follow a specific format.
By looking again at the kernel driver sources, I used the same kernel/filter buffer and the GPU didn't hang anymore. That kernel was all zeroes as the weights, and indeed my output buffer was now full of zeroes.
Then I tried to put my weights into the format that I inferred from the kernel driver source code, but I wasn't able to get any job to run to completion without hangs, and the output buffer was unchanged.
To figure out what I was missing about how the weights (and the biases) need to be placed in the buffer, I added code to the reverse engineering tooling to dump the weights buffer. With that buffer and after playing some with the sizes of the output, input and kernel buffers, I finally got a job to run with non-zero weights.
What I am doing right now is slowly zeroing out the weights buffer to figure out what are data bits, what are control and what effect the changes have in the output.
Hope that by the next update I will have documented the format of the weights buffer and will be able to run at least one kind of convolution!
After what was basically a flurry of typing, the snegg Python bindings for libei are now available. This is a Python package that provides bindings to the libei/libeis/liboeffis C libraries with a little bit of API improvement to make it not completely terrible. The main goal of these bindings (at least for now) is to provide some quick and easy way to experiment with what could possibly be done using libei - both server-side and client-side. [1] The examples directory has a minimal EI client (with portal support via liboeffis) and a minimal EIS implementation. The bindings are still quite rough and the API is nowhere near stable.
A proper way to support EI in Python would be to implement the protocol directly - there's no need for the C API quirkiness this way and you can make full use of things like async and whatnot. If you're interested in that, get in touch! Meanwhile, writing something roughly resemling xdotool is probably only a few hundred lines of python code. [2]
[1] writing these also exposed a few bugs in libei itself so I'm happy 1.0 wasn't out just yet
[2] at least the input emulation parts of xdotool
Upgrade your Asahi Linux systems, because your graphics drivers are getting a big boost: leapfrogging from OpenGL 2.1 over OpenGL 3.0 up to OpenGL 3.1! Similarly, the OpenGL ES 2.0 support is bumping up to OpenGL ES 3.0. That means more playable games and more functioning applications.
Back in December, I teased an early screenshot of SuperTuxKart’s deferred renderer working on Asahi, using OpenGL ES 3.0 features like multiple render targets and instancing. Now you too can enjoy SuperTuxKart with advanced lighting the way it’s meant to be:
As before, these drivers are experimental and not yet conformant to the OpenGL or OpenGL ES specifications. For now, you’ll need to run our -edge
packages to opt-in to the work-in-progress drivers, understanding that there may be bugs. Please refer to our previous post explaining how to install the drivers and how to report bugs to help us improve.
With that disclaimer out of the way, there’s a LOT of new functionality packed into OpenGL 3.0, 3.1, and OpenGL ES 3.0 to make this release. Highlights include:
For now, let’s talk about…
Vulkan and OpenGL support multisampling, short for multisampled anti-aliasing. In graphics, aliasing causes jagged diagonal edges due to rendering at insufficient resolution. One solution to aliasing is rendering at higher resolutions and scaling down. Edges will be blurred, not jagged, which looks better. Multisampling is an efficient implementation of that idea.
A multisampled image contains multiple samples for every pixel. After rendering, a multisampled image is resolved to a regular image with one sample per pixel, typically by averaging the samples within a pixel.
Apple GPUs support multisampled images and framebuffers. There’s quite a bit of typing to plumb the programmer’s view of multisampling into the form understood by the hardware, but there’s no fundamental incompatibility.
The trouble comes with sample shading. Recall that in modern graphics, the colour of each fragment is determined by running a fragment shader given by the programmer. If the fragments are pixels, then each sample within that pixel gets the same colour. Running the fragment shader once per pixel still benefits from multisampling thanks to higher quality rasterization, but it’s not as good as actually rendering at a higher resolution. If instead the fragments are samples, each sample gets a unique colour, equivalent to rendering at a higher resolution (supersampling). In Vulkan and OpenGL, fragment shaders generally run per-pixel, but with “sample shading”, the application can force the fragment shader to run per-sample.
How does sample shading work from the drivers’ perspective? On a typical GPU, it is simple: the driver compiles a fragment shader that calculates the colour of a single sample, and sets a hardware bit to execute it per-sample instead of per-pixel. There is only one bit of state associated with sample shading. The hardware will execute the fragment shader multiple times per pixel, writing out pixel colours independently.
Easy, right?
Alas, Apple’s “AGX” GPU is not typical.
AGX always executes the shader once per pixel, not once per sample, like older GPUs that did not support sample shading. AGX does support it, though.
How? The AGX instruction set allows pixel shaders to output different colours to each sample. The instruction used to output a colour1 takes a set of samples to modify, encoded as a bit mask. The default all-1’s mask writes the same value to all samples in a pixel, but a mask setting a single bit will write only the single corresponding sample.
This design is unusual, and it requires driver backflips to translate “fragment shaders” into hardware pixel shaders. How do we do it?
Physically, the hardware executes our shader once per pixel. Logically, we’re supposed to execute the application’s fragment shader once per sample. If we know the number of samples per pixel, then we can wrap the application’s shader in a loop over each sample. So, if the original fragment shader is:
interpolated colour = interpolate at current sample(input colour);
output current sample(interpolated colour);
then we will transform the program to the pixel shader:
for (sample = 0; sample < number of samples; ++sample) {
sample mask = (1 << sample);
interpolated colour = interpolate at sample(input colour, sample);
output samples(sample mask, interpolated colour);
}
The original fragment shader runs inside the loop, once per sample. Whenever it interpolates inputs at the current sample position, we change it to instead interpolate at a specific sample given by the loop counter sample
. Likewise, when it outputs a colour for a sample, we change it to output the colour to the single sample given by the loop counter.
If the story ended here, this mechanism would be silly. Adding sample masks to the instruction set is more complicated than a single bit to invoke the shader multiple times, as other GPUs do. Even Apple’s own Metal driver has to implement this dance, because Metal has a similar approach to sample shading as OpenGL and Vulkan. With all this extra complexity, is there a benefit?
If we generated that loop at the end, maybe not. But if we know at compile-time that sample shading is used, we can run our full optimizer on this sample loop. If there is an expression that is the same for all samples in a pixel, it can be hoisted out of the loop.2 Instead of calculating the same value multiple times, as other GPUs do, the value can be calculated just once and reused for each sample. Although it complicates the driver, this approach to sample shading isn’t Apple cutting corners. If we slapped on the loop at the end and did no optimizations, the resulting code would be comparable to what other GPUs execute in hardware. There might be slight differences from spawning fewer threads but executing more control flow instructions3, but that’s minor. Generating the loop early and running the optimizer enables better performance than possible on other GPUs.
So is the mechanism only an optimization? Did Apple stumble on a better approach to sample shading that other GPUs should adopt? I wouldn’t be so sure.
Let’s pull the curtain back. AGX has its roots as a mobile GPU intended for iPhones, with significant PowerVR heritage. Even if it powers Mac Pros today, the mobile legacy means AGX prefers software implementations of many features that desktop GPUs implement with dedicated hardware.
Yes, I’m talking about blending.
Blending is an operation in graphics APIs to combine the fragment shader output colour with the existing colour in the framebuffer. It is usually used to implement alpha blending, to let the background poke through translucent objects.
When multisampling is used without sample shading, although the fragment shader only runs once per pixel, blending happens per-sample. Even if the fragment shader outputs the same colour to each sample, if the framebuffer already had different colours in different samples, blending needs to happen per-sample to avoid losing that information already in the framebuffer.
A traditional desktop GPU blends with dedicated hardware. In the mobile space, there’s a mix of dedicated hardware and software. On AGX, blending is purely software. Rather than configure blending hardware, the driver must produce variants of the fragment shader that include instructions to implement the desired blend mode. With alpha blending, a fragment shader like:
becomes:
colour = calculate lighting();
dest = load destination colour;
alpha = colour.alpha;
blended = (alpha * colour) + ((1 - alpha) * dest));
output(blended);
Where’s the problem?
Blending happens per sample. Even if the application intends to run the fragment shader per pixel, the shader must run per sample for correct blending. Compared to other GPUs, this approach to blending would regress performance when blending and multisampling are enabled but sample shading is not.
On the other hand, exposing multisample pixel shaders to the driver solves the problem neatly. If both the blending and the multisample state are known, we can first insert instructions for blending, and then wrap with the sample loop. The above program would then become:
for (sample = 0; sample < number of samples; ++sample_id) {
colour = calculate lighting();
dest = load destination colour at sample (sample);
alpha = colour.alpha;
blended = (alpha * colour) + ((1 - alpha) * dest);
sample mask = (1 << sample);
output samples(sample_mask, blended);
}
In this form, the fragment shader is asymptotically worse than the application wanted: the fragment shader is executed inside the loop, running per-sample unnecessarily.
Have no fear, the optimizer is here. Since colour
is the same for each sample in the pixel, it does not depend on the sample ID. The compiler can move the entire original fragment shader (and related expressions) out of the per-sample loop:
colour = calculate lighting();
alpha = colour.alpha;
inv_alpha = 1 - alpha;
colour_alpha = alpha * colour;
for (sample = 0; sample < number of samples; ++sample_id) {
dest = load destination colour at sample (sample);
blended = colour_alpha + (inv_alpha * dest);
sample mask = (1 << sample);
output samples(sample_mask, blended);
}
Now blending happens per sample but the application’s fragment shader runs just once, matching the performance characteristics of traditional GPUs. Even better, all of this happens without any special work from the compiler. There’s no magic multisampling optimization happening here: it’s just a loop.
By the way, what do we do if we don’t know the blending and multisample state at compile-time? Hope is not lost…
…but that’s a story for another day.
While OpenGL ES 3.0 is an improvement over ES 2.0, we’re not done. In my work-in-progress branch, OpenGL ES 3.1 support is nearly finished, which will unlock compute shaders.
The final goal is a Vulkan driver running modern games. We’re a while away, but the baseline Vulkan 1.0 requirements parallel OpenGL ES 3.1, so our work translates to Vulkan. For example, the multisampling compiler passes described above are common code between the drivers. We’ve tested them against OpenGL, and now they’re ready to go for Vulkan.
And yes, the team is already working on Vulkan.
Until then, you’re one pacman -Syu
away from enjoying OpenGL 3.1!
Store a formatted value to local memory acting as a tilebuffer.↩︎
Via common subexpression elimination if the loop is unrolled, otherwise via code motion.↩︎
Since the number of samples is constant, all threads branch in the same direction so the usual “GPUs are bad at branching” advice does not apply.↩︎
The first week of GSoC is over! I'm working on presentation scheduling in wlroots. Here's how it's going.
Currently in wlroots, the easy and obvious frame schedule for compositors to implement is one where Wayland clients and the compositor are both told to render a new frame as soon as the last one makes it to the screen. This means that clients are almost guaranteed to miss the deadline for this cycle, and their frame submissions will only make it into the compositor's renderer in the next cycle. With this schedule, there are two frames of latency between a client receiving some user input and the new content making it to the screen.
If the compositor starts rendering later then clients are more likely to have submitted the frame that was most recently triggered. The tradeoff is that the compositor is less likely to hit its own deadline. It's like the computer is playing The Price Is Right for every frame, and I'm in control of how it guesses.
I'm trying to make a smarter frame schedule (where the computer is really good at the game) be another easy and obvious schedule for wlroots compositors to use, so the whole ecosystem can win without having to faff about implementing it themselves.
Thanks to Simon for the idea and for mentoring me.
I quickly implemented a way for compositors to delay rendering by a specified number of milliseconds, and Kenny quickly pointed out that the way I did it was silly. Since then, I have a much better understanding of what actually happens for each frame in wlroots. Thanks Kenny!
After that review and a bit more discussion I gave it another go, but didn't submit it yet because I haven't finished polishing it. While I was doing that, I was brutally reminded that the C compiler does not care about me and will not attempt to help at all. Don't forget to enable your sanitizers, reader. Arguably I deserved it, because I had recently commented on how I was enjoying writing C for a change. Don't fall into the same trap as me; C does not want you to enjoy it.
I also wrote a new API that allows you to find out how long it took to render a frame. For now, this is only implemented on the GLES2 backend. Storing this time for the last few frames is a good way to make an educated guess on the time for the next one, and by also knowing your monitor's (maximum) refresh rate you can come up with a reasonable duration to delay rendering by.
In testing the render timer I discovered that my test machine takes a whopping 4 milliseconds to paint the screen a solid colour. A valuable lesson lies here: don't ask questions if you're not strong enough to hear the answer. Or maybe "damage tracking is good".
See you next week!
As I wrote in the last update, my OpenCL branch is able to correctly run MobileNet v1 with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.
In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!).
When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.
After a lots of fumbling around in the internets I stumbled upon a commit that included an interestingly-named environment variable: VIV_VX_DISABLE_TP_NN_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).
With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.
Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.
Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.
The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:
Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -> [T#5]
Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -> [T#31]
...
[12 more pairs of CONV_2D and DEPTHWISE_CONV_2D]
...
Op#27 AVERAGE_POOL_2D(T#29) -> [T#0]
Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -> [T#1]
Op#29 RESHAPE(T#1, T#86[-1, 1001]) -> [T#85]
Op#30 SOFTMAX(T#85) -> [T#87]
As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end.
By using some of the environment variables that are mentioned in this issue in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:
operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP
operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP
operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
...
[34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN]
...
operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH
operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP
operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH
From that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.
So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...
What I will be working on next:
If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.
The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the vxConvolutionLayer node).
The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.
That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.
For the last four years I’ve served as a member of the X.Org Foundation Board of Directors, but some days ago I stepped down after my term ended and not having run for re-election.
I started contributing to Mesa in 2014 and joined the amazing freedesktop community. Soon after, I joined the X.Org Foundation as an regular member in order to participate in the elections and get access to some interesting perks (VESA, Khronos Group). You can learn more about what X.Org Foundation does in Ricardo’s blogpost.
But everything changed in 2018. That year, Chema and I organized XDC 2018 in A Coruña, Spain.
The following year, I ran for the yearly election of X.Org Foundation’s board of directors (as it is a two years term, we renew half of the board every year)… and I was elected! It was awesome! Almost immediately, I started coordinating XDC, and looking for organization proposals for the following XDC. I documented my experience organizing XDC 2018 in an attempt to make the job easier for future organizers, reducing the burden that organizing such a conference entails.
In 2021, I was re-elected and everything continued without changes (well, except the pandemic and having our first 2 virtual XDCs: 2020 and 2021).
Unfortunately, my term finished this year… and I did not re-run for election. The reasons were a mix of personal life commitments (having 2 kids change your life completely) and new professional responsibilities. After those changes, I could not contribute as much as I wanted, and that was enough to me to pass the torch and let others contribute to the X.Org Foundation instead. Congratulations to Christopher Michale and Arek Hiler, I’m pretty sure you are going to do great!
Surprisingly enough, I am closing the cycle as it started: organizing X.Org Developers Conference 2023 in A Coruña, Spain from 17th to 19th October 2023.
I leave the board of directors but I won friends and great memories. In case you are interested on participating to the community via the board of directors, prepare your candidancy for next year!
See you in A Coruña!
A few weeks ago the annual X.Org Foundation Board of Directors election took place. The Board of Directors has 8 members at any given moment, and members are elected for 2-year terms. Instead of renewing the whole board every 2 years, half the board is renewed every year. Foundation members, which must apply for or renew membership every year, are the electorate in the process. Their main duty is voting in board elections and occasionally voting in other changes proposed by the board.
As you may know, thanks to the work I do at Igalia, and the trust of other Foundation members, I’m part of the board and currently serving the second year of my term, which will end in Q1 2024. Despite my merits coming from my professional life, I do not represent Igalia as a board member. However, to avoid companies from taking over the board, I must disclose my professional affiliation and we must abide by the rule that prohibits more than two people with the same affiliation from being on the board at the same time.
Because of the name of the Foundation and for historical reasons, some people are confused about its purpose and sometimes they tend to think it acts as a governance body for some projects, particularly the X server, but this is not the case. The X.Org Foundation wiki page at freedesktop.org has some bits of information but I wanted to clarify a few points, like mentioning the Foundation has no paid employees, and explain what we do at the Foundation and the tasks of the Board of Directors in practical terms.
Cue the music.
(“The Who - Who Are You?” starts playing)
The main points would be:
The Foundation acts as an umbrella for multiple projects, including the X server, Wayland and others.
The board of directors has no power to decide who has to work on what.
The largest task is probably organizing XDC.
Being a director is not a paid position.
The Foundation pays for project infrastructure.
The Foundation, or its financial liaison, acts as an intermediary with other orgs.
Some directors have argued in the past that we need to change the Foundation name to something different, like the Freedesktop.org Foundation. With some healthy sense of humor, others have advocated for names like Freedesktop Software Foundation, or FSF for short, which should be totally not confusing. Humor or not, the truth is the X.Org Foundation is essentially the Freedesktop Foundation, so the name change would be nice in my own personal opinion.
If you take a look at the Freedesktop Gitlab instance, you can navigate to a list of projects and sort them by stars. Notable mentions you’ll find in the list: Mesa, PipeWire, GStreamer, Wayland, the X server, Weston, PulseAudio, NetworkManager, libinput, etc. Most of them closely related to a free and open source graphics stack, or free and open source desktop systems in general.
As I mentioned above, the Foundation has no paid employees and the board has no power to direct engineering resources to a particular project under its umbrella. It’s not a legal question, but a practical one. Is the X.Org server dying and nobody wants to touch it anymore? Certainly. Many people who worked on the X server are now working on Wayland and creating and improving something that works better in a modern computer, with a GPU that’s capable of doing things which were not available 25 years ago. It’s their decision and the board can do nothing.
On a tangent, I’m feeling a bit old now, so let me say when I started using Linux more than 20 years ago people were already mentioning most toolkits were drawing stuff to pixmaps and putting those pixmaps on the screen, ignoring most of the drawing capabilities of the X server. I’ve seen tearing when playing movies on Linux many times, and choppy animations everywhere. Attempting to use the X11 protocol over a slow network resulted in broken elements and generally unusable screens, problems which would not be present when falling back to a good VNC server and client (they do only one specialized thing and do it better).
For the last 3 or 4 years I’ve been using Wayland (first on my work laptop, nowadays also on my personal desktop) and I’ve seen it improve all the time. When using Wayland, animations are never choppy in my own experience, tearing is unheard of and things work more smoothly, as far as my experience goes. Thanks to using the hardware better, Wayland may also give you improved battery life. I’ve posted in the past that you can even use NVIDIA with Gnome on Wayland these days, and things are even simpler if you use an Intel or AMD GPU.
Naturally, there may be a few things which may not be ready for you yet. For example, maybe you use a DE which only works on X11. Or perhaps you use an app or DE which works on Wayland, but its support is not great and has problems there. If it’s an app, likely power users or people working on distributions can tune it to make it use XWayland by default, instead of Wayland, while bugs are ironed out.
Ouch, there we have the “X.Org” moniker again…
Back on track, if the Foundation can do nothing about the lack of people maintaining the X server and does not set any technical direction for projects, what does it do? (I hear you shouting “nothing!” while waving your fist at me.) One of the most time-consuming tasks is organizing XDC every year, which is arguably one of the most important conferences, if not the most important one, for open source graphics right now.
Specifically, the board of directors will set up a commission composed of several board members and other Foundation members to review talk proposals, select which ones will have a place at the conference, talk to speakers about shortening or lengthening their talks, and put them on a schedule to be used at the conference, which typically lasts 3 days. I chaired the paper committee for XDC 2022 and spent quite a lot of time on this.
The conference is free to attend for anyone and usually alternates location between Europe and the Americas. Some people may want to travel to the conference to present talks there but they may lack the budget to do so. Maybe they’re a student or they don’t have enough money, or their company will not sponsor travel to the conference. For that, we have travel grants. The board of directors also reviews requests for travel grants and approves them when they make sense.
But that is only the final part. The board of directors selects the conference contents and prepares the schedule, but the job of running the conference itself (finding an appropriate venue, paying for it, maybe providing some free lunches or breakfasts for attendees, handling audio and video, streaming, etc) falls in the hands of the organizer. Kid you not, it’s not easy to find someone willing to spend the needed amount of time and money organizing such a conference, so the work of the board starts a bit earlier. We have to contact people and request for proposals to organize the conference. If we get more than one proposal, we have to evaluate and select one.
As the conference nears, we have to fire some more emails and convince companies to sponsor XDC. This is also really important and takes time as well. Money gathered from sponsors is not only used for the conference itself and travel grants, but also to pay for infrastructure and project hosting throughout the whole year. Which takes us to…
No, that’s not happening.
Being a director of the Foundation is not a paid position. Every year we suffer a bit to be able to get enough candidates for the 4 positions that will be elected. Many times we have to extend the nomination period.
If you read news about the Foundation having trouble finding candidates for the board, that barely qualifies as news because it’s almost the same every year. Which doesn’t mean we’re not happy when people spread the news and we receive some more nominations, thank you!
Just like being an open source maintainer is not a grateful task sometimes, not everybody wants to volunteer and do time-consuming tasks for free. Running the board elections themselves, approving membership renewals and requests every year, and sending voting reminders also takes time. Believe me, I just did that a few weeks ago with help from Mark Filion from Collabora and technical assistance from Martin Roukala.
The Foundation spends a lot of money on project hosting costs, including Gitlab and CI systems, for projects under the Freedesktop.org umbrella. These systems are used every day and are fundamental for some projects and software you may be using if you run Linux. Running our own Gitlab instance and associated services helps keep the web decentralized and healthy, and provides more technical flexibility. Many people seem to appreciate those details, judging by the number of projects we host.
The Foundation also approaches other organizations on behalf of the community to achieve some stuff that would be difficult otherwise.
To pick one example, we’ve worked with VESA to provide members with access to various specifications that are needed to properly implement some features. Our financial liaison, formerly SPI and soon SFC, signs agreements with the Khronos Group that let them waive fees for certifying open source implementations of their standards.
For example, you know RADV is certified to comply with the Vulkan 1.3 spec and the submission was made on behalf of Software in the Public Interest, Inc. Same thing for lavapipe. Similar for Turnip, which is Vulkan 1.1 conformant.
The song is probably over by now and you have a better idea of what the Foundation does, and what the board members do to keep the lights on. If you have any questions, please let me know.
Last year, the Linux Foundation announced the creation of the Linux Foundation Europe.
The goal of the Linux Foundation Europe is, in a nutshell, to promote Open Source in Europe not only to individuals (via events and courses), but to companies (guidance and hosting projects) and European organizations. However, this effort needs the help of European experts in Open Source.
Thus, the Linux Foundation Europe (LFE) has formed an advisory board called the Linux Foundation Europe Advisory Board (LFEAB), which includes representatives from a cross-section of 20 leading European organizations within the EU, the UK, and beyond. The Advisory Board will play an important role in stewarding Linux Foundation Europe’s growing community, which now spans 100 member organizations from across the European region.
Early this year, I was invited to join the LFEAB as an inaugural member. I would not be in this position without the huge amount of work done by the rest of my colleagues at Igalia since the company was founded in 2001, which has paved the way for us to be one of the landmark consultancies specialized in Open Source, both globally and in Europe.
My presence in the LFEAB will help to share our experience, and help the Linux Foundation Europe to grow and spread Open Source everywhere in Europe.
I’m excited to participate in the Linux Foundation Europe Advisory Board! I and the rest of the LFEAB will be at the Open Source Summit Europe, send me an email if you want to meet me to know more about LFEAB, about Igalia or about how you can contribute more to Open Source.
Happy hacking!
After finishing up my first Igalia Coding Experience in January, I got the amazing opportunity to keep working in the DRI community by extending my Igalia CE to a second round. Huge thanks to Igalia for providing me with this opportunity!
Another four months passed by and here I am completing another milestone with Igalia. Previously, in the last final reports, I described GSoC as “an experience to get a better understanding of what open source is” and the first round of the Igalia CE as “an opportunity for me to mature my knowledge of technical concepts”. My second round of the Igalia CE was a period for broadening my horizons.
I had the opportunity to deepen my knowledge of a new programming language and learn more about Kernel Mode Setting (KMS). I took my time learning more about Vulkan and the Linux graphics stack. All of this new knowledge about the DRM infrastructure fascinated me and made me excited to keep developing.
So, this is a summary report of my journey at my second Igalia CE.
First, I took some time to wrap up the contributions of my previous Igalia CE. In my January Update, I described the journey to include IGT tests for V3D. But at the time, I hadn’t yet sent the final versions of the tests. Right when I started my second Igalia CE, I sent the final versions of the V3D tests, which were accepted and merged.
Series | Status |
---|---|
[PATCH i-g-t 0/6] V3D Job Submission Tests | Accepted |
[PATCH i-g-t 0/3] V3D Mixed Job Submission Tests | Accepted |
The first part of my Igalia CE was focused on rewriting the VGEM driver in Rust. VGEM (Virtual GEM Provider) is a minimal non-hardware-backed GEM (Graphics Execution Manager) service. It is used with non-native 3D hardware for buffer sharing between the X server and DRI.
The goal of the project was to explore Rust in the DRM subsystem and have a working VGEM driver written in Rust. Rust is a blazingly fast and memory-efficient language with its powerful ownership model. It was really exciting to learn more about Rust and implement from the beginning a DRM driver.
During the project, I wrote two blog posts describing the technical aspects of rustgem
driver.
If you are interested in this project, check them out!
Date | Blogpost |
---|---|
28th February | Rust for VGEM |
22th March | Adding a Timeout feature to Rustgem |
By the end of the first half of the Igalia CE, I sent an RFC patch with the rustgem
driver.
Thanks to Asahi Lina, the Rust for Linux folks, and Daniel Vetter for all the feedback provided during the development of the driver.
I still need to address some feedback and rebase the series on top of the new pin-init API, but I hope to see this driver upstream soon.
You can check the driver’s current status in this PR.
Series | Status |
---|---|
[RFC PATCH 0/9] Rust version of the VGEM driver | In Review |
Apart from rewriting the VGEM driver, I also sent a couple of improvements to the C version of the VGEM driver and its IGT tests.
I found a missing mutex_destroy
on the code and also an unused struct.
Patches | Status |
---|---|
[PATCH] drm/vgem: add missing mutex_destroy | Accepted |
[PATCH] drm/vgem: Drop struct drm_vgem_gem_object | Accepted |
On the IGT side, I added some new tests to the VGEM tests. I wanted to ensure that my driver returned the correct values for all possible error paths, so I wrote this IGT test. Initially, it was just for me, but I decided to submit it upstream.
Series | Status |
---|---|
[PATCH v3 i-g-t 0/2] Add negative tests to VGEM | Accepted |
Focusing on the VKMS was the major goal of the second part of my Igalia CE. Melissa Wen is one of the maintainers of the VKMS, and she provided me with a fantastic opportunity to learn more about the VKMS. So far, I haven’t dealt with displays, and learning new concepts in the graphics stack was great.
VKMS is a software-only KMS driver that is quite useful for testing and running X (or similar compositors) on headless machines. At the time, the driver didn’t have any support for optional plane properties, such as rotation and blend mode. Therefore, my goal was to implement the first plane property of the driver: rotation. I described the technicalities of this challenge in this blog post, but I can say that it was a nice challenge of this mentorship project.
In the end, we have the first plane property implemented for the VKMS and it is already committed.
Together with the VKMS part, I sent a series to the IGT mailing list with some improvements to the kms_rotation_crc
tests.
These improvements included adding new tests for rotation with offset and reflection and the isolation of some Intel-specific tests.
As I was working with the rotation series, I discovered a couple of things that could be improved in the VKMS driver. Last year, Igor Torrente sent a series to VKMS that changed the composition work in the driver. Before his series, the plane composition was executed on top of the primary plane. Now, the plane composition is executed on top of the CRTC.
Although his series was merged, some parts of the code still considered that the composition was executed on top of the primary plane, limiting the VKMS capabilities. So I sent a couple of patches to the mailing list, improving the handling of the primary plane and allowing full alpha blending on all planes.
Moreover, I sent a series that added a module parameter to set a background color to the CRTC. This work raised an interesting discussion about the need for this property by the user space and whether this parameter should be a KMS property.
Apart from introducing the rotation property to the VKMS driver, I also took my time to implement two other properties: alpha and blend mode. This series is still awaiting review, but it would be a nice addition to the VKMS, increasing its IGT test coverage rate.
Finally, I found a bug in the RGB565 conversion.
The RGB565 conversion to ARGB16161616 involves some fixed-point operations and, when running the pixel-format
IGT test, I verified that the RGB565 test was failing.
So, some of those fixed-point operations were returning erroneous values.
I checked that the RGB coefficients weren’t being rounded when converted from fixed-point to integers.
But, this should happen in order to provided the proper coefficient values.
Therefore, the fix was: implement a new helper that rounds the fixed-point value when converting it to a integer.
After performing all this work on the VKMS, I sent a patch adding myself as a VKMS maintainer, which was acked by Javier Martinez and Melissa Wen. So now, I’m working together with Melissa, Rodrigo Siqueira and all DRI community to improve and maintain the VKMS driver.
A couple of years ago, Sumera Priyadarsini, an Outreachy intern, worked on a Virtual Hardware functionality for the VKMS. The idea was to add a Virtual Hardware or vblank-less mode as a kernel parameter to enable VKMS to emulate virtual devices. This means no vertical blanking events occur and page flips are completed arbitrarily when required for updating the frame. Unfortunately, she wasn’t able to wrap things up and this ended up never being merged into VKMS.
Melissa suggested rebasing this series and now we can have the Virtual Hardware functionality working on the current VKMS. This was a great work by Sumera and my work here was just to adapt her changes to the new VKMS code.
Series | Status |
---|---|
[PATCH 0/2] drm/vkms: Enable Virtual Hardware support | In Review |
Finally, I was in the last week of the project, just wrapping things up, when I decided to run the VKMS CI. I had recently committed the rotation series and I had run the CI before, but to my surprise, I got the following output:
[root@fedora igt-gpu-tools]# ./build/tests/kms_writeback
IGT-Version: 1.27.1-gce51f539 (x86_64) (Linux: 6.3.0-rc4-01641-gb8e392245105-dirty x86_64)
(kms_writeback:1590) igt_kms-WARNING: Output Writeback-1 could not be assigned to a pipe
Starting subtest: writeback-pixel-formats
Subtest writeback-pixel-formats: SUCCESS (0.000s)
Starting subtest: writeback-invalid-parameters
Subtest writeback-invalid-parameters: SUCCESS (0.001s)
Starting subtest: writeback-fb-id
Subtest writeback-fb-id: SUCCESS (0.020s)
Starting subtest: writeback-check-output
(kms_writeback:1590) CRITICAL: Test assertion failure function get_and_wait_out_fence, file ../tests/kms_writeback.c:288:
(kms_writeback:1590) CRITICAL: Failed assertion: ret == 0
(kms_writeback:1590) CRITICAL: Last errno: 38, Function not implemented
(kms_writeback:1590) CRITICAL: sync_fence_wait failed: Timer expired
Stack trace:
#0 ../lib/igt_core.c:1963 __igt_fail_assert()
#1 [get_and_wait_out_fence+0x83]
#2 ../tests/kms_writeback.c:337 writeback_sequence()
#3 ../tests/kms_writeback.c:360 __igt_unique____real_main481()
#4 ../tests/kms_writeback.c:481 main()
#5 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
#6 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
#7 [_start+0x25]
Subtest writeback-check-output failed.
**** DEBUG ****
(kms_writeback:1590) CRITICAL: Test assertion failure function get_and_wait_out_fence, file ../tests/kms_writeback.c:288:
(kms_writeback:1590) CRITICAL: Failed assertion: ret == 0
(kms_writeback:1590) CRITICAL: Last errno: 38, Function not implemented
(kms_writeback:1590) CRITICAL: sync_fence_wait failed: Timer expired
(kms_writeback:1590) igt_core-INFO: Stack trace:
(kms_writeback:1590) igt_core-INFO: #0 ../lib/igt_core.c:1963 __igt_fail_assert()
(kms_writeback:1590) igt_core-INFO: #1 [get_and_wait_out_fence+0x83]
(kms_writeback:1590) igt_core-INFO: #2 ../tests/kms_writeback.c:337 writeback_sequence()
(kms_writeback:1590) igt_core-INFO: #3 ../tests/kms_writeback.c:360 __igt_unique____real_main481()
(kms_writeback:1590) igt_core-INFO: #4 ../tests/kms_writeback.c:481 main()
(kms_writeback:1590) igt_core-INFO: #5 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
(kms_writeback:1590) igt_core-INFO: #6 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
(kms_writeback:1590) igt_core-INFO: #7 [_start+0x25]
**** END ****
Subtest writeback-check-output: FAIL (1.047s)
🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠
Initially, I thought I had introduced the bug with my rotation series. Turns out, I just had made it more likely to happen. This bug has been hidden in VKMS for a while, but it happened just on rare occasions. Yeah, I’m talking about a race condition… The kind of bug that just stays hidden in your code for a long while.
When I started to debug, I thought it was a performance issue. But then, I increased the timeout to 10 seconds and even then the job wouldn’t finish. So, I thought that it could be a deadlock. But after inspecting the DRM internal locks and the VKMS locks, it didn’t seem the case.
Melissa pointed me to a hint: there was one framebuffer being leaked when removing the driver. I discovered that it was the writeback framebuffer. It meant that the writeback job was being queued, but it wasn’t being signaled. So, the problem was inside the VKMS locking mechanism.
After tons of GDB and ftrace, I was able to find out that the composer was being set twice without any calls to the composer worker. I changed the internal locks a bit and I was able to run the test repeatedly for minutes! I sent the fix for review and now I’m just waiting for a Reviewed-by.
Patches | Status |
---|---|
[PATCH] drm/vkms: Fix race-condition between the hrtimer and the atomic commit | In Review |
While debugging, I found some things that could be improved in the VKMS writeback file. So, I decided to send a series with some minor improvements to the code.
Series | Status |
---|---|
[PATCH 0/3] drm/vkms: Minor Improvements | In Review |
If you run all IGT KMS tests on the VKMS driver, you will see that some tests will fail. That’s not what we would expect: we would expect that all tests would pass or skip. The tests could fail due to errors in the VKMS driver or be wrong exceptions on the IGT side. So, on the final part of my Igalia CE, I inspected a couple of IGT failures and sent fixes to address the errors.
This patch is a revival of a series I sent in January to fix the IGT test kms_addfb_basic@addfb25-bad-modifier
.
This test also failed in VC4, and I investigated the reason in January.
I sent a patch to guarantee that the test would pass and after some feedback, I came down to a dead end.
So, I left this patch aside for a while and decided to recapture it now.
Now, with this patch being merged, we can guarantee that the test kms_addfb_basic@addfb25-bad-modifier
is passing for multiple drivers.
Patches | Status |
---|---|
[PATCH] drm/gem: Check for valid formats | Accepted |
On the IGT side, I sent a couple of improvements to the tests.
The failure was usually just a scenario that the test didn’t consider.
For example, the kms_plane_scaling
test was failing in VKMS, because it didn’t consider the case in which the driver did not have the rotation property.
As VKMS didn’t use to have the rotation property, the tests were failing instead of skipping.
Therefore, I just developed a path for drivers without the rotation property for the tests to skip.
I sent improvements to the kms_plane_scaling
, kms_flip
, and kms_plane
tests, making the tests pass or skip on all cases for the VKMS.
One important thing to VKMS is creating a baseline of generic KMS tests that should pass. This way, we can test new contributions against this baseline and avoid introducing regressions in the codebase. I sent a patch to IGT to create a testlist for the VKMS driver with all the KMS tests that must pass on the VKMS driver. This is great for maintenance, as we can run the testlist to ensure that the VKMS functionalities are preserved.
With new features being introduced in VKMS, it is important to keep the test list updated. So, I verified the test results and updated this test list during my time at the Igalia CE. I intend to keep this list updated as long as I can.
Series | Status |
---|---|
[PATCH i-g-t] tests/vkms: Create a testlist to the vkms DRM driver | Accepted |
[PATCH i-g-t 0/3] tests/vkms: Update VKMS’s testlist | Accepted |
First, I would like to thank my great mentor Melissa Wen. Melissa and I are completing a year together as mentee and mentor and it has been an amazing journey. Since GSoC, Melissa has been helping me by answering every single question I have and providing me with great encouragement. I have a lot of admiration for her and I’m really grateful for having her as my mentor during the last year.
Also, I would like to thank Igalia for giving me this opportunity to keep working in the DRI community and learning more about this fascinating topic. Thanks to all Igalians that helped through this journey!
Moreover, I would like to thank the DRI community for reviewing my patches and giving me constructive feedback.
Especially, I would like to thank Asahi Lina, Daniel Vetter and the Rust for Linux folks for all the help with the rustgem
driver.
Thanks for all the suggestions, on here, and on twitter and on mastodon, anyway who noted I could use a single fd and avoid all the pain was correct!
I hacked up an ever growing ftruncate/madvise memfd and it seemed to work fine. In order to use it for sparse I have to use it for all device memory allocations in lavapipe which means if I push forward I probably have to prove it works and scales a bit better to myself. I suspect layering some of the pb bufmgr code on top of an ever growing fd might work, or maybe just having multiple 2GB buffers might be enough.
Not sure how best to do shaderResourceResidency, userfaultfd might be somewhat useful, mapping with PROT_NONE and then using write(2) to get a -EFAULT is also promising, but I'm not sure how best to avoid segfaults for read/writes to PROT_NONE regions.
Once I got that going, though I ran headfirst into something that should have been obvious to me, but I hadn't thought through.
llvmpipe allocates all it's textures linearly, there is no tiling (even for vulkan optimal). Sparse textures are incompatible with linear implementations. For sparseImage2D you have to be able to give the sparse tile sizes from just the image format. This typically means you have to work out how large the tile that fits into a hw page is in w/h. Of course for a linear image, this would be dependent on the image stride not just the format, and you just don't have that information.
I guess it means texture tiling in llvmpipe might have to become a thing, we've thought about it over the years but I don't think there's ever been a solid positive for implementing it.
Might have to put sparse support on the back burner for a little while longer.
Hi all!
This status update comes in a bit late because I was on leave last week. The highlight this month is the HDR hackfest, I’ve written a dedicated blog post about it. After the publication of that blog post, I’ve sent out an RFC to dri-devel.
We’ve made some good progress on wlroots’ Vulkan renderer. Manuel Stoeckl has added support for an intermediate buffer for blending, which is required for non-8-bit output formats and for color management features. The renderer now has an optional extra rendering pass to run a shader after blending. This is currently used to encode color values to sRGB, and will be used in the future to apply ICC profiles and to perform color space conversions. I’ve added support for the NV12 DMA-BUF format, support for more YCbCr formats is in a merge request.
The new cursor-shape-v1 protocol has been merged in wayland-protocols thanks
to KDE and winit folks. Traditionally Wayland clients needed to load XCursor
themes and submit these as wl_shm
buffers to the compositor. However there
are a few downsides: there is no mechanism to configure the theme that gets
loaded, the theme cannot be changed on-the-fly, there is no way to configure
separate themes per seat, and loading cursors slows down client startup. The
cursor-shape-v1 protocol allows clients to set a cursor image by its name
instead of using wl_shm
buffers.
I’ve worked on adding a new mode to wayland-scanner to generate enums only. This is useful for libraries like wlroots which use C enums generated from protocol XML in their public headers. We plan to ship these headers as part of a wayland-protocols installation.
To wrap up this status update, let’s mention a few updates for miscellaneous projects. A handful of new formats have been added to pixfmtdb. gqlclient now handles GraphQL interfaces correctly and generates methods to unwrap the underlying type. This is now used in hut to show ticket comments, among other things. go-imap now supports SEARCHRES, LITERAL+, and features a simplified API for STATUS commands.
See you next month!
Mike nerdsniped me into wondering how hard sparse memory support would be in lavapipe.
The answer is unfortunately extremely.
Sparse binding essentially allows creating a vulkan buffer/image of a certain size, then plugging in chunks of memory to back it in page-size multiple chunks.
This works great with GPU APIs where we've designed this, but it's actually hard to pull off on the CPU.
Currently lavapipe allocates memory with an aligned malloc. It allocates objects with no backing and non-sparse bindings connect objects to the malloced memory.
However with sparse objects, the object creation should allocate a chunk of virtual memory space, then sparse binding should bind allocated device memory into the virtual memory space. Except Linux has no interfaces for doing this without using a file descriptor.
You can't mmap a chunk of anonymous memory that you allocated with malloc to another location. So if I malloc backing memory A at 0x1234000, but the virtual memory I've used for the object is at 0x4321000, there's no nice way to get the memory from the malloc to be available at the new location (unless I missed an API).
However you can do it with file descriptors. You can mmap a PROT_NONE area for the sparse object, then allocate the backing memory into file descriptors, then mmap areas from those file descriptors into the correct places.
But there are limits on file descriptors, you get 1024 soft, or 4096 hard limits by default, which is woefully low for this. Also *all* device memory allocations would need to be fd backed, not just ones going to be used in sparse allocations.
Vulkan has a limit maxMemoryAllocationCount that could be used for this, but setting it to the fd limit is a problem because some fd's are being used by the application and just in general by normal operations, so reporting 4096 for it, is probably going to explode if you only have 3900 of them left.
Also the sparse CTS tests don't respect the maxMemoryAllocationCount anyways :-)
I shall think on this a bit more, please let me know if anyone has any good ideas!
This blogpost was actually written partially in November/December 2022 while I was developing IGT tests for the V3D driver. I ended up leaving it aside for a while and now, I came back and finished the last loose ends. That’s why I’m referencing the time where I was fighting against V3D’s noop jobs.
Currently, during my Igalia Coding Experience, I’m working on the V3D’s IGT tests and therefore, I’m dealing a lot with the Raspberry Pi 4.
During the project, I had a real struggle to design the tests for the v3d_submit_cl
ioctl, as I was not capable of submit a proper noop job to the GPU.
In order to debug the tests, my mentor Melissa Wen suggested to me to run the CTS tests to reproduce a noop job and debug it through Mesa. I cloned the CTS repository into my Raspberry Pi 4 and I tried to compile, but my Raspberry Pi 4 went OOM. This sent me on a journey to cross-compile CTS for the Raspberry Pi 4. I decided to compile this journey into this blogpost.
During this blogpost, I’m using a Raspbian OS with desktop 64-bit.
First, you need to install Mesa on the Raspberry Pi 4. I decided to compile Mesa on the Raspberry Pi 4 itself, but maybe one day, I can write a blogpost about cross-compiling Mesa.
Currently, the Raspbian repositories only provide libdrm 2.4.104
and Mesa’s main branch needs libdrm >=2.4.109
.
So, first, let’s install libdrm 2.4.109
on the Raspberry Pi 4.
First, let’s make sure that you have meson
installed on your RPi4.
We will need meson
to build libdrm
and Mesa.
I’m installing meson
through pip3
because we need a meson
version greater than 0.60 to build Mesa.
# On the Raspberry Pi 4
$ sudo pip3 install meson
Then, you can install libdrm 2.4.109
on the RPi4.
# On the Raspberry Pi 4
$ wget https://dri.freedesktop.org/libdrm/libdrm-2.4.114.tar.xz
$ tar xvpf libdrm-2.4.114.tar.xz
$ cd libdrm-2.4.114
$ mkdir build
$ cd build
$ FLAGS="-O2 -march=armv8-a+crc+simd -mtune=cortex-a72" \
CXXFLAGS="-O2 -march=armv8-a+crc+simd -mtune=cortex-a72" \
meson -Dudev=true -Dvc4="enabled" -Dintel="disabled" -Dvmwgfx="disabled" \
-Dradeon="disabled" -Damdgpu="disabled" -Dnouveau="disabled" -Dfreedreno="disabled" \
-Dinstall-test-programs=true ..
$ sudo ninja install
So, now let’s install Mesa.
During this blogpost, I will use ${USER}
as the username on the machine.
Note that, in order to run sudo apt build-dep mesa
, you will have to uncomment some deb-src
on the file /etc/apt/sources.list
and run sudo apt update
.
# On the Raspberry Pi 4
# Install Mesa's build dependencies
$ sudo apt build-dep mesa
# Build and Install Mesa
$ git clone https://gitlab.freedesktop.org/mesa/mesa
$ cd mesa
$ mkdir builddir
$ mkdir installdir
$ CFLAGS="-mcpu=cortex-a72" CXXFLAGS="-mcpu=cortex-a72" \
meson -Dprefix="/home/${USER}/mesa/installdir" -D platforms=x11 \
-D vulkan-drivers=broadcom \
-D gallium-drivers=kmsro,v3d,vc4 builddir
$ cd builddir
$ ninja
$ cd ..
$ ninja -C builddir install
In order to cross-compile the Raspberry Pi, you need to clone the target sysroot to the host.
For it, we are going to use rsync
, so the host and the target need to be connected through a network.
$ sudo apt update
$ sudo apt dist-upgrade
As I said before, we will be using the rsync
command to sync files between the host and the Raspberry Pi.
For some of these files, root rights is required internally, so let’s enable rsync
with elevated rights.
$ echo "$USER ALL=NOPASSWD:$(which rsync)" | sudo tee --append /etc/sudoers
Some symbolic links are needed to make the toolchain work properly, so to create all required symbolic link reliably, this bash script is needed.
$ wget https://raw.githubusercontent.com/abhiTronix/raspberry-pi-cross-compilers/master/utils/SSymlinker
Once it is downloaded, you just need to make it executable, and then run it for each path needed.
$ sudo chmod +x SSymlinker
$ ./SSymlinker -s /usr/include/aarch64-linux-gnu/asm -d /usr/include
$ ./SSymlinker -s /usr/include/aarch64-linux-gnu/gnu -d /usr/include
$ ./SSymlinker -s /usr/include/aarch64-linux-gnu/bits -d /usr/include
$ ./SSymlinker -s /usr/include/aarch64-linux-gnu/sys -d /usr/include
$ ./SSymlinker -s /usr/include/aarch64-linux-gnu/openssl -d /usr/include
$ ./SSymlinker -s /usr/lib/aarch64-linux-gnu/crtn.o -d /usr/lib/crtn.o
$ ./SSymlinker -s /usr/lib/aarch64-linux-gnu/crt1.o -d /usr/lib/crt1.o
$ ./SSymlinker -s /usr/lib/aarch64-linux-gnu/crti.o -d /usr/lib/crti.o
First, we need to create a workspace for building CTS, where the Raspberry Pi 4 sysroot is going to be built.
$ sudo mkdir ~/rpi-vk
$ sudo mkdir ~/rpi-vk/installdir
$ sudo mkdir ~/rpi-vk/tools
$ sudo mkdir ~/rpi-vk/sysroot
$ sudo mkdir ~/rpi-vk/sysroot/usr
$ sudo mkdir ~/rpi-vk/sysroot/usr/share
$ sudo chown -R 1000:1000 ~/rpi-vk
$ cd ~/rpi-vk
Now, we need to sync up our sysroot folder with the system files from the Raspberry Pi.
We will be using rsync
that let us sync files from the Raspberry Pi.
To do this, enter the following commands one by one into your terminal and remember to change username and 192.168.1.47 with the IP address of your Raspberry Pi.
$ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/lib sysroot
$ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/usr/include sysroot/usr
$ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/usr/lib sysroot/usr
$ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/usr/share sysroot/usr
$ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/home/${USER}/mesa/installdir installdir
The files we copied in the previous step still have symbolic links pointing to the file system on the Raspberry Pi.
So, we need to alter this, so that they become relative links from the new sysroot
directory on the host machine.
There is a Python script available online that can help us.
$ wget https://raw.githubusercontent.com/abhiTronix/rpi_rootfs/master/scripts/sysroot-relativelinks.py
Once it is downloaded, you just need to make it executable and run it.
$ sudo chmod +x sysroot-relativelinks.py
$ ./sysroot-relativelinks.py sysroot
As Raspbian OS 64-bits uses GCC 10.2.0, let’s install the proper cross-compiler toolchain on our host machine. I’m using the toolchain provided by abhiTronix/raspberry-pi-cross-compilers, but there are many other around the web that you can use.
We are going to use the tools
folder to setup our toolchain.
$ cd ~/rpi-vk/tools
$ wget https://sourceforge.net/projects/raspberry-pi-cross-compilers/files/Bonus%20Raspberry%20Pi%20GCC%2064-Bit%20Toolchains/Raspberry%20Pi%20GCC%2064-Bit%20Cross-Compiler%20Toolchains/Bullseye/GCC%2010.2.0/cross-gcc-10.2.0-pi_64.tar.gz/download
$ tar xvf download
$ rm download
If you run all the steps from this tutorial expect this one, you will still get some weird Wayland-related errors when cross-compiling it.
This will happen because probably the wayland-scanner
version from your host is different from the wayland-scanner
version of the target.
For example, on Fedora 37, the wayland-scanner
version is 1.21.0 and the version on the Raspberry Pi 4 is 1.18.0.
In order to build Wayland, you will need the following dependencies:
$ sudo dnf install expat-devel xmlto
So, let’s install the proper Wayland version on our sysroot.
$ wget https://wayland.freedesktop.org/releases/wayland-1.18.0.tar.xz
$ tar xvf wayland-1.18.0.tar.xz
$ cd wayland-1.18.0
$ meson --prefix ~/rpi-vk/sysroot/usr build
$ ninja -C install
Now that we have the hole Raspberry Pi environment set up, we just need to create a toolchain file for CMake and its all set! So, let’s clone the CTS repository.
$ git clone https://github.com/KhronosGroup/VK-GL-CTS
$ cd VK-GL-CTS
To build dEQP, you need first to download sources for zlib, libpng, jsoncpp, glslang, vulkan-docs, spirv-headers, and spirv-tools. To download sources, run:
$ python3 external/fetch_sources.py
Inside the CTS directory, we are going to create a toolchain file called cross_compiling.cmake
with the following contents:
set(CMAKE_VERBOSE_MAKEFILE ON)
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_VERSION 1)
set(CMAKE_SYSTEM_PROCESSOR aarch64)
# Check if the sysroot and toolchain paths are correct
set(tools /home/${USER}/rpi-vk/tools/cross-pi-gcc-10.2.0-64)
set(rootfs_dir $ENV{HOME}/rpi-vk/sysroot)
set(CMAKE_FIND_ROOT_PATH ${rootfs_dir})
set(CMAKE_SYSROOT ${rootfs_dir})
set(ENV{PKG_CONFIG_PATH} "")
set(ENV{PKG_CONFIG_LIBDIR} "${CMAKE_SYSROOT}/usr/lib/pkgconfig:${CMAKE_SYSROOT}/usr/share/pkgconfig")
set(ENV{PKG_CONFIG_SYSROOT_DIR} ${CMAKE_SYSROOT})
set(CMAKE_LIBRARY_ARCHITECTURE aarch64-linux-gnu)
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fPIC -Wl,-rpath-link,${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE} -L${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE}")
set(CMAKE_C_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -Wl,-rpath-link,${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE} -L${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -Wl,-rpath-link,${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE} -L${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE}")
set(WAYLAND_SCANNER ${CMAKE_SYSROOT}/usr/bin/wayland-scanner)
## Compiler Binary
SET(BIN_PREFIX ${tools}/bin/aarch64-linux-gnu)
SET (CMAKE_C_COMPILER ${BIN_PREFIX}-gcc)
SET (CMAKE_CXX_COMPILER ${BIN_PREFIX}-g++ )
SET (CMAKE_LINKER ${BIN_PREFIX}-ld
CACHE STRING "Set the cross-compiler tool LD" FORCE)
SET (CMAKE_AR ${BIN_PREFIX}-ar
CACHE STRING "Set the cross-compiler tool AR" FORCE)
SET (CMAKE_NM {BIN_PREFIX}-nm
CACHE STRING "Set the cross-compiler tool NM" FORCE)
SET (CMAKE_OBJCOPY ${BIN_PREFIX}-objcopy
CACHE STRING "Set the cross-compiler tool OBJCOPY" FORCE)
SET (CMAKE_OBJDUMP ${BIN_PREFIX}-objdump
CACHE STRING "Set the cross-compiler tool OBJDUMP" FORCE)
SET (CMAKE_RANLIB ${BIN_PREFIX}-ranlib
CACHE STRING "Set the cross-compiler tool RANLIB" FORCE)
SET (CMAKE_STRIP {BIN_PREFIX}-strip
CACHE STRING "Set the cross-compiler tool RANLIB" FORCE)
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
Note that we had to specify our toolchain and also the specify the path to the wayland-scanner
.
Now that we are all set, we can finally cross-compile CTS.
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_LIBRARY_PATH=/home/${USER}/rpi-vk/installdir/lib \
-DCMAKE_INCLUDE_PATH=/home/${USER}/rpi-vk/installdir/include \
-DCMAKE_GENERATOR=Ninja \
-DCMAKE_TOOLCHAIN_FILE=/home/${USER}/VK-GL-CTS/cross_compiling.cmake ..
$ ninja
Now, you can transfer the compiled files to the Raspberry Pi 4 and run CTS!
This was a fun little challenge of my CE project and it was pretty nice to learn more about CTS. Running CTS was also a great idea from Melissa as I was able to hexdump the contents of a noop job for the V3DV and fix my noop job on IGT. So, now I finally have a working noop job on IGT and you can check it here.
Also, a huge thanks to my friend Arthur Grillo for helping me with resources about cross-compiling for the Raspberry Pi.
Wow, it’s really happening :D
After some warm-up and work put on, I got accepted into the 2023 Google Summer of Code program with the X.Org organization.
The title of my project is “Increasing Code Coverage on the DRM Code”. The DRM subsystem is the standard way to interact with complex graphics devices.
It provides many global helpers for use inside the drivers. As these helpers are used by many drivers on the DRM subsystem, testing those functions for asserting that no regressions are made is crucial for kernel development.
Many units test were written for those helpers with the KUnit framework. But there is still much work to do. Running the Gcov code covering
analysis tool, we see that just one file has 100% of code
coverage. Knowing this, I will create more tests for the
drm-format-helper.c
. This file handles color format conversion.
Currently, the conversion functions can’t handle planar formats. Instead
of having the color information packed in a single plane, those have their
information separated into multiple planes. I pretend to add support for it by
modifying the drm_fb_xfrm()
function.
This Summer, I will be mentored by:
During this community bonding period, I’m reserving time to review patches on
the dri-devel
mailing list and read more about the DRM.
These are my reviews:
In parallel, I’m trying to add support for the NV12 format to the VKMS driver. This would be nice for testing userspace programs that contain framebuffers with video. This is turning out to be bigger than I thought, as I need to add support for planar formats there and make the format conversion in software. Stay tuned for blog posts on that too ;).
See ya! :)