planet.freedesktop.org
December 02, 2021

Khronos submission indicating Vulkan 1.1 conformance for Turnip on Adreno 618 GPU.

It is a great feat, especially for a driver which is created without hardware documentation. And we support features far from the bare minimum required for conformance.

But first of all, I want to thank and congratulate everyone working on the driver: Connor Abbott, Rob Clark, Emma Anholt, Jonathan Marek, Hyunjun Ko, Samuel Iglesias. And special thanks to Samuel Iglesias and Ricardo Garcia for tirelessly improving Khronos Vulkan Conformance Tests.


At the start of the year, when I started working on Turnip, I looked at the list of failing tests and thought “It wouldn’t take a lot to fix them!”, right, sure… And so I started fixing issues alongside of looking for missing features.

In June there were even more failures than there were in January, how could it be? Of course we were adding new features and it accounted for some of them. However even this list was likely not exhaustive because for gitlab CI instead of running the whole Vulkan CTS suite - we ran 1/3 of it. We didn’t have enough devices to run the whole suite fast enough to make it usable in CI. So I just ran it locally from time to time.

1/3 of the tests doesn’t sound bad and for the most part it’s good enough since we have a huge amount of tests looking like this:

dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture_format_list
...

Every format, every operation, etc. Tens of thousands of them.

Unfortunately the selection of tests for a fractional run is as straightforward as possible - just every third test. Which bites us when there a single unique tests, like:

dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil_no_attachment
...

Most of them test something unique that has much higher probability of triggering a special path in a driver compared to uncountable image tests. And they fell through the cracks. I even had to fix one test twice because the CI didn’t run it.

A possible solution is to skip tests only when there is a large swath of them and run smaller groups as-is. But it’s likely more productive to just throw more hardware at the issue =).

Not enough hardware in CI

Another trouble is that we had only one 6xx sub-generation present in CI - Adreno 630. We distinguish four sub-generations. Not only they have some different capabilities, there are also differences in the existing ones, causing the same test to pass on CI and being broken on another newer GPU. Presently in CI we test only Adreno 618 and 630 which are “Gen 1” GPUs and we claimed conformance only for Adreno 618.

Yet another issue is that we could render in tiling and bypass (sysmem) modes. That’s because there are a few features we could support only when there is no tiling and we render directly into the sysmem, and sometimes rendering directly into sysmem is just faster. At the moment we use tiling rendering by default unless we meet an edge case, so by default CTS tests only tiling rendering.

We are forcing sysmem mode for a subset of tests on CI, however it’s not enough because the difference between modes is relevant for more than just a few tests. Thus ideally we should run twice as many tests, and even better would be thrice as many to account for tiling mode without binning vertex shader.

That issue became apparent when I implemented a magical eight-ball to choose between tiling and bypass modes depending on the run-time information in order to squeeze more performance (it’s still work-in-progress). The basic idea is that a single draw call or a few small draw calls is faster to render directly into system memory instead of loading framebuffer into the tile memory and storing it back. But almost every single CTS test does exactly this! Do a single or a few draw calls per render pass, which causes all tests to run in bypass mode. Fun!

Now we would be forced to deal with this issue since with the magic eight-ball games would partly run in the tiling mode and partly in the bypass, making them equally important for real-world workload.

Does conformance matter? Does it reflect anything real-world?

Unfortunately no test suite could wholly reflect what game developers do in their games. However, the amount of tests grows and new tests are getting contributed based on issues found in games and other applications.

When I ran my stash of D3D11 game traces through DXVK on Turnip for the first time - I found a bunch of new crashes and hangs but it took fixing just a few of them for majority of games to render correctly. This shows that Khronos Vulkan Conformance Tests are doing their job and we at Igalia are striving to make them even better.

One of the extensions released as part of Vulkan 1.2.199 was VK_EXT_image_view_min_lod extension. I’m happy to see it published as I have participated in the release process of this extension: from reviewing the spec exhaustively (I even contributed a few things to improve it!) to developing CTS tests for it that will be eventually merged to the CTS repo.

This extension was proposed by Valve to mirror a feature present in Direct3D 12 (check ResourceMinLODClamp here) and Direct3D 11 (check SetResourceMinLOD here). In other words, this extension allows clamping the minimum LOD value accessed by an image view to a minLod value set at image view creation time.

That way, any library or API layer that translates Direct3D 11/12 calls to Vulkan can use the extension to mirror the behavior above on Vulkan directly without workarounds, facilitating the port of Direct3D applications such as games to Vulkan. For example, projects like Vkd3d, Vkd3d-proton and DXVK could benefit from it.

Going into more details, this extension changed how the image level selection is calculated and sets an additional minimum required in the image level for integer texel coordinate operations if it is enabled.

The way to use this feature in an application is very simple:

  • Check the extension is supported and if the physical device supports the respective feature:
// Provided by VK_EXT_image_view_min_lod
typedef struct VkPhysicalDeviceImageViewMinLodFeaturesEXT {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           minLod;
} VkPhysicalDeviceImageViewMinLodFeaturesEXT;
  • Once you know everything is working, enable both the extension and the feature when creating the device.

  • When you want to create a VkImageView that defines a minLod for image accesses, then add the following structure filled with the value you want in VkImageViewCreateInfo’s pNext.

// Provided by VK_EXT_image_view_min_lod
typedef struct VkImageViewMinLodCreateInfoEXT {
    VkStructureType    sType;
    const void*        pNext;
    float              minLod;
} VkImageViewMinLodCreateInfoEXT;

And that’s all! As you see, it is a very simple extension.

Happy hacking!

November 24, 2021

I was interested in how much work a vaapi on top of vulkan video proof of concept would be.

My main reason for being interested is actually video encoding, there is no good vulkan video encoding demo yet, and I'm not experienced enough in the area to write one, but I can hack stuff. I think it is probably easier to hack a vaapi encode to vulkan video encode than write a demo app myself.

With that in mind I decided to see what decode would look like first. I talked to Mike B (most famous zink author) before he left for holidays, then I ignored everything he told me and wrote a super hack.

This morning I convinced zink vaapi on top anv with iris GL doing the presents in mpv to show me some useful frames of video. However zink vaapi on anv with zink GL is failing miserably (well green jellyfish).

I'm not sure how much more I'll push on the decode side at this stage, I really wanted it to validate the driver side code, and I've found a few bugs in there already.

The WIP hacks are at [1]. I might push on to encode side and see if I can workout what it entails, though the encode spec work is a lot more changeable at the moment.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/zink-video-wip

November 19, 2021

Last Post Of The Year

Yes, we’ve finally reached that time. It’s mid-November, and I’ve been storing up all this random stuff to unveil now because I’m taking the rest of the year off.

This will be the final SGC post for 2021. As such, it has to be a good one, doesn’t it?

Zink Roundup

It’s been a wild year for zink. Does anybody even remember how many times I finished the project? I don’t, but it’s been at least a couple. Somehow there’s still more to do though.

I’ll be updating zink-wip one final time later today with the latest Copper snapshot. This is going to be crashier than the usual zink-wip, but that’s because zink-wip doesn’t have nearly as much cool future-zink stuff as it used to these days. Nearly everything is already merged into mainline, or at least everything that’s likely to help with general use, so just use that if you aren’t specifically trying to test out Copper.

One of those things that’s been a thorn in zink’s side for a long time is PBO handling, specifically for unsupported formats like ARGB/ABGR, ALPHA, LUMINANCE, and InTeNsItY. Vulkan has no analogs for any of these, and any app/game which tries to do texture upload or download from them with zink is going to have a very, very bad time, as has been the case with CS:GO, which would take literal days to reach menus due to performing fullscreen GL_LUMINANCE texture downloads.

This is now fixed in the course of landing compute PBO download support, which I blogged about forever ago since it also yields a 2-10x performance improvement for a number of other cases in all Gallium drivers. Or at least the ones that enable it.

CS:GO should now run out of the box in Mesa 22.0, and things like RPCS3 which do a lot of PBO downloading should also see huge improvements.

That’s all I’ve got here for zink, so now it’s time once again…

THIS IS NO LONGER A ZINK BLOG

That’s right, it’s happening. Change your hats, we’re a Gallium blog again for the first time in nearly five months.

Everyone remembers when I promised that you’d be able to run native Linux D3D9 games on the Nine state tracker. Well, I suppose that’s a fancy way of saying Source Engine games, aka the ones Valve ships with native Linux ports, since probably nobody else has shipped any kind of native Linux app that uses the D3D9 API, but still!

That time is now.

Right now.

No more waiting, no new Mesa release required, you can just plug it in and test it out this second for instantly improved performance.

As long as you first acknowledge that this is not a Valve-official project, and it’s only to be used for educational purposes.

But also, please benchmark it lots and tell me your findings. Again, just for educational purposes. Wink.

How?

This has been a long time in the making. After the original post, I knew that the goal here was to eventually be able to run these games without needing any kind of specialized Mesa build, since that’s annoying and also breaks compatibility with running Nine for other purposes.

Thus I enlisted the further help of Nine expert and image enthusiast, Axel Davy, to help smooth out the rough edges once I was done fingerpainting my way to victory.

The result is a simple wrapper which can be preloaded to run any DXVK-compatible (i.e., any of them that support -vulkan) Source Engine game on Nine—and obviously this won’t work on NVIDIA blob at all so don’t bother trying.

In short:

  • clone that repo
  • right click on Properties for e.g., Left 4 Dead 2
  • change the command line to LD_PRELOAD=/path/to/Xnine/nine_sdl.so %command% -vulkan

For Portal 2 (at present, though this won’t always be the case), you’ll also need to add NINE_VHACKS=1 to work around some frogs that were accidentally added to the latest version of the game as a developer-only easter egg.

Then just run the game normally, and if everything went right and you have Nine installed in one of the usual places, you should load up the game with Gallium Nine. More details on that in the repo’s README.

GPU Goes Brrr?

Yes. Very brrr.

Here’s your normal GL performance from a simple Portal 2 benchmark:

p2-togl.png

Around 400 FPS.

Here’s Gallium Nine:

p2-nine.png

Around 600 FPS.

A 50% improvement with the exact same backend GPU driver isn’t too bad for a simple preload shim.

Can I Get A Side Of SHOTS FIRED With That?

You got it.

What about DXVK?

This isn’t an extensive benchmark, but here we go with that too:

p2-dxvk.png

Also around 600 FPS.

I say “around” here because the variation is quite extreme for both Nine and DXVK based on slight changes in variable clock speeds because I didn’t pin them: Nine ranges between 590-610 FPS, and DXVK is 590-620 FPS.

So now there’s two solid, open source methods for improving performance in these games over the normal GL version. But what if we go even deeper?

What if we check out some real performance numbers?

Power Consumption

If you’ve never checked out PowerTOP, it’s a nice way to get an overview of what’s using up system resources and consuming power.

If you’ve never used it for benchmarking, don’t worry, I took care of that too.

Here’s some PowerTOP figures for the same Portal 2 timedemo:

What’s interesting here is that DXVK uses 90%+ CPU, while Nine is only using about 25%. This is a gap that’s consistent across runs, and it likely explains why a number of people find that DXVK doesn’t work on their systems: you still need some amount of CPU to run the actual game calculations, so if you’re on older hardware, you might end up using all of your available CPU just on DXVK internals.

GPU Usage?

Got you covered. Here’s a per-second poll (one row per second) from radeontop.

DXVK:

GPU Usage VGT Usage TA Usage SX Usage SH Usage SPI Usage SC Usage PA Usage DB Usage CB Usage VRAM Usage GTT Usage Memory Clock Shader Clock
35.83% 17.50% 23.33% 28.33% 17.50% 29.17% 28.33% 5.00% 27.50% 26.67% 12.75% 1038.15mb 7.82% 638.53mb 52.19% 0.457ghz 33.52% 0.704ghz
35.83% 17.50% 23.33% 28.33% 17.50% 29.17% 28.33% 5.00% 27.50% 26.67% 12.75% 1038.15mb 7.82% 638.53mb 52.19% 0.457ghz 33.52% 0.704ghz
36.67% 30.00% 33.33% 35.00% 30.00% 35.00% 32.50% 18.33% 30.83% 28.33% 12.76% 1038.57mb 7.82% 638.53mb 48.88% 0.428ghz 36.95% 0.776ghz
75.83% 63.33% 62.50% 66.67% 63.33% 68.33% 65.00% 27.50% 60.83% 53.33% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 95.82% 2.012ghz
71.67% 60.00% 60.00% 64.17% 60.00% 66.67% 60.83% 23.33% 56.67% 51.67% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 96.31% 2.023ghz
75.00% 62.50% 66.67% 66.67% 62.50% 69.17% 68.33% 23.33% 65.83% 59.17% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 96.71% 2.031ghz
63.33% 55.00% 56.67% 58.33% 55.00% 59.17% 59.17% 17.50% 52.50% 50.00% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 89.77% 1.885ghz
78.33% 64.17% 64.17% 65.00% 64.17% 69.17% 70.83% 30.00% 63.33% 58.33% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 97.33% 2.044ghz
73.33% 60.83% 64.17% 65.00% 60.83% 67.50% 64.17% 29.17% 59.17% 51.67% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 97.39% 2.045ghz
60.83% 50.83% 50.00% 53.33% 50.83% 55.00% 50.83% 25.83% 48.33% 45.00% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 95.35% 2.002ghz
67.50% 50.00% 55.00% 59.17% 50.00% 60.00% 58.33% 28.33% 52.50% 45.00% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 87.91% 1.846ghz

Nine:

GPU Usage VGT Usage TA Usage SX Usage SH Usage SPI Usage SC Usage PA Usage DB Usage CB Usage VRAM Usage GTT Usage Memory Clock Shader Clock
17.50% 11.67% 15.00% 10.83% 11.67% 15.00% 10.83% 3.33% 10.83% 10.00% 7.38% 600.56mb 1.60% 130.48mb 50.38% 0.441ghz 15.76% 0.331ghz
17.50% 11.67% 15.00% 10.83% 11.67% 15.00% 10.83% 3.33% 10.83% 10.00% 7.38% 600.56mb 1.60% 130.48mb 50.38% 0.441ghz 15.76% 0.331ghz
70.83% 63.33% 65.83% 60.00% 63.33% 68.33% 57.50% 24.17% 56.67% 54.17% 7.35% 598.43mb 1.60% 130.48mb 89.50% 0.783ghz 77.09% 1.619ghz
74.17% 70.00% 67.50% 60.00% 70.00% 70.83% 61.67% 17.50% 60.83% 58.33% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 91.03% 1.912ghz
78.33% 69.17% 72.50% 65.00% 69.17% 75.83% 65.83% 15.00% 65.83% 64.17% 7.37% 599.80mb 1.60% 130.47mb 100.00% 0.875ghz 93.92% 1.972ghz
70.83% 67.50% 64.17% 55.00% 67.50% 67.50% 57.50% 20.83% 55.83% 53.33% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 91.93% 1.930ghz
65.00% 64.17% 60.00% 51.67% 64.17% 61.67% 53.33% 18.33% 52.50% 50.83% 7.37% 599.80mb 1.60% 130.47mb 100.00% 0.875ghz 89.95% 1.889ghz
74.17% 68.33% 70.00% 60.83% 68.33% 72.50% 65.00% 24.17% 64.17% 58.33% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 92.53% 1.943ghz
77.50% 73.33% 73.33% 62.50% 73.33% 75.00% 61.67% 22.50% 62.50% 57.50% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 91.21% 1.915ghz
70.00% 65.83% 60.00% 57.50% 65.83% 61.67% 59.17% 24.17% 55.00% 54.17% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 92.69% 1.946ghz
70.00% 65.83% 60.00% 57.50% 65.83% 61.67% 59.17% 24.17% 55.00% 54.17% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 92.69% 1.946ghz

Again, here we see a number of interesting things. DXVK consistently provokes slightly higher clock speeds (because I didn’t pin them), which may explain why it skews slightly higher in the benchmark results. DXVK also uses nearly 2x more VRAM and nearly 5x more GTT. On more modern hardware it’s unlikely that this would matter since we all have more GPU memory than we can possibly use in an OpenGL game, but on older hardware—or in cases where memory usage might lead to power consumption that should be avoided because we’re running on battery—this could end up being significant.

Conclusion

Source Engine games run great on Linux. That’s what we all care about at the end of the day, isn’t it?

But also, if more Source Engine games get ported to DXVK, give them a try with Nine. Or just test the currently ported ones, Portal 2 and Left 4 Dead 2.

I want data.

Lots of data.

Post it here, email it to me, whatever.

Until 2022

Lots of cool projects still in the works, so stay tuned next year!

November 18, 2021

If you own a laptop (Dell, HP, Lenovo) with a WWAN module, it is very likely that the modules are FCC-locked on every boot, and the special FCC unlock procedure needs to be run before they can be used.

Until ModemManager 1.18.2, the procedure was automatically run for the FCC unlock procedures we knew about, but this will no longer happen. Once 1.18.4 is out, the procedure will need to be explicitly enabled by each user, under their own responsibility, or otherwise implicitly enabled after installing an official FCC unlock tool provided by the manufacturer itself.

See a full description of the rationale behind this change in the ModemManager documentation site and the suggested code changes in the gitlab merge request.

If you want to enable the ModemManager provided unofficial FCC unlock tools once you have installed 1.18.4, run (assuming sysconfdir=/etc and datadir=/usr/share) this command (*):

sudo ln -sft /etc/ModemManager/fcc-unlock.d /usr/share/ModemManager/fcc-unlock.available.d/*

The user-enabled tools in /etc should not be removed during package upgrades, so this should be a one-time setup.

(*) Updated to have one single command instead of a for loop; thanks heftig!

November 17, 2021

What If Zink Was Actually The Fastest GL Driver?

In an earlier post I talked about Copper and what it could do on the way to a zink future.

What I didn’t talk about was WSI, or the fact that I’ve already fully implemented it in the course of bashing Copper into a functional state.

Window System Integration

…was the final step for zink to become truly usable.

At present, zink has a very hacky architecture where it loads through the regular driver path, but then for every image that is presented on the screen, it keeps a shadow copy which it blits to just before scanout, and this is the one that gets displayed.

Usually this works great other than the obvious (but minor) overhead that the blit incurs.

Where it doesn’t work great, however, is on non-Mesa drivers.

That’s right. I’m looking at you, NVIDIA.

As long-time blog enthusiasts will remember, I had NVIDIA running on zink some time ago, but there was a problem as it related to performance. Specifically, that single blit turned into a blit and then a full-frame CPU copy, which made getting any sort of game running with usable FPS a bit of a challenge.

WSI solves this by letting the Vulkan driver handle the scanout image entirely, removing all the copies to let zink render more like a normal driver (or game/app).

So How Is it?

That’s what everyone’s probably wondering. I have zink. I have WSI. I have my RTX2070 with the NVIDIA blob driver.

How does NVIDIA’s Vulkan driver (with zink) stack up to NVIDIA’s GL driver?

Everything below is using the 495.44 beta driver, as that’s the latest one at the time of my testing, and the non-beta driver didn’t work at all.

But first, can NVIDIA’s GL driver even render the game I want to show?

nvtrbug.png

Confusingly, the answer is no, this version of NVIDIA’s GL driver can’t correctly render Tomb Raider, which is my go-to for all things GL and benchmarking. I’m gonna let that slide though since it’s still pumping out those frames at a solid rate.

It’s frustrating, but sometimes just passing CTS isn’t enough to be able to run some games, or there’s certain extensions (ARB_bindless_texture) which are under-covered.

The Numbers Don’t Lie

I’ll say as a prelude that it was a bit challenging to get a AAA game running in this environment. There’s some very strange issues happening with the NVIDIA Vulkan driver which prevented me from running quite a lot of things. Tomb Raider was the first one I got going after two full days of hacking at it, and that’s about what my time budget allowed for the excursion, so I stopped at that.

Up first: NVIDIA’s GL driver (495.44) nvtr-gl.png

Second: NVIDIA’s Vulkan driver (495.44) nvtr.png

As we can see, zink with NVIDIA’s Vulkan driver is roughly 25-30% faster than NVIDIA’s GL driver for Tomb Raider.

In Closing

I doubt that zink maintains this performance gap for all titles, but now we know that there are already at least some cases where it can pull ahead. Given that most vendors are shifting resources towards current-year graphics APIs like Vulkan and D3D12, it won’t be surprising if maintenance-mode GL drivers start to fall behind actively developed Vulkan drivers.

In short, there’s a real possibility that zink can provide tangible benefits to vendors who only want to ship Vulkan drivers, and those benefits might be more than (eventually) providing a conformant GL implementation.

Stay tuned for tomorrow when I close out the week strong with one final announcement for the year.

November 15, 2021

Previously I mentioned having AMD VCN h264 support. Today I added initial support for the older UVD engine[1]. This is found on chips from Vega back to SI.

I've only tested it on my Vega so far.

I also worked out the "correct" answer to the how to I send the reset command correctly, however the nvidia player I'm using as a demo doesn't do things that way yet, so I've forked it for now[2].

The answer is to use vkCmdControlVideoCodingKHR to send a reset the first type a session is used. However I can't see how the app is meant to know this is necessary, but I've asked the appropriate people.

The initial anv branch I mentioned last week is now here[3].


[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-uvd-h264

[2] https://github.com/airlied/vk_video_samples/tree/radv-fixes

[3] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode

Copper: It’s A Thing (Sort of)

Over the past months, I’ve been working with Adam “X Does What I Say” Jackson to try and improve zink’s path through the arcane system of chutes and ladders that comprises Gallium’s loader architecture. The recent victory in getting a Wayland system compositor running is the culmination of those efforts.

I wanted to write at least a short blog post detailing some of the Gallium changes that were necessary to make this happen, if only so I have something to refer back to when I inevevitably break things later, so let’s dig in.

Pipes: How Do They Work?

It’s questionable to me whether anyone really knows how all the Gallium loader and DRI frontend interfaces work without first taking a deep read of the code and then having a nice, refreshing run around the block screaming to let all the crazy out. From what I understand of it, there’s the DRI (userspace) interface, which is used by EGL/GBM/GLX/SMH to manage buffers for scanout. DRI itself is split between software and platform; each DRI interface is a composite made of all the “extensions” which provide additional functionality to enable various API-level extensions.

It’s a real disaster to have to work with, and ideally the eventual removal of classic drivers will allow it to be simplified so that mere humans like me can comprehend its majesty.

Beyond all this, however, there’s the notion that the DRI frontend is responsible for determining the size of the scanout buffer as well as various other attributes. The software path through this is nice and simple since there’s no hardware to negotiate with, and the platform path exists.

Currently, zink runs on the platform path, which means that the DRI frontend is what “runs” zink. It chooses the framebuffer size, manages resizes, and handles multisample resolve blits as needed for every frame that gets rendered.

Too Many CooksPipes

The problem with this methodology is that there’s effecively two WSI systems active simultaneously: the Mesa DRI architecture, and the (eventual) Vulkan WSI infrastructure. Vulkan WSI isn’t going to work at all if it isn’t in charge of deciding things like window size, which means that the existing DRI architecture can’t work, neither in the platform mode nor the software mode.

As we know, there can be only one.

Thus Adam has been toiling away behind the scenes, taking neither vacation nor lunch break for the past ten years in order to iterate on a more optimal solution.

The result?

Copper

If you’re a Mesa developer or just a metallurgist, you know why the name Copper was chosen.

The premise of Copper is that it’s a DRI interface extension which can be used exclusively by zink to avoid any of the problem areas previously mentioned. The application will create a window, create a GL context for it, and (eventually) Vulkan WSI can figure things out by just having the window/surface passed through. This shifts all the “driving” WSI code out of DRI and into Vulkan WSI, which is much more natural.

In addition to Copper, zink can now be bound to a slight variation of the Gallium software loader to skip all the driver querying bits. There’s no longer anything to query, as DRI doesn’t have to make decisions anymore. It just calls through to zink normally, and zink can handle everything using the Vulkan API.

Simple and clean.

Unfortunately

This all requires a ton of code. Looking at the two largest commits:

  • 29 files changed, 1413 insertions(+), 540 deletions(-)
  • 23 files changed, 834 insertions(+), 206 deletions(-)

Is a big yikes.

I can say with certainty that these improvements won’t be landing before 2022, but eventually they will in one form or another, and then zink will become significantly more flexible.

November 12, 2021

Last week I mentioned I had the basics of h264 decode using the proposed vulkan video on radv. This week I attempted to do the same thing with Intel's Mesa vulkan driver "anv".

Now I'd previously unsuccessfully tried to get vaapi on crocus working but got sidetracked back into other projects. The Intel h264 decoder hasn't changed a lot between ivb/hsw/gen8/gen9 era. I ported what I had from crocus to anv and started trying to get something to decode on my WhiskeyLake.

I wrote the code pretty early on, figured out all the things I had to send the hardware.

The first anv side bridge to cross was Vulkan is doing H264 Picture level decode API, so it means you get handed the encoded slice data. However to program the Intel hw you need to decode the slice header. I wrote a slice header decoder in some common code. The other thing you need to give the intel hw is a number of bits of slice header, which in some encoding schemes is rounded to bytes and in some isn't. Slice headers also have a 3-byte header on them, which Intel hardware wants you to discard or skip before handing it to it.

Once I'd fixed up that sort of thing in anv + crocus, I started getting grey I-frames decoded with later B/P frames using the grey frames as references so you'd see this kinda wierd motion.

That was I think 3 days ago. I've have stared at this intently for those 3 days blaming everything from bitstream encoding to rechecking all my packets (not enough times though). I had someone else verify they could see grey frames.

Today after a long discussion about possibilities, I was randomly comparing a frame from the intel-vaapi-driver and from crocus, and I spotted a packet header, the docs say is 34 dwords long, but intel-vaapi was only encoding 16 dwords, I switched crocus to explicitly state a 16-dword length and I started seeing my I-frames.

Now the B/P frames still have issues. I don't think I'm getting the ref frames logic right yet, but it felt like a decent win after 3 days of staring at it.

The crocus code is [1]. The anv code isn't cleaned up enough to post a pointer to yet, enterprising people might find it. Next week I'll clean it all up, and then start to ponder upstream paths and shared code for radv + anv. Then h265 maybe.

[1]https://gitlab.freedesktop.org/airlied/mesa/-/tree/crocus-media-wip

A Long Time Coming

Zink can now run all display platform flavors of Weston (and possibly other compositors?). Expect it in zink-wip later today once it passes another round of my local CI.

Here it is in DRM running weston-simple-egl and weston-simple-dmabuf-egl all on zink:

wayland-screenshot-2021-11-12_13-25-25.png

Under Construction

This has a lot of rough edges, mostly as it relates to X11. In particular:

  • xservers (including xwayland) can’t run because GLAMOR is hard
  • some apps (e.g., Unigine Heaven) randomly get killed by the xserver for unknown reasons
  • if you’re very lucky, you can hit a Vulkan WSI deadlock

How?

I’d go into details on this, but honestly it’s going to be like a week of posts to detail the sheer amount of chainsawing that’s gone into the project.

Stay tuned for that and more next week.

November 11, 2021

Everyone Knows…

That the one true benchmark for graphics is glxgears. It’s been the standard for 20+ years, and it’s going to remain the standard for a long time to come.

Gears Through Time

Zink has gone through a couple phases of glxgears performance.

Everyone remembers weird glxgears that was getting illegal amounts of frames due to its misrendering:

glxgears.png

We salute you, old friend.

Now, however, some number of you have become aware of the new threat posed by heavy gears in the Mesa 21.3 release. Whereas glxgears is usually a lightweight, performant benchmarking tool, heavy gears is the opposite, chugging away at up to 20% of a single CPU core with none of the accompanying performance.

sadgears.png

Terrifying.

What Creates Such A Monster?

The answer won’t surprise you: GL_QUADS.

Indeed, because zink is a driver relying on the Vulkan API, only the primitive types supported by Vulkan can be directly drawn. This means any app using GL_QUADS is going to have a very bad time.

glxgears is exactly one of these apps, and (now that there’s a ticket open) I was forced to take action.

Transquadmation

The root of the problem here is that gears passes its vertices into GL to be drawn as a rectangle, but zink can only draw triangles. This (currently) results in doing a very non-performant readback of the index buffer before every draw call to convert the draw to a triangle-based one.

A smart person might say “Why not just convert the vertices to triangles as you get them instead of waiting until they’re in the buffer?”

Thankfully, a smart person did say that and then do the accompanying work. The result is that finally, after all these years, zink can actually perform well in a real benchmark:

happygears.png

Stay Tuned

For more exciting zink updates. You won’t want to miss them.

November 10, 2021

UberCTS

In the course of working on more CI-related things for zink, I came across a series of troublesome tests (KHR-GL46.geometry_shader.rendering.rendering.triangles_*) that triggered a severe performance issue. Specifically, the LLVM optimizer spends absolute ages trying to optimize ubershaders like this one used in the tests:

#version 440

in  vec4 position;
out vec4 vs_gs_color;

uniform bool  is_lines_output;
uniform bool  is_indexed_draw_call;
uniform bool  is_instanced_draw_call;
uniform bool  is_points_output;
uniform bool  is_triangle_fan_input;
uniform bool  is_triangle_strip_input;
uniform bool  is_triangle_strip_adjacency_input;
uniform bool  is_triangles_adjacency_input;
uniform bool  is_triangles_input;
uniform bool  is_triangles_output;
uniform ivec2 renderingTargetSize;
uniform ivec2 singleRenderingTargetSize;

void main()
{
    gl_Position = position + vec4(float(gl_InstanceID) ) * vec4(0, float(singleRenderingTargetSize.y) / float(renderingTargetSize.y), 0, 0) * vec4(2.0);
    vs_gs_color = vec4(1.0, 0.0, 0.0, 0.0);

    if (is_lines_output)
    {
        if (!is_indexed_draw_call)
        {
            if (is_triangle_fan_input)
            {
                switch(gl_VertexID)
                {
                   case 0:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 1:
                   case 5:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 2:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 4:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_input)
            {
                switch(gl_VertexID)
                {
                   case 1:
                   case 6: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 0:
                   case 4:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 2:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 5:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_adjacency_input)
            {
                switch(gl_VertexID)
                {
                   case 2:
                   case 12: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 0:
                   case 8:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 6:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 10: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangles_input)
            {
                switch(gl_VertexID)
                {
                   case 0: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 1: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 2: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 5: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 6: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 7: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 8: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 9:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 10: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 11: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                }
            }
            else
            if (is_triangles_adjacency_input)
            {
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);

                switch(gl_VertexID)
                {
                    case 0: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 2: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 6: vs_gs_color  = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 8: vs_gs_color  = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 10: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 12: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 14: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 16: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 18: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 20: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 22: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                }
            }
        }
        else
        {
            if (is_triangles_input)
            {
                switch(gl_VertexID)
                {
                    case 11: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 10: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 9:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 8:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 7:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 6:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 5:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 4:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 3:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 0:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                }
            }
            else
            if (is_triangle_fan_input)
            {
                switch(gl_VertexID)
                {
                   case 5:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4:
                   case 0:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 3:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 2:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_input)
            {
                switch (gl_VertexID)
                {
                   case 5:
                   case 0: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 6:
                   case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_adjacency_input)
            {
                switch(gl_VertexID)
                {
                   case 11:
                   case 1:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 13:
                   case 5:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 9:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 7:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 3:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangles_adjacency_input)
            {
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);

                switch(gl_VertexID)
                {
                    case 23: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 21: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 19: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 17: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 15: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 13: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 11: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 9:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 7:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 5: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 3: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 1: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                }
            }
        }
    }
    else
    if (is_points_output)
    {
        if (!is_indexed_draw_call)
        {
            if (is_triangles_adjacency_input)
            {
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);

                switch (gl_VertexID)
                {
                    case 0:
                    case 6:
                    case 12:
                    case 18: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 2:
                    case 22: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 4:
                    case 8: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 10:
                    case 14: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 16:
                    case 20: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                }
            }
            else
            if (is_triangle_fan_input)
            {
                switch(gl_VertexID)
                {
                   case 0:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 1:
                   case 5:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 3:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 4:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_input)
            {
                switch (gl_VertexID)
                {
                   case 1:
                   case 4:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 0:
                   case 6:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 3:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 5:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_adjacency_input)
            {
                switch (gl_VertexID)
                {
                   case 2:
                   case 8:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 0:
                   case 12: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 4:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 6:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 10: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangles_input)
            {
                switch (gl_VertexID)
                {
                    case 0:
                    case 3:
                    case 6:
                    case 9: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 1:
                    case 11: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 2:
                    case 4: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 5:
                    case 7: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 8:
                    case 10: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                }
            }
        }
        else
        {
            if (is_triangle_fan_input)
            {
                switch (gl_VertexID)
                {
                   case 5:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 4:
                   case 0:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 3:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 2:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 1:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_input)
            {
                switch (gl_VertexID)
                {
                   case 5:
                   case 2:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 6:
                   case 0:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 4:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 3:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 1:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_adjacency_input)
            {
                switch (gl_VertexID)
                {
                   case 11:
                   case 5:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 13:
                   case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 9:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 7:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 3:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangles_adjacency_input)
            {
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);
                switch (gl_VertexID)
                {
                    case 23:
                    case 17:
                    case 11:
                    case 5: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 21:
                    case 1: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 19:
                    case 15: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 13:
                    case 9: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 7:
                    case 3: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                }
            }
            else
            if (is_triangles_input)
            {
                switch (gl_VertexID)
                {
                    case 11:
                    case 8:
                    case 5:
                    case 2: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 10:
                    case 0: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 9:
                    case 7: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 6:
                    case 4: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 3:
                    case 1: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                }
            }
        }
    }
    else
    if (is_triangles_output)
    {
        int vertex_id = 0;

        if (!is_indexed_draw_call && is_triangles_adjacency_input && (gl_VertexID % 2 == 0) )
        {
            vertex_id = gl_VertexID / 2 + 1;
        }
        else
        {
            vertex_id = gl_VertexID + 1;
        }

        vs_gs_color = vec4(float(vertex_id) / 48.0, float(vertex_id % 3) / 2.0, float(vertex_id % 4) / 3.0, float(vertex_id % 5) / 4.0);
    }
}

By ages I mean upwards of 10 minutes per test.

Yikes.

When In Doubt, Inline

Fortunately, zink already has tools to combat exactly this problem: ZINK_INLINE_UNIFORMS.

This feature analyzes shaders to determine if inlining uniform values will be beneficial, and if so, it rewrites the shader with the uniform values as constants rather than loads. This brings the resulting NIR for the shader from 4000+ lines down to just under 300. The tests all become near-instant to run as well.

Uniform inlining has been in zink for a while, but it’s been disabled by default (except on zink-wip for testing) as this isn’t a feature that’s typically desirable when running apps/games; every time the uniforms are updated, a new shader must be compiled, and this causes (even more) stuttering, making games on zink (even more) unplayable.

On CPU-based drivers like lavapipe, however, the time to compile a shader is usually less than the time to actually run a shader, so the trade-off becomes worth doing.

Stay tuned for exciting announcements in the next few days.

November 05, 2021

A few weeks ago I watched Victor's excellent talk on Vulkan Video. This made me question my skills in this area. I'm pretty vague on video processing hardware, I really have no understanding of H264 or any of the standards. I've been loosely following the Vulkan video group inside of Khronos, but I can't say I've understood it or been useful.

radeonsi has a gallium vaapi driver, that talks to firmware driver encoder on the hardware, surely copying what it is programming can't be that hard. I got an mpv/vaapi setup running and tested some videos on that setup just to get comfortable. I looked at what sort of data was being pushed about.

The thing is the firmware is doing all the work here, the driver is mostly just responsible for taking semi-parsed h264 bitstream data structures and giving them in memory buffers to the fw API. Then the resulting decoded image should be magically in a buffer.

I then got the demo nvidia video decoder application mentioned in Victor's talk.

I ported the code to radv in a couple of days, but then began a long journey into the unknown. The firmware is quite expectant on exactly what it wants and when it wants it. After fixing some interactions with the video player, I started to dig.

Now vaapi and DXVA (Windows) are context based APIs. This means they are like OpenGL, where you create a context, do a bunch of work, and tear it down, the driver does all the hw queuing of commands internally. All the state is held in the context. Vulkan is a command buffer based API. The application records command buffers and then enqueues those command buffers to the hardware itself.

So the vaapi driver works like this for a video

create hw ctx, flush, decode, flush, decode, flush, decode, flush, decode, flush, destroy hw ctx, flush

However Vulkan wants things to be more like

Create Session, record command buffer with (begin, decode, end) send to hw, (begin, decode, end), send to hw, End Sesssion

There is no way at the Create/End session time to submit things to the hardware.

After a week or two of hair removal and insightful irc chats I stumbled over a decent enough workaround to avoid the hw dying and managed to decode a H264 video of some jellyfish.

The work is based on bunch of other stuff, and is in no way suitable for upstreaming yet, not to mention the Vulkan specification is only beta/provisional so can't be used anywhere outside of development.

The preliminary code is in my gitlab repo here[1]. It has a start on h265 decode, but it's not working at all yet, and I think the h264 code is a bit hangy randomly.

I'm not sure where this is going yet, but it was definitely an interesting experiment.

[1]: https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-prelim-decode

November 04, 2021

What’s Even Happening

Where does the time go?

I’m not sure, but I’m making the most of it. As we all know, the Mesa 21.3 release cycle is well underway, primarily to enable me to jam an increasingly ludicrous number of bug fixes in before the final tarball ships.

But is it possible that I’m doing other things too?

Why yes. Yes it is.

CI

We all like CI. It helps prevent us from footgunning, even when we’re totally sure that our patches are exactly what the codebase needs.

That’s why I decided to add GL4.6 CI runs over the past day or so. No more will tests be randomly fixed or failed with my commits!

Unless they’re part of the Khronos Confidential GL Test Suite, of course, but we don’t talk about that one.

Intrepid readers will note that there’s now a file in the repo which lists exactly how many failures there are on lavapipe, so now everyone knows how many thousands of tests are doing the opposite.

Rendering

A new Vulkan spec was released this week and it included something I’ve been excited about for a while: KHR_dynamic_rendering. This extension is going to let me cut down on some CPU overhead, and so some time ago I wrote the lavapipe implementation to get a feel for it.

Surprisingly, this means that lavapipe is now the only mesa driver implementing it, though I don’t expect that to be the case for long.

I’m looking forward to seeing more projects switch to this, since let’s be honest, nobody ever liked renderpasses anyway.

October 28, 2021

…For 2021

Zink is done. It’s finished. I’ve said it before, I’ve said it again after that, and I even said I meant it that one time, but now I’m serious.

Super serious.

We’re done here. There’s no need to check this blog anymore, and you don’t have to update zink ever again if you’ve pulled in the last week.

Mesa 21.3

This is it. This is the release everyone’s been waiting for.

Why is that?

Because zink is Pretty Fast™ now. And it can run lots of stuff.

Blog enthusiasts will recall all that time ago when zink was over that I noted a number of Big Name™ games that zink could now run, including Metro: Last Light Redux and HITMAN (Agent 47’s Deluxe Psychedelic Trip Edition).

True blog connoisseurs will recall when zink started to pull ahead of real GL drivers in features in order to run Warhammer 40k: Dawn of War.

But how many die-hard blog fans are here from the future and can remember when I posted that Bioshock Infinite now actually runs on RADV instead of hanging?

It’s hard to overstate the amount of work that’s gone into zink for this release. Over 400 patches amounted to ES 3.2, a suballocator, and a slew of random extensions to improve compatibility and performance across the board.

If you find a game or app* which doesn’t run at all on zink 21.3, play the lottery. It’s your day.

  • Except using it for your Xorg session or Wayland compositor. Don’t try this without supervision.

Bioshock Infinite Now Runs On Zink

bioshock.png

Zink+RADV: New BFFs

As part of a cross-training exercise, I’ve been hanging around with the hangmaster himself, the Baron of Black Screens, the Liege of Lost Sessions, Samuel Pitoiset. Together we’ve (but mostly him while I watch) been stamping out a colossal number of pathological hangs and crashes with zink on RADV.

At present, zink on RADV has only around 200 failures in the GL 4.6 conformance suite. It’s not quite there yet, but considering the number was well over 1000 just a week ago, there’s a lot of easy cleanup work that can be done here in both drivers to knock that down further.

Will 2022 Be The Year Of Zink Conformance?

Maybe.

What’s Next?

Bug fixes.

Lots of bug fixes.

Seriously, there’s so, so many bug fixes coming.

I have some fun surprises to unveil in the near future too.

For example: any guesses which Vulkan driver I’ll be showing zink off on next? Hint: B I G F P S.

October 20, 2021

 In 2007, Jan Arne Petersen added a D-Bus API to what was still pretty much an import into gnome-control-center of the "acme" utility I wrote to have all the keys on my iBook working.

It switched the code away from remapping keyboard keys to "XF86Audio*", to expecting players to contact the D-Bus daemon and ask to be forwarded key events.

 

Multimedia keys circa 2003

In 2013, we added support for controlling media players using MPRIS, as another interface. Fast-forward to 2021, and MPRIS support is ubiquitous, whether in free software, proprietary applications or even browsers. So we'll be parting with the "org.gnome.SettingsDaemon.MediaKeys" D-Bus API. If your application still wants to work with older versions of GNOME, it is recommended to at least quiet the MediaKeys API's unavailability.

 

Multimedia keys in 2021
 

TL;DR: Remove code that relies on gnome-settings-daemon's MediaKeys API, make sure to add MPRIS support to your app.

October 04, 2021

I’m Bad At Blogging

I’m responsible enough to admit that I’m bad at blogging.

I’ve said repeatedly that I’m going to blog more often, and then I go and do the complete opposite.

I don’t know why I’m like this, but here we are, and now it’s time for another blog post.

What’s Been Happening

In short: not a lot.

All the features I’ve previously blogged about have landed, and zink is once again in “release mode” until the branchpoint next week to avoid having to rush patches in at the last second. This means probably there won’t be any interesting patches at all to zink until then.

We’re in a good spot though, and I’m pleased with the state of the driver for this release. You probably still won’t be using it to play any OpenGL games you pick up from the Winter Steam Sale, but potentially those days aren’t too far off.

With that said, I do have to actually blog about something technical for once, so let’s roll the dice and see what it’s going to be

ARB_bindless_texture

We did it. We got a good roll.

This is actually a cool extension for an implementation deep dive because of how (relatively) simple Vulkan makes it to handle.

First, an overview: What is ARB_bindless_texture?

This is an extension used by only the most elite GL apps to enable texture streaming, namely the ability to continually add more images into the rendering pipeline either for sampling or shader write operations. An image is bound to a “handle”, and from there, it can be made “resident” at any time to use it in shaders. This is different from the general GL methodology where an image must be explicitly bound to a specific slot (instead each image has its own slot), and it allows for both greater flexibility and more images to be in use at any given time.

At the implementation level, this actually amounts to three distinct features:

  • the ability to track and manage unique “handles” for each image that can be made resident
  • the ability to access these images from shaders
  • the ability to pass these images between shader stages as normal I/O

In zink, I tackled these in the order I’ve listed.

Handle Management

This wouldn’t have been (as) possible without one very special, very awful Vulkan extension.

You knew this was coming.

VK_EXT_descriptor_indexing.

That’s right, it’s a requirement for this, but not for the reason you might think. Zink has no need for the impossibly large descriptorsets enabled by this extension, but I did need the other features it provides:

  • VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT - enables binding a “bindless” descriptor set once and then performing updates on it without needing to have multiple sets
  • VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT - enables invalidating deleted members of an active set and leaving them as garbage values in the descriptor set as long as they won’t be accessed in shaders (don’t worry, this is totally safe)
  • VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT - enables updating members of an active set that aren’t currently in use by any shaders

With these, it becomes possible to implement bindless textures using the existing Gallium convention:

  • create u_idalloc instance to track and generate integer handle IDs
  • map these handle IDs to slots in a large-ish (1024) sized descriptor array
  • dynamically update the slots in the set as textures are made resident/not-resident
  • return handle IDs to the u_idalloc pool once they are destroyed and the image is no longer in use

This creates a cycle where a handle ID is allocated, an image is bound to that slot in the descriptor array, the image can be unbound, the handle ID is deleted, and then finally the ID is recycled, all while only binding and updating a single descriptor set as draws continue.

Shader Access

Now that the images are accessible to the GPU in the bindless descriptor array, shaders will have to be updated to read them.

In NIR, bindless instructions come in two variants:

  • nir_intrinsic_bindless_image_*
  • nir_instr_type_tex with nir_tex_src_texture_handle

These have their own unique semantics that I didn’t bother to look into; I only need to do completely normal array derefs, so what I actually needed was just to rewrite them back into normal-er instructions.

For the image intrinsics, that ended up being the following snippet:

nir_intrinsic_instr *instr = nir_instr_as_intrinsic(in);

nir_intrinsic_op op;
#define OP_SWAP(OP) \
case nir_intrinsic_bindless_image_##OP: \
   op = nir_intrinsic_image_deref_##OP; \
   break;


/* convert bindless intrinsics to deref intrinsics */
switch (instr->intrinsic) {
OP_SWAP(atomic_add)
OP_SWAP(atomic_and)
OP_SWAP(atomic_comp_swap)
OP_SWAP(atomic_dec_wrap)
OP_SWAP(atomic_exchange)
OP_SWAP(atomic_fadd)
OP_SWAP(atomic_fmax)
OP_SWAP(atomic_fmin)
OP_SWAP(atomic_imax)
OP_SWAP(atomic_imin)
OP_SWAP(atomic_inc_wrap)
OP_SWAP(atomic_or)
OP_SWAP(atomic_umax)
OP_SWAP(atomic_umin)
OP_SWAP(atomic_xor)
OP_SWAP(format)
OP_SWAP(load)
OP_SWAP(order)
OP_SWAP(samples)
OP_SWAP(size)
OP_SWAP(store)
default:
   return false;
}

enum glsl_sampler_dim dim = nir_intrinsic_image_dim(instr);
nir_variable *var = dim == GLSL_SAMPLER_DIM_BUF ? bindless_buffer_array : bindless_image_array;
if (!var)
   var = create_bindless_image(b->shader, dim);
instr->intrinsic = op;
b->cursor = nir_before_instr(in);
nir_deref_instr *deref = nir_build_deref_var(b, var);
if (glsl_type_is_array(var->type))
   deref = nir_build_deref_array(b, deref, nir_u2uN(b, instr->src[0].ssa, 32));
nir_instr_rewrite_src_ssa(in, &instr->src[0], &deref->dest.ssa);

In short, swap the intrinsic back to a regular image one, then rewrite the image src as a deref of a bindless image variable (which is just image[1024]). In long…it’s the same thing. It’s actually that simple.

The tex instruction is where things get trickier.

nir_variable *var = tex->sampler_dim == GLSL_SAMPLER_DIM_BUF ? bindless_buffer_array : bindless_texture_array;
if (!var)
   var = create_bindless_texture(b->shader, tex);
b->cursor = nir_before_instr(in);
nir_deref_instr *deref = nir_build_deref_var(b, var);
if (glsl_type_is_array(var->type))
   deref = nir_build_deref_array(b, deref, nir_u2uN(b, tex->src[idx].src.ssa, 32));
nir_instr_rewrite_src_ssa(in, &tex->src[idx].src, &deref->dest.ssa);

This part is the same as the image rewrite: just rewriting the instruction as a deref.

This part, however, is different:

unsigned needed_components = glsl_get_sampler_coordinate_components(glsl_without_array(var->type));
unsigned c = nir_tex_instr_src_index(tex, nir_tex_src_coord);
unsigned coord_components = nir_src_num_components(tex->src[c].src);
if (coord_components < needed_components) {
   nir_ssa_def *def = nir_pad_vector(b, tex->src[c].src.ssa, needed_components);
   nir_instr_rewrite_src_ssa(in, &tex->src[c].src, def);
   tex->coord_components = needed_components;
}

The thing about bindless textures is that by the time zink sees them, they have no dimensionality. They’re just textures in an array, regardless of whether they’re 1D, 2D, 3D, or arrayed. This means the variables used for derefs might not have the right number of coordinate components, or the instructions using them might not have the right number. To fix this, an extra cleanup is needed here to match up the number of components with the variable being used.

With all of that in place, basic bindless operations are working.

But wait…

Shader I/O

This was the tricky part. According to the spec, it now becomes legal to have an image or a sampler as an input or an output in a shader.

But is it really, truly necessary to pass images between the shaders?

No. No it isn’t.

nir_deref_instr *src_deref = nir_src_as_deref(instr->src[0]);
nir_variable *var = nir_deref_instr_get_variable(src_deref);
if (var->data.bindless)
   return false;
if (var->data.mode != nir_var_shader_in && var->data.mode != nir_var_shader_out)
   return false;
if (!glsl_type_is_image(var->type) && !glsl_type_is_sampler(var->type))
   return false;

var->type = glsl_int64_t_type();
var->data.bindless = 1;
b->cursor = nir_before_instr(in);
nir_deref_instr *deref = nir_build_deref_var(b, var);
if (instr->intrinsic == nir_intrinsic_load_deref) {
    nir_ssa_def *def = nir_load_deref(b, deref);
    nir_instr_rewrite_src_ssa(in, &instr->src[0], def);
    nir_ssa_def_rewrite_uses(&instr->dest.ssa, def);
} else {
   nir_store_deref(b, deref, instr->src[1].ssa, nir_intrinsic_write_mask(instr));
}
nir_instr_remove(in);
nir_instr_remove(&src_deref->instr);

Bindless shader i/o is really just passing array indices that masquerade as images. If they’re rewritten back to integer types, that all goes away, and they become regular i/o that needs no additional handling.

Just This Once

The translation to Vulkan made everything incredibly easy. I didn’t need any special hacks or corner case behavior, and I didn’t have to spend time reading code from other drivers to figure out what the hell I was doing wrong. Validation even works for it!

Truly miraculous.

October 01, 2021

Wim Taymans

Wim Taymans laying out the vision for the future of Linux multimedia


PipeWire has already made great strides forward in terms of improving the audio handling situation on Linux, but one of the original goals was to also bring along the video side of the house. In fact in the first few releases of Fedora Workstation where we shipped PipeWire we solely enabled it as a tool to handle screen sharing for Wayland and Flatpaks. So with PipeWire having stabilized a lot for audio now we feel the time has come to go back to the video side of PipeWire and work to improve the state-of-art for video capture handling under Linux. Wim Taymans did a presentation to our team inside Red Hat on the 30th of September talking about the current state of the world and where we need to go to move forward. I thought the information and ideas in his presentation deserved wider distribution so this blog post is building on that presentation to share it more widely and also hopefully rally the community to support us in this endeavour.

The current state of video capture, usually webcams, handling on Linux is basically the v4l2 kernel API. It has served us well for a lot of years, but we believe that just like you don’t write audio applications directly to the ALSA API anymore, you should neither write video applications directly to the v4l2 kernel API anymore. With PipeWire we can offer a lot more flexibility, security and power for video handling, just like it does for audio. The v4l2 API is an open/ioctl/mmap/read/write/close based API, meant for a single application to access at a time. There is a library called libv4l2, but nobody uses it because it causes more problems than it solves (no mmap, slow conversions, quirks). But there is no need to rely on the kernel API anymore as there are GStreamer and PipeWire plugins for v4l2 allowing you to access it using the GStreamer or PipeWire API instead. So our goal is not to replace v4l2, just as it is not our goal to replace ALSA, v4l2 and ALSA are still the kernel driver layer for video and audio.

It is also worth considering that new cameras are getting more and more complicated and thus configuring them are getting more complicated. Driving this change is a new set of cameras on the way often called MIPI cameras, as they adhere to the API standards set by the MiPI Alliance. Partly driven by this V4l2 is in active development with a Codec API addition, statefull/stateless, DMABUF, request API and also adding a Media Controller (MC) Graph with nodes, ports, links of processing blocks. This means that the threshold for an application developer to use these APIs directly is getting very high in addition to the aforementioned issues of single application access, the security issues of direct kernel access and so on.

libcamera logo


Libcamera is meant to be the userland library for v4l2.


Of course we are not the only ones seeing the growing complexity of cameras as a challenge for developers and thus libcamera has been developed to make interacting with these cameras easier. Libcamera provides unified API for setup and capture for cameras, it hides the complexity of modern camera devices, it is supported for ChromeOS, Android and Linux.
One way to describe libcamera is as the MESA of cameras. Libcamera provides hooks to run (out-of-process) vendor extensions like for image processing or enhancement. Using libcamera is considering pretty much a requirement for embedded systems these days, but also newer Intel chips will also have IPUs configurable with media controllers.

Libcamera is still under heavy development upstream and do not yet have a stable ABI, but they did add a .so version very recently which will make packaging in Fedora and elsewhere a lot simpler. In fact we have builds in Fedora ready now. Libcamera also ships with a set of GStreamer plugins which means you should be able to get for instance Cheese working through libcamera in theory (although as we will go into, we think this is the wrong approach).

Before I go further an important thing to be aware of here is that unlike on ALSA, where PipeWire can provide a virtual ALSA device to provide backwards compatibility with older applications using the ALSA API directly, there is no such option possible for v4l2. So application developers will need to port to something new here, be that libcamera or PipeWire. So what do we feel is the right way forward?

Ideal Linux Multimedia Stack

How we envision the Linux multimedia stack going forward


Above you see an illustration of what we believe should be how the stack looks going forward. If you made this drawing of what the current state is, then thanks to our backwards compatibility with ALSA, PulseAudio and Jack, all the applications would be pointing at PipeWire for their audio handling like they are in the illustration you see above, but all the video handling from most applications would be pointing directly at v4l2 in this diagram. At the same time we don’t want applications to port to libcamera either as it doesn’t offer a lot of the flexibility than using PipeWire will, but instead what we propose is that all applications target PipeWire in combination with the video camera portal API. Be aware that the video portal is not an alternative or a abstraction of the PipeWire API, it is just a way to set up the connection to PipeWire that has the added bonus of working if your application is shipping as a Flatpak or another type of desktop container. PipeWire would then be in charge of talking to libcamera or v42l for video, just like PipeWire is in charge of talking with ALSA on the audio side. Having PipeWire be the central hub means we get a lot of the same advantages for video that we get for audio. For instance as the application developer you interact with PipeWire regardless of if what you want is a screen capture, a camera feed or a video being played back. Multiple applications can share the same camera and at the same time there are security provided to avoid the camera being used without your knowledge to spy on you. And also we can have patchbay applications that supports video pipelines and not just audio, like Carla provides for Jack applications. To be clear this feature will not come for ‘free’ from Jack patchbays since Jack only does audio, but hopefully new PipeWire patchbays like Helvum can add video support.

So what about GStreamer you might ask. Well GStreamer is a great way to write multimedia applications and we strongly recommend it, but we do not recommend your GStreamer application using the v4l2 or libcamera plugins, instead we recommend that you use the PipeWire plugins, this is of course a little different from the audio side where PipeWire supports the PulseAudio and Jack APIs and thus you don’t need to port, but by targeting the PipeWire plugins in GStreamer your GStreamer application will get the full PipeWire featureset.

So what is our plan of action>
So we will start putting the pieces in place for this step by step in Fedora Workstation. We have already started on this by working on the libcamera support in PipeWire and packaging libcamera for Fedora. We will set it up so that PipeWire can have option to switch between v4l2 and libcamera, so that most users can keep using the v4l2 through PipeWire for the time being, while we work with upstream and the community to mature libcamera and its PipeWire backend. We will also enable device discoverer for PipeWire.

We are also working on maturing the GStreamer elements for PipeWire for the video capture usecase as we expect a lot of application developers will just be using GStreamer as opposed to targeting PipeWire directly. We will start with Cheese as our initial testbed for this work as it is a fairly simple application, using Cheese as a proof of concept to have it use PipeWire for camera access. We are still trying to decide if we will make Cheese speak directly with PipeWire, or have it talk to PipeWire through the pipewiresrc GStreamer plugin, but have their pro and cons in the context of testing and verifying this.

We will also start working with the Chromium and Firefox projects to have them use the Camera portal and PipeWire for camera support just like we did work with them through WebRTC for the screen sharing support using PipeWire.

There are a few major items we are still trying to decide upon in terms of the interaction between PipeWire and the Camera portal API. It would be tempting to see if we can hide the Camera portal API behind the PipeWire API, or failing that at least hide it for people using the GStreamer plugin. That way all applications get the portal support for free when porting to GStreamer instead of requiring using the Camera portal API as a second step. On the other side you need to set up the screen sharing portal yourself, so it would probably make things more consistent if we left it to application developers to do for camera access too.

What do we want from the community here?
First step is just help us with testing as we roll this out in Fedora Workstation and Cheese. While libcamera was written motivated by MIPI cameras, all webcams are meant to work through it, and thus all webcams are meant to work with PipeWire using the libcamera backend. At the moment that is not the case and thus community testing and feedback is critical for helping us and the libcamera community to mature libcamera. We hope that by allowing you to easily configure PipeWire to use the libcamera backend (and switch back after you are done testing) we can get a lot of you to test and let us what what cameras are not working well yet.

A little further down the road please start planning moving any application you maintain or contribute to away from v4l2 API and towards PipeWire. If your application is a GStreamer application the transition should be fairly simple going from the v4l2 plugins to the pipewire plugins, but beyond that you should familiarize yourself with the Camera portal API and the PipeWire API for accessing cameras.

For further news and information on PipeWire follow our @PipeWireP twitter account and for general news and information about what we are doing in Fedora Workstation make sure to follow me on twitter @cfkschaller.
PipeWire

September 27, 2021
For the hw-enablement for Bay- and Cherry-Trail devices which I do as a side project, sometimes it is useful to play with the Android which comes pre-installed on some of these devices.

Sometimes the Android-X86 boot-loader (kerneflinger) is locked and the standard "Developer-Options" -> "Enable OEM Unlock" -> "Run 'fastboot oem unlock'" sequence does not work (e.g. I got the unlock yes/no dialog, and could move between yes and no, but I could not actually confirm the choice).

Luckily there is an alternative, kernelflinger checks a "OEMLock" EFI variable to see if the device is locked or not. Like with some of my previous adventures changing hidden BIOS settings, this EFI variable is hidden from the OS as soon as the OS calls ExitBootServices, but we can use the same modified grub to change this EFI variable. After booting from an USB stick with the relevant grub binary installed as "EFI/BOOT/BOOTX64.EFI" or "BOOTIA32.EFI", entering the
following command on the grub cmdline will unlock the bootloader:

setup_var_cv OEMLock 0 1 1

Disabling dm-verity support is pretty easy on these devices because they can just boot a regular Linux distro from an USB drive. Note booting a regular Linux distro may cause the Android "system" partition to get auto-mounted after which dm-verity checks will fail! Once we have a regular Linux distro running step 1 is to find out which partition is the android_boot partition to do this as root run:

blkid /dev/mmcblk?p#

Replacing the ? for the mmcblk number for the internal eMMC and then for # is 1 to n, until one of the partitions is reported as having 'PARTLABEL="android_boot"', usually "mmcblk?p3" is the one you want, so you could try that first.

Now make an image of the partition by running e.g.:

dd if=/dev/mmcblk1p3" of=android_boot.img

And then copy the "android_boot.img" file to another computer. On this computer extract the file and then the initrd like this:

abootimg -x android_boot.img
mkdir initrd
cd initrd
zcat ../initrd.img | cpio -i


Now edit the fstab file and remove "verify" from the line for the system partition. after this update android_boot.img like this:

find . | cpio -o -H newc -R 0.0 | gzip -9 > ../initrd.img
cd ..
abootimg -u android_boot.img -r initrd.img


The easiest way to test the new image is using fastboot, boot the tablet into Android and connect it to the PC, then run:

adb reboot bootloader
fastboot boot android_boot.img


And then from an "adb shell" do "cat /fstab" verify that the "verify" option is gone now. After this you can (optionally) dd the new android_boot.img back to the android_boot partition to make the change permanent.

Note if Android is not booting you can force the bootloader to enter fastboot mode on the next boot by downloading this file and then under regular Linux running the following command as root:

cat LoaderEntryOneShot > /sys/firmware/efi/efivars/LoaderEntryOneShot-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f
September 24, 2021

Fedora Workstation
So I have spoken about what is our vision for Fedora Workstation quite a few times before, but I feel it is often useful to get back to it as we progress with our overall effort.So if you read some of my blog posts about Fedora Workstation over the last 5 years, be aware that there is probably little new in here for you. If you haven’t read them however this is hopefully a useful primer on what we are trying to achieve with Fedora Workstation.

The first few years after we launched Fedora Workstation in 2014 we focused on lot on establishing a good culture around what we where doing with Fedora, making sure that it was a good day to day desktop driver for people, and not just a great place to develop the operating system itself. I think it was Fedora Project Lead Matthew Miller who phrased it very well when he said that we want to be Leading Edge, not Bleeding Edge. We also took a good look at the operating system from an overall stance and tried to map out where Linux tended to fall short as a desktop operating system and also tried to ask ourselves what our core audience would and should be. We refocused our efforts on being a great Operating System for all kinds of developers, but I think it is fair to say that we decided that was to narrow a wording as our efforts are truly to reach makers of all kinds like graphics artists and musicians, in addition to coders. So I thought I go through our key pillar efforts and talk about where they are at and where they are going.

Flatpak

Flatpak logo
One of the first things we concluded was that our story for people who wanted to deploy applications to our platform was really bad. The main challenge was that the platform was moving very fast and it was a big overhead for application developers to keep on top of the changes. In addition to that, since the Linux desktop is so fragmented, the application developers would have to deal with the fact that there was 20 different variants of this platform, all moving at a different pace. The way Linux applications was packaged, with each dependency being packaged independently of the application created pains on both sides, for the application developer it means the world kept moving underneath them with limited control and for the distributions it meant packaging pains as different applications who all depended on the same library might work or fail with different versions of a given library. So we concluded we needed a system which allowed us to decouple of application from the host OS to let application developers update their platform at a pace of their own choosing and at the same time unify the platform in the sense that the application should be able to run without problems on the latest Fedora releases, the latest RHEL releases or the latest versions of any other distribution out there. As we looked at it we realized there was some security downsides compared to the existing model, since the Os vendor would not be in charge of keeping all libraries up to date and secure, so sandboxing the applications ended up a critical requirement. At the time Alexander Larsson was working on bringing Docker to RHEL and Fedora so we tasked him with designing the new application model. The initial idea was to see if we could adjust Docker containers to the desktop usecase, but Docker containers as it stood at that time were very unsuited for the purpose of hosting desktop applications and our experience working with the docker upstream at the time was that they where not very welcoming to our contributions. So in light of how major the changes we would need to implement and the unlikelyhood of getting them accepted upstream, Alex started on what would become Flatpak. Another major technology that was coincidentally being developed at the same time was OSTree by Colin Walters. To this day I think the best description of OSTree is that it functions as a git for binaries, meaning it allows you a simple way to maintain and update your binary applications with minimally sized updates. It also provides some disk deduplication which we felt was important due to the duplication of libraries and so on that containers bring with them. Finally another major design decision Alex did was that the runtime/baseimage should be hosted outside the container, so make possible to update the runtime independently of the application with relevant security updates etc.

Today there is a thriving community around Flatpaks, with the center of activity being flathub, the Flatpak application repository. In Fedora Workstation 35 you should start seeing Flatpak from Flathub being offered as long as you have 3rd party repositories enabled. Also underway is Owen Taylor leading our efforts of integrating Flatpak building into the internal tools we use at Red Hat for putting RHEL together, with the goal of switching over to Flatpaks as our primary application delivery method for desktop applications in RHEL and to help us bridge the Fedora and RHEL application ecosystem.

You can follow the latest news from Flatpak through the official Flatpak twitter account.

Silverblue

So another major issue we decided needing improvements was that of OS upgrades (as opposed to application updates). The model pursued by Linux distros since their inception is one of shipping their OS as a large collection of independently packaged libraries. This setup is inherently fragile and requires a lot of quality engineering and testing to avoid problems, but even then sometimes things sometimes fail, especially in a fast moving OS like Fedora. A lot of configuration changes and updates has traditionally been done through scripts and similar, making rollback to an older version in cases where there is a problem also very challenging. Adventurous developers could also have done changes to their own copy of the OS that would break the upgrade later on. So thanks to all the great efforts to test and verify upgrades they usually go well for most users, but we wanted something even more sturdy. So the idea came up to move to a image based OS model, similar to what people had gotten used to on their phones. And OSTree once again became the technology we choose to do this, especially considering it was being used in Red Hat first foray into image based operating systems for servers (the server effort later got rolled into CoreOS as part of Red Hat acquiring CoreOS). The idea is that you ship the core operating system as a singular image and then to upgrade you just replace that image with a new image, and thus the risks of problems are greatly reduced. On top of that each of those images can be tested and verified as a whole by your QE and test teams. Of course we realized that a subset of people would still want to be able to tweak their OS, but once again OSTree came to our rescue as it allows developers to layer further RPMS on top of the OS image, including replacing current system libraries with for instance newer ones. The great thing about OSTree layering is that once you are done testing/using the layers RPMS you can with a very simple command just drop them again and go back to the upstream image. So combined with applications being shipped as Flatpaks this would create an OS that is a lot more sturdy, secure and simple to update and with a lot lower chance of an OS update breaking any of your applications. On top of that OSTree allows us to do easy OS rollbacks, so if the latest update somehow don’t work for you can you quickly rollback while waiting for the issue you are having to be fixed upstream. And hence Fedora Silverblue was born as the vehicle for us to develop and evolve an image based desktop operating system.

You can follow our efforts around Silverblue through the offical Silverblue twitter account.

Toolbx

Toolbox with RHEL

Toolbox pet container with RHEL UBI


So Flatpak helped us address a lot of the the gaps for making a better desktop OS on the application side and Silverblue was the vehicle for our vision on the OS side, but we realized that we also needed some way for all kinds of developers to be able to easily take advantage of the great resource that is the Fedora RPM package universe and the wider tools universe out there. We needed something that provided people with a great terminal experience. We had already been working on various smaller improvements to the terminal for a while, but we realized we needed something a lot more substantial. Accessing an immutable OS like Silverblue through a terminal window tends to be quite limiting. So that it is usually not want you want to do and also you don’t want to rely on the OSTree layering for running all your development tools and so on as that is going to be potentially painful when you upgrade your OS.
Luckily the container revolution happening in the Linux world pointed us to the solution here too, as while containers were rolled out the concept of ‘pet containers’ were also born. The idea of a pet container is that unlike general containers (sometimes refer to as cattle containers) pet container are containers that you care about on an individual level, like your personal development environment. In fact pet containers even improves on how we used to do things as they allow you to very easily maintain different environments for different projects. So for instance if you have two projects, hosted in two separate pet containers, where the two project depends on two different versions of python, then containers make that simple as it ensures that there is no risk of one of your projects ‘contaminating’ the others with its dependencies, yet at the same time allow you to grab RPMS or other kind of packages from upstream resources and install them in your container. In fact while inside your pet container the world feels a lot like it always has when on the linux command line. Thanks to the great effort of Dan Walsh and his team we had a growing number of easy to use container tools available to us, like podman. Podman is developed with the primary usecase being for running and deploying your containers at scale, managed by OpenShift and Kubernetes. But it also gave us the foundation we needed for Debarshi Ray to kicked of the Toolbx project to ensure that we had an easy to use tool for creating and managing pet containers. As a bonus Toolbx allows us to achieve another important goal, to allow Fedora Workstation users to develop applications against RHEL in a simple and straightforward manner, because Toolbx allows you to create RHEL containers just as easy as it allows you to create Fedora containers.

You can follow our efforts around Toolbox on the official Toolbox twitter account

Wayland

Ok, so between Flatpak, Silverblue and Toolbox we have the vision clear for how to create a robust OS, with a great story for application developers to maintain and deliver applications for it, to Toolbox providing a great developer story on top of this OS. But we also looked at the technical state of the Linux desktop and realized that there where some serious deficits we needed to address. One of the first one we saw was the state of graphics where X.org had served us well for many decades, but its age was showing and adding new features as they came in was becoming more and more painful. Kristian Høgsberg had started work on an alternative to X while still at Red Hat called Wayland, an effort he and a team of engineers where pushing forward at Intel. There was a general agreement in the wider community that Wayland was the way forward, but apart from Intel there was little serious development effort being put into moving it forward. On top of that, Canonical at the time had decided to go off on their own and develop their own alternative architecture in competition with X.org and Wayland. So as we were seeing a lot of things happening in the graphics space horizon, like HiDPI, and also we where getting requests to come up with a way to make Linux desktops more secure, we decided to team up with Intel and get Wayland into a truly usable state on the desktop. So we put many of our top developers, like Olivier Fourdan, Adam Jackson and Jonas Ådahl, on working on maturing Wayland as quickly as possible.
As things would have it we also ended up getting a lot of collaboration and development help coming in from the embedded sector, where companies such as Collabora was helping to deploy systems with Wayland onto various kinds of embedded devices and contributing fixes and improvements back up to Wayland (and Weston). To be honest I have to admit we did not fully appreciate what a herculean task it would end up being getting Wayland production ready for the desktop and it took us quite a few Fedora releases before we decided it was ready to go. As you might imagine dealing with 30 years of technical debt is no easy thing to pay down and while we kept moving forward at a steady pace there always seemed to be a new batch of issues to be resolved, but we managed to do so, not just by maturing Wayland, but also by porting major applications such as Martin Stransky porting Firefox, and Caolan McNamara porting LibreOffice over to Wayland. At the end of the day I think what saw us through to success was the incredible collaboration happening upstream between a large host of individual contributors, companies and having the support of the X.org community. And even when we had the whole thing put together there where still practical issues to overcome, like how we had to keep defaulting to X.org in Fedora when people installed the binary NVidia driver because that driver did not work with XWayland, the X backwards compatibility layer in Wayland. Luckily that is now in the process of becoming a thing of the past with the latest NVidia driver updates support XWayland and us working closely with NVidia to ensure driver and windowing stack works well.

PipeWire

Pipewire in action

Example of PipeWire running


So now we had a clear vision for the OS and a much improved and much more secure graphics stack in the form of Wayland, but we realized that all the new security features brought in by Flatpak and Wayland also made certain things like desktop capturing/remoting and web camera access a lot harder. Security is great and critical, but just like the old joke about the most secure computer being the one that is turned off, we realized that we needed to make sure these things kept working, but in a secure and better manner. Thankfully we have GStreamer co-creator Wim Taymans on the team and he thought he could come up with a pulseaudio equivalent for video that would allow us to offer screen capture and webcam access in a convenient and secure manner.
As Wim where prototyping what we called PulseVideo at the time we also started discussing the state of audio on Linux. Wim had contributed to PulseAudio to add a security layer to it, to make for instance it harder for a rogue application to eavesdrop on you using your microphone, but since it was not part of the original design it wasn’t a great solution. At the same time we talked about how our vision for Fedora Workstation was to make it the natural home for all kind of makers, which included musicians, but how the separateness of the pro-audio community getting in the way of that, especially due to the uneasy co-existence of PulseAudio on the consumer side and Jack for the pro-audio side. As part of his development effort Wim came to the conclusion that he code make the core logic of his new project so fast and versatile that it should be able to deal with the low latency requirements of the pro-audio community and also serve its purpose well on the consumer audio and video side. Having audio and video in one shared system would also be an improvement for us in terms of dealing with combined audio and video sources as guaranteeing audio video sync for instance had often been a challenge in the past. So Wims effort evolved into what we today call PipeWire and which I am going to be brave enough to say has been one of the most successful launches of a major new linux system component we ever done. Replacing two old sound servers while at the same time adding video support is no small feat, but Wim is working very hard on fixing bugs as quickly as they come in and ensure users have a great experience with PipeWire. And at the same time we are very happy that PipeWire now provides us with the ability of offering musicians and sound engineers a new home in Fedora Workstation.

You can follow our efforts on PipeWire on the PipeWire twitter account.

Hardware support and firmware

In parallel with everything mentioned above we where looking at the hardware landscape surrounding desktop linux. One of the first things we realized was horribly broken was firmware support under Linux. More and more of the hardware smarts was being found in the firmware, yet the firmware access under Linux and the firmware update story was basically non-existent. As we where discussing this problem internally, Peter Jones who is our representative on UEFI standards committee, pointed out that we probably where better poised to actually do something about this problem than ever, since UEFI was causing the firmware update process on most laptops and workstations to become standardized. So we teamed Peter up with Richard Hughes and out of that collaboration fwupd and LVFS was born. And in the years since we launched that we gone from having next to no firmware available on Linux (and the little we had only available through painful processes like burning bootable CDs etc.) to now having a lot of hardware getting firmware update support and more getting added almost on a weekly basis.
For the latest and greatest news around LVFS the best source of information is Richard Hughes twitter account.

In parallel to this Adam Jackson worked on glvnd, which provided us with a way to have multiple OpenGL implementations on the same system. For those who has been using Linux for a while I am sure you remembers the pain of the NVidia driver and Mesa fighting over who provided OpenGL on your system as it was all tied to a specific .so name. There was a lot of hacks being used out there to deal with that situation, of varying degree of fragility, but with the advent of glvnd nobody has to care about that problem anymore.

We also decided that we needed to have a part of the team dedicated to looking at what was happening in the market and work on covering important gaps. And with gaps I mean fixing the things that keeps the hardware vendors from being able to properly support Linux, not writing drivers for them. Instead we have been working closely with Dell and Lenovo to ensure that their suppliers provide drivers for their hardware and when needed we work to provide a framework for them to plug their hardware into. This has lead to a series of small, but important improvements, like getting the fingerprint reader stack on Linux to a state where hardware vendors can actually support it, bringing Thunderbolt support to Linux through Bolt, support for high definition and gaming mice through the libratbag project, support in the Linux kernel for the new laptop privacy screen feature, improved power management support through the power profiles daemon and now recently hiring a dedicated engineer to get HDR support fully in place in Linux.

Summary

So to summarize. We are of course not over the finish line with our vision yet. Silverblue is a fantastic project, but we are not yet ready to declare it the official version of Fedora Workstation, mostly because we want to give the community more time to embrace the Flatpak application model and for developers to embrace the pet container model. Especially applications like IDEs that cross the boundary between being in their own Flatpak sandbox while also interacting with things in your pet container and calling out to system tools like gdb need more work, but Christian Hergert has already done great work solving the problem in GNOME Builder while Owen Taylor has put together support for using Visual Studio Code with pet containers. So hopefully the wider universe of IDEs will follow suit, in the meantime one would need to call them from the command line from inside the pet container.

The good thing here is that Flatpaks and Toolbox also works great on traditional Fedora Workstation, you can get the full benefit of both technologies even on a traditional distribution, so we can allow for a soft and easy transition.

So for anyone who made it this far, appoligies for this become a little novel, that was not my intention when I started writing it :)

Feel free to follow my personal twitter account for more general news and updates on what we are doing around Fedora Workstation.
Christian F.K. Schaller photo

Last week we had our most loved annual conference: X.Org Developers Conference 2021. As a reminder, due to COVID-19 situation in Europe (and its respective restrictions on travel and events), we kept it virtual again this year… which is a pity as the former venue was Gdańsk, a very beautiful city (see picture below if you don’t believe me!) in Poland. Let’s see if we can finally have an XDC there!

XDC 2021

This year we had a very strong program. There were talks covering all aspects of the open-source graphics stack: from the kernel (including an Outreachy talk about VKMS) and Mesa drivers of all kind, inputs, libraries, X.org security and Wayland robustness… we had talks about testing drivers, debugging them, our infra at freedesktop.org, and even Vulkan specs (such Vulkan Video and VK_EXT_multi_draw) and their support in the open-source graphics stack. Definitely, a very complete program that is very interesting to all open-source developers working on this area. You can watch all the talks here or here and the slides were already uploaded in the program.

On behalf of the Call For Papers Committee, I would like to thank all speakers for their talks… this conference won’t make sense without you!

Big shout-out to the XDC 2021 organizers (Intel) represented by Radosław Szwichtenberg, Ryszard Knop and Maciej Ramotowski. They did an awesome job on having a very smooth conference. I can tell you that they promptly fixed any issue that happened, all of that behind the scenes so that the attendees not even noticed anything most of the times! That is what good conference organizers do!

XDC 2021 Organizers Can I invite you to a drink at least? You really deserve it!

If you want to know more details about what this virtual conference entailed, just watch Ryszard’s talk at XDC (info, video) or you can reuse their materials for future conferences. That’s very useful info for future conference organizers!

Talking about our streaming platforms, the big novelty this year was the use of media.ccc.de as a privacy-friendly alternative to our traditional Youtube setup (last year we got feedback about this). Media.ccc.de is an open-source platform that respects your privacy and we hope it worked fine for all attendees. Our stats indicate that ~50% of our audience connected to it during the three days of the conference. That’s awesome!

Last but not least, we couldn’t make this conference without our sponsors. We are very lucky to have on board Intel as our Platinum sponsor and organizer, our Gold sponsors (Google, NVIDIA, ARM, Microsoft and AMD, our Silver sponsors (Igalia, Collabora, The Linux Foundation), our Bronze sponsors (Gitlab and Khronos Group) and our Supporters (C3VOC). Big thank you from the X.Org community!

XDC 2021 Sponsors

Feedback

We would like to hear from you and learn about what worked and what needs to be improved for future editions of XDC! Share us your experience!

We have sent an email asking for feedback to different mailing lists (for example this). Don’t hesitate to send an email to X.Org Foundation board with all your feedback!

XDC 2022 announced!

X.Org Developers Conference 2022 has been announced! Jeremy White, from Codeweavers, gave a lightning talk presenting next year edition! Next year the XDC will not be alone… WineConf 2022 is going to be organized by Codeweavers as well and co-located with XDC!

Save the dates! October 4-5-6, 2022 in Minneapolis, Minnesota, USA.

XDC 2022: Minneapolis, Minnesota, USA Image from Wikipedia. License CC BY-SA 4.0.

XDC 2023 hosting proposals

Have you enjoyed XDC 2021? Do you think you can do it better? ;-) We are looking for organizers for XDC 2023 (most likely in Europe but we are open to other places).

We know this is a decision that takes time (trigger internal discussion, looking for volunteers, budget, a venue suitable for the event, etc). Therefore, we encourage potential interested parties to start the internal discussions now, so any question they have can be answered before we open the call for proposals for XDC 2023 at some point next year. Please read what it is required to organize this conference and feel free to contact me or the X.Org Foundation board for more info if needed.

Final acknowledgment

I would like to thank Igalia for all the support I got when I decided to run for re-election this year in the X.Org Foundation board and to allow me to participate in XDC organization during my work hours. It’s amazing that our Free Software and collaboration values are still present after 20 years rocking in the free world!

Igalia 20th anniversary Igalia

September 23, 2021

After a nine year hiatus, a new version of the X Input Protocol is out. Credit for the work goes to Povilas Kanapickas, who also implemented support for XI 2.4 in the various pieces of the stack [0]. So let's have a look.

X has had touch events since XI 2.2 (2012) but those were only really useful for direct touch devices (read: touchscreens). There were accommodations for indirect touch devices like touchpads but they were never used. The synaptics driver set the required bits for a while but it was dropped in 2015 because ... it was complicated to make use of and no-one seemed to actually use it anyway. Meanwhile, the rest of the world moved on and touchpad gestures are now prevalent. They've been standard in MacOS for ages, in Windows for almost ages and - with recent GNOME releases - now feature prominently on the Linux desktop as well. They have been part of libinput and the Wayland protocol for years (and even recently gained a new set of "hold" gestures). Meanwhile, X was left behind in the dust or mud, depending on your local climate.

XI 2.4 fixes this, it adds pinch and swipe gestures to the XI2 protocol and makes those available to supporting clients [2]. Notably here is that the interpretation of gestures is left to the driver [1]. The server takes the gestures and does the required state handling but otherwise has no decision into what constitutes a gesture. This is of course no different to e.g. 2-finger scrolling on a touchpad where the server just receives scroll events and passes them on accordingly.

XI 2.4 gesture events are quite similar to touch events in that they are processed as a sequence of begin/update/end with both types having their own event types. So the events you will receive are e.g. XIGesturePinchBegin or XIGestureSwipeUpdate. As with touch events, a client must select for all three (begin/update/end) on a window. Only one gesture can exist at any time, so if you are a multi-tasking octopus prepare to be disappointed.

Because gestures are tied to an indirect-touch device, the location they apply at is wherever the cursor is currently positioned. In that, they work similar to button presses, and passive grabs apply as expected too. So long-term the window manager will likely want a passive grab on the root window for swipe gestures while applications will implement pinch-to-zoom as you'd expect.

In terms of API there are no suprises. libXi 1.8 is the version to implement the new features and there we have a new XIGestureClassInfo returned by XIQueryDevice and of course the two events: XIGesturePinchEvent and XIGestureSwipeEvent. Grabbing is done via e.g. XIGrabSwipeGestureBegin, so for those of you with XI2 experience this will all look familiar. For those of you without - it's probably no longer worth investing time into becoming an XI2 expert.

Overall, it's a nice addition to the protocol and it will help getting the X server slightly closer to Wayland for a widely-used feature. Once GTK, mutter and all the other pieces in the stack are in place, it will just work for any (GTK) application that supports gestures under Wayland already. The same will be true for Qt I expect.

X server 21.1 will be out in a few weeks, xf86-input-libinput 1.2.0 is already out and so are xorgproto 2021.5 and libXi 1.8.

[0] In addition to taking on the Xorg release, so clearly there are no limits here
[1] More specifically: it's done by libinput since neither xf86-input-evdev nor xf86-input-synaptics will ever see gestures being implemented
[2] Hold gestures missed out on the various deadlines

September 22, 2021

Xorg is about to released.

And it's a release without Xwayland.

And... wait, what?

Let's unwind this a bit, and ideally you should come away with a better understanding of Xorg vs Xwayland, and possibly even Wayland itself.

Heads up: if you are familiar with X, the below is simplified to the point it hurts. Sorry about that, but as an X developer you're probably good at coping with pain.

Let's go back to the 1980s, when fashion was weird and there were still reasons to be optimistic about the future. Because this is a thought exercise, we go back with full hindsight 20/20 vision and, ideally, the winning Lotto numbers in case we have some time for some self-indulgence.

If we were to implement an X server from scratch, we'd come away with a set of components. libxprotocol that handles the actual protocol wire format parsing and provides a C api to access that (quite like libxcb, actually). That one will just be the protocol-to-code conversion layer.

We'd have a libxserver component which handles all the state management required for an X server to actually behave like an X server (nothing in the X protocol require an X server to display anything). That library has a few entry points for abstract input events (pointer and keyboard, because this is the 80s after all) and a few exit points for rendered output.

libxserver uses libxprotocol but that's an implementation detail, we can ignore the protocol for the rest of the post.

Let's create a github organisation and host those two libraries. We now have: http://github.com/x/libxserver and http://github.com/x/libxprotocol [1].

Now, to actually implement a working functional X server, our new project would link against libxserver hook into this library's API points. For input, you'd use libinput and pass those events through, for output you'd use the modesetting driver that knows how to scream at the hardware until something finally shows up. This is somewhere between outrageously simplified and unacceptably wrong but it'll do for this post.

Your X server has to handle a lot of the hardware-specifics but other than that it's a wrapper around libxserver which does the work of ... well, being an X server.

Our stack looks like this:


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
|[libinput] [modesetting]|
+------------------------+
| kernel |
+------------------------+
Hooray, we have re-implemented Xorg. Or rather, XFree86 because we're 20 years from all the pent-up frustratrion that caused the Xorg fork. Let's host this project on http://github.com/x/xorg

Now, let's say instead of physical display devices, we want to render into an framebuffer, and we have no input devices.


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
| [write()] |
+------------------------+
| some buffer |
+------------------------+
This is basically Xvfb or, if you are writing out PostScript, Xprint. Let's host those on github too, we're accumulating quite a set of projects here.

Now, let's say those buffers are allocated elsewhere and we're just rendering to them. And those buffer are passed to us via an IPC protocol, like... Wayland!


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
|input events [render]|
+------------------------+
| |
+------------------------+
| Wayland compositor |
+------------------------+
And voila, we have Xwayland. If you swap out the protocol you can have Xquartz (X on Macos) or Xwin (X on Windows) or Xnext/Xephyr (X on X) or Xvnc (X over VNC). The principle is always the same.

Fun fact: the Wayland compositor doesn't need to run on the hardware, you can play display server matryoshka until you run out of turtles.

In our glorious revisioned past all these are distinct projects, re-using libxserver and some external libraries where needed. Depending on the projects things may be very simple or get very complex, it depends on how we render things.

But in the end, we have several independent projects all providing us with an X server process - the specific X bits are done in libxserver though. We can release Xwayland without having to release Xorg or Xvfb.

libxserver won't need a lot of releases, the behaviour is largely specified by the protocol requirements and once you're done implementing it, it'll be quite a slow-moving project.

Ok, now, fast forward to 2021, lose some hindsight, hope, and attitude and - oh, we have exactly the above structure. Except that it's not spread across multiple independent repos on github, it's all sitting in the same git directory: our Xorg, Xwayland, Xvfb, etc. are all sitting in hw/$name, and libxserver is basically the rest of the repo.

A traditional X server release was a tag in that git directory. An XWayland-only release is basically an rm -rf hw/*-but-not-xwayland followed by a tag, an Xorg-only release is basically an rm -rf hw/*-but-not-xfree86 [2].

In theory, we could've moved all these out into separate projects a while ago but the benefits are small and no-one has the time for that anyway.

So there you have it - you can have Xorg-only or XWayland-only releases without the world coming to an end.

Now, for the "Xorg is dead" claims - it's very likely that the current release will be the last Xorg release. [3] There is little interest in an X server that runs on hardware, or rather: there's little interest in the effort required to push out releases. Povilas did a great job in getting this one out but again, it's likely this is the last release. [4]

Xwayland - very different, it'll hang around for a long time because it's "just" a protocol translation layer. And of course the interest is there, so we have volunteers to do the releases.

So basically: expecting Xwayland releases, be surprised (but not confused) by Xorg releases.

[1] Github of course doesn't exist yet because we're in the 80s. Time-travelling is complicated.
[2] Historical directory name, just accept it.
[3] Just like the previous release...
[4] At least until the next volunteer steps ups. Turns out the problem "no-one wants to work on this" is easily fixed by "me! me! I want to work on this". A concept that is apparently quite hard to understand in the peanut gallery.

The Strange State of Authenticated Boot and Disk Encryption on Generic Linux Distributions

TL;DR: Linux has been supporting Full Disk Encryption (FDE) and technologies such as UEFI SecureBoot and TPMs for a long time. However, the way they are set up by most distributions is not as secure as they should be, and in some ways quite frankly weird. In fact, right now, your data is probably more secure if stored on current ChromeOS, Android, Windows or MacOS devices, than it is on typical Linux distributions.

Generic Linux distributions (i.e. Debian, Fedora, Ubuntu, …) adopted Full Disk Encryption (FDE) more than 15 years ago, with the LUKS/cryptsetup infrastructure. It was a big step forward to a more secure environment. Almost ten years ago the big distributions started adding UEFI SecureBoot to their boot process. Support for Trusted Platform Modules (TPMs) has been added to the distributions a long time ago as well — but even though many PCs/laptops these days have TPM chips on-board it's generally not used in the default setup of generic Linux distributions.

How these technologies currently fit together on generic Linux distributions doesn't really make too much sense to me — and falls short of what they could actually deliver. In this story I'd like to have a closer look at why I think that, and what I propose to do about it.

The Basic Technologies

Let's have a closer look what these technologies actually deliver:

  1. LUKS/dm-crypt/cryptsetup provide disk encryption, and optionally data authentication. Disk encryption means that reading the data in clear-text form is only possible if you possess a secret of some form, usually a password/passphrase. Data authentication means that no one can make changes to the data on disk unless they possess a secret of some form. Most distributions only enable the former though — the latter is a more recent addition to LUKS/cryptsetup, and is not used by default on most distributions (though it probably should be). Closely related to LUKS/dm-crypt is dm-verity (which can authenticate immutable volumes) and dm-integrity (which can authenticate writable volumes, among other things).

  2. UEFI SecureBoot provides mechanisms for authenticating boot loaders and other pre-OS binaries before they are invoked. If those boot loaders then authenticate the next step of booting in a similar fashion there's a chain of trust which can ensure that only code that has some level of trust associated with it will run on the system. Authentication of boot loaders is done via cryptographic signatures: the OS/boot loader vendors cryptographically sign their boot loader binaries. The cryptographic certificates that may be used to validate these signatures are then signed by Microsoft, and since Microsoft's certificates are basically built into all of today's PCs and laptops this will provide some basic trust chain: if you want to modify the boot loader of a system you must have access to the private key used to sign the code (or to the private keys further up the certificate chain).

  3. TPMs do many things. For this text we'll focus one facet: they can be used to protect secrets (for example for use in disk encryption, see above), that are released only if the code that booted the host can be authenticated in some form. This works roughly like this: every component that is used during the boot process (i.e. code, certificates, configuration, …) is hashed with a cryptographic hash function before it is used. The resulting hash is written to some small volatile memory the TPM maintains that is write-only (the so called Platform Configuration Registers, "PCRs"): each step of the boot process will write hashes of the resources needed by the next part of the boot process into these PCRs. The PCRs cannot be written freely: the hashes written are combined with what is already stored in the PCRs — also through hashing and the result of that then replaces the previous value. Effectively this means: only if every component involved in the boot matches expectations the hash values exposed in the TPM PCRs match the expected values too. And if you then use those values to unlock the secrets you want to protect you can guarantee that the key is only released to the OS if the expected OS and configuration is booted. The process of hashing the components of the boot process and writing that to the TPM PCRs is called "measuring". What's also important to mention is that the secrets are not only protected by these PCR values but encrypted with a "seed key" that is generated on the TPM chip itself, and cannot leave the TPM (at least so goes the theory). The idea is that you cannot read out a TPM's seed key, and thus you cannot duplicate the chip: unless you possess the original, physical chip you cannot retrieve the secret it might be able to unlock for you. Finally, TPMs can enforce a limit on unlock attempts per time ("anti-hammering"): this makes it hard to brute force things: if you can only execute a certain number of unlock attempts within some specific time then brute forcing will be prohibitively slow.

How Linux Distributions use these Technologies

As mentioned already, Linux distributions adopted the first two of these technologies widely, the third one not so much.

So typically, here's how the boot process of Linux distributions works these days:

  1. The UEFI firmware invokes a piece of code called "shim" (which is stored in the EFI System Partition — the "ESP" — of your system), that more or less is just a list of certificates compiled into code form. The shim is signed with the aforementioned Microsoft key, that is built into all PCs/laptops. This list of certificates then can be used to validate the next step of the boot process. The shim is measured by the firmware into the TPM. (Well, the shim can do a bit more than what I describe here, but this is outside of the focus of this article.)

  2. The shim then invokes a boot loader (often Grub) that is signed by a private key owned by the distribution vendor. The boot loader is stored in the ESP as well, plus some other places (i.e. possibly a separate boot partition). The corresponding certificate is included in the list of certificates built into the shim. The boot loader components are also measured into the TPM.

  3. The boot loader then invokes the kernel and passes it an initial RAM disk image (initrd), which contains initial userspace code. The kernel itself is signed by the distribution vendor too. It's also validated via the shim. The initrd is not validated, though (!). The kernel is measured into the TPM, the initrd sometimes too.

  4. The kernel unpacks the initrd image, and invokes what is contained in it. Typically, the initrd then asks the user for a password for the encrypted root file system. The initrd then uses that to set up the encrypted volume. No code authentication or TPM measurements take place.

  5. The initrd then transitions into the root file system. No code authentication or TPM measurements take place.

  6. When the OS itself is up the user is prompted for their user name, and their password. If correct, this will unlock the user account: the system is now ready to use. At this point no code authentication, no TPM measurements take place. Moreover, the user's password is not used to unlock any data, it's used only to allow or deny the login attempt — the user's data has already been decrypted a long time ago, by the initrd, as mentioned above.

What you'll notice here of course is that code validation happens for the shim, the boot loader and the kernel, but not for the initrd or the main OS code anymore. TPM measurements might go one step further: the initrd is measured sometimes too, if you are lucky. Moreover, you might notice that the disk encryption password and the user password are inquired by code that is not validated, and is thus not safe from external manipulation. You might also notice that even though TPM measurements of boot loader/OS components are done nothing actually ever makes use of the resulting PCRs in the typical setup.

Attack Scenarios

Of course, before determining whether the setup described above makes sense or not, one should have an idea what one actually intends to protect against.

The most basic attack scenario to focus on is probably that you want to be reasonably sure that if someone steals your laptop that contains all your data then this data remains confidential. The model described above probably delivers that to some degree: the full disk encryption when used with a reasonably strong password should make it hard for the laptop thief to access the data. The data is as secure as the password used is strong. The attacker might attempt to brute force the password, thus if the password is not chosen carefully the attacker might be successful.

Two more interesting attack scenarios go something like this:

  1. Instead of stealing your laptop the attacker takes the harddisk from your laptop while you aren't watching (e.g. while you went for a walk and left it at home or in your hotel room), makes a copy of it, and then puts it back. You'll never notice they did that. The attacker then analyzes the data in their lab, maybe trying to brute force the password. In this scenario you won't even know that your data is at risk, because for you nothing changed — unlike in the basic scenario above. If the attacker manages to break your password they have full access to the data included on it, i.e. everything you so far stored on it, but not necessarily on what you are going to store on it later. This scenario is worse than the basic one mentioned above, for the simple fact that you won't know that you might be attacked. (This scenario could be extended further: maybe the attacker has a chance to watch you type in your password or so, effectively lowering the password strength.)

  2. Instead of stealing your laptop the attacker takes the harddisk from your laptop while you aren't watching, inserts backdoor code on it, and puts it back. In this scenario you won't know your data is at risk, because physically everything is as before. What's really bad though is that the attacker gets access to anything you do on your laptop, both the data already on it, and whatever you will do in the future.

I think in particular this backdoor attack scenario is something we should be concerned about. We know for a fact that attacks like that happen all the time (Pegasus, industry espionage, …), hence we should make them hard.

Are we Safe?

So, does the scheme so far implemented by generic Linux distributions protect us against the latter two scenarios? Unfortunately not at all. Because distributions set up disk encryption the way they do, and only bind it to a user password, an attacker can easily duplicate the disk, and then attempt to brute force your password. What's worse: since code authentication ends at the kernel — and the initrd is not authenticated anymore —, backdooring is trivially easy: an attacker can change the initrd any way they want, without having to fight any kind of protections. And given that FDE unlocking is implemented in the initrd, and it's the initrd that asks for the encryption password things are just too easy: an attacker could trivially easily insert some code that picks up the FDE password as you type it in and send it wherever they want. And not just that: since once they are in they are in, they can do anything they like for the rest of the system's lifecycle, with full privileges — including installing backdoors for versions of the OS or kernel that are installed on the device in the future, so that their backdoor remains open for as long as they like.

That is sad of course. It's particular sad given that the other popular OSes all address this much better. ChromeOS, Android, Windows and MacOS all have way better built-in protections against attacks like this. And it's why one can certainly claim that your data is probably better protected right now if you store it on those OSes then it is on generic Linux distributions.

(Yeah, I know that there are some niche distros which do this better, and some hackers hack their own. But I care about general purpose distros here, i.e. the big ones, that most people base their work on.)

Note that there are more problems with the current setup. For example, it's really weird that during boot the user is queried for an FDE password which actually protects their data, and then once the system is up they are queried again – now asking for a username, and another password. And the weird thing is that this second authentication that appears to be user-focused doesn't really protect the user's data anymore — at that moment the data is already unlocked and accessible. The username/password query is supposed to be useful in multi-user scenarios of course, but how does that make any sense, given that these multiple users would all have to know a disk encryption password that unlocks the whole thing during the FDE step, and thus they have access to every user's data anyway if they make an offline copy of the harddisk?

Can we do better?

Of course we can, and that is what this story is actually supposed to be about.

Let's first figure out what the minimal issues we should fix are (at least in my humble opinion):

  1. The initrd must be authenticated before being booted into. (And measured unconditionally.)

  2. The OS binary resources (i.e. /usr/) must be authenticated before being booted into. (But don't need to be encrypted, since everyone has the same anyway, there's nothing to hide here.)

  3. The OS configuration and state (i.e. /etc/ and /var/) must be encrypted, and authenticated before they are used. The encryption key should be bound to the TPM device; i.e system data should be locked to a security concept belonging to the system, not the user.

  4. The user's home directory (i.e. /home/lennart/ and similar) must be encrypted and authenticated. The unlocking key should be bound to a user password or user security token (FIDO2 or PKCS#11 token); i.e. user data should be locked to a security concept belonging to the user, not the system.

Or to summarize this differently:

  1. Every single component of the boot process and OS needs to be authenticated, i.e. all of shim (done), boot loader (done), kernel (done), initrd (missing so far), OS binary resources (missing so far), OS configuration and state (missing so far), the user's home (missing so far).

  2. Encryption is necessary for the OS configuration and state (bound to TPM), and for the user's home directory (bound to a user password or user security token).

In Detail

Let's see how we can achieve the above in more detail.

How to Authenticate the initrd

At the moment initrds are generated on the installed host via scripts (dracut and similar) that try to figure out a minimal set of binaries and configuration data to build an initrd that contains just enough to be able to find and set up the root file system. What is included in the initrd hence depends highly on the individual installation and its configuration. Pretty likely no two initrds generated that way will be fully identical due to this. This model clearly has benefits: the initrds generated this way are very small and minimal, and support exactly what is necessary for the system to boot, and not less or more. It comes with serious drawbacks too though: the generation process is fragile and sometimes more akin to black magic than following clear rules: the generator script natively has to understand a myriad of storage stacks to determine what needs to be included and what not. It also means that authenticating the image is hard: given that each individual host gets a different specialized initrd, it means we cannot just sign the initrd with the vendor key like we sign the kernel. If we want to keep this design we'd have to figure out some other mechanism (e.g. a per-host signature key – that is generated locally; or by authenticating it with a message authentication code bound to the TPM). While these approaches are certainly thinkable, I am not convinced they actually are a good idea though: locally and dynamically generated per-host initrds is something we probably should move away from.

If we move away from locally generated initrds, things become a lot simpler. If the distribution vendor generates the initrds on their build systems then it can be attached to the kernel image itself, and thus be signed and measured along with the kernel image, without any further work. This simplicity is simply lovely. Besides robustness and reproducibility this gives us an easy route to authenticated initrds.

But of course, nothing is really that simple: working with vendor-generated initrds means that we can't adjust them anymore to the specifics of the individual host: if we pre-build the initrds and include them in the kernel image in immutable fashion then it becomes harder to support complex, more exotic storage or to parameterize it with local network server information, credentials, passwords, and so on. Now, for my simple laptop use-case these things don't matter, there's no need to extend/parameterize things, laptops and their setups are not that wildly different. But what to do about the cases where we want both: extensibility to cover for less common storage subsystems (iscsi, LVM, multipath, drivers for exotic hardware…) and parameterization?

Here's a proposal how to achieve that: let's build a basic initrd into the kernel as suggested, but then do two things to make this scheme both extensible and parameterizable, without compromising security.

  1. Let's define a way how the basic initrd can be extended with additional files, which are stored in separate "extension images". The basic initrd should be able to discover these extension images, authenticate them and then activate them, thus extending the initrd with additional resources on-the-fly.

  2. Let's define a way how we can safely pass additional parameters to the kernel/initrd (and actually the rest of the OS, too) in an authenticated (and possibly encrypted) fashion. Parameters in this context can be anything specific to the local installation, i.e. server information, security credentials, certificates, SSH server keys, or even just the root password that shall be able to unlock the root account in the initrd …

In such a scheme we should be able to deliver everything we are looking for:

  1. We'll have a full trust chain for the code: the boot loader will authenticate and measure the kernel and basic initrd. The initrd extension images will then be authenticated by the basic initrd image.

  2. We'll have authentication for all the parameters passed to the initrd.

This so far sounds very unspecific? Let's make it more specific by looking closer at the components I'd suggest to be used for this logic:

  1. The systemd suite since a few months contains a subsystem implementing system extensions (v248). System extensions are ultimately just disk images (for example a squashfs file system in a GPT envelope) that can extend an underlying OS tree. Extending in this regard means they simply add additional files and directories into the OS tree, i.e. below /usr/. For a longer explanation see systemd-sysext(8). When a system extension is activated it is simply mounted and then merged into the main /usr/ tree via a read-only overlayfs mount. Now what's particularly nice about them in this context we are talking about here is that the extension images may carry dm-verity authentication data, and PKCS#7 signatures (once this is merged, that is, i.e. v250).

  2. The systemd suite also contains a concept called service "credentials". These are small pieces of information passed to services in a secure way. One key feature of these credentials is that they can be encrypted and authenticated in a very simple way with a key bound to the TPM (v250). See LoadCredentialEncrypted= and systemd-creds(1) for details. They are great for safely storing SSL private keys and similar on your system, but they also come handy for parameterizing initrds: an encrypted credential is just a file that can only be decoded if the right TPM is around with the right PCR values set.

  3. The systemd suite contains a component called systemd-stub(7). It's an EFI stub, i.e. a small piece of code that is attached to a kernel image, and turns the kernel image into a regular EFI binary that can be directly executed by the firmware (or a boot loader). This stub has a number of nice features (for example, it can show a boot splash before invoking the Linux kernel itself and such). Once this work is merged (v250) the stub will support one more feature: it will automatically search for system extension image files and credential files next to the kernel image file, measure them and pass them on to the main initrd of the host.

Putting this together we have nice way to provide fully authenticated kernel images, initrd images and initrd extension images; as well as encrypted and authenticated parameters via the credentials logic.

How would a distribution actually make us of this? A distribution vendor would pre-build the basic initrd, and glue it into the kernel image, and sign that as a whole. Then, for each supposed extension of the basic initrd (e.g. one for iscsi support, one for LVM, one for multipath, …), the vendor would use a tool such as mkosi to build an extension image, i.e. a GPT disk image containing the files in squashfs format, a Verity partition that authenticates it, plus a PKCS#7 signature partition that validates the root hash for the dm-verity partition, and that can be checked against a key provided by the boot loader or main initrd. Then, any parameters for the initrd will be encrypted using systemd-creds encrypt -T. The resulting encrypted credentials and the initrd extension images are then simply placed next to the kernel image in the ESP (or boot partition). Done.

This checks all boxes: everything is authenticated and measured, the credentials also encrypted. Things remain extensible and modular, can be pre-built by the vendor, and installation is as simple as dropping in one file for each extension and/or credential.

How to Authenticate the Binary OS Resources

Let's now have a look how to authenticate the Binary OS resources, i.e. the stuff you find in /usr/, i.e. the stuff traditionally shipped to the user's system via RPMs or DEBs.

I think there are three relevant ways how to authenticate this:

  1. Make /usr/ a dm-verity volume. dm-verity is a concept implemented in the Linux kernel that provides authenticity to read-only block devices: every read access is cryptographically verified against a top-level hash value. This top-level hash is typically a 256bit value that you can either encode in the kernel image you are using, or cryptographically sign (which is particularly nice once this is merged). I think this is actually the best approach since it makes the /usr/ tree entirely immutable in a very simple way. However, this also means that the whole of /usr/ needs to be updated as once, i.e. the traditional rpm/apt based update logic cannot work in this mode.

  2. Make /usr/ a dm-integrity volume. dm-integrity is a concept provided by the Linux kernel that offers integrity guarantees to writable block devices, i.e. in some ways it can be considered to be a bit like dm-verity while permitting write access. It can be used in three ways, one of which I think is particularly relevant here. The first way is with a simple hash function in "stand-alone" mode: this is not too interesting here, it just provides greater data safety for file systems that don't hash check their files' data on their own. The second way is in combination with dm-crypt, i.e. with disk encryption. In this case it adds authenticity to confidentiality: only if you know the right secret you can read and make changes to the data, and any attempt to make changes without knowing this secret key will be detected as IO error on next read by those in possession of the secret (more about this below). The third way is the one I think is most interesting here: in "stand-alone" mode, but with a keyed hash function (e.g. HMAC). What's this good for? This provides authenticity without encryption: if you make changes to the disk without knowing the secret this will be noticed on the next read attempt of the data and result in IO errors. This mode provides what we want (authenticity) and doesn't do what we don't need (encryption). Of course, the secret key for the HMAC must be provided somehow, I think ideally by the TPM.

  3. Make /usr/ a dm-crypt (LUKS) + dm-integrity volume. This provides both authenticity and encryption. The latter isn't typically needed for /usr/ given that it generally contains no secret data: anyone can download the binaries off the Internet anyway, and the sources too. By encrypting this you'll waste CPU cycles, but beyond that it doesn't hurt much. (Admittedly, some people might want to hide the precise set of packages they have installed, since it of course does reveal a bit of information about you: i.e. what you are working on, maybe what your job is – think: if you are a hacker you have hacking tools installed – and similar). Going this way might simplify things in some cases, as it means you don't have to distinguish "OS binary resources" (i.e /usr/) and "OS configuration and state" (i.e. /etc/ + /var/, see below), and just make it the same volume. Here too, the secret key must be provided somehow, I think ideally by the TPM.

All three approach are valid. The first approach has my primary sympathies, but for distributions not willing to abandon client-side updates via RPM/dpkg this is not an option, in which case I would propose the other two approaches for these cases.

The LUKS encryption key (and in case of dm-integrity standalone mode the key for the keyed hash function) should be bound to the TPM. Why the TPM for this? You could also use a user password, a FIDO2 or PKCS#11 security token — but I think TPM is the right choice: why that? To reduce the requirement for repeated authentication, i.e. that you first have to provide the disk encryption password, and then you have to login, providing another password. It should be possible that the system boots up unattended and then only one authentication prompt is needed to unlock the user's data properly. The TPM provides a way to do this in a reasonably safe and fully unattended way. Also, when we stop considering just the laptop use-case for a moment: on servers interactive disk encryption prompts don't make much sense — the fact that TPMs can provide secrets without this requiring user interaction and thus the ability to work in entirely unattended environments is quite desirable. Note that crypttab(5) as implemented by systemd (v248) provides native support for authentication via password, via TPM2, via PKCS#11 or via FIDO2, so the choice is ultimately all yours.

How to Encrypt/Authenticate OS Configuration and State

Let's now look at the OS configuration and state, i.e. the stuff in /etc/ and /var/. It probably makes sense to not consider these two hierarchies independently but instead just consider this to be the root file system. If the OS binary resources are in a separate file system it is then mounted onto the /usr/ sub-directory of the root file system.

The OS configuration and state (or: root file system) should be both encrypted and authenticated: it might contain secret keys, user passwords, privileged logs and similar. This data matters and contains plenty data that should remain confidential.

The encryption of choice here is dm-crypt (LUKS) + dm-integrity similar as discussed above, again with the key bound to the TPM.

If the OS binary resources are protected the same way it is safe to merge these two volumes and have a single partition for both (see above)

How to Encrypt/Authenticate the User's Home Directory

The data in the user's home directory should be encrypted, and bound to the user's preferred token of authentication (i.e. a password or FIDO2/PKCS#11 security token). As mentioned, in the traditional mode of operation the user's home directory is not individually encrypted, but only encrypted because FDE is in use. The encryption key for that is a system wide key though, not a per-user key. And I think that's problem, as mentioned (and probably not even generally understood by our users). We should correct that and ensure that the user's password is what unlocks the user's data.

In the systemd suite we provide a service systemd-homed(8) (v245) that implements this in a safe way: each user gets its own LUKS volume stored in a loopback file in /home/, and this is enough to synthesize a user account. The encryption password for this volume is the user's account password, thus it's really the password provided at login time that unlocks the user's data. systemd-homed also supports other mechanisms of authentication, in particular PKCS#11/FIDO2 security tokens. It also provides support for other storage back-ends (such as fscrypt), but I'd always suggest to use the LUKS back-end since it's the only one providing the comprehensive confidentiality guarantees one wants for a UNIX-style home directory.

Note that there's one special caveat here: if the user's home directory (e.g. /home/lennart/) is encrypted and authenticated, what about the file system this data is stored on, i.e. /home/ itself? If that dir is part of the the root file system this would result in double encryption: first the data is encrypted with the TPM root file system key, and then again with the per-user key. Such double encryption is a waste of resources, and unnecessary. I'd thus suggest to make /home/ its own dm-integrity volume with a HMAC, keyed by the TPM. This means the data stored directly in /home/ will be authenticated but not encrypted. That's good not only for performance, but also has practical benefits: it allows extracting the encrypted volume of the various users in case the TPM key is lost, as a way to recover from dead laptops or similar.

Why authenticate /home/, if it only contains per-user home directories that are authenticated on their own anyway? That's a valid question: it's because the kernel file system maintainers made clear that Linux file system code is not considered safe against rogue disk images, and is not tested for that; this means before you mount anything you need to establish trust in some way because otherwise there's a risk that the act of mounting might exploit your kernel.

Summary of Resources and their Protections

So, let's now put this all together. Here's a table showing the various resources we deal with, and how I think they should be protected (in my idealized world).

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources yes no dm-verity root hash linked into kernel image, or firmware/initrd certificate database top-level partition
OS configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This should provide all the desired guarantees: everything is authenticated, and the individualized per-host or per-user data is also encrypted. No double encryption takes place. The encryption keys/verification certificates are stored/bound to the most appropriate infrastructure.

Does this address the three attack scenarios mentioned earlier? I think so, yes. The basic attack scenario I described is addressed by the fact that /var/, /etc/ and /home/*/ are encrypted. Brute forcing the former two is harder than in the status quo ante model, since a high entropy key is used instead of one derived from a user provided password. Moreover, the "anti-hammering" logic of the TPM will make brute forcing prohibitively slow. The home directories are protected by the user's password or ideally a personal FIDO2/PKCS#11 security token in this model. Of course, a password isn't better security-wise then the status quo ante. But given the FIDO2/PKCS#11 support built into systemd-homed it should be easier to lock down the home directories securely.

Binding encryption of /var/ and /etc/ to the TPM also addresses the first of the two more advanced attack scenarios: a copy of the harddisk is useless without the physical TPM chip, since the seed key is sealed into that. (And even if the attacker had the chance to watch you type in your password, it won't help unless they possess access to to the TPM chip.) For the home directory this attack is not addressed as long as a plain password is used. However, since binding home directories to FIDO2/PKCS#11 tokens is built into systemd-homed things should be safe here too — provided the user actually possesses and uses such a device.

The backdoor attack scenario is addressed by the fact that every resource in play now is authenticated: it's hard to backdoor the OS if there's no component that isn't verified by signature keys or TPM secrets the attacker hopefully doesn't know.

For general purpose distributions that focus on updating the OS per RPM/dpkg the idealized model above won't work out, since (as mentioned) this implies an immutable /usr/, and thus requires updating /usr/ via an atomic update operation. For such distros a setup like the following is probably more realistic, but see above.

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources, configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This means there's only one root file system that contains all of /etc/, /var/ and /usr/.

Recovery Keys

When binding encryption to TPMs one problem that arises is what strategy to adopt if the TPM is lost, due to hardware failure: if I need the TPM to unlock my encrypted volume, what do I do if I need the data but lost the TPM?

The answer here is supporting recovery keys (this is similar to how other OSes approach this). Recovery keys are pretty much the same concept as passwords. The main difference being that they are computer generated rather than user-chosen. Because of that they typically have much higher entropy (which makes them more annoying to type in, i.e you want to use them only when you must, not day-to-day). By having higher entropy they are useful in combination with TPM, FIDO2 or PKCS#11 based unlocking: unlike a combination with passwords they do not compromise the higher strength of protection that TPM/FIDO2/PKCS#11 based unlocking is supposed to provide.

Current versions of systemd-cryptenroll(1) implement a recovery key concept in an attempt to address this problem. You may enroll any combination of TPM chips, PKCS#11 tokens, FIDO2 tokens, recovery keys and passwords on the same LUKS volume. When enrolling a recovery key it is generated and shown on screen both in text form and as QR code you can scan off screen if you like. The idea is write down/store this recovery key at a safe place so that you can use it when you need it. Note that such recovery keys can be entered wherever a LUKS password is requested, i.e. after generation they behave pretty much the same as a regular password.

TPM PCR Brittleness

Locking devices to TPMs and enforcing a PCR policy with this (i.e. configuring the TPM key to be unlockable only if certain PCRs match certain values, and thus requiring the OS to be in a certain state) brings a problem with it: TPM PCR brittleness. If the key you want to unlock with the TPM requires the OS to be in a specific state (i.e. that all OS components' hashes match certain expectations or similar) then doing OS updates might have the affect of making your key inaccessible: the OS updates will cause the code to change, and thus the hashes of the code, and thus certain PCRs. (Thankfully, you unrolled a recovery key, as described above, so this doesn't mean you lost your data, right?).

To address this I'd suggest three strategies:

  1. Most importantly: don't actually use the TPM PCRs that contain code hashes. There are actually multiple PCRs defined, each containing measurements of different aspects of the boot process. My recommendation is to bind keys to PCR 7 only, a PCR that contains measurements of the UEFI SecureBoot certificate databases. Thus, the keys will remain accessible as long as these databases remain the same, and updates to code will not affect it (updates to the certificate databases will, and they do happen too, though hopefully much less frequent then code updates). Does this reduce security? Not much, no, because the code that's run is after all not just measured but also validated via code signatures, and those signatures are validated with the aforementioned certificate databases. Thus binding an encrypted TPM key to PCR 7 should enforce a similar level of trust in the boot/OS code as binding it to a PCR with hashes of specific versions of that code. i.e. using PCR 7 means you say "every code signed by these vendors is allowed to unlock my key" while using a PCR that contains code hashes means "only this exact version of my code may access my key".

  2. Use LUKS key management to enroll multiple versions of the TPM keys in relevant volumes, to support multiple versions of the OS code (or multiple versions of the certificate database, as discussed above). Specifically: whenever an update is done that might result changing the relevant PCRs, pre-calculate the new PCRs, and enroll them in an additional LUKS slot on the relevant volumes. This means that the unlocking keys tied to the TPM remain accessible in both states of the system. Eventually, once rebooted after the update, remove the old slots.

  3. If these two strategies didn't work out (maybe because the OS/firmware was updated outside of OS control, or the update mechanism was aborted at the wrong time) and the TPM PCRs changed unexpectedly, and the user now needs to use their recovery key to get access to the OS back, let's handle this gracefully and automatically reenroll the current TPM PCRs at boot, after the recovery key checked out, so that for future boots everything is in order again.

Other approaches can work too: for example, some OSes simply remove TPM PCR policy protection of disk encryption keys altogether immediately before OS or firmware updates, and then reenable it right after. Of course, this opens a time window where the key bound to the TPM is much less protected than people might assume. I'd try to avoid such a scheme if possible.

Anything Else?

So, given that we are talking about idealized systems: I personally actually think the ideal OS would be much simpler, and thus more secure than this:

I'd try to ditch the Shim, and instead focus on enrolling the distribution vendor keys directly in the UEFI firmware certificate list. This is actually supported by all firmwares too. This has various benefits: it's no longer necessary to bind everything to Microsoft's root key, you can just enroll your own stuff and thus make sure only what you want to trust is trusted and nothing else. To make an approach like this easier, we have been working on doing automatic enrollment of these keys from the systemd-boot boot loader, see this work in progress for details. This way the Firmware will authenticate the boot loader/kernel/initrd without any further component for this in place.

I'd also not bother with a separate boot partition, and just use the ESP for everything. The ESP is required anyway by the firmware, and is good enough for storing the few files we need.

FAQ

Can I implement all of this in my distribution today?

Probably not. While the big issues have mostly been addressed there's a lot of integration work still missing. As you might have seen I linked some PRs that haven't even been merged into our tree yet, and definitely not been released yet or even entered the distributions.

Will this show up in Fedora/Debian/Ubuntu soon?

I don't know. I am making a proposal how these things might work, and am working on getting various building blocks for this into shape. What the distributions do is up to them. But even if they don't follow the recommendations I make 100%, or don't want to use the building blocks I propose I think it's important they start thinking about this, and yes, I think they should be thinking about defaulting to setups like this.

Work for measuring/signing initrds on Fedora has been started, here's a slide deck with some information about it.

But isn't a TPM evil?

Some corners of the community tried (unfortunately successfully to some degree) to paint TPMs/Trusted Computing/SecureBoot as generally evil technologies that stop us from using our systems the way we want. That idea is rubbish though, I think. We should focus on what it can deliver for us (and that's a lot I think, see above), and appreciate the fact we can actually use it to kick out perceived evil empires from our devices instead of being subjected to them. Yes, the way SecureBoot/TPMs are defined puts you in the driver seat if you want — and you may enroll your own certificates to keep out everything you don't like.

What if my system doesn't have a TPM?

TPMs are becoming quite ubiquitous, in particular as the upcoming Windows versions will require them. In general I think we should focus on modern, fully equipped systems when designing all this, and then find fall-backs for more limited systems. Frankly it feels as if so far the design approach for all this was the other way round: try to make the new stuff work like the old rather than the old like the new (I mean, to me it appears this thinking is the main raison d'être for the Grub boot loader).

More specifically, on the systems where we have no TPM we ultimately cannot provide the same security guarantees as for those which have. So depending on the resource to protect we should fall back to different TPM-less mechanisms. For example, if we have no TPM then the root file system should probably be encrypted with a user provided password, typed in at boot as before. And for the encrypted boot credentials we probably should simply not encrypt them, and place them in the ESP unencrypted.

Effectively this means: without TPM you'll still get protection regarding the basic attack scenario, as before, but not the other two.

What if my system doesn't have UEFI?

Many of the mechanisms explained above taken individually do not require UEFI. But of course the chain of trust suggested above requires something like UEFI SecureBoot. If your system lacks UEFI it's probably best to find work-alikes to the technologies suggested above, but I doubt I'll be able to help you there.

rpm/dpkg already cryptographically validates all packages at installation time (gpg), why would I need more than that?

This type of package validation happens once: at the moment of installation (or update) of the package, but not anymore when the data installed is actually used. Thus when an attacker manages to modify the package data after installation and before use they can make any change they like without this ever being noticed. Such package download validation does address certain attack scenarios (i.e. man-in-the-middle attacks on network downloads), but it doesn't protect you from attackers with physical access, as described in the attack scenarios above.

Systems such as ostree aren't better than rpm/dpkg regarding this BTW, their data is not validated on use either, but only during download or when processing tree checkouts.

Key really here is that the scheme explained here provides offline protection for the data "at rest" — even someone with physical access to your device cannot easily make changes that aren't noticed on next use. rpm/dpkg/ostree provide online protection only: as long as the system remains up, and all OS changes are done through the intended program code-paths, and no one has physical access everything should be good. In today's world I am sure this is not good enough though. As mentioned most modern OSes provide offline protection for the data at rest in one way or another. Generic Linux distributions are terribly behind on this.

This is all so desktop/laptop focused, what about servers?

I am pretty sure servers should provide similar security guarantees as outlined above. In a way servers are a much simpler case: there are no users and no interactivity. Thus the discussion of /home/ and what it contains and of user passwords doesn't matter. However, the authenticated initrd and the unattended TPM-based encryption I think are very important for servers too, in a trusted data center environment. It provides security guarantees so far not given by Linux server OSes.

I'd like to help with this, or discuss/comment on this

Submit patches or reviews through GitHub. General discussion about this is best done on the systemd mailing list.

September 20, 2021

ES: But Why

I got a request recently to fix up the WebGL Aquarium demo. I’ve had this bookmarked for a while since it’s one of the only test cases for GL_EXT_multisampled_render_to_texture I’m aware of, at least when running Chrome in EGL mode.

Naturally, I decided to do both at once since this would be yet another extension that no native desktop driver in Mesa currently supports.

Transient Multisampling

The idea behind this extension is that on tilers, a single-sample image can be temporarily treated as multisampled without needing the extra memory bandwidth that multisampled images require. The multisampled rendering image is transient, meaning it isn’t loaded or written to, so it can be lazily allocated and then discarded.

Vulkan has similar mechanisms: loadOp and storeOp set to VK_ATTACHMENT_LOAD_OP_DONT_CARE with an image allocated using ZINK_HEAP_DEVICE_LOCAL_LAZY and VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT, then adding a resolve attachment for writeout. Simple enough.

Unfortunately, this doesn’t quite work.

As Danylo Piliaiev, RenderDoc master and Finder Of Bad Pixels, was quick to point out, this approach will discard any previous data the texture had, resulting in bad pixels. Lots of bad pixels, in fact.

The solution, which I hate, is that the “transient” multisampled image now gets a full-texture draw before its “transient” renderpass to initialize it, then is loaded with VK_ATTACHMENT_LOAD_OP_LOAD, effectively making it not transient at all.

Sorry tilers.

No frames for you.

aquarium.gif

But it does work and give solid performance (-20% or so while recording smh screen recorders git gud), so I can’t be too mad.

September 16, 2021

I am back with another status update on raytracing in RADV. And the good news is that things are finally starting to come together. After ~9 months of on and off work we’re now having games working with raytracing. Working on first try after getting all the required functionality was Control:

Control with raytracing on RADV

After poking for a long time at CTS and demos it is really nice to see the fruits of ones work.

The piece that I added recently was copy/compaction and serialization of acceleration structures which was a bunch of shader writing, handling another type of queries and dealing with indirect dispatches. (Since of course the API doesn’t give the input size on the CPU. No idea how this API should be secured …)

What games?

I did try 5 games:

  1. Quake 2 RTX (Vulkan): works. This was working already on my previous update.
  2. Control (D3D): works. Pretty much just works. Runs at maybe 30-50% of RT performance on Windows.
  3. Metro Exodus (Vulkan): works. Needs one workaround and is very finicky in WSI but otherwise works fine. Runs at 20-25% of RT performance on Windows.
  4. Ghostrunner (D3D): Does not work. This really needs per shadergroup compilation instead of just mashing all the shaders together as I get shaders now with 1 million NIR instructions, which is a pain to debug.
  5. Doom Eternal (Vulkan): Does not work. The raytracing option in the menu stays grayed out and at this point I’m at a loss what is required to make the game allow enabling RT.

If anybody could tell me how to get Doom Eternal to allow RT I’d appreciate it.

What is next?

Of course the support is far from done. Some things to still make progress on:

  1. Upstreaming what I have. Samuel has been busy reviewing my MRs and I think there is a good chance that what I have now will make it into 21.3.
  2. Improve the pipeline compilation model to hopefully make ghostrunner work.
  3. Improved BVH building. The current BVH is really naive, which is likely one of the big performance factors.
  4. Improve traversal.
  5. Move on to stuff needed for DXR 1.1 like VK_KHR_ray_query.

P.S. If you haven’t seen it yet, Jason Ekstrand from Intel recently gave a talk about how Intel implements raytracing. Nice showcase of how you can provide some more involved hardware implementation than RDNA2 does.

Been some time since my last update, so I felt it was time to flex my blog writing muscles again and provide some updates of some of the things we are working on in Fedora in preparation for Fedora Workstation 35. This is not meant to be a comprehensive whats new article about Fedora Workstation 35, more of a listing of some of the things we are doing as part of the Red Hat desktop team.

NVidia support for Wayland
One thing we spent a lot of effort on for a long time now is getting full support for the NVidia binary driver under Wayland. It has been a recurring topic in our bi-weekly calls with the NVidia engineering team ever since we started looking at moving to Wayland. There has been basic binary driver support for some time, meaning you could run a native Wayland session on top of the binary driver, but the critical missing piece was that you could not get support for accelerated graphics when running applications through XWayland, our X.org compatibility layer. Which basically meant that any application requiring 3D support and which wasn’t a native Wayland application yet wouldn’t work. So over the last Months we been having a great collaboration with NVidia around closing this gap, with them working closely with us in fixing issues in their driver while we have been fixing bugs and missing pieces in the rest of the stack. We been reporting and discussing issues back and forth allowing us a very quickly turnaround on issues as we find them which of course all resulted in the NVidia 470.42.01 driver with XWayland support. I am sure we will find new corner cases that needs to be resolved in the coming Months, but I am equally sure we will be able to quickly resolve them due to the close collaboration we have now established with NVidia. And I know some people will wonder why we spent so much time working with NVidia around their binary driver, but the reality is that NVidia is the market leader, especially in the professional Linux workstation space, and there are lot of people who either would end up not using Linux or using Linux with X without it, including a lot of Red Hat customers and Fedora users. And that is what I and my team are here for at the end of the day, to make sure Red Hat customers are able to get their job done using their Linux systems.

Lightweight kiosk mode
One of the wonderful things about open source is the constant flow of code and innovation between all the different parts of the ecosystem. For instance one thing we on the RHEL side have often been asked about over the last few years is a lightweight and simple to use solution for people wanting to run single application setups, like information boards, ATM machines, cash registers, information kiosks and so on. For many use cases people felt that running a full GNOME 3 desktop underneath their application was either to resource hungry and or created a risk that people accidentally end up in the desktop session. At the same time from our viewpoint as a development team we didn’t want a completely separate stack for this use case as that would just increase our maintenance burden as we would end up having to do a lot of things twice. So to solve this problem Ray Strode spent some time writing what we call GNOME Kiosk mode which makes setting up a simple session running single application easy and without running things like the GNOME shell, tracker, evolution etc. This gives you a window manager with full support for the latest technologies such as compositing, libinput and Wayland, but coming in at about 18MB, which is about 71MB less than a minimal GNOME 3 desktop session. You can read more about the new Kiosk mode and how to use it in this great blog post from our savvy Edge Computing Product Manager Ben Breard. The kiosk mode session described in Ben’s article about RHEL will be available with Fedora Workstation 35.

high-definition mouse wheel support
A major part of what we do is making sure that Red Hat Enterprise Linux customers and Fedora users get hardware support on par with what you find on other operating systems. We try our best to work with our hardware partners, like Lenovo, to ensure that such hardware support comes day and date with when those features are enabled on other systems, but some things ends up taking longer time for various reasons. Support for high-definition mouse wheels was one of those. Peter Hutterer, our resident input expert, put together a great blog post explaining the history and status of high-definition mouse wheel support. As Peter points out in his blog post the feature is not yet fully supported under Wayland, but we hope to close that gap in time for Fedora Workstation 35.

Mouse with hires mouse

Mouse with HiRes scroll wheel

PipeWire
I feel I can’t do one of these posts without talking about latest developments in PipeWire, our unified audio and video server. Wim Taymans keeps working with rapidly growing PipeWire community to fix issues as they are reported and add new features to PipeWire. Most recently Wims focus has been on implementing support for S/PDIF passthrough support over both S/PDIF and HDMI connections. This will allow us to send undecoded data over such connections which is critical for working well with surround sound systems and soundbars. Also the PipeWire community has been working hard on further improving the Bluetooth support with bluetooth battery status support for head-set profile and using Apple extensions. aptX-LL and FastStream codec support was also added. And of course a huge amount of bug fixes, it turns out that when you replace two different sound servers that has been around for close to two decades there are a lot of corner cases to cover :). Make sure to check out two latest release notes for 0.3.35 and for 0.3.36 for details.

Screenshot of Easyeffects

EasyEffects is a great example of a cool new application built with PipeWire

Privacy screen
Another feature that we have been working on as a result of our Lenovo partnership is Privacy screen support. For those not familiar with this technology it is basically to allow you to reduce the readability of your screen when viewed from the side, so that if you are using your laptop at a coffee shop for instance then a person sitting close by will have a lot harder time trying to read what is on your screen. Hans de Goede has been shepherding the kernel side of this forward working with Marco Trevisan from Canonical on the userspace part of it (which also makes this a nice example of cross-company collaboration), allowing you to turn this feature on or off. This feature though is not likely to fully land in time for Fedora Workstation 35 so we are looking at if we will bring this in as an update to Fedora Workstation 35 or if it will be a Fedora Workstation 36 feature.

Penny

zink inside

Zink inside the penny


As most of you know the future of 3D graphics on Linux is the Vulkan API from the Khronos Group. This doesn’t mean that OpenGL is going away anytime soon though, as there is a large host of applications out there using this API and for certain types of 3D graphics development developers might still choose to use OpenGL over Vulkan. Of course for us that creates a little bit of a challenge because maintaining two 3D graphics interfaces is a lot of work, even with the great help and contributions from the hardware makers themselves. So we been eyeing the Zink project for a while, which aims at re-implementing OpenGL on top of Vulkan, as a potential candidate for solving our long term needs to support the OpenGL API, but without drowning us in work while doing so. The big advantage to Zink is that it allows us to support one shared OpenGL implementation across all hardware and then focus our HW support efforts on the Vulkan drivers. As part of this effort Adam Jackson has been working on a project called Penny.

Zink implements OpenGL in terms of Vulkan, as far as the drawing itself is concerned, but presenting that drawing to the rest of the system is currently system-specific (GLX). For hardware that already has a Mesa driver, we use GBM. On NVIDIA’s Vulkan (and probably any other binary stacks on Linux, and probably also like WSL or macOS + MoltenVK) we download the image from the GPU back to the CPU and then use the same software upload/display path as llvmpipe, which as you can imagine is Not Fast.

Penny aims to extend Zink by replacing both of those paths, and instead using the various Vulkan WSI extensions to manage presentation. Even for the GBM case this should enable higher performance since zink will have more information about the rendering pipeline (multisampling in particular is poorly handled atm). Future window system integration work can focus on Vulkan, with EGL and GLX getting features “for free” once they’re enabled in Vulkan.

3rd party software cleanup
Over time we have been working on adding more and more 3rd party software for easy consumption in Fedora Workstation. The problem we discovered though was that due to this being done over time, with changing requirements and expectations, the functionality was not behaving in a very intuitive way and there was also new questions that needed to be answered. So Allan Day and Owen Taylor spent some time this cycle to review all the bits and pieces of this functionality and worked to clean it up. So the goal is that when you enable third-party repositories in Fedora Workstation 35 it behaves in a much more predictable and understandable way and also includes a lot of applications from Flathub. Yes, that is correct you should be able to install a lot of applications from Flathub in Fedora Workstation 35 without having to first visit the Flathub website to enable it, instead they will show up once you turned the knob for general 3rd party application support.

Power profiles
Another item we spent quite a bit of time for Fedora Workstation 35 is making sure we integrate the Power Profiles work that Bastien Nocera has been working on as part of our collaboration with Lenovo. Power Profiles is basically a feature that allows your system to behave in a smarter way when it comes to power consumption and thus prolongs your battery life. So for instance when we notice you are getting low on battery we can offer you to go into a strong power saving mode to prolong how long you can use the system until you can recharge. More in-depth explanation of Power profiles in the official README.

Wayland
I usually also have ended up talking about Wayland in my posts, but I expect to be doing less going forward as we have now covered all the major gaps we saw between Wayland and X.org. Jonas Ådahl got the headless support merged which was one of our big missing pieces and as mentioned above Olivier Fourdan and Jonas and others worked with NVidia on getting the binary driver with XWayland support working with GNOME Shell. Of course this being software we are never truly done, there will of course be new issues discovered, random bugs that needs to be fixed, and of course also new features that needs to be implemented. We already have our next big team focus in place, HDR support, which will need work from the graphics drivers, up through Mesa, into the window manager and the GUI toolkits and in the applications themselves. We been investigating and trying out some things for a while already, but we are now ready to make this a main focus for the team. In fact we will soon be posting a new job listing for a fulltime engineer to work on HDR vertically through the stack so keep an eye out for that if you are interested in working on this. The job will be open to candidates who which to work remotely, so as long as Red Hat has a business presence in the country you live we should be able to offer you the job if you are the right candidate for us. Update:Job listing is now online for our HDR engineer.

BTW, if you want to see future updates and keep on top of other happenings from Fedora and Red Hat in the desktop space, make sure to follow me on twitter.

September 13, 2021

Zink Is Over: This Time I’m Serious.

Look.

I know what you’re gonna say, and maybe I did just say zink was done a week or two ago.

I’m not saying I didn’t.

But that was practically last year at the speed with which zink’s codebase moves and its developer community sits in my office eating cookies between Mesa builds, and it was also before I set off on my journey to make the rest of those zany Phoronix benchmark games run instead of crashing or whatever.

What do we got left on that list anyway?

Metro: Last Light Redux

Oh you want some Metro? We got Metro at home.

metro.png

HITMAN

hitman.png

Agent 47, I’m gonna pretend I didn’t see that. Pull yourself together.

Basemark: High Settings

basemark.png

It’s uh… Mangohud’s slowing me down.

Bioshock Infinite

I bet you’re wondering where this one is, huh.

Warhammer 40,000: Dawn of War

dow3-fail.png

Easy as that, ju—Wait, what?

This game requires ARB_bindless_texture just to run? Is this a joke? Even fucking DOOM 2016, the final boss of OpenGL, doesn’t require bindless textures.

Fine.

Totally fine.

Not at all a problem, and I’m sure it’ll be easy to do.

Definitely no reason why only two Mesa drivers total implement it other than it being some trivial switch that everyone forgot to flip, right?

Probably just a config value here, or maybe a couple lines of code there…

Ignore all the validation errors because descriptor indexing isn’t accurately supported…

Add some null checks…

Fire up ASAN to fix a random stack explosion

File a piglit ticket because two of the eighty unit tests for the extension are bugged and these are quite literally the only unit tests available…

dow3.png

Kapow, first try, expect it in zink-wip later today-ish.

It’s just that easy.

If you disagree, you are nitpicking and biased.

September 01, 2021

I've been working on portals recently and one of the issues for me was that the documentation just didn't quite hit the sweet spot. At least the bits I found were either too high-level or too implementation-specific. So here's a set of notes on how a portal works, in the hope that this is actually correct.

First, Portals are supposed to be a way for sandboxed applications (flatpaks) to trigger functionality they don't have direct access too. The prime example: opening a file without the application having access to $HOME. This is done by the applications talking to portals instead of doing the functionality themselves.

There is really only one portal process: /usr/libexec/xdg-desktop-portal, started as a systemd user service. That process owns a DBus bus name (org.freedesktop.portal.Desktop) and an object on that name (/org/freedesktop/portal/desktop). You can see that bus name and object with D-Feet, from DBus' POV there's nothing special about it. What makes it the portal is simply that the application running inside the sandbox can talk to that DBus name and thus call the various methods. Obviously the xdg-desktop-portal needs to run outside the sandbox to do its things.

There are multiple portal interfaces, all available on that one object. Those interfaces have names like org.freedesktop.portal.FileChooser (to open/save files). The xdg-desktop-portal implements those interfaces and thus handles any method calls on those interfaces. So where an application is sandboxed, it doesn't implement the functionality itself, it instead calls e.g. the OpenFile() method on the org.freedesktop.portal.FileChooser interface. Then it gets an fd back and can read the content of that file without needing full access to the file system.

Some interfaces are fully handled within xdg-desktop-portal. For example, the Camera portal checks a few things internally, pops up a dialog for the user to confirm access to if needed [1] but otherwise there's nothing else involved with this specific method call.

Other interfaces have a backend "implementation" DBus interface. For example, the org.freedesktop.portal.FileChooser interface has a org.freedesktop.impl.portal.FileChooser (notice the "impl") counterpart. xdg-desktop-portal does not implement those impl.portals. xdg-desktop-portal instead routes the DBus calls to the respective "impl.portal". Your sandboxed application calls OpenFile(), xdg-desktop-portal now calls OpenFile() on org.freedesktop.impl.portal.FileChooser. That interface returns a value, xdg-desktop-portal extracts it and returns it back to the application in respones to the original OpenFile() call.

What provides those impl.portals doesn't matter to xdg-desktop-portal, and this is where things are hot-swappable. GTK and Qt both provide (some of) those impl portals, There are GTK and Qt-specific portals with xdg-desktop-portal-gtk and xdg-desktop-portal-kde but another one is provided by GNOME Shell directly. You can check the files in /usr/share/xdg-desktop-portal/portals/ and see which impl portal is provided on which bus name. The reason those impl.portals exist is so they can be native to the desktop environment - regardless what application you're running and with a generic xdg-desktop-portal, you see the native file chooser dialog for your desktop environment.

So the full call sequence is:

  • At startup, xdg-desktop-portal parses the /usr/libexec/xdg-desktop-portal/*.portal files to know which impl.portal interface is provided on which bus name
  • The application calls OpenFile() on the org.freedesktop.portal.FileChooser interface on the object path /org/freedesktop/portal/desktop. It can do so because the bus name this object sits on is not restricted by the sandbox
  • xdg-desktop-portal receives that call. This is portal with an impl.portal so xdg-desktop-portal calls OpenFile() on the bus name that provides the org.freedesktop.impl.portal.FileChooser interface (as previously established by reading the *.portal files)
  • Assuming xdg-desktop-portal-gtk provides that portal at the moment, that process now pops up a GTK FileChooser dialog that runs outside the sandbox. User selects a file
  • xdg-desktop-portal-gtk sends back the fd for the file to the xdg-desktop-portal, and the impl.portal parts are done
  • xdg-desktop-portal receives that fd and sends it back as reply to the OpenFile() method in the normal portal
  • The application receives the fd and can read the file now
A few details here aren't fully correct, but it's correct enough to understand the sequence - the exact details depend on the method call anyway.

Finally: because of DBus restrictions, the various methods in the portal interfaces don't just reply with values. Instead, the xdg-desktop-portal creates a new org.freedesktop.portal.Request object and returns the object path for that. Once that's done the method is complete from DBus' POV. When the actual return value arrives (e.g. the fd), that value is passed via a signal on that Request object, which is then destroyed. This roundabout way is done for purely technical reasons, regular DBus methods would time out while the user picks a file path.

Anyway. Maybe this helps someone understanding how the portal bits fit together.

[1] it does so using another portal but let's ignore that
[2] not really hot-swappable though. You need to restart xdg-desktop-portal but not your host. So luke-warm-swappable only

Edit Sep 01: clarify that it's not GTK/Qt providing the portals, but xdg-desktop-portal-gtk and -kde

August 31, 2021

Let me talk here about how we implemented the support for performance counters in the Mesa V3D driver, the OpenGL driver used by the Raspberry Pi 4. For reference, the implementation is very similar to the one already available (not done by me, by the way) for the VC4, OpenGL driver for the Raspberry Pi 3 and prior devices, also part of Mesa. If you are already familiar with how this is implemented in VC4, then this will mostly be a refresher.

First of all, what are these performance counters? Most of the processors nowadays contain some hardware facilities to get measurements about what is happening inside the processor. And of course graphics processors aren’t different. In this case, the graphics chips used by Raspberry Pi devices (manufactured by Broadcom) can record a bunch of different graphics-related parameters: how many quads are passing or failing depth/stencil tests, how many clock cycles are spent on doing vertex/fragment shading, hits/misses in the GPU cache, and many others values. In fact, with the V3D driver it is possible to measure around 87 different parameters, and up to 32 of them simultaneously. Quite a few less in VC4, though. But still a lot.

On a hardware level, using these counters is just a matter of writing and reading some GPU registers. First, write the registers to select what we want to measure, then a few more to start to measure, and finally read other registers containing the results. But of course, much like we don’t expect users to write GPU assembly code, we don’t expect users to write registers in the GPU directly. Moreover, even the Mesa drivers such as V3D can’t interact directly with the hardware; rather, this is done through the kernel, the one that can use the hardware directly, through the DRM subsystem in the kernel. For the case of V3D (and same applies to VC4, and in general to any other driver), we have a driver in user-space (whether the OpenGL driver, V3D, or the Vulkan driver, V3DV), and a kernel driver in the kernel-space, unsurprisingly also called V3D. The user-space driver is in charge of translating all the commands and options created with the OpenGL API or other API to batches of commands to be executed by the GPU, which are submitted to the kernel driver as DRM jobs. The kernel does the proper actions to send these to the GPU to execute them, including touching the proper registers. Thus, if we want to implement support for the performance counters, we need to modify the code in two places: the kernel and the (user-space) driver.

Implementation in the kernel

Here we need to think about how to deal with the GPU and the registers to make the performance counters work, as well as the API we provide to user-space to use them. As mentioned before, the approach we are following here is the same as the one used in the VC4 driver: performance counters monitors. That is, the user-space driver creates one or more monitors, specifying for each monitor what counters it is interested in (up to 32 simultaneously, the hardware limit). The kernel returns a unique identifier for each monitor, which can be used later to do the measurement, query the results, and finally destroy it when done.

In this case, there isn’t an explicit start/stop the measurement. Rather, every time the driver wants to measure a job, it includes the identifier of the monitor it wants to use for that job, if any. Before submitting a job to the GPU, the kernel checks if the job has a monitor identifier attached. If so, then it needs to check if the previous job executed by the GPU was also using the same monitor identifier, in which case it doesn’t need to do anything other than send the job to the GPU, as the performance counters required are already enabled. If the monitor is different, then it needs first to read the current counter values (through proper GPU registers), adding them to the current monitor, stop the measurement, configure the counters for the new monitor, start the measurement again, and finally submit the new job to the GPU. In this process, if it turns out there wasn’t a monitor under execution before, then it only needs to execute the last steps.

The reason to do all this is that multiple applications can be executing at the same time, some using (different) performance counters, and most of them probably not using performance counters at all. But the performance counter values of one application shouldn’t affect any other application so we need to make sure we don’t mix up the counters between applications. Keeping the values in their respective monitors helps to accomplish this. There is still a small requirement in the user-space driver to help with accomplishing this, but in general, this is how we avoid the mixing.

If you want to take a look at the full implementation, it is available in a single commit.

Implementation in the driver

Once we have a way to create and manage the monitors, using them in the driver is quite easy: as mentioned before, we only need to create a monitor with the counters we are interested in and attach it to the job to be submitted to the kernel. In order to make things easier, we keep a mirror-like version of the monitor inside the driver.

This approach is adequate when you are developing the driver, and you can add code directly on it to check performance. But what about the final user, who is writing an OpenGL application and wants to check how to improve its performance, or check any bottleneck on it? We want the user to have a way to use OpenGL for this.

Fortunately, there is in fact a way to do this through OpenGL: the GL_AMD_performance_monitor extension. This OpenGL extension provides an API to query what counters the hardware supports, to create monitors, to start and stop them, and to retrieve the values. It looks very similar to what we have described so far, except for an important difference: the user needs to start and stop the monitors explicitly. We will explain later why this is necessary. But the key point here is that when we start a monitor, this means that from that moment on, until stopping it, any job created and submitted to the kernel will have the identifier of that monitor attached. This implies that only one monitor can be enabled in the application at the same time. But this isn’t a problem, as this restriction is part of the extension.

Our driver does not implement this API directly, but through “queries”, which are used then by the Gallium subsystem in Mesa to implement the extension. For reference, the V3D driver (as well as the VC4) is implemented as part of the Gallium subsystem. The Gallium part basically handles all the hardware-independent OpenGL functionality, and just requires the driver hook functions to be implemented by the driver. If the driver implements the proper functions, then Gallium exposes the right extension (in this case, the GL_AMD_performance_monitor extension).

For our case, it requires the driver to implement functions to return which counters are available, to create or destroy a query (in this case, the query is the same as the monitor), start and stop the query, and once it is finished, to get the results back.

At this point, I would like to explain a bit better what it implies to stop the monitor and get the results back. As explained earlier, stopping the monitor or query means that from that moment on, any new job submitted to the kernel (and thus to the GPU) won’t contain a performance monitor identifier attached, and hence won’t be measured. But it is important to know that the driver submits jobs to the kernel to be executed at its own pace, but these aren’t executed immediatly; the GPU needs time to execute the jobs, and so the kernel puts the arriving jobs in a queue, to be submitted to the GPU. This means when the user stops the monitor, there could be still jobs in the queue that haven’t been executed yet and are thus pending to be measured.

And how do we know that the jobs have been executed by the GPU? The hook function to implement getting the query results has a “wait” parameter, which tells if the function needs to wait for all the pending jobs to be measured to be executed or not. If it doesn’t but there are pending jobs, then it just returns telling the caller this fact. This allows to do other work meanwhile and query again later, instead of becoming blocked waiting for all the jobs to be executed. This is implemented through sync objects. Every time a job is sent to the kernel, there’s a sync object that is used to signal when the job has finished executing. This is mainly used to have a way to synchronize the jobs. In our case, when the user finalizes the query we save this fence for the last submitted job, and we use it to know when this last job has been executed.

There are quite a few details I’m not covering here. If you are interested though, you can take a look at the merge request.

Gallium HUD

So far we have seen how the performance counters are implemented, and how to use them. In all the cases it requires writing code to create the monitor/query, start/stop it, and querying back the results, either in the driver itself or in the application through the GL_AMD_performance_monitor extension1.

But what if we want to get some general measurements without adding code to the application or the driver? Fortunately, there is an environmental variable GALLIUM_HUD that, when correctly, will show on top of the application some graphs with the measured counters.

Using it is very easy; set it to help to know how to use it, as well as to get a list of the available counters for the current hardware.

As example:

$ env GALLIUM_HUD=L2T-CLE-reads,TLB-quads-passing-z-and-stencil-test,QPU-total-active-clk-cycles-vertex-coord-shading scorched3d

You will see:

Performance Counters in Scorched 3D

Bear in mind that to be able to use this you will need a kernel that supports performance counters for V3D. At the moment of writing this, no kernel has been released yet with this support. If you don’t want to wait for it, you can download the patch, apply it to your raspberry pi kernel (which has been tested in the 5.12 branch), build and install it.

  1. All this is for the case of using OpenGL; if your application uses Vulkan, there are other similar extensions, which are not yet implemented in our V3DV driver at the moment of writing this post. 

Gut Ding braucht Weile. Almost three years ago, we added high-resolution wheel scrolling to the kernel (v5.0). The desktop stack however was first lagging and eventually left behind (except for an update a year ago or so, see here). However, I'm happy to announce that thanks to José Expósito's efforts, we now pushed it across the line. So - in a socially distanced manner and masked up to your eyebrows - gather round children, for it is storytime.

Historical History

In the beginning, there was the wheel detent. Or rather there were 24 of them, dividing a 360 degree [1] movement of a wheel into a neat set of 15 clicks. libinput exposed those wheel clicks as part of the "pointer axis" namespace and you could get the click count with libinput_event_pointer_get_axis_discrete() (announced here). The degree value is exposed as libinput_event_pointer_get_axis_value(). Other scroll backends (finger-scrolling or button-based scrolling) expose the pixel-precise value via that same function.

In a "recent" Microsoft Windows version (Vista!), MS added the ability for wheels to trigger more than 24 clicks per rotation. The MS Windows API now treats one "traditional" wheel click as a value of 120, anything finer-grained will be a fraction thereof. You may have a mouse that triggers quarter-wheel clicks, each sending a value of 30. This makes for smoother scrolling and is supported(-ish) by a lot of mice introduced in the last 10 years [2]. Obviously, three small scrolls are nicer than one large scroll, so the UX is less bad than before.

Now it's time for libinput to catch up with Windows Vista! For $reasons, the existing pointer axis API could get changed to accommodate for the high-res values, so a new API was added for scroll events. Read on for the details, you will believe what happens next.

Out with the old, in with the new

As of libinput 1.19, libinput has three new events: LIBINPUT_EVENT_POINTER_SCROLL_WHEEL, LIBINPUT_EVENT_POINTER_SCROLL_FINGER, and LIBINPUT_EVENT_POINTER_SCROLL_CONTINUOUS. These events reflect, perhaps unsuprisingly, scroll movements of a wheel, a finger or along a continuous axis (e.g. button scrolling). And they replace the old event LIBINPUT_EVENT_POINTER_AXIS. Those familiar with libinput will notice that the new event names encode the scroll source in the event name now. This makes them slightly more flexible and saves callers an extra call.

In terms of actual API, the new events have two new functions: libinput_event_pointer_get_scroll_value(). For the FINGER and CONTINUOUS events, the value returned is in "pixels" [3]. For the new WHEEL events, the value is in degrees. IOW this is a drop-in replacement for the old libinput_event_pointer_get_axis_value() function. The second call is libinput_event_pointer_get_scroll_value_v120() which, for WHEEL events, returns the 120-based logical units the kernel uses as well. libinput_event_pointer_has_axis() returns true if the given axis has a value, just as before. With those three calls you now get the data for the new events.

Backwards compatibility

To ensure backwards compatibility, libinput generates both old and new events so the rule for callers is: if you want to support the new events, just ignore the old ones completely. libinput also guarantees new events even on pre-5.0 kernels. This makes the old and new code easy to ifdef out, and once you get past the immediate event handling the code paths are virtually identical.

When, oh when?

These changes have been merged into the libinput main branch and will be part of libinput 1.19. Which is due to be released over the next month or so, so feel free to work backwards from that for your favourite distribution.

Having said that, libinput is merely the lowest block in the Jenga tower that is the desktop stack. José linked to the various MRs in the upstream libinput MR, so if you're on your seat's edge waiting for e.g. GTK to get this, well, there's an MR for that.

[1] That's degrees of an angle, not Fahrenheit
[2] As usual, on a significant number of those you'll need to know whatever proprietary protocol the vendor deemed to be important IP. Older MS mice stand out here because they use straight HID.
[3] libinput doesn't really have a concept of pixels, but it has a normalized pixel that movements are defined as. Most callers take that as real pixels except for the high-resolution displays where it's appropriately scaled.

Zink Is Over

A while ago I blogged about finishing up ES 3.2. Then I didn’t mention it again because…well, I suppose I’ve only blogged four times since then, but I’m going to pretend this was part of my master plan to make everyone forget so I could build hype again.

The hype is here: Zink can now* run ES 3.2 apps.

  • “now” is a variable unit of time subject to CI not trying to drown itself at the nearest pub, literally melt itself to slag, or hurl itself off a cliff the instant daniels takes his eyes off it

What Does This Mean For The Future?

I know I’ve said this a few times previously, and we all had a good laugh, but this time I mean it.

Zink is done.

The final boss has been beaten, there’s no more versions to support, no extensions left on my todo list, definitely no bugs remaining, and performance can’t possibly improve further.

If you think you’ve found a zink bug, report it to whoever wrote the test or app you’re running, because the only thing I plan on doing for the rest of 2021 is playing Cyberpunk 2077 on Lavapipe.

Right after it finishes loading.

August 30, 2021

The Struggle Continues

Everyone’s seen the Phoronix benchmark numbers by now, and though there’s a lot of confusion over how to calculate the percentage increase between “game does did not run a year ago” and “game runs”, it seems like a couple people out there at Big Triangle are starting to take us seriously.

With that said, even my parents are asking me what the deal is with this one result in particular:

ohno.png

Performance isn’t supposed to go down. Everyone knows this. The version numbers go up and so does the performance as long as it’s not Javascript-based.

Enraged, I sprinted to my computer and searched for tesseract game, which gave me the entirely wrong result, but I eventually did manage to find the right one. I fired up zink-wip, certain that this would end up being some bug I’d already fixed.

Unfortunately, this was not the case.

tesseract-bad.png

I vowed not to sleep, rebase, leave my office, or even run another application until this was resolved, so you can imagine how pleased I am to be writing this post after spending way too much time getting to the bottom of everything.

Speculation Interlude

Full disclosure: I didn’t actually check why performance went down. I’m pretty sure it’s just the result of having improved buffer mapping to be better in most cases, which ended up hurting this case.

But Why

…is the performance so bad?

A quick profiling revealed that this was down to a Gallium component called vbuf, used for translating vertex buffers and attributes from the ones specified by the application to ones that drivers can actually support. The component itself is fine, the problem was that, ideally, it’s not something you ever want to be hitting when you want performance.

Consider the usual sequence of drawing a frame:

  • generate and upload vertex data
  • bind some descriptors
  • maybe throw in a query or two if you need some spice
  • draw
  • repeat until frame is done

This is all great and normal, but what would happen—just hypothetically of course—if instead it looked like this:

  • generate and upload vertex data
  • stall and read vertex data
  • rewrite vertex data in another format and reupload
  • bind some descriptors
  • maybe throw in a query or two if you need some spice
  • draw
  • repeat until frame is done

Suddenly the driver is now stalling multiple times per frame on top of doing lots of CPU work!

Incidentally, this is (almost certainly) why performance appeared to have regressed: the vertex buffer is now device-local and can’t be mapped directly, so it has to be copied to a new buffer before it can be read, which is even slower.

Just AMD Problems

DISCLAIMER: We’re going deep into meme territory now, so let’s all dial down the seriousness about a thousand notches before posting about how much I hate AMD or whatever.

vertexattribmeme.png

Unlike cool hardware, AMD opts to not support features which might be less performant. I assume this is in the hopes that developers will Make The Right Choice and not use those features, but obviously developers are gonna develop, and so it is that Tesseract-The-Game-But-Not-The-One-On-Steam uses 3-component vertex attributes that aren’t supported by AMD hardware, necessitating the use of vbuf to translate them to 4-component attributes that can be safely used.

Decomposition

The vertex buffer format at work here was R8G8B8_SNORM, which is a perfectly cromulent format as long as you hate yourself. A shader would read this as a vec4, which, by the power of buffer robustness, gets translated to vec4(x, y, z, 1.0) because the w component is missing.

The approach I took to solving this was to decompose the vertex attribute into three separate R8_SNORM attributes, as this single-component format is wimpy enough for AMD to handle. Thus, a vertex input state containing three separate attributes including this one would now contain five, as the original R8G8B8_SNORM one is split into three, each reading a single component at an offset to simulate the original attribute.

The tricky part to this is that it requires a vertex shader prolog and variant in order to successfully split the shader’s input in such a way that the read value is the same. It also requires a NIR pass. Let’s check out the NIR pass since this blog has gone for way too long without seeing any real work:

struct decompose_state {
  nir_variable **split;
  bool needs_w;
};

static bool
decompose_attribs(nir_shader *nir, uint32_t decomposed_attrs, uint32_t decomposed_attrs_without_w)
{
   uint32_t bits = 0;
   nir_foreach_variable_with_modes(var, nir, nir_var_shader_in)
      bits |= BITFIELD_BIT(var->data.driver_location);
   bits = ~bits;
   u_foreach_bit(location, decomposed_attrs | decomposed_attrs_without_w) {
      nir_variable *split[5];
      struct decompose_state state;
      state.split = split;
      nir_variable *var = nir_find_variable_with_driver_location(nir, nir_var_shader_in, location);
      assert(var);
      split[0] = var;
      bits |= BITFIELD_BIT(var->data.driver_location);
      const struct glsl_type *new_type = glsl_type_is_scalar(var->type) ? var->type : glsl_get_array_element(var->type);
      unsigned num_components = glsl_get_vector_elements(var->type);
      state.needs_w = (decomposed_attrs_without_w & BITFIELD_BIT(location)) != 0 && num_components == 4;
      for (unsigned i = 0; i < (state.needs_w ? num_components - 1 : num_components); i++) {
         split[i+1] = nir_variable_clone(var, nir);
         split[i+1]->name = ralloc_asprintf(nir, "%s_split%u", var->name, i);
         if (decomposed_attrs_without_w & BITFIELD_BIT(location))
            split[i+1]->type = !i && num_components == 4 ? var->type : new_type;
         else
            split[i+1]->type = new_type;
         split[i+1]->data.driver_location = ffs(bits) - 1;
         bits &= ~BITFIELD_BIT(split[i+1]->data.driver_location);
         nir_shader_add_variable(nir, split[i+1]);
      }
      var->data.mode = nir_var_shader_temp;
      nir_shader_instructions_pass(nir, lower_attrib, nir_metadata_dominance, &state);
   }
   nir_fixup_deref_modes(nir);
   NIR_PASS_V(nir, nir_remove_dead_variables, nir_var_shader_temp, NULL);
   optimize_nir(nir);
   return true;
}

First, the base of the pass; two masks are provided, one for attributes that are being fully split (i.e., four components) and one for attributes that have fewer than four components and thus need to have a w component added, as in the Tesseract case. Each variable in the mask is split into four, with slightly different behavior for the ones needing a w and the ones that don’t.

The new variables are all given new driver locations matching the ones given to the split attributes for the vertex input pipeline state, and the decompose_state is passed along to the per-instruction part of the pass:

static bool
lower_attrib(nir_builder *b, nir_instr *instr, void *data)
{
   struct decompose_state *state = data;
   nir_variable **split = state->split;
   if (instr->type != nir_instr_type_intrinsic)
      return false;
   nir_intrinsic_instr *intr = nir_instr_as_intrinsic(instr);
   if (intr->intrinsic != nir_intrinsic_load_deref)
      return false;
   nir_deref_instr *deref = nir_src_as_deref(intr->src[0]);
   nir_variable *var = nir_deref_instr_get_variable(deref);
   if (var != split[0])
      return false;
   unsigned num_components = glsl_get_vector_elements(split[0]->type);
   b->cursor = nir_after_instr(instr);
   nir_ssa_def *loads[4];
   for (unsigned i = 0; i < (state->needs_w ? num_components - 1 : num_components); i++)
      loads[i] = nir_load_deref(b, nir_build_deref_var(b, split[i+1]));
   if (state->needs_w) {
      loads[3] = nir_channel(b, loads[0], 3);
      loads[0] = nir_channel(b, loads[0], 0);
   }
   nir_ssa_def *new_load = nir_vec(b, loads, num_components);
   nir_ssa_def_rewrite_uses(&intr->dest.ssa, new_load);
   nir_instr_remove_v(instr);
   return true;
}

The existing variable is passed along with the new variable array. Where the original is loaded, instead the new variables are all loaded in sequence and assembled into a vec matching the length of the original one. For attributes needing a w component, the first new variable is loaded as a vec4 so that the w component can be reused naturally. Then the original load instruction is removed, and with it, the original variable and its brokenness.

Immediate Results

Sort of.

tesseract-semifixed.png

The frames were definitely there, but the graphics…

Occlusion Queries

It turns out there’s almost zero coverage for occlusion queries in Vulkan’s CTS. There’s surprisingly little coverage for most query-related things, in fact, which means it wasn’t too surprising when it turned out that there were RADV query bugs at play. What was surprising was how they manifested, but that was about par for anything that reads garbage memory.

A simple one-liner later (just kidding, this fucken thing took like 4 days to find) and, magically, things were happening:

tesseract-fixed.png

We Did It.

A big thanks to Bas Nieuwenhuizen for consulting along the way even despite being so busy getting a RADV raytracing MR up and, as always, preparing his next blog post.

August 25, 2021

A year ago, I first announced libei - a library to support emulated input. After an initial spurt of development, it was left mostly untouched until a few weeks ago. Since then, another flurry of changes have been added, including some initial integration into GNOME's mutter. So, let's see what has changed.

A Recap

First, a short recap of what libei is: it's a transport layer for emulated input events to allow for any application to control the pointer, type, etc. But, unlike the XTEST extension in X, libei allows the compositor to be in control over clients, the devices they can emulate and the input events as well. So it's safer than XTEST but also a lot more flexible. libei already supports touch and smooth scrolling events, something XTest doesn't have or is struggling with.

Terminology refresher: libei is the client library (used by an application wanting to emulate input), EIS is the Emulated Input Server, i.e. the part that typically runs in the compositor.

Server-side Devices

So what has changed recently: first, the whole approach has flipped on its head - now a libei client connects to the EIS implementation and "binds" to the seats the EIS implementation provides. The EIS implementation then provides input devices to the client. In the simplest case, that's just a relative pointer but we have capabilities for absolute pointers, keyboards and touch as well. Plans for the future is to add gestures and tablet support too. Possibly joysticks, but I haven't really thought about that in detail yet.

So basically, the initial conversation with an EIS implementation goes like this:

  • Client: Hello, I am $NAME
  • Server: Hello, I have "seat0" and "seat1"
  • Client: Bind to "seat0" for pointer, keyboard and touch
  • Server: Here is a pointer device
  • Server: Here is a keyboard device
  • Client: Send relative motion event 10/2 through the pointer device
Notice how the touch device is missing? The capabilities the client binds to are just what the client wants, the server doesn't need to actually give the client a device for that capability.

One of the design choices for libei is that devices are effectively static. If something changes on the EIS side, the device is removed and a new device is created with the new data. This applies for example to regions and keymaps (see below), so libei clients need to be able to re-create their internal states whenever the screen or the keymap changes.

Device Regions

Devices can now have regions attached to them, also provided by the EIS implementation. These regions define areas reachable by the device and are required for clients such as Barrier. On a dual-monitor setup you may have one device with two regions or two devices with one region (representing one monitor), it depends on the EIS implementation. But either way, as libei client you will know that there is an area and you will know how to reach any given pixel on that area. Since the EIS implementation decides the regions, it's possible to have areas that are unreachable by emulated input (though I'm struggling a bit for a real-world use-case).

So basically, the conversation with an EIS implementation goes like this:

  • Client: Hello, I am $NAME
  • Server: Hello, I have "seat0" and "seat1"
  • Client: Bind to "seat0" for absolute pointer
  • Server: Here is an abs pointer device with regions 1920x1080@0,0, 1080x1920@1920,0
  • Server: Here is an abs pointer device with regions 1920x1080@0,0
  • Server: Here is an abs pointer device with regions 1080x1920@1920,0
  • Client: Send abs position 100/100 through the second device
Notice how we have three absolute devices? A client emulating a tablet that is mapped to a screen could just use the third device. As with everything, the server decides what devices are created and the clients have to figure out what they want to do and how to do it.

Perhaps unsurprisingly, the use of regions make libei clients windowing-system independent. The Barrier EI support WIP no longer has any Wayland-specific code in it. In theory, we could implement EIS in the X server and libei clients would work against that unmodified.

Keymap handling

The keymap handling has been changed so the keymap too is provided by the EIS implementation now, effectively in the same way as the Wayland compositor provides the keymap to Wayland clients. This means a client knows what keycodes to send, it can handle the state to keep track of things, etc. Using Barrier as an example again - if you want to generate an "a", you need to look up the keymap to figure out which keycode generates an A, then you can send that through libei to actually press the key.

Admittedly, this is quite messy. XKB (and specifically libxkbcommon) does not make it easy to go from a keysym to a key code. The existing Barrier X code is full of corner-cases with XKB already, I espect those to be necessary for the EI support as well.

Scrolling

Scroll events have four types: pixel-based scrolling, discrete scrolling, and scroll stop/cancel events. The first should be obvious, discrete scrolling is for mouse wheels. It uses the same 120-based API that Windows (and the kernel) use, so it's compatible with high-resolution wheel mice. The scroll stop event notifies an EIS implementation that the scroll interaction has stopped (e.g. lifting fingers off) which in turn may start kinetic scrolling - just like the libinput/Wayland scroll stop events. The scroll cancel event notifies the EIS implementation that scrolling really has stopped and no kinetic scrolling should be triggered. There's no equivalent in libinput/Wayland for this yet but it helps to get the hook in place.

Emulation "Transactions"

This has fairly little functional effect, but interactions with an EIS server are now sandwiched in a start/stop emulating pair. While this doesn't matter for one-shot tools like xdotool, it does matter for things like Barrier which can send the start emulating event when the pointer enters the local window. This again allows the EIS implementation to provide some visual feedback to the user. To correct the example from above, the sequence is actually:

  • ...
  • Server: Here is a pointer device
  • Client: Start emulating
  • Client: Send relative motion event 10/2 through the pointer device
  • Client: Send relative motion event 1/4 through the pointer device
  • Client: Stop emulating

Properties

Finally, there is now a generic property API, something copied from PipeWire. Properties are simple key/value string pairs and cover those things that aren't in the immediate API. One example here: the portal can set things like "ei.application.appid" to the Flatpak's appid. Properties can be locked down and only libei itself can set properties before the initial connection. This makes them reliable enough for the EIS implementation to make decisions based on their values. Just like with PipeWire, the list of useful properties will grow over time. it's too early to tell what is really needed.

Repositories

Now, for the actual demo bits: I've added enough support to Barrier, XWayland, Mutter and GNOME Shell that I can control a GNOME on Wayland session through Barrier (note: the controlling host still needs to run X since we don't have the ability to capture input events under Wayland yet). The keymap handling in Barrier is nasty but it's enough to show that it can work.

GNOME Shell has a rudimentary UI, again just to show what works:

The status icon shows ... if libei clients are connected, it changes to !!! while the clients are emulating events. Clients are listed by name and can be disconnected at will. I am not a designer, this is just a PoC to test the hooks.

Note how xdotool is listed in this screenshot: that tool is unmodified, it's the XWayland libei implementation that allows it to work and show up correctly

The various repositories are in the "wip/ei" branch of:

And of course libei itself.

Where to go from here? The last weeks were driven by rapid development, so there's plenty of test cases to be written to make sure the new code actually works as intended. That's easy enough. Looking at the Flatpak integration is another big ticket item, once the portal details are sorted all the pieces are (at least theoretically) in place. That aside, improving the integrations into the various systems above is obviously what's needed to get this working OOTB on the various distributions. Right now it's all very much in alpha stage and I could use help with all of those (unless you're happy to wait another year or so...). Do ping me if you're interested to work on any of this.

August 23, 2021

Hi all, hope you all are doing fine!

Finally today the part 5 of my Outreachy Saga came out, week 9 was on 7/19/21, and as you can see I'm a little late too( ;P )... This week had the theme: “Career opportunities/ Career Goals”

When I read the Outreachy Organizers email, I had an anxiety crisis, starting to think about what I want to do after the internship, what my career goals are, I panicked... The Imposter Syndrome hit hard, and it still haunts my thoughts and it has been very challenging (as my therapist says) to work with this feeling of not being good enough to apply for a job opening or thinking that my resume is worthless...

But week 11 arrived #SPOILERALERT with the theme: "Making connections" and talking to some people I could get to know their experiences in companies that work with free software and their contributions to it, I could feel that I am on the right path, that this is the area I want to work on. So let's get back to the topic of today's post!!

What am I looking for?

I'm looking for a job, preferably remote, which can be full or part-time!

But also some other opportunity where I can improve my CV and help me to continue working with the Linux Kernel, preferably. The end of my Outreachy internship is fast approaching (or rather, it's the day after tomorrow (o_o) ), so after August 24th, I will be available to work full or part-time. I currently live in Fundão, Portugal, and I am open to remote positions based anywhere in the world, along with the possibility of international relocation (my dream is to live in Canada!!! 😜).

What types of work would you like to contribute to?

I would like to continue working with the Linux Kernel, and I also really like Embedded Systems (I've played a little with Raspberry, Beagle Bone, and Arduino)

What tools or skills do you have that would help you with that work?

You know... I have a lot of difficulty answering this question because I always think I don't have many skills, but I'll do my best!!!

During my Outreachy internship, I Created vkms_config_show(), the function which aims to print the data in drm_debugfs_create_files(), I also started to learn how to debug the code. I already worked a little with the Coccinelle tool. And as I mentioned earlier, I've already used Raspberry, Beagle Bone, and Arduino.

What languages do you speak, and at what school grade level?

Portuguese (Native), English (intermediate)

And reviewing now my CV and Linkedin, I could see that I've already done a lot!

I have experience in remote work, with people from different parts of the world, I was the DAECOMP (Academic Directory of Computer engineering - that is how we call our students association) President, where I needed to interact with my classmates to find out which improvements they thought the course needed to improve, I also needed to interact with the campus management and our professors to get things to improve the course and the campus, I organized Free Software dissemination events. I was also a flute tutor (yes I study music!!!) and a robotics tutor, working with young people and children.

Well, I'll stop here... To see more about my experience, feel free to visit my Linkedin

Once again thank you for following me so far, my adventure with Outreachy is almost over, every day has been a lot of learning! Please feel free to comment! And stay tuned to the next chapters of this Saga!!!

Take care and have a great day!

August 18, 2021

We Back

Just a quick update today while I dip my toes back into the blogosphere to remind myself that it’s not so scary.

Remember when I blogged about how nice it would be to have a suballocator all those months ago?

Now it’s landed, and it’s nice indeed to have a suballocator.

Remember when everyone wanted GL 4.6 compatibility contexts so they could play Feral ports of their favorite games? zink-wip did that 6 months ago.

What this all means is that it’s more or less open testing season on zink. I’ve already got a sizable number of tickets open for various Steam games based on zink-wip testing, but this is hardly conclusive.

What games work for you?

What games don’t work?

I’m not gonna test them all myself, so get out there and start crashing.

August 11, 2021

Here I’m playing “Spelunky 2” on my laptop and simultaneously replaying the same Vulkan calls on an ARM board with Adreno GPU running the open source Turnip Vulkan driver. Hint: it’s an x64 Windows game that doesn’t run on ARM.

The bottom right is the game I’m playing on my laptop, the top left is GFXReconstruct immediately replaying Vulkan calls from the game on ARM board.

How is it done? And why would it be useful for debugging? Read below!


Debugging issues a driver faces with real-world applications requires the ability to capture and replay graphics API calls. However, for mobile GPUs it becomes even more challenging since for Vulkan driver the main “source” of real-world workload are x86-64 apps that run via Wine + DXVK, mainly games which were made for desktop x86-64 Windows and do not run on ARM. Efforts are being made to run these apps on ARM but it is still work-in-progress. And we want to test the drivers NOW.

The obvious solution would be to run those applications on an x86-64 machine capturing all Vulkan calls. Then replaying those calls on a second machine where we cannot run the app. This way it would be possible to test the driver even without running the application directly on it.

The main trouble is that Vulkan calls made on one GPU + Driver combo are not generally compatible with other GPU + Driver combo, sometimes even for one GPU vendor. There are different memory capabilities (VkPhysicalDeviceMemoryProperties), different memory requirements for buffer and images, different extensions available, and different optional features supported. It is easier with OpenGL but there are also some incompatibilities there.

There are two open-source vendor-agnostic tools for capturing Vulkan calls: RenderDoc (captures single frame) and GFXReconstruct (captures multiple frames). RenderDoc at the moment isn’t suitable for the task of capturing applications on desktop GPUs and replaying on mobile because it doesn’t translate memory type and requirements (see issue #814). GFXReconstruct on the other hand has the necessary features for this.

I’ll show a couple of tricks with GFXReconstruct I’m using to test things on Turnip.


Capturing with GFXReconstruct

At this point you either have the application itself or, if it doesn’t use Vulkan, a trace of its calls that could be translated to Vulkan. There is a detailed instruction on how to use GFXReconstruct to capture a trace on desktop OS. However there is no clear instruction of how to do this on Android (see issue #534), fortunately there is one in Android’s documentation:

Android how-to (click me)
For Android 9 you should copy layers to the application which will be traced
For Android 10+ it's easier to copy them to com.lunarg.gfxreconstruct.replay
You should have userdebug build of Android or probably rooted Android

# Push GFXReconstruct layer to the device
adb push libVkLayer_gfxreconstruct.so /sdcard/

# Since there is to APK for capture layer,
# copy the layer to e.g. folder of com.lunarg.gfxreconstruct.replay
adb shell run-as com.lunarg.gfxreconstruct.replay cp /sdcard/libVkLayer_gfxreconstruct.so .

# Enable layers
adb shell settings put global enable_gpu_debug_layers 1

# Specify target application
adb shell settings put global gpu_debug_app <package_name>

# Specify layer list (from top to bottom)
adb shell settings put global gpu_debug_layers VK_LAYER_LUNARG_gfxreconstruct

# Specify packages to search for layers
adb shell settings put global gpu_debug_layer_app com.lunarg.gfxreconstruct.replay

If the target application doesn’t have rights to write into external storage - you should change where the capture file is created:

adb shell "setprop debug.gfxrecon.capture_file '/data/data/<target_app_folder>/files/'"


However, when trying to replay the trace you captured on another GPU - most likely it will result in an error:

[gfxrecon] FATAL - API call vkCreateDevice returned error value VK_ERROR_EXTENSION_NOT_PRESENT that does not match the result from the capture file: VK_SUCCESS.  Replay cannot continue.
Replay has encountered a fatal error and cannot continue: the specified extension does not exist

Or other errors/crashes. Fortunately we could limit the capabilities of desktop GPU with VK_LAYER_LUNARG_device_simulation

VK_LAYER_LUNARG_device_simulation when simulating another GPU should be told to intersect the capabilities of both GPUs, making the capture compatible with both of them. This could be achieved by recently added environment variables:

VK_DEVSIM_MODIFY_EXTENSION_LIST=whitelist
VK_DEVSIM_MODIFY_FORMAT_LIST=whitelist
VK_DEVSIM_MODIFY_FORMAT_PROPERTIES=whitelist

whitelist name is rather confusing because it’s essentially means “intersection”.

One would also need to get a json file which describes target GPU capabilities, this should be done by running:

vulkaninfo -j &> <device_name>.json

The final command to capture a trace would be:

VK_LAYER_PATH=<path/to/device-simulation-layer>:<path/to/gfxreconstruct-layer> \
VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_gfxreconstruct:VK_LAYER_LUNARG_device_simulation \
VK_DEVSIM_FILENAME=<device_name>.json \
VK_DEVSIM_MODIFY_EXTENSION_LIST=whitelist \
VK_DEVSIM_MODIFY_FORMAT_LIST=whitelist \
VK_DEVSIM_MODIFY_FORMAT_PROPERTIES=whitelist \
<the_app>

Replaying with GFXReconstruct

gfxrecon-replay -m rebind --skip-failed-allocations <trace_name>.gfxr
  • -m Enable memory translation for replay on GPUs with memory types that are not compatible with the capture GPU’s
    • rebind Change memory allocation behavior based on resource usage and replay memory properties. Resources may be bound to different allocations with different offsets.
  • --skip-failed-allocations skip vkAllocateMemory, vkAllocateCommandBuffers, and vkAllocateDescriptorSets calls that failed during capture

Without these options replay would fail.

Now you could easily test any app/game on your ARM board, if you have enough RAM =) I even successfully ran a capture of “Metro Exodus” on Turnip.

But what if I want to test something that requires interactivity?

Or you don’t want to save a huge trace on disk, which could grow tens of gigabytes if application is running for considerable amount of time.

During the recording GFXReconstruct just appends calls to a file, there are no additional post-processing steps. Given that the next logical step is to just skip writing to a disk and send Vulkan calls over the network!

This would allow us to interact with the application and immediately see the results on another device with different GPU. And so I hacked together a crude support of over-the-network replay.

The only difference with ordinary tracing is that now instead of file we have to specify a network address of the target device:

VK_LAYER_PATH=<path/to/device-simulation-layer>:<path/to/gfxreconstruct-layer> \
    ...
GFXRECON_CAPTURE_FILE="<ip>:<port>" \
<the_app>

And on the target device:

while true; do gfxrecon-replay -m rebind --sfa ":<port>"; done

Why while true? It is common for DXVK to call vkCreateInstance several times leading to the creation of several traces. When replaying over the network we therefor want gfxrecon-replay to immediately restart when one trace ends to be ready for another.

You may want to bring the FPS down to match the capabilities of lower power GPU in order to prevent constant hiccups. It could be done either with libstrangle or with mangohud:

  • stranglevk -f 10
  • MANGOHUD_CONFIG=fps_limit=10 mangohud

You have seen the result at the start of the post.

August 10, 2021

I’ve been silent here for quite some time, so here is a quick summary of some of the new functionality we have been exposing in V3DV, the Vulkan driver for Raspberry PI 4, over the last few months:

  • VK_KHR_bind_memory2
  • VK_KHR_copy_commands2
  • VK_KHR_dedicated_allocation
  • VK_KHR_descriptor_update_template
  • VK_KHR_device_group
  • VK_KHR_device_group_creation
  • VK_KHR_external_fence
  • VK_KHR_external_fence_capabilities
  • VK_KHR_external_fence_fd
  • VK_KHR_external_semaphore
  • VK_KHR_external_semaphore_capabilities
  • VK_KHR_external_semaphore_fd
  • VK_KHR_get_display_properties2
  • VK_KHR_get_memory_requirements2
  • VK_KHR_get_surface_capabilities2
  • VK_KHR_image_format_list
  • VK_KHR_incremental_present
  • VK_KHR_maintenance2
  • VK_KHR_maintenance3
  • VK_KHR_multiview
  • VK_KHR_relaxed_block_layout
  • VK_KHR_sampler_mirror_clamp_to_edge
  • VK_KHR_storage_buffer_storage_class
  • VK_KHR_uniform_buffer_standard_layout
  • VK_KHR_variable_pointers
  • VK_EXT_custom_border_color
  • VK_EXT_external_memory_dma_buf
  • VK_EXT_index_type_uint8
  • VK_EXT_physical_device_drm

Besides that list of extensions, we have also added basic support for Vulkan subgroups (this is a Vulkan 1.1 feature) and Geometry Shaders (we use this to implement multiview).

I think we now meet most (if not all) of the Vulkan 1.1 mandatory feature requirements, but we still need to check this properly and we also need to start doing Vulkan 1.1 CTS runs and fix test failures. In any case, the bottom line is that Vulkan 1.1 should be fairly close now.

August 05, 2021

Just about a year after the original announcement, I think it's time to see the progress on power-profiles-daemon.

Note that I would still recommend you read the up-to-date project README if you have questions about why this project was necessary, and why a new project was started rather than building on an existing one.

 The project was born out of the need to make a firmware feature available to end-users for a number of lines of Lenovo laptops for them to be fully usable on Fedora. For that, I worked with Mark Pearson from Lenovo, who wrote the initial kernel support for the feature and served as our link to the Lenovo firmware team, and Hans de Goede, who worked on making the kernel interfaces more generic.

More generic, but in a good way

 With the initial kernel support written for (select) Lenovo laptops, Hans implemented a more generic interface called platform_profile. This interface is now the one that power-profiles-daemon will integrate with, and means that it also supports a number of Microsoft Surface, HP, Lenovo's own Ideapad laptops, and maybe Razer laptops soon.

 The next item to make more generic is Lenovo's "lap detection" which still relies on a custom driver interface. This should be soon transformed into a generic proximity sensor, which will mean I get to work some more on iio-sensor-proxy.

Working those interactions

 power-profiles-dameon landed in a number of distributions, sometimes enabled by default, sometimes not enabled by default (sigh, the less said about that the better), which fortunately meant that we had some early feedback available.

 The goal was always to have the user in control, but we still needed to think carefully about how the UI would look and how users would interact with it when a profile was temporarily unavailable, or the system started a "power saver" mode because battery was running out.

 The latter is something that David Redondo's work on the "HoldProfile" API made possible. Software can programmatically switch to the power-saver or performance profile for the duration of a command. This is useful to switch to the Performance profile when running a compilation (eg. powerprofilesctl jhbuild --no-interact build gnome-shell), or for gnome-settings-daemon to set the power-saver profile when low on battery.

 The aforementioned David Redondo and Kai Uwe Broulik also worked on the KDE interface to power-profiles-daemon, as Florian Müllner implemented the gnome-shell equivalent.

Promised by me, delivered by somebody else :)

 I took this opportunity to update the Power panel in Settings, which shows off the temporary switch to the performance mode, and the setting to automatically switch to power-saver when low on battery.

Low-Power, everywhere

 Talking of which, while it's important for the system to know that they're targetting a power saving behaviour, it's also pretty useful for applications to try and behave better.
 
 Maybe you've already integrated with "low memory" events using GLib, but thanks to Patrick Griffis you can be an event better ecosystem citizen and monitor whether the system is in "Power Saver" mode and adjust your application's behaviour.
 
 This feature will be available in GLib 2.70 along with documentation of useful steps to take. GNOME Software will already be using this functionality to avoid large automated downloads when energy saving is needed.

Availability

 The majority of the above features are available in the GNOME 41 development branches and should get to your favourite GNOME-friendly distribution for their next release, such as Fedora 35.
August 04, 2021

 I've been chasing a crocus misrendering bug show in a qt trace.


The bottom image is crocus vs 965 on top. This only happened on Gen4->5, so Ironlake and GM45 were my test machines. I burned a lot of time trying to work this out. I trimmed the traces down, dumped a stupendous amount of batchbuffers, turned off UBO push constants, dump all the index and vertex buffers, tried some RGBx changes, but nothing was rushing to hit me, except that the vertex shaders produced were different.

However they were different for many reasons, due to the optimization pipelines the mesa state tracker runs vs the 965 driver. Inputs and UBO loads were in different places so there was a lot of noise in the shaders.

I ported the trace to a piglit GL application so I could easier hack on the shaders and GL, with that I trimmed it down even further (even if I did burn some time on a misplace */+ typo).

Using the ported app, I removed all uniform buffer loads and then split the vertex shader in half (it was quite large, but had two chunks). I finally then could spot the difference in the NIR shaders.

What stood out was the 965 shader had an if which the crocus shader has converted to a bcsel. This is part of peephole optimization and the mesa/st calls it, and sure enough removing that call fixed the rendering, but why? it is a valid optimization.

In a parallel thread on another part of the planet, Ian Romanick filed a MR to mesa https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12191 fixing a bug in the gen4/5 fs backend with conditional selects. This was something he noticed while debugging elsewhere. However his fix was for the fragment shader backend, and my bug was in the vec4 vertex shader backend. I tracked down where the same changes were needed in the vec4 backend and tested a fix on top of his branch, and the misrendering disappeared.

It's a strange coincidence we both started hitting the same bug in different backends in the same week via different tests, but he's definitely saved me a lot of pain in working this out! Hopefully we can combine them and get it merged this week.

Also thanks to Angelo on the initial MR for testing crocus with some real workloads.

August 03, 2021

Hi all, hope you all are doing fine!

Finally today the part 4 of my Outreachy Saga came out, the mid-point was on 5/7/21 and as you can see I'm really late, this week had the theme: “Modifying Expectations”.

But why did it take me so long to post? First, I had to internalize the topic a lot, because in my head I thought that when I reached this point, I would have achieved all the goals I had proposed at the beginning of the internship, but when the mid-point arrived, it seemed to me that I didn't have done anything and that my internship was going to end, as I didn't fulfill expectations.

The project aimed at 2 tasks

  • Clean up the debugfs support
  • Remove custom dumbmapoffset implementations

During the development of the first task, we found that it could not be carried out as intended, so it needed to be restructured and resulted in:

  • Create vkmsconfigshow( )

    • function Which aims to print the data in drmdebugfscreate_files()
    • It has already been reviewed and approved to be part of the drm-misc tree

During the development of this function, I came across an improvement in the code:

  • Replace macro in vkms_release()

    • It has already been reviewed and approved to be part of the drm-misc tree

As part of this week's assignments, I needed to talk to my advisors about the internship progress and schedule review. During our conversation, I could see/understand that I managed to achieve one of the goals, as presented above (I thought I hadn't achieved anything!!), and we also realized that I was not going to be able to do the second task, as it was in another context and could take a long time to understand how to solve it.

Thus, for the second half of the internship, it was decided to convert vkmsconfigdebugfs into the structure proposed by Wambui Karuga, and here I'm working on it.

During this time I'm learning that the Linux Kernel development is not linear, that several things can happen (setup problem again, break the kernel, not knowing what to do...), so I realized that one of the goals of Outreachy is learning to contribute and work with the project I've chosen.

So I started to take advantage of my journey to learn how to contribute as much I can in the Linux Kernel and as I identified a lot with the development, maybe I can find a job to keep working/contributing to the Linux Kernel development.

Thank you for following me so far, please feel free to comment! And stay tuned to the next chapters of this Saga!!!

Take care and have a great day!

July 30, 2021

Deeper Into Software

I don’t feel like blogging about zink today, so here’s more about everyone’s favorite software implementation of Vulkan.

The existing LLVMpipe architecture works like this from a top-down view:

  • mesa / st - this is the GL/Gallium state tracker
  • llvmpipe - this is the Gallium driver
  • gallivm - this is the LLVM program compiler
  • llvm - this is where the fragment shader runs

In short, everything is for the purpose of compiling LLVM programs which will draw/compute the desired result.

Lavapipe makes a slight change:

  • lavapipe - this is the Vulkan state tracker
  • llvmpipe - this is the Gallium driver
  • gallivm - this is the LLVM program compiler
  • llvm - this is where the fragment shader runs

It’s that simple.

Thus, any time a new feature is added to Lavapipe, what’s actually being done is plumbing that Vulkan feature through some number of layers to change how LLVM is executed. Some features, like samplerAnisotropy, require significant work at the gallivm layer just to toggle a boolean flag at the lavapipe level.

Other changes, like KHR_timeline_semaphores are entirely contained in Lavapipe.

What Are Timeline Semaphores?

Vulkan has a number of mechanisms for synchronization including fences, events, and binary semaphores, all of which serve a specific purpose. For more conrete on all of them, please read the blog of an actual expert.

The best and most awesome (don’t @ me, it’s not debatable) of these synchronization methods, however is the timeline semaphore.

A timeline semaphore is an object that can be used to signal and wait on specific integer-assigned points in command execution, also known as timelines. Each queue submission can be accompanied by an array of timeline semaphores to wait on and an array to signal; command buffers in a given submission will wait before executing, then signal after they’re done. This enables parallel code design where one thread can assemble command buffers and submit them, and the GPU can be made to pause at certain points for buffers/images referenced to become populated by another thread before continuing with execution.

Typically, semaphores are managed through signals which pass through the kernel and hardware, meaning that “waiting” on a timeline is really just waiting on an ioctl (DRM_IOCTL_SYNCOBJ_TIMELINE_WAIT) to signal that the specified timeline id has occurred, which requires no additional host-side synchronization. Things get a bit trickier in software, however, as the kernel is not involved, so everything must be managed in the driver.

Lavapipe And Timelines

This was a todo item sitting on the list for a while because it was tricky to handle. The most visible problems here were:

  • connecting timeline identifiers with queue submissions; timelines only need to be monotonic, not sequential, meaning that using something like a sliding array wouldn’t be very efficient
  • the actual synchronization when threads are involved

After some thought and deliberation about my life choices up to this point, I decided to tackle this implementation. The methodology I selected was to add a monotonic counter to the internal command buffer submission and then create a series of per-object timeline “links” which would serve to match the counter to the timeline identifier. This would enable each timeline semaphore to maintain a singly-linked list of links each time they were submitted, and the list could then be pruned at any given time—referenced against the internal counter—to update the “current” timeline id and then evaluate whether a specified wait condition had passed. In the case where the condition had not passed, the timeline link could also store a handle to the fence from the llvmpipe queue submission that could be waited on directly.

Did it work?

Almost on the first try, actually.

But then I ran into a wall in CI while running piglit tests through zink.

It turns out that the CTS tests are considerably less aggressive than the piglit ones for things like this: specifically, there don’t appear to be any cases where a single timeline has 16 threads all trying to wait on it at different values, iterating thousands of times over the course of a couple seconds.

Oops.

But that’s now taken care of, and conformance never felt so good.

The road to Vulkan 1.2 continues!

July 29, 2021

This is a title

I’m back.

Where did I go?

My birthday passed recently, so I gifted myself a couple weeks off from blogging. Feels good.

For today, this is a Lavapipe blog.

What’s New With Lavapipe?

Lots.

Let’s check out what conformant features were added just in July:

  • EXT_line_rasterization
  • EXT_vertex_input_dynamic_state
  • EXT_extended_dynamic_state2
  • EXT_color_write_enable
  • features.strictLines
  • features.shaderStorageImageExtendedFormats
  • features.shaderStorageImageReadWithoutFormat
  • features.samplerAnisotropy
  • KHR_timeline_semaphores

Also under the hood now is a new 2D rasterizer from VMWare which yields “a 2x to 3x performance improvement for 2D workloads”.

Why Aren’t You Using Lavapipe Yet?

Have a big Vulkan-using project? Do you constantly have to worry about breakages from all manner of patches being merged without testing? Can’t afford or too lazy to set up and maintain actual hardware for testing?

Why not Lavapipe?

Seriously, why not? If there’s features missing that you need for your project, open tickets so we know what to work on.

July 28, 2021

Part 1, Part 2, Part 3

After getting thouroughly nerd-sniped a few weeks back, we now have FreeBSD support through qemu in the freedesktop.org ci-templates. This is possible through the qemu image generation we have had for quite a while now. So let's see how we can easily add a FreeBSD VM (or other distributions) to our gitlab CI pipeline:


.freebsd:
variables:
FDO_DISTRIBUTION_VERSION: '13.0'
FDO_DISTRIBUTION_TAG: 'freebsd.0' # some value for humans to read

build-image:
extends:
- .freebsd
- .fdo.qemu-build@freebsd
variables:
FDO_DISTRIBUTION_PACKAGES: "curl wget"
Now, so far this may all seem quite familiar. And indeed, this is almost exactly the same process as for normal containers (see Part 1), the only difference is the .fdo.qemu-build base template. Using this template means we build an image babushka: our desired BSD image is actual a QEMU RAW image sitting inside another generic container image. That latter image only exists to start the QEMU image and set up the environment if need be, you don't need to care what distribution it runs out (Fedora for now).

Because of the nesting, we need to handle this accordingly in our script: tag for the actual test job - we need to start the image and make sure our jobs are actually built within. The templates set up an ssh alias "vm" for this and the vmctl script helps to do things on the vm:


test-build:
extends:
- .freebsd
- .fdo.distribution-image@freebsd
script:
# start our QEMU image
- /app/vmctl start

# copy our current working directory to the VM
# (this is a yaml multiline command to work around the colon)
- |
scp -r $PWD vm:

# Run the build commands on the VM and if they succeed, create a .success file
- /app/vmctl exec "cd $CI_PROJECT_NAME; meson builddir; ninja -C builddir" && touch .success || true

# Copy results back to our run container so we can include them in artifacts:
- |
scp -r vm:$CI_PROJECT_NAME/builddir .

# kill the VM
- /app/vmctl stop

# Now that we have cleaned up: if our build job before
# failed, exit with an error
- [[ -e .success ]] || exit 1
Now, there's a bit to unpack but with the comments above it should be fairly obvious what is happening. We start the VM, copy our working directory over and then run a command on the VM before cleaning up. The reason we use touch .success is simple: it allows us to copy things out and clean up before actually failing the job.

Obviously, if you want to build any other distribution you just swap the freebsd out for fedora or whatever - the process is the same. libinput has been using fedora qemu images for ages now.

July 27, 2021

Thanks to the work done by Josè Expòsito, libinput 1.19 will ship with a new type of gesture: Hold Gestures. So far libinput supported swipe (moving multiple fingers in the same direction) and pinch (moving fingers towards each other or away from each other). These gestures are well-known, commonly used, and familiar to most users. For example, GNOME 40 recently has increased its use of touchpad gestures to switch between workspaces, etc. Swipe and pinch gestures require movement, it was not possible (for callers) to detect fingers on the touchpad that don't move.

This gap is now filled by Hold gestures. These are triggered when a user puts fingers down on the touchpad, without moving the fingers. This allows for some new interactions and we had two specific ones in mind: hold-to-click, a common interaction on older touchscreen interfaces where holding a finger in place eventually triggers the context menu. On a touchpad, a three-finger hold could zoom in, or do dictionary lookups, or kill a kitten. Whatever matches your user interface most, I guess.

The second interaction was the ability to stop kinetic scrolling. libinput does not actually provide kinetic scrolling, it merely provides the information needed in the client to do it there: specifically, it tells the caller when a finger was lifted off a touchpad at the end of a scroll movement. It's up to the caller (usually: the toolkit) to implement the kinetic scrolling effects. One missing piece was that while libinput provided information about lifting the fingers, it didn't provide information about putting fingers down again later - a common way to stop scrolling on other systems.

Hold gestures are intended to address this: a hold gesture triggered after a flick with two fingers can now be used by callers (read: toolkits) to stop scrolling.

Now, one important thing about hold gestures is that they will generate a lot of false positives, so be careful how you implement them. The vast majority of interactions with the touchpad will trigger some movement - once that movement hits a certain threshold the hold gesture will be cancelled and libinput sends out the movement events. Those events may be tiny (depending on touchpad sensitivity) so getting the balance right for the aforementioned hold-to-click gesture is up to the caller.

As usual, the required bits to get hold gestures into the wayland protocol are either in the works, mid-flight or merge-ready so expect this to hit the various repositories over the medium-term future.

July 26, 2021

I have not talked about raytracing in RADV for a while, but after some procrastination being focused on some other things I recently got back to it and achieved my next milestone.

In particular I have been hacking away at CTS and got to a point where CTS on dEQP-VK.ray_tracing.* runs to completion without crashes or hangs. Furthermore, I got the passrate to 90% of non-skiped tests. So we’re finally getting somewhere close to usable.

As further show that it is usable my fixes for CTS also fixed the corruption issues in Quake 2 RTX (Github version), delivering this image:

Q2RTX on RADV

Of course not everything is perfect yet. Besides the not 100% CTS passrate it has like half the Windows performance at 4k right now and we still have some feature gaps to make it really usable for most games.

Why is it slow?

TL;DR Because I haven’t optimized it yet and implemented every shortcut imaginable.

AMD raytracing primer

Raytracing with Vulkan works with two steps:

  1. You built a giant acceleration structure that contains all your geometry. Typically this ends up being some kind of tree, typically a Bounding Volume Hierarchy (BVH).
  2. Then you trace rays using some traversal shader through the acceleration structure you just built.

With RDNA2 AMD started accelerating this by adding an instruction that allowed doing intersection tests between a ray and a single BVH node, where the BVH node can either be

  • A triangle
  • A box node specifying 4 AABB boxes

Of course this isn’t quite enough to deal with all geometry types in Vulkan so we also add two more:

  • an AABB box
  • an instance of another BVH combined with a transformation matrix

Building the BVH

With a search tree like a BVH it is very possibly to make trees that are very useless. As an example consider a binary search tree that is very unbalanced. We can have similarly bad things with a BVH including making it unbalanced or having overlapping bounding volumes.

And my implementation is the simplest thing possible: the input geometry becomes the leaves in exactly the same order and then internal nodes are created just as you’d draw them. That is probably decently fast in building the BVH but surely results in a terrible BVH to actually use.

BVH traversal

After we built a BVH we can start tracing some rays. In rough pseudocode the current implementation is

stack = empty
insert root node into stack
while stack is not empty:

   node = pop a node from the stack

   if we left the bottom level BVH:
      reset ray origin/direction to initial origin/direction

   result = amd_intersect(ray, node)
   switch node type:
      triangle:
         if result is a hit:
            load some node data
            process hit
      box node:
         for each box hit:
            push child node on stack
      custom node 1 (instance):
         load node data
         push the root node of the bottom BVH on the stack
         apply transformation matrix to ray origin/direction
      custom node 2 (AABB geometry):
         load node data
         process hit

We already knew there were inherently going to be some difficulties:

  • We have a poor BVH so we’re going to do way more iterations than needed.
  • Calling shaders as a result of hits is going to result in some divergence.

Furthermore this also clearly shows some difficulties with how we approached the intersection instruction. Some advantages of the intersection instruction are that it avoids divergence in computing collisions if we have different node types in a subgroup and to be cheaper when there are only a few lanes active. (A single CU can process one ray/node intersection per cycle, modulo memory latency, while it can process an ALU instruction on 64 lanes per cycle).

However even if it avoids the divergence in the collision computation we still introduce a ton of divergence in the processing of the results of the intersection. So we are still doing pretty bad here.

A fast GPU traversal stack needs some work too

Another thing to be noted is our traversal stack size. According to the Vulkan specification a bottom level acceleration structure should support 2^24 -1 triangles and a top level acceleration structure should support 2^24 - 1 bottom level structures. Combined with a tree with 4 children in each internal node we can end up with a tree depth of about 24 levels.

In each internal node iteration of our loop we pop one element and push up to 4 elements, so at the deepest level of traversal we could end up with a 72 entry stack. Assuming these are 32-bit node identifiers, that ends up with 288 bytes of stack per lane, or ~18 KiB per 64 lane workgroup (the minimum which could possibly keep a CU busy with an ALU only workload). Given that we have 64 KiB of LDS (yes I am using LDS since there is no divergent dynamic register addressing) per CU that leaves only 3 workgroups per CU, leaving very little options for parallelism between different hardware execution units (e.g. the ALU and the texture unit that executes the ray intersections) or latency hiding of memory operations.

So ideally we get this stack size down significantly.

Where do we go next?

First step is to get CTS passing and getting an initial merge request into upstream Mesa. As a follow on to that I’d like to get a minimal prototype going for some DXR 1.0 games with vkd3d-proton just to make sure we have the right feature coverage.

After that we’ll have to do all the traversal optimizations. I’ll probably implement a bunch of instrumentation so I actually have a clue on what to optimize. This is where having some runnable games really helps get the right idea about performance bottlenecks.

Finally, with some luck better shaders to build a BVH will materialize as well.

July 22, 2021

If you want to write an X application, you need to use some library that speaks the X11 protocol. For a long time this meant libX11, often called xlib, which - like most things about X - is a fantastic bit of engineering that is very much a product of its time with some confusing baroque bits. Overall it does a very nice job of hiding the icky details of the protocol from the application developer.

One of the details it hides has to do with how resource IDs are allocated in X. A resource ID (an XID, in the jargon) is a 32 29-bit integer that names a resource - window, colormap, what have you. Those 29 bits are split up netmask/hostmask style, where the top 8 or so uniquely identify the client, and the rest identify the resource belonging to that client. When you create a window in X, what you really tell the server is "I want a window that's initially this size, this background color (etc.) and from now on when I say (my client id + 17) I mean that window." This is great for performance because it means resource allocation is assumed to succeed and you don't have to wait for a reply from the server.

Key to all this is that in xlib the XID is the return value from the call that issues the resource creation request. Internally the request gets queued into the protocol's write buffer, but the client can march ahead and issue the next few commands as if creation had succeeded - because it probably did, and if it didn't you're probably going to crash anyway.

So to allocate XIDs the client just marches forward through its XID range. What happens when you hit the end of the range? Before X11R4, you'd crash, because xlib doesn't keep track of which XIDs it's allocated, just the lowest one it hasn't allocated yet. Starting in R4 the server added an extension called XC-MISC that lets the client ask the server for a list of unused XIDs, so when xlib hits the end of the range it can request a new range from the server.

But. UI programming tends to want threads, and xlib is perhaps not the most thread-friendly. So XCB was invented, which sacrifices some of xlib's ease of use for a more direct binding to the protocol and (in theory) an explicitly thread-safe design. We then modified xlib and XCB to coexist in the same process, using the same I/O buffers, reply and event management, etc.

This literal reflection of the protocol into the API has consequences. In XCB, unlike xlib, XID generation is an explicit step. The client first calls into XCB to allocate the XID, and then passes that XID to the creation request in order to give the resource a name.

Which... sorta ruins that whole thread-safety thing.

Let's say you call xcb_generate_id in thread A and the XID it returns is the last one in your range. Then thread B schedules in and tries to allocate another XID. You'll ask the server for a new range, but since thread A hasn't called its resource creation request yet, from the server's perspective that "allocated" XID looks like it's still free! So now, whichever thread issues their resource creation request second will get BadIDChoice thrown at them if the other thread's resource hasn't been destroyed in the interim.

A library that was supposed to be about thread safety baked a thread safety hazard into the API. Good work, team.

How do you fix this without changing the API? Maybe you could keep a bitmap on the client side that tracks XID allocation, that's only like 256KB worst case, you can grow it dynamically and most clients don't create more than a few dozen resources anyway. Make xcb_generate_id consult that bitmap for the first unallocated ID, and mark it used when it returns. Then track every resource destruction request and zero it back out of the bitmap. You'd only need XC-MISC if some other client destroyed one of your resources and you were completely out of XIDs otherwise.

And you can implement this, except. One, XCB has zero idea what a resource destruction request is, that's simply not in the protocol description. Not a big deal, you can fix that, there's only like forty destructors you'd need to annotate. But then two, that would only catch resource destruction calls that flow through XCB's protocol binding API, which xlib does not, xlib instead pushes raw data through xcb_writev. So now you need to modify every client library (libXext, libGL, ...) to inform XCB about resource destruction.

Which is doable. Tedious. But doable.

I think.

I feel a little weird writing about this because: surely I can't be the first person to notice this.

July 21, 2021

Debugging programs using printf statements is not a technique that everybody appreciates. However, it can be quite useful and sometimes necessary depending on the situation. My past work on air traffic control software involved using several forms of printf debugging many times. The distributed and time-sensitive nature of the system being studied made it inconvenient or simply impossible to reproduce some issues and situations if one of the processes was stalled while it was being debugged.

In the context of Vulkan and graphics in general, printf debugging can be useful to see what shader programs are doing, but some people may not be aware it’s possible to “print” values from shaders. In Vulkan, shader programs are normally created in a high level language like GLSL or HLSL and then compiled to SPIR-V, which is then passed down to the driver and compiled to the GPU’s native instruction set. That final binary, many times outside the control of user applications, runs in a quite closed and highly parallel environment without many options to observe what’s happening and without text input and output facilities. Fortunately, tools like glslang can generate some debug information when compiling shaders to SPIR-V and other tools like Nsight can use that information to let you debug shaders being run.

Still, being able to print the values of different expressions inside a shader can be an easy way to debug issues. With the arrival of Ray Tracing, this is even more useful than before. In ray tracing pipelines, the shaders being executed and resources being used are chosen based on the scene geometry, the origin and the direction of the ray being traced. printf debugging can let you see where you are and what you’re using. So how do you print values from shaders?

Vulkan’s debug printf is implemented as part of the Validation Layers and the general procedure is well documented. If you were to implement this kind of mechanism yourself, you’d likely use a storage buffer to save the different values you want to print while shader invocations are running and, later, you’d go over the contents of that buffer and print the associated message with each value or values. And that is, essentially, what debug printf does but in a very convenient and automated way so that you don’t have to deal with the gory details and corner cases.

In a GLSL shader, simply:

  1. Enable the GL_EXT_debug_printf extension.

  2. Sprinkle your code with debugPrintfEXT() calls.

  3. Use the Vulkan Configurator that’s part of the SDK or manually edit vk_layer_settings.txt for your app enabling VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT.

  4. Normally, disable other validation features so as not to get too much output.

  5. Take a look at the debug report or debug utils info messages containing printf results, or set printf_to_stdout to true so printf messages are sent to stdout directly.

You can find an example shader in the validation layers test code. The debug printf feature has helped me a lot in the past, so I wanted to make sure it’s widely known and used.

Due to the observer effect, you may end up in situations where your code works correctly when enabling debug printf but incorrectly without it. This may be due to multiple reasons but one of the main ones I’ve encountered is improper synchronization. When debug printf is used, the layers use additional synchronization primitives to sync the contents of auxiliary buffers, which can mask synchronization bugs present in the app.

Finally, RenderDoc 1.14, released at the end of May, also supports Vulkan’s shader printf statements and will let you take a look at the print statements produced during a draw call. Furthermore, the print statements don’t have to be present in the original shader. You can also use the shader edit system to insert them on the fly and use them to debug the results of a particular shader invocation. Isn’t that awesome? Great work by Baldur Karlsson as always.

PS: As a happy coincidence, just yesterday LunarG published a white paper on Vulkan’s debug printf with additional information on this excellent feature. Be sure to check it out!

In order to expose OpenGL 4.6 the last missing feature in llvmpipe is anisotropic texture filtering. Adding support for this also allows lavapipe expose the Vulkan samplerAnisotropy feature.

I started writing anisotropic support > 6 months ago. At the time we were trying to deprecate the classic swrast driver, and someone pointed out it had support for anisotropic filtering. This support had also been ported to the softpipe driver, but never to llvmpipe.

I had also considered porting swiftshaders anisotropic support, but since I was told the softpipe code was functional and had users I based my llvmpipe port on that.

Porting the code to llvmpipe means rewriting it to generate LLVM IR using the llvmpipe vector processing code. This is a lot messier than just writing linear processing code, and when I thought I had it working it passes GL CTS, but failed the VK CTS. The results also to my eye looked worse than I'd have thought was acceptable, and softpipe seemed to be as bad.

Once I swung back around to this I decided to port the VK CTS test to GL and run it on softpipe and llvmpipe code. Initially llvmpipe had some more bugs to solve esp where the mipmap levels were being chosen, but once I'd finished aligning softpipe and llvmpipe I started digging into why the softpipe code wasn't as nice as I expected.

The softpipe code was based on an implementation of an Elliptical Weighted Average Filter (EWA). The paper "Creating Raster Omnimax Images from Multiple Perspective Views Using the Elliptical Weighted Average Filter" described this. I sat down with the paper and softpipe code and eventually found the one line where they diverged.[1] This turned out to be a bug introduced in a refactoring 5 years ago, and nobody had noticed or tracked it down.

I then ported the same fix to my llvmpipe code, and VK CTS passes. I also optimized the llvmpipe code a bit to avoid doing pointless sampling and cleaned things up. This code landed in [2] today.

For GL4.6 there are still some fixes in other areas.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11917

[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8804

July 20, 2021

After a month of reverse-engineering, we’re excited to release documentation on the Valhall instruction set, available as a PDF. The findings are summarized in an XML architecture description for machine consumption. In tandem with the documentation, we’ve developed a Valhall assembler and disassembler as a reverse-engineering aid.

Valhall is the fourth Arm® Mali™ architecture and the fifth Mali instruction set. It is implemented in the Arm® Mali™-G78, the most recently released Mali hardware, and Valhall will continue to be implemented in Mali products yet to come.

Each architecture represents a paradigm shift from the last. Midgard generalizes the Utgard pixel processor to support compute shaders by unifying the shader stages, adding general purpose memory access, and supporting integers of various bit sizes. Bifrost scalarizes Midgard, transitioning away from the fixed 4-channel vector (vec4) architecture of Utgard and Midgard to instead rely on warp-based execution for parallelism, better using the hardware on modern workloads. Valhall linearizes Bifrost, removing the Very Long Instruction Word mechanisms of its predecessors. Valhall replaces the compiler’s static scheduling with hardware dynamic scheduling, trading additional control hardware for higher average performance. That means padding with “no operation” instructions is no longer required, which may decrease code size, promising better instruction cache use.

All information in this post and the linked PDF and XML is published in good faith and for general information purpose only. We do not make any warranties about the completeness, reliability and accuracy of this information. Any action you take upon the information you find here, is strictly at your own risk. We are not be liable for any losses and/or damages in connection with the use of this information.

While we strive to make the information as accurate as possible, we make no claims, promises, or guarantees about its accuracy, completeness, or adequacy. We expressly disclaim liability for content, errors and omissions in this information.

Let’s dig in.

Getting started

In June, Collabora procured an International edition of the Samsung Galaxy S21 phone, powered by a system-on-chip with Mali G78. Although Arm announced Valhall with the Mali G77 in May 2019, roll out has been slow due to the COVID-19 pandemic. At the time of writing, there are not yet Linux friendly devices with a Valhall chip, forcing use of a locked down Android device. There’s a silver lining: we have a head start on the reverse-engineering, so by the time hacker-friendly devices arrive with Valhall GPUs, we can have open source drivers ready.

Android complicates reverse-engineering (though not as much as macOS). On Linux, we can compile a library on the device to intercept data sent to the GPU. On Android, we must cross-compile from a desktop with the Android Native Development Kit, ironically software that doesn’t run on Arm processors. Further, where on Linux we can track the standard system calls, Android device drivers replace the standard open() system call with a complicated Android-only “binder” interface. Adapting the library to support binder would be gnarly, but do we have to? We could sprinkle in one little hack anywhere we see a file descriptor without the file name.

#define MALI0 "/dev/mali0"

bool is_mali(int fd)
{
    char in[128] = { 0 }, out[128] = { 0 };
    snprintf(in, sizeof(in), "/proc/self/fd/%d", fd);

    int count = readlink(in, out, sizeof(out) - 1);
    return count == strlen(MALI0) && strncmp(out, MALI0, count) == 0;
}

Now we can hook the Mali ioctl() calls without tracing binder and easily dump graphics memory.

We’re interested in the new instruction set, so we’re looking for the compiled shader binaries in memory. There’s a chicken-and-egg problem: we need to find the shaders to reverse-engineer them, but we need to reverse-engineer the shaders to know what to look for. Fortunately, there’s an escape hatch. The proprietary Mali drivers allow an OpenGL application to query the compiled binary with the ARM_mali_program_binary extension, returning a file in the Mali Binary Shader format. That format was reverse-engineered years ago by Connor Abbott for earlier Mali architectures, and the basic structure is unchanged in Valhall. Our task is simple: compile a test shader, dump both GPU memory and the Mali Binary Shader, and find the common section. Searching for the common bytes produces an address in executable graphics memory, in this case 0x7f0002de00. Searching for that address in turn finds the “shader program descriptor” which references it.

18 00 00 80 00 10 00 00 00 DE 02 00 7F 00 00 00

Another search shows this descriptor’s address in the payload of an index-driven vertex shading job for graphics or a compute job for OpenCL. Those jobs contain the Job Manager header introduced a decade ago for Midgard, so we understand them well: they form a linked list of jobs, and only the first job is passed to the kernel. The kernel interface has a “job chain” parameter on the submit system call taking a GPU address. We understand the kernel interface well as it is open source due to kernel licensing requirements.

With each layer identified, we teach the wrapper library to chase the pointers and dump every shader executed, enabling us to reverse-engineer the new instruction set and develop a disassembler.

Instruction set reconnaissance

Reverse-engineering in the dark is possible, but it’s easier to have some light. While waiting for the Valhall phone to arrive, I read everything Arm made public about the instruction set, particularly this article from Anandtech. Without lifting a finger, that article tells us Valhall is…

  • Warp-based, like Bifrost, but with 16 threads per warp instead of Bifrost’s 4/8.
  • Isomorphic to Bifrost on the instruction level (“operational equivalence”).
  • Regularly encoded.
  • Flat, lacking Bifrost’s clause and tuple packaging.

It also says that Valhall has a 16KB instruction cache, holding 2048 instructions. Since Valhall has a regular encoding, we divide 16384 bytes by 2048 instructions to find a Valhall instruction is 8 bytes. Our first attempt at a “disassembler” can print hex dumps of every 8 bytes on a line; our calculation ensures that is the correct segmentation.

From here on, reverse-engineering is iterative. We have a baseline level of knowledge, and we want to grow that knowledge. To do so, we input test programs into the proprietary driver to observe the output, then perturbe the input program to see how the output changes.

As we discover new facts about the architecture, we update our disassembler, demonstrating new knowledge and separating the known from the unknown. Ideally, we encode these facts in a machine-readable file forming a single reference for the architecture. From this file, we can generate a disassembler, an assembler, an instruction encoder, and documentation. For Valhall, I use an XML file, resembling Bifrost’s equivalent XML.

Filling out this file is usually straightforward though tedious. Modern APIs are large, so there is a great deal of effort required to map the API requirements to the hardware features.

However, some hardware features do not map to any API. Here are subtler tales from reversing Valhall.

Dependency slots

Arithmetic is faster than memory access, so modern processors execute arithmetic in parallel with pending memory accesses. Modern GPU architectures require the compiler to manage this mechanism by analyzing the program and instructing the hardware to wait for the results before they’re needed.

For this purpose, Bifrost uses an explicit scoreboarding system. Bifrost groups up to 16 instructions together in a clause, and each clause has a fixed header. The compiler assigns a “dependency slot” between 0 and 7 to each clause, specified in the header. Each clause can wait on any set of slots, specified with another 8-bits in the clause header. Specifying dependencies per-clause is a compromise between precision and code size.

We expect Valhall to feature a similar scheme, but Valhall doesn’t have clauses or clause headers, so where does it specify this info?

Studying compiled shaders, we see the last byte of every instruction is usually zero. But when the result of a memory access is first read, the previous instruction has a bit set in the last byte. Which bit is set depends on the number of memory accesses in flight, so it seems the last byte encodes a dependency wait. The memory access instructions themselves are often zero in their last bytes, so it doesn’t look like the last byte is used to encode the dependency slot – but executing many memory access instructions at once and comparing the bits, we see a single 2-bit field stands out as differing. The dependency slot is specified inside the instruction, not in the metadata.

What makes this design practical? Two factors.

One, only the waits need to be specified in general. Arithmetic instructions don’t need a dependency slot, since they complete immediately. The longest message passing instructions is shorter than the longer arithmetic instruction, so there is space in the instruction itself to specify only when needed.

Two, the performance gain from adding extra slots levels off quickly. Valhall cuts back on Bifrost’s 8 slots (6 general purpose). Instead it has 4 or 5 slots, with only 3 general purpose, saving 4-bits for every instruction.

This story exemplifies a general pattern: Valhall is a flattening of Bifrost. Alternatively, Bifrost is “Valhall with clauses”, although that description is an anachronism. Why does Bifrost have clauses, and why does Valhall remove them? The pattern in this story of dependency waits generalizes to answer the question: grouping many instructions into Bifrost clauses allows the hardware to amortize operations like dependency waits and reduce the hardware gate count of the shader core. However, clauses add substantial encoding overhead, compiler complexity, and imprecision. Bifrost optimizes for die space; Valhall optimizes for performance.

The missing modifier

Hardware features that are unused by the proprietary driver are a perennial challenge for reverse-engineering. However, we have a complete Bifrost reference at our disposal, and Valhall instructions are usually equivalent to Bifrost. Special instructions and modes from Bifrost cast a shadow on Valhall, showing where there are gaps in our knowledge. Sometimes these gaps are impractical to close, short of brute-forcing the encoding space. Other times we can transfer knowledge and make good guesses.

Consider the Cross Lane PERmute instruction, CLPER, which takes a register and the index of another lane in the warp, and returns the value of the register in the specified lane. CLPER is a “subgroup operation”, required for Vulkan and used to implement screen-space derivatives in fragment shaders. On Bifrost, the CLPER instruction is defined as:

<ins name="+CLPER.i32" mask="0xfc000" exact="0x7c000">
  <src start="0" mask="0x7"/>
  <src start="3"/>
  <mod name="lane_op" start="6" size="2">
    <opt>none</opt>
    <opt>xor</opt>
    <opt>accumulate</opt>
    <opt>shift</opt>
  </mod>
  <mod name="subgroup" start="8" size="2">
    <opt>subgroup2</opt>
    <opt>subgroup4</opt>
    <opt>subgroup8</opt>
  </mod>
  <mod name="inactive_result" start="10" size="4">
    <opt>zero</opt>
    <opt>umax</opt>
    ....
    <opt>v2infn</opt>
    <opt>v2inf</opt>
  </mod>
</ins>

We expect a similar definition for Valhall. One modification is needed: Valhall warps contain 16 threads, so there should be a subgroup16 option after subgroup8, with the natural binary encoding 11. Looking at a binary Valhall CLPER instruction, we see a 11 pair corresponding to the subgroup field. Similarly experimenting with different subgroup operations in OpenCL lets us figure out the lane_op field. We end up with an instruction definition like:

<ins name="CLPER.u32" title="Cross-lane permute" dests="1" opcode="0xA0" opcode2="0xF">
  <src/>
  <src widen="true"/>
  <subgroup/>
  <lane_op/>
</ins>

Notice we do not specify the encoding in the Valhall XML, since Valhall encoding is regular. Also notice we lack the inactive_result modifier. On Bifrost, inactive_result specifies the value returned if the program attempts to access an inactive lane. We may guess Valhall has the same mechanism, but that modifier is not directly controllable by current APIs. How do we proceed?

If we can run code on the device, we can experiment with the instruction. Inactive lanes may be caused by divergent control flow, where one lane in the thread branches but another lane does not, forcing the hardware to execute only part of the warp. After reverse-engineering Valhall’s branch instructions, we can construct a situation where a single lane is active and the rest are inactive. Then we insert a CLPER instruction with extra bits set, store the result to main memory, and print the result. This assembly program does the trick:

# Elect a single lane
BRANCHZ.reconverge.id lane_id, offset:3

# Try to read a value from an inactive thread
CLPER.u32 r0, r0, 0x01000000.b3, inactive_result:VALUE

# Store the value
STORE.i32.slot0.reconverge @r0, u0, offset:0

# End shader
NOP.return

With the assembler we’re writing, we can assemble this compute kernel. How do we run it on the device without knowing the GPU data structures required to dispatch compute shaders? We make use of another classic reverse-engineering technique: instead of writing the initialization code ourselves, piggyback off the proprietary driver. Our wrapper library allows us to access graphics memory before the driver submits work to the hardware. We use this to read the memory, but we may also modify it. We already identified the shader program descriptor, so we can inject our own shaders. From here, we can jury-rig a script to execute arbitrary shader binaries on the device in the context of an OpenCL application running under the proprietary driver.

Putting it together, we find the inactive_result bits in the CLPER encoding and write one more script to dump all values.

for ((i = 0 ; i < 16 ; i++)); do
  sed -e "s/VALUE/$i/" shader.asm | python3 asm.py shader.bin
  adb push shader.bin /data/local/tmp/
  adb shell 'REPLACE=/data/local/tmp/shader.bin '\
    'LD_PRELOAD=/data/local/tmp/panwrap.so '\
    '/data/local/tmp/test-opencl'
done

The script’s output contains sixteen possibilities – and they line up perfectly with Bifrost’s sixteen options. Success.

Next steps

There’s more to learn about Valhall, but we’ve reverse-engineered enough to develop a Valhall compiler. As Valhall is a simplification of Bifrost, and we’ve already developed a free and open source compiler for Bifrost, this task is within reach. Indeed, adapting the Bifrost compiler to Valhall will require refactoring but little new development.

Mali G78 does bring changes beyond the instruction set. The data structures are changed to reduce Vulkan driver overhead. For example, the monolithic “Renderer State Descriptor” on Bifrost is split into a “Shader Program Descriptor” and a “Depth Stencil Descriptor”, so changes to the depth/stencil state no longer require the driver to re-emit shader state. True, the changes require more reverse-engineering. Fortunately, many data structures are adapted from Bifrost requiring few changes to the Mesa driver.

Overall, supporting Valhall in Mesa is within reach. If you’re designing a Linux-friendly device with Valhall and looking for open source drivers, please reach out!

Originally posted on Collabora’s blog

July 15, 2021

Some days ago my Igalia colleague Adrián Pérez pointed us to mold, a new drop-in replacement for existing Unix linkers created by the original author of LLVM lld. While mold is pretty new and does not aim to be 100% compatible with GNU ld, GNU gold or LLVM lld (at least as of the time I’m writing this), I noticed the benchmark table in its README file also painted a pretty picture about the performance of lld, if inferior to that of mold.

In my job at Igalia I work most of the time on VK-GL-CTS, Vulkan and OpenGL’s Conformance Test Suite, which contains thousands of tests for OpenGL and Vulkan. These tests are provided by different executable files and the Vulkan tests on which I’m focused are contained in a binary called deqp-vk. When built with debug information, deqp-vk can be quite large. A recent build, for example, is taking 369 MB in my drive. But the worst part is that linking the binary typically takes around 25 seconds on my work laptop.

$ time cmakebuild.sh --target deqp-vk
  [6/6] Linking CXX executable external/vulkancts/modules/vulkan/deqp-vk

  real    0m25.137s
  user    0m22.280s
  sys     0m3.440s

I had never paid much attention to the linker before, always relying on the default choice in Fedora or any other distribution. However, I decided to install lld, which has an official package, and gave it a try. You Will Not Believe What Happened Next.

$ time cmakebuild.sh --target deqp-vk
  [6/6] Linking CXX executable external/vulkancts/modules/vulkan/deqp-vk

  real    0m2.622s
  user    0m5.456s
  sys     0m1.764s

lld is capable of correctly linking deqp-vk in 1/10th of the time the default linker (GNU ld) takes to do the same job. If you want to try lld yourself you have several options. Ideally, you’d be able to run update-alternatives --set ld /usr/bin/lld as root but that option is notably not available in Fedora. There was a proposal to make that work but it never materialized, so it cannot be made the default system-wide linker.

However, depending on the build system used by a particular project, there should be a way to make it use lld instead of /usr/bin/ld. For example, VK-GL-CTS uses CMake, which invokes the compiler to link executable files, instead of calling the linker directly, which would be unusual. Both GCC and Clang can be passed -fuse-ld=lld as a command line option to use lld instead of the default linker. That flag should be added to CMake’s CMAKE_EXE_LINKER_FLAGS variable, either by reconfiguring an existing project with, for example, ccmake, or by adding the flag to the LDFLAGS environment variable before running CMake on a build directory for the first time.

Looking forward to start using the mold linker in the future and its multithreading capabilities. In the mean time, I’m very happy to have checked lld. It’s not that usual that a simple tooling change as this one gives me such a clear advantage.

July 09, 2021

It Happened.

ablend.png

That’s right.

Zink(-wip) now fully supports GL_KHR_blend_equation_advanced, which means ES 3.2 is a go (once my local CI clears me to push today’s snapshot).

And all it took was one brief exchange with a top Mesa reviewer who is incidentally rumored to be undergoing training deep in the mountains to become an expert BBQ master on the extremely professional #zink channel on OFTC:

That's the thing. You can totally do it in Zink.

My mind was blown.

Why hadn’t I thought of that sooner?

I could just…do it? Just like that? And then it’d be done?

Truly the experts are on a different level from us mortals.

So now it’s done, and that means zink is finished. I don’t expect there will be any more work to do now that the final boss has been defeated. Don’t even bother trying to file bug reports.

You may not like it, but this is what peak Friday looks like.

July 07, 2021

The Unsung Heroes

This is going to be less of a technical post and more of a have you thought about post from me personally (usual disclaimer: this post represents only my views). With that said, I think this is more important than the average post here, meaning that expectations should be set somewhere between I need to stop everything else I’m doing until I finish reading and this is the most important event in my life.

Let’s talk about open source. No, Open Source. The idea of it.

How Does Open Source Work?

Those of you who are veterans are rolling your eyes. Another post about the glory of Open Source.

The thing about Open Source is that it’s sort of whatever you make of it. At its core, it’s about getting people together to solve a problem—community building. Whether that community is large or small, the goal is the same: write some quality software.

To that end, you’ve got your usual corporate powerpoint slide of community roles:

  • maintainers
  • developers
  • reviewers
  • whatever other buzzwords are currently relevant

In Mesa, the maintainer and developer roles are mostly the same among core contributors: these are the people who write the code that gets posted about on all the news sites.

The reviewer is a bit more mysterious though. Who are reviewers, and what separates them from the others?

WD-40

Reviewers are the grease that makes the project work. There’s really no other way of saying it.

Outside of a few components of Mesa that are effectively the wild west, without any form of oversight or approval needed for changes to be landed, every driver and utility in the tree requires that changes undergo review before they land. This means that each and every patch which affects code or build has to have a person stop everything else they’re doing and physically scroll through each patch, line-by-line, then add a Reviewed-by or Acked-by tag.

If you’re unclear as to the meanings of these tags, consider it like you’re going skydiving with someone you’ve never met before who has been in charge of preparing your parachute:

  • Reviewed-by means “I triple-checked your parachute as well as your reserve, and I’m as certain as a human is capable of being that everything is how it should be”
  • Acked-by means “Hey, I grabbed this already-packed parachute off the hanger and gave it a once-over; you’ll probably be fine”

It’s then up to the developer to decide whether to merge the code based on the feedback given to them by the reviewer.

This, of course, assumes they get feedback at all.

Balance

Too often on news sites (and in certain corporate metrics) you’ll see something like “Patches McCodesAlot, working for GreatCodingCompany, authored the most code changes for this release cycle (9001 patches), which is over 100x more than the next highest contributor.”

The manager at a company sees this and thinks “I’ll send this up the chain. We should poach Patches so we can have greater control over this project which underpins our entire business strategy. Also it’ll make my powerpoint pie charts look rad.”

The casual reader sees this and says “Wow, Patches is awesome! Without Patches, I probably couldn’t even play Fororantwatch on my Linux gaming desktop!”

But how do the patches that Patches writes get merged into the release? Unless Patches works exclusively in one of the undermaintained areas of the project, in which case it’s unlikely that their work is being widely used, the odds are that someone’s pulling a huge lift on the review side to enable all of those patches landing into the repository.

This is the job of the reviewer.

A Thanks

As this Mesa release cycle starts to wind down, I hope that readers of this blog and news sites can take a moment to look past Patches McCodesAlot and see the people who make it possible for Patches to land so many damn patches.

At the time of this post, this is what the top 10 reviewers managed to accomplish over the past few months:

Number of Reviews Reviewer Name Corporate Affiliation
91 Erik Faye-Lund Collabora
94 Samuel Pitoiset Valve
99 Alejandro Piñeiro Igalia
115 Kenneth Graunke Intel
116 Bas Nieuwenhuizen Blogger
121 Lionel Landwerlin Intel
128 Adam Jackson Red Hat
140 Marek Olšák AMD
176 Jason Ekstrand Intel
300 Dave Airlie Red Hat

Summed up, that’s over 1300 patches reviewed! For perspective, that’s around 30% of all the patches in this release, and it’s about 70% of the total number of patches that zink has received in the course of its existence.

Looking at it another way though, this is over 1300 patches that other people wrote which were able to land because these people took the time to look over the proposed changes—to triple-check the parachutes, as it were.

So thanks, Mesa reviewers. The project wouldn’t exist without all of you (and your generous employers, who should be blasting these metrics in the press when they talk about being good Open Source citizens).

But Also

I’d be remiss if I didn’t also mention the people working on Mesa CI. There’s no patch counts or review counts or anything to recognize everyone hard at work here, but CI is what keeps the triangles blasting out of your GPUs looking how they should.

Thanks, CI team. You’re awesome.

According to a recent metric, the Mesa CI infrastructure only had a 0.6% accidental failure rate. That’s pretty good considering how many thousands of jobs run every day.

June 28, 2021

This year, I decided to participate as speaker in esLibre 2021 conference. esLibre is a Spanish free software conference that covers a lot of different topics related to open-source projects: from the technical point of view to its social impact.

This year the conference had talks about game development with Godot, KDE, LibreOffice, Free Software in Universities among many others. Check out the program.

esLibre 2021

This is my first time participating in this conference and I enjoyed it a lot. Huge applause to the organization team for the huge work to organize this edition, for helping out the speakers with different testing days and for their kindness to reply any question from me and other attendees. They did a superb job!

My talk was an introduction to Mesa where I covered things like where is Mesa in the open-source graphics stack, a summary of what it does, the drivers implemented in Mesa, how our community is organized and how to contribute to it. If you know Spanish, you can check it out here (PDF). But in case you want an English version of it, this talk is very similar to the one I gave at Ubucon Europe 2018.

My esLibre talk was recorded as well! Edit: the recording is already available here.

Enjoy it!

Introduction to Mesa

Hi all, hope you all are doing fine!

Today it's part 3 of my Outreachy Saga it’s been 5 weeks of my Outreachy internship, and everything is not sailing smoothly how I would like! Why?? Because I had a little problem with my setup and I was stuck for 2 days without working until be able to correctly do my setup, how I said in my introduction post, one thing that I'm learning at my internship is "learning", because not everything goes as I would like, sometimes it is necessary to stop, breathe, redo everything and, after redoing everything, it is so rewarding when things start to flow.

Today my week’s blog will be focusing on the Linux Kernel Community at which I’m interning and the project on which I’m working. So, let’s get started!

What is Linux Kernel?

A small context: the core of an Operating System (OS) is the Kernel, as this is responsible for the integration of the physical devices (hardware) of the computer with the programs (software). In a Linux OS this core is also known as Linux Kernel, has open source code, and is freely available for the community, how I said in my introduction blog https://open-sourceress.com/outreachy-introduction/, the community it's a set of people and companies that want to collaborate in the development of the system.

Due to these contributions Linux Kernel has grown a lot, with over 8 million lines of code and well over 1000 contributors to each release, is one of the largest and most active free software projects in existence. The kernel codebase has been logically broken down into a set of subsystems: network, architecture-specific support (x86, ARM, MIPS, ...), memory management, devices video, real-time systems, among others. This makes it a little easier to manage the contributions made in the Kernel, as most subsystems have a designated maintainer, and they handle verifying and accepting contributions before they are incorporated in Linux kernel mainline

About my project at Linux Kernel – “Improvements to DRI-devel (aka kernel GPU subsystem)“

In laptops, tablets, phones, and lots of other places GPU/display uses more silicon die space than everything else combined (humans are mostly visual people after all), dri-devel (and the wider set of projects under the X.org Foundation's umbrella) is the community that makes this all work and shine.

In my project, I would like to create new features and better understand how the DRM core works. To achieve this goal, I chose these tasks: Clean up the debugfs support and remove custom dumbmapoffset implementations

How can you contribute to Linux Kernel?

Anyone can contribute to the development of the kernel, just develop a patch, send it to the system's mailing list, wait for community considerations, fix whatever it takes, and that's it.

But yes, I know well, that starting to contribute to the kernel is scary, especially for anyone who is a noob (beginner, newbie) in the Free Software development world and also doesn't know where to start.

But there are several things and initiatives to help, for example:

Internet courses and materials:

A beginners guide to linux kernel development

Kernel newbies

First Patch tutorial

Write and Submit your first Linux kernel Patch

Internship programs:

Outreachy

Outreachy is a paid, remote internship program. Outreachy's goal is to support people from groups underrepresented in tech. We help newcomers to free software and open source make their first contributions. Outreachy provides internships to open source work. People apply from all around the world. Interns work remotely and are not required to move. Interns are paid a stipend of $5,500 USD for the three month internship. Interns have a $500 USD travel stipend to attend conferences or events. Interns work with experienced mentors from open source communities. Outreachy internship projects may include programming, user experience, documentation, illustration, graphical design, or data science. Interns often find employment after their internship with Outreachy sponsors or in jobs that use the skills they learned during their internship.

GSoC

Google Summer of Code is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 10 week programming project during their break from school.

Study groups:

In Brazil, I met 2 of these groups

In Campinas - LKCamp

In São Paulo - FLUSP

It's scary I know, but as you can see there are several initiatives and content to start contributing to the Linux Kernel. So don't be afraid, try to contribute to the Linux kernel and ask the community for help there will always be someone who can help you!

Ah!!! And I almost forgot, if you need help you can send a message, I'm also starting in this world of kernel contribution, but I'll do my best to help and my goal with the blog besides showing my Outreachy internship progress is also to create content to help with the contribution and development of the kernel for beginners, both in English and in Portuguese (my native language).

Thank you for following me so far, please feel free to comment! And stay tuned to the next chapters of this Saga!!!

Take care and have a great day!

June 23, 2021

BREAKING: THIS IS NO LONGER A ZINK BLOG

For today, at least.

Today, this blog is a Gallium blog. And it’s a momentous day indeed.

We all know what this is:

portal2-title.png

It’s a screenshot of Portal 2 with the Gallium HUD activated and VSync disabled.

But what driver is that underneath?

Well, for today’s blog it’s RadeonSI, the reference implementation of a Gallium driver.

And why is this, I can hear you all asking.

What if I told you that this screenshot with 10% higher FPS is also Portal 2 with VSync disabled on RadeonSI using one trick that graphics developers WON’T TELL YOU:

portal2-nine-title.png

Interested?

Coming Soon (Maybe, And Also Maybe Requiring Some Early 2000s-era Elbow Grease From Anyone Wanting To Try): Native Linux Source Games On Gallium Nine

We did it.

By assembling an elite team of individuals with a few minutes to spare here and there over the past week, including:

  • Josh Ashton, expert spammer of 🐸 emojis
  • Axel Davy, expert struct packer
  • Me, expert blogger
  • Is Such A Thing Even Possible? Why Yes, Yes It Is.

it is now (technically) possible to run DXVK-compatible Source Engine games through Gallium’s Nine state tracker, providing a native D3D9 runtime.

Is your Portal 2 in-game FPS sad and barely even 500 like this screenshot?

portal2-ingame.png

Why not jack it up to more than TWICE THAT NUMBER* with riced out, GPU-fan-shredding technology that Mesa Gallium drivers have been shipping for years?

portal2-nine-ingame.png

Disclaimer*

This post does not represent any form of official statement or address from Valve and is only a small project that was started out of boredom while I waited for CTS runs to finish.

This post also does not make any claims or statements regarding performance on other drivers, or performance comparisons using alternative graphics emulation layers, though whew, it sure would be interesting to see what those kinds of numbers look like!

June 21, 2021

The Khronos Group has released today a new version of the Vulkan specification that includes the VK_EXT_multi_draw extension. This new extension has been championed by Mike Blumenkrantz, contracted by Valve to work on Zink, an OpenGL implementation that’s part of Mesa and runs on top of Vulkan. Mike has been working very hard to make OpenGL-on-Vulkan performant and better, and came up with this extension to close an existing gap between the two APIs. As part of the ongoing collaboration between Igalia and Valve, I had the chance to participate in the release process by reviewing the specification text in depth, providing feedback and fixes, and writing a set of CTS tests to check conformance for drivers implementing the extension. As you can see in the contributors list, VK_EXT_multi_draw had input and feedback from more vendors. Special mention to Jason Ekstrand from Intel, who provided an initial review of the text, and Piers Daniell from NVIDIA, who was also involved since the early stages.

Thanks to VK_EXT_multi_draw, Vulkan will have equivalents to the glMultiDrawArrays and glMultiDrawElements functions from OpenGL. They’re called vkCmdDrawMultiEXT and vkCmdDrawMultiIndexedEXT. These two new functions allow recording a batch of draw commands in a command buffer using a single call, and they can be used in situations where an application would be recording a high number of draws without changing state. Although Vulkan already had mechanisms that allowed applications to record batches of draw commands in the form of indirect draws, these need the array of draw parameters to reside in a GPU-accessible buffer. VK_EXT_multi_draw, on the other hand, lets applications provide arrays of draw parameters using CPU memory.

vkCmdDrawMultiEXT is essentially equivalent to calling vkCmdDraw multiple times in a row, and vkCmdDrawMultiIndexedEXT does the same for vkCmdDrawIndexed. To improve application performance and reduce CPU overhead, Vulkan drivers are allowed and encouraged to omit checks for API function arguments provided by applications (these correctness checks are provided by the Vulkan Validation Layers mainly during application development), and thanks to mechanisms like primary and secondary command buffers, Vulkan makes it possible to prepare sequences of commands for the GPU to execute using multiple threads and CPU cores. In this situation, you may be wondering how much of an improvement the new functions provide apart from saving a few microseconds processing some function calls. In other words, what’s the practical difference between calling vkCmdDraw a thousand times and batching a thousand draws using vkCmdDrawMultiEXT?

The answer is that most of the overhead of recording a draw command doesn’t come from having to call a function, but in the checks the implementation has to run when recording the command. These checks may not be related to correctness, but to additional actions and options that may need to be taken depending on the state of the command buffer in the moment the draw command is recorded. For example, see the calls to radv_before_draw when RADV processes a draw command (note: RADV is Mesa’s super nice free software Vulkan driver for AMD cards). These checks only need to run once when using the new functions. In bechmark-like scenarios using real drivers, Mike has been able to verify that, while the overhead varies per driver and some of them are lightweight and have minimal overhead, some mainstream drivers can double their draw call processing rate when using VK_EXT_multi_draw.

Mike has work-in-progress implementations for Mesa’s ANV and RADV drivers (the Vulkan drivers for Intel and AMD GPUs, respectively) which pass conformance and will hopefully land soon in Mesa’s main branch, and more drivers are expected to ship support for the extension in the near future.

We Did It

After months and months of the construction crews hammering away, VK_EXT_multi_draw has now been released for general use.

Will this suddenly make zink the fastest GPU driver in history?

Obviously.

Long-time readers will recall that I memed about this extension some time ago, and the numbers in a synthetic benchmark targeted at exactly this feature are phenomenal.

For more on the topic, we go to our Senior Multidraw Correspondent and my personal Khronos BFF, Ricardo Garcia, who has been following this story since the beginning.

June 18, 2021

Fast Friday

In short, an issue was filed recently about getting the Nine state tracker working with zink.

Was it the first? No..

Was it the first one this year? Yes.

Thus began a minutes-long, helter-skelter sequence of events to get Nine up and running, spread out over the course of a day or two. In need of a skilled finagler knowledgeable in the mysterium of Gallium state trackers, I contacted the only developer I know with a rockstar name, Axel Davy. We set out at dawn, and I strapped on my parachute. It was almost immediately that I heard a familiar call: there’s a build issue.

Next stop was crashing in unimplemented interface methods with a stopover in flailing about wildly in TGSI what even is this before I arrivated at my target:

nine.png

Ah, glorious triangles.

June 16, 2021

Hi all, hope you all are doing fine!

It's been 3 weeks since I started the Outreachy internship, I've done a lot but at the same time, I don't think I've done anything.

In the first week, it was that week of setup machine, fighting with IRC to be able to send messages, sending some information necessary for Outreachy organizers. I also needed to configure my blog's RSS Feed (yes, at a time when I was in doubt whether I wanted to work with backend or frontend, I decided to learn how to develop a blog) as I use Gatsby as the base of the blog, it was relatively easy to configure the RSS (Hooray!! One thing worked \o/)

To do my setup, my mentor Melissa gave me 2 tutorials as a base:

Setting up your QEMU VM

How to compile and install the Linux Kernel

And I needed to redo them a few times to understand how they worked (it's in my GIANT to-do list, a tutorial with my steps explaining where I had a problem, one day it leaves...) because I was going to use a virtual machine to run the tests and see if I didn't break the kernel too much after it was configured I needed to test to see if everything was right, for that I used the tutorial

Experiment-one-iio-dummy

Ok, setup working and now?? I still needed to configure one thing: The VKMS (a software-only model of a KMS driver that is useful for testing and for running X (or similar) on headless machines) and the IGT (a test suite used specifically for debugging and development of the DRM drivers), for this I used the tutorial:

VKMS

I was stuck for a few days in this task, the tests failed but why??? Configuration error? Tool installation error??

Nooo! It was my own mistake... That I didn't read the tutorial properly, and I didn't see the message that said:

“The tests need to be run without a composer, so you need to switch to text-only mode”

For that I only needed to do:

sudo systemctl isolate multi-user.target 

Ready! Solved, tests working \o/ and now what?

Now my task for the next few days is to “create a debugfs file for vkms using drmstatedump()”, but that's a subject for the next post.

Thank you for following me so far, please feel free to comment! And stay tuned to the next chapters of this Saga called Outreachy!!

Take care and have a great day!