planet.freedesktop.org
November 23, 2020

I’ve Been Here For…

I guess I never left, really, since I’ve been vicariously living the life of someone who still writes zink patches through reviewing and discussing some great community efforts that are ongoing.

But now I’m back living that life of someone who writes zink patches.

Valve has generously agreed to sponsor my work on graphics-related projects.

For the time being, that work happens to be zink.

Ambition

I don’t want to just make a big post about leaving and then come back after a couple weeks like nothing happened.

It’s 2020.

We need some sort of positive energy and excitement here.

As such, I’m hereby announcing Operation Oxidize, an ambitious endeavor between me and the formidably skillful Erik Faye-Lund of Collabora.

We’re going to land 99% of zink-wip into mainline Mesa by the end of the year, bringing the driver up to basic GL 4.6 and ES 3.2 support with vastly improved performance.

Or at least, that’s the goal.

Will we succeed?

Stay tuned to find out!

November 22, 2020

Another Brief Review

This was a (relatively) quiet week in zink-world. Here’s some updates, once more in no particular order:

  • Custom border color support landed
  • Erik wrote and I reviewed a patch that enabled some blitting optimizations but also regressed a number of test cases
    • Oops
  • I wrote and Erik reviewed a series which improved some of the query code but also regressed a number of test cases
    • Oops++
  • The flurry of activity around Quake3 not working on RADV died down as it’s now been suggested that this is not a RADV bug and is instead the result of no developers fully understanding the majesty of RADV’s pipeline barrier implementation
    • Developers around the world stunned by the possibility that they don’t know everything
  • Witold Baryluk has helpfully contributed a truckload of issue tickets for my zink-wip branch after extensive testing on AMD hardware
    • I’ll get to these at some point, I promise

Stay tuned for further updates.

November 21, 2020
Recently I acquired an Acer Aspire Switch 10 E SW3-016, this device was the main reason for writing my blog post about the shim boot loop. The EFI firmware of this is bad in a number of ways:

  1. It considers its eMMC unbootable unless its ESP contains an EFI/Microsoft/Boot/bootmgfw.efi file.

  2. But it will actually boot EFI/Boot/bootx64.efi ! (wait what? yes really)

  3. It will only boot from an USB disk connected to its micro-USB connector, not from the USB-A connector on the keyboard-dock.

  4. You must first set a BIOS admin password before you can disable secure-boot (which is necessary to boot home-build kernels without doing your own signing)

  5. Last but not least it has one more nasty "feature", it detect if the OS being booted is Windows, Android or unknown and it updates the ACPI DSDT based in this!

Some more details on the OS detection mis feature. The ACPI "Device (SDHB) node for the MMC controller connected to the SDIO wifi module contains:

        Name (WHID, "80860F14")
        Name (AHID, "INT33BB")


Depending on what OS the BIOS thinks it is booting it renames one of these 2 to _HID. This is weird given that it will only boot if EFI/Microsoft/Boot/bootmgfw.efi exists, but it still does this. Worse it looks at the actual contents of EFI/Boot/bootx64.efi for this. It seems that that file must be signed, otherwise it goes in OS unknown mode and keeps the 2 above DSDT bits as is, so there is no _HID defined for the wifi's mmc controller and thus no wifi. I hit this issue when I replaced EFI/Boot/bootx64.efi with grubx64.efi to break the bootloop. grubx64.efi is not signed so the DSDT as Linux saw it contained the above AML code. Using the proper workaround for the bootloop from my previous blog post this bit of the DSDT morphes into:

        Name (_HID, "80860F14")
        Name (AHID, "INT33BB")


And the wifi works.

The Acer Aspire Switch 10 E SW3-016's firmware also triggers an actual bug / issue in Linux' ACPI implementation, causing the bluetooth to not work. This is discussed in much detail here. I have a patch series fixing this here.

And the older Acer Aspire Switch 10 SW5-012's and S1002's firmware has some similar issues:

  1. It considers its eMMC unbootable unless its ESP contains an EFI/Microsoft/Boot/bootmgfw.efi file

  2. These models will actually always boot the EFI/Microsoft/Boot/bootmgfw.efi file, so that is somewhat more sensible.

  3. On the SW5-012 you must first set a BIOS admin password before you can disable secure-boot.

  4. The SW5-012 is missing an ACPI device node for the PWM controller used for controlling the backlight brightness. I guess that the Windows i915 gfx driver just directly pokes the registers (which are in a whole other IP block), rather then relying on a separate PWM driver as Linux does. Unfortunately there is no way to fix this, other then using a DSDT overlay. I have a DSDT overlay for the V1.20 BIOS and only for the v1.20 BIOS available for this here.

Because of 1. and 2. you need to take the following steps to get Linux to boot on the Acer Aspire Switch 10 SW5-012 or the S1002:

  1. Rename the original bootmgfw.efi (so that you can chainload it in the multi-boot case)

  2. Replace bootmgfw.efi with shimia32.efi

  3. Copy EFI/fedora/grubia32.efi to EFI/Microsoft/Boot

This assumes that you have the files from a 32 bit Windows install in your ESP already.
November 20, 2020

(This post was first published with Collabora on Nov 19, 2020.)

Wayland (the protocol and architecture) is still lacking proper consideration for color management. Wayland also lacks support for high dynamic range (HDR) imagery which has been around in movie and broadcasting industry for a while now (e.g. Netflix HDR UI).

While there are well established tools and workflows for how to do color management on X11, even X11 has not gained support for HDR. There were plans for it (Alex GoinsDeepColor Visuals), but as far as I know nothing really materialized from them.  Right now, the only way to watch HDR content on a HDR monitor in Linux is to use the DRM KMS API directly, in other words, not use any window system, which means not using any desktop environment. Kodi is one of the very few applications that can do this at all.

This is a story about starting the efforts to fix the situation on Wayland.

History and People

Color management for Wayland has been talked about on and off for many years by dozens of people. To me it was obvious from the start that color management architecture on Wayland must be fundamentally different from X11. I thought the display server must be part of the color management stack instead of an untrusted, unknown entity that must be bypassed and overridden by applications that fight each other for who gets to configure the display. This opinion was wildly controversial and it took a long time to get my point across, but over the years some color management experts started to open up to new ideas and other people joined in the opinion as well.  Whether these new ideas are actually better than the ways of old remains to be seen, though. I think the promise of getting everything and more to work better is far too great to not try it out.

The discussions started several times over the years, but they always dried out mostly without any tangible progress. Color management is a wide, deep and difficult topic, and the required skills, knowledge, interest, and available time did not come together until fairly recently. People did write draft protocol extensions, but I would claim that it was not really until Sebastian Wick started building on top of them that things started moving forward. But one person cannot push such a huge effort alone even for the simple reason that there must be at least one reviewer before anything can be merged upstream. I was very lucky that since summer 2020 I have been able to work on Wayland color management and HDR for improving ChromeOS, letting me support Sebastian's efforts on a daily basis. Vitaly Prosyak joined the effort this year as well, researching how to combine the two seemingly different worlds of ICC and HDR, and how tone-mapping could be implemented.

I must also note the past efforts of Harish Krupo, who submitted a major Weston merge request, but unfortunately at the time reviewers in Weston upstream were not much available. Even before that, there were experiments by Ville Syrjälä. All these are now mostly superseded by the on-going work.

Currently the active people around the topic are me (Collabora), Vitaly Prosyak (AMD), and Naveen Kumar (Intel). Sebastian Wick (unaffilated) is still around as well. None of us is a color management or HDR expert by trade, so we are all learning things as we go.

Design

The foundation for the color management protocol are ICC profile files for describing both output and content color spaces. The aim is for ICCv4, also allowing ICCv2, as these are known and supported well in general. Adding iccMAX support or anything else will be possible any time in the future.

As color management is all about color spaces and gamuts, and high dynamic range (HDR) is also very much about color spaces and gamuts plus extended luminance range, Sebastian and I decided that Wayland color management extension should cater for both from the beginning. Combining traditional color management and HDR is a fairly new thing as far as I know, and I'm not sure we have much prior art to base upon, so this is an interesting research journey as well. There is a lot of prior art on HDR and color management separately, but they tend to have fundamental differences that makes the combination not obvious.

To help us keep focused and explain to the community about what we actually intend with Wayland color management and HDR support, I wrote the section "Wayland Color Management and HDR Design Goals" in color.rst (draft). I very much recommend you to read it so that you get a picture what we (or I, at least) want to aim for.

Elle Stone explains in their article how color management should work on X11. As I wanted to avoid repeating the massive email threads that were had on the wayland-devel mailing list, I wrote the section "Color Pipeline Overview" in color.rst (draft) more or less as a response to her article, trying to explain in what ways Wayland will be different from X11. I think that understanding that section is paramount before anyone makes any comment on our efforts with the Wayland protocol extension.

HDR brings even more reasons to put color space conversions in the display server than just the idea that all applications should be color managed if not explicitly then implicitly.  Most of the desktop applications (well, literally all right now) are using Standard Dynamic Range (SDR).  SDR is a fuzzy concept referring to all traditional, non-HDR image content.  Therefore, your desktop is usually 100% SDR. You run your fancy new HDR monitor in SDR mode, which means it looks just like any old monitor with nothing fancy.  What if you want to watch a HDR video? The monitor won't display HDR in SDR mode.  If you simply switch the monitor to HDR mode, you will be blinded by all the over-bright SDR applications.  Switching monitor modes may also cause flicker and take a bit of time. That would be a pretty bad user experience, right?

A solution is to run your monitor in HDR mode all the time, and have the window system compositor convert all SDR application windows appropriately to the HDR luminance, so that they look normal in spite of the HDR mode. There will always be applications that will never support HDR at all, so the compositor doing the conversion is practically the only way.

For the protocol, we are currently exploring the use of relative luminance.  The reason is that people look at monitors in wildly varying viewing environments, under standard office lighting for example. The environment and personal preferences affect what monitor brightness you want. Also monitors themselves can be wildly different in their capabilities. Most prior art on HDR uses absolute luminance, but absolute luminance has the problem that it assumes a specific viewing environment, usually a dark room, similar to a movie theatre.  If a display server would show a movie with the absolute luminance it was mastered for, in most cases it would be far too dark to see. Whether using relative luminance at the protocol level turns out to be a good idea or not, we shall see.

Development

The Wayland color management and HDR protocol extension proposal is known as wayland/wayland-protocols!14 (MR14). Because it is a very long running merge request (the bar for landing a new protocol into wayland-protocols is high) and there are several people working on it, we started using sub-merge-requests to modify the proposal. You can find the sub-MRs in Sebastian's fork. If you have a change to propose, that is how to do it.

Obviously using sub-MRs also splits the review discussions into multiple places, but in this case I think it is a good thing, because the discussion threads in Gitlab are already massive.

There are several big and small open questions we haven't had the time to tackle yet even among the active group; questions that I feel we should have some tentative answers before asking for wider community comments. There is also no set schedule, so don't hold your breath. This work is likely to take months still before there is a complete tentative protocol, and probably years until these features are available in your favourite Wayland desktop environments.

If you are an expert on the topics of color management or HDR displays and content, you are warmly welcome to join the development.

If you are an interested developer or an end user looking to try out things, sorry, there is nothing really for you yet.

November 18, 2020
How to fix the Linux EFI secure-boot shim bootloop issue seen on some systems.

Quite a few Bay- and Cherry-Trail based systems have bad firmware which completely ignores any efibootmgr set boot options. They basically completely reset the boot order doing some sort of auto-detection at boot. Some of these even will given an error about their eMMC not being bootable unless the ESP has a EFI/Microsoft/Boot/bootmgfw.efi file!

Many of these end up booting EFI/Boot/bootx64.efi unconditionally every boot. This will cause a boot loop since when Linux is installed EFI/Boot/bootx64.efi is now shim. When shim is started with a path of EFI/Boot/bootx64.efi, shim will add a new efibootmgr entry pointing to EFI/fedora/shimx64.efi and then reset. The goal of this is so that the firmware's F12 bootmenu can be used to easily switch between Windows and Linux (without chainloading which breaks bitlocker). But since these bad EFI implementations ignore efibootmgr stuff, EFI/Boot/bootx64.efi shim will run again after the reset and we have a loop.

There are 2 ways to fix this loop:

1. The right way: Stop shim from trying to add a bootentry pointing to EFI/fedora/shimx64.efi:

rm EFI/Boot/fbx64.efi
cp EFI/fedora/grubx64.efi EFI/Boot


The first command will stop shim from trying to add a new efibootmgr entry (it calls fbx64.efi to do that for it) instead it will try to execute grubx64.efi from the from which it was executed, so we must put a grubx64.efi in the EFI/Boot dir, which the second command does. Do not use the livecd EFI/Boot/grubx64.efi file for this as I did at first, that searches for its config and env under EFI/Boot which is not what we want.

Note that upgrading shim will restore EFI/Boot/fbx64.efi. To avoid this you may want to backup EFI/Boot/bootx64.efi, then do "sudo rpm -e shim-x64" and then restore the backup.

2. The wrong way: Replace EFI/Boot/bootx64.efi with a copy of EFI/fedora/grubx64.efi

This is how I used to do this until hitting the scenario which caused me to write this blog post. There are 2 problems with this:

2a) This requires disabling secure-boot (which I could live with sofar)
2b) Some firmwares change how they behave, exporting a different DSDT to the OS dependending on if EFI/Boot/bootx64.efi is signed or not (even with secure boot disabled) and their behavior is totally broken when it is not signed. I will post another rant ^W blogpost about this soon. For now lets just say that you should use workaround 1. from above since it simply is a better workaround.

Note for better readability the above text uses bootx64, shimx64, fbx64 and grubx64 throughout. When using a 32 bit EFI (which is typical on Bay Trail systems) you should replace these with bootia32, shimia32, fbia32 and grubia32. Note 32 bit EFI Bay Trail systems should still use a 64 bit Linux distro, the firmware being 32 bit is a weird Windows related thing.

Also note that your system may use another key then F12 to show the firmware's bootmenu.
November 15, 2020

A Brief Review

As time/sanity permit, I’ll be trying to do roundup posts for zink happenings each week. Here’s a look back at things that happened, in no particular order:

November 13, 2020

(project was renamed from vallium to lavapipe)

I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.

tl;dr DO NOT USE LAVAPIPE OVER A GALLIUM HW DRIVER,

What is lavapipe?

The lavapipe layer is a gallium frontend. It takes the Vulkan API and roughly translates it into the gallium API.

How does it do that?

Vulkan is a lowlevel API, it allows the user to allocate memory, create resources, record command buffers amongst other things. When a hw vulkan driver is recording a command buffer, it is putting hw specific commands into it that will be run directly on the GPU. These command buffers are submitted to queues when the app wants to execute them.

Gallium is a context level API, i.e. like OpenGL/D3D10. The user has to create resources and contexts and the driver internally manages command buffers etc. The driver controls internal flushing and queuing of command buffers.
 
In order to bridge the gap, the lavapipe layer abstracts the gallium context into a separate thread of execution. When recording a vulkan command buffer it creates a CPU side command buffer containing an encoding of the Vulkan API. It passes that recorded CPU command buffer to the thread on queue submission. The thread then creates a gallium context, and replays the whole CPU recorded command buffer into the context, one command at a time.

That sounds horrible, isn't it slow?

Yes.

Why doesn't that matter for *software* drivers?

Software rasterizers are a very different proposition from an overhead point of view than real hardware. CPU rasterization is pretty heavy on the CPU load, so nearly always 90% of your CPU time will be in the rasterizer and fragment shader. Having some minor CPU overheads around command submission and queuing isn't going to matter in the overall profile of the user application. CPU rasterization is already slow, the Vulkan->gallium translation overhead isn't going to be the reason for making it much slower.
For real HW drivers which are meant to record their own command buffers in the GPU domain and submit them direct to the hw, adding in a CPU layer that just copies the command buffer data is a massive overhead and one that can't easily be removed from the lavapipe layer.

The lavapipe execution context is also pretty horrible, it has to connect all the state pieces like shaders etc to the gallium context, and disconnect them all at the end of each command buffer. There is only one command submission queue, one context to be used. A lot of hardware exposes more queues etc that this will never model.

I still don't want to write a vulkan driver, give me more reasons.

Pipeline barriers:

Pipeline barriers in Vulkan are essential to efficient driver hw usage. They are one of the most difficult to understand and hard to get right pieces of writing a vulkan driver. For a software rasterizer they are also mostly unneeded. When I get a barrier I just completely hardflush the gallium context because I know the sw driver behind it. For a real hardware driver this would be a horrible solution. You spend a lot of time trying to make anything optimal here.

Memory allocation:

Vulkan is built around the idea of separate memory allocation and objects binding to those allocations. Gallium is built around object allocation with the memory allocs happening implicitly. I've added some simple memory allocation objects to the gallium API for swrast. These APIs are in no way useful for hw drivers. There is no way to expose memory types or heaps from gallium usefully. The current memory allocation API works for software drivers because I know all they want is an aligned_malloc. There is no decent way to bridge this gap without writing a new gallium API that looks like Vulkan. (in which case just write a vulkan driver already).

Can this make my non-Vulkan capable hw run Vulkan?

No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.


There's been some recent discussion about whether the X server is abandonware. As the person arguably most responsible for its care and feeding over the last 15 years or so, I feel like I have something to say about that.

The thing about being the maintainer of a public-facing project for nearly the whole of your professional career is it's difficult to separate your own story from the project. So I'm not going to try to be dispassionate, here. I started working on X precisely because free software had given me options and capabilities that really matter, and I feel privileged to be able to give that back. I can't talk about that without caring about it.

So here's the thing: X works extremely well for what it is, but what it is is deeply flawed. There's no shame in that, it's 33 years old and still relevant, I wish more software worked so well on that kind of timeframe. But using it to drive your display hardware and multiplex your input devices is choosing to make your life worse.

It is, however, uniquely well suited to a very long life as an application compatibility layer. Though the code happens to implement an unfortunate specification, the code itself is quite well structured, easy to hack on, and not far off from being easily embeddable.

The issue, then, is how to get there. And I don't have any real desire to get there while still pretending that the xfree86 hardware-backed server code is a real thing. Sorry, I guess, but I've worked on xfree86-derived servers for very nearly as long as XFree86-the-project existed, and I am completely burnt out on that on its own merits, let alone doing that and also being release manager and reviewer of last resort. You can only apply so much thrust to the pig before you question why you're trying to make it fly at all.

So, is Xorg abandoned? To the extent that that means using it to actually control the display, and not just keep X apps running, I'd say yes. But xserver is more than xfree86. Xwayland, Xwin, Xephyr, Xvnc, Xvfb: these are projects with real value that we should not give up. A better way to say it is that we can finally abandon xfree86.

And if that sounds like a world you'd like to see, please, come talk to us, let's make it happen. I'd be absolutely thrilled to see someone take this on, and I'm happy to be your guide through the server internals.

November 12, 2020

A recent article on phoronix has some commentary about sharing code between Windows and Linux, and how this seems to be a metric that Intel likes.

I'd like to explore this idea a bit and explain why I believe it's bad for Linux based distros and our open source development models in the graphics area.

tl;dr there is a big difference between open source released and open source developed projects in terms of sustainability and community.

The Linux graphics stack from a distro vendor point of view is made up of two main projects, the Linux kernel and Mesa userspace. These two projects are developed in the open with completely open source vendor agnostic practices. There is no vendor controlling either project and both projects have a goal of try to maximise shared code and shared processes/coding standards across drivers from all vendors.

This cross-vendor synergy is very important to the functioning ecosystem that is the Linux graphics stack. The stack also relies in some places on the LLVM project, but again LLVM upstream is vendor agnostic and open source developed.

The value to distros is they have central places to pick up driver stacks with good release cycles and a minimal number of places they have to deal with to interact with those communities. Now usually hardware vendors don't see the value in the external communities as much as Linux distros do. From a hardware vendor internal point of view they see more benefit in creating a single stack shared between their Windows and Linux to maximise their return on investment, or make their orgchart prettier or produce less powerpoints about why their orgchart isn't optimal.

A shared Windows/Linux stack as such is a thing the vendors want more for their own reasons than for the benefit of the Linux community.

Why is it a bad idea?

I'll start by saying it's not always a bad idea. In theory it might be possible to produce such a stack with the benefits of open source development model, however most vendors seem to fail at this. They see open source as a release model, they develop internally and shovel the results over the fence into a github repo every X weeks after a bunch of cycles. They build products containing these open source pieces, but they never expend the time building projects or communities around them.

As an example take AMDVLK vs radv. I started radv because AMD had been promising the world an open source Vulkan driver for Linux that was shared with their Windows stack. Even when it was delivered it was open source released but internally developed. There was no avenue for community participation in the driver development. External contributors were never on the same footing as an AMD employee. Even AMD employees on different teams weren't on the same footing. Compare this to the radv project in Mesa where it allowed Valve to contribute the ACO backend compiler and provide better results than AMD vendor shared code could ever have done, with far less investement and manpower.

Intel have a non-mesa compiler called Intel Graphics Compiler mentioned in the article. This is fully developed by intel internally, there is little info on project direction or how to get involved or where the community is. There doesn't seem to be much public review, patches seem to get merged to the public repo by igcbot which may mean they are being mirrored from some internal repo. There are not using github merge requests etc. Compare this to development of a Mesa NIR backend where lots of changes are reviewed and maximal common code sharing is attempted so that all vendors benefit from the code.

One area where it has mostly sort of worked out what with the AMD display code in the kernel. I believe this code to be shared with their Windows driver (but I'm not 100% sure). They do try to engage with community changes to the code, but the code is still pretty horrible and not really optimal on Linux. Integrating it with atomic modesetting and refactoring was a pain. So even in the best case it's not an optimal outcome even for the vendor. They have to work hard to make the shared code be capable of supporting different OS interactions.

How would I do it?

If I had to share Windows/Linux driver stack I'd (biased opinion) start from the most open project and bring that into the closed projects. I definitely wouldn't start with a new internal project that tries to disrupt both. For example if I needed to create a Windows GL driver, I could:

a) write a complete GL implementation and throw it over the wall every few weeks. and make Windows/Linux use it, Linux users lose out on the shared stack, distros lose out on one dependency instead having to build a stack of multiple per vendor deps, Windows gains nothing really, but I'm so in control of my own destiny (communities don't matter).

b) use Mesa and upstream my driver to share with the Linux stack, add the Windows code to the Mesa stack. I get to share the benefits of external development by other vendors and Windows gains that benefit, and Linux retains the benefits to it's ecosystem.

A warning then to anyone wishing for more vendor code sharing between OSes it generally doesn't end with Linux being better off, it ends up with Linux being more fragmented, harder to support and in the long run unsustainable.


November 06, 2020

About a year ago ago, I got a new laptop: a late 2019 Razer Blade Stealth 13.  It sports an Intel i7-1065G7 with the best Intel's Ice Lake graphics along with an NVIDIA GeForce GTX 1650.  Apart from needing an ACPI lid quirk and the power management issues described here, it’s been a great laptop so far and the Linux experience has been very smooth.

Unfortunately, the out-of-the-box integrated graphics performance of my new laptop was less than stellar.  My first task with the new laptop was to debug a rendering issue in the Linux port of Shadow of the Tomb Raider which turned out to be a bug in the game.  In the process, I discovered that the performance of the game’s built-in benchmark was almost half of Windows.  We’ve had some performance issues with Mesa from time to time on some games but half seemed a bit extreme.  Looking at system-level performance data with gputop revealed that GPU clock rate was unable to get above about 60-70% of the maximum in spite of the GPU being busy the whole time.  Why?  The GPU wasn’t able to get enough power.  Once I sorted out my power management problems, the benchmark went from about 50-60% the speed of Windows to more like 104% the speed of windows (yes, that’s more than 100%).

This blog post is intended to serve as a bit of a guide to understanding memory throughput and power management issues and configuring your system properly to get the most out of your Intel integrated GPU.  Not everything in this post will affect all laptops so you may have to do some experimentation with your system to see what does and does not matter.  I also make no claim that this post is in any way complete; there are almost certainly other configuration issues of which I'm not aware or which I've forgotten.

Update your drivers

This should go without saying but if you want the best performance out of your hardware, running the latest drivers is always recommended.  This is especially true for hardware that has just been released.  Generally, for graphics, most of the big performance improvements are going to be in Mesa but your Linux kernel version can matter as well.  In the case of Intel Ice Lake processors, some of the power management features aren’t enabled until Linux 5.4.

I’m not going to give a complete guide to updating your drivers here.  If you’re running a distro like Arch, chances are that you’re already running something fairly close to the latest available.  If you’re on Ubuntu, the padoka PPA provides versions of the userspace components (Mesa, X11, etc.) that are usually no more than about a week out-of-date but upgrading your kernel is more complicated.  Other distros may have something similar but I’ll leave as an exercise to the reader.

This doesn’t mean that you need to be obsessive about updating kernels and drivers.  If you’re happy with the performance and stability of your system, go ahead and leave it alone.  However, if you have brand new hardware and want to make sure you have new enough drivers, it may be worth attempting an update.  Or, if you have the patience, you can just wait 6 months for the next distro release cycle and hope to pick up with a distro update.

Make sure you have dual-channel RAM

One of the big bottleneck points in 3D rendering applications is memory bandwidth.  Most standard monitors run at a resolution of 1920x1080 and a refresh rate of 60 Hz.  A 1920x1080 RGBA (32bpp) image is just shy of 8 MiB in size and, if the GPU is rendering at 60 FPS, that adds up to about 474 MiB/s of memory bandwidth to write out the image every frame.  If you're running a 4K monitor, multiply by 4 and you get about 1.8 GiB/s.  Those numbers are only for the final color image, assume we write every pixel of the image exactly once, and don't take into account any other memory access.  Even in a simple 3D scene, there are other images than just the color image being written such as depth buffers or auxiliary gbuffers, each pixel typically gets written more than once depending on app over-draw, and shading typically involves reading from uniform buffers and textures.  Modern 3D applications typically also have things such as depth pre-passes, lighting passes, and post-processing filters for depth-of-field and/or motion blur.  The result of this is that actual memory bandwidth for rendering a 3D scene can be 10-100x the bandwidth required to simply write the color image.

Because of the incredible amount of bandwidth required for 3D rendering, discrete GPUs use memories which are optimized for bandwidth above all else.  These go by different names such as GDDR6 or HBM2 (current as of the writing of this post) but they all use extremely wide buses and access many bits of memory in parallel to get the highest throughput they can.  CPU memory, on the other hand, is typically DDR4 (current as of the writing of this post) which runs on a narrower 64-bit bus and so the over-all maximum memory bandwidth is lower.  However, as with anything in engineering, there is a trade-off being made here.  While narrower buses have lower over-all throughput, they are much better at random access which is necessary for good CPU memory performance when crawling complex data structures and doing other normal CPU tasks.  When 3D rendering, on the other hand, the vast majority of your memory bandwidth is consumed in reading/writing large contiguous blocks of memory and so the trade-off falls in favor of wider buses.

With integrated graphics, the GPU uses the same DDR RAM as the CPU so it can't get as much raw memory throughput as a discrete GPU.  Some of the memory bottlenecks can be mitigated via large caches inside the GPU but caching can only do so much.  At the end of the day, if you're fetching 2 GiB of memory to draw a scene, you're going to blow out your caches and load most of that from main memory.

The good news is that most motherboards support a dual-channel ram configurations where, if your DDR units are installed in identical pairs, the memory controller will split memory access between the two DDR units in the pair.  This has similar benefits to running on a 128-bit bus but without some of the drawbacks.  The result is about a 2x improvement in over-all memory throughput.  While this may not affect your CPU performance significantly outside of some very special cases, it makes a huge difference to your integrated GPU which cares far more about total throughput than random access.  If you are unsure how your computer's RAM is configured, you can run “dmidecode -t memory” and see if you have two identical devices reported in different channels.

Power management 101

Before getting into the details of how to fix power management issues, I should explain a bit about how power management works and, more importantly, how it doesn’t.  If you don’t care to learn about power management and are just here for the system configuration tips, feel free to skip this section.

Why is power management important?  Because the clock rate (and therefore the speed) of your CPU or GPU is heavily dependent on how much power is available to the system.  If it’s unable to get enough power for some reason, it will run at a lower clock rate and you’ll see that as processes taking more time or lower frame rates in the case of graphics.  There are some things that you, as the user, cannot control such as the physical limitations of the chip or the way the OEM has configured things on your particular laptop.  However, there are some things which you can do from a system configuration perspective which can greatly affect power management and your performance.

First, we need to talk about thermal design power or TDP.  There is a lot of misunderstanding on the internet about TDP and we need to clear some of them up.  Wikipedia defines TDP as “the maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate under any workload.”  The Intel Product Specifications site defines TDP as follows:

Thermal Design Power (TDP) represents the average power, in watts, the processor dissipates when operating at Base Frequency with all cores active under an Intel-defined, high-complexity workload. Refer to Datasheet for thermal solution requirements.

In other words, the TDP value provided on the Intel spec sheet is a pretty good design target for OEMs but doesn’t provide nearly as many guarantees as one might hope.  In particular, there are several things that the TDP value on the spec sheet is not:
  • It’s not the exact maximum power.  It’s a “average power”.
  • It may not match any particular workload.  It’s based on “an Intel-defined, high-complexity workload”.  Power consumption on any other workload is likely to be slightly different.
  • It’s not the actual maximum.  It’s based on when the processor is “operating at Base Frequency with all cores active.” Technologies such as Turbo Boost can cause the CPU to operate at a higher power for short periods of time.
If you look at the  Intel Product Specifications page for the i7-1065G7, you’ll see three TDP values: the nominal TDP of 15W, a configurable TDP-up value of 25W and a configurable TDP-down value of 12W.  The nominal TDP (simply called “TDP”) is the base TDP which is enough for the CPU to run all of its cores at the base frequency which, given sufficient cooling, it can do in the steady state.  The TDP-up and TDP-down values provide configurability that gives the OEM options when they go to make a laptop based on the i7-1065G7.  If they’re making a performance laptop like Razer and are willing to put in enough cooling, they can configure it to 25W and get more performance.  On the other hand, if they’re going for battery life, they can put the exact same chip in the laptop but configure it to run as low as 12W.  They can also configure the chip to run at 12W or 15W and then ship software with the computer which will bump it to 25W once Windows boots up.  We’ll talk more about this reconfiguration later on.

Beyond just the numbers on the spec sheet, there are other things which may affect how much power the chip can get.  One of the big ones is cooling.  The law of conservation of energy dictates that energy is never created or destroyed.  In particular, your CPU doesn’t really consume energy; it turns that electrical energy into heat.  For every Watt of electrical power that goes into the CPU, a Watt of heat has to be pumped out by the cooling system.  (Yes, a Watt is also a measure of heat flow.)  If the CPU is using more electrical energy than the cooling system can pump back out, energy gets temporarily stored in the CPU as heat and you see this as the CPU temperature rising.  Eventually, however, the CPU has to back off and let the cooling system catch up or else that built up heat may cause permanent damage to the chip.

Another thing which can affect CPU power is the actual power delivery capabilities of the motherboard itself.  In a desktop, the discrete GPU is typically powered directly by the power supply and it can draw 300W or more without affecting the amount of power available to the CPU.  In a laptop, however, you may have more power limitations.  If you have multiple components requiring significant amounts of power such as a CPU and a discrete GPU, the motherboard may not be able to provide enough power for both of them to run flat-out so it may have to limit CPU power while the discrete GPU is running.  These types of power balancing decisions can happen at a very deep firmware level and may not be visible to software.

The moral of this story is that the TDP listed on the spec sheet for the chip isn’t what matters; what matters is how the chip is configured by the OEM, how much power the motherboard is able to deliver, and how much power the cooling system is able to remove.  Just because two laptops have the same processor with the same part number doesn’t mean you should expect them to get the same performance.  This is unfortunate for laptop buyers but it’s the reality of the world we live in.  There are some things that you, as the user, cannot control such as the physical limitations of the chip or the way the OEM has configured things on your particular laptop.  However, there are some things which you can do from a system configuration perspective and that’s what we’ll talk about next.

If you want to experiment with your system and understand what’s going on with power, there are two tools which are very useful for this: powertop and turbostat.  Both are open-source and should be available through your distro package manager.  I personally prefer the turbostat interface for CPU power investigations but powertop is able to split your power usage up per-process which can be really useful as well.

Update GameMode to at least version 1.5

About a two and a half years ago (1.0 was released in may of 2018), Feral Interactive released their GameMode daemon which is able to tweak some of your system settings when a game starts up to get maximal performance.  One of the settings that GameMode tweaks is your CPU performance governor.  By default, GameMode will set it to “performance” when a game is running.  While this seems like a good idea (“performance” is better, right?), it can actually be counterproductive on integrated GPUs and cause you to get worse over-all performance.

Why would the “performance” governor cause worse performance?  First, understand that the names “performance” and “powersave” for CPU governors are a bit misleading.  The powersave governor isn’t just for when you’re running on battery and want to use as little power as possible.  When on the powersave governor, your system will clock all the way up if it needs to and can even turbo if you have a heavy workload.  The difference between the two governors is that the powersave governor tries to give you as much performance as possible while also caring about power; it’s quite well balanced.  Intel typically recommends the powersave governor even in data centers because, even though they have piles of power and cooling available, data centers typically care about their power bill.  The performance governor, on the other hand, doesn’t care about power consumption and only cares about getting the maximum possible performance out of the CPU so it will typically burn significantly more power than needed.

So what does this have to do with GPU performance?  On an integrated GPU, the GPU and CPU typically share a power budget and every Watt of power the CPU is using is a Watt that’s unavailable to the GPU.  In some configurations, the TDP is enough to run both the GPU and CPU flat-out but that’s uncommon.  Most of the time, however, the CPU is capable of using the entire TDP if you clock it high enough.  When running with the performance governor, that extra unnecessary CPU power consumption can eat into the power available to the GPU and cause it to clock down.

This problem should be mostly fixed as of GameMode version 1.5 which adds an integrated GPU heuristic.  The heuristic detects when the integrated GPU is using significant power and puts the CPU back to using the powersave governor.  In the testing I’ve done, this pretty reliably chooses the powersave governor in the cases where the GPU is likely to be TDP limited.  The heuristic is dynamic so it will still use the performance governor if the CPU power usage way overpowers the GPU power usage such as when compiling shaders at a loading screen.

What do you need to do on your system?  First, check what version of GameMode you have installed on your system (if any).  If it’s version 1.4 or earlier)and you intend to play games on an integrated GPU, I recommend either upgrading GameMode or disabling or uninstalling the GameMode daemon.

Use thermald

In “power management 101” I talked about how sometimes OEMs will configure a laptop to 12W or 15W in BIOS and then re-configure it to 25W in software.  This is done via the “Intel Dynamic Platform and Thermal Framework” driver on Windows.  The DPTF driver manages your over-all system thermals and keep the system within its thermal budget.  This is especially important for fanless or ultra-thin laptops where the cooling may not be sufficient for the system to run flat-out for long periods.  One thing the DPTF driver does is dynamically adjust the TDP of your CPU.  It can adjust it both up if the laptop is running cool and you need the power or down if the laptop is running hot and needs to cool down.  Some OEMs choose to be very conservative with their TDP defaults in BIOS to prevent the laptop from overheating or constantly running hot if the Windows DPTF driver is not available.

On Linux, the equivalent to this is thermald.  When installed and enabled on your system, it reads the same OEM configuration data from ACPI as the windows DPTF driver and is also able to scale up your package TDP threshold past the BIOS default as per the OEM configuration.  You can also write your own configuration files if you really wish but you do so at your own risk.

Most distros package thermald but it may not be enabled nor work quite properly out-of-the-box.  This is because, historically, it has relied on the closed-source dptfxtract utility that's provided by Intel as a binary.  It requires dptfxtract to fetch the OEM provided configuration data from the ACPI tables. Since most distros don't usually ship closed-source software in their main repositories and since thermald doesn't do much without that data, a lot of distros don't bother to ship or enable it by default.  You'll have to turn it on manually.

To fix this, install both thermald and dptfxtract and ensure that thermald is enabled.  On most distros, thermald is packaged normally even if it isn’t enabled by default because it is open-source.  The dptfxtract utility is usually available in your distro’s non-free repositories.  On Ubuntu, dptfxtract is available as a package in multiverse.  For Fedora, dptfxtract is available via RPM Fusion’s non-free repo.  There are also packages for Arch and likely others as well.  If no one packages it for your distro, it’s just one binary so it’s pretty easy to install manually.

Some of this may change going forward, however.  Recently, however, Matthew Garrett did some work to reverse-engineer the DPTF framework and provide support for fetching the DPTF data from ACPI without the need for the binary blob.  When running with a recent kernel and Matthew's fork of thermald, you should be able to get OEM-configured thermals without the need for the dptfxtract blob at least on some hardware.  Whether or not you get the right configuration will depend on your hardware, your kernel version, your distro, and whether they ship the Intel version of thermald or Matthew's fork.  Even there, your distro may leave it uninstalled or disabled by default.  It's still disabled by default in Fedora 33, for instance.

It should be noted at this point that, if thermald and dptfxtract are doing their job, your laptop is likely to start running much hotter when under heavy load than it did before.  This is because thermald is re-configuring your processor with a higher thermal budget which means it can now run faster but it will also generate more heat and may drain your battery faster.  In theory, thermald should keep your laptop’s thermals within safe limits; just not within the more conservative limits the OEM programmed into BIOS.  If all the additional heat makes you uncomfortable, you can just disable thermald and it should go back to the BIOS defaults.

Enable NVIDIA’s dynamic power-management

On my laptop (the late 2019 Razer Blade Stealth 13), the BIOS has the CPU configured to 35W out-of-the-box.  (Yes, 35W is higher than TDP-up and I’ve never seen it burn anything close to that much power; I have no idea why it’s configured that way.)  This means that we have no need for DPTF and the cooling is good enough that I don’t really need thermald on it either.  Instead, its power management problems come from the power balancing that the motherboard does between the CPU and the discrete NVIDIA GPU.

If the NVIDIA GPU is powered on at all, the motherboard configures the CPU to the TDP-down value of 12W.  I don’t know exactly how it’s doing this but it’s at a very deep firmware level that seems completely opaque to software.  To make matters worse, it doesn’t just restrict CPU power when the discrete GPU is doing real rendering; it restricts CPU power whenever the GPU is powered on at all.  In the default configuration with the NVIDIA proprietary drivers, that’s all the time.

Fortunately, if you know where to find it, there is a configuration option available in recent drivers for Turing and later GPUs which lets the NVIDIA driver completely power down the discrete GPU when it isn’t in use.  You can find this documented in Chapter 22 of the NVIDIA driver README.  The runtime power management feature is still beta as of the writing of this post and does come with some caveats such as that it doesn’t work if you have audio or USB controllers (for USB-C video) on your GPU.  Fortunately, with many laptops with a hybrid Intel+NVIDIA graphics solution, the discrete GPU exists only for render off-loading and doesn’t have any displays connected to it.  In that case, the audio and USB-C can be disabled and don’t cause any problems.  On my laptop, as soon as I properly enabled runtime power management in the NVIDIA driver, the motherboard stopped throttling my CPU and it started running at the full TDP-up of 25W.

I believe that nouveau has some capabilities for runtime power management.  However, I don’t know for sure how good they are and whether or not they’re able to completely power down the GPU.

Look for other things which might be limiting power

In this blog post, I've covered some of the things which I've personally seen limit GPU power when playing games and running benchmarks.  However, it is by no means an exhaustive list.  If there's one thing that's true about power management, it's that every machine is a bit different.  The biggest challenge with my laptop was the NVIDIA discrete GPU draining power.  On some other laptop, it may be something else.

You can also look for background processes which may be using significant CPU cycles.  With a discrete GPU, a modest amount of background CPU work will often not hurt you unless the game is particularly CPU-hungry.  With an integrated GPU, however, it's far more likely that a background task such as a backup or software update will eat into the GPU's power budget.  Just this last week, a friend of mine was playing a game on Proton and discovered that the game launcher itself was burning enough power with the CPU to prevent the GPU from running at full power.  Once he suspended the game launcher, his GPU was able to run at full power.

Especially with laptops, you're also likely to be affected by the computer's cooling system as was mentioned earlier.  Some laptops such as my Razer are designed with high-end cooling systems that let the laptop run at full power.  Others, particularly the ultra-thin laptops, are far more thermally limited and may never be able to hit the advertised TDP for extended periods of time.

Conclusion

When trying to get the most performance possible out of a laptop, RAM configuration and power management are key.  Unfortunately, due to the issues documented above (and possibly others), the out-of-the-box experience on Linux is not what it should be.  Hopefully, we’ll see this situation improve in the coming years but for now this post will hopefully give people the tools they need to configure their machines properly and get the full performance out of their hardware.

This Is The End

…of my full-time hobby work on zink.

At least for a while.

More on that at the end of the post.

Before I get to that, let’s start with yesterday’s riddle. Anyone who chose this pic

1.png

with 51 fps as being zink, you were correct.

That’s right, zink is now at around 95% of native GL performance for this benchmark, at least on my system.

I know there’s been a lot of speculation about the capability of the driver to reach native or even remotely-close-to-native speeds, and I’m going to say definitively that it’s possible, and performance is only going to increase further from here.

A bit of a different look on things can also be found on my Fall roundup post here.

A Big Boost From Threads

I’ve long been working on zink using a single-thread architecture, and my goal has been to make it as fast as possible within that constraint. Part of my reasoning is that it’s been easier to work within the existing zink architecture than to rewrite it, but the main issue is just that threads are hard, and if you don’t have a very stable foundation to build off of when adding threading to something, it’s going to get exponentially more difficult to gain that stability afterwards.

Reaching a 97% pass rate on my piglit tests at GL 4.6 and ES 3.2 gave me a strong indicator that the driver was in good enough shape to start looking at threads more seriously. Sure, piglit tests aren’t CTS; they fail to cover a lot of areas, and they’re certainly less exhaustive about the areas that they do cover. With that said, CTS isn’t a great tool for zink at the moment due to the lack of provoking vertex compatibility support in the driver (I’m still waiting on a Vulkan extension for this, though it’s looking likely that Erik will be providing a fallback codepath for this using a geometry shader in the somewhat near future) which will fail lots of tests. Given the sheer number of CTS tests, going through the failures and determining which ones are failing due to provoking vertex issues and which are failing due to other issues isn’t a great use of my time, so I’m continuing to wait on that. The remaining piglit test failures are mostly due either to provoking vertex issues or some corner case missing features such as multisampled ZS readback which are being worked on by other people.

With all that rambling out of the way, let’s talk about threads and how I’m now using them in zink-wip.

At present, I’m using u_threaded_context, aka glthread, making zink the only non-radeon driver to implement it. The way this works is by using Gallium to write the command stream to a buffer that is then processed asynchronously, freeing up the main thread for application use and avoiding any sort of blocking from driver overhead. For systems where zink is CPU-bound in the driver thread, this massively increases performance, as seen from the ~40% fps improvement that I gained after the implementation.

This transition presented a number of issues, the first of which was that u_threaded_context required buffer invalidation and rebinding. I’d had this on my list of targets for a while, so it was a good opportunity to finally hook it up.

Next up, u_threaded_context was very obviously written to work for the existing radeon driver architecture, and this was entirely incompatible with zink, specifically in how the batch/command buffer implementation is hardcoded like I talked about yesterday. Switching to monotonic, dynamically scaling command buffer usage resolved that and brought with it some other benefits.

The other big issue was, as I’m sure everyone expected, documentation.

I certainly can’t deny that there’s lots of documentation for u_threaded_context. It exists, it’s plentiful, and it’s quite detailed in some cases.

It’s also written by people who know exactly how it works with the expectation that it’s being read by other people who know exactly how it works. I had no idea going into the implementation how any of it worked other than a general knowledge of the asynchronous command stream parts that are common to all thread queue implementations, so this was a pretty huge stumbling block.

Nevertheless, I persevered, and with the help of a lot of RTFC, I managed to get it up and running. This is a more general overview post rather than a more in-depth, technical one, so I’m not going to go into any deep analysis of the (huge amounts of) code required to make it work, but here’s some key points from the process in case anyone reading this hits some of the same issues/annoyances that I did:

  • use consistent naming for all your struct subclassing, because a huge amount of the code churn is just going to be replacing driver class -> gallium class references to driver class -> u_threaded_context class -> gallium class ones; if you can sed these all at once, it simplifies the work tremendously
  • u_threaded_context works off the radeon queue/fence architecture, which allows (in some cases) multiple fences for any given queue submission, so ensure that your fences work the same way or (as I did) can effectively have sub-fences
  • obviously don’t forget your locking, but also don’t over-lock; I’m still doing some analysis to check how much locking I need for the context-based caches, and it may even be the case that I’m under-locked at the moment, but it’s important to keep in mind that your pipe_context can be in many different threads at a given time, and so, as the u_threaded_context docs repeatedly say without further explanation, don’t use it “in an unsafe way”
  • the buffer mapping rules/docs are complex, but basically it boils down to checking the TC_TRANSFER_MAP_* flags before doing the things that those flags prohibit
    • ignore threaded_resource::max_forced_staging_uploads to start with since it adds complexity
    • if you get TC_TRANSFER_MAP_THREADED_UNSYNC, you have to use threaded_context::base.stream_uploader for staging buffers, though this isn’t (currently) documented anywhere
    • watch your buffer alignments; I already fixed an issue with this, but u_threaded_context was written for radeon drivers, so there may be other cases where hardcoded values for those drivers exist
    • probably just read the radeonsi code before even attempting this anyway

All told, fixing all the regressions took much longer than the actual implementation, but that’s just par for the course with driver work.

Anyone interested in testing should take note that, as always, this has only been used on Intel hardware (and if you’re on Intel, this post is definitely worth reading), and so on systems which were not CPU-bound previously or haven’t been worked on by me, you may not yet see these kinds of gains.

But you will eventually.

And That’s It

This is a sort of bittersweet post as it marks the end of my full-time hobby work with zink. I’ve had a blast over the past ~6 months, but all things change eventually, and such is the case with this situation.

Those of you who have been following me for a long time will recall that I started hacking on zink while I was between jobs in order to improve my skills and knowledge while doing something productive along the way. I succeeded in all regards, at least by my own standards, and I got to work with some brilliant people at the same time.

But now, at last, I will once again become employed, and the course of that employment will take me far away from this project. I don’t expect that I’ll have a considerable amount of mental energy to dedicate to hobbyist Open Source projects, at least for the near term, so this is a farewell of sorts in that sense. This means (again, for at least the near term):

  • I’ll likely be blogging far less frequently
  • I don’t expect to be writing any new patches for zink/gallium/mesa

This does not mean that zink is dead, or the project is stalling development, or anything like that, so don’t start overreaching on the meaning of this post.

I still have 450+ patches left to be merged into mainline Mesa, and I do plan to continue driving things towards that end, though I expect it’ll take a good while. I’ll also be around to do patch reviews for the driver and continue to be involved in the community.

I look forward to a time when I’ll get to write more posts here and move the zink user experience closer to where I think it can be.

This is Mike, signing off for now.

Happy rendering.

November 05, 2020

During my presentation at the X Developers Conference I stated that we had been mostly using the Khronos Vulkan Conformance Test suite (aka Vulkan CTS) to validate our Vulkan driver for Raspberry Pi 4 (aka V3DV). While the CTS is an invaluable resource for driver testing and validation, it doesn’t exactly compare to actual real world applications, and so, I made the point that we should try to do more real world testing for the driver after completing initial Vulkan 1.0 support.

To be fair, we had been doing a little bit of this already when I worked on getting the Vulkan ports of all 3 Quake game classics to work with V3DV, which allowed us to identify and fix a few driver bugs during development. The good thing about these games is that we could get the source code and compile them natively for ARM platforms, so testing and debugging was very convenient.

Unfortunately, there are not a plethora of Vulkan applications and games like these that we can easily test and debug on a Raspberry Pi as of today, which posed a problem. One way to work around this limitation that was suggested after my presentation at XDC was to use Zink, the OpenGL to Vulkan layer in Mesa. Using Zink, we can take existing OpenGL applications that are currently available for Raspberry Pi and use them to test our Vulkan implementation a bit more thoroughly, expanding our options for testing while we wait for the Vulkan ecosystem on Raspberry Pi 4 to grow.

So last week I decided to get hands on with that. Zink requires a few things from the underlying Vulkan implementation depending on the OpenGL version targeted. Currently, Zink only targets desktop OpenGL versions, so that limits us to OpenGL 2.1, which is the maximum version of desktop OpenGL that Raspbery Pi 4 can support (we support up to OpenGL ES 3.1 though). For that desktop OpenGL version, Zink required a few optional Vulkan 1.0 features that we were missing in V3DV, namely:

  • Logic operations.
  • Alpha to one.
  • VK_KHR_maintenance1.

The first two were trivial: they were already implemented and we only had to expose them in the driver. Notably, when I was testing these features with the relevant CTS tests I found a bug in the alpha to one tests, so I proposed a fix to Khronos which is currently in review.

I also noticed that Zink was also implicitly requiring support for timestamp queries, so I also implemented that in V3DV and then also wrote a patch for Zink to handle this requirement better.

Finally, Zink doesn’t use Vulkan swapchains, instead it creates presentable images directly, which was problematic for us because our platform needs to handle allocations for presentable images specially, so a patch for Zink was also required to address this.

As of the writing of this post, all this work has been merged in Mesa and it enables Zink to run OpenGL 2.1 applications over V3DV on Raspberry Pi 4. Here are a few screenshots of Quake3 taken with the native OpenGL driver (V3D), with the native Vulkan driver (V3DV) and with Zink (over V3DV). There is a significant performance hit with Zink at present, although that is probably not too unexpected at this stage, but otherwise it seems to be rendering correctly, which is what we were really interested to see:


Quake3 Vulkan renderer (V3DV)

Quake3 OpenGL renderer (V3D)

Quake3 OpenGL renderer (Zink + V3DV)

Note: you’ll notice that the Vulkan screenshot is darker than the OpenGL versions. As I reported in another post, that is a feature of the Vulkan port of Quake3 and is unrelated to the driver.

Going forward, we expect to use Zink to test more applications and hopefully identify driver bugs that help us make V3DV better.

It’s Time.

I’ve been busy cramming more code than ever into the repo this week in order to finish up my final project for a while by Friday. I’ll talk more about that tomorrow though. Today I’ve got two things for all of you.

First, A Riddle

Of these two screenshots, one is zink+ANV and one is IRIS. Which is which?

2.png

1.png

Second, Queue Architecture

Let’s talk a bit at a high level about how zink uses (non-compute) command buffers.

Currently in the repo zink works like this:

  • there is 1 queue
  • there are 4 command buffers used in a ring
  • after every flush (e.g., glFlush), the command buffers cycle
  • the driver flushes itself internally on pretty much every function call
  • any time an in-use command buffer is iterated to, the driver stalls until the command buffer has completed

In short, there’s a huge bottleneck around the flushing mechanism, and then there’s a lesser-reached bottleneck for cases where an application flushes repeatedly before a command buffer’s ops are completed.

Some time ago I talked about some modifications I’d done to the above architecture, and then things looked more like this:

  • there is 1 queue
  • there are 4 command buffers used in a ring
  • after every flush (e.g., glFlush), the command buffers cycle* after
  • the driver defers all possible flushes to try and match 1 flush to 1 frame
  • any time an in-use command buffer is iterated to, the driver stalls until the command buffer has completed

The major difference after this work was that the flushing was reduced, which then greatly reduced the impact of that bottleneck that exists when all the command buffers are submitted and the driver wants to continue recording commands.

A lot of speculation has occurred among the developers over “how many” command buffers should be used, and there’s been some talk of profiling this, but for various reasons I’ll get into tomorrow, I opted to sidestep the question entirely in favor of a more dynamic solution: monotonically-identified command buffers.

Monotony

The basic idea behind this strategy, which is used by a number of other drivers in the tree, is that there’s no need to keep a “ring” of command buffers to cycle through, as the driver can just continually allocate new command buffers on-the-fly and submit them as needed, reusing them once they’ve naturally completed instead of forcibly stalling on them. Here’s a visual comparison:

The current design:

Here’s the new version:

This way, there’s no possibility of stalling based on application flushes (or the rare driver-internal flush which does still exist in a couple places).

The architectural change here had two great benefits:

  • for systems that aren’t CPU bound, more command buffers will automatically be created and used, yielding immediate performance gains (~5% on Dave Airlie’s AMD setup)
  • the driver internals get massively simplified

The latter of these is due to the way that the queue in zink is split between gfx and compute command buffers; with the hardcoded batch system, the compute queue had its own command buffer while the gfx queue had four, but they all had unique IDs which were tracked using bitfields all over the place, not to mention it was frustrating never being able to just “know” which command buffer was currently being recorded to for a given command without indexing the array.

Now it’s easy to know which command buffer is currently being recorded to, as it’ll always be the one associated with the queue (gfx or compute) for the given operation.

This had further implications, however, and I’d done this to pave the way for a bigger project, one that I’ve spent the past few days on. Check back tomorrow for that and more.

November 02, 2020

New Hotness

Quick update today, but I’ve got some very exciting news coming soon.

The biggest news of the day is that work is underway to merge some patches from Duncan Hopkins which enable zink to run on Mac OS using MoltenVK. This has significant potential to improve OpenGL support on that platform, so it’s awesome that work has been done to get the ball rolling there.

In only slightly less monumental news though, Adam Jackson is already underway with Vulkan WSI work for zink, which is going to be huge for performance.

October 30, 2020

(I just sent the below email to mesa3d developer list).

Just to let everyone know, a month ago I submitted the 20.2 llvmpipe
driver for OpenGL 4.5 conformance under the SPI/X.org umbrella, and it
is now official[1].

Thanks to everyone who helped me drive this forward, and to all the
contributors both to llvmpipe and the general Mesa stack that enabled
this.

Big shout out to Roland Scheidegger for helping review the mountain of
patches I produced in this effort.

My next plans involved submitting lavapipe for Vulkan 1.0, it's at 99%
or so CTS, but there are line drawing, sampler accuracy and some snorm
blending failure I have to work out.
I also ran the OpenCL 3.0 conformance suite against clover/llvmpipe
yesterday and have some vague hopes of driving that to some sort of
completion.

(for GL 4.6 only texture anisotropy is really missing, I've got
patches for SPIR-V support, in case someone was feeling adventurous).

Dave.

[1] https://www.khronos.org/conformance/adopters/conformant-products/opengl#submission_272

October 29, 2020

Buffering

I’ve got a lot of exciting stuff in the pipe now, but for today I’m just going to talk a bit about resource invalidation: what it is, when it happens, and why it’s important.

Let’s get started.

What is invalidation?

Resource invalidation occurs when the backing buffer of a resource is wholly replaced. Consider the following scenario under zink:

  • Have struct A { VkBuffer buffer; };
  • User calls glBufferData(target, size, data, usage), which stores data to A.buffer
  • User calls glBufferData(target, size, NULL, usage), which unsets the data from A.buffer

On a sane/competent driver, the second glBufferData call will trigger invalidation, which means that A.buffer will be replaced entirely, while A is still the driver resource used by Gallium to represent target.

When does invalidation occur?

Resource invalidation can occur in a number of scenarios, but the most common is when unsetting a buffer’s data, as in the above example. The other main case for it is replacing the data of a buffer that’s in use for another operation. In such a case, the backing buffer can be replaced to avoid forcing a sync in the command stream which will stall the application’s processing. There’s some other cases for this as well, like glInvalidateFramebuffer and glDiscardFramebufferEXT, but the primary usage that I’m interested in is buffers.

Why is invalidation important?

The main reason is performance. In the above scenario without invalidation, the second glBufferData call will write null to the whole buffer, which is going to be much more costly than just creating a new buffer.

That’s it

Now comes the slightly more interesting part: how does invalidation work in zink?

Currently, as of today’s mainline zink codebase, we have struct zink_resource to represent a resource for either a buffer or an image. One struct zink_resource represents exactly one VkBuffer or VkImage, and there’s some passable lifetime tracking that I’ve written to guarantee that these Vulkan objects persist through the various command buffers that they’re associated with.

Each struct zink_resource is, as is the way of Gallium drivers, also a struct pipe_resource, which is tracked by Gallium. Because of this, struct zink_resource objects themselves cannot be invalidated in order to avoid breaking Gallium, and instead only the inner Vulkan objects themselves can be replaced.

For this, I created struct zink_resource_object, which is an object that stores only the data that directly relates to the Vulkan objects, leaving struct zink_resource to track the states of these objects. Their lifetimes are separate, with struct zink_resource being bound to the Gallium tracker and struct zink_resource_object persisting for either the lifetime of struct zink_resource or its command buffer usage—whichever is longer.

Code

The code for this mechanism isn’t super interesting since it’s basically just moving some parts around. Where it gets interesting is the exact mechanics of invalidation and how struct zink_resource_object can be injected into an in-use resource, so let’s dig into that a bit.

Here’s what the pipe_context::invalidate_resource hook looks like:

static void
zink_invalidate_resource(struct pipe_context *pctx, struct pipe_resource *pres)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_resource *res = zink_resource(pres);
   struct zink_screen *screen = zink_screen(pctx->screen);

   if (pres->target != PIPE_BUFFER)
      return;

This only handles buffer resources, but extending it for images would likely be little to no extra work.

   if (res->valid_buffer_range.start > res->valid_buffer_range.end)
      return;

Zink tracks the valid data segments of its buffers. This conditional is used to check for an uninitialized buffer, i.e., one which contains no valid data. If a buffer has no data, it’s already invalidated, so there’s nothing to be done here.

   util_range_set_empty(&res->valid_buffer_range);

Invalidating means the buffer will no longer have any valid data, so the range tracking can be reset here.

   if (!get_all_resource_usage(res))
      return;

If this resource isn’t currently in use, unsetting the valid range is enough to invalidate it, so it can just be returned right away with no extra work.

   struct zink_resource_object *old_obj = res->obj;
   struct zink_resource_object *new_obj = resource_object_create(screen, pres, NULL, NULL);
   if (!new_obj) {
      debug_printf("new backing resource alloc failed!");
      return;
   }

Here’s the old internal buffer object as well as a new one, created using the existing buffer as a template so that it’ll match.

   res->obj = new_obj;
   res->access_stage = 0;
   res->access = 0;

struct zink_resource is just a state tracker for the struct zink_resource_object object, so upon invalidate, the states are unset since this is effectively a brand new buffer.

   zink_resource_rebind(ctx, res);

This is the tricky part, and I’ll go into more detail about it below.

   zink_descriptor_set_refs_clear(&old_obj->desc_set_refs, old_obj);

If this resource was used in any cached descriptor sets, the references to those sets need to be invalidated so that the sets won’t be reused.

   zink_resource_object_reference(screen, &old_obj, NULL);
}

Finally, the old struct zink_resource_object is unrefed, which will ensure that it gets destroyed once its current command buffer has finished executing.

Simple enough, but what about that zink_resource_rebind() call? Like I said, that’s where things get a little tricky, but because of how much time I spent on descriptor management, it ends up not being too bad.

This is what it looks like:

void
zink_resource_rebind(struct zink_context *ctx, struct zink_resource *res)
{
   assert(res->base.target == PIPE_BUFFER);

Again, this mechanism is only handling buffer resource for now, and there’s only one place in the driver that calls it, but it never hurts to be careful.

   for (unsigned shader = 0; shader < PIPE_SHADER_TYPES; shader++) {
      if (!(res->bind_stages & BITFIELD64_BIT(shader)))
         continue;
      for (enum zink_descriptor_type type = 0; type < ZINK_DESCRIPTOR_TYPES; type++) {
         if (!(res->bind_history & BITFIELD64_BIT(type)))
            continue;

Something common to many Gallium drivers is this idea of “bind history”, which is where a resource will have bitflags set when it’s used for a certain type of binding. While other drivers have a lot more cases than zink does due to various factors, the only thing that needs to be checked for my purposes is the descriptor type (UBO, SSBO, sampler, shader image) across all the shader stages. If a given resource has the flags set here, this means it was at some point used as a descriptor of this type, so the current descriptor bindings need to be compared to see if there’s a match.

         uint32_t usage = zink_program_get_descriptor_usage(ctx, shader, type);
         while (usage) {
            const int i = u_bit_scan(&usage);

This is a handy mechanism that returns the current descriptor usage of a shader as a bitfield. So for example, if a vertex shader uses UBOs in slots 0, 1, and 3, usage will be 11, and the loop will process i as 0, 1, and 3.

            struct zink_resource *cres = get_resource_for_descriptor(ctx, type, shader, i);
            if (res != cres)
               continue;

Now the slot of the descriptor type can be compared against the resource that’s being re-bound. If this resource is the one that’s currently bound to the specified slot of the specified descriptor type, then steps can be taken to perform additional operations necessary to successfully replace the backing storage for the resource, mimicking the same steps taken when initially binding the resource to the descriptor slot.

            switch (type) {
            case ZINK_DESCRIPTOR_TYPE_SSBO: {
               struct pipe_shader_buffer *ssbo = &ctx->ssbos[shader][i];
               util_range_add(&res->base, &res->valid_buffer_range, ssbo->buffer_offset,
                              ssbo->buffer_offset + ssbo->buffer_size);
               break;
            }

For SSBO descriptors, the only change needed is to add valid range for the bound region as . This region is passed to the shader, so even if it’s never written to, it might be, and so it can be considered a valid region.

            case ZINK_DESCRIPTOR_TYPE_SAMPLER_VIEW: {
               struct zink_sampler_view *sampler_view = zink_sampler_view(ctx->sampler_views[shader][i]);
               zink_descriptor_set_refs_clear(&sampler_view->desc_set_refs, sampler_view);
               zink_buffer_view_reference(ctx, &sampler_view->buffer_view, NULL);
               sampler_view->buffer_view = get_buffer_view(ctx, res, sampler_view->base.format,
                                                           sampler_view->base.u.buf.offset, sampler_view->base.u.buf.size);
               break;
            }

Sampler descriptors require a new VkBufferView be created since the previous one is no longer valid. Again, the references for the existing bufferview need to be invalidated now since that descriptor set can no longer be reused from the cache, and then the new VkBufferView is set after unrefing the old one.

            case ZINK_DESCRIPTOR_TYPE_IMAGE: {
               struct zink_image_view *image_view = &ctx->image_views[shader][i];
               zink_descriptor_set_refs_clear(&image_view->desc_set_refs, image_view);
               zink_buffer_view_reference(ctx, &image_view->buffer_view, NULL);
               image_view->buffer_view = get_buffer_view(ctx, res, image_view->base.format,
                                                         image_view->base.u.buf.offset, image_view->base.u.buf.size);
               util_range_add(&res->base, &res->valid_buffer_range, image_view->base.u.buf.offset,
                              image_view->base.u.buf.offset + image_view->base.u.buf.size);
               break;
            }

Images are nearly identical to the sampler case, the difference being that while samplers are read-only like UBOs (and therefore reach this point already having valid buffer ranges set), images are more like SSBOs and can be written to. Thus the valid range must be set here like in the SSBO case.

            default:
               break;

Eagle-eyed readers will note that I’ve omitted a UBO case, and this is because there’s nothing extra to be done there. UBOs will already have their valid range set and don’t need a VkBufferView.

            }

            invalidate_descriptor_state(ctx, shader, type);

Finally, the incremental decsriptor state hash for this shader stage and descriptor type is invalidated. It’ll be recalculated normally upon the next draw or compute operation, so this is a quick zero-setting operation.

         }
      }
   }
}

That’s everything there is to know about the current state of resource invalidation in zink!

October 24, 2020

Never Seen Before

A rare Saturday post because I spent so much time this week intending to blog and then somehow not getting around to it. Let’s get to the status updates, and then I’m going to dive into the more interesting of the things I worked on over the past few days.

Zink has just hit another big milestone that I’ve just invented: as of now, my branch is passing 97% of piglit tests up through GL 4.6 and ES 3.2, and it’s a huge improvement from earlier in the week when I was only at around 92%. That’s just over 1000 failure cases remaining out of ~41,000 tests. For perspective, a table.

  IRIS zink-mainline zink-wip
Passed Tests 43508 21225 40190
Total Tests 43785 22296 41395
Pass Rate 99.4% 95.2% 97.1%

As always, I happen to be running on Intel hardware, so IRIS and ANV are my reference points.

It’s important to note here that I’m running piglit tests, and this is very different from CTS; put another way, I may be passing over 97% of the test cases I’m running, but that doesn’t mean that zink is conformant for any versions of GL or ES, which may not actually be possible at present (without huge amounts of awkward hacks) given the persistent issues zink has with provoking vertex handling. I expect this situation to change in the future through the addition of more Vulkan extensions, but for now I’m just accepting that there’s some areas where zink is going to misrender stuff.

What Changed?

The biggest change that boosted the zink-wip pass rate was my fixing 64bit vertex attributes, which in total had been accounting for ~2000 test failures.

Vertex attributes, as we all know since we’re all experts in the graphics field, are the inputs for vertex shaders, and the data types for these inputs can vary just like C data types. In particular, with GL 4.1, ARB_vertex_attrib_64bit became a thing, which allows 64bit values to be passed as inputs here.

Once again, this is a problem for zink.

It comes down to the difference between GL’s implicit handling methodology and Vulkan’s explicit handling methodology. Consider the case of a dvec4 data type. Conceptually, this is a data type which is 4x64bit values, requiring 32bytes of storage. A vec4 uses 16bytes of storage, and this equates to a single “slot” or “location” within the shader inputs, as everything there is vec4-aligned. This means that, by simple arithmetic, a dvec4 requires two slots for its storage, one for the first two members, and another for the second two, both consuming a single 16byte slot.

When loading a dvec4 in GL(SL), a single variable with the first location slot is used, and the driver will automatically use the second slot when loading the second half of the value.

When loading a dvec4 in (SPIR)Vulkan, two variables with consecutive, explicit location slots must be used, and the driver will load exactly the input location specified.

This difference requires that for any dvec3 or dvec4 vertex input in zink, the value and also the load have to be split along the vec4 boundary for things to work.

Gallium already performs this split on the API side, allowing zink to already be correctly setting things up in the VkPipeline creation, so I wrote a NIR pass to fix things on the shader side.

Shader Rewriting

Yes, it’s been at least a week since I last wrote about a NIR pass, so it’s past time that I got back into that.

Going into this, the idea here is to perform the following operations within the vertex shader:

  • for the input variable (hereafter A), find the deref instruction (hereafter A_deref); deref is used to access variables for input and output, and so it’s guaranteed that any 64bit input will first have a deref
  • create a second variable (hereafter B) of size double (for dvec3) or dvec2 (for dvec4) to represent the second half of A
  • alter A and A_deref’s type to dvec2; this aligns the variable (and its subsequent load) to the vec4 boundary, which enables it to be correctly read from a single location slot
  • create a second deref instruction for B (hereafter B_deref)
  • find the load_deref instruction for A_deref (hereafter A_load); a load_deref instruction is used to load data from a variable deref
  • alter the number of components for A_load to 2, matching its new dvec2 size
  • create a second load_deref instruction for B_deref which will load the remaining components (hereafter B_load)
  • construct a new composite (hereafter C_load) dvec3 or dvec4 by combining A_load + B_load to match the load of the original type of A
  • rewrite all the subsequent uses of A_load’s result to instead use C_load’s result

Simple, right?

Here we go.

static bool
lower_64bit_vertex_attribs_instr(nir_builder *b, nir_instr *instr, void *data)
{
   if (instr->type != nir_instr_type_deref)
      return false;
   nir_deref_instr *A_deref = nir_instr_as_deref(instr);
   if (A_deref->deref_type != nir_deref_type_var)
      return false;
   nir_variable *A = nir_deref_instr_get_variable(A_deref);
   if (A->data.mode != nir_var_shader_in)
      return false;
   if (!glsl_type_is_64bit(A->type) || !glsl_type_is_vector(A->type) || glsl_get_vector_elements(A->type) < 3)
      return false;

First, it’s necessary to filter out all the instructions that aren’t what should be rewritten. As above, only dvec3 and dvec4 types are targeted here (dmat* types are reduced to dvec types prior to this point), so anything other than a A_deref of variables with those types is ignored.

   /* create second variable for the split */
   nir_variable *B = nir_variable_clone(A, b->shader);
   /* split new variable into second slot */
   B->data.driver_location++;
   nir_shader_add_variable(b->shader, B);

B matches A except in its type and slot location, which will always be one greater than the slot location of A, so A can be cloned here to simplify the process of creating B.

   unsigned total_num_components = glsl_get_vector_elements(A->type);
   /* new variable is the second half of the dvec */
   B->type = glsl_vector_type(glsl_get_base_type(A->type), glsl_get_vector_elements(A->type) - 2);
   /* clamp original variable to a dvec2 */
   A_deref->type = A->type = glsl_vector_type(glsl_get_base_type(A->type), 2);

A and B need their types modified to not cross the vec4/slot boundary. A is always a dvec2, which has 2 components, and B will always be the remaining components.

   /* create A_deref instr for new variable */
   b->cursor = nir_after_instr(instr);
   nir_deref_instr *B_deref = nir_build_deref_var(b, B);

Now B_deref has been added thanks to the nir_builder helper function which massively simplifies the process of setting up all the instruction parameters.

   nir_foreach_use_safe(A_deref_use, &A_deref->dest.ssa) {

NIR is SSA-based, and all uses of an SSA value are tracked for the purposes of ensuring that SSA values are truly assigned only once as well as ease of rewriting them in the case where a value needs to be modified, just as this pass is doing. This use-tracking comes along with a simple API for iterating over the uses.

      nir_instr *A_load_instr = A_deref_use->parent_instr;
      assert(A_load_instr->type == nir_instr_type_intrinsic &&
             nir_instr_as_intrinsic(A_load_instr)->intrinsic == nir_intrinsic_load_deref);

The only use of A_deref should be A_load, so really iterating over the A_deref uses is just a quick, easy way to get from there to the A_load instruction.

      /* this is a load instruction for the A_deref, and we need to split it into two instructions that we can
       * then zip back into a single ssa def */
      nir_intrinsic_instr *A_load = nir_instr_as_intrinsic(A_load_instr);
      /* clamp the first load to 2 64bit components */
      A_load->num_components = A_load->dest.ssa.num_components = 2;

A_load must be clamped to a single slot location to avoid crossing the vec4 boundary, so this is done by changing the number of components to 2, which matches the now-changed type of A.

      b->cursor = nir_after_instr(A_load_instr);
      /* this is the second load instruction for the second half of the dvec3/4 components */
      nir_intrinsic_instr *B_load = nir_intrinsic_instr_create(b->shader, nir_intrinsic_load_deref);
      B_load->src[0] = nir_src_for_ssa(&B_deref->dest.ssa);
      B_load->num_components = total_num_components - 2;
      nir_ssa_dest_init(&B_load->instr, &B_load->dest, B_load->num_components, 64, NULL);
      nir_builder_instr_insert(b, &B_load->instr);

This is B_load, which loads a number of components that matches the type of B. It’s inserted after A_load, though the before/after isn’t important in this case. The key is just that this instruction is added before the next one.

      nir_ssa_def *def[4];
      /* createa new dvec3/4 comprised of all the loaded components from both variables */
      def[0] = nir_vector_extract(b, &A_load->dest.ssa, nir_imm_int(b, 0));
      def[1] = nir_vector_extract(b, &A_load->dest.ssa, nir_imm_int(b, 1));
      def[2] = nir_vector_extract(b, &B_load->dest.ssa, nir_imm_int(b, 0));
      if (total_num_components == 4)
         def[3] = nir_vector_extract(b, &B_load->dest.ssa, nir_imm_int(b, 1));
      nir_ssa_def *C_load = nir_vec(b, def, total_num_components);

Now that A_load and B_load both exist and are loading the corrected number of components, these components can be extracted and reassembled into a larger type for use in the shader, specifically the original dvec3 or dvec4 which is being used. nir_vector_extract performs this extraction from a given instruction by taking an index of the value to extract, and then the composite value is created by passing the extracted components to nir_vec as an array.

      /* use the assembled dvec3/4 for all other uses of the load */
      nir_ssa_def_rewrite_uses_after(&A_load->dest.ssa, nir_src_for_ssa(C_load), C_load->parent_instr);

Since this is all SSA, the NIR helpers can be used to trivially rewrite all the uses of the loaded value from the original A_load instruction to now use the assembled C_load value. It’s important that only the uses after C_load has been created (i.e., nir_ssa_def_rewrite_uses_after) are those that are rewritten, however, or else the shader will also rewrite the original A_load value with C_load, breaking the shader entirely with an SSA-impossible as well as generally-impossible C_load = vec(C_load + B_load) assignment.

   }

   return true;
}

Progress has occurred, so the pass returns true to reflect that.

Now those large attributes are loaded according to Vulkan spec, and everything is great because, as expected, ANV has no bugs here.

October 16, 2020

Brain Hurty

It’s been a very long week for me, and I’m only just winding down now after dissecting and resolving a crazy fp64/ssbo bug. I’m too scrambled to jump into any code, so let’s just do another fluff day and review happenings in mainline mesa which relate to zink.

MRs Landed

Thanks to the tireless, eagle-eyed reviewing of Erik Faye-Lund, a ton of zink patches out of zink-wip have landed this week. Here’s an overview of that in backwards historical order:

Versions Bumped

Zink has grown tremendously over the past day or so of MRs landing, going from GLSL 1.30 and GL 3.0 to GLSL and GL 3.30.

It can even run Blender now. Unless you’re running it under Wayland.

October 15, 2020

New versions of the KWinFT projects Wrapland, Disman, KWinFT and KDisplay are available now. They were on the day aligned with the release of Plasma 5.20 this week and offer new features and stability improvements.

Universal display management

The highlight this time is a completely redefined and reworked Disman that allows to control display configurations not only in a KDE Plasma session with KWinFT but also with KWin and in other Wayland sessions with wlroots-based compositors as well as any X11 session.

You can use it with the included command-line tool dismanctl or together with the graphical frontend KDisplay. Read more about Disman's goals and technical details in the 5.20 beta announcement.

KWinFT projects you should use

Let's cut directly to the chase! As Disman and KDisplay are replacements for libkscreen and KScreen and KWinFT for KWin you will be interested in a comparison from a user point of view. What is better and what should you personally choose?

Disman and KDisplay for KDE Plasma

If you run a KDE Plasma desktop at the moment, you should definitely consider to use Disman and replace KScreen with KDisplay.

Disman comes with a more reliable overall design moving internal logic to its D-Bus service and away from the frontends in KDisplay. Changes by that become more atomic and bugs are less likely to emerge.

The UI of KDisplay is improved in comparison to KScreen and comfort functions have been added, as for example automatic selection of the best available mode.

There are still some caveats to this release that might prompt you to wait for the next one though:

  • Albeit in the beta phase multiple bugs were discovered and could be fixed 5.20 is still the first release after a large redesign, so it is not unlikely more bugs will be discovered later on.
  • If you want to use Disman and KDisplay in a legacy KWin Wayland session note that Disman was only tested with KWinFT and sway by me personally. Maybe other people already use it with legacy KWin in a Wayland session but the backend, that is loaded in this case, could very well have seen no QA at all yet. That being said if you run a KWin X11 session, you will experience no such problems since in this case the backend is the same as with KWinFT or any other X11 window manager.
  • If you require the KDisplay UI in another language then this release is not yet for you. An online translation system has been setup now but the first localization will only become available with release 5.21.

So your mileage may vary but in most cases you should have a better experience with Disman and KDisplay.

And if you in general like to support new projects with ambitious goals and make use of most modern technologies you should definitely give it a try.

Disman and KDisplay with wlroots

Disman includes a backend for wlroots-backed compositors. I'm proud of this achievement since I believe we need more projects for the Linux desktop which do not only try to solve issues in their own little habitat and project monoculture but which aim at improving the Linux desktop in a more holistic and collaborative spirit.

I tested the backend myself and even provided some patches to wlroots directly to improve its output-management capabilities, so using Disman with wlroots should be a decent experience. One catch though is that for those patches above a new wlroots version must be released. For now you can only get them by compiling wlroots from master or having your distribution of choice backport them.

In comparison with other options to manage your displays in wlroots I believe Disman provides the most user-friendly solution taking off lots of work from your shoulders by automatically optimizing unknown new display setups and reloading data for already known setups.

Another prominent alternative for display management on wlroots is kanshi. I don't think kanshi is as easy to use and autonomously optimizing as Disman but you might be able to configure displays more precisly with it. So you could prefer kanshi or Disman depending on your needs.

You can use Disman in wlroots sessions as a standalone system with its included command-line tool dismanctl and without KDisplay. This way you do not need to pull in as many KDE dependencies. But KDisplay together with Disman and wlroots also works very well and provides you an easy-to-use UI for adapting the display configuration according to your needs.

Disman and KDisplay on X11

You will like it. Try it out is all I can say. The RandR backend is tested thoroughly and while there is still room for some refactoring it should work very well already. This is also independent of what desktop environment or window manager you use. Install it and see for yourself.

That being said the following issues are known at the moment:

  • Only a global scale can be selected for all displays. Disman can not yet set different scales for different displays. This might become possible in the future but has certain drawbacks on its own.
  • The global scale in X11 is set by KDisplay alone. So you must install Disman with KDisplay in case you want to change the global scale without writing to the respective X11 config files manually.
  • At the time of writing there is a bug with Nvidia cards that leads to reduced refresh rates. But I expect this bug to be fixed very soon.

KWinFT vs KWin

I was talking a lot about Disman since it contains the most interesting changes this release and it can now be useful to many more people than before.

But you might also be interested in replacing KWin with KWinFT, so let's take a look at how KWinFT at this point in time compares to legacy KWin.

As it stands KWinFT is still a drop-in-replacement for it. You can install it to your system replacing KWin and use it together with a KDE Plasma session.

X11

If you usually run an X11 session you should choose KWinFT without hesitation. It provides the same features as KWin and comes with an improved compositing pipeline that lowers latency and increases smoothness. There are also patches in the work to improve upon this further for multi-display setups. These patches might come to the 5.20 release via a bug fix release.

One point to keep in mind though is that the KWinFT project will concentrate in the future on improving the experience with Wayland. We won't maliciously regress the X11 experience but if there is a tradeoff between improving the Wayland session and regressing X11, KWinFT will opt for the former. But if such a situation unfolds at some point in time has yet to be seen. The X11 session might as well continue to work without any regressions for the next decade.

Wayland

The situation is different if you want to run KWinFT as a Wayland compositor. I believe in regards to stability and robustness KWinFT is superior.

In particular this holds true for multi-display setups and display management. Although I worked mostly on Disman in the last two months that work naturally spilled over to KWinFT too. KWinFT's output objects are now much more reasonably implemented. Besides that there were many more bug fixes to outputs handling what you can convince yourself of by looking at the merged changes for 5.20 in KWinFT and Wrapland.

If you have issues with your outputs in KWin definitely try out KWinFT, of course together with Disman.

Another area where you probably will have a better experience is the composition itself. As on X11 the pipeline was reworked. For multi-display setups the patch, that was linked above and might come in a bug fix release to 5.20, should improve the situation further.

On the other side KWin's Wayland session gained some much awaited features with 5.20. According to the changelog screencasting is now possible, as is middle-click pasting and integration with Klipper, that is the clipboard management utility in the system tray.

I say "in theory" because I have not tested it myself and I expect it to not work without issues. That is for one because big feature additions like these regularly require later adjustments due to unforeseen behavior changes but also because on a principal and strategic level I disagree with the KWin developers' general approach here.

The KWin codebase is rotten and needs a rigorous overhaul. Putting more features on top of that, which often require massive internal changes just for the sake of crossing an item from a checklist might make sense from the viewpoint of KDE users and KDE's marketing staff, but from a long-term engineering vision will only litter the code more and lead to more and more breakage over time. Most users won't notice that immediately but when they do it is already too late.

On how to do that better I really have to compliment the developers of Gnome's Mutter and wlroots.

Especially Mutter's Wayland session was in a bad state with some fundamental problems due to its history just few years ago. But they committed to a very forward-thinking stance, ignoring the initial bad reception and not being tempted by immediate quick fixes that long-term would not hold up to the necessary standards. And nowadays Gnome Mutter's Wayland session is in way better shape. I want to highlight their transactional KMS project. This is a massive overhaul that is completely transparent to the common user, but enables the Mutter developers to build on a solid base in many ways in the future.

Still as said I have not tried KWin 5.20 myself and if the new features are important to you, give it a try and check for yourself if your experience confirms my concerns or if you are happy with what was added. Switching from KWin to KWinFT or the other way around is easy after all.

How to get the KWinFT projects

If you self-compile KWinFT it is very easy to switch from KWin to KWinFT. Just compile KWinFT to your system prefix. If you want more comfort through distribution packages you have to choose your distribution carefully.

Currently only Manjaro provides KWinFT packages officially. You can install all KWinFT projects on Manjaro easily through the packages with the same names.

Manjaro also offers git-variants of these packages allowing you to run KWinFT projects directly from master branch. This way you can participate in its development directly or give feedback to latest changes.

If you run Arch Linux you can install all KWinFT projects from the AUR. The release packages are not yet updated to 5.20 but I assume this happens pretty soon. They have a bit weird naming scheme: there are kwinft, wrapland-kwinft, disman-kwinft and kdisplay-kwinft. Of these packages git-variants are available too but they follow the better naming scheme without a kwinft suffix. So for example the git package for disman-kwinft is just called disman-git. Naming nitpicks aside huge thanks to the maintainers of these packages: abelian424 and Christoph (haagch).

A special place in my heart was conquered not long ago by Fedora. I switched over to it from KDE Neon due to problems on the latest update and the often outdated packages and I am amazed by Fedora's technical versed and overall professional vision.

To install KWinFT projects on Fedora with its exceptional package manager DNF you can make use of this copr repository that includes their release versions. The packages are already updated to 5.20. Thanks to zawertun for providing these packages!

Fedora's KDE SIG group also took interest in the KWinFT projects and setup a preliminary copr for them. One of their packagers contacted me after the Beta release and I hope that I can help them to get it fully setup soon. I think Fedora's philosophy of pushing the Linux ecosystem by providing most recent packages and betting on emerging technologies will harmonize very well with the goals of the KWinFT project.

Busy, Busy, Busy

It’s been a busy week for me in personal stuff, so my blogging has been a bit slow. Here’s a brief summary of a few vaguely interesting things that I’ve been up to:

  • I’m now up to 37fps in the Heaven benchmark, which is an absolute unit of a number that brings me up to 68.5% of native GL
  • I discovered that people have been reporting bugs for zink-wip (great) without tagging me (not great), and I’m not getting notifications for it (also not great). gitlab is hard.
  • I added support for null UBOs in descriptor sets, which I think I was supposed to actually have done some time ago, but I hadn’t run into any tests that hit it. null checks all the way down
  • I fired up supertuxkart for one of the reports, which led me down a rabbithole of discovering that part of my compute shader implementation was broken for big shared data loads; I was, for whatever reason, attempting to do an OpAccessChain with a result type of uvec4 from a base type of array<uint>, which…it’s just not going to work ever, so that’s some strong work by past me
  • I played supertuxkart for a good 30 seconds and took a screenshot

supertuxkart.png

There’s a weird flickering bug with the UI that I get in some levels that bears some looking into, but otherwise I get a steady 60/64/64 in zink vs 74/110/110 in IRIS (no idea what the 3 numbers mean, if anyone does, feel free to drop me a line).

That’s it for today. Hopefully a longer post tomorrow but no promises.

October 14, 2020

 A couple of years ago, we sandboxed thumbnailers using bubblewrap to avoid drive-by downloads taking advantage of thumbnailers with security issues.

 It's a great tool, and it's a tool that Flatpak relies upon to create its own sandboxes. But that also meant that we couldn't use it inside the Flatpak sandboxes themselves, and those aren't always as closed as they could be, to support legacy applications.

 We've finally implemented support for sandboxing thumbnailers within Flatpak, using the Spawn D-Bus interface (indirectly).

This should all land in GNOME 40, though it should already be possible to integrate it into your Flatpaks. Make sure to use the latest gnome-desktop development version, and that the flatpak-spawn utility is new enough in the runtime you're targeting (it's been updated in the freedesktop.org runtimes #1, #2, #3, but it takes time to trickle down to GNOME versions). Example JSON snippets:

        {
"name": "flatpak-xdg-utils",
"buildsystem": "meson",
"sources": [
{
"type": "git",
"url": "https://github.com/flatpak/flatpak-xdg-utils.git",
"tag": "1.0.4"
}
]
},
{
"name": "gnome-desktop",
"buildsystem": "meson",
"config-opts": ["-Ddebug_tools=true", "-Dudev=disabled"],
"sources": [
{
"type": "git",
"url": "https://gitlab.gnome.org/GNOME/gnome-desktop.git"
}
]
}  

(We also sped up GStreamer-based thumbnailers by allowing them to use a cache, and added profiling information to the thumbnail test tools, which could prove useful if you want to investigate performance or bugs in that area)

Edit: correct a link, thanks to the commenters for the notice

October 13, 2020

This week, I had a conversation with one of my coworkers about our subgroup/wave size heuristic and, in particular, whether or not control-flow divergence should be considered as part of the choice.  This lead me down a fun path of looking into the statistics of control-flow divergence and the end result is somewhat surprising:  Once you get above about an 8-wide subgroup, the subgroup size doesn't matter.

Before I get into the details, let's talk nomenclature.  As you're likely aware, GPUs often execute code in groups of 1 or more invocations.  In D3D terminology, these are called waves.  In Vulkan and OpenGL terminology, these are called subgroups.  The two terms are interchangeable and, for the rest of this post, I'll use the Vulkan/OpenGL conventions.

Control-flow divergence

Before we dig into the statistics, let's talk for a minute about control-flow divergence.  This is mostly going to be a primer on SIMT execution and control-flow divergence in GPU architectures.  If you're already familiar, skip ahead to the next section.

Most modern GPUs use a Single Instruction Multiple Thread (SIMT) model.  This means that the graphics programmer writes a shader which, for instance, colors a single pixel (fragment/pixel shader) but what the shader compiler produces is a program which colors, say, 32 pixels using a vector instruction set architecture (ISA).  Each logical single-pixel execution of the shader is called an "invocation" while the physical vectorized execution of the shader which covers multiple pixels is called a wave or a subgroup.  The size of the subgroup (number of pixels colored by a single hardware execution) varies depending on your architecture.  On Intel, it can be 8, 16, or 32, on AMD, it's 32 or 64 and, on Nvidia (if my knowledge is accurate), it's always 32.

This conversion from logical single-pixel version of the shader to a physical multi-pixel version is often fairly straightforward.  The GPU registers each hold N values and the instructions provided by the GPU ISA operate on N pieces of data at a time.  If, for instance, you have an add in the logical shader, it's converted to an add provided by the hardware ISA which adds N values.  (This is, of course an over-simplification but it's sufficient for now.)  Sounds simple, right?

Where things get more complicated is when you have control-flow in your shader.  Suppose you have an if statement with both then and else sections.  What should we do when we hit that if statement?  The if condition will be N Boolean values.  If all of them are true or all of them are false, the answer is pretty simple: we do the then or the else respectively.  If you have a mix of true and false values, we have to execute both sides.  More specifically, the physical shader has to disable all of the invocations for which the condition is false and run the "then" side of the if statement.  Once that's complete, it has to re-enable those channels and disable the channels for which the condition is true and run the "else" side of the if statement.  Once that's complete, it re-enables all the channels and continues executing the code after the if statement.

When you start nesting if statements and throw loops into the mix, things get even more complicated.  Loop continues have to disable all those channels until the next iteration of the loop, loop breaks have to disable all those channels until the loop is entirely complete, and the physical shader has to figure out when there are no channels left and complete the loop.  This makes for some fun and interesting challenges for GPU compiler developers.  Also, believe it or not, everything I just said is a massive over-simplification. :-)

The point which most graphics developers need to understand and what's important for this blog post is that the physical shader has to execute every path taken by any invocation in the subgroup.  For loops, this means that it has to execute the loop enough times for the worst case in the subgroup.  This means that if you have the same work in both the then and else sides of an if statement, that work may get executed twice rather than once and you may be better off pulling it outside the if.  It also means that if you have something particularly expensive and you put it inside an if statement, that doesn't mean that you only pay for it when needed, it means you pay for it whenever any invocation in the subgroup needs it.

Fun with statistics

At the end of the last section, I said that one of the problems with the SIMT model used by GPUs is that they end up having worst-case performance for the subgroup.  Every path through the shader which has to be executed for any invocation in the subgroup has to be taken by the shader as a whole.  The question that naturally arises is, "does a larger subgroup size make this worst-case behavior worse?"  Clearly, the naive answer is, "yes".  If you have a subgroup size of 1, you only execute exactly what's needed and if you have a subgroup size of 2 or more, you end up hitting this worst-case behavior.  If you go higher, the bad cases should be more likely, right?  Yes, but maybe not quite like you think.

This is one of those cases where statistics can be surprising.  Let's say you have an if statement with a boolean condition b.  That condition is actually a vector (b1, b2, b3, ..., bN) and if any two of those vector elements differ, we path the cost of both paths.  Assuming that the conditions are independent identically distributed (IID) random variables, the probability of entire vector being true is P(all(bi = true) = P(b1 = true) * P(b2 = true) * ... * P(bN = true) = P(bi = true)^N where N is the size of the subgroup.  Therefore, the probability of having uniform control-flow is P(bi = true)^N + P(bi = false)^N.  The probability of non-uniform control-flow, on the other hand, is 1 - P(bi = true)^N - P(bi = false)^N.

Before we go further with the math, let's put some solid numbers on it.  Let's say we have a subgroup size of 8 (the smallest Intel can do) and let's say that our input data is a series of coin flips where bi is "flip i was heads".  Then P(bi = true) = P(bi = false) = 1/2.  Using the math in the previous paragraph, P(uniform) = P(bi = true)^8 + P(bi = false)^8 = 1/128.  This means that the there is only a 1:128 chance that that you'll get uniform control-flow and a 127:128 chance that you'll end up taking both paths of your if statement.  If we increase the subgroup size to 64 (the maximum among AMD, Intel, and Nvidia), you get a 1:2^63 chance of having uniform control-flow and a (2^63-1):2^63 chance of executing both halves.  If we assume that the shader takes T time units when control-flow is uniform and 2T time units when control-flow is non-uniform, then the amortized cost of the shader for a subgroup size of 8 is 1/128 * T + 127/128 * 2T = 255/128 T and, by a similar calculation, the cost of a shader with a subgroup size of 64 is (2^64 - 1)/2^63.  Both of those are within rounding error of 2T and the added cost of using the massively wider subgroup size is less than 1%.  Playing with the statistics a bit, the following chart shows the probability of divergence vs. the subgroup size for various choices of P(bi = true):

One thing to immediately notice is that because we're only concerned about the probability of divergence and not of the two halves of the if independently, the graph is symmetric (p=0.9 and p=0.1 are the same).  Second, and the point I was trying to make with all of the math above, is that until your probability gets pretty extreme (> 90%) the probability of divergence is reasonably high at any subgroup size.  From the perspective of a compiler with no knowledge of the input data, we have to assume every if condition is a 50/50 chance at which point we can basically assume it will always diverge.

Instead of only considering divergence, let's take a quick look at another case.  Let's say that the you have a one-sided if statement (no else) that is expensive but rare.  To put numbers on it, let's say the probability of the if statement being taken is 1/16 for any given invocation.  Then P(taken) = P(any(bi = true)) = 1 - P(all(bi = false)) = 1 - P(bi = false)^N = 1 - (15/16)^N.  This works out to about 0.4 for a subgroup size of 8, 0.65 for 16, 0.87 for 32, and 0.98 for 64.  The following chart shows what happens if we play around with the probabilities of our if condition a bit more:

As we saw with the earlier divergence plot, even events with a fairly low probability (10%) are fairly likely to happen even with a subgroup size of 8 (57%) and are even more likely the higher the subgroup size goes.  Again, from the perspective of a compiler with no knowledge of the data trying to make heuristic decisions, it looks like "ifs always happen" is a reasonable assumption.  However, if we have something expensive like a texture instruction that we can easily move into an if statement, we may as well.  There's no guarantees but if the probability of that if statement is low enough, we might be able to avoid it at least some of the time.

Statistical independence

A keen statistical eye may have caught a subtle statement I made very early on in the previous section:

 Assuming that the conditions are independent identically distributed (IID) random variables...

While less statistically minded readers may have glossed over this as meaningless math jargon, it's actually very important assumption.  Let's take a minute to break it down.  A random variable in statistics is just an event.  In our case, it's something like "the if condition was true".  To say that a set of random variables is identically distributed means that they have the same underlying probabilities.  Two coin tosses, for instance, are identically distributed while the distribution of "coin came up heads" and "die came up 6" are very different.  When combining random variables, we have to be careful to ensure that we're not mixing apples and oranges.  All of the analysis above was looking at the evaluation of a boolean in the same if condition but across different subgroup invocations.  These should be identically distributed. 

The remaining word that's of critical importance in the IID assumption is "independent".  Two random variables are said to be independent if they have no effect on one another or, to be more precise, knowing the value of one tells you nothing whatsoever about the value of the other.  Random variables which are not dependent are said to be "correlated".  One example of random variables which are very much not independent would be housing prices in a neighborhood because the first thing home appraisers look at to determine the value of a house is the value of other houses in the same area that have sold recently.  In my computations above, I used the rule that P(X and Y) = P(X) * P(Y) but this only holds if X and Y are independent random variables.  If they're dependent, the statistics look very different.  This raises an obvious question:  Are if conditions statistically independent across a subgroup?  The short answer is "no".

How does this correlation and lack of independence (those are the same) affect the statistics?  If two events X and Y are negatively correlated then P(X and Y) < P(X) * P(Y) and if two events are positively correlated then P(X and Y) > P(X) * P(Y).  When it comes to if conditions across a subgroup, most correlations that matter are positive.  Going back to our statistics calculations, the probability of if condition diverging is 1 - P(all(bi = true)) - P(all(bi = false)) and P(all(bi = true)) = P(b1 = true and b2 = true and... bN = true).  So, if the data is positively correlated, we get P(all(bi = true)) > P(bi = true)^N and P(divergent) = 1 - P(all(bi = true)) - P(all(bi = false)) < 1 - P(bi = true)^N - P(bi = false)^N.  So correlation for us typically reduces the probability of divergence.  This is a good thing because divergence is expensive.  How much does it reduce the probability of divergence?  That's hard to tell without deep knowledge of the data but there are a few easy cases to analyze.

One particular example of dependence that comes up all the time is uniform values.  Many values passed into a shader are the same for all invocations within a draw call or for all pixels within a group of primitives.  Sometimes the compiler is privy to this information (if it comes from a uniform or constant buffer, for instance) but often it isn't.  It's fairly common for apps to pass some bit of data as a vertex attribute which, even though it's specified per-vertex, is actually the same for all of them.  If a bit of data is uniform (even if the compiler doesn't know it is), then any if conditions based on that data (or from a calculation using entirely uniform values) will be the same.  From a statics perspective, this means that P(all(bi = true)) + P(all(bi = false)) = 1 and P(divergent) = 0.  From a shader execution perspective, this means that it will never diverge no matter the probability of the condition because our entire wave will evaluate the same value.

What about non-uniform values such as vertex positions, texture coordinates, and computed values?  In your average vertex, geometry, or tessellation shader, these are likely to be effectively independent.  Yes, there are patterns in the data such as common edges and some triangles being closer to others.  However, there is typically a lot of vertex data and the way that vertices get mapped to subgroups is random enough that these correlations between vertices aren't likely to show up in any meaningful way.  (I don't have a mathematical proof for this off-hand.)  When they're independent, all the statistics we did in the previous section apply directly.

With pixel/fragment shaders, on the other hand, things get more interesting.  Most GPUs rasterize pixels in groups of 2x2 pixels where each 2x2 pixel group comes from the same primitive.  Each subgroup is made up of a series of these 2x2 pixel groups so, if the subgroup size is 16, it's actually 4 groups of 2x2 pixels each.  Within a given 2x2 pixel group, the chances of a given value within the shader being the same for each pixel in that 2x2 group is quite high.  If we have a condition which is the same within each 2x2 pixel group then, from the perspective of divergence analysis, the subgroup size is effectively divided by 4.  As you can see in the earlier charts (for which I conveniently provided small subgroup sizes), the difference between a subgroup size of 2 and 4 is typically much larger than between 8 and 16.

Another common source of correlation in fragment shader data comes from the primitives themselves.  Even if they may be different between triangles, values are often the same or very tightly correlated between pixels in the same triangle.  This is sort of a super-set of the 2x2 pixel group issue we just covered.  This is important because this is a type of correlation that hardware has the ability to encourage.  For instance, hardware can choose to dispatch subgroups such that each subgroup only contains pixels from the same primitive.  Even if the hardware typically mixes primitives within the same subgroup, it can attempt to group things together to increase data correlation and reduce divergence.

Why bother with subgroups?

All this discussion of control-flow divergence might leave you wondering why we bother with subgroups at all.  Clearly, they're a pain.  They definitely are.  Oh, you have no idea...

But they also bring some significant advantages in that the parallelism allows us to get better throughput out of the hardware.  One obvious way this helps is that we can spend less hardware on instruction decoding (we only have to decode once for the whole wave) and put those gates into more floating-point arithmetic units.  Also, most processors are pipelined and, while they can start processing a new instruction each cycle, it takes several cycles before an instruction makes its way from the start of the pipeline to the end and its result can be used in a subsequent instruction.  If you have a lot of back-to-back dependent calculations in the shader, you can end up with lots of stalls where an instruction goes into the pipeline and the next instruction depends on its value and so you have to wait 10ish cycles until for the previous instruction to complete.  On Intel, each SIMD32 instruction is actually four SIMD8 instructions that pipeline very nicely and so it's easier to keep the ALU busy.

Ok, so wider subgroups are good, right?  Go as wide as you can!  Well, yes and no.  Generally, there's a point of diminishing returns.  Is one instruction decoder per 32 invocations of ALU really that much more hardware than one per 64 invocations?  Probalby not.  Generally, the subgroup size is determined based on what's required to keep the underlying floating-point arithmetic hardware full.  If you have 4 ALUs per execution unit and a pipeline depth of 10 cycles, then an 8-wide subgroup is going to have trouble keeping the ALU full.  A 32-wide subgroup, on the other hand, will keep it 80% full even with back-to-back dependent instructions so going 64-wide is pointless.

On Intel GPU hardware, there are additional considerations.  While most GPUs have a fixed subgroup size, ours is configurable and the subgroup size is chosen by the compiler.  What's less flexible for us is our register file.  We have a fixed register file size of 4KB regardless of the subgroup size so, depending on how many temporary values your shader uses, it may be difficult to compile it 16 or 32-wide and still fit everything in registers.  While wider programs generally yield better parallelism, the additional register pressure can easily negate any parallelism benefits.

There are also other issues such as cache utilization and thrashing but those are way out of scope for this blog post...

What does this all mean?

This topic came up this week in the context of tuning our subgroup size heuristic in the Intel Linux 3D drivers.  In particular, how should that heuristic reason about control-flow and divergence?  Are wider programs more expensive because they have the potential to diverge more?

After all the analysis above, the conclusion I've come to is that any given if condition falls roughly into one of three categories:

  1. Effectively uniform.  It never (or very rarely ever) diverges.  In this case, there is no difference between subgroup sizes because it never diverges.
  2. Random.  Since we have no knowledge about the data in the compiler, we have to assume that random if conditions are basically a coin flip every time.  Even with our smallest subgroup size of 8, this means it's going to diverge with a probability of 99.6%.  Even if you assume 2x2 subspans in fragment shaders are strongly correlated, divergence is still likely with a probability of 75% for SIMD8 shaders, 94% for SIMD16, and 99.6% for SIMD32.
  3. Random but very one-sided.  These conditions are the type where we can actually get serious statistical differences between the different subgroup sizes.  Unfortunately, we have no way of knowing when an if condition will be in this category so it's impossible to make heuristic decisions based on it.

Where does that leave our heuristic?  The only interesting case in the above three is random data in fragment shaders.  In our experience, the increased parallelism going from SIMD8 to SIMD16 is huge so it probably makes up for the increased divergence.  The parallelism increase from SIMD16 to SIMD32 isn't huge but the change in the probability of a random if diverging is pretty small (94% vs. 99.6%) so, all other things being equal, it's probably better to go SIMD32.

October 12, 2020

Jumping Right In

When last I left off, I’d cleared out 2/3 of my checklist for improving update_sampler_descriptors() performance:

handle_image_descriptor() was next on my list. As a result, I immediately dove right into an entirely different part of the flamegraph since I’d just been struck by a seemingly-obvious idea. Here’s the last graph:

frame_retrieval.png

Here’s the next step:

reuse_barriers.png

What changed?

Well, as I’m now caching descriptor sets across descriptor pools, it occurred to me that, assuming my descriptor state hashing mechanism is accurate, all the resources used in a given set must be identical. This means that all resources for a given type (e.g., UBO, SSBO, sampler, image) must be completely identical to previous uses across all shader stages. Extrapolating further, this also means that the way in which these resources are used must also identical, which means the pipeline barriers for access and image layouts must also be identical.

Which means they can be stored onto the struct zink_descriptor_set object and reused instead of being accumulated every time. This reuse completely eliminates add_transition() from using any CPU time (it’s the left-most block above update_sampler_descriptors() in the first graph), and it thus massively reduces overall time for descriptor updates.

This marks a notable landmark, as it’s the point at which update_descriptors() begins to use only ~50% of the total CPU time consumed in zink_draw_vbo(), with the other half going to the draw command where it should be.

handle_image_descriptor()

At last my optimization-minded workflow returned to this function, and looking at the flamegraph again yielded the culprit. It’s not visible due to this being a screenshot, but the whole of the perf hog here was obvious, so let’s check out the function itself since it’s small. I think I explored part of this at one point in the distant past, possibly for ARB_texture_buffer_object, but refactoring has changed things up a bit:

static void
handle_image_descriptor(struct zink_screen *screen, struct zink_resource *res, enum zink_descriptor_type type, VkDescriptorType vktype, VkWriteDescriptorSet *wd,
                        VkImageLayout layout, unsigned *num_image_info, VkDescriptorImageInfo *image_info, struct zink_sampler_state *sampler,
                        VkBufferView *null_view, VkImageView imageview, bool do_set)

First, yes, there’s a lot of parameters. There’s a lot of them, including VkBufferView *null_view, which is a pointer to a stack array that’s initialized as containing VK_NULL_HANDLE. As VkDescriptorImageInfo must be initialized with a pointer to an array for texel buffers, it’s important that the stack variable used doesn’t go out of scope, so it has to be passed in like this or else this functionality can’t be broken out in this way.

{
    if (!res) {
        /* if we're hitting this assert often, we can probably just throw a junk buffer in since
         * the results of this codepath are undefined in ARB_texture_buffer_object spec
         */
        assert(screen->info.rb2_feats.nullDescriptor);
        
        switch (vktype) {
        case VK_DESCRIPTOR_TYPE_UNIFORM_TEXEL_BUFFER:
        case VK_DESCRIPTOR_TYPE_STORAGE_TEXEL_BUFFER:
           wd->pTexelBufferView = null_view;
           break;
        case VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER:
        case VK_DESCRIPTOR_TYPE_STORAGE_IMAGE:
           image_info->imageLayout = VK_IMAGE_LAYOUT_UNDEFINED;
           image_info->imageView = VK_NULL_HANDLE;
           if (sampler)
              image_info->sampler = sampler->sampler[0];
           if (do_set)
              wd->pImageInfo = image_info;
           ++(*num_image_info);
           break;
        default:
           unreachable("unknown descriptor type");
        }

This is just handling for null shader inputs, which is permitted by various GL specs.

     } else if (res->base.target != PIPE_BUFFER) {
        assert(layout != VK_IMAGE_LAYOUT_UNDEFINED);
        image_info->imageLayout = layout;
        image_info->imageView = imageview;
        if (sampler) {
           VkFormatProperties props;
           vkGetPhysicalDeviceFormatProperties(screen->pdev, res->format, &props);

This vkGetPhysicalDeviceFormatProperties call is actually the entire cause of handle_image_descriptor() using any CPU time, at least on ANV. The lookup for the format is a significant bottleneck here, so it has to be removed.

           if ((res->optimial_tiling && props.optimalTilingFeatures & VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT) ||
               (!res->optimial_tiling && props.linearTilingFeatures & VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT))
              image_info->sampler = sampler->sampler[0];
           else
              image_info->sampler = sampler->sampler[1] ?: sampler->sampler[0];
        }
        if (do_set)
           wd->pImageInfo = image_info;
        ++(*num_image_info);
     }
}

Just for completeness, the remainder of this function is checking whether the device’s format features support the requested type of filtering (if linear), and then zink will fall back to nearest in other cases. Following this, do_set is true only for the base member of an image/sampler array of resources, and so this is the one that gets added into the descriptor set.

But now I’m again returning to vkGetPhysicalDeviceFormatProperties. Since this is using CPU, it needs to get out of the hotpath here in descriptor updating, but it does still need to be called. As such, I’ve thrown more of this onto the zink_screen object:

static void
populate_format_props(struct zink_screen *screen)
{
   for (unsigned i = 0; i < PIPE_FORMAT_COUNT; i++) {
      VkFormat format = zink_get_format(screen, i);
      if (!format)
         continue;
      vkGetPhysicalDeviceFormatProperties(screen->pdev, format, &screen->format_props[i]);
   }
}

Indeed, now instead of performing the fetch on every decsriptor update, I’m just grabbing all the properties on driver init and then using the cached values throughout. Let’s see how this looks.

props.png

update_descriptors() is now using visibly less time than the draw command, though not by a huge amount. I’m also now up to an unstable 33fps.

Some Cleanups

At this point, it bears mentioning that I wasn’t entirely satisfied with the amount of CPU consumed by descriptor state hashing, so I ended up doing some pre-hashing here for samplers, as they have the largest state. At the beginning, the hashing looked like this:

In this case, each sampler descriptor hash was the size of a VkDescriptorImageInfo, which is 2x 64bit values and a 32bit value, or 20bytes of hashing per sampler. That ends up being a lot of hashing, and it also ends up being a lot of repeatedly hashing the same values.

Instead, I changed things around to do some pre-hashing:

In this way, I could have a single 32bit value representing the sampler view that persisted for its lifetime, and a pair of 32bit values for the sampler (since I still need to potentially toggle between linear and nearest filtering) that I can select between. This ends up being 8bytes to hash, which is over 50% less. It’s not a huge change in the flamegraph, but it’s possibly an interesting factoid. Also, as the layouts will always be the same for these descriptors, that member can safely be omitted from the original sampler_view hash.

Set Usage

Another vaguely interesting tidbit is profiling zink’s usage of mesa’s set implementation, which is used to ensure that various objects are only added to a given batch a single time. Historically the pattern for use in zink has been something like:

if (!_mesa_set_search(set, value)) {
   do_something();
   _mesa_set_add(set, value);
}

This is not ideal, as it ends up performing two lookups in the set for cases where the value isn’t already present. A much better practice is:

bool found = false;
_mesa_set_search_and_add(set, value, &found);
if (!found)
    do_something();

In this way, the lookup is only done once, which ends up being huge for large sets.

For My Final Trick

I was now holding steady at 33fps, but there was a tiny bit more performance to squeeze out of descriptor updating when I began to analyze how much looping was being done. This is a general overview of all the loops in update_descriptors() for each type of descriptor at the time of my review:

  • loop for all shader stages
    • loop for all bindings in the shader
      • in sampler and image bindings, loop for all resources in the binding
  • loop for all resources in descriptor set
  • loop for all barriers to be applied in descriptor set

This was a lot of looping, and it was especially egregious in the final component of my refactored update_descriptors():

static bool
write_descriptors(struct zink_context *ctx, struct zink_descriptor_set *zds, unsigned num_wds, VkWriteDescriptorSet *wds,
                 unsigned num_resources, struct zink_descriptor_resource *resources, struct set *persistent,
                 bool is_compute, bool cache_hit)
{
   bool need_flush = false;
   struct zink_batch *batch = is_compute ? &ctx->compute_batch : zink_curr_batch(ctx);
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   assert(zds->desc_set);
   unsigned check_flush_id = is_compute ? 0 : ZINK_COMPUTE_BATCH_ID;
   for (int i = 0; i < num_resources; ++i) {
      assert(num_resources <= zds->pool->num_resources);

      struct zink_resource *res = resources[i].res;
      if (res) {
         need_flush |= zink_batch_reference_resource_rw(batch, res, resources[i].write) == check_flush_id;
         if (res->persistent_maps)
            _mesa_set_add(persistent, res);
      }
      /* if we got a cache hit, we have to verify that the cached set is still valid;
       * we store the vk resource to the set here to avoid a more complex and costly mechanism of maintaining a
       * hash table on every resource with the associated descriptor sets that then needs to be iterated through
       * whenever a resource is destroyed
       */
      assert(!cache_hit || zds->resources[i] == res);
      if (!cache_hit)
         zink_resource_desc_set_add(res, zds, i);
   }
   if (!cache_hit && num_wds)
      vkUpdateDescriptorSets(screen->dev, num_wds, wds, 0, NULL);
   for (int i = 0; zds->pool->num_descriptors && i < util_dynarray_num_elements(&zds->barriers, struct zink_descriptor_barrier); ++i) {
      struct zink_descriptor_barrier *barrier = util_dynarray_element(&zds->barriers, struct zink_descriptor_barrier, i);
      zink_resource_barrier(ctx, NULL, barrier->res,
                            barrier->layout, barrier->access, barrier->stage);
   }

   return need_flush;
}

This function iterates over all the resources in a descriptor set, tagging them for batch usage and persistent mapping, adding references for the descriptor set to the resource as I previously delved into. Then it iterates over the barriers and applies them.

But why was I iterating over all the resources and then over all the barriers when every resource will always have a barrier for the descriptor set, even if it ends up getting filtered out based on previous usage?

It just doesn’t make sense.

So I refactored this a bit, and now there’s only one loop:

static bool
write_descriptors(struct zink_context *ctx, struct zink_descriptor_set *zds, unsigned num_wds, VkWriteDescriptorSet *wds,
                 struct set *persistent, bool is_compute, bool cache_hit)
{
   bool need_flush = false;
   struct zink_batch *batch = is_compute ? &ctx->compute_batch : zink_curr_batch(ctx);
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   assert(zds->desc_set);
   unsigned check_flush_id = is_compute ? 0 : ZINK_COMPUTE_BATCH_ID;

   if (!cache_hit && num_wds)
      vkUpdateDescriptorSets(screen->dev, num_wds, wds, 0, NULL);
   for (int i = 0; zds->pool->num_descriptors && i < util_dynarray_num_elements(&zds->barriers, struct zink_descriptor_barrier); ++i) {
      struct zink_descriptor_barrier *barrier = util_dynarray_element(&zds->barriers, struct zink_descriptor_barrier, i);
      if (barrier->res->persistent_maps)
         _mesa_set_add(persistent, barrier->res);
      need_flush |= zink_batch_reference_resource_rw(batch, barrier->res, zink_resource_access_is_write(barrier->access)) == check_flush_id;
      zink_resource_barrier(ctx, NULL, barrier->res,
                            barrier->layout, barrier->access, barrier->stage);
   }

   return need_flush;
}

This actually has the side benefit of reducing the required looping even further, as barriers get merged based on access and stages, meaning that though there may be N resources used by a given set used by M stages, it’s possible that the looping here might be reduced to only N rather than N * M since all barriers might be consolidated.

In Closing

Let’s check all the changes out in the flamegraph:

final.png

This last bit has shaved off another big chunk of CPU usage overall, bringing update_descriptors() from 11.4% to 9.32%. Descriptor state updating is down from 0.718% to 0.601% from the pre-hashing as well, though this wasn’t exactly a huge target to hit.

Just for nostalgia, here’s the starting point from just after I’d split the descriptor types into different sets and we all thought 27fps with descriptor set caching was a lot:

split.png

But now zink is up another 25% performance to a steady 34fps:

heaven.png

And I did it in only four blog posts.

For anyone interested, I’ve also put up a branch corresponding to the final flamegraph along with the perf data, which I’ve been using hotspot to view.

But Then, The Future

Looking forward, there’s still some easy work that can be done here.

For starters, I’d probably improve descriptor states a little such that I also had a flag anytime the batch cycled. This would enable me to add batch-tracking for resources/samplers/sampler_views more reliably when was actually necessary vs trying to add it every time, which ends up being a significant perf hit from all the lookups. I imagine that’d make a huge part of the remaining update_descriptors() usage disappear.

future-lookups.png

There’s also, as ever, the pipeline hashing, which can further be reduced by adding more dynamic state handling, which would remove more values from the pipeline state and thus reduce the amount of hashing required.

future-pipeline.png

I’d probably investigate doing some resource caching to keep a bucket of destroyed resources around for faster reuse since there’s a fair amount of that going on.

future-resource.png

Ultimately though, the CPU time in use by zink is unlikely to see any other huge decreases (unless I’m missing something especially obvious or clever) without more major architectural changes, which will end up being a bigger project that takes more than just a week of blog posting to finish. As such, I’ve once again turned my sights to unit test pass rates and related issues, since there’s still a lot of work to be done there.

I’ve fixed another 500ish piglit tests over the past few days, bringing zink up past 92% pass rate, and I’m hopeful I can get that number up even higher in the near future.

Stay tuned for more updates on all things zink and Mike.

October 09, 2020

Moar Descriptors

I talked about a lot of boring optimization stuff yesterday, exploring various ideas which, while they will eventually will end up improving performance, didn’t yield immediate results.

Now it’s just about time to start getting to the payoff. pool_reuse.png

Here’s a flamegraph of the starting point. Since yesterday’s progress of improving the descriptor cache a bit and adding context-based descriptor states to reduce hashing, I’ve now implemented an object to hold the VkDescriptorPool, enabling the pools themselves to be shared across programs for reuse, which deduplicates a considerable amount of memory. The first scene of the heaven benchmark creates a whopping 91 zink_gfx_program structs, each of which previously had their own UBO descriptor pool and sampler descriptor pool for 182 descriptor pools in total, each with 5000 descriptors in it. With this mechanism, that’s cut down to 9 descriptor pools in total which are shared across all the programs. Without even changing that maximum descriptor limit, I’m already up by another frame to 28fps, even if the flamegraph doesn’t look too different.

Moar Caches

I took a quick detour next over to the pipeline cache (that large-ish block directly to the right of update_descriptors in the flamegraph), which stores all the VkPipeline objects that get created during startup. Pipeline creation is extremely costly, so it’s crucial that it be avoided during runtime. Happily, the current caching infrastructure in zink is sufficient to meet that standard, and there are no pipelines created while the scene is playing out.

But I thought to myself: what about VkPipelineCache for startup time improvement while I’m here?

I had high hopes for this, and it was quick and easy to add in, but ultimately even with the cache implemented and working, I saw no benefit in any part of the benchmark.

That was fine, since what I was really shooting for was a result more like this: pipeline_hash.png

The hierarchy of the previously-expensive pipeline hash usage has completely collapsed now, and it’s basically nonexistent. This was achieved through a series of five patches which:

  • moved the tessellation levels for TCS output out of the pipeline state since these have no relation and I don’t know why I put them there to begin with
  • used the bitmasks for vertex divisors and buffers to much more selectively hash the large (32) VkVertexInputBindingDivisorDescriptionEXT and VkVertexInputBindingDescription arrays in the pipeline state instead of always hashing the full array
  • also only hashed the vertex buffer state if we don’t have VK_EXT_extended_dynamic_state support, which lets that be removed from the pipeline creation altogether
  • for debug builds, which are the only builds I run, I changed the pipeline hash tables over to use the pre-hashed values directly since mesa has a pesky assert() that rehashes on every lookup, so this more accurately reflects release build performance

And I’m now up to a solid 29fps.

Over To update_sampler_descriptors()

I left off yesterday with a list of targets to hit in this function, from left to right in the most recent flamegraph:

Looking higher up the chain for the add_transition() usage, it turns out that a huge chunk of this was actually the hash table rehashing itself every time it resized when new members were added (mesa hash tables start out with a very small maximum number of entries and then increase by a power of 2 every time). Since I always know ahead of time the maximum number of entries I’ll have in a given descriptor set, I put up a MR to let me pre-size the table, preventing any of this nonsense from taking up CPU time. The results were good: rehash.png

The entire add_transition hierarchy collapsed a bit, but there’s more to come. I immediately became distracted when I came to the realization that I’d actually misplaced a frame at some point and set to hunting it down.

Vulkan Experts

Anyone who said to themselves “you’re binding your descriptors before you’re emitting your pipeline barriers, thus starting and stopping your render passes repeatedly during each draw” as the answer to yesterday’s question about what I broke during refactoring was totally right, so bonus points to everyone out there who nailed it. This can actually be seen in the flamegraph as the tall stack above update_ubo_descriptors(), which is the block to the right of update_sampler_descriptors().

Oops. frame_retrieval.png

Now the stack is a little to the right where it belongs, and I was now just barely touching 30fps, which is up about 10% from the start of today’s post.

Stay tuned next week, when I pull another 10% performance out of my magic hat and also fix RADV corruption.

October 08, 2020

Descriptors Once More

It’s just that kind of week.

When I left off in my last post, I’d just implemented a two-tiered cache system for managing descriptor sets which was objectively worse in performance than not doing any caching at all.

Cool.

Next, I did some analysis of actual descriptor usage, and it turned out that the UBO churn was massive, while sampler descriptors were only changed occasionally. This is due to a mechanism in mesa involving a NIR pass which rewrites uniform data passed to the OpenGL context as UBO loads in the shader, compacting the data into a single buffer and allowing it to be more efficiently passed to the GPU. There’s a utility component u_upload_mgr for gallium-based drivers which allocates a large (~100k) buffer and then maps/writes to it at offsets to avoid needing to create a new buffer for this every time the uniform data changes.

The downside of u_upload_mgr, for the current state of zink, is that it means the hashed descriptor states are different for almost every single draw because while the UBO doesn’t actually change, the offset does, and this is necessarily part of the descriptor hash since zink is exclusively using VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER descriptors.

I needed to increase the efficiency of the cache to make it worthwhile, so I decided first to reduce the impact of a changed descriptor state hash during pre-draw updating. This way, even if the UBO descriptor state hash was changing every time, maybe it wouldn’t be so impactful.

Deep Cuts

What this amounted to was to split the giant, all-encompassing descriptor set, which included UBOs, samplers, SSBOs, and shader images, into separate sets such that each type of descriptor would be isolated from changes in the other descriptors between draws.

Thus, I now had four distinct descriptor pools for each program, and I was producing and binding up to four descriptor sets for every draw. I also changed the shader compiler a bit to always bind shader resources to the newly-split sets as well as created some dummy descriptor sets since it’s illegal to bind sets to a command buffer with non-sequential indices, but it was mostly easy work. It seemed like a great plan, or at least one that had a lot of potential for more optimizations based on it. As far as direct performance increases from the split, UBO descriptors would be constantly changing, but maybe…

Well, the patch is pretty big (8 files changed, 766 insertions(+), 450 deletions(-)), but in the end, I was still stuck around 23 fps.

Dynamic UBOs

With this work done, I decided to switch things up a bit and explore using VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC descriptors for the mesa-produced UBOs, as this would reduce both the hashing required to calculate the UBO descriptor state (offset no longer needs to be factored in, which is one fewer uint32_t to hash) as well as cache misses due to changing offsets.

Due to potential driver limitations, only these mesa-produced UBOs are dynamic now, otherwise zink might exceed the maxDescriptorSetUniformBuffersDynamic limit, but this was still more than enough.

Boom.

I was now at 27fps, the same as raw bucket allocation.

Hash Performance

I’m not going to talk about hash performance. Mesa uses xxhash internally, and I’m just using the tools that are available.

What I am going to talk about, however, is the amount of hashing and lookups that I was doing.

Let’s take a look at some flamegraphs for the scene that I was showing in my fps screenshots.

split.png

This is, among other things, a view of the juggernaut update_descriptors() function that I linked earlier in the week. At the time of splitting the descriptor sets, it’s over 50% of the driver’s pipe_context::draw_vbo hook, which is decidedly not great.

So I optimized harder.

The leftmost block just above update_descriptors is hashing that’s done to update the descriptor state for cache management. There wasn’t much point in recalculating the descriptor state hash on every draw since there’s plenty of draws where the states remain unchanged. To try and improve this, I moved to a context-based descriptor state tracker, where each pipe_context hook to change active descriptors would invalidate the corresponding descriptor state, and then update_descriptors() could just scan through all the states to see which ones needed to be recalculated.

states.png

The largest block above update_descriptors() on the right side is the new calculator function for descriptor states. It’s actually a bit more intensive than the old method, but again, this is much more easily optimized than the previous hashing which was scattered throughout a giant 400 line function.

Next, while I was in the area, I added even faster access to reusing descriptor sets. This is as simple as an array of sets on the program struct that can have their hash values directly compared to the current descriptor state that’s needed, avoiding lookups through potentially huge hash tables.

last_set.png

Not much to see here since this isn’t really where any of the performance bottleneck was occurring.

ETOOMUCHHASH

Let’s skip ahead a bit. I finally refactored update_descriptors() into smaller functions to update and bind descriptor sets in a loop prior to applying barriers and issuing the eventual draw command, shaking up the graph quite a bit:

split_update_descriptors.png

Clearly, updating the sampler descriptors (update_sampler_descriptors()) is taking a huge amount of time. The three large blocks above it are:

Each of these three had clear ways they could be optimized, and I’m going to speed through that and more in my next post.

But now, a challenge to all the Vulkan experts reading along. In this last section, I’ve briefly covered some refactoring work for descriptor updates.

What is the significant performance regression that I’ve introduced in the course of this refactoring?

It’s possible to determine the solution to my question without reading through any of the linked code, but you have the code available to you nonetheless.

Until next time.

October 07, 2020

I Skipped Bucket Day

I’m back, and I’m about to get even deeper into zink’s descriptor management. I figured everyone including me is well acquainted with bucket allocating, so I skipped that day and we can all just imagine what that post would’ve been like instead.

Let’s talk about caching and descriptor sets.

I talked about it before, I know, so here’s a brief reminder of where I left off:

Just a very normal cache mechanism. The problem, as I said previously, was that there was just way too much hashing going on, and so the performance ended up being worse than a dumb bucket allocator.

Not ideal.

But I kept turning the idea over in the back of my mind, and then I realized that part of the problem was in the upper-right block named move invalidated sets to invalidated set array. It ended up being the case that my resource tracking for descriptor sets was far too invasive; I had separate hash tables on every resource to track every set that a resource was attached to at all times, and I was basically spending all my time modifying those hash tables, not even the actual descriptor set caching.

So then I thought: well, what if I just don’t track it that closely?

Indeed, this simplifies things a bit at the conceptual level, since now I can avoid doing any sort of hashing related to resources, though this does end up making my second-level descriptor set cache less effective. But I’m getting ahead of myself in this post, so it’s time to jump into some code.

Resource Tracking 2.0

Instead of doing really precise tracking, it’s important to recall a few key points about how the descriptor sets are managed:

  • they’re bucket allocated
  • they’re allocated using the struct zink_program ralloc context
  • they’re only ever destroyed when the program is
  • zink is single-threaded

Thus, I brought some pointer hacks to bear:

void
zink_resource_desc_set_add(struct zink_resource *res, struct zink_descriptor_set *zds, unsigned idx)
{
   zds->resources[idx] = res;
   util_dynarray_append(&res->desc_set_refs, struct zink_resource**, &zds->resources[idx]);
}

This function associates a resource with a given descriptor set at the specified index (based on pipeline state). And then it pushes a reference to that pointer from the descriptor set’s C-array of resources into an array on the resource.

Later, during resource destruction, I can then walk the array of pointers like this:

util_dynarray_foreach(&res->desc_set_refs, struct zink_resource **, ref) {
   if (**ref == res)
      **ref = NULL;
}

If the reference I pushed earlier is still pointing to this resource, I can unset the pointer, and this will get picked up during future descriptor updates to flag the set as not-cached, requiring that it be updated. Since a resource won’t ever be destroyed while a set is in use, this is also safe for the associated descriptor set’s lifetime.

And since there’s no hashing or tree traversals involved, this is incredibly fast.

Second-level Caching

At this point, I’d created two categories for descriptor sets: active sets, which were the ones in use in a command buffer, and inactive sets, which were the ones that weren’t currently in use, with sets being pushed into the inactive category once they were no longer used by any command buffers. This ended up being a bit of a waste, however, as I had lots of inactive sets that were still valid but unreachable since I was using an array for storing these as well as the newly-bucket-allocated sets.

Thus, a second-level cache, AKA the B cache, which would store not-used sets that had at one point been valid. I’m still not doing any sort of checking of sets which may have been invalidated by resource destruction, so the B cache isn’t quite as useful as it could be. Also:

  • the check program cache for matching set has now been expanded to two lookups in case a matching set isn’t active but is still configured and valid in the B cache
  • the check program for unused set block in the above diagram will now cannibalize a valid inactive set from the B cache rather than allocate a new set

The last of these items is a bit annoying, but ultimately the B cache can end up having hundreds of members at various points, and iterating through it to try and find a set that’s been invalidated ends up being impractical just based on the random distribution of sets across the table. Also, I only set the resource-based invalidation up to null the resource pointer, so finding an invalid set would mean walking through the resource array of each set in the cache. Thus, a quick iteration through a few items to see if the set-finder gets lucky, otherwise it’s clobbering time.

And this brought me up to about 24fps, which was still down a bit from the mind-blowing 27-28fps I was getting with just the bucket allocator, but it turns out that caching starts to open up other avenues for sizable optimizations.

Which I’ll get to in future posts.

October 05, 2020

Healthier Blogging

I took some time off to focus on making the numbers go up, but if I only do that sort of junk food style blogging with images, and charts, and benchmarks then we might all stop learning things, and certainly I won’t be transferring any knowledge between the coding part of my brain and the speaking part, so really we’re all losers at that point.

In other words, let’s get back to going extra deep into some code and doing some long form patch review.

Descriptor Management

First: what are descriptors?

Descriptors are, in short, when you feed a buffer or an image (+sampler) into a shader. In OpenGL, this is all handled for the user behind the scenes with e.g., a simple glGenBuffers() -> glBindBuffer() -> glBufferData() for an attached buffer. For a gallium-based driver, this example case will trigger the pipe_context::set_constant_buffer or pipe_context::set_shader_buffers hook at draw time to inform the driver that a buffer has been attached, and then the driver can link it up with the GPU.

Things are a bit different in Vulkan. There’s an entire chapter of the spec devoted to explaining how descriptors work in great detail, but the important details for the zink case are:

  • each descriptor must have created for it a binding value which is unique for the given descriptor set
  • each descriptor must have a Vulkan descriptor type, as converted from its OpenGL type
  • each descriptor must also potentially expand to its full array size (image descriptor types only)

Additionally, while organizing and initializing all these descriptor sets, zink has to track all the resources used and guarantee their lifetimes exceed the lifetimes of the batch they’re being submitted with.

To handle this, zink has an amount of code. In the current state of the repo, it’s about 40 lines.

However…

meme-update_descriptors.png

In the state of my branch that I’m going to be working off of for the next few blog posts, the function for handling descriptor updates is 304 lines. This is the increased amount of code that’s required to handle (almost) all GL descriptor types in a way that’s reasonably reliable.

Is it a great design decision to have a function that large?

Probably not.

But I decided to write it all out first before I did any refactoring so that I could avoid having to incrementally refactor my first attempt at refactoring, which would waste lots of time.

Also, memes.

How Does This Work?

The idea behind the latter version of the implementation that I linked is as follows:

  • iterate over all the shader stages
  • iterate over the bindings for the shader
  • for each binding:
    • fill in the Vulkan buffer/image info struct
    • for all resources used by the binding:
      • add tracking for the underlying resource(s) for lifetime management
      • flag a pipeline barrier for the resource(s) based on usage*
    • fill in the VkWriteDescriptorSet struct for the binding

*Then merge and deduplicate all the accumulated pipeline barriers and apply only those which induce a layout or access change in the resource to avoid over-applying barriers.

As I mentioned in a previous post, zink then applies these descriptors to a newly-allocated, max-size descriptor set object from an allocator pool located on the batch object. Every time a draw command is triggered, a new VkDescriptorSet is allocated and updated using these steps.

First Level Refactoring Target

As I touched on briefly in a previous post, the first change to make here in improving descriptor handling is to move the descriptor pools to the program objects. This lets zink create smaller descriptor pools which are likely going to end up using less memory than these giant ones. Here’s the code used for creating descriptor pools prior to refactoring:

#define ZINK_BATCH_DESC_SIZE 1000
VkDescriptorPoolSize sizes[] = {
   {VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,         ZINK_BATCH_DESC_SIZE},
   {VK_DESCRIPTOR_TYPE_UNIFORM_TEXEL_BUFFER,   ZINK_BATCH_DESC_SIZE},
   {VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, ZINK_BATCH_DESC_SIZE},
   {VK_DESCRIPTOR_TYPE_STORAGE_TEXEL_BUFFER,   ZINK_BATCH_DESC_SIZE},
   {VK_DESCRIPTOR_TYPE_STORAGE_IMAGE,          ZINK_BATCH_DESC_SIZE},
   {VK_DESCRIPTOR_TYPE_STORAGE_BUFFER,         ZINK_BATCH_DESC_SIZE},
};
VkDescriptorPoolCreateInfo dpci = {};
dpci.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
dpci.pPoolSizes = sizes;
dpci.poolSizeCount = ARRAY_SIZE(sizes);
dpci.flags = 0;
dpci.maxSets = ZINK_BATCH_DESC_SIZE;
vkCreateDescriptorPool(screen->dev, &dpci, 0, &batch->descpool);

Here, all the used descriptor types allocate ZINK_BATCH_DESC_SIZE descriptors in the pool, and there are ZINK_BATCH_DESC_SIZE sets in the pool. There’s a bug here, which is that really the descriptor types should have ZINK_BATCH_DESC_SIZE * ZINK_BATCH_DESC_SIZE descriptors to avoid oom-ing the pool in the event that allocated sets actually use that many descriptors, but ultimately this is irrelevant, as we’re only ever allocating 1 set at a time due to zink flushing multiple times per draw anyway.

Ideally, however, it would be better to avoid this. The majority of draw cases use much, much smaller descriptor sets which only have 1-5 descriptors total, so allocating 6 * 1000 for a pool is roughly 6 * 1000 more than is actually needed for every set.

The other downside of this strategy is that by creating these giant, generic descriptor sets, it becomes impossible to know what’s actually in a given set without attaching considerable metadata to it, which makes reusing sets without modification (i.e., caching) a fair bit of extra work. Yes, I know I said bucket allocation was faster, but I also said I believe in letting the best idea win, and it doesn’t really seem like doing full updates every draw should be faster, does it? But I’ll get to that in another post.

Descriptor Migration

When creating a struct zink_program (which is the struct that contains all the shaders), zink creates a VkDescriptorSetLayout object for describing the descriptor set layouts that will be allocated from the pool. This means zink is allocating giant, generic descriptor set pools and then allocating (almost certainly) very, very small sets, which means the driver ends up with this giant memory balloon of unused descriptors allocated in the pool that can never be used.

A better idea for this would be to create descriptor pools which precisely match the layout for which they’ll be allocating descriptor sets, as this means there’s no memory ballooning, even if it does end up being more pool objects.

Here’s the current function for creating the layout object for a program:

static VkDescriptorSetLayout
create_desc_set_layout(VkDevice dev,
                       struct zink_shader *stages[ZINK_SHADER_COUNT],
                       unsigned *num_descriptors)
{
   VkDescriptorSetLayoutBinding bindings[PIPE_SHADER_TYPES * (PIPE_MAX_CONSTANT_BUFFERS + PIPE_MAX_SHADER_SAMPLER_VIEWS + PIPE_MAX_SHADER_BUFFERS + PIPE_MAX_SHADER_IMAGES)];
   int num_bindings = 0;

   for (int i = 0; i < ZINK_SHADER_COUNT; i++) {
      struct zink_shader *shader = stages[i];
      if (!shader)
         continue;

      VkShaderStageFlagBits stage_flags = zink_shader_stage(pipe_shader_type_from_mesa(shader->nir->info.stage));

This function is called for both the graphics and compute pipelines, and for the latter, only a single shader object is passed, meaning that i is purely for iterating the maximum number of shaders and not descriptive of the current shader being processed.

      for (int j = 0; j < shader->num_bindings; j++) {
         assert(num_bindings < ARRAY_SIZE(bindings));
         bindings[num_bindings].binding = shader->bindings[j].binding;
         bindings[num_bindings].descriptorType = shader->bindings[j].type;
         bindings[num_bindings].descriptorCount = shader->bindings[j].size;
         bindings[num_bindings].stageFlags = stage_flags;
         bindings[num_bindings].pImmutableSamplers = NULL;
         ++num_bindings;
      }
   }

This iterates over the bindings in a given shader, setting up the various values required by the layout creation struct using the values stored to the shader struct using the code in zink_compiler.c.

   *num_descriptors = num_bindings;
   if (!num_bindings) return VK_NULL_HANDLE;

If this program has no descriptors at all, then this whole thing can become a no-op, and descriptor updating can be skipped for draws which use this program.

   VkDescriptorSetLayoutCreateInfo dcslci = {};
   dcslci.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
   dcslci.pNext = NULL;
   dcslci.flags = 0;
   dcslci.bindingCount = num_bindings;
   dcslci.pBindings = bindings;

   VkDescriptorSetLayout dsl;
   if (vkCreateDescriptorSetLayout(dev, &dcslci, 0, &dsl) != VK_SUCCESS) {
      debug_printf("vkCreateDescriptorSetLayout failed\n");
      return VK_NULL_HANDLE;
   }

   return dsl;
}

Then there’s just the usual Vulkan semantics of storing values to the struct and passing it to the Create function.

But back in the context of moving descriptor pool creation to the program, this is actually the perfect place to jam the pool creation code in since all the information about descriptor types is already here. Here’s what that looks like:

VkDescriptorPoolSize sizes[6] = {};
int type_map[12];
unsigned num_types = 0;
memset(type_map, -1, sizeof(type_map));

for (int i = 0; i < ZINK_SHADER_COUNT; i++) {
   struct zink_shader *shader = stages[i];
   if (!shader)
      continue;

   VkShaderStageFlagBits stage_flags = zink_shader_stage(pipe_shader_type_from_mesa(shader->nir->info.stage));
   for (int j = 0; j < shader->num_bindings; j++) {
      assert(num_bindings < ARRAY_SIZE(bindings));
      bindings[num_bindings].binding = shader->bindings[j].binding;
      bindings[num_bindings].descriptorType = shader->bindings[j].type;
      bindings[num_bindings].descriptorCount = shader->bindings[j].size;
      bindings[num_bindings].stageFlags = stage_flags;
      bindings[num_bindings].pImmutableSamplers = NULL;
      if (type_map[shader->bindings[j].type] == -1) {
         type_map[shader->bindings[j].type] = num_types++;
         sizes[type_map[shader->bindings[j].type]].type = shader->bindings[j].type;
      }
      sizes[type_map[shader->bindings[j].type]].descriptorCount++;
      ++num_bindings;
   }
}

I’ve added the sizes, type_map, and num_types variables, which map used Vulkan descriptor types to a zero-based array and associated counter that can be used to fill in the pPoolSizes and PoolSizeCount values in a VkDescriptorPoolCreateInfo struct.

After the layout creation, which remains unchanged, I’ve then added this block:

for (int i = 0; i < num_types; i++)
   sizes[i].descriptorCount *= ZINK_DEFAULT_MAX_DESCS;

VkDescriptorPoolCreateInfo dpci = {};
dpci.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
dpci.pPoolSizes = sizes;
dpci.poolSizeCount = num_types;
dpci.flags = 0;
dpci.maxSets = ZINK_DEFAULT_MAX_DESCS;
vkCreateDescriptorPool(dev, &dpci, 0, &descpool);

Which uses the descriptor types and sizes from above to create a pool that will pre-allocate the exact descriptor counts that are needed for this program.

Managing descriptor sets in this way does have other challenges, however. Like resources, it’s crucial that sets not be modified or destroyed while they’re submitted to a batch.

Descriptor Tracking

Previously, any time a draw was completed, the batch object would reset and clear its descriptor pool, wiping out all the allocated sets. If the pool is no longer on the batch, however, it’s not possible to perform this reset without adding tracking info for the batch to all the descriptor sets. Also, resetting a descriptor pool like this is wasteful, as it’s probable that a program will be used for multiple draws and thus require multiple descriptor sets. What I’ve done instead is add this function, which is called just after allocating the descriptor set:

bool
zink_batch_add_desc_set(struct zink_batch *batch, struct zink_program *pg, struct zink_descriptor_set *zds)
{
   struct hash_entry *entry = _mesa_hash_table_search(batch->programs, pg);
   assert(entry);
   struct set *desc_sets = (void*)entry->data;
   if (!_mesa_set_search(desc_sets, zds)) {
      pipe_reference(NULL, &zds->reference);
      _mesa_set_add(desc_sets, zds);
      return true;
   }
   return false;
}

Similar to all the other batch<->object tracking, this stores the given descriptor set into a set, but in this case the set is itself stored as the data in a hash table keyed with the program, which provides both objects for use during batch reset:

void
zink_reset_batch(struct zink_context *ctx, struct zink_batch *batch)
{
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   batch->descs_used = 0;

   // cmdbuf hasn't been submitted before
   if (!batch->submitted)
      return;

   zink_fence_finish(screen, &ctx->base, batch->fence, PIPE_TIMEOUT_INFINITE);
   hash_table_foreach(batch->programs, entry) {
      struct zink_program *pg = (struct zink_program*)entry->key;
      struct set *desc_sets = (struct set*)entry->data;
      set_foreach(desc_sets, sentry) {
         struct zink_descriptor_set *zds = (void*)sentry->key;
         /* reset descriptor pools when no batch is using this program to avoid
          * having some inactive program hogging a billion descriptors
          */
         pipe_reference(&zds->reference, NULL);
         zink_program_invalidate_desc_set(pg, zds);
      }
      _mesa_set_destroy(desc_sets, NULL);

And this function is called:

void
zink_program_invalidate_desc_set(struct zink_program *pg, struct zink_descriptor_set *zds)
{
   uint32_t refcount = p_atomic_read(&zds->reference.count);
   /* refcount > 1 means this is currently in use, so we can't recycle it yet */
   if (refcount == 1)
      util_dynarray_append(&pg->alloc_desc_sets, struct zink_descriptor_set *, zds);
}

If a descriptor set has no active batch uses, its refcount will be 1, and then it can be added to the array of allocated descriptor sets for immediate reuse in the next draw. In this iteration of refactoring, descriptor sets can only have one batch use, so this condition is always true when this function is called, but future work will see that change.

Putting it all together is this function:

struct zink_descriptor_set *
zink_program_allocate_desc_set(struct zink_screen *screen,
                               struct zink_batch *batch,
                               struct zink_program *pg)
{
   struct zink_descriptor_set *zds;

   if (util_dynarray_num_elements(&pg->alloc_desc_sets, struct zink_descriptor_set *)) {
      /* grab one off the allocated array */
      zds = util_dynarray_pop(&pg->alloc_desc_sets, struct zink_descriptor_set *);
      goto out;
   }

   VkDescriptorSetAllocateInfo dsai;
   memset((void *)&dsai, 0, sizeof(dsai));
   dsai.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
   dsai.pNext = NULL;
   dsai.descriptorPool = pg->descpool;
   dsai.descriptorSetCount = 1;
   dsai.pSetLayouts = &pg->dsl;

   VkDescriptorSet desc_set;
   if (vkAllocateDescriptorSets(screen->dev, &dsai, &desc_set) != VK_SUCCESS) {
      debug_printf("ZINK: %p failed to allocate descriptor set :/\n", pg);
      return VK_NULL_HANDLE;
   }
   zds = ralloc_size(NULL, sizeof(struct zink_descriptor_set));
   assert(zds);
   pipe_reference_init(&zds->reference, 1);
   zds->desc_set = desc_set;
out:
   if (zink_batch_add_desc_set(batch, pg, zds))
      batch->descs_used += pg->num_descriptors;

   return zds;
}

If a pre-allocated descriptor set exists, it’s popped off the array. Otherwise, a new one is allocated. After that, the set is referenced onto the batch.

Progress

Now all the sets are allocated on the program using a more specific allocation strategy, which paves the way for a number of improvements that I’ll be discussing in various lengths over the coming days:

  • bucket allocating
  • descriptor caching 2.0
  • split descriptor sets
  • context-based descriptor states
  • (pending implementation/testing/evaluation) context-based descriptor pool+layout caching
September 29, 2020

A Showcase

Today I’m taking a break from writing about my work to write about the work of zink’s newest contributor, He Haocheng (aka @hch12907). Among other things, Haocheng has recently tackled the issue of extension refactoring, which is a huge help for future driver development. I’ve written time and time again about adding extensions, and with this patchset in place, the process is simplified and expedited almost into nonexistence.

Before

As an example, let’s look at the most recent extension that I’ve added support for, VK_EXT_extended_dynamic_state. The original patch looked like this:

diff --git a/src/gallium/drivers/zink/zink_screen.c b/src/gallium/drivers/zink/zink_screen.c
index 3effa2b0fe4..83b89106931 100644
--- a/src/gallium/drivers/zink/zink_screen.c
+++ b/src/gallium/drivers/zink/zink_screen.c
@@ -925,6 +925,10 @@ load_device_extensions(struct zink_screen *screen)
       assert(have_device_time);
       free(domains);
    }
+   if (screen->have_EXT_extended_dynamic_state) {
+      GET_PROC_ADDR(CmdSetViewportWithCountEXT);
+      GET_PROC_ADDR(CmdSetScissorWithCountEXT);
+   }
 
 #undef GET_PROC_ADDR
 
@@ -938,7 +942,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
    bool have_tf_ext = false, have_cond_render_ext = false, have_EXT_index_type_uint8 = false,
       have_EXT_robustness2_features = false, have_EXT_vertex_attribute_divisor = false,
       have_EXT_calibrated_timestamps = false, have_VK_KHR_vulkan_memory_model = false;
-   bool have_EXT_custom_border_color = false, have_EXT_blend_operation_advanced = false;
+   bool have_EXT_custom_border_color = false, have_EXT_blend_operation_advanced = false,
+        have_EXT_extended_dynamic_state = false;
    if (!screen)
       return NULL;
 
@@ -1001,6 +1006,9 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
             if (!strcmp(extensions[i].extensionName,
                         VK_EXT_BLEND_OPERATION_ADVANCED_EXTENSION_NAME))
                have_EXT_blend_operation_advanced = true;
+            if (!strcmp(extensions[i].extensionName,
+                        VK_EXT_EXTENDED_DYNAMIC_STATE_EXTENSION_NAME))
+               have_EXT_extended_dynamic_state = true;
 
          }
          FREE(extensions);
@@ -1012,6 +1020,7 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
    VkPhysicalDeviceIndexTypeUint8FeaturesEXT index_uint8_feats = {};
    VkPhysicalDeviceVulkanMemoryModelFeatures mem_feats = {};
    VkPhysicalDeviceBlendOperationAdvancedFeaturesEXT blend_feats = {};
+   VkPhysicalDeviceExtendedDynamicStateFeaturesEXT dynamic_state_feats = {};
 
    feats.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FEATURES_2;
    screen->feats11.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_1_1_FEATURES;
@@ -1060,6 +1069,11 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
       blend_feats.pNext = feats.pNext;
       feats.pNext = &blend_feats;
    }
+   if (have_EXT_extended_dynamic_state) {
+      dynamic_state_feats.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_EXTENDED_DYNAMIC_STATE_FEATURES_EXT;
+      dynamic_state_feats.pNext = feats.pNext;
+      feats.pNext = &dynamic_state_feats;
+   }
    vkGetPhysicalDeviceFeatures2(screen->pdev, &feats);
    memcpy(&screen->feats, &feats.features, sizeof(screen->feats));
    if (have_tf_ext && tf_feats.transformFeedback)
@@ -1074,6 +1088,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
    screen->have_EXT_calibrated_timestamps = have_EXT_calibrated_timestamps;
    if (have_EXT_custom_border_color && screen->border_color_feats.customBorderColors)
       screen->have_EXT_custom_border_color = true;
+   if (have_EXT_extended_dynamic_state && dynamic_state_feats.extendedDynamicState)
+      screen->have_EXT_extended_dynamic_state = true;
 
    VkPhysicalDeviceProperties2 props = {};
    VkPhysicalDeviceVertexAttributeDivisorPropertiesEXT vdiv_props = {};
@@ -1150,7 +1166,7 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
     * this requires us to pass the whole VkPhysicalDeviceFeatures2 struct
     */
    dci.pNext = &feats;
-   const char *extensions[12] = {
+   const char *extensions[13] = {
       VK_KHR_MAINTENANCE1_EXTENSION_NAME,
    };
    num_extensions = 1;
@@ -1185,6 +1201,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
       extensions[num_extensions++] = VK_EXT_CUSTOM_BORDER_COLOR_EXTENSION_NAME;
    if (have_EXT_blend_operation_advanced)
       extensions[num_extensions++] = VK_EXT_BLEND_OPERATION_ADVANCED_EXTENSION_NAME;
+   if (have_EXT_extended_dynamic_state)
+      extensions[num_extensions++] = VK_EXT_EXTENDED_DYNAMIC_STATE_EXTENSION_NAME;
    assert(num_extensions <= ARRAY_SIZE(extensions));
 
    dci.ppEnabledExtensionNames = extensions;
diff --git a/src/gallium/drivers/zink/zink_screen.h b/src/gallium/drivers/zink/zink_screen.h
index 4ee409c0efd..1d35e775262 100644
--- a/src/gallium/drivers/zink/zink_screen.h
+++ b/src/gallium/drivers/zink/zink_screen.h
@@ -75,6 +75,7 @@ struct zink_screen {
    bool have_EXT_calibrated_timestamps;
    bool have_EXT_custom_border_color;
    bool have_EXT_blend_operation_advanced;
+   bool have_EXT_extended_dynamic_state;
 
    bool have_X8_D24_UNORM_PACK32;
    bool have_D24_UNORM_S8_UINT;

It’s awful, right? There’s obviously lots of copy/pasted code here, and it’s a tremendous waste of time to have to do the copy/pasting, not to mention the time needed for reviewing such mind-numbing changes.

After

Here’s the same patch after He Haocheng’s work has been merged:

diff --git a/src/gallium/drivers/zink/zink_device_info.py b/src/gallium/drivers/zink/zink_device_info.py
index 0300e7f7574..69e475df2cf 100644
--- a/src/gallium/drivers/zink/zink_device_info.py
+++ b/src/gallium/drivers/zink/zink_device_info.py
@@ -62,6 +62,7 @@ def EXTENSIONS():
         Extension("VK_EXT_calibrated_timestamps"),
         Extension("VK_EXT_custom_border_color",      alias="border_color", properties=True, feature="customBorderColors"),
         Extension("VK_EXT_blend_operation_advanced", alias="blend", properties=True),
+        Extension("VK_EXT_extended_dynamic_state",   alias="dynamic_state", feature="extendedDynamicState"),
     ]
 
 # There exists some inconsistencies regarding the enum constants, fix them.
diff --git a/src/gallium/drivers/zink/zink_screen.c b/src/gallium/drivers/zink/zink_screen.c
index 864ec32fc22..3c1214d384b 100644
--- a/src/gallium/drivers/zink/zink_screen.c
+++ b/src/gallium/drivers/zink/zink_screen.c
@@ -926,6 +926,10 @@ load_device_extensions(struct zink_screen *screen)
       assert(have_device_time);
       free(domains);
    }
+   if (screen->info.have_EXT_extended_dynamic_state) {
+      GET_PROC_ADDR(CmdSetViewportWithCountEXT);
+      GET_PROC_ADDR(CmdSetScissorWithCountEXT);
+   }
 
 #undef GET_PROC_ADDR

feels_good.png

I’m certain this is going to lead to an increase in my own productivity in the future given how quick the process has now become.

A Short Note On Hardware

I’ve been putting up posts lately about my benchmarking figures, and in them I’ve been referencing Intel hardware. The reason I use Intel is because I’m (currently) a hobbyist developer with exactly one computer capable of doing graphics development, and it has an Intel onboard GPU. I’m quite happy with this given the high quality state of Intel’s drivers, as things become much more challenging when I have to debug both my own bugs as well as an underlying Vulkan driver’s bugs at the same time.

With that said, I present to you this recent out-of-context statement from Dave Airlie regarding zink performance on a different driver:

<airlied> zmike: hey on a fiji amd card, you get 45/46 native vs 35 fps with zink on one heaven scene here, however zink is corrupted

I don’t have any further information about anything there, but it’s the best I can do given the limitations in my available hardware.

September 28, 2020

But No, I’m Not

Just a quick post today to summarize a few exciting changes I’ve made today.

To start with, I’ve added some tracking to the internal batch objects for catching things like piglit’s spec@!opengl 1.1@streaming-texture-leak. Let’s check out the test code there for a moment:

/** @file streaming-texture-leak.c
 *
 * Tests that allocating and freeing textures over and over doesn't OOM
 * the system due to various refcounting issues drivers may have.
 *
 * Textures used are around 4MB, and we make 5k of them, so OOM-killer
 * should catch any failure.
 *
 * Bug #23530
 */
for (i = 0; i < 5000; i++) {
        glGenTextures(1, &texture);
        glBindTexture(GL_TEXTURE_2D, texture);

        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER,
                        GL_LINEAR);
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER,
                        GL_LINEAR);
        glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, TEX_SIZE, TEX_SIZE,
                     0, GL_RGBA,
                     GL_UNSIGNED_BYTE, tex_buffer);

        piglit_draw_rect_tex(0, 0, piglit_width, piglit_height,
                             0, 0, 1, 1);

        glDeleteTextures(1, &texture);
}

This test loops 5000 times, using a different sampler texture for each draw, and then destroys the texture. This is supposed to catch drivers which can’t properly manage their resource refcounts, but instead here zink is getting caught by trying to dump 5000 active resources into the same command buffer, which ooms the system.

The reason for the problem in this case is that, after my recent optimizations which avoid unnecessary flushing, zink only submits the command buffer when a frame is finished or one of the write-flagged resources associated with an active batch is read from. Thus, the whole test runs in one go, only submitting the queue at the very end when the test performs a read.

In this case, my fix is simple: check the system’s total memory on driver init, and then always flush a batch if it crosses some threshold of memory usage in its associated resources when beginning a new draw. I chose 1/8 total memory to be “safe”, since that allows zink to use 50% of the total memory with its resources before it’ll begin to stall and force the draws to complete, hopefully avoiding any oom scenarios. This ends up being a flush every 250ish draws in the above test code, and everything works nicely without killing my system.

Performance 3.0

As a bonus, I noticed that zink was taking considerably longer than IRIS to complete this test once it was fixed, so I did a little profiling, and this was the result: epilogue.png

Up another 3 fps (~10%) from Friday, which isn’t bad for a few minutes spent removing some memset calls from descriptor updating and then throwing in some code for handling VK_DYNAMIC_STATE_VERTEX_INPUT_BINDING_STRIDE_EXT.

September 25, 2020

Today new beta versions for all KWinFT projects – that are KWinFT, Wrapland, Disman and KDisplay – were released. With that we are on target for the full release which is aligned with Plasma 5.20 on October 13.

Big changes will unquestionable come to Disman, a previously stifled library for display management, which now learns to stand on its own feet providing universal means for the configuration of displays with different windowing systems and Wayland compositors.

But also for the compositor KWinFT a very specific yet important feature got implemented and a multitude of stability fixes and code refactors were accomplished.

In the following we will do a deep dive into reasons and results of this recent efforts.

For a quick overview of the work on Disman you can also watch this lightning talk that I held at the virtual XDC 2020 conference last week.

Universal display management with Disman

It was initially not planned like this but Disman and KDisplay were two later additions to the KWinFT project.

The projects were forked from libkscreen and KScreen respectively and I saw this as an opportunity to completely rethink and in every sense overhaul these in the past rather lackluster and at times completely neglected components of the KDE Plasma workspace. This past negligence is rather tragic since the complaints about miserable output management in KDE Plasma go back as long as one can think. Improving this bad state of affairs was my main motivation when I started working on libkscreen and KScreen around two years ago.

In my opinion a well functioning – not necessarily fancy but for sure robust – display configuration system is a cornerstone of a well crafted desktop system. One reason for that is how prevalent multi-display setups are and another how immeasurable annoying it is when you can't configure the projector correctly this one time you have to give a presentation in front of a room full of people.

Disman now tries to solve this by providing a solution not only for KWinFT or the KDE Plasma desktop alone but for any system running X11 or any Wayland compositor.

Moving logic and ideas

Let us look into the details of this solution and why I haven't mentioned KDisplay yet. The reason for this omission is that KDisplay from now on will be a husk of its former self.

Ancient rituals

As a fork of KScreen no longer than one month ago KDisplay was still the logical center of any display configuration with an always active KDE daemon (KDED) module and a KConfig module (KCM) integrated to the KDE System Settings.

The KDED module was responsible for reacting to display hot-plug events, reading control files of the resulting display combination from the user directory, generating optimal configurations if none to be found and writing new files to the hard disk after the configuration has been applied successfully to the windowing system.

In this work flow Disman was only relevant as a provider of backend plugins that were loaded at runtime. Disman was used either in-process or through an included D-Bus service that got automatically started whenever the first client tried to talk to it. According to the commit adding this out-of-process mode five years ago the intention behind it was to improve performance and stability. But in the end on a functional level the service was doing not much more than forwarding data between the windowing system and the Disman consumers.

Break with tradition

Interestingly the D-Bus service was only activatable with the X11 backend and was explicitly disabled on Wayland. When I noticed this I was first tempted to remove the D-Bus service in the all eternal struggle to reduce code complexity. And after all if the service is not used on Wayland we might not need it at all.

But some time later I realized that this D-Bus service must be appreciated in a different way than for its initial reasoning. From a different point of view this service could be the key to a much more ambitious grand solution.

The service allows us to serialize and synchronize access of arbitrary many clients in a transparent way while moving all relevant logical systems to a shared central place and providing per client a high level of integration with those systems.

Concretely does this mean that the Disman D-Bus service becomes an independent entity. Once being invoked by a single call from a client, for example by the included command line utility with dismanctl -o the service reads and writes all necessary control files on its own. It generates optimal display configurations if no files are found and even can disable a laptop display in case the lid was closed while an external output is connected.

In this model Disman consumers solely provide user interfaces that are informed about the generated or loaded current config and that can modify this config additionally if desirable. This way the consumer can concentrate on providing a user interface with great usability and design and leave to Disman all the logic of handling the modified configuration afterwards.

Making it easy to add other clients is only one advantage. On a higher level this new design has two more.

Auxiliary data

I noticed already last year that some basic assumptions in KScreen were questionable. Its internal data logic relied on a round trip through the windowing system.

This meant in practice that the user was supposed to change display properties via the KScreen KCM. These were then sent to the windowing system which tried to apply them to the hardware. Afterwards it informed the KScreen KDE daemon through its own specific protocols and a libkscreen backend about this new configuration. Only the daemon then would write the updated configuration to the disk.

Why it was done this way is clear: we can be sure we have written a valid configuration to the disk and by having only the daemon do the file write we have the file access logic in a single place and do not need to sync file writes of different processes.

But the fundamental problem of this design is that we sometimes need to share additional information about our display configuration for sensible display management not being of relevance to the windowing system and because of that can not be passed through it.

A simple example is when a display is auto-rotated. Smartphones and tablets but also many convertibles come with orientation sensors to auto-rotate the built-in display according to the current device orientation. When auto-rotation is switched on or off in the KCM it is not sent through the windowing system but the daemon or another service needs to know about such a change in order to adapt the display rotation correctly with later orientation changes.

A complex but interesting other example is the replication of displays, also often called mirroring. When I started work on KScreen two years ago the mechanism was painfully primitive: one could only duplicate all displays at once and it was done by moving all of them to the same position and then changing their display resolutions hoping to find some sufficiently alike to cover a similar area.

Obviously that had several issues, the worst in my opinion was that this won't work for displays with different aspect ratios as I noticed quickly after I got myself a 16:10 display. Another grave issue was that displays might not run at their full resolution. In a mixed DPI setup the formerly HiDPI displays are downgraded to the best resolution common with the LoDPI displays.

The good news is that on X11 and also Wayland methods are available to replicate displays without these downsides.

On X11 we can apply arbitrary linear transformations to an output. This solves both issues.

On Wayland all available output management protocols at minimum allow to set a singular floating point value to scale a display. This solves the mixed DPI problem since we can still run both displays at an arbitrary resolution and adapt the logical size of the replica through its scale. If the management protocol provides a way to even specify the logical size directly like the KWinFT protocol does we can also solve the problem of diverging display aspect ratios.

From a bird's eye view in this model there are one or multiple displays that act as replicas for a single source display. Only the transformation, scale or logical size of the replicas is changed, the source is the invariant. The important information to remember is therefore for each display solely if there is a replication source that the display is a replica to. But neither in X11 nor in any Wayland compositor this information is conveyed via the windowing system.

With the new design we send all configuration data including such auxiliary data to the Disman D-Bus service. The service will save all this data to a configuration-specific file but send to the windowing system only a relevant subset of the data. After the windowing system reports that the configuration was applied the Disman service informs all connected clients about this change sending the data received from the windowing system augmented by the auxiliary data that had not been passed through the windowing system.

This way every display management client receives all relevant data about the current configuration including the auxiliary data.

The motivation to solve this problem was the original driving force behind the large redesign of Disman that is coming with this release.

But I realized soon that this redesign also has another advantage that long-term is likely even more important than the first one.

Ready for everything that is not KDE Plasma too

With the new design Disman becomes a truly universal solution for display management.

Previously a running KDE daemon process with KDisplay's daemon module inserted was required in order to load a display configuration on startup and to react to display hot-plug events. The problem to this is that the KDE daemon commonly only makes sense to run on a KDE Plasma desktop.

Thanks to the new design the Disman D-Bus service can now be run as a standalone background service managing all your displays permanently, even if you don't use KDE Plasma.

In a non-Plasma environment like a Wayland session with sway this can be achieved by simply calling once dismanctl -o in a startup script.

On the other side the graphical user interface that KDisplay provides can now be used to manage displays on any desktop that Disman runs on too. KDisplay does not require the KDE System Settings to be installed and can be run as a standalone app. Simply call kdisplay from command line or start it from the launcher of your desktop environment.

KDisplay still includes a now absolutely gutted KDE daemon module that will be run in a Plasma session. The module basically only launches the Disman D-Bus service on session startup anymore. So in a Plasma session after installation of Disman and KDisplay everything is directly setup automatically. In every other session as said a simple dismanctl -o call at startup is enough to get the very same result.

Maybe the integration in other sessions than Plasma could be improved to even make setting up this single call at startup unnecessary. Should Disman for example install a systemd unit file executing this call by default? I would be interested in feedback in this regard in particular from distributions. What do they prefer?

KWinFT and Wrapland improved in selected areas

With today's beta release the greatest changes come to Disman and KDisplay. But that does not mean KWinFT and Wrapland have not received some important updates.

Outputs all the way

The ongoing work on Disman and by that on displays – or outputs as they are called in the land of window managers – stability and feature patches for outputs naturally came to Wrapland and KWinFT as well. A large refactor was the introduction of a master output class on the server side of Wrapland. The class acts as a central entry point for compositors and deals with the different output related protocol objects internally.

Having this class in place it was rather easy to add support for xdg-output version 2 and 3 afterwards. In order to do that it was also reasonable to re-evaluate how we provide output identifying metadata in KWinFT and Wrapland in general.

In regards to output identification a happy coincidence was that Simon Ser of the wlroots projects had been asking himself the very same questions already in the past.

I concluded that Simon's plan for wlroots was spot on and I decided to help them out a bit with patches for wlr-protocols and wlroots. In the same vein I updated Wrapland's output device protocol. That means Wrapland and wlroots based compositors feature the very same way of identifying outputs now what made it easy to provide full support for both in Disman.

Presentation timings

This release comes with support for the presentation-time protocol.

It is one of only three Wayland protocol extensions that have been officially declared stable. Because of that supporting it felt also important in a formal sense.

Primarily though it is essential to my ongoing work on Xwayland. I plan to make use of the presentation-time protocol in Xwayland's Present extension implementation.

With the support in KWinFT I can test future presentation-time work in Xwayland now with KWinFT and sway as wlroots also supports the protocol. Having two different compositors for alternative testing will be quite helpful.

Try out the beta

If you want to try out the new beta release of Disman together with your favorite desktop environment or the KWinFT beta as a drop-in replacement for KWin you have to compile from source at the moment. For that use the Plasma/5.20 branches in the respective repositories.

For Disman there are some limited instructions on how to compile it in the Readme file.

If you have questions or just want to chat about the projects feel free to join the official KWinFT Gitter channel.

If you want to wait for the full release check back on the release date, October 13. I plan to write another article to that date that will then list all distributions where you will be able to install the KWinFT projects comfortably by package manager.

That is also a call to distro packagers: if you plan to provide packages for the KWinFT projects on October 13 get in touch to get support and be featured in the article.

Performance 2.0

In my last post, I left off with an overall 25%ish improvement in framerate for my test case:

endpost1.png

At the end, this was an extra 3 fps over my previous test, but how did I get to this point?

The answer lies in even more unnecessary queue submission. Let’s take a look at zink’s pipe_context::set_framebuffer_state hook, which is called by gallium any time the framebuffer state changes:

static void
zink_set_framebuffer_state(struct pipe_context *pctx,
                           const struct pipe_framebuffer_state *state)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_screen *screen = zink_screen(pctx->screen);

   util_copy_framebuffer_state(&ctx->fb_state, state);

   struct zink_framebuffer *fb = get_framebuffer(ctx);

   zink_framebuffer_reference(screen, &ctx->framebuffer, fb);
   if (ctx->gfx_pipeline_state.render_pass != fb->rp)
      ctx->gfx_pipeline_state.hash = 0;
   zink_render_pass_reference(screen, &ctx->gfx_pipeline_state.render_pass, fb->rp);

   uint8_t rast_samples = util_framebuffer_get_num_samples(state);
   /* in vulkan, gl_SampleMask needs to be explicitly ignored for sampleCount == 1 */
   if ((ctx->gfx_pipeline_state.rast_samples > 1) != (rast_samples > 1))
      ctx->dirty_shader_stages |= 1 << PIPE_SHADER_FRAGMENT;
   if (ctx->gfx_pipeline_state.rast_samples != rast_samples)
      ctx->gfx_pipeline_state.hash = 0;
   ctx->gfx_pipeline_state.rast_samples = rast_samples;
   if (ctx->gfx_pipeline_state.num_attachments != state->nr_cbufs)
      ctx->gfx_pipeline_state.hash = 0;
   ctx->gfx_pipeline_state.num_attachments = state->nr_cbufs;

   /* need to start a new renderpass */
   if (zink_curr_batch(ctx)->rp)
      flush_batch(ctx);
   struct zink_batch *batch = zink_batch_no_rp(ctx);
   zink_framebuffer_reference(screen, &batch->fb, fb);

   framebuffer_state_buffer_barriers_setup(ctx, &ctx->fb_state, zink_curr_batch(ctx));
}

Briefly, zink copies the framebuffer state, there’s a number of conditions under which a new pipeline object is needed, which all result in ctx->gfx_pipeline_state.hash = 0;. Other than this, there’s sample count check for sample changes so that the shader can be modified if necessary, and then there’s the setup for creating the Vulkan framebuffer object as well as the renderpass object in get_framebuffer().

Eagle-eyed readers will immediately spot the problem here, which is, aside from the fact that there’s not actually any reason to be setting up the framebuffer or renderpass here, how zink is also flushing the current batch if a renderpass is active.

The change I made here was to remove everything related to Vulkan from here, and move it to zink_begin_render_pass(), which is the function that the driver uses to begin a renderpass for a given batch.

This is clearly a much larger change than just removing the flush_batch() call, which might be what’s expected now that ending a renderpass no longer forces queue submission. Indeed, why haven’t I just ended the current renderpass and kept using the same batch?

The reason for this is that zink is designed in such a way that a given batch, at the time of calling vkCmdBeginRenderPass, is expected to either have no struct zink_render_pass associated with it (the batch has not performed a draw yet) or have the same object which matches the pipeline state (the batch is continuing to draw using the same renderpass). Adjusting this to be compatible with removing the flush here ends up being more code than just moving the object setup to a different spot.

So now the framebuffer and renderpass are created or pulled from their caches just prior to the vkCmdBeginRenderPass call, and a flush is removed, gaining some noticeable fps.

Going Further

Now that I’d unblocked that bottleneck, I went back to the list and checked the remaining problem areas:

  • descriptor set allocation is going to be a massive performance hit for any application which does lots of draws per frame since each draw command allocates its own (huge) descriptor set
  • the 1000 descriptor set limit is going to be hit constantly for for any application which does lots of draws per frame

I decided to change things up a bit here.

This is the current way of things.

My plan was something more like this:

Where get descriptorset from program would look something like:

In this way, I’d get to conserve some sets and reuse them across draw calls even between different command buffers since I could track whether they were in use and avoid modifying them in any way. I’d also get to remove any tracking for descriptorset usage on batches, thereby removing possible queue submissions there. Any time resources in a set were destroyed, I could keep references to the sets on the resources and then invalidate the sets, returning them to the unused pool.

The results were great: desccache1.png

21 fps now, which is up another 3 from before.

Cache Improvements

Next I started investigating my cache implementation. There was a lot of hashing going on, as I was storing both the in-use sets as well as the unused sets (the valid and invalidated) based on the hash calculated for their descriptor usage, so I decided to try moving just the invalidated sets into an array as they no longer had a valid hash anyway, thereby giving quicker access to sets I knew to be free.

This would also help with my next plan, but again, the results were promising:

desccache2.png

Now I was at 23 fps, which is another 10% from the last changes, and just from removing some of the hashing going on.

This is like shooting fish in a barrel now.

Naturally at this point I began to, as all developers do, consider bucketing my allocations, since I’d seen in my profiling that some of these programs were allocating thousands of sets to submit simultaneously across hundreds of draws. I ended up using a scaling factor here so that programs would initially begin allocating in tens of sets, scaling up by a factor of ten every time it reached that threshold (i.e., once 100 sets were allocated, it begins allocating 100 at a time).

This didn’t have any discernible effect on the fps, but there were certainly fewer allocation calls going on, so I imagine the results will show up somewhere else.

And Then I Thought About It

Because sure, my efficient, possibly-overengineered descriptorset caching mechanism for reusing sets across draws and batches was cool, and it worked great, but was the overhead of all the hashing involved actually worse for performance than just using a dumb bucket allocator to set up and submit the same sets multiple times even in the same command buffer?

I’m not one of those types who refuses to acknowledge that other ideas can be better than the ones I personally prefer, so I smashed all of my descriptor info hashing out of the codebase and just used an array for storing unused sets. So now the mechanism looked like this:

But would this be better than the more specific caching I was already using? Well… desccache3.png

24 fps, so the short of it is yes. It was 1-2 fps faster across the board.

Conclusion

This is where I’m at now after spending some time also rewriting all the clear code again and fixing related regressions. The benchmark is up ~70% from where I started, and the gains just keep coming. I’ll post again about performance improvements in the future, but here’s a comparison to a native GL driver, namely IRIS:

reference.png

Zink is at a little under 50% of the performance here, up from around 25% when I started, though this gap still varies throughout other sections of the benchmark, dipping as low as 30% of the IRIS performance in some parts.

It’s progress.

September 24, 2020

XDC 2020 sponsors

Last week, X.Org Developers Conference 2020 was held online for the first time. This year, with all the COVID-19 situation that is affecting almost every country worldwide, the X.Org Foundation Board of Directors decided to make it virtual.

I love open-source conferences :-) They are great for networking, have fun with the rest of community members, have really good technical discussions in the hallway track… and visit a new place every year! Unfortunately, we couldn’t do any of that this time and we needed to look for an alternative… being going virtual the obvious one.

The organization team at Intel, lead by Radoslaw Szwichtenberg and Martin Peres, analyzed the different open-source alternatives to organize XDC 2020 in a virtual manner. Finally, due to the setup requirements and the possibility of having more than 200 attendees connected to the video stream at the same time (Big Blue Button doesn’t recommend more than 100 simultaneous users), they selected Jitsi for speakers + Youtube for streaming/recording + IRC for questions. Arkadiusz Hiler summarized very well what they did from the A/V technical point of view and how was the experience hosting a virtual XDC.

I’m very happy with the final result given the special situation this year: the streaming was flawless, we didn’t have almost any technical issue (except one audio issue in the opening session… like in physical ones! :-D), and IRC turned out to be very active during the conference. Thanks a lot to the organizers for their great job!

However, there is always room for improvements. Therefore, the X.Org Foundation board is asking for feedback, please share with us your opinion on XDC 2020!

Just focusing on my own experience, it was very good. I enjoyed a lot the talks presented this year and the interesting discussions happening in IRC. I would like to highlight the four talks presented by my colleagues at Igalia :-D

This year I was also a speaker! I presented “Improving Khronos CTS tests with Mesa code coverage” talk (watch recording), where I explained how we can improve the VK-GL-CTS quality by leveraging Mesa code coverage using open-source tools.

My XDC 2020 talk

I’m looking forward to attending X.Org Developers Conference 2021! But first, we need an organizer! Requests For Proposals for hosting XDC2021 are now open!

See you next year!

Finally, Performance

For a long time now, I’ve been writing about various work I’ve done in the course of getting to GL 4.6. This has generally been feature implementation work with an occasional side of bug hunting and fixing and so I haven’t been too concerned about performance.

I’m not done with feature work. There’s still tons of things missing that I’m planning to work on.

I’m not done with bug hunting or fixing. There’s still tons of bugs (like how currently spec@!opengl 1.1@streaming-texture-leak ooms my system and crashes all the other tests trying to run in parallel) that I’m going to fix.

But I wanted a break, and I wanted to learn some new parts of the graphics pipeline instead of just slapping more extensions in.

For the moment, I’ve been focusing on the Unigine Heaven benchmark since there’s tons of room for improvement, though I’m planning to move on from that once I get bored and/or hit a wall. Here’s my starting point, which is taken from the patch in my branch with the summary zink: add batch flag to determine if batch is currently in renderpass, some 300ish patches ahead of the main branch’s tip:

start.png

This is 14 fps running as ./heaven_x64 -project_name Heaven -data_path ../ -engine_config ../data/heaven_4.0.cfg -system_script heaven/unigine.cpp -sound_app openal -video_app opengl -video_multisample 0 -video_fullscreen 0 -video_mode 3 -extern_define ,RELEASE,LANGUAGE_EN,QUALITY_LOW,TESSELLATION_DISABLED -extern_plugin ,GPUMonitor, and I’m going to be posting screenshots from roughly the same point in the demo as I progress to gauge progress.

Is this an amazing way to do a benchmark?

No.

Is it a quick way to determine if I’m making things better or worse right now?

Given the size of the gains I’m making, absolutely.

Let’s begin.

Architecture

Now that I’ve lured everyone in with promises of gains and a screenshot with an fps counter, what I really want to talk about is code.

In order to figure out the most significant performance improvements for zink, it’s important to understand the architecture. At the point when I started, zink’s batches (C name for an object containing a command buffer and a fence as well as references to all the objects submitted to the queue for lifetime validation) worked like this for draw commands:

  • there are 4 batches (and 1 compute batch, but that’s out of scope)
  • each batch has 1 command buffer
  • each batch has 1 descriptor pool
  • each batch can allocate at most 1000 descriptor sets
  • each descriptor set is a fixed size of 1000 of each type of supported descriptor (UBO, SSBO, samplers, uniform and storage texel buffers, images)
  • each batch, upon reaching 1000 descriptor sets, will automatically submit its command buffer and cycle to the next batch
  • each batch has 1 fence
  • each batch, before being reused, waits on its fence and then resets its state
  • each batch, during its state reset, destroys all its allocated descriptor sets
  • renderpasses are started and ended on a given batch as needed
  • renderpasses, when ended, trigger queue submission for the given command buffer
  • renderpasses allocate their descriptor set on-demand just prior to the actual vkCmdBeginRenderPass call
  • renderpasses, during corresponding descriptor set updates, trigger memory barriers on all descriptor resources for their intended usage
  • pipelines are cached at runtime based on all the metadata used to create them, but we don’t currently do any on-disk pipeline caching

This is a lot to take in, so I’ll cut to some conclusions that I drew from these points:

  • sequential draws cannot occur on a given batch unless none of the resources used by its descriptor sets require memory barriers
    • because barriers cannot be submitted during a renderpass
    • this is very unlikely
    • therefore, each batch contains exactly 1 draw command
  • each sequential draw after the fourth one causes explicit fence waiting
    • batches must reset their state before being reused, and prior to this they wait on their fence
    • there are 4 batches
    • this means that any time a frame contains more than 4 draws, zink is pausing to wait for batches to finish so it can get a valid command buffer to use
    • the Heaven benchmark contains hundreds of draw commands per frame
    • yikes
  • definitely there could be better barrier usage
    • a barrier should only be necessary for zink’s uses when: transitioning a resource from write -> read, write -> write, or changing image layouts, and for other cases we can just track the usage in order to trigger a barrier later when one of these conditions is met
  • the pipe_context::flush hook in general is very, very bad and needs to be avoided
    • we use this basically every other line like we’re building a jenga tower
    • sure would be a disaster if someone were to try removing these calls
  • probably we could do some disk caching for pipeline objects so that successive runs of applications could use pre-baked objects and avoid needing to create any
  • descriptor set allocation is going to be a massive performance hit for any application which does lots of draws per frame since each draw command allocates its own (huge) descriptor set
  • the 1000 descriptor set limit is going to be hit constantly for for any application which does lots of draws per frame

There’s a lot more I could go into here, but this is already a lot.

Removing Renderpass Submission

I decided to start here since it was easy:

diff --git a/src/gallium/drivers/zink/zink_context.c b/src/gallium/drivers/zink/zink_context.c
index a9418430bb7..f07ae658115 100644
--- a/src/gallium/drivers/zink/zink_context.c
+++ b/src/gallium/drivers/zink/zink_context.c
@@ -800,12 +800,8 @@ struct zink_batch *
 zink_batch_no_rp(struct zink_context *ctx)
 {
    struct zink_batch *batch = zink_curr_batch(ctx);
-   if (batch->in_rp) {
-      /* flush batch and get a new one */
-      flush_batch(ctx);
-      batch = zink_curr_batch(ctx);
-      assert(!batch->in_rp);
-   }
+   zink_end_render_pass(ctx, batch);
+   assert(!batch->in_rp);
    return batch;
 }

Amazing, I know. Let’s see how much the fps changes:

norp.png

15?! Wait a minute. That’s basically within the margin of error!

It is actually a consistent 1-2 fps gain, even a little more in some other parts, but it seemed like it should’ve been more now that all the command buffers are being gloriously saturated, right?

Well, not exactly. Here’s a fun bit of code from the descriptor updating function:

struct zink_batch *batch = zink_batch_rp(ctx);
unsigned num_descriptors = ctx->curr_program->num_descriptors;
VkDescriptorSetLayout dsl = ctx->curr_program->dsl;

if (batch->descs_left < num_descriptors) {
   ctx->base.flush(&ctx->base, NULL, 0);
   batch = zink_batch_rp(ctx);
   assert(batch->descs_left >= num_descriptors);
}

Right. The flushing continues. And while I’m here, what does zink’s pipe_context::flush hook even look like again?

static void
zink_flush(struct pipe_context *pctx,
           struct pipe_fence_handle **pfence,
           enum pipe_flush_flags flags)
{
   struct zink_context *ctx = zink_context(pctx);

   struct zink_batch *batch = zink_curr_batch(ctx);
   flush_batch(ctx);
...
   /* HACK:
    * For some strange reason, we need to finish before presenting, or else
    * we start rendering on top of the back-buffer for the next frame. This
    * seems like a bug in the DRI-driver to me, because we really should
    * be properly protected by fences here, and the back-buffer should
    * either be swapped with the front-buffer, or blitted from. But for
    * some strange reason, neither of these things happen.
    */
   if (flags & PIPE_FLUSH_END_OF_FRAME)
      pctx->screen->fence_finish(pctx->screen, pctx,
                                 (struct pipe_fence_handle *)batch->fence,
                                 PIPE_TIMEOUT_INFINITE);
}

Oh. So really every time zink “finishes” a frame in this benchmark (which has already stalled hundreds of times up to this point), it then waits on that frame to finish instead of letting things outside the driver worry about that.

Borrowing Code

It was at this moment that a dim spark flickered to life in my memories, reminding me of the in-progress MR from Antonio Caggiano for caching surfaces on batches. In particular, it reminded me that his series has a patch which removes the above monstrosity.

Let’s see what happens when I add those patches in:

nofence.png

15.

Again.

I expected a huge performance win here, but it seems that we still can’t fully utilize all these changes and are still stuck at 15 fps. Every time descriptors are updated, the batch ends up hitting that arbitrary 1000 descriptor set limit, and then it submits the command buffer, so there’s still multiple batches being used for each frame.

Getting Mad

So naturally at this point I tried increasing the limit.

Then I increased it again.

And again.

And now I had exactly one flush per frame, but my fps was still fixed at a measly 15.

That’s when I decided to do some desk curls.

What happened next was shocking:

endpost1.png

18 fps.

It was a sudden 20% fps gain, but it was only the beginning.

More on this tomorrow.

September 23, 2020

Adventures In Blending

For the past few days, I’ve been trying to fix a troublesome bug. Specifically, the Unigine Heaven benchmark wasn’t drawing most textures in color, and this was hampering my ability to make further claims about zink being the fastest graphics driver in the history of software since it’s not very impressive to be posting side-by-side screenshots that look like garbage even if the FPS counter in the corner is higher.

Thus I embarked on adventure.

First: The Problem

heaven-pre.png

This was the starting point. The thing ran just fine, but without valid frames drawn, it’s hard to call it a test of very much other than how many bugs I can pack into a single frame.

Naturally I assumed that this was going to be some bug in zink’s handling of something, whether it was blending, or sampling, or blending and sampling. I set out to figure out exactly how I’d screwed up.

Second: The Analysis

I had no idea what the problem was. I phoned Dr. Render, as we all do when facing issues like this, and I was told that I had problems.

heaven-renderdoc.png

Lots of problems.

The biggest problem was figuring out how to get anywhere with so many draw calls. Each frame consisted of 3 render passes (with hundreds of draws each) as well as a bunch of extra draws and clears.

There was a lot going on. This was by far the biggest thing I’d had to fix, and it’s much more difficult to debug a game-like application than it is a unit test. With that said, and since there’s not actually any documentation about “What do I do if some of my frame isn’t drawing with color?” for people working on drivers, here were some of the things I looked into:

  • Disabling Depth Testing

Just to check. On IRIS, which is my reference for these types of things, the change gave some neat results:

heaven-nodepth.png

How bout that.

On zink, I got the same thing, except there was no color, and it wasn’t very interesting.

  • Checking sampler resources for depth buffers

On an #executive suggestion, I looked into whether a z/s buffer had snuck into my sampler buffers and was thus providing bogus pixel data.

It hadn’t.

  • Checking Fragment Shader Outputs

This was a runtime version of my usual shader debugging, wherein I try to isolate the pixels in a region to a specific color based on a conditional, which then lets me determine which path in the shader is broken. To do this, I added a helper function in ntv:

static SpvId
clobber(struct ntv_context *ctx)
{
   SpvId type = get_fvec_type(ctx, 32, 4);
   SpvId vals[] = {
      emit_float_const(ctx, 32, 1.0),
      emit_float_const(ctx, 32, 0.0),
      emit_float_const(ctx, 32, 0.0),
      emit_float_const(ctx, 32, 1.0)
    };
   printf("CLOBBERING\n");
   return spirv_builder_emit_composite_construct(&ctx->builder, type, vals, 4);
}

This returns a vec4 of the color RED, and I cleverly stuck it at the end of emit_store_deref() like so:

if (ctx->stage == MESA_SHADER_FRAGMENT && var->data.location == FRAG_RESULT_DATA0 && match)
   result = clobber(ctx);

match in this case is set based on this small block at the very start of ntv:

if (s->info.stage == MESA_SHADER_FRAGMENT) {
   const char *env = getenv("TEST_SHADER");
   match = env && s->info.name && !strcmp(s->info.name, env);
}

Thus, I could set my environment in gdb with e.g., set env TEST_SHADER=GLSL271 and then zink would swap the output of the fragment shader named GLSL271 to RED, which let me determine what various shaders were being used for. When I found the shader used for the lamps, things got LIT:

heaven-lamps.png

But ultimately, even though I did find the shaders that were being used for the more general material draws, this ended up being another dead end.

  • Verifying Blend States

This took me the longest since I had to figure out a way to match up the Dr. Render states to the runtime states that I could see. I eventually settled on adding breakpoints based on index buffer size, as the chart provided by Dr. Render had this in the vertex state, which made things simple.

But alas, zink was doing all the right blending too.

  • Complaining

As usual, this was my last resort, but it was also my most powerful weapon that I couldn’t abuse too frequently, lest people come to the conclusion that I don’t actually know what I’m doing.

Which I definitely do.

And now that I’ve cleared up any misunderstandings there, I’m not ashamed to say that I went to #intel-3d to complain that Dr. Render wasn’t giving me any useful draw output for most of the draws under IRIS. If even zink can get some pixels out of a draw, then a more compliant driver like IRIS shouldn’t be having issues here.

I wasn’t wrong.

The Magic Of Dual Blending

It turns out that the Heaven benchmark is buggy and expects the D3D semantics for dual blending, which is why mesa knows this and informs drivers that they need to enable workarounds if they have the need, specifically dual_color_blend_by_location=true which informs the driver that it needs to adjust the Location and Index of gl_FragData[1] from D3D semantics to OpenGL/Vulkan.

As usual, the folks at Intel with their encyclopedic knowledge were quick to point out the exact problem, which then just left me with the relatively simple tasks of:

  • hooking zink up to the driconf build
  • checking driconf at startup so zink can get info on these application workarounds/overrides
  • adding shader keys for forcing the dual blending workaround
  • writing a NIR pass to do the actual work

The result is not that interesting, but here it is anyway:

static bool
lower_dual_blend(nir_shader *shader)
{
   bool progress = false;
   nir_variable *var = nir_find_variable_with_location(shader, nir_var_shader_out, FRAG_RESULT_DATA1);
   if (var) {
      var->data.location = FRAG_RESULT_DATA0;
      var->data.index = 1;
      progress = true;
   }
   nir_shader_preserve_all_metadata(shader);
   return progress;
}

In short, D3D expects to blend two outputs based on their locations, but in Vulkan and OpenGL, the blending is based on index. So here, I’ve just changed the location of gl_FragData[1] to match gl_FragData[0] and then incremented the index, because Fragment outputs identified with an Index of zero are directed to the first input of the blending unit associated with the corresponding Location. Outputs identified with an Index of one are directed to the second input of the corresponding blending unit.

And now nice things can be had:

heaven-post.png

Tune in tomorrow when I strap zink to a rocket and begin counting down to blastoff.

September 21, 2020

Viewporting

In Vulkan, a pipeline object is bound to the graphics pipeline for a given command buffer when a draw is about to take place. This pipeline object contains information about the draw state, and any time that state changes, a different pipeline object must be created/bound.

This is expensive.

Some time ago, Antonio Caggiano did some work to cache pipeline objects, which lets zink reuse them once they’re created. This was great, because creating Vulkan objects is very costly, and we want to always be reusing objects whenever possible.

Unfortunately, the core Vulkan spec has the number of viewports and scissor regions as both being part of the pipeline state, which means any time either one changes the number of regions (though both viewport and scissor region counts are the same for our purposes), we need a new pipeline.

Extensions to the Rescue

VK_EXT_extended_dynamic_state adds functionality to avoid this performance issue. When supported, the pipeline object is created with zero as the count of viewport and scissor regions, and then vkCmdSetViewportWithCountEXT and vkCmdSetScissorWithCountEXT can be called just before draw to ram these state updates into the command buffer without ever needing a different pipeline object.

Longer posts later in the week; I’m in the middle of a construction zone for the next few days, and it’s less hospitable than I anticipated.

September 18, 2020

Blog Returns

Once again, I ended up not blogging for most of the week. When this happens, there’s one of two possibilities: I’m either taking a break or I’m so deep into some code that I’ve forgotten about everything else in my life including sleep.

This time was the latter. I delved into the deepest parts of zink and discovered that the driver is, in fact, functioning only through a combination of sheer luck and a truly unbelievable amount of driver stalls that provide enough forced synchronization and slow things down enough that we don’t explode into a flaming mess every other frame.

Oops.

I’ve fixed all of the crazy things I found, and, in the process, made some sizable performance gains that I’m planning to spend a while blogging about in considerable depth next week.

And when I say sizable, I’m talking in the range of 50-100% fps gains.

But it’s Friday, and I’m sure nobody wants to just see numbers or benchmarks. Let’s get into something that’s interesting on a technical level.

Samplers

Yes, samplers.

In Vulkan, samplers have a lot of rules to follow. Specifically, I’m going to be examining part of the spec that states “If a VkImageView is sampled with VK_FILTER_LINEAR as a result of this command, then the image view’s format features must contain VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT”.

This is a problem for zink. Gallium gives us info about the sampler in the struct pipe_context::create_sampler_state hook, but the created sampler won’t actually be used until draw time. As a result, there’s no way to know which image is going to be sampled, and thus there’s no way to know what features the sampled image’s format flags will contain. This only becomes known at the time of draw.

The way I saw it, there were two options:

  • Dynamically create the sampler just before draw and swizzle between LINEAR and NEAREST based on the format features
  • Create both samplers immediately when LINEAR is passed and swizzle between them at draw time

In theory, the first option is probably more performant in the best case scenario where a sampler is only ever used with a single image, as it would then only ever create a single sampler object.

Unfortunately, this isn’t realistic. Just as an example, u_blitter creates a number of samplers up front, and then it also makes assumptions about filtering based on ideal operations which may not be in sync with the underlying Vulkan driver’s capabilities. So for these persistent samplers, the first option may initially allow the sampler to be created with LINEAR filtering, but it may later then be used for an image which can’t support it.

So I went with the second option. Now any time a LINEAR sampler is created by gallium, we’re actually creating both types so that the appropriate one can be used, ensuring that we can always comply with the spec and avoid any driver issues.

Hooray.

September 14, 2020

Hoo Boy

Let’s talk about ARB_shader_draw_parameters. Specifically, let’s look at gl_BaseVertex.

In OpenGL, this shader variable’s value depends on the parameters passed to the draw command, and the value is always zero if the command has no base vertex.

In Vulkan, the value here is only zero if the first vertex is zero.

The difference here means that for arrayed draws without base vertex parameters, GL always expects zero, and Vulkan expects first vertex.

Hooray.

Compatibilizing

The easiest solution here would be to just throw a shader key at the problem, producing variants of the shader for use with indexed vs non-indexed draws, and using NIR passes to modify the variables for the non-indexed case and zero the value. It’s quick, it’s easy, and it’s not especially great for performance since it requires compiling the shader multiple times and creating multiple pipeline objects.

This is where push constants come in handy once more.

Avid readers of the blog will recall the last time I used push constants was for TCS injection when I needed to generate my own TCS and have it read the default inner/outer tessellation levels out of a push constant.

Since then, I’ve created a struct to track the layout of the push constant:

struct zink_push_constant {
   unsigned draw_mode_is_indexed;
   float default_inner_level[2];
   float default_outer_level[4];
};

Now just before draw, I update the push constant value for draw_mode_is_indexed:

if (ctx->gfx_stages[PIPE_SHADER_VERTEX]->nir->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)) {
   unsigned draw_mode_is_indexed = dinfo->index_size > 0;
   vkCmdPushConstants(batch->cmdbuf, gfx_program->layout, VK_SHADER_STAGE_VERTEX_BIT,
                      offsetof(struct zink_push_constant, draw_mode_is_indexed), sizeof(unsigned),
                      &draw_mode_is_indexed);
}

And now the shader can be made aware of whether the draw mode is indexed.

Now comes the NIR, as is the case for most of this type of work.

static bool
lower_draw_params(nir_shader *shader)
{
   if (shader->info.stage != MESA_SHADER_VERTEX)
      return false;

   if (!(shader->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)))
      return false;

   return nir_shader_instructions_pass(shader, lower_draw_params_instr, nir_metadata_dominance, NULL);
}

This is the future, so I’m now using Eric Anholt’s recent helper function to skip past iterating over the shader’s function/blocks/instructions, instead just passing the lowering implementation as a parameter and letting the helper create the nir_builder for me.

static bool
lower_draw_params_instr(nir_builder *b, nir_instr *in, void *data)
{
   if (in->type != nir_instr_type_intrinsic)
      return false;
   nir_intrinsic_instr *instr = nir_instr_as_intrinsic(in);
   if (instr->intrinsic != nir_intrinsic_load_base_vertex)
      return false;

I’m filtering out everything except for nir_intrinsic_load_base_vertex here, which is the instruction for loading gl_BaseVertex.

   
   b->cursor = nir_after_instr(&instr->instr);

I’m modifying instructions after this one, so I set the cursor after.

   nir_intrinsic_instr *load = nir_intrinsic_instr_create(b->shader, nir_intrinsic_load_push_constant);
   load->src[0] = nir_src_for_ssa(nir_imm_int(b, 0));
   nir_intrinsic_set_range(load, 4);
   load->num_components = 1;
   nir_ssa_dest_init(&load->instr, &load->dest, 1, 32, "draw_mode_is_indexed");
   nir_builder_instr_insert(b, &load->instr);

I’m loading the first 4 bytes of the push constant variable that I created according to my struct, which is the draw_mode_is_indexed value.

   nir_ssa_def *composite = nir_build_alu(b, nir_op_bcsel,
                                          nir_build_alu(b, nir_op_ieq, &load->dest.ssa, nir_imm_int(b, 1), NULL, NULL),
                                          &instr->dest.ssa,
                                          nir_imm_int(b, 0),
                                          NULL);

This adds a new ALU instruction of type bcsel, AKA the ternary operator (condition ? true : false). The condition here is another ALU of type ieq, AKA integer equals, and I’m testing whether the loaded push constant value is equal to 1. If true, this is an indexed draw, so I continue using the loaded gl_BaseVertex value. If false, this is not an indexed draw, so I need to use zero instead.

   nir_ssa_def_rewrite_uses_after(&instr->dest.ssa, nir_src_for_ssa(composite), composite->parent_instr);

With my bcsel composite gl_BaseVertex value constructed, I can now rewrite all subsequent uses of gl_BaseVertex in the shader to use the composite value, which will automatically swap between the Vulkan gl_BaseVertex and zero based on the value of the push constant without the need to rebuild the shader or make a new pipeline.

   return true;
}

And now the shader gets the expected value and everything works.

Billy Mays

It’s also worth pointing out here that gl_DrawID from the same extension has a similar problem: gallium doesn’t pass multidraws in full to the driver, instead iterating for each draw, which means that the shader value is never what’s expected either. I’ve employed a similar trick to jam the draw index into the push constant and read that back in the shader to get the expected value there too.

Extensions.

September 10, 2020

In an ideal world, every frame your application draws would appear on the screen exactly on time. Sadly, as anyone living in the year 2020 CE can attest, this is far from an ideal world. Sometimes the scene gets more complicated and takes longer to draw than you estimated, and sometimes the OS scheduler just decides it has more important things to do than pay attention to you.

When this happens, for some applications, it would be best if you could just get the bits on the screen as fast as possible rather than wait for the next vsync. The Present extension for X11 has a option to let you do exactly this:

If 'options' contains PresentOptionAsync, and the 'target-msc'
is less than or equal to the current msc for 'window', then
the operation will be performed as soon as possible, not
necessarily waiting for the next vertical blank interval. 

But you don't use Present directly, usually, usually Present is the mechanism for GLX and Vulkan to put bits on the screen. So, today I merged some code to Mesa to enable the corresponding features in those APIs, namely GLX_EXT_swap_control_tear and VK_PRESENT_MODE_FIFO_RELAXED_KHR. If all goes well these should be included in Mesa 21.0, with a backport to 20.2.x not out of the question. As the GLX extension name suggests, this can introduce some visual tearing when the buffer swap does come in late, but for fullscreen games or VR displays that can be an acceptable tradeoff in exchange for reduced stuttering.

Despite what this might look like, I don't actually enjoy starting new projects: it's a lot easier to clean up some build warnings, or add a CI, than it is to start from an empty directory.

But sometimes needs must, and I've just released version 0.1 of such a project. Below you'll find an excerpt from the README, which should answer most of the questions. Please read the README directly in the repository if you're getting to this blog post more than a couple of days after it was first published.

Feel free to file new issues in the tracker if you have ideas on possible power-saving or performance enhancements. Currently the only supported “Performance” mode supported will interact with Intel CPUs with P-State support. More hardware support is planned.

TLDR; this setting in the GNOME 3.40 development branch soon, Fedora packages are done, API docs available:

 

 

From the README:

Introduction

power-profiles-daemon offers to modify system behaviour based upon user-selected power profiles. There are 3 different power profiles, a "balanced" default mode, a "power-saver" mode, as well as a "performance" mode. The first 2 of those are available on every system. The "performance" mode is only available on select systems and is implemented by different "drivers" based on the system or systems it targets.

In addition to those 2 or 3 modes (depending on the system), "actions" can be hooked up to change the behaviour of a particular device. For example, this can be used to disable the fast-charging for some USB devices when in power-saver mode.

GNOME's Settings and shell both include interfaces to select the current mode, but they are also expected to adjust the behaviour of the desktop depending on the mode, such as turning the screen off after inaction more aggressively when in power-saver mode.

Note that power-profiles-daemon does not save the currently active profile across system restarts and will always start with the "balanced" profile selected.

Why power-profiles-daemon

The power-profiles-daemon project was created to help provide a solution for two separate use cases, for desktops, laptops, and other devices running a “traditional Linux desktop”.

The first one is a "Low Power" mode, that users could toggle themselves, or have the system toggle for them, with the intent to save battery. Mobile devices running iOS and Android have had a similar feature available to end-users and application developers alike.

The second use case was to allow a "Performance" mode on systems where the hardware maker would provide and design such a mode. The idea is that the Linux kernel would provide a way to access this mode which usually only exists as a configuration option in some machines' "UEFI Setup" screen.

This second use case is the reason why we didn't implement the "Low Power" mode in UPower, as was originally discussed.

As the daemon would change kernel settings, we would need to run it as root, and make its API available over D-Bus, as has been customary for more than 10 years. We would also design that API to be as easily usable to build graphical interfaces as possible.

Why not...

This section will contain explanations of why this new daemon was written rather than re-using, or modifying an existing one. Each project obviously has its own goals and needs, and those comparisons are not meant as a slight on the project.

As the code bases for both those projects listed and power-profiles-daemon are ever evolving, the comments were understood to be correct when made.

thermald

thermald only works on Intel CPUs, and is very focused on allowing maximum performance based on a "maximum temperature" for the system. As such, it could be seen as complementary to power-profiles-daemon.

tuned and TLP

Both projects have similar goals, allowing for tweaks to be applied, for a variety of workloads that goes far beyond the workloads and use cases that power-profiles-daemon targets.

A fair number of the tweaks that could apply to devices running GNOME or another free desktop are either potentially destructive (eg. some of the SATA power-saving mode resulting in corrupted data), or working well enough to be put into place by default (eg. audio codec power-saving), even if we need to disable the power saving on some hardware that reacts badly to it.

Both are good projects to use for the purpose of experimenting with particular settings to see if they'd be something that can be implemented by default, or to put some fine-grained, static, policies in place on server-type workloads which are not as fluid and changing as desktop workloads can be.

auto-cpufreq

It doesn't take user-intent into account, doesn't have a D-Bus interface and seems to want to work automatically by monitoring the CPU usage, which kind of goes against a user's wishes as a user might still want to conserve as much energy as possible under high-CPU usage.

Over the past couple of (gasp!) decades, I've had my fair share of release blunders: forgetting to clean the tree before making a tarball by hand, forgetting to update the NEWS file, forgetting to push after creating the tarball locally, forgetting to update the appdata file (causing problems on Flathub)...

That's where check-news.sh comes in, to replace the check-news function of the autotools. Ideally you would:

- make sure your CI runs a dist job

- always use a merge request to do releases

- integrate check-news.sh to your meson build (though I would relax the appdata checks for devel releases)

September 09, 2020

Long, Long Day

I fell into the abyss of query code again, so this is just a short post to (briefly) touch on the adventure that is pipeline statistics queries.

Pipeline statistics queries include a collection of statistics about things that happened while the query was active, such as the count of shader invocations.

Thankfully, these weren’t too difficult to plug into the ever-evolving zink query architecture, as my previous adventure into QBOs (more on this another time) ended up breaking out a lot of helper functions for various things that simplified the process.

Mapping

The first step, as always, is figuring out how to map gallium API to vulkan. The below table handles the basic conversion for single query types:

unsigned map[] = {
   [PIPE_STAT_QUERY_IA_VERTICES] = VK_QUERY_PIPELINE_STATISTIC_INPUT_ASSEMBLY_VERTICES_BIT,
   [PIPE_STAT_QUERY_IA_PRIMITIVES] = VK_QUERY_PIPELINE_STATISTIC_INPUT_ASSEMBLY_PRIMITIVES_BIT,
   [PIPE_STAT_QUERY_VS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_VERTEX_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_GS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_GEOMETRY_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_GS_PRIMITIVES] = VK_QUERY_PIPELINE_STATISTIC_GEOMETRY_SHADER_PRIMITIVES_BIT,
   [PIPE_STAT_QUERY_C_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_CLIPPING_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_C_PRIMITIVES] = VK_QUERY_PIPELINE_STATISTIC_CLIPPING_PRIMITIVES_BIT,
   [PIPE_STAT_QUERY_PS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_FRAGMENT_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_HS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_TESSELLATION_CONTROL_SHADER_PATCHES_BIT,
   [PIPE_STAT_QUERY_DS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_TESSELLATION_EVALUATION_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_CS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_COMPUTE_SHADER_INVOCATIONS_BIT
};

Pretty straightforward.

With this in place, it’s worth mentioning that OpenGL has facilities for performing either “all” statistics queries, which include all the above types, or “single” statistics queries, which is just one of the types.

Building on that, I’m going to reach way back to the original loop that I’ve been using for handling query results. That’s now its own helper function called check_query_results(). It’s invoked from get_query_result() like so:

int result_size = 1;
   /* these query types emit 2 values */
if (query->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT ||
    query->type == PIPE_QUERY_PRIMITIVES_GENERATED ||
    query->type == PIPE_QUERY_PRIMITIVES_EMITTED)
   result_size = 2;
else if (query->type == PIPE_QUERY_PIPELINE_STATISTICS)
   result_size = 11;

if (query->type == PIPE_QUERY_PIPELINE_STATISTICS)
   num_results = 1;
for (unsigned last_start = query->last_start; last_start + num_results <= query->curr_query; last_start++) {
   /* verify that we have the expected number of results pending */
   assert(num_results <= ARRAY_SIZE(results) / result_size);
   VkResult status = vkGetQueryPoolResults(screen->dev, query->query_pool,
                                           last_start, num_results,
                                           sizeof(results),
                                           results,
                                           sizeof(uint64_t),
                                           flags);
   if (status != VK_SUCCESS)
      return false;

   if (query->type == PIPE_QUERY_PRIMITIVES_GENERATED) {
      status = vkGetQueryPoolResults(screen->dev, query->xfb_query_pool[0],
                                              last_start, num_results,
                                              sizeof(xfb_results),
                                              xfb_results,
                                              2 * sizeof(uint64_t),
                                              flags | VK_QUERY_RESULT_64_BIT);
      if (status != VK_SUCCESS)
         return false;

   }

   check_query_results(query, result, num_results, result_size, results, xfb_results);
}

Beautiful, isn’t it?

For the case of any xfb-related query, 2 result values are returned (and then also PRIMITIVES_GENERATED gets its own separate xfb pool for correctness), while most others only get 1 result value.

And then there’s PIPELINE_STATISTICS, which gets eleven. Since this is a weird result size to handle, I decided to just grab the results for each query one at a time in order to avoid any kind of craziness with allocating enough memory for the actual result values. Plenty of time for that later.

The handling for the results is fun too:

case PIPE_QUERY_PIPELINE_STATISTICS: {
   uint64_t *statistics = (uint64_t*)&result->pipeline_statistics;
   for (unsigned j = 0; j < 11; j++)
      statistics[j] += results[i + j];
   break;
}

Each bit set in the query object returns its own query result, so there’s 11 of them that need to be accumulated for each query result.

Surprisingly though, that’s the craziest part of implementing this. Not really much compared to some of the other insanity.

Corrections

The other day, I made wild claims about zink being the fastest graphics driver in history.

I’m not going to say they were inaccurate.

What I am going to say is that I had a great chat with well-known, long-time mesa developer Kenneth Graunke, who pointed out that most of the time differences in the softfp64 tests were due to NIR optimizations related to loop unrolling and the different settings used between IRIS and zink there.

He also helpfully pointed out that I forgot to set a number of important pipe caps, including the one that holds the value for gl_MaxVaryingComponents. As a result, zink has been advertising a very illegal and too-small number of varyings, which made the shaders smaller, which made us faster.

This has since been fixed, and zink is now fully compliant with the required GL specs for this value.

However.

This has only increased the time of the one test I showcased from ~2 seconds to ~12 seconds.

Due to different mechanics between GLSL and SPIRV shader construction, all XFB outputs have to be explicitly specified when they’re converted in zink, and they count against the regular output limits. As a result, it’s necessary to reserve half of the driver’s advertised vertex outputs for potential XFB usage. This leaves us with much fewer available output slots for user shaders in order to preserve XFB functionality.

September 08, 2020

This is going to be a short post, as changes to Videos have been few and far between in the past couple of releases.

The major change to the latest release is that we've gained Tracker 3 support through a grilo plugin (which meant very few changes to our own code). But the Tracker 3 libraries are incompatible with the Tracker 2 daemon that's usually shipped in distributions, including on this author's development system.

So we made use of the ability of Tracker to run inside a Flatpak sandbox along with the video player, removing the need to have Tracker installed by the distribution, on the host. This should also make it easier to give users control of the directories they want to use to store their movies, in the future.

The release candidate for GNOME 3.38 is available right now as the stable version on Flathub.

September 07, 2020

So here a new update of the evolution of the Vulkan driver for the rpi4 (broadcom GPU).

Features

Since my last update we finished the support for two features. Robust buffer access and multisampling.

Robust buffer access is a feature that allows to specify that accesses to buffers are bounds-checked against the range of the buffer descriptor. Usually this is used as a debug tool during development, but disabled on release (this is explained with more detail on this ARM guide). So sorry, no screenshot here.

On my last update I mentioned that we have started the support for multisampling, enough to get some demos working. Since then we were able to finish the rest of the mulsisampling support, and even implemented the optional feature sample rate shading. So now the following Sascha Willems’s demo is working:

Sascha Willems deferred multisampling demo run on rpi4

Bugfixing

Taking into account that most of the features towards support Vulkan core 1.0 are implemented now, a lot of the effort since the last update was done on bugfixing, focusing on the specifics of the driver. Our main reference for this is Vulkan CTS, the official Khronos testsuite for Vulkan and OpenGL.

As usual, here some screenshots from the nice Sascha Willems’s demos, showing demos that were failing when I wrote the last update, and are working now thanks of the bugfixing work.

Sascha Willems hdr demo run on rpi4

Sascha Willems gltf skinning demo run on rpi4

Next

At this point there are no full features pending to implement to fulfill the support for Vulkan core 1.0. So our focus would be on getting to pass all the Vulkan CTS tests.

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working
v3dv status update 2020-07-31

Just For Fun

I finally managed to get a complete piglit run over the weekend, and, for my own amusement, I decided to check the timediffs against a reference run from the IRIS driver. Given that the Intel drivers are of extremely high quality (and are direct interfaces to the underlying hardware that I happen to be using), I tend to use ANV and IRIS as my references whenever I’m trying to debug things.

Both runs used the same base checkout from mesa, so all the core/gallium/nir parts were identical.

The results weren’t what I expected.

My expectation when I clicked into the timediffs page was that zink would be massively slower in a huge number of tests, likely to a staggering degree in some cases.

We were, but then also on occasion we weren’t.

As a final disclaimer before I dive into this, I feel like given the current state of people potentially rushing to conclusions I need to say that I’m not claiming zink is faster than a native GL driver, only that for some cases, our performance is oddly better than I expected.

The Good

piglit-misc-bench.png

The first thing to take note of here is that IRIS is massively better than zink in successful test completion, with a near-perfect 99.4% pass rate compared to zink’s measly 91%, and that’s across 2500 more tests too. This is important also since timediff only compares between passing tests.

With that said, somehow zink’s codepath is significantly faster when it comes to dealing with high numbers of varying outputs, and also, weirdly, a bunch of dmat4 tests, even though they’re both using the same softfp64 path since my icelake hardware doesn’t support native 64bit operations.

I was skeptical about some of the numbers here, particularly the ext_transform_feedback max-varying-arrays-of-arrays cases, but manual tests were even weirder:

time MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink bin/ext_transform_feedback-max-varyings -auto -fbo

MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink  -auto -fbo  2.13s user 0.03s system 98% cpu 2.197 total
time MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=iris bin/ext_transform_feedback-max-varyings -auto -fbo

MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=iris  -auto -fbo  301.64s user 0.52s system 99% cpu 5:02.45 total

wat.

I don’t have a good explanation for this since I haven’t dug into it other than to speculate that ANV is just massively better at handling large numbers of varying outputs.

The Bad

By contrast, zink gets thrashed pretty decisively in arb_map_buffer_alignment-map-invalidate-range, and we’re about 150x slower.

Yikes. Looks like that’s going to be a target for some work since potentially an application might hit that codepath.

The Weird

piglit-fp64-bench.png

Somehow, zink is noticeably slower in a bunch of other fp64 tests (and this isn’t the full list, only a little over half). It’s strange to me that zink can perform better in certain fp64 cases but then also worse in others, but I’m assuming this is just the result of different shader optimizations happening between the drivers, shifting them onto slightly less slow parts of the softfp64 codepath in certain cases.

Possibly something to look into.

Probably not in too much depth since softfp64 is some pretty crazy stuff.

In Closing

Tests (and especially piglit ones) are not indicative of real world performance.

Exposure Notifications is a protocol developed by Apple and Google for facilitating COVID-19 contact tracing on mobile phones by exchanging codes with nearby phones over Bluetooth, implemented within the Android and iOS operating systems, now available here in Toronto.

Wait – phones? Android and iOS only? Can’t my Debian laptop participate? It has a recent Bluetooth chip. What about phones running GNU/Linux distributions like the PinePhone or Librem 5?

Exposure Notifications breaks down neatly into three sections: a Bluetooth layer, some cryptography, and integration with local public health authorities. Linux is up to the task, via BlueZ, OpenSSL, and some Python.

Given my background, will this build to be a reverse-engineering epic resulting in a novel open stack for a closed system?

Not at all. The specifications for the Exposure Notifications are available for both the Bluetooth protocol and the underlying cryptography. A partial reference implementation is available for Android, as is an independent Android implementation in microG. In Canada, the key servers run an open source stack originally built by Shopify and now maintained by the Canadian Digital Service, including open protocol documentation.

All in all, this is looking to be a smooth-sailing weekend1 project.

The devil’s in the details.

Bluetooth

Exposure Notifications operates via Bluetooth Low Energy “advertisements”. Scanning for other devices is as simple as scanning for advertisements, and broadcasting is as simple as advertising ourselves.

On an Android phone, this is handled deep within Google Play Services. Can we drive the protocol from userspace on a regular GNU/Linux laptop? It depends. Not all laptops support Bluetooth, not all Bluetooth implementations support Bluetooth Low Energy, and I hear not all Bluetooth Low Energy implementations properly support undirected transmissions (“advertising”).

Luckily in my case, I develop on an Debianized Chromebook with a Wi-Fi/Bluetooth module. I’ve never used the Bluetooth, but it turns out the module has full support for advertisements, verified with the lescan (Low Energy Scan) command of the hcitool Bluetooth utility.

hcitool is a part of BlueZ, the standard Linux library for Bluetooth. Since lescan is able to detect nearby phones running Exposure Notifications, pouring through its source code is a good first step to our implementation. With some minor changes to hcitool to dump packets as raw hex and to filter for the Exposure Notifications protocol, we can print all nearby Exposure Notifications advertisements. So far, so good.

That’s about where the good ends.

While scanning is simple with reference code in hcitool, advertising is complicated by BlueZ’s lack of an interface at the time of writing. While a general “enable advertising” routine exists, routines to set advertising parameters and data per the Exposure Notifications specification are unavailable. This is not a showstopper, since BlueZ is itself an open source userspace library. We can drive the Bluetooth module the same way BlueZ does internally, filling in the necessary gaps in the API, while continuing to use BlueZ for the heavy-lifting.

Some care is needed to multiplex scanning and advertising within a single thread while remaining power efficient. The key is that advertising, once configured, is handled entirely in hardware without CPU intervention. On the other hand, scanning does require CPU involvement, but it is not necessary to scan continuously. Since COVID-19 is thought to transmit from sustained exposure, we only need to scan every few minutes. (Food for thought: how does this connect to the sampling theorem?)

Thus we can order our operations as:

  • Configure advertising
  • Scan for devices
  • Wait for several minutes
  • Repeat.

Since most of the time the program is asleep, this loop is efficient. It additionally allows us to reconfigure advertising every ten to fifteen minutes, in order to change the Bluetooth address to prevent tracking.

All of the above amounts to a few hundred lines of C code, treating the Exposure Notifications packets themselves as opaque random data.

Cryptography

Yet the data is far from random; it is the result of a series of operations in terms of secret keys defined by the Exposure Notifications cryptography specification. Every day, a “temporary exposure key” is generated, from which a “rolling proximity identifier key” and an “associated encrypted metadata key” are derived. These are used to generate a “rolling proximity identifier” and the “associated encrypted metadata”, which are advertised over Bluetooth and changed in lockstep with the Bluetooth random addresses.

There are lots of moving parts to get right, but each derivation reuses a common encryption primitive: HKDF-SHA256 for key derivation, AES-128 for the rolling proximity identifier, and AES-128-CTR for the associated encrypted metadata. Ideally, we would grab a state-of-the-art library of cryptography primitives like NaCl or libsodium and wire everything up.

First, some good news: once these routines are written, we can reliably unit test them. Though the specification states that “test vectors… are available upon request”, it isn’t clear who to request from. But Google’s reference implementation is itself unit-tested, and sure enough, it contains a TestVectors.java file, from which we can grab the vectors for a complete set of unit tests.

After patting ourselves on the back for writing unit tests, we’ll need to pick a library to implement the cryptography. Suppose we try NaCl first. We’ll quickly realize the primitives we need are missing, so we move onto libsodium, which is backwards-compatible with NaCl. For a moment, this will work – libsodium has upstream support for HKDF-SHA256. Unfortunately, the version of libsodium shipping in Debian testing is too old for HKDF-SHA256. Not a big problem – we can backwards port the implementation, written in terms of the underlying HMAC-SHA256 operations, and move on to the AES.

AES is a standard symmetric cipher, so libsodium has excellent support… for some modes. However standard, AES is not one cipher; it is a family of ciphers with different key lengths and operating modes, with dramatically different security properties. “AES-128-CTR” in the Exposure Notifications specification is clearly 128-bit AES in CTR (Counter) mode, but what about “AES-128” alone, stated to operate on a “single AES-128 block”?

The mode implicitly specified is known as ECB (Electronic Codebook) mode and is known to have fatal security flaws in most applications. Because AES-ECB is generally insecure, libsodium does not have any support for this cipher mode. Great, now we have two problems – we have to rewrite our cryptography code against a new library, and we have to consider if there is a vulnerability in Exposure Notifications.

ECB’s crucial flaw is that for a given key, identical plaintext will always yield identical ciphertext, regardless of position in the stream. Since AES is block-based, this means identical blocks yield identical ciphertext, leading to trivial cryptanalysis.

In Exposure Notifications, ECB mode is used only to derive rolling proximity identifiers from the rolling proximity identifier key and the timestamp, by the equation:

RPI_ij = AES_128_ECB(RPIK_i, PaddedData_j)

…where PaddedData is a function of the quantized timestamp. Thus the issue is avoided, as every plaintext will be unique (since timestamps are monotonically increasing, unless you’re trying to contact trace Back to the Future).

Nevertheless, libsodium doesn’t know that, so we’ll need to resort to a ubiquitous cryptography library that doesn’t, uh, take security quite so seriously…

I’ll leave the implications up to your imagination.

Database

While the Bluetooth and cryptography sections are governed by upstream specifications, making sense of the data requires tracking a significant amount of state. At minimum, we must:

  • Record received packets (the Rolling Proximity Identifier and the Associated Encrypted Metadata).
  • Query received packets for diagnosed identifiers.
  • Record our Temporary Encryption Keys.
  • Query our keys to upload if we are diagnosed.

If we were so inclined, we could handwrite all the serialization and concurrency logic and hope we don’t have a bug that results in COVID-19 mayhem.

A better idea is to grab SQLite, perhaps the most deployed software in the world, and express these actions as SQL queries. The database persists to disk, and we can even express natural unit tests with a synthetic in-memory database.

With this infrastructure, we’re now done with the primary daemon, recording Exposure Notification identifiers to the database and broadcasting our own identifiers. That’s not interesting if we never do anything with that data, though. Onwards!

Key retrieval

Once per day, Exposure Notifications implementations are expected to query the server for Temporary Encryption Keys associated with diagnosed COVID-19 cases. From these keys, the cryptography implementation can reconstruct the associated Rolling Proximity Identifiers, for which we can query the database to detect if we have been exposed.

Per Google’s documentation, the servers are expected to return a zip file containing two files:

  • export.bin: a container serialized as Protocol Buffers containing Diagnosis Keys
  • export.sig: a signature for the export with the public health agency’s key

The signature is not terribly interesting to us. On Android, it appears the system pins the public keys of recognized public health agencies as an integrity check for the received file. However, this public key is given directly to Google; we don’t appear to have an easy way to access it.

Does it matter? For our purposes, it’s unlikely. The Canadian key retrieval server is already transport-encrypted via HTTPS, so tampering with the data would already require compromising a certificate authority in addition to intercepting the requests to https://canada.ca. Broadly speaking, that limits attackers to nation-states, and since Canada has no reason to attack its own infrastructure, that limits our threat model to foreign nation-states. International intelligence agencies probably have better uses of resources than getting people to take extra COVID tests.

It’s worth noting other countries’ implementations could serve this zip file over plaintext HTTP, in which case this signature check becomes important.

Focusing then on export.bin, we may import the relevant protocol buffer definitions to extract the keys for matching against our database. Since this requires only read-only access to the database and executes infrequently, we can safely perform this work from a separate process written in a higher-level language like Python, interfacing with the cryptography routines over the Python foreign function interface ctypes. Extraction is easy with the Python protocol buffers implementation, and downloading should be as easy as a GET request with the standard library’s urllib, right?

Here we hit a gotcha: the retrieval endpoint is guarded behind an HMAC, requiring authentication to download the zip. The protocol documentation states:

Of course there’s no reliable way to truly authenticate these requests in an environment where millions of devices have immediate access to them upon downloading an Application: this scheme is purely to make it much more difficult to casually scrape these keys.

Ah, security by obscurity. Calculating the HMAC itself is simple given the documentation, but it requires a “secret” HMAC key specific to the server. As the documentation is aware, this key is hardly secret, but it’s not available on the Canadian Digital Service’s official repositories. Interoperating with the upstream servers would require some “extra” tricks.

From purely academic interest, we can write and debug our implementation without any such authorization by running our own sandbox server. Minus the configuration, the server source is available, so after spinning up a virtual machine and fighting with Go versioning, we can test our Python script.

Speaking of a personal sandbox…

Key upload

There is one essential edge case to the contact tracing implementation, one that we can’t test against the Canadian servers. And edge cases matter. In effect, the entire Exposure Notifications infrastructure is designed for the edge cases. If you don’t care about edge cases, you don’t care about digital contact tracing (so please, stay at home.)

The key feature – and key edge case – is uploading Temporary Exposure Keys to the Canadian key server in case of a COVID-19 diagnosis. This upload requires an alphanumeric code generated by a healthcare provider upon diagnosis, so if we used the shared servers, we couldn’t test an implementation. With our sandbox, we can generate as many alphanumeric codes as we’d like.

Once sandboxed, there isn’t much to the implementation itself: the keys are snarfed out of the SQLite database, we handshake with the server over protocol buffers marshaled over POST requests, and we throw in some public-key cryptography via the Python bindings to libsodium.

This functionality neatly fits into a second dedicated Python script which does not interface with the main library. It’s exposed as a command line interface with flow resembling that of the mobile application, adhering reasonably to the UNIX philosophy. Admittedly I’m not sure wrestling with the command line is top on the priority list of a Linux hacker ill with COVID-19. Regardless, the interface is suitable for higher-level (graphical) abstractions.

Problem solved, but of course there’s a gotcha: if the request is malformed, an error should be generated as a key robustness feature. Unfortunately, while developing the script against my sandbox, a bug led the request to be dropped unexpectedly, rather than returning with an error message. On the server implemented in Go, there was an apparent nil dereference. Oops. Fixing this isn’t necessary for this project, but it’s still a bug, even if it requires a COVID-19 diagnosis to trigger. So I went and did the Canadian thing and sent a pull request.

Conclusion

All in all, we end up with a Linux implementation of Exposure Notifications functional in Ontario, Canada. What’s next? Perhaps supporting contact tracing systems elsewhere in the world – patches welcome. Closer to home, while functional, the aesthetics are not (yet) anything to write home about – perhaps we could write a touch-based Linux interface for mobile Linux interfaces like Plasma Mobile and Phosh, maybe even running it on a Android flagship flashed with postmarketOS to go full circle.

Source code for liben is available for any one who dares go near. Compiling from source is straightforward but necessary at the time of writing. As for packaging?

Here’s hoping COVID-19 contact tracing will be obsolete by the time liben hits Debian stable.


  1. Today (Monday) is Labour Day, so this is a 3-day weekend. But I started on Saturday and posted this today, so it technically counts.↩︎

September 04, 2020

Wim Taymans

Wim Taymans talking about current state of PipeWire


Wim Taymans did an internal demonstration yesterday for the desktop team at Red Hat of the current state of PipeWire. For those still unaware PipeWire is our effort to bring together audio, video and pro-audio under Linux, creating a smooth and modern experience. Before PipeWire there was PulseAudio for consumer audio, Jack for Pro-audio and just unending pain and frustration for video. PipeWire is being done with the aim of being ABI compatible with ALSA, PulseAudio and JACK, meaning that PulseAudio and Jack apps should just keep working on top of Pipewire without the need for rewrites (and with the same low latency for JACK apps).

As Wim reported yesterday things are coming together with both the PulseAudio, Jack and ALSA backends being usable if not 100% feature complete yet. Wim has been running his system with Pipewire as the only sound server for a while now and things are now in a state where we feel ready to ask the wider community to test and help provide feedback and test cases.

Carla on PipeWire

Carla running on PipeWire

Carla as shown above is a popular Jack applications and it provides among other things this patchbay view of your audio devices and applications. I recommend you all to click in and take a close look at the screenshot above. That is the Jack application Carla running and as you see PulseAudio applications like GNOME Settings and Google Chrome are also showing up now thanks to the unified architecture of PipeWire, alongside Jack apps like Hydrogen. All of this without any changes to Carla or any of the other applications shown.

At the moment Wim is primarily testing using Cheese, GNOME Control center, Chrome, Firefox, Ardour, Carla, vlc, mplayer, totem, mpv, Catia, pavucontrol, paman, qsynth, zrythm, helm, Spotify and Calf Studio Gear. So these are the applications you should be getting the most mileage from when testing, but most others should work too.

Anyway, let me quickly go over some of the highlight from Wim’s presentation.

Session Manager

PipeWire now has a functioning session manager that allows for things like

  • Metadata, system for tagging objects with properties, visible to all clients (if permitted)
  • Load and save of volumes, automatic routing
  • Default source and sink with metadata, saved and loaded as well
  • Moving streams with metadata

Currently this is a simple sample session manager that Wim created himself, but we also have a more advanced session manager called Wireplumber being developed by Collabora, which they developed for use in automotive Linux usecases, but which we will probably be moving to over time also for the desktop.

Human readable handling of Audio Devices

Wim took the code and configuration data in Pulse Audio for ALSA Card Profiles and created a standalone library that can be shared between PipeWire and PulseAudio. This library handles ALSA sound card profiles, devices, mixers and UCM (use case manager used to configure the newer audio chips (like the Lenovo X1 Carbon) and lets PipeWire provide the correct information to provide to things like GNOME Control Center or pavucontrol. Using the same code as has been used in PulseAudio for this has the added benefit that when you switch from PulseAudio to PipeWire your devices don’t change names. So everything should look and feel just like PulseAudio from an application perspective. In fact just below is a screenshot of pavucontrol, the Pulse Audio mixer application running on top of Pipewire without a problem.

PulSe Audio Mixer

Pavucontrol, the Pulse Audio mixer on Pipewire

Creating audio sink devices with Jack
Pipewire now allows you to create new audio sink devices with Jack. So the example command below creates a Pipewire sink node out of calfjackhost and sets it up so that we can output for instance the audio from Firefox into it. At the moment you can do that by running your Jack apps like this:

PIPEWIRE_PROPS="media.class=Audio/Sink" calfjackhost

But eventually we hope to move this functionality into the GNOME Control Center or similar so that you can do this setup graphically. The screenshot below shows us using CalfJackHost as an audio sink, outputing the audio from Firefox (a PulseAudio application) and CalfJackHost generating an analyzer graph of the audio.

Calfjackhost on pipewire

The CalfJackhost being used as an audio sink for Firefox

Creating devices with GStreamer
We can also use GStreamer to create PipeWire devices now. The command belows take the popular Big Buck Bunny animation created by the great folks over at Blender and lets you set it up as a video source in PipeWire. So for instance if you always wanted to play back a video inside Cheese for instance, to apply the Cheese effects to it, you can do that this way without Cheese needing to change to handle video playback. As one can imagine this opens up the ability to string together a lot of applications in interesting ways to achieve things that there might not be an application for yet. Of course application developers can also take more direct advantage of this to easily add features to their applications, for instance I am really looking forward to something like OBS Studio taking full advantage of PipeWire.

gst-launch-1.0 uridecodebin uri=file:///home/wim/data/BigBuckBunny_320x180.mp4 ! pipewiresink mode=provide stream-properties="props,media.class=Video/Source,node.description=BBB"

Cheese paying a video through pipewire

Cheese playing a video provided by GStreamer through PipeWire.

How to get started testing PipeWire
Ok, so after seeing all of this you might be thinking, how can I test all of this stuff out and find out how my favorite applications work with PipeWire? Well first thing you should do is make sure you are running Fedora Workstation 32 or later as that is where we are developing all of this. Once you done that you need to make sure you got all the needed pieces installed:

sudo dnf install pipewire-libpulse pipewire-libjack pipewire-alsa

Once that dnf command finishes you run the following to get PulseAudio replaced by PipeWire.


cd /usr/lib64/

sudo ln -sf pipewire-0.3/pulse/libpulse-mainloop-glib.so.0 /usr/lib64/libpulse-mainloop-glib.so.0.999.0
sudo ln -sf pipewire-0.3/pulse/libpulse-simple.so.0 /usr/lib64/libpulse-simple.so.0.999.0
sudo ln -sf pipewire-0.3/pulse/libpulse.so.0 /usr/lib64/libpulse.so.0.999.0

sudo ln -sf pipewire-0.3/jack/libjack.so.0 /usr/lib64/libjack.so.0.999.0
sudo ln -sf pipewire-0.3/jack/libjacknet.so.0 /usr/lib64/libjacknet.so.0.999.0
sudo ln -sf pipewire-0.3/jack/libjackserver.so.0 /usr/lib64/libjackserver.so.0.999.0

sudo ldconfig

(you can also find those commands here

Once you run these commands you should be able to run

pactl info

and see this as the first line returned:
Server String: pipewire-0

I do recommend rebooting, to be 100% sure you are on a PipeWire system with everything outputting through PipeWire. Once that is done you are ready to start testing!

Our goal is to use the remainder of the Fedora Workstation 32 lifecycle and the Fedora Workstation 33 lifecycle to stabilize and finish the last major features of PipeWire and then start relying on it in Fedora Workstation 34. So I hope this article will encourage more people to get involved and join us on gitlab and on the PipeWire IRC channel at #pipewire on Freenode.

As we are trying to stabilize PipeWire we are working on it on a bug by bug basis atm, so if you end up testing out the current state of PipeWire then be sure to report issues back to us through the PipeWire issue tracker, but do try to ensure you have a good test case/reproducer as we are still so early in the development process that we can’t dig into ‘obscure/unreproducible’ bugs.

Also if you want/need to go back to PulseAudio you can run the commands here

Also if you just want to test a single application and not switch your whole system over you should be able to do that by using the following commands:

pw-pulse

or

pw-jack

Next Steps
So what are our exact development plans at this point? Well here is a list in somewhat priority order:

  1. Stabilize – Our top priority now is to make PipeWire so stable that the power users that we hope to attract us our first batch of users are comfortable running PipeWire as their only audio server. This is critical to build up a userbase that can help us identify and prioritize remaining issues and ensure that when we do switch Fedora Workstation over to using PipeWire as the default and only supported audio server it will be a great experience for users.
  2. Jackdbus – We want to implement support for the jackdbus API soon as we know its an important feature for the Fedora Jam folks. So we hope to get to this in the not to distant future
  3. Flatpak portal for JACK/audio applications – The future of application packaging is Flatpaks and being able to sandbox Jack applications properly inside a Flatpak is something we want to enable.
  4. Bluetooth – Bluetooth has been supported in PipeWire from the start, but as Wims focus has moved elsewhere it has gone a little stale. So we are looking at cycling back to it and cleaning it up to get it production ready. This includes proper support for things like LDAC and AAC passthrough, which is currently not handled in PulseAudio. Wim hopes to push an updated PipeWire in Fedora out next week which should at least get Bluetooth into a basic working state, but the big fix will come later.
  5. Pulse effects – Wim has looked at this, but there are some bugs that blocks data from moving through the pipeline.
  6. Latency compensation – We want complete latency compensation implemented. This is not actually in Jack currently, so it would be a net new feature.
  7. Network audio – PulseAudio style network audio is not implemented yet.

This is the continuation from these posts: part 1, part 2, part 3

This is the part where it all comes together, with (BYO) fireworks and confetti, champagne and hoorays. Or rather, this is where I'll describe how to actually set everything up. It's a bit complicated because while libxkbcommon does the parsing legwork now, we haven't actually changed other APIs and the file formats which are still 1990s-style nerd cool and requires significant experience in CS [1] to understand what goes where.

The below relies on software using libxkbcommon and libxkbregistry. At the time of writing, libxkbcommon is used by all mainstream Wayland compositors but not by the X server. libxkbregistry is not yet used because I'm typing this before we had a release for it. But at least now I have a link to point people to.

libxkbcommon has a xkbcli-scaffold-new-layout tool that The xkblayout tool creates the template files as shown below. At the time of writing, this tool must be run from the git repo build directory, it is not installed.

I'll explain here how to add the us(banana) variant and the custom:foo option, and I will optimise for simplicity and brevity.

Directory structure

First, create the following directory layout:


$ tree $XDG_CONFIG_HOME/xkb
/home/user/.config/xkb
├── compat
├── keycodes
├── rules
│   ├── evdev
│   └── evdev.xml
├── symbols
│   ├── custom
│   └── us
└── types
If $XDG_CONFIG_HOME is unset, fall back to $HOME/.config.

Rules files

Create the rules file and add an entry to map our custom:foo option to a section in the symbols/custom file.


$ cat $XDG_CONFIG_HOME/xkb/rules/evdev
! option = symbols
custom:foo = +custom(foo)

// Include the system 'evdev' file
! include %S/evdev
Note that no entry is needed for the variant, that is handled by wildcards in the system rules file. If you only want a variant and no options, you technically don't need this rules file.

Second, create the xml file used by libxkbregistry to display your new entries in the configuration GUIs:


$ cat $XDG_CONFIG_HOME/xkb/rules/evdev.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xkbConfigRegistry SYSTEM "xkb.dtd">
<xkbConfigRegistry version="1.1">
<layoutList>
<layout>
<configItem>
<name>us</name>
</configItem>
<variantList>
<variant>
<configItem>
<name>banana</name>
<shortDescription>banana</shortDescription>
<description>US(Banana)</description>
</configItem>
</variant>
</variantList>
</layout>
</layoutList>
<optionList>
<group allowMultipleSelection="true">
<configItem>
<name>custom</name>
<description>custom options</description>
</configItem>
<option>
<configItem>
<name>custom:foo</name>
<description>This option does something great</description>
</configItem>
</option>
</group>
</optionList>
</xkbConfigRegistry>
Our variant needs to be added as a layoutList/layout/variantList/variant, the option to the optionList/group/option. libxkbregistry will combine this with the system-wide evdev.xml file in /usr/share/X11/xkb/rules/evdev.xml.

Overriding and adding symbols

Now to the actual mapping. Add a section to each of the symbols files that matches the variant or option name:


$ cat $XDG_CONFIG_HOME/xkb/symbols/us
partial alphanumeric_keys modifier_keys
xkb_symbols "banana" {
name[Group1]= "Banana (us)";

include "us(basic)"

key <CAPS> { [ Escape ] };
};
with this, the us(banana) layout will be a US keyboard layout but with the CapsLock key mapped to Escape. What about our option? Mostly the same, let's map the tilde key to nothing:

$ cat $XDG_CONFIG_HOME/xkb/symbols/custom
partial alphanumeric_keys modifier_keys
xkb_symbols "foo" {
key <TLDE> { [ VoidSymbol ] };
};
A note here: NoSymbol means "don't overwrite it" whereas VoidSymbol is "map to nothing".

Notes

You may notice that the variant and option sections are almost identical. XKB doesn't care about variants vs options, it only cares about components to combine. So the sections do what we expect of them: variants include enough other components to make them a full keyboard layout, options merely define a few keys so they can be combined with layouts(variants). Due to how the lookups work, you could load the option template as layout custom(foo).

For the actual documentation of keyboard configuration, you'll need to google around, there are quite a few posts on how to map keys. All that has changed is where from and how things are loaded but not the actual data formats.

If you wanted to install this as system-wide custom rules, replace $XDG_CONFIG_HOME with /etc.

The above is a replacement for xmodmap. It does not require a script to be run manually to apply the config, the existing XKB integration will take care of it. It will work in Wayland (but as said above not in X, at least not for now).

A final word

Now, I fully agree that this is cumbersome, clunky and feels outdated. This is easy to fix, all that is needed is for someone to develop a better file format, make sure it's backwards compatible with the full spec of the XKB parser (the above is a fraction of what it can do), that you can generate the old files from the new format to reduce maintenance, and then maintain backwards compatibility with the current format for the next ten or so years. Should be a good Google Decade of Code beginner project.

[1] Cursing and Swearing

This is the continuation from these posts: part 1, part 2, part 3 and part 4.

In the posts linked above, I describe how it's possible to have custom keyboard layouts in $HOME or /etc/xkb that will get picked up by libxkbcommon. This only works for the Wayland stack, the X stack doesn't use libxkbcommon. In this post I'll explain why it's unlikely this will ever happen in X.

As described in the previous posts, users configure with rules, models, layouts, variants and options (RMLVO). What XKB uses internally though are keycodes, compat, geometry, symbols types (KcCGST) [1].

There are, effectively, two KcCGST keymap compilers: libxkbcommon and xkbcomp. libxkbcommon can go from RMLVO to a full keymap, xkbcomp relies on other tools (e.g. setxkbmap) which in turn use a utility library called libxkbfile to can parse rules files. The X server has a copy of the libxkbfile code. It doesn't use libxkbfile itself but it relies on the header files provided by it for some structs.

Wayland's keyboard configuration works like this:

  • the compositor decides on the RMLVO keybard layout, through an out-of-band channel (e.g. gsettings, weston.ini, etc.)
  • the compositor invokes libxkbcommon to generate a KcCGST keymap and passes that full keymap to the client
  • the client compiles that keymap with libxkbcommon and feeds any key events into libxkbcommon's state tracker to get the right keysyms
The advantage we have here is that only the full keymap is passed between entities. Changing how that keymap is generated does not affect the client. This, coincidentally [2], is also how Xwayland gets the keymap passed to it and why Xwayland works with user-specific layouts.

X works differently. Notably, KcCGST can come in two forms, the partial form specifying names only and the full keymap. The partial form looks like this:


$ setxkbmap -print -layout fr -variant azerty -option ctrl:nocaps
xkb_keymap {
xkb_keycodes { include "evdev+aliases(azerty)" };
xkb_types { include "complete" };
xkb_compat { include "complete" };
xkb_symbols { include "pc+fr(azerty)+inet(evdev)+ctrl(nocaps)" };
xkb_geometry { include "pc(pc105)" };
};
This defines the component names but not the actual keymap, punting that to the next part in the stack. This will turn out to be the achilles heel. Keymap handling in the server has two distinct aproaches:
  • During keyboard device init, the input driver passes RMLVO to the server, based on defaults or xorg.conf options
  • The server has its own rules file parser and creates the KcCGST component names (as above)
  • The server forks off xkbcomp and passes the component names to stdin
  • xkbcomp generates a keymap based on the components and writes it out as XKM file format
  • the server reads in the XKM format and updates its internal structs
This has been the approach for decades. To give you an indication of how fast-moving this part of the server is: XKM caching was the latest feature added... in 2009.

Driver initialisation is nice, but barely used these days. You set your keyboard layout in e.g. GNOME or KDE and that will apply it in the running session. Or run setxkbmap, for those with a higher affinity to neckbeards. setxkbmap works like this:

  • setkxkbmap parses the rules file to convert RMLVO to KcCGST component names
  • setkxkbmap calls XkbGetKeyboardByName and hands those component names to the server
  • The server forks off xkbcomp and passes the component names to stdin
  • xkbcomp generates a keymap based on the components and writes it out as XKM file format
  • the server reads in the XKM format and updates its internal structs
Notably, the RMLVO to KcCGST conversion is done on the client side, not the server side. And the only way to send a keymap to the server is that XkbGetKeyboardByName request - which only takes KcCGST, you can't even pass it a full keymap. This is also a long-standing potential issue with XKB: if your client tools uses different XKB data files than the server, you don't get the keymap you expected.

Other parts of the stack do basically the same as setxkbmap which is just a thin wrapper around libxkbfile anyway.

Now, you can use xkbcomp on the client side to generate a keymap, but you can't hand it as-is to the server. xkbcomp can do this (using libxkbfile) by updating the XKB state one-by-one (XkbSetMap, XkbSetCompatMap, XkbSetNames, etc.). But at this point you're at the stage where you ask the server to knowingly compile a wrong keymap before updating the parts of it.

So, realistically, the only way to get user-specific XKB layouts into the X server would require updating libxkbfile to provide the same behavior as libxkbcommon, update the server to actually use libxkbfile instead of its own copy, and updating xkbcomp to support the changes in part 2, part 3. All while ensuring no regressions in code that's decades old, barely maintained, has no tests, and, let's be honest, not particularly pretty to look at. User-specific XKB layouts are somewhat a niche case to begin with, so I don't expect anyone to ever volunteer and do this work [3], much less find the resources to review and merge that code. The X server is unlikely to see another real release and this is definitely not something you want to sneak in in a minor update.

The other option would be to extend XKB-the-protocol with a request to take a full keymap so the server. Given the inertia involved and that the server won't see more full releases, this is not going to happen.

So as a summary: if you want custom keymaps on your machine, switch to Wayland (and/or fix any remaining issues preventing you from doing so) instead of hoping this will ever work on X. xmodmap will remain your only solution for X.

[1] Geometry is so pointless that libxkbcommon doesn't even implement this. It is a complex format to allow rendering a picture of your keyboard but it'd be a per-model thing and with evdev everyone is using the same model, so ...
[2] totally not coincidental btw
[3] libxkbcommon has been around for a decade now and no-one has volunteered to do this in the years since, so...

The Optimizations Continue

Optimizing transfer_map is one of the first issues I created, and it’s definitely one of the most important, at least as it pertains to unit tests. So many unit tests perform reads on buffers that it’s crucial to ensure no unnecessary flushing or stalling is happening here.

Today, I’ve made further strides in this direction for piglit’s spec@glsl-1.30@execution@texelfetch fs sampler2d 1x281-501x281:

before

MESA_LOADER_DRIVER_OVERRIDE=zink bin/texelFetch 4.41s user 1.92s system 71% cpu 8.801 total

after

MESA_LOADER_DRIVER_OVERRIDE=zink bin/texelFetch 4.22s user 1.72s system 76% cpu 7.749 total

More Speed Loops

As part of ensuring test coherency, a lot of explicit fencing was added around transfer_map and transfer_unmap. Part of this was to work around the lack of good barrier usage, and part of it was just to make sure that everything was properly synchronized.

An especially non-performant case of this was in transfer_unmap, where I added a fence to block after the buffer was unmapped. In reality, this makes no sense other than for the case of synchronizing with the compute batch, which is a bit detached from everything else.

The reason this fence continued to be needed comes down to barrier usage for descriptors. At the start, all image resources used for descriptors just used VK_IMAGE_LAYOUT_GENERAL for the barrier layout, which is fine and good; certainly this is spec compliant. It isn’t, however, optimally informing the underlying driver about the usage for the case where the resource was previously used with VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL to copy data back from a staging image.

Instead, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL can be used for sampler images since they’re read-only. This differs from VK_IMAGE_LAYOUT_GENERAL in that VK_IMAGE_LAYOUT_GENERAL contains both read and write—not what’s actually going on in this case given that sampler images are read-only.

No Other Results

I’d planned to post more timediffs from some piglit results, but I hit some weird i915 bugs and so the tests have yet to complete after ~14 hours. More to come next week.

September 03, 2020

A Quick Optimization

As I mentioned last week, I’m turning a small part of my attention now to doing some performance improvements. One of the low-hanging fruits here is adding buffer ranges; in short, this means that for buffer resources (i.e., not images), the driver tracks the ranges in memory of the buffer that have data written, which allows avoiding gpu stalls when trying to read or write from a range in the buffer that’s known to not have anything written.

util_range

The util_range API in gallium is extraordinarily simple. Here’s the key parts:

struct util_range {
   unsigned start; /* inclusive */
   unsigned end; /* exclusive */

   /* for the range to be consistent with multiple contexts: */
   simple_mtx_t write_mutex;
};

/* This is like a union of two sets. */
static inline void
util_range_add(struct pipe_resource *resource, struct util_range *range,
               unsigned start, unsigned end)
{
   if (start < range->start || end > range->end) {
      if (resource->flags & PIPE_RESOURCE_FLAG_SINGLE_THREAD_USE) {
         range->start = MIN2(start, range->start);
         range->end = MAX2(end, range->end);
      } else {
         simple_mtx_lock(&range->write_mutex);
         range->start = MIN2(start, range->start);
         range->end = MAX2(end, range->end);
         simple_mtx_unlock(&range->write_mutex);
      }
   }
}

static inline boolean
util_ranges_intersect(const struct util_range *range,
                      unsigned start, unsigned end)
{
   return MAX2(start, range->start) < MIN2(end, range->end);
}

When the driver writes to the buffer, util_range_add() is called with the range being modified. When the user then tries to map the buffer using a specified range, util_ranges_intersect() is called with the range being mapped to determine if a stall is needed or if it can be skipped.

Where do these ranges get added for the buffer?

Lots of places. Here’s a quick list:

  • struct pipe_context::blit
  • struct pipe_context::clear_texture
  • struct pipe_context::set_shader_buffers
  • struct pipe_context::set_shader_images
  • struct pipe_context::copy_region
  • struct pipe_context::create_stream_output_target
  • resource unmap and flush hooks

Some Numbers

The tests are still running to see what happens here, but I found some interesting improvements using my piglit timediff display after rebasing my branch yesterday for the first time in a bit and running the tests again overnight:

piglit-dmat.png

I haven’t made any changes to this codepath myself (that I’m aware of), so it looks like I’ve pulled in some improvement that’s massively cutting down the time required for the codepath that handles implicit 32bit -> 64bit conversions, and timediff picked it up.

September 01, 2020

Benchmarking vs Testing

Just a quick post for today to talk about a small project I undertook this morning.

As I’ve talked about repeatedly, I run a lot of tests.

It’s practically all I do when I’m not dumping my random wip code by the truckload into the repo.

The problem with this approach is that it doesn’t leave much time for performance improvements. How can I possibly have time to run benchmarks if I’m already spending all my time running tests?

Motivation

As one of the greatest bench markers of all time once said, If you want to turn a vision into reality, you have to give 100% and never stop believing in your dream.

Thus I decided to look for those gains while I ran tests.

Piglit already possesses facilities for providing HTML summaries of all tests (sidebar: the results that I posted last week are actually a bit misleading since they included a number of regressions that have since been resolved, but I forgot to update the result set). This functionality includes providing the elapsed time for each test. It has not, however, included any way of displaying changes in time between result sets.

Until now.

The Future Is Here

With my latest MR, the HTML view can show off those big gains (or losses) that have been quietly accumulating through all these high volume sets of unit tests:

piglit-timing.png

Is it more generally useful to other drivers?

Maybe? It’ll pick up any unexpected performance regressions (inverse gains) too, and it doesn’t need any extra work to display, so it’s an easy check, at the least.

August 31, 2020

This weekend the X1 Carbon with Fedora Workstation went live in North America on Lenovos webstore. This is a big milestone for us and for Lenovo as its the first time Fedora ships pre-installed on a laptop from a major vendor and its the first time the worlds largest laptop maker ships premium laptops with Linux directly to consumers. Currently only the X1 Carbon is available, but more models is on the way and more geographies will get added too soon. As a sidenote, the X1 Carbon and more has actually been available from Lenovo for a couple of Months now, it is just the web sales that went online now. So if you are a IT department buying Lenovo laptops in bulk, be aware that you can already buy the X1 Carbon and the P1 for instance through the direct to business sales channel.

Also as a reminder for people looking to deploy Fedora laptops or workstations in numbers, be sure to check out Fleet Commander our tool for helping you manage configurations across such a fleet.

I am very happy with the work that has been done here to get to this point both by Lenovo and from the engineers on my team here at Red Hat. For example Lenovo made sure to get all of their component makers to ramp up their Linux support and we have been working with them to both help get them started writing drivers for Linux or by helping add infrastructure they could plug their hardware into. We also worked hard to get them all set up on the Linux Vendor Firmware Service so that you could be assured to get updated firmware not just for the laptop itself, but also for its components.

We also have a list of improvements that we are working on to ensure you get the full benefit of your new laptops with Fedora and Lenovo, including working on things like improved power management features being able to have power consumption profiles that includes a high performance mode for some laptops that will allow it to run even faster when on AC power and on the other end a low power mode to maximize battery life. As part of that we are also working on adding lap detection support, so that we can ensure that you don’t risk your laptop running to hot in your lap and burning you or that radio antennas are running to strong when that close to your body.

So I hope you decide to take the leap and get one of the great developer laptops we are doing together with Lenovo. This is a unique collaboration between the worlds largest laptop maker and the worlds largest Linux company. What we are doing here isn’t just a minimal hardware enablement effort, but a concerted effort to evolve Linux as a laptop operating system and doing it in a proper open source way. So this is the culmination of our work over the last few years, creating the LVFS, adding Thunderbolt support to Linux, improving fingerprint reader support in Linux, supporting HiDPI screens, supporting hidpi mice, creating the possibility of a secure desktop with Wayland, working with NVidia to ensure that Mesa and Nvidia driver can co-exist through glvnd, creating Flatpak to ensure we can bring the advantages of containers to the desktop space and at the same way do it in a vendor neutral way. So when you buy a Lenovo laptop with Fedora Workstation, you are not just getting a great system, but you are also supporting our efforts to take Linux to the next level, something which I think we are truly the only linux vendor with the scale and engineering ability to do.

Of course we are not stopping here, so let me also use this chance to talk a bit about some of our other efforts.

Toolbox
Containers are popular for deploying software, but a lot of people are also discovering now that they are an incredible way to develop software, even if that software is not going to be deployed as a Flatpak or Kubernetes container. The term often used for containers when used as a development tool is pet containers and with Toolbox project we are aiming to create the best tool possible for developers to work with pet containers. Toolbox allows you to have always have a clean environment to work in which you can change to suit each project you work on, however you like, without affecting your host system. So for instance if you need to install a development snapshot of Python you can do that inside your Toolbox container and be confident that various other parts of your desktop will not start crashing due to the change. And when your are done with your project and don’t want that toolbox around anymore you can easily delete it without having to spend time to figure out which packages you installed can now be safely uninstalled from your host system or just not bother and have your host get bloated over time with stuff you are not actually using anymore.

One big advantage we got at Red Hat is that we are a major contributor to container technologies across the stack. We are a major participant in the Open Container Initiative and we are alongside Google the biggest contributor to the Kubernetes project. This includes having created a set of container tools called Podman. So when we started prototyping Toolbox we could base it up on podman and get access to all the power and features that podman provides, but at the same make them easier to use and consumer from your developer laptop or workstation.

Our initial motivation was also driven by the fact that for image based operating systems like Fedora Silverblue and Fedora CoreOS, where the host system is immutable you still need some way to be able to install packages and do development, but we quickly realized that the pet container development model is superior to the old ‘on host’ model even if you are using a traditional package based system like Fedora Workstation. So we started out by prototyping the baseline functionality, writing it as a shell script to quickly test out our initial ideas. Of course as Toolbox picked up in popularity we realized we needed to transition quickly to a proper development language so that we wouldn’t end up with an unmaintainable mess written in shell, and thus Debarshi Ray and Ondřej Míchal has recently completed the rewrite to Go (Note: the choice of Go was to make it easier for the wider container community to contribute since almost all container tools are written in Go).

Leading up towards Fedora Workstation 33 we are trying figure out a few things. One is how we can make giving you access to a RHEL based toolbox through the Red Hat Developer Program in an easy and straightforward manner, and this is another area where pet container development shines. You can set up your pet container to run a different Linux version than your host. So you can use Fedora to get the latest features for your laptop, but target RHEL inside your Toolbox to get an easy and quick deployment path to your company RHEL servers. I would love it if we can extend this even further as we go along, to for instance let you set up a Steam runtime toolbox to do game development targeting Steam.
Setting up a RHEL toolbox is already technically possible, but requires a lot more knowledge and understanding of the underlaying technologies than we wish.
The second thing we are looking at is how we deal with graphical applications in the context of these pet containers. The main reason we are looking at that is because while you can install for instance Visual Studio code inside the toolbox container and launch it from the command line, we realize that is not a great model for how you interact with GUI applications. At the moment the only IDE that is set up to be run in the host, but is able to interact with containers properly is GNOME Builder, but we realize that there are a lot more IDEs people are using and thus we want to try to come up with ways to make them work better with toolbox containers beyond launching them from the command line from inside the container. There are some extensions available for things like Visual Studio Code starting to try to improve things (those extensions are not created by us, but looking at solving a similar problem), but we want to see how we can help providing a polished experience here. Over time we do believe the pet container model of development is so good that most IDEs will follow in GNOME Builders footsteps and make in-container development a core part of the feature set, but for now we need to figure out a good bridging strategy.

Wayland – headless and variable refresh rate.
Since switching to Wayland we have continued to work in improving how GNOME work under Wayland to remove any major feature regressions from X11 and to start taking advantage of the opportunities that Wayland gives us. One of the last issues that Jonas Ådahl has been hard at work recently is trying to ensure we have headless support for running GNOME on systems without a screen. We know that there are a lot of sysadmins for instance who want to be able to launch a desktop session on their servers to be used as a tool to test and debug issues. These desktops are then accessed through tools such as VNC or Nice DCV. As part of that work he also made sure we could deal with having multiple monitors connected which had different refresh rates. Before that fix you would get the lowest common denominator between your screens, but now if you for instance got a 60Hz monitor and a 75Hz monitor they will be able to function independent of each other and run at their maximum refresh rate. With the variable refresh rate work now landed upstream Jonas is racing to get the headless support finished and landed in time for Fedora Workstation 33.

Linux Vendor Firmware Service
Richard Hughes is continuing his work on moving the LVFS forward having spent time this cycle working with the Linux Foundation to ensure the service can scale even better. He is also continuously onboarding new vendors and helping existing vendors use LVFS for even more things. We are now getting reports that LVFS has become so popular that we are now getting reports of major hardware companies who up to know hasn’t been to interested in the LVFS are getting told by their customers to start using it or they will switch supplier. So expect the rapid growth of vendors joining the LVFS to keep increasing. It is also worth nothing that many of vendors who are already set up on LVFS are steadily working on increasing the amount of systems they support on it and pushing their suppliers to do the same. Also for enterprise use of LVFS firmware Marc Richter also wrote an article on access.redhat.com about how to use LVFS with Red Hat Satelitte. Satellite for those who don’t know it is Red Hats tool for managing and keeping servers up to date and secure. So for large companies having their machines, especially servers, accessing LVFS directly is not a wanted behaviour, so now they can use Satelitte to provide a local repository of the LVFS firmware.

PipeWire
One of the changes we been working on that I am personally extremely excited about is PipeWire. For those of you who don’t know it, PipeWire is one of our major swamp draining efforts which aims to bring together audio, pro-audio and video under linux and provide a modern infrastructure for us to move forward. It does so however while being ABI compatible with both Jack and PulseAudio, meaning that applications will not need to be ported to work with PipeWire. We have been using it for a while for video already to handle screen capture under Wayland and for allowing Flatpak containers access to webcams in a secure way, but Wim Taymans has been working tirelessly on moving that project forward over the last 6 Months, focused a lot of fixing corner cases in the Jack support and also ramping up the PulseAudio support. We had hoped to start wide testing in Fedora Workstation 32 of the audio parts of PipeWire, but we decided that since such a key advantage that PipeWire brings is not just to replace Jack or PulseAudio, but also to ensure the two usecases co-exist and interact properly, we didn’t want to start asking people to test until we got the PulseAudio support close to being production ready. Wim has been making progress by leaps and bounds recently and while I can’t 100% promise it yet we do expect to roll out the audio bits of PipeWire for more widescale testing in Fedora Workstation 33 with the goal of making it the default for Fedora Workstation 34 or more likely Fedora Workstation 35.
Wim is doing an internal demo this week, so I will try to put out a blog post talking about that later in the week.

Flatpak – incremental updates
One of the features we added to Flatpaks was the ability to distribute them as Open Container Initiative compliant containers. The reason for this was that as companies, Red Hat included, built infrastructure for hosting and distributing containers we could also use that for Flatpaks. This is obviously a great advantage for a variety of reasons, but it had one large downside compared to the traditional way of distributing Flatpaks (as Ostree images) which is that each update comes as a single large update as opposed to the atomic update model that OStree provides.
Which is why if you would compare the same application when shipping from Flathub, which uses Ostree, versus from the Fedora container registry, you would quickly notice that you get a lot smaller updates from Flathub. For kubernetes containers this hasn’t been considered a huge problem as their main usecase is copying the containers around in a high-speed network inside your cloud provider, but for desktop users this is annoying. So Alex Larsson and Owen Taylor has been working on coming up with a way to do to incremental updates for OCI/Docker/Kubernetes containers too, which not only means we can get very close to the Flathub update size in the Fedora Container Catalog, but it also means that since we implemented this in a way that works for all OCI/Kubernetes containers you will be able to get them too with incremental update functionality. Especially as such containers are making their way into edge computing where update sizes do matter, just like they do on the desktop.

Hangul input under Wayland
Red Hat, like Lenovo, targets most of the world with our products and projects. This means that we want them to work great even for people who doesn’t use English or another European language. To achieve this we have a team dedicated to ensuring that not just Linux, but all Red Hat products work well for international users as part of my group at Red Hat. That team, lead by Jens Petersen, is distributed around the globe with engineers in Japan, China, India, Singapore and Germany. This team contributes to a lot of different things like font maintenance, input method development, i18n infrastructure and more.
One thing this team recently discovered was that the support for Korean input under Wayland. So Peng Wu, Takao Fujiwara and Carlos Garnacho worked together to come up with a series of patches for ibus and GNOME Shell to ensure that Fedora Workstation on Wayland works perfectly for Korean input. I wanted to highlight this effort because while I don’t usually mention efforts which such a regional impact in my blog posts it is a critical part of keeping Linux viable and usable across the globe. And ensuring that you can use your computer in your own language is something we feel is important and want to enable and also an area where I believe Red Hat is investing more than any other vendor out there.

GLX on EGL
We meet with NVidia on a regular basis to discuss topics of shared interest and one thing we been looking at for a while now is the best way to support Nvidia binary driver under XWayland. As part of that Adam Jackson has been working on a research project to see how feasible it would be to create a way to run GLX applications on top of EGL. As one might imagine EGL doesn’t have a 1to1 match with GLX APIs, but based on what we seen so far is that it should be close enough to get things going (Adam already got glxgears running :). The goal here would be to have an initial version that works ok, and then in collaboration with NVidia we can evolve it to be a great solution for even the most demanding OpenGL/GLX applications. Currently the code causes an extra memcopy compared to running on GLX native, but this is something we think can be resolved in collaboration with NVidia. Of course this is still an early stage effort and Adam and NVidia are currently looking at it so there is of course a chance still we will hit a snag and have to go back to the drawing board. For those interested you can take a look at this Mesa merge request to see the current state.

Hi!

I haven’t said Hi for a while when starting a post. I think the rush and the whirlwind of things happening during the GSoC made me a little agitated. This year, my project was the only one accepted for X.Org Foundation, and I felt a great responsibility. Well, this is the last week of the project, and I’m slowing down and breathing :)

This report is a summary of my journey at Google Summer of Code 2020. Experience and technical reports can be read in more detail in the 11 blog posts I published during this period:

Date Post
2020/05/13 I’m in - GSoC 2020 - X.Org Foundation
2020/05/20 Everyone makes a script
2020/06/02 Status update - Tie up loose ends before starting
2020/06/03 Walking in the KMS CURSOR CRC test
2020/06/15 Status update - connected errors
2020/07/06 GSoC First Phase - Achievements
2020/07/17 Increasing test coverage in VKMS - max square cursor size
2020/08/12 The end of an endless debugging of an endless wait
2020/08/19 If a warning remains, the job is not finished.
2020/08/27 Another day, another mistery
2020/08/28 Better validation of alpha-blending

So, the Google Summer of Code was an amazing experience! I have evolved not only technically, but also as a developer in a software community. In this period, I could work on different projects and even interact with their community. I contributed to three projects with various purposes, sizes, and maturities, and I describe below a little of my relationship with each of them:

DRM/Linux Kernel

The Linux kernel is one of the largest, most famous, and most mature free and open-source project. It is also the kernel that I have been using for over ten years. The development of Linux is so interesting to me that I chose it as a case study of my Master’s in Computer Science research.

Among the various subsystems in the project, I have contributed to DRM, the part of Linux responsible for the interface with GPUs. It provides the user-space with API to send commands and data in a format suitable for modern GPUs. It is the kernel-space component of graphic stacks like the X.Org Server.

And was X.Org Foundation, the organization that supported my project in GSoC. Thanks to the support from the DRI community and the X.Org Foundation, I have contributed over the past few months to improve the VKMS. The Virtual Kernel Mode Setting is a software-only model of a KMS driver that allows you to test DRM and run X on systems without a hardware display capability.

IGT GPU Tools

IGT is a set of tools used to develop and validate DRM drivers. These tools can be used by drivers other than Intel drivers and I widely used it to evolve the VKMS. Using IGT to improve VKMS can be very useful for validating patches sent to the core of DRM, that is, performing automated tests against new code changes with no need of real hardware.

With this concern, all my work on GSoC aimed to bring greater stability in IGT tests’ execution using VKMS. IGT test cases guided most of my contributions to the VKMS. Before sending any code, I took care of validating if, with my change, the tests that I have any familiarity remained stable and properly working.

KWorkFlow

Kworflow is a set of scripts that I use in my development environment for the Linux kernel. It greatly facilitates the execution of routine tasks of coding, examining the code, and sending patches.

My mentor developed it, Rodrigo Siqueira and other students and former students of computer science at the university contributed to add functionality and evolve the project. It supports your local kernel version’s compilation and deployment, helps you browse the code and the change history, and provides essential information for patch formatting.

With these three projects, I had an exciting development journey and many lessons learned to share. Here is a summary of that experience:

From start to finish

The general purpose of my GSoC project was to evolve VKMS using IGT tests. In this way, I used the kms_cursor_crc test case as a starting point to fix and validate features, adjust behaviors, and increase test coverage in VKMS.


[KMS cursor crc uses]
the display CRC support to validate cursor plane functionality.
The test will position the cursor plane either fully onscreen,
partially onscreen, or fully offscreen, using either a fully opaque
or fully transparent surface. In each case, it enables the cursor plane
and then reads the PF CRC (hardware test) and compares it with the CRC
value obtained when the cursor plane was disabled and its drawing is
directly inserted on the PF by software.

In my project proposal, I presented the piglit statistics for kms_cursor_crc using VKMS. It was seven test cases successful, two fails, one warning (under development), and 236 skips. Now, I can present an overview of this state and the improvements mapped and applied during this GSoC period:

Failing tests

Initially, three test cases failed using VKMS. Before GSoC start, I had already sent a proposal that moved it from failure to success with a warning and was related to the composition of planes considering the alpha channel. The second was a case that had already worked two years ago. The latter had never worked before. This last two were related to the behavior of the cursor plane in power management tasks.

From failure to warning

Since VKMS did not consider the alpha channel for blending, it zeroed that channel before computing the crc. However, the operation that would do this was zeroing the wrong channel, due to an endianness trap. It led the test case to failure. Even after fixing this operation, the test still emitted a warning. This happened because, when zeroing the alpha channel, VKMS was delivering a fully transparent black primary plane for capturing CRC, while the background should be a solid black.

A cross-cutting problem that affected the performance of VKMS for sequential subtests execution

During the sequential execution of the kms_cursor_crc test cases, the results were unstable. Successful tests failed, two runs of the same test alternated between failure and success … in short, a mess.

In debugging and examining the change history, I found a commit that changes the VKMS performance of the kms_cursor_crc test cases. This change replaced the drm_wait_for_vblanks function with drm_wait_flip_done and, therefore, the VKMS stopped “forcing” vblank interrupts during the process of state commit. Without vblank interruptions, the execution of testcases in sequence got stuck. Forcing vblanks was just a stopgap and not a real solution. Besides, the IGT test itself was also leaving a kind of trash when failure happened, since it did not complete the cleanup and affected the next subtest.

Skips due to unmet of test requirements

Skips were caused by the lack of the following features in VKMS:

  • Cursor sizes larger than 64x64 and non-square cursor (few drivers take this approach)
  • Support for more than one CRTC (still not developed):

  Test requirement not met in function igt_require_pipe, file ../lib/igt_kms.c:1900:
  Test requirement: !(!display->pipes[pipe].enabled)
  Pipe F does not exist or not enabled

So, for each of these issues, I needed to identify the problem, find out what project the problem was coming from, map what was required to resolve, and then code the contribution. With that, I had to combine work from two fronts: DRM and IGT. Also, with the more intensive use of my development environment, I needed to develop some improvements in the tool that supports me in compiling, installing, and exploring the kernel using a virtual machine. As a consequence, I also sent some contributions to the Kworkflow project.

Patches sent during GSoC 2020 to solve the above issues

## Project Patch Status Blog post
01 DRM/Linux kernel drm/vkms: change the max cursor width/height Accepted Link
02 DRM/Linux kernel drm/vkms: add missing drm_crtc_vblank_put to the get/put pair on flush Discarded -
03 DRM/Linux kernel drm/vkms: fix xrgb on compute crc Accepted -
04 DRM/Linux kernel drm/vkms: guarantee vblank when capturing crc (v3) Accepted Link
05 DRM/Linux kernel drm/vkms: add alpha-premultiplied color blending (v2) Accepted Link
06 IGT GPU Tools lib/igt_fb: remove extra parameters from igt_put_cairo_ctx Accepted Link
07 IGT GPU Tools [i-g-t,v2,1/2] lib/igt_fb: change comments with fd description Accepted Link
08 IGT GPU Tools [i-g-t,v2,2/2] test/kms_cursor_crc: update subtests descriptions and some comments Accepted Link
09 IGT GPU Tools [i-g-t,v2,1/1] test/kms_cursor_crc: release old pipe_crc before create a new one Accepted Link
10 IGT GPU Tools [i-g-t,2/2] test/kms_cursor_crc: align the start of the CRC capture to a vblank Discarded Link
11 IGT GPU Tools [i-g-t] tests/kms_cursor_crc: refactoring cursor-alpha subtests Under review Link
12 Kworkflow src: add grep utility to explore feature Merged Link
13 Kworkflow kw: small issue on u/mount alert message Merged Link
14 Kworkflow Add support for deployment in a debian-VM Under review Link

With the patches sent, I collaborated to:

  • Increasing test coverage in VKMS
  • Bug fixes in VKMS
  • Adjusting part of the VKMS design to its peculiarities as a fake driver allowing to stabilize its work performance
  • Fixing a leak in kms_cursor_crc cleanup when a testcase fails
  • Improving different parts of the IGT tool documentation
  • Increasing the effectiveness of testing cursor plane alpha-composition
  • Improve my development environment and increase my productivity.

Not only coding, getting involved in the community

In addition to sending code improvements, my initial proposal included adding support for real overlay planes. However, I participated in other community activities that led me to adapt a portion of my project to meet emerging and more urgent demands.

I took a long time debugging the VKMS’s unstable behavior together with other DRM community members. Initially, I was working in isolation on the solution. However, I realized that that problem had a history and that it would be more productive to talk to other developers to find a suitable solution. When I saw on the mailing list, another developer had encountered a similar problem, and I joined the conversation. This experience was very enriching, where I had the support and guidance of DRM’s maintainer, Daniel Vetter. Debugging together with Daniel and Sidong Yang, we got a better solution to the problem. Finally, it seems that somehow, this debate contributed to another developer, Leandro Ribeiro, in his work on another VKMS’s issue.

In this debugging process, I also gained more experience and confidence concerning the project. So, I reviewed and tested some patches sent to VKMS [1,2,3,4]. Finally, with the knowledge acquired, I was also able to contribute to the debugging of a feature under development that adds support for writeback [5].

Discussion and future works

The table below summarizes my progress in stabilizing the execution of kms_cursor_crc using VKMS.

Status Start End
pass 7 25 (+18)
warn 1 0 (-1)
fail 2 0 (-2)
skip 236 221 (-15)
total 246 246

To solve the warning triggered by the test case pipe-A-cursor-alpha-transparent, I needed to develop an already mapped feature: TODO: Use the alpha value to blend vaddr\_src with vaddr\_dst instead of overwriting it in blend(). This feature showed that another test case, the pipe-A-cursor-alpha-opaque, had a false-pass. Moreover, both test cases related to the composition of the cursor plane with the primary plane were not sufficient to verify the composition’s correctness. As a result, I sent IGT a refactoring of the test cases to improve coverage.

The handling of the two failures initially reported were built by the same debugging process. However, they had different stories. The dpms test case had already been successful in the past, according to a commit in the git log. It started to fail after a change, which was correct but evidenced the driver’s deficiency. There was no report for the suspend test case that it worked correctly at some point, but it lacked the same: ensuring that vblank interruptions are occurring for the composition work and CRC captures.

The remaining skips are related to:

  1. non-square cursor support. Which, as far as I know, is a very specific feature for Intel drivers. If this restricted application is confirmed, the right thing to do is moving these test cases to the specific i915 driver test folder.
  2. the test requirement for more than one pipe: “Pipe (B-F) does not exist or not enabled”. This need a more complex work to add support for more than one CRTC.

I also started examining how to support the real overlay:

  • create a parameter to enable the overlay;
  • enable the third type of plane overlay in addition to the existing primary and cursor;
  • and allow the blending of the three planes.

Lessons learned

  1. In one of my first contributions to the VKMS I received feedback, let’s say, scary. To this day, I have no idea who that person was, and maybe he didn’t want to scare me, but his lack of politeness to say ‘That looks horrid’ was pretty disheartening for a beginner. Lucky for me that I didn’t take it so seriously, I continued my journey, started GSoC, and realized that the most active developers in the community are very friendly and inclusive. I learned a lot, and I would like to thank them for sharing their knowledge and information.
  2. Still related to the previous topic, I feel that the less serious developers also don’t care much about contribution etiquette and code of conduct. By unknowing or overlooking the rules of the community, they end up creating discomfort for other developers. In the DRM community, I noticed care in maintaining the community and bringing in new contributors. I did not feel unworthy or discredited in any discussion with the others. On the contrary, I felt very encouraged to continue, to learn, and to contribute.
  3. In addition to the diversity of skills, maturity, and culture of the developers, the Linux community deals with timezone differences. It’s crazy to see the energy of the DRI community on IRC channel, even with people sleeping and waking up at random times.
  4. I still don’t feel very comfortable talking on the IRC channel, for two reasons: I always take time to understand the subject being discussed technically, and my English is not very good. The dialogues I had by e-mail were very useful. It gave me time to think and check the validity of my ideas before speaking. It was still a record/documentation for future contributions.
  5. The Linux kernel is very well documented. For most questions, there is an answer somewhere in the documentation. However, it is not always immediate to find the answer because it is an extensible project. Besides, a lot has already been encapsulated and encoded for reuse, so creating something from scratch can often be unnecessary.

And after GSoC

Well, my first, short-term plan is to make a good presentation at XDC 2020. In it, I will highlight interesting cases on my project of working on IGT and VKMS together. My presentation will be on the first day, September 16; more details:

VKMS improvements using IGT GPU Tools

My second plan is to continue contributing to the VKMS, initially adding the mapped features of adding a real overlay and structuring the VKMS for more than one CRTC. Probably, with the entry of writeback support, doing these things requires even more reasoning.

Not least, I want to finish my master’s degree and complete my research on Linux development, adding my practical experiences with VKMS development. I’m almost there, hopefully later this year. And then, I hope to find a job that will allow me to work on Linux development. And live. :P

Acknowledgment

Finally, I thank X.Org Foundation for accepting my project and believing in my performance (since I was the organization’s sole project in this year’s GSoC). Also, Trevor Woerner for motivation, communication and confidence! He was always very attentive, guiding me and giving tips and advice.

Thank my mentor, Rodrigo Siqueira, who believed in my potential, openly shared knowledge that he acquired with great effort, and gave relevance to each question that I presented, and encouraged me to talk with the community. Many thanks also to the DRI community and Daniel Vetter for sharing their time and so much information and knowledge with me and being friendly and giving me constructive feedback.

  1. https://patchwork.freedesktop.org/patch/374926/
  2. https://patchwork.freedesktop.org/patch/377557/
  3. https://patchwork.freedesktop.org/patch/381344/
  4. https://patchwork.freedesktop.org/patch/387714/
  5. https://patchwork.freedesktop.org/series/80961/

This is the continuation from these posts: part 1, part 2

Let's talk about everyone's favourite [1] keyboard configuration system again: XKB. If you recall the goal is to make it simple for users to configure their own custom layouts. Now, as described earlier, XKB-the-implementation doesn't actually have a concept of a "layout" as such, it has "components" and something converts your layout desires into the combination of components. RMLVO (rules, model, layout, variant, options) is what you specify and gets converted to KcCGST (keycodes, compat, geometry, symbols, types). This is a one-way conversion, the resulting keymaps no longer has references to the RMLVO arguments. Today's post is about that conversion, and we're only talking about libxkbcommon as XKB parser because anything else is no longer maintained.

The peculiar thing about XKB data files (provided by xkeyboard-config [3]) is that the filename is part of the API. You say layout "us" variant "dvorak", the rules file translates this to symbols 'us(dvorak)' and the parser will understand this as "load file 'symbols/us' and find the dvorak section in that file". [4] The default "us" keyboard layout will require these components:


xkb_keymap {
xkb_keycodes { include "evdev+aliases(qwerty)" };
xkb_types { include "complete" };
xkb_compat { include "complete" };
xkb_symbols { include "pc+us+inet(evdev)" };
xkb_geometry { include "pc(pc105)" };
};
So the symbols are really: file symbols/pc, add symbols/us and then the section named 'evdev' from symbols/inet [5]. Types are loaded from types/complete, etc. The lookup paths for libxkbcommon are $XDG_CONFIG_HOME/xkb, /etc/xkb, and /usr/share/X11/xkb, in that order.

Most of the symbols sections aren't actually full configurations. The 'us' default section only sets the alphanumeric rows, everything else comes from the 'pc' default section (hence: include "pc+us+inet(evdev)"). And most variant sections just include the default one, usually called 'basic'. For example, this is the 'euro' variant of the 'us' layout which merely combines a few other sections:


partial alphanumeric_keys
xkb_symbols "euro" {

include "us(basic)"
name[Group1]= "English (US, euro on 5)";

include "eurosign(5)"

include "level3(ralt_switch)"
};
Including things works as you'd expect: include "foo(bar)" loads section 'bar' from file 'foo' and this works for 'symbols/', 'compat/', etc., it'll just load the file in the same subdirectory. So yay, the directory is kinda also part of the API.

Alright, now you understand how KcCGST files are loaded, much to your despair.

For user-specific configuration, we could already load a 'custom' layout from the user's home directory. But it'd be nice if we could just add a variant to an existing layout. Like "us(banana)", because potassium is important or somesuch. This wasn't possible because the filename is part of the API. So our banana variant had to be in $XDG_CONFIG_HOME/xkb/symbols/us and once we found that "us" file, we could no longer include the system one.

So as of two days ago, libxkbcommon now extends the parser to have merged KcCGST files, or in other words: it'll load the symbols/us file in the lookup path order until it finds the section needed. With that, you can now copy this into your $XDG_CONFIG_HOME/xkb/symbols/us file and have it work as variant:


partial alphanumeric_keys
xkb_symbols "banana" {

include "us(basic)"
name[Group1]= "English (Banana)";

// let's assume there are some keymappings here
};
And voila, you now have a banana variant that can combine with the system-level "us" layout.

And because there must be millions [6] of admins out there that maintain custom XKB layouts for a set of machines, the aforementioned /etc/xkb lookup path was also recently added to libxkbcommon. So we truly now have the typical triplet of lookup paths:

  • vendor-provided ones in /usr/share/X11/xkb,
  • host-specific ones in /etc/xkb, and
  • user-specific ones in $XDG_CONFIG_HOME/xkb [7].
Good times, good times.

[1] Much in the same way everyone's favourite Model T colour was black
[2] This all follows the UNIX philosophy, there are of course multiple tools involved and none of them know what the other one is doing
[3] And I don't think Sergey gets enough credit for maintaining that pile of language oddities
[4] Note that the names don't have to match, we could map layout 'us' to the symbols in 'banana' but life's difficult enough as it is
[5] I say "add" when it's sort of a merge-and-overwrite and, yes, of course there are multiple ways to combine layouts, thanks for asking
[6] Actual number may be less
[7] Notice how "X11" is missing in the latter two? If that's not proof that we want to get rid of X, I don't know what is!

At Last, Geometry Shaders

I’ve mentioned GS on a few occasions in the past without going too deeply into the topic. The truth is that I’m probably not ever going to get that deep into the topic.

There’s just not that much to talk about.

The 101

Geometry shaders take the vertices from previous shader stages (vertex, tessellation) and then emit some sort of (possibly-complete) primitive. EmitVertex() and EndPrimitive() are the big GLSL functions that need to be cared about, and there’s gl_PrimitiveID and gl_InvocationID, but hooking everything into a mesa (gallium) driver is, at the overview level, pretty straightforward:

  • add struct pipe_context hooks for the gs state creation/deletion, which are just the same wrappers that every other shader type uses
  • make sure to add GS state saving to any util_blitter usage
  • add a bunch of GS shader pipe cap handling
  • the driver now supports geometry shaders

But Then Zink

The additional changes needed in zink are almost entirely in ntv. Most of this is just expanding existing conditionals which were previously restricted to vertex shaders to continue handling the input/output variables in the same way, though there’s also just a bunch of boilerplate SPIRV setup for enabling/setting execution modes and the like that can mostly be handled with ctrl+f in the SPIRV spec with shader_info.h open.

One small gotcha is again gl_Position handling. Since this was previously transformed in the vertex shader to go from an OpenGL [-1, 1] depth range to a Vulkan [0, 1] range, any time a GS loads gl_Position it then needs to be un-transformed, as (at the point of the GS implementation patch) zink doesn’t support shader keys, and so there will only ever be one variant of a shader, which means the vertex shader must always perform this transform. This will be resolved at a later point, but it’s worth taking note of now.

The other point to note is, as always, I know everyone’s tired of it by now, transform feedback. Whereas vertex shaders emit this info at the end of the shader, things are different in the case of geometry shaders. The point of transform feedback is to emit captured data at the time shader output occurs, so now xfb happens at the point of EmitVertex().

And that’s it.

Sometimes, but only sometimes, things in zink-land don’t devolve into insane hacks.

August 28, 2020

I recently posted about a feature I developed for VKMS to consider the alpha channel in the composition of the cursor plane with the primary plane. This process took a little longer than expected, and now I can identify a few reasons:

  • Beginner: I had little knowledge of computer graphics and its operations
  • Maybe cliché: I did not consider that in the subsystem itself, there was already material to aid coding.
  • The unexpected: the test cases that checked the composition of a cursor considering the alpha channel were successful, even with a defect in the implementation.

IGT GPU Tools has two test cases in kms_cursor_crc to check the cursor blend in the primary plane: the cursor-alpha-opaque and the cursor-alpha-transparent. These two cases are structured in the same way, changing only the value of the alpha channel - totally opaque 0xFF or totally transparent 0x00.

In a brief description, the test checks the composition as follows:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Creates a ARGB framebuffer for cursor plane with white color and a given alpha value (0xFF or 0x00)
  3. Enables the cursor on hardware and captures the plane’s CRC after composition. (hardware)
  4. Disables the cursor on hardware, draws a cursor directly on the primary plane and capture the CRC (software)
  5. Compares the two CRCs captured: they must have equal values.

After implementing alpha blending using the straight alpha formula in VKMS, both tests were successful. However, the equation was not correct.

To paint, IGT uses the library Cairo. For primary plane, the test uses CAIRO_FORMAT_RGB24 and CAIRO_FORMAT_ARGB32 for the cursor. According to documentation, the format ARGB32 stores the pixel color in the pre-multiplied alpha representation:

CAIRO_FORMAT_ARGB32

each pixel is a 32-bit quantity, with alpha in the upper 8 bits, then red, then
green, then blue. The 32-bit quantities are stored native-endian.
Pre-multiplied alpha is used. (That is, 50% transparent red is 0x80800000, not
0x80ff0000.) (Since 1.0)

In a brief dialogue with Pekka about endianness on DRM, he showed me information from the DRM documentation that I didn’t know about. According to it, DRM converges with the representation used by Cairo:


Current DRM assumption is that alpha is premultiplied, and old userspace can
break if the property defaults to anything else.

It is also possible to find information about DRM’s Plane Composition Properties, such as the existence of pixel blend mode to add a blend mode for alpha blending equation selection, describing how the pixels from the current plane are composited with the background.

Finally, you can find there the pre-multiplied alpha blending equation:

out.rgb = plane_alpha * fg.rgb + (1 - (plane_alpha * fg.alpha)) * bg.rgb

From this information, we see that both IGT and DRM use the same representation, but the current test cases of kms_cursor_crc do not show the defect in using the straight-alpha formula.

With this in mind, I think of refactoring the test cases so that they could validate translucent cursors and “remove some zeros” from the equation. After shared thoughts with my mentor, Siqueira, I decided to combine the two testcases (cursor-alpha-opaque and cursor-alpha-transparent) into one and refactor them so that the testcase verifies not only extreme alpha values, but also translucent values.

Therefore, the submitted test case proposal follows this steps:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Create a ARGB framebuffer for cursor plane and enables the cursor on hardware.
  3. Paints the cursor with white color and a range of alpha value (from 0xFF to 0x00)
  4. For each alpha value, captures the plane’s CRC after composition in a array of CRC’s. (hardware)
  5. Disables the cursor on hardware.
  6. Draws cursor directly on the primary plane following the same range of alpha values (software)
  7. Captures the CRC and compares it with the CRC in the array of hardware CRC’s: they must have equal values.
  8. Clears primary plane and go to step 5.

The code sent: tests/kms_cursor_crc: refactoring cursor-alpha subtests

CI checks did not show any problems, so I am expecting some human feedback :)

A Different Friday

Taking a break from talking about all this crazy feature nonsense, let’s get back to things that actually matter. Like gains. Specifically, why have some of these piglit tests been taking so damn long to run?

A great example of this is spec@!opengl 2.0@tex3d-npot.

Mesa 20.3: MESA_LOADER_DRIVER_OVERRIDE=zink bin/tex3d-npot 24.65s user 83.38s system 73% cpu 2:27.31 total

Mesa zmike/zink-wip: MESA_LOADER_DRIVER_OVERRIDE=zink bin/tex3d-npot 6.09s user 5.07s system 48% cpu 23.122 total

Yes, currently some changes I’ve done allow this test to pass in 16% of the time that it requires in the released version of Mesa. How did this happen?

Speed Loop

The core of the problem at present is zink’s reliance on explicit fencing without enough info to know when to actually wait on a fence, as is vaguely referenced in this ticket, though the specific test case there still has yet to be addressed. In short, here’s the problem codepath that’s being hit for the above test:

static void *
zink_transfer_map(struct pipe_context *pctx,
                  struct pipe_resource *pres,
                  unsigned level,
                  unsigned usage,
                  const struct pipe_box *box,
                  struct pipe_transfer **transfer)
{
  ...
  if (pres->target == PIPE_BUFFER) {
      if (usage & PIPE_TRANSFER_READ) {
         /* need to wait for rendering to finish
          * TODO: optimize/fix this to be much less obtrusive
          * mesa/mesa#2966
          */
         struct pipe_fence_handle *fence = NULL;
         pctx->flush(pctx, &fence, PIPE_FLUSH_HINT_FINISH);
         if (fence) {
            pctx->screen->fence_finish(pctx->screen, NULL, fence,
                                       PIPE_TIMEOUT_INFINITE);
            pctx->screen->fence_reference(pctx->screen, &fence, NULL);
         }
      }

Here, the current command buffer is submitted and then its fence is waited upon any time a call to e.g., glReadPixels() is made.

Any time.

Regardless of whether the resource in question even has pending writes.

Or whether it’s ever had anything written to it at all.

This patch was added at my request to fix up a huge number of test cases we were seeing that failed due to cases of write -> read on without waiting for the write to complete, and it was added with the understanding that at some time in the future it would be improved to both ensure synchronization and not incur such a massive performance hit.

That time has passed, and we are now in the future.

Buffer Synchronization

At its core, this is just another synchronization issue, and so by adding more details about synchronization needs, the problem can be resolved.

I chose to do this using r/w flags for resources based on the command buffer (batch) id that the resource was used on. Presently, Mesa releases ship with this code for tracking resource usage in a command buffer:

void
zink_batch_reference_resoure(struct zink_batch *batch,
                             struct zink_resource *res)
{
   struct set_entry *entry = _mesa_set_search(batch->resources, res);
   if (!entry) {
      entry = _mesa_set_add(batch->resources, res);
      pipe_reference(NULL, &res->base.reference);
   }
}

This ensures that the resource isn’t destroyed before the batch finishes, which is a key functionality that drivers generally prefer to have in order to avoid crashing.

It doesn’t, however, provide any details about how the resource is being used, such as whether it’s being read from or written to, which means there’s no way to optimize that case in zink_transfer_map().

Here’s the somewhat improved version:

void
zink_batch_reference_resource_rw(struct zink_batch *batch, struct zink_resource *res, bool write)
{
   unsigned mask = write ? ZINK_RESOURCE_ACCESS_WRITE : ZINK_RESOURCE_ACCESS_READ;

   struct set_entry *entry = _mesa_set_search(batch->resources, res);
   if (!entry) {
      entry = _mesa_set_add(batch->resources, res);
      pipe_reference(NULL, &res->base.reference);
   }
   /* the batch_uses value for this batch is guaranteed to not be in use now because
    * reset_batch() waits on the fence and removes access before resetting
    */
   res->batch_uses[batch->batch_id] |= mask;
}

For context, I’ve simultaneously added a member uint8_t batch_uses[4]; to struct zink_resource, as there are 4 batches in the batch array.

What this change does is allow callers to provide very basic info about whether the resource is being read from or written to in a given batch, stored as a bitmask to the batch-specific struct zink_resource::batch_uses member. Each batch has its own member of this array, as it needs to be able to be modified atomically in order to have its usage unset when a fence is notified of completion, and this member can have up to two bits.

Now here’s the current struct zink_context::fence_finish hook for waiting on a fence:

bool
zink_fence_finish(struct zink_screen *screen, struct zink_fence *fence,
                  uint64_t timeout_ns)
{
   bool success = vkWaitForFences(screen->dev, 1, &fence->fence, VK_TRUE,
                                  timeout_ns) == VK_SUCCESS;
   if (success && fence->active_queries)
      zink_prune_queries(screen, fence);
   return success;
}

Not much to see here. Normal fence stuff. I’ve made some additions:

static inline void
fence_remove_resource_access(struct zink_fence *fence, struct zink_resource *res)
{
   p_atomic_set(&res->batch_uses[fence->batch_id], 0);
}

bool
zink_fence_finish(struct zink_screen *screen, struct zink_fence *fence,
                  uint64_t timeout_ns)
{
   bool success = vkWaitForFences(screen->dev, 1, &fence->fence, VK_TRUE,
                                  timeout_ns) == VK_SUCCESS;
   if (success) {
      if (fence->active_queries)
         zink_prune_queries(screen, fence);

      /* unref all used resources */
      util_dynarray_foreach(&fence->resources, struct pipe_resource*, pres) {
         struct zink_resource *res = zink_resource(*pres);
         fence_remove_resource_access(fence, res);

         pipe_resource_reference(pres, NULL);
      }
      util_dynarray_clear(&fence->resources);
   }
   return success;
}

Now as soon as a fence completes, all used resources will have the usage for the corresponding batch removed.

With this done, some changes can be made to zink_transfer_map():

static uint32_t
get_resource_usage(struct zink_resource *res)
{
   uint32_t batch_uses = 0;
   for (unsigned i = 0; i < 4; i++) {
      uint8_t used = p_atomic_read(&res->batch_uses[i]);
      if (used & ZINK_RESOURCE_ACCESS_READ)
         batch_uses |= ZINK_RESOURCE_ACCESS_READ << i;
      if (used & ZINK_RESOURCE_ACCESS_WRITE)
         batch_uses |= ZINK_RESOURCE_ACCESS_WRITE << i;
   }
   return batch_uses;
}

static void *
zink_transfer_map(struct pipe_context *pctx,
                  struct pipe_resource *pres,
                  unsigned level,
                  unsigned usage,
                  const struct pipe_box *box,
                  struct pipe_transfer **transfer)
{
   ...
   struct zink_resource *res = zink_resource(pres);
   uint32_t batch_uses = get_resource_usage(res);
   if (pres->target == PIPE_BUFFER) {
      if ((usage & PIPE_TRANSFER_READ && batch_uses >= ZINK_RESOURCE_ACCESS_WRITE) ||
          (usage & PIPE_TRANSFER_WRITE && batch_uses)) {
         /* need to wait for rendering to finish
          * TODO: optimize/fix this to be much less obtrusive
          * mesa/mesa#2966
          */
         zink_fence_wait(pctx);
      }

get_resource_usage() iterates over struct zink_resource::batch_uses to generate a full bitmask of the resource’s usage across all batches. Then, using the usage detail from the transfer_map, this can be applied in order to determine whether any waiting is necessary:

  • If the transfer_map is for reading, only wait for rendering if the resource has pending writes
  • If the transfer_map is for writing, wait if the resource has any pending usage

When this resource usage flagging is properly applied to all types of resources, huge performance gains abound even despite how simple it is. For the above test case, flagging the sampler view resources with read-only access continues to ensure their lifetimes while enabling them to be concurrently read from when draw commands are still pending, which yields the improvements to the tex3d-npot test case above.

Future Gains

I’m always on the lookout for some gains, so I’ve already done and flagged some future work to be done here as well:

  • texture resources have had similar mechanisms applied for improved synchronization, whereas currently they have none
  • zink-wip has facilities for waiting on specific batches to complete, so zink_fence_wait() here can instead be changed to only wait on the last batch with write access for the PIPE_TRANSFER_READ case
  • buffer ranges can be added to struct zink_resource as many other drivers use with the util_range API, enabling the fencing here to be skipped if the regions have no overlap or no valid data is extant/pending

These improvements are mostly specific to unit testing, but for me personally, those are the most important gains to be making, as faster test runs mean faster verification that no regressions have occurred, which means I can continue smashing more code into the repo.

Stay tuned for future Zink Gains updates.

As a newbie, I consider debugging as a study guided. During the process, I have a goal that leads me to browse the code, raise and down various suspicions, look at the changes history and perhaps relate changes from other parts of the kernel to the current problem. Co-debugging is even more interesting. Approaches are discussed, we open our mind to more possibilities, refine solutions and share knowledges… in addition to preventing any uncomplete solution. Debugging the operation of vkms when composing plans was a very interesting journey. All the shared ideas was so dynamic and open that, at the end, it was linked to another demand, the implementation of writeback support.

And that is how I fell into another debugging journey.

Writeback support on vkms is something Rodrigo Siqueira has been working on for some time. However, with so many comings and goings of versions, other DRM changes were introducing new issues. With each new version, it was necessary to reassess the current state of the DRM structures in addition to incorporating revisions from the previous version.

When Leandro Ribeiro reported that calling the drm_crtc_vblank_put frees the writeback work, this indicate that there was also a issue with the simulated vblank execution there. I asked Siqueira a little more about the writeback, to try to collaborate with his work. He explained to me what writeback is, how it is implemented in his patches and the IGT test that is used for validation.

In that state, writeback was crashing on the last subtest, writeback-check-output. Again, due to a timeout waiting for resource access. We discussed the possible causes and concluded that the solution would be in defining: when the composer work needed to be enabled and how to ensure that vblank is happening to allow also the writeback work. When Daniel suggested to create a small function to set output->composer_enabled, he was already thinking on writeback.

However, something that appeared to be straight was somewhat entangled by our initial assumption that the problem was concentrated in the writeback-check-output. We later adjusted our investigation to also look at the writeback-fb-id subtest. We began this part of the debugging process using ftrace to check the execution steps of this two test cases. Looking at different execution cases and more parts of the code, we examined the writeback job to find the symmetries, the actions of each element, the beginning and the end of each cycle. We also includes some prints on VKMS and the test code, and check dmesg.

Siqueira concluded that VKMS need to guarantee the composer is enabled when starting the writeback job and release it (as well as vblank) in the job’s cleanup (which is currently called when a job was queued or when the state is destroyed). Another concern was to avoid dispersed modifications and keep the changes well encapsulated with writeback functions.

With that, Siqueira wrapped up the v5, and sent. Just fly, little bird:

As soon as it was sent, I checked if the patchset passed not only the kms_writeback subtests, but also if the other IGT tests I use have remained stable: kms_flip, kms_cursor_crc and kms_pipe_crc_basic. With that, I realized that a refactoring in the composer was out of date and reported the problem for correction (since it was a fix that I recently made). I also indicated that, apart from this small problem (not directly related to writeback), writeback support was working well in the tests performed. However, the series has not yet received any other feedback.

The end? Hmm… never ends.

I recently posted about a feature I developed for VKMS to consider the alpha channel in the composition of the cursor plane with the primary plane. This process took a little longer than expected, and now I can identify a few reasons:

  • Beginner: I had little knowledge of computer graphics and its operations
  • Maybe cliché: I did not consider that in the subsystem itself, there was already material to aid coding.
  • The unexpected: the test cases that checked the composition of a cursor considering the alpha channel were successful, even with a defect in the implementation.

IGT GPU Tools has two test cases in kms_cursor_crc to check the cursor blend in the primary plane: the cursor-alpha-opaque and the cursor-alpha-transparent. These two cases are structured in the same way, changing only the value of the alpha channel - totally opaque 0xFF or totally transparent 0x00.

In a brief description, the test checks the composition as follows:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Creates a ARGB framebuffer for cursor plane with white color and a given alpha value (0xFF or 0x00)
  3. Enables the cursor on hardware and captures the plane’s CRC after composition. (hardware)
  4. Disables the cursor on hardware, draws a cursor directly on the primary plane and capture the CRC (software)
  5. Compares the two CRCs captured: they must have equal values.

After implementing alpha blending using the straight alpha formula in VKMS, both tests were successful. However, the equation was not correct.

To paint, IGT uses the library Cairo. For primary plane, the test uses CAIRO_FORMAT_RGB24 and CAIRO_FORMAT_ARGB32 for the cursor. According to documentation, the format ARGB32 stores the pixel color in the pre-multiplied alpha representation:

CAIRO_FORMAT_ARGB32

each pixel is a 32-bit quantity, with alpha in the upper 8 bits, then red, then
green, then blue. The 32-bit quantities are stored native-endian.
Pre-multiplied alpha is used. (That is, 50% transparent red is 0x80800000, not
0x80ff0000.) (Since 1.0)

In a brief dialogue with Pekka about endianness on DRM, he showed me information from the DRM documentation that I didn’t know about. According to it, DRM converges with the representation used by Cairo:


Current DRM assumption is that alpha is premultiplied, and old userspace can
break if the property defaults to anything else.

It is also possible to find information about DRM’s Plane Composition Properties, such as the existence of pixel blend mode to add a blend mode for alpha blending equation selection, describing how the pixels from the current plane are composited with the background.

Finally, you can find there the pre-multiplied alpha blending equation:

out.rgb = plane_alpha * fg.rgb + (1 - (plane_alpha * fg.alpha)) * bg.rgb

From this information, we see that both IGT and DRM use the same representation, but the current test cases of kms_cursor_crc do not show the defect in using the straight-alpha formula.

With this in mind, I think of refactoring the test cases so that they could validate translucent cursors and “remove some zeros” from the equation. After shared thoughts with my mentor, Siqueira, I decided to combine the two testcases (cursor-alpha-opaque and cursor-alpha-transparent) into one and refactor them so that the testcase verifies not only extreme alpha values, but also translucent values.

Therefore, the submitted test case proposal follows this steps:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Create a ARGB framebuffer for cursor plane and enables the cursor on hardware.
  3. Paints the cursor with white color and a range of alpha value (from 0xFF to 0x00)
  4. For each alpha value, captures the plane’s CRC after composition in a array of CRC’s. (hardware)
  5. Disables the cursor on hardware.
  6. Draws cursor directly on the primary plane following the same range of alpha values (software)
  7. Captures the CRC and compares it with the CRC in the array of hardware CRC’s: they must have equal values.
  8. Clears primary plane and go to step 5.

The code sent: tests/kms_cursor_crc: refactoring cursor-alpha subtests

CI checks did not show any problems, so I am expecting some human feedback :)

In the past few weeks, I was examining two issues on vkms: development of writeback support and alpha blending. I try to keep activities in parallel so that one can recover me from any tiredness from the other :P

Alpha blending is a TODO[1] of the VKMS that possibly could solve the warning[2] triggered in the cursor-alpha-transparent testcase. And, you know, if you still have a warning, you still have work.

  1. Blend - blend value at vaddr_src with value at vaddr_dst
     * TODO: Use the alpha value to blend vaddr_src with vaddr_dst
     *	 instead of overwriting it.
    
  2. WARNING: Suspicious CRC: All values are 0

To develop this feature, I needed to understand and “practice” some abstractions concepts: alpha composition, bitwise operation, and endianness. The beginning was a little confusing, as I was reading many definitions and rare seeing practical examples, which was terrible for dealing with abstractions. I searched for information on Google and little by little, the code was taking shape in my head.

The code, the problem solved and a new problem found

I combined what I understood from these links above with what was already on VKMS. An important note is that I consider that when we combine an alpha layer in the background, this final composition is solid (that is, an opaque alpha), that is, the result on the screen is not a transparent plate - it is usually black background. That said, the resulting alpha channel is 100% opaque, so we set it to 0xFF.

When executing my code, the two test cases related to the alpha channel (opaque-transparent) pass clean. But for me, it still wasn’t enough. I need to see the colors read and the result of the composition (putting some pr_info in the code). Besides, only testing extreme values (background solid black, cursor completely white with totally opaque or transparent alpha) did not convince me. See that I’m a little pessimist and a little wary.

So I decided to play with the cursor-alpha testcase… and I scraped myself.

Ouch!

I changed the transparent test case from 0.0 to 0.5, and things started to fail. After a few rounds checking the returned values, what I always saw was a cursor color (ARGB) returned as follows:

  • A: the alpha determined (ok)
  • RGB: the color resulted by a white cursor with the set transparency already applied to a standard solid black background (not ok).

For example, I expected that a white cursor with transparency 0.5 returned the following ARGB = 80FFFFFF, but the VKMS was reading 80808080 (???)

What is that?

I did more experiments to understand what was going on. I checked if the 0.5-transparency was working on i915, and yes. I also changed the test RGB color of the cursor and the level of transparency. The cursor color read was always a combination of a given A + RGB blended with black color. I could only realize that the mismatch happens because, in the hardware test, when composing a cursor color 80808080, again on a background FF000000 (primary), I obtained as the final result FF404040. However, in the software test step, what was drawn as a final color FF808080.

The developed code would also do that if it was getting the right color… how to deal with it and why it is not a problem on i915 drive?

Reading the documentation

I was reading the cairo manual in addition to perform different color combinations tests. I was suspicious that it could be a format problem, but even if it was that, I had no idea what was “wrong”. After some days, I found the reason:

CAIRO_FORMAT_ARGB32

each pixel is a 32-bit quantity, with alpha in the upper 8 bits, then red, then
green, then blue. The 32-bit quantities are stored native-endian.
Pre-multiplied alpha is used. (That is, 50% transparent red is 0x80800000, not
0x80ff0000.)

Pre-multiplied alpha is used. Thanks, cairo… or not!

But what is pre-multiplied alpha? I didn’t even know it existed.

So, I needed to read about the difference between straight and premultiplied alpha. Moreover, I needed to figure out how to do alpha blending using this format.

After some searching, I found a post on Nvidia - Gameworks Blog that helped me a lot to understand this approach. Besides, the good old StackOverflow clarified what I needed to do.

So I adapted the code I had previously prepared considering a straight alpha blending to work with alpha-premultiplied colors. Of course, I put pr_info to see the result. I also played a little more with the cursor and background colors on kms_cursor_crc test for checking. For example, I changed the background color to a solid red and/or the cursor color and/or the cursor alpha value. In all this situation, the patch works fine.

So, I just sent:

A new interpretation for the warning

Examining this behavior, I discovered a more precise cause of the warning. As VKMS currently only overwrites the cursor over the background, a fully transparent black cursor has a color of 0 and results in a transparent primary layer instead of solid black. When doing XRGB, VKMS zeroes the alpha instead of making it opaque, which seems incorrect and triggers the warning.

I also think that this shows the weakness of the full opaque/transparent tests since they were not able to expose the wrong composition of colors. So I will prepare a patch for IGT to expand the coverage of this type of problem.

Well, doing the alpha blending, the warning will never bothered me anyway.

Let it go!