planet.freedesktop.org
January 14, 2021

The open source Panfrost driver for Arm Mali Midgard and Bifrost GPUs now provides non-conformant OpenGL ES 3.0 on Bifrost and desktop OpenGL 3.1 on Midgard (Mali T760 and newer) and Bifrost, in time for Mesa’s first release of 2021. This follows the OpenGL ES 3.0 support on Midgard that landed over the summer, as well as the initial OpenGL ES 2.0 support that recently debuted for Bifrost. OpenGL ES 3.0 is now tested on Mali G52 in Mesa’s continuous integration, achieving a 99.9% pass rate on the corresponding drawElements Quality Program tests.

Architecturally, Bifrost shares most of its fixed-function data structures with Midgard, but features a brand new instruction set. Our work for bringing up OpenGL ES 3.0 on Bifrost reflects this division. Some fixed-function features, like instancing and transform feedback, worked without any Bifrost-specific changes since we already did bring-up on Midgard. Other shader features, like uniform buffer objects, required “from scratch” implementations in the Bifrost compiler, a task facilitated by the compiler’s maturing intermediate representation with first-class builder support. Yet other features like multiple render targets required some Bifrost-specific code while leveraging other code shared with Midgard. All in all, the work progressed much more quickly the second time around, a testament to the power of code sharing. But there is no need to limit sharing to just Panfrost GPUs; open source drivers can share code across vendors.

Indeed, since Mali is an embedded GPU, the proprietary driver only exposes exposes OpenGL ES, not desktop OpenGL. However, desktop OpenGL 3.1 support comes nearly “for free” for us as an upstream Mesa driver by leveraging common infrastructure. This milestone shows the technical advantage of open source development: Compared to layered implementations of desktop GL like gl4es or Zink, Panfrost’s desktop OpenGL support is native, reducing CPU overhead. Furthermore, applications can make use of the hardware’s hidden features, like explicit primitive restart indices, alpha testing, and quadrilaterals. Although these features could be emulated, the native solutions are more efficient.

Mesa’s shared code also extends to OpenCL support via Clover. Once a driver supports compute shaders and sufficient compiler features, baseline OpenCL is just a few patches and a bug-fixing spree away. While OpenCL implementations could be layered (for example with clvk), an open source Mesa driver avoids the indirection.

I would like to thank Collaboran Boris Brezillon, who has worked tirelessly to bring OpenGL ES 3.0 support to Bifrost, as well as the prolific Icecream95, who has spearheaded OpenCL and desktop OpenGL support.

Originally posted on Collabora’s blog

A Quintessential Metric

There’s been a lot of talk about driver overhead in the Mesa community as of late, in large part begun by Marek Olšák and his daredevil stunts driving RadeonSI through flaming hoops while juggling chainsaws.

While zink isn’t quite at that level yet (and neither am I), there’s still some progress being made that I’d like to dig into a bit.

What Is Overhead?

As in all software, overhead is the performance penalty that is incurred as compared to a baseline measurement. In Mesa, a lot of people know of driver overhead as “Gallium sucks” and/or “A Gallium-based driver is slow” due to the fact that Gallium does incur some amount of overhead as compared to the old-style immediate mode DRI drivers.

While it’s true that there is an amount of performance lost by using Gallium in this sense, it’s also true that the performance gained is much greater. The reason for this is that Gallium is able to batch commands and state changes for every driver using it, allowing redundant calls to avoid triggering any work in the GPU.

It also makes for an easier time profiling and improving upon the CPU usage that’s required to handle the state changes emitted by Gallium. Instead of having a ton of core Mesa callbacks which need to be handled, each one potentially leading to a no-op that can be analyzed and deferred by the driver, Gallium provides a more cohesive API where each driver hook is a necessary change that must be handled. Because of this, the job of optimizing for those changes is simplified.

How Can Overhead Be Measured?

Other than the obvious method of running apps on a driver and checking the fps counter, piglit provides a facility for this: the drawoverhead test. This test has over a hundred subtests which perform sequences of draw operations with various state changes, each with its own result relative to a baseline, enabling a developer to easily profile and optimize a given codepath.

How Is Zink Doing Here?

To answer this, let’s look at some preliminary results from zink in master, the code which will soon be shipping in Mesa 21.0.0. All numbers here are, in contrast to my usual benchmarking, done on AMD 5700XT GPU. More on this later.

ZINK: MASTER

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                  818, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                  686, 83.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                  411, 50.3%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  232, 28.4%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  258, 31.5%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,             87, 10.7%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             162, 19.9%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 150, 18.3%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                120, 14.7%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     192, 23.5%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    146, 17.9%

After this point, the test aborts because shader images are not yet implemented, but it’s enough for a baseline.

These numbers are…not great. Primarily, at least to start, I’ll be focusing on the first row where zink is performing 818,000 draws per second.

Let’s check out some performance from zink-wip (20201230 snapshot), specifically with GALLIUM_THREAD=0 set to disable threaded context. This means I’m adding in descriptor caching and unlimited command buffer counts (vs forcing a stall after every submit from the 4th one onwards to reset a batch):

ZINK: WIP (CACHED, NO THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                  766, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                  633, 82.6%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                  407, 53.1%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  500, 65.3%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  449, 58.6%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,             85, 11.2%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             235, 30.7%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 159, 20.8%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                128, 16.7%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     179, 23.4%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    139, 18.2%

This is actually worse for a lot of cases!

But why is that?

It turns out that in the base draw case, threaded context is really necessary to be doing caching and using more command buffers. There’s sizable gains made in the baseline texture cases (+100% or so each) and a vertex attribute change (+50%), but fundamentally the overhead for the driver seems higher.

What happens if threading is enabled though?

ZINK: WIP (CACHED, THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5206, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 5149, 98.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 5187, 99.6%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5210, 100.1%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 4684, 90.0%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            137, 2.6%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             252, 4.8%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 243, 4.7%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                222, 4.3%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     213, 4.1%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    208, 4.0%

blink.gif

Indeed, threading yields almost a 700% performance improvement for teh baseline cases. It turns out that synchronously performing expensive tasks like computing hash values for descriptor sets is bad. Who could have guessed.

State Changes

Looking at the other values, however, is a bit more pertinent for the purpose of this post. Overhead is incurred when state changes are triggered by descriptors being changed, and this is much closer to a real world scenario (i.e., gaming) than simply running draw calls with no changes. Caching yields roughly a 50% performance improvement for this case.

Further Improvements

As I’d mentioned previously, I’m doing some work now on descriptor management with an aim to further lower this overhead. Let’s see what that looks like.

ZINK: TEST (UNCACHED, THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5426, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 5423, 99.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 5432, 100.1%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5246, 96.7%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 5177, 95.4%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            153, 2.8%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             229, 4.2%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 247, 4.6%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                228, 4.2%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     237, 4.4%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    223, 4.1%

While there’s a small (~4%) improvement for the baseline numbers, what’s much more interesting is the values where descriptor states are changed. They are, in fact, about as good or even slightly better than the caching version of descriptor management.

This is huge. Specifically it’s huge because it means that I can likely port over some of the techniques used in this approached to the cached version in order to drive further reductions in overhead.

Closing Remarks

Before I go, let’s check out some numbers from a real driver. Specifically, RadeonSI: the pinnacle of Gallium-based drivers.

RADEONSI

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 6221, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 6261, 100.7%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 6236, 100.2%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 6263, 100.7%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 6243, 100.4%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            217, 3.5%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,            1467, 23.6%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 374, 6.0%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                218, 3.5%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     680, 10.9%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    318, 5.1%

Yikes. Especially intimidating here is the relative performance for vertex attribute changes, where RadeonSI is able to retain almost 25% of its baseline performance relative to zink not even managing 5%.

Hopefully these figures get closer to each other in the future, but this just shows that there’s still a long way to go.

January 13, 2021

This post explains how to parse the HID Unit Global Item as explained by the HID Specification, page 37. The table there is quite confusing and it took me a while to fully understand it (Benjamin Tissoires was really the one who cracked it). I couldn't find any better explanation online which means either I'm incredibly dense and everyone's figured it out or no-one has posted a better explanation. On the off-chance it's the latter [1], here are the instructions on how to parse this item.

We know a HID Report Descriptor consists of a number of items that describe the content of each HID Report (read: an event from a device). These Items include things like Logical Minimum/Maximum for axis ranges, etc. A HID Unit item specifies the physical unit to apply. For example, a Report Descriptor may specify that X and Y axes are in mm which can be quite useful for all the obvious reasons.

Like most HID items, a HID Unit Item consists of a one-byte item tag and 1, 2 or 4 byte payload. The Unit item in the Report Descriptor itself has the binary value 0110 01nn where the nn is either 1, 2, or 3 indicating 1, 2 or 4 bytes of payload, respectively. That's standard HID.

The payload is divided into nibbles (4-bit units) and goes from LSB to MSB. The lowest-order 4 bits (first byte & 0xf) define the unit System to apply: one of SI Linear, SI Rotation, English Linear or English Rotation (well, or None/Reserved). The rest of the nibbles are in this order: "length", "mass", "time", "temperature", "current", "luminous intensity". In something resembling code this means:


system = value & 0xf
length_exponent = (value & 0xf0) >> 4
mass_exponent = (value & 0xf00) >> 8
time_exponent = (value & 0xf000) >> 12
...
The System defines which unit is used for length (e.g. SILinear means length is in cm). The actual value of each nibble is the exponent for the unit in use [2]. In something resembling code:

switch (system)
case SILinear:
print("length is in cm^{length_exponent}");
break;
case SIRotation:
print("length is in rad^{length_exponent}");
break;
case EnglishLinear:
print("length is in in^{length_exponent}");
break;
case EnglishRotation:
print("length is in deg^{length_exponent}");
break;
case None:
case Reserved"
print("boo!");
break;

For example, the value 0x321 means "SI Linear" (0x1) so the remaining nibbles represent, in ascending nibble order: Centimeters, Grams, Seconds, Kelvin, Ampere, Candela. The length nibble has a value of 0x2 so it's square cm, the mass nibble has a value of 0x3 so it is cubic grams (well, it's just an example, so...). This means that any report containing this item comes in cm²g³. As a more realistic example: 0xF011 would be cm/s.

If we changed the lowest nibble to English Rotation (0x4), i.e. our value is now 0x324, the units represent: Degrees, Slug, Seconds, F, Ampere, Candela [3]. The length nibble 0x2 means square degrees, the mass nibble is cubic slugs. As a more realistic example, 0xF014 would be degrees/s.

Any nibble with value 0 means the unit isn't in use, so the example from the spec with value 0x00F0D121 is SI linear, units cm² g s⁻³ A⁻¹, which is... Voltage! Of course you knew that and totally didn't have to double-check with wikipedia.

Because bits are expensive and the base units are of course either too big or too small or otherwise not quite right, HID also provides a Unit Exponent item. The Unit Exponent item (a separate item to Unit in the Report Descriptor) then describes the exponent to be applied to the actual value in the report. For example, a Unit Eponent of -3 means 10⁻³ to be applied to the value. If the report descriptor specifies an item of Unit 0x00F0D121 (i.e. V) and Unit Exponent -3, the value of this item is mV (milliVolt), Unit Exponent of 3 would be kV (kiloVolt).

Now, in hindsight all this is pretty obvious and maybe even sensible. It'd have been nice if the spec would've explained it a bit clearer but then I would have nothing to write about, so I guess overall I call it a draw.

[1] This whole adventure was started because there's a touchpad out there that measures touch pressure in radians, so at least one other person out there struggled with the docs...
[2] The nibble value is twos complement (i.e. it's a signed 4-bit integer). Values 0x1-0x7 are exponents 1 to 7, values 0x8-0xf are exponents -8 to -1.
[3] English Linear should've trolled everyone and use Centimetres instead of Centimeters in SI Linear.

January 12, 2021

TL;DR: It's now easy to unlock your LUKS2 volume with a FIDO2 security token (e.g. YubiKey or Nitrokey FIDO2). And TPM2 unlocking is easy now too.

Blogging is a lot of work, and a lot less fun than hacking. I mostly focus on the latter because of that, but from time to time I guess stuff is just too interesting to not be blogged about. Hence here, finally, another blog story about exciting new features in systemd.

With the upcoming systemd v248 the systemd-cryptsetup component of systemd (which is responsible for assembling encrypted volumes during boot) gained direct support for unlocking encrypted storage with three types of security hardware:

  1. Unlocking with FIDO2 security tokens (well, at least with those which implement the hmac-secret extension, most do). i.e. your YubiKeys (series 5 and above), or Nitrokey FIDO2 and such.

  2. Unlocking with TPM2 security chips (pretty ubiquitous on non-budget PCs/laptops/…)

  3. Unlocking with PKCS#11 security tokens, i.e. your smartcards and older YubiKeys (the ones that implement PIV). (Strictly speaking this was supported on older systemd already, but was a lot more "manual".)

For completeness' sake, let's keep in mind that the component also allows unlocking with these more traditional mechanisms:

  1. Unlocking interactively with a user-entered passphrase (i.e. the way most people probably already deploy it, supported since about forever)

  2. Unlocking via key file on disk (optionally on removable media plugged in at boot), supported since forever.

  3. Unlocking via a key acquired through trivial AF_UNIX/SOCK_STREAM socket IPC. (Also new in v248)

  4. Unlocking via recovery keys. These are pretty much the same thing as a regular passphrase (and in fact can be entered wherever a passphrase is requested) — the main difference being that they are always generated by the computer, and thus have guaranteed high entropy, typically higher than user-chosen passphrases. They are generated in a way they are easy to type, in many cases even if the local key map is misconfigured. (Also new in v248)

In this blog story, let's focus on the first three items, i.e. those that talk to specific types of hardware for implementing unlocking.

To make working with security tokens and TPM2 easy, a new, small tool was added to the systemd tool set: systemd-cryptenroll. It's only purpose is to make it easy to enroll your security token/chip of choice into an encrypted volume. It works with any LUKS2 volume, and embeds a tiny bit of meta-information into the LUKS2 header with parameters necessary for the unlock operation.

Unlocking with FIDO2

So, let's see how this fits together in the FIDO2 case. Most likely this is what you want to use if you have one of these fancy FIDO2 tokens (which need to implement the hmac-secret extension, as mentioned). Let's say you already have your LUKS2 volume set up, and previously unlocked it with a simple passphrase. Plug in your token, and run:

# systemd-cryptenroll --fido2-device=auto /dev/sda5

(Replace /dev/sda5 with the underlying block device of your volume).

This will enroll the key as an additional way to unlock the volume, and embeds all necessary information for it in the LUKS2 volume header. Before we can unlock the volume with this at boot, we need to allow FIDO2 unlocking via /etc/crypttab. For that, find the right entry for your volume in that file, and edit it like so:

myvolume /dev/sda5 - fido2-device=auto

Replace myvolume and /dev/sda5 with the right volume name, and underlying device of course. Key here is the fido2-device=auto option you need to add to the fourth column in the file. It tells systemd-cryptsetup to use the FIDO2 metadata now embedded in the LUKS2 header, wait for the FIDO2 token to be plugged in at boot (utilizing systemd-udevd, …) and unlock the volume with it.

And that's it already. Easy-peasy, no?

Note that all of this doesn't modify the FIDO2 token itself in any way. Moreover you can enroll the same token in as many volumes as you like. Since all enrollment information is stored in the LUKS2 header (and not on the token) there are no bounds on any of this. (OK, well, admittedly, there's a cap on LUKS2 key slots per volume, i.e. you can't enroll more than a bunch of keys per volume.)

Unlocking with PKCS#11

Let's now have a closer look how the same works with a PKCS#11 compatible security token or smartcard. For this to work, you need a device that can store an RSA key pair. I figure most security tokens/smartcards that implement PIV qualify. How you actually get the keys onto the device might differ though. Here's how you do this for any YubiKey that implements the PIV feature:

# ykman piv reset
# ykman piv generate-key -a RSA2048 9d pubkey.pem
# ykman piv generate-certificate --subject "Knobelei" 9d pubkey.pem
# rm pubkey.pem

(This chain of commands erases what was stored in PIV feature of your token before, be careful!)

For tokens/smartcards from other vendors a different series of commands might work. Once you have a key pair on it, you can enroll it with a LUKS2 volume like so:

# systemd-cryptenroll --pkcs11-token-uri=auto /dev/sda5

Just like the same command's invocation in the FIDO2 case this enrolls the security token as an additional way to unlock the volume, any passphrases you already have enrolled remain enrolled.

For the PKCS#11 case you need to edit your /etc/crypttab entry like this:

myvolume /dev/sda5 - pkcs11-uri=auto

If you have a security token that implements both PKCS#11 PIV and FIDO2 I'd probably enroll it as FIDO2 device, given it's the more contemporary, future-proof standard. Moreover, it requires no special preparation in order to get an RSA key onto the device: FIDO2 keys typically just work.

Unlocking with TPM2

Most modern (non-budget) PC hardware (and other kind of hardware too) nowadays comes with a TPM2 security chip. In many ways a TPM2 chip is a smartcard that is soldered onto the mainboard of your system. Unlike your usual USB-connected security tokens you thus cannot remove them from your PC, which means they address quite a different security scenario: they aren't immediately comparable to a physical key you can take with you that unlocks some door, but they are a key you leave at the door, but that refuses to be turned by anyone but you.

Even though this sounds a lot weaker than the FIDO2/PKCS#11 model TPM2 still bring benefits for securing your systems: because the cryptographic key material stored in TPM2 devices cannot be extracted (at least that's the theory), if you bind your hard disk encryption to it, it means attackers cannot just copy your disk and analyze it offline — they always need access to the TPM2 chip too to have a chance to acquire the necessary cryptographic keys. Thus, they can still steal your whole PC and analyze it, but they cannot just copy the disk without you noticing and analyze the copy.

Moreover, you can bind the ability to unlock the harddisk to specific software versions: for example you could say that only your trusted Fedora Linux can unlock the device, but not any arbitrary OS some hacker might boot from a USB stick they plugged in. Thus, if you trust your OS vendor, you can entrust storage unlocking to the vendor's OS together with your TPM2 device, and thus can be reasonably sure intruders cannot decrypt your data unless they both hack your OS vendor and steal/break your TPM2 chip.

Here's how you enroll your LUKS2 volume with your TPM2 chip:

# systemd-cryptenroll --tpm2-device=auto --tpm2-pcrs=7 /dev/sda5

This looks almost as straightforward as the two earlier sytemd-cryptenroll command lines — if it wasn't for the --tpm2-pcrs= part. With that option you can specify to which TPM2 PCRs you want to bind the enrollment. TPM2 PCRs are a set of (typically 24) hash values that every TPM2 equipped system at boot calculates from all the software that is invoked during the boot sequence, in a secure, unfakable way (this is called "measurement"). If you bind unlocking to a specific value of a specific PCR you thus require the system has to follow the same sequence of software at boot to re-acquire the disk encryption key. Sounds complex? Well, that's because it is.

For now, let's see how we have to modify your /etc/crypttab to unlock via TPM2:

myvolume /dev/sda5 - tpm2-device=auto

This part is easy again: the tpm2-device= option is what tells systemd-cryptsetup to use the TPM2 metadata from the LUKS2 header and to wait for the TPM2 device to show up.

Bonus: Recovery Key Enrollment

FIDO2, PKCS#11 and TPM2 security tokens and chips pair well with recovery keys: since you don't need to type in your password everyday anymore it makes sense to get rid of it, and instead enroll a high-entropy recovery key you then print out or scan off screen and store a safe, physical location. i.e. forget about good ol' passphrase-based unlocking, go for FIDO2 plus recovery key instead! Here's how you do it:

# systemd-cryptenroll --recovery-key /dev/sda5

This will generate a key, enroll it in the LUKS2 volume, show it to you on screen and generate a QR code you may scan off screen if you like. The key has highest entropy, and can be entered wherever you can enter a passphrase. Because of that you don't have to modify /etc/crypttab to make the recovery key work.

Future

There's still plenty room for further improvement in all of this. In particular for the TPM2 case: what the text above doesn't really mention is that binding your encrypted volume unlocking to specific software versions (i.e. kernel + initrd + OS versions) actually sucks hard: if you naively update your system to newer versions you might lose access to your TPM2 enrolled keys (which isn't terrible, after all you did enroll a recovery key — right? — which you then can use to regain access). To solve this some more integration with distributions would be necessary: whenever they upgrade the system they'd have to make sure to enroll the TPM2 again — with the PCR hashes matching the new version. And whenever they remove an old version of the system they need to remove the old TPM2 enrollment. Alternatively TPM2 also knows a concept of signed PCR hash values. In this mode the distro could just ship a set of PCR signatures which would unlock the TPM2 keys. (But quite frankly I don't really see the point: whether you drop in a signature file on each system update, or enroll a new set of PCR hashes in the LUKS2 header doesn't make much of a difference). Either way, to make TPM2 enrollment smooth some more integration work with your distribution's system update mechanisms need to happen. And yes, because of this OS updating complexity the example above — where I referenced your trusty Fedora Linux — doesn't actually work IRL (yet? hopefully…). Nothing updates the enrollment automatically after you initially enrolled it, hence after the first kernel/initrd update you have to manually re-enroll things again, and again, and again … after every update.

The TPM2 could also be used for other kinds of key policies, we might look into adding later too. For example, Windows uses TPM2 stuff to allow short (4 digits or so) "PINs" for unlocking the harddisk, i.e. kind of a low-entropy password you type in. The reason this is reasonably safe is that in this case the PIN is passed to the TPM2 which enforces that not more than some limited amount of unlock attempts may be made within some time frame, and that after too many attempts the PIN is invalidated altogether. Thus making dictionary attacks harder (which would normally be easier given the short length of the PINs).

Postscript

(BTW: Yubico sent me two YubiKeys for testing and Nitrokey a Nitrokey FIDO2, thank you! — That's why you see all those references to YubiKey/Nitrokey devices in the text above: it's the hardware I had to test this with. That said, I also tested the FIDO2 stuff with a SoloKey I bought, where it also worked fine. And yes, you!, other vendors!, who might be reading this, please send me your security tokens for free, too, and I might test things with them as well. No promises though. And I am not going to give them back, if you do, sorry. ;-))

January 11, 2021

A Different Strategy

As the merge window for the upcoming Mesa release looms, Erik and I have decided on a new strategy for development: we’re just going to stop merging patches.

At this point in time, we have no regressions as compared to the last release, so we’re just doing a full stop until after the branch point in order to save ourselves time potentially tracking down any issues in further feature additions.

Slowdowns

Some of you may have noticed that zink-wip has yet to update this year. This isn’t due to a lack of work, but rather due to lack of stability. I’ve been tinkering with a new descriptor management infrastructure (yes, I’m back on the horse), and it’s… capable of drawing frames is maybe the best way to describe it. I’ve gone through probably about ten iterations on it so far based on all the ideas I’ve had.

This is hardly an exhaustive list, but here’s some of the ideas that I’ve cycled through:

  • async descriptor updating - It seems like this should be good on paper given that it’s legal to do descriptor updates in threads, but the overhead from signalling the task thread in this case ended up being, on average, about 10-20x the cost of just doing the updating synchronously.

  • all push descriptors all the time - Just for hahas along the way I jammed everything into a pushed descriptor set. Or at least I was going to try. About halfway through, I realized this was way more work to execute than it’d be worth for the hahas considering I wouldn’t ever be able to use this in reality.

  • zero iteration updates - The gist of this ideas is that looking at the descriptor updating code, there’s a ton of iterating going on. This is an extreme hotpath, so any amount of looping that can be avoided is great, and the underlying Vulkan driver has to iterate the sets anyway, so… Eventually I managed to throw a bunch of memory at the problem and do all the setup during pipeline init, giving me pre-initialized blobs of memory in the form of VkWriteDescriptorSet arrays with the associated sub-types for descriptors. With this in place, naturally I turned to…

  • templates - Descriptor templates are a way of giving the Vulkan driver the raw memory of the descriptor info as a blob and letting it huck that directly into a buffer. Since I already had the memory set up for this, it was an easy swap over, though the gains were less impressive than I’d expected.

At last I’ve settled on a model of uncached, templated descriptors with an extra push set for handling uniform data for further exploration. Initial results for real world use (e.g., graphical benchmarks) are good, but piglit’s drawoverhead test shows there’s still a lot of work to be done to catch up to caching.

Big thanks to Hans-Kristian Arntzen, aka themaister, aka low-level graphics swashbuckler, for providing insight and consults along the process of this.

January 08, 2021

VkRunner is a Vulkan shader tester based on Piglit’s shader_runner (I already talked about it in my blog). This tool is very helpful for creating simple Vulkan tests without writing hundreds of lines of code. In the Graphics Team at Igalia, we use it extensively to help us in the open-source driver development in Mesa such as V3D and Turnip drivers.

As a hobby project for last Christmas holiday season, I wrote the .spec file for VkRunner and uploaded it to Fedora’s Copr and OpenSUSE Build Service (OBS) for generating the respective RPM packages.

This is the first time I create a package and thanks to the documentation on how to create RPM packages, the process was simpler than I initially thought. If I find the time to read Debian New Maintainers’ Guide, I will create a DEB package as well.

Anyway, if you have installed Fedora or OpenSUSE in your computer and you want to try VkRunner, just follow these steps:

Fedora

  • Fedora:
$ sudo dnf copr enable samuelig/vkrunner
$ sudo dnf install vkrunner

OpenSUSE logo

  • OpenSUSE / SLE:
$ sudo zypper addrepo https://download.opensuse.org/repositories/home:samuelig/openSUSE_Leap_15.2/home:samuelig.repo
$ sudo zypper refresh
$ sudo zypper install vkrunner

Enjoy it!

January 07, 2021

Apple’s latest line of Macs includes their in-house “M1” system-on-chip, featuring a custom GPU. This poses a problem for those of us in the Asahi Linux project who wish to run Linux on our devices, as this custom Apple GPU has neither public documentation nor open source drivers. Some speculate it might descend from PowerVR GPUs, as used in older iPhones, while others believe the GPU to be completely custom. But rumours and speculations are no fun when we can peek under the hood ourselves!

A few weeks ago, I purchased a Mac Mini with an M1 GPU as a development target to study the instruction set and command stream, to understand the GPU’s architecture at a level not previously publicly understood, and ultimately to accelerate the development of a Mesa driver for the hardware. Today I’ve reached my first milestone: I now understand enough of the instruction set to disassemble simple shaders with a free and open-source tool chain, released on GitHub here.

The process for decoding the instruction set and command stream of the GPU parallels the same process I used for reverse-engineering Mali GPUs in the Panfrost project, originally pioneered by the Lima, Freedreno, and Nouveau free software driver projects. Typically, for Linux or Android driver reverse-engineering, a small wrap library will be written to inject into a test application via LD_PRELOAD that hooks key system calls like ioctl and mmap in order to analyze user-kernel interactions. Once the “submit command buffer” call is issued, the library can dump all (mapped) shared memory for offline analysis.

The same overall process will work for the M1, but there are some macOSisms that need to be translated. First, there is no LD_PRELOAD on macOS; the equivalent is DYLD_INSERT_LIBRARIES, which has some extra security features which are easy enough to turn off for our purposes. Second, while the standard Linux/BSD system calls do exist on macOS, they are not used for graphics drivers. Instead, Apple’s own IOKit framework is used for both kernel and userspace drivers, with the critical entry point of IOConnectCallMethod, an analogue of ioctl. These differences are easy enough to paper over, but they do add a layer of distance from the standard Linux tooling.

The bigger issue is orienting ourselves in the IOKit world. Since Linux is under a copyleft license, (legal) kernel drivers are open source, so the ioctl interface is public, albeit vendor-specific. macOS’s kernel (XNU) being under a permissive license brings no such obligations; the kernel interface is proprietary and undocumented. Even after wrapping IOConnectCallMethod, it took some elbow grease to identify the three critical calls: memory allocation, command buffer creation, and command buffer submission. Wrapping the allocation and creation calls is essential for tracking GPU-visible memory (what we are interested in studying), and wrapping the submission call is essential for timing the memory dump.

With those obstacles cleared, we can finally get to the shader binaries, black boxes in themselves. However, the process from here on out is standard: start with the simplest fragment or compute shader possible, make a small change in the input source code, and compare the output binaries. Iterating on this process is tedious but will quickly reveal key structures, including opcode numbers.

The findings of the process documented in the free software disassembler confirm a number of traits of the GPU:

One, the architecture is scalar. Unlike some GPUs that are scalar for 32-bits but vectorized for 16-bits, the M1’s GPU is scalar at all bit sizes. Yet Metal optimization resources imply 16-bit arithmetic should be significantly faster, in addition to a reduction of register usage leading to higher thread count (occupancy). This suggests the hardware is superscalar, with more 16-bit ALUs than 32-bit ALUs, allowing the part to benefit from low-precision graphics shaders much more than competing chips can, while removing a great deal of complexity from the compiler.

Two, the architecture seems to handle scheduling in hardware, common among desktop GPUs but less so in the embedded space. This again makes the compiler simpler at the expense of more hardware. Instructions seem to have minimal encoding overhead, unlike other architectures which need to pad out instructions with nop’s to accommodate highly constrained instruction sets.

Three, various modifiers are supported. Floating-point ALUs can do clamps (saturate), negates, and absolute value modifiers “for free”, a common shader architecture trait. Further, most (all?) instructions can type-convert between 16-bit and 32-bit “for free” on both the destination and the sources, which allows the compiler to be much more aggressive about using 16-bit operations without risking conversion overheads. On the integer side, various bitwise complements and shifts are allowed on certain instructions for free. None of this is unique to Apple’s design, but it’s worth noting all the same.

Finally, not all ALU instructions have the same timing. Instructions like imad, used to multiply two integers and add a third, are avoided in favour of repeated iadd integer addition instructions where possible. This also suggests a superscalar architecture; software-scheduled designs like those I work on for my day job cannot exploit differences in pipeline length, inadvertently slowing down simple instructions to match the speed of complex ones.

From my prior experience working with GPUs, I expect to find some eldritch horror waiting in the instruction set, to balloon compiler complexity. Though the above work currently covers only a small surface area of the instruction set, so far everything seems sound. There are no convoluted optimization tricks, but doing away with the trickery creates a streamlined, efficient design that does one thing and does it well. Maybe Apple’s hardware engineers discovered it’s hard to beat simplicity.

Alas, a shader tool chain isn’t much use without an open source userspace driver. Next up: dissecting the command stream!

Disclaimer: This work is a hobby project conducted based on public information. Opinions expressed may not reflect those of my employer.

January 05, 2021

Some People Will Appreciate This

border_colors.png

Also zink hit GL 4.1 today.

January 04, 2021

It Happens

As long-time readers of the blog know, SGC is a safe space where making mistakes is not only accepted, it’s a way of life. So it is once again that I need to amend statements previously made regarding Xorg synchronization after Michel Dänzer, also known for anchoring the award-winning series Why Is My MR Failing CI Today?, pointed out that while I was indeed addressing the correct problem, I was addressing it from the wrong side.

Looking Closer

The issue here is that WSI synchronizes with the display server using a file descriptor for the swapchain image that the Vulkan driver manages. But what if the Vulkan driver never configures itself to be used for WSI (genuine or faked) in the first place?

Yes, this indeed appeared to be the true problem. Iago Toral Quiroga added handling for this specific to the V3DV driver back in October, and it’s the same mechanism: setting up a Mesa-internal struct during resource initialization.

So I extended this to the ANV codepath and…

And obviously it didn’t work.

But why was this the case?

A script-based git blame revealed that ANV has a different handling for implicit sync than other Vulkan drivers. After a well-hidden patch, ANV relies entirely on a struct attached to VkSubmitInfo which contains the swapchain image’s memory pointer in order to handle implicit sync. Thus by attaching a wsi_memory_signal_submit_info struct, everything was resolved.

Is it a great fix? No. Does it work? Yes.

Questions

If ANV wasn’t configuring itself to handle implicit sync, why was poll() working?

Luck.

Why does RADV work without any of this?

Also probably luck.

January 01, 2021

Reviving Very Old X Code

I've taken the week between Christmas and New Year's off this year. I didn't really have anything serious planned, just taking a break from the usual routine. As often happens, I got sucked into doing a project when I received this simple bug report Debian Bug #974011

I have been researching old terminal and X games recently, and realized
that much of the code from 'xmille' originated from the terminal game
'mille', which is part of bsdgames.

...

[The copyright and license information] has been stripped out of all
code in the xmille distribution.  Also, none of the included materials
give credit to the original author, Ken Arnold.

The reason the 'xmille' source is missing copyright and license information from the 'mille' files is that they were copied in before that information was added upstream. Xmille forked from Mille around 1987 or so. I wrote the UI parts for the system I had at the time, which was running X10R4. A very basic port to X11 was done at some point, and that's what Debian has in the archive today.

At some point in the 90s, I ported Xmille to the Athena widget set, including several custom widgets in an Xaw extension library, Xkw. It's a lot better than the version in Debian, including displaying the cards correctly (the Debian version has some pretty bad color issues).

Here's what the current Debian version looks like:

Fixing The Bug

To fix the missing copyright and license information, I imported the mille source code into the "latest" Xaw-based version. The updated mille code had a number of bug fixes and improvements, along with the copyright information.

That should have been sufficient to resolve the issue and I could have constructed a suitable source package from whatever bits were needed and and uploaded that as a replacement 'xmille' package.

However, at some later point, I had actually merged xmille into a larger package, 'kgames', which also included a number of other games, including Reversi, Dominoes, Cribbage and ten Solitaire/Patience variants. (as an aside, those last ten games formed the basis for my Patience Palm Pilot application, which seems to have inspired an Android App of the same name...)

So began my yak shaving holiday.

Building Kgames in 2020

Ok, so getting this old source code running should be easy, right? It's just a bunch of C code designed in the 80s and 90s to work on VAXen and their kin. How hard could it be?

  1. Everything was a 32-bit computer back then; pointers and ints were both 32 bits, so you could cast them with wild abandon and cause no problems. Today, testing revealed segfaults in some corners of the code.

  2. It's K&R C code. Remember that the first version of ANSI-C didn't come out until 1989, and it was years later that we could reliably expect to find an ANSI compiler with a random Unix box.

  3. It's X11 code. Fortunately (?), X11 hasn't changed since these applications were written, so at least that part still works just fine. Imagine trying to build Windows or Mac OS code from the early 90's on a modern OS...

I decided to dig in and add prototypes everywhere; that found a lot of pointer/int casting issues, as well as several lurking bugs where the code was just plain broken.

After a day or so, I had things building and running and was no longer hitting crashes.

Kgames 1.0 uploaded to Debian New Queue

With that done, I decided I could at least upload the working bits to the Debian archive and close the bug reported above. kgames 1.0-2 may eventually get into unstable, presumably once the Debian FTP team realizes just how important fixing this bug is. Or something.

Here's what xmille looks like in this version:

And here's my favorite solitaire variant too:

But They Look So Old

Yeah, Xaw applications have a rustic appearance which may appeal to some, but for people with higher resolution monitors and “well seasoned” eyesight, squinting at the tiny images and text makes it difficult to enjoy these games today.

How hard could it be to update them to use larger cards and scalable fonts?

Xkw version 2.0

I decided to dig in and start hacking the code, starting by adding new widgets to the Xkw library that used cairo for drawing instead of core X calls. Fortunately, the needs of the games were pretty limited, so I only needed to implement a handful of widgets:

  • KLabel. Shows a text string. It allows the string to be left, center or right justified. And that's about it.

  • KCommand. A push button, which uses KLabel for the underlying presentation.

  • KToggle. A push-on/push-off button, which uses KCommand for most of the implementation. Also supports 'radio groups' where pushing one on makes the others in the group turn off.

  • KMenuButton. A button for bringing up a menu widget; this is some pretty simple behavior built on top of KCommand.

  • KSimpleMenu, KSmeBSB, KSmeLine. These three create pop-up menus; KSimpleMenu creates a container which can hold any number of KSmeBSB (string) and KSmeLine (separator lines) objects).

  • KTextLine. A single line text entry widget.

The other Xkw widgets all got their rendering switched to using cairo, plus using double buffering to make updates look better.

SVG Playing Cards

Looking on wikimedia, I found a page referencing a large number of playing cards in SVG form That led me to Adrian Kennard's playing card web site that let me customize and download a deck of cards, licensed using the CC0 Public Domain license.

With these cards, I set about rewriting the Xkw playing card widget, stripping out three different versions of bitmap playing cards and replacing them with just these new SVG versions.

SVG Xmille Cards

Ok, so getting regular playing cards was good, but the original goal was to update Xmille, and that has cards hand drawn by me. I could just use those images, import them into cairo and let it scale them to suit on the screen. I decided to experiment with inkscape's bitmap tracing code to see what it could do with them.

First, I had to get them into a format that inkscape could parse. That turned out to be a bit tricky; the original format is as a set of X bitmap layers; each layer painting a single color. I ended up hacking the Xmille source code to generate the images using X, then fetching them with XGetImage and walking them to construct XPM format files which could then be fed into the portable bitmap tools to create PNG files that inkscape could handle.

The resulting images have a certain charm:

I did replace the text in the images to make it readable, otherwise these are untouched from what inkscape generated.

The Results

Remember that all of these are applications built using the venerable X toolkit; there are still some non-antialiased graphics visible as the shaped buttons use the X Shape extension. But, all rendering is now done with cairo, so it's all anti-aliased and all scalable.

Here's what Xmille looks like after the upgrades:

And here's spider:

Once kgames 1.0 reaches Debian unstable, I'll upload these new versions.

December 30, 2020

A Different Sort Of Optimization

There’s a number of strange hacks in zink that provide compatibility for some of the layers in mesa. One of these hacks is the NIR pass used for managing non-constant UBO/SSBO array indexing, made necessary because SPIRV operates by directly accessing variables, and so it’s impossible to have a non-constant index because then when generating the SPIRV there’s no way to know which variable is being accessed.

In its current state from zink-wip it looks like this:

static nir_ssa_def *
recursive_generate_bo_ssa_def(nir_builder *b, nir_intrinsic_instr *instr, nir_ssa_def *index, unsigned start, unsigned end)
{
   if (start == end - 1) {
      /* block index src is 1 for this op */
      unsigned block_idx = instr->intrinsic == nir_intrinsic_store_ssbo;
      nir_intrinsic_instr *new_instr = nir_intrinsic_instr_create(b->shader, instr->intrinsic);
      new_instr->src[block_idx] = nir_src_for_ssa(nir_imm_int(b, start));
      for (unsigned i = 0; i < nir_intrinsic_infos[instr->intrinsic].num_srcs; i++) {
         if (i != block_idx)
            nir_src_copy(&new_instr->src[i], &instr->src[i], &new_instr->instr);
      }
      if (instr->intrinsic != nir_intrinsic_load_ubo_vec4) {
         nir_intrinsic_set_align(new_instr, nir_intrinsic_align_mul(instr), nir_intrinsic_align_offset(instr));
         if (instr->intrinsic != nir_intrinsic_load_ssbo)
            nir_intrinsic_set_range(new_instr, nir_intrinsic_range(instr));
      }
      new_instr->num_components = instr->num_components;
      if (instr->intrinsic != nir_intrinsic_store_ssbo)
         nir_ssa_dest_init(&new_instr->instr, &new_instr->dest,
                           nir_dest_num_components(instr->dest),
                           nir_dest_bit_size(instr->dest), NULL);
      nir_builder_instr_insert(b, &new_instr->instr);
      return &new_instr->dest.ssa;
   }

   unsigned mid = start + (end - start) / 2;
   return nir_build_alu(b, nir_op_bcsel, nir_build_alu(b, nir_op_ilt, index, nir_imm_int(b, mid), NULL, NULL),
      recursive_generate_bo_ssa_def(b, instr, index, start, mid),
      recursive_generate_bo_ssa_def(b, instr, index, mid, end),
      NULL
   );
}

static bool
lower_dynamic_bo_access_instr(nir_intrinsic_instr *instr, nir_builder *b)
{
   if (instr->intrinsic != nir_intrinsic_load_ubo &&
       instr->intrinsic != nir_intrinsic_load_ubo_vec4 &&
       instr->intrinsic != nir_intrinsic_get_ssbo_size &&
       instr->intrinsic != nir_intrinsic_load_ssbo &&
       instr->intrinsic != nir_intrinsic_store_ssbo)
      return false;
   /* block index src is 1 for this op */
   unsigned block_idx = instr->intrinsic == nir_intrinsic_store_ssbo;
   if (nir_src_is_const(instr->src[block_idx]))
      return false;
   b->cursor = nir_after_instr(&instr->instr);
   bool ssbo_mode = instr->intrinsic != nir_intrinsic_load_ubo && instr->intrinsic != nir_intrinsic_load_ubo_vec4;
   unsigned first_idx = 0, last_idx;
   if (ssbo_mode) {
      last_idx = first_idx + b->shader->info.num_ssbos;
   } else {
      /* skip 0 index if uniform_0 is one we created previously */
      first_idx = !b->shader->info.first_ubo_is_default_ubo;
      last_idx = first_idx + b->shader->info.num_ubos;
   }

   /* now create the composite dest with a bcsel chain based on the original value */
   nir_ssa_def *new_dest = recursive_generate_bo_ssa_def(b, instr,
                                                       instr->src[block_idx].ssa,
                                                       first_idx, last_idx);

   if (instr->intrinsic != nir_intrinsic_store_ssbo)
      /* now use the composite dest in all cases where the original dest (from the dynamic index)
       * was used and remove the dynamically-indexed load_*bo instruction
       */
      nir_ssa_def_rewrite_uses_after(&instr->dest.ssa, nir_src_for_ssa(new_dest), &instr->instr);

   nir_instr_remove(&instr->instr);
   return true;
}

In brief, lower_dynamic_bo_access_instr() is used to detect UBO/SSBO instructions with a non-constant index, e.g., array_of_ubos[n] where n is a uniform. Following this, recursive_generate_bo_ssa_def() generates a chain of bcsel instructions which checks the non-constant array index against constant values and then, upon matching, uses the value loaded from that UBO.

Without going into more depth about the exact mechanics of this pass for the sake of time, I’ll instead provide a better explanation by example. Here’s a stripped down version of one of the simplest piglit shader tests for non-constant uniform indexing (fs-array-nonconst):

[require]
GLSL >= 1.50
GL_ARB_gpu_shader5

[vertex shader passthrough]

[fragment shader]
#version 150
#extension GL_ARB_gpu_shader5: require

uniform block {
	vec4 color[2];
} arr[4];

uniform int n;
uniform int m;

out vec4 color;

void main()
{
	color = arr[n].color[m];
}

[test]
clear color 0.2 0.2 0.2 0.2
clear

ubo array index 0
uniform vec4 block.color[0] 0.0 1.0 1.0 0.0
uniform vec4 block.color[1] 1.0 0.0 0.0 0.0

uniform int n 0
uniform int m 1
draw rect -1 -1 1 1

relative probe rect rgb (0.0, 0.0, 0.5, 0.5) (1.0, 0.0, 0.0)

Using two uniforms, a color is indexed from a UBO as the FS output.

In the currently shipping version of zink, the final NIR output from ANV of the fragment shader might look something like this:

shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 8
ubos: 5
shared: 0
decl_var shader_out INTERP_MODE_NONE vec4 color (FRAG_RESULT_DATA0.xyzw, 8, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000002 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x00000001 /* 0.000000 */)
	vec1 32 ssa_2 = load_const (0x00000004 /* 0.000000 */)
	vec1 32 ssa_3 = load_const (0x00000003 /* 0.000000 */)
	vec1 32 ssa_4 = load_const (0x00000010 /* 0.000000 */)
	vec1 32 ssa_5 = intrinsic load_ubo (ssa_1, ssa_4) (0, 1073741824, 16, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=16 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_6 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_7 = intrinsic load_ubo (ssa_1, ssa_6) (0, 1073741824, 0, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_8 = umin ssa_7, ssa_3
	vec1 32 ssa_9 = ishl ssa_5, ssa_2
	vec1 32 ssa_10 = iadd ssa_8, ssa_1
	vec1 32 ssa_11 = load_const (0xfffffffc /* -nan */)
	vec1 32 ssa_12 = iand ssa_9, ssa_11
	vec1 32 ssa_13 = load_const (0x00000005 /* 0.000000 */)
	vec4 32 ssa_14 = intrinsic load_ubo (ssa_13, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec4 32 ssa_23 = intrinsic load_ubo (ssa_2, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_27 = ilt32 ssa_10, ssa_2
	vec1 32 ssa_28 = b32csel ssa_27, ssa_23.x, ssa_14.x
	vec1 32 ssa_29 = b32csel ssa_27, ssa_23.y, ssa_14.y
	vec1 32 ssa_30 = b32csel ssa_27, ssa_23.z, ssa_14.z
	vec1 32 ssa_31 = b32csel ssa_27, ssa_23.w, ssa_14.w
	vec4 32 ssa_32 = intrinsic load_ubo (ssa_3, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_36 = ilt32 ssa_10, ssa_3
	vec1 32 ssa_37 = b32csel ssa_36, ssa_32.x, ssa_28
	vec1 32 ssa_38 = b32csel ssa_36, ssa_32.y, ssa_29
	vec1 32 ssa_39 = b32csel ssa_36, ssa_32.z, ssa_30
	vec1 32 ssa_40 = b32csel ssa_36, ssa_32.w, ssa_31
	vec4 32 ssa_41 = intrinsic load_ubo (ssa_0, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec4 32 ssa_45 = intrinsic load_ubo (ssa_1, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_49 = ilt32 ssa_10, ssa_1
	vec1 32 ssa_50 = b32csel ssa_49, ssa_45.x, ssa_41.x
	vec1 32 ssa_51 = b32csel ssa_49, ssa_45.y, ssa_41.y
	vec1 32 ssa_52 = b32csel ssa_49, ssa_45.z, ssa_41.z
	vec1 32 ssa_53 = b32csel ssa_49, ssa_45.w, ssa_41.w
	vec1 32 ssa_54 = ilt32 ssa_10, ssa_0
	vec1 32 ssa_55 = b32csel ssa_54, ssa_50, ssa_37
	vec1 32 ssa_56 = b32csel ssa_54, ssa_51, ssa_38
	vec1 32 ssa_57 = b32csel ssa_54, ssa_52, ssa_39
	vec1 32 ssa_58 = b32csel ssa_54, ssa_53, ssa_40
	vec4 32 ssa_59 = vec4 ssa_55, ssa_56, ssa_57, ssa_58
	intrinsic store_output (ssa_59, ssa_6) (8, 15, 0, 160, 132) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */	/* color */
	/* succs: block_1 */
	block block_1:
}

All the b32csel ops are generated by the above NIR pass, with each one “checking” a non-constant index against a constant value. At the end of the shader, the store_output uses the correct values, but this is pretty gross.

And Then Inlining

Some time ago, noted Gallium professor Marek Olšák authored a series which provided a codepath for inlining uniform data directly into shaders. The process for this is two steps:

  • Detect and designate uniforms to be inlined
  • Replace shader loads of these uniforms with the actual uniforms

The purpose of this is specifically to eliminate complex conditionals resulting from uniform data, so the detection NIR pass specifically looks for conditionals which use only constants and uniform data as the sources. Something like if (uniform_variable_expression) then becomes if (constant_value_expression) which can then be optimized out, greatly simplifying the eventual shader instructions.

Looking at the above NIR, this seems like a good target for inlining as well, so I took my hatchet to the detection pass and added in support for the bcsel and fcsel ALU ops when their result sources were the results of intrinsics, e.g., loads. The results are good to say the least:

shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 8
ubos: 5
shared: 0
decl_var shader_out INTERP_MODE_NONE vec4 color (FRAG_RESULT_DATA0.xyzw, 8, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000001 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x00000004 /* 0.000000 */)
	vec1 32 ssa_2 = load_const (0x00000010 /* 0.000000 */)
	vec1 32 ssa_3 = intrinsic load_ubo (ssa_0, ssa_2) (0, 1073741824, 16, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=16 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_4 = ishl ssa_3, ssa_1
	vec1 32 ssa_5 = load_const (0x00000002 /* 0.000000 */)
	vec1 32 ssa_6 = load_const (0xfffffffc /* -nan */)
	vec1 32 ssa_7 = iand ssa_4, ssa_6
	vec1 32 ssa_8 = load_const (0x00000000 /* 0.000000 */)
	vec4 32 ssa_9 = intrinsic load_ubo (ssa_5, ssa_7) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	intrinsic store_output (ssa_9, ssa_8) (8, 15, 0, 160, 132) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */	/* color */
	/* succs: block_1 */
	block block_1:
}

The second load_ubo here is using the inlined uniform data to determine that it needs to load the 0 index, greatly reducing the shader’s complexity.

This still needs a bit of tuning, but I’m hoping to get it finalized soonish.

December 29, 2020

A New Sync

For some time now I’ve been talking about zink’s lack of WSI and the forthcoming, near-messianic work by FPS sherpa Adam Jackson to implement it.

This is an extremely challenging project, however, and some work needs to be done in the meanwhile to ensure that zink doesn’t drive off a cliff.

Swapchain Strikes Back

Any swapchain master is already well acquainted with the mechanism by which images are displayed on the screen, but the gist of it for anyone unfamiliar is that there’s N image resources that are swapped back and forth (2 for double-buffered, 3 for triple-buffered, …). An image being rendered to is a backbuffer, and an image being displayed is a frontbuffer.

Ideally, a frontbuffer shouldn’t be drawn to while it’s in the process of being presented since such an action obliterates the app’s usefulness. The knowledge of exactly when a resource is done presenting is gained through WSI. On Xorg, however, it’s a bit tricky, to say the least. DRI3 is intended to address the underlying problems there with the XPresent extension, and the Mesa DRI frontend utilizes this to determine when an image is safe to use.

All this is great, and I’m sure it works terrifically in other cases, but zink is not like other cases. Zink lacks direct WSI integration. Under Xorg, this means it relies entirely on the DRI frontend to determine when it’s safe to start rendering onto an image resource.

But what if the DRI frontend gets it wrong?

Indeed, due to quirks in the protocol/xserver, XPresent idle events can be received for a “presented” image immediately, even if it’s still in use and has not finished presenting.

scumbag-xorg.png

In apps like SuperTuxKart, this results in insane flickering due to always rendering over the current frame before it’s finished being presented.

Return Of The Poll

To solve this problem, a wise, reclusive ghostwriter took time off from being at his local pub to offer me a suggestion:

Why not just rip the implicit fence of the DMAbuf out of the image object?

It was a great idea. But what did this pub enthusiast mean?

In short, WSI handles this problem by internally poll()ing on the image resource’s underlying file descriptor. When there’s no more events to poll() for, the image is safe to write on.

So now it’s back to the (semi) basics of programming. First, get the file descriptor of the image using normal Vulkan function calls:

static int
get_resource_fd(struct zink_screen *screen, struct zink_resource *res)
{
   VkMemoryGetFdInfoKHR fd_info = {};
   int fd;
   fd_info.sType = VK_STRUCTURE_TYPE_MEMORY_GET_FD_INFO_KHR;
   fd_info.memory = res->obj->mem;
   fd_info.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT;
   VkResult result = (*screen->vk_GetMemoryFdKHR)(screen->dev, &fd_info, &fd);
   return result == VK_SUCCESS ? fd : -1;
}

This provides a file descriptor that can be used for more nefarious purposes. Any time the gallium pipe_context::flush hook is called, the flushed resource (swapchain image) must be synchronized by poll()ing as in this snippet:

static void
zink_flush(struct pipe_context *pctx,
           struct pipe_fence_handle **pfence,
           enum pipe_flush_flags flags)
{
   struct zink_context *ctx = zink_context(pctx);

   if (flags & PIPE_FLUSH_END_OF_FRAME && ctx->flush_res) {
      if (ctx->flush_res->obj->fd != -1) {
          /* FIXME: remove this garbage once we get wsi */
          struct pollfd p = {};
          p.fd = ctx->flush_res->obj->fd;
          p.events = POLLOUT;
          assert(poll(&p, 1, -1) == 1);
          assert(p.revents & POLLOUT);
      }
      ctx->flush_res = NULL;
   }

The POLLOUT event flag is used to determine when it’s safe to write. If there’s no pending usage during present then this will return immediately, otherwise it will wait until the image is safe to use.

hacks++.

December 24, 2020

ETOOMUCHREBASE

For real though, I’ve spent literal hours over the past week just rebasing stuff and managing conflicts. And then rebasing again after diffing against a reference commit when I inevitably discover that I fucked up the merge somehow.

But now the rebasing is done for a few minutes while I run more unit tests, so it’s finally time to blog.

It’s been a busy week. Nothing I’ve done has been very interesting. Lots of stabilizing and refactoring.

The blogging must continue, however, so here goes.

The Return of QBOs

Many months ago I blogged about QBOs.

Maybe.

Maybe I didn’t.

QBOs are are Query Buffer Objects, where the result of a given query is stored into a buffer. This is great for performance since it doesn’t require any stalling while the query result is directly read.

Conceptually, anyway.

At present, zink has problems making this efficient for many types of queries due to the mismatch between GL query data and Vulkan query data, and there’s a need to manually read it back and parse it with the CPU.

This is consistent with how zink manages non-QBO queries:

  • start query
  • end query
  • stall GPU
  • read query results back for user

As I’ve said many times along the way, the goal for zink has been to get the features in place and working first and then optimize later.

It’s now later, and query bottlenecking is actually hurting performance in some apps (e.g., RPCS3).

Memory++

Some profiling was done recently by bleeding edge tester Witold Baryluk, and it turns out that zink is using slightly less GPU memory than some native drivers, though it’s also using slightly more than some other native drivers:

lowmem.png

Looking at the right side of the graph, it’s obvious that there’s still some VRAM available to be used, which means there’s some VRAM available to use in optimizations.

As such, I decided to rewrite the query internals to have every query be a QBO internally, consistent with the RadeonSI model. While it does use a tiny bit more VRAM due to needing to allocate the backing buffers, the benefit of this is that now all query result data is copied to a buffer as soon as the query stops, so from an API perspective, this means that the result becomes available as soon as the backing buffer becomes idle.

It also means that any time actual QBOs are used (which is all the time for competent apps), I’ll eventually have the ability to asynchronously post the result data from a query onto a user buffer without triggering a stall.

Functionally, this isn’t a super complex maneuver: I’ve already got a utility function that performs a vkCmdCopyQueryPoolResults for regular QBO handling, so repurposing this to be called any time a query was ended, combined with modifying the parsing function to first map the internal buffer, was sufficient.

In the end, the query code is now a bit more uniform, and in the future I can use a compute shader to keep everything on GPU without needing to do any manual readback.

December 18, 2020

memcpy harder

Amidst the flurry of patches being smashed into the repo today I thought I’d talk about about memcpy. Yes, I’m referring to the same function everyone knows and loves.

The final frontier of RPCS3 performance was memcpy. With ARGB emulation in place, my perf results were looking like this on RADV:

zink.png

Pushing a hard 12fps, this was a big improvement from before, but it seemed a bit low. Not having any prior experience in PS3 emulation, I started wondering whether this was the limit of things in the current year.

Meanwhile, RadeonSI was getting significantly higher performance with a graph like this:

radeonsi.png

Clearly performance is capable of being much higher, so why is zink so much worse at memcpy?

Further Exploration…

Along with the wisdom of the great sage Dave Airlie led me to checking out the resource creation code in zink, specifically the types of memory that are used for allocation. Vulkan supports a range of memory types for the discerning allocations connoisseur, but the driver was jamming everything into VK_MEMORY_PROPERTY_HOST_COHERENT_BIT. This is great for ease of use, as the contents of buffers are always synchronized between CPU and GPU with no additional work needed, but it ends up being massively slower for any kind of direct copy operations of the backing memory, e.g., anything PBO-related.

What should really be used is VK_MEMORY_PROPERTY_HOST_CACHED_BIT whenever possible. This requires some additional legwork to properly invalidate/flush the memory used by any vkMapMemory calls, but the results were well worth the effort:

zink2.png

And performance was a buttery smooth 30fps (the cap) as well:

bioshock.png

Mission accomplished.

December 17, 2020

The Name Of The Game

…is emulation.

With blit-based transfers working, I checked my RPCS3 flamegraph again to see the massive performance improvements that I’d no doubt be seeing:

fg.png

Except there were none.

Closer examination revealed that this was due to the app using ARGB formats for its PBOs. Referencing that against VkFormat led to a further problem: ARGB and ABGR are not explicitly supported by Vulkan.

This wasn’t exactly going to be an easy fix, but it wouldn’t prove too challenging either.

Swizzle Me Timbers

Yes, swizzles. In layman’s terms, a swizzle is mapping a given component to another component, for example in an RGBA-ordered format, using a WXYZ swizzle would result in a reordering of the 0-indexed components to 3012, or ARGB.

Gallium, when using blit-based transfers, provides a lot of opportunities to use swizzles, specifically by having a lot of blits go through a u_blitter codepath that translates blits into quad draws with a sampled image.

Thus, by applying an ARGB/ABGR emulation swizzle to each of these codepaths, I can drop in native-ish support under the hood of the driver by internally reporting ARGB as RGBA and ABGR as BGRA.

In pseudocode, the ARGB path looks something like this:

 unsigned char dst_swiz[4];

 if (src_is_argb) {
    unsigned char reverse_alpha[] = {
       PIPE_SWIZZLE_Y,
       PIPE_SWIZZLE_Z,
       PIPE_SWIZZLE_W,
       PIPE_SWIZZLE_X,
    };
    /* compose swizzle with alpha at the end */
    util_format_compose_swizzles(original_swizzle, reverse_alpha, dst_swiz);
 } else if (dst_is_argb) {
    unsigned char reverse_alpha[] = {
       PIPE_SWIZZLE_W,
       PIPE_SWIZZLE_X,
       PIPE_SWIZZLE_Y,
       PIPE_SWIZZLE_Z,
    };
    /* compose swizzle with alpha at the start */
    util_format_compose_swizzles(original_swizzle, reverse_alpha, dst_swiz);
}

The original swizzle is composed with the alpha-reversing swizzle to generate a swizzle that translates the resource’s internal ARGB data into RGBA data (or vice versa) like the Vulkan driver is expecting it to be.

From there, the only restriction is that this emulation is prohibited in texel buffers due to there not being a direct method of applying a swizzle to that codepath. Sure, I could do the swizzle in the shader as a variant, but then this leads to more shader variants and more pipeline objects, so it’s simpler to just claim no support here and let gallium figure things out using other codepaths.

Progress?

Would this be enough to finally get some frames moving?

Find out tomorrow in the conclusion to this SGC miniseries.

December 16, 2020

To Begin

This is the journey of how zink-wip went from 0 fps in RPCS3 to a bit more than that. Quite a bit more, in fact, if you’re using RADV.

As all new app tests begin, this one started with firing up the app. Since there’s no homebrew games available (that I could find), I decided to pick something that I owned and was familiar with. Namely a demo of Bioshock.

It started up nicely enough:

title.png

But then I started a game and things got rough:

maximum-oof.png

Yikes.

Another Overly Technical Post

One of the fundamentals of a graphics driver is that the GPU should be handling as much work as possible. This means that, for example, any time an application is using a Pixel Buffer Object (PBO), the GPU should be used for uploading and downloading the pixel buffer.

Why are you suddenly mentioning PBOs, you might be asking.

Well, let’s check out what’s going on using a perf flamegraph:

fg.png

The driver in this case is hitting a software path for copying pixels to and from a PBO, effectively doing full-frame memcpy operations multiple times each frame. This is on the CPU, which is obviously not great for performance. As above, ideally this should be moved to the GPU.

Gallium provides a pipe cap for this: PIPE_CAP_PREFER_BLIT_BASED_TEXTURE_TRANSFER

Zink doesn’t use this in master right now, which naturally led me down the path of enabling it.

There were problems.

Lots of problems.

The first problem was that suddenly I had an infinite number of failing unit tests. Confusing for sure. Some intensive debugging led me to this block of code in zink which is used for directly mapping a rectangular region of image resource memory:

VkResult result = vkMapMemory(screen->dev, res->mem, res->offset, res->size, 0, &ptr);
if (result != VK_SUCCESS)
   return NULL;
VkImageSubresource isr = {
   res->aspect,
   level,
   0
};
VkSubresourceLayout srl;
vkGetImageSubresourceLayout(screen->dev, res->image, &isr, &srl);
ptr = ((uint8_t *)ptr) + box->z * srl.depthPitch +
                         box->y * srl.rowPitch +
                         box->x;

Suspicious. box in this case represents the region to be mapped, yet members of VkSubresourceLayout like offset aren’t being applied to handle the level that’s intended to be loaded, nor is this taking into account the bits-per-pixel of the image. In fact, this is always assuming that each x coordinate unit equals a single byte.

The fully corrected version is more like this:

VkResult result = vkMapMemory(screen->dev, res->mem, res->offset, res->size, 0, &ptr);
if (result != VK_SUCCESS)
   return NULL;
VkImageSubresource isr = {
   res->aspect,
   level,
   0
};
VkSubresourceLayout srl;
vkGetImageSubresourceLayout(screen->dev, res->image, &isr, &srl);
const struct util_format_description *desc = util_format_description(res->base.format);
unsigned offset = srl.offset +
                  box->z * srl.depthPitch +
                  (box->y / desc->block.height) * srl.rowPitch +
                  (box->x / desc->block.width) * (desc->block.bits / 8);
ptr = ((uint8_t *)ptr) + offset;

It turns out that no unit test had previously passed a nonzero x coordinate for a mapping region or tried to map a nonzero level, so this was never exposed as being broken.

Imagine that.

December 15, 2020

In Many Forms

One of the people in the #zink IRC channel has been posing an interesting challenge for me in the form of trying to run every possible console emulator on my zink-wip branch.

This has raised a number of issues with various parts of the driver, so expect a number of posts on the topic.

Threads

First up was the citra emulator for the 3DS. This is an interesting app for a number of reasons, the least of which is because it uses a ton of threads, including a separate one for GL, which put my own work to the test.

Suffice to say that my initial implementation of u_threaded_context needed some work.

One of the main premises of the threaded context is this idea of an asynchronous fence object. The threaded context will create these in a thread and provide them to the driver in the pipe_context::flush hook, but only in some cases; at other times, the fence object provided will just be a “regular” synchronous one.

The trick here is that the driver itself has a fence for managing synchronization, and the threaded context can create N number of its own fences to manage the driver’s fence, all of which must potentially work when called in a random order and from either the “main” thread or the driver-specific thread.

There’s too much code involved here to be providing any samples here, but I’ll go over the basics of it just for posterity. Initially, I had implemented this entirely on the zink side such that each zink fence had references to all the tc fences in a chain, and fence-related resources were generally managed on the last fence in the chain. I had two separate object types for this: one for zink fences and one for tc fences. The former contained all the required vulkan-specific objects while the latter contained just enough info to work with tc.

This was sort of fine, and it worked for many things, the least of which was all my benchmarking.

The problem was that a desync could occur if one of the tc fences was destroyed sufficiently later than its zink fence, leading to an eventual crash. This was never triggered by unit tests nor basic app usage, but something like citra with its many threads managed to hit it consistently and quickly.

Thus began the day-long process of rewriting the tc implementation to a much-improved 2.0 version. The primary difference in this design model is that I worked a bit closer to the original RadeonSI implementation, having only a single externally-used fence object type for both gallium as well as tc and creating them for the zink fence object without any sort of cross-referencing. This meant that rather than having 1 zink fence with references to N tc fences, I now had N tc fences each with a reference to 1 zink fence.

This simplified the code a bit in other ways after the rewrite, as the gallium/tc fence objects were now entirely independent. The one small catch was that zink fences get recycled, meaning that in theory a gallium/tc fence could have a reference to a zink fence that it no longer was managing, but this was simple enough to avoid by gating all tc fence functionality on a comparison between its stored fence id and the id of the fence that it had a reference to. If they failed to match, the gallium/tc fence had already completed.

Stability++

Things seem like they’re in better shape now with regards to stability. It’s become more challenging than ever to debug the driver with threading enabled, but that’s just one of the benefits that threads provide.

Next time I’ll begin a series on how to get a mesa driver from less than 1fps to native performance in RPCS3.

December 14, 2020

New Week, New Idea

I have to change things up.

Historically I’ve spent a day working on zink and then written up a post at the end. The problem with this approach is obvious: a lot of times when I get to the end of the day I’m just too mentally drained to think about anything else and want to just pass out on my couch.

So now I’m going to try inverting my schedule: as soon as I get up, it’s now going to be blog time.

I’m not even fully awake right now, so this is definitely going to be interesting.

Stencil Sampling

Today’s exploratory morning post is about sampling from stencil buffers.

What is sampling from stencil buffers, some of you might be asking.

Sampling in general is the reading of data from a resource. It’s most commonly used as an alternative to using a Copy command for transferring some amount of data from one resource to another in a specified way.

For example, extracting only stencil data from a resource which combines both depth and stencil data. In zink, this is an important operation because none of the Copy commands support multisampled resources containing both depth and stencil data, an OpenGL feature that the unit tests most certainly cover.

As with all things, zink has a tough time with this.

Sampling Basics

For the purpose of this post, I’m only going to be talking about sampling from image resources. Sampling from buffer resources is certainly possible and useful, however, but there’s just less that can go wrong for that case.

The general process of a sampling-based copy operation in Gallium-based drivers is as follows:

  • have resource src which contains some amount of data
  • have resource dst which is a the intended destination for the data from src
  • bind src as a “sampler view”, which is essentially a combination of info that determines how data will be sampled from a resource
  • bind dst as an output target (e.g., a framebuffer attachment)
  • bind a fragment shader containing a sampler type that samples from the bound sampler view and writes to the output target (either by gl_FragColor output or imageStore)
  • dump some vertices into the pipeline and blammo, missionaccomplished.jpg

In the case of stencil sampling, zink has issues with the third step here.

The Code

Here’s what we’ve currently got shipping in the driver for the relevant part of creating image sampler views:

VkImageViewCreateInfo ivci = {};
ivci.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
ivci.image = res->image;
ivci.viewType = image_view_type(state->target);
ivci.format = zink_get_format(screen, state->format);
assert(ivci.format);
ivci.components.r = component_mapping(state->swizzle_r);
ivci.components.g = component_mapping(state->swizzle_g);
ivci.components.b = component_mapping(state->swizzle_b);
ivci.components.a = component_mapping(state->swizzle_a);

ivci.subresourceRange.aspectMask = sampler_aspect_from_format(state->format);
ivci.subresourceRange.baseMipLevel = state->u.tex.first_level;
ivci.subresourceRange.baseArrayLayer = state->u.tex.first_layer;
ivci.subresourceRange.levelCount = state->u.tex.last_level - state->u.tex.first_level + 1;
ivci.subresourceRange.layerCount = state->u.tex.last_layer - state->u.tex.first_layer + 1;

err = vkCreateImageView(screen->dev, &ivci, NULL, &sampler_view->image_view);

Filling in some gaps:

  • res is the image being sampled from
  • image_view_type() converts Gallium texture types (e.g., 1D, 2D, 2D_ARRAY, …) to the corresponding Vulkan type
  • zink_get_format() converts a Gallium image format to a usage Vulkan one
  • component_mapping() converts a Gallium swizzle to a Vulkan one (swizzles determine which channel in the sample operation are mapped from the source to the destination)
  • sampler_aspect_from_format() infers VkImageAspectFlags from a Gallium format

The Problem

Regarding sampler descriptors, Vulkan spec states If imageView is created from a depth/stencil image, the aspectMask used to create the imageView must include either VK_IMAGE_ASPECT_DEPTH_BIT or VK_IMAGE_ASPECT_STENCIL_BIT but not both.

This means that for combined depth+stencil resources, only the depth or stencil aspect can be specified but not both. As Gallium presents drivers with a format and swizzle based on the data being sampled from the image’s data, this poses a problem since 1) the format provided will usually map to something like VK_FORMAT_D32_SFLOAT_S8_UINT and 2) the swizzle provided will be based on this format.

But if zink can only specify one of the aspects, this poses a problem.

The Solution

The format being sampled must also match the aspect type, and VK_FORMAT_D32_SFLOAT_S8_UINT is obviously not a pure stencil format. This means that any time zink infers a stencil-only aspect image format like PIPE_FORMAT_X32_S8X24_UINT, which is a two channel format where the depth channel is ignored, the format passed in VkImageViewCreateInfo has to just be the stencil format being sampled. Helpfully, this will always be VK_FORMAT_S8_UINT.

So now the code would look like this:

VkImageViewCreateInfo ivci = {};
ivci.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
ivci.image = res->obj->image;
ivci.viewType = image_view_type(state->target);

ivci.components.r = component_mapping(state->swizzle_r);
ivci.components.g = component_mapping(state->swizzle_g);
ivci.components.b = component_mapping(state->swizzle_b);
ivci.components.a = component_mapping(state->swizzle_a);
ivci.subresourceRange.aspectMask = sampler_aspect_from_format(state->format);
/* samplers for stencil aspects of packed formats need to always use stencil type */
if (ivci.subresourceRange.aspectMask == VK_IMAGE_ASPECT_STENCIL_BIT)
   ivci.format = VK_FORMAT_S8_UINT;
else
   ivci.format = zink_get_format(screen, state->format);

Mo’ Time, Mo’ Problems

The above code was working great for months in zink-wip.

Then “bugs” were fixed in master.

The new problem came from a merge request claiming to “fix depth/stencil blit shaders”. The short of this is that previously, the shaders generated by mesa for the purpose of doing depth+stencil sampling were always reading from the first channel of the image, which was exactly what zink was intending given that for that case the underlying Vulkan driver would only be reading one component anyway. After this change, however, samplers are now reading from the second channel of the image.

Given that a Vulkan stencil format has no second channel, this poses a problem.

Luckily, the magic of swizzles can solve this. By mapping the second channel of the sampler to the first channel of the image data, the sampler will read the stencil data again.

The fully fixed code now looks like this:

VkImageViewCreateInfo ivci = {};
ivci.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
ivci.image = res->obj->image;
ivci.viewType = image_view_type(state->target);

ivci.components.r = component_mapping(state->swizzle_r);
ivci.components.g = component_mapping(state->swizzle_g);
ivci.components.b = component_mapping(state->swizzle_b);
ivci.components.a = component_mapping(state->swizzle_a);
ivci.subresourceRange.aspectMask = sampler_aspect_from_format(state->format);
/* samplers for stencil aspects of packed formats need to always use stencil type */
if (ivci.subresourceRange.aspectMask == VK_IMAGE_ASPECT_STENCIL_BIT) {
   ivci.format = VK_FORMAT_S8_UINT;
   ivci.components.g = VK_COMPONENT_SWIZZLE_R;
} else
   ivci.format = zink_get_format(screen, state->format);

And now everything works.

For now.

December 08, 2020

Counting

I keep saying this, but I feel like I’m progressively getting further away from the original goal of this blog, which was to talk about actual code that I’m writing and not just create great graphics memes. So today it’s once again a return to the roots and the code that I had intended to talk about yesterday.

Gallium is a state tracker, and as part of this, it provides various features to make writing drivers easier. One of these features is that it rolls atomic counters into SSBOs, both in terms of the actual buffer resource and the changing of shader instructions to access atomic counters as though they’re uint32_t values at an offset in a buffer. On the surface, and for most drivers, this is great: the driver just has to implement handling for SSBOs, and then they get counters as a bonus.

As always, however, for zink this is A Very Bad Thing.

SPIRV

One of the challenges in zink is the ntv backend which translates the OpenGL shader into a Vulkan shader. Typically this means GLSL -> SPIRV, but there’s also the ARB_gl_spirv extension which allows GL to have SPIRV shaders as well, meaning that zink has to do SPIRV -> SPIRV. GLSL has a certain way of working that NIR handles, but SPIRV is very different, and so the information that is provided by NIR for GLSL is very different than what’s available for SPIRV.

In particular, SPIRV shaders have valid binding values for shader buffers. GLSL shaders do not. This becomes a problem when trying to determine, as zink must, exactly which descriptors are on a given resource and which descriptors need to have their bindings used vs which can be ignored. Since there’s no way to differentiate a GLSL shader from a SPIRV shader, this is a challenge. It’s further a challenge given that one of the NIR passes that changes shader instructions over from variable pointer derefs to explicit block_id and offset values happens to break bindings in such a way that it becomes impossible to accurately tell which SSBO variables are counters and which are actual SSBOs.

Enter The Dark Arts

if (!strcmp(glsl_get_type_name(var->interface_type), "counters"))

Yup. The original SSBO/counter implementation has to use strcmp to check the name of the variable’s interface in order to accurately determine whether it’s a counter-turned-SSBO.

There’s also some extremely gross code in ntv for trying to match up the SSBO to its expected block_id based on this, but SGC is a SFW blog, so I’m going to refrain from posting it.

Improvements

As always, there’s ways to improve my code. This way came some time after I’d written SSBO support in the form of a new pipe cap, PIPE_CAP_NIR_ATOMICS_AS_DEREF. What this does is allow a driver to skip the Gallium code that transforms counters into SSBOs, making them very easy to detect.

With this in my pocket, I was already 5% of the way to better atomic counter handling.

The next step was to unbreak counter location values. The location is, in ntv language, the offset of a variable inside a given buffer block using a type-based unit, e.g., location=1 would mean an offset of 4 bytes for an int type block. Here’s a NIR pass I wrote to tackle the problem:

static bool
fixup_counter_locations(nir_shader *shader)
{
   unsigned last_binding = 0;
   unsigned last_location = 0;
   if (!shader->info.num_abos)
      return false;
   nir_foreach_variable_with_modes(var, shader, nir_var_uniform) {
      if (!type_is_counter(var->type))
         continue;
      var->data.binding += shader->info.num_ssbos;
      if (var->data.binding != last_binding) {
         last_binding = var->data.binding;
         last_location = 0;
      }
      var->data.location = last_location++;
   }
   return true;
}

The premise here is that counters get merged into buffers based on their binding value, and any number of counters can exist for a given binding. Since Gallium always puts counter buffers after SSBOs, the binding used needs to be incremented by the number of real SSBOs present. With this done, all counters with matching bindings can be assumed to exist sequentially on the same buffer.

Next comes the actual SPIRV variable construction. With the knowledge that zink will be receiving some sort of NIR shader instruction like vec1 32 ssa_0 = deref_var &some_counter, where some_counter is actually a value at an offset inside a buffer, it’s important to be considering how to conveniently handle the offset. I ended up with something like this:

   if (type_is_counter(var->type)) {
      SpvId padding = var->data.offset ? get_sized_uint_array_type(ctx, var->data.offset / 4) : 0;
      SpvId types[2];
      if (padding)
         types[0] = padding, types[1] = array_type;
      else
         types[0] = array_type;
      struct_type = spirv_builder_type_struct(&ctx->builder, types, 1 + !!padding);
      if (padding)
         spirv_builder_emit_member_offset(&ctx->builder, struct_type, 1, var->data.offset);
   }

This creates a struct containing 1-2 members:

  • (optional) a padding array for the variable’s offset
  • the actual variable type, sized as an array

Converting the deref_var instruction can then be simplified into a consistent and easy to generate OpAccessChain:

if (type_is_counter(deref->var->type)) {
   SpvId dest_type = glsl_type_is_array(deref->var->type) ?
                     get_glsl_type(ctx, deref->var->type) :
                     get_dest_uvec_type(ctx, &deref->dest);
   SpvId ptr_type = spirv_builder_type_pointer(&ctx->builder,
                                               SpvStorageClassStorageBuffer,
                                               dest_type);
   SpvId idx[] = {
      emit_uint_const(ctx, 32, !!deref->var->data.offset),
   };
   result = spirv_builder_emit_access_chain(&ctx->builder, ptr_type, result, idx, ARRAY_SIZE(idx));
}

After setting up the destination type for the deref, the OpAccessChain is generated using a single index: for cases where the variable lies at a nonzero offset it selects the second member after the padding array, otherwise it selects the first member, which is the intended counter variable.

The rest of the atomic counter conversion was just a matter of supporting the specific counter-related instructions that would otherwise have been converted to regular atomic instructions.

As a result of these changes, zink has gone from a 75% pass rate in ARB_gl_spirv piglit tests all the way up to around 90%.

December 07, 2020

Alright, But Now I’m Really Back

Blogging is hard. Also, getting back on track during a major holiday week is hard.

But now things are settling, and it’s time to get down to brass tacks.

And code. Brass code.

Maybe just code.

Updates

But first, some updates.

Historically when I’ve missed my blogging window for an extended period, it’s because I’m busy. This has been the case for the past week, but I don’t have much to show for it in terms of zink enhancements. There’s some work ongoing on various MRs, but probably this is a good time to revise the bold statement I’d previously made: there’s now roughly two weeks (9 workdays) remaining in which it’s feasible to land zink patches before the end of the year, and probably hitting GL 4.6 in mainline mesa is unrealistic. I’d be pleasantly surprised if we hit 4.0 given that we’d need to be landing a minimum of 1 new MR each day.

But there are some cool things on the horizon for zink nonetheless:

  • work has begun on getting zink working with lavapipe for CI purposes
  • I fixed an annoying spec-related issue that now gives better compatiblity with non-Intel drivers
  • improved gl_spirv support
  • further performance-related work

And Now For Something Vaguely Interesting

Looking at the second item in the above list, there’s a vague sense of discomfort that anyone working deeply with shader images will recognize.

Yes, I’m talking about the COHERENT qualifier.

For anyone interested in a deeper reading into this GLSL debacle, check out this stackoverflow thread.

TL;DR, COHERENT is supposed to ensure that buffer/image access across shader stages is synchronized, also known as coherency in this context. But then also due to GL spec wording it can simultaneously mean absolutely nothing, sort of like compiler warnings, so this is an area that generally one should avoid thinking about or delving into.

Naturally, zink is delving deep into this. And of course, Vulkan makes everything better, so this issue is 100% not an issue anymore, and everything is great.

Just kidding.

Vulkan has the exact same language in the parts of the spec referencing this behavior:

While GLSL (and legacy SPIR-V) applies the “coherent” decoration to variables (for historical reasons), this model treats each memory access instruction as having optional implicit availability/visibility operations.

optional implicit

Aren’t explicit specifications like Vulkan great?

What happens here is that the spec has no requirement that either the application or driver actually enforces coherency across resources, meaning that if an application optionally decides not to bother, then it’s up to the driver whether to optionally bother guaranteeing coherent access. If neither the application nor driver take any action to guarantee this behavior, the application won’t work as expected.

coherency.png

To fix this on the application (zink) side, image writes in shaders need to specify MakeTexelAvailable|NonPrivateTexel operands, and the reads need MakeTexelVisible|NonPrivateTexel.

December 04, 2020

In Part 1 I've shown you how to create your own distribution image using the freedesktop.org CI templates. In Part 2, I've shown you how to truly build nested images. In this part, I'll talk about the ci-fairy tool that is part of the same repository of ci-templates.

When you're building a CI pipeline, there are some tasks that most projects need in some way or another. The ci-fairy tool is a grab-bag of solutions for these. Some of those solutions are for a pipeline itself, others are for running locally. So let's go through the various commands available.

Using ci-fairy in a pipeline

It's as simple as including the template in your .gitlab-ci.yml file.


include:
- 'https://gitlab.freedesktop.org/freedesktop/ci-templates/-/raw/master/templates/ci-fairy.yml'
Of course, if you want to track a specific sha instead of following master, just sub that sha there. freedesktop.org projects can include ci-fairy like this:

include:
- project: 'freedesktop/ci-templates'
ref: master
file: '/templates/ci-fairy.yml'
Once that's done, you have access to a .fdo.ci-fairy job template that you can extends from. This will download an image from quay.io that is capable of git, python, bash and obviously ci-fairy. This image is a fixed one and referenced by a unique sha so even if where we keep working on ci-fairy upstream you should never see regression, updating requires you to explicitly update the sha of the included ci-fairy template. Obviously, if you're using master like above you'll always get the latest.

Due to how the ci-templates work, it's good to set the FDO_UPSTREAM_REPO variable with the upstream project name. This means ci-fairy will be able to find the equivalent origin/master branch, where that's not available in the merge request. Note, this is not your personal fork but the upstream one, e.g. "freedesktop/ci-templates" if you are working on the ci-templates itself.

Checking commit messages

ci-fairy has a command to check commits for a few basic expectations in commit messages. This currently includes things like enforcing a 80 char subject line length, that there is an empty line after the subject line, that no fixup or squash commits are in the history, etc. If you have complex requirements you need to write your own but for most projects this job ensures that there are no obvious errors in the git commit log:


check-commit:
extends:
- .fdo.ci-fairy
script:
- ci-fairy check-commits --signed-off-by
except:
- master@upstream/project
Since you don't ever want this to fail on an already merged commit, exclude this job the master branch of the upstream project - the MRs should've caught this already anyway.

Checking merge requests

To rebase a contributors merge request, the contributor must tick the checkbox to Allow commits from members who can merge to the target branch. The default value is off which is frustrating (gitlab is working on it though) and causes unnecessary delays in processing merge requests. ci-fairy has command to check for this value on an MR and fail - contributors ideally pay attention to the pipeline and fix this accordingly.


check-merge-request:
extends:
- .fdo.ci-fairy
script:
- ci-fairy check-merge-request --require-allow-collaboration
allow_failure: true
As a tip: run this job towards the end of the pipeline to give collaborators a chance to file an MR before this job fails.

Using ci-fairy locally

The two examples above are the most useful ones for CI pipelines, but ci-fairy also has some useful local commands. For that you'll have to install it, but that's as simple as


$ pip3 install git+http://gitlab.freedesktop.org/freedesktop/ci-templates
A big focus on ci-fairy for local commands is that it should, usually, be able to work without any specific configuration if you run it in the repository itself.

Linting

Just hacked on the CI config?


$ ci-fairy lint
and done, you get the same error back that the online linter for your project would return.

Pipeline checks

Just pushed to the repo?


$ ci-fairy wait-for-pipeline
Pipeline https://gitlab.freedesktop.org/username/project/-/pipelines/238586
status: success | 7/7 | created: 0 | pending: 0 | running: 0 | failed: 0 | success: 7 ....
The command is self-explanatory, I think.

Summary

There are a few other parts to ci-fairy including templating and even minio handling. I recommend looking at e.g. the libinput CI pipeline which uses much of ci-fairy's functionality. And the online documentation for ci-fairy, who knows, there may be something useful in there for you.

The useful contribution of ci-fairy is primarily that it tries to detect the settings for each project automatically, regardless of whether it's run inside a MR pipeline or just as part of a normal pipeline. So the same commands will work without custom configuration on a per-project basis. And for many things it works without API tokens, so the setup costs are just the pip install.

If you have recurring jobs, let us know, we're always looking to add more useful functionality to this little tool.

November 27, 2020

New Game+

Up until now, I’ve been relying solely on my Intel laptop’s onboard GPU for testing, and that’s been great; Intel’s drivers are robust as hell and have very few issues. On top of that, the rare occasions when I’ve found issues have led to a swift resolution.

Certainly I can’t complain at all about my experience with Intel’s hardware or software.

But now things are different and strange because I received in the mail a couple weeks ago a shiny AMD Radeon RX 5700XT.

Mostly in that it’s a new codebase with new debugging tools and such.

Unlike when I started my zink journey earlier this year, however, I’m much better equipped to dive in and Get Things Done.

Progress

This week’s been a bit hectic as I worked to build my new machines, do holiday stuff, and then also get back into the project here. Nevertheless, significant progress has been made in a couple areas:

  • I tested out and then dumped some very heavy-handed descriptor locks that had no impact on performance because they were never doing anything
  • I fixed a major issue with barrier usage

The latter of these is what I’m going to talk about today, and I’m going to zoom in on one very specific handler since it’s been a while since this blog has shown any actual code.

Barrier Performance

When using Vulkan barriers, it’s important to simultaneously:

  • use enough of them that the underlying driver has enough information to transition resources into the states desired
  • avoid using so many that your performance is garbage

Many months ago, I wrote a patch which aimed to address the second point while also not neglecting the first.

I succeeded in one of these goals.

The reason I didn’t notice that I also failed in one of these goals until now is that ANV actually has weak barrier support. By this I mean that while ANV’s barriers work fine and serve the expected purpose of changing resource layouts when necessary, it doesn’t actually do anything with the srcStageMask or dstStageMask parameters. Also, Intel drivers conveniently don’t change an image’s layout for different uses (i.e., GENERAL is the same as SHADER_READ), so screwing up these layouts doesn’t really matter.

Is this a problem?

No.

ANV is great at doing ANV stuff, and zink has thus far been great at punting GL down the pipe so that ANV can blast out some (correct) pixels.

Consider the following code:

bool
zink_resource_image_needs_barrier(struct zink_resource *res, VkImageLayout new_layout, VkAccessFlags flags, VkPipelineStageFlags pipeline)
{
   if (!pipeline)
       pipeline = pipeline_dst_stage(new_layout);
    if (!flags)
       flags = access_dst_flags(new_layout);
    return res->layout != new_layout || (res->access & flags) != flags || (res->access_stage & pipeline) != pipeline;
}

This is a function I wrote for the purpose of no-oping redundant barriers. The idea here is that a barrier is unnecessary if it’s applying the same layout with the same access (or a subset of access) for the same stages (or a subset of stages). The access and stage flags can optionally be omitted and filled in with default values for ease of use too.

Basic state tracking.

This worked great on ANV.

The problem, however, comes when trying to run this on a driver that really gets deep into putting those access and stage flags to work in order to optimize the resource’s access.

RADV is such a driver.

Consider the following barrier sequence:

* VK_IMAGE_LAYOUT_GENERAL, VK_ACCESS_READ | VK_ACCESS_WRITE, VK_PIPELINE_STAGE_TRANSFER_BIT
* VK_IMAGE_LAYOUT_GENERAL, VK_ACCESS_READ | VK_ACCESS_WRITE, VK_PIPELINE_STAGE_TRANSFER_BIT

Obviously it’s not desirable to use GENERAL as a layout, but that’s how zink works for a couple cases at the moment, so it’s a case that must be covered adequately. Going by the above filtering function, the second barrier has the same layout, the same access flags, and the same stage flags, so it gets ignored.

Conceptually, barriers in Vulkan are used for the purpose of informing the driver of dependencies between operations for both internal image layout (i.e., compressing/decompressing images for various usages) and synchronization. This means that if image A is written to in operation O1 and then read from in operation O2, the user can either stall after O1 or use one of the various synchronization methods provided by Vulkan to ensure the desired result. Given that this is GPU <-> GPU synchronization, that means either a semaphore or a pipeline barrier.

The above scenario seems at first glance to be a redundant barrier based on the state-tracking flags, but conceptually it isn’t since it expresses a dependency between two operations which happen to use matching access.

Refinement

After crashing my system a few times trying to do full piglit runs (seriously, don’t ever try this with zink+radv at present if you’re on similar hardware to mine), I came back to the barrier issue and started to rejigger this filter a bit.

The improved version is here:

bool
zink_resource_image_needs_barrier(struct zink_resource *res, VkImageLayout new_layout, VkAccessFlags flags, VkPipelineStageFlags pipeline)
{
   if (!pipeline)
      pipeline = pipeline_dst_stage(new_layout);
   if (!flags)
      flags = access_dst_flags(new_layout);
   return res->layout != new_layout || (res->access_stage & pipeline) != pipeline ||
          (res->access & flags) != flags ||
          (zink_resource_access_is_write(flags) && util_bitcount(flags) > 1);
}

This adds an extra check for a sequence of barriers where the new barrier has at least two access flags and one of them is write access. In this sense, the barrier dependency is ignored if the resource is doing READ -> READ, but READ|WRITE -> READ|WRITE will still be emitted like it should.

This change fixes a ton of unit tests, though I don’t actually know how many since there’s still some overall instability in various tests which cause my GPU to hang.

RADVCAL

Certainly worth mentioning is that I’ve been working closely with the RADV developer community over the past few days, and they’ve been extremely helpful in both getting me started with debugging the driver and assisting with resolving some issues. In particular, keep your eyes on these MRs which also fix zink issues.

Stay tuned as always for more updates on all things Mike and zink.

November 23, 2020

I’ve Been Here For…

I guess I never left, really, since I’ve been vicariously living the life of someone who still writes zink patches through reviewing and discussing some great community efforts that are ongoing.

But now I’m back living that life of someone who writes zink patches.

Valve has generously agreed to sponsor my work on graphics-related projects.

For the time being, that work happens to be zink.

Ambition

I don’t want to just make a big post about leaving and then come back after a couple weeks like nothing happened.

It’s 2020.

We need some sort of positive energy and excitement here.

As such, I’m hereby announcing Operation Oxidize, an ambitious endeavor between me and the formidably skillful Erik Faye-Lund of Collabora.

We’re going to land 99% of zink-wip into mainline Mesa by the end of the year, bringing the driver up to basic GL 4.6 and ES 3.2 support with vastly improved performance.

Or at least, that’s the goal.

Will we succeed?

Stay tuned to find out!

November 22, 2020

Another Brief Review

This was a (relatively) quiet week in zink-world. Here’s some updates, once more in no particular order:

  • Custom border color support landed
  • Erik wrote and I reviewed a patch that enabled some blitting optimizations but also regressed a number of test cases
    • Oops
  • I wrote and Erik reviewed a series which improved some of the query code but also regressed a number of test cases
    • Oops++
  • The flurry of activity around Quake3 not working on RADV died down as it’s now been suggested that this is not a RADV bug and is instead the result of no developers fully understanding the majesty of RADV’s pipeline barrier implementation
    • Developers around the world stunned by the possibility that they don’t know everything
  • Witold Baryluk has helpfully contributed a truckload of issue tickets for my zink-wip branch after extensive testing on AMD hardware
    • I’ll get to these at some point, I promise

Stay tuned for further updates.

November 21, 2020
Recently I acquired an Acer Aspire Switch 10 E SW3-016, this device was the main reason for writing my blog post about the shim boot loop. The EFI firmware of this is bad in a number of ways:

  1. It considers its eMMC unbootable unless its ESP contains an EFI/Microsoft/Boot/bootmgfw.efi file.

  2. But it will actually boot EFI/Boot/bootx64.efi ! (wait what? yes really)

  3. It will only boot from an USB disk connected to its micro-USB connector, not from the USB-A connector on the keyboard-dock.

  4. You must first set a BIOS admin password before you can disable secure-boot (which is necessary to boot home-build kernels without doing your own signing)

  5. Last but not least it has one more nasty "feature", it detect if the OS being booted is Windows, Android or unknown and it updates the ACPI DSDT based in this!

Some more details on the OS detection mis feature. The ACPI "Device (SDHB) node for the MMC controller connected to the SDIO wifi module contains:

        Name (WHID, "80860F14")
        Name (AHID, "INT33BB")


Depending on what OS the BIOS thinks it is booting it renames one of these 2 to _HID. This is weird given that it will only boot if EFI/Microsoft/Boot/bootmgfw.efi exists, but it still does this. Worse it looks at the actual contents of EFI/Boot/bootx64.efi for this. It seems that that file must be signed, otherwise it goes in OS unknown mode and keeps the 2 above DSDT bits as is, so there is no _HID defined for the wifi's mmc controller and thus no wifi. I hit this issue when I replaced EFI/Boot/bootx64.efi with grubx64.efi to break the bootloop. grubx64.efi is not signed so the DSDT as Linux saw it contained the above AML code. Using the proper workaround for the bootloop from my previous blog post this bit of the DSDT morphes into:

        Name (_HID, "80860F14")
        Name (AHID, "INT33BB")


And the wifi works.

The Acer Aspire Switch 10 E SW3-016's firmware also triggers an actual bug / issue in Linux' ACPI implementation, causing the bluetooth to not work. This is discussed in much detail here. I have a patch series fixing this here.

And the older Acer Aspire Switch 10 SW5-012's and S1002's firmware has some similar issues:

  1. It considers its eMMC unbootable unless its ESP contains an EFI/Microsoft/Boot/bootmgfw.efi file

  2. These models will actually always boot the EFI/Microsoft/Boot/bootmgfw.efi file, so that is somewhat more sensible.

  3. On the SW5-012 you must first set a BIOS admin password before you can disable secure-boot.

  4. The SW5-012 is missing an ACPI device node for the PWM controller used for controlling the backlight brightness. I guess that the Windows i915 gfx driver just directly pokes the registers (which are in a whole other IP block), rather then relying on a separate PWM driver as Linux does. Unfortunately there is no way to fix this, other then using a DSDT overlay. I have a DSDT overlay for the V1.20 BIOS and only for the v1.20 BIOS available for this here.

Because of 1. and 2. you need to take the following steps to get Linux to boot on the Acer Aspire Switch 10 SW5-012 or the S1002:

  1. Rename the original bootmgfw.efi (so that you can chainload it in the multi-boot case)

  2. Replace bootmgfw.efi with shimia32.efi

  3. Copy EFI/fedora/grubia32.efi to EFI/Microsoft/Boot

This assumes that you have the files from a 32 bit Windows install in your ESP already.
November 20, 2020

(This post was first published with Collabora on Nov 19, 2020.)

Wayland (the protocol and architecture) is still lacking proper consideration for color management. Wayland also lacks support for high dynamic range (HDR) imagery which has been around in movie and broadcasting industry for a while now (e.g. Netflix HDR UI).

While there are well established tools and workflows for how to do color management on X11, even X11 has not gained support for HDR. There were plans for it (Alex GoinsDeepColor Visuals), but as far as I know nothing really materialized from them.  Right now, the only way to watch HDR content on a HDR monitor in Linux is to use the DRM KMS API directly, in other words, not use any window system, which means not using any desktop environment. Kodi is one of the very few applications that can do this at all.

This is a story about starting the efforts to fix the situation on Wayland.

History and People

Color management for Wayland has been talked about on and off for many years by dozens of people. To me it was obvious from the start that color management architecture on Wayland must be fundamentally different from X11. I thought the display server must be part of the color management stack instead of an untrusted, unknown entity that must be bypassed and overridden by applications that fight each other for who gets to configure the display. This opinion was wildly controversial and it took a long time to get my point across, but over the years some color management experts started to open up to new ideas and other people joined in the opinion as well.  Whether these new ideas are actually better than the ways of old remains to be seen, though. I think the promise of getting everything and more to work better is far too great to not try it out.

The discussions started several times over the years, but they always dried out mostly without any tangible progress. Color management is a wide, deep and difficult topic, and the required skills, knowledge, interest, and available time did not come together until fairly recently. People did write draft protocol extensions, but I would claim that it was not really until Sebastian Wick started building on top of them that things started moving forward. But one person cannot push such a huge effort alone even for the simple reason that there must be at least one reviewer before anything can be merged upstream. I was very lucky that since summer 2020 I have been able to work on Wayland color management and HDR for improving ChromeOS, letting me support Sebastian's efforts on a daily basis. Vitaly Prosyak joined the effort this year as well, researching how to combine the two seemingly different worlds of ICC and HDR, and how tone-mapping could be implemented.

I must also note the past efforts of Harish Krupo, who submitted a major Weston merge request, but unfortunately at the time reviewers in Weston upstream were not much available. Even before that, there were experiments by Ville Syrjälä. All these are now mostly superseded by the on-going work.

Currently the active people around the topic are me (Collabora), Vitaly Prosyak (AMD), and Naveen Kumar (Intel). Sebastian Wick (unaffilated) is still around as well. None of us is a color management or HDR expert by trade, so we are all learning things as we go.

Design

The foundation for the color management protocol are ICC profile files for describing both output and content color spaces. The aim is for ICCv4, also allowing ICCv2, as these are known and supported well in general. Adding iccMAX support or anything else will be possible any time in the future.

As color management is all about color spaces and gamuts, and high dynamic range (HDR) is also very much about color spaces and gamuts plus extended luminance range, Sebastian and I decided that Wayland color management extension should cater for both from the beginning. Combining traditional color management and HDR is a fairly new thing as far as I know, and I'm not sure we have much prior art to base upon, so this is an interesting research journey as well. There is a lot of prior art on HDR and color management separately, but they tend to have fundamental differences that makes the combination not obvious.

To help us keep focused and explain to the community about what we actually intend with Wayland color management and HDR support, I wrote the section "Wayland Color Management and HDR Design Goals" in color.rst (draft). I very much recommend you to read it so that you get a picture what we (or I, at least) want to aim for.

Elle Stone explains in their article how color management should work on X11. As I wanted to avoid repeating the massive email threads that were had on the wayland-devel mailing list, I wrote the section "Color Pipeline Overview" in color.rst (draft) more or less as a response to her article, trying to explain in what ways Wayland will be different from X11. I think that understanding that section is paramount before anyone makes any comment on our efforts with the Wayland protocol extension.

HDR brings even more reasons to put color space conversions in the display server than just the idea that all applications should be color managed if not explicitly then implicitly.  Most of the desktop applications (well, literally all right now) are using Standard Dynamic Range (SDR).  SDR is a fuzzy concept referring to all traditional, non-HDR image content.  Therefore, your desktop is usually 100% SDR. You run your fancy new HDR monitor in SDR mode, which means it looks just like any old monitor with nothing fancy.  What if you want to watch a HDR video? The monitor won't display HDR in SDR mode.  If you simply switch the monitor to HDR mode, you will be blinded by all the over-bright SDR applications.  Switching monitor modes may also cause flicker and take a bit of time. That would be a pretty bad user experience, right?

A solution is to run your monitor in HDR mode all the time, and have the window system compositor convert all SDR application windows appropriately to the HDR luminance, so that they look normal in spite of the HDR mode. There will always be applications that will never support HDR at all, so the compositor doing the conversion is practically the only way.

For the protocol, we are currently exploring the use of relative luminance.  The reason is that people look at monitors in wildly varying viewing environments, under standard office lighting for example. The environment and personal preferences affect what monitor brightness you want. Also monitors themselves can be wildly different in their capabilities. Most prior art on HDR uses absolute luminance, but absolute luminance has the problem that it assumes a specific viewing environment, usually a dark room, similar to a movie theatre.  If a display server would show a movie with the absolute luminance it was mastered for, in most cases it would be far too dark to see. Whether using relative luminance at the protocol level turns out to be a good idea or not, we shall see.

Development

The Wayland color management and HDR protocol extension proposal is known as wayland/wayland-protocols!14 (MR14). Because it is a very long running merge request (the bar for landing a new protocol into wayland-protocols is high) and there are several people working on it, we started using sub-merge-requests to modify the proposal. You can find the sub-MRs in Sebastian's fork. If you have a change to propose, that is how to do it.

Obviously using sub-MRs also splits the review discussions into multiple places, but in this case I think it is a good thing, because the discussion threads in Gitlab are already massive.

There are several big and small open questions we haven't had the time to tackle yet even among the active group; questions that I feel we should have some tentative answers before asking for wider community comments. There is also no set schedule, so don't hold your breath. This work is likely to take months still before there is a complete tentative protocol, and probably years until these features are available in your favourite Wayland desktop environments.

If you are an expert on the topics of color management or HDR displays and content, you are warmly welcome to join the development.

If you are an interested developer or an end user looking to try out things, sorry, there is nothing really for you yet.

November 18, 2020
How to fix the Linux EFI secure-boot shim bootloop issue seen on some systems.

Quite a few Bay- and Cherry-Trail based systems have bad firmware which completely ignores any efibootmgr set boot options. They basically completely reset the boot order doing some sort of auto-detection at boot. Some of these even will given an error about their eMMC not being bootable unless the ESP has a EFI/Microsoft/Boot/bootmgfw.efi file!

Many of these end up booting EFI/Boot/bootx64.efi unconditionally every boot. This will cause a boot loop since when Linux is installed EFI/Boot/bootx64.efi is now shim. When shim is started with a path of EFI/Boot/bootx64.efi, shim will add a new efibootmgr entry pointing to EFI/fedora/shimx64.efi and then reset. The goal of this is so that the firmware's F12 bootmenu can be used to easily switch between Windows and Linux (without chainloading which breaks bitlocker). But since these bad EFI implementations ignore efibootmgr stuff, EFI/Boot/bootx64.efi shim will run again after the reset and we have a loop.

There are 2 ways to fix this loop:

1. The right way: Stop shim from trying to add a bootentry pointing to EFI/fedora/shimx64.efi:

rm EFI/Boot/fbx64.efi
cp EFI/fedora/grubx64.efi EFI/Boot


The first command will stop shim from trying to add a new efibootmgr entry (it calls fbx64.efi to do that for it) instead it will try to execute grubx64.efi from the from which it was executed, so we must put a grubx64.efi in the EFI/Boot dir, which the second command does. Do not use the livecd EFI/Boot/grubx64.efi file for this as I did at first, that searches for its config and env under EFI/Boot which is not what we want.

Note that upgrading shim will restore EFI/Boot/fbx64.efi. To avoid this you may want to backup EFI/Boot/bootx64.efi, then do "sudo rpm -e shim-x64" and then restore the backup.

2. The wrong way: Replace EFI/Boot/bootx64.efi with a copy of EFI/fedora/grubx64.efi

This is how I used to do this until hitting the scenario which caused me to write this blog post. There are 2 problems with this:

2a) This requires disabling secure-boot (which I could live with sofar)
2b) Some firmwares change how they behave, exporting a different DSDT to the OS dependending on if EFI/Boot/bootx64.efi is signed or not (even with secure boot disabled) and their behavior is totally broken when it is not signed. I will post another rant ^W blogpost about this soon. For now lets just say that you should use workaround 1. from above since it simply is a better workaround.

Note for better readability the above text uses bootx64, shimx64, fbx64 and grubx64 throughout. When using a 32 bit EFI (which is typical on Bay Trail systems) you should replace these with bootia32, shimia32, fbia32 and grubia32. Note 32 bit EFI Bay Trail systems should still use a 64 bit Linux distro, the firmware being 32 bit is a weird Windows related thing.

Also note that your system may use another key then F12 to show the firmware's bootmenu.
November 15, 2020

A Brief Review

As time/sanity permit, I’ll be trying to do roundup posts for zink happenings each week. Here’s a look back at things that happened, in no particular order:

November 13, 2020

(project was renamed from vallium to lavapipe)

I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.

tl;dr DO NOT USE LAVAPIPE OVER A GALLIUM HW DRIVER,

What is lavapipe?

The lavapipe layer is a gallium frontend. It takes the Vulkan API and roughly translates it into the gallium API.

How does it do that?

Vulkan is a lowlevel API, it allows the user to allocate memory, create resources, record command buffers amongst other things. When a hw vulkan driver is recording a command buffer, it is putting hw specific commands into it that will be run directly on the GPU. These command buffers are submitted to queues when the app wants to execute them.

Gallium is a context level API, i.e. like OpenGL/D3D10. The user has to create resources and contexts and the driver internally manages command buffers etc. The driver controls internal flushing and queuing of command buffers.
 
In order to bridge the gap, the lavapipe layer abstracts the gallium context into a separate thread of execution. When recording a vulkan command buffer it creates a CPU side command buffer containing an encoding of the Vulkan API. It passes that recorded CPU command buffer to the thread on queue submission. The thread then creates a gallium context, and replays the whole CPU recorded command buffer into the context, one command at a time.

That sounds horrible, isn't it slow?

Yes.

Why doesn't that matter for *software* drivers?

Software rasterizers are a very different proposition from an overhead point of view than real hardware. CPU rasterization is pretty heavy on the CPU load, so nearly always 90% of your CPU time will be in the rasterizer and fragment shader. Having some minor CPU overheads around command submission and queuing isn't going to matter in the overall profile of the user application. CPU rasterization is already slow, the Vulkan->gallium translation overhead isn't going to be the reason for making it much slower.
For real HW drivers which are meant to record their own command buffers in the GPU domain and submit them direct to the hw, adding in a CPU layer that just copies the command buffer data is a massive overhead and one that can't easily be removed from the lavapipe layer.

The lavapipe execution context is also pretty horrible, it has to connect all the state pieces like shaders etc to the gallium context, and disconnect them all at the end of each command buffer. There is only one command submission queue, one context to be used. A lot of hardware exposes more queues etc that this will never model.

I still don't want to write a vulkan driver, give me more reasons.

Pipeline barriers:

Pipeline barriers in Vulkan are essential to efficient driver hw usage. They are one of the most difficult to understand and hard to get right pieces of writing a vulkan driver. For a software rasterizer they are also mostly unneeded. When I get a barrier I just completely hardflush the gallium context because I know the sw driver behind it. For a real hardware driver this would be a horrible solution. You spend a lot of time trying to make anything optimal here.

Memory allocation:

Vulkan is built around the idea of separate memory allocation and objects binding to those allocations. Gallium is built around object allocation with the memory allocs happening implicitly. I've added some simple memory allocation objects to the gallium API for swrast. These APIs are in no way useful for hw drivers. There is no way to expose memory types or heaps from gallium usefully. The current memory allocation API works for software drivers because I know all they want is an aligned_malloc. There is no decent way to bridge this gap without writing a new gallium API that looks like Vulkan. (in which case just write a vulkan driver already).

Can this make my non-Vulkan capable hw run Vulkan?

No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.


There's been some recent discussion about whether the X server is abandonware. As the person arguably most responsible for its care and feeding over the last 15 years or so, I feel like I have something to say about that.

The thing about being the maintainer of a public-facing project for nearly the whole of your professional career is it's difficult to separate your own story from the project. So I'm not going to try to be dispassionate, here. I started working on X precisely because free software had given me options and capabilities that really matter, and I feel privileged to be able to give that back. I can't talk about that without caring about it.

So here's the thing: X works extremely well for what it is, but what it is is deeply flawed. There's no shame in that, it's 33 years old and still relevant, I wish more software worked so well on that kind of timeframe. But using it to drive your display hardware and multiplex your input devices is choosing to make your life worse.

It is, however, uniquely well suited to a very long life as an application compatibility layer. Though the code happens to implement an unfortunate specification, the code itself is quite well structured, easy to hack on, and not far off from being easily embeddable.

The issue, then, is how to get there. And I don't have any real desire to get there while still pretending that the xfree86 hardware-backed server code is a real thing. Sorry, I guess, but I've worked on xfree86-derived servers for very nearly as long as XFree86-the-project existed, and I am completely burnt out on that on its own merits, let alone doing that and also being release manager and reviewer of last resort. You can only apply so much thrust to the pig before you question why you're trying to make it fly at all.

So, is Xorg abandoned? To the extent that that means using it to actually control the display, and not just keep X apps running, I'd say yes. But xserver is more than xfree86. Xwayland, Xwin, Xephyr, Xvnc, Xvfb: these are projects with real value that we should not give up. A better way to say it is that we can finally abandon xfree86.

And if that sounds like a world you'd like to see, please, come talk to us, let's make it happen. I'd be absolutely thrilled to see someone take this on, and I'm happy to be your guide through the server internals.

November 12, 2020

A recent article on phoronix has some commentary about sharing code between Windows and Linux, and how this seems to be a metric that Intel likes.

I'd like to explore this idea a bit and explain why I believe it's bad for Linux based distros and our open source development models in the graphics area.

tl;dr there is a big difference between open source released and open source developed projects in terms of sustainability and community.

The Linux graphics stack from a distro vendor point of view is made up of two main projects, the Linux kernel and Mesa userspace. These two projects are developed in the open with completely open source vendor agnostic practices. There is no vendor controlling either project and both projects have a goal of try to maximise shared code and shared processes/coding standards across drivers from all vendors.

This cross-vendor synergy is very important to the functioning ecosystem that is the Linux graphics stack. The stack also relies in some places on the LLVM project, but again LLVM upstream is vendor agnostic and open source developed.

The value to distros is they have central places to pick up driver stacks with good release cycles and a minimal number of places they have to deal with to interact with those communities. Now usually hardware vendors don't see the value in the external communities as much as Linux distros do. From a hardware vendor internal point of view they see more benefit in creating a single stack shared between their Windows and Linux to maximise their return on investment, or make their orgchart prettier or produce less powerpoints about why their orgchart isn't optimal.

A shared Windows/Linux stack as such is a thing the vendors want more for their own reasons than for the benefit of the Linux community.

Why is it a bad idea?

I'll start by saying it's not always a bad idea. In theory it might be possible to produce such a stack with the benefits of open source development model, however most vendors seem to fail at this. They see open source as a release model, they develop internally and shovel the results over the fence into a github repo every X weeks after a bunch of cycles. They build products containing these open source pieces, but they never expend the time building projects or communities around them.

As an example take AMDVLK vs radv. I started radv because AMD had been promising the world an open source Vulkan driver for Linux that was shared with their Windows stack. Even when it was delivered it was open source released but internally developed. There was no avenue for community participation in the driver development. External contributors were never on the same footing as an AMD employee. Even AMD employees on different teams weren't on the same footing. Compare this to the radv project in Mesa where it allowed Valve to contribute the ACO backend compiler and provide better results than AMD vendor shared code could ever have done, with far less investement and manpower.

Intel have a non-mesa compiler called Intel Graphics Compiler mentioned in the article. This is fully developed by intel internally, there is little info on project direction or how to get involved or where the community is. There doesn't seem to be much public review, patches seem to get merged to the public repo by igcbot which may mean they are being mirrored from some internal repo. There are not using github merge requests etc. Compare this to development of a Mesa NIR backend where lots of changes are reviewed and maximal common code sharing is attempted so that all vendors benefit from the code.

One area where it has mostly sort of worked out what with the AMD display code in the kernel. I believe this code to be shared with their Windows driver (but I'm not 100% sure). They do try to engage with community changes to the code, but the code is still pretty horrible and not really optimal on Linux. Integrating it with atomic modesetting and refactoring was a pain. So even in the best case it's not an optimal outcome even for the vendor. They have to work hard to make the shared code be capable of supporting different OS interactions.

How would I do it?

If I had to share Windows/Linux driver stack I'd (biased opinion) start from the most open project and bring that into the closed projects. I definitely wouldn't start with a new internal project that tries to disrupt both. For example if I needed to create a Windows GL driver, I could:

a) write a complete GL implementation and throw it over the wall every few weeks. and make Windows/Linux use it, Linux users lose out on the shared stack, distros lose out on one dependency instead having to build a stack of multiple per vendor deps, Windows gains nothing really, but I'm so in control of my own destiny (communities don't matter).

b) use Mesa and upstream my driver to share with the Linux stack, add the Windows code to the Mesa stack. I get to share the benefits of external development by other vendors and Windows gains that benefit, and Linux retains the benefits to it's ecosystem.

A warning then to anyone wishing for more vendor code sharing between OSes it generally doesn't end with Linux being better off, it ends up with Linux being more fragmented, harder to support and in the long run unsustainable.


November 06, 2020

About a year ago ago, I got a new laptop: a late 2019 Razer Blade Stealth 13.  It sports an Intel i7-1065G7 with the best Intel's Ice Lake graphics along with an NVIDIA GeForce GTX 1650.  Apart from needing an ACPI lid quirk and the power management issues described here, it’s been a great laptop so far and the Linux experience has been very smooth.

Unfortunately, the out-of-the-box integrated graphics performance of my new laptop was less than stellar.  My first task with the new laptop was to debug a rendering issue in the Linux port of Shadow of the Tomb Raider which turned out to be a bug in the game.  In the process, I discovered that the performance of the game’s built-in benchmark was almost half of Windows.  We’ve had some performance issues with Mesa from time to time on some games but half seemed a bit extreme.  Looking at system-level performance data with gputop revealed that GPU clock rate was unable to get above about 60-70% of the maximum in spite of the GPU being busy the whole time.  Why?  The GPU wasn’t able to get enough power.  Once I sorted out my power management problems, the benchmark went from about 50-60% the speed of Windows to more like 104% the speed of windows (yes, that’s more than 100%).

This blog post is intended to serve as a bit of a guide to understanding memory throughput and power management issues and configuring your system properly to get the most out of your Intel integrated GPU.  Not everything in this post will affect all laptops so you may have to do some experimentation with your system to see what does and does not matter.  I also make no claim that this post is in any way complete; there are almost certainly other configuration issues of which I'm not aware or which I've forgotten.

Update your drivers

This should go without saying but if you want the best performance out of your hardware, running the latest drivers is always recommended.  This is especially true for hardware that has just been released.  Generally, for graphics, most of the big performance improvements are going to be in Mesa but your Linux kernel version can matter as well.  In the case of Intel Ice Lake processors, some of the power management features aren’t enabled until Linux 5.4.

I’m not going to give a complete guide to updating your drivers here.  If you’re running a distro like Arch, chances are that you’re already running something fairly close to the latest available.  If you’re on Ubuntu, the padoka PPA provides versions of the userspace components (Mesa, X11, etc.) that are usually no more than about a week out-of-date but upgrading your kernel is more complicated.  Other distros may have something similar but I’ll leave as an exercise to the reader.

This doesn’t mean that you need to be obsessive about updating kernels and drivers.  If you’re happy with the performance and stability of your system, go ahead and leave it alone.  However, if you have brand new hardware and want to make sure you have new enough drivers, it may be worth attempting an update.  Or, if you have the patience, you can just wait 6 months for the next distro release cycle and hope to pick up with a distro update.

Make sure you have dual-channel RAM

One of the big bottleneck points in 3D rendering applications is memory bandwidth.  Most standard monitors run at a resolution of 1920x1080 and a refresh rate of 60 Hz.  A 1920x1080 RGBA (32bpp) image is just shy of 8 MiB in size and, if the GPU is rendering at 60 FPS, that adds up to about 474 MiB/s of memory bandwidth to write out the image every frame.  If you're running a 4K monitor, multiply by 4 and you get about 1.8 GiB/s.  Those numbers are only for the final color image, assume we write every pixel of the image exactly once, and don't take into account any other memory access.  Even in a simple 3D scene, there are other images than just the color image being written such as depth buffers or auxiliary gbuffers, each pixel typically gets written more than once depending on app over-draw, and shading typically involves reading from uniform buffers and textures.  Modern 3D applications typically also have things such as depth pre-passes, lighting passes, and post-processing filters for depth-of-field and/or motion blur.  The result of this is that actual memory bandwidth for rendering a 3D scene can be 10-100x the bandwidth required to simply write the color image.

Because of the incredible amount of bandwidth required for 3D rendering, discrete GPUs use memories which are optimized for bandwidth above all else.  These go by different names such as GDDR6 or HBM2 (current as of the writing of this post) but they all use extremely wide buses and access many bits of memory in parallel to get the highest throughput they can.  CPU memory, on the other hand, is typically DDR4 (current as of the writing of this post) which runs on a narrower 64-bit bus and so the over-all maximum memory bandwidth is lower.  However, as with anything in engineering, there is a trade-off being made here.  While narrower buses have lower over-all throughput, they are much better at random access which is necessary for good CPU memory performance when crawling complex data structures and doing other normal CPU tasks.  When 3D rendering, on the other hand, the vast majority of your memory bandwidth is consumed in reading/writing large contiguous blocks of memory and so the trade-off falls in favor of wider buses.

With integrated graphics, the GPU uses the same DDR RAM as the CPU so it can't get as much raw memory throughput as a discrete GPU.  Some of the memory bottlenecks can be mitigated via large caches inside the GPU but caching can only do so much.  At the end of the day, if you're fetching 2 GiB of memory to draw a scene, you're going to blow out your caches and load most of that from main memory.

The good news is that most motherboards support a dual-channel ram configurations where, if your DDR units are installed in identical pairs, the memory controller will split memory access between the two DDR units in the pair.  This has similar benefits to running on a 128-bit bus but without some of the drawbacks.  The result is about a 2x improvement in over-all memory throughput.  While this may not affect your CPU performance significantly outside of some very special cases, it makes a huge difference to your integrated GPU which cares far more about total throughput than random access.  If you are unsure how your computer's RAM is configured, you can run “dmidecode -t memory” and see if you have two identical devices reported in different channels.

Power management 101

Before getting into the details of how to fix power management issues, I should explain a bit about how power management works and, more importantly, how it doesn’t.  If you don’t care to learn about power management and are just here for the system configuration tips, feel free to skip this section.

Why is power management important?  Because the clock rate (and therefore the speed) of your CPU or GPU is heavily dependent on how much power is available to the system.  If it’s unable to get enough power for some reason, it will run at a lower clock rate and you’ll see that as processes taking more time or lower frame rates in the case of graphics.  There are some things that you, as the user, cannot control such as the physical limitations of the chip or the way the OEM has configured things on your particular laptop.  However, there are some things which you can do from a system configuration perspective which can greatly affect power management and your performance.

First, we need to talk about thermal design power or TDP.  There is a lot of misunderstanding on the internet about TDP and we need to clear some of them up.  Wikipedia defines TDP as “the maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate under any workload.”  The Intel Product Specifications site defines TDP as follows:

Thermal Design Power (TDP) represents the average power, in watts, the processor dissipates when operating at Base Frequency with all cores active under an Intel-defined, high-complexity workload. Refer to Datasheet for thermal solution requirements.

In other words, the TDP value provided on the Intel spec sheet is a pretty good design target for OEMs but doesn’t provide nearly as many guarantees as one might hope.  In particular, there are several things that the TDP value on the spec sheet is not:
  • It’s not the exact maximum power.  It’s a “average power”.
  • It may not match any particular workload.  It’s based on “an Intel-defined, high-complexity workload”.  Power consumption on any other workload is likely to be slightly different.
  • It’s not the actual maximum.  It’s based on when the processor is “operating at Base Frequency with all cores active.” Technologies such as Turbo Boost can cause the CPU to operate at a higher power for short periods of time.
If you look at the  Intel Product Specifications page for the i7-1065G7, you’ll see three TDP values: the nominal TDP of 15W, a configurable TDP-up value of 25W and a configurable TDP-down value of 12W.  The nominal TDP (simply called “TDP”) is the base TDP which is enough for the CPU to run all of its cores at the base frequency which, given sufficient cooling, it can do in the steady state.  The TDP-up and TDP-down values provide configurability that gives the OEM options when they go to make a laptop based on the i7-1065G7.  If they’re making a performance laptop like Razer and are willing to put in enough cooling, they can configure it to 25W and get more performance.  On the other hand, if they’re going for battery life, they can put the exact same chip in the laptop but configure it to run as low as 12W.  They can also configure the chip to run at 12W or 15W and then ship software with the computer which will bump it to 25W once Windows boots up.  We’ll talk more about this reconfiguration later on.

Beyond just the numbers on the spec sheet, there are other things which may affect how much power the chip can get.  One of the big ones is cooling.  The law of conservation of energy dictates that energy is never created or destroyed.  In particular, your CPU doesn’t really consume energy; it turns that electrical energy into heat.  For every Watt of electrical power that goes into the CPU, a Watt of heat has to be pumped out by the cooling system.  (Yes, a Watt is also a measure of heat flow.)  If the CPU is using more electrical energy than the cooling system can pump back out, energy gets temporarily stored in the CPU as heat and you see this as the CPU temperature rising.  Eventually, however, the CPU has to back off and let the cooling system catch up or else that built up heat may cause permanent damage to the chip.

Another thing which can affect CPU power is the actual power delivery capabilities of the motherboard itself.  In a desktop, the discrete GPU is typically powered directly by the power supply and it can draw 300W or more without affecting the amount of power available to the CPU.  In a laptop, however, you may have more power limitations.  If you have multiple components requiring significant amounts of power such as a CPU and a discrete GPU, the motherboard may not be able to provide enough power for both of them to run flat-out so it may have to limit CPU power while the discrete GPU is running.  These types of power balancing decisions can happen at a very deep firmware level and may not be visible to software.

The moral of this story is that the TDP listed on the spec sheet for the chip isn’t what matters; what matters is how the chip is configured by the OEM, how much power the motherboard is able to deliver, and how much power the cooling system is able to remove.  Just because two laptops have the same processor with the same part number doesn’t mean you should expect them to get the same performance.  This is unfortunate for laptop buyers but it’s the reality of the world we live in.  There are some things that you, as the user, cannot control such as the physical limitations of the chip or the way the OEM has configured things on your particular laptop.  However, there are some things which you can do from a system configuration perspective and that’s what we’ll talk about next.

If you want to experiment with your system and understand what’s going on with power, there are two tools which are very useful for this: powertop and turbostat.  Both are open-source and should be available through your distro package manager.  I personally prefer the turbostat interface for CPU power investigations but powertop is able to split your power usage up per-process which can be really useful as well.

Update GameMode to at least version 1.5

About a two and a half years ago (1.0 was released in may of 2018), Feral Interactive released their GameMode daemon which is able to tweak some of your system settings when a game starts up to get maximal performance.  One of the settings that GameMode tweaks is your CPU performance governor.  By default, GameMode will set it to “performance” when a game is running.  While this seems like a good idea (“performance” is better, right?), it can actually be counterproductive on integrated GPUs and cause you to get worse over-all performance.

Why would the “performance” governor cause worse performance?  First, understand that the names “performance” and “powersave” for CPU governors are a bit misleading.  The powersave governor isn’t just for when you’re running on battery and want to use as little power as possible.  When on the powersave governor, your system will clock all the way up if it needs to and can even turbo if you have a heavy workload.  The difference between the two governors is that the powersave governor tries to give you as much performance as possible while also caring about power; it’s quite well balanced.  Intel typically recommends the powersave governor even in data centers because, even though they have piles of power and cooling available, data centers typically care about their power bill.  The performance governor, on the other hand, doesn’t care about power consumption and only cares about getting the maximum possible performance out of the CPU so it will typically burn significantly more power than needed.

So what does this have to do with GPU performance?  On an integrated GPU, the GPU and CPU typically share a power budget and every Watt of power the CPU is using is a Watt that’s unavailable to the GPU.  In some configurations, the TDP is enough to run both the GPU and CPU flat-out but that’s uncommon.  Most of the time, however, the CPU is capable of using the entire TDP if you clock it high enough.  When running with the performance governor, that extra unnecessary CPU power consumption can eat into the power available to the GPU and cause it to clock down.

This problem should be mostly fixed as of GameMode version 1.5 which adds an integrated GPU heuristic.  The heuristic detects when the integrated GPU is using significant power and puts the CPU back to using the powersave governor.  In the testing I’ve done, this pretty reliably chooses the powersave governor in the cases where the GPU is likely to be TDP limited.  The heuristic is dynamic so it will still use the performance governor if the CPU power usage way overpowers the GPU power usage such as when compiling shaders at a loading screen.

What do you need to do on your system?  First, check what version of GameMode you have installed on your system (if any).  If it’s version 1.4 or earlier)and you intend to play games on an integrated GPU, I recommend either upgrading GameMode or disabling or uninstalling the GameMode daemon.

Use thermald

In “power management 101” I talked about how sometimes OEMs will configure a laptop to 12W or 15W in BIOS and then re-configure it to 25W in software.  This is done via the “Intel Dynamic Platform and Thermal Framework” driver on Windows.  The DPTF driver manages your over-all system thermals and keep the system within its thermal budget.  This is especially important for fanless or ultra-thin laptops where the cooling may not be sufficient for the system to run flat-out for long periods.  One thing the DPTF driver does is dynamically adjust the TDP of your CPU.  It can adjust it both up if the laptop is running cool and you need the power or down if the laptop is running hot and needs to cool down.  Some OEMs choose to be very conservative with their TDP defaults in BIOS to prevent the laptop from overheating or constantly running hot if the Windows DPTF driver is not available.

On Linux, the equivalent to this is thermald.  When installed and enabled on your system, it reads the same OEM configuration data from ACPI as the windows DPTF driver and is also able to scale up your package TDP threshold past the BIOS default as per the OEM configuration.  You can also write your own configuration files if you really wish but you do so at your own risk.

Most distros package thermald but it may not be enabled nor work quite properly out-of-the-box.  This is because, historically, it has relied on the closed-source dptfxtract utility that's provided by Intel as a binary.  It requires dptfxtract to fetch the OEM provided configuration data from the ACPI tables. Since most distros don't usually ship closed-source software in their main repositories and since thermald doesn't do much without that data, a lot of distros don't bother to ship or enable it by default.  You'll have to turn it on manually.

To fix this, install both thermald and dptfxtract and ensure that thermald is enabled.  On most distros, thermald is packaged normally even if it isn’t enabled by default because it is open-source.  The dptfxtract utility is usually available in your distro’s non-free repositories.  On Ubuntu, dptfxtract is available as a package in multiverse.  For Fedora, dptfxtract is available via RPM Fusion’s non-free repo.  There are also packages for Arch and likely others as well.  If no one packages it for your distro, it’s just one binary so it’s pretty easy to install manually.

Some of this may change going forward, however.  Recently, however, Matthew Garrett did some work to reverse-engineer the DPTF framework and provide support for fetching the DPTF data from ACPI without the need for the binary blob.  When running with a recent kernel and Matthew's fork of thermald, you should be able to get OEM-configured thermals without the need for the dptfxtract blob at least on some hardware.  Whether or not you get the right configuration will depend on your hardware, your kernel version, your distro, and whether they ship the Intel version of thermald or Matthew's fork.  Even there, your distro may leave it uninstalled or disabled by default.  It's still disabled by default in Fedora 33, for instance.

It should be noted at this point that, if thermald and dptfxtract are doing their job, your laptop is likely to start running much hotter when under heavy load than it did before.  This is because thermald is re-configuring your processor with a higher thermal budget which means it can now run faster but it will also generate more heat and may drain your battery faster.  In theory, thermald should keep your laptop’s thermals within safe limits; just not within the more conservative limits the OEM programmed into BIOS.  If all the additional heat makes you uncomfortable, you can just disable thermald and it should go back to the BIOS defaults.

Enable NVIDIA’s dynamic power-management

On my laptop (the late 2019 Razer Blade Stealth 13), the BIOS has the CPU configured to 35W out-of-the-box.  (Yes, 35W is higher than TDP-up and I’ve never seen it burn anything close to that much power; I have no idea why it’s configured that way.)  This means that we have no need for DPTF and the cooling is good enough that I don’t really need thermald on it either.  Instead, its power management problems come from the power balancing that the motherboard does between the CPU and the discrete NVIDIA GPU.

If the NVIDIA GPU is powered on at all, the motherboard configures the CPU to the TDP-down value of 12W.  I don’t know exactly how it’s doing this but it’s at a very deep firmware level that seems completely opaque to software.  To make matters worse, it doesn’t just restrict CPU power when the discrete GPU is doing real rendering; it restricts CPU power whenever the GPU is powered on at all.  In the default configuration with the NVIDIA proprietary drivers, that’s all the time.

Fortunately, if you know where to find it, there is a configuration option available in recent drivers for Turing and later GPUs which lets the NVIDIA driver completely power down the discrete GPU when it isn’t in use.  You can find this documented in Chapter 22 of the NVIDIA driver README.  The runtime power management feature is still beta as of the writing of this post and does come with some caveats such as that it doesn’t work if you have audio or USB controllers (for USB-C video) on your GPU.  Fortunately, with many laptops with a hybrid Intel+NVIDIA graphics solution, the discrete GPU exists only for render off-loading and doesn’t have any displays connected to it.  In that case, the audio and USB-C can be disabled and don’t cause any problems.  On my laptop, as soon as I properly enabled runtime power management in the NVIDIA driver, the motherboard stopped throttling my CPU and it started running at the full TDP-up of 25W.

I believe that nouveau has some capabilities for runtime power management.  However, I don’t know for sure how good they are and whether or not they’re able to completely power down the GPU.

Look for other things which might be limiting power

In this blog post, I've covered some of the things which I've personally seen limit GPU power when playing games and running benchmarks.  However, it is by no means an exhaustive list.  If there's one thing that's true about power management, it's that every machine is a bit different.  The biggest challenge with my laptop was the NVIDIA discrete GPU draining power.  On some other laptop, it may be something else.

You can also look for background processes which may be using significant CPU cycles.  With a discrete GPU, a modest amount of background CPU work will often not hurt you unless the game is particularly CPU-hungry.  With an integrated GPU, however, it's far more likely that a background task such as a backup or software update will eat into the GPU's power budget.  Just this last week, a friend of mine was playing a game on Proton and discovered that the game launcher itself was burning enough power with the CPU to prevent the GPU from running at full power.  Once he suspended the game launcher, his GPU was able to run at full power.

Especially with laptops, you're also likely to be affected by the computer's cooling system as was mentioned earlier.  Some laptops such as my Razer are designed with high-end cooling systems that let the laptop run at full power.  Others, particularly the ultra-thin laptops, are far more thermally limited and may never be able to hit the advertised TDP for extended periods of time.

Conclusion

When trying to get the most performance possible out of a laptop, RAM configuration and power management are key.  Unfortunately, due to the issues documented above (and possibly others), the out-of-the-box experience on Linux is not what it should be.  Hopefully, we’ll see this situation improve in the coming years but for now this post will hopefully give people the tools they need to configure their machines properly and get the full performance out of their hardware.

This Is The End

…of my full-time hobby work on zink.

At least for a while.

More on that at the end of the post.

Before I get to that, let’s start with yesterday’s riddle. Anyone who chose this pic

1.png

with 51 fps as being zink, you were correct.

That’s right, zink is now at around 95% of native GL performance for this benchmark, at least on my system.

I know there’s been a lot of speculation about the capability of the driver to reach native or even remotely-close-to-native speeds, and I’m going to say definitively that it’s possible, and performance is only going to increase further from here.

A bit of a different look on things can also be found on my Fall roundup post here.

A Big Boost From Threads

I’ve long been working on zink using a single-thread architecture, and my goal has been to make it as fast as possible within that constraint. Part of my reasoning is that it’s been easier to work within the existing zink architecture than to rewrite it, but the main issue is just that threads are hard, and if you don’t have a very stable foundation to build off of when adding threading to something, it’s going to get exponentially more difficult to gain that stability afterwards.

Reaching a 97% pass rate on my piglit tests at GL 4.6 and ES 3.2 gave me a strong indicator that the driver was in good enough shape to start looking at threads more seriously. Sure, piglit tests aren’t CTS; they fail to cover a lot of areas, and they’re certainly less exhaustive about the areas that they do cover. With that said, CTS isn’t a great tool for zink at the moment due to the lack of provoking vertex compatibility support in the driver (I’m still waiting on a Vulkan extension for this, though it’s looking likely that Erik will be providing a fallback codepath for this using a geometry shader in the somewhat near future) which will fail lots of tests. Given the sheer number of CTS tests, going through the failures and determining which ones are failing due to provoking vertex issues and which are failing due to other issues isn’t a great use of my time, so I’m continuing to wait on that. The remaining piglit test failures are mostly due either to provoking vertex issues or some corner case missing features such as multisampled ZS readback which are being worked on by other people.

With all that rambling out of the way, let’s talk about threads and how I’m now using them in zink-wip.

At present, I’m using u_threaded_context, aka glthread, making zink the only non-radeon driver to implement it. The way this works is by using Gallium to write the command stream to a buffer that is then processed asynchronously, freeing up the main thread for application use and avoiding any sort of blocking from driver overhead. For systems where zink is CPU-bound in the driver thread, this massively increases performance, as seen from the ~40% fps improvement that I gained after the implementation.

This transition presented a number of issues, the first of which was that u_threaded_context required buffer invalidation and rebinding. I’d had this on my list of targets for a while, so it was a good opportunity to finally hook it up.

Next up, u_threaded_context was very obviously written to work for the existing radeon driver architecture, and this was entirely incompatible with zink, specifically in how the batch/command buffer implementation is hardcoded like I talked about yesterday. Switching to monotonic, dynamically scaling command buffer usage resolved that and brought with it some other benefits.

The other big issue was, as I’m sure everyone expected, documentation.

I certainly can’t deny that there’s lots of documentation for u_threaded_context. It exists, it’s plentiful, and it’s quite detailed in some cases.

It’s also written by people who know exactly how it works with the expectation that it’s being read by other people who know exactly how it works. I had no idea going into the implementation how any of it worked other than a general knowledge of the asynchronous command stream parts that are common to all thread queue implementations, so this was a pretty huge stumbling block.

Nevertheless, I persevered, and with the help of a lot of RTFC, I managed to get it up and running. This is a more general overview post rather than a more in-depth, technical one, so I’m not going to go into any deep analysis of the (huge amounts of) code required to make it work, but here’s some key points from the process in case anyone reading this hits some of the same issues/annoyances that I did:

  • use consistent naming for all your struct subclassing, because a huge amount of the code churn is just going to be replacing driver class -> gallium class references to driver class -> u_threaded_context class -> gallium class ones; if you can sed these all at once, it simplifies the work tremendously
  • u_threaded_context works off the radeon queue/fence architecture, which allows (in some cases) multiple fences for any given queue submission, so ensure that your fences work the same way or (as I did) can effectively have sub-fences
  • obviously don’t forget your locking, but also don’t over-lock; I’m still doing some analysis to check how much locking I need for the context-based caches, and it may even be the case that I’m under-locked at the moment, but it’s important to keep in mind that your pipe_context can be in many different threads at a given time, and so, as the u_threaded_context docs repeatedly say without further explanation, don’t use it “in an unsafe way”
  • the buffer mapping rules/docs are complex, but basically it boils down to checking the TC_TRANSFER_MAP_* flags before doing the things that those flags prohibit
    • ignore threaded_resource::max_forced_staging_uploads to start with since it adds complexity
    • if you get TC_TRANSFER_MAP_THREADED_UNSYNC, you have to use threaded_context::base.stream_uploader for staging buffers, though this isn’t (currently) documented anywhere
    • watch your buffer alignments; I already fixed an issue with this, but u_threaded_context was written for radeon drivers, so there may be other cases where hardcoded values for those drivers exist
    • probably just read the radeonsi code before even attempting this anyway

All told, fixing all the regressions took much longer than the actual implementation, but that’s just par for the course with driver work.

Anyone interested in testing should take note that, as always, this has only been used on Intel hardware (and if you’re on Intel, this post is definitely worth reading), and so on systems which were not CPU-bound previously or haven’t been worked on by me, you may not yet see these kinds of gains.

But you will eventually.

And That’s It

This is a sort of bittersweet post as it marks the end of my full-time hobby work with zink. I’ve had a blast over the past ~6 months, but all things change eventually, and such is the case with this situation.

Those of you who have been following me for a long time will recall that I started hacking on zink while I was between jobs in order to improve my skills and knowledge while doing something productive along the way. I succeeded in all regards, at least by my own standards, and I got to work with some brilliant people at the same time.

But now, at last, I will once again become employed, and the course of that employment will take me far away from this project. I don’t expect that I’ll have a considerable amount of mental energy to dedicate to hobbyist Open Source projects, at least for the near term, so this is a farewell of sorts in that sense. This means (again, for at least the near term):

  • I’ll likely be blogging far less frequently
  • I don’t expect to be writing any new patches for zink/gallium/mesa

This does not mean that zink is dead, or the project is stalling development, or anything like that, so don’t start overreaching on the meaning of this post.

I still have 450+ patches left to be merged into mainline Mesa, and I do plan to continue driving things towards that end, though I expect it’ll take a good while. I’ll also be around to do patch reviews for the driver and continue to be involved in the community.

I look forward to a time when I’ll get to write more posts here and move the zink user experience closer to where I think it can be.

This is Mike, signing off for now.

Happy rendering.

November 05, 2020

During my presentation at the X Developers Conference I stated that we had been mostly using the Khronos Vulkan Conformance Test suite (aka Vulkan CTS) to validate our Vulkan driver for Raspberry Pi 4 (aka V3DV). While the CTS is an invaluable resource for driver testing and validation, it doesn’t exactly compare to actual real world applications, and so, I made the point that we should try to do more real world testing for the driver after completing initial Vulkan 1.0 support.

To be fair, we had been doing a little bit of this already when I worked on getting the Vulkan ports of all 3 Quake game classics to work with V3DV, which allowed us to identify and fix a few driver bugs during development. The good thing about these games is that we could get the source code and compile them natively for ARM platforms, so testing and debugging was very convenient.

Unfortunately, there are not a plethora of Vulkan applications and games like these that we can easily test and debug on a Raspberry Pi as of today, which posed a problem. One way to work around this limitation that was suggested after my presentation at XDC was to use Zink, the OpenGL to Vulkan layer in Mesa. Using Zink, we can take existing OpenGL applications that are currently available for Raspberry Pi and use them to test our Vulkan implementation a bit more thoroughly, expanding our options for testing while we wait for the Vulkan ecosystem on Raspberry Pi 4 to grow.

So last week I decided to get hands on with that. Zink requires a few things from the underlying Vulkan implementation depending on the OpenGL version targeted. Currently, Zink only targets desktop OpenGL versions, so that limits us to OpenGL 2.1, which is the maximum version of desktop OpenGL that Raspbery Pi 4 can support (we support up to OpenGL ES 3.1 though). For that desktop OpenGL version, Zink required a few optional Vulkan 1.0 features that we were missing in V3DV, namely:

  • Logic operations.
  • Alpha to one.
  • VK_KHR_maintenance1.

The first two were trivial: they were already implemented and we only had to expose them in the driver. Notably, when I was testing these features with the relevant CTS tests I found a bug in the alpha to one tests, so I proposed a fix to Khronos which is currently in review.

I also noticed that Zink was also implicitly requiring support for timestamp queries, so I also implemented that in V3DV and then also wrote a patch for Zink to handle this requirement better.

Finally, Zink doesn’t use Vulkan swapchains, instead it creates presentable images directly, which was problematic for us because our platform needs to handle allocations for presentable images specially, so a patch for Zink was also required to address this.

As of the writing of this post, all this work has been merged in Mesa and it enables Zink to run OpenGL 2.1 applications over V3DV on Raspberry Pi 4. Here are a few screenshots of Quake3 taken with the native OpenGL driver (V3D), with the native Vulkan driver (V3DV) and with Zink (over V3DV). There is a significant performance hit with Zink at present, although that is probably not too unexpected at this stage, but otherwise it seems to be rendering correctly, which is what we were really interested to see:


Quake3 Vulkan renderer (V3DV)

Quake3 OpenGL renderer (V3D)

Quake3 OpenGL renderer (Zink + V3DV)

Note: you’ll notice that the Vulkan screenshot is darker than the OpenGL versions. As I reported in another post, that is a feature of the Vulkan port of Quake3 and is unrelated to the driver.

Going forward, we expect to use Zink to test more applications and hopefully identify driver bugs that help us make V3DV better.

It’s Time.

I’ve been busy cramming more code than ever into the repo this week in order to finish up my final project for a while by Friday. I’ll talk more about that tomorrow though. Today I’ve got two things for all of you.

First, A Riddle

Of these two screenshots, one is zink+ANV and one is IRIS. Which is which?

2.png

1.png

Second, Queue Architecture

Let’s talk a bit at a high level about how zink uses (non-compute) command buffers.

Currently in the repo zink works like this:

  • there is 1 queue
  • there are 4 command buffers used in a ring
  • after every flush (e.g., glFlush), the command buffers cycle
  • the driver flushes itself internally on pretty much every function call
  • any time an in-use command buffer is iterated to, the driver stalls until the command buffer has completed

In short, there’s a huge bottleneck around the flushing mechanism, and then there’s a lesser-reached bottleneck for cases where an application flushes repeatedly before a command buffer’s ops are completed.

Some time ago I talked about some modifications I’d done to the above architecture, and then things looked more like this:

  • there is 1 queue
  • there are 4 command buffers used in a ring
  • after every flush (e.g., glFlush), the command buffers cycle* after
  • the driver defers all possible flushes to try and match 1 flush to 1 frame
  • any time an in-use command buffer is iterated to, the driver stalls until the command buffer has completed

The major difference after this work was that the flushing was reduced, which then greatly reduced the impact of that bottleneck that exists when all the command buffers are submitted and the driver wants to continue recording commands.

A lot of speculation has occurred among the developers over “how many” command buffers should be used, and there’s been some talk of profiling this, but for various reasons I’ll get into tomorrow, I opted to sidestep the question entirely in favor of a more dynamic solution: monotonically-identified command buffers.

Monotony

The basic idea behind this strategy, which is used by a number of other drivers in the tree, is that there’s no need to keep a “ring” of command buffers to cycle through, as the driver can just continually allocate new command buffers on-the-fly and submit them as needed, reusing them once they’ve naturally completed instead of forcibly stalling on them. Here’s a visual comparison:

The current design:

Here’s the new version:

This way, there’s no possibility of stalling based on application flushes (or the rare driver-internal flush which does still exist in a couple places).

The architectural change here had two great benefits:

  • for systems that aren’t CPU bound, more command buffers will automatically be created and used, yielding immediate performance gains (~5% on Dave Airlie’s AMD setup)
  • the driver internals get massively simplified

The latter of these is due to the way that the queue in zink is split between gfx and compute command buffers; with the hardcoded batch system, the compute queue had its own command buffer while the gfx queue had four, but they all had unique IDs which were tracked using bitfields all over the place, not to mention it was frustrating never being able to just “know” which command buffer was currently being recorded to for a given command without indexing the array.

Now it’s easy to know which command buffer is currently being recorded to, as it’ll always be the one associated with the queue (gfx or compute) for the given operation.

This had further implications, however, and I’d done this to pave the way for a bigger project, one that I’ve spent the past few days on. Check back tomorrow for that and more.

November 02, 2020

New Hotness

Quick update today, but I’ve got some very exciting news coming soon.

The biggest news of the day is that work is underway to merge some patches from Duncan Hopkins which enable zink to run on Mac OS using MoltenVK. This has significant potential to improve OpenGL support on that platform, so it’s awesome that work has been done to get the ball rolling there.

In only slightly less monumental news though, Adam Jackson is already underway with Vulkan WSI work for zink, which is going to be huge for performance.

October 30, 2020

(I just sent the below email to mesa3d developer list).

Just to let everyone know, a month ago I submitted the 20.2 llvmpipe
driver for OpenGL 4.5 conformance under the SPI/X.org umbrella, and it
is now official[1].

Thanks to everyone who helped me drive this forward, and to all the
contributors both to llvmpipe and the general Mesa stack that enabled
this.

Big shout out to Roland Scheidegger for helping review the mountain of
patches I produced in this effort.

My next plans involved submitting lavapipe for Vulkan 1.0, it's at 99%
or so CTS, but there are line drawing, sampler accuracy and some snorm
blending failure I have to work out.
I also ran the OpenCL 3.0 conformance suite against clover/llvmpipe
yesterday and have some vague hopes of driving that to some sort of
completion.

(for GL 4.6 only texture anisotropy is really missing, I've got
patches for SPIR-V support, in case someone was feeling adventurous).

Dave.

[1] https://www.khronos.org/conformance/adopters/conformant-products/opengl#submission_272

October 29, 2020

Buffering

I’ve got a lot of exciting stuff in the pipe now, but for today I’m just going to talk a bit about resource invalidation: what it is, when it happens, and why it’s important.

Let’s get started.

What is invalidation?

Resource invalidation occurs when the backing buffer of a resource is wholly replaced. Consider the following scenario under zink:

  • Have struct A { VkBuffer buffer; };
  • User calls glBufferData(target, size, data, usage), which stores data to A.buffer
  • User calls glBufferData(target, size, NULL, usage), which unsets the data from A.buffer

On a sane/competent driver, the second glBufferData call will trigger invalidation, which means that A.buffer will be replaced entirely, while A is still the driver resource used by Gallium to represent target.

When does invalidation occur?

Resource invalidation can occur in a number of scenarios, but the most common is when unsetting a buffer’s data, as in the above example. The other main case for it is replacing the data of a buffer that’s in use for another operation. In such a case, the backing buffer can be replaced to avoid forcing a sync in the command stream which will stall the application’s processing. There’s some other cases for this as well, like glInvalidateFramebuffer and glDiscardFramebufferEXT, but the primary usage that I’m interested in is buffers.

Why is invalidation important?

The main reason is performance. In the above scenario without invalidation, the second glBufferData call will write null to the whole buffer, which is going to be much more costly than just creating a new buffer.

That’s it

Now comes the slightly more interesting part: how does invalidation work in zink?

Currently, as of today’s mainline zink codebase, we have struct zink_resource to represent a resource for either a buffer or an image. One struct zink_resource represents exactly one VkBuffer or VkImage, and there’s some passable lifetime tracking that I’ve written to guarantee that these Vulkan objects persist through the various command buffers that they’re associated with.

Each struct zink_resource is, as is the way of Gallium drivers, also a struct pipe_resource, which is tracked by Gallium. Because of this, struct zink_resource objects themselves cannot be invalidated in order to avoid breaking Gallium, and instead only the inner Vulkan objects themselves can be replaced.

For this, I created struct zink_resource_object, which is an object that stores only the data that directly relates to the Vulkan objects, leaving struct zink_resource to track the states of these objects. Their lifetimes are separate, with struct zink_resource being bound to the Gallium tracker and struct zink_resource_object persisting for either the lifetime of struct zink_resource or its command buffer usage—whichever is longer.

Code

The code for this mechanism isn’t super interesting since it’s basically just moving some parts around. Where it gets interesting is the exact mechanics of invalidation and how struct zink_resource_object can be injected into an in-use resource, so let’s dig into that a bit.

Here’s what the pipe_context::invalidate_resource hook looks like:

static void
zink_invalidate_resource(struct pipe_context *pctx, struct pipe_resource *pres)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_resource *res = zink_resource(pres);
   struct zink_screen *screen = zink_screen(pctx->screen);

   if (pres->target != PIPE_BUFFER)
      return;

This only handles buffer resources, but extending it for images would likely be little to no extra work.

   if (res->valid_buffer_range.start > res->valid_buffer_range.end)
      return;

Zink tracks the valid data segments of its buffers. This conditional is used to check for an uninitialized buffer, i.e., one which contains no valid data. If a buffer has no data, it’s already invalidated, so there’s nothing to be done here.

   util_range_set_empty(&res->valid_buffer_range);

Invalidating means the buffer will no longer have any valid data, so the range tracking can be reset here.

   if (!get_all_resource_usage(res))
      return;

If this resource isn’t currently in use, unsetting the valid range is enough to invalidate it, so it can just be returned right away with no extra work.

   struct zink_resource_object *old_obj = res->obj;
   struct zink_resource_object *new_obj = resource_object_create(screen, pres, NULL, NULL);
   if (!new_obj) {
      debug_printf("new backing resource alloc failed!");
      return;
   }

Here’s the old internal buffer object as well as a new one, created using the existing buffer as a template so that it’ll match.

   res->obj = new_obj;
   res->access_stage = 0;
   res->access = 0;

struct zink_resource is just a state tracker for the struct zink_resource_object object, so upon invalidate, the states are unset since this is effectively a brand new buffer.

   zink_resource_rebind(ctx, res);

This is the tricky part, and I’ll go into more detail about it below.

   zink_descriptor_set_refs_clear(&old_obj->desc_set_refs, old_obj);

If this resource was used in any cached descriptor sets, the references to those sets need to be invalidated so that the sets won’t be reused.

   zink_resource_object_reference(screen, &old_obj, NULL);
}

Finally, the old struct zink_resource_object is unrefed, which will ensure that it gets destroyed once its current command buffer has finished executing.

Simple enough, but what about that zink_resource_rebind() call? Like I said, that’s where things get a little tricky, but because of how much time I spent on descriptor management, it ends up not being too bad.

This is what it looks like:

void
zink_resource_rebind(struct zink_context *ctx, struct zink_resource *res)
{
   assert(res->base.target == PIPE_BUFFER);

Again, this mechanism is only handling buffer resource for now, and there’s only one place in the driver that calls it, but it never hurts to be careful.

   for (unsigned shader = 0; shader < PIPE_SHADER_TYPES; shader++) {
      if (!(res->bind_stages & BITFIELD64_BIT(shader)))
         continue;
      for (enum zink_descriptor_type type = 0; type < ZINK_DESCRIPTOR_TYPES; type++) {
         if (!(res->bind_history & BITFIELD64_BIT(type)))
            continue;

Something common to many Gallium drivers is this idea of “bind history”, which is where a resource will have bitflags set when it’s used for a certain type of binding. While other drivers have a lot more cases than zink does due to various factors, the only thing that needs to be checked for my purposes is the descriptor type (UBO, SSBO, sampler, shader image) across all the shader stages. If a given resource has the flags set here, this means it was at some point used as a descriptor of this type, so the current descriptor bindings need to be compared to see if there’s a match.

         uint32_t usage = zink_program_get_descriptor_usage(ctx, shader, type);
         while (usage) {
            const int i = u_bit_scan(&usage);

This is a handy mechanism that returns the current descriptor usage of a shader as a bitfield. So for example, if a vertex shader uses UBOs in slots 0, 1, and 3, usage will be 11, and the loop will process i as 0, 1, and 3.

            struct zink_resource *cres = get_resource_for_descriptor(ctx, type, shader, i);
            if (res != cres)
               continue;

Now the slot of the descriptor type can be compared against the resource that’s being re-bound. If this resource is the one that’s currently bound to the specified slot of the specified descriptor type, then steps can be taken to perform additional operations necessary to successfully replace the backing storage for the resource, mimicking the same steps taken when initially binding the resource to the descriptor slot.

            switch (type) {
            case ZINK_DESCRIPTOR_TYPE_SSBO: {
               struct pipe_shader_buffer *ssbo = &ctx->ssbos[shader][i];
               util_range_add(&res->base, &res->valid_buffer_range, ssbo->buffer_offset,
                              ssbo->buffer_offset + ssbo->buffer_size);
               break;
            }

For SSBO descriptors, the only change needed is to add valid range for the bound region as . This region is passed to the shader, so even if it’s never written to, it might be, and so it can be considered a valid region.

            case ZINK_DESCRIPTOR_TYPE_SAMPLER_VIEW: {
               struct zink_sampler_view *sampler_view = zink_sampler_view(ctx->sampler_views[shader][i]);
               zink_descriptor_set_refs_clear(&sampler_view->desc_set_refs, sampler_view);
               zink_buffer_view_reference(ctx, &sampler_view->buffer_view, NULL);
               sampler_view->buffer_view = get_buffer_view(ctx, res, sampler_view->base.format,
                                                           sampler_view->base.u.buf.offset, sampler_view->base.u.buf.size);
               break;
            }

Sampler descriptors require a new VkBufferView be created since the previous one is no longer valid. Again, the references for the existing bufferview need to be invalidated now since that descriptor set can no longer be reused from the cache, and then the new VkBufferView is set after unrefing the old one.

            case ZINK_DESCRIPTOR_TYPE_IMAGE: {
               struct zink_image_view *image_view = &ctx->image_views[shader][i];
               zink_descriptor_set_refs_clear(&image_view->desc_set_refs, image_view);
               zink_buffer_view_reference(ctx, &image_view->buffer_view, NULL);
               image_view->buffer_view = get_buffer_view(ctx, res, image_view->base.format,
                                                         image_view->base.u.buf.offset, image_view->base.u.buf.size);
               util_range_add(&res->base, &res->valid_buffer_range, image_view->base.u.buf.offset,
                              image_view->base.u.buf.offset + image_view->base.u.buf.size);
               break;
            }

Images are nearly identical to the sampler case, the difference being that while samplers are read-only like UBOs (and therefore reach this point already having valid buffer ranges set), images are more like SSBOs and can be written to. Thus the valid range must be set here like in the SSBO case.

            default:
               break;

Eagle-eyed readers will note that I’ve omitted a UBO case, and this is because there’s nothing extra to be done there. UBOs will already have their valid range set and don’t need a VkBufferView.

            }

            invalidate_descriptor_state(ctx, shader, type);

Finally, the incremental decsriptor state hash for this shader stage and descriptor type is invalidated. It’ll be recalculated normally upon the next draw or compute operation, so this is a quick zero-setting operation.

         }
      }
   }
}

That’s everything there is to know about the current state of resource invalidation in zink!

October 24, 2020

Never Seen Before

A rare Saturday post because I spent so much time this week intending to blog and then somehow not getting around to it. Let’s get to the status updates, and then I’m going to dive into the more interesting of the things I worked on over the past few days.

Zink has just hit another big milestone that I’ve just invented: as of now, my branch is passing 97% of piglit tests up through GL 4.6 and ES 3.2, and it’s a huge improvement from earlier in the week when I was only at around 92%. That’s just over 1000 failure cases remaining out of ~41,000 tests. For perspective, a table.

  IRIS zink-mainline zink-wip
Passed Tests 43508 21225 40190
Total Tests 43785 22296 41395
Pass Rate 99.4% 95.2% 97.1%

As always, I happen to be running on Intel hardware, so IRIS and ANV are my reference points.

It’s important to note here that I’m running piglit tests, and this is very different from CTS; put another way, I may be passing over 97% of the test cases I’m running, but that doesn’t mean that zink is conformant for any versions of GL or ES, which may not actually be possible at present (without huge amounts of awkward hacks) given the persistent issues zink has with provoking vertex handling. I expect this situation to change in the future through the addition of more Vulkan extensions, but for now I’m just accepting that there’s some areas where zink is going to misrender stuff.

What Changed?

The biggest change that boosted the zink-wip pass rate was my fixing 64bit vertex attributes, which in total had been accounting for ~2000 test failures.

Vertex attributes, as we all know since we’re all experts in the graphics field, are the inputs for vertex shaders, and the data types for these inputs can vary just like C data types. In particular, with GL 4.1, ARB_vertex_attrib_64bit became a thing, which allows 64bit values to be passed as inputs here.

Once again, this is a problem for zink.

It comes down to the difference between GL’s implicit handling methodology and Vulkan’s explicit handling methodology. Consider the case of a dvec4 data type. Conceptually, this is a data type which is 4x64bit values, requiring 32bytes of storage. A vec4 uses 16bytes of storage, and this equates to a single “slot” or “location” within the shader inputs, as everything there is vec4-aligned. This means that, by simple arithmetic, a dvec4 requires two slots for its storage, one for the first two members, and another for the second two, both consuming a single 16byte slot.

When loading a dvec4 in GL(SL), a single variable with the first location slot is used, and the driver will automatically use the second slot when loading the second half of the value.

When loading a dvec4 in (SPIR)Vulkan, two variables with consecutive, explicit location slots must be used, and the driver will load exactly the input location specified.

This difference requires that for any dvec3 or dvec4 vertex input in zink, the value and also the load have to be split along the vec4 boundary for things to work.

Gallium already performs this split on the API side, allowing zink to already be correctly setting things up in the VkPipeline creation, so I wrote a NIR pass to fix things on the shader side.

Shader Rewriting

Yes, it’s been at least a week since I last wrote about a NIR pass, so it’s past time that I got back into that.

Going into this, the idea here is to perform the following operations within the vertex shader:

  • for the input variable (hereafter A), find the deref instruction (hereafter A_deref); deref is used to access variables for input and output, and so it’s guaranteed that any 64bit input will first have a deref
  • create a second variable (hereafter B) of size double (for dvec3) or dvec2 (for dvec4) to represent the second half of A
  • alter A and A_deref’s type to dvec2; this aligns the variable (and its subsequent load) to the vec4 boundary, which enables it to be correctly read from a single location slot
  • create a second deref instruction for B (hereafter B_deref)
  • find the load_deref instruction for A_deref (hereafter A_load); a load_deref instruction is used to load data from a variable deref
  • alter the number of components for A_load to 2, matching its new dvec2 size
  • create a second load_deref instruction for B_deref which will load the remaining components (hereafter B_load)
  • construct a new composite (hereafter C_load) dvec3 or dvec4 by combining A_load + B_load to match the load of the original type of A
  • rewrite all the subsequent uses of A_load’s result to instead use C_load’s result

Simple, right?

Here we go.

static bool
lower_64bit_vertex_attribs_instr(nir_builder *b, nir_instr *instr, void *data)
{
   if (instr->type != nir_instr_type_deref)
      return false;
   nir_deref_instr *A_deref = nir_instr_as_deref(instr);
   if (A_deref->deref_type != nir_deref_type_var)
      return false;
   nir_variable *A = nir_deref_instr_get_variable(A_deref);
   if (A->data.mode != nir_var_shader_in)
      return false;
   if (!glsl_type_is_64bit(A->type) || !glsl_type_is_vector(A->type) || glsl_get_vector_elements(A->type) < 3)
      return false;

First, it’s necessary to filter out all the instructions that aren’t what should be rewritten. As above, only dvec3 and dvec4 types are targeted here (dmat* types are reduced to dvec types prior to this point), so anything other than a A_deref of variables with those types is ignored.

   /* create second variable for the split */
   nir_variable *B = nir_variable_clone(A, b->shader);
   /* split new variable into second slot */
   B->data.driver_location++;
   nir_shader_add_variable(b->shader, B);

B matches A except in its type and slot location, which will always be one greater than the slot location of A, so A can be cloned here to simplify the process of creating B.

   unsigned total_num_components = glsl_get_vector_elements(A->type);
   /* new variable is the second half of the dvec */
   B->type = glsl_vector_type(glsl_get_base_type(A->type), glsl_get_vector_elements(A->type) - 2);
   /* clamp original variable to a dvec2 */
   A_deref->type = A->type = glsl_vector_type(glsl_get_base_type(A->type), 2);

A and B need their types modified to not cross the vec4/slot boundary. A is always a dvec2, which has 2 components, and B will always be the remaining components.

   /* create A_deref instr for new variable */
   b->cursor = nir_after_instr(instr);
   nir_deref_instr *B_deref = nir_build_deref_var(b, B);

Now B_deref has been added thanks to the nir_builder helper function which massively simplifies the process of setting up all the instruction parameters.

   nir_foreach_use_safe(A_deref_use, &A_deref->dest.ssa) {

NIR is SSA-based, and all uses of an SSA value are tracked for the purposes of ensuring that SSA values are truly assigned only once as well as ease of rewriting them in the case where a value needs to be modified, just as this pass is doing. This use-tracking comes along with a simple API for iterating over the uses.

      nir_instr *A_load_instr = A_deref_use->parent_instr;
      assert(A_load_instr->type == nir_instr_type_intrinsic &&
             nir_instr_as_intrinsic(A_load_instr)->intrinsic == nir_intrinsic_load_deref);

The only use of A_deref should be A_load, so really iterating over the A_deref uses is just a quick, easy way to get from there to the A_load instruction.

      /* this is a load instruction for the A_deref, and we need to split it into two instructions that we can
       * then zip back into a single ssa def */
      nir_intrinsic_instr *A_load = nir_instr_as_intrinsic(A_load_instr);
      /* clamp the first load to 2 64bit components */
      A_load->num_components = A_load->dest.ssa.num_components = 2;

A_load must be clamped to a single slot location to avoid crossing the vec4 boundary, so this is done by changing the number of components to 2, which matches the now-changed type of A.

      b->cursor = nir_after_instr(A_load_instr);
      /* this is the second load instruction for the second half of the dvec3/4 components */
      nir_intrinsic_instr *B_load = nir_intrinsic_instr_create(b->shader, nir_intrinsic_load_deref);
      B_load->src[0] = nir_src_for_ssa(&B_deref->dest.ssa);
      B_load->num_components = total_num_components - 2;
      nir_ssa_dest_init(&B_load->instr, &B_load->dest, B_load->num_components, 64, NULL);
      nir_builder_instr_insert(b, &B_load->instr);

This is B_load, which loads a number of components that matches the type of B. It’s inserted after A_load, though the before/after isn’t important in this case. The key is just that this instruction is added before the next one.

      nir_ssa_def *def[4];
      /* createa new dvec3/4 comprised of all the loaded components from both variables */
      def[0] = nir_vector_extract(b, &A_load->dest.ssa, nir_imm_int(b, 0));
      def[1] = nir_vector_extract(b, &A_load->dest.ssa, nir_imm_int(b, 1));
      def[2] = nir_vector_extract(b, &B_load->dest.ssa, nir_imm_int(b, 0));
      if (total_num_components == 4)
         def[3] = nir_vector_extract(b, &B_load->dest.ssa, nir_imm_int(b, 1));
      nir_ssa_def *C_load = nir_vec(b, def, total_num_components);

Now that A_load and B_load both exist and are loading the corrected number of components, these components can be extracted and reassembled into a larger type for use in the shader, specifically the original dvec3 or dvec4 which is being used. nir_vector_extract performs this extraction from a given instruction by taking an index of the value to extract, and then the composite value is created by passing the extracted components to nir_vec as an array.

      /* use the assembled dvec3/4 for all other uses of the load */
      nir_ssa_def_rewrite_uses_after(&A_load->dest.ssa, nir_src_for_ssa(C_load), C_load->parent_instr);

Since this is all SSA, the NIR helpers can be used to trivially rewrite all the uses of the loaded value from the original A_load instruction to now use the assembled C_load value. It’s important that only the uses after C_load has been created (i.e., nir_ssa_def_rewrite_uses_after) are those that are rewritten, however, or else the shader will also rewrite the original A_load value with C_load, breaking the shader entirely with an SSA-impossible as well as generally-impossible C_load = vec(C_load + B_load) assignment.

   }

   return true;
}

Progress has occurred, so the pass returns true to reflect that.

Now those large attributes are loaded according to Vulkan spec, and everything is great because, as expected, ANV has no bugs here.

October 16, 2020

Brain Hurty

It’s been a very long week for me, and I’m only just winding down now after dissecting and resolving a crazy fp64/ssbo bug. I’m too scrambled to jump into any code, so let’s just do another fluff day and review happenings in mainline mesa which relate to zink.

MRs Landed

Thanks to the tireless, eagle-eyed reviewing of Erik Faye-Lund, a ton of zink patches out of zink-wip have landed this week. Here’s an overview of that in backwards historical order:

Versions Bumped

Zink has grown tremendously over the past day or so of MRs landing, going from GLSL 1.30 and GL 3.0 to GLSL and GL 3.30.

It can even run Blender now. Unless you’re running it under Wayland.

October 15, 2020

New versions of the KWinFT projects Wrapland, Disman, KWinFT and KDisplay are available now. They were on the day aligned with the release of Plasma 5.20 this week and offer new features and stability improvements.

Universal display management

The highlight this time is a completely redefined and reworked Disman that allows to control display configurations not only in a KDE Plasma session with KWinFT but also with KWin and in other Wayland sessions with wlroots-based compositors as well as any X11 session.

You can use it with the included command-line tool dismanctl or together with the graphical frontend KDisplay. Read more about Disman's goals and technical details in the 5.20 beta announcement.

KWinFT projects you should use

Let's cut directly to the chase! As Disman and KDisplay are replacements for libkscreen and KScreen and KWinFT for KWin you will be interested in a comparison from a user point of view. What is better and what should you personally choose?

Disman and KDisplay for KDE Plasma

If you run a KDE Plasma desktop at the moment, you should definitely consider to use Disman and replace KScreen with KDisplay.

Disman comes with a more reliable overall design moving internal logic to its D-Bus service and away from the frontends in KDisplay. Changes by that become more atomic and bugs are less likely to emerge.

The UI of KDisplay is improved in comparison to KScreen and comfort functions have been added, as for example automatic selection of the best available mode.

There are still some caveats to this release that might prompt you to wait for the next one though:

  • Albeit in the beta phase multiple bugs were discovered and could be fixed 5.20 is still the first release after a large redesign, so it is not unlikely more bugs will be discovered later on.
  • If you want to use Disman and KDisplay in a legacy KWin Wayland session note that Disman was only tested with KWinFT and sway by me personally. Maybe other people already use it with legacy KWin in a Wayland session but the backend, that is loaded in this case, could very well have seen no QA at all yet. That being said if you run a KWin X11 session, you will experience no such problems since in this case the backend is the same as with KWinFT or any other X11 window manager.
  • If you require the KDisplay UI in another language then this release is not yet for you. An online translation system has been setup now but the first localization will only become available with release 5.21.

So your mileage may vary but in most cases you should have a better experience with Disman and KDisplay.

And if you in general like to support new projects with ambitious goals and make use of most modern technologies you should definitely give it a try.

Disman and KDisplay with wlroots

Disman includes a backend for wlroots-backed compositors. I'm proud of this achievement since I believe we need more projects for the Linux desktop which do not only try to solve issues in their own little habitat and project monoculture but which aim at improving the Linux desktop in a more holistic and collaborative spirit.

I tested the backend myself and even provided some patches to wlroots directly to improve its output-management capabilities, so using Disman with wlroots should be a decent experience. One catch though is that for those patches above a new wlroots version must be released. For now you can only get them by compiling wlroots from master or having your distribution of choice backport them.

In comparison with other options to manage your displays in wlroots I believe Disman provides the most user-friendly solution taking off lots of work from your shoulders by automatically optimizing unknown new display setups and reloading data for already known setups.

Another prominent alternative for display management on wlroots is kanshi. I don't think kanshi is as easy to use and autonomously optimizing as Disman but you might be able to configure displays more precisly with it. So you could prefer kanshi or Disman depending on your needs.

You can use Disman in wlroots sessions as a standalone system with its included command-line tool dismanctl and without KDisplay. This way you do not need to pull in as many KDE dependencies. But KDisplay together with Disman and wlroots also works very well and provides you an easy-to-use UI for adapting the display configuration according to your needs.

Disman and KDisplay on X11

You will like it. Try it out is all I can say. The RandR backend is tested thoroughly and while there is still room for some refactoring it should work very well already. This is also independent of what desktop environment or window manager you use. Install it and see for yourself.

That being said the following issues are known at the moment:

  • Only a global scale can be selected for all displays. Disman can not yet set different scales for different displays. This might become possible in the future but has certain drawbacks on its own.
  • The global scale in X11 is set by KDisplay alone. So you must install Disman with KDisplay in case you want to change the global scale without writing to the respective X11 config files manually.
  • At the time of writing there is a bug with Nvidia cards that leads to reduced refresh rates. But I expect this bug to be fixed very soon.

KWinFT vs KWin

I was talking a lot about Disman since it contains the most interesting changes this release and it can now be useful to many more people than before.

But you might also be interested in replacing KWin with KWinFT, so let's take a look at how KWinFT at this point in time compares to legacy KWin.

As it stands KWinFT is still a drop-in-replacement for it. You can install it to your system replacing KWin and use it together with a KDE Plasma session.

X11

If you usually run an X11 session you should choose KWinFT without hesitation. It provides the same features as KWin and comes with an improved compositing pipeline that lowers latency and increases smoothness. There are also patches in the work to improve upon this further for multi-display setups. These patches might come to the 5.20 release via a bug fix release.

One point to keep in mind though is that the KWinFT project will concentrate in the future on improving the experience with Wayland. We won't maliciously regress the X11 experience but if there is a tradeoff between improving the Wayland session and regressing X11, KWinFT will opt for the former. But if such a situation unfolds at some point in time has yet to be seen. The X11 session might as well continue to work without any regressions for the next decade.

Wayland

The situation is different if you want to run KWinFT as a Wayland compositor. I believe in regards to stability and robustness KWinFT is superior.

In particular this holds true for multi-display setups and display management. Although I worked mostly on Disman in the last two months that work naturally spilled over to KWinFT too. KWinFT's output objects are now much more reasonably implemented. Besides that there were many more bug fixes to outputs handling what you can convince yourself of by looking at the merged changes for 5.20 in KWinFT and Wrapland.

If you have issues with your outputs in KWin definitely try out KWinFT, of course together with Disman.

Another area where you probably will have a better experience is the composition itself. As on X11 the pipeline was reworked. For multi-display setups the patch, that was linked above and might come in a bug fix release to 5.20, should improve the situation further.

On the other side KWin's Wayland session gained some much awaited features with 5.20. According to the changelog screencasting is now possible, as is middle-click pasting and integration with Klipper, that is the clipboard management utility in the system tray.

I say "in theory" because I have not tested it myself and I expect it to not work without issues. That is for one because big feature additions like these regularly require later adjustments due to unforeseen behavior changes but also because on a principal and strategic level I disagree with the KWin developers' general approach here.

The KWin codebase is rotten and needs a rigorous overhaul. Putting more features on top of that, which often require massive internal changes just for the sake of crossing an item from a checklist might make sense from the viewpoint of KDE users and KDE's marketing staff, but from a long-term engineering vision will only litter the code more and lead to more and more breakage over time. Most users won't notice that immediately but when they do it is already too late.

On how to do that better I really have to compliment the developers of Gnome's Mutter and wlroots.

Especially Mutter's Wayland session was in a bad state with some fundamental problems due to its history just few years ago. But they committed to a very forward-thinking stance, ignoring the initial bad reception and not being tempted by immediate quick fixes that long-term would not hold up to the necessary standards. And nowadays Gnome Mutter's Wayland session is in way better shape. I want to highlight their transactional KMS project. This is a massive overhaul that is completely transparent to the common user, but enables the Mutter developers to build on a solid base in many ways in the future.

Still as said I have not tried KWin 5.20 myself and if the new features are important to you, give it a try and check for yourself if your experience confirms my concerns or if you are happy with what was added. Switching from KWin to KWinFT or the other way around is easy after all.

How to get the KWinFT projects

If you self-compile KWinFT it is very easy to switch from KWin to KWinFT. Just compile KWinFT to your system prefix. If you want more comfort through distribution packages you have to choose your distribution carefully.

Currently only Manjaro provides KWinFT packages officially. You can install all KWinFT projects on Manjaro easily through the packages with the same names.

Manjaro also offers git-variants of these packages allowing you to run KWinFT projects directly from master branch. This way you can participate in its development directly or give feedback to latest changes.

If you run Arch Linux you can install all KWinFT projects from the AUR. The release packages are not yet updated to 5.20 but I assume this happens pretty soon. They have a bit weird naming scheme: there are kwinft, wrapland-kwinft, disman-kwinft and kdisplay-kwinft. Of these packages git-variants are available too but they follow the better naming scheme without a kwinft suffix. So for example the git package for disman-kwinft is just called disman-git. Naming nitpicks aside huge thanks to the maintainers of these packages: abelian424 and Christoph (haagch).

A special place in my heart was conquered not long ago by Fedora. I switched over to it from KDE Neon due to problems on the latest update and the often outdated packages and I am amazed by Fedora's technical versed and overall professional vision.

To install KWinFT projects on Fedora with its exceptional package manager DNF you can make use of this copr repository that includes their release versions. The packages are already updated to 5.20. Thanks to zawertun for providing these packages!

Fedora's KDE SIG group also took interest in the KWinFT projects and setup a preliminary copr for them. One of their packagers contacted me after the Beta release and I hope that I can help them to get it fully setup soon. I think Fedora's philosophy of pushing the Linux ecosystem by providing most recent packages and betting on emerging technologies will harmonize very well with the goals of the KWinFT project.

Busy, Busy, Busy

It’s been a busy week for me in personal stuff, so my blogging has been a bit slow. Here’s a brief summary of a few vaguely interesting things that I’ve been up to:

  • I’m now up to 37fps in the Heaven benchmark, which is an absolute unit of a number that brings me up to 68.5% of native GL
  • I discovered that people have been reporting bugs for zink-wip (great) without tagging me (not great), and I’m not getting notifications for it (also not great). gitlab is hard.
  • I added support for null UBOs in descriptor sets, which I think I was supposed to actually have done some time ago, but I hadn’t run into any tests that hit it. null checks all the way down
  • I fired up supertuxkart for one of the reports, which led me down a rabbithole of discovering that part of my compute shader implementation was broken for big shared data loads; I was, for whatever reason, attempting to do an OpAccessChain with a result type of uvec4 from a base type of array<uint>, which…it’s just not going to work ever, so that’s some strong work by past me
  • I played supertuxkart for a good 30 seconds and took a screenshot

supertuxkart.png

There’s a weird flickering bug with the UI that I get in some levels that bears some looking into, but otherwise I get a steady 60/64/64 in zink vs 74/110/110 in IRIS (no idea what the 3 numbers mean, if anyone does, feel free to drop me a line).

That’s it for today. Hopefully a longer post tomorrow but no promises.

October 14, 2020

 A couple of years ago, we sandboxed thumbnailers using bubblewrap to avoid drive-by downloads taking advantage of thumbnailers with security issues.

 It's a great tool, and it's a tool that Flatpak relies upon to create its own sandboxes. But that also meant that we couldn't use it inside the Flatpak sandboxes themselves, and those aren't always as closed as they could be, to support legacy applications.

 We've finally implemented support for sandboxing thumbnailers within Flatpak, using the Spawn D-Bus interface (indirectly).

This should all land in GNOME 40, though it should already be possible to integrate it into your Flatpaks. Make sure to use the latest gnome-desktop development version, and that the flatpak-spawn utility is new enough in the runtime you're targeting (it's been updated in the freedesktop.org runtimes #1, #2, #3, but it takes time to trickle down to GNOME versions). Example JSON snippets:

        {
"name": "flatpak-xdg-utils",
"buildsystem": "meson",
"sources": [
{
"type": "git",
"url": "https://github.com/flatpak/flatpak-xdg-utils.git",
"tag": "1.0.4"
}
]
},
{
"name": "gnome-desktop",
"buildsystem": "meson",
"config-opts": ["-Ddebug_tools=true", "-Dudev=disabled"],
"sources": [
{
"type": "git",
"url": "https://gitlab.gnome.org/GNOME/gnome-desktop.git"
}
]
}  

(We also sped up GStreamer-based thumbnailers by allowing them to use a cache, and added profiling information to the thumbnail test tools, which could prove useful if you want to investigate performance or bugs in that area)

Edit: correct a link, thanks to the commenters for the notice

October 13, 2020

This week, I had a conversation with one of my coworkers about our subgroup/wave size heuristic and, in particular, whether or not control-flow divergence should be considered as part of the choice.  This lead me down a fun path of looking into the statistics of control-flow divergence and the end result is somewhat surprising:  Once you get above about an 8-wide subgroup, the subgroup size doesn't matter.

Before I get into the details, let's talk nomenclature.  As you're likely aware, GPUs often execute code in groups of 1 or more invocations.  In D3D terminology, these are called waves.  In Vulkan and OpenGL terminology, these are called subgroups.  The two terms are interchangeable and, for the rest of this post, I'll use the Vulkan/OpenGL conventions.

Control-flow divergence

Before we dig into the statistics, let's talk for a minute about control-flow divergence.  This is mostly going to be a primer on SIMT execution and control-flow divergence in GPU architectures.  If you're already familiar, skip ahead to the next section.

Most modern GPUs use a Single Instruction Multiple Thread (SIMT) model.  This means that the graphics programmer writes a shader which, for instance, colors a single pixel (fragment/pixel shader) but what the shader compiler produces is a program which colors, say, 32 pixels using a vector instruction set architecture (ISA).  Each logical single-pixel execution of the shader is called an "invocation" while the physical vectorized execution of the shader which covers multiple pixels is called a wave or a subgroup.  The size of the subgroup (number of pixels colored by a single hardware execution) varies depending on your architecture.  On Intel, it can be 8, 16, or 32, on AMD, it's 32 or 64 and, on Nvidia (if my knowledge is accurate), it's always 32.

This conversion from logical single-pixel version of the shader to a physical multi-pixel version is often fairly straightforward.  The GPU registers each hold N values and the instructions provided by the GPU ISA operate on N pieces of data at a time.  If, for instance, you have an add in the logical shader, it's converted to an add provided by the hardware ISA which adds N values.  (This is, of course an over-simplification but it's sufficient for now.)  Sounds simple, right?

Where things get more complicated is when you have control-flow in your shader.  Suppose you have an if statement with both then and else sections.  What should we do when we hit that if statement?  The if condition will be N Boolean values.  If all of them are true or all of them are false, the answer is pretty simple: we do the then or the else respectively.  If you have a mix of true and false values, we have to execute both sides.  More specifically, the physical shader has to disable all of the invocations for which the condition is false and run the "then" side of the if statement.  Once that's complete, it has to re-enable those channels and disable the channels for which the condition is true and run the "else" side of the if statement.  Once that's complete, it re-enables all the channels and continues executing the code after the if statement.

When you start nesting if statements and throw loops into the mix, things get even more complicated.  Loop continues have to disable all those channels until the next iteration of the loop, loop breaks have to disable all those channels until the loop is entirely complete, and the physical shader has to figure out when there are no channels left and complete the loop.  This makes for some fun and interesting challenges for GPU compiler developers.  Also, believe it or not, everything I just said is a massive over-simplification. :-)

The point which most graphics developers need to understand and what's important for this blog post is that the physical shader has to execute every path taken by any invocation in the subgroup.  For loops, this means that it has to execute the loop enough times for the worst case in the subgroup.  This means that if you have the same work in both the then and else sides of an if statement, that work may get executed twice rather than once and you may be better off pulling it outside the if.  It also means that if you have something particularly expensive and you put it inside an if statement, that doesn't mean that you only pay for it when needed, it means you pay for it whenever any invocation in the subgroup needs it.

Fun with statistics

At the end of the last section, I said that one of the problems with the SIMT model used by GPUs is that they end up having worst-case performance for the subgroup.  Every path through the shader which has to be executed for any invocation in the subgroup has to be taken by the shader as a whole.  The question that naturally arises is, "does a larger subgroup size make this worst-case behavior worse?"  Clearly, the naive answer is, "yes".  If you have a subgroup size of 1, you only execute exactly what's needed and if you have a subgroup size of 2 or more, you end up hitting this worst-case behavior.  If you go higher, the bad cases should be more likely, right?  Yes, but maybe not quite like you think.

This is one of those cases where statistics can be surprising.  Let's say you have an if statement with a boolean condition b.  That condition is actually a vector (b1, b2, b3, ..., bN) and if any two of those vector elements differ, we path the cost of both paths.  Assuming that the conditions are independent identically distributed (IID) random variables, the probability of entire vector being true is P(all(bi = true) = P(b1 = true) * P(b2 = true) * ... * P(bN = true) = P(bi = true)^N where N is the size of the subgroup.  Therefore, the probability of having uniform control-flow is P(bi = true)^N + P(bi = false)^N.  The probability of non-uniform control-flow, on the other hand, is 1 - P(bi = true)^N - P(bi = false)^N.

Before we go further with the math, let's put some solid numbers on it.  Let's say we have a subgroup size of 8 (the smallest Intel can do) and let's say that our input data is a series of coin flips where bi is "flip i was heads".  Then P(bi = true) = P(bi = false) = 1/2.  Using the math in the previous paragraph, P(uniform) = P(bi = true)^8 + P(bi = false)^8 = 1/128.  This means that the there is only a 1:128 chance that that you'll get uniform control-flow and a 127:128 chance that you'll end up taking both paths of your if statement.  If we increase the subgroup size to 64 (the maximum among AMD, Intel, and Nvidia), you get a 1:2^63 chance of having uniform control-flow and a (2^63-1):2^63 chance of executing both halves.  If we assume that the shader takes T time units when control-flow is uniform and 2T time units when control-flow is non-uniform, then the amortized cost of the shader for a subgroup size of 8 is 1/128 * T + 127/128 * 2T = 255/128 T and, by a similar calculation, the cost of a shader with a subgroup size of 64 is (2^64 - 1)/2^63.  Both of those are within rounding error of 2T and the added cost of using the massively wider subgroup size is less than 1%.  Playing with the statistics a bit, the following chart shows the probability of divergence vs. the subgroup size for various choices of P(bi = true):

One thing to immediately notice is that because we're only concerned about the probability of divergence and not of the two halves of the if independently, the graph is symmetric (p=0.9 and p=0.1 are the same).  Second, and the point I was trying to make with all of the math above, is that until your probability gets pretty extreme (> 90%) the probability of divergence is reasonably high at any subgroup size.  From the perspective of a compiler with no knowledge of the input data, we have to assume every if condition is a 50/50 chance at which point we can basically assume it will always diverge.

Instead of only considering divergence, let's take a quick look at another case.  Let's say that the you have a one-sided if statement (no else) that is expensive but rare.  To put numbers on it, let's say the probability of the if statement being taken is 1/16 for any given invocation.  Then P(taken) = P(any(bi = true)) = 1 - P(all(bi = false)) = 1 - P(bi = false)^N = 1 - (15/16)^N.  This works out to about 0.4 for a subgroup size of 8, 0.65 for 16, 0.87 for 32, and 0.98 for 64.  The following chart shows what happens if we play around with the probabilities of our if condition a bit more:

As we saw with the earlier divergence plot, even events with a fairly low probability (10%) are fairly likely to happen even with a subgroup size of 8 (57%) and are even more likely the higher the subgroup size goes.  Again, from the perspective of a compiler with no knowledge of the data trying to make heuristic decisions, it looks like "ifs always happen" is a reasonable assumption.  However, if we have something expensive like a texture instruction that we can easily move into an if statement, we may as well.  There's no guarantees but if the probability of that if statement is low enough, we might be able to avoid it at least some of the time.

Statistical independence

A keen statistical eye may have caught a subtle statement I made very early on in the previous section:

 Assuming that the conditions are independent identically distributed (IID) random variables...

While less statistically minded readers may have glossed over this as meaningless math jargon, it's actually very important assumption.  Let's take a minute to break it down.  A random variable in statistics is just an event.  In our case, it's something like "the if condition was true".  To say that a set of random variables is identically distributed means that they have the same underlying probabilities.  Two coin tosses, for instance, are identically distributed while the distribution of "coin came up heads" and "die came up 6" are very different.  When combining random variables, we have to be careful to ensure that we're not mixing apples and oranges.  All of the analysis above was looking at the evaluation of a boolean in the same if condition but across different subgroup invocations.  These should be identically distributed. 

The remaining word that's of critical importance in the IID assumption is "independent".  Two random variables are said to be independent if they have no effect on one another or, to be more precise, knowing the value of one tells you nothing whatsoever about the value of the other.  Random variables which are not dependent are said to be "correlated".  One example of random variables which are very much not independent would be housing prices in a neighborhood because the first thing home appraisers look at to determine the value of a house is the value of other houses in the same area that have sold recently.  In my computations above, I used the rule that P(X and Y) = P(X) * P(Y) but this only holds if X and Y are independent random variables.  If they're dependent, the statistics look very different.  This raises an obvious question:  Are if conditions statistically independent across a subgroup?  The short answer is "no".

How does this correlation and lack of independence (those are the same) affect the statistics?  If two events X and Y are negatively correlated then P(X and Y) < P(X) * P(Y) and if two events are positively correlated then P(X and Y) > P(X) * P(Y).  When it comes to if conditions across a subgroup, most correlations that matter are positive.  Going back to our statistics calculations, the probability of if condition diverging is 1 - P(all(bi = true)) - P(all(bi = false)) and P(all(bi = true)) = P(b1 = true and b2 = true and... bN = true).  So, if the data is positively correlated, we get P(all(bi = true)) > P(bi = true)^N and P(divergent) = 1 - P(all(bi = true)) - P(all(bi = false)) < 1 - P(bi = true)^N - P(bi = false)^N.  So correlation for us typically reduces the probability of divergence.  This is a good thing because divergence is expensive.  How much does it reduce the probability of divergence?  That's hard to tell without deep knowledge of the data but there are a few easy cases to analyze.

One particular example of dependence that comes up all the time is uniform values.  Many values passed into a shader are the same for all invocations within a draw call or for all pixels within a group of primitives.  Sometimes the compiler is privy to this information (if it comes from a uniform or constant buffer, for instance) but often it isn't.  It's fairly common for apps to pass some bit of data as a vertex attribute which, even though it's specified per-vertex, is actually the same for all of them.  If a bit of data is uniform (even if the compiler doesn't know it is), then any if conditions based on that data (or from a calculation using entirely uniform values) will be the same.  From a statics perspective, this means that P(all(bi = true)) + P(all(bi = false)) = 1 and P(divergent) = 0.  From a shader execution perspective, this means that it will never diverge no matter the probability of the condition because our entire wave will evaluate the same value.

What about non-uniform values such as vertex positions, texture coordinates, and computed values?  In your average vertex, geometry, or tessellation shader, these are likely to be effectively independent.  Yes, there are patterns in the data such as common edges and some triangles being closer to others.  However, there is typically a lot of vertex data and the way that vertices get mapped to subgroups is random enough that these correlations between vertices aren't likely to show up in any meaningful way.  (I don't have a mathematical proof for this off-hand.)  When they're independent, all the statistics we did in the previous section apply directly.

With pixel/fragment shaders, on the other hand, things get more interesting.  Most GPUs rasterize pixels in groups of 2x2 pixels where each 2x2 pixel group comes from the same primitive.  Each subgroup is made up of a series of these 2x2 pixel groups so, if the subgroup size is 16, it's actually 4 groups of 2x2 pixels each.  Within a given 2x2 pixel group, the chances of a given value within the shader being the same for each pixel in that 2x2 group is quite high.  If we have a condition which is the same within each 2x2 pixel group then, from the perspective of divergence analysis, the subgroup size is effectively divided by 4.  As you can see in the earlier charts (for which I conveniently provided small subgroup sizes), the difference between a subgroup size of 2 and 4 is typically much larger than between 8 and 16.

Another common source of correlation in fragment shader data comes from the primitives themselves.  Even if they may be different between triangles, values are often the same or very tightly correlated between pixels in the same triangle.  This is sort of a super-set of the 2x2 pixel group issue we just covered.  This is important because this is a type of correlation that hardware has the ability to encourage.  For instance, hardware can choose to dispatch subgroups such that each subgroup only contains pixels from the same primitive.  Even if the hardware typically mixes primitives within the same subgroup, it can attempt to group things together to increase data correlation and reduce divergence.

Why bother with subgroups?

All this discussion of control-flow divergence might leave you wondering why we bother with subgroups at all.  Clearly, they're a pain.  They definitely are.  Oh, you have no idea...

But they also bring some significant advantages in that the parallelism allows us to get better throughput out of the hardware.  One obvious way this helps is that we can spend less hardware on instruction decoding (we only have to decode once for the whole wave) and put those gates into more floating-point arithmetic units.  Also, most processors are pipelined and, while they can start processing a new instruction each cycle, it takes several cycles before an instruction makes its way from the start of the pipeline to the end and its result can be used in a subsequent instruction.  If you have a lot of back-to-back dependent calculations in the shader, you can end up with lots of stalls where an instruction goes into the pipeline and the next instruction depends on its value and so you have to wait 10ish cycles until for the previous instruction to complete.  On Intel, each SIMD32 instruction is actually four SIMD8 instructions that pipeline very nicely and so it's easier to keep the ALU busy.

Ok, so wider subgroups are good, right?  Go as wide as you can!  Well, yes and no.  Generally, there's a point of diminishing returns.  Is one instruction decoder per 32 invocations of ALU really that much more hardware than one per 64 invocations?  Probalby not.  Generally, the subgroup size is determined based on what's required to keep the underlying floating-point arithmetic hardware full.  If you have 4 ALUs per execution unit and a pipeline depth of 10 cycles, then an 8-wide subgroup is going to have trouble keeping the ALU full.  A 32-wide subgroup, on the other hand, will keep it 80% full even with back-to-back dependent instructions so going 64-wide is pointless.

On Intel GPU hardware, there are additional considerations.  While most GPUs have a fixed subgroup size, ours is configurable and the subgroup size is chosen by the compiler.  What's less flexible for us is our register file.  We have a fixed register file size of 4KB regardless of the subgroup size so, depending on how many temporary values your shader uses, it may be difficult to compile it 16 or 32-wide and still fit everything in registers.  While wider programs generally yield better parallelism, the additional register pressure can easily negate any parallelism benefits.

There are also other issues such as cache utilization and thrashing but those are way out of scope for this blog post...

What does this all mean?

This topic came up this week in the context of tuning our subgroup size heuristic in the Intel Linux 3D drivers.  In particular, how should that heuristic reason about control-flow and divergence?  Are wider programs more expensive because they have the potential to diverge more?

After all the analysis above, the conclusion I've come to is that any given if condition falls roughly into one of three categories:

  1. Effectively uniform.  It never (or very rarely ever) diverges.  In this case, there is no difference between subgroup sizes because it never diverges.
  2. Random.  Since we have no knowledge about the data in the compiler, we have to assume that random if conditions are basically a coin flip every time.  Even with our smallest subgroup size of 8, this means it's going to diverge with a probability of 99.6%.  Even if you assume 2x2 subspans in fragment shaders are strongly correlated, divergence is still likely with a probability of 75% for SIMD8 shaders, 94% for SIMD16, and 99.6% for SIMD32.
  3. Random but very one-sided.  These conditions are the type where we can actually get serious statistical differences between the different subgroup sizes.  Unfortunately, we have no way of knowing when an if condition will be in this category so it's impossible to make heuristic decisions based on it.

Where does that leave our heuristic?  The only interesting case in the above three is random data in fragment shaders.  In our experience, the increased parallelism going from SIMD8 to SIMD16 is huge so it probably makes up for the increased divergence.  The parallelism increase from SIMD16 to SIMD32 isn't huge but the change in the probability of a random if diverging is pretty small (94% vs. 99.6%) so, all other things being equal, it's probably better to go SIMD32.

October 12, 2020

Jumping Right In

When last I left off, I’d cleared out 2/3 of my checklist for improving update_sampler_descriptors() performance:

handle_image_descriptor() was next on my list. As a result, I immediately dove right into an entirely different part of the flamegraph since I’d just been struck by a seemingly-obvious idea. Here’s the last graph:

frame_retrieval.png

Here’s the next step:

reuse_barriers.png

What changed?

Well, as I’m now caching descriptor sets across descriptor pools, it occurred to me that, assuming my descriptor state hashing mechanism is accurate, all the resources used in a given set must be identical. This means that all resources for a given type (e.g., UBO, SSBO, sampler, image) must be completely identical to previous uses across all shader stages. Extrapolating further, this also means that the way in which these resources are used must also identical, which means the pipeline barriers for access and image layouts must also be identical.

Which means they can be stored onto the struct zink_descriptor_set object and reused instead of being accumulated every time. This reuse completely eliminates add_transition() from using any CPU time (it’s the left-most block above update_sampler_descriptors() in the first graph), and it thus massively reduces overall time for descriptor updates.

This marks a notable landmark, as it’s the point at which update_descriptors() begins to use only ~50% of the total CPU time consumed in zink_draw_vbo(), with the other half going to the draw command where it should be.

handle_image_descriptor()

At last my optimization-minded workflow returned to this function, and looking at the flamegraph again yielded the culprit. It’s not visible due to this being a screenshot, but the whole of the perf hog here was obvious, so let’s check out the function itself since it’s small. I think I explored part of this at one point in the distant past, possibly for ARB_texture_buffer_object, but refactoring has changed things up a bit:

static void
handle_image_descriptor(struct zink_screen *screen, struct zink_resource *res, enum zink_descriptor_type type, VkDescriptorType vktype, VkWriteDescriptorSet *wd,
                        VkImageLayout layout, unsigned *num_image_info, VkDescriptorImageInfo *image_info, struct zink_sampler_state *sampler,
                        VkBufferView *null_view, VkImageView imageview, bool do_set)

First, yes, there’s a lot of parameters. There’s a lot of them, including VkBufferView *null_view, which is a pointer to a stack array that’s initialized as containing VK_NULL_HANDLE. As VkDescriptorImageInfo must be initialized with a pointer to an array for texel buffers, it’s important that the stack variable used doesn’t go out of scope, so it has to be passed in like this or else this functionality can’t be broken out in this way.

{
    if (!res) {
        /* if we're hitting this assert often, we can probably just throw a junk buffer in since
         * the results of this codepath are undefined in ARB_texture_buffer_object spec
         */
        assert(screen->info.rb2_feats.nullDescriptor);
        
        switch (vktype) {
        case VK_DESCRIPTOR_TYPE_UNIFORM_TEXEL_BUFFER:
        case VK_DESCRIPTOR_TYPE_STORAGE_TEXEL_BUFFER:
           wd->pTexelBufferView = null_view;
           break;
        case VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER:
        case VK_DESCRIPTOR_TYPE_STORAGE_IMAGE:
           image_info->imageLayout = VK_IMAGE_LAYOUT_UNDEFINED;
           image_info->imageView = VK_NULL_HANDLE;
           if (sampler)
              image_info->sampler = sampler->sampler[0];
           if (do_set)
              wd->pImageInfo = image_info;
           ++(*num_image_info);
           break;
        default:
           unreachable("unknown descriptor type");
        }

This is just handling for null shader inputs, which is permitted by various GL specs.

     } else if (res->base.target != PIPE_BUFFER) {
        assert(layout != VK_IMAGE_LAYOUT_UNDEFINED);
        image_info->imageLayout = layout;
        image_info->imageView = imageview;
        if (sampler) {
           VkFormatProperties props;
           vkGetPhysicalDeviceFormatProperties(screen->pdev, res->format, &props);

This vkGetPhysicalDeviceFormatProperties call is actually the entire cause of handle_image_descriptor() using any CPU time, at least on ANV. The lookup for the format is a significant bottleneck here, so it has to be removed.

           if ((res->optimial_tiling && props.optimalTilingFeatures & VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT) ||
               (!res->optimial_tiling && props.linearTilingFeatures & VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT))
              image_info->sampler = sampler->sampler[0];
           else
              image_info->sampler = sampler->sampler[1] ?: sampler->sampler[0];
        }
        if (do_set)
           wd->pImageInfo = image_info;
        ++(*num_image_info);
     }
}

Just for completeness, the remainder of this function is checking whether the device’s format features support the requested type of filtering (if linear), and then zink will fall back to nearest in other cases. Following this, do_set is true only for the base member of an image/sampler array of resources, and so this is the one that gets added into the descriptor set.

But now I’m again returning to vkGetPhysicalDeviceFormatProperties. Since this is using CPU, it needs to get out of the hotpath here in descriptor updating, but it does still need to be called. As such, I’ve thrown more of this onto the zink_screen object:

static void
populate_format_props(struct zink_screen *screen)
{
   for (unsigned i = 0; i < PIPE_FORMAT_COUNT; i++) {
      VkFormat format = zink_get_format(screen, i);
      if (!format)
         continue;
      vkGetPhysicalDeviceFormatProperties(screen->pdev, format, &screen->format_props[i]);
   }
}

Indeed, now instead of performing the fetch on every decsriptor update, I’m just grabbing all the properties on driver init and then using the cached values throughout. Let’s see how this looks.

props.png

update_descriptors() is now using visibly less time than the draw command, though not by a huge amount. I’m also now up to an unstable 33fps.

Some Cleanups

At this point, it bears mentioning that I wasn’t entirely satisfied with the amount of CPU consumed by descriptor state hashing, so I ended up doing some pre-hashing here for samplers, as they have the largest state. At the beginning, the hashing looked like this:

In this case, each sampler descriptor hash was the size of a VkDescriptorImageInfo, which is 2x 64bit values and a 32bit value, or 20bytes of hashing per sampler. That ends up being a lot of hashing, and it also ends up being a lot of repeatedly hashing the same values.

Instead, I changed things around to do some pre-hashing:

In this way, I could have a single 32bit value representing the sampler view that persisted for its lifetime, and a pair of 32bit values for the sampler (since I still need to potentially toggle between linear and nearest filtering) that I can select between. This ends up being 8bytes to hash, which is over 50% less. It’s not a huge change in the flamegraph, but it’s possibly an interesting factoid. Also, as the layouts will always be the same for these descriptors, that member can safely be omitted from the original sampler_view hash.

Set Usage

Another vaguely interesting tidbit is profiling zink’s usage of mesa’s set implementation, which is used to ensure that various objects are only added to a given batch a single time. Historically the pattern for use in zink has been something like:

if (!_mesa_set_search(set, value)) {
   do_something();
   _mesa_set_add(set, value);
}

This is not ideal, as it ends up performing two lookups in the set for cases where the value isn’t already present. A much better practice is:

bool found = false;
_mesa_set_search_and_add(set, value, &found);
if (!found)
    do_something();

In this way, the lookup is only done once, which ends up being huge for large sets.

For My Final Trick

I was now holding steady at 33fps, but there was a tiny bit more performance to squeeze out of descriptor updating when I began to analyze how much looping was being done. This is a general overview of all the loops in update_descriptors() for each type of descriptor at the time of my review:

  • loop for all shader stages
    • loop for all bindings in the shader
      • in sampler and image bindings, loop for all resources in the binding
  • loop for all resources in descriptor set
  • loop for all barriers to be applied in descriptor set

This was a lot of looping, and it was especially egregious in the final component of my refactored update_descriptors():

static bool
write_descriptors(struct zink_context *ctx, struct zink_descriptor_set *zds, unsigned num_wds, VkWriteDescriptorSet *wds,
                 unsigned num_resources, struct zink_descriptor_resource *resources, struct set *persistent,
                 bool is_compute, bool cache_hit)
{
   bool need_flush = false;
   struct zink_batch *batch = is_compute ? &ctx->compute_batch : zink_curr_batch(ctx);
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   assert(zds->desc_set);
   unsigned check_flush_id = is_compute ? 0 : ZINK_COMPUTE_BATCH_ID;
   for (int i = 0; i < num_resources; ++i) {
      assert(num_resources <= zds->pool->num_resources);

      struct zink_resource *res = resources[i].res;
      if (res) {
         need_flush |= zink_batch_reference_resource_rw(batch, res, resources[i].write) == check_flush_id;
         if (res->persistent_maps)
            _mesa_set_add(persistent, res);
      }
      /* if we got a cache hit, we have to verify that the cached set is still valid;
       * we store the vk resource to the set here to avoid a more complex and costly mechanism of maintaining a
       * hash table on every resource with the associated descriptor sets that then needs to be iterated through
       * whenever a resource is destroyed
       */
      assert(!cache_hit || zds->resources[i] == res);
      if (!cache_hit)
         zink_resource_desc_set_add(res, zds, i);
   }
   if (!cache_hit && num_wds)
      vkUpdateDescriptorSets(screen->dev, num_wds, wds, 0, NULL);
   for (int i = 0; zds->pool->num_descriptors && i < util_dynarray_num_elements(&zds->barriers, struct zink_descriptor_barrier); ++i) {
      struct zink_descriptor_barrier *barrier = util_dynarray_element(&zds->barriers, struct zink_descriptor_barrier, i);
      zink_resource_barrier(ctx, NULL, barrier->res,
                            barrier->layout, barrier->access, barrier->stage);
   }

   return need_flush;
}

This function iterates over all the resources in a descriptor set, tagging them for batch usage and persistent mapping, adding references for the descriptor set to the resource as I previously delved into. Then it iterates over the barriers and applies them.

But why was I iterating over all the resources and then over all the barriers when every resource will always have a barrier for the descriptor set, even if it ends up getting filtered out based on previous usage?

It just doesn’t make sense.

So I refactored this a bit, and now there’s only one loop:

static bool
write_descriptors(struct zink_context *ctx, struct zink_descriptor_set *zds, unsigned num_wds, VkWriteDescriptorSet *wds,
                 struct set *persistent, bool is_compute, bool cache_hit)
{
   bool need_flush = false;
   struct zink_batch *batch = is_compute ? &ctx->compute_batch : zink_curr_batch(ctx);
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   assert(zds->desc_set);
   unsigned check_flush_id = is_compute ? 0 : ZINK_COMPUTE_BATCH_ID;

   if (!cache_hit && num_wds)
      vkUpdateDescriptorSets(screen->dev, num_wds, wds, 0, NULL);
   for (int i = 0; zds->pool->num_descriptors && i < util_dynarray_num_elements(&zds->barriers, struct zink_descriptor_barrier); ++i) {
      struct zink_descriptor_barrier *barrier = util_dynarray_element(&zds->barriers, struct zink_descriptor_barrier, i);
      if (barrier->res->persistent_maps)
         _mesa_set_add(persistent, barrier->res);
      need_flush |= zink_batch_reference_resource_rw(batch, barrier->res, zink_resource_access_is_write(barrier->access)) == check_flush_id;
      zink_resource_barrier(ctx, NULL, barrier->res,
                            barrier->layout, barrier->access, barrier->stage);
   }

   return need_flush;
}

This actually has the side benefit of reducing the required looping even further, as barriers get merged based on access and stages, meaning that though there may be N resources used by a given set used by M stages, it’s possible that the looping here might be reduced to only N rather than N * M since all barriers might be consolidated.

In Closing

Let’s check all the changes out in the flamegraph:

final.png

This last bit has shaved off another big chunk of CPU usage overall, bringing update_descriptors() from 11.4% to 9.32%. Descriptor state updating is down from 0.718% to 0.601% from the pre-hashing as well, though this wasn’t exactly a huge target to hit.

Just for nostalgia, here’s the starting point from just after I’d split the descriptor types into different sets and we all thought 27fps with descriptor set caching was a lot:

split.png

But now zink is up another 25% performance to a steady 34fps:

heaven.png

And I did it in only four blog posts.

For anyone interested, I’ve also put up a branch corresponding to the final flamegraph along with the perf data, which I’ve been using hotspot to view.

But Then, The Future

Looking forward, there’s still some easy work that can be done here.

For starters, I’d probably improve descriptor states a little such that I also had a flag anytime the batch cycled. This would enable me to add batch-tracking for resources/samplers/sampler_views more reliably when was actually necessary vs trying to add it every time, which ends up being a significant perf hit from all the lookups. I imagine that’d make a huge part of the remaining update_descriptors() usage disappear.

future-lookups.png

There’s also, as ever, the pipeline hashing, which can further be reduced by adding more dynamic state handling, which would remove more values from the pipeline state and thus reduce the amount of hashing required.

future-pipeline.png

I’d probably investigate doing some resource caching to keep a bucket of destroyed resources around for faster reuse since there’s a fair amount of that going on.

future-resource.png

Ultimately though, the CPU time in use by zink is unlikely to see any other huge decreases (unless I’m missing something especially obvious or clever) without more major architectural changes, which will end up being a bigger project that takes more than just a week of blog posting to finish. As such, I’ve once again turned my sights to unit test pass rates and related issues, since there’s still a lot of work to be done there.

I’ve fixed another 500ish piglit tests over the past few days, bringing zink up past 92% pass rate, and I’m hopeful I can get that number up even higher in the near future.

Stay tuned for more updates on all things zink and Mike.

October 09, 2020

Moar Descriptors

I talked about a lot of boring optimization stuff yesterday, exploring various ideas which, while they will eventually will end up improving performance, didn’t yield immediate results.

Now it’s just about time to start getting to the payoff. pool_reuse.png

Here’s a flamegraph of the starting point. Since yesterday’s progress of improving the descriptor cache a bit and adding context-based descriptor states to reduce hashing, I’ve now implemented an object to hold the VkDescriptorPool, enabling the pools themselves to be shared across programs for reuse, which deduplicates a considerable amount of memory. The first scene of the heaven benchmark creates a whopping 91 zink_gfx_program structs, each of which previously had their own UBO descriptor pool and sampler descriptor pool for 182 descriptor pools in total, each with 5000 descriptors in it. With this mechanism, that’s cut down to 9 descriptor pools in total which are shared across all the programs. Without even changing that maximum descriptor limit, I’m already up by another frame to 28fps, even if the flamegraph doesn’t look too different.

Moar Caches

I took a quick detour next over to the pipeline cache (that large-ish block directly to the right of update_descriptors in the flamegraph), which stores all the VkPipeline objects that get created during startup. Pipeline creation is extremely costly, so it’s crucial that it be avoided during runtime. Happily, the current caching infrastructure in zink is sufficient to meet that standard, and there are no pipelines created while the scene is playing out.

But I thought to myself: what about VkPipelineCache for startup time improvement while I’m here?

I had high hopes for this, and it was quick and easy to add in, but ultimately even with the cache implemented and working, I saw no benefit in any part of the benchmark.

That was fine, since what I was really shooting for was a result more like this: pipeline_hash.png

The hierarchy of the previously-expensive pipeline hash usage has completely collapsed now, and it’s basically nonexistent. This was achieved through a series of five patches which:

  • moved the tessellation levels for TCS output out of the pipeline state since these have no relation and I don’t know why I put them there to begin with
  • used the bitmasks for vertex divisors and buffers to much more selectively hash the large (32) VkVertexInputBindingDivisorDescriptionEXT and VkVertexInputBindingDescription arrays in the pipeline state instead of always hashing the full array
  • also only hashed the vertex buffer state if we don’t have VK_EXT_extended_dynamic_state support, which lets that be removed from the pipeline creation altogether
  • for debug builds, which are the only builds I run, I changed the pipeline hash tables over to use the pre-hashed values directly since mesa has a pesky assert() that rehashes on every lookup, so this more accurately reflects release build performance

And I’m now up to a solid 29fps.

Over To update_sampler_descriptors()

I left off yesterday with a list of targets to hit in this function, from left to right in the most recent flamegraph:

Looking higher up the chain for the add_transition() usage, it turns out that a huge chunk of this was actually the hash table rehashing itself every time it resized when new members were added (mesa hash tables start out with a very small maximum number of entries and then increase by a power of 2 every time). Since I always know ahead of time the maximum number of entries I’ll have in a given descriptor set, I put up a MR to let me pre-size the table, preventing any of this nonsense from taking up CPU time. The results were good: rehash.png

The entire add_transition hierarchy collapsed a bit, but there’s more to come. I immediately became distracted when I came to the realization that I’d actually misplaced a frame at some point and set to hunting it down.

Vulkan Experts

Anyone who said to themselves “you’re binding your descriptors before you’re emitting your pipeline barriers, thus starting and stopping your render passes repeatedly during each draw” as the answer to yesterday’s question about what I broke during refactoring was totally right, so bonus points to everyone out there who nailed it. This can actually be seen in the flamegraph as the tall stack above update_ubo_descriptors(), which is the block to the right of update_sampler_descriptors().

Oops. frame_retrieval.png

Now the stack is a little to the right where it belongs, and I was now just barely touching 30fps, which is up about 10% from the start of today’s post.

Stay tuned next week, when I pull another 10% performance out of my magic hat and also fix RADV corruption.

October 08, 2020

Descriptors Once More

It’s just that kind of week.

When I left off in my last post, I’d just implemented a two-tiered cache system for managing descriptor sets which was objectively worse in performance than not doing any caching at all.

Cool.

Next, I did some analysis of actual descriptor usage, and it turned out that the UBO churn was massive, while sampler descriptors were only changed occasionally. This is due to a mechanism in mesa involving a NIR pass which rewrites uniform data passed to the OpenGL context as UBO loads in the shader, compacting the data into a single buffer and allowing it to be more efficiently passed to the GPU. There’s a utility component u_upload_mgr for gallium-based drivers which allocates a large (~100k) buffer and then maps/writes to it at offsets to avoid needing to create a new buffer for this every time the uniform data changes.

The downside of u_upload_mgr, for the current state of zink, is that it means the hashed descriptor states are different for almost every single draw because while the UBO doesn’t actually change, the offset does, and this is necessarily part of the descriptor hash since zink is exclusively using VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER descriptors.

I needed to increase the efficiency of the cache to make it worthwhile, so I decided first to reduce the impact of a changed descriptor state hash during pre-draw updating. This way, even if the UBO descriptor state hash was changing every time, maybe it wouldn’t be so impactful.

Deep Cuts

What this amounted to was to split the giant, all-encompassing descriptor set, which included UBOs, samplers, SSBOs, and shader images, into separate sets such that each type of descriptor would be isolated from changes in the other descriptors between draws.

Thus, I now had four distinct descriptor pools for each program, and I was producing and binding up to four descriptor sets for every draw. I also changed the shader compiler a bit to always bind shader resources to the newly-split sets as well as created some dummy descriptor sets since it’s illegal to bind sets to a command buffer with non-sequential indices, but it was mostly easy work. It seemed like a great plan, or at least one that had a lot of potential for more optimizations based on it. As far as direct performance increases from the split, UBO descriptors would be constantly changing, but maybe…

Well, the patch is pretty big (8 files changed, 766 insertions(+), 450 deletions(-)), but in the end, I was still stuck around 23 fps.

Dynamic UBOs

With this work done, I decided to switch things up a bit and explore using VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC descriptors for the mesa-produced UBOs, as this would reduce both the hashing required to calculate the UBO descriptor state (offset no longer needs to be factored in, which is one fewer uint32_t to hash) as well as cache misses due to changing offsets.

Due to potential driver limitations, only these mesa-produced UBOs are dynamic now, otherwise zink might exceed the maxDescriptorSetUniformBuffersDynamic limit, but this was still more than enough.

Boom.

I was now at 27fps, the same as raw bucket allocation.

Hash Performance

I’m not going to talk about hash performance. Mesa uses xxhash internally, and I’m just using the tools that are available.

What I am going to talk about, however, is the amount of hashing and lookups that I was doing.

Let’s take a look at some flamegraphs for the scene that I was showing in my fps screenshots.

split.png

This is, among other things, a view of the juggernaut update_descriptors() function that I linked earlier in the week. At the time of splitting the descriptor sets, it’s over 50% of the driver’s pipe_context::draw_vbo hook, which is decidedly not great.

So I optimized harder.

The leftmost block just above update_descriptors is hashing that’s done to update the descriptor state for cache management. There wasn’t much point in recalculating the descriptor state hash on every draw since there’s plenty of draws where the states remain unchanged. To try and improve this, I moved to a context-based descriptor state tracker, where each pipe_context hook to change active descriptors would invalidate the corresponding descriptor state, and then update_descriptors() could just scan through all the states to see which ones needed to be recalculated.

states.png

The largest block above update_descriptors() on the right side is the new calculator function for descriptor states. It’s actually a bit more intensive than the old method, but again, this is much more easily optimized than the previous hashing which was scattered throughout a giant 400 line function.

Next, while I was in the area, I added even faster access to reusing descriptor sets. This is as simple as an array of sets on the program struct that can have their hash values directly compared to the current descriptor state that’s needed, avoiding lookups through potentially huge hash tables.

last_set.png

Not much to see here since this isn’t really where any of the performance bottleneck was occurring.

ETOOMUCHHASH

Let’s skip ahead a bit. I finally refactored update_descriptors() into smaller functions to update and bind descriptor sets in a loop prior to applying barriers and issuing the eventual draw command, shaking up the graph quite a bit:

split_update_descriptors.png

Clearly, updating the sampler descriptors (update_sampler_descriptors()) is taking a huge amount of time. The three large blocks above it are:

Each of these three had clear ways they could be optimized, and I’m going to speed through that and more in my next post.

But now, a challenge to all the Vulkan experts reading along. In this last section, I’ve briefly covered some refactoring work for descriptor updates.

What is the significant performance regression that I’ve introduced in the course of this refactoring?

It’s possible to determine the solution to my question without reading through any of the linked code, but you have the code available to you nonetheless.

Until next time.

October 07, 2020

I Skipped Bucket Day

I’m back, and I’m about to get even deeper into zink’s descriptor management. I figured everyone including me is well acquainted with bucket allocating, so I skipped that day and we can all just imagine what that post would’ve been like instead.

Let’s talk about caching and descriptor sets.

I talked about it before, I know, so here’s a brief reminder of where I left off:

Just a very normal cache mechanism. The problem, as I said previously, was that there was just way too much hashing going on, and so the performance ended up being worse than a dumb bucket allocator.

Not ideal.

But I kept turning the idea over in the back of my mind, and then I realized that part of the problem was in the upper-right block named move invalidated sets to invalidated set array. It ended up being the case that my resource tracking for descriptor sets was far too invasive; I had separate hash tables on every resource to track every set that a resource was attached to at all times, and I was basically spending all my time modifying those hash tables, not even the actual descriptor set caching.

So then I thought: well, what if I just don’t track it that closely?

Indeed, this simplifies things a bit at the conceptual level, since now I can avoid doing any sort of hashing related to resources, though this does end up making my second-level descriptor set cache less effective. But I’m getting ahead of myself in this post, so it’s time to jump into some code.

Resource Tracking 2.0

Instead of doing really precise tracking, it’s important to recall a few key points about how the descriptor sets are managed:

  • they’re bucket allocated
  • they’re allocated using the struct zink_program ralloc context
  • they’re only ever destroyed when the program is
  • zink is single-threaded

Thus, I brought some pointer hacks to bear:

void
zink_resource_desc_set_add(struct zink_resource *res, struct zink_descriptor_set *zds, unsigned idx)
{
   zds->resources[idx] = res;
   util_dynarray_append(&res->desc_set_refs, struct zink_resource**, &zds->resources[idx]);
}

This function associates a resource with a given descriptor set at the specified index (based on pipeline state). And then it pushes a reference to that pointer from the descriptor set’s C-array of resources into an array on the resource.

Later, during resource destruction, I can then walk the array of pointers like this:

util_dynarray_foreach(&res->desc_set_refs, struct zink_resource **, ref) {
   if (**ref == res)
      **ref = NULL;
}

If the reference I pushed earlier is still pointing to this resource, I can unset the pointer, and this will get picked up during future descriptor updates to flag the set as not-cached, requiring that it be updated. Since a resource won’t ever be destroyed while a set is in use, this is also safe for the associated descriptor set’s lifetime.

And since there’s no hashing or tree traversals involved, this is incredibly fast.

Second-level Caching

At this point, I’d created two categories for descriptor sets: active sets, which were the ones in use in a command buffer, and inactive sets, which were the ones that weren’t currently in use, with sets being pushed into the inactive category once they were no longer used by any command buffers. This ended up being a bit of a waste, however, as I had lots of inactive sets that were still valid but unreachable since I was using an array for storing these as well as the newly-bucket-allocated sets.

Thus, a second-level cache, AKA the B cache, which would store not-used sets that had at one point been valid. I’m still not doing any sort of checking of sets which may have been invalidated by resource destruction, so the B cache isn’t quite as useful as it could be. Also:

  • the check program cache for matching set has now been expanded to two lookups in case a matching set isn’t active but is still configured and valid in the B cache
  • the check program for unused set block in the above diagram will now cannibalize a valid inactive set from the B cache rather than allocate a new set

The last of these items is a bit annoying, but ultimately the B cache can end up having hundreds of members at various points, and iterating through it to try and find a set that’s been invalidated ends up being impractical just based on the random distribution of sets across the table. Also, I only set the resource-based invalidation up to null the resource pointer, so finding an invalid set would mean walking through the resource array of each set in the cache. Thus, a quick iteration through a few items to see if the set-finder gets lucky, otherwise it’s clobbering time.

And this brought me up to about 24fps, which was still down a bit from the mind-blowing 27-28fps I was getting with just the bucket allocator, but it turns out that caching starts to open up other avenues for sizable optimizations.

Which I’ll get to in future posts.

October 05, 2020

Healthier Blogging

I took some time off to focus on making the numbers go up, but if I only do that sort of junk food style blogging with images, and charts, and benchmarks then we might all stop learning things, and certainly I won’t be transferring any knowledge between the coding part of my brain and the speaking part, so really we’re all losers at that point.

In other words, let’s get back to going extra deep into some code and doing some long form patch review.

Descriptor Management

First: what are descriptors?

Descriptors are, in short, when you feed a buffer or an image (+sampler) into a shader. In OpenGL, this is all handled for the user behind the scenes with e.g., a simple glGenBuffers() -> glBindBuffer() -> glBufferData() for an attached buffer. For a gallium-based driver, this example case will trigger the pipe_context::set_constant_buffer or pipe_context::set_shader_buffers hook at draw time to inform the driver that a buffer has been attached, and then the driver can link it up with the GPU.

Things are a bit different in Vulkan. There’s an entire chapter of the spec devoted to explaining how descriptors work in great detail, but the important details for the zink case are:

  • each descriptor must have created for it a binding value which is unique for the given descriptor set
  • each descriptor must have a Vulkan descriptor type, as converted from its OpenGL type
  • each descriptor must also potentially expand to its full array size (image descriptor types only)

Additionally, while organizing and initializing all these descriptor sets, zink has to track all the resources used and guarantee their lifetimes exceed the lifetimes of the batch they’re being submitted with.

To handle this, zink has an amount of code. In the current state of the repo, it’s about 40 lines.

However…

meme-update_descriptors.png

In the state of my branch that I’m going to be working off of for the next few blog posts, the function for handling descriptor updates is 304 lines. This is the increased amount of code that’s required to handle (almost) all GL descriptor types in a way that’s reasonably reliable.

Is it a great design decision to have a function that large?

Probably not.

But I decided to write it all out first before I did any refactoring so that I could avoid having to incrementally refactor my first attempt at refactoring, which would waste lots of time.

Also, memes.

How Does This Work?

The idea behind the latter version of the implementation that I linked is as follows:

  • iterate over all the shader stages
  • iterate over the bindings for the shader
  • for each binding:
    • fill in the Vulkan buffer/image info struct
    • for all resources used by the binding:
      • add tracking for the underlying resource(s) for lifetime management
      • flag a pipeline barrier for the resource(s) based on usage*
    • fill in the VkWriteDescriptorSet struct for the binding

*Then merge and deduplicate all the accumulated pipeline barriers and apply only those which induce a layout or access change in the resource to avoid over-applying barriers.

As I mentioned in a previous post, zink then applies these descriptors to a newly-allocated, max-size descriptor set object from an allocator pool located on the batch object. Every time a draw command is triggered, a new VkDescriptorSet is allocated and updated using these steps.

First Level Refactoring Target

As I touched on briefly in a previous post, the first change to make here in improving descriptor handling is to move the descriptor pools to the program objects. This lets zink create smaller descriptor pools which are likely going to end up using less memory than these giant ones. Here’s the code used for creating descriptor pools prior to refactoring:

#define ZINK_BATCH_DESC_SIZE 1000
VkDescriptorPoolSize sizes[] = {
   {VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,         ZINK_BATCH_DESC_SIZE},
   {VK_DESCRIPTOR_TYPE_UNIFORM_TEXEL_BUFFER,   ZINK_BATCH_DESC_SIZE},
   {VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, ZINK_BATCH_DESC_SIZE},
   {VK_DESCRIPTOR_TYPE_STORAGE_TEXEL_BUFFER,   ZINK_BATCH_DESC_SIZE},
   {VK_DESCRIPTOR_TYPE_STORAGE_IMAGE,          ZINK_BATCH_DESC_SIZE},
   {VK_DESCRIPTOR_TYPE_STORAGE_BUFFER,         ZINK_BATCH_DESC_SIZE},
};
VkDescriptorPoolCreateInfo dpci = {};
dpci.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
dpci.pPoolSizes = sizes;
dpci.poolSizeCount = ARRAY_SIZE(sizes);
dpci.flags = 0;
dpci.maxSets = ZINK_BATCH_DESC_SIZE;
vkCreateDescriptorPool(screen->dev, &dpci, 0, &batch->descpool);

Here, all the used descriptor types allocate ZINK_BATCH_DESC_SIZE descriptors in the pool, and there are ZINK_BATCH_DESC_SIZE sets in the pool. There’s a bug here, which is that really the descriptor types should have ZINK_BATCH_DESC_SIZE * ZINK_BATCH_DESC_SIZE descriptors to avoid oom-ing the pool in the event that allocated sets actually use that many descriptors, but ultimately this is irrelevant, as we’re only ever allocating 1 set at a time due to zink flushing multiple times per draw anyway.

Ideally, however, it would be better to avoid this. The majority of draw cases use much, much smaller descriptor sets which only have 1-5 descriptors total, so allocating 6 * 1000 for a pool is roughly 6 * 1000 more than is actually needed for every set.

The other downside of this strategy is that by creating these giant, generic descriptor sets, it becomes impossible to know what’s actually in a given set without attaching considerable metadata to it, which makes reusing sets without modification (i.e., caching) a fair bit of extra work. Yes, I know I said bucket allocation was faster, but I also said I believe in letting the best idea win, and it doesn’t really seem like doing full updates every draw should be faster, does it? But I’ll get to that in another post.

Descriptor Migration

When creating a struct zink_program (which is the struct that contains all the shaders), zink creates a VkDescriptorSetLayout object for describing the descriptor set layouts that will be allocated from the pool. This means zink is allocating giant, generic descriptor set pools and then allocating (almost certainly) very, very small sets, which means the driver ends up with this giant memory balloon of unused descriptors allocated in the pool that can never be used.

A better idea for this would be to create descriptor pools which precisely match the layout for which they’ll be allocating descriptor sets, as this means there’s no memory ballooning, even if it does end up being more pool objects.

Here’s the current function for creating the layout object for a program:

static VkDescriptorSetLayout
create_desc_set_layout(VkDevice dev,
                       struct zink_shader *stages[ZINK_SHADER_COUNT],
                       unsigned *num_descriptors)
{
   VkDescriptorSetLayoutBinding bindings[PIPE_SHADER_TYPES * (PIPE_MAX_CONSTANT_BUFFERS + PIPE_MAX_SHADER_SAMPLER_VIEWS + PIPE_MAX_SHADER_BUFFERS + PIPE_MAX_SHADER_IMAGES)];
   int num_bindings = 0;

   for (int i = 0; i < ZINK_SHADER_COUNT; i++) {
      struct zink_shader *shader = stages[i];
      if (!shader)
         continue;

      VkShaderStageFlagBits stage_flags = zink_shader_stage(pipe_shader_type_from_mesa(shader->nir->info.stage));

This function is called for both the graphics and compute pipelines, and for the latter, only a single shader object is passed, meaning that i is purely for iterating the maximum number of shaders and not descriptive of the current shader being processed.

      for (int j = 0; j < shader->num_bindings; j++) {
         assert(num_bindings < ARRAY_SIZE(bindings));
         bindings[num_bindings].binding = shader->bindings[j].binding;
         bindings[num_bindings].descriptorType = shader->bindings[j].type;
         bindings[num_bindings].descriptorCount = shader->bindings[j].size;
         bindings[num_bindings].stageFlags = stage_flags;
         bindings[num_bindings].pImmutableSamplers = NULL;
         ++num_bindings;
      }
   }

This iterates over the bindings in a given shader, setting up the various values required by the layout creation struct using the values stored to the shader struct using the code in zink_compiler.c.

   *num_descriptors = num_bindings;
   if (!num_bindings) return VK_NULL_HANDLE;

If this program has no descriptors at all, then this whole thing can become a no-op, and descriptor updating can be skipped for draws which use this program.

   VkDescriptorSetLayoutCreateInfo dcslci = {};
   dcslci.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
   dcslci.pNext = NULL;
   dcslci.flags = 0;
   dcslci.bindingCount = num_bindings;
   dcslci.pBindings = bindings;

   VkDescriptorSetLayout dsl;
   if (vkCreateDescriptorSetLayout(dev, &dcslci, 0, &dsl) != VK_SUCCESS) {
      debug_printf("vkCreateDescriptorSetLayout failed\n");
      return VK_NULL_HANDLE;
   }

   return dsl;
}

Then there’s just the usual Vulkan semantics of storing values to the struct and passing it to the Create function.

But back in the context of moving descriptor pool creation to the program, this is actually the perfect place to jam the pool creation code in since all the information about descriptor types is already here. Here’s what that looks like:

VkDescriptorPoolSize sizes[6] = {};
int type_map[12];
unsigned num_types = 0;
memset(type_map, -1, sizeof(type_map));

for (int i = 0; i < ZINK_SHADER_COUNT; i++) {
   struct zink_shader *shader = stages[i];
   if (!shader)
      continue;

   VkShaderStageFlagBits stage_flags = zink_shader_stage(pipe_shader_type_from_mesa(shader->nir->info.stage));
   for (int j = 0; j < shader->num_bindings; j++) {
      assert(num_bindings < ARRAY_SIZE(bindings));
      bindings[num_bindings].binding = shader->bindings[j].binding;
      bindings[num_bindings].descriptorType = shader->bindings[j].type;
      bindings[num_bindings].descriptorCount = shader->bindings[j].size;
      bindings[num_bindings].stageFlags = stage_flags;
      bindings[num_bindings].pImmutableSamplers = NULL;
      if (type_map[shader->bindings[j].type] == -1) {
         type_map[shader->bindings[j].type] = num_types++;
         sizes[type_map[shader->bindings[j].type]].type = shader->bindings[j].type;
      }
      sizes[type_map[shader->bindings[j].type]].descriptorCount++;
      ++num_bindings;
   }
}

I’ve added the sizes, type_map, and num_types variables, which map used Vulkan descriptor types to a zero-based array and associated counter that can be used to fill in the pPoolSizes and PoolSizeCount values in a VkDescriptorPoolCreateInfo struct.

After the layout creation, which remains unchanged, I’ve then added this block:

for (int i = 0; i < num_types; i++)
   sizes[i].descriptorCount *= ZINK_DEFAULT_MAX_DESCS;

VkDescriptorPoolCreateInfo dpci = {};
dpci.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
dpci.pPoolSizes = sizes;
dpci.poolSizeCount = num_types;
dpci.flags = 0;
dpci.maxSets = ZINK_DEFAULT_MAX_DESCS;
vkCreateDescriptorPool(dev, &dpci, 0, &descpool);

Which uses the descriptor types and sizes from above to create a pool that will pre-allocate the exact descriptor counts that are needed for this program.

Managing descriptor sets in this way does have other challenges, however. Like resources, it’s crucial that sets not be modified or destroyed while they’re submitted to a batch.

Descriptor Tracking

Previously, any time a draw was completed, the batch object would reset and clear its descriptor pool, wiping out all the allocated sets. If the pool is no longer on the batch, however, it’s not possible to perform this reset without adding tracking info for the batch to all the descriptor sets. Also, resetting a descriptor pool like this is wasteful, as it’s probable that a program will be used for multiple draws and thus require multiple descriptor sets. What I’ve done instead is add this function, which is called just after allocating the descriptor set:

bool
zink_batch_add_desc_set(struct zink_batch *batch, struct zink_program *pg, struct zink_descriptor_set *zds)
{
   struct hash_entry *entry = _mesa_hash_table_search(batch->programs, pg);
   assert(entry);
   struct set *desc_sets = (void*)entry->data;
   if (!_mesa_set_search(desc_sets, zds)) {
      pipe_reference(NULL, &zds->reference);
      _mesa_set_add(desc_sets, zds);
      return true;
   }
   return false;
}

Similar to all the other batch<->object tracking, this stores the given descriptor set into a set, but in this case the set is itself stored as the data in a hash table keyed with the program, which provides both objects for use during batch reset:

void
zink_reset_batch(struct zink_context *ctx, struct zink_batch *batch)
{
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   batch->descs_used = 0;

   // cmdbuf hasn't been submitted before
   if (!batch->submitted)
      return;

   zink_fence_finish(screen, &ctx->base, batch->fence, PIPE_TIMEOUT_INFINITE);
   hash_table_foreach(batch->programs, entry) {
      struct zink_program *pg = (struct zink_program*)entry->key;
      struct set *desc_sets = (struct set*)entry->data;
      set_foreach(desc_sets, sentry) {
         struct zink_descriptor_set *zds = (void*)sentry->key;
         /* reset descriptor pools when no batch is using this program to avoid
          * having some inactive program hogging a billion descriptors
          */
         pipe_reference(&zds->reference, NULL);
         zink_program_invalidate_desc_set(pg, zds);
      }
      _mesa_set_destroy(desc_sets, NULL);

And this function is called:

void
zink_program_invalidate_desc_set(struct zink_program *pg, struct zink_descriptor_set *zds)
{
   uint32_t refcount = p_atomic_read(&zds->reference.count);
   /* refcount > 1 means this is currently in use, so we can't recycle it yet */
   if (refcount == 1)
      util_dynarray_append(&pg->alloc_desc_sets, struct zink_descriptor_set *, zds);
}

If a descriptor set has no active batch uses, its refcount will be 1, and then it can be added to the array of allocated descriptor sets for immediate reuse in the next draw. In this iteration of refactoring, descriptor sets can only have one batch use, so this condition is always true when this function is called, but future work will see that change.

Putting it all together is this function:

struct zink_descriptor_set *
zink_program_allocate_desc_set(struct zink_screen *screen,
                               struct zink_batch *batch,
                               struct zink_program *pg)
{
   struct zink_descriptor_set *zds;

   if (util_dynarray_num_elements(&pg->alloc_desc_sets, struct zink_descriptor_set *)) {
      /* grab one off the allocated array */
      zds = util_dynarray_pop(&pg->alloc_desc_sets, struct zink_descriptor_set *);
      goto out;
   }

   VkDescriptorSetAllocateInfo dsai;
   memset((void *)&dsai, 0, sizeof(dsai));
   dsai.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
   dsai.pNext = NULL;
   dsai.descriptorPool = pg->descpool;
   dsai.descriptorSetCount = 1;
   dsai.pSetLayouts = &pg->dsl;

   VkDescriptorSet desc_set;
   if (vkAllocateDescriptorSets(screen->dev, &dsai, &desc_set) != VK_SUCCESS) {
      debug_printf("ZINK: %p failed to allocate descriptor set :/\n", pg);
      return VK_NULL_HANDLE;
   }
   zds = ralloc_size(NULL, sizeof(struct zink_descriptor_set));
   assert(zds);
   pipe_reference_init(&zds->reference, 1);
   zds->desc_set = desc_set;
out:
   if (zink_batch_add_desc_set(batch, pg, zds))
      batch->descs_used += pg->num_descriptors;

   return zds;
}

If a pre-allocated descriptor set exists, it’s popped off the array. Otherwise, a new one is allocated. After that, the set is referenced onto the batch.

Progress

Now all the sets are allocated on the program using a more specific allocation strategy, which paves the way for a number of improvements that I’ll be discussing in various lengths over the coming days:

  • bucket allocating
  • descriptor caching 2.0
  • split descriptor sets
  • context-based descriptor states
  • (pending implementation/testing/evaluation) context-based descriptor pool+layout caching
September 29, 2020

A Showcase

Today I’m taking a break from writing about my work to write about the work of zink’s newest contributor, He Haocheng (aka @hch12907). Among other things, Haocheng has recently tackled the issue of extension refactoring, which is a huge help for future driver development. I’ve written time and time again about adding extensions, and with this patchset in place, the process is simplified and expedited almost into nonexistence.

Before

As an example, let’s look at the most recent extension that I’ve added support for, VK_EXT_extended_dynamic_state. The original patch looked like this:

diff --git a/src/gallium/drivers/zink/zink_screen.c b/src/gallium/drivers/zink/zink_screen.c
index 3effa2b0fe4..83b89106931 100644
--- a/src/gallium/drivers/zink/zink_screen.c
+++ b/src/gallium/drivers/zink/zink_screen.c
@@ -925,6 +925,10 @@ load_device_extensions(struct zink_screen *screen)
       assert(have_device_time);
       free(domains);
    }
+   if (screen->have_EXT_extended_dynamic_state) {
+      GET_PROC_ADDR(CmdSetViewportWithCountEXT);
+      GET_PROC_ADDR(CmdSetScissorWithCountEXT);
+   }
 
 #undef GET_PROC_ADDR
 
@@ -938,7 +942,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
    bool have_tf_ext = false, have_cond_render_ext = false, have_EXT_index_type_uint8 = false,
       have_EXT_robustness2_features = false, have_EXT_vertex_attribute_divisor = false,
       have_EXT_calibrated_timestamps = false, have_VK_KHR_vulkan_memory_model = false;
-   bool have_EXT_custom_border_color = false, have_EXT_blend_operation_advanced = false;
+   bool have_EXT_custom_border_color = false, have_EXT_blend_operation_advanced = false,
+        have_EXT_extended_dynamic_state = false;
    if (!screen)
       return NULL;
 
@@ -1001,6 +1006,9 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
             if (!strcmp(extensions[i].extensionName,
                         VK_EXT_BLEND_OPERATION_ADVANCED_EXTENSION_NAME))
                have_EXT_blend_operation_advanced = true;
+            if (!strcmp(extensions[i].extensionName,
+                        VK_EXT_EXTENDED_DYNAMIC_STATE_EXTENSION_NAME))
+               have_EXT_extended_dynamic_state = true;
 
          }
          FREE(extensions);
@@ -1012,6 +1020,7 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
    VkPhysicalDeviceIndexTypeUint8FeaturesEXT index_uint8_feats = {};
    VkPhysicalDeviceVulkanMemoryModelFeatures mem_feats = {};
    VkPhysicalDeviceBlendOperationAdvancedFeaturesEXT blend_feats = {};
+   VkPhysicalDeviceExtendedDynamicStateFeaturesEXT dynamic_state_feats = {};
 
    feats.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FEATURES_2;
    screen->feats11.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_1_1_FEATURES;
@@ -1060,6 +1069,11 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
       blend_feats.pNext = feats.pNext;
       feats.pNext = &blend_feats;
    }
+   if (have_EXT_extended_dynamic_state) {
+      dynamic_state_feats.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_EXTENDED_DYNAMIC_STATE_FEATURES_EXT;
+      dynamic_state_feats.pNext = feats.pNext;
+      feats.pNext = &dynamic_state_feats;
+   }
    vkGetPhysicalDeviceFeatures2(screen->pdev, &feats);
    memcpy(&screen->feats, &feats.features, sizeof(screen->feats));
    if (have_tf_ext && tf_feats.transformFeedback)
@@ -1074,6 +1088,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
    screen->have_EXT_calibrated_timestamps = have_EXT_calibrated_timestamps;
    if (have_EXT_custom_border_color && screen->border_color_feats.customBorderColors)
       screen->have_EXT_custom_border_color = true;
+   if (have_EXT_extended_dynamic_state && dynamic_state_feats.extendedDynamicState)
+      screen->have_EXT_extended_dynamic_state = true;
 
    VkPhysicalDeviceProperties2 props = {};
    VkPhysicalDeviceVertexAttributeDivisorPropertiesEXT vdiv_props = {};
@@ -1150,7 +1166,7 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
     * this requires us to pass the whole VkPhysicalDeviceFeatures2 struct
     */
    dci.pNext = &feats;
-   const char *extensions[12] = {
+   const char *extensions[13] = {
       VK_KHR_MAINTENANCE1_EXTENSION_NAME,
    };
    num_extensions = 1;
@@ -1185,6 +1201,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
       extensions[num_extensions++] = VK_EXT_CUSTOM_BORDER_COLOR_EXTENSION_NAME;
    if (have_EXT_blend_operation_advanced)
       extensions[num_extensions++] = VK_EXT_BLEND_OPERATION_ADVANCED_EXTENSION_NAME;
+   if (have_EXT_extended_dynamic_state)
+      extensions[num_extensions++] = VK_EXT_EXTENDED_DYNAMIC_STATE_EXTENSION_NAME;
    assert(num_extensions <= ARRAY_SIZE(extensions));
 
    dci.ppEnabledExtensionNames = extensions;
diff --git a/src/gallium/drivers/zink/zink_screen.h b/src/gallium/drivers/zink/zink_screen.h
index 4ee409c0efd..1d35e775262 100644
--- a/src/gallium/drivers/zink/zink_screen.h
+++ b/src/gallium/drivers/zink/zink_screen.h
@@ -75,6 +75,7 @@ struct zink_screen {
    bool have_EXT_calibrated_timestamps;
    bool have_EXT_custom_border_color;
    bool have_EXT_blend_operation_advanced;
+   bool have_EXT_extended_dynamic_state;
 
    bool have_X8_D24_UNORM_PACK32;
    bool have_D24_UNORM_S8_UINT;

It’s awful, right? There’s obviously lots of copy/pasted code here, and it’s a tremendous waste of time to have to do the copy/pasting, not to mention the time needed for reviewing such mind-numbing changes.

After

Here’s the same patch after He Haocheng’s work has been merged:

diff --git a/src/gallium/drivers/zink/zink_device_info.py b/src/gallium/drivers/zink/zink_device_info.py
index 0300e7f7574..69e475df2cf 100644
--- a/src/gallium/drivers/zink/zink_device_info.py
+++ b/src/gallium/drivers/zink/zink_device_info.py
@@ -62,6 +62,7 @@ def EXTENSIONS():
         Extension("VK_EXT_calibrated_timestamps"),
         Extension("VK_EXT_custom_border_color",      alias="border_color", properties=True, feature="customBorderColors"),
         Extension("VK_EXT_blend_operation_advanced", alias="blend", properties=True),
+        Extension("VK_EXT_extended_dynamic_state",   alias="dynamic_state", feature="extendedDynamicState"),
     ]
 
 # There exists some inconsistencies regarding the enum constants, fix them.
diff --git a/src/gallium/drivers/zink/zink_screen.c b/src/gallium/drivers/zink/zink_screen.c
index 864ec32fc22..3c1214d384b 100644
--- a/src/gallium/drivers/zink/zink_screen.c
+++ b/src/gallium/drivers/zink/zink_screen.c
@@ -926,6 +926,10 @@ load_device_extensions(struct zink_screen *screen)
       assert(have_device_time);
       free(domains);
    }
+   if (screen->info.have_EXT_extended_dynamic_state) {
+      GET_PROC_ADDR(CmdSetViewportWithCountEXT);
+      GET_PROC_ADDR(CmdSetScissorWithCountEXT);
+   }
 
 #undef GET_PROC_ADDR

feels_good.png

I’m certain this is going to lead to an increase in my own productivity in the future given how quick the process has now become.

A Short Note On Hardware

I’ve been putting up posts lately about my benchmarking figures, and in them I’ve been referencing Intel hardware. The reason I use Intel is because I’m (currently) a hobbyist developer with exactly one computer capable of doing graphics development, and it has an Intel onboard GPU. I’m quite happy with this given the high quality state of Intel’s drivers, as things become much more challenging when I have to debug both my own bugs as well as an underlying Vulkan driver’s bugs at the same time.

With that said, I present to you this recent out-of-context statement from Dave Airlie regarding zink performance on a different driver:

<airlied> zmike: hey on a fiji amd card, you get 45/46 native vs 35 fps with zink on one heaven scene here, however zink is corrupted

I don’t have any further information about anything there, but it’s the best I can do given the limitations in my available hardware.

September 28, 2020

But No, I’m Not

Just a quick post today to summarize a few exciting changes I’ve made today.

To start with, I’ve added some tracking to the internal batch objects for catching things like piglit’s spec@!opengl 1.1@streaming-texture-leak. Let’s check out the test code there for a moment:

/** @file streaming-texture-leak.c
 *
 * Tests that allocating and freeing textures over and over doesn't OOM
 * the system due to various refcounting issues drivers may have.
 *
 * Textures used are around 4MB, and we make 5k of them, so OOM-killer
 * should catch any failure.
 *
 * Bug #23530
 */
for (i = 0; i < 5000; i++) {
        glGenTextures(1, &texture);
        glBindTexture(GL_TEXTURE_2D, texture);

        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER,
                        GL_LINEAR);
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER,
                        GL_LINEAR);
        glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, TEX_SIZE, TEX_SIZE,
                     0, GL_RGBA,
                     GL_UNSIGNED_BYTE, tex_buffer);

        piglit_draw_rect_tex(0, 0, piglit_width, piglit_height,
                             0, 0, 1, 1);

        glDeleteTextures(1, &texture);
}

This test loops 5000 times, using a different sampler texture for each draw, and then destroys the texture. This is supposed to catch drivers which can’t properly manage their resource refcounts, but instead here zink is getting caught by trying to dump 5000 active resources into the same command buffer, which ooms the system.

The reason for the problem in this case is that, after my recent optimizations which avoid unnecessary flushing, zink only submits the command buffer when a frame is finished or one of the write-flagged resources associated with an active batch is read from. Thus, the whole test runs in one go, only submitting the queue at the very end when the test performs a read.

In this case, my fix is simple: check the system’s total memory on driver init, and then always flush a batch if it crosses some threshold of memory usage in its associated resources when beginning a new draw. I chose 1/8 total memory to be “safe”, since that allows zink to use 50% of the total memory with its resources before it’ll begin to stall and force the draws to complete, hopefully avoiding any oom scenarios. This ends up being a flush every 250ish draws in the above test code, and everything works nicely without killing my system.

Performance 3.0

As a bonus, I noticed that zink was taking considerably longer than IRIS to complete this test once it was fixed, so I did a little profiling, and this was the result: epilogue.png

Up another 3 fps (~10%) from Friday, which isn’t bad for a few minutes spent removing some memset calls from descriptor updating and then throwing in some code for handling VK_DYNAMIC_STATE_VERTEX_INPUT_BINDING_STRIDE_EXT.

September 25, 2020

Today new beta versions for all KWinFT projects – that are KWinFT, Wrapland, Disman and KDisplay – were released. With that we are on target for the full release which is aligned with Plasma 5.20 on October 13.

Big changes will unquestionable come to Disman, a previously stifled library for display management, which now learns to stand on its own feet providing universal means for the configuration of displays with different windowing systems and Wayland compositors.

But also for the compositor KWinFT a very specific yet important feature got implemented and a multitude of stability fixes and code refactors were accomplished.

In the following we will do a deep dive into reasons and results of this recent efforts.

For a quick overview of the work on Disman you can also watch this lightning talk that I held at the virtual XDC 2020 conference last week.

Universal display management with Disman

It was initially not planned like this but Disman and KDisplay were two later additions to the KWinFT project.

The projects were forked from libkscreen and KScreen respectively and I saw this as an opportunity to completely rethink and in every sense overhaul these in the past rather lackluster and at times completely neglected components of the KDE Plasma workspace. This past negligence is rather tragic since the complaints about miserable output management in KDE Plasma go back as long as one can think. Improving this bad state of affairs was my main motivation when I started working on libkscreen and KScreen around two years ago.

In my opinion a well functioning – not necessarily fancy but for sure robust – display configuration system is a cornerstone of a well crafted desktop system. One reason for that is how prevalent multi-display setups are and another how immeasurable annoying it is when you can't configure the projector correctly this one time you have to give a presentation in front of a room full of people.

Disman now tries to solve this by providing a solution not only for KWinFT or the KDE Plasma desktop alone but for any system running X11 or any Wayland compositor.

Moving logic and ideas

Let us look into the details of this solution and why I haven't mentioned KDisplay yet. The reason for this omission is that KDisplay from now on will be a husk of its former self.

Ancient rituals

As a fork of KScreen no longer than one month ago KDisplay was still the logical center of any display configuration with an always active KDE daemon (KDED) module and a KConfig module (KCM) integrated to the KDE System Settings.

The KDED module was responsible for reacting to display hot-plug events, reading control files of the resulting display combination from the user directory, generating optimal configurations if none to be found and writing new files to the hard disk after the configuration has been applied successfully to the windowing system.

In this work flow Disman was only relevant as a provider of backend plugins that were loaded at runtime. Disman was used either in-process or through an included D-Bus service that got automatically started whenever the first client tried to talk to it. According to the commit adding this out-of-process mode five years ago the intention behind it was to improve performance and stability. But in the end on a functional level the service was doing not much more than forwarding data between the windowing system and the Disman consumers.

Break with tradition

Interestingly the D-Bus service was only activatable with the X11 backend and was explicitly disabled on Wayland. When I noticed this I was first tempted to remove the D-Bus service in the all eternal struggle to reduce code complexity. And after all if the service is not used on Wayland we might not need it at all.

But some time later I realized that this D-Bus service must be appreciated in a different way than for its initial reasoning. From a different point of view this service could be the key to a much more ambitious grand solution.

The service allows us to serialize and synchronize access of arbitrary many clients in a transparent way while moving all relevant logical systems to a shared central place and providing per client a high level of integration with those systems.

Concretely does this mean that the Disman D-Bus service becomes an independent entity. Once being invoked by a single call from a client, for example by the included command line utility with dismanctl -o the service reads and writes all necessary control files on its own. It generates optimal display configurations if no files are found and even can disable a laptop display in case the lid was closed while an external output is connected.

In this model Disman consumers solely provide user interfaces that are informed about the generated or loaded current config and that can modify this config additionally if desirable. This way the consumer can concentrate on providing a user interface with great usability and design and leave to Disman all the logic of handling the modified configuration afterwards.

Making it easy to add other clients is only one advantage. On a higher level this new design has two more.

Auxiliary data

I noticed already last year that some basic assumptions in KScreen were questionable. Its internal data logic relied on a round trip through the windowing system.

This meant in practice that the user was supposed to change display properties via the KScreen KCM. These were then sent to the windowing system which tried to apply them to the hardware. Afterwards it informed the KScreen KDE daemon through its own specific protocols and a libkscreen backend about this new configuration. Only the daemon then would write the updated configuration to the disk.

Why it was done this way is clear: we can be sure we have written a valid configuration to the disk and by having only the daemon do the file write we have the file access logic in a single place and do not need to sync file writes of different processes.

But the fundamental problem of this design is that we sometimes need to share additional information about our display configuration for sensible display management not being of relevance to the windowing system and because of that can not be passed through it.

A simple example is when a display is auto-rotated. Smartphones and tablets but also many convertibles come with orientation sensors to auto-rotate the built-in display according to the current device orientation. When auto-rotation is switched on or off in the KCM it is not sent through the windowing system but the daemon or another service needs to know about such a change in order to adapt the display rotation correctly with later orientation changes.

A complex but interesting other example is the replication of displays, also often called mirroring. When I started work on KScreen two years ago the mechanism was painfully primitive: one could only duplicate all displays at once and it was done by moving all of them to the same position and then changing their display resolutions hoping to find some sufficiently alike to cover a similar area.

Obviously that had several issues, the worst in my opinion was that this won't work for displays with different aspect ratios as I noticed quickly after I got myself a 16:10 display. Another grave issue was that displays might not run at their full resolution. In a mixed DPI setup the formerly HiDPI displays are downgraded to the best resolution common with the LoDPI displays.

The good news is that on X11 and also Wayland methods are available to replicate displays without these downsides.

On X11 we can apply arbitrary linear transformations to an output. This solves both issues.

On Wayland all available output management protocols at minimum allow to set a singular floating point value to scale a display. This solves the mixed DPI problem since we can still run both displays at an arbitrary resolution and adapt the logical size of the replica through its scale. If the management protocol provides a way to even specify the logical size directly like the KWinFT protocol does we can also solve the problem of diverging display aspect ratios.

From a bird's eye view in this model there are one or multiple displays that act as replicas for a single source display. Only the transformation, scale or logical size of the replicas is changed, the source is the invariant. The important information to remember is therefore for each display solely if there is a replication source that the display is a replica to. But neither in X11 nor in any Wayland compositor this information is conveyed via the windowing system.

With the new design we send all configuration data including such auxiliary data to the Disman D-Bus service. The service will save all this data to a configuration-specific file but send to the windowing system only a relevant subset of the data. After the windowing system reports that the configuration was applied the Disman service informs all connected clients about this change sending the data received from the windowing system augmented by the auxiliary data that had not been passed through the windowing system.

This way every display management client receives all relevant data about the current configuration including the auxiliary data.

The motivation to solve this problem was the original driving force behind the large redesign of Disman that is coming with this release.

But I realized soon that this redesign also has another advantage that long-term is likely even more important than the first one.

Ready for everything that is not KDE Plasma too

With the new design Disman becomes a truly universal solution for display management.

Previously a running KDE daemon process with KDisplay's daemon module inserted was required in order to load a display configuration on startup and to react to display hot-plug events. The problem to this is that the KDE daemon commonly only makes sense to run on a KDE Plasma desktop.

Thanks to the new design the Disman D-Bus service can now be run as a standalone background service managing all your displays permanently, even if you don't use KDE Plasma.

In a non-Plasma environment like a Wayland session with sway this can be achieved by simply calling once dismanctl -o in a startup script.

On the other side the graphical user interface that KDisplay provides can now be used to manage displays on any desktop that Disman runs on too. KDisplay does not require the KDE System Settings to be installed and can be run as a standalone app. Simply call kdisplay from command line or start it from the launcher of your desktop environment.

KDisplay still includes a now absolutely gutted KDE daemon module that will be run in a Plasma session. The module basically only launches the Disman D-Bus service on session startup anymore. So in a Plasma session after installation of Disman and KDisplay everything is directly setup automatically. In every other session as said a simple dismanctl -o call at startup is enough to get the very same result.

Maybe the integration in other sessions than Plasma could be improved to even make setting up this single call at startup unnecessary. Should Disman for example install a systemd unit file executing this call by default? I would be interested in feedback in this regard in particular from distributions. What do they prefer?

KWinFT and Wrapland improved in selected areas

With today's beta release the greatest changes come to Disman and KDisplay. But that does not mean KWinFT and Wrapland have not received some important updates.

Outputs all the way

The ongoing work on Disman and by that on displays – or outputs as they are called in the land of window managers – stability and feature patches for outputs naturally came to Wrapland and KWinFT as well. A large refactor was the introduction of a master output class on the server side of Wrapland. The class acts as a central entry point for compositors and deals with the different output related protocol objects internally.

Having this class in place it was rather easy to add support for xdg-output version 2 and 3 afterwards. In order to do that it was also reasonable to re-evaluate how we provide output identifying metadata in KWinFT and Wrapland in general.

In regards to output identification a happy coincidence was that Simon Ser of the wlroots projects had been asking himself the very same questions already in the past.

I concluded that Simon's plan for wlroots was spot on and I decided to help them out a bit with patches for wlr-protocols and wlroots. In the same vein I updated Wrapland's output device protocol. That means Wrapland and wlroots based compositors feature the very same way of identifying outputs now what made it easy to provide full support for both in Disman.

Presentation timings

This release comes with support for the presentation-time protocol.

It is one of only three Wayland protocol extensions that have been officially declared stable. Because of that supporting it felt also important in a formal sense.

Primarily though it is essential to my ongoing work on Xwayland. I plan to make use of the presentation-time protocol in Xwayland's Present extension implementation.

With the support in KWinFT I can test future presentation-time work in Xwayland now with KWinFT and sway as wlroots also supports the protocol. Having two different compositors for alternative testing will be quite helpful.

Try out the beta

If you want to try out the new beta release of Disman together with your favorite desktop environment or the KWinFT beta as a drop-in replacement for KWin you have to compile from source at the moment. For that use the Plasma/5.20 branches in the respective repositories.

For Disman there are some limited instructions on how to compile it in the Readme file.

If you have questions or just want to chat about the projects feel free to join the official KWinFT Gitter channel.

If you want to wait for the full release check back on the release date, October 13. I plan to write another article to that date that will then list all distributions where you will be able to install the KWinFT projects comfortably by package manager.

That is also a call to distro packagers: if you plan to provide packages for the KWinFT projects on October 13 get in touch to get support and be featured in the article.

Performance 2.0

In my last post, I left off with an overall 25%ish improvement in framerate for my test case:

endpost1.png

At the end, this was an extra 3 fps over my previous test, but how did I get to this point?

The answer lies in even more unnecessary queue submission. Let’s take a look at zink’s pipe_context::set_framebuffer_state hook, which is called by gallium any time the framebuffer state changes:

static void
zink_set_framebuffer_state(struct pipe_context *pctx,
                           const struct pipe_framebuffer_state *state)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_screen *screen = zink_screen(pctx->screen);

   util_copy_framebuffer_state(&ctx->fb_state, state);

   struct zink_framebuffer *fb = get_framebuffer(ctx);

   zink_framebuffer_reference(screen, &ctx->framebuffer, fb);
   if (ctx->gfx_pipeline_state.render_pass != fb->rp)
      ctx->gfx_pipeline_state.hash = 0;
   zink_render_pass_reference(screen, &ctx->gfx_pipeline_state.render_pass, fb->rp);

   uint8_t rast_samples = util_framebuffer_get_num_samples(state);
   /* in vulkan, gl_SampleMask needs to be explicitly ignored for sampleCount == 1 */
   if ((ctx->gfx_pipeline_state.rast_samples > 1) != (rast_samples > 1))
      ctx->dirty_shader_stages |= 1 << PIPE_SHADER_FRAGMENT;
   if (ctx->gfx_pipeline_state.rast_samples != rast_samples)
      ctx->gfx_pipeline_state.hash = 0;
   ctx->gfx_pipeline_state.rast_samples = rast_samples;
   if (ctx->gfx_pipeline_state.num_attachments != state->nr_cbufs)
      ctx->gfx_pipeline_state.hash = 0;
   ctx->gfx_pipeline_state.num_attachments = state->nr_cbufs;

   /* need to start a new renderpass */
   if (zink_curr_batch(ctx)->rp)
      flush_batch(ctx);
   struct zink_batch *batch = zink_batch_no_rp(ctx);
   zink_framebuffer_reference(screen, &batch->fb, fb);

   framebuffer_state_buffer_barriers_setup(ctx, &ctx->fb_state, zink_curr_batch(ctx));
}

Briefly, zink copies the framebuffer state, there’s a number of conditions under which a new pipeline object is needed, which all result in ctx->gfx_pipeline_state.hash = 0;. Other than this, there’s sample count check for sample changes so that the shader can be modified if necessary, and then there’s the setup for creating the Vulkan framebuffer object as well as the renderpass object in get_framebuffer().

Eagle-eyed readers will immediately spot the problem here, which is, aside from the fact that there’s not actually any reason to be setting up the framebuffer or renderpass here, how zink is also flushing the current batch if a renderpass is active.

The change I made here was to remove everything related to Vulkan from here, and move it to zink_begin_render_pass(), which is the function that the driver uses to begin a renderpass for a given batch.

This is clearly a much larger change than just removing the flush_batch() call, which might be what’s expected now that ending a renderpass no longer forces queue submission. Indeed, why haven’t I just ended the current renderpass and kept using the same batch?

The reason for this is that zink is designed in such a way that a given batch, at the time of calling vkCmdBeginRenderPass, is expected to either have no struct zink_render_pass associated with it (the batch has not performed a draw yet) or have the same object which matches the pipeline state (the batch is continuing to draw using the same renderpass). Adjusting this to be compatible with removing the flush here ends up being more code than just moving the object setup to a different spot.

So now the framebuffer and renderpass are created or pulled from their caches just prior to the vkCmdBeginRenderPass call, and a flush is removed, gaining some noticeable fps.

Going Further

Now that I’d unblocked that bottleneck, I went back to the list and checked the remaining problem areas:

  • descriptor set allocation is going to be a massive performance hit for any application which does lots of draws per frame since each draw command allocates its own (huge) descriptor set
  • the 1000 descriptor set limit is going to be hit constantly for for any application which does lots of draws per frame

I decided to change things up a bit here.

This is the current way of things.

My plan was something more like this:

Where get descriptorset from program would look something like:

In this way, I’d get to conserve some sets and reuse them across draw calls even between different command buffers since I could track whether they were in use and avoid modifying them in any way. I’d also get to remove any tracking for descriptorset usage on batches, thereby removing possible queue submissions there. Any time resources in a set were destroyed, I could keep references to the sets on the resources and then invalidate the sets, returning them to the unused pool.

The results were great: desccache1.png

21 fps now, which is up another 3 from before.

Cache Improvements

Next I started investigating my cache implementation. There was a lot of hashing going on, as I was storing both the in-use sets as well as the unused sets (the valid and invalidated) based on the hash calculated for their descriptor usage, so I decided to try moving just the invalidated sets into an array as they no longer had a valid hash anyway, thereby giving quicker access to sets I knew to be free.

This would also help with my next plan, but again, the results were promising:

desccache2.png

Now I was at 23 fps, which is another 10% from the last changes, and just from removing some of the hashing going on.

This is like shooting fish in a barrel now.

Naturally at this point I began to, as all developers do, consider bucketing my allocations, since I’d seen in my profiling that some of these programs were allocating thousands of sets to submit simultaneously across hundreds of draws. I ended up using a scaling factor here so that programs would initially begin allocating in tens of sets, scaling up by a factor of ten every time it reached that threshold (i.e., once 100 sets were allocated, it begins allocating 100 at a time).

This didn’t have any discernible effect on the fps, but there were certainly fewer allocation calls going on, so I imagine the results will show up somewhere else.

And Then I Thought About It

Because sure, my efficient, possibly-overengineered descriptorset caching mechanism for reusing sets across draws and batches was cool, and it worked great, but was the overhead of all the hashing involved actually worse for performance than just using a dumb bucket allocator to set up and submit the same sets multiple times even in the same command buffer?

I’m not one of those types who refuses to acknowledge that other ideas can be better than the ones I personally prefer, so I smashed all of my descriptor info hashing out of the codebase and just used an array for storing unused sets. So now the mechanism looked like this:

But would this be better than the more specific caching I was already using? Well… desccache3.png

24 fps, so the short of it is yes. It was 1-2 fps faster across the board.

Conclusion

This is where I’m at now after spending some time also rewriting all the clear code again and fixing related regressions. The benchmark is up ~70% from where I started, and the gains just keep coming. I’ll post again about performance improvements in the future, but here’s a comparison to a native GL driver, namely IRIS:

reference.png

Zink is at a little under 50% of the performance here, up from around 25% when I started, though this gap still varies throughout other sections of the benchmark, dipping as low as 30% of the IRIS performance in some parts.

It’s progress.

September 24, 2020

XDC 2020 sponsors

Last week, X.Org Developers Conference 2020 was held online for the first time. This year, with all the COVID-19 situation that is affecting almost every country worldwide, the X.Org Foundation Board of Directors decided to make it virtual.

I love open-source conferences :-) They are great for networking, have fun with the rest of community members, have really good technical discussions in the hallway track… and visit a new place every year! Unfortunately, we couldn’t do any of that this time and we needed to look for an alternative… being going virtual the obvious one.

The organization team at Intel, lead by Radoslaw Szwichtenberg and Martin Peres, analyzed the different open-source alternatives to organize XDC 2020 in a virtual manner. Finally, due to the setup requirements and the possibility of having more than 200 attendees connected to the video stream at the same time (Big Blue Button doesn’t recommend more than 100 simultaneous users), they selected Jitsi for speakers + Youtube for streaming/recording + IRC for questions. Arkadiusz Hiler summarized very well what they did from the A/V technical point of view and how was the experience hosting a virtual XDC.

I’m very happy with the final result given the special situation this year: the streaming was flawless, we didn’t have almost any technical issue (except one audio issue in the opening session… like in physical ones! :-D), and IRC turned out to be very active during the conference. Thanks a lot to the organizers for their great job!

However, there is always room for improvements. Therefore, the X.Org Foundation board is asking for feedback, please share with us your opinion on XDC 2020!

Just focusing on my own experience, it was very good. I enjoyed a lot the talks presented this year and the interesting discussions happening in IRC. I would like to highlight the four talks presented by my colleagues at Igalia :-D

This year I was also a speaker! I presented “Improving Khronos CTS tests with Mesa code coverage” talk (watch recording), where I explained how we can improve the VK-GL-CTS quality by leveraging Mesa code coverage using open-source tools.

My XDC 2020 talk

I’m looking forward to attending X.Org Developers Conference 2021! But first, we need an organizer! Requests For Proposals for hosting XDC2021 are now open!

See you next year!

Finally, Performance

For a long time now, I’ve been writing about various work I’ve done in the course of getting to GL 4.6. This has generally been feature implementation work with an occasional side of bug hunting and fixing and so I haven’t been too concerned about performance.

I’m not done with feature work. There’s still tons of things missing that I’m planning to work on.

I’m not done with bug hunting or fixing. There’s still tons of bugs (like how currently spec@!opengl 1.1@streaming-texture-leak ooms my system and crashes all the other tests trying to run in parallel) that I’m going to fix.

But I wanted a break, and I wanted to learn some new parts of the graphics pipeline instead of just slapping more extensions in.

For the moment, I’ve been focusing on the Unigine Heaven benchmark since there’s tons of room for improvement, though I’m planning to move on from that once I get bored and/or hit a wall. Here’s my starting point, which is taken from the patch in my branch with the summary zink: add batch flag to determine if batch is currently in renderpass, some 300ish patches ahead of the main branch’s tip:

start.png

This is 14 fps running as ./heaven_x64 -project_name Heaven -data_path ../ -engine_config ../data/heaven_4.0.cfg -system_script heaven/unigine.cpp -sound_app openal -video_app opengl -video_multisample 0 -video_fullscreen 0 -video_mode 3 -extern_define ,RELEASE,LANGUAGE_EN,QUALITY_LOW,TESSELLATION_DISABLED -extern_plugin ,GPUMonitor, and I’m going to be posting screenshots from roughly the same point in the demo as I progress to gauge progress.

Is this an amazing way to do a benchmark?

No.

Is it a quick way to determine if I’m making things better or worse right now?

Given the size of the gains I’m making, absolutely.

Let’s begin.

Architecture

Now that I’ve lured everyone in with promises of gains and a screenshot with an fps counter, what I really want to talk about is code.

In order to figure out the most significant performance improvements for zink, it’s important to understand the architecture. At the point when I started, zink’s batches (C name for an object containing a command buffer and a fence as well as references to all the objects submitted to the queue for lifetime validation) worked like this for draw commands:

  • there are 4 batches (and 1 compute batch, but that’s out of scope)
  • each batch has 1 command buffer
  • each batch has 1 descriptor pool
  • each batch can allocate at most 1000 descriptor sets
  • each descriptor set is a fixed size of 1000 of each type of supported descriptor (UBO, SSBO, samplers, uniform and storage texel buffers, images)
  • each batch, upon reaching 1000 descriptor sets, will automatically submit its command buffer and cycle to the next batch
  • each batch has 1 fence
  • each batch, before being reused, waits on its fence and then resets its state
  • each batch, during its state reset, destroys all its allocated descriptor sets
  • renderpasses are started and ended on a given batch as needed
  • renderpasses, when ended, trigger queue submission for the given command buffer
  • renderpasses allocate their descriptor set on-demand just prior to the actual vkCmdBeginRenderPass call
  • renderpasses, during corresponding descriptor set updates, trigger memory barriers on all descriptor resources for their intended usage
  • pipelines are cached at runtime based on all the metadata used to create them, but we don’t currently do any on-disk pipeline caching

This is a lot to take in, so I’ll cut to some conclusions that I drew from these points:

  • sequential draws cannot occur on a given batch unless none of the resources used by its descriptor sets require memory barriers
    • because barriers cannot be submitted during a renderpass
    • this is very unlikely
    • therefore, each batch contains exactly 1 draw command
  • each sequential draw after the fourth one causes explicit fence waiting
    • batches must reset their state before being reused, and prior to this they wait on their fence
    • there are 4 batches
    • this means that any time a frame contains more than 4 draws, zink is pausing to wait for batches to finish so it can get a valid command buffer to use
    • the Heaven benchmark contains hundreds of draw commands per frame
    • yikes
  • definitely there could be better barrier usage
    • a barrier should only be necessary for zink’s uses when: transitioning a resource from write -> read, write -> write, or changing image layouts, and for other cases we can just track the usage in order to trigger a barrier later when one of these conditions is met
  • the pipe_context::flush hook in general is very, very bad and needs to be avoided
    • we use this basically every other line like we’re building a jenga tower
    • sure would be a disaster if someone were to try removing these calls
  • probably we could do some disk caching for pipeline objects so that successive runs of applications could use pre-baked objects and avoid needing to create any
  • descriptor set allocation is going to be a massive performance hit for any application which does lots of draws per frame since each draw command allocates its own (huge) descriptor set
  • the 1000 descriptor set limit is going to be hit constantly for for any application which does lots of draws per frame

There’s a lot more I could go into here, but this is already a lot.

Removing Renderpass Submission

I decided to start here since it was easy:

diff --git a/src/gallium/drivers/zink/zink_context.c b/src/gallium/drivers/zink/zink_context.c
index a9418430bb7..f07ae658115 100644
--- a/src/gallium/drivers/zink/zink_context.c
+++ b/src/gallium/drivers/zink/zink_context.c
@@ -800,12 +800,8 @@ struct zink_batch *
 zink_batch_no_rp(struct zink_context *ctx)
 {
    struct zink_batch *batch = zink_curr_batch(ctx);
-   if (batch->in_rp) {
-      /* flush batch and get a new one */
-      flush_batch(ctx);
-      batch = zink_curr_batch(ctx);
-      assert(!batch->in_rp);
-   }
+   zink_end_render_pass(ctx, batch);
+   assert(!batch->in_rp);
    return batch;
 }

Amazing, I know. Let’s see how much the fps changes:

norp.png

15?! Wait a minute. That’s basically within the margin of error!

It is actually a consistent 1-2 fps gain, even a little more in some other parts, but it seemed like it should’ve been more now that all the command buffers are being gloriously saturated, right?

Well, not exactly. Here’s a fun bit of code from the descriptor updating function:

struct zink_batch *batch = zink_batch_rp(ctx);
unsigned num_descriptors = ctx->curr_program->num_descriptors;
VkDescriptorSetLayout dsl = ctx->curr_program->dsl;

if (batch->descs_left < num_descriptors) {
   ctx->base.flush(&ctx->base, NULL, 0);
   batch = zink_batch_rp(ctx);
   assert(batch->descs_left >= num_descriptors);
}

Right. The flushing continues. And while I’m here, what does zink’s pipe_context::flush hook even look like again?

static void
zink_flush(struct pipe_context *pctx,
           struct pipe_fence_handle **pfence,
           enum pipe_flush_flags flags)
{
   struct zink_context *ctx = zink_context(pctx);

   struct zink_batch *batch = zink_curr_batch(ctx);
   flush_batch(ctx);
...
   /* HACK:
    * For some strange reason, we need to finish before presenting, or else
    * we start rendering on top of the back-buffer for the next frame. This
    * seems like a bug in the DRI-driver to me, because we really should
    * be properly protected by fences here, and the back-buffer should
    * either be swapped with the front-buffer, or blitted from. But for
    * some strange reason, neither of these things happen.
    */
   if (flags & PIPE_FLUSH_END_OF_FRAME)
      pctx->screen->fence_finish(pctx->screen, pctx,
                                 (struct pipe_fence_handle *)batch->fence,
                                 PIPE_TIMEOUT_INFINITE);
}

Oh. So really every time zink “finishes” a frame in this benchmark (which has already stalled hundreds of times up to this point), it then waits on that frame to finish instead of letting things outside the driver worry about that.

Borrowing Code

It was at this moment that a dim spark flickered to life in my memories, reminding me of the in-progress MR from Antonio Caggiano for caching surfaces on batches. In particular, it reminded me that his series has a patch which removes the above monstrosity.

Let’s see what happens when I add those patches in:

nofence.png

15.

Again.

I expected a huge performance win here, but it seems that we still can’t fully utilize all these changes and are still stuck at 15 fps. Every time descriptors are updated, the batch ends up hitting that arbitrary 1000 descriptor set limit, and then it submits the command buffer, so there’s still multiple batches being used for each frame.

Getting Mad

So naturally at this point I tried increasing the limit.

Then I increased it again.

And again.

And now I had exactly one flush per frame, but my fps was still fixed at a measly 15.

That’s when I decided to do some desk curls.

What happened next was shocking:

endpost1.png

18 fps.

It was a sudden 20% fps gain, but it was only the beginning.

More on this tomorrow.

September 23, 2020

Adventures In Blending

For the past few days, I’ve been trying to fix a troublesome bug. Specifically, the Unigine Heaven benchmark wasn’t drawing most textures in color, and this was hampering my ability to make further claims about zink being the fastest graphics driver in the history of software since it’s not very impressive to be posting side-by-side screenshots that look like garbage even if the FPS counter in the corner is higher.

Thus I embarked on adventure.

First: The Problem

heaven-pre.png

This was the starting point. The thing ran just fine, but without valid frames drawn, it’s hard to call it a test of very much other than how many bugs I can pack into a single frame.

Naturally I assumed that this was going to be some bug in zink’s handling of something, whether it was blending, or sampling, or blending and sampling. I set out to figure out exactly how I’d screwed up.

Second: The Analysis

I had no idea what the problem was. I phoned Dr. Render, as we all do when facing issues like this, and I was told that I had problems.

heaven-renderdoc.png

Lots of problems.

The biggest problem was figuring out how to get anywhere with so many draw calls. Each frame consisted of 3 render passes (with hundreds of draws each) as well as a bunch of extra draws and clears.

There was a lot going on. This was by far the biggest thing I’d had to fix, and it’s much more difficult to debug a game-like application than it is a unit test. With that said, and since there’s not actually any documentation about “What do I do if some of my frame isn’t drawing with color?” for people working on drivers, here were some of the things I looked into:

  • Disabling Depth Testing

Just to check. On IRIS, which is my reference for these types of things, the change gave some neat results:

heaven-nodepth.png

How bout that.

On zink, I got the same thing, except there was no color, and it wasn’t very interesting.

  • Checking sampler resources for depth buffers

On an #executive suggestion, I looked into whether a z/s buffer had snuck into my sampler buffers and was thus providing bogus pixel data.

It hadn’t.

  • Checking Fragment Shader Outputs

This was a runtime version of my usual shader debugging, wherein I try to isolate the pixels in a region to a specific color based on a conditional, which then lets me determine which path in the shader is broken. To do this, I added a helper function in ntv:

static SpvId
clobber(struct ntv_context *ctx)
{
   SpvId type = get_fvec_type(ctx, 32, 4);
   SpvId vals[] = {
      emit_float_const(ctx, 32, 1.0),
      emit_float_const(ctx, 32, 0.0),
      emit_float_const(ctx, 32, 0.0),
      emit_float_const(ctx, 32, 1.0)
    };
   printf("CLOBBERING\n");
   return spirv_builder_emit_composite_construct(&ctx->builder, type, vals, 4);
}

This returns a vec4 of the color RED, and I cleverly stuck it at the end of emit_store_deref() like so:

if (ctx->stage == MESA_SHADER_FRAGMENT && var->data.location == FRAG_RESULT_DATA0 && match)
   result = clobber(ctx);

match in this case is set based on this small block at the very start of ntv:

if (s->info.stage == MESA_SHADER_FRAGMENT) {
   const char *env = getenv("TEST_SHADER");
   match = env && s->info.name && !strcmp(s->info.name, env);
}

Thus, I could set my environment in gdb with e.g., set env TEST_SHADER=GLSL271 and then zink would swap the output of the fragment shader named GLSL271 to RED, which let me determine what various shaders were being used for. When I found the shader used for the lamps, things got LIT:

heaven-lamps.png

But ultimately, even though I did find the shaders that were being used for the more general material draws, this ended up being another dead end.

  • Verifying Blend States

This took me the longest since I had to figure out a way to match up the Dr. Render states to the runtime states that I could see. I eventually settled on adding breakpoints based on index buffer size, as the chart provided by Dr. Render had this in the vertex state, which made things simple.

But alas, zink was doing all the right blending too.

  • Complaining

As usual, this was my last resort, but it was also my most powerful weapon that I couldn’t abuse too frequently, lest people come to the conclusion that I don’t actually know what I’m doing.

Which I definitely do.

And now that I’ve cleared up any misunderstandings there, I’m not ashamed to say that I went to #intel-3d to complain that Dr. Render wasn’t giving me any useful draw output for most of the draws under IRIS. If even zink can get some pixels out of a draw, then a more compliant driver like IRIS shouldn’t be having issues here.

I wasn’t wrong.

The Magic Of Dual Blending

It turns out that the Heaven benchmark is buggy and expects the D3D semantics for dual blending, which is why mesa knows this and informs drivers that they need to enable workarounds if they have the need, specifically dual_color_blend_by_location=true which informs the driver that it needs to adjust the Location and Index of gl_FragData[1] from D3D semantics to OpenGL/Vulkan.

As usual, the folks at Intel with their encyclopedic knowledge were quick to point out the exact problem, which then just left me with the relatively simple tasks of:

  • hooking zink up to the driconf build
  • checking driconf at startup so zink can get info on these application workarounds/overrides
  • adding shader keys for forcing the dual blending workaround
  • writing a NIR pass to do the actual work

The result is not that interesting, but here it is anyway:

static bool
lower_dual_blend(nir_shader *shader)
{
   bool progress = false;
   nir_variable *var = nir_find_variable_with_location(shader, nir_var_shader_out, FRAG_RESULT_DATA1);
   if (var) {
      var->data.location = FRAG_RESULT_DATA0;
      var->data.index = 1;
      progress = true;
   }
   nir_shader_preserve_all_metadata(shader);
   return progress;
}

In short, D3D expects to blend two outputs based on their locations, but in Vulkan and OpenGL, the blending is based on index. So here, I’ve just changed the location of gl_FragData[1] to match gl_FragData[0] and then incremented the index, because Fragment outputs identified with an Index of zero are directed to the first input of the blending unit associated with the corresponding Location. Outputs identified with an Index of one are directed to the second input of the corresponding blending unit.

And now nice things can be had:

heaven-post.png

Tune in tomorrow when I strap zink to a rocket and begin counting down to blastoff.

September 21, 2020

Viewporting

In Vulkan, a pipeline object is bound to the graphics pipeline for a given command buffer when a draw is about to take place. This pipeline object contains information about the draw state, and any time that state changes, a different pipeline object must be created/bound.

This is expensive.

Some time ago, Antonio Caggiano did some work to cache pipeline objects, which lets zink reuse them once they’re created. This was great, because creating Vulkan objects is very costly, and we want to always be reusing objects whenever possible.

Unfortunately, the core Vulkan spec has the number of viewports and scissor regions as both being part of the pipeline state, which means any time either one changes the number of regions (though both viewport and scissor region counts are the same for our purposes), we need a new pipeline.

Extensions to the Rescue

VK_EXT_extended_dynamic_state adds functionality to avoid this performance issue. When supported, the pipeline object is created with zero as the count of viewport and scissor regions, and then vkCmdSetViewportWithCountEXT and vkCmdSetScissorWithCountEXT can be called just before draw to ram these state updates into the command buffer without ever needing a different pipeline object.

Longer posts later in the week; I’m in the middle of a construction zone for the next few days, and it’s less hospitable than I anticipated.

September 18, 2020

Blog Returns

Once again, I ended up not blogging for most of the week. When this happens, there’s one of two possibilities: I’m either taking a break or I’m so deep into some code that I’ve forgotten about everything else in my life including sleep.

This time was the latter. I delved into the deepest parts of zink and discovered that the driver is, in fact, functioning only through a combination of sheer luck and a truly unbelievable amount of driver stalls that provide enough forced synchronization and slow things down enough that we don’t explode into a flaming mess every other frame.

Oops.

I’ve fixed all of the crazy things I found, and, in the process, made some sizable performance gains that I’m planning to spend a while blogging about in considerable depth next week.

And when I say sizable, I’m talking in the range of 50-100% fps gains.

But it’s Friday, and I’m sure nobody wants to just see numbers or benchmarks. Let’s get into something that’s interesting on a technical level.

Samplers

Yes, samplers.

In Vulkan, samplers have a lot of rules to follow. Specifically, I’m going to be examining part of the spec that states “If a VkImageView is sampled with VK_FILTER_LINEAR as a result of this command, then the image view’s format features must contain VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT”.

This is a problem for zink. Gallium gives us info about the sampler in the struct pipe_context::create_sampler_state hook, but the created sampler won’t actually be used until draw time. As a result, there’s no way to know which image is going to be sampled, and thus there’s no way to know what features the sampled image’s format flags will contain. This only becomes known at the time of draw.

The way I saw it, there were two options:

  • Dynamically create the sampler just before draw and swizzle between LINEAR and NEAREST based on the format features
  • Create both samplers immediately when LINEAR is passed and swizzle between them at draw time

In theory, the first option is probably more performant in the best case scenario where a sampler is only ever used with a single image, as it would then only ever create a single sampler object.

Unfortunately, this isn’t realistic. Just as an example, u_blitter creates a number of samplers up front, and then it also makes assumptions about filtering based on ideal operations which may not be in sync with the underlying Vulkan driver’s capabilities. So for these persistent samplers, the first option may initially allow the sampler to be created with LINEAR filtering, but it may later then be used for an image which can’t support it.

So I went with the second option. Now any time a LINEAR sampler is created by gallium, we’re actually creating both types so that the appropriate one can be used, ensuring that we can always comply with the spec and avoid any driver issues.

Hooray.

September 14, 2020

Hoo Boy

Let’s talk about ARB_shader_draw_parameters. Specifically, let’s look at gl_BaseVertex.

In OpenGL, this shader variable’s value depends on the parameters passed to the draw command, and the value is always zero if the command has no base vertex.

In Vulkan, the value here is only zero if the first vertex is zero.

The difference here means that for arrayed draws without base vertex parameters, GL always expects zero, and Vulkan expects first vertex.

Hooray.

Compatibilizing

The easiest solution here would be to just throw a shader key at the problem, producing variants of the shader for use with indexed vs non-indexed draws, and using NIR passes to modify the variables for the non-indexed case and zero the value. It’s quick, it’s easy, and it’s not especially great for performance since it requires compiling the shader multiple times and creating multiple pipeline objects.

This is where push constants come in handy once more.

Avid readers of the blog will recall the last time I used push constants was for TCS injection when I needed to generate my own TCS and have it read the default inner/outer tessellation levels out of a push constant.

Since then, I’ve created a struct to track the layout of the push constant:

struct zink_push_constant {
   unsigned draw_mode_is_indexed;
   float default_inner_level[2];
   float default_outer_level[4];
};

Now just before draw, I update the push constant value for draw_mode_is_indexed:

if (ctx->gfx_stages[PIPE_SHADER_VERTEX]->nir->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)) {
   unsigned draw_mode_is_indexed = dinfo->index_size > 0;
   vkCmdPushConstants(batch->cmdbuf, gfx_program->layout, VK_SHADER_STAGE_VERTEX_BIT,
                      offsetof(struct zink_push_constant, draw_mode_is_indexed), sizeof(unsigned),
                      &draw_mode_is_indexed);
}

And now the shader can be made aware of whether the draw mode is indexed.

Now comes the NIR, as is the case for most of this type of work.

static bool
lower_draw_params(nir_shader *shader)
{
   if (shader->info.stage != MESA_SHADER_VERTEX)
      return false;

   if (!(shader->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)))
      return false;

   return nir_shader_instructions_pass(shader, lower_draw_params_instr, nir_metadata_dominance, NULL);
}

This is the future, so I’m now using Eric Anholt’s recent helper function to skip past iterating over the shader’s function/blocks/instructions, instead just passing the lowering implementation as a parameter and letting the helper create the nir_builder for me.

static bool
lower_draw_params_instr(nir_builder *b, nir_instr *in, void *data)
{
   if (in->type != nir_instr_type_intrinsic)
      return false;
   nir_intrinsic_instr *instr = nir_instr_as_intrinsic(in);
   if (instr->intrinsic != nir_intrinsic_load_base_vertex)
      return false;

I’m filtering out everything except for nir_intrinsic_load_base_vertex here, which is the instruction for loading gl_BaseVertex.

   
   b->cursor = nir_after_instr(&instr->instr);

I’m modifying instructions after this one, so I set the cursor after.

   nir_intrinsic_instr *load = nir_intrinsic_instr_create(b->shader, nir_intrinsic_load_push_constant);
   load->src[0] = nir_src_for_ssa(nir_imm_int(b, 0));
   nir_intrinsic_set_range(load, 4);
   load->num_components = 1;
   nir_ssa_dest_init(&load->instr, &load->dest, 1, 32, "draw_mode_is_indexed");
   nir_builder_instr_insert(b, &load->instr);

I’m loading the first 4 bytes of the push constant variable that I created according to my struct, which is the draw_mode_is_indexed value.

   nir_ssa_def *composite = nir_build_alu(b, nir_op_bcsel,
                                          nir_build_alu(b, nir_op_ieq, &load->dest.ssa, nir_imm_int(b, 1), NULL, NULL),
                                          &instr->dest.ssa,
                                          nir_imm_int(b, 0),
                                          NULL);

This adds a new ALU instruction of type bcsel, AKA the ternary operator (condition ? true : false). The condition here is another ALU of type ieq, AKA integer equals, and I’m testing whether the loaded push constant value is equal to 1. If true, this is an indexed draw, so I continue using the loaded gl_BaseVertex value. If false, this is not an indexed draw, so I need to use zero instead.

   nir_ssa_def_rewrite_uses_after(&instr->dest.ssa, nir_src_for_ssa(composite), composite->parent_instr);

With my bcsel composite gl_BaseVertex value constructed, I can now rewrite all subsequent uses of gl_BaseVertex in the shader to use the composite value, which will automatically swap between the Vulkan gl_BaseVertex and zero based on the value of the push constant without the need to rebuild the shader or make a new pipeline.

   return true;
}

And now the shader gets the expected value and everything works.

Billy Mays

It’s also worth pointing out here that gl_DrawID from the same extension has a similar problem: gallium doesn’t pass multidraws in full to the driver, instead iterating for each draw, which means that the shader value is never what’s expected either. I’ve employed a similar trick to jam the draw index into the push constant and read that back in the shader to get the expected value there too.

Extensions.

September 10, 2020

In an ideal world, every frame your application draws would appear on the screen exactly on time. Sadly, as anyone living in the year 2020 CE can attest, this is far from an ideal world. Sometimes the scene gets more complicated and takes longer to draw than you estimated, and sometimes the OS scheduler just decides it has more important things to do than pay attention to you.

When this happens, for some applications, it would be best if you could just get the bits on the screen as fast as possible rather than wait for the next vsync. The Present extension for X11 has a option to let you do exactly this:

If 'options' contains PresentOptionAsync, and the 'target-msc'
is less than or equal to the current msc for 'window', then
the operation will be performed as soon as possible, not
necessarily waiting for the next vertical blank interval. 

But you don't use Present directly, usually, usually Present is the mechanism for GLX and Vulkan to put bits on the screen. So, today I merged some code to Mesa to enable the corresponding features in those APIs, namely GLX_EXT_swap_control_tear and VK_PRESENT_MODE_FIFO_RELAXED_KHR. If all goes well these should be included in Mesa 21.0, with a backport to 20.2.x not out of the question. As the GLX extension name suggests, this can introduce some visual tearing when the buffer swap does come in late, but for fullscreen games or VR displays that can be an acceptable tradeoff in exchange for reduced stuttering.