![]() |
|
January 14, 2021 | |
![]() |
The open source Panfrost driver for Arm Mali Midgard and Bifrost GPUs now provides non-conformant OpenGL ES 3.0 on Bifrost and desktop OpenGL 3.1 on Midgard (Mali T760 and newer) and Bifrost, in time for Mesa’s first release of 2021. This follows the OpenGL ES 3.0 support on Midgard that landed over the summer, as well as the initial OpenGL ES 2.0 support that recently debuted for Bifrost. OpenGL ES 3.0 is now tested on Mali G52 in Mesa’s continuous integration, achieving a 99.9% pass rate on the corresponding drawElements Quality Program tests.
Architecturally, Bifrost shares most of its fixed-function data structures with Midgard, but features a brand new instruction set. Our work for bringing up OpenGL ES 3.0 on Bifrost reflects this division. Some fixed-function features, like instancing and transform feedback, worked without any Bifrost-specific changes since we already did bring-up on Midgard. Other shader features, like uniform buffer objects, required “from scratch” implementations in the Bifrost compiler, a task facilitated by the compiler’s maturing intermediate representation with first-class builder support. Yet other features like multiple render targets required some Bifrost-specific code while leveraging other code shared with Midgard. All in all, the work progressed much more quickly the second time around, a testament to the power of code sharing. But there is no need to limit sharing to just Panfrost GPUs; open source drivers can share code across vendors.
Indeed, since Mali is an embedded GPU, the proprietary driver only exposes exposes OpenGL ES, not desktop OpenGL. However, desktop OpenGL 3.1 support comes nearly “for free” for us as an upstream Mesa driver by leveraging common infrastructure. This milestone shows the technical advantage of open source development: Compared to layered implementations of desktop GL like gl4es or Zink, Panfrost’s desktop OpenGL support is native, reducing CPU overhead. Furthermore, applications can make use of the hardware’s hidden features, like explicit primitive restart indices, alpha testing, and quadrilaterals. Although these features could be emulated, the native solutions are more efficient.
Mesa’s shared code also extends to OpenCL support via Clover. Once a driver supports compute shaders and sufficient compiler features, baseline OpenCL is just a few patches and a bug-fixing spree away. While OpenCL implementations could be layered (for example with clvk), an open source Mesa driver avoids the indirection.
I would like to thank Collaboran Boris Brezillon, who has worked tirelessly to bring OpenGL ES 3.0 support to Bifrost, as well as the prolific Icecream95, who has spearheaded OpenCL and desktop OpenGL support.
Originally posted on Collabora’s blog
There’s been a lot of talk about driver overhead in the Mesa community as of late, in large part begun by Marek Olšák and his daredevil stunts driving RadeonSI through flaming hoops while juggling chainsaws.
While zink isn’t quite at that level yet (and neither am I), there’s still some progress being made that I’d like to dig into a bit.
As in all software, overhead is the performance penalty that is incurred as compared to a baseline measurement. In Mesa, a lot of people know of driver overhead as “Gallium sucks” and/or “A Gallium-based driver is slow” due to the fact that Gallium does incur some amount of overhead as compared to the old-style immediate mode DRI drivers.
While it’s true that there is an amount of performance lost by using Gallium in this sense, it’s also true that the performance gained is much greater. The reason for this is that Gallium is able to batch commands and state changes for every driver using it, allowing redundant calls to avoid triggering any work in the GPU.
It also makes for an easier time profiling and improving upon the CPU usage that’s required to handle the state changes emitted by Gallium. Instead of having a ton of core Mesa callbacks which need to be handled, each one potentially leading to a no-op that can be analyzed and deferred by the driver, Gallium provides a more cohesive API where each driver hook is a necessary change that must be handled. Because of this, the job of optimizing for those changes is simplified.
Other than the obvious method of running apps on a driver and checking the fps counter, piglit provides a facility for this: the drawoverhead
test. This test has over a hundred subtests which perform sequences of draw operations with various state changes, each with its own result relative to a baseline, enabling a developer to easily profile and optimize a given codepath.
To answer this, let’s look at some preliminary results from zink in master, the code which will soon be shipping in Mesa 21.0.0. All numbers here are, in contrast to my usual benchmarking, done on AMD 5700XT GPU. More on this later.
ZINK: MASTER
#, Test name , Thousands draws/s, Difference vs the 1st
1, DrawElements ( 1 VBO| 0 UBO| 0 ) w/ no state change, 818, 100.0%
2, DrawElements ( 4 VBO| 0 UBO| 0 ) w/ no state change, 686, 83.9%
3, DrawElements (16 VBO| 0 UBO| 0 ) w/ no state change, 411, 50.3%
4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change, 232, 28.4%
5, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ no state change, 258, 31.5%
6, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ shader program change, 87, 10.7%
7, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ vertex attrib change, 162, 19.9%
8, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ 1 texture change, 150, 18.3%
9, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ 8 textures change, 120, 14.7%
10, DrawElements ( 1 VBO| 8 UBO| 8 TBO) w/ 1 TBO change, 192, 23.5%
11, DrawElements ( 1 VBO| 8 UBO| 8 TBO) w/ 8 TBOs change, 146, 17.9%
After this point, the test aborts because shader images are not yet implemented, but it’s enough for a baseline.
These numbers are…not great. Primarily, at least to start, I’ll be focusing on the first row where zink is performing 818,000 draws per second.
Let’s check out some performance from zink-wip (20201230 snapshot), specifically with GALLIUM_THREAD=0
set to disable threaded context. This means I’m adding in descriptor caching and unlimited command buffer counts (vs forcing a stall after every submit from the 4th one onwards to reset a batch):
ZINK: WIP (CACHED, NO THREAD)
#, Test name , Thousands draws/s, Difference vs the 1st
1, DrawElements ( 1 VBO| 0 UBO| 0 ) w/ no state change, 766, 100.0%
2, DrawElements ( 4 VBO| 0 UBO| 0 ) w/ no state change, 633, 82.6%
3, DrawElements (16 VBO| 0 UBO| 0 ) w/ no state change, 407, 53.1%
4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change, 500, 65.3%
5, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ no state change, 449, 58.6%
6, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ shader program change, 85, 11.2%
7, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ vertex attrib change, 235, 30.7%
8, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ 1 texture change, 159, 20.8%
9, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ 8 textures change, 128, 16.7%
10, DrawElements ( 1 VBO| 8 UBO| 8 TBO) w/ 1 TBO change, 179, 23.4%
11, DrawElements ( 1 VBO| 8 UBO| 8 TBO) w/ 8 TBOs change, 139, 18.2%
This is actually worse for a lot of cases!
But why is that?
It turns out that in the base draw case, threaded context is really necessary to be doing caching and using more command buffers. There’s sizable gains made in the baseline texture cases (+100% or so each) and a vertex attribute change (+50%), but fundamentally the overhead for the driver seems higher.
What happens if threading is enabled though?
ZINK: WIP (CACHED, THREAD)
#, Test name , Thousands draws/s, Difference vs the 1st
1, DrawElements ( 1 VBO| 0 UBO| 0 ) w/ no state change, 5206, 100.0%
2, DrawElements ( 4 VBO| 0 UBO| 0 ) w/ no state change, 5149, 98.9%
3, DrawElements (16 VBO| 0 UBO| 0 ) w/ no state change, 5187, 99.6%
4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change, 5210, 100.1%
5, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ no state change, 4684, 90.0%
6, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ shader program change, 137, 2.6%
7, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ vertex attrib change, 252, 4.8%
8, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ 1 texture change, 243, 4.7%
9, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ 8 textures change, 222, 4.3%
10, DrawElements ( 1 VBO| 8 UBO| 8 TBO) w/ 1 TBO change, 213, 4.1%
11, DrawElements ( 1 VBO| 8 UBO| 8 TBO) w/ 8 TBOs change, 208, 4.0%
Indeed, threading yields almost a 700% performance improvement for teh baseline cases. It turns out that synchronously performing expensive tasks like computing hash values for descriptor sets is bad. Who could have guessed.
Looking at the other values, however, is a bit more pertinent for the purpose of this post. Overhead is incurred when state changes are triggered by descriptors being changed, and this is much closer to a real world scenario (i.e., gaming) than simply running draw calls with no changes. Caching yields roughly a 50% performance improvement for this case.
As I’d mentioned previously, I’m doing some work now on descriptor management with an aim to further lower this overhead. Let’s see what that looks like.
ZINK: TEST (UNCACHED, THREAD)
#, Test name , Thousands draws/s, Difference vs the 1st
1, DrawElements ( 1 VBO| 0 UBO| 0 ) w/ no state change, 5426, 100.0%
2, DrawElements ( 4 VBO| 0 UBO| 0 ) w/ no state change, 5423, 99.9%
3, DrawElements (16 VBO| 0 UBO| 0 ) w/ no state change, 5432, 100.1%
4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change, 5246, 96.7%
5, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ no state change, 5177, 95.4%
6, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ shader program change, 153, 2.8%
7, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ vertex attrib change, 229, 4.2%
8, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ 1 texture change, 247, 4.6%
9, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ 8 textures change, 228, 4.2%
10, DrawElements ( 1 VBO| 8 UBO| 8 TBO) w/ 1 TBO change, 237, 4.4%
11, DrawElements ( 1 VBO| 8 UBO| 8 TBO) w/ 8 TBOs change, 223, 4.1%
While there’s a small (~4%) improvement for the baseline numbers, what’s much more interesting is the values where descriptor states are changed. They are, in fact, about as good or even slightly better than the caching version of descriptor management.
This is huge. Specifically it’s huge because it means that I can likely port over some of the techniques used in this approached to the cached version in order to drive further reductions in overhead.
Before I go, let’s check out some numbers from a real driver. Specifically, RadeonSI: the pinnacle of Gallium-based drivers.
RADEONSI
#, Test name , Thousands draws/s, Difference vs the 1st
1, DrawElements ( 1 VBO| 0 UBO| 0 ) w/ no state change, 6221, 100.0%
2, DrawElements ( 4 VBO| 0 UBO| 0 ) w/ no state change, 6261, 100.7%
3, DrawElements (16 VBO| 0 UBO| 0 ) w/ no state change, 6236, 100.2%
4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change, 6263, 100.7%
5, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ no state change, 6243, 100.4%
6, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ shader program change, 217, 3.5%
7, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ vertex attrib change, 1467, 23.6%
8, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ 1 texture change, 374, 6.0%
9, DrawElements ( 1 VBO| 8 UBO| 8 Tex) w/ 8 textures change, 218, 3.5%
10, DrawElements ( 1 VBO| 8 UBO| 8 TBO) w/ 1 TBO change, 680, 10.9%
11, DrawElements ( 1 VBO| 8 UBO| 8 TBO) w/ 8 TBOs change, 318, 5.1%
Yikes. Especially intimidating here is the relative performance for vertex attribute changes, where RadeonSI is able to retain almost 25% of its baseline performance relative to zink not even managing 5%.
Hopefully these figures get closer to each other in the future, but this just shows that there’s still a long way to go.
![]() |
|
January 13, 2021 | |
![]() |
This post explains how to parse the HID Unit Global Item as explained by the HID Specification, page 37. The table there is quite confusing and it took me a while to fully understand it (Benjamin Tissoires was really the one who cracked it). I couldn't find any better explanation online which means either I'm incredibly dense and everyone's figured it out or no-one has posted a better explanation. On the off-chance it's the latter [1], here are the instructions on how to parse this item.
We know a HID Report Descriptor consists of a number of items that describe the content of each HID Report (read: an event from a device). These Items include things like Logical Minimum/Maximum for axis ranges, etc. A HID Unit item specifies the physical unit to apply. For example, a Report Descriptor may specify that X and Y axes are in mm which can be quite useful for all the obvious reasons.
Like most HID items, a HID Unit Item consists of a one-byte item tag and 1, 2 or 4 byte payload. The Unit item in the Report Descriptor itself has the binary value 0110 01nn where the nn is either 1, 2, or 3 indicating 1, 2 or 4 bytes of payload, respectively. That's standard HID.
The payload is divided into nibbles (4-bit units) and goes from LSB to MSB. The lowest-order 4 bits (first byte & 0xf) define the unit System to apply: one of SI Linear, SI Rotation, English Linear or English Rotation (well, or None/Reserved). The rest of the nibbles are in this order: "length", "mass", "time", "temperature", "current", "luminous intensity". In something resembling code this means:
The System defines which unit is used for length (e.g. SILinear means length is in cm). The actual value of each nibble is the exponent for the unit in use [2]. In something resembling code:
system = value & 0xf
length_exponent = (value & 0xf0) >> 4
mass_exponent = (value & 0xf00) >> 8
time_exponent = (value & 0xf000) >> 12
...
switch (system)
case SILinear:
print("length is in cm^{length_exponent}");
break;
case SIRotation:
print("length is in rad^{length_exponent}");
break;
case EnglishLinear:
print("length is in in^{length_exponent}");
break;
case EnglishRotation:
print("length is in deg^{length_exponent}");
break;
case None:
case Reserved"
print("boo!");
break;
For example, the value 0x321 means "SI Linear" (0x1) so the remaining nibbles represent, in ascending nibble order: Centimeters, Grams, Seconds, Kelvin, Ampere, Candela. The length nibble has a value of 0x2 so it's square cm, the mass nibble has a value of 0x3 so it is cubic grams (well, it's just an example, so...). This means that any report containing this item comes in cm²g³. As a more realistic example: 0xF011 would be cm/s.
If we changed the lowest nibble to English Rotation (0x4), i.e. our value is now 0x324, the units represent: Degrees, Slug, Seconds, F, Ampere, Candela [3]. The length nibble 0x2 means square degrees, the mass nibble is cubic slugs. As a more realistic example, 0xF014 would be degrees/s.
Any nibble with value 0 means the unit isn't in use, so the example from the spec with value 0x00F0D121 is SI linear, units cm² g s⁻³ A⁻¹, which is... Voltage! Of course you knew that and totally didn't have to double-check with wikipedia.
Because bits are expensive and the base units are of course either too big or too small or otherwise not quite right, HID also provides a Unit Exponent item. The Unit Exponent item (a separate item to Unit in the Report Descriptor) then describes the exponent to be applied to the actual value in the report. For example, a Unit Eponent of -3 means 10⁻³ to be applied to the value. If the report descriptor specifies an item of Unit 0x00F0D121 (i.e. V) and Unit Exponent -3, the value of this item is mV (milliVolt), Unit Exponent of 3 would be kV (kiloVolt).
Now, in hindsight all this is pretty obvious and maybe even sensible. It'd have been nice if the spec would've explained it a bit clearer but then I would have nothing to write about, so I guess overall I call it a draw.
[1] This whole adventure was started because there's a touchpad out there that measures touch pressure in radians, so at least one other person out there struggled with the docs...
[2] The nibble value is twos complement (i.e. it's a signed 4-bit integer). Values 0x1-0x7 are exponents 1 to 7, values 0x8-0xf are exponents -8 to -1.
[3] English Linear should've trolled everyone and use Centimetres instead of Centimeters in SI Linear.
![]() |
|
January 12, 2021 | |
![]() |
TL;DR: It's now easy to unlock your LUKS2 volume with a FIDO2 security token (e.g. YubiKey or Nitrokey FIDO2). And TPM2 unlocking is easy now too.
Blogging is a lot of work, and a lot less fun than hacking. I mostly focus on the latter because of that, but from time to time I guess stuff is just too interesting to not be blogged about. Hence here, finally, another blog story about exciting new features in systemd.
With the upcoming systemd v248 the
systemd-cryptsetup
component of systemd (which is responsible for assembling encrypted
volumes during boot) gained direct support for unlocking encrypted
storage with three types of security hardware:
Unlocking with FIDO2 security tokens (well, at least with those
which implement the hmac-secret
extension, most do). i.e. your
YubiKeys (series 5 and above), or Nitrokey FIDO2 and such.
Unlocking with TPM2 security chips (pretty ubiquitous on non-budget PCs/laptops/…)
Unlocking with PKCS#11 security tokens, i.e. your smartcards and older YubiKeys (the ones that implement PIV). (Strictly speaking this was supported on older systemd already, but was a lot more "manual".)
For completeness' sake, let's keep in mind that the component also allows unlocking with these more traditional mechanisms:
Unlocking interactively with a user-entered passphrase (i.e. the way most people probably already deploy it, supported since about forever)
Unlocking via key file on disk (optionally on removable media plugged in at boot), supported since forever.
Unlocking via a key acquired through trivial
AF_UNIX
/SOCK_STREAM
socket IPC. (Also new in v248)
Unlocking via recovery keys. These are pretty much the same thing as a regular passphrase (and in fact can be entered wherever a passphrase is requested) — the main difference being that they are always generated by the computer, and thus have guaranteed high entropy, typically higher than user-chosen passphrases. They are generated in a way they are easy to type, in many cases even if the local key map is misconfigured. (Also new in v248)
In this blog story, let's focus on the first three items, i.e. those that talk to specific types of hardware for implementing unlocking.
To make working with security tokens and TPM2 easy, a new, small tool was added to the systemd tool set: systemd-cryptenroll. It's only purpose is to make it easy to enroll your security token/chip of choice into an encrypted volume. It works with any LUKS2 volume, and embeds a tiny bit of meta-information into the LUKS2 header with parameters necessary for the unlock operation.
So, let's see how this fits together in the FIDO2 case. Most likely
this is what you want to use if you have one of these fancy FIDO2 tokens
(which need to implement the hmac-secret
extension, as
mentioned). Let's say you already have your LUKS2 volume set up, and
previously unlocked it with a simple passphrase. Plug in your token,
and run:
# systemd-cryptenroll --fido2-device=auto /dev/sda5
(Replace /dev/sda5
with the underlying block device of your volume).
This will enroll the key as an additional way to unlock the volume,
and embeds all necessary information for it in the LUKS2 volume
header. Before we can unlock the volume with this at boot, we need to
allow FIDO2 unlocking via
/etc/crypttab
. For
that, find the right entry for your volume in that file, and edit it
like so:
myvolume /dev/sda5 - fido2-device=auto
Replace myvolume
and /dev/sda5
with the right volume name, and
underlying device of course. Key here is the fido2-device=auto
option you need to add to the fourth column in the file. It tells
systemd-cryptsetup
to use the FIDO2 metadata now embedded in the
LUKS2 header, wait for the FIDO2 token to be plugged in at boot
(utilizing systemd-udevd
, …) and unlock the volume with it.
And that's it already. Easy-peasy, no?
Note that all of this doesn't modify the FIDO2 token itself in any way. Moreover you can enroll the same token in as many volumes as you like. Since all enrollment information is stored in the LUKS2 header (and not on the token) there are no bounds on any of this. (OK, well, admittedly, there's a cap on LUKS2 key slots per volume, i.e. you can't enroll more than a bunch of keys per volume.)
Let's now have a closer look how the same works with a PKCS#11 compatible security token or smartcard. For this to work, you need a device that can store an RSA key pair. I figure most security tokens/smartcards that implement PIV qualify. How you actually get the keys onto the device might differ though. Here's how you do this for any YubiKey that implements the PIV feature:
# ykman piv reset
# ykman piv generate-key -a RSA2048 9d pubkey.pem
# ykman piv generate-certificate --subject "Knobelei" 9d pubkey.pem
# rm pubkey.pem
(This chain of commands erases what was stored in PIV feature of your token before, be careful!)
For tokens/smartcards from other vendors a different series of commands might work. Once you have a key pair on it, you can enroll it with a LUKS2 volume like so:
# systemd-cryptenroll --pkcs11-token-uri=auto /dev/sda5
Just like the same command's invocation in the FIDO2 case this enrolls the security token as an additional way to unlock the volume, any passphrases you already have enrolled remain enrolled.
For the PKCS#11 case you need to edit your /etc/crypttab
entry like this:
myvolume /dev/sda5 - pkcs11-uri=auto
If you have a security token that implements both PKCS#11 PIV and FIDO2 I'd probably enroll it as FIDO2 device, given it's the more contemporary, future-proof standard. Moreover, it requires no special preparation in order to get an RSA key onto the device: FIDO2 keys typically just work.
Most modern (non-budget) PC hardware (and other kind of hardware too) nowadays comes with a TPM2 security chip. In many ways a TPM2 chip is a smartcard that is soldered onto the mainboard of your system. Unlike your usual USB-connected security tokens you thus cannot remove them from your PC, which means they address quite a different security scenario: they aren't immediately comparable to a physical key you can take with you that unlocks some door, but they are a key you leave at the door, but that refuses to be turned by anyone but you.
Even though this sounds a lot weaker than the FIDO2/PKCS#11 model TPM2 still bring benefits for securing your systems: because the cryptographic key material stored in TPM2 devices cannot be extracted (at least that's the theory), if you bind your hard disk encryption to it, it means attackers cannot just copy your disk and analyze it offline — they always need access to the TPM2 chip too to have a chance to acquire the necessary cryptographic keys. Thus, they can still steal your whole PC and analyze it, but they cannot just copy the disk without you noticing and analyze the copy.
Moreover, you can bind the ability to unlock the harddisk to specific software versions: for example you could say that only your trusted Fedora Linux can unlock the device, but not any arbitrary OS some hacker might boot from a USB stick they plugged in. Thus, if you trust your OS vendor, you can entrust storage unlocking to the vendor's OS together with your TPM2 device, and thus can be reasonably sure intruders cannot decrypt your data unless they both hack your OS vendor and steal/break your TPM2 chip.
Here's how you enroll your LUKS2 volume with your TPM2 chip:
# systemd-cryptenroll --tpm2-device=auto --tpm2-pcrs=7 /dev/sda5
This looks almost as straightforward as the two earlier
sytemd-cryptenroll
command lines — if it wasn't for the
--tpm2-pcrs=
part. With that option you can specify to which TPM2
PCRs you want to bind the enrollment. TPM2 PCRs are a set of
(typically 24) hash values that every TPM2 equipped system at boot
calculates from all the software that is invoked during the boot
sequence, in a secure, unfakable way (this is called
"measurement"). If you bind unlocking to a specific value of a
specific PCR you thus require the system has to follow the same
sequence of software at boot to re-acquire the disk encryption
key. Sounds complex? Well, that's because it is.
For now, let's see how we have to modify your /etc/crypttab
to
unlock via TPM2:
myvolume /dev/sda5 - tpm2-device=auto
This part is easy again: the tpm2-device=
option is what tells
systemd-cryptsetup
to use the TPM2 metadata from the LUKS2 header
and to wait for the TPM2 device to show up.
FIDO2, PKCS#11 and TPM2 security tokens and chips pair well with recovery keys: since you don't need to type in your password everyday anymore it makes sense to get rid of it, and instead enroll a high-entropy recovery key you then print out or scan off screen and store a safe, physical location. i.e. forget about good ol' passphrase-based unlocking, go for FIDO2 plus recovery key instead! Here's how you do it:
# systemd-cryptenroll --recovery-key /dev/sda5
This will generate a key, enroll it in the LUKS2 volume, show it to
you on screen and generate a QR code you may scan off screen if you
like. The key has highest entropy, and can be entered wherever you can
enter a passphrase. Because of that you don't have to modify
/etc/crypttab
to make the recovery key work.
There's still plenty room for further improvement in all of this. In particular for the TPM2 case: what the text above doesn't really mention is that binding your encrypted volume unlocking to specific software versions (i.e. kernel + initrd + OS versions) actually sucks hard: if you naively update your system to newer versions you might lose access to your TPM2 enrolled keys (which isn't terrible, after all you did enroll a recovery key — right? — which you then can use to regain access). To solve this some more integration with distributions would be necessary: whenever they upgrade the system they'd have to make sure to enroll the TPM2 again — with the PCR hashes matching the new version. And whenever they remove an old version of the system they need to remove the old TPM2 enrollment. Alternatively TPM2 also knows a concept of signed PCR hash values. In this mode the distro could just ship a set of PCR signatures which would unlock the TPM2 keys. (But quite frankly I don't really see the point: whether you drop in a signature file on each system update, or enroll a new set of PCR hashes in the LUKS2 header doesn't make much of a difference). Either way, to make TPM2 enrollment smooth some more integration work with your distribution's system update mechanisms need to happen. And yes, because of this OS updating complexity the example above — where I referenced your trusty Fedora Linux — doesn't actually work IRL (yet? hopefully…). Nothing updates the enrollment automatically after you initially enrolled it, hence after the first kernel/initrd update you have to manually re-enroll things again, and again, and again … after every update.
The TPM2 could also be used for other kinds of key policies, we might look into adding later too. For example, Windows uses TPM2 stuff to allow short (4 digits or so) "PINs" for unlocking the harddisk, i.e. kind of a low-entropy password you type in. The reason this is reasonably safe is that in this case the PIN is passed to the TPM2 which enforces that not more than some limited amount of unlock attempts may be made within some time frame, and that after too many attempts the PIN is invalidated altogether. Thus making dictionary attacks harder (which would normally be easier given the short length of the PINs).
(BTW: Yubico sent me two YubiKeys for testing and Nitrokey a Nitrokey FIDO2, thank you! — That's why you see all those references to YubiKey/Nitrokey devices in the text above: it's the hardware I had to test this with. That said, I also tested the FIDO2 stuff with a SoloKey I bought, where it also worked fine. And yes, you!, other vendors!, who might be reading this, please send me your security tokens for free, too, and I might test things with them as well. No promises though. And I am not going to give them back, if you do, sorry. ;-))
![]() |
|
January 11, 2021 | |
![]() |
As the merge window for the upcoming Mesa release looms, Erik and I have decided on a new strategy for development: we’re just going to stop merging patches.
At this point in time, we have no regressions as compared to the last release, so we’re just doing a full stop until after the branch point in order to save ourselves time potentially tracking down any issues in further feature additions.
Some of you may have noticed that zink-wip has yet to update this year. This isn’t due to a lack of work, but rather due to lack of stability. I’ve been tinkering with a new descriptor management infrastructure (yes, I’m back on the horse), and it’s… capable of drawing frames is maybe the best way to describe it. I’ve gone through probably about ten iterations on it so far based on all the ideas I’ve had.
This is hardly an exhaustive list, but here’s some of the ideas that I’ve cycled through:
async descriptor updating - It seems like this should be good on paper given that it’s legal to do descriptor updates in threads, but the overhead from signalling the task thread in this case ended up being, on average, about 10-20x the cost of just doing the updating synchronously.
all push descriptors all the time - Just for hahas along the way I jammed everything into a pushed descriptor set. Or at least I was going to try. About halfway through, I realized this was way more work to execute than it’d be worth for the hahas considering I wouldn’t ever be able to use this in reality.
zero iteration updates - The gist of this ideas is that looking at the descriptor updating code, there’s a ton of iterating going on. This is an extreme hotpath, so any amount of looping that can be avoided is great, and the underlying Vulkan driver has to iterate the sets anyway, so… Eventually I managed to throw a bunch of memory at the problem and do all the setup during pipeline init, giving me pre-initialized blobs of memory in the form of VkWriteDescriptorSet arrays with the associated sub-types for descriptors. With this in place, naturally I turned to…
templates - Descriptor templates are a way of giving the Vulkan driver the raw memory of the descriptor info as a blob and letting it huck that directly into a buffer. Since I already had the memory set up for this, it was an easy swap over, though the gains were less impressive than I’d expected.
At last I’ve settled on a model of uncached, templated descriptors with an extra push set for handling uniform data for further exploration. Initial results for real world use (e.g., graphical benchmarks) are good, but piglit’s drawoverhead
test shows there’s still a lot of work to be done to catch up to caching.
Big thanks to Hans-Kristian Arntzen, aka themaister, aka low-level graphics swashbuckler, for providing insight and consults along the process of this.
![]() |
|
January 08, 2021 | |
![]() |
VkRunner is a Vulkan shader tester based on Piglit’s shader_runner (I already talked about it in my blog). This tool is very helpful for creating simple Vulkan tests without writing hundreds of lines of code. In the Graphics Team at Igalia, we use it extensively to help us in the open-source driver development in Mesa such as V3D and Turnip drivers.
As a hobby project for last Christmas holiday season, I wrote the .spec file for VkRunner and uploaded it to Fedora’s Copr and OpenSUSE Build Service (OBS) for generating the respective RPM packages.
This is the first time I create a package and thanks to the documentation on how to create RPM packages, the process was simpler than I initially thought. If I find the time to read Debian New Maintainers’ Guide, I will create a DEB package as well.
Anyway, if you have installed Fedora or OpenSUSE in your computer and you want to try VkRunner, just follow these steps:
$ sudo dnf copr enable samuelig/vkrunner
$ sudo dnf install vkrunner
$ sudo zypper addrepo https://download.opensuse.org/repositories/home:samuelig/openSUSE_Leap_15.2/home:samuelig.repo
$ sudo zypper refresh
$ sudo zypper install vkrunner
Enjoy it!
![]() |
|
January 07, 2021 | |
![]() |
Apple’s latest line of Macs includes their in-house “M1” system-on-chip, featuring a custom GPU. This poses a problem for those of us in the Asahi Linux project who wish to run Linux on our devices, as this custom Apple GPU has neither public documentation nor open source drivers. Some speculate it might descend from PowerVR GPUs, as used in older iPhones, while others believe the GPU to be completely custom. But rumours and speculations are no fun when we can peek under the hood ourselves!
A few weeks ago, I purchased a Mac Mini with an M1 GPU as a development target to study the instruction set and command stream, to understand the GPU’s architecture at a level not previously publicly understood, and ultimately to accelerate the development of a Mesa driver for the hardware. Today I’ve reached my first milestone: I now understand enough of the instruction set to disassemble simple shaders with a free and open-source tool chain, released on GitHub here.
The process for decoding the instruction set and command stream of the GPU parallels the same process I used for reverse-engineering Mali GPUs in the Panfrost project, originally pioneered by the Lima, Freedreno, and Nouveau free software driver projects. Typically, for Linux or Android driver reverse-engineering, a small wrap library will be written to inject into a test application via LD_PRELOAD
that hooks key system calls like ioctl
and mmap
in order to analyze user-kernel interactions. Once the “submit command buffer” call is issued, the library can dump all (mapped) shared memory for offline analysis.
The same overall process will work for the M1, but there are some macOSisms that need to be translated. First, there is no LD_PRELOAD
on macOS; the equivalent is DYLD_INSERT_LIBRARIES
, which has some extra security features which are easy enough to turn off for our purposes. Second, while the standard Linux/BSD system calls do exist on macOS, they are not used for graphics drivers. Instead, Apple’s own IOKit
framework is used for both kernel and userspace drivers, with the critical entry point of IOConnectCallMethod
, an analogue of ioctl
. These differences are easy enough to paper over, but they do add a layer of distance from the standard Linux tooling.
The bigger issue is orienting ourselves in the IOKit world. Since Linux is under a copyleft license, (legal) kernel drivers are open source, so the ioctl
interface is public, albeit vendor-specific. macOS’s kernel (XNU) being under a permissive license brings no such obligations; the kernel interface is proprietary and undocumented. Even after wrapping IOConnectCallMethod
, it took some elbow grease to identify the three critical calls: memory allocation, command buffer creation, and command buffer submission. Wrapping the allocation and creation calls is essential for tracking GPU-visible memory (what we are interested in studying), and wrapping the submission call is essential for timing the memory dump.
With those obstacles cleared, we can finally get to the shader binaries, black boxes in themselves. However, the process from here on out is standard: start with the simplest fragment or compute shader possible, make a small change in the input source code, and compare the output binaries. Iterating on this process is tedious but will quickly reveal key structures, including opcode numbers.
The findings of the process documented in the free software disassembler confirm a number of traits of the GPU:
One, the architecture is scalar. Unlike some GPUs that are scalar for 32-bits but vectorized for 16-bits, the M1’s GPU is scalar at all bit sizes. Yet Metal optimization resources imply 16-bit arithmetic should be significantly faster, in addition to a reduction of register usage leading to higher thread count (occupancy). This suggests the hardware is superscalar, with more 16-bit ALUs than 32-bit ALUs, allowing the part to benefit from low-precision graphics shaders much more than competing chips can, while removing a great deal of complexity from the compiler.
Two, the architecture seems to handle scheduling in hardware, common among desktop GPUs but less so in the embedded space. This again makes the compiler simpler at the expense of more hardware. Instructions seem to have minimal encoding overhead, unlike other architectures which need to pad out instructions with nop’s to accommodate highly constrained instruction sets.
Three, various modifiers are supported. Floating-point ALUs can do clamps (saturate), negates, and absolute value modifiers “for free”, a common shader architecture trait. Further, most (all?) instructions can type-convert between 16-bit and 32-bit “for free” on both the destination and the sources, which allows the compiler to be much more aggressive about using 16-bit operations without risking conversion overheads. On the integer side, various bitwise complements and shifts are allowed on certain instructions for free. None of this is unique to Apple’s design, but it’s worth noting all the same.
Finally, not all ALU instructions have the same timing. Instructions like imad
, used to multiply two integers and add a third, are avoided in favour of repeated iadd
integer addition instructions where possible. This also suggests a superscalar architecture; software-scheduled designs like those I work on for my day job cannot exploit differences in pipeline length, inadvertently slowing down simple instructions to match the speed of complex ones.
From my prior experience working with GPUs, I expect to find some eldritch horror waiting in the instruction set, to balloon compiler complexity. Though the above work currently covers only a small surface area of the instruction set, so far everything seems sound. There are no convoluted optimization tricks, but doing away with the trickery creates a streamlined, efficient design that does one thing and does it well. Maybe Apple’s hardware engineers discovered it’s hard to beat simplicity.
Alas, a shader tool chain isn’t much use without an open source userspace driver. Next up: dissecting the command stream!
Disclaimer: This work is a hobby project conducted based on public information. Opinions expressed may not reflect those of my employer.
![]() |
|
January 05, 2021 | |
![]() |
![]() |
|
January 04, 2021 | |
![]() |
As long-time readers of the blog know, SGC is a safe space where making mistakes is not only accepted, it’s a way of life. So it is once again that I need to amend statements previously made regarding Xorg synchronization after Michel Dänzer, also known for anchoring the award-winning series Why Is My MR Failing CI Today?, pointed out that while I was indeed addressing the correct problem, I was addressing it from the wrong side.
The issue here is that WSI synchronizes with the display server using a file descriptor for the swapchain image that the Vulkan driver manages. But what if the Vulkan driver never configures itself to be used for WSI (genuine or faked) in the first place?
Yes, this indeed appeared to be the true problem. Iago Toral Quiroga added handling for this specific to the V3DV driver back in October, and it’s the same mechanism: setting up a Mesa-internal struct during resource initialization.
So I extended this to the ANV codepath and…
And obviously it didn’t work.
But why was this the case?
A script-based git blame
revealed that ANV has a different handling for implicit sync than other Vulkan drivers. After a well-hidden patch, ANV relies entirely on a struct attached to VkSubmitInfo
which contains the swapchain image’s memory pointer in order to handle implicit sync. Thus by attaching a wsi_memory_signal_submit_info
struct, everything was resolved.
Is it a great fix? No. Does it work? Yes.
If ANV wasn’t configuring itself to handle implicit sync, why was poll() working?
Luck.
Why does RADV work without any of this?
Also probably luck.
![]() |
|
January 01, 2021 | |
![]() |
I've taken the week between Christmas and New Year's off this year. I didn't really have anything serious planned, just taking a break from the usual routine. As often happens, I got sucked into doing a project when I received this simple bug report Debian Bug #974011
I have been researching old terminal and X games recently, and realized
that much of the code from 'xmille' originated from the terminal game
'mille', which is part of bsdgames.
...
[The copyright and license information] has been stripped out of all
code in the xmille distribution. Also, none of the included materials
give credit to the original author, Ken Arnold.
The reason the 'xmille' source is missing copyright and license information from the 'mille' files is that they were copied in before that information was added upstream. Xmille forked from Mille around 1987 or so. I wrote the UI parts for the system I had at the time, which was running X10R4. A very basic port to X11 was done at some point, and that's what Debian has in the archive today.
At some point in the 90s, I ported Xmille to the Athena widget set, including several custom widgets in an Xaw extension library, Xkw. It's a lot better than the version in Debian, including displaying the cards correctly (the Debian version has some pretty bad color issues).
Here's what the current Debian version looks like:
To fix the missing copyright and license information, I imported the mille source code into the "latest" Xaw-based version. The updated mille code had a number of bug fixes and improvements, along with the copyright information.
That should have been sufficient to resolve the issue and I could have constructed a suitable source package from whatever bits were needed and and uploaded that as a replacement 'xmille' package.
However, at some later point, I had actually merged xmille into a larger package, 'kgames', which also included a number of other games, including Reversi, Dominoes, Cribbage and ten Solitaire/Patience variants. (as an aside, those last ten games formed the basis for my Patience Palm Pilot application, which seems to have inspired an Android App of the same name...)
So began my yak shaving holiday.
Ok, so getting this old source code running should be easy, right? It's just a bunch of C code designed in the 80s and 90s to work on VAXen and their kin. How hard could it be?
Everything was a 32-bit computer back then; pointers and ints were both 32 bits, so you could cast them with wild abandon and cause no problems. Today, testing revealed segfaults in some corners of the code.
It's K&R C code. Remember that the first version of ANSI-C didn't come out until 1989, and it was years later that we could reliably expect to find an ANSI compiler with a random Unix box.
It's X11 code. Fortunately (?), X11 hasn't changed since these applications were written, so at least that part still works just fine. Imagine trying to build Windows or Mac OS code from the early 90's on a modern OS...
I decided to dig in and add prototypes everywhere; that found a lot of pointer/int casting issues, as well as several lurking bugs where the code was just plain broken.
After a day or so, I had things building and running and was no longer hitting crashes.
With that done, I decided I could at least upload the working bits to the Debian archive and close the bug reported above. kgames 1.0-2 may eventually get into unstable, presumably once the Debian FTP team realizes just how important fixing this bug is. Or something.
Here's what xmille looks like in this version:
And here's my favorite solitaire variant too:
Yeah, Xaw applications have a rustic appearance which may appeal to some, but for people with higher resolution monitors and “well seasoned” eyesight, squinting at the tiny images and text makes it difficult to enjoy these games today.
How hard could it be to update them to use larger cards and scalable fonts?
I decided to dig in and start hacking the code, starting by adding new widgets to the Xkw library that used cairo for drawing instead of core X calls. Fortunately, the needs of the games were pretty limited, so I only needed to implement a handful of widgets:
KLabel. Shows a text string. It allows the string to be left, center or right justified. And that's about it.
KCommand. A push button, which uses KLabel for the underlying presentation.
KToggle. A push-on/push-off button, which uses KCommand for most of the implementation. Also supports 'radio groups' where pushing one on makes the others in the group turn off.
KMenuButton. A button for bringing up a menu widget; this is some pretty simple behavior built on top of KCommand.
KSimpleMenu, KSmeBSB, KSmeLine. These three create pop-up menus; KSimpleMenu creates a container which can hold any number of KSmeBSB (string) and KSmeLine (separator lines) objects).
KTextLine. A single line text entry widget.
The other Xkw widgets all got their rendering switched to using cairo, plus using double buffering to make updates look better.
Looking on wikimedia, I found a page referencing a large number of playing cards in SVG form That led me to Adrian Kennard's playing card web site that let me customize and download a deck of cards, licensed using the CC0 Public Domain license.
With these cards, I set about rewriting the Xkw playing card widget, stripping out three different versions of bitmap playing cards and replacing them with just these new SVG versions.
Ok, so getting regular playing cards was good, but the original goal was to update Xmille, and that has cards hand drawn by me. I could just use those images, import them into cairo and let it scale them to suit on the screen. I decided to experiment with inkscape's bitmap tracing code to see what it could do with them.
First, I had to get them into a format that inkscape could parse. That turned out to be a bit tricky; the original format is as a set of X bitmap layers; each layer painting a single color. I ended up hacking the Xmille source code to generate the images using X, then fetching them with XGetImage and walking them to construct XPM format files which could then be fed into the portable bitmap tools to create PNG files that inkscape could handle.
The resulting images have a certain charm:
I did replace the text in the images to make it readable, otherwise these are untouched from what inkscape generated.
Remember that all of these are applications built using the venerable X toolkit; there are still some non-antialiased graphics visible as the shaped buttons use the X Shape extension. But, all rendering is now done with cairo, so it's all anti-aliased and all scalable.
Here's what Xmille looks like after the upgrades:
And here's spider:
Once kgames 1.0 reaches Debian unstable, I'll upload these new versions.
![]() |
|
December 30, 2020 | |
![]() |
There’s a number of strange hacks in zink that provide compatibility for some of the layers in mesa. One of these hacks is the NIR pass used for managing non-constant UBO/SSBO array indexing, made necessary because SPIRV operates by directly accessing variables, and so it’s impossible to have a non-constant index because then when generating the SPIRV there’s no way to know which variable is being accessed.
In its current state from zink-wip it looks like this:
static nir_ssa_def *
recursive_generate_bo_ssa_def(nir_builder *b, nir_intrinsic_instr *instr, nir_ssa_def *index, unsigned start, unsigned end)
{
if (start == end - 1) {
/* block index src is 1 for this op */
unsigned block_idx = instr->intrinsic == nir_intrinsic_store_ssbo;
nir_intrinsic_instr *new_instr = nir_intrinsic_instr_create(b->shader, instr->intrinsic);
new_instr->src[block_idx] = nir_src_for_ssa(nir_imm_int(b, start));
for (unsigned i = 0; i < nir_intrinsic_infos[instr->intrinsic].num_srcs; i++) {
if (i != block_idx)
nir_src_copy(&new_instr->src[i], &instr->src[i], &new_instr->instr);
}
if (instr->intrinsic != nir_intrinsic_load_ubo_vec4) {
nir_intrinsic_set_align(new_instr, nir_intrinsic_align_mul(instr), nir_intrinsic_align_offset(instr));
if (instr->intrinsic != nir_intrinsic_load_ssbo)
nir_intrinsic_set_range(new_instr, nir_intrinsic_range(instr));
}
new_instr->num_components = instr->num_components;
if (instr->intrinsic != nir_intrinsic_store_ssbo)
nir_ssa_dest_init(&new_instr->instr, &new_instr->dest,
nir_dest_num_components(instr->dest),
nir_dest_bit_size(instr->dest), NULL);
nir_builder_instr_insert(b, &new_instr->instr);
return &new_instr->dest.ssa;
}
unsigned mid = start + (end - start) / 2;
return nir_build_alu(b, nir_op_bcsel, nir_build_alu(b, nir_op_ilt, index, nir_imm_int(b, mid), NULL, NULL),
recursive_generate_bo_ssa_def(b, instr, index, start, mid),
recursive_generate_bo_ssa_def(b, instr, index, mid, end),
NULL
);
}
static bool
lower_dynamic_bo_access_instr(nir_intrinsic_instr *instr, nir_builder *b)
{
if (instr->intrinsic != nir_intrinsic_load_ubo &&
instr->intrinsic != nir_intrinsic_load_ubo_vec4 &&
instr->intrinsic != nir_intrinsic_get_ssbo_size &&
instr->intrinsic != nir_intrinsic_load_ssbo &&
instr->intrinsic != nir_intrinsic_store_ssbo)
return false;
/* block index src is 1 for this op */
unsigned block_idx = instr->intrinsic == nir_intrinsic_store_ssbo;
if (nir_src_is_const(instr->src[block_idx]))
return false;
b->cursor = nir_after_instr(&instr->instr);
bool ssbo_mode = instr->intrinsic != nir_intrinsic_load_ubo && instr->intrinsic != nir_intrinsic_load_ubo_vec4;
unsigned first_idx = 0, last_idx;
if (ssbo_mode) {
last_idx = first_idx + b->shader->info.num_ssbos;
} else {
/* skip 0 index if uniform_0 is one we created previously */
first_idx = !b->shader->info.first_ubo_is_default_ubo;
last_idx = first_idx + b->shader->info.num_ubos;
}
/* now create the composite dest with a bcsel chain based on the original value */
nir_ssa_def *new_dest = recursive_generate_bo_ssa_def(b, instr,
instr->src[block_idx].ssa,
first_idx, last_idx);
if (instr->intrinsic != nir_intrinsic_store_ssbo)
/* now use the composite dest in all cases where the original dest (from the dynamic index)
* was used and remove the dynamically-indexed load_*bo instruction
*/
nir_ssa_def_rewrite_uses_after(&instr->dest.ssa, nir_src_for_ssa(new_dest), &instr->instr);
nir_instr_remove(&instr->instr);
return true;
}
In brief, lower_dynamic_bo_access_instr()
is used to detect UBO/SSBO instructions with a non-constant index, e.g., array_of_ubos[n]
where n
is a uniform. Following this, recursive_generate_bo_ssa_def()
generates a chain of bcsel
instructions which checks the non-constant array index against constant values and then, upon matching, uses the value loaded from that UBO.
Without going into more depth about the exact mechanics of this pass for the sake of time, I’ll instead provide a better explanation by example. Here’s a stripped down version of one of the simplest piglit shader tests for non-constant uniform indexing (fs-array-nonconst):
[require]
GLSL >= 1.50
GL_ARB_gpu_shader5
[vertex shader passthrough]
[fragment shader]
#version 150
#extension GL_ARB_gpu_shader5: require
uniform block {
vec4 color[2];
} arr[4];
uniform int n;
uniform int m;
out vec4 color;
void main()
{
color = arr[n].color[m];
}
[test]
clear color 0.2 0.2 0.2 0.2
clear
ubo array index 0
uniform vec4 block.color[0] 0.0 1.0 1.0 0.0
uniform vec4 block.color[1] 1.0 0.0 0.0 0.0
uniform int n 0
uniform int m 1
draw rect -1 -1 1 1
relative probe rect rgb (0.0, 0.0, 0.5, 0.5) (1.0, 0.0, 0.0)
Using two uniforms, a color is indexed from a UBO as the FS output.
In the currently shipping version of zink, the final NIR output from ANV of the fragment shader might look something like this:
shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 8
ubos: 5
shared: 0
decl_var shader_out INTERP_MODE_NONE vec4 color (FRAG_RESULT_DATA0.xyzw, 8, 0)
decl_function main (0 params)
impl main {
block block_0:
/* preds: */
vec1 32 ssa_0 = load_const (0x00000002 /* 0.000000 */)
vec1 32 ssa_1 = load_const (0x00000001 /* 0.000000 */)
vec1 32 ssa_2 = load_const (0x00000004 /* 0.000000 */)
vec1 32 ssa_3 = load_const (0x00000003 /* 0.000000 */)
vec1 32 ssa_4 = load_const (0x00000010 /* 0.000000 */)
vec1 32 ssa_5 = intrinsic load_ubo (ssa_1, ssa_4) (0, 1073741824, 16, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=16 */ /* range_base=0 */ /* range=-1 */
vec1 32 ssa_6 = load_const (0x00000000 /* 0.000000 */)
vec1 32 ssa_7 = intrinsic load_ubo (ssa_1, ssa_6) (0, 1073741824, 0, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
vec1 32 ssa_8 = umin ssa_7, ssa_3
vec1 32 ssa_9 = ishl ssa_5, ssa_2
vec1 32 ssa_10 = iadd ssa_8, ssa_1
vec1 32 ssa_11 = load_const (0xfffffffc /* -nan */)
vec1 32 ssa_12 = iand ssa_9, ssa_11
vec1 32 ssa_13 = load_const (0x00000005 /* 0.000000 */)
vec4 32 ssa_14 = intrinsic load_ubo (ssa_13, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
vec4 32 ssa_23 = intrinsic load_ubo (ssa_2, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
vec1 32 ssa_27 = ilt32 ssa_10, ssa_2
vec1 32 ssa_28 = b32csel ssa_27, ssa_23.x, ssa_14.x
vec1 32 ssa_29 = b32csel ssa_27, ssa_23.y, ssa_14.y
vec1 32 ssa_30 = b32csel ssa_27, ssa_23.z, ssa_14.z
vec1 32 ssa_31 = b32csel ssa_27, ssa_23.w, ssa_14.w
vec4 32 ssa_32 = intrinsic load_ubo (ssa_3, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
vec1 32 ssa_36 = ilt32 ssa_10, ssa_3
vec1 32 ssa_37 = b32csel ssa_36, ssa_32.x, ssa_28
vec1 32 ssa_38 = b32csel ssa_36, ssa_32.y, ssa_29
vec1 32 ssa_39 = b32csel ssa_36, ssa_32.z, ssa_30
vec1 32 ssa_40 = b32csel ssa_36, ssa_32.w, ssa_31
vec4 32 ssa_41 = intrinsic load_ubo (ssa_0, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
vec4 32 ssa_45 = intrinsic load_ubo (ssa_1, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
vec1 32 ssa_49 = ilt32 ssa_10, ssa_1
vec1 32 ssa_50 = b32csel ssa_49, ssa_45.x, ssa_41.x
vec1 32 ssa_51 = b32csel ssa_49, ssa_45.y, ssa_41.y
vec1 32 ssa_52 = b32csel ssa_49, ssa_45.z, ssa_41.z
vec1 32 ssa_53 = b32csel ssa_49, ssa_45.w, ssa_41.w
vec1 32 ssa_54 = ilt32 ssa_10, ssa_0
vec1 32 ssa_55 = b32csel ssa_54, ssa_50, ssa_37
vec1 32 ssa_56 = b32csel ssa_54, ssa_51, ssa_38
vec1 32 ssa_57 = b32csel ssa_54, ssa_52, ssa_39
vec1 32 ssa_58 = b32csel ssa_54, ssa_53, ssa_40
vec4 32 ssa_59 = vec4 ssa_55, ssa_56, ssa_57, ssa_58
intrinsic store_output (ssa_59, ssa_6) (8, 15, 0, 160, 132) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */ /* color */
/* succs: block_1 */
block block_1:
}
All the b32csel
ops are generated by the above NIR pass, with each one “checking” a non-constant index against a constant value. At the end of the shader, the store_output
uses the correct values, but this is pretty gross.
Some time ago, noted Gallium professor Marek Olšák authored a series which provided a codepath for inlining uniform data directly into shaders. The process for this is two steps:
The purpose of this is specifically to eliminate complex conditionals resulting from uniform data, so the detection NIR pass specifically looks for conditionals which use only constants and uniform data as the sources. Something like if (uniform_variable_expression)
then becomes if (constant_value_expression)
which can then be optimized out, greatly simplifying the eventual shader instructions.
Looking at the above NIR, this seems like a good target for inlining as well, so I took my hatchet to the detection pass and added in support for the bcsel
and fcsel
ALU ops when their result sources were the results of intrinsics, e.g., loads. The results are good to say the least:
shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 8
ubos: 5
shared: 0
decl_var shader_out INTERP_MODE_NONE vec4 color (FRAG_RESULT_DATA0.xyzw, 8, 0)
decl_function main (0 params)
impl main {
block block_0:
/* preds: */
vec1 32 ssa_0 = load_const (0x00000001 /* 0.000000 */)
vec1 32 ssa_1 = load_const (0x00000004 /* 0.000000 */)
vec1 32 ssa_2 = load_const (0x00000010 /* 0.000000 */)
vec1 32 ssa_3 = intrinsic load_ubo (ssa_0, ssa_2) (0, 1073741824, 16, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=16 */ /* range_base=0 */ /* range=-1 */
vec1 32 ssa_4 = ishl ssa_3, ssa_1
vec1 32 ssa_5 = load_const (0x00000002 /* 0.000000 */)
vec1 32 ssa_6 = load_const (0xfffffffc /* -nan */)
vec1 32 ssa_7 = iand ssa_4, ssa_6
vec1 32 ssa_8 = load_const (0x00000000 /* 0.000000 */)
vec4 32 ssa_9 = intrinsic load_ubo (ssa_5, ssa_7) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
intrinsic store_output (ssa_9, ssa_8) (8, 15, 0, 160, 132) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */ /* color */
/* succs: block_1 */
block block_1:
}
The second load_ubo
here is using the inlined uniform data to determine that it needs to load the 0
index, greatly reducing the shader’s complexity.
This still needs a bit of tuning, but I’m hoping to get it finalized soonish.
![]() |
|
December 29, 2020 | |
![]() |
For some time now I’ve been talking about zink’s lack of WSI and the forthcoming, near-messianic work by FPS sherpa Adam Jackson to implement it.
This is an extremely challenging project, however, and some work needs to be done in the meanwhile to ensure that zink doesn’t drive off a cliff.
Any swapchain master is already well acquainted with the mechanism by which images are displayed on the screen, but the gist of it for anyone unfamiliar is that there’s N image resources that are swapped back and forth (2 for double-buffered, 3 for triple-buffered, …). An image being rendered to is a backbuffer, and an image being displayed is a frontbuffer.
Ideally, a frontbuffer shouldn’t be drawn to while it’s in the process of being presented since such an action obliterates the app’s usefulness. The knowledge of exactly when a resource is done presenting is gained through WSI. On Xorg, however, it’s a bit tricky, to say the least. DRI3 is intended to address the underlying problems there with the XPresent extension, and the Mesa DRI frontend utilizes this to determine when an image is safe to use.
All this is great, and I’m sure it works terrifically in other cases, but zink is not like other cases. Zink lacks direct WSI integration. Under Xorg, this means it relies entirely on the DRI frontend to determine when it’s safe to start rendering onto an image resource.
But what if the DRI frontend gets it wrong?
Indeed, due to quirks in the protocol/xserver, XPresent idle events can be received for a “presented” image immediately, even if it’s still in use and has not finished presenting.
In apps like SuperTuxKart, this results in insane flickering due to always rendering over the current frame before it’s finished being presented.
To solve this problem, a wise, reclusive ghostwriter took time off from being at his local pub to offer me a suggestion:
Why not just rip the implicit fence of the DMAbuf out of the image object?
It was a great idea. But what did this pub enthusiast mean?
In short, WSI handles this problem by internally poll()
ing on the image resource’s underlying file descriptor. When there’s no more events to poll()
for, the image is safe to write on.
So now it’s back to the (semi) basics of programming. First, get the file descriptor of the image using normal Vulkan function calls:
static int
get_resource_fd(struct zink_screen *screen, struct zink_resource *res)
{
VkMemoryGetFdInfoKHR fd_info = {};
int fd;
fd_info.sType = VK_STRUCTURE_TYPE_MEMORY_GET_FD_INFO_KHR;
fd_info.memory = res->obj->mem;
fd_info.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT;
VkResult result = (*screen->vk_GetMemoryFdKHR)(screen->dev, &fd_info, &fd);
return result == VK_SUCCESS ? fd : -1;
}
This provides a file descriptor that can be used for more nefarious purposes. Any time the gallium pipe_context::flush
hook is called, the flushed resource (swapchain image) must be synchronized by poll()
ing as in this snippet:
static void
zink_flush(struct pipe_context *pctx,
struct pipe_fence_handle **pfence,
enum pipe_flush_flags flags)
{
struct zink_context *ctx = zink_context(pctx);
if (flags & PIPE_FLUSH_END_OF_FRAME && ctx->flush_res) {
if (ctx->flush_res->obj->fd != -1) {
/* FIXME: remove this garbage once we get wsi */
struct pollfd p = {};
p.fd = ctx->flush_res->obj->fd;
p.events = POLLOUT;
assert(poll(&p, 1, -1) == 1);
assert(p.revents & POLLOUT);
}
ctx->flush_res = NULL;
}
The POLLOUT
event flag is used to determine when it’s safe to write. If there’s no pending usage during present then this will return immediately, otherwise it will wait until the image is safe to use.
hacks++.
![]() |
|
December 24, 2020 | |
![]() |
For real though, I’ve spent literal hours over the past week just rebasing stuff and managing conflicts. And then rebasing again after diffing against a reference commit when I inevitably discover that I fucked up the merge somehow.
But now the rebasing is done for a few minutes while I run more unit tests, so it’s finally time to blog.
It’s been a busy week. Nothing I’ve done has been very interesting. Lots of stabilizing and refactoring.
The blogging must continue, however, so here goes.
Many months ago I blogged about QBOs.
Maybe.
Maybe I didn’t.
QBOs are are Query Buffer Objects, where the result of a given query is stored into a buffer. This is great for performance since it doesn’t require any stalling while the query result is directly read.
Conceptually, anyway.
At present, zink has problems making this efficient for many types of queries due to the mismatch between GL query data and Vulkan query data, and there’s a need to manually read it back and parse it with the CPU.
This is consistent with how zink manages non-QBO queries:
As I’ve said many times along the way, the goal for zink has been to get the features in place and working first and then optimize later.
It’s now later, and query bottlenecking is actually hurting performance in some apps (e.g., RPCS3).
Some profiling was done recently by bleeding edge tester Witold Baryluk, and it turns out that zink is using slightly less GPU memory than some native drivers, though it’s also using slightly more than some other native drivers:
Looking at the right side of the graph, it’s obvious that there’s still some VRAM available to be used, which means there’s some VRAM available to use in optimizations.
As such, I decided to rewrite the query internals to have every query be a QBO internally, consistent with the RadeonSI model. While it does use a tiny bit more VRAM due to needing to allocate the backing buffers, the benefit of this is that now all query result data is copied to a buffer as soon as the query stops, so from an API perspective, this means that the result becomes available as soon as the backing buffer becomes idle.
It also means that any time actual QBOs are used (which is all the time for competent apps), I’ll eventually have the ability to asynchronously post the result data from a query onto a user buffer without triggering a stall.
Functionally, this isn’t a super complex maneuver: I’ve already got a utility function that performs a vkCmdCopyQueryPoolResults for regular QBO handling, so repurposing this to be called any time a query was ended, combined with modifying the parsing function to first map the internal buffer, was sufficient.
In the end, the query code is now a bit more uniform, and in the future I can use a compute shader to keep everything on GPU without needing to do any manual readback.
![]() |
|
December 18, 2020 | |
![]() |
Amidst the flurry of patches being smashed into the repo today I thought I’d talk about about memcpy
. Yes, I’m referring to the same function everyone knows and loves.
The final frontier of RPCS3 performance was memcpy
. With ARGB emulation in place, my perf results were looking like this on RADV:
Pushing a hard 12fps, this was a big improvement from before, but it seemed a bit low. Not having any prior experience in PS3 emulation, I started wondering whether this was the limit of things in the current year.
Meanwhile, RadeonSI was getting significantly higher performance with a graph like this:
Clearly performance is capable of being much higher, so why is zink so much worse at memcpy
?
Along with the wisdom of the great sage Dave Airlie led me to checking out the resource creation code in zink, specifically the types of memory that are used for allocation. Vulkan supports a range of memory types for the discerning allocations connoisseur, but the driver was jamming everything into VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
. This is great for ease of use, as the contents of buffers are always synchronized between CPU and GPU with no additional work needed, but it ends up being massively slower for any kind of direct copy operations of the backing memory, e.g., anything PBO-related.
What should really be used is VK_MEMORY_PROPERTY_HOST_CACHED_BIT
whenever possible. This requires some additional legwork to properly invalidate/flush the memory used by any vkMapMemory
calls, but the results were well worth the effort:
And performance was a buttery smooth 30fps (the cap) as well:
Mission accomplished.
![]() |
|
December 17, 2020 | |
![]() |
…is emulation.
With blit-based transfers working, I checked my RPCS3 flamegraph again to see the massive performance improvements that I’d no doubt be seeing:
Except there were none.
Closer examination revealed that this was due to the app using ARGB formats for its PBOs. Referencing that against VkFormat led to a further problem: ARGB and ABGR are not explicitly supported by Vulkan.
This wasn’t exactly going to be an easy fix, but it wouldn’t prove too challenging either.
Yes, swizzles. In layman’s terms, a swizzle is mapping a given component to another component, for example in an RGBA-ordered format, using a WXYZ swizzle would result in a reordering of the 0-indexed components to 3012, or ARGB.
Gallium, when using blit-based transfers, provides a lot of opportunities to use swizzles, specifically by having a lot of blits go through a u_blitter
codepath that translates blits into quad draws with a sampled image.
Thus, by applying an ARGB/ABGR emulation swizzle to each of these codepaths, I can drop in native-ish support under the hood of the driver by internally reporting ARGB as RGBA and ABGR as BGRA.
In pseudocode, the ARGB path looks something like this:
unsigned char dst_swiz[4];
if (src_is_argb) {
unsigned char reverse_alpha[] = {
PIPE_SWIZZLE_Y,
PIPE_SWIZZLE_Z,
PIPE_SWIZZLE_W,
PIPE_SWIZZLE_X,
};
/* compose swizzle with alpha at the end */
util_format_compose_swizzles(original_swizzle, reverse_alpha, dst_swiz);
} else if (dst_is_argb) {
unsigned char reverse_alpha[] = {
PIPE_SWIZZLE_W,
PIPE_SWIZZLE_X,
PIPE_SWIZZLE_Y,
PIPE_SWIZZLE_Z,
};
/* compose swizzle with alpha at the start */
util_format_compose_swizzles(original_swizzle, reverse_alpha, dst_swiz);
}
The original swizzle is composed with the alpha-reversing swizzle to generate a swizzle that translates the resource’s internal ARGB data into RGBA data (or vice versa) like the Vulkan driver is expecting it to be.
From there, the only restriction is that this emulation is prohibited in texel buffers due to there not being a direct method of applying a swizzle to that codepath. Sure, I could do the swizzle in the shader as a variant, but then this leads to more shader variants and more pipeline objects, so it’s simpler to just claim no support here and let gallium figure things out using other codepaths.
Would this be enough to finally get some frames moving?
Find out tomorrow in the conclusion to this SGC miniseries.
![]() |
|
December 16, 2020 | |
![]() |
This is the journey of how zink-wip went from 0 fps in RPCS3 to a bit more than that. Quite a bit more, in fact, if you’re using RADV.
As all new app tests begin, this one started with firing up the app. Since there’s no homebrew games available (that I could find), I decided to pick something that I owned and was familiar with. Namely a demo of Bioshock.
It started up nicely enough:
But then I started a game and things got rough:
Yikes.
One of the fundamentals of a graphics driver is that the GPU should be handling as much work as possible. This means that, for example, any time an application is using a Pixel Buffer Object (PBO), the GPU should be used for uploading and downloading the pixel buffer.
Why are you suddenly mentioning PBOs, you might be asking.
Well, let’s check out what’s going on using a perf flamegraph:
The driver in this case is hitting a software path for copying pixels to and from a PBO, effectively doing full-frame memcpy operations multiple times each frame. This is on the CPU, which is obviously not great for performance. As above, ideally this should be moved to the GPU.
Gallium provides a pipe cap for this: PIPE_CAP_PREFER_BLIT_BASED_TEXTURE_TRANSFER
Zink doesn’t use this in master right now, which naturally led me down the path of enabling it.
There were problems.
Lots of problems.
The first problem was that suddenly I had an infinite number of failing unit tests. Confusing for sure. Some intensive debugging led me to this block of code in zink which is used for directly mapping a rectangular region of image resource memory:
VkResult result = vkMapMemory(screen->dev, res->mem, res->offset, res->size, 0, &ptr);
if (result != VK_SUCCESS)
return NULL;
VkImageSubresource isr = {
res->aspect,
level,
0
};
VkSubresourceLayout srl;
vkGetImageSubresourceLayout(screen->dev, res->image, &isr, &srl);
ptr = ((uint8_t *)ptr) + box->z * srl.depthPitch +
box->y * srl.rowPitch +
box->x;
Suspicious. box
in this case represents the region to be mapped, yet members of VkSubresourceLayout like offset
aren’t being applied to handle the level that’s intended to be loaded, nor is this taking into account the bits-per-pixel of the image. In fact, this is always assuming that each x
coordinate unit equals a single byte.
The fully corrected version is more like this:
VkResult result = vkMapMemory(screen->dev, res->mem, res->offset, res->size, 0, &ptr);
if (result != VK_SUCCESS)
return NULL;
VkImageSubresource isr = {
res->aspect,
level,
0
};
VkSubresourceLayout srl;
vkGetImageSubresourceLayout(screen->dev, res->image, &isr, &srl);
const struct util_format_description *desc = util_format_description(res->base.format);
unsigned offset = srl.offset +
box->z * srl.depthPitch +
(box->y / desc->block.height) * srl.rowPitch +
(box->x / desc->block.width) * (desc->block.bits / 8);
ptr = ((uint8_t *)ptr) + offset;
It turns out that no unit test had previously passed a nonzero x
coordinate for a mapping region or tried to map a nonzero level
, so this was never exposed as being broken.
Imagine that.
![]() |
|
December 15, 2020 | |
![]() |
One of the people in the #zink IRC channel has been posing an interesting challenge for me in the form of trying to run every possible console emulator on my zink-wip branch.
This has raised a number of issues with various parts of the driver, so expect a number of posts on the topic.
First up was the citra emulator for the 3DS. This is an interesting app for a number of reasons, the least of which is because it uses a ton of threads, including a separate one for GL, which put my own work to the test.
Suffice to say that my initial implementation of u_threaded_context
needed some work.
One of the main premises of the threaded context is this idea of an asynchronous fence object. The threaded context will create these in a thread and provide them to the driver in the pipe_context::flush
hook, but only in some cases; at other times, the fence object provided will just be a “regular” synchronous one.
The trick here is that the driver itself has a fence for managing synchronization, and the threaded context can create N number of its own fences to manage the driver’s fence, all of which must potentially work when called in a random order and from either the “main” thread or the driver-specific thread.
There’s too much code involved here to be providing any samples here, but I’ll go over the basics of it just for posterity. Initially, I had implemented this entirely on the zink side such that each zink fence had references to all the tc fences in a chain, and fence-related resources were generally managed on the last fence in the chain. I had two separate object types for this: one for zink fences and one for tc fences. The former contained all the required vulkan-specific objects while the latter contained just enough info to work with tc.
This was sort of fine, and it worked for many things, the least of which was all my benchmarking.
The problem was that a desync could occur if one of the tc fences was destroyed sufficiently later than its zink fence, leading to an eventual crash. This was never triggered by unit tests nor basic app usage, but something like citra with its many threads managed to hit it consistently and quickly.
Thus began the day-long process of rewriting the tc implementation to a much-improved 2.0 version. The primary difference in this design model is that I worked a bit closer to the original RadeonSI implementation, having only a single externally-used fence object type for both gallium as well as tc and creating them for the zink fence object without any sort of cross-referencing. This meant that rather than having 1 zink fence with references to N tc fences, I now had N tc fences each with a reference to 1 zink fence.
This simplified the code a bit in other ways after the rewrite, as the gallium/tc fence objects were now entirely independent. The one small catch was that zink fences get recycled, meaning that in theory a gallium/tc fence could have a reference to a zink fence that it no longer was managing, but this was simple enough to avoid by gating all tc fence functionality on a comparison between its stored fence id and the id of the fence that it had a reference to. If they failed to match, the gallium/tc fence had already completed.
Things seem like they’re in better shape now with regards to stability. It’s become more challenging than ever to debug the driver with threading enabled, but that’s just one of the benefits that threads provide.
Next time I’ll begin a series on how to get a mesa driver from less than 1fps to native performance in RPCS3.
![]() |
|
December 14, 2020 | |
![]() |
I have to change things up.
Historically I’ve spent a day working on zink and then written up a post at the end. The problem with this approach is obvious: a lot of times when I get to the end of the day I’m just too mentally drained to think about anything else and want to just pass out on my couch.
So now I’m going to try inverting my schedule: as soon as I get up, it’s now going to be blog time.
I’m not even fully awake right now, so this is definitely going to be interesting.
Today’s exploratory morning post is about sampling from stencil buffers.
What is sampling from stencil buffers, some of you might be asking.
Sampling in general is the reading of data from a resource. It’s most commonly used as an alternative to using a Copy command for transferring some amount of data from one resource to another in a specified way.
For example, extracting only stencil data from a resource which combines both depth and stencil data. In zink, this is an important operation because none of the Copy commands support multisampled resources containing both depth and stencil data, an OpenGL feature that the unit tests most certainly cover.
As with all things, zink has a tough time with this.
For the purpose of this post, I’m only going to be talking about sampling from image resources. Sampling from buffer resources is certainly possible and useful, however, but there’s just less that can go wrong for that case.
The general process of a sampling-based copy operation in Gallium-based drivers is as follows:
gl_FragColor
output or imageStore
)In the case of stencil sampling, zink has issues with the third step here.
Here’s what we’ve currently got shipping in the driver for the relevant part of creating image sampler views:
VkImageViewCreateInfo ivci = {};
ivci.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
ivci.image = res->image;
ivci.viewType = image_view_type(state->target);
ivci.format = zink_get_format(screen, state->format);
assert(ivci.format);
ivci.components.r = component_mapping(state->swizzle_r);
ivci.components.g = component_mapping(state->swizzle_g);
ivci.components.b = component_mapping(state->swizzle_b);
ivci.components.a = component_mapping(state->swizzle_a);
ivci.subresourceRange.aspectMask = sampler_aspect_from_format(state->format);
ivci.subresourceRange.baseMipLevel = state->u.tex.first_level;
ivci.subresourceRange.baseArrayLayer = state->u.tex.first_layer;
ivci.subresourceRange.levelCount = state->u.tex.last_level - state->u.tex.first_level + 1;
ivci.subresourceRange.layerCount = state->u.tex.last_layer - state->u.tex.first_layer + 1;
err = vkCreateImageView(screen->dev, &ivci, NULL, &sampler_view->image_view);
Filling in some gaps:
res
is the image being sampled fromimage_view_type()
converts Gallium texture types (e.g., 1D, 2D, 2D_ARRAY, …) to the corresponding Vulkan typezink_get_format()
converts a Gallium image format to a usage Vulkan onecomponent_mapping()
converts a Gallium swizzle to a Vulkan one (swizzles determine which channel in the sample operation are mapped from the source to the destination)sampler_aspect_from_format()
infers VkImageAspectFlags
from a Gallium formatRegarding sampler descriptors, Vulkan spec states If imageView is created from a depth/stencil image, the aspectMask used to create the imageView must include either VK_IMAGE_ASPECT_DEPTH_BIT or VK_IMAGE_ASPECT_STENCIL_BIT but not both.
This means that for combined depth+stencil resources, only the depth or stencil aspect can be specified but not both. As Gallium presents drivers with a format and swizzle based on the data being sampled from the image’s data, this poses a problem since 1) the format provided will usually map to something like VK_FORMAT_D32_SFLOAT_S8_UINT
and 2) the swizzle provided will be based on this format.
But if zink can only specify one of the aspects, this poses a problem.
The format being sampled must also match the aspect type, and VK_FORMAT_D32_SFLOAT_S8_UINT
is obviously not a pure stencil format. This means that any time zink infers a stencil-only aspect image format like PIPE_FORMAT_X32_S8X24_UINT
, which is a two channel format where the depth channel is ignored, the format passed in VkImageViewCreateInfo
has to just be the stencil format being sampled. Helpfully, this will always be VK_FORMAT_S8_UINT
.
So now the code would look like this:
VkImageViewCreateInfo ivci = {};
ivci.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
ivci.image = res->obj->image;
ivci.viewType = image_view_type(state->target);
ivci.components.r = component_mapping(state->swizzle_r);
ivci.components.g = component_mapping(state->swizzle_g);
ivci.components.b = component_mapping(state->swizzle_b);
ivci.components.a = component_mapping(state->swizzle_a);
ivci.subresourceRange.aspectMask = sampler_aspect_from_format(state->format);
/* samplers for stencil aspects of packed formats need to always use stencil type */
if (ivci.subresourceRange.aspectMask == VK_IMAGE_ASPECT_STENCIL_BIT)
ivci.format = VK_FORMAT_S8_UINT;
else
ivci.format = zink_get_format(screen, state->format);
The above code was working great for months in zink-wip.
Then “bugs” were fixed in master.
The new problem came from a merge request claiming to “fix depth/stencil blit shaders”. The short of this is that previously, the shaders generated by mesa for the purpose of doing depth+stencil sampling were always reading from the first channel of the image, which was exactly what zink was intending given that for that case the underlying Vulkan driver would only be reading one component anyway. After this change, however, samplers are now reading from the second channel of the image.
Given that a Vulkan stencil format has no second channel, this poses a problem.
Luckily, the magic of swizzles can solve this. By mapping the second channel of the sampler to the first channel of the image data, the sampler will read the stencil data again.
The fully fixed code now looks like this:
VkImageViewCreateInfo ivci = {};
ivci.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
ivci.image = res->obj->image;
ivci.viewType = image_view_type(state->target);
ivci.components.r = component_mapping(state->swizzle_r);
ivci.components.g = component_mapping(state->swizzle_g);
ivci.components.b = component_mapping(state->swizzle_b);
ivci.components.a = component_mapping(state->swizzle_a);
ivci.subresourceRange.aspectMask = sampler_aspect_from_format(state->format);
/* samplers for stencil aspects of packed formats need to always use stencil type */
if (ivci.subresourceRange.aspectMask == VK_IMAGE_ASPECT_STENCIL_BIT) {
ivci.format = VK_FORMAT_S8_UINT;
ivci.components.g = VK_COMPONENT_SWIZZLE_R;
} else
ivci.format = zink_get_format(screen, state->format);
And now everything works.
For now.
![]() |
|
December 08, 2020 | |
![]() |
I keep saying this, but I feel like I’m progressively getting further away from the original goal of this blog, which was to talk about actual code that I’m writing and not just create great graphics memes. So today it’s once again a return to the roots and the code that I had intended to talk about yesterday.
Gallium is a state tracker, and as part of this, it provides various features to make writing drivers easier. One of these features is that it rolls atomic counters into SSBOs, both in terms of the actual buffer resource and the changing of shader instructions to access atomic counters as though they’re uint32_t values at an offset in a buffer. On the surface, and for most drivers, this is great: the driver just has to implement handling for SSBOs, and then they get counters as a bonus.
As always, however, for zink this is A Very Bad Thing.
One of the challenges in zink is the ntv backend which translates the OpenGL shader into a Vulkan shader. Typically this means GLSL -> SPIRV, but there’s also the ARB_gl_spirv extension which allows GL to have SPIRV shaders as well, meaning that zink has to do SPIRV -> SPIRV. GLSL has a certain way of working that NIR handles, but SPIRV is very different, and so the information that is provided by NIR for GLSL is very different than what’s available for SPIRV.
In particular, SPIRV shaders have valid binding
values for shader buffers. GLSL shaders do not. This becomes a problem when trying to determine, as zink must, exactly which descriptors are on a given resource and which descriptors need to have their bindings used vs which can be ignored. Since there’s no way to differentiate a GLSL shader from a SPIRV shader, this is a challenge. It’s further a challenge given that one of the NIR passes that changes shader instructions over from variable pointer derefs to explicit block_id
and offset
values happens to break bindings in such a way that it becomes impossible to accurately tell which SSBO variables are counters and which are actual SSBOs.
if (!strcmp(glsl_get_type_name(var->interface_type), "counters"))
Yup. The original SSBO/counter implementation has to use strcmp
to check the name of the variable’s interface in order to accurately determine whether it’s a counter-turned-SSBO.
There’s also some extremely gross code in ntv for trying to match up the SSBO to its expected block_id
based on this, but SGC is a SFW blog, so I’m going to refrain from posting it.
As always, there’s ways to improve my code. This way came some time after I’d written SSBO support in the form of a new pipe cap, PIPE_CAP_NIR_ATOMICS_AS_DEREF
. What this does is allow a driver to skip the Gallium code that transforms counters into SSBOs, making them very easy to detect.
With this in my pocket, I was already 5% of the way to better atomic counter handling.
The next step was to unbreak counter location
values. The location
is, in ntv language, the offset of a variable inside a given buffer block using a type-based unit, e.g., location=1
would mean an offset of 4 bytes for an int type block. Here’s a NIR pass I wrote to tackle the problem:
static bool
fixup_counter_locations(nir_shader *shader)
{
unsigned last_binding = 0;
unsigned last_location = 0;
if (!shader->info.num_abos)
return false;
nir_foreach_variable_with_modes(var, shader, nir_var_uniform) {
if (!type_is_counter(var->type))
continue;
var->data.binding += shader->info.num_ssbos;
if (var->data.binding != last_binding) {
last_binding = var->data.binding;
last_location = 0;
}
var->data.location = last_location++;
}
return true;
}
The premise here is that counters get merged into buffers based on their binding value, and any number of counters can exist for a given binding. Since Gallium always puts counter buffers after SSBOs, the binding used needs to be incremented by the number of real SSBOs present. With this done, all counters with matching bindings can be assumed to exist sequentially on the same buffer.
Next comes the actual SPIRV variable construction. With the knowledge that zink will be receiving some sort of NIR shader instruction like vec1 32 ssa_0 = deref_var &some_counter
, where some_counter
is actually a value at an offset inside a buffer, it’s important to be considering how to conveniently handle the offset. I ended up with something like this:
if (type_is_counter(var->type)) {
SpvId padding = var->data.offset ? get_sized_uint_array_type(ctx, var->data.offset / 4) : 0;
SpvId types[2];
if (padding)
types[0] = padding, types[1] = array_type;
else
types[0] = array_type;
struct_type = spirv_builder_type_struct(&ctx->builder, types, 1 + !!padding);
if (padding)
spirv_builder_emit_member_offset(&ctx->builder, struct_type, 1, var->data.offset);
}
This creates a struct containing 1-2 members:
Converting the deref_var
instruction can then be simplified into a consistent and easy to generate OpAccessChain
:
if (type_is_counter(deref->var->type)) {
SpvId dest_type = glsl_type_is_array(deref->var->type) ?
get_glsl_type(ctx, deref->var->type) :
get_dest_uvec_type(ctx, &deref->dest);
SpvId ptr_type = spirv_builder_type_pointer(&ctx->builder,
SpvStorageClassStorageBuffer,
dest_type);
SpvId idx[] = {
emit_uint_const(ctx, 32, !!deref->var->data.offset),
};
result = spirv_builder_emit_access_chain(&ctx->builder, ptr_type, result, idx, ARRAY_SIZE(idx));
}
After setting up the destination type for the deref, the OpAccessChain
is generated using a single index: for cases where the variable lies at a nonzero offset it selects the second member after the padding array, otherwise it selects the first member, which is the intended counter variable.
The rest of the atomic counter conversion was just a matter of supporting the specific counter-related instructions that would otherwise have been converted to regular atomic instructions.
As a result of these changes, zink has gone from a 75% pass rate in ARB_gl_spirv
piglit tests all the way up to around 90%.
![]() |
|
December 07, 2020 | |
![]() |
Blogging is hard. Also, getting back on track during a major holiday week is hard.
But now things are settling, and it’s time to get down to brass tacks.
And code. Brass code.
Maybe just code.
But first, some updates.
Historically when I’ve missed my blogging window for an extended period, it’s because I’m busy. This has been the case for the past week, but I don’t have much to show for it in terms of zink enhancements. There’s some work ongoing on various MRs, but probably this is a good time to revise the bold statement I’d previously made: there’s now roughly two weeks (9 workdays) remaining in which it’s feasible to land zink patches before the end of the year, and probably hitting GL 4.6 in mainline mesa is unrealistic. I’d be pleasantly surprised if we hit 4.0 given that we’d need to be landing a minimum of 1 new MR each day.
But there are some cool things on the horizon for zink nonetheless:
Looking at the second item in the above list, there’s a vague sense of discomfort that anyone working deeply with shader images will recognize.
Yes, I’m talking about the COHERENT
qualifier.
For anyone interested in a deeper reading into this GLSL debacle, check out this stackoverflow thread.
TL;DR, COHERENT
is supposed to ensure that buffer/image access across shader stages is synchronized, also known as coherency in this context. But then also due to GL spec wording it can simultaneously mean absolutely nothing, sort of like compiler warnings, so this is an area that generally one should avoid thinking about or delving into.
Naturally, zink is delving deep into this. And of course, Vulkan makes everything better, so this issue is 100% not an issue anymore, and everything is great.
Just kidding.
Vulkan has the exact same language in the parts of the spec referencing this behavior:
While GLSL (and legacy SPIR-V) applies the “coherent” decoration to variables (for historical reasons), this model treats each memory access instruction as having optional implicit availability/visibility operations.
optional implicit
Aren’t explicit specifications like Vulkan great?
What happens here is that the spec has no requirement that either the application or driver actually enforces coherency across resources, meaning that if an application optionally decides not to bother, then it’s up to the driver whether to optionally bother guaranteeing coherent access. If neither the application nor driver take any action to guarantee this behavior, the application won’t work as expected.
To fix this on the application (zink) side, image writes in shaders need to specify MakeTexelAvailable|NonPrivateTexel
operands, and the reads need MakeTexelVisible|NonPrivateTexel
.
![]() |
|
December 04, 2020 | |
![]() |
In Part 1 I've shown you how to create your own distribution image using the freedesktop.org CI templates. In Part 2, I've shown you how to truly build nested images. In this part, I'll talk about the ci-fairy tool that is part of the same repository of ci-templates.
When you're building a CI pipeline, there are some tasks that most projects need in some way or another. The ci-fairy tool is a grab-bag of solutions for these. Some of those solutions are for a pipeline itself, others are for running locally. So let's go through the various commands available.
It's as simple as including the template in your .gitlab-ci.yml file.
Of course, if you want to track a specific sha instead of following master, just sub that sha there. freedesktop.org projects can include ci-fairy like this:
include:
- 'https://gitlab.freedesktop.org/freedesktop/ci-templates/-/raw/master/templates/ci-fairy.yml'
Once that's done, you have access to a .fdo.ci-fairy job template that you can extends from. This will download an image from quay.io that is capable of git, python, bash and obviously ci-fairy. This image is a fixed one and referenced by a unique sha so even if where we keep working on ci-fairy upstream you should never see regression, updating requires you to explicitly update the sha of the included ci-fairy template. Obviously, if you're using master like above you'll always get the latest.
include:
- project: 'freedesktop/ci-templates'
ref: master
file: '/templates/ci-fairy.yml'
Due to how the ci-templates work, it's good to set the FDO_UPSTREAM_REPO variable with the upstream project name. This means ci-fairy will be able to find the equivalent origin/master branch, where that's not available in the merge request. Note, this is not your personal fork but the upstream one, e.g. "freedesktop/ci-templates" if you are working on the ci-templates itself.
ci-fairy has a command to check commits for a few basic expectations in commit messages. This currently includes things like enforcing a 80 char subject line length, that there is an empty line after the subject line, that no fixup or squash commits are in the history, etc. If you have complex requirements you need to write your own but for most projects this job ensures that there are no obvious errors in the git commit log:
Since you don't ever want this to fail on an already merged commit, exclude this job the master branch of the upstream project - the MRs should've caught this already anyway.
check-commit:
extends:
- .fdo.ci-fairy
script:
- ci-fairy check-commits --signed-off-by
except:
- master@upstream/project
To rebase a contributors merge request, the contributor must tick the checkbox to Allow commits from members who can merge to the target branch. The default value is off which is frustrating (gitlab is working on it though) and causes unnecessary delays in processing merge requests. ci-fairy has command to check for this value on an MR and fail - contributors ideally pay attention to the pipeline and fix this accordingly.
As a tip: run this job towards the end of the pipeline to give collaborators a chance to file an MR before this job fails.
check-merge-request:
extends:
- .fdo.ci-fairy
script:
- ci-fairy check-merge-request --require-allow-collaboration
allow_failure: true
The two examples above are the most useful ones for CI pipelines, but ci-fairy also has some useful local commands. For that you'll have to install it, but that's as simple as
A big focus on ci-fairy for local commands is that it should, usually, be able to work without any specific configuration if you run it in the repository itself.
$ pip3 install git+http://gitlab.freedesktop.org/freedesktop/ci-templates
Just hacked on the CI config?
and done, you get the same error back that the online linter for your project would return.
$ ci-fairy lint
Just pushed to the repo?
The command is self-explanatory, I think.
$ ci-fairy wait-for-pipeline
Pipeline https://gitlab.freedesktop.org/username/project/-/pipelines/238586
status: success | 7/7 | created: 0 | pending: 0 | running: 0 | failed: 0 | success: 7 ....
There are a few other parts to ci-fairy including templating and even minio handling. I recommend looking at e.g. the libinput CI pipeline which uses much of ci-fairy's functionality. And the online documentation for ci-fairy, who knows, there may be something useful in there for you.
The useful contribution of ci-fairy is primarily that it tries to detect the settings for each project automatically, regardless of whether it's run inside a MR pipeline or just as part of a normal pipeline. So the same commands will work without custom configuration on a per-project basis. And for many things it works without API tokens, so the setup costs are just the pip install.
If you have recurring jobs, let us know, we're always looking to add more useful functionality to this little tool.
![]() |
|
November 27, 2020 | |
![]() |
Up until now, I’ve been relying solely on my Intel laptop’s onboard GPU for testing, and that’s been great; Intel’s drivers are robust as hell and have very few issues. On top of that, the rare occasions when I’ve found issues have led to a swift resolution.
Certainly I can’t complain at all about my experience with Intel’s hardware or software.
But now things are different and strange because I received in the mail a couple weeks ago a shiny AMD Radeon RX 5700XT.
Mostly in that it’s a new codebase with new debugging tools and such.
Unlike when I started my zink journey earlier this year, however, I’m much better equipped to dive in and Get Things Done.
This week’s been a bit hectic as I worked to build my new machines, do holiday stuff, and then also get back into the project here. Nevertheless, significant progress has been made in a couple areas:
The latter of these is what I’m going to talk about today, and I’m going to zoom in on one very specific handler since it’s been a while since this blog has shown any actual code.
When using Vulkan barriers, it’s important to simultaneously:
Many months ago, I wrote a patch which aimed to address the second point while also not neglecting the first.
I succeeded in one of these goals.
The reason I didn’t notice that I also failed in one of these goals until now is that ANV actually has weak barrier support. By this I mean that while ANV’s barriers work fine and serve the expected purpose of changing resource layouts when necessary, it doesn’t actually do anything with the srcStageMask or dstStageMask parameters. Also, Intel drivers conveniently don’t change an image’s layout for different uses (i.e., GENERAL is the same as SHADER_READ), so screwing up these layouts doesn’t really matter.
Is this a problem?
No.
ANV is great at doing ANV stuff, and zink has thus far been great at punting GL down the pipe so that ANV can blast out some (correct) pixels.
Consider the following code:
bool
zink_resource_image_needs_barrier(struct zink_resource *res, VkImageLayout new_layout, VkAccessFlags flags, VkPipelineStageFlags pipeline)
{
if (!pipeline)
pipeline = pipeline_dst_stage(new_layout);
if (!flags)
flags = access_dst_flags(new_layout);
return res->layout != new_layout || (res->access & flags) != flags || (res->access_stage & pipeline) != pipeline;
}
This is a function I wrote for the purpose of no-oping redundant barriers. The idea here is that a barrier is unnecessary if it’s applying the same layout with the same access (or a subset of access) for the same stages (or a subset of stages). The access and stage flags can optionally be omitted and filled in with default values for ease of use too.
Basic state tracking.
This worked great on ANV.
The problem, however, comes when trying to run this on a driver that really gets deep into putting those access and stage flags to work in order to optimize the resource’s access.
RADV is such a driver.
Consider the following barrier sequence:
* VK_IMAGE_LAYOUT_GENERAL, VK_ACCESS_READ | VK_ACCESS_WRITE, VK_PIPELINE_STAGE_TRANSFER_BIT
* VK_IMAGE_LAYOUT_GENERAL, VK_ACCESS_READ | VK_ACCESS_WRITE, VK_PIPELINE_STAGE_TRANSFER_BIT
Obviously it’s not desirable to use GENERAL as a layout, but that’s how zink works for a couple cases at the moment, so it’s a case that must be covered adequately. Going by the above filtering function, the second barrier has the same layout, the same access flags, and the same stage flags, so it gets ignored.
Conceptually, barriers in Vulkan are used for the purpose of informing the driver of dependencies between operations for both internal image layout (i.e., compressing/decompressing images for various usages) and synchronization. This means that if image A
is written to in operation O1
and then read from in operation O2
, the user can either stall after O1
or use one of the various synchronization methods provided by Vulkan to ensure the desired result. Given that this is GPU <-> GPU synchronization, that means either a semaphore or a pipeline barrier.
The above scenario seems at first glance to be a redundant barrier based on the state-tracking flags, but conceptually it isn’t since it expresses a dependency between two operations which happen to use matching access.
After crashing my system a few times trying to do full piglit runs (seriously, don’t ever try this with zink+radv at present if you’re on similar hardware to mine), I came back to the barrier issue and started to rejigger this filter a bit.
The improved version is here:
bool
zink_resource_image_needs_barrier(struct zink_resource *res, VkImageLayout new_layout, VkAccessFlags flags, VkPipelineStageFlags pipeline)
{
if (!pipeline)
pipeline = pipeline_dst_stage(new_layout);
if (!flags)
flags = access_dst_flags(new_layout);
return res->layout != new_layout || (res->access_stage & pipeline) != pipeline ||
(res->access & flags) != flags ||
(zink_resource_access_is_write(flags) && util_bitcount(flags) > 1);
}
This adds an extra check for a sequence of barriers where the new barrier has at least two access flags and one of them is write access. In this sense, the barrier dependency is ignored if the resource is doing READ -> READ
, but READ|WRITE -> READ|WRITE
will still be emitted like it should.
This change fixes a ton of unit tests, though I don’t actually know how many since there’s still some overall instability in various tests which cause my GPU to hang.
Certainly worth mentioning is that I’ve been working closely with the RADV developer community over the past few days, and they’ve been extremely helpful in both getting me started with debugging the driver and assisting with resolving some issues. In particular, keep your eyes on these MRs which also fix zink issues.
Stay tuned as always for more updates on all things Mike and zink.
![]() |
|
November 23, 2020 | |
![]() |
I guess I never left, really, since I’ve been vicariously living the life of someone who still writes zink patches through reviewing and discussing some great community efforts that are ongoing.
But now I’m back living that life of someone who writes zink patches.
Valve has generously agreed to sponsor my work on graphics-related projects.
For the time being, that work happens to be zink.
I don’t want to just make a big post about leaving and then come back after a couple weeks like nothing happened.
It’s 2020.
We need some sort of positive energy and excitement here.
As such, I’m hereby announcing Operation Oxidize, an ambitious endeavor between me and the formidably skillful Erik Faye-Lund of Collabora.
We’re going to land 99% of zink-wip into mainline Mesa by the end of the year, bringing the driver up to basic GL 4.6 and ES 3.2 support with vastly improved performance.
Or at least, that’s the goal.
Will we succeed?
Stay tuned to find out!
![]() |
|
November 22, 2020 | |
![]() |
This was a (relatively) quiet week in zink-world. Here’s some updates, once more in no particular order:
zink-wip
branch after extensive testing on AMD hardware
Stay tuned for further updates.
![]() |
|
November 21, 2020 | |
![]() |
![]() |
|
November 20, 2020 | |
![]() |
(This post was first published with Collabora on Nov 19, 2020.)
Wayland (the protocol and architecture) is still lacking proper consideration for color management. Wayland also lacks support for high dynamic range (HDR) imagery which has been around in movie and broadcasting industry for a while now (e.g. Netflix HDR UI).
While there are well established tools and workflows for how to do color management on X11, even X11 has not gained support for HDR. There were plans for it (Alex Goins, DeepColor Visuals), but as far as I know nothing really materialized from them. Right now, the only way to watch HDR content on a HDR monitor in Linux is to use the DRM KMS API directly, in other words, not use any window system, which means not using any desktop environment. Kodi is one of the very few applications that can do this at all.
This is a story about starting the efforts to fix the situation on Wayland.
Color management for Wayland has been talked about on and off for many years by dozens of people. To me it was obvious from the start that color management architecture on Wayland must be fundamentally different from X11. I thought the display server must be part of the color management stack instead of an untrusted, unknown entity that must be bypassed and overridden by applications that fight each other for who gets to configure the display. This opinion was wildly controversial and it took a long time to get my point across, but over the years some color management experts started to open up to new ideas and other people joined in the opinion as well. Whether these new ideas are actually better than the ways of old remains to be seen, though. I think the promise of getting everything and more to work better is far too great to not try it out.
The discussions started several times over the years, but they always dried out mostly without any tangible progress. Color management is a wide, deep and difficult topic, and the required skills, knowledge, interest, and available time did not come together until fairly recently. People did write draft protocol extensions, but I would claim that it was not really until Sebastian Wick started building on top of them that things started moving forward. But one person cannot push such a huge effort alone even for the simple reason that there must be at least one reviewer before anything can be merged upstream. I was very lucky that since summer 2020 I have been able to work on Wayland color management and HDR for improving ChromeOS, letting me support Sebastian's efforts on a daily basis. Vitaly Prosyak joined the effort this year as well, researching how to combine the two seemingly different worlds of ICC and HDR, and how tone-mapping could be implemented.
I must also note the past efforts of Harish Krupo, who submitted a major Weston merge request, but unfortunately at the time reviewers in Weston upstream were not much available. Even before that, there were experiments by Ville Syrjälä. All these are now mostly superseded by the on-going work.
Currently the active people around the topic are me (Collabora), Vitaly Prosyak (AMD), and Naveen Kumar (Intel). Sebastian Wick (unaffilated) is still around as well. None of us is a color management or HDR expert by trade, so we are all learning things as we go.
The foundation for the color management protocol are ICC profile files for describing both output and content color spaces. The aim is for ICCv4, also allowing ICCv2, as these are known and supported well in general. Adding iccMAX support or anything else will be possible any time in the future.
As color management is all about color spaces and gamuts, and high dynamic range (HDR) is also very much about color spaces and gamuts plus extended luminance range, Sebastian and I decided that Wayland color management extension should cater for both from the beginning. Combining traditional color management and HDR is a fairly new thing as far as I know, and I'm not sure we have much prior art to base upon, so this is an interesting research journey as well. There is a lot of prior art on HDR and color management separately, but they tend to have fundamental differences that makes the combination not obvious.
To help us keep focused and explain to the community about what we actually intend with Wayland color management and HDR support, I wrote the section "Wayland Color Management and HDR Design Goals" in color.rst (draft). I very much recommend you to read it so that you get a picture what we (or I, at least) want to aim for.
Elle Stone explains in their article how color management should work on X11. As I wanted to avoid repeating the massive email threads that were had on the wayland-devel mailing list, I wrote the section "Color Pipeline Overview" in color.rst (draft) more or less as a response to her article, trying to explain in what ways Wayland will be different from X11. I think that understanding that section is paramount before anyone makes any comment on our efforts with the Wayland protocol extension.
HDR brings even more reasons to put color space conversions in the display server than just the idea that all applications should be color managed if not explicitly then implicitly. Most of the desktop applications (well, literally all right now) are using Standard Dynamic Range (SDR). SDR is a fuzzy concept referring to all traditional, non-HDR image content. Therefore, your desktop is usually 100% SDR. You run your fancy new HDR monitor in SDR mode, which means it looks just like any old monitor with nothing fancy. What if you want to watch a HDR video? The monitor won't display HDR in SDR mode. If you simply switch the monitor to HDR mode, you will be blinded by all the over-bright SDR applications. Switching monitor modes may also cause flicker and take a bit of time. That would be a pretty bad user experience, right?
A solution is to run your monitor in HDR mode all the time, and have the window system compositor convert all SDR application windows appropriately to the HDR luminance, so that they look normal in spite of the HDR mode. There will always be applications that will never support HDR at all, so the compositor doing the conversion is practically the only way.
For the protocol, we are currently exploring the use of relative luminance. The reason is that people look at monitors in wildly varying viewing environments, under standard office lighting for example. The environment and personal preferences affect what monitor brightness you want. Also monitors themselves can be wildly different in their capabilities. Most prior art on HDR uses absolute luminance, but absolute luminance has the problem that it assumes a specific viewing environment, usually a dark room, similar to a movie theatre. If a display server would show a movie with the absolute luminance it was mastered for, in most cases it would be far too dark to see. Whether using relative luminance at the protocol level turns out to be a good idea or not, we shall see.
The Wayland color management and HDR protocol extension proposal is known as wayland/wayland-protocols!14 (MR14). Because it is a very long running merge request (the bar for landing a new protocol into wayland-protocols is high) and there are several people working on it, we started using sub-merge-requests to modify the proposal. You can find the sub-MRs in Sebastian's fork. If you have a change to propose, that is how to do it.
Obviously using sub-MRs also splits the review discussions into multiple places, but in this case I think it is a good thing, because the discussion threads in Gitlab are already massive.
There are several big and small open questions we haven't had the time to tackle yet even among the active group; questions that I feel we should have some tentative answers before asking for wider community comments. There is also no set schedule, so don't hold your breath. This work is likely to take months still before there is a complete tentative protocol, and probably years until these features are available in your favourite Wayland desktop environments.
If you are an expert on the topics of color management or HDR displays and content, you are warmly welcome to join the development.
If you are an interested developer or an end user looking to try out things, sorry, there is nothing really for you yet.
![]() |
|
November 18, 2020 | |
![]() |
![]() |
|
November 15, 2020 | |
![]() |
As time/sanity permit, I’ll be trying to do roundup posts for zink happenings each week. Here’s a look back at things that happened, in no particular order:
![]() |
|
November 13, 2020 | |
![]() |
(project was renamed from vallium to lavapipe)
I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.
tl;dr DO NOT USE LAVAPIPE OVER A GALLIUM HW DRIVER,
No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.
There's been some recent discussion about whether the X server is abandonware. As the person arguably most responsible for its care and feeding over the last 15 years or so, I feel like I have something to say about that.
The thing about being the maintainer of a public-facing project for nearly the whole of your professional career is it's difficult to separate your own story from the project. So I'm not going to try to be dispassionate, here. I started working on X precisely because free software had given me options and capabilities that really matter, and I feel privileged to be able to give that back. I can't talk about that without caring about it.
So here's the thing: X works extremely well for what it is, but what it is is deeply flawed. There's no shame in that, it's 33 years old and still relevant, I wish more software worked so well on that kind of timeframe. But using it to drive your display hardware and multiplex your input devices is choosing to make your life worse.
It is, however, uniquely well suited to a very long life as an application compatibility layer. Though the code happens to implement an unfortunate specification, the code itself is quite well structured, easy to hack on, and not far off from being easily embeddable.
The issue, then, is how to get there. And I don't have any real desire to get there while still pretending that the xfree86 hardware-backed server code is a real thing. Sorry, I guess, but I've worked on xfree86-derived servers for very nearly as long as XFree86-the-project existed, and I am completely burnt out on that on its own merits, let alone doing that and also being release manager and reviewer of last resort. You can only apply so much thrust to the pig before you question why you're trying to make it fly at all.
So, is Xorg abandoned? To the extent that that means using it to actually control the display, and not just keep X apps running, I'd say yes. But xserver is more than xfree86. Xwayland, Xwin, Xephyr, Xvnc, Xvfb: these are projects with real value that we should not give up. A better way to say it is that we can finally abandon xfree86.
And if that sounds like a world you'd like to see, please, come talk to us, let's make it happen. I'd be absolutely thrilled to see someone take this on, and I'm happy to be your guide through the server internals.
![]() |
|
November 12, 2020 | |
![]() |
A recent article on phoronix has some commentary about sharing code between Windows and Linux, and how this seems to be a metric that Intel likes.
I'd like to explore this idea a bit and explain why I believe it's bad for Linux based distros and our open source development models in the graphics area.
tl;dr there is a big difference between open source released and open source developed projects in terms of sustainability and community.
The Linux graphics stack from a distro vendor point of view is made up of two main projects, the Linux kernel and Mesa userspace. These two projects are developed in the open with completely open source vendor agnostic practices. There is no vendor controlling either project and both projects have a goal of try to maximise shared code and shared processes/coding standards across drivers from all vendors.
This cross-vendor synergy is very important to the functioning ecosystem that is the Linux graphics stack. The stack also relies in some places on the LLVM project, but again LLVM upstream is vendor agnostic and open source developed.
The value to distros is they have central places to pick up driver stacks with good release cycles and a minimal number of places they have to deal with to interact with those communities. Now usually hardware vendors don't see the value in the external communities as much as Linux distros do. From a hardware vendor internal point of view they see more benefit in creating a single stack shared between their Windows and Linux to maximise their return on investment, or make their orgchart prettier or produce less powerpoints about why their orgchart isn't optimal.
A shared Windows/Linux stack as such is a thing the vendors want more for their own reasons than for the benefit of the Linux community.
Why is it a bad idea?
I'll start by saying it's not always a bad idea. In theory it might be possible to produce such a stack with the benefits of open source development model, however most vendors seem to fail at this. They see open source as a release model, they develop internally and shovel the results over the fence into a github repo every X weeks after a bunch of cycles. They build products containing these open source pieces, but they never expend the time building projects or communities around them.
As an example take AMDVLK vs radv. I started radv because AMD had been promising the world an open source Vulkan driver for Linux that was shared with their Windows stack. Even when it was delivered it was open source released but internally developed. There was no avenue for community participation in the driver development. External contributors were never on the same footing as an AMD employee. Even AMD employees on different teams weren't on the same footing. Compare this to the radv project in Mesa where it allowed Valve to contribute the ACO backend compiler and provide better results than AMD vendor shared code could ever have done, with far less investement and manpower.
Intel have a non-mesa compiler called Intel Graphics Compiler mentioned in the article. This is fully developed by intel internally, there is little info on project direction or how to get involved or where the community is. There doesn't seem to be much public review, patches seem to get merged to the public repo by igcbot which may mean they are being mirrored from some internal repo. There are not using github merge requests etc. Compare this to development of a Mesa NIR backend where lots of changes are reviewed and maximal common code sharing is attempted so that all vendors benefit from the code.
One area where it has mostly sort of worked out what with the AMD display code in the kernel. I believe this code to be shared with their Windows driver (but I'm not 100% sure). They do try to engage with community changes to the code, but the code is still pretty horrible and not really optimal on Linux. Integrating it with atomic modesetting and refactoring was a pain. So even in the best case it's not an optimal outcome even for the vendor. They have to work hard to make the shared code be capable of supporting different OS interactions.
How would I do it?
If I had to share Windows/Linux driver stack I'd (biased opinion) start from the most open project and bring that into the closed projects. I definitely wouldn't start with a new internal project that tries to disrupt both. For example if I needed to create a Windows GL driver, I could:
a) write a complete GL implementation and throw it over the wall every few weeks. and make Windows/Linux use it, Linux users lose out on the shared stack, distros lose out on one dependency instead having to build a stack of multiple per vendor deps, Windows gains nothing really, but I'm so in control of my own destiny (communities don't matter).
b) use Mesa and upstream my driver to share with the Linux stack, add the Windows code to the Mesa stack. I get to share the benefits of external development by other vendors and Windows gains that benefit, and Linux retains the benefits to it's ecosystem.
A warning then to anyone wishing for more vendor code sharing between OSes it generally doesn't end with Linux being better off, it ends up with Linux being more fragmented, harder to support and in the long run unsustainable.
![]() |
|
November 06, 2020 | |
![]() |
In “power management 101” I talked about how sometimes OEMs will configure a laptop to 12W or 15W in BIOS and then re-configure it to 25W in software. This is done via the “Intel Dynamic Platform and Thermal Framework” driver on Windows. The DPTF driver manages your over-all system thermals and keep the system within its thermal budget. This is especially important for fanless or ultra-thin laptops where the cooling may not be sufficient for the system to run flat-out for long periods. One thing the DPTF driver does is dynamically adjust the TDP of your CPU. It can adjust it both up if the laptop is running cool and you need the power or down if the laptop is running hot and needs to cool down. Some OEMs choose to be very conservative with their TDP defaults in BIOS to prevent the laptop from overheating or constantly running hot if the Windows DPTF driver is not available.
On Linux, the equivalent to this is thermald. When installed and enabled on your system, it reads the same OEM configuration data from ACPI as the windows DPTF driver and is also able to scale up your package TDP threshold past the BIOS default as per the OEM configuration. You can also write your own configuration files if you really wish but you do so at your own risk.
Most distros package thermald but it may not be enabled nor work quite properly out-of-the-box. This is because, historically, it has relied on the closed-source dptfxtract utility that's provided by Intel as a binary. It requires dptfxtract to fetch the OEM provided configuration data from the ACPI tables. Since most distros don't usually ship closed-source software in their main repositories and since thermald doesn't do much without that data, a lot of distros don't bother to ship or enable it by default. You'll have to turn it on manually.
To fix this, install both thermald and dptfxtract and ensure that thermald is enabled. On most distros, thermald is packaged normally even if it isn’t enabled by default because it is open-source. The dptfxtract utility is usually available in your distro’s non-free repositories. On Ubuntu, dptfxtract is available as a package in multiverse. For Fedora, dptfxtract is available via RPM Fusion’s non-free repo. There are also packages for Arch and likely others as well. If no one packages it for your distro, it’s just one binary so it’s pretty easy to install manually.
Some of this may change going forward, however. Recently, however, Matthew Garrett did some work to reverse-engineer the DPTF framework and provide support for fetching the DPTF data from ACPI without the need for the binary blob. When running with a recent kernel and Matthew's fork of thermald, you should be able to get OEM-configured thermals without the need for the dptfxtract blob at least on some hardware. Whether or not you get the right configuration will depend on your hardware, your kernel version, your distro, and whether they ship the Intel version of thermald or Matthew's fork. Even there, your distro may leave it uninstalled or disabled by default. It's still disabled by default in Fedora 33, for instance.
It should be noted at this point that, if thermald and dptfxtract are doing their job, your laptop is likely to start running much hotter when under heavy load than it did before. This is because thermald is re-configuring your processor with a higher thermal budget which means it can now run faster but it will also generate more heat and may drain your battery faster. In theory, thermald should keep your laptop’s thermals within safe limits; just not within the more conservative limits the OEM programmed into BIOS. If all the additional heat makes you uncomfortable, you can just disable thermald and it should go back to the BIOS defaults.
…of my full-time hobby work on zink.
At least for a while.
More on that at the end of the post.
Before I get to that, let’s start with yesterday’s riddle. Anyone who chose this pic
with 51 fps as being zink, you were correct.
That’s right, zink is now at around 95% of native GL performance for this benchmark, at least on my system.
I know there’s been a lot of speculation about the capability of the driver to reach native or even remotely-close-to-native speeds, and I’m going to say definitively that it’s possible, and performance is only going to increase further from here.
A bit of a different look on things can also be found on my Fall roundup post here.
I’ve long been working on zink using a single-thread architecture, and my goal has been to make it as fast as possible within that constraint. Part of my reasoning is that it’s been easier to work within the existing zink architecture than to rewrite it, but the main issue is just that threads are hard, and if you don’t have a very stable foundation to build off of when adding threading to something, it’s going to get exponentially more difficult to gain that stability afterwards.
Reaching a 97% pass rate on my piglit tests at GL 4.6 and ES 3.2 gave me a strong indicator that the driver was in good enough shape to start looking at threads more seriously. Sure, piglit tests aren’t CTS; they fail to cover a lot of areas, and they’re certainly less exhaustive about the areas that they do cover. With that said, CTS isn’t a great tool for zink at the moment due to the lack of provoking vertex compatibility support in the driver (I’m still waiting on a Vulkan extension for this, though it’s looking likely that Erik will be providing a fallback codepath for this using a geometry shader in the somewhat near future) which will fail lots of tests. Given the sheer number of CTS tests, going through the failures and determining which ones are failing due to provoking vertex issues and which are failing due to other issues isn’t a great use of my time, so I’m continuing to wait on that. The remaining piglit test failures are mostly due either to provoking vertex issues or some corner case missing features such as multisampled ZS readback which are being worked on by other people.
With all that rambling out of the way, let’s talk about threads and how I’m now using them in zink-wip
.
At present, I’m using u_threaded_context
, aka glthread
, making zink the only non-radeon driver to implement it. The way this works is by using Gallium to write the command stream to a buffer that is then processed asynchronously, freeing up the main thread for application use and avoiding any sort of blocking from driver overhead. For systems where zink is CPU-bound in the driver thread, this massively increases performance, as seen from the ~40% fps improvement that I gained after the implementation.
This transition presented a number of issues, the first of which was that u_threaded_context
required buffer invalidation and rebinding. I’d had this on my list of targets for a while, so it was a good opportunity to finally hook it up.
Next up, u_threaded_context
was very obviously written to work for the existing radeon driver architecture, and this was entirely incompatible with zink, specifically in how the batch/command buffer implementation is hardcoded like I talked about yesterday. Switching to monotonic, dynamically scaling command buffer usage resolved that and brought with it some other benefits.
The other big issue was, as I’m sure everyone expected, documentation.
I certainly can’t deny that there’s lots of documentation for u_threaded_context
. It exists, it’s plentiful, and it’s quite detailed in some cases.
It’s also written by people who know exactly how it works with the expectation that it’s being read by other people who know exactly how it works. I had no idea going into the implementation how any of it worked other than a general knowledge of the asynchronous command stream parts that are common to all thread queue implementations, so this was a pretty huge stumbling block.
Nevertheless, I persevered, and with the help of a lot of RTFC, I managed to get it up and running. This is a more general overview post rather than a more in-depth, technical one, so I’m not going to go into any deep analysis of the (huge amounts of) code required to make it work, but here’s some key points from the process in case anyone reading this hits some of the same issues/annoyances that I did:
driver class -> gallium class
references to driver class -> u_threaded_context class -> gallium class
ones; if you can sed
these all at once, it simplifies the work tremendouslyu_threaded_context
works off the radeon queue/fence architecture, which allows (in some cases) multiple fences for any given queue submission, so ensure that your fences work the same way or (as I did) can effectively have sub-fencespipe_context
can be in many different threads at a given time, and so, as the u_threaded_context
docs repeatedly say without further explanation, don’t use it “in an unsafe way”TC_TRANSFER_MAP_*
flags before doing the things that those flags prohibit
threaded_resource::max_forced_staging_uploads
to start with since it adds complexityTC_TRANSFER_MAP_THREADED_UNSYNC
, you have to use threaded_context::base.stream_uploader
for staging buffers, though this isn’t (currently) documented anywhereu_threaded_context
was written for radeon drivers, so there may be other cases where hardcoded values for those drivers existAll told, fixing all the regressions took much longer than the actual implementation, but that’s just par for the course with driver work.
Anyone interested in testing should take note that, as always, this has only been used on Intel hardware (and if you’re on Intel, this post is definitely worth reading), and so on systems which were not CPU-bound previously or haven’t been worked on by me, you may not yet see these kinds of gains.
But you will eventually.
This is a sort of bittersweet post as it marks the end of my full-time hobby work with zink. I’ve had a blast over the past ~6 months, but all things change eventually, and such is the case with this situation.
Those of you who have been following me for a long time will recall that I started hacking on zink while I was between jobs in order to improve my skills and knowledge while doing something productive along the way. I succeeded in all regards, at least by my own standards, and I got to work with some brilliant people at the same time.
But now, at last, I will once again become employed, and the course of that employment will take me far away from this project. I don’t expect that I’ll have a considerable amount of mental energy to dedicate to hobbyist Open Source projects, at least for the near term, so this is a farewell of sorts in that sense. This means (again, for at least the near term):
This does not mean that zink is dead, or the project is stalling development, or anything like that, so don’t start overreaching on the meaning of this post.
I still have 450+ patches left to be merged into mainline Mesa, and I do plan to continue driving things towards that end, though I expect it’ll take a good while. I’ll also be around to do patch reviews for the driver and continue to be involved in the community.
I look forward to a time when I’ll get to write more posts here and move the zink user experience closer to where I think it can be.
This is Mike, signing off for now.
Happy rendering.
![]() |
|
November 05, 2020 | |
![]() |
During my presentation at the X Developers Conference I stated that we had been mostly using the Khronos Vulkan Conformance Test suite (aka Vulkan CTS) to validate our Vulkan driver for Raspberry Pi 4 (aka V3DV). While the CTS is an invaluable resource for driver testing and validation, it doesn’t exactly compare to actual real world applications, and so, I made the point that we should try to do more real world testing for the driver after completing initial Vulkan 1.0 support.
To be fair, we had been doing a little bit of this already when I worked on getting the Vulkan ports of all 3 Quake game classics to work with V3DV, which allowed us to identify and fix a few driver bugs during development. The good thing about these games is that we could get the source code and compile them natively for ARM platforms, so testing and debugging was very convenient.
Unfortunately, there are not a plethora of Vulkan applications and games like these that we can easily test and debug on a Raspberry Pi as of today, which posed a problem. One way to work around this limitation that was suggested after my presentation at XDC was to use Zink, the OpenGL to Vulkan layer in Mesa. Using Zink, we can take existing OpenGL applications that are currently available for Raspberry Pi and use them to test our Vulkan implementation a bit more thoroughly, expanding our options for testing while we wait for the Vulkan ecosystem on Raspberry Pi 4 to grow.
So last week I decided to get hands on with that. Zink requires a few things from the underlying Vulkan implementation depending on the OpenGL version targeted. Currently, Zink only targets desktop OpenGL versions, so that limits us to OpenGL 2.1, which is the maximum version of desktop OpenGL that Raspbery Pi 4 can support (we support up to OpenGL ES 3.1 though). For that desktop OpenGL version, Zink required a few optional Vulkan 1.0 features that we were missing in V3DV, namely:
The first two were trivial: they were already implemented and we only had to expose them in the driver. Notably, when I was testing these features with the relevant CTS tests I found a bug in the alpha to one tests, so I proposed a fix to Khronos which is currently in review.
I also noticed that Zink was also implicitly requiring support for timestamp queries, so I also implemented that in V3DV and then also wrote a patch for Zink to handle this requirement better.
Finally, Zink doesn’t use Vulkan swapchains, instead it creates presentable images directly, which was problematic for us because our platform needs to handle allocations for presentable images specially, so a patch for Zink was also required to address this.
As of the writing of this post, all this work has been merged in Mesa and it enables Zink to run OpenGL 2.1 applications over V3DV on Raspberry Pi 4. Here are a few screenshots of Quake3 taken with the native OpenGL driver (V3D), with the native Vulkan driver (V3DV) and with Zink (over V3DV). There is a significant performance hit with Zink at present, although that is probably not too unexpected at this stage, but otherwise it seems to be rendering correctly, which is what we were really interested to see:
Note: you’ll notice that the Vulkan screenshot is darker than the OpenGL versions. As I reported in another post, that is a feature of the Vulkan port of Quake3 and is unrelated to the driver.
Going forward, we expect to use Zink to test more applications and hopefully identify driver bugs that help us make V3DV better.
I’ve been busy cramming more code than ever into the repo this week in order to finish up my final project for a while by Friday. I’ll talk more about that tomorrow though. Today I’ve got two things for all of you.
Of these two screenshots, one is zink+ANV and one is IRIS. Which is which?
Let’s talk a bit at a high level about how zink uses (non-compute) command buffers.
Currently in the repo zink works like this:
glFlush
), the command buffers cycleIn short, there’s a huge bottleneck around the flushing mechanism, and then there’s a lesser-reached bottleneck for cases where an application flushes repeatedly before a command buffer’s ops are completed.
Some time ago I talked about some modifications I’d done to the above architecture, and then things looked more like this:
glFlush
), the command buffers cycle* afterThe major difference after this work was that the flushing was reduced, which then greatly reduced the impact of that bottleneck that exists when all the command buffers are submitted and the driver wants to continue recording commands.
A lot of speculation has occurred among the developers over “how many” command buffers should be used, and there’s been some talk of profiling this, but for various reasons I’ll get into tomorrow, I opted to sidestep the question entirely in favor of a more dynamic solution: monotonically-identified command buffers.
The basic idea behind this strategy, which is used by a number of other drivers in the tree, is that there’s no need to keep a “ring” of command buffers to cycle through, as the driver can just continually allocate new command buffers on-the-fly and submit them as needed, reusing them once they’ve naturally completed instead of forcibly stalling on them. Here’s a visual comparison:
This way, there’s no possibility of stalling based on application flushes (or the rare driver-internal flush which does still exist in a couple places).
The architectural change here had two great benefits:
The latter of these is due to the way that the queue in zink is split between gfx and compute command buffers; with the hardcoded batch system, the compute queue had its own command buffer while the gfx queue had four, but they all had unique IDs which were tracked using bitfields all over the place, not to mention it was frustrating never being able to just “know” which command buffer was currently being recorded to for a given command without indexing the array.
Now it’s easy to know which command buffer is currently being recorded to, as it’ll always be the one associated with the queue (gfx or compute) for the given operation.
This had further implications, however, and I’d done this to pave the way for a bigger project, one that I’ve spent the past few days on. Check back tomorrow for that and more.
![]() |
|
November 02, 2020 | |
![]() |
Quick update today, but I’ve got some very exciting news coming soon.
The biggest news of the day is that work is underway to merge some patches from Duncan Hopkins which enable zink to run on Mac OS using MoltenVK. This has significant potential to improve OpenGL support on that platform, so it’s awesome that work has been done to get the ball rolling there.
In only slightly less monumental news though, Adam Jackson is already underway with Vulkan WSI work for zink, which is going to be huge for performance.
![]() |
|
October 30, 2020 | |
![]() |
(I just sent the below email to mesa3d developer list).
![]() |
|
October 29, 2020 | |
![]() |
I’ve got a lot of exciting stuff in the pipe now, but for today I’m just going to talk a bit about resource invalidation: what it is, when it happens, and why it’s important.
Let’s get started.
Resource invalidation occurs when the backing buffer of a resource is wholly replaced. Consider the following scenario under zink:
struct A { VkBuffer buffer; };
glBufferData(target, size, data, usage)
, which stores data to A.buffer
glBufferData(target, size, NULL, usage)
, which unsets the data from A.buffer
On a sane/competent driver, the second glBufferData
call will trigger invalidation, which means that A.buffer
will be replaced entirely, while A
is still the driver resource used by Gallium to represent target
.
Resource invalidation can occur in a number of scenarios, but the most common is when unsetting a buffer’s data, as in the above example. The other main case for it is replacing the data of a buffer that’s in use for another operation. In such a case, the backing buffer can be replaced to avoid forcing a sync in the command stream which will stall the application’s processing. There’s some other cases for this as well, like glInvalidateFramebuffer
and glDiscardFramebufferEXT
, but the primary usage that I’m interested in is buffers.
The main reason is performance. In the above scenario without invalidation, the second glBufferData
call will write null to the whole buffer, which is going to be much more costly than just creating a new buffer.
Now comes the slightly more interesting part: how does invalidation work in zink?
Currently, as of today’s mainline zink codebase, we have struct zink_resource
to represent a resource for either a buffer or an image. One struct zink_resource
represents exactly one VkBuffer
or VkImage
, and there’s some passable lifetime tracking that I’ve written to guarantee that these Vulkan objects persist through the various command buffers that they’re associated with.
Each struct zink_resource
is, as is the way of Gallium drivers, also a struct pipe_resource
, which is tracked by Gallium. Because of this, struct zink_resource
objects themselves cannot be invalidated in order to avoid breaking Gallium, and instead only the inner Vulkan objects themselves can be replaced.
For this, I created struct zink_resource_object
, which is an object that stores only the data that directly relates to the Vulkan objects, leaving struct zink_resource
to track the states of these objects. Their lifetimes are separate, with struct zink_resource
being bound to the Gallium tracker and struct zink_resource_object
persisting for either the lifetime of struct zink_resource
or its command buffer usage—whichever is longer.
The code for this mechanism isn’t super interesting since it’s basically just moving some parts around. Where it gets interesting is the exact mechanics of invalidation and how struct zink_resource_object
can be injected into an in-use resource, so let’s dig into that a bit.
Here’s what the pipe_context::invalidate_resource
hook looks like:
static void
zink_invalidate_resource(struct pipe_context *pctx, struct pipe_resource *pres)
{
struct zink_context *ctx = zink_context(pctx);
struct zink_resource *res = zink_resource(pres);
struct zink_screen *screen = zink_screen(pctx->screen);
if (pres->target != PIPE_BUFFER)
return;
This only handles buffer resources, but extending it for images would likely be little to no extra work.
if (res->valid_buffer_range.start > res->valid_buffer_range.end)
return;
Zink tracks the valid data segments of its buffers. This conditional is used to check for an uninitialized buffer, i.e., one which contains no valid data. If a buffer has no data, it’s already invalidated, so there’s nothing to be done here.
util_range_set_empty(&res->valid_buffer_range);
Invalidating means the buffer will no longer have any valid data, so the range tracking can be reset here.
if (!get_all_resource_usage(res))
return;
If this resource isn’t currently in use, unsetting the valid range is enough to invalidate it, so it can just be returned right away with no extra work.
struct zink_resource_object *old_obj = res->obj;
struct zink_resource_object *new_obj = resource_object_create(screen, pres, NULL, NULL);
if (!new_obj) {
debug_printf("new backing resource alloc failed!");
return;
}
Here’s the old internal buffer object as well as a new one, created using the existing buffer as a template so that it’ll match.
res->obj = new_obj;
res->access_stage = 0;
res->access = 0;
struct zink_resource
is just a state tracker for the struct zink_resource_object
object, so upon invalidate, the states are unset since this is effectively a brand new buffer.
zink_resource_rebind(ctx, res);
This is the tricky part, and I’ll go into more detail about it below.
zink_descriptor_set_refs_clear(&old_obj->desc_set_refs, old_obj);
If this resource was used in any cached descriptor sets, the references to those sets need to be invalidated so that the sets won’t be reused.
zink_resource_object_reference(screen, &old_obj, NULL);
}
Finally, the old struct zink_resource_object
is unrefed, which will ensure that it gets destroyed once its current command buffer has finished executing.
Simple enough, but what about that zink_resource_rebind()
call? Like I said, that’s where things get a little tricky, but because of how much time I spent on descriptor management, it ends up not being too bad.
This is what it looks like:
void
zink_resource_rebind(struct zink_context *ctx, struct zink_resource *res)
{
assert(res->base.target == PIPE_BUFFER);
Again, this mechanism is only handling buffer resource for now, and there’s only one place in the driver that calls it, but it never hurts to be careful.
for (unsigned shader = 0; shader < PIPE_SHADER_TYPES; shader++) {
if (!(res->bind_stages & BITFIELD64_BIT(shader)))
continue;
for (enum zink_descriptor_type type = 0; type < ZINK_DESCRIPTOR_TYPES; type++) {
if (!(res->bind_history & BITFIELD64_BIT(type)))
continue;
Something common to many Gallium drivers is this idea of “bind history”, which is where a resource will have bitflags set when it’s used for a certain type of binding. While other drivers have a lot more cases than zink does due to various factors, the only thing that needs to be checked for my purposes is the descriptor type (UBO, SSBO, sampler, shader image) across all the shader stages. If a given resource has the flags set here, this means it was at some point used as a descriptor of this type, so the current descriptor bindings need to be compared to see if there’s a match.
uint32_t usage = zink_program_get_descriptor_usage(ctx, shader, type);
while (usage) {
const int i = u_bit_scan(&usage);
This is a handy mechanism that returns the current descriptor usage of a shader as a bitfield. So for example, if a vertex shader uses UBOs in slots 0, 1, and 3, usage
will be 11, and the loop will process i
as 0, 1, and 3.
struct zink_resource *cres = get_resource_for_descriptor(ctx, type, shader, i);
if (res != cres)
continue;
Now the slot of the descriptor type can be compared against the resource that’s being re-bound. If this resource is the one that’s currently bound to the specified slot of the specified descriptor type, then steps can be taken to perform additional operations necessary to successfully replace the backing storage for the resource, mimicking the same steps taken when initially binding the resource to the descriptor slot.
switch (type) {
case ZINK_DESCRIPTOR_TYPE_SSBO: {
struct pipe_shader_buffer *ssbo = &ctx->ssbos[shader][i];
util_range_add(&res->base, &res->valid_buffer_range, ssbo->buffer_offset,
ssbo->buffer_offset + ssbo->buffer_size);
break;
}
For SSBO descriptors, the only change needed is to add valid range for the bound region as . This region is passed to the shader, so even if it’s never written to, it might be, and so it can be considered a valid region.
case ZINK_DESCRIPTOR_TYPE_SAMPLER_VIEW: {
struct zink_sampler_view *sampler_view = zink_sampler_view(ctx->sampler_views[shader][i]);
zink_descriptor_set_refs_clear(&sampler_view->desc_set_refs, sampler_view);
zink_buffer_view_reference(ctx, &sampler_view->buffer_view, NULL);
sampler_view->buffer_view = get_buffer_view(ctx, res, sampler_view->base.format,
sampler_view->base.u.buf.offset, sampler_view->base.u.buf.size);
break;
}
Sampler descriptors require a new VkBufferView
be created since the previous one is no longer valid. Again, the references for the existing bufferview need to be invalidated now since that descriptor set can no longer be reused from the cache, and then the new VkBufferView
is set after unrefing the old one.
case ZINK_DESCRIPTOR_TYPE_IMAGE: {
struct zink_image_view *image_view = &ctx->image_views[shader][i];
zink_descriptor_set_refs_clear(&image_view->desc_set_refs, image_view);
zink_buffer_view_reference(ctx, &image_view->buffer_view, NULL);
image_view->buffer_view = get_buffer_view(ctx, res, image_view->base.format,
image_view->base.u.buf.offset, image_view->base.u.buf.size);
util_range_add(&res->base, &res->valid_buffer_range, image_view->base.u.buf.offset,
image_view->base.u.buf.offset + image_view->base.u.buf.size);
break;
}
Images are nearly identical to the sampler case, the difference being that while samplers are read-only like UBOs (and therefore reach this point already having valid buffer ranges set), images are more like SSBOs and can be written to. Thus the valid range must be set here like in the SSBO case.
default:
break;
Eagle-eyed readers will note that I’ve omitted a UBO case, and this is because there’s nothing extra to be done there. UBOs will already have their valid range set and don’t need a VkBufferView
.
}
invalidate_descriptor_state(ctx, shader, type);
Finally, the incremental decsriptor state hash for this shader stage and descriptor type is invalidated. It’ll be recalculated normally upon the next draw or compute operation, so this is a quick zero-setting operation.
}
}
}
}
That’s everything there is to know about the current state of resource invalidation in zink!
![]() |
|
October 24, 2020 | |
![]() |
A rare Saturday post because I spent so much time this week intending to blog and then somehow not getting around to it. Let’s get to the status updates, and then I’m going to dive into the more interesting of the things I worked on over the past few days.
Zink has just hit another big milestone that I’ve just invented: as of now, my branch is passing 97% of piglit tests up through GL 4.6 and ES 3.2, and it’s a huge improvement from earlier in the week when I was only at around 92%. That’s just over 1000 failure cases remaining out of ~41,000 tests. For perspective, a table.
IRIS | zink-mainline | zink-wip | |
---|---|---|---|
Passed Tests | 43508 | 21225 | 40190 |
Total Tests | 43785 | 22296 | 41395 |
Pass Rate | 99.4% | 95.2% | 97.1% |
As always, I happen to be running on Intel hardware, so IRIS and ANV are my reference points.
It’s important to note here that I’m running piglit tests, and this is very different from CTS; put another way, I may be passing over 97% of the test cases I’m running, but that doesn’t mean that zink is conformant for any versions of GL or ES, which may not actually be possible at present (without huge amounts of awkward hacks) given the persistent issues zink has with provoking vertex handling. I expect this situation to change in the future through the addition of more Vulkan extensions, but for now I’m just accepting that there’s some areas where zink is going to misrender stuff.
The biggest change that boosted the zink-wip
pass rate was my fixing 64bit vertex attributes, which in total had been accounting for ~2000 test failures.
Vertex attributes, as we all know since we’re all experts in the graphics field, are the inputs for vertex shaders, and the data types for these inputs can vary just like C data types. In particular, with GL 4.1, ARB_vertex_attrib_64bit became a thing, which allows 64bit values to be passed as inputs here.
Once again, this is a problem for zink.
It comes down to the difference between GL’s implicit handling methodology and Vulkan’s explicit handling methodology. Consider the case of a dvec4
data type. Conceptually, this is a data type which is 4x64bit values, requiring 32bytes of storage. A vec4
uses 16bytes of storage, and this equates to a single “slot” or “location” within the shader inputs, as everything there is vec4-aligned. This means that, by simple arithmetic, a dvec4
requires two slots for its storage, one for the first two members, and another for the second two, both consuming a single 16byte slot.
When loading a dvec4
in GL(SL), a single variable with the first location slot is used, and the driver will automatically use the second slot when loading the second half of the value.
When loading a dvec4
in (SPIR)Vulkan, two variables with consecutive, explicit location slots must be used, and the driver will load exactly the input location specified.
This difference requires that for any dvec3
or dvec4
vertex input in zink, the value and also the load have to be split along the vec4
boundary for things to work.
Gallium already performs this split on the API side, allowing zink to already be correctly setting things up in the VkPipeline
creation, so I wrote a NIR pass to fix things on the shader side.
Yes, it’s been at least a week since I last wrote about a NIR pass, so it’s past time that I got back into that.
Going into this, the idea here is to perform the following operations within the vertex shader:
A
), find the deref
instruction (hereafter A_deref
); deref
is used to access variables for input and output, and so it’s guaranteed that any 64bit input will first have a deref
B
) of size double
(for dvec3
) or dvec2
(for dvec4
) to represent the second half of A
A
and A_deref
’s type to dvec2
; this aligns the variable (and its subsequent load) to the vec4
boundary, which enables it to be correctly read from a single location slotderef
instruction for B
(hereafter B_deref
)load_deref
instruction for A_deref
(hereafter A_load
); a load_deref
instruction is used to load data from a variable deref
A_load
to 2, matching its new dvec2
sizeload_deref
instruction for B_deref
which will load the remaining components (hereafter B_load
)C_load
) dvec3
or dvec4
by combining A_load
+ B_load
to match the load of the original type of A
A_load
’s result to instead use C_load
’s resultSimple, right?
Here we go.
static bool
lower_64bit_vertex_attribs_instr(nir_builder *b, nir_instr *instr, void *data)
{
if (instr->type != nir_instr_type_deref)
return false;
nir_deref_instr *A_deref = nir_instr_as_deref(instr);
if (A_deref->deref_type != nir_deref_type_var)
return false;
nir_variable *A = nir_deref_instr_get_variable(A_deref);
if (A->data.mode != nir_var_shader_in)
return false;
if (!glsl_type_is_64bit(A->type) || !glsl_type_is_vector(A->type) || glsl_get_vector_elements(A->type) < 3)
return false;
First, it’s necessary to filter out all the instructions that aren’t what should be rewritten. As above, only dvec3
and dvec4
types are targeted here (dmat*
types are reduced to dvec
types prior to this point), so anything other than a A_deref
of variables with those types is ignored.
/* create second variable for the split */
nir_variable *B = nir_variable_clone(A, b->shader);
/* split new variable into second slot */
B->data.driver_location++;
nir_shader_add_variable(b->shader, B);
B
matches A
except in its type and slot location, which will always be one greater than the slot location of A
, so A
can be cloned here to simplify the process of creating B
.
unsigned total_num_components = glsl_get_vector_elements(A->type);
/* new variable is the second half of the dvec */
B->type = glsl_vector_type(glsl_get_base_type(A->type), glsl_get_vector_elements(A->type) - 2);
/* clamp original variable to a dvec2 */
A_deref->type = A->type = glsl_vector_type(glsl_get_base_type(A->type), 2);
A
and B
need their types modified to not cross the vec4
/slot boundary. A
is always a dvec2
, which has 2 components, and B
will always be the remaining components.
/* create A_deref instr for new variable */
b->cursor = nir_after_instr(instr);
nir_deref_instr *B_deref = nir_build_deref_var(b, B);
Now B_deref
has been added thanks to the nir_builder
helper function which massively simplifies the process of setting up all the instruction parameters.
nir_foreach_use_safe(A_deref_use, &A_deref->dest.ssa) {
NIR is SSA-based, and all uses of an SSA value are tracked for the purposes of ensuring that SSA values are truly assigned only once as well as ease of rewriting them in the case where a value needs to be modified, just as this pass is doing. This use-tracking comes along with a simple API for iterating over the uses.
nir_instr *A_load_instr = A_deref_use->parent_instr;
assert(A_load_instr->type == nir_instr_type_intrinsic &&
nir_instr_as_intrinsic(A_load_instr)->intrinsic == nir_intrinsic_load_deref);
The only use of A_deref
should be A_load
, so really iterating over the A_deref
uses is just a quick, easy way to get from there to the A_load
instruction.
/* this is a load instruction for the A_deref, and we need to split it into two instructions that we can
* then zip back into a single ssa def */
nir_intrinsic_instr *A_load = nir_instr_as_intrinsic(A_load_instr);
/* clamp the first load to 2 64bit components */
A_load->num_components = A_load->dest.ssa.num_components = 2;
A_load
must be clamped to a single slot location to avoid crossing the vec4
boundary, so this is done by changing the number of components to 2, which matches the now-changed type of A
.
b->cursor = nir_after_instr(A_load_instr);
/* this is the second load instruction for the second half of the dvec3/4 components */
nir_intrinsic_instr *B_load = nir_intrinsic_instr_create(b->shader, nir_intrinsic_load_deref);
B_load->src[0] = nir_src_for_ssa(&B_deref->dest.ssa);
B_load->num_components = total_num_components - 2;
nir_ssa_dest_init(&B_load->instr, &B_load->dest, B_load->num_components, 64, NULL);
nir_builder_instr_insert(b, &B_load->instr);
This is B_load
, which loads a number of components that matches the type of B
. It’s inserted after A_load
, though the before/after isn’t important in this case. The key is just that this instruction is added before the next one.
nir_ssa_def *def[4];
/* createa new dvec3/4 comprised of all the loaded components from both variables */
def[0] = nir_vector_extract(b, &A_load->dest.ssa, nir_imm_int(b, 0));
def[1] = nir_vector_extract(b, &A_load->dest.ssa, nir_imm_int(b, 1));
def[2] = nir_vector_extract(b, &B_load->dest.ssa, nir_imm_int(b, 0));
if (total_num_components == 4)
def[3] = nir_vector_extract(b, &B_load->dest.ssa, nir_imm_int(b, 1));
nir_ssa_def *C_load = nir_vec(b, def, total_num_components);
Now that A_load
and B_load
both exist and are loading the corrected number of components, these components can be extracted and reassembled into a larger type for use in the shader, specifically the original dvec3
or dvec4
which is being used. nir_vector_extract
performs this extraction from a given instruction by taking an index of the value to extract, and then the composite value is created by passing the extracted components to nir_vec
as an array.
/* use the assembled dvec3/4 for all other uses of the load */
nir_ssa_def_rewrite_uses_after(&A_load->dest.ssa, nir_src_for_ssa(C_load), C_load->parent_instr);
Since this is all SSA, the NIR helpers can be used to trivially rewrite all the uses of the loaded value from the original A_load
instruction to now use the assembled C_load
value. It’s important that only the uses after C_load
has been created (i.e., nir_ssa_def_rewrite_uses_after
) are those that are rewritten, however, or else the shader will also rewrite the original A_load
value with C_load
, breaking the shader entirely with an SSA-impossible as well as generally-impossible C_load = vec(C_load + B_load)
assignment.
}
return true;
}
Progress has occurred, so the pass returns true to reflect that.
Now those large attributes are loaded according to Vulkan spec, and everything is great because, as expected, ANV has no bugs here.
![]() |
|
October 16, 2020 | |
![]() |
It’s been a very long week for me, and I’m only just winding down now after dissecting and resolving a crazy fp64/ssbo bug. I’m too scrambled to jump into any code, so let’s just do another fluff day and review happenings in mainline mesa which relate to zink.
Thanks to the tireless, eagle-eyed reviewing of Erik Faye-Lund, a ton of zink patches out of zink-wip
have landed this week. Here’s an overview of that in backwards historical order:
Zink has grown tremendously over the past day or so of MRs landing, going from GLSL 1.30 and GL 3.0 to GLSL and GL 3.30.
It can even run Blender now. Unless you’re running it under Wayland.
![]() |
|
October 15, 2020 | |
![]() |
New versions of the KWinFT projects Wrapland, Disman, KWinFT and KDisplay are available now. They were on the day aligned with the release of Plasma 5.20 this week and offer new features and stability improvements.
The highlight this time is a completely redefined and reworked Disman that allows to control display configurations not only in a KDE Plasma session with KWinFT but also with KWin and in other Wayland sessions with wlroots-based compositors as well as any X11 session.
You can use it with the included command-line tool dismanctl or together with the graphical frontend KDisplay. Read more about Disman's goals and technical details in the 5.20 beta announcement.
Let's cut directly to the chase! As Disman and KDisplay are replacements for libkscreen and KScreen and KWinFT for KWin you will be interested in a comparison from a user point of view. What is better and what should you personally choose?
If you run a KDE Plasma desktop at the moment, you should definitely consider to use Disman and replace KScreen with KDisplay.
Disman comes with a more reliable overall design moving internal logic to its D-Bus service and away from the frontends in KDisplay. Changes by that become more atomic and bugs are less likely to emerge.
The UI of KDisplay is improved in comparison to KScreen and comfort functions have been added, as for example automatic selection of the best available mode.
There are still some caveats to this release that might prompt you to wait for the next one though:
So your mileage may vary but in most cases you should have a better experience with Disman and KDisplay.
And if you in general like to support new projects with ambitious goals and make use of most modern technologies you should definitely give it a try.
Disman includes a backend for wlroots-backed compositors. I'm proud of this achievement since I believe we need more projects for the Linux desktop which do not only try to solve issues in their own little habitat and project monoculture but which aim at improving the Linux desktop in a more holistic and collaborative spirit.
I tested the backend myself and even provided some patches to wlroots directly to improve its output-management capabilities, so using Disman with wlroots should be a decent experience. One catch though is that for those patches above a new wlroots version must be released. For now you can only get them by compiling wlroots from master or having your distribution of choice backport them.
In comparison with other options to manage your displays in wlroots I believe Disman provides the most user-friendly solution taking off lots of work from your shoulders by automatically optimizing unknown new display setups and reloading data for already known setups.
Another prominent alternative for display management on wlroots is kanshi. I don't think kanshi is as easy to use and autonomously optimizing as Disman but you might be able to configure displays more precisly with it. So you could prefer kanshi or Disman depending on your needs.
You can use Disman in wlroots sessions as a standalone system with its included command-line tool dismanctl and without KDisplay. This way you do not need to pull in as many KDE dependencies. But KDisplay together with Disman and wlroots also works very well and provides you an easy-to-use UI for adapting the display configuration according to your needs.
You will like it. Try it out is all I can say. The RandR backend is tested thoroughly and while there is still room for some refactoring it should work very well already. This is also independent of what desktop environment or window manager you use. Install it and see for yourself.
That being said the following issues are known at the moment:
I was talking a lot about Disman since it contains the most interesting changes this release and it can now be useful to many more people than before.
But you might also be interested in replacing KWin with KWinFT, so let's take a look at how KWinFT at this point in time compares to legacy KWin.
As it stands KWinFT is still a drop-in-replacement for it. You can install it to your system replacing KWin and use it together with a KDE Plasma session.
If you usually run an X11 session you should choose KWinFT without hesitation. It provides the same features as KWin and comes with an improved compositing pipeline that lowers latency and increases smoothness. There are also patches in the work to improve upon this further for multi-display setups. These patches might come to the 5.20 release via a bug fix release.
One point to keep in mind though is that the KWinFT project will concentrate in the future on improving the experience with Wayland. We won't maliciously regress the X11 experience but if there is a tradeoff between improving the Wayland session and regressing X11, KWinFT will opt for the former. But if such a situation unfolds at some point in time has yet to be seen. The X11 session might as well continue to work without any regressions for the next decade.
The situation is different if you want to run KWinFT as a Wayland compositor. I believe in regards to stability and robustness KWinFT is superior.
In particular this holds true for multi-display setups and display management. Although I worked mostly on Disman in the last two months that work naturally spilled over to KWinFT too. KWinFT's output objects are now much more reasonably implemented. Besides that there were many more bug fixes to outputs handling what you can convince yourself of by looking at the merged changes for 5.20 in KWinFT and Wrapland.
If you have issues with your outputs in KWin definitely try out KWinFT, of course together with Disman.
Another area where you probably will have a better experience is the composition itself. As on X11 the pipeline was reworked. For multi-display setups the patch, that was linked above and might come in a bug fix release to 5.20, should improve the situation further.
On the other side KWin's Wayland session gained some much awaited features with 5.20. According to the changelog screencasting is now possible, as is middle-click pasting and integration with Klipper, that is the clipboard management utility in the system tray.
I say "in theory" because I have not tested it myself and I expect it to not work without issues. That is for one because big feature additions like these regularly require later adjustments due to unforeseen behavior changes but also because on a principal and strategic level I disagree with the KWin developers' general approach here.
The KWin codebase is rotten and needs a rigorous overhaul. Putting more features on top of that, which often require massive internal changes just for the sake of crossing an item from a checklist might make sense from the viewpoint of KDE users and KDE's marketing staff, but from a long-term engineering vision will only litter the code more and lead to more and more breakage over time. Most users won't notice that immediately but when they do it is already too late.
On how to do that better I really have to compliment the developers of Gnome's Mutter and wlroots.
Especially Mutter's Wayland session was in a bad state with some fundamental problems due to its history just few years ago. But they committed to a very forward-thinking stance, ignoring the initial bad reception and not being tempted by immediate quick fixes that long-term would not hold up to the necessary standards. And nowadays Gnome Mutter's Wayland session is in way better shape. I want to highlight their transactional KMS project. This is a massive overhaul that is completely transparent to the common user, but enables the Mutter developers to build on a solid base in many ways in the future.
Still as said I have not tried KWin 5.20 myself and if the new features are important to you, give it a try and check for yourself if your experience confirms my concerns or if you are happy with what was added. Switching from KWin to KWinFT or the other way around is easy after all.
If you self-compile KWinFT it is very easy to switch from KWin to KWinFT. Just compile KWinFT to your system prefix. If you want more comfort through distribution packages you have to choose your distribution carefully.
Currently only Manjaro provides KWinFT packages officially. You can install all KWinFT projects on Manjaro easily through the packages with the same names.
Manjaro also offers git-variants of these packages allowing you to run KWinFT projects directly from master branch. This way you can participate in its development directly or give feedback to latest changes.
If you run Arch Linux you can install all KWinFT projects from the AUR. The release packages are not yet updated to 5.20 but I assume this happens pretty soon. They have a bit weird naming scheme: there are kwinft, wrapland-kwinft, disman-kwinft and kdisplay-kwinft. Of these packages git-variants are available too but they follow the better naming scheme without a kwinft suffix. So for example the git package for disman-kwinft is just called disman-git. Naming nitpicks aside huge thanks to the maintainers of these packages: abelian424 and Christoph (haagch).
A special place in my heart was conquered not long ago by Fedora. I switched over to it from KDE Neon due to problems on the latest update and the often outdated packages and I am amazed by Fedora's technical versed and overall professional vision.
To install KWinFT projects on Fedora with its exceptional package manager DNF you can make use of this copr repository that includes their release versions. The packages are already updated to 5.20. Thanks to zawertun for providing these packages!
Fedora's KDE SIG group also took interest in the KWinFT projects and setup a preliminary copr for them. One of their packagers contacted me after the Beta release and I hope that I can help them to get it fully setup soon. I think Fedora's philosophy of pushing the Linux ecosystem by providing most recent packages and betting on emerging technologies will harmonize very well with the goals of the KWinFT project.
It’s been a busy week for me in personal stuff, so my blogging has been a bit slow. Here’s a brief summary of a few vaguely interesting things that I’ve been up to:
OpAccessChain
with a result type of uvec4
from a base type of array<uint>
, which…it’s just not going to work ever, so that’s some strong work by past meThere’s a weird flickering bug with the UI that I get in some levels that bears some looking into, but otherwise I get a steady 60/64/64
in zink vs 74/110/110
in IRIS (no idea what the 3 numbers mean, if anyone does, feel free to drop me a line).
That’s it for today. Hopefully a longer post tomorrow but no promises.
![]() |
|
October 14, 2020 | |
![]() |
A couple of years ago, we sandboxed thumbnailers using bubblewrap to avoid drive-by downloads taking advantage of thumbnailers with security issues.
It's a great tool, and it's a tool that Flatpak relies upon to create its own sandboxes. But that also meant that we couldn't use it inside the Flatpak sandboxes themselves, and those aren't always as closed as they could be, to support legacy applications.
We've finally implemented support for sandboxing thumbnailers within Flatpak, using the Spawn D-Bus interface (indirectly).
This should all land in GNOME 40, though it should already be possible to integrate it into your Flatpaks. Make sure to use the latest gnome-desktop development version, and that the flatpak-spawn utility is new enough in the runtime you're targeting (it's been updated in the freedesktop.org runtimes #1, #2, #3, but it takes time to trickle down to GNOME versions). Example JSON snippets:
{
"name": "flatpak-xdg-utils",
"buildsystem": "meson",
"sources": [
{
"type": "git",
"url": "https://github.com/flatpak/flatpak-xdg-utils.git",
"tag": "1.0.4"
}
]
},
{
"name": "gnome-desktop",
"buildsystem": "meson",
"config-opts": ["-Ddebug_tools=true", "-Dudev=disabled"],
"sources": [
{
"type": "git",
"url": "https://gitlab.gnome.org/GNOME/gnome-desktop.git"
}
]
}
(We also sped up GStreamer-based thumbnailers by allowing them to use a cache, and added profiling information to the thumbnail test tools, which could prove useful if you want to investigate performance or bugs in that area)
Edit: correct a link, thanks to the commenters for the notice
![]() |
|
October 13, 2020 | |
![]() |
This week, I had a conversation with one of my coworkers about our subgroup/wave size heuristic and, in particular, whether or not control-flow divergence should be considered as part of the choice. This lead me down a fun path of looking into the statistics of control-flow divergence and the end result is somewhat surprising: Once you get above about an 8-wide subgroup, the subgroup size doesn't matter.
Before I get into the details, let's talk nomenclature. As you're likely aware, GPUs often execute code in groups of 1 or more invocations. In D3D terminology, these are called waves. In Vulkan and OpenGL terminology, these are called subgroups. The two terms are interchangeable and, for the rest of this post, I'll use the Vulkan/OpenGL conventions.
Before we dig into the statistics, let's talk for a minute about control-flow divergence. This is mostly going to be a primer on SIMT execution and control-flow divergence in GPU architectures. If you're already familiar, skip ahead to the next section.
Most modern GPUs use a Single Instruction Multiple Thread (SIMT) model. This means that the graphics programmer writes a shader which, for instance, colors a single pixel (fragment/pixel shader) but what the shader compiler produces is a program which colors, say, 32 pixels using a vector instruction set architecture (ISA). Each logical single-pixel execution of the shader is called an "invocation" while the physical vectorized execution of the shader which covers multiple pixels is called a wave or a subgroup. The size of the subgroup (number of pixels colored by a single hardware execution) varies depending on your architecture. On Intel, it can be 8, 16, or 32, on AMD, it's 32 or 64 and, on Nvidia (if my knowledge is accurate), it's always 32.
This conversion from logical single-pixel version of the shader to a physical multi-pixel version is often fairly straightforward. The GPU registers each hold N values and the instructions provided by the GPU ISA operate on N pieces of data at a time. If, for instance, you have an add in the logical shader, it's converted to an add provided by the hardware ISA which adds N values. (This is, of course an over-simplification but it's sufficient for now.) Sounds simple, right?
Where things get more complicated is when you have control-flow in your shader. Suppose you have an if statement with both then and else sections. What should we do when we hit that if statement? The if condition will be N Boolean values. If all of them are true or all of them are false, the answer is pretty simple: we do the then or the else respectively. If you have a mix of true and false values, we have to execute both sides. More specifically, the physical shader has to disable all of the invocations for which the condition is false and run the "then" side of the if statement. Once that's complete, it has to re-enable those channels and disable the channels for which the condition is true and run the "else" side of the if statement. Once that's complete, it re-enables all the channels and continues executing the code after the if statement.
When you start nesting if statements and throw loops into the mix, things get even more complicated. Loop continues have to disable all those channels until the next iteration of the loop, loop breaks have to disable all those channels until the loop is entirely complete, and the physical shader has to figure out when there are no channels left and complete the loop. This makes for some fun and interesting challenges for GPU compiler developers. Also, believe it or not, everything I just said is a massive over-simplification. :-)
The point which most graphics developers need to understand and what's important for this blog post is that the physical shader has to execute every path taken by any invocation in the subgroup. For loops, this means that it has to execute the loop enough times for the worst case in the subgroup. This means that if you have the same work in both the then and else sides of an if statement, that work may get executed twice rather than once and you may be better off pulling it outside the if. It also means that if you have something particularly expensive and you put it inside an if statement, that doesn't mean that you only pay for it when needed, it means you pay for it whenever any invocation in the subgroup needs it.
At the end of the last section, I said that one of the problems with the SIMT model used by GPUs is that they end up having worst-case performance for the subgroup. Every path through the shader which has to be executed for any invocation in the subgroup has to be taken by the shader as a whole. The question that naturally arises is, "does a larger subgroup size make this worst-case behavior worse?" Clearly, the naive answer is, "yes". If you have a subgroup size of 1, you only execute exactly what's needed and if you have a subgroup size of 2 or more, you end up hitting this worst-case behavior. If you go higher, the bad cases should be more likely, right? Yes, but maybe not quite like you think.
This is one of those cases where statistics can be surprising. Let's say you have an if statement with a boolean condition b. That condition is actually a vector (b1, b2, b3, ..., bN) and if any two of those vector elements differ, we path the cost of both paths. Assuming that the conditions are independent identically distributed (IID) random variables, the probability of entire vector being true is P(all(bi = true) = P(b1 = true) * P(b2 = true) * ... * P(bN = true) = P(bi = true)^N where N is the size of the subgroup. Therefore, the probability of having uniform control-flow is P(bi = true)^N + P(bi = false)^N. The probability of non-uniform control-flow, on the other hand, is 1 - P(bi = true)^N - P(bi = false)^N.
Before we go further with the math, let's put some solid numbers on it. Let's say we have a subgroup size of 8 (the smallest Intel can do) and let's say that our input data is a series of coin flips where bi is "flip i was heads". Then P(bi = true) = P(bi = false) = 1/2. Using the math in the previous paragraph, P(uniform) = P(bi = true)^8 + P(bi = false)^8 = 1/128. This means that the there is only a 1:128 chance that that you'll get uniform control-flow and a 127:128 chance that you'll end up taking both paths of your if statement. If we increase the subgroup size to 64 (the maximum among AMD, Intel, and Nvidia), you get a 1:2^63 chance of having uniform control-flow and a (2^63-1):2^63 chance of executing both halves. If we assume that the shader takes T time units when control-flow is uniform and 2T time units when control-flow is non-uniform, then the amortized cost of the shader for a subgroup size of 8 is 1/128 * T + 127/128 * 2T = 255/128 T and, by a similar calculation, the cost of a shader with a subgroup size of 64 is (2^64 - 1)/2^63. Both of those are within rounding error of 2T and the added cost of using the massively wider subgroup size is less than 1%. Playing with the statistics a bit, the following chart shows the probability of divergence vs. the subgroup size for various choices of P(bi = true):
One thing to immediately notice is that because we're only concerned about the probability of divergence and not of the two halves of the if independently, the graph is symmetric (p=0.9 and p=0.1 are the same). Second, and the point I was trying to make with all of the math above, is that until your probability gets pretty extreme (> 90%) the probability of divergence is reasonably high at any subgroup size. From the perspective of a compiler with no knowledge of the input data, we have to assume every if condition is a 50/50 chance at which point we can basically assume it will always diverge.Instead of only considering divergence, let's take a quick look at another case. Let's say that the you have a one-sided if statement (no else) that is expensive but rare. To put numbers on it, let's say the probability of the if statement being taken is 1/16 for any given invocation. Then P(taken) = P(any(bi = true)) = 1 - P(all(bi = false)) = 1 - P(bi = false)^N = 1 - (15/16)^N. This works out to about 0.4 for a subgroup size of 8, 0.65 for 16, 0.87 for 32, and 0.98 for 64. The following chart shows what happens if we play around with the probabilities of our if condition a bit more:
As we saw with the earlier divergence plot, even events with a fairly low probability (10%) are fairly likely to happen even with a subgroup size of 8 (57%) and are even more likely the higher the subgroup size goes. Again, from the perspective of a compiler with no knowledge of the data trying to make heuristic decisions, it looks like "ifs always happen" is a reasonable assumption. However, if we have something expensive like a texture instruction that we can easily move into an if statement, we may as well. There's no guarantees but if the probability of that if statement is low enough, we might be able to avoid it at least some of the time.A keen statistical eye may have caught a subtle statement I made very early on in the previous section:
Assuming that the conditions are independent identically distributed (IID) random variables...
While less statistically minded readers may have glossed over this as meaningless math jargon, it's actually very important assumption. Let's take a minute to break it down. A random variable in statistics is just an event. In our case, it's something like "the if condition was true". To say that a set of random variables is identically distributed means that they have the same underlying probabilities. Two coin tosses, for instance, are identically distributed while the distribution of "coin came up heads" and "die came up 6" are very different. When combining random variables, we have to be careful to ensure that we're not mixing apples and oranges. All of the analysis above was looking at the evaluation of a boolean in the same if condition but across different subgroup invocations. These should be identically distributed.
The remaining word that's of critical importance in the IID assumption is "independent". Two random variables are said to be independent if they have no effect on one another or, to be more precise, knowing the value of one tells you nothing whatsoever about the value of the other. Random variables which are not dependent are said to be "correlated". One example of random variables which are very much not independent would be housing prices in a neighborhood because the first thing home appraisers look at to determine the value of a house is the value of other houses in the same area that have sold recently. In my computations above, I used the rule that P(X and Y) = P(X) * P(Y) but this only holds if X and Y are independent random variables. If they're dependent, the statistics look very different. This raises an obvious question: Are if conditions statistically independent across a subgroup? The short answer is "no".
How does this correlation and lack of independence (those are the same) affect the statistics? If two events X and Y are negatively correlated then P(X and Y) < P(X) * P(Y) and if two events are positively correlated then P(X and Y) > P(X) * P(Y). When it comes to if conditions across a subgroup, most correlations that matter are positive. Going back to our statistics calculations, the probability of if condition diverging is 1 - P(all(bi = true)) - P(all(bi = false)) and P(all(bi = true)) = P(b1 = true and b2 = true and... bN = true). So, if the data is positively correlated, we get P(all(bi = true)) > P(bi = true)^N and P(divergent) = 1 - P(all(bi = true)) - P(all(bi = false)) < 1 - P(bi = true)^N - P(bi = false)^N. So correlation for us typically reduces the probability of divergence. This is a good thing because divergence is expensive. How much does it reduce the probability of divergence? That's hard to tell without deep knowledge of the data but there are a few easy cases to analyze.
One particular example of dependence that comes up all the time is uniform values. Many values passed into a shader are the same for all invocations within a draw call or for all pixels within a group of primitives. Sometimes the compiler is privy to this information (if it comes from a uniform or constant buffer, for instance) but often it isn't. It's fairly common for apps to pass some bit of data as a vertex attribute which, even though it's specified per-vertex, is actually the same for all of them. If a bit of data is uniform (even if the compiler doesn't know it is), then any if conditions based on that data (or from a calculation using entirely uniform values) will be the same. From a statics perspective, this means that P(all(bi = true)) + P(all(bi = false)) = 1 and P(divergent) = 0. From a shader execution perspective, this means that it will never diverge no matter the probability of the condition because our entire wave will evaluate the same value.
What about non-uniform values such as vertex positions, texture coordinates, and computed values? In your average vertex, geometry, or tessellation shader, these are likely to be effectively independent. Yes, there are patterns in the data such as common edges and some triangles being closer to others. However, there is typically a lot of vertex data and the way that vertices get mapped to subgroups is random enough that these correlations between vertices aren't likely to show up in any meaningful way. (I don't have a mathematical proof for this off-hand.) When they're independent, all the statistics we did in the previous section apply directly.
With pixel/fragment shaders, on the other hand, things get more interesting. Most GPUs rasterize pixels in groups of 2x2 pixels where each 2x2 pixel group comes from the same primitive. Each subgroup is made up of a series of these 2x2 pixel groups so, if the subgroup size is 16, it's actually 4 groups of 2x2 pixels each. Within a given 2x2 pixel group, the chances of a given value within the shader being the same for each pixel in that 2x2 group is quite high. If we have a condition which is the same within each 2x2 pixel group then, from the perspective of divergence analysis, the subgroup size is effectively divided by 4. As you can see in the earlier charts (for which I conveniently provided small subgroup sizes), the difference between a subgroup size of 2 and 4 is typically much larger than between 8 and 16.
Another common source of correlation in fragment shader data comes from the primitives themselves. Even if they may be different between triangles, values are often the same or very tightly correlated between pixels in the same triangle. This is sort of a super-set of the 2x2 pixel group issue we just covered. This is important because this is a type of correlation that hardware has the ability to encourage. For instance, hardware can choose to dispatch subgroups such that each subgroup only contains pixels from the same primitive. Even if the hardware typically mixes primitives within the same subgroup, it can attempt to group things together to increase data correlation and reduce divergence.
All this discussion of control-flow divergence might leave you wondering why we bother with subgroups at all. Clearly, they're a pain. They definitely are. Oh, you have no idea...
But they also bring some significant advantages in that the parallelism allows us to get better throughput out of the hardware. One obvious way this helps is that we can spend less hardware on instruction decoding (we only have to decode once for the whole wave) and put those gates into more floating-point arithmetic units. Also, most processors are pipelined and, while they can start processing a new instruction each cycle, it takes several cycles before an instruction makes its way from the start of the pipeline to the end and its result can be used in a subsequent instruction. If you have a lot of back-to-back dependent calculations in the shader, you can end up with lots of stalls where an instruction goes into the pipeline and the next instruction depends on its value and so you have to wait 10ish cycles until for the previous instruction to complete. On Intel, each SIMD32 instruction is actually four SIMD8 instructions that pipeline very nicely and so it's easier to keep the ALU busy.
Ok, so wider subgroups are good, right? Go as wide as you can! Well, yes and no. Generally, there's a point of diminishing returns. Is one instruction decoder per 32 invocations of ALU really that much more hardware than one per 64 invocations? Probalby not. Generally, the subgroup size is determined based on what's required to keep the underlying floating-point arithmetic hardware full. If you have 4 ALUs per execution unit and a pipeline depth of 10 cycles, then an 8-wide subgroup is going to have trouble keeping the ALU full. A 32-wide subgroup, on the other hand, will keep it 80% full even with back-to-back dependent instructions so going 64-wide is pointless.
On Intel GPU hardware, there are additional considerations. While most GPUs have a fixed subgroup size, ours is configurable and the subgroup size is chosen by the compiler. What's less flexible for us is our register file. We have a fixed register file size of 4KB regardless of the subgroup size so, depending on how many temporary values your shader uses, it may be difficult to compile it 16 or 32-wide and still fit everything in registers. While wider programs generally yield better parallelism, the additional register pressure can easily negate any parallelism benefits.
There are also other issues such as cache utilization and thrashing but those are way out of scope for this blog post...
This topic came up this week in the context of tuning our subgroup size heuristic in the Intel Linux 3D drivers. In particular, how should that heuristic reason about control-flow and divergence? Are wider programs more expensive because they have the potential to diverge more?
After all the analysis above, the conclusion I've come to is that any given if condition falls roughly into one of three categories:
Where does that leave our heuristic? The only interesting case in the above three is random data in fragment shaders. In our experience, the increased parallelism going from SIMD8 to SIMD16 is huge so it probably makes up for the increased divergence. The parallelism increase from SIMD16 to SIMD32 isn't huge but the change in the probability of a random if diverging is pretty small (94% vs. 99.6%) so, all other things being equal, it's probably better to go SIMD32.
![]() |
|
October 12, 2020 | |
![]() |
When last I left off, I’d cleared out 2/3 of my checklist for improving update_sampler_descriptors()
performance:
handle_image_descriptor()
was next on my list. As a result, I immediately dove right into an entirely different part of the flamegraph since I’d just been struck by a seemingly-obvious idea. Here’s the last graph:
Here’s the next step:
What changed?
Well, as I’m now caching descriptor sets across descriptor pools, it occurred to me that, assuming my descriptor state hashing mechanism is accurate, all the resources used in a given set must be identical. This means that all resources for a given type (e.g., UBO, SSBO, sampler, image) must be completely identical to previous uses across all shader stages. Extrapolating further, this also means that the way in which these resources are used must also identical, which means the pipeline barriers for access and image layouts must also be identical.
Which means they can be stored onto the struct zink_descriptor_set
object and reused instead of being accumulated every time. This reuse completely eliminates add_transition()
from using any CPU time (it’s the left-most block above update_sampler_descriptors()
in the first graph), and it thus massively reduces overall time for descriptor updates.
This marks a notable landmark, as it’s the point at which update_descriptors()
begins to use only ~50% of the total CPU time consumed in zink_draw_vbo()
, with the other half going to the draw command where it should be.
handle_image_descriptor()
At last my optimization-minded workflow returned to this function, and looking at the flamegraph again yielded the culprit. It’s not visible due to this being a screenshot, but the whole of the perf hog here was obvious, so let’s check out the function itself since it’s small. I think I explored part of this at one point in the distant past, possibly for ARB_texture_buffer_object
, but refactoring has changed things up a bit:
static void
handle_image_descriptor(struct zink_screen *screen, struct zink_resource *res, enum zink_descriptor_type type, VkDescriptorType vktype, VkWriteDescriptorSet *wd,
VkImageLayout layout, unsigned *num_image_info, VkDescriptorImageInfo *image_info, struct zink_sampler_state *sampler,
VkBufferView *null_view, VkImageView imageview, bool do_set)
First, yes, there’s a lot of parameters. There’s a lot of them, including VkBufferView *null_view
, which is a pointer to a stack array that’s initialized as containing VK_NULL_HANDLE
. As VkDescriptorImageInfo
must be initialized with a pointer to an array for texel buffers, it’s important that the stack variable used doesn’t go out of scope, so it has to be passed in like this or else this functionality can’t be broken out in this way.
{
if (!res) {
/* if we're hitting this assert often, we can probably just throw a junk buffer in since
* the results of this codepath are undefined in ARB_texture_buffer_object spec
*/
assert(screen->info.rb2_feats.nullDescriptor);
switch (vktype) {
case VK_DESCRIPTOR_TYPE_UNIFORM_TEXEL_BUFFER:
case VK_DESCRIPTOR_TYPE_STORAGE_TEXEL_BUFFER:
wd->pTexelBufferView = null_view;
break;
case VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER:
case VK_DESCRIPTOR_TYPE_STORAGE_IMAGE:
image_info->imageLayout = VK_IMAGE_LAYOUT_UNDEFINED;
image_info->imageView = VK_NULL_HANDLE;
if (sampler)
image_info->sampler = sampler->sampler[0];
if (do_set)
wd->pImageInfo = image_info;
++(*num_image_info);
break;
default:
unreachable("unknown descriptor type");
}
This is just handling for null shader inputs, which is permitted by various GL specs.
} else if (res->base.target != PIPE_BUFFER) {
assert(layout != VK_IMAGE_LAYOUT_UNDEFINED);
image_info->imageLayout = layout;
image_info->imageView = imageview;
if (sampler) {
VkFormatProperties props;
vkGetPhysicalDeviceFormatProperties(screen->pdev, res->format, &props);
This vkGetPhysicalDeviceFormatProperties call is actually the entire cause of handle_image_descriptor()
using any CPU time, at least on ANV. The lookup for the format is a significant bottleneck here, so it has to be removed.
if ((res->optimial_tiling && props.optimalTilingFeatures & VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT) ||
(!res->optimial_tiling && props.linearTilingFeatures & VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT))
image_info->sampler = sampler->sampler[0];
else
image_info->sampler = sampler->sampler[1] ?: sampler->sampler[0];
}
if (do_set)
wd->pImageInfo = image_info;
++(*num_image_info);
}
}
Just for completeness, the remainder of this function is checking whether the device’s format features support the requested type of filtering (if linear
), and then zink will fall back to nearest
in other cases. Following this, do_set
is true only for the base member of an image/sampler array of resources, and so this is the one that gets added into the descriptor set.
But now I’m again returning to vkGetPhysicalDeviceFormatProperties
. Since this is using CPU, it needs to get out of the hotpath here in descriptor updating, but it does still need to be called. As such, I’ve thrown more of this onto the zink_screen
object:
static void
populate_format_props(struct zink_screen *screen)
{
for (unsigned i = 0; i < PIPE_FORMAT_COUNT; i++) {
VkFormat format = zink_get_format(screen, i);
if (!format)
continue;
vkGetPhysicalDeviceFormatProperties(screen->pdev, format, &screen->format_props[i]);
}
}
Indeed, now instead of performing the fetch on every decsriptor update, I’m just grabbing all the properties on driver init and then using the cached values throughout. Let’s see how this looks.
update_descriptors()
is now using visibly less time than the draw command, though not by a huge amount. I’m also now up to an unstable 33fps.
At this point, it bears mentioning that I wasn’t entirely satisfied with the amount of CPU consumed by descriptor state hashing, so I ended up doing some pre-hashing here for samplers, as they have the largest state. At the beginning, the hashing looked like this:
In this case, each sampler descriptor hash was the size of a VkDescriptorImageInfo, which is 2x 64bit values and a 32bit value, or 20bytes of hashing per sampler. That ends up being a lot of hashing, and it also ends up being a lot of repeatedly hashing the same values.
Instead, I changed things around to do some pre-hashing:
In this way, I could have a single 32bit value representing the sampler view that persisted for its lifetime, and a pair of 32bit values for the sampler (since I still need to potentially toggle between linear and nearest filtering) that I can select between. This ends up being 8bytes to hash, which is over 50% less. It’s not a huge change in the flamegraph, but it’s possibly an interesting factoid. Also, as the layouts will always be the same for these descriptors, that member can safely be omitted from the original sampler_view hash.
Another vaguely interesting tidbit is profiling zink’s usage of mesa’s set
implementation, which is used to ensure that various objects are only added to a given batch a single time. Historically the pattern for use in zink has been something like:
if (!_mesa_set_search(set, value)) {
do_something();
_mesa_set_add(set, value);
}
This is not ideal, as it ends up performing two lookups in the set
for cases where the value isn’t already present. A much better practice is:
bool found = false;
_mesa_set_search_and_add(set, value, &found);
if (!found)
do_something();
In this way, the lookup is only done once, which ends up being huge for large sets.
I was now holding steady at 33fps, but there was a tiny bit more performance to squeeze out of descriptor updating when I began to analyze how much looping was being done. This is a general overview of all the loops in update_descriptors()
for each type of descriptor at the time of my review:
This was a lot of looping, and it was especially egregious in the final component of my refactored update_descriptors()
:
static bool
write_descriptors(struct zink_context *ctx, struct zink_descriptor_set *zds, unsigned num_wds, VkWriteDescriptorSet *wds,
unsigned num_resources, struct zink_descriptor_resource *resources, struct set *persistent,
bool is_compute, bool cache_hit)
{
bool need_flush = false;
struct zink_batch *batch = is_compute ? &ctx->compute_batch : zink_curr_batch(ctx);
struct zink_screen *screen = zink_screen(ctx->base.screen);
assert(zds->desc_set);
unsigned check_flush_id = is_compute ? 0 : ZINK_COMPUTE_BATCH_ID;
for (int i = 0; i < num_resources; ++i) {
assert(num_resources <= zds->pool->num_resources);
struct zink_resource *res = resources[i].res;
if (res) {
need_flush |= zink_batch_reference_resource_rw(batch, res, resources[i].write) == check_flush_id;
if (res->persistent_maps)
_mesa_set_add(persistent, res);
}
/* if we got a cache hit, we have to verify that the cached set is still valid;
* we store the vk resource to the set here to avoid a more complex and costly mechanism of maintaining a
* hash table on every resource with the associated descriptor sets that then needs to be iterated through
* whenever a resource is destroyed
*/
assert(!cache_hit || zds->resources[i] == res);
if (!cache_hit)
zink_resource_desc_set_add(res, zds, i);
}
if (!cache_hit && num_wds)
vkUpdateDescriptorSets(screen->dev, num_wds, wds, 0, NULL);
for (int i = 0; zds->pool->num_descriptors && i < util_dynarray_num_elements(&zds->barriers, struct zink_descriptor_barrier); ++i) {
struct zink_descriptor_barrier *barrier = util_dynarray_element(&zds->barriers, struct zink_descriptor_barrier, i);
zink_resource_barrier(ctx, NULL, barrier->res,
barrier->layout, barrier->access, barrier->stage);
}
return need_flush;
}
This function iterates over all the resources in a descriptor set, tagging them for batch usage and persistent mapping, adding references for the descriptor set to the resource as I previously delved into. Then it iterates over the barriers and applies them.
But why was I iterating over all the resources and then over all the barriers when every resource will always have a barrier for the descriptor set, even if it ends up getting filtered out based on previous usage?
It just doesn’t make sense.
So I refactored this a bit, and now there’s only one loop:
static bool
write_descriptors(struct zink_context *ctx, struct zink_descriptor_set *zds, unsigned num_wds, VkWriteDescriptorSet *wds,
struct set *persistent, bool is_compute, bool cache_hit)
{
bool need_flush = false;
struct zink_batch *batch = is_compute ? &ctx->compute_batch : zink_curr_batch(ctx);
struct zink_screen *screen = zink_screen(ctx->base.screen);
assert(zds->desc_set);
unsigned check_flush_id = is_compute ? 0 : ZINK_COMPUTE_BATCH_ID;
if (!cache_hit && num_wds)
vkUpdateDescriptorSets(screen->dev, num_wds, wds, 0, NULL);
for (int i = 0; zds->pool->num_descriptors && i < util_dynarray_num_elements(&zds->barriers, struct zink_descriptor_barrier); ++i) {
struct zink_descriptor_barrier *barrier = util_dynarray_element(&zds->barriers, struct zink_descriptor_barrier, i);
if (barrier->res->persistent_maps)
_mesa_set_add(persistent, barrier->res);
need_flush |= zink_batch_reference_resource_rw(batch, barrier->res, zink_resource_access_is_write(barrier->access)) == check_flush_id;
zink_resource_barrier(ctx, NULL, barrier->res,
barrier->layout, barrier->access, barrier->stage);
}
return need_flush;
}
This actually has the side benefit of reducing the required looping even further, as barriers get merged based on access and stages, meaning that though there may be N
resources used by a given set used by M
stages, it’s possible that the looping here might be reduced to only N
rather than N * M
since all barriers might be consolidated.
Let’s check all the changes out in the flamegraph:
This last bit has shaved off another big chunk of CPU usage overall, bringing update_descriptors()
from 11.4% to 9.32%. Descriptor state updating is down from 0.718% to 0.601% from the pre-hashing as well, though this wasn’t exactly a huge target to hit.
Just for nostalgia, here’s the starting point from just after I’d split the descriptor types into different sets and we all thought 27fps with descriptor set caching was a lot:
But now zink is up another 25% performance to a steady 34fps:
And I did it in only four blog posts.
For anyone interested, I’ve also put up a branch corresponding to the final flamegraph along with the perf data, which I’ve been using hotspot to view.
Looking forward, there’s still some easy work that can be done here.
For starters, I’d probably improve descriptor states a little such that I also had a flag anytime the batch cycled. This would enable me to add batch-tracking for resources/samplers/sampler_views more reliably when was actually necessary vs trying to add it every time, which ends up being a significant perf hit from all the lookups. I imagine that’d make a huge part of the remaining update_descriptors()
usage disappear.
There’s also, as ever, the pipeline hashing, which can further be reduced by adding more dynamic state handling, which would remove more values from the pipeline state and thus reduce the amount of hashing required.
I’d probably investigate doing some resource caching to keep a bucket of destroyed resources around for faster reuse since there’s a fair amount of that going on.
Ultimately though, the CPU time in use by zink is unlikely to see any other huge decreases (unless I’m missing something especially obvious or clever) without more major architectural changes, which will end up being a bigger project that takes more than just a week of blog posting to finish. As such, I’ve once again turned my sights to unit test pass rates and related issues, since there’s still a lot of work to be done there.
I’ve fixed another 500ish piglit tests over the past few days, bringing zink up past 92% pass rate, and I’m hopeful I can get that number up even higher in the near future.
Stay tuned for more updates on all things zink and Mike.
![]() |
|
October 09, 2020 | |
![]() |
I talked about a lot of boring optimization stuff yesterday, exploring various ideas which, while they will eventually will end up improving performance, didn’t yield immediate results.
Now it’s just about time to start getting to the payoff.
Here’s a flamegraph of the starting point. Since yesterday’s progress of improving the descriptor cache a bit and adding context-based descriptor states to reduce hashing, I’ve now implemented an object to hold the VkDescriptorPool
, enabling the pools themselves to be shared across programs for reuse, which deduplicates a considerable amount of memory. The first scene of the heaven benchmark creates a whopping 91 zink_gfx_program
structs, each of which previously had their own UBO descriptor pool and sampler descriptor pool for 182 descriptor pools in total, each with 5000 descriptors in it. With this mechanism, that’s cut down to 9 descriptor pools in total which are shared across all the programs. Without even changing that maximum descriptor limit, I’m already up by another frame to 28fps, even if the flamegraph doesn’t look too different.
I took a quick detour next over to the pipeline cache (that large-ish block directly to the right of update_descriptors
in the flamegraph), which stores all the VkPipeline objects that get created during startup. Pipeline creation is extremely costly, so it’s crucial that it be avoided during runtime. Happily, the current caching infrastructure in zink is sufficient to meet that standard, and there are no pipelines created while the scene is playing out.
But I thought to myself: what about VkPipelineCache for startup time improvement while I’m here?
I had high hopes for this, and it was quick and easy to add in, but ultimately even with the cache implemented and working, I saw no benefit in any part of the benchmark.
That was fine, since what I was really shooting for was a result more like this:
The hierarchy of the previously-expensive pipeline hash usage has completely collapsed now, and it’s basically nonexistent. This was achieved through a series of five patches which:
assert()
that rehashes on every lookup, so this more accurately reflects release build performanceAnd I’m now up to a solid 29fps.
update_sampler_descriptors()
I left off yesterday with a list of targets to hit in this function, from left to right in the most recent flamegraph:
Looking higher up the chain for the add_transition()
usage, it turns out that a huge chunk of this was actually the hash table rehashing itself every time it resized when new members were added (mesa hash tables start out with a very small maximum number of entries and then increase by a power of 2 every time). Since I always know ahead of time the maximum number of entries I’ll have in a given descriptor set, I put up a MR to let me pre-size the table, preventing any of this nonsense from taking up CPU time. The results were good:
The entire add_transition
hierarchy collapsed a bit, but there’s more to come. I immediately became distracted when I came to the realization that I’d actually misplaced a frame at some point and set to hunting it down.
Anyone who said to themselves “you’re binding your descriptors before you’re emitting your pipeline barriers, thus starting and stopping your render passes repeatedly during each draw” as the answer to yesterday’s question about what I broke during refactoring was totally right, so bonus points to everyone out there who nailed it. This can actually be seen in the flamegraph as the tall stack above update_ubo_descriptors()
, which is the block to the right of update_sampler_descriptors()
.
Now the stack is a little to the right where it belongs, and I was now just barely touching 30fps, which is up about 10% from the start of today’s post.
Stay tuned next week, when I pull another 10% performance out of my magic hat and also fix RADV corruption.
![]() |
|
October 08, 2020 | |
![]() |
It’s just that kind of week.
When I left off in my last post, I’d just implemented a two-tiered cache system for managing descriptor sets which was objectively worse in performance than not doing any caching at all.
Cool.
Next, I did some analysis of actual descriptor usage, and it turned out that the UBO churn was massive, while sampler descriptors were only changed occasionally. This is due to a mechanism in mesa involving a NIR pass which rewrites uniform data passed to the OpenGL context as UBO loads in the shader, compacting the data into a single buffer and allowing it to be more efficiently passed to the GPU. There’s a utility component u_upload_mgr
for gallium-based drivers which allocates a large (~100k) buffer and then maps/writes to it at offsets to avoid needing to create a new buffer for this every time the uniform data changes.
The downside of u_upload_mgr
, for the current state of zink, is that it means the hashed descriptor states are different for almost every single draw because while the UBO doesn’t actually change, the offset does, and this is necessarily part of the descriptor hash since zink is exclusively using VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER
descriptors.
I needed to increase the efficiency of the cache to make it worthwhile, so I decided first to reduce the impact of a changed descriptor state hash during pre-draw updating. This way, even if the UBO descriptor state hash was changing every time, maybe it wouldn’t be so impactful.
What this amounted to was to split the giant, all-encompassing descriptor set, which included UBOs, samplers, SSBOs, and shader images, into separate sets such that each type of descriptor would be isolated from changes in the other descriptors between draws.
Thus, I now had four distinct descriptor pools for each program, and I was producing and binding up to four descriptor sets for every draw. I also changed the shader compiler a bit to always bind shader resources to the newly-split sets as well as created some dummy descriptor sets since it’s illegal to bind sets to a command buffer with non-sequential indices, but it was mostly easy work. It seemed like a great plan, or at least one that had a lot of potential for more optimizations based on it. As far as direct performance increases from the split, UBO descriptors would be constantly changing, but maybe…
Well, the patch is pretty big (8 files changed, 766 insertions(+), 450 deletions(-)
), but in the end, I was still stuck around 23 fps.
With this work done, I decided to switch things up a bit and explore using VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC
descriptors for the mesa-produced UBOs, as this would reduce both the hashing required to calculate the UBO descriptor state (offset no longer needs to be factored in, which is one fewer uint32_t
to hash) as well as cache misses due to changing offsets.
Due to potential driver limitations, only these mesa-produced UBOs are dynamic now, otherwise zink might exceed the maxDescriptorSetUniformBuffersDynamic limit, but this was still more than enough.
Boom.
I was now at 27fps, the same as raw bucket allocation.
I’m not going to talk about hash performance. Mesa uses xxhash internally, and I’m just using the tools that are available.
What I am going to talk about, however, is the amount of hashing and lookups that I was doing.
Let’s take a look at some flamegraphs for the scene that I was showing in my fps screenshots.
This is, among other things, a view of the juggernaut update_descriptors()
function that I linked earlier in the week. At the time of splitting the descriptor sets, it’s over 50% of the driver’s pipe_context::draw_vbo
hook, which is decidedly not great.
So I optimized harder.
The leftmost block just above update_descriptors
is hashing that’s done to update the descriptor state for cache management. There wasn’t much point in recalculating the descriptor state hash on every draw since there’s plenty of draws where the states remain unchanged. To try and improve this, I moved to a context-based descriptor state tracker, where each pipe_context
hook to change active descriptors would invalidate the corresponding descriptor state, and then update_descriptors()
could just scan through all the states to see which ones needed to be recalculated.
The largest block above update_descriptors()
on the right side is the new calculator function for descriptor states. It’s actually a bit more intensive than the old method, but again, this is much more easily optimized than the previous hashing which was scattered throughout a giant 400 line function.
Next, while I was in the area, I added even faster access to reusing descriptor sets. This is as simple as an array of sets on the program struct that can have their hash values directly compared to the current descriptor state that’s needed, avoiding lookups through potentially huge hash tables.
Not much to see here since this isn’t really where any of the performance bottleneck was occurring.
Let’s skip ahead a bit. I finally refactored update_descriptors()
into smaller functions to update and bind descriptor sets in a loop prior to applying barriers and issuing the eventual draw command, shaking up the graph quite a bit:
Clearly, updating the sampler descriptors (update_sampler_descriptors()
) is taking a huge amount of time. The three large blocks above it are:
Each of these three had clear ways they could be optimized, and I’m going to speed through that and more in my next post.
But now, a challenge to all the Vulkan experts reading along. In this last section, I’ve briefly covered some refactoring work for descriptor updates.
What is the significant performance regression that I’ve introduced in the course of this refactoring?
It’s possible to determine the solution to my question without reading through any of the linked code, but you have the code available to you nonetheless.
Until next time.
![]() |
|
October 07, 2020 | |
![]() |
I’m back, and I’m about to get even deeper into zink’s descriptor management. I figured everyone including me is well acquainted with bucket allocating, so I skipped that day and we can all just imagine what that post would’ve been like instead.
Let’s talk about caching and descriptor sets.
I talked about it before, I know, so here’s a brief reminder of where I left off:
Just a very normal cache mechanism. The problem, as I said previously, was that there was just way too much hashing going on, and so the performance ended up being worse than a dumb bucket allocator.
Not ideal.
But I kept turning the idea over in the back of my mind, and then I realized that part of the problem was in the upper-right block named move invalidated sets to invalidated set array
. It ended up being the case that my resource tracking for descriptor sets was far too invasive; I had separate hash tables on every resource to track every set that a resource was attached to at all times, and I was basically spending all my time modifying those hash tables, not even the actual descriptor set caching.
So then I thought: well, what if I just don’t track it that closely?
Indeed, this simplifies things a bit at the conceptual level, since now I can avoid doing any sort of hashing related to resources, though this does end up making my second-level descriptor set cache less effective. But I’m getting ahead of myself in this post, so it’s time to jump into some code.
Instead of doing really precise tracking, it’s important to recall a few key points about how the descriptor sets are managed:
struct zink_program
ralloc contextThus, I brought some pointer hacks to bear:
void
zink_resource_desc_set_add(struct zink_resource *res, struct zink_descriptor_set *zds, unsigned idx)
{
zds->resources[idx] = res;
util_dynarray_append(&res->desc_set_refs, struct zink_resource**, &zds->resources[idx]);
}
This function associates a resource with a given descriptor set at the specified index (based on pipeline state). And then it pushes a reference to that pointer from the descriptor set’s C-array of resources into an array on the resource.
Later, during resource destruction, I can then walk the array of pointers like this:
util_dynarray_foreach(&res->desc_set_refs, struct zink_resource **, ref) {
if (**ref == res)
**ref = NULL;
}
If the reference I pushed earlier is still pointing to this resource, I can unset the pointer, and this will get picked up during future descriptor updates to flag the set as not-cached, requiring that it be updated. Since a resource won’t ever be destroyed while a set is in use, this is also safe for the associated descriptor set’s lifetime.
And since there’s no hashing or tree traversals involved, this is incredibly fast.
At this point, I’d created two categories for descriptor sets: active sets, which were the ones in use in a command buffer, and inactive sets, which were the ones that weren’t currently in use, with sets being pushed into the inactive category once they were no longer used by any command buffers. This ended up being a bit of a waste, however, as I had lots of inactive sets that were still valid but unreachable since I was using an array for storing these as well as the newly-bucket-allocated sets.
Thus, a second-level cache, AKA the B cache, which would store not-used sets that had at one point been valid. I’m still not doing any sort of checking of sets which may have been invalidated by resource destruction, so the B cache isn’t quite as useful as it could be. Also:
check program cache for matching set
has now been expanded to two lookups in case a matching set isn’t active but is still configured and valid in the B cachecheck program for unused set
block in the above diagram will now cannibalize a valid inactive set from the B cache rather than allocate a new setThe last of these items is a bit annoying, but ultimately the B cache can end up having hundreds of members at various points, and iterating through it to try and find a set that’s been invalidated ends up being impractical just based on the random distribution of sets across the table. Also, I only set the resource-based invalidation up to null the resource pointer, so finding an invalid set would mean walking through the resource array of each set in the cache. Thus, a quick iteration through a few items to see if the set-finder gets lucky, otherwise it’s clobbering time.
And this brought me up to about 24fps, which was still down a bit from the mind-blowing 27-28fps I was getting with just the bucket allocator, but it turns out that caching starts to open up other avenues for sizable optimizations.
Which I’ll get to in future posts.
![]() |
|
October 05, 2020 | |
![]() |
I took some time off to focus on making the numbers go up, but if I only do that sort of junk food style blogging with images, and charts, and benchmarks then we might all stop learning things, and certainly I won’t be transferring any knowledge between the coding part of my brain and the speaking part, so really we’re all losers at that point.
In other words, let’s get back to going extra deep into some code and doing some long form patch review.
First: what are descriptors?
Descriptors are, in short, when you feed a buffer or an image (+sampler) into a shader. In OpenGL, this is all handled for the user behind the scenes with e.g., a simple glGenBuffers()
-> glBindBuffer()
-> glBufferData()
for an attached buffer. For a gallium-based driver, this example case will trigger the pipe_context::set_constant_buffer
or pipe_context::set_shader_buffers
hook at draw time to inform the driver that a buffer has been attached, and then the driver can link it up with the GPU.
Things are a bit different in Vulkan. There’s an entire chapter of the spec devoted to explaining how descriptors work in great detail, but the important details for the zink case are:
binding
value which is unique for the given descriptor setAdditionally, while organizing and initializing all these descriptor sets, zink has to track all the resources used and guarantee their lifetimes exceed the lifetimes of the batch they’re being submitted with.
To handle this, zink has an amount of code. In the current state of the repo, it’s about 40 lines.
However…
In the state of my branch that I’m going to be working off of for the next few blog posts, the function for handling descriptor updates is 304 lines. This is the increased amount of code that’s required to handle (almost) all GL descriptor types in a way that’s reasonably reliable.
Is it a great design decision to have a function that large?
Probably not.
But I decided to write it all out first before I did any refactoring so that I could avoid having to incrementally refactor my first attempt at refactoring, which would waste lots of time.
Also, memes.
The idea behind the latter version of the implementation that I linked is as follows:
*Then merge and deduplicate all the accumulated pipeline barriers and apply only those which induce a layout or access change in the resource to avoid over-applying barriers.
As I mentioned in a previous post, zink then applies these descriptors to a newly-allocated, max-size descriptor set object from an allocator pool located on the batch object. Every time a draw command is triggered, a new VkDescriptorSet is allocated and updated using these steps.
As I touched on briefly in a previous post, the first change to make here in improving descriptor handling is to move the descriptor pools to the program objects. This lets zink create smaller descriptor pools which are likely going to end up using less memory than these giant ones. Here’s the code used for creating descriptor pools prior to refactoring:
#define ZINK_BATCH_DESC_SIZE 1000
VkDescriptorPoolSize sizes[] = {
{VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, ZINK_BATCH_DESC_SIZE},
{VK_DESCRIPTOR_TYPE_UNIFORM_TEXEL_BUFFER, ZINK_BATCH_DESC_SIZE},
{VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, ZINK_BATCH_DESC_SIZE},
{VK_DESCRIPTOR_TYPE_STORAGE_TEXEL_BUFFER, ZINK_BATCH_DESC_SIZE},
{VK_DESCRIPTOR_TYPE_STORAGE_IMAGE, ZINK_BATCH_DESC_SIZE},
{VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, ZINK_BATCH_DESC_SIZE},
};
VkDescriptorPoolCreateInfo dpci = {};
dpci.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
dpci.pPoolSizes = sizes;
dpci.poolSizeCount = ARRAY_SIZE(sizes);
dpci.flags = 0;
dpci.maxSets = ZINK_BATCH_DESC_SIZE;
vkCreateDescriptorPool(screen->dev, &dpci, 0, &batch->descpool);
Here, all the used descriptor types allocate ZINK_BATCH_DESC_SIZE
descriptors in the pool, and there are ZINK_BATCH_DESC_SIZE
sets in the pool. There’s a bug here, which is that really the descriptor types should have ZINK_BATCH_DESC_SIZE * ZINK_BATCH_DESC_SIZE
descriptors to avoid oom-ing the pool in the event that allocated sets actually use that many descriptors, but ultimately this is irrelevant, as we’re only ever allocating 1 set at a time due to zink flushing multiple times per draw anyway.
Ideally, however, it would be better to avoid this. The majority of draw cases use much, much smaller descriptor sets which only have 1-5 descriptors total, so allocating 6 * 1000 for a pool is roughly 6 * 1000 more than is actually needed for every set.
The other downside of this strategy is that by creating these giant, generic descriptor sets, it becomes impossible to know what’s actually in a given set without attaching considerable metadata to it, which makes reusing sets without modification (i.e., caching) a fair bit of extra work. Yes, I know I said bucket allocation was faster, but I also said I believe in letting the best idea win, and it doesn’t really seem like doing full updates every draw should be faster, does it? But I’ll get to that in another post.
When creating a struct zink_program
(which is the struct that contains all the shaders), zink creates a VkDescriptorSetLayout object for describing the descriptor set layouts that will be allocated from the pool. This means zink is allocating giant, generic descriptor set pools and then allocating (almost certainly) very, very small sets, which means the driver ends up with this giant memory balloon of unused descriptors allocated in the pool that can never be used.
A better idea for this would be to create descriptor pools which precisely match the layout for which they’ll be allocating descriptor sets, as this means there’s no memory ballooning, even if it does end up being more pool objects.
Here’s the current function for creating the layout object for a program:
static VkDescriptorSetLayout
create_desc_set_layout(VkDevice dev,
struct zink_shader *stages[ZINK_SHADER_COUNT],
unsigned *num_descriptors)
{
VkDescriptorSetLayoutBinding bindings[PIPE_SHADER_TYPES * (PIPE_MAX_CONSTANT_BUFFERS + PIPE_MAX_SHADER_SAMPLER_VIEWS + PIPE_MAX_SHADER_BUFFERS + PIPE_MAX_SHADER_IMAGES)];
int num_bindings = 0;
for (int i = 0; i < ZINK_SHADER_COUNT; i++) {
struct zink_shader *shader = stages[i];
if (!shader)
continue;
VkShaderStageFlagBits stage_flags = zink_shader_stage(pipe_shader_type_from_mesa(shader->nir->info.stage));
This function is called for both the graphics and compute pipelines, and for the latter, only a single shader object is passed, meaning that i
is purely for iterating the maximum number of shaders and not descriptive of the current shader being processed.
for (int j = 0; j < shader->num_bindings; j++) {
assert(num_bindings < ARRAY_SIZE(bindings));
bindings[num_bindings].binding = shader->bindings[j].binding;
bindings[num_bindings].descriptorType = shader->bindings[j].type;
bindings[num_bindings].descriptorCount = shader->bindings[j].size;
bindings[num_bindings].stageFlags = stage_flags;
bindings[num_bindings].pImmutableSamplers = NULL;
++num_bindings;
}
}
This iterates over the bindings in a given shader, setting up the various values required by the layout creation struct using the values stored to the shader struct using the code in zink_compiler.c
.
*num_descriptors = num_bindings;
if (!num_bindings) return VK_NULL_HANDLE;
If this program has no descriptors at all, then this whole thing can become a no-op, and descriptor updating can be skipped for draws which use this program.
VkDescriptorSetLayoutCreateInfo dcslci = {};
dcslci.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
dcslci.pNext = NULL;
dcslci.flags = 0;
dcslci.bindingCount = num_bindings;
dcslci.pBindings = bindings;
VkDescriptorSetLayout dsl;
if (vkCreateDescriptorSetLayout(dev, &dcslci, 0, &dsl) != VK_SUCCESS) {
debug_printf("vkCreateDescriptorSetLayout failed\n");
return VK_NULL_HANDLE;
}
return dsl;
}
Then there’s just the usual Vulkan semantics of storing values to the struct and passing it to the Create function.
But back in the context of moving descriptor pool creation to the program, this is actually the perfect place to jam the pool creation code in since all the information about descriptor types is already here. Here’s what that looks like:
VkDescriptorPoolSize sizes[6] = {};
int type_map[12];
unsigned num_types = 0;
memset(type_map, -1, sizeof(type_map));
for (int i = 0; i < ZINK_SHADER_COUNT; i++) {
struct zink_shader *shader = stages[i];
if (!shader)
continue;
VkShaderStageFlagBits stage_flags = zink_shader_stage(pipe_shader_type_from_mesa(shader->nir->info.stage));
for (int j = 0; j < shader->num_bindings; j++) {
assert(num_bindings < ARRAY_SIZE(bindings));
bindings[num_bindings].binding = shader->bindings[j].binding;
bindings[num_bindings].descriptorType = shader->bindings[j].type;
bindings[num_bindings].descriptorCount = shader->bindings[j].size;
bindings[num_bindings].stageFlags = stage_flags;
bindings[num_bindings].pImmutableSamplers = NULL;
if (type_map[shader->bindings[j].type] == -1) {
type_map[shader->bindings[j].type] = num_types++;
sizes[type_map[shader->bindings[j].type]].type = shader->bindings[j].type;
}
sizes[type_map[shader->bindings[j].type]].descriptorCount++;
++num_bindings;
}
}
I’ve added the sizes
, type_map
, and num_types
variables, which map used Vulkan descriptor types to a zero-based array and associated counter that can be used to fill in the pPoolSizes
and PoolSizeCount
values in a VkDescriptorPoolCreateInfo struct.
After the layout creation, which remains unchanged, I’ve then added this block:
for (int i = 0; i < num_types; i++)
sizes[i].descriptorCount *= ZINK_DEFAULT_MAX_DESCS;
VkDescriptorPoolCreateInfo dpci = {};
dpci.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
dpci.pPoolSizes = sizes;
dpci.poolSizeCount = num_types;
dpci.flags = 0;
dpci.maxSets = ZINK_DEFAULT_MAX_DESCS;
vkCreateDescriptorPool(dev, &dpci, 0, &descpool);
Which uses the descriptor types and sizes from above to create a pool that will pre-allocate the exact descriptor counts that are needed for this program.
Managing descriptor sets in this way does have other challenges, however. Like resources, it’s crucial that sets not be modified or destroyed while they’re submitted to a batch.
Previously, any time a draw was completed, the batch object would reset and clear its descriptor pool, wiping out all the allocated sets. If the pool is no longer on the batch, however, it’s not possible to perform this reset without adding tracking info for the batch to all the descriptor sets. Also, resetting a descriptor pool like this is wasteful, as it’s probable that a program will be used for multiple draws and thus require multiple descriptor sets. What I’ve done instead is add this function, which is called just after allocating the descriptor set:
bool
zink_batch_add_desc_set(struct zink_batch *batch, struct zink_program *pg, struct zink_descriptor_set *zds)
{
struct hash_entry *entry = _mesa_hash_table_search(batch->programs, pg);
assert(entry);
struct set *desc_sets = (void*)entry->data;
if (!_mesa_set_search(desc_sets, zds)) {
pipe_reference(NULL, &zds->reference);
_mesa_set_add(desc_sets, zds);
return true;
}
return false;
}
Similar to all the other batch<->object tracking, this stores the given descriptor set into a set
, but in this case the set is itself stored as the data in a hash table keyed with the program, which provides both objects for use during batch reset:
void
zink_reset_batch(struct zink_context *ctx, struct zink_batch *batch)
{
struct zink_screen *screen = zink_screen(ctx->base.screen);
batch->descs_used = 0;
// cmdbuf hasn't been submitted before
if (!batch->submitted)
return;
zink_fence_finish(screen, &ctx->base, batch->fence, PIPE_TIMEOUT_INFINITE);
hash_table_foreach(batch->programs, entry) {
struct zink_program *pg = (struct zink_program*)entry->key;
struct set *desc_sets = (struct set*)entry->data;
set_foreach(desc_sets, sentry) {
struct zink_descriptor_set *zds = (void*)sentry->key;
/* reset descriptor pools when no batch is using this program to avoid
* having some inactive program hogging a billion descriptors
*/
pipe_reference(&zds->reference, NULL);
zink_program_invalidate_desc_set(pg, zds);
}
_mesa_set_destroy(desc_sets, NULL);
And this function is called:
void
zink_program_invalidate_desc_set(struct zink_program *pg, struct zink_descriptor_set *zds)
{
uint32_t refcount = p_atomic_read(&zds->reference.count);
/* refcount > 1 means this is currently in use, so we can't recycle it yet */
if (refcount == 1)
util_dynarray_append(&pg->alloc_desc_sets, struct zink_descriptor_set *, zds);
}
If a descriptor set has no active batch uses, its refcount will be 1, and then it can be added to the array of allocated descriptor sets for immediate reuse in the next draw. In this iteration of refactoring, descriptor sets can only have one batch use, so this condition is always true when this function is called, but future work will see that change.
Putting it all together is this function:
struct zink_descriptor_set *
zink_program_allocate_desc_set(struct zink_screen *screen,
struct zink_batch *batch,
struct zink_program *pg)
{
struct zink_descriptor_set *zds;
if (util_dynarray_num_elements(&pg->alloc_desc_sets, struct zink_descriptor_set *)) {
/* grab one off the allocated array */
zds = util_dynarray_pop(&pg->alloc_desc_sets, struct zink_descriptor_set *);
goto out;
}
VkDescriptorSetAllocateInfo dsai;
memset((void *)&dsai, 0, sizeof(dsai));
dsai.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
dsai.pNext = NULL;
dsai.descriptorPool = pg->descpool;
dsai.descriptorSetCount = 1;
dsai.pSetLayouts = &pg->dsl;
VkDescriptorSet desc_set;
if (vkAllocateDescriptorSets(screen->dev, &dsai, &desc_set) != VK_SUCCESS) {
debug_printf("ZINK: %p failed to allocate descriptor set :/\n", pg);
return VK_NULL_HANDLE;
}
zds = ralloc_size(NULL, sizeof(struct zink_descriptor_set));
assert(zds);
pipe_reference_init(&zds->reference, 1);
zds->desc_set = desc_set;
out:
if (zink_batch_add_desc_set(batch, pg, zds))
batch->descs_used += pg->num_descriptors;
return zds;
}
If a pre-allocated descriptor set exists, it’s popped off the array. Otherwise, a new one is allocated. After that, the set is referenced onto the batch.
Now all the sets are allocated on the program using a more specific allocation strategy, which paves the way for a number of improvements that I’ll be discussing in various lengths over the coming days:
![]() |
|
September 29, 2020 | |
![]() |
Today I’m taking a break from writing about my work to write about the work of zink’s newest contributor, He Haocheng (aka @hch12907). Among other things, Haocheng has recently tackled the issue of extension refactoring, which is a huge help for future driver development. I’ve written time and time again about adding extensions, and with this patchset in place, the process is simplified and expedited almost into nonexistence.
As an example, let’s look at the most recent extension that I’ve added support for, VK_EXT_extended_dynamic_state. The original patch looked like this:
diff --git a/src/gallium/drivers/zink/zink_screen.c b/src/gallium/drivers/zink/zink_screen.c
index 3effa2b0fe4..83b89106931 100644
--- a/src/gallium/drivers/zink/zink_screen.c
+++ b/src/gallium/drivers/zink/zink_screen.c
@@ -925,6 +925,10 @@ load_device_extensions(struct zink_screen *screen)
assert(have_device_time);
free(domains);
}
+ if (screen->have_EXT_extended_dynamic_state) {
+ GET_PROC_ADDR(CmdSetViewportWithCountEXT);
+ GET_PROC_ADDR(CmdSetScissorWithCountEXT);
+ }
#undef GET_PROC_ADDR
@@ -938,7 +942,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
bool have_tf_ext = false, have_cond_render_ext = false, have_EXT_index_type_uint8 = false,
have_EXT_robustness2_features = false, have_EXT_vertex_attribute_divisor = false,
have_EXT_calibrated_timestamps = false, have_VK_KHR_vulkan_memory_model = false;
- bool have_EXT_custom_border_color = false, have_EXT_blend_operation_advanced = false;
+ bool have_EXT_custom_border_color = false, have_EXT_blend_operation_advanced = false,
+ have_EXT_extended_dynamic_state = false;
if (!screen)
return NULL;
@@ -1001,6 +1006,9 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
if (!strcmp(extensions[i].extensionName,
VK_EXT_BLEND_OPERATION_ADVANCED_EXTENSION_NAME))
have_EXT_blend_operation_advanced = true;
+ if (!strcmp(extensions[i].extensionName,
+ VK_EXT_EXTENDED_DYNAMIC_STATE_EXTENSION_NAME))
+ have_EXT_extended_dynamic_state = true;
}
FREE(extensions);
@@ -1012,6 +1020,7 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
VkPhysicalDeviceIndexTypeUint8FeaturesEXT index_uint8_feats = {};
VkPhysicalDeviceVulkanMemoryModelFeatures mem_feats = {};
VkPhysicalDeviceBlendOperationAdvancedFeaturesEXT blend_feats = {};
+ VkPhysicalDeviceExtendedDynamicStateFeaturesEXT dynamic_state_feats = {};
feats.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FEATURES_2;
screen->feats11.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_1_1_FEATURES;
@@ -1060,6 +1069,11 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
blend_feats.pNext = feats.pNext;
feats.pNext = &blend_feats;
}
+ if (have_EXT_extended_dynamic_state) {
+ dynamic_state_feats.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_EXTENDED_DYNAMIC_STATE_FEATURES_EXT;
+ dynamic_state_feats.pNext = feats.pNext;
+ feats.pNext = &dynamic_state_feats;
+ }
vkGetPhysicalDeviceFeatures2(screen->pdev, &feats);
memcpy(&screen->feats, &feats.features, sizeof(screen->feats));
if (have_tf_ext && tf_feats.transformFeedback)
@@ -1074,6 +1088,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
screen->have_EXT_calibrated_timestamps = have_EXT_calibrated_timestamps;
if (have_EXT_custom_border_color && screen->border_color_feats.customBorderColors)
screen->have_EXT_custom_border_color = true;
+ if (have_EXT_extended_dynamic_state && dynamic_state_feats.extendedDynamicState)
+ screen->have_EXT_extended_dynamic_state = true;
VkPhysicalDeviceProperties2 props = {};
VkPhysicalDeviceVertexAttributeDivisorPropertiesEXT vdiv_props = {};
@@ -1150,7 +1166,7 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
* this requires us to pass the whole VkPhysicalDeviceFeatures2 struct
*/
dci.pNext = &feats;
- const char *extensions[12] = {
+ const char *extensions[13] = {
VK_KHR_MAINTENANCE1_EXTENSION_NAME,
};
num_extensions = 1;
@@ -1185,6 +1201,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
extensions[num_extensions++] = VK_EXT_CUSTOM_BORDER_COLOR_EXTENSION_NAME;
if (have_EXT_blend_operation_advanced)
extensions[num_extensions++] = VK_EXT_BLEND_OPERATION_ADVANCED_EXTENSION_NAME;
+ if (have_EXT_extended_dynamic_state)
+ extensions[num_extensions++] = VK_EXT_EXTENDED_DYNAMIC_STATE_EXTENSION_NAME;
assert(num_extensions <= ARRAY_SIZE(extensions));
dci.ppEnabledExtensionNames = extensions;
diff --git a/src/gallium/drivers/zink/zink_screen.h b/src/gallium/drivers/zink/zink_screen.h
index 4ee409c0efd..1d35e775262 100644
--- a/src/gallium/drivers/zink/zink_screen.h
+++ b/src/gallium/drivers/zink/zink_screen.h
@@ -75,6 +75,7 @@ struct zink_screen {
bool have_EXT_calibrated_timestamps;
bool have_EXT_custom_border_color;
bool have_EXT_blend_operation_advanced;
+ bool have_EXT_extended_dynamic_state;
bool have_X8_D24_UNORM_PACK32;
bool have_D24_UNORM_S8_UINT;
It’s awful, right? There’s obviously lots of copy/pasted code here, and it’s a tremendous waste of time to have to do the copy/pasting, not to mention the time needed for reviewing such mind-numbing changes.
Here’s the same patch after He Haocheng’s work has been merged:
diff --git a/src/gallium/drivers/zink/zink_device_info.py b/src/gallium/drivers/zink/zink_device_info.py
index 0300e7f7574..69e475df2cf 100644
--- a/src/gallium/drivers/zink/zink_device_info.py
+++ b/src/gallium/drivers/zink/zink_device_info.py
@@ -62,6 +62,7 @@ def EXTENSIONS():
Extension("VK_EXT_calibrated_timestamps"),
Extension("VK_EXT_custom_border_color", alias="border_color", properties=True, feature="customBorderColors"),
Extension("VK_EXT_blend_operation_advanced", alias="blend", properties=True),
+ Extension("VK_EXT_extended_dynamic_state", alias="dynamic_state", feature="extendedDynamicState"),
]
# There exists some inconsistencies regarding the enum constants, fix them.
diff --git a/src/gallium/drivers/zink/zink_screen.c b/src/gallium/drivers/zink/zink_screen.c
index 864ec32fc22..3c1214d384b 100644
--- a/src/gallium/drivers/zink/zink_screen.c
+++ b/src/gallium/drivers/zink/zink_screen.c
@@ -926,6 +926,10 @@ load_device_extensions(struct zink_screen *screen)
assert(have_device_time);
free(domains);
}
+ if (screen->info.have_EXT_extended_dynamic_state) {
+ GET_PROC_ADDR(CmdSetViewportWithCountEXT);
+ GET_PROC_ADDR(CmdSetScissorWithCountEXT);
+ }
#undef GET_PROC_ADDR
I’m certain this is going to lead to an increase in my own productivity in the future given how quick the process has now become.
I’ve been putting up posts lately about my benchmarking figures, and in them I’ve been referencing Intel hardware. The reason I use Intel is because I’m (currently) a hobbyist developer with exactly one computer capable of doing graphics development, and it has an Intel onboard GPU. I’m quite happy with this given the high quality state of Intel’s drivers, as things become much more challenging when I have to debug both my own bugs as well as an underlying Vulkan driver’s bugs at the same time.
With that said, I present to you this recent out-of-context statement from Dave Airlie regarding zink performance on a different driver:
<airlied> zmike: hey on a fiji amd card, you get 45/46 native vs 35 fps with zink on one heaven scene here, however zink is corrupted
I don’t have any further information about anything there, but it’s the best I can do given the limitations in my available hardware.
![]() |
|
September 28, 2020 | |
![]() |
Just a quick post today to summarize a few exciting changes I’ve made today.
To start with, I’ve added some tracking to the internal batch objects for catching things like piglit’s spec@!opengl 1.1@streaming-texture-leak
. Let’s check out the test code there for a moment:
/** @file streaming-texture-leak.c
*
* Tests that allocating and freeing textures over and over doesn't OOM
* the system due to various refcounting issues drivers may have.
*
* Textures used are around 4MB, and we make 5k of them, so OOM-killer
* should catch any failure.
*
* Bug #23530
*/
for (i = 0; i < 5000; i++) {
glGenTextures(1, &texture);
glBindTexture(GL_TEXTURE_2D, texture);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER,
GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER,
GL_LINEAR);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, TEX_SIZE, TEX_SIZE,
0, GL_RGBA,
GL_UNSIGNED_BYTE, tex_buffer);
piglit_draw_rect_tex(0, 0, piglit_width, piglit_height,
0, 0, 1, 1);
glDeleteTextures(1, &texture);
}
This test loops 5000 times, using a different sampler texture for each draw, and then destroys the texture. This is supposed to catch drivers which can’t properly manage their resource refcounts, but instead here zink is getting caught by trying to dump 5000 active resources into the same command buffer, which ooms the system.
The reason for the problem in this case is that, after my recent optimizations which avoid unnecessary flushing, zink only submits the command buffer when a frame is finished or one of the write-flagged resources associated with an active batch is read from. Thus, the whole test runs in one go, only submitting the queue at the very end when the test performs a read.
In this case, my fix is simple: check the system’s total memory on driver init, and then always flush a batch if it crosses some threshold of memory usage in its associated resources when beginning a new draw. I chose 1/8 total memory to be “safe”, since that allows zink to use 50% of the total memory with its resources before it’ll begin to stall and force the draws to complete, hopefully avoiding any oom scenarios. This ends up being a flush every 250ish draws in the above test code, and everything works nicely without killing my system.
As a bonus, I noticed that zink was taking considerably longer than IRIS to complete this test once it was fixed, so I did a little profiling, and this was the result:
Up another 3 fps (~10%) from Friday, which isn’t bad for a few minutes spent removing some memset
calls from descriptor updating and then throwing in some code for handling VK_DYNAMIC_STATE_VERTEX_INPUT_BINDING_STRIDE_EXT.
![]() |
|
September 25, 2020 | |
![]() |
Today new beta versions for all KWinFT projects – that are KWinFT, Wrapland, Disman and KDisplay – were released. With that we are on target for the full release which is aligned with Plasma 5.20 on October 13.
Big changes will unquestionable come to Disman, a previously stifled library for display management, which now learns to stand on its own feet providing universal means for the configuration of displays with different windowing systems and Wayland compositors.
But also for the compositor KWinFT a very specific yet important feature got implemented and a multitude of stability fixes and code refactors were accomplished.
In the following we will do a deep dive into reasons and results of this recent efforts.
For a quick overview of the work on Disman you can also watch this lightning talk that I held at the virtual XDC 2020 conference last week.
It was initially not planned like this but Disman and KDisplay were two later additions to the KWinFT project.
The projects were forked from libkscreen and KScreen respectively and I saw this as an opportunity to completely rethink and in every sense overhaul these in the past rather lackluster and at times completely neglected components of the KDE Plasma workspace. This past negligence is rather tragic since the complaints about miserable output management in KDE Plasma go back as long as one can think. Improving this bad state of affairs was my main motivation when I started working on libkscreen and KScreen around two years ago.
In my opinion a well functioning – not necessarily fancy but for sure robust – display configuration system is a cornerstone of a well crafted desktop system. One reason for that is how prevalent multi-display setups are and another how immeasurable annoying it is when you can't configure the projector correctly this one time you have to give a presentation in front of a room full of people.
Disman now tries to solve this by providing a solution not only for KWinFT or the KDE Plasma desktop alone but for any system running X11 or any Wayland compositor.
Let us look into the details of this solution and why I haven't mentioned KDisplay yet. The reason for this omission is that KDisplay from now on will be a husk of its former self.
As a fork of KScreen no longer than one month ago KDisplay was still the logical center of any display configuration with an always active KDE daemon (KDED) module and a KConfig module (KCM) integrated to the KDE System Settings.
The KDED module was responsible for reacting to display hot-plug events, reading control files of the resulting display combination from the user directory, generating optimal configurations if none to be found and writing new files to the hard disk after the configuration has been applied successfully to the windowing system.
In this work flow Disman was only relevant as a provider of backend plugins that were loaded at runtime. Disman was used either in-process or through an included D-Bus service that got automatically started whenever the first client tried to talk to it. According to the commit adding this out-of-process mode five years ago the intention behind it was to improve performance and stability. But in the end on a functional level the service was doing not much more than forwarding data between the windowing system and the Disman consumers.
Interestingly the D-Bus service was only activatable with the X11 backend and was explicitly disabled on Wayland. When I noticed this I was first tempted to remove the D-Bus service in the all eternal struggle to reduce code complexity. And after all if the service is not used on Wayland we might not need it at all.
But some time later I realized that this D-Bus service must be appreciated in a different way than for its initial reasoning. From a different point of view this service could be the key to a much more ambitious grand solution.
The service allows us to serialize and synchronize access of arbitrary many clients in a transparent way while moving all relevant logical systems to a shared central place and providing per client a high level of integration with those systems.
Concretely does this mean
that the Disman D-Bus service becomes an independent entity.
Once being invoked by a single call from a client,
for example by the included command line utility with dismanctl -o
the service reads and writes all necessary control files on its own.
It generates optimal display configurations
if no files are found
and even can disable a laptop display in case the lid was closed
while an external output is connected.
In this model Disman consumers solely provide user interfaces that are informed about the generated or loaded current config and that can modify this config additionally if desirable. This way the consumer can concentrate on providing a user interface with great usability and design and leave to Disman all the logic of handling the modified configuration afterwards.
Making it easy to add other clients is only one advantage. On a higher level this new design has two more.
I noticed already last year that some basic assumptions in KScreen were questionable. Its internal data logic relied on a round trip through the windowing system.
This meant in practice that the user was supposed to change display properties via the KScreen KCM. These were then sent to the windowing system which tried to apply them to the hardware. Afterwards it informed the KScreen KDE daemon through its own specific protocols and a libkscreen backend about this new configuration. Only the daemon then would write the updated configuration to the disk.
Why it was done this way is clear: we can be sure we have written a valid configuration to the disk and by having only the daemon do the file write we have the file access logic in a single place and do not need to sync file writes of different processes.
But the fundamental problem of this design is that we sometimes need to share additional information about our display configuration for sensible display management not being of relevance to the windowing system and because of that can not be passed through it.
A simple example is when a display is auto-rotated. Smartphones and tablets but also many convertibles come with orientation sensors to auto-rotate the built-in display according to the current device orientation. When auto-rotation is switched on or off in the KCM it is not sent through the windowing system but the daemon or another service needs to know about such a change in order to adapt the display rotation correctly with later orientation changes.
A complex but interesting other example is the replication of displays, also often called mirroring. When I started work on KScreen two years ago the mechanism was painfully primitive: one could only duplicate all displays at once and it was done by moving all of them to the same position and then changing their display resolutions hoping to find some sufficiently alike to cover a similar area.
Obviously that had several issues, the worst in my opinion was that this won't work for displays with different aspect ratios as I noticed quickly after I got myself a 16:10 display. Another grave issue was that displays might not run at their full resolution. In a mixed DPI setup the formerly HiDPI displays are downgraded to the best resolution common with the LoDPI displays.
The good news is that on X11 and also Wayland methods are available to replicate displays without these downsides.
On X11 we can apply arbitrary linear transformations to an output. This solves both issues.
On Wayland all available output management protocols at minimum allow to set a singular floating point value to scale a display. This solves the mixed DPI problem since we can still run both displays at an arbitrary resolution and adapt the logical size of the replica through its scale. If the management protocol provides a way to even specify the logical size directly like the KWinFT protocol does we can also solve the problem of diverging display aspect ratios.
From a bird's eye view in this model there are one or multiple displays that act as replicas for a single source display. Only the transformation, scale or logical size of the replicas is changed, the source is the invariant. The important information to remember is therefore for each display solely if there is a replication source that the display is a replica to. But neither in X11 nor in any Wayland compositor this information is conveyed via the windowing system.
With the new design we send all configuration data including such auxiliary data to the Disman D-Bus service. The service will save all this data to a configuration-specific file but send to the windowing system only a relevant subset of the data. After the windowing system reports that the configuration was applied the Disman service informs all connected clients about this change sending the data received from the windowing system augmented by the auxiliary data that had not been passed through the windowing system.
This way every display management client receives all relevant data about the current configuration including the auxiliary data.
The motivation to solve this problem was the original driving force behind the large redesign of Disman that is coming with this release.
But I realized soon that this redesign also has another advantage that long-term is likely even more important than the first one.
With the new design Disman becomes a truly universal solution for display management.
Previously a running KDE daemon process with KDisplay's daemon module inserted was required in order to load a display configuration on startup and to react to display hot-plug events. The problem to this is that the KDE daemon commonly only makes sense to run on a KDE Plasma desktop.
Thanks to the new design the Disman D-Bus service can now be run as a standalone background service managing all your displays permanently, even if you don't use KDE Plasma.
In a non-Plasma environment like a Wayland session with sway
this can be achieved
by simply calling once dismanctl -o
in a startup script.
On the other side the graphical user interface
that KDisplay provides
can now be used to manage displays on any desktop
that Disman runs on too.
KDisplay does not require the KDE System Settings to be installed
and can be run as a standalone app.
Simply call kdisplay
from command line
or start it from the launcher of your desktop environment.
KDisplay still includes a now absolutely gutted KDE daemon module
that will be run in a Plasma session.
The module basically only launches the Disman D-Bus service
on session startup anymore.
So in a Plasma session after installation of Disman and KDisplay
everything is directly setup automatically.
In every other session as said
a simple dismanctl -o
call at startup is enough
to get the very same result.
Maybe the integration in other sessions than Plasma could be improved to even make setting up this single call at startup unnecessary. Should Disman for example install a systemd unit file executing this call by default? I would be interested in feedback in this regard in particular from distributions. What do they prefer?
With today's beta release the greatest changes come to Disman and KDisplay. But that does not mean KWinFT and Wrapland have not received some important updates.
The ongoing work on Disman and by that on displays – or outputs as they are called in the land of window managers – stability and feature patches for outputs naturally came to Wrapland and KWinFT as well. A large refactor was the introduction of a master output class on the server side of Wrapland. The class acts as a central entry point for compositors and deals with the different output related protocol objects internally.
Having this class in place it was rather easy to add support for xdg-output version 2 and 3 afterwards. In order to do that it was also reasonable to re-evaluate how we provide output identifying metadata in KWinFT and Wrapland in general.
In regards to output identification a happy coincidence was that Simon Ser of the wlroots projects had been asking himself the very same questions already in the past.
I concluded that Simon's plan for wlroots was spot on and I decided to help them out a bit with patches for wlr-protocols and wlroots. In the same vein I updated Wrapland's output device protocol. That means Wrapland and wlroots based compositors feature the very same way of identifying outputs now what made it easy to provide full support for both in Disman.
This release comes with support for the presentation-time protocol.
It is one of only three Wayland protocol extensions that have been officially declared stable. Because of that supporting it felt also important in a formal sense.
Primarily though it is essential to my ongoing work on Xwayland. I plan to make use of the presentation-time protocol in Xwayland's Present extension implementation.
With the support in KWinFT I can test future presentation-time work in Xwayland now with KWinFT and sway as wlroots also supports the protocol. Having two different compositors for alternative testing will be quite helpful.
If you want to try out the new beta release of Disman together with your favorite desktop environment or the KWinFT beta as a drop-in replacement for KWin you have to compile from source at the moment. For that use the Plasma/5.20 branches in the respective repositories.
For Disman there are some limited instructions on how to compile it in the Readme file.
If you have questions or just want to chat about the projects feel free to join the official KWinFT Gitter channel.
If you want to wait for the full release check back on the release date, October 13. I plan to write another article to that date that will then list all distributions where you will be able to install the KWinFT projects comfortably by package manager.
That is also a call to distro packagers: if you plan to provide packages for the KWinFT projects on October 13 get in touch to get support and be featured in the article.
In my last post, I left off with an overall 25%ish improvement in framerate for my test case:
At the end, this was an extra 3 fps over my previous test, but how did I get to this point?
The answer lies in even more unnecessary queue submission. Let’s take a look at zink’s pipe_context::set_framebuffer_state
hook, which is called by gallium any time the framebuffer state changes:
static void
zink_set_framebuffer_state(struct pipe_context *pctx,
const struct pipe_framebuffer_state *state)
{
struct zink_context *ctx = zink_context(pctx);
struct zink_screen *screen = zink_screen(pctx->screen);
util_copy_framebuffer_state(&ctx->fb_state, state);
struct zink_framebuffer *fb = get_framebuffer(ctx);
zink_framebuffer_reference(screen, &ctx->framebuffer, fb);
if (ctx->gfx_pipeline_state.render_pass != fb->rp)
ctx->gfx_pipeline_state.hash = 0;
zink_render_pass_reference(screen, &ctx->gfx_pipeline_state.render_pass, fb->rp);
uint8_t rast_samples = util_framebuffer_get_num_samples(state);
/* in vulkan, gl_SampleMask needs to be explicitly ignored for sampleCount == 1 */
if ((ctx->gfx_pipeline_state.rast_samples > 1) != (rast_samples > 1))
ctx->dirty_shader_stages |= 1 << PIPE_SHADER_FRAGMENT;
if (ctx->gfx_pipeline_state.rast_samples != rast_samples)
ctx->gfx_pipeline_state.hash = 0;
ctx->gfx_pipeline_state.rast_samples = rast_samples;
if (ctx->gfx_pipeline_state.num_attachments != state->nr_cbufs)
ctx->gfx_pipeline_state.hash = 0;
ctx->gfx_pipeline_state.num_attachments = state->nr_cbufs;
/* need to start a new renderpass */
if (zink_curr_batch(ctx)->rp)
flush_batch(ctx);
struct zink_batch *batch = zink_batch_no_rp(ctx);
zink_framebuffer_reference(screen, &batch->fb, fb);
framebuffer_state_buffer_barriers_setup(ctx, &ctx->fb_state, zink_curr_batch(ctx));
}
Briefly, zink copies the framebuffer state, there’s a number of conditions under which a new pipeline object is needed, which all result in ctx->gfx_pipeline_state.hash = 0;
. Other than this, there’s sample count check for sample changes so that the shader can be modified if necessary, and then there’s the setup for creating the Vulkan framebuffer object as well as the renderpass object in get_framebuffer()
.
Eagle-eyed readers will immediately spot the problem here, which is, aside from the fact that there’s not actually any reason to be setting up the framebuffer or renderpass here, how zink is also flushing the current batch if a renderpass is active.
The change I made here was to remove everything related to Vulkan from here, and move it to zink_begin_render_pass()
, which is the function that the driver uses to begin a renderpass for a given batch.
This is clearly a much larger change than just removing the flush_batch()
call, which might be what’s expected now that ending a renderpass no longer forces queue submission. Indeed, why haven’t I just ended the current renderpass and kept using the same batch?
The reason for this is that zink is designed in such a way that a given batch, at the time of calling vkCmdBeginRenderPass
, is expected to either have no struct zink_render_pass
associated with it (the batch has not performed a draw yet) or have the same object which matches the pipeline state (the batch is continuing to draw using the same renderpass). Adjusting this to be compatible with removing the flush here ends up being more code than just moving the object setup to a different spot.
So now the framebuffer and renderpass are created or pulled from their caches just prior to the vkCmdBeginRenderPass
call, and a flush is removed, gaining some noticeable fps.
Now that I’d unblocked that bottleneck, I went back to the list and checked the remaining problem areas:
I decided to change things up a bit here.
This is the current way of things.
My plan was something more like this:
Where get descriptorset from program
would look something like:
In this way, I’d get to conserve some sets and reuse them across draw calls even between different command buffers since I could track whether they were in use and avoid modifying them in any way. I’d also get to remove any tracking for descriptorset usage on batches, thereby removing possible queue submissions there. Any time resources in a set were destroyed, I could keep references to the sets on the resources and then invalidate the sets, returning them to the unused pool.
21 fps now, which is up another 3 from before.
Next I started investigating my cache implementation. There was a lot of hashing going on, as I was storing both the in-use sets as well as the unused sets (the valid and invalidated) based on the hash calculated for their descriptor usage, so I decided to try moving just the invalidated sets into an array as they no longer had a valid hash anyway, thereby giving quicker access to sets I knew to be free.
This would also help with my next plan, but again, the results were promising:
Now I was at 23 fps, which is another 10% from the last changes, and just from removing some of the hashing going on.
This is like shooting fish in a barrel now.
Naturally at this point I began to, as all developers do, consider bucketing my allocations, since I’d seen in my profiling that some of these programs were allocating thousands of sets to submit simultaneously across hundreds of draws. I ended up using a scaling factor here so that programs would initially begin allocating in tens of sets, scaling up by a factor of ten every time it reached that threshold (i.e., once 100 sets were allocated, it begins allocating 100 at a time).
This didn’t have any discernible effect on the fps, but there were certainly fewer allocation calls going on, so I imagine the results will show up somewhere else.
Because sure, my efficient, possibly-overengineered descriptorset caching mechanism for reusing sets across draws and batches was cool, and it worked great, but was the overhead of all the hashing involved actually worse for performance than just using a dumb bucket allocator to set up and submit the same sets multiple times even in the same command buffer?
I’m not one of those types who refuses to acknowledge that other ideas can be better than the ones I personally prefer, so I smashed all of my descriptor info hashing out of the codebase and just used an array for storing unused sets. So now the mechanism looked like this:
But would this be better than the more specific caching I was already using? Well…
24 fps, so the short of it is yes. It was 1-2 fps faster across the board.
This is where I’m at now after spending some time also rewriting all the clear code again and fixing related regressions. The benchmark is up ~70% from where I started, and the gains just keep coming. I’ll post again about performance improvements in the future, but here’s a comparison to a native GL driver, namely IRIS:
Zink is at a little under 50% of the performance here, up from around 25% when I started, though this gap still varies throughout other sections of the benchmark, dipping as low as 30% of the IRIS performance in some parts.
It’s progress.
![]() |
|
September 24, 2020 | |
![]() |
Last week, X.Org Developers Conference 2020 was held online for the first time. This year, with all the COVID-19 situation that is affecting almost every country worldwide, the X.Org Foundation Board of Directors decided to make it virtual.
I love open-source conferences :-) They are great for networking, have fun with the rest of community members, have really good technical discussions in the hallway track… and visit a new place every year! Unfortunately, we couldn’t do any of that this time and we needed to look for an alternative… being going virtual the obvious one.
The organization team at Intel, lead by Radoslaw Szwichtenberg and Martin Peres, analyzed the different open-source alternatives to organize XDC 2020 in a virtual manner. Finally, due to the setup requirements and the possibility of having more than 200 attendees connected to the video stream at the same time (Big Blue Button doesn’t recommend more than 100 simultaneous users), they selected Jitsi for speakers + Youtube for streaming/recording + IRC for questions. Arkadiusz Hiler summarized very well what they did from the A/V technical point of view and how was the experience hosting a virtual XDC.
I’m very happy with the final result given the special situation this year: the streaming was flawless, we didn’t have almost any technical issue (except one audio issue in the opening session… like in physical ones! :-D), and IRC turned out to be very active during the conference. Thanks a lot to the organizers for their great job!
However, there is always room for improvements. Therefore, the X.Org Foundation board is asking for feedback, please share with us your opinion on XDC 2020!
Just focusing on my own experience, it was very good. I enjoyed a lot the talks presented this year and the interesting discussions happening in IRC. I would like to highlight the four talks presented by my colleagues at Igalia :-D
This year I was also a speaker! I presented “Improving Khronos CTS tests with Mesa code coverage” talk (watch recording), where I explained how we can improve the VK-GL-CTS quality by leveraging Mesa code coverage using open-source tools.
I’m looking forward to attending X.Org Developers Conference 2021! But first, we need an organizer! Requests For Proposals for hosting XDC2021 are now open!
See you next year!
For a long time now, I’ve been writing about various work I’ve done in the course of getting to GL 4.6. This has generally been feature implementation work with an occasional side of bug hunting and fixing and so I haven’t been too concerned about performance.
I’m not done with feature work. There’s still tons of things missing that I’m planning to work on.
I’m not done with bug hunting or fixing. There’s still tons of bugs (like how currently spec@!opengl 1.1@streaming-texture-leak
ooms my system and crashes all the other tests trying to run in parallel) that I’m going to fix.
But I wanted a break, and I wanted to learn some new parts of the graphics pipeline instead of just slapping more extensions in.
For the moment, I’ve been focusing on the Unigine Heaven benchmark since there’s tons of room for improvement, though I’m planning to move on from that once I get bored and/or hit a wall. Here’s my starting point, which is taken from the patch in my branch with the summary zink: add batch flag to determine if batch is currently in renderpass
, some 300ish patches ahead of the main branch’s tip:
This is 14 fps running as ./heaven_x64 -project_name Heaven -data_path ../ -engine_config ../data/heaven_4.0.cfg -system_script heaven/unigine.cpp -sound_app openal -video_app opengl -video_multisample 0 -video_fullscreen 0 -video_mode 3 -extern_define ,RELEASE,LANGUAGE_EN,QUALITY_LOW,TESSELLATION_DISABLED -extern_plugin ,GPUMonitor
, and I’m going to be posting screenshots from roughly the same point in the demo as I progress to gauge progress.
Is this an amazing way to do a benchmark?
No.
Is it a quick way to determine if I’m making things better or worse right now?
Given the size of the gains I’m making, absolutely.
Let’s begin.
Now that I’ve lured everyone in with promises of gains and a screenshot with an fps counter, what I really want to talk about is code.
In order to figure out the most significant performance improvements for zink, it’s important to understand the architecture. At the point when I started, zink’s batches (C name for an object containing a command buffer and a fence as well as references to all the objects submitted to the queue for lifetime validation) worked like this for draw commands:
vkCmdBeginRenderPass
callThis is a lot to take in, so I’ll cut to some conclusions that I drew from these points:
pipe_context::flush
hook in general is very, very bad and needs to be avoided
There’s a lot more I could go into here, but this is already a lot.
I decided to start here since it was easy:
diff --git a/src/gallium/drivers/zink/zink_context.c b/src/gallium/drivers/zink/zink_context.c
index a9418430bb7..f07ae658115 100644
--- a/src/gallium/drivers/zink/zink_context.c
+++ b/src/gallium/drivers/zink/zink_context.c
@@ -800,12 +800,8 @@ struct zink_batch *
zink_batch_no_rp(struct zink_context *ctx)
{
struct zink_batch *batch = zink_curr_batch(ctx);
- if (batch->in_rp) {
- /* flush batch and get a new one */
- flush_batch(ctx);
- batch = zink_curr_batch(ctx);
- assert(!batch->in_rp);
- }
+ zink_end_render_pass(ctx, batch);
+ assert(!batch->in_rp);
return batch;
}
Amazing, I know. Let’s see how much the fps changes:
15?! Wait a minute. That’s basically within the margin of error!
It is actually a consistent 1-2 fps gain, even a little more in some other parts, but it seemed like it should’ve been more now that all the command buffers are being gloriously saturated, right?
Well, not exactly. Here’s a fun bit of code from the descriptor updating function:
struct zink_batch *batch = zink_batch_rp(ctx);
unsigned num_descriptors = ctx->curr_program->num_descriptors;
VkDescriptorSetLayout dsl = ctx->curr_program->dsl;
if (batch->descs_left < num_descriptors) {
ctx->base.flush(&ctx->base, NULL, 0);
batch = zink_batch_rp(ctx);
assert(batch->descs_left >= num_descriptors);
}
Right. The flushing continues. And while I’m here, what does zink’s pipe_context::flush
hook even look like again?
static void
zink_flush(struct pipe_context *pctx,
struct pipe_fence_handle **pfence,
enum pipe_flush_flags flags)
{
struct zink_context *ctx = zink_context(pctx);
struct zink_batch *batch = zink_curr_batch(ctx);
flush_batch(ctx);
...
/* HACK:
* For some strange reason, we need to finish before presenting, or else
* we start rendering on top of the back-buffer for the next frame. This
* seems like a bug in the DRI-driver to me, because we really should
* be properly protected by fences here, and the back-buffer should
* either be swapped with the front-buffer, or blitted from. But for
* some strange reason, neither of these things happen.
*/
if (flags & PIPE_FLUSH_END_OF_FRAME)
pctx->screen->fence_finish(pctx->screen, pctx,
(struct pipe_fence_handle *)batch->fence,
PIPE_TIMEOUT_INFINITE);
}
Oh. So really every time zink “finishes” a frame in this benchmark (which has already stalled hundreds of times up to this point), it then waits on that frame to finish instead of letting things outside the driver worry about that.
It was at this moment that a dim spark flickered to life in my memories, reminding me of the in-progress MR from Antonio Caggiano for caching surfaces on batches. In particular, it reminded me that his series has a patch which removes the above monstrosity.
Let’s see what happens when I add those patches in:
15.
Again.
I expected a huge performance win here, but it seems that we still can’t fully utilize all these changes and are still stuck at 15 fps. Every time descriptors are updated, the batch ends up hitting that arbitrary 1000 descriptor set limit, and then it submits the command buffer, so there’s still multiple batches being used for each frame.
So naturally at this point I tried increasing the limit.
Then I increased it again.
And again.
And now I had exactly one flush per frame, but my fps was still fixed at a measly 15.
That’s when I decided to do some desk curls.
What happened next was shocking:
18 fps.
It was a sudden 20% fps gain, but it was only the beginning.
More on this tomorrow.
![]() |
|
September 23, 2020 | |
![]() |
For the past few days, I’ve been trying to fix a troublesome bug. Specifically, the Unigine Heaven benchmark wasn’t drawing most textures in color, and this was hampering my ability to make further claims about zink being the fastest graphics driver in the history of software since it’s not very impressive to be posting side-by-side screenshots that look like garbage even if the FPS counter in the corner is higher.
Thus I embarked on adventure.
This was the starting point. The thing ran just fine, but without valid frames drawn, it’s hard to call it a test of very much other than how many bugs I can pack into a single frame.
Naturally I assumed that this was going to be some bug in zink’s handling of something, whether it was blending, or sampling, or blending and sampling. I set out to figure out exactly how I’d screwed up.
I had no idea what the problem was. I phoned Dr. Render, as we all do when facing issues like this, and I was told that I had problems.
Lots of problems.
The biggest problem was figuring out how to get anywhere with so many draw calls. Each frame consisted of 3 render passes (with hundreds of draws each) as well as a bunch of extra draws and clears.
There was a lot going on. This was by far the biggest thing I’d had to fix, and it’s much more difficult to debug a game-like application than it is a unit test. With that said, and since there’s not actually any documentation about “What do I do if some of my frame isn’t drawing with color?” for people working on drivers, here were some of the things I looked into:
Just to check. On IRIS, which is my reference for these types of things, the change gave some neat results:
How bout that.
On zink, I got the same thing, except there was no color, and it wasn’t very interesting.
On an #executive suggestion, I looked into whether a z/s buffer had snuck into my sampler buffers and was thus providing bogus pixel data.
It hadn’t.
This was a runtime version of my usual shader debugging, wherein I try to isolate the pixels in a region to a specific color based on a conditional, which then lets me determine which path in the shader is broken. To do this, I added a helper function in ntv
:
static SpvId
clobber(struct ntv_context *ctx)
{
SpvId type = get_fvec_type(ctx, 32, 4);
SpvId vals[] = {
emit_float_const(ctx, 32, 1.0),
emit_float_const(ctx, 32, 0.0),
emit_float_const(ctx, 32, 0.0),
emit_float_const(ctx, 32, 1.0)
};
printf("CLOBBERING\n");
return spirv_builder_emit_composite_construct(&ctx->builder, type, vals, 4);
}
This returns a vec4 of the color RED, and I cleverly stuck it at the end of emit_store_deref()
like so:
if (ctx->stage == MESA_SHADER_FRAGMENT && var->data.location == FRAG_RESULT_DATA0 && match)
result = clobber(ctx);
match
in this case is set based on this small block at the very start of ntv
:
if (s->info.stage == MESA_SHADER_FRAGMENT) {
const char *env = getenv("TEST_SHADER");
match = env && s->info.name && !strcmp(s->info.name, env);
}
Thus, I could set my environment in gdb with e.g., set env TEST_SHADER=GLSL271
and then zink would swap the output of the fragment shader named GLSL271
to RED, which let me determine what various shaders were being used for. When I found the shader used for the lamps, things got LIT:
But ultimately, even though I did find the shaders that were being used for the more general material draws, this ended up being another dead end.
This took me the longest since I had to figure out a way to match up the Dr. Render states to the runtime states that I could see. I eventually settled on adding breakpoints based on index buffer size, as the chart provided by Dr. Render had this in the vertex state, which made things simple.
But alas, zink was doing all the right blending too.
As usual, this was my last resort, but it was also my most powerful weapon that I couldn’t abuse too frequently, lest people come to the conclusion that I don’t actually know what I’m doing.
Which I definitely do.
And now that I’ve cleared up any misunderstandings there, I’m not ashamed to say that I went to #intel-3d to complain that Dr. Render wasn’t giving me any useful draw output for most of the draws under IRIS. If even zink can get some pixels out of a draw, then a more compliant driver like IRIS shouldn’t be having issues here.
I wasn’t wrong.
It turns out that the Heaven benchmark is buggy and expects the D3D semantics for dual blending, which is why mesa knows this and informs drivers that they need to enable workarounds if they have the need, specifically dual_color_blend_by_location=true
which informs the driver that it needs to adjust the Location
and Index
of gl_FragData[1]
from D3D semantics to OpenGL/Vulkan.
As usual, the folks at Intel with their encyclopedic knowledge were quick to point out the exact problem, which then just left me with the relatively simple tasks of:
The result is not that interesting, but here it is anyway:
static bool
lower_dual_blend(nir_shader *shader)
{
bool progress = false;
nir_variable *var = nir_find_variable_with_location(shader, nir_var_shader_out, FRAG_RESULT_DATA1);
if (var) {
var->data.location = FRAG_RESULT_DATA0;
var->data.index = 1;
progress = true;
}
nir_shader_preserve_all_metadata(shader);
return progress;
}
In short, D3D expects to blend two outputs based on their locations, but in Vulkan and OpenGL, the blending is based on index. So here, I’ve just changed the location of gl_FragData[1]
to match gl_FragData[0]
and then incremented the index, because Fragment outputs identified with an Index of zero are directed to the first input of the blending unit associated with the corresponding Location. Outputs identified with an Index of one are directed to the second input of the corresponding blending unit.
And now nice things can be had:
Tune in tomorrow when I strap zink to a rocket and begin counting down to blastoff.
![]() |
|
September 21, 2020 | |
![]() |
In Vulkan, a pipeline object is bound to the graphics pipeline for a given command buffer when a draw is about to take place. This pipeline object contains information about the draw state, and any time that state changes, a different pipeline object must be created/bound.
This is expensive.
Some time ago, Antonio Caggiano did some work to cache pipeline objects, which lets zink reuse them once they’re created. This was great, because creating Vulkan objects is very costly, and we want to always be reusing objects whenever possible.
Unfortunately, the core Vulkan spec has the number of viewports and scissor regions as both being part of the pipeline state, which means any time either one changes the number of regions (though both viewport and scissor region counts are the same for our purposes), we need a new pipeline.
VK_EXT_extended_dynamic_state adds functionality to avoid this performance issue. When supported, the pipeline object is created with zero as the count of viewport and scissor regions, and then vkCmdSetViewportWithCountEXT and vkCmdSetScissorWithCountEXT can be called just before draw to ram these state updates into the command buffer without ever needing a different pipeline object.
Longer posts later in the week; I’m in the middle of a construction zone for the next few days, and it’s less hospitable than I anticipated.
![]() |
|
September 18, 2020 | |
![]() |
Once again, I ended up not blogging for most of the week. When this happens, there’s one of two possibilities: I’m either taking a break or I’m so deep into some code that I’ve forgotten about everything else in my life including sleep.
This time was the latter. I delved into the deepest parts of zink and discovered that the driver is, in fact, functioning only through a combination of sheer luck and a truly unbelievable amount of driver stalls that provide enough forced synchronization and slow things down enough that we don’t explode into a flaming mess every other frame.
Oops.
I’ve fixed all of the crazy things I found, and, in the process, made some sizable performance gains that I’m planning to spend a while blogging about in considerable depth next week.
And when I say sizable, I’m talking in the range of 50-100% fps gains.
But it’s Friday, and I’m sure nobody wants to just see numbers or benchmarks. Let’s get into something that’s interesting on a technical level.
Yes, samplers.
In Vulkan, samplers have a lot of rules to follow. Specifically, I’m going to be examining part of the spec that states “If a VkImageView is sampled with VK_FILTER_LINEAR as a result of this command, then the image view’s format features must contain VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT”.
This is a problem for zink. Gallium gives us info about the sampler in the struct pipe_context::create_sampler_state
hook, but the created sampler won’t actually be used until draw time. As a result, there’s no way to know which image is going to be sampled, and thus there’s no way to know what features the sampled image’s format flags will contain. This only becomes known at the time of draw.
The way I saw it, there were two options:
LINEAR
and NEAREST
based on the format featuresLINEAR
is passed and swizzle between them at draw timeIn theory, the first option is probably more performant in the best case scenario where a sampler is only ever used with a single image, as it would then only ever create a single sampler object.
Unfortunately, this isn’t realistic. Just as an example, u_blitter
creates a number of samplers up front, and then it also makes assumptions about filtering based on ideal operations which may not be in sync with the underlying Vulkan driver’s capabilities. So for these persistent samplers, the first option may initially allow the sampler to be created with LINEAR
filtering, but it may later then be used for an image which can’t support it.
So I went with the second option. Now any time a LINEAR
sampler is created by gallium, we’re actually creating both types so that the appropriate one can be used, ensuring that we can always comply with the spec and avoid any driver issues.
Hooray.
![]() |
|
September 14, 2020 | |
![]() |
Let’s talk about ARB_shader_draw_parameters. Specifically, let’s look at gl_BaseVertex.
In OpenGL, this shader variable’s value depends on the parameters passed to the draw command, and the value is always zero if the command has no base vertex.
In Vulkan, the value here is only zero if the first vertex is zero.
The difference here means that for arrayed draws without base vertex parameters, GL always expects zero, and Vulkan expects first vertex
.
Hooray.
The easiest solution here would be to just throw a shader key at the problem, producing variants of the shader for use with indexed vs non-indexed draws, and using NIR passes to modify the variables for the non-indexed case and zero the value. It’s quick, it’s easy, and it’s not especially great for performance since it requires compiling the shader multiple times and creating multiple pipeline objects.
This is where push constants come in handy once more.
Avid readers of the blog will recall the last time I used push constants was for TCS injection when I needed to generate my own TCS and have it read the default inner/outer tessellation levels out of a push constant.
Since then, I’ve created a struct to track the layout of the push constant:
struct zink_push_constant {
unsigned draw_mode_is_indexed;
float default_inner_level[2];
float default_outer_level[4];
};
Now just before draw, I update the push constant value for draw_mode_is_indexed
:
if (ctx->gfx_stages[PIPE_SHADER_VERTEX]->nir->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)) {
unsigned draw_mode_is_indexed = dinfo->index_size > 0;
vkCmdPushConstants(batch->cmdbuf, gfx_program->layout, VK_SHADER_STAGE_VERTEX_BIT,
offsetof(struct zink_push_constant, draw_mode_is_indexed), sizeof(unsigned),
&draw_mode_is_indexed);
}
And now the shader can be made aware of whether the draw mode is indexed.
Now comes the NIR, as is the case for most of this type of work.
static bool
lower_draw_params(nir_shader *shader)
{
if (shader->info.stage != MESA_SHADER_VERTEX)
return false;
if (!(shader->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)))
return false;
return nir_shader_instructions_pass(shader, lower_draw_params_instr, nir_metadata_dominance, NULL);
}
This is the future, so I’m now using Eric Anholt’s recent helper function to skip past iterating over the shader’s function/blocks/instructions, instead just passing the lowering implementation as a parameter and letting the helper create the nir_builder
for me.
static bool
lower_draw_params_instr(nir_builder *b, nir_instr *in, void *data)
{
if (in->type != nir_instr_type_intrinsic)
return false;
nir_intrinsic_instr *instr = nir_instr_as_intrinsic(in);
if (instr->intrinsic != nir_intrinsic_load_base_vertex)
return false;
I’m filtering out everything except for nir_intrinsic_load_base_vertex
here, which is the instruction for loading gl_BaseVertex
.
b->cursor = nir_after_instr(&instr->instr);
I’m modifying instructions after this one, so I set the cursor after.
nir_intrinsic_instr *load = nir_intrinsic_instr_create(b->shader, nir_intrinsic_load_push_constant);
load->src[0] = nir_src_for_ssa(nir_imm_int(b, 0));
nir_intrinsic_set_range(load, 4);
load->num_components = 1;
nir_ssa_dest_init(&load->instr, &load->dest, 1, 32, "draw_mode_is_indexed");
nir_builder_instr_insert(b, &load->instr);
I’m loading the first 4 bytes of the push constant variable that I created according to my struct, which is the draw_mode_is_indexed
value.
nir_ssa_def *composite = nir_build_alu(b, nir_op_bcsel,
nir_build_alu(b, nir_op_ieq, &load->dest.ssa, nir_imm_int(b, 1), NULL, NULL),
&instr->dest.ssa,
nir_imm_int(b, 0),
NULL);
This adds a new ALU instruction of type bcsel
, AKA the ternary operator (condition ? true : false). The condition here is another ALU of type ieq
, AKA integer equals, and I’m testing whether the loaded push constant value is equal to 1. If true, this is an indexed draw, so I continue using the loaded gl_BaseVertex
value. If false, this is not an indexed draw, so I need to use zero instead.
nir_ssa_def_rewrite_uses_after(&instr->dest.ssa, nir_src_for_ssa(composite), composite->parent_instr);
With my bcsel
composite gl_BaseVertex
value constructed, I can now rewrite all subsequent uses of gl_BaseVertex
in the shader to use the composite value, which will automatically swap between the Vulkan gl_BaseVertex
and zero based on the value of the push constant without the need to rebuild the shader or make a new pipeline.
return true;
}
And now the shader gets the expected value and everything works.
It’s also worth pointing out here that gl_DrawID
from the same extension has a similar problem: gallium doesn’t pass multidraws in full to the driver, instead iterating for each draw, which means that the shader value is never what’s expected either. I’ve employed a similar trick to jam the draw index into the push constant and read that back in the shader to get the expected value there too.
Extensions.
![]() |
|
September 10, 2020 | |
![]() |
In an ideal world, every frame your application draws would appear on the screen exactly on time. Sadly, as anyone living in the year 2020 CE can attest, this is far from an ideal world. Sometimes the scene gets more complicated and takes longer to draw than you estimated, and sometimes the OS scheduler just decides it has more important things to do than pay attention to you.
When this happens, for some applications, it would be best if you could just get the bits on the screen as fast as possible rather than wait for the next vsync. The Present extension for X11 has a option to let you do exactly this:
If 'options' contains PresentOptionAsync, and the 'target-msc'
is less than or equal to the current msc for 'window', then
the operation will be performed as soon as possible, not
necessarily waiting for the next vertical blank interval.
But you don't use Present directly, usually, usually Present is the mechanism for GLX and Vulkan to put bits on the screen. So, today I merged some code to Mesa to enable the corresponding features in those APIs, namely GLX_EXT_swap_control_tear and VK_PRESENT_MODE_FIFO_RELAXED_KHR. If all goes well these should be included in Mesa 21.0, with a backport to 20.2.x not out of the question. As the GLX extension name suggests, this can introduce some visual tearing when the buffer swap does come in late, but for fullscreen games or VR displays that can be an acceptable tradeoff in exchange for reduced stuttering.
![]() |
![]() |
|
planet.fd.o | ||
planet.freedesktop.org is powered by Venus,
and the freedesktop.org community.
![]() ![]() ![]() ![]() ![]() ![]() |
||
![]() |
![]() |