October 30, 2020

(I just sent the below email to mesa3d developer list).

Just to let everyone know, a month ago I submitted the 20.2 llvmpipe
driver for OpenGL 4.5 conformance under the SPI/ umbrella, and it
is now official[1].

Thanks to everyone who helped me drive this forward, and to all the
contributors both to llvmpipe and the general Mesa stack that enabled

Big shout out to Roland Scheidegger for helping review the mountain of
patches I produced in this effort.

My next plans involved submitting lavapipe for Vulkan 1.0, it's at 99%
or so CTS, but there are line drawing, sampler accuracy and some snorm
blending failure I have to work out.
I also ran the OpenCL 3.0 conformance suite against clover/llvmpipe
yesterday and have some vague hopes of driving that to some sort of

(for GL 4.6 only texture anisotropy is really missing, I've got
patches for SPIR-V support, in case someone was feeling adventurous).



October 29, 2020


I’ve got a lot of exciting stuff in the pipe now, but for today I’m just going to talk a bit about resource invalidation: what it is, when it happens, and why it’s important.

Let’s get started.

What is invalidation?

Resource invalidation occurs when the backing buffer of a resource is wholly replaced. Consider the following scenario under zink:

  • Have struct A { VkBuffer buffer; };
  • User calls glBufferData(target, size, data, usage), which stores data to A.buffer
  • User calls glBufferData(target, size, NULL, usage), which unsets the data from A.buffer

On a sane/competent driver, the second glBufferData call will trigger invalidation, which means that A.buffer will be replaced entirely, while A is still the driver resource used by Gallium to represent target.

When does invalidation occur?

Resource invalidation can occur in a number of scenarios, but the most common is when unsetting a buffer’s data, as in the above example. The other main case for it is replacing the data of a buffer that’s in use for another operation. In such a case, the backing buffer can be replaced to avoid forcing a sync in the command stream which will stall the application’s processing. There’s some other cases for this as well, like glInvalidateFramebuffer and glDiscardFramebufferEXT, but the primary usage that I’m interested in is buffers.

Why is invalidation important?

The main reason is performance. In the above scenario without invalidation, the second glBufferData call will write null to the whole buffer, which is going to be much more costly than just creating a new buffer.

That’s it

Now comes the slightly more interesting part: how does invalidation work in zink?

Currently, as of today’s mainline zink codebase, we have struct zink_resource to represent a resource for either a buffer or an image. One struct zink_resource represents exactly one VkBuffer or VkImage, and there’s some passable lifetime tracking that I’ve written to guarantee that these Vulkan objects persist through the various command buffers that they’re associated with.

Each struct zink_resource is, as is the way of Gallium drivers, also a struct pipe_resource, which is tracked by Gallium. Because of this, struct zink_resource objects themselves cannot be invalidated in order to avoid breaking Gallium, and instead only the inner Vulkan objects themselves can be replaced.

For this, I created struct zink_resource_object, which is an object that stores only the data that directly relates to the Vulkan objects, leaving struct zink_resource to track the states of these objects. Their lifetimes are separate, with struct zink_resource being bound to the Gallium tracker and struct zink_resource_object persisting for either the lifetime of struct zink_resource or its command buffer usage—whichever is longer.


The code for this mechanism isn’t super interesting since it’s basically just moving some parts around. Where it gets interesting is the exact mechanics of invalidation and how struct zink_resource_object can be injected into an in-use resource, so let’s dig into that a bit.

Here’s what the pipe_context::invalidate_resource hook looks like:

static void
zink_invalidate_resource(struct pipe_context *pctx, struct pipe_resource *pres)
   struct zink_context *ctx = zink_context(pctx);
   struct zink_resource *res = zink_resource(pres);
   struct zink_screen *screen = zink_screen(pctx->screen);

   if (pres->target != PIPE_BUFFER)

This only handles buffer resources, but extending it for images would likely be little to no extra work.

   if (res->valid_buffer_range.start > res->valid_buffer_range.end)

Zink tracks the valid data segments of its buffers. This conditional is used to check for an uninitialized buffer, i.e., one which contains no valid data. If a buffer has no data, it’s already invalidated, so there’s nothing to be done here.


Invalidating means the buffer will no longer have any valid data, so the range tracking can be reset here.

   if (!get_all_resource_usage(res))

If this resource isn’t currently in use, unsetting the valid range is enough to invalidate it, so it can just be returned right away with no extra work.

   struct zink_resource_object *old_obj = res->obj;
   struct zink_resource_object *new_obj = resource_object_create(screen, pres, NULL, NULL);
   if (!new_obj) {
      debug_printf("new backing resource alloc failed!");

Here’s the old internal buffer object as well as a new one, created using the existing buffer as a template so that it’ll match.

   res->obj = new_obj;
   res->access_stage = 0;
   res->access = 0;

struct zink_resource is just a state tracker for the struct zink_resource_object object, so upon invalidate, the states are unset since this is effectively a brand new buffer.

   zink_resource_rebind(ctx, res);

This is the tricky part, and I’ll go into more detail about it below.

   zink_descriptor_set_refs_clear(&old_obj->desc_set_refs, old_obj);

If this resource was used in any cached descriptor sets, the references to those sets need to be invalidated so that the sets won’t be reused.

   zink_resource_object_reference(screen, &old_obj, NULL);

Finally, the old struct zink_resource_object is unrefed, which will ensure that it gets destroyed once its current command buffer has finished executing.

Simple enough, but what about that zink_resource_rebind() call? Like I said, that’s where things get a little tricky, but because of how much time I spent on descriptor management, it ends up not being too bad.

This is what it looks like:

zink_resource_rebind(struct zink_context *ctx, struct zink_resource *res)
   assert(res-> == PIPE_BUFFER);

Again, this mechanism is only handling buffer resource for now, and there’s only one place in the driver that calls it, but it never hurts to be careful.

   for (unsigned shader = 0; shader < PIPE_SHADER_TYPES; shader++) {
      if (!(res->bind_stages & BITFIELD64_BIT(shader)))
      for (enum zink_descriptor_type type = 0; type < ZINK_DESCRIPTOR_TYPES; type++) {
         if (!(res->bind_history & BITFIELD64_BIT(type)))

Something common to many Gallium drivers is this idea of “bind history”, which is where a resource will have bitflags set when it’s used for a certain type of binding. While other drivers have a lot more cases than zink does due to various factors, the only thing that needs to be checked for my purposes is the descriptor type (UBO, SSBO, sampler, shader image) across all the shader stages. If a given resource has the flags set here, this means it was at some point used as a descriptor of this type, so the current descriptor bindings need to be compared to see if there’s a match.

         uint32_t usage = zink_program_get_descriptor_usage(ctx, shader, type);
         while (usage) {
            const int i = u_bit_scan(&usage);

This is a handy mechanism that returns the current descriptor usage of a shader as a bitfield. So for example, if a vertex shader uses UBOs in slots 0, 1, and 3, usage will be 11, and the loop will process i as 0, 1, and 3.

            struct zink_resource *cres = get_resource_for_descriptor(ctx, type, shader, i);
            if (res != cres)

Now the slot of the descriptor type can be compared against the resource that’s being re-bound. If this resource is the one that’s currently bound to the specified slot of the specified descriptor type, then steps can be taken to perform additional operations necessary to successfully replace the backing storage for the resource, mimicking the same steps taken when initially binding the resource to the descriptor slot.

            switch (type) {
            case ZINK_DESCRIPTOR_TYPE_SSBO: {
               struct pipe_shader_buffer *ssbo = &ctx->ssbos[shader][i];
               util_range_add(&res->base, &res->valid_buffer_range, ssbo->buffer_offset,
                              ssbo->buffer_offset + ssbo->buffer_size);

For SSBO descriptors, the only change needed is to add valid range for the bound region as . This region is passed to the shader, so even if it’s never written to, it might be, and so it can be considered a valid region.

               struct zink_sampler_view *sampler_view = zink_sampler_view(ctx->sampler_views[shader][i]);
               zink_descriptor_set_refs_clear(&sampler_view->desc_set_refs, sampler_view);
               zink_buffer_view_reference(ctx, &sampler_view->buffer_view, NULL);
               sampler_view->buffer_view = get_buffer_view(ctx, res, sampler_view->base.format,
                                                           sampler_view->base.u.buf.offset, sampler_view->base.u.buf.size);

Sampler descriptors require a new VkBufferView be created since the previous one is no longer valid. Again, the references for the existing bufferview need to be invalidated now since that descriptor set can no longer be reused from the cache, and then the new VkBufferView is set after unrefing the old one.

            case ZINK_DESCRIPTOR_TYPE_IMAGE: {
               struct zink_image_view *image_view = &ctx->image_views[shader][i];
               zink_descriptor_set_refs_clear(&image_view->desc_set_refs, image_view);
               zink_buffer_view_reference(ctx, &image_view->buffer_view, NULL);
               image_view->buffer_view = get_buffer_view(ctx, res, image_view->base.format,
                                                         image_view->base.u.buf.offset, image_view->base.u.buf.size);
               util_range_add(&res->base, &res->valid_buffer_range, image_view->base.u.buf.offset,
                              image_view->base.u.buf.offset + image_view->base.u.buf.size);

Images are nearly identical to the sampler case, the difference being that while samplers are read-only like UBOs (and therefore reach this point already having valid buffer ranges set), images are more like SSBOs and can be written to. Thus the valid range must be set here like in the SSBO case.


Eagle-eyed readers will note that I’ve omitted a UBO case, and this is because there’s nothing extra to be done there. UBOs will already have their valid range set and don’t need a VkBufferView.


            invalidate_descriptor_state(ctx, shader, type);

Finally, the incremental decsriptor state hash for this shader stage and descriptor type is invalidated. It’ll be recalculated normally upon the next draw or compute operation, so this is a quick zero-setting operation.


That’s everything there is to know about the current state of resource invalidation in zink!

October 28, 2020

There's been some recent discussion about whether the X server is abandonware. As the person arguably most responsible for its care and feeding over the last 15 years or so, I feel like I have something to say about that.

The thing about being the maintainer of a public-facing project for nearly the whole of your professional career is it's difficult to separate your own story from the project. So I'm not going to try to be dispassionate, here. I started working on X precisely because free software had given me options and capabilities that really matter, and I feel privileged to be able to give that back. I can't talk about that without caring about it.

So here's the thing: X works extremely well for what it is, but what it is is deeply flawed. There's no shame in that, it's 33 years old and still relevant, I wish more software worked so well on that kind of timeframe. But using it to drive your display hardware and multiplex your input devices is choosing to make your life worse.

It is, however, uniquely well suited to a very long life as an application compatibility layer. Though the code happens to implement an unfortunate specification, the code itself is quite well structured, easy to hack on, and not far off from being easily embeddable.

The issue, then, is how to get there. And I don't have any real desire to get there while still pretending that the xfree86 hardware-backed server code is a real thing. Sorry, I guess, but I've worked on xfree86-derived servers for very nearly as long as XFree86-the-project existed, and I am completely burnt out on that on its own merits, let alone doing that and also being release manager and reviewer of last resort. You can only apply so much thrust to the pig before you question why you're trying to make it fly at all.

So, is Xorg abandoned? To the extent that that means using it to actually control the display, and not just keep X apps running, I'd say yes. But xserver is more than xfree86. Xwayland, Xwin, Xephyr, Xvnc, Xvfb: these are projects with real value that we should not give up. A better way to say it is that we can finally abandon xfree86.

And if that sounds like a world you'd like to see, please, come talk to us, let's make it happen. I'd be absolutely thrilled to see someone take this on, and I'm happy to be your guide through the server internals.

October 24, 2020

Never Seen Before

A rare Saturday post because I spent so much time this week intending to blog and then somehow not getting around to it. Let’s get to the status updates, and then I’m going to dive into the more interesting of the things I worked on over the past few days.

Zink has just hit another big milestone that I’ve just invented: as of now, my branch is passing 97% of piglit tests up through GL 4.6 and ES 3.2, and it’s a huge improvement from earlier in the week when I was only at around 92%. That’s just over 1000 failure cases remaining out of ~41,000 tests. For perspective, a table.

  IRIS zink-mainline zink-wip
Passed Tests 43508 21225 40190
Total Tests 43785 22296 41395
Pass Rate 99.4% 95.2% 97.1%

As always, I happen to be running on Intel hardware, so IRIS and ANV are my reference points.

It’s important to note here that I’m running piglit tests, and this is very different from CTS; put another way, I may be passing over 97% of the test cases I’m running, but that doesn’t mean that zink is conformant for any versions of GL or ES, which may not actually be possible at present (without huge amounts of awkward hacks) given the persistent issues zink has with provoking vertex handling. I expect this situation to change in the future through the addition of more Vulkan extensions, but for now I’m just accepting that there’s some areas where zink is going to misrender stuff.

What Changed?

The biggest change that boosted the zink-wip pass rate was my fixing 64bit vertex attributes, which in total had been accounting for ~2000 test failures.

Vertex attributes, as we all know since we’re all experts in the graphics field, are the inputs for vertex shaders, and the data types for these inputs can vary just like C data types. In particular, with GL 4.1, ARB_vertex_attrib_64bit became a thing, which allows 64bit values to be passed as inputs here.

Once again, this is a problem for zink.

It comes down to the difference between GL’s implicit handling methodology and Vulkan’s explicit handling methodology. Consider the case of a dvec4 data type. Conceptually, this is a data type which is 4x64bit values, requiring 32bytes of storage. A vec4 uses 16bytes of storage, and this equates to a single “slot” or “location” within the shader inputs, as everything there is vec4-aligned. This means that, by simple arithmetic, a dvec4 requires two slots for its storage, one for the first two members, and another for the second two, both consuming a single 16byte slot.

When loading a dvec4 in GL(SL), a single variable with the first location slot is used, and the driver will automatically use the second slot when loading the second half of the value.

When loading a dvec4 in (SPIR)Vulkan, two variables with consecutive, explicit location slots must be used, and the driver will load exactly the input location specified.

This difference requires that for any dvec3 or dvec4 vertex input in zink, the value and also the load have to be split along the vec4 boundary for things to work.

Gallium already performs this split on the API side, allowing zink to already be correctly setting things up in the VkPipeline creation, so I wrote a NIR pass to fix things on the shader side.

Shader Rewriting

Yes, it’s been at least a week since I last wrote about a NIR pass, so it’s past time that I got back into that.

Going into this, the idea here is to perform the following operations within the vertex shader:

  • for the input variable (hereafter A), find the deref instruction (hereafter A_deref); deref is used to access variables for input and output, and so it’s guaranteed that any 64bit input will first have a deref
  • create a second variable (hereafter B) of size double (for dvec3) or dvec2 (for dvec4) to represent the second half of A
  • alter A and A_deref’s type to dvec2; this aligns the variable (and its subsequent load) to the vec4 boundary, which enables it to be correctly read from a single location slot
  • create a second deref instruction for B (hereafter B_deref)
  • find the load_deref instruction for A_deref (hereafter A_load); a load_deref instruction is used to load data from a variable deref
  • alter the number of components for A_load to 2, matching its new dvec2 size
  • create a second load_deref instruction for B_deref which will load the remaining components (hereafter B_load)
  • construct a new composite (hereafter C_load) dvec3 or dvec4 by combining A_load + B_load to match the load of the original type of A
  • rewrite all the subsequent uses of A_load’s result to instead use C_load’s result

Simple, right?

Here we go.

static bool
lower_64bit_vertex_attribs_instr(nir_builder *b, nir_instr *instr, void *data)
   if (instr->type != nir_instr_type_deref)
      return false;
   nir_deref_instr *A_deref = nir_instr_as_deref(instr);
   if (A_deref->deref_type != nir_deref_type_var)
      return false;
   nir_variable *A = nir_deref_instr_get_variable(A_deref);
   if (A->data.mode != nir_var_shader_in)
      return false;
   if (!glsl_type_is_64bit(A->type) || !glsl_type_is_vector(A->type) || glsl_get_vector_elements(A->type) < 3)
      return false;

First, it’s necessary to filter out all the instructions that aren’t what should be rewritten. As above, only dvec3 and dvec4 types are targeted here (dmat* types are reduced to dvec types prior to this point), so anything other than a A_deref of variables with those types is ignored.

   /* create second variable for the split */
   nir_variable *B = nir_variable_clone(A, b->shader);
   /* split new variable into second slot */
   nir_shader_add_variable(b->shader, B);

B matches A except in its type and slot location, which will always be one greater than the slot location of A, so A can be cloned here to simplify the process of creating B.

   unsigned total_num_components = glsl_get_vector_elements(A->type);
   /* new variable is the second half of the dvec */
   B->type = glsl_vector_type(glsl_get_base_type(A->type), glsl_get_vector_elements(A->type) - 2);
   /* clamp original variable to a dvec2 */
   A_deref->type = A->type = glsl_vector_type(glsl_get_base_type(A->type), 2);

A and B need their types modified to not cross the vec4/slot boundary. A is always a dvec2, which has 2 components, and B will always be the remaining components.

   /* create A_deref instr for new variable */
   b->cursor = nir_after_instr(instr);
   nir_deref_instr *B_deref = nir_build_deref_var(b, B);

Now B_deref has been added thanks to the nir_builder helper function which massively simplifies the process of setting up all the instruction parameters.

   nir_foreach_use_safe(A_deref_use, &A_deref->dest.ssa) {

NIR is SSA-based, and all uses of an SSA value are tracked for the purposes of ensuring that SSA values are truly assigned only once as well as ease of rewriting them in the case where a value needs to be modified, just as this pass is doing. This use-tracking comes along with a simple API for iterating over the uses.

      nir_instr *A_load_instr = A_deref_use->parent_instr;
      assert(A_load_instr->type == nir_instr_type_intrinsic &&
             nir_instr_as_intrinsic(A_load_instr)->intrinsic == nir_intrinsic_load_deref);

The only use of A_deref should be A_load, so really iterating over the A_deref uses is just a quick, easy way to get from there to the A_load instruction.

      /* this is a load instruction for the A_deref, and we need to split it into two instructions that we can
       * then zip back into a single ssa def */
      nir_intrinsic_instr *A_load = nir_instr_as_intrinsic(A_load_instr);
      /* clamp the first load to 2 64bit components */
      A_load->num_components = A_load->dest.ssa.num_components = 2;

A_load must be clamped to a single slot location to avoid crossing the vec4 boundary, so this is done by changing the number of components to 2, which matches the now-changed type of A.

      b->cursor = nir_after_instr(A_load_instr);
      /* this is the second load instruction for the second half of the dvec3/4 components */
      nir_intrinsic_instr *B_load = nir_intrinsic_instr_create(b->shader, nir_intrinsic_load_deref);
      B_load->src[0] = nir_src_for_ssa(&B_deref->dest.ssa);
      B_load->num_components = total_num_components - 2;
      nir_ssa_dest_init(&B_load->instr, &B_load->dest, B_load->num_components, 64, NULL);
      nir_builder_instr_insert(b, &B_load->instr);

This is B_load, which loads a number of components that matches the type of B. It’s inserted after A_load, though the before/after isn’t important in this case. The key is just that this instruction is added before the next one.

      nir_ssa_def *def[4];
      /* createa new dvec3/4 comprised of all the loaded components from both variables */
      def[0] = nir_vector_extract(b, &A_load->dest.ssa, nir_imm_int(b, 0));
      def[1] = nir_vector_extract(b, &A_load->dest.ssa, nir_imm_int(b, 1));
      def[2] = nir_vector_extract(b, &B_load->dest.ssa, nir_imm_int(b, 0));
      if (total_num_components == 4)
         def[3] = nir_vector_extract(b, &B_load->dest.ssa, nir_imm_int(b, 1));
      nir_ssa_def *C_load = nir_vec(b, def, total_num_components);

Now that A_load and B_load both exist and are loading the corrected number of components, these components can be extracted and reassembled into a larger type for use in the shader, specifically the original dvec3 or dvec4 which is being used. nir_vector_extract performs this extraction from a given instruction by taking an index of the value to extract, and then the composite value is created by passing the extracted components to nir_vec as an array.

      /* use the assembled dvec3/4 for all other uses of the load */
      nir_ssa_def_rewrite_uses_after(&A_load->dest.ssa, nir_src_for_ssa(C_load), C_load->parent_instr);

Since this is all SSA, the NIR helpers can be used to trivially rewrite all the uses of the loaded value from the original A_load instruction to now use the assembled C_load value. It’s important that only the uses after C_load has been created (i.e., nir_ssa_def_rewrite_uses_after) are those that are rewritten, however, or else the shader will also rewrite the original A_load value with C_load, breaking the shader entirely with an SSA-impossible as well as generally-impossible C_load = vec(C_load + B_load) assignment.


   return true;

Progress has occurred, so the pass returns true to reflect that.

Now those large attributes are loaded according to Vulkan spec, and everything is great because, as expected, ANV has no bugs here.

October 16, 2020

Brain Hurty

It’s been a very long week for me, and I’m only just winding down now after dissecting and resolving a crazy fp64/ssbo bug. I’m too scrambled to jump into any code, so let’s just do another fluff day and review happenings in mainline mesa which relate to zink.

MRs Landed

Thanks to the tireless, eagle-eyed reviewing of Erik Faye-Lund, a ton of zink patches out of zink-wip have landed this week. Here’s an overview of that in backwards historical order:

Versions Bumped

Zink has grown tremendously over the past day or so of MRs landing, going from GLSL 1.30 and GL 3.0 to GLSL and GL 3.30.

It can even run Blender now. Unless you’re running it under Wayland.

October 15, 2020

New versions of the KWinFT projects Wrapland, Disman, KWinFT and KDisplay are available now. They were on the day aligned with the release of Plasma 5.20 this week and offer new features and stability improvements.

Universal display management

The highlight this time is a completely redefined and reworked Disman that allows to control display configurations not only in a KDE Plasma session with KWinFT but also with KWin and in other Wayland sessions with wlroots-based compositors as well as any X11 session.

You can use it with the included command-line tool dismanctl or together with the graphical frontend KDisplay. Read more about Disman's goals and technical details in the 5.20 beta announcement.

KWinFT projects you should use

Let's cut directly to the chase! As Disman and KDisplay are replacements for libkscreen and KScreen and KWinFT for KWin you will be interested in a comparison from a user point of view. What is better and what should you personally choose?

Disman and KDisplay for KDE Plasma

If you run a KDE Plasma desktop at the moment, you should definitely consider to use Disman and replace KScreen with KDisplay.

Disman comes with a more reliable overall design moving internal logic to its D-Bus service and away from the frontends in KDisplay. Changes by that become more atomic and bugs are less likely to emerge.

The UI of KDisplay is improved in comparison to KScreen and comfort functions have been added, as for example automatic selection of the best available mode.

There are still some caveats to this release that might prompt you to wait for the next one though:

  • Albeit in the beta phase multiple bugs were discovered and could be fixed 5.20 is still the first release after a large redesign, so it is not unlikely more bugs will be discovered later on.
  • If you want to use Disman and KDisplay in a legacy KWin Wayland session note that Disman was only tested with KWinFT and sway by me personally. Maybe other people already use it with legacy KWin in a Wayland session but the backend, that is loaded in this case, could very well have seen no QA at all yet. That being said if you run a KWin X11 session, you will experience no such problems since in this case the backend is the same as with KWinFT or any other X11 window manager.
  • If you require the KDisplay UI in another language then this release is not yet for you. An online translation system has been setup now but the first localization will only become available with release 5.21.

So your mileage may vary but in most cases you should have a better experience with Disman and KDisplay.

And if you in general like to support new projects with ambitious goals and make use of most modern technologies you should definitely give it a try.

Disman and KDisplay with wlroots

Disman includes a backend for wlroots-backed compositors. I'm proud of this achievement since I believe we need more projects for the Linux desktop which do not only try to solve issues in their own little habitat and project monoculture but which aim at improving the Linux desktop in a more holistic and collaborative spirit.

I tested the backend myself and even provided some patches to wlroots directly to improve its output-management capabilities, so using Disman with wlroots should be a decent experience. One catch though is that for those patches above a new wlroots version must be released. For now you can only get them by compiling wlroots from master or having your distribution of choice backport them.

In comparison with other options to manage your displays in wlroots I believe Disman provides the most user-friendly solution taking off lots of work from your shoulders by automatically optimizing unknown new display setups and reloading data for already known setups.

Another prominent alternative for display management on wlroots is kanshi. I don't think kanshi is as easy to use and autonomously optimizing as Disman but you might be able to configure displays more precisly with it. So you could prefer kanshi or Disman depending on your needs.

You can use Disman in wlroots sessions as a standalone system with its included command-line tool dismanctl and without KDisplay. This way you do not need to pull in as many KDE dependencies. But KDisplay together with Disman and wlroots also works very well and provides you an easy-to-use UI for adapting the display configuration according to your needs.

Disman and KDisplay on X11

You will like it. Try it out is all I can say. The RandR backend is tested thoroughly and while there is still room for some refactoring it should work very well already. This is also independent of what desktop environment or window manager you use. Install it and see for yourself.

That being said the following issues are known at the moment:

  • Only a global scale can be selected for all displays. Disman can not yet set different scales for different displays. This might become possible in the future but has certain drawbacks on its own.
  • The global scale in X11 is set by KDisplay alone. So you must install Disman with KDisplay in case you want to change the global scale without writing to the respective X11 config files manually.
  • At the time of writing there is a bug with Nvidia cards that leads to reduced refresh rates. But I expect this bug to be fixed very soon.

KWinFT vs KWin

I was talking a lot about Disman since it contains the most interesting changes this release and it can now be useful to many more people than before.

But you might also be interested in replacing KWin with KWinFT, so let's take a look at how KWinFT at this point in time compares to legacy KWin.

As it stands KWinFT is still a drop-in-replacement for it. You can install it to your system replacing KWin and use it together with a KDE Plasma session.


If you usually run an X11 session you should choose KWinFT without hesitation. It provides the same features as KWin and comes with an improved compositing pipeline that lowers latency and increases smoothness. There are also patches in the work to improve upon this further for multi-display setups. These patches might come to the 5.20 release via a bug fix release.

One point to keep in mind though is that the KWinFT project will concentrate in the future on improving the experience with Wayland. We won't maliciously regress the X11 experience but if there is a tradeoff between improving the Wayland session and regressing X11, KWinFT will opt for the former. But if such a situation unfolds at some point in time has yet to be seen. The X11 session might as well continue to work without any regressions for the next decade.


The situation is different if you want to run KWinFT as a Wayland compositor. I believe in regards to stability and robustness KWinFT is superior.

In particular this holds true for multi-display setups and display management. Although I worked mostly on Disman in the last two months that work naturally spilled over to KWinFT too. KWinFT's output objects are now much more reasonably implemented. Besides that there were many more bug fixes to outputs handling what you can convince yourself of by looking at the merged changes for 5.20 in KWinFT and Wrapland.

If you have issues with your outputs in KWin definitely try out KWinFT, of course together with Disman.

Another area where you probably will have a better experience is the composition itself. As on X11 the pipeline was reworked. For multi-display setups the patch, that was linked above and might come in a bug fix release to 5.20, should improve the situation further.

On the other side KWin's Wayland session gained some much awaited features with 5.20. According to the changelog screencasting is now possible, as is middle-click pasting and integration with Klipper, that is the clipboard management utility in the system tray.

I say "in theory" because I have not tested it myself and I expect it to not work without issues. That is for one because big feature additions like these regularly require later adjustments due to unforeseen behavior changes but also because on a principal and strategic level I disagree with the KWin developers' general approach here.

The KWin codebase is rotten and needs a rigorous overhaul. Putting more features on top of that, which often require massive internal changes just for the sake of crossing an item from a checklist might make sense from the viewpoint of KDE users and KDE's marketing staff, but from a long-term engineering vision will only litter the code more and lead to more and more breakage over time. Most users won't notice that immediately but when they do it is already too late.

On how to do that better I really have to compliment the developers of Gnome's Mutter and wlroots.

Especially Mutter's Wayland session was in a bad state with some fundamental problems due to its history just few years ago. But they committed to a very forward-thinking stance, ignoring the initial bad reception and not being tempted by immediate quick fixes that long-term would not hold up to the necessary standards. And nowadays Gnome Mutter's Wayland session is in way better shape. I want to highlight their transactional KMS project. This is a massive overhaul that is completely transparent to the common user, but enables the Mutter developers to build on a solid base in many ways in the future.

Still as said I have not tried KWin 5.20 myself and if the new features are important to you, give it a try and check for yourself if your experience confirms my concerns or if you are happy with what was added. Switching from KWin to KWinFT or the other way around is easy after all.

How to get the KWinFT projects

If you self-compile KWinFT it is very easy to switch from KWin to KWinFT. Just compile KWinFT to your system prefix. If you want more comfort through distribution packages you have to choose your distribution carefully.

Currently only Manjaro provides KWinFT packages officially. You can install all KWinFT projects on Manjaro easily through the packages with the same names.

Manjaro also offers git-variants of these packages allowing you to run KWinFT projects directly from master branch. This way you can participate in its development directly or give feedback to latest changes.

If you run Arch Linux you can install all KWinFT projects from the AUR. The release packages are not yet updated to 5.20 but I assume this happens pretty soon. They have a bit weird naming scheme: there are kwinft, wrapland-kwinft, disman-kwinft and kdisplay-kwinft. Of these packages git-variants are available too but they follow the better naming scheme without a kwinft suffix. So for example the git package for disman-kwinft is just called disman-git. Naming nitpicks aside huge thanks to the maintainers of these packages: abelian424 and Christoph (haagch).

A special place in my heart was conquered not long ago by Fedora. I switched over to it from KDE Neon due to problems on the latest update and the often outdated packages and I am amazed by Fedora's technical versed and overall professional vision.

To install KWinFT projects on Fedora with its exceptional package manager DNF you can make use of this copr repository that includes their release versions. The packages are already updated to 5.20. Thanks to zawertun for providing these packages!

Fedora's KDE SIG group also took interest in the KWinFT projects and setup a preliminary copr for them. One of their packagers contacted me after the Beta release and I hope that I can help them to get it fully setup soon. I think Fedora's philosophy of pushing the Linux ecosystem by providing most recent packages and betting on emerging technologies will harmonize very well with the goals of the KWinFT project.

Busy, Busy, Busy

It’s been a busy week for me in personal stuff, so my blogging has been a bit slow. Here’s a brief summary of a few vaguely interesting things that I’ve been up to:

  • I’m now up to 37fps in the Heaven benchmark, which is an absolute unit of a number that brings me up to 68.5% of native GL
  • I discovered that people have been reporting bugs for zink-wip (great) without tagging me (not great), and I’m not getting notifications for it (also not great). gitlab is hard.
  • I added support for null UBOs in descriptor sets, which I think I was supposed to actually have done some time ago, but I hadn’t run into any tests that hit it. null checks all the way down
  • I fired up supertuxkart for one of the reports, which led me down a rabbithole of discovering that part of my compute shader implementation was broken for big shared data loads; I was, for whatever reason, attempting to do an OpAccessChain with a result type of uvec4 from a base type of array<uint>, which…it’s just not going to work ever, so that’s some strong work by past me
  • I played supertuxkart for a good 30 seconds and took a screenshot


There’s a weird flickering bug with the UI that I get in some levels that bears some looking into, but otherwise I get a steady 60/64/64 in zink vs 74/110/110 in IRIS (no idea what the 3 numbers mean, if anyone does, feel free to drop me a line).

That’s it for today. Hopefully a longer post tomorrow but no promises.

October 14, 2020

 A couple of years ago, we sandboxed thumbnailers using bubblewrap to avoid drive-by downloads taking advantage of thumbnailers with security issues.

 It's a great tool, and it's a tool that Flatpak relies upon to create its own sandboxes. But that also meant that we couldn't use it inside the Flatpak sandboxes themselves, and those aren't always as closed as they could be, to support legacy applications.

 We've finally implemented support for sandboxing thumbnailers within Flatpak, using the Spawn D-Bus interface (indirectly).

This should all land in GNOME 40, though it should already be possible to integrate it into your Flatpaks. Make sure to use the latest gnome-desktop development version, and that the flatpak-spawn utility is new enough in the runtime you're targeting (it's been updated in the runtimes #1, #2, #3, but it takes time to trickle down to GNOME versions). Example JSON snippets:

"name": "flatpak-xdg-utils",
"buildsystem": "meson",
"sources": [
"type": "git",
"url": "",
"tag": "1.0.4"
"name": "gnome-desktop",
"buildsystem": "meson",
"config-opts": ["-Ddebug_tools=true", "-Dudev=disabled"],
"sources": [
"type": "git",
"url": ""

(We also sped up GStreamer-based thumbnailers by allowing them to use a cache, and added profiling information to the thumbnail test tools, which could prove useful if you want to investigate performance or bugs in that area)

Edit: correct a link, thanks to the commenters for the notice

October 13, 2020

This week, I had a conversation with one of my coworkers about our subgroup/wave size heuristic and, in particular, whether or not control-flow divergence should be considered as part of the choice.  This lead me down a fun path of looking into the statistics of control-flow divergence and the end result is somewhat surprising:  Once you get above about an 8-wide subgroup, the subgroup size doesn't matter.

Before I get into the details, let's talk nomenclature.  As you're likely aware, GPUs often execute code in groups of 1 or more invocations.  In D3D terminology, these are called waves.  In Vulkan and OpenGL terminology, these are called subgroups.  The two terms are interchangeable and, for the rest of this post, I'll use the Vulkan/OpenGL conventions.

Control-flow divergence

Before we dig into the statistics, let's talk for a minute about control-flow divergence.  This is mostly going to be a primer on SIMT execution and control-flow divergence in GPU architectures.  If you're already familiar, skip ahead to the next section.

Most modern GPUs use a Single Instruction Multiple Thread (SIMT) model.  This means that the graphics programmer writes a shader which, for instance, colors a single pixel (fragment/pixel shader) but what the shader compiler produces is a program which colors, say, 32 pixels using a vector instruction set architecture (ISA).  Each logical single-pixel execution of the shader is called an "invocation" while the physical vectorized execution of the shader which covers multiple pixels is called a wave or a subgroup.  The size of the subgroup (number of pixels colored by a single hardware execution) varies depending on your architecture.  On Intel, it can be 8, 16, or 32, on AMD, it's 32 or 64 and, on Nvidia (if my knowledge is accurate), it's always 32.

This conversion from logical single-pixel version of the shader to a physical multi-pixel version is often fairly straightforward.  The GPU registers each hold N values and the instructions provided by the GPU ISA operate on N pieces of data at a time.  If, for instance, you have an add in the logical shader, it's converted to an add provided by the hardware ISA which adds N values.  (This is, of course an over-simplification but it's sufficient for now.)  Sounds simple, right?

Where things get more complicated is when you have control-flow in your shader.  Suppose you have an if statement with both then and else sections.  What should we do when we hit that if statement?  The if condition will be N Boolean values.  If all of them are true or all of them are false, the answer is pretty simple: we do the then or the else respectively.  If you have a mix of true and false values, we have to execute both sides.  More specifically, the physical shader has to disable all of the invocations for which the condition is false and run the "then" side of the if statement.  Once that's complete, it has to re-enable those channels and disable the channels for which the condition is true and run the "else" side of the if statement.  Once that's complete, it re-enables all the channels and continues executing the code after the if statement.

When you start nesting if statements and throw loops into the mix, things get even more complicated.  Loop continues have to disable all those channels until the next iteration of the loop, loop breaks have to disable all those channels until the loop is entirely complete, and the physical shader has to figure out when there are no channels left and complete the loop.  This makes for some fun and interesting challenges for GPU compiler developers.  Also, believe it or not, everything I just said is a massive over-simplification. :-)

The point which most graphics developers need to understand and what's important for this blog post is that the physical shader has to execute every path taken by any invocation in the subgroup.  For loops, this means that it has to execute the loop enough times for the worst case in the subgroup.  This means that if you have the same work in both the then and else sides of an if statement, that work may get executed twice rather than once and you may be better off pulling it outside the if.  It also means that if you have something particularly expensive and you put it inside an if statement, that doesn't mean that you only pay for it when needed, it means you pay for it whenever any invocation in the subgroup needs it.

Fun with statistics

At the end of the last section, I said that one of the problems with the SIMT model used by GPUs is that they end up having worst-case performance for the subgroup.  Every path through the shader which has to be executed for any invocation in the subgroup has to be taken by the shader as a whole.  The question that naturally arises is, "does a larger subgroup size make this worst-case behavior worse?"  Clearly, the naive answer is, "yes".  If you have a subgroup size of 1, you only execute exactly what's needed and if you have a subgroup size of 2 or more, you end up hitting this worst-case behavior.  If you go higher, the bad cases should be more likely, right?  Yes, but maybe not quite like you think.

This is one of those cases where statistics can be surprising.  Let's say you have an if statement with a boolean condition b.  That condition is actually a vector (b1, b2, b3, ..., bN) and if any two of those vector elements differ, we path the cost of both paths.  Assuming that the conditions are independent identically distributed (IID) random variables, the probability of entire vector being true is P(all(bi = true) = P(b1 = true) * P(b2 = true) * ... * P(bN = true) = P(bi = true)^N where N is the size of the subgroup.  Therefore, the probability of having uniform control-flow is P(bi = true)^N + P(bi = false)^N.  The probability of non-uniform control-flow, on the other hand, is 1 - P(bi = true)^N - P(bi = false)^N.

Before we go further with the math, let's put some solid numbers on it.  Let's say we have a subgroup size of 8 (the smallest Intel can do) and let's say that our input data is a series of coin flips where bi is "flip i was heads".  Then P(bi = true) = P(bi = false) = 1/2.  Using the math in the previous paragraph, P(uniform) = P(bi = true)^8 + P(bi = false)^8 = 1/128.  This means that the there is only a 1:128 chance that that you'll get uniform control-flow and a 127:128 chance that you'll end up taking both paths of your if statement.  If we increase the subgroup size to 64 (the maximum among AMD, Intel, and Nvidia), you get a 1:2^63 chance of having uniform control-flow and a (2^63-1):2^63 chance of executing both halves.  If we assume that the shader takes T time units when control-flow is uniform and 2T time units when control-flow is non-uniform, then the amortized cost of the shader for a subgroup size of 8 is 1/128 * T + 127/128 * 2T = 255/128 T and, by a similar calculation, the cost of a shader with a subgroup size of 64 is (2^64 - 1)/2^63.  Both of those are within rounding error of 2T and the added cost of using the massively wider subgroup size is less than 1%.  Playing with the statistics a bit, the following chart shows the probability of divergence vs. the subgroup size for various choices of P(bi = true):

One thing to immediately notice is that because we're only concerned about the probability of divergence and not of the two halves of the if independently, the graph is symmetric (p=0.9 and p=0.1 are the same).  Second, and the point I was trying to make with all of the math above, is that until your probability gets pretty extreme (> 90%) the probability of divergence is reasonably high at any subgroup size.  From the perspective of a compiler with no knowledge of the input data, we have to assume every if condition is a 50/50 chance at which point we can basically assume it will always diverge.

Instead of only considering divergence, let's take a quick look at another case.  Let's say that the you have a one-sided if statement (no else) that is expensive but rare.  To put numbers on it, let's say the probability of the if statement being taken is 1/16 for any given invocation.  Then P(taken) = P(any(bi = true)) = 1 - P(all(bi = false)) = 1 - P(bi = false)^N = 1 - (15/16)^N.  This works out to about 0.4 for a subgroup size of 8, 0.65 for 16, 0.87 for 32, and 0.98 for 64.  The following chart shows what happens if we play around with the probabilities of our if condition a bit more:

As we saw with the earlier divergence plot, even events with a fairly low probability (10%) are fairly likely to happen even with a subgroup size of 8 (57%) and are even more likely the higher the subgroup size goes.  Again, from the perspective of a compiler with no knowledge of the data trying to make heuristic decisions, it looks like "ifs always happen" is a reasonable assumption.  However, if we have something expensive like a texture instruction that we can easily move into an if statement, we may as well.  There's no guarantees but if the probability of that if statement is low enough, we might be able to avoid it at least some of the time.

Statistical independence

A keen statistical eye may have caught a subtle statement I made very early on in the previous section:

 Assuming that the conditions are independent identically distributed (IID) random variables...

While less statistically minded readers may have glossed over this as meaningless math jargon, it's actually very important assumption.  Let's take a minute to break it down.  A random variable in statistics is just an event.  In our case, it's something like "the if condition was true".  To say that a set of random variables is identically distributed means that they have the same underlying probabilities.  Two coin tosses, for instance, are identically distributed while the distribution of "coin came up heads" and "die came up 6" are very different.  When combining random variables, we have to be careful to ensure that we're not mixing apples and oranges.  All of the analysis above was looking at the evaluation of a boolean in the same if condition but across different subgroup invocations.  These should be identically distributed. 

The remaining word that's of critical importance in the IID assumption is "independent".  Two random variables are said to be independent if they have no effect on one another or, to be more precise, knowing the value of one tells you nothing whatsoever about the value of the other.  Random variables which are not dependent are said to be "correlated".  One example of random variables which are very much not independent would be housing prices in a neighborhood because the first thing home appraisers look at to determine the value of a house is the value of other houses in the same area that have sold recently.  In my computations above, I used the rule that P(X and Y) = P(X) * P(Y) but this only holds if X and Y are independent random variables.  If they're dependent, the statistics look very different.  This raises an obvious question:  Are if conditions statistically independent across a subgroup?  The short answer is "no".

How does this correlation and lack of independence (those are the same) affect the statistics?  If two events X and Y are negatively correlated then P(X and Y) < P(X) * P(Y) and if two events are positively correlated then P(X and Y) > P(X) * P(Y).  When it comes to if conditions across a subgroup, most correlations that matter are positive.  Going back to our statistics calculations, the probability of if condition diverging is 1 - P(all(bi = true)) - P(all(bi = false)) and P(all(bi = true)) = P(b1 = true and b2 = true and... bN = true).  So, if the data is positively correlated, we get P(all(bi = true)) > P(bi = true)^N and P(divergent) = 1 - P(all(bi = true)) - P(all(bi = false)) < 1 - P(bi = true)^N - P(bi = false)^N.  So correlation for us typically reduces the probability of divergence.  This is a good thing because divergence is expensive.  How much does it reduce the probability of divergence?  That's hard to tell without deep knowledge of the data but there are a few easy cases to analyze.

One particular example of dependence that comes up all the time is uniform values.  Many values passed into a shader are the same for all invocations within a draw call or for all pixels within a group of primitives.  Sometimes the compiler is privy to this information (if it comes from a uniform or constant buffer, for instance) but often it isn't.  It's fairly common for apps to pass some bit of data as a vertex attribute which, even though it's specified per-vertex, is actually the same for all of them.  If a bit of data is uniform (even if the compiler doesn't know it is), then any if conditions based on that data (or from a calculation using entirely uniform values) will be the same.  From a statics perspective, this means that P(all(bi = true)) + P(all(bi = false)) = 1 and P(divergent) = 0.  From a shader execution perspective, this means that it will never diverge no matter the probability of the condition because our entire wave will evaluate the same value.

What about non-uniform values such as vertex positions, texture coordinates, and computed values?  In your average vertex, geometry, or tessellation shader, these are likely to be effectively independent.  Yes, there are patterns in the data such as common edges and some triangles being closer to others.  However, there is typically a lot of vertex data and the way that vertices get mapped to subgroups is random enough that these correlations between vertices aren't likely to show up in any meaningful way.  (I don't have a mathematical proof for this off-hand.)  When they're independent, all the statistics we did in the previous section apply directly.

With pixel/fragment shaders, on the other hand, things get more interesting.  Most GPUs rasterize pixels in groups of 2x2 pixels where each 2x2 pixel group comes from the same primitive.  Each subgroup is made up of a series of these 2x2 pixel groups so, if the subgroup size is 16, it's actually 4 groups of 2x2 pixels each.  Within a given 2x2 pixel group, the chances of a given value within the shader being the same for each pixel in that 2x2 group is quite high.  If we have a condition which is the same within each 2x2 pixel group then, from the perspective of divergence analysis, the subgroup size is effectively divided by 4.  As you can see in the earlier charts (for which I conveniently provided small subgroup sizes), the difference between a subgroup size of 2 and 4 is typically much larger than between 8 and 16.

Another common source of correlation in fragment shader data comes from the primitives themselves.  Even if they may be different between triangles, values are often the same or very tightly correlated between pixels in the same triangle.  This is sort of a super-set of the 2x2 pixel group issue we just covered.  This is important because this is a type of correlation that hardware has the ability to encourage.  For instance, hardware can choose to dispatch subgroups such that each subgroup only contains pixels from the same primitive.  Even if the hardware typically mixes primitives within the same subgroup, it can attempt to group things together to increase data correlation and reduce divergence.

Why bother with subgroups?

All this discussion of control-flow divergence might leave you wondering why we bother with subgroups at all.  Clearly, they're a pain.  They definitely are.  Oh, you have no idea...

But they also bring some significant advantages in that the parallelism allows us to get better throughput out of the hardware.  One obvious way this helps is that we can spend less hardware on instruction decoding (we only have to decode once for the whole wave) and put those gates into more floating-point arithmetic units.  Also, most processors are pipelined and, while they can start processing a new instruction each cycle, it takes several cycles before an instruction makes its way from the start of the pipeline to the end and its result can be used in a subsequent instruction.  If you have a lot of back-to-back dependent calculations in the shader, you can end up with lots of stalls where an instruction goes into the pipeline and the next instruction depends on its value and so you have to wait 10ish cycles until for the previous instruction to complete.  On Intel, each SIMD32 instruction is actually four SIMD8 instructions that pipeline very nicely and so it's easier to keep the ALU busy.

Ok, so wider subgroups are good, right?  Go as wide as you can!  Well, yes and no.  Generally, there's a point of diminishing returns.  Is one instruction decoder per 32 invocations of ALU really that much more hardware than one per 64 invocations?  Probalby not.  Generally, the subgroup size is determined based on what's required to keep the underlying floating-point arithmetic hardware full.  If you have 4 ALUs per execution unit and a pipeline depth of 10 cycles, then an 8-wide subgroup is going to have trouble keeping the ALU full.  A 32-wide subgroup, on the other hand, will keep it 80% full even with back-to-back dependent instructions so going 64-wide is pointless.

On Intel GPU hardware, there are additional considerations.  While most GPUs have a fixed subgroup size, ours is configurable and the subgroup size is chosen by the compiler.  What's less flexible for us is our register file.  We have a fixed register file size of 4KB regardless of the subgroup size so, depending on how many temporary values your shader uses, it may be difficult to compile it 16 or 32-wide and still fit everything in registers.  While wider programs generally yield better parallelism, the additional register pressure can easily negate any parallelism benefits.

There are also other issues such as cache utilization and thrashing but those are way out of scope for this blog post...

What does this all mean?

This topic came up this week in the context of tuning our subgroup size heuristic in the Intel Linux 3D drivers.  In particular, how should that heuristic reason about control-flow and divergence?  Are wider programs more expensive because they have the potential to diverge more?

After all the analysis above, the conclusion I've come to is that any given if condition falls roughly into one of three categories:

  1. Effectively uniform.  It never (or very rarely ever) diverges.  In this case, there is no difference between subgroup sizes because it never diverges.
  2. Random.  Since we have no knowledge about the data in the compiler, we have to assume that random if conditions are basically a coin flip every time.  Even with our smallest subgroup size of 8, this means it's going to diverge with a probability of 99.6%.  Even if you assume 2x2 subspans in fragment shaders are strongly correlated, divergence is still likely with a probability of 75% for SIMD8 shaders, 94% for SIMD16, and 99.6% for SIMD32.
  3. Random but very one-sided.  These conditions are the type where we can actually get serious statistical differences between the different subgroup sizes.  Unfortunately, we have no way of knowing when an if condition will be in this category so it's impossible to make heuristic decisions based on it.

Where does that leave our heuristic?  The only interesting case in the above three is random data in fragment shaders.  In our experience, the increased parallelism going from SIMD8 to SIMD16 is huge so it probably makes up for the increased divergence.  The parallelism increase from SIMD16 to SIMD32 isn't huge but the change in the probability of a random if diverging is pretty small (94% vs. 99.6%) so, all other things being equal, it's probably better to go SIMD32.

October 12, 2020

Jumping Right In

When last I left off, I’d cleared out 2/3 of my checklist for improving update_sampler_descriptors() performance:

handle_image_descriptor() was next on my list. As a result, I immediately dove right into an entirely different part of the flamegraph since I’d just been struck by a seemingly-obvious idea. Here’s the last graph:


Here’s the next step:


What changed?

Well, as I’m now caching descriptor sets across descriptor pools, it occurred to me that, assuming my descriptor state hashing mechanism is accurate, all the resources used in a given set must be identical. This means that all resources for a given type (e.g., UBO, SSBO, sampler, image) must be completely identical to previous uses across all shader stages. Extrapolating further, this also means that the way in which these resources are used must also identical, which means the pipeline barriers for access and image layouts must also be identical.

Which means they can be stored onto the struct zink_descriptor_set object and reused instead of being accumulated every time. This reuse completely eliminates add_transition() from using any CPU time (it’s the left-most block above update_sampler_descriptors() in the first graph), and it thus massively reduces overall time for descriptor updates.

This marks a notable landmark, as it’s the point at which update_descriptors() begins to use only ~50% of the total CPU time consumed in zink_draw_vbo(), with the other half going to the draw command where it should be.


At last my optimization-minded workflow returned to this function, and looking at the flamegraph again yielded the culprit. It’s not visible due to this being a screenshot, but the whole of the perf hog here was obvious, so let’s check out the function itself since it’s small. I think I explored part of this at one point in the distant past, possibly for ARB_texture_buffer_object, but refactoring has changed things up a bit:

static void
handle_image_descriptor(struct zink_screen *screen, struct zink_resource *res, enum zink_descriptor_type type, VkDescriptorType vktype, VkWriteDescriptorSet *wd,
                        VkImageLayout layout, unsigned *num_image_info, VkDescriptorImageInfo *image_info, struct zink_sampler_state *sampler,
                        VkBufferView *null_view, VkImageView imageview, bool do_set)

First, yes, there’s a lot of parameters. There’s a lot of them, including VkBufferView *null_view, which is a pointer to a stack array that’s initialized as containing VK_NULL_HANDLE. As VkDescriptorImageInfo must be initialized with a pointer to an array for texel buffers, it’s important that the stack variable used doesn’t go out of scope, so it has to be passed in like this or else this functionality can’t be broken out in this way.

    if (!res) {
        /* if we're hitting this assert often, we can probably just throw a junk buffer in since
         * the results of this codepath are undefined in ARB_texture_buffer_object spec
        switch (vktype) {
           wd->pTexelBufferView = null_view;
           image_info->imageLayout = VK_IMAGE_LAYOUT_UNDEFINED;
           image_info->imageView = VK_NULL_HANDLE;
           if (sampler)
              image_info->sampler = sampler->sampler[0];
           if (do_set)
              wd->pImageInfo = image_info;
           unreachable("unknown descriptor type");

This is just handling for null shader inputs, which is permitted by various GL specs.

     } else if (res-> != PIPE_BUFFER) {
        assert(layout != VK_IMAGE_LAYOUT_UNDEFINED);
        image_info->imageLayout = layout;
        image_info->imageView = imageview;
        if (sampler) {
           VkFormatProperties props;
           vkGetPhysicalDeviceFormatProperties(screen->pdev, res->format, &props);

This vkGetPhysicalDeviceFormatProperties call is actually the entire cause of handle_image_descriptor() using any CPU time, at least on ANV. The lookup for the format is a significant bottleneck here, so it has to be removed.

           if ((res->optimial_tiling && props.optimalTilingFeatures & VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT) ||
               (!res->optimial_tiling && props.linearTilingFeatures & VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT))
              image_info->sampler = sampler->sampler[0];
              image_info->sampler = sampler->sampler[1] ?: sampler->sampler[0];
        if (do_set)
           wd->pImageInfo = image_info;

Just for completeness, the remainder of this function is checking whether the device’s format features support the requested type of filtering (if linear), and then zink will fall back to nearest in other cases. Following this, do_set is true only for the base member of an image/sampler array of resources, and so this is the one that gets added into the descriptor set.

But now I’m again returning to vkGetPhysicalDeviceFormatProperties. Since this is using CPU, it needs to get out of the hotpath here in descriptor updating, but it does still need to be called. As such, I’ve thrown more of this onto the zink_screen object:

static void
populate_format_props(struct zink_screen *screen)
   for (unsigned i = 0; i < PIPE_FORMAT_COUNT; i++) {
      VkFormat format = zink_get_format(screen, i);
      if (!format)
      vkGetPhysicalDeviceFormatProperties(screen->pdev, format, &screen->format_props[i]);

Indeed, now instead of performing the fetch on every decsriptor update, I’m just grabbing all the properties on driver init and then using the cached values throughout. Let’s see how this looks.


update_descriptors() is now using visibly less time than the draw command, though not by a huge amount. I’m also now up to an unstable 33fps.

Some Cleanups

At this point, it bears mentioning that I wasn’t entirely satisfied with the amount of CPU consumed by descriptor state hashing, so I ended up doing some pre-hashing here for samplers, as they have the largest state. At the beginning, the hashing looked like this:

In this case, each sampler descriptor hash was the size of a VkDescriptorImageInfo, which is 2x 64bit values and a 32bit value, or 20bytes of hashing per sampler. That ends up being a lot of hashing, and it also ends up being a lot of repeatedly hashing the same values.

Instead, I changed things around to do some pre-hashing:

In this way, I could have a single 32bit value representing the sampler view that persisted for its lifetime, and a pair of 32bit values for the sampler (since I still need to potentially toggle between linear and nearest filtering) that I can select between. This ends up being 8bytes to hash, which is over 50% less. It’s not a huge change in the flamegraph, but it’s possibly an interesting factoid. Also, as the layouts will always be the same for these descriptors, that member can safely be omitted from the original sampler_view hash.

Set Usage

Another vaguely interesting tidbit is profiling zink’s usage of mesa’s set implementation, which is used to ensure that various objects are only added to a given batch a single time. Historically the pattern for use in zink has been something like:

if (!_mesa_set_search(set, value)) {
   _mesa_set_add(set, value);

This is not ideal, as it ends up performing two lookups in the set for cases where the value isn’t already present. A much better practice is:

bool found = false;
_mesa_set_search_and_add(set, value, &found);
if (!found)

In this way, the lookup is only done once, which ends up being huge for large sets.

For My Final Trick

I was now holding steady at 33fps, but there was a tiny bit more performance to squeeze out of descriptor updating when I began to analyze how much looping was being done. This is a general overview of all the loops in update_descriptors() for each type of descriptor at the time of my review:

  • loop for all shader stages
    • loop for all bindings in the shader
      • in sampler and image bindings, loop for all resources in the binding
  • loop for all resources in descriptor set
  • loop for all barriers to be applied in descriptor set

This was a lot of looping, and it was especially egregious in the final component of my refactored update_descriptors():

static bool
write_descriptors(struct zink_context *ctx, struct zink_descriptor_set *zds, unsigned num_wds, VkWriteDescriptorSet *wds,
                 unsigned num_resources, struct zink_descriptor_resource *resources, struct set *persistent,
                 bool is_compute, bool cache_hit)
   bool need_flush = false;
   struct zink_batch *batch = is_compute ? &ctx->compute_batch : zink_curr_batch(ctx);
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   unsigned check_flush_id = is_compute ? 0 : ZINK_COMPUTE_BATCH_ID;
   for (int i = 0; i < num_resources; ++i) {
      assert(num_resources <= zds->pool->num_resources);

      struct zink_resource *res = resources[i].res;
      if (res) {
         need_flush |= zink_batch_reference_resource_rw(batch, res, resources[i].write) == check_flush_id;
         if (res->persistent_maps)
            _mesa_set_add(persistent, res);
      /* if we got a cache hit, we have to verify that the cached set is still valid;
       * we store the vk resource to the set here to avoid a more complex and costly mechanism of maintaining a
       * hash table on every resource with the associated descriptor sets that then needs to be iterated through
       * whenever a resource is destroyed
      assert(!cache_hit || zds->resources[i] == res);
      if (!cache_hit)
         zink_resource_desc_set_add(res, zds, i);
   if (!cache_hit && num_wds)
      vkUpdateDescriptorSets(screen->dev, num_wds, wds, 0, NULL);
   for (int i = 0; zds->pool->num_descriptors && i < util_dynarray_num_elements(&zds->barriers, struct zink_descriptor_barrier); ++i) {
      struct zink_descriptor_barrier *barrier = util_dynarray_element(&zds->barriers, struct zink_descriptor_barrier, i);
      zink_resource_barrier(ctx, NULL, barrier->res,
                            barrier->layout, barrier->access, barrier->stage);

   return need_flush;

This function iterates over all the resources in a descriptor set, tagging them for batch usage and persistent mapping, adding references for the descriptor set to the resource as I previously delved into. Then it iterates over the barriers and applies them.

But why was I iterating over all the resources and then over all the barriers when every resource will always have a barrier for the descriptor set, even if it ends up getting filtered out based on previous usage?

It just doesn’t make sense.

So I refactored this a bit, and now there’s only one loop:

static bool
write_descriptors(struct zink_context *ctx, struct zink_descriptor_set *zds, unsigned num_wds, VkWriteDescriptorSet *wds,
                 struct set *persistent, bool is_compute, bool cache_hit)
   bool need_flush = false;
   struct zink_batch *batch = is_compute ? &ctx->compute_batch : zink_curr_batch(ctx);
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   unsigned check_flush_id = is_compute ? 0 : ZINK_COMPUTE_BATCH_ID;

   if (!cache_hit && num_wds)
      vkUpdateDescriptorSets(screen->dev, num_wds, wds, 0, NULL);
   for (int i = 0; zds->pool->num_descriptors && i < util_dynarray_num_elements(&zds->barriers, struct zink_descriptor_barrier); ++i) {
      struct zink_descriptor_barrier *barrier = util_dynarray_element(&zds->barriers, struct zink_descriptor_barrier, i);
      if (barrier->res->persistent_maps)
         _mesa_set_add(persistent, barrier->res);
      need_flush |= zink_batch_reference_resource_rw(batch, barrier->res, zink_resource_access_is_write(barrier->access)) == check_flush_id;
      zink_resource_barrier(ctx, NULL, barrier->res,
                            barrier->layout, barrier->access, barrier->stage);

   return need_flush;

This actually has the side benefit of reducing the required looping even further, as barriers get merged based on access and stages, meaning that though there may be N resources used by a given set used by M stages, it’s possible that the looping here might be reduced to only N rather than N * M since all barriers might be consolidated.

In Closing

Let’s check all the changes out in the flamegraph:


This last bit has shaved off another big chunk of CPU usage overall, bringing update_descriptors() from 11.4% to 9.32%. Descriptor state updating is down from 0.718% to 0.601% from the pre-hashing as well, though this wasn’t exactly a huge target to hit.

Just for nostalgia, here’s the starting point from just after I’d split the descriptor types into different sets and we all thought 27fps with descriptor set caching was a lot:


But now zink is up another 25% performance to a steady 34fps:


And I did it in only four blog posts.

For anyone interested, I’ve also put up a branch corresponding to the final flamegraph along with the perf data, which I’ve been using hotspot to view.

But Then, The Future

Looking forward, there’s still some easy work that can be done here.

For starters, I’d probably improve descriptor states a little such that I also had a flag anytime the batch cycled. This would enable me to add batch-tracking for resources/samplers/sampler_views more reliably when was actually necessary vs trying to add it every time, which ends up being a significant perf hit from all the lookups. I imagine that’d make a huge part of the remaining update_descriptors() usage disappear.


There’s also, as ever, the pipeline hashing, which can further be reduced by adding more dynamic state handling, which would remove more values from the pipeline state and thus reduce the amount of hashing required.


I’d probably investigate doing some resource caching to keep a bucket of destroyed resources around for faster reuse since there’s a fair amount of that going on.


Ultimately though, the CPU time in use by zink is unlikely to see any other huge decreases (unless I’m missing something especially obvious or clever) without more major architectural changes, which will end up being a bigger project that takes more than just a week of blog posting to finish. As such, I’ve once again turned my sights to unit test pass rates and related issues, since there’s still a lot of work to be done there.

I’ve fixed another 500ish piglit tests over the past few days, bringing zink up past 92% pass rate, and I’m hopeful I can get that number up even higher in the near future.

Stay tuned for more updates on all things zink and Mike.

October 09, 2020

Moar Descriptors

I talked about a lot of boring optimization stuff yesterday, exploring various ideas which, while they will eventually will end up improving performance, didn’t yield immediate results.

Now it’s just about time to start getting to the payoff. pool_reuse.png

Here’s a flamegraph of the starting point. Since yesterday’s progress of improving the descriptor cache a bit and adding context-based descriptor states to reduce hashing, I’ve now implemented an object to hold the VkDescriptorPool, enabling the pools themselves to be shared across programs for reuse, which deduplicates a considerable amount of memory. The first scene of the heaven benchmark creates a whopping 91 zink_gfx_program structs, each of which previously had their own UBO descriptor pool and sampler descriptor pool for 182 descriptor pools in total, each with 5000 descriptors in it. With this mechanism, that’s cut down to 9 descriptor pools in total which are shared across all the programs. Without even changing that maximum descriptor limit, I’m already up by another frame to 28fps, even if the flamegraph doesn’t look too different.

Moar Caches

I took a quick detour next over to the pipeline cache (that large-ish block directly to the right of update_descriptors in the flamegraph), which stores all the VkPipeline objects that get created during startup. Pipeline creation is extremely costly, so it’s crucial that it be avoided during runtime. Happily, the current caching infrastructure in zink is sufficient to meet that standard, and there are no pipelines created while the scene is playing out.

But I thought to myself: what about VkPipelineCache for startup time improvement while I’m here?

I had high hopes for this, and it was quick and easy to add in, but ultimately even with the cache implemented and working, I saw no benefit in any part of the benchmark.

That was fine, since what I was really shooting for was a result more like this: pipeline_hash.png

The hierarchy of the previously-expensive pipeline hash usage has completely collapsed now, and it’s basically nonexistent. This was achieved through a series of five patches which:

  • moved the tessellation levels for TCS output out of the pipeline state since these have no relation and I don’t know why I put them there to begin with
  • used the bitmasks for vertex divisors and buffers to much more selectively hash the large (32) VkVertexInputBindingDivisorDescriptionEXT and VkVertexInputBindingDescription arrays in the pipeline state instead of always hashing the full array
  • also only hashed the vertex buffer state if we don’t have VK_EXT_extended_dynamic_state support, which lets that be removed from the pipeline creation altogether
  • for debug builds, which are the only builds I run, I changed the pipeline hash tables over to use the pre-hashed values directly since mesa has a pesky assert() that rehashes on every lookup, so this more accurately reflects release build performance

And I’m now up to a solid 29fps.

Over To update_sampler_descriptors()

I left off yesterday with a list of targets to hit in this function, from left to right in the most recent flamegraph:

Looking higher up the chain for the add_transition() usage, it turns out that a huge chunk of this was actually the hash table rehashing itself every time it resized when new members were added (mesa hash tables start out with a very small maximum number of entries and then increase by a power of 2 every time). Since I always know ahead of time the maximum number of entries I’ll have in a given descriptor set, I put up a MR to let me pre-size the table, preventing any of this nonsense from taking up CPU time. The results were good: rehash.png

The entire add_transition hierarchy collapsed a bit, but there’s more to come. I immediately became distracted when I came to the realization that I’d actually misplaced a frame at some point and set to hunting it down.

Vulkan Experts

Anyone who said to themselves “you’re binding your descriptors before you’re emitting your pipeline barriers, thus starting and stopping your render passes repeatedly during each draw” as the answer to yesterday’s question about what I broke during refactoring was totally right, so bonus points to everyone out there who nailed it. This can actually be seen in the flamegraph as the tall stack above update_ubo_descriptors(), which is the block to the right of update_sampler_descriptors().

Oops. frame_retrieval.png

Now the stack is a little to the right where it belongs, and I was now just barely touching 30fps, which is up about 10% from the start of today’s post.

Stay tuned next week, when I pull another 10% performance out of my magic hat and also fix RADV corruption.

October 08, 2020

Descriptors Once More

It’s just that kind of week.

When I left off in my last post, I’d just implemented a two-tiered cache system for managing descriptor sets which was objectively worse in performance than not doing any caching at all.


Next, I did some analysis of actual descriptor usage, and it turned out that the UBO churn was massive, while sampler descriptors were only changed occasionally. This is due to a mechanism in mesa involving a NIR pass which rewrites uniform data passed to the OpenGL context as UBO loads in the shader, compacting the data into a single buffer and allowing it to be more efficiently passed to the GPU. There’s a utility component u_upload_mgr for gallium-based drivers which allocates a large (~100k) buffer and then maps/writes to it at offsets to avoid needing to create a new buffer for this every time the uniform data changes.

The downside of u_upload_mgr, for the current state of zink, is that it means the hashed descriptor states are different for almost every single draw because while the UBO doesn’t actually change, the offset does, and this is necessarily part of the descriptor hash since zink is exclusively using VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER descriptors.

I needed to increase the efficiency of the cache to make it worthwhile, so I decided first to reduce the impact of a changed descriptor state hash during pre-draw updating. This way, even if the UBO descriptor state hash was changing every time, maybe it wouldn’t be so impactful.

Deep Cuts

What this amounted to was to split the giant, all-encompassing descriptor set, which included UBOs, samplers, SSBOs, and shader images, into separate sets such that each type of descriptor would be isolated from changes in the other descriptors between draws.

Thus, I now had four distinct descriptor pools for each program, and I was producing and binding up to four descriptor sets for every draw. I also changed the shader compiler a bit to always bind shader resources to the newly-split sets as well as created some dummy descriptor sets since it’s illegal to bind sets to a command buffer with non-sequential indices, but it was mostly easy work. It seemed like a great plan, or at least one that had a lot of potential for more optimizations based on it. As far as direct performance increases from the split, UBO descriptors would be constantly changing, but maybe…

Well, the patch is pretty big (8 files changed, 766 insertions(+), 450 deletions(-)), but in the end, I was still stuck around 23 fps.

Dynamic UBOs

With this work done, I decided to switch things up a bit and explore using VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC descriptors for the mesa-produced UBOs, as this would reduce both the hashing required to calculate the UBO descriptor state (offset no longer needs to be factored in, which is one fewer uint32_t to hash) as well as cache misses due to changing offsets.

Due to potential driver limitations, only these mesa-produced UBOs are dynamic now, otherwise zink might exceed the maxDescriptorSetUniformBuffersDynamic limit, but this was still more than enough.


I was now at 27fps, the same as raw bucket allocation.

Hash Performance

I’m not going to talk about hash performance. Mesa uses xxhash internally, and I’m just using the tools that are available.

What I am going to talk about, however, is the amount of hashing and lookups that I was doing.

Let’s take a look at some flamegraphs for the scene that I was showing in my fps screenshots.


This is, among other things, a view of the juggernaut update_descriptors() function that I linked earlier in the week. At the time of splitting the descriptor sets, it’s over 50% of the driver’s pipe_context::draw_vbo hook, which is decidedly not great.

So I optimized harder.

The leftmost block just above update_descriptors is hashing that’s done to update the descriptor state for cache management. There wasn’t much point in recalculating the descriptor state hash on every draw since there’s plenty of draws where the states remain unchanged. To try and improve this, I moved to a context-based descriptor state tracker, where each pipe_context hook to change active descriptors would invalidate the corresponding descriptor state, and then update_descriptors() could just scan through all the states to see which ones needed to be recalculated.


The largest block above update_descriptors() on the right side is the new calculator function for descriptor states. It’s actually a bit more intensive than the old method, but again, this is much more easily optimized than the previous hashing which was scattered throughout a giant 400 line function.

Next, while I was in the area, I added even faster access to reusing descriptor sets. This is as simple as an array of sets on the program struct that can have their hash values directly compared to the current descriptor state that’s needed, avoiding lookups through potentially huge hash tables.


Not much to see here since this isn’t really where any of the performance bottleneck was occurring.


Let’s skip ahead a bit. I finally refactored update_descriptors() into smaller functions to update and bind descriptor sets in a loop prior to applying barriers and issuing the eventual draw command, shaking up the graph quite a bit:


Clearly, updating the sampler descriptors (update_sampler_descriptors()) is taking a huge amount of time. The three large blocks above it are:

Each of these three had clear ways they could be optimized, and I’m going to speed through that and more in my next post.

But now, a challenge to all the Vulkan experts reading along. In this last section, I’ve briefly covered some refactoring work for descriptor updates.

What is the significant performance regression that I’ve introduced in the course of this refactoring?

It’s possible to determine the solution to my question without reading through any of the linked code, but you have the code available to you nonetheless.

Until next time.

October 07, 2020

I Skipped Bucket Day

I’m back, and I’m about to get even deeper into zink’s descriptor management. I figured everyone including me is well acquainted with bucket allocating, so I skipped that day and we can all just imagine what that post would’ve been like instead.

Let’s talk about caching and descriptor sets.

I talked about it before, I know, so here’s a brief reminder of where I left off:

Just a very normal cache mechanism. The problem, as I said previously, was that there was just way too much hashing going on, and so the performance ended up being worse than a dumb bucket allocator.

Not ideal.

But I kept turning the idea over in the back of my mind, and then I realized that part of the problem was in the upper-right block named move invalidated sets to invalidated set array. It ended up being the case that my resource tracking for descriptor sets was far too invasive; I had separate hash tables on every resource to track every set that a resource was attached to at all times, and I was basically spending all my time modifying those hash tables, not even the actual descriptor set caching.

So then I thought: well, what if I just don’t track it that closely?

Indeed, this simplifies things a bit at the conceptual level, since now I can avoid doing any sort of hashing related to resources, though this does end up making my second-level descriptor set cache less effective. But I’m getting ahead of myself in this post, so it’s time to jump into some code.

Resource Tracking 2.0

Instead of doing really precise tracking, it’s important to recall a few key points about how the descriptor sets are managed:

  • they’re bucket allocated
  • they’re allocated using the struct zink_program ralloc context
  • they’re only ever destroyed when the program is
  • zink is single-threaded

Thus, I brought some pointer hacks to bear:

zink_resource_desc_set_add(struct zink_resource *res, struct zink_descriptor_set *zds, unsigned idx)
   zds->resources[idx] = res;
   util_dynarray_append(&res->desc_set_refs, struct zink_resource**, &zds->resources[idx]);

This function associates a resource with a given descriptor set at the specified index (based on pipeline state). And then it pushes a reference to that pointer from the descriptor set’s C-array of resources into an array on the resource.

Later, during resource destruction, I can then walk the array of pointers like this:

util_dynarray_foreach(&res->desc_set_refs, struct zink_resource **, ref) {
   if (**ref == res)
      **ref = NULL;

If the reference I pushed earlier is still pointing to this resource, I can unset the pointer, and this will get picked up during future descriptor updates to flag the set as not-cached, requiring that it be updated. Since a resource won’t ever be destroyed while a set is in use, this is also safe for the associated descriptor set’s lifetime.

And since there’s no hashing or tree traversals involved, this is incredibly fast.

Second-level Caching

At this point, I’d created two categories for descriptor sets: active sets, which were the ones in use in a command buffer, and inactive sets, which were the ones that weren’t currently in use, with sets being pushed into the inactive category once they were no longer used by any command buffers. This ended up being a bit of a waste, however, as I had lots of inactive sets that were still valid but unreachable since I was using an array for storing these as well as the newly-bucket-allocated sets.

Thus, a second-level cache, AKA the B cache, which would store not-used sets that had at one point been valid. I’m still not doing any sort of checking of sets which may have been invalidated by resource destruction, so the B cache isn’t quite as useful as it could be. Also:

  • the check program cache for matching set has now been expanded to two lookups in case a matching set isn’t active but is still configured and valid in the B cache
  • the check program for unused set block in the above diagram will now cannibalize a valid inactive set from the B cache rather than allocate a new set

The last of these items is a bit annoying, but ultimately the B cache can end up having hundreds of members at various points, and iterating through it to try and find a set that’s been invalidated ends up being impractical just based on the random distribution of sets across the table. Also, I only set the resource-based invalidation up to null the resource pointer, so finding an invalid set would mean walking through the resource array of each set in the cache. Thus, a quick iteration through a few items to see if the set-finder gets lucky, otherwise it’s clobbering time.

And this brought me up to about 24fps, which was still down a bit from the mind-blowing 27-28fps I was getting with just the bucket allocator, but it turns out that caching starts to open up other avenues for sizable optimizations.

Which I’ll get to in future posts.

October 05, 2020

Healthier Blogging

I took some time off to focus on making the numbers go up, but if I only do that sort of junk food style blogging with images, and charts, and benchmarks then we might all stop learning things, and certainly I won’t be transferring any knowledge between the coding part of my brain and the speaking part, so really we’re all losers at that point.

In other words, let’s get back to going extra deep into some code and doing some long form patch review.

Descriptor Management

First: what are descriptors?

Descriptors are, in short, when you feed a buffer or an image (+sampler) into a shader. In OpenGL, this is all handled for the user behind the scenes with e.g., a simple glGenBuffers() -> glBindBuffer() -> glBufferData() for an attached buffer. For a gallium-based driver, this example case will trigger the pipe_context::set_constant_buffer or pipe_context::set_shader_buffers hook at draw time to inform the driver that a buffer has been attached, and then the driver can link it up with the GPU.

Things are a bit different in Vulkan. There’s an entire chapter of the spec devoted to explaining how descriptors work in great detail, but the important details for the zink case are:

  • each descriptor must have created for it a binding value which is unique for the given descriptor set
  • each descriptor must have a Vulkan descriptor type, as converted from its OpenGL type
  • each descriptor must also potentially expand to its full array size (image descriptor types only)

Additionally, while organizing and initializing all these descriptor sets, zink has to track all the resources used and guarantee their lifetimes exceed the lifetimes of the batch they’re being submitted with.

To handle this, zink has an amount of code. In the current state of the repo, it’s about 40 lines.



In the state of my branch that I’m going to be working off of for the next few blog posts, the function for handling descriptor updates is 304 lines. This is the increased amount of code that’s required to handle (almost) all GL descriptor types in a way that’s reasonably reliable.

Is it a great design decision to have a function that large?

Probably not.

But I decided to write it all out first before I did any refactoring so that I could avoid having to incrementally refactor my first attempt at refactoring, which would waste lots of time.

Also, memes.

How Does This Work?

The idea behind the latter version of the implementation that I linked is as follows:

  • iterate over all the shader stages
  • iterate over the bindings for the shader
  • for each binding:
    • fill in the Vulkan buffer/image info struct
    • for all resources used by the binding:
      • add tracking for the underlying resource(s) for lifetime management
      • flag a pipeline barrier for the resource(s) based on usage*
    • fill in the VkWriteDescriptorSet struct for the binding

*Then merge and deduplicate all the accumulated pipeline barriers and apply only those which induce a layout or access change in the resource to avoid over-applying barriers.

As I mentioned in a previous post, zink then applies these descriptors to a newly-allocated, max-size descriptor set object from an allocator pool located on the batch object. Every time a draw command is triggered, a new VkDescriptorSet is allocated and updated using these steps.

First Level Refactoring Target

As I touched on briefly in a previous post, the first change to make here in improving descriptor handling is to move the descriptor pools to the program objects. This lets zink create smaller descriptor pools which are likely going to end up using less memory than these giant ones. Here’s the code used for creating descriptor pools prior to refactoring:

VkDescriptorPoolSize sizes[] = {
VkDescriptorPoolCreateInfo dpci = {};
dpci.pPoolSizes = sizes;
dpci.poolSizeCount = ARRAY_SIZE(sizes);
dpci.flags = 0;
dpci.maxSets = ZINK_BATCH_DESC_SIZE;
vkCreateDescriptorPool(screen->dev, &dpci, 0, &batch->descpool);

Here, all the used descriptor types allocate ZINK_BATCH_DESC_SIZE descriptors in the pool, and there are ZINK_BATCH_DESC_SIZE sets in the pool. There’s a bug here, which is that really the descriptor types should have ZINK_BATCH_DESC_SIZE * ZINK_BATCH_DESC_SIZE descriptors to avoid oom-ing the pool in the event that allocated sets actually use that many descriptors, but ultimately this is irrelevant, as we’re only ever allocating 1 set at a time due to zink flushing multiple times per draw anyway.

Ideally, however, it would be better to avoid this. The majority of draw cases use much, much smaller descriptor sets which only have 1-5 descriptors total, so allocating 6 * 1000 for a pool is roughly 6 * 1000 more than is actually needed for every set.

The other downside of this strategy is that by creating these giant, generic descriptor sets, it becomes impossible to know what’s actually in a given set without attaching considerable metadata to it, which makes reusing sets without modification (i.e., caching) a fair bit of extra work. Yes, I know I said bucket allocation was faster, but I also said I believe in letting the best idea win, and it doesn’t really seem like doing full updates every draw should be faster, does it? But I’ll get to that in another post.

Descriptor Migration

When creating a struct zink_program (which is the struct that contains all the shaders), zink creates a VkDescriptorSetLayout object for describing the descriptor set layouts that will be allocated from the pool. This means zink is allocating giant, generic descriptor set pools and then allocating (almost certainly) very, very small sets, which means the driver ends up with this giant memory balloon of unused descriptors allocated in the pool that can never be used.

A better idea for this would be to create descriptor pools which precisely match the layout for which they’ll be allocating descriptor sets, as this means there’s no memory ballooning, even if it does end up being more pool objects.

Here’s the current function for creating the layout object for a program:

static VkDescriptorSetLayout
create_desc_set_layout(VkDevice dev,
                       struct zink_shader *stages[ZINK_SHADER_COUNT],
                       unsigned *num_descriptors)
   int num_bindings = 0;

   for (int i = 0; i < ZINK_SHADER_COUNT; i++) {
      struct zink_shader *shader = stages[i];
      if (!shader)

      VkShaderStageFlagBits stage_flags = zink_shader_stage(pipe_shader_type_from_mesa(shader->nir->info.stage));

This function is called for both the graphics and compute pipelines, and for the latter, only a single shader object is passed, meaning that i is purely for iterating the maximum number of shaders and not descriptive of the current shader being processed.

      for (int j = 0; j < shader->num_bindings; j++) {
         assert(num_bindings < ARRAY_SIZE(bindings));
         bindings[num_bindings].binding = shader->bindings[j].binding;
         bindings[num_bindings].descriptorType = shader->bindings[j].type;
         bindings[num_bindings].descriptorCount = shader->bindings[j].size;
         bindings[num_bindings].stageFlags = stage_flags;
         bindings[num_bindings].pImmutableSamplers = NULL;

This iterates over the bindings in a given shader, setting up the various values required by the layout creation struct using the values stored to the shader struct using the code in zink_compiler.c.

   *num_descriptors = num_bindings;
   if (!num_bindings) return VK_NULL_HANDLE;

If this program has no descriptors at all, then this whole thing can become a no-op, and descriptor updating can be skipped for draws which use this program.

   VkDescriptorSetLayoutCreateInfo dcslci = {};
   dcslci.pNext = NULL;
   dcslci.flags = 0;
   dcslci.bindingCount = num_bindings;
   dcslci.pBindings = bindings;

   VkDescriptorSetLayout dsl;
   if (vkCreateDescriptorSetLayout(dev, &dcslci, 0, &dsl) != VK_SUCCESS) {
      debug_printf("vkCreateDescriptorSetLayout failed\n");
      return VK_NULL_HANDLE;

   return dsl;

Then there’s just the usual Vulkan semantics of storing values to the struct and passing it to the Create function.

But back in the context of moving descriptor pool creation to the program, this is actually the perfect place to jam the pool creation code in since all the information about descriptor types is already here. Here’s what that looks like:

VkDescriptorPoolSize sizes[6] = {};
int type_map[12];
unsigned num_types = 0;
memset(type_map, -1, sizeof(type_map));

for (int i = 0; i < ZINK_SHADER_COUNT; i++) {
   struct zink_shader *shader = stages[i];
   if (!shader)

   VkShaderStageFlagBits stage_flags = zink_shader_stage(pipe_shader_type_from_mesa(shader->nir->info.stage));
   for (int j = 0; j < shader->num_bindings; j++) {
      assert(num_bindings < ARRAY_SIZE(bindings));
      bindings[num_bindings].binding = shader->bindings[j].binding;
      bindings[num_bindings].descriptorType = shader->bindings[j].type;
      bindings[num_bindings].descriptorCount = shader->bindings[j].size;
      bindings[num_bindings].stageFlags = stage_flags;
      bindings[num_bindings].pImmutableSamplers = NULL;
      if (type_map[shader->bindings[j].type] == -1) {
         type_map[shader->bindings[j].type] = num_types++;
         sizes[type_map[shader->bindings[j].type]].type = shader->bindings[j].type;

I’ve added the sizes, type_map, and num_types variables, which map used Vulkan descriptor types to a zero-based array and associated counter that can be used to fill in the pPoolSizes and PoolSizeCount values in a VkDescriptorPoolCreateInfo struct.

After the layout creation, which remains unchanged, I’ve then added this block:

for (int i = 0; i < num_types; i++)
   sizes[i].descriptorCount *= ZINK_DEFAULT_MAX_DESCS;

VkDescriptorPoolCreateInfo dpci = {};
dpci.pPoolSizes = sizes;
dpci.poolSizeCount = num_types;
dpci.flags = 0;
vkCreateDescriptorPool(dev, &dpci, 0, &descpool);

Which uses the descriptor types and sizes from above to create a pool that will pre-allocate the exact descriptor counts that are needed for this program.

Managing descriptor sets in this way does have other challenges, however. Like resources, it’s crucial that sets not be modified or destroyed while they’re submitted to a batch.

Descriptor Tracking

Previously, any time a draw was completed, the batch object would reset and clear its descriptor pool, wiping out all the allocated sets. If the pool is no longer on the batch, however, it’s not possible to perform this reset without adding tracking info for the batch to all the descriptor sets. Also, resetting a descriptor pool like this is wasteful, as it’s probable that a program will be used for multiple draws and thus require multiple descriptor sets. What I’ve done instead is add this function, which is called just after allocating the descriptor set:

zink_batch_add_desc_set(struct zink_batch *batch, struct zink_program *pg, struct zink_descriptor_set *zds)
   struct hash_entry *entry = _mesa_hash_table_search(batch->programs, pg);
   struct set *desc_sets = (void*)entry->data;
   if (!_mesa_set_search(desc_sets, zds)) {
      pipe_reference(NULL, &zds->reference);
      _mesa_set_add(desc_sets, zds);
      return true;
   return false;

Similar to all the other batch<->object tracking, this stores the given descriptor set into a set, but in this case the set is itself stored as the data in a hash table keyed with the program, which provides both objects for use during batch reset:

zink_reset_batch(struct zink_context *ctx, struct zink_batch *batch)
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   batch->descs_used = 0;

   // cmdbuf hasn't been submitted before
   if (!batch->submitted)

   zink_fence_finish(screen, &ctx->base, batch->fence, PIPE_TIMEOUT_INFINITE);
   hash_table_foreach(batch->programs, entry) {
      struct zink_program *pg = (struct zink_program*)entry->key;
      struct set *desc_sets = (struct set*)entry->data;
      set_foreach(desc_sets, sentry) {
         struct zink_descriptor_set *zds = (void*)sentry->key;
         /* reset descriptor pools when no batch is using this program to avoid
          * having some inactive program hogging a billion descriptors
         pipe_reference(&zds->reference, NULL);
         zink_program_invalidate_desc_set(pg, zds);
      _mesa_set_destroy(desc_sets, NULL);

And this function is called:

zink_program_invalidate_desc_set(struct zink_program *pg, struct zink_descriptor_set *zds)
   uint32_t refcount = p_atomic_read(&zds->reference.count);
   /* refcount > 1 means this is currently in use, so we can't recycle it yet */
   if (refcount == 1)
      util_dynarray_append(&pg->alloc_desc_sets, struct zink_descriptor_set *, zds);

If a descriptor set has no active batch uses, its refcount will be 1, and then it can be added to the array of allocated descriptor sets for immediate reuse in the next draw. In this iteration of refactoring, descriptor sets can only have one batch use, so this condition is always true when this function is called, but future work will see that change.

Putting it all together is this function:

struct zink_descriptor_set *
zink_program_allocate_desc_set(struct zink_screen *screen,
                               struct zink_batch *batch,
                               struct zink_program *pg)
   struct zink_descriptor_set *zds;

   if (util_dynarray_num_elements(&pg->alloc_desc_sets, struct zink_descriptor_set *)) {
      /* grab one off the allocated array */
      zds = util_dynarray_pop(&pg->alloc_desc_sets, struct zink_descriptor_set *);
      goto out;

   VkDescriptorSetAllocateInfo dsai;
   memset((void *)&dsai, 0, sizeof(dsai));
   dsai.pNext = NULL;
   dsai.descriptorPool = pg->descpool;
   dsai.descriptorSetCount = 1;
   dsai.pSetLayouts = &pg->dsl;

   VkDescriptorSet desc_set;
   if (vkAllocateDescriptorSets(screen->dev, &dsai, &desc_set) != VK_SUCCESS) {
      debug_printf("ZINK: %p failed to allocate descriptor set :/\n", pg);
      return VK_NULL_HANDLE;
   zds = ralloc_size(NULL, sizeof(struct zink_descriptor_set));
   pipe_reference_init(&zds->reference, 1);
   zds->desc_set = desc_set;
   if (zink_batch_add_desc_set(batch, pg, zds))
      batch->descs_used += pg->num_descriptors;

   return zds;

If a pre-allocated descriptor set exists, it’s popped off the array. Otherwise, a new one is allocated. After that, the set is referenced onto the batch.


Now all the sets are allocated on the program using a more specific allocation strategy, which paves the way for a number of improvements that I’ll be discussing in various lengths over the coming days:

  • bucket allocating
  • descriptor caching 2.0
  • split descriptor sets
  • context-based descriptor states
  • (pending implementation/testing/evaluation) context-based descriptor pool+layout caching
September 29, 2020

A Showcase

Today I’m taking a break from writing about my work to write about the work of zink’s newest contributor, He Haocheng (aka @hch12907). Among other things, Haocheng has recently tackled the issue of extension refactoring, which is a huge help for future driver development. I’ve written time and time again about adding extensions, and with this patchset in place, the process is simplified and expedited almost into nonexistence.


As an example, let’s look at the most recent extension that I’ve added support for, VK_EXT_extended_dynamic_state. The original patch looked like this:

diff --git a/src/gallium/drivers/zink/zink_screen.c b/src/gallium/drivers/zink/zink_screen.c
index 3effa2b0fe4..83b89106931 100644
--- a/src/gallium/drivers/zink/zink_screen.c
+++ b/src/gallium/drivers/zink/zink_screen.c
@@ -925,6 +925,10 @@ load_device_extensions(struct zink_screen *screen)
+   if (screen->have_EXT_extended_dynamic_state) {
+      GET_PROC_ADDR(CmdSetViewportWithCountEXT);
+      GET_PROC_ADDR(CmdSetScissorWithCountEXT);
+   }
@@ -938,7 +942,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
    bool have_tf_ext = false, have_cond_render_ext = false, have_EXT_index_type_uint8 = false,
       have_EXT_robustness2_features = false, have_EXT_vertex_attribute_divisor = false,
       have_EXT_calibrated_timestamps = false, have_VK_KHR_vulkan_memory_model = false;
-   bool have_EXT_custom_border_color = false, have_EXT_blend_operation_advanced = false;
+   bool have_EXT_custom_border_color = false, have_EXT_blend_operation_advanced = false,
+        have_EXT_extended_dynamic_state = false;
    if (!screen)
       return NULL;
@@ -1001,6 +1006,9 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
             if (!strcmp(extensions[i].extensionName,
                have_EXT_blend_operation_advanced = true;
+            if (!strcmp(extensions[i].extensionName,
+               have_EXT_extended_dynamic_state = true;
@@ -1012,6 +1020,7 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
    VkPhysicalDeviceIndexTypeUint8FeaturesEXT index_uint8_feats = {};
    VkPhysicalDeviceVulkanMemoryModelFeatures mem_feats = {};
    VkPhysicalDeviceBlendOperationAdvancedFeaturesEXT blend_feats = {};
+   VkPhysicalDeviceExtendedDynamicStateFeaturesEXT dynamic_state_feats = {};
@@ -1060,6 +1069,11 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
       blend_feats.pNext = feats.pNext;
       feats.pNext = &blend_feats;
+   if (have_EXT_extended_dynamic_state) {
+      dynamic_state_feats.pNext = feats.pNext;
+      feats.pNext = &dynamic_state_feats;
+   }
    vkGetPhysicalDeviceFeatures2(screen->pdev, &feats);
    memcpy(&screen->feats, &feats.features, sizeof(screen->feats));
    if (have_tf_ext && tf_feats.transformFeedback)
@@ -1074,6 +1088,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
    screen->have_EXT_calibrated_timestamps = have_EXT_calibrated_timestamps;
    if (have_EXT_custom_border_color && screen->border_color_feats.customBorderColors)
       screen->have_EXT_custom_border_color = true;
+   if (have_EXT_extended_dynamic_state && dynamic_state_feats.extendedDynamicState)
+      screen->have_EXT_extended_dynamic_state = true;
    VkPhysicalDeviceProperties2 props = {};
    VkPhysicalDeviceVertexAttributeDivisorPropertiesEXT vdiv_props = {};
@@ -1150,7 +1166,7 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
     * this requires us to pass the whole VkPhysicalDeviceFeatures2 struct
    dci.pNext = &feats;
-   const char *extensions[12] = {
+   const char *extensions[13] = {
    num_extensions = 1;
@@ -1185,6 +1201,8 @@ zink_internal_create_screen(struct sw_winsys *winsys, int fd, const struct pipe_
       extensions[num_extensions++] = VK_EXT_CUSTOM_BORDER_COLOR_EXTENSION_NAME;
    if (have_EXT_blend_operation_advanced)
       extensions[num_extensions++] = VK_EXT_BLEND_OPERATION_ADVANCED_EXTENSION_NAME;
+   if (have_EXT_extended_dynamic_state)
+      extensions[num_extensions++] = VK_EXT_EXTENDED_DYNAMIC_STATE_EXTENSION_NAME;
    assert(num_extensions <= ARRAY_SIZE(extensions));
    dci.ppEnabledExtensionNames = extensions;
diff --git a/src/gallium/drivers/zink/zink_screen.h b/src/gallium/drivers/zink/zink_screen.h
index 4ee409c0efd..1d35e775262 100644
--- a/src/gallium/drivers/zink/zink_screen.h
+++ b/src/gallium/drivers/zink/zink_screen.h
@@ -75,6 +75,7 @@ struct zink_screen {
    bool have_EXT_calibrated_timestamps;
    bool have_EXT_custom_border_color;
    bool have_EXT_blend_operation_advanced;
+   bool have_EXT_extended_dynamic_state;
    bool have_X8_D24_UNORM_PACK32;
    bool have_D24_UNORM_S8_UINT;

It’s awful, right? There’s obviously lots of copy/pasted code here, and it’s a tremendous waste of time to have to do the copy/pasting, not to mention the time needed for reviewing such mind-numbing changes.


Here’s the same patch after He Haocheng’s work has been merged:

diff --git a/src/gallium/drivers/zink/ b/src/gallium/drivers/zink/
index 0300e7f7574..69e475df2cf 100644
--- a/src/gallium/drivers/zink/
+++ b/src/gallium/drivers/zink/
@@ -62,6 +62,7 @@ def EXTENSIONS():
         Extension("VK_EXT_custom_border_color",      alias="border_color", properties=True, feature="customBorderColors"),
         Extension("VK_EXT_blend_operation_advanced", alias="blend", properties=True),
+        Extension("VK_EXT_extended_dynamic_state",   alias="dynamic_state", feature="extendedDynamicState"),
 # There exists some inconsistencies regarding the enum constants, fix them.
diff --git a/src/gallium/drivers/zink/zink_screen.c b/src/gallium/drivers/zink/zink_screen.c
index 864ec32fc22..3c1214d384b 100644
--- a/src/gallium/drivers/zink/zink_screen.c
+++ b/src/gallium/drivers/zink/zink_screen.c
@@ -926,6 +926,10 @@ load_device_extensions(struct zink_screen *screen)
+   if (screen->info.have_EXT_extended_dynamic_state) {
+      GET_PROC_ADDR(CmdSetViewportWithCountEXT);
+      GET_PROC_ADDR(CmdSetScissorWithCountEXT);
+   }


I’m certain this is going to lead to an increase in my own productivity in the future given how quick the process has now become.

A Short Note On Hardware

I’ve been putting up posts lately about my benchmarking figures, and in them I’ve been referencing Intel hardware. The reason I use Intel is because I’m (currently) a hobbyist developer with exactly one computer capable of doing graphics development, and it has an Intel onboard GPU. I’m quite happy with this given the high quality state of Intel’s drivers, as things become much more challenging when I have to debug both my own bugs as well as an underlying Vulkan driver’s bugs at the same time.

With that said, I present to you this recent out-of-context statement from Dave Airlie regarding zink performance on a different driver:

<airlied> zmike: hey on a fiji amd card, you get 45/46 native vs 35 fps with zink on one heaven scene here, however zink is corrupted

I don’t have any further information about anything there, but it’s the best I can do given the limitations in my available hardware.

September 28, 2020

But No, I’m Not

Just a quick post today to summarize a few exciting changes I’ve made today.

To start with, I’ve added some tracking to the internal batch objects for catching things like piglit’s spec@!opengl 1.1@streaming-texture-leak. Let’s check out the test code there for a moment:

/** @file streaming-texture-leak.c
 * Tests that allocating and freeing textures over and over doesn't OOM
 * the system due to various refcounting issues drivers may have.
 * Textures used are around 4MB, and we make 5k of them, so OOM-killer
 * should catch any failure.
 * Bug #23530
for (i = 0; i < 5000; i++) {
        glGenTextures(1, &texture);
        glBindTexture(GL_TEXTURE_2D, texture);

        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER,
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER,
        glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, TEX_SIZE, TEX_SIZE,
                     0, GL_RGBA,
                     GL_UNSIGNED_BYTE, tex_buffer);

        piglit_draw_rect_tex(0, 0, piglit_width, piglit_height,
                             0, 0, 1, 1);

        glDeleteTextures(1, &texture);

This test loops 5000 times, using a different sampler texture for each draw, and then destroys the texture. This is supposed to catch drivers which can’t properly manage their resource refcounts, but instead here zink is getting caught by trying to dump 5000 active resources into the same command buffer, which ooms the system.

The reason for the problem in this case is that, after my recent optimizations which avoid unnecessary flushing, zink only submits the command buffer when a frame is finished or one of the write-flagged resources associated with an active batch is read from. Thus, the whole test runs in one go, only submitting the queue at the very end when the test performs a read.

In this case, my fix is simple: check the system’s total memory on driver init, and then always flush a batch if it crosses some threshold of memory usage in its associated resources when beginning a new draw. I chose 1/8 total memory to be “safe”, since that allows zink to use 50% of the total memory with its resources before it’ll begin to stall and force the draws to complete, hopefully avoiding any oom scenarios. This ends up being a flush every 250ish draws in the above test code, and everything works nicely without killing my system.

Performance 3.0

As a bonus, I noticed that zink was taking considerably longer than IRIS to complete this test once it was fixed, so I did a little profiling, and this was the result: epilogue.png

Up another 3 fps (~10%) from Friday, which isn’t bad for a few minutes spent removing some memset calls from descriptor updating and then throwing in some code for handling VK_DYNAMIC_STATE_VERTEX_INPUT_BINDING_STRIDE_EXT.

September 25, 2020

Today new beta versions for all KWinFT projects – that are KWinFT, Wrapland, Disman and KDisplay – were released. With that we are on target for the full release which is aligned with Plasma 5.20 on October 13.

Big changes will unquestionable come to Disman, a previously stifled library for display management, which now learns to stand on its own feet providing universal means for the configuration of displays with different windowing systems and Wayland compositors.

But also for the compositor KWinFT a very specific yet important feature got implemented and a multitude of stability fixes and code refactors were accomplished.

In the following we will do a deep dive into reasons and results of this recent efforts.

For a quick overview of the work on Disman you can also watch this lightning talk that I held at the virtual XDC 2020 conference last week.

Universal display management with Disman

It was initially not planned like this but Disman and KDisplay were two later additions to the KWinFT project.

The projects were forked from libkscreen and KScreen respectively and I saw this as an opportunity to completely rethink and in every sense overhaul these in the past rather lackluster and at times completely neglected components of the KDE Plasma workspace. This past negligence is rather tragic since the complaints about miserable output management in KDE Plasma go back as long as one can think. Improving this bad state of affairs was my main motivation when I started working on libkscreen and KScreen around two years ago.

In my opinion a well functioning – not necessarily fancy but for sure robust – display configuration system is a cornerstone of a well crafted desktop system. One reason for that is how prevalent multi-display setups are and another how immeasurable annoying it is when you can't configure the projector correctly this one time you have to give a presentation in front of a room full of people.

Disman now tries to solve this by providing a solution not only for KWinFT or the KDE Plasma desktop alone but for any system running X11 or any Wayland compositor.

Moving logic and ideas

Let us look into the details of this solution and why I haven't mentioned KDisplay yet. The reason for this omission is that KDisplay from now on will be a husk of its former self.

Ancient rituals

As a fork of KScreen no longer than one month ago KDisplay was still the logical center of any display configuration with an always active KDE daemon (KDED) module and a KConfig module (KCM) integrated to the KDE System Settings.

The KDED module was responsible for reacting to display hot-plug events, reading control files of the resulting display combination from the user directory, generating optimal configurations if none to be found and writing new files to the hard disk after the configuration has been applied successfully to the windowing system.

In this work flow Disman was only relevant as a provider of backend plugins that were loaded at runtime. Disman was used either in-process or through an included D-Bus service that got automatically started whenever the first client tried to talk to it. According to the commit adding this out-of-process mode five years ago the intention behind it was to improve performance and stability. But in the end on a functional level the service was doing not much more than forwarding data between the windowing system and the Disman consumers.

Break with tradition

Interestingly the D-Bus service was only activatable with the X11 backend and was explicitly disabled on Wayland. When I noticed this I was first tempted to remove the D-Bus service in the all eternal struggle to reduce code complexity. And after all if the service is not used on Wayland we might not need it at all.

But some time later I realized that this D-Bus service must be appreciated in a different way than for its initial reasoning. From a different point of view this service could be the key to a much more ambitious grand solution.

The service allows us to serialize and synchronize access of arbitrary many clients in a transparent way while moving all relevant logical systems to a shared central place and providing per client a high level of integration with those systems.

Concretely does this mean that the Disman D-Bus service becomes an independent entity. Once being invoked by a single call from a client, for example by the included command line utility with dismanctl -o the service reads and writes all necessary control files on its own. It generates optimal display configurations if no files are found and even can disable a laptop display in case the lid was closed while an external output is connected.

In this model Disman consumers solely provide user interfaces that are informed about the generated or loaded current config and that can modify this config additionally if desirable. This way the consumer can concentrate on providing a user interface with great usability and design and leave to Disman all the logic of handling the modified configuration afterwards.

Making it easy to add other clients is only one advantage. On a higher level this new design has two more.

Auxiliary data

I noticed already last year that some basic assumptions in KScreen were questionable. Its internal data logic relied on a round trip through the windowing system.

This meant in practice that the user was supposed to change display properties via the KScreen KCM. These were then sent to the windowing system which tried to apply them to the hardware. Afterwards it informed the KScreen KDE daemon through its own specific protocols and a libkscreen backend about this new configuration. Only the daemon then would write the updated configuration to the disk.

Why it was done this way is clear: we can be sure we have written a valid configuration to the disk and by having only the daemon do the file write we have the file access logic in a single place and do not need to sync file writes of different processes.

But the fundamental problem of this design is that we sometimes need to share additional information about our display configuration for sensible display management not being of relevance to the windowing system and because of that can not be passed through it.

A simple example is when a display is auto-rotated. Smartphones and tablets but also many convertibles come with orientation sensors to auto-rotate the built-in display according to the current device orientation. When auto-rotation is switched on or off in the KCM it is not sent through the windowing system but the daemon or another service needs to know about such a change in order to adapt the display rotation correctly with later orientation changes.

A complex but interesting other example is the replication of displays, also often called mirroring. When I started work on KScreen two years ago the mechanism was painfully primitive: one could only duplicate all displays at once and it was done by moving all of them to the same position and then changing their display resolutions hoping to find some sufficiently alike to cover a similar area.

Obviously that had several issues, the worst in my opinion was that this won't work for displays with different aspect ratios as I noticed quickly after I got myself a 16:10 display. Another grave issue was that displays might not run at their full resolution. In a mixed DPI setup the formerly HiDPI displays are downgraded to the best resolution common with the LoDPI displays.

The good news is that on X11 and also Wayland methods are available to replicate displays without these downsides.

On X11 we can apply arbitrary linear transformations to an output. This solves both issues.

On Wayland all available output management protocols at minimum allow to set a singular floating point value to scale a display. This solves the mixed DPI problem since we can still run both displays at an arbitrary resolution and adapt the logical size of the replica through its scale. If the management protocol provides a way to even specify the logical size directly like the KWinFT protocol does we can also solve the problem of diverging display aspect ratios.

From a bird's eye view in this model there are one or multiple displays that act as replicas for a single source display. Only the transformation, scale or logical size of the replicas is changed, the source is the invariant. The important information to remember is therefore for each display solely if there is a replication source that the display is a replica to. But neither in X11 nor in any Wayland compositor this information is conveyed via the windowing system.

With the new design we send all configuration data including such auxiliary data to the Disman D-Bus service. The service will save all this data to a configuration-specific file but send to the windowing system only a relevant subset of the data. After the windowing system reports that the configuration was applied the Disman service informs all connected clients about this change sending the data received from the windowing system augmented by the auxiliary data that had not been passed through the windowing system.

This way every display management client receives all relevant data about the current configuration including the auxiliary data.

The motivation to solve this problem was the original driving force behind the large redesign of Disman that is coming with this release.

But I realized soon that this redesign also has another advantage that long-term is likely even more important than the first one.

Ready for everything that is not KDE Plasma too

With the new design Disman becomes a truly universal solution for display management.

Previously a running KDE daemon process with KDisplay's daemon module inserted was required in order to load a display configuration on startup and to react to display hot-plug events. The problem to this is that the KDE daemon commonly only makes sense to run on a KDE Plasma desktop.

Thanks to the new design the Disman D-Bus service can now be run as a standalone background service managing all your displays permanently, even if you don't use KDE Plasma.

In a non-Plasma environment like a Wayland session with sway this can be achieved by simply calling once dismanctl -o in a startup script.

On the other side the graphical user interface that KDisplay provides can now be used to manage displays on any desktop that Disman runs on too. KDisplay does not require the KDE System Settings to be installed and can be run as a standalone app. Simply call kdisplay from command line or start it from the launcher of your desktop environment.

KDisplay still includes a now absolutely gutted KDE daemon module that will be run in a Plasma session. The module basically only launches the Disman D-Bus service on session startup anymore. So in a Plasma session after installation of Disman and KDisplay everything is directly setup automatically. In every other session as said a simple dismanctl -o call at startup is enough to get the very same result.

Maybe the integration in other sessions than Plasma could be improved to even make setting up this single call at startup unnecessary. Should Disman for example install a systemd unit file executing this call by default? I would be interested in feedback in this regard in particular from distributions. What do they prefer?

KWinFT and Wrapland improved in selected areas

With today's beta release the greatest changes come to Disman and KDisplay. But that does not mean KWinFT and Wrapland have not received some important updates.

Outputs all the way

The ongoing work on Disman and by that on displays – or outputs as they are called in the land of window managers – stability and feature patches for outputs naturally came to Wrapland and KWinFT as well. A large refactor was the introduction of a master output class on the server side of Wrapland. The class acts as a central entry point for compositors and deals with the different output related protocol objects internally.

Having this class in place it was rather easy to add support for xdg-output version 2 and 3 afterwards. In order to do that it was also reasonable to re-evaluate how we provide output identifying metadata in KWinFT and Wrapland in general.

In regards to output identification a happy coincidence was that Simon Ser of the wlroots projects had been asking himself the very same questions already in the past.

I concluded that Simon's plan for wlroots was spot on and I decided to help them out a bit with patches for wlr-protocols and wlroots. In the same vein I updated Wrapland's output device protocol. That means Wrapland and wlroots based compositors feature the very same way of identifying outputs now what made it easy to provide full support for both in Disman.

Presentation timings

This release comes with support for the presentation-time protocol.

It is one of only three Wayland protocol extensions that have been officially declared stable. Because of that supporting it felt also important in a formal sense.

Primarily though it is essential to my ongoing work on Xwayland. I plan to make use of the presentation-time protocol in Xwayland's Present extension implementation.

With the support in KWinFT I can test future presentation-time work in Xwayland now with KWinFT and sway as wlroots also supports the protocol. Having two different compositors for alternative testing will be quite helpful.

Try out the beta

If you want to try out the new beta release of Disman together with your favorite desktop environment or the KWinFT beta as a drop-in replacement for KWin you have to compile from source at the moment. For that use the Plasma/5.20 branches in the respective repositories.

For Disman there are some limited instructions on how to compile it in the Readme file.

If you have questions or just want to chat about the projects feel free to join the official KWinFT Gitter channel.

If you want to wait for the full release check back on the release date, October 13. I plan to write another article to that date that will then list all distributions where you will be able to install the KWinFT projects comfortably by package manager.

That is also a call to distro packagers: if you plan to provide packages for the KWinFT projects on October 13 get in touch to get support and be featured in the article.

Performance 2.0

In my last post, I left off with an overall 25%ish improvement in framerate for my test case:


At the end, this was an extra 3 fps over my previous test, but how did I get to this point?

The answer lies in even more unnecessary queue submission. Let’s take a look at zink’s pipe_context::set_framebuffer_state hook, which is called by gallium any time the framebuffer state changes:

static void
zink_set_framebuffer_state(struct pipe_context *pctx,
                           const struct pipe_framebuffer_state *state)
   struct zink_context *ctx = zink_context(pctx);
   struct zink_screen *screen = zink_screen(pctx->screen);

   util_copy_framebuffer_state(&ctx->fb_state, state);

   struct zink_framebuffer *fb = get_framebuffer(ctx);

   zink_framebuffer_reference(screen, &ctx->framebuffer, fb);
   if (ctx->gfx_pipeline_state.render_pass != fb->rp)
      ctx->gfx_pipeline_state.hash = 0;
   zink_render_pass_reference(screen, &ctx->gfx_pipeline_state.render_pass, fb->rp);

   uint8_t rast_samples = util_framebuffer_get_num_samples(state);
   /* in vulkan, gl_SampleMask needs to be explicitly ignored for sampleCount == 1 */
   if ((ctx->gfx_pipeline_state.rast_samples > 1) != (rast_samples > 1))
      ctx->dirty_shader_stages |= 1 << PIPE_SHADER_FRAGMENT;
   if (ctx->gfx_pipeline_state.rast_samples != rast_samples)
      ctx->gfx_pipeline_state.hash = 0;
   ctx->gfx_pipeline_state.rast_samples = rast_samples;
   if (ctx->gfx_pipeline_state.num_attachments != state->nr_cbufs)
      ctx->gfx_pipeline_state.hash = 0;
   ctx->gfx_pipeline_state.num_attachments = state->nr_cbufs;

   /* need to start a new renderpass */
   if (zink_curr_batch(ctx)->rp)
   struct zink_batch *batch = zink_batch_no_rp(ctx);
   zink_framebuffer_reference(screen, &batch->fb, fb);

   framebuffer_state_buffer_barriers_setup(ctx, &ctx->fb_state, zink_curr_batch(ctx));

Briefly, zink copies the framebuffer state, there’s a number of conditions under which a new pipeline object is needed, which all result in ctx->gfx_pipeline_state.hash = 0;. Other than this, there’s sample count check for sample changes so that the shader can be modified if necessary, and then there’s the setup for creating the Vulkan framebuffer object as well as the renderpass object in get_framebuffer().

Eagle-eyed readers will immediately spot the problem here, which is, aside from the fact that there’s not actually any reason to be setting up the framebuffer or renderpass here, how zink is also flushing the current batch if a renderpass is active.

The change I made here was to remove everything related to Vulkan from here, and move it to zink_begin_render_pass(), which is the function that the driver uses to begin a renderpass for a given batch.

This is clearly a much larger change than just removing the flush_batch() call, which might be what’s expected now that ending a renderpass no longer forces queue submission. Indeed, why haven’t I just ended the current renderpass and kept using the same batch?

The reason for this is that zink is designed in such a way that a given batch, at the time of calling vkCmdBeginRenderPass, is expected to either have no struct zink_render_pass associated with it (the batch has not performed a draw yet) or have the same object which matches the pipeline state (the batch is continuing to draw using the same renderpass). Adjusting this to be compatible with removing the flush here ends up being more code than just moving the object setup to a different spot.

So now the framebuffer and renderpass are created or pulled from their caches just prior to the vkCmdBeginRenderPass call, and a flush is removed, gaining some noticeable fps.

Going Further

Now that I’d unblocked that bottleneck, I went back to the list and checked the remaining problem areas:

  • descriptor set allocation is going to be a massive performance hit for any application which does lots of draws per frame since each draw command allocates its own (huge) descriptor set
  • the 1000 descriptor set limit is going to be hit constantly for for any application which does lots of draws per frame

I decided to change things up a bit here.

This is the current way of things.

My plan was something more like this:

Where get descriptorset from program would look something like:

In this way, I’d get to conserve some sets and reuse them across draw calls even between different command buffers since I could track whether they were in use and avoid modifying them in any way. I’d also get to remove any tracking for descriptorset usage on batches, thereby removing possible queue submissions there. Any time resources in a set were destroyed, I could keep references to the sets on the resources and then invalidate the sets, returning them to the unused pool.

The results were great: desccache1.png

21 fps now, which is up another 3 from before.

Cache Improvements

Next I started investigating my cache implementation. There was a lot of hashing going on, as I was storing both the in-use sets as well as the unused sets (the valid and invalidated) based on the hash calculated for their descriptor usage, so I decided to try moving just the invalidated sets into an array as they no longer had a valid hash anyway, thereby giving quicker access to sets I knew to be free.

This would also help with my next plan, but again, the results were promising:


Now I was at 23 fps, which is another 10% from the last changes, and just from removing some of the hashing going on.

This is like shooting fish in a barrel now.

Naturally at this point I began to, as all developers do, consider bucketing my allocations, since I’d seen in my profiling that some of these programs were allocating thousands of sets to submit simultaneously across hundreds of draws. I ended up using a scaling factor here so that programs would initially begin allocating in tens of sets, scaling up by a factor of ten every time it reached that threshold (i.e., once 100 sets were allocated, it begins allocating 100 at a time).

This didn’t have any discernible effect on the fps, but there were certainly fewer allocation calls going on, so I imagine the results will show up somewhere else.

And Then I Thought About It

Because sure, my efficient, possibly-overengineered descriptorset caching mechanism for reusing sets across draws and batches was cool, and it worked great, but was the overhead of all the hashing involved actually worse for performance than just using a dumb bucket allocator to set up and submit the same sets multiple times even in the same command buffer?

I’m not one of those types who refuses to acknowledge that other ideas can be better than the ones I personally prefer, so I smashed all of my descriptor info hashing out of the codebase and just used an array for storing unused sets. So now the mechanism looked like this:

But would this be better than the more specific caching I was already using? Well… desccache3.png

24 fps, so the short of it is yes. It was 1-2 fps faster across the board.


This is where I’m at now after spending some time also rewriting all the clear code again and fixing related regressions. The benchmark is up ~70% from where I started, and the gains just keep coming. I’ll post again about performance improvements in the future, but here’s a comparison to a native GL driver, namely IRIS:


Zink is at a little under 50% of the performance here, up from around 25% when I started, though this gap still varies throughout other sections of the benchmark, dipping as low as 30% of the IRIS performance in some parts.

It’s progress.

September 24, 2020

XDC 2020 sponsors

Last week, X.Org Developers Conference 2020 was held online for the first time. This year, with all the COVID-19 situation that is affecting almost every country worldwide, the X.Org Foundation Board of Directors decided to make it virtual.

I love open-source conferences :-) They are great for networking, have fun with the rest of community members, have really good technical discussions in the hallway track… and visit a new place every year! Unfortunately, we couldn’t do any of that this time and we needed to look for an alternative… being going virtual the obvious one.

The organization team at Intel, lead by Radoslaw Szwichtenberg and Martin Peres, analyzed the different open-source alternatives to organize XDC 2020 in a virtual manner. Finally, due to the setup requirements and the possibility of having more than 200 attendees connected to the video stream at the same time (Big Blue Button doesn’t recommend more than 100 simultaneous users), they selected Jitsi for speakers + Youtube for streaming/recording + IRC for questions. Arkadiusz Hiler summarized very well what they did from the A/V technical point of view and how was the experience hosting a virtual XDC.

I’m very happy with the final result given the special situation this year: the streaming was flawless, we didn’t have almost any technical issue (except one audio issue in the opening session… like in physical ones! :-D), and IRC turned out to be very active during the conference. Thanks a lot to the organizers for their great job!

However, there is always room for improvements. Therefore, the X.Org Foundation board is asking for feedback, please share with us your opinion on XDC 2020!

Just focusing on my own experience, it was very good. I enjoyed a lot the talks presented this year and the interesting discussions happening in IRC. I would like to highlight the four talks presented by my colleagues at Igalia :-D

This year I was also a speaker! I presented “Improving Khronos CTS tests with Mesa code coverage” talk (watch recording), where I explained how we can improve the VK-GL-CTS quality by leveraging Mesa code coverage using open-source tools.

My XDC 2020 talk

I’m looking forward to attending X.Org Developers Conference 2021! But first, we need an organizer! Requests For Proposals for hosting XDC2021 are now open!

See you next year!

Finally, Performance

For a long time now, I’ve been writing about various work I’ve done in the course of getting to GL 4.6. This has generally been feature implementation work with an occasional side of bug hunting and fixing and so I haven’t been too concerned about performance.

I’m not done with feature work. There’s still tons of things missing that I’m planning to work on.

I’m not done with bug hunting or fixing. There’s still tons of bugs (like how currently spec@!opengl 1.1@streaming-texture-leak ooms my system and crashes all the other tests trying to run in parallel) that I’m going to fix.

But I wanted a break, and I wanted to learn some new parts of the graphics pipeline instead of just slapping more extensions in.

For the moment, I’ve been focusing on the Unigine Heaven benchmark since there’s tons of room for improvement, though I’m planning to move on from that once I get bored and/or hit a wall. Here’s my starting point, which is taken from the patch in my branch with the summary zink: add batch flag to determine if batch is currently in renderpass, some 300ish patches ahead of the main branch’s tip:


This is 14 fps running as ./heaven_x64 -project_name Heaven -data_path ../ -engine_config ../data/heaven_4.0.cfg -system_script heaven/unigine.cpp -sound_app openal -video_app opengl -video_multisample 0 -video_fullscreen 0 -video_mode 3 -extern_define ,RELEASE,LANGUAGE_EN,QUALITY_LOW,TESSELLATION_DISABLED -extern_plugin ,GPUMonitor, and I’m going to be posting screenshots from roughly the same point in the demo as I progress to gauge progress.

Is this an amazing way to do a benchmark?


Is it a quick way to determine if I’m making things better or worse right now?

Given the size of the gains I’m making, absolutely.

Let’s begin.


Now that I’ve lured everyone in with promises of gains and a screenshot with an fps counter, what I really want to talk about is code.

In order to figure out the most significant performance improvements for zink, it’s important to understand the architecture. At the point when I started, zink’s batches (C name for an object containing a command buffer and a fence as well as references to all the objects submitted to the queue for lifetime validation) worked like this for draw commands:

  • there are 4 batches (and 1 compute batch, but that’s out of scope)
  • each batch has 1 command buffer
  • each batch has 1 descriptor pool
  • each batch can allocate at most 1000 descriptor sets
  • each descriptor set is a fixed size of 1000 of each type of supported descriptor (UBO, SSBO, samplers, uniform and storage texel buffers, images)
  • each batch, upon reaching 1000 descriptor sets, will automatically submit its command buffer and cycle to the next batch
  • each batch has 1 fence
  • each batch, before being reused, waits on its fence and then resets its state
  • each batch, during its state reset, destroys all its allocated descriptor sets
  • renderpasses are started and ended on a given batch as needed
  • renderpasses, when ended, trigger queue submission for the given command buffer
  • renderpasses allocate their descriptor set on-demand just prior to the actual vkCmdBeginRenderPass call
  • renderpasses, during corresponding descriptor set updates, trigger memory barriers on all descriptor resources for their intended usage
  • pipelines are cached at runtime based on all the metadata used to create them, but we don’t currently do any on-disk pipeline caching

This is a lot to take in, so I’ll cut to some conclusions that I drew from these points:

  • sequential draws cannot occur on a given batch unless none of the resources used by its descriptor sets require memory barriers
    • because barriers cannot be submitted during a renderpass
    • this is very unlikely
    • therefore, each batch contains exactly 1 draw command
  • each sequential draw after the fourth one causes explicit fence waiting
    • batches must reset their state before being reused, and prior to this they wait on their fence
    • there are 4 batches
    • this means that any time a frame contains more than 4 draws, zink is pausing to wait for batches to finish so it can get a valid command buffer to use
    • the Heaven benchmark contains hundreds of draw commands per frame
    • yikes
  • definitely there could be better barrier usage
    • a barrier should only be necessary for zink’s uses when: transitioning a resource from write -> read, write -> write, or changing image layouts, and for other cases we can just track the usage in order to trigger a barrier later when one of these conditions is met
  • the pipe_context::flush hook in general is very, very bad and needs to be avoided
    • we use this basically every other line like we’re building a jenga tower
    • sure would be a disaster if someone were to try removing these calls
  • probably we could do some disk caching for pipeline objects so that successive runs of applications could use pre-baked objects and avoid needing to create any
  • descriptor set allocation is going to be a massive performance hit for any application which does lots of draws per frame since each draw command allocates its own (huge) descriptor set
  • the 1000 descriptor set limit is going to be hit constantly for for any application which does lots of draws per frame

There’s a lot more I could go into here, but this is already a lot.

Removing Renderpass Submission

I decided to start here since it was easy:

diff --git a/src/gallium/drivers/zink/zink_context.c b/src/gallium/drivers/zink/zink_context.c
index a9418430bb7..f07ae658115 100644
--- a/src/gallium/drivers/zink/zink_context.c
+++ b/src/gallium/drivers/zink/zink_context.c
@@ -800,12 +800,8 @@ struct zink_batch *
 zink_batch_no_rp(struct zink_context *ctx)
    struct zink_batch *batch = zink_curr_batch(ctx);
-   if (batch->in_rp) {
-      /* flush batch and get a new one */
-      flush_batch(ctx);
-      batch = zink_curr_batch(ctx);
-      assert(!batch->in_rp);
-   }
+   zink_end_render_pass(ctx, batch);
+   assert(!batch->in_rp);
    return batch;

Amazing, I know. Let’s see how much the fps changes:


15?! Wait a minute. That’s basically within the margin of error!

It is actually a consistent 1-2 fps gain, even a little more in some other parts, but it seemed like it should’ve been more now that all the command buffers are being gloriously saturated, right?

Well, not exactly. Here’s a fun bit of code from the descriptor updating function:

struct zink_batch *batch = zink_batch_rp(ctx);
unsigned num_descriptors = ctx->curr_program->num_descriptors;
VkDescriptorSetLayout dsl = ctx->curr_program->dsl;

if (batch->descs_left < num_descriptors) {
   ctx->base.flush(&ctx->base, NULL, 0);
   batch = zink_batch_rp(ctx);
   assert(batch->descs_left >= num_descriptors);

Right. The flushing continues. And while I’m here, what does zink’s pipe_context::flush hook even look like again?

static void
zink_flush(struct pipe_context *pctx,
           struct pipe_fence_handle **pfence,
           enum pipe_flush_flags flags)
   struct zink_context *ctx = zink_context(pctx);

   struct zink_batch *batch = zink_curr_batch(ctx);
   /* HACK:
    * For some strange reason, we need to finish before presenting, or else
    * we start rendering on top of the back-buffer for the next frame. This
    * seems like a bug in the DRI-driver to me, because we really should
    * be properly protected by fences here, and the back-buffer should
    * either be swapped with the front-buffer, or blitted from. But for
    * some strange reason, neither of these things happen.
   if (flags & PIPE_FLUSH_END_OF_FRAME)
      pctx->screen->fence_finish(pctx->screen, pctx,
                                 (struct pipe_fence_handle *)batch->fence,

Oh. So really every time zink “finishes” a frame in this benchmark (which has already stalled hundreds of times up to this point), it then waits on that frame to finish instead of letting things outside the driver worry about that.

Borrowing Code

It was at this moment that a dim spark flickered to life in my memories, reminding me of the in-progress MR from Antonio Caggiano for caching surfaces on batches. In particular, it reminded me that his series has a patch which removes the above monstrosity.

Let’s see what happens when I add those patches in:




I expected a huge performance win here, but it seems that we still can’t fully utilize all these changes and are still stuck at 15 fps. Every time descriptors are updated, the batch ends up hitting that arbitrary 1000 descriptor set limit, and then it submits the command buffer, so there’s still multiple batches being used for each frame.

Getting Mad

So naturally at this point I tried increasing the limit.

Then I increased it again.

And again.

And now I had exactly one flush per frame, but my fps was still fixed at a measly 15.

That’s when I decided to do some desk curls.

What happened next was shocking:


18 fps.

It was a sudden 20% fps gain, but it was only the beginning.

More on this tomorrow.

September 23, 2020

Adventures In Blending

For the past few days, I’ve been trying to fix a troublesome bug. Specifically, the Unigine Heaven benchmark wasn’t drawing most textures in color, and this was hampering my ability to make further claims about zink being the fastest graphics driver in the history of software since it’s not very impressive to be posting side-by-side screenshots that look like garbage even if the FPS counter in the corner is higher.

Thus I embarked on adventure.

First: The Problem


This was the starting point. The thing ran just fine, but without valid frames drawn, it’s hard to call it a test of very much other than how many bugs I can pack into a single frame.

Naturally I assumed that this was going to be some bug in zink’s handling of something, whether it was blending, or sampling, or blending and sampling. I set out to figure out exactly how I’d screwed up.

Second: The Analysis

I had no idea what the problem was. I phoned Dr. Render, as we all do when facing issues like this, and I was told that I had problems.


Lots of problems.

The biggest problem was figuring out how to get anywhere with so many draw calls. Each frame consisted of 3 render passes (with hundreds of draws each) as well as a bunch of extra draws and clears.

There was a lot going on. This was by far the biggest thing I’d had to fix, and it’s much more difficult to debug a game-like application than it is a unit test. With that said, and since there’s not actually any documentation about “What do I do if some of my frame isn’t drawing with color?” for people working on drivers, here were some of the things I looked into:

  • Disabling Depth Testing

Just to check. On IRIS, which is my reference for these types of things, the change gave some neat results:


How bout that.

On zink, I got the same thing, except there was no color, and it wasn’t very interesting.

  • Checking sampler resources for depth buffers

On an #executive suggestion, I looked into whether a z/s buffer had snuck into my sampler buffers and was thus providing bogus pixel data.

It hadn’t.

  • Checking Fragment Shader Outputs

This was a runtime version of my usual shader debugging, wherein I try to isolate the pixels in a region to a specific color based on a conditional, which then lets me determine which path in the shader is broken. To do this, I added a helper function in ntv:

static SpvId
clobber(struct ntv_context *ctx)
   SpvId type = get_fvec_type(ctx, 32, 4);
   SpvId vals[] = {
      emit_float_const(ctx, 32, 1.0),
      emit_float_const(ctx, 32, 0.0),
      emit_float_const(ctx, 32, 0.0),
      emit_float_const(ctx, 32, 1.0)
   return spirv_builder_emit_composite_construct(&ctx->builder, type, vals, 4);

This returns a vec4 of the color RED, and I cleverly stuck it at the end of emit_store_deref() like so:

if (ctx->stage == MESA_SHADER_FRAGMENT && var->data.location == FRAG_RESULT_DATA0 && match)
   result = clobber(ctx);

match in this case is set based on this small block at the very start of ntv:

if (s->info.stage == MESA_SHADER_FRAGMENT) {
   const char *env = getenv("TEST_SHADER");
   match = env && s-> && !strcmp(s->, env);

Thus, I could set my environment in gdb with e.g., set env TEST_SHADER=GLSL271 and then zink would swap the output of the fragment shader named GLSL271 to RED, which let me determine what various shaders were being used for. When I found the shader used for the lamps, things got LIT:


But ultimately, even though I did find the shaders that were being used for the more general material draws, this ended up being another dead end.

  • Verifying Blend States

This took me the longest since I had to figure out a way to match up the Dr. Render states to the runtime states that I could see. I eventually settled on adding breakpoints based on index buffer size, as the chart provided by Dr. Render had this in the vertex state, which made things simple.

But alas, zink was doing all the right blending too.

  • Complaining

As usual, this was my last resort, but it was also my most powerful weapon that I couldn’t abuse too frequently, lest people come to the conclusion that I don’t actually know what I’m doing.

Which I definitely do.

And now that I’ve cleared up any misunderstandings there, I’m not ashamed to say that I went to #intel-3d to complain that Dr. Render wasn’t giving me any useful draw output for most of the draws under IRIS. If even zink can get some pixels out of a draw, then a more compliant driver like IRIS shouldn’t be having issues here.

I wasn’t wrong.

The Magic Of Dual Blending

It turns out that the Heaven benchmark is buggy and expects the D3D semantics for dual blending, which is why mesa knows this and informs drivers that they need to enable workarounds if they have the need, specifically dual_color_blend_by_location=true which informs the driver that it needs to adjust the Location and Index of gl_FragData[1] from D3D semantics to OpenGL/Vulkan.

As usual, the folks at Intel with their encyclopedic knowledge were quick to point out the exact problem, which then just left me with the relatively simple tasks of:

  • hooking zink up to the driconf build
  • checking driconf at startup so zink can get info on these application workarounds/overrides
  • adding shader keys for forcing the dual blending workaround
  • writing a NIR pass to do the actual work

The result is not that interesting, but here it is anyway:

static bool
lower_dual_blend(nir_shader *shader)
   bool progress = false;
   nir_variable *var = nir_find_variable_with_location(shader, nir_var_shader_out, FRAG_RESULT_DATA1);
   if (var) {
      var->data.location = FRAG_RESULT_DATA0;
      var->data.index = 1;
      progress = true;
   return progress;

In short, D3D expects to blend two outputs based on their locations, but in Vulkan and OpenGL, the blending is based on index. So here, I’ve just changed the location of gl_FragData[1] to match gl_FragData[0] and then incremented the index, because Fragment outputs identified with an Index of zero are directed to the first input of the blending unit associated with the corresponding Location. Outputs identified with an Index of one are directed to the second input of the corresponding blending unit.

And now nice things can be had:


Tune in tomorrow when I strap zink to a rocket and begin counting down to blastoff.

September 21, 2020


In Vulkan, a pipeline object is bound to the graphics pipeline for a given command buffer when a draw is about to take place. This pipeline object contains information about the draw state, and any time that state changes, a different pipeline object must be created/bound.

This is expensive.

Some time ago, Antonio Caggiano did some work to cache pipeline objects, which lets zink reuse them once they’re created. This was great, because creating Vulkan objects is very costly, and we want to always be reusing objects whenever possible.

Unfortunately, the core Vulkan spec has the number of viewports and scissor regions as both being part of the pipeline state, which means any time either one changes the number of regions (though both viewport and scissor region counts are the same for our purposes), we need a new pipeline.

Extensions to the Rescue

VK_EXT_extended_dynamic_state adds functionality to avoid this performance issue. When supported, the pipeline object is created with zero as the count of viewport and scissor regions, and then vkCmdSetViewportWithCountEXT and vkCmdSetScissorWithCountEXT can be called just before draw to ram these state updates into the command buffer without ever needing a different pipeline object.

Longer posts later in the week; I’m in the middle of a construction zone for the next few days, and it’s less hospitable than I anticipated.

September 18, 2020

Blog Returns

Once again, I ended up not blogging for most of the week. When this happens, there’s one of two possibilities: I’m either taking a break or I’m so deep into some code that I’ve forgotten about everything else in my life including sleep.

This time was the latter. I delved into the deepest parts of zink and discovered that the driver is, in fact, functioning only through a combination of sheer luck and a truly unbelievable amount of driver stalls that provide enough forced synchronization and slow things down enough that we don’t explode into a flaming mess every other frame.


I’ve fixed all of the crazy things I found, and, in the process, made some sizable performance gains that I’m planning to spend a while blogging about in considerable depth next week.

And when I say sizable, I’m talking in the range of 50-100% fps gains.

But it’s Friday, and I’m sure nobody wants to just see numbers or benchmarks. Let’s get into something that’s interesting on a technical level.


Yes, samplers.

In Vulkan, samplers have a lot of rules to follow. Specifically, I’m going to be examining part of the spec that states “If a VkImageView is sampled with VK_FILTER_LINEAR as a result of this command, then the image view’s format features must contain VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT”.

This is a problem for zink. Gallium gives us info about the sampler in the struct pipe_context::create_sampler_state hook, but the created sampler won’t actually be used until draw time. As a result, there’s no way to know which image is going to be sampled, and thus there’s no way to know what features the sampled image’s format flags will contain. This only becomes known at the time of draw.

The way I saw it, there were two options:

  • Dynamically create the sampler just before draw and swizzle between LINEAR and NEAREST based on the format features
  • Create both samplers immediately when LINEAR is passed and swizzle between them at draw time

In theory, the first option is probably more performant in the best case scenario where a sampler is only ever used with a single image, as it would then only ever create a single sampler object.

Unfortunately, this isn’t realistic. Just as an example, u_blitter creates a number of samplers up front, and then it also makes assumptions about filtering based on ideal operations which may not be in sync with the underlying Vulkan driver’s capabilities. So for these persistent samplers, the first option may initially allow the sampler to be created with LINEAR filtering, but it may later then be used for an image which can’t support it.

So I went with the second option. Now any time a LINEAR sampler is created by gallium, we’re actually creating both types so that the appropriate one can be used, ensuring that we can always comply with the spec and avoid any driver issues.


September 14, 2020

Hoo Boy

Let’s talk about ARB_shader_draw_parameters. Specifically, let’s look at gl_BaseVertex.

In OpenGL, this shader variable’s value depends on the parameters passed to the draw command, and the value is always zero if the command has no base vertex.

In Vulkan, the value here is only zero if the first vertex is zero.

The difference here means that for arrayed draws without base vertex parameters, GL always expects zero, and Vulkan expects first vertex.



The easiest solution here would be to just throw a shader key at the problem, producing variants of the shader for use with indexed vs non-indexed draws, and using NIR passes to modify the variables for the non-indexed case and zero the value. It’s quick, it’s easy, and it’s not especially great for performance since it requires compiling the shader multiple times and creating multiple pipeline objects.

This is where push constants come in handy once more.

Avid readers of the blog will recall the last time I used push constants was for TCS injection when I needed to generate my own TCS and have it read the default inner/outer tessellation levels out of a push constant.

Since then, I’ve created a struct to track the layout of the push constant:

struct zink_push_constant {
   unsigned draw_mode_is_indexed;
   float default_inner_level[2];
   float default_outer_level[4];

Now just before draw, I update the push constant value for draw_mode_is_indexed:

if (ctx->gfx_stages[PIPE_SHADER_VERTEX]->nir->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)) {
   unsigned draw_mode_is_indexed = dinfo->index_size > 0;
   vkCmdPushConstants(batch->cmdbuf, gfx_program->layout, VK_SHADER_STAGE_VERTEX_BIT,
                      offsetof(struct zink_push_constant, draw_mode_is_indexed), sizeof(unsigned),

And now the shader can be made aware of whether the draw mode is indexed.

Now comes the NIR, as is the case for most of this type of work.

static bool
lower_draw_params(nir_shader *shader)
   if (shader->info.stage != MESA_SHADER_VERTEX)
      return false;

   if (!(shader->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)))
      return false;

   return nir_shader_instructions_pass(shader, lower_draw_params_instr, nir_metadata_dominance, NULL);

This is the future, so I’m now using Eric Anholt’s recent helper function to skip past iterating over the shader’s function/blocks/instructions, instead just passing the lowering implementation as a parameter and letting the helper create the nir_builder for me.

static bool
lower_draw_params_instr(nir_builder *b, nir_instr *in, void *data)
   if (in->type != nir_instr_type_intrinsic)
      return false;
   nir_intrinsic_instr *instr = nir_instr_as_intrinsic(in);
   if (instr->intrinsic != nir_intrinsic_load_base_vertex)
      return false;

I’m filtering out everything except for nir_intrinsic_load_base_vertex here, which is the instruction for loading gl_BaseVertex.

   b->cursor = nir_after_instr(&instr->instr);

I’m modifying instructions after this one, so I set the cursor after.

   nir_intrinsic_instr *load = nir_intrinsic_instr_create(b->shader, nir_intrinsic_load_push_constant);
   load->src[0] = nir_src_for_ssa(nir_imm_int(b, 0));
   nir_intrinsic_set_range(load, 4);
   load->num_components = 1;
   nir_ssa_dest_init(&load->instr, &load->dest, 1, 32, "draw_mode_is_indexed");
   nir_builder_instr_insert(b, &load->instr);

I’m loading the first 4 bytes of the push constant variable that I created according to my struct, which is the draw_mode_is_indexed value.

   nir_ssa_def *composite = nir_build_alu(b, nir_op_bcsel,
                                          nir_build_alu(b, nir_op_ieq, &load->dest.ssa, nir_imm_int(b, 1), NULL, NULL),
                                          nir_imm_int(b, 0),

This adds a new ALU instruction of type bcsel, AKA the ternary operator (condition ? true : false). The condition here is another ALU of type ieq, AKA integer equals, and I’m testing whether the loaded push constant value is equal to 1. If true, this is an indexed draw, so I continue using the loaded gl_BaseVertex value. If false, this is not an indexed draw, so I need to use zero instead.

   nir_ssa_def_rewrite_uses_after(&instr->dest.ssa, nir_src_for_ssa(composite), composite->parent_instr);

With my bcsel composite gl_BaseVertex value constructed, I can now rewrite all subsequent uses of gl_BaseVertex in the shader to use the composite value, which will automatically swap between the Vulkan gl_BaseVertex and zero based on the value of the push constant without the need to rebuild the shader or make a new pipeline.

   return true;

And now the shader gets the expected value and everything works.

Billy Mays

It’s also worth pointing out here that gl_DrawID from the same extension has a similar problem: gallium doesn’t pass multidraws in full to the driver, instead iterating for each draw, which means that the shader value is never what’s expected either. I’ve employed a similar trick to jam the draw index into the push constant and read that back in the shader to get the expected value there too.


September 10, 2020

In an ideal world, every frame your application draws would appear on the screen exactly on time. Sadly, as anyone living in the year 2020 CE can attest, this is far from an ideal world. Sometimes the scene gets more complicated and takes longer to draw than you estimated, and sometimes the OS scheduler just decides it has more important things to do than pay attention to you.

When this happens, for some applications, it would be best if you could just get the bits on the screen as fast as possible rather than wait for the next vsync. The Present extension for X11 has a option to let you do exactly this:

If 'options' contains PresentOptionAsync, and the 'target-msc'
is less than or equal to the current msc for 'window', then
the operation will be performed as soon as possible, not
necessarily waiting for the next vertical blank interval. 

But you don't use Present directly, usually, usually Present is the mechanism for GLX and Vulkan to put bits on the screen. So, today I merged some code to Mesa to enable the corresponding features in those APIs, namely GLX_EXT_swap_control_tear and VK_PRESENT_MODE_FIFO_RELAXED_KHR. If all goes well these should be included in Mesa 21.0, with a backport to 20.2.x not out of the question. As the GLX extension name suggests, this can introduce some visual tearing when the buffer swap does come in late, but for fullscreen games or VR displays that can be an acceptable tradeoff in exchange for reduced stuttering.

Despite what this might look like, I don't actually enjoy starting new projects: it's a lot easier to clean up some build warnings, or add a CI, than it is to start from an empty directory.

But sometimes needs must, and I've just released version 0.1 of such a project. Below you'll find an excerpt from the README, which should answer most of the questions. Please read the README directly in the repository if you're getting to this blog post more than a couple of days after it was first published.

Feel free to file new issues in the tracker if you have ideas on possible power-saving or performance enhancements. Currently the only supported “Performance” mode supported will interact with Intel CPUs with P-State support. More hardware support is planned.

TLDR; this setting in the GNOME 3.40 development branch soon, Fedora packages are done, API docs available:



From the README:


power-profiles-daemon offers to modify system behaviour based upon user-selected power profiles. There are 3 different power profiles, a "balanced" default mode, a "power-saver" mode, as well as a "performance" mode. The first 2 of those are available on every system. The "performance" mode is only available on select systems and is implemented by different "drivers" based on the system or systems it targets.

In addition to those 2 or 3 modes (depending on the system), "actions" can be hooked up to change the behaviour of a particular device. For example, this can be used to disable the fast-charging for some USB devices when in power-saver mode.

GNOME's Settings and shell both include interfaces to select the current mode, but they are also expected to adjust the behaviour of the desktop depending on the mode, such as turning the screen off after inaction more aggressively when in power-saver mode.

Note that power-profiles-daemon does not save the currently active profile across system restarts and will always start with the "balanced" profile selected.

Why power-profiles-daemon

The power-profiles-daemon project was created to help provide a solution for two separate use cases, for desktops, laptops, and other devices running a “traditional Linux desktop”.

The first one is a "Low Power" mode, that users could toggle themselves, or have the system toggle for them, with the intent to save battery. Mobile devices running iOS and Android have had a similar feature available to end-users and application developers alike.

The second use case was to allow a "Performance" mode on systems where the hardware maker would provide and design such a mode. The idea is that the Linux kernel would provide a way to access this mode which usually only exists as a configuration option in some machines' "UEFI Setup" screen.

This second use case is the reason why we didn't implement the "Low Power" mode in UPower, as was originally discussed.

As the daemon would change kernel settings, we would need to run it as root, and make its API available over D-Bus, as has been customary for more than 10 years. We would also design that API to be as easily usable to build graphical interfaces as possible.

Why not...

This section will contain explanations of why this new daemon was written rather than re-using, or modifying an existing one. Each project obviously has its own goals and needs, and those comparisons are not meant as a slight on the project.

As the code bases for both those projects listed and power-profiles-daemon are ever evolving, the comments were understood to be correct when made.


thermald only works on Intel CPUs, and is very focused on allowing maximum performance based on a "maximum temperature" for the system. As such, it could be seen as complementary to power-profiles-daemon.

tuned and TLP

Both projects have similar goals, allowing for tweaks to be applied, for a variety of workloads that goes far beyond the workloads and use cases that power-profiles-daemon targets.

A fair number of the tweaks that could apply to devices running GNOME or another free desktop are either potentially destructive (eg. some of the SATA power-saving mode resulting in corrupted data), or working well enough to be put into place by default (eg. audio codec power-saving), even if we need to disable the power saving on some hardware that reacts badly to it.

Both are good projects to use for the purpose of experimenting with particular settings to see if they'd be something that can be implemented by default, or to put some fine-grained, static, policies in place on server-type workloads which are not as fluid and changing as desktop workloads can be.


It doesn't take user-intent into account, doesn't have a D-Bus interface and seems to want to work automatically by monitoring the CPU usage, which kind of goes against a user's wishes as a user might still want to conserve as much energy as possible under high-CPU usage.

Over the past couple of (gasp!) decades, I've had my fair share of release blunders: forgetting to clean the tree before making a tarball by hand, forgetting to update the NEWS file, forgetting to push after creating the tarball locally, forgetting to update the appdata file (causing problems on Flathub)...

That's where comes in, to replace the check-news function of the autotools. Ideally you would:

- make sure your CI runs a dist job

- always use a merge request to do releases

- integrate to your meson build (though I would relax the appdata checks for devel releases)

September 09, 2020

Long, Long Day

I fell into the abyss of query code again, so this is just a short post to (briefly) touch on the adventure that is pipeline statistics queries.

Pipeline statistics queries include a collection of statistics about things that happened while the query was active, such as the count of shader invocations.

Thankfully, these weren’t too difficult to plug into the ever-evolving zink query architecture, as my previous adventure into QBOs (more on this another time) ended up breaking out a lot of helper functions for various things that simplified the process.


The first step, as always, is figuring out how to map gallium API to vulkan. The below table handles the basic conversion for single query types:

unsigned map[] = {

Pretty straightforward.

With this in place, it’s worth mentioning that OpenGL has facilities for performing either “all” statistics queries, which include all the above types, or “single” statistics queries, which is just one of the types.

Building on that, I’m going to reach way back to the original loop that I’ve been using for handling query results. That’s now its own helper function called check_query_results(). It’s invoked from get_query_result() like so:

int result_size = 1;
   /* these query types emit 2 values */
   result_size = 2;
else if (query->type == PIPE_QUERY_PIPELINE_STATISTICS)
   result_size = 11;

   num_results = 1;
for (unsigned last_start = query->last_start; last_start + num_results <= query->curr_query; last_start++) {
   /* verify that we have the expected number of results pending */
   assert(num_results <= ARRAY_SIZE(results) / result_size);
   VkResult status = vkGetQueryPoolResults(screen->dev, query->query_pool,
                                           last_start, num_results,
   if (status != VK_SUCCESS)
      return false;

   if (query->type == PIPE_QUERY_PRIMITIVES_GENERATED) {
      status = vkGetQueryPoolResults(screen->dev, query->xfb_query_pool[0],
                                              last_start, num_results,
                                              2 * sizeof(uint64_t),
                                              flags | VK_QUERY_RESULT_64_BIT);
      if (status != VK_SUCCESS)
         return false;


   check_query_results(query, result, num_results, result_size, results, xfb_results);

Beautiful, isn’t it?

For the case of any xfb-related query, 2 result values are returned (and then also PRIMITIVES_GENERATED gets its own separate xfb pool for correctness), while most others only get 1 result value.

And then there’s PIPELINE_STATISTICS, which gets eleven. Since this is a weird result size to handle, I decided to just grab the results for each query one at a time in order to avoid any kind of craziness with allocating enough memory for the actual result values. Plenty of time for that later.

The handling for the results is fun too:

   uint64_t *statistics = (uint64_t*)&result->pipeline_statistics;
   for (unsigned j = 0; j < 11; j++)
      statistics[j] += results[i + j];

Each bit set in the query object returns its own query result, so there’s 11 of them that need to be accumulated for each query result.

Surprisingly though, that’s the craziest part of implementing this. Not really much compared to some of the other insanity.


The other day, I made wild claims about zink being the fastest graphics driver in history.

I’m not going to say they were inaccurate.

What I am going to say is that I had a great chat with well-known, long-time mesa developer Kenneth Graunke, who pointed out that most of the time differences in the softfp64 tests were due to NIR optimizations related to loop unrolling and the different settings used between IRIS and zink there.

He also helpfully pointed out that I forgot to set a number of important pipe caps, including the one that holds the value for gl_MaxVaryingComponents. As a result, zink has been advertising a very illegal and too-small number of varyings, which made the shaders smaller, which made us faster.

This has since been fixed, and zink is now fully compliant with the required GL specs for this value.


This has only increased the time of the one test I showcased from ~2 seconds to ~12 seconds.

Due to different mechanics between GLSL and SPIRV shader construction, all XFB outputs have to be explicitly specified when they’re converted in zink, and they count against the regular output limits. As a result, it’s necessary to reserve half of the driver’s advertised vertex outputs for potential XFB usage. This leaves us with much fewer available output slots for user shaders in order to preserve XFB functionality.

September 08, 2020

This is going to be a short post, as changes to Videos have been few and far between in the past couple of releases.

The major change to the latest release is that we've gained Tracker 3 support through a grilo plugin (which meant very few changes to our own code). But the Tracker 3 libraries are incompatible with the Tracker 2 daemon that's usually shipped in distributions, including on this author's development system.

So we made use of the ability of Tracker to run inside a Flatpak sandbox along with the video player, removing the need to have Tracker installed by the distribution, on the host. This should also make it easier to give users control of the directories they want to use to store their movies, in the future.

The release candidate for GNOME 3.38 is available right now as the stable version on Flathub.

September 07, 2020

So here a new update of the evolution of the Vulkan driver for the rpi4 (broadcom GPU).


Since my last update we finished the support for two features. Robust buffer access and multisampling.

Robust buffer access is a feature that allows to specify that accesses to buffers are bounds-checked against the range of the buffer descriptor. Usually this is used as a debug tool during development, but disabled on release (this is explained with more detail on this ARM guide). So sorry, no screenshot here.

On my last update I mentioned that we have started the support for multisampling, enough to get some demos working. Since then we were able to finish the rest of the mulsisampling support, and even implemented the optional feature sample rate shading. So now the following Sascha Willems’s demo is working:

Sascha Willems deferred multisampling demo run on rpi4


Taking into account that most of the features towards support Vulkan core 1.0 are implemented now, a lot of the effort since the last update was done on bugfixing, focusing on the specifics of the driver. Our main reference for this is Vulkan CTS, the official Khronos testsuite for Vulkan and OpenGL.

As usual, here some screenshots from the nice Sascha Willems’s demos, showing demos that were failing when I wrote the last update, and are working now thanks of the bugfixing work.

Sascha Willems hdr demo run on rpi4

Sascha Willems gltf skinning demo run on rpi4


At this point there are no full features pending to implement to fulfill the support for Vulkan core 1.0. So our focus would be on getting to pass all the Vulkan CTS tests.

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working
v3dv status update 2020-07-31

Just For Fun

I finally managed to get a complete piglit run over the weekend, and, for my own amusement, I decided to check the timediffs against a reference run from the IRIS driver. Given that the Intel drivers are of extremely high quality (and are direct interfaces to the underlying hardware that I happen to be using), I tend to use ANV and IRIS as my references whenever I’m trying to debug things.

Both runs used the same base checkout from mesa, so all the core/gallium/nir parts were identical.

The results weren’t what I expected.

My expectation when I clicked into the timediffs page was that zink would be massively slower in a huge number of tests, likely to a staggering degree in some cases.

We were, but then also on occasion we weren’t.

As a final disclaimer before I dive into this, I feel like given the current state of people potentially rushing to conclusions I need to say that I’m not claiming zink is faster than a native GL driver, only that for some cases, our performance is oddly better than I expected.

The Good


The first thing to take note of here is that IRIS is massively better than zink in successful test completion, with a near-perfect 99.4% pass rate compared to zink’s measly 91%, and that’s across 2500 more tests too. This is important also since timediff only compares between passing tests.

With that said, somehow zink’s codepath is significantly faster when it comes to dealing with high numbers of varying outputs, and also, weirdly, a bunch of dmat4 tests, even though they’re both using the same softfp64 path since my icelake hardware doesn’t support native 64bit operations.

I was skeptical about some of the numbers here, particularly the ext_transform_feedback max-varying-arrays-of-arrays cases, but manual tests were even weirder:

time MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink bin/ext_transform_feedback-max-varyings -auto -fbo

MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink  -auto -fbo  2.13s user 0.03s system 98% cpu 2.197 total
time MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=iris bin/ext_transform_feedback-max-varyings -auto -fbo

MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=iris  -auto -fbo  301.64s user 0.52s system 99% cpu 5:02.45 total


I don’t have a good explanation for this since I haven’t dug into it other than to speculate that ANV is just massively better at handling large numbers of varying outputs.

The Bad

By contrast, zink gets thrashed pretty decisively in arb_map_buffer_alignment-map-invalidate-range, and we’re about 150x slower.

Yikes. Looks like that’s going to be a target for some work since potentially an application might hit that codepath.

The Weird


Somehow, zink is noticeably slower in a bunch of other fp64 tests (and this isn’t the full list, only a little over half). It’s strange to me that zink can perform better in certain fp64 cases but then also worse in others, but I’m assuming this is just the result of different shader optimizations happening between the drivers, shifting them onto slightly less slow parts of the softfp64 codepath in certain cases.

Possibly something to look into.

Probably not in too much depth since softfp64 is some pretty crazy stuff.

In Closing

Tests (and especially piglit ones) are not indicative of real world performance.

Exposure Notifications is a protocol developed by Apple and Google for facilitating COVID-19 contact tracing on mobile phones by exchanging codes with nearby phones over Bluetooth, implemented within the Android and iOS operating systems, now available here in Toronto.

Wait – phones? Android and iOS only? Can’t my Debian laptop participate? It has a recent Bluetooth chip. What about phones running GNU/Linux distributions like the PinePhone or Librem 5?

Exposure Notifications breaks down neatly into three sections: a Bluetooth layer, some cryptography, and integration with local public health authorities. Linux is up to the task, via BlueZ, OpenSSL, and some Python.

Given my background, will this build to be a reverse-engineering epic resulting in a novel open stack for a closed system?

Not at all. The specifications for the Exposure Notifications are available for both the Bluetooth protocol and the underlying cryptography. A partial reference implementation is available for Android, as is an independent Android implementation in microG. In Canada, the key servers run an open source stack originally built by Shopify and now maintained by the Canadian Digital Service, including open protocol documentation.

All in all, this is looking to be a smooth-sailing weekend1 project.

The devil’s in the details.


Exposure Notifications operates via Bluetooth Low Energy “advertisements”. Scanning for other devices is as simple as scanning for advertisements, and broadcasting is as simple as advertising ourselves.

On an Android phone, this is handled deep within Google Play Services. Can we drive the protocol from userspace on a regular GNU/Linux laptop? It depends. Not all laptops support Bluetooth, not all Bluetooth implementations support Bluetooth Low Energy, and I hear not all Bluetooth Low Energy implementations properly support undirected transmissions (“advertising”).

Luckily in my case, I develop on an Debianized Chromebook with a Wi-Fi/Bluetooth module. I’ve never used the Bluetooth, but it turns out the module has full support for advertisements, verified with the lescan (Low Energy Scan) command of the hcitool Bluetooth utility.

hcitool is a part of BlueZ, the standard Linux library for Bluetooth. Since lescan is able to detect nearby phones running Exposure Notifications, pouring through its source code is a good first step to our implementation. With some minor changes to hcitool to dump packets as raw hex and to filter for the Exposure Notifications protocol, we can print all nearby Exposure Notifications advertisements. So far, so good.

That’s about where the good ends.

While scanning is simple with reference code in hcitool, advertising is complicated by BlueZ’s lack of an interface at the time of writing. While a general “enable advertising” routine exists, routines to set advertising parameters and data per the Exposure Notifications specification are unavailable. This is not a showstopper, since BlueZ is itself an open source userspace library. We can drive the Bluetooth module the same way BlueZ does internally, filling in the necessary gaps in the API, while continuing to use BlueZ for the heavy-lifting.

Some care is needed to multiplex scanning and advertising within a single thread while remaining power efficient. The key is that advertising, once configured, is handled entirely in hardware without CPU intervention. On the other hand, scanning does require CPU involvement, but it is not necessary to scan continuously. Since COVID-19 is thought to transmit from sustained exposure, we only need to scan every few minutes. (Food for thought: how does this connect to the sampling theorem?)

Thus we can order our operations as:

  • Configure advertising
  • Scan for devices
  • Wait for several minutes
  • Repeat.

Since most of the time the program is asleep, this loop is efficient. It additionally allows us to reconfigure advertising every ten to fifteen minutes, in order to change the Bluetooth address to prevent tracking.

All of the above amounts to a few hundred lines of C code, treating the Exposure Notifications packets themselves as opaque random data.


Yet the data is far from random; it is the result of a series of operations in terms of secret keys defined by the Exposure Notifications cryptography specification. Every day, a “temporary exposure key” is generated, from which a “rolling proximity identifier key” and an “associated encrypted metadata key” are derived. These are used to generate a “rolling proximity identifier” and the “associated encrypted metadata”, which are advertised over Bluetooth and changed in lockstep with the Bluetooth random addresses.

There are lots of moving parts to get right, but each derivation reuses a common encryption primitive: HKDF-SHA256 for key derivation, AES-128 for the rolling proximity identifier, and AES-128-CTR for the associated encrypted metadata. Ideally, we would grab a state-of-the-art library of cryptography primitives like NaCl or libsodium and wire everything up.

First, some good news: once these routines are written, we can reliably unit test them. Though the specification states that “test vectors… are available upon request”, it isn’t clear who to request from. But Google’s reference implementation is itself unit-tested, and sure enough, it contains a file, from which we can grab the vectors for a complete set of unit tests.

After patting ourselves on the back for writing unit tests, we’ll need to pick a library to implement the cryptography. Suppose we try NaCl first. We’ll quickly realize the primitives we need are missing, so we move onto libsodium, which is backwards-compatible with NaCl. For a moment, this will work – libsodium has upstream support for HKDF-SHA256. Unfortunately, the version of libsodium shipping in Debian testing is too old for HKDF-SHA256. Not a big problem – we can backwards port the implementation, written in terms of the underlying HMAC-SHA256 operations, and move on to the AES.

AES is a standard symmetric cipher, so libsodium has excellent support… for some modes. However standard, AES is not one cipher; it is a family of ciphers with different key lengths and operating modes, with dramatically different security properties. “AES-128-CTR” in the Exposure Notifications specification is clearly 128-bit AES in CTR (Counter) mode, but what about “AES-128” alone, stated to operate on a “single AES-128 block”?

The mode implicitly specified is known as ECB (Electronic Codebook) mode and is known to have fatal security flaws in most applications. Because AES-ECB is generally insecure, libsodium does not have any support for this cipher mode. Great, now we have two problems – we have to rewrite our cryptography code against a new library, and we have to consider if there is a vulnerability in Exposure Notifications.

ECB’s crucial flaw is that for a given key, identical plaintext will always yield identical ciphertext, regardless of position in the stream. Since AES is block-based, this means identical blocks yield identical ciphertext, leading to trivial cryptanalysis.

In Exposure Notifications, ECB mode is used only to derive rolling proximity identifiers from the rolling proximity identifier key and the timestamp, by the equation:

RPI_ij = AES_128_ECB(RPIK_i, PaddedData_j)

…where PaddedData is a function of the quantized timestamp. Thus the issue is avoided, as every plaintext will be unique (since timestamps are monotonically increasing, unless you’re trying to contact trace Back to the Future).

Nevertheless, libsodium doesn’t know that, so we’ll need to resort to a ubiquitous cryptography library that doesn’t, uh, take security quite so seriously…

I’ll leave the implications up to your imagination.


While the Bluetooth and cryptography sections are governed by upstream specifications, making sense of the data requires tracking a significant amount of state. At minimum, we must:

  • Record received packets (the Rolling Proximity Identifier and the Associated Encrypted Metadata).
  • Query received packets for diagnosed identifiers.
  • Record our Temporary Encryption Keys.
  • Query our keys to upload if we are diagnosed.

If we were so inclined, we could handwrite all the serialization and concurrency logic and hope we don’t have a bug that results in COVID-19 mayhem.

A better idea is to grab SQLite, perhaps the most deployed software in the world, and express these actions as SQL queries. The database persists to disk, and we can even express natural unit tests with a synthetic in-memory database.

With this infrastructure, we’re now done with the primary daemon, recording Exposure Notification identifiers to the database and broadcasting our own identifiers. That’s not interesting if we never do anything with that data, though. Onwards!

Key retrieval

Once per day, Exposure Notifications implementations are expected to query the server for Temporary Encryption Keys associated with diagnosed COVID-19 cases. From these keys, the cryptography implementation can reconstruct the associated Rolling Proximity Identifiers, for which we can query the database to detect if we have been exposed.

Per Google’s documentation, the servers are expected to return a zip file containing two files:

  • export.bin: a container serialized as Protocol Buffers containing Diagnosis Keys
  • export.sig: a signature for the export with the public health agency’s key

The signature is not terribly interesting to us. On Android, it appears the system pins the public keys of recognized public health agencies as an integrity check for the received file. However, this public key is given directly to Google; we don’t appear to have an easy way to access it.

Does it matter? For our purposes, it’s unlikely. The Canadian key retrieval server is already transport-encrypted via HTTPS, so tampering with the data would already require compromising a certificate authority in addition to intercepting the requests to Broadly speaking, that limits attackers to nation-states, and since Canada has no reason to attack its own infrastructure, that limits our threat model to foreign nation-states. International intelligence agencies probably have better uses of resources than getting people to take extra COVID tests.

It’s worth noting other countries’ implementations could serve this zip file over plaintext HTTP, in which case this signature check becomes important.

Focusing then on export.bin, we may import the relevant protocol buffer definitions to extract the keys for matching against our database. Since this requires only read-only access to the database and executes infrequently, we can safely perform this work from a separate process written in a higher-level language like Python, interfacing with the cryptography routines over the Python foreign function interface ctypes. Extraction is easy with the Python protocol buffers implementation, and downloading should be as easy as a GET request with the standard library’s urllib, right?

Here we hit a gotcha: the retrieval endpoint is guarded behind an HMAC, requiring authentication to download the zip. The protocol documentation states:

Of course there’s no reliable way to truly authenticate these requests in an environment where millions of devices have immediate access to them upon downloading an Application: this scheme is purely to make it much more difficult to casually scrape these keys.

Ah, security by obscurity. Calculating the HMAC itself is simple given the documentation, but it requires a “secret” HMAC key specific to the server. As the documentation is aware, this key is hardly secret, but it’s not available on the Canadian Digital Service’s official repositories. Interoperating with the upstream servers would require some “extra” tricks.

From purely academic interest, we can write and debug our implementation without any such authorization by running our own sandbox server. Minus the configuration, the server source is available, so after spinning up a virtual machine and fighting with Go versioning, we can test our Python script.

Speaking of a personal sandbox…

Key upload

There is one essential edge case to the contact tracing implementation, one that we can’t test against the Canadian servers. And edge cases matter. In effect, the entire Exposure Notifications infrastructure is designed for the edge cases. If you don’t care about edge cases, you don’t care about digital contact tracing (so please, stay at home.)

The key feature – and key edge case – is uploading Temporary Exposure Keys to the Canadian key server in case of a COVID-19 diagnosis. This upload requires an alphanumeric code generated by a healthcare provider upon diagnosis, so if we used the shared servers, we couldn’t test an implementation. With our sandbox, we can generate as many alphanumeric codes as we’d like.

Once sandboxed, there isn’t much to the implementation itself: the keys are snarfed out of the SQLite database, we handshake with the server over protocol buffers marshaled over POST requests, and we throw in some public-key cryptography via the Python bindings to libsodium.

This functionality neatly fits into a second dedicated Python script which does not interface with the main library. It’s exposed as a command line interface with flow resembling that of the mobile application, adhering reasonably to the UNIX philosophy. Admittedly I’m not sure wrestling with the command line is top on the priority list of a Linux hacker ill with COVID-19. Regardless, the interface is suitable for higher-level (graphical) abstractions.

Problem solved, but of course there’s a gotcha: if the request is malformed, an error should be generated as a key robustness feature. Unfortunately, while developing the script against my sandbox, a bug led the request to be dropped unexpectedly, rather than returning with an error message. On the server implemented in Go, there was an apparent nil dereference. Oops. Fixing this isn’t necessary for this project, but it’s still a bug, even if it requires a COVID-19 diagnosis to trigger. So I went and did the Canadian thing and sent a pull request.


All in all, we end up with a Linux implementation of Exposure Notifications functional in Ontario, Canada. What’s next? Perhaps supporting contact tracing systems elsewhere in the world – patches welcome. Closer to home, while functional, the aesthetics are not (yet) anything to write home about – perhaps we could write a touch-based Linux interface for mobile Linux interfaces like Plasma Mobile and Phosh, maybe even running it on a Android flagship flashed with postmarketOS to go full circle.

Source code for liben is available for any one who dares go near. Compiling from source is straightforward but necessary at the time of writing. As for packaging?

Here’s hoping COVID-19 contact tracing will be obsolete by the time liben hits Debian stable.

  1. Today (Monday) is Labour Day, so this is a 3-day weekend. But I started on Saturday and posted this today, so it technically counts.↩︎

September 04, 2020

Wim Taymans

Wim Taymans talking about current state of PipeWire

Wim Taymans did an internal demonstration yesterday for the desktop team at Red Hat of the current state of PipeWire. For those still unaware PipeWire is our effort to bring together audio, video and pro-audio under Linux, creating a smooth and modern experience. Before PipeWire there was PulseAudio for consumer audio, Jack for Pro-audio and just unending pain and frustration for video. PipeWire is being done with the aim of being ABI compatible with ALSA, PulseAudio and JACK, meaning that PulseAudio and Jack apps should just keep working on top of Pipewire without the need for rewrites (and with the same low latency for JACK apps).

As Wim reported yesterday things are coming together with both the PulseAudio, Jack and ALSA backends being usable if not 100% feature complete yet. Wim has been running his system with Pipewire as the only sound server for a while now and things are now in a state where we feel ready to ask the wider community to test and help provide feedback and test cases.

Carla on PipeWire

Carla running on PipeWire

Carla as shown above is a popular Jack applications and it provides among other things this patchbay view of your audio devices and applications. I recommend you all to click in and take a close look at the screenshot above. That is the Jack application Carla running and as you see PulseAudio applications like GNOME Settings and Google Chrome are also showing up now thanks to the unified architecture of PipeWire, alongside Jack apps like Hydrogen. All of this without any changes to Carla or any of the other applications shown.

At the moment Wim is primarily testing using Cheese, GNOME Control center, Chrome, Firefox, Ardour, Carla, vlc, mplayer, totem, mpv, Catia, pavucontrol, paman, qsynth, zrythm, helm, Spotify and Calf Studio Gear. So these are the applications you should be getting the most mileage from when testing, but most others should work too.

Anyway, let me quickly go over some of the highlight from Wim’s presentation.

Session Manager

PipeWire now has a functioning session manager that allows for things like

  • Metadata, system for tagging objects with properties, visible to all clients (if permitted)
  • Load and save of volumes, automatic routing
  • Default source and sink with metadata, saved and loaded as well
  • Moving streams with metadata

Currently this is a simple sample session manager that Wim created himself, but we also have a more advanced session manager called Wireplumber being developed by Collabora, which they developed for use in automotive Linux usecases, but which we will probably be moving to over time also for the desktop.

Human readable handling of Audio Devices

Wim took the code and configuration data in Pulse Audio for ALSA Card Profiles and created a standalone library that can be shared between PipeWire and PulseAudio. This library handles ALSA sound card profiles, devices, mixers and UCM (use case manager used to configure the newer audio chips (like the Lenovo X1 Carbon) and lets PipeWire provide the correct information to provide to things like GNOME Control Center or pavucontrol. Using the same code as has been used in PulseAudio for this has the added benefit that when you switch from PulseAudio to PipeWire your devices don’t change names. So everything should look and feel just like PulseAudio from an application perspective. In fact just below is a screenshot of pavucontrol, the Pulse Audio mixer application running on top of Pipewire without a problem.

PulSe Audio Mixer

Pavucontrol, the Pulse Audio mixer on Pipewire

Creating audio sink devices with Jack
Pipewire now allows you to create new audio sink devices with Jack. So the example command below creates a Pipewire sink node out of calfjackhost and sets it up so that we can output for instance the audio from Firefox into it. At the moment you can do that by running your Jack apps like this:

PIPEWIRE_PROPS="media.class=Audio/Sink" calfjackhost

But eventually we hope to move this functionality into the GNOME Control Center or similar so that you can do this setup graphically. The screenshot below shows us using CalfJackHost as an audio sink, outputing the audio from Firefox (a PulseAudio application) and CalfJackHost generating an analyzer graph of the audio.

Calfjackhost on pipewire

The CalfJackhost being used as an audio sink for Firefox

Creating devices with GStreamer
We can also use GStreamer to create PipeWire devices now. The command belows take the popular Big Buck Bunny animation created by the great folks over at Blender and lets you set it up as a video source in PipeWire. So for instance if you always wanted to play back a video inside Cheese for instance, to apply the Cheese effects to it, you can do that this way without Cheese needing to change to handle video playback. As one can imagine this opens up the ability to string together a lot of applications in interesting ways to achieve things that there might not be an application for yet. Of course application developers can also take more direct advantage of this to easily add features to their applications, for instance I am really looking forward to something like OBS Studio taking full advantage of PipeWire.

gst-launch-1.0 uridecodebin uri=file:///home/wim/data/BigBuckBunny_320x180.mp4 ! pipewiresink mode=provide stream-properties="props,media.class=Video/Source,node.description=BBB"

Cheese paying a video through pipewire

Cheese playing a video provided by GStreamer through PipeWire.

How to get started testing PipeWire
Ok, so after seeing all of this you might be thinking, how can I test all of this stuff out and find out how my favorite applications work with PipeWire? Well first thing you should do is make sure you are running Fedora Workstation 32 or later as that is where we are developing all of this. Once you done that you need to make sure you got all the needed pieces installed:

sudo dnf install pipewire-libpulse pipewire-libjack pipewire-alsa

Once that dnf command finishes you run the following to get PulseAudio replaced by PipeWire.

cd /usr/lib64/

sudo ln -sf pipewire-0.3/pulse/ /usr/lib64/
sudo ln -sf pipewire-0.3/pulse/ /usr/lib64/
sudo ln -sf pipewire-0.3/pulse/ /usr/lib64/

sudo ln -sf pipewire-0.3/jack/ /usr/lib64/
sudo ln -sf pipewire-0.3/jack/ /usr/lib64/
sudo ln -sf pipewire-0.3/jack/ /usr/lib64/

sudo ldconfig

(you can also find those commands here

Once you run these commands you should be able to run

pactl info

and see this as the first line returned:
Server String: pipewire-0

I do recommend rebooting, to be 100% sure you are on a PipeWire system with everything outputting through PipeWire. Once that is done you are ready to start testing!

Our goal is to use the remainder of the Fedora Workstation 32 lifecycle and the Fedora Workstation 33 lifecycle to stabilize and finish the last major features of PipeWire and then start relying on it in Fedora Workstation 34. So I hope this article will encourage more people to get involved and join us on gitlab and on the PipeWire IRC channel at #pipewire on Freenode.

As we are trying to stabilize PipeWire we are working on it on a bug by bug basis atm, so if you end up testing out the current state of PipeWire then be sure to report issues back to us through the PipeWire issue tracker, but do try to ensure you have a good test case/reproducer as we are still so early in the development process that we can’t dig into ‘obscure/unreproducible’ bugs.

Also if you want/need to go back to PulseAudio you can run the commands here

Also if you just want to test a single application and not switch your whole system over you should be able to do that by using the following commands:




Next Steps
So what are our exact development plans at this point? Well here is a list in somewhat priority order:

  1. Stabilize – Our top priority now is to make PipeWire so stable that the power users that we hope to attract us our first batch of users are comfortable running PipeWire as their only audio server. This is critical to build up a userbase that can help us identify and prioritize remaining issues and ensure that when we do switch Fedora Workstation over to using PipeWire as the default and only supported audio server it will be a great experience for users.
  2. Jackdbus – We want to implement support for the jackdbus API soon as we know its an important feature for the Fedora Jam folks. So we hope to get to this in the not to distant future
  3. Flatpak portal for JACK/audio applications – The future of application packaging is Flatpaks and being able to sandbox Jack applications properly inside a Flatpak is something we want to enable.
  4. Bluetooth – Bluetooth has been supported in PipeWire from the start, but as Wims focus has moved elsewhere it has gone a little stale. So we are looking at cycling back to it and cleaning it up to get it production ready. This includes proper support for things like LDAC and AAC passthrough, which is currently not handled in PulseAudio. Wim hopes to push an updated PipeWire in Fedora out next week which should at least get Bluetooth into a basic working state, but the big fix will come later.
  5. Pulse effects – Wim has looked at this, but there are some bugs that blocks data from moving through the pipeline.
  6. Latency compensation – We want complete latency compensation implemented. This is not actually in Jack currently, so it would be a net new feature.
  7. Network audio – PulseAudio style network audio is not implemented yet.

This is the continuation from these posts: part 1, part 2, part 3

This is the part where it all comes together, with (BYO) fireworks and confetti, champagne and hoorays. Or rather, this is where I'll describe how to actually set everything up. It's a bit complicated because while libxkbcommon does the parsing legwork now, we haven't actually changed other APIs and the file formats which are still 1990s-style nerd cool and requires significant experience in CS [1] to understand what goes where.

The below relies on software using libxkbcommon and libxkbregistry. At the time of writing, libxkbcommon is used by all mainstream Wayland compositors but not by the X server. libxkbregistry is not yet used because I'm typing this before we had a release for it. But at least now I have a link to point people to.

libxkbcommon has a xkbcli-scaffold-new-layout tool that The xkblayout tool creates the template files as shown below. At the time of writing, this tool must be run from the git repo build directory, it is not installed.

I'll explain here how to add the us(banana) variant and the custom:foo option, and I will optimise for simplicity and brevity.

Directory structure

First, create the following directory layout:

$ tree $XDG_CONFIG_HOME/xkb
├── compat
├── keycodes
├── rules
│   ├── evdev
│   └── evdev.xml
├── symbols
│   ├── custom
│   └── us
└── types
If $XDG_CONFIG_HOME is unset, fall back to $HOME/.config.

Rules files

Create the rules file and add an entry to map our custom:foo option to a section in the symbols/custom file.

$ cat $XDG_CONFIG_HOME/xkb/rules/evdev
! option = symbols
custom:foo = +custom(foo)

// Include the system 'evdev' file
! include %S/evdev
Note that no entry is needed for the variant, that is handled by wildcards in the system rules file. If you only want a variant and no options, you technically don't need this rules file.

Second, create the xml file used by libxkbregistry to display your new entries in the configuration GUIs:

$ cat $XDG_CONFIG_HOME/xkb/rules/evdev.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xkbConfigRegistry SYSTEM "xkb.dtd">
<xkbConfigRegistry version="1.1">
<group allowMultipleSelection="true">
<description>custom options</description>
<description>This option does something great</description>
Our variant needs to be added as a layoutList/layout/variantList/variant, the option to the optionList/group/option. libxkbregistry will combine this with the system-wide evdev.xml file in /usr/share/X11/xkb/rules/evdev.xml.

Overriding and adding symbols

Now to the actual mapping. Add a section to each of the symbols files that matches the variant or option name:

$ cat $XDG_CONFIG_HOME/xkb/symbols/us
partial alphanumeric_keys modifier_keys
xkb_symbols "banana" {
name[Group1]= "Banana (us)";

include "us(basic)"

key <CAPS> { [ Escape ] };
with this, the us(banana) layout will be a US keyboard layout but with the CapsLock key mapped to Escape. What about our option? Mostly the same, let's map the tilde key to nothing:

$ cat $XDG_CONFIG_HOME/xkb/symbols/custom
partial alphanumeric_keys modifier_keys
xkb_symbols "foo" {
key <TLDE> { [ VoidSymbol ] };
A note here: NoSymbol means "don't overwrite it" whereas VoidSymbol is "map to nothing".


You may notice that the variant and option sections are almost identical. XKB doesn't care about variants vs options, it only cares about components to combine. So the sections do what we expect of them: variants include enough other components to make them a full keyboard layout, options merely define a few keys so they can be combined with layouts(variants). Due to how the lookups work, you could load the option template as layout custom(foo).

For the actual documentation of keyboard configuration, you'll need to google around, there are quite a few posts on how to map keys. All that has changed is where from and how things are loaded but not the actual data formats.

If you wanted to install this as system-wide custom rules, replace $XDG_CONFIG_HOME with /etc.

The above is a replacement for xmodmap. It does not require a script to be run manually to apply the config, the existing XKB integration will take care of it. It will work in Wayland (but as said above not in X, at least not for now).

A final word

Now, I fully agree that this is cumbersome, clunky and feels outdated. This is easy to fix, all that is needed is for someone to develop a better file format, make sure it's backwards compatible with the full spec of the XKB parser (the above is a fraction of what it can do), that you can generate the old files from the new format to reduce maintenance, and then maintain backwards compatibility with the current format for the next ten or so years. Should be a good Google Decade of Code beginner project.

[1] Cursing and Swearing

This is the continuation from these posts: part 1, part 2, part 3 and part 4.

In the posts linked above, I describe how it's possible to have custom keyboard layouts in $HOME or /etc/xkb that will get picked up by libxkbcommon. This only works for the Wayland stack, the X stack doesn't use libxkbcommon. In this post I'll explain why it's unlikely this will ever happen in X.

As described in the previous posts, users configure with rules, models, layouts, variants and options (RMLVO). What XKB uses internally though are keycodes, compat, geometry, symbols types (KcCGST) [1].

There are, effectively, two KcCGST keymap compilers: libxkbcommon and xkbcomp. libxkbcommon can go from RMLVO to a full keymap, xkbcomp relies on other tools (e.g. setxkbmap) which in turn use a utility library called libxkbfile to can parse rules files. The X server has a copy of the libxkbfile code. It doesn't use libxkbfile itself but it relies on the header files provided by it for some structs.

Wayland's keyboard configuration works like this:

  • the compositor decides on the RMLVO keybard layout, through an out-of-band channel (e.g. gsettings, weston.ini, etc.)
  • the compositor invokes libxkbcommon to generate a KcCGST keymap and passes that full keymap to the client
  • the client compiles that keymap with libxkbcommon and feeds any key events into libxkbcommon's state tracker to get the right keysyms
The advantage we have here is that only the full keymap is passed between entities. Changing how that keymap is generated does not affect the client. This, coincidentally [2], is also how Xwayland gets the keymap passed to it and why Xwayland works with user-specific layouts.

X works differently. Notably, KcCGST can come in two forms, the partial form specifying names only and the full keymap. The partial form looks like this:

$ setxkbmap -print -layout fr -variant azerty -option ctrl:nocaps
xkb_keymap {
xkb_keycodes { include "evdev+aliases(azerty)" };
xkb_types { include "complete" };
xkb_compat { include "complete" };
xkb_symbols { include "pc+fr(azerty)+inet(evdev)+ctrl(nocaps)" };
xkb_geometry { include "pc(pc105)" };
This defines the component names but not the actual keymap, punting that to the next part in the stack. This will turn out to be the achilles heel. Keymap handling in the server has two distinct aproaches:
  • During keyboard device init, the input driver passes RMLVO to the server, based on defaults or xorg.conf options
  • The server has its own rules file parser and creates the KcCGST component names (as above)
  • The server forks off xkbcomp and passes the component names to stdin
  • xkbcomp generates a keymap based on the components and writes it out as XKM file format
  • the server reads in the XKM format and updates its internal structs
This has been the approach for decades. To give you an indication of how fast-moving this part of the server is: XKM caching was the latest feature added... in 2009.

Driver initialisation is nice, but barely used these days. You set your keyboard layout in e.g. GNOME or KDE and that will apply it in the running session. Or run setxkbmap, for those with a higher affinity to neckbeards. setxkbmap works like this:

  • setkxkbmap parses the rules file to convert RMLVO to KcCGST component names
  • setkxkbmap calls XkbGetKeyboardByName and hands those component names to the server
  • The server forks off xkbcomp and passes the component names to stdin
  • xkbcomp generates a keymap based on the components and writes it out as XKM file format
  • the server reads in the XKM format and updates its internal structs
Notably, the RMLVO to KcCGST conversion is done on the client side, not the server side. And the only way to send a keymap to the server is that XkbGetKeyboardByName request - which only takes KcCGST, you can't even pass it a full keymap. This is also a long-standing potential issue with XKB: if your client tools uses different XKB data files than the server, you don't get the keymap you expected.

Other parts of the stack do basically the same as setxkbmap which is just a thin wrapper around libxkbfile anyway.

Now, you can use xkbcomp on the client side to generate a keymap, but you can't hand it as-is to the server. xkbcomp can do this (using libxkbfile) by updating the XKB state one-by-one (XkbSetMap, XkbSetCompatMap, XkbSetNames, etc.). But at this point you're at the stage where you ask the server to knowingly compile a wrong keymap before updating the parts of it.

So, realistically, the only way to get user-specific XKB layouts into the X server would require updating libxkbfile to provide the same behavior as libxkbcommon, update the server to actually use libxkbfile instead of its own copy, and updating xkbcomp to support the changes in part 2, part 3. All while ensuring no regressions in code that's decades old, barely maintained, has no tests, and, let's be honest, not particularly pretty to look at. User-specific XKB layouts are somewhat a niche case to begin with, so I don't expect anyone to ever volunteer and do this work [3], much less find the resources to review and merge that code. The X server is unlikely to see another real release and this is definitely not something you want to sneak in in a minor update.

The other option would be to extend XKB-the-protocol with a request to take a full keymap so the server. Given the inertia involved and that the server won't see more full releases, this is not going to happen.

So as a summary: if you want custom keymaps on your machine, switch to Wayland (and/or fix any remaining issues preventing you from doing so) instead of hoping this will ever work on X. xmodmap will remain your only solution for X.

[1] Geometry is so pointless that libxkbcommon doesn't even implement this. It is a complex format to allow rendering a picture of your keyboard but it'd be a per-model thing and with evdev everyone is using the same model, so ...
[2] totally not coincidental btw
[3] libxkbcommon has been around for a decade now and no-one has volunteered to do this in the years since, so...

The Optimizations Continue

Optimizing transfer_map is one of the first issues I created, and it’s definitely one of the most important, at least as it pertains to unit tests. So many unit tests perform reads on buffers that it’s crucial to ensure no unnecessary flushing or stalling is happening here.

Today, I’ve made further strides in this direction for piglit’s spec@glsl-1.30@execution@texelfetch fs sampler2d 1x281-501x281:


MESA_LOADER_DRIVER_OVERRIDE=zink bin/texelFetch 4.41s user 1.92s system 71% cpu 8.801 total


MESA_LOADER_DRIVER_OVERRIDE=zink bin/texelFetch 4.22s user 1.72s system 76% cpu 7.749 total

More Speed Loops

As part of ensuring test coherency, a lot of explicit fencing was added around transfer_map and transfer_unmap. Part of this was to work around the lack of good barrier usage, and part of it was just to make sure that everything was properly synchronized.

An especially non-performant case of this was in transfer_unmap, where I added a fence to block after the buffer was unmapped. In reality, this makes no sense other than for the case of synchronizing with the compute batch, which is a bit detached from everything else.

The reason this fence continued to be needed comes down to barrier usage for descriptors. At the start, all image resources used for descriptors just used VK_IMAGE_LAYOUT_GENERAL for the barrier layout, which is fine and good; certainly this is spec compliant. It isn’t, however, optimally informing the underlying driver about the usage for the case where the resource was previously used with VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL to copy data back from a staging image.

Instead, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL can be used for sampler images since they’re read-only. This differs from VK_IMAGE_LAYOUT_GENERAL in that VK_IMAGE_LAYOUT_GENERAL contains both read and write—not what’s actually going on in this case given that sampler images are read-only.

No Other Results

I’d planned to post more timediffs from some piglit results, but I hit some weird i915 bugs and so the tests have yet to complete after ~14 hours. More to come next week.

September 03, 2020

A Quick Optimization

As I mentioned last week, I’m turning a small part of my attention now to doing some performance improvements. One of the low-hanging fruits here is adding buffer ranges; in short, this means that for buffer resources (i.e., not images), the driver tracks the ranges in memory of the buffer that have data written, which allows avoiding gpu stalls when trying to read or write from a range in the buffer that’s known to not have anything written.


The util_range API in gallium is extraordinarily simple. Here’s the key parts:

struct util_range {
   unsigned start; /* inclusive */
   unsigned end; /* exclusive */

   /* for the range to be consistent with multiple contexts: */
   simple_mtx_t write_mutex;

/* This is like a union of two sets. */
static inline void
util_range_add(struct pipe_resource *resource, struct util_range *range,
               unsigned start, unsigned end)
   if (start < range->start || end > range->end) {
      if (resource->flags & PIPE_RESOURCE_FLAG_SINGLE_THREAD_USE) {
         range->start = MIN2(start, range->start);
         range->end = MAX2(end, range->end);
      } else {
         range->start = MIN2(start, range->start);
         range->end = MAX2(end, range->end);

static inline boolean
util_ranges_intersect(const struct util_range *range,
                      unsigned start, unsigned end)
   return MAX2(start, range->start) < MIN2(end, range->end);

When the driver writes to the buffer, util_range_add() is called with the range being modified. When the user then tries to map the buffer using a specified range, util_ranges_intersect() is called with the range being mapped to determine if a stall is needed or if it can be skipped.

Where do these ranges get added for the buffer?

Lots of places. Here’s a quick list:

  • struct pipe_context::blit
  • struct pipe_context::clear_texture
  • struct pipe_context::set_shader_buffers
  • struct pipe_context::set_shader_images
  • struct pipe_context::copy_region
  • struct pipe_context::create_stream_output_target
  • resource unmap and flush hooks

Some Numbers

The tests are still running to see what happens here, but I found some interesting improvements using my piglit timediff display after rebasing my branch yesterday for the first time in a bit and running the tests again overnight:


I haven’t made any changes to this codepath myself (that I’m aware of), so it looks like I’ve pulled in some improvement that’s massively cutting down the time required for the codepath that handles implicit 32bit -> 64bit conversions, and timediff picked it up.

September 01, 2020

Benchmarking vs Testing

Just a quick post for today to talk about a small project I undertook this morning.

As I’ve talked about repeatedly, I run a lot of tests.

It’s practically all I do when I’m not dumping my random wip code by the truckload into the repo.

The problem with this approach is that it doesn’t leave much time for performance improvements. How can I possibly have time to run benchmarks if I’m already spending all my time running tests?


As one of the greatest bench markers of all time once said, If you want to turn a vision into reality, you have to give 100% and never stop believing in your dream.

Thus I decided to look for those gains while I ran tests.

Piglit already possesses facilities for providing HTML summaries of all tests (sidebar: the results that I posted last week are actually a bit misleading since they included a number of regressions that have since been resolved, but I forgot to update the result set). This functionality includes providing the elapsed time for each test. It has not, however, included any way of displaying changes in time between result sets.

Until now.

The Future Is Here

With my latest MR, the HTML view can show off those big gains (or losses) that have been quietly accumulating through all these high volume sets of unit tests:


Is it more generally useful to other drivers?

Maybe? It’ll pick up any unexpected performance regressions (inverse gains) too, and it doesn’t need any extra work to display, so it’s an easy check, at the least.

August 31, 2020

This weekend the X1 Carbon with Fedora Workstation went live in North America on Lenovos webstore. This is a big milestone for us and for Lenovo as its the first time Fedora ships pre-installed on a laptop from a major vendor and its the first time the worlds largest laptop maker ships premium laptops with Linux directly to consumers. Currently only the X1 Carbon is available, but more models is on the way and more geographies will get added too soon. As a sidenote, the X1 Carbon and more has actually been available from Lenovo for a couple of Months now, it is just the web sales that went online now. So if you are a IT department buying Lenovo laptops in bulk, be aware that you can already buy the X1 Carbon and the P1 for instance through the direct to business sales channel.

Also as a reminder for people looking to deploy Fedora laptops or workstations in numbers, be sure to check out Fleet Commander our tool for helping you manage configurations across such a fleet.

I am very happy with the work that has been done here to get to this point both by Lenovo and from the engineers on my team here at Red Hat. For example Lenovo made sure to get all of their component makers to ramp up their Linux support and we have been working with them to both help get them started writing drivers for Linux or by helping add infrastructure they could plug their hardware into. We also worked hard to get them all set up on the Linux Vendor Firmware Service so that you could be assured to get updated firmware not just for the laptop itself, but also for its components.

We also have a list of improvements that we are working on to ensure you get the full benefit of your new laptops with Fedora and Lenovo, including working on things like improved power management features being able to have power consumption profiles that includes a high performance mode for some laptops that will allow it to run even faster when on AC power and on the other end a low power mode to maximize battery life. As part of that we are also working on adding lap detection support, so that we can ensure that you don’t risk your laptop running to hot in your lap and burning you or that radio antennas are running to strong when that close to your body.

So I hope you decide to take the leap and get one of the great developer laptops we are doing together with Lenovo. This is a unique collaboration between the worlds largest laptop maker and the worlds largest Linux company. What we are doing here isn’t just a minimal hardware enablement effort, but a concerted effort to evolve Linux as a laptop operating system and doing it in a proper open source way. So this is the culmination of our work over the last few years, creating the LVFS, adding Thunderbolt support to Linux, improving fingerprint reader support in Linux, supporting HiDPI screens, supporting hidpi mice, creating the possibility of a secure desktop with Wayland, working with NVidia to ensure that Mesa and Nvidia driver can co-exist through glvnd, creating Flatpak to ensure we can bring the advantages of containers to the desktop space and at the same way do it in a vendor neutral way. So when you buy a Lenovo laptop with Fedora Workstation, you are not just getting a great system, but you are also supporting our efforts to take Linux to the next level, something which I think we are truly the only linux vendor with the scale and engineering ability to do.

Of course we are not stopping here, so let me also use this chance to talk a bit about some of our other efforts.

Containers are popular for deploying software, but a lot of people are also discovering now that they are an incredible way to develop software, even if that software is not going to be deployed as a Flatpak or Kubernetes container. The term often used for containers when used as a development tool is pet containers and with Toolbox project we are aiming to create the best tool possible for developers to work with pet containers. Toolbox allows you to have always have a clean environment to work in which you can change to suit each project you work on, however you like, without affecting your host system. So for instance if you need to install a development snapshot of Python you can do that inside your Toolbox container and be confident that various other parts of your desktop will not start crashing due to the change. And when your are done with your project and don’t want that toolbox around anymore you can easily delete it without having to spend time to figure out which packages you installed can now be safely uninstalled from your host system or just not bother and have your host get bloated over time with stuff you are not actually using anymore.

One big advantage we got at Red Hat is that we are a major contributor to container technologies across the stack. We are a major participant in the Open Container Initiative and we are alongside Google the biggest contributor to the Kubernetes project. This includes having created a set of container tools called Podman. So when we started prototyping Toolbox we could base it up on podman and get access to all the power and features that podman provides, but at the same make them easier to use and consumer from your developer laptop or workstation.

Our initial motivation was also driven by the fact that for image based operating systems like Fedora Silverblue and Fedora CoreOS, where the host system is immutable you still need some way to be able to install packages and do development, but we quickly realized that the pet container development model is superior to the old ‘on host’ model even if you are using a traditional package based system like Fedora Workstation. So we started out by prototyping the baseline functionality, writing it as a shell script to quickly test out our initial ideas. Of course as Toolbox picked up in popularity we realized we needed to transition quickly to a proper development language so that we wouldn’t end up with an unmaintainable mess written in shell, and thus Debarshi Ray and Ondřej Míchal has recently completed the rewrite to Go (Note: the choice of Go was to make it easier for the wider container community to contribute since almost all container tools are written in Go).

Leading up towards Fedora Workstation 33 we are trying figure out a few things. One is how we can make giving you access to a RHEL based toolbox through the Red Hat Developer Program in an easy and straightforward manner, and this is another area where pet container development shines. You can set up your pet container to run a different Linux version than your host. So you can use Fedora to get the latest features for your laptop, but target RHEL inside your Toolbox to get an easy and quick deployment path to your company RHEL servers. I would love it if we can extend this even further as we go along, to for instance let you set up a Steam runtime toolbox to do game development targeting Steam.
Setting up a RHEL toolbox is already technically possible, but requires a lot more knowledge and understanding of the underlaying technologies than we wish.
The second thing we are looking at is how we deal with graphical applications in the context of these pet containers. The main reason we are looking at that is because while you can install for instance Visual Studio code inside the toolbox container and launch it from the command line, we realize that is not a great model for how you interact with GUI applications. At the moment the only IDE that is set up to be run in the host, but is able to interact with containers properly is GNOME Builder, but we realize that there are a lot more IDEs people are using and thus we want to try to come up with ways to make them work better with toolbox containers beyond launching them from the command line from inside the container. There are some extensions available for things like Visual Studio Code starting to try to improve things (those extensions are not created by us, but looking at solving a similar problem), but we want to see how we can help providing a polished experience here. Over time we do believe the pet container model of development is so good that most IDEs will follow in GNOME Builders footsteps and make in-container development a core part of the feature set, but for now we need to figure out a good bridging strategy.

Wayland – headless and variable refresh rate.
Since switching to Wayland we have continued to work in improving how GNOME work under Wayland to remove any major feature regressions from X11 and to start taking advantage of the opportunities that Wayland gives us. One of the last issues that Jonas Ådahl has been hard at work recently is trying to ensure we have headless support for running GNOME on systems without a screen. We know that there are a lot of sysadmins for instance who want to be able to launch a desktop session on their servers to be used as a tool to test and debug issues. These desktops are then accessed through tools such as VNC or Nice DCV. As part of that work he also made sure we could deal with having multiple monitors connected which had different refresh rates. Before that fix you would get the lowest common denominator between your screens, but now if you for instance got a 60Hz monitor and a 75Hz monitor they will be able to function independent of each other and run at their maximum refresh rate. With the variable refresh rate work now landed upstream Jonas is racing to get the headless support finished and landed in time for Fedora Workstation 33.

Linux Vendor Firmware Service
Richard Hughes is continuing his work on moving the LVFS forward having spent time this cycle working with the Linux Foundation to ensure the service can scale even better. He is also continuously onboarding new vendors and helping existing vendors use LVFS for even more things. We are now getting reports that LVFS has become so popular that we are now getting reports of major hardware companies who up to know hasn’t been to interested in the LVFS are getting told by their customers to start using it or they will switch supplier. So expect the rapid growth of vendors joining the LVFS to keep increasing. It is also worth nothing that many of vendors who are already set up on LVFS are steadily working on increasing the amount of systems they support on it and pushing their suppliers to do the same. Also for enterprise use of LVFS firmware Marc Richter also wrote an article on about how to use LVFS with Red Hat Satelitte. Satellite for those who don’t know it is Red Hats tool for managing and keeping servers up to date and secure. So for large companies having their machines, especially servers, accessing LVFS directly is not a wanted behaviour, so now they can use Satelitte to provide a local repository of the LVFS firmware.

One of the changes we been working on that I am personally extremely excited about is PipeWire. For those of you who don’t know it, PipeWire is one of our major swamp draining efforts which aims to bring together audio, pro-audio and video under linux and provide a modern infrastructure for us to move forward. It does so however while being ABI compatible with both Jack and PulseAudio, meaning that applications will not need to be ported to work with PipeWire. We have been using it for a while for video already to handle screen capture under Wayland and for allowing Flatpak containers access to webcams in a secure way, but Wim Taymans has been working tirelessly on moving that project forward over the last 6 Months, focused a lot of fixing corner cases in the Jack support and also ramping up the PulseAudio support. We had hoped to start wide testing in Fedora Workstation 32 of the audio parts of PipeWire, but we decided that since such a key advantage that PipeWire brings is not just to replace Jack or PulseAudio, but also to ensure the two usecases co-exist and interact properly, we didn’t want to start asking people to test until we got the PulseAudio support close to being production ready. Wim has been making progress by leaps and bounds recently and while I can’t 100% promise it yet we do expect to roll out the audio bits of PipeWire for more widescale testing in Fedora Workstation 33 with the goal of making it the default for Fedora Workstation 34 or more likely Fedora Workstation 35.
Wim is doing an internal demo this week, so I will try to put out a blog post talking about that later in the week.

Flatpak – incremental updates
One of the features we added to Flatpaks was the ability to distribute them as Open Container Initiative compliant containers. The reason for this was that as companies, Red Hat included, built infrastructure for hosting and distributing containers we could also use that for Flatpaks. This is obviously a great advantage for a variety of reasons, but it had one large downside compared to the traditional way of distributing Flatpaks (as Ostree images) which is that each update comes as a single large update as opposed to the atomic update model that OStree provides.
Which is why if you would compare the same application when shipping from Flathub, which uses Ostree, versus from the Fedora container registry, you would quickly notice that you get a lot smaller updates from Flathub. For kubernetes containers this hasn’t been considered a huge problem as their main usecase is copying the containers around in a high-speed network inside your cloud provider, but for desktop users this is annoying. So Alex Larsson and Owen Taylor has been working on coming up with a way to do to incremental updates for OCI/Docker/Kubernetes containers too, which not only means we can get very close to the Flathub update size in the Fedora Container Catalog, but it also means that since we implemented this in a way that works for all OCI/Kubernetes containers you will be able to get them too with incremental update functionality. Especially as such containers are making their way into edge computing where update sizes do matter, just like they do on the desktop.

Hangul input under Wayland
Red Hat, like Lenovo, targets most of the world with our products and projects. This means that we want them to work great even for people who doesn’t use English or another European language. To achieve this we have a team dedicated to ensuring that not just Linux, but all Red Hat products work well for international users as part of my group at Red Hat. That team, lead by Jens Petersen, is distributed around the globe with engineers in Japan, China, India, Singapore and Germany. This team contributes to a lot of different things like font maintenance, input method development, i18n infrastructure and more.
One thing this team recently discovered was that the support for Korean input under Wayland. So Peng Wu, Takao Fujiwara and Carlos Garnacho worked together to come up with a series of patches for ibus and GNOME Shell to ensure that Fedora Workstation on Wayland works perfectly for Korean input. I wanted to highlight this effort because while I don’t usually mention efforts which such a regional impact in my blog posts it is a critical part of keeping Linux viable and usable across the globe. And ensuring that you can use your computer in your own language is something we feel is important and want to enable and also an area where I believe Red Hat is investing more than any other vendor out there.

We meet with NVidia on a regular basis to discuss topics of shared interest and one thing we been looking at for a while now is the best way to support Nvidia binary driver under XWayland. As part of that Adam Jackson has been working on a research project to see how feasible it would be to create a way to run GLX applications on top of EGL. As one might imagine EGL doesn’t have a 1to1 match with GLX APIs, but based on what we seen so far is that it should be close enough to get things going (Adam already got glxgears running :). The goal here would be to have an initial version that works ok, and then in collaboration with NVidia we can evolve it to be a great solution for even the most demanding OpenGL/GLX applications. Currently the code causes an extra memcopy compared to running on GLX native, but this is something we think can be resolved in collaboration with NVidia. Of course this is still an early stage effort and Adam and NVidia are currently looking at it so there is of course a chance still we will hit a snag and have to go back to the drawing board. For those interested you can take a look at this Mesa merge request to see the current state.


I haven’t said Hi for a while when starting a post. I think the rush and the whirlwind of things happening during the GSoC made me a little agitated. This year, my project was the only one accepted for X.Org Foundation, and I felt a great responsibility. Well, this is the last week of the project, and I’m slowing down and breathing :)

This report is a summary of my journey at Google Summer of Code 2020. Experience and technical reports can be read in more detail in the 11 blog posts I published during this period:

Date Post
2020/05/13 I’m in - GSoC 2020 - X.Org Foundation
2020/05/20 Everyone makes a script
2020/06/02 Status update - Tie up loose ends before starting
2020/06/03 Walking in the KMS CURSOR CRC test
2020/06/15 Status update - connected errors
2020/07/06 GSoC First Phase - Achievements
2020/07/17 Increasing test coverage in VKMS - max square cursor size
2020/08/12 The end of an endless debugging of an endless wait
2020/08/19 If a warning remains, the job is not finished.
2020/08/27 Another day, another mistery
2020/08/28 Better validation of alpha-blending

So, the Google Summer of Code was an amazing experience! I have evolved not only technically, but also as a developer in a software community. In this period, I could work on different projects and even interact with their community. I contributed to three projects with various purposes, sizes, and maturities, and I describe below a little of my relationship with each of them:

DRM/Linux Kernel

The Linux kernel is one of the largest, most famous, and most mature free and open-source project. It is also the kernel that I have been using for over ten years. The development of Linux is so interesting to me that I chose it as a case study of my Master’s in Computer Science research.

Among the various subsystems in the project, I have contributed to DRM, the part of Linux responsible for the interface with GPUs. It provides the user-space with API to send commands and data in a format suitable for modern GPUs. It is the kernel-space component of graphic stacks like the X.Org Server.

And was X.Org Foundation, the organization that supported my project in GSoC. Thanks to the support from the DRI community and the X.Org Foundation, I have contributed over the past few months to improve the VKMS. The Virtual Kernel Mode Setting is a software-only model of a KMS driver that allows you to test DRM and run X on systems without a hardware display capability.


IGT is a set of tools used to develop and validate DRM drivers. These tools can be used by drivers other than Intel drivers and I widely used it to evolve the VKMS. Using IGT to improve VKMS can be very useful for validating patches sent to the core of DRM, that is, performing automated tests against new code changes with no need of real hardware.

With this concern, all my work on GSoC aimed to bring greater stability in IGT tests’ execution using VKMS. IGT test cases guided most of my contributions to the VKMS. Before sending any code, I took care of validating if, with my change, the tests that I have any familiarity remained stable and properly working.


Kworflow is a set of scripts that I use in my development environment for the Linux kernel. It greatly facilitates the execution of routine tasks of coding, examining the code, and sending patches.

My mentor developed it, Rodrigo Siqueira and other students and former students of computer science at the university contributed to add functionality and evolve the project. It supports your local kernel version’s compilation and deployment, helps you browse the code and the change history, and provides essential information for patch formatting.

With these three projects, I had an exciting development journey and many lessons learned to share. Here is a summary of that experience:

From start to finish

The general purpose of my GSoC project was to evolve VKMS using IGT tests. In this way, I used the kms_cursor_crc test case as a starting point to fix and validate features, adjust behaviors, and increase test coverage in VKMS.

[KMS cursor crc uses]
the display CRC support to validate cursor plane functionality.
The test will position the cursor plane either fully onscreen,
partially onscreen, or fully offscreen, using either a fully opaque
or fully transparent surface. In each case, it enables the cursor plane
and then reads the PF CRC (hardware test) and compares it with the CRC
value obtained when the cursor plane was disabled and its drawing is
directly inserted on the PF by software.

In my project proposal, I presented the piglit statistics for kms_cursor_crc using VKMS. It was seven test cases successful, two fails, one warning (under development), and 236 skips. Now, I can present an overview of this state and the improvements mapped and applied during this GSoC period:

Failing tests

Initially, three test cases failed using VKMS. Before GSoC start, I had already sent a proposal that moved it from failure to success with a warning and was related to the composition of planes considering the alpha channel. The second was a case that had already worked two years ago. The latter had never worked before. This last two were related to the behavior of the cursor plane in power management tasks.

From failure to warning

Since VKMS did not consider the alpha channel for blending, it zeroed that channel before computing the crc. However, the operation that would do this was zeroing the wrong channel, due to an endianness trap. It led the test case to failure. Even after fixing this operation, the test still emitted a warning. This happened because, when zeroing the alpha channel, VKMS was delivering a fully transparent black primary plane for capturing CRC, while the background should be a solid black.

A cross-cutting problem that affected the performance of VKMS for sequential subtests execution

During the sequential execution of the kms_cursor_crc test cases, the results were unstable. Successful tests failed, two runs of the same test alternated between failure and success … in short, a mess.

In debugging and examining the change history, I found a commit that changes the VKMS performance of the kms_cursor_crc test cases. This change replaced the drm_wait_for_vblanks function with drm_wait_flip_done and, therefore, the VKMS stopped “forcing” vblank interrupts during the process of state commit. Without vblank interruptions, the execution of testcases in sequence got stuck. Forcing vblanks was just a stopgap and not a real solution. Besides, the IGT test itself was also leaving a kind of trash when failure happened, since it did not complete the cleanup and affected the next subtest.

Skips due to unmet of test requirements

Skips were caused by the lack of the following features in VKMS:

  • Cursor sizes larger than 64x64 and non-square cursor (few drivers take this approach)
  • Support for more than one CRTC (still not developed):

  Test requirement not met in function igt_require_pipe, file ../lib/igt_kms.c:1900:
  Test requirement: !(!display->pipes[pipe].enabled)
  Pipe F does not exist or not enabled

So, for each of these issues, I needed to identify the problem, find out what project the problem was coming from, map what was required to resolve, and then code the contribution. With that, I had to combine work from two fronts: DRM and IGT. Also, with the more intensive use of my development environment, I needed to develop some improvements in the tool that supports me in compiling, installing, and exploring the kernel using a virtual machine. As a consequence, I also sent some contributions to the Kworkflow project.

Patches sent during GSoC 2020 to solve the above issues

## Project Patch Status Blog post
01 DRM/Linux kernel drm/vkms: change the max cursor width/height Accepted Link
02 DRM/Linux kernel drm/vkms: add missing drm_crtc_vblank_put to the get/put pair on flush Discarded -
03 DRM/Linux kernel drm/vkms: fix xrgb on compute crc Accepted -
04 DRM/Linux kernel drm/vkms: guarantee vblank when capturing crc (v3) Accepted Link
05 DRM/Linux kernel drm/vkms: add alpha-premultiplied color blending (v2) Accepted Link
06 IGT GPU Tools lib/igt_fb: remove extra parameters from igt_put_cairo_ctx Accepted Link
07 IGT GPU Tools [i-g-t,v2,1/2] lib/igt_fb: change comments with fd description Accepted Link
08 IGT GPU Tools [i-g-t,v2,2/2] test/kms_cursor_crc: update subtests descriptions and some comments Accepted Link
09 IGT GPU Tools [i-g-t,v2,1/1] test/kms_cursor_crc: release old pipe_crc before create a new one Accepted Link
10 IGT GPU Tools [i-g-t,2/2] test/kms_cursor_crc: align the start of the CRC capture to a vblank Discarded Link
11 IGT GPU Tools [i-g-t] tests/kms_cursor_crc: refactoring cursor-alpha subtests Under review Link
12 Kworkflow src: add grep utility to explore feature Merged Link
13 Kworkflow kw: small issue on u/mount alert message Merged Link
14 Kworkflow Add support for deployment in a debian-VM Under review Link

With the patches sent, I collaborated to:

  • Increasing test coverage in VKMS
  • Bug fixes in VKMS
  • Adjusting part of the VKMS design to its peculiarities as a fake driver allowing to stabilize its work performance
  • Fixing a leak in kms_cursor_crc cleanup when a testcase fails
  • Improving different parts of the IGT tool documentation
  • Increasing the effectiveness of testing cursor plane alpha-composition
  • Improve my development environment and increase my productivity.

Not only coding, getting involved in the community

In addition to sending code improvements, my initial proposal included adding support for real overlay planes. However, I participated in other community activities that led me to adapt a portion of my project to meet emerging and more urgent demands.

I took a long time debugging the VKMS’s unstable behavior together with other DRM community members. Initially, I was working in isolation on the solution. However, I realized that that problem had a history and that it would be more productive to talk to other developers to find a suitable solution. When I saw on the mailing list, another developer had encountered a similar problem, and I joined the conversation. This experience was very enriching, where I had the support and guidance of DRM’s maintainer, Daniel Vetter. Debugging together with Daniel and Sidong Yang, we got a better solution to the problem. Finally, it seems that somehow, this debate contributed to another developer, Leandro Ribeiro, in his work on another VKMS’s issue.

In this debugging process, I also gained more experience and confidence concerning the project. So, I reviewed and tested some patches sent to VKMS [1,2,3,4]. Finally, with the knowledge acquired, I was also able to contribute to the debugging of a feature under development that adds support for writeback [5].

Discussion and future works

The table below summarizes my progress in stabilizing the execution of kms_cursor_crc using VKMS.

Status Start End
pass 7 25 (+18)
warn 1 0 (-1)
fail 2 0 (-2)
skip 236 221 (-15)
total 246 246

To solve the warning triggered by the test case pipe-A-cursor-alpha-transparent, I needed to develop an already mapped feature: TODO: Use the alpha value to blend vaddr\_src with vaddr\_dst instead of overwriting it in blend(). This feature showed that another test case, the pipe-A-cursor-alpha-opaque, had a false-pass. Moreover, both test cases related to the composition of the cursor plane with the primary plane were not sufficient to verify the composition’s correctness. As a result, I sent IGT a refactoring of the test cases to improve coverage.

The handling of the two failures initially reported were built by the same debugging process. However, they had different stories. The dpms test case had already been successful in the past, according to a commit in the git log. It started to fail after a change, which was correct but evidenced the driver’s deficiency. There was no report for the suspend test case that it worked correctly at some point, but it lacked the same: ensuring that vblank interruptions are occurring for the composition work and CRC captures.

The remaining skips are related to:

  1. non-square cursor support. Which, as far as I know, is a very specific feature for Intel drivers. If this restricted application is confirmed, the right thing to do is moving these test cases to the specific i915 driver test folder.
  2. the test requirement for more than one pipe: “Pipe (B-F) does not exist or not enabled”. This need a more complex work to add support for more than one CRTC.

I also started examining how to support the real overlay:

  • create a parameter to enable the overlay;
  • enable the third type of plane overlay in addition to the existing primary and cursor;
  • and allow the blending of the three planes.

Lessons learned

  1. In one of my first contributions to the VKMS I received feedback, let’s say, scary. To this day, I have no idea who that person was, and maybe he didn’t want to scare me, but his lack of politeness to say ‘That looks horrid’ was pretty disheartening for a beginner. Lucky for me that I didn’t take it so seriously, I continued my journey, started GSoC, and realized that the most active developers in the community are very friendly and inclusive. I learned a lot, and I would like to thank them for sharing their knowledge and information.
  2. Still related to the previous topic, I feel that the less serious developers also don’t care much about contribution etiquette and code of conduct. By unknowing or overlooking the rules of the community, they end up creating discomfort for other developers. In the DRM community, I noticed care in maintaining the community and bringing in new contributors. I did not feel unworthy or discredited in any discussion with the others. On the contrary, I felt very encouraged to continue, to learn, and to contribute.
  3. In addition to the diversity of skills, maturity, and culture of the developers, the Linux community deals with timezone differences. It’s crazy to see the energy of the DRI community on IRC channel, even with people sleeping and waking up at random times.
  4. I still don’t feel very comfortable talking on the IRC channel, for two reasons: I always take time to understand the subject being discussed technically, and my English is not very good. The dialogues I had by e-mail were very useful. It gave me time to think and check the validity of my ideas before speaking. It was still a record/documentation for future contributions.
  5. The Linux kernel is very well documented. For most questions, there is an answer somewhere in the documentation. However, it is not always immediate to find the answer because it is an extensible project. Besides, a lot has already been encapsulated and encoded for reuse, so creating something from scratch can often be unnecessary.

And after GSoC

Well, my first, short-term plan is to make a good presentation at XDC 2020. In it, I will highlight interesting cases on my project of working on IGT and VKMS together. My presentation will be on the first day, September 16; more details:

VKMS improvements using IGT GPU Tools

My second plan is to continue contributing to the VKMS, initially adding the mapped features of adding a real overlay and structuring the VKMS for more than one CRTC. Probably, with the entry of writeback support, doing these things requires even more reasoning.

Not least, I want to finish my master’s degree and complete my research on Linux development, adding my practical experiences with VKMS development. I’m almost there, hopefully later this year. And then, I hope to find a job that will allow me to work on Linux development. And live. :P


Finally, I thank X.Org Foundation for accepting my project and believing in my performance (since I was the organization’s sole project in this year’s GSoC). Also, Trevor Woerner for motivation, communication and confidence! He was always very attentive, guiding me and giving tips and advice.

Thank my mentor, Rodrigo Siqueira, who believed in my potential, openly shared knowledge that he acquired with great effort, and gave relevance to each question that I presented, and encouraged me to talk with the community. Many thanks also to the DRI community and Daniel Vetter for sharing their time and so much information and knowledge with me and being friendly and giving me constructive feedback.


This is the continuation from these posts: part 1, part 2

Let's talk about everyone's favourite [1] keyboard configuration system again: XKB. If you recall the goal is to make it simple for users to configure their own custom layouts. Now, as described earlier, XKB-the-implementation doesn't actually have a concept of a "layout" as such, it has "components" and something converts your layout desires into the combination of components. RMLVO (rules, model, layout, variant, options) is what you specify and gets converted to KcCGST (keycodes, compat, geometry, symbols, types). This is a one-way conversion, the resulting keymaps no longer has references to the RMLVO arguments. Today's post is about that conversion, and we're only talking about libxkbcommon as XKB parser because anything else is no longer maintained.

The peculiar thing about XKB data files (provided by xkeyboard-config [3]) is that the filename is part of the API. You say layout "us" variant "dvorak", the rules file translates this to symbols 'us(dvorak)' and the parser will understand this as "load file 'symbols/us' and find the dvorak section in that file". [4] The default "us" keyboard layout will require these components:

xkb_keymap {
xkb_keycodes { include "evdev+aliases(qwerty)" };
xkb_types { include "complete" };
xkb_compat { include "complete" };
xkb_symbols { include "pc+us+inet(evdev)" };
xkb_geometry { include "pc(pc105)" };
So the symbols are really: file symbols/pc, add symbols/us and then the section named 'evdev' from symbols/inet [5]. Types are loaded from types/complete, etc. The lookup paths for libxkbcommon are $XDG_CONFIG_HOME/xkb, /etc/xkb, and /usr/share/X11/xkb, in that order.

Most of the symbols sections aren't actually full configurations. The 'us' default section only sets the alphanumeric rows, everything else comes from the 'pc' default section (hence: include "pc+us+inet(evdev)"). And most variant sections just include the default one, usually called 'basic'. For example, this is the 'euro' variant of the 'us' layout which merely combines a few other sections:

partial alphanumeric_keys
xkb_symbols "euro" {

include "us(basic)"
name[Group1]= "English (US, euro on 5)";

include "eurosign(5)"

include "level3(ralt_switch)"
Including things works as you'd expect: include "foo(bar)" loads section 'bar' from file 'foo' and this works for 'symbols/', 'compat/', etc., it'll just load the file in the same subdirectory. So yay, the directory is kinda also part of the API.

Alright, now you understand how KcCGST files are loaded, much to your despair.

For user-specific configuration, we could already load a 'custom' layout from the user's home directory. But it'd be nice if we could just add a variant to an existing layout. Like "us(banana)", because potassium is important or somesuch. This wasn't possible because the filename is part of the API. So our banana variant had to be in $XDG_CONFIG_HOME/xkb/symbols/us and once we found that "us" file, we could no longer include the system one.

So as of two days ago, libxkbcommon now extends the parser to have merged KcCGST files, or in other words: it'll load the symbols/us file in the lookup path order until it finds the section needed. With that, you can now copy this into your $XDG_CONFIG_HOME/xkb/symbols/us file and have it work as variant:

partial alphanumeric_keys
xkb_symbols "banana" {

include "us(basic)"
name[Group1]= "English (Banana)";

// let's assume there are some keymappings here
And voila, you now have a banana variant that can combine with the system-level "us" layout.

And because there must be millions [6] of admins out there that maintain custom XKB layouts for a set of machines, the aforementioned /etc/xkb lookup path was also recently added to libxkbcommon. So we truly now have the typical triplet of lookup paths:

  • vendor-provided ones in /usr/share/X11/xkb,
  • host-specific ones in /etc/xkb, and
  • user-specific ones in $XDG_CONFIG_HOME/xkb [7].
Good times, good times.

[1] Much in the same way everyone's favourite Model T colour was black
[2] This all follows the UNIX philosophy, there are of course multiple tools involved and none of them know what the other one is doing
[3] And I don't think Sergey gets enough credit for maintaining that pile of language oddities
[4] Note that the names don't have to match, we could map layout 'us' to the symbols in 'banana' but life's difficult enough as it is
[5] I say "add" when it's sort of a merge-and-overwrite and, yes, of course there are multiple ways to combine layouts, thanks for asking
[6] Actual number may be less
[7] Notice how "X11" is missing in the latter two? If that's not proof that we want to get rid of X, I don't know what is!

At Last, Geometry Shaders

I’ve mentioned GS on a few occasions in the past without going too deeply into the topic. The truth is that I’m probably not ever going to get that deep into the topic.

There’s just not that much to talk about.

The 101

Geometry shaders take the vertices from previous shader stages (vertex, tessellation) and then emit some sort of (possibly-complete) primitive. EmitVertex() and EndPrimitive() are the big GLSL functions that need to be cared about, and there’s gl_PrimitiveID and gl_InvocationID, but hooking everything into a mesa (gallium) driver is, at the overview level, pretty straightforward:

  • add struct pipe_context hooks for the gs state creation/deletion, which are just the same wrappers that every other shader type uses
  • make sure to add GS state saving to any util_blitter usage
  • add a bunch of GS shader pipe cap handling
  • the driver now supports geometry shaders

But Then Zink

The additional changes needed in zink are almost entirely in ntv. Most of this is just expanding existing conditionals which were previously restricted to vertex shaders to continue handling the input/output variables in the same way, though there’s also just a bunch of boilerplate SPIRV setup for enabling/setting execution modes and the like that can mostly be handled with ctrl+f in the SPIRV spec with shader_info.h open.

One small gotcha is again gl_Position handling. Since this was previously transformed in the vertex shader to go from an OpenGL [-1, 1] depth range to a Vulkan [0, 1] range, any time a GS loads gl_Position it then needs to be un-transformed, as (at the point of the GS implementation patch) zink doesn’t support shader keys, and so there will only ever be one variant of a shader, which means the vertex shader must always perform this transform. This will be resolved at a later point, but it’s worth taking note of now.

The other point to note is, as always, I know everyone’s tired of it by now, transform feedback. Whereas vertex shaders emit this info at the end of the shader, things are different in the case of geometry shaders. The point of transform feedback is to emit captured data at the time shader output occurs, so now xfb happens at the point of EmitVertex().

And that’s it.

Sometimes, but only sometimes, things in zink-land don’t devolve into insane hacks.

August 28, 2020

I recently posted about a feature I developed for VKMS to consider the alpha channel in the composition of the cursor plane with the primary plane. This process took a little longer than expected, and now I can identify a few reasons:

  • Beginner: I had little knowledge of computer graphics and its operations
  • Maybe cliché: I did not consider that in the subsystem itself, there was already material to aid coding.
  • The unexpected: the test cases that checked the composition of a cursor considering the alpha channel were successful, even with a defect in the implementation.

IGT GPU Tools has two test cases in kms_cursor_crc to check the cursor blend in the primary plane: the cursor-alpha-opaque and the cursor-alpha-transparent. These two cases are structured in the same way, changing only the value of the alpha channel - totally opaque 0xFF or totally transparent 0x00.

In a brief description, the test checks the composition as follows:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Creates a ARGB framebuffer for cursor plane with white color and a given alpha value (0xFF or 0x00)
  3. Enables the cursor on hardware and captures the plane’s CRC after composition. (hardware)
  4. Disables the cursor on hardware, draws a cursor directly on the primary plane and capture the CRC (software)
  5. Compares the two CRCs captured: they must have equal values.

After implementing alpha blending using the straight alpha formula in VKMS, both tests were successful. However, the equation was not correct.

To paint, IGT uses the library Cairo. For primary plane, the test uses CAIRO_FORMAT_RGB24 and CAIRO_FORMAT_ARGB32 for the cursor. According to documentation, the format ARGB32 stores the pixel color in the pre-multiplied alpha representation:


each pixel is a 32-bit quantity, with alpha in the upper 8 bits, then red, then
green, then blue. The 32-bit quantities are stored native-endian.
Pre-multiplied alpha is used. (That is, 50% transparent red is 0x80800000, not
0x80ff0000.) (Since 1.0)

In a brief dialogue with Pekka about endianness on DRM, he showed me information from the DRM documentation that I didn’t know about. According to it, DRM converges with the representation used by Cairo:

Current DRM assumption is that alpha is premultiplied, and old userspace can
break if the property defaults to anything else.

It is also possible to find information about DRM’s Plane Composition Properties, such as the existence of pixel blend mode to add a blend mode for alpha blending equation selection, describing how the pixels from the current plane are composited with the background.

Finally, you can find there the pre-multiplied alpha blending equation:

out.rgb = plane_alpha * fg.rgb + (1 - (plane_alpha * fg.alpha)) * bg.rgb

From this information, we see that both IGT and DRM use the same representation, but the current test cases of kms_cursor_crc do not show the defect in using the straight-alpha formula.

With this in mind, I think of refactoring the test cases so that they could validate translucent cursors and “remove some zeros” from the equation. After shared thoughts with my mentor, Siqueira, I decided to combine the two testcases (cursor-alpha-opaque and cursor-alpha-transparent) into one and refactor them so that the testcase verifies not only extreme alpha values, but also translucent values.

Therefore, the submitted test case proposal follows this steps:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Create a ARGB framebuffer for cursor plane and enables the cursor on hardware.
  3. Paints the cursor with white color and a range of alpha value (from 0xFF to 0x00)
  4. For each alpha value, captures the plane’s CRC after composition in a array of CRC’s. (hardware)
  5. Disables the cursor on hardware.
  6. Draws cursor directly on the primary plane following the same range of alpha values (software)
  7. Captures the CRC and compares it with the CRC in the array of hardware CRC’s: they must have equal values.
  8. Clears primary plane and go to step 5.

The code sent: tests/kms_cursor_crc: refactoring cursor-alpha subtests

CI checks did not show any problems, so I am expecting some human feedback :)

A Different Friday

Taking a break from talking about all this crazy feature nonsense, let’s get back to things that actually matter. Like gains. Specifically, why have some of these piglit tests been taking so damn long to run?

A great example of this is spec@!opengl 2.0@tex3d-npot.

Mesa 20.3: MESA_LOADER_DRIVER_OVERRIDE=zink bin/tex3d-npot 24.65s user 83.38s system 73% cpu 2:27.31 total

Mesa zmike/zink-wip: MESA_LOADER_DRIVER_OVERRIDE=zink bin/tex3d-npot 6.09s user 5.07s system 48% cpu 23.122 total

Yes, currently some changes I’ve done allow this test to pass in 16% of the time that it requires in the released version of Mesa. How did this happen?

Speed Loop

The core of the problem at present is zink’s reliance on explicit fencing without enough info to know when to actually wait on a fence, as is vaguely referenced in this ticket, though the specific test case there still has yet to be addressed. In short, here’s the problem codepath that’s being hit for the above test:

static void *
zink_transfer_map(struct pipe_context *pctx,
                  struct pipe_resource *pres,
                  unsigned level,
                  unsigned usage,
                  const struct pipe_box *box,
                  struct pipe_transfer **transfer)
  if (pres->target == PIPE_BUFFER) {
      if (usage & PIPE_TRANSFER_READ) {
         /* need to wait for rendering to finish
          * TODO: optimize/fix this to be much less obtrusive
          * mesa/mesa#2966
         struct pipe_fence_handle *fence = NULL;
         pctx->flush(pctx, &fence, PIPE_FLUSH_HINT_FINISH);
         if (fence) {
            pctx->screen->fence_finish(pctx->screen, NULL, fence,
            pctx->screen->fence_reference(pctx->screen, &fence, NULL);

Here, the current command buffer is submitted and then its fence is waited upon any time a call to e.g., glReadPixels() is made.

Any time.

Regardless of whether the resource in question even has pending writes.

Or whether it’s ever had anything written to it at all.

This patch was added at my request to fix up a huge number of test cases we were seeing that failed due to cases of write -> read on without waiting for the write to complete, and it was added with the understanding that at some time in the future it would be improved to both ensure synchronization and not incur such a massive performance hit.

That time has passed, and we are now in the future.

Buffer Synchronization

At its core, this is just another synchronization issue, and so by adding more details about synchronization needs, the problem can be resolved.

I chose to do this using r/w flags for resources based on the command buffer (batch) id that the resource was used on. Presently, Mesa releases ship with this code for tracking resource usage in a command buffer:

zink_batch_reference_resoure(struct zink_batch *batch,
                             struct zink_resource *res)
   struct set_entry *entry = _mesa_set_search(batch->resources, res);
   if (!entry) {
      entry = _mesa_set_add(batch->resources, res);
      pipe_reference(NULL, &res->base.reference);

This ensures that the resource isn’t destroyed before the batch finishes, which is a key functionality that drivers generally prefer to have in order to avoid crashing.

It doesn’t, however, provide any details about how the resource is being used, such as whether it’s being read from or written to, which means there’s no way to optimize that case in zink_transfer_map().

Here’s the somewhat improved version:

zink_batch_reference_resource_rw(struct zink_batch *batch, struct zink_resource *res, bool write)

   struct set_entry *entry = _mesa_set_search(batch->resources, res);
   if (!entry) {
      entry = _mesa_set_add(batch->resources, res);
      pipe_reference(NULL, &res->base.reference);
   /* the batch_uses value for this batch is guaranteed to not be in use now because
    * reset_batch() waits on the fence and removes access before resetting
   res->batch_uses[batch->batch_id] |= mask;

For context, I’ve simultaneously added a member uint8_t batch_uses[4]; to struct zink_resource, as there are 4 batches in the batch array.

What this change does is allow callers to provide very basic info about whether the resource is being read from or written to in a given batch, stored as a bitmask to the batch-specific struct zink_resource::batch_uses member. Each batch has its own member of this array, as it needs to be able to be modified atomically in order to have its usage unset when a fence is notified of completion, and this member can have up to two bits.

Now here’s the current struct zink_context::fence_finish hook for waiting on a fence:

zink_fence_finish(struct zink_screen *screen, struct zink_fence *fence,
                  uint64_t timeout_ns)
   bool success = vkWaitForFences(screen->dev, 1, &fence->fence, VK_TRUE,
                                  timeout_ns) == VK_SUCCESS;
   if (success && fence->active_queries)
      zink_prune_queries(screen, fence);
   return success;

Not much to see here. Normal fence stuff. I’ve made some additions:

static inline void
fence_remove_resource_access(struct zink_fence *fence, struct zink_resource *res)
   p_atomic_set(&res->batch_uses[fence->batch_id], 0);

zink_fence_finish(struct zink_screen *screen, struct zink_fence *fence,
                  uint64_t timeout_ns)
   bool success = vkWaitForFences(screen->dev, 1, &fence->fence, VK_TRUE,
                                  timeout_ns) == VK_SUCCESS;
   if (success) {
      if (fence->active_queries)
         zink_prune_queries(screen, fence);

      /* unref all used resources */
      util_dynarray_foreach(&fence->resources, struct pipe_resource*, pres) {
         struct zink_resource *res = zink_resource(*pres);
         fence_remove_resource_access(fence, res);

         pipe_resource_reference(pres, NULL);
   return success;

Now as soon as a fence completes, all used resources will have the usage for the corresponding batch removed.

With this done, some changes can be made to zink_transfer_map():

static uint32_t
get_resource_usage(struct zink_resource *res)
   uint32_t batch_uses = 0;
   for (unsigned i = 0; i < 4; i++) {
      uint8_t used = p_atomic_read(&res->batch_uses[i]);
         batch_uses |= ZINK_RESOURCE_ACCESS_READ << i;
         batch_uses |= ZINK_RESOURCE_ACCESS_WRITE << i;
   return batch_uses;

static void *
zink_transfer_map(struct pipe_context *pctx,
                  struct pipe_resource *pres,
                  unsigned level,
                  unsigned usage,
                  const struct pipe_box *box,
                  struct pipe_transfer **transfer)
   struct zink_resource *res = zink_resource(pres);
   uint32_t batch_uses = get_resource_usage(res);
   if (pres->target == PIPE_BUFFER) {
      if ((usage & PIPE_TRANSFER_READ && batch_uses >= ZINK_RESOURCE_ACCESS_WRITE) ||
          (usage & PIPE_TRANSFER_WRITE && batch_uses)) {
         /* need to wait for rendering to finish
          * TODO: optimize/fix this to be much less obtrusive
          * mesa/mesa#2966

get_resource_usage() iterates over struct zink_resource::batch_uses to generate a full bitmask of the resource’s usage across all batches. Then, using the usage detail from the transfer_map, this can be applied in order to determine whether any waiting is necessary:

  • If the transfer_map is for reading, only wait for rendering if the resource has pending writes
  • If the transfer_map is for writing, wait if the resource has any pending usage

When this resource usage flagging is properly applied to all types of resources, huge performance gains abound even despite how simple it is. For the above test case, flagging the sampler view resources with read-only access continues to ensure their lifetimes while enabling them to be concurrently read from when draw commands are still pending, which yields the improvements to the tex3d-npot test case above.

Future Gains

I’m always on the lookout for some gains, so I’ve already done and flagged some future work to be done here as well:

  • texture resources have had similar mechanisms applied for improved synchronization, whereas currently they have none
  • zink-wip has facilities for waiting on specific batches to complete, so zink_fence_wait() here can instead be changed to only wait on the last batch with write access for the PIPE_TRANSFER_READ case
  • buffer ranges can be added to struct zink_resource as many other drivers use with the util_range API, enabling the fencing here to be skipped if the regions have no overlap or no valid data is extant/pending

These improvements are mostly specific to unit testing, but for me personally, those are the most important gains to be making, as faster test runs mean faster verification that no regressions have occurred, which means I can continue smashing more code into the repo.

Stay tuned for future Zink Gains updates.

As a newbie, I consider debugging as a study guided. During the process, I have a goal that leads me to browse the code, raise and down various suspicions, look at the changes history and perhaps relate changes from other parts of the kernel to the current problem. Co-debugging is even more interesting. Approaches are discussed, we open our mind to more possibilities, refine solutions and share knowledges… in addition to preventing any uncomplete solution. Debugging the operation of vkms when composing plans was a very interesting journey. All the shared ideas was so dynamic and open that, at the end, it was linked to another demand, the implementation of writeback support.

And that is how I fell into another debugging journey.

Writeback support on vkms is something Rodrigo Siqueira has been working on for some time. However, with so many comings and goings of versions, other DRM changes were introducing new issues. With each new version, it was necessary to reassess the current state of the DRM structures in addition to incorporating revisions from the previous version.

When Leandro Ribeiro reported that calling the drm_crtc_vblank_put frees the writeback work, this indicate that there was also a issue with the simulated vblank execution there. I asked Siqueira a little more about the writeback, to try to collaborate with his work. He explained to me what writeback is, how it is implemented in his patches and the IGT test that is used for validation.

In that state, writeback was crashing on the last subtest, writeback-check-output. Again, due to a timeout waiting for resource access. We discussed the possible causes and concluded that the solution would be in defining: when the composer work needed to be enabled and how to ensure that vblank is happening to allow also the writeback work. When Daniel suggested to create a small function to set output->composer_enabled, he was already thinking on writeback.

However, something that appeared to be straight was somewhat entangled by our initial assumption that the problem was concentrated in the writeback-check-output. We later adjusted our investigation to also look at the writeback-fb-id subtest. We began this part of the debugging process using ftrace to check the execution steps of this two test cases. Looking at different execution cases and more parts of the code, we examined the writeback job to find the symmetries, the actions of each element, the beginning and the end of each cycle. We also includes some prints on VKMS and the test code, and check dmesg.

Siqueira concluded that VKMS need to guarantee the composer is enabled when starting the writeback job and release it (as well as vblank) in the job’s cleanup (which is currently called when a job was queued or when the state is destroyed). Another concern was to avoid dispersed modifications and keep the changes well encapsulated with writeback functions.

With that, Siqueira wrapped up the v5, and sent. Just fly, little bird:

As soon as it was sent, I checked if the patchset passed not only the kms_writeback subtests, but also if the other IGT tests I use have remained stable: kms_flip, kms_cursor_crc and kms_pipe_crc_basic. With that, I realized that a refactoring in the composer was out of date and reported the problem for correction (since it was a fix that I recently made). I also indicated that, apart from this small problem (not directly related to writeback), writeback support was working well in the tests performed. However, the series has not yet received any other feedback.

The end? Hmm… never ends.

I recently posted about a feature I developed for VKMS to consider the alpha channel in the composition of the cursor plane with the primary plane. This process took a little longer than expected, and now I can identify a few reasons:

  • Beginner: I had little knowledge of computer graphics and its operations
  • Maybe cliché: I did not consider that in the subsystem itself, there was already material to aid coding.
  • The unexpected: the test cases that checked the composition of a cursor considering the alpha channel were successful, even with a defect in the implementation.

IGT GPU Tools has two test cases in kms_cursor_crc to check the cursor blend in the primary plane: the cursor-alpha-opaque and the cursor-alpha-transparent. These two cases are structured in the same way, changing only the value of the alpha channel - totally opaque 0xFF or totally transparent 0x00.

In a brief description, the test checks the composition as follows:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Creates a ARGB framebuffer for cursor plane with white color and a given alpha value (0xFF or 0x00)
  3. Enables the cursor on hardware and captures the plane’s CRC after composition. (hardware)
  4. Disables the cursor on hardware, draws a cursor directly on the primary plane and capture the CRC (software)
  5. Compares the two CRCs captured: they must have equal values.

After implementing alpha blending using the straight alpha formula in VKMS, both tests were successful. However, the equation was not correct.

To paint, IGT uses the library Cairo. For primary plane, the test uses CAIRO_FORMAT_RGB24 and CAIRO_FORMAT_ARGB32 for the cursor. According to documentation, the format ARGB32 stores the pixel color in the pre-multiplied alpha representation:


each pixel is a 32-bit quantity, with alpha in the upper 8 bits, then red, then
green, then blue. The 32-bit quantities are stored native-endian.
Pre-multiplied alpha is used. (That is, 50% transparent red is 0x80800000, not
0x80ff0000.) (Since 1.0)

In a brief dialogue with Pekka about endianness on DRM, he showed me information from the DRM documentation that I didn’t know about. According to it, DRM converges with the representation used by Cairo:

Current DRM assumption is that alpha is premultiplied, and old userspace can
break if the property defaults to anything else.

It is also possible to find information about DRM’s Plane Composition Properties, such as the existence of pixel blend mode to add a blend mode for alpha blending equation selection, describing how the pixels from the current plane are composited with the background.

Finally, you can find there the pre-multiplied alpha blending equation:

out.rgb = plane_alpha * fg.rgb + (1 - (plane_alpha * fg.alpha)) * bg.rgb

From this information, we see that both IGT and DRM use the same representation, but the current test cases of kms_cursor_crc do not show the defect in using the straight-alpha formula.

With this in mind, I think of refactoring the test cases so that they could validate translucent cursors and “remove some zeros” from the equation. After shared thoughts with my mentor, Siqueira, I decided to combine the two testcases (cursor-alpha-opaque and cursor-alpha-transparent) into one and refactor them so that the testcase verifies not only extreme alpha values, but also translucent values.

Therefore, the submitted test case proposal follows this steps:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Create a ARGB framebuffer for cursor plane and enables the cursor on hardware.
  3. Paints the cursor with white color and a range of alpha value (from 0xFF to 0x00)
  4. For each alpha value, captures the plane’s CRC after composition in a array of CRC’s. (hardware)
  5. Disables the cursor on hardware.
  6. Draws cursor directly on the primary plane following the same range of alpha values (software)
  7. Captures the CRC and compares it with the CRC in the array of hardware CRC’s: they must have equal values.
  8. Clears primary plane and go to step 5.

The code sent: tests/kms_cursor_crc: refactoring cursor-alpha subtests

CI checks did not show any problems, so I am expecting some human feedback :)

In the past few weeks, I was examining two issues on vkms: development of writeback support and alpha blending. I try to keep activities in parallel so that one can recover me from any tiredness from the other :P

Alpha blending is a TODO[1] of the VKMS that possibly could solve the warning[2] triggered in the cursor-alpha-transparent testcase. And, you know, if you still have a warning, you still have work.

  1. Blend - blend value at vaddr_src with value at vaddr_dst
     * TODO: Use the alpha value to blend vaddr_src with vaddr_dst
     *	 instead of overwriting it.
  2. WARNING: Suspicious CRC: All values are 0

To develop this feature, I needed to understand and “practice” some abstractions concepts: alpha composition, bitwise operation, and endianness. The beginning was a little confusing, as I was reading many definitions and rare seeing practical examples, which was terrible for dealing with abstractions. I searched for information on Google and little by little, the code was taking shape in my head.

The code, the problem solved and a new problem found

I combined what I understood from these links above with what was already on VKMS. An important note is that I consider that when we combine an alpha layer in the background, this final composition is solid (that is, an opaque alpha), that is, the result on the screen is not a transparent plate - it is usually black background. That said, the resulting alpha channel is 100% opaque, so we set it to 0xFF.

When executing my code, the two test cases related to the alpha channel (opaque-transparent) pass clean. But for me, it still wasn’t enough. I need to see the colors read and the result of the composition (putting some pr_info in the code). Besides, only testing extreme values (background solid black, cursor completely white with totally opaque or transparent alpha) did not convince me. See that I’m a little pessimist and a little wary.

So I decided to play with the cursor-alpha testcase… and I scraped myself.


I changed the transparent test case from 0.0 to 0.5, and things started to fail. After a few rounds checking the returned values, what I always saw was a cursor color (ARGB) returned as follows:

  • A: the alpha determined (ok)
  • RGB: the color resulted by a white cursor with the set transparency already applied to a standard solid black background (not ok).

For example, I expected that a white cursor with transparency 0.5 returned the following ARGB = 80FFFFFF, but the VKMS was reading 80808080 (???)

What is that?

I did more experiments to understand what was going on. I checked if the 0.5-transparency was working on i915, and yes. I also changed the test RGB color of the cursor and the level of transparency. The cursor color read was always a combination of a given A + RGB blended with black color. I could only realize that the mismatch happens because, in the hardware test, when composing a cursor color 80808080, again on a background FF000000 (primary), I obtained as the final result FF404040. However, in the software test step, what was drawn as a final color FF808080.

The developed code would also do that if it was getting the right color… how to deal with it and why it is not a problem on i915 drive?

Reading the documentation

I was reading the cairo manual in addition to perform different color combinations tests. I was suspicious that it could be a format problem, but even if it was that, I had no idea what was “wrong”. After some days, I found the reason:


each pixel is a 32-bit quantity, with alpha in the upper 8 bits, then red, then
green, then blue. The 32-bit quantities are stored native-endian.
Pre-multiplied alpha is used. (That is, 50% transparent red is 0x80800000, not

Pre-multiplied alpha is used. Thanks, cairo… or not!

But what is pre-multiplied alpha? I didn’t even know it existed.

So, I needed to read about the difference between straight and premultiplied alpha. Moreover, I needed to figure out how to do alpha blending using this format.

After some searching, I found a post on Nvidia - Gameworks Blog that helped me a lot to understand this approach. Besides, the good old StackOverflow clarified what I needed to do.

So I adapted the code I had previously prepared considering a straight alpha blending to work with alpha-premultiplied colors. Of course, I put pr_info to see the result. I also played a little more with the cursor and background colors on kms_cursor_crc test for checking. For example, I changed the background color to a solid red and/or the cursor color and/or the cursor alpha value. In all this situation, the patch works fine.

So, I just sent:

A new interpretation for the warning

Examining this behavior, I discovered a more precise cause of the warning. As VKMS currently only overwrites the cursor over the background, a fully transparent black cursor has a color of 0 and results in a transparent primary layer instead of solid black. When doing XRGB, VKMS zeroes the alpha instead of making it opaque, which seems incorrect and triggers the warning.

I also think that this shows the weakness of the full opaque/transparent tests since they were not able to expose the wrong composition of colors. So I will prepare a patch for IGT to expand the coverage of this type of problem.

Well, doing the alpha blending, the warning will never bothered me anyway.

Let it go!

After a long debugging process, we finally reached an acceptable solution to the problem of running sequential subtests of IGT tests involving CRC capture using the VKMS.

Despite all the time, uncertainty, and failures, this was, by far, one of the greatest lessons learned in my journey of FLOSS developer. With this experience, I learned a lot about how VKMS works, evolved technically, experienced different feelings, and interacted with people I may never meet in person. These people were sharing some of their knowledge with me in open and accessible communication. The interactions allowed us to build together a more suitable and effective solution, and at that moment, everything I thought was important in the development of FLOSS came to fruition.

I know it sounds very romantic, but after more than a month of it, breathing is not a simple and involuntary movement.

With this patch, I was able to update my progress table on GSoC:

Result/Status GSoC start After accepted patches
pass 7 24 (+17)
warn 1 1
fail 2 0 (-2)
skip 236 221 (-15)
crash 0 0
timeout 0 0
incomplete 0 0
dmesg-warn 0 0
dmesg-fail 0 0
changes 0 0
fixes 0 0
regressions 0 0
total 246 246

The endless wait

I believe I have already talked about the problem in some other posts as it has been followed me since the elaboration of my GSoC project.

I was trying to solve the failure of one of the IGT kms_cursor_crc subtests, the cursor-alpha-transparent. As this subtest has an implementation similar to the cursor-alpha-opaque (that was passing), I used both to observe the manipulation of ARGB -> XRGB happening in CRC capture. However, strangely, the cursor-alpha-opaque kept alternating between success and failure using the same kernel. Gradually I noticed that this was a widespread problem with the sequential execution of subtests in VKMS.

In the beginning, I only noticed that the test crashed while waiting for a vblank event that was never going to happen. So I looked for solutions that enabled vblanks in time to prevent the subtest get stuck. However, as Daniel mentioned, these solutions just looked like duct-tape.

The endless debugging

Initially, I found two problems and thought that both were related to the test implementation. So I sent patches to IGT to correct the cleanup process and to force a vblank. In waiting for feedback, I realized that a previously successful subtest was currently failing, even with the solution I intended — bad smell. I searched the patch that modified this behavior in the history and, in fact, reversing the change let the subtests flow.

It was then that I saw a patch sent to the DRM mailing-list in which someone faced a similar problem and decided to participate in the discussion. From this point, doubts and suspicions increased even more. However, on the bright side, I wasn’t the only person looking for it. With our time zone differences, we were three sharing the findings for someone else’s next shift.

I sent a first patch about this issue on June 25th (in the wrong direction). After so much feedback, adjustments, and some other wrong attempts, the solution was finally accepted on August 8th.

The whole story is documented in the following discussions (patchwork links):

  1. Aiming in the wrong direction
  2. Joining the discussion of people that observed a similar problem
  3. Trying to hit the target … and failing
  4. Hey… I had this idea! But again, it was not ideal
  5. Finally, we come to a solution

And when it all seemed to come to an end, a fourth person appeared, raising doubts about other VKMS parts and that, in a way, was related to the problem at hand.

And that was amazing! There’s still a lot of work out there!

Now I’m still a little tired of it, but I’ll try to make a more technical post about the solution. I changed the context, going back to that will give me a little mental work.

For the next developments

  • Don’t keep the problem to yourself; share your doubts.
  • When making a patch, don’t forget the credits! It doesn’t make much difference for you, but it can do for those who were with you on that journey. See more
  • Each thread/patch is full of information that can help you with other problems.
  • If you can’t dive into the code, browse through it.
  • ftrace is also a good friend
  • The mental effort is less if you write a post before changing the context.

P.S.: I still need to find a way to feel comfortable speaking on IRC.

The first round just passed so fast, and what I did?

The case: the IGT test kms_cursor_crc

Being acquainted with a IGT test

  1. Anatomy of the test
    What: study, read the code, dive into each function, and describe its structure, steps, and functionality.
    To: be well aware of the code construction and, consequently, able to figure out and deal with problems.
  2. Patches to improve test documentation
    What: During the study of the kms_cursor_crc test, I realized that the subtests had no description that would help a newcomer to perceive the purpose of that subtest. To improve this documentation and some code comments, I sent some patches to IGT with what I was able to understand from my inspection.
  3. Refactoring function with parameters never used.
    Outside the context of exploring and becoming acquainted with the case, examining the anatomy of the kms_cursor_crc I caught useless parameters in a general IGT function, i.e., it requires two parameters, but never uses them within its code. I checked the author (git blame) and asked him on IRC about the need for these additional parameters, but I didn’t get a response (or maybe I missed the reply due to disconnection). Then, I sent an RFC patch to the mailing list and also nothing. Finally, my mentor took a look, and he agreed that the parameters seem useless and can be removed. He asked me to resend as a normal patch.

Maybe you can also take a look: lib/igt_fb: remove extra parameters from igt_put_caito_ctx

Solving some problems

I have also sent a patchset to treat to reported problems in kms_cursor_crc:

  1. Access to debugfs data file is blocked
  2. Unstable behaviour in sequential subtests

You can check out more about that in my previous posts. Reviewing the logs, I saw that these problems in prepare_crtc were introduced more recently since some Haneen commits report that subtests were working before. I noticed that waiting for the blank is a solution used before and removed by another commit that not seems to treat this issue.

Sadly, my patches also received no feedback at all. So, if you are interested in them or can make an interesting comment, go ahead! :)

Preparing for the second round

In the last few days, I surveyed the materials and tasks needed to execute the second round of my project. This is a kind of roadmap with useful links.

As far as I know, I have to deal with three issues that seem currently not supported by VKMS:

  1. Enhance the blend function to provide alpha composing
    To solve warning message in pipe-%s-cursor-alpha-transparent
  2. DPMS (?)
    To solve failure in pipe-%s-cursor-dpms and pipe-%s-cursor-suspend
  3. Cursor size other than 64x64
    • See drivers/gpu/drm/drm_ioctl.c
    • See drivers/gpu/drm/vkms/vkms_drv.c (drm_getCap)
      drivers/gpu/drm/vkms/vkms_drv.c:131:    dev->mode_config.funcs = &vkms_mode_funcs;
      drivers/gpu/drm/vkms/vkms_drv.c:132:    dev->mode_config.min_width = XRES_MIN;
      drivers/gpu/drm/vkms/vkms_drv.c:133:    dev->mode_config.min_height = YRES_MIN;
      drivers/gpu/drm/vkms/vkms_drv.c:134:    dev->mode_config.max_width = XRES_MAX;
      drivers/gpu/drm/vkms/vkms_drv.c:135:    dev->mode_config.max_height = YRES_MAX;
      drivers/gpu/drm/vkms/vkms_drv.c:136:    dev->mode_config.preferred_depth = 24;
      drivers/gpu/drm/vkms/vkms_drv.c:137:    dev->mode_config.helper_private = &vkms_mode_config_helpers;

I also need to check the reason for some subtests like dpms/suspend are failing now, but they were working in the past, accordingly to Hannen commit message:

commit db7f419c06d7cce892384df464d4b609a3ea70af
Author: Haneen Mohammed <>
Date:   Thu Sep 6 08:18:26 2018 +0300

    drm/vkms: Compute CRC with Cursor Plane
    This patch compute CRC for output frame with cursor and primary plane.
    Blend cursor with primary plane and compute CRC on the resulted frame.
    This currently passes cursor-size-change, and cursor-64x64-[onscreen,
    offscreen, sliding, random, dpms, rapid-movement] from igt
    kms_cursor_crc tests.
    Signed-off-by: Haneen Mohammed <>
    Signed-off-by: Daniel Vetter <>

In March, I inspected the coverage of kms_cursor_crc on VKMS to develop my GSoC project proposal. Using piglit, I present the evolution of this coverage so far:

Result GSoC start Only accepted patches Fixes under development
pass 7 22 24
warn 1 0 1
fail 2 3 0
skip 236 221 221
crash 0 0 0
timeout 0 0 0
incomplete 0 0 0
dmesg-warn 0 0 0
dmesg-fail 0 0 0
changes 0 0 0
fixes 0 0 0
regressions 0 0 0
total 246 246 246

+ Instability in the sequential run of subtests; ie, although the statistic showing that 7 tests are passing in the beggining, this result was only clear after a double-check because when running all subtests sequentially, 8 tests fails and only 1 succeed;

+ Warning or Failure ? My fix proposal for the pipe-A-cursor-alpha-transparent test was passing with a warning (CRC all zero)

As a first step, I decided to examine and solve issues that affected the test results in a general way. the instability. Solving the instability first (or at least identify what was going on) would make the work more consistent and fluid, since I would no longer need to double-check each subtest result and, in one running of the entire kms_cursor_crc test I could check the absent features or errors. However, in this investigation, some problems were more linked to IGT and others to VKMS. Identifying who is the “guilty” was not simple, and some false charges happened.

This is a little long story and deserves another post focused on the issue. The effective solution has not yet been found, but it has already been realized that the proper bug-fix achieves:

  • The stability of the sequential execution of subtests;
  • The success of the following subtests: pipe-A-cursor-dpms pipe-A-cursor-suspend

As we don’t have that yet, well, I’ll focus on this post in describing the way to allow testing of different cursor sizes in vkms.

The maximum cursor size

In 2018, one of the Haneen contributions to vkms was adding support for the cursor plane. In this implementation, the cursor has a standard maximum supported size 64x64, which limited the coverage of the kms_cursor_crc to only cursor subtests with this size.

The IGT tests cursor sizes from 64 to 512 (powers of 2), but how to enable cursor larger than the standard?

Initially, I thought I needed to develop some functionality from scratch, maybe do this by drawing in larger sizes… as a good beginner :) So I started to read some materials and check the codes of other drivers to find out how to do this.

During this stage, for some kind of universe coincidence (mystic), I came across a conversation on the IRC on the topic “cursor sizes.” This conversation gave me some references that led me to the right work path.

Ctags-driven investigation

I tend to investigate Linux trough keywords (like a ctags on the head?). This approach helps me to narrow the scope of reading and attempts. The Linux kernel is frighteningly large, and I know that I still need many years to understand things as a whole (if possible).

Therefore, these are the references that helped me to find the keywords and information related to my problem “the cursor size”:

  1. Comparing an AMD device with VKMS device: (keyword 1: DRM_CAP_CURSOR_HEIGHT)
  2. Checking max cursor sizes per driver: (keyword 2: capabilities)
  3. Finding where is the DRM_CAP_CURSOR_HEIGHT value: drivers/gpu/drm/drm_ioctl.c
    (How to define ` dev->mode_config.cursor_height ` in vkms?) ``` /*
    • Get device/driver capabilities */ static int drm_getcap(struct drm_device *dev, void *data, struct drm_file *file_priv) { struct drm_get_cap *req = data; struct drm_crtc *crtc;

    req->value = 0; [..] case DRM_CAP_CURSOR_WIDTH: if (dev->mode_config.cursor_width) req->value = dev->mode_config.cursor_width; else req->value = 64; break; case DRM_CAP_CURSOR_HEIGHT: if (dev->mode_config.cursor_height) req->value = dev->mode_config.cursor_height; else req->value = 64; break; [..] }

    ``` \

  4. More information: include/uapi/drm/drm.h

    / *
     * The CURSOR_WIDTH and CURSOR_HEIGHT capabilities return a valid widthxheight
     * combination for the hardware cursor. The intention is that a hardware
     * agnostic userspace can query a cursor plane size to use.
     * Note that the cross-driver contract is to merely return a valid size;
     * drivers are free to attach another meaning on top, eg. i915 returns the
     * maximum plane size.
     * /
    #define DRM_CAP_CURSOR_WIDTH 0x8
    #define DRM_CAP_CURSOR_HEIGHT 0x9
  5. Check the documentation for cursor_width:
    **struct drm_mode_config**
    Mode configuration control structure
    *cursor_width*: hint to userspace for max cursor width
    *cursor_height*: hint to userspace for max cursor height
  6. So, where is this mode_config defined in vkms?
    Here: drivers/gpu/drm/vkms/vkms_drv.c
    static int vkms_modeset_init(struct vkms_device *vkmsdev)
     struct drm_device *dev = &vkmsdev->drm;
     dev->mode_config.funcs = &vkms_mode_funcs;
     dev->mode_config.min_width = XRES_MIN;
     dev->mode_config.min_height = YRES_MIN;
     dev->mode_config.max_width = XRES_MAX;
     dev->mode_config.max_height = YRES_MAX;
     dev->mode_config.preferred_depth = 24;
     dev->mode_config.helper_private = &vkms_mode_config_helpers;
     return vkms_output_init(vkmsdev, 0);

    There is nothing about cursor here, so we need to assign maximum values to not take the default.

  7. I also found that, on my intel computer, the maximum cursor is 256. Why do tests include 512?
  8. Also, there are subtests in kms_cursor_crc for non-square cursors, but these tests are restricted to i915 devices. Why are they here?
  9. Finally, I develop a simple patch that increases the coverage rate by 15 subtests. Considering the current drm-misc-next, my project state is:
Name Results
pass 22
fail 3
skip 221
crash 0
timeout 0
warn 0
incomplete 0
dmesg-warn 0
dmesg-fail 0
changes 0
fixes 0
regressions 0
total 246

Cheer up!

It is also possible to consider those in which we have some idea of the problem and a provisory solution exists (I mean, an optimistic view):

Name Results
pass 24
warn 1
fail 0
skip 221
crash 0
timeout 0
incomplete 0
dmesg-warn 0
dmesg-fail 0
changes 0
fixes 0
regressions 0
total 246

There is still the non-square cursors issue, which I’m not sure we should handle. Subtests with non-square cursors mean 16 skips.

Lessons learned

  1. For me, IRC conversations are inspiring and also show community pulsing. Part of what I question in my master’s research is the lack of interest in academic studies in using IRC as a means of understanding the community under investigation.
    One of the things I like about IRC is that, unlike other instant messengers today, when we are there, we are usually sitting in front of the computer (our work tool). I mean, we are not (I think) lying in a hammock on the beach, for example. Ok, there could be exceptions. :)
    But to be honest, I rarely talk on a channel; I am usually just watching.
  2. There are cases where the complexity lies in understanding what already exists instead of in developing new features. I still don’t know if it’s frustrating or satisfying.
  3. I don’t know if I could understand things without using ctags. The ctags was a great tip from Siqueira even in FLUSP times.

As the GSoC coding time officially started this week (01/06/2020), this post is a line between my activities in the period of community bonding and the official development of my project. I used the last month to solve issues in my development environment, improve the tool that supports my development activity (kworflow), and study concepts and implementations related to my GSoC project.


Besides e-mail, IRC chat, and Telegram, my mentor (Siqueira) and I are meeting every Wednesday on Jitsi, where we also use tmate for terminal sharing. We also use, together with Trevor, a spreadsheet to schedule tasks, report my daily activity, and write any suggestions.

Issues in my development environment

Probably you have already had this kind of pending task: a problem that trouble your work but you can go ahead with it, therefore you postpone until the total tiredness. So, I have some small issues on my vm: from auth to update kernel process that I was postpone forever. The first I solved with some direct setup and the last led me for the second task: improve kworkflow.

Improvements for a set of scripts that facilitate my development day-to-day

The lack of support on kworkflow for deployment on vm drove me to start hacking the code. As updating vm needs understanding things more complex, I started developing soft changes (or less integrated with the scripts structure). This work is still in progress, and after discuss with others kworkflow developers (on the issue, on IRC and on voice meetings), the proposals of changes were refined.

The first phase of my project proposal orbits the IGT test kms_cursor_crc. I have already a preliminar experience with the test code; however, I lack a lot of knowledge of each steps and concepts behing the implementation. Bearing this in mind, I used part of the community bonding time to immerse myself in this code.

Checking issues in a patchset that was reported by IGT CI

My first project task is to find out why it is not possible to access debugfs files when running kms_cursor_crc (and fix it). Two things could help me solve it: learning about debugfs and dissecting kms_cursor_crc. To guide my studies, my mentor suggested taking a look at a patchset for the IGT write-back test implementation that CI reported a crash on debugfs_test for i915. For this investigation, I installed on another machine (an old netbook) a Debian without a graphical environment, and, accessing via ssh, I applied the patches and ran the test. Well, everything seemed to work (and the subtests passed). Perhaps something has been fixed or changed in IGT since the patchset was sent. Nothing more to do here.

IGT_FORCE_DRIVER=i915 build/tests/debugfs_test 
IGT-Version: 1.25-gf1884574 (x86_64) (Linux: 4.19.0-9-amd64 x86_64)
Force option used: Using driver i915
Force option used: Using driver i915
Starting subtest: sysfs
Subtest sysfs: SUCCESS (0,009s)
Starting subtest: read_all_entries
Subtest read_all_entries: SUCCESS (0,094s)
Starting subtest: read_all_entries_display_on
Subtest read_all_entries_display_on: SUCCESS (0,144s)
Starting subtest: read_all_entries_display_off
Subtest read_all_entries_display_off: SUCCESS (0,316s)

Diving into the kms_cursor_crc test

I’m writing a kind of anatomy from the kms_cursor_crc test. I chose the alpha-transparent subtest as a target and then followed each step necessary to achieve it, understanding each function called, parameters, and also abstractions. I am often confused by something that once seemed clear. Well, it comes to graphic interface stuff and is acceptable that theses abstraction will disorientate me LOL I guess… The result of this work will be my next post. In the meantime, here are links that helped me on this journey

In this post, I describe the steps involved in the execution of a kms_cursor_crc subtest. In my approach, I chose a subtest (pipe-A-cursor-alpha-transparent) as a target and examined the code from the beginning of the test (igt main) until reaching the target subtest and executing it.

This is my zero version. I plan to incrementally expand this document with evaluation/description of the other subtests. I will probably also need to fix some misunderstandings.

As described by IGT, kms_cursor_crc

Uses the display CRC support to validate cursor plane functionality. The test will position the cursor plane either fully onscreen, partially onscreen, or fully offscreen, using either a fully opaque or fully transparent surface. In each case it then reads the PF CRC and compares it with the CRC value obtained when the cursor plane was disabled.

In the past, Haneen have shown something about the test in a blog post. Fixing any issue in VKMS to make it passes all kms_cursor_crc subtest is also a case in my GSoC project proposal.

The struct data_t

This struct is used in all subtest stores many elements such as DRM file descriptor; framebuffers info; cursor size info; etc. Also, before the main of the test, a static data_t data is declared global.

The beginning - igt_main

We can divide the main function into two parts: setup DRM stuff (igt_fixture) and subtest execution.

Initial test setup

igt_fixture {
	data.drm_fd = drm_open_driver_master(DRIVER_ANY);
	ret = drmGetCap(data.drm_fd, DRM_CAP_CURSOR_WIDTH, &cursor_width);
	igt_assert(ret == 0 || errno == EINVAL);
	/* Not making use of cursor_height since it is same as width, still reading */
	ret = drmGetCap(data.drm_fd, DRM_CAP_CURSOR_HEIGHT, &cursor_height);
	igt_assert(ret == 0 || errno == EINVAL);

drmGetCap(int fd, uint64_t capability, uint64_t * value) queries capability of the DRM driver, return 0 if the capability is supported. DRM_CAP_CURSOR_WIDTH and DRM_CAP_CURSOR_HEIGHT store a valid width/height for the hardware cursor.

/* We assume width and height are same so max is assigned width */
igt_assert_eq(cursor_width, cursor_height);


void kmstest_set_vt_graphics_mode(void)

From lib/igt_kms.c: This function sets the controlling virtual terminal (VT) into graphics/raw mode and installs an igt exit handler to set the VT back to text mode on exit. All kms tests must call this function to make sure that the framebuffer console doesn’t interfere by e.g. blanking the screen.


void igt_require_pipe_crc(int fd)

From lib/igt_debugfs: checks whether pipe CRC capturing is supported by the kernel. Uses igt_skip to automatically skip the test/subtest if this isn’t the case.

    igt_display_require(&data.display, data.drm_fd);

void igt_display_require(igt_display_t *display, int drm_fd)

From lib/igt_kms.c: Initializes \@display (a pointer to an #igt_display_t structure) and allocates the various resources required. This function automatically skips if the kernel driver doesn’t support any CRTC or outputs.

Run test on pipe

		run_tests_on_pipe(&data, pipe);

At this part, each subtest of kms_cursor_crc is runned, where the pointer for the already setup struct data_t are passed.

static void run_tests_on_pipe(data_t *data, enum pipe pipe)

This function runs each subtest grouped by pipe. In the setup, it increments the passed data_t struct, and then starts to call each subtest. In this document version, I only focused on the subtest test_cursor_transparent .

igt_subtest_f("pipe-%s-cursor-alpha-transparent", kmstest_pipe_name(pipe))
	run_test(data, test_cursor_transparent, data->cursor_max_w, data->cursor_max_h);

The execution of test_cursor_transparent

static void run_test(data_t *data, void (*testfunc)(data_t *), int cursor_w, int cursor_h)

The function run_test wraps the habitual preparation for running a subtest and also, after then, a cleanup. Therefore, it basically has three steps:

  1. Prepare CRTC: static void prepare_crtc(data_t data, igt_output_toutput, int cursor_w, int cursor_h) This function is responsible for:
    • Select the pipe to be used
    • Create Front and Restore framebuffer of primary plane
    • Find a valid plane type for primary plane and cursor plane
    • Pairs primary framebuffer to its plane and sets a default size
    • Create a new pipe CRC
    • Position the cursor fully visible
    • Store test image as cairo surface
    • Start CRC capture process
  2. Run subtest: testfunc(data) » static void test_cursor_transparent(data_t *data) » test_cursor_alpha(data, 0.0)

    The subtest_cursor_transparent is a variation of test_cursor_alpha where the alpha channel is set zero (or transparent). So, let’s take a look at test_cursor_alpha execution:

static void test_cursor_alpha(data_t *data, double a)
	igt_display_t *display = &data->display;
	igt_pipe_crc_t *pipe_crc = data->pipe_crc;
	igt_crc_t crc, ref_crc;
	cairo_t *cr;
	uint32_t fb_id;
	int curw = data->curw;
	int curh = data->curh;
	/*alpha cursor fb*/
	fb_id = igt_create_fb(data->drm_fd, curw, curh,

When this subtest starts, it creates the cursor’s framebuffer with the format ARGB8888 , i.e., a framebuffer with RGB plus Alpha channel (pay attention to endianness)

	cr = igt_get_cairo_ctx(data->drm_fd, &data->fb);
	igt_paint_color_alpha(cr, 0, 0, curw, curh, 1.0, 1.0, 1.0, a);
	igt_put_cairo_ctx(data->drm_fd, &data->fb, cr);

Then, the test uses some Cairo resources to create a cairo surface for the cursor’s framebuffer and allocate a drawing context for it, draw a rectangle with RGB white and the given opacity (alpha channel) and, finally, release the cairo surface and write the changes out to the framebuffer (Disclaimed: looking inside the function igt_put_cairo_ctx, I am not sure if it is doing what it is saying on comments, and also not sure if all the parameters are necessary)

The test is divided into two parts: Hardware test and Software test.

	/*Hardware Test*/
	igt_wait_for_vblank(data->drm_fd, data->pipe);
	igt_pipe_crc_get_current(data->drm_fd, pipe_crc, &crc);
	igt_remove_fb(data->drm_fd, &data->fb);

The hardware test consists in:

  • Enable cursor: Pair the cursor plane and a framebuffer, set the cursor size for its plane, set the cursor plane size for framebuffer. Commit framebuffer changes to all changes of each display pipe
  • Wait for vblank. Vblank is a couple of extra scanlines region which aren’t actually displayed on the screen designed to give the electron gun (on CRTs) enough time to move back to the top of the screen to start scanning out the next frame. A vblank interrupt is used to notify the driver when it can start the updating of registers. To achieve tear-free display, users must synchronize page flips and/or rendering to vertical blanking []
  • Calculate and check the current CRC
	cr = igt_get_cairo_ctx(data->drm_fd, &data->primary_fb[FRONTBUFFER]);
	igt_paint_color_alpha(cr, 0, 0, curw, curh, 1.0, 1.0, 1.0, a);
	igt_put_cairo_ctx(data->drm_fd, &data->primary_fb[FRONTBUFFER], cr);
	igt_wait_for_vblank(data->drm_fd, data->pipe);
	igt_pipe_crc_get_current(data->drm_fd, pipe_crc, &ref_crc);
	igt_assert_crc_equal(&crc, &ref_crc);

The software test consists in:

  • Create a cairo surface and drawing context, draw a rectangle on top left side with RGB white and the given opacity (alpha channel) and then release de cairo context, applying changes to framebuffer. Then, commit framebuffer changes to all changes of each display pipe.
  • Wait for vblank
  • Calculate and check the current CRC

Finally, the screen is clean.

And in the level of run_test, crtc(data) is clean up;

Meanwhile, in GSoC:

I took the second week of Community Bonding to make some improvements in my development environment. As I have reported before, I use a QEMU VM to develop kernel contributions. I was initially using an Arch VM for development; however, at the beginning of the year, I reconfigured it to use a Debian VM, since my host is a Debian installation - fewer context changes. In this movement, some ends were loose, and I did some workarounds, well… better round it off.

I also use kworkflow (KW) to ease most of the no-coding tasks included in the day-to-day coding for Linux kernel. The KW automates repetitive steps of a developer’s life, such as compiling and installing my kernel modifications; finding information to format and send patches correctly; mounting or remotely accessing a VM, etc. During the time that preceded the GSoC project submission, I noticed that the feature of installing a kernel inside the VM was incompleted. At that time, I started to use the “remote” option as palliative. Therefore, I spent the last days learning more features and how to hack the kworkflow to improve my development environment (and send it back to the kw project).

I have started by sending a minor fix on alert message:

kw: small issue on u/mount alert message

Then I expanded the feature “explore” - looking for a string in directory contents - by adding GNU grep utility in addition to the already used git grep. I gathered many helpful suggestions for this patch, and I applied them together with the reviews received in a new version:

src: add grep utility to explore feature

Finally, after many hours of searching, reading and learning a little about guestfish, grub, initramfs-tools and bash, I could create the first proposal of code changes that enable kw to automate the build and install of a kernel in VM:

add support for deployment in a debian-VM

The main barrier to this feature was figuring out how to update the grub on the VM without running the update-grub command via ssh access. First, I thought about adding a custom file with a new entry to boot. Thinking and researching a little more, I realized that guestfish could solve the problem and, following this logic, I found a blog post describing how to run “update-grub” with guestfish. From that, I made some adaptations that created the solution.

However, in addition to updating grub, the feature still lacks some steps to install the kernel on the VM properly. I checked the missing code by visiting an old FLUSP tutorial that describes the step-by-step of compiling and install the Linux Kernel inside a VM. I also used the implementation of the “remote” mode of the “kw deploy” to wrap up.

Now I use kw to automatically compile and install a custom kernel on my development VM. So, time to sing: “Ooh, that’s why I’m easy; I’m easy as Sunday morning!”

Maybe not now. It’s time to learn more about IGT tests!

August 27, 2020

As a newbie, I consider debugging as a study guided. During the process, I have a goal that leads me to browse the code, raise and down various suspicions, look at the changes history and perhaps relate changes from other parts of the kernel to the current problem. Co-debugging is even more interesting. Approaches are discussed, we open our mind to more possibilities, refine solutions and share knowledges… in addition to preventing any uncomplete solution. Debugging the operation of vkms when composing plans was a very interesting journey. All the shared ideas was so dynamic and open that, at the end, it was linked to another demand, the implementation of writeback support.

And that is how I fell into another debugging journey.

Writeback support on vkms is something Rodrigo Siqueira has been working on for some time. However, with so many comings and goings of versions, other DRM changes were introducing new issues. With each new version, it was necessary to reassess the current state of the DRM structures in addition to incorporating revisions from the previous version.

When Leandro Ribeiro reported that calling the drm_crtc_vblank_put frees the writeback work, this indicate that there was also a issue with the simulated vblank execution there. I asked Siqueira a little more about the writeback, to try to collaborate with his work. He explained to me what writeback is, how it is implemented in his patches and the IGT test that is used for validation.

In that state, writeback was crashing on the last subtest, writeback-check-output. Again, due to a timeout waiting for resource access. We discussed the possible causes and concluded that the solution would be in defining: when the composer work needed to be enabled and how to ensure that vblank is happening to allow also the writeback work. When Daniel suggested to create a small function to set output->composer_enabled, he was already thinking on writeback.

However, something that appeared to be straight was somewhat entangled by our initial assumption that the problem was concentrated in the writeback-check-output. We later adjusted our investigation to also look at the writeback-fb-id subtest. We began this part of the debugging process using ftrace to check the execution steps of this two test cases. Looking at different execution cases and more parts of the code, we examined the writeback job to find the symmetries, the actions of each element, the beginning and the end of each cycle. We also includes some prints on VKMS and the test code, and check dmesg.

Siqueira concluded that VKMS need to guarantee the composer is enabled when starting the writeback job and release it (as well as vblank) in the job’s cleanup (which is currently called when a job was queued or when the state is destroyed). Another concern was to avoid dispersed modifications and keep the changes well encapsulated with writeback functions.

With that, Siqueira wrapped up the v5, and sent. Just fly, little bird:

As soon as it was sent, I checked if the patchset passed not only the kms_writeback subtests, but also if the other IGT tests I use have remained stable: kms_flip, kms_cursor_crc and kms_pipe_crc_basic. With that, I realized that a refactoring in the composer was out of date and reported the problem for correction (since it was a fix that I recently made). I also indicated that, apart from this small problem (not directly related to writeback), writeback support was working well in the tests performed. However, the series has not yet received any other feedback.

The end? Hmm… never ends.

Extensions Extensions Extensions

Once more, I’ve spun the wheel of topics to blog about and landed on something extension-related. Such is the life of those working with Vulkan.

In this case, the extension is VK_EXT_custom_border_color, the underlying implementation for which I had the luxury of testing on ANV as written by the memelord himself, Ivan Briano.

Let’s get to it.

The setup for this (in zink_screen.c) is the same as for every extension, and I’ve blogged about this previously, so there’s no new ground to cover here.

Border Coloring

Fortunately, this is one of the simpler extensions to implement as well. This code is completely exclusive to the struct pipe_context::create_sampler_state hook, which is what gallium uses to have drivers setup their samplers (not samplerviews, as that’s separate).

Here’s the current implementation:

static void *
zink_create_sampler_state(struct pipe_context *pctx,
                          const struct pipe_sampler_state *state)
   struct zink_screen *screen = zink_screen(pctx->screen);

   VkSamplerCreateInfo sci = {};
   sci.magFilter = zink_filter(state->mag_img_filter);
   sci.minFilter = zink_filter(state->min_img_filter);

   if (state->min_mip_filter != PIPE_TEX_MIPFILTER_NONE) {
      sci.mipmapMode = sampler_mipmap_mode(state->min_mip_filter);
      sci.minLod = state->min_lod;
      sci.maxLod = state->max_lod;
   } else {
      sci.mipmapMode = VK_SAMPLER_MIPMAP_MODE_NEAREST;
      sci.minLod = 0;
      sci.maxLod = 0;

   sci.addressModeU = sampler_address_mode(state->wrap_s);
   sci.addressModeV = sampler_address_mode(state->wrap_t);
   sci.addressModeW = sampler_address_mode(state->wrap_r);
   sci.mipLodBias = state->lod_bias;

   if (state->compare_mode == PIPE_TEX_COMPARE_NONE)
      sci.compareOp = VK_COMPARE_OP_NEVER;
   else {
      sci.compareOp = compare_op(state->compare_func);
      sci.compareEnable = VK_TRUE;

   sci.unnormalizedCoordinates = !state->normalized_coords;

   if (state->max_anisotropy > 1) {
      sci.maxAnisotropy = state->max_anisotropy;
      sci.anisotropyEnable = VK_TRUE;

   struct zink_sampler_state *sampler = CALLOC(1, sizeof(struct zink_sampler_state));
   if (!sampler)
      return NULL;

   if (vkCreateSampler(screen->dev, &sci, NULL, &sampler->sampler) != VK_SUCCESS) {
      return NULL;

   return sampler;

It’s just populating a VkSamplerCreateInfo struct from struct pipe_sampler_state. Nothing too unusual.

Looking at the VK_EXT_custom_border_color spec, the only change required for a sampler to handle a custom border color is to add a populated VkSamplerCustomBorderColorCreateInfoEXT struct into the pNext chain for the VkSamplerCreateInfo.

The struct populating is simple enough:

VkSamplerCustomBorderColorCreateInfoEXT cbci = {};
cbci.format = VK_FORMAT_UNDEFINED;
/* these are identical unions */
memcpy(&cbci.customBorderColor, &state->border_color, sizeof(union pipe_color_union));
sci.pNext = &cbci;

Helpfully here, the layouts of union pipe_color_union and the customBorderColor member of the Vulkan struct are the same, so this can just be copied directly. I added an assert to handle the case where the customBorderColorWithoutFormat feature isn’t provided by the driver, since currently I’m just being a bit lazy and ANV handles it just fine, but this shouldn’t be too much work to fix up if zink is ever run on a driver that doesn’t support it.

With this set, border colors will be drawn as expected.


It’s not always the case that border coloring is necessary, however, and there’s also limits that must be obeyed.

For the former, the wrap values (struct pipe_sampler_state::wrap_(s|t|r), which are all enum pipe_tex_wrap) can be checked from the sampler state to determine whether specified border colors are actually going to be used, allowing them to be omitted when they’re unnecessary. Specifically, a small helper function like this can be used:

static inline bool
wrap_needs_border_color(unsigned wrap)
   return wrap == PIPE_TEX_WRAP_CLAMP || wrap == PIPE_TEX_WRAP_CLAMP_TO_BORDER ||

For other wrap modes, the border color can be ignored.

Looking now to limits, VK_EXT_custom_border_color provides the maxCustomBorderColorSamplers limit, which is the maximum number of samplers using custom border colors that can exist simultaneously. To avoid potential issues with this, a counter is atomically managed on the screen object based on the lifetimes of samplers that use custom border colors—another reason to avoid unnecessarily including custom border color structs here.

Altogether, the changed bits of the new function look like this:

static void *
zink_create_sampler_state(struct pipe_context *pctx,
                          const struct pipe_sampler_state *state)

   bool need_custom = false;
   need_custom |= wrap_needs_border_color(state->wrap_s);
   need_custom |= wrap_needs_border_color(state->wrap_t);
   need_custom |= wrap_needs_border_color(state->wrap_r);

   VkSamplerCustomBorderColorCreateInfoEXT cbci = {};
   if (screen->have_EXT_custom_border_color && need_custom) {
      cbci.format = VK_FORMAT_UNDEFINED;
      /* these are identical unions */
      memcpy(&cbci.customBorderColor, &state->border_color, sizeof(union pipe_color_union));
      sci.pNext = &cbci;
      sci.borderColor = VK_BORDER_COLOR_INT_CUSTOM_EXT;
      uint32_t check = p_atomic_inc_return(&screen->cur_custom_border_color_samplers);
      assert(check <= screen->max_custom_border_color_samplers);
   } else
      sci.borderColor = VK_BORDER_COLOR_FLOAT_TRANSPARENT_BLACK; // TODO with custom shader if we're super interested?


Borders colored.

August 26, 2020


Once more jumping around, here’s a brief look at an interesting issue I came across while implementing ARB_clear_texture.

When I started looking at the existing clear code in zink, I discovered that the only existing codepath for clears required being inside a renderpass using vkCmdClearAttachments. This meant that if a renderpass wasn’t active, we started one, which was suboptimal for performance, as unnecessarily starting a renderpass means we may end up having to end the renderpass just as quickly in order to emit some non-renderpass commands.

I decided to do something about this along the way. Vulkan has specific commands for clearing color and depth outside of renderpasses, and the usage is very straightforward. The key is detecting whether they’re capable of being used.


There’s a number of conditions for using these specialized Vulkan clear functions:

  • cannot be in a renderpass
  • must be intending to clear the entire surface (i.e., no scissor or full-surface scissor)
  • no 3D surfaces unless clearing all layers of the surface

The last is a product of a spec issue which allows drivers to interpret a surface’s layer count as possessing a single “3D” image layer instead of the actual count of layers inside that layer. As a result, any time a 3D surface is cleared with a non-zero layer specified, it can’t be cleared except with vkCmdClearAttachments.

The second criterion less tricky, requiring only that the specified scissor region of the clear covers the entire surface being cleared. That can be accomplished like so with some mesa helper APIs:

static bool
clear_needs_rp(unsigned width, unsigned height, struct u_rect *region)
   struct u_rect intersect = {0, width, 0, height};

   if (!u_rect_test_intersection(region, &intersect))
      return true;

    u_rect_find_intersection(region, &intersect);
    if (intersect.x0 != 0 || intersect.y0 != 0 ||
        intersect.x1 != width || intersect.y1 != height)
       return true;

   return false;

If the intersection of the surface and the scissor region isn’t the entire size of the surface, false is returned and the renderpass codepath is used.

With this check done, and having verified previously that no renderpass is active on the current batch’s command buffer, some helper functions to perform the individual clear operations can be added:

static void
clear_color_no_rp(struct zink_batch *batch, struct zink_resource *res, const union pipe_color_union *pcolor, unsigned level, unsigned layer, unsigned layerCount)
   VkImageSubresourceRange range = {};
   range.baseMipLevel = level;
   range.levelCount = 1;
   range.baseArrayLayer = layer;
   range.layerCount = layerCount;
   range.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;

   VkClearColorValue color;
   color.float32[0] = pcolor->f[0];
   color.float32[1] = pcolor->f[1];
   color.float32[2] = pcolor->f[2];
   color.float32[3] = pcolor->f[3];

      zink_resource_barrier(batch, res, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, 0);
   vkCmdClearColorImage(batch->cmdbuf, res->image, res->layout, &color, 1, &range);

static void
clear_zs_no_rp(struct zink_batch *batch, struct zink_resource *res, VkImageAspectFlags aspects, double depth, unsigned stencil, unsigned level, unsigned layer, unsigned layerCount)
   VkImageSubresourceRange range = {};
   range.baseMipLevel = level;
   range.levelCount = 1;
   range.baseArrayLayer = layer;
   range.layerCount = layerCount;
   range.aspectMask = aspects;

   VkClearDepthStencilValue zs_value = {depth, stencil};

      zink_resource_barrier(batch, res, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, 0);
   vkCmdClearDepthStencilImage(batch->cmdbuf, res->image, res->layout, &zs_value, 1, &range);

This is as simple as populating the required structs, adding in a pipeline barrier if necessary to set the image layout, and then emitting the commands.


As per usual, there’s a gotcha here. Clearing color surfaces in this way doesn’t take into account SRGBness as the renderpass codepath does, which means it’ll actually clear using the wrong colors if the surface is a RGBA view of a SRGB image. This means that the color value has to be converted prior to calling the clear_color_no_rp() helper:

if (psurf->format != res->base.format &&
    !util_format_is_srgb(psurf->format) && util_format_is_srgb(res->base.format)) {
   /* if SRGB mode is disabled for the fb with a backing srgb image then we have to
    * convert this to srgb color
   color.f[0] = util_format_srgb_to_linear_float(pcolor->f[0]);
   color.f[1] = util_format_srgb_to_linear_float(pcolor->f[1]);
   color.f[2] = util_format_srgb_to_linear_float(pcolor->f[2]);

Here, psurf is the struct pipe_surface given by the gallium struct pipe_context::clear hook, and res is the struct zink_resource (aka subclassed struct pipe_resource) image that the surface is a view of. The util_format API takes care of checking srgb status, and then it also can be used to handle the conversion back to the expected format.

Now the color can be safely passed along to achieve the expected results, and out-of-renderpass clearing can safely be used for all cases where the original conditions are met.

August 25, 2020

Let's talk about eggs. X has always supported XSendEvent() which allows anyone to send any event to any client [1]. However, this event had a magic bit to make it detectable, so clients detect and subsequently ignore it. Spoofing input that just gets ignored is of course not productive, so in the year 13 BG [2] the XTest extension was conceived. XTest has a few requests that allow you to trigger a keyboard event (press and release, imagine the possibilities), buttons and pointer motion. The name may seem odd until someone explains to you that it was primarily written to support automated testing of X servers. But no-one has the time to explain that.

Having a separate extension worked around the issue of detectability and thus any client could spoof input events. Security concerns were addressed with "well, just ifdef out that extension then" which worked great until other applications started using it for input emulation. Since around ~2008 XTest events are emulated through special XTest devices in the server but that is solely to make the implementation less insane. Technically this means that XTest events are detectable again, except that no-one bothers to actually do that. Having said that, these devices only make it possible to detect an XTest event, but not which client sent that event. And, due to how the device hierarchy works, it's really hard to filter out those events anyway.

Now it's 2020 and we still have an X server that basically allows access to anything and any client to spoof input. This level of security is industry standard for IoT devices but we are trying to be more restrictive than that on your desktop, lest the stuff you type actually matters. The accepted replacement for X is of course Wayland, but Wayland-the-protocol does not provide an extension to emulate input. This makes sense, emulating input is not exactly a display server job [3] but it does leaves us with a bunch of things no longer working.


So let's talk about libei. This new library for Emulated Input consists of two components: libei, the client library to be used in applications, and libeis, the part to be used by an Emulated Input Server, to be used in the compositors. libei provides the API to connect to an EIS implementation and send events. libeis provides the API to receive those events and pass them on to the compositor. Let's see how this looks like in the stack:

+--------------------+ +------------------+
| Wayland compositor |---wayland---| Wayland client B |
+--------------------+\ +------------------+
| libinput | libeis | \_wayland______
+----------+---------+ \
| | +-------+------------------+
/dev/input/ +---brei----| libei | Wayland client A |

"brei" is the communication "Bridge for EI" and is a private protocol between libei and libeis.

Emulated input is not particularly difficult, mostly because when you're emulating input, you know exactly what you are trying to do. There is no interpretation of bad hardware data like libinput has to do, it's all straightforward. Hence the communication between libei and libeis is relatively trivial, it's all button, keyboard, pointer and touch events. Nothing fancy here and I won't go into details because that part will work as you expect it to. The selling point of libei is elsewhere, primarily in separation, distinction and control.

libei is a separate library to libinput or libwayland or Xlib or whatever else may have. By definition, it is thus a separate input stream in both the compositor and the client. This means the compositor can do things like display a warning message while emulated input is active. Or it can re-route the input automatically, e.g. assign a separate wl_seat to the emulated input devices so they can be independently focused, etc. Having said that, the libeis API is very similar to libinput's API so integration into compositors is quite trivial because most of the hooks to process incoming events are already in place.

libei distinguishes between different clients, so the compositor is always aware of who is currently emulating input. You have synergy, xdotool and a test suite running at the same time? The compositor is aware of which client is sending which events and can thus handle those accordingly.

Finally, libei provides control over the emulated input. This works on multiple levels. A libei client has to request device capabilities (keyboard, touch, pointer) and the compositor can restrict to a subset of those (e.g. "no keyboard for you"). Second, the compositor can suspend/resume a device at any time. And finally, since the input events go through the compositor anyway, it can discard events it doesn't like. For example, even where the compositor allowed keyboards and the device is not suspended, it may still filter out Escape key presses. Or rewrite those to be Caps Lock because we all like a bit of fun, don't we?

The above isn't technically difficult either, libei itself is not overly complex. The interaction between an EI client and an EIS server is usually the following:

  • client connects to server and says hello
  • server disconnects the client, or accepts it
  • client requests one or more devices with a set of capabilities
  • server rejects those devices or allows those devices and/or limits their capabilities
  • server resumes the device (because they are suspended by default)
  • client sends events
  • client disconnects, server disconnects the client, server suspends the device, etc.
Again, nothing earth-shattering here. There is one draw-back though: the server must approve of a client and its devices, so a client is not able to connect, send events and exit. It must wait until the server has approved the devices before it can send events. This means tools like xdotool need to be handled in a special way, more on that below.

Flatpaks and portals

With libei we still have the usual difficulties: a client may claim it's synergy when really it's bad-hacker-tool. There's not that much we can do outside a sandbox, but once we are guarded by a portal, things look different:

| Wayland compositor |_
+--------------------+ \
| libinput | libeis | \_wayland______
+----------+---------+ \
| [eis-0.socket] \
/dev/input/ / \\ +-------+------------------+
| ======>| libei | Wayland client A |
| after +-------+------------------+
initial| handover /
connection| / initial request
| / dbus[org.freedesktop.portal.EmulatedInput]
| xdg-desktop-portal |
The above shows the interaction when a client is run inside a sandbox with access to the org.freedesktop.portal.Desktop bus. The libei client connects to the portal (it won't see the EIS server otherwise), the portal hands it back a file descriptor to communicate on and from then on it's like a normal EI session. What the portal implementation can do though is write several Restrictions for EIS on the server side of the file descriptor using libreis. Usually, the portal would write the app-id of the client (thus guaranteeing a reliable name) and maybe things like "don't allow keyboard capabilities for any device". Once the fd is handed to the libei client, the restrictions cannot be loosened anymore so a client cannot overwrite its own name.

So the full interaction here becomes:

  • Client connects to org.freedesktop.portal.EmulatedInput
  • Portal implementation verifies that the client is allowed to emulate input
  • Portal implementation obtains a socket to the EIS server
  • Portal implementation sends the app id and any other restrictions to the EIS server
  • Portal implementation returns the socket to the client
  • Client creates devices, sends events, etc.
For a client to connect to the portal instead of the EIS server directly is currently a one line change.

Note that the portal implementation is still in its very early stages and there are changes likely to happen. The general principle is unlikely to change though.


Turns out we still have a lot of X clients around so somehow we want to be able to use those. Under Wayland, those clients connect to Xwayland which translates X requests to Wayland requests. X clients will use XTest to emulate input which currently goes to where the dodos went. But we can add libei support to Xwayland and connect XTest, at which point the diagram looks like this:

+--------------------+ +------------------+
| Wayland compositor |---wayland---| Wayland client B |
+--------------------+\ +------------------+
| libinput | libeis | \_wayland______
+----------+---------+ \
| | +-------+------------------+
/dev/input/ +---brei----| libei | XWayland |
| X client |
XWayland is basically just another Wayland client but it knows about XTest requests, and it knows about the X clients that request those. So it can create a separate libei context for each X client, with pointer and keyboard devices that match the XTest abilities. The end result of that is you can have a xdotool libei context, a synergy libei context, etc. And of course, where XWayland runs in a sandbox it could use the libei portal backend. At which point we have a sandboxed X client using a portal to emulate input in the Wayland compositor. Which is pretty close to what we want.

Where to go from here?

So, at this point the libei repo is still sitting in my personal gitlab namespace. Olivier Fourdan has done most of the Xwayland integration work and there's a basic Weston implementation. The portal work (tracked here) is the one needing the most attention right now, followed by the implementations in the real compositors. I think I have tentative agreement from the GNOME and KDE developers that this all this is a good idea. Given that the goal of libei is to provide a single approach to emulate input that works on all(-ish) compositors [4], that's a good start.

Meanwhile, if you want to try it, grab libei from git, build it, and run the eis-demo-server and ei-demo-client tools. And for portal support, run the eis-fake-portal tool, just so you don't need to mess with the real one. At the moment, those demo tools will have a client connecting a keyboard and pointer and sending motion, button and 'qwerty' keyboard events every few seconds. The latter with client and/or server-set keymaps because that's possible too.


What does all this have to do with eggs? "Ei", "Eis", "Brei", and "Reis" are, respectively, the German words for "egg", "ice" or "ice cream", "mush" (think: porridge) and "rice". There you go, now you can go on holidays to a German speaking country and sustain yourself on a nutritionally imbalanced diet.

[1] The whole "any to any" is a big thing in X and just shows that in the olden days you could apparently trust, well, apparently anyone
[2] "before git", i.e. too bothersome to track down so let's go with the Copyright notice in the protocol specification
[3] XTest the protocol extension exists presumably because we had a hammer and a protocol extension thus looked like a nail
[4] Let's not assume that every compositor having their own different approach to emulation is a good idea

The Tessellating Commences

Tessellation shaders were the second type of new shader I worked to implement after geometry shaders, which I haven’t blogged about yet.

As I always say, why start at the beginning when I could start at the middle, then go back to the beginning, then skip right to the end.

Tessellation happens across two distinct shader stages: tessellation control (TCS) and tessellation evaluation (TES). In short, the former sets up parameters for the latter to generate vertex data with. In OpenGL, TCS can be omitted by specifying default inner and outer levels with glPatchParameterfv which will then be propagated to gl_TessLevelInner and gl_TessLevelOuter, respectively.

Vulkan, however, will only perform tessellation when both TCS and TES are present for a given graphics pipeline.

There are piglit tests which verify GL’s TCS-less tessellation capability.


Faking It

As is the case with many things in this project, Vulkan’s lack of support for the “easier” OpenGL method of doing things requires jumping through a number of hoops. In this case, the hoops amount to getting default inner/outer levels from gallium states into the TES stage.

Going into this problem, I knew nothing about any of these functionalities or shaders (or even that they existed), but I quickly learned a number of key points:

  • TES implicitly expects to read gl_TessLevelInner and gl_TessLevelOuter variables
  • TCS expands and propagates all VS outputs to something like type output_var[gl_MaxPatchVertices] for the TES to read
  • Vulkan has no concept of “default” levels
  • Push constants exist

The last of these was the key to my approach. In short, if a TCS doesn’t exist, I just need to make one. Since I already have full control of the pipeline state, I know when a TCS is absent, and I also know the default tessellation levels. Thus, I can emit them to a shader using a push constant, and I can also create my own passthrough TCS stage to jam into the pipeline.

Ideally, this passthrough shader would look a bit like this pseudoglsl:

#version 150
#extension GL_ARB_tessellation_shader : require

in vec4 some_var[gl_MaxPatchVertices];
out vec4 some_var_out;

layout(push_constant) uniform tcsPushConstants {
    layout(offset = 0) float TessLevelInner[2];
    layout(offset = 8) float TessLevelOuter[4];
} u_tcsPushConstants;
layout(vertices = $vertices_per_patch) out;
void main()
  gl_TessLevelInner = u_tcsPushConstants.TessLevelInner;
  gl_TessLevelOuter = u_tcsPushConstants.TessLevelOuter;
  some_var_out = some_var[gl_InvocationID];

where all input variables get expanded to gl_MaxPatchVertices, and the default levels are pulled from the push constant block.

Shader Construction

Now, it’s worthwhile to note that at this point, ntv had no ability to load push constants. With that said, I’ve alraedy blogged extensively about the process for UBO handling, and the mechanisms for both types of load are more or less identical. This means that as long as I structure my push constant data in the same way that I’ve expected my UBOs to look (i.e., a struct containing an array of vec4s), I can just copy and paste the existing code with a couple minor tweaks and simplifications since I know this codepath will only be hit by fixed function shader code that I’ve personally verified.

Therefore, I don’t actually have to blog about the specific loading mechanism, and I can just refer to that post and skip to the juicy stuff.

Specifically, I’m talking about writing the above tessellation control shader in NIR.

As a quick note, I’ve created (in patches after the initial implementation) a struct for managing the overall push constant data, which at this point would look like this:

struct zink_push_constant {
   float default_inner_level[2];
   float default_outer_level[4];

Now, let’s jump in.

nir_shader *nir = nir_shader_create(NULL, MESA_SHADER_TESS_CTRL, &nir_options, NULL);
nir_function *fn = nir_function_create(nir, "main");
fn->is_entrypoint = true;
nir_function_impl *impl = nir_function_impl_create(fn);

First, I’ve created a new shader for the TCS stage, and I’ve added a “main” function that’s flagged as a shader entrypoint. This is the base of every shader.

nir_builder b;
nir_builder_init(&b, impl);
b.cursor = nir_before_block(nir_start_block(impl));

Here, a nir_builder is instantiated for the shader. nir_builder is the tool which allows modifying existing shaders. It operates using a “cursor” position, which determines where new instructions are being added.

To begin inserting instructions, the cursor is positioned at the start of the function block that’s just been created.

nir_intrinsic_instr *invocation_id = nir_intrinsic_instr_create(b.shader, nir_intrinsic_load_invocation_id);
nir_ssa_dest_init(&invocation_id->instr, &invocation_id->dest, 1, 32, "gl_InvocationID");
nir_builder_instr_insert(&b, &invocation_id->instr);

The first instruction is loading the gl_InvocationID variable, which is a single 32-bit integer. The instruction is created, then the destination is initialized to the type being stored since this instruction type loads a value, and then the instruction is inserted into the shader using the nir_builder.

nir_foreach_shader_out_variable(var, vs->nir) {
   const struct glsl_type *type = var->type;
   const struct glsl_type *in_type = var->type;
   const struct glsl_type *out_type = var->type;
   char buf[1024];
   snprintf(buf, sizeof(buf), "%s_out", var->name);
   in_type = glsl_array_type(type, 32 /* MAX_PATCH_VERTICES */, 0);
   out_type = glsl_array_type(type, vertices_per_patch, 0);

   nir_variable *in = nir_variable_create(nir, nir_var_shader_in, in_type, var->name);
   nir_variable *out = nir_variable_create(nir, nir_var_shader_out, out_type, buf);
   out->data.location = in->data.location = var->data.location;
   out->data.location_frac = in->data.location_frac = var->data.location_frac;

   /* gl_in[] receives values from equivalent built-in output
      variables written by the vertex shader (section 2.14.7).  Each array
      element of gl_in[] is a structure holding values for a specific vertex of
      the input patch.  The length of gl_in[] is equal to the
      implementation-dependent maximum patch size (gl_MaxPatchVertices).
      - ARB_tessellation_shader
   for (unsigned i = 0; i < vertices_per_patch; i++) {
      /* we need to load the invocation-specific value of the vertex output and then store it to the per-patch output */
      nir_if *start_block = nir_push_if(&b, nir_ieq(&b, &invocation_id->dest.ssa, nir_imm_int(&b, i)));
      nir_deref_instr *in_array_var = nir_build_deref_array(&b, nir_build_deref_var(&b, in), &invocation_id->dest.ssa);
      nir_ssa_def *load = nir_load_deref(&b, in_array_var);
      nir_deref_instr *out_array_var = nir_build_deref_array_imm(&b, nir_build_deref_var(&b, out), i);
      nir_store_deref(&b, out_array_var, load, 0xff);
      nir_pop_if(&b, start_block);

Next, all the output variables for the vertex shader are looped over and expanded to arrays of MAX_PATCH_VERTICES size, which in the case of mesa is 32. These values are loaded based on gl_InvocationID, so the stores for the expanded variable values are always conditional on the current invocation id value matching the vertices_per_patch index, as is checked in the nir_push_if line where a conditional is added for this.

nir_variable *gl_TessLevelInner = nir_variable_create(nir, nir_var_shader_out, glsl_array_type(glsl_float_type(), 2, 0), "gl_TessLevelInner");
gl_TessLevelInner->data.location = VARYING_SLOT_TESS_LEVEL_INNER;
gl_TessLevelInner->data.patch = 1;
nir_variable *gl_TessLevelOuter = nir_variable_create(nir, nir_var_shader_out, glsl_array_type(glsl_float_type(), 4, 0), "gl_TessLevelOuter");
gl_TessLevelOuter->data.location = VARYING_SLOT_TESS_LEVEL_OUTER;
gl_TessLevelOuter->data.patch = 1;

Following this, the gl_TessLevelInner and gl_TessLevelOuter variables must be created so they can be processed in ntv and converted to the corresponding variables that SPIRV and the underlying vulkan driver expect to see. All that’s necessary for this case is to set the types of the variables, the slot location, and then flag them as patch variables.

/* hacks so we can size these right for now */
struct glsl_struct_field *fields = ralloc_size(nir, 2 * sizeof(struct glsl_struct_field));
fields[0].type = glsl_array_type(glsl_uint_type(), 2, 0);
fields[0].name = ralloc_asprintf(nir, "gl_TessLevelInner");
fields[0].offset = 0;
fields[1].type = glsl_array_type(glsl_uint_type(), 4, 0);
fields[1].name = ralloc_asprintf(nir, "gl_TessLevelOuter");
fields[1].offset = 8;
nir_variable *pushconst = nir_variable_create(nir, nir_var_shader_in,
                                              glsl_struct_type(fields, 2, "struct", false), "pushconst");
pushconst->data.location = VARYING_SLOT_VAR0;

Next up, creating a variable to represent the push constant for the TCS stage that’s going to be injected. SPIRV is always explicit about variable sizes, so it’s important that everything match up with the size and offset in the push constant that’s being emitted. The inner level value will be first in the push constant layout, so this has offset 0, and the outer level will be second, so this has an offset of sizeof(gl_TessLevelInner), which is known to be 8.

nir_intrinsic_instr *load_inner = nir_intrinsic_instr_create(b.shader, nir_intrinsic_load_push_constant);
load_inner->src[0] = nir_src_for_ssa(nir_imm_int(&b, offsetof(struct zink_push_constant, default_inner_level)));
nir_intrinsic_set_base(load_inner, offsetof(struct zink_push_constant, default_inner_level));
nir_intrinsic_set_range(load_inner, 8);
load_inner->num_components = 2;
nir_ssa_dest_init(&load_inner->instr, &load_inner->dest, 2, 32, "TessLevelInner");
nir_builder_instr_insert(&b, &load_inner->instr);

nir_intrinsic_instr *load_outer = nir_intrinsic_instr_create(b.shader, nir_intrinsic_load_push_constant);
load_outer->src[0] = nir_src_for_ssa(nir_imm_int(&b, offsetof(struct zink_push_constant, default_outer_level)));
nir_intrinsic_set_base(load_outer, offsetof(struct zink_push_constant, default_outer_level));
nir_intrinsic_set_range(load_outer, 16);
load_outer->num_components = 4;
nir_ssa_dest_init(&load_outer->instr, &load_outer->dest, 4, 32, "TessLevelOuter");
nir_builder_instr_insert(&b, &load_outer->instr);

for (unsigned i = 0; i < 2; i++) {
   nir_deref_instr *store_idx = nir_build_deref_array_imm(&b, nir_build_deref_var(&b, gl_TessLevelInner), i);
   nir_store_deref(&b, store_idx, nir_channel(&b, &load_inner->dest.ssa, i), 0xff);
for (unsigned i = 0; i < 4; i++) {
   nir_deref_instr *store_idx = nir_build_deref_array_imm(&b, nir_build_deref_var(&b, gl_TessLevelOuter), i);
   nir_store_deref(&b, store_idx, nir_channel(&b, &load_outer->dest.ssa, i), 0xff);

Finally, the push constant data can be set to the created variables. Each variable needs two parts:

  • loading the push constant data
  • storing the loaded data to the variable

Each load of the push constant data requires a base and a range along with a single source value. The base and the source are both the offset of the value being loaded (referenced from struct zink_push_constant), and the range is the size of the value, which is constant based on the variable.

Using the same mechanic as the first intrinsic instruction that was created, each intrinsic is created and has its destination value initialized based on the size and number of components of the type being loaded (first load 2 values, then 4 values). A src is created based on the struct offset, which determines the offset at which data will be loaded, and the instruction is inserted using the nir_builder.

Following this, the loaded value is then dereferenced for each array member and stored component-wise into the actual gl_TessLevel* variable.

nir->info.tess.tcs_vertices_out = vertices_per_patch;
nir_validate_shader(nir, "created");

pushconst->data.mode = nir_var_mem_push_const;

At last, the finishing touches. The vertices_per_patch value is set so that SPIRV can process the shader, and the shader is then validated by NIR to ensure that I didn’t screw anything up.

And then the push constant gets its variable type swizzled to being a push constant so that ntv can special case it. This is very illegal, and so it has to happen at the point when absolutely no other NIR passes will be run on the shader or everything will explode.

But it’s totally safe, I promise.

With all this done, the only remaining step is to set up the push constant in the Vulkan pipeline, which is trivial by comparison.

The TCS is picked up automatically after being jammed into the existing shader stages, and everything works like it should.

August 24, 2020

After Some Time

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Collabora Ltd (0x8086)
    Device: zink (Intel(R) Iris(R) Plus Graphics (ICL GT2)) (0x8a52)
    Version: 20.3.0
    Accelerated: yes
    Video memory: 11690MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 3.0
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL vendor string: Collabora Ltd
OpenGL renderer string: zink (Intel(R) Iris(R) Plus Graphics (ICL GT2))
OpenGL core profile version string: 4.6 (Core Profile) Mesa 20.3.0-devel (git-b229c29e4d)
OpenGL core profile shading language version string: 4.60

It’s been about three months since I jumped into the project to learn more about graphics drivers, and zink has now gone from supporting GL 3.0 to GL 4.6 and GLES 3.2 compatibility*. Currently I’m at a 91% pass rate on piglit tests while forcing ANV to do unsupported fp64 ops, which seems like a pretty good result.

*in my branch

Not bad for some random guy sitting in his basement.

What’s Next?

Well, first up is probably going to be a bit of a break. I’ve been surfing the gfx pipeline nonstop for the past couple weeks, as has been hinted at by my lack of posts here, and so there’s some non-graphics stuff I’ve been putting off doing.

After that, I imagine I’ll be going back to work on some of these test failures and go through more of the CTS results since there’s a lot of test coverage over there.

There’s also lots of refactoring work that I’ve flagged for doing at some point when I had time or interest.

Also, more blogging.

When Will This Be Merged?

Patch review is a complex process during which code must be carefully examined for defects. It takes time to do it right, and it shouldn’t be rushed.

With that said, I’ve got around 300 patches in my branch (likely far more once I break up some of the giant wip ones), so it’s unlikely that all of it will be landed anytime in the very near future.

Anyone interested in trying things out immediately can pull my zink-wip branch and follow the wiki instructions for building and running the zink driver on top of existing distro-provided vulkan drivers without any other changes, potentially even submitting bug reports for me to look at.

Tomorrow, I’ll be starting a dive here into the tessellation shader implementation and some “cool” things I had to do there.

August 20, 2020

I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.


What is vallium?

The vallium layer is a gallium frontend. It takes the Vulkan API and roughly translates it into the gallium API.

How does it do that?

Vulkan is a lowlevel API, it allows the user to allocate memory, create resources, record command buffers amongst other things. When a hw vulkan driver is recording a command buffer, it is putting hw specific commands into it that will be run directly on the GPU. These command buffers are submitted to queues when the app wants to execute them.

Gallium is a context level API, i.e. like OpenGL/D3D10. The user has to create resources and contexts and the driver internally manages command buffers etc. The driver controls internal flushing and queuing of command buffers.
In order to bridge the gap, the vallium layer abstracts the gallium context into a separate thread of execution. When recording a vulkan command buffer it creates a CPU side command buffer containing an encoding of the Vulkan API. It passes that recorded CPU command buffer to the thread on queue submission. The thread then creates a gallium context, and replays the whole CPU recorded command buffer into the context, one command at a time.

That sounds horrible, isn't it slow?


Why doesn't that matter for *software* drivers?

Software rasterizers are a very different proposition from an overhead point of view than real hardware. CPU rasterization is pretty heavy on the CPU load, so nearly always 90% of your CPU time will be in the rasterizer and fragment shader. Having some minor CPU overheads around command submission and queuing isn't going to matter in the overall profile of the user application. CPU rasterization is already slow, the Vulkan->gallium translation overhead isn't going to be the reason for making it much slower.
For real HW drivers which are meant to record their own command buffers in the GPU domain and submit them direct to the hw, adding in a CPU layer that just copies the command buffer data is a massive overhead and one that can't easily be removed from the vallium layer.

The vallium execution context is also pretty horrible, it has to connect all the state pieces like shaders etc to the gallium context, and disconnect them all at the end of each command buffer. There is only one command submission queue, one context to be used. A lot of hardware exposes more queues etc that this will never model.

I still don't want to write a vulkan driver, give me more reasons.

Pipeline barriers:

Pipeline barriers in Vulkan are essential to efficient driver hw usage. They are one of the most difficult to understand and hard to get right pieces of writing a vulkan driver. For a software rasterizer they are also mostly unneeded. When I get a barrier I just completely hardflush the gallium context because I know the sw driver behind it. For a real hardware driver this would be a horrible solution. You spend a lot of time trying to make anything optimal here.

Memory allocation:

Vulkan is built around the idea of separate memory allocation and objects binding to those allocations. Gallium is built around object allocation with the memory allocs happening implicitly. I've added some simple memory allocation objects to the gallium API for swrast. These APIs are in no way useful for hw drivers. There is no way to expose memory types or heaps from gallium usefully. The current memory allocation API works for software drivers because I know all they want is an aligned_malloc. There is no decent way to bridge this gap without writing a new gallium API that looks like Vulkan. (in which case just write a vulkan driver already).

Can this make my non-Vulkan capable hw run Vulkan?

No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.