planet.freedesktop.org
August 06, 2020

Post-Storm Posting

I’m back, and today’s topic is testing.

Again.

But this time is different. This time I’m going to be looking into a specific test format, namely piglit shader tests.

Shader tests in piglit are tests which are passed through piglit’s undocumented shader_runner binary, which parses *.shader_test files at runtime to automatically produce tests based on GLSL without requiring any actual GL code and only minimal boilerplate. This makes writing tests easy, and, more importantly for my own use case, it makes debugging them easier.

An Example

[require]
GLSL >= 1.50

[vertex shader passthrough]

[fragment shader]
#version 150

uniform float c;

out vec4 color;

void main()
{
	// set the 'green' value of the color to the value of the uniform
	color = vec4(0.0, c, 0.0, 1.0);
}

[test]
clear color 0.2 0.2 0.2 0.2
clear

// set 'c' to 1
uniform float c 1;
// draw a rect starting at -1,-1 (bottom left) with a width of 2 units, covering the fb
draw rect -1 -1 2 2
// read back the color buffer at [0.0, 0.0] (bottom left) and compare against rgb(0.0, 1.0, 0.0)
relative probe rect rgb (0.0, 0.0, 0.5, 0.5) (0.0, 1.0, 0.0)

Here’s an extremely simple example with a passthrough vertex shader (i.e., gl_Position = piglit_vertex;) which draws a full-framebuffer rectangle based on the color passed in as a uniform value and then reads back the color in the middle of the fb to check that it’s green.

Debugging

The great part of shader tests like these is that they’re very simple to debug in a visual way. For example, if the above test was failing, I’d have a number of easy ways to narrow down where the problem was:

  • replace c in the output color with a constant value to ensure that uniform loading isn’t an issue
  • change the first two components of the probe command’s coordinates to see if perhaps the failure is from the read failing in the specified location in the color buffer
  • change the size of the rectangle being drawn in the event that it isn’t as large as it should be to verify that the driver isn’t using some hardcoded value here

And through all of this, no rebuilding is necessary; I can just make my edits and then shader_runner will generate the test at runtime.

More Complicated

Shader tests can get a bit more complex as well. For example, here’s another “basic” one that I was using while working on dynamic UBO indexing:

# This test verifies that dynamically uniform indexing of sampler arrays
# in the fragment shader behaves correctly.

[require]
GLSL >= 1.50
GL_ARB_gpu_shader5

[vertex shader passthrough]

[fragment shader]
#version 150
#extension GL_ARB_gpu_shader5: require

uniform sampler2D s[4];

uniform int n;

out vec4 color;

void main()
{
	color = texture(s[n], vec2(0.75, 0.25));
}

[test]
clear color 0.2 0.2 0.2 0.2
clear

uniform int s[0] 0
uniform int s[1] 1
uniform int s[2] 2
uniform int s[3] 3

texture checkerboard 0 0 (32, 32) (0.5, 0.0, 0.0, 0.0) (1.0, 0.0, 0.0, 0.0)
texparameter 2D min nearest
texparameter 2D mag nearest

texture checkerboard 1 0 (32, 32) (0.5, 0.0, 0.0, 0.0) (0.0, 1.0, 0.0, 0.0)
texparameter 2D min nearest
texparameter 2D mag nearest

texture checkerboard 2 0 (32, 32) (0.5, 0.0, 0.0, 0.0) (0.0, 0.0, 1.0, 0.0)
texparameter 2D min nearest
texparameter 2D mag nearest

texture checkerboard 3 0 (32, 32) (0.5, 0.0, 0.0, 0.0) (1.0, 1.0, 1.0, 1.0)
texparameter 2D min nearest
texparameter 2D mag nearest

uniform int n 0
draw rect -1 -1 1 1

relative probe rect rgb (0.0, 0.0, 0.5, 0.5) (1.0, 0.0, 0.0)

uniform int n 1
draw rect 0 -1 1 1

relative probe rect rgb (0.5, 0.0, 0.5, 0.5) (0.0, 1.0, 0.0)

uniform int n 2
draw rect -1 0 1 1

relative probe rect rgb (0.0, 0.5, 0.5, 0.5) (0.0, 0.0, 1.0)

uniform int n 3
draw rect 0 0 1 1

relative probe rect rgb (0.5, 0.5, 0.5, 0.5) (1.0, 1.0, 1.0)

This test again uses a passthrough vertex shader, outputting the color from one of the members of the sampler uniform array using another uniform as the array index. The texture being sampled in each case goes through a utility function piglit_checkerboard_texture(), which takes parameters:

  • tex Name of the texture to be used.
  • level Mipmap level the checkerboard should be written to
  • width Width of the texture image
  • height Height of the texture image
  • horiz_square_size Size of each checkerboard tile along the X axis
  • vert_square_size Size of each checkerboard tile along the Y axis
  • black RGBA color to be used for “black” tiles
  • white RGBA color to be used for “white” tiles

Each draw command then draws a rectangle in a different corner of the framebuffer, sampling from a texture with a different color, which produces:

arb_gpu_shader5-sampler_array_indexing-fs-simple.png

While I was getting this to work, I had a couple options available to me so that I could break down the test and run it in smaller pieces:

  • eliminate all but the first texture, draw, and probe calls for simpler shader output
  • use a constant array index to verify that arrays of samplers were working

Combined with visual inspection of the test output, this made the process a relative snap.

In Summary

Piglit’s shader tests are an easy-to-read, easy-to-debug test format, and it’s a format that’s just won my coveted Preferred Test Format award for excellence in every category that matters.

Remember: working on driver-level code isn’t scary unless you need documentation.

August 03, 2020

Instancing

Once more going way out of order since it’s fresh in my mind, today will be a brief look at gl_InstanceID and a problem I found there while hooking up ARB_base_instance.

gl_InstanceID is an input for vertex shaders which provides the current instance being processed by the shader. It has a range of [0, instanceCount], and this breaks the heck out of Vulkan.

Vulkan Instancing

In Vulkan, this same variable instead has a range of [firstInstance, firstInstance + instanceCount]. This difference isn’t explicitly documented anywhere, which means keen readers will need to pick up on the difference as noted between Vulkan spec 14.7. Built-In Variables and the OpenGL wiki, because it isn’t even stated in the GLSL 1.40 spec (or anywhere else that I could find) what the range of this variable is.

Fixing

Here’s the generic int builtin loading function zink uses in ntv to handle instructions which load builtin variables:

static void
emit_load_uint_input(struct ntv_context *ctx, nir_intrinsic_instr *intr, SpvId *var_id, const char *var_name, SpvBuiltIn builtin)
{
   SpvId var_type = spirv_builder_type_uint(&ctx->builder, 32);
   if (!*var_id)
      *var_id = create_builtin_var(ctx, var_type,
                                   SpvStorageClassInput,
                                   var_name,
                                   builtin);

   SpvId result = spirv_builder_emit_load(&ctx->builder, var_type, *var_id);
   assert(1 == nir_dest_num_components(intr->dest));
   store_dest(ctx, &intr->dest, result, nir_type_uint);
}

var_id here is a pointer to the SpvId member of struct ntv_context where the created variable is stored for later use, var_name is the name of the variable, and builtin is the SPIRV name of the builtin that’s being loaded.

gl_InstanceID comes through here as SpvBuiltInInstanceIndex. To fix the difference in semantics here, I added a few lines:

static void
emit_load_uint_input(struct ntv_context *ctx, nir_intrinsic_instr *intr, SpvId *var_id, const char *var_name, SpvBuiltIn builtin)
{
   SpvId var_type = spirv_builder_type_uint(&ctx->builder, 32);
   if (!*var_id)
      *var_id = create_builtin_var(ctx, var_type,
                                   SpvStorageClassInput,
                                   var_name,
                                   builtin);

   SpvId result = spirv_builder_emit_load(&ctx->builder, var_type, *var_id);
   assert(1 == nir_dest_num_components(intr->dest));
   if (builtin == SpvBuiltInInstanceIndex) {
      /* GL's gl_InstanceID always begins at 0, so we have to normalize with gl_BaseInstance */
      SpvId base = spirv_builder_emit_load(&ctx->builder, var_type, ctx->base_instance_var);
      result = emit_binop(ctx, SpvOpISub, var_type, result, base);
   }
   store_dest(ctx, &intr->dest, result, nir_type_uint);
}

Now when loading gl_InstanceID, gl_BaseInstance is also loaded and subtracted to give a zero-indexed value.

July 31, 2020
The just released 5.2-rc1 kernel includes improved support for Logitech wireless keyboards and mice. Until now we were relying on the generic HID keyboard and mouse emulation for 27 MHz and non-unifying 2.4 GHz wireless receivers.

Starting with the 5.2 kernel instead we actually look at the devices behind the receiver. This allows us to provide battery monitoring support and to have per device quirks, like device specific HID-code to evdev-code mappings where necessary. Until now device specific quirks where not possible because the receivers have a generic product-id which is the same independent of the device behind the receiver.

The per device key-mapping is especially important for 27MHz wireless devices, these use the same HID-code for Fn + F1 to Fn + F12 for all devices, but the markings on the keys differ per model. Sofar it was impossible for Linux to get the mapping for this right, but now that we have per device product-ids for the devices behind the receiver we can finally fix this. As is the case with other devices with vendor specific mappings, the actual mapping is done in userspace through hwdb.

If you have a 27 MHz device (often using this receiver, keyboard marked as canada 210 or canada 310 at the bottom). Please give 5.2 a try. Download the latest 60-keyboard.hwdb file and place it in /lib/udev/hwdb.d (replacing the existing file) and then run "sudo udevadm hwdb --update", before booting into the 5.2 kernel. Then run "sudo evemu-record" select your keyboard and try Fn + F1 to Fn + F12 and any other special keys. If any keys do not work, edit 60-keyboard.hwdb, search for Logitech and add an entry for your keyboard, see the existing Logitech entries. After editing you need to re-run "sudo udevadm hwdb --update", followed by "sudo udevadm trigger" for the changes to take effect. Once you have a working entry, submit a pull-req to systemd to get the changes upstream. If you need any help drop me an email.

We still have some old code for the generic HID emulation for 27 MHz receivers with a product-id of 046d:c50c, these should work fine with the new code, but we've been unable to test this. I would really like to move the 046d:c50c id over to the new code and remove all the old code. If you've a 27 MHz Logitech device, please run lsusb, if your device has a product-id of 046d:c50c and you are willing to test, please drop me an email.

Likewise I suspect that 2.4GHz receivers with a product-id of 046d:c531 should work fine with the new support for non-unifying 2.4 GHz receivers, if you have one of those also please drop me an email.

Iago talked recently about the work done testing and supporting well known applications, like the Vulkan ports of the Quake1, Quake 2 and Quake3. Let’s go back here to the lastest news on feature and bugfixing work.

Pipeline cache

Pipeline cache objects allow the result of pipeline construction to be reused. Usually (and specifically on our implementation) that means caching compiled shaders. Reuse can be achieved between pipelines creation during the same application run by passing the same pipeline cache object when creating multiple pipelines. Reuse across runs of an application is achieved by retrieving pipeline cache contents in one run of an application, saving the contents, and using them to preinitialize a pipeline cache on a subsequent run.

Note that it may happens that a pipeline cache would not improve the performance of an application once it starts to render. This is because application developers are encouraged to create all the pipelines in advance, to avoid any hiccup during rendering. On that situation pipeline cache would help to reduce load times. In any case, that is not always avoidable. In that case the pipeline cache would allow to reduce the hiccup, as a cache hit is far faster than a shader recompilation.

One specific detail about our implementation is that internally we keep a default pipeline cache, used if the user doesn’t provide a pipeline cache when creating a pipeline, and also to cache the custom shaders we use for internal operations. This allowed to simplify our code, discarding some custom caches that were already implemented.

Uniform/storage texel buffer

Uniform texel buffers define a tightly-packed 1-dimensional linear array of texels, with texels going through format conversion when read in a shader in the same way as they are for an image. They are mostly equivalent to OpenGL buffer texture, so you can see them as textures backed up by a VkBuffer (through a VkBufferView). With uniform texel buffers you can only do a formatted load.

Storage texel buffers are the equivalent concept, but applied to images instead of textures. Unlike uniform texel buffers, they can also be written to in the same way as for storage images.

Multisampling

Multisampling is a technique that allows to reduce aliasing artifacts on images, by by sampling pixel coverage at multiple subpixel locations and then averaging subpixel samples to produce a final color value for each pixel. We have already started working on this feature, and included some patches on the development branch, but it is still a work in progress. Having said so, it is enough to get Sascha Willems’s basic multisampling demo working:

Sascha Willems multisampling demo run on rpi4

Bugfixing

Again, in addition to work on specific features, we also spent some time fixing specific driver bugs, using failing Vulkan CTS tests as reference. So let’s take a look of some screenshots of Sascha Willem’s demos that are now working:

Sascha Willems deferred demo run on rpi4

Sascha Willems texture array demo run on rpi4

Sascha Willems Compute N-Body demo run on rpi4

Next

We plan to work on supporting the following features next:

  • Robust access
  • Multisample (finish it)

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working

Queries Again

I’ve talked in the past about XFB, and I’ve talked about queries, but I’ve never spent much time talking about XFB queries.

That’s going to change.

XFB is not great, and queries aren’t something I’m too fond of at this point after rewriting the handling so many times, but it was only the other day, while handling streams for XFB3/ARB_gpu_shader5 (humblebrag) that I realized I had been in the infant area of the playground until this moment.

Enter GL_PRIMITIVES_GENERATED

Yes, it’s this query. This one query which behaves differently based on whether XFB is currently active. And in Vulkan terms, this difference is nightmarish for anyone who wants to maintain a “simple” query implementation.

Here’s the gist:

  • When XFB is active, GL_PRIMITIVES_GENERATED needs to return the XFB extension query data for primitives generated
  • When XFB is not active, GL_PRIMITIVES_GENERATED needs to return pipeline statistics query data

But hold on. It gets worse.

For the second case, there’s two subcases:

  • If a geometry shader is present, VK_QUERY_PIPELINE_STATISTIC_GEOMETRY_SHADER_PRIMITIVES_BIT is the value that needs to be returned
  • Otherwise, it’s VK_QUERY_PIPELINE_STATISTIC_INPUT_ASSEMBLY_PRIMITIVES_BIT

Ughhhhhhhhhh

I don’t really have much positive to say here other than “it works”.

The current solution is to maintain two separate VkQueryPool objects for these types of queries, keep them both in sync as far as active/not-active is concerned, and then use results based on have_xfb ? use_xfb : (have_gs ? use_gs : use_vs), and it’s about as gross as you’d imagine.

July 30, 2020

The Hidden Dangers of State Tracking

It’s another hot one, so let’s cool down by checking out a neat bug I came across.

As I’ve mentioned previously, zink runs a NIR pass to inject a gl_PointSize value into the last vertex processing stage of the pipeline for point draws. This works in three parts:

  • the NIR pass to add the instructions
  • adding the pointsize data into the shader parameters
  • the shader parameters get uploaded to the gpu as a constant buffer

Recently, I found a case where the constant buffer for this data was missing, causing tests to crash. I immediately strapped on my snorkel and dove in.

Is The NIR Pass Running?

Yes, but since it’s in a geometry shader, some recent work is needed.

Is The Data Being Added Into Shader Parameters?

Again checking out the MR where I added this, I examined the code:

            if (key->lower_point_size) {
               static const gl_state_index16 point_size_state[STATE_LENGTH] =
                  { STATE_INTERNAL, STATE_POINT_SIZE_CLAMPED, 0 };
               _mesa_add_state_reference(params, point_size_state);
               NIR_PASS_V(state.ir.nir, nir_lower_point_size_mov,
                          point_size_state);
               finalize = true;
            }

As seen above, there’s a call to _mesa_add_state_reference() with the pointsize data from where I copied it from the vertex shader callsite, so this is definitely being added to the shader parameters.

Are These Parameters Being Uploaded?

A quick breakpoint on st_update_gs_constants(), which handles uploading the constant buffer data for geometry shaders, revealed that no, there was no upload taking place.

A quick hop over to st_atom_list.h revealed that this function is called when ST_NEW_GS_CONSTANTS is set in struct st_program::affected_states, but this would leave out the case of the codepath being triggered by a tessellation evaluation shader, so really it should be ST_NEW_CONSTANTS to be more generic.

On my way out, I decided to check on the original vertex shader handling of the pass, and the state was similarly not being explicitly updated, instead relying on other parts of the code to have initiated an upload of the constant buffers.

In summary, when you copy code that has bugs which happen to not exhibit themselves in the original location, you’re basically adding more bugs.

July 28, 2020

How Do Mesa Versions Work?

Today I thought it might be interesting to dive into how mesa detects version support for drivers.

To do so, I’m going to be jumping into mesa/src/mesa/main/version.c, which is where the magic happens.

mesa/main/version.c

In this file are a couple functions which check driver extension support. For the (current) purposes of my zink work, I’ll be looking at compute_version(), the function handling desktop GL. This function takes a pointer to the extensions as well as the constant values set by the driver, and each GL version is determined by support for the required extensions.

As an example, here’s GL 3.0, which is what zink is currently detected as:

   const bool ver_3_0 = (ver_2_1 &&
                         consts->GLSLVersion >= 130 &&
                         (consts->MaxSamples >= 4 || consts->FakeSWMSAA) &&
                         (api == API_OPENGL_CORE ||
                          extensions->ARB_color_buffer_float) &&
                         extensions->ARB_depth_buffer_float &&
                         extensions->ARB_half_float_vertex &&
                         extensions->ARB_map_buffer_range &&
                         extensions->ARB_shader_texture_lod &&
                         extensions->ARB_texture_float &&
                         extensions->ARB_texture_rg &&
                         extensions->ARB_texture_compression_rgtc &&
                         extensions->EXT_draw_buffers2 &&
                         extensions->ARB_framebuffer_object &&
                         extensions->EXT_framebuffer_sRGB &&
                         extensions->EXT_packed_float &&
                         extensions->EXT_texture_array &&
                         extensions->EXT_texture_shared_exponent &&
                         extensions->EXT_transform_feedback &&
                         extensions->NV_conditional_render);

Each member of extensions being checked is the full name of an extension, so it can easily be found by a search. Each member of consts is a constant value exported by the driver. Most notable is GLSLVersion, which increases for every GL version beginning with 3.0.

Thus, saying “zink supports GL 3.0” is really equivalent to passing this check.

But wait, this is mesa core, and zink is a gallium driver.

mesa/state_tracker/st_extensions.c

Midway through this file is st_init_extensions() (direct line link omitted for future compatibility). Here, enum pipe_cap is mapped to extension support, but this is only for some cases:

   static const struct st_extension_cap_mapping cap_mapping[] = {
      { o(ARB_base_instance),                PIPE_CAP_START_INSTANCE                   },
      { o(ARB_bindless_texture),             PIPE_CAP_BINDLESS_TEXTURE                 },
      { o(ARB_buffer_storage),               PIPE_CAP_BUFFER_MAP_PERSISTENT_COHERENT   },
      { o(ARB_clear_texture),                PIPE_CAP_CLEAR_TEXTURE                    },
      { o(ARB_clip_control),                 PIPE_CAP_CLIP_HALFZ                       },
      { o(ARB_color_buffer_float),           PIPE_CAP_VERTEX_COLOR_UNCLAMPED           },
      { o(ARB_conditional_render_inverted),  PIPE_CAP_CONDITIONAL_RENDER_INVERTED      },
      { o(ARB_copy_image),                   PIPE_CAP_COPY_BETWEEN_COMPRESSED_AND_PLAIN_FORMATS },

Setting a nonzero value for these pipe caps enable the corresponding extension for version checking.

Other extensions are determined by values of pipe caps later on:

   if (GLSLVersion >= 400 && !options->disable_arb_gpu_shader5)
      extensions->ARB_gpu_shader5 = GL_TRUE;

In this case, supporting at least GLSL 4.00 and not explicitly disabling the extension is enough to enable it.

In this way, there’s no method to do “enable GL $version”, and everything is instead inferred based on the driver’s capabilities, letting us remain reasonably confident that drivers really do support all parts of every GL version they claim to support.

July 27, 2020

Variable Lists

Today I’m going to briefly go over a big-ish API change that’s taking place as a result of a MR from Jason Ekstrand.

As I’ve talked about in the past, zink does a lot of iterating over variables in shaders. Primarily, this access is limited to the first three variable lists in struct nir_shader:

typedef struct nir_shader {
   /** list of uniforms (nir_variable) */
   struct exec_list uniforms;

   /** list of inputs (nir_variable) */
   struct exec_list inputs;

   /** list of outputs (nir_variable) */
   struct exec_list outputs;

The uniforms list is used in UBO support and texture calls, outputs is used in xfb, and inputs and outputs have both been featured in all my slot remapping adventures.

Less Lists

The question asked by this MR (and the provoking issue) is “why do we have so many lists of variables when we could instead have one list?”

The issue has more discussion about the specifics of why, but for the purposes of zink, the question becomes “how does this affect us?”

Generally speaking, the changes needed here are simple. Instead of nir_foreach_variable() macro iterations over a specific list type, there’s now nir_foreach_variable_with_modes(var, shader, modes), which has an additional param for modes. This param is based on enum nir_variable_mode, which corresponds to nir_variable::data.mode, and it causes the macro to perform a check on each variable it iterates over:

#define nir_foreach_variable_with_modes(var, shader, modes) \
   nir_foreach_variable_in_shader(var, shader) \
      if (_nir_shader_variable_has_mode(var, modes))

Thus, the zink loops like nir_foreach_variable(var, &s->uniforms) look like nir_foreach_variable_with_modes(var, s, nir_var_uniform | nir_var_mem_ubo | nir_var_mem_ssbo) after the change.

Further Improvements

At some point after this lands, it’ll be useful to go through the places in zink where variable iterating occurs and try to combine iterations, as each variable list iteration now iterates over every type of variable, so ideally that should be reduced to a single loop that handles all the things that each separate type-iteration used to handle in order to keep things lightweight.

July 24, 2020

Spec Harder

At this point, longtime readers are well aware of the SPIR-V spec since I reference it in most posts. This spec covers all the base operations and types used in shaders, which are then (occasionally) documented more fully in the full Vulkan spec.

There comes a time, however, when the core SPIR-V spec is not enough to maintain compatibility with GLSL instructions. In such cases, there are additional specifications which extend the base functionality.

  • GLSL 450 provides many additional ops, primarily for calculation-related functionality
  • Other specs

Perhaps now you see where I’m going with this.

Yes, any time you’re working with one of the “other specs”, e..g, SPV_KHR_vulkan_memory_model, there’s no longer this nice, consistent webpage for viewing the spec. Instead, the SPIR-V landing page directs readers to this GitHub repository, which then has a list of links to all the extensions, whereupon finally the goal is reached at a meta-link that displays the version in the repo.

The frustrating part about this is that you have to know that you need to do this. Upon seeing references to an extension in the SPIR-V spec, and they are most certainly listed, the next step is naturally to find more info on the extension. But there is no info on any extensions in the core spec, and, while the extension names look like they’re links, they aren’t links at all, and so there’s no way to directly get to the spec for an extension that you now know you need to use after seeing it as a requirement in the core spec.

The next step is a staple of any workflow. Typically, it’s enough to just throw down a search engine of your choice query for “function name” or “specification name” or “extension name” when working with GL or VK, but for this one odd case, assuming your search engine even finds results, most likely you’ll end up instead at this page for SPV_KHR_vulkan_memory_model, which isn’t very useful.

July 23, 2020

A few weeks ago we shared an update regarding the status of the Vulkan driver for the Raspberry Pi 4. In that post I mentioned that the plan was to focus on completing the feature set for Vulkan 1.0 and then moving on to conformance and bugfixing work before attempting to run actual games and applications. A bit later my colleague Alejandro shared another update detailing some of our recent feature work.

We have been making good progress so far and at this point we are getting close to having a complete Vulkan 1.0 implementation. I believe the main pending features for that are pipeline caches, which Alejandro is currently working on, texel buffers, multisampling support and robust buffer access, so in the last few weeks I decided to take a break from feature development and try to get some Vulkan games running with our driver and use them to guide some inital performance work.

I decided to work with all 3 VkQuake games since they run on Linux, the source code is available (which makes things a lot easier to debug) and seemed to be using a subset of the Vulkan API we already supported. For vkQuake we needed compute shaders and input attachments that we implemented recently, and for vkQuake3 we needed a couple of optional Vulkan features which I implemented recently to get it running without having to modify the game code. So all these games are now running on the Raspberry Pi4 with the V3DV driver. At the same time, our friend Salva from Pi Labs has also been testing the PPSSPP emulator using Vulkan and reporting that some games seem to be working already, which has been great to hear.

I was particularly interested in getting vkQuake3 to work because the project includes both the Vulkan and the original OpenGL renderers, which was great to compare performance between both APIs. VkQuake3 comes with a GL1 and a GL2 renderer, with the GL1 render being the fastest of the two by a large margin (apparently the GL2 renderer has additional rendering features that make it much slower). I think the Vulkan renderer is based on the GL1 renderer (although I have not actually checked) so I figured it would make the most reasonable comparison, and in our tests we found the Vulkan version to be up to 60% faster. Of course, it could be argued that GL1 is a pretty old API and that the difference with a more modern GL or GLES renderer might be less significant, but it is still a good sign.

To finish the post, here are some pics of the games:


vkQuake

vkQuake2

vkQuake3 OpenGL 1 renderer

vkQuake3 OpenGL 1 renderer

vkQuake3 Vulkan renderer

vkQuake3 Vulkan renderer

A couple of final notes:
* Note that the Vulkan renderer for vkQuake3 is much darker, but that is just how the renderer operates and not a driver issue, we observed the same behavior on Intel GPUs.
* A note for those interested in trying vkQuake3, we noticed that exterior levels have broken sky rendering, I hope we will get to fix that soon.

A Slow Day

It’s a very slow day, and I awoke at 4:30 with the burning need to finish tessellation shaders. Or maybe I was just burning because it’s so hot out.

In either case, I had a couple remaining tests that were failing, and, upon closer inspection and a lot of squinting, I determined that the problem was the use of partial writes to patch output variables in tessellation control shaders.

With this knowledge in hand, I set about resolving the issue in the most bulldozer way possible, with no regard for how much of the code would survive once I discovered which nir pass I probably wanted to be running instead.

Partial Writes

A partial write results from shader code that looks like this:

vec4 somevec = vec4(1.0);
somevec.xz = vec2(0.0);

The second assignment is partial, meaning it only writes to some of the components. SPIR-V, however, doesn’t have methods for writing to only certain components of a composite value, meaning that it requires either the full composite to be written (e.g., the first line of code above) or it requires that each component be written separately.

If you’re a long-time reader of the blog, or if you’re good at inferring from the tone of an article, I’m sure you know what came next.

SPIRV

Here is a normal ntv function which stores a write to an output variable:


static void
emit_store_deref(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   SpvId ptr = get_src(ctx, &intr->src[0]);
   SpvId src = get_src(ctx, &intr->src[1]);

   SpvId type = get_glsl_type(ctx, nir_src_as_deref(intr->src[0])->type);
   SpvId result = emit_bitcast(ctx, type, src);
   spirv_builder_emit_store(&ctx->builder, ptr, result);
}

It’s very simple. Simple is good. The only thing to be done is to take the source operand, cast it to the type of the output variable, then store (write) it to the output variable.

Here’s where we’re at now:

static void
emit_store_deref(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   SpvId ptr = get_src(ctx, &intr->src[0]);
   SpvId src = get_src(ctx, &intr->src[1]);

   const struct glsl_type *gtype = nir_src_as_deref(intr->src[0])->type;
   SpvId type = get_glsl_type(ctx, gtype);
   unsigned num_writes = util_bitcount(nir_intrinsic_write_mask(intr));
   unsigned wrmask = nir_intrinsic_write_mask(intr);
   if (num_writes != intr->num_components) {
      /* no idea what we do if this fails */
      assert(glsl_type_is_array(gtype) || glsl_type_is_vector(gtype));

      /* this is a partial write, so we have to loop and do a per-component write */
      SpvId result_type;
      SpvId member_type;
      if (glsl_type_is_vector(gtype)) {
         result_type = get_glsl_basetype(ctx, glsl_get_base_type(gtype));
         member_type = get_uvec_type(ctx, 32, 1);
      } else
         member_type = result_type = get_glsl_type(ctx, glsl_get_array_element(gtype));
      SpvId ptr_type = spirv_builder_type_pointer(&ctx->builder,
                                                  SpvStorageClassOutput,
                                                  result_type);
      for (unsigned i = 0; i < 4; i++)
         if ((wrmask >> i) & 1) {
            SpvId idx = emit_uint_const(ctx, 32, i);
            SpvId val = spirv_builder_emit_composite_extract(&ctx->builder, member_type, src, &i, 1);
            val = emit_bitcast(ctx, result_type, val);
            SpvId member = spirv_builder_emit_access_chain(&ctx->builder, ptr_type,
                                                           ptr, &idx, 1);
            spirv_builder_emit_store(&ctx->builder, member, val);
         }
      return;

   }
   SpvId result = emit_bitcast(ctx, type, src);
   spirv_builder_emit_store(&ctx->builder, ptr, result);
}

I’ve now taken this otherwise readable function and changed it to something unnecessarily complex. For cases where the component output dosen’t match the written mask, this code extracts each written member of the source composite, then creates an access chain (which is basically pointer dereferencing / array/vector access in SPIRV) for the corresponding member of the output, and finally writes the bitcasted component.

But…

But Mike, I’m sure you’re asking now, shouldn’t nir_lower_io_to_scalar handle this? Or what about nir_lower_io in general?

Well.

Maybe?

But there’s a comment /* TODO: add patch support */, so it doesn’t, and here we are.

July 22, 2020

Last of the Vertex Processors

As I’ve touched upon previously, zink does some work in ntv to perform translations of certain GLSL variables to SPIR-V versions of those variables. Also sometimes we add and remove variables for compatibility reasons.

The key part of these translations, for some cases, is to ensure that they occur on the right shader. As an example that I’ve talked about several times, Vulkan coordinates are different from GL coordinates, specifically in that the Z axis is compressed and thus values there must be converted to perform as expected in the underlying Vulkan driver. This means that gl_Position needs to be converted before reaching the fragment shader.

Conversion Timing

Early approaches to handling this in zink, as in the currently-released versions of mesa, performed all translation in the vertex shader. Only vertex and fragment shaders are supported here, so this is fine and makes total sense.

Once more types of shaders become supported, however, this is no longer quite as good to be doing. Consider the following progression:

  • vertex shader converts gl_Position
  • vertex shader outputs gl_Position
  • geometry shader consumes gl_Position
  • geometry shader uses gl_Position for vertex output
  • geometry shader outputs gl_Position

In this scenario, the geometry shader is still executing instructions that assume a GL coordinate input, which means they will not function as expected when they receive the converted VK coordinates. The simplest fix results in:

  • vertex shader converts gl_Position
  • vertex shader outputs gl_Position
  • geometry shader consumes gl_Position
  • geometry shader unconverts gl_Position
  • geometry shader uses gl_Position for vertex output
  • geometry shader converts gl_Position
  • geometry shader outputs gl_POsition

I say simplest here because this requires no changes to the shader compiler ordering in zink, meaning that shaders don’t need to be “aware” of each other, e.g., a vertex shader doesn’t need to know whether a geometry shader exists and can just do the same conversion in every case. This is useful because at present, shaders in zink are compiled in a “random” order, meaning that it’s impossible to know whether a geometry shader exists at the time that a vertex shader is being compiled.

This is still not ideal, however, as it means that the vertex and geometry shaders are going to be executing unnecessary instructions, which yields a big frowny face in benchmarks (probably not actually that big, but this is the sort of optimizing that lets you call your code “lightweight”). The situation is further complicated with the introduction of tessellation shaders, where the flow now starts looking like:

  • vertex shader converts gl_Position
  • vertex shader outputs gl_Position
  • tessellation shader consumes gl_Position
  • tessellation shader unconverts gl_Position
  • tessellation shader converts gl_Position
  • tessellation shader outputs gl_Position
  • geometry shader consumes gl_Position
  • geometry shader unconverts gl_Position
  • geometry shader uses gl_Position for vertex output
  • geometry shader converts gl_Position
  • geometry shader outputs gl_Position

Not great.

Once Per Pipeline

The obvious change required here is to ensure that zink compiles shaders in pipeline order. With this done, all the uncompiled shaders are available to scan for info and existence, and the process can now be:

  • vertex shader converts gl_Position if no tessellation or geometry shader is present
  • vertex shader outputs gl_Position
  • tessellation shader consumes gl_Position
  • tessellation shader converts gl_Position if no geometry shader is present
  • tessellation shader outputs gl_Position
  • geometry shader consumes gl_Position
  • geometry shader uses gl_Position for vertex output
  • geometry shader converts gl_Position
  • geometry shader outputs gl_Position

Now that’s some lightweight shader execution. #optimized

July 21, 2020

Directing Indirection

Now on a wildly different topic, I’m going to talk about indirect drawing for a bit, specifically when using it in combination with primitive restart, which I already briefly talked about in a prior post.

In general, indirect drawing is used when an application wants to provide the gpu with a buffer containing the parameters to be used for draw calls. The idea is that the parameters are already “on the CPU”, so there’s no back-and-forth needed with the CPU for cases where these parameters may be derived in the course of GPU operations.

The problem here for zink is that indirect drawing can be used with primitive restart, but the problem I brought up previously still exists, namely that OpenGL allows arbitrary values for the restart index, whereas Vulkan requires a fixed value.

This means that in order for indirect draws to work correctly with primitive restart after being translated to Vulkan, zink needs to… harold.jpg

Yes, zink needs to map those buffers used for indirect drawing and rewrite the indices to convert the arbitrary restart index to the fixed one.

Utils, Utils, Utils

Thankfully, all of this can be done in the utility functions in u_prim_restart.c that I talked about in the primitive restart post, which provide functionality for both rewriting index buffers to perform restart index conversion as well as handling draw calls for unsupported primitive types.

The ARB_draw_indirect spec defines two buffer formats for use with indirect draws:

typedef struct {
  GLuint count;
  GLuint primCount;
  GLuint first;
  GLuint reservedMustBeZero;
} DrawArraysIndirectCommand;

typedef struct {
  GLuint count;
  GLuint primCount;
  GLuint firstIndex;
  GLint  baseVertex;
  GLuint reservedMustBeZero;
} DrawElementsIndirectCommand;

One is for array draws, the other for elements. Happily, only the first three members are of use for this awfulness, so there’s no need to determine which type of draw it is before grabbing that buffer with both hands and telling the GPU to completely stop everything that it wanted to do so we can push up our reading glasses and take a nice, slow read.

With the buffer contents in hand and the GPU performance having dropped to a nonexistent state, the indirect draw command can then be rewritten as a direct draw since the buffer is already mapped, and the entire premise of the indirect draw can be invalidated for the sake of compatibility.

For those interested in seeing what this looks like, the MR is here.

July 20, 2020

I’m Back

Last week was my birthday (all week), and I decided to gift myself fewer blog posts.

Now that things are back to normal, however, I’ll be easing my way back into the blogosphere. Today I’m going to be looking briefly at the benefits of code from a recent MR by Erik Faye-Lund which improves memory allocation in ntv to let us better track all the allocations done in the course of doing compiler stuff, with the key benefit being that we’re no longer leaking this entire thing every time we compile a shader (oops).

Mesa internals use an allocator called ralloc, and this was first added to the tree by Kenneth Graunke way back in 7.10.1, or at least that was the earliest reference of it I could find. It’s used everywhere, and it has a number of nice features that justify its use other than simple efficiency.

For this post, I’ll be looking specifically at its memory context capability.

Contextualize

Typically in C, we have our calls to malloc, and we have to track all these allocations, ensuring ownership is preserved, and then later call free at an appropriate time on everything that we’ve allocated.

Not so with the magic of ralloc!

A call to ralloc_context() creates a new memory context. Subsequent allocations for related objects requiring allocation can then pass this context to the related ralloc_size() calls, and this allocation will then be added to the specified context. At a later point, the context can be freed, taking all the allocated memory with it.

This enables code like:

void *ctx = ralloc_context(NULL);
void *some_mem1 = ralloc_size(ctx, some_size1);
void *some_mem2 = ralloc_size(ctx, some_size2);
void *some_mem3 = ralloc_size(ctx, some_size3);
ralloc_free(ctx);

which will free all allocations created using ctx.

But wait, there’s more! Every allocation made using ralloc is implicitly a context like this, which means that any time a struct is created with allocated members, the struct can be passed as a context, meaning that a destructor function for that struct can simply be written as a call to ralloc_free() which passes the struct, and newly-added members don’t need to be deallocated.

So for example:

struct somestruct *s = ralloc_size(NULL, sizeof(struct somestruct));
s->member1 = ralloc_size(s, sizeof(member1_size));
ralloc_free(s);

will free the struct and member, but also adding e.g.,:

s->member2 = ralloc_size(s, sizeof(member2_size));

for a new member at some point requires no changes to the deallocation codepath.

Future Work

At some point, it’s likely we’ll be going through the zink codebase to add more ralloc usage, both for helping to avoid memory leaks and for the other more general benefits that memory allocators bring over our existing usage of raw malloc.

July 17, 2020

In March, I inspected the coverage of kms_cursor_crc on VKMS to develop my GSoC project proposal. Using piglit, I present the evolution of this coverage so far:

Result GSoC start Only accepted patches Fixes under development
pass 7 22 24
warn 1 0 1
fail 2 3 0
skip 236 221 221
crash 0 0 0
timeout 0 0 0
incomplete 0 0 0
dmesg-warn 0 0 0
dmesg-fail 0 0 0
changes 0 0 0
fixes 0 0 0
regressions 0 0 0
total 246 246 246

+ Instability in the sequential run of subtests; ie, although the statistic showing that 7 tests are passing in the beggining, this result was only clear after a double-check because when running all subtests sequentially, 8 tests fails and only 1 succeed;

+ Warning or Failure ? My fix proposal for the pipe-A-cursor-alpha-transparent test was passing with a warning (CRC all zero)

As a first step, I decided to examine and solve issues that affected the test results in a general way. the instability. Solving the instability first (or at least identify what was going on) would make the work more consistent and fluid, since I would no longer need to double-check each subtest result and, in one running of the entire kms_cursor_crc test I could check the absent features or errors. However, in this investigation, some problems were more linked to IGT and others to VKMS. Identifying who is the “guilty” was not simple, and some false charges happened.

This is a little long story and deserves another post focused on the issue. The effective solution has not yet been found, but it has already been realized that the proper bug-fix achieves:

  • The stability of the sequential execution of subtests;
  • The success of the following subtests: pipe-A-cursor-dpms pipe-A-cursor-suspend

As we don’t have that yet, well, I’ll focus on this post in describing the way to allow testing of different cursor sizes in vkms.

The maximum cursor size

In 2018, one of the Haneen contributions to vkms was adding support for the cursor plane. In this implementation, the cursor has a standard maximum supported size 64x64, which limited the coverage of the kms_cursor_crc to only cursor subtests with this size.

The IGT tests cursor sizes from 64 to 512 (powers of 2), but how to enable cursor larger than the standard?

Initially, I thought I needed to develop some functionality from scratch, maybe do this by drawing in larger sizes… as a good beginner :) So I started to read some materials and check the codes of other drivers to find out how to do this.

During this stage, for some kind of universe coincidence (mystic), I came across a conversation on the IRC on the topic “cursor sizes.” This conversation gave me some references that led me to the right work path.

Ctags-driven investigation

I tend to investigate Linux trough keywords (like a ctags on the head?). This approach helps me to narrow the scope of reading and attempts. The Linux kernel is frighteningly large, and I know that I still need many years to understand things as a whole (if possible).

Therefore, these are the references that helped me to find the keywords and information related to my problem “the cursor size”:

  1. Comparing an AMD device with VKMS device: https://drmdb.emersion.fr/devices (keyword 1: DRM_CAP_CURSOR_HEIGHT)

  2. Checking max cursor sizes per driver: https://drmdb.emersion.fr/capabilities (keyword 2: capabilities)

  3. Finding where is the DRM_CAP_CURSOR_HEIGHT value: drivers/gpu/drm/drm_ioctl.c (How to define dev->mode_config.cursor_height in vkms?)
    /*
     * Get device/driver capabilities
     */
    static int drm_getcap(struct drm_device *dev, void *data, struct drm_file *file_priv)
    {
     struct drm_get_cap *req = data;
     struct drm_crtc *crtc;
    
     req->value = 0;
    [..]
     case DRM_CAP_CURSOR_WIDTH:
         if (dev->mode_config.cursor_width)
             req->value = dev->mode_config.cursor_width;
         else
             req->value = 64;
         break;
     case DRM_CAP_CURSOR_HEIGHT:
         if (dev->mode_config.cursor_height)
             req->value = dev->mode_config.cursor_height;
         else
             req->value = 64;
         break;
    [..]
    }
    
  4. More information: include/uapi/drm/drm.h
    / *
     * The CURSOR_WIDTH and CURSOR_HEIGHT capabilities return a valid widthxheight
     * combination for the hardware cursor. The intention is that a hardware
     * agnostic userspace can query a cursor plane size to use.
     *
     * Note that the cross-driver contract is to merely return a valid size;
     * drivers are free to attach another meaning on top, eg. i915 returns the
     * maximum plane size.
     * /
    #define DRM_CAP_CURSOR_WIDTH 0x8
    #define DRM_CAP_CURSOR_HEIGHT 0x9
    
  5. Check the documentation for cursor_width:
    ``` struct drm_mode_config
    Mode configuration control structure

cursor_width: hint to userspace for max cursor width
cursor_height: hint to userspace for max cursor height

6. So, where is this `mode_config` defined in vkms?  
Here: `drivers/gpu/drm/vkms/vkms_drv.c`

static int vkms_modeset_init(struct vkms_device *vkmsdev) { struct drm_device *dev = &vkmsdev->drm;

drm_mode_config_init(dev);
dev->mode_config.funcs = &vkms_mode_funcs;
dev->mode_config.min_width = XRES_MIN;
dev->mode_config.min_height = YRES_MIN;
dev->mode_config.max_width = XRES_MAX;
dev->mode_config.max_height = YRES_MAX;
dev->mode_config.preferred_depth = 24;
dev->mode_config.helper_private = &vkms_mode_config_helpers;

return vkms_output_init(vkmsdev, 0); } ```  There is nothing about cursor here, so we need to assign maximum values to not take the default.  7. I also found that, on my intel computer, the maximum cursor is 256. Why do tests include 512?  8. Also, there are subtests in kms_cursor_crc for non-square cursors, but these tests are restricted to i915 devices. Why are they here?  9. Finally, I develop a [simple patch][3] that increases the coverage rate by 15 subtests. Considering the current drm-misc-next, my project state is:
Name Results
pass 22
fail 3
skip 221
crash 0
timeout 0
warn 0
incomplete 0
dmesg-warn 0
dmesg-fail 0
changes 0
fixes 0
regressions 0
total 246

Cheer up!

It is also possible to consider those in which we have some idea of the problem and a provisory solution exists (I mean, an optimistic view):

Name Results
pass 24
warn 1
fail 0
skip 221
crash 0
timeout 0
incomplete 0
dmesg-warn 0
dmesg-fail 0
changes 0
fixes 0
regressions 0
total 246

There is still the non-square cursors issue, which I’m not sure we should handle. Subtests with non-square cursors mean 16 skips.

Lessons learned

  1. For me, IRC conversations are inspiring and also show community pulsing. Part of what I question in my master’s research is the lack of interest in academic studies in using IRC as a means of understanding the community under investigation.
    One of the things I like about IRC is that, unlike other instant messengers today, when we are there, we are usually sitting in front of the computer (our work tool). I mean, we are not (I think) lying in a hammock on the beach, for example. Ok, there could be exceptions. :)
    But to be honest, I hardly talk on a channel; I am usually just watching.

  2. There are cases where the complexity lies in understanding what already exists instead of in developing new features. I still don’t know if it’s frustrating or satisfying.

  3. I don’t know if I could understand things without using ctags. The ctags was a great tip from Siqueira even in FLUSP times.

July 15, 2020

We’re Back

I spent a few days locked in a hellish battle against a software implementation of doubles (64-bit floats) for ARB_gpu_shader_fp64 in the raging summer heat. This scholarly infographic roughly documents my progress:

fp64-clown.png

Instead of dwelling on it, I’m going to go back to something very quick and less brain-damaging, namely ARB_timer_query. This extension provides functionality for performing time queries on the gpu, both for elapsed time and just getting regular time values.

In gallium, this is represented by PIPE_CAP_QUERY_TIME_ELAPSED and PIPE_CAP_QUERY_TIMESTAMP capabilities in a driver.

Extensions

Obviously there’s Vulkan extensions to go with this. VK_EXT_calibrated_timestamps is the one that’s needed, as this provides the functionality for retrieving time values directly from the gpu without going through a command buffer. There’s the usual copy-and-paste dance for this during driver init, which includes:

  • checking for the extension name
  • checking for the feature
  • enabling the feature
  • enabling the extension
  • checking for the needed extensions-specific features (in this case, VK_TIME_DOMAIN_DEVICE_EXT)

Query Time

What followed, in this case, a lot of wrangling existing query code to remove asserts and finish some stubs where timestamp queries had been partially implemented. The most important part was this function:

 static void
 timestamp_to_nanoseconds(struct zink_screen *screen, uint64_t *timestamp)
 {
    /* The number of valid bits in a timestamp value is determined by
     * the VkQueueFamilyProperties::timestampValidBits property of the queue on which the timestamp is written.
     * - 17.5. Timestamp Queries
     */
    *timestamp &= ((1ull << screen->timestampValidBits) - 1);;
    /* The number of nanoseconds it takes for a timestamp value to be incremented by 1
     * can be obtained from VkPhysicalDeviceLimits::timestampPeriod
     * - 17.5. Timestamp Queries
     */
    *timestamp *= screen->props.limits.timestampPeriod;
 }

All time values from the gpu are returned in “ticks”, which are a unit decided by the underlying driver. Applications using Zink want nanoseconds, however, so this needs to be converted.

Side Query

But then also there’s the case where a timestamp can be retrieved directly, for example:

glGetInteger64v(GL_TIMESTAMP, &time);

In Zink and other gallium drivers, this uses either a struct pipe_screen or struct pipe_context hook:

   /**
    * Query a timestamp in nanoseconds. The returned value should match
    * PIPE_QUERY_TIMESTAMP. This function returns immediately and doesn't
    * wait for rendering to complete (which cannot be achieved with queries).
    */
   uint64_t (*get_timestamp)(struct pipe_screen *);


   /**
    * Query a timestamp in nanoseconds.  This is completely equivalent to
    * pipe_screen::get_timestamp() but takes a context handle for drivers
    * that require a context.
    */
   uint64_t (*get_timestamp)(struct pipe_context *);

The current implementation looks like:

static uint64_t
zink_get_timestamp(struct pipe_context *pctx)
{
   struct zink_screen *screen = zink_screen(pctx->screen);
   uint64_t timestamp, deviation;
   assert(screen->have_EXT_calibrated_timestamps);
   VkCalibratedTimestampInfoEXT cti = {};
   cti.sType = VK_STRUCTURE_TYPE_CALIBRATED_TIMESTAMP_INFO_EXT;
   cti.timeDomain = VK_TIME_DOMAIN_DEVICE_EXT;
   screen->vk_GetCalibratedTimestampsEXT(screen->dev, 1, &cti, &timestamp, &deviation);
   timestamp_to_nanoseconds(screen, &timestamp);
   return timestamp;
}

Using the VK_EXT_calibrated_timestamps functionality from earlier, this is very simple and straightforward, unlike software implementations of ARB_gpu_shader_fp64.

It’s a bit of a leap forward, but that’s another GL 3.3 feature done.

July 13, 2020

NIR passes

I’ve spent a lot of time talking about NIR and writing passes, so let’s take a shallow dive into what exactly that means.

To start, there’s this idea of a “lowering” pass, where “lowering” means reducing or changing some part of the shader representation. These passes are run for various reasons, ranging from handling compatibility (e.g., the gl_FragColor -> gl_FragData[n] pass I discussed previously) to optimizing the shader by removing unused variables and instructions.

Basics

The most basic type of NIR pass, which is what I’ll be covering quickly today since time got away from me, is the type that runs through a shader’s instructions in order to rewrite a specific type of instruction.

Usually this starts out with this block:

   nir_foreach_function(function, shader) {
      if (function->impl) {
         nir_builder builder;
         nir_builder_init(&builder, function->impl);
         nir_foreach_block(block, function->impl) {
            nir_foreach_instr_safe(instr, block) {
               progress = lower_impl(instr, &builder);
            }
         }
     }
  }

In this, the pass iterates over the shader’s functions, then creates a nir_builder (an object used for altering shader internals) while it iterates the blocks within the function. A function block is a group of instructions contained within a given scope, e.g., a conditional. The pass iterates into each block, looping over all the instructions and passing them to the lower_impl internal function for the pass, which is where all the work happens.

A pass I wrote earlier today searches fragment shaders for writes to gl_SampleMask and deletes them in order to mimic OpenGL behavior of ignoring that variable if a render target’s sample count is zero:

static bool
lower_samplemask_instr(nir_intrinsic_instr *instr, nir_builder *b)
{
   nir_variable *out;
   if (instr->intrinsic != nir_intrinsic_store_deref)
      return false;

   out = nir_deref_instr_get_variable(nir_src_as_deref(instr->src[0]));
   if (out->data.location != FRAG_RESULT_SAMPLE_MASK || out->data.mode != nir_var_shader_out)
      return false;
   b->cursor = nir_after_instr(&instr->instr);
   nir_instr_remove(&instr->instr);
   return true;
}

This filters for intrinsic instructions, which are those defined as intrinsics in mesa/src/compiler/nir/nir_intrinsics.py, and then skips over all instructions which aren’t store_deref. A store_deref is the instruction used when writing a value to a shader output. More filtering is done based on the variable used as the source of the operation (instr->src[0]) until the pass has located a gl_SampleMask write, and then the instruction is removed.

Important here is that true and false are returned by the implementation here to indicate progress. NIR has a feature where setting the NIR_PRINT environment variable causes the current NIR output to be printed any time a lowering pass changes the shader, so setting this correctly is important.

More on NIR passes in a future post.

July 10, 2020

Something different

Usually I cover in-depth looks at various chunks of code I’ve been working on, but today it’s going to be a more traditional style of modern blogging: memes and complaining.

I do a lot of work using Vulkan extensions. Feels like every other day I’m throwing another one into Zink (like when I had to add VK_EXT_calibrated_timestamps today for timer queries). To that end, I spend a lot of time reading the spec for extensions, though usually I prefer the man pages since they don’t throw my browser on the ground and leave it a twitchy mess like the main spec page does: chrome

The problem with extensions and the spec is that extensions each have their own section, which is the same as the man page entry, and that’s it. Extensions also modify the base text of the spec, but there’s no clear way to know (that I’ve seen, anyway) what else is modified based on the entry that the extension gets.

As an example, check out VK_EXT_robustness2. This blurb for the spec clearly says This extension also adds support for “null descriptors”, where VK_NULL_HANDLE can be used instead of a valid handle. Accesses to null descriptors have well-defined behavior, and don’t rely on robustness.

Alright, but what does that mean? Sure, I can chuck a null handle into a descriptor set, but how do I know that I’m complying with the spec other than yolo running it to see what the validator says?

Near as I can tell, the answer is to actually search for the extension name, then read the diff of the commit which officially adds the extension to the spec.

Compare this to OpenGL, where I can check out something like ARB_timer_query and see exactly what parts of the existing spec the extension is modifying and how they’re being changed.

Maybe I’m missing something here; I’ve only been working on Zink for a bit over a month, so I’m certainly not going to claim to be an expert, but this feels like a huge step backwards in documentation comprehensiveness.

July 09, 2020

Primitive Restart

The last remaining feature for GL 3.1 was primitive restart, which allows an indexed draw command to end the current primitive when a specified index is processed, beginning a new one of the same type with the next index. Other than the minor changes of enabling the driver capability, there were two main issues with translating this functionality to Vulkan:

  • OpenGL primitive restart allows for an arbitrary restart index to be specified by the user, while Vulkan always uses (uint[size]_t)-1
  • OpenGL primitive restart works on all primitive types, but Vulkan only allows it for strips and fans without adjacency

I decided to start with the first issue.

Indexing

As with many problems in driver world, this is one that people have had previously, which meant that the work I needed to do was limited. In fact, it was as simple as checking in zink_draw_vbo() whether the draw command was using the Vulkan index and, if it wasn’t, remapping it using an existing function.

uint32_t restart_index;
if (dinfo->index_size == 1)
   restart_index = (uint8_t)-1;
else if (dinfo->index_size == 2)
   restart_index = (uint16_t)-1;
else if (dinfo->index_size == 4)
   restart_index = (uint32_t)-1;
else
   unreachable("unknown index size passed");
if ((dinfo->primitive_restart && (dinfo->restart_index != restart_index)) || !screen->have_EXT_index_type_uint8) {
  util_translate_prim_restart_ib(pctx, dinfo, &index_buffer);

Note here that there’s handling for the case of missing VK_EXT_index_type_uint8, which is required in order to use index buffers of size uint8_t.

If this codepath is taken, a new index buffer with the restart values rewritten to Vulkan ones is created using the specified range, which means the draw command now starts from index zero.

Primitive types

Again, this is not a new problem that I’d discovered, which meant that, after being pointed in the right direction by Dave Airlie, this was just a few lines of code:

void
zink_draw_vbo(struct pipe_context *pctx,
              const struct pipe_draw_info *dinfo)
{

   if (dinfo->primitive_restart && !restart_supported(dinfo->mode)) {
       util_draw_vbo_without_prim_restart(pctx, dinfo);
       return;
   }

restart_supported() here just checks the primitive draw mode for a Vulkan-supported strip/fan type.

That’s actually all that was required, and now primitive restart works*.

  • Except when it breaks gl_PrimitiveID in geometry shaders, which I’ll get to in a future post.
July 08, 2020

A quick break

I’ve been blogging pretty in-depth about Zink and related code for a while, so let’s do a quick roundup with some future blog post spoilers.

I previously talked about these, now they’re merged:

Done but awaiting dependencies before merge:

  • UBO support from last week’s involved blog series
  • Primitive restart which I’ll likely cover in a quick post tomorrow
  • GLSL 1.40 was a variety of misc fixups
  • GL 3.1 depends on the above items, so that’s done
  • Geometry shaders are working about as well as they can after tackling that over the past day or so
  • GLSL 1.50 is at 98% of piglit tests passing (this can’t reach 100% due to architecture limitations)
  • GL 3.2 doesn’t have a milestone, but it’s done (depth clamp, geometry shaders, and GLSL 1.50 were the last remaining items)
  • GL 3.3 needs 3 extensions and the GLSL bump

Also, Antonio Caggiano has dipped a toe into Zink-land and is investigating fixing up some issues we have with depth/stencil buffer operations!

Goals

It’s been over a month of daily posts, so again, here are 3 goals for this blog for any recently-joined readers:

  • Document things I/others are doing in Zink and Zink-affecting components\ This is coincidentally the only documentation for some of the mesa APIs that I’m blogging about.
  • Show readers who maybe have strong coding/graphics backgrounds but haven’t done driver development that drivers aren’t some kind of black box code that’s too hard for hobbyists to jump into
  • Also possibly describe some problem-solving strategies and implementation handling that might be useful
July 07, 2020

Something more recent

Armed with a colossal set of patches in my zink-wip branch and feeling again like maybe it was time to be a team player instead of charging off down the field on my own, I decided yesterday morning to check out Erik’s MR to enable ARB_depth_clamp that’s been blocked on a failing piglit test for several weeks. The extension was working, supposedly, and all this MR does is enable it for use, so how hard could this be?

Filled with hubris after bulldozing my way through UBO and primitive restart handling over the course of a week, I thought I could do anything.

The test

The test, spec@arb_depth_clamp@depth-clamp-range, was originally written by Eric Anholt over ten years ago and includes more recent updates from Roland Scheidegger to give it better coverage. Here’s the key bits:

	float white[3] = {1.0, 1.0, 1.0};
	float clear[3] = {0.0, 0.0, 0.0};

	glClearDepth(0.5);
	glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
	glEnable(GL_DEPTH_TEST);
	glDepthFunc(GL_LEQUAL);

	glColor3fv(white);

	/* Keep in mind that the ortho projection flips near and far's signs,
	 * so 1.0 to quad()'s z maps to glDepthRange's near, and -1.0 maps to
	 * glDepthRange's far.
	 */

	glEnable(GL_DEPTH_CLAMP);

	glDepthRange(0.75, 0.0);
	quad(70, 30, 4); /* 0.75 - not drawn. */

	pass = piglit_probe_pixel_rgb(75, 35, clear) && pass;

This portion of the test:

  • clears the depth buffer to 0.5
  • enables depth testing with less-than-or-equal
  • enables depth clamping
  • sets an inverted depth range
  • draws a quad with an untransformed depth value of 4
  • checks the coordinates at the point of the quad to verify that the quad was discarded during depth testing

Problem

Looking at the test here, there’s the expectation that, after viewport transform, the depth will get clamped to 0.75, tested with <= against the clear value 0.5, and then it will be discarded since 0.75 <= 0.5 is false.

Contrary to my expectations, this quad got rendered instead of discarded. But why?

Examining

I rushed to the pipeline creation to check out the depthClampEnable value of VkPipelineRasterizationStateCreateInfo. It was enabled.

I checked depthTestEnable and depthCompareOp from VkPipelineDepthStencilStateCreateInfo in the same place. All fine there too.

vkCmdSetViewport seemed a possible candidate as well, but minDepth and maxDepth were set with 0.75 and 0.0 as they should be.

Maybe this was some weirdness going on with the shader handling? It seemed unlikely, but I wanted to cover all the bases. I fired up a quick printf shader with:

gl_FragColor = vec4(gl_FragCoord.z - 0.75, 0.0, 1.0, 1.0);

The idea was to verify that the depth was a value that would be clamped.

What I got was a nice, purple square, which was also what I expected.

ANV then

At some point, it’s important to consider the possibility that this was a driver bug.

That point was not yet, however.

I checked out the Vulkan CTS, running the dEQP-VK.draw.inverted_depth_ranges.* glob of tests, as the problem I was having was isolated to inverted depth ranges.

Confoundingly, all the tests passed.

But then I read the tests, and I noticed something: none of the CTS tests for inverted depth ranges have depth testing enabled.

At last, it was time to consider the possibility that this was, in fact, not a problem in Zink. After a brief query on IRC, Jason Ekstrand produced a patch within seconds which resolved the issue.

This is the story of how I didn’t fix a bug.

July 06, 2020

The first round just passed so fast, and what I did?

The case: the IGT test kms_cursor_crc

Being acquainted with a IGT test

  1. Anatomy of the test
    What: study, read the code, dive into each function, and describe its structure, steps, and functionality.
    To: be well aware of the code construction and, consequently, able to figure out and deal with problems.
  2. Patches to improve test documentation
    What: During the study of the kms_cursor_crc test, I realized that the subtests had no description that would help a newcomer to perceive the purpose of that subtest. To improve this documentation and some code comments, I sent some patches to IGT with what I was able to understand from my inspection.
  3. Refactoring function with parameters never used.
    Outside the context of exploring and becoming acquainted with the case, examining the anatomy of the kms_cursor_crc I caught useless parameters in a general IGT function, i.e., it requires two parameters, but never uses them within its code. I checked the author (git blame) and asked him on IRC about the need for these additional parameters, but I didn’t get a response (or maybe I missed the reply due to disconnection). Then, I sent an RFC patch to the mailing list and also nothing. Finally, my mentor took a look, and he agreed that the parameters seem useless and can be removed. He asked me to resend as a normal patch.

Maybe you can also take a look: lib/igt_fb: remove extra parameters from igt_put_caito_ctx

Solving some problems

I have also sent a patchset to treat to reported problems in kms_cursor_crc:

  1. Access to debugfs data file is blocked
  2. Unstable behaviour in sequential subtests

You can check out more about that in my previous posts. Reviewing the logs, I saw that these problems in prepare_crtc were introduced more recently since some Haneen commits report that subtests were working before. I noticed that waiting for the blank is a solution used before and removed by another commit that not seems to treat this issue.

Sadly, my patches also received no feedback at all. So, if you are interested in them or can make an interesting comment, go ahead! :)

Preparing for the second round

In the last few days, I surveyed the materials and tasks needed to execute the second round of my project. This is a kind of roadmap with useful links.

As far as I know, I have to deal with three issues that seem currently not supported by VKMS:

  1. Enhance the blend function to provide alpha composing
    To solve warning message in pipe-%s-cursor-alpha-transparent
  2. DPMS (?)
    To solve failure in pipe-%s-cursor-dpms and pipe-%s-cursor-suspend
  3. Cursor size other than 64x64
    • https://drmdb.emersion.fr/devices/3a5c55e30723
    • https://drmdb.emersion.fr/capabilities
    • See drivers/gpu/drm/drm_ioctl.c
    • See drivers/gpu/drm/vkms/vkms_drv.c (drm_getCap)
      drivers/gpu/drm/vkms/vkms_drv.c:131:    dev->mode_config.funcs = &vkms_mode_funcs;
      drivers/gpu/drm/vkms/vkms_drv.c:132:    dev->mode_config.min_width = XRES_MIN;
      drivers/gpu/drm/vkms/vkms_drv.c:133:    dev->mode_config.min_height = YRES_MIN;
      drivers/gpu/drm/vkms/vkms_drv.c:134:    dev->mode_config.max_width = XRES_MAX;
      drivers/gpu/drm/vkms/vkms_drv.c:135:    dev->mode_config.max_height = YRES_MAX;
      drivers/gpu/drm/vkms/vkms_drv.c:136:    dev->mode_config.preferred_depth = 24;
      drivers/gpu/drm/vkms/vkms_drv.c:137:    dev->mode_config.helper_private = &vkms_mode_config_helpers;
      

I also need to check the reason for some subtests like dpms/suspend are failing now, but they were working in the past, accordingly to Hannen commit message:

commit db7f419c06d7cce892384df464d4b609a3ea70af
Author: Haneen Mohammed <hamohammed.sa@gmail.com>
Date:   Thu Sep 6 08:18:26 2018 +0300

    drm/vkms: Compute CRC with Cursor Plane
    
    This patch compute CRC for output frame with cursor and primary plane.
    Blend cursor with primary plane and compute CRC on the resulted frame.
    
    This currently passes cursor-size-change, and cursor-64x64-[onscreen,
    offscreen, sliding, random, dpms, rapid-movement] from igt
    kms_cursor_crc tests.
    
    Signed-off-by: Haneen Mohammed <hamohammed.sa@gmail.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Link: https://patchwork.freedesktop.org/patch/msgid/b1749f5c90da5721a481f12740e2e370edb4a752.1536210181.git.hamohammed.sa@gmail.com

This is the continuation from this post.

Several moons have bypassed us [1] in the time since the first post, and Things Have Happened! If you recall (and of course you did because you just re-read the article I so conveniently linked above), libxkbcommon supports an include directive for the rules files and it will load a rules file from $XDG_CONFIG_HOME/xkb/rules/ which is the framework for custom per-user keyboard layouts. Alas, those files are just sitting there, useful but undiscoverable.

To give you a very approximate analogy, the KcCGST format I described last time are the ingredients to a meal (pasta, mince, tomato). The rules file is the machine-readable instruction set to assemble your meal but it relies a lot on wildcards. Feed it "spaghetti, variant bolognese" and the actual keymap ends up being the various components put together: "pasta(spaghetti)+sauce(tomato)+mince". But for this to work you need to know that spag bol is available in the first place, i.e you need the menu. This applies to keyboard layouts too - the keyboard configuration panel needs to present a list so the users can clickedy click-click on whatever layout they think is best for them.

This menu of possible layouts is provided by the xkeyboard-config project but for historical reasons [2], it is stored as an XML file named after the ruleset: usually /usr/share/X11/xkb/rules/evdev.xml [3]. Configuration utilities parse that file directly which is a bit of an issue when your actual keymap compiler starts supporting other include paths. Your fancy new layout won't show up because everything insists on loading the system layout menu only. This bit is part 2, i.e. this post here.

If there's one thing that the world doesn't have enough of yet, it's low-level C libraries. So I hereby present to you: libxkbregistry. This library has now been merged into the libxkbcommon repository and provides a simple C interface to list all available models, layouts and options for a given ruleset. It sits in the same repository as libxkbcommon - long term this will allow us to better synchronise any changes to XKB handling or data formats as we can guarantee that the behaviour of both components is the same.

Speaking of data formats, we haven't actually changed any of those which means they're about as easy to understand as your local COVID19 restrictions. In the previous post I outlined the example for the KcCGST and rules file, what you need now with libxkbregistry is an XKB-compatible XML file named after your ruleset. Something like this:


$ cat $HOME/.config/xkb/rules/evdev.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xkbConfigRegistry SYSTEM "xkb.dtd">
<xkbConfigRegistry version="1.1">
<layoutList>
<layout>
<configItem>
<name>us</name>
</configItem>
<variantList>
<variant>
<configItem>
<name>banana</name>
<shortDescription>banana</shortDescription>
<description>US (Banana)</description>
</variant>
</variantList>
</layoutList>
<optionList>
<group allowMultipleSelection="true">
<configItem>
<name>custom</name>
<description>Custom options</description>
</configItem>
<option>
<configItem>
<name>custom:foo</name>
<description>Map Tilde to nothing</description>
</configItem>
</option>
<option>
<configItem>
<name>custom:baz</name>
<description>Map Z to K</description>
</configItem>
</option>
</group>
</optionList>
</xkbConfigRegistry>
This looks more complicated than it is: we have models (not shown here), layouts which can have multiple variants and options which are grouped together in option group (to make options mutually exclusive). libxkbregistry will merge this with the system layouts in what is supposed to be the most obvious merge algorithm. The simple summary of that is that you can add to existing system layouts but you can't modify those - the above example will add a "banana" variant to the US keyboard layout without modifying "us" itself or any of its other variants. The second part adds two new options based on my previous post.

Now, all that is needed is to change every user of evdev.xml to use libxkbregistry. The gnome-desktop merge request is here for a start.

[1] technically something that goes around something else doesn't bypass it but the earth is flat, the moon is made of cheese, facts don't matter anymore and stop being so pedantic about things already!
[2] it's amazing what you can handwave away with "for historical reasons". Life would be better if there was less history to choose reasons from.
[3] there's also evdev.extras.xml for very niche layouts which is a separate file for historical reasons [2], despite there being a "popularity" XML attribute

Extensions Extensions Extensions

Yes, somehow there are still more extensions left to handle, so the work continues. This time, I’ll be talking briefly about a specific part of EXT_multiview_draw_buffers, namely:

If a fragment shader writes to "gl_FragColor", DrawBuffersIndexedEXT
specifies a set of draw buffers into which the color written to
"gl_FragColor" is written. If a fragment shader writes to
gl_FragData, DrawBuffersIndexedEXT specifies a set of draw buffers
into which each of the multiple output colors defined by these
variables are separately written. If a fragment shader writes to
neither gl_FragColor nor gl_FragData, the values of the fragment
colors following shader execution are undefined, and may differ
for each fragment color.

The gist of this is:

  • if a shader writes to gl_FragColor, then the value of gl_FragColor is output to all color attachments
  • if a shader writes to gl_FragData[n], then this n value corresponds to the indexed color attachment

As expected, this is a problem

SPIR-V has no builtin for a color decoration on variables, which means that gl_FragColor goes through as a regular variable with no special handling. As such, there’s similarly no special handling in the underlying Vulkan driver to split the output of this variable out to the various color attachments, which means that only the first color attachment will have the expected result when multiple color attachments are present.

The solution, as always, is more NIR.

A quick pass

What needs to happen here is a lowering pass that transforms gl_FragColor into gl_FragData[0] and then outputs its value to newly-created gl_FragData[1-7] variables. At present, it looks like this:

static bool
lower_fragcolor_instr(nir_intrinsic_instr *instr, nir_builder *b)
{
   nir_variable *out;
   if (instr->intrinsic != nir_intrinsic_store_deref)
      return false;

   out = nir_deref_instr_get_variable(nir_src_as_deref(instr->src[0]));
   if (out->data.location != FRAG_RESULT_COLOR || out->data.mode != nir_var_shader_out)
      return false;
   b->cursor = nir_after_instr(&instr->instr);

   nir_ssa_def *frag_color = nir_load_var(b, out);
   ralloc_free(out->name);
   out->name = ralloc_strdup(out, "gl_FragData[0]");

   /* translate gl_FragColor -> gl_FragData since this is already handled */
   out->data.location = FRAG_RESULT_DATA0;
   nir_component_mask_t writemask = nir_intrinsic_write_mask(instr);

   for (unsigned i = 1; i < 8; i++) {
      char name[16];
      snprintf(name, sizeof(name), "gl_FragData[%u]", i);
      nir_variable *out_color = nir_variable_create(b->shader, nir_var_shader_out,
                                                   glsl_vec4_type(),
                                                   name);
      out_color->data.location = FRAG_RESULT_DATA0 + i;
      out_color->data.driver_location = i;
      out_color->data.index = out->data.index;
      nir_store_var(b, out_color, frag_color, writemask);
   }
   return true;
}

This function is called in a nested loop that iterates over all the shader instructions, and it filters through until it gets an instruction that performs a store to gl_FragColor, which is denoted by the FRAG_RESULT_COLOR value in nir_variable::data.location. It loads the stored value, modifying the variable itself to gl_FragData[0] along the way, and then does a quick loop to create the rest of the gl_FragData variables, storing the loaded output.

Of particular importance here is:

      out_color->data.index = out->data.index;

This is the handling for gl_SecondaryFragColorEXT from GLES, which needs to work as well.

There’s no special handling for any of these variables, so the only other change needed is to throw an Index decoration onto ntv variable handling to ensure that everything works as expected.

July 03, 2020

ARB_arrays_of_arrays

I recently came across a number of failing tests where the problem was related to variable sizing in a shader. Check out this beaut:

#version 130
#extension GL_ARB_separate_shader_objects: require
#extension GL_ARB_arrays_of_arrays: require

out vec4 out_color;

uniform sampler2D s2[2][2];
uniform sampler3D s3[2][2];

void main()
{
    out_color = texture(s2[1][1], vec2(0)) + texture(s3[1][1], vec3(0));
}

When I checked out the corresponding spec, it seems that there’s no limitation to array nesting like this, and there’s not any handling for arrays of arrays in Zink at present.

Thus I entered the magic of struct glsl_type and its many, many, many helper/wrapper functions. Zink has many checks for glsl_type_is_array(var->type) when processing variables to find arrays, but there’s no checks for arrays of arrays, which was a problem.

Thankfully, as is usually the case in mesa development, someone has had this problem before, and so there’s glsl_get_aoa_size() for getting the flattened size of an array in these cases. By using the return from this instead of glsl_get_length() in these places, Zink can now support this extension.

July 02, 2020

At last

It’s time to finish up UBO support in this long form patch review blog series. Here’s where the current progress has left things:


static void
emit_load_ubo(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   nir_const_value *const_block_index = nir_src_as_const_value(intr->src[0]);
   assert(const_block_index); // no dynamic indexing for now

   nir_const_value *const_offset = nir_src_as_const_value(intr->src[1]);
   if (const_offset) {
      SpvId uvec4_type = get_uvec_type(ctx, 32, 4);
      SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                      SpvStorageClassUniform,
                                                      uvec4_type);

      unsigned idx = const_offset->u32 / 16;
      SpvId member = emit_uint_const(ctx, 32, 0);
      SpvId offset = emit_uint_const(ctx, 32, idx);
      SpvId offsets[] = { member, offset };
      SpvId ptr = spirv_builder_emit_access_chain(&ctx->builder, pointer_type,
                                                  ctx->ubos[const_block_index->u32], offsets,
                                                  ARRAY_SIZE(offsets));
      SpvId result = spirv_builder_emit_load(&ctx->builder, uvec4_type, ptr);

      SpvId type = get_dest_uvec_type(ctx, &intr->dest);
      unsigned num_components = nir_dest_num_components(intr->dest);
      if (num_components == 1) {
         uint32_t components[] = { 0 };
         result = spirv_builder_emit_composite_extract(&ctx->builder,
                                                       type,
                                                       result, components,
                                                       1);
      } else if (num_components < 4) {
         SpvId constituents[num_components];
         SpvId uint_type = spirv_builder_type_uint(&ctx->builder, 32);
         for (uint32_t i = 0; i < num_components; ++i)
            constituents[i] = spirv_builder_emit_composite_extract(&ctx->builder,
                                                                   uint_type,
                                                                   result, &i,
                                                                   1);

         result = spirv_builder_emit_composite_construct(&ctx->builder,
                                                         type,
                                                         constituents,
                                                         num_components);
      }

      if (nir_dest_bit_size(intr->dest) == 1)
         result = uvec_to_bvec(ctx, result, num_components);

      store_dest(ctx, &intr->dest, result, nir_type_uint);
   } else
      unreachable("uniform-addressing not yet supported");
}

Remaining work here:

  • handle dynamic offsets in order to e.g., process shaders which use loops to access a UBO member index
  • handle loading an index inside each vec4-sized UBO member in order to be capable of accessing components

More problems

There’s another tangle here when it comes to accessing components of a UBO member though. The Extract operations in SPIR-V all take literals, not SPIR-V ids, which means they can’t be used to support dynamic offsets from shader-side variables. As a result, OpAccessChain is the best solution, but this has some small challenges itself.

The way that OpAccessChain works is that it takes an array of index values that are used to progressively access deeper parts of a composite type. For a case like a vec4[2], passing [0, 2] as the index array would access the first vec4’s third member, as this Op delves based on the composite’s type.

However, in emit_load_ubo() here, the instructions passed provide the offset in bytes, not “members”. This means the value passed as src[1] here has to be converted from bytes into “members”, and it has to be done such that OpAccessChain gets three separate index values in order to access the exact component of the UBO that the instruction specifies. The calculation is familiar for people who have worked extensively in C:

index_0 = 0;
index_1 = offset / sizeof(vec4)
index_2 = (offset % sizeof(vec4) / sizeof(int)
  • The first index is always 0 since the type is a pointer.
  • The second index is determining which vec4 to access; since all UBO members are sized as vec4 types, this is effectively determining which member of the UBO to access
  • The third index accesses components of the variable, which in SPIR-V internals has been sized by ntv as an int-based type in terms of size

This is it for loading the UBO, but now those loaded values need to be stored, as that’s the point of this instruction. In the above code segment, the entire vec4 is loaded, and then OpCompositeExtract is used to grab the desired values, creating a new composite at the end for storage. This won’t work for dynamic usage, however, as I mentioned previously: OpCompositeExtract takes literal index values, which means it can only be used with constant offsets.

Instead, a solution which handles both cases would be to use the OpAccessChain to loop over each individual component that needs to be loaded, Then these loaded values can be reassembled into a composite at the end.

More code

The end result looks like this:

static void
emit_load_ubo(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   nir_const_value *const_block_index = nir_src_as_const_value(intr->src[0]);
   assert(const_block_index); // no dynamic indexing for now

   SpvId uint_type = get_uvec_type(ctx, 32, 1);
   SpvId one = emit_uint_const(ctx, 32, 1);

   /* number of components being loaded */
   unsigned num_components = nir_dest_num_components(intr->dest);
   SpvId constituents[num_components];
   SpvId result;

   /* destination type for the load */
   SpvId type = get_dest_uvec_type(ctx, &intr->dest);
   /* an id of the array stride in bytes */
   SpvId vec4_size = emit_uint_const(ctx, 32, sizeof(uint32_t) * 4);
   /* an id of an array member in bytes */
   SpvId uint_size = emit_uint_const(ctx, 32, sizeof(uint32_t));

   /* we grab a single array member at a time, so it's a pointer to a uint */
   SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                   SpvStorageClassUniform,
                                                   uint_type);

   /* our generated uniform has a memory layout like
    *
    * struct {
    *    vec4 base[array_size];
    * };
    *
    * where 'array_size' is set as though every member of the ubo takes up a vec4,
    * even if it's only a vec2 or a float.
    *
    * first, access 'base'
    */
   SpvId member = emit_uint_const(ctx, 32, 0);
   /* this is the offset (in bytes) that we're accessing:
    * it may be a const value or it may be dynamic in the shader
    */
   SpvId offset = get_src(ctx, &intr->src[1]);
   /* convert offset to an array index for 'base' to determine which vec4 to access */
   SpvId vec_offset = emit_binop(ctx, SpvOpUDiv, uint_type, offset, vec4_size);
   /* use the remainder to calculate the byte offset in the vec, which tells us the member
    * that we're going to access
    */
   SpvId vec_member_offset = emit_binop(ctx, SpvOpUDiv, uint_type,
                                        emit_binop(ctx, SpvOpUMod, uint_type, offset, vec4_size),
                                        uint_size);
   /* OpAccessChain takes an array of indices that drill into a hierarchy based on the type:
    * index 0 is accessing 'base'
    * index 1 is accessing 'base[index 1]'
    * index 2 is accessing 'base[index 1][index 2]'
    *
    * we must perform the access this way in case src[1] is dynamic because there's
    * no other spirv method for using an id to access a member of a composite, as
    * (composite|vector)_extract both take literals
    */
   for (unsigned i = 0; i < num_components; i++) {
      SpvId indices[3] = { member, vec_offset, vec_member_offset };
      SpvId ptr = spirv_builder_emit_access_chain(&ctx->builder, pointer_type,
                                                  ctx->ubos[const_block_index->u32], indices,
                                                  ARRAY_SIZE(indices));
      /* load a single value into the constituents array */
      constituents[i] = spirv_builder_emit_load(&ctx->builder, uint_type, ptr);
      /* increment to the next vec4 member index for the next load */
      vec_member_offset = emit_binop(ctx, SpvOpIAdd, uint_type, vec_member_offset, one);
   }

   /* if loading more than 1 value, reassemble the results into the desired type,
    * otherwise just use the loaded result
    */
   if (num_components > 1) {
      result = spirv_builder_emit_composite_construct(&ctx->builder,
                                                      type,
                                                      constituents,
                                                      num_components);
   } else
      result = constituents[0];

   /* explicitly convert to a bool vector if the destination type is a bool */
   if (nir_dest_bit_size(intr->dest) == 1)
      result = uvec_to_bvec(ctx, result, num_components);

   store_dest(ctx, &intr->dest, result, nir_type_uint);
}

But wait! Perhaps some avid reader is now considering how many load operations are potentially being added by this method if the original instruction was intended to load an entire vec4. Surely some optimizing can be done here?

One of the great parts about ntv is that there’s not much need to optimize anything in advance here. Getting things working is usually “good enough”, and the reason for that is once again NIR. While it’s true that loading a vec4 member of a UBO from this code does generate four load_ubo instructions, these instructions will get automatically optimized back to a single load_ubo by a nir_lower_io pass triggered from the underlying Vulkan driver, which means spending any effort pre-optimizing here is wasted time.

Moving on

ARB_uniform_buffer_object is done now, so look forward to new topics again.

July 01, 2020

About three weeks ago there was a big announcement about the update of the status of the Vulkan effort for the Raspberry Pi 4. Now the source code is public. Taking into account the interest that it got, and that now the driver is more usable, we will try to post status updates more regularly. Let’s talk about what’s happened since then.

Input Attachments

Input attachment is one of the main sub-features for Vulkan multipass, and we’ve gained support since the announcement. On Vulkan the support for multipass is more tightly supported by the API. Renderpasses can have multiple subpasses. These can have dependencies between each other, and each subpass define a subset of “attachments”. One attachment that is easy to understand is the color attachment: This is where a given subpass writes a given color. Another, input attachment, is an attachment that was updated in a previous subpass (for example, it was the color attachment on such previous subpass), and you get as a input on following subpasses. From the shader POV, you interact with it as a texture, with some restrictions. One important restriction is that you can only read the input attachment at the current pixel location. The main reason for this restriction is because on tile-based GPUs (like rpi4) all primitives are batched on tiles and fragment processing is rendered one tile at a time. In general, if you can live with those restrictions, Vulkan multipass and input attachment will provide better performance than traditional multipass solutions.

If you are interested in reading more details on this, you can check out ARM’s very nice presentation “Vulkan Multipass mobile deferred done right”, or Sascha Willems’ post “Vulkan input attachments and sub passes”. The latter also includes information about how to use them and code snippets of one of his demos. For reference, this is how the input attachment demos looks on the rpi4:

Sascha Willems inputattachment demos run on rpi4

Compute Shader

Given that this was one of the most requested features after the last update, we expect that this will be likely be the most popular news from this post: Compute shaders are now supported.

Compute shaders give applications the ability to perform non-graphics related tasks on the GPU, outside the normal rendering pipeline. For example they don’t have vertices as input, or fragments as output. They can still be used for massivelly parallel GPGPU algorithms. For example, this demo from Sascha Willems uses a compute shader to simulate cloth:

Sascha Willems Compute Cloth demos run on rpi4

Storage Image

Storage Image is another recent addition. It is a descriptor type that represents an image view, and supports unfiltered loads, stores, and atomics in a shader. It is really similar in most other ways to the well-known OpenGL concept of texture. They are really common with compute shaders. Compute shaders will not render (they can’t) directly any image, and it is likely that if they need an image, they will update it. In fact the two Sascha Willem demos using storage images also require compute shader support:

Sascha Willems compute shader demos run on rpi4

Sascha Willems compute raytracing demo run on rpi4

Performance

Right now our main focus for the driver is working on features, targetting a compliant Vulkan 1.0 driver. Having said so, now that we both support a good range of features and can run non-basic applications, we have devoted some time to analyze if there were clear points where we could improve the performance. Among these we implemented:
1. A buffer object (BO) cache: internally we are allocating and freeing really often buffer objects for basically the same tasks, so there are a constant need of buffers of the same size. Such allocation/free require a DRM call, so we implemented a BO cache (based on the existing for the OpenGL driver) so freed BOs would be added to a cache, and reused if a new BO is allocated with the same size.
2. New code paths for buffer to image copies.

Bugfixing!!

In addition to work on specific features, we also spent some time fixing specific driver bugs, using failing Vulkan CTS tests as reference. Thanks to that work, the Sascha Willems’ radial blur demo is now properly rendering, even though we didn’t focus specifically on working on that demo:

Sascha Willems radial blur demo run on rpi4

Next?

Now that the driver supports a good range of features and we are able to test more applications and run more Vulkan CTS Tests with all the needed features implemented, we plan to focus some efforts towards bugfixing for a while.

We also plan to start to work on implementing the support for Pipeline Cache, which allows the result of pipeline construction to be reused between pipelines and between runs of an application.

No time to waste

So let’s get down to pixels. The UBO indexing is now fixed-ish, which means moving onto the next step: setting up bindings for the UBOs.

A binding in this context is the numeric id assigned to a UBO for the purposes of accessing it from a shader, which also corresponds to the uniform block index. In mesa, this is the struct nir_variable::data.binding member of a UBO. A load_ubo instruction will take this value as its first parameter, which means there’s a need to ensure that everything matches up just right.

Where to start

Where I started was checking out the existing code, which assumes that nir_variable::data.binding is already set up correctly, since the comment in mesa/src/compiler/nir/nir.h for the member implies that—

Just kidding, that only applies to Vulkan drivers. In Zink, that needs to be manually set up since, at most, the value will have been incremented by 1 in the nir_lower_uniforms_to_ubo pass from yesterday’s post.

With this in mind, it’s time to check out a block from zink_compiler.c:

   nir_foreach_variable(var, &nir->uniforms) {
      if (var->data.mode == nir_var_mem_ubo) {
         int binding = zink_binding(nir->info.stage,
                                    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,
                                    var->data.binding);
         ret->bindings[ret->num_bindings].index = var->data.binding;
         ret->bindings[ret->num_bindings].binding = binding;
         ret->bindings[ret->num_bindings].type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
         ret->num_bindings++;

This iterates over the uniform variables, which are now all wrapped in UBOs, setting up the binding table that will later be used in a vkCreateDescriptorSetLayout call, which passes the bindings along to the underlying driver.

Unfortunately, as just mentioned, this assumes that var->data.binding is set, which it isn’t.

Ordering

A number of things need to be kept in mind to effectively assign all the binding values:

  • The UBOs in this list are ordered backwards, with the zero-id UBO at the end of the list. As such, the bindings need to be generated in reverse order as compared to the uniforms list stored onto the shader.
  • The index member of the binding table, however, is not the same as the binding as this determines the index of the buffer to be used with the specified UBO; if nir_lower_uniforms_to_ubo was run, then index begins at 0, but otherwise it will begin at 1.
  • The point of the binding value is to bind the UBO itself, not variables contained in the UBO. This means that any uniform with a nonzero data.location can be ignored, as this indicates that it’s located at an offset from the base of the UBO and will be accessed by the second parameter of the load_ubo instruction, the offset.

With all this in mind, the following changes can be made:

   uint32_t cur_ubo = 0;
   /* UBO buffers are zero-indexed, but buffer 0 is always the one created by nir_lower_uniforms_to_ubo,
    * which means there is no buffer 0 if there are no uniforms
    */
   int ubo_index = !nir->num_uniforms;
   /* need to set up var->data.binding for UBOs, which means we need to start at
    * the "first" UBO, which is at the end of the list
    */
   foreach_list_typed_reverse(nir_variable, var, node, &nir->uniforms) {
      if (var->data.mode == nir_var_mem_ubo) {
         /* ignore variables being accessed if they aren't the base of the UBO */
         if (var->data.location)
            continue;
         var->data.binding = cur_ubo++;

         int binding = zink_binding(nir->info.stage,
                                    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,
                                    var->data.binding);
         ret->bindings[ret->num_bindings].index = ubo_index++;
         ret->bindings[ret->num_bindings].binding = binding;
         ret->bindings[ret->num_bindings].type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
         ret->num_bindings++;

Declaring

Now that the binding values are all taken care of, the next step is to go back to the UBO declarations in ntv:

static void
emit_ubo(struct ntv_context *ctx, struct nir_variable *var)
{
   uint32_t size = glsl_count_attribute_slots(var->type, false);

This is the first line of the function, and it’s the only one that’s important here. Zink is going to pad out every member of a UBO to the size of a vec4 (because PIPE_CAP_PACKED_UNIFORMS is not set by the driver), which is what size here is being assigned as—the number of vec4s needed to declare the passed variable.

This isn’t what the driver should be doing here. As with the binding table setup above, this is declaring UBOs themselves, not variables inside UBOs. As such, all of these variables can be ignored, but the base variable needs to be sized for the entire UBO.

Helpfully, this type is available as struct nir_variable::interface_type for the overall UBO type, which results in the following small changes:

static void
emit_ubo(struct ntv_context *ctx, struct nir_variable *var)
{
   /* variables accessed inside a uniform block will get merged into a big
    * memory blob and accessed by offset
    */
   if (var->data.location)
      return;

   uint32_t size = glsl_count_attribute_slots(var->interface_type, false);

The UBO list in ntv also has to be walked backwards for its declarations in order to match the part from zink_compiler.c, but this is the only change necessary.

Binding accomplished

Yes, that’s sufficient for setting up the variables and bindings for all the UBOs.

Next time, I’ll finish this with a back-to-the-basics look at loading memory from buffers using offsets, except it’s in SPIR-V so everything is way more complicated.

June 30, 2020

assert()gery

In yesterday’s post, I left off in saying that removing an assert() from the constant block index check wasn’t going to work quite right. Let’s see why that is.

Some context again:

static void
emit_load_ubo(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   nir_const_value *const_block_index = nir_src_as_const_value(intr->src[0]);
   assert(const_block_index); // no dynamic indexing for now
   assert(const_block_index->u32 == 0); // we only support the default UBO for now

   nir_const_value *const_offset = nir_src_as_const_value(intr->src[1]);
   if (const_offset) {
      SpvId uvec4_type = get_uvec_type(ctx, 32, 4);
      SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                      SpvStorageClassUniform,
                                                      uvec4_type);

      unsigned idx = const_offset->u32;
      SpvId member = emit_uint_const(ctx, 32, 0);
      SpvId offset = emit_uint_const(ctx, 32, idx);
      SpvId offsets[] = { member, offset };
      SpvId ptr = spirv_builder_emit_access_chain(&ctx->builder, pointer_type,
                                                  ctx->ubos[0], offsets,
                                                  ARRAY_SIZE(offsets));
      SpvId result = spirv_builder_emit_load(&ctx->builder, uvec4_type, ptr);

This is the top half of emit_load_ubo(), which performs the load on the desired memory region for the UBO access. In particular, the line I’m going to be exploring today is

   assert(const_block_index->u32 == 0); // we only support the default UBO for now

Which directly corresponds to the explicit 0 in

      SpvId ptr = spirv_builder_emit_access_chain(&ctx->builder, pointer_type,
                                                  ctx->ubos[0], offsets,
                                                  ARRAY_SIZE(offsets));

At a glance, it seems like the assert() can be removed, and const_block_index->u32 can be passed as the index to the ctx->ubos array, which is where all the declared UBO variables are stored, and there won’t be any issue.

Not so.

In fact, there’s a number of problems with this.

NIR resurfaces

Over in zink_compiler.c, Zink runs a nir_lower_uniforms_to_ubo pass on shaders. What this pass does is:

  • rewrites all load_uniform instructions as load_ubo instructions for the UBO bound to 0, which works with Gallium’s merging of all non-block uniforms into UBO with binding point 0 (which is what’s currently handled by Zink)
  • adds the variable for a UBO with binding point 0 if there’s any load_uniform instructions
  • increments the binding points (and load instructions) of all pre-existing UBOs by 1
  • uses a specified multiplier to rewrite the offset values specified by the converted load_ubo instructions which were previously load_uniform

But then there’s a problem: what happens when this pass gets run when there’s no non-block uniforms? Well, the answer is just as expected:

  • rewrites all load_uniform instructions as load_ubo instructions for the UBO bound to 0, which works with Gallium’s merging of all non-block uniforms into UBO with binding point 0 (which is what’s currently handled by Zink)
  • adds the variable for a UBO with binding point 0 if there’s any load_uniform instructions
  • increments the binding points (and load instructions) of all pre-existing UBOs by 1
  • uses a specified multiplier to rewrite the offset values specified by the converted load_ubo instructions which were previously load_uniform

So now, in emit_load_ubo() above, that ctx->ubos[const_block_index->u32] is actually going to translate to ctx->ubos[1] in the case of a shader without any uniforms. Unfortunately, here’s the function which declares the UBO variables:

static void
emit_ubo(struct ntv_context *ctx, struct nir_variable *var)
{
   uint32_t size = glsl_count_attribute_slots(var->type, false);
   SpvId vec4_type = get_uvec_type(ctx, 32, 4);
   SpvId array_length = emit_uint_const(ctx, 32, size);
   SpvId array_type = spirv_builder_type_array(&ctx->builder, vec4_type,
                                               array_length);
   spirv_builder_emit_array_stride(&ctx->builder, array_type, 16);

   // wrap UBO-array in a struct
   SpvId struct_type = spirv_builder_type_struct(&ctx->builder, &array_type, 1);
   if (var->name) {
      char struct_name[100];
      snprintf(struct_name, sizeof(struct_name), "struct_%s", var->name);
      spirv_builder_emit_name(&ctx->builder, struct_type, struct_name);
   }

   spirv_builder_emit_decoration(&ctx->builder, struct_type,
                                 SpvDecorationBlock);
   spirv_builder_emit_member_offset(&ctx->builder, struct_type, 0, 0);


   SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                   SpvStorageClassUniform,
                                                   struct_type);

   SpvId var_id = spirv_builder_emit_var(&ctx->builder, pointer_type,
                                         SpvStorageClassUniform);
   if (var->name)
      spirv_builder_emit_name(&ctx->builder, var_id, var->name);

   assert(ctx->num_ubos < ARRAY_SIZE(ctx->ubos));
   ctx->ubos[ctx->num_ubos++] = var_id;

   spirv_builder_emit_descriptor_set(&ctx->builder, var_id,
                                     var->data.descriptor_set);
   int binding = zink_binding(ctx->stage,
                              VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,
                              var->data.binding);
   spirv_builder_emit_binding(&ctx->builder, var_id, binding);
}

Specifically:

   ctx->ubos[ctx->num_ubos++] = var_id;

Indeed, this is zero-indexed, which means all the UBO access for a shader with no uniforms is going to fail because all the UBO load instructions are using a block index that’s off by one.

Solved

As is my way, I slapped some if (!shader->num_uniforms) flex tape on running the zink_compiler nir_lower_uniforms_to_ubo pass in order to avoid potentially breaking the pass’s other usage over in TGSI by changing the pass itself, and now the problem is solved. The assert() can now be removed.

Yes, sometimes there’s all this work, and analyzing, and debugging, and blogging, and the end result is a sweet, sweet zero/null check.

Tune in next time when I again embark on a journey that definitely, in no way results in more flex tape being used.

June 29, 2020

Long Weekend

Not really, but I didn’t get around to blogging on Friday because I was working until pretty late on something that’s Kind Of A Big Deal.

Not really, but it’s probably more interesting than my posts about unhandled ALUs.

ARB_uniform_buffer_object support

This extension is one of the last remaining items (along with GL_NV_primitive_restart, which is likely to be done soon as well, and some fixups for GLSL-1.40) required for OpenGL 3.1 support, so I decided to take a break from fixing corner case piglit tests to try doing something useful.

At a very basic level, this extension provides shaders with the ability to declare a “struct” type uniform containing explicitly-defined members that can be referenced normally. Here’s a quick vertex shader example from piglit’s rendering test from the arb_uniform_buffer_object extension tests:

#extension GL_ARB_uniform_buffer_object : require

layout(std140) uniform;
uniform ub_pos_size { vec2 pos; float size; };
uniform ub_rot {float rotation; };

void main()
{
   mat2 m;
   m[0][0] = m[1][1] = cos(rotation); 
   m[0][1] = sin(rotation); 
   m[1][0] = -m[0][1]; 
   gl_Position.xy = m * gl_Vertex.xy * vec2(size) + pos;
   gl_Position.zw = vec2(0, 1);
};

Seen here, there’s two UBOs passed as inputs, and the shader’s main() function directly references their members to perform a rotation on the passed vertex.

What does this actually mean?

That was my first question. In essence, what it means is that once PIPE_SHADER_CAP_INDIRECT_CONST_ADDR is enabled for the driver, shaders are going to start being compiled that contain instructions to perform UBO loads with offsets, as the “struct” member access is really just loading memory from a buffer at an offset from the base.

There’s two types of indexing that need to be handled:

  • constant - this is like array[1], where the index is explicitly defined
  • dynamic - this is array[i], where the index has been computed by the shader

This type of indexing applies to both uniform block indexing, which determines which UBO is being accessed by the instruction, and uniform block offset, which is the precise region in that UBO being accessed.

At present, only constant block indexing is going to be discussed, though both types of addressing need to be handled for block offsets.

Evaluating the existing code

Let’s check out the core of the implementation:

static void
emit_load_ubo(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   nir_const_value *const_block_index = nir_src_as_const_value(intr->src[0]);
   assert(const_block_index); // no dynamic indexing for now
   assert(const_block_index->u32 == 0); // we only support the default UBO for now

   nir_const_value *const_offset = nir_src_as_const_value(intr->src[1]);
   if (const_offset) {
      SpvId uvec4_type = get_uvec_type(ctx, 32, 4);
      SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                      SpvStorageClassUniform,
                                                      uvec4_type);

      unsigned idx = const_offset->u32;
      SpvId member = emit_uint_const(ctx, 32, 0);
      SpvId offset = emit_uint_const(ctx, 32, idx);
      SpvId offsets[] = { member, offset };
      SpvId ptr = spirv_builder_emit_access_chain(&ctx->builder, pointer_type,
                                                  ctx->ubos[0], offsets,
                                                  ARRAY_SIZE(offsets));
      SpvId result = spirv_builder_emit_load(&ctx->builder, uvec4_type, ptr);

      SpvId type = get_dest_uvec_type(ctx, &intr->dest);
      unsigned num_components = nir_dest_num_components(intr->dest);
      if (num_components == 1) {
         uint32_t components[] = { 0 };
         result = spirv_builder_emit_composite_extract(&ctx->builder,
                                                       type,
                                                       result, components,
                                                       1);
      } else if (num_components < 4) {
         SpvId constituents[num_components];
         SpvId uint_type = spirv_builder_type_uint(&ctx->builder, 32);
         for (uint32_t i = 0; i < num_components; ++i)
            constituents[i] = spirv_builder_emit_composite_extract(&ctx->builder,
                                                                   uint_type,
                                                                   result, &i,
                                                                   1);

         result = spirv_builder_emit_composite_construct(&ctx->builder,
                                                         type,
                                                         constituents,
                                                         num_components);
      }

      if (nir_dest_bit_size(intr->dest) == 1)
         result = uvec_to_bvec(ctx, result, num_components);

      store_dest(ctx, &intr->dest, result, nir_type_uint);
   } else
      unreachable("uniform-addressing not yet supported");
}

This is the handler for the load_ubo instruction in ntv. It performs a load operation on a previously-emitted UBO variable, using the first parameter (intr->src[0]) as the block index and the second parameter (intr->src[1]) as the block offset, and storing the resulting data that was loaded into the destination (intr->dest).

In this implementation, which is what’s currently in the repo, there’s some assert()s which verify that both of the parameters passed are constant rather than dynamic; as previously-mentioned, this is going to need to change, at least for the block offset. Additionally, the block index is restricted to 0, which I’ll explain a bit later, but it’s a problem.

Work items

So at a minimum, these are the following changes that need to be made:

  • Enable nonzero block indexing so that more than one UBO can be accessed
  • Handle block access using dynamic offsets

As with all projects I decide to tackle, however, these are not going to be the only changes required, as this is going to uncover a small tangle if I try to fix it directly by just removing the assert().

Stay tuned as this saga progresses.

June 25, 2020

A quick word

I didn’t budget my time well today, so here’s a very brief post about neat features in piglit, the test suite/runner for mesa.

Piglit is my go-to for verifying OpenGL behaviors. It’s got tons of fiendishly nitpicky tests for core functionality and extensions, and then it also provides this same quality for shaders with an unlimited number of shader tests.

When working on a new extension or modifying existing behavior, it can be useful to do quick runs of the tests verifying the behavior that’s being modified. A full piglit run with Zink takes around 10-20 minutes, which isn’t even enough to get in a good rhythm for some desk curls, so it’s great that there’s functionality for paring down the number of tests being run.

Regex testing

Piglit provides test inclusion and exclusion support using regular expressions.

  • With -t, a regex for tests to run can be specified, e.g., -t '.*arb_uniform_buffer_object.*'
  • With -x, a regex for tests to skip can be specified, e.g., -x .*glx.*

These options can be used to cut down runtimes as well as ignore tests with intermittent failures when those results aren’t interesting.

That’s it

No, really, I said it would be brief.

June 24, 2020
Code reviews are a central fact of life in software development. It's important to do them well, and developer quality of life depends on a good review workflow.

Unfortunately, code reviews also appear to be a difficult problem. Many projects are bottlenecked by code reviews, in that reviewers are hard to find and progress gets slowed down by having to wait a long time for reviews.

The "solution" that I've often seen applied in practice is to have lower quality code reviews. Reviewers don't attempt to gain a proper understanding of a change, so reviews become shallower and therefore easier. This is convenient on the surface, but more likely to allow bad code to go through: a subtle corner case that isn't covered by tests (yet?) may be missed, there may be a misunderstanding of a relevant underlying spec, a bad design decision slips through, and so on. This is bound to cause pains later on.

I've experienced a number of different code review workflows in practice, based on a number of tools: GitHub PRs and GitLab MRs, Phabricator, and the e-mail review of patch series that is the original workflow for which Git was designed. Of those, the e-mail review flow produced the highest quality. There may be confounding factors, such as the nature of the projects and the selection of developers working on them, but quality issues aside I certainly feel that the e-mail review flow was the most pleasant to work with. Over time I've been thinking and having discussions a lot about just why that is. I feel that I have distilled it to two key factors, which I've decided to write down here so I can just link to this blog post in the future.

First, the UI experience of e-mail is a lot nicer. All of the alternatives are ultimately web-based, and their UI latency is universally terrible. Perhaps I'm particularly sensitive, but I just cannot stand web UIs for serious work. Give me something that reacts to all input reliably in under 50ms and I will be much happier. E-mail achieves that, web UIs don't. Okay, to be fair, e-mail is probably more in the 100ms range given the general state of the desktop. The point is, web UIs are about an order of magnitude worse. It's incredibly painful. (All of this assumes that you're using a decent native e-mail client. Do yourself a favor and give that a try if you haven't. The CLI warriors all have their favorites, but frankly Thunderbird works just fine. Outlook doesn't.)

Second, I've come to realize that there are conflicting goals in review granularity that e-mail happens to address pretty well, but none of the alternatives do a good job of it. Most of the alternatives don't even seem to understand that there is a problem to begin with! Here's the thing:

Reviews want to be small. The smaller and the more self-contained a change is, the easier it is to wrap your head around and judge. If you do a big refactor that is supposed to have no functional impact, followed by a separate small functional change that is enabled by the refactor, then each change individually is much easier to review. Approving changes at a fine granularity also helps ensure that you've really thought through each individual change and that each change has a reasonable justification. Important details don't get lost in something larger.

Reviews want to be big. A small, self-contained change can be difficult to understand and judge in isolation. You're doing a refactor that moves a function somewhere else? Fine, it's easy to tell that the change is correct, but is it a good change? To judge that, you often need to understand how the refactored result ends up being used in later changes, so it's good to see all those changes at once. Keep in mind though that you don't necessarily have to approve them at the same time. It's entirely possible to say, yes, that refactor looks good, we can go ahead with that, but please fix the way it's being used in a subsequent change.

There is another reason why reviews want to be big. Code reviews have a mental context-switching overhead. As a reviewer, you need to think yourself into the affected code in order to judge it well. If you do many reviews, you typically need to context-switch between each review. This can be very taxing mentally and ultimately unpleasant. A similar, though generally smaller, context-switching overhead applies to the author of the change as well: let's say you send out some changes for review, then go off and do another thing, and come back a day or two later to some asynchronously written reviews. In order to respond to the review, you may now have to page the context of that change back in. The point of all this is that when reviews are big, the context-switching overhead gets amortized better, i.e. the cost per change drops.

Reviews want to be both small and big. Guess what, patch series solve that problem! You get to review an entire feature implementation in the form of a patch series at once, so your context-switching overhead is reduced and you can understand how the different parts of the change play together. At the same time, you can drill down into individual patches and review those. Two levels of detail are available simultaneously.

So why e-mail? Honestly, I don't know. Given that the original use case for Git is based on patch series review, it's mind-boggling in a bad way that web-based Git hosting and code review systems do such a poor job of it, if they handle it at all.

Gerrit is the only system I know of that really takes patch series as an idea seriously, but while I am using it occasionally, I haven't had the opportunity to really stress it. Unfortunately, most people don't even want to consider Gerrit as an option because it's ugly.

Phabricator's stacks are a pretty decent attempt and I've made good use of them in the context of LLVM. However, they're too hidden and clumsy to navigate. Both Phabricator and Gerrit lack a mechanism for discussing a patch series as a whole.

GitHub and Gitlab? They're basically unusable. Yes, you can look at individual commits, but then GitHub doesn't even display the commits in the correct order: they're sorted by commit or author date, not by the Git DAG order, which is an obviously and astonishingly bad idea. Comments tend to get lost when authors rebase, which is what authors should do in order to ensure a clean history, and actually reviewing an individual commit is impossible. Part of the power of patch series is the ability to say: "Patches 1-6, 9, and 11 are good to go, the rest needs work."

Oh, and by the way: Commit messages? They're important! Gerrit again is the only system I know of that allows review comments on commit messages. It's as if the authors of Gerrit are the only ones who really understood the problem space. Unfortunately, they seem to lack business and web design skills, and so we ended up in the mess we're in right now.

Mind you, even if the other players got their act together and supported the workflow properly, there'd still be the problem of UI latency. One can dream...

Shader ALUs

Today let’s talk about ALUs a little.

During shader compilation, GLSL gets serialized into SSA form, which is what ntv operates on when translating it into SPIR-V. An ALU in the context of Zink (specifically ntv) is an algebraic operation which takes a varying number of inputs and generates an output. This is represented in NIR by a struct, nir_alu_instr, which contains the operation type, the inputs, and the output.

When writing GLSL, there’s the general assumption that writing something like 1 + 2 will yield 3, but this is contingent on the driver being able to correctly compile the NIR form of the shader into instructions that the physical hardware runs in order to get that result. In Zink, there’s the need to translate all these NIR instructions into SPIR-V, which is sometimes made trickier by both different semantics between similar GLSL and SPIR-V operations as well as aggressive NIR optimizations.

A deep dive into isnan()

The isnan function checks whether the input is a number. It’s a simple enough functionality to describe, but the implementation and transit through the GLSL->NIR->SPIR-V->NIR pipeline is fraught with perils.

In mesa, isnan(x) is serialized to NIR as fne(x, x), where fne is the operation for float-not-equal, which compares two floats to determine whether they are equal. As such, there’s never actually a case where isnan gets passed through ntv. Let’s see what this looks like in practice with this failing shader test:

// from piglit's fs-isnan-vec2.shader_test for GLSL 1.30
#version 130
uniform vec2 numerator;
uniform vec2 denominator;

void main()
{
  gl_FragColor = vec4(isnan(numerator/denominator), 0.0, 1.0);
}

In Zink, this yields:

shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 0
ubos: 1
shared: 0
decl_var ubo INTERP_MODE_NONE struct_uniform_0 uniform_0 (~0, 0, 640)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragColor (FRAG_RESULT_COLOR.xyzw, 4, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec4 32 ssa_1 = load_const (0x00000000 /* 0.000000 */, 0x00000000 /* 0.000000 */, 0x00000000 /* 0.000000 */, 0x3f800000 /* 1.000000 */)
	vec1 32 ssa_2 = load_const (0x00000000 /* 0.000000 */)
	intrinsic store_output (ssa_1, ssa_2) (8, 15, 0, 160) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* type=float32 */	/* gl_FragColor */
	/* succs: block_1 */
	block block_1:
}

As with yesterday’s shader adventure, here’s IRIS as a control:

shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 0
outputs: 1
uniforms: 0
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE vec2 numerator (0, 0, 0)
decl_var uniform INTERP_MODE_NONE vec2 denominator (1, 2, 0)
decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragColor (FRAG_RESULT_COLOR.xyzw, 4, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x3f800000 /* 1.000000 */)
	vec1 32 ssa_2 = load_const (0x00000001 /* 0.000000 */)
	vec4 32 ssa_3 = intrinsic load_ubo (ssa_2, ssa_0) (0, 4, 0) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */
	vec1 32 ssa_6 = frcp ssa_3.z
	vec1 32 ssa_7 = frcp ssa_3.w
	vec1 32 ssa_8 = fmul ssa_3.x, ssa_6
	vec1 32 ssa_9 = fmul ssa_3.y, ssa_7
	vec1 32 ssa_10 = fne32 ssa_8, ssa_8
	vec1 32 ssa_12 = b2f32 ssa_10
	vec1 32 ssa_11 = fne32 ssa_9, ssa_9
	vec1 32 ssa_13 = b2f32 ssa_11
	vec4 32 ssa_14 = vec4 ssa_12, ssa_13, ssa_0, ssa_1
	intrinsic store_output (ssa_14, ssa_0) (4, 15, 0, 160) /* base=4 */ /* wrmask=xyzw */ /* component=0 */ /* type=float32 */	/* gl_FragColor */
	/* succs: block_1 */
	block block_1:
}

This is clearly much different. In particular, note that IRIS retains its fne instructions, but Zink has lost them along the way.

Why is this?

The problem comes from how SPIR-V is translated back to NIR. When emitting fne(a, a) into SPIR-V with OpFOrdNotEqual, the result is that the NaN-ness is ignored, and the NaN value is compared against itself, managing to be equivalent somehow, which breaks the test. This is due to how OpFOrdNotEqual is explicitly used for ordered (numeric).

Using OpFUnordNotEqual for this case has no such issue, as this op always return false if either of the inputs are unordered (NaN).

June 23, 2020

Something different

I’ve talked and rambled about various things, and maybe I’ve given an idea of what it’s like to work on Zink, but the reality is that I spend a majority of my time working on the shader translation pipeline, which converts NIR to SPIRV.

Today let’s go through the process for fixing an issue in this pipeline, as detected by piglit.

Steps

  • build piglit, as described by its README
  • run piglit; my current run is executed with VK_INSTANCE_LAYERS= MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink ./piglit run --timeout 3000 gpu results/new
  • generate a viewable summary with e.g., ./piglit summary html results/compare <possibly some previous results> results/new
  • open generated results/compare/index.html in browser

Now there’s a massive list of tests with pass/fail results. Clicking on the results of any test will provide more detail, just like this:

piglit-result-2020-06-23_15-35-31.png

In this case, spec@glsl-1.30@execution@fs-texelfetchoffset-2d is failing. What does that mean?

Debugging piglit tests

Near the bottom of the above results, there’s a row for Command, which is the command used to run a given test. This command can be run in any tool, such as gdb or valgrind, in order to run only this test.

More importantly, however, in the case of shader tests, it lets someone debugging a given test produce NIR output, as this is usually the best way to figure out what’s going wrong.

To do so for the above test, I’ve run NIR_PRINT=1 LIBGL_DEBUG=verbose MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink bin/fs-texelFetchOffset-2D -auto -fbo &>| zink. This generates a file zink in my current directory which contains the generated NIR as it progreses through various stages of optimization and translation.

NIR

The specified test is for a fragment shader, as indicated by fs in the name or just reading the test code, which uses the following shader:

"#version 130\n"
"uniform ivec2 pos;\n"
"uniform int lod;\n"
"uniform sampler2D tex;\n"
"void main()\n"
"{\n"
"       const ivec2 offset = ivec2(-2, 2);\n"
"       vec4 texel = texelFetchOffset(tex, pos, lod, offset);\n"
"	gl_FragColor = texel;\n"
"}\n";

Searching through the NIR output for the last output of the fragment shader IR yields:

shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 8
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE sampler2D tex (~0, 0, 672)
decl_var ubo INTERP_MODE_NONE struct_uniform_0 uniform_0 (~0, 0, 640)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[0] (FRAG_RESULT_DATA0.xyzw, 8, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000002 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x00000000 /* 0.000000 */)
	vec4 32 ssa_2 = intrinsic load_ubo (ssa_0, ssa_1) (0, 4, 0) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */
	vec1 32 ssa_3 = load_const (0x00000010 /* 0.000000 */)
	vec4 32 ssa_4 = intrinsic load_ubo (ssa_0, ssa_3) (0, 16, 0) /* access=0 */ /* align_mul=16 */ /* align_offset=0 */
	vec2 32 ssa_5 = vec2 ssa_2.x, ssa_2.y
	vec1 32 ssa_6 = mov ssa_4.x
	vec4 32 ssa_7 = (float)txf ssa_5 (coord), ssa_6 (lod), 3 (texture), 0 (sampler)
	intrinsic store_output (ssa_7, ssa_1) (8, 15, 0, 160) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* type=float32 */	/* gl_FragData[0] */
	/* succs: block_1 */
	block block_1:
}

That’s a lot to go through in a single post, so I’ll be providing a brief overview for now. The most important thing to keep in mind is that ssa_* values in IR are SSA, so each value can be traced through execution by following the assignments.

Looking at main in the shader code, an ivec2 is created as (-2, 2), and this is passed into texelFetchOffset() as the offset from the pos uniform.

Looking at main in the IR, the first 5 lines of block_0 (the only block) are used to load resources. It can be assumed they’re generally correct right now, though that won’t always be the case. Next there’s a vec2 formed (ssa_5) from the load_ubo-created ssa_2; as can be seen a couple lines down, this is is the coord or P param in texelFetchOffset, which is abbreviated as txf here.

In particular, ssa_5 is formed and then passed directly to the txf instruction. What happened to the offset?

Let’s check out NIR generated for this shader by IRIS, the Intel gallium driver:

shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 0
outputs: 1
uniforms: 0
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE ivec2 pos (1, 0, 0)
decl_var uniform INTERP_MODE_NONE int lod (2, 2, 0)
decl_var uniform INTERP_MODE_NONE sampler2D tex (3, 0, 0)
decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragColor (FRAG_RESULT_COLOR.xyzw, 4, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x00000002 /* 0.000000 */)
	vec3 32 ssa_2 = intrinsic load_ubo (ssa_1, ssa_0) (0, 4, 0) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */
	vec1 32 ssa_16 = mov ssa_2.z
	vec1 32 ssa_3 = load_const (0xfffffffe /* -nan */)
	vec1 32 ssa_6 = iadd ssa_2.x, ssa_3
	vec1 32 ssa_7 = iadd ssa_2.y, ssa_1
	vec2 32 ssa_8 = vec2 ssa_6, ssa_7
	vec4 32 ssa_9 = (float)txf ssa_8 (coord), ssa_16 (lod), 1 (texture), 0 (sampler)
	intrinsic store_output (ssa_9, ssa_0) (4, 15, 0, 160) /* base=4 */ /* wrmask=xyzw */ /* component=0 */ /* type=float32 */	/* gl_FragColor */
	/* succs: block_1 */
	block block_1:
}

In particular:

	vec1 32 ssa_6 = iadd ssa_2.x, ssa_3
	vec1 32 ssa_7 = iadd ssa_2.y, ssa_1
	vec2 32 ssa_8 = vec2 ssa_6, ssa_7

As can be seen here, the ssa for the coord param is only formed after a pair of iadd instructions occur, as one would expect to see if a vec2 offset were added to a vec2 coordinate.

Indeed, it seems that Zink is ignoring the offset here.

Fixing

Armed with the knowledge that a txf instruction is involved, a quick search through nir_to_spirv.c reveals the emit_tex() function as a likely starting point, as it’s where txf is handled.

Some excerpts follow:

   for (unsigned i = 0; i < tex->num_srcs; i++) {
      switch (tex->src[i].src_type) {
      case nir_tex_src_coord:
         if (tex->op == nir_texop_txf ||
             tex->op == nir_texop_txf_ms)
            coord = get_src_int(ctx, &tex->src[i].src);
         else
            coord = get_src_float(ctx, &tex->src[i].src);
         coord_components = nir_src_num_components(tex->src[i].src);
         break;

      case nir_tex_src_offset:
         offset = get_src_int(ctx, &tex->src[i].src);
         break;

Here the code iterates through the inputs for the command and then takes action based on their type. In particular, I’ve cut out the coord and offset inputs, as that’s where the issue lies. The implementation is translating the nir_src values (which represent “some value” at runtime) into SpvId values (which also represent “some value” at runtime), so this is okay.

Let’s scroll down a bit:

   if (tex->op == nir_texop_txf ||
       tex->op == nir_texop_txf_ms) {
      SpvId image = spirv_builder_emit_image(&ctx->builder, image_type, load);
      result = spirv_builder_emit_image_fetch(&ctx->builder, dest_type,
                                              image, coord, lod, sample);
   } else {
      result = spirv_builder_emit_image_sample(&ctx->builder,
                                               actual_dest_type, load,
                                               coord,
                                               proj != 0,
                                               lod, bias, dref, dx, dy,
                                               offset);
   }

And here’s the problem. The txf instruction isn’t handling the offset at all, while other instructions (which map to e.g., OpImageSampleImplicitLod) are passing it along as a parameter.

The fix in this case is to check OpIAdd, which does indeed permit addition of vectors, and so the block can be changed to:

   if (tex->op == nir_texop_txf ||
       tex->op == nir_texop_txf_ms) {
      SpvId image = spirv_builder_emit_image(&ctx->builder, image_type, load);
      if (offset)
         coord = emit_binop(ctx, SpvOpIAdd,
                            /* 'coord_bitsize' here comes from adding
                            
                               coord_bitsize = nir_src_bit_size(tex->src[i].src);
                               
                               to the 'nir_tex_src_coord' switch case in the first block
                             */
                            get_ivec_type(ctx, coord_bitsize, coord_components),
                            coord, offset);
      result = spirv_builder_emit_image_fetch(&ctx->builder, dest_type,
                                              image, coord, lod, sample);
   } else {
      result = spirv_builder_emit_image_sample(&ctx->builder,
                                               actual_dest_type, load,
                                               coord,
                                               proj != 0,
                                               lod, bias, dref, dx, dy,
                                               offset);
   }

This emits an addition instruction for the coord and offset vectors and passes the new coord value to the spirv_builder_emit_image_fetch() function, and now the issue is resolved.

June 22, 2020

Finally, An Introduction

Since the start of this blog, I’ve been going full speed ahead with minimal regard for explaining terminology or architecture. This was partly to bootstrap the blog and get some potentially interesting content out there, but I also wanted to provide some insight into how clueless I was when I started out in mesa.

If you’ve been confused by the previous posts, that’s roughly where I was at the time when I first encountered whatever it was you that you’ve been reading about.

What is Gallium?

There’s a lot of good documentation available for it, but much of that documentation assumes that the reader already has fairly deep knowledge about graphics/rendering as well as pipeline architecture.

When I began working on mesa, I did not have that knowledge, so let’s take a little time to go over some parts of the mesa tree, beginning with gallium.

Gallium is the API provided by mesa/src/mesa/state_tracker. state_tracker is a mesa dri driver implementation (like i965 or radeon) which translates the mesa/src/mesa/main API and functionality into something a bit more flexible and easy to write drivers for. In particular, the state tracker is less immediate-mode functionality than core mesa, which enables greater optimization to be performed with e.g., batching and deduplication of repeated operations.

What are the main components of the Gallium API?

The main headers for use with gallium drivers can be found in mesa/src/gallium/include/pipe. This contains:

  • struct pipe_screen - an interface for accessing the underlying hardware/device layer, providing the get_param() methods for determining the capabilities (PIPE_CAP_XYZ) that a driver has. In Zink terms, this is the object that all Vulkan commands go through, as struct zink_screen::dev is the VkDevice used for everything.

  • struct pipe_context - an interface created from a struct pipe_screen providing rendering context methods to handle managing states and surface objects as well as VBO drawing. In Zink, this is the component that ties everything together.

  • struct pipe_resource - an object created from a struct pipe_screen representing some sort of buffer or texture. In Zink terms, any time an OpenGL command reads back data from a buffer or directly maps data to a texture, this is the object used.

  • struct pipe_surface - an object created from a struct pipe_context representing a texture view that can be bound as a color/depth/stencil attachment to a framebuffer.

  • struct pipe_query - an object created from a struct pipe_context representing a query of some sort, whether for performance or functional purposes.

  • struct pipe_fence_handle - an object created from a struct pipe_screen representing a fence used for command stream synchronization.

Other components

Aside from the main gallium API (which has tons more types than just those listed above), there’s also:

  • the GLSL compiler
  • the NIR compiler
  • the SPIR-V compiler
  • the mesa utility API
  • the gallium aux/utility API

These are written in a combination of C/C++/Python, and they’re (mostly) all used in gallium drivers.

June 19, 2020

A new problem

My work in ntv exposed more issues that needed to be resolved, the most significant of which was the ability of Zink to accidentally clobber output variables. What does that actually mean though?

Inputs and outputs in shaders are assigned a location (SPIRV terminology) or a slot (mesa terminology). These are helpfully defined in mesa/src/compiler/shader_enums.h:

/**
 * Indexes for vertex shader outputs, geometry shader inputs/outputs, and
 * fragment shader inputs.
 *
 * Note that some of these values are not available to all pipeline stages.
 *
 * When this enum is updated, the following code must be updated too:
 * - vertResults (in prog_print.c's arb_output_attrib_string())
 * - fragAttribs (in prog_print.c's arb_input_attrib_string())
 * - _mesa_varying_slot_in_fs()
 */
typedef enum
{
   VARYING_SLOT_POS,
   VARYING_SLOT_COL0, /* COL0 and COL1 must be contiguous */
   VARYING_SLOT_COL1,
   VARYING_SLOT_FOGC,
   VARYING_SLOT_TEX0, /* TEX0-TEX7 must be contiguous */
   VARYING_SLOT_TEX1,
   VARYING_SLOT_TEX2,
   VARYING_SLOT_TEX3,
   VARYING_SLOT_TEX4,
   VARYING_SLOT_TEX5,
   VARYING_SLOT_TEX6,
   VARYING_SLOT_TEX7,
   VARYING_SLOT_PSIZ, /* Does not appear in FS */
   VARYING_SLOT_BFC0, /* Does not appear in FS */
   VARYING_SLOT_BFC1, /* Does not appear in FS */
   VARYING_SLOT_EDGE, /* Does not appear in FS */
   VARYING_SLOT_CLIP_VERTEX, /* Does not appear in FS */
   VARYING_SLOT_CLIP_DIST0,
   VARYING_SLOT_CLIP_DIST1,
   VARYING_SLOT_CULL_DIST0,
   VARYING_SLOT_CULL_DIST1,
   VARYING_SLOT_PRIMITIVE_ID, /* Does not appear in VS */
   VARYING_SLOT_LAYER, /* Appears as VS or GS output */
   VARYING_SLOT_VIEWPORT, /* Appears as VS or GS output */
   VARYING_SLOT_FACE, /* FS only */
   VARYING_SLOT_PNTC, /* FS only */
   VARYING_SLOT_TESS_LEVEL_OUTER, /* Only appears as TCS output. */
   VARYING_SLOT_TESS_LEVEL_INNER, /* Only appears as TCS output. */
   VARYING_SLOT_BOUNDING_BOX0, /* Only appears as TCS output. */
   VARYING_SLOT_BOUNDING_BOX1, /* Only appears as TCS output. */
   VARYING_SLOT_VIEW_INDEX,
   VARYING_SLOT_VIEWPORT_MASK, /* Does not appear in FS */
   VARYING_SLOT_VAR0, /* First generic varying slot */
   /* the remaining are simply for the benefit of gl_varying_slot_name()
    * and not to be construed as an upper bound:
    */
   VARYING_SLOT_VAR1,
   VARYING_SLOT_VAR2,
   VARYING_SLOT_VAR3,
   VARYING_SLOT_VAR4,
   VARYING_SLOT_VAR5,
   VARYING_SLOT_VAR6,
   VARYING_SLOT_VAR7,
   VARYING_SLOT_VAR8,
   VARYING_SLOT_VAR9,
   VARYING_SLOT_VAR10,
   VARYING_SLOT_VAR11,
   VARYING_SLOT_VAR12,
   VARYING_SLOT_VAR13,
   VARYING_SLOT_VAR14,
   VARYING_SLOT_VAR15,
   VARYING_SLOT_VAR16,
   VARYING_SLOT_VAR17,
   VARYING_SLOT_VAR18,
   VARYING_SLOT_VAR19,
   VARYING_SLOT_VAR20,
   VARYING_SLOT_VAR21,
   VARYING_SLOT_VAR22,
   VARYING_SLOT_VAR23,
   VARYING_SLOT_VAR24,
   VARYING_SLOT_VAR25,
   VARYING_SLOT_VAR26,
   VARYING_SLOT_VAR27,
   VARYING_SLOT_VAR28,
   VARYING_SLOT_VAR29,
   VARYING_SLOT_VAR30,
   VARYING_SLOT_VAR31,
} gl_varying_slot;

As seen above, there’s a total of 64 slots: 32 for builtins and 32 for other usage. In ntv, the builtins are translated from GLSL -> SPIRV. The problem arises because SPIRV doesn’t have analogues for many of the GLSL builtins, which means they need to take up space in the latter half of the slots.

As an example, VARYING_SLOT_COL0 (GLSL’s gl_Color or gl_FrontColor depending on shader type) does not have a SPIRV builtin. This means it’ll get emitted as VARYING_SLOT_VAR[n]. In such a scenario, any shader-created VARYING_SLOT[n] created from a user-defined varying will end up clobbering the color value.

More problems

The simple solution here would be to just map the first half (builtin) of the slot range onto the second half (user), but that has its own problem: slot usage must remain within the boundaries of the enum. This means that the slot usage for GLSL builtins needs to be kept to a minimum in order to leave room for user-defined varyings.

Additionally, the slots need to be remapped consistently for all types of shaders, as ntv has no capacity to look at any shader but the one being actively processed. So doing any kind of dynamic remapping is out.

Solutions

Ideally the GLSL slot usage needs to be compacted, so I started creating a remapping array for the builtins so that I could see what was available as a SPIRV builtin and what wasn’t. Then I went over the members lacking SPIRV builtins and assigned them a value. The result was this:

/* this consistently maps slots to a zero-indexed value to avoid wasting slots */
static unsigned slot_pack_map[] = {
   /* Position is builtin */
   [VARYING_SLOT_POS] = UINT_MAX,
   [VARYING_SLOT_COL0] = 0, /* input/output */
   [VARYING_SLOT_COL1] = 1, /* input/output */
   [VARYING_SLOT_FOGC] = 2, /* input/output */
   /* TEX0-7 are translated to VAR0-7 by nir, so we don't need to reserve */
   [VARYING_SLOT_TEX0] = UINT_MAX, /* input/output */
   [VARYING_SLOT_TEX1] = UINT_MAX,
   [VARYING_SLOT_TEX2] = UINT_MAX,
   [VARYING_SLOT_TEX3] = UINT_MAX,
   [VARYING_SLOT_TEX4] = UINT_MAX,
   [VARYING_SLOT_TEX5] = UINT_MAX,
   [VARYING_SLOT_TEX6] = UINT_MAX,
   [VARYING_SLOT_TEX7] = UINT_MAX,

   /* PointSize is builtin */
   [VARYING_SLOT_PSIZ] = UINT_MAX,

   [VARYING_SLOT_BFC0] = 3, /* output only */
   [VARYING_SLOT_BFC1] = 4, /* output only */
   [VARYING_SLOT_EDGE] = 5, /* output only */
   [VARYING_SLOT_CLIP_VERTEX] = 6, /* output only */

   /* ClipDistance is builtin */
   [VARYING_SLOT_CLIP_DIST0] = UINT_MAX,
   [VARYING_SLOT_CLIP_DIST1] = UINT_MAX,

   /* CullDistance is builtin */
   [VARYING_SLOT_CULL_DIST0] = UINT_MAX, /* input/output */
   [VARYING_SLOT_CULL_DIST1] = UINT_MAX, /* never actually used */

   /* PrimitiveId is builtin */
   [VARYING_SLOT_PRIMITIVE_ID] = UINT_MAX,

   /* Layer is builtin */
   [VARYING_SLOT_LAYER] = UINT_MAX, /* input/output */

   /* ViewportIndex is builtin */
   [VARYING_SLOT_VIEWPORT] =  UINT_MAX, /* input/output */

   /* FrontFacing is builtin */
   [VARYING_SLOT_FACE] = UINT_MAX,

   /* PointCoord is builtin */
   [VARYING_SLOT_PNTC] = UINT_MAX, /* input only */

   /* TessLevelOuter is builtin */
   [VARYING_SLOT_TESS_LEVEL_OUTER] = UINT_MAX,
   /* TessLevelInner is builtin */
   [VARYING_SLOT_TESS_LEVEL_INNER] = UINT_MAX,

   [VARYING_SLOT_BOUNDING_BOX0] = 7, /* Only appears as TCS output. */
   [VARYING_SLOT_BOUNDING_BOX1] = 8, /* Only appears as TCS output. */
   [VARYING_SLOT_VIEW_INDEX] = 9, /* input/output */
   [VARYING_SLOT_VIEWPORT_MASK] = 10, /* output only */
};

Now all the GLSL builtins that need slots are compacted into 11 members of the enum, which leaves the other 21 available.

Any input or output coming through ntv now goes through a switch statement: the GLSL builtins that can be translated to SPIRV builtins are, the builtins that can’t are remapped using this array, and and rest of the slots get mapped onto a slot after the reserved slot members.

Problem solved.

For now.

June 18, 2020

Testing Testing Testing

As always, there’s more tests to run. And when the validation layers are enabled for Vulkan (export VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation once installed), sometimes new errors pop up.

Such was the case when running tests one day when I encountered an error about vkCmdResetQueryPool being called inside a render pass. Indeed, there was an incorrect usage here, and it needed to be resolved. Let’s take a look at why this was happening.

Query Active-ness

Once vkCmdBeginQuery is called, a query is considered active by Vulkan, which means they can’t be destroyed until they’re made inactive. Queries in the active state now have association with the command buffer (batch) that they’re made active in, but this also means that a query needs to be “transferred” to a new zink_batch any time the current one is flushed and cycled to the next batch. This happens as below:

static void
flush_batch(struct zink_context *ctx)
{
   struct zink_batch *batch = zink_curr_batch(ctx);
   if (batch->rp)
      vkCmdEndRenderPass(batch->cmdbuf);

   zink_end_batch(ctx, batch);

   ctx->curr_batch++;
   if (ctx->curr_batch == ARRAY_SIZE(ctx->batches))
      ctx->curr_batch = 0;

   zink_start_batch(ctx, zink_curr_batch(ctx));
}

This submits the current command buffer (batch), then switches to the next one and activates it. In the process, all active queries on the current batch are stopped, then they get started once more on the new command buffer first thing so that they continue to e.g., track primitives emitted without missing any operations.

Diving deeper, zink_end_batch() calls through to zink_suspend_queries(), which calls end_query(), a function I’ve mentioned a couple times previously. After all the reworking, here’s how it looks:

static void
end_query(struct zink_batch *batch, struct zink_query *q)
{
   struct zink_context *ctx = batch->ctx;
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   assert(q->type != PIPE_QUERY_TIMESTAMP);
   q->active = false;
   if (q->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT)
      screen->vk_CmdEndQueryIndexedEXT(batch->cmdbuf, q->query_pool, q->curr_query, q->index);
   else
      vkCmdEndQuery(batch->cmdbuf, q->query_pool, q->curr_query);
   if (++q->curr_query == q->num_queries) {
      get_query_result(&ctx->base, (struct pipe_query*)q, false, &q->big_result);
      vkCmdResetQueryPool(batch->cmdbuf, q->query_pool, 0, q->num_queries);
      q->last_checked_query = q->curr_query = 0;
   }
}

As seen, the query is marked inactive (as it relates to Vulkan), ended, and then the query id is incremented. If the id reaches the max number of queries in the pool, the query grabs the current results into the inline result struct discussed previously before resetting the pool.

This is where the API misuse error was coming from, as the render pass is not destroyed until the batch is reset, which occurs the next time it’s made active.

A small change to the reset block at the end of end_query():

static void
end_query(struct zink_batch *batch, struct zink_query *q)
{
   struct zink_context *ctx = batch->ctx;
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   assert(q->type != PIPE_QUERY_TIMESTAMP);
   q->active = false;
   if (q->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT)
      screen->vk_CmdEndQueryIndexedEXT(batch->cmdbuf, q->query_pool, q->curr_query, q->index);
   else
      vkCmdEndQuery(batch->cmdbuf, q->query_pool, q->curr_query);
   if (++q->curr_query == q->num_queries) {
      if (batch->rp)
         q->needs_reset = true;
      else
        reset_pool(batch, q);
   }
}

Now the reset can be deferred until the query is made active again, at which point it’s guaranteed to not be inside a render pass.

And that’s it

The updated query code is awaiting review, and Zink no longer assert()s when an application toggles its queries too many times.

June 17, 2020

In Part 1 I've shown you how to create your own distribution image using the freedesktop.org CI templates. In Part 2, we'll go a bit further than that by truly embracing nested images.

Our assumption here is that we have two projects (or jobs), with the second one relying heavily on the first one. For example, the base project and a plugin, or a base project and its language bindings. What we'll get out of this blog post is a setup where we have

  • a base image in the base project
  • an image extending that base image in a different project
  • automatic rebuilds of that extended image when the base image changes
And none of your contributors have to care about this. It's all handled automatically and filing a MR against a project will build against the right image. So let's get started.

Our base project has CI that pushes an image to its registry. The .gitlab-ci.yml contains something like this:


.fedora32:
variables:
FDO_DISTRIBUTION_VERSION: '32'
FDO_DISTRIBUTION_TAG: 'base.0'

build-img:
extends:
- .fedora32
- .fdo.container-build@fedora
variables:
FDO_DISTRIBUTION_PACKAGES: "curl wget"

This will build a fedora/32:base.0 image in the project's container registry. That image is built once and then re-used by any job extending .fdo.distribution-image@fedora. So far, so Part 1.

Now, the second project needs to test things on top of this base image, for example language bindings for rust. You want to use the same image that the base project uses (and has successfully completed its CI on) but you need some extra packages or setup. This is where the FDO_BASE_IMAGE comes in. In our dependent project, we have this:


.fedora32:
variables:
FDO_DISTRIBUTION_VERSION: '32'
FDO_DISTRIBUTION_TAG: 'rust.0'

build-rust-image:
extends:
- .fedora32
- .fdo.container-build@fedora
variables:
FDO_BASE_IMAGE: "registry.freedesktop.org/baseproject/name/fedora/32:base.0"
# extra packages we want to install and things we need to set up
FDO_DISTRIBUTION_PACKAGES: "rust cargo"
FDO_DISTRIBUTION_EXEC: "bash -x some-setup-script.sh"

test-rust:
extends:
- .fedora32
- .fdo.distribution-image@fedora
script:
- cargo build myproject-bindings

And voila, you now have two images: the base image with curl and wget in the base project and an extra image with rust and cargo in the dependent project. And all that is required is to reference the FDO_BASE_IMAGE, everything else is the same. Note how the FDO_BASE_IMAGE is a full path in this example since we assume it's in a different project. For dependent images within the same project, you can just use the image path without the host.

The dependency only matters while the image is built, after that the dependent image is just another standalone image. So even if the base project removes the base image, you still have yours to test on.

But eventually you will need to change the base image and you want the dependent image to update as well. The best solution here is to have a CI job as part of the base repo that pokes the dependent repo's CI whenever the base image updates. The CI templates add the pipeline id as label to an image when it is built. In your base project, you can thus have a job like this:


poke-dependents:
extends:
- .fedora32
- .fdo.distribution-image@fedora
image: something-with-skopeo-and-jq
script:
# FDO_DISTRIBUTION_IMAGE still has indirections
- DISTRO_IMAGE=$(eval echo ${FDO_DISTRIBUTION_IMAGE})
# retrieve info from the registry and extract the pipeline id
- JSON_IMAGE=$(skopeo inspect docker://$DISTRO_IMAGE)
- IMAGE_PIPELINE_ID=$(echo $JSON_IMAGE | jq -r '.Labels["fdo.pipeline_id"]')
- |
if [[ x"$IMAGE_PIPELINE_ID" == x"$CI_PIPELINE_ID" ]]; then
curl -X POST
-F "token=$AUTH_TOKEN_VALUE"
-F "ref=master"
-F "variables[SOMEVARIABLE]=somevalue"
https://gitlab.freedesktop.org/api/v4/projects/dependent${SLASH}project/trigger/pipeline
fi
variables:
SLASH: "%2F"

Let's dissect this: First, we use the .fdo.distribution-image@fedora template to get access to FDO_DISTRIBUTION_IMAGE. We don't need to use the actual image though, anything with skopeo and jq will do. Then we fetch the pipeline id label from the image and compare it to the current pipeline ID. If it is the same, our image was rebuilt as part of the pipeline and we poke the other project's pipeline with a SOMEVARIABLE set to somevalue. The auth token is a standard GitLab token you need to create to allow triggering the pipeline in the dependent project.

In that dependent project you can have a job like this:


rebuild-extra-image:
extends: build-extra-image
rules:
- if: '$SOMEVARIABLE == "somevalue"'
variables:
FDO_FORCE_REBUILD: 1

This job is only triggered where the variable is set and it will force a rebuild of the container image. If you want custom rebuilds of images, set the variables accordingly.

So, as promised above, we now have a base image and a separate image building on that, together with auto-rebuild hooks. The gstreamer-plugins-rs project uses this approach. The base image is built by gstreamer-rs during its CI run which then pokes gstreamer-plugins-rs to rebuild selected dependent images.

The above is most efficient when the base project knows of the dependent projects. Where this is not the case, the dependent project will need a scheduled pipeline to poll the base project and extract the image IDs from that, possibly using creation dates and whatnot. We'll figure that out when we have a use-case for it.

The overflow problem

In past posts I touched on the issue of query pool overflows. The gist of the issue, again, is that each “query” object exposed higher up to the OpenGL API is actually a query pool in Vulkan. Every time an OpenGL query is started or resumed, this consumes a query from the pool. As such, any application which attempts to toggle the active state of a query too many times will end up losing data when the pool gets reset.

Solution

It wasn’t the most difficult of problems. I dove in with the idea of storing partial query results onto the struct zink_query object. At zink_create_query, the union pipe_query_result object gets zeroed, and it can then have partial results concatenated to it. In code, it looks more or less like:

static bool
get_query_result(struct pipe_context *pctx,
                      struct pipe_query *q,
                      bool wait,
                      union pipe_query_result *result)
{
   struct zink_screen *screen = zink_screen(pctx->screen);
   struct zink_query *query = (struct zink_query *)q;
   VkQueryResultFlagBits flags = 0;

   if (wait)
      flags |= VK_QUERY_RESULT_WAIT_BIT;

   if (query->use_64bit)
      flags |= VK_QUERY_RESULT_64_BIT;

   // the below lines added for overflow handling
   if (result != &query->big_result) {
      memcpy(result, &query->big_result, sizeof(query->big_result));
      util_query_clear_result(&query->big_result, query->type);
   } else
      flags |= VK_QUERY_RESULT_PARTIAL_BIT;

So when calling this from the application, the result pointer isn’t the inlined query result, and the existing result data gets stored to it, then the latest query data is concatenated onto it once it’s fetched from Vulkan.

When calling internally from end_query just before resetting the pool, the partial bit is set, which permits the return of “whatever results are available”. Those get stored onto the inline results struct that gets passed for this case, and this process repeats as many times as necessary until the query is destroyed or the user fetches the results.

June 16, 2020

To start with

Reworking queries was a long process, though not nearly as long as getting xfb completely working. I’ll be going over it in more general terms since the amount of code changed is substantial, and it’s a combination of additions and deletions which makes for difficult reading in diff format on a blog post.

The first thing I started looking at was getting vkCmdResetQueryPool out of zink_begin_query. The obvious choice was to move it to zink_create_query; the pool should reset all its queries upon creation so that they can be used immediately. It’s important to remember that this command can’t be called from within a render pass, which means zink_batch_no_rp must be used here.

The only other place that reset is required is now at the time when a query ends in a pool, as each pool is responsible for only a single type of query.

Fences

Now up to this point, queries were sort of just fired off and forgotten, which meant that there was no way to know whether they had completed. This was going to be a problem, as it’s against Vulkan spec to destroy an in-progress query, which means they must be deferred. While violating some parts of the spec might be harmless, this part is not: destroying an in-progress xfb query actually crashes the Intel ANV driver.

To solve this, queries must be attached first to a given struct zink_batch so it becomes possible to know which command buffer they’re running in. With this relation established, a fence finishing for a batch can iterate over the list of active queries for that batch, marking them as having completed. This enables query destruction to be deferred when necessary to avoid violating spec.

Unfortunately, xfb queries allocate specific resources in the underlying driver, and attempting to defer these types of queries can result in those resources being freed by other means, so this requires Zink to block on the corresponding batch’s fence whenever deleting an active xfb query. The queries do know when they’re active now, however, so at least there’s no need to block unnecessarily.

And then also

One of the downsides of the existing query implementation is that starting and stopping a query repeatedly without fetching the results resets the pool and discards any fetched results. Now that the reset occurs on pool creation, this is no longer the case, and the only time that results are discarded is when the pool overflows on query termination. This in itself is a questionable improvement over the existing behavior of assert()ing, but it’s a step in the right direction.

June 15, 2020

I spent the last week investigating the cause of two problems between VKMS and IGT that I have faced and reported in the development phase of my GSoC project proposal. One of the issues was a weird behavior, that I described as unstable, in the sequential execution of the kms\ _cursor\ _crc subtests or running the same subtest twice in a row: a subtest that passed in the first run failed in the second and returned to succeed in the third (and so on).

At first, it was difficult to determine where was the error because:

  1. I had a very superficial comprehension (I still have it, but a little less) about the implementation of an IGT test and subtests.
  2. The file used by the test to write and verify the data had the access blocked soon after the execution of a subtest
  3. Currently, when vkms is the driver used, only two subtests of kms_cursor_crc are passing

A previous task that helped me a lot in this investigative process was stopping to read and study the test code to understand its structure and the steps taken during the execution. My mentor guided me to do this study, and I published an initial version of this anatomy in my previous post. With that, I was able to do some checks to evaluate what was leaving the file created by igt_debugfs blocked, which ended up also solving the problem of sequential execution.

Waiting for Vblank

I describe below how I reached the idea of adding a call of waiting for vblank that solved the problem mentioned above. I verified in some other scenarios that adding this call does not cause regression, but I still don’t have a good perception and confidence of why this call is only necessary for VKMS.

Timed out: Opening crc fd, and poll for first CRC

For checking this issue, you have to enable VKMS and run (for example) the subtest alpha-opaque twice:

sudo IGT_FORCE_DRIVER=vkms build/tests/kms_cursor_crc --run-subtest pipe-A-cursor-alpha-opaque

You will see the subtest succeed in the first execution and fail in the second. The debug will report a timeout on opening crc fd, and poll for first CRC. From this report, I guessed that something previously allocated was not released, i.e., lacking a king of “free”.

  1. Checking the code, much of allocation and release of things on each round of subtest is in the function prepare_crtc and cleanup_crtc:
static void run_test(data_t *data, void (*testfunc)(data_t *), int cursor_w, int cursor_h)
{
	prepare_crtc(data, data->output, cursor_w, cursor_h);
	testfunc(data);
	cleanup_crtc(data);
}

From this, I change the code to skip the call testfunc(data) - a function to call a specific subtest, because I wanted to check if the problem was limited to prepare_crtc/cleanup_crtc or would be inside any subtest. I validated that the problem was inside prepare/cleanup operations, since the issue still happened even if no specific subtest was running.

More “freedom”

I checked what was allocated in prepare_crtc to ‘mirror’ it when cleaning. I also took a look at other crc tests to see how they do the setup and cleanup of things. I was partially successful when I did a pre-check before the creation of pipe_crc, releasing pipe_crc if it was not free at that moment.

	/* create the pipe_crc object for this pipe */
	if(data->pipe_crc)
		igt_pipe_crc_free(data->pipe_crc);
	data->pipe_crc = igt_pipe_crc_new(data->drm_fd, data->pipe,
					  INTEL_PIPE_CRC_SOURCE_AUTO);

With this, the sequential execution of subtests passed to alternate between success and failure (note that no subtest was running, only it was preparing, and cleaning). I have also tried to complement the cleanup function, releasing more things, but I didn’t see any improvement.

Waiting forever

I observed that something was generating an infinite busy wait for the debugfs file related to the fd. Then, I delimited this busy waiting to the line “poll(&pfd, 1, -1)” in lib/igt_debugfs.c:igt_pipe_crc_start, the line “poll(&pfd, 1, -1)”. I tested that changing -1 to a finite timeout seemed to solve the problem, but it could not be a solution, since it would affect the behaviour of all IGT tests, not only the test kms_cursor_crc.

Not forever, waiting for VBlank

After I have read documentations and seen other crc related tests in IGT, I looked how is the process of pipe crc creation, initialization, crc collection, stop and free of kms_pipe_crc_basic, and two functions called my attention: igt_pipe_crc_new_nonblock() and igt_wait_for_vblank(). At this point, I remembered a previous talk that I had with Siqueira that the VKMS simulates vblank interrupts. For me, It was valid to think in the possibility that VKMS was busy waiting for this vblank interrupt… therefore, as the busy wait was during the starting of pipe_crc, I added a call for igt_wait_for_vblank() before igt_pipe_crc_start() and then, it seems to solve the problem… but, WHY?

	igt_wait_for_vblank(data->drm_fd, data->pipe);
	igt_pipe_crc_start(data->pipe_crc);

I hope have the answer in a next blog post.

I asked for help to my mentor and took a break to check another problem: Why the subtest alpha-transparent succeed using VKMS after the implementation of XRGB but shows a message of WARNING: Suspicious CRC: All values are 0 ?

Short post today

I got caught up in some exciting synchronization code today and ran out of time to write a post before my brain started to shut down. As a result, let’s talk briefly about synchronization.

What is synchronization?

Synchronization is the idea that application and gpu states match up. When used properly, it means that when a vertex buffer is thrown into a shader, the buffer contains the expected data. When not used properly, nothing works at all.

Zink and synchronization (zinkronization?)

Zink uses the concept of “batches”, aka struct zink_batch to represent Vulkan command buffers. Zink throws a stream of commands into a batch, and eventually they get executed by the driver. A batch doesn’t necessarily terminate with a draw command, and not every batch will even contain a draw command.

Batches in Zink are dispatched (flushed) to the gpu based on render passes. Some commands must be made inside a render pass, and some must be made outside. When that boundary is reached, and the next command must be inside/outside a render pass while the current batch is outside/inside a render pass, the current batch is flushed and the next batch in the circular batch queue becomes current.

However, just because a batch is flushed doesn’t mean it has been executed.

Fences

A fence is an object in Zink (and other places) which provides a notification when a batch has completed. In the case where, e.g., an application wants to read from an in-flight draw buffer, the current batch is flushed, then a fence is waited on to determine when the pending draw commands are complete. This is synchronization for command buffers.

Barriers

Barriers are synchronization for memory access. They allow fine-grained control over exactly when a resource transitions between access states. This avoids data races between e.g., a write operation and a read operation which might be occurring in the same command buffer, but it enables the user of the Vulkan API to be explicit about the usage to maximize performance.

Query synchronization

A big issue with the existing query handling, as was previously discussed relates to synchronization. The queries need to be able to keep the underlying data that they’re fetching from the driver in sync with the application’s idea of what data the query is fetching. This needs to remain consistent across batch flushes and query pool overflows.

More on this to come.

June 12, 2020

The next problem

With xfb defeated, it’s time to move on to the next big issue: improving query handling.

The existing implementation of queries in Zink has a few problems:

  • They’re limited to 100 queries (50 for xfb) before it’s necessary to get the result of a query before hitting an assert()
  • There’s a number of issues related to API correctness, so running in a validator will generate tons of errors (though mostly they’re harmless in the sense that the code still works as expected)

Ideally, queries shouldn’t trigger abort() due to pool depletion, and they shouldn’t trigger validator errors, however, so this needs some work.

Step one: understanding the problems

Let’s check out some of the existing code.

static bool
zink_begin_query(struct pipe_context *pctx,
                 struct pipe_query *q)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_query *query = (struct zink_query *)q;

   /* ignore begin_query for timestamps */
   if (query->type == PIPE_QUERY_TIMESTAMP)
      return true;

   /* TODO: resetting on begin isn't ideal, as it forces render-pass exit...
    * should instead reset on creation (if possible?)... Or perhaps maintain
    * the pool in the batch instead?
    */
   struct zink_batch *batch = zink_batch_no_rp(zink_context(pctx));
   vkCmdResetQueryPool(batch->cmdbuf, query->query_pool, 0, MIN2(query->curr_query + 1, query->num_queries));
   query->curr_query = 0;

   begin_query(batch, query);
   list_addtail(&query->active_list, &ctx->active_queries);

   return true;
}

Here’s a function that’s called any time a new query is begun by an application.

Notice here that vkCmdResetQueryPool is being called here rather than alongside vkCreateQueryPool. According to Vulkan spec, each query must be reset before being used (which is why the call is here), but with this usage, any time a query is stopped without its results being returned, those results are lost because the query has been reset.

Also, as noted in the comment, because there’s a reset here, the current renderpass gets flushed in order to comply with Vulkan spec, which requires that reset be called outside of a renderpass. This causes a slowdown, which isn’t optimal.

static void
end_query(struct zink_batch *batch, struct zink_query *q)
{
   assert(q->type != PIPE_QUERY_TIMESTAMP);
   vkCmdEndQuery(batch->cmdbuf, q->query_pool, q->curr_query);
   if (++q->curr_query == q->num_queries) {
      assert(0);
      /* need to reset pool! */
   }
}

This is a function that’s called any time a query is ended (more on the multiple meanings of this in a future post) either internally or by an application.

The idea here is that the query is ended, so the “current” query id is incremented in preparation for the next query, as all queries are tied to an id value. If the incremented query id reaches the size of the query pool, then the code triggers an assert() since vkCmdResetQueryPool needs to be emitted, but that can’t happen without discarding the existing query results (as already happens above).

Step two: devising solutions

Ideally, what needs to happen here is:

  • Handle pool overflows by triggering a reset without discarding existing query results
  • Move reset commands out of zink_begin_query()

This turns out to be significantly more difficult than it sounds due to a number of other constraints in the driver, however. Stay tuned for more query-related problem solving!

June 11, 2020

More Fixes

Yesterday I covered the problems related to handling gl_PointSize during the SPIR-V conversion. But there were still more problems to overcome.

Packed outputs are a thing. This is the case when a variable is partially captured by xfb, e.g., in the case where only the x coordinate is selected from a vec4 output, it will be packed into the buffer as a single float. The original code assumed that all captured output types would match the original type, which is clearly not the case in the previously-described scenario.

A lot went into handling this case. Let’s jump into some of the code.

Critical terminology for this post:

  • SpvId - an id in SPIR-V which represents some other value, as denoted by <id> in the SPIR-V spec

Improved variable creation

The first step here was to figure out what the heck the resulting output type would be. Having remapped the struct pipe_stream_output::register_index values, it’s now possible to check the original variable types and use that when creating the xfb output.

/* return the intended xfb output vec type based on base type and vector size */
static SpvId
get_output_type(struct ntv_context *ctx, unsigned register_index, unsigned num_components)
{
   const struct glsl_type *out_type = ctx->so_output_gl_types[register_index];
   enum glsl_base_type base_type = glsl_get_base_type(out_type);
   if (base_type == GLSL_TYPE_ARRAY)
      base_type = glsl_get_base_type(glsl_without_array(out_type));

   switch (base_type) {
   case GLSL_TYPE_BOOL:
      return get_bvec_type(ctx, num_components);

   case GLSL_TYPE_FLOAT:
      return get_fvec_type(ctx, 32, num_components);

   case GLSL_TYPE_INT:
      return get_ivec_type(ctx, 32, num_components);

   case GLSL_TYPE_UINT:
      return get_uvec_type(ctx, 32, num_components);

   default:
      break;
   }
   unreachable("unknown type");
   return 0;
}

/* for streamout create new outputs, as streamout can be done on individual components,
   from complete outputs, so we just can't use the created packed outputs */
static void
emit_so_info(struct ntv_context *ctx, unsigned max_output_location,
             const struct pipe_stream_output_info *so_info, struct pipe_stream_output_info *local_so_info)
{
   for (unsigned i = 0; i < local_so_info->num_outputs; i++) {
      struct pipe_stream_output so_output = local_so_info->output[i];
      SpvId out_type = get_output_type(ctx, so_output.register_index, so_output.num_components);
      SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                      SpvStorageClassOutput,
                                                      out_type);
      SpvId var_id = spirv_builder_emit_var(&ctx->builder, pointer_type,
                                            SpvStorageClassOutput);
      char name[10];

      snprintf(name, 10, "xfb%d", i);
      spirv_builder_emit_name(&ctx->builder, var_id, name);
      spirv_builder_emit_offset(&ctx->builder, var_id, (so_output.dst_offset * 4));
      spirv_builder_emit_xfb_buffer(&ctx->builder, var_id, so_output.output_buffer);
      spirv_builder_emit_xfb_stride(&ctx->builder, var_id, so_info->stride[so_output.output_buffer] * 4);

      /* output location is incremented by VARYING_SLOT_VAR0 for non-builtins in vtn
       */
      uint32_t location = so_info->output[i].register_index;
      spirv_builder_emit_location(&ctx->builder, var_id, location);

      /* note: gl_ClipDistance[4] can the 0-indexed member of VARYING_SLOT_CLIP_DIST1 here,
       * so this is still the 0 component
       */
      if (so_output.start_component)
         spirv_builder_emit_component(&ctx->builder, var_id, so_output.start_component);

      uint32_t *key = ralloc_size(NULL, sizeof(uint32_t));
      *key = (uint32_t)so_output.register_index << 2 | so_output.start_component;
      _mesa_hash_table_insert(ctx->so_outputs, key, (void *)(intptr_t)var_id);

      assert(ctx->num_entry_ifaces < ARRAY_SIZE(ctx->entry_ifaces));
      ctx->entry_ifaces[ctx->num_entry_ifaces++] = var_id;
   }
}

Here’s the slightly changed code for emit_so_info() along with a new helper function. Note that there’s now so_info and local_so_info being passed here: the former is the gallium-produced streamout info, and the latter is the one that’s been re-translated back to enum gl_varying_slot using the code from yesterday’s post.

The remapped value is passed into the helper function, which retrieves the previously-stored struct glsl_type and then returns an SpvId for the necessary xfb output type, which is what’s now used to create the variable.

Improved variable value emission

Now that the variables are correctly created, it’s important to ensure that the correct value is being emitted.

static void
emit_so_outputs(struct ntv_context *ctx,
                const struct pipe_stream_output_info *so_info, struct pipe_stream_output_info *local_so_info)
{
   SpvId loaded_outputs[VARYING_SLOT_MAX] = {};
   for (unsigned i = 0; i < local_so_info->num_outputs; i++) {
      uint32_t components[NIR_MAX_VEC_COMPONENTS];
      struct pipe_stream_output so_output = local_so_info->output[i];
      uint32_t so_key = (uint32_t) so_output.register_index << 2 | so_output.start_component;
      struct hash_entry *he = _mesa_hash_table_search(ctx->so_outputs, &so_key);
      assert(he);
      SpvId so_output_var_id = (SpvId)(intptr_t)he->data;

      SpvId type = get_output_type(ctx, so_output.register_index, so_output.num_components);
      SpvId output = ctx->outputs[so_output.register_index];
      SpvId output_type = ctx->so_output_types[so_output.register_index];
      const struct glsl_type *out_type = ctx->so_output_gl_types[so_output.register_index];

      if (!loaded_outputs[so_output.register_index])
         loaded_outputs[so_output.register_index] = spirv_builder_emit_load(&ctx->builder, output_type, output);
      SpvId src = loaded_outputs[so_output.register_index];

      SpvId result;

      for (unsigned c = 0; c < so_output.num_components; c++) {
         components[c] = so_output.start_component + c;
         /* this is the second half of a 2 * vec4 array */
         if (ctx->stage == MESA_SHADER_VERTEX && so_output.register_index == VARYING_SLOT_CLIP_DIST1)
            components[c] += 4;
      }

Taking a short break here, a lot has changed. There’s now code for getting both the original type of the output as well as the xfb output type, and special handling has been added for gl_ClipDistance, which is potentially a vec8 that’s represented in memory as two vec4 values spanning two separate varying slots.

This takes care of loading the value for the variable corresponding to the xfb output as well as building an array of the components that are going to be emitted in the xfb output.

Now let’s get to the ugly stuff:

      /* if we're emitting a scalar or the type we're emitting matches the output's original type and we're
       * emitting the same number of components, then we can skip any sort of conversion here
       */
      if (glsl_type_is_scalar(out_type) || (type == output_type && glsl_get_length(out_type) == so_output.num_components))
         result = src;
      else {
         if (ctx->stage == MESA_SHADER_VERTEX && so_output.register_index == VARYING_SLOT_POS) {
            /* gl_Position was modified by nir_lower_clip_halfz, so we need to reverse that for streamout here:
             * 
             * opengl gl_Position.z = (vulkan gl_Position.z * 2.0) - vulkan gl_Position.w
             *
             * to do this, we extract the z and w components, perform the multiply and subtract ops, then reinsert
             */
            uint32_t z_component[] = {2};
            uint32_t w_component[] = {3};
            SpvId ftype = spirv_builder_type_float(&ctx->builder, 32);
            SpvId z = spirv_builder_emit_composite_extract(&ctx->builder, ftype, src, z_component, 1);
            SpvId w = spirv_builder_emit_composite_extract(&ctx->builder, ftype, src, w_component, 1);
            SpvId new_z = emit_binop(ctx, SpvOpFMul, ftype, z, spirv_builder_const_float(&ctx->builder, 32, 2.0));
            new_z = emit_binop(ctx, SpvOpFSub, ftype, new_z, w);
            src = spirv_builder_emit_vector_insert(&ctx->builder, type, src, new_z, 2);
         }
         /* OpCompositeExtract can only extract scalars for our use here */
         if (so_output.num_components == 1) {
            result = spirv_builder_emit_composite_extract(&ctx->builder, type, src, components, so_output.num_components);
         } else if (glsl_type_is_vector(out_type)) {
            /* OpVectorShuffle can select vector members into a differently-sized vector */
            result = spirv_builder_emit_vector_shuffle(&ctx->builder, type,
                                                             src, src,
                                                             components, so_output.num_components);
            result = emit_unop(ctx, SpvOpBitcast, type, result);
         } else {
             /* for arrays, we need to manually extract each desired member
              * and re-pack them into the desired output type
              */
             for (unsigned c = 0; c < so_output.num_components; c++) {
                uint32_t member[] = { so_output.start_component + c };
                SpvId base_type = get_glsl_type(ctx, glsl_without_array(out_type));

                if (ctx->stage == MESA_SHADER_VERTEX && so_output.register_index == VARYING_SLOT_CLIP_DIST1)
                   member[0] += 4;
                components[c] = spirv_builder_emit_composite_extract(&ctx->builder, base_type, src, member, 1);
             }
             result = spirv_builder_emit_composite_construct(&ctx->builder, type, components, so_output.num_components);
         }
      }

      spirv_builder_emit_store(&ctx->builder, so_output_var_id, result);
   }
}

There’s five blocks here:

  • Handling for cases of either a scalar value or a vec/array type containing the same total number of components that are being output to xfb
  • Handling for gl_Position, which, as was previously discussed, has already been converted to Vulkan coordinates, and so now the value of this variable needs to be un-converted so the expected value can be read back
  • Handling for the case of extracting a single component from a vector/array, which can be done using a single OpCompositeExtract
  • Handling for extracting a sequence of components from a vector, which is the OpVectorShuffle from the original implementation
  • Handling for extracting a sequence of components from an array, which requires manually extracting each desired array element and then re-assembling the resulting component array into the output

And now…

Well, the tests all pass, so maybe this will be the end of it.

Just kidding.

June 10, 2020

There’s always more tests

Helpfully, mesa has a suite of very demanding unit tests, aka piglit, which work great for finding all manner of issues with drivers. While the code from the previous posts handled a number of tests, it turned out that there were still a ton of failing tests.

What happened?

A number of things. The biggest culprits were:

  • a nir bug related to gl_PointSize; Zink (aka Vulkan) has no native point-size variable, so one has to be injected in order to draw points successfully. The problem here was that the variable was being injected unconditionally, which sometimes resulted in two gl_PointSize outputs, breaking handling of outputs in xfb.
  • another nir bug in which xfb shader variables were being optimized such that packing them into the output buffers correctly was failing.

Next up, issues mapping output values back to the correct xfb buffer in SPIR-V conversion. The problem in this case is that gallium translates struct nir_shader::info.outputs_written (a value comprised of bitflags corresponding to the enum gl_varying_slot outputs) to a 0-indexed value as struct pipe_stream_output::register_index, when what’s actually needed is enum gl_varying_slot, since that’s what passes through the output-emitting function in struct nir_variable::data.location. To fix this, some fixup is done on the local copy of struct pipe_stream_output_info that gets stored onto the struct zink_shader that represents the vertex shader being used:

/* check for a genuine gl_PointSize output vs one from nir_lower_point_size_mov */
static bool
check_psiz(struct nir_shader *s)
{
   nir_foreach_variable(var, &s->outputs) {
      if (var->data.location == VARYING_SLOT_PSIZ) {
         /* genuine PSIZ outputs will have this set */
         return !!var->data.explicit_location;
      }
   }
   return false;
}

/* semi-copied from iris */
static void
update_so_info(struct pipe_stream_output_info *so_info,
               uint64_t outputs_written, bool have_psiz)
{
   uint8_t reverse_map[64] = {};
   unsigned slot = 0;
   while (outputs_written) {
      int bit = u_bit_scan64(&outputs_written);
      /* PSIZ from nir_lower_point_size_mov breaks stream output, so always skip it */
      if (bit == VARYING_SLOT_PSIZ && !have_psiz)
         continue;
      reverse_map[slot++] = bit;
   }

   for (unsigned i = 0; i < so_info->num_outputs; i++) {
      struct pipe_stream_output *output = &so_info->output[i];

      /* Map Gallium's condensed "slots" back to real VARYING_SLOT_* enums */
      output->register_index = reverse_map[output->register_index];
   }
}

In these excerpts from zink_compile_nir(), the shader’s variables are scanned for a genuine gl_PointSize variable originating from the shader instead of NIR, then that knowledge can be applied to skip over the faked PSIZ output when rewriting the register_index values.

But this was only the start

Indeed, considerably more work was required to handle the rest of the tests, as there were failures related to packed output buffers and type conversions. It’s a lot of code to go over, and it merits its own post.

June 09, 2020

Just today it has published a status update of the Vulkan effort for the Raspberry Pi 4, including that we are moving the development of the driver to an open repository. As it is really likely that some people would be interested on testing it, even if it is not complete at all, here you can find a quick guide to compile it, and get some demos running.

Dependencies

So let’s start installing some dependencies. My personal recipe, that I use every time I configure a new machine to work on mesa is the following one (sorry if some extra unneeded dependencies slipped):

sudo apt-get install libxcb-randr0-dev libxrandr-dev \
        libxcb-xinerama0-dev libxinerama-dev libxcursor-dev \
        libxcb-cursor-dev libxkbcommon-dev xutils-dev \
        xutils-dev libpthread-stubs0-dev libpciaccess-dev \
        libffi-dev x11proto-xext-dev libxcb1-dev libxcb-*dev \
        bison flex libssl-dev libgnutls28-dev x11proto-dri2-dev \
        x11proto-dri3-dev libx11-dev libxcb-glx0-dev \
        libx11-xcb-dev libxext-dev libxdamage-dev libxfixes-dev \
        libva-dev x11proto-randr-dev x11proto-present-dev \
        libclc-dev libelf-dev git build-essential mesa-utils \
        libvulkan-dev ninja-build libvulkan1 python-mako \
        libdrm-dev libxshmfence-dev libxxf86vm-dev \
        python3-mako

Most Raspian libraries are recent enough, but they have been updating some of then during the past months, so just in case, don’t forget to update:

$ sudo apt-get update
$ sudo apt-get upgrade

Additionally, you woud need to install meson. Mesa has just recently bumped up the version needed for meson, so Raspbian version is not enough. There is the option to build meson from the tarball (meson-0.52.0 here), but by far, the easier way to get a recent meson version is using pip3:

$ pip3 install meson

2020-07-04 update

It seems that some people had problems if they have installed meson with apt-get on their system, as when building it would try the older meson version first. For those people, they were able to fix that doing this:

$ sudo apt-get remove meson
$ pip3 install --user meson

Download and build v3dv

This is the simpler recipe to build v3dv:

$ git clone https://gitlab.freedesktop.org/apinheiro/mesa.git mesa
$ cd mesa
$ git checkout wip/igalia/v3dv
$ meson --prefix /home/pi/local-install --libdir lib -Dplatforms=x11,drm -Dvulkan-drivers=broadcom -Ddri-drivers= -Dgallium-drivers=v3d,kmsro,vc4 -Dbuildtype=debug _build
$ ninja -C _build
$ ninja -C _build install

This builds and install a debug version of v3dv on a local directory. You could set a release build, or any other directory. The recipe is also building the OpenGL driver, just in case anyone want to compare, but if you are only interested on the vulkan driver, that is not mandatory.

Run some Vulkan demos

Now, the easiest way to ensure that a vulkan program founds the drivers is setting the following envvar:

export VK_ICD_FILENAMES=/home/pi/local-install/share/vulkan/icd.d/broadcom_icd.armv7l.json

That envvar is used by the Vulkan loader (installed as one of the dependencies listed before) to know which library load. This also means that you don’t need to use LD_PRELOAD, LD_LIBRARY_PATH or similar

So what Vulkan programs are working? For example several of the Sascha Willem Vulkan demos. To make things easier to everybody, here another quick recipe of how to get them build:

$ sudo apt-get install libassimp-dev
$ git clone --recursive https://github.com/SaschaWillems/Vulkan.git  sascha-willems
$ cd sascha-willems
$ mkdir build; cd build
$ cmake -DCMAKE_BUILD_TYPE=Debug  ..
$ make

Update 2020-08-03: When the post was originally written, some demos didn’t need to ask for extra assets. Recently the fonts were moved there, so you would need to gather the assests always:

$ cd ..
$ python3 download_assets.py

So in order to see a really familiar demo:

$ cd build/bin
$ ./gears

And one slightly more complex:

$./scenerendering

As mentioned, not all the demos works. But a list of some that we tested and seem to work:
* distancefieldfonts
* descriptorsets
* dynamicuniformbuffer
* gears
* gltfscene
* imgui
* indirectdraw
* occlusionquery
* parallaxmapping
* pbrbasic
* pbribl
* pbrtexture
* pushconstants
* scenerendering
* shadowmapping
* shadowmappingcascade
* specializationconstants
* sphericalenvmapping
* stencilbuffer
* textoverlay
* texture
* texture3d
* texturecubemap
* triangle
* vulkanscene

Update : rpiMike on the comments, and some people privately, have pointed some errors on the post. Thanks! And sorry for the inconvenience.

Update 2 : Mike Hooper pointed more issues on gitlab

What are extensions?

Extensions are non-core parts of a spec which provide additional features. xfb is an extension in Vulkan. In order to use it, a number of small changes are required.

What are features?

Features in Vulkan inform an application (or Zink) what the underlying driver supports. In order to enable and use an extension, both the extension and feature must be enabled.

Step 1: extension detection

To begin, let’s check out the significant parts of zink_screen.c that got modified, all in zink_internal_create_screen(), which creates a struct zink_screen* object, a subclass of struct pipe_screen*.

VkExtensionProperties *extensions;
vkEnumerateDeviceExtensionProperties(screen->pdev, NULL,
                                     &num_extensions, extensions);

for (uint32_t  i = 0; i < num_extensions; ++i) {
   if (!strcmp(extensions[i].extensionName,
               VK_EXT_TRANSFORM_FEEDBACK_EXTENSION_NAME))
      have_tf_ext = true;

}

There’s already some code doing this in Zink, so it’s a simple case of plugging in another strcmp to check for VK_EXT_TRANSFORM_FEEDBACK_EXTENSION_NAME.

Step 2: feature detection

VkPhysicalDeviceFeatures2 feats = {};
VkPhysicalDeviceTransformFeedbackFeaturesEXT tf_feats = {};

feats.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FEATURES_2;
if (have_tf_ext) {
   tf_feats.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_TRANSFORM_FEEDBACK_FEATURES_EXT;
   tf_feats.pNext = feats.pNext;
   feats.pNext = &tf_feats;
}
vkGetPhysicalDeviceFeatures2(screen->pdev, &feats);

Again, there’s already some code in Zink to handle feature detection, so this just requires plugging in the xfb feature parts.

Step 3: property checking

In addition to extensions and features, there’s also properties, which provide things like device-specific limits for various capabilities that need to be checked in order to avoid requesting resources that the gpu hardware can’t provide.

if (have_tf_ext && tf_feats.transformFeedback)
   screen->have_EXT_transform_feedback = true;

VkPhysicalDeviceProperties2 props = {};
props.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PROPERTIES_2;
if (screen->have_EXT_transform_feedback) {
   screen->tf_props.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_TRANSFORM_FEEDBACK_PROPERTIES_EXT;
   screen->tf_props.pNext = NULL;
   props.pNext = &screen->tf_props;
}
vkGetPhysicalDeviceProperties2(screen->pdev, &props);

This was another easy plug-in, since it’s just the xfb parts that need to be added.

Finishing touches

That’s more or less it. The xfb extension name needs to get added into VkDeviceCreateInfo::ppEnabledExtensionNames when it’s passed to vkCreateDevice() a bit later, but xfb is now fully activated if the driver supports it.

Lastly, the relevant enum pipe_cap members need to be handled in zink_get_param() so that gallium recognizes the newly-activated capabilities:

case PIPE_CAP_MAX_STREAM_OUTPUT_BUFFERS:
   return screen->have_EXT_transform_feedback ? screen->tf_props.maxTransformFeedbackBuffers : 0;
case PIPE_CAP_STREAM_OUTPUT_PAUSE_RESUME:
case PIPE_CAP_STREAM_OUTPUT_INTERLEAVE_BUFFERS:
   return 1;

And that’s it

Everything worked perfectly on the first try, and there were absolutely no issues whatsoever, because working on mesa is just that easy.

June 08, 2020

Queries: What are they?

A query in mesa is where an application asks for info from the underlying driver. There’s a number of APIs related to this, but right now only the xfb related ones matter. Specifically, GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN and GL_PRIMITIVES_GENERATED.

  • GL_PRIMITIVES_GENERATED is a query that, when active, tracks the total number of primitives generated.
  • GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN is a query that, when active, tracks the number of primitives written.

The difference between the two is that not all primitives generated are written, as some may be duplicates that are culled. In Vulkan, these translate to:

  • GL_PRIMITIVES_GENERATED = VK_QUERY_PIPELINE_STATISTIC_INPUT_ASSEMBLY_PRIMITIVES_BIT
  • GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN = VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT

It’s important to note that these are very different enum namespaces; one is for pipeline statistics (VK_QUERY_TYPE_PIPELINE_STATISTICS), and one is a regular query type. Also, GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN will be internally translated to PIPE_QUERY_PRIMITIVES_EMITTED, which has a very different name, so that’s good to keep in mind as well.

What does this look like?

Let’s check out some of the important bits. In Zink, starting a query now ends up looking like this, where we use an extension method for starting a query for the xfb query type and the regular vkCmdBeginQuery for all other queries.

if (q->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT)
   zink_screen(ctx->base.screen)->vk_CmdBeginQueryIndexedEXT(batch->cmdbuf,
                                                             q->query_pool,
                                                             q->curr_query,
                                                             flags,
                                                             q->index);
else
   vkCmdBeginQuery(batch->cmdbuf, q->query_pool, q->curr_query, flags);

Stopping queries is now very similar:

if (q->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT)
   screen->vk_CmdEndQueryIndexedEXT(batch->cmdbuf, q->query_pool, q->curr_query, q->index);
else
   vkCmdEndQuery(batch->cmdbuf, q->query_pool, q->curr_query);

Then finally we have the parts for fetching the returned query data:

int num_results;
if (query->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT) {
   /* this query emits 2 values */
   assert(query->curr_query <= ARRAY_SIZE(results) / 2);
   num_results = query->curr_query * 2;
   VkResult status = vkGetQueryPoolResults(screen->dev, query->query_pool,
                                           0, query->curr_query,
                                           sizeof(results),
                                           results,
                                           sizeof(uint64_t),
                                           flags);
   if (status != VK_SUCCESS)
      return false;
} else {
   assert(query->curr_query <= ARRAY_SIZE(results));
   num_results = query->curr_query;
   VkResult status = vkGetQueryPoolResults(screen->dev, query->query_pool,
                                           0, query->curr_query,
                                           sizeof(results),
                                           results,
                                           sizeof(uint64_t),
                                           flags);
   if (status != VK_SUCCESS)
      return false;
}

In this block, there’s once again a split between VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT results and all other query results. In this case, however, the reason is that these queries return two result values, which means the buffer is treated a bit differently.

for (int i = 0; i < num_results; ++i) {
   switch (query->type) {
   case PIPE_QUERY_PRIMITIVES_GENERATED:
      result->u32 += results[i];
      break;
   case PIPE_QUERY_PRIMITIVES_EMITTED:
      /* A query pool created with this type will capture 2 integers -
       * numPrimitivesWritten and numPrimitivesNeeded -
       * for the specified vertex stream output from the last vertex processing stage.
       * - from VK_EXT_transform_feedback spec
       */
      result->u64 += results[i];
      i++;
      break;
   }
}

This is where the returned results get looped over. PIPE_QUERY_PRIMITIVES_GENERATED returns a single 32-bit uint, which is added to the result struct using the appropriate member. The xfb query is now PIPE_QUERY_PRIMITIVES_EMITTED in internal types, and it returns a sequence of two 64-bit uints: numPrimitivesWritten and numPrimitivesNeeded. Zink only needs the first value, so it gets added into the struct.

It’s that simple

Compared to the other parts of implementing xfb, this was very simple and straightforward. More or less just translating the GL enum values to VK and mesa, then letting it run its course.

June 05, 2020

Quick Runthrough

This xfb blog series has gone on for a while now, and it’d be great if it ended soon. Unfortunately, there’s a lot of corner cases which are being found by piglit, and the work and fixing continue.

Today let’s look at some of the drawing code for xfb, since that’s probably not going to be changing much in the course of fixing those corner cases.

static void
zink_emit_stream_output_targets(struct pipe_context *pctx)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_screen *screen = zink_screen(pctx->screen);
   struct zink_batch *batch = zink_curr_batch(ctx);
   VkBuffer buffers[PIPE_MAX_SO_OUTPUTS];
   VkDeviceSize buffer_offsets[PIPE_MAX_SO_OUTPUTS];
   VkDeviceSize buffer_sizes[PIPE_MAX_SO_OUTPUTS];

   for (unsigned i = 0; i < ctx->num_so_targets; i++) {
      struct zink_so_target *t = (struct zink_so_target *)ctx->so_targets[i];
      buffers[i] = zink_resource(t->base.buffer)->buffer;
      buffer_offsets[i] = t->base.buffer_offset;
      buffer_sizes[i] = t->base.buffer_size;
   }

   screen->vk_CmdBindTransformFeedbackBuffersEXT(batch->cmdbuf, 0, ctx->num_so_targets,
                                                 buffers, buffer_offsets,
                                                 buffer_sizes);
   ctx->dirty_so_targets = false;
}

This is a function called from zink_draw_vbo(), which is the struct pipe_context::draw_vbo hook for drawing primitives. Here, the streamout target buffers are bound in Vulkan in preparation for the upcoming draw, passing along related info into the command buffer.

if (ctx->xfb_barrier) {
   /* Between the pause and resume there needs to be a memory barrier for the counter buffers
    * with a source access of VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_WRITE_BIT_EXT
    * at pipeline stage VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT
    * to a destination access of VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_READ_BIT_EXT
    * at pipeline stage VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT.
    *
    * - from VK_EXT_transform_feedback spec
    */
   VkBufferMemoryBarrier barriers[PIPE_MAX_SO_OUTPUTS] = {};
   unsigned barrier_count = 0;
   for (unsigned i = 0; i < ctx->num_so_targets; i++) {
      struct zink_so_target *t = zink_so_target(ctx->so_targets[i]);
      if (t->counter_buffer_valid) {
          barriers[i].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
          barriers[i].srcAccessMask = VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_WRITE_BIT_EXT;
          barriers[i].dstAccessMask = VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_READ_BIT_EXT;
          barriers[i].buffer = zink_resource(t->counter_buffer)->buffer;
          barriers[i].size = VK_WHOLE_SIZE;
          barrier_count++;
      }
   }
   batch = zink_batch_no_rp(ctx);
   vkCmdPipelineBarrier(batch->cmdbuf,
      VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT,
      VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
      0,
      0, NULL,
      barrier_count, barriers,
      0, NULL
   );
   ctx->xfb_barrier = false;
}
if (ctx->dirty_so_targets)
   zink_emit_stream_output_targets(pctx);
if (so_target && so_target->needs_barrier) {
   /* A pipeline barrier is required between using the buffers as
    * transform feedback buffers and vertex buffers to
    * ensure all writes to the transform feedback buffers are visible
    * when the data is read as vertex attributes.
    * The source access is VK_ACCESS_TRANSFORM_FEEDBACK_WRITE_BIT_EXT
    * and the destination access is VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT
    * for the pipeline stages VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT
    * and VK_PIPELINE_STAGE_VERTEX_INPUT_BIT respectively.
    *
    * - 20.3.1. Drawing Transform Feedback
    */
   VkBufferMemoryBarrier barriers[1] = {};
   if (so_target->counter_buffer_valid) {
       barriers[0].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
       barriers[0].srcAccessMask = VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_WRITE_BIT_EXT;
       barriers[0].dstAccessMask = VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT;
       barriers[0].buffer = zink_resource(so_target->base.buffer)->buffer;
       barriers[0].size = VK_WHOLE_SIZE;
   }
   batch = zink_batch_no_rp(ctx);
   zink_batch_reference_resoure(batch, zink_resource(so_target->base.buffer));
   vkCmdPipelineBarrier(batch->cmdbuf,
      VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT,
      VK_PIPELINE_STAGE_VERTEX_INPUT_BIT,
      0,
      0, NULL,
      ARRAY_SIZE(barriers), barriers,
      0, NULL
   );
   so_target->needs_barrier = false;
}

This is a block added to zink_draw_vbo() for synchronization of the xfb buffers. The counter buffer needs a specific type of barrier according to the spec, and the streamout target buffer needs a different type of barrier. These need to be emitted outside of a render pass, so zink_batch_no_rp() is used to get a batch that isn’t currently in a render pass (ending the active batch if necessary to switch to a new one). Without these, vk-layers will output tons of errors and also probably your stream output will be broken.

   if (ctx->num_so_targets) {
      for (unsigned i = 0; i < ctx->num_so_targets; i++) {
         struct zink_so_target *t = zink_so_target(ctx->so_targets[i]);
         if (t->counter_buffer_valid) {
            zink_batch_reference_resoure(batch, zink_resource(t->counter_buffer));
            counter_buffers[i] = zink_resource(t->counter_buffer)->buffer;
            counter_buffer_offsets[i] = t->counter_buffer_offset;
         } else
            counter_buffers[i] = NULL;
         t->needs_barrier = true;
      }
      screen->vk_CmdBeginTransformFeedbackEXT(batch->cmdbuf, 0, ctx->num_so_targets, counter_buffers, counter_buffer_offsets);
   }

/* existing code */
   if (dinfo->index_size > 0) {
      assert(dinfo->index_size != 1);
      VkIndexType index_type = dinfo->index_size == 2 ? VK_INDEX_TYPE_UINT16 : VK_INDEX_TYPE_UINT32;
      struct zink_resource *res = zink_resource(index_buffer);
      vkCmdBindIndexBuffer(batch->cmdbuf, res->buffer, index_offset, index_type);
      zink_batch_reference_resoure(batch, res);
      vkCmdDrawIndexed(batch->cmdbuf,
         dinfo->count, dinfo->instance_count,
         dinfo->start, dinfo->index_bias, dinfo->start_instance);
   } else {
/* new code */
      if (so_target && screen->tf_props.transformFeedbackDraw) {
         zink_batch_reference_resoure(batch, zink_resource(so_target->counter_buffer));
         screen->vk_CmdDrawIndirectByteCountEXT(batch->cmdbuf, dinfo->instance_count, dinfo->start_instance,
                                       zink_resource(so_target->counter_buffer)->buffer, so_target->counter_buffer_offset, 0,
                                       MIN2(so_target->stride, screen->tf_props.maxTransformFeedbackBufferDataStride));
      }
      else
         vkCmdDraw(batch->cmdbuf, dinfo->count, dinfo->instance_count, dinfo->start, dinfo->start_instance);
   }

   if (dinfo->index_size > 0 && dinfo->has_user_indices)
      pipe_resource_reference(&index_buffer, NULL);

   if (ctx->num_so_targets) {
      for (unsigned i = 0; i < ctx->num_so_targets; i++) {
         struct zink_so_target *t = zink_so_target(ctx->so_targets[i]);
         counter_buffers[i] = zink_resource(t->counter_buffer)->buffer;
         counter_buffer_offsets[i] = t->counter_buffer_offset;
         t->counter_buffer_valid = true;
      }
      screen->vk_CmdEndTransformFeedbackEXT(batch->cmdbuf, 0, ctx->num_so_targets, counter_buffers, counter_buffer_offsets);
   }

Excluding a small block that I’ve added a comment for, this is pretty much all added for handling xfb draws. This includes the begin/end calls for xfb and outputting to the counter buffers for each streamout target, and the actual vkCmdDrawIndirectByteCountEXT call for drawing transform feedback when appropriate.

The begin/end calls handle managing the buffer states to work with glPauseTransformFeedback and glResumeTransformFeedback. When resuming, the counter buffer offset is used to track the state and continue with the buffers from the correct location in memory.

Next time

We’ll look at xfb queries and extension/feature enabling, and I’ll start to get into a bit more detail about how more of this stuff works.

June 04, 2020

To Begin…

Stream output is another name for xfb closer to the driver level. Inside mesa (and gallium), we’ll commonly see different types related to struct pipe_stream_output_* which provide info about how the xfb info is output to its corresponding buffers.

Here’s some blocks from the Zink implementation, referenced heavily from the original work by David Airlie:

static struct pipe_stream_output_target *
zink_create_stream_output_target(struct pipe_context *pctx,
                                 struct pipe_resource *pres,
                                 unsigned buffer_offset,
                                 unsigned buffer_size)
{
   struct zink_so_target *t;
   t = CALLOC_STRUCT(zink_so_target);
   if (!t)
      return NULL;

   t->base.reference.count = 1;
   t->base.context = pctx;
   pipe_resource_reference(&t->base.buffer, pres);
   t->base.buffer_offset = buffer_offset;
   t->base.buffer_size = buffer_size;

   /* using PIPE_BIND_CUSTOM here lets us create a custom pipe buffer resource,
    * which allows us to differentiate and use VK_BUFFER_USAGE_TRANSFORM_FEEDBACK_COUNTER_BUFFER_BIT_EXT
    * as we must for this case
    */
   t->counter_buffer = pipe_buffer_create(pctx->screen, PIPE_BIND_STREAM_OUTPUT | PIPE_BIND_CUSTOM, PIPE_USAGE_DEFAULT,
4);
   if (!t->counter_buffer) {
      FREE(t);
      return NULL;
   }

   return &t->base;
}

Here we have the create_stream_output_target hook for our struct pipe_context, which is called any time we’re initializing stream output. This takes the previously-created stream output buffer (more on buffer creation in a future post) along with offset and size parameters for the buffer. It’s necessary to create a counter buffer here so that we can use it to correctly save and restore states if xfb is paused or resumed (glPauseTransformFeedback and glResumeTransformFeedback).

static void
zink_stream_output_target_destroy(struct pipe_context *pctx,
                                  struct pipe_stream_output_target *psot)
{
   struct zink_so_target *t = (struct zink_so_target *)psot;
   pipe_resource_reference(&t->counter_buffer, NULL);
   pipe_resource_reference(&t->base.buffer, NULL);
   FREE(t);
}

A simple destructor hook for the previously-created object. struct pipe_resource buffers are refcounted, so we always need to make sure to properly unref them here; the first parameter of pipe_resource_reference can be thought of as the resource to unref, and the second is the resource to ref.

static void
zink_set_stream_output_targets(struct pipe_context *pctx,
                               unsigned num_targets,
                               struct pipe_stream_output_target **targets,
                               const unsigned *offsets)
{
   struct zink_context *ctx = zink_context(pctx);

   if (num_targets == 0) {
      for (unsigned i = 0; i < ctx->num_so_targets; i++)
         pipe_so_target_reference(&ctx->so_targets[i], NULL);
      ctx->num_so_targets = 0;
   } else {
      for (unsigned i = 0; i < num_targets; i++)
         pipe_so_target_reference(&ctx->so_targets[i], targets[i]);
      for (unsigned i = num_targets; i < ctx->num_so_targets; i++)
         pipe_so_target_reference(&ctx->so_targets[i], NULL);
      ctx->num_so_targets = num_targets;

      /* emit memory barrier on next draw for synchronization */
      if (offsets[0] == (unsigned)-1)
         ctx->xfb_barrier = true;
      /* TODO: possibly avoid rebinding on resume if resuming from same buffers? */
      ctx->dirty_so_targets = true;
   }
}

This is the hook that gets called any time xfb is started or stopped, whether from glBeginTransformFeedback / glEndTransformFeedback or glPauseTransformFeedback / glResumeTransformFeedback. It’s used to pass the active stream output buffers to the driver context in preparation for upcoming draw calls. We store and ref++ the buffers on activate, when targets is non-NULL, and then we ref– and unset the buffers on deactivate, when targets is non-NULL. On glResumeTransformFeedback, offsets is -1 (technically UINT_MAX from underflow).

According to Vulkan spec, we need to emit a different memory barrier for synchronization if we’re resuming xfb, so a flag is set in that case. Ideally some work can be done here to optimize the case of pausing and resuming the same output buffer set to avoid needing to do additional synchronization later on in the draw, but for now, slow and steady wins the race.

June 03, 2020

In this post, I describe the steps involved in the execution of a kms_cursor_crc subtest. In my approach, I chose a subtest (pipe-A-cursor-alpha-transparent) as a target and examined the code from the beginning of the test (igt main) until reaching the target subtest and executing it.

This is my zero version. I plan to incrementally expand this document with evaluation/description of the other subtests. I will probably also need to fix some misunderstandings.

As described by IGT, kms_cursor_crc

Uses the display CRC support to validate cursor plane functionality. The test will position the cursor plane either fully onscreen, partially onscreen, or fully offscreen, using either a fully opaque or fully transparent surface. In each case it then reads the PF CRC and compares it with the CRC value obtained when the cursor plane was disabled.

In the past, Haneen have shown something about the test in a blog post. Fixing any issue in VKMS to make it passes all kms_cursor_crc subtest is also a case in my GSoC project proposal.

The struct data_t

This struct is used in all subtest stores many elements such as DRM file descriptor; framebuffers info; cursor size info; etc. Also, before the main of the test, a static data_t data is declared global.

The beginning - igt_main

We can divide the main function into two parts: setup DRM stuff (igt_fixture) and subtest execution.

Initial test setup

igt_fixture {
	data.drm_fd = drm_open_driver_master(DRIVER_ANY);
	ret = drmGetCap(data.drm_fd, DRM_CAP_CURSOR_WIDTH, &cursor_width);
	igt_assert(ret == 0 || errno == EINVAL);
	/* Not making use of cursor_height since it is same as width, still reading */
	ret = drmGetCap(data.drm_fd, DRM_CAP_CURSOR_HEIGHT, &cursor_height);
	igt_assert(ret == 0 || errno == EINVAL);
		

drmGetCap(int fd, uint64_t capability, uint64_t * value) queries capability of the DRM driver, return 0 if the capability is supported. DRM_CAP_CURSOR_WIDTH and DRM_CAP_CURSOR_HEIGHT store a valid width/height for the hardware cursor.

/* We assume width and height are same so max is assigned width */
igt_assert_eq(cursor_width, cursor_height);

	kmstest_set_vt_graphics_mode();

void kmstest_set_vt_graphics_mode(void)

From lib/igt_kms.c: This function sets the controlling virtual terminal (VT) into graphics/raw mode and installs an igt exit handler to set the VT back to text mode on exit. All kms tests must call this function to make sure that the framebuffer console doesn’t interfere by e.g. blanking the screen.

    igt_require_pipe_crc(data.drm_fd);

void igt_require_pipe_crc(int fd)

From lib/igt_debugfs: checks whether pipe CRC capturing is supported by the kernel. Uses igt_skip to automatically skip the test/subtest if this isn’t the case.

    igt_display_require(&data.display, data.drm_fd);

void igt_display_require(igt_display_t *display, int drm_fd)

From lib/igt_kms.c: Initializes \@display (a pointer to an #igt_display_t structure) and allocates the various resources required. This function automatically skips if the kernel driver doesn’t support any CRTC or outputs.

Run test on pipe

for_each_pipe_static(pipe)
	igt_subtest_group
		run_tests_on_pipe(&data, pipe);

At this part, each subtest of kms_cursor_crc is runned, where the pointer for the already setup struct data_t are passed.

static void run_tests_on_pipe(data_t *data, enum pipe pipe)

This function runs each subtest grouped by pipe. In the setup, it increments the passed data_t struct, and then starts to call each subtest. In this document version, I only focused on the subtest test_cursor_transparent .

igt_subtest_f("pipe-%s-cursor-alpha-transparent", kmstest_pipe_name(pipe))
	run_test(data, test_cursor_transparent, data->cursor_max_w, data->cursor_max_h);

The execution of test_cursor_transparent

static void run_test(data_t *data, void (*testfunc)(data_t *), int cursor_w, int cursor_h)

The function run_test wraps the habitual preparation for running a subtest and also, after then, a cleanup. Therefore, it basically has three steps:

  1. Prepare CRTC: static void prepare_crtc(data_t data, igt_output_toutput, int cursor_w, int cursor_h) This function is responsible for:
    • Select the pipe to be used
    • Create Front and Restore framebuffer of primary plane
    • Find a valid plane type for primary plane and cursor plane
    • Pairs primary framebuffer to its plane and sets a default size
    • Create a new pipe CRC
    • Position the cursor fully visible
    • Store test image as cairo surface
    • Start CRC capture process
  2. Run subtest: testfunc(data) » static void test_cursor_transparent(data_t *data) » test_cursor_alpha(data, 0.0)

    The subtest_cursor_transparent is a variation of test_cursor_alpha where the alpha channel is set zero (or transparent). So, let’s take a look at test_cursor_alpha execution:

static void test_cursor_alpha(data_t *data, double a)
{
	igt_display_t *display = &data->display;
	igt_pipe_crc_t *pipe_crc = data->pipe_crc;
	igt_crc_t crc, ref_crc;
	cairo_t *cr;
	uint32_t fb_id;
	int curw = data->curw;
	int curh = data->curh;
	/*alpha cursor fb*/
	fb_id = igt_create_fb(data->drm_fd, curw, curh,
				    DRM_FORMAT_ARGB8888,
				    LOCAL_DRM_FORMAT_MOD_NONE,
				    &data->fb);

When this subtest starts, it creates the cursor’s framebuffer with the format ARGB8888 , i.e., a framebuffer with RGB plus Alpha channel (pay attention to endianness)

	igt_assert(fb_id);
	cr = igt_get_cairo_ctx(data->drm_fd, &data->fb);
	igt_paint_color_alpha(cr, 0, 0, curw, curh, 1.0, 1.0, 1.0, a);
	igt_put_cairo_ctx(data->drm_fd, &data->fb, cr);

Then, the test uses some Cairo resources to create a cairo surface for the cursor’s framebuffer and allocate a drawing context for it, draw a rectangle with RGB white and the given opacity (alpha channel) and, finally, release the cairo surface and write the changes out to the framebuffer (Disclaimed: looking inside the function igt_put_cairo_ctx, I am not sure if it is doing what it is saying on comments, and also not sure if all the parameters are necessary)

The test is divided into two parts: Hardware test and Software test.

	/*Hardware Test*/
	cursor_enable(data);
	igt_display_commit(display);
	igt_wait_for_vblank(data->drm_fd, data->pipe);
	igt_pipe_crc_get_current(data->drm_fd, pipe_crc, &crc);
	cursor_disable(data);
	igt_remove_fb(data->drm_fd, &data->fb);

The hardware test consists in:

  • Enable cursor: Pair the cursor plane and a framebuffer, set the cursor size for its plane, set the cursor plane size for framebuffer. Commit framebuffer changes to all changes of each display pipe
  • Wait for vblank. Vblank is a couple of extra scanlines region which aren’t actually displayed on the screen designed to give the electron gun (on CRTs) enough time to move back to the top of the screen to start scanning out the next frame. A vblank interrupt is used to notify the driver when it can start the updating of registers. To achieve tear-free display, users must synchronize page flips and/or rendering to vertical blanking [https://dri.freedesktop.org/docs/drm/gpu/drm-kms.html#vertical-blanking]
  • Calculate and check the current CRC
	cr = igt_get_cairo_ctx(data->drm_fd, &data->primary_fb[FRONTBUFFER]);
	igt_paint_color_alpha(cr, 0, 0, curw, curh, 1.0, 1.0, 1.0, a);
	igt_put_cairo_ctx(data->drm_fd, &data->primary_fb[FRONTBUFFER], cr);
	igt_display_commit(display);
	igt_wait_for_vblank(data->drm_fd, data->pipe);
	igt_pipe_crc_get_current(data->drm_fd, pipe_crc, &ref_crc);
	igt_assert_crc_equal(&crc, &ref_crc);

The software test consists in:

  • Create a cairo surface and drawing context, draw a rectangle on top left side with RGB white and the given opacity (alpha channel) and then release de cairo context, applying changes to framebuffer. Then, commit framebuffer changes to all changes of each display pipe.
  • Wait for vblank
  • Calculate and check the current CRC

Finally, the screen is clean.

And in the level of run_test, crtc(data) is clean up;

This is what introspection looks like

The libqmi and libmbim libraries are every day getting more popular to control your QMI or MBIM based devices. One of the things I’ve noticed, though, is that lots of users are writing applications in e.g. Python but then running qmicli or mbimcli commands, and parsing the outputs. This approach may work, but there is absolutely no guarantee that the format of the output printed by the command line programs will be kept stable across new releases. And also, the way these operations are performed may be suboptimal (e.g. allocating QMI clients for each operation, instead of reusing them).

Since the new stable libqmi 1.26 and libmbim 1.24 releases, these libraries integrate GObject Introspection support for all their types, and that provides a much better integration within Python applications (or really, any other language supported by GObject Introspection).

The only drawback of using the libraries in this way, if you’re already using and parsing command line interface commands, is that you would need to go deep into how the protocol works in order to use them.

For example, in order to run a DMS Get Capabilities operation, you would need to create a Qmi.Device first, open it, then allocate a Qmi.Client for the DMS service, then run the Qmi.Client.get_capabilities() operation, receive and process the response with Qmi.Client.get_capabilities_finish(), and parse the result with the per-TLV reader method, e.g. output.get_info() to process the Info TLV. Once the client is no longer needed, it would need to be explicitly released before exiting. A full example doing just this is provided in the libqmi sources.

In the case of MBIM operations, there is no need for the extra step of allocating a per-service client, but instead, the user should be more aware of the actual types of messages being transferred, in order to use the correct parsing operations. For example, in order to query device capabilities you would need to create a Mbim.Device first, open it, create a message to query the device capabilities, send the message, receive and process the response, check whether an error is reported, and if it isn’t, fully parse it. A full example doing just this is provided in the libmbim sources.

Of course, all these low level operations can also be done through the qmi-proxy or mbim-proxy, so that ModemManager or other programs can be running at the same time, all sharing access to the same QMI or MBIM ports.

P.S.: not a true Python or GObject Introspection expert here, so please report any issue found or improvements that could be done 😀

And special thanks to Vladimir Podshivalov who is the one that started the hard work of setting everything up in libqmi. Thank you!

Enjoy!

June 02, 2020

As the GSoC coding time officially started this week (01/06/2020), this post is a line between my activities in the period of community bonding and the official development of my project. I used the last month to solve issues in my development environment, improve the tool that supports my development activity (kworflow), and study concepts and implementations related to my GSoC project.

Communication

Besides e-mail, IRC chat, and Telegram, my mentor (Siqueira) and I are meeting every Wednesday on Jitsi, where we also use tmate for terminal sharing. We also use, together with Trevor, a spreadsheet to schedule tasks, report my daily activity, and write any suggestions.

Issues in my development environment

Probably you have already had this kind of pending task: a problem that trouble your work but you can go ahead with it, therefore you postpone until the total tiredness. So, I have some small issues on my vm: from auth to update kernel process that I was postpone forever. The first I solved with some direct setup and the last led me for the second task: improve kworkflow.

Improvements for a set of scripts that facilitate my development day-to-day

The lack of support on kworkflow for deployment on vm drove me to start hacking the code. As updating vm needs understanding things more complex, I started developing soft changes (or less integrated with the scripts structure). This work is still in progress, and after discuss with others kworkflow developers (on the issue, on IRC and on voice meetings), the proposals of changes were refined.

The first phase of my project proposal orbits the IGT test kms_cursor_crc. I have already a preliminar experience with the test code; however, I lack a lot of knowledge of each steps and concepts behing the implementation. Bearing this in mind, I used part of the community bonding time to immerse myself in this code.

Checking issues in a patchset that was reported by IGT CI

My first project task is to find out why it is not possible to access debugfs files when running kms_cursor_crc (and fix it). Two things could help me solve it: learning about debugfs and dissecting kms_cursor_crc. To guide my studies, my mentor suggested taking a look at a patchset for the IGT write-back test implementation that CI reported a crash on debugfs_test for i915. For this investigation, I installed on another machine (an old netbook) a Debian without a graphical environment, and, accessing via ssh, I applied the patches and ran the test. Well, everything seemed to work (and the subtests passed). Perhaps something has been fixed or changed in IGT since the patchset was sent. Nothing more to do here.


IGT_FORCE_DRIVER=i915 build/tests/debugfs_test 
IGT-Version: 1.25-gf1884574 (x86_64) (Linux: 4.19.0-9-amd64 x86_64)
Force option used: Using driver i915
Force option used: Using driver i915
Starting subtest: sysfs
Subtest sysfs: SUCCESS (0,009s)
Starting subtest: read_all_entries
Subtest read_all_entries: SUCCESS (0,094s)
Starting subtest: read_all_entries_display_on
Subtest read_all_entries_display_on: SUCCESS (0,144s)
Starting subtest: read_all_entries_display_off
Subtest read_all_entries_display_off: SUCCESS (0,316s)

Diving into the kms_cursor_crc test

I’m writing a kind of anatomy from the kms_cursor_crc test. I chose the alpha-transparent subtest as a target and then followed each step necessary to achieve it, understanding each function called, parameters, and also abstractions. I am often confused by something that once seemed clear. Well, it comes to graphic interface stuff and is acceptable that theses abstraction will disorientate me LOL I guess… The result of this work will be my next post. In the meantime, here are links that helped me on this journey

May 28, 2020

So this is a blog post not related to Fedora or Red Hat, but rather my personal experience with getting a robo vacuum and robo mop into the house.

So about two Months ago my wife and I decided to get a Robo vacuum while shopping at Costco (a US wholesaler outfit). So we brought home the iRobot Roomba 980. Over the next week we ended up also getting the newer iRobot Roomba i7+ and the iRobot Braava m6 mopping robot. Our dream was that we would never have to vacuum or mop again, instead leaving that to our new robots to handle. With two little kids being able to cut that work from our todo list seemed like a dream come through.

I feel that whenever you get into a new technology it takes some time with your first product in that category to understand what questions to ask and what considerations to make. For instance I feel a lot of more informed and confident in my knowledge about electric cars having owned a Nissan Leaf for a few years now (enough to wish I had a Tesla instead for instance :). I guess our experience with robot vacuums here is similar.

Anyway, if you are considering buying a Robot vacuum or mop I think the first lesson we learned is that it is definitely not a magic solution. You have to prepare your house quite a bit before each run, including obvious things like tidying up anything on the floor like the kids legos etc., to discovering that certain furniture, like the IKEA Poang chairs are mortal enemies with your robo vacuum. We had to put our chair on top of the sofa as the Roomba would get stuck on it every time we tried to vacuum the floor. Also the door mat in front of our entrance door kept having its corners sucked into the vacuum getting it stuck. Anyway, our lesson learned is that vacuuming (or mopping) is not something we can do on an impulse or easily on a schedule, as it takes quite a bit of preparation. If you don’t have small kid leaving random stuff all over the house all the time you might be able to just set the vacuum on a schedule, but for us that has turned out to be a big no :). So in practice we only vacuum at night now when my wife and I have had time to prep the house after the kids have gone to bed.

It is worth nothing that we only got one vacuum now. We got the i7+ after we got the 980 due to realizing that the 980 didn’t have features like the smart map allowing you to for instance vacuum specific rooms. It also had other niceties like self emptying and it was supposed to be more quiet (which is nice when you run it at night). However in our experience it also had a less strong vacuum, so we felt it left more crap on the floor then the older 980 model. So in the end we returned the i7+ in favour of the 980, just because we felt it did a better job at vacuuming. It is quite loud though, so we can hear it very loud and clear up on the second floor while trying to fall asleep. So if you need a quiet house to sleep, this setup is not for you.

Another lesson we learned is that the vacuums or mops do not work great in darkness, so we now have to leave the light on downstairs at night when we want to vacuum or mop the floor. We should be able to automate that using Google Home, so Google Home could turn on the lights, start the vacuum and then once done, turn off the lights again. We haven’t actually gotten around to test that yet though.

As for the mop, I would say that it is not a replacement for mopping yourself, but it can reduce the frequency of you mopping yourself and thus help maintain a nice clear floor for longer after you done a full manual mop yourself. Also the m6 is super sensitive to edges, which I assume is to avoid it trying to mop your rugs and mats, but it also means that it can not traverse even small thresholds. So for us who have small thresholds between our kitchen area and the rest of the house we have to carry the mop over the thresholds and mop the rest of the first floor as a separate action, which is a bit of an annoyance now that we are running these things at night. That said the kitchen is the one room which needs moping more regularly, so in some sense the current setup where the roomba vacuums the whole first floor and the braava mop mops just the kitchen is a workable solution for us. One nice feature here is that they can be set up to run in order, so the mop will only start once the vacuum is done (that feature is the main reason we haven’t tested out other brand mops which might handle the threshold situation better).

So to conclude, would I recommend robot vacuums and robot mops to other parents with you kids? I would say yes, it has definitely helped us keep the house cleaner and nicer and let us spend less time cleaning the house. But it is not a miracle cure in any way or form, it still takes time and effort to prepare and set up the house and sometimes you still need to do especially the mopping yourself to get things really clean. As for the question of iRobot versus other brands I have no input as I haven’t really tested any other brands. iRobot is a local company so their vacuums are available in a lot of stores around me and I drive by their HQ on a regular basis, so that is the more or less random reason I ended up with their products as opposed to competing ones.

May 25, 2020

The KWinFT project with its two major open source offerings KWinFT and Wrapland was announced one month ago. This made quite some headlines back then but I decided to keep it down afterwards and push the project silently forward on a technical side.

Now I am pleased to announce the release of a beta version for the next stable release 5.19 in two weeks. The highlights of this release are a complete redesign of Wrapland's server library and two more projects joining KWinFT.

Wrapland Server library redesign

One of the goals of KWinFT is to facilitate large upsetting changes to the internal structure and technological base of its open source offerings. As mentioned one month ago in the project announcement these changes include pushing back the usage of Qt in lower-level libraries and instead making use of modern C++ to its full potential.

We achieved the first milestone on this route in an impressively short timeframe: the redesign of Wrapland's server library for improved encapsulation of external libwayland types and providing template-enhanced meta-classes for easy extension with new functionality in the future.

This redesign work was organized on a separate branch and merged this weekend into master. In the end that included over 200 commits and 40'000 changed lines. Here I have to thank in particular Adrien Faveraux who joined KWinFT shortly after its announcement and contributed several class refactors. Our combined work enabled us to deliver this major redesign already now with the upcoming release.

Aside from the redesign I used this opportunity to add clang-based tools for static code analysis: clang-format and clang-tidy. Adding to our autotests that run with and without sanitizers Wrapland's CI pipelines now provide efficient means for handling contributions by GitLab merge requests and checking back on the result after merge. You can see a full pipeline with linters, static code analysis, project build and autotests passing in the article picture above or check it out here directly in the project.

New in KWinFT: Disman and KDisplay

With this release Disman and KDisplay join the KWinFT project. Disman is a fork of libkscreen and KDisplay one of KScreen. KScreen is the main UI in a KDE Plasma workspace for display configuration and I was its main contributor and maintainer in the last years.

Disman can be installed in parallel with libkscreen. For KDisplay on the other side it is recommended to remove KScreen when KDisplay is installed. This way not both daemons try to meddle with the display configuration at the same time. KDisplay can make use of plugins for KWinFT, KWin and wlroots so you could also use KDisplay as a general replacement.

Forking libkscreen and KScreen to Disman and KDisplay was an unwanted move from my side because I would have liked to keep maintaining them inside KDE. But my efforts to integrate KWinFT with them were not welcomed by some members of the Plasma team. Form your own opinion by reading the discussion in the patch review.

I am not happy about this development but I decided to make the best out of a bad situation and after forking and rebranding directly created CI pipelines for both projects which now also run linters, project builds and autotests on all merge requests and branch pushes. And I defined some more courageous goals for the two projects now that I have more control.

One would think after years of being the only person putting real work into KScreen I would have had this kind of freedom also inside KDE but that is not how KDE operates.

Does it need to be this way? What are arguments for and against it? That is a discussion for another article in the future.

Very next steps

There is an overall technical vision I am following with KWinFT: building a modern C++ framework for Wayland compositor creation. A framework that is built up from independent yet well interacting small libraries.

Take a look at this task for an overview. The first one of these libraries that we have now put work in was Wrapland. I plan for the directly next one to be the backend library that provides interfacing capabilities with the kernel or a host window system the compositor runs on, what in most cases means talking to the Direct Rendering Manager.

The work in Wrapland is not finished though. After the basic representation of Wayland objects has been improved we can push further by layering the server library like this task describes. The end goal here is to get rid of the Qt dependency and make it an optional facade only.

How to try out or contribute to KWinFT

You can try out KWinFT on Manjaro. At the moment you can install KWinFT and its dependencies on Manjaro's unstable images but it is planned to make this possible also in the stable images with the upcoming 5.19 stable release.

I explicitly recommend the Manjaro distribution nowadays to any user from being new to Linux to experts. I have Manjaro running on several devices and I am very pleased with Manjaro's technology, its development pace and its community.

If you are an advanced user you can also use Arch Linux directly and install a KWinFT AUR package that builds KWinFT and its dependencies directly from Git. I hope a package of KWinFT's stable release will also be soon available from Arch' official repositories.

If you want to contribute to one of the KWinFT projects take a look at the open tasks and come join us in our Gitter channel. I am very happy that already several people joined the project who provide QA feedback and patches. There are also opportunities to work on DevOps, documentation and translations.

I am hoping KWinFT will be a welcoming place for everyone interested in high-quality open source graphics technology. A place with low entry barriers and many opportunities to learn and grow as an engineer.

May 20, 2020

Meanwhile, in GSoC:

I took the second week of Community Bonding to make some improvements in my development environment. As I have reported before, I use a QEMU VM to develop kernel contributions. I was initially using an Arch VM for development; however, at the beginning of the year, I reconfigured it to use a Debian VM, since my host is a Debian installation - fewer context changes. In this movement, some ends were loose, and I did some workarounds, well… better round it off.

I also use kworkflow (KW) to ease most of the no-coding tasks included in the day-to-day coding for Linux kernel. The KW automates repetitive steps of a developer’s life, such as compiling and installing my kernel modifications; finding information to format and send patches correctly; mounting or remotely accessing a VM, etc. During the time that preceded the GSoC project submission, I noticed that the feature of installing a kernel inside the VM was incompleted. At that time, I started to use the “remote” option as palliative. Therefore, I spent the last days learning more features and how to hack the kworkflow to improve my development environment (and send it back to the kw project).

I have started by sending a minor fix on alert message:

kw: small issue on u/mount alert message

Then I expanded the feature “explore” - looking for a string in directory contents - by adding GNU grep utility in addition to the already used git grep. I gathered many helpful suggestions for this patch, and I applied them together with the reviews received in a new version:

src: add grep utility to explore feature

Finally, after many hours of searching, reading and learning a little about guestfish, grub, initramfs-tools and bash, I could create the first proposal of code changes that enable kw to automate the build and install of a kernel in VM:

add support for deployment in a debian-VM

The main barrier to this feature was figuring out how to update the grub on the VM without running the update-grub command via ssh access. First, I thought about adding a custom file with a new entry to boot. Thinking and researching a little more, I realized that guestfish could solve the problem and, following this logic, I found a blog post describing how to run “update-grub” with guestfish. From that, I made some adaptations that created the solution.

However, in addition to updating grub, the feature still lacks some steps to install the kernel on the VM properly. I checked the missing code by visiting an old FLUSP tutorial that describes the step-by-step of compiling and install the Linux Kernel inside a VM. I also used the implementation of the “remote” mode of the “kw deploy” to wrap up.

Now I use kw to automatically compile and install a custom kernel on my development VM. So, time to sing: “Ooh, that’s why I’m easy; I’m easy as Sunday morning!”

Maybe not now. It’s time to learn more about IGT tests!

This morning I saw two things that were Microsoft and Linux graphics related.

https://devblogs.microsoft.com/commandline/the-windows-subsystem-for-linux-build-2020-summary/

a) DirectX on Linux for compute workloads
b) Linux GUI apps on Windows

At first I thought these were related, but it appears at least presently these are quite orthogonal projects.

First up clarify for the people who jump to insane conclusions:

The DX on Linux is a WSL2 only thing. Microsoft are not any way bringing DX12 to Linux outside of the Windows environment. They are also in no way open sourcing any of the DX12 driver code. They are recompiling the DX12 userspace drivers (from GPU vendors) into Linux shared libraries, and running them on a kernel driver shim that transfers the kernel interface up to the closed source Windows kernel driver. This is in no way useful for having DX12 on Linux baremetal or anywhere other than in a WSL2 environment. It is not useful for Linux gaming.

Microsoft have submitted to the upstream kernel the shim driver to support this. This driver exposes their D3DKMT kernel interface from Windows over virtual channels into a Linux driver that provides an ioctl interface. The kernel drivers are still all running on the Windows side.

Now I read the Linux GUI apps bit and assumed that these things were the same, well it turns out the DX12 stuff doesn't address presentation at all. It's currently only for compute/ML workloads using CUDA/DirectML. There isn't a way to put the results of DX12 rendering from the Linux guest applications onto the screen at all. The other project is a wayland/RDP integration server, that connects Linux apps via wayland to RDP client on Windows display, integrating that with DX12 will be a tricky project, and then integrating that upstream with the Linux stack another step completely.

Now I'm sure this will be resolved, but it has certain implications on how the driver architecture works and how much of the rest of the Linux graphics ecosystem you have to interact with, and that means that the current driver might not be a great fit in the long run and upstreaming it prematurely might be a bad idea.

From my point of view the kernel shim driver doesn't really bring anything to Linux, it's just a tunnel for some binary data between a host windows kernel binary and a guest linux userspace binary. It doesn't enhance the Linux graphics ecosystem in any useful direction, and as such I'm questioning why we'd want this upstream at all.

May 19, 2020

One of the more common issues we encounter debugging things is that users don't always know whether they're running on a Wayland or X11 session. Which I guess is a good advertisement for how far some of the compositors have come. The question "are you running on Xorg or Wayland" thus comes up a lot and suggestions previously included things like "run xeyes", "grep xinput list", "check xrandr" and so on and so forth. None of those are particularly scriptable, so there's a new tool around now: xisxwayland.

Run without arguments it simply exits with exit code 0 if the X server is Xwayland, or 1 otherwise. Which means use can use it like this:


$ cat my-xorg-only-script.sh
#!/bin/bash

if xisxwayland; then
echo "This is an Xwayland server!";
exit 1
fi

...
Or, in the case where you have a human user (gasp!), you can ask them to run:

$ xisxwayland --verbose
Xwayland: YES
And even non-technical users should be able to interpret that.

Note that the script checks for Xwayland (hence the name) via the $DISPLAY environment variable, just like any X application. It does not check whether there's a Wayland compositor running but for most use-cases this doesn't matter anyway. For those where it matters you get to write your own script. Congratulations, I guess.

May 13, 2020

I submitted a project proposal to participate in this year’s Google Summer of Code (GSoC). As I am curious about the DRM subsystem and have already started to work in contributing to it, I was looking for an organization that supports this subsystem. So, I applied to the X.Org foundation and proposed a project to Improve VKMS using IGT GPU Tools. Luckily, in May 4th, I received a e-mail from GSoC announcing that I was accepted! Happy happy happy!

Observing a conversation in #dri-devel channel, I realized that my project was the only one accepted on X.Org. WoW! I have no idea why the organization has only one intern this year, and even more, that this is me! Imediately after I received the result, Trevor Woerner congratulated me and kindly announced my project on his Twitter profile! It was fun to know that he enjoyed the things that I randomly posted in my blog, and was so nice to see that he read what I wrote!

Advices, not sure, but I can report the submission experience

From time to time, someone appears on communication channels of FLUSP (FLOSS@USP) asking how to participate in GSoC or Outreachy. They are usually trying to answer questions about the rules of participation and/or obtain reports from former participants. In my opinion, we have in Brazil many high-qualified IT people who, for some reason, do not feel safe to invest in a career abroad. They are very demanding with themselves. And I believe that groups like FLUSP can be a means of launching good developers internationally.

By viewing this contribution window, I consider it worthwhile to report some actions that I took in my application process.

First, take seriously, and do it as soon as possible.

About a year ago, I took my first efforts to understand the project and the Linux community. But that doesn’t mean you need a year in advance for that. What I mean is that, for a project like Linux, I had to dedicate enough hours studying, understanding, and trying to develop my first contributions. And I, in particular, did not feel skilled enough to propose anything for the community. Participating in GSoC was not my initial plan. What I wanted (and still want) is to become a high-skilled developer of the Linux kernel project.

Second, take your first contribution steps.

Many times I prepared the environment, started to work on some things, and then ended up blocked by all the doubts that arose when I dived into all these lines of code. Breathe… Better start with the basics. If you have someone who can mentor you, one way is to work “with her/him.” Find issues or code that s/he participates so you can start a conversation and some guidance. I prefer a “peer-to-peer” introduction than a “broadcast” message for the community. And then, things fly.

When the organizations are announced…

Ask the community (or the developers you have already developed a connexion) if any of the organizations would contemplate the system you are interested in proposing a project. You can propose something that you think relevant, you can look for project suggestions within organizations, or you can (always) ask. Honestly, I’m not very good at asking. In the case of X.Org Foundation, you can find tips here, and suggestions and potential mentors here and here.

Write your proposal and share!

As soon as you can.

In your proposal, you should introduce yourself and show what you already know so far, that is, make your portfolio. Also, try to be as transparent as possible about tasks, describe everything you already know about the issues, estimate time, and make a schedule. Finally, propose productions in addition to code (documentation, tutorial, blog posts). I think this content production will help your work and also the well-monitoring of your activities. Moreover, it is a way to spread the word and track development issues.

I wasn’t fast enough to share my project, but I saw how my proposal evolved after I received the community feedback. The feature of “create and share” is available in the GSoC application platform using Google Docs. I think my internship work plan became more potent due to share my document and apply all suggestions.

Approved and now? I am the newer X.Org developer

After the welcome, Siqueira and Trevor also gave me some guidance, reading, and initial activities.

To better understand the X.Org organization and the system I will be working on in the coming months, I have read the following links (still in progress):

Hands on, it has already started!

Well, time to revisit the proposal, organize daily activities, and check the development environment. Before the GSoC results, I was working on some kworkflow issues and so, that’s what I’ve been working on at the moment. I like to study “problem-oriented”, so to dissect the kw, I took some issues to try to solve. I believe I can report my progress in an upcoming post.

To the next!

May 08, 2020

Gather round children, it's analogy time! First, some definitions:

  • Wayland is a protocol to define the communication between a display server (the "compositor") and a client, i.e. an application though the actual communication is usually done by a toolkit like GTK or Qt.
  • A Wayland Compositor is an implementation of a display server that (usually but not necessary) handles things like displaying stuff on screen and handling your input devices, among many other things. Common examples for Wayland Compositors are GNOME's mutter, KDE's KWin, weston, sway, etc.[1]

And now for the definitions we need for our analogy:

  • HTTP is a protocol to define the communication between a web server and a client (usually called the "browser")
  • A Web Browser is an implementation that (sometimes but not usually) handles things like displaying web sites correctly, among many other things. Common examples for Web Browsers are Firefox, Chrome, Edge, Safari, etc. [2]

And now for the analogy:

The following complaints are technically correct but otherwise rather pointless to make:

  • HTTP doesn't support CSS
  • HTTP doesn't support adblocking
  • HTTP doesn't render this or that website correctly
And much in the same style, the following complaints are technically correct but otherwise rather pointless to make:
  • Wayland doesn't support keyboard shortcuts
  • Wayland doesn't support screen sharing
  • Wayland doesn't render this or that application correctly
In most cases you may encounter (online or on your setup), saying "Wayland doesn't support $BLAH" is like saying "HTTP doesn't support $BLAH". The problem isn't in with Wayland itself, it's a missing piece or bug in the compositor you're using.

Likewise, saying "I don't like Wayland" is like saying "I don't like HTTP".The vast majority of users will have negative feelings towards the browser, not the transport protocol.

[1] Because they're implementations of a display server they can speak multiple protocols and some compositors can also be X11 window managers, much in the same way as you can switch between English and your native language(s).[2] Because they're implementations of a web browser they can speak multiple protocols and some browsers can also be FTP file managers, much in the same way as... you get the point