planet.freedesktop.org
September 18, 2020

Blog Returns

Once again, I ended up not blogging for most of the week. When this happens, there’s one of two possibilities: I’m either taking a break or I’m so deep into some code that I’ve forgotten about everything else in my life including sleep.

This time was the latter. I delved into the deepest parts of zink and discovered that the driver is, in fact, functioning only through a combination of sheer luck and a truly unbelievable amount of driver stalls that provide enough forced synchronization and slow things down enough that we don’t explode into a flaming mess every other frame.

Oops.

I’ve fixed all of the crazy things I found, and, in the process, made some sizable performance gains that I’m planning to spend a while blogging about in considerable depth next week.

And when I say sizable, I’m talking in the range of 50-100% fps gains.

But it’s Friday, and I’m sure nobody wants to just see numbers or benchmarks. Let’s get into something that’s interesting on a technical level.

Samplers

Yes, samplers.

In Vulkan, samplers have a lot of rules to follow. Specifically, I’m going to be examining part of the spec that states “If a VkImageView is sampled with VK_FILTER_LINEAR as a result of this command, then the image view’s format features must contain VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT”.

This is a problem for zink. Gallium gives us info about the sampler in the struct pipe_context::create_sampler_state hook, but the created sampler won’t actually be used until draw time. As a result, there’s no way to know which image is going to be sampled, and thus there’s no way to know what features the sampled image’s format flags will contain. This only becomes known at the time of draw.

The way I saw it, there were two options:

  • Dynamically create the sampler just before draw and swizzle between LINEAR and NEAREST based on the format features
  • Create both samplers immediately when LINEAR is passed and swizzle between them at draw time

In theory, the first option is probably more performant in the best case scenario where a sampler is only ever used with a single image, as it would then only ever create a single sampler object.

Unfortunately, this isn’t realistic. Just as an example, u_blitter creates a number of samplers up front, and then it also makes assumptions about filtering based on ideal operations which may not be in sync with the underlying Vulkan driver’s capabilities. So for these persistent samplers, the first option may initially allow the sampler to be created with LINEAR filtering, but it may later then be used for an image which can’t support it.

So I went with the second option. Now any time a LINEAR sampler is created by gallium, we’re actually creating both types so that the appropriate one can be used, ensuring that we can always comply with the spec and avoid any driver issues.

Hooray.

September 14, 2020

Hoo Boy

Let’s talk about ARB_shader_draw_parameters. Specifically, let’s look at gl_BaseVertex.

In OpenGL, this shader variable’s value depends on the parameters passed to the draw command, and the value is always zero if the command has no base vertex.

In Vulkan, the value here is only zero if the first vertex is zero.

The difference here means that for arrayed draws without base vertex parameters, GL always expects zero, and Vulkan expects first vertex.

Hooray.

Compatibilizing

The easiest solution here would be to just throw a shader key at the problem, producing variants of the shader for use with indexed vs non-indexed draws, and using NIR passes to modify the variables for the non-indexed case and zero the value. It’s quick, it’s easy, and it’s not especially great for performance since it requires compiling the shader multiple times and creating multiple pipeline objects.

This is where push constants come in handy once more.

Avid readers of the blog will recall the last time I used push constants was for TCS injection when I needed to generate my own TCS and have it read the default inner/outer tessellation levels out of a push constant.

Since then, I’ve created a struct to track the layout of the push constant:

struct zink_push_constant {
   unsigned draw_mode_is_indexed;
   float default_inner_level[2];
   float default_outer_level[4];
};

Now just before draw, I update the push constant value for draw_mode_is_indexed:

if (ctx->gfx_stages[PIPE_SHADER_VERTEX]->nir->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)) {
   unsigned draw_mode_is_indexed = dinfo->index_size > 0;
   vkCmdPushConstants(batch->cmdbuf, gfx_program->layout, VK_SHADER_STAGE_VERTEX_BIT,
                      offsetof(struct zink_push_constant, draw_mode_is_indexed), sizeof(unsigned),
                      &draw_mode_is_indexed);
}

And now the shader can be made aware of whether the draw mode is indexed.

Now comes the NIR, as is the case for most of this type of work.

static bool
lower_draw_params(nir_shader *shader)
{
   if (shader->info.stage != MESA_SHADER_VERTEX)
      return false;

   if (!(shader->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)))
      return false;

   return nir_shader_instructions_pass(shader, lower_draw_params_instr, nir_metadata_dominance, NULL);
}

This is the future, so I’m now using Eric Anholt’s recent helper function to skip past iterating over the shader’s function/blocks/instructions, instead just passing the lowering implementation as a parameter and letting the helper create the nir_builder for me.

static bool
lower_draw_params_instr(nir_builder *b, nir_instr *in, void *data)
{
   if (in->type != nir_instr_type_intrinsic)
      return false;
   nir_intrinsic_instr *instr = nir_instr_as_intrinsic(in);
   if (instr->intrinsic != nir_intrinsic_load_base_vertex)
      return false;

I’m filtering out everything except for nir_intrinsic_load_base_vertex here, which is the instruction for loading gl_BaseVertex.

   
   b->cursor = nir_after_instr(&instr->instr);

I’m modifying instructions after this one, so I set the cursor after.

   nir_intrinsic_instr *load = nir_intrinsic_instr_create(b->shader, nir_intrinsic_load_push_constant);
   load->src[0] = nir_src_for_ssa(nir_imm_int(b, 0));
   nir_intrinsic_set_range(load, 4);
   load->num_components = 1;
   nir_ssa_dest_init(&load->instr, &load->dest, 1, 32, "draw_mode_is_indexed");
   nir_builder_instr_insert(b, &load->instr);

I’m loading the first 4 bytes of the push constant variable that I created according to my struct, which is the draw_mode_is_indexed value.

   nir_ssa_def *composite = nir_build_alu(b, nir_op_bcsel,
                                          nir_build_alu(b, nir_op_ieq, &load->dest.ssa, nir_imm_int(b, 1), NULL, NULL),
                                          &instr->dest.ssa,
                                          nir_imm_int(b, 0),
                                          NULL);

This adds a new ALU instruction of type bcsel, AKA the ternary operator (condition ? true : false). The condition here is another ALU of type ieq, AKA integer equals, and I’m testing whether the loaded push constant value is equal to 1. If true, this is an indexed draw, so I continue using the loaded gl_BaseVertex value. If false, this is not an indexed draw, so I need to use zero instead.

   nir_ssa_def_rewrite_uses_after(&instr->dest.ssa, nir_src_for_ssa(composite), composite->parent_instr);

With my bcsel composite gl_BaseVertex value constructed, I can now rewrite all subsequent uses of gl_BaseVertex in the shader to use the composite value, which will automatically swap between the Vulkan gl_BaseVertex and zero based on the value of the push constant without the need to rebuild the shader or make a new pipeline.

   return true;
}

And now the shader gets the expected value and everything works.

Billy Mays

It’s also worth pointing out here that gl_DrawID from the same extension has a similar problem: gallium doesn’t pass multidraws in full to the driver, instead iterating for each draw, which means that the shader value is never what’s expected either. I’ve employed a similar trick to jam the draw index into the push constant and read that back in the shader to get the expected value there too.

Extensions.

September 10, 2020

In an ideal world, every frame your application draws would appear on the screen exactly on time. Sadly, as anyone living in the year 2020 CE can attest, this is far from an ideal world. Sometimes the scene gets more complicated and takes longer to draw than you estimated, and sometimes the OS scheduler just decides it has more important things to do than pay attention to you.

When this happens, for some applications, it would be best if you could just get the bits on the screen as fast as possible rather than wait for the next vsync. The Present extension for X11 has a option to let you do exactly this:

If 'options' contains PresentOptionAsync, and the 'target-msc'
is less than or equal to the current msc for 'window', then
the operation will be performed as soon as possible, not
necessarily waiting for the next vertical blank interval. 

But you don't use Present directly, usually, usually Present is the mechanism for GLX and Vulkan to put bits on the screen. So, today I merged some code to Mesa to enable the corresponding features in those APIs, namely GLX_EXT_swap_control_tear and VK_PRESENT_MODE_FIFO_RELAXED_KHR. If all goes well these should be included in Mesa 21.0, with a backport to 20.2.x not out of the question. As the GLX extension name suggests, this can introduce some visual tearing when the buffer swap does come in late, but for fullscreen games or VR displays that can be an acceptable tradeoff in exchange for reduced stuttering.

Despite what this might look like, I don't actually enjoy starting new projects: it's a lot easier to clean up some build warnings, or add a CI, than it is to start from an empty directory.

But sometimes needs must, and I've just released version 0.1 of such a project. Below you'll find an excerpt from the README, which should answer most of the questions. Please read the README directly in the repository if you're getting to this blog post more than a couple of days after it was first published.

Feel free to file new issues in the tracker if you have ideas on possible power-saving or performance enhancements. Currently the only supported “Performance” mode supported will interact with Intel CPUs with P-State support. More hardware support is planned.

TLDR; this setting in the GNOME 3.40 development branch soon, Fedora packages are done, API docs available:

 

 

From the README:

Introduction

power-profiles-daemon offers to modify system behaviour based upon user-selected power profiles. There are 3 different power profiles, a "balanced" default mode, a "power-saver" mode, as well as a "performance" mode. The first 2 of those are available on every system. The "performance" mode is only available on select systems and is implemented by different "drivers" based on the system or systems it targets.

In addition to those 2 or 3 modes (depending on the system), "actions" can be hooked up to change the behaviour of a particular device. For example, this can be used to disable the fast-charging for some USB devices when in power-saver mode.

GNOME's Settings and shell both include interfaces to select the current mode, but they are also expected to adjust the behaviour of the desktop depending on the mode, such as turning the screen off after inaction more aggressively when in power-saver mode.

Note that power-profiles-daemon does not save the currently active profile across system restarts and will always start with the "balanced" profile selected.

Why power-profiles-daemon

The power-profiles-daemon project was created to help provide a solution for two separate use cases, for desktops, laptops, and other devices running a “traditional Linux desktop”.

The first one is a "Low Power" mode, that users could toggle themselves, or have the system toggle for them, with the intent to save battery. Mobile devices running iOS and Android have had a similar feature available to end-users and application developers alike.

The second use case was to allow a "Performance" mode on systems where the hardware maker would provide and design such a mode. The idea is that the Linux kernel would provide a way to access this mode which usually only exists as a configuration option in some machines' "UEFI Setup" screen.

This second use case is the reason why we didn't implement the "Low Power" mode in UPower, as was originally discussed.

As the daemon would change kernel settings, we would need to run it as root, and make its API available over D-Bus, as has been customary for more than 10 years. We would also design that API to be as easily usable to build graphical interfaces as possible.

Why not...

This section will contain explanations of why this new daemon was written rather than re-using, or modifying an existing one. Each project obviously has its own goals and needs, and those comparisons are not meant as a slight on the project.

As the code bases for both those projects listed and power-profiles-daemon are ever evolving, the comments were understood to be correct when made.

thermald

thermald only works on Intel CPUs, and is very focused on allowing maximum performance based on a "maximum temperature" for the system. As such, it could be seen as complementary to power-profiles-daemon.

tuned and TLP

Both projects have similar goals, allowing for tweaks to be applied, for a variety of workloads that goes far beyond the workloads and use cases that power-profiles-daemon targets.

A fair number of the tweaks that could apply to devices running GNOME or another free desktop are either potentially destructive (eg. some of the SATA power-saving mode resulting in corrupted data), or working well enough to be put into place by default (eg. audio codec power-saving), even if we need to disable the power saving on some hardware that reacts badly to it.

Both are good projects to use for the purpose of experimenting with particular settings to see if they'd be something that can be implemented by default, or to put some fine-grained, static, policies in place on server-type workloads which are not as fluid and changing as desktop workloads can be.

auto-cpufreq

It doesn't take user-intent into account, doesn't have a D-Bus interface and seems to want to work automatically by monitoring the CPU usage, which kind of goes against a user's wishes as a user might still want to conserve as much energy as possible under high-CPU usage.

Over the past couple of (gasp!) decades, I've had my fair share of release blunders: forgetting to clean the tree before making a tarball by hand, forgetting to update the NEWS file, forgetting to push after creating the tarball locally, forgetting to update the appdata file (causing problems on Flathub)...

That's where check-news.sh comes in, to replace the check-news function of the autotools. Ideally you would:

- make sure your CI runs a dist job

- always use a merge request to do releases

- integrate check-news.sh to your meson build (though I would relax the appdata checks for devel releases)

September 09, 2020

Long, Long Day

I fell into the abyss of query code again, so this is just a short post to (briefly) touch on the adventure that is pipeline statistics queries.

Pipeline statistics queries include a collection of statistics about things that happened while the query was active, such as the count of shader invocations.

Thankfully, these weren’t too difficult to plug into the ever-evolving zink query architecture, as my previous adventure into QBOs (more on this another time) ended up breaking out a lot of helper functions for various things that simplified the process.

Mapping

The first step, as always, is figuring out how to map gallium API to vulkan. The below table handles the basic conversion for single query types:

unsigned map[] = {
   [PIPE_STAT_QUERY_IA_VERTICES] = VK_QUERY_PIPELINE_STATISTIC_INPUT_ASSEMBLY_VERTICES_BIT,
   [PIPE_STAT_QUERY_IA_PRIMITIVES] = VK_QUERY_PIPELINE_STATISTIC_INPUT_ASSEMBLY_PRIMITIVES_BIT,
   [PIPE_STAT_QUERY_VS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_VERTEX_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_GS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_GEOMETRY_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_GS_PRIMITIVES] = VK_QUERY_PIPELINE_STATISTIC_GEOMETRY_SHADER_PRIMITIVES_BIT,
   [PIPE_STAT_QUERY_C_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_CLIPPING_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_C_PRIMITIVES] = VK_QUERY_PIPELINE_STATISTIC_CLIPPING_PRIMITIVES_BIT,
   [PIPE_STAT_QUERY_PS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_FRAGMENT_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_HS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_TESSELLATION_CONTROL_SHADER_PATCHES_BIT,
   [PIPE_STAT_QUERY_DS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_TESSELLATION_EVALUATION_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_CS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_COMPUTE_SHADER_INVOCATIONS_BIT
};

Pretty straightforward.

With this in place, it’s worth mentioning that OpenGL has facilities for performing either “all” statistics queries, which include all the above types, or “single” statistics queries, which is just one of the types.

Building on that, I’m going to reach way back to the original loop that I’ve been using for handling query results. That’s now its own helper function called check_query_results(). It’s invoked from get_query_result() like so:

int result_size = 1;
   /* these query types emit 2 values */
if (query->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT ||
    query->type == PIPE_QUERY_PRIMITIVES_GENERATED ||
    query->type == PIPE_QUERY_PRIMITIVES_EMITTED)
   result_size = 2;
else if (query->type == PIPE_QUERY_PIPELINE_STATISTICS)
   result_size = 11;

if (query->type == PIPE_QUERY_PIPELINE_STATISTICS)
   num_results = 1;
for (unsigned last_start = query->last_start; last_start + num_results <= query->curr_query; last_start++) {
   /* verify that we have the expected number of results pending */
   assert(num_results <= ARRAY_SIZE(results) / result_size);
   VkResult status = vkGetQueryPoolResults(screen->dev, query->query_pool,
                                           last_start, num_results,
                                           sizeof(results),
                                           results,
                                           sizeof(uint64_t),
                                           flags);
   if (status != VK_SUCCESS)
      return false;

   if (query->type == PIPE_QUERY_PRIMITIVES_GENERATED) {
      status = vkGetQueryPoolResults(screen->dev, query->xfb_query_pool[0],
                                              last_start, num_results,
                                              sizeof(xfb_results),
                                              xfb_results,
                                              2 * sizeof(uint64_t),
                                              flags | VK_QUERY_RESULT_64_BIT);
      if (status != VK_SUCCESS)
         return false;

   }

   check_query_results(query, result, num_results, result_size, results, xfb_results);
}

Beautiful, isn’t it?

For the case of any xfb-related query, 2 result values are returned (and then also PRIMITIVES_GENERATED gets its own separate xfb pool for correctness), while most others only get 1 result value.

And then there’s PIPELINE_STATISTICS, which gets eleven. Since this is a weird result size to handle, I decided to just grab the results for each query one at a time in order to avoid any kind of craziness with allocating enough memory for the actual result values. Plenty of time for that later.

The handling for the results is fun too:

case PIPE_QUERY_PIPELINE_STATISTICS: {
   uint64_t *statistics = (uint64_t*)&result->pipeline_statistics;
   for (unsigned j = 0; j < 11; j++)
      statistics[j] += results[i + j];
   break;
}

Each bit set in the query object returns its own query result, so there’s 11 of them that need to be accumulated for each query result.

Surprisingly though, that’s the craziest part of implementing this. Not really much compared to some of the other insanity.

Corrections

The other day, I made wild claims about zink being the fastest graphics driver in history.

I’m not going to say they were inaccurate.

What I am going to say is that I had a great chat with well-known, long-time mesa developer Kenneth Graunke, who pointed out that most of the time differences in the softfp64 tests were due to NIR optimizations related to loop unrolling and the different settings used between IRIS and zink there.

He also helpfully pointed out that I forgot to set a number of important pipe caps, including the one that holds the value for gl_MaxVaryingComponents. As a result, zink has been advertising a very illegal and too-small number of varyings, which made the shaders smaller, which made us faster.

This has since been fixed, and zink is now fully compliant with the required GL specs for this value.

However.

This has only increased the time of the one test I showcased from ~2 seconds to ~12 seconds.

Due to different mechanics between GLSL and SPIRV shader construction, all XFB outputs have to be explicitly specified when they’re converted in zink, and they count against the regular output limits. As a result, it’s necessary to reserve half of the driver’s advertised vertex outputs for potential XFB usage. This leaves us with much fewer available output slots for user shaders in order to preserve XFB functionality.

September 08, 2020

This is going to be a short post, as changes to Videos have been few and far between in the past couple of releases.

The major change to the latest release is that we've gained Tracker 3 support through a grilo plugin (which meant very few changes to our own code). But the Tracker 3 libraries are incompatible with the Tracker 2 daemon that's usually shipped in distributions, including on this author's development system.

So we made use of the ability of Tracker to run inside a Flatpak sandbox along with the video player, removing the need to have Tracker installed by the distribution, on the host. This should also make it easier to give users control of the directories they want to use to store their movies, in the future.

The release candidate for GNOME 3.38 is available right now as the stable version on Flathub.

September 07, 2020

So here a new update of the evolution of the Vulkan driver for the rpi4 (broadcom GPU).

Features

Since my last update we finished the support for two features. Robust buffer access and multisampling.

Robust buffer access is a feature that allows to specify that accesses to buffers are bounds-checked against the range of the buffer descriptor. Usually this is used as a debug tool during development, but disabled on release (this is explained with more detail on this ARM guide). So sorry, no screenshot here.

On my last update I mentioned that we have started the support for multisampling, enough to get some demos working. Since then we were able to finish the rest of the mulsisampling support, and even implemented the optional feature sample rate shading. So now the following Sascha Willems’s demo is working:

Sascha Willems deferred multisampling demo run on rpi4

Bugfixing

Taking into account that most of the features towards support Vulkan core 1.0 are implemented now, a lot of the effort since the last update was done on bugfixing, focusing on the specifics of the driver. Our main reference for this is Vulkan CTS, the official Khronos testsuite for Vulkan and OpenGL.

As usual, here some screenshots from the nice Sascha Willems’s demos, showing demos that were failing when I wrote the last update, and are working now thanks of the bugfixing work.

Sascha Willems hdr demo run on rpi4

Sascha Willems gltf skinning demo run on rpi4

Next

At this point there are no full features pending to implement to fulfill the support for Vulkan core 1.0. So our focus would be on getting to pass all the Vulkan CTS tests.

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working
v3dv status update 2020-07-31

Just For Fun

I finally managed to get a complete piglit run over the weekend, and, for my own amusement, I decided to check the timediffs against a reference run from the IRIS driver. Given that the Intel drivers are of extremely high quality (and are direct interfaces to the underlying hardware that I happen to be using), I tend to use ANV and IRIS as my references whenever I’m trying to debug things.

Both runs used the same base checkout from mesa, so all the core/gallium/nir parts were identical.

The results weren’t what I expected.

My expectation when I clicked into the timediffs page was that zink would be massively slower in a huge number of tests, likely to a staggering degree in some cases.

We were, but then also on occasion we weren’t.

As a final disclaimer before I dive into this, I feel like given the current state of people potentially rushing to conclusions I need to say that I’m not claiming zink is faster than a native GL driver, only that for some cases, our performance is oddly better than I expected.

The Good

piglit-misc-bench.png

The first thing to take note of here is that IRIS is massively better than zink in successful test completion, with a near-perfect 99.4% pass rate compared to zink’s measly 91%, and that’s across 2500 more tests too. This is important also since timediff only compares between passing tests.

With that said, somehow zink’s codepath is significantly faster when it comes to dealing with high numbers of varying outputs, and also, weirdly, a bunch of dmat4 tests, even though they’re both using the same softfp64 path since my icelake hardware doesn’t support native 64bit operations.

I was skeptical about some of the numbers here, particularly the ext_transform_feedback max-varying-arrays-of-arrays cases, but manual tests were even weirder:

time MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink bin/ext_transform_feedback-max-varyings -auto -fbo

MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink  -auto -fbo  2.13s user 0.03s system 98% cpu 2.197 total
time MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=iris bin/ext_transform_feedback-max-varyings -auto -fbo

MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=iris  -auto -fbo  301.64s user 0.52s system 99% cpu 5:02.45 total

wat.

I don’t have a good explanation for this since I haven’t dug into it other than to speculate that ANV is just massively better at handling large numbers of varying outputs.

The Bad

By contrast, zink gets thrashed pretty decisively in arb_map_buffer_alignment-map-invalidate-range, and we’re about 150x slower.

Yikes. Looks like that’s going to be a target for some work since potentially an application might hit that codepath.

The Weird

piglit-fp64-bench.png

Somehow, zink is noticeably slower in a bunch of other fp64 tests (and this isn’t the full list, only a little over half). It’s strange to me that zink can perform better in certain fp64 cases but then also worse in others, but I’m assuming this is just the result of different shader optimizations happening between the drivers, shifting them onto slightly less slow parts of the softfp64 codepath in certain cases.

Possibly something to look into.

Probably not in too much depth since softfp64 is some pretty crazy stuff.

In Closing

Tests (and especially piglit ones) are not indicative of real world performance.

Exposure Notifications is a protocol developed by Apple and Google for facilitating COVID-19 contact tracing on mobile phones by exchanging codes with nearby phones over Bluetooth, implemented within the Android and iOS operating systems, now available here in Toronto.

Wait – phones? Android and iOS only? Can’t my Debian laptop participate? It has a recent Bluetooth chip. What about phones running GNU/Linux distributions like the PinePhone or Librem 5?

Exposure Notifications breaks down neatly into three sections: a Bluetooth layer, some cryptography, and integration with local public health authorities. Linux is up to the task, via BlueZ, OpenSSL, and some Python.

Given my background, will this build to be a reverse-engineering epic resulting in a novel open stack for a closed system?

Not at all. The specifications for the Exposure Notifications are available for both the Bluetooth protocol and the underlying cryptography. A partial reference implementation is available for Android, as is an independent Android implementation in microG. In Canada, the key servers run an open source stack originally built by Shopify and now maintained by the Canadian Digital Service, including open protocol documentation.

All in all, this is looking to be a smooth-sailing weekend1 project.

The devil’s in the details.

Bluetooth

Exposure Notifications operates via Bluetooth Low Energy “advertisements”. Scanning for other devices is as simple as scanning for advertisements, and broadcasting is as simple as advertising ourselves.

On an Android phone, this is handled deep within Google Play Services. Can we drive the protocol from userspace on a regular GNU/Linux laptop? It depends. Not all laptops support Bluetooth, not all Bluetooth implementations support Bluetooth Low Energy, and I hear not all Bluetooth Low Energy implementations properly support undirected transmissions (“advertising”).

Luckily in my case, I develop on an Debianized Chromebook with a Wi-Fi/Bluetooth module. I’ve never used the Bluetooth, but it turns out the module has full support for advertisements, verified with the lescan (Low Energy Scan) command of the hcitool Bluetooth utility.

hcitool is a part of BlueZ, the standard Linux library for Bluetooth. Since lescan is able to detect nearby phones running Exposure Notifications, pouring through its source code is a good first step to our implementation. With some minor changes to hcitool to dump packets as raw hex and to filter for the Exposure Notifications protocol, we can print all nearby Exposure Notifications advertisements. So far, so good.

That’s about where the good ends.

While scanning is simple with reference code in hcitool, advertising is complicated by BlueZ’s lack of an interface at the time of writing. While a general “enable advertising” routine exists, routines to set advertising parameters and data per the Exposure Notifications specification are unavailable. This is not a showstopper, since BlueZ is itself an open source userspace library. We can drive the Bluetooth module the same way BlueZ does internally, filling in the necessary gaps in the API, while continuing to use BlueZ for the heavy-lifting.

Some care is needed to multiplex scanning and advertising within a single thread while remaining power efficient. The key is that advertising, once configured, is handled entirely in hardware without CPU intervention. On the other hand, scanning does require CPU involvement, but it is not necessary to scan continuously. Since COVID-19 is thought to transmit from sustained exposure, we only need to scan every few minutes. (Food for thought: how does this connect to the sampling theorem?)

Thus we can order our operations as:

  • Configure advertising
  • Scan for devices
  • Wait for several minutes
  • Repeat.

Since most of the time the program is asleep, this loop is efficient. It additionally allows us to reconfigure advertising every ten to fifteen minutes, in order to change the Bluetooth address to prevent tracking.

All of the above amounts to a few hundred lines of C code, treating the Exposure Notifications packets themselves as opaque random data.

Cryptography

Yet the data is far from random; it is the result of a series of operations in terms of secret keys defined by the Exposure Notifications cryptography specification. Every day, a “temporary exposure key” is generated, from which a “rolling proximity identifier key” and an “associated encrypted metadata key” are derived. These are used to generate a “rolling proximity identifier” and the “associated encrypted metadata”, which are advertised over Bluetooth and changed in lockstep with the Bluetooth random addresses.

There are lots of moving parts to get right, but each derivation reuses a common encryption primitive: HKDF-SHA256 for key derivation, AES-128 for the rolling proximity identifier, and AES-128-CTR for the associated encrypted metadata. Ideally, we would grab a state-of-the-art library of cryptography primitives like NaCl or libsodium and wire everything up.

First, some good news: once these routines are written, we can reliably unit test them. Though the specification states that “test vectors… are available upon request”, it isn’t clear who to request from. But Google’s reference implementation is itself unit-tested, and sure enough, it contains a TestVectors.java file, from which we can grab the vectors for a complete set of unit tests.

After patting ourselves on the back for writing unit tests, we’ll need to pick a library to implement the cryptography. Suppose we try NaCl first. We’ll quickly realize the primitives we need are missing, so we move onto libsodium, which is backwards-compatible with NaCl. For a moment, this will work – libsodium has upstream support for HKDF-SHA256. Unfortunately, the version of libsodium shipping in Debian testing is too old for HKDF-SHA256. Not a big problem – we can backwards port the implementation, written in terms of the underlying HMAC-SHA256 operations, and move on to the AES.

AES is a standard symmetric cipher, so libsodium has excellent support… for some modes. However standard, AES is not one cipher; it is a family of ciphers with different key lengths and operating modes, with dramatically different security properties. “AES-128-CTR” in the Exposure Notifications specification is clearly 128-bit AES in CTR (Counter) mode, but what about “AES-128” alone, stated to operate on a “single AES-128 block”?

The mode implicitly specified is known as ECB (Electronic Codebook) mode and is known to have fatal security flaws in most applications. Because AES-ECB is generally insecure, libsodium does not have any support for this cipher mode. Great, now we have two problems – we have to rewrite our cryptography code against a new library, and we have to consider if there is a vulnerability in Exposure Notifications.

ECB’s crucial flaw is that for a given key, identical plaintext will always yield identical ciphertext, regardless of position in the stream. Since AES is block-based, this means identical blocks yield identical ciphertext, leading to trivial cryptanalysis.

In Exposure Notifications, ECB mode is used only to derive rolling proximity identifiers from the rolling proximity identifier key and the timestamp, by the equation:

RPI_ij = AES_128_ECB(RPIK_i, PaddedData_j)

…where PaddedData is a function of the quantized timestamp. Thus the issue is avoided, as every plaintext will be unique (since timestamps are monotonically increasing, unless you’re trying to contact trace Back to the Future).

Nevertheless, libsodium doesn’t know that, so we’ll need to resort to a ubiquitous cryptography library that doesn’t, uh, take security quite so seriously…

I’ll leave the implications up to your imagination.

Database

While the Bluetooth and cryptography sections are governed by upstream specifications, making sense of the data requires tracking a significant amount of state. At minimum, we must:

  • Record received packets (the Rolling Proximity Identifier and the Associated Encrypted Metadata).
  • Query received packets for diagnosed identifiers.
  • Record our Temporary Encryption Keys.
  • Query our keys to upload if we are diagnosed.

If we were so inclined, we could handwrite all the serialization and concurrency logic and hope we don’t have a bug that results in COVID-19 mayhem.

A better idea is to grab SQLite, perhaps the most deployed software in the world, and express these actions as SQL queries. The database persists to disk, and we can even express natural unit tests with a synthetic in-memory database.

With this infrastructure, we’re now done with the primary daemon, recording Exposure Notification identifiers to the database and broadcasting our own identifiers. That’s not interesting if we never do anything with that data, though. Onwards!

Key retrieval

Once per day, Exposure Notifications implementations are expected to query the server for Temporary Encryption Keys associated with diagnosed COVID-19 cases. From these keys, the cryptography implementation can reconstruct the associated Rolling Proximity Identifiers, for which we can query the database to detect if we have been exposed.

Per Google’s documentation, the servers are expected to return a zip file containing two files:

  • export.bin: a container serialized as Protocol Buffers containing Diagnosis Keys
  • export.sig: a signature for the export with the public health agency’s key

The signature is not terribly interesting to us. On Android, it appears the system pins the public keys of recognized public health agencies as an integrity check for the received file. However, this public key is given directly to Google; we don’t appear to have an easy way to access it.

Does it matter? For our purposes, it’s unlikely. The Canadian key retrieval server is already transport-encrypted via HTTPS, so tampering with the data would already require compromising a certificate authority in addition to intercepting the requests to https://canada.ca. Broadly speaking, that limits attackers to nation-states, and since Canada has no reason to attack its own infrastructure, that limits our threat model to foreign nation-states. International intelligence agencies probably have better uses of resources than getting people to take extra COVID tests.

It’s worth noting other countries’ implementations could serve this zip file over plaintext HTTP, in which case this signature check becomes important.

Focusing then on export.bin, we may import the relevant protocol buffer definitions to extract the keys for matching against our database. Since this requires only read-only access to the database and executes infrequently, we can safely perform this work from a separate process written in a higher-level language like Python, interfacing with the cryptography routines over the Python foreign function interface ctypes. Extraction is easy with the Python protocol buffers implementation, and downloading should be as easy as a GET request with the standard library’s urllib, right?

Here we hit a gotcha: the retrieval endpoint is guarded behind an HMAC, requiring authentication to download the zip. The protocol documentation states:

Of course there’s no reliable way to truly authenticate these requests in an environment where millions of devices have immediate access to them upon downloading an Application: this scheme is purely to make it much more difficult to casually scrape these keys.

Ah, security by obscurity. Calculating the HMAC itself is simple given the documentation, but it requires a “secret” HMAC key specific to the server. As the documentation is aware, this key is hardly secret, but it’s not available on the Canadian Digital Service’s official repositories. Interoperating with the upstream servers would require some “extra” tricks.

From purely academic interest, we can write and debug our implementation without any such authorization by running our own sandbox server. Minus the configuration, the server source is available, so after spinning up a virtual machine and fighting with Go versioning, we can test our Python script.

Speaking of a personal sandbox…

Key upload

There is one essential edge case to the contact tracing implementation, one that we can’t test against the Canadian servers. And edge cases matter. In effect, the entire Exposure Notifications infrastructure is designed for the edge cases. If you don’t care about edge cases, you don’t care about digital contact tracing (so please, stay at home.)

The key feature – and key edge case – is uploading Temporary Exposure Keys to the Canadian key server in case of a COVID-19 diagnosis. This upload requires an alphanumeric code generated by a healthcare provider upon diagnosis, so if we used the shared servers, we couldn’t test an implementation. With our sandbox, we can generate as many alphanumeric codes as we’d like.

Once sandboxed, there isn’t much to the implementation itself: the keys are snarfed out of the SQLite database, we handshake with the server over protocol buffers marshaled over POST requests, and we throw in some public-key cryptography via the Python bindings to libsodium.

This functionality neatly fits into a second dedicated Python script which does not interface with the main library. It’s exposed as a command line interface with flow resembling that of the mobile application, adhering reasonably to the UNIX philosophy. Admittedly I’m not sure wrestling with the command line is top on the priority list of a Linux hacker ill with COVID-19. Regardless, the interface is suitable for higher-level (graphical) abstractions.

Problem solved, but of course there’s a gotcha: if the request is malformed, an error should be generated as a key robustness feature. Unfortunately, while developing the script against my sandbox, a bug led the request to be dropped unexpectedly, rather than returning with an error message. On the server implemented in Go, there was an apparent nil dereference. Oops. Fixing this isn’t necessary for this project, but it’s still a bug, even if it requires a COVID-19 diagnosis to trigger. So I went and did the Canadian thing and sent a pull request.

Conclusion

All in all, we end up with a Linux implementation of Exposure Notifications functional in Ontario, Canada. What’s next? Perhaps supporting contact tracing systems elsewhere in the world – patches welcome. Closer to home, while functional, the aesthetics are not (yet) anything to write home about – perhaps we could write a touch-based Linux interface for mobile Linux interfaces like Plasma Mobile and Phosh, maybe even running it on a Android flagship flashed with postmarketOS to go full circle.

Source code for liben is available for any one who dares go near. Compiling from source is straightforward but necessary at the time of writing. As for packaging?

Here’s hoping COVID-19 contact tracing will be obsolete by the time liben hits Debian stable.


  1. Today (Monday) is Labour Day, so this is a 3-day weekend. But I started on Saturday and posted this today, so it technically counts.↩︎

September 04, 2020

Wim Taymans

Wim Taymans talking about current state of PipeWire


Wim Taymans did an internal demonstration yesterday for the desktop team at Red Hat of the current state of PipeWire. For those still unaware PipeWire is our effort to bring together audio, video and pro-audio under Linux, creating a smooth and modern experience. Before PipeWire there was PulseAudio for consumer audio, Jack for Pro-audio and just unending pain and frustration for video. PipeWire is being done with the aim of being ABI compatible with ALSA, PulseAudio and JACK, meaning that PulseAudio and Jack apps should just keep working on top of Pipewire without the need for rewrites (and with the same low latency for JACK apps).

As Wim reported yesterday things are coming together with both the PulseAudio, Jack and ALSA backends being usable if not 100% feature complete yet. Wim has been running his system with Pipewire as the only sound server for a while now and things are now in a state where we feel ready to ask the wider community to test and help provide feedback and test cases.

Carla on PipeWire

Carla running on PipeWire

Carla as shown above is a popular Jack applications and it provides among other things this patchbay view of your audio devices and applications. I recommend you all to click in and take a close look at the screenshot above. That is the Jack application Carla running and as you see PulseAudio applications like GNOME Settings and Google Chrome are also showing up now thanks to the unified architecture of PipeWire, alongside Jack apps like Hydrogen. All of this without any changes to Carla or any of the other applications shown.

At the moment Wim is primarily testing using Cheese, GNOME Control center, Chrome, Firefox, Ardour, Carla, vlc, mplayer, totem, mpv, Catia, pavucontrol, paman, qsynth, zrythm, helm, Spotify and Calf Studio Gear. So these are the applications you should be getting the most mileage from when testing, but most others should work too.

Anyway, let me quickly go over some of the highlight from Wim’s presentation.

Session Manager

PipeWire now has a functioning session manager that allows for things like

  • Metadata, system for tagging objects with properties, visible to all clients (if permitted)
  • Load and save of volumes, automatic routing
  • Default source and sink with metadata, saved and loaded as well
  • Moving streams with metadata

Currently this is a simple sample session manager that Wim created himself, but we also have a more advanced session manager called Wireplumber being developed by Collabora, which they developed for use in automotive Linux usecases, but which we will probably be moving to over time also for the desktop.

Human readable handling of Audio Devices

Wim took the code and configuration data in Pulse Audio for ALSA Card Profiles and created a standalone library that can be shared between PipeWire and PulseAudio. This library handles ALSA sound card profiles, devices, mixers and UCM (use case manager used to configure the newer audio chips (like the Lenovo X1 Carbon) and lets PipeWire provide the correct information to provide to things like GNOME Control Center or pavucontrol. Using the same code as has been used in PulseAudio for this has the added benefit that when you switch from PulseAudio to PipeWire your devices don’t change names. So everything should look and feel just like PulseAudio from an application perspective. In fact just below is a screenshot of pavucontrol, the Pulse Audio mixer application running on top of Pipewire without a problem.

PulSe Audio Mixer

Pavucontrol, the Pulse Audio mixer on Pipewire

Creating audio sink devices with Jack
Pipewire now allows you to create new audio sink devices with Jack. So the example command below creates a Pipewire sink node out of calfjackhost and sets it up so that we can output for instance the audio from Firefox into it. At the moment you can do that by running your Jack apps like this:

PIPEWIRE_PROPS="media.class=Audio/Sink" calfjackhost

But eventually we hope to move this functionality into the GNOME Control Center or similar so that you can do this setup graphically. The screenshot below shows us using CalfJackHost as an audio sink, outputing the audio from Firefox (a PulseAudio application) and CalfJackHost generating an analyzer graph of the audio.

Calfjackhost on pipewire

The CalfJackhost being used as an audio sink for Firefox

Creating devices with GStreamer
We can also use GStreamer to create PipeWire devices now. The command belows take the popular Big Buck Bunny animation created by the great folks over at Blender and lets you set it up as a video source in PipeWire. So for instance if you always wanted to play back a video inside Cheese for instance, to apply the Cheese effects to it, you can do that this way without Cheese needing to change to handle video playback. As one can imagine this opens up the ability to string together a lot of applications in interesting ways to achieve things that there might not be an application for yet. Of course application developers can also take more direct advantage of this to easily add features to their applications, for instance I am really looking forward to something like OBS Studio taking full advantage of PipeWire.

gst-launch-1.0 uridecodebin uri=file:///home/wim/data/BigBuckBunny_320x180.mp4 ! pipewiresink mode=provide stream-properties="props,media.class=Video/Source,node.description=BBB"

Cheese paying a video through pipewire

Cheese playing a video provided by GStreamer through PipeWire.

How to get started testing PipeWire
Ok, so after seeing all of this you might be thinking, how can I test all of this stuff out and find out how my favorite applications work with PipeWire? Well first thing you should do is make sure you are running Fedora Workstation 32 or later as that is where we are developing all of this. Once you done that you need to make sure you got all the needed pieces installed:

sudo dnf install pipewire-libpulse pipewire-libjack pipewire-alsa

Once that dnf command finishes you run the following to get PulseAudio replaced by PipeWire.


cd /usr/lib64/

sudo ln -sf pipewire-0.3/pulse/libpulse-mainloop-glib.so.0 /usr/lib64/libpulse-mainloop-glib.so.0.999.0
sudo ln -sf pipewire-0.3/pulse/libpulse-simple.so.0 /usr/lib64/libpulse-simple.so.0.999.0
sudo ln -sf pipewire-0.3/pulse/libpulse.so.0 /usr/lib64/libpulse.so.0.999.0

sudo ln -sf pipewire-0.3/jack/libjack.so.0 /usr/lib64/libjack.so.0.999.0
sudo ln -sf pipewire-0.3/jack/libjacknet.so.0 /usr/lib64/libjacknet.so.0.999.0
sudo ln -sf pipewire-0.3/jack/libjackserver.so.0 /usr/lib64/libjackserver.so.0.999.0

sudo ldconfig

(you can also find those commands here

Once you run these commands you should be able to run

pactl info

and see this as the first line returned:
Server String: pipewire-0

I do recommend rebooting, to be 100% sure you are on a PipeWire system with everything outputting through PipeWire. Once that is done you are ready to start testing!

Our goal is to use the remainder of the Fedora Workstation 32 lifecycle and the Fedora Workstation 33 lifecycle to stabilize and finish the last major features of PipeWire and then start relying on it in Fedora Workstation 34. So I hope this article will encourage more people to get involved and join us on gitlab and on the PipeWire IRC channel at #pipewire on Freenode.

As we are trying to stabilize PipeWire we are working on it on a bug by bug basis atm, so if you end up testing out the current state of PipeWire then be sure to report issues back to us through the PipeWire issue tracker, but do try to ensure you have a good test case/reproducer as we are still so early in the development process that we can’t dig into ‘obscure/unreproducible’ bugs.

Also if you want/need to go back to PulseAudio you can run the commands here

Also if you just want to test a single application and not switch your whole system over you should be able to do that by using the following commands:

pw-pulse

or

pw-jack

Next Steps
So what are our exact development plans at this point? Well here is a list in somewhat priority order:

  1. Stabilize – Our top priority now is to make PipeWire so stable that the power users that we hope to attract us our first batch of users are comfortable running PipeWire as their only audio server. This is critical to build up a userbase that can help us identify and prioritize remaining issues and ensure that when we do switch Fedora Workstation over to using PipeWire as the default and only supported audio server it will be a great experience for users.
  2. Jackdbus – We want to implement support for the jackdbus API soon as we know its an important feature for the Fedora Jam folks. So we hope to get to this in the not to distant future
  3. Flatpak portal for JACK/audio applications – The future of application packaging is Flatpaks and being able to sandbox Jack applications properly inside a Flatpak is something we want to enable.
  4. Bluetooth – Bluetooth has been supported in PipeWire from the start, but as Wims focus has moved elsewhere it has gone a little stale. So we are looking at cycling back to it and cleaning it up to get it production ready. This includes proper support for things like LDAC and AAC passthrough, which is currently not handled in PulseAudio. Wim hopes to push an updated PipeWire in Fedora out next week which should at least get Bluetooth into a basic working state, but the big fix will come later.
  5. Pulse effects – Wim has looked at this, but there are some bugs that blocks data from moving through the pipeline.
  6. Latency compensation – We want complete latency compensation implemented. This is not actually in Jack currently, so it would be a net new feature.
  7. Network audio – PulseAudio style network audio is not implemented yet.

This is the continuation from these posts: part 1, part 2, part 3

This is the part where it all comes together, with (BYO) fireworks and confetti, champagne and hoorays. Or rather, this is where I'll describe how to actually set everything up. It's a bit complicated because while libxkbcommon does the parsing legwork now, we haven't actually changed other APIs and the file formats which are still 1990s-style nerd cool and requires significant experience in CS [1] to understand what goes where.

The below relies on software using libxkbcommon and libxkbregistry. At the time of writing, libxkbcommon is used by all mainstream Wayland compositors but not by the X server. libxkbregistry is not yet used because I'm typing this before we had a release for it. But at least now I have a link to point people to.

libxkbcommon has a xkbcli-scaffold-new-layout tool that The xkblayout tool creates the template files as shown below. At the time of writing, this tool must be run from the git repo build directory, it is not installed.

I'll explain here how to add the us(banana) variant and the custom:foo option, and I will optimise for simplicity and brevity.

Directory structure

First, create the following directory layout:


$ tree $XDG_CONFIG_HOME/xkb
/home/user/.config/xkb
├── compat
├── keycodes
├── rules
│   ├── evdev
│   └── evdev.xml
├── symbols
│   ├── custom
│   └── us
└── types
If $XDG_CONFIG_HOME is unset, fall back to $HOME/.config.

Rules files

Create the rules file and add an entry to map our custom:foo option to a section in the symbols/custom file.


$ cat $XDG_CONFIG_HOME/xkb/rules/evdev
! option = symbols
custom:foo = +custom(foo)

// Include the system 'evdev' file
! include %S/evdev
Note that no entry is needed for the variant, that is handled by wildcards in the system rules file. If you only want a variant and no options, you technically don't need this rules file.

Second, create the xml file used by libxkbregistry to display your new entries in the configuration GUIs:


$ cat $XDG_CONFIG_HOME/xkb/rules/evdev.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xkbConfigRegistry SYSTEM "xkb.dtd">
<xkbConfigRegistry version="1.1">
<layoutList>
<layout>
<configItem>
<name>us</name>
</configItem>
<variantList>
<variant>
<configItem>
<name>banana</name>
<shortDescription>banana</shortDescription>
<description>US(Banana)</description>
</configItem>
</variant>
</variantList>
</layout>
</layoutList>
<optionList>
<group allowMultipleSelection="true">
<configItem>
<name>custom</name>
<description>custom options</description>
</configItem>
<option>
<configItem>
<name>custom:foo</name>
<description>This option does something great</description>
</configItem>
</option>
</group>
</optionList>
</xkbConfigRegistry>
Our variant needs to be added as a layoutList/layout/variantList/variant, the option to the optionList/group/option. libxkbregistry will combine this with the system-wide evdev.xml file in /usr/share/X11/xkb/rules/evdev.xml.

Overriding and adding symbols

Now to the actual mapping. Add a section to each of the symbols files that matches the variant or option name:


$ cat $XDG_CONFIG_HOME/xkb/symbols/us
partial alphanumeric_keys modifier_keys
xkb_symbols "banana" {
name[Group1]= "Banana (us)";

include "us(basic)"

key <CAPS> { [ Escape ] };
};
with this, the us(banana) layout will be a US keyboard layout but with the CapsLock key mapped to Escape. What about our option? Mostly the same, let's map the tilde key to nothing:

$ cat $XDG_CONFIG_HOME/xkb/symbols/custom
partial alphanumeric_keys modifier_keys
xkb_symbols "foo" {
key <TLDE> { [ VoidSymbol ] };
};
A note here: NoSymbol means "don't overwrite it" whereas VoidSymbol is "map to nothing".

Notes

You may notice that the variant and option sections are almost identical. XKB doesn't care about variants vs options, it only cares about components to combine. So the sections do what we expect of them: variants include enough other components to make them a full keyboard layout, options merely define a few keys so they can be combined with layouts(variants). Due to how the lookups work, you could load the option template as layout custom(foo).

For the actual documentation of keyboard configuration, you'll need to google around, there are quite a few posts on how to map keys. All that has changed is where from and how things are loaded but not the actual data formats.

If you wanted to install this as system-wide custom rules, replace $XDG_CONFIG_HOME with /etc.

The above is a replacement for xmodmap. It does not require a script to be run manually to apply the config, the existing XKB integration will take care of it. It will work in Wayland (but as said above not in X, at least not for now).

A final word

Now, I fully agree that this is cumbersome, clunky and feels outdated. This is easy to fix, all that is needed is for someone to develop a better file format, make sure it's backwards compatible with the full spec of the XKB parser (the above is a fraction of what it can do), that you can generate the old files from the new format to reduce maintenance, and then maintain backwards compatibility with the current format for the next ten or so years. Should be a good Google Decade of Code beginner project.

[1] Cursing and Swearing

This is the continuation from these posts: part 1, part 2, part 3 and part 4.

In the posts linked above, I describe how it's possible to have custom keyboard layouts in $HOME or /etc/xkb that will get picked up by libxkbcommon. This only works for the Wayland stack, the X stack doesn't use libxkbcommon. In this post I'll explain why it's unlikely this will ever happen in X.

As described in the previous posts, users configure with rules, models, layouts, variants and options (RMLVO). What XKB uses internally though are keycodes, compat, geometry, symbols types (KcCGST) [1].

There are, effectively, two KcCGST keymap compilers: libxkbcommon and xkbcomp. libxkbcommon can go from RMLVO to a full keymap, xkbcomp relies on other tools (e.g. setxkbmap) which in turn use a utility library called libxkbfile to can parse rules files. The X server has a copy of the libxkbfile code. It doesn't use libxkbfile itself but it relies on the header files provided by it for some structs.

Wayland's keyboard configuration works like this:

  • the compositor decides on the RMLVO keybard layout, through an out-of-band channel (e.g. gsettings, weston.ini, etc.)
  • the compositor invokes libxkbcommon to generate a KcCGST keymap and passes that full keymap to the client
  • the client compiles that keymap with libxkbcommon and feeds any key events into libxkbcommon's state tracker to get the right keysyms
The advantage we have here is that only the full keymap is passed between entities. Changing how that keymap is generated does not affect the client. This, coincidentally [2], is also how Xwayland gets the keymap passed to it and why Xwayland works with user-specific layouts.

X works differently. Notably, KcCGST can come in two forms, the partial form specifying names only and the full keymap. The partial form looks like this:


$ setxkbmap -print -layout fr -variant azerty -option ctrl:nocaps
xkb_keymap {
xkb_keycodes { include "evdev+aliases(azerty)" };
xkb_types { include "complete" };
xkb_compat { include "complete" };
xkb_symbols { include "pc+fr(azerty)+inet(evdev)+ctrl(nocaps)" };
xkb_geometry { include "pc(pc105)" };
};
This defines the component names but not the actual keymap, punting that to the next part in the stack. This will turn out to be the achilles heel. Keymap handling in the server has two distinct aproaches:
  • During keyboard device init, the input driver passes RMLVO to the server, based on defaults or xorg.conf options
  • The server has its own rules file parser and creates the KcCGST component names (as above)
  • The server forks off xkbcomp and passes the component names to stdin
  • xkbcomp generates a keymap based on the components and writes it out as XKM file format
  • the server reads in the XKM format and updates its internal structs
This has been the approach for decades. To give you an indication of how fast-moving this part of the server is: XKM caching was the latest feature added... in 2009.

Driver initialisation is nice, but barely used these days. You set your keyboard layout in e.g. GNOME or KDE and that will apply it in the running session. Or run setxkbmap, for those with a higher affinity to neckbeards. setxkbmap works like this:

  • setkxkbmap parses the rules file to convert RMLVO to KcCGST component names
  • setkxkbmap calls XkbGetKeyboardByName and hands those component names to the server
  • The server forks off xkbcomp and passes the component names to stdin
  • xkbcomp generates a keymap based on the components and writes it out as XKM file format
  • the server reads in the XKM format and updates its internal structs
Notably, the RMLVO to KcCGST conversion is done on the client side, not the server side. And the only way to send a keymap to the server is that XkbGetKeyboardByName request - which only takes KcCGST, you can't even pass it a full keymap. This is also a long-standing potential issue with XKB: if your client tools uses different XKB data files than the server, you don't get the keymap you expected.

Other parts of the stack do basically the same as setxkbmap which is just a thin wrapper around libxkbfile anyway.

Now, you can use xkbcomp on the client side to generate a keymap, but you can't hand it as-is to the server. xkbcomp can do this (using libxkbfile) by updating the XKB state one-by-one (XkbSetMap, XkbSetCompatMap, XkbSetNames, etc.). But at this point you're at the stage where you ask the server to knowingly compile a wrong keymap before updating the parts of it.

So, realistically, the only way to get user-specific XKB layouts into the X server would require updating libxkbfile to provide the same behavior as libxkbcommon, update the server to actually use libxkbfile instead of its own copy, and updating xkbcomp to support the changes in part 2, part 3. All while ensuring no regressions in code that's decades old, barely maintained, has no tests, and, let's be honest, not particularly pretty to look at. User-specific XKB layouts are somewhat a niche case to begin with, so I don't expect anyone to ever volunteer and do this work [3], much less find the resources to review and merge that code. The X server is unlikely to see another real release and this is definitely not something you want to sneak in in a minor update.

The other option would be to extend XKB-the-protocol with a request to take a full keymap so the server. Given the inertia involved and that the server won't see more full releases, this is not going to happen.

So as a summary: if you want custom keymaps on your machine, switch to Wayland (and/or fix any remaining issues preventing you from doing so) instead of hoping this will ever work on X. xmodmap will remain your only solution for X.

[1] Geometry is so pointless that libxkbcommon doesn't even implement this. It is a complex format to allow rendering a picture of your keyboard but it'd be a per-model thing and with evdev everyone is using the same model, so ...
[2] totally not coincidental btw
[3] libxkbcommon has been around for a decade now and no-one has volunteered to do this in the years since, so...

The Optimizations Continue

Optimizing transfer_map is one of the first issues I created, and it’s definitely one of the most important, at least as it pertains to unit tests. So many unit tests perform reads on buffers that it’s crucial to ensure no unnecessary flushing or stalling is happening here.

Today, I’ve made further strides in this direction for piglit’s spec@glsl-1.30@execution@texelfetch fs sampler2d 1x281-501x281:

before

MESA_LOADER_DRIVER_OVERRIDE=zink bin/texelFetch 4.41s user 1.92s system 71% cpu 8.801 total

after

MESA_LOADER_DRIVER_OVERRIDE=zink bin/texelFetch 4.22s user 1.72s system 76% cpu 7.749 total

More Speed Loops

As part of ensuring test coherency, a lot of explicit fencing was added around transfer_map and transfer_unmap. Part of this was to work around the lack of good barrier usage, and part of it was just to make sure that everything was properly synchronized.

An especially non-performant case of this was in transfer_unmap, where I added a fence to block after the buffer was unmapped. In reality, this makes no sense other than for the case of synchronizing with the compute batch, which is a bit detached from everything else.

The reason this fence continued to be needed comes down to barrier usage for descriptors. At the start, all image resources used for descriptors just used VK_IMAGE_LAYOUT_GENERAL for the barrier layout, which is fine and good; certainly this is spec compliant. It isn’t, however, optimally informing the underlying driver about the usage for the case where the resource was previously used with VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL to copy data back from a staging image.

Instead, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL can be used for sampler images since they’re read-only. This differs from VK_IMAGE_LAYOUT_GENERAL in that VK_IMAGE_LAYOUT_GENERAL contains both read and write—not what’s actually going on in this case given that sampler images are read-only.

No Other Results

I’d planned to post more timediffs from some piglit results, but I hit some weird i915 bugs and so the tests have yet to complete after ~14 hours. More to come next week.

September 03, 2020

A Quick Optimization

As I mentioned last week, I’m turning a small part of my attention now to doing some performance improvements. One of the low-hanging fruits here is adding buffer ranges; in short, this means that for buffer resources (i.e., not images), the driver tracks the ranges in memory of the buffer that have data written, which allows avoiding gpu stalls when trying to read or write from a range in the buffer that’s known to not have anything written.

util_range

The util_range API in gallium is extraordinarily simple. Here’s the key parts:

struct util_range {
   unsigned start; /* inclusive */
   unsigned end; /* exclusive */

   /* for the range to be consistent with multiple contexts: */
   simple_mtx_t write_mutex;
};

/* This is like a union of two sets. */
static inline void
util_range_add(struct pipe_resource *resource, struct util_range *range,
               unsigned start, unsigned end)
{
   if (start < range->start || end > range->end) {
      if (resource->flags & PIPE_RESOURCE_FLAG_SINGLE_THREAD_USE) {
         range->start = MIN2(start, range->start);
         range->end = MAX2(end, range->end);
      } else {
         simple_mtx_lock(&range->write_mutex);
         range->start = MIN2(start, range->start);
         range->end = MAX2(end, range->end);
         simple_mtx_unlock(&range->write_mutex);
      }
   }
}

static inline boolean
util_ranges_intersect(const struct util_range *range,
                      unsigned start, unsigned end)
{
   return MAX2(start, range->start) < MIN2(end, range->end);
}

When the driver writes to the buffer, util_range_add() is called with the range being modified. When the user then tries to map the buffer using a specified range, util_ranges_intersect() is called with the range being mapped to determine if a stall is needed or if it can be skipped.

Where do these ranges get added for the buffer?

Lots of places. Here’s a quick list:

  • struct pipe_context::blit
  • struct pipe_context::clear_texture
  • struct pipe_context::set_shader_buffers
  • struct pipe_context::set_shader_images
  • struct pipe_context::copy_region
  • struct pipe_context::create_stream_output_target
  • resource unmap and flush hooks

Some Numbers

The tests are still running to see what happens here, but I found some interesting improvements using my piglit timediff display after rebasing my branch yesterday for the first time in a bit and running the tests again overnight:

piglit-dmat.png

I haven’t made any changes to this codepath myself (that I’m aware of), so it looks like I’ve pulled in some improvement that’s massively cutting down the time required for the codepath that handles implicit 32bit -> 64bit conversions, and timediff picked it up.

September 01, 2020

Benchmarking vs Testing

Just a quick post for today to talk about a small project I undertook this morning.

As I’ve talked about repeatedly, I run a lot of tests.

It’s practically all I do when I’m not dumping my random wip code by the truckload into the repo.

The problem with this approach is that it doesn’t leave much time for performance improvements. How can I possibly have time to run benchmarks if I’m already spending all my time running tests?

Motivation

As one of the greatest bench markers of all time once said, If you want to turn a vision into reality, you have to give 100% and never stop believing in your dream.

Thus I decided to look for those gains while I ran tests.

Piglit already possesses facilities for providing HTML summaries of all tests (sidebar: the results that I posted last week are actually a bit misleading since they included a number of regressions that have since been resolved, but I forgot to update the result set). This functionality includes providing the elapsed time for each test. It has not, however, included any way of displaying changes in time between result sets.

Until now.

The Future Is Here

With my latest MR, the HTML view can show off those big gains (or losses) that have been quietly accumulating through all these high volume sets of unit tests:

piglit-timing.png

Is it more generally useful to other drivers?

Maybe? It’ll pick up any unexpected performance regressions (inverse gains) too, and it doesn’t need any extra work to display, so it’s an easy check, at the least.

August 31, 2020

This weekend the X1 Carbon with Fedora Workstation went live in North America on Lenovos webstore. This is a big milestone for us and for Lenovo as its the first time Fedora ships pre-installed on a laptop from a major vendor and its the first time the worlds largest laptop maker ships premium laptops with Linux directly to consumers. Currently only the X1 Carbon is available, but more models is on the way and more geographies will get added too soon. As a sidenote, the X1 Carbon and more has actually been available from Lenovo for a couple of Months now, it is just the web sales that went online now. So if you are a IT department buying Lenovo laptops in bulk, be aware that you can already buy the X1 Carbon and the P1 for instance through the direct to business sales channel.

Also as a reminder for people looking to deploy Fedora laptops or workstations in numbers, be sure to check out Fleet Commander our tool for helping you manage configurations across such a fleet.

I am very happy with the work that has been done here to get to this point both by Lenovo and from the engineers on my team here at Red Hat. For example Lenovo made sure to get all of their component makers to ramp up their Linux support and we have been working with them to both help get them started writing drivers for Linux or by helping add infrastructure they could plug their hardware into. We also worked hard to get them all set up on the Linux Vendor Firmware Service so that you could be assured to get updated firmware not just for the laptop itself, but also for its components.

We also have a list of improvements that we are working on to ensure you get the full benefit of your new laptops with Fedora and Lenovo, including working on things like improved power management features being able to have power consumption profiles that includes a high performance mode for some laptops that will allow it to run even faster when on AC power and on the other end a low power mode to maximize battery life. As part of that we are also working on adding lap detection support, so that we can ensure that you don’t risk your laptop running to hot in your lap and burning you or that radio antennas are running to strong when that close to your body.

So I hope you decide to take the leap and get one of the great developer laptops we are doing together with Lenovo. This is a unique collaboration between the worlds largest laptop maker and the worlds largest Linux company. What we are doing here isn’t just a minimal hardware enablement effort, but a concerted effort to evolve Linux as a laptop operating system and doing it in a proper open source way. So this is the culmination of our work over the last few years, creating the LVFS, adding Thunderbolt support to Linux, improving fingerprint reader support in Linux, supporting HiDPI screens, supporting hidpi mice, creating the possibility of a secure desktop with Wayland, working with NVidia to ensure that Mesa and Nvidia driver can co-exist through glvnd, creating Flatpak to ensure we can bring the advantages of containers to the desktop space and at the same way do it in a vendor neutral way. So when you buy a Lenovo laptop with Fedora Workstation, you are not just getting a great system, but you are also supporting our efforts to take Linux to the next level, something which I think we are truly the only linux vendor with the scale and engineering ability to do.

Of course we are not stopping here, so let me also use this chance to talk a bit about some of our other efforts.

Toolbox
Containers are popular for deploying software, but a lot of people are also discovering now that they are an incredible way to develop software, even if that software is not going to be deployed as a Flatpak or Kubernetes container. The term often used for containers when used as a development tool is pet containers and with Toolbox project we are aiming to create the best tool possible for developers to work with pet containers. Toolbox allows you to have always have a clean environment to work in which you can change to suit each project you work on, however you like, without affecting your host system. So for instance if you need to install a development snapshot of Python you can do that inside your Toolbox container and be confident that various other parts of your desktop will not start crashing due to the change. And when your are done with your project and don’t want that toolbox around anymore you can easily delete it without having to spend time to figure out which packages you installed can now be safely uninstalled from your host system or just not bother and have your host get bloated over time with stuff you are not actually using anymore.

One big advantage we got at Red Hat is that we are a major contributor to container technologies across the stack. We are a major participant in the Open Container Initiative and we are alongside Google the biggest contributor to the Kubernetes project. This includes having created a set of container tools called Podman. So when we started prototyping Toolbox we could base it up on podman and get access to all the power and features that podman provides, but at the same make them easier to use and consumer from your developer laptop or workstation.

Our initial motivation was also driven by the fact that for image based operating systems like Fedora Silverblue and Fedora CoreOS, where the host system is immutable you still need some way to be able to install packages and do development, but we quickly realized that the pet container development model is superior to the old ‘on host’ model even if you are using a traditional package based system like Fedora Workstation. So we started out by prototyping the baseline functionality, writing it as a shell script to quickly test out our initial ideas. Of course as Toolbox picked up in popularity we realized we needed to transition quickly to a proper development language so that we wouldn’t end up with an unmaintainable mess written in shell, and thus Debarshi Ray and Ondřej Míchal has recently completed the rewrite to Go (Note: the choice of Go was to make it easier for the wider container community to contribute since almost all container tools are written in Go).

Leading up towards Fedora Workstation 33 we are trying figure out a few things. One is how we can make giving you access to a RHEL based toolbox through the Red Hat Developer Program in an easy and straightforward manner, and this is another area where pet container development shines. You can set up your pet container to run a different Linux version than your host. So you can use Fedora to get the latest features for your laptop, but target RHEL inside your Toolbox to get an easy and quick deployment path to your company RHEL servers. I would love it if we can extend this even further as we go along, to for instance let you set up a Steam runtime toolbox to do game development targeting Steam.
Setting up a RHEL toolbox is already technically possible, but requires a lot more knowledge and understanding of the underlaying technologies than we wish.
The second thing we are looking at is how we deal with graphical applications in the context of these pet containers. The main reason we are looking at that is because while you can install for instance Visual Studio code inside the toolbox container and launch it from the command line, we realize that is not a great model for how you interact with GUI applications. At the moment the only IDE that is set up to be run in the host, but is able to interact with containers properly is GNOME Builder, but we realize that there are a lot more IDEs people are using and thus we want to try to come up with ways to make them work better with toolbox containers beyond launching them from the command line from inside the container. There are some extensions available for things like Visual Studio Code starting to try to improve things (those extensions are not created by us, but looking at solving a similar problem), but we want to see how we can help providing a polished experience here. Over time we do believe the pet container model of development is so good that most IDEs will follow in GNOME Builders footsteps and make in-container development a core part of the feature set, but for now we need to figure out a good bridging strategy.

Wayland – headless and variable refresh rate.
Since switching to Wayland we have continued to work in improving how GNOME work under Wayland to remove any major feature regressions from X11 and to start taking advantage of the opportunities that Wayland gives us. One of the last issues that Jonas Ådahl has been hard at work recently is trying to ensure we have headless support for running GNOME on systems without a screen. We know that there are a lot of sysadmins for instance who want to be able to launch a desktop session on their servers to be used as a tool to test and debug issues. These desktops are then accessed through tools such as VNC or Nice DCV. As part of that work he also made sure we could deal with having multiple monitors connected which had different refresh rates. Before that fix you would get the lowest common denominator between your screens, but now if you for instance got a 60Hz monitor and a 75Hz monitor they will be able to function independent of each other and run at their maximum refresh rate. With the variable refresh rate work now landed upstream Jonas is racing to get the headless support finished and landed in time for Fedora Workstation 33.

Linux Vendor Firmware Service
Richard Hughes is continuing his work on moving the LVFS forward having spent time this cycle working with the Linux Foundation to ensure the service can scale even better. He is also continuously onboarding new vendors and helping existing vendors use LVFS for even more things. We are now getting reports that LVFS has become so popular that we are now getting reports of major hardware companies who up to know hasn’t been to interested in the LVFS are getting told by their customers to start using it or they will switch supplier. So expect the rapid growth of vendors joining the LVFS to keep increasing. It is also worth nothing that many of vendors who are already set up on LVFS are steadily working on increasing the amount of systems they support on it and pushing their suppliers to do the same. Also for enterprise use of LVFS firmware Marc Richter also wrote an article on access.redhat.com about how to use LVFS with Red Hat Satelitte. Satellite for those who don’t know it is Red Hats tool for managing and keeping servers up to date and secure. So for large companies having their machines, especially servers, accessing LVFS directly is not a wanted behaviour, so now they can use Satelitte to provide a local repository of the LVFS firmware.

PipeWire
One of the changes we been working on that I am personally extremely excited about is PipeWire. For those of you who don’t know it, PipeWire is one of our major swamp draining efforts which aims to bring together audio, pro-audio and video under linux and provide a modern infrastructure for us to move forward. It does so however while being ABI compatible with both Jack and PulseAudio, meaning that applications will not need to be ported to work with PipeWire. We have been using it for a while for video already to handle screen capture under Wayland and for allowing Flatpak containers access to webcams in a secure way, but Wim Taymans has been working tirelessly on moving that project forward over the last 6 Months, focused a lot of fixing corner cases in the Jack support and also ramping up the PulseAudio support. We had hoped to start wide testing in Fedora Workstation 32 of the audio parts of PipeWire, but we decided that since such a key advantage that PipeWire brings is not just to replace Jack or PulseAudio, but also to ensure the two usecases co-exist and interact properly, we didn’t want to start asking people to test until we got the PulseAudio support close to being production ready. Wim has been making progress by leaps and bounds recently and while I can’t 100% promise it yet we do expect to roll out the audio bits of PipeWire for more widescale testing in Fedora Workstation 33 with the goal of making it the default for Fedora Workstation 34 or more likely Fedora Workstation 35.
Wim is doing an internal demo this week, so I will try to put out a blog post talking about that later in the week.

Flatpak – incremental updates
One of the features we added to Flatpaks was the ability to distribute them as Open Container Initiative compliant containers. The reason for this was that as companies, Red Hat included, built infrastructure for hosting and distributing containers we could also use that for Flatpaks. This is obviously a great advantage for a variety of reasons, but it had one large downside compared to the traditional way of distributing Flatpaks (as Ostree images) which is that each update comes as a single large update as opposed to the atomic update model that OStree provides.
Which is why if you would compare the same application when shipping from Flathub, which uses Ostree, versus from the Fedora container registry, you would quickly notice that you get a lot smaller updates from Flathub. For kubernetes containers this hasn’t been considered a huge problem as their main usecase is copying the containers around in a high-speed network inside your cloud provider, but for desktop users this is annoying. So Alex Larsson and Owen Taylor has been working on coming up with a way to do to incremental updates for OCI/Docker/Kubernetes containers too, which not only means we can get very close to the Flathub update size in the Fedora Container Catalog, but it also means that since we implemented this in a way that works for all OCI/Kubernetes containers you will be able to get them too with incremental update functionality. Especially as such containers are making their way into edge computing where update sizes do matter, just like they do on the desktop.

Hangul input under Wayland
Red Hat, like Lenovo, targets most of the world with our products and projects. This means that we want them to work great even for people who doesn’t use English or another European language. To achieve this we have a team dedicated to ensuring that not just Linux, but all Red Hat products work well for international users as part of my group at Red Hat. That team, lead by Jens Petersen, is distributed around the globe with engineers in Japan, China, India, Singapore and Germany. This team contributes to a lot of different things like font maintenance, input method development, i18n infrastructure and more.
One thing this team recently discovered was that the support for Korean input under Wayland. So Peng Wu, Takao Fujiwara and Carlos Garnacho worked together to come up with a series of patches for ibus and GNOME Shell to ensure that Fedora Workstation on Wayland works perfectly for Korean input. I wanted to highlight this effort because while I don’t usually mention efforts which such a regional impact in my blog posts it is a critical part of keeping Linux viable and usable across the globe. And ensuring that you can use your computer in your own language is something we feel is important and want to enable and also an area where I believe Red Hat is investing more than any other vendor out there.

GLX on EGL
We meet with NVidia on a regular basis to discuss topics of shared interest and one thing we been looking at for a while now is the best way to support Nvidia binary driver under XWayland. As part of that Adam Jackson has been working on a research project to see how feasible it would be to create a way to run GLX applications on top of EGL. As one might imagine EGL doesn’t have a 1to1 match with GLX APIs, but based on what we seen so far is that it should be close enough to get things going (Adam already got glxgears running :). The goal here would be to have an initial version that works ok, and then in collaboration with NVidia we can evolve it to be a great solution for even the most demanding OpenGL/GLX applications. Currently the code causes an extra memcopy compared to running on GLX native, but this is something we think can be resolved in collaboration with NVidia. Of course this is still an early stage effort and Adam and NVidia are currently looking at it so there is of course a chance still we will hit a snag and have to go back to the drawing board. For those interested you can take a look at this Mesa merge request to see the current state.

Hi!

I haven’t said Hi for a while when starting a post. I think the rush and the whirlwind of things happening during the GSoC made me a little agitated. This year, my project was the only one accepted for X.Org Foundation, and I felt a great responsibility. Well, this is the last week of the project, and I’m slowing down and breathing :)

This report is a summary of my journey at Google Summer of Code 2020. Experience and technical reports can be read in more detail in the 11 blog posts I published during this period:

Date Post
2020/05/13 I’m in - GSoC 2020 - X.Org Foundation
2020/05/20 Everyone makes a script
2020/06/02 Status update - Tie up loose ends before starting
2020/06/03 Walking in the KMS CURSOR CRC test
2020/06/15 Status update - connected errors
2020/07/06 GSoC First Phase - Achievements
2020/07/17 Increasing test coverage in VKMS - max square cursor size
2020/08/12 The end of an endless debugging of an endless wait
2020/08/19 If a warning remains, the job is not finished.
2020/08/27 Another day, another mistery
2020/08/28 Better validation of alpha-blending

So, the Google Summer of Code was an amazing experience! I have evolved not only technically, but also as a developer in a software community. In this period, I could work on different projects and even interact with their community. I contributed to three projects with various purposes, sizes, and maturities, and I describe below a little of my relationship with each of them:

DRM/Linux Kernel

The Linux kernel is one of the largest, most famous, and most mature free and open-source project. It is also the kernel that I have been using for over ten years. The development of Linux is so interesting to me that I chose it as a case study of my Master’s in Computer Science research.

Among the various subsystems in the project, I have contributed to DRM, the part of Linux responsible for the interface with GPUs. It provides the user-space with API to send commands and data in a format suitable for modern GPUs. It is the kernel-space component of graphic stacks like the X.Org Server.

And was X.Org Foundation, the organization that supported my project in GSoC. Thanks to the support from the DRI community and the X.Org Foundation, I have contributed over the past few months to improve the VKMS. The Virtual Kernel Mode Setting is a software-only model of a KMS driver that allows you to test DRM and run X on systems without a hardware display capability.

IGT GPU Tools

IGT is a set of tools used to develop and validate DRM drivers. These tools can be used by drivers other than Intel drivers and I widely used it to evolve the VKMS. Using IGT to improve VKMS can be very useful for validating patches sent to the core of DRM, that is, performing automated tests against new code changes with no need of real hardware.

With this concern, all my work on GSoC aimed to bring greater stability in IGT tests’ execution using VKMS. IGT test cases guided most of my contributions to the VKMS. Before sending any code, I took care of validating if, with my change, the tests that I have any familiarity remained stable and properly working.

KWorkFlow

Kworflow is a set of scripts that I use in my development environment for the Linux kernel. It greatly facilitates the execution of routine tasks of coding, examining the code, and sending patches.

My mentor developed it, Rodrigo Siqueira and other students and former students of computer science at the university contributed to add functionality and evolve the project. It supports your local kernel version’s compilation and deployment, helps you browse the code and the change history, and provides essential information for patch formatting.

With these three projects, I had an exciting development journey and many lessons learned to share. Here is a summary of that experience:

From start to finish

The general purpose of my GSoC project was to evolve VKMS using IGT tests. In this way, I used the kms_cursor_crc test case as a starting point to fix and validate features, adjust behaviors, and increase test coverage in VKMS.


[KMS cursor crc uses]
the display CRC support to validate cursor plane functionality.
The test will position the cursor plane either fully onscreen,
partially onscreen, or fully offscreen, using either a fully opaque
or fully transparent surface. In each case, it enables the cursor plane
and then reads the PF CRC (hardware test) and compares it with the CRC
value obtained when the cursor plane was disabled and its drawing is
directly inserted on the PF by software.

In my project proposal, I presented the piglit statistics for kms_cursor_crc using VKMS. It was seven test cases successful, two fails, one warning (under development), and 236 skips. Now, I can present an overview of this state and the improvements mapped and applied during this GSoC period:

Failing tests

Initially, three test cases failed using VKMS. Before GSoC start, I had already sent a proposal that moved it from failure to success with a warning and was related to the composition of planes considering the alpha channel. The second was a case that had already worked two years ago. The latter had never worked before. This last two were related to the behavior of the cursor plane in power management tasks.

From failure to warning

Since VKMS did not consider the alpha channel for blending, it zeroed that channel before computing the crc. However, the operation that would do this was zeroing the wrong channel, due to an endianness trap. It led the test case to failure. Even after fixing this operation, the test still emitted a warning. This happened because, when zeroing the alpha channel, VKMS was delivering a fully transparent black primary plane for capturing CRC, while the background should be a solid black.

A cross-cutting problem that affected the performance of VKMS for sequential subtests execution

During the sequential execution of the kms_cursor_crc test cases, the results were unstable. Successful tests failed, two runs of the same test alternated between failure and success … in short, a mess.

In debugging and examining the change history, I found a commit that changes the VKMS performance of the kms_cursor_crc test cases. This change replaced the drm_wait_for_vblanks function with drm_wait_flip_done and, therefore, the VKMS stopped “forcing” vblank interrupts during the process of state commit. Without vblank interruptions, the execution of testcases in sequence got stuck. Forcing vblanks was just a stopgap and not a real solution. Besides, the IGT test itself was also leaving a kind of trash when failure happened, since it did not complete the cleanup and affected the next subtest.

Skips due to unmet of test requirements

Skips were caused by the lack of the following features in VKMS:

  • Cursor sizes larger than 64x64 and non-square cursor (few drivers take this approach)
  • Support for more than one CRTC (still not developed):

  Test requirement not met in function igt_require_pipe, file ../lib/igt_kms.c:1900:
  Test requirement: !(!display->pipes[pipe].enabled)
  Pipe F does not exist or not enabled

So, for each of these issues, I needed to identify the problem, find out what project the problem was coming from, map what was required to resolve, and then code the contribution. With that, I had to combine work from two fronts: DRM and IGT. Also, with the more intensive use of my development environment, I needed to develop some improvements in the tool that supports me in compiling, installing, and exploring the kernel using a virtual machine. As a consequence, I also sent some contributions to the Kworkflow project.

Patches sent during GSoC 2020 to solve the above issues

## Project Patch Status Blog post
01 DRM/Linux kernel drm/vkms: change the max cursor width/height Accepted Link
02 DRM/Linux kernel drm/vkms: add missing drm_crtc_vblank_put to the get/put pair on flush Discarded -
03 DRM/Linux kernel drm/vkms: fix xrgb on compute crc Accepted -
04 DRM/Linux kernel drm/vkms: guarantee vblank when capturing crc (v3) Accepted Link
05 DRM/Linux kernel drm/vkms: add alpha-premultiplied color blending (v2) Accepted Link
06 IGT GPU Tools lib/igt_fb: remove extra parameters from igt_put_cairo_ctx Accepted Link
07 IGT GPU Tools [i-g-t,v2,1/2] lib/igt_fb: change comments with fd description Accepted Link
08 IGT GPU Tools [i-g-t,v2,2/2] test/kms_cursor_crc: update subtests descriptions and some comments Accepted Link
09 IGT GPU Tools [i-g-t,v2,1/1] test/kms_cursor_crc: release old pipe_crc before create a new one Accepted Link
10 IGT GPU Tools [i-g-t,2/2] test/kms_cursor_crc: align the start of the CRC capture to a vblank Discarded Link
11 IGT GPU Tools [i-g-t] tests/kms_cursor_crc: refactoring cursor-alpha subtests Under review Link
12 Kworkflow src: add grep utility to explore feature Merged Link
13 Kworkflow kw: small issue on u/mount alert message Merged Link
14 Kworkflow Add support for deployment in a debian-VM Under review Link

With the patches sent, I collaborated to:

  • Increasing test coverage in VKMS
  • Bug fixes in VKMS
  • Adjusting part of the VKMS design to its peculiarities as a fake driver allowing to stabilize its work performance
  • Fixing a leak in kms_cursor_crc cleanup when a testcase fails
  • Improving different parts of the IGT tool documentation
  • Increasing the effectiveness of testing cursor plane alpha-composition
  • Improve my development environment and increase my productivity.

Not only coding, getting involved in the community

In addition to sending code improvements, my initial proposal included adding support for real overlay planes. However, I participated in other community activities that led me to adapt a portion of my project to meet emerging and more urgent demands.

I took a long time debugging the VKMS’s unstable behavior together with other DRM community members. Initially, I was working in isolation on the solution. However, I realized that that problem had a history and that it would be more productive to talk to other developers to find a suitable solution. When I saw on the mailing list, another developer had encountered a similar problem, and I joined the conversation. This experience was very enriching, where I had the support and guidance of DRM’s maintainer, Daniel Vetter. Debugging together with Daniel and Sidong Yang, we got a better solution to the problem. Finally, it seems that somehow, this debate contributed to another developer, Leandro Ribeiro, in his work on another VKMS’s issue.

In this debugging process, I also gained more experience and confidence concerning the project. So, I reviewed and tested some patches sent to VKMS [1,2,3,4]. Finally, with the knowledge acquired, I was also able to contribute to the debugging of a feature under development that adds support for writeback [5].

Discussion and future works

The table below summarizes my progress in stabilizing the execution of kms_cursor_crc using VKMS.

Status Start End
pass 7 25 (+18)
warn 1 0 (-1)
fail 2 0 (-2)
skip 236 221 (-15)
total 246 246

To solve the warning triggered by the test case pipe-A-cursor-alpha-transparent, I needed to develop an already mapped feature: TODO: Use the alpha value to blend vaddr\_src with vaddr\_dst instead of overwriting it in blend(). This feature showed that another test case, the pipe-A-cursor-alpha-opaque, had a false-pass. Moreover, both test cases related to the composition of the cursor plane with the primary plane were not sufficient to verify the composition’s correctness. As a result, I sent IGT a refactoring of the test cases to improve coverage.

The handling of the two failures initially reported were built by the same debugging process. However, they had different stories. The dpms test case had already been successful in the past, according to a commit in the git log. It started to fail after a change, which was correct but evidenced the driver’s deficiency. There was no report for the suspend test case that it worked correctly at some point, but it lacked the same: ensuring that vblank interruptions are occurring for the composition work and CRC captures.

The remaining skips are related to:

  1. non-square cursor support. Which, as far as I know, is a very specific feature for Intel drivers. If this restricted application is confirmed, the right thing to do is moving these test cases to the specific i915 driver test folder.
  2. the test requirement for more than one pipe: “Pipe (B-F) does not exist or not enabled”. This need a more complex work to add support for more than one CRTC.

I also started examining how to support the real overlay:

  • create a parameter to enable the overlay;
  • enable the third type of plane overlay in addition to the existing primary and cursor;
  • and allow the blending of the three planes.

Lessons learned

  1. In one of my first contributions to the VKMS I received feedback, let’s say, scary. To this day, I have no idea who that person was, and maybe he didn’t want to scare me, but his lack of politeness to say ‘That looks horrid’ was pretty disheartening for a beginner. Lucky for me that I didn’t take it so seriously, I continued my journey, started GSoC, and realized that the most active developers in the community are very friendly and inclusive. I learned a lot, and I would like to thank them for sharing their knowledge and information.
  2. Still related to the previous topic, I feel that the less serious developers also don’t care much about contribution etiquette and code of conduct. By unknowing or overlooking the rules of the community, they end up creating discomfort for other developers. In the DRM community, I noticed care in maintaining the community and bringing in new contributors. I did not feel unworthy or discredited in any discussion with the others. On the contrary, I felt very encouraged to continue, to learn, and to contribute.
  3. In addition to the diversity of skills, maturity, and culture of the developers, the Linux community deals with timezone differences. It’s crazy to see the energy of the DRI community on IRC channel, even with people sleeping and waking up at random times.
  4. I still don’t feel very comfortable talking on the IRC channel, for two reasons: I always take time to understand the subject being discussed technically, and my English is not very good. The dialogues I had by e-mail were very useful. It gave me time to think and check the validity of my ideas before speaking. It was still a record/documentation for future contributions.
  5. The Linux kernel is very well documented. For most questions, there is an answer somewhere in the documentation. However, it is not always immediate to find the answer because it is an extensible project. Besides, a lot has already been encapsulated and encoded for reuse, so creating something from scratch can often be unnecessary.

And after GSoC

Well, my first, short-term plan is to make a good presentation at XDC 2020. In it, I will highlight interesting cases on my project of working on IGT and VKMS together. My presentation will be on the first day, September 16; more details:

VKMS improvements using IGT GPU Tools

My second plan is to continue contributing to the VKMS, initially adding the mapped features of adding a real overlay and structuring the VKMS for more than one CRTC. Probably, with the entry of writeback support, doing these things requires even more reasoning.

Not least, I want to finish my master’s degree and complete my research on Linux development, adding my practical experiences with VKMS development. I’m almost there, hopefully later this year. And then, I hope to find a job that will allow me to work on Linux development. And live. :P

Acknowledgment

Finally, I thank X.Org Foundation for accepting my project and believing in my performance (since I was the organization’s sole project in this year’s GSoC). Also, Trevor Woerner for motivation, communication and confidence! He was always very attentive, guiding me and giving tips and advice.

Thank my mentor, Rodrigo Siqueira, who believed in my potential, openly shared knowledge that he acquired with great effort, and gave relevance to each question that I presented, and encouraged me to talk with the community. Many thanks also to the DRI community and Daniel Vetter for sharing their time and so much information and knowledge with me and being friendly and giving me constructive feedback.

  1. https://patchwork.freedesktop.org/patch/374926/
  2. https://patchwork.freedesktop.org/patch/377557/
  3. https://patchwork.freedesktop.org/patch/381344/
  4. https://patchwork.freedesktop.org/patch/387714/
  5. https://patchwork.freedesktop.org/series/80961/

This is the continuation from these posts: part 1, part 2

Let's talk about everyone's favourite [1] keyboard configuration system again: XKB. If you recall the goal is to make it simple for users to configure their own custom layouts. Now, as described earlier, XKB-the-implementation doesn't actually have a concept of a "layout" as such, it has "components" and something converts your layout desires into the combination of components. RMLVO (rules, model, layout, variant, options) is what you specify and gets converted to KcCGST (keycodes, compat, geometry, symbols, types). This is a one-way conversion, the resulting keymaps no longer has references to the RMLVO arguments. Today's post is about that conversion, and we're only talking about libxkbcommon as XKB parser because anything else is no longer maintained.

The peculiar thing about XKB data files (provided by xkeyboard-config [3]) is that the filename is part of the API. You say layout "us" variant "dvorak", the rules file translates this to symbols 'us(dvorak)' and the parser will understand this as "load file 'symbols/us' and find the dvorak section in that file". [4] The default "us" keyboard layout will require these components:


xkb_keymap {
xkb_keycodes { include "evdev+aliases(qwerty)" };
xkb_types { include "complete" };
xkb_compat { include "complete" };
xkb_symbols { include "pc+us+inet(evdev)" };
xkb_geometry { include "pc(pc105)" };
};
So the symbols are really: file symbols/pc, add symbols/us and then the section named 'evdev' from symbols/inet [5]. Types are loaded from types/complete, etc. The lookup paths for libxkbcommon are $XDG_CONFIG_HOME/xkb, /etc/xkb, and /usr/share/X11/xkb, in that order.

Most of the symbols sections aren't actually full configurations. The 'us' default section only sets the alphanumeric rows, everything else comes from the 'pc' default section (hence: include "pc+us+inet(evdev)"). And most variant sections just include the default one, usually called 'basic'. For example, this is the 'euro' variant of the 'us' layout which merely combines a few other sections:


partial alphanumeric_keys
xkb_symbols "euro" {

include "us(basic)"
name[Group1]= "English (US, euro on 5)";

include "eurosign(5)"

include "level3(ralt_switch)"
};
Including things works as you'd expect: include "foo(bar)" loads section 'bar' from file 'foo' and this works for 'symbols/', 'compat/', etc., it'll just load the file in the same subdirectory. So yay, the directory is kinda also part of the API.

Alright, now you understand how KcCGST files are loaded, much to your despair.

For user-specific configuration, we could already load a 'custom' layout from the user's home directory. But it'd be nice if we could just add a variant to an existing layout. Like "us(banana)", because potassium is important or somesuch. This wasn't possible because the filename is part of the API. So our banana variant had to be in $XDG_CONFIG_HOME/xkb/symbols/us and once we found that "us" file, we could no longer include the system one.

So as of two days ago, libxkbcommon now extends the parser to have merged KcCGST files, or in other words: it'll load the symbols/us file in the lookup path order until it finds the section needed. With that, you can now copy this into your $XDG_CONFIG_HOME/xkb/symbols/us file and have it work as variant:


partial alphanumeric_keys
xkb_symbols "banana" {

include "us(basic)"
name[Group1]= "English (Banana)";

// let's assume there are some keymappings here
};
And voila, you now have a banana variant that can combine with the system-level "us" layout.

And because there must be millions [6] of admins out there that maintain custom XKB layouts for a set of machines, the aforementioned /etc/xkb lookup path was also recently added to libxkbcommon. So we truly now have the typical triplet of lookup paths:

  • vendor-provided ones in /usr/share/X11/xkb,
  • host-specific ones in /etc/xkb, and
  • user-specific ones in $XDG_CONFIG_HOME/xkb [7].
Good times, good times.

[1] Much in the same way everyone's favourite Model T colour was black
[2] This all follows the UNIX philosophy, there are of course multiple tools involved and none of them know what the other one is doing
[3] And I don't think Sergey gets enough credit for maintaining that pile of language oddities
[4] Note that the names don't have to match, we could map layout 'us' to the symbols in 'banana' but life's difficult enough as it is
[5] I say "add" when it's sort of a merge-and-overwrite and, yes, of course there are multiple ways to combine layouts, thanks for asking
[6] Actual number may be less
[7] Notice how "X11" is missing in the latter two? If that's not proof that we want to get rid of X, I don't know what is!

At Last, Geometry Shaders

I’ve mentioned GS on a few occasions in the past without going too deeply into the topic. The truth is that I’m probably not ever going to get that deep into the topic.

There’s just not that much to talk about.

The 101

Geometry shaders take the vertices from previous shader stages (vertex, tessellation) and then emit some sort of (possibly-complete) primitive. EmitVertex() and EndPrimitive() are the big GLSL functions that need to be cared about, and there’s gl_PrimitiveID and gl_InvocationID, but hooking everything into a mesa (gallium) driver is, at the overview level, pretty straightforward:

  • add struct pipe_context hooks for the gs state creation/deletion, which are just the same wrappers that every other shader type uses
  • make sure to add GS state saving to any util_blitter usage
  • add a bunch of GS shader pipe cap handling
  • the driver now supports geometry shaders

But Then Zink

The additional changes needed in zink are almost entirely in ntv. Most of this is just expanding existing conditionals which were previously restricted to vertex shaders to continue handling the input/output variables in the same way, though there’s also just a bunch of boilerplate SPIRV setup for enabling/setting execution modes and the like that can mostly be handled with ctrl+f in the SPIRV spec with shader_info.h open.

One small gotcha is again gl_Position handling. Since this was previously transformed in the vertex shader to go from an OpenGL [-1, 1] depth range to a Vulkan [0, 1] range, any time a GS loads gl_Position it then needs to be un-transformed, as (at the point of the GS implementation patch) zink doesn’t support shader keys, and so there will only ever be one variant of a shader, which means the vertex shader must always perform this transform. This will be resolved at a later point, but it’s worth taking note of now.

The other point to note is, as always, I know everyone’s tired of it by now, transform feedback. Whereas vertex shaders emit this info at the end of the shader, things are different in the case of geometry shaders. The point of transform feedback is to emit captured data at the time shader output occurs, so now xfb happens at the point of EmitVertex().

And that’s it.

Sometimes, but only sometimes, things in zink-land don’t devolve into insane hacks.

August 28, 2020

I recently posted about a feature I developed for VKMS to consider the alpha channel in the composition of the cursor plane with the primary plane. This process took a little longer than expected, and now I can identify a few reasons:

  • Beginner: I had little knowledge of computer graphics and its operations
  • Maybe cliché: I did not consider that in the subsystem itself, there was already material to aid coding.
  • The unexpected: the test cases that checked the composition of a cursor considering the alpha channel were successful, even with a defect in the implementation.

IGT GPU Tools has two test cases in kms_cursor_crc to check the cursor blend in the primary plane: the cursor-alpha-opaque and the cursor-alpha-transparent. These two cases are structured in the same way, changing only the value of the alpha channel - totally opaque 0xFF or totally transparent 0x00.

In a brief description, the test checks the composition as follows:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Creates a ARGB framebuffer for cursor plane with white color and a given alpha value (0xFF or 0x00)
  3. Enables the cursor on hardware and captures the plane’s CRC after composition. (hardware)
  4. Disables the cursor on hardware, draws a cursor directly on the primary plane and capture the CRC (software)
  5. Compares the two CRCs captured: they must have equal values.

After implementing alpha blending using the straight alpha formula in VKMS, both tests were successful. However, the equation was not correct.

To paint, IGT uses the library Cairo. For primary plane, the test uses CAIRO_FORMAT_RGB24 and CAIRO_FORMAT_ARGB32 for the cursor. According to documentation, the format ARGB32 stores the pixel color in the pre-multiplied alpha representation:

CAIRO_FORMAT_ARGB32

each pixel is a 32-bit quantity, with alpha in the upper 8 bits, then red, then
green, then blue. The 32-bit quantities are stored native-endian.
Pre-multiplied alpha is used. (That is, 50% transparent red is 0x80800000, not
0x80ff0000.) (Since 1.0)

In a brief dialogue with Pekka about endianness on DRM, he showed me information from the DRM documentation that I didn’t know about. According to it, DRM converges with the representation used by Cairo:


Current DRM assumption is that alpha is premultiplied, and old userspace can
break if the property defaults to anything else.

It is also possible to find information about DRM’s Plane Composition Properties, such as the existence of pixel blend mode to add a blend mode for alpha blending equation selection, describing how the pixels from the current plane are composited with the background.

Finally, you can find there the pre-multiplied alpha blending equation:

out.rgb = plane_alpha * fg.rgb + (1 - (plane_alpha * fg.alpha)) * bg.rgb

From this information, we see that both IGT and DRM use the same representation, but the current test cases of kms_cursor_crc do not show the defect in using the straight-alpha formula.

With this in mind, I think of refactoring the test cases so that they could validate translucent cursors and “remove some zeros” from the equation. After shared thoughts with my mentor, Siqueira, I decided to combine the two testcases (cursor-alpha-opaque and cursor-alpha-transparent) into one and refactor them so that the testcase verifies not only extreme alpha values, but also translucent values.

Therefore, the submitted test case proposal follows this steps:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Create a ARGB framebuffer for cursor plane and enables the cursor on hardware.
  3. Paints the cursor with white color and a range of alpha value (from 0xFF to 0x00)
  4. For each alpha value, captures the plane’s CRC after composition in a array of CRC’s. (hardware)
  5. Disables the cursor on hardware.
  6. Draws cursor directly on the primary plane following the same range of alpha values (software)
  7. Captures the CRC and compares it with the CRC in the array of hardware CRC’s: they must have equal values.
  8. Clears primary plane and go to step 5.

The code sent: tests/kms_cursor_crc: refactoring cursor-alpha subtests

CI checks did not show any problems, so I am expecting some human feedback :)

A Different Friday

Taking a break from talking about all this crazy feature nonsense, let’s get back to things that actually matter. Like gains. Specifically, why have some of these piglit tests been taking so damn long to run?

A great example of this is spec@!opengl 2.0@tex3d-npot.

Mesa 20.3: MESA_LOADER_DRIVER_OVERRIDE=zink bin/tex3d-npot 24.65s user 83.38s system 73% cpu 2:27.31 total

Mesa zmike/zink-wip: MESA_LOADER_DRIVER_OVERRIDE=zink bin/tex3d-npot 6.09s user 5.07s system 48% cpu 23.122 total

Yes, currently some changes I’ve done allow this test to pass in 16% of the time that it requires in the released version of Mesa. How did this happen?

Speed Loop

The core of the problem at present is zink’s reliance on explicit fencing without enough info to know when to actually wait on a fence, as is vaguely referenced in this ticket, though the specific test case there still has yet to be addressed. In short, here’s the problem codepath that’s being hit for the above test:

static void *
zink_transfer_map(struct pipe_context *pctx,
                  struct pipe_resource *pres,
                  unsigned level,
                  unsigned usage,
                  const struct pipe_box *box,
                  struct pipe_transfer **transfer)
{
  ...
  if (pres->target == PIPE_BUFFER) {
      if (usage & PIPE_TRANSFER_READ) {
         /* need to wait for rendering to finish
          * TODO: optimize/fix this to be much less obtrusive
          * mesa/mesa#2966
          */
         struct pipe_fence_handle *fence = NULL;
         pctx->flush(pctx, &fence, PIPE_FLUSH_HINT_FINISH);
         if (fence) {
            pctx->screen->fence_finish(pctx->screen, NULL, fence,
                                       PIPE_TIMEOUT_INFINITE);
            pctx->screen->fence_reference(pctx->screen, &fence, NULL);
         }
      }

Here, the current command buffer is submitted and then its fence is waited upon any time a call to e.g., glReadPixels() is made.

Any time.

Regardless of whether the resource in question even has pending writes.

Or whether it’s ever had anything written to it at all.

This patch was added at my request to fix up a huge number of test cases we were seeing that failed due to cases of write -> read on without waiting for the write to complete, and it was added with the understanding that at some time in the future it would be improved to both ensure synchronization and not incur such a massive performance hit.

That time has passed, and we are now in the future.

Buffer Synchronization

At its core, this is just another synchronization issue, and so by adding more details about synchronization needs, the problem can be resolved.

I chose to do this using r/w flags for resources based on the command buffer (batch) id that the resource was used on. Presently, Mesa releases ship with this code for tracking resource usage in a command buffer:

void
zink_batch_reference_resoure(struct zink_batch *batch,
                             struct zink_resource *res)
{
   struct set_entry *entry = _mesa_set_search(batch->resources, res);
   if (!entry) {
      entry = _mesa_set_add(batch->resources, res);
      pipe_reference(NULL, &res->base.reference);
   }
}

This ensures that the resource isn’t destroyed before the batch finishes, which is a key functionality that drivers generally prefer to have in order to avoid crashing.

It doesn’t, however, provide any details about how the resource is being used, such as whether it’s being read from or written to, which means there’s no way to optimize that case in zink_transfer_map().

Here’s the somewhat improved version:

void
zink_batch_reference_resource_rw(struct zink_batch *batch, struct zink_resource *res, bool write)
{
   unsigned mask = write ? ZINK_RESOURCE_ACCESS_WRITE : ZINK_RESOURCE_ACCESS_READ;

   struct set_entry *entry = _mesa_set_search(batch->resources, res);
   if (!entry) {
      entry = _mesa_set_add(batch->resources, res);
      pipe_reference(NULL, &res->base.reference);
   }
   /* the batch_uses value for this batch is guaranteed to not be in use now because
    * reset_batch() waits on the fence and removes access before resetting
    */
   res->batch_uses[batch->batch_id] |= mask;
}

For context, I’ve simultaneously added a member uint8_t batch_uses[4]; to struct zink_resource, as there are 4 batches in the batch array.

What this change does is allow callers to provide very basic info about whether the resource is being read from or written to in a given batch, stored as a bitmask to the batch-specific struct zink_resource::batch_uses member. Each batch has its own member of this array, as it needs to be able to be modified atomically in order to have its usage unset when a fence is notified of completion, and this member can have up to two bits.

Now here’s the current struct zink_context::fence_finish hook for waiting on a fence:

bool
zink_fence_finish(struct zink_screen *screen, struct zink_fence *fence,
                  uint64_t timeout_ns)
{
   bool success = vkWaitForFences(screen->dev, 1, &fence->fence, VK_TRUE,
                                  timeout_ns) == VK_SUCCESS;
   if (success && fence->active_queries)
      zink_prune_queries(screen, fence);
   return success;
}

Not much to see here. Normal fence stuff. I’ve made some additions:

static inline void
fence_remove_resource_access(struct zink_fence *fence, struct zink_resource *res)
{
   p_atomic_set(&res->batch_uses[fence->batch_id], 0);
}

bool
zink_fence_finish(struct zink_screen *screen, struct zink_fence *fence,
                  uint64_t timeout_ns)
{
   bool success = vkWaitForFences(screen->dev, 1, &fence->fence, VK_TRUE,
                                  timeout_ns) == VK_SUCCESS;
   if (success) {
      if (fence->active_queries)
         zink_prune_queries(screen, fence);

      /* unref all used resources */
      util_dynarray_foreach(&fence->resources, struct pipe_resource*, pres) {
         struct zink_resource *res = zink_resource(*pres);
         fence_remove_resource_access(fence, res);

         pipe_resource_reference(pres, NULL);
      }
      util_dynarray_clear(&fence->resources);
   }
   return success;
}

Now as soon as a fence completes, all used resources will have the usage for the corresponding batch removed.

With this done, some changes can be made to zink_transfer_map():

static uint32_t
get_resource_usage(struct zink_resource *res)
{
   uint32_t batch_uses = 0;
   for (unsigned i = 0; i < 4; i++) {
      uint8_t used = p_atomic_read(&res->batch_uses[i]);
      if (used & ZINK_RESOURCE_ACCESS_READ)
         batch_uses |= ZINK_RESOURCE_ACCESS_READ << i;
      if (used & ZINK_RESOURCE_ACCESS_WRITE)
         batch_uses |= ZINK_RESOURCE_ACCESS_WRITE << i;
   }
   return batch_uses;
}

static void *
zink_transfer_map(struct pipe_context *pctx,
                  struct pipe_resource *pres,
                  unsigned level,
                  unsigned usage,
                  const struct pipe_box *box,
                  struct pipe_transfer **transfer)
{
   ...
   struct zink_resource *res = zink_resource(pres);
   uint32_t batch_uses = get_resource_usage(res);
   if (pres->target == PIPE_BUFFER) {
      if ((usage & PIPE_TRANSFER_READ && batch_uses >= ZINK_RESOURCE_ACCESS_WRITE) ||
          (usage & PIPE_TRANSFER_WRITE && batch_uses)) {
         /* need to wait for rendering to finish
          * TODO: optimize/fix this to be much less obtrusive
          * mesa/mesa#2966
          */
         zink_fence_wait(pctx);
      }

get_resource_usage() iterates over struct zink_resource::batch_uses to generate a full bitmask of the resource’s usage across all batches. Then, using the usage detail from the transfer_map, this can be applied in order to determine whether any waiting is necessary:

  • If the transfer_map is for reading, only wait for rendering if the resource has pending writes
  • If the transfer_map is for writing, wait if the resource has any pending usage

When this resource usage flagging is properly applied to all types of resources, huge performance gains abound even despite how simple it is. For the above test case, flagging the sampler view resources with read-only access continues to ensure their lifetimes while enabling them to be concurrently read from when draw commands are still pending, which yields the improvements to the tex3d-npot test case above.

Future Gains

I’m always on the lookout for some gains, so I’ve already done and flagged some future work to be done here as well:

  • texture resources have had similar mechanisms applied for improved synchronization, whereas currently they have none
  • zink-wip has facilities for waiting on specific batches to complete, so zink_fence_wait() here can instead be changed to only wait on the last batch with write access for the PIPE_TRANSFER_READ case
  • buffer ranges can be added to struct zink_resource as many other drivers use with the util_range API, enabling the fencing here to be skipped if the regions have no overlap or no valid data is extant/pending

These improvements are mostly specific to unit testing, but for me personally, those are the most important gains to be making, as faster test runs mean faster verification that no regressions have occurred, which means I can continue smashing more code into the repo.

Stay tuned for future Zink Gains updates.

As a newbie, I consider debugging as a study guided. During the process, I have a goal that leads me to browse the code, raise and down various suspicions, look at the changes history and perhaps relate changes from other parts of the kernel to the current problem. Co-debugging is even more interesting. Approaches are discussed, we open our mind to more possibilities, refine solutions and share knowledges… in addition to preventing any uncomplete solution. Debugging the operation of vkms when composing plans was a very interesting journey. All the shared ideas was so dynamic and open that, at the end, it was linked to another demand, the implementation of writeback support.

And that is how I fell into another debugging journey.

Writeback support on vkms is something Rodrigo Siqueira has been working on for some time. However, with so many comings and goings of versions, other DRM changes were introducing new issues. With each new version, it was necessary to reassess the current state of the DRM structures in addition to incorporating revisions from the previous version.

When Leandro Ribeiro reported that calling the drm_crtc_vblank_put frees the writeback work, this indicate that there was also a issue with the simulated vblank execution there. I asked Siqueira a little more about the writeback, to try to collaborate with his work. He explained to me what writeback is, how it is implemented in his patches and the IGT test that is used for validation.

In that state, writeback was crashing on the last subtest, writeback-check-output. Again, due to a timeout waiting for resource access. We discussed the possible causes and concluded that the solution would be in defining: when the composer work needed to be enabled and how to ensure that vblank is happening to allow also the writeback work. When Daniel suggested to create a small function to set output->composer_enabled, he was already thinking on writeback.

However, something that appeared to be straight was somewhat entangled by our initial assumption that the problem was concentrated in the writeback-check-output. We later adjusted our investigation to also look at the writeback-fb-id subtest. We began this part of the debugging process using ftrace to check the execution steps of this two test cases. Looking at different execution cases and more parts of the code, we examined the writeback job to find the symmetries, the actions of each element, the beginning and the end of each cycle. We also includes some prints on VKMS and the test code, and check dmesg.

Siqueira concluded that VKMS need to guarantee the composer is enabled when starting the writeback job and release it (as well as vblank) in the job’s cleanup (which is currently called when a job was queued or when the state is destroyed). Another concern was to avoid dispersed modifications and keep the changes well encapsulated with writeback functions.

With that, Siqueira wrapped up the v5, and sent. Just fly, little bird:

As soon as it was sent, I checked if the patchset passed not only the kms_writeback subtests, but also if the other IGT tests I use have remained stable: kms_flip, kms_cursor_crc and kms_pipe_crc_basic. With that, I realized that a refactoring in the composer was out of date and reported the problem for correction (since it was a fix that I recently made). I also indicated that, apart from this small problem (not directly related to writeback), writeback support was working well in the tests performed. However, the series has not yet received any other feedback.

The end? Hmm… never ends.

I recently posted about a feature I developed for VKMS to consider the alpha channel in the composition of the cursor plane with the primary plane. This process took a little longer than expected, and now I can identify a few reasons:

  • Beginner: I had little knowledge of computer graphics and its operations
  • Maybe cliché: I did not consider that in the subsystem itself, there was already material to aid coding.
  • The unexpected: the test cases that checked the composition of a cursor considering the alpha channel were successful, even with a defect in the implementation.

IGT GPU Tools has two test cases in kms_cursor_crc to check the cursor blend in the primary plane: the cursor-alpha-opaque and the cursor-alpha-transparent. These two cases are structured in the same way, changing only the value of the alpha channel - totally opaque 0xFF or totally transparent 0x00.

In a brief description, the test checks the composition as follows:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Creates a ARGB framebuffer for cursor plane with white color and a given alpha value (0xFF or 0x00)
  3. Enables the cursor on hardware and captures the plane’s CRC after composition. (hardware)
  4. Disables the cursor on hardware, draws a cursor directly on the primary plane and capture the CRC (software)
  5. Compares the two CRCs captured: they must have equal values.

After implementing alpha blending using the straight alpha formula in VKMS, both tests were successful. However, the equation was not correct.

To paint, IGT uses the library Cairo. For primary plane, the test uses CAIRO_FORMAT_RGB24 and CAIRO_FORMAT_ARGB32 for the cursor. According to documentation, the format ARGB32 stores the pixel color in the pre-multiplied alpha representation:

CAIRO_FORMAT_ARGB32

each pixel is a 32-bit quantity, with alpha in the upper 8 bits, then red, then
green, then blue. The 32-bit quantities are stored native-endian.
Pre-multiplied alpha is used. (That is, 50% transparent red is 0x80800000, not
0x80ff0000.) (Since 1.0)

In a brief dialogue with Pekka about endianness on DRM, he showed me information from the DRM documentation that I didn’t know about. According to it, DRM converges with the representation used by Cairo:


Current DRM assumption is that alpha is premultiplied, and old userspace can
break if the property defaults to anything else.

It is also possible to find information about DRM’s Plane Composition Properties, such as the existence of pixel blend mode to add a blend mode for alpha blending equation selection, describing how the pixels from the current plane are composited with the background.

Finally, you can find there the pre-multiplied alpha blending equation:

out.rgb = plane_alpha * fg.rgb + (1 - (plane_alpha * fg.alpha)) * bg.rgb

From this information, we see that both IGT and DRM use the same representation, but the current test cases of kms_cursor_crc do not show the defect in using the straight-alpha formula.

With this in mind, I think of refactoring the test cases so that they could validate translucent cursors and “remove some zeros” from the equation. After shared thoughts with my mentor, Siqueira, I decided to combine the two testcases (cursor-alpha-opaque and cursor-alpha-transparent) into one and refactor them so that the testcase verifies not only extreme alpha values, but also translucent values.

Therefore, the submitted test case proposal follows this steps:

  1. Creates a XRGB primary plane framebuffer with black background
  2. Create a ARGB framebuffer for cursor plane and enables the cursor on hardware.
  3. Paints the cursor with white color and a range of alpha value (from 0xFF to 0x00)
  4. For each alpha value, captures the plane’s CRC after composition in a array of CRC’s. (hardware)
  5. Disables the cursor on hardware.
  6. Draws cursor directly on the primary plane following the same range of alpha values (software)
  7. Captures the CRC and compares it with the CRC in the array of hardware CRC’s: they must have equal values.
  8. Clears primary plane and go to step 5.

The code sent: tests/kms_cursor_crc: refactoring cursor-alpha subtests

CI checks did not show any problems, so I am expecting some human feedback :)

In the past few weeks, I was examining two issues on vkms: development of writeback support and alpha blending. I try to keep activities in parallel so that one can recover me from any tiredness from the other :P

Alpha blending is a TODO[1] of the VKMS that possibly could solve the warning[2] triggered in the cursor-alpha-transparent testcase. And, you know, if you still have a warning, you still have work.

  1. Blend - blend value at vaddr_src with value at vaddr_dst
     * TODO: Use the alpha value to blend vaddr_src with vaddr_dst
     *	 instead of overwriting it.
    
  2. WARNING: Suspicious CRC: All values are 0

To develop this feature, I needed to understand and “practice” some abstractions concepts: alpha composition, bitwise operation, and endianness. The beginning was a little confusing, as I was reading many definitions and rare seeing practical examples, which was terrible for dealing with abstractions. I searched for information on Google and little by little, the code was taking shape in my head.

The code, the problem solved and a new problem found

I combined what I understood from these links above with what was already on VKMS. An important note is that I consider that when we combine an alpha layer in the background, this final composition is solid (that is, an opaque alpha), that is, the result on the screen is not a transparent plate - it is usually black background. That said, the resulting alpha channel is 100% opaque, so we set it to 0xFF.

When executing my code, the two test cases related to the alpha channel (opaque-transparent) pass clean. But for me, it still wasn’t enough. I need to see the colors read and the result of the composition (putting some pr_info in the code). Besides, only testing extreme values (background solid black, cursor completely white with totally opaque or transparent alpha) did not convince me. See that I’m a little pessimist and a little wary.

So I decided to play with the cursor-alpha testcase… and I scraped myself.

Ouch!

I changed the transparent test case from 0.0 to 0.5, and things started to fail. After a few rounds checking the returned values, what I always saw was a cursor color (ARGB) returned as follows:

  • A: the alpha determined (ok)
  • RGB: the color resulted by a white cursor with the set transparency already applied to a standard solid black background (not ok).

For example, I expected that a white cursor with transparency 0.5 returned the following ARGB = 80FFFFFF, but the VKMS was reading 80808080 (???)

What is that?

I did more experiments to understand what was going on. I checked if the 0.5-transparency was working on i915, and yes. I also changed the test RGB color of the cursor and the level of transparency. The cursor color read was always a combination of a given A + RGB blended with black color. I could only realize that the mismatch happens because, in the hardware test, when composing a cursor color 80808080, again on a background FF000000 (primary), I obtained as the final result FF404040. However, in the software test step, what was drawn as a final color FF808080.

The developed code would also do that if it was getting the right color… how to deal with it and why it is not a problem on i915 drive?

Reading the documentation

I was reading the cairo manual in addition to perform different color combinations tests. I was suspicious that it could be a format problem, but even if it was that, I had no idea what was “wrong”. After some days, I found the reason:

CAIRO_FORMAT_ARGB32

each pixel is a 32-bit quantity, with alpha in the upper 8 bits, then red, then
green, then blue. The 32-bit quantities are stored native-endian.
Pre-multiplied alpha is used. (That is, 50% transparent red is 0x80800000, not
0x80ff0000.)

Pre-multiplied alpha is used. Thanks, cairo… or not!

But what is pre-multiplied alpha? I didn’t even know it existed.

So, I needed to read about the difference between straight and premultiplied alpha. Moreover, I needed to figure out how to do alpha blending using this format.

After some searching, I found a post on Nvidia - Gameworks Blog that helped me a lot to understand this approach. Besides, the good old StackOverflow clarified what I needed to do.

So I adapted the code I had previously prepared considering a straight alpha blending to work with alpha-premultiplied colors. Of course, I put pr_info to see the result. I also played a little more with the cursor and background colors on kms_cursor_crc test for checking. For example, I changed the background color to a solid red and/or the cursor color and/or the cursor alpha value. In all this situation, the patch works fine.

So, I just sent:

A new interpretation for the warning

Examining this behavior, I discovered a more precise cause of the warning. As VKMS currently only overwrites the cursor over the background, a fully transparent black cursor has a color of 0 and results in a transparent primary layer instead of solid black. When doing XRGB, VKMS zeroes the alpha instead of making it opaque, which seems incorrect and triggers the warning.

I also think that this shows the weakness of the full opaque/transparent tests since they were not able to expose the wrong composition of colors. So I will prepare a patch for IGT to expand the coverage of this type of problem.

Well, doing the alpha blending, the warning will never bothered me anyway.

Let it go!

After a long debugging process, we finally reached an acceptable solution to the problem of running sequential subtests of IGT tests involving CRC capture using the VKMS.

Despite all the time, uncertainty, and failures, this was, by far, one of the greatest lessons learned in my journey of FLOSS developer. With this experience, I learned a lot about how VKMS works, evolved technically, experienced different feelings, and interacted with people I may never meet in person. These people were sharing some of their knowledge with me in open and accessible communication. The interactions allowed us to build together a more suitable and effective solution, and at that moment, everything I thought was important in the development of FLOSS came to fruition.

I know it sounds very romantic, but after more than a month of it, breathing is not a simple and involuntary movement.

With this patch, I was able to update my progress table on GSoC:

Result/Status GSoC start After accepted patches
pass 7 24 (+17)
warn 1 1
fail 2 0 (-2)
skip 236 221 (-15)
crash 0 0
timeout 0 0
incomplete 0 0
dmesg-warn 0 0
dmesg-fail 0 0
changes 0 0
fixes 0 0
regressions 0 0
total 246 246

The endless wait

I believe I have already talked about the problem in some other posts as it has been followed me since the elaboration of my GSoC project.

I was trying to solve the failure of one of the IGT kms_cursor_crc subtests, the cursor-alpha-transparent. As this subtest has an implementation similar to the cursor-alpha-opaque (that was passing), I used both to observe the manipulation of ARGB -> XRGB happening in CRC capture. However, strangely, the cursor-alpha-opaque kept alternating between success and failure using the same kernel. Gradually I noticed that this was a widespread problem with the sequential execution of subtests in VKMS.

In the beginning, I only noticed that the test crashed while waiting for a vblank event that was never going to happen. So I looked for solutions that enabled vblanks in time to prevent the subtest get stuck. However, as Daniel mentioned, these solutions just looked like duct-tape.

The endless debugging

Initially, I found two problems and thought that both were related to the test implementation. So I sent patches to IGT to correct the cleanup process and to force a vblank. In waiting for feedback, I realized that a previously successful subtest was currently failing, even with the solution I intended — bad smell. I searched the patch that modified this behavior in the history and, in fact, reversing the change let the subtests flow.

It was then that I saw a patch sent to the DRM mailing-list in which someone faced a similar problem and decided to participate in the discussion. From this point, doubts and suspicions increased even more. However, on the bright side, I wasn’t the only person looking for it. With our time zone differences, we were three sharing the findings for someone else’s next shift.

I sent a first patch about this issue on June 25th (in the wrong direction). After so much feedback, adjustments, and some other wrong attempts, the solution was finally accepted on August 8th.

The whole story is documented in the following discussions (patchwork links):

  1. Aiming in the wrong direction
  2. Joining the discussion of people that observed a similar problem
  3. Trying to hit the target … and failing
  4. Hey… I had this idea! But again, it was not ideal
  5. Finally, we come to a solution

And when it all seemed to come to an end, a fourth person appeared, raising doubts about other VKMS parts and that, in a way, was related to the problem at hand.

And that was amazing! There’s still a lot of work out there!

Now I’m still a little tired of it, but I’ll try to make a more technical post about the solution. I changed the context, going back to that will give me a little mental work.

For the next developments

  • Don’t keep the problem to yourself; share your doubts.
  • When making a patch, don’t forget the credits! It doesn’t make much difference for you, but it can do for those who were with you on that journey. See more
  • Each thread/patch is full of information that can help you with other problems.
  • If you can’t dive into the code, browse through it.
  • ftrace is also a good friend
  • The mental effort is less if you write a post before changing the context.

P.S.: I still need to find a way to feel comfortable speaking on IRC.

The first round just passed so fast, and what I did?

The case: the IGT test kms_cursor_crc

Being acquainted with a IGT test

  1. Anatomy of the test
    What: study, read the code, dive into each function, and describe its structure, steps, and functionality.
    To: be well aware of the code construction and, consequently, able to figure out and deal with problems.
  2. Patches to improve test documentation
    What: During the study of the kms_cursor_crc test, I realized that the subtests had no description that would help a newcomer to perceive the purpose of that subtest. To improve this documentation and some code comments, I sent some patches to IGT with what I was able to understand from my inspection.
  3. Refactoring function with parameters never used.
    Outside the context of exploring and becoming acquainted with the case, examining the anatomy of the kms_cursor_crc I caught useless parameters in a general IGT function, i.e., it requires two parameters, but never uses them within its code. I checked the author (git blame) and asked him on IRC about the need for these additional parameters, but I didn’t get a response (or maybe I missed the reply due to disconnection). Then, I sent an RFC patch to the mailing list and also nothing. Finally, my mentor took a look, and he agreed that the parameters seem useless and can be removed. He asked me to resend as a normal patch.

Maybe you can also take a look: lib/igt_fb: remove extra parameters from igt_put_caito_ctx

Solving some problems

I have also sent a patchset to treat to reported problems in kms_cursor_crc:

  1. Access to debugfs data file is blocked
  2. Unstable behaviour in sequential subtests

You can check out more about that in my previous posts. Reviewing the logs, I saw that these problems in prepare_crtc were introduced more recently since some Haneen commits report that subtests were working before. I noticed that waiting for the blank is a solution used before and removed by another commit that not seems to treat this issue.

Sadly, my patches also received no feedback at all. So, if you are interested in them or can make an interesting comment, go ahead! :)

Preparing for the second round

In the last few days, I surveyed the materials and tasks needed to execute the second round of my project. This is a kind of roadmap with useful links.

As far as I know, I have to deal with three issues that seem currently not supported by VKMS:

  1. Enhance the blend function to provide alpha composing
    To solve warning message in pipe-%s-cursor-alpha-transparent
  2. DPMS (?)
    To solve failure in pipe-%s-cursor-dpms and pipe-%s-cursor-suspend
  3. Cursor size other than 64x64
    • https://drmdb.emersion.fr/devices/3a5c55e30723
    • https://drmdb.emersion.fr/capabilities
    • See drivers/gpu/drm/drm_ioctl.c
    • See drivers/gpu/drm/vkms/vkms_drv.c (drm_getCap)
      drivers/gpu/drm/vkms/vkms_drv.c:131:    dev->mode_config.funcs = &vkms_mode_funcs;
      drivers/gpu/drm/vkms/vkms_drv.c:132:    dev->mode_config.min_width = XRES_MIN;
      drivers/gpu/drm/vkms/vkms_drv.c:133:    dev->mode_config.min_height = YRES_MIN;
      drivers/gpu/drm/vkms/vkms_drv.c:134:    dev->mode_config.max_width = XRES_MAX;
      drivers/gpu/drm/vkms/vkms_drv.c:135:    dev->mode_config.max_height = YRES_MAX;
      drivers/gpu/drm/vkms/vkms_drv.c:136:    dev->mode_config.preferred_depth = 24;
      drivers/gpu/drm/vkms/vkms_drv.c:137:    dev->mode_config.helper_private = &vkms_mode_config_helpers;
      

I also need to check the reason for some subtests like dpms/suspend are failing now, but they were working in the past, accordingly to Hannen commit message:

commit db7f419c06d7cce892384df464d4b609a3ea70af
Author: Haneen Mohammed <hamohammed.sa@gmail.com>
Date:   Thu Sep 6 08:18:26 2018 +0300

    drm/vkms: Compute CRC with Cursor Plane
    
    This patch compute CRC for output frame with cursor and primary plane.
    Blend cursor with primary plane and compute CRC on the resulted frame.
    
    This currently passes cursor-size-change, and cursor-64x64-[onscreen,
    offscreen, sliding, random, dpms, rapid-movement] from igt
    kms_cursor_crc tests.
    
    Signed-off-by: Haneen Mohammed <hamohammed.sa@gmail.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Link: https://patchwork.freedesktop.org/patch/msgid/b1749f5c90da5721a481f12740e2e370edb4a752.1536210181.git.hamohammed.sa@gmail.com

In March, I inspected the coverage of kms_cursor_crc on VKMS to develop my GSoC project proposal. Using piglit, I present the evolution of this coverage so far:

Result GSoC start Only accepted patches Fixes under development
pass 7 22 24
warn 1 0 1
fail 2 3 0
skip 236 221 221
crash 0 0 0
timeout 0 0 0
incomplete 0 0 0
dmesg-warn 0 0 0
dmesg-fail 0 0 0
changes 0 0 0
fixes 0 0 0
regressions 0 0 0
total 246 246 246

+ Instability in the sequential run of subtests; ie, although the statistic showing that 7 tests are passing in the beggining, this result was only clear after a double-check because when running all subtests sequentially, 8 tests fails and only 1 succeed;

+ Warning or Failure ? My fix proposal for the pipe-A-cursor-alpha-transparent test was passing with a warning (CRC all zero)

As a first step, I decided to examine and solve issues that affected the test results in a general way. the instability. Solving the instability first (or at least identify what was going on) would make the work more consistent and fluid, since I would no longer need to double-check each subtest result and, in one running of the entire kms_cursor_crc test I could check the absent features or errors. However, in this investigation, some problems were more linked to IGT and others to VKMS. Identifying who is the “guilty” was not simple, and some false charges happened.

This is a little long story and deserves another post focused on the issue. The effective solution has not yet been found, but it has already been realized that the proper bug-fix achieves:

  • The stability of the sequential execution of subtests;
  • The success of the following subtests: pipe-A-cursor-dpms pipe-A-cursor-suspend

As we don’t have that yet, well, I’ll focus on this post in describing the way to allow testing of different cursor sizes in vkms.

The maximum cursor size

In 2018, one of the Haneen contributions to vkms was adding support for the cursor plane. In this implementation, the cursor has a standard maximum supported size 64x64, which limited the coverage of the kms_cursor_crc to only cursor subtests with this size.

The IGT tests cursor sizes from 64 to 512 (powers of 2), but how to enable cursor larger than the standard?

Initially, I thought I needed to develop some functionality from scratch, maybe do this by drawing in larger sizes… as a good beginner :) So I started to read some materials and check the codes of other drivers to find out how to do this.

During this stage, for some kind of universe coincidence (mystic), I came across a conversation on the IRC on the topic “cursor sizes.” This conversation gave me some references that led me to the right work path.

Ctags-driven investigation

I tend to investigate Linux trough keywords (like a ctags on the head?). This approach helps me to narrow the scope of reading and attempts. The Linux kernel is frighteningly large, and I know that I still need many years to understand things as a whole (if possible).

Therefore, these are the references that helped me to find the keywords and information related to my problem “the cursor size”:

  1. Comparing an AMD device with VKMS device: https://drmdb.emersion.fr/devices (keyword 1: DRM_CAP_CURSOR_HEIGHT)
  2. Checking max cursor sizes per driver: https://drmdb.emersion.fr/capabilities (keyword 2: capabilities)
  3. Finding where is the DRM_CAP_CURSOR_HEIGHT value: drivers/gpu/drm/drm_ioctl.c
    (How to define ` dev->mode_config.cursor_height ` in vkms?) ``` /*
    • Get device/driver capabilities */ static int drm_getcap(struct drm_device *dev, void *data, struct drm_file *file_priv) { struct drm_get_cap *req = data; struct drm_crtc *crtc;

    req->value = 0; [..] case DRM_CAP_CURSOR_WIDTH: if (dev->mode_config.cursor_width) req->value = dev->mode_config.cursor_width; else req->value = 64; break; case DRM_CAP_CURSOR_HEIGHT: if (dev->mode_config.cursor_height) req->value = dev->mode_config.cursor_height; else req->value = 64; break; [..] }

    ``` \

  4. More information: include/uapi/drm/drm.h

    / *
     * The CURSOR_WIDTH and CURSOR_HEIGHT capabilities return a valid widthxheight
     * combination for the hardware cursor. The intention is that a hardware
     * agnostic userspace can query a cursor plane size to use.
     *
     * Note that the cross-driver contract is to merely return a valid size;
     * drivers are free to attach another meaning on top, eg. i915 returns the
     * maximum plane size.
     * /
    #define DRM_CAP_CURSOR_WIDTH 0x8
    #define DRM_CAP_CURSOR_HEIGHT 0x9
    
    
  5. Check the documentation for cursor_width:
    **struct drm_mode_config**
    Mode configuration control structure
         
    *cursor_width*: hint to userspace for max cursor width
    *cursor_height*: hint to userspace for max cursor height
       
    
  6. So, where is this mode_config defined in vkms?
    Here: drivers/gpu/drm/vkms/vkms_drv.c
    static int vkms_modeset_init(struct vkms_device *vkmsdev)
    {
     struct drm_device *dev = &vkmsdev->drm;
    
     drm_mode_config_init(dev);
     dev->mode_config.funcs = &vkms_mode_funcs;
     dev->mode_config.min_width = XRES_MIN;
     dev->mode_config.min_height = YRES_MIN;
     dev->mode_config.max_width = XRES_MAX;
     dev->mode_config.max_height = YRES_MAX;
     dev->mode_config.preferred_depth = 24;
     dev->mode_config.helper_private = &vkms_mode_config_helpers;
    
     return vkms_output_init(vkmsdev, 0);
    }
    

    There is nothing about cursor here, so we need to assign maximum values to not take the default.

  7. I also found that, on my intel computer, the maximum cursor is 256. Why do tests include 512?
  8. Also, there are subtests in kms_cursor_crc for non-square cursors, but these tests are restricted to i915 devices. Why are they here?
  9. Finally, I develop a simple patch that increases the coverage rate by 15 subtests. Considering the current drm-misc-next, my project state is:
Name Results
pass 22
fail 3
skip 221
crash 0
timeout 0
warn 0
incomplete 0
dmesg-warn 0
dmesg-fail 0
changes 0
fixes 0
regressions 0
total 246

Cheer up!

It is also possible to consider those in which we have some idea of the problem and a provisory solution exists (I mean, an optimistic view):

Name Results
pass 24
warn 1
fail 0
skip 221
crash 0
timeout 0
incomplete 0
dmesg-warn 0
dmesg-fail 0
changes 0
fixes 0
regressions 0
total 246

There is still the non-square cursors issue, which I’m not sure we should handle. Subtests with non-square cursors mean 16 skips.

Lessons learned

  1. For me, IRC conversations are inspiring and also show community pulsing. Part of what I question in my master’s research is the lack of interest in academic studies in using IRC as a means of understanding the community under investigation.
    One of the things I like about IRC is that, unlike other instant messengers today, when we are there, we are usually sitting in front of the computer (our work tool). I mean, we are not (I think) lying in a hammock on the beach, for example. Ok, there could be exceptions. :)
    But to be honest, I rarely talk on a channel; I am usually just watching.
  2. There are cases where the complexity lies in understanding what already exists instead of in developing new features. I still don’t know if it’s frustrating or satisfying.
  3. I don’t know if I could understand things without using ctags. The ctags was a great tip from Siqueira even in FLUSP times.

As the GSoC coding time officially started this week (01/06/2020), this post is a line between my activities in the period of community bonding and the official development of my project. I used the last month to solve issues in my development environment, improve the tool that supports my development activity (kworflow), and study concepts and implementations related to my GSoC project.

Communication

Besides e-mail, IRC chat, and Telegram, my mentor (Siqueira) and I are meeting every Wednesday on Jitsi, where we also use tmate for terminal sharing. We also use, together with Trevor, a spreadsheet to schedule tasks, report my daily activity, and write any suggestions.

Issues in my development environment

Probably you have already had this kind of pending task: a problem that trouble your work but you can go ahead with it, therefore you postpone until the total tiredness. So, I have some small issues on my vm: from auth to update kernel process that I was postpone forever. The first I solved with some direct setup and the last led me for the second task: improve kworkflow.

Improvements for a set of scripts that facilitate my development day-to-day

The lack of support on kworkflow for deployment on vm drove me to start hacking the code. As updating vm needs understanding things more complex, I started developing soft changes (or less integrated with the scripts structure). This work is still in progress, and after discuss with others kworkflow developers (on the issue, on IRC and on voice meetings), the proposals of changes were refined.

The first phase of my project proposal orbits the IGT test kms_cursor_crc. I have already a preliminar experience with the test code; however, I lack a lot of knowledge of each steps and concepts behing the implementation. Bearing this in mind, I used part of the community bonding time to immerse myself in this code.

Checking issues in a patchset that was reported by IGT CI

My first project task is to find out why it is not possible to access debugfs files when running kms_cursor_crc (and fix it). Two things could help me solve it: learning about debugfs and dissecting kms_cursor_crc. To guide my studies, my mentor suggested taking a look at a patchset for the IGT write-back test implementation that CI reported a crash on debugfs_test for i915. For this investigation, I installed on another machine (an old netbook) a Debian without a graphical environment, and, accessing via ssh, I applied the patches and ran the test. Well, everything seemed to work (and the subtests passed). Perhaps something has been fixed or changed in IGT since the patchset was sent. Nothing more to do here.


IGT_FORCE_DRIVER=i915 build/tests/debugfs_test 
IGT-Version: 1.25-gf1884574 (x86_64) (Linux: 4.19.0-9-amd64 x86_64)
Force option used: Using driver i915
Force option used: Using driver i915
Starting subtest: sysfs
Subtest sysfs: SUCCESS (0,009s)
Starting subtest: read_all_entries
Subtest read_all_entries: SUCCESS (0,094s)
Starting subtest: read_all_entries_display_on
Subtest read_all_entries_display_on: SUCCESS (0,144s)
Starting subtest: read_all_entries_display_off
Subtest read_all_entries_display_off: SUCCESS (0,316s)

Diving into the kms_cursor_crc test

I’m writing a kind of anatomy from the kms_cursor_crc test. I chose the alpha-transparent subtest as a target and then followed each step necessary to achieve it, understanding each function called, parameters, and also abstractions. I am often confused by something that once seemed clear. Well, it comes to graphic interface stuff and is acceptable that theses abstraction will disorientate me LOL I guess… The result of this work will be my next post. In the meantime, here are links that helped me on this journey

In this post, I describe the steps involved in the execution of a kms_cursor_crc subtest. In my approach, I chose a subtest (pipe-A-cursor-alpha-transparent) as a target and examined the code from the beginning of the test (igt main) until reaching the target subtest and executing it.

This is my zero version. I plan to incrementally expand this document with evaluation/description of the other subtests. I will probably also need to fix some misunderstandings.

As described by IGT, kms_cursor_crc

Uses the display CRC support to validate cursor plane functionality. The test will position the cursor plane either fully onscreen, partially onscreen, or fully offscreen, using either a fully opaque or fully transparent surface. In each case it then reads the PF CRC and compares it with the CRC value obtained when the cursor plane was disabled.

In the past, Haneen have shown something about the test in a blog post. Fixing any issue in VKMS to make it passes all kms_cursor_crc subtest is also a case in my GSoC project proposal.

The struct data_t

This struct is used in all subtest stores many elements such as DRM file descriptor; framebuffers info; cursor size info; etc. Also, before the main of the test, a static data_t data is declared global.

The beginning - igt_main

We can divide the main function into two parts: setup DRM stuff (igt_fixture) and subtest execution.

Initial test setup

igt_fixture {
	data.drm_fd = drm_open_driver_master(DRIVER_ANY);
	ret = drmGetCap(data.drm_fd, DRM_CAP_CURSOR_WIDTH, &cursor_width);
	igt_assert(ret == 0 || errno == EINVAL);
	/* Not making use of cursor_height since it is same as width, still reading */
	ret = drmGetCap(data.drm_fd, DRM_CAP_CURSOR_HEIGHT, &cursor_height);
	igt_assert(ret == 0 || errno == EINVAL);
		

drmGetCap(int fd, uint64_t capability, uint64_t * value) queries capability of the DRM driver, return 0 if the capability is supported. DRM_CAP_CURSOR_WIDTH and DRM_CAP_CURSOR_HEIGHT store a valid width/height for the hardware cursor.

/* We assume width and height are same so max is assigned width */
igt_assert_eq(cursor_width, cursor_height);

	kmstest_set_vt_graphics_mode();

void kmstest_set_vt_graphics_mode(void)

From lib/igt_kms.c: This function sets the controlling virtual terminal (VT) into graphics/raw mode and installs an igt exit handler to set the VT back to text mode on exit. All kms tests must call this function to make sure that the framebuffer console doesn’t interfere by e.g. blanking the screen.

    igt_require_pipe_crc(data.drm_fd);

void igt_require_pipe_crc(int fd)

From lib/igt_debugfs: checks whether pipe CRC capturing is supported by the kernel. Uses igt_skip to automatically skip the test/subtest if this isn’t the case.

    igt_display_require(&data.display, data.drm_fd);

void igt_display_require(igt_display_t *display, int drm_fd)

From lib/igt_kms.c: Initializes \@display (a pointer to an #igt_display_t structure) and allocates the various resources required. This function automatically skips if the kernel driver doesn’t support any CRTC or outputs.

Run test on pipe

for_each_pipe_static(pipe)
	igt_subtest_group
		run_tests_on_pipe(&data, pipe);

At this part, each subtest of kms_cursor_crc is runned, where the pointer for the already setup struct data_t are passed.

static void run_tests_on_pipe(data_t *data, enum pipe pipe)

This function runs each subtest grouped by pipe. In the setup, it increments the passed data_t struct, and then starts to call each subtest. In this document version, I only focused on the subtest test_cursor_transparent .

igt_subtest_f("pipe-%s-cursor-alpha-transparent", kmstest_pipe_name(pipe))
	run_test(data, test_cursor_transparent, data->cursor_max_w, data->cursor_max_h);

The execution of test_cursor_transparent

static void run_test(data_t *data, void (*testfunc)(data_t *), int cursor_w, int cursor_h)

The function run_test wraps the habitual preparation for running a subtest and also, after then, a cleanup. Therefore, it basically has three steps:

  1. Prepare CRTC: static void prepare_crtc(data_t data, igt_output_toutput, int cursor_w, int cursor_h) This function is responsible for:
    • Select the pipe to be used
    • Create Front and Restore framebuffer of primary plane
    • Find a valid plane type for primary plane and cursor plane
    • Pairs primary framebuffer to its plane and sets a default size
    • Create a new pipe CRC
    • Position the cursor fully visible
    • Store test image as cairo surface
    • Start CRC capture process
  2. Run subtest: testfunc(data) » static void test_cursor_transparent(data_t *data) » test_cursor_alpha(data, 0.0)

    The subtest_cursor_transparent is a variation of test_cursor_alpha where the alpha channel is set zero (or transparent). So, let’s take a look at test_cursor_alpha execution:

static void test_cursor_alpha(data_t *data, double a)
{
	igt_display_t *display = &data->display;
	igt_pipe_crc_t *pipe_crc = data->pipe_crc;
	igt_crc_t crc, ref_crc;
	cairo_t *cr;
	uint32_t fb_id;
	int curw = data->curw;
	int curh = data->curh;
	/*alpha cursor fb*/
	fb_id = igt_create_fb(data->drm_fd, curw, curh,
				    DRM_FORMAT_ARGB8888,
				    LOCAL_DRM_FORMAT_MOD_NONE,
				    &data->fb);

When this subtest starts, it creates the cursor’s framebuffer with the format ARGB8888 , i.e., a framebuffer with RGB plus Alpha channel (pay attention to endianness)

	igt_assert(fb_id);
	cr = igt_get_cairo_ctx(data->drm_fd, &data->fb);
	igt_paint_color_alpha(cr, 0, 0, curw, curh, 1.0, 1.0, 1.0, a);
	igt_put_cairo_ctx(data->drm_fd, &data->fb, cr);

Then, the test uses some Cairo resources to create a cairo surface for the cursor’s framebuffer and allocate a drawing context for it, draw a rectangle with RGB white and the given opacity (alpha channel) and, finally, release the cairo surface and write the changes out to the framebuffer (Disclaimed: looking inside the function igt_put_cairo_ctx, I am not sure if it is doing what it is saying on comments, and also not sure if all the parameters are necessary)

The test is divided into two parts: Hardware test and Software test.

	/*Hardware Test*/
	cursor_enable(data);
	igt_display_commit(display);
	igt_wait_for_vblank(data->drm_fd, data->pipe);
	igt_pipe_crc_get_current(data->drm_fd, pipe_crc, &crc);
	cursor_disable(data);
	igt_remove_fb(data->drm_fd, &data->fb);

The hardware test consists in:

  • Enable cursor: Pair the cursor plane and a framebuffer, set the cursor size for its plane, set the cursor plane size for framebuffer. Commit framebuffer changes to all changes of each display pipe
  • Wait for vblank. Vblank is a couple of extra scanlines region which aren’t actually displayed on the screen designed to give the electron gun (on CRTs) enough time to move back to the top of the screen to start scanning out the next frame. A vblank interrupt is used to notify the driver when it can start the updating of registers. To achieve tear-free display, users must synchronize page flips and/or rendering to vertical blanking [https://dri.freedesktop.org/docs/drm/gpu/drm-kms.html#vertical-blanking]
  • Calculate and check the current CRC
	cr = igt_get_cairo_ctx(data->drm_fd, &data->primary_fb[FRONTBUFFER]);
	igt_paint_color_alpha(cr, 0, 0, curw, curh, 1.0, 1.0, 1.0, a);
	igt_put_cairo_ctx(data->drm_fd, &data->primary_fb[FRONTBUFFER], cr);
	igt_display_commit(display);
	igt_wait_for_vblank(data->drm_fd, data->pipe);
	igt_pipe_crc_get_current(data->drm_fd, pipe_crc, &ref_crc);
	igt_assert_crc_equal(&crc, &ref_crc);

The software test consists in:

  • Create a cairo surface and drawing context, draw a rectangle on top left side with RGB white and the given opacity (alpha channel) and then release de cairo context, applying changes to framebuffer. Then, commit framebuffer changes to all changes of each display pipe.
  • Wait for vblank
  • Calculate and check the current CRC

Finally, the screen is clean.

And in the level of run_test, crtc(data) is clean up;

Meanwhile, in GSoC:

I took the second week of Community Bonding to make some improvements in my development environment. As I have reported before, I use a QEMU VM to develop kernel contributions. I was initially using an Arch VM for development; however, at the beginning of the year, I reconfigured it to use a Debian VM, since my host is a Debian installation - fewer context changes. In this movement, some ends were loose, and I did some workarounds, well… better round it off.

I also use kworkflow (KW) to ease most of the no-coding tasks included in the day-to-day coding for Linux kernel. The KW automates repetitive steps of a developer’s life, such as compiling and installing my kernel modifications; finding information to format and send patches correctly; mounting or remotely accessing a VM, etc. During the time that preceded the GSoC project submission, I noticed that the feature of installing a kernel inside the VM was incompleted. At that time, I started to use the “remote” option as palliative. Therefore, I spent the last days learning more features and how to hack the kworkflow to improve my development environment (and send it back to the kw project).

I have started by sending a minor fix on alert message:

kw: small issue on u/mount alert message

Then I expanded the feature “explore” - looking for a string in directory contents - by adding GNU grep utility in addition to the already used git grep. I gathered many helpful suggestions for this patch, and I applied them together with the reviews received in a new version:

src: add grep utility to explore feature

Finally, after many hours of searching, reading and learning a little about guestfish, grub, initramfs-tools and bash, I could create the first proposal of code changes that enable kw to automate the build and install of a kernel in VM:

add support for deployment in a debian-VM

The main barrier to this feature was figuring out how to update the grub on the VM without running the update-grub command via ssh access. First, I thought about adding a custom file with a new entry to boot. Thinking and researching a little more, I realized that guestfish could solve the problem and, following this logic, I found a blog post describing how to run “update-grub” with guestfish. From that, I made some adaptations that created the solution.

However, in addition to updating grub, the feature still lacks some steps to install the kernel on the VM properly. I checked the missing code by visiting an old FLUSP tutorial that describes the step-by-step of compiling and install the Linux Kernel inside a VM. I also used the implementation of the “remote” mode of the “kw deploy” to wrap up.

Now I use kw to automatically compile and install a custom kernel on my development VM. So, time to sing: “Ooh, that’s why I’m easy; I’m easy as Sunday morning!”

Maybe not now. It’s time to learn more about IGT tests!

August 27, 2020

As a newbie, I consider debugging as a study guided. During the process, I have a goal that leads me to browse the code, raise and down various suspicions, look at the changes history and perhaps relate changes from other parts of the kernel to the current problem. Co-debugging is even more interesting. Approaches are discussed, we open our mind to more possibilities, refine solutions and share knowledges… in addition to preventing any uncomplete solution. Debugging the operation of vkms when composing plans was a very interesting journey. All the shared ideas was so dynamic and open that, at the end, it was linked to another demand, the implementation of writeback support.

And that is how I fell into another debugging journey.

Writeback support on vkms is something Rodrigo Siqueira has been working on for some time. However, with so many comings and goings of versions, other DRM changes were introducing new issues. With each new version, it was necessary to reassess the current state of the DRM structures in addition to incorporating revisions from the previous version.

When Leandro Ribeiro reported that calling the drm_crtc_vblank_put frees the writeback work, this indicate that there was also a issue with the simulated vblank execution there. I asked Siqueira a little more about the writeback, to try to collaborate with his work. He explained to me what writeback is, how it is implemented in his patches and the IGT test that is used for validation.

In that state, writeback was crashing on the last subtest, writeback-check-output. Again, due to a timeout waiting for resource access. We discussed the possible causes and concluded that the solution would be in defining: when the composer work needed to be enabled and how to ensure that vblank is happening to allow also the writeback work. When Daniel suggested to create a small function to set output->composer_enabled, he was already thinking on writeback.

However, something that appeared to be straight was somewhat entangled by our initial assumption that the problem was concentrated in the writeback-check-output. We later adjusted our investigation to also look at the writeback-fb-id subtest. We began this part of the debugging process using ftrace to check the execution steps of this two test cases. Looking at different execution cases and more parts of the code, we examined the writeback job to find the symmetries, the actions of each element, the beginning and the end of each cycle. We also includes some prints on VKMS and the test code, and check dmesg.

Siqueira concluded that VKMS need to guarantee the composer is enabled when starting the writeback job and release it (as well as vblank) in the job’s cleanup (which is currently called when a job was queued or when the state is destroyed). Another concern was to avoid dispersed modifications and keep the changes well encapsulated with writeback functions.

With that, Siqueira wrapped up the v5, and sent. Just fly, little bird:

As soon as it was sent, I checked if the patchset passed not only the kms_writeback subtests, but also if the other IGT tests I use have remained stable: kms_flip, kms_cursor_crc and kms_pipe_crc_basic. With that, I realized that a refactoring in the composer was out of date and reported the problem for correction (since it was a fix that I recently made). I also indicated that, apart from this small problem (not directly related to writeback), writeback support was working well in the tests performed. However, the series has not yet received any other feedback.

The end? Hmm… never ends.

Extensions Extensions Extensions

Once more, I’ve spun the wheel of topics to blog about and landed on something extension-related. Such is the life of those working with Vulkan.

In this case, the extension is VK_EXT_custom_border_color, the underlying implementation for which I had the luxury of testing on ANV as written by the memelord himself, Ivan Briano.

Let’s get to it.

The setup for this (in zink_screen.c) is the same as for every extension, and I’ve blogged about this previously, so there’s no new ground to cover here.

Border Coloring

Fortunately, this is one of the simpler extensions to implement as well. This code is completely exclusive to the struct pipe_context::create_sampler_state hook, which is what gallium uses to have drivers setup their samplers (not samplerviews, as that’s separate).

Here’s the current implementation:

static void *
zink_create_sampler_state(struct pipe_context *pctx,
                          const struct pipe_sampler_state *state)
{
   struct zink_screen *screen = zink_screen(pctx->screen);

   VkSamplerCreateInfo sci = {};
   sci.sType = VK_STRUCTURE_TYPE_SAMPLER_CREATE_INFO;
   sci.magFilter = zink_filter(state->mag_img_filter);
   sci.minFilter = zink_filter(state->min_img_filter);

   if (state->min_mip_filter != PIPE_TEX_MIPFILTER_NONE) {
      sci.mipmapMode = sampler_mipmap_mode(state->min_mip_filter);
      sci.minLod = state->min_lod;
      sci.maxLod = state->max_lod;
   } else {
      sci.mipmapMode = VK_SAMPLER_MIPMAP_MODE_NEAREST;
      sci.minLod = 0;
      sci.maxLod = 0;
   }

   sci.addressModeU = sampler_address_mode(state->wrap_s);
   sci.addressModeV = sampler_address_mode(state->wrap_t);
   sci.addressModeW = sampler_address_mode(state->wrap_r);
   sci.mipLodBias = state->lod_bias;

   if (state->compare_mode == PIPE_TEX_COMPARE_NONE)
      sci.compareOp = VK_COMPARE_OP_NEVER;
   else {
      sci.compareOp = compare_op(state->compare_func);
      sci.compareEnable = VK_TRUE;
   }

   sci.borderColor = VK_BORDER_COLOR_FLOAT_TRANSPARENT_BLACK; // TODO
   sci.unnormalizedCoordinates = !state->normalized_coords;

   if (state->max_anisotropy > 1) {
      sci.maxAnisotropy = state->max_anisotropy;
      sci.anisotropyEnable = VK_TRUE;
   }

   struct zink_sampler_state *sampler = CALLOC(1, sizeof(struct zink_sampler_state));
   if (!sampler)
      return NULL;

   if (vkCreateSampler(screen->dev, &sci, NULL, &sampler->sampler) != VK_SUCCESS) {
      FREE(sampler);
      return NULL;
   }

   return sampler;
}

It’s just populating a VkSamplerCreateInfo struct from struct pipe_sampler_state. Nothing too unusual.

Looking at the VK_EXT_custom_border_color spec, the only change required for a sampler to handle a custom border color is to add a populated VkSamplerCustomBorderColorCreateInfoEXT struct into the pNext chain for the VkSamplerCreateInfo.

The struct populating is simple enough:

VkSamplerCustomBorderColorCreateInfoEXT cbci = {};
cbci.sType = VK_STRUCTURE_TYPE_SAMPLER_CUSTOM_BORDER_COLOR_CREATE_INFO_EXT;
assert(screen->border_color_feats.customBorderColorWithoutFormat);
cbci.format = VK_FORMAT_UNDEFINED;
/* these are identical unions */
memcpy(&cbci.customBorderColor, &state->border_color, sizeof(union pipe_color_union));
sci.pNext = &cbci;
sci.borderColor = VK_BORDER_COLOR_INT_CUSTOM_EXT;

Helpfully here, the layouts of union pipe_color_union and the customBorderColor member of the Vulkan struct are the same, so this can just be copied directly. I added an assert to handle the case where the customBorderColorWithoutFormat feature isn’t provided by the driver, since currently I’m just being a bit lazy and ANV handles it just fine, but this shouldn’t be too much work to fix up if zink is ever run on a driver that doesn’t support it.

With this set, border colors will be drawn as expected.

Conditional

It’s not always the case that border coloring is necessary, however, and there’s also limits that must be obeyed.

For the former, the wrap values (struct pipe_sampler_state::wrap_(s|t|r), which are all enum pipe_tex_wrap) can be checked from the sampler state to determine whether specified border colors are actually going to be used, allowing them to be omitted when they’re unnecessary. Specifically, a small helper function like this can be used:

static inline bool
wrap_needs_border_color(unsigned wrap)
{
   return wrap == PIPE_TEX_WRAP_CLAMP || wrap == PIPE_TEX_WRAP_CLAMP_TO_BORDER ||
          wrap == PIPE_TEX_WRAP_MIRROR_CLAMP || wrap == PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER;
}

For other wrap modes, the border color can be ignored.

Looking now to limits, VK_EXT_custom_border_color provides the maxCustomBorderColorSamplers limit, which is the maximum number of samplers using custom border colors that can exist simultaneously. To avoid potential issues with this, a counter is atomically managed on the screen object based on the lifetimes of samplers that use custom border colors—another reason to avoid unnecessarily including custom border color structs here.

Altogether, the changed bits of the new function look like this:


static void *
zink_create_sampler_state(struct pipe_context *pctx,
                          const struct pipe_sampler_state *state)
{
   ...

   bool need_custom = false;
   need_custom |= wrap_needs_border_color(state->wrap_s);
   need_custom |= wrap_needs_border_color(state->wrap_t);
   need_custom |= wrap_needs_border_color(state->wrap_r);

   VkSamplerCustomBorderColorCreateInfoEXT cbci = {};
   if (screen->have_EXT_custom_border_color && need_custom) {
      cbci.sType = VK_STRUCTURE_TYPE_SAMPLER_CUSTOM_BORDER_COLOR_CREATE_INFO_EXT;
      assert(screen->border_color_feats.customBorderColorWithoutFormat);
      cbci.format = VK_FORMAT_UNDEFINED;
      /* these are identical unions */
      memcpy(&cbci.customBorderColor, &state->border_color, sizeof(union pipe_color_union));
      sci.pNext = &cbci;
      sci.borderColor = VK_BORDER_COLOR_INT_CUSTOM_EXT;
      uint32_t check = p_atomic_inc_return(&screen->cur_custom_border_color_samplers);
      assert(check <= screen->max_custom_border_color_samplers);
   } else
      sci.borderColor = VK_BORDER_COLOR_FLOAT_TRANSPARENT_BLACK; // TODO with custom shader if we're super interested?

   ...
}

Borders colored.

August 26, 2020

glClear

Once more jumping around, here’s a brief look at an interesting issue I came across while implementing ARB_clear_texture.

When I started looking at the existing clear code in zink, I discovered that the only existing codepath for clears required being inside a renderpass using vkCmdClearAttachments. This meant that if a renderpass wasn’t active, we started one, which was suboptimal for performance, as unnecessarily starting a renderpass means we may end up having to end the renderpass just as quickly in order to emit some non-renderpass commands.

I decided to do something about this along the way. Vulkan has specific commands for clearing color and depth outside of renderpasses, and the usage is very straightforward. The key is detecting whether they’re capable of being used.

Detection

There’s a number of conditions for using these specialized Vulkan clear functions:

  • cannot be in a renderpass
  • must be intending to clear the entire surface (i.e., no scissor or full-surface scissor)
  • no 3D surfaces unless clearing all layers of the surface

The last is a product of a spec issue which allows drivers to interpret a surface’s layer count as possessing a single “3D” image layer instead of the actual count of layers inside that layer. As a result, any time a 3D surface is cleared with a non-zero layer specified, it can’t be cleared except with vkCmdClearAttachments.

The second criterion less tricky, requiring only that the specified scissor region of the clear covers the entire surface being cleared. That can be accomplished like so with some mesa helper APIs:

static bool
clear_needs_rp(unsigned width, unsigned height, struct u_rect *region)
{
   struct u_rect intersect = {0, width, 0, height};

   if (!u_rect_test_intersection(region, &intersect))
      return true;

    u_rect_find_intersection(region, &intersect);
    if (intersect.x0 != 0 || intersect.y0 != 0 ||
        intersect.x1 != width || intersect.y1 != height)
       return true;

   return false;
}

If the intersection of the surface and the scissor region isn’t the entire size of the surface, false is returned and the renderpass codepath is used.

With this check done, and having verified previously that no renderpass is active on the current batch’s command buffer, some helper functions to perform the individual clear operations can be added:

static void
clear_color_no_rp(struct zink_batch *batch, struct zink_resource *res, const union pipe_color_union *pcolor, unsigned level, unsigned layer, unsigned layerCount)
{
   VkImageSubresourceRange range = {};
   range.baseMipLevel = level;
   range.levelCount = 1;
   range.baseArrayLayer = layer;
   range.layerCount = layerCount;
   range.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;

   VkClearColorValue color;
   color.float32[0] = pcolor->f[0];
   color.float32[1] = pcolor->f[1];
   color.float32[2] = pcolor->f[2];
   color.float32[3] = pcolor->f[3];

   if (res->layout != VK_IMAGE_LAYOUT_GENERAL && res->layout != VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL)
      zink_resource_barrier(batch, res, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, 0);
   vkCmdClearColorImage(batch->cmdbuf, res->image, res->layout, &color, 1, &range);
}

static void
clear_zs_no_rp(struct zink_batch *batch, struct zink_resource *res, VkImageAspectFlags aspects, double depth, unsigned stencil, unsigned level, unsigned layer, unsigned layerCount)
{
   VkImageSubresourceRange range = {};
   range.baseMipLevel = level;
   range.levelCount = 1;
   range.baseArrayLayer = layer;
   range.layerCount = layerCount;
   range.aspectMask = aspects;

   VkClearDepthStencilValue zs_value = {depth, stencil};

   if (res->layout != VK_IMAGE_LAYOUT_GENERAL && res->layout != VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL)
      zink_resource_barrier(batch, res, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, 0);
   vkCmdClearDepthStencilImage(batch->cmdbuf, res->image, res->layout, &zs_value, 1, &range);
}

This is as simple as populating the required structs, adding in a pipeline barrier if necessary to set the image layout, and then emitting the commands.

Gotcha

As per usual, there’s a gotcha here. Clearing color surfaces in this way doesn’t take into account SRGBness as the renderpass codepath does, which means it’ll actually clear using the wrong colors if the surface is a RGBA view of a SRGB image. This means that the color value has to be converted prior to calling the clear_color_no_rp() helper:

if (psurf->format != res->base.format &&
    !util_format_is_srgb(psurf->format) && util_format_is_srgb(res->base.format)) {
   /* if SRGB mode is disabled for the fb with a backing srgb image then we have to
    * convert this to srgb color
    */
   color.f[0] = util_format_srgb_to_linear_float(pcolor->f[0]);
   color.f[1] = util_format_srgb_to_linear_float(pcolor->f[1]);
   color.f[2] = util_format_srgb_to_linear_float(pcolor->f[2]);
}

Here, psurf is the struct pipe_surface given by the gallium struct pipe_context::clear hook, and res is the struct zink_resource (aka subclassed struct pipe_resource) image that the surface is a view of. The util_format API takes care of checking srgb status, and then it also can be used to handle the conversion back to the expected format.

Now the color can be safely passed along to achieve the expected results, and out-of-renderpass clearing can safely be used for all cases where the original conditions are met.

August 25, 2020

Let's talk about eggs. X has always supported XSendEvent() which allows anyone to send any event to any client [1]. However, this event had a magic bit to make it detectable, so clients detect and subsequently ignore it. Spoofing input that just gets ignored is of course not productive, so in the year 13 BG [2] the XTest extension was conceived. XTest has a few requests that allow you to trigger a keyboard event (press and release, imagine the possibilities), buttons and pointer motion. The name may seem odd until someone explains to you that it was primarily written to support automated testing of X servers. But no-one has the time to explain that.

Having a separate extension worked around the issue of detectability and thus any client could spoof input events. Security concerns were addressed with "well, just ifdef out that extension then" which worked great until other applications started using it for input emulation. Since around ~2008 XTest events are emulated through special XTest devices in the server but that is solely to make the implementation less insane. Technically this means that XTest events are detectable again, except that no-one bothers to actually do that. Having said that, these devices only make it possible to detect an XTest event, but not which client sent that event. And, due to how the device hierarchy works, it's really hard to filter out those events anyway.

Now it's 2020 and we still have an X server that basically allows access to anything and any client to spoof input. This level of security is industry standard for IoT devices but we are trying to be more restrictive than that on your desktop, lest the stuff you type actually matters. The accepted replacement for X is of course Wayland, but Wayland-the-protocol does not provide an extension to emulate input. This makes sense, emulating input is not exactly a display server job [3] but it does leaves us with a bunch of things no longer working.

libei

So let's talk about libei. This new library for Emulated Input consists of two components: libei, the client library to be used in applications, and libeis, the part to be used by an Emulated Input Server, to be used in the compositors. libei provides the API to connect to an EIS implementation and send events. libeis provides the API to receive those events and pass them on to the compositor. Let's see how this looks like in the stack:


+--------------------+ +------------------+
| Wayland compositor |---wayland---| Wayland client B |
+--------------------+\ +------------------+
| libinput | libeis | \_wayland______
+----------+---------+ \
| | +-------+------------------+
/dev/input/ +---brei----| libei | Wayland client A |
+-------+------------------+

"brei" is the communication "Bridge for EI" and is a private protocol between libei and libeis.

Emulated input is not particularly difficult, mostly because when you're emulating input, you know exactly what you are trying to do. There is no interpretation of bad hardware data like libinput has to do, it's all straightforward. Hence the communication between libei and libeis is relatively trivial, it's all button, keyboard, pointer and touch events. Nothing fancy here and I won't go into details because that part will work as you expect it to. The selling point of libei is elsewhere, primarily in separation, distinction and control.

libei is a separate library to libinput or libwayland or Xlib or whatever else may have. By definition, it is thus a separate input stream in both the compositor and the client. This means the compositor can do things like display a warning message while emulated input is active. Or it can re-route the input automatically, e.g. assign a separate wl_seat to the emulated input devices so they can be independently focused, etc. Having said that, the libeis API is very similar to libinput's API so integration into compositors is quite trivial because most of the hooks to process incoming events are already in place.

libei distinguishes between different clients, so the compositor is always aware of who is currently emulating input. You have synergy, xdotool and a test suite running at the same time? The compositor is aware of which client is sending which events and can thus handle those accordingly.

Finally, libei provides control over the emulated input. This works on multiple levels. A libei client has to request device capabilities (keyboard, touch, pointer) and the compositor can restrict to a subset of those (e.g. "no keyboard for you"). Second, the compositor can suspend/resume a device at any time. And finally, since the input events go through the compositor anyway, it can discard events it doesn't like. For example, even where the compositor allowed keyboards and the device is not suspended, it may still filter out Escape key presses. Or rewrite those to be Caps Lock because we all like a bit of fun, don't we?

The above isn't technically difficult either, libei itself is not overly complex. The interaction between an EI client and an EIS server is usually the following:

  • client connects to server and says hello
  • server disconnects the client, or accepts it
  • client requests one or more devices with a set of capabilities
  • server rejects those devices or allows those devices and/or limits their capabilities
  • server resumes the device (because they are suspended by default)
  • client sends events
  • client disconnects, server disconnects the client, server suspends the device, etc.
Again, nothing earth-shattering here. There is one draw-back though: the server must approve of a client and its devices, so a client is not able to connect, send events and exit. It must wait until the server has approved the devices before it can send events. This means tools like xdotool need to be handled in a special way, more on that below.

Flatpaks and portals

With libei we still have the usual difficulties: a client may claim it's synergy when really it's bad-hacker-tool. There's not that much we can do outside a sandbox, but once we are guarded by a portal, things look different:


+--------------------+
| Wayland compositor |_
+--------------------+ \
| libinput | libeis | \_wayland______
+----------+---------+ \
| [eis-0.socket] \
/dev/input/ / \\ +-------+------------------+
| ======>| libei | Wayland client A |
| after +-------+------------------+
initial| handover /
connection| / initial request
| / dbus[org.freedesktop.portal.EmulatedInput]
+--------------------+
| xdg-desktop-portal |
+--------------------+
The above shows the interaction when a client is run inside a sandbox with access to the org.freedesktop.portal.Desktop bus. The libei client connects to the portal (it won't see the EIS server otherwise), the portal hands it back a file descriptor to communicate on and from then on it's like a normal EI session. What the portal implementation can do though is write several Restrictions for EIS on the server side of the file descriptor using libreis. Usually, the portal would write the app-id of the client (thus guaranteeing a reliable name) and maybe things like "don't allow keyboard capabilities for any device". Once the fd is handed to the libei client, the restrictions cannot be loosened anymore so a client cannot overwrite its own name.

So the full interaction here becomes:

  • Client connects to org.freedesktop.portal.EmulatedInput
  • Portal implementation verifies that the client is allowed to emulate input
  • Portal implementation obtains a socket to the EIS server
  • Portal implementation sends the app id and any other restrictions to the EIS server
  • Portal implementation returns the socket to the client
  • Client creates devices, sends events, etc.
For a client to connect to the portal instead of the EIS server directly is currently a one line change.

Note that the portal implementation is still in its very early stages and there are changes likely to happen. The general principle is unlikely to change though.

Xwayland

Turns out we still have a lot of X clients around so somehow we want to be able to use those. Under Wayland, those clients connect to Xwayland which translates X requests to Wayland requests. X clients will use XTest to emulate input which currently goes to where the dodos went. But we can add libei support to Xwayland and connect XTest, at which point the diagram looks like this:


+--------------------+ +------------------+
| Wayland compositor |---wayland---| Wayland client B |
+--------------------+\ +------------------+
| libinput | libeis | \_wayland______
+----------+---------+ \
| | +-------+------------------+
/dev/input/ +---brei----| libei | XWayland |
+-------+------------------+
|
| XTEST
|
+-----------+
| X client |
+-----------+
XWayland is basically just another Wayland client but it knows about XTest requests, and it knows about the X clients that request those. So it can create a separate libei context for each X client, with pointer and keyboard devices that match the XTest abilities. The end result of that is you can have a xdotool libei context, a synergy libei context, etc. And of course, where XWayland runs in a sandbox it could use the libei portal backend. At which point we have a sandboxed X client using a portal to emulate input in the Wayland compositor. Which is pretty close to what we want.

Where to go from here?

So, at this point the libei repo is still sitting in my personal gitlab namespace. Olivier Fourdan has done most of the Xwayland integration work and there's a basic Weston implementation. The portal work (tracked here) is the one needing the most attention right now, followed by the implementations in the real compositors. I think I have tentative agreement from the GNOME and KDE developers that this all this is a good idea. Given that the goal of libei is to provide a single approach to emulate input that works on all(-ish) compositors [4], that's a good start.

Meanwhile, if you want to try it, grab libei from git, build it, and run the eis-demo-server and ei-demo-client tools. And for portal support, run the eis-fake-portal tool, just so you don't need to mess with the real one. At the moment, those demo tools will have a client connecting a keyboard and pointer and sending motion, button and 'qwerty' keyboard events every few seconds. The latter with client and/or server-set keymaps because that's possible too.

Eggs

What does all this have to do with eggs? "Ei", "Eis", "Brei", and "Reis" are, respectively, the German words for "egg", "ice" or "ice cream", "mush" (think: porridge) and "rice". There you go, now you can go on holidays to a German speaking country and sustain yourself on a nutritionally imbalanced diet.

[1] The whole "any to any" is a big thing in X and just shows that in the olden days you could apparently trust, well, apparently anyone
[2] "before git", i.e. too bothersome to track down so let's go with the Copyright notice in the protocol specification
[3] XTest the protocol extension exists presumably because we had a hammer and a protocol extension thus looked like a nail
[4] Let's not assume that every compositor having their own different approach to emulation is a good idea

The Tessellating Commences

Tessellation shaders were the second type of new shader I worked to implement after geometry shaders, which I haven’t blogged about yet.

As I always say, why start at the beginning when I could start at the middle, then go back to the beginning, then skip right to the end.

Tessellation happens across two distinct shader stages: tessellation control (TCS) and tessellation evaluation (TES). In short, the former sets up parameters for the latter to generate vertex data with. In OpenGL, TCS can be omitted by specifying default inner and outer levels with glPatchParameterfv which will then be propagated to gl_TessLevelInner and gl_TessLevelOuter, respectively.

Vulkan, however, will only perform tessellation when both TCS and TES are present for a given graphics pipeline.

There are piglit tests which verify GL’s TCS-less tessellation capability.

Uh-oh.

Faking It

As is the case with many things in this project, Vulkan’s lack of support for the “easier” OpenGL method of doing things requires jumping through a number of hoops. In this case, the hoops amount to getting default inner/outer levels from gallium states into the TES stage.

Going into this problem, I knew nothing about any of these functionalities or shaders (or even that they existed), but I quickly learned a number of key points:

  • TES implicitly expects to read gl_TessLevelInner and gl_TessLevelOuter variables
  • TCS expands and propagates all VS outputs to something like type output_var[gl_MaxPatchVertices] for the TES to read
  • Vulkan has no concept of “default” levels
  • Push constants exist

The last of these was the key to my approach. In short, if a TCS doesn’t exist, I just need to make one. Since I already have full control of the pipeline state, I know when a TCS is absent, and I also know the default tessellation levels. Thus, I can emit them to a shader using a push constant, and I can also create my own passthrough TCS stage to jam into the pipeline.

Ideally, this passthrough shader would look a bit like this pseudoglsl:

#version 150
#extension GL_ARB_tessellation_shader : require

in vec4 some_var[gl_MaxPatchVertices];
out vec4 some_var_out;

layout(push_constant) uniform tcsPushConstants {
    layout(offset = 0) float TessLevelInner[2];
    layout(offset = 8) float TessLevelOuter[4];
} u_tcsPushConstants;
layout(vertices = $vertices_per_patch) out;
void main()
{
  gl_TessLevelInner = u_tcsPushConstants.TessLevelInner;
  gl_TessLevelOuter = u_tcsPushConstants.TessLevelOuter;
  some_var_out = some_var[gl_InvocationID];
}

where all input variables get expanded to gl_MaxPatchVertices, and the default levels are pulled from the push constant block.

Shader Construction

Now, it’s worthwhile to note that at this point, ntv had no ability to load push constants. With that said, I’ve alraedy blogged extensively about the process for UBO handling, and the mechanisms for both types of load are more or less identical. This means that as long as I structure my push constant data in the same way that I’ve expected my UBOs to look (i.e., a struct containing an array of vec4s), I can just copy and paste the existing code with a couple minor tweaks and simplifications since I know this codepath will only be hit by fixed function shader code that I’ve personally verified.

Therefore, I don’t actually have to blog about the specific loading mechanism, and I can just refer to that post and skip to the juicy stuff.

Specifically, I’m talking about writing the above tessellation control shader in NIR.

As a quick note, I’ve created (in patches after the initial implementation) a struct for managing the overall push constant data, which at this point would look like this:

struct zink_push_constant {
   float default_inner_level[2];
   float default_outer_level[4];
};

Now, let’s jump in.

nir_shader *nir = nir_shader_create(NULL, MESA_SHADER_TESS_CTRL, &nir_options, NULL);
nir_function *fn = nir_function_create(nir, "main");
fn->is_entrypoint = true;
nir_function_impl *impl = nir_function_impl_create(fn);

First, I’ve created a new shader for the TCS stage, and I’ve added a “main” function that’s flagged as a shader entrypoint. This is the base of every shader.

nir_builder b;
nir_builder_init(&b, impl);
b.cursor = nir_before_block(nir_start_block(impl));

Here, a nir_builder is instantiated for the shader. nir_builder is the tool which allows modifying existing shaders. It operates using a “cursor” position, which determines where new instructions are being added.

To begin inserting instructions, the cursor is positioned at the start of the function block that’s just been created.

nir_intrinsic_instr *invocation_id = nir_intrinsic_instr_create(b.shader, nir_intrinsic_load_invocation_id);
nir_ssa_dest_init(&invocation_id->instr, &invocation_id->dest, 1, 32, "gl_InvocationID");
nir_builder_instr_insert(&b, &invocation_id->instr);

The first instruction is loading the gl_InvocationID variable, which is a single 32-bit integer. The instruction is created, then the destination is initialized to the type being stored since this instruction type loads a value, and then the instruction is inserted into the shader using the nir_builder.

nir_foreach_shader_out_variable(var, vs->nir) {
   const struct glsl_type *type = var->type;
   const struct glsl_type *in_type = var->type;
   const struct glsl_type *out_type = var->type;
   char buf[1024];
   snprintf(buf, sizeof(buf), "%s_out", var->name);
   in_type = glsl_array_type(type, 32 /* MAX_PATCH_VERTICES */, 0);
   out_type = glsl_array_type(type, vertices_per_patch, 0);

   nir_variable *in = nir_variable_create(nir, nir_var_shader_in, in_type, var->name);
   nir_variable *out = nir_variable_create(nir, nir_var_shader_out, out_type, buf);
   out->data.location = in->data.location = var->data.location;
   out->data.location_frac = in->data.location_frac = var->data.location_frac;

   /* gl_in[] receives values from equivalent built-in output
      variables written by the vertex shader (section 2.14.7).  Each array
      element of gl_in[] is a structure holding values for a specific vertex of
      the input patch.  The length of gl_in[] is equal to the
      implementation-dependent maximum patch size (gl_MaxPatchVertices).
      - ARB_tessellation_shader
    */
   for (unsigned i = 0; i < vertices_per_patch; i++) {
      /* we need to load the invocation-specific value of the vertex output and then store it to the per-patch output */
      nir_if *start_block = nir_push_if(&b, nir_ieq(&b, &invocation_id->dest.ssa, nir_imm_int(&b, i)));
      nir_deref_instr *in_array_var = nir_build_deref_array(&b, nir_build_deref_var(&b, in), &invocation_id->dest.ssa);
      nir_ssa_def *load = nir_load_deref(&b, in_array_var);
      nir_deref_instr *out_array_var = nir_build_deref_array_imm(&b, nir_build_deref_var(&b, out), i);
      nir_store_deref(&b, out_array_var, load, 0xff);
      nir_pop_if(&b, start_block);
   }
}

Next, all the output variables for the vertex shader are looped over and expanded to arrays of MAX_PATCH_VERTICES size, which in the case of mesa is 32. These values are loaded based on gl_InvocationID, so the stores for the expanded variable values are always conditional on the current invocation id value matching the vertices_per_patch index, as is checked in the nir_push_if line where a conditional is added for this.

nir_variable *gl_TessLevelInner = nir_variable_create(nir, nir_var_shader_out, glsl_array_type(glsl_float_type(), 2, 0), "gl_TessLevelInner");
gl_TessLevelInner->data.location = VARYING_SLOT_TESS_LEVEL_INNER;
gl_TessLevelInner->data.patch = 1;
nir_variable *gl_TessLevelOuter = nir_variable_create(nir, nir_var_shader_out, glsl_array_type(glsl_float_type(), 4, 0), "gl_TessLevelOuter");
gl_TessLevelOuter->data.location = VARYING_SLOT_TESS_LEVEL_OUTER;
gl_TessLevelOuter->data.patch = 1;

Following this, the gl_TessLevelInner and gl_TessLevelOuter variables must be created so they can be processed in ntv and converted to the corresponding variables that SPIRV and the underlying vulkan driver expect to see. All that’s necessary for this case is to set the types of the variables, the slot location, and then flag them as patch variables.

/* hacks so we can size these right for now */
struct glsl_struct_field *fields = ralloc_size(nir, 2 * sizeof(struct glsl_struct_field));
fields[0].type = glsl_array_type(glsl_uint_type(), 2, 0);
fields[0].name = ralloc_asprintf(nir, "gl_TessLevelInner");
fields[0].offset = 0;
fields[1].type = glsl_array_type(glsl_uint_type(), 4, 0);
fields[1].name = ralloc_asprintf(nir, "gl_TessLevelOuter");
fields[1].offset = 8;
nir_variable *pushconst = nir_variable_create(nir, nir_var_shader_in,
                                              glsl_struct_type(fields, 2, "struct", false), "pushconst");
pushconst->data.location = VARYING_SLOT_VAR0;

Next up, creating a variable to represent the push constant for the TCS stage that’s going to be injected. SPIRV is always explicit about variable sizes, so it’s important that everything match up with the size and offset in the push constant that’s being emitted. The inner level value will be first in the push constant layout, so this has offset 0, and the outer level will be second, so this has an offset of sizeof(gl_TessLevelInner), which is known to be 8.

nir_intrinsic_instr *load_inner = nir_intrinsic_instr_create(b.shader, nir_intrinsic_load_push_constant);
load_inner->src[0] = nir_src_for_ssa(nir_imm_int(&b, offsetof(struct zink_push_constant, default_inner_level)));
nir_intrinsic_set_base(load_inner, offsetof(struct zink_push_constant, default_inner_level));
nir_intrinsic_set_range(load_inner, 8);
load_inner->num_components = 2;
nir_ssa_dest_init(&load_inner->instr, &load_inner->dest, 2, 32, "TessLevelInner");
nir_builder_instr_insert(&b, &load_inner->instr);

nir_intrinsic_instr *load_outer = nir_intrinsic_instr_create(b.shader, nir_intrinsic_load_push_constant);
load_outer->src[0] = nir_src_for_ssa(nir_imm_int(&b, offsetof(struct zink_push_constant, default_outer_level)));
nir_intrinsic_set_base(load_outer, offsetof(struct zink_push_constant, default_outer_level));
nir_intrinsic_set_range(load_outer, 16);
load_outer->num_components = 4;
nir_ssa_dest_init(&load_outer->instr, &load_outer->dest, 4, 32, "TessLevelOuter");
nir_builder_instr_insert(&b, &load_outer->instr);

for (unsigned i = 0; i < 2; i++) {
   nir_deref_instr *store_idx = nir_build_deref_array_imm(&b, nir_build_deref_var(&b, gl_TessLevelInner), i);
   nir_store_deref(&b, store_idx, nir_channel(&b, &load_inner->dest.ssa, i), 0xff);
}
for (unsigned i = 0; i < 4; i++) {
   nir_deref_instr *store_idx = nir_build_deref_array_imm(&b, nir_build_deref_var(&b, gl_TessLevelOuter), i);
   nir_store_deref(&b, store_idx, nir_channel(&b, &load_outer->dest.ssa, i), 0xff);
}

Finally, the push constant data can be set to the created variables. Each variable needs two parts:

  • loading the push constant data
  • storing the loaded data to the variable

Each load of the push constant data requires a base and a range along with a single source value. The base and the source are both the offset of the value being loaded (referenced from struct zink_push_constant), and the range is the size of the value, which is constant based on the variable.

Using the same mechanic as the first intrinsic instruction that was created, each intrinsic is created and has its destination value initialized based on the size and number of components of the type being loaded (first load 2 values, then 4 values). A src is created based on the struct offset, which determines the offset at which data will be loaded, and the instruction is inserted using the nir_builder.

Following this, the loaded value is then dereferenced for each array member and stored component-wise into the actual gl_TessLevel* variable.

nir->info.tess.tcs_vertices_out = vertices_per_patch;
nir_validate_shader(nir, "created");

pushconst->data.mode = nir_var_mem_push_const;

At last, the finishing touches. The vertices_per_patch value is set so that SPIRV can process the shader, and the shader is then validated by NIR to ensure that I didn’t screw anything up.

And then the push constant gets its variable type swizzled to being a push constant so that ntv can special case it. This is very illegal, and so it has to happen at the point when absolutely no other NIR passes will be run on the shader or everything will explode.

But it’s totally safe, I promise.

With all this done, the only remaining step is to set up the push constant in the Vulkan pipeline, which is trivial by comparison.

The TCS is picked up automatically after being jammed into the existing shader stages, and everything works like it should.

August 24, 2020

After Some Time

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Collabora Ltd (0x8086)
    Device: zink (Intel(R) Iris(R) Plus Graphics (ICL GT2)) (0x8a52)
    Version: 20.3.0
    Accelerated: yes
    Video memory: 11690MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 3.0
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL vendor string: Collabora Ltd
OpenGL renderer string: zink (Intel(R) Iris(R) Plus Graphics (ICL GT2))
OpenGL core profile version string: 4.6 (Core Profile) Mesa 20.3.0-devel (git-b229c29e4d)
OpenGL core profile shading language version string: 4.60

It’s been about three months since I jumped into the project to learn more about graphics drivers, and zink has now gone from supporting GL 3.0 to GL 4.6 and GLES 3.2 compatibility*. Currently I’m at a 91% pass rate on piglit tests while forcing ANV to do unsupported fp64 ops, which seems like a pretty good result.

*in my branch

Not bad for some random guy sitting in his basement.

What’s Next?

Well, first up is probably going to be a bit of a break. I’ve been surfing the gfx pipeline nonstop for the past couple weeks, as has been hinted at by my lack of posts here, and so there’s some non-graphics stuff I’ve been putting off doing.

After that, I imagine I’ll be going back to work on some of these test failures and go through more of the CTS results since there’s a lot of test coverage over there.

There’s also lots of refactoring work that I’ve flagged for doing at some point when I had time or interest.

Also, more blogging.

When Will This Be Merged?

Patch review is a complex process during which code must be carefully examined for defects. It takes time to do it right, and it shouldn’t be rushed.

With that said, I’ve got around 300 patches in my branch (likely far more once I break up some of the giant wip ones), so it’s unlikely that all of it will be landed anytime in the very near future.

Anyone interested in trying things out immediately can pull my zink-wip branch and follow the wiki instructions for building and running the zink driver on top of existing distro-provided vulkan drivers without any other changes, potentially even submitting bug reports for me to look at.

Tomorrow, I’ll be starting a dive here into the tessellation shader implementation and some “cool” things I had to do there.

August 20, 2020

I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.

tl;dr DO NOT USE VALLIUM OVER A GALLIUM HW DRIVER,

What is vallium?

The vallium layer is a gallium frontend. It takes the Vulkan API and roughly translates it into the gallium API.

How does it do that?

Vulkan is a lowlevel API, it allows the user to allocate memory, create resources, record command buffers amongst other things. When a hw vulkan driver is recording a command buffer, it is putting hw specific commands into it that will be run directly on the GPU. These command buffers are submitted to queues when the app wants to execute them.

Gallium is a context level API, i.e. like OpenGL/D3D10. The user has to create resources and contexts and the driver internally manages command buffers etc. The driver controls internal flushing and queuing of command buffers.
 
In order to bridge the gap, the vallium layer abstracts the gallium context into a separate thread of execution. When recording a vulkan command buffer it creates a CPU side command buffer containing an encoding of the Vulkan API. It passes that recorded CPU command buffer to the thread on queue submission. The thread then creates a gallium context, and replays the whole CPU recorded command buffer into the context, one command at a time.

That sounds horrible, isn't it slow?

Yes.

Why doesn't that matter for *software* drivers?

Software rasterizers are a very different proposition from an overhead point of view than real hardware. CPU rasterization is pretty heavy on the CPU load, so nearly always 90% of your CPU time will be in the rasterizer and fragment shader. Having some minor CPU overheads around command submission and queuing isn't going to matter in the overall profile of the user application. CPU rasterization is already slow, the Vulkan->gallium translation overhead isn't going to be the reason for making it much slower.
For real HW drivers which are meant to record their own command buffers in the GPU domain and submit them direct to the hw, adding in a CPU layer that just copies the command buffer data is a massive overhead and one that can't easily be removed from the vallium layer.

The vallium execution context is also pretty horrible, it has to connect all the state pieces like shaders etc to the gallium context, and disconnect them all at the end of each command buffer. There is only one command submission queue, one context to be used. A lot of hardware exposes more queues etc that this will never model.

I still don't want to write a vulkan driver, give me more reasons.

Pipeline barriers:

Pipeline barriers in Vulkan are essential to efficient driver hw usage. They are one of the most difficult to understand and hard to get right pieces of writing a vulkan driver. For a software rasterizer they are also mostly unneeded. When I get a barrier I just completely hardflush the gallium context because I know the sw driver behind it. For a real hardware driver this would be a horrible solution. You spend a lot of time trying to make anything optimal here.

Memory allocation:

Vulkan is built around the idea of separate memory allocation and objects binding to those allocations. Gallium is built around object allocation with the memory allocs happening implicitly. I've added some simple memory allocation objects to the gallium API for swrast. These APIs are in no way useful for hw drivers. There is no way to expose memory types or heaps from gallium usefully. The current memory allocation API works for software drivers because I know all they want is an aligned_malloc. There is no decent way to bridge this gap without writing a new gallium API that looks like Vulkan. (in which case just write a vulkan driver already).

Can this make my non-Vulkan capable hw run Vulkan?

No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.


August 19, 2020

In the past few weeks, I was examining two issues on vkms: development of writeback support and alpha blending. I try to keep activities in parallel so that one can recover me from any tiredness from the other :P

Alpha blending is a TODO[1] of the VKMS that possibly could solve the warning[2] triggered in the cursor-alpha-transparent testcase. And, you know, if you still have a warning, you still have work.

  1. Blend - blend value at vaddr_src with value at vaddr_dst
     * TODO: Use the alpha value to blend vaddr_src with vaddr_dst
     *	 instead of overwriting it.
    
  2. WARNING: Suspicious CRC: All values are 0

To develop this feature, I needed to understand and “practice” some abstractions concepts: alpha composition, bitwise operation, and endianness. The beginning was a little confusing, as I was reading many definitions and rare seeing practical examples, which was terrible for dealing with abstractions. I searched for information on Google and little by little, the code was taking shape in my head.

The code, the problem solved and a new problem found

I combined what I understood from these links above with what was already on VKMS. An important note is that I consider that when we combine an alpha layer in the background, this final composition is solid (that is, an opaque alpha), that is, the result on the screen is not a transparent plate - it is usually black background. That said, the resulting alpha channel is 100% opaque, so we set it to 0xFF.

When executing my code, the two test cases related to the alpha channel (opaque-transparent) pass clean. But for me, it still wasn’t enough. I need to see the colors read and the result of the composition (putting some pr_info in the code). Besides, only testing extreme values (background solid black, cursor completely white with totally opaque or transparent alpha) did not convince me. See that I’m a little pessimist and a little wary.

So I decided to play with the cursor-alpha testcase… and I scraped myself.

Ouch!

I changed the transparent test case from 0.0 to 0.5, and things started to fail. After a few rounds checking the returned values, what I always saw was a cursor color (ARGB) returned as follows:

  • A: the alpha determined (ok)
  • RGB: the color resulted by a white cursor with the set transparency already applied to a standard solid black background (not ok).

For example, I expected that a white cursor with transparency 0.5 returned the following ARGB = 80FFFFFF, but the VKMS was reading 80808080 (???)

What is that?

I did more experiments to understand what was going on. I checked if the 0.5-transparency was working on i915, and yes. I also changed the test RGB color of the cursor and the level of transparency. The cursor color read was always a combination of a given A + RGB blended with black color. I could only realize that the mismatch happens because, in the hardware test, when composing a cursor color 80808080, again on a background FF000000 (primary), I obtained as the final result FF404040. However, in the software test step, what was drawn as a final color FF808080.

The developed code would also do that if it was getting the right color… how to deal with it and why it is not a problem on i915 drive?

Reading the documentation

I was reading the cairo manual in addition to perform different color combinations tests. I was suspicious that it could be a format problem, but even if it was that, I had no idea what was “wrong”. After some days, I found the reason:

CAIRO_FORMAT_ARGB32

each pixel is a 32-bit quantity, with alpha in the upper 8 bits, then red, then
green, then blue. The 32-bit quantities are stored native-endian.
Pre-multiplied alpha is used. (That is, 50% transparent red is 0x80800000, not
0x80ff0000.)

Pre-multiplied alpha is used. Thanks, cairo… or not!

But what is pre-multiplied alpha? I didn’t even know it existed.

So, I needed to read about the difference between straight and premultiplied alpha. Moreover, I needed to figure out how to do alpha blending using this format.

After some searching, I found a post on Nvidia - Gameworks Blog that helped me a lot to understand this approach. Besides, the good old StackOverflow clarified what I needed to do.

So I adapted the code I had previously prepared considering a straight alpha blending to work with alpha-premultiplied colors. Of course, I put pr_info to see the result. I also played a little more with the cursor and background colors on kms_cursor_crc test for checking. For example, I changed the background color to a solid red and/or the cursor color and/or the cursor alpha value. In all this situation, the patch works fine.

So, I just sent:

A new interpretation for the warning

Examining this behavior, I discovered a more precise cause of the warning. As VKMS currently only overwrites the cursor over the background, a fully transparent black cursor has a color of 0 and results in a transparent primary layer instead of solid black. When doing XRGB, VKMS zeroes the alpha instead of making it opaque, which seems incorrect and triggers the warning.

I also think that this shows the weakness of the full opaque/transparent tests since they were not able to expose the wrong composition of colors. So I will prepare a patch for IGT to expand the coverage of this type of problem.

Well, doing the alpha blending, the warning will never bothered me anyway.

Let it go!

Yet Another XFB Post

Sort of.

In the course of working on some code to introduce more granular resource-based pipeline barriers for synchronization, I rediscovered some parts of the xfb implementation we have that use barriers, and this got me thinking about our barrier usage in general for buffer resources.

At present, we have exactly two places in the zink codebase where we emit pipeline barriers for buffer resources:

  • when binding an xfb output buffer as a target for indirect draws (i.e., drawing transform feedback)
  • when binding an xfb output buffer for stream output

That’s it.

So no matter how else we may use buffer resources, whether it’s as a writable SSBO in a vertex shader, or a readable SSBO in a fragment shader, or an index buffer, we never use barriers.

Instead, we mostly just flush the pipeline (vkQueueSubmit) and/or stall by waiting on a fence to work around our lack of barrier usage.

Example

An example of this in the repo now is zink’s pipe_context::set_vertex_buffers hook:

static void
zink_set_vertex_buffers(struct pipe_context *pctx,
                        unsigned start_slot,
                        unsigned num_buffers,
                        const struct pipe_vertex_buffer *buffers)
{
   struct zink_context *ctx = zink_context(pctx);

   if (buffers) {
      for (int i = 0; i < num_buffers; ++i) {
         const struct pipe_vertex_buffer *vb = buffers + i;
         struct zink_resource *res = zink_resource(vb->buffer.resource);

         ctx->gfx_pipeline_state.bindings[start_slot + i].stride = vb->stride;
         if (res && res->needs_xfb_barrier) {
            /* if we're binding a previously-used xfb buffer, we need cmd buffer synchronization to ensure
             * that we use the right buffer data
             */
            pctx->flush(pctx, NULL, 0);
            res->needs_xfb_barrier = false;
         }
      }
      ctx->gfx_pipeline_state.hash = 0;
   }

   util_set_vertex_buffers_mask(ctx->buffers, &ctx->buffers_enabled_mask,
                                buffers, start_slot, num_buffers);
}

Here, the needs_xfb_barrier flag is set any time a resource is used as a stream output buffer for a draw, and then we force a flush (which also ends the current renderpass and forces us to start a new one) before we start the next draw in order to sync.

Full disclosure: I wrote the weird parts here, but probably nobody remembers that far back anyway.

Improvement

Now clearly it would be better for performance reasons to not be unnecessarily flushing our command buffer or forcing a new renderpass here.

Instead, a barrier can be used. The needs_xfb_barrier parts of zink_set_vertex_buffers() can be removed, and now sights are set on the repo-current version of a helper function for zink_draw_vbo() which is used to bind vertex buffers for use:

static void
zink_bind_vertex_buffers(struct zink_batch *batch, struct zink_context *ctx)
{
   VkBuffer buffers[PIPE_MAX_ATTRIBS];
   VkDeviceSize buffer_offsets[PIPE_MAX_ATTRIBS];
   const struct zink_vertex_elements_state *elems = ctx->element_state;
   for (unsigned i = 0; i < elems->hw_state.num_bindings; i++) {
      struct pipe_vertex_buffer *vb = ctx->buffers + ctx->element_state->binding_map[i];
      assert(vb);
      if (vb->buffer.resource) {
         struct zink_resource *res = zink_resource(vb->buffer.resource);
         buffers[i] = res->buffer;
         buffer_offsets[i] = vb->buffer_offset;
         zink_batch_reference_resoure(batch, res);
      } else {
         buffers[i] = zink_resource(ctx->dummy_buffer)->buffer;
         buffer_offsets[i] = 0;
      }
   }

   if (elems->hw_state.num_bindings > 0)
      vkCmdBindVertexBuffers(batch->cmdbuf, 0,
                             elems->hw_state.num_bindings,
                             buffers, buffer_offsets);
}

In this function, the to-be-bound vertex buffers are iterated over to populate the info for vkCmdBindVertexBuffers(), and a reference is also added to the batch they’ll be used on. But, despite being used as vertex inputs, there’s no barrier here to inform the vulkan driver of zink’s synchronization needs.

A revised version of this function now looks like this:


static void
zink_bind_vertex_buffers(struct zink_batch *batch, struct zink_context *ctx)
{
   VkBuffer buffers[PIPE_MAX_ATTRIBS];
   VkDeviceSize buffer_offsets[PIPE_MAX_ATTRIBS];
   const struct zink_vertex_elements_state *elems = ctx->element_state;
   for (unsigned i = 0; i < elems->hw_state.num_bindings; i++) {
      struct pipe_vertex_buffer *vb = ctx->buffers + ctx->element_state->binding_map[i];
      assert(vb);
      if (vb->buffer.resource) {
         struct zink_resource *res = zink_resource(vb->buffer.resource);
         buffers[i] = res->buffer;
         buffer_offsets[i] = vb->buffer_offset;
         zink_resource_buffer_barrier(batch->cmdbuf, res, VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT,
                                      VK_PIPELINE_STAGE_VERTEX_INPUT_BIT);
         zink_batch_reference_resource_rw(batch, res, false);
      } else {
         buffers[i] = zink_resource(ctx->dummy_buffer)->buffer;
         buffer_offsets[i] = 0;
      }
   }

   if (elems->hw_state.num_bindings > 0)
      vkCmdBindVertexBuffers(batch->cmdbuf, 0,
                             elems->hw_state.num_bindings,
                             buffers, buffer_offsets);
}

There’s now a barrier being added through the zink_resource_buffer_barrier() helper function, using the buffer’s current VkAccessFlags and VkPipelineStageFlags from its last state when a change is needed. This ensures that the driver knows it’s transitioning the buffer from its previous usage to the current usage, that of being a vertex input.

But wait, how does the pipeline know that the buffers have been used for xfb previously?

More changes are needed.

Currently, when zink binds xfb buffers again after pausing transform feedback (and only in this case), the following function will be called:

static void
zink_emit_xfb_counter_barrier(struct zink_context *ctx)
{
   /* Between the pause and resume there needs to be a memory barrier for the counter buffers
    * with a source access of VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_WRITE_BIT_EXT
    * at pipeline stage VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT
    * to a destination access of VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_READ_BIT_EXT
    * at pipeline stage VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT.
    *
    * - from VK_EXT_transform_feedback spec
    */
   VkBufferMemoryBarrier barriers[PIPE_MAX_SO_OUTPUTS] = {};
   unsigned barrier_count = 0;

   for (unsigned i = 0; i < ctx->num_so_targets; i++) {
      struct zink_so_target *t = zink_so_target(ctx->so_targets[i]);
      if (t->counter_buffer_valid) {
          barriers[i].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
          barriers[i].srcAccessMask = VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_WRITE_BIT_EXT;
          barriers[i].dstAccessMask = VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_READ_BIT_EXT;
          barriers[i].buffer = zink_resource(t->counter_buffer)->buffer;
          barriers[i].size = VK_WHOLE_SIZE;
          barrier_count++;
      }
   }
   struct zink_batch *batch = zink_batch_no_rp(ctx);
   vkCmdPipelineBarrier(batch->cmdbuf,
      VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT,
      VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
      0,
      0, NULL,
      barrier_count, barriers,
      0, NULL
   );
   ctx->xfb_barrier = false;
}

This works great for resuming xfb, but it doesn’t particularly help with any other cases, such as using a buffer for stream output for the first time. Instead, an improvement would be to check the buffer’s pipeline states upon being set as a stream output target (zink_set_stream_output_targets()) and then flag the need for barriers. At that point, the above function can be changed:

static void
zink_emit_xfb_counter_barrier(struct zink_context *ctx)
{
   /* Between the pause and resume there needs to be a memory barrier for the counter buffers
    * with a source access of VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_WRITE_BIT_EXT
    * at pipeline stage VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT
    * to a destination access of VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_READ_BIT_EXT
    * at pipeline stage VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT.
    *
    * - from VK_EXT_transform_feedback spec
    */
   struct zink_batch *batch = zink_batch_no_rp(ctx);
   for (unsigned i = 0; i < ctx->num_so_targets; i++) {
      struct zink_so_target *t = zink_so_target(ctx->so_targets[i]);
      struct zink_resource *res = zink_resource(t->counter_buffer);
      if (t->counter_buffer_valid)
          zink_resource_buffer_barrier(batch->cmdbuf, res,
                                       VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_READ_BIT_EXT,
                                       VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT);
      else
          zink_resource_buffer_barrier(batch->cmdbuf, res,
                                       VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_WRITE_BIT_EXT,
                                       VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT);
   }
   ctx->xfb_barrier = false;
}

Now the barriers are set as needed upon resuming xfb, but barriers are also emitted upon beginning xfb, which ensures synchronization both for that starting-xfb point as well as any future usage of the buffer, like as a vertex input.

I’ve already done some work to add similar changes throughout the driver, which should yield improved memory consistency for buffers as well as performance gains when a new renderpass would otherwise need to be started after an unnecessary queue submission.

August 17, 2020

XFB: The Re-Return of the Sequel I Didn’t Want

In the course of handling ARB_enhanced_layouts, I came across another interesting xfb-related issue: shaders are now allowed to specify all the xfb buffer info in the layout, and they’re also now allowed to set the location of the variable. And locations aren’t considered to overlap unless they have overlapping components.

For those who recall so many years ago when I explored xfb integration in ntv, you know that this is a problem because I’ve been tracking xfb outputs based solely on the location.

That is to say:

SpvId var_id = spirv_builder_emit_var(&ctx->builder, pointer_type, SpvStorageClassOutput);
ctx->outputs[var->data.location] = var_id;

It looks a bit like this, where the variable id (and type info) is indexed using nir_variable::data.location, with the assumption that no variables will share the same location.

Oops.

Yeah, in retrospect, this was stupid. I already had some parts of the code here which accounted for collision of base locations (e.g., one component from a variable might be output to buffer A, and another component might be output to buffer B), but I didn’t account for it here.

For the “good” handling, I’ve been using a key value of nir_variable::data.location << 2 | nir_variable::data.location_frac. This expands the base location to a 4-component index (because slots are sized for vec4 types), and then adds in the component usage to give a precise index that has no conflicts.

Problem solved.

Almost

In some cases, there are output variables which don’t have component (location_frac) set, but then the xfb output uses the component value. Luckily, this is a simple case to handle: using the expanded index from above, the base index can be found with a simple loop from the varying slot location of the xfb info.

SpvId output = ctx->outputs[location];
while (!output)
   output = ctx->outputs[location--];

At most, this will go back three indices before it finds the output, which will always be nonzero.

I’m optimistically hoping that this will be the end of xfb handling, but probably it won’t be.

August 13, 2020

Recently we’ve looked a bit at lockdep annotations in the GPU subsystems, and I figured it’s a good opportunity to explain how this all works, and what the tradeoffs are. Creating working locking hierarchies for the kernel isn’t easy, making sure the kernel’s locking validator lockdep is happy and reviewers don’t have their brains explode even more so.

First things first, and the fundamental issue:

Lockdep is about trading false positives against better testing.

The only way to avoid false positives for deadlocks is to only report a deadlock when the kernel actually deadlocked. Which is useless, since the entire point of lockdep is to catch potential deadlock issues before they actually happen. Hence false postives are not avoidable, at least not in theory, to be able to report potential issues before they hang the machine. Read on for what to do in practice.

We need to understand how exactly lockdep trades false positives to better discovery locking inconsistencies. Lockdep makes a few assumptions about how real code does locking in practice:

Invariance of locking rules over time

First assumption baked into lockdep is that the locking rules for a given lock do not change over the lifetime of the lock’s existence. This already throws out a large chunk of perfectly correct locking designs, since state transitions can control how an object is accessed, and therefore how the lock is used. Examples include different rules for creation and destruction, or whether an object is on a specific list (e.g. only a gpu buffer object that’s in the lru can be evicted). It’s not possible to proof automatically that certain code flat out wont ever run together with some other code on the same structure, at least not in generality. Hence this is pretty much a required assumption to make lockdep useful - if every new lock() call could follow new rules there’s nothing to check. Besides realizing that an actual deadlock indeed occured and all is lost already.

And of course getting such state transitions correct, with the guarantee that all the old code will no longer run, is tricky to get right, and very hard on reviewers. It’s a good thing lockdep has problems with such code too.

Common locking rules for the same objects

Second assumption is that all locks initialized by the same code are following the same locking rules. This is achieved by making all lock initializers C macros, which create the corresponding lockdep class as a static variable within the calling function. Again this is pretty much required, since to spot inconsistencies you need as many observations of all the different code path possibilities. Best to share them all between the same object. Also a distinct lockdep class for each individual object would explode the runtime overhead in both memory and cpu cycles.

And again this is good from a code design point too, since having the same data structure and code follow different locking rules for different objects is at best very confusing for reviewers.

Fighting lockdep, badly

Now things go wrong, you have a lockdep splat at your hands, concluded it’s a false positive and go ahead trying to teach lockdep about what’s going on. The first class of annotains are special lock_nested(lock, subclass) functions. Without lockdep nothing in the generated code changes, but it tells lockdep that for this lock acquisition, we’re using a different class to track the observed locking.

This breaks both the time invariance - nothing is stopping you from using different classes for the same lock at different times - and commonality of locking for the same objects. Worse, you can write code which obviously deadlocks, but lockdep will think everything is perfectly fine:

mutex_init(&A);

mutex_lock(&A);
mutex_lock_nested(&A, SINGLE_DEPTH_NESTING);

This is no good and puts a huge burden on reviewers to carefully check all these places themselves, manually. Exactly the kind of tedious and error prone work lockdep was meant to take over.

Slightly better are the annotations which adjust the lockdep class once, when the object is initialized, using lockdep_set_class() and related functions. This at least does not break time invariance, hence will at least guarantee that lockdep spots the deadlock latest when it happens. It still reduces how much lockdep can connect what’s going on, but occasionally “rewrite the entire subsystem” to resolve a locking inconsistency is just not a reasonable option.

It still means that reviewers always need to remember what the locking rules for all types of different objects behind the same structure are, instead of just one. And then check against every path whether that code needs to work with all of them, or just some, or only one. Again tedious work that really lockdep is supposed to help with. If it’s hard to come by a system where you can easily run the code for the different types of object without rebooting, then lockdep cannot help at all.

All these annotations have in common that they don’t change the code logic, only how lockdep interprets what’s going on.

An even more insideous trick on reviewers and lockdep is to push locking into an asynchronous worker of some sorts. This hides issues because lockdep does not follow dependencies between threads through waiter/wakee relationships like wait_for_completion() and complete(), or through wait queues. There are lockdep annotations for specific dependencies, like in the kernel’s workqueue code when flushing workers or specific work items with flush_work(). Automatic annotations have been attemped with the lockdep cross-release extension, which for various reasons had to be backed out again. Therefore hand-rolled asynchronous code is a great place to create complexity and hide locking issues from both lockdep and reviewers.

Playing to lockdep’s strength

Except when there’s very strong justification for all the complexity, the real fix is to change the locking and make it simpler. Simple enough for lockdep to understand what’s going on, which also makes reviewer’s lifes a lot better. Often this means substantial code rework, but at least in some cases there are useful tricks.

A special kind of annotations are the lock_nest_lock(lock, superlock) family of functions - these tell lockdep that when multiple locks of the same class are acquired, it’s all serialized by the single superlock. Lockdep then validates that the right superlock is indeed held. A great example is mm_take_all_locks(), which as the name implies, takes all locks related to the given mm_struct. In a sense this is not a pure annotation, unlike the ones above, since it requires that the superlock is actually locked. That’s generally the easier to understand scheme than clever sorting of lock acquisition of some sort for reviewers too, not just for lockdep.

A different situation often arises when creating or destroying an object. But at that stage often no other thread has a reference to the object and therefore can take the lock, and the best way to resolve locking inconsistency over the lifetime of an object due to creation and destruction code is to not take any locks at all in these paths. There is nothing to protect against after all!

In all these cases the best option for long term maintainability is to simplify the locking design, not reduce lockdep’s power by reducing the amount of false positives it reports. And that should be the general principle.

tldr; do not fix lockdep false positives, fix your locking

August 12, 2020

After a long debugging process, we finally reached an acceptable solution to the problem of running sequential subtests of IGT tests involving CRC capture using the VKMS.

Despite all the time, uncertainty, and failures, this was, by far, one of the greatest lessons learned in my journey of FLOSS developer. With this experience, I learned a lot about how VKMS works, evolved technically, experienced different feelings, and interacted with people I may never meet in person. These people were sharing some of their knowledge with me in open and accessible communication. The interactions allowed us to build together a more suitable and effective solution, and at that moment, everything I thought was important in the development of FLOSS came to fruition.

I know it sounds very romantic, but after more than a month of it, breathing is not a simple and involuntary movement.

With this patch, I was able to update my progress table on GSoC:

Result/Status GSoC start After accepted patches
pass 7 24 (+17)
warn 1 1
fail 2 0 (-2)
skip 236 221 (-15)
crash 0 0
timeout 0 0
incomplete 0 0
dmesg-warn 0 0
dmesg-fail 0 0
changes 0 0
fixes 0 0
regressions 0 0
total 246 246

The endless wait

I believe I have already talked about the problem in some other posts as it has been followed me since the elaboration of my GSoC project.

I was trying to solve the failure of one of the IGT kms_cursor_crc subtests, the cursor-alpha-transparent. As this subtest has an implementation similar to the cursor-alpha-opaque (that was passing), I used both to observe the manipulation of ARGB -> XRGB happening in CRC capture. However, strangely, the cursor-alpha-opaque kept alternating between success and failure using the same kernel. Gradually I noticed that this was a widespread problem with the sequential execution of subtests in VKMS.

In the beginning, I only noticed that the test crashed while waiting for a vblank event that was never going to happen. So I looked for solutions that enabled vblanks in time to prevent the subtest get stuck. However, as Daniel mentioned, these solutions just looked like duct-tape.

The endless debugging

Initially, I found two problems and thought that both were related to the test implementation. So I sent patches to IGT to correct the cleanup process and to force a vblank. In waiting for feedback, I realized that a previously successful subtest was currently failing, even with the solution I intended — bad smell. I searched the patch that modified this behavior in the history and, in fact, reversing the change let the subtests flow.

It was then that I saw a patch sent to the DRM mailing-list in which someone faced a similar problem and decided to participate in the discussion. From this point, doubts and suspicions increased even more. However, on the bright side, I wasn’t the only person looking for it. With our time zone differences, we were three sharing the findings for someone else’s next shift.

I sent a first patch about this issue on June 25th (in the wrong direction). After so much feedback, adjustments, and some other wrong attempts, the solution was finally accepted on August 8th.

The whole story is documented in the following discussions (patchwork links):

  1. Aiming in the wrong direction
  2. Joining the discussion of people that observed a similar problem
  3. Trying to hit the target … and failing
  4. Hey… I had this idea! But again, it was not ideal
  5. Finally, we come to a solution

And when it all seemed to come to an end, a fourth person appeared, raising doubts about other VKMS parts and that, in a way, was related to the problem at hand.

And that was amazing! There’s still a lot of work out there!

Now I’m still a little tired of it, but I’ll try to make a more technical post about the solution. I changed the context, going back to that will give me a little mental work.

For the next developments

  • Don’t keep the problem to yourself; share your doubts.
  • When making a patch, don’t forget the credits! It doesn’t make much difference for you, but it can do for those who were with you on that journey. See more
  • Each thread/patch is full of information that can help you with other problems.
  • If you can’t dive into the code, browse through it.
  • ftrace is also a good friend
  • The mental effort is less if you write a post before changing the context.

P.S.: I still need to find a way to feel comfortable speaking on IRC.

August 10, 2020

enum pipe_format

One thing that’s everywhere in mesa (at least outside of mesa core) is enum pipe_format. This enum is used to describe image formats. The general way that it works is that there’s a hook in struct pipe_screen:

   /**
    * Check if the given pipe_format is supported as a texture or
    * drawing surface.
    * \param bindings  bitmask of PIPE_BIND_*
    */
   bool (*is_format_supported)( struct pipe_screen *,
                                enum pipe_format format,
                                enum pipe_texture_target target,
                                unsigned sample_count,
                                unsigned storage_sample_count,
                                unsigned bindings );

Each gallium driver implements this hook, and then gallium queries the driver before creating a resource using a given format and then also before using a created resource in various ways.

PIPE_BIND

Of particular note in this hook is the bindings parameter, which specifies the way(s) that the resource will be used. These values are defined as:

/**
 * Resource binding flags -- gallium frontends must specify in advance all
 * the ways a resource might be used.
 */
#define PIPE_BIND_DEPTH_STENCIL        (1 << 0) /* create_surface */
#define PIPE_BIND_RENDER_TARGET        (1 << 1) /* create_surface */
#define PIPE_BIND_BLENDABLE            (1 << 2) /* create_surface */
#define PIPE_BIND_SAMPLER_VIEW         (1 << 3) /* create_sampler_view */
#define PIPE_BIND_VERTEX_BUFFER        (1 << 4) /* set_vertex_buffers */
#define PIPE_BIND_INDEX_BUFFER         (1 << 5) /* draw_elements */
#define PIPE_BIND_CONSTANT_BUFFER      (1 << 6) /* set_constant_buffer */
#define PIPE_BIND_DISPLAY_TARGET       (1 << 7) /* flush_front_buffer */
/* gap */
#define PIPE_BIND_STREAM_OUTPUT        (1 << 10) /* set_stream_output_buffers */
#define PIPE_BIND_CURSOR               (1 << 11) /* mouse cursor */
#define PIPE_BIND_CUSTOM               (1 << 12) /* gallium frontend/winsys usages */
#define PIPE_BIND_GLOBAL               (1 << 13) /* set_global_binding */
#define PIPE_BIND_SHADER_BUFFER        (1 << 14) /* set_shader_buffers */
#define PIPE_BIND_SHADER_IMAGE         (1 << 15) /* set_shader_images */
#define PIPE_BIND_COMPUTE_RESOURCE     (1 << 16) /* set_compute_resources */
#define PIPE_BIND_COMMAND_ARGS_BUFFER  (1 << 17) /* pipe_draw_info.indirect */
#define PIPE_BIND_QUERY_BUFFER         (1 << 18) /* get_query_result_resource */

Each of these correspond to types of usages, with the hook that the created resource is likely to pass through in the comment.

Problem

As always, this is a problem for zink, mainly because the usage passed before resource creation doesn’t always match the actual usage. For example, a resource created with just PIPE_BIND_SAMPLER_VIEW may eventually use PIPE_BIND_RENDER_TARGET and vice versa.

This is very common, as u_blitter uses PIPE_BIND_SAMPLER_VIEW for blitting, and also many piglit tests will draw an image with PIPE_BIND_SAMPLER_VIEW and then use it as a framebuffer attachment, aka PIPE_BIND_RENDER_TARGET.

The problem here is that Vulkan is a very specific API, and so at no point is it possible to “add” usage capabilities later. Furthermore, it’s also the case that drivers may support either color attachment or sampler usage for a given format but not both.

There’s two specific parts of zink that this affects:

  • resource creation, where the usage bits need to be set for the right usages so that the underlying driver allocates the right memory
  • blitting/copying, where it’s necessary to know the underlying driver’s capabilities for a format in order to be able to choose the right method for the transfer

Solutions-ish

For the first item, the solution is simple-ish:

/* apparently gallium thinks this is the jack-of-all-trades bind type */
if (templ->bind & PIPE_BIND_SAMPLER_VIEW)
   bci.usage |= VK_BUFFER_USAGE_UNIFORM_TEXEL_BUFFER_BIT |
                VK_BUFFER_USAGE_STORAGE_TEXEL_BUFFER_BIT |
                VK_BUFFER_USAGE_STORAGE_BUFFER_BIT |
                VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT |
                VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT |
                VK_BUFFER_USAGE_INDEX_BUFFER_BIT |
                VK_BUFFER_USAGE_VERTEX_BUFFER_BIT |
                VK_BUFFER_USAGE_TRANSFORM_FEEDBACK_BUFFER_BIT_EXT;

Yes, for this case the simple fix is to just set every possible usage bit. Is it the “most correct” way? Maybe. But gallium just doesn’t give enough information at this point due to deficiencies in the OpenGL spec which allow objects to sort of just be used for whatever.

For the second item, things get messy.

Some gallium drivers (e.g., iris) check for 3-component (i.e., RGB instead of RGBA) with PIPE_BIND_SAMPLER_VIEW and then claim no support so that gallium will provide a 4-component image. This sort of works for zink, except that then there’s an extra component and so various Vulkan codepaths are now receiving an extra n-bits per block, which breaks everything.

I haven’t really dug into the issue any further than this, but it’s an interesting problem, so I thought I’d blog about it.

August 06, 2020

Post-Storm Posting

I’m back, and today’s topic is testing.

Again.

But this time is different. This time I’m going to be looking into a specific test format, namely piglit shader tests.

Shader tests in piglit are tests which are passed through piglit’s undocumented shader_runner binary, which parses *.shader_test files at runtime to automatically produce tests based on GLSL without requiring any actual GL code and only minimal boilerplate. This makes writing tests easy, and, more importantly for my own use case, it makes debugging them easier.

An Example

[require]
GLSL >= 1.50

[vertex shader passthrough]

[fragment shader]
#version 150

uniform float c;

out vec4 color;

void main()
{
	// set the 'green' value of the color to the value of the uniform
	color = vec4(0.0, c, 0.0, 1.0);
}

[test]
clear color 0.2 0.2 0.2 0.2
clear

// set 'c' to 1
uniform float c 1;
// draw a rect starting at -1,-1 (bottom left) with a width of 2 units, covering the fb
draw rect -1 -1 2 2
// read back the color buffer at [0.0, 0.0] (bottom left) and compare against rgb(0.0, 1.0, 0.0)
relative probe rect rgb (0.0, 0.0, 0.5, 0.5) (0.0, 1.0, 0.0)

Here’s an extremely simple example with a passthrough vertex shader (i.e., gl_Position = piglit_vertex;) which draws a full-framebuffer rectangle based on the color passed in as a uniform value and then reads back the color in the middle of the fb to check that it’s green.

Debugging

The great part of shader tests like these is that they’re very simple to debug in a visual way. For example, if the above test was failing, I’d have a number of easy ways to narrow down where the problem was:

  • replace c in the output color with a constant value to ensure that uniform loading isn’t an issue
  • change the first two components of the probe command’s coordinates to see if perhaps the failure is from the read failing in the specified location in the color buffer
  • change the size of the rectangle being drawn in the event that it isn’t as large as it should be to verify that the driver isn’t using some hardcoded value here

And through all of this, no rebuilding is necessary; I can just make my edits and then shader_runner will generate the test at runtime.

More Complicated

Shader tests can get a bit more complex as well. For example, here’s another “basic” one that I was using while working on dynamic UBO indexing:

# This test verifies that dynamically uniform indexing of sampler arrays
# in the fragment shader behaves correctly.

[require]
GLSL >= 1.50
GL_ARB_gpu_shader5

[vertex shader passthrough]

[fragment shader]
#version 150
#extension GL_ARB_gpu_shader5: require

uniform sampler2D s[4];

uniform int n;

out vec4 color;

void main()
{
	color = texture(s[n], vec2(0.75, 0.25));
}

[test]
clear color 0.2 0.2 0.2 0.2
clear

uniform int s[0] 0
uniform int s[1] 1
uniform int s[2] 2
uniform int s[3] 3

texture checkerboard 0 0 (32, 32) (0.5, 0.0, 0.0, 0.0) (1.0, 0.0, 0.0, 0.0)
texparameter 2D min nearest
texparameter 2D mag nearest

texture checkerboard 1 0 (32, 32) (0.5, 0.0, 0.0, 0.0) (0.0, 1.0, 0.0, 0.0)
texparameter 2D min nearest
texparameter 2D mag nearest

texture checkerboard 2 0 (32, 32) (0.5, 0.0, 0.0, 0.0) (0.0, 0.0, 1.0, 0.0)
texparameter 2D min nearest
texparameter 2D mag nearest

texture checkerboard 3 0 (32, 32) (0.5, 0.0, 0.0, 0.0) (1.0, 1.0, 1.0, 1.0)
texparameter 2D min nearest
texparameter 2D mag nearest

uniform int n 0
draw rect -1 -1 1 1

relative probe rect rgb (0.0, 0.0, 0.5, 0.5) (1.0, 0.0, 0.0)

uniform int n 1
draw rect 0 -1 1 1

relative probe rect rgb (0.5, 0.0, 0.5, 0.5) (0.0, 1.0, 0.0)

uniform int n 2
draw rect -1 0 1 1

relative probe rect rgb (0.0, 0.5, 0.5, 0.5) (0.0, 0.0, 1.0)

uniform int n 3
draw rect 0 0 1 1

relative probe rect rgb (0.5, 0.5, 0.5, 0.5) (1.0, 1.0, 1.0)

This test again uses a passthrough vertex shader, outputting the color from one of the members of the sampler uniform array using another uniform as the array index. The texture being sampled in each case goes through a utility function piglit_checkerboard_texture(), which takes parameters:

  • tex Name of the texture to be used.
  • level Mipmap level the checkerboard should be written to
  • width Width of the texture image
  • height Height of the texture image
  • horiz_square_size Size of each checkerboard tile along the X axis
  • vert_square_size Size of each checkerboard tile along the Y axis
  • black RGBA color to be used for “black” tiles
  • white RGBA color to be used for “white” tiles

Each draw command then draws a rectangle in a different corner of the framebuffer, sampling from a texture with a different color, which produces:

arb_gpu_shader5-sampler_array_indexing-fs-simple.png

While I was getting this to work, I had a couple options available to me so that I could break down the test and run it in smaller pieces:

  • eliminate all but the first texture, draw, and probe calls for simpler shader output
  • use a constant array index to verify that arrays of samplers were working

Combined with visual inspection of the test output, this made the process a relative snap.

In Summary

Piglit’s shader tests are an easy-to-read, easy-to-debug test format, and it’s a format that’s just won my coveted Preferred Test Format award for excellence in every category that matters.

Remember: working on driver-level code isn’t scary unless you need documentation.

August 03, 2020

Instancing

Once more going way out of order since it’s fresh in my mind, today will be a brief look at gl_InstanceID and a problem I found there while hooking up ARB_base_instance.

gl_InstanceID is an input for vertex shaders which provides the current instance being processed by the shader. It has a range of [0, instanceCount], and this breaks the heck out of Vulkan.

Vulkan Instancing

In Vulkan, this same variable instead has a range of [firstInstance, firstInstance + instanceCount]. This difference isn’t explicitly documented anywhere, which means keen readers will need to pick up on the difference as noted between Vulkan spec 14.7. Built-In Variables and the OpenGL wiki, because it isn’t even stated in the GLSL 1.40 spec (or anywhere else that I could find) what the range of this variable is.

Fixing

Here’s the generic int builtin loading function zink uses in ntv to handle instructions which load builtin variables:

static void
emit_load_uint_input(struct ntv_context *ctx, nir_intrinsic_instr *intr, SpvId *var_id, const char *var_name, SpvBuiltIn builtin)
{
   SpvId var_type = spirv_builder_type_uint(&ctx->builder, 32);
   if (!*var_id)
      *var_id = create_builtin_var(ctx, var_type,
                                   SpvStorageClassInput,
                                   var_name,
                                   builtin);

   SpvId result = spirv_builder_emit_load(&ctx->builder, var_type, *var_id);
   assert(1 == nir_dest_num_components(intr->dest));
   store_dest(ctx, &intr->dest, result, nir_type_uint);
}

var_id here is a pointer to the SpvId member of struct ntv_context where the created variable is stored for later use, var_name is the name of the variable, and builtin is the SPIRV name of the builtin that’s being loaded.

gl_InstanceID comes through here as SpvBuiltInInstanceIndex. To fix the difference in semantics here, I added a few lines:

static void
emit_load_uint_input(struct ntv_context *ctx, nir_intrinsic_instr *intr, SpvId *var_id, const char *var_name, SpvBuiltIn builtin)
{
   SpvId var_type = spirv_builder_type_uint(&ctx->builder, 32);
   if (!*var_id)
      *var_id = create_builtin_var(ctx, var_type,
                                   SpvStorageClassInput,
                                   var_name,
                                   builtin);

   SpvId result = spirv_builder_emit_load(&ctx->builder, var_type, *var_id);
   assert(1 == nir_dest_num_components(intr->dest));
   if (builtin == SpvBuiltInInstanceIndex) {
      /* GL's gl_InstanceID always begins at 0, so we have to normalize with gl_BaseInstance */
      SpvId base = spirv_builder_emit_load(&ctx->builder, var_type, ctx->base_instance_var);
      result = emit_binop(ctx, SpvOpISub, var_type, result, base);
   }
   store_dest(ctx, &intr->dest, result, nir_type_uint);
}

Now when loading gl_InstanceID, gl_BaseInstance is also loaded and subtracted to give a zero-indexed value.

July 31, 2020
The just released 5.2-rc1 kernel includes improved support for Logitech wireless keyboards and mice. Until now we were relying on the generic HID keyboard and mouse emulation for 27 MHz and non-unifying 2.4 GHz wireless receivers.

Starting with the 5.2 kernel instead we actually look at the devices behind the receiver. This allows us to provide battery monitoring support and to have per device quirks, like device specific HID-code to evdev-code mappings where necessary. Until now device specific quirks where not possible because the receivers have a generic product-id which is the same independent of the device behind the receiver.

The per device key-mapping is especially important for 27MHz wireless devices, these use the same HID-code for Fn + F1 to Fn + F12 for all devices, but the markings on the keys differ per model. Sofar it was impossible for Linux to get the mapping for this right, but now that we have per device product-ids for the devices behind the receiver we can finally fix this. As is the case with other devices with vendor specific mappings, the actual mapping is done in userspace through hwdb.

If you have a 27 MHz device (often using this receiver, keyboard marked as canada 210 or canada 310 at the bottom). Please give 5.2 a try. Download the latest 60-keyboard.hwdb file and place it in /lib/udev/hwdb.d (replacing the existing file) and then run "sudo udevadm hwdb --update", before booting into the 5.2 kernel. Then run "sudo evemu-record" select your keyboard and try Fn + F1 to Fn + F12 and any other special keys. If any keys do not work, edit 60-keyboard.hwdb, search for Logitech and add an entry for your keyboard, see the existing Logitech entries. After editing you need to re-run "sudo udevadm hwdb --update", followed by "sudo udevadm trigger" for the changes to take effect. Once you have a working entry, submit a pull-req to systemd to get the changes upstream. If you need any help drop me an email.

We still have some old code for the generic HID emulation for 27 MHz receivers with a product-id of 046d:c50c, these should work fine with the new code, but we've been unable to test this. I would really like to move the 046d:c50c id over to the new code and remove all the old code. If you've a 27 MHz Logitech device, please run lsusb, if your device has a product-id of 046d:c50c and you are willing to test, please drop me an email.

Likewise I suspect that 2.4GHz receivers with a product-id of 046d:c531 should work fine with the new support for non-unifying 2.4 GHz receivers, if you have one of those also please drop me an email.

Iago talked recently about the work done testing and supporting well known applications, like the Vulkan ports of the Quake1, Quake 2 and Quake3. Let’s go back here to the lastest news on feature and bugfixing work.

Pipeline cache

Pipeline cache objects allow the result of pipeline construction to be reused. Usually (and specifically on our implementation) that means caching compiled shaders. Reuse can be achieved between pipelines creation during the same application run by passing the same pipeline cache object when creating multiple pipelines. Reuse across runs of an application is achieved by retrieving pipeline cache contents in one run of an application, saving the contents, and using them to preinitialize a pipeline cache on a subsequent run.

Note that it may happens that a pipeline cache would not improve the performance of an application once it starts to render. This is because application developers are encouraged to create all the pipelines in advance, to avoid any hiccup during rendering. On that situation pipeline cache would help to reduce load times. In any case, that is not always avoidable. In that case the pipeline cache would allow to reduce the hiccup, as a cache hit is far faster than a shader recompilation.

One specific detail about our implementation is that internally we keep a default pipeline cache, used if the user doesn’t provide a pipeline cache when creating a pipeline, and also to cache the custom shaders we use for internal operations. This allowed to simplify our code, discarding some custom caches that were already implemented.

Uniform/storage texel buffer

Uniform texel buffers define a tightly-packed 1-dimensional linear array of texels, with texels going through format conversion when read in a shader in the same way as they are for an image. They are mostly equivalent to OpenGL buffer texture, so you can see them as textures backed up by a VkBuffer (through a VkBufferView). With uniform texel buffers you can only do a formatted load.

Storage texel buffers are the equivalent concept, but applied to images instead of textures. Unlike uniform texel buffers, they can also be written to in the same way as for storage images.

Multisampling

Multisampling is a technique that allows to reduce aliasing artifacts on images, by by sampling pixel coverage at multiple subpixel locations and then averaging subpixel samples to produce a final color value for each pixel. We have already started working on this feature, and included some patches on the development branch, but it is still a work in progress. Having said so, it is enough to get Sascha Willems’s basic multisampling demo working:

Sascha Willems multisampling demo run on rpi4

Bugfixing

Again, in addition to work on specific features, we also spent some time fixing specific driver bugs, using failing Vulkan CTS tests as reference. So let’s take a look of some screenshots of Sascha Willem’s demos that are now working:

Sascha Willems deferred demo run on rpi4

Sascha Willems texture array demo run on rpi4

Sascha Willems Compute N-Body demo run on rpi4

Next

We plan to work on supporting the following features next:

  • Robust access
  • Multisample (finish it)

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working

Queries Again

I’ve talked in the past about XFB, and I’ve talked about queries, but I’ve never spent much time talking about XFB queries.

That’s going to change.

XFB is not great, and queries aren’t something I’m too fond of at this point after rewriting the handling so many times, but it was only the other day, while handling streams for XFB3/ARB_gpu_shader5 (humblebrag) that I realized I had been in the infant area of the playground until this moment.

Enter GL_PRIMITIVES_GENERATED

Yes, it’s this query. This one query which behaves differently based on whether XFB is currently active. And in Vulkan terms, this difference is nightmarish for anyone who wants to maintain a “simple” query implementation.

Here’s the gist:

  • When XFB is active, GL_PRIMITIVES_GENERATED needs to return the XFB extension query data for primitives generated
  • When XFB is not active, GL_PRIMITIVES_GENERATED needs to return pipeline statistics query data

But hold on. It gets worse.

For the second case, there’s two subcases:

  • If a geometry shader is present, VK_QUERY_PIPELINE_STATISTIC_GEOMETRY_SHADER_PRIMITIVES_BIT is the value that needs to be returned
  • Otherwise, it’s VK_QUERY_PIPELINE_STATISTIC_INPUT_ASSEMBLY_PRIMITIVES_BIT

Ughhhhhhhhhh

I don’t really have much positive to say here other than “it works”.

The current solution is to maintain two separate VkQueryPool objects for these types of queries, keep them both in sync as far as active/not-active is concerned, and then use results based on have_xfb ? use_xfb : (have_gs ? use_gs : use_vs), and it’s about as gross as you’d imagine.

July 30, 2020

The Hidden Dangers of State Tracking

It’s another hot one, so let’s cool down by checking out a neat bug I came across.

As I’ve mentioned previously, zink runs a NIR pass to inject a gl_PointSize value into the last vertex processing stage of the pipeline for point draws. This works in three parts:

  • the NIR pass to add the instructions
  • adding the pointsize data into the shader parameters
  • the shader parameters get uploaded to the gpu as a constant buffer

Recently, I found a case where the constant buffer for this data was missing, causing tests to crash. I immediately strapped on my snorkel and dove in.

Is The NIR Pass Running?

Yes, but since it’s in a geometry shader, some recent work is needed.

Is The Data Being Added Into Shader Parameters?

Again checking out the MR where I added this, I examined the code:

            if (key->lower_point_size) {
               static const gl_state_index16 point_size_state[STATE_LENGTH] =
                  { STATE_INTERNAL, STATE_POINT_SIZE_CLAMPED, 0 };
               _mesa_add_state_reference(params, point_size_state);
               NIR_PASS_V(state.ir.nir, nir_lower_point_size_mov,
                          point_size_state);
               finalize = true;
            }

As seen above, there’s a call to _mesa_add_state_reference() with the pointsize data from where I copied it from the vertex shader callsite, so this is definitely being added to the shader parameters.

Are These Parameters Being Uploaded?

A quick breakpoint on st_update_gs_constants(), which handles uploading the constant buffer data for geometry shaders, revealed that no, there was no upload taking place.

A quick hop over to st_atom_list.h revealed that this function is called when ST_NEW_GS_CONSTANTS is set in struct st_program::affected_states, but this would leave out the case of the codepath being triggered by a tessellation evaluation shader, so really it should be ST_NEW_CONSTANTS to be more generic.

On my way out, I decided to check on the original vertex shader handling of the pass, and the state was similarly not being explicitly updated, instead relying on other parts of the code to have initiated an upload of the constant buffers.

In summary, when you copy code that has bugs which happen to not exhibit themselves in the original location, you’re basically adding more bugs.

July 28, 2020

How Do Mesa Versions Work?

Today I thought it might be interesting to dive into how mesa detects version support for drivers.

To do so, I’m going to be jumping into mesa/src/mesa/main/version.c, which is where the magic happens.

mesa/main/version.c

In this file are a couple functions which check driver extension support. For the (current) purposes of my zink work, I’ll be looking at compute_version(), the function handling desktop GL. This function takes a pointer to the extensions as well as the constant values set by the driver, and each GL version is determined by support for the required extensions.

As an example, here’s GL 3.0, which is what zink is currently detected as:

   const bool ver_3_0 = (ver_2_1 &&
                         consts->GLSLVersion >= 130 &&
                         (consts->MaxSamples >= 4 || consts->FakeSWMSAA) &&
                         (api == API_OPENGL_CORE ||
                          extensions->ARB_color_buffer_float) &&
                         extensions->ARB_depth_buffer_float &&
                         extensions->ARB_half_float_vertex &&
                         extensions->ARB_map_buffer_range &&
                         extensions->ARB_shader_texture_lod &&
                         extensions->ARB_texture_float &&
                         extensions->ARB_texture_rg &&
                         extensions->ARB_texture_compression_rgtc &&
                         extensions->EXT_draw_buffers2 &&
                         extensions->ARB_framebuffer_object &&
                         extensions->EXT_framebuffer_sRGB &&
                         extensions->EXT_packed_float &&
                         extensions->EXT_texture_array &&
                         extensions->EXT_texture_shared_exponent &&
                         extensions->EXT_transform_feedback &&
                         extensions->NV_conditional_render);

Each member of extensions being checked is the full name of an extension, so it can easily be found by a search. Each member of consts is a constant value exported by the driver. Most notable is GLSLVersion, which increases for every GL version beginning with 3.0.

Thus, saying “zink supports GL 3.0” is really equivalent to passing this check.

But wait, this is mesa core, and zink is a gallium driver.

mesa/state_tracker/st_extensions.c

Midway through this file is st_init_extensions() (direct line link omitted for future compatibility). Here, enum pipe_cap is mapped to extension support, but this is only for some cases:

   static const struct st_extension_cap_mapping cap_mapping[] = {
      { o(ARB_base_instance),                PIPE_CAP_START_INSTANCE                   },
      { o(ARB_bindless_texture),             PIPE_CAP_BINDLESS_TEXTURE                 },
      { o(ARB_buffer_storage),               PIPE_CAP_BUFFER_MAP_PERSISTENT_COHERENT   },
      { o(ARB_clear_texture),                PIPE_CAP_CLEAR_TEXTURE                    },
      { o(ARB_clip_control),                 PIPE_CAP_CLIP_HALFZ                       },
      { o(ARB_color_buffer_float),           PIPE_CAP_VERTEX_COLOR_UNCLAMPED           },
      { o(ARB_conditional_render_inverted),  PIPE_CAP_CONDITIONAL_RENDER_INVERTED      },
      { o(ARB_copy_image),                   PIPE_CAP_COPY_BETWEEN_COMPRESSED_AND_PLAIN_FORMATS },

Setting a nonzero value for these pipe caps enable the corresponding extension for version checking.

Other extensions are determined by values of pipe caps later on:

   if (GLSLVersion >= 400 && !options->disable_arb_gpu_shader5)
      extensions->ARB_gpu_shader5 = GL_TRUE;

In this case, supporting at least GLSL 4.00 and not explicitly disabling the extension is enough to enable it.

In this way, there’s no method to do “enable GL $version”, and everything is instead inferred based on the driver’s capabilities, letting us remain reasonably confident that drivers really do support all parts of every GL version they claim to support.

July 27, 2020

Variable Lists

Today I’m going to briefly go over a big-ish API change that’s taking place as a result of a MR from Jason Ekstrand.

As I’ve talked about in the past, zink does a lot of iterating over variables in shaders. Primarily, this access is limited to the first three variable lists in struct nir_shader:

typedef struct nir_shader {
   /** list of uniforms (nir_variable) */
   struct exec_list uniforms;

   /** list of inputs (nir_variable) */
   struct exec_list inputs;

   /** list of outputs (nir_variable) */
   struct exec_list outputs;

The uniforms list is used in UBO support and texture calls, outputs is used in xfb, and inputs and outputs have both been featured in all my slot remapping adventures.

Less Lists

The question asked by this MR (and the provoking issue) is “why do we have so many lists of variables when we could instead have one list?”

The issue has more discussion about the specifics of why, but for the purposes of zink, the question becomes “how does this affect us?”

Generally speaking, the changes needed here are simple. Instead of nir_foreach_variable() macro iterations over a specific list type, there’s now nir_foreach_variable_with_modes(var, shader, modes), which has an additional param for modes. This param is based on enum nir_variable_mode, which corresponds to nir_variable::data.mode, and it causes the macro to perform a check on each variable it iterates over:

#define nir_foreach_variable_with_modes(var, shader, modes) \
   nir_foreach_variable_in_shader(var, shader) \
      if (_nir_shader_variable_has_mode(var, modes))

Thus, the zink loops like nir_foreach_variable(var, &s->uniforms) look like nir_foreach_variable_with_modes(var, s, nir_var_uniform | nir_var_mem_ubo | nir_var_mem_ssbo) after the change.

Further Improvements

At some point after this lands, it’ll be useful to go through the places in zink where variable iterating occurs and try to combine iterations, as each variable list iteration now iterates over every type of variable, so ideally that should be reduced to a single loop that handles all the things that each separate type-iteration used to handle in order to keep things lightweight.

July 24, 2020

Spec Harder

At this point, longtime readers are well aware of the SPIR-V spec since I reference it in most posts. This spec covers all the base operations and types used in shaders, which are then (occasionally) documented more fully in the full Vulkan spec.

There comes a time, however, when the core SPIR-V spec is not enough to maintain compatibility with GLSL instructions. In such cases, there are additional specifications which extend the base functionality.

  • GLSL 450 provides many additional ops, primarily for calculation-related functionality
  • Other specs

Perhaps now you see where I’m going with this.

Yes, any time you’re working with one of the “other specs”, e..g, SPV_KHR_vulkan_memory_model, there’s no longer this nice, consistent webpage for viewing the spec. Instead, the SPIR-V landing page directs readers to this GitHub repository, which then has a list of links to all the extensions, whereupon finally the goal is reached at a meta-link that displays the version in the repo.

The frustrating part about this is that you have to know that you need to do this. Upon seeing references to an extension in the SPIR-V spec, and they are most certainly listed, the next step is naturally to find more info on the extension. But there is no info on any extensions in the core spec, and, while the extension names look like they’re links, they aren’t links at all, and so there’s no way to directly get to the spec for an extension that you now know you need to use after seeing it as a requirement in the core spec.

The next step is a staple of any workflow. Typically, it’s enough to just throw down a search engine of your choice query for “function name” or “specification name” or “extension name” when working with GL or VK, but for this one odd case, assuming your search engine even finds results, most likely you’ll end up instead at this page for SPV_KHR_vulkan_memory_model, which isn’t very useful.

July 23, 2020

A few weeks ago we shared an update regarding the status of the Vulkan driver for the Raspberry Pi 4. In that post I mentioned that the plan was to focus on completing the feature set for Vulkan 1.0 and then moving on to conformance and bugfixing work before attempting to run actual games and applications. A bit later my colleague Alejandro shared another update detailing some of our recent feature work.

We have been making good progress so far and at this point we are getting close to having a complete Vulkan 1.0 implementation. I believe the main pending features for that are pipeline caches, which Alejandro is currently working on, texel buffers, multisampling support and robust buffer access, so in the last few weeks I decided to take a break from feature development and try to get some Vulkan games running with our driver and use them to guide some inital performance work.

I decided to work with all 3 VkQuake games since they run on Linux, the source code is available (which makes things a lot easier to debug) and seemed to be using a subset of the Vulkan API we already supported. For vkQuake we needed compute shaders and input attachments that we implemented recently, and for vkQuake3 we needed a couple of optional Vulkan features which I implemented recently to get it running without having to modify the game code. So all these games are now running on the Raspberry Pi4 with the V3DV driver. At the same time, our friend Salva from Pi Labs has also been testing the PPSSPP emulator using Vulkan and reporting that some games seem to be working already, which has been great to hear.

I was particularly interested in getting vkQuake3 to work because the project includes both the Vulkan and the original OpenGL renderers, which was great to compare performance between both APIs. VkQuake3 comes with a GL1 and a GL2 renderer, with the GL1 render being the fastest of the two by a large margin (apparently the GL2 renderer has additional rendering features that make it much slower). I think the Vulkan renderer is based on the GL1 renderer (although I have not actually checked) so I figured it would make the most reasonable comparison, and in our tests we found the Vulkan version to be up to 60% faster. Of course, it could be argued that GL1 is a pretty old API and that the difference with a more modern GL or GLES renderer might be less significant, but it is still a good sign.

To finish the post, here are some pics of the games:


vkQuake

vkQuake2

vkQuake3 OpenGL 1 renderer

vkQuake3 OpenGL 1 renderer

vkQuake3 Vulkan renderer

vkQuake3 Vulkan renderer

A couple of final notes:
* Note that the Vulkan renderer for vkQuake3 is much darker, but that is just how the renderer operates and not a driver issue, we observed the same behavior on Intel GPUs.
* A note for those interested in trying vkQuake3, we noticed that exterior levels have broken sky rendering, I hope we will get to fix that soon.

A Slow Day

It’s a very slow day, and I awoke at 4:30 with the burning need to finish tessellation shaders. Or maybe I was just burning because it’s so hot out.

In either case, I had a couple remaining tests that were failing, and, upon closer inspection and a lot of squinting, I determined that the problem was the use of partial writes to patch output variables in tessellation control shaders.

With this knowledge in hand, I set about resolving the issue in the most bulldozer way possible, with no regard for how much of the code would survive once I discovered which nir pass I probably wanted to be running instead.

Partial Writes

A partial write results from shader code that looks like this:

vec4 somevec = vec4(1.0);
somevec.xz = vec2(0.0);

The second assignment is partial, meaning it only writes to some of the components. SPIR-V, however, doesn’t have methods for writing to only certain components of a composite value, meaning that it requires either the full composite to be written (e.g., the first line of code above) or it requires that each component be written separately.

If you’re a long-time reader of the blog, or if you’re good at inferring from the tone of an article, I’m sure you know what came next.

SPIRV

Here is a normal ntv function which stores a write to an output variable:


static void
emit_store_deref(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   SpvId ptr = get_src(ctx, &intr->src[0]);
   SpvId src = get_src(ctx, &intr->src[1]);

   SpvId type = get_glsl_type(ctx, nir_src_as_deref(intr->src[0])->type);
   SpvId result = emit_bitcast(ctx, type, src);
   spirv_builder_emit_store(&ctx->builder, ptr, result);
}

It’s very simple. Simple is good. The only thing to be done is to take the source operand, cast it to the type of the output variable, then store (write) it to the output variable.

Here’s where we’re at now:

static void
emit_store_deref(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   SpvId ptr = get_src(ctx, &intr->src[0]);
   SpvId src = get_src(ctx, &intr->src[1]);

   const struct glsl_type *gtype = nir_src_as_deref(intr->src[0])->type;
   SpvId type = get_glsl_type(ctx, gtype);
   unsigned num_writes = util_bitcount(nir_intrinsic_write_mask(intr));
   unsigned wrmask = nir_intrinsic_write_mask(intr);
   if (num_writes != intr->num_components) {
      /* no idea what we do if this fails */
      assert(glsl_type_is_array(gtype) || glsl_type_is_vector(gtype));

      /* this is a partial write, so we have to loop and do a per-component write */
      SpvId result_type;
      SpvId member_type;
      if (glsl_type_is_vector(gtype)) {
         result_type = get_glsl_basetype(ctx, glsl_get_base_type(gtype));
         member_type = get_uvec_type(ctx, 32, 1);
      } else
         member_type = result_type = get_glsl_type(ctx, glsl_get_array_element(gtype));
      SpvId ptr_type = spirv_builder_type_pointer(&ctx->builder,
                                                  SpvStorageClassOutput,
                                                  result_type);
      for (unsigned i = 0; i < 4; i++)
         if ((wrmask >> i) & 1) {
            SpvId idx = emit_uint_const(ctx, 32, i);
            SpvId val = spirv_builder_emit_composite_extract(&ctx->builder, member_type, src, &i, 1);
            val = emit_bitcast(ctx, result_type, val);
            SpvId member = spirv_builder_emit_access_chain(&ctx->builder, ptr_type,
                                                           ptr, &idx, 1);
            spirv_builder_emit_store(&ctx->builder, member, val);
         }
      return;

   }
   SpvId result = emit_bitcast(ctx, type, src);
   spirv_builder_emit_store(&ctx->builder, ptr, result);
}

I’ve now taken this otherwise readable function and changed it to something unnecessarily complex. For cases where the component output dosen’t match the written mask, this code extracts each written member of the source composite, then creates an access chain (which is basically pointer dereferencing / array/vector access in SPIRV) for the corresponding member of the output, and finally writes the bitcasted component.

But…

But Mike, I’m sure you’re asking now, shouldn’t nir_lower_io_to_scalar handle this? Or what about nir_lower_io in general?

Well.

Maybe?

But there’s a comment /* TODO: add patch support */, so it doesn’t, and here we are.

July 22, 2020

Last of the Vertex Processors

As I’ve touched upon previously, zink does some work in ntv to perform translations of certain GLSL variables to SPIR-V versions of those variables. Also sometimes we add and remove variables for compatibility reasons.

The key part of these translations, for some cases, is to ensure that they occur on the right shader. As an example that I’ve talked about several times, Vulkan coordinates are different from GL coordinates, specifically in that the Z axis is compressed and thus values there must be converted to perform as expected in the underlying Vulkan driver. This means that gl_Position needs to be converted before reaching the fragment shader.

Conversion Timing

Early approaches to handling this in zink, as in the currently-released versions of mesa, performed all translation in the vertex shader. Only vertex and fragment shaders are supported here, so this is fine and makes total sense.

Once more types of shaders become supported, however, this is no longer quite as good to be doing. Consider the following progression:

  • vertex shader converts gl_Position
  • vertex shader outputs gl_Position
  • geometry shader consumes gl_Position
  • geometry shader uses gl_Position for vertex output
  • geometry shader outputs gl_Position

In this scenario, the geometry shader is still executing instructions that assume a GL coordinate input, which means they will not function as expected when they receive the converted VK coordinates. The simplest fix results in:

  • vertex shader converts gl_Position
  • vertex shader outputs gl_Position
  • geometry shader consumes gl_Position
  • geometry shader unconverts gl_Position
  • geometry shader uses gl_Position for vertex output
  • geometry shader converts gl_Position
  • geometry shader outputs gl_POsition

I say simplest here because this requires no changes to the shader compiler ordering in zink, meaning that shaders don’t need to be “aware” of each other, e.g., a vertex shader doesn’t need to know whether a geometry shader exists and can just do the same conversion in every case. This is useful because at present, shaders in zink are compiled in a “random” order, meaning that it’s impossible to know whether a geometry shader exists at the time that a vertex shader is being compiled.

This is still not ideal, however, as it means that the vertex and geometry shaders are going to be executing unnecessary instructions, which yields a big frowny face in benchmarks (probably not actually that big, but this is the sort of optimizing that lets you call your code “lightweight”). The situation is further complicated with the introduction of tessellation shaders, where the flow now starts looking like:

  • vertex shader converts gl_Position
  • vertex shader outputs gl_Position
  • tessellation shader consumes gl_Position
  • tessellation shader unconverts gl_Position
  • tessellation shader converts gl_Position
  • tessellation shader outputs gl_Position
  • geometry shader consumes gl_Position
  • geometry shader unconverts gl_Position
  • geometry shader uses gl_Position for vertex output
  • geometry shader converts gl_Position
  • geometry shader outputs gl_Position

Not great.

Once Per Pipeline

The obvious change required here is to ensure that zink compiles shaders in pipeline order. With this done, all the uncompiled shaders are available to scan for info and existence, and the process can now be:

  • vertex shader converts gl_Position if no tessellation or geometry shader is present
  • vertex shader outputs gl_Position
  • tessellation shader consumes gl_Position
  • tessellation shader converts gl_Position if no geometry shader is present
  • tessellation shader outputs gl_Position
  • geometry shader consumes gl_Position
  • geometry shader uses gl_Position for vertex output
  • geometry shader converts gl_Position
  • geometry shader outputs gl_Position

Now that’s some lightweight shader execution. #optimized

July 21, 2020

Directing Indirection

Now on a wildly different topic, I’m going to talk about indirect drawing for a bit, specifically when using it in combination with primitive restart, which I already briefly talked about in a prior post.

In general, indirect drawing is used when an application wants to provide the gpu with a buffer containing the parameters to be used for draw calls. The idea is that the parameters are already “on the CPU”, so there’s no back-and-forth needed with the CPU for cases where these parameters may be derived in the course of GPU operations.

The problem here for zink is that indirect drawing can be used with primitive restart, but the problem I brought up previously still exists, namely that OpenGL allows arbitrary values for the restart index, whereas Vulkan requires a fixed value.

This means that in order for indirect draws to work correctly with primitive restart after being translated to Vulkan, zink needs to… harold.jpg

Yes, zink needs to map those buffers used for indirect drawing and rewrite the indices to convert the arbitrary restart index to the fixed one.

Utils, Utils, Utils

Thankfully, all of this can be done in the utility functions in u_prim_restart.c that I talked about in the primitive restart post, which provide functionality for both rewriting index buffers to perform restart index conversion as well as handling draw calls for unsupported primitive types.

The ARB_draw_indirect spec defines two buffer formats for use with indirect draws:

typedef struct {
  GLuint count;
  GLuint primCount;
  GLuint first;
  GLuint reservedMustBeZero;
} DrawArraysIndirectCommand;

typedef struct {
  GLuint count;
  GLuint primCount;
  GLuint firstIndex;
  GLint  baseVertex;
  GLuint reservedMustBeZero;
} DrawElementsIndirectCommand;

One is for array draws, the other for elements. Happily, only the first three members are of use for this awfulness, so there’s no need to determine which type of draw it is before grabbing that buffer with both hands and telling the GPU to completely stop everything that it wanted to do so we can push up our reading glasses and take a nice, slow read.

With the buffer contents in hand and the GPU performance having dropped to a nonexistent state, the indirect draw command can then be rewritten as a direct draw since the buffer is already mapped, and the entire premise of the indirect draw can be invalidated for the sake of compatibility.

For those interested in seeing what this looks like, the MR is here.

July 20, 2020

I’m Back

Last week was my birthday (all week), and I decided to gift myself fewer blog posts.

Now that things are back to normal, however, I’ll be easing my way back into the blogosphere. Today I’m going to be looking briefly at the benefits of code from a recent MR by Erik Faye-Lund which improves memory allocation in ntv to let us better track all the allocations done in the course of doing compiler stuff, with the key benefit being that we’re no longer leaking this entire thing every time we compile a shader (oops).

Mesa internals use an allocator called ralloc, and this was first added to the tree by Kenneth Graunke way back in 7.10.1, or at least that was the earliest reference of it I could find. It’s used everywhere, and it has a number of nice features that justify its use other than simple efficiency.

For this post, I’ll be looking specifically at its memory context capability.

Contextualize

Typically in C, we have our calls to malloc, and we have to track all these allocations, ensuring ownership is preserved, and then later call free at an appropriate time on everything that we’ve allocated.

Not so with the magic of ralloc!

A call to ralloc_context() creates a new memory context. Subsequent allocations for related objects requiring allocation can then pass this context to the related ralloc_size() calls, and this allocation will then be added to the specified context. At a later point, the context can be freed, taking all the allocated memory with it.

This enables code like:

void *ctx = ralloc_context(NULL);
void *some_mem1 = ralloc_size(ctx, some_size1);
void *some_mem2 = ralloc_size(ctx, some_size2);
void *some_mem3 = ralloc_size(ctx, some_size3);
ralloc_free(ctx);

which will free all allocations created using ctx.

But wait, there’s more! Every allocation made using ralloc is implicitly a context like this, which means that any time a struct is created with allocated members, the struct can be passed as a context, meaning that a destructor function for that struct can simply be written as a call to ralloc_free() which passes the struct, and newly-added members don’t need to be deallocated.

So for example:

struct somestruct *s = ralloc_size(NULL, sizeof(struct somestruct));
s->member1 = ralloc_size(s, sizeof(member1_size));
ralloc_free(s);

will free the struct and member, but also adding e.g.,:

s->member2 = ralloc_size(s, sizeof(member2_size));

for a new member at some point requires no changes to the deallocation codepath.

Future Work

At some point, it’s likely we’ll be going through the zink codebase to add more ralloc usage, both for helping to avoid memory leaks and for the other more general benefits that memory allocators bring over our existing usage of raw malloc.

July 17, 2020

In March, I inspected the coverage of kms_cursor_crc on VKMS to develop my GSoC project proposal. Using piglit, I present the evolution of this coverage so far:

Result GSoC start Only accepted patches Fixes under development
pass 7 22 24
warn 1 0 1
fail 2 3 0
skip 236 221 221
crash 0 0 0
timeout 0 0 0
incomplete 0 0 0
dmesg-warn 0 0 0
dmesg-fail 0 0 0
changes 0 0 0
fixes 0 0 0
regressions 0 0 0
total 246 246 246

+ Instability in the sequential run of subtests; ie, although the statistic showing that 7 tests are passing in the beggining, this result was only clear after a double-check because when running all subtests sequentially, 8 tests fails and only 1 succeed;

+ Warning or Failure ? My fix proposal for the pipe-A-cursor-alpha-transparent test was passing with a warning (CRC all zero)

As a first step, I decided to examine and solve issues that affected the test results in a general way. the instability. Solving the instability first (or at least identify what was going on) would make the work more consistent and fluid, since I would no longer need to double-check each subtest result and, in one running of the entire kms_cursor_crc test I could check the absent features or errors. However, in this investigation, some problems were more linked to IGT and others to VKMS. Identifying who is the “guilty” was not simple, and some false charges happened.

This is a little long story and deserves another post focused on the issue. The effective solution has not yet been found, but it has already been realized that the proper bug-fix achieves:

  • The stability of the sequential execution of subtests;
  • The success of the following subtests: pipe-A-cursor-dpms pipe-A-cursor-suspend

As we don’t have that yet, well, I’ll focus on this post in describing the way to allow testing of different cursor sizes in vkms.

The maximum cursor size

In 2018, one of the Haneen contributions to vkms was adding support for the cursor plane. In this implementation, the cursor has a standard maximum supported size 64x64, which limited the coverage of the kms_cursor_crc to only cursor subtests with this size.

The IGT tests cursor sizes from 64 to 512 (powers of 2), but how to enable cursor larger than the standard?

Initially, I thought I needed to develop some functionality from scratch, maybe do this by drawing in larger sizes… as a good beginner :) So I started to read some materials and check the codes of other drivers to find out how to do this.

During this stage, for some kind of universe coincidence (mystic), I came across a conversation on the IRC on the topic “cursor sizes.” This conversation gave me some references that led me to the right work path.

Ctags-driven investigation

I tend to investigate Linux trough keywords (like a ctags on the head?). This approach helps me to narrow the scope of reading and attempts. The Linux kernel is frighteningly large, and I know that I still need many years to understand things as a whole (if possible).

Therefore, these are the references that helped me to find the keywords and information related to my problem “the cursor size”:

  1. Comparing an AMD device with VKMS device: https://drmdb.emersion.fr/devices (keyword 1: DRM_CAP_CURSOR_HEIGHT)
  2. Checking max cursor sizes per driver: https://drmdb.emersion.fr/capabilities (keyword 2: capabilities)
  3. Finding where is the DRM_CAP_CURSOR_HEIGHT value: drivers/gpu/drm/drm_ioctl.c
    (How to define ` dev->mode_config.cursor_height ` in vkms?) ``` /*
    • Get device/driver capabilities */ static int drm_getcap(struct drm_device *dev, void *data, struct drm_file *file_priv) { struct drm_get_cap *req = data; struct drm_crtc *crtc;

    req->value = 0; [..] case DRM_CAP_CURSOR_WIDTH: if (dev->mode_config.cursor_width) req->value = dev->mode_config.cursor_width; else req->value = 64; break; case DRM_CAP_CURSOR_HEIGHT: if (dev->mode_config.cursor_height) req->value = dev->mode_config.cursor_height; else req->value = 64; break; [..] }

    ```

  4. More information: include/uapi/drm/drm.h

    / *
     * The CURSOR_WIDTH and CURSOR_HEIGHT capabilities return a valid widthxheight
     * combination for the hardware cursor. The intention is that a hardware
     * agnostic userspace can query a cursor plane size to use.
     *
     * Note that the cross-driver contract is to merely return a valid size;
     * drivers are free to attach another meaning on top, eg. i915 returns the
     * maximum plane size.
     * /
    #define DRM_CAP_CURSOR_WIDTH 0x8
    #define DRM_CAP_CURSOR_HEIGHT 0x9
    
    
  5. Check the documentation for cursor_width:
    **struct drm_mode_config**
    Mode configuration control structure
         
    *cursor_width*: hint to userspace for max cursor width
    *cursor_height*: hint to userspace for max cursor height
       
    
  6. So, where is this mode_config defined in vkms?
    Here: drivers/gpu/drm/vkms/vkms_drv.c
    static int vkms_modeset_init(struct vkms_device *vkmsdev)
    {
     struct drm_device *dev = &vkmsdev->drm;
    
     drm_mode_config_init(dev);
     dev->mode_config.funcs = &vkms_mode_funcs;
     dev->mode_config.min_width = XRES_MIN;
     dev->mode_config.min_height = YRES_MIN;
     dev->mode_config.max_width = XRES_MAX;
     dev->mode_config.max_height = YRES_MAX;
     dev->mode_config.preferred_depth = 24;
     dev->mode_config.helper_private = &vkms_mode_config_helpers;
    
     return vkms_output_init(vkmsdev, 0);
    }
    

    There is nothing about cursor here, so we need to assign maximum values to not take the default.

  7. I also found that, on my intel computer, the maximum cursor is 256. Why do tests include 512?
  8. Also, there are subtests in kms_cursor_crc for non-square cursors, but these tests are restricted to i915 devices. Why are they here?
  9. Finally, I develop a simple patch that increases the coverage rate by 15 subtests.

Considering the current drm-misc-next, my project state is:

Name Results
pass 22
fail 3
skip 221
crash 0
timeout 0
warn 0
incomplete 0
dmesg-warn 0
dmesg-fail 0
changes 0
fixes 0
regressions 0
total 246

Cheer up!

It is also possible to consider those in which we have some idea of the problem and a provisory solution exists (I mean, an optimistic view):

Name Results
pass 24
warn 1
fail 0
skip 221
crash 0
timeout 0
incomplete 0
dmesg-warn 0
dmesg-fail 0
changes 0
fixes 0
regressions 0
total 246

There is still the non-square cursors issue, which I’m not sure we should handle. Subtests with non-square cursors mean 16 skips.

Lessons learned

  1. For me, IRC conversations are inspiring and also show community pulsing. Part of what I question in my master’s research is the lack of interest in academic studies in using IRC as a means of understanding the community under investigation.
    One of the things I like about IRC is that, unlike other instant messengers today, when we are there, we are usually sitting in front of the computer (our work tool). I mean, we are not (I think) lying in a hammock on the beach, for example. Ok, there could be exceptions. :)
    But to be honest, I rarely talk on a channel; I am usually just watching.
  2. There are cases where the complexity lies in understanding what already exists instead of in developing new features. I still don’t know if it’s frustrating or satisfying.
  3. I don’t know if I could understand things without using ctags. The ctags was a great tip from Siqueira even in FLUSP times.
July 15, 2020

We’re Back

I spent a few days locked in a hellish battle against a software implementation of doubles (64-bit floats) for ARB_gpu_shader_fp64 in the raging summer heat. This scholarly infographic roughly documents my progress:

fp64-clown.png

Instead of dwelling on it, I’m going to go back to something very quick and less brain-damaging, namely ARB_timer_query. This extension provides functionality for performing time queries on the gpu, both for elapsed time and just getting regular time values.

In gallium, this is represented by PIPE_CAP_QUERY_TIME_ELAPSED and PIPE_CAP_QUERY_TIMESTAMP capabilities in a driver.

Extensions

Obviously there’s Vulkan extensions to go with this. VK_EXT_calibrated_timestamps is the one that’s needed, as this provides the functionality for retrieving time values directly from the gpu without going through a command buffer. There’s the usual copy-and-paste dance for this during driver init, which includes:

  • checking for the extension name
  • checking for the feature
  • enabling the feature
  • enabling the extension
  • checking for the needed extensions-specific features (in this case, VK_TIME_DOMAIN_DEVICE_EXT)

Query Time

What followed, in this case, a lot of wrangling existing query code to remove asserts and finish some stubs where timestamp queries had been partially implemented. The most important part was this function:

 static void
 timestamp_to_nanoseconds(struct zink_screen *screen, uint64_t *timestamp)
 {
    /* The number of valid bits in a timestamp value is determined by
     * the VkQueueFamilyProperties::timestampValidBits property of the queue on which the timestamp is written.
     * - 17.5. Timestamp Queries
     */
    *timestamp &= ((1ull << screen->timestampValidBits) - 1);;
    /* The number of nanoseconds it takes for a timestamp value to be incremented by 1
     * can be obtained from VkPhysicalDeviceLimits::timestampPeriod
     * - 17.5. Timestamp Queries
     */
    *timestamp *= screen->props.limits.timestampPeriod;
 }

All time values from the gpu are returned in “ticks”, which are a unit decided by the underlying driver. Applications using Zink want nanoseconds, however, so this needs to be converted.

Side Query

But then also there’s the case where a timestamp can be retrieved directly, for example:

glGetInteger64v(GL_TIMESTAMP, &time);

In Zink and other gallium drivers, this uses either a struct pipe_screen or struct pipe_context hook:

   /**
    * Query a timestamp in nanoseconds. The returned value should match
    * PIPE_QUERY_TIMESTAMP. This function returns immediately and doesn't
    * wait for rendering to complete (which cannot be achieved with queries).
    */
   uint64_t (*get_timestamp)(struct pipe_screen *);


   /**
    * Query a timestamp in nanoseconds.  This is completely equivalent to
    * pipe_screen::get_timestamp() but takes a context handle for drivers
    * that require a context.
    */
   uint64_t (*get_timestamp)(struct pipe_context *);

The current implementation looks like:

static uint64_t
zink_get_timestamp(struct pipe_context *pctx)
{
   struct zink_screen *screen = zink_screen(pctx->screen);
   uint64_t timestamp, deviation;
   assert(screen->have_EXT_calibrated_timestamps);
   VkCalibratedTimestampInfoEXT cti = {};
   cti.sType = VK_STRUCTURE_TYPE_CALIBRATED_TIMESTAMP_INFO_EXT;
   cti.timeDomain = VK_TIME_DOMAIN_DEVICE_EXT;
   screen->vk_GetCalibratedTimestampsEXT(screen->dev, 1, &cti, &timestamp, &deviation);
   timestamp_to_nanoseconds(screen, &timestamp);
   return timestamp;
}

Using the VK_EXT_calibrated_timestamps functionality from earlier, this is very simple and straightforward, unlike software implementations of ARB_gpu_shader_fp64.

It’s a bit of a leap forward, but that’s another GL 3.3 feature done.