planet.freedesktop.org
July 03, 2020

ARB_arrays_of_arrays

I recently came across a number of failing tests where the problem was related to variable sizing in a shader. Check out this beaut:

#version 130
#extension GL_ARB_separate_shader_objects: require
#extension GL_ARB_arrays_of_arrays: require

out vec4 out_color;

uniform sampler2D s2[2][2];
uniform sampler3D s3[2][2];

void main()
{
    out_color = texture(s2[1][1], vec2(0)) + texture(s3[1][1], vec3(0));
}

When I checked out the corresponding spec, it seems that there’s no limitation to array nesting like this, and there’s not any handling for arrays of arrays in Zink at present.

Thus I entered the magic of struct glsl_type and its many, many, many helper/wrapper functions. Zink has many checks for glsl_type_is_array(var->type) when processing variables to find arrays, but there’s no checks for arrays of arrays, which was a problem.

Thankfully, as is usually the case in mesa development, someone has had this problem before, and so there’s glsl_get_aoa_size() for getting the flattened size of an array in these cases. By using the return from this instead of glsl_get_length() in these places, Zink can now support this extension.

July 02, 2020

At last

It’s time to finish up UBO support in this long form patch review blog series. Here’s where the current progress has left things:


static void
emit_load_ubo(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   nir_const_value *const_block_index = nir_src_as_const_value(intr->src[0]);
   assert(const_block_index); // no dynamic indexing for now

   nir_const_value *const_offset = nir_src_as_const_value(intr->src[1]);
   if (const_offset) {
      SpvId uvec4_type = get_uvec_type(ctx, 32, 4);
      SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                      SpvStorageClassUniform,
                                                      uvec4_type);

      unsigned idx = const_offset->u32 / 16;
      SpvId member = emit_uint_const(ctx, 32, 0);
      SpvId offset = emit_uint_const(ctx, 32, idx);
      SpvId offsets[] = { member, offset };
      SpvId ptr = spirv_builder_emit_access_chain(&ctx->builder, pointer_type,
                                                  ctx->ubos[const_block_index->u32], offsets,
                                                  ARRAY_SIZE(offsets));
      SpvId result = spirv_builder_emit_load(&ctx->builder, uvec4_type, ptr);

      SpvId type = get_dest_uvec_type(ctx, &intr->dest);
      unsigned num_components = nir_dest_num_components(intr->dest);
      if (num_components == 1) {
         uint32_t components[] = { 0 };
         result = spirv_builder_emit_composite_extract(&ctx->builder,
                                                       type,
                                                       result, components,
                                                       1);
      } else if (num_components < 4) {
         SpvId constituents[num_components];
         SpvId uint_type = spirv_builder_type_uint(&ctx->builder, 32);
         for (uint32_t i = 0; i < num_components; ++i)
            constituents[i] = spirv_builder_emit_composite_extract(&ctx->builder,
                                                                   uint_type,
                                                                   result, &i,
                                                                   1);

         result = spirv_builder_emit_composite_construct(&ctx->builder,
                                                         type,
                                                         constituents,
                                                         num_components);
      }

      if (nir_dest_bit_size(intr->dest) == 1)
         result = uvec_to_bvec(ctx, result, num_components);

      store_dest(ctx, &intr->dest, result, nir_type_uint);
   } else
      unreachable("uniform-addressing not yet supported");
}

Remaining work here:

  • handle dynamic offsets in order to e.g., process shaders which use loops to access a UBO member index
  • handle loading an index inside each vec4-sized UBO member in order to be capable of accessing components

More problems

There’s another tangle here when it comes to accessing components of a UBO member though. The Extract operations in SPIR-V all take literals, not SPIR-V ids, which means they can’t be used to support dynamic offsets from shader-side variables. As a result, OpAccessChain is the best solution, but this has some small challenges itself.

The way that OpAccessChain works is that it takes an array of index values that are used to progressively access deeper parts of a composite type. For a case like a vec4[2], passing [0, 2] as the index array would access the first vec4’s third member, as this Op delves based on the composite’s type.

However, in emit_load_ubo() here, the instructions passed provide the offset in bytes, not “members”. This means the value passed as src[1] here has to be converted from bytes into “members”, and it has to be done such that OpAccessChain gets three separate index values in order to access the exact component of the UBO that the instruction specifies. The calculation is familiar for people who have worked extensively in C:

index_0 = 0;
index_1 = offset / sizeof(vec4)
index_2 = (offset % sizeof(vec4) / sizeof(int)
  • The first index is always 0 since the type is a pointer.
  • The second index is determining which vec4 to access; since all UBO members are sized as vec4 types, this is effectively determining which member of the UBO to access
  • The third index accesses components of the variable, which in SPIR-V internals has been sized by ntv as an int-based type in terms of size

This is it for loading the UBO, but now those loaded values need to be stored, as that’s the point of this instruction. In the above code segment, the entire vec4 is loaded, and then OpCompositeExtract is used to grab the desired values, creating a new composite at the end for storage. This won’t work for dynamic usage, however, as I mentioned previously: OpCompositeExtract takes literal index values, which means it can only be used with constant offsets.

Instead, a solution which handles both cases would be to use the OpAccessChain to loop over each individual component that needs to be loaded, Then these loaded values can be reassembled into a composite at the end.

More code

The end result looks like this:

static void
emit_load_ubo(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   nir_const_value *const_block_index = nir_src_as_const_value(intr->src[0]);
   assert(const_block_index); // no dynamic indexing for now

   SpvId uint_type = get_uvec_type(ctx, 32, 1);
   SpvId one = emit_uint_const(ctx, 32, 1);

   /* number of components being loaded */
   unsigned num_components = nir_dest_num_components(intr->dest);
   SpvId constituents[num_components];
   SpvId result;

   /* destination type for the load */
   SpvId type = get_dest_uvec_type(ctx, &intr->dest);
   /* an id of the array stride in bytes */
   SpvId vec4_size = emit_uint_const(ctx, 32, sizeof(uint32_t) * 4);
   /* an id of an array member in bytes */
   SpvId uint_size = emit_uint_const(ctx, 32, sizeof(uint32_t));

   /* we grab a single array member at a time, so it's a pointer to a uint */
   SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                   SpvStorageClassUniform,
                                                   uint_type);

   /* our generated uniform has a memory layout like
    *
    * struct {
    *    vec4 base[array_size];
    * };
    *
    * where 'array_size' is set as though every member of the ubo takes up a vec4,
    * even if it's only a vec2 or a float.
    *
    * first, access 'base'
    */
   SpvId member = emit_uint_const(ctx, 32, 0);
   /* this is the offset (in bytes) that we're accessing:
    * it may be a const value or it may be dynamic in the shader
    */
   SpvId offset = get_src(ctx, &intr->src[1]);
   /* convert offset to an array index for 'base' to determine which vec4 to access */
   SpvId vec_offset = emit_binop(ctx, SpvOpUDiv, uint_type, offset, vec4_size);
   /* use the remainder to calculate the byte offset in the vec, which tells us the member
    * that we're going to access
    */
   SpvId vec_member_offset = emit_binop(ctx, SpvOpUDiv, uint_type,
                                        emit_binop(ctx, SpvOpUMod, uint_type, offset, vec4_size),
                                        uint_size);
   /* OpAccessChain takes an array of indices that drill into a hierarchy based on the type:
    * index 0 is accessing 'base'
    * index 1 is accessing 'base[index 1]'
    * index 2 is accessing 'base[index 1][index 2]'
    *
    * we must perform the access this way in case src[1] is dynamic because there's
    * no other spirv method for using an id to access a member of a composite, as
    * (composite|vector)_extract both take literals
    */
   for (unsigned i = 0; i < num_components; i++) {
      SpvId indices[3] = { member, vec_offset, vec_member_offset };
      SpvId ptr = spirv_builder_emit_access_chain(&ctx->builder, pointer_type,
                                                  ctx->ubos[const_block_index->u32], indices,
                                                  ARRAY_SIZE(indices));
      /* load a single value into the constituents array */
      constituents[i] = spirv_builder_emit_load(&ctx->builder, uint_type, ptr);
      /* increment to the next vec4 member index for the next load */
      vec_member_offset = emit_binop(ctx, SpvOpIAdd, uint_type, vec_member_offset, one);
   }

   /* if loading more than 1 value, reassemble the results into the desired type,
    * otherwise just use the loaded result
    */
   if (num_components > 1) {
      result = spirv_builder_emit_composite_construct(&ctx->builder,
                                                      type,
                                                      constituents,
                                                      num_components);
   } else
      result = constituents[0];

   /* explicitly convert to a bool vector if the destination type is a bool */
   if (nir_dest_bit_size(intr->dest) == 1)
      result = uvec_to_bvec(ctx, result, num_components);

   store_dest(ctx, &intr->dest, result, nir_type_uint);
}

But wait! Perhaps some avid reader is now considering how many load operations are potentially being added by this method if the original instruction was intended to load an entire vec4. Surely some optimizing can be done here?

One of the great parts about ntv is that there’s not much need to optimize anything in advance here. Getting things working is usually “good enough”, and the reason for that is once again NIR. While it’s true that loading a vec4 member of a UBO from this code does generate four load_ubo instructions, these instructions will get automatically optimized back to a single load_ubo by a nir_lower_io pass triggered from the underlying Vulkan driver, which means spending any effort pre-optimizing here is wasted time.

Moving on

ARB_uniform_buffer_object is done now, so look forward to new topics again.

July 01, 2020

About three weeks ago there was a big announcement about the update of the status of the Vulkan effort for the Raspberry Pi 4. Now the source code is public. Taking into account the interest that it got, and that now the driver is more usable, we will try to post status updates more regularly. Let’s talk about what’s happened since then.

Input Attachments

Input attachment is one of the main sub-features for Vulkan multipass, and we’ve gained support since the announcement. On Vulkan the support for multipass is more tightly supported by the API. Renderpasses can have multiple subpasses. These can have dependencies between each other, and each subpass define a subset of “attachments”. One attachment that is easy to understand is the color attachment: This is where a given subpass writes a given color. Another, input attachment, is an attachment that was updated in a previous subpass (for example, it was the color attachment on such previous subpass), and you get as a input on following subpasses. From the shader POV, you interact with it as a texture, with some restrictions. One important restriction is that you can only read the input attachment at the current pixel location. The main reason for this restriction is because on tile-based GPUs (like rpi4) all primitives are batched on tiles and fragment processing is rendered one tile at a time. In general, if you can live with those restrictions, Vulkan multipass and input attachment will provide better performance than traditional multipass solutions.

If you are interested in reading more details on this, you can check out ARM’s very nice presentation “Vulkan Multipass mobile deferred done right”, or Sascha Willems’ post “Vulkan input attachments and sub passes”. The latter also includes information about how to use them and code snippets of one of his demos. For reference, this is how the input attachment demos looks on the rpi4:

Compute Shader

Given that this was one of the most requested features after the last update, we expect that this will be likely be the most popular news from this post: Compute shaders are now supported.

Compute shaders give applications the ability to perform non-graphics related tasks on the GPU, outside the normal rendering pipeline. For example they don’t have vertices as input, or fragments as output. They can still be used for massivelly parallel GPGPU algorithms. For example, this demo from Sascha Willems uses a compute shader to simulate cloth:

Storage Image

Storage Image is another recent addition. It is a descriptor type that represents an image view, and supports unfiltered loads, stores, and atomics in a shader. It is really similar in most other ways to the well-known OpenGL concept of texture. They are really common with compute shaders. Compute shaders will not render (they can’t) directly any image, and it is likely that if they need an image, they will update it. In fact the two Sascha Willem demos using storage images also require compute shader support:

Performance

Right now our main focus for the driver is working on features, targetting a compliant Vulkan 1.0 driver. Having said so, now that we both support a good range of features and can run non-basic applications, we have devoted some time to analyze if there were clear points where we could improve the performance. Among these we implemented:
1. A buffer object (BO) cache: internally we are allocating and freeing really often buffer objects for basically the same tasks, so there are a constant need of buffers of the same size. Such allocation/free require a DRM call, so we implemented a BO cache (based on the existing for the OpenGL driver) so freed BOs would be added to a cache, and reused if a new BO is allocated with the same size.
2. New code paths for buffer to image copies.

Bugfixing!!

In addition to work on specific features, we also spent some time fixing specific driver bugs, using failing Vulkan CTS tests as reference. Thanks to that work, the Sascha Willems’ radial blur demo is now properly rendering, even though we didn’t focus specifically on working on that demo:

Next?

Now that the driver supports a good range of features and we are able to test more applications and run more Vulkan CTS Tests with all the needed features implemented, we plan to focus some efforts towards bugfixing for a while.

We also plan to start to work on implementing the support for Pipeline Cache, which allows the result of pipeline construction to be reused between pipelines and between runs of an application.

No time to waste

So let’s get down to pixels. The UBO indexing is now fixed-ish, which means moving onto the next step: setting up bindings for the UBOs.

A binding in this context is the numeric id assigned to a UBO for the purposes of accessing it from a shader, which also corresponds to the uniform block index. In mesa, this is the struct nir_variable::data.binding member of a UBO. A load_ubo instruction will take this value as its first parameter, which means there’s a need to ensure that everything matches up just right.

Where to start

Where I started was checking out the existing code, which assumes that nir_variable::data.binding is already set up correctly, since the comment in mesa/src/compiler/nir/nir.h for the member implies that—

Just kidding, that only applies to Vulkan drivers. In Zink, that needs to be manually set up since, at most, the value will have been incremented by 1 in the nir_lower_uniforms_to_ubo pass from yesterday’s post.

With this in mind, it’s time to check out a block from zink_compiler.c:

   nir_foreach_variable(var, &nir->uniforms) {
      if (var->data.mode == nir_var_mem_ubo) {
         int binding = zink_binding(nir->info.stage,
                                    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,
                                    var->data.binding);
         ret->bindings[ret->num_bindings].index = var->data.binding;
         ret->bindings[ret->num_bindings].binding = binding;
         ret->bindings[ret->num_bindings].type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
         ret->num_bindings++;

This iterates over the uniform variables, which are now all wrapped in UBOs, setting up the binding table that will later be used in a vkCreateDescriptorSetLayout call, which passes the bindings along to the underlying driver.

Unfortunately, as just mentioned, this assumes that var->data.binding is set, which it isn’t.

Ordering

A number of things need to be kept in mind to effectively assign all the binding values:

  • The UBOs in this list are ordered backwards, with the zero-id UBO at the end of the list. As such, the bindings need to be generated in reverse order as compared to the uniforms list stored onto the shader.
  • The index member of the binding table, however, is not the same as the binding as this determines the index of the buffer to be used with the specified UBO; if nir_lower_uniforms_to_ubo was run, then index begins at 0, but otherwise it will begin at 1.
  • The point of the binding value is to bind the UBO itself, not variables contained in the UBO. This means that any uniform with a nonzero data.location can be ignored, as this indicates that it’s located at an offset from the base of the UBO and will be accessed by the second parameter of the load_ubo instruction, the offset.

With all this in mind, the following changes can be made:

   uint32_t cur_ubo = 0;
   /* UBO buffers are zero-indexed, but buffer 0 is always the one created by nir_lower_uniforms_to_ubo,
    * which means there is no buffer 0 if there are no uniforms
    */
   int ubo_index = !nir->num_uniforms;
   /* need to set up var->data.binding for UBOs, which means we need to start at
    * the "first" UBO, which is at the end of the list
    */
   foreach_list_typed_reverse(nir_variable, var, node, &nir->uniforms) {
      if (var->data.mode == nir_var_mem_ubo) {
         /* ignore variables being accessed if they aren't the base of the UBO */
         if (var->data.location)
            continue;
         var->data.binding = cur_ubo++;

         int binding = zink_binding(nir->info.stage,
                                    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,
                                    var->data.binding);
         ret->bindings[ret->num_bindings].index = ubo_index++;
         ret->bindings[ret->num_bindings].binding = binding;
         ret->bindings[ret->num_bindings].type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
         ret->num_bindings++;

Declaring

Now that the binding values are all taken care of, the next step is to go back to the UBO declarations in ntv:

static void
emit_ubo(struct ntv_context *ctx, struct nir_variable *var)
{
   uint32_t size = glsl_count_attribute_slots(var->type, false);

This is the first line of the function, and it’s the only one that’s important here. Zink is going to pad out every member of a UBO to the size of a vec4 (because PIPE_CAP_PACKED_UNIFORMS is not set by the driver), which is what size here is being assigned as—the number of vec4s needed to declare the passed variable.

This isn’t what the driver should be doing here. As with the binding table setup above, this is declaring UBOs themselves, not variables inside UBOs. As such, all of these variables can be ignored, but the base variable needs to be sized for the entire UBO.

Helpfully, this type is available as struct nir_variable::interface_type for the overall UBO type, which results in the following small changes:

static void
emit_ubo(struct ntv_context *ctx, struct nir_variable *var)
{
   /* variables accessed inside a uniform block will get merged into a big
    * memory blob and accessed by offset
    */
   if (var->data.location)
      return;

   uint32_t size = glsl_count_attribute_slots(var->interface_type, false);

The UBO list in ntv also has to be walked backwards for its declarations in order to match the part from zink_compiler.c, but this is the only change necessary.

Binding accomplished

Yes, that’s sufficient for setting up the variables and bindings for all the UBOs.

Next time, I’ll finish this with a back-to-the-basics look at loading memory from buffers using offsets, except it’s in SPIR-V so everything is way more complicated.

June 30, 2020

assert()gery

In yesterday’s post, I left off in saying that removing an assert() from the constant block index check wasn’t going to work quite right. Let’s see why that is.

Some context again:

static void
emit_load_ubo(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   nir_const_value *const_block_index = nir_src_as_const_value(intr->src[0]);
   assert(const_block_index); // no dynamic indexing for now
   assert(const_block_index->u32 == 0); // we only support the default UBO for now

   nir_const_value *const_offset = nir_src_as_const_value(intr->src[1]);
   if (const_offset) {
      SpvId uvec4_type = get_uvec_type(ctx, 32, 4);
      SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                      SpvStorageClassUniform,
                                                      uvec4_type);

      unsigned idx = const_offset->u32;
      SpvId member = emit_uint_const(ctx, 32, 0);
      SpvId offset = emit_uint_const(ctx, 32, idx);
      SpvId offsets[] = { member, offset };
      SpvId ptr = spirv_builder_emit_access_chain(&ctx->builder, pointer_type,
                                                  ctx->ubos[0], offsets,
                                                  ARRAY_SIZE(offsets));
      SpvId result = spirv_builder_emit_load(&ctx->builder, uvec4_type, ptr);

This is the top half of emit_load_ubo(), which performs the load on the desired memory region for the UBO access. In particular, the line I’m going to be exploring today is

   assert(const_block_index->u32 == 0); // we only support the default UBO for now

Which directly corresponds to the explicit 0 in

      SpvId ptr = spirv_builder_emit_access_chain(&ctx->builder, pointer_type,
                                                  ctx->ubos[0], offsets,
                                                  ARRAY_SIZE(offsets));

At a glance, it seems like the assert() can be removed, and const_block_index->u32 can be passed as the index to the ctx->ubos array, which is where all the declared UBO variables are stored, and there won’t be any issue.

Not so.

In fact, there’s a number of problems with this.

NIR resurfaces

Over in zink_compiler.c, Zink runs a nir_lower_uniforms_to_ubo pass on shaders. What this pass does is:

  • rewrites all load_uniform instructions as load_ubo instructions for the UBO bound to 0, which works with Gallium’s merging of all non-block uniforms into UBO with binding point 0 (which is what’s currently handled by Zink)
  • adds the variable for a UBO with binding point 0 if there’s any load_uniform instructions
  • increments the binding points (and load instructions) of all pre-existing UBOs by 1
  • uses a specified multiplier to rewrite the offset values specified by the converted load_ubo instructions which were previously load_uniform

But then there’s a problem: what happens when this pass gets run when there’s no non-block uniforms? Well, the answer is just as expected:

  • rewrites all load_uniform instructions as load_ubo instructions for the UBO bound to 0, which works with Gallium’s merging of all non-block uniforms into UBO with binding point 0 (which is what’s currently handled by Zink)
  • adds the variable for a UBO with binding point 0 if there’s any load_uniform instructions
  • increments the binding points (and load instructions) of all pre-existing UBOs by 1
  • uses a specified multiplier to rewrite the offset values specified by the converted load_ubo instructions which were previously load_uniform

So now, in emit_load_ubo() above, that ctx->ubos[const_block_index->u32] is actually going to translate to ctx->ubos[1] in the case of a shader without any uniforms. Unfortunately, here’s the function which declares the UBO variables:

static void
emit_ubo(struct ntv_context *ctx, struct nir_variable *var)
{
   uint32_t size = glsl_count_attribute_slots(var->type, false);
   SpvId vec4_type = get_uvec_type(ctx, 32, 4);
   SpvId array_length = emit_uint_const(ctx, 32, size);
   SpvId array_type = spirv_builder_type_array(&ctx->builder, vec4_type,
                                               array_length);
   spirv_builder_emit_array_stride(&ctx->builder, array_type, 16);

   // wrap UBO-array in a struct
   SpvId struct_type = spirv_builder_type_struct(&ctx->builder, &array_type, 1);
   if (var->name) {
      char struct_name[100];
      snprintf(struct_name, sizeof(struct_name), "struct_%s", var->name);
      spirv_builder_emit_name(&ctx->builder, struct_type, struct_name);
   }

   spirv_builder_emit_decoration(&ctx->builder, struct_type,
                                 SpvDecorationBlock);
   spirv_builder_emit_member_offset(&ctx->builder, struct_type, 0, 0);


   SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                   SpvStorageClassUniform,
                                                   struct_type);

   SpvId var_id = spirv_builder_emit_var(&ctx->builder, pointer_type,
                                         SpvStorageClassUniform);
   if (var->name)
      spirv_builder_emit_name(&ctx->builder, var_id, var->name);

   assert(ctx->num_ubos < ARRAY_SIZE(ctx->ubos));
   ctx->ubos[ctx->num_ubos++] = var_id;

   spirv_builder_emit_descriptor_set(&ctx->builder, var_id,
                                     var->data.descriptor_set);
   int binding = zink_binding(ctx->stage,
                              VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,
                              var->data.binding);
   spirv_builder_emit_binding(&ctx->builder, var_id, binding);
}

Specifically:

   ctx->ubos[ctx->num_ubos++] = var_id;

Indeed, this is zero-indexed, which means all the UBO access for a shader with no uniforms is going to fail because all the UBO load instructions are using a block index that’s off by one.

Solved

As is my way, I slapped some if (!shader->num_uniforms) flex tape on running the zink_compiler nir_lower_uniforms_to_ubo pass in order to avoid potentially breaking the pass’s other usage over in TGSI by changing the pass itself, and now the problem is solved. The assert() can now be removed.

Yes, sometimes there’s all this work, and analyzing, and debugging, and blogging, and the end result is a sweet, sweet zero/null check.

Tune in next time when I again embark on a journey that definitely, in no way results in more flex tape being used.

June 29, 2020

Long Weekend

Not really, but I didn’t get around to blogging on Friday because I was working until pretty late on something that’s Kind Of A Big Deal.

Not really, but it’s probably more interesting than my posts about unhandled ALUs.

ARB_uniform_buffer_object support

This extension is one of the last remaining items (along with GL_NV_primitive_restart, which is likely to be done soon as well, and some fixups for GLSL-1.40) required for OpenGL 3.1 support, so I decided to take a break from fixing corner case piglit tests to try doing something useful.

At a very basic level, this extension provides shaders with the ability to declare a “struct” type uniform containing explicitly-defined members that can be referenced normally. Here’s a quick vertex shader example from piglit’s rendering test from the arb_uniform_buffer_object extension tests:

#extension GL_ARB_uniform_buffer_object : require

layout(std140) uniform;
uniform ub_pos_size { vec2 pos; float size; };
uniform ub_rot {float rotation; };

void main()
{
   mat2 m;
   m[0][0] = m[1][1] = cos(rotation); 
   m[0][1] = sin(rotation); 
   m[1][0] = -m[0][1]; 
   gl_Position.xy = m * gl_Vertex.xy * vec2(size) + pos;
   gl_Position.zw = vec2(0, 1);
};

Seen here, there’s two UBOs passed as inputs, and the shader’s main() function directly references their members to perform a rotation on the passed vertex.

What does this actually mean?

That was my first question. In essence, what it means is that once PIPE_SHADER_CAP_INDIRECT_CONST_ADDR is enabled for the driver, shaders are going to start being compiled that contain instructions to perform UBO loads with offsets, as the “struct” member access is really just loading memory from a buffer at an offset from the base.

There’s two types of indexing that need to be handled:

  • constant - this is like array[1], where the index is explicitly defined
  • dynamic - this is array[i], where the index has been computed by the shader

This type of indexing applies to both uniform block indexing, which determines which UBO is being accessed by the instruction, and uniform block offset, which is the precise region in that UBO being accessed.

At present, only constant block indexing is going to be discussed, though both types of addressing need to be handled for block offsets.

Evaluating the existing code

Let’s check out the core of the implementation:

static void
emit_load_ubo(struct ntv_context *ctx, nir_intrinsic_instr *intr)
{
   nir_const_value *const_block_index = nir_src_as_const_value(intr->src[0]);
   assert(const_block_index); // no dynamic indexing for now
   assert(const_block_index->u32 == 0); // we only support the default UBO for now

   nir_const_value *const_offset = nir_src_as_const_value(intr->src[1]);
   if (const_offset) {
      SpvId uvec4_type = get_uvec_type(ctx, 32, 4);
      SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                      SpvStorageClassUniform,
                                                      uvec4_type);

      unsigned idx = const_offset->u32;
      SpvId member = emit_uint_const(ctx, 32, 0);
      SpvId offset = emit_uint_const(ctx, 32, idx);
      SpvId offsets[] = { member, offset };
      SpvId ptr = spirv_builder_emit_access_chain(&ctx->builder, pointer_type,
                                                  ctx->ubos[0], offsets,
                                                  ARRAY_SIZE(offsets));
      SpvId result = spirv_builder_emit_load(&ctx->builder, uvec4_type, ptr);

      SpvId type = get_dest_uvec_type(ctx, &intr->dest);
      unsigned num_components = nir_dest_num_components(intr->dest);
      if (num_components == 1) {
         uint32_t components[] = { 0 };
         result = spirv_builder_emit_composite_extract(&ctx->builder,
                                                       type,
                                                       result, components,
                                                       1);
      } else if (num_components < 4) {
         SpvId constituents[num_components];
         SpvId uint_type = spirv_builder_type_uint(&ctx->builder, 32);
         for (uint32_t i = 0; i < num_components; ++i)
            constituents[i] = spirv_builder_emit_composite_extract(&ctx->builder,
                                                                   uint_type,
                                                                   result, &i,
                                                                   1);

         result = spirv_builder_emit_composite_construct(&ctx->builder,
                                                         type,
                                                         constituents,
                                                         num_components);
      }

      if (nir_dest_bit_size(intr->dest) == 1)
         result = uvec_to_bvec(ctx, result, num_components);

      store_dest(ctx, &intr->dest, result, nir_type_uint);
   } else
      unreachable("uniform-addressing not yet supported");
}

This is the handler for the load_ubo instruction in ntv. It performs a load operation on a previously-emitted UBO variable, using the first parameter (intr->src[0]) as the block index and the second parameter (intr->src[1]) as the block offset, and storing the resulting data that was loaded into the destination (intr->dest).

In this implementation, which is what’s currently in the repo, there’s some assert()s which verify that both of the parameters passed are constant rather than dynamic; as previously-mentioned, this is going to need to change, at least for the block offset. Additionally, the block index is restricted to 0, which I’ll explain a bit later, but it’s a problem.

Work items

So at a minimum, these are the following changes that need to be made:

  • Enable nonzero block indexing so that more than one UBO can be accessed
  • Handle block access using dynamic offsets

As with all projects I decide to tackle, however, these are not going to be the only changes required, as this is going to uncover a small tangle if I try to fix it directly by just removing the assert().

Stay tuned as this saga progresses.

June 25, 2020

A quick word

I didn’t budget my time well today, so here’s a very brief post about neat features in piglit, the test suite/runner for mesa.

Piglit is my go-to for verifying OpenGL behaviors. It’s got tons of fiendishly nitpicky tests for core functionality and extensions, and then it also provides this same quality for shaders with an unlimited number of shader tests.

When working on a new extension or modifying existing behavior, it can be useful to do quick runs of the tests verifying the behavior that’s being modified. A full piglit run with Zink takes around 10-20 minutes, which isn’t even enough to get in a good rhythm for some desk curls, so it’s great that there’s functionality for paring down the number of tests being run.

Regex testing

Piglit provides test inclusion and exclusion support using regular expressions.

  • With -t, a regex for tests to run can be specified, e.g., -t '.*arb_uniform_buffer_object.*'
  • With -x, a regex for tests to skip can be specified, e.g., -x .*glx.*

These options can be used to cut down runtimes as well as ignore tests with intermittent failures when those results aren’t interesting.

That’s it

No, really, I said it would be brief.

June 24, 2020
Code reviews are a central fact of life in software development. It's important to do them well, and developer quality of life depends on a good review workflow.

Unfortunately, code reviews also appear to be a difficult problem. Many projects are bottlenecked by code reviews, in that reviewers are hard to find and progress gets slowed down by having to wait a long time for reviews.

The "solution" that I've often seen applied in practice is to have lower quality code reviews. Reviewers don't attempt to gain a proper understanding of a change, so reviews become shallower and therefore easier. This is convenient on the surface, but more likely to allow bad code to go through: a subtle corner case that isn't covered by tests (yet?) may be missed, there may be a misunderstanding of a relevant underlying spec, a bad design decision slips through, and so on. This is bound to cause pains later on.

I've experienced a number of different code review workflows in practice, based on a number of tools: GitHub PRs and GitLab MRs, Phabricator, and the e-mail review of patch series that is the original workflow for which Git was designed. Of those, the e-mail review flow produced the highest quality. There may be confounding factors, such as the nature of the projects and the selection of developers working on them, but quality issues aside I certainly feel that the e-mail review flow was the most pleasant to work with. Over time I've been thinking and having discussions a lot about just why that is. I feel that I have distilled it to two key factors, which I've decided to write down here so I can just link to this blog post in the future.

First, the UI experience of e-mail is a lot nicer. All of the alternatives are ultimately web-based, and their UI latency is universally terrible. Perhaps I'm particularly sensitive, but I just cannot stand web UIs for serious work. Give me something that reacts to all input reliably in under 50ms and I will be much happier. E-mail achieves that, web UIs don't. Okay, to be fair, e-mail is probably more in the 100ms range given the general state of the desktop. The point is, web UIs are about an order of magnitude worse. It's incredibly painful. (All of this assumes that you're using a decent native e-mail client. Do yourself a favor and give that a try if you haven't. The CLI warriors all have their favorites, but frankly Thunderbird works just fine. Outlook doesn't.)

Second, I've come to realize that there are conflicting goals in review granularity that e-mail happens to address pretty well, but none of the alternatives do a good job of it. Most of the alternatives don't even seem to understand that there is a problem to begin with! Here's the thing:

Reviews want to be small. The smaller and the more self-contained a change is, the easier it is to wrap your head around and judge. If you do a big refactor that is supposed to have no functional impact, followed by a separate small functional change that is enabled by the refactor, then each change individually is much easier to review. Approving changes at a fine granularity also helps ensure that you've really thought through each individual change and that each change has a reasonable justification. Important details don't get lost in something larger.

Reviews want to be big. A small, self-contained change can be difficult to understand and judge in isolation. You're doing a refactor that moves a function somewhere else? Fine, it's easy to tell that the change is correct, but is it a good change? To judge that, you often need to understand how the refactored result ends up being used in later changes, so it's good to see all those changes at once. Keep in mind though that you don't necessarily have to approve them at the same time. It's entirely possible to say, yes, that refactor looks good, we can go ahead with that, but please fix the way it's being used in a subsequent change.

There is another reason why reviews want to be big. Code reviews have a mental context-switching overhead. As a reviewer, you need to think yourself into the affected code in order to judge it well. If you do many reviews, you typically need to context-switch between each review. This can be very taxing mentally and ultimately unpleasant. A similar, though generally smaller, context-switching overhead applies to the author of the change as well: let's say you send out some changes for review, then go off and do another thing, and come back a day or two later to some asynchronously written reviews. In order to respond to the review, you may now have to page the context of that change back in. The point of all this is that when reviews are big, the context-switching overhead gets amortized better, i.e. the cost per change drops.

Reviews want to be both small and big. Guess what, patch series solve that problem! You get to review an entire feature implementation in the form of a patch series at once, so your context-switching overhead is reduced and you can understand how the different parts of the change play together. At the same time, you can drill down into individual patches and review those. Two levels of detail are available simultaneously.

So why e-mail? Honestly, I don't know. Given that the original use case for Git is based on patch series review, it's mind-boggling in a bad way that web-based Git hosting and code review systems do such a poor job of it, if they handle it at all.

Gerrit is the only system I know of that really takes patch series as an idea seriously, but while I am using it occasionally, I haven't had the opportunity to really stress it. Unfortunately, most people don't even want to consider Gerrit as an option because it's ugly.

Phabricator's stacks are a pretty decent attempt and I've made good use of them in the context of LLVM. However, they're too hidden and clumsy to navigate. Both Phabricator and Gerrit lack a mechanism for discussing a patch series as a whole.

GitHub and Gitlab? They're basically unusable. Yes, you can look at individual commits, but then GitHub doesn't even display the commits in the correct order: they're sorted by commit or author date, not by the Git DAG order, which is an obviously and astonishingly bad idea. Comments tend to get lost when authors rebase, which is what authors should do in order to ensure a clean history, and actually reviewing an individual commit is impossible. Part of the power of patch series is the ability to say: "Patches 1-6, 9, and 11 are good to go, the rest needs work."

Oh, and by the way: Commit messages? They're important! Gerrit again is the only system I know of that allows review comments on commit messages. It's as if the authors of Gerrit are the only ones who really understood the problem space. Unfortunately, they seem to lack business and web design skills, and so we ended up in the mess we're in right now.

Mind you, even if the other players got their act together and supported the workflow properly, there'd still be the problem of UI latency. One can dream...

Shader ALUs

Today let’s talk about ALUs a little.

During shader compilation, GLSL gets serialized into SSA form, which is what ntv operates on when translating it into SPIR-V. An ALU in the context of Zink (specifically ntv) is an algebraic operation which takes a varying number of inputs and generates an output. This is represented in NIR by a struct, nir_alu_instr, which contains the operation type, the inputs, and the output.

When writing GLSL, there’s the general assumption that writing something like 1 + 2 will yield 3, but this is contingent on the driver being able to correctly compile the NIR form of the shader into instructions that the physical hardware runs in order to get that result. In Zink, there’s the need to translate all these NIR instructions into SPIR-V, which is sometimes made trickier by both different semantics between similar GLSL and SPIR-V operations as well as aggressive NIR optimizations.

A deep dive into isnan()

The isnan function checks whether the input is a number. It’s a simple enough functionality to describe, but the implementation and transit through the GLSL->NIR->SPIR-V->NIR pipeline is fraught with perils.

In mesa, isnan(x) is serialized to NIR as fne(x, x), where fne is the operation for float-not-equal, which compares two floats to determine whether they are equal. As such, there’s never actually a case where isnan gets passed through ntv. Let’s see what this looks like in practice with this failing shader test:

// from piglit's fs-isnan-vec2.shader_test for GLSL 1.30
#version 130
uniform vec2 numerator;
uniform vec2 denominator;

void main()
{
  gl_FragColor = vec4(isnan(numerator/denominator), 0.0, 1.0);
}

In Zink, this yields:

shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 0
ubos: 1
shared: 0
decl_var ubo INTERP_MODE_NONE struct_uniform_0 uniform_0 (~0, 0, 640)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragColor (FRAG_RESULT_COLOR.xyzw, 4, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec4 32 ssa_1 = load_const (0x00000000 /* 0.000000 */, 0x00000000 /* 0.000000 */, 0x00000000 /* 0.000000 */, 0x3f800000 /* 1.000000 */)
	vec1 32 ssa_2 = load_const (0x00000000 /* 0.000000 */)
	intrinsic store_output (ssa_1, ssa_2) (8, 15, 0, 160) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* type=float32 */	/* gl_FragColor */
	/* succs: block_1 */
	block block_1:
}

As with yesterday’s shader adventure, here’s IRIS as a control:

shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 0
outputs: 1
uniforms: 0
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE vec2 numerator (0, 0, 0)
decl_var uniform INTERP_MODE_NONE vec2 denominator (1, 2, 0)
decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragColor (FRAG_RESULT_COLOR.xyzw, 4, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x3f800000 /* 1.000000 */)
	vec1 32 ssa_2 = load_const (0x00000001 /* 0.000000 */)
	vec4 32 ssa_3 = intrinsic load_ubo (ssa_2, ssa_0) (0, 4, 0) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */
	vec1 32 ssa_6 = frcp ssa_3.z
	vec1 32 ssa_7 = frcp ssa_3.w
	vec1 32 ssa_8 = fmul ssa_3.x, ssa_6
	vec1 32 ssa_9 = fmul ssa_3.y, ssa_7
	vec1 32 ssa_10 = fne32 ssa_8, ssa_8
	vec1 32 ssa_12 = b2f32 ssa_10
	vec1 32 ssa_11 = fne32 ssa_9, ssa_9
	vec1 32 ssa_13 = b2f32 ssa_11
	vec4 32 ssa_14 = vec4 ssa_12, ssa_13, ssa_0, ssa_1
	intrinsic store_output (ssa_14, ssa_0) (4, 15, 0, 160) /* base=4 */ /* wrmask=xyzw */ /* component=0 */ /* type=float32 */	/* gl_FragColor */
	/* succs: block_1 */
	block block_1:
}

This is clearly much different. In particular, note that IRIS retains its fne instructions, but Zink has lost them along the way.

Why is this?

The problem comes from how SPIR-V is translated back to NIR. When emitting fne(a, a) into SPIR-V with OpFOrdNotEqual, the result is that the NaN-ness is ignored, and the NaN value is compared against itself, managing to be equivalent somehow, which breaks the test. This is due to how OpFOrdNotEqual is explicitly used for ordered (numeric).

Using OpFUnordNotEqual for this case has no such issue, as this op always return false if either of the inputs are unordered (NaN).

June 23, 2020

Something different

I’ve talked and rambled about various things, and maybe I’ve given an idea of what it’s like to work on Zink, but the reality is that I spend a majority of my time working on the shader translation pipeline, which converts NIR to SPIRV.

Today let’s go through the process for fixing an issue in this pipeline, as detected by piglit.

Steps

  • build piglit, as described by its README
  • run piglit; my current run is executed with VK_INSTANCE_LAYERS= MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink ./piglit run --timeout 3000 gpu results/new
  • generate a viewable summary with e.g., ./piglit summary html results/compare <possibly some previous results> results/new
  • open generated results/compare/index.html in browser

Now there’s a massive list of tests with pass/fail results. Clicking on the results of any test will provide more detail, just like this:

piglit-result-2020-06-23_15-35-31.png

In this case, spec@glsl-1.30@execution@fs-texelfetchoffset-2d is failing. What does that mean?

Debugging piglit tests

Near the bottom of the above results, there’s a row for Command, which is the command used to run a given test. This command can be run in any tool, such as gdb or valgrind, in order to run only this test.

More importantly, however, in the case of shader tests, it lets someone debugging a given test produce NIR output, as this is usually the best way to figure out what’s going wrong.

To do so for the above test, I’ve run NIR_PRINT=1 LIBGL_DEBUG=verbose MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink bin/fs-texelFetchOffset-2D -auto -fbo &>| zink. This generates a file zink in my current directory which contains the generated NIR as it progreses through various stages of optimization and translation.

NIR

The specified test is for a fragment shader, as indicated by fs in the name or just reading the test code, which uses the following shader:

"#version 130\n"
"uniform ivec2 pos;\n"
"uniform int lod;\n"
"uniform sampler2D tex;\n"
"void main()\n"
"{\n"
"       const ivec2 offset = ivec2(-2, 2);\n"
"       vec4 texel = texelFetchOffset(tex, pos, lod, offset);\n"
"	gl_FragColor = texel;\n"
"}\n";

Searching through the NIR output for the last output of the fragment shader IR yields:

shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 8
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE sampler2D tex (~0, 0, 672)
decl_var ubo INTERP_MODE_NONE struct_uniform_0 uniform_0 (~0, 0, 640)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragData[0] (FRAG_RESULT_DATA0.xyzw, 8, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000002 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x00000000 /* 0.000000 */)
	vec4 32 ssa_2 = intrinsic load_ubo (ssa_0, ssa_1) (0, 4, 0) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */
	vec1 32 ssa_3 = load_const (0x00000010 /* 0.000000 */)
	vec4 32 ssa_4 = intrinsic load_ubo (ssa_0, ssa_3) (0, 16, 0) /* access=0 */ /* align_mul=16 */ /* align_offset=0 */
	vec2 32 ssa_5 = vec2 ssa_2.x, ssa_2.y
	vec1 32 ssa_6 = mov ssa_4.x
	vec4 32 ssa_7 = (float)txf ssa_5 (coord), ssa_6 (lod), 3 (texture), 0 (sampler)
	intrinsic store_output (ssa_7, ssa_1) (8, 15, 0, 160) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* type=float32 */	/* gl_FragData[0] */
	/* succs: block_1 */
	block block_1:
}

That’s a lot to go through in a single post, so I’ll be providing a brief overview for now. The most important thing to keep in mind is that ssa_* values in IR are SSA, so each value can be traced through execution by following the assignments.

Looking at main in the shader code, an ivec2 is created as (-2, 2), and this is passed into texelFetchOffset() as the offset from the pos uniform.

Looking at main in the IR, the first 5 lines of block_0 (the only block) are used to load resources. It can be assumed they’re generally correct right now, though that won’t always be the case. Next there’s a vec2 formed (ssa_5) from the load_ubo-created ssa_2; as can be seen a couple lines down, this is is the coord or P param in texelFetchOffset, which is abbreviated as txf here.

In particular, ssa_5 is formed and then passed directly to the txf instruction. What happened to the offset?

Let’s check out NIR generated for this shader by IRIS, the Intel gallium driver:

shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 0
outputs: 1
uniforms: 0
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE ivec2 pos (1, 0, 0)
decl_var uniform INTERP_MODE_NONE int lod (2, 2, 0)
decl_var uniform INTERP_MODE_NONE sampler2D tex (3, 0, 0)
decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragColor (FRAG_RESULT_COLOR.xyzw, 4, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x00000002 /* 0.000000 */)
	vec3 32 ssa_2 = intrinsic load_ubo (ssa_1, ssa_0) (0, 4, 0) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */
	vec1 32 ssa_16 = mov ssa_2.z
	vec1 32 ssa_3 = load_const (0xfffffffe /* -nan */)
	vec1 32 ssa_6 = iadd ssa_2.x, ssa_3
	vec1 32 ssa_7 = iadd ssa_2.y, ssa_1
	vec2 32 ssa_8 = vec2 ssa_6, ssa_7
	vec4 32 ssa_9 = (float)txf ssa_8 (coord), ssa_16 (lod), 1 (texture), 0 (sampler)
	intrinsic store_output (ssa_9, ssa_0) (4, 15, 0, 160) /* base=4 */ /* wrmask=xyzw */ /* component=0 */ /* type=float32 */	/* gl_FragColor */
	/* succs: block_1 */
	block block_1:
}

In particular:

	vec1 32 ssa_6 = iadd ssa_2.x, ssa_3
	vec1 32 ssa_7 = iadd ssa_2.y, ssa_1
	vec2 32 ssa_8 = vec2 ssa_6, ssa_7

As can be seen here, the ssa for the coord param is only formed after a pair of iadd instructions occur, as one would expect to see if a vec2 offset were added to a vec2 coordinate.

Indeed, it seems that Zink is ignoring the offset here.

Fixing

Armed with the knowledge that a txf instruction is involved, a quick search through nir_to_spirv.c reveals the emit_tex() function as a likely starting point, as it’s where txf is handled.

Some excerpts follow:

   for (unsigned i = 0; i < tex->num_srcs; i++) {
      switch (tex->src[i].src_type) {
      case nir_tex_src_coord:
         if (tex->op == nir_texop_txf ||
             tex->op == nir_texop_txf_ms)
            coord = get_src_int(ctx, &tex->src[i].src);
         else
            coord = get_src_float(ctx, &tex->src[i].src);
         coord_components = nir_src_num_components(tex->src[i].src);
         break;

      case nir_tex_src_offset:
         offset = get_src_int(ctx, &tex->src[i].src);
         break;

Here the code iterates through the inputs for the command and then takes action based on their type. In particular, I’ve cut out the coord and offset inputs, as that’s where the issue lies. The implementation is translating the nir_src values (which represent “some value” at runtime) into SpvId values (which also represent “some value” at runtime), so this is okay.

Let’s scroll down a bit:

   if (tex->op == nir_texop_txf ||
       tex->op == nir_texop_txf_ms) {
      SpvId image = spirv_builder_emit_image(&ctx->builder, image_type, load);
      result = spirv_builder_emit_image_fetch(&ctx->builder, dest_type,
                                              image, coord, lod, sample);
   } else {
      result = spirv_builder_emit_image_sample(&ctx->builder,
                                               actual_dest_type, load,
                                               coord,
                                               proj != 0,
                                               lod, bias, dref, dx, dy,
                                               offset);
   }

And here’s the problem. The txf instruction isn’t handling the offset at all, while other instructions (which map to e.g., OpImageSampleImplicitLod) are passing it along as a parameter.

The fix in this case is to check OpIAdd, which does indeed permit addition of vectors, and so the block can be changed to:

   if (tex->op == nir_texop_txf ||
       tex->op == nir_texop_txf_ms) {
      SpvId image = spirv_builder_emit_image(&ctx->builder, image_type, load);
      if (offset)
         coord = emit_binop(ctx, SpvOpIAdd,
                            /* 'coord_bitsize' here comes from adding
                            
                               coord_bitsize = nir_src_bit_size(tex->src[i].src);
                               
                               to the 'nir_tex_src_coord' switch case in the first block
                             */
                            get_ivec_type(ctx, coord_bitsize, coord_components),
                            coord, offset);
      result = spirv_builder_emit_image_fetch(&ctx->builder, dest_type,
                                              image, coord, lod, sample);
   } else {
      result = spirv_builder_emit_image_sample(&ctx->builder,
                                               actual_dest_type, load,
                                               coord,
                                               proj != 0,
                                               lod, bias, dref, dx, dy,
                                               offset);
   }

This emits an addition instruction for the coord and offset vectors and passes the new coord value to the spirv_builder_emit_image_fetch() function, and now the issue is resolved.

June 22, 2020

Finally, An Introduction

Since the start of this blog, I’ve been going full speed ahead with minimal regard for explaining terminology or architecture. This was partly to bootstrap the blog and get some potentially interesting content out there, but I also wanted to provide some insight into how clueless I was when I started out in mesa.

If you’ve been confused by the previous posts, that’s roughly where I was at the time when I first encountered whatever it was you that you’ve been reading about.

What is Gallium?

There’s a lot of good documentation available for it, but much of that documentation assumes that the reader already has fairly deep knowledge about graphics/rendering as well as pipeline architecture.

When I began working on mesa, I did not have that knowledge, so let’s take a little time to go over some parts of the mesa tree, beginning with gallium.

Gallium is the API provided by mesa/src/mesa/state_tracker. state_tracker is a mesa dri driver implementation (like i965 or radeon) which translates the mesa/src/mesa/main API and functionality into something a bit more flexible and easy to write drivers for. In particular, the state tracker is less immediate-mode functionality than core mesa, which enables greater optimization to be performed with e.g., batching and deduplication of repeated operations.

What are the main components of the Gallium API?

The main headers for use with gallium drivers can be found in mesa/src/gallium/include/pipe. This contains:

  • struct pipe_screen - an interface for accessing the underlying hardware/device layer, providing the get_param() methods for determining the capabilities (PIPE_CAP_XYZ) that a driver has. In Zink terms, this is the object that all Vulkan commands go through, as struct zink_screen::dev is the VkDevice used for everything.

  • struct pipe_context - an interface created from a struct pipe_screen providing rendering context methods to handle managing states and surface objects as well as VBO drawing. In Zink, this is the component that ties everything together.

  • struct pipe_resource - an object created from a struct pipe_screen representing some sort of buffer or texture. In Zink terms, any time an OpenGL command reads back data from a buffer or directly maps data to a texture, this is the object used.

  • struct pipe_surface - an object created from a struct pipe_context representing a texture view that can be bound as a color/depth/stencil attachment to a framebuffer.

  • struct pipe_query - an object created from a struct pipe_context representing a query of some sort, whether for performance or functional purposes.

  • struct pipe_fence_handle - an object created from a struct pipe_screen representing a fence used for command stream synchronization.

Other components

Aside from the main gallium API (which has tons more types than just those listed above), there’s also:

  • the GLSL compiler
  • the NIR compiler
  • the SPIR-V compiler
  • the mesa utility API
  • the gallium aux/utility API

These are written in a combination of C/C++/Python, and they’re (mostly) all used in gallium drivers.

June 19, 2020

A new problem

My work in ntv exposed more issues that needed to be resolved, the most significant of which was the ability of Zink to accidentally clobber output variables. What does that actually mean though?

Inputs and outputs in shaders are assigned a location (SPIRV terminology) or a slot (mesa terminology). These are helpfully defined in mesa/src/compiler/shader_enums.h:

/**
 * Indexes for vertex shader outputs, geometry shader inputs/outputs, and
 * fragment shader inputs.
 *
 * Note that some of these values are not available to all pipeline stages.
 *
 * When this enum is updated, the following code must be updated too:
 * - vertResults (in prog_print.c's arb_output_attrib_string())
 * - fragAttribs (in prog_print.c's arb_input_attrib_string())
 * - _mesa_varying_slot_in_fs()
 */
typedef enum
{
   VARYING_SLOT_POS,
   VARYING_SLOT_COL0, /* COL0 and COL1 must be contiguous */
   VARYING_SLOT_COL1,
   VARYING_SLOT_FOGC,
   VARYING_SLOT_TEX0, /* TEX0-TEX7 must be contiguous */
   VARYING_SLOT_TEX1,
   VARYING_SLOT_TEX2,
   VARYING_SLOT_TEX3,
   VARYING_SLOT_TEX4,
   VARYING_SLOT_TEX5,
   VARYING_SLOT_TEX6,
   VARYING_SLOT_TEX7,
   VARYING_SLOT_PSIZ, /* Does not appear in FS */
   VARYING_SLOT_BFC0, /* Does not appear in FS */
   VARYING_SLOT_BFC1, /* Does not appear in FS */
   VARYING_SLOT_EDGE, /* Does not appear in FS */
   VARYING_SLOT_CLIP_VERTEX, /* Does not appear in FS */
   VARYING_SLOT_CLIP_DIST0,
   VARYING_SLOT_CLIP_DIST1,
   VARYING_SLOT_CULL_DIST0,
   VARYING_SLOT_CULL_DIST1,
   VARYING_SLOT_PRIMITIVE_ID, /* Does not appear in VS */
   VARYING_SLOT_LAYER, /* Appears as VS or GS output */
   VARYING_SLOT_VIEWPORT, /* Appears as VS or GS output */
   VARYING_SLOT_FACE, /* FS only */
   VARYING_SLOT_PNTC, /* FS only */
   VARYING_SLOT_TESS_LEVEL_OUTER, /* Only appears as TCS output. */
   VARYING_SLOT_TESS_LEVEL_INNER, /* Only appears as TCS output. */
   VARYING_SLOT_BOUNDING_BOX0, /* Only appears as TCS output. */
   VARYING_SLOT_BOUNDING_BOX1, /* Only appears as TCS output. */
   VARYING_SLOT_VIEW_INDEX,
   VARYING_SLOT_VIEWPORT_MASK, /* Does not appear in FS */
   VARYING_SLOT_VAR0, /* First generic varying slot */
   /* the remaining are simply for the benefit of gl_varying_slot_name()
    * and not to be construed as an upper bound:
    */
   VARYING_SLOT_VAR1,
   VARYING_SLOT_VAR2,
   VARYING_SLOT_VAR3,
   VARYING_SLOT_VAR4,
   VARYING_SLOT_VAR5,
   VARYING_SLOT_VAR6,
   VARYING_SLOT_VAR7,
   VARYING_SLOT_VAR8,
   VARYING_SLOT_VAR9,
   VARYING_SLOT_VAR10,
   VARYING_SLOT_VAR11,
   VARYING_SLOT_VAR12,
   VARYING_SLOT_VAR13,
   VARYING_SLOT_VAR14,
   VARYING_SLOT_VAR15,
   VARYING_SLOT_VAR16,
   VARYING_SLOT_VAR17,
   VARYING_SLOT_VAR18,
   VARYING_SLOT_VAR19,
   VARYING_SLOT_VAR20,
   VARYING_SLOT_VAR21,
   VARYING_SLOT_VAR22,
   VARYING_SLOT_VAR23,
   VARYING_SLOT_VAR24,
   VARYING_SLOT_VAR25,
   VARYING_SLOT_VAR26,
   VARYING_SLOT_VAR27,
   VARYING_SLOT_VAR28,
   VARYING_SLOT_VAR29,
   VARYING_SLOT_VAR30,
   VARYING_SLOT_VAR31,
} gl_varying_slot;

As seen above, there’s a total of 64 slots: 32 for builtins and 32 for other usage. In ntv, the builtins are translated from GLSL -> SPIRV. The problem arises because SPIRV doesn’t have analogues for many of the GLSL builtins, which means they need to take up space in the latter half of the slots.

As an example, VARYING_SLOT_COL0 (GLSL’s gl_Color or gl_FrontColor depending on shader type) does not have a SPIRV builtin. This means it’ll get emitted as VARYING_SLOT_VAR[n]. In such a scenario, any shader-created VARYING_SLOT[n] created from a user-defined varying will end up clobbering the color value.

More problems

The simple solution here would be to just map the first half (builtin) of the slot range onto the second half (user), but that has its own problem: slot usage must remain within the boundaries of the enum. This means that the slot usage for GLSL builtins needs to be kept to a minimum in order to leave room for user-defined varyings.

Additionally, the slots need to be remapped consistently for all types of shaders, as ntv has no capacity to look at any shader but the one being actively processed. So doing any kind of dynamic remapping is out.

Solutions

Ideally the GLSL slot usage needs to be compacted, so I started creating a remapping array for the builtins so that I could see what was available as a SPIRV builtin and what wasn’t. Then I went over the members lacking SPIRV builtins and assigned them a value. The result was this:

/* this consistently maps slots to a zero-indexed value to avoid wasting slots */
static unsigned slot_pack_map[] = {
   /* Position is builtin */
   [VARYING_SLOT_POS] = UINT_MAX,
   [VARYING_SLOT_COL0] = 0, /* input/output */
   [VARYING_SLOT_COL1] = 1, /* input/output */
   [VARYING_SLOT_FOGC] = 2, /* input/output */
   /* TEX0-7 are translated to VAR0-7 by nir, so we don't need to reserve */
   [VARYING_SLOT_TEX0] = UINT_MAX, /* input/output */
   [VARYING_SLOT_TEX1] = UINT_MAX,
   [VARYING_SLOT_TEX2] = UINT_MAX,
   [VARYING_SLOT_TEX3] = UINT_MAX,
   [VARYING_SLOT_TEX4] = UINT_MAX,
   [VARYING_SLOT_TEX5] = UINT_MAX,
   [VARYING_SLOT_TEX6] = UINT_MAX,
   [VARYING_SLOT_TEX7] = UINT_MAX,

   /* PointSize is builtin */
   [VARYING_SLOT_PSIZ] = UINT_MAX,

   [VARYING_SLOT_BFC0] = 3, /* output only */
   [VARYING_SLOT_BFC1] = 4, /* output only */
   [VARYING_SLOT_EDGE] = 5, /* output only */
   [VARYING_SLOT_CLIP_VERTEX] = 6, /* output only */

   /* ClipDistance is builtin */
   [VARYING_SLOT_CLIP_DIST0] = UINT_MAX,
   [VARYING_SLOT_CLIP_DIST1] = UINT_MAX,

   /* CullDistance is builtin */
   [VARYING_SLOT_CULL_DIST0] = UINT_MAX, /* input/output */
   [VARYING_SLOT_CULL_DIST1] = UINT_MAX, /* never actually used */

   /* PrimitiveId is builtin */
   [VARYING_SLOT_PRIMITIVE_ID] = UINT_MAX,

   /* Layer is builtin */
   [VARYING_SLOT_LAYER] = UINT_MAX, /* input/output */

   /* ViewportIndex is builtin */
   [VARYING_SLOT_VIEWPORT] =  UINT_MAX, /* input/output */

   /* FrontFacing is builtin */
   [VARYING_SLOT_FACE] = UINT_MAX,

   /* PointCoord is builtin */
   [VARYING_SLOT_PNTC] = UINT_MAX, /* input only */

   /* TessLevelOuter is builtin */
   [VARYING_SLOT_TESS_LEVEL_OUTER] = UINT_MAX,
   /* TessLevelInner is builtin */
   [VARYING_SLOT_TESS_LEVEL_INNER] = UINT_MAX,

   [VARYING_SLOT_BOUNDING_BOX0] = 7, /* Only appears as TCS output. */
   [VARYING_SLOT_BOUNDING_BOX1] = 8, /* Only appears as TCS output. */
   [VARYING_SLOT_VIEW_INDEX] = 9, /* input/output */
   [VARYING_SLOT_VIEWPORT_MASK] = 10, /* output only */
};

Now all the GLSL builtins that need slots are compacted into 11 members of the enum, which leaves the other 21 available.

Any input or output coming through ntv now goes through a switch statement: the GLSL builtins that can be translated to SPIRV builtins are, the builtins that can’t are remapped using this array, and and rest of the slots get mapped onto a slot after the reserved slot members.

Problem solved.

For now.

June 18, 2020

Testing Testing Testing

As always, there’s more tests to run. And when the validation layers are enabled for Vulkan (export VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation once installed), sometimes new errors pop up.

Such was the case when running tests one day when I encountered an error about vkCmdResetQueryPool being called inside a render pass. Indeed, there was an incorrect usage here, and it needed to be resolved. Let’s take a look at why this was happening.

Query Active-ness

Once vkCmdBeginQuery is called, a query is considered active by Vulkan, which means they can’t be destroyed until they’re made inactive. Queries in the active state now have association with the command buffer (batch) that they’re made active in, but this also means that a query needs to be “transferred” to a new zink_batch any time the current one is flushed and cycled to the next batch. This happens as below:

static void
flush_batch(struct zink_context *ctx)
{
   struct zink_batch *batch = zink_curr_batch(ctx);
   if (batch->rp)
      vkCmdEndRenderPass(batch->cmdbuf);

   zink_end_batch(ctx, batch);

   ctx->curr_batch++;
   if (ctx->curr_batch == ARRAY_SIZE(ctx->batches))
      ctx->curr_batch = 0;

   zink_start_batch(ctx, zink_curr_batch(ctx));
}

This submits the current command buffer (batch), then switches to the next one and activates it. In the process, all active queries on the current batch are stopped, then they get started once more on the new command buffer first thing so that they continue to e.g., track primitives emitted without missing any operations.

Diving deeper, zink_end_batch() calls through to zink_suspend_queries(), which calls end_query(), a function I’ve mentioned a couple times previously. After all the reworking, here’s how it looks:

static void
end_query(struct zink_batch *batch, struct zink_query *q)
{
   struct zink_context *ctx = batch->ctx;
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   assert(q->type != PIPE_QUERY_TIMESTAMP);
   q->active = false;
   if (q->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT)
      screen->vk_CmdEndQueryIndexedEXT(batch->cmdbuf, q->query_pool, q->curr_query, q->index);
   else
      vkCmdEndQuery(batch->cmdbuf, q->query_pool, q->curr_query);
   if (++q->curr_query == q->num_queries) {
      get_query_result(&ctx->base, (struct pipe_query*)q, false, &q->big_result);
      vkCmdResetQueryPool(batch->cmdbuf, q->query_pool, 0, q->num_queries);
      q->last_checked_query = q->curr_query = 0;
   }
}

As seen, the query is marked inactive (as it relates to Vulkan), ended, and then the query id is incremented. If the id reaches the max number of queries in the pool, the query grabs the current results into the inline result struct discussed previously before resetting the pool.

This is where the API misuse error was coming from, as the render pass is not destroyed until the batch is reset, which occurs the next time it’s made active.

A small change to the reset block at the end of end_query():

static void
end_query(struct zink_batch *batch, struct zink_query *q)
{
   struct zink_context *ctx = batch->ctx;
   struct zink_screen *screen = zink_screen(ctx->base.screen);
   assert(q->type != PIPE_QUERY_TIMESTAMP);
   q->active = false;
   if (q->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT)
      screen->vk_CmdEndQueryIndexedEXT(batch->cmdbuf, q->query_pool, q->curr_query, q->index);
   else
      vkCmdEndQuery(batch->cmdbuf, q->query_pool, q->curr_query);
   if (++q->curr_query == q->num_queries) {
      if (batch->rp)
         q->needs_reset = true;
      else
        reset_pool(batch, q);
   }
}

Now the reset can be deferred until the query is made active again, at which point it’s guaranteed to not be inside a render pass.

And that’s it

The updated query code is awaiting review, and Zink no longer assert()s when an application toggles its queries too many times.

June 17, 2020

In Part 1 I've shown you how to create your own distribution image using the freedesktop.org CI templates. In Part 2, we'll go a bit further than that by truly embracing nested images.

Our assumption here is that we have two projects (or jobs), with the second one relying heavily on the first one. For example, the base project and a plugin, or a base project and its language bindings. What we'll get out of this blog post is a setup where we have

  • a base image in the base project
  • an image extending that base image in a different project
  • automatic rebuilds of that extended image when the base image changes
And none of your contributors have to care about this. It's all handled automatically and filing a MR against a project will build against the right image. So let's get started.

Our base project has CI that pushes an image to its registry. The .gitlab-ci.yml contains something like this:


.fedora32:
variables:
FDO_DISTRIBUTION_VERSION: '32'
FDO_DISTRIBUTION_TAG: 'base.0'

build-img:
extends:
- .fedora32
- .fdo.container-build@fedora
variables:
FDO_DISTRIBUTION_PACKAGES: "curl wget"

This will build a fedora/32:base.0 image in the project's container registry. That image is built once and then re-used by any job extending .fdo.distribution-image@fedora. So far, so Part 1.

Now, the second project needs to test things on top of this base image, for example language bindings for rust. You want to use the same image that the base project uses (and has successfully completed its CI on) but you need some extra packages or setup. This is where the FDO_BASE_IMAGE comes in. In our dependent project, we have this:


.fedora32:
variables:
FDO_DISTRIBUTION_VERSION: '32'
FDO_DISTRIBUTION_TAG: 'rust.0'

build-rust-image:
extends:
- .fedora32
- .fdo.container-build@fedora
variables:
FDO_BASE_IMAGE: "registry.freedesktop.org/baseproject/name/fedora/32:base.0"
# extra packages we want to install and things we need to set up
FDO_DISTRIBUTION_PACKAGES: "rust cargo"
FDO_DISTRIBUTION_EXEC: "bash -x some-setup-script.sh"

test-rust:
extends:
- .fedora32
- .fdo.distribution-image@fedora
script:
- cargo build myproject-bindings

And voila, you now have two images: the base image with curl and wget in the base project and an extra image with rust and cargo in the dependent project. And all that is required is to reference the FDO_BASE_IMAGE, everything else is the same. Note how the FDO_BASE_IMAGE is a full path in this example since we assume it's in a different project. For dependent images within the same project, you can just use the image path without the host.

The dependency only matters while the image is built, after that the dependent image is just another standalone image. So even if the base project removes the base image, you still have yours to test on.

But eventually you will need to change the base image and you want the dependent image to update as well. The best solution here is to have a CI job as part of the base repo that pokes the dependent repo's CI whenever the base image updates. The CI templates add the pipeline id as label to an image when it is built. In your base project, you can thus have a job like this:


poke-dependents:
extends:
- .fedora32
- .fdo.distribution-image@fedora
image: something-with-skopeo-and-jq
script:
# FDO_DISTRIBUTION_IMAGE still has indirections
- DISTRO_IMAGE=$(eval echo ${FDO_DISTRIBUTION_IMAGE})
# retrieve info from the registry and extract the pipeline id
- JSON_IMAGE=$(skopeo inspect docker://$DISTRO_IMAGE)
- IMAGE_PIPELINE_ID=$(echo $JSON_IMAGE | jq -r '.Labels["fdo.pipeline_id"]')
- |
if [[ x"$IMAGE_PIPELINE_ID" == x"$CI_PIPELINE_ID" ]]; then
curl -X POST
-F "token=$AUTH_TOKEN_VALUE"
-F "ref=master"
-F "variables[SOMEVARIABLE]=somevalue"
https://gitlab.freedesktop.org/api/v4/projects/dependent${SLASH}project/trigger/pipeline
fi
variables:
SLASH: "%2F"

Let's dissect this: First, we use the .fdo.distribution-image@fedora template to get access to FDO_DISTRIBUTION_IMAGE. We don't need to use the actual image though, anything with skopeo and jq will do. Then we fetch the pipeline id label from the image and compare it to the current pipeline ID. If it is the same, our image was rebuilt as part of the pipeline and we poke the other project's pipeline with a SOMEVARIABLE set to somevalue. The auth token is a standard GitLab token you need to create to allow triggering the pipeline in the dependent project.

In that dependent project you can have a job like this:


rebuild-extra-image:
extends: build-extra-image
rules:
- if: '$SOMEVARIABLE == "somevalue"'
variables:
FDO_FORCE_REBUILD: 1

This job is only triggered where the variable is set and it will force a rebuild of the container image. If you want custom rebuilds of images, set the variables accordingly.

So, as promised above, we now have a base image and a separate image building on that, together with auto-rebuild hooks. The gstreamer-plugins-rs project uses this approach. The base image is built by gstreamer-rs during its CI run which then pokes gstreamer-plugins-rs to rebuild selected dependent images.

The above is most efficient when the base project knows of the dependent projects. Where this is not the case, the dependent project will need a scheduled pipeline to poll the base project and extract the image IDs from that, possibly using creation dates and whatnot. We'll figure that out when we have a use-case for it.

The overflow problem

In past posts I touched on the issue of query pool overflows. The gist of the issue, again, is that each “query” object exposed higher up to the OpenGL API is actually a query pool in Vulkan. Every time an OpenGL query is started or resumed, this consumes a query from the pool. As such, any application which attempts to toggle the active state of a query too many times will end up losing data when the pool gets reset.

Solution

It wasn’t the most difficult of problems. I dove in with the idea of storing partial query results onto the struct zink_query object. At zink_create_query, the union pipe_query_result object gets zeroed, and it can then have partial results concatenated to it. In code, it looks more or less like:

static bool
get_query_result(struct pipe_context *pctx,
                      struct pipe_query *q,
                      bool wait,
                      union pipe_query_result *result)
{
   struct zink_screen *screen = zink_screen(pctx->screen);
   struct zink_query *query = (struct zink_query *)q;
   VkQueryResultFlagBits flags = 0;

   if (wait)
      flags |= VK_QUERY_RESULT_WAIT_BIT;

   if (query->use_64bit)
      flags |= VK_QUERY_RESULT_64_BIT;

   // the below lines added for overflow handling
   if (result != &query->big_result) {
      memcpy(result, &query->big_result, sizeof(query->big_result));
      util_query_clear_result(&query->big_result, query->type);
   } else
      flags |= VK_QUERY_RESULT_PARTIAL_BIT;

So when calling this from the application, the result pointer isn’t the inlined query result, and the existing result data gets stored to it, then the latest query data is concatenated onto it once it’s fetched from Vulkan.

When calling internally from end_query just before resetting the pool, the partial bit is set, which permits the return of “whatever results are available”. Those get stored onto the inline results struct that gets passed for this case, and this process repeats as many times as necessary until the query is destroyed or the user fetches the results.

June 16, 2020

To start with

Reworking queries was a long process, though not nearly as long as getting xfb completely working. I’ll be going over it in more general terms since the amount of code changed is substantial, and it’s a combination of additions and deletions which makes for difficult reading in diff format on a blog post.

The first thing I started looking at was getting vkCmdResetQueryPool out of zink_begin_query. The obvious choice was to move it to zink_create_query; the pool should reset all its queries upon creation so that they can be used immediately. It’s important to remember that this command can’t be called from within a render pass, which means zink_batch_no_rp must be used here.

The only other place that reset is required is now at the time when a query ends in a pool, as each pool is responsible for only a single type of query.

Fences

Now up to this point, queries were sort of just fired off and forgotten, which meant that there was no way to know whether they had completed. This was going to be a problem, as it’s against Vulkan spec to destroy an in-progress query, which means they must be deferred. While violating some parts of the spec might be harmless, this part is not: destroying an in-progress xfb query actually crashes the Intel ANV driver.

To solve this, queries must be attached first to a given struct zink_batch so it becomes possible to know which command buffer they’re running in. With this relation established, a fence finishing for a batch can iterate over the list of active queries for that batch, marking them as having completed. This enables query destruction to be deferred when necessary to avoid violating spec.

Unfortunately, xfb queries allocate specific resources in the underlying driver, and attempting to defer these types of queries can result in those resources being freed by other means, so this requires Zink to block on the corresponding batch’s fence whenever deleting an active xfb query. The queries do know when they’re active now, however, so at least there’s no need to block unnecessarily.

And then also

One of the downsides of the existing query implementation is that starting and stopping a query repeatedly without fetching the results resets the pool and discards any fetched results. Now that the reset occurs on pool creation, this is no longer the case, and the only time that results are discarded is when the pool overflows on query termination. This in itself is a questionable improvement over the existing behavior of assert()ing, but it’s a step in the right direction.

June 15, 2020

I spent the last week investigating the cause of two problems between VKMS and IGT that I have faced and reported in the development phase of my GSoC project proposal. One of the issues was a weird behavior, that I described as unstable, in the sequential execution of the kms\ _cursor\ _crc subtests or running the same subtest twice in a row: a subtest that passed in the first run failed in the second and returned to succeed in the third (and so on).

At first, it was difficult to determine where was the error because:

  1. I had a very superficial comprehension (I still have it, but a little less) about the implementation of an IGT test and subtests.
  2. The file used by the test to write and verify the data had the access blocked soon after the execution of a subtest
  3. Currently, when vkms is the driver used, only two subtests of kms_cursor_crc are passing

A previous task that helped me a lot in this investigative process was stopping to read and study the test code to understand its structure and the steps taken during the execution. My mentor guided me to do this study, and I published an initial version of this anatomy in my previous post. With that, I was able to do some checks to evaluate what was leaving the file created by igt_debugfs blocked, which ended up also solving the problem of sequential execution.

Waiting for Vblank

I describe below how I reached the idea of adding a call of waiting for vblank that solved the problem mentioned above. I verified in some other scenarios that adding this call does not cause regression, but I still don’t have a good perception and confidence of why this call is only necessary for VKMS.

Timed out: Opening crc fd, and poll for first CRC

For checking this issue, you have to enable VKMS and run (for example) the subtest alpha-opaque twice:

sudo IGT_FORCE_DRIVER=vkms build/tests/kms_cursor_crc --run-subtest pipe-A-cursor-alpha-opaque

You will see the subtest succeed in the first execution and fail in the second. The debug will report a timeout on opening crc fd, and poll for first CRC. From this report, I guessed that something previously allocated was not released, i.e., lacking a king of “free”.

  1. Checking the code, much of allocation and release of things on each round of subtest is in the function prepare_crtc and cleanup_crtc:
static void run_test(data_t *data, void (*testfunc)(data_t *), int cursor_w, int cursor_h)
{
	prepare_crtc(data, data->output, cursor_w, cursor_h);
	testfunc(data);
	cleanup_crtc(data);
}

From this, I change the code to skip the call testfunc(data) - a function to call a specific subtest, because I wanted to check if the problem was limited to prepare_crtc/cleanup_crtc or would be inside any subtest. I validated that the problem was inside prepare/cleanup operations, since the issue still happened even if no specific subtest was running.

More “freedom”

I checked what was allocated in prepare_crtc to ‘mirror’ it when cleaning. I also took a look at other crc tests to see how they do the setup and cleanup of things. I was partially successful when I did a pre-check before the creation of pipe_crc, releasing pipe_crc if it was not free at that moment.

	/* create the pipe_crc object for this pipe */
	if(data->pipe_crc)
		igt_pipe_crc_free(data->pipe_crc);
	data->pipe_crc = igt_pipe_crc_new(data->drm_fd, data->pipe,
					  INTEL_PIPE_CRC_SOURCE_AUTO);

With this, the sequential execution of subtests passed to alternate between success and failure (note that no subtest was running, only it was preparing, and cleaning). I have also tried to complement the cleanup function, releasing more things, but I didn’t see any improvement.

Waiting forever

I observed that something was generating an infinite busy wait for the debugfs file related to the fd. Then, I delimited this busy waiting to the line “poll(&pfd, 1, -1)” in lib/igt_debugfs.c:igt_pipe_crc_start, the line “poll(&pfd, 1, -1)”. I tested that changing -1 to a finite timeout seemed to solve the problem, but it could not be a solution, since it would affect the behaviour of all IGT tests, not only the test kms_cursor_crc.

Not forever, waiting for VBlank

After I have read documentations and seen other crc related tests in IGT, I looked how is the process of pipe crc creation, initialization, crc collection, stop and free of kms_pipe_crc_basic, and two functions called my attention: igt_pipe_crc_new_nonblock() and igt_wait_for_vblank(). At this point, I remembered a previous talk that I had with Siqueira that the VKMS simulates vblank interrupts. For me, It was valid to think in the possibility that VKMS was busy waiting for this vblank interrupt… therefore, as the busy wait was during the starting of pipe_crc, I added a call for igt_wait_for_vblank() before igt_pipe_crc_start() and then, it seems to solve the problem… but, WHY?

	igt_wait_for_vblank(data->drm_fd, data->pipe);
	igt_pipe_crc_start(data->pipe_crc);

I hope have the answer in a next blog post.

I asked for help to my mentor and took a break to check another problem: Why the subtest alpha-transparent succeed using VKMS after the implementation of XRGB but shows a message of WARNING: Suspicious CRC: All values are 0 ?

Short post today

I got caught up in some exciting synchronization code today and ran out of time to write a post before my brain started to shut down. As a result, let’s talk briefly about synchronization.

What is synchronization?

Synchronization is the idea that application and gpu states match up. When used properly, it means that when a vertex buffer is thrown into a shader, the buffer contains the expected data. When not used properly, nothing works at all.

Zink and synchronization (zinkronization?)

Zink uses the concept of “batches”, aka struct zink_batch to represent Vulkan command buffers. Zink throws a stream of commands into a batch, and eventually they get executed by the driver. A batch doesn’t necessarily terminate with a draw command, and not every batch will even contain a draw command.

Batches in Zink are dispatched (flushed) to the gpu based on render passes. Some commands must be made inside a render pass, and some must be made outside. When that boundary is reached, and the next command must be inside/outside a render pass while the current batch is outside/inside a render pass, the current batch is flushed and the next batch in the circular batch queue becomes current.

However, just because a batch is flushed doesn’t mean it has been executed.

Fences

A fence is an object in Zink (and other places) which provides a notification when a batch has completed. In the case where, e.g., an application wants to read from an in-flight draw buffer, the current batch is flushed, then a fence is waited on to determine when the pending draw commands are complete. This is synchronization for command buffers.

Barriers

Barriers are synchronization for memory access. They allow fine-grained control over exactly when a resource transitions between access states. This avoids data races between e.g., a write operation and a read operation which might be occurring in the same command buffer, but it enables the user of the Vulkan API to be explicit about the usage to maximize performance.

Query synchronization

A big issue with the existing query handling, as was previously discussed relates to synchronization. The queries need to be able to keep the underlying data that they’re fetching from the driver in sync with the application’s idea of what data the query is fetching. This needs to remain consistent across batch flushes and query pool overflows.

More on this to come.

June 12, 2020

The next problem

With xfb defeated, it’s time to move on to the next big issue: improving query handling.

The existing implementation of queries in Zink has a few problems:

  • They’re limited to 100 queries (50 for xfb) before it’s necessary to get the result of a query before hitting an assert()
  • There’s a number of issues related to API correctness, so running in a validator will generate tons of errors (though mostly they’re harmless in the sense that the code still works as expected)

Ideally, queries shouldn’t trigger abort() due to pool depletion, and they shouldn’t trigger validator errors, however, so this needs some work.

Step one: understanding the problems

Let’s check out some of the existing code.

static bool
zink_begin_query(struct pipe_context *pctx,
                 struct pipe_query *q)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_query *query = (struct zink_query *)q;

   /* ignore begin_query for timestamps */
   if (query->type == PIPE_QUERY_TIMESTAMP)
      return true;

   /* TODO: resetting on begin isn't ideal, as it forces render-pass exit...
    * should instead reset on creation (if possible?)... Or perhaps maintain
    * the pool in the batch instead?
    */
   struct zink_batch *batch = zink_batch_no_rp(zink_context(pctx));
   vkCmdResetQueryPool(batch->cmdbuf, query->query_pool, 0, MIN2(query->curr_query + 1, query->num_queries));
   query->curr_query = 0;

   begin_query(batch, query);
   list_addtail(&query->active_list, &ctx->active_queries);

   return true;
}

Here’s a function that’s called any time a new query is begun by an application.

Notice here that vkCmdResetQueryPool is being called here rather than alongside vkCreateQueryPool. According to Vulkan spec, each query must be reset before being used (which is why the call is here), but with this usage, any time a query is stopped without its results being returned, those results are lost because the query has been reset.

Also, as noted in the comment, because there’s a reset here, the current renderpass gets flushed in order to comply with Vulkan spec, which requires that reset be called outside of a renderpass. This causes a slowdown, which isn’t optimal.

static void
end_query(struct zink_batch *batch, struct zink_query *q)
{
   assert(q->type != PIPE_QUERY_TIMESTAMP);
   vkCmdEndQuery(batch->cmdbuf, q->query_pool, q->curr_query);
   if (++q->curr_query == q->num_queries) {
      assert(0);
      /* need to reset pool! */
   }
}

This is a function that’s called any time a query is ended (more on the multiple meanings of this in a future post) either internally or by an application.

The idea here is that the query is ended, so the “current” query id is incremented in preparation for the next query, as all queries are tied to an id value. If the incremented query id reaches the size of the query pool, then the code triggers an assert() since vkCmdResetQueryPool needs to be emitted, but that can’t happen without discarding the existing query results (as already happens above).

Step two: devising solutions

Ideally, what needs to happen here is:

  • Handle pool overflows by triggering a reset without discarding existing query results
  • Move reset commands out of zink_begin_query()

This turns out to be significantly more difficult than it sounds due to a number of other constraints in the driver, however. Stay tuned for more query-related problem solving!

June 11, 2020

More Fixes

Yesterday I covered the problems related to handling gl_PointSize during the SPIR-V conversion. But there were still more problems to overcome.

Packed outputs are a thing. This is the case when a variable is partially captured by xfb, e.g., in the case where only the x coordinate is selected from a vec4 output, it will be packed into the buffer as a single float. The original code assumed that all captured output types would match the original type, which is clearly not the case in the previously-described scenario.

A lot went into handling this case. Let’s jump into some of the code.

Critical terminology for this post:

  • SpvId - an id in SPIR-V which represents some other value, as denoted by <id> in the SPIR-V spec

Improved variable creation

The first step here was to figure out what the heck the resulting output type would be. Having remapped the struct pipe_stream_output::register_index values, it’s now possible to check the original variable types and use that when creating the xfb output.

/* return the intended xfb output vec type based on base type and vector size */
static SpvId
get_output_type(struct ntv_context *ctx, unsigned register_index, unsigned num_components)
{
   const struct glsl_type *out_type = ctx->so_output_gl_types[register_index];
   enum glsl_base_type base_type = glsl_get_base_type(out_type);
   if (base_type == GLSL_TYPE_ARRAY)
      base_type = glsl_get_base_type(glsl_without_array(out_type));

   switch (base_type) {
   case GLSL_TYPE_BOOL:
      return get_bvec_type(ctx, num_components);

   case GLSL_TYPE_FLOAT:
      return get_fvec_type(ctx, 32, num_components);

   case GLSL_TYPE_INT:
      return get_ivec_type(ctx, 32, num_components);

   case GLSL_TYPE_UINT:
      return get_uvec_type(ctx, 32, num_components);

   default:
      break;
   }
   unreachable("unknown type");
   return 0;
}

/* for streamout create new outputs, as streamout can be done on individual components,
   from complete outputs, so we just can't use the created packed outputs */
static void
emit_so_info(struct ntv_context *ctx, unsigned max_output_location,
             const struct pipe_stream_output_info *so_info, struct pipe_stream_output_info *local_so_info)
{
   for (unsigned i = 0; i < local_so_info->num_outputs; i++) {
      struct pipe_stream_output so_output = local_so_info->output[i];
      SpvId out_type = get_output_type(ctx, so_output.register_index, so_output.num_components);
      SpvId pointer_type = spirv_builder_type_pointer(&ctx->builder,
                                                      SpvStorageClassOutput,
                                                      out_type);
      SpvId var_id = spirv_builder_emit_var(&ctx->builder, pointer_type,
                                            SpvStorageClassOutput);
      char name[10];

      snprintf(name, 10, "xfb%d", i);
      spirv_builder_emit_name(&ctx->builder, var_id, name);
      spirv_builder_emit_offset(&ctx->builder, var_id, (so_output.dst_offset * 4));
      spirv_builder_emit_xfb_buffer(&ctx->builder, var_id, so_output.output_buffer);
      spirv_builder_emit_xfb_stride(&ctx->builder, var_id, so_info->stride[so_output.output_buffer] * 4);

      /* output location is incremented by VARYING_SLOT_VAR0 for non-builtins in vtn
       */
      uint32_t location = so_info->output[i].register_index;
      spirv_builder_emit_location(&ctx->builder, var_id, location);

      /* note: gl_ClipDistance[4] can the 0-indexed member of VARYING_SLOT_CLIP_DIST1 here,
       * so this is still the 0 component
       */
      if (so_output.start_component)
         spirv_builder_emit_component(&ctx->builder, var_id, so_output.start_component);

      uint32_t *key = ralloc_size(NULL, sizeof(uint32_t));
      *key = (uint32_t)so_output.register_index << 2 | so_output.start_component;
      _mesa_hash_table_insert(ctx->so_outputs, key, (void *)(intptr_t)var_id);

      assert(ctx->num_entry_ifaces < ARRAY_SIZE(ctx->entry_ifaces));
      ctx->entry_ifaces[ctx->num_entry_ifaces++] = var_id;
   }
}

Here’s the slightly changed code for emit_so_info() along with a new helper function. Note that there’s now so_info and local_so_info being passed here: the former is the gallium-produced streamout info, and the latter is the one that’s been re-translated back to enum gl_varying_slot using the code from yesterday’s post.

The remapped value is passed into the helper function, which retrieves the previously-stored struct glsl_type and then returns an SpvId for the necessary xfb output type, which is what’s now used to create the variable.

Improved variable value emission

Now that the variables are correctly created, it’s important to ensure that the correct value is being emitted.

static void
emit_so_outputs(struct ntv_context *ctx,
                const struct pipe_stream_output_info *so_info, struct pipe_stream_output_info *local_so_info)
{
   SpvId loaded_outputs[VARYING_SLOT_MAX] = {};
   for (unsigned i = 0; i < local_so_info->num_outputs; i++) {
      uint32_t components[NIR_MAX_VEC_COMPONENTS];
      struct pipe_stream_output so_output = local_so_info->output[i];
      uint32_t so_key = (uint32_t) so_output.register_index << 2 | so_output.start_component;
      struct hash_entry *he = _mesa_hash_table_search(ctx->so_outputs, &so_key);
      assert(he);
      SpvId so_output_var_id = (SpvId)(intptr_t)he->data;

      SpvId type = get_output_type(ctx, so_output.register_index, so_output.num_components);
      SpvId output = ctx->outputs[so_output.register_index];
      SpvId output_type = ctx->so_output_types[so_output.register_index];
      const struct glsl_type *out_type = ctx->so_output_gl_types[so_output.register_index];

      if (!loaded_outputs[so_output.register_index])
         loaded_outputs[so_output.register_index] = spirv_builder_emit_load(&ctx->builder, output_type, output);
      SpvId src = loaded_outputs[so_output.register_index];

      SpvId result;

      for (unsigned c = 0; c < so_output.num_components; c++) {
         components[c] = so_output.start_component + c;
         /* this is the second half of a 2 * vec4 array */
         if (ctx->stage == MESA_SHADER_VERTEX && so_output.register_index == VARYING_SLOT_CLIP_DIST1)
            components[c] += 4;
      }

Taking a short break here, a lot has changed. There’s now code for getting both the original type of the output as well as the xfb output type, and special handling has been added for gl_ClipDistance, which is potentially a vec8 that’s represented in memory as two vec4 values spanning two separate varying slots.

This takes care of loading the value for the variable corresponding to the xfb output as well as building an array of the components that are going to be emitted in the xfb output.

Now let’s get to the ugly stuff:

      /* if we're emitting a scalar or the type we're emitting matches the output's original type and we're
       * emitting the same number of components, then we can skip any sort of conversion here
       */
      if (glsl_type_is_scalar(out_type) || (type == output_type && glsl_get_length(out_type) == so_output.num_components))
         result = src;
      else {
         if (ctx->stage == MESA_SHADER_VERTEX && so_output.register_index == VARYING_SLOT_POS) {
            /* gl_Position was modified by nir_lower_clip_halfz, so we need to reverse that for streamout here:
             * 
             * opengl gl_Position.z = (vulkan gl_Position.z * 2.0) - vulkan gl_Position.w
             *
             * to do this, we extract the z and w components, perform the multiply and subtract ops, then reinsert
             */
            uint32_t z_component[] = {2};
            uint32_t w_component[] = {3};
            SpvId ftype = spirv_builder_type_float(&ctx->builder, 32);
            SpvId z = spirv_builder_emit_composite_extract(&ctx->builder, ftype, src, z_component, 1);
            SpvId w = spirv_builder_emit_composite_extract(&ctx->builder, ftype, src, w_component, 1);
            SpvId new_z = emit_binop(ctx, SpvOpFMul, ftype, z, spirv_builder_const_float(&ctx->builder, 32, 2.0));
            new_z = emit_binop(ctx, SpvOpFSub, ftype, new_z, w);
            src = spirv_builder_emit_vector_insert(&ctx->builder, type, src, new_z, 2);
         }
         /* OpCompositeExtract can only extract scalars for our use here */
         if (so_output.num_components == 1) {
            result = spirv_builder_emit_composite_extract(&ctx->builder, type, src, components, so_output.num_components);
         } else if (glsl_type_is_vector(out_type)) {
            /* OpVectorShuffle can select vector members into a differently-sized vector */
            result = spirv_builder_emit_vector_shuffle(&ctx->builder, type,
                                                             src, src,
                                                             components, so_output.num_components);
            result = emit_unop(ctx, SpvOpBitcast, type, result);
         } else {
             /* for arrays, we need to manually extract each desired member
              * and re-pack them into the desired output type
              */
             for (unsigned c = 0; c < so_output.num_components; c++) {
                uint32_t member[] = { so_output.start_component + c };
                SpvId base_type = get_glsl_type(ctx, glsl_without_array(out_type));

                if (ctx->stage == MESA_SHADER_VERTEX && so_output.register_index == VARYING_SLOT_CLIP_DIST1)
                   member[0] += 4;
                components[c] = spirv_builder_emit_composite_extract(&ctx->builder, base_type, src, member, 1);
             }
             result = spirv_builder_emit_composite_construct(&ctx->builder, type, components, so_output.num_components);
         }
      }

      spirv_builder_emit_store(&ctx->builder, so_output_var_id, result);
   }
}

There’s five blocks here:

  • Handling for cases of either a scalar value or a vec/array type containing the same total number of components that are being output to xfb
  • Handling for gl_Position, which, as was previously discussed, has already been converted to Vulkan coordinates, and so now the value of this variable needs to be un-converted so the expected value can be read back
  • Handling for the case of extracting a single component from a vector/array, which can be done using a single OpCompositeExtract
  • Handling for extracting a sequence of components from a vector, which is the OpVectorShuffle from the original implementation
  • Handling for extracting a sequence of components from an array, which requires manually extracting each desired array element and then re-assembling the resulting component array into the output

And now…

Well, the tests all pass, so maybe this will be the end of it.

Just kidding.

June 10, 2020

There’s always more tests

Helpfully, mesa has a suite of very demanding unit tests, aka piglit, which work great for finding all manner of issues with drivers. While the code from the previous posts handled a number of tests, it turned out that there were still a ton of failing tests.

What happened?

A number of things. The biggest culprits were:

  • a nir bug related to gl_PointSize; Zink (aka Vulkan) has no native point-size variable, so one has to be injected in order to draw points successfully. The problem here was that the variable was being injected unconditionally, which sometimes resulted in two gl_PointSize outputs, breaking handling of outputs in xfb.
  • another nir bug in which xfb shader variables were being optimized such that packing them into the output buffers correctly was failing.

Next up, issues mapping output values back to the correct xfb buffer in SPIR-V conversion. The problem in this case is that gallium translates struct nir_shader::info.outputs_written (a value comprised of bitflags corresponding to the enum gl_varying_slot outputs) to a 0-indexed value as struct pipe_stream_output::register_index, when what’s actually needed is enum gl_varying_slot, since that’s what passes through the output-emitting function in struct nir_variable::data.location. To fix this, some fixup is done on the local copy of struct pipe_stream_output_info that gets stored onto the struct zink_shader that represents the vertex shader being used:

/* check for a genuine gl_PointSize output vs one from nir_lower_point_size_mov */
static bool
check_psiz(struct nir_shader *s)
{
   nir_foreach_variable(var, &s->outputs) {
      if (var->data.location == VARYING_SLOT_PSIZ) {
         /* genuine PSIZ outputs will have this set */
         return !!var->data.explicit_location;
      }
   }
   return false;
}

/* semi-copied from iris */
static void
update_so_info(struct pipe_stream_output_info *so_info,
               uint64_t outputs_written, bool have_psiz)
{
   uint8_t reverse_map[64] = {};
   unsigned slot = 0;
   while (outputs_written) {
      int bit = u_bit_scan64(&outputs_written);
      /* PSIZ from nir_lower_point_size_mov breaks stream output, so always skip it */
      if (bit == VARYING_SLOT_PSIZ && !have_psiz)
         continue;
      reverse_map[slot++] = bit;
   }

   for (unsigned i = 0; i < so_info->num_outputs; i++) {
      struct pipe_stream_output *output = &so_info->output[i];

      /* Map Gallium's condensed "slots" back to real VARYING_SLOT_* enums */
      output->register_index = reverse_map[output->register_index];
   }
}

In these excerpts from zink_compile_nir(), the shader’s variables are scanned for a genuine gl_PointSize variable originating from the shader instead of NIR, then that knowledge can be applied to skip over the faked PSIZ output when rewriting the register_index values.

But this was only the start

Indeed, considerably more work was required to handle the rest of the tests, as there were failures related to packed output buffers and type conversions. It’s a lot of code to go over, and it merits its own post.

June 09, 2020

Just today it has published a status update of the Vulkan effort for the Raspberry Pi 4, including that we are moving the development of the driver to an open repository. As it is really likely that some people would be interested on testing it, even if it is not complete at all, here you can find a quick guide to compile it, and get some demos running.

Dependencies

So let’s start installing some dependencies. My personal recipe, that I use every time I configure a new machine to work on mesa is the following one (sorry if some extra unneeded dependencies slipped):

sudo apt-get install libxcb-randr0-dev libxrandr-dev \
        libxcb-xinerama0-dev libxinerama-dev libxcursor-dev \
        libxcb-cursor-dev libxkbcommon-dev xutils-dev \
        xutils-dev libpthread-stubs0-dev libpciaccess-dev \
        libffi-dev x11proto-xext-dev libxcb1-dev libxcb-*dev \
        bison flex libssl-dev libgnutls28-dev x11proto-dri2-dev \
        x11proto-dri3-dev libx11-dev libxcb-glx0-dev \
        libx11-xcb-dev libxext-dev libxdamage-dev libxfixes-dev \
        libva-dev x11proto-randr-dev x11proto-present-dev \
        libclc-dev libelf-dev git build-essential mesa-utils \
        libvulkan-dev ninja-build libvulkan1 python-mako \
        libdrm-dev libxshmfence-dev libxxf86vm-dev \
        python3-mako

Most Raspian libraries are recent enough, but they have been updating some of then during the past months, so just in case, don’t forget to update:

$ sudo apt-get update
$ sudo apt-get upgrade

Additionally, you woud need to install meson. Mesa has just recently bumped up the version needed for meson, so Raspbian version is not enough. There is the option to build meson from the tarball (meson-0.52.0 here), but by far, the easier way to get a recent meson version is using pip3:

$ pip3 install meson

2020-07-04 update

It seems that some people had problems if they have installed meson with apt-get on their system, as when building it would try the older meson version first. For those people, they were able to fix that doing this:

$ sudo apt-get remove meson
$ pip3 install --user meson

Download and build v3dv

This is the simpler recipe to build v3dv:

$ git clone https://gitlab.freedesktop.org/apinheiro/mesa.git mesa
$ cd mesa
$ git checkout wip/igalia/v3dv
$ meson --prefix /home/pi/local-install --libdir lib -Dplatforms=x11,drm -Dvulkan-drivers=broadcom -Ddri-drivers= -Dgallium-drivers=v3d,kmsro,vc4 -Dbuildtype=debug _build
$ ninja -C _build
$ ninja -C _build install

This builds and install a debug version of v3dv on a local directory. You could set a release build, or any other directory. The recipe is also building the OpenGL driver, just in case anyone want to compare, but if you are only interested on the vulkan driver, that is not mandatory.

Run some Vulkan demos

Now, the easiest way to ensure that a vulkan program founds the drivers is setting the following envvar:

export VK_ICD_FILENAMES=/home/pi/local-install/share/vulkan/icd.d/broadcom_icd.armv7l.json

That envvar is used by the Vulkan loader (installed as one of the dependencies listed before) to know which library load. This also means that you don’t need to use LD_PRELOAD, LD_LIBRARY_PATH or similar

So what Vulkan programs are working? For example several of the Sascha Willem Vulkan demos. To make things easier to everybody, here another quick recipe of how to get them build:

$ sudo apt-get install libassimp-dev
$ git clone --recursive https://github.com/SaschaWillems/Vulkan.git  sascha-willems
$ cd sascha-willems
$ mkdir build; cd build
$ cmake -DCMAKE_BUILD_TYPE=Debug  ..
$ make
$ cd bin
$ ./gears

If everything went well, doing that would get this familiar image:

If you want to a somewhat more eye candy demo, you would need to download some assets. So:

$ cd ../..
$ python3 download_assets.py
$ cd build/bin
$./scenerendering

As mentioned, not all the demos works. But a list of some that we tested and seem to work:
* distancefieldfonts
* descriptorsets
* dynamicuniformbuffer
* gears
* gltfscene
* imgui
* indirectdraw
* occlusionquery
* parallaxmapping
* pbrbasic
* pbribl
* pbrtexture
* pushconstants
* scenerendering
* shadowmapping
* shadowmappingcascade
* specializationconstants
* sphericalenvmapping
* stencilbuffer
* textoverlay
* texture
* texture3d
* texturecubemap
* triangle
* vulkanscene

Update : rpiMike on the comments, and some people privately, have pointed some errors on the post. Thanks! And sorry for the inconvenience.

Update 2 : Mike Hooper pointed more issues on gitlab

What are extensions?

Extensions are non-core parts of a spec which provide additional features. xfb is an extension in Vulkan. In order to use it, a number of small changes are required.

What are features?

Features in Vulkan inform an application (or Zink) what the underlying driver supports. In order to enable and use an extension, both the extension and feature must be enabled.

Step 1: extension detection

To begin, let’s check out the significant parts of zink_screen.c that got modified, all in zink_internal_create_screen(), which creates a struct zink_screen* object, a subclass of struct pipe_screen*.

VkExtensionProperties *extensions;
vkEnumerateDeviceExtensionProperties(screen->pdev, NULL,
                                     &num_extensions, extensions);

for (uint32_t  i = 0; i < num_extensions; ++i) {
   if (!strcmp(extensions[i].extensionName,
               VK_EXT_TRANSFORM_FEEDBACK_EXTENSION_NAME))
      have_tf_ext = true;

}

There’s already some code doing this in Zink, so it’s a simple case of plugging in another strcmp to check for VK_EXT_TRANSFORM_FEEDBACK_EXTENSION_NAME.

Step 2: feature detection

VkPhysicalDeviceFeatures2 feats = {};
VkPhysicalDeviceTransformFeedbackFeaturesEXT tf_feats = {};

feats.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FEATURES_2;
if (have_tf_ext) {
   tf_feats.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_TRANSFORM_FEEDBACK_FEATURES_EXT;
   tf_feats.pNext = feats.pNext;
   feats.pNext = &tf_feats;
}
vkGetPhysicalDeviceFeatures2(screen->pdev, &feats);

Again, there’s already some code in Zink to handle feature detection, so this just requires plugging in the xfb feature parts.

Step 3: property checking

In addition to extensions and features, there’s also properties, which provide things like device-specific limits for various capabilities that need to be checked in order to avoid requesting resources that the gpu hardware can’t provide.

if (have_tf_ext && tf_feats.transformFeedback)
   screen->have_EXT_transform_feedback = true;

VkPhysicalDeviceProperties2 props = {};
props.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PROPERTIES_2;
if (screen->have_EXT_transform_feedback) {
   screen->tf_props.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_TRANSFORM_FEEDBACK_PROPERTIES_EXT;
   screen->tf_props.pNext = NULL;
   props.pNext = &screen->tf_props;
}
vkGetPhysicalDeviceProperties2(screen->pdev, &props);

This was another easy plug-in, since it’s just the xfb parts that need to be added.

Finishing touches

That’s more or less it. The xfb extension name needs to get added into VkDeviceCreateInfo::ppEnabledExtensionNames when it’s passed to vkCreateDevice() a bit later, but xfb is now fully activated if the driver supports it.

Lastly, the relevant enum pipe_cap members need to be handled in zink_get_param() so that gallium recognizes the newly-activated capabilities:

case PIPE_CAP_MAX_STREAM_OUTPUT_BUFFERS:
   return screen->have_EXT_transform_feedback ? screen->tf_props.maxTransformFeedbackBuffers : 0;
case PIPE_CAP_STREAM_OUTPUT_PAUSE_RESUME:
case PIPE_CAP_STREAM_OUTPUT_INTERLEAVE_BUFFERS:
   return 1;

And that’s it

Everything worked perfectly on the first try, and there were absolutely no issues whatsoever, because working on mesa is just that easy.

June 08, 2020

Queries: What are they?

A query in mesa is where an application asks for info from the underlying driver. There’s a number of APIs related to this, but right now only the xfb related ones matter. Specifically, GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN and GL_PRIMITIVES_GENERATED.

  • GL_PRIMITIVES_GENERATED is a query that, when active, tracks the total number of primitives generated.
  • GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN is a query that, when active, tracks the number of primitives written.

The difference between the two is that not all primitives generated are written, as some may be duplicates that are culled. In Vulkan, these translate to:

  • GL_PRIMITIVES_GENERATED = VK_QUERY_PIPELINE_STATISTIC_INPUT_ASSEMBLY_PRIMITIVES_BIT
  • GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN = VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT

It’s important to note that these are very different enum namespaces; one is for pipeline statistics (VK_QUERY_TYPE_PIPELINE_STATISTICS), and one is a regular query type. Also, GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN will be internally translated to PIPE_QUERY_PRIMITIVES_EMITTED, which has a very different name, so that’s good to keep in mind as well.

What does this look like?

Let’s check out some of the important bits. In Zink, starting a query now ends up looking like this, where we use an extension method for starting a query for the xfb query type and the regular vkCmdBeginQuery for all other queries.

if (q->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT)
   zink_screen(ctx->base.screen)->vk_CmdBeginQueryIndexedEXT(batch->cmdbuf,
                                                             q->query_pool,
                                                             q->curr_query,
                                                             flags,
                                                             q->index);
else
   vkCmdBeginQuery(batch->cmdbuf, q->query_pool, q->curr_query, flags);

Stopping queries is now very similar:

if (q->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT)
   screen->vk_CmdEndQueryIndexedEXT(batch->cmdbuf, q->query_pool, q->curr_query, q->index);
else
   vkCmdEndQuery(batch->cmdbuf, q->query_pool, q->curr_query);

Then finally we have the parts for fetching the returned query data:

int num_results;
if (query->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT) {
   /* this query emits 2 values */
   assert(query->curr_query <= ARRAY_SIZE(results) / 2);
   num_results = query->curr_query * 2;
   VkResult status = vkGetQueryPoolResults(screen->dev, query->query_pool,
                                           0, query->curr_query,
                                           sizeof(results),
                                           results,
                                           sizeof(uint64_t),
                                           flags);
   if (status != VK_SUCCESS)
      return false;
} else {
   assert(query->curr_query <= ARRAY_SIZE(results));
   num_results = query->curr_query;
   VkResult status = vkGetQueryPoolResults(screen->dev, query->query_pool,
                                           0, query->curr_query,
                                           sizeof(results),
                                           results,
                                           sizeof(uint64_t),
                                           flags);
   if (status != VK_SUCCESS)
      return false;
}

In this block, there’s once again a split between VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT results and all other query results. In this case, however, the reason is that these queries return two result values, which means the buffer is treated a bit differently.

for (int i = 0; i < num_results; ++i) {
   switch (query->type) {
   case PIPE_QUERY_PRIMITIVES_GENERATED:
      result->u32 += results[i];
      break;
   case PIPE_QUERY_PRIMITIVES_EMITTED:
      /* A query pool created with this type will capture 2 integers -
       * numPrimitivesWritten and numPrimitivesNeeded -
       * for the specified vertex stream output from the last vertex processing stage.
       * - from VK_EXT_transform_feedback spec
       */
      result->u64 += results[i];
      i++;
      break;
   }
}

This is where the returned results get looped over. PIPE_QUERY_PRIMITIVES_GENERATED returns a single 32-bit uint, which is added to the result struct using the appropriate member. The xfb query is now PIPE_QUERY_PRIMITIVES_EMITTED in internal types, and it returns a sequence of two 64-bit uints: numPrimitivesWritten and numPrimitivesNeeded. Zink only needs the first value, so it gets added into the struct.

It’s that simple

Compared to the other parts of implementing xfb, this was very simple and straightforward. More or less just translating the GL enum values to VK and mesa, then letting it run its course.

June 05, 2020

Quick Runthrough

This xfb blog series has gone on for a while now, and it’d be great if it ended soon. Unfortunately, there’s a lot of corner cases which are being found by piglit, and the work and fixing continue.

Today let’s look at some of the drawing code for xfb, since that’s probably not going to be changing much in the course of fixing those corner cases.

static void
zink_emit_stream_output_targets(struct pipe_context *pctx)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_screen *screen = zink_screen(pctx->screen);
   struct zink_batch *batch = zink_curr_batch(ctx);
   VkBuffer buffers[PIPE_MAX_SO_OUTPUTS];
   VkDeviceSize buffer_offsets[PIPE_MAX_SO_OUTPUTS];
   VkDeviceSize buffer_sizes[PIPE_MAX_SO_OUTPUTS];

   for (unsigned i = 0; i < ctx->num_so_targets; i++) {
      struct zink_so_target *t = (struct zink_so_target *)ctx->so_targets[i];
      buffers[i] = zink_resource(t->base.buffer)->buffer;
      buffer_offsets[i] = t->base.buffer_offset;
      buffer_sizes[i] = t->base.buffer_size;
   }

   screen->vk_CmdBindTransformFeedbackBuffersEXT(batch->cmdbuf, 0, ctx->num_so_targets,
                                                 buffers, buffer_offsets,
                                                 buffer_sizes);
   ctx->dirty_so_targets = false;
}

This is a function called from zink_draw_vbo(), which is the struct pipe_context::draw_vbo hook for drawing primitives. Here, the streamout target buffers are bound in Vulkan in preparation for the upcoming draw, passing along related info into the command buffer.

if (ctx->xfb_barrier) {
   /* Between the pause and resume there needs to be a memory barrier for the counter buffers
    * with a source access of VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_WRITE_BIT_EXT
    * at pipeline stage VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT
    * to a destination access of VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_READ_BIT_EXT
    * at pipeline stage VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT.
    *
    * - from VK_EXT_transform_feedback spec
    */
   VkBufferMemoryBarrier barriers[PIPE_MAX_SO_OUTPUTS] = {};
   unsigned barrier_count = 0;
   for (unsigned i = 0; i < ctx->num_so_targets; i++) {
      struct zink_so_target *t = zink_so_target(ctx->so_targets[i]);
      if (t->counter_buffer_valid) {
          barriers[i].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
          barriers[i].srcAccessMask = VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_WRITE_BIT_EXT;
          barriers[i].dstAccessMask = VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_READ_BIT_EXT;
          barriers[i].buffer = zink_resource(t->counter_buffer)->buffer;
          barriers[i].size = VK_WHOLE_SIZE;
          barrier_count++;
      }
   }
   batch = zink_batch_no_rp(ctx);
   vkCmdPipelineBarrier(batch->cmdbuf,
      VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT,
      VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
      0,
      0, NULL,
      barrier_count, barriers,
      0, NULL
   );
   ctx->xfb_barrier = false;
}
if (ctx->dirty_so_targets)
   zink_emit_stream_output_targets(pctx);
if (so_target && so_target->needs_barrier) {
   /* A pipeline barrier is required between using the buffers as
    * transform feedback buffers and vertex buffers to
    * ensure all writes to the transform feedback buffers are visible
    * when the data is read as vertex attributes.
    * The source access is VK_ACCESS_TRANSFORM_FEEDBACK_WRITE_BIT_EXT
    * and the destination access is VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT
    * for the pipeline stages VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT
    * and VK_PIPELINE_STAGE_VERTEX_INPUT_BIT respectively.
    *
    * - 20.3.1. Drawing Transform Feedback
    */
   VkBufferMemoryBarrier barriers[1] = {};
   if (so_target->counter_buffer_valid) {
       barriers[0].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
       barriers[0].srcAccessMask = VK_ACCESS_TRANSFORM_FEEDBACK_COUNTER_WRITE_BIT_EXT;
       barriers[0].dstAccessMask = VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT;
       barriers[0].buffer = zink_resource(so_target->base.buffer)->buffer;
       barriers[0].size = VK_WHOLE_SIZE;
   }
   batch = zink_batch_no_rp(ctx);
   zink_batch_reference_resoure(batch, zink_resource(so_target->base.buffer));
   vkCmdPipelineBarrier(batch->cmdbuf,
      VK_PIPELINE_STAGE_TRANSFORM_FEEDBACK_BIT_EXT,
      VK_PIPELINE_STAGE_VERTEX_INPUT_BIT,
      0,
      0, NULL,
      ARRAY_SIZE(barriers), barriers,
      0, NULL
   );
   so_target->needs_barrier = false;
}

This is a block added to zink_draw_vbo() for synchronization of the xfb buffers. The counter buffer needs a specific type of barrier according to the spec, and the streamout target buffer needs a different type of barrier. These need to be emitted outside of a render pass, so zink_batch_no_rp() is used to get a batch that isn’t currently in a render pass (ending the active batch if necessary to switch to a new one). Without these, vk-layers will output tons of errors and also probably your stream output will be broken.

   if (ctx->num_so_targets) {
      for (unsigned i = 0; i < ctx->num_so_targets; i++) {
         struct zink_so_target *t = zink_so_target(ctx->so_targets[i]);
         if (t->counter_buffer_valid) {
            zink_batch_reference_resoure(batch, zink_resource(t->counter_buffer));
            counter_buffers[i] = zink_resource(t->counter_buffer)->buffer;
            counter_buffer_offsets[i] = t->counter_buffer_offset;
         } else
            counter_buffers[i] = NULL;
         t->needs_barrier = true;
      }
      screen->vk_CmdBeginTransformFeedbackEXT(batch->cmdbuf, 0, ctx->num_so_targets, counter_buffers, counter_buffer_offsets);
   }

/* existing code */
   if (dinfo->index_size > 0) {
      assert(dinfo->index_size != 1);
      VkIndexType index_type = dinfo->index_size == 2 ? VK_INDEX_TYPE_UINT16 : VK_INDEX_TYPE_UINT32;
      struct zink_resource *res = zink_resource(index_buffer);
      vkCmdBindIndexBuffer(batch->cmdbuf, res->buffer, index_offset, index_type);
      zink_batch_reference_resoure(batch, res);
      vkCmdDrawIndexed(batch->cmdbuf,
         dinfo->count, dinfo->instance_count,
         dinfo->start, dinfo->index_bias, dinfo->start_instance);
   } else {
/* new code */
      if (so_target && screen->tf_props.transformFeedbackDraw) {
         zink_batch_reference_resoure(batch, zink_resource(so_target->counter_buffer));
         screen->vk_CmdDrawIndirectByteCountEXT(batch->cmdbuf, dinfo->instance_count, dinfo->start_instance,
                                       zink_resource(so_target->counter_buffer)->buffer, so_target->counter_buffer_offset, 0,
                                       MIN2(so_target->stride, screen->tf_props.maxTransformFeedbackBufferDataStride));
      }
      else
         vkCmdDraw(batch->cmdbuf, dinfo->count, dinfo->instance_count, dinfo->start, dinfo->start_instance);
   }

   if (dinfo->index_size > 0 && dinfo->has_user_indices)
      pipe_resource_reference(&index_buffer, NULL);

   if (ctx->num_so_targets) {
      for (unsigned i = 0; i < ctx->num_so_targets; i++) {
         struct zink_so_target *t = zink_so_target(ctx->so_targets[i]);
         counter_buffers[i] = zink_resource(t->counter_buffer)->buffer;
         counter_buffer_offsets[i] = t->counter_buffer_offset;
         t->counter_buffer_valid = true;
      }
      screen->vk_CmdEndTransformFeedbackEXT(batch->cmdbuf, 0, ctx->num_so_targets, counter_buffers, counter_buffer_offsets);
   }

Excluding a small block that I’ve added a comment for, this is pretty much all added for handling xfb draws. This includes the begin/end calls for xfb and outputting to the counter buffers for each streamout target, and the actual vkCmdDrawIndirectByteCountEXT call for drawing transform feedback when appropriate.

The begin/end calls handle managing the buffer states to work with glPauseTransformFeedback and glResumeTransformFeedback. When resuming, the counter buffer offset is used to track the state and continue with the buffers from the correct location in memory.

Next time

We’ll look at xfb queries and extension/feature enabling, and I’ll start to get into a bit more detail about how more of this stuff works.

June 04, 2020

To Begin…

Stream output is another name for xfb closer to the driver level. Inside mesa (and gallium), we’ll commonly see different types related to struct pipe_stream_output_* which provide info about how the xfb info is output to its corresponding buffers.

Here’s some blocks from the Zink implementation, referenced heavily from the original work by David Airlie:

static struct pipe_stream_output_target *
zink_create_stream_output_target(struct pipe_context *pctx,
                                 struct pipe_resource *pres,
                                 unsigned buffer_offset,
                                 unsigned buffer_size)
{
   struct zink_so_target *t;
   t = CALLOC_STRUCT(zink_so_target);
   if (!t)
      return NULL;

   t->base.reference.count = 1;
   t->base.context = pctx;
   pipe_resource_reference(&t->base.buffer, pres);
   t->base.buffer_offset = buffer_offset;
   t->base.buffer_size = buffer_size;

   /* using PIPE_BIND_CUSTOM here lets us create a custom pipe buffer resource,
    * which allows us to differentiate and use VK_BUFFER_USAGE_TRANSFORM_FEEDBACK_COUNTER_BUFFER_BIT_EXT
    * as we must for this case
    */
   t->counter_buffer = pipe_buffer_create(pctx->screen, PIPE_BIND_STREAM_OUTPUT | PIPE_BIND_CUSTOM, PIPE_USAGE_DEFAULT,
4);
   if (!t->counter_buffer) {
      FREE(t);
      return NULL;
   }

   return &t->base;
}

Here we have the create_stream_output_target hook for our struct pipe_context, which is called any time we’re initializing stream output. This takes the previously-created stream output buffer (more on buffer creation in a future post) along with offset and size parameters for the buffer. It’s necessary to create a counter buffer here so that we can use it to correctly save and restore states if xfb is paused or resumed (glPauseTransformFeedback and glResumeTransformFeedback).

static void
zink_stream_output_target_destroy(struct pipe_context *pctx,
                                  struct pipe_stream_output_target *psot)
{
   struct zink_so_target *t = (struct zink_so_target *)psot;
   pipe_resource_reference(&t->counter_buffer, NULL);
   pipe_resource_reference(&t->base.buffer, NULL);
   FREE(t);
}

A simple destructor hook for the previously-created object. struct pipe_resource buffers are refcounted, so we always need to make sure to properly unref them here; the first parameter of pipe_resource_reference can be thought of as the resource to unref, and the second is the resource to ref.

static void
zink_set_stream_output_targets(struct pipe_context *pctx,
                               unsigned num_targets,
                               struct pipe_stream_output_target **targets,
                               const unsigned *offsets)
{
   struct zink_context *ctx = zink_context(pctx);

   if (num_targets == 0) {
      for (unsigned i = 0; i < ctx->num_so_targets; i++)
         pipe_so_target_reference(&ctx->so_targets[i], NULL);
      ctx->num_so_targets = 0;
   } else {
      for (unsigned i = 0; i < num_targets; i++)
         pipe_so_target_reference(&ctx->so_targets[i], targets[i]);
      for (unsigned i = num_targets; i < ctx->num_so_targets; i++)
         pipe_so_target_reference(&ctx->so_targets[i], NULL);
      ctx->num_so_targets = num_targets;

      /* emit memory barrier on next draw for synchronization */
      if (offsets[0] == (unsigned)-1)
         ctx->xfb_barrier = true;
      /* TODO: possibly avoid rebinding on resume if resuming from same buffers? */
      ctx->dirty_so_targets = true;
   }
}

This is the hook that gets called any time xfb is started or stopped, whether from glBeginTransformFeedback / glEndTransformFeedback or glPauseTransformFeedback / glResumeTransformFeedback. It’s used to pass the active stream output buffers to the driver context in preparation for upcoming draw calls. We store and ref++ the buffers on activate, when targets is non-NULL, and then we ref– and unset the buffers on deactivate, when targets is non-NULL. On glResumeTransformFeedback, offsets is -1 (technically UINT_MAX from underflow).

According to Vulkan spec, we need to emit a different memory barrier for synchronization if we’re resuming xfb, so a flag is set in that case. Ideally some work can be done here to optimize the case of pausing and resuming the same output buffer set to avoid needing to do additional synchronization later on in the draw, but for now, slow and steady wins the race.

June 03, 2020

In this post, I describe the steps involved in the execution of a kms_cursor_crc subtest. In my approach, I chose a subtest (pipe-A-cursor-alpha-transparent) as a target and examined the code from the beginning of the test (igt main) until reaching the target subtest and executing it.

This is my zero version. I plan to incrementally expand this document with evaluation/description of the other subtests. I will probably also need to fix some misunderstandings.

As described by IGT, kms_cursor_crc

Uses the display CRC support to validate cursor plane functionality. The test will position the cursor plane either fully onscreen, partially onscreen, or fully offscreen, using either a fully opaque or fully transparent surface. In each case it then reads the PF CRC and compares it with the CRC value obtained when the cursor plane was disabled.

In the past, Haneen have shown something about the test in a blog post. Fixing any issue in VKMS to make it passes all kms_cursor_crc subtest is also a case in my GSoC project proposal.

The struct data_t

This struct is used in all subtest stores many elements such as DRM file descriptor; framebuffers info; cursor size info; etc. Also, before the main of the test, a static data_t data is declared global.

The beginning - igt_main

We can divide the main function into two parts: setup DRM stuff (igt_fixture) and subtest execution.

Initial test setup

igt_fixture {
	data.drm_fd = drm_open_driver_master(DRIVER_ANY);
	ret = drmGetCap(data.drm_fd, DRM_CAP_CURSOR_WIDTH, &cursor_width);
	igt_assert(ret == 0 || errno == EINVAL);
	/* Not making use of cursor_height since it is same as width, still reading */
	ret = drmGetCap(data.drm_fd, DRM_CAP_CURSOR_HEIGHT, &cursor_height);
	igt_assert(ret == 0 || errno == EINVAL);
		

drmGetCap(int fd, uint64_t capability, uint64_t * value) queries capability of the DRM driver, return 0 if the capability is supported. DRM_CAP_CURSOR_WIDTH and DRM_CAP_CURSOR_HEIGHT store a valid width/height for the hardware cursor.

/* We assume width and height are same so max is assigned width */
igt_assert_eq(cursor_width, cursor_height);

	kmstest_set_vt_graphics_mode();

void kmstest_set_vt_graphics_mode(void)

From lib/igt_kms.c: This function sets the controlling virtual terminal (VT) into graphics/raw mode and installs an igt exit handler to set the VT back to text mode on exit. All kms tests must call this function to make sure that the framebuffer console doesn’t interfere by e.g. blanking the screen.

    igt_require_pipe_crc(data.drm_fd);

void igt_require_pipe_crc(int fd)

From lib/igt_debugfs: checks whether pipe CRC capturing is supported by the kernel. Uses igt_skip to automatically skip the test/subtest if this isn’t the case.

    igt_display_require(&data.display, data.drm_fd);

void igt_display_require(igt_display_t *display, int drm_fd)

From lib/igt_kms.c: Initializes \@display (a pointer to an #igt_display_t structure) and allocates the various resources required. This function automatically skips if the kernel driver doesn’t support any CRTC or outputs.

Run test on pipe

for_each_pipe_static(pipe)
	igt_subtest_group
		run_tests_on_pipe(&data, pipe);

At this part, each subtest of kms_cursor_crc is runned, where the pointer for the already setup struct data_t are passed.

static void run_tests_on_pipe(data_t *data, enum pipe pipe)

This function runs each subtest grouped by pipe. In the setup, it increments the passed data_t struct, and then starts to call each subtest. In this document version, I only focused on the subtest test_cursor_transparent .

igt_subtest_f("pipe-%s-cursor-alpha-transparent", kmstest_pipe_name(pipe))
	run_test(data, test_cursor_transparent, data->cursor_max_w, data->cursor_max_h);

The execution of test_cursor_transparent

static void run_test(data_t *data, void (*testfunc)(data_t *), int cursor_w, int cursor_h)

The function run_test wraps the habitual preparation for running a subtest and also, after then, a cleanup. Therefore, it basically has three steps:

  1. Prepare CRTC: static void prepare_crtc(data_t data, igt_output_toutput, int cursor_w, int cursor_h) This function is responsible for:
    • Select the pipe to be used
    • Create Front and Restore framebuffer of primary plane
    • Find a valid plane type for primary plane and cursor plane
    • Pairs primary framebuffer to its plane and sets a default size
    • Create a new pipe CRC
    • Position the cursor fully visible
    • Store test image as cairo surface
    • Start CRC capture process
  2. Run subtest: testfunc(data) » static void test_cursor_transparent(data_t *data) » test_cursor_alpha(data, 0.0)

    The subtest_cursor_transparent is a variation of test_cursor_alpha where the alpha channel is set zero (or transparent). So, let’s take a look at test_cursor_alpha execution:

static void test_cursor_alpha(data_t *data, double a)
{
	igt_display_t *display = &data->display;
	igt_pipe_crc_t *pipe_crc = data->pipe_crc;
	igt_crc_t crc, ref_crc;
	cairo_t *cr;
	uint32_t fb_id;
	int curw = data->curw;
	int curh = data->curh;
	/*alpha cursor fb*/
	fb_id = igt_create_fb(data->drm_fd, curw, curh,
				    DRM_FORMAT_ARGB8888,
				    LOCAL_DRM_FORMAT_MOD_NONE,
				    &data->fb);

When this subtest starts, it creates the cursor’s framebuffer with the format ARGB8888 , i.e., a framebuffer with RGB plus Alpha channel (pay attention to endianness)

	igt_assert(fb_id);
	cr = igt_get_cairo_ctx(data->drm_fd, &data->fb);
	igt_paint_color_alpha(cr, 0, 0, curw, curh, 1.0, 1.0, 1.0, a);
	igt_put_cairo_ctx(data->drm_fd, &data->fb, cr);

Then, the test uses some Cairo resources to create a cairo surface for the cursor’s framebuffer and allocate a drawing context for it, draw a rectangle with RGB white and the given opacity (alpha channel) and, finally, release the cairo surface and write the changes out to the framebuffer (Disclaimed: looking inside the function igt_put_cairo_ctx, I am not sure if it is doing what it is saying on comments, and also not sure if all the parameters are necessary)

The test is divided into two parts: Hardware test and Software test.

	/*Hardware Test*/
	cursor_enable(data);
	igt_display_commit(display);
	igt_wait_for_vblank(data->drm_fd, data->pipe);
	igt_pipe_crc_get_current(data->drm_fd, pipe_crc, &crc);
	cursor_disable(data);
	igt_remove_fb(data->drm_fd, &data->fb);

The hardware test consists in:

  • Enable cursor: Pair the cursor plane and a framebuffer, set the cursor size for its plane, set the cursor plane size for framebuffer. Commit framebuffer changes to all changes of each display pipe
  • Wait for vblank. Vblank is a couple of extra scanlines region which aren’t actually displayed on the screen designed to give the electron gun (on CRTs) enough time to move back to the top of the screen to start scanning out the next frame. A vblank interrupt is used to notify the driver when it can start the updating of registers. To achieve tear-free display, users must synchronize page flips and/or rendering to vertical blanking [https://dri.freedesktop.org/docs/drm/gpu/drm-kms.html#vertical-blanking]
  • Calculate and check the current CRC
	cr = igt_get_cairo_ctx(data->drm_fd, &data->primary_fb[FRONTBUFFER]);
	igt_paint_color_alpha(cr, 0, 0, curw, curh, 1.0, 1.0, 1.0, a);
	igt_put_cairo_ctx(data->drm_fd, &data->primary_fb[FRONTBUFFER], cr);
	igt_display_commit(display);
	igt_wait_for_vblank(data->drm_fd, data->pipe);
	igt_pipe_crc_get_current(data->drm_fd, pipe_crc, &ref_crc);
	igt_assert_crc_equal(&crc, &ref_crc);

The software test consists in:

  • Create a cairo surface and drawing context, draw a rectangle on top left side with RGB white and the given opacity (alpha channel) and then release de cairo context, applying changes to framebuffer. Then, commit framebuffer changes to all changes of each display pipe.
  • Wait for vblank
  • Calculate and check the current CRC

Finally, the screen is clean.

And in the level of run_test, crtc(data) is clean up;

This is what introspection looks like

The libqmi and libmbim libraries are every day getting more popular to control your QMI or MBIM based devices. One of the things I’ve noticed, though, is that lots of users are writing applications in e.g. Python but then running qmicli or mbimcli commands, and parsing the outputs. This approach may work, but there is absolutely no guarantee that the format of the output printed by the command line programs will be kept stable across new releases. And also, the way these operations are performed may be suboptimal (e.g. allocating QMI clients for each operation, instead of reusing them).

Since the new stable libqmi 1.26 and libmbim 1.24 releases, these libraries integrate GObject Introspection support for all their types, and that provides a much better integration within Python applications (or really, any other language supported by GObject Introspection).

The only drawback of using the libraries in this way, if you’re already using and parsing command line interface commands, is that you would need to go deep into how the protocol works in order to use them.

For example, in order to run a DMS Get Capabilities operation, you would need to create a Qmi.Device first, open it, then allocate a Qmi.Client for the DMS service, then run the Qmi.Client.get_capabilities() operation, receive and process the response with Qmi.Client.get_capabilities_finish(), and parse the result with the per-TLV reader method, e.g. output.get_info() to process the Info TLV. Once the client is no longer needed, it would need to be explicitly released before exiting. A full example doing just this is provided in the libqmi sources.

In the case of MBIM operations, there is no need for the extra step of allocating a per-service client, but instead, the user should be more aware of the actual types of messages being transferred, in order to use the correct parsing operations. For example, in order to query device capabilities you would need to create a Mbim.Device first, open it, create a message to query the device capabilities, send the message, receive and process the response, check whether an error is reported, and if it isn’t, fully parse it. A full example doing just this is provided in the libmbim sources.

Of course, all these low level operations can also be done through the qmi-proxy or mbim-proxy, so that ModemManager or other programs can be running at the same time, all sharing access to the same QMI or MBIM ports.

P.S.: not a true Python or GObject Introspection expert here, so please report any issue found or improvements that could be done 😀

And special thanks to Vladimir Podshivalov who is the one that started the hard work of setting everything up in libqmi. Thank you!

Enjoy!

June 02, 2020

As the GSoC coding time officially started this week (01/06/2020), this post is a line between my activities in the period of community bonding and the official development of my project. I used the last month to solve issues in my development environment, improve the tool that supports my development activity (kworflow), and study concepts and implementations related to my GSoC project.

Communication

Besides e-mail, IRC chat, and Telegram, my mentor (Siqueira) and I are meeting every Wednesday on Jitsi, where we also use tmate for terminal sharing. We also use, together with Trevor, a spreadsheet to schedule tasks, report my daily activity, and write any suggestions.

Issues in my development environment

Probably you have already had this kind of pending task: a problem that trouble your work but you can go ahead with it, therefore you postpone until the total tiredness. So, I have some small issues on my vm: from auth to update kernel process that I was postpone forever. The first I solved with some direct setup and the last led me for the second task: improve kworkflow.

Improvements for a set of scripts that facilitate my development day-to-day

The lack of support on kworkflow for deployment on vm drove me to start hacking the code. As updating vm needs understanding things more complex, I started developing soft changes (or less integrated with the scripts structure). This work is still in progress, and after discuss with others kworkflow developers (on the issue, on IRC and on voice meetings), the proposals of changes were refined.

The first phase of my project proposal orbits the IGT test kms_cursor_crc. I have already a preliminar experience with the test code; however, I lack a lot of knowledge of each steps and concepts behing the implementation. Bearing this in mind, I used part of the community bonding time to immerse myself in this code.

Checking issues in a patchset that was reported by IGT CI

My first project task is to find out why it is not possible to access debugfs files when running kms_cursor_crc (and fix it). Two things could help me solve it: learning about debugfs and dissecting kms_cursor_crc. To guide my studies, my mentor suggested taking a look at a patchset for the IGT write-back test implementation that CI reported a crash on debugfs_test for i915. For this investigation, I installed on another machine (an old netbook) a Debian without a graphical environment, and, accessing via ssh, I applied the patches and ran the test. Well, everything seemed to work (and the subtests passed). Perhaps something has been fixed or changed in IGT since the patchset was sent. Nothing more to do here.


IGT_FORCE_DRIVER=i915 build/tests/debugfs_test 
IGT-Version: 1.25-gf1884574 (x86_64) (Linux: 4.19.0-9-amd64 x86_64)
Force option used: Using driver i915
Force option used: Using driver i915
Starting subtest: sysfs
Subtest sysfs: SUCCESS (0,009s)
Starting subtest: read_all_entries
Subtest read_all_entries: SUCCESS (0,094s)
Starting subtest: read_all_entries_display_on
Subtest read_all_entries_display_on: SUCCESS (0,144s)
Starting subtest: read_all_entries_display_off
Subtest read_all_entries_display_off: SUCCESS (0,316s)

Diving into the kms_cursor_crc test

I’m writing a kind of anatomy from the kms_cursor_crc test. I chose the alpha-transparent subtest as a target and then followed each step necessary to achieve it, understanding each function called, parameters, and also abstractions. I am often confused by something that once seemed clear. Well, it comes to graphic interface stuff and is acceptable that theses abstraction will disorientate me LOL I guess… The result of this work will be my next post. In the meantime, here are links that helped me on this journey

May 28, 2020

So this is a blog post not related to Fedora or Red Hat, but rather my personal experience with getting a robo vacuum and robo mop into the house.

So about two Months ago my wife and I decided to get a Robo vacuum while shopping at Costco (a US wholesaler outfit). So we brought home the iRobot Roomba 980. Over the next week we ended up also getting the newer iRobot Roomba i7+ and the iRobot Braava m6 mopping robot. Our dream was that we would never have to vacuum or mop again, instead leaving that to our new robots to handle. With two little kids being able to cut that work from our todo list seemed like a dream come through.

I feel that whenever you get into a new technology it takes some time with your first product in that category to understand what questions to ask and what considerations to make. For instance I feel a lot of more informed and confident in my knowledge about electric cars having owned a Nissan Leaf for a few years now (enough to wish I had a Tesla instead for instance :). I guess our experience with robot vacuums here is similar.

Anyway, if you are considering buying a Robot vacuum or mop I think the first lesson we learned is that it is definitely not a magic solution. You have to prepare your house quite a bit before each run, including obvious things like tidying up anything on the floor like the kids legos etc., to discovering that certain furniture, like the IKEA Poang chairs are mortal enemies with your robo vacuum. We had to put our chair on top of the sofa as the Roomba would get stuck on it every time we tried to vacuum the floor. Also the door mat in front of our entrance door kept having its corners sucked into the vacuum getting it stuck. Anyway, our lesson learned is that vacuuming (or mopping) is not something we can do on an impulse or easily on a schedule, as it takes quite a bit of preparation. If you don’t have small kid leaving random stuff all over the house all the time you might be able to just set the vacuum on a schedule, but for us that has turned out to be a big no :). So in practice we only vacuum at night now when my wife and I have had time to prep the house after the kids have gone to bed.

It is worth nothing that we only got one vacuum now. We got the i7+ after we got the 980 due to realizing that the 980 didn’t have features like the smart map allowing you to for instance vacuum specific rooms. It also had other niceties like self emptying and it was supposed to be more quiet (which is nice when you run it at night). However in our experience it also had a less strong vacuum, so we felt it left more crap on the floor then the older 980 model. So in the end we returned the i7+ in favour of the 980, just because we felt it did a better job at vacuuming. It is quite loud though, so we can hear it very loud and clear up on the second floor while trying to fall asleep. So if you need a quiet house to sleep, this setup is not for you.

Another lesson we learned is that the vacuums or mops do not work great in darkness, so we now have to leave the light on downstairs at night when we want to vacuum or mop the floor. We should be able to automate that using Google Home, so Google Home could turn on the lights, start the vacuum and then once done, turn off the lights again. We haven’t actually gotten around to test that yet though.

As for the mop, I would say that it is not a replacement for mopping yourself, but it can reduce the frequency of you mopping yourself and thus help maintain a nice clear floor for longer after you done a full manual mop yourself. Also the m6 is super sensitive to edges, which I assume is to avoid it trying to mop your rugs and mats, but it also means that it can not traverse even small thresholds. So for us who have small thresholds between our kitchen area and the rest of the house we have to carry the mop over the thresholds and mop the rest of the first floor as a separate action, which is a bit of an annoyance now that we are running these things at night. That said the kitchen is the one room which needs moping more regularly, so in some sense the current setup where the roomba vacuums the whole first floor and the braava mop mops just the kitchen is a workable solution for us. One nice feature here is that they can be set up to run in order, so the mop will only start once the vacuum is done (that feature is the main reason we haven’t tested out other brand mops which might handle the threshold situation better).

So to conclude, would I recommend robot vacuums and robot mops to other parents with you kids? I would say yes, it has definitely helped us keep the house cleaner and nicer and let us spend less time cleaning the house. But it is not a miracle cure in any way or form, it still takes time and effort to prepare and set up the house and sometimes you still need to do especially the mopping yourself to get things really clean. As for the question of iRobot versus other brands I have no input as I haven’t really tested any other brands. iRobot is a local company so their vacuums are available in a lot of stores around me and I drive by their HQ on a regular basis, so that is the more or less random reason I ended up with their products as opposed to competing ones.

May 25, 2020

The KWinFT project with its two major open source offerings KWinFT and Wrapland was announced one month ago. This made quite some headlines back then but I decided to keep it down afterwards and push the project silently forward on a technical side.

Now I am pleased to announce the release of a beta version for the next stable release 5.19 in two weeks. The highlights of this release are a complete redesign of Wrapland's server library and two more projects joining KWinFT.

Wrapland Server library redesign

One of the goals of KWinFT is to facilitate large upsetting changes to the internal structure and technological base of its open source offerings. As mentioned one month ago in the project announcement these changes include pushing back the usage of Qt in lower-level libraries and instead making use of modern C++ to its full potential.

We achieved the first milestone on this route in an impressively short timeframe: the redesign of Wrapland's server library for improved encapsulation of external libwayland types and providing template-enhanced meta-classes for easy extension with new functionality in the future.

This redesign work was organized on a separate branch and merged this weekend into master. In the end that included over 200 commits and 40'000 changed lines. Here I have to thank in particular Adrien Faveraux who joined KWinFT shortly after its announcement and contributed several class refactors. Our combined work enabled us to deliver this major redesign already now with the upcoming release.

Aside from the redesign I used this opportunity to add clang-based tools for static code analysis: clang-format and clang-tidy. Adding to our autotests that run with and without sanitizers Wrapland's CI pipelines now provide efficient means for handling contributions by GitLab merge requests and checking back on the result after merge. You can see a full pipeline with linters, static code analysis, project build and autotests passing in the article picture above or check it out here directly in the project.

New in KWinFT: Disman and KDisplay

With this release Disman and KDisplay join the KWinFT project. Disman is a fork of libkscreen and KDisplay one of KScreen. KScreen is the main UI in a KDE Plasma workspace for display configuration and I was its main contributor and maintainer in the last years.

Disman can be installed in parallel with libkscreen. For KDisplay on the other side it is recommended to remove KScreen when KDisplay is installed. This way not both daemons try to meddle with the display configuration at the same time. KDisplay can make use of plugins for KWinFT, KWin and wlroots so you could also use KDisplay as a general replacement.

Forking libkscreen and KScreen to Disman and KDisplay was an unwanted move from my side because I would have liked to keep maintaining them inside KDE. But my efforts to integrate KWinFT with them were not welcomed by some members of the Plasma team. Form your own opinion by reading the discussion in the patch review.

I am not happy about this development but I decided to make the best out of a bad situation and after forking and rebranding directly created CI pipelines for both projects which now also run linters, project builds and autotests on all merge requests and branch pushes. And I defined some more courageous goals for the two projects now that I have more control.

One would think after years of being the only person putting real work into KScreen I would have had this kind of freedom also inside KDE but that is not how KDE operates.

Does it need to be this way? What are arguments for and against it? That is a discussion for another article in the future.

Very next steps

There is an overall technical vision I am following with KWinFT: building a modern C++ framework for Wayland compositor creation. A framework that is built up from independent yet well interacting small libraries.

Take a look at this task for an overview. The first one of these libraries that we have now put work in was Wrapland. I plan for the directly next one to be the backend library that provides interfacing capabilities with the kernel or a host window system the compositor runs on, what in most cases means talking to the Direct Rendering Manager.

The work in Wrapland is not finished though. After the basic representation of Wayland objects has been improved we can push further by layering the server library like this task describes. The end goal here is to get rid of the Qt dependency and make it an optional facade only.

How to try out or contribute to KWinFT

You can try out KWinFT on Manjaro. At the moment you can install KWinFT and its dependencies on Manjaro's unstable images but it is planned to make this possible also in the stable images with the upcoming 5.19 stable release.

I explicitly recommend the Manjaro distribution nowadays to any user from being new to Linux to experts. I have Manjaro running on several devices and I am very pleased with Manjaro's technology, its development pace and its community.

If you are an advanced user you can also use Arch Linux directly and install a KWinFT AUR package that builds KWinFT and its dependencies directly from Git. I hope a package of KWinFT's stable release will also be soon available from Arch' official repositories.

If you want to contribute to one of the KWinFT projects take a look at the open tasks and come join us in our Gitter channel. I am very happy that already several people joined the project who provide QA feedback and patches. There are also opportunities to work on DevOps, documentation and translations.

I am hoping KWinFT will be a welcoming place for everyone interested in high-quality open source graphics technology. A place with low entry barriers and many opportunities to learn and grow as an engineer.

May 20, 2020

Meanwhile, in GSoC:

I took the second week of Community Bonding to make some improvements in my development environment. As I have reported before, I use a QEMU VM to develop kernel contributions. I was initially using an Arch VM for development; however, at the beginning of the year, I reconfigured it to use a Debian VM, since my host is a Debian installation - fewer context changes. In this movement, some ends were loose, and I did some workarounds, well… better round it off.

I also use kworkflow (KW) to ease most of the no-coding tasks included in the day-to-day coding for Linux kernel. The KW automates repetitive steps of a developer’s life, such as compiling and installing my kernel modifications; finding information to format and send patches correctly; mounting or remotely accessing a VM, etc. During the time that preceded the GSoC project submission, I noticed that the feature of installing a kernel inside the VM was incompleted. At that time, I started to use the “remote” option as palliative. Therefore, I spent the last days learning more features and how to hack the kworkflow to improve my development environment (and send it back to the kw project).

I have started by sending a minor fix on alert message:

kw: small issue on u/mount alert message

Then I expanded the feature “explore” - looking for a string in directory contents - by adding GNU grep utility in addition to the already used git grep. I gathered many helpful suggestions for this patch, and I applied them together with the reviews received in a new version:

src: add grep utility to explore feature

Finally, after many hours of searching, reading and learning a little about guestfish, grub, initramfs-tools and bash, I could create the first proposal of code changes that enable kw to automate the build and install of a kernel in VM:

add support for deployment in a debian-VM

The main barrier to this feature was figuring out how to update the grub on the VM without running the update-grub command via ssh access. First, I thought about adding a custom file with a new entry to boot. Thinking and researching a little more, I realized that guestfish could solve the problem and, following this logic, I found a blog post describing how to run “update-grub” with guestfish. From that, I made some adaptations that created the solution.

However, in addition to updating grub, the feature still lacks some steps to install the kernel on the VM properly. I checked the missing code by visiting an old FLUSP tutorial that describes the step-by-step of compiling and install the Linux Kernel inside a VM. I also used the implementation of the “remote” mode of the “kw deploy” to wrap up.

Now I use kw to automatically compile and install a custom kernel on my development VM. So, time to sing: “Ooh, that’s why I’m easy; I’m easy as Sunday morning!”

Maybe not now. It’s time to learn more about IGT tests!

This morning I saw two things that were Microsoft and Linux graphics related.

https://devblogs.microsoft.com/commandline/the-windows-subsystem-for-linux-build-2020-summary/

a) DirectX on Linux for compute workloads
b) Linux GUI apps on Windows

At first I thought these were related, but it appears at least presently these are quite orthogonal projects.

First up clarify for the people who jump to insane conclusions:

The DX on Linux is a WSL2 only thing. Microsoft are not any way bringing DX12 to Linux outside of the Windows environment. They are also in no way open sourcing any of the DX12 driver code. They are recompiling the DX12 userspace drivers (from GPU vendors) into Linux shared libraries, and running them on a kernel driver shim that transfers the kernel interface up to the closed source Windows kernel driver. This is in no way useful for having DX12 on Linux baremetal or anywhere other than in a WSL2 environment. It is not useful for Linux gaming.

Microsoft have submitted to the upstream kernel the shim driver to support this. This driver exposes their D3DKMT kernel interface from Windows over virtual channels into a Linux driver that provides an ioctl interface. The kernel drivers are still all running on the Windows side.

Now I read the Linux GUI apps bit and assumed that these things were the same, well it turns out the DX12 stuff doesn't address presentation at all. It's currently only for compute/ML workloads using CUDA/DirectML. There isn't a way to put the results of DX12 rendering from the Linux guest applications onto the screen at all. The other project is a wayland/RDP integration server, that connects Linux apps via wayland to RDP client on Windows display, integrating that with DX12 will be a tricky project, and then integrating that upstream with the Linux stack another step completely.

Now I'm sure this will be resolved, but it has certain implications on how the driver architecture works and how much of the rest of the Linux graphics ecosystem you have to interact with, and that means that the current driver might not be a great fit in the long run and upstreaming it prematurely might be a bad idea.

From my point of view the kernel shim driver doesn't really bring anything to Linux, it's just a tunnel for some binary data between a host windows kernel binary and a guest linux userspace binary. It doesn't enhance the Linux graphics ecosystem in any useful direction, and as such I'm questioning why we'd want this upstream at all.

May 19, 2020

One of the more common issues we encounter debugging things is that users don't always know whether they're running on a Wayland or X11 session. Which I guess is a good advertisement for how far some of the compositors have come. The question "are you running on Xorg or Wayland" thus comes up a lot and suggestions previously included things like "run xeyes", "grep xinput list", "check xrandr" and so on and so forth. None of those are particularly scriptable, so there's a new tool around now: xisxwayland.

Run without arguments it simply exits with exit code 0 if the X server is Xwayland, or 1 otherwise. Which means use can use it like this:


$ cat my-xorg-only-script.sh
#!/bin/bash

if xisxwayland; then
echo "This is an Xwayland server!";
exit 1
fi

...
Or, in the case where you have a human user (gasp!), you can ask them to run:

$ xisxwayland --verbose
Xwayland: YES
And even non-technical users should be able to interpret that.

Note that the script checks for Xwayland (hence the name) via the $DISPLAY environment variable, just like any X application. It does not check whether there's a Wayland compositor running but for most use-cases this doesn't matter anyway. For those where it matters you get to write your own script. Congratulations, I guess.

May 13, 2020

I submitted a project proposal to participate in this year’s Google Summer of Code (GSoC). As I am curious about the DRM subsystem and have already started to work in contributing to it, I was looking for an organization that supports this subsystem. So, I applied to the X.Org foundation and proposed a project to Improve VKMS using IGT GPU Tools. Luckily, in May 4th, I received a e-mail from GSoC announcing that I was accepted! Happy happy happy!

Observing a conversation in #dri-devel channel, I realized that my project was the only one accepted on X.Org. WoW! I have no idea why the organization has only one intern this year, and even more, that this is me! Imediately after I received the result, Trevor Woerner congratulated me and kindly announced my project on his Twitter profile! It was fun to know that he enjoyed the things that I randomly posted in my blog, and was so nice to see that he read what I wrote!

Advices, not sure, but I can report the submission experience

From time to time, someone appears on communication channels of FLUSP (FLOSS@USP) asking how to participate in GSoC or Outreachy. They are usually trying to answer questions about the rules of participation and/or obtain reports from former participants. In my opinion, we have in Brazil many high-qualified IT people who, for some reason, do not feel safe to invest in a career abroad. They are very demanding with themselves. And I believe that groups like FLUSP can be a means of launching good developers internationally.

By viewing this contribution window, I consider it worthwhile to report some actions that I took in my application process.

First, take seriously, and do it as soon as possible.

About a year ago, I took my first efforts to understand the project and the Linux community. But that doesn’t mean you need a year in advance for that. What I mean is that, for a project like Linux, I had to dedicate enough hours studying, understanding, and trying to develop my first contributions. And I, in particular, did not feel skilled enough to propose anything for the community. Participating in GSoC was not my initial plan. What I wanted (and still want) is to become a high-skilled developer of the Linux kernel project.

Second, take your first contribution steps.

Many times I prepared the environment, started to work on some things, and then ended up blocked by all the doubts that arose when I dived into all these lines of code. Breathe… Better start with the basics. If you have someone who can mentor you, one way is to work “with her/him.” Find issues or code that s/he participates so you can start a conversation and some guidance. I prefer a “peer-to-peer” introduction than a “broadcast” message for the community. And then, things fly.

When the organizations are announced…

Ask the community (or the developers you have already developed a connexion) if any of the organizations would contemplate the system you are interested in proposing a project. You can propose something that you think relevant, you can look for project suggestions within organizations, or you can (always) ask. Honestly, I’m not very good at asking. In the case of X.Org Foundation, you can find tips here, and suggestions and potential mentors here and here.

Write your proposal and share!

As soon as you can.

In your proposal, you should introduce yourself and show what you already know so far, that is, make your portfolio. Also, try to be as transparent as possible about tasks, describe everything you already know about the issues, estimate time, and make a schedule. Finally, propose productions in addition to code (documentation, tutorial, blog posts). I think this content production will help your work and also the well-monitoring of your activities. Moreover, it is a way to spread the word and track development issues.

I wasn’t fast enough to share my project, but I saw how my proposal evolved after I received the community feedback. The feature of “create and share” is available in the GSoC application platform using Google Docs. I think my internship work plan became more potent due to share my document and apply all suggestions.

Approved and now? I am the newer X.Org developer

After the welcome, Siqueira and Trevor also gave me some guidance, reading, and initial activities.

To better understand the X.Org organization and the system I will be working on in the coming months, I have read the following links (still in progress):

Hands on, it has already started!

Well, time to revisit the proposal, organize daily activities, and check the development environment. Before the GSoC results, I was working on some kworkflow issues and so, that’s what I’ve been working on at the moment. I like to study “problem-oriented”, so to dissect the kw, I took some issues to try to solve. I believe I can report my progress in an upcoming post.

To the next!

May 08, 2020

Gather round children, it's analogy time! First, some definitions:

  • Wayland is a protocol to define the communication between a display server (the "compositor") and a client, i.e. an application though the actual communication is usually done by a toolkit like GTK or Qt.
  • A Wayland Compositor is an implementation of a display server that (usually but not necessary) handles things like displaying stuff on screen and handling your input devices, among many other things. Common examples for Wayland Compositors are GNOME's mutter, KDE's KWin, weston, sway, etc.[1]

And now for the definitions we need for our analogy:

  • HTTP is a protocol to define the communication between a web server and a client (usually called the "browser")
  • A Web Browser is an implementation that (sometimes but not usually) handles things like displaying web sites correctly, among many other things. Common examples for Web Browsers are Firefox, Chrome, Edge, Safari, etc. [2]

And now for the analogy:

The following complaints are technically correct but otherwise rather pointless to make:

  • HTTP doesn't support CSS
  • HTTP doesn't support adblocking
  • HTTP doesn't render this or that website correctly
And much in the same style, the following complaints are technically correct but otherwise rather pointless to make:
  • Wayland doesn't support keyboard shortcuts
  • Wayland doesn't support screen sharing
  • Wayland doesn't render this or that application correctly
In most cases you may encounter (online or on your setup), saying "Wayland doesn't support $BLAH" is like saying "HTTP doesn't support $BLAH". The problem isn't in with Wayland itself, it's a missing piece or bug in the compositor you're using.

Likewise, saying "I don't like Wayland" is like saying "I don't like HTTP".The vast majority of users will have negative feelings towards the browser, not the transport protocol.

[1] Because they're implementations of a display server they can speak multiple protocols and some compositors can also be X11 window managers, much in the same way as you can switch between English and your native language(s).[2] Because they're implementations of a web browser they can speak multiple protocols and some browsers can also be FTP file managers, much in the same way as... you get the point

May 07, 2020

We recently had a Fedora AMA where one of the questions asked is why GNOME is the default desktop for Fedora Workstation. In the AMA we answered why GNOME had been chosen for Fedora Workstation, but we didn’t challenge the underlying assumption built into the way the question was asked, and the answer to that assumption is that it isn’t the default. What I mean with this is that Fedora Workstation isn’t a box of parts, where you have default options that can be replaced, its a carefully procured and assembled operating system aimed at developers, sysadmins and makers in general. If you replace one or more parts of it, then it stops being Fedora Workstation and starts being ‘build your own operating system OS’. There is nothing wrong with wanting to or finding it interesting to build your own operating systems, I think a lot of us initially got into Linux due to enjoying doing that. And the Fedora project provides a lot of great infrastructure for people who want to themselves or through teaming up with others build their own operating systems, which is why Fedora has so many spins and variants available.
The Fedora Workstation project is something we made using those tools and it has been tested and developed as an integrated whole, not as a collection of interchangeable components. The Fedora Workstation project might of course over time replace certain parts with other parts over time, like how we are migrating from X.org to Wayland. But at some point we are going to drop standalone X.org support and only support X applications through XWayland. But that is not the same as if each of our users individually did the same. And while it might be technically possible for a skilled users to still get things moved back onto X for some time after we make the formal deprecation, the fact is that you would no longer be using ‘Fedora Workstation’. You be using a homebrew OS that contains parts taken from Fedora Workstation.

So why am I making this distinction? To be crystal clear, it is not to hate on you for wanting to assemble your own OS, in fact we love having anyone with that passion as part of the Fedora community. I would of course love for you to share our vision and join the Fedora Workstation effort, but the same is true for all the other spins and variant communities we have within the Fedora community too. No the reason is that we have a very specific goal of creating a stable and well working experience for our users with Fedora Workstation and one of the ways we achieve this is by having a tightly integrated operating system that we test and develop as a whole. Because that is the operating system we as the Fedora Workstation project want to make. We believe that doing anything else creates an impossible QA matrix, because if you tell people that ‘hey, any part of this OS is replaceable and should still work’ you have essentially created a testing matrix for yourself of infinite size. And while as software engineers I am sure many of us find experiments like ‘wonder if I can get Fedora Workstation running on a BSD kernel’ or ‘I wonder if I can make it work if I replace glibc with Bionic‘ fun and interesting, I am equally sure we all also realize what once we do that we are in self support territory and that Fedora Workstation or any other OS you use as your starting point can’t not be blamed if your system stops working very well. And replacing such a core thing as the desktop is no different to those other examples.

Having been in the game of trying to provide a high quality desktop experience both commercially in the form of RHEL Workstation and through our community efforts around Fedora Workstation I have seen and experienced first hand the problems that the mindset of interchangeable desktop creates. For instance before we switched to the Fedora Workstation branding and it was all just ‘Fedora’ I experienced reviewers complaining about missing features, features had actually spent serious effort implementing, because the reviewer decided to review a different spin of Fedora than the GNOME one. Other cases I remember are of customers trying to fix a problem by switching desktops, only to discover that while the initial issue they wanted fix got resolved by the switch they now got a new batch of issues that was equally problematic for them. And we where left trying to figure out if we should try to fix the original problem, the new ones or maybe the problems reported by users of a third desktop option. We also have had cases of users who just like the reviewer mentioned earlier, assumed something was broken or missing because they where using a different desktop than the one where the feature was added. And at the same time trying to add every feature everywhere would dilute our limited development resources so much that it made us move slow and not have the resources to focus on getting ready for major changes in the hardware landscape for instance.
So for RHEL we now only offer GNOME as the desktop and the same is true in Fedora Workstation, and that is not because we don’t understand that people enjoy experimenting with other desktops, but because it allows us to work with our customers and users and hardware partners on fixing the issues they have with our operating system, because it is a clearly defined entity, and adding the features they need going forward and properly support the hardware they are using, as opposed to spreading ourselves to thin that we just run around putting on band-aids for the problems reported.
And in the longer run I actually believe this approach benefits those of you who want to build your own OS to, or use an OS built by another team around a different set of technologies, because while the improvements might come in a bit later for you, the work we now have the ability to undertake due to having a clear focus, like our work on adding HiDPI support, getting Wayland ready for desktop use or enabling Thunderbolt support in Linux, makes it a lot easier for these other projects to eventually add support for these things too.

Update: Adam Jacksons oft quoted response to the old ‘linux is about choice meme’ is also a required reading for anyone wanting a high quality operating system

May 04, 2020
*reality TV show deep voice guy*

In 2016, we added a way to launch apps on the discrete GPU.

*swoosh effects*

In 2019, we added a way for that to work with the NVidia drivers.

*explosions*

In 2020, we're adding a way for applications to launch automatically on the discrete GPU.

*fast cuts of loads of applications being launched and quiet*




Introducing the (badly-named-but-if-you-can-come-up-with-a-better-name-youre-ready-for-computers) “PrefersNonDefaultGPU” desktop entry key.

From the specifications website:
If true, the application prefers to be run on a more powerful discrete GPU if available, which we describe as “a GPU other than the default one” in this spec to avoid the need to define what a discrete GPU is and in which cases it might be considered more powerful than the default GPU. This key is only a hint and support might not be present depending on the implementation. 
And support for that key is coming to GNOME Shell soon.

TL;DR

Add “PrefersNonDefaultGPU=true” to your application's .desktop file if it can benefit from being run on a more powerful GPU.

We've also added a switcherooctl command to recent versions of switcheroo-control so you can launch your apps on the right GPU from your scripts and tweaks.
April 28, 2020

As Fedora Workstation 32 was released today I ended up looking back at our efforts to drain the swamp over the last 6 years. In April of 2014 I wrote a blog post outlining our vision for the Fedora Workstation effort and what we wanted to achieve with it. I hadn’t looked at that blog post in years, but it was interesting going back to it and realize that while some of the details have changed it is still the vision we are pursuing today; to keep draining the swamp and make Fedora Workstation a top notch operating system for developers and makers in general. Which I guess is one of the hallmarks of a decent vision, that it allows for the details to change without invalidating it.

One of my pet peeves at the time with Linux as a desktop operating system was that so many of the so called efforts to make linux user friendly was essentially duck taping over the problems, creating fragile solutions that often made it harder for us to really move forward. In the yers since we addressed a lot of major swamp issues with our efforts around HiDPI & Bolt (getting ahead of hardware enablement for new monitors and Thunderbolt devices respectively), Flatpaks, GNOME Software and AppStream (making applications discoverable, deployable and maintainable), Wayland (making your desktop secure and future proof), LVFS and firmware handling (making them easily available for Linux users), Finger print reader standard (ensuring your hardware is fully supported) and coming up with ways to improve the lives of developers with improvements to the terminal or Fedora Toolbox, our developer pet container tool.

Working on these and other issues we early realized that a model where hardware gets enabled in a reactive manner, in response to new laptops being sold, was never going to yield a good result for our users. As long as we followed that model people where bound to always hit issues with laptops as they came out and then have to deal with those issues for the first 6-12 Months of its life. This is why I am so excited about our new partnership with Lenovo that we pre-announced on Friday as it is both the culmination of our efforts over the last 6 years, but also the starting point of a new era in terms of how we work with hardware makers. So instead of us spending a ton of time trying to reverse engineers basic drivers we can now rely on our hardware partner and their component vendors providing that and we can instead focus on what I call high level hardware enablement. Meaning that as we see new features coming into laptops and computers we can try to improve the infrastructure in the operating system to be able to take full advantage of said hardware, and we can do so in collaboration with the hardware makers knowing that once we provide the infrastructure they will ensure to provide drivers and similar fitting into that infrastructure. Our work on fingerprint readers and thunderbolt support for instance has been two great early examples of that.

Anyway, you are probably interested to know some of the new things coming in Fedora Workstaton 32, so here are some of my personal highlights:

New lock screen

This is more a cosmetic change, but one that every user will see upon logging into their Fedora system after a new install or upgrade. The new design features a faded version of your desktop background image and it should also feel more smooth as the password dialog now appears on the lock screen page as opposed to before where it sort of replaced it. The dialog now also tries to more discreetly than before inform you if your trying to type in the password while the lock screen is on. A big thanks to Allan Day and the GNOME design team for their work here trying to polish this part of the user interface.

GNOME extension app

GNOME Shell extensions are little tweaks and additional features for the desktop that our user have gotten accustomed to and enjoy greatly. Extensions are also the technology that powers the GNOME Classic session that provides those of our users who want it with a more traditional desktop experience. GNOME Shell extensions have gradually evolved in how we work with them since their inception as something you install through your web browser to now being handled through GNOME Software. With Fedora Workstation 32 we are making the new GNOME Shell extensions management app available as the next step in the evolution of GNOME Shell extensions, making it simple to turn any given extension on of our or quickly see which extensions you have installed.

GNOME Extensions app

GNOME Extensions handling app

Fedora Toolbox

Fedora Toolbox is our helper for making working with containers for development and testing as easy it possibly can be. Debarshi Ray and Ondřej Míchal have been hard at work porting the Fedora Toolbox to Go from shell for this release. For those wondering why we choose Go as the language; there was basically two reasons for that. One we felt that the toolbox had gone as far as it could as a shell script, and two that was the language used by all the components we rely on and interact with in the container space, like buildah and podman. We also wanted to make it easy for developers on those projects to contribute by using the same language as they use in their projects.

Fedora Toolbox

Fedora Toolbox running on Fedora Workstation 32

Performance improvements

Another area that we always try to give some love is general performance improvement. For example this time around Christian Hergert identified some really bad behavior of GNOME shell when running on a system with very high I/O. At the face of it GNOME Shell didn’t look like it should have been affected, but during some intensive debugging sessions Christian Hergert discovered that I/O was triggered by various API calls to do things like string translation. So he put together a set of patches to resolve the high I/O stalls and can now report that GNOME Shell keeps running smoothly as silk, even under high disk I/O situations.

PipeWire

Wim Taymans keeps making great strides forward with PipeWire, our tool for creating a unified media handler for audio, pro-audio and Video. In Fedora Workstation 32 we will be shipping the 0.3 version which has quite complete Jack support. In fact we are hoping to team up with the Fedora Jam team to finalize the Jack support during the Fedora 32 lifecycle by testing it extensively. We have a lot of Jacks apps already working with PipeWire, including a series of important Jack apps that we have put into Flatpaks in Fedora like Carla. While the support is there in PipeWire in Fedora 32 right now, there are some convenience work we are still needing to do, but we hope to get that pushed out by next week to make replacing Jack with PipeWire becomes very simple to both do and undo for testing purposes.

The PulseAudio support is the last piece that are still in progress. It works for simple music playback, but it is not a drop in replacement for PulseAudio yet, so while we hoped to encourage widespread testing in F32 we will aim at delaying that to F33 in order to polish the PulseAudio support more first. But once ready we will make this available for testing in a simple manner just like the Jack support.

There has also been further work on the video side of PipeWire, adding support for zero copy video capture, this has reduced the overhead of doing things like screen capturing significantly and should be a nice performance/resource usage improvement for everyone.

Firefox on Wayland

Martin Stransky and Jan Horak has been working hard to improve how Firefox runs and works when used as a Wayland native application fixing a truckload of bigger and smaller bugs this cycle. We feel that we crossed the corner now in terms of the Wayland version being just as stable and good as the X11 one. In fact we could move beyond just fixing bugs to actually adding features this time around for instance Martin Stransky worked on WebGL HW acceleration support enabling us to have that enabled by default now for the first time. We also made sure to taking advantage of the Pipewire zero copy support to improve your video conferencing applications running under Firefox which turned out to be even more important than we expected considering Covid-19 has everyone working from home.

Looking forward

We spent a lot of time and energy over the last 6 years to get to where we are now, putting in place a lot of the basic building blocks needed to make Linux a great desktop operating system. And it feels great that just as we kick of the new line of Lenovo laptops running Fedora we are also entering a new phase of development where we can move beyond getting our basic infrastructure in place, but we can really start taking advantage of it to rapidly improve the experience we are providing even more. A good example is the Firefox work mentioned above, where we finally could move on from ‘make it work with Wayland and PipeWire, to ‘lets take advantage of these new pieces to make Firefox on Linux better’. Another example here is that Adam Jackson is currently investigating how we can improve how Fedora Workstation performs for remote usage. This work includes looking at things like VNC and RDP and commercial offerings and figuring out how we can make our stack work better with such tools, on top of the improvements that PipeWire brings for such usecases.

There is some more heavy lifting needed before our next generation OS architecture, Silverblue, is ready to be our default offering, but it is improving leaps and bounds each release and already have a loyal following, personally I am very excited about the fact that we are quickly moving closer the point were we can make it our default and through that offer features like bulletproof OS updates, factory resets and solid version rollbacks.

On the Flatpak side Owen Taylor and Alex Larsson are putting in a lot of final touches on our Red Hat infrastructure. So for RHEL8.2 we will finally be able to build Flatpaks in RHEL infrastructure and provide a runtime and SDK for our RHEL customers to use. But equally exciting is that we will be able to offer these to the community at large, meaning that we can offer a high quality Flatpak Long Term Support runtime and SDK for ISVs that they can use to both target RHEL users, but also Fedora and other Linux distributions with, in a similar vein to how the Red Hat UBI works. We will also be looking at ways to make getting access to these on Fedora very simple for developers, so that developing towards this runtime becomes quick and easy on your Fedora system. Alex and Owen are also working on an incremental updates feature to be shared between Kubernetes containers and OCI Flatpaks, making both technologies better and updates a lot smaller.

We are also looking at a host of other smaller improvements, many of them in collaboration with our friends at Lenovo, like lap detection (so you can be sure the laptop doesn’t burn you), privacy features (like making it harder to read your screen from an angle) and far field microphones. There are also things like Lennarts HomeD idea which we will be looking at as a way to improve the end user experience.

So the future is looking bright and I hope to see many new faces in the Fedora community going forward, be that if you download Fedora Workstation 32 to install on your own system yourself or if you join us through buying a Fedora laptop from Lenovo this summer.

April 26, 2020

The ELF data format divides object files into segments and sections, which has for long caused confusion. Both terms segment and section can be used interchangeably in almost all cases in the English language ([1], [2]). What is often overlooked is that the ELF specification explicitely meant both to mean almost the same. They merely provide two views of the same data, but use different terms to allow referring to them more easily.

When we look at the defining specification (gABI: System V Application Binary Interface) we find this quote in the introduction:

Object files participate in program linking (building a program) and program execution (running a program). For convenience and efficiency, the object file format provides parallel views of a file’s contents, reflecting the differing needs of those activities.

This is, in my opinion, a crucial detail often overlooked. The ELF data format explicitly provides two views of the same data. The difference between segments and sections is thus not what data they contain, but how they index the same data. The specification goes a step further:

A program header table tells the system how to create a process image. Files used to build a process image (execute a program) must have a program header table; relocatable files do not need one.

A section header table contains information describing the file’s sections. Every section has an entry in the table; each entry gives information such as the section name, the section size, and so on. Files used during linking must have a section header table; other object files may or may not have one.

Keep in mind that the program header table is effectively a segment header table. Therefore, the specification explicitly says that these two data views do not have to be present in a specific file. Depending on the use case, the format allows for only segments or only sections.

To summarize, an ELF object file contains data and machine code of a program, which itself is divided into many parts. The ELF format then provides two different views of this same content: segments and sections. However, these are views of the data present in the file, they do not define the content, but merely index it.

As a closing note, we must acknowledge how all this evolved over time, though. While the ELF specification provides this neat dual-view, a lot of this freedom is not actually used in most ELF files. Instead, most files are effectively split into many small sections, and the segments merely provide a grouping of sequential sections in the file. Sections have become the tool that drives the data in ELF files, and segments have become a view of that data. But this was a purely artifical interpretation and is not rooted in the ELF data format.

April 24, 2020

So you have probably seen the announcement that Lenovo are launching a set of Fedora Workstation based laptops. I am so happy and proud of this effort as it comes as the culmination of our hard effort over the last 6 years to drain the swamp and make Linux a more viable desktop operating system.
I am also so happy and proud that Lenovo was willing to work with us on this effort as they provide us with an incredible opportunity to reach both new and old Linux users around the globe with these systems, being the worlds biggest laptop maker with the widest global reach. Because one important aspect of this is that Lenovo will provide these laptops through all their sales channels in all their markets. This means you can of course order them online through their website, but it also means companies can order them through Lenovos business to business channels and it means that in any country where Lenovo is present you can order them, so this is not a North America only or Europe only, this is truly a global offering.

There are a lot of people who has been involved here in helping to make this happen, but special thanks goes to Egbert Gracias from Lenovo who was critical in making this happen and also a special thanks to Alberto Ruiz who spearheaded this effort from our side.

Our engineering team here at Red Hat has also been hard at work ensuring we can support these models very well be that by bugfixes to kernel drivers or by polishing up things like the Linux fingerprint support. As we go forward we hope to build on this relationship to take linux laptops to the next level and I am also very happy to say that we got Jared Dominguez on on team now to help us develop better work practices and closer relationships with our hardware partners and original device manufacturers.


Also a special thanks to Jakub Steiner for putting together the little sizzle video above, it was supposed to be used at our booth at Red Hat Summit next week, but with that going virtual we repurposed it for this announcement.

April 23, 2020
glmark2-es2 -btextureglmark2-es2 -btexture

In Panfrost’s infancy, community members Connor Abbott and Lyude Paul largely reverse-engineered Bifrost and built a proof-of-concept shader dis/assembler. Meanwhile, I focused on the Midgard architecture (Mali T600+), building an OpenGL driver alongside developers like Collaboran Tomeu Vizoso.

As Midgard support has grown – including initial GLES3 support – we have now turned our attention to building a Bifrost driver. We at Collabora got to work in late February, with Tomeu porting the Panfrost command stream, while I built up a new Bifrost compiler.

This week, we’ve reached our first major milestone: the first 3D renders on Bifrost, including basic texture support!

glmark2-es2 -bshading:shading=phong:model=horseglmark2-es2 -bshading:shading=phong:model=horse
glmark2-es2 -bbufferglmark2-es2 -bbuffer

The interface to a modern GPU has two components, the fixed-function command stream and the programmable instruction set architecture. The command stream controls the hardware, dispatching shaders and containing the state required by OpenGL or Vulkan. By contrast, the instruction set encodes the shaders themselves, as with any programmable architecture. Thus the GPU driver contains two major components, generating the command stream and compiling programs respectively.

From Midgard to Bifrost, there have been few changes to the command stream. After all, both architectures feature approximately the same OpenGL and Vulkan capabilities, and the fixed-function hardware has not required much driver-visible optimization. The largest changes involve the interfaces between the shaders and the command stream, including the titular shader descriptors. Indeed, squinting command stream traces from Midgard and Bifrost look similar – but the long tail of minor updates implies a nontrivial Panfrost port.

But the Bifrost instruction set, on the other hand? A rather different story.

Let’s dive in.

Compiler

Bifrost’s instruction set was redesigned completely from Midgard’s, requiring us to build a free software compiler targeting Bifrost from scratch. Midgard’s architecture is characterized as:

  • Vector. 128-bit arithmetic logic unit (ALU) allows 32-bit 4-channel SIMD.

  • Very Long Instruction Word. 5 blocks – scalar/vector add/multiply and special function unit – operate with partial parallelism across 2 pipeline stages.

Vector and VLIW architectures promise high-performance in theory, and in theory, theory and practice are the same. In practice, these architectures are extremely difficult to compile efficiently for. Automatic vectorization is difficult; if a shader uses a 3-channel 32-bit vector (vec3), most likely the extra channel will go unused, wasting resources. Likewise, scheduling for VLIW architectures is difficult and slots often go unused, again wasting resources and preventing shaders from reaching peak architectural performance.

Bifrost by contrast is:

  • Scalarish. 32-bit ALU. 32-bit operations are purely scalar, 16-bit/2-channel and 8-bit/4-channel SIMD. 64-bit operation requires a special half-performance mode.

  • Slightly Long Instruction Word. Bifrost has 2 blocks – fused multiply-add and add – pipelined without parallelism. Simplified special function unit.

This redesign promises better performance - and a redesign of Panfrost’s compiler, too.

Bifrost Intermediate Representation

At the heart of a modern compiler is one or more Intermediate Representations (IRs). In Mesa, OpenGL shaders are parsed into GLSL IR – a tree IR expressing language-visible types – which is converted to NIR – a flat static single assignment IR enabling powerful optimizations. Backends convert NIR to a custom backend IR, whose design seals the fate of a compiler. A poor IR design can impede the growth and harm the performance of the entire compiler, while a well-designed IR enables complex features to be implemented naturally with simple, fast code. There is no one-size-fits-all IR; the design necessarily is built to make certain compiler passes (like algebraic optimization) simple at the expense of others (like register allocation), justifying the use of multiple IRs within a compiler. Further, IR design permeates every aspect of the compiler, so IR changes to a mature compiler are difficult. Both Intel and AMD GPU compilers have required ground-up rewrites to fix IR issues, so I was eager to get the Bifrost IR (BIR) right to begin.

An IR is simply a set of data structures representing a program. Typically backends use a “flat” IR: a list of blocks, where each block contains a list of instructions. But how do we store an instruction?

We could reuse the hardware’s instruction format as-is, with abstract variables instead of registers. It’s tempting, and for simple architectures, it can work. The initial NIR-to-BIR pass is a bit harder than with abstract instructions, but final code generation is trivial, since the packing is already done.

Unfortunately, real world instructions sets are irregular and as quirky as their compilers’ developers. Further, they are tightly packed to be small instead of simple. For final assembly, we will always need to pack the machine formats, but with this design, we also need to unpack them. Worse, the machine irregularities will permeate every aspect of the compiler since they are now embedded into the IR. On Bifrost, for example, most operations have multiple unrelated encodings; this design would force much of the compiler to be duplicated.

So what if the IR is entirely machine-independent, compiling in the abstract and converting to the machine-specific form at the very end. Such IRs are helpful; in Mesa, the machine-independent NIR facilitates sharing of powerful optimizations across drivers. Unfortunately, some design traits really do affect our compiler. Is there a compromise?

Notice the first IR simplifies packing at the expense of the rest of the compiler, whereas the second simplifies NIR-to-BIR at the expense of machine-specific passes. All designs trade complexity in one part of the compiler for another. Hence a good IR simplifies the hardest compiler passes at the expense of the easiest. For us, scheduling and register allocation are NP-complete problems requiring complex heuristics, whereas NIR-to-BIR and packing are linear translations with straightforward rules. Ideally, our IR simplifies scheduling and register allocation, pushing the complexity into NIR-to-BIR and packing. For Bifrost, this yields one golden rule:

A single BIR instruction corresponds to a single Bifrost instruction.

While single instructions may move during scheduling and be rewired from register allocation, the operation is unaffected. On the other hand, within an instruction:

BIR instructions are purely abstract without packing.

By delaying packing, we simplify manipulation. So NIR-to-BIR is complicated by the one-to-many mapping of NIR to Bifrost opcodes with special functions; meanwhile, final code generation is complicated by the pattern matching required to infer packing from abstract instructions. But by pushing this complexity to the edges, the optimizations in between are greatly simplified.

But speaking of IR mistakes, there is one other common issue…

16-bit Support

One oversight I made in designing the Midgard IR – an oversight shared by the IR of many other compilers in Mesa – is often assuming instructions to operate on 32-bit data only. In OpenGL with older Mesa versions, this assumption was true as the default float and integer types are 32-bit. However, the assumption is problematic for OpenCL where 8-, 16-, and 64-bit types are first class. Even for OpenGL, it is suboptimal. While the specification mandates minimum precision requirement for operation, fulfillable with 32-bit arithmetic, on shaders using mediump precision qualifiers we may use 16-bit arithmetic instead. About a month ago, Mesa landed support for optimizing mediump fragment shaders to use 16-bit arithmetic, so for Bifrost, we want to make sure we can take advantage of these optimizations.

The benefit of reduced precision is two-fold. First, shader computations need to be stored in registers, but space in the register file is scarce, so we need to conserve it. If a shader runs out of registers, it spills to main memory, which is slow, so by using 16-bit types instead of 32-bit, we can reduce spilling. Second, although Bifrost is scalar for 32-bit, it is 2-channel vector for 16-bit. As mentioned, automatic vectorization is difficult, but if shaders are vectorized to begin, the compiler can take advantage of this.

As an example, to add 32-bit vector R0-R3 with R4-R7, we need code like:

ADD.f32 R0, R0, R4
ADD.f32 R1, R1, R5
ADD.f32 R2, R2, R6
ADD.f32 R3, R3, R7

But in 16-bit mode with vectors R0-R1 and R2-R3:

ADD.v2f16 R0, R0, R2
ADD.v2f16 R1, R1, R3

Notice both register usage and instruction count are halved. How do we support this in Mesa? Mesa pipes 16-bitness through NIR into our backend, so we must ensure types are respected across our backend compiler. To do so, we include types explicitly in our backend intermediate representation, which the NIR-to-BIR pass simply passes through from NIR. Certain backend optimizations have to be careful to respect these types, whereas others work with little change provided the IR is well-designed. Scheduling is mostly unaffected. But where are there major changes?

Enter register allocation.

Register allocation

A fundamental problem every backend compiler faces is register allocation, mapping abstract IR variables to concrete machine registers:

0 = load uniform #0
1 = load attribute #0
2 = add 0, 1
3 = mul 2, 2

\(\rightarrow\)

R0 = load uniform #0
R1 = load attribute #0
R0 = add R0, R1
R0 = mul R0, R0

Traditionally, register allocation is modeled by a graph, where each IR variable is represented by a node. Any variables that are simultaneously live, in the sense that both of their values will be read later in the program, are said to interfere since they require separate registers to avoid clobbering each other’s data. Interference is represented by edges. Then the problem reduces to graph colouring, finding colours (registers) for each node (variable) that don’t coincide with the colours (registers) of any nodes connected by edges (interfering variables). Initially, Panfrost used a graph colouring approach.

However, the algorithm above is difficult to adapt to irregular vector architectures. One alternative approach is to model the register file explicitly, modeling modeling interference as constraints and using a constraint solver to allocate registers. For the above program, letting \(k_i\) denote the register allocated to variable \(i\), there is a single constraint on the registers \(k_0 \ne k_1\), since \((0, 1)\) is the only pair interfering nodes, yielding a optimal valid solution \(k_0 = 0, k_1 = 1, k_2 = 0, k_3 = 0\), corresponding to the allocation above.

As-is, this approach is mathematically equivalent to graph colouring. However, unlike graph colouring, it extends easily to model vectorization, enabling per-component liveness tracking, so some components of a vector are allocated to a register while others are reused for another value. It also extends easily to vectors of varying type sizes, crucial for 16-bit support, whereas the corresponding generalization for graph colouring is much more complicated.

This work was originally conducted in October for the Midgard compiler, but the approach works just as well for Bifrost. Although Bifrost is conceptually “almost” scalar, there are enough corner cases where we dip into vector structures that a vector-aware register allocator is useful. In particular, 16-bit instructions involve subdivides 32-bit registers, and vectorized input/output requires contiguous registers; both are easily modeled with linear-style constraints.

Packing

The final stage of a backend compiler is packing (final code generation), taking the scheduled, register allocated IR and assembling a final binary. Compared to Midgard, packing for Bifrost is incredibly complicated. Why?

Vectorized programs for vector architectures can be smaller than equivalent programs for scalar architectures. The above instruction sequences to add a 4-channel vector corresponds to just a single instruction on Midgard:

VADD.FADD R0.xyzw, R0.xyzw, R1.xyzw

We would like to minimize program size, since accesses to main memory are slow and increasing the size of the instruction cache is expensive. By minimizing size, smaller caches can be used with better efficiency. Unfortunately, naively scalarizing the architecture by a factor of 4 would appear to inflate program size by 4x, requiring a 4x larger cache for the same performance.

We can do better than simply duplicating instructions. First, by simplifying the vector semantics (since we know most code will be scalar or small contiguous vectors), we eliminate vector write masks and swizzles. But this may not be good enough.

Bifrost goes beyond to save instruction bits in any way possible, since small savings in instruction encoding accumulate quickly in complex shaders. For example, Bifrost separates medium-latency register file accesses from low latency “port” accesses, loading registers into ports ahead of the instruction. There are 64 registers (32-bits each), requiring 6-bits to index a register number. The structure to specify which registers should be loaded looks like:

unsigned port_0 : 5;
unsigned port_1 : 6;

We have 6 bits to encode port 1, but only 5 bits for port 0. Does that mean port 0 can only load registers R0-R31, instead of the full range?

Actually, no - if port 0 is smaller than port 1, the port numbers are as you would expect. But Bifrost has a trick: if port 0 is larger than port 1, then the hardware subtracts 63 from both ports to get the actual port number. In effect, the ordering of the ports is used as an implicit extra bit, storing 12-bits of port data in 11-bits on the wire. (What if the port numbers are equal? Then just reuse the same port!)

Similar tricks permeate the architecture, a win for code size but a loss for compiler packing complexity. The good news is that our compiler’s packing code is self-contained and unit tested independent of the rest of the compiler.

Conclusion

Putting it all together, we have the beginnings of a Bifrost compiler, sufficient for the screenshots above. Next will be adding support for more complex instructions and scheduling to support more complex shaders.

Architecturally, Bifrost turns Midgard on its head, ideally bringing performance improvements but rather complicating the driver. While Bifrost support in Panfrost is still early, it’s progressing rapidly. The compiler code required for the screenshots above is all upstreamed, and the command stream code is working its way up. So stay tuned, and happy hacking.

Originally posted on Collabora’s blog

April 15, 2020

I am pleased to announce the KWinFT project and with it the first public release of its major open source offerings KWinFT and Wrapland, drop-in replacements for KDE's window manager KWin and its accompanying KWayland library.

The KWinFT project was founded by me at the beginning of this year with the goal to accelerate the development significantly in comparison to KWin. Classic KWin can only be moved with caution, since many people rely on it in their daily computing and there are just as many other stakeholders. In this respect, at least for some time, I anticipated to be able to push KWinFT forward in a much more dynamic way.

Over time I refined this goal though and defined additional objectives to supplement the initial vision to ensure its longevity. And that became in my view now equally important: to provide a sane, modern, well organized development process, something KWinFT users won't notice directly but hopefully will benefit from indirectly by enabling the achievement of the initial goal of rapid pace development while retaining the overall stability of the product.

What is in there for you

If you are primarily a consumer of Linux graphics technology this announcement may not seem especially exciting for you. Right now and in the near future the focus of KWinFT is inwards: to formulate and establish great structures for its development process, multiplying all later development efforts.

Examples are continuous integration with different code linters, scheduled and automatic builds and tests, as well as policies to increase developer's effectiveness. This will hopefully in many ways accelerate the KWinFT progress in the future.

But there are already some experimental features in the first release that you might look out for:

  • My rework of KWin's composition pipeline that, according to some early feedback last year, improves the presentation greatly on X11 and Wayland. Additionally a timer was added to minimize the latency from image creation to its depiction on screen.
  • The Wayland viewporter extension was implemented enabling better presentation of content for example for video players and with the next XWayland major release to emulate resolution changes for many older games.
  • Full support for output rotation and mirroring in the Wayland session.

How to get KWinFT not later than now

Does this sound interesting to you? You are in luck! The first official release is already available and if you use Manjaro Linux with its unstable branch enabled, you can even try it out right now by installing the kwinft package.

And you may switch back to classic KWin in no time by installing again the kwin package. Your dependencies will be updated in both directions without further ado.

I hope KWinFT and Wrapland will soon become available in other distributions as well. But at this moment I must thank the Manjaro team for making this happen on very short notice.

Here I also want to thank Ilya Bizyaev for testing my builds from time to time in the last few weeks and giving me direct feedback when something needed to be fixed.

Future directions

For the rest of this article let me outline the strategy I will follow in the future with the KWinFT project. I will expand on these goals in upcoming blog posts as time permits.

An optimized development process

I already mentioned that defining and maintaining a healthy development process in KWinFT is an absolute focus objective for me.

This is the basis on that any future development effort can pick up momentum or if neglected will be held back. And that was a huge problem with KWin in the last year and I can say with certainty that I was not the only developer who suffered because of that.

I tried to improve the situation in the past inside KWin but the larger an organization and the more numerous the stakeholders become the more difficult it is for any form of change to manifest.

In open source software development we have the amazing advantage to be able to circumvent this blockade by rebooting a project in a fresh organisational paradigm what we call a fork. This has all so many risks and challenges as one could expect, so such a decision should not be taken lightly and needs a lot of conviction, preparation and dedication.

I won't go into anymore detail here but I plan to write more articles on what I see as current deficits in KDE's organisation of development processes, how I plan to make sure such issues won't plague KWinFT as well and in which ways these solutions could be at least partly adopted in KDE as well in order to improve the situation there too.

And while I don't have to say much positive about the current state of KDE right now, don't forget that KDE is an organisation which withstood the test of time with a history reaching back more than 20 years. An organisation that had a positive impact on many people. Such an organisation must not be slated but fostered and enhanced.

KWinFT with focus on Wayland

The project is called KWinFT because its primary offering right now is the window manager KWinFT, a fork of KWin. The strategy I will follow for this open source offering is centered around developer focus.

The time and resources of open source developers are not limitless and the window compositor is a central building piece in any desktop making it a natural point of contention.

One solution is permanently trying to make everyone happy through compromise and consensus. That is the normal pattern in large organisations. Dynamic progress is the opposite of that, instead featuring a trial-and-error approach and the sad reality that sometimes corners must be cut and hearts be broken for achieving a greater good.

Both these patterns can be valid approaches in different times and contexts. KDE naturally can only employ the first one, KWinFT is destined to employ the second one.

A major focus for the overall KWinFT project and in particular its window manager will be Wayland. That won't mean to make it just about usable at whatever the cost, but to put KWinFT's Wayland session on a rock-solid solution, rework it as often as needed, grind it out until it is right without any compromise.

And boy, does it need that! To say it bluntly, even with the risk of getting cited out of context: in 2020 still KWin and KWayland are a mess.

Sure, the superficial impression improved over time but there are many deep and fundamental flaws in the architecture that require one or better several developers and project lifetimes of not days or weeks, but months. Let me skip the details for now and instead go directly to the biggest offender.

Wrapland in modern C++ and without Qt

Wrapland was forked from KWayland alongside with KWinFT. I knew KWayland's architecture before of course already, but there is knowing and understanding. And what I have learned additionally about KWayland's internals in the last few months was shocking. And with the current vision that I follow for Wrapland I would not call Wrapland a fork anymore, rather a reboot.

The very first issue ticket I opened in Wrapland was somewhat a gamble back at that time but in hind sight quite visionary. The issue asks to remove Wrapland's Qt dependency. When I opened this ticket I wasn't aware of all the puzzle pieces, I couldn't be. But now two months later I am more convinced by that goal than ever before.

A C++ library that provides a wrapper for the C-style library libwayland is useful, a C++ library in conformance with the C++ Core Guidelines, leveraging the most current C++ standard. That means in particular without using Qt concepts like their moc, raw pointers everywhere and the prevailing use of dynamic over static inheritance.

KWinFT and potentially many other applications, for example from games with nested composition up to the UIs of large industrial machinery, could make great use of a well designed, unit tested, battle-hardened C++ Wayland library that employs modern C++ features for type and memory safety to their full extend. And that only covers Wrapland's server part. Although clients often use complete toolkit for their windowing system integration it is not hard to envision use cases where more direct access is needed and a C++ library is preferred.

The advantages by leveraging Qt in comparison would be primarily the possibility to add QML bindings. This can be useful as some interesting applications leveraging QtWayland's server part prove. But it is minuscule in KWinFT's use case. And what the compositor that Wrapland is written against does not make use of can not be a development objective of this library in the foreseeable future.

I am currently rewriting the server part of Wrapland in this spirit. You can check out the overview ticket that I created for planning the rewrite and the current prototype I am working on. Note that the development is still ongoing on a fundamental level and there might be more iterations necessary before a final structure manifests.

While the remodel of the server part is certainly exciting and I do plan something similar for the client part, this project will need to wait some more. For now I "only" reworked most of the client library to not leak memory in every second place. This allows to run unit tests on the GitLab CI for merge requests and on push in a robust manner. This rework, which contained also fixes for the server part, resulted in a massive merge with 40 commits and over 6000 changed lines.

A beacon of modern technologies

Leveraging modern language features of C++ is one objective, but a far more important one for this project is in the domain to find KWinFT is really created for: computer graphics and their presentation and organisation in an optimal way to the user.

But here I declare just a single goal: KWinFT must be at the top of every major development in this domain.

Too often in the past KWin was sidelined, or rather sidelined itself, when new technology emerged trying to catch up later on. The state of Wayland on Plasma in 2020 is testament to that. In contrast KWinFT shall be open to new developments in the larger community and if manpower permits spearhead such itself, not necessarily as a maverick but in concert with the many other great single and multi-developer projects on freedesktop.org and beyond. This leads over to the final founding principle of KWinFT.

Open to other communities and technologies

A major goal I pursued last year already as a KWin developer and that I want to expand upon with KWinFT is my commitment to building and maintaining meaningful relations with other open source communities.

Meaningful means here on one side on a personal level, like when I attended the X.Org Developer and Wine conferences on two consecutive weekends in October last year in Canada and on the way back home the Gnome Shell meetup in the Netherlands.

But meaningful also means working together, being open to the technologies other projects provide, trying to increase the interoperability instead of locking yourself into the own technology stack.

Of course in this respect the primary address that comes to mind is freedesktop.org with the Wayland and X11 projects. But also working together with teams developing other compositors can be very rewarding.

For example in Wrapland I recently added a client implementation of wlroots' output management protocol. This will allow users of wlroots-based compositors in the future to use KScreen for configuring their outputs.

I want to expand upon that by sharing more protocols and more tools with fellow compositor developers. How about an internal wlroots-based compositor for Wrapland's autotests? This would double-check not only Wrapland's protocol implementation but also wlroot's ones at the same time. If you are interested in designing such a solution in greater details check out Wrapland's contributing guideline and get in touch.

April 07, 2020
$ time git lg
git lg  13.34s user 0.87s system 84% cpu 16.845 total

# True by default as of git v2.24
git config --global core.commitGraph true
git config --global gc.writeCommitGraph true

# Command added in git v2.20
git commit-graph write

$ time git lg
git lg  0.72s user 0.14s system 74% cpu 1.154 total

This is a speed up of ~18x, compared to the older versions.

The way this works is that the commit-graph file stores the commit graph structure along with some extra metadata to speed up graph in the .git/objects/info directory.

April 04, 2020

This is a follow up from the kernel support for high-resolution wheel scrolling which you totally forgot about because it's already more then a year in the past and seriously, who has the attention span these days to remember this. Anyway, I finally found time and motivation to pick this up again and I started lining up the pieces like cans, for it only to be shot down by the commentary of strangers on the internet. The Wayland merge request lists the various pieces (libinput, wayland, weston, mutter, gtk and Xwayland) but for the impatient there's also an Fedora 32 COPR. For all you weirdos inexplicably not running the latest Fedora, well, you'll have to compile this yourself, just like I did.

Let's recap: in v5.0 the kernel added new axes REL_WHEEL_HI_RES and REL_HWHEEL_HI_RES for all devices. On devices that actually support high-resolution wheel scrolling (Logitech and Microsoft mice, primarily) you'll get multiple hires events before the now-legacy REL_WHEEL events. On all other devices those two are in sync.

Integrating this into the userspace stack was a bit of a mess at first, but I think the solution is good enough, even if it has a rather verbose explanation on how to handle it. The actual patches to integrate ended up being relatively simple. So let's see why it's a bit weird:

When Wayland started, back in WhoahReallyThatLongAgo, scrolling was specified as the wl_pointer.axis event with a value in pixels. This works fine for touchpads, not so much for wheels. The early versions of Weston decreed that one wheel click was 10 pixels [1] and, perhaps surprisingly, the world kept on turning. When libinput was forked from Weston an early change was that wheel events would have two values - degrees of movement and click count ("discrete steps"). The wayland protocol was expanded to include the discrete steps as wl_pointer.axis_discrete as well. Then backwards compatibility reared its ugly head and Mutter, Weston, GTK all basically said: one discrete step equals 10 pixels so we multiply the discrete value by 10 and, perhaps surprisingly, the world kept on turning.

This worked out well enough for a few years but with high resolution wheels we ran into a problem. Discrete steps are integers, so we can't send partial values. And the protocol is defined in a way that any tweaking of the behaviour would result in broken clients which, perhaps surprisingly, is a Bad Thing. This lead to the current proposal of separate events. LIBINPUT_EVENT_POINTER_AXIS_WHEEL and for Wayland the wl_pointer.axis_v120 event, linked to above. These events are (like the kernel events) a parallel event stream to the previous events and effectively replace the LIBINPUT_EVENT_POINTER_AXIS and Wayland wl_pointer.axis/axis_discrete pair for wheel events (not so for touchpad or button scrolling though).

The compositor side of things is relatively simple: take the events from libinput and pass the hires ones as v120 events and the lowres ones as v120 events with a value of zero. The client side takes the v120 events and uses them over wl_pointer.axis/axis_discrete unless one is zero in which case you can discard all axis events in that wl_pointer.frame. Since most client implementation already have the support for smooth scrolling (because, well, touchpads do exist) it's relatively simple to integrate - the new events just feed into the smooth scrolling code. And since you already have to do wheel emulation for that (because, well, old clients exist) wheel emulation is handled easily too.

All that to provide buttery smooth [2] wheel scrolling. Or not, if your hardware doesn't support it. In which case, well, live with the warm fuzzy feeling that someone else has a better user experience now. Or soon, anyway.

[1] with, I suspect, the scientific measurement of "yeah, that seems about alright"
[2] like butter out of a fridge, so still chunky but at least less so than before

April 01, 2020
On the road to libfprint and fprintd 2.0, we've been fixing some long-standing bugs, including one that required porting our PAM module from dbus-glib to sd-bus, systemd's D-Bus library implementation.

As you can imagine, I have confidence in my ability to write bug-free code at the first attempt, but the foresight to know that this code will be buggy if it's not tested (and to know there's probably a bug in the tests if they run successfully the first time around). So we will have to test that PAM module, thoroughly, before and after the port.

Replacing fprintd

First, to make it easier to run and instrument, we needed to replace fprintd itself. For this, we used dbusmock, which is both a convenience Python library and way to write instrumentable D-Bus services, and wrote a template. There are a number of existing templates for a lot of session and system services, in case you want to test the integration of your code with NetworkManager, low-memory-monitor, or any number of other services.

We then used this to write tests for the command-line utilities, so we can both test our new template and test the command-line utilities themselves.

Replacing gdm

Now that we've got a way to replace fprintd and a physical fingerprint reader, we should write some tests for the (old) PAM module to replace sudo, gdm, or the login authentication services.

Co-workers Andreas Schneier and Jakub Hrozek worked on pam_wrapper, an LD_PRELOAD library to mock the PAM library, and Python helpers to write simple PAM services. This LWN article explains how to test PAM applications, and PAM modules.

After fixing a few bugs in pam_wrapper, and combining with the fprintd dbusmock work above, we could wrap and test the fprintd PAM module like it never was before.

Porting to sd-bus

Finally, porting the PAM module to sd-bus was pretty trivial, a loop of 1) writing tests that work against the old PAM module, 2) porting a section of the code (like the fingerprint reader enumeration, or the timeout support), and 3) testing against the new sd-bus based code. The result was no regressions that we could test for.

Conclusion

Both dbusmock, and pam_wrapper are useful tools in your arsenal to write tests, and given those (fairly) easy to use CIs in GNOME and FreeDesktop.org's GitLabs, it would be a shame not to.

You might also be interested in umockdev, to mock a number of device types, and mocklibc (which combined with dbusmock powers polkit's unattended CI)
March 24, 2020

For the last few months, we have been working on two exciting new projects at Collabora, and it’s finally time to share some information about them with the world:

We are partnering with Microsoft DirectX engineers to build OpenCL and OpenGL mapping layers, in order to bring OpenCL 1.2 and OpenGL 3.3 support to all Windows and DirectX 12 enabled devices out there!

This work builds on a lot of previous work. First and foremost, we are building this by using Mesa 3D, with the Gallium interface as the base for the OpenGL layer, and NIR as the base for the OpenCL compiler. We are also using LLVM and the SPIRV-LLVM-Translator from Khronos as the compiler front-end.

In addition, we are taking advantage of Microsoft’s experience in creating their D3D12 Translation Layer, as well as our own experience from developing Zink.

What is Mesa 3D?

Mesa 3D is an open source implementation of several graphics technologies, including OpenCL and OpenGL. The OpenGL implementation in Mesa is robust and is used as the base for several industry-strength OpenGL drivers from multiple GPU vendors.

Among other things, Mesa consists of several API implementations (called state-trackers) as well as the Gallium low-level driver interface. The Gallium interface hides a lot of the legacy OpenGL details and translates OpenGL calls into something that looks more like modern GPU primitives.

Why translate APIs?

Not all Windows-powered devices have consistent support for hardware-accelerated OpenCL and OpenGL. So in order to improve application compatibility, we are building a generic solution to the problem. This means that a GPU vendor only has to implement a D3D12 driver for their hardware in order to support all three APIs.

This mapping layer is also expected to serve as a starting point in porting older OpenCL and OpenGL applications over to D3D12.

In addition, we believe this is good for the wider open source community. A lot of the problems we are solving here are shared with other drivers and translation layers, and we hope that the code will be useful beyond the use cases listed above.

Implementation

The work is largely split into three parts: an OpenCL compiler, an OpenCL runtime, and a Gallium driver that builds and executes command-buffers on the GPU using the D3D12 API.

In addition, there is a shared NIR-to-DXIL shader compiler that both components use. For those not familiar with NIR, it is Mesa’s internal representation for GPU shaders. Similarly, DXIL is Microsoft’s internal representation, which D3D12 drivers will consume and translate into hardware-specific shaders.

OpenCL compiler

The OpenCL compiler uses LLVM and the SPIRV-LLVM-Translator to generate SPIR-V representations of OpenCL kernels. These, in turn, are passed to Mesa’s SPIR-V to NIR translator, where some optimizations and semantical translations are done. Then the NIR representation is finally passed to NIR-to-DXIL, which produces a DXIL compute shader and the needed metadata so it can be executed on the GPU by the runtime using D3D12.

Here’s a diagram of the complete process, including NIR-to-DXIL, which will be described below:

OpenCL Compiler Overview OpenCL Compiler Overview

OpenCL runtime

While Mesa provides an OpenCL implementation called Clover, we are not using it for this project. Instead, we have a new OpenCL runtime that does a more direct translation to the DirectX 12 API.

NIR-to-DXIL

DXIL is essentially LLVM 3.7 bitcode with some extra metadata and validation. This was a technical choice that made sense for Microsoft because all the major driver vendors already used LLVM in their compiler toolchain. Using an older version of the LLVM bitcode format gives good compatibility with drivers because the LLVM bitcode format is backwards compatible.

Because we depend on a much more recent version of LLVM for the compiler front-end, we sadly cannot easily use the DirectX Shader Compiler as a compiler back-end. The DirectX Shader Compiler is effectively a fork of LLVM 3.7, and we are currently using LLVM 10.0 for the compiler front-end. Using DirectX Shader Compiler as that would require us to link two different versions of LLVM into the same binary, which would have led to problems.

We also cannot easily use LLVM itself to generate the bitcode. While the LLVM bitcode format is backwards compatible, LLVM itself is not forward compatible. This means that newer versions of LLVM cannot produce a bitcode format that is understood by older versions. This makes sense from LLVM’s point of view because it was never meant as a general interchange format.

So instead, we have decided to implement our own DXIL emitter. This is quite a bit harder than it looks because LLVM bitcode goes to great lengths to try to make the format as dense as possible. For instance, LLVM does not store its bitcode as a sequence of bytes and words, but rather as variable-width bitfields in a long sequence of bits.

There are a lot of tricky details to get right, but in the end we have a compiler that works.

D3D12 Gallium driver

The D3D12 Gallium driver is the last piece of the puzzle. Essentially, it takes OpenGL commands and, with the help of the NIR to DXIL translator, turns them into D3D12 command-buffers, which it executes on the GPU using the D3D12 driver.

There are a lot of interesting details that makes this tricky as well, but I will save those details for later.

But to not leave you empty-handed, here’s a screenshot of the Windows version of the famous glxgears, wglgears:

wglgears on DirectX12 wglgears on DirectX12

Source code

In the short term, the source code can be found here. We intend on upstreaming this work into the main Mesa repository shortly, so it is not a permanent home.

Next steps

This is just the announcement, and a whole lot of work is left to be done. We have something that works in some cases right now, but we are just starting to scratch the surface.

First of all, we need to get up to the feature-level that we target. Our goals at the moment is to pass conformance tests for OpenCL 1.2 and OpenGL 3.3. We have a long way to go, but with some hard work and sweat, I am sure we will get there.

Secondly, we need to work on application compatibility. For now we will be focusing on productivity applications.

We also want to upstream this in Mesa. This way we can keep up with fixes and new features in Mesa, and other drivers can benefit from what we are doing as well.

Acknowledgments

It is also important to point out that I am not the only one working on this. Our team consists of five additional Collabora engineers (Boris Brezillon, Daniel Stone, Elie Tournier, Gert Wollny, Louis-Francis Ratté-Boulianne) and two Microsoft DirectX engineers (Bill Kristiansen, Jesse Natalie).

March 23, 2020

I finished my last post talking about a patch to fix the XRGB operation in compute crc function (VKMS).

[PATCH] drm/vkms: use bitfield op to get xrgb on compute crc

I was waiting for the community feedback and I received a review with good suggestions to simplify the solution. Unfortunatelly, despite the correct and useful guidance, the comment was really rude. What is the necessity of someone to give this kind of opinion? This behavior have no benefits, for who send, who receive and for the community.

This kind of approach is able to discourage new developers. I know the community has much more skilled and experienced people than me, but people get older, just as they can passed away any time… who will be to support the project continuity?

In short, unnecessary.

.o/

After I began a simple change in the cursor behavior in VKMS, many related issues started to appear. My initial task seemed simple, sending a proposal to enable cursor by default when loading the vkms module. However, I have now spent a lot of time untangling issues that troubled me to validate that change.

This initial task arose after I have asked on the dri-devel IRC channel for suggestions to contribute to VKMS because I am interested in participating in this year’s Google internship program (GSoC). As VKMS’s mentor is Rodrigo Siqueira, he suggested this warm-up task. Each day, I become more familiar with VKMS, so I could quickly realize what I need to modify, and I proposed a simple modification with this patch:

[PATCH] drm/vkms: enable cursor by default

However, as a newbie, I was not confident only with the logical change on the code. I was curious to see this behavior on bits (or something like this). For this, I thought to look for the cursor on the framebuffer. As I have already played with the IGT test kms_enable_cursor, I considered using the framebuffer of this test to see the cursor.

First, according to the context, I checked if variables related to cursor presented values that represent the enabling, putting a lot of pr_info to say: “ok, it looks like working”. Second, I thought I could find a white cursor in the framebuffer of this test whenever the cursor was enabled, since it also is a requirement for the test execution.

With this in mind, I ran the subtest pipe-A-cursor-alpha-transparent to looking for this white cursor and see it become transparent. Then, after some computing, the transparent cursor plane would be blend with the other planes… but, ops… where is the cursor?

“Well, maybe there is a problem with the subtest. Let’s run the pipe-A-cursor-alpha-opaque test that I know is working and right passing!

Ops, crash? It’s weird!”

To face this weird situation encouraged me to dive into the elements of IGT test and understand a little more of what is happening: the abstraction involved, the test execution step by step, searching on web information about Cairo operations called by the tests, etc.

I realized the pipe-A-cursor-alpha-transparent check the CRC from the framebuffer that has already blended a cursor transparent in a black background, with this, it’s not possible to find a white cursor in any step of the test, that was my initial intention.

This interpretation matches the output of the pixels printed in my investigation, where only have two possible ARGB pixel values: transparent black and opaque black. I am still not sure if my comprehension is correct, but following this logic helped me to find a solution for the pipe-A-cursor-alpha-transparent test and also put the other things on a backlog of investigation.

Backlog of issues to treat

  1. Using a VM to run both kms_enable_cursor subtests: pipe-A-cursor-alpha-transparent/opaque, I put some pr_info to check the path of execution inside vkms features. With this, I could better understand the process of planes blending and CRC computing, and also verify if all steps are executed when I run a subtest.

From the opaque-subtest episode of crashing, I figured out an unstable behavior; I verified that the test fails, and nothing was printed even with a pr_info inside each VKMS function involved in the process. I asked Siqueira about this problem, and he suggested to dive inside the hrtimer/vblank operations to check if something is causing a delay. In this problem, a code snippet in vkms_composer.c deserves attention and maybe generating some delay:

 * The worker can fall behind the vblank hrtimer, make sure we catch up.  */
   while (frame_start <= frame_end)
      drm_crtc_add_crc_entry(crtc, true, frame_start++, &crc32);

I also tried to check the file /sys/kernel/debug/dri/0/crtc-0/crc/data that stores framebuffer and crc values (as explained by Haneen in this post). However, it was blocked and I was unallowed to see its content. These problems together lead me to suspect two possible problems: a lost lock/unlock operation or long busy writing operation.

  1. Maybe this issue is related to the problem above. Although I found a solution for pipe-A-cursor-alpha-transparent, this subtest still displays a warning that needs attention: Suspicious CRC: All values are 0. The weird thing is when I print the pixels in the test, because they are actually all zero (black: 0x000000) after setting the alpha channel for zero, ie, ARGB to XRGB. But this warning does not appear when running past versions of kernel on a host machine.

Could it be some debufs implementation fail or some problem on the IGT test construction? I hope to have good news in the next post update.

Finding a solution for pipe-A-cursor-alpha-transparent crash

I needed a strategy to solve these side problems and focus on validating the test’s success after changing the operation. Thus, I decided to check how the code works for both the transparent and the opaque cursor, bringing the idea of complementarity. To do this, I printed pixel values before and after the XRGB operation. This concern occurred because, after several executions, I realized that most of the proposed solutions fall into a trap related to the existence of orders of magnitude (bits x bytes). In my opinion, the ideal solution would prioritize legibility, extracting RGB values without the concern of interpreting small or large expressions and magnitudes.

Using a previous experience of contributing to the IIO, I thought about using bitwise / bitfield operations to ensure the interpretation in bits and the extraction of only the RGB bits of interest.

With that, I define a GENMASK that get only the first 24 bits (from right to left) and extract those bits using the FIELD_GET function defined in the file: linux / bitfield.h. I learned a little about this operation in a patch I sent a year ago to improve the readability of an IIO staging driver: staging: iio: ad7150: use FIELD_GET and GENMASK. And so my solution was born:

[PATCH] drm/vkms: use bitfield op to get xrgb on compute crc

I am still waiting for community evaluation and feedback. And now I have new knowledge and a lot of new challenges.

March 20, 2020

Benjamin Tissoires and I have been busy anthophila and working on the freedesktop CI templates. This post is primarily of interest if you're working on GitLab, specifically if your repo is hosted on gitlab.freedesktop.org. If either of those applies, prepare to be distracted from the current pandemic, otherwise maybe just prepare to be entertained. I'll do my best to be less miserable than the news.

We all know that CI/CD really helps with finding bugs early. If you don't know that yet, insert a jedi handwave before the previous sentence and now you do. GitLab is the git forge now used by freedesktop.org and it comes with a built-in CI system. I'm leaving out the difficult bits such as actually setting the thing up because this is obviously all handled by Heinzelmännchen and just readily available, hooray. I'm also going to assume that you roughly know how to write GitLab CI jobs or, failing that, at least know how to read YAML without screaming. So for this post, we start with the basic problem that your .gitlab-ci.yml is getting unwieldy, repetitive or generally just kinda sucks to maintain. Which is roughly where libinput and libevdev were a while back which caused Benjamin to start the ci-templates.

Now, what do we want? (other than a COVID-19 cure) Reproducible tests, possibly on different distributions, with the same base system across tests. For my repos the goal was basically "test on the common distributions to catch certain bugs early". [1] For Mesa, the requirement is closer to "have a fixed set of images that 'never' change so tests are reproducible". Both goals have much in common.

Your first venture into CI will look like this:


myjob:
image: fedora:31
before_script:
- dnf update -y
- dnf install -y onepackage twopackage threepackage floor
script:
- meson builddir && ninja -C builddir test
So, in short: take a Fedora 31 docker image, update it [2], install the required packages and then run the actual test part - meson and ninja. Easy.

This works fine but it takes approximately forever because dnf update is slow and you're potentially pulling down gigs of packages on every test run. Which is fun, but less so when you have 10 different jobs and they all do that. So let's call this step 1 and pretend we're more advanced than that. Step 2 is where you start building an image you re-use, steps 3 to N are the bits where you learn more than you want to know about docker, podman, skopeo and how many typos you can put into a YAML file. So, ad break, and we jump right to the part where enlightenment is just around the corner or wherever enlightenment lurks these days.

Using the CI Templates

Here's the .gitlab-ci.yml to build a Fedora 31 images with ci-templates and run the test on that image:


include:
- project: 'freedesktop/ci-templates'
ref: 123456deadbeef
file: '/templates/fedora.yml'

variables:
# project name of the upstream repo
FDO_UPSTREAM_REPO: someproject/name

stages:
- prep
- test

myimage:
extends: .fdo.container-build@fedora
stage: prep
variables:
FDO_DISTRIBUTION_VERSION: '31'
FDO_DISTRIBUTION_PACKAGES: 'onepackage twopackage threepackage floor'
FDO_DISTRIBUTION_TAG: '2020-03-20.0'

myjob:
extends: .fdo.distribution-image@fedora
stage: test
script:
- meson builddir && ninja -C builddir test
variables:
FDO_DISTRIBUTION_VERSION: '31'
FDO_DISTRIBUTION_TAG: '2020-03-20.0'
Now, you guessed correctly that the .fdo and FDO_ prefixes are used by the templates. There is a bunch of stuff hidden here. Basically, this will:
  • check if the image exists in your personal project's registry and use that, but if not
  • check if the image exists in the given upstream project's registry and use that, but if not
  • create a Fedora 31 image with the given packages installed and pushes it with the tag to the registry
  • use that image (whether newly created or pre-existing) and run the tests on it
There are a few more details too, but that's roughly the summary of it. For existing tags, the the myimage job effectively becomes a noop and the myjob job will re-use the image. The image will be in your registry so you can podman run it locally to reproduce a bug.

To build a new image, simply change the tag. Either because you want newer packages or you need extra (or less packages). And the nice thing here: you will build a new image as part of your merge request and run the CI against that new image. But upstream and every other MR will keep using the old image - right up until your MR is merged at which point every (future) MR will use that new updated image.

Want to build a Debian Stretch image? Replace Fedora and 31 with debian and stretch. Same for Ubuntu, Centos, Alpine and Arch though for those two you don't need a version number.

Templating the templates

"But, but, Peter, I want to test on eleventy different distribution like you do" I hear you say. Well, fear not, for this is where the ci-fairy comes in. How about we *gasp* generate the .gitlab-ci.yml file from a base configuration? That can't possibly be a bad idea, so let's do that! First, we save our configuration into the .gitlab-ci/config.yml:


distributions:
- name: fedora
tag: 12345
version: 30
- name: ubuntu
tag: abcde
version: '19.10'
# and so on, and so forth

packages:
- curl
- wget
- gcc
There is no specific requirement on the structure of the config file, ci-fairy simply loads it and passes it to Jinja2. Your template could thus look like this .gitlab-ci/ci.template file:

include:
{% for d in distributions %}
- project: 'freedesktop/ci-templates'
ref: 123456deadbeef
file: '/templates/{{d.name}}.yml'
{% endfor %}

stages:
- prep
- test

{% for d in distributions %}

.{{d.name}}.{{d.version}}:
variables:
FDO_DISTRIBUTION_VERSION: '{{d.version}}'
FDO_DISTRIBUTION_TAG: '{{d.tag}}'

myimage.{{d.name}}.{{d.version}}:
extends:
- .fdo.container-build@{{d.name}}
- .{{d.name}}.{{d.version}}
stage: prep
variables:
FDO_DISTRIBUTION_PACKAGES: "{{' '.join(packages)}}"

myjob.{{d.name}}.{{d.version}}:
extends:
- .fdo.distribution-image@{{d.name}}
- .{{d.name}}.{{d.version}}
stage: test
script:
- meson builddir && ninja -C builddir
{% endfor %}
And to locally generate our .gitlab-ci.yml, all we need to do is

$ pip3 install git+http://gitlab.freedesktop.org/freedesktop/ci-templates
$ cd path/to/project
$ ci-fairy generate-template
$ ci-fairy lint # checks the resulting YAML for syntax errors
$ git commit .gitlab-ci.yml
And, for reference, the file we generated here looks like this:

include:
- project: 'freedesktop/ci-templates'
ref: 123456deadbeef
file: '/templates/fedora.yml'
- project: 'freedesktop/ci-templates'
ref: 123456deadbeef
file: '/templates/ubuntu.yml'

stages:
- prep
- test

.fedora.30:
variables:
FDO_DISTRIBUTION_VERSION: '30'
FDO_DISTRIBUTION_TAG: '12345'

myimage.fedora.30:
extends:
- .fdo.container-build@fedora
- .fedora.30
stage: prep
variables:
FDO_DISTRIBUTION_PACKAGES: "curl wget gcc"

myjob.fedora.30:
extends:
- .fdo.distribution-image@fedora
- .fedora.30
stage: test
script:
- meson builddir && ninja -C builddir

.ubuntu.19.10:
variables:
FDO_DISTRIBUTION_VERSION: '19.10'
FDO_DISTRIBUTION_TAG: 'abcde'

myimage.ubuntu.19.10:
extends:
- .fdo.container-build@ubuntu
- .ubuntu.19.10
stage: prep
variables:
FDO_DISTRIBUTION_PACKAGES: "curl wget gcc"

myjob.ubuntu.19.10:
extends:
- .fdo.distribution-image@ubuntu
- .ubuntu.19.10
stage: test
script:
- meson builddir && ninja -C builddir
Aside from the templating a new thing here is the e.g. .fedora.30 template what we extend from. This is an easy way to avoid having to set things like the distribution version and the tag multiple times. And a few things of note: the tag is job-specific (not distribution-specific). So you could have two Fedora 30 images with two different tags. This is also just an example I typed out, a real-world .gitlab-ci.yml will look more complex and different. So only rely on the above to get an idea of what's possible.

A word for non-gitlab.freedesktop.org users: You can use the remote: include directive to use the templates from elsewhere. ci-fairy isn't tied to freedesktop.org either but you'll have to provide more flags to get what you want instead of relying on the default behaviours.

The documentation for CI Templates has more, go and peruse my pretties.

[1] For months the CI was basically just a build test because I couldn't run the test suite in a container
[2] Updating isn't always required but sooner or later you run into a dependency issue if you don't

March 16, 2020

This article isn’t about anything “new”, like the previous ones on AppStream – it rather exists to shine the spotlight on a feature I feel is underutilized. From conversations it appears that the reason simply is that people don’t know that it exists, and of course that’s a pretty bad reason not to make your life easier 😉

Mini-Disclaimer: I’ll be talking about appstreamcli, part of AppStream, in this blogpost exclusively. The appstream-util tool from the appstream-glib project has a similar functionality – check out its help text and look for appdata-to-news if you are interested in using it instead.

What is this about?

AppStream permits software to add release information to their MetaInfo files to describe current and upcoming releases. This feature has the following advantages:

  • Distribution-agnostic format for release descriptions
  • Provides versioning information for bundling systems (Flatpak, AppImage, …)
  • Release texts are short and end-user-centric, not technical as the ones provided by distributors usually are
  • Release texts are fully translatable using the normal localization workflow for MetaInfo files
  • Releases can link artifacts (built binaries, source code, …) and have additional machine-readable metadata e.g. one can tag a release as a development release

The disadvantage of all this, is that humans have to maintain the release information. Also, people need to write XML for this. Of course, once humans are involved with any technology, things get a lot more complicated. That doesn’t mean we can’t make things easier for people to use though.

Did you know that you don’t actually have to edit the XML in order to update your release information? To make creating and maintaining release information as easy as possible, the appstreamcli utility has a few helpers built in. And the best thing is that appstreamcli, being part of AppStream, is available pretty ubiquitously on Linux distributions.

Update release information from NEWS data

The NEWS file is a not very well defined textfile that lists “user-visible changes worth mentioning” per each version. This maps pretty well to what AppStream release information should contain, so let’s generate that from a NEWS file!

Since the news format is not defined, but we need to parse this somehow, the amount of things appstreamcli can parse is very limited. We support a format in this style:

Version 0.2.0
~~~~~~~~~~~~~~
Released: 2020-03-14

Notes:
 * Important thing 1
 * Important thing 2

Features:
 * New/changed feature 1
 * New/changed feature 2 (Author Name)
 * ...

Bugfixes:
 * Bugfix 1
 * Bugfix 2
 * ...

Version 0.1.0
~~~~~~~~~~~~~~
Released: 2020-01-10

Features:
 * ...

When parsing a file like this, appstreamcli will allow a lot of errors/”imperfections” and account for quite a few style and string variations. You will need to check whether this format works for you. You can see it in use in appstream itself and libxmlb for a slightly different style.

So, how do you convert this? We first create our NEWS file, e.g. with this content:

Version 0.2.0
~~~~~~~~~~~~~~
Released: 2020-03-14

Bugfixes:
 * The CPU no longer overheats when you hold down spacebar

Version 0.1.0
~~~~~~~~~~~~~~
Released: 2020-01-10

Features:
 * Now plays a "zap" sound on every character input

For the MetaInfo file, we of course generate one using the MetaInfo Creator. Then we can run the following command to get a preview of the generated file: appstreamcli news-to-metainfo ./NEWS ./org.example.myapp.metainfo.xml - Note the single dash at the end – this is the explicit way of telling appstreamcli to print something to stdout. This is how the result looks like:

<?xml version="1.0" encoding="utf-8"?>
<component type="desktop-application">
  [...]
  <releases>
    <release type="stable" version="0.2.0" date="2020-03-14T00:00:00Z">
      <description>
        <p>This release fixes the following bug:</p>
        <ul>
          <li>The CPU no longer overheats when you hold down spacebar</li>
        </ul>
      </description>
    </release>
    <release type="stable" version="0.1.0" date="2020-01-10T00:00:00Z">
      <description>
        <p>This release adds the following features:</p>
        <ul>
          <li>Now plays a "zap" sound on every character input</li>
        </ul>
      </description>
    </release>
  </releases>
</component>

Neat! If we want to save this to a file instead, we just exchange the dash with a filename. And maybe we don’t want to add all releases of the past decade to the final XML? No problem too, just pass the --limit flag as well: appstreamcli news-to-metainfo --limit=6 ./NEWS ./org.example.myapp.metainfo.tmpl.xml ./result/org.example.myapp.metainfo.xml

That’s nice on its own, but we really don’t want to do this by hand… The best way to ensure the MetaInfo file is updated, is to simply run this command at build time to generate the final MetaInfo file. For the Meson build system you can achieve this with a code snippet like below (but for CMake this shouldn’t be an issue either – you could even make a nice macro for it there):

ascli_exe = find_program('appstreamcli')
metainfo_with_relinfo = custom_target('gen-metainfo-rel',
    input : ['./NEWS', 'org.example.myapp.metainfo.xml'],
    output : ['org.example.myapp.metainfo.xml'],
    command : [ascli_exe, 'news-to-metainfo', '--limit=6', '@INPUT0@', '@INPUT1@', '@OUTPUT@']
)

In order to also translate releases, you will need to add this to your .pot file generation workflow, so (x)gettext can run on the MetaInfo file with translations merged in.

Release information from YAML files

Since parsing a “no structure, somewhat human-readable file” is hard without baking an AI into appstreamcli, there is also a second option available: Generate the XML from a YAML file. YAML is easy to write for humans, but can also be parsed by machines.The YAML structure used here is specific to AppStream, but somewhat maps to the NEWS file contents as well as MetaInfo file data. That makes it more versatile, but in order to use it, you will need to opt into using YAML for writing news entries. If that’s okay for you to consider, read on!

A YAML release file has this structure:

---
Version: 0.2.0
Date: 2020-03-14
Type: development
Description:
- The CPU no longer overheats when you hold down spacebar
- Fixed bugs ABC and DEF
---
Version: 0.1.0
Date: 2020-01-10
Description: |-
  This is our first release!

  Now plays a "zap" sound on every character input

As you can see, the release date has to be an ISO 8601 string, just like it is assumed for NEWS files. Unlike in NEWS files, releases can be defined as either stable or development depending on whether they are a stable or development release, by specifying a Type field. If no Type field is present, stable is implicitly assumed. Each release has a description, which can either be a free-form multi-paragraph text, or a list of entries.

Converting the YAML example from above is as easy as using the exact same command that was used before for plain NEWS files: appstreamcli news-to-metainfo --limit=6 ./NEWS.yml ./org.example.myapp.metainfo.tmpl.xml ./result/org.example.myapp.metainfo.xml If appstreamcli fails to autodetect the format, you can help it by specifying it explicitly via the --format=yaml flag. This command would produce the following result:

<?xml version="1.0" encoding="utf-8"?>
<component type="console-application">
  [...]
  <releases>
    <release type="development" version="0.2.0" date="2020-03-14T00:00:00Z">
      <description>
        <ul>
          <li>The CPU no longer overheats when you hold down spacebar</li>
          <li>Fixed bugs ABC and DEF</li>
        </ul>
      </description>
    </release>
    <release type="stable" version="0.1.0" date="2020-01-10T00:00:00Z">
      <description>
        <p>This is our first release!</p>
        <p>Now plays a "zap" sound on every character input</p>
      </description>
    </release>
  </releases>
</component>

Note that the 0.2.0 release is now marked as development release, a thing which was not possible in the plain text NEWS file before.

Going the other way

Maybe you like writing XML, or have some other tool that generates the MetaInfo XML, or you have received your release information from some other source and want to convert it into text. AppStream also has a tool for that! Using appstreamcli metainfo-to-news <metainfo-file> <news-file> you can convert a MetaInfo file that has release entries into a text representation. If you don’t want appstreamcli to autodetect the right format, you can specify it via the --format=<text|yaml> switch.

Future considerations

The release handling is still not something I am entirely happy with. For example, the release information has to be written and translated at release time of the application. For some projects, this workflow isn’t practical. That’s why issue #240 exists in AppStream which basically requests an option to have release notes split out to a separate, remote location (and also translations, but that’s unlikely to happen). Having remote release information is something that will highly likely happen in some way, but implementing this will be a quite disruptive, if not breaking change. That is why I am holding this change back for the AppStream 1.0 release.

In the meanwhile, besides improving the XML form of release information, I also hope to support a few more NEWS text styles if they can be autodetected. The format of the systemd project may be a good candidate. The YAML release-notes format variant will also receive a few enhancements, e.g. for specifying a release URL. For all of these things, I very much welcome pull requests or issue reports. I can implement and maintain the things I use myself best, so if I don’t use something or don’t know about a feature many people want I won’t suddenly implement it or start to add features at random because “they may be useful”. That would be a recipe for disaster. This is why for these features in particular contributions from people who are using them in their own projects or want their new usecase represented are very welcome.

March 07, 2020

This year’s FOSDEM conference was a lot of fun – one of the things I always enjoy most about this particular conference (besides having some of the outstanding food you can get in Brussels and meeting with friends from the free software world) is the ability to meet a large range of new people who I wouldn’t usually have interacted with, or getting people from different communities together who otherwise would not meet in person as each bigger project has their own conference (for example, the amount of VideoLAN people is much lower at GUADEC and Akademy compared to FOSDEM). It’s also really neat to have GNOME and KDE developers within reach at the same place, as I care about both desktops a lot.

An unexpected issue

This blog post however is not about that. It’s about what I learned when talking to people there about AppStream, and the outcome of that. Especially when talking to application authors but also to people who deal with larger software repositories, it became apparent that many app authors don’t really want to deal with the extra effort of writing metadata at all. This was a bit of a surprise to me, as I thought that there would be a strong interest for application authors to make their apps look as good as possible in software catalogs.

A bit less surprising was the fact that people apparently don’t enjoy reading a large specification, reading a long-ish intro guide with lots of dos and don’ts or basically reading any longer text at all before being able to create an AppStream MetaInfo/AppData file describing their software.

Another common problem seems to be that people don’t immediately know what a “reverse-DNS ID” is, the format AppStream uses for uniquely identifying each software component. So naturally, people either have to read about it again (bah, reading! 😜) or make something up, which occasionally is wrong and not the actual component-ID their software component should have.

The MetaInfo Creator

It was actually suggested to me twice that what people really would like to have is a simple tool to put together a MetaInfo file for their software. Basically a simple form with a few questions which produces the final file. I always considered this a “nice to have, but not essential” feature, but now I was convinced that this actually has a priority attached to it.

So, instead of jumping into my favourite editor and writing a bunch of C code to create this “make MetaInfo file” form as part of appstreamcli, this time I decided to try what the cool kids are doing and make a web application that runs in your browser and creates all metadata there.

So, behold the MetaInfo Creator! If you click this link, you will end up at an Angular-based web application that will let you generate MetaInfo/AppData files for a few component-types simply by answering a set of questions.

The intent was to make this tool as easy to use as possible for someone who basically doesn’t know anything about AppStream at all. Therefore, the tool will:

  • Generate a rDNS component-ID suggestion automatically based on the software’s homepage and name
  • Fill out default values for anything it thinks it has enough data for
  • Show short hints for what values we expect for certain fields
  • Interactively validate the entered value, so people know immediately when they have entered something invalid
  • Produce a .desktop file as well for GUI applications, if people select the option for it
  • Show additional hints about how to do more with the metadata
  • Create some Meson snippets as pointers how people can integrate the MetaInfo files into projects using the Meson build system

For the Meson feature, the tool simply can not generate a “use this and be done” script, as each Meson snippet needs to be adjusted for the individual project. So this option is disabled by default, but when enabled, a few simple Meson snippets will be produced which can be easily adjusted to the project they should be part of.

The tool currently does not generate any release information for a MetaInfo file at all, This may be added in future. The initial goal was to have people create any MetaInfo file in the first place, having projects also ship release details would be the icing on the cake.

I hope people find this project useful and use it to create better MetaInfo files, so distribution repositories and Flatpak repos look better in software centers. Also, since MetaInfo files can be used to create an “inventory” of software and to install missing stuff as-needed, having more of them will help to build smarter software managers, create smaller OS base installations and introspect what software bundles are made of easily.

I welcome contributions to the MetaInfo Creator! You can find its source code on GitHub. This is my first web application ever, the first time I wrote TypeScript and the first time I used Angular, so I’d bet a veteran developer more familiar with these tools will cringe at what I produced. So, scratch that itch and submit a PR! 😉 Also, if you want to create a form for a new component type, please submit a patch as well.

C developer’s experience notes for Angular, TypeScript, NodeJS

This section is just to ramble a bit about random things I found interesting as a developer who mostly works with C/C++ and Python and stepped into the web-application developer’s world for the first time.

For a project like this, I would usually have gone with my default way of developing something for the web: Creating a Flask-based application in Python. I really love Python and Flask, but of course using them would have meant that all processing would have had to be done on the server. One the one hand I could have used libappstream that way to create the XML, format it and validate it, but on the other hand I would have had to host the Python app on my own server, find a place at Purism/Debian/GNOME/KDE or get it housed at Freedesktop somehow (which would have taken a while to arrange) – and I really wanted to have a permanent location for this application immediately. Additionally, I didn’t want people to send the details of new unpublished software to my server.

TypeScript

I must say that I really like TypeScript as a language compared to JavaScript. It is not really revolutionary (I looked into Dart and other ways to compile $stuff to JavaScript first), but it removes just enough JavaScript weirdness to be pleasant to use. At the same time, since TS is a superset of JS, JavaScript code is valid TypeScript code, so you can integrate with existing JS code easily. Picking TS up took me much less than an hour, and most of its features you learn organically when working on a project. The optional type-safety is a blessing and actually helped me a few times to find an issue. It being so close to JS is both a strength and weakness: On the one hand you have all the JS oddities in the language (implicit type conversion is really weird sometimes) and have to basically refrain from using them or count on the linter to spot them, but on the other hand you can immediately use the massive amount of JavaScript code available on the web.

Angular

The Angular web framework took a few hours to pick up – there are a lot of concepts to understand. But ultimately, it’s manageable and pretty nice to use. When working at the system level, a lot of complexity is in understanding how the CPU is processing data, managing memory and using the low-level APIs the operating system provides. With the web application stuff, a lot of the complexity for me was in learning about all the moving parts the system is comprised of, what their names are, what they are, and what works with which. And that is not a flat learning curve at all. As C developer, you need to know how the computer works to be efficient, as web developer you need to know a bunch of different tools really well to be productive.

One thing I am still a bit puzzled about is the amount of duplicated HTML templates my project has. I haven’t found a way to reuse template blocks in multiple components with Angular, like I would with Jinja2. The documentation suggests this feature does not exist, but maybe I simply can’t find it or there is a completely different way to achieve the same result.

NPM Ecosystem

The MetaInfo Creator application ultimately doesn’t do much. But according to GitHub, it has 985 (!!!) dependencies in NPM/NodeJS. And that is the bare minimum! I only added one dependency myself to it. I feel really uneasy about this, as I prefer the Python approach of having a rich standard library instead of billions of small modules scattered across the web. If there is a bug in one of the standard library functions, I can submit a patch to Python where some core developer is there to review it. In NodeJS, I imagine fixing some module is much harder.

That being said though, using npm is actually pretty nice – there is a module available for most things, and adding a new dependency is easy. NPM will also manage all the details of your dependency chain, GitHub will warn about security issues in modules you depend on, etc. So, from a usability perspective, there isn’t much to complain about (unlike with Python, where creating or using a module ends up as a “fight the system” event way too often and the question “which random file do I need to create now to achieve what I want?” always exists. Fortunately, Poetry made this a bit more pleasant for me recently).

So, tl;dr for this section: The web application development excursion was actually a lot of fun, and I may make more of those in future, now that I learned more about how to write web applications. Ultimately though, I enjoy the lower-level software development and backend development a bit more.

Summary

Check out the MetaInfo Creator and its source code, if you want to create MetaInfo files for a GUI application, console application, addon or service component quickly.

March 04, 2020

In this post, I talk a little about my first steps to introduce myself to the DRM community, a Linux subsystem from which I am looking to learn more and develop skills.

In my opinion, sending contributions is the best way to start to fit in. We can also send questions or interact about an idea, but as nobody knows you there, showing your potential and your interest in practice leads to a more fluid relationship.

Well, I can say that a safe way to start contributing code to a subsystem of which we are little acquainted is by sending code style improvement patches. They are simple patches, and many believe that they have no relevance or translate into technical capacity. For me, they don’t reflect in technical ability; however, they break the first barriers of sending and communication and make it possible to gradually learn about how to modify a large volume of code.

With that in mind, I tried to contribute in three ways:

  1. some code style improvements;
  2. removing unused code; and
  3. fixing bugs

Improving code style

I can see two direct contributions generate for this kind of patch: getting rid of warnings (giving space to look to relevant problems), and improving the code readability lead to lot of good consequences for coworking remotely.

A good way to discover codestyle problems is using checkpatch.pl. Even better is using kworkflow to guide you, since the tool has a option to check codestyle and to discover the maintainers responsible for the file (in case of sending contributions).

So, I sent a patcheset to clean up two functions in a AMD’s file that is full of code style problems:

[PATCH v2 0/2] drm/amd/display: dc_link: cleaning up some code style issues [PATCH v2 1/2] drm/amd/display: dc_link: code clean up on enable_link_dp function [PATCH v2 2/2] drm/amd/display: dc_link: code clean up on detect_dp function

Examples of code style issues include lines with more than 80 characteres, avoiding comparisons to NULL in conditional clauses, and alignment of parenthesis and indentations.

These simple patches call my attention for the branch that I was using for development. Since the file is maintained by AMD team, instead of using the drm-misc repository on the branch drm-misc-next, I should base on repository of Alex Deucher(agd5f/amdgpu). Asking him on IRC channel (#dri-devel) for the right branch to send my contribution, he said: drm-next. So, I needed to rebase my patch from drm-misc to the right branch.

Removing unused code

The 2017 Linux Kernel Development Report declared:

The kernel has grown steadily since its first release in 1991, when there were only about 10,000 lines of code. At almost 25 million lines (up from nearly 22 million), the kernel is almost three million lines larger than it was at the time of the previous version of this report.

Someone can think “Wow, it is great!”, but is important to noticed that large size is not direct related to power or potential. Lines of code need of maintainance, which in turn needs skilled developers and organization/management. So, unused code seems inoffencive, but it is not. It may affect maintainability and readability. Thinking in a house and in how terrible is living with a lot of trash or things that you no longer use in your life.

Even in the house analogy, when there is a messy room in your house, it is hard to clean up and decide what stay and what live. So, for getting rid of a unused code you need to check is someone still need it. In a large project like Linux, there are a lot of files and people involved. So, first, check the files that the stuff appears, and then send the removal as patch, and wait for someone claims the necessity of it.

Then, I sent a patch to remove an entire function from another AMD’s file.

[PATCH] drm/amd/display: dcn20: remove an unused function

This was a suggestion from Siqueira. However I also looking for apparison of this function in the whole project and also in the same file, as I didn’t find nothing, I removed and sent. Done!

Attempt to fix bugs

For me, fixing bugs is a small and punctual task, but very challenging. Often, it also requires a breadth of knowledge. Besides, it is a task that, when completed, brings satisfaction and some confidence.

I have tried to solve two mapped problems on vkms: a bug reported by the syzkaller and a bug found by an IGT test.

Bug reported by syzkaller

I have sent a e-mail for dri-vel mailing list asked for help. In this e-mail, I describe my attempts:

I tried to reproduce a syzkaller bug found in the vkms WARNING in vkms_gem_free_object However, I was not very successful in this task.

Looking at the bug history at https://syzkaller.appspot.com/bug?extid=e7ad70d406e74d8fc9d0, it seems like the bug still exists.

For testing, I used a VM (QEMU) with Debian 10 with a compiled kernel from https://cgit.freedesktop.org/drm/drm-misc (branch drm-misc-next)

  1. Using the usual .config for my VM, I compiled and installed the kernel and, as root, ran the C program provided by syzkaller: 1

  2. Then, I checked the debug/panic/hacking/drm/i915 debugging/vkms settings on the .config reported by syzkaller: 2 and enabled the same things in my .config. I compiled and installed the kernel and ran the C program. Nothing happened.

  3. So, I reverted my current branch to the commit that generated the bug (as reported: 94e2ec3f7fef86506293a448273b2b4ee21e6195) and used the kernel on that state. Nothing happened.

  4. I decided to use the syzkaller .config without modifications and adaptations for my VM (although I didn’t think it felt right). I compiled, installed… some boot problems happened, but the kernel worked. I ran the C program and nothing.

Daniel Vetter replied by saying that he also had no experience with syzkaller bug reports and that perhaps it would be better for me to seek guidance from the syzkaller community. I also asked Rodrigo Siqueira and he also had no previous experience in this.

An then I gave up… for a while :)

IGT test failure

So I pass to the next bug, some failure to compute the CRC with the transparent cursor. The kms_cursor_crc subtest (pipe-A-cursor-alpha-transparent) that checks the CRC when the cursor is transparent was failing, and everything indicated that it was a problem around the XRGB operation that ignored the alpha channel using the C memset function.

This bug took me a lot of time to comprehend the VKMS operations of drawing cursor, blending planes, and computing CRC, and also understand the steps of this IGT subtest until asserting values. See more about this situation in the next post.

February 20, 2020

libinput 1.15.1 had a new feature: it matched the expected touch count with the one actually seen as opposed to the one advertised by the kernel. That is good news for ALPS devices whose kernel driver lies about their capabilities because these days who doesn't. However, in some cases that feature had the side-effect of reducing the touch count to zero - meaning libinput would ignore any touch. This caused a slight UX degradation.

After a bit of debugging and/or cursing, the issue was identified as a libevdev issue, specifically - the way libevdev replays events after a SYN_DROPPED event. And after several days of fixing things, adding stuff to the CI and adding meson support for libevdev so the CI can actually run a few useful things, it's time for a blog post to brain-dump and possibly entertain the occasional reader such as you are. Congratulations, I guess.

The Linux kernel's evdev protocol is a serial protocol where all events have a type, a code and a value. Events are grouped by EV_SYN.SYN_REPORT events, so the event type is EV_SYN (0), the event code is SYN_REPORT (also 0). The value is usually (but not always), you guessed it, zero. A SYN_REPORT signals that the current event sequence (also called a "frame") is to be interpreted as one hardware event [0]. In the simplest case, two hardware events from a mouse could look like this:


EV_REL REL_X 1
EV_SYN SYN_REPORT 0
EV_REL REL_X 1
EV_REL REL_Y 1
EV_SYN SYN_REPORT 0
While we have five evdev events here, those represent one hardware event with an x movement of 1 and a second hardware event with a diagonal movement by 1/1. Glorious, we all understand evdev now (if not, read this and immediately afterwards this, although that second post will be rather reinforced by this post).

Life as software developer would be quite trivial but our universe hates us and we need an extra event code called SYN_DROPPED. This event is used by the kernel when events from the device come in faster than you're reading them. This shouldn't happen given that most input devices scan out at the casual rate of every 7ms or slower and we're not exactly running on carrier pigeons here. But your compositor has been a busy bee rendering all these browser windows containing kitten videos and thus completely neglected to check whether you've moved the finger on the touchpad recently. So the kernel sort-of clears the current event buffer and positions a shiny steaming SYN_DROPPED in there to notify the compositor of its wrongdoing. [1]

Now, we could assume that every evdev client (libinput, every Xorg driver, ...) knows how to handle SYN_DROPPED events correctly but we're self-aware enough that we don't. So SYN_DROPPED handling is wrapped via libevdev, in a way that lets the clients use almost exactly the same processing paths they use for normal events. libevdev gives you a notification that a SYN_DROPPED occured, then you fetch the events one-by-one until libevdev tells you have the complete current state of the device, and back to kittens you go. In pseudo-code, your input stack's event loop works like this:


while (user_wants_kittens):
event = libevdev_get_event()

if event is a SYN_DROPPED:
while (libevdev_is_still_synchronizing):
event = libevdev_get_event()
process_event(event)
else:
process_event(event)
Now, this works great for keys where you get the required events to release or press new keys. This works great for relative axes because meh, who cares [2]. This works great for absolute axes because you just get the current state of the device and done. This works great for touch because, no wait, that bit is awful.

You see, the multi-touch protocol is ... special. It uses the absolute axes, but it also multiplexes over those axes via the slot protocol. A normal two-touch event looks like this:


EV_ABS ABS_MT_SLOT 0
EV_ABS ABS_MT_POSITION_X 123
EV_ABS ABS_MT_SLOT 1
EV_ABS ABS_MT_POSITION_X 456
EV_ABS ABS_MT_POSITION_Y 789
EV_ABS ABS_X 123
EV_SYN SYN_REPORT 0
The first two evdev events are slot 0 (first touch [3]), the second two evdev events are slot 1 (second touch [3]). Both touches update their X position but the second touch also updates its Y position. But for single-touch emulation we also get the normal absolute axis event [3]. Which is equivalent to the first touch [3] and can be ignored if you're handling the MT axes [3] (I'm getting a lot of mileage out of that footnote). And because things aren't confusing enough: events within an evdev frame are position-independent except the ABS_MT axes which need to be processed in sequence. So that ABS_X events could be anywhere within that frame, but the ABS_MT axes need to be grouped by slot.

About that single-touch emulation... We also have a single-touch multi-touch protocol via EV_KEY. For devices that can only track N fingers but can detect N+M fingers, we have a set of BTN_TOOL defines. Two fingers down sets BTN_TOOL_DOUBLETAP, three fingers down sets BTN_TOOL_TRIPLETAP, etc. Those are just a bitfield though, so no position data is available. And it tops out at BTN_TOOL_QUINTTAP but then again, that's a good maximum backed by a lot of statistical samples from users hands. On many devices, we have to combine that single-touch MT protocol with the real MT protocol. Synaptics touchpads on PS/2 only support 2 finger positions but detect up 5 touches otherwise [4]. And remember the ALPS devices? They say they have 4 slots but may only send data for two or three, so we have to detect this at runtime and switch to the BTN_TOOL bits for some touches.

So anyway, now that we unfortunately all understand the MT protocol(s), let's look at that libevdev bug. libevdev checks the slot states after SYN_DROPPED to detect whether any touch has stopped or started during SYN_DROPPED. It also detects whether a touch has changed, i.e. the user lifted the finger(s) and put the finger(s) down again while SYN_DROPPED was happening. For those touches it generates the events to stop the original touch, then events to start the new touch. This needs to be done over two event frames, i.e. with a SYN_REPORT in between [5]. But the implementation ended up splitting those changes - any touch that changed was terminated in the first event frame, any touch that outright stopped was terminated in the second event frame. That in itself wasn't the problem yet, the problem was that libevdev didn't emulate the single-touch multi-touch protocol with those emulated frames. So we ended up with event frames where slots would terminate but the single-touch protocol didn't update until a frame later.

This doesn't matter for most users. Both protocols were still correct-enough in their own bubble, only once you start mixing protocols was where it all started getting wonky. libinput does this because it has to, too many devices out there only track two fingers. So if you want three-finger tapping and pinch gestures, you need to handle both protocols. Despite this we didn't notice until we added the quirk for ALPS devices. Because now libinput sometimes noticed that after a SYN_DROPPED there were no fingers on the touchpad (because they all stopped/changed) but the BTN_TOOL bits were still on so clearly we have a touchpad that cannot track all fingers it detects - in this case zero. [6]

So to recap: libinput's auto-adjustment of the touch count for buggy touchpad devices failed thanks to libevdev's buggy workaround of the device sync. The device sync we need because we can't rely on userspace handling touches correctly across SYN_DROPPED. An event which only gets triggered because the compositor is too buggy to read input events in time. I don't know how to describe it exactly, but what I can see all the way down are definitely not turtles.

And the sad thing about it: if we didn't try to correct for the firmware and accepted that gestures are just broken on ALPS devices because the kernel driver is lying to us, none of the above would have mattered. Likewise, the old xorg synaptics driver won't be affected by this because it doesn't handle multitouch properly anyway, so it doesn't need to care about these discrepancies. Or, in other words and much like real life: the better you try to be, the worse it all gets.

And as the take-home lesson: do upgrade to libinput 1.15.2 and do upgrade to libevdev 1.9.0 when it's out. Your kittens won't care but at least that way it won't make me feel like I've done all this work in vain.

[0] Unless the SYN_REPORT value is nonzero but let's not confuse everyone more than necessary
[1] A SYN_DROPPED is per userspace client, so a debugging tool reading from the same event node may not see that event unless it too is busy with feline renderings.
[2] yes, you'll get pointer jumps because event data is missing but since you've been staring at those bloody cats anyway, you probably didn't even notice
[3] usually, but not always
[4] on those devices, identifying a 3-finger pinch gesture only works if you put the fingers down in the correct order
[5] historical reasons: in theory a touch could change directly but most userspace can't handle it and it's too much effort to add now
[6] libinput 1.15.2 leaves you with 1 finger in that case and that's good enough until libevdev is released

February 07, 2020

The xkeyboard-config project is the repository for all XKB descriptions, or "keyboard layouts" as the layman would say. But languages are weird and thus xkeyboard-config contains an obscene amount of different layouts. And of course there are additional layouts that are more experimental than common [1].

The fault, as usual, lies with us (the pronoun, not the layout). XKB is weird and its flexible to the point of driving even bananas bananas but due to some historic accidents it's largely non-editable. All XKB files are installed in system folders and we all know the 11th commandment was "thou shalt not edit things in /usr/share". But, luckily, that is all about to change. Or rather: it has changed as of libxkbcommon 0.10.0, released Jan 20 2020.

xkeyboard-config provides two types of files. The ones that actually set up your keyboard layout and the ones that allow you to keep sane while doing so, despite your best efforts to the contrary.

KcCGST

Let's look at the first set of files. XKB uses "keycodes, compat, geometry, symbols, types", conveniently if unpronouncably called KcCGST. Keycodes map your "physical" scancode to an internal code-name. For example, your key with the digit 1 on it is AE01 (alphabetic key, row E from bottom, key number 1 from left). Then you map those keycodes into symbols (1 and !). This happens based on the key's type which defines the combination of modifiers to produce the symbols [2]. Compat is largely magic weird stuff (locking modifiers, pointer control) and geometry would let you draw a pretty picture of your keyboard if it was defined for your keyboard which it won't be.

To see the full keyboard layout simply run xkbcomp -xkb $DISPLAY - and marvel. xkeyboard-config keeps all these parts so your X server or Wayland compositor can load it at runtime depending on your layout.

RMLVO

But when it comes to actual layout selection we like our users enough that we don't make them handle KcCGST but rather provide them with RMLVO instead - "rules, models, layouts, variants, options". You select layout "us" and something converts this into the right components to actually load. Run setxkbmap -layout us -print to see this happening.

"layouts" is what you'd usually associate with a country (except politics is still a thing, so more weirdness here) and "variants" are variations of those. Layout "us" gives you QWERTY and "fr" gives you AZERTY but the "us(dvorak)" variant gives you whatever heresy dvorak applies to those physical keys. And of course, things don't stop there - options are tack-on thingies that do stuff. Like remapping caps lock to compose so you're less capable of shouting at me. Come to think of it, it should really be enabled by default for that reason. You can combine multiple options largely at-will. "models" are largely obsolete (except where they aren't) thanks to the Linux kernel evdev interface which makes all keyboard look the same. But they used to be a thing and maybe one day they'll make a comeback like bell bottom jeans. Disliked by everyone but some weirdos insist on using them.

Rules

Finally, we have rules and thus come to the core of the matter of this post. Rules are magic files that tell the various tools how to go from RMLVO to KcCGST. It's a weird format but it's quite understandable, just open /usr/share/X11/xkb/rules/evdev and have a looksie. It'll make you the popular kid at the next frat party.

Many many moons ago before the Y2K bug was even in its larvae stage, the idea was that you could configure all of those because every UNIX tool had to be more flexible than your yoga teacher. I'm unsure to what extent this was actually ever the case but around 2007-ish the old keyboard driver got deprecated and the evdev driver made it's grand entrance. And one side-effect of that was that things broke. evdev uses different keycodes, so all those users that copy-pasted unnecessary XKB configuration into their xorg.conf now had broken keys because they were applying the wrong rules. After whacking enough moles that we got in trouble with the RSPCA we started hardcoding the "evdev" ruleset everywhere. The xorg.conf option "XKBRules" became a noop and thus stopped breaking users' setups.

Except that it also stopped users from deploying their own rules files - something that probably didn't really matter anyway. This had some unintended side-effects though. First, to have a working custom XKB layout you basically had to get it merged upstream. Yes, you could edit the files locally but they'd just be overwritten next time you update the packages. Second, getting rid of hardcoded things is hard so we're stuck with the evdev ruleset for the forseeable future. This was the situation until, well, now.

User-specific rules and layouts

The new libxkbcommon release changes two things: it prepends $XDG_CONFIG_HOME/xkb/ to the lookup path for XKB rules (and other files). So any file in that path will be picked before the system paths. This makes it possible to have KcCGST files in your home directory and actually use them. This was somewhat possible before by passing the right flags to the various tools but now it's on by default - at least where libxkbcommon is used (Wayland).

Secondly, rules files now support an include statement. This means you can set your own rules and include system rules. Because everything is hardcoded to evdev this effectively means your new rule file will be $XDG_CONFIG_HOME/xkb/rules/evdev and have at least one line: ! include %S/rules/evdev. If you do just that, you get the evdev ruleset from the system installation path. And any lines you add before or after that line will be loaded. Have a look at the git commit for the details but the summary is that you'll have a rules file that looks like this:


$ cat $XDG_CONFIG_DIR/xkb/rules/evdev
! option = symbols
custom:foo = +custom(foo)
custom:bar = +custom(baz)

! include %S/evdev
This file will define the option->symbol mappings as above and then include the system-provided evdev rules file, i.e. it'll behave like before with those two added. To get those to do something, you need to have the actual symbols files:

$ cat $XDG_CONFIG_HOME/xkb/symbols/custom
partial alphanumeric_keys
xkb_symbols "foo" {
key <TLDE> { [ VoidSymbol ] };
};

partial alphanumeric_keys
xkb_symbols "baz" {
key <AB01> { [ k, K ] };
};

And voila, you can now use the XKB option "custom:foo" and/or "custom:bar" to remap your tilde or Z key. The rest is left to the reader as an exercise in creativity.

Remaining work

The libxkbcommon change was only the first part of the full feature. The remaining parts is to have libxkbcommon actually resolve XDG_CONFIG_HOME when running in gnome-shell which doesn't work right now thanks to secure_getenv() always returning NULL. That's an issue with gnome-shell in particular thanks to the rt-scheduler feature, enabled by default on Fedora already.

The second part, and harder, is to make the new options appear in the various graphical configuration tools. xkeyboard-config ships an XML file [3] that lists every possible combination with some human description for it. This XML file is used by the various tools directly but none of those tools support XML's XI:Include statements. So we'd either have to update all those tools to extend the parsing accordingly or, most likely a smarter long-term solution, write a wrapper library that provides a stable API to get at the same info. That way we can update the include paths under the hood without having to update every tool. Of course this requires every tool to update to the new library first, so, well, chicken, egg, usual problem. Anyway, we'll get there eventually.

[1] For example, I suspect a meetup of icelandic dvorak users doesn't qualify for a group discount.
[2] Each key has "levels" with one symbol each and modifiers that switch between those levels. Most keys have two levels - normal and shift. But there's a key type for EIGHT_LEVEL_ALPHABETIC_LEVEL_FIVE_LOCK and you can cry or laugh and both reactions are appropriate
[3] ask your grandparents about that, it's basically JSON for old people
February 02, 2020

Prototyping a Vulkan Extension — VK_MESA_present_period

I've been messing with application presentation through the Vulkan API for quite a while now, first starting by exploring how to make head-mounted displays work by creating DRM leases as described in a few blog posts: 1, 2, 3, 4.

Last year, I presented some work towards improving frame timing accuracy at the X developers conference. Part of that was about the Google Display Timing extension.

VK_GOOGLE_display_timing

VK_GOOGLE_display_timing is really two extensions in one:

  1. Report historical information about when frames were shown to the user.

  2. Allow applications to express when future frames should be shown to the user.

The combination of these two is designed to allow applications to get frames presented to the user at the right time. The biggest barrier to having things work perfectly all of the time is that the GPU has finite rendering performance, and can easily get behind if the application asks it to do more than it can in the time available.

When this happens, the previous frame gets stuck on the screen for extra time, and then the late frame gets displayed. In fact, because the software queues up a pile of stuff, several frames will often get delayed.

Once the application figures out that something bad happened, it can adjust future rendering, but the queued frames are going to get displayed at some point.

The problem is that the application has little control over the cadence of frames once things start going wrong.

Imagine the application is trying to render at 1/2 the native frame rate. Using GOOGLE_display_timing, it sets the display time for each frame by spacing them apart by twice the refresh interval. When a frame misses its target, it will be delayed by one frame. If the subsequent frame is ready in time, it will be displayed just one frame later, instead of two. That means you see two glitches, one for the delayed frame and a second for the "early" frame (not actually early, just early with respect to the delayed frame).

Specifying Presentation Periods

Maybe, instead of specifying when frames should be displayed, we should specify how long frames should be displayed. That way, when a frame is late, subsequent queued frames will still be displayed at the correct relative time. The application can use the first part of GOOGLE_display_timing to figure out what happened and correct at some later point, being careful to avoid generating another obvious glitch.

I really don't know if this is a better plan, but it seems worth experimenting with, so I decided to write some code and see how hard it was to implement.

Going In The Wrong Direction

At first, I assumed I'd have to hack up the X server, and maybe the kernel itself to make this work. So I started specifying changes to the X present extension and writing a pile of code in the X server.

Queuing the first presentation to the kernel was easy; with no previous presentation needing to be kept on the screen for a specified period, it just gets sent right along.

For subsequent presentations, I realized that I needed to wait until I learned when the earlier presentations actually happened, which meant waiting for a kernel event. That kernel event immediately generates an X event back to the Vulkan client, telling it when the presentation occurred.

Once I saw that both X and Vulkan were getting the necessary information at about the same time, I realized that I could wait in the Vulkan code rather than in the X server.

Window-system Independent Implementation

As part of the GOOGLE_display_timing implementation, each window system tells the common code when presentations have happened to record that information for the application. This provides the hook I need to send off pending presentations using that timing information to compute when they should be presented.

Almost. The direct-to-display (DRM) back-end worked great, but the X11 back-end wasn't very prompt about delivering this timing information, preferring to process X events (containing the timing information) only when the application was blocked in vkAcquireNextImageKHR. I hacked in a separate event handling thread so that events would be processed promptly and got things working.

VK_MESA_present_period

An application uses VK_MESA_present_period by including a VkPresentPeriodMESA structure in the pNext chain in the VkPresentInfoKHR structure passed to the vkQueuePresentKHR call.

typedef struct VkPresentPeriodMESA {
    VkStructureType    sType;
    const void*        pNext;
    uint32_t           swapchainCount;
    const int64_t*     pPresentPeriods;
} VkPresentPeriodMESA;

The fields in this structure are:

  • sType. Set to VK_STRUCTURE_TYPE_PRESENT_PERIOD_MESA
  • pNext. Points to the next extension structure in the chain (if any).
  • swapchainCount. A copy of the swapchainCount field in the VkPresentInfoKHR structure.
  • pPresentPeriods. An array, length swapchainCount, of presentation periods for each image in the call.

Positive presentation periods represent nanoseconds. Negative presentation periods represent frames. A zero value means the extension does not affect the associated presentation. Nanosecond values are rounded to the nearest upcoming frame so that a value of n * refresh_interval is the same as using a value of n frames.

The presentation period causes future images to be delayed at least until the specified interval after this image has been presented. Specifying both a presentation period in a previous frame and using GOOGLE_display_timing is well defined -- the presentation will be delayed until the later of the two times.

Status and Plans

The prototype (it's a bit haphazard, I'm afraid) code is available in my gitlab mesa repository. It depends on my GOOGLE_display_timing implementation, which has not been merged yet, so you may want to check that out to understand what this patch does.

As far as the API goes, I could easily be convinced to use some better way of switching between frames and nanoseconds, otherwise I think it's in pretty good shape.

I'm looking for feedback on whether this seems like a useful way to improve frame timing in Vulkan. Comments on how the code might be better structured would also be welcome; I'm afraid I open-coded a singly linked list in my haste...

January 28, 2020

Let's say you have a friend (this wouldn't happen to you, of course, just that friend) who is staring at their system logs and wonder why it is full of messages similar to this:


libinput error: client bug: timer event5 debounce short: offset negative (-7ms)
And the question is of course - what is going on here and why hasn't this been fixed yet. Now, the libinput documentation explains this already but it's always worthwhile to fire out a blog post into the void in the hope someone reads it.

libinput uses a specific model to communicate with the Wayland compositor (or the X server). There is a single epoll file descriptor and that fd will trigger whenever something happens that's of interest to libinput. When that fd triggers, the compositor is expected to call libinput_dispatch() which is the main "do stuff" function of libinput.

The actual trigger doesn't matter, it could be an event from a device but it could be something else. The caller doesn't have to care. All that matters is that there is code like this:


if (libinput_fd_triggered_in_select)
libinput_dispatch();
And then libinput will do the right thing. Whether you also want events from libinput is almost orthogonal to this.

libinput uses timerfd internally so any timeouts also trigger the epoll fd. Timeouts are scheduled based on the event's time stamp, so if you get an event with timestamp T, a timeout of 180ms will be scheduled for time T + 180ms. So the process looks something like this:


T(0): kernel button event
T(0): libinput_dispatch(): schedule timeout for T(0+180)
...
T(180): epoll fd triggers
T(180): libinput_dispatch(): process timeout
...
This works generally fine. Even with some delays we don't generally need to worry about the timeouts and they still trigger as expected. But some of the timeouts are "short", as in 8ms short. And this is where these warnings may trigger.

Let's say your compositor is busy doing some rendering. The epoll fd triggers with a button event but the compositor is too busy to handle it immediately. Instead, it finishes whatever it's doing and only then calls libinput_dispatch():


T(0): kernel button event
...
T(12): libinput_dispatch(): schedule timeout for T(0+8)

libinput error: client bug: timer event5 debounce short: offset negative (-4ms)
libinput will still use the event's timestamp instead of the wall clock time so the scheduled timeout is no longer in the future. And that is when the error message will be printed. This isn't a libinput bug, it's always a bug in the compositor. Especially gnome-shell is still struggling with these instances and while great strives have been made to make it more responsive, it can still happen.

The error message may seem cryptic, but it provides a bunch of useful information: event5 is your event node, "debounce short" is the timer name so we know where we got stuck. And 4ms gives us an indication how much we got delayed.

And for the record: the other end of this issue - delayed libinput_dispatch() after a timeout should have triggered is handled quietly by libinput. For example, if you have a physical event queued and a timeout expiry, we will process the earlier one first to make sure the sequences are handled correctly.

January 27, 2020

It has been a while since the last AppStream-related post (or any post for that matter) on this blog, but of course development didn’t stand still all this time. Quite the opposite – it was just me writing less about it, which actually is a problem as some of the new features are much less visible. People don’t seem to re-read the specification constantly for some reason 😉. As a consequence, we have pretty good adoption of features I blogged about (like fonts support), but much of the new stuff is still not widely used. Also, I had to make a promise to several people to blog about the new changes more often, and I am definitely planning to do so. So, expect posts about AppStream stuff a bit more often now.

What actually was AppStream again? The AppStream Freedesktop Specification describes two XML metadata formats to describe software components: One for software developers to describe their software, and one for distributors and software repositories to describe (possibly curated) collections of software. The format written by upstream projects is called Metainfo and encompasses any data installed in /usr/share/metainfo/, while the distribution format is just called Collection Metadata. A reference implementation of the format and related features written in C/GLib exists as well as Qt bindings for it, so the data can be easily accessed by projects which need it.

The software metadata contains a unique ID for the respective software so it can be identified across software repositories. For example the VLC Mediaplayer is known with the ID org.videolan.vlc in every software repository, no matter whether it’s the package archives of Debian, Fedora, Ubuntu or a Flatpak repository. The metadata also contains translatable names, summaries, descriptions, release information etc. as well as a type for the software. In general, any information about a software component that is in some form relevant to displaying it in software centers is or can be present in AppStream. The newest revisions of the specification also provide a lot of technical data for systems to make the right choices on behalf of the user, e.g. Fwupd uses AppStream data to describe compatible devices for a certain firmware, or the mediatype information in AppStream metadata can be used to install applications for an unknown filetype easier. Information AppStream does not contain is data the software bundling systems are responsible for. So mechanistic data how to build a software component or how exactly to install it is out of scope.

So, now let’s finally get to the new AppStream features since last time I talked about it – which was almost two years ago, so quite a lot of stuff has accumulated!

Specification Changes/Additions

Web Application component type

(Since v0.11.7) A new component type web-application has been introduced to describe web applications. A web application can for example be GMail, YouTube, Twitter, etc. launched by the browser in a special mode with less chrome. Fundamentally though it is a simple web link. Therefore, web apps need a launchable tag of type url to specify an URL used to launch them. Refer to the specification for details. Here is a (shortened) example metainfo file for the Riot Matrix client web app:

<component type="web-application">
  <id>im.riot.webapp</id>
  <metadata_license>FSFAP</metadata_license>
  <project_license>Apache-2.0</project_license>
  <name>Riot.im</name>
  <summary>A glossy Matrix collaboration client for the web</summary>
  <description>
    <p>Communicate with your team[...]</p>
  </description>
  <icon type="stock">im.riot.webapp</icon>
  <categories>
    <category>Network</category>
    <category>Chat</category>
    <category>VideoConference</category>
  </categories>
  <url type="homepage">https://riot.im/</url>
  <launchable type="url">https://riot.im/app</launchable>
</component>

Repository component type

(Since v0.12.1) The repository component type describes a repository of downloadable content (usually other software) to be added to the system. Once a component of this type is installed, the user has access to the new content. In case the repository contains proprietary software, this component type pairs well with the agreements section.

This component type can be used to provide easy installation of e.g. trusted Debian or Fedora repositories, but also can be used for other downloadable content. Refer to the specification entry for more information.

Operating System component type

(Since v0.12.5) It makes sense for the operating system itself to be represented in the AppStream metadata catalog. Information about it can be used by software centers to display information about the current OS release and also to notify about possible system upgrades. It also serves as a component the software center can use to attribute package updates to that do not have AppStream metadata. The operating-system component type was designed for this and you can find more information about it in the specification documentation.

Icon Theme component type

(Since v0.12.8) While styles, themes, desktop widgets etc. are already covered in AppStream via the addon component type as they are specific to the toolkit and desktop environment, there is one exception: Icon themes are described by a Freedesktop specification and (usually) work independent of the desktop environment. Because of that and on request of desktop environment developers, a new icon-theme component type was introduced to describe icon themes specifically. From the data I see in the wild and in Debian specifically, this component type appears to be very underutilized. So if you are an icon theme developer, consider adding a metainfo file to make the theme show up in software centers! You can find a full description of this component type in the specification.

Runtime component type

(Since v0.12.10) A runtime is mainly known in the context of Flatpak bundles, but it actually is a more universal concept. A runtime describes a defined collection of software components used to run other applications. To represent runtimes in the software catalog, the new AppStream component type was introduced in the specification, but it has been used by Flatpak for a while already as a nonstandard extension.

Release types

(Since v0.12.0) Not all software releases are created equal. Some may be for general use, others may be development releases on the way to becoming an actual final release. In order to reflect that, AppStream introduced at type property to the release tag in a releases block, which can be either set to stable or development. Software centers can then decide to hide or show development releases.

End-of-life date for releases

(Since v0.12.5) Some software releases have an end-of-life date from which onward they will no longer be supported by the developers. This is especially true for Linux distributions which are described in a operating-system component. To define an end-of-life date, a release in AppStream can now have a date_eol property using the same syntax as a date property but defining the date when the release will no longer be supported (refer to the releases tag definition).

Details URL for releases

(Since v0.12.5) The release descriptions are short, text-only summaries of a release, usually only consisting of a few bullet points. They are intended to give users a fast, quick to read overview of a new release that can be displayed directly in the software updater. But sometimes you want more than that. Maybe you are an application like Blender or Krita and have prepared an extensive website with an in-depth overview, images and videos describing the new release. For these cases, AppStream now permits an url tag in a release tag pointing to a website that contains more information about a particular release.

Release artifacts

(Since v0.12.6) AppStream limited release descriptions to their version numbers and release notes for a while, without linking the actual released artifacts. This was intentional, as any information how to get or install software should come from the bundling/packaging system that Collection Metadata was generated for.

But the AppStream metadata has outgrown this more narrowly defined purpose and has since been used for a lot more things, like generating HTML download pages for software, making it the canonical source for all the software metadata in some projects. Coming from Richard Hughes awesome Fwupd project was also the need to link to firmware binaries from an AppStream metadata file, as the LVFS/Fwupd use AppStream metadata exclusively to provide metadata for firmware. Therefore, the specification was extended with an artifacts tag for releases, to link to the actual release binaries and tarballs. This replaced the previous makeshift “release location” tag.

Release artifacts always have to link to releases directly, so the releases can be acquired by machines immediately and without human intervention. A release can have a type of source or binary, indicating whether a source tarball or binary artifact is linked. Each binary release can also have an associated platform triplet for Linux systems, an identifier for firmware, or any other identifier for a platform. Furthermore, we permit sha256 and blake2 checksums for the release artifacts, as well as specifying sizes. Take a look at the example below, or read the specification for details.

<releases>
​  <release version="1.2" date="2014-04-12" urgency="high">
​    [...]
​    <artifacts>
​      <artifact type="binary" platform="x86_64-linux-gnu">
​        <location>https://example.com/mytarball.bin.tar.xz</location>
​        <checksum type="blake2">852ed4aff45e1a9437fe4774b8997e4edfd31b7db2e79b8866832c4ba0ac1ebb7ca96cd7f95da92d8299da8b2b96ba480f661c614efd1069cf13a35191a8ebf1</checksum>
​        <size type="download">12345678</size>
​        <size type="installed">42424242</size>
​      </artifact>
​      <artifact type="source">
​        <location>https://example.com/mytarball.tar.xz</location>
​        [...]
​      </artifact>
​    </artifacts>
​  </release>
​</releases>

Issue listings for releases

(Since v0.12.9) Software releases often fix issues, sometimes security relevant ones that have a CVE ID. AppStream provides a machine-readable way to figure out which components on your system are currently vulnerable to which CVE registered issues. Additionally, a release tag can also just contain references to any normal resolved bugs, via bugtracker URLs. Refer to the specification for details. Example for the issues tag in AppStream Metainfo files:

<issues>
​  <issue url="https://example.com/bugzilla/12345">bz#12345</issue>
​  <issue type="cve">CVE-2019-123456</issue>
​</issues>

Requires and Recommends relations

(Since v0.12.0) Sometimes software has certain requirements only justified by some systems, and sometimes it might recommend specific things on the system it will run on in order to run at full performance.

I was against adding relations to AppStream for quite a while, as doing so would add a more “functional” dimension to it, impacting how and when software is installed, as opposed to being only descriptive and not essential to be read in order to install software correctly. However, AppStream has pretty much outgrown its initial narrow scope and adding relation information to Metainfo files was a natural step to take. For Fwupd it was an essential step, as Fwupd firmware might have certain hard requirements on the system in order to be installed properly. And AppStream requirements and recommendations go way beyond what regular package dependencies could do in Linux distributions so far.

Requirements and recommendations can be on other software components via their id, on a modalias, specific kernel version, existing firmware version or for making system memory recommendations. See the specification for details on how to use this. Example:

<requires>
  <id version="1.0" compare="ge">org.example.MySoftware</id>
​  <kernel version="5.6" compare="ge">Linux</kernel>
​</requires>
<recommends>
​  <memory>2048</memory> <!-- recommend at least 2GiB of memory -->
​</recommends>

This means that AppStream currently supported provides, suggests, recommends and requires relations to refer to other software components or system specifications.

Agreements

(Since v0.12.1) The new agreement section in AppStream Metainfo files was added to make it easier for software to be compliant to the EU GDPR. It has since been expanded to be used for EULAs as well, which was a request coming (to no surprise) from people having to deal with corporate and proprietary software components. An agreement consists of individual sections with headers and descriptive texts and should – depending on the type – be shown to the user upon installation or first use of a software component. It can also be very useful in case the software component is a firmware or driver (which often is proprietary – and companies really love their legal documents and EULAs).

Contact URL type

(Since v0.12.4) The contact URL type can be used to simply set a link back to the developer of the software component. This may be an URL to a contact form, their website or even a mailto: link. See the specification for all URL types AppStream supports.

Videos as software screenshots

(Since v0.12.8) This one was quite long in the making – the feature request for videos as screenshots had been filed in early 2018. I was a bit wary about adding video, as that lets you run into a codec and container hell as well as requiring software centers to support video and potentially requiring the appstream-generator to get into video transcoding, which I really wanted to avoid. Alternatively, we would have had to make AppStream add support for multiple likely proprietary video hosting platforms, which certainly would have been a bad idea on every level. Additionally, I didn’t want to have people add really long introductory videos to their applications.

Ultimately, the problem was solved by simplification and reduction: People can add a video as “screenshot” to their software components, as long as it isn’t the first screenshot in the list. We only permit the vp9 and av1 codecs and the webm and matroska container formats. Developers should expect the audio of their videos to be muted, but if audio is present, the opus codec must be used. Videos will be size-limited, for example Debian imposes a 14MiB limit on video filesize. The appstream-generator will check for all of these requirements and reject a video in case it doesn’t pass one of the checks. This should make implementing videos in software centers easy, and also provide the safety guarantees and flexibility we want.

So far we have not seen many videos used for application screenshots. As always, check the specification for details on videos in AppStream. Example use in a screenshots tag:

​<screenshots>
​  <screenshot type="default">
​    <image type="source" width="1600" height="900">https://example.com/foobar/screenshot-1.png</image>
​  </screenshot>
​  <screenshot>
​    <video codec="av1" width="1600" height="900">https://example.com/foobar/screencast.mkv</video>
​  </screenshot>
​ </screenshots>

Emphasis and code markup in descriptions

(Since v0.12.8) It has long been requested to have a little bit more expressive markup in descriptions in AppStream, at least more than just lists and paragraphs. That has not happened for a while, as it would be a breaking change to all existing AppStream parsers. Additionally, I didn’t want to let AppStream descriptions become long, general-purpose “how to use this software” documents. They are intended to give a quick overview of the software, and not comprehensive information. However ultimately we decided to add support for at least two more elements to format text: Inline code elements as well as em emphases. There may be more to come, but that’s it for now. This change was made about half a year ago, and people are currently advised to use the new styling tags sparingly, as otherwise their software descriptions may look odd when parsed with older AppStream implementation versions.

Remove-component merge mode

(Since v0.12.4) This addition is specified for the Collection Metadata only, as it affects curation. Since AppStream metadata is in one big pool for Linux distributions, and distributions like Debian freeze their repositories, it sometimes is required to merge metadata from different sources on the client system instead of generating it in the right format on the server. This can also be used for curation by vendors of software centers. In order to edit preexisting metadata, special merge components are created. These can permit appending data, replacing data etc. in existing components in the metadata pool. The one thing that was missing was a mode that permitted the complete removal of a component. This was added via a special remove-component merge mode. This mode can be used to pull metadata from a software center’s catalog immediately even if the original metadata was frozen in place in a package repository. This can be very useful in case an inappropriate software component is found in the repository of a Linux distribution post-release. Refer to the specification for details.

Custom metadata

(Since v0.12.1) The AppStream specification is extensive, but it can not fit every single special usecase. Sometimes requests come up that can’t be generalized easily, and occasionally it is useful to prototype a feature first to see if it is actually used before adding it to the specification properly. For that purpose, the custom tag exists. The tag defines a simple key-value structure that people can use to inject arbitrary metadata into an AppStream metainfo file. The libappstream library will read this tag by default, providing easy access to the underlying data. Thereby, the data can easily be used by custom applications designed to parse it. It is important to note that the appstream-generator tool will by default strip the custom data from files unless it has been whitelisted explicitly. That way, the creator of a metadata collection for a (package) repository has some control over what data ends up in the resulting Collection Metadata file. See the specification for more details on this tag.

Miscellaneous additions

(Since v0.12.9) Additionally to JPEG and PNG, WebP images are now permitted for screenshots in Metainfo files. These images will – like every image – be converted to PNG images by the tool generating Collection Metadata for a repository though.

(Since v0.12.10) The specification now contains a new name_variant_suffix tag, which is a translatable string that software lists may append to the name of a component in case there are multiple components with the same name. This is intended to be primarily used for firmware in Fwupd, where firmware may have the same name but actually be slightly different (e.g. region-specific). In these cases, the additional name suffix is shown to make it easier to distinguish the different components in case multiple are present.

(Since v0.12.10) AppStream has an URI format to install applications directly from webpages via the appstream: scheme. This URI scheme now permits alternative IDs for the same component, in case it switched its ID in the past. Take a look at the specification for details about the URI format.

(Since v0.12.10) AppStream now supports version 1.1 of the Open Age Rating Service (OARS), so applications (especially games) can voluntarily age-rate themselves. AppStream does not replace parental guidance here, and all data is purely informational.

Library & Implementation Changes

Of course, besides changes to the specification, the reference implementation also received a lot of improvements. There are too many to list them all, but a few are noteworthy to mention here.

No more automatic desktop-entry file loading

(Since v0.12.3) By default, libappstream was loading information from local .desktop files into the metadata pool of installed applications. This was done to ensure installed apps were represented in software centers to allow them to be uninstalled. This generated much more pain than it was useful for though, with metadata appearing two to three times in software centers because people didn’t set the X-AppStream-Ignore=true tag in their desktop-entry files. Also, the generated data was pretty bad. So, newer versions of AppStream will only load data of installed software that doesn’t have an equivalent in the repository metadata if it ships a metainfo file. One more good reason to ship a metainfo file!

Software centers can override this default behavior change by setting the AS_POOL_FLAG_READ_DESKTOP_FILES flag for AsPool instances (which many already did anyway).

LMDB caches and other caching improvements

(Since v0.12.7) One of the biggest pain points in adding new AppStream features was always adjusting the (de)serialization of the new markup: AppStream exists as a YAML version for Debian-based distributions for Collection Metadata, an XML version based on the Metainfo format as default, and a GVariant binary serialization for on-disk caching. The latter was used to drastically reduce memory consumption and increase speed of software centers: Instead of loading all languages, only the one we currently needed was loaded. The expensive icon-finding logic, building of the token cache for searches and other operations were performed and the result was saved as a binary cache on-disk, so it was instantly ready when the software center was loaded next.

Adjusting three serialization formats was pretty laborious and a very boring task. And at one point I benchmarked the (de)serialization performance of the different formats and found out the the XML reading/writing was actually massively outperforming that of the GVariant cache. Since the XML parser received much more attention, that was only natural (but there were also other issues with GVariant deserializing large dictionary structures).

Ultimately, I removed the GVariant serialization and replaced it with a memory-mapped XML-based cache that reuses 99.9% of the existing XML serialization code. The cache uses LMDB, a small embeddable key-value store. This makes maintaining AppStream much easier, and we are using the same well-tested codepaths for caching now that we also use for normal XML reading/writing. With this change, AppStream also uses even less memory, as we only keep the software components in memory that the software center currently displays. Everything that isn’t directly needed also isn’t in memory. But if we do need the data, it can be pulled from the memory-mapped store very quickly.

While refactoring the caching code, I also decided to give people using libappstream in their own projects a lot more control over the caching behavior. Previously, libappstream was magically handling the cache behind the back of the application that was using it, guessing which behavior was best for the given usecase. But actually, the application using libappstream knows best how caching should be handled, especially when it creates more than one AsPool instance to hold and search metadata. Therefore, libappstream will still pick the best defaults it can, but give the application that uses it all control it needs, down to where to place a cache file, to permit more efficient and more explicit management of caches.

Validator improvements

(Since v0.12.8) The AppStream metadata validator, used by running appstreamcli validate <file>, is the tool that each Metainfo file should run through to ensure it is conformant to the AppStream specification and to give some useful hints to improve the metadata quality. It knows four issue severities: Pedantic issues are hidden by default (show them with the --pedantic flag) and affect upcoming features or really “nice to have” things that are completely nonessential. Info issues are not directly a problem, but are hints to improve the metadata and get better overall data. Things the specification recommends but doesn’t mandate also fall into this category. Warnings will result in degraded metadata but don’t make the file invalid in its entirety. Yet, they are severe enough that we fail the validation. Things like that are for example a vanishing screenshot from an URL: Most of the data is still valid, but the result may not look as intended. Invalid email addresses, invalid tag properties etc. fall into that category as well: They will all reduce the amount of metadata systems have available. So the metadata should definitely be warning-free in order to be valid. And finally errors are outright violation of the specification that may likely result in the data being ignored in its entirety or large chunks of it being invalid. Malformed XML or invalid SPDX license expressions would fall into that group.

Previously, the validator would always show very long explanations for all the issues it found, giving detailed information on an issue. While this was nice if there were few issues, it produces very noisy output and makes it harder to quickly spot the actual error. So, the whole validator output was changed to be based on issue tags, a concept that is also known from other lint tools such as Debian’s Lintian: Each error has its own tag string, identifying it. By default, we only show the tag string, line of the issue, severity and component name it affects as well a short repeat of an invalid value (in case that’s applicable to the issue). If people do want to know detailed information, they can get it by passing --explain to the validation command. This solution has many advantages:

  • It makes the output concise and easy to read by humans and is mostly already self-explanatory
  • Machines can parse the tags easily and identify which issue was emitted, which is very helpful for AppStream’s own testsuite but also for any tool wanting to parse the output
  • We can now have translators translate the explanatory texts

Initially, I didn’t want to have the validator return translated output, as that may be less helpful and harder to search the web for. But now, with the untranslated issue tags and much longer and better explanatory texts, it makes sense to trust the translators to translate the technical explanations well.

Of course, this change broke any tool that was parsing the old output. I had an old request by people to have appstreamcli return machine-readable validator output, so they could integrate it better with preexisting CI pipelines and issue reporting software. Therefore, the tool can now return structured, machine-readable output in the YAML format if you pass --format=yaml to it. That output is guaranteed to be stable and can be parsed by any CI machinery that a project already has running. If needed, other output formats could be added in future, but for now YAML is the only one and people generally seem to be happy with it.

Create desktop-entry files from Metainfo

(Since v0.12.9) As you may have noticed, an AppStream Metainfo file contains some information that a desktop-entry file also contains. Yet, the two file formats serve very different purposes: A desktop file is basically launch instructions for an application, with some information about how it is displayed. A Metainfo file is mostly display information and less to none launch instructions. Admittedly though, there is quite a bit of overlap which may make it useful for some projects to simply generate a desktop-entry file from a Metainfo file. This may not work for all projects, most notably ones where multiple desktop-entry files exists for just one AppStream component. But for the simplest and most common of cases, a direct mapping between Metainfo and desktop-entry file, this option is viable.

The appstreamcli tool permits this now, using the appstreamcli make-desktop-file subcommand. It just needs a Metainfo file as first parameter, and a desktop-entry output file as second parameter. If the desktop-entry file already exists, it will be extended with the new data from tbe Metainfo file. For the Exec field in a desktop-entry file, appstreamcli will read the first binary entry in a provides tag, or use an explicitly provided line passed via the --exec parameter.

Please take a look at the appstreamcli(1) manual page for more information on how to use this useful feature.

Convert NEWS files to Metainfo and vice versa

(Since v0.12.9) Writing the XML for release entries in Metainfo files can sometimes be a bit tedious. To make this easier and to integrate better with existing workflows, two new subcommands for appstreamcli are now available: news-to-metainfo and metainfo-to-news. They permit converting a NEWS textfile to Metainfo XML and vice versa, and can be integrated with an application’s build process. Take a look at AppStream itself on how it uses that feature.

In addition to generating the NEWS output or reading it, there is also a second YAML-based option available. Since YAML is a structured format, more of the features of AppStream release metadata are available in the format, such as marking development releases as such. You can use the --format flag to switch the output (or input) format to YAML.

Please take a look at the appstreamcli(1) manual page for a bit more information on how to use this feature in your project.

Support for recent SPDX syntax

(Since v0.12.10) This has been a pain point for quite a while: SPDX is a project supported by the Linux Foundation to (mainly) provide a unified syntax to identify licenses for Open Source projects. They did change the license syntax twice in incompatible ways though, and AppStream already implemented a previous versions, so we could not simply jump to the latest version without supporting the old one.

With the latest release of AppStream though, the software should transparently convert between the different version identifiers and also support the most recent SPDX license expressions, including the WITH operator for license exceptions. Please report any issues if you see them!

Future Plans?

First of all, congratulations for reading this far into the blog post! I hope you liked the new features! In case you skipped here, welcome to one of the most interesting sections of this blog post! 😉

So, what is next for AppStream? The 1.0 release, of course! The project is certainly mature enough to warrant that, and originally I wanted to get the 1.0 release out of the door this February, but it doesn’t look like that date is still realistic. But what does “1.0” actually mean for AppStream? Well, here is a list of the intended changes:

  • Removal of almost all deprecated parts of the specification. Some things will remain supported forever though: For example the desktop component type is technically deprecated for desktop-application but is so widely used that we will support it forever. Things like the old application node will certainly go though, and so will the /usr/share/appdata path as metainfo location, the appcategory node that nobody uses anymore and all other legacy cruft. I will be mindful about this though: If a feature still has a lot of users, it will stay supported, potentially forever. I am closely monitoring what is used mainly via the information available via the Debian archive. As a general rule of thumb though: A file for which appstreamcli validate passes today is guaranteed to work and be fine with AppStream 1.0 as well.
  • Removal of all deprecated API in libappstream. If your application still uses API that is flagged as deprecated, consider migrating to the supported functions and you should be good to go! There are a few bigger refactorings planned for some of the API around releases and data serialization, but in general I don’t expect this to be hard to port.
  • The 1.0 specification will be covered by an extended stability promise. When a feature is deprecated, there will be no risk that it is removed or become unsupported (so the removal of deprecated stuff in the specification should only happen once). What is in the 1.0 specification will quite likely be supported forever.

So, what is holding up the 1.0 release besides the API cleanup work? Well, there are a few more points I want to resolve before releasing the 1.0 release:

  • Resolve hosting release information at a remote location, not in the Metainfo file (#240): This will be a disruptive change that will need API adjustments in libappstream for sure, and certainly will – if it happens – need the 1.0 release. Fetching release data from remote locations as opposed to having it installed with software makes a lot of sense, and I either want to have this implemented and specified properly for the 1.0 release, or have it explicitly dismissed.
  • Mobile friendliness / controls metadata (#192 & #55): We need some way to identify applications as “works well on mobile”. I also work for a company called Purism which happens to make a Linux-based smartphone, so this is obviously important for us. But it also is very relevant for users and other Linux mobile projects. The main issue here is to define what “mobile” actually means and what information makes sense to have in the Metainfo file to be future-proof. At the moment, I think we should definitely have data on supported input controls for a GUI application (touch vs mouse), but for this the discussion is still not done.
  • Resolving addon component type complexity (lots of issue reports): At the moment, an addon component can be created to extend an existing application by $whatever thing This can be a plugin, a theme, a wallpaper, extra content, etc. This is all running in the addon supergroup of components. This makes it difficult for applications and software centers to occasionally group addons into useful groups – a plugin is functionally very different from a theme. Therefore I intend to possibly allow components to name “addon classes” they support and that addons can sort themselves into, allowing easy grouping and sorting of addons. This would of course add extra complexity. So this feature will either go into the 1.0 release, or be rejected.
  • Zero pending feature requests for the specification: Any remaining open feature request for the specification itself in AppStream’s issue tracker should either be accepted & implemented, or explicitly deferred or rejected.

I am not sure yet when the todo list will be completed, but I am certain that the 1.0 release of AppStream will happen this year, most likely before summer. Any input, especially from users of the format, is highly appreciated.

Thanks a lot to everyone who contributed or is contributing to the AppStream implementation or specification, you are great! Also, thanks to you, the reader, for using AppStream in your project 😉. I definitely will give a bit more frequent and certainly shorter updates on the project’s progress from now on. Enjoy your rich software metadata, firmware updates and screenshot videos meanwhile! 😀

January 21, 2020

Linux.conf.au 2020

I just got back from linux.conf.au 2020 on Saturday and am still adjusting to being home again. I had the opportunity to give three presentations during the conference and wanted to provide links to the slides and videos.

Picolibc

My first presentation was part of the Open ISA miniconf on Monday. I summarized the work I've been doing on a fork of Newlib called Picolibc which targets 32- and 64- bit embedded processors.

Snek

Wednesday morning, I presented on my snek language, which is a small Python designed for introducing programming in an embedded environment. I've been using this for the last year or more in a middle-school environment (grades 5-7) as a part of a LEGO robotics class.

X History and Politics

Bradley Kuhn has been encouraging me to talk about the early politics of X and how that has shaped my views on the benefits of copyleft licenses in building strong communities, especially in driving corporate cooperation and collaboration. I would have loved to also give this talk as a part of the Copyleft Conference being held in Brussels after FOSDEM, but I won't be at that event. This talk spans the early years of X, covering events up through 1992 or so.

January 17, 2020
A while ago as a spin-off of my project to improve support for Logitech wireless keyboards and mice I have also done some work on improving support for (Gaming) keyboards with a builtin LCD panel.

Specifically if you have a Logitech MX5000, G15, G15 v2 or G510 and you want the LCD panel to show something somewhat useful then on Fedora 31 you can now install the lcdproc package and it will automatically recognize the keyboard and show "top" like information on it. No need to manually write an LCDd.conf or anything, this works fully plug and play:

sudo dnf install lcdproc
sudo udevadm trigger


If you have a MX5000 and you do not want the LCD panel to show "top" like info, you may still want to install the mx5000tools package, this will automatically send the system time to the keyboard, after which it will display the time.

Once the 5.5 kernel becomes available as an update for Fedora you will also be able to use the keys surrounding the LCD panel to control the lcdproc menu-s on the LCD panel. The 5.5 kernel will also export key backlight brightness control through the standardized /sys/class/leds API, so that you can control it from e.g. the GNOME control-center's power-settings and you get a nice OSD when toggling the brightnesslevel using the key on the keyboard.

The 5.5 kernel will also make the "G" keys send standard input-events (evdev events), once userspace support for the new key-codes these send has landed, this will allow e.g. binding them to actions in GNOME control-center's keyboard-settings. But only under Wayland as the new keycodes are > 255 and X11 does not support this.