planet.freedesktop.org
April 08, 2021

In RADV we just added an option to speed up rendering by rendering less pixels.

These kinds of techniques have become more common over the past decade with techniques such as checkerboarding, TAA based upscaling and recently DLSS. Fundamentally all they do is trading off rendering quality for rendering cost and many of them include some amount of postprocessing to try to change the curve of that tradeoff. Most notably DLSS has been wildly successful at that to the point many people claim it is barely a quality regression.

Of course increasing GPU performance by up to 50% or so with barely any quality regression seems like must have and I think it would be pretty cool if we could have the same improvements on Linux. I think it has the potential to be a game changer, making games playable on APUs or playing with really high resolution or framerates on desktops.

And today we took our first baby steps in RADV by allowing users to force Variable Rate Shading (VRS) with an experimental environment variable:

RADV_FORCE_VRS=2x2

VRS is a hardware capability that allows us to reduce the number of fragment shader invocations per pixel rendered. So you could say configure the hardware to use one fragment shader invocation per 2x2 pixels. The hardware still renders the edges of geometry exactly, but the inner area of each triangle is rendered with a reduced number of fragment shader invocations.

There are a couple of ways this capability can be configured:

  1. On a per-draw level
  2. On a per-primitive level (e.g. per triangle)
  3. Using an image to configure on a per-region level

This is a new feature for AMD on RDNA2 hardware.

With RADV_FORCE_VRS we use this to improve performance at the cost of visual quality. Since we did not implement any postprocessing the quality loss can be pretty bad, so we restricted the reduce shading rate when we detect one of the following

  1. Something is rendered in 2D, as that is likely some UI where you’d really want some crispness
  2. When the shader can discard pixels, as this implicitly introduces geometry edges that the hardware doesn’t see but that significantly impact the visual quality.

As a result there are some games where this has barely any effect but you also don’t notice the quality regression and there are games where it really improves performance by 30%+ but you really notice the quality regression.

VRS is by far the easiest thing to make work in almost all games. Most alternatives like checkerboarding, TAA and DLSS need modified render target size, significant shader fixups, or even a proprietary integration with games. Making changes that deeply is getting more complicated the more advanced the game is.

If we want to reduce render resolution (which would be a key thing in e.g. checkerboarding or DLSS) it is very hard to confidently tie all resolution dependent things together. For example a big cost for some modern games is raytracing, but the information flow to the main render targets can be very hard to track automatically and hence such a thing would require a lot of investigation or a bunch of per game customizations.

And hence we decided to introduce this first baby step. Enjoy!

April 07, 2021

The lavapipe vulkan software rasterizer in Mesa is now reporting Vulkan 1.1 support.

It passes all CTS tests for those new features in 1.1 but it stills fails all the same 1.0 tests so isn't that close to conformant. (lines/point rendering are the main areas of issue).

There are also a bunch of the 1.2 features implemented so that might not be too far away though 16-bit shader ops and depth resolve are looking a bit tricky.

If there are any specific features anyone wants to see or any crazy places/ideas for using lavapipe out there, please either file a gitlab issue or hit me up on twitter @DaveAirlie


Buffering

The great thing about tomorrow is that it never comes.

Let’s talk about sparse buffers.

What is a sparse buffer? A sparse buffer is a buffer that is not required to be contiguously or fully backed. This means that a buffer larger than the GPU’s available memory can be created, and only some parts of it are utilized at any given time. Because of the non-resident nature of the backing memory, they can never be mapped, instead needing to go through a staging buffer for any host read/write.

In a gallium-based driver, provided that an effective implementation for staging buffers exists, sparse buffer implementation goes almost exclusively through the pipe_context::resource_commit hook, which manages residency of a sparse resource’s backing memory, passing a range to change residency for and an on/off switch.

In zink(-wip), the hook looks like this:

static bool
zink_resource_commit(struct pipe_context *pctx, struct pipe_resource *pres, unsigned level, struct pipe_box *box, bool commit)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_resource *res = zink_resource(pres);
   struct zink_screen *screen = zink_screen(pctx->screen);

   /* if any current usage exists, flush the queue */
   if (zink_batch_usage_matches(&res->obj->reads, ctx->curr_batch) ||
       zink_batch_usage_matches(&res->obj->writes, ctx->curr_batch))
      zink_flush_queue(ctx);

   VkBindSparseInfo sparse;
   sparse.sType = VK_STRUCTURE_TYPE_BIND_SPARSE_INFO;
   sparse.pNext = NULL;
   sparse.waitSemaphoreCount = 0;
   sparse.bufferBindCount = 1;
   sparse.imageOpaqueBindCount = 0;
   sparse.imageBindCount = 0;
   sparse.signalSemaphoreCount = 0;

   VkSparseBufferMemoryBindInfo sparse_bind;
   sparse_bind.buffer = res->obj->buffer;
   sparse_bind.bindCount = 1;
   sparse.pBufferBinds = &sparse_bind;

   VkSparseMemoryBind mem_bind;
   mem_bind.resourceOffset = box->x;
   mem_bind.size = box->width;
   mem_bind.memory = commit ? res->obj->mem : VK_NULL_HANDLE;
   mem_bind.memoryOffset = box->x;
   mem_bind.flags = 0;
   sparse_bind.pBinds = &mem_bind;
   VkQueue queue = util_queue_is_initialized(&ctx->batch.flush_queue) ? ctx->batch.thread_queue : ctx->batch.queue;

   VkResult ret = vkQueueBindSparse(queue, 1, &sparse, VK_NULL_HANDLE);
   if (!zink_screen_handle_vkresult(screen, ret)) {
      check_device_lost(ctx);
      return false;
   }
   return true;
}

Naturally there’s a need to enjoy the verbosity of Vulkan structs here, but there’s two key takeaways.

The first is that this implementation is likely suboptimal; it should be making better use of semaphores to avoid having to flush the queue if the resource has current-batch usage. That’s complex to implement, however, so I took the same shortcut that RadeonSI does here.

The second is that this is just copying the pipe_box struct to the VkSparseMemoryBind struct. The reason this works with a 1:1 mapping is because the backing resource is allocated with a 1:1 range mapping, so the values can be directly used.

Other than that, the only changes required for this implementation were to add a bunch of checks for the sparse flag on resources during map/unmap to force staging buffers and to use device-local memory instead of host-visible.

Sometimes zink can be simple!

April 06, 2021
For a long time Logitech produced wireless keyboards using 27 MHz as communications band. Although these have not been produced for a while now these are still pretty common and a lot of them are still perfectly serviceable.

But when using them under Linux, there is one downside, since the communication is one way by default the wireless link is unencrypted by default, which is kinda bad from a security pov. These keyboards do support using an encrypted link, but this requires a one-time setup where the user manually enters a key on the keyboard.

I've written a small Linux utility to do this under Linux, which should help give these keyboards an extra lease on life and stop them unnecessarily becoming e-waste. Sometimes these keyboards appear to be broken, while the only problem is that the key in the keyboard and receiver are not in sync, the README also contains instructions on howto reset the keyboard, without needing the utility, restoring (unencrypted) functionality.

The 'lg-27MHz-keyboard-encryption-setup' utility is available on Fedora in the 'logitech-27mhz-keyboard-encryption-setup package.
April 05, 2021

The crocus project was recently mentioned in a phoronix article. The article covered most of the background for the project.

Crocus is a gallium driver to cover the gen4-gen7 families of Intel GPUs. The basic GPU list is 965, GM45, Ironlake, Sandybridge, Ivybridge and Haswell, with some variants thrown in. This hardware currently uses the Intel classic 965 driver. This is hardware is all gallium capable and since we'd like to put the classic drivers out to pasture, and remove support for the old infrastructure, it would be nice to have these generations supported by a modern gallium driver.

The project was initiated by Ilia Mirkin last year, and I've expended some time in small bursts to moving it forward. There have been some other small contributions from the community. The basis of the project is a fork of the iris driver with the old relocation based batchbuffer and state management added back in. I started my focus mostly on the older gen4/5 hardware since it was simpler and only supported GL 2.1 in the current drivers. I've tried to cleanup support for Ivybridge along the way.

The current status of the driver is in my crocus branch.

Ironlake is the best supported, it runs openarena and supertuxkart, and piglit has only around 100 tests delta vs i965 (mostly edgeflag related) and there is only one missing feature (vertex shader push constants). 

Ivybridge just stop hanging on second batch submission now, and glxgears runs on it. Openarena starts to the menu but is misrendering and a piglit run completes with some gpu hangs and a quite large delta. I expect IVB to move faster now that I've solved the worst hang.

Haswell runs glxgears as well.

I think once I take a closer look at Ivybridge/Haswell and can get Ilia (or anyone else) to do some rudimentary testing on Sandybridge, I will start taking a closer look at upstreaming it into Mesa proper.


Woosh

After last week’s post touting the “final” features being added to the upcoming Mesa release, naturally now that this is a new week, I have to outdo myself.

I’ve heard some speculation about zink’s future regarding features. Specifically regarding all the mesamatrix features that aren’t green-ified for zink yet.

So you want features is what you’re saying.

Let’s see where things stand in today’s zink-wip snapshot:

  • GL_OES_tessellation_shader, GL_OES_gpu_shader5 - this is a mesamatrix bug; zink can’t reach GL 4.0 without supporting them, so obviously they are supported
  • GL_ARB_bindless_texture - the final boss
  • GL_ARB_cl_event - not (yet) supported by mesa
  • GL_ARB_compute_variable_group_size - done
  • GL_ARB_ES3_2_compatibility - missing advanced blend from ES3.2
  • GL_ARB_fragment_shader_interlock - done
  • GL_ARB_gpu_shader_int64 - done
  • GL_ARB_parallel_shader_compile - done
  • GL_ARB_post_depth_coverage - done (thanks ajax)
  • GL_ARB_robustness_isolation - not supported by mesa
  • GL_ARB_sample_locations - done
  • GL_ARB_seamless_cubemap_per_texture - needs a new Vulkan extension
  • GL_ARB_shader_ballot - done
  • GL_ARB_shader_clock - done
  • GL_ARB_shader_stencil_export - done
  • GL_ARB_shader_viewport_layer_array - done
  • GL_ARB_shading_language_include - done
  • GL_ARB_sparse_buffer - done
  • GL_ARB_sparse_texture - not supported by mesa
  • GL_ARB_sparse_texture2 - not supported by mesa
  • GL_ARB_sparse_texture_clamp - not supported by mesa
  • GL_ARB_texture_filter_minmax - done
  • GL_EXT_memory_object - TODO
  • GL_EXT_memory_object_fd - TODO
  • GL_EXT_memory_object_win32 - not supported by mesa
  • GL_EXT_render_snorm - done
  • GL_EXT_semaphore - TODO
  • GL_EXT_semaphore_fd - TODO
  • GL_EXT_semaphore_win32 - not supported by mesa
  • GL_EXT_sRGB_write_control - TODO
  • GL_EXT_texture_norm16 - done
  • GL_EXT_texture_sRGB_R8 - TODO
  • GL_KHR_blend_equation_advanced_coherent - same as regular advanced blend
  • GL_KHR_texture_compression_astc_hdr - TODO
  • GL_KHR_texture_compression_astc_sliced_3d - TODO
  • GL_OES_depth_texture_cube_map - done
  • GL_OES_EGL_image - done
  • GL_OES_EGL_image_external - done
  • GL_OES_EGL_image_external_essl3 - done
  • GL_OES_required_internalformat - done
  • GL_OES_surfaceless_context - done
  • GL_OES_texture_compression_astc - TODO
  • GL_OES_texture_float - done
  • GL_OES_texture_float_linear - done
  • GL_OES_texture_half_float - done
  • GL_OES_texture_half_float_linear - done
  • GL_OES_texture_view - same mesamatrix bug since this is a GL 4.3 extension
  • GL_OES_viewport_array - done
  • GLX_ARB_context_flush_control - not supported by mesa
  • GLX_ARB_robustness_application_isolation - not supported by mesa
  • GLX_ARB_robustness_share_group_isolation - not supported by mesa
  • GL_EXT_shader_group_vote - done
  • GL_EXT_multisampled_render_to_texture - TODO
  • GL_EXT_color_buffer_half_float - TODO
  • GL_EXT_depth_bounds_test - done

By my calculations, that’s 11 TODO, 10 not supported, 2 advanced blend, and 1 final boss, a total of 24 out-of-version-extensions not yet implemented out of 54, meaning that 30 are done, tying with i965 and second only to RadeonSI at 33.

New in today’s snapshot: GL_ARB_fragment_shader_interlock, GL_ARB_sparse_buffer, GL_ARB_sample_locations, GL_ARB_shader_ballot, GL_ARB_shader_clock, GL_ARB_texture_filter_minmax

Cross-referencing

Writing blog posts like this is easy, but you know what’s not easy?

Writing good blog posts.

And new to the blogging game is the one, the only, Bas Nieuwenhuizen of RADV founding fame! If you’re at all curious about how drivers actually work, his is definitely a site to follow, as he’s already gone much deeper into explaining my RPCS3 memcpy fail than I ever did.

Tomorrow

Is sparse buffer implementation 101. I’ve said it, so now the blog post has to happen.

April 03, 2021

In this article I show how reading from VRAM can be a catastrophe for game performance and why.

To illustrate I will go back to fall 2015. AMDGPU was just released, it didn’t even have re-clocking yet and I was just a young student trying to play Skyrim on my new AMD R9 285.

Except it ran slowly. 10-15 FPS slowly. Now one might think that is no surprise as due to lack of re-clocking the GPU ran with a shader clock of 300 MHz. However the real surprise was that the game was not at all GPU bound.

As usual with games of that era there was a single thread doing a lot of the work and that thread was very busy doing something inside the game binary. After a bunch of digging with profilers and gdb, it turned out that the majority of time was spent in a single function that accessed less than 1 MiB from a GPU buffer each frame.

At the time DXVK was not a thing yet and I ran the game with wined3d on top of OpenGL. In OpenGL an application does not specify the location of GPU buffers directly, but specifies some properties about how it is going to be used and the driver decides. Poorly in this case.

There was a clear tweak to the driver heuristics that choose the memory location and the frame rate of the game more than doubled and was now properly GPU bound.

Some Data

After the anecdote above you might be wondering how slow reading from VRAM can really be? 1 MiB is not a lot of data so even if it is slow it cannot be that bad right?

To show you how bad it can be I ran some benchmarks on my system (Threadripper 2990wx, 4 channel DDR4-3200 and a RX 6800 XT). I checked read/write performance using a 16 MiB buffer (512 MiB for system memory to avoid the test being contained in L3 cache)

We look into three allocation types that are exposed by the amdgpu Linux kernel driver:

  • VRAM. This lives on the GPU and is mapped with Uncacheable Speculative Write Combining (USWC) on the CPU. This means that accesses from the CPU are not cached, but writes can be write-combined.

  • Cacheable system memory. This is system memory that has caching enabled on the CPU and there is cache snooping to ensure the memory is coherent between the CPU and GPU (up till the top level caches. The GPU caches do not participate in the coherence).

  • USWC system memory. This is system memory that is mapped with Uncacheable Speculative Write Combining on the CPU. This can lead to slight performance benefits compared to cacheable system memory due to lack of cache snooping.

For context, in Vulkan this would roughly correspond to the following memory types:

Hardware Vulkan
VRAM VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
Cacheable system memory VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT
USWC system memory VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT

The benchmark resulted in the following throughput numbers:

method (MiB/s) VRAM Cacheable System Memory USWC System Memory
read via memcpy 15 11488 137
write via memcpy 10028 18249 11480

I furthermore tested handwritten for-loops accessing 8,16,32 and 64-bit elements at a time and those got similar performance.

This clearly shows that reads from VRAM using memcpy are ~766x slower than memcpy reads from cacheable system memory and even non-cacheable system memory is ~91x slower than cacheable system memory. Reading even small amounts from these can cause severe performance degradations.

Writes show a difference as well, but the difference is not nearly as significant. So if an application does not select the best memory location for their data for CPU access it is still likely to result in a reasonable experience.

APUs Are Affected Too

Even though APUs do not have VRAM they still are affected by the same issue. Typically the GPU gets a certain amount of memory pre-allocated at boot time as a carveout. There are some differences in how this is accessed from the GPU so from the perspective of the GPU this memory can be faster.

At the same time the Linux kernel only gives uncached access to that region from the CPU, so one could expect similar performance issues to crop up.

I did the same test as above on a laptop with a Ryzen 5 2500U (Raven Ridge) APU, and got results that are are not dissimilar from my workstation.

method (MiB/s) Carveout Snooped System Memory USWC System Memory
read via memcpy 108 10426 108
write via memcpy 11797 20743 11821

The carveout performance is virtually identical to the uncached system memory now, which is still ~97x slower than cacheable system memory. So even though it is all system memory on an APU care still has to be taken on how the memory is allocated.

What To Do Instead

Since the performance cliff is so large it is recommended to avoid this issue if at all possible. The following three methods are good ways to avoid the issue:

  1. If the data is only written from the CPU, it is advisable to use a shadow buffer in cacheable system memory (can even be outside of the graphics API, e.g. malloc) and read from that instead.

  2. If this is written by the GPU but not frequently, one could consider putting the buffer in snooped system memory. This makes the GPU traffic go over the PCIE bus though, so it has a trade-off.

  3. Let the GPU copy the data to a buffer in snooped system memory. This is basically an extension of the previous item by making sure that the GPU accesses the data exactly once in system memory. The GPU roundtrip can take a non-trivial wall-time though (up to ~0.5 ms measured on some low end APUs), some of which is size-independent, such as command submission. Additionally this may need to wait till the hardware unit used for the copy is available, which may depend on other GPU work. The SDMA unit (Vulkan transfer queue) is a good option to avoid that.

Other Limitations

Another problem with CPU access from VRAM is the BAR size. Typically only the first 256 MiB of VRAM is configured to be accessible from the CPU and for anything else one needs to use DMA.

If the working set of what is allocated in VRAM and accessed from the CPU is large enough the kernel driver may end up moving buffers frequently in the page fault handler. System memory would be an obvious target, but due to the GPU performance trade-off that is not always the decision that gets made.

Luckily, due to the recent push from AMD for Smart Access Memory, large BARs that encompass the entire VRAM are now much more common on consumer platforms.

April 02, 2021

This is the first post of this blog and with it being past midnight I couldn’t be bothered making one about a technical topic. So instead here is an explanation of my plans with the blog.

I got inspired by the prolific blogging of Mike Blumenkrantz and some discussion on the VKx discord that some actually written updates can be very useful, and that I don’t need to make a paper out of each one.

At the same time I have been involved in some longer running things on the driver side which I think could really use some updates as progress is made. Consider for example raytracing, DRM format modifiers, RGP support and more.

I have no plans at all to be as prolific as Mike by a long shot, but I think the style of articles is probably a good template of what to expect from this blog.

April 01, 2021

I’m Trying

Blogging is tough, but I’m getting the posts out there one way or another.

Today marks what is likely to be the last of the “big” changes to zink in Mesa 21.1 before the merge window closes in less than two weeks, and what changes they were.

Threaded context support is now implemented, which, on its own, makes zink vaguely competitive against native GL drivers. I’d expect that for many scenarios, people should start seeing upwards of 60-70% native perf when previously the numbers were much lower, excepting things like furmark, where a weird problem with alpha blending is still causing a massive perf hit.

If that wasn’t enough, my timeline semaphore handling also snuck in, providing a reduction in CPU overhead for asynchronous queue-related operations where supported. Special thanks to Vulkan crash test dummy and uninitialized variable enthusiast Lionel Landwerlin for tripping and falling over basically every line of code in this implementation to help get it to the finish line for your consumption.

And if that still wasn’t enough, my RADV draw dispatch refactor also landed yesterday, both paving the way for some totally secret future work of mine and also bringing a 3-4% reduction in CPU overhead for draws that will make your gaming feel faster now that you’re aware of it but realistically won’t have any discernible effect. Basically racing stripes.

March 31, 2021

Do As I Say, Not As Zink Does

Today, a brief post and lamentation.

I’m sure everyone is well aware of Vulkan semantics regarding array vs non-array image types: when using an array type, e.g., VK_IMAGE_TYPE_2D with VkImageCreateInfo::arrayLayers > 1, always use array members for accessing/copying/blitting, and when using a 3D type, always use depth members.

This means array types should use baseArrayLayer and layerCount for copying/blitting and arrayPitch for accessing subresource regions. Non-array types, specifically 3D types, should use VkExtent3D::depth and VkSubresourceLayout::depthPitch.

This is really important, as I’ve found out over the past week, given that this has not been handled as it should’ve in many places throughout the zink stack. Some drivers were cool and didn’t make a big deal about it. Other drivers were more accurate and have just been failing all along.

And Before I Forget

I was recently interviewed by Boiling Steam, a small Linux gaming-oriented news site focused on creating original content and interviewing only the most important figures within the community (like me). If you’ve ever wanted to know more about the rich open source pedigree of Super Good Code, the interview goes deep into the back catalogue of how things got to this point.

March 30, 2021

Yeah, Again

It’s been a while since I blogged about descriptors, so let’s fix that since this site may as well be called Super Good Descriptors.

Last time, I talked a bit about descriptors 3.0: lazy descriptors. The idea I settled on here was to do templated updates, and to do the absolute minimal amount of work possible for each draw while still never reusing any written-to descriptor sets once they’d been cycled out.

Lazy descriptors worked out great, and I’m sure many of you have grown to enjoy the ZINK: USING LAZY DESCRIPTORS log message that got spammed on startup over the past couple months of zink-wip.

Well, times have changed, and this message will no longer fill your terminal by default after today’s (20210330) zink-wip snapshot.

Modes

New today is the ZINK_DESCRIPTORS environment variable, supporting three modes of operation:

  • auto - the default, which attempts to detect system capabilities and use caching with templated updates
  • lazy - the mode we all know and love from the past few months
  • notemplates - this is the old-style caching mechanism

auto mode moves zink one step closer to my eventual goal, which is to be able to use driconf to do application-based mode changes to disable caching for apps which never reuse resources. Also potentially useful would be the ability to dynamically disable caching on a pipeline-by-pipeline basis while an application is running if too many cache misses are detected.

Necessary?

With that said, I’ve come to the conclusion that any form of caching may actually be, at best, equivalent to uncached mode for the general desktop user, and it may only be worthwhile for special cases, like Vulkan drivers which can’t do descriptor templates or embedded devices. In my latest testing (on desktop systems), I have yet to see any scenarios where lazy mode fails to provide the best performance.

ARM in particular seems to gain a lot from it, as the post shows a ~40% perf improvement. It’s unclear to me, however, whether any benchmarking was done against a highly optimized uncached implementation like I’ve done. The overhead from doing basic descriptor updating without templates is definitely significant, so it might just be that things are different now that more functionality is available. On Linux systems, at least, every Vulkan driver that matters supports descriptor templates, so this is functionality that can be relied upon.

Is zink-wip slow now?

No.

The goal of zink-wip is to provide an optimal testing environment with the absolute bleeding edge in terms of performance and features. The auto mode should provide that, and the cases I’ve seen where its performance is noticeably worse number exactly one, and it’s a subtest for drawoverhead. If anyone finds any other cases where auto is worse than lazy, I’m interested, but it shouldn’t be a concern.

With that said, it might be worth doing some benchmarking between the two for some extremely high CPU usage scenarios, as that’s the only case where it may be possible to detect a difference. Gone are the days of zink(-wip) hogging the whole CPU, so probably this is just useless pontificating to fill more of a blog page.

But also, if you’re doing any kind of benchmarking on a high-end CPU, I’d probably recommend going with the lazy mode for now.

Game-changers

I’m pleased with the current state of descriptor caching, but it does bother me that it isn’t dramatically better than the uncached mode on my desktop. I think this ultimately just comes down to the current cache implementation being split into two steps:

  • compute the descriptor cache key
  • lookup the set

This effectively sits on top of the lazy mode, moving work out of the Vulkan driver and into the cache lookup any time there’s a cache hit. As such, I’ve been considering working to shift some of this work into threads, though this is somewhat challenging given the current gallium API. Specifically, only SSBOs and shader images can be per-stage updated immediately after bind, as both UBOs and samplers bind only a single descriptor slot at a time, meaning there’s no way to know when the “final” one is bound.

But then again, I’ve certainly reached the point of diminishing returns now. Most applications that I test have minimal CPU usage for the zink driver thread (e.g., Unigine Superposition is only at about 10% utilization on a i7-6700K), and are instead bottlenecking hard in the GPU, so I think it’s time to call things “good enough” here unless things change in a significant way.

March 29, 2021

A restrictive end-user license agreement is one way a company can exert power over the user. When the free software movement was founded thirty years ago, these restrictive licenses were the primary user-hostile power dynamic, so permissive and copyleft licenses emerged as synonyms to software freedom. Licensing does matter; user autonomy is lost with subscription models, revocable licenses, binary-only software, and onerous legal clauses. Yet these issues pertinent to desktop software do not scratch the surface of today’s digital power dynamics.

Today, companies exert power over their users by: tracking, selling data, psychological manipulation, intrusive advertising, planned obsolescence, and hostile Digital “Rights” Management (DRM) software. These issues affect every digital user, technically inclined or otherwise, on desktops and smartphones alike.

The free software movement promised to right these wrongs via free licenses on the source code, with adherents arguing free licenses provide immunity to these forms of malware since users could modify the code. Unfortunately most users lack the resources to do so. While the most egregious violations of user freedom come from companies publishing proprietary software, these ills can remain unchecked even in open source programs, and not all proprietary software exhibits these issues. The modern browser is nominally free software containing the trifecta of telemetry, advertisement, and DRM; a retro video game is proprietary software but relatively harmless.

As such, it’s not enough to look at the license. It’s not even enough to consider the license and a fixed set of issues endemic to proprietary software; the context matters. Software does not exist in a vacuum. Just as proprietary software tends to integrate with other proprietary software, free software tends to integrate with other free software. Software freedom in context demands a gentle nudge towards software in user interests, rather than corporate interests.

How then should we conceptualize software freedom?

Consider the three adherents to free software and open source: hobbyists, corporations, and activists. Individual hobbyists care about tinkering with the software of their choice, emphasizing freely licensed source code. These concerns do not affect those who do not make a sport out of modifying code. There is nothing wrong with this, but it will never be a household issue.

For their part, large corporations claim to love “open source”. No, they do not care about the social movement, only the cost reduction achieved by taking advantage of permissively licensed software. This corporate emphasis on licensing is often to the detriment of software freedom in the broader context. In fact, it is this irony that motivates software freedom beyond the license.

It is the activist whose ethos must apply to everyone regardless of technical ability or financial status. There is no shortage of open source software, often of corporate origin, but this is insufficient – it is the power dynamic we must fight.

We are not alone. Software freedom is intertwined with contemporary social issues, including copyright reform, privacy, sustainability, and Internet addiction. Each issue arises as a hostile power dynamic between a corporate software author and the user, with complicated interactions with software licensing. Disentangling each issue from licensing provides a framework to address nuanced questions of political reform in the digital era.

Copyright reform generalizes the licensing approaches of the free software and free culture movements. Indeed, free licenses empower us to freely use, adapt, remix, and share media and software alike. However, proprietary licenses micromanaging the core of human community and creativity are doomed to fail. Proprietary licenses have had little success preventing the proliferation of the creative works they seek to “protect”, and the rights to adapt and remix media have long been exercised by dedicated fans of proprietary media, producing volumes of fanfiction and fan art. The same observation applies to software: proprietary end-user license agreements have stopped neither file sharing nor reverse-engineering. In fact, a unique creative fandom around proprietary software has emerged in video game modding communities. Regardless of legal concerns, the human imagination and spirit of sharing persists. As such, we need not judge anyone for proprietary software and media in their life; rather, we must work towards copyright reform and free licensing to protect them from copyright overreach.

Privacy concerns are also traditional in software freedom discourse. True, secure communications software can never be proprietary, given the possibility of backdoors and impossibility of transparent audits. Unfortunately, the converse fails: there are freely licensed programs that inherently compromise user privacy. Consider third-party clients to centralized unencrypted chat systems. Although two users of such a client privately messaging one another are using only free software, if their messages are being data mined, there is still harm. The need for context is once more underscored.

Sustainability is an emergent concern, tying to software freedom via the electronic waste crisis. In the mobile space, where deprecating smartphones after a few short years is the norm and lithium batteries are hanging around in landfills indefinitely, we see the paradox of a freely licensed operating system with an abysmal social track record. A curious implication is the need for free device drivers. Where proprietary drivers force devices into obsolescence shortly after the vendor abandons them in favour of a new product, free drivers enable long-term maintenance. As before, licensing is not enough; the code must also be upstreamed and mainlined. Simply throwing source code over a wall is insufficient to resolve electronic waste, but it is a prerequisite. At risk is the right of a device owner to continue use of a device they have already purchased, even after the manufacturer no longer wishes to support it. Desired by climate activists and the dollar conscious alike, we cannot allow software to override this right.

Beyond copyright, privacy, and sustainability concerns, no software can be truly “free” if the technology itself shackles us, dumbing us down and driving us to outrage for clicks. Thanks to television culture spilling onto the Internet, the typical citizen has less to fear from government wiretaps than from themselves. For every encrypted message broken by an intelligence agency, thousands of messages are willingly broadcast to the public, seeking instant gratification. Why should a corporation or a government bother snooping into our private lives, if we present them on a silver platter? Indeed, popular open source implementations of corrupt technology do not constitute success, an issue epitomized by free software responses to social media. No, even without proprietary software, centralization, or cruel psychological manipulation, the proliferation of social media still endangers society.

Overall, focusing on concrete software freedom issues provides room for nuance, rather than the traditional binary view. End-users may make more informed decisions, with awareness of technologies’ trade-offs beyond the license. Software developers gain a framework to understand how their software fits into the bigger picture, as a free license is necessary but not sufficient for guaranteeing software freedom today. Activists can divide-and-conquer.

Many outside of our immediate sphere understand and care about these issues; long-term success requires these allies. Claims of moral superiority by licenses are unfounded and foolish; there is no success backstabbing our friends. Instead, a nuanced approach broadens our reach. While abstract moral philosophies may be intellectually valid, they are inaccessible to all but academics and the most dedicated supporters. Abstractions are perpetually on the political fringe, but these concrete issues are already understood by the general public. Furthermore, we cannot limit ourselves to technical audiences; understanding network topology cannot be a prerequisite to private conversations. Overemphasizing the role of source code and under-emphasizing the power dynamics at play is a doomed strategy; for decades we have tried and failed. In a post-Snowden world, there is too much at stake for more failures. Reforming the specific issues paves the way to software freedom. After all, social change is harder than writing code, but with incremental social reform, licenses become the easy part.

The nuanced analysis even helps individual software freedom activists. Purist attempts to refuse non-free technology categorically are laudable, but outside a closed community, going against the grain leads to activist burnout. During the day, employers and schools invariably mandate proprietary software, sometimes used to facilitate surveillance. At night, popular hobbies and social connections today are mediated by questionable software, from the DRM in a video game to the surveillance of a chat with a group of friends. Cutting ties with friends and abandoning self-care as a prerequisite to fighting powerful organizations seems noble, but is futile. Even without politics, there remain technical challenges to using only free software. Layering in other concerns, or perhaps foregoing a mobile smartphone, only amplifies the risk of software freedom burnout.

As an application, this approach to software freedom brings to light disparate issues with the modern web raising alarm in the free software community. The traditional issue is proprietary JavaScript, a licensing question, yet considering only JavaScript licensing prompts both imprecise and inaccurate conclusions about web “applications”. Deeper issues include rampant advertising and tracking; the Internet is the largest surveillance network in human history, largely for commercial aims. To some degree, these issues are mitigated by script, advertisement, and tracker blockers; these may be pre-installed in a web browser for harm reduction in pursuit of a gentler web. However, the web’s fatal flaw is yet more fundamental. By design, when a user navigates to a URL, their browser executes whatever code is piped on the wire. Effectively, the web implies an automatic auto-update, regardless of the license of the code. Even if the code is benign, it is still every year more expensive to run, forcing a hardware upgrade cycle deprecating old hardware which would work if only the web weren’t bloated by corporate interests. A subtler point is the “attention economy” tied into the web. While it’s hard to become addicted to reading in a text-only browser, binge-watching DRM-encumbered television is a different story. Half-hearted advances like “Reading Mode” are limited by the ironic distribution of documents over an app store. On the web, disparate issues of DRM, forced auto-update, privacy, sustainability, and psychological dark patterns converge to a single worst case scenario for software freedom. The licenses were only the beginning.

Nevertheless, there is cause for optimism. Framed appropriately, the fight for software freedom is winnable. To fight for software freedom, fight for privacy. Fight for copyright reform. Fight for sustainability. Resist psychological dark patterns. At the heart of each is a software freedom battle – keep fighting and we can win.

See also

Declaration of Digital Autonomy

Local-first software: You own your data, in spite of the cloud

The WWWorst App Store

March 25, 2021

 Mike, the zink dev, mentioned that swiftshader seemed slow at some stuff and I realised I've never expended much effort in checking swiftshader vs llvmpipe in benchmarks.

The thing is CPU rendering is pretty much going to top out on memory bandwidth pretty quickly but I decided to do some rough napkin benchmarks using the vulkan samples from Sascha Willems.

I'd also thought that due to having a few devs and the fact that it was used instead of mesa by google for lots of things that llvmpipe would be slower since it hasn't really gotten dedicated development resources.

I picked a random smattering of Vulkan samples and ran them on my Ryzen 

workstation without doing anything else, in their default window size.

The first number is lavapipe fps the second swiftshader.

  • gears: 336 309
  • instancing: 3 3
  • ssao: 19 9
  • deferredmultisampling:  11 4
  • computeparticles: 9 8
  • computeshader: 73 57
  • computeshader sharpen: 54 34

I guess the swift is just good marketing name, now I'm not sure why llvmpipe/lavapipe isn't more of a development target for those devs, imagine how much better it could be if it has fulltime dedicate devs on it.

Enhance Your Pipe!

It’s no secret that CPU renderers are slower than GPU renderers. But at the same time, CPU renderers are crucial for things like CI and also not hanging your current session by testing on a live GPU when you’re deep into Critical Rewrites.

So I’ve spent some time today doing some rewrites in the *pipe section of mesa, and let’s just say that the pipe was good with zink before, but it’s much, much better now.

Here’s where I started on piglit’s drawoverhead test under Lavapipe:

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 1517, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 1519, 100.2%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 1112, 73.3%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  524, 34.5%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  583, 38.4%

Nothing too incredible. Modern CPUs with a discrete GPU should be pulling upwards of 15,000k draws/s, and hitting 10% of that is sort of okay.jpg.

Pipe Faster!

But what if I implemented my Mesa multidraw extensions in Lavapipe?

goku.jpg

The gist of this extension is that when no other draw parameters have changed except for the starting vertex/index and the number of vertices/indices, an array of those values can just be dumped into the Vulkan driver to optimize out a lot of draw dispatch/validation code, mirroring Marek Olšák’s multidraw work in Gallium.

The results, just from doing basic support in Lavapipe without adding any handling in LLVMpipe underneath, speak for themselves:

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 2945, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 2626, 89.2%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 1702, 57.8%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 2881, 97.8%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 2888, 98.1%

Yup, that’s a 100% perf increase.

Pipe Beyond Space And Time!

But then I thought to myself: What if LLVMpipe could also handle me dumping all this info in?

Indeed, by spending way too much time rewriting and refactoring deep Mesa internals, I have gone beyond:

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5273, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 4804, 91.1%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 2941, 55.8%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5052, 95.8%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 5057, 95.9%

A 3.5x performance boost in drawoverhead seemed pretty good, and it also cut my GLES3 CTS runtime by about 20%, so I think I’m done here for now.

Pipe Fight!

After my recent Swiftshader misadventure, wherein I discovered that it’s missing (at least) transform feedback and conditional render extension support, meaning that it can’t be used to create a legitimate GL 3.0+ context, esoteric rendering expert Dave Airlie took benchmarking matters into his own hands today to battle it out against the other name-brand Vulkan CPU renderer.

The results were shocking.

March 23, 2021

Iago wrote recently a blog post about performance improvements on the v3d compiler, and introduced our plans to improve the pipeline caching (specifically the compiled shaders) (full blog here). We merged some improvements recently, so let’s talk about that work.

Pipeline cache improvements

While analysing the perfomance of RBDOOM-3-BFG, we noticed that some significant CPU time was spent every frame for linking shaders.

RBDOOM-3-BFG screenshot

After some investigation, we found that the game was calling ClearAttachment twice every frame. The implementation of those ClearAttachments was relying on a full job with a graphics pipeline. On v3dv by default any pipeline is created with a pipeline cache (provided by the user, or a default pipeline). On v3dv (and in general any Vulkan driver) the main cached data are the compiled shaders, so the main objective of the pipeline cache is avoiding full shader re-compilation on compatible pipelines that are used really often. Why was that time spent on linking shaders?

The issue was that for each pipeline lookup on the pipeline cache we were doing two cache lookups. The first one against a cache with the shaders in NIR, that is the main intermediate representation for shaders in Mesa (more info about intermediate representation here). And then we used those shaders to fill up the key for a second cache lookup, that if succesful, will return the compiled shader on Broadcom (QPU) assembly format.

The reason of this two-step lookup is that to compile a shader we call the common (for both OpenGL and Vulkan) Broadcom compiler, and we use some data structures that contain info that will affect the compilation (like if blending is enabled). When we implemented the pipeline cache support, for simplicity, we used the same data structures as part of the cache key. But as such keys were filled with info coming from the NIR shaders, those needed to be linked together on the case of the graphics pipelines.

When we analyzed how to improve it, we realized that in order to identify the compiled shader, we don’t really need the NIR shaders, as the info derived from them are implicit to the SPIR-V shaders provided to create the pipeline. Those NIR shaders are only really needed to compile the shader. The improvement here was using a different data structure as part of the cache key, and replace the two-cache-lookup with a one-cache-lookup. We needed to do some additional changes, as there were parts of the code that assumed that the NIR shaders would be available, but now if possible we are skipping getting them.

Numbers

Let’s start showing the improvement with a synthetic test. We took one CTS test using a complex shader, and forced the pipeline to be re-recreated 1000 times. So we get the following times:

  • Before this work, disabling pipeline cache: 125,79 seconds
  • Before this work, enabling pipeline cache (default): 11,41 seconds
  • With this work, enabling pipeline chache: 0,87 seconds

That’s a clear improvement. So how about real applications? As mentioned we started this work after analyze RBDOOM-3-BFG use of ClearAttachment. That game got an improvement of ~1fps. But when we tested other games we didn’t get any improvement on that case. This is because using a full job for ClearAttachment isn’t the preferred option (it is in fact the slowest path), and because the other games are GPU limited.

But we got improvements on other cases. As mentioned the advantage of creating the pipeline with a hot cache is that we avoid the shader recompilation, so it is far faster. And there are some apps that create pipelines at runtime. We found that this happens with the Unreal Engine 4 demos, specifically the Shooter Game. So, for example, we start the demo like this:

UE4 demo screenshot, before shooting gun

and then we decide to shot for the first time:

UE4 demo screenshot, while shooting gun

in order to render the shooting effect, new pipelines are created, and several shaders are compiled. Due all that extra work we experiment a noticiable FPS hiccup. We can visualize it with this graph:

On that graph we can see how we go from ~25fps to ~2fps. That is the shooting moment.

For this cases, the ideal would be to start the game with a pipeline cache loaded with the outcome of previous executions of the game. So adding a hack to simulate that situation, and focusing on the relevant stat in this case, that is the minimum FPS, we get the following:

  • Cold cache: ~2fps
  • Hot cache, before this work: ~6fps
  • Hot cache, with this work: ~8fps

on-disk-cache

As mentioned, the ideal would be that the applications used a pipeline cache when creating the pipelines, and that they stored the content on disk, and loaded it on following executions. Among other things, because the application would have more control about what to store and load at each moment (like for example: store/load the pipeline cache for a given level of a game).

The reality is that not all the applications do that. As mentioned, the numbers of the previous situation was simulating that ideal situation, but the application was not doing that. In order to mitigate that, we added support for on-disk-cache. Mesa provides a framework to store and load shaders on disk, so basically for any pipeline cacke lookup, now we have an extra fallback. In this case, for the UE4 Shooter demo, for a hot on-disk-cache we get a minimum of ~7fps, that as expected, is better that the situation before our work, but worse that the ideal case of the application handling the store/load of the pipeline cache on disk.

Some conclusions

So some conclusions from this work, that applies to any Vulkan driver, not only v3dv in particular:

  • If possible, pre-create all the pipelines that you would use during your application runtime. Creating pipelines during runtime, even if cached, could lead to performance hiccups.
  • If that is not possible (for example if the variable combination is too high), use pipeline cache, and store/load them on-disk. This would reduce the loading time, and make the runtime performance hiccups less noticiable. Note that having support for general on-disk-cache doesn’t mean that v3dv will do this for you.
March 18, 2021

Where We At

The blogging is happening once again, and it feels good.

I did a brief recap a couple posts ago, but I’ve gotten some sleep since then, and now it’s time for the real deal. Here’s what you missed while you were asleep and not following the Mesa RSS feed like all the coolest people I know:

  • zink-wip is shrinking. no, really. it’s down from close to 600 patches to around 200 after the past month
  • descriptor caching has landed, likely yielding around an 80-100% performance increase across the board for applications which were CPU-bound. huge, huge thanks to legendary reviewer, RADV developer, and all around great guy, Bas Nieuwenhuizen for tackling this way-too-fucking-big, 50+ patch MR in an amazingly short amount of time
  • no, zink still can’t render glxgears properly
  • zink now has lavapipe-based CI jobs going to prevent regressions!
  • Erik Faye-Lund has managed a one-liner MR that fixed over a hundred unit tests in CI
  • buffer invalidation is in, which lets zink avoid stalling the GPU in a lot of cases, also improving performance
  • with all this said, another giant thanks to l*v*pipe guru and part-time motivational graphics coach, Dave Airlie, who has reviewed nearly 200 of my patches in this release cycle
  • lavapipe is ramping up towards Vulkan 1.1 support, which will, other than being a great milestone and feat of software engineering, enable zink CI testing through GL 4.5
  • less than a hundred patches to go before zink’s threaded context support is up for merging

Good times.

March 17, 2021
I spend quite a bit of time on getting a Sierra Wireless EM7345-LTE modem to work under Linux. So here are some quick instructions to help other people who may hit the same problem.

These modems are somewhat notorious for shipping with broken firmware. They work fine after a firmware upgrade, but under Windows they will only upgrade to "carrier approved" firmware versions, which requires to be connected to the mobile-network first so that the tool can identify the carrier. And with some carriers connecting to the network does not work due to the broken firmware (ugh). There are a ton of forum-threads on how to work around this under Windows, but they all require that you are atleast able to register with the mobile-network.

Luckily someone has figured out how to update these under Linux and posted instructions for this. The procedure is actually much more straight forward under Linux. The hardest part is extracting the firmware from the Windows driver download.

One problem is that the necessary Intel FlashTool download is no longer available for download from Intel. I needed this tool a while ago for something else, and back then I used the PhoneFlashTool_5.8.4.0.rpm file from https://androiddatahost.com/nm466 . The rpm-file in the zip there has a sha256sum of: c5b964ed4fae470d1234c9bf24e0beb15a088a73c7e8e6b9369a68697020c17a

I now see that it seems that Intel is again offering this for download itself, you can find it here: https://github.com/projectceladon/tools and projectceladon seems to be an official Intel project. Note this is not the version which I used, I used the PhoneFlashTool_5.8.4.0.rpm version.

Once you have the Intel FlashTool installed just follow the posted instructions and after that your model should start working under Linux.

Since last year, I have been working on Turnip driver development as my daily job at Igalia. One of those tasks was implementing the support for VK_KHR_depth_stencil_resolve extension.

VK_KHR_depth_stencil_resolve

This extension adds support for automatically resolving multisampled depth/stencil attachments in a subpass in a similar manner as for color attachments. This extension, however, does not add support for resolving msaa depth/stencil images with vkCmdResolveImage() command.

As you can imagine, this extension is easy to use by any application. Unless you are using a driver that supports Vulkan 1.2 or higher (VK_KHR_depth_stencil_resolve was promoted to core in Vulkan 1.2), you should check first if it is supported. Once done that, ask the driver which features are supported by extending VkPhysicalDeviceProperties2 with VkPhysicalDeviceDepthStencilResolveProperties structure when calling vkGetPhysicalDeviceProperties2().

struct VkPhysicalDeviceDepthStencilResolveProperties {
    VkStructureType       sType;
    void*                 pNext;
    VkResolveModeFlags    supportedDepthResolveModes;
    VkResolveModeFlags    supportedStencilResolveModes;
    VkBool32              independentResolveNone;
    VkBool32              independentResolve;
}

This structure will be filled by the driver to indicate the different depth/stencil resolve modes supported by the driver (more info about their meaning and possible values in the spec).

Next step: just fill a VkSubpassDescriptionDepthStencilResolve struct with the resolve mode wanted and the depth/stencil attachment used for resolve. Then, extend the VkSubpassDescription2 struct with it. And that’s all.

struct VkSubpassDescriptionDepthStencilResolve {
    VkStructureType                  sType;
    const void*                      pNext;
    VkResolveModeFlagBits            depthResolveMode;
    VkResolveModeFlagBits            stencilResolveMode;
    const VkAttachmentReference2*    pDepthStencilResolveAttachment;
}

Turnip implementation

Implementing this extension on Turnip was more or less direct, although there were some issues to fix.

For all depth/stencil formats, it was added its support in the both resolve paths used by the driver, one for sysmem (system memory) and other for gmem (tile buffer), including flushing the depth cache when needed. However, for VK_FORMAT_D32_SFLOAT_S8_UINT format, which has the stencil part in a separate plane, it was needed to add specific code to extend the 2D path used in the driver for sysmem resolves.

However the main issue when was testing the extension implementation on Adreno 630 GPU. It turns out that VK-GL-CTS tests exercising this extension for MSAA VK_FORMAT_D24_UNORM_S8_UINT formats were failing always, except when disabling UBWC via TU_DEBUG=noubwc environment variable. UBWC (Universal Bandwidth Compression) is a HW feature designed to improve throughput to system memory by minimizing the bandwidth of data (which also gives some power savings). The problem was that the UBWC support for MSAA VK_FORMAT_D24_UNORM_S8_UINT format is known to be failing on Adreno 630 and Adreno 618 (see merge request for freedreno driver). I just needed to disable it in Turnip to fix these failures.

I also found other VK_KHR_depth_stencil_resolve CTS tests failing: the ones testing the format compatibility for VK_FORMAT_D32_SFLOAT_S8_UINT and VK_FORMAT_D24_UNORM_S8_UINT formats. For VK_FORMAT_D32_SFLOAT_S8_UINT failures, it was needed to take into account the particularity that it has a separate plane for the stencil part when resolving it to VK_FORMAT_S8_UINT. In the VK_FORMAT_D24_UNORM_S8_UINT failures, the problem was that we were setting wrongly the resolve mode used by the HW: it was wrongly doing a sample average, when we wanted to use the value of sample 0. This merge request fixed both issues.

And that’s all, this was a extension that allowed me to dive into the different resolve paths used by Turnip and learn one or two things about the HW ;-) Thanks a lot to Jonathan Marek for his reviews and suggestions to improve the implementation of this extension.

Happy hacking!

March 16, 2021

Lately, I have been looking at improving performance of the V3DV Vulkan driver for the Raspberry Pi 4. So far we had been toying a lot with some Vulkan ports of the Quake trilogy but we wanted to have a look at more modern games as well, and for that we started to look at some Unreal Engine 4 samples, particularly the Shooter demo.


Unreal Engine 4 Shooter Demo

In our initial tests with “High” settings, even at 480p we were running the sample at 8-15 fps, with 720p being mostly in the 5-10 fps range. Not great, obviously, but a good opportunity to guide and focus our performance efforts.

One aspect of the UE4 sample that was immediately obvious compared to the Quake games is that the shading is a lot more expensive in general, and more specifically, it involves more texture lookups and UBO loads, which require expensive accesses to memory from the shader cores, so this was the first area we targeted. The good thing about this is that because our Vulkan and OpenGL drivers share the compiler stack, any optimizations we do here benefit both drivers.

What follows is a brief discussion of some of the optimizations we did recently to our backend compiler and the results we have observed from this work.

Optimizing the backend compiler

So the first thing we tackled was better managing the latency of texture lookups. Interestingly, the NIR scheduler was setting this up so that we would try to put instructions that consumed the result of a texture lookup as far away as possible from the instruction that triggered the lookup, but then the backend compiler was not fully taking advantage of this and would still end up waiting on lookup results sooner than it should.

Fixing this helped performance by 1%-3%, although it could go a bit above that in some cases. It has a caveat though: doing this extends the liveness of our lookup sequences, and that makes spilling more difficult (we can’t emit spills/unspills in the middle of an outstanding memory lookup), so when register pressure is high enough that we need to inject register spills to compile the shader, we would typically be a lot more constrained and end up producing significantly worse code, or even worse, failing to register allocate for the shader completely. To avoid this, we recompile the shader without the optimization if we detect that we need to do any spilling. One can use V3D_DEBUG=perf to detect if this is happening for any shaders, looking for messages like this:

Falling back to strategy 'disable TMU pipelining' for MESA_SHADER_FRAGMENT.

While the above optimization was useful, I was expecting that it would make a larger impact for this particular demo so I kept looking for ways to do our memory lookups more efficient. One thing that is relevant to this analysis is that we were using the same hardware unit for both texture and UBO lookups, but for the latter, we could really use a different strategy by handling our UBOs as uniform streams. This has some caveats, but making a long story short, the fact that many UBO loads use uniform addresses, that we usually read a bunch of consecutive scalars from a UBO load (such as a full vec4) and that applications usually emit UBO loads for nearby addresses together, we can emit fairly optimal code for many of these, leading to more efficient memory access in general.

Eventually, this turned out to be a big win, yielding 20%-30% improvements for the Shooter demo even with the initial basic implementation, which we would then tune and optimize further.

Again related to memory accesses, I have also been improving how we schedule instructions involved with setting up memory lookups. Our scheduler was more restrictive here than it needed, and the extra flexibility can help reduce instruction counts, which will affect these more modern games most, as they are more likely to emit a larger number of texture/image operations in their shaders.

Another thing we did was to improve instruction packing. The V3D GPU is able to emit multiple instructions in the same cycle so long as the instructions meet some requirements. Turns out our assembly scheduler was being too restrictive with this and we could do better. Going by shader-db results, this led to ~5% less instructions on our programs and added another modest 1%-2% performance improvement for the Shooter demo.

Another issue we noticed is that a lot of the shaders were passing a lot of varyings from the vertex to fragment shaders, and our setup for fragment shader inputs was not optimal. The issue here was that there is a specific instruction involved in this process that writes two registers, one of then with an instruction of delay, and our dependency tracking was not able to handle this properly, effectively assuming that both registers are written in the same instruction which then had an impact in how we scheduled instructions. Fixing this required to handle this case specially in our scheduler so we could be more effective at scheduling these instructions in a way that would enable optimal pipelining of the instructions involved with varying setups for fragment shaders.

Another aspect we improved was related to our uniform handling. Generally, there many instances in which we need to emit duplicate uniforms. There are reasons for that related to how the GPU works, but in some cases, for example with consecutive UBO loads, we would emit the uniform with the base address of the UBO multiple times very close to each other. We can obviously do better by trying to track previous uses of a uniform/constant value and reusing them in nearby instructions. Of course, this comes at the expense of increasing register pressure (for reasons beyond the scope of this post our shaders typically require a lot of uniforms), so there is a balancing game to play here too. Reducing the size of our uniform streams this way also plays an important role in lowering some of the CPU overhead of the driver, since these streams need to be rebuilt often when certain pipeline states change.

Finally, an optimization that was more targeted at reducing register pressure rather than improving performance: we noticed that we would sometimes put some instructions far away from their consumers with no obvious benefit to it. This was bad enough some times that it would even cause us to be unable to compile some shaders. Mesa has a handy NIR pass for this called nir_opt_sink, which also proved to be helpful for this. This allowed us to get a few more shaders from shader-db to compile and reduce spilling for a bunch of shaders. For the Shooter demo, this changed a large compute shader involved with histogram post-processing, which had 48 spills and 50 fills to only have 8 spills and 15 fills. While the impact in performance of this is probably very small since the game only runs this pass once per frame, it made a notable difference in loading time, since compiling a shader with this much spilling is very slow at present. I have a mental note to improve this some day, I know Intel managed to fix this for their compiler, but for the time being, this alone managed to make the loading time much more reasonable.

Results

First, here are some shader-db stats, which describe how these optimizations change various statistics for a large collections of real world shaders:

total instructions in shared programs: 14992758 -> 13731927 (-8.41%)
instructions in affected programs: 14003658 -> 12742827 (-9.00%)
helped: 80448
HURT: 4297

total threads in shared programs: 407932 -> 412242 (1.06%)
threads in affected programs: 4514 -> 8824 (95.48%)
helped: 2189
HURT: 34

total uniforms in shared programs: 4069524 -> 3790401 (-6.86%)
uniforms in affected programs: 2267834 -> 1988711 (-12.31%)
helped: 40588
HURT: 1186

total max-temps in shared programs: 2388462 -> 2322009 (-2.78%)
max-temps in affected programs: 897803 -> 831350 (-7.40%)
helped: 30598
HURT: 2464

total spills in shared programs: 6241 -> 5940 (-4.82%)
spills in affected programs: 3963 -> 3662 (-7.60%)
helped: 75
HURT: 24

total fills in shared programs: 14587 -> 13372 (-8.33%)
fills in affected programs: 11192 -> 9977 (-10.86%)
helped: 90
HURT: 14

total sfu-stalls in shared programs: 28106 -> 31489 (12.04%)
sfu-stalls in affected programs: 16039 -> 19422 (21.09%)
helped: 4382
HURT: 6429

total inst-and-stalls in shared programs: 15020864 -> 13763416 (-8.37%)
inst-and-stalls in affected programs: 14028723 -> 12771275 (-8.96%)
helped: 80396
HURT: 4305

Less is better for all stats, except threads. We can see significant improvements across the board: we generally produce shaders with less instructions that have less maximum register pressure, we reduce spills and uniform counts and can run with more threads. We only are worse at stalls, but that is generally because we now produce more compact code with less instructions, so more stalls are expected.

Another good thing in these stats is the large number of helped shaders compared to hurt shaders, meaning that it is very likely that these optimizations will help most applications to some extent.

But enough of boring compiler statistics, most people won’t care about that and what they want to know is how this impacts performance on actual games and applications, which is what the following graphs shows (these were obtained by replaying specific traces with gfx-reconstruct). Keep in mind that while I am using a collection of Vulkan samples/games here it is expected that these optimizations apply to OpenGL too.


Before vs After

Framerate improvement after optimization (in %)

As it can be observed from the graph above, this optimization work made a significant impact in the observed framerate in all the cases. It is not surprising that the UE4 demo is the one that sees the most improvement, considering this the one we used to guide most of the optimization work.

Other optimizations and future work

In this post I have been focusing exclusively on compiler optimizations, but we have also been improving other parts of the Vulkan driver. While I won’t go into details to avoid making this post too long, we have also been improving aspects of the driver involved with buffer to image copies, depth buffer clears, dirty descriptor state management, usage of the TFU unit for transfer operations and more.

Finally, there is one other aspect of this UE4 demo that is pretty obvious as soon as you start a game: it can compile a lot of shaders in the middle of the gameplay loop which can lead to significant stutter. While there is not much we can do about this on the driver side, but adding support for a shader cache on disk should eliminate the problem on sessions after the first, so this is something that we may work on in the future.

We will certainly continue to look at improving the driver performance in the future so stay tuned for further updates on our progress or maybe join us at #videocore on Freenode.

Superposition

I got a weird bug report the other day. Apparently Unigine Superposition, the final boss of Unigine benchmarks, was broken in zink(-wip). And it was a regression, which meant that at some point it had worked, but because testing is hard, it then stopped working for a while.

The Problem (superposition wtf u doin?)

I had no idea what the problem was other than a vague impression it might be queue-related given the perf output in the bug report. The first step here was to figure out what I was dealing with. Off I went to the two horsemen of all queue-related bugs: zink_flush() and zink_fence_finish().

Here’s the latter of the two:

static bool
zink_fence_finish(struct zink_screen *screen, struct pipe_context *pctx, struct zink_tc_fence *mfence,
                  uint64_t timeout_ns)
{
   pctx = threaded_context_unwrap_sync(pctx);
   struct zink_context *ctx = zink_context(pctx);

   if (screen->device_lost)
      return true;

   if (pctx && mfence->deferred_ctx == pctx && mfence->deferred_id == ctx->curr_batch) {
      zink_context(pctx)->batch.has_work = true;
      /* this must be the current batch */
      pctx->flush(pctx, NULL, !timeout_ns ? PIPE_FLUSH_ASYNC : 0);
      if (!timeout_ns)
         return false;
   }

   /* need to ensure the tc mfence has been flushed before we wait */
   bool tc_finish = tc_fence_finish(ctx, mfence, &timeout_ns);
   struct zink_fence *fence = mfence->fence;
   if (!tc_finish || (fence && !fence->submitted))
      return fence ? p_atomic_read(&fence->completed) : false;

   /* this was an invalid flush, just return completed */
   if (!mfence->fence)
      return true;
   /* if the zink fence has a different batch id then it must have completed and been recycled already */
   if (mfence->fence->batch_id != mfence->batch_id)
      return true;

   return zink_vkfence_wait(screen, fence, timeout_ns);
}

In short, the control flow goes:

  • if the GPU rejected a previous cmdbuf, fail
  • detect and then submit deferred flushes (i.e., the app has requested a sync object and at some later point decided to wait on it)
  • verify that the cmdbuf represented by this fence has actually been submitted
  • detect cases where the internal fence (mfence->fence) no longer belongs to the gallium sync object (mfence)
  • finally, attempt to actually wait on the fence

So this is all fine and good, and it’s been working terrifically for many months. But I put a printf at the top, and it turns out this was being spammed nonstop during load with the app constantly checking the same sync object to see if it was finished.

It wasn’t finished.

Specifically, the return fence ? p_atomic_read(&fence->completed) : false; case was being hit, which led me to dig deeper into what was going on.

Another printf in zink_flush() (function contents omitted to spare the eyes of anyone under the legal drinking age) revealed that the sync object reaching zink_fence_finish was, in fact, the second one created by the app, which was then being waited upon twenty-something flushes later.

So in this case, mfence->deferred_id was something like 2 and the current batch id in ctx->curr_batch was around 20. The last-completed batch id was also around 20.

Revision

A slight modification here resolved the issue:

if (pctx && mfence->deferred_ctx == pctx) {
   if (mfence->deferred_id == ctx->curr_batch) {
      zink_context(pctx)->batch.has_work = true;
      /* this must be the current batch */
      pctx->flush(pctx, NULL, !timeout_ns ? PIPE_FLUSH_ASYNC : 0);
      if (!timeout_ns)
         return false;
   }
   /* this batch is known to have finished */
   if (mfence->deferred_id <= screen->last_finished)
      return true;
}

This way, cases where an app is randomly waiting on something that has finished so far in the past that there’s no longer any trace of it can just become a no-op success, and everything is happy.

Probably.

Testing is ongoing, but it seems like everything is happy now.

March 15, 2021

As we are heading towards April and the release of Fedora Workstation 34 I wanted to post an update on what we are working on for this release and what we are looking at going forward. 2020 was a year where we focused a lot on polishing what we had and getting things past the finish line and Fedora Workstation 34 is going to be the culmination of that effort in many ways.

Wayland:
The big ticket item we have wanted to close off on was Wayland, because while Wayland has been production ready for most of us for a while, there was still some cases it didn’t cover as well as X.org. The biggest of this was of course the lack of accelerated XWayland support with the binary NVidia driver. Fixing that issue of course wasn’t something we could do ourselves, but we have been working diligently with our friends at NVidia to help ensure everything was in place for them to enable that support in their driver, so I have been very happy to see the public reports confirming that NVidia will have accelerated 3D in the summer release of their driver. The other Wayland area we have put a lot of effort into has been the work undertaken by Jonas Ådahl to get headless display support working with Wayland. This is a critical feature for people who for instance want a desktop instance on their servers or in the cloud, who want a desktop they access through things like VNC or RDP to use for sysadmin related tasks. Jonas spent a lot of time laying the groundwork for this over the course of last year and we are now in the final stages of merging the patches to enable this feature in GNOME and Wayland in preparation for Fedora Workstation 34. Once those two items are out we consider our Wayland rampup/rollout to be complete, so while there of course will continue to be bugfixes and new features implemented, that will be part of a natural evolution of Wayland and not part of a ‘close gaps with X11’ effort like now.

PipeWire
Another big ticket item we are hoping to release fully in Fedora Workstation 34 is PipeWire. PipeWire as most of you know is the engine we use to deal with handling video streams in a secure and shareable away in Fedora Workstation, so when you interact with your web camera(s) or do screen casting or make screenshots it is all routed and handled by PipeWire. But in Fedora Workstation 34 we are aiming to also switch to using PipeWire for audio, to replace both PulseAudio and Jack. For those of you who had read any previous blog post from me you will know what an important step forward this will be as we would finally be making the pro-audio community first class citizens in Fedora Workstation and Linux in general. When we decided to try to switch to PipeWire in Fedora Workstation 34 I have to admit I was a little skeptical about if we would be able to get all things ready in time as there are so many things that needs to be tested and fixed when you switch out such a critical component. Up to that point we had a lot of people interested in PipeWire, but only limited community involvement, but I feel the announcement of bringing in PipeWire for Fedora Workstation 34 galvanized the community around the project and we now have a very active community around PipeWire in #pipewire on the freenode IRC network. Not only is Wim Taymans getting a ton of help with testing and verification, but we also see a stead stream of patches coming in, with for instance improved Bluetooth audio support being contributed, in fact I believe that PipeWire will be able to usher in better bluetooth audio support in Fedora than we ever had before, with great support for high quality Bluetooth audio codecs like LDAC.

I am especially happy to see so many of the key members of the pro-audio Linux community taking part in this effort and is of course also happy to see many pro-audio folks testing Fedora Workstation for the first time due to this effort. The community is working closely with Wim to test and verify as many important ProAudio applications as possible and work to update Fedora packaging as needed to ensure they can transition from Jack to PipeWire without dependency conflicts or issues. One last item to mention here is that you might have seen that Red Hat is getting into the automotive space, I can’t share a lot of details about that effort, but one thing I can say is that PipeWire will be a core part of it and thus we will soon be looking to hire more engineers to work on PipeWire, so if that is of interest to you be sure to track my twitter feed or blog as I will announce our job openings there as they become available. For the community at large this should be great too as it means that we can get a lot of synergy between automotive and the desktop around audio and video handling.

It is still somewhat of an open question if we end up actually switching to PipeWire in Fedora Workstation 34, but things are looking good at this point in time and worst case scenario it will be in place for Fedora Workstation 35.

Toolbox
Toolbox is another effort that is in a great spot now. Toolbox is our tool for making working with pet containers a breeze for developers. The initial version was prototyped quickly by writing it as a shell script, but we spent time last year getting it rewritten in Go in order to make it possible to keep expanding the project and allow us to implement all the things we envision for it. With that done feature work is now in focus again and Ondřej Michal has done some great work making it possible to set up RHEL containers in Toolbox. This means that you can run Fedora on your laptop and get the latest and greatest features that way, but you can do your development in a RHEL pet container, so you get an environment identical to what you applications will see once they are deployed into the cloud or onto company hardware. This gives you the best of both worlds in my opinion, the fast moving Fedora Workstation that brings the most out of our laptop and desktop hardware, but still easy access to the RHEL platform for development and testing. You can even test this today on Fedora Workstation 33, just open a terminal and type ‘toolbox create --distro rhel --release 8.3‘. The resulting toolbox will then be based on RHEL and not Fedora and thus perfect for doing RHEL targeted development. You will need to use the subscription-manager tool to register it (be sure to register on developer.redhat.com for your free RHEL developer subscription. Over time we hope to integrate this into GNOME Online accounts like we do for the RHEL virtual machines you can set up with GNOME Boxes, so that once you set up your RHEL account you can create RHEL virtual machines and RHEL containers easily on Fedora Workstation.

Toolbox with RHEL

Toolbox pet container with RHEL UBI

Flatpak
Owen Taylor has been doing some incredible work behind the scenes for the last year trying to ensure the infrastructure we have in RHEL and Fedora provides a great integrated Flatpak experience. As we move forward we expect Flatpaks to be the primary packaging format that Fedora users consume their applications in, but to make that a reality we needed to ensure the experience is good both for Fedora maintainers and for the end users. So one of the big ticket items Owen been working on is getting incremental updates working in Fedora. If you have used applications from Flathub you probably noticed that their updates are small and nimble despite being packaged as Flatpak containers, while the Fedora flatpaks causes big updates each time. The reason for this is that the Fedora flatpaks are shipping as OCI (Open Container Initiative) images, while the Flatpaks on Flathub are shipping as OStree repositories (if you don’t know OStree, think of it as git for binaries). So shipping the Flatpaks as OCI images has advantages in the form of being the same format we at Red Hat use for our kubernetes/docker/openshift containers and thus it allows us to reuse a lot of the work that Red Hat as put into ensuring we can provide and keep such containers up to date and secure, but the downside until now has been that these containers where shipped in a way which cause each update, no matter how small the change, to be a full re-download of the whole image. Well Owen Taylor and Alex Larsson worked together to resolve this and came up with a method to allow incremental updates of such containers and thus bring the update sizes in line with what you see on Flathub for Flatpaks. This should be deployed in time for Fedora Workstation 34 and we also hope to  eventually deploy it for  kubernetes/docker containers too. Finally to make even more applications available we are doing work to enable people to get access to Flathub.org out of the box in Fedora when you enable 3rd party repositories in initial setup, so that your our of the box application selection will be even bigger.

Flathub frontpage

Flathub webpage

GNOME 40
Another major change in Fedora Workstation 34 is GNOME40 which contains a revamp of the GNOME 3 user interface. This was a collaborative effort between a lot of GNOME 3 stakeholders with Allan Day representing Red Hat. This was also an effort by the GNOME design community to up their game and thus as part of the development process the GNOME Foundation paid a professional company to do user testing on the proposed changes and some of the alternatives. This means that the changes where verified to actually be experienced as an improvement for the experienced GNOME user participants and that it felt intuitive for new users of GNOME. One advantage we have in Fedora is that since we don’t do major tweaking of the GNOME user interface which means once Fedora Workstation 34 ships you are set to enjoy the new GNOME experience from day one. For long time GNOME users I hope and expect that the updates will be a welcome refresh and at the same time that the changes provide a more easy onramp for new GNOME and Fedora Workstation users. Some of the early versions did lead some long term fans of how multimonitor support in GNOME3 worked to be a bit concerned, but be assured that multi monitor is a critical usecase in our opinion and something we have been looking at and will be looking at keep improving. In fact Allan Day wrote a great blog post about GNOME40 multimonitor support recently to explain what we are doing and how we see it evolving going forward.

Input
Another area where we keep putting in a lot of effort is input. Thanks to Peter Hutterer and Benjamin Tissoires we keep making sure Fedora Workstation and the world of Linux keeps having access to the newest and best in input. The latest effort they are working on has been to enable haptic touchpads. Haptics touchpads should be familiar among people who tried Apple hardware, but they are expected to appear in force on laptops in general this year, so we have been putting in the effort to ensure that we can support this new type of device as they come out. So if you see laptops you want with haptic touchpads then Fedora Workstation should be ready for it, but of course until these devices are commonplace and we had a chance to test and verify I can make no guarantees.

Another major effort that we undertook in relation to input was move the GNOME input to a separate thread. Carlos Garnacho worked on this patch to make that happen. This should provide a smoother experience with Fedora Workstation 34 as it means the mouse should not stall due to the main thread running Wayland being busy. This was done as part of the overall performance work we been continuously doing over the last years to ensure to address performance issues and make Fedora and GNOME have the best performance and performance related behaviour possible.

Lenovo Laptops
So one of the major announcements of last year was Lenovo Laptops with Fedora Linux pre-installed. There are currently two models available with Linux, the X1 Carbon and the Lenovo P1. Between them they cover the two most commons requests we see, a ultralight weight laptop with the X1 and a more powerful ‘portable workstation’ model with the P1. We are working on a couple of more models and also to get them sold globally, which was key goal of the effort. Be aware that both models are on sale as I am writing this (hopefully still true when you read this), so it is a good time to grab a great laptop with a great OS.

Lenovo P1

Lenovo P1

Vision
So one thing I wanted to do to is tie the work we do in Fedora Workstation together by articulating what we are trying to achieve. Fedora has for the longest time been the place where Linux as an operating system is evolving and being developed. There are very few major innovations that has come to Linux that hasn’t been developed and incubated in Fedora by Fedora contributors, including of course Red Hat. This include things like the Linux Vendor Firmware Service, Wayland, Flatpak, SilverBlue, PipeWire, SystemD, flicker free boot, HiDPI support, gaming mouse support and so much more. We have always done this work in close cooperation with the upstreams we are collaborating with, which is why the patch delta in any given Fedora release is small. We work hard to get improvements into the upstream kernel and into GNOME and so on right away, to avoid needing to ship downstream patches in Fedora. That of course saves us from having to maintain temporary forks, but more importantly it is the right way to collaborate with an open source community.

So looking back to when we launched Fedora Workstation we realized that being at the front like that had come at the cost of not being stable and user friendly. So the big question we tried to ask ourselves when launching Fedora Workstation and the question that still drives a lot of our decision making and focus is : how can we preserve being the heart and center of Linux OS development, but at the same time provide end users with a stable and well functioning system? To achieve that we have done lot of changes over the last years, ranging from some policy changes in terms of how and when we brought changes into Fedora, but maybe even more importantly we focused on a lot on figuring out ways to reduce the challenges caused by a rapidly evolving OS, like the introduction of Flatpaks to allow applications to be developed and released without strong ties to the host system libraries and with the concepts we are maturing in Silverblue around image based operating systems or how we are looking at pet container development with Toolbox. All of these things combined remove a lot of the fragility we seen in Linux up to this point and instead let us treat the rapidly evolving linux landscape as a strength.

So where we are today is that I think we are very close to realizing the vision of being able to let Fedora be the place where exiting new stuff happens, yet at the same time provide the robustness and polish that end users need to be able to use it as their daily driver, it has been my daily driver for many years now and by the rapid growth of users we seen in Fedora over the last 5 years I think that is true for a lot of other people too. The goal is to allow the wider community around Linux, especially the developers, sysadmins and creators relying on Linux to do their work, to come to Fedora and interact and collaborate with the developers working on the OS itself to the benefit of all. You all are probably better judges than me to if we are succeeding with that, but I do take the increased chatter and adoption of Fedora by a lot of people doing Linux related podcasts, news sites and so on as a sign that we are succeeding. And PipeWire for me is a perfect example of how this can look, where we want to bring in the pro-audio creators to Fedora Workstation and let them interact and work closely with Wim Taymans and the PipeWire development team to make the experience even better for themselves and their fellow creators and at the same time give them a great stable platform to create their music on.

March 12, 2021

A New Post

Blogging is hard.

There’s a big switch that a brain needs to do in order to go from ramming code into a project at full speed to carefully crafting an internet post that potentially people might want to read, and some days the effort required to make that switch exceeds the available energy.

Such is (at least one of) the reason(s) why more graphics driver-y people don’t blog more, since it’s taxing enough for me to try and get posts out and I mostly just post memes.

But this time is definitely going to be a return to the normalcy that is posting more than once per week. The key is just to start posting and keep doing it.

So what’s been going on since the last post?

Well.

Lots.

Since Our Last Correspondence…

  • Some number of patches landed
    • zink is now conformant for the enhanced layouts tests in CTS (KHR-GL46.enhanced_layouts*)
  • Mesa 21.0 is going out (has already gone out? I’m bad at time) with GL 4.1 enabled in zink
    • enjoy crashes and GPU hangs in style with a driver that nobody else is using!
  • Definitely other stuff that’s good and useful but isn’t in the scope of this post

Easing Back In

Tomorrow’s post is going to be back to the usual. In fact, I’m going to be posting about some tc-related stuff, and now that I’ve said it, I can’t back out.

March 10, 2021

Over the last year and then so, we at Collabora have been working with Microsoft on their D3D12 mapping layer, which I announced in my previous blog post. In July, Louis-Francis Ratté-Boulianne wrote an update on the status on the Collabora blog, but a lot has happened since then, so it’s time for another update.

There’s two major things that has happened since then; we have passed the OpenGL 3.3 conformance tests, and we have upstreamed the code in Mesa 3D.

Photoshop support

It might not be a big surprise, but one of the motivation for this work was to be able to run applications like Photoshop on Windows devices without full OpenGL support.

I’m happy to report that Microsoft has released their compatibility pack that uses our work to provide OpenGL (and OpenCL) support, Photoshop can now run on Windows on ARM CPUs! This is pretty exciting to see high-profile applications like that benefit from our work!

OpenGL 3.3 Conformance Test Suite

First of all, I would like to point out that having passed the OpenGL CTS isn’t necessarily the same as being formally conformant. There’s some details about how to formally becoming conformant on layered implementations that are complicated, and I’ll leave the question about formal conformance up to Microsoft and Khronos.

Instead, I want to talk a bit about passing the OpenGL CTS.

Challenges for layered implementations

A problem we face with layered implementations is that we are subject to a few more sources of issues, some of which are entirely out of our control. A normal OpenGL, non-layered, implementation have two primary sources of issues:

  1. The OpenGL driver
  2. The hardware

Issues with the OpenGL driver itself that leads to tests failing is always required to be fixed before results are submitted. Issues with the hardware generally requires software workarounds, but this is not always feasible, so Khronos have a system where a vendor can file a waiver for a hardware issue, and if approved they can mark test-failures as waived, and the appropriate failures will be ignored.

So far so good.

But for our layered implementations, our world looks a bit different. We don’t really see the hardware, but instead we see D3D12 and the D3D12 driver. This means our sources of issues are:

  1. The OpenGL driver
  2. The D3D12 run-time
  3. The D3D12 vendor-driver
  4. The hardware

For the OpenGL driver, the story is the same as for a non-layered implementation, but from there on things start changing.

Problems in the D3D12 run-time must also be fixed before submitting results. We work together with Microsoft to get these issues fixed as appropriate. Such fixes can take a while to trickle all the way into a Windows build and to end-users, but they will eventually show up.

But for the D3D12 vendor-driver and below, things gets complicated. First of all, it’s not always possible for us to tell vendor-driver issues and hardware issues apart. And worse, as these are developed by third party companies, we have little insight there. We can’t affect their priorities, so it’s hard to know when or even if an issue gets resolved.

It’s also not really a good idea to work around such issues, because if they turn out to be fixable software problems, we don’t know when they will be fixed, so we can’t really tell when to disable the work-around. We also don’t know exactly what combination of hardware and software these issues apply to.

But there’s one case where we have full insight, and that’s when the D3D12 vendor-driver is WARP, a high-performance software rasterizer. Because that component is developed by Microsoft, and we have channels to report issues and even make sure they get resolved!

Bugs, bugs, bugs

When developing something new, there’s always going to be bugs. But usually also when using something existing in a new way. We encountered a lot of bugs on our way, and here’s a quick overview over some of them. This is in no way exhaustive, and most of our own bugs are not that interesting. So this is mostly about problems unique to layered implementations.

64-bit shifts

It turned out early on that the DXIL validator in D3D12 had a requirement when parsing the LLVM bitcode that required that the amounts were always 32-bit values. While this seems fine by itself, LLVM itself requires that all operands to binops have the same bit-size. This obviously meant that only 32-bit shifts could pass the validator.

Microsoft quickly removed this requirement once we figured out what was going on.

Aligned block-compressed textures

In D3D12, one requirement for block-compressed textures is that the base-level is aligned to ther block-size. This requirement does not apply to mip-levels, and OpenGL has no such requirement. This isn’t technically speaking a bug, but a documented limitation in DirectX.

It turns out that this limitation was an artificial historical left-over, and after a bunch of testing (and fixing of WARP), we got this limitation lifted. Great :)

D3D12 vendor-driver bugs

Something that has been much more frustrating is bugs in the vendor-drivers. The problem here is that even though we have channels to file bugs, we don’t have any influence or even insight into their prioritization and release schedule.

I think it suffice to say that there’s been several reported bugs to all vendors we’ve actively been running the OpenGL CTS on top of. We believe fixes are underway for at least some of this, but we can’t make any promises here.

Current status

Right now, the only configurations we’re cleanly passing the OpenGL 3.3 CTS on are WARP (which became conformant on November 24th, 2020), and NVIDIA (which became conformant on February 26th, 2021).

Having these multiple independent implementations of DirectX drivers passing in conjunction with the Mesa/D3D12 layer shows that we are able to implement GLon12 in a vendor-neutral way, which allowed us to bring the layer to conformance. Many thanks to Khronos for their assistance through this process.

We’ve also submitted results on top of an Intel GPU, but that submission has been halted due to failures, and will as far as I know be updated as soon as Intel publish updated drivers.

The conformance tests have been run against our downstream fork, which is no longer actively maintained, because:

Upstreaming

The D3D12 driver was upstreamed in Mesa in Merge-Request 7477, and the OpenCL compiler followed in Merge-Request 7565. There’s been a lot more merge-requests since then, and even more is expected in the future.

The process of upstreaming the driver into Mesa3D went relatively smoothly, but there were quite a lot of regressions that happened quickly after we upstreamed the code, so to avoid this from becoming a big problem we’ve added the D3D12 driver to Mesa’s set of GitLab CI tests. We now build and test the D3D12 driver on top of WARP on CI, as well as running some basic sanity-tests for the OpenCL compiler.

All in all, this seems to work very well right now, and we’re looking forward to the future. Next step, WSL support!

I’m no longer working full-time on this project, instead I’m trying to take some of the lessons learned and apply them to Zink. I’m sure there’s even more room for code-reuse than what we currently have, but it will probably take some time to figure out.

March 08, 2021

It was created by Konstantin Ryabitsev and has become a very frequently used tool for me.

It supports a lot of different ways for interacting with the Linux Kernel mailing lists. Of these the b4 am subcommand is what I primarily use. This subcommand downloads all of the patches belonging to a patch series and drops them into a .mbox file. But! It doesn't apply them to the repository we're currently in, and herein lies the itch that I would like to scratch.

The inspiration for this post is the script that @stellarhopper authored and @widawsky pointed out to me.

The Good, the Bad & the Ugly

After first publishing this post, people on the twittersphere suggested some alternative approaches, and it would seem that there …

March 03, 2021

One of the best decisions I did in my life was when I joined Igalia in 2012. Inside Igalia, I have been working in different open-source projects, most of the time related to graphics technologies, interacting with different communities, giving talks, organizing conferences and, more importantly, contributing to free software as my daily job.

Now I’m thrilled to announce that we are hiring for our Graphics team!

Igalia

Right now we have two open positions:

  • Graphics Developer.

    We are looking for candidates that would like to contribute to open-source OpenGL/Vulkan graphics drivers (Mesa), or other areas of the open-source graphics stack such as X11 or Wayland, among others. If you have experience with them, or you are very motivated to become an expert there, just send us your CV!

  • Kernel Developer.

    We are looking for candidates that either have experience with kernel development or they can ramp-up quickly to contribute to linux kernel drivers. Although no specific subsystem is mentioned in the job position, I encourage you to apply if you have DRM experience and/or ARM[64]/MIPS related knowledge.

Graphics technologies are not your cup of tea? We have positions in other areas like browsers, compilers, multimedia… Just check out our job offers on our website!

What we offer is to work in an open-source consultancy in which you can participate equally in the management and decision-making process of the company via our democratic, consensus-based assembly structure. As all of our positions are remote-friendly, we welcome submissions from any part of the world.

Are you still a student? We have launched the 2021 edition of our Coding Experience program. Check it out!

Igalia's office

February 27, 2021

It is extremely rare that a hobby software project of mine gets completed, but now it has happened. Behold! Fourbyfour!

Have you ever had to implement a mathematical algorithm, say, matrix inversion? You want it to be fast and measuring the speed is fairly simple, right. But what about correctness? Or precision? Behavior around inputs that are on the edge? You can hand-pick a few example inputs, put those into your test suite, and verify the result is what you expect. If you do not pick only trivial inputs, this is usually enough to guarantee your algorithm does not have fundamental mistakes. But what about those almost invalid inputs, can you trust your algorithm to not go haywire on them? How close to invalid can your inputs be before things break down? Does your algorithm know when it stops working and tell you?

Inverting a square matrix requires that the inverse matrix exists to begin with. Matrices that do not mathematically have an inverse matrix are called singular. Can your matrix inversion algorithm tell you when you are trying to invert a matrix that cannot be inverted, or does it just give you a bad result pretending it is ok?

Working with computers often means working with floating-point numbers. With floating-point, the usual mathematics is not enough, it can actually break down. You calculate something and the result a computer gives you is total nonsense, like 1+2=2 in spirit. In the case of matrix inversion, it's not enough that the input matrix is not singular mathematically, it needs to be "nice enough" numerically as well. How do you test your matrix inversion algorithm with this in mind?

These questions I tried to answer with Fourbyfour. The README has the links to the sub-pages discussing how I solved this, so I will not repeat it here. However, as the TL;DR, if there is one thing you should remember, it is this:

    Do not use the matrix determinant to test if a matrix is invertible!

Yes, the determinant is zero for a singular matrix. No, close to zero determinant does not tell you how close to singular the matrix is. There are better ways.

However, the conclusion I came to is that if you want a clear answer for a specific input matrix, is it invertible, the only way to know for sure is to actually invert it, multiply the input matrix with the inverse you computed, and measure how far off from the identity matrix it is. Of course, you also need to set a threshold, how close to identity matrix is close enough for your application, because with numerical algorithms, you will almost never get the exact answer. Also, pick an appropriate matrix norm for the matrix difference.

The reason for this conclusion is what one of the tools I wrote tells me about a matrix that would be typical for a display server with two full-HD monitors. The matrix is simply the pixel offset of the second monitor on the desktop. The analysis of the matrix is the example I used to demonstrate fourbyfour-analyse. If you read through it, you should be shocked. The mathematics, as far as I can understand, seems to tell us that if you use 32-bit floating-point, inverting this matrix gives us a result that leads to no correct digits at all. Obviously this is nonsense, the inverse is trivial and algorithms should give the exact correct result. However, the math does not lie (unless I did). If I did my research right, then what fourbyfour-analyse tells us is true, with an important detail: it is the upper error bound. It guarantees that we cannot get errors larger than that (heh, zero correct digits is pretty hard to make much worse). But I also read that there is no better error bound possible for a generic matrix inversion algorithm. (If you take the obvious-to-human constraints into account that those elements must be one and those must be zero, the analysis would likely be very different.) Therefore the only thing left to do is to actually go on with the matrix inversion and then verify the result.

Here is a list of the cool things the Fourbyfour project does or has:

  • Generates random matrices arbitrarily close to singular in a controlled way. If you simply generated random matrices for testing, they would almost never be close to singular. With this code, you can define how close to singular you want the matrices to be, to really torture your inversion algorithms.
  • Generates random matrices with a given determinant value. This is orthogonal to choosing how close to singular the generated matrices are. You can independently pick the determinant value and the condition number, and have the random matrices have both simultaneously.
  • Plot a graph about a matrix inversion algorithm's behavior when inputs get closer to singular, to see exactly when it breaks down.
  • A tutorial on how mathematical matrix notation works and how it relates to row- vs. column-major layouts (spoiler: it does not).
  • A comparison between Graphene and Weston matrix inversion algorithms.
In this project I also tried out several project quality assurance features:

  • Use Gitlab CI to run the test suite for the main branch, tags, and all merge requests, but not for other git branches.
  • Use Freedesktop ci-templates to easily generate the Docker image in CI under which CI testing will run.
  • Generate LCOV test code coverage report from CI.
  • Use reuse lint tool in CI to ensure every single file has a defined, machine-readable license. Using well-known licenses clearly is important if you want your code to be attractive. Fourbyfour also uses the Developer Certificate of Origin.
  • Use ci-fairy to ensure every commit has Singed-off-by and every merge request allows maintainer pushes.
  • Good CI test coverage. Test even the pseudo-random number generator in the test suite that it roughly follows the intended distribution.
  • CONTRIBUTING file. I believe that every open source project regardless of size needs this to set up the people's expectations when they see your project, whether you expect or accept contributions or not.
I'm really happy this project is now "done", well, version 1.0.0 so to say. One thing I have realized it is still missing is a determinant sweep mode. The precision testing mode sweeps over condition numbers and allows plotting the inversion behavior. It should have another mode where the sweep controls the determinant value, with some fixed condition number for the random test matrices. This determinant mode could point out inversion algorithms that use determinant value for matrix singularity testing and show how it leads to completely arbitrary results.

I you want to learn about numerical methods for matrices, I recommend the book Gene H. Golub, Charles F. van Loan, Matrix Computations. The Johns Hopkins University Press. I used the third edition, 1996, when implementing the Weston matrix inversion years ago.

February 26, 2021

Introduction

In a now pretty well established tradition on my part, I am posting on things I no longer work on!

I gave a talk on modifiers at XDC 2017 and Linux Plumbers 2017 audio only. It was always my goal to have a blog post accompany the work. Relatively shortly after the talks, I ended up leaving graphics and so it dropped on the priority list.

I'm splitting this up into two posts. This post will go over the problem, and solutions. The next post will go over the implementation details.

Modifiers

Each 3d computational unit in an Intel GPU is called an Execution Unit (EU). Aside from what you might expect them to do, like execute shaders, they may be used for copy operations (itself a shader), or compute operations (also, shaders). All of these things require memory bandwidth in order to complete their task in a timely manner.

Modifiers were the chosen solution in order to allow end to end renderbuffer [de]compression to work, which is itself designed to reduce memory bandwidth needs in the GPU and display pipeline. End to end renderbuffer compression simply means that through all parts of the GPU and display pipeline, assets are read and written to in a compression scheme that is capable of reducing bandwidth (more on this later).

Modifiers are relatively simple concept. They are modifications that are applied to a buffer's layout. Typically a buffer has a few properties, width, height, and pixel format to name a few. Modifiers can be thought of as ancillary information that is passed along with the pixel data. It will impact how the data is processed or displayed. One such example might be to support tiling, which is a mechanism to change how pixels are stored (not sequentially) in order for operations to make better use of locality for caching and other similar reasons. Modifiers were primarily designed to help negotiate modified buffers between the GPU rendering engine and the display engine (usually by way of the compositor). In addition, other uses can crop up such as the video decode/encode engines.

A Waste of Time and Gates

My understanding is that even now, 3 years later, full modifier support isn't readily available across all corners of the graphics ecosystem. Many hardware features are being entirely unrealized. Upstreaming sweeping graphics features like this one can be very time consuming and I seriously would advise hardware designers to take that into consideration (or better yet, ask your local driver maintainer) before they spend the gates. If you can make changes that don't require software, just do it. If you need software involvement, the longer you wait, the worse it will be.

They weren't new even when I made the presentation 3.5 years ago.

commit e3eb3250d84ef97b766312345774367b6a310db8
Author: Rob Clark <robdclark@gmail.com>
Date:   6 years ago

    drm: add support for tiled/compressed/etc modifier in addfb2

I managed to land some stuff:

commit db1689aa61bd1efb5ce9b896e7aa860a85b7f1b6
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   3 years, 7 months ago

    drm: Create a format/modifier blob

Admiring the Problem

Back of the envelope requirement for a midrange Skylake GPU from the time can be calculated relatively easily. 4 years ago, at the frequencies we run our GPUs and their ISA, we can expect roughly 1GBs for each of the 24 EUs.

A 4k display:

3840px × 2160rows × 4Bpp × 60HZ = 1.85GBs

24GBs + 1.85GBs = 25.85GBs

This by itself will oversaturate single channel DDR4 bandwidth (which was what was around at the time) at the fastest possible clock. As it turns out, it gets even worse with compositing. Most laptops sporting a SKL of this range wouldn't have a 4k display, but you get the idea.

The picture (click for larger SVG) is a typical "flow" for a composited desktop using direct rendering with X or a Wayland compositor using EGL. In this case, drawing a Rubik's cube looking thing into a black window.

Admiring the problem

Using this simple Rubik's cube example I'll explain each of the steps so that we can understand where our bandwidth is going and how we might mitigate that. This is just the overview, so feel free to move on to the next section. Since the example will be trivial, and the window is small (and only a singleton) it won't saturate the bandwidth, but it will demonstrate where the bandwidth is being consumed, and open up a discussion on how savings can be achieved.

Rendering and Texturing

For the example, no processing happens other than texturing. In a simple world, the processing of the shader instructions doesn't increase the memory bandwidth cost. As such, we'll omit that from the details.

The main steps on how you get this Rubik's cube displayed are

  • upload a static texture
  • read from static texture
  • write to offscreen buffer
  • copy to output frame
  • scanout from output

More details below...

Texture Upload

Getting the texture from the application, usually from disk, into main memory, is what I'm referring to as texture upload. In terms of memory bandwidth, you are using write bandwidth to write into the memory.

Assets are transfered from persistent storage to memory

Textures may either be generated by the 3d application, which would be trivial for this example, or they may be authored using a set of offline tools and baked into the application. For any consequential use, the latter is predominantly used. Certain surface types are often dynamically generated though, for example, the shadow mapping technique will generate depth maps. Those dynamically generated surfaces actually will benefit even more (more later).

This is pseudo code (but close to real) to upload the texture in OpenGL:

const unsigned height = 128;
const unsigned width = 64;
const void *data = ... // rubik's cube
GLuint tex;

glGenTextures(1, &tex);
glBindTexture(GL_TEXTURE_2D, texture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, width, height, 0, GL_RGB, GL_UNSIGNED_BYTE, data);
glGenerateMipmap(GL_TEXTURE_2D);

I'm going to punt on explaining mipmaps, which are themselves a mechanism to conserve memory bandwidth. If you have no understanding, I'd recommend reading up on mipmaps. This wikipedia article looks decent to me.

Texture Sampling

Once the texture is bound, the graphics runtime can execute shaders which can reference those textures. When the shader requests a color value (also known as sampling) from the texture, it's possiblelikely that the calculated coordinate within the texture will be in between pixels. The hardware will have to return a single color value for the sample point and the way it interpolates is chosen by the graphics runtime. This is referred to as a filter

Texture Fetch
Texture Fetch/Filtering
  • Nearest: Take the value of the closest two pixels and interpolate. If the texture coordinate hits a single pixel, don't interpolate.
  • Bilinear: Take the surround 4 pixels and interpolate based on distance for the texture coordinate
  • Trilinear: Bilinear, but also interpolate between the two closest mipmaps. (I skipped discussing mipmaps, but part of the texture fetch involves finding the nearest miplevel.
  • Anisotropic: It's complicated. Let's say 16x trilinear.

Here's the GLSL to fetch the texture:

#version 330

uniform sampler2D tex;
in vec2 texCoord;
out vec4 fragColor;

void main() {
    vec4 temp = texelFetch(tex, ivec2(texCoord));
    fragColor = temp;
}

The above actually does something that's perhaps not immediately obvious. fragColor = temp;. This actually instructs the fragment shader to write out that value to a surface which is bound for output (usually a framebuffer). In other words, there are two steps here, read and filter a value from the texture, write it back out.

The part of the overall diagram that represents this step:

Composition

In the old days of X, and even still when not using the composite extension, the graphics applications could be given a window to write the pixels directly into the resulting output. The X window manager would mediate resize and move events, letting the client update as needed. This has a lot of downsides which I'll say are out of scope here. There is one upside that is in scope though, there's no extra copy needed to create the screen composition. It just is what it is, tearing and all.

If you don't know if you're currently using a compositor, you almost certainly are using one. Wayland only composites, and the number of X window managers that don't composite is very few. So what exactly is compositing? Simply put it's a window manager that marshals frame updates from clients and is responsible for drawing them on the final output. Often the compositor may add its own effects such as the infamous wobbly windows. Those effects themselves may use up bandwidth!

Compositor
Simplified compositor block diagram

Applications will write their output into what's referred to as an offscreen buffer. 👋👋 The compositor will read the output and copy it into what will become the next frame. What this means from a bandwidth consumption perspective is that the compositor will need to use both read and write bandwidth just to build the final frame. 👋👋

Display

It's the mundane part of this whole thing. Pixels are fetched from memory and pushed out onto whatever display protocol.

Display Engine
Display Engine

Perhaps the interesting thing about the display engine is it has fairly isochronous timing requirements and can't tolerate latency very well. As such, it will likely have a dedicated port into memory that bypasses arbitration with other agents in the system that are generating memory traffic.

Out of scope here but I'll briefly mention, this also gets a bit into tiling. Display wants to read things row by row, whereas rendering works a bit different. In short this is the difference between X-tiling (good for display), and Y-tiling (good for rendering). Until Skylake, the display engine couldn't even understand Y-tiled buffers.

Summing the Bandwidth Cost

Running through our 64x64 example...

Operation Color Depth Description Bandwidth R/W
Texture Upload 1Bpc (RGBX8) File to DRAM 16KB (64 × 64 × 4) W
Texel Fetch (nearest) 1Bpc DRAM to Sampler 16KB (64 × 64 × 4) R
FB Write 1Bpc GPU to DRAM 16KB (64 × 64 × 4) W
Compositing 1Bpc DRAM to DRAM 32KB (64 × 64 × 4 × 2) R+W
Scanout 1Bpc DRAM to PHY 16KB (64 × 64 × 4) R

Total = (16 + 16 + 16 + 32 + 16) × 60Hz = 5.625MBs

But actually, Display Engine will always scanout the whole thing, so really with a 4k display:

Total = (16 + 16 + 16 + 32 + 32400) × 60Hz = 1.9GBs

Don't forget about those various filter modes though!

Filter Mode Multiplier (texel fetch) Total Bandwidth
Bilinear 4x 11.25MBs
Trilinear 8x 18.75MBs
Aniso 4x 32x 63.75MBs
Aniso 16x 128x 243.75MBs

Proposing some solutions

Without actually doing math, I think cache is probably the biggest win you can get. One spot where caching could help is that framebuffer write step followed composition step could avoid the trip to main memory. Another is the texture upload and fetch. Assuming you don't blow out your cache, you can avoid the main memory trip.

While caching can buy you some relief, ultimately you have to flush your caches to get the display engine to be able to read your buffer. At least as of 2017, I was unaware of an architecture that had a shared cache between display and 3d.

Also, cache sizes are limited...

Wait for DRAM to get faster

Instead of doing anything, why not just wait until memory gets higher bandwidth?

Here's a quick breakdown of the progression at the high end of the specs. For the DDR memory types, I took a swag at number of populated channels. For a fair comparison, the expectation with DDR is you'll have at least dual channel nowadays.

Bandwidth

Looking at the graph it seems like the memory vendors aren't hitting Moore's Law any time soon, and if they are, they're fooling me. A similar chart should be made for execution unit counts, but I'm too lazy. A Tigerlake GT2 has 96 Eus. If you go back to our back of the envelope calculation we had a mid range GPU at 24 EUs, so that has quadrupled. In other words, the system architects will use all the bandwidth they can get.

Improving memory technologies is vitally important, it just isn't enough.

TOTAL SAVINGS = 0%

Hardware Composition

One obvious place we want to try to reduce bandwidth is composition. It was after all the biggest individual consumer of available memory bandwidth.

With composition as we described earlier, there was presumed to be a single plane. Software would arrange the various windows onto the plane, which if you recall from the section on composition added quite a bit to the bandwidth consumption, then the display engine could display from that plane.

Hardware composition is the notion that each of those windows could have a separate display plane, directly write into that, and all the compositor would have to do is make sure those display planes occupied the right part of the overall screen. It's conceptually similar to the direct scanout we described earlier in the section on composition.

Operation Color Depth Description Bandwidth R/W
Compositing 1Bpc DRAM to DRAM 32KB (64 × 64 × 4 × 2) R+W

TOTAL SAVINGS = 1.875MBs (33% savings)

Hardware Compsition Verdict

33% savings is really really good, and certainly if you have hardware with this capability, the driver should enable it, but there are some problems that come along with this that make it not so appealing.

  1. Hardware has a limited number of planes.
  2. Formats. One thing I left out about the compositor earlier is that one of things it may opt to do is convert the application's window into a format that the display hardware understands. This means some amount of negotiation has to take place so the application knows about this. Prior to this work, that wasn't in place.
  3. Doesn't reduce any other parts of process, ie. a full screen application wouldn't benefit at all.

Texture Compression

So far in order to solve the, not enough bandwidth, problem, we've tried to add more bandwidth, and reduce usage with hardware composition. The next place to go is to try to tackle texture consumption from texturing.

If you recall, we split the texturing up into two stages. Texture upload, and texture fetch. This third proposed solution attempts to reduce bandwidth by storing a compressed texture in memory. Texture upload will compress it while uploading, and texture sampling can understand the compression scheme and avoid doing all the lookups. Compressing the texture usually comes with some unperceivable degradation. In terms of the sampling, it's a bit handwavy to say you reduce by the compression factor, but let's say for simplicity sake, that's what it does.

Some common formats at the time of the original materials were

Format Compression Ratio
DXT1 8:1
ETC2 4:1
ASTC Variable, 6:1

Using DXT1 as an example of the savings:

Operation Color Depth Bandwidth R/W
Texture Upload DXT1 2KB (64 × 64 × 4 / 8) W
Texel Fetch (nearest) DXT1 2KB (64 × 64 × 4 / 8) R
FB Write 1Bpc 16KB (64 × 64 × 4) W
Compositing 1Bpc 32KB (64 × 64 × 4 × 2) R+W
Scanout 1Bpc 16KB (64 × 64 × 4) R

Here's an example with the simple DXT1 format:

Texture Compression Verdict

Texture compression solves a couple of the limitations that hardware composition left. Namely it can work for full screen applications, and if your hardware supports it, there isn't a limit to how many applications can make use of it. Furthermore, it scales a bit better because an application might use many many textures but only have 1 visible window.

There are of course some downsides.

Click for SVG

For comparison, here is the same cube scaled down with an 8:1 ratio. As you can see DXT1 does a really good job.

Scaled cube

We can't ignore the degradation though as certain rendering may get very distorted as a result.

  • Lossy (perhaps).
  • Hardware compatibility. Application developers need to be ready and able to compress to different formats and select the right things at runtime based on what the hardware supports. This takes effort both in the authoring, as well as the programming.
  • Patents.
  • Display doesn't understand this, so you need to decompress before display engine will read from a surface that has this compression.
  • Doesn't work well for all filtering types, like anisotropic.

*TOTAL SAVINGS (DXT1) = 1.64MBs (30% savings)

*total savings here is kind of a theoretical max

End to end lossless compression

So what if I told you there was a way to reduce your memory bandwidth consumption without having to modify your application, without being subject to hardware limits on planes, and without having to wait for new memory technologies to arrive?

End to end loss compression attempts to provide both "end to end" and "lossless" compression transparently to software. Explanation coming up.

End to End

As mentioned in the previous section on texture compression, one of the pitfalls is that you'd have to decompress the texture in order for it to be used outside of your 3d engine. Typically this would mean for the display engine to scanout out from, but you could also envision a case where perhaps you'd like to share these surfaces with the hardware video encoder. The nice thing about this "end to end" attribute is every stage we mentioned in previous sections that required bandwidth can get the savings just by running on hardware and drivers that enable this.

Lossless

Now because this is all transparent to the application running, a lossless compression scheme has to be used so that there aren't any unexpected results. While lossless might sound great on the surface (why would you want to lose quality?) it reduces the potential savings because lossless compression algorithms are always more inefficient, but it's still a pretty big win.

What's with the box, bro?

I want to provide an example of how this can be possible. Going back to our original image of the full picture, everything looks sort of the same. The only difference is there is a little display engine decompression step, and all of the sampler and framebuffer write steps now have a little purple box accompanying them

One sort of surprising aspect of this compression is it reduces bandwidth, not overall memory usage (that's also true of the Intel implementation). In order to store the compression information, hardware carves off a little bit of extra memory which is referenced for each operation on a texture (yes, that might use bandwidth too if it's not cached).

Here's a made-up implementation which tracks state in a similar way to Skylake era hardware, but the rest is entirely made up by me. It shows that even a naive implementation can get up to a lossless 2:1 compression ratio. Remember though, this comes at the cost of adding gates to the design and so you'd probably want something better performing than this.

2:1 compression

Everything is tracked as cacheline pairs. In this example we have state called "CCS". For every pair of cachelines in the image, 2b are in this state to track the current compression. When the pair of cachelines uses 12 or fewer colors (which is surprisingly often in real life), we're able to compress the data into a single cacheline (state becomes '01'). When the data is compressed, we can reassemble the image losslessly from a single cacheline, this is 2:1 compression because 1 cacheline gets us back 2 cachelines worth of pixel data.

Walking through the example we've been using of the Rubik's cube.

  1. As the texture is being uploaded, the hardware observes all the runs of the same color and stores them in this compressed manner by building the lookup table. On doing this it modifies the state bits in the CCS to be 01 for those cachelines.
  2. On texture fetch, the texture sampler checks the CCS. If the encoding is 01, then the hardware knows to use the LUT mechanism instead for all the color values.
  3. Throughout the rest of rendering, steps 1 & 2 are repeated as needed.
  4. When display is ready to scanout the next frame, it too can look at the CCS determine if there is compression, and decompress as it's doing the scanout.

The memory consumed is minimal which also means that any bandwidth usage overhead is minimal. In the example we have a 64x128 image. In total that's 512 cachelines. At 2 bits per 2 cachelines the CCS size for the example would be not even be a full cacheline: 512 / 2 / 2 = 128b = 16B

* Unless you really want to understand how hardware might actually work, ignore the 00 encoding for clear color.

* There's a caveat here that we assume texture upload and fetch use the sampler. At the time of the original presentation, this was not usually the case and so until the FB write occurred, you didn't actually get compression.

Theoretical best savings would compress everything:

Operation Color Depth Description Bandwidth R/W
Texture Upload 1Bpc compressed File to DRAM 8KB (64 × 64 × 4) / 2 W
Texel Fetch (nearest) 1Bpc compressed DRAM to Sampler 8KB (64 × 64 × 4) / 2 R
FB Write 1Bpc compressed GPU to DRAM 8KB (64 × 64 × 4) / 2 W
Compositing 1Bpc compressed DRAM to DRAM 16KB (64 × 64 × 4 × 2) / 2 R+W
Scanout 1Bpc compressed DRAM to PHY 8KB (64 × 64 × 4) / 2 R

TOTAL SAVINGS = 2.8125MBs (50% savings)

And if you use HW compositing in addition to this...

TOTAL SAVINGS = 3.75MBs (66% savings)

Ending Notes

Hopefully it's somewhat clear how 3d applications are consuming memory bandwidth, and how quickly the consumption grows when adding more applications, textures, screen size, and refresh rate.

End to end lossless compression isn't always going to be a huge win, but in many cases it can really chip away at the problem enough to be measurable. The challenge as it turns out is actually getting it hooked up in the driver and rest of the graphics software stack. As I said earlier, just because a feature seems good doesn't necessarily mean it would be worth the software effort to implement it. End to end loss compression is one feature that you cannot just turn on by setting a bit, and the fact that it's still not enabled anywhere, to me, is an indication that effort and gates may have been better spent elsewhere.

However, next section will be all about how we got it hooked up through the graphics stack.

If you've made it this far, you probably could use a drink. I know I can.

February 25, 2021

A Losing Battle

For a long time, I’ve tried very, very, very, very hard to work around problems with NIR variables when it comes to UBOs and SSBOs.

Really, I have.

But the bottom line is that, at least for gallium-based drivers, they’re unusable. They’re so unreliable that it’s only by sheer luck (and a considerable amount of it) that zink has worked at all until this point.

Don’t believe me? Here’s a list of just some of the hacks that are currently in use by zink to handle support for these descriptor types, along with the reason(s) why they’re needed:

Hack Reason It’s Used Bad Because?
iterating the list of variables backwards this indexing vaguely matches the value used by shaders to access the descriptor only works coincidentally for as long as nothing changes this ordering and explodes entirely with GL-SPIRV
skipping non-array variables with data.location > 0 these are (usually) explicit references to components of a block BO object in a shader sometimes they’re the only reference to the BO, and skipping them means the whole BO interface gets skipped
using different indexing for SSBO variables depending on whether data.explicit_binding is set this (sometimes) works to fix indexing for SSBOs with random bindings and also atomic counters the value is set randomly by other optimization passes and so it isn’t actually reliable
atomic counters are identified by using !strcmp(glsl_get_type_name(var->interface_type), "counters") counters get converted to SSBOs, but they require different indexing in order to be accessed correctly c’mon.
runtime arrays (array[]) are randomly tacked onto SPIRV SSBO variables based on the variable type fixes atomic counter array access and SSBO length() method not actually needed most of the time

And then there’s this monstrosity that’s used for linking up SSBO variable indices with their instruction’s access value (comments included for posterity):

unsigned ssbo_idx = 0;
if (!is_ubo_array && var->data.explicit_binding &&
    (glsl_type_is_unsized_array(var->type) || glsl_get_length(var->interface_type) == 1)) {
    /* - block ssbos get their binding broken in gl_nir_lower_buffers,
     *   but also they're totally indistinguishable from lowered counter buffers which have valid bindings
     *
     * hopefully this is a counter or some other non-block variable, but if not then we're probably fucked
     */
    ssbo_idx = var->data.binding;
} else if (base >= 0)
   /* we're indexing into a ssbo array and already have the base index */
   ssbo_idx = base + i;
else {
   if (ctx->ssbo_mask & 1) {
      /* 0 index is used, iterate through the used blocks until we find the first unused one */
      for (unsigned j = 1; j < ctx->num_ssbos; j++)
         if (!(ctx->ssbo_mask & (1 << j))) {
            /* we're iterating forward through the blocks, so the first available one should be
             * what we're looking for
             */
            base = ssbo_idx = j;
            break;
         }
   } else
      /* we're iterating forward through the ssbos, so always assign 0 first */
      base = ssbo_idx = 0;
   assert(ssbo_idx < ctx->num_ssbos);
}
assert(!ctx->ssbos[ssbo_idx]);
ctx->ssbos[ssbo_idx] = var_id;
ctx->ssbo_mask |= 1 << ssbo_idx;
ctx->ssbo_vars[ssbo_idx] = var;

Does it work?

Amazingly, yes, it does work the majority of the time.

But is this really how we should live our lives?

A Methodology To Live By

As the great compiler-warrior Jasonus Ekstrandimus once said, “Just Delete All The Code”.

Truly this is a pivotal revelation, one that can induce many days of deep thinking, but how can it be applied to this scenario?

Today I present the latest in zink code deletion: a NIR pass that deletes all the broken variables and makes new ones.

bender.jpg

Let’s get into it.

uint32_t ssbo_used = 0;
uint32_t ubo_used = 0;
uint64_t max_ssbo_size = 0;
uint64_t max_ubo_size = 0;
bool ssbo_sizes[PIPE_MAX_SHADER_BUFFERS] = {false};

if (!shader->info.num_ssbos && !shader->info.num_ubos && !shader->num_uniforms)
   return false;
nir_function_impl *impl = nir_shader_get_entrypoint(shader);
nir_foreach_block(block, impl) {
   nir_foreach_instr(instr, block) {
      if (instr->type != nir_instr_type_intrinsic)
         continue;

      nir_intrinsic_instr *intrin = nir_instr_as_intrinsic(instr);
      switch (intrin->intrinsic) {
      case nir_intrinsic_store_ssbo:
         ssbo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[1]));
         break;

      case nir_intrinsic_get_ssbo_size: {
         uint32_t slot = nir_src_as_uint(intrin->src[0]);
         ssbo_used |= BITFIELD_BIT(slot);
         ssbo_sizes[slot] = true;
         break;
      }
      case nir_intrinsic_ssbo_atomic_add:
      case nir_intrinsic_ssbo_atomic_imin:
      case nir_intrinsic_ssbo_atomic_umin:
      case nir_intrinsic_ssbo_atomic_imax:
      case nir_intrinsic_ssbo_atomic_umax:
      case nir_intrinsic_ssbo_atomic_and:
      case nir_intrinsic_ssbo_atomic_or:
      case nir_intrinsic_ssbo_atomic_xor:
      case nir_intrinsic_ssbo_atomic_exchange:
      case nir_intrinsic_ssbo_atomic_comp_swap:
      case nir_intrinsic_ssbo_atomic_fmin:
      case nir_intrinsic_ssbo_atomic_fmax:
      case nir_intrinsic_ssbo_atomic_fcomp_swap:
      case nir_intrinsic_load_ssbo:
         ssbo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[0]));
         break;
      case nir_intrinsic_load_ubo:
      case nir_intrinsic_load_ubo_vec4:
         ubo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[0]));
         break;
      default:
         break;
      }
   }
}

The start of the pass iterates over the instructions in the shader. All UBOs and SSBOs that are used get tagged into a bitfield of their index, and any SSBOs which have the length() method called are similarly tagged.


nir_foreach_variable_with_modes(var, shader, nir_var_mem_ssbo | nir_var_mem_ubo) {
   const struct glsl_type *type = glsl_without_array(var->type);
   if (type_is_counter(type))
      continue;
   unsigned size = glsl_count_attribute_slots(type, false);
   if (var->data.mode == nir_var_mem_ubo)
      max_ubo_size = MAX2(max_ubo_size, size);
   else
      max_ssbo_size = MAX2(max_ssbo_size, size);
   var->data.mode = nir_var_shader_temp;
}
nir_fixup_deref_modes(shader);
NIR_PASS_V(shader, nir_remove_dead_variables, nir_var_shader_temp, NULL);
optimize_nir(shader);

Next, the existing SSBO and UBO variables get iterated over. A maximum size is stored for each type, and then the variable mode is set to temp so it can be deleted. These variables aren’t actually used by the shader anymore, so this is definitely okay.

Boom.

if (!ssbo_used && !ubo_used)
   return false;

Early return if it turns out that there’s not actually any UBO or SSBO use in the shader, and all the variables are gone to boot.

struct glsl_struct_field *fields = rzalloc_array(shader, struct glsl_struct_field, 2);
fields[0].name = ralloc_strdup(shader, "base");
fields[1].name = ralloc_strdup(shader, "unsized");

The new variables are all going to be the same type, one which matches what’s actually used during SPIRV translation: a simple struct containing an array of uints, aka base. SSBO variables which need the length() method will get a second struct member that’s a runtime array, aka unsized.

if (ubo_used) {
   const struct glsl_type *ubo_type = glsl_array_type(glsl_uint_type(), max_ubo_size * 4, 4);
   fields[0].type = ubo_type;
   u_foreach_bit(slot, ubo_used) {
      char buf[64];
      snprintf(buf, sizeof(buf), "ubo_slot_%u", slot);
      nir_variable *var = nir_variable_create(shader, nir_var_mem_ubo, glsl_struct_type(fields, 1, "struct", false), buf);
      var->interface_type = var->type;
      var->data.driver_location = slot;
   }
}

If there’s a valid bitmask of UBOs that are used by the shader, the index slots get iterated over, and a variable is created for that slot using the same type for each one. The size is determined by the size of the biggest UBO variable that previously existed, which ensures that there won’t be any errors or weirdness with access past the boundary of the variable. All the GLSL compiliation and NIR passes to this point have already handled bounds detection, so this is also fine.

if (ssbo_used) {
   const struct glsl_type *ssbo_type = glsl_array_type(glsl_uint_type(), max_ssbo_size * 4, 4);
   const struct glsl_type *unsized = glsl_array_type(glsl_uint_type(), 0, 4);
   fields[0].type = ssbo_type;
   u_foreach_bit(slot, ssbo_used) {
      char buf[64];
      snprintf(buf, sizeof(buf), "ssbo_slot_%u", slot);
      if (ssbo_sizes[slot])
         fields[1].type = unsized;
      else
         fields[1].type = NULL;
      nir_variable *var = nir_variable_create(shader, nir_var_mem_ssbo,
                                              glsl_struct_type(fields, 1 + !!ssbo_sizes[slot], "struct", false), buf);
      var->interface_type = var->type;
      var->data.driver_location = slot;
   }
}

SSBOs are almost the same, but as previously mentioned, they also get a bonus member if they need the length() method. The GLSL compiler has already pre-computed the adjustment for the value that will be returned by length(), so it doesn’t actually matter what the size of the variable is anymore.

And that’s it! The entire encyclopedia of hacks can now be removed, and I can avoid ever having to look at any of this again.

February 23, 2021

Linaro has been working together with Qualcomm to enable camera support on their platformssince 2017. The Open Source CAMSS driver was written to support the ISP IP-block with the same name that is present on Qualcomm SoCs coming from the smartphone space.

The first development board targeted by this work was the DragonBoard 410C, which was followed in 2018 by DragonBoard 820C support. Recently support for the Snapdragon 660 SoC was added to the driver, which will be part of the v5.11 Linux Kernel release. These SoCs all contain the CAMSS (Camera SubSystem) version of the ISP architecture.

Currently, support for the ISP found in the Snapdragon 845 SoC and the DragonBoard 845C is in the process of being upstreamed to the mailinglists. Having …

February 19, 2021

Quickly: ES 3.2

I’ve been getting a lot of pings over the past week or two about ES 3.2 support.

Here’s the deal.

It’s not happening soon. Probably.

Zink currently supports every 3.2 extension except for photoshop. There’s two ways to achieve support for that extension at present:

  • the nice, simple, VK_EXT_blend_operation_advanced which nobody* supports
  • the difficult, excruciating fbfetch method using shader rewrites, which also requires extensions that nobody supports

* Yes, I know that Nvidia supports advanced blend, but zink+nvidia is not currently capable of doing ES of any version, so that’s not testable.

So in short, it’s going to be a while.

But there’s not really a technical reason to rush towards full ES 3.2 anyway other than to fill out a box on mesamatrix. If you have an app that requires 3.2, chances are that it probably doesn’t really require it; few apps actually use the advanced blend extension, and so it should be possible for the app to require only 3.1 and then verify the presence of whatever 3.2-based extensions it may use in order to be more compatible.

Of course, this is unlikely to happen. It’s far easier for app developers to just say “give me 3.2” if maybe they just want geometry shaders, and I wouldn’t expect anyone is going to be special-casing things just to run on zink.

Nonetheless, it’s really not a priority for me given the current state of the Vulkan ecosystem. As time moves on and various extensions/features become more common that may change, but for now I’m focusing on things that are going to be the most useful.

Background

After a lot of effort over short stints in the last several months, I have completed my blog migration to Lektor in the hopes that when I migrate again in the future, it won't be as painful.

Despite my efforts, many old posts might not be perfect. This is a job for the wayback machine

In case you're curious I did this primarily for one reason (and lots of smaller ones). I wanted my data back. Wordpress is an open source blogging platform with huge adoption. It has a very large plugin ecosystem and is very actively updated and maintained. While security issues have come up here and there, at some point automatic updates became an option and that helped a bit. In 2010 it was the obvious choice.

If you've gained anything from my blog posts, you should thank Wordpress. Wordpress' ease of setup and relative ease of use is a big reason I was able to author things as well as I did.

So what happened - plugins

I wanted my data back. It was a self hosted instance and I had all my information stored in a SQL database. Obviously I never lost my data, but...

Plugins.

I used plugins for my tables (multiple plugins). I used plugins for code highlighting. Plugins for LaTeX. Plugins for table of contents, social media integration, post tagging, image captioning and formatting, spelling. You get the idea. The result of all this was I ended up with a blog post that was entirely useless in its text only form. Plugins storing the data in non-standard places so it can be processed and look fancy.

The WYSIWYG editor interface was a huge plus for me. I spent all day in front of a terminal breaking graphics and display (meaning I really was in front of an 80x24 terminal at times). I didn't want to have to deal with fanciful layout engines or styles. Those plugins ended up destroying the WYSIWYG editor experience and I ended up doing everything in quasi markdown anyway.

Plugins themselves introduced security issues when they weren't intentionally malicious anyway.

What was next?

These static site generators seemed really appealing as a solution to this problem. Everything in markdown. Assets stored together in the filesystem. Jekyll is obviously hugely popular. Hugo, Pelican, Gatsby, and Sphinx are all generators I considered. The number of static site generators is staggering. I wish I could remember what made me choose Lektor, but I can't - python based was my only requirement.

Python because I wanted a platform that did most of what I wanted but was extendible by me if absolutely necessary.

Migrating was definitely a lot of work. I was tempted several times to abort the effort and just rely on wayback machine. Ultimately I decided that migrating the post would be a good way to learn how well the platform would meet my needs (that being an annual blog post or so)

There are definitely some features I miss that I may or may not get to.

  1. Comments. There is disqus integration. I'm not convinced this is what I want.
  2. Post grouping. There is categories. It was too complicated for me to figure out in a short time, so I'm punting on it for now.
  3. I'd really like to not have to learn CSS and jinja2. I can scrape by a bit, but changing anything drastic takes a lot of effort for me.

Migration

I followed this. I did have to make some minor changes specific to my needs and posts did still require some touchups, in large part due to plugins and my obsessive use of SVG.

See you soon

Now that I'm back, I hope to post more often. Next up will be a recap of some of the pathfinding projects I worked on after FreeBSD enabling

February 18, 2021

Last year I wrote about how to create a user-specific XKB layout, followed by a post explaining that this won't work in X. But there's a pandemic going on, which is presumably the only reason people haven't all switched to Wayland yet. So it was time to figure out a workaround for those still running X.

This Merge Request (scheduled for xkeyboard-config 2.33) adds a "custom" layout to the evdev.xml and base.xml files. These XML files are parsed by the various GUI tools to display the selection of available layouts. An entry in there will thus show up in the GUI tool.

Our rulesets, i.e. the files that convert a layout/variant configuration into the components to actually load already have wildcard matching [1]. So the custom layout will resolve to the symbols/custom file in your XKB data dir - usually /usr/share/X11/xkb/symbols/custom.

This file is not provided by xkeyboard-config. It can be created by the user though and whatever configuration is in there will be the "custom" keyboard layout. Because xkeyboard-config does not supply this file, it will not get overwritten on update.

From XKB's POV it is just another layout and it thus uses the same syntax. For example, to override the +/* key on the German keyboard layout with a key that produces a/b/c/d on the various Shift/Alt combinations, use this:


default
xkb_symbols "basic" {
include "de(basic)"
key <AD12> { [ a, b, c, d ] };
};
This example includes the "basic" section from the symbols/de file (i.e. the default German layout), then overrides the 12th alphanumeric key from left in the 4th row from bottom (D) with the given symbols. I'll leave it up to the reader to come up with a less useful example.

There are a few drawbacks:

  • If the file is missing and the user selects the custom layout, the results are... undefined. For run-time configuration like GNOME it doesn't really matter - the layout compilation fails and you end up with the one the device already had (i.e. the default one built into X, usually the US layout).
  • If the file is missing and the custom layout is selected in the xorg.conf, the results are... undefined. I tested it and ended up with the US layout but that seems more by accident than design. My recommendation is to not do that.
  • No variants are available in the XML files, so the only accessible section is the one marked default.
  • If a commandline tool uses a variant of custom, the GUI will not reflect this. If the GUI goes boom, that's a bug in the GUI.

So overall, it's a hack[2]. But it's a hack that fixes real user issues and given we're talking about X, I doubt anyone notices another hack anyway.

[1] If you don't care about GUIs, setxkbmap -layout custom -variant foobar has been possible for years.
[2] Sticking with the UNIX principle, it's a hack that fixes the issue at hand, is badly integrated, and weird to configure.

What’s Next

It’s been a busy week. The CTS fixes and patch drops are coming faster and faster, and progress is swift. Here’s a quick note on some things that are on the horizon.

Features Landing Soon

Zink’s in a tough spot right now in master. GL 4.6 is available, but there’s still plenty of things that won’t work, e.g., running anything at 60fps. These are things I expect (hope) to see land in the repo in the next month or so:

  • improved barrier support, which frees up some opportunities with queue refactoring
  • removing explicit pre-fencing on every frame (and sometimes multiple times per frame)
  • descriptor caching
  • various bugfixes which weren’t feasible due to architectural issues

All told, just as an example, Unigine Heaven (which can now run in color!) should see roughly a 100% performance improvement (possibly more) once this is in, and I’d expect substantial performance gains across the board.

Will you be able to suddenly play all your favorite GL-based Steam games?

No.

I can’t even play all your favorite GL-based Steam games yet, so it’s a long ways off for everyone else.

But you’ll probably be able to get surprisingly good speed on what things you can run considering the amount of time that will pass between hitting 4.6 and these patchsets merging.

Features I’m Working On

I spent some time working on Wolfenstein over the past week, but there’s some non-zink issues in the way, so that’s on the backburner for a while. Instead, I’ve turned my attention to CTS and begun unloading a dumptruck of resulting fixes into the codebase.

There comes a time when performance is “good enough” for a while, and, after some intense optimizing since the start of the year, that time has come. So now it’s back to stabilization mode, and I’m now aiming to have a vaguely decent pass rate in the near term.

Hopefully I’ll find some time to post some of the crazy bugs I’ve been hunting, but maybe not. Time will tell.

February 11, 2021

By Now

…or in the very near future, the ol’ bumperino will have landed, putting zink at GL 4.5.

But that’s boring, so let’s check out something very slightly more interesting.

Steam Games

What are they and how do they work?

I’m not going to answer these questions, but I am going to be looking into getting them working on zink.

To that end, as I hinted at yesterday, I began with Wolfenstein: The New Order, as chosen by Daniel Schuermann, the lucky winner of the What Steam Game Should Zink Use As Its Primary Test Case And Benchmark contest that was recently held.

Early tests of this game were unimpressive. That is to say I got an immediate crash. It turns out that having the GL compatibility context restricted to 3.0 is bad for getting AAA games running, so zink-wip now enables 4.6 compat contexts.

But then I was still getting a crash without any clear error message. Suddenly, I was back in 2004 trying to figure out how to debug wine apps.

Things are much simpler now, however. PROTON_DUMP_DEBUG_COMMANDS enables dumping scripts for debugging from steam, including one which attaches a debugger to the game. This solved the problem of getting a debugger in before the almost-immediate crash, but it didn’t get me closer to a resolution.

The problem now is that I’d attached a debugger to the in-wine process, which is just a sandbox for the Windows API. What I actually wanted was to attach to the wine process itself so I could see what was going on in the driver.

gdb --pid=$(pidof WolfNewOrder_x64.exe) ended up what I needed, but this was complicated by the fact that I had to attach before the game crashed and without triggering the steam error reporter. So in the end, I had to attach using the proton script, then while it was paused, attach to the outer process for driver debugging. But then also I had to attach to the outer process after zink was loaded, so it was a real struggle.

Then, as per usual, another problem: I had no symbols loaded because proton runs a static binary. After cluelessly asking around in the DXVK discord, @Herbert helpfully provided a gdb python script for proton in-process debugging that I was able to repurpose for my needs. The gist (haha) of the script is that it scans /proc/$pid/maps and then manually loads the required library files.

At last, I had attached to the game, I had symbols, and I could see that I was hitting a zink assert I’d added to catch int overflows. A quick one-liner to change the order of a calculation fixed that, and now I’m on to an entirely new class of bugs.

February 10, 2021

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Check out part 1 where we expose the context/high-level principles of the whole CI system, and make the machine fully controllable remotely (power on, OS to boot, keyboard/screen emulation using a serial console).

In this article, we will start demystifying the boot process, and discuss about different ways to generate and boot an OS image along with a kernel for your machine. Finally, we will introduce boot2container, a project that makes running containers on bare metal a breeze!

This work is sponsored by the Valve Corporation.

Generating a kernel & rootfs for your Linux-based testing

To boot your test environment, you will need to generate the following items:

  • A kernel, providing all the necessary drivers for your test;
  • A userspace, containing all the dependencies of your test (rootfs);
  • An initramfs (optional), containing the drivers/firmwares needed to access the userspace image, along with an init script performing the early boot sequence of the machine;

The initramfs is optional because the drivers and their firmwares can be built in the kernel directly.

Let's not generate these items just yet, but instead let's look at the different ways one could generate them, depending on their experience.

The embedded way

Buildroot's logo

If you are used to dealing with embedded devices, you are already familiar with projects such as Yocto or Buildroot. They are well-suited to generate a tiny rootfs which can be be useful for netbooted systems such as the one we set up in part 1 of this series. They usually allow you to describe everything you want on your rootfs, then will configure, compile, and install all the wanted programs in the rootfs.

If you are wondering which one to use, I suggest you check out the presentation from Alexandre Belloni / Thomas Pettazoni which will give you an overview of both projects, and help you decide on what you need.

Pros:

  • Minimal size: Only what is needed is included
  • Complete: Configures and compiles the kernel for you

Cons:

  • Slow to generate: Everything is compiled from source
  • Small selection of software/libraries: Adding build recipes is however relatively easy

The Linux distribution way

Debian Logo, www.debian.org

If you are used to installing Linux distributions, your first instinct might be to install your distribution of choice in a chroot or a Virtual Machine, install the packages you want, and package the folder/virtual disk into a tarball.

Some tools such as debos, or virt-builder make this process relatively painless, although they will not be compiling an initramfs, nor a kernel for you.

Fortunately, building the kernel is relatively simple, and there are plenty of tutorials on the topic (see ArchLinux's wiki). Just make sure to compile modules and firmware in the kernel, to avoid the complication of using an initramfs. Don't forget to also compress your kernel if you decide to netboot it!

Pros:

  • Relatively fast: No compilation necessary (except for the kernel)
  • Familiar environment: Closer to what users/developers use in the wild

Cons:

  • Larger: Packages tend to bring a lot of unwanted dependencies, drastically increasing the size of the image
  • Limited choice of distros: Not all distributions are easy to install in a chroot
  • Insecure: Requires root rights to generate the image, which may accidentally trash your distro
  • Poor reproducibility: Distributions get updates continuously, leading to different outcomes when running the same command
  • No caching: all the steps to generate the rootfs are re-done every time
  • Incomplete: does not generate a kernel or initramfs for you

The refined distribution way: containers

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc.

Containers are an evolution of the old chroot trick, but instead made secure thanks the addition of multiple namespaces to Linux. Containers and their runtimes have been addressing pretty much all the cons of the "Linux distribution way", and became a standard way to share applications.

On top of generating a rootfs, containers also allow setting environment variables, control the command line of the program, and have a standardized transport mechanism which simplifies sharing images.

Finally, container images are constituted of cacheable layers, which can be used to share base images between containers, and also speed up the generation of the container image by only re-computing the layer that changed and all the layers applied on top of it.

The biggest draw-back of containers is that they usually are meant to be run on pre-configured hosts. This means that if you want to run the container directly, you will need to make sure to include an initscript or install systemd in your container, and set it as the entrypoint of the container. It is however possible to perform these tasks before running the container, as we'll explain in the following sections.

Pros:

  • Fastest: No compilation necessary (except for the kernel), and layers cached
  • Familiar: Shared environment between developers and the test system
  • Flexible: Full choice of distro
  • Secure: No root rights needed, everything is done in a user namespace
  • Shareable: Containers come with a transport/storage mechanism (registries)
  • Reproducible: Easily run the exact same userspace on your dev and test machines

Cons:

  • Larger: Packages tend to bring a lot of dependencies, drastically increasing the size of the image
  • Incomplete: does not generate a kernel or initramfs for you

Deploying and booting a rootfs

Now we know how we could generate a rootfs, so the next step is to be able to deploy and boot it!

Challenge #1: Deploying the Kernel / Initramfs

There are multiple ways to deploy an operating system:

  • Flash and reboot: Typical on ARM boards / Android phones;
  • Netboot: Typical in big organizations that manage thousands of machines.

The former solution is great at preventing the bricking of a device that depends on an Operating System to be flashed again, as it enables checking the deployment on the device itself before rebooting.

The latter solution enables diskless test machines, which is an effective way to reduce state (the enemy #1 of reproducible results). It also enables a faster deployment/boot time as the CI system would not have to boot the machine, flash it, then reboot. Instead, the machine simply starts up, requests an IP address through BOOTP/DHCP, downloads the kernel/initramfs, and executes the kernel. This was the solution we opted for in part 1 of this blog series.

Whatever solution you end up picking, you will now be presented with your next challenge: making sure the rootfs remains the same across reboots.

Challenge #2: Deploying the rootfs efficiently

If you have chosen the Flash and reboot deployment method, you may be prepared to re-flash the entire Operating System image every time you boot. This would make sure that the state of a previous boot won't leak into following boots.

This method can however become a big burden on your network when scaled to tens of machines, so you may be tempted to use a Network File System such as NFS to spread the load over a longer period of time. Unfortunately, using NFS brings its own set of challenges (how deep is this rabbit hole?):

  • The same rootfs directory cannot be shared across machines without duplication unless mounted read-only, as machines should not be able to influence each-other's execution;
  • The NFS server needs to remain available as long as at least one test machine is running;
  • Network congestion might influence the testing happening on the machine, which can affect functional testing, but will definitely affect performance testing.

So, instead of trying to spread the load, we could try to reduce the size of the rootfs by only sending the content that changed. For example, the rootfs could be split into the following layers:

  • The base Operating System needed for the testing;
  • The driver you want to test (if it wasn't in the kernel);
  • The test suite(s) you want to run.

Layers can be downloaded by the test machine, through a short-lived-state network protocol such as HTTP, as individual SquashFS images. Additionally, SquashFS provides compression which further reduces the storage/network bandwidth needs.

The layers can then be directly combined by first mounting the layers to separate folders in read-only mode (only mode supported by SquashFS), then merging them using OverlayFS. OverlayFS will store all the writes done to this file system into the workdir directory. If this work directory is backed up by a ramdisk (tmpfs) or a never-reused temporary directory, then this would guarantee that no information from previous boots would impact the new boots!

If you are familiar with containers, you may have recognized this approach as what is used by containers: layers + overlay2 storage driver. The only difference is that container runtimes depend on tarballs rather than SquashFS images, probably because this is a Linux-only filesystem.

If you are anything like me, you should now be pretty tempted to simply use containers for the rootfs generation, transport, and boot! That would be a wise move, given that thousands of engineers have been working on them over the last decade or so, and whatever solution you may come up with will inevitably have even more quirks than these industry standards.

I would thus recommend using containers to generate your rootfs, as there are plenty of tools that will generate them for you, with varying degree of complexity. Check out buildah, if Docker, or Podman are not too high/level for your needs!

Let's now brace for the next challenge, deploying a container runtime!

Challenge #3: Deploying a container runtime to run the test image

In the previous challenge, we realized that a great way to deploy a rootfs efficiently was to simply use a container runtime to do everything for us, rather than re-inventing the wheel.

This would enable us to create an initramfs which would be downloaded along with the kernel through the usual netboot process, and would be responsible for initializing the machine, connecting to the network, mounting the layer cache partition, setting the time, downloading a container, then executing it. The last two steps would be performed by the container runtime of our choice.

Generating an initramfs is way easier than one can expect. Projects like dracut are meant to simplify their creation, but my favourite has been u-root, coming from the LinuxBoot project. I generated my first initramfs in less than 5 minutes, so I was incredibly hopeful to achieve the outlined goals in no time!

Unfortunately, the first setback came quickly: container runtimes (Docker, or Podman) are huge (~150 to 300 MB), if we are to believe Alpine Linux's size of their respective packages and dependencies! While this may not be a problem for the Flash and reboot method, it is definitely a significant issue for the Netboot method which would need to download it for every boot.

Challenge #3.5: Minifying the container runtime

After spending a significant amount of time studying container runtimes, I identified the following functions:

  • Transport / distribution: Downloading a container image from a container registry to the local storage (spec );
  • De-layer the rootfs: Unpack the layers' tarball, and use OverlayFS to merge them (default storage driver, but there are many other ways);
  • Generate the container manifest: A JSON-based config file specifying how the container should be run;
  • Executing the container

Thus started my quest to find lightweight solutions that could do all of these steps... and wonder just how deep is this rabbit hole??

The usual executor found in the likes of Podman and Docker is runc. It is written in Golang, which compiles everything statically and leads to giant binaries. In this case, runc clocks at ~12MB. Fortunately, a knight in shining armour came to the rescue, re-implemented runc in C, and named it crun. The final binary size is ~400 KB, and it is fully compatible with runc. That's good-enough for me!

To download and unpack the rootfs from the container image, I found genuinetools/img which supports that out of the box! Its size was however much bigger than expected, at ~28.5MB. Fortunately, compiling it ourselves, stripping the symbols, then compressing it using UPX led to a much more manageable ~9MB!

What was left was to generate the container manifest according to the runtime spec. I started by hardcoding it to verify that I could indeed run the container. I was relieved to see it would work on my development machine, even thought it fails on my initramfs. After spending a couple of hours diffing straces, poking a couple of files sysfs/config files, and realizing that pivot_root does not work in an initramfs , I finally managed to run the container with crun run --no-pivot!

I was over the moon, as the only thing left was to generate the container manifest by patching genuinetools/img to generate it according to the container image manifest (like docker or podman does). This is where I started losing grip: lured by the prospect of a simple initramfs solving all my problems, being so close to the goal, I started free-falling down what felt like the deepest rabbit hole of my engineering career... Fortunately, after a couple of weeks, I emerged, covered in mud but victorious! Queue the gory battle log :)

When trying to access the container image's manifest in img, I realized that it was re-creating the layers and manifest, and thus was losing the information such as entrypoint, environment variables, and other important parameters. After scouring through its source code and its 500 kLOC of dependencies, I came to the conclusion that it would be easier to start a project from scratch that would use Red Hat's image and storage libraries to download and store the container on the cache partition. I then needed to unpack the layers, generate the container manifest, and start runc. After a couple of days, ~250 lines of code, and tons of staring at straces to get it working, it finally did! Out was img, and the new runtime's size was under 10 MB \o/!

The last missing piece in the puzzle was performance-related: use OverlayFS to merge the layers, rather than unpacking them ourselves.

This is when I decided to have another look at Podman, saw that they have their own internal library for all the major functions, and decided to compile podman to try it out. The binary size was ~50 MB, but after removing some features, setting the -w -s LDFLAGS, and compressing it using upx --best, I got the final size to be ~14 MB! Of course, Podman is more than just one binary, so trying to run a container with it failed. However, after a bit of experimentation and stracing, I realized that running the container with --privileged --network=host would work using crun... provided we force-added the --no-pivot parameter to crun. My happiness was however short-lived, replaced by a MAJOR FACEPALM MOMENT:

After a couple of minutes of constant facepalming, I realized I was also relieved, as Podman is a battle-tested container runtime, and I would not need to maintain a single line of Go! Also, I now knew how deep the rabbit was, and we just needed to package everything nicely in an initramfs and we would be good. Success, at last!

Boot2container: Run your containers from an initramfs!

If you have managed to read through the article up to this point, congratulations! For the others who just gave up and jumped straight to this section, I forgive you for teleporting yourself to the bottom of the rabbit hole directly! In both cases, you are likely wondering where is this breeze you were promised in the introduction?

     Boot2container enters the chat.

Boot2container is a lightweight (sub-20 MB) and fast initramfs I developed that will allow you to ignore the subtleties of operating a container runtime and focus on what matters, your test environment!

Here is an example of how to run boot2container, using SYSLINUX:

LABEL root
    MENU LABEL Run docker's hello world container, with caching disabled
    LINUX /vmlinuz-linux
    APPEND b2c.container=docker://hello-world b2c.cache_device=none b2c.ntp_peer=auto
    INITRD /initramfs.linux_amd64.cpio.xz

The hello-world container image will be run in privileged mode, without the host network, which is what you want when running the container for bare metal testing!

Make sure to check out the list of features and options before either generating the initramfs yourself or downloading it from the releases page. Try it out with your kernel, or the example one bundled in in the release!

With this project mostly done, we pretty much conclude the work needed to set up the test machines, and the next articles in this series will be focusing on the infrastructure needed to support a fleet of test machines, and expose it to Gitlab/Github/...

That's all for now, thanks for reading that far!

wolfenstein.png

wolfenstein2.png

February 09, 2021

If you don’t know what is traces based rendering regression testing, read the appendix before continuing.


The Mesa community has witnessed an explosion of the Continuous Integration interest in the last two years.

In addition to checking the proper building of the project, integrating the testing of its functional correctness has become a priority. The user space graphics drivers exhibit a wide variety of types of tests and test suites. One kind of those tests are the traces based rendering regression testing.

The public effort to add this kind of tests into Mesa’s CI started with this mail from Alexandros Frantzis.

At some point, we had support for replaying OpenGL, Vulkan and D3D11 traces using apitrace, RenderDoc and GFXReconstruct with the in-tree tool tracie. However, it was a very custom solution made to the needs of Mesa so I proposed to move this codebase and integrate it into the piglit test suite. It was a natural step forward.

This is how replayer was born into piglit.

replayer

The first step to test a trace is, actually, obtaining a trace. I won’t go into the details about how to create one from scratch. The process is well documented on each of the tools listed above. However, the Mesa community has been collecting publicly distributable traces for a while and placing them in traces-db whose CI is copying them to Freedesktop.org’s MinIO instance.

To make things simple, once we have built and installed piglit, if we would like to test an apitrace created OpenGL trace, we can download from there with:

$ replayer.py download \
 	 --download-url https://minio-packet.freedesktop.org/mesa-tracie-public/ \
 	 --db-path ./traces-db \
 	 --force-download \
 	 glxgears/glxgears-2.trace

The parameters are self explanatory. The downloaded trace will now exist at ./traces-db/glxgears/glxgears-2.trace.

The next step will be to dump an image from the trace. Since it is a .trace file we will need to have apitrace installed in the system. If we do not specify the call(s) from which to dump the image(s), we will just get the last frame of the trace:

$ replayer.py dump ./traces-db/glxgears/glxgears-2.trace

The dumped PNG image will be at ./results/glxgears-2.trace-0000001413.png. Notice, the number suffix is the snapshot id from the trace.

Dumping from a trace may result in a range of different possible images. One example is when the trace makes use of uninitialized values, leading to undefined behaviors.

However, since the original aim was performing pre-merge rendering regression testing in Mesa’s CI, the idea is that replaying any of the provided traces would be quick and the dumped image will be consistent. In other words, if we would dump several times the same frame of a trace with the same GFX stack, the image will always be the same.

With this precondition, we can test whether 2 different images are the same just by doing a hash of its content. replayer can obtain the hash for the generated dumped image:

$ replayer.py checksum ./results/glxgears-2.trace-0000001413.png 
f8eba0fec6e3e0af9cb09844bc73bdc8

Now, if we would build a different commit of Mesa, we could check the generated image at this new point against the previously generated reference image. If everything goes well, we will see something like:

$ replayer.py compare trace \
 	 --download-url https://minio-packet.freedesktop.org/mesa-tracie-public/ \
 	 --device-name gl-vmware-llvmpipe \
 	 --db-path ./traces-db \
 	 --keep-image \
 	 glxgears/glxgears-2.trace f8eba0fec6e3e0af9cb09844bc73bdc8
[dump_trace_images] Info: Dumping trace ./traces-db/glxgears/glxgears-2.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./traces-db/glxgears/glxgears-2.trace
// process.name = "/usr/bin/glxgears"
1384 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

1413 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

error: drawable failed to resize: expected 1515x843, got 300x300
[dump_trace_images] Running: eglretrace --headless --snapshot=1413 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace- ./blog-traces-db/glxgears/glxgears-2.trace
Wrote ./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413.png

OK
[check_image]
    actual: f8eba0fec6e3e0af9cb09844bc73bdc8
  expected: f8eba0fec6e3e0af9cb09844bc73bdc8
[check_image] Images match for:
  glxgears/glxgears-2.trace

PIGLIT: {"images": [{"image_desc": "glxgears/glxgears-2.trace", "image_ref": "f8eba0fec6e3e0af9cb09844bc73bdc8.png", "image_render": "./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413-f8eba0fec6e3e0af9cb09844bc73bdc8.png"}], "result": "pass"}

replayer‘s compare subcommand is the one spitting a piglit formatted test expectations output.

Putting everything together

We can make the whole process way simpler by passing the replayer a YAML tests list file. For example:

$ cat testing-traces.yml
traces-db:
  download-url: https://minio-packet.freedesktop.org/mesa-tracie-public/

traces:
  - path: gputest/triangle.trace
    expectations:
      - device: gl-vmware-llvmpipe
        checksum: c8848dec77ee0c55292417f54c0a1a49
  - path: glxgears/glxgears-2.trace
    expectations:
      - device: gl-vmware-llvmpipe
        checksum: f53ac20e17da91c0359c31f2fa3f401e
$ replayer.py compare yaml \
 	 --device-name gl-vmware-llvmpipe \
 	 --yaml-file testing-traces.yml 
[check_image] Downloading file gputest/triangle.trace took 5s.
[dump_trace_images] Info: Dumping trace ./replayer-db/gputest/triangle.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./replayer-db/gputest/triangle.trace
// process.name = "/home/anholt/GpuTest_Linux_x64_0.7.0/GpuTest"
397 glXSwapBuffers(dpy = 0x7f0ad0005a90, drawable = 56623106)

510 glXSwapBuffers(dpy = 0x7f0ad0005a90, drawable = 56623106)


/home/anholt/GpuTest_Linux_x64_0.7.0/GpuTest
[dump_trace_images] Running: eglretrace --headless --snapshot=510 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/gputest/triangle.trace- ./replayer-db/gputest/triangle.trace
Wrote ./results/trace/gl-vmware-llvmpipe/gputest/triangle.trace-0000000510.png

OK
[check_image]
    actual: c8848dec77ee0c55292417f54c0a1a49
  expected: c8848dec77ee0c55292417f54c0a1a49
[check_image] Images match for:
  gputest/triangle.trace

[check_image] Downloading file glxgears/glxgears-2.trace took 5s.
[dump_trace_images] Info: Dumping trace ./replayer-db/glxgears/glxgears-2.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./replayer-db/glxgears/glxgears-2.trace
// process.name = "/usr/bin/glxgears"
1384 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

1413 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)


/usr/bin/glxgears
error: drawable failed to resize: expected 1515x843, got 300x300
[dump_trace_images] Running: eglretrace --headless --snapshot=1413 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace- ./replayer-db/glxgears/glxgears-2.trace
Wrote ./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413.png

OK
[check_image]
    actual: f8eba0fec6e3e0af9cb09844bc73bdc8
  expected: f8eba0fec6e3e0af9cb09844bc73bdc8
[check_image] Images match for:
  glxgears/glxgears-2.trace

replayer features also the query subcommand, which is just a helper to read the YAML files with the tests configuration.

Testing the other kind of supported 3D traces doesn’t change much from what’s shown here. Just make sure to have the needed tools installed: RenderDoc, GFXReconstruct, the VK_LAYER_LUNARG_screenshot layer, Wine and DXVK. A good reference for building, installing and configuring these tools are Mesa’s GL and VK test containers building scripts.

replayer also accepts several configurations to tweak how to behave and where to find the actual tracing tools needed for replaying the different types of traces. Make sure to check the replay section in piglit’s configuration example file.

replayer‘s README.md file is also a good read for further information.

piglit

replayer is a test runner in a similar fashion to shader_runner or glslparsertest. We are now missing how does it integrate so we can do piglit runs which will produce piglit formatted results.

This is done through the replay test profile.

This profile needs a couple configuration values. Easiest is just to set the PIGLIT_REPLAY_DESCRIPTION_FILE and PIGLIT_REPLAY_DEVICE_NAME env variables. They are self explanatory, but make sure to check the documentation for this and other configuration options for this profile.

The following example features a similar run to the one done above invoking directly replayer but with piglit integration, providing formatted results:

$ PIGLIT_REPLAY_DESCRIPTION_FILE=testing-traces.yml PIGLIT_REPLAY_DEVICE_NAME=gl-vmware-llvmpipe piglit run replay -n replay-example replay-results
[2/2] pass: 2   
Thank you for running Piglit!
Results have been written to replay-results

We can create some summary based on the results:

# piglit summary console replay-results/
trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace: pass
trace/gl-vmware-llvmpipe/gputest/triangle.trace: pass
summary:
       name: replay-example
       ----  --------------
       pass:              2
       fail:              0
      crash:              0
       skip:              0
    timeout:              0
       warn:              0
 incomplete:              0
 dmesg-warn:              0
 dmesg-fail:              0
    changes:              0
      fixes:              0
regressions:              0
      total:              2
       time:       00:00:00

Creating an HTML summary may be also interesting, specially when finding failures!

Wishlist

  • Through different backends, replayer supports running apitrace, RenderDoc and GFXReconstruct traces. We may want to support other tracing tools in the future. The dummy backend used for functional testing is a good starting point when writing a new backend.
  • The solution chosen for checking whether we detect a rendering regression is dependent on having consistent results, as said before. It’d be great if we could add a secondary testing method whenever the expected rendered image is variable. From the top of my head, using exclusion masks could be a quick single-run solution when we know which specific areas in a rendered scenario are the ones fluctuating. For more complex variations, a multi-run based solution seems to be the best option. EzBench has a great statistical approach for this!
  • The current syntax of the YAML test list files implies running the compare subcommand with the default behavior of checking against the last frame of the tested trace. This means figuring out which call number is the one of the last frame first. It would be great to support providing the call numbers directly in the YAML files to be able to test more than just the last frame and, additionally, cut down the time taken to run the test.
  • The HTML generated summary allows us to see the reference and generated image during a test run side to side when it fails. It’d be great to have also some easy way of checking its differences. Using Rembrandt.js could be a possible solution.

Thanks a lot to the whole Mesa community for helping with the creation of this tool. Alexandros Frantzis, Rohan Garg and Tomeu Vizoso did a lot of the initial development for the in-tree tracie tool while Dylan Baker was very patient while reviewing my patches for the piglit integration.

Finally, thanks to Igalia for allowing me to work in this.


Appendix

In 3D computer graphics we say “traces”, for short, to name the files generated by 3D APIs capturing tools which store not only the calls to the specific 3D API but also the internal state of the 3D program during the capturing process: shaders, textures, buffers, etc.

Being able to “record” the execution of a 3D program is very useful. Usually, it will allow us to replay the execution without the need of the original program from which we generated the trace, it will also allow in-depth analysis for debugging and performance optimization, it’s a very good solution for sharing with other developers, and, in some cases, will allow us to check how the replay will happen with different GPUs.

In this post, however, I focus in a specific usage: rendering regression testing.

When doing a regression test what we would do is compare a specific metric obtained by replaying the trace with a specific version of the GFX software stack against the same metric obtained from a different version of the GFX stack. If the value of the metric changes we have found a regression (or an improvement!).

To make things simpler, we would like to check changes happening just in one of the many elements of the software stack. The most relevant component is the user space driver. In particular, I care about the Mesa drivers and the GNU/Linux stack.

Mainly, there are two kinds of regression testing we can do with a trace: performance or rendering regression testing. When doing a performance one, the checked metric(s) usually are in terms of speed or memory usage. In the case of the rendering ones what we would do is comparing the rendered output at one (or many) point during the trace replay. This output, a bitmap image, is the metric that we will compare in between two different points of the Mesa driver. If the images differ, we may have found a regression; artifacts, improper colors, etc, or an enhancement, if the reference image is the one featuring any of these problems.

If you’re on Intel…

Your zink built from git master now has GL 4.3.

Turns out having actual hardware available when doing feature support is important, so I need to do some fixups there for stencil texturing before you can enjoy things.

February 08, 2021

Under contracting work for Valve Corporation, I have been working with Charlie Turner and Andres Gomez from Igalia to develop a CI test farm for driver testing (most graphics).

This is now the fifth CI system I have worked with / on, and I am growing tired of not being able to re-use components from the previous systems due to how deeply-integrated its components are, and how implementation details permeate from one component to another. Additionally, such designs limit the ability of the system to grow, as updating a component would impact a lot of components, making it difficult or even impossible to do without a rewrite of the system, or taking the system down for multiple hours.

With this new system, I am putting emphasis on designing good interfaces between components in order to create an open source toolbox that CI systems can re-use freely and tailor to their needs, while not painting themselves in a corner.

I aim to blog about all the different components/interfaces we will be making for this test system, but in this article, I would like to start with the basics: proposing design goals, and setting up a machine to be controllable remotely by a test system.

Overall design principles

When designing a test system, it is important to keep in mind that test results need to be:

  • Stable: Re-executing the same test should yield the same result;
  • Reproducible: The test should be runnable on other machines with the same hardware, and yield the same result;

What this means is that we should use the default configuration as much as possible (no weird setup in CI). Additionally, we need to reduce the amount of state in the system to the absolute minimum. This can be achieved in the following way:

  • Power cycle the machine between each test cycle: this helps reset the hardware;
  • Go diskless if at all possible, or treat the disk as a cache that can be flushed when testing fails;
  • Pre-compute as much as possible outside of the test machine, to reduce the impact of the environment of the machine running the test.

Finally, the machine should not restrict which kernel / Operating System can be loaded for testing. An easy way to achieve this is to use netboot (PXE), which is a common BIOS feature allowing diskless machines to boot from the network.

Converting a machine for testing

Now that we have a pretty good idea about the design principles behind preparing a machine for CI, let's try to apply them to an actual machine.

Step 1: Powering up the machine remotely

In order to power up, a machine needs both power and a signal to start. The latter is usually provided by a power button, but additional ways exist (non-exhaustive):

  • Wake on LAN: An Ethernet frame sent to the network adapter triggers the boot;
  • Power on by Mouse/Keyboard: Any activity on the mouse or the keyboard will boot the computer;
  • Power on AC: Providing power to the machine will automatically turn it on;
  • Timer: Boot at a specified time.

An Intel motherboard's list of wakeup events

Unfortunately, none of these triggers can be used to also turn off the machine. The only way to guarantee that a machine will power down and reset its internal state completely is to cut its power supply for a significant amount of time. A safe way to provide/cut power is to use a remotely-switchable Power Distribution Unit (example), or simply using some smart plug such as Ikea's TRÅDFRI. In any case, make sure you rely on as few services as possible (no cloud!), that you won't exceed the ratings (voltage, power, and cycles), and can read back the state to make sure the command was well received. If you opt out for the industrial PDUs, make sure to check out PDU Gateway, our REST service to control the machines.

An example of a PDU

Now that we can reliably cut/provide power, we still need to control the boot signal. The difficulty here is that the signal needs to be received after the machine received power and initialized enough to receive this event. To make things as easy as possible, the easiest is to configure the BIOS to boot as soon as the power is brought to the computer. This is usually called "Boot on AC". If your computer does not support this feature, you may want to try the other ones, or use a microcontroller to press the power button for you when powering up (see the HELP! My machine can't ... Boot on AC section at the end of this article).

Step 2: Net booting

Net booting is quite commonly supported on x86 and ARM bootloaders. On x86 platforms, you can generally find this option in the boot option priorities under the name PXE boot or network boot. You may also need to enable the LAN option ROM, LAN controller, or the UEFI network stack. Reboot, and check that your machine is trying to get an IP!

The next step will be to set up a machine, called Testing Gateway, that will provide a PXE service. This machine should have two network interfaces, one connected to a public network, and one connected to the test machines (through a switch). Setting up this machine will be the subject of an upcoming blog post, but if your are impatient, you may use our valve-infra container.

Step 3: Emulating your screen and keyboard using a serial console

Thanks to the previous steps, we can now boot in any Operating System we want, but we cannot interact with it...

One solution could be to run an SSH server on the Operating System, but until we could connect to it, there would be no way to know what is going on. Instead, we could use an ancient technology, a serial port, to drive a console. This solution is often called "Serial console" and is supported by most Operating Systems. Serial ports come in two types:

  • UART: voltage changing between 0 and VCC (TTL signalling), more common in the System-on-Chip (SoC) and microcontrollers world;
  • RS-232: voltage changing between a positive and negative voltage, more common in the desktop and datacenter world.

In any case, I suggest you find a serial-to-USB adapter adapted to the computer you are trying to connect:

On Linux, using a serial console is relatively simple, just add the following in the command line to get a console on your screen AND over the /dev/ttyS0 serial port running at 9600 bauds:

console=tty0 console=ttyS0,9600 earlyprintk=vga,keep

If your machine does not have a serial port but has USB ports, which is more the norm than the exception in the desktop/laptop world, you may want to connect two RS-232-to-USB adapters together, using a Null modem cable:

Test Machine <-> USB <-> RS-232 <-> NULL modem cable <-> RS-232 <-> USB Hub <-> Gateway

And the kernel command line should use ttyACM0 / ttyUSB0 instead of ttyS0.

Putting it all together

Start by removing the internal battery if it has one (laptops), and any built-in wireless antenna. Then set the BIOS to boot on AC, and use netboot.

Steps for an AMD motherboard:

Steps for an Intel motherboard:

Finally, connect the test machine to the wider infrastructure in this way:

      Internet /   ------------------------------+
    Public network                               |
                                       +---------+--------+                USB
                                       |                  +-----------------------------------+
                                       |      Testing     | Private network                   |
Main power (240 V) ---------+          |      Gateway     +-----------------+                 |
                            |          +---------+--------+                 |                 |
                            |                    | Serial /                 |                 |
                            |                    | Ethernet                 |                 |
                            |                    |                          |                 |
                +-----------+--------------------+--------------+   +-------+--------+   +----+----+
                |                 Switchable PDU                |   |   RJ45 switch  |   | USB Hub |
                |  Port 0    Port 1        ...          Port N  |   |                |   |         |
                +----+------------------------------------------+   +---+------------+   +-+-------+
                     |                                                  |                  |
                Main |                                                  |                  |
                Power|                                                  |                  |
            +--------|--------+               Ethernet                  |                  |
            |                 +-----------------------------------------+   +----+----+    |
            |  Test Machine 1 |            Serial (RS-232 / TTL)            |  Serial |    |
            |                 +---------------------------------------------+  2 USB  +----+ USB
            +-----------------+                                             +---------+

If you managed to do all this, then congratulations, you are set! If you got some issues finding the BIOS parameters, brace yourself, and check out the following section!

HELP! My machine can't ...

Net boot

It's annoying, but it is super simple to work around that. What you need is to install a bootloader on a drive or USB stick which supports PXE. I would recommend you look into SYSLINUX, and Arch Linux's wiki page about it.

Boot on AC

Well, that's a bummer, but that's not the end of the line either if you have some experience dealing with microcontrollers, such as Arduino. Provided you can find the following 4 wires, you should be fine:

  • Ground: The easiest to find;
  • Power rail: 3.3 or 5V depending on what your controller expects;
  • Power LED: A signal that will change when the computer turns on/off;
  • Power Switch: A signal to pull-up/down to start the computer.

On desktop PCs, all these wires can be easily found in the motherboard's manual. For laptops, you'll need to scour the motherboard for these signals using a multimeter. Pay extra attention when looking for the power rail, as it needs to be able to source enough current for your microcontroller. If you are struggling to find one, look for the VCC pins of some of the chips and you'll be set.

Next, you'll just need to figure out what voltage the power LED is at when the machine is ON or OFF. Make sure to check that this voltage is compatible with your microcontroller's input rating and plug it directly into a GPIO of your microcontroller.

Let's then do the same work for the power switch, except this time we also need to check how much current will flow through it when it is activated. To do that, just use a multimeter to check how much current is flowing when you connect the two wires of the power switch. Check that this amount of current can be sourced/sinked by the microcontroller, and then connect it to a GPIO.

Finally, we need to find power for the microcontroller that will be present as soon as we plug the machine to the power. For desktop PCs, you would find this in Pin 9 of the ATX connector. For laptops, you will need to probe the motherboard until you find a pin that has one with a voltage suitable for your microcontroller (5 or 3.3V). However, make sure it is able to source enough current without the voltage dropping bellow the minimum acceptable VCC of your microcontroller. The best way to make sure of that is to connect this rail to the ground through a ~100 Ohm and check that the voltage at the leads of the resistor, and keep on trying until you find a suitable place (took me 3 attempts). Connect your microcontroller's VCC and ground to the these pads.

The last step will be to edit this Arduino code for your needs, flash it to your microcontroller, and iterate until it works!

Here is a photo summary of all the above steps:

Thanks to Arkadiusz Hiler for giving me a couple of these BluePills, as I did not have any microcontroller that would be small-enough to fit in place of a laptop speaker. If you are a novice, I would suggest you pick an Arduino nano instead.

Oh, and if you want to create a board that would be generic-enough for most motherboards, check out the schematics from my 8 year-old blog post about doing just that!

Boot without a battery

So far, I have never heard of any laptop that would completely refuse to boot when disconnecting the battery. The worst I have heard of was that the laptop would take 30s before starting to boot.

Let's be real though, your time is valuable, and I would suggest you buy/get another laptop. However, if this is the only model you can get, and you really want to test it, then it will be .... juuuuuust fine!

Your state of mind, right now!

There are multiple options here, depending how far down the stack you want to go to:

  1. Rework the Embedded Controller (EC) to drop this delay: Applicable when you have access to the EC's source code, like for chromebooks;
  2. Impersonate the battery: Replacing the battery with a microcontroller that will respond to the EC's commands just like the real battery;
  3. Reuse the battery controller, but replace the battery cells with ... capacitors: The fastest way forward, but can be real-tricky without some knowledge about dealing with Li-ion cells.

I will not explain what needs to be done in option 1, as this is highly-dependent on your platform of choice, but it is by far the safest and the least hacky.

Option 2 is the next best option if the EC is not open or easy to flash. What you will want is to figure out what are the I2C lines in the battery's connector, and then attach a protocol analyser to it. Boot the machine, then inspect the logs and try to figure out the pattern. Re-implement as much as of it as needed in a microcontroller, until the system boots reliably. Should be a good weekend project!

Option 3 is by far the hackiest and requiring the most skills even if it is the fastest to implement IF you have an oscilloscope, and some super capacitors with a low discharge rate lying around (who doesn't?). What you'll need to do is open the battery, rip off the battery cells, and replace them with the super capacitors. They will simulate the battery cell well-enough for most controllers, but beware that the controller might not like starting with the capacitors being discharged, so you may need to force-charge them to [2, 3.6]V (the range between a fully discharged and a fully-charged battery), so consider using a 3.3V power rail. Beware that the battery cells might be wired in series, so you should not connect their negative pole to the ground, as it would short one or more cells! In my case, the controller was happy with seeing an empty battery, and it was fun to see the battery go from 50% to 100% in a second when booting :D

That's all, folks!

February 04, 2021

All In A Day’s Work

I posted yesterday in brief about zink on nvidia blob, But what actually was necessary to get this working?

In short, there were three problem areas that needed to be worked on:

  • GLX
  • image creation
  • memory allocation

Let’s go over some of these in a bit more depth.

GLX

GLX is the layer which connects GL to X. In the context of this post and zink, this can be thought of as the method by which zink negotiates how it’s going to draw to the screen.

The way this works, at a very high level and with only the barest concern for accuracy and depth of subject matter, is that Mesa grabs a bunch of properties and features from the driver (zink) and then generates a series of configurations that can be used for X window outputs. GLX then compares these configurations against the one requested by the driver. If a match is found, the driver gets to draw. Otherwise, an error like this one I found on StackOverflow is printed:

X Error of failed request:  BadMatch (invalid parameter attributes)
Major opcode of failed request:  128 (GLX)
Minor opcode of failed request:  34 ()
Serial number of failed request:  39
Current serial number in output stream:  40

I didn’t do much digging into this. Here’s a visual representation of the process by which, with the assistance of GLX Hall of Famer Adam Jackson, I resolved the issue:

giphy.gif

In short, he found this problem some time ago and disabled some of the less relevant checks for configuration matching. All I had to do was apply the patch.

#teamwork

Image Creation

A core philosophy of Vulkan is that it’s an explicit API. This means that in the case of images, the exact format, tiling mode, target, usage, etc are specified at the time of creation. This also means that drivers have things they support and things they don’t support.

Historically, zink has just assumed that everything is supported and then either crashed horribly or succeeded because ANV is a pretty cool driver that makes a lot of stuff work when it shouldn’t.

To improve this situation, I’ve added two tiers of validation for image resource creation:

  • first, check all VkImageUsageFlags that are needed for an image are supported by the format being used
  • second, run the exact image creation params by the Vulkan driver before creation to see if it’ll work (and abort() if it doesn’t)

Roughly, these correspond to vkGetPhysicalDeviceFormatProperties (which previously had some, but not comprehensive use for this type of validation) and vkGetPhysicalDeviceImageFormatProperties (which has never been used previously in zink).

A major issue that I found along the way is that there’s many cases where zink needs a linear image tiling, as this is the only type of layout which enables reading it back into a buffer. Zink has been assuming that if the format has any (e.g., non-linear) support for the image’s usage, linear is also fine. This is not the case, however, so now there’s more checks which enforce a series of hoops to be jumped through when it’s necessary to do readback from images which have no support at all for linear tiling.

Memory Allocation

This is more or less the same as the issue that existed with image creation: zink tried to allocate memory from the wrong bucket (usually HOST_VISIBLE), and the Vulkan driver couldn’t handle that, so everything blew up.

Now this is handled by using device memory for almost all allocations, and everything works well enough.

Closing Thoughts

Nvidia GPUs should work pretty well from today on in my branch; I’m at a ~95% pass rate in piglit tests as of my last run, putting it solidly in second place on the Zink Preferred GPU List behind ANV, where I’m getting upwards of 97% of tests passing.

This didn’t make it into yesterday’s post, but everyone’s favorite benchmark also runs on zink+nvidia now:

nvidia-heaven.png

Here’s the caveat for all of the above: at present, zink on NV is unusably slow. The primary reason for this is that every frame that gets displayed has to be copied multiple times:

  • first, a GPU copy to a staging image that has HOST_VISIBLE (CPU-readable) memory
  • second, a CPU copy to another image which will then be used for displaying the frame

The first step in the process not only introduces more GPU work, it also forces an explicit fence, meaning that this is effectively right back where zink from master/release versions is at in forcing a complete stop of all work prior to each frame being finished.

The second step is also pretty bad, as it delays queuing any further work by the driver until the entire frame has once again been copied.

There’s not really any way to improve the current situation, but that’s only temporary. In the “soon” future, we’ll be landing WSI support, which will greatly improve things for all the drivers that zink supports, but probably mostly this one.

February 03, 2021

So some months have passed since our last update, when we announced that v3dv became Vulkan 1.0 conformant. The main reason for not publishing so many posts is that we saw the 1.0 checkpoint as a good moment to hold on adding new big features, and focus on improving the codebase (refactor, clean-ups, etc.) and the already existing features. For the latter we did a lot of work on performance. That alone would deserve a specific blog post, so in this one I will summarize the other stuff we did.

New features

Even if we didn’t focus on adding new features, we were still able to add some:

  • The following optional 1.0 features were enabled: logicOp, althaToOne, independentBlend, drawIndirectFirstInstance, and shaderStorageImageExtendedFormats.
  • Added support for timestamp queries.
  • Added implementation for VK_KHR_maintenance1, VK_EXT_private_data, and VK_KHR_display extensions
  • Added support for Wayland WSI.

Here I would like to highlight that we started to get feature contributions out of the initial core of developers that created the driver. VK_KHR_display was submitted by Steven Houston, and Wayland WSI support was submitted by Ella-0. Thanks a lot for it, really appreciated! We hope that this would begin a trend of having more things implemented by the rpi/mesa community as a whole.

Bugfixing and vulkan tools

Even if the driver got conformant, we were still testing the driver with several demos and applications, and provided fixes. As a example, we got Sascha Willem’s oit (Order Independent Transparency) working:

Sascha Willem’s oit demo on the rpi4

Among those applications that we were testing, we can highlight renderdoc and gfxreconstruct. The former is a frame-capture based graphics debugger and the latter is a tool that allows to capture and replay several frames. Both tools are heavily used when debugging and testing vulkan applications. We tested that they work on the rpi4 (fixing some bugs while doing it), and also started to use them to help/guide the performance work we are doing.

Fosdem 2021

If you are interested on an overview of the development of the driver during the last year, we are going to present “Overview of the Open Source Vulkan Driver for Raspberry Pi 4” on FOSDEM this weekend (presentation details here).

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working
v3dv status update 2020-07-31
v3dv status update 2020-09-07
Vulkan update: we’re conformant!

What Is Even Happening Anymore

Thanks once again to my generous sponsors at Valve, as well as a patch from a GLX guru with the very accurate commit log “I hate everything.”, I put in a little time today and came up with this:

nvidia.png

So now zink+nvidia is a thing.

Also: snow. Why is that a thing and can it stop being a thing for the rest of the week so I can stop shoveling?

January 28, 2021

(This post was first published with Collabora on Nov 19, 2020.) (Fixed a broken link on Jan 28, 2021.)

Wayland (the protocol and architecture) is still lacking proper consideration for color management. Wayland also lacks support for high dynamic range (HDR) imagery which has been around in movie and broadcasting industry for a while now (e.g. Netflix HDR UI).

While there are well established tools and workflows for how to do color management on X11, even X11 has not gained support for HDR. There were plans for it (Alex GoinsDeepColor Visuals), but as far as I know nothing really materialized from them.  Right now, the only way to watch HDR content on a HDR monitor in Linux is to use the DRM KMS API directly, in other words, not use any window system, which means not using any desktop environment. Kodi is one of the very few applications that can do this at all.

This is a story about starting the efforts to fix the situation on Wayland.

History and People

Color management for Wayland has been talked about on and off for many years by dozens of people. To me it was obvious from the start that color management architecture on Wayland must be fundamentally different from X11. I thought the display server must be part of the color management stack instead of an untrusted, unknown entity that must be bypassed and overridden by applications that fight each other for who gets to configure the display. This opinion was wildly controversial and it took a long time to get my point across, but over the years some color management experts started to open up to new ideas and other people joined in the opinion as well.  Whether these new ideas are actually better than the ways of old remains to be seen, though. I think the promise of getting everything and more to work better is far too great to not try it out.

The discussions started several times over the years, but they always dried out mostly without any tangible progress. Color management is a wide, deep and difficult topic, and the required skills, knowledge, interest, and available time did not come together until fairly recently. People did write draft protocol extensions, but I would claim that it was not really until Sebastian Wick started building on top of them that things started moving forward. But one person cannot push such a huge effort alone even for the simple reason that there must be at least one reviewer before anything can be merged upstream. I was very lucky that since summer 2020 I have been able to work on Wayland color management and HDR for improving ChromeOS, letting me support Sebastian's efforts on a daily basis. Vitaly Prosyak joined the effort this year as well, researching how to combine the two seemingly different worlds of ICC and HDR, and how tone-mapping could be implemented.

I must also note the past efforts of Harish Krupo, who submitted a major Weston merge request, but unfortunately at the time reviewers in Weston upstream were not much available. Even before that, there were experiments by Ville Syrjälä. All these are now mostly superseded by the on-going work.

Currently the active people around the topic are me (Collabora), Vitaly Prosyak (AMD), and Naveen Kumar (Intel). Sebastian Wick (unaffilated) is still around as well. None of us is a color management or HDR expert by trade, so we are all learning things as we go.

Design

The foundation for the color management protocol are ICC profile files for describing both output and content color spaces. The aim is for ICCv4, also allowing ICCv2, as these are known and supported well in general. Adding iccMAX support or anything else will be possible any time in the future.

As color management is all about color spaces and gamuts, and high dynamic range (HDR) is also very much about color spaces and gamuts plus extended luminance range, Sebastian and I decided that Wayland color management extension should cater for both from the beginning. Combining traditional color management and HDR is a fairly new thing as far as I know, and I'm not sure we have much prior art to base upon, so this is an interesting research journey as well. There is a lot of prior art on HDR and color management separately, but they tend to have fundamental differences that makes the combination not obvious.

To help us keep focused and explain to the community about what we actually intend with Wayland color management and HDR support, I wrote the section "Wayland Color Management and HDR Design Goals" in color.rst (draft). I very much recommend you to read it so that you get a picture what we (or I, at least) want to aim for.

Elle Stone explains in their article how color management should work on X11. As I wanted to avoid repeating the massive email threads that were had on the wayland-devel mailing list, I wrote the section "Color Pipeline Overview" in color.rst (draft) more or less as a response to her article, trying to explain in what ways Wayland will be different from X11. I think that understanding that section is paramount before anyone makes any comment on our efforts with the Wayland protocol extension.

HDR brings even more reasons to put color space conversions in the display server than just the idea that all applications should be color managed if not explicitly then implicitly.  Most of the desktop applications (well, literally all right now) are using Standard Dynamic Range (SDR).  SDR is a fuzzy concept referring to all traditional, non-HDR image content.  Therefore, your desktop is usually 100% SDR. You run your fancy new HDR monitor in SDR mode, which means it looks just like any old monitor with nothing fancy.  What if you want to watch a HDR video? The monitor won't display HDR in SDR mode.  If you simply switch the monitor to HDR mode, you will be blinded by all the over-bright SDR applications.  Switching monitor modes may also cause flicker and take a bit of time. That would be a pretty bad user experience, right?

A solution is to run your monitor in HDR mode all the time, and have the window system compositor convert all SDR application windows appropriately to the HDR luminance, so that they look normal in spite of the HDR mode. There will always be applications that will never support HDR at all, so the compositor doing the conversion is practically the only way.

For the protocol, we are currently exploring the use of relative luminance.  The reason is that people look at monitors in wildly varying viewing environments, under standard office lighting for example. The environment and personal preferences affect what monitor brightness you want. Also monitors themselves can be wildly different in their capabilities. Most prior art on HDR uses absolute luminance, but absolute luminance has the problem that it assumes a specific viewing environment, usually a dark room, similar to a movie theatre.  If a display server would show a movie with the absolute luminance it was mastered for, in most cases it would be far too dark to see. Whether using relative luminance at the protocol level turns out to be a good idea or not, we shall see.

Development

The Wayland color management and HDR protocol extension proposal is known as wayland/wayland-protocols!14 (MR14). Because it is a very long running merge request (the bar for landing a new protocol into wayland-protocols is high) and there are several people working on it, we started using sub-merge-requests to modify the proposal. You can find the sub-MRs in Sebastian's fork. If you have a change to propose, that is how to do it.

Obviously using sub-MRs also splits the review discussions into multiple places, but in this case I think it is a good thing, because the discussion threads in Gitlab are already massive.

There are several big and small open questions we haven't had the time to tackle yet even among the active group; questions that I feel we should have some tentative answers before asking for wider community comments. There is also no set schedule, so don't hold your breath. This work is likely to take months still before there is a complete tentative protocol, and probably years until these features are available in your favourite Wayland desktop environments.

If you are an expert on the topics of color management or HDR displays and content, you are warmly welcome to join the development.

If you are an interested developer or an end user looking to try out things, sorry, there is nothing really for you yet.

January 26, 2021

Overhead Migration

The goal in this post is to migrate a truckload block of code I wrote to handle sampler updating out of zink and into Gallium, thereby creating several days worth of rebase work for myself but also removing a costly codepath from the driver thread.

The first step in getting sampler creation to work right in zink is getting Gallium to create samplers with the correct filters in accordance with Chapter 42 of the Vulkan Spec:

VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT specifies that if VK_FORMAT_FEATURE_SAMPLED_IMAGE_BIT is also set, an image view can be used with a sampler that has either of magFilter or minFilter set to VK_FILTER_LINEAR, or mipmapMode set to VK_SAMPLER_MIPMAP_MODE_LINEAR. If VK_FORMAT_FEATURE_BLIT_SRC_BIT is also set, an image can be used as the srcImage to vkCmdBlitImage2KHR and vkCmdBlitImage with a filter of VK_FILTER_LINEAR. This bit must only be exposed for formats that also support the VK_FORMAT_FEATURE_SAMPLED_IMAGE_BIT or VK_FORMAT_FEATURE_BLIT_SRC_BIT.

If the format being queried is a depth/stencil format, this bit only specifies that the depth aspect (not the stencil aspect) of an image of this format supports linear filtering, and that linear filtering of the depth aspect is supported whether depth compare is enabled in the sampler or not. If this bit is not present, linear filtering with depth compare disabled is unsupported and linear filtering with depth compare enabled is supported, but may compute the filtered value in an implementation-dependent manner which differs from the normal rules of linear filtering. The resulting value must be in the range [0,1] and should be proportional to, or a weighted average of, the number of comparison passes or failures.

Here’s the (primary) function that I’ll be modifying to get everything working:

void
st_convert_sampler(const struct st_context *st,
                   const struct gl_texture_object *texobj,
                   const struct gl_sampler_object *msamp,
                   float tex_unit_lod_bias,
                   struct pipe_sampler_state *sampler)
{
   memset(sampler, 0, sizeof(*sampler));
   sampler->wrap_s = gl_wrap_xlate(msamp->Attrib.WrapS);
   sampler->wrap_t = gl_wrap_xlate(msamp->Attrib.WrapT);
   sampler->wrap_r = gl_wrap_xlate(msamp->Attrib.WrapR);

   if (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = gl_filter_to_img_filter(msamp->Attrib.MinFilter);
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }
   sampler->min_mip_filter = gl_filter_to_mip_filter(msamp->Attrib.MinFilter);

   if (texobj->Target != GL_TEXTURE_RECTANGLE_ARB)
      sampler->normalized_coords = 1;

   sampler->lod_bias = msamp->Attrib.LodBias + tex_unit_lod_bias;
   /* Reduce the number of states by allowing only the values that AMD GCN
    * can represent. Apps use lod_bias for smooth transitions to bigger mipmap
    * levels.
    */
   sampler->lod_bias = CLAMP(sampler->lod_bias, -16, 16);
   sampler->lod_bias = roundf(sampler->lod_bias * 256) / 256;

   sampler->min_lod = MAX2(msamp->Attrib.MinLod, 0.0f);
   sampler->max_lod = msamp->Attrib.MaxLod;
   if (sampler->max_lod < sampler->min_lod) {
      /* The GL spec doesn't seem to specify what to do in this case.
       * Swap the values.
       */
      float tmp = sampler->max_lod;
      sampler->max_lod = sampler->min_lod;
      sampler->min_lod = tmp;
      assert(sampler->min_lod <= sampler->max_lod);
   }

   /* Check that only wrap modes using the border color have the first bit
    * set.
    */
   STATIC_ASSERT(PIPE_TEX_WRAP_CLAMP & 0x1);
   STATIC_ASSERT(PIPE_TEX_WRAP_CLAMP_TO_BORDER & 0x1);
   STATIC_ASSERT(PIPE_TEX_WRAP_MIRROR_CLAMP & 0x1);
   STATIC_ASSERT(PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER & 0x1);
   STATIC_ASSERT(((PIPE_TEX_WRAP_REPEAT |
                   PIPE_TEX_WRAP_CLAMP_TO_EDGE |
                   PIPE_TEX_WRAP_MIRROR_REPEAT |
                   PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE) & 0x1) == 0);

   /* For non-black borders... */
   if (/* This is true if wrap modes are using the border color: */
       (sampler->wrap_s | sampler->wrap_t | sampler->wrap_r) & 0x1 &&
       (msamp->Attrib.BorderColor.ui[0] ||
        msamp->Attrib.BorderColor.ui[1] ||
        msamp->Attrib.BorderColor.ui[2] ||
        msamp->Attrib.BorderColor.ui[3])) {
      const GLboolean is_integer = texobj->_IsIntegerFormat;
      GLenum texBaseFormat = _mesa_base_tex_image(texobj)->_BaseFormat;

      if (texobj->Attrib.StencilSampling)
         texBaseFormat = GL_STENCIL_INDEX;

      if (st->apply_texture_swizzle_to_border_color) {
         const struct st_texture_object *stobj = st_texture_object_const(texobj);
         /* XXX: clean that up to not use the sampler view at all */
         const struct st_sampler_view *sv = st_texture_get_current_sampler_view(st, stobj);

         if (sv) {
            struct pipe_sampler_view *view = sv->view;
            union pipe_color_union tmp;
            const unsigned char swz[4] =
            {
               view->swizzle_r,
               view->swizzle_g,
               view->swizzle_b,
               view->swizzle_a,
            };

            st_translate_color(&msamp->Attrib.BorderColor, &tmp,
                               texBaseFormat, is_integer);

            util_format_apply_color_swizzle(&sampler->border_color,
                                            &tmp, swz, is_integer);
         } else {
            st_translate_color(&msamp->Attrib.BorderColor,
                               &sampler->border_color,
                               texBaseFormat, is_integer);
         }
      } else {
         st_translate_color(&msamp->Attrib.BorderColor,
                            &sampler->border_color,
                            texBaseFormat, is_integer);
      }
   }

   sampler->max_anisotropy = (msamp->Attrib.MaxAnisotropy == 1.0 ?
                              0 : (GLuint) msamp->Attrib.MaxAnisotropy);

   /* If sampling a depth texture and using shadow comparison */
   if (msamp->Attrib.CompareMode == GL_COMPARE_R_TO_TEXTURE) {
      GLenum texBaseFormat = _mesa_base_tex_image(texobj)->_BaseFormat;

      if (texBaseFormat == GL_DEPTH_COMPONENT ||
          (texBaseFormat == GL_DEPTH_STENCIL && !texobj->Attrib.StencilSampling)) {
         sampler->compare_mode = PIPE_TEX_COMPARE_R_TO_TEXTURE;
         sampler->compare_func = st_compare_func_to_pipe(msamp->Attrib.CompareFunc);
      }
   }

   /* Only set the seamless cube map texture parameter because the per-context
    * enable should be ignored and treated as disabled when using texture
    * handles, as specified by ARB_bindless_texture.
    */
   sampler->seamless_cube_map = msamp->Attrib.CubeMapSeamless;
}

texobj here is the texture being sampled, msamp is the GL sampler object, and sampler is the template for the driver-backed sampler object that will be created with the pipe_context::create_sampler_state hook. The first half of the function deals with setting up filtering and wrap modes. The second half is mostly for border color pre-swizzling (i.e., what the Vulkan spec claims is the way that drivers should be handling border colors).

First: LINEAR Availability

First, if a driver doesn’t provide the format feature for linear filtering, linear filtering can’t be used.

I added a struct pipe_screen hook for this:

/**
 * Check if the given pipe_format and resource is supported for linear filtering
 * as a sampler view.
 * \param format The format to check.
 * \param pres The resource to check.
 */
bool (*is_linear_filtering_supported)( struct pipe_screen *,
                                       enum pipe_format format,
                                       struct pipe_resource *pres );

This gets called in st_convert_sampler(), which is the path that all user-managed samplers go through:

void
st_convert_sampler(const struct st_context *st,
                   const struct gl_texture_object *texobj,
                   const struct gl_sampler_object *msamp,
                   float tex_unit_lod_bias,
                   struct pipe_sampler_state *sampler)
{
   const struct st_texture_object *stobj = NULL;
   const struct st_sampler_view *sv = NULL;

   memset(sampler, 0, sizeof(*sampler));
   sampler->wrap_s = gl_wrap_xlate(msamp->Attrib.WrapS);
   sampler->wrap_t = gl_wrap_xlate(msamp->Attrib.WrapT);
   sampler->wrap_r = gl_wrap_xlate(msamp->Attrib.WrapR);

   bool is_linear_filtering_supported = true;

   if (st->pipe->screen->is_linear_filtering_supported) {
      enum pipe_format fmt = PIPE_FORMAT_NONE;
      stobj = st_texture_object_const(texobj);
      if (stobj->surface_based)
         fmt = stobj->surface_format;
      else {
         sv = st_texture_get_current_sampler_view(st, stobj);
         if (sv)
            fmt = sv->view->format;
         else
            fmt = stobj->pt->format;
      }
      assert(fmt != PIPE_FORMAT_NONE);
      is_linear_filtering_supported =
         st->pipe->screen->is_linear_filtering_supported(st->pipe->screen, fmt, stobj->pt);
   }

   if (!is_linear_filtering_supported ||
       (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest)) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = gl_filter_to_img_filter(msamp->Attrib.MinFilter);
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }

   if (is_linear_filtering_supported)
      sampler->min_mip_filter = gl_filter_to_mip_filter(msamp->Attrib.MinFilter);
   else
      sampler->min_mip_filter = gl_filter_to_img_filter(GL_NEAREST);

The code automatically assumes that linear filtering is available for all formats, utilizing the new is_linear_filtering_supported method if it’s available in order to override that value. The filtering modes are then updated based on the (Vulkan) driver’s capabilities.

Easy.

Second: LINEAR Depth Filtering

The spec allows linear filtering unconditionally for formats containing a depth aspect so long as depth compare is enabled:

If this bit is not present, linear filtering with depth compare disabled is unsupported and linear filtering with depth compare enabled is supported

For this, I added PIPE_CAP_LINEAR_DEPTH_FILTERING and some interaction with the pipe_screen::is_linear_filtering_supported hook:

...
   bool is_linear_filtering_supported = true;
   bool has_depth = false;

   if (st->pipe->screen->is_linear_filtering_supported) {
      enum pipe_format fmt = PIPE_FORMAT_NONE;
      stobj = st_texture_object_const(texobj);
      if (stobj->surface_based)
         fmt = stobj->surface_format;
      else {
         sv = st_texture_get_current_sampler_view(st, stobj);
         if (sv)
            fmt = sv->view->format;
         else
            fmt = stobj->pt->format;
      }
      assert(fmt != PIPE_FORMAT_NONE);
      is_linear_filtering_supported =
         st->pipe->screen->is_linear_filtering_supported(st->pipe->screen, fmt, stobj->pt);
      if (st->linear_depth_filtering_semantics)
         has_depth = util_format_has_depth(util_format_description(fmt));
   }

   /* PIPE_CAP_LINEAR_DEPTH_FILTERING */
   if (has_depth &&
       !is_linear_filtering_supported) {
      /* this conditional has the same result as the one after it,
       * but its complexity makes splitting it more readable
       */
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else if ((!is_linear_filtering_supported && !has_depth) ||
       (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest)) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = gl_filter_to_img_filter(msamp->Attrib.MinFilter);
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }

   if (is_linear_filtering_supported || has_depth)
      sampler->min_mip_filter = gl_filter_to_mip_filter(msamp->Attrib.MinFilter);
   else
      sampler->min_mip_filter = gl_filter_to_img_filter(GL_NEAREST);
...
   /* PIPE_CAP_LINEAR_DEPTH_FILTERING */
   if (sampler->compare_mode == PIPE_TEX_COMPARE_NONE &&
       has_depth && !is_linear_filtering_supported &&
       (sampler->mag_img_filter == PIPE_TEX_FILTER_LINEAR ||
        sampler->min_img_filter == PIPE_TEX_FILTER_LINEAR ||
        sampler->min_mip_filter == PIPE_TEX_FILTER_LINEAR)) {
      sampler->compare_mode = PIPE_TEX_COMPARE_R_TO_TEXTURE;
      sampler->compare_func = PIPE_FUNC_ALWAYS;
   }

If the format has a depth component, allow linear filtering and, if necessary, enable depth compare.

Nothing too complex here either.

Third: Handle Int-based Border Colors

Now it’s time to start getting grimy. Vulkan’s custom border color extension is, at best, functional. One component of the spec for this is that border colors can be specified as either integer or float values, and this is actually significant, so ideally this should just be passed through using the same value the user specified.

I made PIPE_CAP_NEED_BORDER_COLOR_TYPE for this. When specified, the border_color_is_integer member that I added to struct pipe_sampler_state will be treated as a disambiguating value for sampler states, and thus zink can use it to set the right type of border color values. It affects this glorious micro-optimization in the CSO state cache:

void
cso_single_sampler(struct cso_context *ctx, enum pipe_shader_type shader_stage,
                   unsigned idx, const struct pipe_sampler_state *templ)
{
   if (templ) {
      unsigned key_size = ctx-> needs_sampler_border_color_type ?
                             sizeof(struct pipe_sampler_state) :
                             offsetof(struct pipe_sampler_state, border_color_is_integer);
      unsigned hash_key = cso_construct_key((void*)templ, key_size);

Finally: GL_CLAMP

Get ready to roll around in some mud.

I added PIPE_CAP_EMULATE_GL_CLAMP_PLZ (named by Kayden, not up for discussion) to handle this awfulness. Functionally, it’s comprised of 3 components:

  • handling in Mesa core to flag shader and sampler state updates any time a sampler sets or unsets GL_CLAMP (or GL_MIRROR_CLAMP)
  • handling in Gallium to force the sampler to either CLAMP_TO_BORDER or CLAMP_TO_EDGE depending on linear filtering availability for a given resource
  • shader rewrites in Gallium to run nir_lower_tex with bitfield info set for the GL_CLAMP samplers (henceforth gl_clamplers) that need to be rewritten

Since I’m already going deep into this function, let’s go once more into st_convert_sampler():

...
   /* PIPE_CAP_LINEAR_DEPTH_FILTERING */
   if (has_depth &&
       !is_linear_filtering_supported &&
       (!st->emulate_gl_clamp || (
       sampler->wrap_s != PIPE_TEX_WRAP_CLAMP &&
       sampler->wrap_s != PIPE_TEX_WRAP_MIRROR_CLAMP &&
       sampler->wrap_t != PIPE_TEX_WRAP_CLAMP &&
       sampler->wrap_t != PIPE_TEX_WRAP_MIRROR_CLAMP &&
       sampler->wrap_r != PIPE_TEX_WRAP_CLAMP &&
       sampler->wrap_r != PIPE_TEX_WRAP_MIRROR_CLAMP))) {
      /* this conditional has the same result as the one after it,
       * but its complexity makes splitting it more readable
       */
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else if ((!is_linear_filtering_supported && !has_depth) ||
       (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest)) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = min_img_filter;
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }
...
   if (st->emulate_gl_clamp) {
      bool clamp_to_border = (is_linear_filtering_supported || has_depth) &&
                             min_img_filter != PIPE_TEX_FILTER_NEAREST;
      if (sampler->wrap_s == PIPE_TEX_WRAP_CLAMP)
         sampler->wrap_s = clamp_to_border ? PIPE_TEX_WRAP_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_CLAMP_TO_EDGE;
      else if (sampler->wrap_s == PIPE_TEX_WRAP_MIRROR_CLAMP)
         sampler->wrap_s = clamp_to_border ? PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE;

      if (sampler->wrap_t == PIPE_TEX_WRAP_CLAMP)
         sampler->wrap_t = clamp_to_border ? PIPE_TEX_WRAP_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_CLAMP_TO_EDGE;
      else if (sampler->wrap_t == PIPE_TEX_WRAP_MIRROR_CLAMP)
         sampler->wrap_t = clamp_to_border ? PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE;

      if (sampler->wrap_r == PIPE_TEX_WRAP_CLAMP)
         sampler->wrap_r = clamp_to_border ? PIPE_TEX_WRAP_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_CLAMP_TO_EDGE;
      else if (sampler->wrap_r == PIPE_TEX_WRAP_MIRROR_CLAMP)
         sampler->wrap_r = clamp_to_border ? PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE;
   }
...

The depth component codepath is a little trickier, so I’ve split it off for readability even though it’s able to be collapsed into the conditional after. In short, this is only allowing linear filtering for unsupported depth formats when GL_CLAMP is also used.

With the filtering and wrap modes set, the next step here is to adjust the wrap modes based on whether linear filtering is available and the min filter mode. If linear is available and the min filter is linear, GL_CLAMP becomes CLAMP_TO_BORDER, otherwise it’s CLAMP_TO_EDGE. In conjunction with the NIR pass, this ends up replicating the expected behavior.

And to get that info to the NIR pass, more awfulness is required:

static inline GLboolean
is_wrap_gl_clamp(GLint param)
{
   return param == GL_CLAMP || param == GL_MIRROR_CLAMP_EXT;
}

static void
update_gl_clamplers(struct st_context *st, struct gl_program *prog, uint32_t *gl_clamplers)
{
   if (!st->emulate_gl_clamp)
      return;

   gl_clamplers[0] = gl_clamplers[1] = gl_clamplers[2] = 0;
   GLbitfield samplers_used = prog->SamplersUsed;
   unsigned unit;
   /* same as st_atom_sampler.c */
   for (unit = 0; samplers_used; unit++, samplers_used >>= 1) {
      unsigned tex_unit = prog->SamplerUnits[unit];
      if (samplers_used & 1 &&
          (st->ctx->Texture.Unit[tex_unit]._Current->Target != GL_TEXTURE_BUFFER ||
           st->texture_buffer_sampler)) {
         const struct gl_texture_object *texobj;
         struct gl_context *ctx = st->ctx;
         const struct gl_sampler_object *msamp;

         texobj = ctx->Texture.Unit[tex_unit]._Current;
         assert(texobj);

         msamp = _mesa_get_samplerobj(ctx, tex_unit);
         if (is_wrap_gl_clamp(msamp->Attrib.WrapS))
            gl_clamplers[0] |= BITFIELD64_BIT(unit);
         if (is_wrap_gl_clamp(msamp->Attrib.WrapT))
            gl_clamplers[1] |= BITFIELD64_BIT(unit);
         if (is_wrap_gl_clamp(msamp->Attrib.WrapR))
            gl_clamplers[2] |= BITFIELD64_BIT(unit);
      }
   }
}

This function iterates over all the samplers used by a given shader (struct gl_program), checking the wrap modes for GL_CLAMP and then updating the bitfields which correspond to struct nir_lower_tex_options::saturate_{s,t,r} when one is found. Each Gallium shader key is updated to use the values for comparisons, though I’ve helpfully reduced the key size used for comparisons for drivers which don’t set the pipe cap as well as those which do but have yet to see a gl_clampler.

The Result

By setting all these pipe caps and adding a trivial function, zink no longer needs to internally create and track sampler variants based on the above factors in order to support various sampler modes. Additionally, all Gallium-based drivers which emulate GL_CLAMP (there’s several) can switch over to this and delete a bunch of code.

Hooray.

January 22, 2021

Less than a month ago, I began investigating the Apple M1 GPU in hopes of developing a free and open-source driver. This week, I’ve reached a second milestone: drawing a triangle with my own open-source code. The vertex and fragment shaders are handwritten in machine code, and I interface with the hardware via the IOKit kernel driver in an identical fashion to the system’s Metal userspace driver.

A triangle rendered on the M1 with open-source code

The bulk of the new code is responsible for constructing the various command buffers and descriptors resident in shared memory, used to control the GPU’s behaviour. Any state accessible from Metal corresponds to bits in these buffers, so understanding them will be the next major task. So far, I have focused less on the content and more on the connections between them. In particular, the structures contain pointers to one another, sometimes nested multiple layers deep. The bring-up process for the project’s triangle provides a bird’s eye view of how all these disparate pieces in memory fit together.

As an example, the application-provided vertex data are in their own buffers. An internal table in yet another buffer points each of these vertex buffers. That internal table is passed directly as input to the vertex shader, specified in another buffer. That description of the vertex shader, including the address of the code in executable memory, is pointed to by another buffer, itself referenced from the main command buffer, which is referenced by a handle in the IOKit call to submit a command buffer. Whew!

In other words, the demo code is not yet intended to demonstrate an understanding of the fine-grained details of the command buffers, but rather to demonstrate there is “nothing missing”. Since GPU virtual addresses change from run to run, the demo validates that all of the pointers required are identified and can be relocated freely in memory using our own (trivial) allocator. As there is a bit of “magic” around memory and command buffer allocation on macOS, having this code working at an early stage gives peace of mind going forward.

I employed a piecemeal bring-up process. Since my IOKit wrapper exists in the same address space as the Metal application, the wrapper may modify command buffers just before submission to the GPU. As an early “hello world”, I identified the encoding of the render target’s clear colour in memory, and demonstrated that I could modify the colour as I pleased. Similarly, while learning about the instruction set to bring up the disassembler, I replaced shaders with handwritten equivalents and confirmed I could execute code on the GPU, provided I wrote out the machine code. But it’s not necessary to stop at these “leaf nodes” of the system; after modifying the shader code, I tried uploading shader code to a different part of the executable buffer while modifying the command buffer’s pointer to the code to compensate. After that, I could try uploading the commands for the shader myself. Iterating in this fashion, I could build up every structure needed while testing each in isolation.

Despite curveballs, this procedure worked out far better than the alternative of jumping straight to constructing buffers, perhaps via a “replay”. I had used that alternate technique to bring-up Mali a few years back, but it comes with the substantial drawback of fiendishly difficult debugging. If there is a single typo in five hundred lines of magic numbers, there would be no feedback, except an error from the GPU. However, by working one bit at a time, errors could be pinpointed and fixed immediately, providing a faster turn around time and a more pleasant bring-up experience.

But curveballs there were! My momentary elation at modifying the clear colours disappeared when I attempted to allocate a buffer for the colours. Despite encoding the same bits as before, the GPU would fail to clear correctly. Wondering if there was something wrong with the way I modified the pointer, I tried placing the colour in an unused part of memory that was already created by the Metal driver – that worked. The contents were the same, the way I modified the pointers was the same, but somehow the GPU didn’t like my memory allocation. I wondered if there was something wrong with the way I allocated memory, but the arguments I used to invoke the memory allocation IOKit call were bit-identical to those used by Metal, as confirmed by wrap. My last-ditch effort was checking if GPU memory had to be mapped explicitly via some side channel, like the mmap system call. IOKit does feature a device-independent memory map call, but no amount of fortified tracing found any evidence of side-channel system call mappings.

Trouble was brewing. Feeling delirious after so much time chasing an “impossible” bug, I wondered if there wasn’t something “magic” in the system call… but rather in the GPU memory itself. It was a silly theory since it produces a serious chicken-and-egg problem if true: if a GPU allocation has to be blessed by another GPU allocation, who blesses the first allocation?

But feeling silly and perhaps desperate, I pressed forward to test the theory by inserting a memory allocation call in the middle of the application flow, such that every subsequent allocation would be at a different address. Dumping GPU memory before and after this change and checking for differences revealed my first horror: an auxiliary buffer in GPU memory tracked all of the required allocations. In particular, I noticed values in this buffer increasing by one at a predictable offset (every 0x40 bytes), suggesting that the buffer contained an array of handles to allocations. Indeed, these values corresponded exactly to handles returned from the kernel on GPU memory allocation calls.

Putting aside the obvious problems with this theory, I tested it anyway, modifying this table to include an extra entry at the end with the handle of my new allocation, and modifying the header data structure to bump the number of entries by one. Still no dice. Discouraging as it was, that did not sink the theory entirely. In fact, I noticed something peculiar about the entries: contrary to what I thought, not all of them corresponded to valid handles. No, all but the last entry were valid. The handles from the kernel are 1-indexed, yet in each memory dump, the final handle was always 0, nonexistent. Perhaps this acts as a sentinel value, analogous to NULL-terminated strings in C. That explanation begs the question of why? If the header already contains a count of entries, a sentinel value is redundant.

I pressed on. Instead of adding on an extra entry with my handle, I copied the last entry n to the extra entry n + 1 and overwrote the (now second to last) entry n with the new handle.

Suddenly my clear colour showed up.

Is the mystery solved? I got the code working, so in some sense, the answer must be yes. But this is hardly a satisfying explanation; at every step, the unlikely solution only raises more questions. The chicken-and-egg problem is the easiest to resolve: this mapping table, along with the root command buffer, is allocated via a special IOKit selector independent from the general buffer allocation, and the handle to the mapping table is passed along with the submit command buffer selector. Further, the idea of passing required handles with command buffer submission is not unheard of; a similar mechanism is used on mainline Linux drivers. Nevertheless, the rationale for using 64-byte table entries in shared memory, as opposed to a simple CPU-side array, remains totally elusive.

Putting memory allocation woes behind me, the road ahead was not without bumps (and potholes), but with patience, I iterated until I had constructed the entirety of GPU memory myself in parallel to Metal, relying on the proprietary userspace only to initialize the device. Finally, all that remained was a leap of faith to kick off the IOKit handshake myself, and I had my first triangle.

These changes amount to around 1700 lines of code since the last blog post, available on GitHub. I’ve pieced together a simple demo animating a triangle with the GPU on-screen. The window system integration is effectively nonexistent at this point: XQuartz is required and detiling the (64x64 Morton-order interleaved) framebuffer occurs in software with naive scalar code. Nevertheless, the M1’s CPU is more than fast enough to cope.

Now that each part of the userspace driver is bootstrapped, going forward we can iterate on the instruction set and the command buffers in isolation. We can tease apart the little details and bit-by-bit transform the code from hundreds of inexplicable magic constants to a real driver. Onwards!

Passion Led Us Here

How our for-profit company became a nonprofit, to better tackle the digital divide.

Originally posted on the Endless OS Foundation blog.

An 8-year journey to a nonprofit

On the 1st of April 2020, our for-profit Endless Mobile officially became a nonprofit as the Endless OS Foundation. Our launch as a nonprofit just as the global pandemic took hold was, predictably, hardly noticed, but for us the timing was incredible: as the world collectively asked “What can we do to help others in need?”, we framed our mission statement and launched our .org with the same very important question in mind. Endless always had a social impact mission at its heart, and the challenges related to students, families, and communities falling further into the digital divide during COVID-19 brought new urgency and purpose to our team’s decision to officially step in the social welfare space.

On April 1st 2020, our for-profit Endless Mobile officially became a nonprofit as the Endless OS Foundation, focused on the #DigitalDivide.

Our updated status was a long time coming: we began our transformation to a nonprofit organization in late 2019 with the realization that the true charter and passions of our team would be greatly accelerated without the constraints of for-profit goals, investors and sales strategies standing in the way of our mission of digital access and equity for all. 

But for 8 years we made a go of it commercially, headquartered in Silicon Valley and framing ourselves as a tech startup with access to the venture capital and partnerships on our doorstep. We believed that a successful commercial channel would be the most efficient way to scale the impact of bringing computer devices and access to communities in need. We still believe this – we’ve just learned through our experience that we don’t have the funding to enter the computer and OS marketplace head-on. With the social impact goal first, and the hope of any revenue a secondary goal, we have had many successes in those 8 years bridging the digital divide throughout the world, from Brazil, to Kenya, and the USA. We’ve learned a huge amount which will go on to inform our strategy as a nonprofit.

Endless always had a social impact mission at its heart. COVID-19 brought new urgency and purpose to our team’s decision to officially step in the social welfare space.

Our unique perspective

One thing we learned as a for-profit is that the OS and technology we’ve built has some unique properties which are hugely impactful as a working solution to digital equity barriers. And our experience deploying in the field around the world for 8 years has left us uniquely informed via many iterations and incremental improvements.

Endless OS designer in discussion with prospective user

With this knowledge in-hand, we’ve been refining our strategy throughout 2020 and now starting to focus on what it really means to become an effective nonprofit and make that impact. In many ways it is liberating to abandon the goals and constraints of being a for-profit entity, and in other ways it’s been a challenging journey for me and the team to adjust our way of thinking and let these for-profit notions and models go. Previously we exclusively built and sold a product that defined our success; and any impact we achieved was a secondary consequence of that success and seen through that lens. Now our success is defined purely in terms of social impact, and through our actions, those positive impacts can be made with or without our “product”. That means that we may develop and introduce technology to solve a problem, but it is equally as valid to find another organization’s existing offering and design a way to increase that positive impact and scale.

We develop technology to solve access equity issues, but it’s equally as valid to find another organization’s offering and partner in a way that increases their positive impact.

The analogy to Free and Open Source Software is very strong – while Endless has always used and contributed to a wide variety of FOSS projects, we’ve also had a tension where we’ve been trying to hold some pieces back and capture value – such as our own application or content ecosystem, our own hardware platform – necessarily making us competitors to other organisations even though they were hoping to achieve the same things as us. As a nonprofit we can let these ideas go and just pick the best partners and technologies to help the people we’re trying to reach.

School kids writing on paper

Digital equity … 4 barriers we need to overcome

In future, our decisions around which projects to build or engage with will revolve around 4 barriers to digital equity, and how our Endless OS, Endless projects, or our partners’ offerings can help to solve them. We define these 4 equity barriers as: barriers to devices, barriers to connectivity, barriers to literacy in terms of your ability to use the technology, and barriers to engagement in terms of whether using the system is rewarding and worthwhile.

We define the 4 digital equity barriers we exist to impact as:
1. barriers to devices
2. barriers to connectivity
3. barriers to literacy
4. barriers to engagement

It doesn’t matter who makes the solutions that break these barriers; what matters is how we assist in enabling people to use technology to gain access to the education and opportunities these barriers block. Our goal therefore is to simply ensure that solutions exist – building them ourselves and with partners such as the FOSS community and other nonprofits – proving them with real-world deployments, and sharing our results as widely as possible to allow for better adoption globally.

If we define our goal purely in terms of whether people are using Endless OS, we are effectively restricting the reach and scale of our solutions to the audience we can reach directly with Endless OS downloads, installs and propagation. Conversely, partnerships that scale impact are a win-win-win for us, our partners, and the communities we all serve. 

Engineering impact

Our Endless engineering roots and capabilities feed our unique ability to build and deploy all of our solutions, and the practical experience of deploying them gives us evidence and credibility as we advocate for their use. Either activity would be weaker without the other.

Our engineering roots and capabilities feed our unique ability to build and deploy digital divide solutions.

Our partners in various engineering communities will have already seen our change in approach. Particularly, with GNOME we are working hard to invest in upstream and reconcile the long-standing differences between our experience and GNOME. If successful, many more people can benefit from our work than just users of Endless OS. We’re working with Learning Equality on Kolibri to build a better app experience for Linux desktop users and bring content publishers into our ecosystem for the first time, and we’ve also taken our very own Hack, the immersive and fun destination for kids learning to code, released it for non-Endless systems on Flathub, and made it fully open-source.

Planning tasks with sticky notes on a whiteboard

What’s next for our OS?

What then is in store for the future of Endless OS, the place where we have invested so much time and planning through years of iterations? For the immediate future, we need the capacity to deploy everything we’ve built – all at once, to our partners. We built an OS that we feel is very unique and valuable, containing a number of world-firsts: first production OS shipped with OSTree, first Flatpak-only desktop, built-in support for updating OS and apps from USBs, while still providing a great deal of reliability and convenience for deployments in offline and educational-safe environments with great apps and content loaded on every system.

However, we need to find a way to deliver this Linux-based experience in a more efficient way, and we’d love to talk if you have ideas about how we can do this, perhaps as partners. Can the idea of “Endless OS” evolve to become a spec that is provided by different platforms in the future, maybe remixes of Debian, Fedora, openSUSE or Ubuntu? 

Build, Validate, Advocate

Beyond the OS, the Endless OS Foundation has identified multiple programs to help underserved communities, and in each case we are adopting our “build, validate, advocate” strategy. This approach underpins all of our projects: can we build the technology (or assist in the making), will a community in-need validate it by adoption, and can we inspire others by telling the story and advocating for its wider use?

We are adopting a “build, validate, advocate” strategy.
1. build the technology (or assist in the making)
2. validate by community adoption
3. advocate for its wider use

As examples, we have just launched the Endless Key (link) as an offline solution for students during the COVID-19 at-home distance learning challenges. This project is also establishing a first-ever partnership of well-known online educational brands to reach an underserved offline audience with valuable learning resources. We are developing a pay-as-you-go platform and new partnerships that will allow families to own laptops via micro-payments that are built directly into the operating system, even if they cannot qualify for standard retail financing. And during the pandemic, we’ve partnered with Teach For America to focus on very practical digital equity needs in the USA’s urban and rural communities.

One part of the world-wide digital divide solution

We are one solution provider for the complex matrix of issues known collectively as the #DigitalDivide, and these issues will not disappear after the pandemic. Digital equity was an issue long before COVID-19, and we are not so naive to think it can be solved by any single institution, or by the time the pandemic recedes. It will take time and a coalition of partnerships to win. We are in for the long-haul and we are always looking for partners, especially now as we are finding our feet in the nonprofit world. We’d love to hear from you, so please feel free to reach out to me – I’m ramcq on IRC, RocketChat, Twitter, LinkedIn or rob@endlessos.org.

Your XKB keymap contains two important parts. One is the mapping from the hardware scancode to some internal representation, for example:

  <AB10> = 61;  

Which basically means Alphanumeric key in row B (from bottom), 10th key from the left. In other words: the /? key on a US keyboard.

The second part is mapping that internal representation to a keysym, for example:

  key <AB10> {        [     slash,    question        ]       }; 

This is the actual layout mapping - once in place this key really produces a slash or question mark (on level2, i.e. when Shift is down).

This two-part approach exists so either part can be swapped without affecting the other. Swap the second part to an exclamation mark and paragraph symbol and you have the French version of this key, swap it to dash/underscore and you have the German version of the key - all without having to change the keycode.

Back in the golden days of everyone-does-what-they-feel-like, keyboard manufacturers (presumably happily so) changed the key codes and we needed model-specific keycodes in XKB. The XkbModel configuration is a leftover from these trying times.

The Linux kernel's evdev API has largely done away with this. It provides a standardised set of keycodes, defined in linux/input-event-codes.h, and ensures, with the help of udev [0], that all keyboards actually conform to that. An evdev XKB keycode is a simple "kernel keycode + 8" [1] and that applies to all keyboards. On top of that, the kernel uses semantic definitions for the keys as they'd be in the US layout. KEY_Q is the key that would, behold!, produce a Q. Or an A in the French layout because they just have to be different, don't they? Either way, with evdev the Xkb Model configuration largely points to nothing and only wastes a few cycles with string parsing.

The second part, the keysym mapping, uses two approaches. One is to use a named #define like the "slash", "question" outlined above (see X11/keysymdef.h for the defines). The other is to use unicode directly like this example from  the Devangari layout:

  key <AB10> { [ U092f, U095f, slash, question ] };

As you can see, mix and match is available too. Using Unicode code points of course makes the layouts less immediately readable but on the other hand we don't need to #define the whole of Unicode. So from a maintenance perspective it's a win.

However, there's a third type of key that we care about: functional keys. Those are the multimedia (historically: "internet") keys that most devices have these days. Volume up, touchpad on/off, cycle display connectors, etc. Those keys are special in that they don't have a Unicode representation and they are always mapped to the same fixed functionality. Even Dvorak users want their volume keys to do what it says on the key.

Because they have no Unicode code points, those keys are defined, historically, in XF86keysyms.h:

  #define XF86XK_MonBrightnessUp    0x1008FF02  /* Monitor/panel brightness */

And mapping a key like this looks like this [2]:

  key <I21>   {       [ XF86Calculator        ] };

The only drawback: every key needs to be added manually. This has been done for some, but not for others. And some keys were added with different names than what the kernel uses [3].

So we're in this weird situation where we have a flexible keymap system  but the kernel already tells us what a key does anyway and we don't want to change that. Virtually all keys added in the last decade or so falls into that group of keys, but to actually make use of them requires a #define in xorgproto and an update to the keycodes and symbols in xkeyboard-config. That again introduces discrepancies and we end up in the situation where we're at right now: some keys don't work until someone files a bug, and then the users still need to wait for several components to be released and those releases trickle into the distributions.

10 years ago would've been a good time to make this more efficient. The situation wasn't that urgent then, most of the kernel keycodes added are >255 which means they cannot be used in X anyway. [4] The second best time to do it is now. What we need is basically a pass-through from kernel code to symbol and that's currently sitting in various MRs:

- xkeyboard-config can generate the keycodes/evdev file based on the list of kernel keycodes, so all kernel keycodes are mapped to internal representations by default

- xorgproto has reserved a range within the XF86 keysym reserved range for pass-through mappings, i.e. any KEY_FOO define from the kernel is mapped to XF86XK_Foo with a specific value [5]. The #define format is fixed so it can be parsed.

- xkeyboard-config parses theses XF86 keysyms and sets up a keysym mapping in the default keymap.

This is semi-automatic, i.e. there are helper scripts that detect changes and notify us, hooked into the CI, but the actual work must be done manually. These keysyms immediately become set-in-stone API so we don't want some unsupervised script to go wild on them.

There's a huge backlog of keys to be added (dating to kernels pre-v3.18) and I'll go through them one-by-one over the next weeks to make sure they're correct. But eventually they'll be done and we have a full keymap for all kernel keys to be immediately available in the XKB layout.

The last part of all of this is a calendar reminder for me to do this after every new kernel release. Let's hope this crucial part isn't the first to fail.

[0] 60-keyboard.hwdb has a mere ~1800 lines!
[1] Historical reasons, you don't want to know. *jedi wave*
[2] the XK_ part of the key name is dropped, implementation detail.
[3] This can also happen when a kernel define is renamed/aliased but we cannot easily do so for this header.
[4] X has an 8 bit keycode limit and that won't change until someone develops XKB2 with support for 32-bit keycodes, i.e. never.

[5] The actual value is an implementation detail and no client must care


Return of the Blog

I meant to blog a couple times this week, but I kept getting sidetracked by various matters. Here’s a very brief recap on what’s happened in zinkland over the past week.

Important MRs

And a bunch more extensions related to GL 4.3+ are now enabled.

A new zink-wip snapshot is finally up after most of a week spent fighting regressions. Anyone updating from a previous (e.g., 20201230) snapshot will find:

  • a ton of garbage patches which haven’t been properly split or pruned in any way and are likely unreadable/unbisectable
  • a new descriptor manager which doesn’t cache and instead uses templates and incremental updates to provide comparable performance to the caching one; ZINK_CACHE_DESCRIPTORS=1 will use the caching version
  • tons of optimizations for reducing driver overhead
  • a prototype for a new Vulkan extension providing direct multidraw functionality (as well as accompanying implementations for both ANV and RADV) which very slightly improves performance
  • lots of corner case bug fixes for things much earlier in the branch

All told, zink should now be (slightly, possibly even imperceptibly) faster as well as being even less bug-prone.

I’m pretty beat after this week, so that’s all for now. Hoping to return to more normal, in-depth coverage of driver internals next week.

January 14, 2021

The open source Panfrost driver for Arm Mali Midgard and Bifrost GPUs now provides non-conformant OpenGL ES 3.0 on Bifrost and desktop OpenGL 3.1 on Midgard (Mali T760 and newer) and Bifrost, in time for Mesa’s first release of 2021. This follows the OpenGL ES 3.0 support on Midgard that landed over the summer, as well as the initial OpenGL ES 2.0 support that recently debuted for Bifrost. OpenGL ES 3.0 is now tested on Mali G52 in Mesa’s continuous integration, achieving a 99.9% pass rate on the corresponding drawElements Quality Program tests.

Architecturally, Bifrost shares most of its fixed-function data structures with Midgard, but features a brand new instruction set. Our work for bringing up OpenGL ES 3.0 on Bifrost reflects this division. Some fixed-function features, like instancing and transform feedback, worked without any Bifrost-specific changes since we already did bring-up on Midgard. Other shader features, like uniform buffer objects, required “from scratch” implementations in the Bifrost compiler, a task facilitated by the compiler’s maturing intermediate representation with first-class builder support. Yet other features like multiple render targets required some Bifrost-specific code while leveraging other code shared with Midgard. All in all, the work progressed much more quickly the second time around, a testament to the power of code sharing. But there is no need to limit sharing to just Panfrost GPUs; open source drivers can share code across vendors.

Indeed, since Mali is an embedded GPU, the proprietary driver only exposes exposes OpenGL ES, not desktop OpenGL. However, desktop OpenGL 3.1 support comes nearly “for free” for us as an upstream Mesa driver by leveraging common infrastructure. This milestone shows the technical advantage of open source development: Compared to layered implementations of desktop GL like gl4es or Zink, Panfrost’s desktop OpenGL support is native, reducing CPU overhead. Furthermore, applications can make use of the hardware’s hidden features, like explicit primitive restart indices, alpha testing, and quadrilaterals. Although these features could be emulated, the native solutions are more efficient.

Mesa’s shared code also extends to OpenCL support via Clover. Once a driver supports compute shaders and sufficient compiler features, baseline OpenCL is just a few patches and a bug-fixing spree away. While OpenCL implementations could be layered (for example with clvk), an open source Mesa driver avoids the indirection.

I would like to thank Collaboran Boris Brezillon, who has worked tirelessly to bring OpenGL ES 3.0 support to Bifrost, as well as the prolific Icecream95, who has spearheaded OpenCL and desktop OpenGL support.

Originally posted on Collabora’s blog

A Quintessential Metric

There’s been a lot of talk about driver overhead in the Mesa community as of late, in large part begun by Marek Olšák and his daredevil stunts driving RadeonSI through flaming hoops while juggling chainsaws.

While zink isn’t quite at that level yet (and neither am I), there’s still some progress being made that I’d like to dig into a bit.

What Is Overhead?

As in all software, overhead is the performance penalty that is incurred as compared to a baseline measurement. In Mesa, a lot of people know of driver overhead as “Gallium sucks” and/or “A Gallium-based driver is slow” due to the fact that Gallium does incur some amount of overhead as compared to the old-style immediate mode DRI drivers.

While it’s true that there is an amount of performance lost by using Gallium in this sense, it’s also true that the performance gained is much greater. The reason for this is that Gallium is able to batch commands and state changes for every driver using it, allowing redundant calls to avoid triggering any work in the GPU.

It also makes for an easier time profiling and improving upon the CPU usage that’s required to handle the state changes emitted by Gallium. Instead of having a ton of core Mesa callbacks which need to be handled, each one potentially leading to a no-op that can be analyzed and deferred by the driver, Gallium provides a more cohesive API where each driver hook is a necessary change that must be handled. Because of this, the job of optimizing for those changes is simplified.

How Can Overhead Be Measured?

Other than the obvious method of running apps on a driver and checking the fps counter, piglit provides a facility for this: the drawoverhead test. This test has over a hundred subtests which perform sequences of draw operations with various state changes, each with its own result relative to a baseline, enabling a developer to easily profile and optimize a given codepath.

How Is Zink Doing Here?

To answer this, let’s look at some preliminary results from zink in master, the code which will soon be shipping in Mesa 21.0.0. All numbers here are, in contrast to my usual benchmarking, done on AMD 5700XT GPU. More on this later.

ZINK: MASTER

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                  818, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                  686, 83.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                  411, 50.3%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  232, 28.4%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  258, 31.5%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,             87, 10.7%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             162, 19.9%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 150, 18.3%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                120, 14.7%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     192, 23.5%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    146, 17.9%

After this point, the test aborts because shader images are not yet implemented, but it’s enough for a baseline.

These numbers are…not great. Primarily, at least to start, I’ll be focusing on the first row where zink is performing 818,000 draws per second.

Let’s check out some performance from zink-wip (20201230 snapshot), specifically with GALLIUM_THREAD=0 set to disable threaded context. This means I’m adding in descriptor caching and unlimited command buffer counts (vs forcing a stall after every submit from the 4th one onwards to reset a batch):

ZINK: WIP (CACHED, NO THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                  766, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                  633, 82.6%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                  407, 53.1%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  500, 65.3%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  449, 58.6%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,             85, 11.2%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             235, 30.7%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 159, 20.8%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                128, 16.7%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     179, 23.4%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    139, 18.2%

This is actually worse for a lot of cases!

But why is that?

It turns out that in the base draw case, threaded context is really necessary to be doing caching and using more command buffers. There’s sizable gains made in the baseline texture cases (+100% or so each) and a vertex attribute change (+50%), but fundamentally the overhead for the driver seems higher.

What happens if threading is enabled though?

ZINK: WIP (CACHED, THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5206, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 5149, 98.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 5187, 99.6%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5210, 100.1%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 4684, 90.0%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            137, 2.6%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             252, 4.8%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 243, 4.7%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                222, 4.3%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     213, 4.1%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    208, 4.0%

blink.gif

Indeed, threading yields almost a 700% performance improvement for teh baseline cases. It turns out that synchronously performing expensive tasks like computing hash values for descriptor sets is bad. Who could have guessed.

State Changes

Looking at the other values, however, is a bit more pertinent for the purpose of this post. Overhead is incurred when state changes are triggered by descriptors being changed, and this is much closer to a real world scenario (i.e., gaming) than simply running draw calls with no changes. Caching yields roughly a 50% performance improvement for this case.

Further Improvements

As I’d mentioned previously, I’m doing some work now on descriptor management with an aim to further lower this overhead. Let’s see what that looks like.

ZINK: TEST (UNCACHED, THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5426, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 5423, 99.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 5432, 100.1%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5246, 96.7%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 5177, 95.4%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            153, 2.8%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             229, 4.2%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 247, 4.6%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                228, 4.2%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     237, 4.4%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    223, 4.1%

While there’s a small (~4%) improvement for the baseline numbers, what’s much more interesting is the values where descriptor states are changed. They are, in fact, about as good or even slightly better than the caching version of descriptor management.

This is huge. Specifically it’s huge because it means that I can likely port over some of the techniques used in this approached to the cached version in order to drive further reductions in overhead.

Closing Remarks

Before I go, let’s check out some numbers from a real driver. Specifically, RadeonSI: the pinnacle of Gallium-based drivers.

RADEONSI

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 6221, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 6261, 100.7%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 6236, 100.2%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 6263, 100.7%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 6243, 100.4%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            217, 3.5%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,            1467, 23.6%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 374, 6.0%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                218, 3.5%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     680, 10.9%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    318, 5.1%

Yikes. Especially intimidating here is the relative performance for vertex attribute changes, where RadeonSI is able to retain almost 25% of its baseline performance relative to zink not even managing 5%.

Hopefully these figures get closer to each other in the future, but this just shows that there’s still a long way to go.

January 13, 2021

This post explains how to parse the HID Unit Global Item as explained by the HID Specification, page 37. The table there is quite confusing and it took me a while to fully understand it (Benjamin Tissoires was really the one who cracked it). I couldn't find any better explanation online which means either I'm incredibly dense and everyone's figured it out or no-one has posted a better explanation. On the off-chance it's the latter [1], here are the instructions on how to parse this item.

We know a HID Report Descriptor consists of a number of items that describe the content of each HID Report (read: an event from a device). These Items include things like Logical Minimum/Maximum for axis ranges, etc. A HID Unit item specifies the physical unit to apply. For example, a Report Descriptor may specify that X and Y axes are in mm which can be quite useful for all the obvious reasons.

Like most HID items, a HID Unit Item consists of a one-byte item tag and 1, 2 or 4 byte payload. The Unit item in the Report Descriptor itself has the binary value 0110 01nn where the nn is either 1, 2, or 3 indicating 1, 2 or 4 bytes of payload, respectively. That's standard HID.

The payload is divided into nibbles (4-bit units) and goes from LSB to MSB. The lowest-order 4 bits (first byte & 0xf) define the unit System to apply: one of SI Linear, SI Rotation, English Linear or English Rotation (well, or None/Reserved). The rest of the nibbles are in this order: "length", "mass", "time", "temperature", "current", "luminous intensity". In something resembling code this means:


system = value & 0xf
length_exponent = (value & 0xf0) >> 4
mass_exponent = (value & 0xf00) >> 8
time_exponent = (value & 0xf000) >> 12
...
The System defines which unit is used for length (e.g. SILinear means length is in cm). The actual value of each nibble is the exponent for the unit in use [2]. In something resembling code:

switch (system)
case SILinear:
print("length is in cm^{length_exponent}");
break;
case SIRotation:
print("length is in rad^{length_exponent}");
break;
case EnglishLinear:
print("length is in in^{length_exponent}");
break;
case EnglishRotation:
print("length is in deg^{length_exponent}");
break;
case None:
case Reserved"
print("boo!");
break;

For example, the value 0x321 means "SI Linear" (0x1) so the remaining nibbles represent, in ascending nibble order: Centimeters, Grams, Seconds, Kelvin, Ampere, Candela. The length nibble has a value of 0x2 so it's square cm, the mass nibble has a value of 0x3 so it is cubic grams (well, it's just an example, so...). This means that any report containing this item comes in cm²g³. As a more realistic example: 0xF011 would be cm/s.

If we changed the lowest nibble to English Rotation (0x4), i.e. our value is now 0x324, the units represent: Degrees, Slug, Seconds, F, Ampere, Candela [3]. The length nibble 0x2 means square degrees, the mass nibble is cubic slugs. As a more realistic example, 0xF014 would be degrees/s.

Any nibble with value 0 means the unit isn't in use, so the example from the spec with value 0x00F0D121 is SI linear, units cm² g s⁻³ A⁻¹, which is... Voltage! Of course you knew that and totally didn't have to double-check with wikipedia.

Because bits are expensive and the base units are of course either too big or too small or otherwise not quite right, HID also provides a Unit Exponent item. The Unit Exponent item (a separate item to Unit in the Report Descriptor) then describes the exponent to be applied to the actual value in the report. For example, a Unit Eponent of -3 means 10⁻³ to be applied to the value. If the report descriptor specifies an item of Unit 0x00F0D121 (i.e. V) and Unit Exponent -3, the value of this item is mV (milliVolt), Unit Exponent of 3 would be kV (kiloVolt).

Now, in hindsight all this is pretty obvious and maybe even sensible. It'd have been nice if the spec would've explained it a bit clearer but then I would have nothing to write about, so I guess overall I call it a draw.

[1] This whole adventure was started because there's a touchpad out there that measures touch pressure in radians, so at least one other person out there struggled with the docs...
[2] The nibble value is twos complement (i.e. it's a signed 4-bit integer). Values 0x1-0x7 are exponents 1 to 7, values 0x8-0xf are exponents -8 to -1.
[3] English Linear should've trolled everyone and use Centimetres instead of Centimeters in SI Linear.

January 12, 2021

TL;DR: It's now easy to unlock your LUKS2 volume with a FIDO2 security token (e.g. YubiKey or Nitrokey FIDO2). And TPM2 unlocking is easy now too.

Blogging is a lot of work, and a lot less fun than hacking. I mostly focus on the latter because of that, but from time to time I guess stuff is just too interesting to not be blogged about. Hence here, finally, another blog story about exciting new features in systemd.

With the upcoming systemd v248 the systemd-cryptsetup component of systemd (which is responsible for assembling encrypted volumes during boot) gained direct support for unlocking encrypted storage with three types of security hardware:

  1. Unlocking with FIDO2 security tokens (well, at least with those which implement the hmac-secret extension, most do). i.e. your YubiKeys (series 5 and above), or Nitrokey FIDO2 and such.

  2. Unlocking with TPM2 security chips (pretty ubiquitous on non-budget PCs/laptops/…)

  3. Unlocking with PKCS#11 security tokens, i.e. your smartcards and older YubiKeys (the ones that implement PIV). (Strictly speaking this was supported on older systemd already, but was a lot more "manual".)

For completeness' sake, let's keep in mind that the component also allows unlocking with these more traditional mechanisms:

  1. Unlocking interactively with a user-entered passphrase (i.e. the way most people probably already deploy it, supported since about forever)

  2. Unlocking via key file on disk (optionally on removable media plugged in at boot), supported since forever.

  3. Unlocking via a key acquired through trivial AF_UNIX/SOCK_STREAM socket IPC. (Also new in v248)

  4. Unlocking via recovery keys. These are pretty much the same thing as a regular passphrase (and in fact can be entered wherever a passphrase is requested) — the main difference being that they are always generated by the computer, and thus have guaranteed high entropy, typically higher than user-chosen passphrases. They are generated in a way they are easy to type, in many cases even if the local key map is misconfigured. (Also new in v248)

In this blog story, let's focus on the first three items, i.e. those that talk to specific types of hardware for implementing unlocking.

To make working with security tokens and TPM2 easy, a new, small tool was added to the systemd tool set: systemd-cryptenroll. It's only purpose is to make it easy to enroll your security token/chip of choice into an encrypted volume. It works with any LUKS2 volume, and embeds a tiny bit of meta-information into the LUKS2 header with parameters necessary for the unlock operation.

Unlocking with FIDO2

So, let's see how this fits together in the FIDO2 case. Most likely this is what you want to use if you have one of these fancy FIDO2 tokens (which need to implement the hmac-secret extension, as mentioned). Let's say you already have your LUKS2 volume set up, and previously unlocked it with a simple passphrase. Plug in your token, and run:

# systemd-cryptenroll --fido2-device=auto /dev/sda5

(Replace /dev/sda5 with the underlying block device of your volume).

This will enroll the key as an additional way to unlock the volume, and embeds all necessary information for it in the LUKS2 volume header. Before we can unlock the volume with this at boot, we need to allow FIDO2 unlocking via /etc/crypttab. For that, find the right entry for your volume in that file, and edit it like so:

myvolume /dev/sda5 - fido2-device=auto

Replace myvolume and /dev/sda5 with the right volume name, and underlying device of course. Key here is the fido2-device=auto option you need to add to the fourth column in the file. It tells systemd-cryptsetup to use the FIDO2 metadata now embedded in the LUKS2 header, wait for the FIDO2 token to be plugged in at boot (utilizing systemd-udevd, …) and unlock the volume with it.

And that's it already. Easy-peasy, no?

Note that all of this doesn't modify the FIDO2 token itself in any way. Moreover you can enroll the same token in as many volumes as you like. Since all enrollment information is stored in the LUKS2 header (and not on the token) there are no bounds on any of this. (OK, well, admittedly, there's a cap on LUKS2 key slots per volume, i.e. you can't enroll more than a bunch of keys per volume.)

Unlocking with PKCS#11

Let's now have a closer look how the same works with a PKCS#11 compatible security token or smartcard. For this to work, you need a device that can store an RSA key pair. I figure most security tokens/smartcards that implement PIV qualify. How you actually get the keys onto the device might differ though. Here's how you do this for any YubiKey that implements the PIV feature:

# ykman piv reset
# ykman piv generate-key -a RSA2048 9d pubkey.pem
# ykman piv generate-certificate --subject "Knobelei" 9d pubkey.pem
# rm pubkey.pem

(This chain of commands erases what was stored in PIV feature of your token before, be careful!)

For tokens/smartcards from other vendors a different series of commands might work. Once you have a key pair on it, you can enroll it with a LUKS2 volume like so:

# systemd-cryptenroll --pkcs11-token-uri=auto /dev/sda5

Just like the same command's invocation in the FIDO2 case this enrolls the security token as an additional way to unlock the volume, any passphrases you already have enrolled remain enrolled.

For the PKCS#11 case you need to edit your /etc/crypttab entry like this:

myvolume /dev/sda5 - pkcs11-uri=auto

If you have a security token that implements both PKCS#11 PIV and FIDO2 I'd probably enroll it as FIDO2 device, given it's the more contemporary, future-proof standard. Moreover, it requires no special preparation in order to get an RSA key onto the device: FIDO2 keys typically just work.

Unlocking with TPM2

Most modern (non-budget) PC hardware (and other kind of hardware too) nowadays comes with a TPM2 security chip. In many ways a TPM2 chip is a smartcard that is soldered onto the mainboard of your system. Unlike your usual USB-connected security tokens you thus cannot remove them from your PC, which means they address quite a different security scenario: they aren't immediately comparable to a physical key you can take with you that unlocks some door, but they are a key you leave at the door, but that refuses to be turned by anyone but you.

Even though this sounds a lot weaker than the FIDO2/PKCS#11 model TPM2 still bring benefits for securing your systems: because the cryptographic key material stored in TPM2 devices cannot be extracted (at least that's the theory), if you bind your hard disk encryption to it, it means attackers cannot just copy your disk and analyze it offline — they always need access to the TPM2 chip too to have a chance to acquire the necessary cryptographic keys. Thus, they can still steal your whole PC and analyze it, but they cannot just copy the disk without you noticing and analyze the copy.

Moreover, you can bind the ability to unlock the harddisk to specific software versions: for example you could say that only your trusted Fedora Linux can unlock the device, but not any arbitrary OS some hacker might boot from a USB stick they plugged in. Thus, if you trust your OS vendor, you can entrust storage unlocking to the vendor's OS together with your TPM2 device, and thus can be reasonably sure intruders cannot decrypt your data unless they both hack your OS vendor and steal/break your TPM2 chip.

Here's how you enroll your LUKS2 volume with your TPM2 chip:

# systemd-cryptenroll --tpm2-device=auto --tpm2-pcrs=7 /dev/sda5

This looks almost as straightforward as the two earlier sytemd-cryptenroll command lines — if it wasn't for the --tpm2-pcrs= part. With that option you can specify to which TPM2 PCRs you want to bind the enrollment. TPM2 PCRs are a set of (typically 24) hash values that every TPM2 equipped system at boot calculates from all the software that is invoked during the boot sequence, in a secure, unfakable way (this is called "measurement"). If you bind unlocking to a specific value of a specific PCR you thus require the system has to follow the same sequence of software at boot to re-acquire the disk encryption key. Sounds complex? Well, that's because it is.

For now, let's see how we have to modify your /etc/crypttab to unlock via TPM2:

myvolume /dev/sda5 - tpm2-device=auto

This part is easy again: the tpm2-device= option is what tells systemd-cryptsetup to use the TPM2 metadata from the LUKS2 header and to wait for the TPM2 device to show up.

Bonus: Recovery Key Enrollment

FIDO2, PKCS#11 and TPM2 security tokens and chips pair well with recovery keys: since you don't need to type in your password everyday anymore it makes sense to get rid of it, and instead enroll a high-entropy recovery key you then print out or scan off screen and store a safe, physical location. i.e. forget about good ol' passphrase-based unlocking, go for FIDO2 plus recovery key instead! Here's how you do it:

# systemd-cryptenroll --recovery-key /dev/sda5

This will generate a key, enroll it in the LUKS2 volume, show it to you on screen and generate a QR code you may scan off screen if you like. The key has highest entropy, and can be entered wherever you can enter a passphrase. Because of that you don't have to modify /etc/crypttab to make the recovery key work.

Future

There's still plenty room for further improvement in all of this. In particular for the TPM2 case: what the text above doesn't really mention is that binding your encrypted volume unlocking to specific software versions (i.e. kernel + initrd + OS versions) actually sucks hard: if you naively update your system to newer versions you might lose access to your TPM2 enrolled keys (which isn't terrible, after all you did enroll a recovery key — right? — which you then can use to regain access). To solve this some more integration with distributions would be necessary: whenever they upgrade the system they'd have to make sure to enroll the TPM2 again — with the PCR hashes matching the new version. And whenever they remove an old version of the system they need to remove the old TPM2 enrollment. Alternatively TPM2 also knows a concept of signed PCR hash values. In this mode the distro could just ship a set of PCR signatures which would unlock the TPM2 keys. (But quite frankly I don't really see the point: whether you drop in a signature file on each system update, or enroll a new set of PCR hashes in the LUKS2 header doesn't make much of a difference). Either way, to make TPM2 enrollment smooth some more integration work with your distribution's system update mechanisms need to happen. And yes, because of this OS updating complexity the example above — where I referenced your trusty Fedora Linux — doesn't actually work IRL (yet? hopefully…). Nothing updates the enrollment automatically after you initially enrolled it, hence after the first kernel/initrd update you have to manually re-enroll things again, and again, and again … after every update.

The TPM2 could also be used for other kinds of key policies, we might look into adding later too. For example, Windows uses TPM2 stuff to allow short (4 digits or so) "PINs" for unlocking the harddisk, i.e. kind of a low-entropy password you type in. The reason this is reasonably safe is that in this case the PIN is passed to the TPM2 which enforces that not more than some limited amount of unlock attempts may be made within some time frame, and that after too many attempts the PIN is invalidated altogether. Thus making dictionary attacks harder (which would normally be easier given the short length of the PINs).

Postscript

(BTW: Yubico sent me two YubiKeys for testing and Nitrokey a Nitrokey FIDO2, thank you! — That's why you see all those references to YubiKey/Nitrokey devices in the text above: it's the hardware I had to test this with. That said, I also tested the FIDO2 stuff with a SoloKey I bought, where it also worked fine. And yes, you!, other vendors!, who might be reading this, please send me your security tokens for free, too, and I might test things with them as well. No promises though. And I am not going to give them back, if you do, sorry. ;-))

January 11, 2021

A Different Strategy

As the merge window for the upcoming Mesa release looms, Erik and I have decided on a new strategy for development: we’re just going to stop merging patches.

At this point in time, we have no regressions as compared to the last release, so we’re just doing a full stop until after the branch point in order to save ourselves time potentially tracking down any issues in further feature additions.

Slowdowns

Some of you may have noticed that zink-wip has yet to update this year. This isn’t due to a lack of work, but rather due to lack of stability. I’ve been tinkering with a new descriptor management infrastructure (yes, I’m back on the horse), and it’s… capable of drawing frames is maybe the best way to describe it. I’ve gone through probably about ten iterations on it so far based on all the ideas I’ve had.

This is hardly an exhaustive list, but here’s some of the ideas that I’ve cycled through:

  • async descriptor updating - It seems like this should be good on paper given that it’s legal to do descriptor updates in threads, but the overhead from signalling the task thread in this case ended up being, on average, about 10-20x the cost of just doing the updating synchronously.

  • all push descriptors all the time - Just for hahas along the way I jammed everything into a pushed descriptor set. Or at least I was going to try. About halfway through, I realized this was way more work to execute than it’d be worth for the hahas considering I wouldn’t ever be able to use this in reality.

  • zero iteration updates - The gist of this ideas is that looking at the descriptor updating code, there’s a ton of iterating going on. This is an extreme hotpath, so any amount of looping that can be avoided is great, and the underlying Vulkan driver has to iterate the sets anyway, so… Eventually I managed to throw a bunch of memory at the problem and do all the setup during pipeline init, giving me pre-initialized blobs of memory in the form of VkWriteDescriptorSet arrays with the associated sub-types for descriptors. With this in place, naturally I turned to…

  • templates - Descriptor templates are a way of giving the Vulkan driver the raw memory of the descriptor info as a blob and letting it huck that directly into a buffer. Since I already had the memory set up for this, it was an easy swap over, though the gains were less impressive than I’d expected.

At last I’ve settled on a model of uncached, templated descriptors with an extra push set for handling uniform data for further exploration. Initial results for real world use (e.g., graphical benchmarks) are good, but piglit’s drawoverhead test shows there’s still a lot of work to be done to catch up to caching.

Big thanks to Hans-Kristian Arntzen, aka themaister, aka low-level graphics swashbuckler, for providing insight and consults along the process of this.

January 08, 2021

VkRunner is a Vulkan shader tester based on Piglit’s shader_runner (I already talked about it in my blog). This tool is very helpful for creating simple Vulkan tests without writing hundreds of lines of code. In the Graphics Team at Igalia, we use it extensively to help us in the open-source driver development in Mesa such as V3D and Turnip drivers.

As a hobby project for last Christmas holiday season, I wrote the .spec file for VkRunner and uploaded it to Fedora’s Copr and OpenSUSE Build Service (OBS) for generating the respective RPM packages.

This is the first time I create a package and thanks to the documentation on how to create RPM packages, the process was simpler than I initially thought. If I find the time to read Debian New Maintainers’ Guide, I will create a DEB package as well.

Anyway, if you have installed Fedora or OpenSUSE in your computer and you want to try VkRunner, just follow these steps:

Fedora

  • Fedora:
$ sudo dnf copr enable samuelig/vkrunner
$ sudo dnf install vkrunner

OpenSUSE logo

  • OpenSUSE / SLE:
$ sudo zypper addrepo https://download.opensuse.org/repositories/home:samuelig/openSUSE_Leap_15.2/home:samuelig.repo
$ sudo zypper refresh
$ sudo zypper install vkrunner

Enjoy it!

January 07, 2021

Apple’s latest line of Macs includes their in-house “M1” system-on-chip, featuring a custom GPU. This poses a problem for those of us in the Asahi Linux project who wish to run Linux on our devices, as this custom Apple GPU has neither public documentation nor open source drivers. Some speculate it might descend from PowerVR GPUs, as used in older iPhones, while others believe the GPU to be completely custom. But rumours and speculations are no fun when we can peek under the hood ourselves!

A few weeks ago, I purchased a Mac Mini with an M1 GPU as a development target to study the instruction set and command stream, to understand the GPU’s architecture at a level not previously publicly understood, and ultimately to accelerate the development of a Mesa driver for the hardware. Today I’ve reached my first milestone: I now understand enough of the instruction set to disassemble simple shaders with a free and open-source tool chain, released on GitHub here.

The process for decoding the instruction set and command stream of the GPU parallels the same process I used for reverse-engineering Mali GPUs in the Panfrost project, originally pioneered by the Lima, Freedreno, and Nouveau free software driver projects. Typically, for Linux or Android driver reverse-engineering, a small wrap library will be written to inject into a test application via LD_PRELOAD that hooks key system calls like ioctl and mmap in order to analyze user-kernel interactions. Once the “submit command buffer” call is issued, the library can dump all (mapped) shared memory for offline analysis.

The same overall process will work for the M1, but there are some macOSisms that need to be translated. First, there is no LD_PRELOAD on macOS; the equivalent is DYLD_INSERT_LIBRARIES, which has some extra security features which are easy enough to turn off for our purposes. Second, while the standard Linux/BSD system calls do exist on macOS, they are not used for graphics drivers. Instead, Apple’s own IOKit framework is used for both kernel and userspace drivers, with the critical entry point of IOConnectCallMethod, an analogue of ioctl. These differences are easy enough to paper over, but they do add a layer of distance from the standard Linux tooling.

The bigger issue is orienting ourselves in the IOKit world. Since Linux is under a copyleft license, (legal) kernel drivers are open source, so the ioctl interface is public, albeit vendor-specific. macOS’s kernel (XNU) being under a permissive license brings no such obligations; the kernel interface is proprietary and undocumented. Even after wrapping IOConnectCallMethod, it took some elbow grease to identify the three critical calls: memory allocation, command buffer creation, and command buffer submission. Wrapping the allocation and creation calls is essential for tracking GPU-visible memory (what we are interested in studying), and wrapping the submission call is essential for timing the memory dump.

With those obstacles cleared, we can finally get to the shader binaries, black boxes in themselves. However, the process from here on out is standard: start with the simplest fragment or compute shader possible, make a small change in the input source code, and compare the output binaries. Iterating on this process is tedious but will quickly reveal key structures, including opcode numbers.

The findings of the process documented in the free software disassembler confirm a number of traits of the GPU:

One, the architecture is scalar. Unlike some GPUs that are scalar for 32-bits but vectorized for 16-bits, the M1’s GPU is scalar at all bit sizes. Yet Metal optimization resources imply 16-bit arithmetic should be significantly faster, in addition to a reduction of register usage leading to higher thread count (occupancy). This suggests the hardware is superscalar, with more 16-bit ALUs than 32-bit ALUs, allowing the part to benefit from low-precision graphics shaders much more than competing chips can, while removing a great deal of complexity from the compiler.

Two, the architecture seems to handle scheduling in hardware, common among desktop GPUs but less so in the embedded space. This again makes the compiler simpler at the expense of more hardware. Instructions seem to have minimal encoding overhead, unlike other architectures which need to pad out instructions with nop’s to accommodate highly constrained instruction sets.

Three, various modifiers are supported. Floating-point ALUs can do clamps (saturate), negates, and absolute value modifiers “for free”, a common shader architecture trait. Further, most (all?) instructions can type-convert between 16-bit and 32-bit “for free” on both the destination and the sources, which allows the compiler to be much more aggressive about using 16-bit operations without risking conversion overheads. On the integer side, various bitwise complements and shifts are allowed on certain instructions for free. None of this is unique to Apple’s design, but it’s worth noting all the same.

Finally, not all ALU instructions have the same timing. Instructions like imad, used to multiply two integers and add a third, are avoided in favour of repeated iadd integer addition instructions where possible. This also suggests a superscalar architecture; software-scheduled designs like those I work on for my day job cannot exploit differences in pipeline length, inadvertently slowing down simple instructions to match the speed of complex ones.

From my prior experience working with GPUs, I expect to find some eldritch horror waiting in the instruction set, to balloon compiler complexity. Though the above work currently covers only a small surface area of the instruction set, so far everything seems sound. There are no convoluted optimization tricks, but doing away with the trickery creates a streamlined, efficient design that does one thing and does it well. Maybe Apple’s hardware engineers discovered it’s hard to beat simplicity.

Alas, a shader tool chain isn’t much use without an open source userspace driver. Next up: dissecting the command stream!

Disclaimer: This work is a hobby project conducted based on public information. Opinions expressed may not reflect those of my employer.

January 05, 2021

Some People Will Appreciate This

border_colors.png

Also zink hit GL 4.1 today.

January 04, 2021

It Happens

As long-time readers of the blog know, SGC is a safe space where making mistakes is not only accepted, it’s a way of life. So it is once again that I need to amend statements previously made regarding Xorg synchronization after Michel Dänzer, also known for anchoring the award-winning series Why Is My MR Failing CI Today?, pointed out that while I was indeed addressing the correct problem, I was addressing it from the wrong side.

Looking Closer

The issue here is that WSI synchronizes with the display server using a file descriptor for the swapchain image that the Vulkan driver manages. But what if the Vulkan driver never configures itself to be used for WSI (genuine or faked) in the first place?

Yes, this indeed appeared to be the true problem. Iago Toral Quiroga added handling for this specific to the V3DV driver back in October, and it’s the same mechanism: setting up a Mesa-internal struct during resource initialization.

So I extended this to the ANV codepath and…

And obviously it didn’t work.

But why was this the case?

A script-based git blame revealed that ANV has a different handling for implicit sync than other Vulkan drivers. After a well-hidden patch, ANV relies entirely on a struct attached to VkSubmitInfo which contains the swapchain image’s memory pointer in order to handle implicit sync. Thus by attaching a wsi_memory_signal_submit_info struct, everything was resolved.

Is it a great fix? No. Does it work? Yes.

Questions

If ANV wasn’t configuring itself to handle implicit sync, why was poll() working?

Luck.

Why does RADV work without any of this?

Also probably luck.

January 01, 2021

Reviving Very Old X Code

I've taken the week between Christmas and New Year's off this year. I didn't really have anything serious planned, just taking a break from the usual routine. As often happens, I got sucked into doing a project when I received this simple bug report Debian Bug #974011

I have been researching old terminal and X games recently, and realized
that much of the code from 'xmille' originated from the terminal game
'mille', which is part of bsdgames.

...

[The copyright and license information] has been stripped out of all
code in the xmille distribution.  Also, none of the included materials
give credit to the original author, Ken Arnold.

The reason the 'xmille' source is missing copyright and license information from the 'mille' files is that they were copied in before that information was added upstream. Xmille forked from Mille around 1987 or so. I wrote the UI parts for the system I had at the time, which was running X10R4. A very basic port to X11 was done at some point, and that's what Debian has in the archive today.

At some point in the 90s, I ported Xmille to the Athena widget set, including several custom widgets in an Xaw extension library, Xkw. It's a lot better than the version in Debian, including displaying the cards correctly (the Debian version has some pretty bad color issues).

Here's what the current Debian version looks like:

Fixing The Bug

To fix the missing copyright and license information, I imported the mille source code into the "latest" Xaw-based version. The updated mille code had a number of bug fixes and improvements, along with the copyright information.

That should have been sufficient to resolve the issue and I could have constructed a suitable source package from whatever bits were needed and and uploaded that as a replacement 'xmille' package.

However, at some later point, I had actually merged xmille into a larger package, 'kgames', which also included a number of other games, including Reversi, Dominoes, Cribbage and ten Solitaire/Patience variants. (as an aside, those last ten games formed the basis for my Patience Palm Pilot application, which seems to have inspired an Android App of the same name...)

So began my yak shaving holiday.

Building Kgames in 2020

Ok, so getting this old source code running should be easy, right? It's just a bunch of C code designed in the 80s and 90s to work on VAXen and their kin. How hard could it be?

  1. Everything was a 32-bit computer back then; pointers and ints were both 32 bits, so you could cast them with wild abandon and cause no problems. Today, testing revealed segfaults in some corners of the code.

  2. It's K&R C code. Remember that the first version of ANSI-C didn't come out until 1989, and it was years later that we could reliably expect to find an ANSI compiler with a random Unix box.

  3. It's X11 code. Fortunately (?), X11 hasn't changed since these applications were written, so at least that part still works just fine. Imagine trying to build Windows or Mac OS code from the early 90's on a modern OS...

I decided to dig in and add prototypes everywhere; that found a lot of pointer/int casting issues, as well as several lurking bugs where the code was just plain broken.

After a day or so, I had things building and running and was no longer hitting crashes.

Kgames 1.0 uploaded to Debian New Queue

With that done, I decided I could at least upload the working bits to the Debian archive and close the bug reported above. kgames 1.0-2 may eventually get into unstable, presumably once the Debian FTP team realizes just how important fixing this bug is. Or something.

Here's what xmille looks like in this version:

And here's my favorite solitaire variant too:

But They Look So Old

Yeah, Xaw applications have a rustic appearance which may appeal to some, but for people with higher resolution monitors and “well seasoned” eyesight, squinting at the tiny images and text makes it difficult to enjoy these games today.

How hard could it be to update them to use larger cards and scalable fonts?

Xkw version 2.0

I decided to dig in and start hacking the code, starting by adding new widgets to the Xkw library that used cairo for drawing instead of core X calls. Fortunately, the needs of the games were pretty limited, so I only needed to implement a handful of widgets:

  • KLabel. Shows a text string. It allows the string to be left, center or right justified. And that's about it.

  • KCommand. A push button, which uses KLabel for the underlying presentation.

  • KToggle. A push-on/push-off button, which uses KCommand for most of the implementation. Also supports 'radio groups' where pushing one on makes the others in the group turn off.

  • KMenuButton. A button for bringing up a menu widget; this is some pretty simple behavior built on top of KCommand.

  • KSimpleMenu, KSmeBSB, KSmeLine. These three create pop-up menus; KSimpleMenu creates a container which can hold any number of KSmeBSB (string) and KSmeLine (separator lines) objects).

  • KTextLine. A single line text entry widget.

The other Xkw widgets all got their rendering switched to using cairo, plus using double buffering to make updates look better.

SVG Playing Cards

Looking on wikimedia, I found a page referencing a large number of playing cards in SVG form That led me to Adrian Kennard's playing card web site that let me customize and download a deck of cards, licensed using the CC0 Public Domain license.

With these cards, I set about rewriting the Xkw playing card widget, stripping out three different versions of bitmap playing cards and replacing them with just these new SVG versions.

SVG Xmille Cards

Ok, so getting regular playing cards was good, but the original goal was to update Xmille, and that has cards hand drawn by me. I could just use those images, import them into cairo and let it scale them to suit on the screen. I decided to experiment with inkscape's bitmap tracing code to see what it could do with them.

First, I had to get them into a format that inkscape could parse. That turned out to be a bit tricky; the original format is as a set of X bitmap layers; each layer painting a single color. I ended up hacking the Xmille source code to generate the images using X, then fetching them with XGetImage and walking them to construct XPM format files which could then be fed into the portable bitmap tools to create PNG files that inkscape could handle.

The resulting images have a certain charm:

I did replace the text in the images to make it readable, otherwise these are untouched from what inkscape generated.

The Results

Remember that all of these are applications built using the venerable X toolkit; there are still some non-antialiased graphics visible as the shaped buttons use the X Shape extension. But, all rendering is now done with cairo, so it's all anti-aliased and all scalable.

Here's what Xmille looks like after the upgrades:

And here's spider:

Once kgames 1.0 reaches Debian unstable, I'll upload these new versions.

December 30, 2020

A Different Sort Of Optimization

There’s a number of strange hacks in zink that provide compatibility for some of the layers in mesa. One of these hacks is the NIR pass used for managing non-constant UBO/SSBO array indexing, made necessary because SPIRV operates by directly accessing variables, and so it’s impossible to have a non-constant index because then when generating the SPIRV there’s no way to know which variable is being accessed.

In its current state from zink-wip it looks like this:

static nir_ssa_def *
recursive_generate_bo_ssa_def(nir_builder *b, nir_intrinsic_instr *instr, nir_ssa_def *index, unsigned start, unsigned end)
{
   if (start == end - 1) {
      /* block index src is 1 for this op */
      unsigned block_idx = instr->intrinsic == nir_intrinsic_store_ssbo;
      nir_intrinsic_instr *new_instr = nir_intrinsic_instr_create(b->shader, instr->intrinsic);
      new_instr->src[block_idx] = nir_src_for_ssa(nir_imm_int(b, start));
      for (unsigned i = 0; i < nir_intrinsic_infos[instr->intrinsic].num_srcs; i++) {
         if (i != block_idx)
            nir_src_copy(&new_instr->src[i], &instr->src[i], &new_instr->instr);
      }
      if (instr->intrinsic != nir_intrinsic_load_ubo_vec4) {
         nir_intrinsic_set_align(new_instr, nir_intrinsic_align_mul(instr), nir_intrinsic_align_offset(instr));
         if (instr->intrinsic != nir_intrinsic_load_ssbo)
            nir_intrinsic_set_range(new_instr, nir_intrinsic_range(instr));
      }
      new_instr->num_components = instr->num_components;
      if (instr->intrinsic != nir_intrinsic_store_ssbo)
         nir_ssa_dest_init(&new_instr->instr, &new_instr->dest,
                           nir_dest_num_components(instr->dest),
                           nir_dest_bit_size(instr->dest), NULL);
      nir_builder_instr_insert(b, &new_instr->instr);
      return &new_instr->dest.ssa;
   }

   unsigned mid = start + (end - start) / 2;
   return nir_build_alu(b, nir_op_bcsel, nir_build_alu(b, nir_op_ilt, index, nir_imm_int(b, mid), NULL, NULL),
      recursive_generate_bo_ssa_def(b, instr, index, start, mid),
      recursive_generate_bo_ssa_def(b, instr, index, mid, end),
      NULL
   );
}

static bool
lower_dynamic_bo_access_instr(nir_intrinsic_instr *instr, nir_builder *b)
{
   if (instr->intrinsic != nir_intrinsic_load_ubo &&
       instr->intrinsic != nir_intrinsic_load_ubo_vec4 &&
       instr->intrinsic != nir_intrinsic_get_ssbo_size &&
       instr->intrinsic != nir_intrinsic_load_ssbo &&
       instr->intrinsic != nir_intrinsic_store_ssbo)
      return false;
   /* block index src is 1 for this op */
   unsigned block_idx = instr->intrinsic == nir_intrinsic_store_ssbo;
   if (nir_src_is_const(instr->src[block_idx]))
      return false;
   b->cursor = nir_after_instr(&instr->instr);
   bool ssbo_mode = instr->intrinsic != nir_intrinsic_load_ubo && instr->intrinsic != nir_intrinsic_load_ubo_vec4;
   unsigned first_idx = 0, last_idx;
   if (ssbo_mode) {
      last_idx = first_idx + b->shader->info.num_ssbos;
   } else {
      /* skip 0 index if uniform_0 is one we created previously */
      first_idx = !b->shader->info.first_ubo_is_default_ubo;
      last_idx = first_idx + b->shader->info.num_ubos;
   }

   /* now create the composite dest with a bcsel chain based on the original value */
   nir_ssa_def *new_dest = recursive_generate_bo_ssa_def(b, instr,
                                                       instr->src[block_idx].ssa,
                                                       first_idx, last_idx);

   if (instr->intrinsic != nir_intrinsic_store_ssbo)
      /* now use the composite dest in all cases where the original dest (from the dynamic index)
       * was used and remove the dynamically-indexed load_*bo instruction
       */
      nir_ssa_def_rewrite_uses_after(&instr->dest.ssa, nir_src_for_ssa(new_dest), &instr->instr);

   nir_instr_remove(&instr->instr);
   return true;
}

In brief, lower_dynamic_bo_access_instr() is used to detect UBO/SSBO instructions with a non-constant index, e.g., array_of_ubos[n] where n is a uniform. Following this, recursive_generate_bo_ssa_def() generates a chain of bcsel instructions which checks the non-constant array index against constant values and then, upon matching, uses the value loaded from that UBO.

Without going into more depth about the exact mechanics of this pass for the sake of time, I’ll instead provide a better explanation by example. Here’s a stripped down version of one of the simplest piglit shader tests for non-constant uniform indexing (fs-array-nonconst):

[require]
GLSL >= 1.50
GL_ARB_gpu_shader5

[vertex shader passthrough]

[fragment shader]
#version 150
#extension GL_ARB_gpu_shader5: require

uniform block {
	vec4 color[2];
} arr[4];

uniform int n;
uniform int m;

out vec4 color;

void main()
{
	color = arr[n].color[m];
}

[test]
clear color 0.2 0.2 0.2 0.2
clear

ubo array index 0
uniform vec4 block.color[0] 0.0 1.0 1.0 0.0
uniform vec4 block.color[1] 1.0 0.0 0.0 0.0

uniform int n 0
uniform int m 1
draw rect -1 -1 1 1

relative probe rect rgb (0.0, 0.0, 0.5, 0.5) (1.0, 0.0, 0.0)

Using two uniforms, a color is indexed from a UBO as the FS output.

In the currently shipping version of zink, the final NIR output from ANV of the fragment shader might look something like this:

shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 8
ubos: 5
shared: 0
decl_var shader_out INTERP_MODE_NONE vec4 color (FRAG_RESULT_DATA0.xyzw, 8, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000002 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x00000001 /* 0.000000 */)
	vec1 32 ssa_2 = load_const (0x00000004 /* 0.000000 */)
	vec1 32 ssa_3 = load_const (0x00000003 /* 0.000000 */)
	vec1 32 ssa_4 = load_const (0x00000010 /* 0.000000 */)
	vec1 32 ssa_5 = intrinsic load_ubo (ssa_1, ssa_4) (0, 1073741824, 16, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=16 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_6 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_7 = intrinsic load_ubo (ssa_1, ssa_6) (0, 1073741824, 0, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_8 = umin ssa_7, ssa_3
	vec1 32 ssa_9 = ishl ssa_5, ssa_2
	vec1 32 ssa_10 = iadd ssa_8, ssa_1
	vec1 32 ssa_11 = load_const (0xfffffffc /* -nan */)
	vec1 32 ssa_12 = iand ssa_9, ssa_11
	vec1 32 ssa_13 = load_const (0x00000005 /* 0.000000 */)
	vec4 32 ssa_14 = intrinsic load_ubo (ssa_13, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec4 32 ssa_23 = intrinsic load_ubo (ssa_2, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_27 = ilt32 ssa_10, ssa_2
	vec1 32 ssa_28 = b32csel ssa_27, ssa_23.x, ssa_14.x
	vec1 32 ssa_29 = b32csel ssa_27, ssa_23.y, ssa_14.y
	vec1 32 ssa_30 = b32csel ssa_27, ssa_23.z, ssa_14.z
	vec1 32 ssa_31 = b32csel ssa_27, ssa_23.w, ssa_14.w
	vec4 32 ssa_32 = intrinsic load_ubo (ssa_3, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_36 = ilt32 ssa_10, ssa_3
	vec1 32 ssa_37 = b32csel ssa_36, ssa_32.x, ssa_28
	vec1 32 ssa_38 = b32csel ssa_36, ssa_32.y, ssa_29
	vec1 32 ssa_39 = b32csel ssa_36, ssa_32.z, ssa_30
	vec1 32 ssa_40 = b32csel ssa_36, ssa_32.w, ssa_31
	vec4 32 ssa_41 = intrinsic load_ubo (ssa_0, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec4 32 ssa_45 = intrinsic load_ubo (ssa_1, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_49 = ilt32 ssa_10, ssa_1
	vec1 32 ssa_50 = b32csel ssa_49, ssa_45.x, ssa_41.x
	vec1 32 ssa_51 = b32csel ssa_49, ssa_45.y, ssa_41.y
	vec1 32 ssa_52 = b32csel ssa_49, ssa_45.z, ssa_41.z
	vec1 32 ssa_53 = b32csel ssa_49, ssa_45.w, ssa_41.w
	vec1 32 ssa_54 = ilt32 ssa_10, ssa_0
	vec1 32 ssa_55 = b32csel ssa_54, ssa_50, ssa_37
	vec1 32 ssa_56 = b32csel ssa_54, ssa_51, ssa_38
	vec1 32 ssa_57 = b32csel ssa_54, ssa_52, ssa_39
	vec1 32 ssa_58 = b32csel ssa_54, ssa_53, ssa_40
	vec4 32 ssa_59 = vec4 ssa_55, ssa_56, ssa_57, ssa_58
	intrinsic store_output (ssa_59, ssa_6) (8, 15, 0, 160, 132) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */	/* color */
	/* succs: block_1 */
	block block_1:
}

All the b32csel ops are generated by the above NIR pass, with each one “checking” a non-constant index against a constant value. At the end of the shader, the store_output uses the correct values, but this is pretty gross.

And Then Inlining

Some time ago, noted Gallium professor Marek Olšák authored a series which provided a codepath for inlining uniform data directly into shaders. The process for this is two steps:

  • Detect and designate uniforms to be inlined
  • Replace shader loads of these uniforms with the actual uniforms

The purpose of this is specifically to eliminate complex conditionals resulting from uniform data, so the detection NIR pass specifically looks for conditionals which use only constants and uniform data as the sources. Something like if (uniform_variable_expression) then becomes if (constant_value_expression) which can then be optimized out, greatly simplifying the eventual shader instructions.

Looking at the above NIR, this seems like a good target for inlining as well, so I took my hatchet to the detection pass and added in support for the bcsel and fcsel ALU ops when their result sources were the results of intrinsics, e.g., loads. The results are good to say the least:

shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 8
ubos: 5
shared: 0
decl_var shader_out INTERP_MODE_NONE vec4 color (FRAG_RESULT_DATA0.xyzw, 8, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000001 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x00000004 /* 0.000000 */)
	vec1 32 ssa_2 = load_const (0x00000010 /* 0.000000 */)
	vec1 32 ssa_3 = intrinsic load_ubo (ssa_0, ssa_2) (0, 1073741824, 16, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=16 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_4 = ishl ssa_3, ssa_1
	vec1 32 ssa_5 = load_const (0x00000002 /* 0.000000 */)
	vec1 32 ssa_6 = load_const (0xfffffffc /* -nan */)
	vec1 32 ssa_7 = iand ssa_4, ssa_6
	vec1 32 ssa_8 = load_const (0x00000000 /* 0.000000 */)
	vec4 32 ssa_9 = intrinsic load_ubo (ssa_5, ssa_7) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	intrinsic store_output (ssa_9, ssa_8) (8, 15, 0, 160, 132) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */	/* color */
	/* succs: block_1 */
	block block_1:
}

The second load_ubo here is using the inlined uniform data to determine that it needs to load the 0 index, greatly reducing the shader’s complexity.

This still needs a bit of tuning, but I’m hoping to get it finalized soonish.

December 29, 2020

A New Sync

For some time now I’ve been talking about zink’s lack of WSI and the forthcoming, near-messianic work by FPS sherpa Adam Jackson to implement it.

This is an extremely challenging project, however, and some work needs to be done in the meanwhile to ensure that zink doesn’t drive off a cliff.

Swapchain Strikes Back

Any swapchain master is already well acquainted with the mechanism by which images are displayed on the screen, but the gist of it for anyone unfamiliar is that there’s N image resources that are swapped back and forth (2 for double-buffered, 3 for triple-buffered, …). An image being rendered to is a backbuffer, and an image being displayed is a frontbuffer.

Ideally, a frontbuffer shouldn’t be drawn to while it’s in the process of being presented since such an action obliterates the app’s usefulness. The knowledge of exactly when a resource is done presenting is gained through WSI. On Xorg, however, it’s a bit tricky, to say the least. DRI3 is intended to address the underlying problems there with the XPresent extension, and the Mesa DRI frontend utilizes this to determine when an image is safe to use.

All this is great, and I’m sure it works terrifically in other cases, but zink is not like other cases. Zink lacks direct WSI integration. Under Xorg, this means it relies entirely on the DRI frontend to determine when it’s safe to start rendering onto an image resource.

But what if the DRI frontend gets it wrong?

Indeed, due to quirks in the protocol/xserver, XPresent idle events can be received for a “presented” image immediately, even if it’s still in use and has not finished presenting.

scumbag-xorg.png

In apps like SuperTuxKart, this results in insane flickering due to always rendering over the current frame before it’s finished being presented.

Return Of The Poll

To solve this problem, a wise, reclusive ghostwriter took time off from being at his local pub to offer me a suggestion:

Why not just rip the implicit fence of the DMAbuf out of the image object?

It was a great idea. But what did this pub enthusiast mean?

In short, WSI handles this problem by internally poll()ing on the image resource’s underlying file descriptor. When there’s no more events to poll() for, the image is safe to write on.

So now it’s back to the (semi) basics of programming. First, get the file descriptor of the image using normal Vulkan function calls:

static int
get_resource_fd(struct zink_screen *screen, struct zink_resource *res)
{
   VkMemoryGetFdInfoKHR fd_info = {};
   int fd;
   fd_info.sType = VK_STRUCTURE_TYPE_MEMORY_GET_FD_INFO_KHR;
   fd_info.memory = res->obj->mem;
   fd_info.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT;
   VkResult result = (*screen->vk_GetMemoryFdKHR)(screen->dev, &fd_info, &fd);
   return result == VK_SUCCESS ? fd : -1;
}

This provides a file descriptor that can be used for more nefarious purposes. Any time the gallium pipe_context::flush hook is called, the flushed resource (swapchain image) must be synchronized by poll()ing as in this snippet:

static void
zink_flush(struct pipe_context *pctx,
           struct pipe_fence_handle **pfence,
           enum pipe_flush_flags flags)
{
   struct zink_context *ctx = zink_context(pctx);

   if (flags & PIPE_FLUSH_END_OF_FRAME && ctx->flush_res) {
      if (ctx->flush_res->obj->fd != -1) {
          /* FIXME: remove this garbage once we get wsi */
          struct pollfd p = {};
          p.fd = ctx->flush_res->obj->fd;
          p.events = POLLOUT;
          assert(poll(&p, 1, -1) == 1);
          assert(p.revents & POLLOUT);
      }
      ctx->flush_res = NULL;
   }

The POLLOUT event flag is used to determine when it’s safe to write. If there’s no pending usage during present then this will return immediately, otherwise it will wait until the image is safe to use.

hacks++.