planet.freedesktop.org
August 13, 2022

Hi all!

This month I’ve been pondering offline-first apps. The online aspect of modern apps is an important feature for many use-cases: it enables collaboration between multiple people and seamless transition between devices (e.g. I often switch between my personal workstation, my laptop, and my phone). However many modern apps come with a cost: often times they only work with a fixed proprietary server, and only work online. I think that for many use-cases, allowing users to pick their own open-source server instance and designing offline-friendly apps is a good compromise between freedom and ease-of-use/simplicity. Not to say that peer-to-peer or fully distributed apps are always a bad choice, but they come at a significantly higher complexity cost, which makes them more annoying to both build and use.

The main hurdle when writing an offline-first app is synchronization. All devices must have a local copy of the database for offline use, and they need to push changes to the server when the device comes online. Of course, it’s perfectly possible that changes were made on multiple devices while offline, so some kind of conflict resolution is necessary. Instead of presenting a “Oops, we’ve got a conflict, which version would you like to keep?” dialog to the user, it’d be much nicer to just Do The Right Thing™. CRDTs are a solution to that problem. They look a bit scary at first because of all of the obscure naming (PN-Counter? LWW-Element-Set? anyone?) and intimidating theory in papers. However I like to think of CRDTs as “use this one easy trick to make synchronization work well”, and not some kind of complicated abstract machinery. In other words, by following some simple rules, it’s not too difficult to write well-behaved synchronization logic.

So, long story short, I’ve been experimenting with CRDTs this month. To get some hands-on experience, I’ve started working on a small hacky group expense tracking app, seda. I’ve got the idea for this NPotM while realizing that there’s no existing good open-source user-friendly collaborative offline-capable (!) alternative yet. That said, it’s just a toy for now, nothing serious yet. If you want to play with it, you can have a look at the demo (feel free to toggle offline mode in the dev tools, then make some changes, then go online). There’s still a lot to be done: in particular, things gets a bit hairy when one device deletes a participant and another creates a transaction with that user at the same time. I plan to write some docs and maybe a blog post about my findings.

I’ve released two new versions of software I maintain. soju 0.5.0 adds support for push notifications, a new IRC extension to search the chat history, support for more IRCv3 extensions (some of which originate from soju itself), and many other enhancements. hut 0.2.0 adds numerous new commands and export functionality.

In graphics news, I’ve been working on small tasks in various projects. As part of my Valve contract, I’ve been investigating some DisplayPort MST issues in the core kernel DRM code, and I’ve introduced a function to unset layer properties in libliftoff. With the help of Petter Hutterer we’ve introduced a new X11 XWAYLAND extension as a more reliable way for clients to figure out whether they’re running under Xwayland. Last, I’ve continued ticking more boxes in libdisplay-info’s TODO list. We aren’t too far from having a complete EDID parser, but there are still so many extension blocks to add support for, and a whole new high-level API to design.

See you next month!

August 11, 2022

As of xorgproto 2022.2, we have a new X11 protocol extension. First, you may rightly say "whaaaat? why add new extensions to the X protocol?" in a rather unnecessarily accusing way, followed up by "that's like adding lipstick to a dodo!". And that's not completely wrong, but nevertheless, we have a new protocol extension to the ... [checks calendar] almost 40 year old X protocol. And that extension is, ever creatively, named "XWAYLAND".

If you recall, Xwayland is a different X server than Xorg. It doesn't try to render directly to the hardware, instead it's a translation layer between the X protocol and the Wayland protocol so that X clients can continue to function on a Wayland compositor. The X application is generally unaware that it isn't running on Xorg and Xwayland (and the compositor) will do their best to accommodate for all the quirks that the application expects because it only speaks X. In a way, it's like calling a restaurant and ordering a burger because the person answering speaks American English. Without realising that you just called the local fancy French joint and now the chefs will have to make a burger for you, totally without avec.

Anyway, sometimes it is necessary for a client (or a user) to know whether the X server is indeed Xwayland. Previously, this was done through heuristics: the xisxwayland tool checks for XRandR properties, the xinput tool checks for input device names, and so on. These heuristics are just that, though, so they can become unreliable as Xwayland gets closer to emulating Xorg or things just change. And properties in general are problematic since they could be set by other clients. To solve this, we now have a new extension.

The XWAYLAND extension doesn't actually do anything, it's the bare minimum required for an extension. It just needs to exist and clients only need to XQueryExtension or check for it in XListExtensions (the equivalent to xdpyinfo | grep XWAYLAND). Hence, no support for Xlib or libxcb is planned. So of all the nightmares you've had in the last 2 years, the one of misidentifying Xwayland will soon be in the past.

August 10, 2022

New Month, New Post

I’m going to kick off this post and month by saying that in my defense, I was going to write this post weeks ago, but then I didn’t, and then I got sidetracked, but I had the screenshots open the whole time so it’s not like I forgot, but then I did forget for a little while, and then my session died because the man the myth the legend the as-seen-on-the-web-with-a-different-meaning Adam “ajax” Jackson pranked me with a GLX patch, but I started a new session, and gimp recovered my screenshots, and I remembered I needed to post, and I got distracted even more, and now it’s like three whole weeks later and here we are at the post I was going to write last month but didn’t get around to but now it’s totally been gotten to.

You’re welcome.

Release

There’s a new Mesa release in the pipeline, and its name is 22.2. Cool stuff happening in that release: Zink passes GL 4.6 conformance tests on Lavapipe and ANV, would totally pass on RADV if Mesa could even compile shaders for sparse texture tests, and is reasonably close to passing on NVIDIA’s beta driver except for the cases which don’t pass because of unfixed NVIDIA driver bugs. Also Kopper is way better. And some other stuff that I could handwave about but I’m too tired and already losing focus.

Recap over, now let’s get down to some technical mumbo-jumbo.

Render Passes

They exist in Vulkan, and I hate them, but I’m not talking specifically about VkRenderPass. Instead, I’m referring to the abstract concept of “render passes” which includes the better-in-every-way dynamic rendering variant.

Shut up, subpasses, nobody was talking to you.

Render passes control how rendering works. There’s load operations which determine how data is retrieved from the framebuffer (I also hate framebuffers) attachments, there’s store operations which determine how data is stored back to the attachments (I hate this part too), and then there’s “dependencies” (better believe I hate these) which manage synchronization between operations, and input attachments (everyone hates these) which enable reading attachment data in shaders, and then also render pass instances have to be started and stopped any time any attachments or framebuffer geometry changes (this sucks), and to top it all off, transfer operations can’t be executed while render passes are active (mega sucks).

Also there’s nested render passes, but I’m literally fearing for my life even mentioning them where other driver developers can see, so let’s move on.

On Topic

Today’s suck is focused on render passes and transfer operations: why can’t they just get along?

Here’s a sample command stream from a bug I recently solved:

reorder-before.png

Notice how the command stream continually starts and stops render passes to execute transfer operations. If you’re on a desktop, you’re like whatever just give me frames, but if you’re on a mobile device with a tiling GPU, this is probably sending you sprinting for your Tik-Tok so you don’t have to look at the scary perf demons anymore.

Until recently, I would’ve been in the former camp, but guess what: as of Mesa 22.2, Zink on Turnip is Officially Supported. Not only that, it has only 14 failures for all of GL 4.6 conformance, so it can render stuff pretty accurately too. Given this as well as the rapid response from Turnip developers, it seems only right that some changes be made to improve Zink on tiling GPUs so that performance isn’t an unmitigated disaster (really hate when this happens).

When running on a tiling GPU, it’s important to avoid VK_ATTACHMENT_LOAD_OP_LOAD at all costs. These GPUs incur a penalty whenever they load attachment data, so ideally it’s best to either VK_ATTACHMENT_LOAD_OP_CLEAR or, for attachments without valid data, VK_ATTACHMENT_LOAD_OP_DONT_CARE. Failure to abide by this guideline will result in performance going straight to the dumpster (again, really hate this).

With this in mind, it may seem obvious, but the above command stream is terrible for tiling GPUs. Each render pass contains only a single draw before terminating to execute a transfer operation. Sure, an argument could be made that this is running a desktop GL application on mobile and thus any performance issues are expected, but I hate it when people argument.

Stop argumenting and get back to writing software.

Problem Space

Problem Space is a cool term used by real experts, so I can only imagine how much SEO using it twice gets me. In this Problem Space, none of the problem space problems that are spacing are new to people who have been in the problem space for longer than me, which is basically everyone since I only got to this problem space a couple years ago. Looking at Freedreno, there’s already a mechanism in place that I can copy to solve one problem Zink has, namely that of never using VK_ATTACHMENT_LOAD_OP_DONT_CARE.

Delving deeper into the problem space, use of VK_ATTACHMENT_LOAD_OP_DONT_CARE is predicated upon the driver being able to determine that a framebuffer attachment does not have valid data when beginning a render pass instance. Thus, the first step is to track that status on the image. It turns out there’s not many places where this needs to be managed. On write-map, import, and write operations, mark the image as valid, and if the image is explicitly invalidated, mark it as invalid. This flag can then be leveraged to skip loading attachment data where possible.

Problem spaced. Or at least one of them.

Nowing Thens

Historically, Zink has always been a driver of order. It gets the GL commands as Gallium callbacks, it records the corresponding Vulkan command, and then it sends the command buffer to the GPU. It’s simple, yet effective.

But what if there was a better way? What if Zink could see all these operations that were splitting the render passes and determine whether they needed to occur “now” or could instead occur “then”?

This is entirely possible, and it’s similar to the tracking for invalidation. Think of it like a sieve:

  • Create two command buffers A and B
  • A is the main command buffer
  • B is for “unordered” commands
  • When recording commands, implement an algorithm to determine whether a given command can be reordered to occur earlier in the command stream
    • If yes, record command into B
    • If no, record command into A
  • At submission time, execute B before A

It sounds simple, but the details are a bit more complex. To handle this in Zink, I added more tracking data to each buffer and image resource such that:

  • At the start of every command buffer, all transfer operations can be promoted to occur on cmdbuf B
  • If a transfer operation is promoted to cmdbuf B, the resources involved are tagged with the appropriate unordered read/write access flag(s)
  • Any time a resource is used for a draw or dispatch operation, it gets flagged with read or write access that unsets the corresponding unordered access flag(s)
  • When evaluating transfer operations, operations can be promoted to B if one of the following is true:
    • The resources involved are already used only in B
    • There is no read access occurring on A for the resource being written to AND ONE OF
      • There is no write access for the resources involved
      • The only write access for the resources involved is on B

By following this set of rules, transfer operations can be effectively filtered out of render passes on cmdbuf A and into the dumpster of cmdbuf B where all the other transfer operations live. Using a bit more creativity, the above command stream can be rewritten to look like this:

reorder-after.png

It’s not perfect, but it’s quite effective in not turning the driver code to spaghetti while yielding some nice performance boosts.

It’s also shipping along with the other changes in Mesa 22.2.

Future Improvements

While both of these changesets solved real problems, there’s still work to be done in the future. Command stream reordering currently disallows moving draw and dispatch operations from A to B, but in some cases, this may actually be optimal. Finding a method to manage this could yield further gains in performance by avoiding more VK_ATTACHMENT_LOAD_OP_LOAD occurrences.

August 08, 2022

Descriptors are hard

Over the weekend, I asked on twitter if people would be interested in a rant about descriptor sets. As of the writing of this post, it has 46 likes so I’ll count that as a yes.

I kind-of hate descriptor sets…

Well, not descriptor sets per se. More descriptor set layouts. The fundamental problem, I think, was that we too closely tied memory layout to the shader interface. The Vulkan model works ok if your objective is to implement GL on top of Vulkan. You want 32 textures, 16 images, 24 UBOs, etc. and everything in your engine fits into those limits. As long as they’re always separate bindings in the shader, it works fine. It also works fine if you attempt to implement HLSL SM6.6 bindless on top of it. Have one giant descriptor set with all resources ever in giant arrays and pass indices into the shader somehow as part of the material.

The moment you want to use different binding interfaces in different shaders (pretty common if artists author shaders), things start to get painful. If you want to avoid excess descriptor set switching, you need multiple pipelines with different interfaces to use the same set. This makes the already painful situation with pipelines worse. Now you need to know the binding interfaces of all pipelines that are going to be used together so you can build the combined descriptor set layout and you need to know that before you can compile ANY pipelines. We tried to solve this a bit with multiple descriptor sets and pipeline layout compatibility which is supposed to let you mix-and-match a bit. It’s probably good enough for VS/FS mixing but not for mixing whole materials.

The problem space

So, how did we get here? As with most things in Vulkan, a big part of the problem is that Vulkan targets a very diverse spread of hardware and everyone does descriptor binding a bit differently. In order to understand the problem space a bit, we need to look at the hardware…

DISCLAIMER:

I’m about to spill a truckload of hardware beans. Let me reassure you all that I am not violating any NDAs here. Everything I’m about to tell you is either publicly documented (AMD and Intel) or can be gleaned from reading public Mesa source code.

Descriptor binding methods in hardware can be roughly broken down into 4 broad categories, each with its own advantages and disadvantages:

  1. Direct access (D): This is where the shader passes the entire descriptor to the access instruction directly. The descriptor may have been loaded from a buffer somewhere but the shader instructions do not reference that buffer in any way; they just take what they’re given. The classic example here is implementing SSBOs as “raw” pointer access. Direct access is extremely flexible because the descriptors can live literally anywhere but it comes at the cost of having to pass the full descriptor through the shader every time.

  2. Descriptor buffers (B): Instead of passing the entire descriptor through the shader, descriptors live in a buffer. The buffers themselves are bound to fixed binding points or have their base addresses pushed into the shader somehow. The shader instruction takes either a fixed descriptor buffer binding index or a base address (as appropriate) along with some form of offset to the descriptor in the buffer. The difference between this and the direct access model is that the descriptor data lives in some other bit of memory that the hardware must first read before it can do the actual access. Changing buffer bindings, while definitely not free, is typically not incredibly expensive.

  3. Descriptor heaps (H): Descriptors of a particular type all live in a single global table or heap. Because the table is global, changing it typically involves a full GPU stall and maybe dumping a bunch of caches. This makes changing out the table fairly expensive. Shader instructions which access these descriptors are passed an index into the global table. Because everything is fixed and global, this requires the least amount of data to pass through the shader of the three bindless mechanisms.

  4. Fixed HW bindings (F): In this model, resources are bound to fixed HW slots, often by setting registers from the command streamer or filling out small tables in memory. With the push towards bindless, fixed HW bindings are typically only used for fixed-function things on modern hardware such as render targets and vertex, index, and streamout buffers. However, we still need to consider them because Vulkan 1.0 was designed to support pre-bindless hardware which might not be quite as nice.

Here’s a quick run-down on where things sit with most of the hardware shipping today:

Hardware Textures Images Samplers Border Colors Typed buffers UBOs SSBOs
NVIDIA (Kepler+) H H H H D/F D
AMD D D D H D D D
Intel (Skylake+) H H H H H/D/F H/D
Intel (pre-Skylake) F F F F D/F F
Arm (Valhal+) B B B B B/D/F B/D
Arm (Pre-Valhal) F F F F D/F D
Qualcomm (a5xx+) B B B B B B
Broadcom (vc5) D D D D D D

The line above for “Intel (pre-Skylake)” is a bit misleading. I’m labeling everything as fixed HW bindings but it’s actually a bit more flexible than most fixed HW binding mechanisms. It a sort of heap model but where, instead of indexing into heaps directly from the shader, everything goes through a second layer of indirection called a binding table which is restricted to 240 entries. On Skylake and later hardware, the binding table hardware still exists and uses a different set up heaps which provides a nice back-door for drivers. More on that when we talk about D3D12.

The Vulkan 1.0 descriptor set model

As you can see from above, the hardware landscape is quite diverse when it comes to descriptor binding. Everyone has made slightly different choices depending on the type of descriptor and picking a single model for everyone isn’t easy. The Vulkan answer to this was, of course, descriptor sets and their dreaded layouts.

Ignoring UBOs for the moment, the mapping from the Vulkan API to these hardware descriptors is conceptually fairly simple. The descriptor set layout describe a set of bindings, each with a binding type and a number of descriptors in that binding. The driver maps the binding type to the type of HW binding it uses and computes how much GPU or CPU memory is needed to store all the bindings. Fixed HW bindings are typically stored CPU-side and the actual bindings get set as part of vkCmdBindDescriptorSets() or vkCmdDraw/Dispatch(). For everything in one of the three bindless categories, they allocate GPU memory. For heap descriptors, descriptors may be allocated as part of the descriptor set or, to save memory, as part of the image or buffer view object. Given that descriptor heaps are often limited in size, allocating them as part of the view object is often preferred.

UBOs get weird. I’m not going to try and go into all of the details because there are often heuristics involved and it gets complicated fast. However, as you can see from the above table, most hardware has some sort of fixed HW binding for UBOs, even on bindless hardware. This is because UBOs are the hottest of hot paths and even small differences in UBO fetch speed turn into real FPS differences in games. This is why, even with descriptor indexing, UBOs aren’t required to support update-after-bind. The Intel Linux driver has three or four different paths a UBO may take based on how often it’s used relative to other UBOs, update-after-bind, and which shader stage it’s being accessed from.

The other thing I have yet to mention is dynamic buffers. These typically look like a fixed HW binding. How they’re implemented varies by hardware and driver. Often they use fixed HW bindings or the descriptors are loaded into the shader as push constants. Even if the buffer pointer comes from descriptor set memory, the dynamic offset has to get loaded in via some push-like mechanism.

The D3D12 descriptor heap

DISCLAIMER:

Again, I’m going to talk details here. Again, in spite of the fact that there are exactly zero open-source D3D12 drivers, I can safely say that I’m not violating any NDAs. I’ve literally never seen the inside of a D3D12 driver. I’ve just read public documentation and am familiar with how hardware works and is driven. This is all based on D3D12 drivers I’ve written in my head, not the real deal. I may get a few things wrong.

For D3D12, Microsoft took a very different approach. They embraced heaps. D3D12 has these heavy-weight descriptor heap objects which have to be bound before you can execute any 3D or compute commands. Shaders have the usual HLSL register notation for describing the descriptor interface. When shaders are compiled into pipelines, descriptor tables are used to map the bindings in the shader to ranges in the relevant descriptor heap. While the size of a descriptor heap range remains fixed, each such range has a dynamic offset which allows the application to move it around at will.

With SM6.6, Microsoft added significant flexibility and further embraced heaps. Now, instead of having to use descriptor tables in the root descriptor, applications can use heap indices directly. This provides a full bindless experience. All the application developer has to do is manage heap allocations with resource lifetimes and figure out how to get indices into their shader. Gone are the days of fiddling with fixed interface layouts through side-bind pipeline create APIs. From what I’ve heard, most developers love it.

If D3D12 has embraced heaps, how does it work on AMD? They use descriptor buffers, don’t they? Yup. But, fortunately for Microsoft, a descriptor heap is just a very restrictive descriptor buffer. The AMD driver just uses two of their descriptor buffer bindings (resource and sampler heaps are separate in D3D12) and implements the heap as a descriptor buffer.

One downside to the descriptor heap approach is that it forces some amount of extra indirection, especially with the SM6.6 bindless model. If your application is using bindless, you first have to load a heap index from a constant buffer somewhere and then pass that to the load/store op. The load/store turns into a sequence of instruction that fetches the descriptor from the heap, does the offset calculation, and then does the actual load or store from the corresponding pointer. Depending on how often the shader does this, how many unique descriptors are involved, and the compiler’s ability to optimize away redundant descriptor fetches, this can add up to real shader time in a hurry.

The other major downside to the D3D12 model is that handing control of the hardware heaps to the application really ties driver writers’ hands. Any time the client does a copy or blit operation which isn’t implemented directly in the DMA hardware, the driver has to spin up the 3D hardware, set up a pipeline, and do a few draws. In order to do a blit, the pixel shader needs to be able to read from the blit source image. This means it needs a texture or UAV descriptor which needs to live in the heap which is now owned by the client. On AMD, this isn’t a problem because they can re-bind descriptor sets relatively cheaply or just use one of the high descriptor set bindings which they’re not using for heaps. On Intel, they have the very convenient back-door I mentioned above where the old binding table hardware still exists for fragment shaders.

Where this gets especially bad is on NVIDIA, which is a bit ironic given that the D3D12 model is basically exactly NVIDIA hardware. NVIDIA hardware only has one texture/image heap and switching it is expensive. How do they implement these DMA operations, then? First off, as far as I can tell, the only DMA operation in D3D12 that isn’t directly supported by NVIDIA’s DMA engine is MSAA resolves. D3D12 doesn’t have an equivalent of vkCmdBlitImage(). Applications are told to implement that themselves if they really want it. What saves them, I think (I can’t confirm), is that D3D12 exposes 106 descriptors to the application but NVIDIA hardware supports 220 descriptors. That leaves about 48k descriptors for internal usage. Some of those are reserved by Microsoft for tools such as PIX but I’m guessing a few of them are reserved for the driver as well. As long as the hardware is able to copy descriptors around a bit (NVIDIA is very good at doing tiny DMA ops), they can manage their internal descriptors inside this range. It’s not ideal, but it does work.

Towards a better future?

I have nothing to announce but me and others have been thinking about descriptors in Vulkan and how to make them better. I think we should be able to do something that’s better than the descriptor sets we have today. What is that? I’m personally not sure yet.

The good news is that, if we’re willing to ignore non-bindless hardware (I think we are for forward-looking things), there are really only two models: heaps and buffers. (Anything direct access can be stored in the heap or buffer and it won’t hurt anything.) I too can hear the siren call of D3D12 heaps but I’d really like to avoid tying the drivers hands like that. Even if NVIDIA were to rework their hardware to support two heaps today to get around the internal descriptors problem and make it part of the next generation of GPUs, we wouldn’t be able to rely on users having that for 5-10 years, longer depending on application targets.

If we keep letting drivers managing their own heaps, D3D12 layering on top of Vulkan becomes difficult. D3D12 doesn’t have image or buffer view objects in the same sense that Vulkan does. You just create descriptors and stick them in the heap somewhere. This means we either need to come up with a way to get rid of view objects in Vulkan or a D3D12 layer needs a giant cache of view objects, the lifetimes if which are difficult to manage to say the least. It’s quite the pickle.

As with many of my rant posts, I don’t really have a solution. I’m not even really asking for feedback and ideas. My primary goal is to educate people and help them understand the problem space. Graphics is insanely complicated and hardware vendors are notoriously cagey about the details. I’m hoping that, by demystifying things a bit, I can at the very least garner a bit of sympathy for what we at Khronos are trying to do and help people understand that it’s a near miracle that we’ve gotten where we are. 😅

August 04, 2022
People write code. Test coverage is never enough. Some angry contributor will disable the CI. And we all write bugs. But that’s OK, it part of the job. Programming is hard and sometimes we may miss a corner case, forget that numbers overflow and all other strange things that computers can do. One easy thing that we can do to help the poor developer that needs to find what changed in the code that stopped their printer to work properly, is to keep the project bisectable.
August 03, 2022

The first part of this series covered principles of locking engineering. This part goes through a pile of locking patterns and designs, from most favourable and easiest to adjust and hence resulting in a long term maintainable code base, to the least favourable since hardest to ensure it works correctly and stays that way while the code evolves. For convenience even color coded, with the dangerous levels getting progressively more crispy red indicating how close to the burning fire you are! Think of it as Dante’s Inferno, but for locking.

As a reminder from the intro of the first part, with locking engineering I mean the art of ensuring that there’s sufficient consistency in reading and manipulating data structures, and not just sprinkling mutex_lock() and mutex_unlock() calls around until the result looks reasonable and lockdep has gone quiet.

Level 0: No Locking

The dumbest possible locking is no need for locking at all. Which does not mean extremely clever lockless tricks for a “look, no calls to mutex_lock()” feint, but an overall design which guarantees that any writers cannot exist concurrently with any other access at all. This removes the need for consistency guarantees while accessing an object at the architectural level.

There’s a few standard patterns to achieve locking nirvana.

Locking Pattern: Immutable State

The lesson in graphics API design over the last decade is that immutable state objects rule, because they both lead to simpler driver stacks and also better performance. Vulkan instead of the OpenGL with it’s ridiculous amount of mutable and implicit state is the big example, but atomic instead of legacy kernel mode setting or Wayland instead of the X11 are also built on the assumption that immutable state objects are a Great Thing (tm).

The usual pattern is:

  1. A single thread fully constructs an object, including any sub structures and anything else you might need. Often subsystems provide initialization helpers for objects that driver can subclass through embedding, e.g. drm_connector_init() for initializing a kernel modesetting output object. Additional functions can set up different or optional aspects of an object, e.g. drm_connector_attach_encoder() sets up the invariant links to the preceding element in a kernel modesetting display chain.

  2. The fully formed object is published to the world, in the kernel this often happens by registering it under some kind of identifier. This could be a global identifier like register_chrdev() for character devices, something attached to a device like registering a new display output on a driver with drm_connector_register() or some struct xarray in the file private structure. Note that this step here requires memory barriers of some sort. If you hand roll the data structure like a list or lookup tree with your own fancy locking scheme instead of using existing standard interfaces you are on a fast path to level 3 locking hell. Don’t do that.

  3. From this point on there are no consistency issues anymore and all threads can access the object without any locking.

Locking Pattern: Single Owner

Another way to ensure there’s no concurrent access is by only allowing one thread to own an object at a given point of time, and have well defined handover points if that is necessary.

Most often this pattern is used for asynchronously processing a userspace request:

  1. The syscall or IOCTL constructs an object with sufficient information to process the userspace’s request.

  2. That object is handed over to a worker thread with e.g. queue_work().

  3. The worker thread is now the sole owner of that piece of memory and can do whatever it feels like with it.

Again the second step requires memory barriers, which means if you hand roll your own lockless queue you’re firmly in level 3 territory and won’t get rid of the burned in red hot afterglow in your retina for quite some time. Use standard interfaces like struct completion or even better libraries like the workqueue subsystem here.

Note that the handover can also be chained or split up, e.g. for a nonblocking atomic kernel modeset requests there’s three asynchronous processing pieces involved:

  • The main worker, which pushes the display state update to the hardware and which is enqueued with queue_work().

  • The userspace completion event handling built around struct drm_pending_event and generally handed off to the interrupt handler of the driver from the main worker and processed in the interrupt handler.

  • The cleanup of the no longer used old scanout buffers from the preceding update. The synchronization between the preceding update and the cleanup is done through struct completion to ensure that there’s only ever a single worker which owns a state structure and is allowed to change it.

Locking Pattern: Reference Counting

Users generally don’t appreciate if the kernel leaks memory too much, and cleaning up objects by freeing their memory and releasing any other resources tends to be an operation of the very much mutable kind. Reference counting to the rescue!

  • Every pointer to the reference counted object must guarantee that a reference exists for as long as the pointer is in use. Usually that’s done by calling kref_get() when making a copy of the pointer, but implied references by e.g. continuing to hold a lock that protects a different pointer are often good enough too for a temporary pointer.

  • The cleanup code runs when the last reference is released with kref_put(). Note that this again requires memory barriers to work correctly, which means if you’re not using struct kref then it’s safe to assume you’ve screwed up.

Note that this scheme falls apart when released objects are put into some kind of cache and can be resurrected. In that case your cleanup code needs to somehow deal with these zombies and ensure there’s no confusion, and vice versa any code that resurrects a zombie needs to deal the wooden spikes the cleanup code might throw at an inopportune time. The worst example of this kind is SLAB_TYPESAFE_BY_RCU, where readers that are only protected with rcu_read_lock() may need to deal with objects potentially going through simultaneous zombie resurrections, potentially multiple times, while the readers are trying to figure out what is going on. This generally leads to lots of sorrow, wailing and ill-tempered maintainers, as the GPU subsystem has and continues to experience with struct dma_fence.

Hence use standard reference counting, and don’t be tempted by the siren of trying to implement clever caching of any kind.

Level 1: Big Dumb Lock

It would be great if nothing ever changes, but sometimes that cannot be avoided. At that point you add a single lock for each logical object. An object could be just a single structure, but it could also be multiple structures that are dynamically allocated and freed under the protection of that single big dumb lock, e.g. when managing GPU virtual address space with different mappings.

The tricky part is figuring out what is an object to ensure that your lock is neither too big nor too small:

  • If you make your lock too big you run the risk of creating a dreaded subsystem lock, or violating the “Protect Data, not Code” principle in some other way. Split your locking further so that a single lock really only protects a single object, and not a random collection of unrelated ones. So one lock per device instance, not one lock for all the device instances in a driver or worse in an entire subsystem.

    The trouble is that once a lock is too big and has firmly moved into “protects some vague collection of code” territory, it’s very hard to get out of that hole.

  • Different problems strike when the locking scheme is too fine-grained, e.g. in the GPU virtual memory management example when every address mapping in the big vma tree has its own private lock. Or when a structure has a lot of different locks for different member fields.

    One issue is that locks aren’t free, the overhead of fine-grained locking can seriously hurt, especially when common operations have to take most of the locks anyway and so there’s no chance of any concurrency benefit. Furthermore fine-grained locking leads to the temptation of solving locking overhead with ever more clever lockless tricks, instead of radically simplifying the design.

    The other issue is that more locks improve the odds for locking inversions, and those can be tough nuts to crack. Again trying to solve this with more lockless tricks to avoid inversions is tempting, and again in most cases the wrong approach.

Ideally, your big dumb lock would always be right-sized everytime the requirements on the datastructures changes. But working magic 8 balls tend to be on short supply, and you tend to only find out that your guess was wrong when the pain of the lock being too big or too small is already substantial. The inherent struggles of resizing a lock as the code evolves then keeps pushing you further away from the optimum instead of closer. Good luck!

Level 2: Fine-grained Locking

It would be great if this is all the locking we ever need, but sometimes there’s functional reasons that force us to go beyond the single lock for each logical object approach. This section will go through a few of the common examples, and the usual pitfalls to avoid.

But before we delve into details remember to document in kerneldoc with the inline per-member kerneldoc comment style once you go beyond a simple single lock per object approach. It’s the best place for future bug fixers and reviewers - meaning you - to find the rules for how at least things were meant to work.

Locking Pattern: Object Tracking Lists

One of the main duties of the kernel is to track everything, least to make sure there’s no leaks and everything gets cleaned up again. But there’s other reasons to maintain lists (or other container structures) of objects.

Now sometimes there’s a clear parent object, with its own lock, which could also protect the list with all the objects, but this does not always work:

  • It might force the lock of the parent object to essentially become a subsystem lock and so protect much more than it should when following the “Protect Data, not Code” principle. In that case it’s better to have a separate (spin-)lock just for the list to be able to clearly untangle what the parent and subordinate object’s lock each protect.

  • Different code paths might need to walk and possibly manipulate the list both from the container object and contained object, which would lead to locking inversion if the list isn’t protected by it’s own stand-alone (nested) lock. This tends to especially happen when an object can be attached to multiple other objects, like a GPU buffer object can be mapped into multiple GPU virtual address spaces of different processes.

  • The constraints of calling contexts for adding or removing objects from the list could be different and incompatible from the requirements when walking the list itself. The main example here are LRU lists where the shrinker needs to be able to walk the list from reclaim context, whereas the superior object locks often have a need to allocate memory while holding each lock. Those object locks the shrinker can then only trylock, which is generally good enough, but only being able to trylock the LRU list lock itself is not.

Simplicity should still win, therefore only add a (nested) lock for lists or other container objects if there’s really no suitable object lock that could do the job instead.

Locking Pattern: Interrupt Handler State

Another example that requires nested locking is when part of the object is manipulated from a different execution context. The prime example here are interrupt handlers. Interrupt handlers can only use interrupt safe spinlocks, but often the main object lock must be a mutex to allow sleeping or allocating memory or nesting with other mutexes.

Hence the need for a nested spinlock to just protect the object state shared between the interrupt handler and code running from process context. Process context should generally only acquire the spinlock nested with the main object lock, to avoid surprises and limit any concurrency issues to just the singleton interrupt handler.

Locking Pattern: Async Processing

Very similar to the interrupt handler problems is coordination with async workers. The best approach is the single owner pattern, but often state needs to be shared between the worker and other threads operating on the same object.

The naive approach of just using a single object lock tends to deadlock:

start_processing(obj)
{
	mutex_lock(&obj->lock);
	/* set up the data for the async work */;
	schedule_work(&obj->work);
	mutex_unlock(&obj->lock);
}

stop_processing(obj)
{
	mutex_lock(&obj->lock);
	/* clear the data for the async work */;
	cancel_work_sync(&obj->work);
	mutex_unlock(&obj->lock);
}

work_fn(work)
{
	obj = container_of(work, work);

	mutex_lock(&obj->lock);
	/* do some processing */
	mutex_unlock(&obj->lock);
}

Do not worry if you don’t spot the deadlock, because it is a cross-release dependency between the entire work_fn() and cancel_work_sync() and these are a lot trickier to spot. Since cross-release dependencies are a entire huge topic on themselves I won’t go into more details, a good starting point is this LWN article.

There’s a bunch of variations of this theme, with problems in different scenarios:

  • Replacing the cancel_work_sync() with cancel_work() avoids the deadlock, but often means the work_fn() is prone to use-after-free issues.

  • Calling cancel_work_sync()before taking the mutex can work in some cases, but falls apart when the work is self-rearming. Or maybe the work or overall object isn’t guaranteed to exist without holding it’s lock, e.g. if this is part of an async processing queue for a parent structure.

  • Cancelling the work after the call to mutex_unlock() might race with concurrent restarting of the work and upset the bookkeeping.

Like with interrupt handlers the clean solution tends to be an additional nested lock which protects just the mutable state shared with the work function and nests within the main object lock. That way work can be cancelled while the main object lock is held, which avoids a ton of races. But without holding the sublock that work_fn() needs, which avoids the deadlock.

Note that in some cases the superior lock doesn’t need to exist, e.g. struct drm_connector_state is protected by the single owner pattern, but drivers might have some need for some further decoupled asynchronous processing, e.g. for handling the content protect or link training machinery. In that case only the sublock for the mutable driver private state shared with the worker exists.

Locking Pattern: Weak References

Reference counting is a great pattern, but sometimes you need be able to store pointers without them holding a full reference. This could be for lookup caches, or because your userspace API mandates that some references do not keep the object alive - we’ve unfortunately committed that mistake in the GPU world. Or because holding full references everywhere would lead to unreclaimable references loops and there’s no better way to break them than to make some of the references weak. In languages with a garbage collector weak references are implemented by the runtime, and so no real worry. But in the kernel the concept has to be implemented by hand.

Since weak references are such a standard pattern struct kref has ready-made support for them. The simple approach is using kref_put_mutex() with the same lock that also protects the structure containing the weak reference. This guarantees that either the weak reference pointer is gone too, or there is at least somewhere still a strong reference around and it is therefore safe to call kref_get(). But there are some issues with this approach:

  • It doesn’t compose to multiple weak references, at least if they are protected by different locks - all the locks need to be taken before the final kref_put() is called, which means minimally some pain with lock nesting and you get to hand-roll it all to boot.

  • The mutex required to be held during the final put is the one which protects the structure with the weak reference, and often has little to do with the object that’s being destroyed. So a pretty nasty violation of the big dumb lock pattern. Furthermore the lock is held over the entire cleanup function, which defeats the point of the reference counting pattern, which is meant to enable “no locking” cleanup code. It becomes very tempting to stuff random other pieces of code under the protection of this look, making it a sprawling mess and violating the principle to protect data, not code: The lock held during the entire cleanup operation is protecting against that cleanup code doing things, and not anymore a specific data structure.

The much better approach is using kref_get_unless_zero(), together with a spinlock for your data structure containing the weak reference. This looks especially nifty in combination with struct xarray.

obj_find_in_cache(id)
{
	xa_lock();
	obj = xa_find(id);
	if (!kref_get_unless_zero(&obj->kref))
		obj = NULL;
	xa_unlock();

	return obj;
}

With this all the issues are resolved:

  • Arbitrary amounts of weak references in any kind of structures protected by their own spinlock can be added, without causing dependencies between them.

  • In the object’s cleanup function the same spinlock only needs to be held right around when the weak references are removed from the lookup structure. The lock critical section is no longer needlessly enlarged, we’re back to protecting data instead of code.

With both together the locking does no longer leak beyond the lookup structure and it’s associated code any more, unlike with kref_put_mutex() and similar approaches. Thankfully kref_get_unless_zero() has become the much more popular approach since it was added 10 years ago!

Locking Antipattern: Confusing Object Lifetime and Data Consistency

We’ve now seen a few examples where the “no locking” patterns from level 0 collide in annoying ways when more locking is added to the point where we seem to violate the principle to protect data, not code. It’s worth to look at this a bit closer, since we can generalize what’s going on here to a fairly high-level antipattern.

The key insight is that the “no locking” patterns all rely on memory barrier primitives in disguise, not classic locks, to synchronize access between multiple threads. In the case of the single owner pattern there might also be blocking semantics involved, when the next owner needs to wait for the previous owner to finish processing first. These are functions like flush_work() or the various wait functions like wait_event() or wait_completion().

Calling these barrier functions while holding locks commonly leads to issues:

  • Blocking functions like flush_work() pull in every lock or other dependency the work we wait on, or more generally, any of the previous owners of an object needed as a so called cross-release dependency. Unfortunately lockdep does not understand these natively, and the usual tricks to add manual annotations have severe limitations. There’s work ongoing to add cross-release dependency tracking to lockdep, but nothing looks anywhere near ready to merge. Since these dependency chains can be really long and get ever longer when more code is added to a worker - dependencies are pulled in even if only a single lock is held at any given time - this can quickly become a nightmare to untangle.

  • Often the requirement to hold a lock over these barrier type functions comes from the fact that the object would disappear. Or otherwise undergo some serious confusion about it’s lifetime state - not just whether it’s still alive or getting destroyed, but also who exactly owns it or whether it’s maybe a resurrected zombie representing a different instance now. This encourages that the lock morphs from a “protects some specific data” to a “protects specific code from running” design, leading to all the code maintenance issues discussed in the protect data, not code principle.

For these reasons try as hard as possible to not hold any locks, or as few as feasible, when calling any of these memory barriers in disguise functions used to manage object lifetime or ownership in general. The antipattern here is abusing locks to fix lifetime issues. We have seen two specific instances thus far:

We will see some more, but the antipattern holds in general as a source of troubles.

Level 2.5: Splitting Locks for Performance Reasons

We’ve looked at a pile of functional reasons for complicating the locking design, but sometimes you need to add more fine-grained locking for performance reasons. This is already getting dangerous, because it’s very tempting to tune some microbenchmark just because we can, or maybe delude ourselves that it will be needed in the future. Therefore only complicate your locking if:

  • You have actual real world benchmarks with workloads relevant to users that show measurable gains outside of statistical noise.

  • You’ve fully exhausted architectural changes to outright avoid the overhead, like io_uring pre-registering file descriptors locally to avoid manipulating the file descriptor table.

  • You’ve fully exhausted algorithm improvements like batching up operations to amortize locking overhead better.

Only then make your future maintenance pain guaranteed worse by applying more tricky locking than the bare minimum necessary for correctness. Still, go with the simplest approach, often converting a lock to its read-write variant is good enough.

Sometimes this isn’t enough, and you actually have to split up a lock into more fine-grained locks to achieve more parallelism and less contention among threads. Note that doing so blindly will backfire because locks are not free. When common operations still have to take most of the locks anyway, even if it’s only for short time and in strict succession, the performance hit on single threaded workloads will not justify any benefit in more threaded use-cases.

Another issue with more fine-grained locking is that often you cannot define a strict nesting hierarchy, or worse might need to take multiple locks of the same object or lock class. I’ve written previously about this specific issue, and more importantly, how to teach lockdep about lock nesting, the bad and the good ways.

One really entertaining story from the GPU subsystem, for bystanders at least, is that we really screwed this up for good by defacto allowing userspace to control the lock order of all the objects involved in an IOCTL. Furthermore disjoint operations should actually proceed without contention. If you ever manage to repeat this feat you can take a look at the wait-wound mutexes. Or if you just want some pretty graphs, LWN has an old article about wait-wound mutexes too.

Level 3: Lockless Tricks

Do not go here wanderer!

Seriously, I have seen a lot of very fancy driver subsystem locking designs, I have not yet found a lot that were actually justified. Because only real world, non-contrived performance issues can ever justify reaching for this level, and in almost all cases algorithmic or architectural fixes yield much better improvements than any kind of (locking) micro-optimization could ever hope for.

Hence this is just a long list of antipatterns, so that people who have not yet a grumpy expression permanently chiseled into their facial structure know when they’re in trouble.

Note that this section isn’t limited to lockless tricks in the academic sense of guaranteed constant overhead forward progress, meaning no spinning or retrying anywhere at all. It’s for everything which doesn’t use standard locks like struct mutex, spinlock_t, struct rw_semaphore, or any of the others provided in the Linux kernel.

Locking Antipattern: Using RCU

Yeah RCU is really awesome and impressive, but it comes at serious costs:

  • By design, at least with standard usage, RCU elevates mixing up lifetime and consistency concerns to a virtue. rcu_read_lock() gives you both a read-side critical section and it extends the lifetime of any RCU protected object. There’s absolutely no way you can avoid that antipattern, it’s built in.

    Worse, RCU read-side critical section nest rather freely, which means unlike with real locks abusing them to keep objects alive won’t run into nasty locking inversion issues when you pull that stunt with nesting different objects or classes of objects. Using locks to paper over lifetime issues is bad, but with RCU it’s weapons-grade levels of dangerous.

  • Equally nasty, RCU practically forces you to deal with zombie objects, which breaks the reference counting pattern in annoying ways.

  • On top of all this breaking out of RCU is costly and kinda defeats the point, and hence there’s a huge temptation to delay this as long as possible. Meaning check as many things and dereference as many pointers under RCU protection as you can, before you take a real lock or upgrade to a proper reference with kref_get_unless_zero().

    Unless extreme restraint is applied this results in RCU leading you towards locking antipatterns. Worse RCU tends to spread them to ever more objects and ever more fields within them.

All together all freely using RCU achieves is proving that there really is no bottom on the code maintainability scale. It is not a great day when your driver dies in synchronize_rcu() and lockdep has no idea what’s going on, and I’ve seen such days.

Personally I think in driver subsystem the most that’s still a legit and justified use of RCU is for object lookup with struct xarray and kref_get_unless_zero(), and cleanup handled entirely by kfree_rcu(). Anything more and you’re very likely chasing a rabbit down it’s hole and have not realized it yet.

Locking Antipattern: Atomics

Firstly, Linux atomics have two annoying properties just to start:

  • Unlike e.g. C++ atomics in userspace they are unordered or weakly ordered by default in a lot of cases. A lot of people are surprised by that, and then have an even harder time understanding the memory barriers they need to sprinkle over the code to make it work correctly.

  • Worse, many atomic functions neither operate on the atomic types atomic_t and atomic64_t nor have atomic anywhere in their names, and so pose serious pitfalls to reviewers:

    • READ_ONCE() and WRITE_ONCE for volatile stores and loads.
    • cmpxchg() and the various variants of atomic exchange with or without a compare operation.
    • Atomic bitops like set_bit() are all atomic. Worse, their non-atomic variants have the __set_bit() double underscores to scare you away from using them, despite that these are the ones you really want by default.

Those are a lot of unnecessary trap doors, but the real bad part is what people tend to build with atomic instructions:

  • I’ve seen at least three different, incomplete and ill-defined reimplementations of read write semaphores without lockdep support. Reinventing completions is also pretty popular. Worse, the folks involved didn’t realize what they built. That’s an impressive violation of the “Make it Correct” principle.

  • It seems very tempting to build terrible variations of the “no locking” patterns. It’s very easy to screw them up by extending them in a bad way, e.g. reference counting with weak reference or RCU optimizations done wrong very quickly leads to a complete mess. There are reasons why you should never deviate from these.

  • What looks innocent are statistical counters with atomics, but almost always there’s already a lock you could take instead of unordered counter updates. Often resulting in better code organization to boot since the statistics for a list and it’s manipulation are then closer together. There are some exceptions with real performance justification, a recent one I’ve seen is memory shrinkers where you really want your shrinker->count_objects() to not have to acquire any locks. Otherwise in a memory intense workload all threads are stuck on the one thread doing actual reclaim holding the same lock in your shrinker->scan_objects() function.

In short, unless you’re actually building a new locking or synchronization primitive in the core kernel, you most likely do not want to get seen even looking at atomic operations as an option.

Locking Antipattern: preempt/local_irq/bh_disable() and Friends …

This one is simple: Lockdep doesn’t understand them. The real-time folks hate them. Whatever it is you’re doing, use proper primitives instead, and at least read up on the LWN coverage on why these are problematic what to do instead. If you need some kind of synchronization primitive - maybe to avoid the lifetime vs. consistency antipattern pitfalls - then use the proper functions for that like synchronize_irq().

Locking Antipattern: Memory Barriers

Or more often, lack of them, incorrect or imbalanced use of barriers, badly or wrongly or just not at all documented memory barriers, or …

Fact is that exceedingly most kernel hackers, and more so driver people, have no useful understanding of the Linux kernel’s memory model, and should never be caught entertaining use of explicit memory barriers in production code. Personally I’m pretty good at spotting holes, but I’ve had to learn the hard way that I’m not even close to being able to positively prove correctness. And for better or worse, nothing short of that tends to cut it.

For a still fairly cursory discussion read the LWN series on lockless algorithms. If the code comments and commit message are anything less rigorous than that it’s fairly safe to assume there’s an issue.

Now don’t get me wrong, I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly. But aside from extreme exceptions this kind of maintenance cost has simply no justification in a driver subsystem. At least unless it’s packaged in a driver hacker proof library or core kernel service of some sorts with all the memory barriers well hidden away where ordinary fools like me can’t touch them.

Closing Thoughts

I hope you enjoyed this little tour of progressively more worrying levels of locking engineering, with really just one key take away:

Simple, dumb locking is good locking, since with that you have a fighting chance to make it correct locking.

Thanks to Daniel Stone and Jason Ekstrand for reading and commenting on drafts of this text.

July 27, 2022

For various reasons I spent the last two years way too much looking at code with terrible locking design and trying to rectify it, instead of a lot more actual building cool things. Symptomatic that the last post here on my neglected blog is also a rant on lockdep abuse.

I tried to distill all the lessons learned into some training slides, and this two part is the writeup of the same. There are some GPU specific rules, but I think the key points should apply to at least apply to kernel drivers in general.

The first part here lays out some principles, the second part builds a locking engineering design pattern hierarchy from the most easiest to understand and maintain to the most nightmare inducing approaches.

Also with locking engineering I mean the general problem of protecting data structures against concurrent access by multiple threads and trying to ensure that each sufficiently consistent view of the data it reads and that the updates it commits won’t result in confusion. Of course it highly depends upon the precise requirements what exactly sufficiently consistent means, but figuring out these kind of questions is out of scope for this little series here.

Priorities in Locking Engineering

Designing a correct locking scheme is hard, validating that your code actually implements your design is harder, and then debugging when - not if! - you screwed up is even worse. Therefore the absolute most important rule in locking engineering, at least if you want to have any chance at winning this game, is to make the design as simple and dumb as possible.

1. Make it Dumb

Since this is the key principle the entire second part of this series will go through a lot of different locking design patterns, from the simplest and dumbest and easiest to understand, to the most hair-raising horrors of complexity and trickiness.

Meanwhile let’s continue to look at everything else that matters.

2. Make it Correct

Since simple doesn’t necessarily mean correct, especially when transferring a concept from design to code, we need guidelines. On the design front the most important one is to design for lockdep, and not fight it, for which I already wrote a full length rant. Here I will only go through the main lessons: Validating locking by hand against all the other locking designs and nesting rules the kernel has overall is nigh impossible, extremely slow, something only few people can do with any chance of success and hence in almost all cases a complete waste of time. We need tools to automate this, and in the Linux kernel this is lockdep.

Therefore if lockdep doesn’t understand your locking design your design is at fault, not lockdep. Adjust accordingly.

A corollary is that you actually need to teach lockdep your locking rules, because otherwise different drivers or subsystems will end up with defacto incompatible nesting and dependencies. Which, as long as you never exercise them on the same kernel boot-up, much less same machine, wont make lockdep grumpy. But it will make maintainers very much question why they are doing what they’re doing.

Hence at driver/subsystem/whatever load time, when CONFIG_LOCKDEP is enabled, take all key locks in the correct order. One example for this relevant to GPU drivers is in the dma-buf subsystem.

In the same spirit, at every entry point to your library or subsytem, or anything else big, validate that the callers hold up the locking contract with might_lock(), might_sleep(), might_alloc() and all the variants and more specific implementations of this. Note that there’s a huge overlap between locking contracts and calling context in general (like interrupt safety, or whether memory allocation is allowed to call into direct reclaim), and since all these functions compile away to nothing when debugging is disabled there’s really no cost in sprinkling them around very liberally.

On the implementation and coding side there’s a few rules of thumb to follow:

  • Never invent your own locking primitives, you’ll get them wrong, or at least build something that’s slow. The kernel’s locks are built and tuned by people who’ve done nothing else their entire career, you wont beat them except in bug count, and that by a lot.

  • The same holds for synchronization primitives - don’t build your own with a struct wait_queue_head, or worse, hand-roll your own wait queue. Instead use the most specific existing function that provides the synchronization you need, e.g. flush_work() or flush_workqueue() and the enormous pile of variants available for synchronizing against scheduled work items.

    A key reason here is that very often these more specific functions already come with elaborate lockdep annotations, whereas anything hand-roll tends to require much more manual design validation.

  • Finally at the intersection of “make it dumb” and “make it correct”, pick the simplest lock that works, like a normal mutex instead of an read-write semaphore. This is because in general, stricter rules catch bugs and design issues quicker, hence picking a very fancy “anything goes” locking primitives is a bad choice.

    As another example pick spinlocks over mutexes because spinlocks are a lot more strict in what code they allow in their critical section. Hence much less risk you put something silly in there by accident and close a dependency loop that could lead to a deadlock.

3. Make it Fast

Speed doesn’t matter if you don’t understand the design anymore in the future, you need simplicity first.

Speed doesn’t matter if all you’re doing is crashing faster. You need correctness before speed.

Finally speed doesn’t matter where users don’t notice it. If you micro-optimize a path that doesn’t even show up in real world workloads users care about, all you’ve done is wasted time and committed to future maintenance pain for no gain at all.

Similarly optimizing code paths which should never be run when you instead improve your design are not worth it. This holds especially for GPU drivers, where the real application interfaces are OpenGL, Vulkan or similar, and there’s an entire driver in the userspace side - the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts.

The big example here is GPU address patch list processing at command submission time, which was necessary for old hardware that completely lacked any useful concept of a per process virtual address space. But that has changed, which means virtual addresses can stay constant, while the kernel can still freely manage the physical memory by manipulating pagetables, like on the CPU. Unfortunately one driver in the DRM subsystem instead spent an easy engineer decade of effort to tune relocations, write lots of testcases for the resulting corner cases in the multi-level fastpath fallbacks, and even more time handling the impressive amounts of fallout in the form of bugs and future headaches due to the resulting unmaintainable code complexity …

In other subsystems where the kernel ABI is the actual application contract these kind of design simplifications might instead need to be handled between the subsystem’s code and driver implementations. This is what we’ve done when moving from the old kernel modesetting infrastructure to atomic modesetting. But sometimes no clever tricks at all help and you only get true speed with a radically revamped uAPI - io_uring is a great example here.

Protect Data, not Code

A common pitfall is to design locking by looking at the code, perhaps just sprinkling locking calls over it until it feels like it’s good enough. The right approach is to design locking for the data structures, which means specifying for each structure or member field how it is protected against concurrent changes, and how the necessary amount of consistency is maintained across the entire data structure with rules that stay invariant, irrespective of how code operates on the data. Then roll it out consistently to all the functions, because the code-first approach tends to have a lot of issues:

  • A code centric approach to locking often leads to locking rules changing over the lifetime of an object, e.g. with different rules for a structure or member field depending upon whether an object is in active use, maybe just cached or undergoing reclaim. This is hard to teach to lockdep, especially when the nesting rules change for different states. Lockdep assumes that the locking rules are completely invariant over the lifetime of the entire kernel, not just over the lifetime of an individual object or structure even.

    Starting from the data structures on the other hand encourages that locking rules stay the same for a structure or member field.

  • Locking design that changes depending upon the code that can touch the data would need either complicated documentation entirely separate from the code - so high risk of becoming stale. Or the explanations, if there are any are sprinkled over the various functions, which means reviewers need to reacquire the entire relevant chunks of the code base again to make sure they don’t miss an odd corner cases.

    With data structure driven locking design there’s a perfect, because unique place to document the rules - in the kerneldoc of each structure or member field.

  • A consequence for code review is that to recheck the locking design for a code first approach every function and flow has to be checked against all others, and changes need to be checked against all the existing code. If this is not done you might miss a corner cases where the locking falls apart with a race condition or could deadlock.

    With a data first approach to locking changes can be reviewed incrementally against the invariant rules, which means review of especially big or complex subsystems actually scales.

  • When facing a locking bug it’s tempting to try and fix it just in the affected code. By repeating that often enough a locking scheme that protects data acquires code specific special cases. Therefore locking issues always need to be first mapped back to new or changed requirements on the data structures and how they are protected.

The big antipattern of how you end up with code centric locking is to protect an entire subsystem (or worse, a group of related subsystems) with a single huge lock. The canonical example was the big kernel lock BKL, that’s gone, but in many cases it’s just replaced by smaller, but still huge locks like console_lock().

This results in a lot of long term problems when trying to adjust the locking design later on:

  • Since the big lock protects everything, it’s often very hard to tell what it does not protect. Locking at the fringes tends to be inconsistent, and due to that its coverage tends to creep ever further when people try to fix bugs where a given structure is not consistently protected by the same lock.

  • Also often subsystems have different entry points, e.g. consoles can be reached through the console subsystem directly, through vt, tty subsystems and also through an enormous pile of driver specific interfaces with the fbcon IOCTLs as an example. Attempting to split the big lock into smaller per-structure locks pretty much guarantees that different entry points have to take the per-object locks in opposite order, which often can only be resolved through a large-scale rewrite of all impacted subsystems.

    Worse, as long as the big subsystem lock continues to be in use no one is spotting these design issues in the code flow. Hence they will slowly get worse instead of the code moving towards a better structure.

For these reasons big subsystem locks tend to live way past their justified usefulness until code maintenance becomes nigh impossible: Because no individual bugfix is worth the task to really rectify the design, but each bugfix tends to make the situation worse.

From Principles to Practice

Stay tuned for next week’s installment, which will cover what these principles mean when applying to practice: Going through a large pile of locking design patterns from the most desirable to the most hair raising complex.

July 24, 2022

There is a steady progress being made since :tada: Turnip is Vulkan 1.1 Conformant :tada:. We now support GL 4.6 via Zink, implemented a lot of extensions, and are close to Vulkan 1.3 conformance.

Support of real-world games is also looking good, here is a video of Adreno 660 rendering “The Witcher 3”, “The Talos Principle”, and “OMD2”:

All of them have reasonable frame rate. However, there was a bit of “cheating” involved. Only “The Talos Principle” game was fully running on the development board (via box64), other two games were only rendered in real time on Adreno GPU, but were ran on x86-64 laptop with their VK commands being streamed to the dev board. You could read about this method in my post “Testing Vulkan drivers with games that cannot run on the target device”.

The video was captured directly on the device via OBS with obs-vkcapture, which worked surprisingly well after fighting a bunch of issues due to the lack of binary package for it and a bit dated Ubuntu installation.

Zink (GL over Vulkan)

A number of extensions were implemented that are required for Zink to support higher GL versions. As of now Turnip supports OpenGL 4.6 via Zink, and while yet not conformant - only a handful of GL CTS tests are failing. For the perspective, Freedreno (our GL driver for Adreno) supports only OpenGL 3.3.

For Zink adventures and profound post titles check out Mike Blumenkrantz’s awesome blog supergoodcode.com

If you are interested in Zink over Turnip bring up in particular, you should read:

Low Resolution Z improvements

A major improvement for low resolution Z optimization (LRZ) was recently made in Turnip, read about it in the previous post of mine: LRZ on Adreno GPUs

Extensions

Anyway, since the last update Turnip supports many more extensions (in no particular order):

What about Vulkan conformance?

Screenshot of a mesamatrix.net website which shows how many extensions left for Turnip to implement to be Vulkan 1.3 conformant From mesamatrix.net/#Vulkan1.3

For Vulkan 1.3 conformance there are only a few extension left to implement. The only major ones are VK_KHR_dynamic_rendering and VK_EXT_inline_uniform_block required for Vulkan 1.3. VK_KHR_dynamic_rendering is currently being reviewed and foundation for VK_EXT_inline_uniform_block was recently merged.

That’s all for today!

July 20, 2022
$ vlan
Device not provided

    vlan $DEV $VLAN $SUBNET

    vlan eth0 42 10.31.155.1/27

This is a achieved by pasting the below function into your .bashrc / .zshrc and issuing a source .bashrc or source .zshrc correspondingly.

function vlan {
    DEV=$1
    VLAN=$2
    ADDR=$3

    HELP="

    vlan \$DEV \$VLAN \$SUBNET

    vlan eth0 42 10.31.155.1/27
"

    if [ -z "$DEV" ]; then
        echo "Device not provided"
        echo "$HELP"
        return 1
    fi

    ip link | grep "${DEV}: " >/dev/null 2>&1
    if [ $? -ne 0 ]; then
        echo "\"$DEV\" is not a valid device"
        echo "$HELP"
        return 1
    fi

    if [ -z "$VLAN" ]; then
        echo "VLAN not provided"
        echo "$HELP"
        exit 1
    fi
    REGEX='^[0-9]+$'
    if ! [[ $VLAN =~ $REGEX ]] ; then
        echo "\"$VLAN\" is not a number" >&2; exit 1
        echo …

The software Vulkan renderer in Mesa, lavapipe, achieved official Vulkan 1.3 conformance. The official entry in the table is  here . We can now remove the nonconformant warning from the driver. Thanks to everyone involved!

July 19, 2022

The Future Comes

Anyone else remember way back years ago when I implemented descriptor caching for zink because I couldn’t even hit 60 fps in Unigine Heaven due to extreme CPU bottlenecking? Also because I got to make incredible flowcharts to post here?

Good times.

Simpler times.

But now times have changed.

The New Perf

As recently as a year ago I blogged about new descriptor infrastructure I was working on, specifically “lazy” descriptors. This endeavor was the culmination of a month or two of wild exploration into the problem space, trying out a number of less successful options.

But it turns out the most performant option was always going to be the stupidest one: create new descriptors on every draw and jam them into the GPU.

But why is this the case?

The answer lies in an extension that has become much more widely adopted, VK_KHR_descriptor_update_template. This extension allows applications to allocate a template that the driver can use to jam those buffers/images into descriptor sets much more efficiently than the standard vkUpdateDescriptorSets. When combined with extreme bucket allocating, this methodology ends up being more performant than caching.

And sometimes by significant margins.

In Minecraft, for example, you might see a 30-50% FPS increase. In a game like Tomb Raider (2013) it’ll be closer to 10-20%.

But, most likely, there won’t be any scenario in which FPS goes down.

So wave farewell to the old code that I’ll probably delete altogether in Mesa 22.3, and embrace the new code that just works better and has been undergoing heavy testing for the past year.

Welcome to the future.

As you can probably imagine, the Linux graphics stack comprises many layers of abstractions, from the pretty little button from which you open your Proton-able AAA Steam title to the actual bytecode that runs on whatever graphics card you have installed.

These many abstractions are what allow us to have our glorious moments of being a hero (or maybe a villain, whatever you’re up to…) without even noticing what’s happening, and – most importantly – that allow game devs to make such complex games without having to worry about an awful lot of details. They’re really a marvel of engineering (!), but not by accident!

All of those abstraction layers come with a history of their own1, and it’s kinda amazing that we can even have such a smooth experience with all of those, community-powered, beautifully thought-out, moving pieces, twisting and turning in a life of their own.

But enough mystery! Let’s hop into it already, shall we?

Though in most cases we start from the bottom of the stack and build our way towards the top, here I think it makes more sense for us to build it upside down, as that’s what we’re used to interacting with.

On a funny thought experiment, maybe driver designers do live upside down, who knows…

Window servers and TTYs

Even though Linux is generally regarded as a developer/hacker OS, most modern distributions don’t require you to ever leave a graphic environment. Even if you love using the “terminal” on your distro, that’s simply a terminal emulator, which simulates using a TTY, which used to be an endpoint for interaction with a central computer that held all the resources for users. As Linux, and specifically the X server, were created during an era where computation was very much migrating from that model of “distributed access”, you can still peek into a TTY if you want to! On most distribution, pressing CTRL+ALT+F1 (or +F2, +F3 and so on) will take you to an old-school text-only display (of course one of these TTYs will also contain the display server session you began with).

But then, it seems logical that the graphic session we’re used to is one layer above the OS itself, which only provides those TTYs and, quite likely, the ABI needed to “talk” to the OS, which is indeed the case (currently) :).

In an X session for example, the X server (or your compositor) will render its windows through this ABI, and not directly through the hardware. This is actually a recent development, as before the current graphics stack was implemented (it’s called DRI, which is short for Direct Rendering Interface) the X server used to access hardware directly.

But enough history techno-babble! How does our pretty game, then, render its stuff using a window server? The short answer is: it doesn’t!

Actually anything that wants to render elements that are independent of a window server will have to use a graphics API like OpenGL, and that’s where Mesa comes in.

Mesa and graphics APIs

First, notice that we’re already down a level: the window server shows us cute windows and deals with user input in that interface, wonderful! But then I open Minecraft, and we’re already asking for 3D objects which the window server can’t possibly handle effectively, so Mesa was introduced to provide a second route for applications to send their complex commands directly to the kernel, without the need to go through the display sessions’ (bloated and slow) rendering mechanisms, which are quite often single-handedly optimized to show us application interfaces (the infamous GUIs).

As a side note, the X server is also OpenGL capable, and you can see this through its glx* API commands, or even some commands (glxgears is an infamous example!)

On a birds eye view, our (supposed) game will try to run its 3D routines, which are written in shader-speak (for example OpenGL’s, whose language is called GLSL), which Mesa handles for us, compiling and optimizing it for our specific GPU, then sending it to be run by the kernel, just like your window server does, but for anything!

An important point to notice here is that, of course, the window server is still needed for user interaction in our game. It will query and send user commands to the application, and will also handle windowing and displaying stuff (including our game, as its window is still managed by the window server), as well as dealing with many other system-related interactions that would be a nightmare for game devs to implement.

Another important point is that writing user applications which don’t require using shaders would be a total nightmare without a display manager, as it provides many useful abstractions for that use case (which already covers >90% of the uses cases for most people).

We’re getting close! Next stop is: the Linux kernel!

The DRM and KMS subsystems

The Linux kernel is, of course, comprised of many moving parts, including inside the DRM (short for Direct Rendering Manager) itself, which I cannot possibly explain in this one blog post (sorry about that!).

The DRM is addressed using ioctls (short for I/O Control), which are syscalls (short for System Call) used for device specific control, as providing generalist syscalls would be nearly impossible – it’s more feasible to create one ioctl(device, function, parameters) then treat this in a driver than to create 2000 mygpu_do_something_syscall(parameters) for “obvious” reasons.

For the inexperienced reader, it’s interesting to notice here that the approach of one syscall per driver-specific command would bloat the general kernel ABI with too many generally useless functions, as the large majority of them would be used for one and only one driver.

Just for reference, in my kernel I have 465 syscalls as defined in the system manual. Putting this vs a simple estimate of drivers in the kernel * commands only they use should give you some perspective on the issue.

Just use man syscalls | grep -E '.*\(2.\s+[0-9.]+' | wc -l if you want to query the syscalls defined in your kernel.

Those ioctls are then wrapped inside a libdrm library that provides a more comprehensible interface for the poor Mesa developers, which would otherwise need to keep checking every ioctl they want to use for the unusual macro names provided by the kernel’s userspace API.

One very shady aspect of graphics rendering that the DRM deals with is GPU memory management, and it does this through two interfaces, namely:

  • GEM – short for Graphics Execution Manager
  • TTM – short for Translation Table Maps

The older of those two is TTM, which was a generalist approach for memory management, and it provides literally everything anyone could ever hope for. That being said, TTM is regarded as too difficult to use23 as it provides a gigantic API, and very convoluted must-have features that end up being unyielding. For instance, TTM’s fencing mechanism – which is responsible for coordinating memory access between the GPU and CPU, just like semaphores if you’re used to them at all – has a very odd interface. We could also talk about TTM’s general inefficiencies which have been noted time and time again, as well as its “wicked ways” of doing things, which abuse the DMA (short for Direct Memory Access) API for one (check out König’s talk for more4).

Just like evolution, an alternative to TTM had to come along, and that was our friend GEM, Intel’s interface for memory management. It’s much easier to use by comparison, but also much more simplified and thus, only attends their specific use case – that is, integrated video cards5 –, as it’s limited to addressing memory shared by both GPU and CPU (no discrete video card support at all, really). It also won’t handle any fencing, and simply “wait” for the GPU to finish its thing before moving on, which is a no-no for those beefy discrete GPUs.

Then, as anyone sane would much rather be dealing with GEM, all software engineers fired themselves from other companies and went to Intel. I hear they’re currently working hard to deprecate TTM in an integrated graphics supremacy movement. Jokes aside, what actually happened is that DRM drivers will usually implement the needed memory management (including fencing) functionality in TTM, but provide GEM-like APIs for those things, so that everyone ends up happy (except for the people implementing these interfaces, as they’re probably quite depressed).

These memory related aspects are a rabbit hole of their own, and if you’d like to have a deeper look into this, I recommend these resources:

As my mentor recently pointed out, kernel devs are working hard into making TTM nicer all around, as it’s used by too many drivers to simply, paraphrasing König, “be set on fire” and start a memory interface anew. If you want some perspective, take a look at Christian König’s amazing talk1, where he talks about TTM from the viewpoint of a maintainer.

Notice, however, that the DRM is only responsible for graphics rendering and display mode-setting (that is, basically, setting resolution and refresh rate) is done in a separate (but related) subsystem, called KMS (short for Kernel Mode Setting).

The KMS logically separates various aspects of image transmission, such as

  • connectors – basically outputs on your GPU
  • CRTCs – representing a controller that reads the final information (known as scanout buffer or framebuffer) to send to those connectors
  • encoders – how that signal should be transmitted through the connector
  • planes – which feed the CRTCs framebuffers with data

For a quick overview, check out this image:

Linux graphics stack

I can already hear you say

OMG Isinya, that was a hell of a lot of information 😵.

But hey, you just read it through (what a nerd)! On a following post I’ll talk about testing the kernel pieces of this puzzle, hope to see you then :).

1

Which I might even talk about in another post – excuse me, but that requires a little too much research for the time I have currently.

5

The attentive reader might have thought about Intel’s discrete video cards at this point, and as a matter of fact Intel is actually working with TTM to support those products. See: phoronix.com - Linux 5.14 enabling Intel graphics TTM usage for their dGPUs

July 15, 2022

Again.

You know what I’m about to talk about.

You knew it as soon as you opened up the page.

I’ve said I was done with it a number of times, but deep down we all knew that was a lie.

Let’s talk about XFB.

XFB and Zink: Recap

For newcomers to the blog, Zink has two methods of emitting streamout info:

  • inlined emission, where the shader output variable and XFB data are written simultaneously
  • explicit emission, where the shader output variable is written and then XFB data is written later with its own explicit variables

The former is obviously the better option since it’s simpler. But also it has another benefit: it doesn’t need more variables. On the surface, it seems like this should just be the same as the first reason, namely that I don’t need to run some 300 line giga-function to wrangle the XFB outputs after the shader has ended.

There shouldn’t be any other reason. I’ve got the shader. I tell it to write some XFB data. Everything works as expected.

I know this.

You know this.

But then there’s someone we didn’t consult, isn’t there.

vulkan.png

Obviously.

And when we do consult the spec, this seemingly-benign restriction is imposed:

VUID-StandaloneSpirv-Location-04916

The Location decorations must be used on user-defined variables

So any user-defined variable must have a location. Seems fine. Until that restriction applies to the explicit XFB outputs. The ones that are never counted as inter-stage IO and are eliminated by the Vulkan driver’s compiler for all purposes except XFB. And since locations are consumed, the restrictions for locations apply: maxVertexOutputComponents, maxTessellationEvaluationOutputComponents, and maxGeometryOutputComponents restrict the maximum number of locations that can be used.

And surely that would never be a problem.

Narrator: It was.

It’s XFB, so obviously it’s a problem. The standard number of locations that can be relied upon for user variables is 32, which means a total of 128 components can be used for XFB. CTS specifically hits this in the Dragonball Z region of the test suite (Enhanced Layouts, for those of you who don’t speak GLCTS) with shaders that increase in complexity at a geometric rate.

Let’s take a look at one of the simpler ones out of the 800ish cases in KHR-Single-GL46.enhanced_layouts.xfb_struct_explicit_location:

#version 430 core
#extension GL_ARB_enhanced_layouts : require

layout(isolines, point_mode) in;

struct TestStruct {
   dmat3x4 a;
   double b;
   float c;
   dvec2 d;
};
struct OuterStruct {
    TestStruct inner_struct_a;
    TestStruct inner_struct_b;
};
layout (location = 0, xfb_offset = 0) flat out OuterStruct goku;

layout(std140, binding = 0) uniform Goku {
    TestStruct uni_goku;
};

void main()
{

    goku.inner_struct_a = uni_goku;
    goku.inner_struct_b = uni_goku;
}

This here is a tessellation evaluation shader that uses XFB on a very complex output type. Zink’s ability to inline such types is limited, which means Goku is about to throw that spirit bomb and kill off the chances of a clean CTS run.

The reasoning is how the XFB data is emitted. Rather than get a simple piece of shader info that says “output this entire struct with XFB”, Gallium instead provides the incredibly helpful struct pipe_stream_output_info data type:

/**
 * A single output for vertex transform feedback.
 */
struct pipe_stream_output
{
   unsigned register_index:6;  /**< 0 to 63 (OUT index) */
   unsigned start_component:2; /** 0 to 3 */
   unsigned num_components:3;  /** 1 to 4 */
   unsigned output_buffer:3;   /**< 0 to PIPE_MAX_SO_BUFFERS */
   unsigned dst_offset:16;     /**< offset into the buffer in dwords */
   unsigned stream:2;          /**< 0 to 3 */
};

/**
 * Stream output for vertex transform feedback.
 */
struct pipe_stream_output_info
{
   unsigned num_outputs;
   /** stride for an entire vertex for each buffer in dwords */
   uint16_t stride[PIPE_MAX_SO_BUFFERS];

   /**
    * Array of stream outputs, in the order they are to be written in.
    * Selected components are tightly packed into the output buffer.
    */
   struct pipe_stream_output output[PIPE_MAX_SO_OUTPUTS];
};

Each output variable is deconstructed into the number of 32-bit components that it writes (num_components) with an offset (start_component) and a zero-indexed “location” (register_index) which increments sequentially for all the variables output by the shader.

In short, it has absolutely no relation to the actual shader variables, and matching them back up is a considerable amount of work.

But this is XFB, so it was always going to be terrible no matter what.

Problems

Going back to the location disaster again, you might be wondering where the problem lies.

Let’s break it down with some location analysis. According to GLSL rules, here’s how locations are assigned for the output variables:

struct TestStruct {
   dmat3x4 a; <--this is effectively dvec3[4];
                 a dvec3 consumes 2 locations
                 so 4 * 2 is 8, so this consumes locations [0,7]
   double b; <--location 8
   float c; <--location 9
   dvec2 d; <--location 10
};
struct OuterStruct {
    TestStruct inner_struct_a; <--locations [0,10]
    TestStruct inner_struct_b; <--locations [11,21]
};

In total, and assuming I did my calculations right, 22 locations are consumed by this struct.

And this is zink, so point size must always be emitted, which means 23 locations are now consumed. This leaves 32 - 23 = 9 locations remaining for XFB.

Given that XFB works at the component level with tight packing, this means a minimum of 2 * (3 * 4) + 2 + 1 + 2 * 2 = 31 components are needed for each struct, but there’s two structs, which means it’s 62 components. And ignoring all other rules for a moment, it’s definitely true that locations are assigned in vec4 groupings, so a minimum of ceil(62 / 4) = 16 locations are needed to do explicit emission of this type.

But only 9 remain.

Whoops.

Solutions?

There’s a lot of ways to fix this.

The “best” way to fix it would be to improve/overhaul the inlining detection to ensure that crazy types like this are always inlined.

That’s really hard to do though, and the inlining code is already ridiculously complex to the point where I’d prefer not to ever touch it again to avoid jobbing like Vegeta in any future XFB battles.

The “easy” way to fix it would be to make the existing code work in this scenario without changing it. This is the approach I took, namely to decompose the struct before inlining so that the inline analysis has an easier time and can successfully inline all (or most) of the outputs.

It’s much simpler (and more accurate) to inline a shader that uses an output interface with no blocks like:

dmat3x4 a;
double b;
float c;
dvec2 d;
dmat3x4 a2;
double b2;
float c2;
dvec2 d2;

Thus the block splitting pass was created:


static bool
split_blocks(nir_shader *nir)
{
   bool progress = false;
   bool changed = true;
   do {
      progress = false;
      nir_foreach_shader_out_variable(var, nir) {
         const struct glsl_type *base_type = glsl_without_array(var->type);
         nir_variable *members[32]; //can't have more than this without breaking NIR
         if (!glsl_type_is_struct(base_type))
            continue;
         if (!glsl_type_is_struct(var->type) || glsl_get_length(var->type) == 1)
            continue;
         if (glsl_count_attribute_slots(var->type, false) == 1)
            continue;
         unsigned offset = 0;
         for (unsigned i = 0; i < glsl_get_length(var->type); i++) {
            members[i] = nir_variable_clone(var, nir);
            members[i]->type = glsl_get_struct_field(var->type, i);
            members[i]->name = (void*)glsl_get_struct_elem_name(var->type, i);
            members[i]->data.location += offset;
            offset += glsl_count_attribute_slots(members[i]->type, false);
            nir_shader_add_variable(nir, members[i]);
         }
         nir_foreach_function(function, nir) {
            bool func_progress = false;
            if (!function->impl)
               continue;
            nir_builder b;
            nir_builder_init(&b, function->impl);
            nir_foreach_block(block, function->impl) {
               nir_foreach_instr_safe(instr, block) {
                  switch (instr->type) {
                  case nir_instr_type_deref: {
                  nir_deref_instr *deref = nir_instr_as_deref(instr);
                  if (!(deref->modes & nir_var_shader_out))
                     continue;
                  if (nir_deref_instr_get_variable(deref) != var)
                     continue;
                  if (deref->deref_type != nir_deref_type_struct)
                     continue;
                  nir_deref_instr *parent = nir_deref_instr_parent(deref);
                  if (parent->deref_type != nir_deref_type_var)
                     continue;
                  deref->modes = nir_var_shader_temp;
                  parent->modes = nir_var_shader_temp;
                  b.cursor = nir_before_instr(instr);
                  nir_ssa_def *dest = &nir_build_deref_var(&b, members[deref->strct.index])->dest.ssa;
                  nir_ssa_def_rewrite_uses_after(&deref->dest.ssa, dest, &deref->instr);
                  nir_instr_remove(&deref->instr);
                  func_progress = true;
                  break;
                  }
                  default: break;
                  }
               }
            }
            if (func_progress)
               nir_metadata_preserve(function->impl, nir_metadata_none);
         }
         var->data.mode = nir_var_shader_temp;
         changed = true;
         progress = true;
      }
   } while (progress);
   return changed;
}

This is an simple pass that does three things:

  • walks over the shader outputs and, for every struct-type variable, splits out all the struct members into new variables at their designated location
  • rewrites all derefs of the struct variable’s members to instead access the new, non-struct variables
  • deletes the original variable and any instructions referencing it

The result is that it’s much rarer to need explicit XFB emission, and it’s generally easier to debug XFB problems involving shaders with struct blocks since the blocks will be eliminated.

Now as long as Goku doesn’t have any senzu beans remaining, I think this might actually be the last post I ever make about XFB.

Stay tuned next week for a more interesting and exciting post. Maybe.

July 13, 2022

Hi!

I’ve continued working on my IRC suite this month. Two of our extensions have been accepted in IRCv3: read-marker synchronizes read markers between multiple devices belonging to the same user, and channel-context adds a machine-readable tag to indicate the context of a message. Some other server and client developers have already implemented them!

soju has gained a few quality-of-life features. Thanks to gildarts, there is a new channel update -detached flag to attach or detach a channel, a new contrib/migrate-db script to migrate the database between backends, users can now delete their own account, and the user password hashes are upgraded when logging in. Additionally, read markers are broadcast using Web Push (to dismiss notifications when a message is read on another client), users can now set a default nickname to use for all networks, and the logic to regain the configured nick should work on servers missing MONITOR support (so I should no longer be stuck with “emersion_” on OFTC).

Goguma now supports irc:// URLs, so it should be easier to click on these links in various project pages. If the user hasn’t joined the target channel or network yet, a confirmation dialog will be displayed. In the network settings page, a new button opens a UI to authenticate on the IRC network. A ton of other minor fixes and improvements have been pushed as well.

I’ve released gamja 1.0 beta 6. This new version adds a settings dialog with options to customize how timestamps and chat events (join, part, and so on) are displayed.

In Wayland news, new Wayland and wayland-protocols releases have been shipped. The Wayland release includes an axis_value120 event for high-resolution wheel scrolling, the wayland-protocols releases includes the single-pixel-buffer and the xdg-shell wm_capabilities event I’ve talked about last month. I’ve posted a new color-representation protocol extension to improve YUV buffer presentation.

Now for the mixed bag of miscellaneous updates. mako 1.7 has been released, it includes support for multiple modes, which should make it possible to e.g. create a light/dark mode and do-not-disturb mode without conflicts. kimchi now supports live reloading its configuration file (previously a restart was necessary). dalligi now mirrors sr.ht build logs on GitLab (as seen for instance in this wlroots pipeline). Last but not least, [hut] has gained better webhook support thanks to Thorben Günther, and a more complete export functionality thanks to Drew DeVault.

That’s all for this month! No NPotM this time, sadly. See you!

July 12, 2022

Table of Contents

What is LRZ?

Citing official Adreno documentation:

[A Low Resolution Z (LRZ)] pass is also referred to as draw order independent depth rejection. During the binning pass, a low resolution Z-buffer is constructed, and can reject LRZ-tile wide contributions to boost binning performance. This LRZ is then used during the rendering pass to reject pixels efficiently before testing against the full resolution Z-buffer.

My colleague Samuel Iglesias did the initial reverse-engineering of this feature; for its in-depth overview you could read his great blog post Low Resolution Z Buffer support on Turnip.

Here are a few excerpts from this post describing what is LRZ?

To understand better how LRZ works, we need to talk a bit about tiled-based rendering. This is a way of rendering based on subdividing the framebuffer in tiles and rendering each tile separately.

The binning pass processes the geometry of the scene and records in a table on which tiles a primitive will be rendered. By doing this, the HW only needs to render the primitives that affect a specific tile when is processed.

The rendering pass gets the rasterized primitives and executes all the fragment related processes of the pipeline. Once it finishes, the resolve pass starts.

Where is LRZ used then? Well, in both binning and rendering passes. In the binning pass, it is possible to store the depth value of each vertex of the geometries of the scene in a buffer as the HW has that data available. That is the depth buffer used internally for LRZ. It has lower resolution as too much detail is not needed, which helps to save bandwidth while transferring its contents to system memory.

Thanks to LRZ, the rendering pass is only executed on the fragments that are going to be visible at the end.

LRZ brings a couple of things on the table that makes it interesting. One is that applications don’t need to reorder their primitives before submission to be more efficient, that is done by the HW with LRZ automatically.

Now, a year later, I returned to this feature to make some important improvements, for nitty-gritty details you could dive into Mesa MR#16251 “tu: Overhaul LRZ, implement on-GPU dir tracking and LRZ fast-clear”. There I implemented on-GPU LRZ direction tracking, LRZ reuse between renderpasses, and fast-clear of LRZ.

In this post I want to give a practical advice, based on things I learnt while reverse-engineering this feature, on how to help driver to enable LRZ. Some of it could be self-evident, some is already written in the official docs, and some cannot be found there. It should be applicable for Vulkan, GLES, and likely for Direct3D.

Do not change the direction of depth comparisons

Or rather, when writing depth - do not change the direction of depth comparisons. If depth comparison direction is changed while writing into depth buffer - LRZ would have to be disabled.

Why? Because if depth comparison direction is GREATER - LRZ stores the lowest depth value of the block of pixels, if direction is LESS - it stores the highest value of the block. So if direction is changed the LRZ value becomes wrong for the new direction.

A few examples:

  • :thumbsup: Going from VK_COMPARE_OP_GREATER -> VK_COMPARE_OP_GREATER_OR_EQUAL is good;
  • :x: Going from VK_COMPARE_OP_GREATER -> VK_COMPARE_OP_LESS is bad;
  • :neutral_face: From VK_COMPARE_OP_GREATER with depth write -> VK_COMPARE_OP_LESS without depth write is ok;
    • LRZ would be just temporally disabled for VK_COMPARE_OP_LESS draw calls.

The rules could be summarized as:

  • Changing depth write direction disables LRZ;
  • For calls with different direction but without depth write LRZ is temporally disabled;
  • VK_COMPARE_OP_GREATER and VK_COMPARE_OP_GREATER_OR_EQUAL have same direction;
  • VK_COMPARE_OP_LESS and VK_COMPARE_OP_LESS_OR_EQUAL have same direction;
  • VK_COMPARE_OP_EQUAL and VK_COMPARE_OP_NEVER don’t have a direction, LRZ is temporally disabled;
    • Surprise, your VK_COMPARE_OP_EQUAL compares don’t benefit from LRZ;
  • VK_COMPARE_OP_ALWAYS and VK_COMPARE_OP_NOT_EQUAL either temporally or completely disable LRZ, depending on depth being written.

Simple rules for fragment shader

Do not write depth

This obviously makes resulting depth value unpredictable, so LRZ has to be completely disabled.

Note, the output values of manually written depth could be bound by conservative depth modifier, for GLSL this is achieved by GL_ARB_conservative_depth extension, like this:

layout (depth_greater) out float gl_FragDepth;

However, Turnip at the moment does not consider this hint, and it is unknown if Qualcomm’s proprietary driver does.

Do not use Blending/Logic OPs/colorWriteMask

All of them make a new fragment value depend on the old fragment value. LRZ is temporary disabled in this case.

Do not have side-effects in fragment shaders

Writing to SSBO, images, … from fragment shader forces late Z, thus it is incompatible with LRZ. At the moment Turnip completely disables LRZ when shader has such side-effects.

Do not discard fragments

Discarding fragments moves the decision whether fragment contributes to the depth buffer to the time of fragment shader execution. LRZ is temporary disabled in this case.

LRZ in secondary command buffers and dynamic rendering

TLDR: Since Snapdragon 865 (Adreno 650) LRZ supported in secondary command buffers.

TLDR: LRZ would work with VK_KHR_dynamic_rendering, but you’d like to avoid using this extension because it isn’t nice to tilers.


Official docs state that LRZ is disabled with “Use of secondary command buffers (Vulkan)”, and on another page that “Snapdragon 865 and newer will not disable LRZ based on this criteria”.

Why?

Because up to Snapdragon 865 tracking of the direction is done on the CPU, meaning that LRZ direction is kept in internal renderpass object, updated and checked without any GPU involvement.

But starting from Snapdragon 865 the direction could be tracked on GPU which allows driver not to know previous LRZ direction during a command buffer construction. Therefor secondary command buffers could now use LRZ!


Recently Vulkan 1.3 came out and mandated the support of VK_KHR_dynamic_rendering. It gets rid of complicated VkRenderpass and VkFramebuffer setup, but much more exciting is a simpler way for parallel renderpasses construction (with VK_RENDERING_SUSPENDING_BIT / VK_RENDERING_RESUMING_BIT flags).

VK_KHR_dynamic_rendering poses a similar challenge for LRZ as secondary command buffers and has the same solution.

Reusing LRZ between renderpasses

TLDR: Since Snapdragon 865 (Adreno 650) LRZ would work if you store depth in one renderpass and load it later, giving depth image isn’t changed in-between.


Another major improvement brought by Snapdragon 865 is the possibility to reuse LRZ state between renderpasses.

The on-GPU direction tracking is part of the equation here, another part is the tracking of a depth view being used. Depth image has a single LRZ buffer which corresponds to a single array layer + single mip level of the image. So if view with different array layer or mip layer is used - LRZ state couldn’t be reused and will be invalidated.

With the above knowledge here are the conditions when LRZ state could be reused:

  • Depth attachment was stored (STORE_OP_STORE) at the end of some past renderpass;
  • The same depth attachment with the same depth view settings is being loaded (not cleared) in the current renderpass;
  • There were no changes in the underlying depth image, meaning there was no vkCmdBlitImage*, vkCmdCopyBufferToImage*, or vkCmdCopyImage*. Otherwise LRZ state would be invalidated;

Misc notes:

  • LRZ state is saved per depth image so you don’t lose the state if you you have several renderpasses with different depth attachments;
  • vkCmdClearAttachments + LOAD_OP_LOAD is just equal to LOAD_OP_CLEAR.

Conclusion

While there are many rules listed above - it all boils down to keeping things simple in the main renderpass(es) and not being too clever.

July 11, 2022

This week I was planning on talking about Device Mocking with KUnit, as I’m currently working on my first unit test for a physical device, the AMDGPU Radeon RX5700. I would introduce you to the Kernel Unit Testing Framework (KUnit), how it works, how to mock devices with it, and why it is so great to write tests.

But, my week was pretty more interesting due to a limitation on the KUnit Framework. This got me thinking about the Kernel Symbol Table and compilation for a while. So, I decided to write about it this week.

The Problem


When starting the GSoC project, my fellow colleagues and I ran straight into a problem with the use of KUnit on the AMDGPU stack.

We would create a simple test, just like this one:

#include <kunit/test.h>
#include "inc/bw_fixed.h"

static void abs_i64_test(struct kunit *test)
{
	KUNIT_EXPECT_EQ(test, 0ULL, abs_i64(0LL));

	/* Argument type limits */
	KUNIT_EXPECT_EQ(test, (uint64_t)MAX_I64, abs_i64(MAX_I64));
	KUNIT_EXPECT_EQ(test, (uint64_t)MAX_I64 + 1, abs_i64(MIN_I64));
}

static struct kunit_case bw_fixed_test_cases[] = {
	KUNIT_CASE(abs_i64_test),
	{  }
};

static struct kunit_suite bw_fixed_test_suite = {
	.name = "dml_calcs_bw_fixed",
	.test_cases = bw_fixed_test_cases,
};

kunit_test_suite(bw_fixed_test_suite);

Ok, pretty simple test: just checking the boundary values for a function that returns the absolute value of a 64-bit integer. Nothing could go wrong…

And, at first, running the kunit-tool everything would go fine. But, if we tried to compile the test as a module, we would get a linking error:

Multiple definitions of 'init_module'/'cleanup_module' at kunit_test_suites().

This looks like a simple error, but if we think further this is a matter of kernel symbols and linking. So, let’s hop on and understand the basics of kernel symbols and linking. Finally, I will tell the end of this KUnit tell.

The Stages of Compilation


The Stages of Compilation

First, it is important to understand the stages of the compilation of a C program. If you’re a C-veteran, you can skip this section. But if you are starting in the C-programming world recently (or maybe just used to run make without thinking further), let’s understand a bit more about the compilation process for C programs - basically any compiled language.

The first stage of compilation is preprocessing. The preprocessor expands the included files - a.k.a. .h, expands the macros, and removes the comments. Basically, the preprocessor obeys to the directives, that is, the commands that begin with #.

The second stage of compilation is compiling. The compiling stage takes the preprocessor’s output and produces either assembly code or an object file as output. The object code contains the binary machine code that is generated from compiling the C source.

Then, we got to the linking stage. Linking takes one or more object files and produces the product of the final compilation. This output can be a shared library or an executable.

For our problem, the linking stage is the interesting one. In this stage, the linker links all the object files by replacing the references to undefined symbols with the appropriate addresses. So, at this stage, we get the missing definitions or multiple definitions errors.

When ld (or lld for those at the clang community), tells us that there are missing definitions, it means that either the definitions don’t exist, or that the object files or libraries where they reside are not provided to the linker. For the multiple definition errors, the linker is telling us that the same symbol was defined in two different object files or libraries.

So, going back to our error, we now know that:

  1. The linker generates this error.
  2. We are defining the init_module()/cleanup_module() twice.

But, if you check the code, there is no duplicate of either of those functions. 🤔

Ok, now, let’s take a look at the kernel symbol table.

Kernel Symbols Table


So, we keep talking about symbols. But now, we need to understand which symbols are visible and available to our module and which aren’t.

We can think of the kernel symbols in three levels of visibility:

  • static: visible only inside their compilation unit.
  • external: potentially visible to any other code built into the kernel.
  • exported: visible and available to any loadable module.

So, by quoting the book Linux Kernel Development (3nd ed.), p. 348:

When modules are loaded, they are dynamically linked into the kernel. As with userspace, dynamically linked binaries can call only into external functions explicitly exported for use. In the kernel, this is handled via special directive called EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL(). Export functions are available for use by modules. Functions not exported cannot be invoked from modules. The linking and invoking rules are much more stringent for modules than code in the core kernel image. Core code can call any nonstatic interface in the kernel because all core source files are linked into a single base image. Exported symbols, of course, must be nonstatic, too. The set of exported kernel symbols are known as the exported kernel interfaces.

So, at this point, you can already get this statement, as you already understand about linking ;)

The kernel symbol table can be pretty important in debugging and you can check the list of symbols in a module with the nm command. Moreover, sometimes you want more than just the symbols from a module, but the symbols from the whole kernel. In this case, you can check the /proc/kallsyms file: it contains symbols of dynamically loaded modules as well as symbols from static code.

Also, during a kernel build, a file named Module.symvers will be generated. This file contains all exported symbols from the kernel and compiled modules. For each symbol, the corresponding CRC value, export type, and namespace are also stored.

Building an out-of-tree module is not trivial, and you can check the kbuild docs here, to understand more about symbols, how to install modules, and more.

Now, you have all the pieces needed to crack this puzzle. But I only gave you separate pieces of this problem. It’s time to bring these pieces together.

How to solve this linking problem?


Let’s go back to the linking error we got at the test:

Multiple definitions of 'init_module'/'cleanup_module' at kunit_test_suites().

So, first, we need to understand how we are defining init_module multiple times. The first definition is at kunit_test_suites(). So, when building a KUnit test as a module, KUnit creates brand new module_init/exit_module functions.

But, think for a while with me… The amdgpu module, linked with our test, already defines a module_init function for the graphics module.

FYI: the module_init is the module entry point when the module is loaded.

So, we have figured out the problem! We have one init_module at kunit_test_suites() and other init_module at amdgpu entry point, which is amdgpu_drv.c. And, as they are linked together, we have a linking problem!

And, how can we solve this problem?

Solutions inside the tests

  1. Adding EXPORT_SYMBOL to all tested functions

    Going back to the idea of the Kernel Symbol Table, we can load the amdgpu module and expose all the tested functions to any loadable module by adding EXPORT_SYMBOL. Then, we can compile the test module independently - that said, outside the amdgpu module - and loaded separately.

    It feels like an easy fix, right? Not exactly! This would pollute the symbol namespace from the amdgpu module and also pollute the code. Polluting the code means more work to maintain and work with the code. So, this is not a good idea.

  2. Incorporating the tests into the driver stack

    Another idea is to call the tests inside the driver stack. So, inside the AMDGPU’s init_module function, we can call the KUnit’s private suite execution function and run the tests when the amdgpu module is loaded.

    It is the strategy that some drivers, such as thunderbolt, were using. But, this introduces some incompatibilities with the KUnit tooling, as it makes it impossible to use the great kunit-tool and also doesn’t scale pretty well. If I want to have multiple modules with tests for a single driver, it would require the use of many #ifdef guards and the creation of awful init functions in multiple files.

    Creating a test should be simple: not a huge structure with preprocessor directives and multiple files.

A better solution: changing how KUnit calls modules

The previous solutions were a workaround for the real problem: KUnit was stealing module_init from other modules. For built-in tests, the kunit_test_suite() macro adds a list of suites in the .kunit_test_suites linker section. However, a module_init() function is used for kernel modules to run the test suites.

So, after some discussion on the KUnit Mailing List, Jeremy Kerr unified the module and non-module KUnit init formats. David Gow submitted a patch from him removing the KUnit-defined module inits, and instead parsing the KUnit tests from their own section in the module.

Now, the array of struct kunit_suite * will be placed in the .kunit_test_suites ELF section and the tests will run on the module load.

You can check the version 4 of this patchset.

Having this structure will make our work on GSoC much easier, and much cleaner! Huge thanks to all KUnit folks working on this great framework!


Getting this problem is not trivial! When it comes to compilation, linking, and, symbols, many CS students get pretty confused. In contrast, this is a pretty poetic part of computation: seeing these high-level symbols becoming simple assembly instructions and thinking about memory stacks.

If you are feeling a bit confused over this, I hugely recommend the Tanenbaum books and also Linux Kernel Development by Robert Love. Although Tanenbaum doesn’t write specifically about compilation, the knowledge of Compute Architecture and Operational Systems is fundamental to understanding the idea of running binaries on a machine.

June 30, 2022

Addressing Concerns

After my last post, I received a very real, nonzero number of DMs bearing implications.

That’s right.

Messages like, “Nice blog post. I’m sure you totally know how to use RenderDoc.” and “Nice RenderDoc tutorial.” as well as “You don’t know how to use RenderDoc, do you?”

I even got a message from Baldur “Dr. Render” Karlsson asking to see my RenderDoc operator’s license. Which I definitely have. And it’s not expired.

I just, uh, don’t have it on me right now, officer.

But we can work something out, right?

Right?

Community Service

So I’m writing this post of my own free will, and in it we’re going to take a look at a real bug. Using RenderDoc.

Like a tutorial, but less useful.

If you want a real tutorial, you should be contacting Danylo Piliaiev, Professor Emeritus at Render University, for his self-help guide. Or watch his free zen meditation tutorial. Powerful stuff.

First, let’s open up the renderdoc capture for this bug:

app.png

What’s that? I skipped the part where I was supposed to describe how to get a capture?

Fine, fine, let’s go back.

RenderDoc + Zink: Not A HOWTO

If you’re not already a maestro of the most powerful graphics debugging tool on the planet, the process for capturing a frame on zink goes something like this:

LD_PRELOAD=/usr/lib64/librenderdoc.so MESA_LOADER_DRIVER_OVERRIDE=zink <executable>

I’ve been advised by my lawyer to state for the record that preloading the library in this manner is Not Officially Supported, and the GUI should be used whenever possible. But this isn’t a tutorial, so you can read the RenderDoc documentation to set up the GUI capture.

Moving along, if the case in question is a single frame apitrace, as is the case for this bug, there’s some additional legwork required:

LD_PRELOAD=/usr/lib64/librenderdoc.so MESA_LOADER_DRIVER_OVERRIDE=zink glretrace --loop portal2.trace

In particular here, the --loop parameter will replay the frame infinitely, enabling capture. There’s also the need to use F11 to cycle through until the Vulkan window is selected. And also possibly to disable GL support in RenderDoc so it doesn’t get confused.

But this isn’t a tutorial, so I’m gonna assume that once the trace starts playing at this point, it’s easy enough for anyone following along to press F11 a couple times to select Vulkan and then press F12 to capture the frame.

The Interface

This is more or less what RenderDoc will look like once the trace is opened:

app.png

Assuming, of course, that you are in one of these scenarios:

  • not running on ANV
  • running on ANV with this MR applied so the app doesn’t crash
  • not running on Lavapipe since, obviously, this bug doesn’t exist there (R E F E R E N C E)

Running on Lavapipe, the interface looks more like this since Lavapipe has no bugs and all bugs disappear when running it:

lavapipe.png

Incredible.

If you’re new to the blog, I’ve included a handy infographic to help you see the problem area in the ANV capture:

bug.png

That’s all well and good, you might be saying, but how are you going to solve this problem?

Diving In

The first step when employing this tool is to locate a draw call which exhibits the problem that you’re attempting to solve. This can be done in the left pane by scrolling through the event list until the texture viewer displays a misrender.

I’ve expedited the process by locating a problem draw:

frame.png

This pile of rubble is obviously missing the correct color, so the hunt begins.

With the draw call selected, I’m going to first check the vertex output data in the mesh viewer. If the draw isn’t producing the right geometry, that’s a problem.

geometry.png

I’ve dragged the wireframe around a bit to try and get a good angle on it, but all I can really say is that it looks like a great pile of rubble. Looks good. A+ to whoever created it. Probably have some really smart people on payroll at a company like that.

The geometry is fine, so the color output must be broken. Let’s check that out.

First, go back to the texture viewer and right click somewhere on that rubble heap. Second, click the Debug button at the bottom right.

debug.png

You’re now in the shader debugger. What an amazing piece of software this is that you can just debug shaders.

I’m going to do what’s called a “Pro Gamer Move” here since I’ve taken Danylo Piliaiev’s executive seminar in which he describes optimal RenderDoc usage as doing “what my intuition tells me”. Upon entering the debugger, I can see the shader variables being accessed, and I can immediately begin to speculate on a problem:

big.png

These vertex inputs are too big. Probably. I don’t have any proof that they’re not supposed to be non-normalized floats, but probably they aren’t, because who uses non-normalized floats?

Let’s open up the vertex shader by going to the pipeline inspector view, clicking the VS bubble, then clicking Edit → Decompile with SPIRV-Cross:

vs.png

This brings up the vertex shader, decompiled back to GLSL for readability, where I’m interested to see the shader’s inputs and outputs:

io.png

Happily, these are nicely organized such that there are 3 inputs and 3 outputs which match up with the fragment shader locations in the shader debugger from earlier, using locations 1, 2, 3. Having read through the shader, we see that each input location corresponds to the same output location. This means that broken vertex input data will yield broken vertex output data.

Here’s where the real Pro Gamer Move that I totally used at the time I was originally solving this issue comes into play. Notice that the input variables are named with a specific schema. v0 is location 0. v2, however, is location 1, v3 is location 2, and v4 is location 3.

Isn’t that weird?

We’re all experts here, so we can agree that it’s a little weird. The variables should probably be named v0, v1, v2, v3. That would make sense. I like things that make sense.

In this scenario, we have our R E F E R E N C E driver, Lavapipe, which has zero bugs and if you find them they’re your fault, so let’s look at the vertex shader in the same draw call there:

lavapipe-io.png

O

M

G

I was right. On my own blog. Nobody saw this coming.

So as we can see, in Lavapipe, where everything renders correctly, the locations do correspond to the variable names. Is this the problem?

Let’s find out.

RenderDoc, being the futuristic piece of software that it is, lets us make changes like these and test them out.

Going back to the ANV capture and the vertex shader editing pane, I can change the locations in the shader and then compile and run it:

test.png

Which, upon switching to the texture viewer, yields this:

fixed.png

Hooray, it’s fixed. And if we switch back to the Vertex Input viewer in the pipeline inspector:

inputs.png

The formats here for the upper three attributes have also changed, which explains the problem. Also though there actually were non-normalized floats for some of the attributes, so my wild guess ended up only being partially right.

It wasn’t like I actually stepped through the whole fragment shader to see which inputs were broken and determined that oT6 in (broken) location 3 was coming into the shader with values far too large to ever produce a viable color output.

That would be way less cool than making a wild conjecture that happened to be right like a real guru would do.

Solutions

This identified the problem, but it didn’t solve it. Zink doesn’t do its own assignment for vertex input locations and instead uses whatever values Gallium assigns, so the bug had to be somewhere down the stack.

Given that the number of indices used in the draw call was somewhat unique, I was able to set a breakpoint to use with gdb, which let me:

  • inspect the vertex shader to determine its id
  • step through the process of compiling it
  • discover that the location assignment problem was caused by a vertex attribute (v1) that was deleted early on without reserving a location
  • determine that v1 should have reserved location 1, thus offsetting every subsequent vertex attribute by 1, which yields the correct rendering

And, like magic, I solved the issue.

But I didn’t fix it.

This isn’t a tutorial.

I’m just here so I don’t get fined.

June 27, 2022

Being part of the community, is more than just writing code and sending patches, it is also keeping track of the IRC discussions and reading the mailing lists to review and test patches sent from others whenever you can.

Both environments are not the most welcoming, but there are plenty of tools from the community to help parsing them. In this post I’ll talk about b4, suggested by my GSOC mentor André, a tool to help with applying patches.

Applying patches

I assume you already know that when we refer to “git commits”, we are basically talking about snapshots of the files in the repository (more about that); it’s almost like, for each set of changes, we archived and compressed the whole repository folder an gave the result a name.

Example
  • v1-created-wireframes.tar.gz
  • v2-minimum-testable-product.tar.gz
  • v2.1-fixed-download-icon.tar.gz
  • v3…

When working in a large project with so many people, like we have in the Linux Kernel community, it would be impractical to send a file containing the whole repository just to show some changes in some files, specially in the old days, when there probably wasn’t even that much bandwidth. So, in order to share your workings with the community you just have to tell them “add X to line N, remove Y from the following line”, in other words, you have to share only the differences you brought to the code.

There is a command to convert your commits into these messages showing only the “diffs” in your code: git format-patch. It’s worth mentioning that Git uses its own enhanced format of diff (see git diff), which tries to humanize and contextualize some changes, either by recognizing scopes in some languages or simply including surrounding lines in the output. So, lets say you created a couple commits based on master and want to extract them as patches, you could run git format-patch master, which would create a couple numbered files. You could then send them via email with git send-mail, but that’s another talk, my point here was just to introduce the concept of patches, you can read more at https://git-scm.com/book/en/v2/Distributed-Git-Maintaining-a-Project.

Note
Nowadays there are plenty of source-code hosts, like Github and Gitlab, that provide an alternative to email patching through Pull/Merge Request.

Now, lets say somebody has already sent their patch to some mailing list, like https://lore.kernel.org/all/20220627161132.33256-1-jose.exposito89@gmail.com/. How can you assert that their code compiles and works as described?

You could find the link and download the mbox.gz file from the lore.kernel.org page, or find the series at patchwork.kernel.org to do the same, which then would allow you to use git am to apply the patches, recreating the commits in your local environment. That process is easy enough but it can be improved as far as running a command over the lore.kernel.org URL with b4.

Info
B4 is a helper utility to work with patches made available via a public-inbox archive like lore.kernel.org. It is written to make it easier to participate in a patch-based workflows, like those used in the Linux kernel development.

B4 - it’s not an acronym, it’s just a name

B4 is a Python package and can be easily installed with python3 -m pip install --user b4. I’d suggest using a virtual environment to avoid problems with dependencies, but this post won’t cover that.

It comes with a helpful b4 --help, which tells us that, to apply the mentioned patch series you’d just need to run:

b4 am https://lore.kernel.org/all/20220627161132.33256-1-jose.exposito89@gmail.com/

Which will download the patch series as a mbox file and the cover letter as another, so that you could then use git am on it the former. With some luck (and communication), everything will apply without any conflicts.

That’s it, good luck on your reviews and thanks for reading!


“applying patch to belly” by The EnergySmart Academy is licensed under CC BY-NC-SA 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/2.0/?ref=openverse.

NGG (Next Generation Geometry) is the technology that is responsible for any vertex and geometry processing in AMD RDNA GPUs. I decided to do a write-up about my experience implementing it in RADV, which is the Vulkan driver used by many Linux systems, including the Steam Deck. I will also talk about shader culling on RDNA GPUs.

Old stuff: the legacy geometry pipeline on GCN

I will start by briefly going over how the GCN geometry pipeline works, so that we can compare the old and new.

GCN GPUs have 5 programmable hardware shader stages for vertex/geometry processing: LS, HS, ES, GS, VS. These HW stages don’t exactly map to the software stages that are advertised in the API. Instead, it is the responsibility of the driver to select which HW stages need to be used for a given pipeline. We have a table if you’re interested.

The rasterizer can only consume the output from HW VS, so the last SW stage in your pipeline must be compiled to HW VS. This is trivial for VS and TES, but GS is complicated. GS outputs are written to memory. The driver compiles a shader that runs as HW VS, reads this memory and feeds the output vertices to the rasterizer. We call this the “GS copy shader” in Mesa. (It is not part of )

Notes:

  • All of these HW stages (except GS, of course) use a model in which 1 shader invocation (SIMD lane, or simply thread in D3D jargon) corresponds to 1 output vertex. These stages (except GS) are also not aware of output primitives.
  • Vega introduced “merged shaders” (LS+HS, and ES+GS) but did not fundamentally change the above model.

New stuff: NGG pipeline on RDNA

The next-generation geometry pipeline vastly simplifies how the hardware works. (At the expense of some increased driver complexity.)

There are now only 2 HW shader stages for vertex/geometry processing:

  • Surface shader which is a pre-tessellation stage and is equivalent to what LS + HS was in the old HW.
  • Primitive shader which can feed the rasterizer and replaces all of the previous ES + GS + VS stages.

The surface shader is not too interesting, it runs the merged SW VS + TCS when tessellation is enabled. I’m not aware of any changes to how this works compared to old HW.

The interesting part, and subject of much discussion on the internet is the primitive shader. In some hardware documentation and register header files the new stage is referred to as simply “GS”, because AMD essentially took what the GS stage could already do and added the ability for it to directly feed the rasterizer using exp instructions. Don’t confuse this with SW GS. It supports a superset of the functionality that you can do in software VS, TES, GS and MS. In very loose terms you can think about it as a “mesh-like” stage which all these software stages can be compiled to.

Compared to the old HW VS, a primitive shader has these new features:

  • Compute-like: they are running in workgroups, and have full support for features such as workgroup ID, subgroup count, local invocation index, etc.
  • Aware of both input primitives and vertices: there are registers which contain information about the input primitive topology and the overall number of vertices/primitives (similar to GS).
  • They have to export not only vertex output attributes (positions and parameters), but also the primitive topology, ie. which primitive (eg. triangle) contains which vertices and in what order. Instead of processing vertices in a fixed topology, it is up to the shader to create as many vertices and primitives as the application wants.
  • Each shader invocation can create up to 1 vertex and up to 1 primitive.
  • Before outputting any vertex or primitive, a workgroup has to tell how many it will output, using s_sendmsg(gs_alloc_req) which ensures that the necessary amount of space in the parameter cache is allocated for them.
  • On RDNA2, per-primitive output params are also supported.

How is shader compilation different?

Software VS and TES:
Compared to the legacy pipeline, the compiled shaders must now not only export vertex output attributes but also create vertices (and declare how many vertices/primitives they will create). This is quite trivial because all they have to do is read the registers that contain the input primitive topology and then export the exact same topology.

Software GS:
As noted above, each NGG shader invocation can only create up to 1 vertex up to 1 primitive. This mismatches the programming model of SW GS and makes it difficult to implement. In a nutshell, for SW GS the hardware launches a large enough workgroup to fit every possible output vertex. This results in poor HW utilization (most of those threads just sit there doing nothing while the GS threads do the work), but there is not much we can do about that.

Mesh shaders:
The new pipeline enables us to support mesh shaders, which was simply impossible on the legacy pipeline, due to how the programming model entirely mismatches anything the hardware could do.

How does any of this make my games go faster?

We did some benchmarks when we switched RADV and ACO to use the new pipeline. We found no significant perf changes. At all. Considering all the hype we heard about NGG at the hardware launch, I was quite surprised.

However, after I set the hype aside, it was quite self-explanatory. When we switched to NGG, we still compiled our shaders the same way as before, so even though we used the new geometry pipeline, we didn’t do anything to take advantage of its new capabilities.

The actual perf improvement came after I also implemented shader-based culling.

What is NGG shader culling?

The NGG pipeline makes it possible for shaders to know about input primitives and create an arbitrary topology of output primitives. Even though the API does not make this information available to application shaders, it is possible for driver developers to make their compiler aware of it and add some crazy code that can get rid of primitives (eg. triangles) when it knows that those will never be actually visible. This is known as “shader culling”, or “NGG culling”.

This can improve performance in games that have a lot of triangles, because we only calculate the output positions of every vertex before we decide which triangle to remove. We then also remove unused vertices.

The benefits are:

  • Reduced bottleneck from the fixed-function HW that traditionally does culling.
  • Improved bandwidth use, because we can avoid loading some inputs for vertices we delete.
  • Improved shader HW utilization because we can avoid computing additional vertex attributes for deleted vertices.
  • More efficient PC (parameter cache) use as we don’t need to reserve output space for deleted vertices and primitives.

If there is interest, I may write a blog post about the implementation details later.

Caveats of shader culling

Due to how all of this reduces certain bottlenecks, its effectiveness very highly depends on whether you actually had a bottleneck in the first place. How many primitives it can remove of course depends on the application. Therefore the exact percentage of performance gain (or loss) also depends on the application.

If an application didn’t have any of the aforementioned bottlenecks or already mitigates them in its own way, then all of this new code may just add unnecessary overhead and actually slightly reduce performance rather than improve it.

Other than that, there is some concern that a shader-based implementation may be less accurate than the fixed-function HW.

  • It may leave some triangles which should have been removed. This is not really an issue as these will be removed by fixed-func HW anyway.
  • The bigger problem is that it may delete triangles which should have been kept. This can manifest itself by missing objects from a scene or flickering, etc.

Our results with shader culling on RDNA2

Shader culling seems to be most efficient in benchmarks and apps that output a lot of triangles without having any in-app solution for dealing with the bottlenecks. It is also very effective on games that suffer from overtessellation (ie. create a lot of very small triangles which are not visible).

  • An extreme example is the instancing demo by Sascha Willems which gets a massive boost
  • Basemark sees 10%+ performance improvement
  • Doom Eternal gets a 3%-5% boost (depending on the GPU and scene)
  • The Witcher 3 also benefits (likely thanks to its overuse of tessellation)
  • In less demanding games, the difference is negligible, around 1%

While shader culling can also work on RDNA1, we don’t enable it by default because we haven’t yet found a game that noticably benefits from it. On RDNA1, it seems that the old and new pipelines have similar performance.

Notes about hardware support

  • Vega had something similar, but I haven’t heard of any drivers that ever used this. Based on public info I cound find, it’s not even worth looking into.
  • Navi 10 and 12 lack some features such as per-primitive outputs which makes it impossible to implement mesh shaders on these GPUs. We don’t use NGG on Navi 14 (RX 5500 series) because it doesn’t work.
  • Navi 21 and newer have the best support. They have all necessary features for mesh shaders. We enabled shader culling by default on these GPUs because they show a measurable benefit.
  • Van Gogh (the GPU in the Steam Deck) has the same feature set as Navi 2x. It also shows benefits from shader culling, but to a smaller extent.

Closing thoughts

The main takeaway from this post is that NGG is not a performance silver bullet that magically makes all your games faster. Instead, it is an enabler of new features. It lets the driver implement new techniques such as shader culling and new programming models like mesh shaders.

June 16, 2022

A quick update on my latest activities around V3DV: I’ve been focusing on getting the driver ready for Vulkan 1.2 conformance, which mostly involved fixing a few CTS tests of the kind that would only fail occasionally, these are always fun :). I think we have fixed all the issues now and we are ready to submit conformance to Khronos, my colleague Alejandro Piñeiro is now working on that.

Innocuous

Who else out there knows the feeling of opening up a bug ticket, thinking Oh, I fixed that, and then closing the tab?

Don’t lie, I know you do it too.

Such was the case when I (repeatedly) pulled up this ticket about a rendering regression in Dirt Rally.

dirt-rally.png

To me, this is a normal, accurate photo of a car driving along a dirt track in the woods. A very bright track, sure, but it’s plainly visible, just like the spectators lining the sides of the track.

The reporter insisted there was a bug in action, however, so I strapped on my snorkel and oven mitt and got to work.

For the first time.

Last week.

Bug #1

We all remember our first bug. It’s usually “the code didn’t compile the first time I tried”. The worst bug there is.

The first bug I found in pursuit of this so-called rendering anomaly was not a simple compile error. No, unfortunately the code built and ran fine thanks to Mesa’s incredible and rock-stable CI systems (this was last week), which meant I had to continue on this futile quest to find out whatever was wrong.

Next, I looked at the ticket again and downloaded the trace, which was unfortunately provided to prevent me from claiming that I didn’t have the game or couldn’t find it or couldn’t launch it or was too lazy. Another unlucky roll of the dice.

After sitting through thirty minutes of loading and shader compiling in the trace, I was treated to another picturesque view of a realistic racing environment:

dirt-fail.png

B-e-a-u-tiful.

If this is a bug, I don’t want to know what it’s supposed to look like.

Nevertheless, I struggled gamely onward. The first tool in any driver developers kit is naturally RenderDoc, which I definitely know how to use and am an expert in wielding for all manner of purposes, the least of which is fixing trivial “bugs” like this one. So I fired up RenderDoc—again, without even the slightest of troubles since I use it all the time and it works great for doing all the things I want it to do—got my package of frame-related data, and then loaded it up in QRenderDoc.

And, of course, QRenderDoc crashed because that was what I wanted it to do. It’s what I always want it to do, really.

But, regrettably, the legendary Baldur Karlsson is on my team and can fix bugs faster than I can report them, so I was unable to continue claiming I couldn’t proceed due to this minor inconvenience for long. I did my usual work in RenderDoc, which I won’t even describe because it took such a short amount of time due to my staggering expertise in the area and definitely didn’t result in finding any more RenderDoc bugs, and the result was clear to me: there were no zink bugs anywhere.

I did, however, notice that descriptor caching was performing suboptimally such that only the first descriptor used for UBOs/SSBOs was used to identify the cache entry, so I fixed that in less time than it took to describe this entire non-bug nothingburger.

Bug #2

I want you to look at the original image above, then look at this one here:

dirt-fail2.png

To me, these are indistinguishable. If you disagree, you are nitpicking and biased.

But now I was getting pile-on bug reports: if zink were to run out of BAR memory while executing this trace, again there was this nebulous claim that “bugs” occurred.

My skepticism was at a regular-time average after the original report yielded nothing but fixes for RenderDoc, so I closed the tab.

I did some other stuff for most of the week that I can’t remember because it was all so important I stored the memories somewhere safe—so safe that I can’t access them now and accidentally doodle all over them, but I was definitely very, very busy with legitimate work—but then today I decided to be charitable and give the issue a once-over in case there happened to be some merit to the claims.

Seriously though, who runs out of BAR memory? Doesn’t everyone have a GPU with 16GB of BAR these days?

I threw in a quick hack to disable BAR access in zink’s allocator and loaded up the trace again.

dirt-fail3.png

As we can see here, the weather is bad, so there are fewer spectators out to spectate, but everything is totally fine and bug-free. Just as expected. Very realistic. Maybe it was a little weird that the spectators were appearing and disappearing with each frame, but then I remembered this was the Road To The Haunted House track, and naturally such a track is haunted by the ghosts of players who didn’t place first.

At a conceptual level, I knew it was nonsense to be considering such things. I’m an adult now, and there’s no scientific proof that racing games can be haunted. Despite this, I couldn’t help feel a chill running down my fingers as I realized I’d discovered a bug. Not specifically because of anything I was doing, but I’d found one. Entirely independently. Just by thinking about it.

Indeed, by looking at the innermost allocation function of zink’s suballocator, there’s this bit of vaguely interesting code:

VkResult ret = VKSCR(AllocateMemory)(screen->dev, &mai, NULL, &bo->mem);
if (!zink_screen_handle_vkresult(screen, ret)) {
   if (heap == ZINK_HEAP_DEVICE_LOCAL_VISIBLE) {
      heap = ZINK_HEAP_DEVICE_LOCAL;
      mesa_loge("zink: %p couldn't allocate memory! from BAR heap: retrying as device-local", bo);
      goto demote;
   }
   mesa_loge("zink: couldn't allocate memory! from heap %u", heap);
   goto fail;
}

This is the fallback path for demoting BAR allocations to device-local. It works great. Except there’s an obvious bug for anyone thinking about it.

I’ll give you a few moments since it was actually a little tricky to spot.

By now though, you’ve all figured it out. Obviously by performing the demotion here, the entire BO is demoted to device-local, meaning that suballocator slabs get demoted and then cached as BAR, breaking the whole thing.

Terrifying, just like the not-at-all-haunted race track.

I migrated the demotion code out a couple layers so the resource manager could handle demotion, and the footgun was removed before it could do any real harm.

Bugs #3-5

Just to pass the time while I waited to see if an unrelated CI pipeline would fail, I decided to try running CTS with BAR allocations disabled. As everyone knows, if CTS fails, there’s a real bug somewhere, so it would be easy to determine whether there might be any bugs that were totally unrelated to the reports in this absolute lunatic’s ticket.

Imagine my surprise when I did, in fact, find failing test cases. Not anything related to the trace or scenario in the ticket, of course, since that’s not a test case, it’s a game, and my parents always told me to be responsible and stop playing games.

It turns out there were some corner case bugs that nothing had been hitting until now. The first one of these was in the second-worst place I could think of: threaded context. Debugging problems here is always nightmarish for varying reasons, though mostly because of threads.

The latest problem in question was that I was receiving a buffer_map call on a device-local buffer with the unsynchronized flag set, meaning that I was unable to use my command buffer. This was fine in the case where I was discarding any pre-existing data on the buffer, but if I did have data I wanted to save, I was screwed.

Well, not me specifically, but my performance was definitely not going to be in a good place since I’d have to force a thread sync and…

Anyway, this was bad, and the cause was that the is_resource_busy call from TC was never passing along the unsynchronized flag, which prevented the driver from accurately determining whether the resource was actually busy.

With that fixed, I definitely didn’t find any more issues whatsoever.

Except that I did.

They required the complex understanding of what some would call “basic arithmetic”, however, so I’m not going to get into them. Suffice to say that buffer copies were very subtly broken in a couple ways.

In Summary

I win.

I didn’t explicitly set out to prove this bug report wrong, but it clearly was, and there are absolutely no bugs anywhere so don’t even bother trying to find them.

And if you do find them, don’t bother reporting them because it’s not like anyone’s gonna fix them.

And if someone does fix them, it’s not like they’re gonna spend hours pointlessly blogging about it.

And if someone does spend hours pointlessly blogging about it, it’s not like anyone will read it anyway.

June 14, 2022

Hi!

Yesterday I’ve finally finished up and merged push notification support for the soju IRC bouncer and the goguma Android client! Highlights & PM notifications should now be delivered much more quickly, and power consumption should go down. Additionally, the IRC extension isn’t tied to a proprietary platform (like Google or Apple) and the push notification payloads are end-to-end encrypted. If you want to read more about the technical details, have a look at the IRCv3 draft.

In the Wayland world, we’re working hard to get ready for the next wlroots release. We’ve merged numerous improvements for the scene-graph API, xdg-shell v3 and v4 support has been added (to allow xdg-popups to change their position), and a ton of other miscellaneous patches have been merged. Special thanks to Kirill Primak, Alexander Orzechowski and Isaac Freund!

I’ve also been working on various Wayland protocol bits. The single-pixel-buffer extension allows clients to easily create buffers with a single color instead of having to go through wl_shm. The security-context extension will make it possible for compositors to reliably detect sandboxed clients and apply special policies accordingly (e.g. limit access to screen capture protocols). Thanks for xdg-shell capabilities clients will be able to hide their maximize/minimize/fullscreen buttons when these actions are not supported by the compositor. Xaver Hugl’s content-type extension will enable rules based on the type of the content being displayed (e.g. enable adaptive sync when a game is displayed, or make all video player windows float). Last, I’ve been working on some smaller changes to the core protocol: a new wl_surface.configure event to atomically apply a surface configuration, and a new wl_surface.buffer_scale event to make the compositor send the preferred scale factor instead of letting the clients pick it.

I’ve tried to help Jason “I’m literally writing an NVIDIA compiler as I read Mike Blumenkrantz’s blog” Ekstrand with his explicit synchronization work. I’ve written a kernel patch to make it easier for user-space to check whether Jason’s new shiny IOCTLs are available. Unfortunately Greg K-H didn’t like the idea of using sysfs for IOCTL advertisement, and we didn’t find any other good solution, so I’ve merged Jason’s patches as-is and user-space needs to perform a kernel version check. At least I’ve learned how to add sysfs entries! If you want to learn more about Jason’s work, he’s written a blog post about it.

Progress is steady on the libdisplay-info front. I like taking a break from the complicated Linux graphics issues by writing a small patch to parse a new section of the EDID data structure. Don’t get me wrong, EDID is a total mess, but at least I don’t need to think too much when writing a parser. Pekka and Sebastian have been providing very good feedback. Sometimes I complain about their reviews being too nitpicky, but I’m sure I wouldn’t do better should the roles be reversed.

The NPotM is wlspeech, a small Wayland input method to write text using your voice. It leverages Mozilla’s DeepSpeech library for the voice recognition logic. It’s very basic at the moment: it just listens for 2 seconds, and then types the text it’s recognized. It would be nice to switch to ALSA’s non-blocking API, to add a hotkey to trigger the recording, give feedback about the currently recognized word via pre-edit text. Let me know if you’re interested in working on this.

That’s all for today, see you next month!

June 13, 2022

Recently I’ve heard from a friend that his professor simply doesn’t believe that free software should be profitable, so I’m making this blog post.

Of course, I don’t just want to rub it in his face, I’m also here to talk about my Google Summer of Code project :). Let’s hop into it!

DISCLAIMER:

For those who don’t know, at the moment of writing this I also work with open source professionally at Red Hat, and of course it is a profitable business otherwise I’d be unemployed. Notice the context here is very different though as Red Hat not only provides source code for their projects but also many other products related to systems maintenance and services, such as customer assistance and maintenance of customer’s product’s infrastructure.

It should be also noted that this blog is in no way what-so-ever related to Red Hat or any of its subsidiaries, and conveys my personal opinion only.

Now we can proceed :).

For thy poor readers who don’t already know about this, Google Summer of Code (GSoC for the intimate) is an annual initiative for encouraging people (mainly students) to contribute to open source software projects all around, and there are many organizations that take part on it, submitting their mentors’ ideas as possible projects, then parsing through to-be-contributors proposals.

Let me explain this in more detail, so you can catch that carp next year:

  • The X.Org Foundation published its mentors ideas for projects.
  • My first mentor, Rodrigo Siqueira, who is also one of X.Org’s appointed mentors told me about this
  • Then I made a proposal1 that was very much aligned with his to X.Org in GSoC.
  • Now we profit!

By the way, I’ve already explained how you can find a local community and, possibly even mentors on them in a previous blog post, so go check it out if you’re not really sure how to approach these situations.

The most obvious ways Google has to incentivize contributions are, of course, through money, and also by utilizing their massive reputation to connect people (that is, contributors and mentors). Now we get a little more philosophical: why does Google, of all companies in the world, sponsor random people to contribute to open source?

The implications of this tell us a lot about software engineering, idealistic thinking, and even capitalism itself!

Why sponsor something free (as in free beer)?

As you might already know, many of these projects exist not only for their own sake, but as reliable, auditable, building blocks for larger projects, or even as tools, making it very easy and cheap (money-wise and also time-wise) to create whatever you might want. So it happens that by having this sort of, at first glance, unreasonably undervalued (most times literally free!) tools we can build much more valuable software and/or media at exponential rates, because the only thing holding us back is the knowledge to yield them.

Then, as our current financial system much rather values competing alternatives in an open battle market, it makes a lot of sense for some of this stuff to be as widely available as possible.

We, of course, can also look at this through idealistic lenses: free (libre) software objectively makes the world a better place because

  • we worry less about security and privacy violations as anyone can audit the source code we’re using and look for problems;
  • we can make whatever changes we like to it (mind your licenses, though);
  • but most importantly, we can make real social impact by improving such code, or adding missing functionality.

That is to say that the real advantage in all of this is that we can make the world a better place for everyone, one commit at a time, and with almost no barriers to entry!

The downsides

If you’d never thought about this, of course we have some problems here.

First and foremost, the most obvious issue happens when such software is not maintained properly (maintainers may lack resources, for example), and as it might be used by many other projects we might have a broken dependency chain. This has happened recently with Log4j and it’s caused a huge turmoil.

Still onto this issue, we’re never totally sure if the code has been audited properly. Sometimes maintainers also aren’t as experienced as we’d hope, and end up making mistakes.

A huge chunk of this problem happens primarily when projects aren’t so interesting and have small user base, or when its maintainers lack appeal (e.g. they might be blunt on community interactions) and, as a consequence, cannot build a proper community around it.

This is a problem that large corporations can usually deal with in their products, as they can just select competent developers and pay them good money to maintain literally depressing code bases.

Another problem is having bloated or unruly software. Of course this isn’t exclusive to open source (just look at Microsoft Windows for example), but it can be made worse, as maintainers might have different views on what’s important and no consensus as to which path to follow.

Using the Linux kernel as an example, even with Linus as the head of the project it still has had a history of too many ABI changes2 3 and still has many issues caused simply by leaving code “unattended” (i.e. introduction of hypocrite commits, or easy to solve compilation warnings that were simply overlooked).

This might have been a product of overworked maintainers who have no time to review every piece of every patch, coupled with their trust onto familiar developers who have no better ways to check if their contributions are up to a certain threshold than to run some scripts on them or wait for some CI to fail and warn them.

This might not be the full picture of upsides vs downsides, but it’s certainly one which I’ve been very exposed to recently, so that’s my take on it.

I recommend having a look at Rosenzweig’s blog post for a more in depth discussion on issues regarding licensing and open source software.

Giving a hand to our beloved projects

With all of this in mind, I personally believe that as starters in this contribution journey the most valuable thing we can give back to maintainers is to provide fixes for overlooked problems, or even better, provide ways to help more experienced developers make better code.

What do I mean with this? Simply put

  • If code breaks a lot, make a test for it!
  • If it has too many small problems that should not be overlooked, fix them!
  • If it misses documentation, understand it and upstream your newfound knowledge!

It’s literally that simple!

What gets to me is that those are usually very simple things, and a new contributor might want to make contributions as heavy as those they see maintainers making. But that’s just not realistic for most newbies.

And of course it’s a lot less interesting to send PRs adding a test than it is to upstream emoji support for Intel graphics drivers, but you often gotta start small, kiddo.

So, onto my GSoC project, what the hell am I even doing?

Isinya’s GSoC 101

As alluded to above, I’m adding tests to graphics drivers in the kernel. Specifically to the AMDGPU driver, which is a ginourmous beast of a thing, being literally the largest in the Linux kernel currently. Of course that with great powers come great headaches and so it happens that the AMD driver for its GPUs has many shady pieces of code, and a core representative of which is its DML submodule that, roughly speaking, is responsible for providing the absolute best timing values for internal GPU components, and it achieves this through a series of unorthodox floating point calculations.

For the matter, I’ll specifically be working with this file, wish me luck!

Another thing that makes it an absolute nightmare to deal with is that floating point routines are simply not welcomed in Linux’s kernel code. No one likes to even review that.

Of course some kernels (like Window’s) do have floating point support, and there’s no strong argument to not have it nowadays, as modern CPUs are pretty well optimized for this – just look at modern SIMD instruction sets for example (SSE2 onwards) –, and as a matter of fact, modern GPUs can perform just the same whether we’re using them for integers or floating point numbers!

All of this conspires to a perpetual state of “just let it be” (perpetuated by the engineers themselves…), and we end up with bloated, hard-to-read routines, which are unfortunately core to their most cutting-edge GPUs.

Such is the rabbit hole I got into! But not alone, mind you, Maíra Canal and Thales Aparecida will be joining me for lots of crying in the bathroom I mean FUN! Lots of fun! Magali Lemes will also be joining our party while doing her final work for undergrad, so stay tuned :).

I also managed to clone my mentor, Rodrigo Siqueira, and now I have three! Starring: Melissa Wen and André Almeida!

I hope to learn a lot, being in such great company, and with them, we should be able to at least “sweep” some piles of sand from under AMD’s GPU driver carpet (or so I hope).

If you’re wondering what exactly we’re proposing, you can look at my project submission1, but in short we should be creating tests for some DML files, which is a task that in itself already encompasses many technical obstacles, like testing in real hardware, or mocking it, and also ensuring that these tests are compatible with IGT, which I’ll talk about in my next blog post. We’re also concerned with generating coverage reports for the tests, refactoring these monstrous files, and documenting them better.

Last but not least, GSoC also appreciates community presence, so you should see more stuff around here, and I’ve also set up an IRC bouncer to stay up-to-date 24/7 with some communities (ping me [isinyaaa] anytime on #dri-devel if you wish 😊).

See you later alligator!

1

My GSoC proposal. If you want to see it in full, just ping me and I’ll be happy to share :).

June 12, 2022

The first step to eliminate bugs is to find a way how to reproduce them consistently. Wait… what?

Test suites are great for that, since they can simulate very specific behavior in a timely manner. IGT GPU Tools is a collection of tools for development and testing of the DRM drivers, and, as such, it can help us to find and reproduce bugs.

I intend to help expand the AMDGPU tests’ list in my the GSoC project, so it made sense trying to run them right away. Cloned and built the IGT project then tried to run the “amdgpu” tests using a TTY in text mode:

$ ./scripts/run-tests.sh -t ".*amdgpu.*"

Unfortunately, the tests failed and never came to a stop. I tried interrupting the process using different techniques but, in the end, had to reboot pressing the Reset button. Looking through the partial results I found that one of the subtests of amd_cs_nop was causing the problem… But why weren’t the test just failing or crashing?

In an attempt to debug what was happening inside the subtest I enabled the DRM debugging messages with:

echo 0x19F | sudo tee /sys/module/drm/parameters/debug

This mask activates all debugging logs but “verbose vblank” and “verbose atomic state”. Unfortunately, again, this debugging logs only managed to show me, with my limited knowledge, that something had been locked in an infinite loop.

Tip
Ask for help!

After sharing my experience with my mentors, they assured me the problem was probably caused by the kernel code itself, not with my setup or my compilation .config. So, to pinpoint what part of the code was causing the problem I could look through space (stepping throughout the calls with gdb) or time (using git) and I choose the latter.

So now the plan was to find when the bug was introduced. Luckily it wasn’t so hard, a single git checkout to the previous release in the amd-staging-drm-next branch, which was v5.16, only 1000 commits or so behind the HEAD of the branch. It might seem like a lot to go through, but our time-machine is quite fast: git bisect.

According to Wikipedia, the bisection method is a root-finding method that applies to any continuous function for which one knows two values with opposite signs. You probably already know it as binary-search. In this case, we are searching for the first commit in which the test fails, so to start our journey

git bisect start
git bisect bad # HEAD
git bisect good v5.16

And with it, git informs us there will be around 10 steps and checks out to the middle of those commits. For each step, build, deploy, reboot into the new kernel and run tests. Quite tedious and time consuming. Right now, I’m not sure I could do it in a QEMU environment, but certainly, if I ever need to bisect the kernel in the future I’ll look into it, because in that case I would be able to use git bisect run, which would allow the whole process to be automatized.

Tip
Always read the build result before deploying.

In the end, the whole process took a full afternoon… just to find myself in a failed bisection. I had the misfortune of making some mistake in the middle of the road, probably skipping the build step by accident due to a compiling error, and marked a commit with the wrong flag. After restarting the bisect in the following day, I made sure to avoid my mistake be renaming each build with the short hash of the commit I was compiling. And finally 🎉, found the first broken commit: https://gitlab.freedesktop.org/agd5f/linux/-/commit/e68efb27647f2106d6b545667f35b2ea39746b57

Well… at least, I found the commit where the amd_cs_nop started failing. It looks promising, given it handles a mutex lock, and my mentors think so as well. Next step was reporting the bug, by simply creating an issue following the “BUG template” given at https://gitlab.freedesktop.org/drm/amd/-/issues/:

And that’s that. Thanks for reading. ❤️


“Repair Bug” by AZRainman is licensed under CC BY 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/2.0/?ref=openverse.

June 11, 2022

I’m a Fedora fan. I mean: I have two laptops for development, and all of them run Fedora. I also have a deployment machine. Guess what? It runs Fedora. Stickers? The Fedora Logo sticked forever on my laptop.

So, by now, you know: I’m really a Fedora fan.

So, when I started working with Linux Kernel, I really wanted to develop in a Fedora environment. But, without any kind of script, the work of a Linux Kernel developer is ungrateful. I mean, do you want to deploy to a remote machine? Be ready for network configurations, grub configurations, generate initramfs image, and tons of commands. Do you want to manage your config files? Basically, you are back to the ancient times when many save tons of config files on folders.

So, it’s a tremendous setup overhead. But, there is a tool to unify all this infrastructure: the kworkflow.

Kworkflow is great. It unifies all tools for sending email, managing configs, debugging, and deploying. The Problem? It didn’t use to have support for Fedora.

As a defender of the Fedora ecosystem, I couldn’t let that go, I had a mission: bring support for Fedora to kw.

Why use Fedora for development?

Ok, first of all, you need to understand my case of love for Fedora.

My first Linux distribution was Pop_OS!. It was great until they made the update to Cosmic. And I hated Cosmic with all my guts. I felt cheated by System76 (and if you check Reddit, you gonna see that it wasn’t only me). My computer was now slow, crashing and I simply hated all those buttons. I missed my classic and simple GNOME and I couldn’t understand how a company could go so against its community.

So, I ended up changing it for Fedora 34. Yes, my first Fedora…

And it was simply perfect: simple and plain GNOME 40 and moreover, Fedora is a bleeding-edge distribution. Fedora is always on the rollout for the latest Linux features, driver updates, and software. It is very innovative: Fedora comes with Wayland and Pipewire out-of-the-box, for example. And, although it is very innovative, it is incredibly stable.

But, we didn’t solve up the problem of Red Hat simply making a disgusting change and the community simply hate it. In fact, we did! Fedora is incredibly democratic. The Fedora community is the most important part of the development of another version and any change is submitted to the community through a proposal.

And that’s why I love Fedora so much: stability, innovation, and a great community.

But, if none of that convinced you to move to Fedora, maybe Linus Torvalds does. It is a pretty known fact that the creator of Linux is a Fedora user.

Implementing kw deploy for Fedora

Fedora is great, but… It didn’t use to have kw support. I couldn’t let that go. So, I started to work on the kw support for Fedora.

First of all, we needed to make it possible to install kw on a Fedora system with the setup.sh script. This was pretty simple after all: I just needed to add a package dependency list to the kw files and specify how to install those packages, so basically dnf.

To get the dependencies, I based myself on the Arch Linux dependency list. If you are willing to check more on this feature, check the PR.

So, now the biggest challenge for me: introduce kw deploy for Fedora.

The first step was to learn how to install a custom kernel in a Fedora distribution manually, how to generate the initramfs and what was the default bootloader and how to update it.

Fedora has great documentation, so I check out the guide Building a Custom Kernel.

On Fedora, to build the kernel, you run:

make oldconfig
make bzImage
make modules

And them, to install the kernel, you run:

sudo make modules_install
sudo make install

But, there is a small detail: we don´t want simply to install the kernel on your local system, we want to deploy remotely.

So, we must generate the initramfs, and send it to the remote machine. The mechanism to send the modules and the initramfs was already coded on kw. So, I had only one task: find out how Fedora generates its initramfs.

I found out an article from Fedora Magazine that explained how Fedora uses dracut to generate the initramfs. To generate the initramfs, it is as simple as running:

dracut --force --kver {KERNEL_NAME}

Next step, check out the default bootloader for Fedora. Fedora uses GRUB2, as is can be seen in this article. So, it looked straightly simple for me as kw already had GRUB support.

But, all the GRUB support on kw was based on the grub-mkconfig command and Fedora comes with the grub2-mkconfig command. So, the first thing I had to work on was adding support to the grub2-mkconfig command on kw, as I found it unfair to simply install another GRUB package on the system of the kw user.

That said, I based myself on the Debian deployment script and wrote the Fedora deployment script with the specific Fedora tools: dracut and dnf.

But a new problem arrived: GRUB menu didn’t seem to show up on my machine. And I found out that Fedora comes with GRUB hidden by default. To display GRUB, you must run:

grub2-editenv - unset menu_auto_hide

Ok, now GRUB showed up, but the newly compiled kernel didn’t show up in the GRUB menu. So, I found out that Fedora comes with GRUB_ENABLE_BLSCFG set by default. It was pretty simple to fix this, simply run:

sed -i -e 's/GRUB_ENABLE_BLSCFG=true/GRUB_ENABLE_BLSCFG=false/g' /etc/default/grub

And that’s it! Now, kw d worked like a charm with Fedora. I tested on my local machine with Fedora 35 and remotely with Fedora 35 and Fedora 36.

If you wanna check this feature out, take a look at this PR.

My workflow with Fedora

As I said, I have two development laptops: a Lenovo IdeaPad and a Dell Inspiron 15. Lenovo is my less powerful notebook, so I basically always carry him in my backpack. And the Dell Inspiron is my precious baby: it is an Intel Core i7, with 16 GB DRAM and Nvidia graphics.

Moreover, I have a testing machine, with an AMD Radeon RX 5700 XT 50th Anniversary. This is where I run my kernel tests (especially, graphics tests) and run IGT GPU Tools.

The problem? Setup up a network with three machines without becoming a cable hell. That’s why I use Tailscale to connect all my machines through a network. It’s a great tool for people with multiple machines.

Next step: make deployment easy and simple. Basically, I want to compile the Kernel on my Dell Inspiron 15 and deploy it to my testing machine through the network.

First thing: have a good config file.

I manage my config files through kw configm and I have three config files: STD_DCN_CONFIG, KUNIT_CONFIG, and STD_CONFIG.

The STD_DCN_CONFIG has enabled AMDGPU drivers and DCN stuff, with the addition of TCP configurations to make Tailscale work properly. Also, I use BTRFS as my Filesystem (as it comes out of the box with Fedora), so I also had to configure the FS.

The KUNIT_CONFIG is almost the same as STD_DCN_CONFIG, but with the addition of the KUNIT module. As I’m working in GSoC, this is my go-to config recently.

And the STD_CONFIG is the config that comes with Fedora. I don’t use it that much. It’s pretty loaded with modules and it takes too long to compile.

Ok, now it’s time to compile it. kw has a great tool to build Linux images, kw b, but it doesn’t support clang yet (but, Isa is working on it). So, as I like to use the LLVM system and use ccache to speed up builds, I run:

make CC="ccache clang" -j8

Great! Now we have a vmlinuz image. But, we still need to deploy it. Now, kw really shines for me. I simply run kw d with reboot setup on my kworkflow.config and, that’s it. I simply go to my deployment machine and choose a kernel to boot from.

It is incredibly simple, right?

Next Steps

Hope I have inspired you to try out Fedora 36 and kw! These great tools will make your development simple and fast.

My next step with Fedora: introduce kw to the dnf package manager. But that’s a talk to another post.

June 10, 2022

Adding Applications to the GNOME Software Center

Written by Richard Hughes and Christian F.K. Schaller

This blog post is based on a white paper style writeup Richard and I did a few years ago, since I noticed this week there wasn’t any other comprehensive writeup online on the topic of how to add the required metadata to get an application to appear in GNOME Software (or any other major open source appstore) online I decided to turn the writeup into a blog post, hopefully useful to the wider community. I tried to clean it up a bit as I converted it from the old white paper, so hopefully all information in here is valid as of this posting.

Abstract

Traditionally we have had little information about Linux applications before they have been installed. With the creation of a software center we require access to rich set of metadata about an application before it is deployed so it it can be displayed to the user and easily installed. This document is meant to be a guide for developers who wish to get their software appearing in the Software stores in Fedora Workstation and other distributions. Without the metadata described in this document your application is likely to go undiscovered by many or most linux users, but by reading this document you should be able to relatively quickly prepare you application.

Introduction

GNOME Software

Installing applications on Linux has traditionally involved copying binary and data files into a directory and just writing a single desktop file into a per-user or per-system directory so that it shows up in the desktop environment. In this document we refer to applications as graphical programs, rather than other system add-on components like drivers and codecs. This document will explain why the extra metadata is required and what is required for an application to be visible in the software center. We will try to document how to do this regardless of if you choose to package your application as a rpm package or as a flatpak bundle. The current rules is a combination of various standards that have evolved over the years and will will try to summarize and explain them here, going from bottom to top.

System Architecture

Linux File Hierarchy

Traditionally applications on Linux are expected to install binary files to /usr/bin, the install architecture independent data files to /usr/share/ and configuration files to /etc. If you want to package your application as a flatpak the prefix used will be /app so it is critical for applications to respect the prefix setting. Small temporary files can be stored in /tmp and much larger files in /var/tmp. Per-user configuration is either stored in the users home directory (in ~/.config) or stored in a binary settings store such as dconf. As an application developer never hardcode these paths, but set them following the XDG standard so that they relocate correctly inside a Flatpak.

Desktop files

Desktop files have been around for a long while now and are used by almost all Linux desktops to provide the basic description of a desktop application that your desktop environment will display. Like a human readable name and an icon.

So the creation of a desktop file on Linux allows a program to be visible to the graphical environment, e.g. KDE or GNOME Shell. If applications do not have a desktop file they must be manually launched using a terminal emulator. Desktop files must adhere to the Desktop File Specification and provide metadata in an ini-style format such as:

  • Binary type, typically ‘Application’
  • Program name (optionally localized)
  • Icon to use in the desktop shell
  • Program binary name to use for launching
  • Any mime types that can be opened by the applications (optional)
  • The standard categories the application should be included in (optional)
  • Keywords (optional, and optionally localized)
  • Short one-line summary (optional, and optionally localized)

The desktop file should be installed into /usr/share/applications for applications that are installed system wide. An example desktop file provided below:


[Desktop Entry]
Type=Application
Name=OpenSCAD
Icon=openscad
Exec=openscad %f
MimeType=application/x-openscad;
Categories=Graphics;3DGraphics;Engineering;
Keywords=3d;solid;geometry;csg;model;stl;

The desktop files are used when creating the software center metadata, and so you should verify that you ship a .desktop file for each built application, and that these keys exist: Name, Comment, Icon, Categories, Keywords and Exec and that desktop-file-validate correctly validates the file. There should also be only one desktop file for each application.

The application icon should be in the PNG format with a transparent background and installed in
/usr/share/icons,/usr/share/icons/hicolor//apps/, or /usr/share/${app_name}/icons/*. The icon should be at least 128×128 in size (as this is the minimum size required by Flathub).

The file name of the desktop file is also very important, as this is the assigned ‘application ID’. New applications typically use a reverse-DNS style, e.g. org.gnome.Nautilus would be the app-id. And the .desktop entry file should thus be name org.gnome.Nautilus.desktop, but older programs may just use a short name, e.g. gimp.desktop. It is important to note that the file extension is also included as part of the desktop ID.

You can verify your desktop file using the command ‘desktop-file-validate’. You just run it like this:


desktop-file-validate myapp.desktop

This tools is available through the desktop-file-utils package, which you can install on Fedora Workstation using this command


dnf install desktop-file-utils

You also need what is called a metainfo file (previously known as AppData file= file with the suffix .metainfo.xml (some applications still use the older .appdata.xml name) file should be installed into /usr/share/metainfo with a name that matches the name of the .desktop file, e.g. gimp.desktop & gimp.metainfo.xml or org.gnome.Nautilus.desktop & org.gnome.Nautilus.metainfo.xml.

In the metainfo file you should include several 16:9 aspect screenshots along with a compelling translated description made up of multiple paragraphs.

In order to make it easier for you to do screenshots in 16:9 format we created a small GNOME Shell extension called ‘Screenshot Window Sizer’. You can install it from the GNOME Extensions site.

Once it is installed you can resize the window of your application to 16:9 format by focusing it and pressing ‘ctrl+alt+s’ (you can press the key combo multiple times to get the correct size). It should resize your application window to a perfect 16:9 aspect ratio and let you screenshot it.

Make sure you follow the style guide, which can be tested using the appstreamcli command line tool. appstreamcli is part of the ‘appstream’ package in Fedora Workstation.:


appstreamcli validate foo.metainfo.xml

If you don’t already have the appstreamcli installed it can be installed using this command on Fedora Workstation:

dnf install appstream

What is allowed in an metainfo file is defined in the AppStream specification but common items typical applications add is:

  • License of the upstream project in SPDX identifier format [6], or ‘Proprietary’
  • A translated name and short description to show in the software center search results
  • A translated long description, consisting of multiple paragraphs, itemized and ordered lists.
  • A number of screenshots, with localized captions, typically in 16:9 aspect ratio
  • An optional list of releases with the update details and release information.
  • An optional list of kudos which tells the software center about the integration level of the
    application
  • A set of URLs that allow the software center to provide links to help or bug information
  • Content ratings and hardware compatibility
  • An optional gettext or QT translation domain which allows the AppStream generator to collect statistics on shipped application translations.

A typical (albeit somewhat truncated) metainfo file is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<component type="desktop-application">
<id>org.gnome.Terminal</id>
<metadata_license>GPL-3.0+ or GFDL-1.3-only</metadata_license>
<project_license>GPL-3.0+</project_license>
<name>Terminal</name>
<name xml:lang="ar">الطرفية</name>
<name xml:lang="an">Terminal</name>
<summary>Use the command line</summary>
<summary xml:lang="ar">استعمل سطر الأوامر</summary>
<summary xml:lang="an">Emplega la linia de comandos</summary>
<description>
<p>GNOME Terminal is a terminal emulator application for accessing a UNIX shell environment which can be used to run programs available on your system.</p>
<p xml:lang="ar">يدعم تشكيلات مختلفة، و الألسنة و العديد من اختصارات لوحة المفاتيح.</p>
<p xml:lang="an">Suporta quantos perfils, quantas pestanyas y implementa quantos alcorces de teclau.</p>
</description>
<recommends>
<control>console</control>
<control>keyboard</control>
<control>pointing</control>
</recommends>
<screenshots>
<screenshot type="default">https://help.gnome.org/users/gnome-terminal/stable/figures/gnome-terminal.png</screenshot>
</screenshots>
<kudos>
<kudo>HiDpiIcon</kudo>
<kudo>HighContrast</kudo>
<kudo>ModernToolkit</kudo>
<kudo>SearchProvider</kudo>
<kudo>UserDocs</kudo>
</kudos>
<content_rating type="oars-1.1"/>
<url type="homepage">https://wiki.gnome.org/Apps/Terminal</url>
<project_group>GNOME</project_group>
<update_contact>https://wiki.gnome.org/Apps/Terminal/ReportingBugs</update_contact>
</component>

Some Appstrean background

The Appstream specification is an mature and evolving standard that allows upstream applications to provide metadata such as localized descriptions, screenshots, extra keywords and content ratings for parental control. This intoduction just touches on the surface what it provides so I recommend reading the specification through once you understood the basics. The core concept is that the upstream project ships one extra metainfo XML file which is used to build a global application catalog called a metainfo file. Thousands of open source projects now include metainfo files, and the software center shipped in Fedora, Ubuntu and OpenSuse is now an easy to use application filled with useful application metadata. Applications without metainfo files are no longer shown which provides quite some incentive to upstream projects wanting visibility in popular desktop environments. AppStream was first introduced in 2008 and since then many people have contributed to the specification. It is being used primarily for application metadata but also now is used for drivers, firmware, input methods and fonts. There are multiple projects producing AppStream metadata and also a number of projects consuming the final XML metadata.

When applications are being built as packages by a distribution then the AppStream generation is done automatically, and you do not need to do anything other than installing a .desktop file and an metainfo.xml file in the upstream tarball or zip file. If the application is being built on your own machines or cloud instance then the distributor will need to generate the AppStream metadata manually. This would for example be the case when internal-only or closed source software is being either used or produced. This document assumes you are currently building RPM packages and exporting yum-style repository metadata for Fedora or RHEL although the concepts are the same for rpm-on-OpenSuse or deb-on-Ubuntu.

NOTE: If you are building packages, make sure that there are not two applications installed with one single package. If this is currently the case split up the package so that there are multiple subpackages or mark one of the .desktop files as NoDisplay=true. Make sure the application-subpackages depend on any -common subpackage and deal with upgrades (perhaps using a metapackage) if you’ve shipped the application before.

Summary of Package building

So the steps outlined above explains the extra metadata you need to have your application show up in GNOME Software. This tutorial does not cover how to set up your build system to build these, but both for Meson and autotools you should be able to find a long range of examples online. And there are also major resources available to explain how to create a Fedora RPM or how to build a Flatpak. You probably also want to tie both the Desktop file and the metainfo file into your i18n system so the metadata in them can be translated. It is worth nothing here that while this document explains how you can do everything yourself we do generally recommend relying on existing community infrastructure for hosting source code and packages if you can (for instance if your application is open source), as they will save you work and effort over time. For instance putting your source code into the GNOME git will give you free access to the translator community in GNOME and thus increase the chance your application is internationalized significantly. And by building your package in Fedora you can get peer review of your package and free hosting of the resulting package. Or by putting your package up on Flathub you get wide cross distribution availability.

Setting up hosting infrastructure for your package

We will here explain how you set up a Yum repository for RPM packages that provides the needed metadata. If you are making a Flatpak we recommend skipping ahead to the Flatpak section a bit further down.

Yum hosting and Metadata:

When GNOME Software checks for updates it downloads various metadata files from the server describing the packages available in the repository. GNOME Software can also download AppStream metadata at the same time, allowing add-on repositories to include applications that are visible in the the software center. In most cases distributors are already building binary RPMS and then building metadata as an additional step by running something like this to generate the repomd files on a directory of packages. The tool for creating the repository metadata is called createrepo_c and is part of the package createrepo_c in Fedora. You can install it by running the command:


dnf install createrepo_c.

Once the tool is installed you can run these commands to generate your metadata:


$ createrepo_c --no-database --simple-md-filenames SRPMS/
$ createrepo_c --no-database --simple-md-filenames x86_64/

This creates the primary and filelist metadata required for updating on the command line. Next to build the metadata required for the software center we we need to actually generate the AppStream XML. The tool you need for this is called appstream-builder. This works by decompressing .rpm files and merging together the .desktop file, the .metainfo.xml file and preprocessing the icons. Remember, only applications installing AppData files will be included in the metadata.

You can install appstream builder in Fedora Workstation by using this command:

dnf install libappstream-glib-builder

Once it is installed you can run it by using the following syntax:

$ appstream-builder \
   --origin=yourcompanyname \
   --basename=appstream \
   --cache-dir=/tmp/asb-cache \
   --enable-hidpi \
   --max-threads=1 \
   --min-icon-size=32 \
   --output-dir=/tmp/asb-md \
   --packages-dir=x86_64/ \
   --temp-dir=/tmp/asb-icons

This takes a few minutes and generates some files to the output directory. Your output should look something like this:


Scanning packages...
Processing packages...
Merging applications...
Writing /tmp/asb-md/appstream.xml.gz...
Writing /tmp/asb-md/appstream-icons.tar.gz...
Writing /tmp/asb-md/appstream-screenshots.tar...Done!

The actual build output will depend on your compose server configuration. At this point you can also verify the application is visible in the yourcompanyname.xml.gz file.
We then have to take the generated XML and the tarball of icons and add it to the repomd.xml master document so that GNOME Software automatically downloads the content for searching.
This is as simple as doing:

modifyrepo_c \
    --no-compress \
    --simple-md-filenames \
    /tmp/asb-md/appstream.xml.gz \
    x86_64/repodata/
modifyrepo_c \
    --no-compress \
    --simple-md-filenames \
    /tmp/asb-md/appstream-icons.tar.gz \
    x86_64/repodata/

 

Deploying this metadata will allow GNOME Software to add the application metadata the next time the repository is refreshed, typically, once per day. Hosting your Yum repository on Github Github isn’t really set up for hosting Yum repositories, but here is a method that currently works. So once you created a local copy of your repository create a new project on github. Then use the follow commands to import your repository into github.


cd ~/src/myrepository
git init
git add -A
git commit -a -m "first commit"
git remote add origin git@github.com:yourgitaccount/myrepo.git
git push -u origin master

Once everything is important go into the github web interface and drill down in the file tree until you find the file called ‘repomd.xml’ and click on it. You should now see a button the github interface called ‘Raw’. Once you click that you get the raw version of the XML file and in the URL bar of your browser you should see a URL looking something like this:
https://raw.githubusercontent.com/cschalle/hubyum/master/noarch/repodata/repomd.xml
Copy that URL as you will need the information from it to create your .repo file which is what distributions and users want in order to reach you new repository. To create your .repo file copy this example and edit it to match your data:


[remarkable]
name=Remarkable Markdown editor software and updates
baseurl=https://raw.githubusercontent.com/cschalle/hubyum/master/noarch
gpgcheck=0
enabled=1
enabled_metadata=1

So on top is your Repo shortname inside the brackets, then a name field with a more extensive name. For the baseurl paste the URL you copied earlier and remove the last bits until you are left with either the ‘norach’ directory or your platform directory for instance x86_64. Once you have that file completed put it into /etc/yum.repos.d on your computer and load up GNOME Software. Click on the ‘Updates’ button in GNOME Software and then on the refresh button in the top left corner to ensure your database is up to date. If everything works as expected you should then be able to do a search in GNOME software and find your new application showing up.

Example of self hosted RPM

Flapak hosting and Metadata

The flatpak-builder binary generates AppStream metadata automatically when building applications if the appstream-compose tool is installed on the flatpak build machine. Flatpak remotes are exported with a separate ‘appstream’ branch which is automatically downloaded by GNOME Software and no addition work if required when building your application or updating the remote. Adding the remote is enough to add the application to the software center, on the assumption the AppData file is valid.

Conclusions

AppStream files allow us to build a modern software center experience either using distro packages with yum-style metadata or with the new flatpak application deployment framework. By including a desktop file and AppData file for your Linux binary build your application can be easily found and installed by end users greatly expanding its userbase.

June 07, 2022

I Remembered

After my last blog post I was so exhausted I had to take a week off, but I’m back. In the course of my blog-free week, I remembered the secret to blogging: blog before I start real work for the day.

It seems obvious, but once that code starts flowing, the prose ain’t coming.

Updates

It’s been a busy couple weeks.

As I mentioned, I’ve been working on getting zink running on an Adreno chromebook using turnip—the open source Qualcomm driver. When I first fired up a CTS run, I was at well over 5000 failures. Today, two weeks later, that number is closer to 150. Performance in SuperTuxKart, the only app I have that’s supposed to run well, is a bit lacking, but it’s unclear whether this is anything related to zink. Details to come in future posts.

Some time ago I implemented dmabuf support for lavapipe. This is finally landing, assuming CI doesn’t get lost at the pub on the way to ramming the patches into the repo. Enjoy running Vulkan-based display servers in style. I’m still waiting on Cyberpunk to finish loading, but I’m planning to test out a full lavapipe system once I finish the game.

Also in lavapipe-land, 1.3 conformance submissions are pending. While 1.2 conformance went through and was blogged about to great acclaim, this unfortunately can’t be listed in main or a current release since the driver has 1.2 conformance but advertises 1.3 support. The goalpost is moving, but we’ve got our hustle shoes on.

Real Work

On occasion I manage to get real work done. A couple weeks ago, in the course of working on turnip support, I discovered something terrible: Qualcomm has no 64bit shader support.

This is different from old-time ANV, where 64bit support isn’t in the hardware but is still handled correctly in the backend and so everything works even if I clumsily fumble some dmat3 types into the GPU. On Qualcomm, such tomfoolery is Not Appreciated and will either fail or just crash altogether.

So how well did 64bit -> 32bit conversions work in zink?

Not well.

Very not well.

A Plethora Of Failures

Before I get into all the methods by which zink fails, let’s talk about it at a conceptual level: what 64bit operations are needed here?

There are two types of 64bit shader operations, Int64 and Float64. At the API level, these correspond to doing 64bit integer and 64bit float operations respectively. At the functional level, however, they’re a bit different. Due to how zink handles shader i/o, Int64 is what determines support for 64bit descriptor interfaces as well as cross-stage interfaces. Float64 is solely for handling ALU operations involving 64bit floating point values.

With that in mind, let’s take a look at all the ways this failed:

  • 64bit UBO loads
  • 64bit SSBO loads
  • 64bit SSBO stores
  • 64bit shared memory loads
  • 64bit shared memory stores
  • 64bit variable loads
  • 64bit variable stores
  • 64bit XFB stores

Basically everything.

Oops.

Fixing: Part One

There’s a lot of code involved in addressing all these issues, so rather than delving too deeply into it (you can see the MR here), I’ll go over the issues at a pseudo code level.

Like a whiteboard section of an interview.

Except useful.

First, let’s take a look at descriptor handling. This is:

  • 64bit UBO loads
  • 64bit SSBO loads
  • 64bit SSBO stores
  • 64bit shared memory loads
  • 64bit shared memory stores

All of these are handled in two phases. Initially, the load/store operation is rewritten. This involves two steps:

  • rewrite the offset of the operation in terms of dwords rather than bytes (zink loads array<uint> variables)
  • rewrite 64bit operations to 2x32 if 64bit support is not present

As expected, at least one of these was broken in various ways.

Offset handling was, generally speaking, fine. I got something right.

What I didn’t get right, however, was the 64bit rewrites. While certain operations were being converted, I had (somehow) failed to catch the case of missing Int64 support as a place where such conversions were required, and so the rewrites weren’t being done. With that fixed, I discovered even more issues:

  • more 64bit operations were being added in the course of converting from 64bit to 32bit
  • non-scalar loads/stores broke lots of things

The first issue was simple to fix: instead of doing manual bitshifts and casts, use NIR intrinsics like nir_op_pack_64_2x32 to handle rewrites since these can be optimized out upon meeting a matching nir_op_unpack_64_2x32 later on.

The second issue was a little trickier, but it mostly just amounted to forcing scalarization during the rewrite process to avoid issues.

Except there were other issues, because of course there were.

Some of the optimization passes required for all this to work properly weren’t handling atomic load/store ops correctly and were deleting loads that occurred after atomic stores. This spawned a massive amount of galaxy brain-level discussion that culminated in some patches that might land someday.

So with all this solved, there were only a couple issues remaining.

How hard could it be?

FFFFFFFUUUUUUUUUUUU

Anyway, so there were more issues, but really it was just two issues:

  • 64bit shader variable handling
  • 64bit XFB handling

Tackling the first issue first, the way I saw it, 64bit variables had to be rewritten into expanded (e.g., dvec2 -> vec4) types and then load/store operations had to be rewritten to split the values up and write to those types. My plan for this was:

  • scan the shader for variables containing 64bit types
  • rewrite the type for a matching variable
  • rewrite the i/o for matching variable to use the new type

This obviously hit a snag with the larger types (e.g., dvec3, dvec4) where there was more than a vec4 of expanded components. I ended up converting these to a struct containing vec4 members that could then be indexed based on the offset of the original load.

But hold on, Jason “I’m literally writing an NVIDIA compiler as I read your blog post” Ekstrand is probably thinking, what about even bigger types, like dmat3?

I’m so glad you asked, Jason.

The difference with a 64bit matrix type is that it’s treated as an array<dvec> by NIR, which means that the access chain goes something like:

  • deref var
  • deref array (row/column)
  • load/store

That intermediate step means indexing from a struct isn’t going to work since the array index might not be constant.

It’s not going to work, right?

anakin.png

It turns out that if you use the phi, anything is possible.

Here’s a taste of how 64bit matrix loads are handled:

/* matrix types always come from array (row) derefs */
assert(deref->deref_type == nir_deref_type_array);
nir_deref_instr *var_deref = nir_deref_instr_parent(deref);
/* let optimization clean up consts later */
nir_ssa_def *index = deref->arr.index.ssa;
/* this might be an indirect array index:
 * - iterate over matrix columns
 * - add if blocks for each column
 * - phi the loads using the array index
 */
unsigned cols = glsl_get_matrix_columns(matrix);
nir_ssa_def *dests[4];
for (unsigned idx = 0; idx < cols; idx++) {
   /* don't add an if for the final row: this will be handled in the else */
   if (idx < cols - 1)
      nir_push_if(&b, nir_ieq_imm(&b, index, idx));
   unsigned vec_components = glsl_get_vector_elements(matrix);
   /* always clamp dvec3 to 4 components */
   if (vec_components == 3)
      vec_components = 4;
   unsigned start_component = idx * vec_components * 2;
   /* struct member */
   unsigned member = start_component / 4;
   /* number of components remaining */
   unsigned remaining = num_components;
   /* component index */
   unsigned comp_idx = 0;
   for (unsigned i = 0; i < num_components; member++) {
      assert(member < glsl_get_length(var_deref->type));
      nir_deref_instr *strct = nir_build_deref_struct(&b, var_deref, member);
      nir_ssa_def *load = nir_load_deref(&b, strct);
      unsigned incr = MIN2(remaining, 4);
      /* repack the loads to 64bit */
      for (unsigned c = 0; c < incr / 2; c++, comp_idx++)
         comp[comp_idx] = nir_pack_64_2x32(&b, nir_channels(&b, load, BITFIELD_RANGE(c * 2, 2)));
      remaining -= incr;
      i += incr;
   }
   dest = dests[idx] = nir_vec(&b, comp, intr->num_components);
   if (idx < cols - 1)
      nir_push_else(&b, NULL);
}
/* loop over all the if blocks that were made, pop them, and phi the loaded+packed results */
for (unsigned idx = cols - 1; idx >= 1; idx--) {
   nir_pop_if(&b, NULL);
   dest = nir_if_phi(&b, dests[idx - 1], dest);
}

I just saw Big Triangle reaching for a phone, probably to call the police regarding illegal amounts of compiler horseplay, so let’s move on so I can maybe finish this before I get hauled off.

The only remaining issue now is

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Look, there’s a reason why entire blog posts are devoted to the awfulness of XFB.

It’s bad.

Really bad.

I’m not going to go deep into the changes required. They’re terrible. The gist of it is that zink utilizes two mechanisms to handle XFB, and one is worse than the other. That one is the “manual emission” handler, in which the output of a variable is read back and then explicitly stored to an XFB output. This has to handle all the variable types (in spirv) for both the load and the store, and then it also has to untangle the abomination that is Gallium XFB handling, and it’s just terrible.

But it’s done.

And with it comes the marking off of a ton of CTS failures for turnip and other platforms that lack 32bit support.

June 06, 2022

So it begins!

With some pushes and pulls from friends, I’ve been studying the Linux Graphical stack for some time now. After some minor patches to both Mesa and the Linux Kernel, I followed the instructions thoroughly and landed a successful Google Summer of Code proposal:

Introduce Unit Tests to the AMDGPU “DCE” Component

My project’s primary goal is to create unit tests using KUnit for the AMDGPU driver focused on the Display and Compositing Engine (DCE) 11.2, which will be tested on the GPU “RX 580”.

The motivation for that comes not only to assert that the APIs work as expected, but also to keep their behavior stable across minor changes in their code, which can allow for great improvement to the code readability and maintainability.

For the implementation of the tests, we decided to go with the Kernel Unit Testing Framework (KUnit). KUnit makes it possible to run test suites on kernel boot or load the tests as a module. It reports all test case results through a TAP (Test Anything Protocol) in the kernel log.

There is a great probability that KUnit will have some limitations in regards to testing GPU’s drivers’ functions, so the secondary goal will be to enhance its capabilities. There will be other people working with KUnit on DCN in parallel, so there will be a lot of code review to be done as well. I will keep track of my weekly progress on my blog, reporting the challenges I will face and trying to create an introductory material that could help future newcomers.

Mentors and Teammates

During this summer I’ll have by my side Isabella Basso and Maíra Canal, sharing an overall similar GSOC proposal but working with DCN, which is used by newer GPUs, and Magali Lemes, working on her related capstone project. We all will be mentored by three awesome FLOSS contributors:

Community

When talking about FLOSS, communication must be plenty and time-travel compatible 🙃!

Jokes aside, the two main channels to chat and exchange patches are IRC and mailing lists, respectively.

Mailing Lists

Right now, I’m still overwhelmed by the volume of emails arriving (even after setting some filters). Searching for relevant threads at lore.kernel.org has proven useful. Right now, I’m subscribed to:

IRC channels

IRC is an important tool used by the community to keep in touch real time. Similar to old-school chat rooms, there’s no chat history by default, so, to circumvent that, I’ve been using thelounge kindly deployed by André, which acts not only as an IRC web client but also a IRC bouncer, meaning that it keeps me connected and stores any messages while I’m away.

I’ve joined the following IRC channels:

  • #kunit-usp: Where daily discussions from our team are being held, in portuguese.
  • #kunit: Kunit development channel.
  • #kw-devel: Kworkflow development channel.
  • #dri-devel: Pretty active channel shared by Mesa and Kernel graphics (filled with light hearted people, highly recommend!).
  • #freedesktop: freedesktop.org infrastructure and online services.
  • #radeon: Support and development for open-source radeon/amdgpu drivers.
  • #xorg-devel: X.Org development discussion.

“Happy Birthday Penguin Cake” by foamcow is licensed under CC BY-NC-SA 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/2.0/?ref=openverse.

In recent days I have been testing how modifying the default CPU and GPU frequencies on the rpi4 increases the performance of our reference Vulkan applications. By default Raspbian uses 1500MHz and 500MHz respectively. But with a good heat dissipation (a good fan, rpi400 heat spreader, etc) you can play a little with those values.

One of the tools we usually use to check performance changes are gfxreconstruct. This tools allows you to record all the Vulkan calls during a execution of an aplication, and then you can replay the captured file. So we have traces of several applications, and we use them to test any hypothetical performance improvement, or to verify that some change doesn’t cause a performance drop.

So, let’s see what we got if we increase the CPU/GPU frequency, focused on the Unreal Engine 4 demos, that are the more shader intensive:

Unreal Engine 4 demos FPS chart

So as expected, with higher clock speed we see a good boost in performance of ~10FPS for several of these demos.

Some could wonder why the increase on the CPU frequency got so little impact. As I mentioned, we didn’t get those values from the real application, but from gfxreconstruct traces. Those are only capturing the Vulkan calls. So on those replays there are not tasks like collision detection, user input, etc that are usually handled on the CPU. Also as mentioned, all the Unreal Engine 4 demos uses really complex shaders, so the “bottleneck” there is the GPU.

Let’s move now from the cold numbers, and test the real applications. Let’s start with the Unreal Engine 4 SunTemple demo, using the default CPU/GPU frequencies (1500/500):

Even if it runs fairly smooth most of the time at ~24 FPS, there are some places where it dips below 18 FPS. Let’s see now increasing the CPU/GPU frequencies to 1800/750:

Now the demo runs at ~34 FPS most of the time. The worse dip is ~24 FPS. It is a lot smoother than before.

Here is another example with the Unreal Engine 4 Shooter demo, already increasing the CPU/GPU frequencies:

Here the FPS never dips below 34FPS, staying at ~40FPS most of time.

It has been around 1 year and a half since we announced a Vulkan 1.0 driver for Raspberry Pi 4, and since then we have made significant performance improvements, mostly around our compiler stack, that have notably improved some of these demos. In some cases (like the Unreal Engine 4 Shooter demo) we got a 50%-60% improvement (if you want more details about the compiler work, you can read the details here).

In this post we can see how after this and taking advantage of increasing the CPU and GPU frequencies, we can really start to get reasonable framerates in more demanding demos. Even if this is still at low resolutions (for this post all the demos were running at 640×480), it is still great to see this on a Raspberry Pi.

June 01, 2022

Intro

Compute Express Link (CXL) is the next spec of significance for connecting hardware devices. It will replace or supplement existing stalwarts like PCIe. The adoption is starting in the datacenter, and the specification definitely provides interesting possibilities for client and embedded devices. A few years ago, the picture wasn't so clear. The original release of the CXL specification wasn't Earth shattering. There were competing standards with intent hardware vendors behind them. The drive to, and release of, the Compute Express Link 2.0 specification changed much of that.

There are a bunch of really great materials hosted by the CXL consortium. I find that these are primarily geared toward use cases, hardware vendors, and sales & marketing. This blog series will dissect CXL with the scalpel of a software engineer. I intend to provide an in depth overview of the driver architecture, and go over how the development was done before there was hardware. This post will go over the important parts of the specification as I see them.

All spec references are relative to the 2.0 specification which can be obtained here.

What it is

Let's start with two practical examples.

  1. If one were creating a PCIe based memory expansion device today, and desired that device expose coherent byte addressable memory there would really be only two viable options. One could expose this memory mapped input/output (MMIO) via Base Address Register (BAR). Without horrendous hacks, and to have ubiquitous CPU support, the only sane way this can work is if you map the MMIO as uncached (UC), which has a distinct performance penalty. For more details on coherent memory access to the GPU, see my previous blog post. Access to the device memory should be fast (at least, not limited by the protocol's restriction), and we haven't managed to accomplish that. In fact, the NVMe 1.4 specification introduces the Persistent Memory Region (PMR) which sort of does this but is still limited.
  2. If one were creating a PCIe based device whose main job was to do Network Address Translation (NAT) (or some other IP packet mutation) that was to be done by the CPU, critical memory bandwidth would be need for this. This is because the CPU will have to read data in from the device, modify it, and write it back out and the only way to do this via PCIe is to go through main memory. (More on this below

CXL defines 3 protocols that work on top of PCIe that enable (Chapter 3 of the CXL 2.0 specification) a general purpose way to implement the examples. Two of these protocols help address our 'coherent but fast' problem above. I'll call them the data protocols. The third, CXL.io can be thought of as a stricter set of requirements over PCIe config cycles. I'll call that the enumeration/configuration protocol. We'll not discuss that in any depth as it's not particularly interesting.

There's plenty of great overviews such as this one. The point of this blog is to focus on the specific aspects driver writers and reviewers might care about.

CXL.cache

But first, a bit on PCIe coherency. Modern x86 architectures have cache coherent PCIe DMA. For DMA reads this simply means that the DMA engine obtains the most recent copy of the data by requesting it from the fabric. For writes, once the DMA is complete the DMA engine will send an invalidation request to the host(s) to invalidate the range that was DMA'd. Fundamentally however, using this is generally not optimal since keeping coherency would require the CPU basically snoop the PCIe interconnect all the time. This would be bad for power and performance. As such, general drivers manage coherency via software mechanisms. There are exceptions to this rule, but I'm not intimately familiar with them, so I won't add more detail.

Driving a PCIe NIC

CXL.cache is interesting because it allows the device to participate in the CPU cache coherency protocol as if it were another CPU rather than being a device. From a software perspective, it's the less interesting of the two data protocols. Chapter 3.2.x has a lot of words around what this protocol is for, and how it is designed to work. This protocol is targeted towards accelerators which do not have anything to provide to the system in terms of resources, but instead utilize host-attached memory and a local cache. The CXL.cache protocol if successfully negotiated throughout the CXL topology, host to endpoint, should just work. It permits the device to have coherent view of memory without software intervention and on x86, without the potential negative ramifications of the snoops. Similarly, the host is able to read from the device caches without using main memory as a stopping point. Main memory can be skipped and instead data can be directly transferred over CXL.cache protocol. The protocol describes snoop filtering and the necessary messages to keep coherency. As a software person, I consider it a more efficient version of PCIe coherency, and one which transcends x86 specificity.

Protocol

CXL.cache has a bidirectional request/response protocol where a request can be made from host to device (H2D) or vice-versa (D2H). The set of commands are what you'd expect to provide proper snooping. For example, H2D requests with one of the 3 Snp* opcodes defined in 3.2.4.3.X, these allow to gain exclusive access to a line, shared access, or just get the current value; while the device uses one of several commands in Table 18 to read/write/invalidate/flush (similar uses).

One might also notice that the peer to peer case isn't covered. The CXL model however makes every device/CPU a peer in the CXL.cache domain. While the current CXL specification doesn't address skipping CPU caches in this matter entirely, it'd be a safe bet to assume a specification so comprehensive would be getting there soon. CXL would allow this more generically than NVMe.

To summarize, CXL.cache essentially lets CPU and device caches to remain coherent without needing to use main memory as the synchronization barrier.

CXL.mem

If CXL.cache is for devices that don't provide resources, CXL.mem is exactly the opposite. CXL.mem allows the CPU to have coherent byte addressable access to device-attached memory while maintaining its own internal cache. Unlike CXL.cache where every entity is a peer, the device or host sends requests and responses, the CPU, known as the "master" in the CXL spec, is responsible for sending requests, and the CXL subordinate (device) sends the response. Introduced in CXL 1.1, CXL.mem was added for Type 2 devices. Requests from the master to the subordinate are "M2S" and responses are "S2M".

When CXL.cache isn't also present, CXL.mem is very straightforward. All requests boil down to a read, or a write. When CXL.cache is present, the situation is more tricky. For performance improvements, the host will tell the subordinate about certain ranges of memory which may not need to handle coherency between the device cache, device-attached memory, and the host cache. There is also meta data passed along to inform the device about the current cacheline state. Both the master and subordinate need to keep their cache state in harmony.

Protocol

CXL.mem protocol is straight forward, especially when the device doesn't also use CXL.cache (ie. it has no local cache).

The requests are known as

  1. Req - Request without data. These are generally reads and invalidates, where the response will put the data on the S2M's data channel.
  2. RwD - Request with data. These are generally writes, where the data channel has the data to write.

The responses:

  1. NDR - No Data Response. These are generally completions on state changes on the device side, such as, writeback complete.
  2. DRS - Data Response. This is used for returning data, ie. on a read request from the master.

Bias controls

Unsurprisingly, strict coherency often negatively impacts bandwidth, or latency, or both. While it's generally ideal from a software model to be coherent, it likely won't be ideal for performance. The CXL specification has a solution for this. Chapter 2.2.1 describes a knob which allows a mechanism to provide the hint over which entity should pay for that coherency (CPU vs. device). For many HPC workloads, such as weather modeling, large sets of data are uploaded to an accelerator's device-attached memory via the CPU, then the accelerator crunches numbers on the data, and finally the CPU download results. In CXL, at all times, the model data is coherent for both the device and the CPU. Depending on the bias however, one or the other will take a performance hit.

Host vs. device bias

Using the weather modeling example, there are 4 interesting flows.

  1. CPU writes data to device-attached memory.
  2. GPU reads data from device-attached memory.
  3. *GPU writes data to device_attached memory.
  4. CPU reads data from device attached memory.

*#3 above poses an interesting situation that was possible only with bespoke hardware. The GPU could in theory write that data out via CXL.cache and short-circuit another bias change. In practice though, many such usages would blow out the cache.

The CPU coherency engine has been a thing for a long time. One might ask, why not just use that and be done with it. Well, easy one first, a Device Coherency Engine (DCOH) was already required for CXL.cache protocol support. More practically however, the hit to latency and bandwidth is significant if every cacheline access required a check-in with the CPU's coherency engine. What this means is that when the device wishes to access data from this line, it must first determine the cacheline state (DCOH can track this), if that line isn't exclusive to the accelerator, the accelerator needs to use CXL.cache protocol to request the CPU make things coherent, and once complete, then can access it's device-attached memory. Why is that? If you recall, CXL.cache is essentially where the device is the initiator of the request, and CXL.mem is where the CPU is the initiator.

So suppose we continue on this CPU owns coherency adventure. #1 looks great, the CPU can quickly upload the dataset. However, #2 will immediately hit the bottleneck just mentioned. Similarly for #3, even though a flush won't have to occur, the accelerator will still need to send a request to the CPU to make sure the line gets invalidated. To sum up, we have coherency, but half of our operations are slower than they need to be.

To address this, a [fairly vague] description of bias controls is defined. When in host bias mode, the CPU coherency engine effectively owns the cacheline state (the contents are shared of course) by requiring the device to use CXL.cache for coherency. In device bias mode however, the host will use CXL.mem commands to ensure coherency. This is why Type 2 devices need both CXL.cache and CXL.mem.

Device types

I'd like to know why they didn't start numbering at 0. I've already talked quite a bit about device types. I believe it made sense to define the protocols first though so that device types would make more sense. CXL 1.1 introduced two device types, and CXL 2.0 added a third. All types implement CXL.io, the less than exciting protocol we ignore.

Type CXL.cache CXL.mem
1 y n
2 y y
3 n y

Just from looking at the table it'd be wise to ask, if Type 2 does both protocols, why do Type 1 and Type 3 devices exist. In short, gate savings can be had with Type 1 devices not needing CXL.mem, and Type 3 devices offer gate savings and increased performance because they don't have to manage internal cache coherency. More on this next...

CXL Type 1 Devices

These are your accelerators without local memory.

The quintessential Type 1 device is the NIC. A NIC pushes data from memory out onto the wire, or pulls from the wire and into memory. It might perform many steps, such as repackaging a packet, or encryption, or reordering packets (I dunno, not a networking person). Our NAT example above is one such case.

How you might envision that working is the PCIe device would write the incoming packet into the Rx buffer. The CPU would copy that packet out of the Rx buffer, update the IP and port, then write it into the Tx buffer. This set of steps would use memory write bandwidth when the device wrote into the Rx buffer, memory read bandwidth when the CPU copied the packet, and memory write bandwidth when the CPU writes into the Tx buffer. Again, NVMe has a concept to support a subset of this case for peer to peer DMA called Controller Memory Buffers (CMB), but this is limited to NVMe based devices, and doesn't help with coherency on the CPU. Summarizing, (D is device cache, M is memory, H is Host/CPU cache)

  1. (D2M) Device writes into Rx Queue
  2. (M2H) Host copies out buffer
  3. (H2M) Host writes into Tx Queue
  4. (M2D) Device reads from Tx Queue

Post-CXL this becomes a matter of managing cache ownership throughout the pipeline. The NIC would write the incoming packet into the Rx buffer. The CPU would likely copy it out so as to prevent blocking future packets from coming in. Once done, the CPU has the buffer in its cache. The packet information could be mutated all in the cache, and then delivered to the Tx queue for sending out. Since the NIC may decide to mutate the packet further before going out, it'd issue the RdOwn opcode (3.2.4.1.7), from which point it would effectively own that cacheline.

  1. (D2M) Device writes into Rx Queue
  2. (M2H) Host copies out buffer
  3. (H2D) Host transfers ownership into Tx Queue

With accelerators that don't have the possibility of causing backpressure like the Rx queue does, step 2 could be removed.

CXL Type 2 Devices

These are your accelerators with local memory.

Type 2 devices are mandated to support both data protocols and as such, must implement their own DCOH engine (this will vary in complexity based on the underlying device's cache hierarchy complexity). One can think of this problem the same way as multiple CPUs where each has their own L1/L2, but a shared L3 (like Intel CPUs have, where L3 is LLC). Each CPU would need to track transitions between the local L1 and L2, and the L3 to the global L3. TL;DR on this is, for Type 2 devices, there's a relatively complex flow to manage local cache state on the device in relation to the host-attached memory they are using.

In a pre-CXL world, if a device wants to access its own memory, caches or no, it would have the logic to do so. For example, in GPUs, the sampler generally has a cache. If you try to access texture data via the sampler that is already in the sampler cache, everything remains internal to the device. Similarly, if the CPU wishes to modify the texture, an explicit command to invalidate the GPUs sampler cache must be issued before it can be reliably used by the GPU (or flushed if your GPU was modifying the texture).

Continuing with this example in the post-CXL world, the texture lives in graphics memory on the card, and that graphics memory is participating in CXL.mem protocol. That would imply that should the CPU want to inspect, or worse, modify the texture it can do so in a coherent fashion. Later in Type 3 devices, we'll figure out how none of this needs to be that complex for memory expanders.

CXL Type 3 Devices

These are your memory modules. They provide memory capacity that's persistent, volatile, or a combination.

Even though a Type 2 device could technically behave as a memory expander, it's not ideal to do so. The nature of a Type 2 device is that it has a cache which also needs to be maintained. Even with meticulous use of bias controls, extra invalidations and flushes will need to occur, and of course, extra gates are needed to handle this logic. The host CPU does not know a Type 2 device has no cache. To address this, the CXL 2.0 specification introduces a new type, Type 3, which is a "dumb" memory expander device. Since this device has no visible caches (because there is no accelerator), a reduced set of CXL.mem protocol can be used, the CPU will never need to snoop the device, which means the CPU's cache is the cache of truth. What this also implies is a CXL type 3 device simply provides device-attached memory to the system for any use. Hotplug is permitted. Type 3 peer to peer is absent from the 2.0 spec, and unlike CXL.cache, it's not as clear to see the path forward because CXL.mem is a Master/Subordinate protocol.

In a pre-CXL world the closest thing you find to this are a combination of PCIe based NVMe devices (for persistent capacity), NVDIMM devices, and of course, attached DRAM. Generally, DRAM isn't available as expansion cards because a single DDR4 DIMM (which is internally dual channel), only has 21.6 GB/s of bandwidth. PCIe can keep up with that, but it requires all 16 lanes, which I guess isn't scalable, or cost effective, or something. But mostly, it's not a good use of DRAM when the platform based interleaving can yield bandwidth in the hundreds of gigabytes per second.

Type Max bandwidth (GB/s)
PCIe 4.0 x16 32
PCIe 5.0 x16 64
PCIe 6.0 x16 256
DDR4 (1 DIMM) 25.6
DDR5 (1 DIMM) 51.2
HBM3 819

In a post-CXL world the story is changed in the sense that the OS is responsible for much of the configuration and this is why Type 3 devices are the most interesting from a software perspective. Even though CXL currently runs on PCIe 5.0, CXL offers the ability to interleave across multiple devices thus increasing the bandwidth in multiples by count of the interleave ways. When you take PCIe 6.0 bandwidth, and interleaving, CXL offers quite a robust alternative to HBM, and can even scale to GPU level memory bandwidth with DDR.

Memory types

Host physical address space management

This would apply to Type 3 devices, but technically could also apply to Type 2 devices.

Even though the protocols and use-cases should be understood, the devil is in the details with software enabling. Type 1 and Type 2 devices will largely gain benefit just from hardware; perhaps some flows might need driver changes, ie. reducing flushes and/or copies which wouldn't be needed. Type 3 devices on the other hand are a whole new ball of wax.

Type 3 devices will need host physical address space allocated dynamically (it's not entirely unlike memory hot plug, but it is trickier in some ways). The devices will need to be programmed to accept those addresses. And last but not least, those devices will need to be maintained using a spec defined mailbox interface.

The next chapter will start in the same way the driver did, the mailbox interface used for device information and configuration.

Summary

Important takeaways are as follow:

  1. CXL.cache allows CPUs and devices to operate on host caches uniformly.
  2. CXL.mem allows devices to export their memory coherently.
  3. Bias controls mitigate performance penalties of CXL.mem coherency
  4. Type 3 devices provide a subset of CXL.mem, for memory expanders.
May 30, 2022
The goal of this adventure is to have hardware acceleration for applications when we have Glamor disabled in the X server.

What is Glamor ?

Glamor is a GL-based rendering acceleration library for the X server that can use OpenGL, EGL, or GBM. It uses GL functions & shaders to complete 2D graphics operations, and uses normal textures to represent drawable pixmaps where possible. Glamor calls GL functions to render to a texture directly and is somehow hardware independent. If the GL rendering cannot complete due to failure (or not being supported), then Glamor will fallback to software rendering (via llvmpipe) which uses framebuffer functions.

Why disable Glamor ?

On current RPi images like bullseye, Glamor is disabled by default for RPi 1-3 devices. This means that there is no hardware acceleration out of the box. The main reason for not using Glamor on RPi 1-3 hardware is because it uses GPU memory (CMA memory) which is limited to 256Mb. If you run out of CMA memory, then the X server cannot allocate memory for pixmaps and your system will crash. RPi 1-3 devices currently use V3D as the render GPU. V3D can only sample from tiled buffers, but it can render to tiled or linear buffers. If V3D needs to sample from a linear buffer, then we allocate a shadow buffer and transform the linear image to a tiled layout in the shadow buffer and sample from the shadow buffer. Any update of the linear texture implies updating the shadow image… and that is SLOW. With Glamor enabled in this scenario, you will quickly run out of CMA memory and crash. This issue is especially apparent if you try launching Chromium in full screen with many tabs opened.

Where has my hardware acceleration gone ?

On RPi 1-3 devices, we default to the modesetting driver from the X server. For those that are not aware, ‘modesetting’ is an Xorg driver for Kernel Modesetting (KMS) devices. The driver supports TrueColor visuals at various framebuffer depths and also supports RandR 1.2 for multi-head configurations. This driver supports all hardware where a KMS device is available and uses the Linux DRM ioctls or dumb buffer objects to create & map memory for applications to use. This driver can be used with Glamor to provide hardware acceleration, however that can lead to the X server crashing as mentioned above. Without enabling Glamor, then the modesetting driver cannot do hardware acceleration and applications will render using software (dumb buffer objects). So how can we get hardware acceleration without Glamor ? Let’s take an adventure into the land of Direct Rendering…

What is Direct Rendering ?

Direct rendering allows for X client applications to perform 3D rendering using direct access to the graphics hardware. User-space programs can use the DRM API to command the GPU to do hardware-accelerated 3D rendering and video decoding. You may be thinking “Wow, this could solve the problem” and you would be correct. If this could be enabled in the modesetting driver without using Glamor, then we could have hardware acceleration without having to worry about the X server crashing. It cannot be that difficult, right ? Well, as it turns out, things are not so simple. The biggest problem with this approach is that the DRI2 implementation inside the modesetting driver depends on Glamor. DRI2 is a version of the Direct Rendering Infrastructure (DRI). It is a framework comprising the modern Linux graphics stack that allows unprivileged user-space programs to use graphics hardware. The main use of DRI is to provide hardware acceleration for the Mesa implementation of OpenGL. So what approach should be taken ? Do we modify the modesetting driver code to support DRI2 without Glamor ? Is there a better way to get direct rendering without DRI2 ? As it turns out, there is a better way…enter DRI3.

DRI3 to the rescue ?

The main purpose of the DRI3 extension is to implement the mechanism to share direct rendered buffers between DRI clients and the X Server. With DRI3, clients can allocate the render buffers themselves instead of relying on the X server for doing the allocation. DRI3 clients allocate and use GEM buffers objects as rendering targets, while the X Server represents these render buffers using a pixmap. After initialization the client doesn’t make any extra calls to the X server, except perhaps in the case of window resizing. Utilizing this method, we should be able to avoid crashing the X server if we run out of memory, right ? Well once again, things are not as simple as they appear to be…

So using DRI3 & GEM can save the day ?

With GEM, a user-space program can create, handle and destroy memory objects living in the GPU memory. When a user-space program needs video memory (for a framebuffer, texture or any other data), it requests the allocation from the DRM driver using the GEM API. The DRM driver keeps track of the used video memory and is able to comply with the request if there is free memory available. You may recall from earlier that the main reason for not using Glamor on RPi 1-3 hardware is because it uses GPU memory (CMA memory) which is limited to 256Mb, so how can using DRI3 with GEM help us ? The short answer is “it does not”…at least, not if we utilize GEM.

Where do we go next ?

Surely there must be a way to have hardware acceleration without using all of our GPU memory ? I am glad you asked because there is a solution that we will explore in my next blog post.
May 27, 2022

The software Vulkan renderer in Mesa, lavapipe, achieved official Vulkan 1.2 conformance. The non obvious entry in the table is here.

Thanks to all the Mesa team who helped achieve this, Shout outs to Mike of Zink fame who drove a bunch of pieces over the line, Roland who helped review some of the funkier changes. 

We will be submitting 1.3 conformance soon, just a few things to iron out.

May 26, 2022

This year I had a goal: to improve my abilities as a kernel developer. After some research, I figured out about the Google Summer of Code (GSoC) initiative.

Last year, I had contact with Isabella Basso during the LKCAMP Hackathon and she told me that she was being mentored by Rodrigo Siqueira. So, in my early 22’, I tried to get in touch with him. He introduced me to this great group of women that were building a project for GSoC.

So, since January, I, Isabella Basso and Magali Lemes were building a project proposal for GSoC to the X.Org Foundation.

Essentially, we had the idea to introduce unit testing to the Display Mode Library (DML) of the AMDGPU driver with KUnit.

So, we divide the DML between us and also welcomed Tales Almeida to the project. Each of us submitted a project proposal to the X.Org Foundation and I proposed a project to Introduce Unit Testing to the Display Mode VBA Library.

Gladly, on May 20th, I received an e-mail congratulating me from GSoC. I got extremely excited that I’ll spend the summer working on something that I love. Moreover, I got extremely happy that Isabella and Tales were approved with me.

I am thrilled to be part of this project and I am super excited for the summer of ‘22. I’m looking forward to work on what I most love.

So, let me talk a bit more about my project.

Why implement Unit Testing in the AMDGPU drivers?

The modern AMD Linux kernel graphics driver is the single largest driver in the mainline Linux codebase. As AMD releases new GPU models, the size of AMDGPU drivers is only becoming even larger.

At Linux 5.18, the modern AMD driver is approaching 4 million lines of code, which is more than 10% of the entire Linux kernel codebase.

With such a huge codebase, assuring the quality and the reliability of the drivers becomes a hard task without systematic testing, especially for graphic drivers - which usually involve tons of complex calculations. Also, finding bugs becomes an increasingly hard task.

Moreover, a large codebase usually indicates the need for code refactoring. But, refactoring a large codebase without any implemented tests will probably generate tons of bugs, as there is no quality measurement.

In that sense, it is possible to argue for the benefits of implementing unit testing at the AMDGPU drivers. This implementation will help developers to recognize bugs before they are merged into the mainline and also makes it possible for future code refactors of the AMDGPU driver.

When analyzing the AMDGPU driver, a particular part of the driver highlights itself as a good candidate for the implementation of unit tests: the Display Mode Library (DML). DML is fundamental to the functioning of AMD Display Core Next (DCN) since DML calculates the signals - VSTARTUP, VUPDATE, and VREADY - used for Global Sync. DML calculates the signals based on a large number of parameters and ensures our hardware is able to feed the DCN pipeline without underflows or hangs in any given system configuration.

My project: Introducing Unit Testing to the Display Mode VBA Library

The general project’s idea is to implement unit testing in the Display Mode Library (DML) with the help of the KUnit Framework. Nonetheless, implementing unit testing on all DML functions seems not viable for a 12-week project. That said, my project intends to focus on the Display Mode VBA libraries, especially the dml/display_mode_vba.h and dml/dcn20/display_mode_vba_20.h libraries.

In my project, I intend to create unit tests for all functions at the dml/display_mode_vba.h library because those functions are used in all DCN models VBA libraries. Moreover, considering I have access to the AMDGPU Radeon™ RX 5700 XT, I also intend to create unit tests for the functions at the dml/dcn20/display_mode_vba_20.h library.

The static functions at the libraries won’t be tested, because the main intention is to test the public API of the Display Mode VBA library.

The Display Mode VBA libraries have a very intricate codebase, so I hope that the implementation of unit tests with KUnit and the possible integration with IGT inspire developers to work on a code refactor.

So, now what?

I have already got my hands on the Community Bonding Period, primarily, working on some workflow issues. I use kworkflow to ease my development workflow, but kw doesn’t deploy for Fedora-based systems yet. As a Fedora user (and huge fan), I couldn´t leave that as it is and work on an Arch system. So, I’m currently working on the deployment for Fedora systems.

Also, I’m taking time to get familiar with the mailing lists and to read the AMDGPU documentation. Melissa recommended some great links and I’m willing to read them in the next couple of days.

Moreover, I’m organizing my daily life for this new exciting period. I’m excited to start working and developing on this great community.

May 23, 2022

Hi all! This month’s status update will be shorter than usual, because I’ve taken some time off to visit Napoli. Discovering the city and the surrounding region was great! Of course the main reason to visit is to taste true Neapolitan pizza. I must admit, this wasn’t a let-down. Should you also be interested in pizza, I found some time to put together a tiny NPotM which displays a map of Neapolitan pizzerias. The source is on sr.ht.

Landscape of the Amalfi coast

But that’s enough of that, let’s get down to business. This month Conrad Hoffmann has been submitting a lot of patches for go-webdav. The end goal is to build a brand new CardDAV/CalDAV server, tokidoki. Conrad has upstreamed a CalDAV server implementation and a lot of other improvements (mostly to server code, but a few changes also benefit the client implementation). While tokidoki is missing a bunch of features, it’s already pretty usable!

I’ve released a new version of the goguma mobile IRC client yesterday. A network details page has been added, delthas has implemented typing indicators (disabled by default), and a whole lot of reliability improvements have been pushed. I’ve continued polishing my work-in-progress branch with push notifications support, hopefully we’ll be able to merge this soon.

I’ve started working on the gyosu C documentation generator again, and it’s shaping up pretty well. I’ve published an example of rendered docs for wlroots. gyosu will now generate one doc page per header file, and correctly link between these pages. Next I’ll focus on retaining comments inside structs/enums and grouping functions per type.

Pekka, Sebastian and I have finally came to a consensus regarding the libdisplay-info API. The basic API has been merged alongside the testing infrastructure, and we’ll incrementally add new features on top. Because EDID is such a beautiful format, there are still a lot of surprising API decisions that we need to make.

In other FOSS graphics news, I’ve updated drmdb with a new page to list devices supporting a KMS property (example: alpha property), and I’ve improved the filtering UI a bit. I’ve automated the Wayland docs deployment on GitLab, and I’ve planned a new release.

That’s it for now, see you next month!

Lately I have been exposing a bit more functionality in V3DV and was wondering how far we are from Vulkan 1.2. Turns out that a lot of the new Vulkan 1.2 features are actually optional and what we have right now (missing a few trivial patches to expose a few things) seems to be sufficient for a minimal implementation.

We actually did a test run with CTS enabling Vulkan 1.2 to verify this and it went surprisingly well, with just a few test failures that I am currently looking into, so I think we should be able to submit conformance soon.

For those who may be interested, here is a list of what we are not supporting (all of these are optional features in Vulkan 1.2):

VK_EXT_descriptor_indexing

I think we should be able to support this in the future.

VK_KHR_shader_float16_int8

This we can support in theory, since the hardware has support for half-float, however, the way this is designed in hardware comes with significant caveats that I think would make it really difficult to take advantage of it in practice. It would also require significant work, so it is not something we are planning at present.

VK_KHR_buffer_device_address

We can’t implement this without hacks because the Vulkan spec explicitly defined these addresses to be 64-bit values and the V3D GPU only deals with 32-bit addresses and is not capable of doing any kind of native 64-bit operation. At first I thought we could just lower these to 32-bit (since we know they will be 32-bit), but because the spec makes these explicit 64-bit values, it allows shaders to cast a device address from/to uvec2, which generates 64-bit bitcast instructions and those require both the destination and source to be 64-bit values.

VK_EXT_sampler_filter_minmax
VK_KHR_draw_indirect_count
VK_EXT_scalar_block_layout
VK_EXT_shader_viewport_index_layer
VK_KHR_shader_atomic_int64

These lack required hardware support, so we don’t expect to implement them.

May 21, 2022

Previously, I gave you an introduction to mesh/task shaders and wrote up some details about how mesh shaders are implemented in the driver. But I left out the important details of how task shaders (aka. amplification shaders) work in the driver. In this post, I aim to give you some details about how task shaders work under the hood. Like before, this is based on my experience implementing task shaders in RADV and all details are already public information.

Refresher about the task shader API

The task shader (aka. amplification shader in D3D12) is a new stage that runs in workgroups similar to compute shaders. Each task shader workgroup has two jobs: determine how many mesh shader workgroups it should launch (dispatch size), and optionally create a “payload” (up to 16K data of your choice) which is passed to mesh shaders.

Additionally, the API allows task shaders to perform atomic operations on the payload variables.

Typical use of task shaders can be: cluster culling, LOD selection, geometry amplification.

Expectations on task shaders

Before we get into any HW specific details, there are a few things we should unpack first. Based on the API programming mode, let’s think about some expectations on a good driver implementation.

Storing the output task payload. There must exist some kind of buffer where the task payload is stored, and the size of this buffer will obviously be a limiting factor on how many task shader workgroups can run in parallel. Therefore, the implementation must ensure that only as many task workgroups run as there is space in this buffer. Preferably this would be a ring buffer whose entries get reused between different task shader workgroups.

Analogy with the tessellator. The above requirements are pretty similar to what tessellation can already do. So a natural conclusion may be that we may be able to implement task shaders by abusing the tessellator. However, this introduces a potential bottleneck on fixed-function hardware which we would prefer not to do.

Analogy with a compute pre-pass. Another similar thing that comes to mind is a compute pre-pass. Many games already do something like this: some pre-processing in a compute dispatch that is executed before a draw call. Of course, the application has to insert a barrier between the dispatch and the draw, which means the draw can’t start before every invocation in the dispatch is finished. In reality, not every graphics shader invocation depends on the results of all compute invocations, but there is no way to express a more fine-grained dependency. For task shaders, it is preferable to avoid this barrier and allow task and mesh shader invocations to overlap.

Task shaders on AMD HW

What I discuss here is based on information that is already publicly available in open source drivers. If you are already familiar with how AMD’s own PAL-based drivers work, you won’t find any surprises here.

First things fist. Under the hood, task shaders are compiled to a plain old compute shader. The task payload is located in VRAM. The shader code that stores the mesh dispatch size and payload are compiled to memory writes which store these in VRAM ring buffers. Even though they are compute shaders as far as the AMD HW is concerned, task shaders do not work like a compute pre-pass. Instead, task shaders are dispatched on an async compute queue while at the same time the mesh shader work is executed on the graphics queue in parallel.

The task+mesh dispatch packets are different from a regular compute dispatch. The compute and graphics queue firmwares work together in parallel:

  • Compute queue launches up to as many task workgroups as it has space available in the ring buffer.
  • Graphics queue waits until a task workgroup is finished and can launch mesh shader workgroups immediately. Execution of mesh dispatches from a finished task workgroup can therefore overlap with other task workgroups.
  • When a mesh dispatch from the a task workgroup is finished, its slot in the ring buffer can be reused and a new task workgroup can be launched.
  • When the ring buffer is full, the compute queue waits until a mesh dispatch is finished, before launching the next task workgroup.

You can find out the exact concrete details in the PAL source code, or RADV merge requests.

Side note, getting some implementation details wrong can easily cause a deadlock on the GPU. It is great fun to debug these.

The relevant details here are that most of the hard work is implemented in the firmware (good news, because that means I don’t have to implement it), and that task shaders are executed on an async compute queue and that the driver now has to submit compute and graphics work in parallel.

Keep in mind that the API hides this detail and pretends that the mesh shading pipeline is just another graphics pipeline that the application can submit to a graphics queue. So, once again we have a mismatch between the API programming model and what the HW actually does.

Squeezing a hidden compute pipeline in your graphics

In order to use this beautiful scheme provided by the firmware, the driver needs to do two things:

  • Create a compute pipeline from the task shader.
  • Submit the task shader work on the asyc compute queue while at the same time also submit the mesh and pixel shader work on the graphics queue.

We already had good support for compute pipelines in RADV (as much as the API needs), but internally in the driver we’ve never had this kind of close cooperation between graphics and compute.

When you use a draw call in a command buffer with a pipeline that has a task shader, RADV must create a hidden, internal compute command buffer. This internal compute command buffer contains the task shader dispatch packet, while the graphics command buffer contains the packet that dispatches the mesh shaders. We must also ensure correct synchronization between these two command buffers according to application barriers ― because of the API mismatch it must work as if the internal compute cmdbuf was part of the graphics cmdbuf. We also need to emit the same descriptors and push constants, etc. When the application submits the graphics queue, this new, internal compute command buffer is then submitted to the async compute queue.

Thus far, this sounds pretty logical and easy.

The actual hard work is to make it possible for the driver to submit work to different queues at the same time. RADV’s queue code was written assuming that there is a 1:1 mapping between radv_queue objects and HW queues. To make task shaders work we must now break this assumption.

So, of course I had to do some crazy refactor to enable this. At the time of writing the AMDGPU Linux kernel driver doesn’t support “gang submit” yet, so I use scheduled dependencies instead. This has the drawback of submitting to the two queues sequentially rather than doing everything in the same submit.

Conclusion, perf considerations

Let’s turn the above wall of text into some performance considerations that you can actually use when you write your next mesh shading application.

  1. Because task shaders are executed on a different HW queue, there is some overhead. Don’t use task shaders for small draws or other cases when this overhead may be more than what you gain from them.
  2. For the same reason, barriers may require the driver to emit some commands that stall the async compute queue. Be mindful of your barriers (eg. top of pipe, etc) and only use these when your task shader actually depends on some previous graphics work.
  3. Because task payload is written to VRAM by the task shader, and has to be read from VRAM by the mesh shader, there is some latency. Only use as much payload memory as you need. Try to compact the memory use by packing your data etc.
  4. When you have a lot of geometry data, it is beneficial to implement cluster culling in your task shader. After you’ve done this, it may or may not be worth it to implement per-triangle culling in your mesh shader.
  5. Don’t try to reimplement the classic vertex processing pipeline or emulate fixed-function HW with task+mesh shaders. Instead, come up with simpler ways that work better for your app.

NVidia also has some perf recommendations here which mostly apply to any other HW, except for the recommended number of vertices and primitives per meshlet because the sweet spot for that can differ between GPU architectures.

Stay tuned

It has been officially confirmed that a Vulkan cross-vendor mesh shading extension is coming soon.

While I can’t give you any details about the new extension, I think it won’t be a surprise to anyone that it may have been the motivation for my work on mesh and task shaders.

Once the new extension goes public, I will post some thoughts about it and a comparison to the vendor-specific NV_mesh_shader extension.

May 20, 2022
Now that FESCo has decided that Fedora will keep supporting BIOS booting, the people working on Fedora's bootloader stack will need help from the Fedora community to keep Fedora booting on systems which require Legacy BIOS to boot.

To help with this the Fedora BIOS boot SIG (special interest group) has been formed. The main goal of this SIG are to help the Fedora bootloader people by:

  1. Doing regular testing of nightly Fedora N + 1 composes on hardware
    which only supports BIOS booting

  2. Triaging and/or fixing BIOS boot related bugs.


A biosboot-sig@lists.fedoraproject.org mailinglist and bugzilla account has been created, which will be used to discuss testing result and as assignee / Cc for bootloader bugzillas which are related to BIOS booting.

If you are interested in helping with Fedora BIOS boot support please:

  1. Subscribe to the email-list

  2. Add yourself to the Members section of the SIG's wiki page

Again

Two posts in one month is a record for May of 2022. Might even shoot for three at this rate.

Yesterday I posted a hasty roundup of what’s been going on with zink.

It was not comprehensive.

What else have I been working on, you might ask.

Introducing A New Zink-enabled Platform

We all know I’m champing at the bit to further my goal of world domination with zink on all platforms, so it was only a matter of time before I branched out from the big two-point-five of desktop GPU vendors (NVIDIA, AMD, and also maybe Intel sometimes).

Thus it was that a glorious bounty descended from on high at Valve: a shiny new A630 Coachz Chromebook that runs on the open source Qualcomm driver, Turnip.

How did the initial testing go?

My initial combined, all-the-tests-at-once CTS run (KHR46, 4.6 confidential, all dEQP) yielded over 1500 crashes and over 3000 failures.

Brutal.

I then accidentally bricked my kernel by foolishly allowing my distro to upgrade it for me, at which point I defenestrated the device.

Problems

There were many problems, but two were glaring:

  • Maximum of 4 descriptor sets allowed
  • No 64-bit support

It’s tough to say which issue is a bigger problem. As zinkologists know, the preferred minimum number of descriptor sets is six:

  • Constants
  • UBOs
  • Samplers
  • SSBOs
  • Images
  • Bindless

Thus, almost all the crashes I encountered were from the tests which attempted to use shader images, as they access an out-of-bounds set.

On the other hand, while there was an amount of time in which I had to use zink without 64-bit support on my Intel machine before emulation was added at the driver level, Intel’s driver has always been helpfully tolerant of my continuing to jam 64-bit i/o into shaders without adverse effects since the backend compiler supported such operations. This was not the case with Turnip, and all I got for my laziness was crashing and failing.

Solutions?

Both problems were varying degrees of excruciating to fix, so I chose the one I thought would be worse to start off: descriptors.

We all know zink has various modes for its descriptor management:

  • caching
  • templates
  • caching without templates

The default is currently the caching mode, which calculates hash values for all the descriptors being used for a draw/dispatch and then stores the sets for reuse. My plan was to collapse the like sets: merging UBOs and SSBOs into one set as well as Samplers and Images into another set. This would use a total of four sets including the bindless one, opening up all the features.

The project actually turniped out to be easier than expected. All I needed to do was add indirection to the existing set indexing, then modify the indirect indices on startup based on how many sets were available. In this way, accessing e.g., descriptor_set[SSBO] would actually access descriptor_set[UBO] in the new compact mode, and everything else would work the same.

It was then that I remembered the tricky part of doing anything descriptor-related: updating the caching mechanism.

In time, I finally managed to settle on just merging the hash values from the merged sets just as carefully as the fusion dance must be performed, and things seemed to be working, bringing results down to just over 1000 total CTS fails.

The harder problem was actually the easier one.

Which means I’m gonna need another post.

May 19, 2022

Well

I’ve been meaning to blog for a while.

As per usual.

Finally though, I’m doing it. It’s been a month. Roundup time.

22.1

Mesa 22.1 is out, and it has the best zink release of all time. You’ve got features like:

  • kopper
  • kopper
  • kopper

Really not sure I’m getting across how momentous kopper is, but it’s probably the biggest thing to happen to zink since… I don’t know. Threaded context?

B I G.

Bugs

There’s a lot of bugs in the world. As of now, there’s a lot fewer bugs in zink than there used to be.

How many fewer, you might ask?

Let’s take a gander.

  • 0 GL4.6 CTS fails for lavapipe
  • 0 GL4.6 CTS fails for the new ANV CI job I just added
  • less than 5 GL4.6 CTS fails for RADV
  • this space reserved for NVIDIA, which has bugs in CTS itself and(probably) driver issues, but is still a very low count

In short, conformance submission pending.

Anniversary

This roughly marks 2 years since I started working on zink full-time.

Hooray.

Windows

A quick follow-up to the previous post: there was an issue in the initial WGL handling which prevented zink from loading as expected. This is now fixed.

There are many ways to bring Real-Time to Linux. A standard Linux distribution can provide a reasonable latency to a soft real-time application. But, if you are dealing with applications with harsh timing restrictions, you might be unsatisfied with the results provided by a standard Linux distro.

There are basically 3 options for building a Linux Real-Time system:

  1. Dual-Kernel/Hypervisor approach: A combination between a micro-kernel and the Linux Kernel, where the micro-kernel has priority over the Linux Kernel and manages the real-time tasks. This approach has a major disadvantage: the need to maintain the microkernel, doubling the work to develop and maintain drivers, and architecture-specific software.

  2. Heterogeneous Asymmetric Multi-Core System: A system where a Linux and a deterministic kernel, such as FreeRTOS, run independently on different cores. The deterministic kernel is then responsible for the real-time tasks.

  3. Single-Kernel approach: Make the Linux Kernel more real-time capable by improving its preemptiveness.

Xenomai has two options to deliver real-time: a dual-kernel approach and a single-kernel approach.

The dual-kernel approach is named Cobalt. The Cobalt extension is built into the Linux kernel and deals with all real-time tasks by scheduling real-time threads. The Cobalt core has a higher priority over the Linux kernel native activities.

Moreover, Xenomai also provides the Mercury core, which relies on the real-time capabilities of the native Linux kernel.

As I was setting up a system for my undergraduate research, I have been trying to install Xenomai on the Beaglebone Black. I found a bunch of tutorials online, but a huge part of them was outdated and some simply did not work for me. So, I decided to synthesize all my work installing Xenomai on the Beaglebone Black.

There are two ways to install Xenomai on the Beaglebone Black:

  1. By recompiling the Linux Kernel and applying the appropriate patches.

  2. By using a precompiled kernel with the Xenomai patches already applied.

I used the second approach because it is slightly faster. But, if you want to stick with the first approach, I recommend checking on the Xenomai documentation.

So, let’s install Xenomai.

Installing Xenomai

1. Install Debian on the Beaglebone Black

If you already have Debian installed on your Beaglebone Black, then just skip this step.

Otherwise, you can follow the tutorial from Derek Molloy on how to write a Debian Image to the Beaglebone Black.

2. Install the Cobalt Core

First, we need to access the Beaglebone Black through SSH.

ssh debian@192.168.7.2

In order to keep the repositories and packages updated before we start the Cobalt core installation, we can run:

sudo apt update
sudo apt upgrade

To install the Cobalt core, first, we need to know the version of the Linux image we will install. A couple of pre-compiled kernel versions are provided by Beagleboard for Debian Buster and they are listed here.

After deciding on the kernel version, we can just run the following command to install and update the kernel.

sudo apt install linux-image-{KERNEL TAGFORM}

I installed the 4.19.94-ti-xenomai-r64 version, so I ran:

sudo apt install linux-image-4.19.94-ti-xenomai-r64

To load the new kernel, we need to reboot the machine and reconnect through SSH.

sudo reboot
ssh debian@192.168.7.2

To check that the kernel was properly installed, we can check the kernel version with:

uname -r

The output must be the kernel tag form that you selected previously. In my case, the output was 4.19.94-ti-xenomai-r64.

We can also check the kernel log and search for Xenomai references. Looking at dmesg, we will find something like this:

debian@beaglebone:~$ dmesg | grep -i xenomai
[    0.000000] Linux version 4.19.94-ti-xenomai-r64 (voodoo@rpi4b4g-06) (gcc version 8.3.0 (Debian 8.3.0-6)) #1buster SMP PREEMPT Sat May 22 01:02:28 UTC 2021
[    1.220506] [Xenomai] scheduling class idle registered.
[    1.220521] [Xenomai] scheduling class rt registered.
[    1.220676] I-pipe: head domain Xenomai registered.
[    1.225554] [Xenomai] Cobalt v3.1
[    1.753962] usb usb1: Manufacturer: Linux 4.19.94-ti-xenomai-r64 musb-hcd

Look that we are running Cobalt v3.1 and this version is extremely important to the next step.

3. Install Xenomai userspace tools and bindings

First, we need to install the appropriate Xenomai bindings. From the kernel log, I could check that I’m running Cobalt v3.1, so I’m going to download the Xenomai 3.1 tarball.

wget https://xenomai.org/downloads/xenomai/stable/xenomai-3.1.tar.bz2

If you are running another version of Cobalt, you can just change the version tag from the URL.

Next, we can decompress the tarball and get inside the Xenomai folder.

tar xf xenomai-3.1.tar.bz2
cd xenomai-3.1

Now, it is time to build and install the Xenomai binding. First, we need to configure the built environment by running:

./configure --enable-smp

Although the Beaglebone Black has a single-core processor, the flag --enable-smp is important, because the precompiled kernel versions from Beagleboard enable CONFIG_SMP by default.

Then, finally, we can build and install Xenomai.

make
sudo make install

And then, you are done!

You can test the real-time system by running:

sudo su
/usr/xenomai/bin/latency

The output will be similar to this:

== Sampling period: 1000 us
== Test mode: periodic user-mode task
== All results in microseconds
warming up...
RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
RTD|      7.875|     13.579|     50.625|       0|     0|      7.875|     50.625
RTD|     11.458|     15.983|     53.958|       0|     0|      7.875|     53.958
RTD|     11.458|     13.997|     50.750|       0|     0|      7.875|     53.958
RTD|     11.541|     15.578|     55.999|       0|     0|      7.875|     55.999
RTD|     11.416|     13.186|     52.208|       0|     0|      7.875|     55.999
RTD|     11.499|     14.507|     57.249|       0|     0|      7.875|     57.249
RTD|     11.499|     13.787|     48.707|       0|     0|      7.875|     57.249
RTD|     11.540|     13.694|     50.582|       0|     0|      7.875|     57.249
RTD|     11.456|     15.118|     49.498|       0|     0|      7.875|     57.249
RTD|     11.373|     13.618|     51.290|       0|     0|      7.875|     57.249
RTD|     11.498|     15.844|     48.914|       0|     0|      7.875|     57.249
RTD|     11.539|     17.654|     55.581|       0|     0|      7.875|     57.249
RTD|     11.539|     15.403|     52.622|       0|     0|      7.875|     57.249
RTD|     11.539|     12.955|     51.580|       0|     0|      7.875|     57.249
RTD|     10.747|     13.254|     52.163|       0|     0|      7.875|     57.249
^C---|-----------|-----------|-----------|--------|------|-------------------------
RTS|      7.875|     14.543|     57.249|       0|     0|    00:00:15/00:00:15

This command displays a message every second with minimum, maximum, and average latency values. Notice that all the latencies are in the order of microseconds.

So, now, you can go on and build a real-time application with the Xenomai userspace API on the Beaglebone Black.

May 16, 2022

We haven’t posted updates to the work done on the V3DV driver since
we announced the driver becoming Vulkan 1.1 Conformant.

But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then.

Multisync support

As mentioned on past posts, for the Vulkan driver we tried to focus as much as possible on the userspace part. So we tried to re-use the already existing kernel interface that we had for V3D, used by the OpenGL driver, without modifying/extending it.

This worked fine in general, except for synchronization. The V3D kernel interface only supported one synchronization object per submission. This didn’t properly map with Vulkan synchronization, which is more detailed and complex, and allowed defining several semaphores/fences. We initially handled the situation with workarounds, and left some optional features as unsupported.

After our 1.1 conformance work, our colleage Melissa Wen started to work on adding support for multiple semaphores on the V3D kernel side. Then she also implemented the changes on V3DV to use this new feature. If you want more technical info, she wrote a very detailed explanation on her blog (part1 and part2).

For now the driver has two codepaths that are used depending on if the kernel supports this new feature or not. That also means that, depending on the kernel, the V3DV driver could expose a slightly different set of supported features.

More common code – Migration to the common synchronization framework

For a while, Mesa developers have been doing a great effort to refactor and move common functionality to a single place, so it can be used by all drivers, reducing the amount of code each driver needs to maintain.

During these months we have been porting V3DV to some of that infrastructure, from small bits (common VkShaderModule to NIR code), to a really big one: common synchronization framework.

As mentioned, the Vulkan synchronization model is really detailed and powerful. But that also means it is complex. V3DV support for Vulkan synchronization included heavy use of threads. For example, V3DV needed to rely on a CPU wait (polling with threads) to implement vkCmdWaitEvents, as the GPU lacked a mechanism for this.

This was common to several drivers. So at some point there were multiple versions of complex synchronization code, one per driver. But, some months ago, Jason Ekstrand refactored Anvil support and collaborated with other driver developers to create a common framework. Obviously each driver would have their own needs, but the framework provides enough hooks for that.

After some gitlab and IRC chats, Jason provided a Merge Request with the port of V3DV to this new common framework, that we iterated and tested through the review process.

Also, with this port we got timelime semaphore support for free. Thanks to this change, we got ~1.2k less total lines of code (and have more features!).

Again, we want to thank Jason Ekstrand for all his help.

Support for more extensions:

Since 1.1 got announced the following extension got implemented and exposed:

  • VK_EXT_debug_utils
  • VK_KHR_timeline_semaphore
  • VK_KHR_create_renderpass2
  • VK_EXT_4444_formats
  • VK_KHR_driver_properties
  • VK_KHR_16_bit_storage and VK_KHR_8bit_storage
  • VK_KHR_imageless_framebuffer
  • VK_KHR_depth_stencil_resolve
  • VK_EXT_image_drm_format_modifier
  • VK_EXT_line_rasterization
  • VK_EXT_inline_uniform_block
  • VK_EXT_separate_stencil_usage
  • VK_KHR_separate_depth_stencil_layouts
  • VK_KHR_pipeline_executable_properties
  • VK_KHR_shader_float_controls
  • VK_KHR_spirv_1_4

If you want more details about VK_KHR_pipeline_executable_properties, Iago wrote recently a blog post about it (here)

Android support

Android support for V3DV was added thanks to the work of Roman Stratiienko, who implemented this and submitted Mesa patches. We also want to thank the Android RPi team, and the Lineage RPi maintainer (Konsta) who also created and tested an initial version of that support, which was used as the baseline for the code that Roman submitted. I didn’t test it myself (it’s in my personal TO-DO list), but LineageOS images for the RPi4 are already available.

Performance

In addition to new functionality, we also have been working on improving performance. Most of the focus was done on the V3D shader compiler, as improvements to it would be shared among the OpenGL and Vulkan drivers.

But one of the features specific to the Vulkan driver (pending to be ported to OpenGL), is that we have implemented double buffer mode, only available if MSAA is not enabled. This mode would split the tile buffer size in half, so the driver could start processing the next tile while the current one is being stored in memory.

In theory this could improve performance by reducing tile store overhead, so it would be more benefitial when vertex/geometry shaders aren’t too expensive. However, it comes at the cost of reducing tile size, which also causes some overhead on its own.

Testing shows that this helps in some cases (i.e the Vulkan Quake ports) but hurts in others (i.e. Unreal Engine 4), so for the time being we don’t enable this by default. It can be enabled selectively by adding V3D_DEBUG=db to the environment variables. The idea for the future would be to implement a heuristic that would decide when to activate this mode.

FOSDEM 2022

If you are interested in watching an overview of the improvements and changes to the driver during the last year, we made a presention in FOSDEM 2022:
“v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi
4”

May 13, 2022

In late 2020, Apple debuted the M1 with Apple’s GPU architecture, AGX, rumoured to be derived from Imagination’s PowerVR series. Since then, we’ve been reverse-engineering AGX and building open source graphics drivers. Last January, I rendered a triangle with my own code, but there has since been a heinous bug lurking:

The driver fails to render large amounts of geometry.

Spinning a cube is fine, low polygon geometry is okay, but detailed models won’t render. Instead, the GPU renders only part of the model and then faults.

Partially rendered bunny

It’s hard to pinpoint how much we can render without faults. It’s not just the geometry complexity that matters. The same geometry can render with simple shaders but fault with complex ones.

That suggests rendering detailed geometry with a complex shader “takes too long”, and the GPU is timing out. Maybe it renders only the parts it finished in time.

Given the hardware architecture, this explanation is unlikely.

This hypothesis is easy to test, because we can control for timing with a shader that takes as long as we like:

for (int i = 0; i < LARGE_NUMBER; ++i) {
    /* some work to prevent the optimizer from removing the loop */
}

After experimenting with such a shader, we learn…

  • If shaders have a time limit to protect against infinite loops, it’s astronomically high. There’s no way our bunny hits that limit.
  • The symptoms of timing out differ from the symptoms of our driver rendering too much geometry.

That theory is out.

Let’s experiment more. Modifying the shader and seeing where it breaks, we find the only part of the shader contributing to the bug: the amount of data interpolated per vertex. Modern graphics APIs allow specifying “varying” data for each vertex, like the colour or the surface normal. Then, for each triangle the hardware renders, these “varyings” are interpolated across the triangle to provide smooth inputs to the fragment shader, allowing efficient implementation of common graphics techniques like Blinn-Phong shading.

Putting the pieces together, what matters is the product of the number of vertices (geometry complexity) times amount of data per vertex (“shading” complexity). That product is “total amount of per-vertex data”. The GPU faults if we use too much total per-vertex data.

Why?

When the hardware processes each vertex, the vertex shader produces per-vertex data. That data has to go somewhere. How this works depends on the hardware architecture. Let’s consider common GPU architectures.1

Traditional immediate mode renderers render directly into the framebuffer. They first run the vertex shader for each vertex of a triangle, then run the fragment shader for each pixel in the triangle. Per-vertex “varying” data is passed almost directly between the shaders, so immediate mode renderers are efficient for complex scenes.

There is a drawback: rendering directly into the framebuffer requires tremendous amounts of memory access to constantly write the results of the fragment shader and to read out back results when blending. Immediate mode renderers are suited to discrete, power-hungry desktop GPUs with dedicated video RAM.

By contrast, tile-based deferred renderers split rendering into two passes. First, the hardware runs all vertex shaders for the entire frame, not just for a single model. Then the framebuffer is divided into small tiles, and dedicated hardware called a tiler determines which triangles are in each tile. Finally, for each tile, the hardware runs all relevant fragment shaders and writes the final blended result to memory.

Tilers reduce memory traffic required for the framebuffer. As the hardware renders a single tile at a time, it keeps a “cached” copy of that tile of the framebuffer (called the “tilebuffer”). The tilebuffer is small, just a few kilobytes, but tilebuffer access is fast. Writing to the tilebuffer is cheap, and unlike immediate renderers, blending is almost free. Because main memory access is expensive and mobile GPUs can’t afford dedicated video memory, tilers are suited to mobile GPUs, like Arm’s Mali, Imaginations’s PowerVR, and Apple’s AGX.

Yes, AGX is a mobile GPU, designed for the iPhone. The M1 is a screaming fast desktop, but its unified memory and tiler GPU have roots in mobile phones. Tilers work well on the desktop, but there are some drawbacks.

First, at the start of a frame, the contents of the tilebuffer are undefined. If the application needs to preserve existing framebuffer contents, the driver needs to load the framebuffer from main memory and store it into the tilebuffer. This is expensive.

Second, because all vertex shaders are run before any fragment shaders, the hardware needs a buffer to store the outputs of all vertex shaders. In general, there is much more data required than space inside the GPU, so this buffer must be in main memory. This is also expensive.

Ah-ha. Because AGX is a tiler, it requires a buffer of all per-vertex data. We fault when we use too much total per-vertex data, overflowing the buffer.

…So how do we allocate a larger buffer?

On some tilers, like older versions of Arm’s Mali GPU, the userspace driver computes how large this “varyings” buffer should be and allocates it.2 To fix the faults, we can try increasing the sizes of all buffers we allocate, in the hopes that one of them contains the per-vertex data.

No dice.

It’s prudent to observe what Apple’s Metal driver does. We can cook up a Metal program drawing variable amounts of geometry and trace all GPU memory allocations that Metal performs while running our program. Doing so, we learn that increasing the amount of geometry drawn does not increase the sizes of any allocated buffers. In fact, it doesn’t change anything in the command buffer submitted to the kernel, except for the single “number of vertices” field in the draw command.

We know that buffer exists. If it’s not allocated by userspace – and by now it seems that it’s not – it must be allocated by the kernel or firmware.

Here’s a funny thought: maybe we don’t specify the size of the buffer at all. Maybe it’s okay for it to overflow, and there’s a way to handle the overflow.

It’s time for a little reconnaissance. Digging through what little public documentation exists for AGX, we learn from one WWDC presentation:

The Tiled Vertex Buffer stores the Tiling phase output, which includes the post-transform vertex data…

But it may cause a Partial Render if full. A Partial Render is when the GPU splits the render pass in order to flush the contents of that buffer.

Bullseye. The buffer we’re chasing, the “tiled vertex buffer”, can overflow. To cope, the GPU stops accepting new geometry, renders the existing geometry, and restarts rendering.

Since partial renders hurt performance, Metal application developers need to know about them to optimize their applications. There should be performance counters flagging this issue. Poking around, we find two:

  • Number of partial renders.
  • Number of bytes used of the parameter buffer.

Wait, what’s a “parameter buffer”?

Remember the rumours that AGX is derived from PowerVR? The public PowerVR optimization guides explain:

[The] list containing pointers to each vertex passed in from the application… is called the parameter buffer (PB) and is stored in system memory along with the vertex data.

Each varying requires additional space in the parameter buffer.

The Tiled Vertex Buffer is the Parameter Buffer. PB is the PowerVR name, TVB is the public Apple name, and PB is still an internal Apple name.

What happens when PowerVR overflows the parameter buffer?

An old PowerVR presentation says that when the parameter buffer is full, the “render is flushed”, meaning “flushed data must be retrieved from the frame buffer as successive tile renders are performed”. In other words, it performs a partial render.

Back to the Apple M1, it seems the hardware is failing to perform a partial render. Let’s revisit the broken render.

Partially rendered bunny, again

Notice parts of the model are correctly rendered. The parts that are not only have the black clear colour of the scene rendered at the start. Let’s consider the logical order of events.

First, the hardware runs vertex shaders for the bunny until the parameter buffer overflows. This works: the partial geometry is correct.

Second, the hardware rasterizes the partial geometry and runs the fragment shaders. This works: the shading is correct.

Third, the hardware flushes the partial render to the framebuffer. This must work for us to see anything at all.

Fourth, the hardware runs vertex shaders for the rest of the bunny’s geometry. This ought to work: the configuration is identical to the original vertex shaders.

Fifth, the hardware rasterizes and shades the rest of the geometry, blending with the old partial render. Because AGX is a tiler, to preserve that existing partial render, the hardware needs to load it back into the tilebuffer. We have no idea how it does this.

Finally, the hardware flushes the render to the framebuffer. This should work as it did the first time.

The only problematic step is loading the framebuffer back into the tilebuffer after a partial render. Usually, the driver supplies two “extra” fragment shaders. One clears the tilebuffer at the start, and the other flushes out the tilebuffer contents at the end.

If the application needs the existing framebuffer contents preserved, instead of writing a clear colour, the “load tilebuffer” program instead reads from the framebuffer to reload the contents. Handling this requires quite a bit of code, but it works in our driver.

Looking closer, AGX requires more auxiliary programs.

The “store” program is supplied twice. I noticed this when initially bringing up the hardware, but the reason for the duplication was unclear. Omitting each copy separately and seeing what breaks, the reason becomes clear: one program flushes the final render, and the other flushes a partial render.3

…What about the program that loads the framebuffer into the tilebuffer?

When a partial render is possible, there are two “load” programs. One writes the clear colour or loads the framebuffer, depending on the application setting. We understand this one. The other always loads the framebuffer.

…Always loads the framebuffer, as in, for loading back with a partial render even if there is a clear at the start of the frame?

If this program is the issue, we can confirm easily. Metal must require it to draw the same bunny, so we can write a Metal application drawing the bunny and stomp over its GPU memory to replace this auxiliary load program with one always loading with black.

Metal drawing the bunny, stomping over its memory.

Doing so, Metal fails in a similar way. That means we’re at the root cause. Looking at our own driver code, we don’t specify any program for this partial render load. Up until now, that’s worked okay. If the parameter buffer is never overflowed, this program is unused. As soon as a partial render is required, however, failing to provide this program means the GPU dereferences a null pointer and faults. That explains our GPU faults at the beginning.

Following Metal, we supply our own program to load back the tilebuffer after a partial render…

Bunny with the fix

…which does not fix the rendering! Cursed, this GPU. The faults go away, but the render still isn’t quite right for the first few frames, indicating partial renders are still broken. Notice the weird artefacts on the feet.

Curiously, the render “repairs itself” after a few frames, suggesting the parameter buffer stops overflowing. This implies the parameter buffer can be resized (by the kernel or by the firmware), and the system is growing the parameter buffer after a few frames in response to overflow. This mechanism makes sense:

  • The hardware can’t allocate more parameter buffer space itself.
  • Overflowing the parameter buffer is expensive, as partial renders require tremendous memory bandwidth.
  • Overallocating the parameter buffer wastes memory for applications rendering simple geometry.

Starting the parameter buffer small and growing in response to overflow provides a balance, reducing the GPU’s memory footprint and minimizing partial renders.

Back to our misrendering. There are actually two buffers being used by our program, a colour buffer (framebuffer)… and a depth buffer. The depth buffer isn’t directly visible, but facilitates the “depth test”, which discards far pixels that are occluded by other close pixels. While the partial render mechanism discards geometry, the depth test discards pixels.

That would explain the missing pixels on our bunny. The depth test is broken with partial renders. Why? The depth test depends on the depth buffer, so the depth buffer must also be stored after a partial render and loaded back when resuming. Comparing a trace from our driver to a trace from Metal, looking for any relevant difference, we eventually stumble on the configuration required to make depth buffer flushes work.

And with that, we get our bunny.

The final Phong shaded bunny

  1. These explanations are massive oversimplifications of how modern GPUs work, but it’s good enough for our purposes here.↩︎

  2. This is a worse idea than it sounds. Starting with the new Valhall architecture, Mali allocates varyings much more efficiently.↩︎

  3. Why the duplication? I have not yet observed Metal using different programs for each. However, for front buffer rendering, partial renders need to be flushed to a temporary buffer for this scheme to work. Of course, you may as well use double buffering at that point.↩︎

May 12, 2022

In part 1 I gave a brief introduction on what mesh and task shaders are from the perspective of application developers. Now it’s time to dive deeper and talk about how mesh shaders are implemented in a Vulkan driver on AMD HW. Note that everything I discuss here is based on my experience and understanding as I was working on mesh shader support in RADV and is already available as public information in open source driver code. The goal of this blog post is to elaborate on how mesh shaders are implemented on the NGG hardware in AMD RDNA2 GPUs, and to show how these details affect shader performance. Hopefully, this helps the reader better understand how the concepts in the API are translated to the HW and what pitfalls to avoid to get good perf.

Short intro to NGG

NGG (Next Generation Geometry) is the technology that is responsible for any vertex and geometry processing in RDNA GPUs (with some caveats). Also known as “primitive shader”, the main innovations of NGG are:

  • Shaders are aware of not only vertices, but also primitives (this is why they are called primitive shader).
  • The output topology is entirely up to the shader, meaning that it can create output vertices and primitives with an arbitrary topology regarless of its input.
  • On RDNA2 and newer, per-primitive output attributes are also supported.

This flexibility allows the driver to implement every vertex/geometry processing stage using NGG. Vertex, tess eval and geometry shaders can all be compiled to NGG “primitive shaders”. The only major limiting factor is that each thread (SIMD lane) can only output up to 1 vertex and 1 primitive (with caveats).

The driver is also capable of extending the application shaders with sweet stuff such as per-triangle culling, but this is not the main focus of this blog post. I also won’t cover the caveats here, but I may write more about NGG in the future.

Mapping the mesh shader API to NGG

The draw commands as executed on the GPU only understand a number of input vertices but the mesh shader API draw calls specify a number of workgroups instead. To make it work, we configure the shader such that the number of input vertices per workgroup is 1, and the output is set to what you passed into the API. This way, the FW can figure out how many workgroups it really needs to launch.

The driver has to accomodate the HW limitation above, so we must ensure that in the compiled shader, each thread only outputs up to 1 vertex and 1 primitive. Reminder: the API programming model allows any shader invocation to write any vertex and/or primitive. So, there is a fundamental mismatch between the programming model and what the HW can do.

This raises a few interesting questions.

How do we allow any thread to write any vertex/primitive? The driver allocates some LDS (shared memory) space, and writes all mesh shader outputs there. At the very end of the shader, each thread reads the attributes of the vertex and primitive that matches the thread ID and outputs that single vertex and primitive. This roundtrip to the LDS can be omitted if an output is only written by the thread with matching thread ID. (Note: at the time of writing, I haven’t implemented this optimization yet, but I plan to.)

What if the MS workgroup size is less than the max number of output vertices or primitives? Each HW thread can create up to 1 vertex and 1 primitive. The driver has to set the real workgroup size accordingly:
hw workgroup size = max(api workgroup size, max vertex count, max primitive count)
The result is that the HW will get a workgroup that has some threads that execute the code you wrote (the “API shader”), and then some that won’t do anything but wait until the very end to output their up to 1 vertex and 1 primitive. It can result in poor occupancy (low HW utilization = bad performance).

What if the shader also has barriers in it? This is now turning into a headache. The driver has to ensure that the threads that “do nothing” also execute an equal amount of barriers as those that run your API shader. If the HW workgroup has the same number of waves as the API shader, this is trivial. Otherwise, we have to emit some extra code that keeps the extra waves running in a loop executing barriers. This is the worst.

What if the API shader also uses shared memory, or not all outputs fit the LDS? The D3D12 spec requires the driver to have at least 28K shared memory (LDS) available to the shader. However, graphics shaders can only access up to 32K LDS. How do we make this work, considering the above fact that the driver has to write mesh shader outputs to LDS? This is getting really ugly now, but in that case, the driver is forced to write MS outputs to VRAM instead of LDS. (Note: at the time of writing, I haven’t implemented this edge case yet, but I plan to.)

How do you deal with the compute-like stuff, eg. workgroup ID, subgroup ID, etc.? Fortunately, most of these were already available to the shader, just not exposed in the traditional VS, TES, GS programming model. The only pain point is the workgroup ID which needs trickery. I already mentioned above that the HW is tricked into thinking that each MS workgroup has 1 input vertex. So we can just use the same register that contains the vertex ID for getting the workgroup ID.

Conclusion, performance considerations

The above implementation details can be turned into performance recommendations.

Specify a MS workgroup size that matches the maximum amount of vertices and primitives. Also, distribute the work among the full workgroup rather than leaving some threads doing nothing. If you do this, you ensure that the hardware is optimally utilized. This is the most important recommendation here today.

Try to only write to the mesh output array indices from the corresponding thread. If you do this, you hit an optimal code path in the driver, so it won’t have to write those outputs to LDS and read them back at the end.

Use shared memory, but not excessively. Implementing any nice algorithm in your mesh shader will likely need you to share data between threads. Don’t be afraid to use shared memory, but prefer to use subgroup functionality instead when possible.

What if you don’t want do any of the above?

That is perfectly fine. Don’t use mesh shaders then.

The main takeaway about mesh shading is that it’s a very low level tool. The driver can implement the full programming model, but it can’t hold your hands as well as it could for traditional vertex processing. You may have to implement things (eg. vertex inputs, culling, etc.) that previously the driver would do for you. Essentially, if you write a mesh shader you are trying to beat the driver at its own game.

Wait, aren’t we forgetting something?

I think this post is already dense enough with technical detail. Brace yourself for the next post, where I’m going to blow your mind even more and talk about how task shaders are implemented.

May 11, 2022

Background
Today NVIDIA announced that they are releasing an open source kernel driver for their GPUs, so I want to share with you some background information and how I think this will impact Linux graphics and compute going forward.

One thing many people are not aware of is that Red Hat is the only Linux OS company who has a strong presence in the Linux compute and graphics engineering space. There are of course a lot of other people working in the space too, like engineers working for Intel, AMD and NVIDIA or people working for consultancy companies like Collabora or individual community members, but Red Hat as an OS integration company has been very active on trying to ensure we have a maintainable and shared upstream open source stack. This engineering presence is also what has allowed us to move important technologies forward, like getting hiDPI support for Linux some years ago, or working with NVIDIA to get glvnd implemented to remove a pain point for our users since the original OpenGL design only allowed for one OpenGl implementation to be installed at a time. We see ourselves as the open source community’s partner here, fighting to keep the linux graphics stack coherent and maintainable and as a partner for the hardware OEMs to work with when they need help pushing major new initiatives around GPUs for Linux forward. And as the only linux vendor with a significant engineering footprint in GPUs we have been working closely with NVIDIA. People like Kevin Martin, the manager for our GPU technologies team, Ben Skeggs the maintainer of Nouveau and Dave Airlie, the upstream kernel maintainer for the graphics subsystem, Nouveau contributor Karol Herbst and our accelerator lead Tom Rix have all taken part in meetings, code reviews and discussions with NVIDIA. So let me talk a little about what this release means (and also what it doesn’t mean) and what we hope to see come out of this long term.

First of all, what is in this new driver?
What has been released is an out of tree source code kernel driver which has been tested to support CUDA usecases on datacenter GPUs. There is code in there to support display, but it is not complete or fully tested yet. Also this is only the kernel part, a big part of a modern graphics driver are to be found in the firmware and userspace components and those are still closed source. But it does mean we have a NVIDIA kernel driver now that will start being able to consume the GPL-only APIs in the linux kernel, although this initial release doesn’t consume any APIs the old driver wasn’t already using. The driver also only supports NVIDIA Turing chip GPUs and newer, which means it is not targeting GPUs from before 2018. So for the average Linux desktop user, while this is a great first step and hopefully a sign of what is to come, it is not something you are going to start using tomorrow.

What does it mean for the NVidia binary driver?
Not too much immediately. This binary kernel driver will continue to be needed for older pre-Turing NVIDIA GPUs and until the open source kernel module is full tested and extended for display usecases you are likely to continue using it for your system even if you are on Turing or newer. Also as mentioned above regarding firmware and userspace bits and the binary driver is going to continue to be around even once the open source kernel driver is fully capable.

What does it mean for Nouveau?
Let me start with the obvious, this is actually great news for the Nouveau community and the Nouveau driver and NVIDIA has done a great favour to the open source graphics community with this release. And for those unfamiliar with Nouveau, Nouveau is the in-kernel graphics driver for NVIDIA GPUs today which was originally developed as a reverse engineered driver, but which over recent years actually have had active support from NVIDIA. It is fully functional, but is severely hampered by not having had the ability to for instance re-clock the NVIDIA card, meaning that it can’t give you full performance like the binary driver can. This was something we were working with NVIDIA trying to remedy, but this new release provides us with a better path forward. So what does this new driver mean for Nouveau? Less initially, but a lot in the long run. To give a little background first. The linux kernel does not allow multiple drivers for the same hardware, so in order for a new NVIDIA kernel driver to go in the current one will have to go out or at least be limited to a different set of hardware. The current one is Nouveau. And just like the binary driver a big chunk of Nouveau is not in the kernel, but are the userspace pieces found in Mesa and the Nouveau specific firmware that NVIDIA currently kindly makes available. So regardless of the long term effort to create a new open source in-tree kernel driver based on this new open source driver for NVIDIA hardware, Nouveau will very likely be staying around to support pre-turing hardware just like the NVIDIA binary kernel driver will.

The plan we are working towards from our side, but which is likely to take a few years to come to full fruition, is to come up with a way for the NVIDIA binary driver and Mesa to share a kernel driver. The details of how we will do that is something we are still working on and discussing with our friends at NVIDIA to address both the needs of the NVIDIA userspace and the needs of the Mesa userspace. Along with that evolution we hope to work with NVIDIA engineers to refactor the userspace bits of Mesa that are now targeting just Nouveau to be able to interact with this new kernel driver and also work so that the binary driver and Nouveau can share the same firmware. This has clear advantages for both the open source community and the NVIDIA. For the open source community it means that we will now have a kernel driver and firmware that allows things like changing the clocking of the GPU to provide the kind of performance people expect from the NVIDIA graphics card and it means that we will have an open source driver that will have access to the firmware and kernel updates from day one for new generations of NVIDIA hardware. For the ‘binary’ driver, and I put that in ” signs because it will now be less binary :), it means as stated above that it can start taking advantage of the GPL-only APIs in the kernel, distros can ship it and enable secure boot, and it gets an open source consumer of its kernel driver allowing it to go upstream.
If this new shared kernel driver will be known as Nouveau or something completely different is still an open question, and of course it happening at all depends on if we and the rest of the open source community and NVIDIA are able to find a path together to make it happen, but so far everyone seems to be of good will.

What does this release mean for linux distributions like Fedora and RHEL?

Over time it provides a pathway to radically simplify supporting NVIDIA hardware due to the opportunities discussed elsewhere in this document. Long term we will hope be able to get a better user experience with NVIDIA hardware in terms out of box functionality. Which means day 1 support for new chipsets, a high performance open source Mesa driver for NVIDIA and it will allow us to sign the NVIDIA driver alongside the rest of the kernel to enable things like secureboot support. Since this first release is targeting compute one can expect that these options will first be available for compute users and then graphics at a later time.

What are the next steps
Well there is a lot of work to do here. NVIDIA need to continue the effort to make this new driver feature complete for both Compute and Graphics Display usecases, we’d like to work together to come up with a plan for what the future unified kernel driver can look like and a model around it that works for both the community and NVIDIA, we need to add things like a Mesa Vulkan driver. We at Red Hat will be playing an active part in this work as the only Linux vendor with the capacity to do so and we will also work to ensure that the wider open source community has a chance to participate fully like we do for all open source efforts we are part of.

If you want to hear more about this I did talk with Chris Fisher and Linux Action News about this topic. Note: I did state some timelines in that interview which I didn’t make clear was my guesstimates and not in any form official NVIDIA timelines, so apologize for the confusion.

May 10, 2022

In the previous post, I described how we enable multiple syncobjs capabilities in the V3D kernel driver. Now I will tell you what was changed on the userspace side, where we reworked the V3DV sync mechanisms to use Vulkan multiple wait and signal semaphores directly. This change represents greater adherence to the Vulkan submission framework.

I was not used to Vulkan concepts and the V3DV driver. Fortunately, I counted on the guidance of the Igalia’s Graphics team, mainly Iago Toral (thanks!), to understand the Vulkan Graphics Pipeline, sync scopes, and submission order. Therefore, we changed the original V3DV implementation for vkQueueSubmit and all related functions to allow direct mapping of multiple semaphores from V3DV to the V3D-kernel interface.

Disclaimer: Here’s a brief and probably inaccurate background, which we’ll go into more detail later on.

In Vulkan, GPU work submissions are described as command buffers. These command buffers, with GPU jobs, are grouped in a command buffer submission batch, specified by vkSubmitInfo, and submitted to a queue for execution. vkQueueSubmit is the command called to submit command buffers to a queue. Besides command buffers, vkSubmitInfo also specifies semaphores to wait before starting the batch execution and semaphores to signal when all command buffers in the batch are complete. Moreover, a fence in vkQueueSubmit can be signaled when all command buffer batches have completed execution.

From this sequence, we can see some implicit ordering guarantees. Submission order defines the start order of execution between command buffers, in other words, it is determined by the order in which pSubmits appear in VkQueueSubmit and pCommandBuffers appear in VkSubmitInfo. However, we don’t have any completion guarantees for jobs submitted to different GPU queue, which means they may overlap and complete out of order. Of course, jobs submitted to the same GPU engine follow start and finish order. A fence is ordered after all semaphores signal operations for signal operation order. In addition to implicit sync, we also have some explicit sync resources, such as semaphores, fences, and events.

Considering these implicit and explicit sync mechanisms, we rework the V3DV implementation of queue submissions to better use multiple syncobjs capabilities from the kernel. In this merge request, you can find this work: v3dv: add support to multiple wait and signal semaphores. In this blog post, we run through each scope of change of this merge request for a V3D driver-guided description of the multisync support implementation.

Groundwork and basic code clean-up:

As the original V3D-kernel interface allowed only one semaphore, V3DV resorted to booleans to “translate” multiple semaphores into one. Consequently, if a command buffer batch had at least one semaphore, it needed to wait on all jobs submitted complete before starting its execution. So, instead of just boolean, we created and changed structs that store semaphores information to accept the actual list of wait semaphores.

Expose multisync kernel interface to the driver:

In the two commits below, we basically updated the DRM V3D interface from that one defined in the kernel and verified if the multisync capability is available for use.

Handle multiple semaphores for all GPU job types:

At this point, we were only changing the submission design to consider multiple wait semaphores. Before supporting multisync, V3DV was waiting for the last job submitted to be signaled when at least one wait semaphore was defined, even when serialization wasn’t required. V3DV handle GPU jobs according to the GPU queue in which they are submitted:

  • Control List (CL) for binning and rendering
  • Texture Formatting Unit (TFU)
  • Compute Shader Dispatch (CSD)

Therefore, we changed their submission setup to do jobs submitted to any GPU queues able to handle more than one wait semaphores.

These commits created all mechanisms to set arrays of wait and signal semaphores for GPU job submissions:

  • Checking the conditions to define the wait_stage.
  • Wrapping them in a multisync extension.
  • According to the kernel interface (described in the previous blog post), configure the generic extension as a multisync extension.

Finally, we extended the ability of GPU jobs to handle multiple signal semaphores, but at this point, no GPU job is actually in charge of signaling them. With this in place, we could rework part of the code that tracks CPU and GPU job completions by verifying the GPU status and threads spawned by Event jobs.

Rework the QueueWaitIdle mechanism to track the syncobj of the last job submitted in each queue:

As we had only single in/out syncobj interfaces for semaphores, we used a single last_job_sync to synchronize job dependencies of the previous submission. Although the DRM scheduler guarantees the order of starting to execute a job in the same queue in the kernel space, the order of completion isn’t predictable. On the other hand, we still needed to use syncobjs to follow job completion since we have event threads on the CPU side. Therefore, a more accurate implementation requires last_job syncobjs to track when each engine (CL, TFU, and CSD) is idle. We also needed to keep the driver working on previous versions of v3d kernel-driver with single semaphores, then we kept tracking ANY last_job_sync to preserve the previous implementation.

Rework synchronization and submission design to let the jobs handle wait and signal semaphores:

With multiple semaphores support, the conditions for waiting and signaling semaphores changed accordingly to the particularities of each GPU job (CL, CSD, TFU) and CPU job restrictions (Events, CSD indirect, etc.). In this sense, we redesigned V3DV semaphores handling and job submissions for command buffer batches in vkQueueSubmit.

We scrutinized possible scenarios for submitting command buffer batches to change the original implementation carefully. It resulted in three commits more:

We keep track of whether we have submitted a job to each GPU queue (CSD, TFU, CL) and a CPU job for each command buffer. We use syncobjs to track the last job submitted to each GPU queue and a flag that indicates if this represents the beginning of a command buffer.

The first GPU job submitted to a GPU queue in a command buffer should wait on wait semaphores. The first CPU job submitted in a command buffer should call v3dv_QueueWaitIdle() to do the waiting and ignore semaphores (because it is waiting for everything).

If the job is not the first but has the serialize flag set, it should wait on the completion of all last job submitted to any GPU queue before running. In practice, it means using syncobjs to track the last job submitted by queue and add these syncobjs as job dependencies of this serialized job.

If this job is the last job of a command buffer batch, it may be used to signal semaphores if this command buffer batch has only one type of GPU job (because we have guarantees of execution ordering). Otherwise, we emit a no-op job just to signal semaphores. It waits on the completion of all last jobs submitted to any GPU queue and then signal semaphores. Note: We changed this approach to correctly deal with ordering changes caused by event threads at some point. Whenever we have an event job in the command buffer, we cannot use the last job in the last command buffer assumption. We have to wait all event threads complete to signal

After submitting all command buffers, we emit a no-op job to wait on all last jobs by queue completion and signal fence. Note: at some point, we changed this approach to correct deal with ordering changes caused by event threads, as mentioned before.

Final considerations

With many changes and many rounds of reviews, the patchset was merged. After more validations and code review, we polished and fixed the implementation together with external contributions:

Also, multisync capabilities enabled us to add new features to V3DV and switch the driver to the common synchronization and submission framework:

  • v3dv: expose support for semaphore imports

    This was waiting for multisync support in the v3d kernel, which is already available. Exposing this feature however enabled a few more CTS tests that exposed pre-existing bugs in the user-space driver so we fix those here before exposing the feature.

  • v3dv: Switch to the common submit framework

    This should give you emulated timeline semaphores for free and kernel-assisted sharable timeline semaphores for cheap once you have the kernel interface wired in.

We used a set of games to ensure no performance regression in the new implementation. For this, we used GFXReconstruct to capture Vulkan API calls when playing those games. Then, we compared results with and without multisync caps in the kernelspace and also enabling multisync on v3dv. We didn’t observe any compromise in performance, but improvements when replaying scenes of vkQuake game.

As you may already know, we at Igalia have been working on several improvements to the 3D rendering drivers of Broadcom Videocore GPU, found in Raspberry Pi 4 devices. One of our recent works focused on improving V3D(V) drivers adherence to Vulkan submission and synchronization framework. We had to cross various layers from the Linux Graphics stack to add support for multiple syncobjs to V3D(V), from the Linux/DRM kernel to the Vulkan driver. We have delivered bug fixes, a generic gate to extend job submission interfaces, and a more direct sync mapping of the Vulkan framework. These changes did not impact the performance of the tested games and brought greater precision to the synchronization mechanisms. Ultimately, support for multiple syncobjs opened the door to new features and other improvements to the V3DV submission framework.

DRM Syncobjs

But, first, what are DRM sync objs?

* DRM synchronization objects (syncobj, see struct &drm_syncobj) provide a
* container for a synchronization primitive which can be used by userspace
* to explicitly synchronize GPU commands, can be shared between userspace
* processes, and can be shared between different DRM drivers.
* Their primary use-case is to implement Vulkan fences and semaphores.
[...]
* At it's core, a syncobj is simply a wrapper around a pointer to a struct
* &dma_fence which may be NULL.

And Jason Ekstrand well-summarized dma_fence features in a talk at the Linux Plumbers Conference 2021:

A struct that represents a (potentially future) event:

  • Has a boolean “signaled” state
  • Has a bunch of useful utility helpers/concepts, such as refcount, callback wait mechanisms, etc.

Provides two guarantees:

  • One-shot: once signaled, it will be signaled forever
  • Finite-time: once exposed, is guaranteed signal in a reasonable amount of time

What does multiple semaphores support mean for Raspberry Pi 4 GPU drivers?

For our main purpose, the multiple syncobjs support means that V3DV can submit jobs with more than one wait and signal semaphore. In the kernel space, wait semaphores become explicit job dependencies to wait on before executing the job. Signal semaphores (or post dependencies), in turn, work as fences to be signaled when the job completes its execution, unlocking following jobs that depend on its completion.

The multisync support development comprised of many decision-making points and steps summarized as follow:

  • added to the v3d kernel-driver capabilities to handle multiple syncobj;
  • exposed multisync capabilities to the userspace through a generic extension; and
  • reworked synchronization mechanisms of the V3DV driver to benefit from this feature
  • enabled simulator to work with multiple semaphores
  • tested on Vulkan games to verify the correctness and possible performance enhancements.

We decided to refactor parts of the V3D(V) submission design in kernel-space and userspace during this development. We improved job scheduling on V3D-kernel and the V3DV job submission design. We also delivered more accurate synchronizing mechanisms and further updates in the Broadcom Vulkan driver running on Raspberry Pi 4. Therefore, we summarize here changes in the kernel space, describing the previous state of the driver, taking decisions, side improvements, and fixes.

From single to multiple binary in/out syncobjs:

Initially, V3D was very limited in the numbers of syncobjs per job submission. V3D job interfaces (CL, CSD, and TFU) only supported one syncobj (in_sync) to be added as an execution dependency and one syncobj (out_sync) to be signaled when a submission completes. Except for CL submission, which accepts two in_syncs: one for binner and another for render job, it didn’t change the limited options.

Meanwhile in the userspace, the V3DV driver followed alternative paths to meet Vulkan’s synchronization and submission framework. It needed to handle multiple wait and signal semaphores, but the V3D kernel-driver interface only accepts one in_sync and one out_sync. In short, V3DV had to fit multiple semaphores into one when submitting every GPU job.

Generic ioctl extension

The first decision was how to extend the V3D interface to accept multiple in and out syncobjs. We could extend each ioctl with two entries of syncobj arrays and two entries for their counters. We could create new ioctls with multiple in/out syncobj. But after examining other drivers solutions to extend their submission’s interface, we decided to extend V3D ioctls (v3d_cl_submit_ioctl, v3d_csd_submit_ioctl, v3d_tfu_submit_ioctl) by a generic ioctl extension.

I found a curious commit message when I was examining how other developers handled the issue in the past:

Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 22 09:23:22 2019 +0000

    drm/i915: Introduce the i915_user_extension_method
    
    An idea for extending uABI inspired by Vulkan's extension chains.
    Instead of expanding the data struct for each ioctl every time we need
    to add a new feature, define an extension chain instead. As we add
    optional interfaces to control the ioctl, we define a new extension
    struct that can be linked into the ioctl data only when required by the
    user. The key advantage being able to ignore large control structs for
    optional interfaces/extensions, while being able to process them in a
    consistent manner.
    
    In comparison to other extensible ioctls, the key difference is the
    use of a linked chain of extension structs vs an array of tagged
    pointers. For example,
    
    struct drm_amdgpu_cs_chunk {
    	__u32		chunk_id;
        __u32		length_dw;
        __u64		chunk_data;
    };
[...]

So, inspired by amdgpu_cs_chunk and i915_user_extension, we opted to extend the V3D interface through a generic interface. After applying some suggestions from Iago Toral (Igalia) and Daniel Vetter, we reached the following struct:

struct drm_v3d_extension {
	__u64 next;
	__u32 id;
#define DRM_V3D_EXT_ID_MULTI_SYNC		0x01
	__u32 flags; /* mbz */
};

This generic extension has an id to identify the feature/extension we are adding to an ioctl (that maps the related struct type), a pointer to the next extension, and flags (if needed). Whenever we need to extend the V3D interface again for another specific feature, we subclass this generic extension into the specific one instead of extending ioctls indefinitely.

Multisync extension

For the multiple syncobjs extension, we define a multi_sync extension struct that subclasses the generic extension struct. It has arrays of in and out syncobjs, the respective number of elements in each of them, and a wait_stage value used in CL submissions to determine which job needs to wait for syncobjs before running.

struct drm_v3d_multi_sync {
	struct drm_v3d_extension base;
	/* Array of wait and signal semaphores */
	__u64 in_syncs;
	__u64 out_syncs;

	/* Number of entries */
	__u32 in_sync_count;
	__u32 out_sync_count;

	/* set the stage (v3d_queue) to sync */
	__u32 wait_stage;

	__u32 pad; /* mbz */
};

And if a multisync extension is defined, the V3D driver ignores the previous interface of single in/out syncobjs.

Once we had the interface to support multiple in/out syncobjs, v3d kernel-driver needed to handle it. As V3D uses the DRM scheduler for job executions, changing from single syncobj to multiples is quite straightforward. V3D copies from userspace the in syncobjs and uses drm_syncobj_find_fence()+ drm_sched_job_add_dependency() to add all in_syncs (wait semaphores) as job dependencies, i.e. syncobjs to be checked by the scheduler before running the job. On CL submissions, we have the bin and render jobs, so V3D follows the value of wait_stage to determine which job depends on those in_syncs to start its execution.

When V3D defines the last job in a submission, it replaces dma_fence of out_syncs with the done_fence from this last job. It uses drm_syncobj_find() + drm_syncobj_replace_fence() to do that. Therefore, when a job completes its execution and signals done_fence, all out_syncs are signaled too.

Other improvements to v3d kernel driver

This work also made possible some improvements in the original implementation. Following Iago’s suggestions, we refactored the job’s initialization code to allocate memory and initialize a job in one go. With this, we started to clean up resources more cohesively, clearly distinguishing cleanups in case of failure from job completion. We also fixed the resource cleanup when a job is aborted before the DRM scheduler arms it - at that point, drm_sched_job_arm() had recently been introduced to job initialization. Finally, we prepared the semaphore interface to implement timeline syncobjs in the future.

Going Up

The patchset that adds multiple syncobjs support and improvements to V3D is available here and comprises four patches:

  • drm/v3d: decouple adding job dependencies steps from job init
  • drm/v3d: alloc and init job in one shot
  • drm/v3d: add generic ioctl extension
  • drm/v3d: add multiple syncobjs support

After extending the V3D kernel interface to accept multiple syncobjs, we worked on V3DV to benefit from V3D multisync capabilities. In the next post, I will describe a little of this work.

May 09, 2022

As a board, we have been working on several initiatives to make the Foundation a better asset for the GNOME Project. We’re working on a number of threads in parallel, so I wanted to explain the “big picture” a bit more to try and connect together things like the new ED search and the bylaw changes.

We’re all here to see free and open source software succeed and thrive, so that people can be be truly empowered with agency over their technology, rather than being passive consumers. We want to bring GNOME to as many people as possible so that they have computing devices that they can inspect, trust, share and learn from.

In previous years we’ve tried to boost the relevance of GNOME (or technologies such as GTK) or solicit donations from businesses and individuals with existing engagement in FOSS ideology and technology. The problem with this approach is that we’re mostly addressing people and organisations who are already supporting or contributing FOSS in some way. To truly scale our impact, we need to look to the outside world, build better awareness of GNOME outside of our current user base, and find opportunities to secure funding to invest back into the GNOME project.

The Foundation supports the GNOME project with infrastructure, arranging conferences, sponsoring hackfests and travel, design work, legal support, managing sponsorships, advisory board, being the fiscal sponsor of GNOME, GTK, Flathub… and we will keep doing all of these things. What we’re talking about here are additional ways for the Foundation to support the GNOME project – we want to go beyond these activities, and invest into GNOME to grow its adoption amongst people who need it. This has a cost, and that means in parallel with these initiatives, we need to find partners to fund this work.

Neil has previously talked about themes such as education, advocacy, privacy, but we’ve not previously translated these into clear specific initiatives that we would establish in addition to the Foundation’s existing work. This is all a work in progress and we welcome any feedback from the community about refining these ideas, but here are the current strategic initiatives the board is working on. We’ve been thinking about growing our community by encouraging and retaining diverse contributors, and addressing evolving computing needs which aren’t currently well served on the desktop.

Initiative 1. Welcoming newcomers. The community is already spending a lot of time welcoming newcomers and teaching them the best practices. Those activities are as time consuming as they are important, but currently a handful of individuals are running initiatives such as GSoC, Outreachy and outreach to Universities. These activities help bring diverse individuals and perspectives into the community, and helps them develop skills and experience of collaborating to create Open Source projects. We want to make those efforts more sustainable by finding sponsors for these activities. With funding, we can hire people to dedicate their time to operating these programs, including paid mentors and creating materials to support newcomers in future, such as developer documentation, examples and tutorials. This is the initiative that needs to be refined the most before we can turn it into something real.

Initiative 2: Diverse and sustainable Linux app ecosystem. I spoke at the Linux App Summit about the work that GNOME and Endless has been supporting in Flathub, but this is an example of something which has a great overlap between commercial, technical and mission-based advantages. The key goal here is to improve the financial sustainability of participating in our community, which in turn has an impact on the diversity of who we can expect to afford to enter and remain in our community. We believe the existence of this is critically important for individual developers and contributors to unlock earning potential from our ecosystem, through donations or app sales. In turn, a healthy app ecosystem also improves the usefulness of the Linux desktop as a whole for potential users. We believe that we can build a case for commercial vendors in the space to join an advisory board alongside with GNOME, KDE, etc to input into the governance and contribute to the costs of growing Flathub.

Initiative 3: Local-first applications for the GNOME desktop. This is what Thib has been starting to discuss on Discourse, in this thread. There are many different threats to free access to computing and information in today’s world. The GNOME desktop and apps need to give users convenient and reliable access to technology which works similarly to the tools they already use everyday, but keeps them and their data safe from surveillance, censorship, filtering or just being completely cut off from the Internet. We believe that we can seek both philanthropic and grant funding for this work. It will make GNOME a more appealing and comprehensive offering for the many people who want to protect their privacy.

The idea is that these initiatives all sit on the boundary between the GNOME community and the outside world. If the Foundation can grow and deliver these kinds of projects, we are reaching to new people, new contributors and new funding. These contributions and investments back into GNOME represent a true “win-win” for the newcomers and our existing community.

(Originally posted to GNOME Discourse, please feel free to join the discussion there.)

Mesh and task shaders (amplification shaders in D3D jargon) are a new way to produce geometry in 3D applications. First proposed by NVidia in 2018 and initially available in the “Turing” series of GPUs, they are now supported on RDNA2 GPUs and on the API are part of the D3D12 API and also as vendor-specific extensions to Vulkan and OpenGL. In this post I’m going to talk about what mesh shaders are and in part 2 I’m going to talk about how they are implemented on the driver side.

Problems with the old geometry pipeline

The problem with the traditional vertex processing pipeline is that it is mainly designed assuming several fixed-function hardware units in the GPU and offers very little flexibility for the user to customize it. The main issues with the traditional pipeline are:

  • Vertex buffers and vertex shader inputs are annoying (especially from the driver’s perspective) and input assembly may be a bottleneck on some HW in some cases.
  • The user has no control over how the input vertices and primitives are arranged, so the vertex shader may uselessly run for primitives that are invisible (eg. occluded, or backfacing etc.) meaning that compute resources are wasted on things that don’t actually produce any pixels.
  • Geometry amplification depends on fixed function tessellation HW and offers poor customizability.
  • The programming model allows very poor control over input and output primitives. Geometry shaders have a horrible programming model that results in low HW occupancy and limited topologies.

The mesh shading pipeline is a graphics pipeline that addresses these issues by completely replacing the entire traditional vertex processing pipeline with two new stages: the task and mesh shader.

What is a mesh shader?

A mesh shader is a compute-like stage which allows the application to fully customize its inputs and outputs, including output primitives.

  • Mesh shader vs. Vertex shader: A mesh shader is responsible for creating its output vertices and primitives. In comparison, a vertex shader is only capable of loading a fixed amount of vertices, doing some processing on them and has no awareness of primitives.
  • Mesh shader vs. Geometry shader: As opposed to a geometry shader which can only use a fixed output topology, a mesh shader is free to define whatever topology it wants. You can think about it as if the mesh shader produces an indexed triangle list.

What does it mean that the mesh shader is compute-like?

You can use all the sweet good stuff that compute shaders already can do, but vertex shaders couldn’t, for example: use shared memory, run in workgroups, rely on workgroup ID, subgroup ID, etc.

The API allows any mesh shader invocation to write to any vertex or primitive. The invocations in each mesh shader workgroup are meant to co-operatively produce a small set of output vertices and primitives (this is sometimes called a “meshlet”). All workgroups together create the full output geometry (the “mesh”).

What does the mesh shader do?

First, it needs to figure out how many vertices and primitives it wants to create, then it can write these to its output arrays. How it does this, is entirely up to the application developer. Though, there are some performance recommendations which you should follow to make it go fast. I’m going to talk about these more in Part 2.

The input assembler step is entirely eliminated, which means that the application is now in full control of how (or if at all) the input vertex data is fetched, meaning that you can save bandwidth on things that don’t need to be loaded, etc. There is no “input” in the traditional sense, but you can rely on things like push constants, UBO, SSBO etc.

For example, a mesh shader could perform per-triangle culling in such a manner that it wouldn’t need to load data for primitives that are culled, therefore saving bandwidth.

What is a task shader aka. amplification shader?

The task shader is an optional stage which operates like a compute shader. Each task shader workgroup has two main purposes:

  • Decide how many mesh shader workgroups need to be launched.
  • Create an optional “task payload” which can be passed to mesh shaders.

The “geometry amplification” is achieved by choosing to launch more (or less) mesh shader workgroups. As opposed to the fixed-function tessellator in the traditional pipeline, it is now entirely up to the application how to create the vertices.

While you could re-implement the old fixed-function tessellation with mesh shading, this may not actually be necessary and your application may work fine with some other simpler algorithm.

Another interesting use case for task shaders is per-meshlet culling, meaning that a task shader is a good place to decide which meshlets you actually want to render and eliminate entire mesh shader workgroups which would otherwise operate on invisible primitives.

  • Task shader vs. Tessellation. Tessellation relies on fixed-function hardware and makes users do complicated shader I/O gymnastics with tess control shaders. The task shader is straightforward and only as complicated as you want it to be.
  • Task shader vs. Geometry shader. Geometry shaders operate on input primitives directly and replace them with “strip” primitives. Task shaders don’t have to be directy get involved with the geometry output, but rather just let you specify how many mesh shader workgroups to launch and let the mesh shader deal with the nitty-gritty details.

Usefulness

For now, I’ll just discuss a few basic use cases.

Meshlets. If your application loads input vertex data from somewhere, it is recommended that you subdivide that data (the “mesh”) into smaller chunks called “meshlets”. Then you can write your shaders such that each mesh shader workgroup processes a single meshlet.

Procedural geometry. Your application can generate all its vertices and primitives based on a mathematical formula that is implemented in the shader. In this case, you don’t need to load any inputs, just implement your formula as if you were writing a compute shader, then store the results into the mesh shader output arrays.

Replacing compute pre-passes. Many modern games use a compute pre-pass. They launch some compute shaders that do some pre-processing on the geometry before the graphics work. These are no longer necessary. The compute work can be made part of either the task or mesh shader, which removes the overhead of the additional submission.

Note that mesh shader workgroups may be launched as soon as the corresponding task shader workgroup is finished, so mesh shader execution (of the already finished tasks) may overlap with task shader execution, removing the need for extra synchronization on the application side.

Conclusion

Thus far, I’ve sold you on how awesome and flexible mesh shading is, so it’s time to ask the million dollar question.

Is mesh shading for you?

The answer, as always, is: It depends.

Yes, mesh and task shaders do give you a lot of opportunities to implement things just the way you like them without the stupid hardware getting in your way, but as with any low-level tools, this also means that you get a lot of possibities for shooting yourself in the foot.

The traditional vertex processing pipeline has been around for so long that on most hardware it’s extremely well optimized because the drivers do a lot of optimization work for you. Therefore, just because an app uses mesh shaders, doesn’t automatically mean that it’s going to be faster or better in any way. It’s only worth it if you are willing to do it well.

That being said, perhaps the easiest way to start experimenting with mesh shaders is to rewrite parts of your application that used to use geometry shaders. Geometry shaders are so horribly inefficient that it’ll be difficult to write a worse mesh shader.

How is mesh shading implemented under the hood?

Stay tuned for Part 2 if you are curious about that! In Part 2, I’m going to talk about how mesh and task shaders are implemented in a driver. This will shed some light on how these shaders work internally and why certain things perform really badly.

References

Sometimes you want to go and inspect details of the shaders that are used with specific draw calls in a frame. With RenderDoc this is really easy if the driver implements VK_KHR_pipeline_executable_properties. This extension allows applications to query the driver about various aspects of the executable code generated for a Vulkan pipeline.

I implemented this extension for V3DV, the Vulkan driver for Raspberry Pi 4, last week (it is currently in review process) because I was tired of jumping through loops to get the info I needed when looking at traces. For V3DV we expose the NIR and QPU assembly code as well as various others stats, some of which are quite relevant to performance, such as spill or thread counts.


Some shader statistics

Final NIR code

QPU assembly
May 02, 2022

TLDR: Hermetic /usr/ is awesome; let's popularize image-based OSes with modernized security properties built around immutability, SecureBoot, TPM2, adaptability, auto-updating, factory reset, uniformity – built from traditional distribution packages, but deployed via images.

Over the past years, systemd gained a number of components for building Linux-based operating systems. While these components individually have been adopted by many distributions and products for specific purposes, we did not publicly communicate a broader vision of how they should all fit together in the long run. In this blog story I hope to provide that from my personal perspective, i.e. explain how I personally would build an OS and where I personally think OS development with Linux should go.

I figure this is going to be a longer blog story, but I hope it will be equally enlightening. Please understand though that everything I write about OS design here is my personal opinion, and not one of my employer.

For the last 12 years or so I have been working on Linux OS development, mostly around systemd. In all those years I had a lot of time thinking about the Linux platform, and specifically traditional Linux distributions and their strengths and weaknesses. I have seen many attempts to reinvent Linux distributions in one way or another, to varying success. After all this most would probably agree that the traditional RPM or dpkg/apt-based distributions still define the Linux platform more than others (for 25+ years now), even though some Linux-based OSes (Android, ChromeOS) probably outnumber the installations overall.

And over all those 12 years I kept wondering, how would I actually build an OS for a system or for an appliance, and what are the components necessary to achieve that. And most importantly, how can we make these components generic enough so that they are useful in generic/traditional distributions too, and in other use cases than my own.

The Project

Before figuring out how I would build an OS it's probably good to figure out what type of OS I actually want to build, what purpose I intend to cover. I think a desktop OS is probably the most interesting. Why is that? Well, first of all, I use one of these for my job every single day, so I care immediately, it's my primary tool of work. But more importantly: I think building a desktop OS is one of the most complex overall OS projects you can work on, simply because desktops are so much more versatile and variable than servers or embedded devices. If one figures out the desktop case, I think there's a lot more to learn from, and reuse in the server or embedded case, then going the other way. After all, there's a reason why so much of the widely accepted Linux userspace stack comes from people with a desktop background (including systemd, BTW).

So, let's see how I would build a desktop OS. If you press me hard, and ask me why I would do that given that ChromeOS already exists and more or less is a Linux desktop OS: there's plenty I am missing in ChromeOS, but most importantly, I am lot more interested in building something people can easily and naturally rebuild and hack on, i.e. Google-style over-the-wall open source with its skewed power dynamic is not particularly attractive to me. I much prefer building this within the framework of a proper open source community, out in the open, and basing all this strongly on the status quo ante, i.e. the existing distributions. I think it is crucial to provide a clear avenue to build a modern OS based on the existing distribution model, if there shall ever be a chance to make this interesting for a larger audience.

(Let me underline though: even though I am going to focus on a desktop here, most of this is directly relevant for servers as well, in particular container host OSes and suchlike, or embedded devices, e.g. car IVI systems and so on.)

Design Goals

  1. First and foremost, I think the focus must be on an image-based design rather than a package-based one. For robustness and security it is essential to operate with reproducible, immutable images that describe the OS or large parts of it in full, rather than operating always with fine-grained RPM/dpkg style packages. That's not to say that packages are not relevant (I actually think they matter a lot!), but I think they should be less of a tool for deploying code but more one of building the objects to deploy. A different way to see this: any OS built like this must be easy to replicate in a large number of instances, with minimal variability. Regardless if we talk about desktops, servers or embedded devices: focus for my OS should be on "cattle", not "pets", i.e that from the start it's trivial to reuse the well-tested, cryptographically signed combination of software over a large set of devices the same way, with a maximum of bit-exact reuse and a minimum of local variances.

  2. The trust chain matters, from the boot loader all the way to the apps. This means all code that is run must be cryptographically validated before it is run. All storage must be cryptographically protected: public data must be integrity checked; private data must remain confidential.

    This is in fact where big distributions currently fail pretty badly. I would go as far as saying that SecureBoot on Linux distributions is mostly security theater at this point, if you so will. That's because the initrd that unlocks your FDE (i.e. the cryptographic concept that protects the rest of your system) is not signed or protected in any way. It's trivial to modify for an attacker with access to your hard disk in an undetectable way, and collect your FDE passphrase. The involved bureaucracy around the implementation of UEFI SecureBoot of the big distributions is to a large degree pointless if you ask me, given that once the kernel is assumed to be in a good state, as the next step the system invokes completely unsafe code with full privileges.

    This is a fault of current Linux distributions though, not of SecureBoot in general. Other OSes use this functionality in more useful ways, and we should correct that too.

  3. Pretty much the same thing: offline security matters. I want my data to be reasonably safe at rest, i.e. cryptographically inaccessible even when I leave my laptop in my hotel room, suspended.

  4. Everything should be cryptographically measured, so that remote attestation is supported for as much software shipped on the OS as possible.

  5. Everything should be self descriptive, have single sources of truths that are closely attached to the object itself, instead of stored externally.

  6. Everything should be self-updating. Today we know that software is never bug-free, and thus requires a continuous update cycle. Not only the OS itself, but also any extensions, services and apps running on it.

  7. Everything should be robust in respect to aborted OS operations, power loss and so on. It should be robust towards hosed OS updates (regardless if the download process failed, or the image was buggy), and not require user interaction to recover from them.

  8. There must always be a way to put the system back into a well-defined, guaranteed safe state ("factory reset"). This includes that all sensitive data from earlier uses becomes cryptographically inaccessible.

  9. The OS should enforce clear separation between vendor resources, system resources and user resources: conceptually and when it comes to cryptographical protection.

  10. Things should be adaptive: the system should come up and make the best of the system it runs on, adapt to the storage and hardware. Moreover, the system should support execution on bare metal equally well as execution in a VM environment and in a container environment (i.e. systemd-nspawn).

  11. Things should not require explicit installation. i.e. every image should be a live image. For installation it should be sufficient to dd an OS image onto disk. Thus, strong focus on "instantiate on first boot", rather than "instantiate before first boot".

  12. Things should be reasonably minimal. The image the system starts its life with should be quick to download, and not include resources that can as well be created locally later.

  13. System identity, local cryptographic keys and so on should be generated locally, not be pre-provisioned, so that there's no leak of sensitive data during the transport onto the system possible.

  14. Things should be reasonably democratic and hackable. It should be easy to fork an OS, to modify an OS and still get reasonable cryptographic protection. Modifying your OS should not necessarily imply that your "warranty is voided" and you lose all good properties of the OS, if you so will.

  15. Things should be reasonably modular. The privileged part of the core OS must be extensible, including on the individual system. It's not sufficient to support extensibility just through high-level UI applications.

  16. Things should be reasonably uniform, i.e. ideally the same formats and cryptographic properties are used for all components of the system, regardless if for the host OS itself or the payloads it receives and runs.

  17. Even taking all these goals into consideration, it should still be close to traditional Linux distributions, and take advantage of what they are really good at: integration and security update cycles.

Now that we know our goals and requirements, let's start designing the OS along these lines.

Hermetic /usr/

First of all the OS resources (code, data files, …) should be hermetic in an immutable /usr/. This means that a /usr/ tree should carry everything needed to set up the minimal set of directories and files outside of /usr/ to make the system work. This /usr/ tree can then be mounted read-only into the writable root file system that then will eventually carry the local configuration, state and user data in /etc/, /var/ and /home/ as usual.

Thankfully, modern distributions are surprisingly close to working without issues in such a hermetic context. Specifically, Fedora works mostly just fine: it has adopted the /usr/ merge and the declarative systemd-sysusers and systemd-tmpfiles components quite comprehensively, which means the directory trees outside of /usr/ are automatically generated as needed if missing. In particular /etc/passwd and /etc/group (and related files) are appropriately populated, should they be missing entries.

In my model a hermetic OS is hence comprehensively defined within /usr/: combine the /usr/ tree with an empty, otherwise unpopulated root file system, and it will boot up successfully, automatically adding the strictly necessary files, and resources that are necessary to boot up.

Monopolizing vendor OS resources and definitions in an immutable /usr/ opens multiple doors to us:

  • We can apply dm-verity to the whole /usr/ tree, i.e. guarantee structural, cryptographic integrity on the whole vendor OS resources at once, with full file system metadata.

  • We can implement updates to the OS easily: by implementing an A/B update scheme on the /usr/ tree we can update the OS resources atomically and robustly, while leaving the rest of the OS environment untouched.

  • We can implement factory reset easily: erase the root file system and reboot. The hermetic OS in /usr/ has all the information it needs to set up the root file system afresh — exactly like in a new installation.

Initial Look at the Partition Table

So let's have a look at a suitable partition table, taking a hermetic /usr/ into account. Let's conceptually start with a table of four entries:

  1. An UEFI System Partition (required by firmware to boot)

  2. Immutable, Verity-protected, signed file system with the /usr/ tree in version A

  3. Immutable, Verity-protected, signed file system with the /usr/ tree in version B

  4. A writable, encrypted root file system

(This is just for initial illustration here, as we'll see later it's going to be a bit more complex in the end.)

The Discoverable Partitions Specification provides suitable partition types UUIDs for all of the above partitions. Which is great, because it makes the image self-descriptive: simply by looking at the image's GPT table we know what to mount where. This means we do not need a manual /etc/fstab, and a multitude of tools such as systemd-nspawn and similar can operate directly on the disk image and boot it up.

Booting

Now that we have a rough idea how to organize the partition table, let's look a bit at how to boot into that. Specifically, in my model "unified kernels" are the way to go, specifically those implementing Boot Loader Specification Type #2. These are basically kernel images that have an initial RAM disk attached to them, as well as a kernel command line, a boot splash image and possibly more, all wrapped into a single UEFI PE binary. By combining these into one we achieve two goals: they become extremely easy to update (i.e. drop in one file, and you update kernel+initrd) and more importantly, you can sign them as one for the purpose of UEFI SecureBoot.

In my model, each version of such a kernel would be associated with exactly one version of the /usr/ tree: both are always updated at the same time. An update then becomes relatively simple: drop in one new /usr/ file system plus one kernel, and the update is complete.

The boot loader used for all this would be systemd-boot, of course. It's a very simple loader, and implements the aforementioned boot loader specification. This means it requires no explicit configuration or anything: it's entirely sufficient to drop in one such unified kernel file, and it will be picked up, and be made a candidate to boot into.

You might wonder how to configure the root file system to boot from with such a unified kernel that contains the kernel command line and is signed as a whole and thus immutable. The idea here is to use the usrhash= kernel command line option implemented by systemd-veritysetup-generator and systemd-fstab-generator. It does two things: it will search and set up a dm-verity volume for the /usr/ file system, and then mount it. It takes the root hash value of the dm-verity Merkle tree as the parameter. This hash is then also used to find the /usr/ partition in the GPT partition table, under the assumption that the partition UUIDs are derived from it, as per the suggestions in the discoverable partitions specification (see above).

systemd-boot (if not told otherwise) will do a version sort of the kernel image files it finds, and then automatically boot the newest one. Picking a specific kernel to boot will also fixate which version of the /usr/ tree to boot into, because — as mentioned — the Verity root hash of it is built into the kernel command line the unified kernel image contains.

In my model I'd place the kernels directly into the UEFI System Partition (ESP), in order to simplify things. (systemd-boot also supports reading them from a separate boot partition, but let's not complicate things needlessly, at least for now.)

So, with all this, we now already have a boot chain that goes something like this: once the boot loader is run, it will pick the newest kernel, which includes the initial RAM disk and a secure reference to the /usr/ file system to use. This is already great. But a /usr/ alone won't make us happy, we also need a root file system. In my model, that file system would be writable, and the /etc/ and /var/ hierarchies would be located directly on it. Since these trees potentially contain secrets (SSH keys, …) the root file system needs to be encrypted. We'll use LUKS2 for this, of course. In my model, I'd bind this to the TPM2 chip (for compatibility with systems lacking one, we can find a suitable fallback, which then provides weaker guarantees, see below). A TPM2 is a security chip available in most modern PCs. Among other things it contains a persistent secret key that can be used to encrypt data, in a way that only if you possess access to it and can prove you are using validated software you can decrypt it again. The cryptographic measuring I mentioned earlier is what allows this to work. But … let's not get lost too much in the details of TPM2 devices, that'd be material for a novel, and this blog story is going to be way too long already.

What does using a TPM2 bound key for unlocking the root file system get us? We can encrypt the root file system with it, and you can only read or make changes to the root file system if you also possess the TPM2 chip and run our validated version of the OS. This protects us against an evil maid scenario to some level: an attacker cannot just copy the hard disk of your laptop while you leave it in your hotel room, because unless the attacker also steals the TPM2 device it cannot be decrypted. The attacker can also not just modify the root file system, because such changes would be detected on next boot because they aren't done with the right cryptographic key.

So, now we have a system that already can boot up somewhat completely, and run userspace services. All code that is run is verified in some way: the /usr/ file system is Verity protected, and the root hash of it is included in the kernel that is signed via UEFI SecureBoot. And the root file system is locked to the TPM2 where the secret key is only accessible if our signed OS + /usr/ tree is used.

(One brief intermission here: so far all the components I am referencing here exist already, and have been shipped in systemd and other projects already, including the TPM2 based disk encryption. There's one thing missing here however at the moment that still needs to be developed (happy to take PRs!): right now TPM2 based LUKS2 unlocking is bound to PCR hash values. This is hard to work with when implementing updates — what we'd need instead is unlocking by signatures of PCR hashes. TPM2 supports this, but we don't support it yet in our systemd-cryptsetup + systemd-cryptenroll stack.)

One of the goals mentioned above is that cryptographic key material should always be generated locally on first boot, rather than pre-provisioned. This of course has implications for the encryption key of the root file system: if we want to boot into this system we need the root file system to exist, and thus a key already generated that it is encrypted with. But where precisely would we generate it if we have no installer which could generate while installing (as it is done in traditional Linux distribution installers). My proposed solution here is to use systemd-repart, which is a declarative, purely additive repartitioner. It can run from the initrd to create and format partitions on boot, before transitioning into the root file system. It can also format the partitions it creates and encrypt them, automatically enrolling an TPM2-bound key.

So, let's revisit the partition table we mentioned earlier. Here's what in my model we'd actually ship in the initial image:

  1. An UEFI System Partition (ESP)

  2. An immutable, Verity-protected, signed file system with the /usr/ tree in version A

And that's already it. No root file system, no B /usr/ partition, nothing else. Only two partitions are shipped: the ESP with the systemd-boot loader and one unified kernel image, and the A version of the /usr/ partition. Then, on first boot systemd-repart will notice that the root file system doesn't exist yet, and will create it, encrypt it, and format it, and enroll the key into the TPM2. It will also create the second /usr/ partition (B) that we'll need for later A/B updates (which will be created empty for now, until the first update operation actually takes place, see below). Once done the initrd will combine the fresh root file system with the shipped /usr/ tree, and transition into it. Because the OS is hermetic in /usr/ and contains all the systemd-tmpfiles and systemd-sysuser information it can then set up the root file system properly and create any directories and symlinks (and maybe a few files) necessary to operate.

Besides the fact that the root file system's encryption keys are generated on the system we boot from and never leave it, it is also pretty nice that the root file system will be sized dynamically, taking into account the physical size of the backing storage. This is perfect, because on first boot the image will automatically adapt to what it has been dd'ed onto.

Factory Reset

This is a good point to talk about the factory reset logic, i.e. the mechanism to place the system back into a known good state. This is important for two reasons: in our laptop use case, once you want to pass the laptop to someone else, you want to ensure your data is fully and comprehensively erased. Moreover, if you have reason to believe your device was hacked you want to revert the device to a known good state, i.e. ensure that exploits cannot persist. systemd-repart already has a mechanism for it. In the declarations of the partitions the system should have, entries may be marked to be candidates for erasing on factory reset. The actual factory reset is then requested by one of two means: by specifying a specific kernel command line option (which is not too interesting here, given we lock that down via UEFI SecureBoot; but then again, one could also add a second kernel to the ESP that is identical to the first, with only different that it lists this command line option: thus when the user selects this entry it will initiate a factory reset) — and via an EFI variable that can be set and is honoured on the immediately following boot. So here's how a factory reset would then go down: once the factory reset is requested it's enough to reboot. On the subsequent boot systemd-repart runs from the initrd, where it will honour the request and erase the partitions marked for erasing. Once that is complete the system is back in the state we shipped the system in: only the ESP and the /usr/ file system will exist, but the root file system is gone. And from here we can continue as on the original first boot: create a new root file system (and any other partitions), and encrypt/set it up afresh.

So now we have a nice setup, where everything is either signed or encrypted securely. The system can adapt to the system it is booted on automatically on first boot, and can easily be brought back into a well defined state identical to the way it was shipped in.

Modularity

But of course, such a monolithic, immutable system is only useful for very specific purposes. If /usr/ can't be written to, – at least in the traditional sense – one cannot just go and install a new software package that one needs. So here two goals are superficially conflicting: on one hand one wants modularity, i.e. the ability to add components to the system, and on the other immutability, i.e. that precisely this is prohibited.

So let's see what I propose as a middle ground in my model. First, what's the precise use case for such modularity? I see a couple of different ones:

  1. For some cases it is necessary to extend the system itself at the lowest level, so that the components added in extend (or maybe even replace) the resources shipped in the base OS image, so that they live in the same namespace, and are subject to the same security restrictions and privileges. Exposure to the details of the base OS and its interface for this kind of modularity is at the maximum.

    Example: a module that adds a debugger or tracing tools into the system. Or maybe an optional hardware driver module.

  2. In other cases, more isolation is preferable: instead of extending the system resources directly, additional services shall be added in that bring their own files, can live in their own namespace (but with "windows" into the host namespaces), however still are system components, and provide services to other programs, whether local or remote. Exposure to the details of the base OS for this kind of modularity is restricted: it mostly focuses on the ability to consume and provide IPC APIs from/to the system. Components of this type can still be highly privileged, but the level of integration is substantially smaller than for the type explained above.

    Example: a module that adds a specific VPN connection service to the OS.

  3. Finally, there's the actual payload of the OS. This stuff is relatively isolated from the OS and definitely from each other. It mostly consumes OS APIs, and generally doesn't provide OS APIs. This kind of stuff runs with minimal privileges, and in its own namespace of concepts.

    Example: a desktop app, for reading your emails.

Of course, the lines between these three types of modules are blurry, but I think distinguishing them does make sense, as I think different mechanisms are appropriate for each. So here's what I'd propose in my model to use for this.

  1. For the system extension case I think the systemd-sysext images are appropriate. This tool operates on system extension images that are very similar to the host's disk image: they also contain a /usr/ partition, protected by Verity. However, they just include additions to the host image: binaries that extend the host. When such a system extension image is activated, it is merged via an immutable overlayfs mount into the host's /usr/ tree. Thus any file shipped in such a system extension will suddenly appear as if it was part of the host OS itself. For optional components that should be considered part of the OS more or less this is a very simple and powerful way to combine an immutable OS with an immutable extension. Note that most likely extensions for an OS matching this tool should be built at the same time within the same update cycle scheme as the host OS itself. After all, the files included in the extensions will have dependencies on files in the system OS image, and care must be taken that these dependencies remain in order.

  2. For adding in additional somewhat isolated system services in my model, Portable Services are the proposed tool of choice. Portable services are in most ways just like regular system services; they could be included in the system OS image or an extension image. However, portable services use RootImage= to run off separate disk images, thus within their own namespace. Images set up this way have various ways to integrate into the host OS, as they are in most ways regular system services, which just happen to bring their own directory tree. Also, unlike regular system services, for them sandboxing is opt-out rather than opt-in. In my model, here too the disk images are Verity protected and thus immutable. Just like the host OS they are GPT disk images that come with a /usr/ partition and Verity data, along with signing.

  3. Finally, the actual payload of the OS, i.e. the apps. To be useful in real life here it is important to hook into existing ecosystems, so that a large set of apps are available. Given that on Linux flatpak (or on servers OCI containers) are the established format that pretty much won they are probably the way to go. That said, I think both of these mechanisms have relatively weak properties, in particular when it comes to security, since immutability/measurements and similar are not provided. This means, unlike for system extensions and portable services a complete trust chain with attestation and per-app cryptographically protected data is much harder to implement sanely.

What I'd like to underline here is that the main system OS image, as well as the system extension images and the portable service images are put together the same way: they are GPT disk images, with one immutable file system and associated Verity data. The latter two should also contain a PKCS#7 signature for the top-level Verity hash. This uniformity has many benefits: you can use the same tools to build and process these images, but most importantly: by using a single way to validate them throughout the stack (i.e. Verity, in the latter cases with PKCS#7 signatures), validation and measurement is straightforward. In fact it's so obvious that we don't even have to implement it in systemd: the kernel has direct support for this Verity signature checking natively already (IMA).

So, by composing a system at runtime from a host image, extension images and portable service images we have a nicely modular system where every single component is cryptographically validated on every single IO operation, and every component is measured, in its entire combination, directly in the kernel's IMA subsystem.

(Of course, once you add the desktop apps or OCI containers on top, then these properties are lost further down the chain. But well, a lot is already won, if you can close the chain that far down.)

Note that system extensions are not designed to replicate the fine grained packaging logic of RPM/dpkg. Of course, systemd-sysext is a generic tool, so you can use it for whatever you want, but there's a reason it does not bring support for a dependency language: the goal here is not to replicate traditional Linux packaging (we have that already, in RPM/dpkg, and I think they are actually OK for what they do) but to provide delivery of larger, coarser sets of functionality, in lockstep with the underlying OS' life-cycle and in particular with no interdependencies, except on the underlying OS.

Also note that depending on the use case it might make sense to also use system extensions to modularize the initrd step. This is probably less relevant for a desktop OS, but for server systems it might make sense to package up support for specific complex storage in a systemd-sysext system extension, which can be applied to the initrd that is built into the unified kernel. (In fact, we have been working on implementing signed yet modular initrd support to general purpose Fedora this way.)

Note that portable services are composable from system extension too, by the way. This makes them even more useful, as you can share a common runtime between multiple portable service, or even use the host image as common runtime for portable services. In this model a common runtime image is shared between one or more system extensions, and composed at runtime via an overlayfs instance.

More Modularity: Secondary OS Installs

Having an immutable, cryptographically locked down host OS is great I think, and if we have some moderate modularity on top, that's also great. But oftentimes it's useful to be able to depart/compromise for some specific use cases from that, i.e. provide a bridge for example to allow workloads designed around RPM/dpkg package management to coexist reasonably nicely with such an immutable host.

For this purpose in my model I'd propose using systemd-nspawn containers. The containers are focused on OS containerization, i.e. they allow you to run a full OS with init system and everything as payload (unlike for example Docker containers which focus on a single service, and where running a full OS in it is a mess).

Running systemd-nspawn containers for such secondary OS installs has various nice properties. One of course is that systemd-nspawn supports the same level of cryptographic image validation that we rely on for the host itself. Thus, to some level the whole OS trust chain is reasonably recursive if desired: the firmware validates the OS, and the OS can validate a secondary OS installed within it. In fact, we can run our trusted OS recursively on itself and get similar security guarantees! Besides these security aspects, systemd-nspawn also has really nice properties when it comes to integration with the host. For example the --bind-user= permits binding a host user record and their directory into a container as a simple one step operation. This makes it extremely easy to have a single user and $HOME but share it concurrently with the host and a zoo of secondary OSes in systemd-nspawn containers, which each could run different distributions even.

Developer Mode

Superficially, an OS with an immutable /usr/ appears much less hackable than an OS where everything is writable. Moreover, an OS where everything must be signed and cryptographically validated makes it hard to insert your own code, given you are unlikely to possess access to the signing keys.

To address this issue other systems have supported a "developer" mode: when entered the security guarantees are disabled, and the system can be freely modified, without cryptographic validation. While that's a great concept to have I doubt it's what most developers really want: the cryptographic properties of the OS are great after all, it sucks having to give them up once developer mode is activated.

In my model I'd thus propose two different approaches to this problem. First of all, I think there's value in allowing users to additively extend/override the OS via local developer system extensions. With this scheme the underlying cryptographic validation would remain in tact, but — if this form of development mode is explicitly enabled – the developer could add in more resources from local storage, that are not tied to the OS builder's chain of trust, but a local one (i.e. simply backed by encrypted storage of some form).

The second approach is to make it easy to extend (or in fact replace) the set of trusted validation keys, with local ones that are under the control of the user, in order to make it easy to operate with kernel, OS, extension, portable service or container images signed by the local developer without involvement of the OS builder. This is relatively easy to do for components down the trust chain, i.e. the elements further up the chain should optionally allow additional certificates to allow validation with.

(Note that systemd currently has no explicit support for a "developer" mode like this. I think we should add that sooner or later however.)

Democratizing Code Signing

Closely related to the question of developer mode is the question of code signing. If you ask me, the status quo of UEFI SecureBoot code signing in the major Linux distributions is pretty sad. The work to get stuff signed is massive, but in effect it delivers very little in return: because initrds are entirely unprotected, and reside on partitions lacking any form of cryptographic integrity protection any attacker can trivially easily modify the boot process of any such Linux system and freely collected FDE passphrases entered. There's little value in signing the boot loader and kernel in a complex bureaucracy if it then happily loads entirely unprotected code that processes the actually relevant security credentials: the FDE keys.

In my model, through use of unified kernels this important gap is closed, hence UEFI SecureBoot code signing becomes an integral part of the boot chain from firmware to the host OS. Unfortunately, code signing – and having something a user can locally hack, is to some level conflicting. However, I think we can improve the situation here, and put more emphasis on enrolling developer keys in the trust chain easily. Specifically, I see one relevant approach here: enrolling keys directly in the firmware is something that we should make less of a theoretical exercise and more something we can realistically deploy. See this work in progress making this more automatic and eventually safe. Other approaches are thinkable (including some that build on existing MokManager infrastructure), but given the politics involved, are harder to conclusively implement.

Running the OS itself in a container

What I explain above is put together with running on a bare metal system in mind. However, one of the stated goals is to make the OS adaptive enough to also run in a container environment (specifically: systemd-nspawn) nicely. Booting a disk image on bare metal or in a VM generally means that the UEFI firmware validates and invokes the boot loader, and the boot loader invokes the kernel which then transitions into the final system. This is different for containers: here the container manager immediately calls the init system, i.e. PID 1. Thus the validation logic must be different: cryptographic validation must be done by the container manager. In my model this is solved by shipping the OS image not only with a Verity data partition (as is already necessary for the UEFI SecureBoot trust chain, see above), but also with another partition, containing a PKCS#7 signature of the root hash of said Verity partition. This of course is exactly what I propose for both the system extension and portable service image. Thus, in my model the images for all three uses are put together the same way: an immutable /usr/ partition, accompanied by a Verity partition and a PKCS#7 signature partition. The OS image itself then has two ways "into" the trust chain: either through the signed unified kernel in the ESP (which is used for bare metal and VM boots) or by using the PKCS#7 signature stored in the partition (which is used for container/systemd-nspawn boots).

Parameterizing Kernels

A fully immutable and signed OS has to establish trust in the user data it makes use of before doing so. In the model I describe here, for /etc/ and /var/ we do this via disk encryption of the root file system (in combination with integrity checking). But the point where the root file system is mounted comes relatively late in the boot process, and thus cannot be used to parameterize the boot itself. In many cases it's important to be able to parameterize the boot process however.

For example, for the implementation of the developer mode indicated above it's useful to be able to pass this fact safely to the initrd, in combination with other fields (e.g. hashed root password for allowing in-initrd logins for debug purposes). After all, if the initrd is pre-built by the vendor and signed as whole together with the kernel it cannot be modified to carry such data directly (which is in fact how parameterizing of the initrd to a large degree was traditionally done).

In my model this is achieved through system credentials, which allow passing parameters to systems (and services for the matter) in an encrypted and authenticated fashion, bound to the TPM2 chip. This means that we can securely pass data into the initrd so that it can be authenticated and decrypted only on the system it is intended for and with the unified kernel image it was intended for.

Swap

In my model the OS would also carry a swap partition. For the simple reason that only then systemd-oomd.service can provide the best results. Also see In defence of swap: common misconceptions

Updating Images

We have a rough idea how the system shall be organized now, let's next focus on the deployment cycle: software needs regular update cycles, and software that is not updated regularly is a security problem. Thus, I am sure that any modern system must be automatically updated, without this requiring avoidable user interaction.

In my model, this is the job for systemd-sysupdate. It's a relatively simple A/B image updater: it operates either on partitions, on regular files in a directory, or on subdirectories in a directory. Each entry has a version (which is encoded in the GPT partition label for partitions, and in the filename for regular files and directories): whenever an update is initiated the oldest version is erased, and the newest version is downloaded.

With the setup described above a system update becomes a really simple operation. On each update the systemd-sysupdate tool downloads a /usr/ file system partition, an accompanying Verity partition, a PKCS#7 signature partition, and drops it into the host's partition table (where it possibly replaces the oldest version so far stored there). Then it downloads a unified kernel image and drops it into the EFI System Partition's /EFI/Linux (as per Boot Loader Specification; possibly erase the oldest such file there). And that's already the whole update process: four files are downloaded from the server, unpacked and put in the most straightforward of ways into the partition table or file system. Unlike in other OS designs there's no mechanism required to explicitly switch to the newer version, the aforementioned systemd-boot logic will automatically pick the newest kernel once it is dropped in.

Above we talked a lot about modularity, and how to put systems together as a combination of a host OS image, system extension images for the initrd and the host, portable service images and systemd-nspawn container images. I already emphasized that these image files are actually always the same: GPT disk images with partition definitions that match the Discoverable Partition Specification. This comes very handy when thinking about updating: we can use the exact same systemd-sysupdate tool for updating these other images as we use for the host image. The uniformity of the on-disk format allows us to update them uniformly too.

Boot Counting + Assessment

Automatic OS updates do not come without risks: if they happen automatically, and an update goes wrong this might mean your system might be automatically updated into a brick. This of course is less than ideal. Hence it is essential to address this reasonably automatically. In my model, there's systemd's Automatic Boot Assessment for that. The mechanism is simple: whenever a new unified kernel image is dropped into the system it will be stored with a small integer counter value included in the filename. Whenever the unified kernel image is selected for booting by systemd-boot, it is decreased by one. Once the system booted up successfully (which is determined by userspace) the counter is removed from the file name (which indicates "this entry is known to work"). If the counter ever hits zero, this indicates that it tried to boot it a couple of times, and each time failed, thus is apparently "bad". In this case systemd-boot will not consider the kernel anymore, and revert to the next older (that doesn't have a counter of zero).

By sticking the boot counter into the filename of the unified kernel we can directly attach this information to the kernel, and thus need not concern ourselves with cleaning up secondary information about the kernel when the kernel is removed. Updating with a tool like systemd-sysupdate remains a very simple operation hence: drop one old file, add one new file.

Picking the Newest Version

I already mentioned that systemd-boot automatically picks the newest unified kernel image to boot, by looking at the version encoded in the filename. This is done via a simple strverscmp() call (well, truth be told, it's a modified version of that call, different from the one implemented in libc, because real-life package managers use more complex rules for comparing versions these days, and hence it made sense to do that here too). The concept of having multiple entries of some resource in a directory, and picking the newest one automatically is a powerful concept, I think. It means adding/removing new versions is extremely easy (as we discussed above, in systemd-sysupdate context), and allows stateless determination of what to use.

If systemd-boot can do that, what about system extension images, portable service images, or systemd-nspawn container images that do not actually use systemd-boot as the entrypoint? All these tools actually implement the very same logic, but on the partition level: if multiple suitable /usr/ partitions exist, then the newest is determined by comparing the GPT partition label of them.

This is in a way the counterpart to the systemd-sysupdate update logic described above: we always need a way to determine which partition to actually then use after the update took place: and this becomes very easy each time: enumerate possible entries, pick the newest as per the (modified) strverscmp() result.

Home Directory Management

In my model the device's users and their home directories are managed by systemd-homed. This means they are relatively self-contained and can be migrated easily between devices. The numeric UID assignment for each user is done at the moment of login only, and the files in the home directory are mapped as needed via a uidmap mount. It also allows us to protect the data of each user individually with a credential that belongs to the user itself. i.e. instead of binding confidentiality of the user's data to the system-wide full-disk-encryption each user gets their own encrypted home directory where the user's authentication token (password, FIDO2 token, PKCS#11 token, recovery key…) is used as authentication and decryption key for the user's data. This brings a major improvement for security as it means the user's data is cryptographically inaccessible except when the user is actually logged in.

It also allows us to correct another major issue with traditional Linux systems: the way how data encryption works during system suspend. Traditionally on Linux the disk encryption credentials (e.g. LUKS passphrase) is kept in memory also when the system is suspended. This is a bad choice for security, since many (most?) of us probably never turn off their laptop but suspend it instead. But if the decryption key is always present in unencrypted form during the suspended time, then it could potentially be read from there by a sufficiently equipped attacker.

By encrypting the user's home directory with the user's authentication token we can first safely "suspend" the home directory before going to the system suspend state (i.e. flush out the cryptographic keys needed to access it). This means any process currently accessing the home directory will be frozen for the time of the suspend, but that's expected anyway during a system suspend cycle. Why is this better than the status quo ante? In this model the home directory's cryptographic key material is erased during suspend, but it can be safely reacquired on resume, from system code. If the system is only encrypted as a whole however, then the system code itself couldn't reauthenticate the user, because it would be frozen too. By separating home directory encryption from the root file system encryption we can avoid this problem.

Partition Setup

So we discussed the organization of the partitions OS images multiple times in the above, each time focusing on a specific aspect. Let's now summarize how this should look like all together.

In my model, the initial, shipped OS image should look roughly like this:

  • (1) An UEFI System Partition, with systemd-boot as boot loader and one unified kernel
  • (2) A /usr/ partition (version "A"), with a label fooOS_0.7 (under the assumption we called our project fooOS and the image version is 0.7).
  • (3) A Verity partition for the /usr/ partition (version "A"), with the same label
  • (4) A partition carrying the Verity root hash for the /usr/ partition (version "A"), along with a PKCS#7 signature of it, also with the same label

On first boot this is augmented by systemd-repart like this:

  • (5) A second /usr/ partition (version "B"), initially with a label _empty (which is the label systemd-sysupdate uses to mark partitions that currently carry no valid payload)
  • (6) A Verity partition for that (version "B"), similar to the above case, also labelled _empty
  • (7) And ditto a Verity root hash partition with a PKCS#7 signature (version "B"), also labelled _empty
  • (8) A root file system, encrypted and locked to the TPM2
  • (9) A home file system, integrity protected via a key also in TPM2 (encryption is unnecessary, since systemd-homed adds that on its own, and it's nice to avoid duplicate encryption)
  • (10) A swap partition, encrypted and locked to the TPM2

Then, on the first OS update the partitions 5, 6, 7 are filled with a new version of the OS (let's say 0.8) and thus get their label updated to fooOS_0.8. After a boot, this version is active.

On a subsequent update the three partitions fooOS_0.7 get wiped and replaced by fooOS_0.9 and so on.

On factory reset, the partitions 8, 9, 10 are deleted, so that systemd-repart recreates them, using a new set of cryptographic keys.

Here's a graphic that hopefully illustrates the partition stable from shipped image, through first boot, multiple update cycles and eventual factory reset:

Partitions Overview

Trust Chain

So let's summarize the intended chain of trust (for bare metal/VM boots) that ensures every piece of code in this model is signed and validated, and any system secret is locked to TPM2.

  1. First, firmware (or possibly shim) authenticates systemd-boot.

  2. Once systemd-boot picks a unified kernel image to boot, it is also authenticated by firmware/shim.

  3. The unified kernel image contains an initrd, which is the first userspace component that runs. It finds any system extensions passed into the initrd, and sets them up through Verity. The kernel will validate the Verity root hash signature of these system extension images against its usual keyring.

  4. The initrd also finds credentials passed in, then securely unlocks (which means: decrypts + authenticates) them with a secret from the TPM2 chip, locked to the kernel image itself.

  5. The kernel image also contains a kernel command line which contains a usrhash= option that pins the root hash of the /usr/ partition to use.

  6. The initrd then unlocks the encrypted root file system, with a secret bound to the TPM2 chip.

  7. The system then transitions into the main system, i.e. the combination of the Verity protected /usr/ and the encrypted root files system. It then activates two more encrypted (and/or integrity protected) volumes for /home/ and swap, also with a secret tied to the TPM2 chip.

Here's an attempt to illustrate the above graphically:

Trust Chain

This is the trust chain of the basic OS. Validation of system extension images, portable service images, systemd-nspawn container images always takes place the same way: the kernel validates these Verity images along with their PKCS#7 signatures against the kernel's keyring.

File System Choice

In the above I left the choice of file systems unspecified. For the immutable /usr/ partitions squashfs might be a good candidate, but any other that works nicely in a read-only fashion and generates reproducible results is a good choice, too. The home directories as managed by systemd-homed should certainly use btrfs, because it's the only general purpose file system supporting online grow and shrink, which systemd-homed can take benefit of, to manage storage.

For the root file system btrfs is likely also the best idea. That's because we intend to use LUKS/dm-crypt underneath, which by default only provides confidentiality, not authenticity of the data (unless combined with dm-integrity). Since btrfs (unlike xfs/ext4) does full data checksumming it's probably the best choice here, since it means we don't have to use dm-integrity (which comes at a higher performance cost).

OS Installation vs. OS Instantiation

In the discussion above a lot of focus was put on setting up the OS and completing the partition layout and such on first boot. This means installing the OS becomes as simple as dd-ing (i.e. "streaming") the shipped disk image into the final HDD medium. Simple, isn't it?

Of course, such a scheme is just too simple for many setups in real life. Whenever multi-boot is required (i.e. co-installing an OS implementing this model with another unrelated one), dd-ing a disk image onto the HDD is going to overwrite user data that was supposed to be kept around.

In order to cover for this case, in my model, we'd use systemd-repart (again!) to allow streaming the source disk image into the target HDD in a smarter, additive way. The tool after all is purely additive: it will add in partitions or grow them if they are missing or too small. systemd-repart already has all the necessary provisions to not only create a partition on the target disk, but also copy blocks from a raw installer disk. An install operation would then become a two stop process: one invocation of systemd-repart that adds in the /usr/, its Verity and the signature partition to the target medium, populated with a copy of the same partition of the installer medium. And one invocation of bootctl that installs the systemd-boot boot loader in the ESP. (Well, there's one thing missing here: the unified OS kernel also needs to be dropped into the ESP. For now, this can be done with a simple cp call. In the long run, this should probably be something bootctl can do as well, if told so.)

So, with this scheme we have a simple scheme to cover all bases: we can either just dd an image to disk, or we can stream an image onto an existing HDD, adding a couple of new partitions and files to the ESP.

Of course, in reality things are more complex than that even: there's a good chance that the existing ESP is simply too small to carry multiple unified kernels. In my model, the way to address this is by shipping two slightly different systemd-repart partition definition file sets: the ideal case when the ESP is large enough, and a fallback case, where it isn't and where we then add in an addition XBOOTLDR partition (as per the Discoverable Partitions Specification). In that mode the ESP carries the boot loader, but the unified kernels are stored in the XBOOTLDR partition. This scenario is not quite as simple as the XBOOTLDR-less scenario described first, but is equally well supported in the various tools. Note that systemd-repart can be told size constraints on the partitions it shall create or augment, thus to implement this scheme it's enough to invoke the tool with the fallback partition scheme if invocation with the ideal scheme fails.

Either way: regardless how the partitions, the boot loader and the unified kernels ended up on the system's hard disk, on first boot the code paths are the same again: systemd-repart will be called to augment the partition table with the root file system, and properly encrypt it, as was already discussed earlier here. This means: all cryptographic key material used for disk encryption is generated on first boot only, the installer phase does not encrypt anything.

Live Systems vs. Installer Systems vs. Installed Systems

Traditionally on Linux three types of systems were common: "installed" systems, i.e. that are stored on the main storage of the device and are the primary place people spend their time in; "installer" systems which are used to install them and whose job is to copy and setup the packages that make up the installed system; and "live" systems, which were a middle ground: a system that behaves like an installed system in most ways, but lives on removable media.

In my model I'd like to remove the distinction between these three concepts as much as possible: each of these three images should carry the exact same /usr/ file system, and should be suitable to be replicated the same way. Once installed the resulting image can also act as an installer for another system, and so on, creating a certain "viral" effect: if you have one image or installation it's automatically something you can replicate 1:1 with a simple systemd-repart invocation.

Building Images According to this Model

The above explains how the image should look like and how its first boot and update cycle will modify it. But this leaves one question unanswered: how to actually build the initial image for OS instances according to this model?

Note that there's nothing too special about the images following this model: they are ultimately just GPT disk images with Linux file systems, following the Discoverable Partition Specification. This means you can use any set of tools of your choice that can put together GPT disk images for compliant images.

I personally would use mkosi for this purpose though. It's designed to generate compliant images, and has a rich toolset for SecureBoot and signed/Verity file systems already in place.

What is key here is that this model doesn't depart from RPM and dpkg, instead it builds on top of that: in this model they are excellent for putting together images on the build host, but deployment onto the runtime host does not involve individual packages.

I think one cannot underestimate the value traditional distributions bring, regarding security, integration and general polishing. The concepts I describe above are inherited from this, but depart from the idea that distribution packages are a runtime concept and make it a build-time concept instead.

Note that the above is pretty much independent from the underlying distribution.

Final Words

I have no illusions, general purpose distributions are not going to adopt this model as their default any time soon, and it's not even my goal that they do that. The above is my personal vision, and I don't expect people to buy into it 100%, and that's fine. However, what I am interested in is finding the overlaps, i.e. work with people who buy 50% into this vision, and share the components.

My goals here thus are to:

  1. Get distributions to move to a model where images like this can be built from the distribution easily. Specifically this means that distributions make their OS hermetic in /usr/.

  2. Find the overlaps, share components with other projects to revisit how distributions are put together. This is already happening, see systemd-tmpfiles and systemd-sysuser support in various distributions, but I think there's more to share.

  3. Make people interested in building actual real-world images based on general purpose distributions adhering to the model described above. I'd love a "GnomeBook" image with full trust properties, that is built from true Linux distros, such as Fedora or ArchLinux.

FAQ

  1. What about ostree? Doesn't ostree already deliver what this blog story describes?

    ostree is fine technology, but in respect to security and robustness properties it's not too interesting I think, because unlike image-based approaches it cannot really deliver integrity/robustness guarantees over the whole tree easily. To be able to trust an ostree setup you have to establish trust in the underlying file system first, and the complexity of the file system makes that challenging. To provide an effective offline-secure trust chain through the whole depth of the stack it is essential to cryptographically validate every single I/O operation. In an image-based model this is trivially easy, but in ostree model it's with current file system technology not possible and even if this is added in one way or another in the future (though I am not aware of anyone doing on-access file-based integrity that spans a whole hierarchy of files that was compatible with ostree's hardlink farm model) I think validation is still at too high a level, since Linux file system developers made very clear their implementations are not robust to rogue images. (There's this stuff planned, but doing structural authentication ahead of time instead of on access makes the idea to weak — and I'd expect too slow — in my eyes.)

    With my design I want to deliver similar security guarantees as ChromeOS does, but ostree is much weaker there, and I see no perspective of this changing. In a way ostree's integrity checks are similar to RPM's and enforced on download rather than on access. In the model I suggest above, it's always on access, and thus safe towards offline attacks (i.e. evil maid attacks). In today's world, I think offline security is absolutely necessary though.

    That said, ostree does have some benefits over the model described above: it naturally shares file system inodes if many of the modules/images involved share the same data. It's thus more space efficient on disk (and thus also in RAM/cache to some degree) by default. In my model it would be up to the image builders to minimize shipping overly redundant disk images, by making good use of suitably composable system extensions.

  2. What about configuration management?

    At first glance immutable systems and configuration management don't go that well together. However, do note, that in the model I propose above the root file system with all its contents, including /etc/ and /var/ is actually writable and can be modified like on any other typical Linux distribution. The only exception is /usr/ where the immutable OS is hermetic. That means configuration management tools should work just fine in this model – up to the point where they are used to install additional RPM/dpkg packages, because that's something not allowed in the model above: packages need to be installed at image build time and thus on the image build host, not the runtime host.

  3. What about non-UEFI and non-TPM2 systems?

    The above is designed around the feature set of contemporary PCs, and this means UEFI and TPM2 being available (simply because the PC is pretty much defined by the Windows platform, and current versions of Windows require both).

    I think it's important to make the best of the features of today's PC hardware, and then find suitable fallbacks on more limited hardware. Specifically this means: if there's desire to implement something like the this on non-UEFI or non-TPM2 hardware we should look for suitable fallbacks for the individual functionality, but generally try to add glue to the old systems so that conceptually they behave more like the new systems instead of the other way round. Or in other words: most of the above is not strictly tied to UEFI or TPM2, and for many cases already there are reasonably fallbacks in place for more limited systems. Of course, without TPM2 many of the security guarantees will be weakened.

  4. How would you name an OS built that way?

    I think a desktop OS built this way if it has the GNOME desktop should of course be called GnomeBook, to mimic the ChromeBook name. ;-)

    But in general, I'd call hermetic, adaptive, immutable OSes like this "particles".

How can you help?

  1. Help making Distributions Hermetic in /usr/!

    One of the core ideas of the approach described above is to make the OS hermetic in /usr/, i.e. make it carry a comprehensive description of what needs to be set up outside of it when instantiated. Specifically, this means that system users that are needed are declared in systemd-sysusers snippets, and skeleton files and directories are created via systemd-tmpfiles. Moreover additional partitions should be declared via systemd-repart drop-ins.

    At this point some distributions (such as Fedora) are (probably more by accident than on purpose) already mostly hermetic in /usr/, at least for the most basic parts of the OS. However, this is not complete: many daemons require to have specific resources set up in /var/ or /etc/ before they can work, and the relevant packages do not carry systemd-tmpfiles descriptions that add them if missing. So there are two ways you could help here: politically, it would be highly relevant to convince distributions that an OS that is hermetic in /usr/ is highly desirable and it's a worthy goal for packagers to get there. More specifically, it would be desirable if RPM/dpkg packages would ship with enough systemd-tmpfiles information so that configuration files the packages strictly need for operation are symlinked (or copied) from /usr/share/factory/ if they are missing (even better of course would be if packages from their upstream sources on would just work with an empty /etc/ and /var/, and create themselves what they need and default to good defaults in absence of configuration files).

    Note that distributions that adopted systemd-sysusers, systemd-tmpfiles and the /usr/ merge are already quite close to providing an OS that is hermetic in /usr/. These were the big, the major advancements: making the image fully hermetic should be less controversial – at least that's my guess.

    Also note that making the OS hermetic in /usr/ is not just useful in scenarios like the above. It also means that stuff like this and like this can work well.

  2. Fill in the gaps!

    I already mentioned a couple of missing bits and pieces in the implementation of the overall vision. In the systemd project we'd be delighted to review/merge any PRs that fill in the voids.

  3. Build your own OS like this!

    Of course, while we built all these building blocks and they have been adopted to various levels and various purposes in the various distributions, no one so far built an OS that puts things together just like that. It would be excellent if we had communities that work on building images like what I propose above. i.e. if you want to work on making a secure GnomeBook as I suggest above a reality that would be more than welcome.

    How could this look like specifically? Pick an existing distribution, write a set of mkosi descriptions plus some additional drop-in files, and then build this on some build infrastructure. While doing so, report the gaps, and help us address them.

Further Documentation of Used Components and Concepts

  1. systemd-tmpfiles
  2. systemd-sysusers
  3. systemd-boot
  4. systemd-stub
  5. systemd-sysext
  6. systemd-portabled, Portable Services Introduction
  7. systemd-repart
  8. systemd-nspawn
  9. systemd-sysupdate
  10. systemd-creds, System and Service Credentials
  11. systemd-homed
  12. Automatic Boot Assessment
  13. Boot Loader Specification
  14. Discoverable Partitions Specification
  15. Safely Building Images

Earlier Blog Stories Related to this Topic

  1. The Strange State of Authenticated Boot and Disk Encryption on Generic Linux Distributions
  2. The Wondrous World of Discoverable GPT Disk Images
  3. Unlocking LUKS2 volumes with TPM2, FIDO2, PKCS#11 Security Hardware on systemd 248
  4. Portable Services with systemd v239
  5. mkosi — A Tool for Generating OS Images

And that's all for now.

April 29, 2022

I've been working on kopper recently, which is a complementary project to zink. Just as zink implements OpenGL in terms of Vulkan, kopper seeks to implement the GL window system bindings - like EGL and GLX - in terms of the Vulkan WSI extensions. There are several benefits to doing this, which I'll get into in a future post, but today's story is really about libX11 and libxcb.

Yes, again.

One important GLX feature is the ability to set the swap interval, which is how you get tear-free rendering by syncing buffer swaps to the vertical retrace. A swap interval of 1 is the typical case, where an image update happens once per frame. The Vulkan way to do this is to set the swapchain present mode to FIFO, since FIFO updates are implicitly synced to vblank. Mesa's WSI code for X11 uses a swapchain management thread for FIFO present modes. This thread is started from inside the vulkan driver, and it only uses libxcb to talk to the X server. But libGL is a libX11 client library, so in this scenario there is always an "xlib thread" as well.

libX11 uses libxcb internally these days, because otherwise there would be no way to intermix xlib and xcb calls in the same process. But it does not use libxcb's reflection of the protocol, XGetGeometry does not call xcb_get_geometry for example. Instead, libxcb has an API to allow other code to take over the write side of the display socket, with a callback mechanism to get it back when another xcb client issues a request. The callback function libX11 uses here is straightforward: lock the Display, flush out any internally buffered requests, and return the sequence number of the last request written. Both libraries need this sequence number for various reasons internally, xcb for example uses it to make sure replies go back to the thread that issued the request.

But "lock the Display" here really means call into a vtable in the Display struct. That vtable is filled in during XOpenDisplay, but the individual function pointers are only non-NULL if you called XInitThreads beforehand. And if you're libGL, you have no way to enforce that, your public-facing API operates on a Display that was already created.

So now we see the race. The queue management thread calls into libxcb while the main thread is somewhere inside libX11. Since libX11 has taken the socket, the xcb thread runs the release callback. Since the Display was not made thread-safe at XOpenDisplay time, the release callback does not block, so the xlib thread's work won't be correctly accounted. If you're lucky the two sides will at least write to the socket atomically with respect to each other, but at this point they have diverging opinions about the request sequence numbering, and it's a matter of time until you crash.

It turns out kopper makes this really easy to hit. Like "resize a glxgears window" easy. However, this isn't just a kopper issue, this race exists for every program that uses xcb on a not-necessarily-thread-safe Display. The only reasonable fix is to for libX11 to just always be thread-safe.

So now, it is.


April 26, 2022

I recently blogged about how to run a volatile systemd-nspawn container from your host's /usr/ tree, for quickly testing stuff in your host environment, sharing your home drectory, but all that without making a single modification to your host, and on an isolated node.

The one-liner discussed in that blog story is great for testing during system software development. Let's have a look at another systemd tool that I regularly use to test things during systemd development, in a relatively safe environment, but still taking full benefit of my host's setup.

Since a while now, systemd has been shipping with a simple component called systemd-sysext. It's primary usecase goes something like this: on one hand OS systems with immutable /usr/ hierarchies are fantastic for security, robustness, updating and simplicity, but on the other hand not being able to quickly add stuff to /usr/ is just annoying.

systemd-sysext is supposed to bridge this contradiction: when invoked it will merge a bunch of "system extension" images into /usr/ (and /opt/ as a matter of fact) through the use of read-only overlayfs, making all files shipped in the image instantly and atomically appear in /usr/ during runtime — as if they always had been there. Now, let's say you are building your locked down OS, with an immutable /usr/ tree, and it comes without ability to log into, without debugging tools, without anything you want and need when trying to debug and fix something in the system. With systemd-sysext you could use a system extension image that contains all this, drop it into the system, and activate it with systemd-sysext so that it genuinely extends the host system.

(There are many other usecases for this tool, for example, you could build systems that way that at their base use a generic image, but by installing one or more system extensions get extended to with additional more specific functionality, or drivers, or similar. The tool is generic, use it for whatever you want, but for now let's not get lost in listing all the possibilites.)

What's particularly nice about the tool is that it supports automatically discovered dm-verity images, with signatures and everything. So you can even do this in a fully authenticated, measured, safe way. But I am digressing…

Now that we (hopefully) have a rough understanding what systemd-sysext is and does, let's discuss how specficially we can use this in the context of system software development, to safely use and test bleeding edge development code — built freshly from your project's build tree – in your host OS without having to risk that the host OS is corrupted or becomes unbootable by stuff that didn't quite yet work the way it was envisioned:

The images systemd-sysext merges into /usr/ can be of two kinds: disk images with a file system/verity/signature, or simple, plain directory trees. To make these images available to the tool, they can be placed or symlinked into /usr/lib/extensions/, /var/lib/extensions/, /run/extensions/ (and a bunch of others). So if we now install our freshly built development software into a subdirectory of those paths, then that's entirely sufficient to make them valid system extension images in the sense of systemd-sysext, and thus can be merged into /usr/ to try them out.

To be more specific: when I develop systemd itself, here's what I do regularly, to see how my new development version would behave on my host system. As preparation I checked out the systemd development git tree first of course, hacked around in it a bit, then built it with meson/ninja. And now I want to test what I just built:

sudo DESTDIR=/run/extensions/systemd-test meson install -C build --quiet --no-rebuild &&
        sudo systemd-sysext refresh --force

Explanation: first, we'll install my current build tree as a system extension into /run/extensions/systemd-test/. And then we apply it to the host via the systemd-sysext refresh command. This command will search for all installed system extension images in the aforementioned directories, then unmount (i.e. "unmerge") any previously merged dirs from /usr/ and then freshly mount (i.e. "merge") the new set of system extensions on top of /usr/. And just like that, I have installed my development tree of systemd into the host OS, and all that without actually modifying/replacing even a single file on the host at all. Nothing here actually hit the disk!

Note that all this works on any system really, it is not necessary that the underlying OS even is designed with immutability in mind. Just because the tool was developed with immutable systems in mind it doesn't mean you couldn't use it on traditional systems where /usr/ is mutable as well. In fact, my development box actually runs regular Fedora, i.e. is RPM-based and thus has a mutable /usr/ tree. As long as system extensions are applied the whole of /usr/ becomes read-only though.

Once I am done testing, when I want to revert to how things were without the image installed, it is sufficient to call:

sudo systemd-sysext unmerge

And there you go, all files my development tree generated are gone again, and the host system is as it was before (and /usr/ mutable again, in case one is on a traditional Linux distribution).

Also note that a reboot (regardless if a clean one or an abnormal shutdown) will undo the whole thing automatically, since we installed our build tree into /run/ after all, i.e. a tmpfs instance that is flushed on boot. And given that the overlayfs merge is a runtime thing, too, the whole operation was executed without any persistence. Isn't that great?

(You might wonder why I specified --force on the systemd-sysext refresh line earlier. That's because systemd-sysext actually does some minimal version compatibility checks when applying system extension images. For that it will look at the host's /etc/os-release file with /usr/lib/extension-release.d/extension-release.<name>, and refuse operaton if the image is not actually built for the host OS version. Here we don't want to bother with dropping that file in there, we know already that the extension image is compatible with the host, as we just built it on it. --force allows us to skip the version check.)

You might wonder: what about the combination of the idea from the previous blog story (regarding running container's off the host /usr/ tree) with system extensions? Glad you asked. Right now we have no support for this, but it's high on our TODO list (patches welcome, of course!). i.e. a new switch for systemd-nspawn called --system-extension= that would allow merging one or more such extensions into the container tree booted would be stellar. With that, with a single command I could run a container off my host OS but with a development version of systemd dropped in, all without any persistence. How awesome would that be?

(Oh, and in case you wonder, all of this only works with distributions that have completed the /usr/ merge. On legacy distributions that didn't do that and still place parts of /usr/ all over the hierarchy the above won't work, since merging /usr/ trees via overlayfs is pretty pointess if the OS is not hermetic in /usr/.)

And that's all for now. Happy hacking!

April 24, 2022

The title might be a bit hyperbolic here, but we’re indeed exploring a first step in that direction with radv. The impetus here is the ExecuteIndirect command in Direct3D 12 and some games that are using it in non-trivial ways. (e.g. Halo Infinite)

ExecuteIndirect can be seen as an extension of what we have in Vulkan with vkCmdDrawIndirectCount. It adds extra capabilities. To support that with vkd3d-proton we need the following indirect Vulkan capabilities:

  1. Binding vertex buffers.
  2. Binding index buffers.
  3. Updating push constants.

This functionality happens to be a subset of VK_NV_device_generated_commands and hence I’ve been working on implementing a subset of that extension on radv. Unfortunately, we can’t really give the firmware a “extended indirect draw call” and execute stuff, so we’re stuck generating command buffers on the GPU.

The way the extension works, the application specifies a command “signature” on the CPU, which specifies that for each draw call the application is going to update A, B and C. Then, at runtime, the application provides a buffer providing the data for A, B and C for each draw call. The driver then processes that into a command buffer and then executes that into a secondary command buffer.

The workflow is then as follows:

  1. The application (or vkd3d-proton) provides the command signature to the driver which creates an object out of it.
  2. The application queries how big a command buffer (“preprocess buffer”) of $n$ draws with that signature would be.
  3. The application allocates the preprocess buffer.
  4. The application does its stuff to generate some commands.
  5. The application calls vkCmdPreprocessGeneratedCommandsNV which converts the application buffer into a command buffer (in the preprocess buffer)
  6. The application calls vkCmdExecuteGeneratedCommandsNV to execute the generated command buffer.

What goes into a draw in radv

When the application triggers a draw command in Vulkan, the driver generates GPU commands to do the following:

  1. Flush caches if needed
  2. Set some registers.
  3. Trigger the draw.

Of course we skip any of these steps (or parts of them) when they’re redundant. The majority of the complexity is in the register state we have to set. There are multiple parts here

  1. Fixed function state:

    1. subpass attachments
    2. static/dynamic state (viewports, scissors, etc.)
    3. index buffers
    4. some derived state from the shaders (some tesselation stuff, fragment shader export types, varyings, etc.)
  2. shaders (start address, number of registers, builtins used)
  3. user SGPRs (i.e. registers that are available at the start of a shader invocation)

Overall, most of the pipeline state is fairly easy to emit: we just precompute it on pipeline creation and memcpy it over if we switch shaders. The most difficult is probably the user SGPRs, and the reason for that is that it is derived from a lot of the remaining API state . Note that the list above doesn’t include push constants, descriptor sets or vertex buffers. The driver computes all of these, and generates the user SGPR data from that.

Descriptor sets in radv are just a piece of GPU memory, and radv binds a descriptor set by providing the shader with a pointer to that GPU memory in a user SGPR. Similarly, we have no hardware support for vertex buffers, so radv generates a push descriptor set containing internal texel buffers and then provides a user SGPR with a pointer to that descriptor set.

For push constants, radv has two modes: a portion of the data can be passed in user SGPRs directly, but sometimes a chunk of memory gets allocated and then a pointer to that memory is provided in a user SGPR. This fallback exists because the hardware doesn’t always have enough user SGPRs to fit all the data.

On Vega and later there are 32 user SGPRs, and on earlier GCN GPUs there are 16. This needs to fit pointers to all the referenced descriptor sets (including internal ones like the one for vertex buffers), push constants, builtins like the start vertex and start instance etc. To get the best performance here, radv determines a mapping of API object to user SGPR at shader compile time and then at draw time radv uses that mapping to write user SGPRs.

This results in some interesting behavior, like switching pipelines does cause the driver to update all the user SGPRs because the mapping might have changed.

Furthermore, as an interesting performance hack radv allocates all upload buffers (for the push constant and push descriptor sets), shaders and descriptor pools in a single 4 GiB region of of memory so that we can pass only the bottom 32-bits of all the pointers in a user SGPR, getting us farther with the limited number of user SGPRs. We will see later how that makes things difficult for us.

Generating a commandbuffer on the GPUs

As shown above radv has a bunch of complexity around state for draw calls and if we start generating command buffers on the GPU that risks copying a significant part of that complexity to a shader. Luckily ExecuteIndirect and VK_NV_device_generated_commands have some limitations that make this easier. The app can only change

  1. vertex buffers
  2. index buffers
  3. push constants

VK_NV_device_generated_commands also allows changing shaders and the rotation winding of what is considered a primitive backface but we’ve chosen to ignore that for now since it isn’t needed for ExecuteIndirect (though especially the shader switching could be useful for an application).

The second curveball is that the buffer the application provides needs to provide the same set of data for every draw call. This avoids having to do a lot of serial processing to figure out what the previous state was, which allows processing every draw command in a separate shader invocation. Unfortunately we’re still a bit dependent on the old state that is bound before the indirect command buffer execution:

  1. The previously bound index buffer
  2. Previously bound vertex buffers.
  3. Previously bound push constants.

Remember that for vertex buffers and push constants we may put them in a piece of memory. That piece of memory needs to contains all the vertex buffers/push constants for that draw call, so even if we modify only one of them, we have to copy the rest over. The index buffer is different: in the draw packets for the GPU there is a field that is derived from the index buffer size.

So in vkCmdPreprocessGeneratedCommandsNV radv partitions the preprocess buffer into a command buffer and an upload buffer (for the vertex buffers & push constants), both with a fixed stride based on the command signature. Then it launches a shader which processes a draw call in each invocation:

   if (shader used vertex buffers && we change a vertex buffer) {
      copy all vertex buffers 
      update the changed vertex buffers
      emit a new vertex descriptor set pointer
   }
   if (we change a push constant) {
      if (we change a push constant in memory) {
         copy all push constant
         update changed push constants
         emit a new push constant pointer
      }
      emit all changed inline push constants into user SGPRs
   }
   if (we change the index buffer) {
      emit new index buffers
   }
   emit a draw command
   insert NOPs up to the stride

In vkCmdExecuteGeneratedCommandsNV radv uses the internal equivalent of vkCmdExecuteCommands to execute as if the generated command buffer is a secondary command buffer.

Challenges

Of course one does not simply move part of the driver to GPU shaders without any challenges. In fact we have a whole bunch of them. Some of them just need a bunch of work to solve, some need some extension specification tweaking and some are hard to solve without significant tradeoffs.

Code maintainability

A big problem is that the code needed for the limited subset of state that is supported is now in 3 places:

  1. The traditional CPU path
  2. For determining how large the preprocess buffer needs to be
  3. For the shader called in vkCmdPreprocessGeneratedCommandsNV to build the preprocess buffer.

Having the same functionality in multiple places is a recipe for things going out of sync. This makes it harder to change this code and much easier for bugs to sneak in. This can be mitigated with a lot of testing, but a bunch of GPU work gets complicated quickly. (e.g. the preprocess buffer being larger than needed still results in correct results, getting a second opinion from the shader to check adds significant complexity).

nir_builder gets old quickly

In the driver at the moment we have no good high level shader compiler. As a result a lot of the internal helper shaders are written using the nir_builder helper to generate nir, the intermediate IR of the shader compiler. Example fragment:

   nir_push_loop(b);
   {
      nir_ssa_def *curr_offset = nir_load_var(b, offset);

      nir_push_if(b, nir_ieq(b, curr_offset, cmd_buf_size));
      {
         nir_jump(b, nir_jump_break);
      }
      nir_pop_if(b, NULL);

      nir_ssa_def *packet_size = nir_isub(b, cmd_buf_size, curr_offset);
      packet_size = nir_umin(b, packet_size, nir_imm_int(b, 0x3ffc * 4));

      nir_ssa_def *len = nir_ushr_imm(b, packet_size, 2);
      len = nir_iadd_imm(b, len, -2);
      nir_ssa_def *packet = nir_pkt3(b, PKT3_NOP, len);

      nir_store_ssbo(b, packet, dst_buf, curr_offset, .write_mask = 0x1,
                     .access = ACCESS_NON_READABLE, .align_mul = 4);
      nir_store_var(b, offset, nir_iadd(b, curr_offset, packet_size), 0x1);
   }
   nir_pop_loop(b, NULL);

It is clear that this all gets very verbose very quickly. This is somewhat fine as long as all the internal shaders are tiny. However, between this and raytracing our internal shaders are getting significantly bigger and the verbosity really becomes a problem.

Interesting things to explore here are to use glslang, or even to try writing our shaders in OpenCL C and then compiling it to SPIR-V at build time. The challenge there is that radv is built on a diverse set of platforms (including Windows, Android and desktop Linux) which can make significant dependencies a struggle.

Preprocessing

Ideally your GPU work is very suitable for pipelining to avoid synchronization cost on the GPU. If we generate the command buffer and then execute it we need to have a full GPU sync point in between, which can get very expensive as it waits until the GPU is idle. To avoid this VK_NV_device_generated_commands has added the separate vkCmdPreprocessGeneratedCommandsNV command, so that the application can batch up a bunch of work before incurring the cost a sync point.

However, in radv we have to do the command buffer generation in vkCmdExecuteGeneratedCommandsNV as our command buffer generation depends on some of the other state that is bound, but might not be bound yet when the application calls vkCmdPreprocessGeneratedCommandsNV.

Which brings up a slight spec problem: The extension specification doesn’t specify whether the application is allowed to execute vkCmdExecuteGeneratedCommandsNV on multiple queues concurrently with the same preprocess buffer. If all the writing of that happens in vkCmdPreprocessGeneratedCommandsNV that would result in correct behavior, but if the writing happens in vkCmdExecuteGeneratedCommandsNV this results in a race condition.

The 32-bit pointers

Remember that radv only passes the bottom 32-bits of some pointers around. As a result the application needs to allocate the preprocess buffer in that 4-GiB range. This in itself is easy: just add a new memory type and require it for this usage. However, the devil is in the details.

For example, what should we do for memory budget queries? That is per memory heap, not memory type. However, a new memory heap does not make sense, as the memory is also still subject to physical availability of VRAM, not only address space.

Furthermore, this 4-GiB region is more constrained than other memory, so it would be a shame if applications start allocating random stuff in it. If we look at the existing usage for a pretty heavy game (HZD) we get about

  1. 40 MiB of command buffers + upload buffers
  2. 200 MiB of descriptor pools
  3. 400 MiB of shaders

So typically we have a lot of room available. Ideally the ordering of memory types would get an application to prefer another memory type when we do not need this special region. However, memory object caching poses a big risk here: Would you choose a memory object in the cache that you can reuse/suballocate (potentially in that limited region), or allocate new for a “better” memory type?

Luckily we have not seen that risk play out, but the only real tested user at this point has been vkd3d-proton.

Secondary command buffers.

When executing the generated command buffer radv does that the same way as calling a secondary command buffer. This has a significant limitation: A secondary command buffer cannot call a secondary command buffer on the hardware. As a result the current implementation has a problem if vkCmdExecuteGeneratedCommandsNV gets called on a secondary command buffer.

It is possible to work around this. An example would be to split the secondary command buffer into 3 parts: pre, generated, post. However, that needs a bunch of refactoring to allow multiple internal command buffers per API command buffers.

Where to go next

Don’t expect this upstream very quickly. The main reason for exploring this in radv is ExecuteIndirect support for Halo Infinite, and after some recent updates we’re back into GPU hang limbo with radv/vkd3d-proton there. So while we’re solving that I’m holding off on upstreaming in case the hangs are caused by the implementation of this extension.

Furthermore, this is only a partial implementation of the extension anyways, with a fair number of limitations that we’d ideally eliminate before fully exposing this extension.

April 20, 2022

Let Your Memes Be Dreams

With Mesa 22.1 RC1 firmly out the door, most eyes have turned towards Mesa 22.2.

But not all eyes.

No, while most expected me to be rocketing off towards the next shiny feature, one ticket caught my eye:

Mesa 22.1rc1: Zink on Windows doesn’t work even simple wglgears app fails..

Sadly, I don’t support Windows. I don’t have a test machine to run it, and I don’t even have a VM I could spin up to run Lavapipe. I knew that Kopper was going to cause problems with other frontends, but I didn’t know how many other frontends were actually being used.

The answer was not zero, unfortunately. Plenty of users were enjoying the slow, software driver speed of Zink on Windows to spin those gears, and I had just crushed their dreams.

As I had no plans to change anything here, it would take a new hero to set things right.

The Hero We Deserve

Who here loves X-Plane?

I love X-Plane. It’s my favorite flight simulator. If I could, I’d play it all day every day. And do you know who my favorite X-Plane developer is?

Friend of the blog and part-time Zink developer, Sidney Just.

Some of you might know him from his extensive collection of artisanal blog posts. Some might have seen his work enabling Vulkan<->OpenGL interop in Mesa on Windows.

But did you know that Sid’s latest project is much more groundbreaking than just bumping Zink’s supported extension count far beyond the reach of every other driver?

What if I told you that this image

gears.png

is Zink running wglgears on a NVIDIA 2070 GPU on Windows at full speed? No software-copy scanout. Just Kopper.

Full Support: Windows Ultimate Home Professional Edition

Over the past couple days, Sid’s done the esoteric work of hammering out WSI support for Zink on Windows, making us the first hardware-accelerated, GL 4.6-capable Mesa driver to run natively on Windows.

Don’t believe me?

Recognize a little Aztec Ruins action from GFXBench?

aztec.png

The results are about what we’d expect of an app I’ve literally never run myself:

Zink

zink-aztec.png

NVIDIA

nv-aztec.png

Not too bad at all!

In Summary

I think we can safely say that Sid has managed to fix the original bug. Thanks, Sid!

But why is an X-Plane developer working on Zink?

The man himself has this to say on the topic:

X-Plane has traditionally been using OpenGL directly for all of its rendering needs. As a result, for years our plugin SDK has directly exposed the games OpenGL context directly to third party plugins, which have used it to render custom avionic screens and GUI elements. When we finally did the switch to Vulkan and Metal in 2020, one of the big issues we faced was how to deal with plugins. Our solution so far has been to rely on native Vulkan/OpenGL driver interop via extensions, which has mostly worked and allowed us to ship with modern backends.

Unfortunately this puts us at the mercy of the driver to provide good interop. Sadly on some platforms, this just isn’t available at all. On others, the drivers are broken leading to artifacts when mixing Vulkan and GL rendering. To date, our solution has been to just shrug it off and hope for better drivers. X-Plane plugins make use of compatibly profile GL features, as well as core profile features, depending on the authors skill, so libraries like ANGLE were not an option for us.

This is where Zink comes in for us: Being a real GL driver, it has support for all of the features that we need. Being open source also means that any issues that we do discover are much easier to fix ourselves. We’ve made some progress including Zink into the next major version of X-Plane, X-Plane 12, and it’s looking very promising so far. Our hope is to ship X-Plane 12 with Zink as the GL backend for plugins and leave driver interop issues in the past.

The roots of this interest can also be seen in his blog post from last year where he touches on the future of GL plugin support.

Awesome!

Big Triangle’s definitely paying attention now.

And if any of my readers think this work is cool, go buy yourself a copy of X-Plane to say thanks for contributing back to open source.

April 15, 2022

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:

  • Part 1: The high-level view of the whole CI system, and how to fully control test machines remotely (power on, OS to boot, keyboard/screen emulation using a serial console);
  • Part 2: A comparison of the different ways to generate the rootfs of your test environment, and introducing the boot2container project;
  • Part 3: Analysis of the requirements for the CI gateway, catching regressions before deployment, easy roll-back, and netbooting the CI gateway securely over the internet.

In this article, we will finally focus on generating the rootfs/container image of the CI Gateway in a way that enables live patching the system without always needing to reboot.

This work is sponsored by the Valve Corporation.

Introduction: The impact of updates

System updates are a necessary evil for any internet-facing server, unless you want your system to become part of a botnet. This is especially true for CI systems since they let people on the internet run code on machines, often leading to unfair use such as cryptomining (this one is hard to avoid though)!

The problem with system updates is not the 2 or 3 minutes of downtime that it takes to reboot, it is that we cannot reboot while any CI job is running. Scheduling a reboot thus first requires to stop accepting new jobs, wait for the current ones to finish, then finally reboot. This solution may be acceptable if your jobs take ~30 minutes, but what if they last 6h? A reboot suddenly gets close to a typical 8h work day, and we definitely want to have someone looking over the reboot sequence so they can revert to a previous boot configuration if the new one failed.

This problem may be addressed in a cloud environment by live-migrating services/containers/VMs from a non-updated host to an updated one. This is unfortunately a lot more complex to pull off for a bare-metal CI without having a second CI gateway and designing synchronization systems/hardware to arbiter access to the test machines's power/serial consoles/boot configuration.

So, while we cannot always avoid the need to drain the CI jobs before rebooting, what we can do is reduce the cases in which we need to perform this action. Unfortunately, containers have been designed with atomic updates in mind (this is why we want to use them), but that means that trivial operations such as adding an ssh key, a Wireguard peer, or updating a firewall rule will require a reboot. A hacky solution may be for the admins to update the infra container then log in the different CI gateways and manually reproduce the changes they have done in the new container. These changes would be lost at the next reboot, but this is not a problem since the CI gateway would use the latest container when rebooting which already contains the updates. While possible, this solution is error-prone and not testable ahead of time, which is against the requirements for the gateway we laid out in Part 3.

Live patching containers

An improvement to live-updating containers by hand would be to use tools such as Ansible, Salt, or even Puppet to manage and deploy non-critical services and configuration. This would enable live-updating the currently-running container but would need to be run after every reboot. An Ansible playbook may be run locally, so it is not inconceivable for a service to be run at boot that would download the latest playbook and run it. This solution is however forcing developers/admins to decide which services need to have their configuration baked in the container and which services should be deployed using a tool like Ansible... unless...

We could use a tool like Ansible to describe all the packages and services to install, along with their configuration. Creating a container would then be achieved by running the Ansible playbook on a base container image. Assuming that the playbook would truly be idem-potent (running the playbook multiple times will lead to the same final state), this would mean that there would be no differences between the live-patched container and the new container we created. In other words, we simply morph the currently-running container to the wanted configuration by running the same Ansible playbook we used to create the container, but against the live CI gateway! This will not always remove the need to reboot the CI gateways from time to time (updating the kernel, or services which don't support live-updates without affecting CI jobs), but all the smaller changes can get applied in-situ!

The base container image has to contain the basic dependencies of the tool like Ansible, but if it were made to contain all the OS packages, it would split the final image into three container layers: the base OS container, the packages needed, and the configuration. Updating the configuration would thus result in only a few megabytes of update to download at the next reboot rather than the full OS image, thus reducing the reboot time.

Limits to live-patching containers

Ansible is perfectly-suited to morph a container into its newest version, provided that all the resources used remain static between when the new container was created and when the currently-running container gets live-patched. This is because of Ansible's core principle of idempotency of operations: Rather than running commands blindly like in a shell-script, it first checks what is the current state then, if needed, update the state to match the desired target. This makes it safe to run the playbook multiple times, but will also allow us to only reboot services if its configuration or one of its dependencies' changed.

When version pinning of packages is possible (Python, Ruby, Rust, Golang, ...), Ansible can guarantee the idempotency that make live-patching safe. Unfortunately, package managers of Linux distributions are usually not idempotent: They were designed to ship updates, not pin software versions! In practice, this means that there are no guarantees that the package installed during live-patching will be the same as the one installed in the new base container, thus exposing oneself to potential differences in behaviour between the two deployment methods... The only way out of this issue is to create your own package repository and make sure its content will not change between the creation of the new container and the live-patching of all the CI Gateways. Failing that, all I can advise you to do is pick a stable distribution which will try its best to limit functional changes between updates within the same distribution version (Alpine Linux, CentOS, Debian, ...).

In the end, Ansible won't always be able to make live-updating your container strictly equivalent to rebooting into its latest version, but as long as you are aware of its limitations (or work around them), it will make updating your CI gateways way less of a trouble than it would be otherwise! You will need to find the right balance between live-updatability, and ease of maintenance of the code-base of your gateway.

Putting it all together: The example of valve-infra-container

At this point, you may be wondering how all of this looks in practice! Here is the example of the CI gateways we have been developping for Valve:

  • Ansible playbook: You will find here the entire configuration of our CI gateways. NOTE: we are still working on live-patching!;
  • Valve-infra-base-container: The buildah script used to generate the base container;
  • Valve-infra-container: The buildah script used to generate the final container by running the Ansible playbook.

And if you are wondering how we can go from these scripts to working containers, here is how:

$ podman run --rm -d -p 8088:5000 --name registry docker.io/library/registry:2
$ env \
    IMAGE_NAME=localhost:8088/valve-infra-base-container \
    BASE_IMAGE=archlinux \
    buildah unshare -- .gitlab-ci/valve-infra-base-container-build.sh
$ env \
    IMAGE_NAME=localhost:8088/valve-infra-container \
    BASE_IMAGE=valve-infra-base-container \
    ANSIBLE_EXTRA_ARGS='--extra-vars service_mgr_override=inside_container -e development=true' \
    buildah unshare -- .gitlab-ci/valve-infra-container-build.sh

And if you were willing to use our Makefile, it gets even easier:

$ make valve-infra-base-container BASE_IMAGE=archlinux IMAGE_NAME=localhost:8088/valve-infra-base-container
$ make valve-infra-container BASE_IMAGE=localhost:8088/valve-infra-base-container IMAGE_NAME=localhost:8088/valve-infra-container

Not too bad, right?

PS: These scripts are constantly being updated, so make sure to check out their current version!

Conclusion

In this post, we highlighted the difficulty of keeping the CI Gateways up to date when CI jobs can take multiple hours to complete, preventing new jobs from starting until the current queue is emptied and the gateway has rebooted.

We have then shown that despite looking like competing solutions to deploy services in production, containers and tools like Ansible can actually work well together to reduce the need for reboots by morphing the currently-running container into the updated one. There are however some limits to this solution which are important to keep in mind when designing the system.

In the next post, we will be designing the executor service which is responsible for time-sharing the test machines between different CI/manual jobs. We will thus be talking about deploying test environments, BOOTP, and serial consoles!

That's all for now, thanks for making it to the end!

April 14, 2022

Hi! This month I’ve continued working on Goguma, my IRC client for Android. I’ve released version 0.2.0 earlier today. Tons of new features and bug fixes have been shipped! delthas has added a new old-school compact mode for the message list, has implemented pinned and muted conversations, and has added a new UI to manage IRC networks configured on the bouncer. Noah Loomans has come up with a cool new look for /me messages. I’ve redesigned the UI to create a new conversation, and added knobs to edit the user profile and channel topics.

Settings Compact message list New conversation /me message

We now have our own F-Droid repository with nightly builds, so trying out Goguma should be easier than manually grabbing APK files. Goguma is also available on the official F-Droid repository, although the version published there lags behind.

On the soju bouncer side, delthas has implemented a new soju.im/search extension. This allows clients to search the server-side message history without having to download the full logs. I’ve added a tiny soju.im/no-implicit-names extension which allows clients to opt-out of some chatter sent when connecting. This is useful for mobile devices where re-connections are frequent, data plans are limited and latency is high. Last, delthas has added support for the echo-message extension for upstream servers in order to properly display messages mutated by the server (e.g. to strip formatting).

In SourceHut news, the initial release of hut has been published! Some distributions have already started shipping it in their official repositories. hut is now integrated in builds.sr.ht itself: hut will automatically pick up the OAuth 2.0 token generated by the oauth directive. For example:

image: alpine/edge
packages:
- hut
oauth: meta.sr.ht/PROFILE:RO
tasks:
- hi: |
        hut meta show

I’ve also resumed work on go-emailthreads again. It’s the successor to python-emailthreads which currently powers lists.sr.ht’s patch review UI. I’ve finished up the basics and integrated it into lists.sr.ht’s GraphQL API. The next step is to update the Python frontend to use data from the GraphQL API instead of using python-emailthreads.

The NPotM is libdisplay-info. This is a shared project with other Wayland developers (Pekka Paalanen, Sebastian Wick) and driver developers (AMD, Intel). The goal is to build a small library for EDID and DisplayID, two standards for display device metadata. Right now compositors don’t need to parse a lot of the EDID, but that’s going to change with upcoming color management support. Some interesting discussions have sparkled, but progress is slow. I guess that’s expected when collaboratively building a new library from scratch: it takes time to make everybody agree.

That’s all for now! See you next month.

April 12, 2022

Another Quarter Down

As everyone who’s anyone knows, the next Mesa release branchpoint is coming up tomorrow. Like usual, here’s the rundown on what to expect from zink in this release:

  • zero performance improvements (that I’m aware of)
  • Kopper has landed: Vulkan WSI is now used and NVIDIA drivers can finally run at full speed
  • lots of bugs fixed
  • seriously so many bugs
  • I’m not even joking
  • literally this whole quarter was just fixing bugs

So if you find a zink problem in the 22.1 release of Mesa, it’s definitely because of Kopper and not actually anything zink-related.

Piping

But also this is sort-of-almost-maybe a lavapipe blog, and that driver has had a much more exciting quarter. Here’s a rundown.

New Extensions:

  • VK_EXT_debug_utils
  • VK_EXT_depth_clip_control
  • VK_EXT_graphics_pipeline_library
  • VK_EXT_image_2d_view_of_3d
  • VK_EXT_image_robustness
  • VK_EXT_inline_uniform_block
  • VK_EXT_pipeline_creation_cache_control
  • VK_EXT_pipeline_creation_feedback
  • VK_EXT_primitives_generated_query
  • VK_EXT_shader_demote_to_helper_invocation
  • VK_EXT_subgroup_size_control
  • VK_EXT_texel_buffer_alignment
  • VK_KHR_format_feature_flags2
  • VK_KHR_memory_model
  • VK_KHR_pipeline_library
  • VK_KHR_shader_integer_dot_product
  • VK_KHR_shader_terminate_invocation
  • VK_KHR_swapchain_mutable_format
  • VK_KHR_synchronization2
  • VK_KHR_zero_initialize_workgroup_memory

Vulkan 1.3 is now supported. We’ve landed a number of big optimizations as well, leading to massively improved CI performance.

Lavapipe: the cutting-edge software implementation of Vulkan.

…as long as you don’t need descriptor indexing.

April 07, 2022

Since Kopper got merged today upstream I wanted to write a little about it as I think the value it brings can be unclear for the uninitiated.

Adam Jackson in our graphics team has been working for the last Months together with other community members like Mike Blumenkrantz implementing Kopper. For those unaware Zink is an OpenGL implementation running on top of Vulkan and Kopper is the layer that allows you to translate OpenGL and GLX window handling to Vulkan WSI handling. This means that you can get full OpenGL support even if your GPU only has a Vulkan driver available and it also means you can for instance run GNOME on top of this stack thanks to the addition of Kopper to Zink.

During the lifecycle of the soon to be released Fedora Workstation 36 we expect to allow you to turn on the doing OpenGL using Kopper and Zink as an experimental feature once we update Fedora 36 to Mesa 22.1.

So you might ask why would I care about this as an end user? Well initially you probably will not care much, but over time it is likely that GPU makers will eventually stop developing native OpenGL drivers and just focus on their Vulkan drivers. At that point Zink and Kopper provides you with a backwards compatibility solution for your OpenGL applications. And for Linux distributions it will also at some point help reduce the amount of code we need to ship and maintain significantly as we can just rely on Zink and Kopper everywhere which of course reduces the workload for maintainers.

This is not going to be an overnight transition though, Zink and Kopper will need some time to stabilize and further improve performance. At the moment performance is generally a bit slower than the native drivers, but we have seen some examples of games which actually got better performance with specific driver combinations, but over time we expect to see the negative performance delta shrink. The delta is unlikely to ever fully go away due to the cost of translating between the two APIs, but on the other side we are going to be in a situation in a few years where all current/new applications use Vulkan natively (or through Proton) and thus the stuff that relies on OpenGL will be older software, so combined with faster GPUs you should still get more than good enough performance. And at that point Zink will be a lifesaver for your old OpenGL based applications and games.

April 06, 2022

Just In Time

By the time you read this, Kopper will have landed. This means a number of things have changed:

  • Zink now uses Vulkan WSI and has actual swapchains
  • Combinations of clunky Mesa environment variables are no longer needed; MESA_LOADER_DRIVER_OVERRIDE=zink will work for all drivers
  • Some things that didn’t used to work now work
  • Some things that used to work now don’t

In particular, lots of cases of garbled/flickering rendering (I’m looking at you, Supertuxkart on ANV) will now be perfectly smooth and without issue.

Also there’s no swapinterval control yet, so X11 clients will have no choice but to churn out the maximum amount of FPS possible at all times.

You (probably?) aren’t going to be able to run a compositor on zink just yet, but it’s on the 22.1 TODO list.

Big thanks to Adam Jackson for carrying this project on his back.

April 05, 2022

Apparently, in some parts of this world, the /usr/-merge transition is still ongoing. Let's take the opportunity to have a look at one specific way to take benefit of the /usr/-merge (and associated work) IRL.

I develop system-level software as you might know. Oftentimes I want to run my development code on my PC but be reasonably sure it cannot destroy or otherwise negatively affect my host system. Now I could set up a container tree for that, and boot into that. But often I am too lazy for that, I don't want to bother with a slow package manager setting up a new OS tree for me. So here's what I often do instead — and this only works because of the /usr/-merge.

I run a command like the following (without any preparatory work):

systemd-nspawn \
        --directory=/ \
        --volatile=yes \
        -U \
        --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \
        --set-credential=firstboot.locale:C.UTF-8 \
        --bind-user=lennart \
        -b

And then I very quickly get a login prompt on a container that runs the exact same software as my host — but is also isolated from the host. I do not need to prepare any separate OS tree or anything else. It just works. And my host user lennart is just there, ready for me to log into.

So here's what these systemd-nspawn options specifically do:

  • --directory=/ tells systemd-nspawn to run off the host OS' file hierarchy. That smells like danger of course, running two OS instances off the same directory hierarchy. But don't be scared, because:

  • --volatile=yes enables volatile mode. Specifically this means what we configured with --directory=/ as root file system is slightly rearranged. Instead of mounting that tree as it is, we'll mount a tmpfs instance as actual root file system, and then mount the /usr/ subdirectory of the specified hierarchy into the /usr/ subdirectory of the container file hierarchy in read-only fashion – and only that directory. So now we have a container directory tree that is basically empty, but imports all host OS binaries and libraries into its /usr/ tree. All software installed on the host is also available in the container with no manual work. This mechanism only works because on /usr/-merged OSes vendor resources are monopolized at a single place: /usr/. It's sufficient to share that one directory with the container to get a second instance of the host OS running. Note that this means /etc/ and /var/ will be entirely empty initially when this second system boots up. Thankfully, forward looking distributions (such as Fedora) have adopted systemd-tmpfiles and systemd-sysusers quite pervasively, so that system users and files/directories required for operation are created automatically should they be missing. Thus, even though at boot the mentioned directories are initially empty, once the system is booted up they are sufficiently populated for things to just work.

  • -U means we'll enable user namespacing, in fully automatic mode. This does three things: it picks a free host UID range dynamically for the container, then sets up user namespacing for the container processes mapping host UID range to UIDs 0…65534 in the container. It then sets up a similar UID mapped mount on the /usr/ tree of the container. Net effect: file ownerships as set on the host OS tree appear as they belong to the very same users inside of the container environment, except that we use user namespacing for everything, and thus the users are actually neatly isolated from the host.

  • --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) passes a credential to the container. Credentials are bits of data that you can pass to systemd services and whole systems. They are actually awesome concepts (e.g. they support TPM2 authentication/encryption that just works!) but I am not going to go into details around that, given it's off-topic in this specific scenario. Here we just take benefit of the fact that systemd-sysusers looks for a credential called passwd.hashed-password.root to initialize the root password of the system from. We set it to mysecret. This means once the system is booted up we can log in as root and the supplied password. Yay. (Remember, /etc/ is initially empty on this container, and thus also carries no /etc/passwd or /etc/shadow, and thus has no root user record, and thus no root password.)

    mkpasswd is a tool then converts a plain text password into a UNIX hashed password, which is what this specific credential expects.

  • Similar, --set-credential=firstboot.locale:C.UTF-8 tells the systemd-firstboot service in the container to initialize /etc/locale.conf with this locale.

  • --bind-user=lennart binds the host user lennart into the container, also as user lennart. This does two things: it mounts the host user's home directory into the container. It also copies a minimal user record of the specified user into the container that nss-systemd then picks up and includes in the regular user database. This means, once the container is booted up I can log in as lennart with my regular password, and once I logged in I will see my regular host home directory, and can make changes to it. Yippieh! (This does a couple of more things, such as UID mapping and things, but let's not get lost in too much details.)

So, if I run this, I will very quickly get a login prompt, where I can log into as my regular user. I have full access to my host home directory, but otherwise everything is nicely isolated from the host, and changes outside of the home directory are either prohibited or are volatile, i.e. go to a tmpfs instance whose lifetime is bound to the container's lifetime: when I shut down the container I just started, then any changes outside of my user's home directory are lost.

Note that while here I use --volatile=yes in combination with --directory=/ you can actually use it on any OS hierarchy, i.e. just about any directory that contains OS binaries.

Similar, the --bind-user= stuff works with any OS hierarchy too (but do note that only systemd 249 and newer will pick up the user records passed to the container that way, i.e. this requires at least v249 both on the host and in the container to work).

Or in short: the possibilities are endless!

Requirements

For this all to work, you need:

  1. A recent kernel (5.15 should suffice, as it brings UID mapped mounts for the most common file systems, so that -U and --bind-user= can work well.)

  2. A recent systemd (249 should suffice, which brings --bind-user=, and a -U switch backed by UID mapped mounts).

  3. A distribution that adopted the /usr/-merge, systemd-tmpfiles and systemd-sysusers so that the directory hierarchy and user databases are automatically populated when empty at boot. (Fedora 35 should suffice.)

Limitations

While a lot of today's software actually out of the box works well on systems that come up with an unpopulated /etc/ and /var/, and either fall back to reasonable built-in defaults, or deploy systemd-tmpfiles to create what is missing, things aren't perfect: some software typically installed an desktop OSes will fail to start when invoked in such a container, and be visible as ugly failed services, but it won't stop me from logging in and using the system for what I want to use it. It would be excellent to get that fixed, though. This can either be fixed in the relevant software upstream (i.e. if opening your configuration file fails with ENOENT, then just default to reasonable defaults), or in the distribution packaging (i.e. add a tmpfiles.d/ file that copies or symlinks in skeleton configuration from /usr/share/factory/etc/ via the C or L line types).

And then there's certain software dealing with hardware management and similar that simply cannot work in a container (as device APIs on Linux are generally not virtualized for containers) reasonably. It would be excellent if software like that would be updated to carry ConditionVirtualization=!container or ConditionPathIsReadWrite=/sys conditionalization in their unit files, so that it is automatically – cleanly – skipped when executed in such a container environment.

And that's all for now.