Lately I have been exposing a bit more functionality in V3DV and was wondering how far we are from Vulkan 1.2. Turns out that a lot of the new Vulkan 1.2 features are actually optional and what we have right now (missing a few trivial patches to expose a few things) seems to be sufficient for a minimal implementation.
We actually did a test run with CTS enabling Vulkan 1.2 to verify this and it went surprisingly well, with just a few test failures that I am currently looking into, so I think we should be able to submit conformance soon.
For those who may be interested, here is a list of what we are not supporting (all of these are optional features in Vulkan 1.2):
VK_EXT_descriptor_indexing
I think we should be able to support this in the future.
VK_KHR_shader_float16_int8
This we can support in theory, since the hardware has support for half-float, however, the way this is designed in hardware comes with significant caveats that I think would make it really difficult to take advantage of it in practice. It would also require significant work, so it is not something we are planning at present.
VK_KHR_buffer_device_address
We can’t implement this without hacks because the Vulkan spec explicitly defined these addresses to be 64-bit values and the V3D GPU only deals with 32-bit addresses and is not capable of doing any kind of native 64-bit operation. At first I thought we could just lower these to 32-bit (since we know they will be 32-bit), but because the spec makes these explicit 64-bit values, it allows shaders to cast a device address from/to uvec2, which generates 64-bit bitcast instructions and those require both the destination and source to be 64-bit values.
VK_EXT_sampler_filter_minmax
VK_KHR_draw_indirect_count
VK_EXT_scalar_block_layout
VK_EXT_shader_viewport_index_layer
VK_KHR_shader_atomic_int64
These lack required hardware support, so we don’t expect to implement them.
Previously, I gave you an introduction to mesh/task shaders and wrote up some details about how mesh shaders are implemented in the driver. But I left out the important details of how task shaders (aka. amplification shaders) work in the driver. In this post, I aim to give you some details about how task shaders work under the hood. Like before, this is based on my experience implementing task shaders in RADV and all details are already public information.
The task shader (aka. amplification shader in D3D12) is a new stage that runs in workgroups similar to compute shaders. Each task shader workgroup has two jobs: determine how many mesh shader workgroups it should launch (dispatch size), and optionally create a “payload” (up to 16K data of your choice) which is passed to mesh shaders.
Additionally, the API allows task shaders to perform atomic operations on the payload variables.
Typical use of task shaders can be: cluster culling, LOD selection, geometry amplification.
Before we get into any HW specific details, there are a few things we should unpack first. Based on the API programming mode, let’s think about some expectations on a good driver implementation.
Storing the output task payload. There must exist some kind of buffer where the task payload is stored, and the size of this buffer will obviously be a limiting factor on how many task shader workgroups can run in parallel. Therefore, the implementation must ensure that only as many task workgroups run as there is space in this buffer. Preferably this would be a ring buffer whose entries get reused between different task shader workgroups.
Analogy with the tessellator. The above requirements are pretty similar to what tessellation can already do. So a natural conclusion may be that we may be able to implement task shaders by abusing the tessellator. However, this introduces a potential bottleneck on fixed-function hardware which we would prefer not to do.
Analogy with a compute pre-pass. Another similar thing that comes to mind is a compute pre-pass. Many games already do something like this: some pre-processing in a compute dispatch that is executed before a draw call. Of course, the application has to insert a barrier between the dispatch and the draw, which means the draw can’t start before every invocation in the dispatch is finished. In reality, not every graphics shader invocation depends on the results of all compute invocations, but there is no way to express a more fine-grained dependency. For task shaders, it is preferable to avoid this barrier and allow task and mesh shader invocations to overlap.
What I discuss here is based on information that is already publicly available in open source drivers. If you are already familiar with how AMD’s own PAL-based drivers work, you won’t find any surprises here.
First things fist. Under the hood, task shaders are compiled to a plain old compute shader. The task payload is located in VRAM. The shader code that stores the mesh dispatch size and payload are compiled to memory writes which store these in VRAM ring buffers. Even though they are compute shaders as far as the AMD HW is concerned, task shaders do not work like a compute pre-pass. Instead, task shaders are dispatched on an async compute queue while at the same time the mesh shader work is executed on the graphics queue in parallel.
The task+mesh dispatch packets are different from a regular compute dispatch. The compute and graphics queue firmwares work together in parallel:
You can find out the exact concrete details in the PAL source code, or RADV merge requests.
Side note, getting some implementation details wrong can easily cause a deadlock on the GPU. It is great fun to debug these.
The relevant details here are that most of the hard work is implemented in the firmware (good news, because that means I don’t have to implement it), and that task shaders are executed on an async compute queue and that the driver now has to submit compute and graphics work in parallel.
Keep in mind that the API hides this detail and pretends that the mesh shading pipeline is just another graphics pipeline that the application can submit to a graphics queue. So, once again we have a mismatch between the API programming model and what the HW actually does.
In order to use this beautiful scheme provided by the firmware, the driver needs to do two things:
We already had good support for compute pipelines in RADV (as much as the API needs), but internally in the driver we’ve never had this kind of close cooperation between graphics and compute.
When you use a draw call in a command buffer with a pipeline that has a task shader, RADV must create a hidden, internal compute command buffer. This internal compute command buffer contains the task shader dispatch packet, while the graphics command buffer contains the packet that dispatches the mesh shaders. We must also ensure correct synchronization between these two command buffers according to application barriers ― because of the API mismatch it must work as if the internal compute cmdbuf was part of the graphics cmdbuf. We also need to emit the same descriptors and push constants, etc. When the application submits the graphics queue, this new, internal compute command buffer is then submitted to the async compute queue.
Thus far, this sounds pretty logical and easy.
The actual hard work is to make it possible for the driver to submit work
to different queues at the same time. RADV’s queue code was written
assuming that there is a 1:1 mapping between radv_queue
objects and
HW queues. To make task shaders work we must now break this assumption.
So, of course I had to do some crazy refactor to enable this. At the time of writing the AMDGPU Linux kernel driver doesn’t support “gang submit” yet, so I use scheduled dependencies instead. This has the drawback of submitting to the two queues sequentially rather than doing everything in the same submit.
Let’s turn the above wall of text into some performance considerations that you can actually use when you write your next mesh shading application.
NVidia also has some perf recommendations here which mostly apply to any other HW, except for the recommended number of vertices and primitives per meshlet because the sweet spot for that can differ between GPU architectures.
It has been officially confirmed that a Vulkan cross-vendor mesh shading extension is coming soon.
While I can’t give you any details about the new extension, I think it won’t be a surprise to anyone that it may have been the motivation for my work on mesh and task shaders.
Once the new extension goes public, I will post some thoughts about it and a comparison to
the vendor-specific NV_mesh_shader
extension.
Two posts in one month is a record for May of 2022. Might even shoot for three at this rate.
Yesterday I posted a hasty roundup of what’s been going on with zink.
It was not comprehensive.
What else have I been working on, you might ask.
We all know I’m champing at the bit to further my goal of world domination with zink on all platforms, so it was only a matter of time before I branched out from the big two-point-five of desktop GPU vendors (NVIDIA, AMD, and also maybe Intel sometimes).
Thus it was that a glorious bounty descended from on high at Valve: a shiny new A630 Coachz Chromebook that runs on the open source Qualcomm driver, Turnip.
How did the initial testing go?
My initial combined, all-the-tests-at-once CTS run (KHR46, 4.6 confidential, all dEQP) yielded over 1500 crashes and over 3000 failures.
Brutal.
I then accidentally bricked my kernel by foolishly allowing my distro to upgrade it for me, at which point I defenestrated the device.
There were many problems, but two were glaring:
It’s tough to say which issue is a bigger problem. As zinkologists know, the preferred minimum number of descriptor sets is six:
Thus, almost all the crashes I encountered were from the tests which attempted to use shader images, as they access an out-of-bounds set.
On the other hand, while there was an amount of time in which I had to use zink without 64-bit support on my Intel machine before emulation was added at the driver level, Intel’s driver has always been helpfully tolerant of my continuing to jam 64-bit i/o into shaders without adverse effects since the backend compiler supported such operations. This was not the case with Turnip, and all I got for my laziness was crashing and failing.
Both problems were varying degrees of excruciating to fix, so I chose the one I thought would be worse to start off: descriptors.
We all know zink has various modes for its descriptor management:
The default is currently the caching mode, which calculates hash values for all the descriptors being used for a draw/dispatch and then stores the sets for reuse. My plan was to collapse the like sets: merging UBOs and SSBOs into one set as well as Samplers and Images into another set. This would use a total of four sets including the bindless one, opening up all the features.
The project actually turniped out to be easier than expected. All I needed to do was add indirection to the existing set indexing, then modify the indirect indices on startup based on how many sets were available. In this way, accessing e.g., descriptor_set[SSBO]
would actually access descriptor_set[UBO]
in the new compact mode, and everything else would work the same.
It was then that I remembered the tricky part of doing anything descriptor-related: updating the caching mechanism.
In time, I finally managed to settle on just merging the hash values from the merged sets just as carefully as the fusion dance must be performed, and things seemed to be working, bringing results down to just over 1000 total CTS fails.
The harder problem was actually the easier one.
Which means I’m gonna need another post.
I’ve been meaning to blog for a while.
As per usual.
Finally though, I’m doing it. It’s been a month. Roundup time.
Mesa 22.1 is out, and it has the best zink release of all time. You’ve got features like:
Really not sure I’m getting across how momentous kopper is, but it’s probably the biggest thing to happen to zink since… I don’t know. Threaded context?
B I G.
There’s a lot of bugs in the world. As of now, there’s a lot fewer bugs in zink than there used to be.
How many fewer, you might ask?
Let’s take a gander.
In short, conformance submission pending.
This roughly marks 2 years since I started working on zink full-time.
Hooray.
A quick follow-up to the previous post: there was an issue in the initial WGL handling which prevented zink from loading as expected. This is now fixed.
We haven’t posted updates to the work done on the V3DV driver since
we announced the driver becoming Vulkan 1.1 Conformant.
But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then.
As mentioned on past posts, for the Vulkan driver we tried to focus as much as possible on the userspace part. So we tried to re-use the already existing kernel interface that we had for V3D, used by the OpenGL driver, without modifying/extending it.
This worked fine in general, except for synchronization. The V3D kernel interface only supported one synchronization object per submission. This didn’t properly map with Vulkan synchronization, which is more detailed and complex, and allowed defining several semaphores/fences. We initially handled the situation with workarounds, and left some optional features as unsupported.
After our 1.1 conformance work, our colleage Melissa Wen started to work on adding support for multiple semaphores on the V3D kernel side. Then she also implemented the changes on V3DV to use this new feature. If you want more technical info, she wrote a very detailed explanation on her blog (part1 and part2).
For now the driver has two codepaths that are used depending on if the kernel supports this new feature or not. That also means that, depending on the kernel, the V3DV driver could expose a slightly different set of supported features.
For a while, Mesa developers have been doing a great effort to refactor and move common functionality to a single place, so it can be used by all drivers, reducing the amount of code each driver needs to maintain.
During these months we have been porting V3DV to some of that infrastructure, from small bits (common VkShaderModule to NIR code), to a really big one: common synchronization framework.
As mentioned, the Vulkan synchronization model is really detailed and powerful. But that also means it is complex. V3DV support for Vulkan synchronization included heavy use of threads. For example, V3DV needed to rely on a CPU wait (polling with threads) to implement vkCmdWaitEvents, as the GPU lacked a mechanism for this.
This was common to several drivers. So at some point there were multiple versions of complex synchronization code, one per driver. But, some months ago, Jason Ekstrand refactored Anvil support and collaborated with other driver developers to create a common framework. Obviously each driver would have their own needs, but the framework provides enough hooks for that.
After some gitlab and IRC chats, Jason provided a Merge Request with the port of V3DV to this new common framework, that we iterated and tested through the review process.
Also, with this port we got timelime semaphore support for free. Thanks to this change, we got ~1.2k less total lines of code (and have more features!).
Again, we want to thank Jason Ekstrand for all his help.
Since 1.1 got announced the following extension got implemented and exposed:
If you want more details about VK_KHR_pipeline_executable_properties, Iago wrote recently a blog post about it (here)
Android support for V3DV was added thanks to the work of Roman Stratiienko, who implemented this and submitted Mesa patches. We also want to thank the Android RPi team, and the Lineage RPi maintainer (Konsta) who also created and tested an initial version of that support, which was used as the baseline for the code that Roman submitted. I didn’t test it myself (it’s in my personal TO-DO list), but LineageOS images for the RPi4 are already available.
In addition to new functionality, we also have been working on improving performance. Most of the focus was done on the V3D shader compiler, as improvements to it would be shared among the OpenGL and Vulkan drivers.
But one of the features specific to the Vulkan driver (pending to be ported to OpenGL), is that we have implemented double buffer mode, only available if MSAA is not enabled. This mode would split the tile buffer size in half, so the driver could start processing the next tile while the current one is being stored in memory.
In theory this could improve performance by reducing tile store overhead, so it would be more benefitial when vertex/geometry shaders aren’t too expensive. However, it comes at the cost of reducing tile size, which also causes some overhead on its own.
Testing shows that this helps in some cases (i.e the Vulkan Quake ports) but hurts in others (i.e. Unreal Engine 4), so for the time being we don’t enable this by default. It can be enabled selectively by adding V3D_DEBUG=db to the environment variables. The idea for the future would be to implement a heuristic that would decide when to activate this mode.
If you are interested in watching an overview of the improvements and changes to the driver during the last year, we made a presention in FOSDEM 2022:
“v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi
4”
In late 2020, Apple debuted the M1 with Apple’s GPU architecture, AGX, rumoured to be derived from Imagination’s PowerVR series. Since then, we’ve been reverse-engineering AGX and building open source graphics drivers. Last January, I rendered a triangle with my own code, but there has since been a heinous bug lurking:
The driver fails to render large amounts of geometry.
Spinning a cube is fine, low polygon geometry is okay, but detailed models won’t render. Instead, the GPU renders only part of the model and then faults.
It’s hard to pinpoint how much we can render without faults. It’s not just the geometry complexity that matters. The same geometry can render with simple shaders but fault with complex ones.
That suggests rendering detailed geometry with a complex shader “takes too long”, and the GPU is timing out. Maybe it renders only the parts it finished in time.
Given the hardware architecture, this explanation is unlikely.
This hypothesis is easy to test, because we can control for timing with a shader that takes as long as we like:
for (int i = 0; i < LARGE_NUMBER; ++i) {
/* some work to prevent the optimizer from removing the loop */
}
After experimenting with such a shader, we learn…
That theory is out.
Let’s experiment more. Modifying the shader and seeing where it breaks, we find the only part of the shader contributing to the bug: the amount of data interpolated per vertex. Modern graphics APIs allow specifying “varying” data for each vertex, like the colour or the surface normal. Then, for each triangle the hardware renders, these “varyings” are interpolated across the triangle to provide smooth inputs to the fragment shader, allowing efficient implementation of common graphics techniques like Blinn-Phong shading.
Putting the pieces together, what matters is the product of the number of vertices (geometry complexity) times amount of data per vertex (“shading” complexity). That product is “total amount of per-vertex data”. The GPU faults if we use too much total per-vertex data.
Why?
When the hardware processes each vertex, the vertex shader produces per-vertex data. That data has to go somewhere. How this works depends on the hardware architecture. Let’s consider common GPU architectures.1
Traditional immediate mode renderers render directly into the framebuffer. They first run the vertex shader for each vertex of a triangle, then run the fragment shader for each pixel in the triangle. Per-vertex “varying” data is passed almost directly between the shaders, so immediate mode renderers are efficient for complex scenes.
There is a drawback: rendering directly into the framebuffer requires tremendous amounts of memory access to constantly write the results of the fragment shader and to read out back results when blending. Immediate mode renderers are suited to discrete, power-hungry desktop GPUs with dedicated video RAM.
By contrast, tile-based deferred renderers split rendering into two passes. First, the hardware runs all vertex shaders for the entire frame, not just for a single model. Then the framebuffer is divided into small tiles, and dedicated hardware called a tiler determines which triangles are in each tile. Finally, for each tile, the hardware runs all relevant fragment shaders and writes the final blended result to memory.
Tilers reduce memory traffic required for the framebuffer. As the hardware renders a single tile at a time, it keeps a “cached” copy of that tile of the framebuffer (called the “tilebuffer”). The tilebuffer is small, just a few kilobytes, but tilebuffer access is fast. Writing to the tilebuffer is cheap, and unlike immediate renderers, blending is almost free. Because main memory access is expensive and mobile GPUs can’t afford dedicated video memory, tilers are suited to mobile GPUs, like Arm’s Mali, Imaginations’s PowerVR, and Apple’s AGX.
Yes, AGX is a mobile GPU, designed for the iPhone. The M1 is a screaming fast desktop, but its unified memory and tiler GPU have roots in mobile phones. Tilers work well on the desktop, but there are some drawbacks.
First, at the start of a frame, the contents of the tilebuffer are undefined. If the application needs to preserve existing framebuffer contents, the driver needs to load the framebuffer from main memory and store it into the tilebuffer. This is expensive.
Second, because all vertex shaders are run before any fragment shaders, the hardware needs a buffer to store the outputs of all vertex shaders. In general, there is much more data required than space inside the GPU, so this buffer must be in main memory. This is also expensive.
Ah-ha. Because AGX is a tiler, it requires a buffer of all per-vertex data. We fault when we use too much total per-vertex data, overflowing the buffer.
…So how do we allocate a larger buffer?
On some tilers, like older versions of Arm’s Mali GPU, the userspace driver computes how large this “varyings” buffer should be and allocates it.2 To fix the faults, we can try increasing the sizes of all buffers we allocate, in the hopes that one of them contains the per-vertex data.
No dice.
It’s prudent to observe what Apple’s Metal driver does. We can cook up a Metal program drawing variable amounts of geometry and trace all GPU memory allocations that Metal performs while running our program. Doing so, we learn that increasing the amount of geometry drawn does not increase the sizes of any allocated buffers. In fact, it doesn’t change anything in the command buffer submitted to the kernel, except for the single “number of vertices” field in the draw command.
We know that buffer exists. If it’s not allocated by userspace – and by now it seems that it’s not – it must be allocated by the kernel or firmware.
Here’s a funny thought: maybe we don’t specify the size of the buffer at all. Maybe it’s okay for it to overflow, and there’s a way to handle the overflow.
It’s time for a little reconnaissance. Digging through what little public documentation exists for AGX, we learn from one WWDC presentation:
The Tiled Vertex Buffer stores the Tiling phase output, which includes the post-transform vertex data…
But it may cause a Partial Render if full. A Partial Render is when the GPU splits the render pass in order to flush the contents of that buffer.
Bullseye. The buffer we’re chasing, the “tiled vertex buffer”, can overflow. To cope, the GPU stops accepting new geometry, renders the existing geometry, and restarts rendering.
Since partial renders hurt performance, Metal application developers need to know about them to optimize their applications. There should be performance counters flagging this issue. Poking around, we find two:
Wait, what’s a “parameter buffer”?
Remember the rumours that AGX is derived from PowerVR? The public PowerVR optimization guides explain:
[The] list containing pointers to each vertex passed in from the application… is called the parameter buffer (PB) and is stored in system memory along with the vertex data.
Each varying requires additional space in the parameter buffer.
The Tiled Vertex Buffer is the Parameter Buffer. PB is the PowerVR name, TVB is the public Apple name, and PB is still an internal Apple name.
What happens when PowerVR overflows the parameter buffer?
An old PowerVR presentation says that when the parameter buffer is full, the “render is flushed”, meaning “flushed data must be retrieved from the frame buffer as successive tile renders are performed”. In other words, it performs a partial render.
Back to the Apple M1, it seems the hardware is failing to perform a partial render. Let’s revisit the broken render.
Notice parts of the model are correctly rendered. The parts that are not only have the black clear colour of the scene rendered at the start. Let’s consider the logical order of events.
First, the hardware runs vertex shaders for the bunny until the parameter buffer overflows. This works: the partial geometry is correct.
Second, the hardware rasterizes the partial geometry and runs the fragment shaders. This works: the shading is correct.
Third, the hardware flushes the partial render to the framebuffer. This must work for us to see anything at all.
Fourth, the hardware runs vertex shaders for the rest of the bunny’s geometry. This ought to work: the configuration is identical to the original vertex shaders.
Fifth, the hardware rasterizes and shades the rest of the geometry, blending with the old partial render. Because AGX is a tiler, to preserve that existing partial render, the hardware needs to load it back into the tilebuffer. We have no idea how it does this.
Finally, the hardware flushes the render to the framebuffer. This should work as it did the first time.
The only problematic step is loading the framebuffer back into the tilebuffer after a partial render. Usually, the driver supplies two “extra” fragment shaders. One clears the tilebuffer at the start, and the other flushes out the tilebuffer contents at the end.
If the application needs the existing framebuffer contents preserved, instead of writing a clear colour, the “load tilebuffer” program instead reads from the framebuffer to reload the contents. Handling this requires quite a bit of code, but it works in our driver.
Looking closer, AGX requires more auxiliary programs.
The “store” program is supplied twice. I noticed this when initially bringing up the hardware, but the reason for the duplication was unclear. Omitting each copy separately and seeing what breaks, the reason becomes clear: one program flushes the final render, and the other flushes a partial render.3
…What about the program that loads the framebuffer into the tilebuffer?
When a partial render is possible, there are two “load” programs. One writes the clear colour or loads the framebuffer, depending on the application setting. We understand this one. The other always loads the framebuffer.
…Always loads the framebuffer, as in, for loading back with a partial render even if there is a clear at the start of the frame?
If this program is the issue, we can confirm easily. Metal must require it to draw the same bunny, so we can write a Metal application drawing the bunny and stomp over its GPU memory to replace this auxiliary load program with one always loading with black.
Doing so, Metal fails in a similar way. That means we’re at the root cause. Looking at our own driver code, we don’t specify any program for this partial render load. Up until now, that’s worked okay. If the parameter buffer is never overflowed, this program is unused. As soon as a partial render is required, however, failing to provide this program means the GPU dereferences a null pointer and faults. That explains our GPU faults at the beginning.
Following Metal, we supply our own program to load back the tilebuffer after a partial render…
…which does not fix the rendering! Cursed, this GPU. The faults go away, but the render still isn’t quite right for the first few frames, indicating partial renders are still broken. Notice the weird artefacts on the feet.
Curiously, the render “repairs itself” after a few frames, suggesting the parameter buffer stops overflowing. This implies the parameter buffer can be resized (by the kernel or by the firmware), and the system is growing the parameter buffer after a few frames in response to overflow. This mechanism makes sense:
Starting the parameter buffer small and growing in response to overflow provides a balance, reducing the GPU’s memory footprint and minimizing partial renders.
Back to our misrendering. There are actually two buffers being used by our program, a colour buffer (framebuffer)… and a depth buffer. The depth buffer isn’t directly visible, but facilitates the “depth test”, which discards far pixels that are occluded by other close pixels. While the partial render mechanism discards geometry, the depth test discards pixels.
That would explain the missing pixels on our bunny. The depth test is broken with partial renders. Why? The depth test depends on the depth buffer, so the depth buffer must also be stored after a partial render and loaded back when resuming. Comparing a trace from our driver to a trace from Metal, looking for any relevant difference, we eventually stumble on the configuration required to make depth buffer flushes work.
And with that, we get our bunny.
These explanations are massive oversimplifications of how modern GPUs work, but it’s good enough for our purposes here.↩︎
This is a worse idea than it sounds. Starting with the new Valhall architecture, Mali allocates varyings much more efficiently.↩︎
Why the duplication? I have not yet observed Metal using different programs for each. However, for front buffer rendering, partial renders need to be flushed to a temporary buffer for this scheme to work. Of course, you may as well use double buffering at that point.↩︎
In part 1 I gave a brief introduction on what mesh and task shaders are from the perspective of application developers. Now it’s time to dive deeper and talk about how mesh shaders are implemented in a Vulkan driver on AMD HW. Note that everything I discuss here is based on my experience and understanding as I was working on mesh shader support in RADV and is already available as public information in open source driver code. The goal of this blog post is to elaborate on how mesh shaders are implemented on the NGG hardware in AMD RDNA2 GPUs, and to show how these details affect shader performance. Hopefully, this helps the reader better understand how the concepts in the API are translated to the HW and what pitfalls to avoid to get good perf.
NGG (Next Generation Geometry) is the technology that is responsible for any vertex and geometry processing in RDNA GPUs (with some caveats). Also known as “primitive shader”, the main innovations of NGG are:
This flexibility allows the driver to implement every vertex/geometry processing stage using NGG. Vertex, tess eval and geometry shaders can all be compiled to NGG “primitive shaders”. The only major limiting factor is that each thread (SIMD lane) can only output up to 1 vertex and 1 primitive (with caveats).
The driver is also capable of extending the application shaders with sweet stuff such as per-triangle culling, but this is not the main focus of this blog post. I also won’t cover the caveats here, but I may write more about NGG in the future.
The draw commands as executed on the GPU only understand a number of input vertices but the mesh shader API draw calls specify a number of workgroups instead. To make it work, we configure the shader such that the number of input vertices per workgroup is 1, and the output is set to what you passed into the API. This way, the FW can figure out how many workgroups it really needs to launch.
The driver has to accomodate the HW limitation above, so we must ensure that in the compiled shader, each thread only outputs up to 1 vertex and 1 primitive. Reminder: the API programming model allows any shader invocation to write any vertex and/or primitive. So, there is a fundamental mismatch between the programming model and what the HW can do.
This raises a few interesting questions.
How do we allow any thread to write any vertex/primitive? The driver allocates some LDS (shared memory) space, and writes all mesh shader outputs there. At the very end of the shader, each thread reads the attributes of the vertex and primitive that matches the thread ID and outputs that single vertex and primitive. This roundtrip to the LDS can be omitted if an output is only written by the thread with matching thread ID. (Note: at the time of writing, I haven’t implemented this optimization yet, but I plan to.)
What if the MS workgroup size is less than the max number of output vertices or primitives?
Each HW thread can create up to 1 vertex and 1 primitive.
The driver has to set the real workgroup size accordingly:
hw workgroup size = max(api workgroup size, max vertex count, max primitive count)
The result is that the HW will get a workgroup that has some threads that
execute the code you wrote (the “API shader”), and then some that won’t do anything
but wait until the very end to output their up to 1 vertex and 1 primitive.
It can result in poor occupancy (low HW utilization = bad performance).
What if the shader also has barriers in it? This is now turning into a headache. The driver has to ensure that the threads that “do nothing” also execute an equal amount of barriers as those that run your API shader. If the HW workgroup has the same number of waves as the API shader, this is trivial. Otherwise, we have to emit some extra code that keeps the extra waves running in a loop executing barriers. This is the worst.
What if the API shader also uses shared memory, or not all outputs fit the LDS? The D3D12 spec requires the driver to have at least 28K shared memory (LDS) available to the shader. However, graphics shaders can only access up to 32K LDS. How do we make this work, considering the above fact that the driver has to write mesh shader outputs to LDS? This is getting really ugly now, but in that case, the driver is forced to write MS outputs to VRAM instead of LDS. (Note: at the time of writing, I haven’t implemented this edge case yet, but I plan to.)
How do you deal with the compute-like stuff, eg. workgroup ID, subgroup ID, etc.? Fortunately, most of these were already available to the shader, just not exposed in the traditional VS, TES, GS programming model. The only pain point is the workgroup ID which needs trickery. I already mentioned above that the HW is tricked into thinking that each MS workgroup has 1 input vertex. So we can just use the same register that contains the vertex ID for getting the workgroup ID.
The above implementation details can be turned into performance recommendations.
Specify a MS workgroup size that matches the maximum amount of vertices and primitives. Also, distribute the work among the full workgroup rather than leaving some threads doing nothing. If you do this, you ensure that the hardware is optimally utilized. This is the most important recommendation here today.
Try to only write to the mesh output array indices from the corresponding thread. If you do this, you hit an optimal code path in the driver, so it won’t have to write those outputs to LDS and read them back at the end.
Use shared memory, but not excessively. Implementing any nice algorithm in your mesh shader will likely need you to share data between threads. Don’t be afraid to use shared memory, but prefer to use subgroup functionality instead when possible.
That is perfectly fine. Don’t use mesh shaders then.
The main takeaway about mesh shading is that it’s a very low level tool. The driver can implement the full programming model, but it can’t hold your hands as well as it could for traditional vertex processing. You may have to implement things (eg. vertex inputs, culling, etc.) that previously the driver would do for you. Essentially, if you write a mesh shader you are trying to beat the driver at its own game.
I think this post is already dense enough with technical detail. Brace yourself for the next post, where I’m going to blow your mind even more and talk about how task shaders are implemented.
Background
Today NVIDIA announced that they are releasing an open source kernel driver for their GPUs, so I want to share with you some background information and how I think this will impact Linux graphics and compute going forward.
One thing many people are not aware of is that Red Hat is the only Linux OS company who has a strong presence in the Linux compute and graphics engineering space. There are of course a lot of other people working in the space too, like engineers working for Intel, AMD and NVIDIA or people working for consultancy companies like Collabora or individual community members, but Red Hat as an OS integration company has been very active on trying to ensure we have a maintainable and shared upstream open source stack. This engineering presence is also what has allowed us to move important technologies forward, like getting hiDPI support for Linux some years ago, or working with NVIDIA to get glvnd implemented to remove a pain point for our users since the original OpenGL design only allowed for one OpenGl implementation to be installed at a time. We see ourselves as the open source community’s partner here, fighting to keep the linux graphics stack coherent and maintainable and as a partner for the hardware OEMs to work with when they need help pushing major new initiatives around GPUs for Linux forward. And as the only linux vendor with a significant engineering footprint in GPUs we have been working closely with NVIDIA. People like Kevin Martin, the manager for our GPU technologies team, Ben Skeggs the maintainer of Nouveau and Dave Airlie, the upstream kernel maintainer for the graphics subsystem, Nouveau contributor Karol Herbst and our accelerator lead Tom Rix have all taken part in meetings, code reviews and discussions with NVIDIA. So let me talk a little about what this release means (and also what it doesn’t mean) and what we hope to see come out of this long term.
First of all, what is in this new driver?
What has been released is an out of tree source code kernel driver which has been tested to support CUDA usecases on datacenter GPUs. There is code in there to support display, but it is not complete or fully tested yet. Also this is only the kernel part, a big part of a modern graphics driver are to be found in the firmware and userspace components and those are still closed source. But it does mean we have a NVIDIA kernel driver now that will start being able to consume the GPL-only APIs in the linux kernel, although this initial release doesn’t consume any APIs the old driver wasn’t already using. The driver also only supports NVIDIA Turing chip GPUs and newer, which means it is not targeting GPUs from before 2018. So for the average Linux desktop user, while this is a great first step and hopefully a sign of what is to come, it is not something you are going to start using tomorrow.
What does it mean for the NVidia binary driver?
Not too much immediately. This binary kernel driver will continue to be needed for older pre-Turing NVIDIA GPUs and until the open source kernel module is full tested and extended for display usecases you are likely to continue using it for your system even if you are on Turing or newer. Also as mentioned above regarding firmware and userspace bits and the binary driver is going to continue to be around even once the open source kernel driver is fully capable.
What does it mean for Nouveau?
Let me start with the obvious, this is actually great news for the Nouveau community and the Nouveau driver and NVIDIA has done a great favour to the open source graphics community with this release. And for those unfamiliar with Nouveau, Nouveau is the in-kernel graphics driver for NVIDIA GPUs today which was originally developed as a reverse engineered driver, but which over recent years actually have had active support from NVIDIA. It is fully functional, but is severely hampered by not having had the ability to for instance re-clock the NVIDIA card, meaning that it can’t give you full performance like the binary driver can. This was something we were working with NVIDIA trying to remedy, but this new release provides us with a better path forward. So what does this new driver mean for Nouveau? Less initially, but a lot in the long run. To give a little background first. The linux kernel does not allow multiple drivers for the same hardware, so in order for a new NVIDIA kernel driver to go in the current one will have to go out or at least be limited to a different set of hardware. The current one is Nouveau. And just like the binary driver a big chunk of Nouveau is not in the kernel, but are the userspace pieces found in Mesa and the Nouveau specific firmware that NVIDIA currently kindly makes available. So regardless of the long term effort to create a new open source in-tree kernel driver based on this new open source driver for NVIDIA hardware, Nouveau will very likely be staying around to support pre-turing hardware just like the NVIDIA binary kernel driver will.
The plan we are working towards from our side, but which is likely to take a few years to come to full fruition, is to come up with a way for the NVIDIA binary driver and Mesa to share a kernel driver. The details of how we will do that is something we are still working on and discussing with our friends at NVIDIA to address both the needs of the NVIDIA userspace and the needs of the Mesa userspace. Along with that evolution we hope to work with NVIDIA engineers to refactor the userspace bits of Mesa that are now targeting just Nouveau to be able to interact with this new kernel driver and also work so that the binary driver and Nouveau can share the same firmware. This has clear advantages for both the open source community and the NVIDIA. For the open source community it means that we will now have a kernel driver and firmware that allows things like changing the clocking of the GPU to provide the kind of performance people expect from the NVIDIA graphics card and it means that we will have an open source driver that will have access to the firmware and kernel updates from day one for new generations of NVIDIA hardware. For the ‘binary’ driver, and I put that in ” signs because it will now be less binary :), it means as stated above that it can start taking advantage of the GPL-only APIs in the kernel, distros can ship it and enable secure boot, and it gets an open source consumer of its kernel driver allowing it to go upstream.
If this new shared kernel driver will be known as Nouveau or something completely different is still an open question, and of course it happening at all depends on if we and the rest of the open source community and NVIDIA are able to find a path together to make it happen, but so far everyone seems to be of good will.
What does this release mean for linux distributions like Fedora and RHEL?
Over time it provides a pathway to radically simplify supporting NVIDIA hardware due to the opportunities discussed elsewhere in this document. Long term we will hope be able to get a better user experience with NVIDIA hardware in terms out of box functionality. Which means day 1 support for new chipsets, a high performance open source Mesa driver for NVIDIA and it will allow us to sign the NVIDIA driver alongside the rest of the kernel to enable things like secureboot support. Since this first release is targeting compute one can expect that these options will first be available for compute users and then graphics at a later time.
What are the next steps
Well there is a lot of work to do here. NVIDIA need to continue the effort to make this new driver feature complete for both Compute and Graphics Display usecases, we’d like to work together to come up with a plan for what the future unified kernel driver can look like and a model around it that works for both the community and NVIDIA, we need to add things like a Mesa Vulkan driver. We at Red Hat will be playing an active part in this work as the only Linux vendor with the capacity to do so and we will also work to ensure that the wider open source community has a chance to participate fully like we do for all open source efforts we are part of.
If you want to hear more about this I did talk with Chris Fisher and Linux Action News about this topic. Note: I did state some timelines in that interview which I didn’t make clear was my guesstimates and not in any form official NVIDIA timelines, so apologize for the confusion.
In the previous post, I described how we enable multiple syncobjs capabilities in the V3D kernel driver. Now I will tell you what was changed on the userspace side, where we reworked the V3DV sync mechanisms to use Vulkan multiple wait and signal semaphores directly. This change represents greater adherence to the Vulkan submission framework.
I was not used to Vulkan concepts and the V3DV driver. Fortunately, I counted on the guidance of the Igalia’s Graphics team, mainly Iago Toral (thanks!), to understand the Vulkan Graphics Pipeline, sync scopes, and submission order. Therefore, we changed the original V3DV implementation for vkQueueSubmit and all related functions to allow direct mapping of multiple semaphores from V3DV to the V3D-kernel interface.
Disclaimer: Here’s a brief and probably inaccurate background, which we’ll go into more detail later on.
In Vulkan, GPU work submissions are described as command buffers. These command buffers, with GPU jobs, are grouped in a command buffer submission batch, specified by vkSubmitInfo, and submitted to a queue for execution. vkQueueSubmit is the command called to submit command buffers to a queue. Besides command buffers, vkSubmitInfo also specifies semaphores to wait before starting the batch execution and semaphores to signal when all command buffers in the batch are complete. Moreover, a fence in vkQueueSubmit can be signaled when all command buffer batches have completed execution.
From this sequence, we can see some implicit ordering guarantees. Submission order defines the start order of execution between command buffers, in other words, it is determined by the order in which pSubmits appear in VkQueueSubmit and pCommandBuffers appear in VkSubmitInfo. However, we don’t have any completion guarantees for jobs submitted to different GPU queue, which means they may overlap and complete out of order. Of course, jobs submitted to the same GPU engine follow start and finish order. A fence is ordered after all semaphores signal operations for signal operation order. In addition to implicit sync, we also have some explicit sync resources, such as semaphores, fences, and events.
Considering these implicit and explicit sync mechanisms, we rework the V3DV implementation of queue submissions to better use multiple syncobjs capabilities from the kernel. In this merge request, you can find this work: v3dv: add support to multiple wait and signal semaphores. In this blog post, we run through each scope of change of this merge request for a V3D driver-guided description of the multisync support implementation.
As the original V3D-kernel interface allowed only one semaphore, V3DV resorted to booleans to “translate” multiple semaphores into one. Consequently, if a command buffer batch had at least one semaphore, it needed to wait on all jobs submitted complete before starting its execution. So, instead of just boolean, we created and changed structs that store semaphores information to accept the actual list of wait semaphores.
In the two commits below, we basically updated the DRM V3D interface from that one defined in the kernel and verified if the multisync capability is available for use.
At this point, we were only changing the submission design to consider multiple wait semaphores. Before supporting multisync, V3DV was waiting for the last job submitted to be signaled when at least one wait semaphore was defined, even when serialization wasn’t required. V3DV handle GPU jobs according to the GPU queue in which they are submitted:
Therefore, we changed their submission setup to do jobs submitted to any GPU queues able to handle more than one wait semaphores.
These commits created all mechanisms to set arrays of wait and signal semaphores for GPU job submissions:
Finally, we extended the ability of GPU jobs to handle multiple signal semaphores, but at this point, no GPU job is actually in charge of signaling them. With this in place, we could rework part of the code that tracks CPU and GPU job completions by verifying the GPU status and threads spawned by Event jobs.
As we had only single in/out syncobj interfaces for semaphores, we used a
single last_job_sync
to synchronize job dependencies of the previous
submission. Although the DRM scheduler guarantees the order of starting to
execute a job in the same queue in the kernel space, the order of completion
isn’t predictable. On the other hand, we still needed to use syncobjs to follow
job completion since we have event threads on the CPU side. Therefore, a more
accurate implementation requires last_job syncobjs to track when each engine
(CL, TFU, and CSD) is idle. We also needed to keep the driver working on
previous versions of v3d kernel-driver with single semaphores, then we kept
tracking ANY last_job_sync
to preserve the previous implementation.
With multiple semaphores support, the conditions for waiting and signaling semaphores changed accordingly to the particularities of each GPU job (CL, CSD, TFU) and CPU job restrictions (Events, CSD indirect, etc.). In this sense, we redesigned V3DV semaphores handling and job submissions for command buffer batches in vkQueueSubmit.
We scrutinized possible scenarios for submitting command buffer batches to change the original implementation carefully. It resulted in three commits more:
We keep track of whether we have submitted a job to each GPU queue (CSD, TFU, CL) and a CPU job for each command buffer. We use syncobjs to track the last job submitted to each GPU queue and a flag that indicates if this represents the beginning of a command buffer.
The first GPU job submitted to a GPU queue in a command buffer should wait on wait semaphores. The first CPU job submitted in a command buffer should call v3dv_QueueWaitIdle() to do the waiting and ignore semaphores (because it is waiting for everything).
If the job is not the first but has the serialize flag set, it should wait on the completion of all last job submitted to any GPU queue before running. In practice, it means using syncobjs to track the last job submitted by queue and add these syncobjs as job dependencies of this serialized job.
If this job is the last job of a command buffer batch, it may be used to signal semaphores if this command buffer batch has only one type of GPU job (because we have guarantees of execution ordering). Otherwise, we emit a no-op job just to signal semaphores. It waits on the completion of all last jobs submitted to any GPU queue and then signal semaphores. Note: We changed this approach to correctly deal with ordering changes caused by event threads at some point. Whenever we have an event job in the command buffer, we cannot use the last job in the last command buffer assumption. We have to wait all event threads complete to signal
After submitting all command buffers, we emit a no-op job to wait on all last jobs by queue completion and signal fence. Note: at some point, we changed this approach to correct deal with ordering changes caused by event threads, as mentioned before.
With many changes and many rounds of reviews, the patchset was merged. After more validations and code review, we polished and fixed the implementation together with external contributions:
Also, multisync capabilities enabled us to add new features to V3DV and switch the driver to the common synchronization and submission framework:
This was waiting for multisync support in the v3d kernel, which is already available. Exposing this feature however enabled a few more CTS tests that exposed pre-existing bugs in the user-space driver so we fix those here before exposing the feature.
This should give you emulated timeline semaphores for free and kernel-assisted sharable timeline semaphores for cheap once you have the kernel interface wired in.
We used a set of games to ensure no performance regression in the new implementation. For this, we used GFXReconstruct to capture Vulkan API calls when playing those games. Then, we compared results with and without multisync caps in the kernelspace and also enabling multisync on v3dv. We didn’t observe any compromise in performance, but improvements when replaying scenes of vkQuake game.
As you may already know, we at Igalia have been working on several improvements to the 3D rendering drivers of Broadcom Videocore GPU, found in Raspberry Pi 4 devices. One of our recent works focused on improving V3D(V) drivers adherence to Vulkan submission and synchronization framework. We had to cross various layers from the Linux Graphics stack to add support for multiple syncobjs to V3D(V), from the Linux/DRM kernel to the Vulkan driver. We have delivered bug fixes, a generic gate to extend job submission interfaces, and a more direct sync mapping of the Vulkan framework. These changes did not impact the performance of the tested games and brought greater precision to the synchronization mechanisms. Ultimately, support for multiple syncobjs opened the door to new features and other improvements to the V3DV submission framework.
But, first, what are DRM sync objs?
* DRM synchronization objects (syncobj, see struct &drm_syncobj) provide a
* container for a synchronization primitive which can be used by userspace
* to explicitly synchronize GPU commands, can be shared between userspace
* processes, and can be shared between different DRM drivers.
* Their primary use-case is to implement Vulkan fences and semaphores.
[...]
* At it's core, a syncobj is simply a wrapper around a pointer to a struct
* &dma_fence which may be NULL.
And Jason Ekstrand well-summarized dma_fence features in a talk at the Linux Plumbers Conference 2021:
A struct that represents a (potentially future) event:
- Has a boolean “signaled” state
- Has a bunch of useful utility helpers/concepts, such as refcount, callback wait mechanisms, etc.
Provides two guarantees:
- One-shot: once signaled, it will be signaled forever
- Finite-time: once exposed, is guaranteed signal in a reasonable amount of time
For our main purpose, the multiple syncobjs support means that V3DV can submit jobs with more than one wait and signal semaphore. In the kernel space, wait semaphores become explicit job dependencies to wait on before executing the job. Signal semaphores (or post dependencies), in turn, work as fences to be signaled when the job completes its execution, unlocking following jobs that depend on its completion.
The multisync support development comprised of many decision-making points and steps summarized as follow:
We decided to refactor parts of the V3D(V) submission design in kernel-space and userspace during this development. We improved job scheduling on V3D-kernel and the V3DV job submission design. We also delivered more accurate synchronizing mechanisms and further updates in the Broadcom Vulkan driver running on Raspberry Pi 4. Therefore, we summarize here changes in the kernel space, describing the previous state of the driver, taking decisions, side improvements, and fixes.
Initially, V3D was very limited in the numbers of syncobjs per job submission. V3D job interfaces (CL, CSD, and TFU) only supported one syncobj (in_sync) to be added as an execution dependency and one syncobj (out_sync) to be signaled when a submission completes. Except for CL submission, which accepts two in_syncs: one for binner and another for render job, it didn’t change the limited options.
Meanwhile in the userspace, the V3DV driver followed alternative paths to meet Vulkan’s synchronization and submission framework. It needed to handle multiple wait and signal semaphores, but the V3D kernel-driver interface only accepts one in_sync and one out_sync. In short, V3DV had to fit multiple semaphores into one when submitting every GPU job.
The first decision was how to extend the V3D interface to accept multiple in and out syncobjs. We could extend each ioctl with two entries of syncobj arrays and two entries for their counters. We could create new ioctls with multiple in/out syncobj. But after examining other drivers solutions to extend their submission’s interface, we decided to extend V3D ioctls (v3d_cl_submit_ioctl, v3d_csd_submit_ioctl, v3d_tfu_submit_ioctl) by a generic ioctl extension.
I found a curious commit message when I was examining how other developers handled the issue in the past:
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Fri Mar 22 09:23:22 2019 +0000
drm/i915: Introduce the i915_user_extension_method
An idea for extending uABI inspired by Vulkan's extension chains.
Instead of expanding the data struct for each ioctl every time we need
to add a new feature, define an extension chain instead. As we add
optional interfaces to control the ioctl, we define a new extension
struct that can be linked into the ioctl data only when required by the
user. The key advantage being able to ignore large control structs for
optional interfaces/extensions, while being able to process them in a
consistent manner.
In comparison to other extensible ioctls, the key difference is the
use of a linked chain of extension structs vs an array of tagged
pointers. For example,
struct drm_amdgpu_cs_chunk {
__u32 chunk_id;
__u32 length_dw;
__u64 chunk_data;
};
[...]
So, inspired by amdgpu_cs_chunk
and i915_user_extension
, we opted to extend
the V3D interface through a generic interface. After applying some suggestions
from Iago Toral (Igalia) and Daniel
Vetter, we reached the following struct:
struct drm_v3d_extension {
__u64 next;
__u32 id;
#define DRM_V3D_EXT_ID_MULTI_SYNC 0x01
__u32 flags; /* mbz */
};
This generic extension has an id to identify the feature/extension we are adding to an ioctl (that maps the related struct type), a pointer to the next extension, and flags (if needed). Whenever we need to extend the V3D interface again for another specific feature, we subclass this generic extension into the specific one instead of extending ioctls indefinitely.
For the multiple syncobjs extension, we define a multi_sync
extension struct
that subclasses the generic extension struct. It has arrays of in and out
syncobjs, the respective number of elements in each of them, and a wait_stage
value used in CL submissions to determine which job needs to wait for syncobjs
before running.
struct drm_v3d_multi_sync {
struct drm_v3d_extension base;
/* Array of wait and signal semaphores */
__u64 in_syncs;
__u64 out_syncs;
/* Number of entries */
__u32 in_sync_count;
__u32 out_sync_count;
/* set the stage (v3d_queue) to sync */
__u32 wait_stage;
__u32 pad; /* mbz */
};
And if a multisync extension is defined, the V3D driver ignores the previous interface of single in/out syncobjs.
Once we had the interface to support multiple in/out syncobjs, v3d
kernel-driver needed to handle it. As V3D uses the DRM scheduler for job
executions, changing from single syncobj to multiples is quite straightforward.
V3D copies from userspace the in syncobjs and uses drm_syncobj_find_fence()+
drm_sched_job_add_dependency()
to add all in_syncs
(wait semaphores) as job
dependencies, i.e. syncobjs to be checked by the scheduler before running the
job. On CL submissions, we have the bin and render jobs, so V3D follows the
value of wait_stage
to determine which job depends on those in_syncs
to start
its execution.
When V3D defines the last job in a submission, it replaces dma_fence
of
out_syncs
with the done_fence
from this last job. It uses drm_syncobj_find() +
drm_syncobj_replace_fence()
to do that. Therefore, when a job completes its
execution and signals done_fence
, all out_syncs
are signaled too.
This work also made possible some improvements in the original implementation.
Following Iago’s suggestions, we refactored the job’s initialization code to
allocate memory and initialize a job in one go. With this, we started to clean
up resources more cohesively, clearly distinguishing cleanups in case of
failure from job completion. We also fixed the resource cleanup when a job is
aborted before the DRM scheduler arms it - at that point, drm_sched_job_arm()
had recently been introduced to job initialization. Finally, we prepared the
semaphore interface to implement timeline syncobjs in the future.
The patchset that adds multiple syncobjs support and improvements to V3D is available here and comprises four patches:
After extending the V3D kernel interface to accept multiple syncobjs, we worked on V3DV to benefit from V3D multisync capabilities. In the next post, I will describe a little of this work.
As a board, we have been working on several initiatives to make the Foundation a better asset for the GNOME Project. We’re working on a number of threads in parallel, so I wanted to explain the “big picture” a bit more to try and connect together things like the new ED search and the bylaw changes.
We’re all here to see free and open source software succeed and thrive, so that people can be be truly empowered with agency over their technology, rather than being passive consumers. We want to bring GNOME to as many people as possible so that they have computing devices that they can inspect, trust, share and learn from.
In previous years we’ve tried to boost the relevance of GNOME (or technologies such as GTK) or solicit donations from businesses and individuals with existing engagement in FOSS ideology and technology. The problem with this approach is that we’re mostly addressing people and organisations who are already supporting or contributing FOSS in some way. To truly scale our impact, we need to look to the outside world, build better awareness of GNOME outside of our current user base, and find opportunities to secure funding to invest back into the GNOME project.
The Foundation supports the GNOME project with infrastructure, arranging conferences, sponsoring hackfests and travel, design work, legal support, managing sponsorships, advisory board, being the fiscal sponsor of GNOME, GTK, Flathub… and we will keep doing all of these things. What we’re talking about here are additional ways for the Foundation to support the GNOME project – we want to go beyond these activities, and invest into GNOME to grow its adoption amongst people who need it. This has a cost, and that means in parallel with these initiatives, we need to find partners to fund this work.
Neil has previously talked about themes such as education, advocacy, privacy, but we’ve not previously translated these into clear specific initiatives that we would establish in addition to the Foundation’s existing work. This is all a work in progress and we welcome any feedback from the community about refining these ideas, but here are the current strategic initiatives the board is working on. We’ve been thinking about growing our community by encouraging and retaining diverse contributors, and addressing evolving computing needs which aren’t currently well served on the desktop.
Initiative 1. Welcoming newcomers. The community is already spending a lot of time welcoming newcomers and teaching them the best practices. Those activities are as time consuming as they are important, but currently a handful of individuals are running initiatives such as GSoC, Outreachy and outreach to Universities. These activities help bring diverse individuals and perspectives into the community, and helps them develop skills and experience of collaborating to create Open Source projects. We want to make those efforts more sustainable by finding sponsors for these activities. With funding, we can hire people to dedicate their time to operating these programs, including paid mentors and creating materials to support newcomers in future, such as developer documentation, examples and tutorials. This is the initiative that needs to be refined the most before we can turn it into something real.
Initiative 2: Diverse and sustainable Linux app ecosystem. I spoke at the Linux App Summit about the work that GNOME and Endless has been supporting in Flathub, but this is an example of something which has a great overlap between commercial, technical and mission-based advantages. The key goal here is to improve the financial sustainability of participating in our community, which in turn has an impact on the diversity of who we can expect to afford to enter and remain in our community. We believe the existence of this is critically important for individual developers and contributors to unlock earning potential from our ecosystem, through donations or app sales. In turn, a healthy app ecosystem also improves the usefulness of the Linux desktop as a whole for potential users. We believe that we can build a case for commercial vendors in the space to join an advisory board alongside with GNOME, KDE, etc to input into the governance and contribute to the costs of growing Flathub.
Initiative 3: Local-first applications for the GNOME desktop. This is what Thib has been starting to discuss on Discourse, in this thread. There are many different threats to free access to computing and information in today’s world. The GNOME desktop and apps need to give users convenient and reliable access to technology which works similarly to the tools they already use everyday, but keeps them and their data safe from surveillance, censorship, filtering or just being completely cut off from the Internet. We believe that we can seek both philanthropic and grant funding for this work. It will make GNOME a more appealing and comprehensive offering for the many people who want to protect their privacy.
The idea is that these initiatives all sit on the boundary between the GNOME community and the outside world. If the Foundation can grow and deliver these kinds of projects, we are reaching to new people, new contributors and new funding. These contributions and investments back into GNOME represent a true “win-win” for the newcomers and our existing community.
(Originally posted to GNOME Discourse, please feel free to join the discussion there.)
Mesh and task shaders (amplification shaders in D3D jargon) are a new way to produce geometry in 3D applications. First proposed by NVidia in 2018 and initially available in the “Turing” series of GPUs, they are now supported on RDNA2 GPUs and on the API are part of the D3D12 API and also as vendor-specific extensions to Vulkan and OpenGL. In this post I’m going to talk about what mesh shaders are and in part 2 I’m going to talk about how they are implemented on the driver side.
The problem with the traditional vertex processing pipeline is that it is mainly designed assuming several fixed-function hardware units in the GPU and offers very little flexibility for the user to customize it. The main issues with the traditional pipeline are:
The mesh shading pipeline is a graphics pipeline that addresses these issues by completely replacing the entire traditional vertex processing pipeline with two new stages: the task and mesh shader.
A mesh shader is a compute-like stage which allows the application to fully customize its inputs and outputs, including output primitives.
You can use all the sweet good stuff that compute shaders already can do, but vertex shaders couldn’t, for example: use shared memory, run in workgroups, rely on workgroup ID, subgroup ID, etc.
The API allows any mesh shader invocation to write to any vertex or primitive. The invocations in each mesh shader workgroup are meant to co-operatively produce a small set of output vertices and primitives (this is sometimes called a “meshlet”). All workgroups together create the full output geometry (the “mesh”).
First, it needs to figure out how many vertices and primitives it wants to create, then it can write these to its output arrays. How it does this, is entirely up to the application developer. Though, there are some performance recommendations which you should follow to make it go fast. I’m going to talk about these more in Part 2.
The input assembler step is entirely eliminated, which means that the application is now in full control of how (or if at all) the input vertex data is fetched, meaning that you can save bandwidth on things that don’t need to be loaded, etc. There is no “input” in the traditional sense, but you can rely on things like push constants, UBO, SSBO etc.
For example, a mesh shader could perform per-triangle culling in such a manner that it wouldn’t need to load data for primitives that are culled, therefore saving bandwidth.
The task shader is an optional stage which operates like a compute shader. Each task shader workgroup has two main purposes:
The “geometry amplification” is achieved by choosing to launch more (or less) mesh shader workgroups. As opposed to the fixed-function tessellator in the traditional pipeline, it is now entirely up to the application how to create the vertices.
While you could re-implement the old fixed-function tessellation with mesh shading, this may not actually be necessary and your application may work fine with some other simpler algorithm.
Another interesting use case for task shaders is per-meshlet culling, meaning that a task shader is a good place to decide which meshlets you actually want to render and eliminate entire mesh shader workgroups which would otherwise operate on invisible primitives.
For now, I’ll just discuss a few basic use cases.
Meshlets. If your application loads input vertex data from somewhere, it is recommended that you subdivide that data (the “mesh”) into smaller chunks called “meshlets”. Then you can write your shaders such that each mesh shader workgroup processes a single meshlet.
Procedural geometry. Your application can generate all its vertices and primitives based on a mathematical formula that is implemented in the shader. In this case, you don’t need to load any inputs, just implement your formula as if you were writing a compute shader, then store the results into the mesh shader output arrays.
Replacing compute pre-passes. Many modern games use a compute pre-pass. They launch some compute shaders that do some pre-processing on the geometry before the graphics work. These are no longer necessary. The compute work can be made part of either the task or mesh shader, which removes the overhead of the additional submission.
Note that mesh shader workgroups may be launched as soon as the corresponding task shader workgroup is finished, so mesh shader execution (of the already finished tasks) may overlap with task shader execution, removing the need for extra synchronization on the application side.
Thus far, I’ve sold you on how awesome and flexible mesh shading is, so it’s time to ask the million dollar question.
The answer, as always, is: It depends.
Yes, mesh and task shaders do give you a lot of opportunities to implement things just the way you like them without the stupid hardware getting in your way, but as with any low-level tools, this also means that you get a lot of possibities for shooting yourself in the foot.
The traditional vertex processing pipeline has been around for so long that on most hardware it’s extremely well optimized because the drivers do a lot of optimization work for you. Therefore, just because an app uses mesh shaders, doesn’t automatically mean that it’s going to be faster or better in any way. It’s only worth it if you are willing to do it well.
That being said, perhaps the easiest way to start experimenting with mesh shaders is to rewrite parts of your application that used to use geometry shaders. Geometry shaders are so horribly inefficient that it’ll be difficult to write a worse mesh shader.
Stay tuned for Part 2 if you are curious about that! In Part 2, I’m going to talk about how mesh and task shaders are implemented in a driver. This will shed some light on how these shaders work internally and why certain things perform really badly.
Sometimes you want to go and inspect details of the shaders that are used with specific draw calls in a frame. With RenderDoc this is really easy if the driver implements VK_KHR_pipeline_executable_properties. This extension allows applications to query the driver about various aspects of the executable code generated for a Vulkan pipeline.
I implemented this extension for V3DV, the Vulkan driver for Raspberry Pi 4, last week (it is currently in review process) because I was tired of jumping through loops to get the info I needed when looking at traces. For V3DV we expose the NIR and QPU assembly code as well as various others stats, some of which are quite relevant to performance, such as spill or thread counts.
TLDR: Hermetic /usr/
is awesome; let's popularize image-based OSes
with modernized security properties built around immutability,
SecureBoot, TPM2, adaptability, auto-updating, factory reset,
uniformity – built from traditional distribution packages, but
deployed via images.
Over the past years, systemd gained a number of components for building Linux-based operating systems. While these components individually have been adopted by many distributions and products for specific purposes, we did not publicly communicate a broader vision of how they should all fit together in the long run. In this blog story I hope to provide that from my personal perspective, i.e. explain how I personally would build an OS and where I personally think OS development with Linux should go.
I figure this is going to be a longer blog story, but I hope it will be equally enlightening. Please understand though that everything I write about OS design here is my personal opinion, and not one of my employer.
For the last 12 years or so I have been working on Linux OS
development, mostly around systemd
. In all those years I had a lot
of time thinking about the Linux platform, and specifically
traditional Linux distributions and their strengths and weaknesses. I
have seen many attempts to reinvent Linux distributions in one way or
another, to varying success. After all this most would probably
agree that the traditional RPM or dpkg/apt-based distributions still
define the Linux platform more than others (for 25+ years now), even
though some Linux-based OSes (Android, ChromeOS) probably outnumber
the installations overall.
And over all those 12 years I kept wondering, how would I actually build an OS for a system or for an appliance, and what are the components necessary to achieve that. And most importantly, how can we make these components generic enough so that they are useful in generic/traditional distributions too, and in other use cases than my own.
Before figuring out how I would build an OS it's probably good to figure out what type of OS I actually want to build, what purpose I intend to cover. I think a desktop OS is probably the most interesting. Why is that? Well, first of all, I use one of these for my job every single day, so I care immediately, it's my primary tool of work. But more importantly: I think building a desktop OS is one of the most complex overall OS projects you can work on, simply because desktops are so much more versatile and variable than servers or embedded devices. If one figures out the desktop case, I think there's a lot more to learn from, and reuse in the server or embedded case, then going the other way. After all, there's a reason why so much of the widely accepted Linux userspace stack comes from people with a desktop background (including systemd, BTW).
So, let's see how I would build a desktop OS. If you press me hard, and ask me why I would do that given that ChromeOS already exists and more or less is a Linux desktop OS: there's plenty I am missing in ChromeOS, but most importantly, I am lot more interested in building something people can easily and naturally rebuild and hack on, i.e. Google-style over-the-wall open source with its skewed power dynamic is not particularly attractive to me. I much prefer building this within the framework of a proper open source community, out in the open, and basing all this strongly on the status quo ante, i.e. the existing distributions. I think it is crucial to provide a clear avenue to build a modern OS based on the existing distribution model, if there shall ever be a chance to make this interesting for a larger audience.
(Let me underline though: even though I am going to focus on a desktop here, most of this is directly relevant for servers as well, in particular container host OSes and suchlike, or embedded devices, e.g. car IVI systems and so on.)
First and foremost, I think the focus must be on an image-based design rather than a package-based one. For robustness and security it is essential to operate with reproducible, immutable images that describe the OS or large parts of it in full, rather than operating always with fine-grained RPM/dpkg style packages. That's not to say that packages are not relevant (I actually think they matter a lot!), but I think they should be less of a tool for deploying code but more one of building the objects to deploy. A different way to see this: any OS built like this must be easy to replicate in a large number of instances, with minimal variability. Regardless if we talk about desktops, servers or embedded devices: focus for my OS should be on "cattle", not "pets", i.e that from the start it's trivial to reuse the well-tested, cryptographically signed combination of software over a large set of devices the same way, with a maximum of bit-exact reuse and a minimum of local variances.
The trust chain matters, from the boot loader all the way to the apps. This means all code that is run must be cryptographically validated before it is run. All storage must be cryptographically protected: public data must be integrity checked; private data must remain confidential.
This is in fact where big distributions currently fail pretty badly. I would go as far as saying that SecureBoot on Linux distributions is mostly security theater at this point, if you so will. That's because the initrd that unlocks your FDE (i.e. the cryptographic concept that protects the rest of your system) is not signed or protected in any way. It's trivial to modify for an attacker with access to your hard disk in an undetectable way, and collect your FDE passphrase. The involved bureaucracy around the implementation of UEFI SecureBoot of the big distributions is to a large degree pointless if you ask me, given that once the kernel is assumed to be in a good state, as the next step the system invokes completely unsafe code with full privileges.
This is a fault of current Linux distributions though, not of SecureBoot in general. Other OSes use this functionality in more useful ways, and we should correct that too.
Pretty much the same thing: offline security matters. I want my data to be reasonably safe at rest, i.e. cryptographically inaccessible even when I leave my laptop in my hotel room, suspended.
Everything should be cryptographically measured, so that remote attestation is supported for as much software shipped on the OS as possible.
Everything should be self descriptive, have single sources of truths that are closely attached to the object itself, instead of stored externally.
Everything should be self-updating. Today we know that software is never bug-free, and thus requires a continuous update cycle. Not only the OS itself, but also any extensions, services and apps running on it.
Everything should be robust in respect to aborted OS operations, power loss and so on. It should be robust towards hosed OS updates (regardless if the download process failed, or the image was buggy), and not require user interaction to recover from them.
There must always be a way to put the system back into a well-defined, guaranteed safe state ("factory reset"). This includes that all sensitive data from earlier uses becomes cryptographically inaccessible.
The OS should enforce clear separation between vendor resources, system resources and user resources: conceptually and when it comes to cryptographical protection.
Things should be adaptive: the system should come up and make the
best of the system it runs on, adapt to the storage and
hardware. Moreover, the system should support execution on bare
metal equally well as execution in a VM environment and in a
container environment (i.e. systemd-nspawn
).
Things should not require explicit installation. i.e. every image
should be a live image. For installation it should be sufficient to
dd
an OS image onto disk. Thus, strong focus on "instantiate on
first boot", rather than "instantiate before first boot".
Things should be reasonably minimal. The image the system starts its life with should be quick to download, and not include resources that can as well be created locally later.
System identity, local cryptographic keys and so on should be generated locally, not be pre-provisioned, so that there's no leak of sensitive data during the transport onto the system possible.
Things should be reasonably democratic and hackable. It should be easy to fork an OS, to modify an OS and still get reasonable cryptographic protection. Modifying your OS should not necessarily imply that your "warranty is voided" and you lose all good properties of the OS, if you so will.
Things should be reasonably modular. The privileged part of the core OS must be extensible, including on the individual system. It's not sufficient to support extensibility just through high-level UI applications.
Things should be reasonably uniform, i.e. ideally the same formats and cryptographic properties are used for all components of the system, regardless if for the host OS itself or the payloads it receives and runs.
Even taking all these goals into consideration, it should still be close to traditional Linux distributions, and take advantage of what they are really good at: integration and security update cycles.
Now that we know our goals and requirements, let's start designing the OS along these lines.
/usr/
First of all the OS resources (code, data files, …) should be
hermetic in an immutable /usr/
. This means that a /usr/
tree
should carry everything needed to set up the minimal set of
directories and files outside of /usr/
to make the system work. This
/usr/
tree can then be mounted read-only into the writable root file
system that then will eventually carry the local configuration, state
and user data in /etc/
, /var/
and /home/
as usual.
Thankfully, modern distributions are surprisingly close to working
without issues in such a hermetic context. Specifically, Fedora works
mostly just fine: it has adopted the /usr/
merge and the declarative
systemd-sysusers
and
systemd-tmpfiles
components quite comprehensively, which means the directory trees
outside of /usr/
are automatically generated as needed if missing.
In particular /etc/passwd
and /etc/group
(and related files) are
appropriately populated, should they be missing entries.
In my model a hermetic OS is hence comprehensively defined within
/usr/
: combine the /usr/
tree with an empty, otherwise unpopulated
root file system, and it will boot up successfully, automatically
adding the strictly necessary files, and resources that are necessary
to boot up.
Monopolizing vendor OS resources and definitions in an immutable
/usr/
opens multiple doors to us:
We can apply dm-verity
to the whole /usr/
tree, i.e. guarantee
structural, cryptographic integrity on the whole vendor OS resources
at once, with full file system metadata.
We can implement updates to the OS easily: by implementing an A/B
update scheme on the /usr/
tree we can update the OS resources
atomically and robustly, while leaving the rest of the OS environment
untouched.
We can implement factory reset easily: erase the root file system
and reboot. The hermetic OS in /usr/
has all the information it
needs to set up the root file system afresh — exactly like in a new
installation.
So let's have a look at a suitable partition table, taking a hermetic
/usr/
into account. Let's conceptually start with a table of four
entries:
An UEFI System Partition (required by firmware to boot)
Immutable, Verity-protected, signed file system with the /usr/
tree in version A
Immutable, Verity-protected, signed file system with the /usr/
tree in version B
A writable, encrypted root file system
(This is just for initial illustration here, as we'll see later it's going to be a bit more complex in the end.)
The Discoverable Partitions
Specification provides
suitable partition types UUIDs for all of the above partitions. Which
is great, because it makes the image self-descriptive: simply by
looking at the image's GPT table we know what to mount where. This
means we do not need a manual /etc/fstab
, and a multitude of tools
such as systemd-nspawn
and similar can operate directly on the disk
image and boot it up.
Now that we have a rough idea how to organize the partition table, let's look a bit at how to boot into that. Specifically, in my model "unified kernels" are the way to go, specifically those implementing Boot Loader Specification Type #2. These are basically kernel images that have an initial RAM disk attached to them, as well as a kernel command line, a boot splash image and possibly more, all wrapped into a single UEFI PE binary. By combining these into one we achieve two goals: they become extremely easy to update (i.e. drop in one file, and you update kernel+initrd) and more importantly, you can sign them as one for the purpose of UEFI SecureBoot.
In my model, each version of such a kernel would be associated with
exactly one version of the /usr/
tree: both are always updated at
the same time. An update then becomes relatively simple: drop in one
new /usr/
file system plus one kernel, and the update is complete.
The boot loader used for all this would be systemd-boot, of course. It's a very simple loader, and implements the aforementioned boot loader specification. This means it requires no explicit configuration or anything: it's entirely sufficient to drop in one such unified kernel file, and it will be picked up, and be made a candidate to boot into.
You might wonder how to configure the root file system to boot from
with such a unified kernel that contains the kernel command line and
is signed as a whole and thus immutable. The idea here is to use the
usrhash=
kernel command line option implemented by
systemd-veritysetup-generator
and
systemd-fstab-generator. It
does two things: it will search and set up a dm-verity
volume for
the /usr/
file system, and then mount it. It takes the root hash
value of the dm-verity
Merkle tree as the parameter. This hash is
then also used to find the /usr/
partition in the GPT partition
table, under the assumption that the partition UUIDs are derived from
it, as per the suggestions in the discoverable partitions
specification (see above).
systemd-boot
(if not told otherwise) will do a version sort of the
kernel image files it finds, and then automatically boot the newest
one. Picking a specific kernel to boot will also fixate which version
of the /usr/
tree to boot into, because — as mentioned — the Verity
root hash of it is built into the kernel command line the unified
kernel image contains.
In my model I'd place the kernels directly into the UEFI System
Partition (ESP), in order to simplify things. (systemd-boot
also
supports reading them from a separate boot partition, but let's not
complicate things needlessly, at least for now.)
So, with all this, we now already have a boot chain that goes
something like this: once the boot loader is run, it will pick the
newest kernel, which includes the initial RAM disk and a secure
reference to the /usr/
file system to use. This is already
great. But a /usr/
alone won't make us happy, we also need a root
file system. In my model, that file system would be writable, and the
/etc/
and /var/
hierarchies would be located directly on it. Since
these trees potentially contain secrets (SSH keys, …) the root file
system needs to be encrypted. We'll use LUKS2 for this, of course. In
my model, I'd bind this to the TPM2 chip (for compatibility with
systems lacking one, we can find a suitable fallback, which then
provides weaker guarantees, see below). A TPM2 is a security chip
available in most modern PCs. Among other things it contains a
persistent secret key that can be used to encrypt data, in a way that
only if you possess access to it and can prove you are using validated
software you can decrypt it again. The cryptographic measuring I
mentioned earlier is what allows this to work. But … let's not get
lost too much in the details of TPM2 devices, that'd be material for a
novel, and this blog story is going to be way too long already.
What does using a TPM2 bound key for unlocking the root file system get us? We can encrypt the root file system with it, and you can only read or make changes to the root file system if you also possess the TPM2 chip and run our validated version of the OS. This protects us against an evil maid scenario to some level: an attacker cannot just copy the hard disk of your laptop while you leave it in your hotel room, because unless the attacker also steals the TPM2 device it cannot be decrypted. The attacker can also not just modify the root file system, because such changes would be detected on next boot because they aren't done with the right cryptographic key.
So, now we have a system that already can boot up somewhat completely,
and run userspace services. All code that is run is verified in some
way: the /usr/
file system is Verity protected, and the root hash of
it is included in the kernel that is signed via UEFI SecureBoot. And
the root file system is locked to the TPM2 where the secret key is
only accessible if our signed OS + /usr/
tree is used.
(One brief intermission here: so far all the components I am
referencing here exist already, and have been shipped in systemd
and
other projects already, including the TPM2 based disk
encryption. There's one thing missing here however at the moment that
still needs to be developed (happy to take PRs!): right now TPM2 based
LUKS2 unlocking is bound to PCR hash values. This is hard to work with
when implementing updates — what we'd need instead is unlocking by
signatures of PCR hashes. TPM2 supports this, but we don't support it
yet in our systemd-cryptsetup
+ systemd-cryptenroll
stack.)
One of the goals mentioned above is that cryptographic key material
should always be generated locally on first boot, rather than
pre-provisioned. This of course has implications for the encryption
key of the root file system: if we want to boot into this system we
need the root file system to exist, and thus a key already generated
that it is encrypted with. But where precisely would we generate it if
we have no installer which could generate while installing (as it is
done in traditional Linux distribution installers). My proposed
solution here is to use
systemd-repart
,
which is a declarative, purely additive repartitioner. It can run from
the initrd to create and format partitions on boot, before
transitioning into the root file system. It can also format the
partitions it creates and encrypt them, automatically enrolling an
TPM2-bound key.
So, let's revisit the partition table we mentioned earlier. Here's what in my model we'd actually ship in the initial image:
An UEFI System Partition (ESP)
An immutable, Verity-protected, signed file system with the /usr/
tree in version A
And that's already it. No root file system, no B /usr/
partition,
nothing else. Only two partitions are shipped: the ESP with the
systemd-boot
loader and one unified kernel image, and the A version
of the /usr/
partition. Then, on first boot systemd-repart
will
notice that the root file system doesn't exist yet, and will create
it, encrypt it, and format it, and enroll the key into the TPM2. It
will also create the second /usr/
partition (B) that we'll need for
later A/B updates (which will be created empty for now, until the
first update operation actually takes place, see below). Once done the
initrd will combine the fresh root file system with the shipped
/usr/
tree, and transition into it. Because the OS is hermetic in
/usr/
and contains all the systemd-tmpfiles
and systemd-sysuser
information it can then set up the root file system properly and
create any directories and symlinks (and maybe a few files) necessary
to operate.
Besides the fact that the root file system's encryption keys are
generated on the system we boot from and never leave it, it is also
pretty nice that the root file system will be sized dynamically,
taking into account the physical size of the backing storage. This is
perfect, because on first boot the image will automatically adapt to what
it has been dd
'ed onto.
This is a good point to talk about the factory reset logic, i.e. the
mechanism to place the system back into a known good state. This is
important for two reasons: in our laptop use case, once you want to
pass the laptop to someone else, you want to ensure your data is fully
and comprehensively erased. Moreover, if you have reason to believe
your device was hacked you want to revert the device to a known good
state, i.e. ensure that exploits cannot persist. systemd-repart
already has a mechanism for it. In the declarations of the partitions
the system should have, entries may be marked to be candidates for
erasing on factory reset. The actual factory reset is then requested
by one of two means: by specifying a specific kernel command line
option (which is not too interesting here, given we lock that down via
UEFI SecureBoot; but then again, one could also add a second kernel to
the ESP that is identical to the first, with only different that it
lists this command line option: thus when the user selects this entry
it will initiate a factory reset) — and via an EFI variable that can
be set and is honoured on the immediately following boot. So here's
how a factory reset would then go down: once the factory reset is
requested it's enough to reboot. On the subsequent boot
systemd-repart
runs from the initrd, where it will honour the
request and erase the partitions marked for erasing. Once that is
complete the system is back in the state we shipped the system in:
only the ESP and the /usr/
file system will exist, but the root file
system is gone. And from here we can continue as on the original first
boot: create a new root file system (and any other partitions), and
encrypt/set it up afresh.
So now we have a nice setup, where everything is either signed or encrypted securely. The system can adapt to the system it is booted on automatically on first boot, and can easily be brought back into a well defined state identical to the way it was shipped in.
But of course, such a monolithic, immutable system is only useful for
very specific purposes. If /usr/
can't be written to, – at least in
the traditional sense – one cannot just go and install a new software
package that one needs. So here two goals are superficially
conflicting: on one hand one wants modularity, i.e. the ability to
add components to the system, and on the other immutability, i.e. that
precisely this is prohibited.
So let's see what I propose as a middle ground in my model. First, what's the precise use case for such modularity? I see a couple of different ones:
For some cases it is necessary to extend the system itself at the lowest level, so that the components added in extend (or maybe even replace) the resources shipped in the base OS image, so that they live in the same namespace, and are subject to the same security restrictions and privileges. Exposure to the details of the base OS and its interface for this kind of modularity is at the maximum.
Example: a module that adds a debugger or tracing tools into the system. Or maybe an optional hardware driver module.
In other cases, more isolation is preferable: instead of extending the system resources directly, additional services shall be added in that bring their own files, can live in their own namespace (but with "windows" into the host namespaces), however still are system components, and provide services to other programs, whether local or remote. Exposure to the details of the base OS for this kind of modularity is restricted: it mostly focuses on the ability to consume and provide IPC APIs from/to the system. Components of this type can still be highly privileged, but the level of integration is substantially smaller than for the type explained above.
Example: a module that adds a specific VPN connection service to the OS.
Finally, there's the actual payload of the OS. This stuff is relatively isolated from the OS and definitely from each other. It mostly consumes OS APIs, and generally doesn't provide OS APIs. This kind of stuff runs with minimal privileges, and in its own namespace of concepts.
Example: a desktop app, for reading your emails.
Of course, the lines between these three types of modules are blurry, but I think distinguishing them does make sense, as I think different mechanisms are appropriate for each. So here's what I'd propose in my model to use for this.
For the system extension case I think the
systemd-sysext
images are appropriate. This tool operates on
system extension images that are very similar to the host's disk
image: they also contain a /usr/
partition, protected by
Verity. However, they just include additions to the host image:
binaries that extend the host. When such a system extension image
is activated, it is merged via an immutable overlayfs
mount into
the host's /usr/
tree. Thus any file shipped in such a system
extension will suddenly appear as if it was part of the host OS
itself. For optional components that should be considered part of
the OS more or less this is a very simple and powerful way to
combine an immutable OS with an immutable extension. Note that most
likely extensions for an OS matching this tool should be built at
the same time within the same update cycle scheme as the host OS
itself. After all, the files included in the extensions will have
dependencies on files in the system OS image, and care must be
taken that these dependencies remain in order.
For adding in additional somewhat isolated system services in my
model, Portable Services
are the proposed tool of choice. Portable services are in most ways
just like regular system services; they could be included in the
system OS image or an extension image. However, portable services
use
RootImage=
to run off separate disk images, thus within their own
namespace. Images set up this way have various ways to integrate
into the host OS, as they are in most ways regular system services,
which just happen to bring their own directory tree. Also, unlike
regular system services, for them sandboxing is opt-out rather than
opt-in. In my model, here too the disk images are Verity protected
and thus immutable. Just like the host OS they are GPT disk images
that come with a /usr/
partition and Verity data, along with
signing.
Finally, the actual payload of the OS, i.e. the apps. To be useful in real life here it is important to hook into existing ecosystems, so that a large set of apps are available. Given that on Linux flatpak (or on servers OCI containers) are the established format that pretty much won they are probably the way to go. That said, I think both of these mechanisms have relatively weak properties, in particular when it comes to security, since immutability/measurements and similar are not provided. This means, unlike for system extensions and portable services a complete trust chain with attestation and per-app cryptographically protected data is much harder to implement sanely.
What I'd like to underline here is that the main system OS image, as well as the system extension images and the portable service images are put together the same way: they are GPT disk images, with one immutable file system and associated Verity data. The latter two should also contain a PKCS#7 signature for the top-level Verity hash. This uniformity has many benefits: you can use the same tools to build and process these images, but most importantly: by using a single way to validate them throughout the stack (i.e. Verity, in the latter cases with PKCS#7 signatures), validation and measurement is straightforward. In fact it's so obvious that we don't even have to implement it in systemd: the kernel has direct support for this Verity signature checking natively already (IMA).
So, by composing a system at runtime from a host image, extension images and portable service images we have a nicely modular system where every single component is cryptographically validated on every single IO operation, and every component is measured, in its entire combination, directly in the kernel's IMA subsystem.
(Of course, once you add the desktop apps or OCI containers on top, then these properties are lost further down the chain. But well, a lot is already won, if you can close the chain that far down.)
Note that system extensions are not designed to replicate the fine
grained packaging logic of RPM/dpkg. Of course, systemd-sysext
is a
generic tool, so you can use it for whatever you want, but there's a
reason it does not bring support for a dependency language: the goal
here is not to replicate traditional Linux packaging (we have that
already, in RPM/dpkg, and I think they are actually OK for what they
do) but to provide delivery of larger, coarser sets of functionality,
in lockstep with the underlying OS' life-cycle and in particular with
no interdependencies, except on the underlying OS.
Also note that depending on the use case it might make sense to also
use system extensions to modularize the initrd
step. This is
probably less relevant for a desktop OS, but for server systems it
might make sense to package up support for specific complex storage in
a systemd-sysext
system extension, which can be applied to the
initrd that is built into the unified kernel. (In fact, we have been
working on implementing signed yet modular initrd support to general
purpose Fedora this way.)
Note that portable services are composable from system extension too,
by the way. This makes them even more useful, as you can share a
common runtime between multiple portable service, or even use the host
image as common runtime for portable services. In this model a common
runtime image is shared between one or more system extensions, and
composed at runtime via an overlayfs
instance.
Having an immutable, cryptographically locked down host OS is great I think, and if we have some moderate modularity on top, that's also great. But oftentimes it's useful to be able to depart/compromise for some specific use cases from that, i.e. provide a bridge for example to allow workloads designed around RPM/dpkg package management to coexist reasonably nicely with such an immutable host.
For this purpose in my model I'd propose using systemd-nspawn
containers. The containers are focused on OS containerization,
i.e. they allow you to run a full OS with init system and everything
as payload (unlike for example Docker containers which focus on a
single service, and where running a full OS in it is a mess).
Running systemd-nspawn
containers for such secondary OS installs has
various nice properties. One of course is that systemd-nspawn
supports the same level of cryptographic image validation that we rely
on for the host itself. Thus, to some level the whole OS trust chain
is reasonably recursive if desired: the firmware validates the OS, and the OS can
validate a secondary OS installed within it. In fact, we can run our
trusted OS recursively on itself and get similar security guarantees!
Besides these security aspects, systemd-nspawn
also has really nice
properties when it comes to integration with the host. For example the
--bind-user=
permits binding a host user record and their directory
into a container as a simple one step operation. This makes it
extremely easy to have a single user and $HOME
but share it
concurrently with the host and a zoo of secondary OSes in
systemd-nspawn
containers, which each could run different
distributions even.
Superficially, an OS with an immutable /usr/
appears much less
hackable than an OS where everything is writable. Moreover, an OS
where everything must be signed and cryptographically validated makes
it hard to insert your own code, given you are unlikely to possess
access to the signing keys.
To address this issue other systems have supported a "developer" mode: when entered the security guarantees are disabled, and the system can be freely modified, without cryptographic validation. While that's a great concept to have I doubt it's what most developers really want: the cryptographic properties of the OS are great after all, it sucks having to give them up once developer mode is activated.
In my model I'd thus propose two different approaches to this problem. First of all, I think there's value in allowing users to additively extend/override the OS via local developer system extensions. With this scheme the underlying cryptographic validation would remain in tact, but — if this form of development mode is explicitly enabled – the developer could add in more resources from local storage, that are not tied to the OS builder's chain of trust, but a local one (i.e. simply backed by encrypted storage of some form).
The second approach is to make it easy to extend (or in fact replace) the set of trusted validation keys, with local ones that are under the control of the user, in order to make it easy to operate with kernel, OS, extension, portable service or container images signed by the local developer without involvement of the OS builder. This is relatively easy to do for components down the trust chain, i.e. the elements further up the chain should optionally allow additional certificates to allow validation with.
(Note that systemd currently has no explicit support for a "developer" mode like this. I think we should add that sooner or later however.)
Closely related to the question of developer mode is the question of code signing. If you ask me, the status quo of UEFI SecureBoot code signing in the major Linux distributions is pretty sad. The work to get stuff signed is massive, but in effect it delivers very little in return: because initrds are entirely unprotected, and reside on partitions lacking any form of cryptographic integrity protection any attacker can trivially easily modify the boot process of any such Linux system and freely collected FDE passphrases entered. There's little value in signing the boot loader and kernel in a complex bureaucracy if it then happily loads entirely unprotected code that processes the actually relevant security credentials: the FDE keys.
In my model, through use of unified kernels this important gap is closed, hence UEFI SecureBoot code signing becomes an integral part of the boot chain from firmware to the host OS. Unfortunately, code signing – and having something a user can locally hack, is to some level conflicting. However, I think we can improve the situation here, and put more emphasis on enrolling developer keys in the trust chain easily. Specifically, I see one relevant approach here: enrolling keys directly in the firmware is something that we should make less of a theoretical exercise and more something we can realistically deploy. See this work in progress making this more automatic and eventually safe. Other approaches are thinkable (including some that build on existing MokManager infrastructure), but given the politics involved, are harder to conclusively implement.
What I explain above is put together with running on a bare metal
system in mind. However, one of the stated goals is to make the OS
adaptive enough to also run in a container environment (specifically:
systemd-nspawn
) nicely. Booting a disk image on bare metal or in a
VM generally means that the UEFI firmware validates and invokes the
boot loader, and the boot loader invokes the kernel which then
transitions into the final system. This is different for containers:
here the container manager immediately calls the init system, i.e. PID
1. Thus the validation logic must be different: cryptographic
validation must be done by the container manager. In my model this is
solved by shipping the OS image not only with a Verity data partition
(as is already necessary for the UEFI SecureBoot trust chain, see
above), but also with another partition, containing a PKCS#7 signature
of the root hash of said Verity partition. This of course is exactly
what I propose for both the system extension and portable service
image. Thus, in my model the images for all three uses are put
together the same way: an immutable /usr/
partition, accompanied by
a Verity partition and a PKCS#7 signature partition. The OS image
itself then has two ways "into" the trust chain: either through the
signed unified kernel in the ESP (which is used for bare metal and VM
boots) or by using the PKCS#7 signature stored in the partition
(which is used for container/systemd-nspawn
boots).
A fully immutable and signed OS has to establish trust in the user
data it makes use of before doing so. In the model I describe here,
for /etc/
and /var/
we do this via disk encryption of the root
file system (in combination with integrity checking). But the point
where the root file system is mounted comes relatively late in the
boot process, and thus cannot be used to parameterize the boot
itself. In many cases it's important to be able to parameterize the
boot process however.
For example, for the implementation of the developer mode indicated above it's useful to be able to pass this fact safely to the initrd, in combination with other fields (e.g. hashed root password for allowing in-initrd logins for debug purposes). After all, if the initrd is pre-built by the vendor and signed as whole together with the kernel it cannot be modified to carry such data directly (which is in fact how parameterizing of the initrd to a large degree was traditionally done).
In my model this is achieved through system credentials, which allow passing parameters to systems (and services for the matter) in an encrypted and authenticated fashion, bound to the TPM2 chip. This means that we can securely pass data into the initrd so that it can be authenticated and decrypted only on the system it is intended for and with the unified kernel image it was intended for.
In my model the OS would also carry a swap partition. For the simple
reason that only then
systemd-oomd.service
can provide the best results. Also see In defence of swap: common
misconceptions
We have a rough idea how the system shall be organized now, let's next focus on the deployment cycle: software needs regular update cycles, and software that is not updated regularly is a security problem. Thus, I am sure that any modern system must be automatically updated, without this requiring avoidable user interaction.
In my model, this is the job for systemd-sysupdate. It's a relatively simple A/B image updater: it operates either on partitions, on regular files in a directory, or on subdirectories in a directory. Each entry has a version (which is encoded in the GPT partition label for partitions, and in the filename for regular files and directories): whenever an update is initiated the oldest version is erased, and the newest version is downloaded.
With the setup described above a system update becomes a really simple
operation. On each update the systemd-sysupdate
tool downloads a
/usr/
file system partition, an accompanying Verity partition, a
PKCS#7 signature partition, and drops it into the host's partition
table (where it possibly replaces the oldest version so far stored
there). Then it downloads a unified kernel image and drops it into
the EFI System Partition's /EFI/Linux
(as per Boot Loader
Specification; possibly erase the oldest such file there). And that's
already the whole update process: four files are downloaded from the
server, unpacked and put in the most straightforward of ways into the
partition table or file system. Unlike in other OS designs there's no
mechanism required to explicitly switch to the newer version, the
aforementioned systemd-boot
logic will automatically pick the newest
kernel once it is dropped in.
Above we talked a lot about modularity, and how to put systems
together as a combination of a host OS image, system extension images
for the initrd and the host, portable service images and
systemd-nspawn
container images. I already emphasized that these
image files are actually always the same: GPT disk images with
partition definitions that match the Discoverable Partition
Specification. This comes very handy when thinking about updating: we
can use the exact same systemd-sysupdate
tool for updating these
other images as we use for the host image. The uniformity of the
on-disk format allows us to update them uniformly too.
Automatic OS updates do not come without risks: if they happen
automatically, and an update goes wrong this might mean your system
might be automatically updated into a brick. This of course is less
than ideal. Hence it is essential to address this reasonably
automatically. In my model, there's systemd's Automatic Boot
Assessment for
that. The mechanism is simple: whenever a new unified kernel image is
dropped into the system it will be stored with a small integer counter
value included in the filename. Whenever the unified kernel image is
selected for booting by systemd-boot
, it is decreased by one. Once
the system booted up successfully (which is determined by userspace)
the counter is removed from the file name (which indicates "this entry
is known to work"). If the counter ever hits zero, this indicates that
it tried to boot it a couple of times, and each time failed, thus is
apparently "bad". In this case systemd-boot
will not consider the
kernel anymore, and revert to the next older (that doesn't have a
counter of zero).
By sticking the boot counter into the filename of the unified kernel
we can directly attach this information to the kernel, and thus need
not concern ourselves with cleaning up secondary information about the
kernel when the kernel is removed. Updating with a tool like
systemd-sysupdate
remains a very simple operation hence: drop one
old file, add one new file.
I already mentioned that systemd-boot
automatically picks the newest
unified kernel image to boot, by looking at the version encoded in the
filename. This is done via a simple
strverscmp()
call (well, truth be told, it's a modified version of that call,
different from the one implemented in libc, because real-life package
managers use more complex rules for comparing versions these days, and
hence it made sense to do that here too). The concept of having
multiple entries of some resource in a directory, and picking the
newest one automatically is a powerful concept, I think. It means
adding/removing new versions is extremely easy (as we discussed above,
in systemd-sysupdate
context), and allows stateless determination of
what to use.
If systemd-boot
can do that, what about system extension images,
portable service images, or systemd-nspawn
container images that do
not actually use systemd-boot
as the entrypoint? All these tools
actually implement the very same logic, but on the partition level: if
multiple suitable /usr/
partitions exist, then the newest is determined
by comparing the GPT partition label of them.
This is in a way the counterpart to the systemd-sysupdate
update
logic described above: we always need a way to determine which
partition to actually then use after the update took place: and this
becomes very easy each time: enumerate possible entries, pick the
newest as per the (modified) strverscmp()
result.
In my model the device's users and their home directories are managed
by
systemd-homed
. This
means they are relatively self-contained and can be migrated easily
between devices. The numeric UID assignment for each user is done at
the moment of login only, and the files in the home directory are
mapped as needed via a uidmap
mount. It also allows us to protect
the data of each user individually with a credential that belongs to
the user itself. i.e. instead of binding confidentiality of the user's
data to the system-wide full-disk-encryption each user gets their own
encrypted home directory where the user's authentication token
(password, FIDO2 token, PKCS#11 token, recovery key…) is used as
authentication and decryption key for the user's data. This brings
a major improvement for security as it means the user's data is
cryptographically inaccessible except when the user is actually logged
in.
It also allows us to correct another major issue with traditional Linux systems: the way how data encryption works during system suspend. Traditionally on Linux the disk encryption credentials (e.g. LUKS passphrase) is kept in memory also when the system is suspended. This is a bad choice for security, since many (most?) of us probably never turn off their laptop but suspend it instead. But if the decryption key is always present in unencrypted form during the suspended time, then it could potentially be read from there by a sufficiently equipped attacker.
By encrypting the user's home directory with the user's authentication token we can first safely "suspend" the home directory before going to the system suspend state (i.e. flush out the cryptographic keys needed to access it). This means any process currently accessing the home directory will be frozen for the time of the suspend, but that's expected anyway during a system suspend cycle. Why is this better than the status quo ante? In this model the home directory's cryptographic key material is erased during suspend, but it can be safely reacquired on resume, from system code. If the system is only encrypted as a whole however, then the system code itself couldn't reauthenticate the user, because it would be frozen too. By separating home directory encryption from the root file system encryption we can avoid this problem.
So we discussed the organization of the partitions OS images multiple times in the above, each time focusing on a specific aspect. Let's now summarize how this should look like all together.
In my model, the initial, shipped OS image should look roughly like this:
systemd-boot
as boot loader and one unified kernel/usr/
partition (version "A"), with a label fooOS_0.7
(under the assumption we called our project fooOS
and the image version is 0.7
)./usr/
partition (version "A"), with the same label/usr/
partition (version "A"), along with a PKCS#7 signature of it, also with the same labelOn first boot this is augmented by systemd-repart
like this:
/usr/
partition (version "B"), initially with a label _empty
(which is the label systemd-sysupdate
uses to mark partitions that currently carry no valid payload)_empty
_empty
systemd-homed
adds that on its own, and it's nice to avoid duplicate encryption)Then, on the first OS update the partitions 5, 6, 7 are filled with a
new version of the OS (let's say 0.8
) and thus get their label
updated to fooOS_0.8
. After a boot, this version is active.
On a subsequent update the three partitions fooOS_0.7
get wiped and
replaced by fooOS_0.9
and so on.
On factory reset, the partitions 8, 9, 10 are deleted, so that
systemd-repart
recreates them, using a new set of cryptographic
keys.
Here's a graphic that hopefully illustrates the partition stable from shipped image, through first boot, multiple update cycles and eventual factory reset:
So let's summarize the intended chain of trust (for bare metal/VM boots) that ensures every piece of code in this model is signed and validated, and any system secret is locked to TPM2.
First, firmware (or possibly shim) authenticates systemd-boot
.
Once systemd-boot
picks a unified kernel image to boot, it is
also authenticated by firmware/shim.
The unified kernel image contains an initrd, which is the first userspace component that runs. It finds any system extensions passed into the initrd, and sets them up through Verity. The kernel will validate the Verity root hash signature of these system extension images against its usual keyring.
The initrd also finds credentials passed in, then securely unlocks (which means: decrypts + authenticates) them with a secret from the TPM2 chip, locked to the kernel image itself.
The kernel image also contains a kernel command line which contains
a usrhash=
option that pins the root hash of the /usr/
partition
to use.
The initrd then unlocks the encrypted root file system, with a secret bound to the TPM2 chip.
The system then transitions into the main system, i.e. the
combination of the Verity protected /usr/
and the encrypted root
files system. It then activates two more encrypted (and/or
integrity protected) volumes for /home/
and swap, also with a
secret tied to the TPM2 chip.
Here's an attempt to illustrate the above graphically:
This is the trust chain of the basic OS. Validation of system
extension images, portable service images, systemd-nspawn
container
images always takes place the same way: the kernel validates these
Verity images along with their PKCS#7 signatures against the kernel's
keyring.
In the above I left the choice of file systems unspecified. For the
immutable /usr/
partitions squashfs
might be a good candidate, but
any other that works nicely in a read-only fashion and generates
reproducible results is a good choice, too. The home directories as managed
by systemd-homed
should certainly use btrfs
, because it's the only
general purpose file system supporting online grow and shrink, which
systemd-homed
can take benefit of, to manage storage.
For the root file system btrfs
is likely also the best idea. That's
because we intend to use LUKS/dm-crypt
underneath, which by default
only provides confidentiality, not authenticity of the data (unless
combined with dm-integrity
). Since btrfs
(unlike xfs/ext4) does
full data checksumming it's probably the best choice here, since it
means we don't have to use dm-integrity
(which comes at a higher
performance cost).
In the discussion above a lot of focus was put on setting up the OS
and completing the partition layout and such on first boot. This means
installing the OS becomes as simple as dd
-ing (i.e. "streaming") the
shipped disk image into the final HDD medium. Simple, isn't it?
Of course, such a scheme is just too simple for many setups in real
life. Whenever multi-boot is required (i.e. co-installing an OS
implementing this model with another unrelated one), dd
-ing a disk
image onto the HDD is going to overwrite user data that was supposed
to be kept around.
In order to cover for this case, in my model, we'd use
systemd-repart
(again!) to allow streaming the source disk image
into the target HDD in a smarter, additive way. The tool after all is
purely additive: it will add in partitions or grow them if they are
missing or too small. systemd-repart
already has all the necessary
provisions to not only create a partition on the target disk, but also
copy blocks from a raw installer disk. An install operation would then
become a two stop process: one invocation of systemd-repart
that
adds in the /usr/
, its Verity and the signature partition to the
target medium, populated with a copy of the same partition of the
installer medium. And one invocation of bootctl
that installs the
systemd-boot
boot loader in the ESP. (Well, there's one thing
missing here: the unified OS kernel also needs to be dropped into the
ESP. For now, this can be done with a simple cp
call. In the long
run, this should probably be something bootctl
can do as well, if
told so.)
So, with this scheme we have a simple scheme to cover all bases: we
can either just dd
an image to disk, or we can stream an image onto
an existing HDD, adding a couple of new partitions and files to the
ESP.
Of course, in reality things are more complex than that even: there's
a good chance that the existing ESP is simply too small to carry
multiple unified kernels. In my model, the way to address this is by
shipping two slightly different systemd-repart
partition definition
file sets: the ideal case when the ESP is large enough, and a
fallback case, where it isn't and where we then add in an addition
XBOOTLDR partition (as per the Discoverable Partitions
Specification). In that mode the ESP carries the boot loader, but the
unified kernels are stored in the XBOOTLDR partition. This scenario is
not quite as simple as the XBOOTLDR-less scenario described first, but
is equally well supported in the various tools. Note that
systemd-repart
can be told size constraints on the partitions it
shall create or augment, thus to implement this scheme it's enough to
invoke the tool with the fallback partition scheme if invocation with
the ideal scheme fails.
Either way: regardless how the partitions, the boot loader and the
unified kernels ended up on the system's hard disk, on first boot the
code paths are the same again: systemd-repart
will be called to
augment the partition table with the root file system, and properly
encrypt it, as was already discussed earlier here. This means: all
cryptographic key material used for disk encryption is generated on
first boot only, the installer phase does not encrypt anything.
Traditionally on Linux three types of systems were common: "installed" systems, i.e. that are stored on the main storage of the device and are the primary place people spend their time in; "installer" systems which are used to install them and whose job is to copy and setup the packages that make up the installed system; and "live" systems, which were a middle ground: a system that behaves like an installed system in most ways, but lives on removable media.
In my model I'd like to remove the distinction between these three
concepts as much as possible: each of these three images should carry
the exact same /usr/
file system, and should be suitable to be
replicated the same way. Once installed the resulting image can also
act as an installer for another system, and so on, creating a certain
"viral" effect: if you have one image or installation it's
automatically something you can replicate 1:1 with a simple
systemd-repart
invocation.
The above explains how the image should look like and how its first boot and update cycle will modify it. But this leaves one question unanswered: how to actually build the initial image for OS instances according to this model?
Note that there's nothing too special about the images following this model: they are ultimately just GPT disk images with Linux file systems, following the Discoverable Partition Specification. This means you can use any set of tools of your choice that can put together GPT disk images for compliant images.
I personally would use mkosi
for
this purpose though. It's designed to generate compliant images, and
has a rich toolset for SecureBoot and signed/Verity file systems
already in place.
What is key here is that this model doesn't depart from RPM and dpkg, instead it builds on top of that: in this model they are excellent for putting together images on the build host, but deployment onto the runtime host does not involve individual packages.
I think one cannot underestimate the value traditional distributions bring, regarding security, integration and general polishing. The concepts I describe above are inherited from this, but depart from the idea that distribution packages are a runtime concept and make it a build-time concept instead.
Note that the above is pretty much independent from the underlying distribution.
I have no illusions, general purpose distributions are not going to adopt this model as their default any time soon, and it's not even my goal that they do that. The above is my personal vision, and I don't expect people to buy into it 100%, and that's fine. However, what I am interested in is finding the overlaps, i.e. work with people who buy 50% into this vision, and share the components.
My goals here thus are to:
Get distributions to move to a model where images like this can be
built from the distribution easily. Specifically this means that
distributions make their OS hermetic in /usr/
.
Find the overlaps, share components with other projects to revisit
how distributions are put together. This is already happening, see
systemd-tmpfiles
and systemd-sysuser
support in various
distributions, but I think there's more to share.
Make people interested in building actual real-world images based on general purpose distributions adhering to the model described above. I'd love a "GnomeBook" image with full trust properties, that is built from true Linux distros, such as Fedora or ArchLinux.
What about ostree
? Doesn't ostree
already deliver what this blog story describes?
ostree
is fine technology, but in respect to security and
robustness properties it's not too interesting I think, because
unlike image-based approaches it cannot really deliver
integrity/robustness guarantees over the whole tree easily. To be
able to trust an ostree
setup you have to establish trust in the
underlying file system first, and the complexity of the file
system makes that challenging. To provide an effective
offline-secure trust chain through the whole depth of the stack it
is essential to cryptographically validate every single I/O
operation. In an image-based model this is trivially easy, but in
ostree
model it's with current file system technology not
possible and even if this is added in one way or another in the
future (though I am not aware of anyone doing on-access file-based
integrity that spans a whole hierarchy of files that was
compatible with ostree
's hardlink farm model) I think validation
is still at too high a level, since Linux file system developers
made very clear their implementations are not robust to rogue
images. (There's this stuff
planned,
but doing structural authentication ahead of time instead of on
access makes the idea to weak — and I'd expect too slow — in my
eyes.)
With my design I want to deliver similar security guarantees as
ChromeOS does, but ostree
is much weaker there, and I see no
perspective of this changing. In a way ostree
's integrity checks
are similar to RPM's and enforced on download rather than on
access. In the model I suggest above, it's always on access, and
thus safe towards offline attacks (i.e. evil maid attacks). In
today's world, I think offline security is absolutely necessary
though.
That said, ostree
does have some benefits over the model
described above: it naturally shares file system inodes if many of
the modules/images involved share the same data. It's thus more
space efficient on disk (and thus also in RAM/cache to some
degree) by default. In my model it would be up to the image
builders to minimize shipping overly redundant disk images, by
making good use of suitably composable system extensions.
What about configuration management?
At first glance immutable systems and configuration management
don't go that well together. However, do note, that in the model
I propose above the root file system with all its contents,
including /etc/
and /var/
is actually writable and can be
modified like on any other typical Linux distribution. The only
exception is /usr/
where the immutable OS is hermetic. That
means configuration management tools should work just fine in this
model – up to the point where they are used to install additional
RPM/dpkg packages, because that's something not allowed in the
model above: packages need to be installed at image build time and
thus on the image build host, not the runtime host.
What about non-UEFI and non-TPM2 systems?
The above is designed around the feature set of contemporary PCs, and this means UEFI and TPM2 being available (simply because the PC is pretty much defined by the Windows platform, and current versions of Windows require both).
I think it's important to make the best of the features of today's PC hardware, and then find suitable fallbacks on more limited hardware. Specifically this means: if there's desire to implement something like the this on non-UEFI or non-TPM2 hardware we should look for suitable fallbacks for the individual functionality, but generally try to add glue to the old systems so that conceptually they behave more like the new systems instead of the other way round. Or in other words: most of the above is not strictly tied to UEFI or TPM2, and for many cases already there are reasonably fallbacks in place for more limited systems. Of course, without TPM2 many of the security guarantees will be weakened.
How would you name an OS built that way?
I think a desktop OS built this way if it has the GNOME desktop should of course be called GnomeBook, to mimic the ChromeBook name. ;-)
But in general, I'd call hermetic, adaptive, immutable OSes like this "particles".
Help making Distributions Hermetic in /usr/
!
One of the core ideas of the approach described above is to make
the OS hermetic in /usr/
, i.e. make it carry a comprehensive
description of what needs to be set up outside of it when
instantiated. Specifically, this means that system users that are
needed are declared in systemd-sysusers
snippets, and skeleton
files and directories are created via systemd-tmpfiles
. Moreover
additional partitions should be declared via systemd-repart
drop-ins.
At this point some distributions (such as Fedora) are (probably
more by accident than on purpose) already mostly hermetic in
/usr/
, at least for the most basic parts of the OS. However,
this is not complete: many daemons require to have specific
resources set up in /var/
or /etc/
before they can work, and
the relevant packages do not carry systemd-tmpfiles
descriptions
that add them if missing. So there are two ways you could help
here: politically, it would be highly relevant to convince
distributions that an OS that is hermetic in /usr/
is highly
desirable and it's a worthy goal for packagers to get there. More
specifically, it would be desirable if RPM/dpkg packages would
ship with enough systemd-tmpfiles
information so that
configuration files the packages strictly need for operation are
symlinked (or copied) from /usr/share/factory/
if they are
missing (even better of course would be if packages from their
upstream sources on would just work with an empty /etc/
and
/var/
, and create themselves what they need and default to good
defaults in absence of configuration files).
Note that distributions that adopted systemd-sysusers
,
systemd-tmpfiles
and the /usr/
merge are already quite close
to providing an OS that is hermetic in /usr/
. These were the
big, the major advancements: making the image fully hermetic
should be less controversial – at least that's my guess.
Also note that making the OS hermetic in /usr/
is not just useful in
scenarios like the above. It also means that stuff like
this
and like
this
can work well.
Fill in the gaps!
I already mentioned a couple of missing bits and pieces in the
implementation of the overall vision. In the systemd
project
we'd be delighted to review/merge any PRs that fill in the voids.
Build your own OS like this!
Of course, while we built all these building blocks and they have been adopted to various levels and various purposes in the various distributions, no one so far built an OS that puts things together just like that. It would be excellent if we had communities that work on building images like what I propose above. i.e. if you want to work on making a secure GnomeBook as I suggest above a reality that would be more than welcome.
How could this look like specifically? Pick an existing
distribution, write a set of mkosi
descriptions plus some
additional drop-in files, and then build this on some build
infrastructure. While doing so, report the gaps, and help us
address them.
systemd-tmpfiles
systemd-sysusers
systemd-boot
systemd-stub
systemd-sysext
systemd-portabled
, Portable Services Introductionsystemd-repart
systemd-nspawn
systemd-sysupdate
systemd-creds
, System and Service Credentialssystemd-homed
And that's all for now.
I've been working on kopper recently, which is a complementary project to zink. Just as zink implements OpenGL in terms of Vulkan, kopper seeks to implement the GL window system bindings - like EGL and GLX - in terms of the Vulkan WSI extensions. There are several benefits to doing this, which I'll get into in a future post, but today's story is really about libX11 and libxcb.
Yes, again.
One important GLX feature is the ability to set the swap interval, which is how you get tear-free rendering by syncing buffer swaps to the vertical retrace. A swap interval of 1 is the typical case, where an image update happens once per frame. The Vulkan way to do this is to set the swapchain present mode to FIFO, since FIFO updates are implicitly synced to vblank. Mesa's WSI code for X11 uses a swapchain management thread for FIFO present modes. This thread is started from inside the vulkan driver, and it only uses libxcb to talk to the X server. But libGL is a libX11 client library, so in this scenario there is always an "xlib thread" as well.
libX11 uses libxcb internally these days, because otherwise there would be no way to intermix xlib and xcb calls in the same process. But it does not use libxcb's reflection of the protocol, XGetGeometry does not call xcb_get_geometry for example. Instead, libxcb has an API to allow other code to take over the write side of the display socket, with a callback mechanism to get it back when another xcb client issues a request. The callback function libX11 uses here is straightforward: lock the Display, flush out any internally buffered requests, and return the sequence number of the last request written. Both libraries need this sequence number for various reasons internally, xcb for example uses it to make sure replies go back to the thread that issued the request.
But "lock the Display" here really means call into a vtable in the Display struct. That vtable is filled in during XOpenDisplay, but the individual function pointers are only non-NULL if you called XInitThreads beforehand. And if you're libGL, you have no way to enforce that, your public-facing API operates on a Display that was already created.
So now we see the race. The queue management thread calls into libxcb while the main thread is somewhere inside libX11. Since libX11 has taken the socket, the xcb thread runs the release callback. Since the Display was not made thread-safe at XOpenDisplay time, the release callback does not block, so the xlib thread's work won't be correctly accounted. If you're lucky the two sides will at least write to the socket atomically with respect to each other, but at this point they have diverging opinions about the request sequence numbering, and it's a matter of time until you crash.
It turns out kopper makes this really easy to hit. Like "resize a glxgears window" easy. However, this isn't just a kopper issue, this race exists for every program that uses xcb on a not-necessarily-thread-safe Display. The only reasonable fix is to for libX11 to just always be thread-safe.
I recently
blogged
about how to run a volatile systemd-nspawn
container from your
host's /usr/
tree, for quickly testing stuff in your host
environment, sharing your home drectory, but all that without making a
single modification to your host, and on an isolated node.
The one-liner discussed in that blog story is great for testing during
system software development. Let's have a look at another systemd
tool that I regularly use to test things during systemd
development,
in a relatively safe environment, but still taking full benefit of my
host's setup.
Since a while now, systemd has been shipping with a simple component
called
systemd-sysext
. It's
primary usecase goes something like this: on one hand OS systems with
immutable /usr/
hierarchies are fantastic for security, robustness,
updating and simplicity, but on the other hand not being able to
quickly add stuff to /usr/
is just annoying.
systemd-sysext
is supposed to bridge this contradiction: when
invoked it will merge a bunch of "system extension" images into
/usr/
(and /opt/
as a matter of fact) through the use of read-only
overlayfs
, making all files shipped in the image instantly and
atomically appear in /usr/
during runtime — as if they always had
been there. Now, let's say you are building your locked down OS, with
an immutable /usr/
tree, and it comes without ability to log into,
without debugging tools, without anything you want and need when
trying to debug and fix something in the system. With systemd-sysext
you could use a system extension image that contains all this, drop it
into the system, and activate it with systemd-sysext
so that it
genuinely extends the host system.
(There are many other usecases for this tool, for example, you could build systems that way that at their base use a generic image, but by installing one or more system extensions get extended to with additional more specific functionality, or drivers, or similar. The tool is generic, use it for whatever you want, but for now let's not get lost in listing all the possibilites.)
What's particularly nice about the tool is that it supports
automatically discovered dm-verity
images, with signatures and
everything. So you can even do this in a fully authenticated,
measured, safe way. But I am digressing…
Now that we (hopefully) have a rough understanding what
systemd-sysext
is and does, let's discuss how specficially we can
use this in the context of system software development, to safely use
and test bleeding edge development code — built freshly from your
project's build tree – in your host OS without having to risk that the
host OS is corrupted or becomes unbootable by stuff that didn't quite
yet work the way it was envisioned:
The images systemd-sysext
merges into /usr/
can be of two kinds:
disk images with a file system/verity/signature, or simple, plain
directory trees. To make these images available to the tool, they can
be placed or symlinked into /usr/lib/extensions/
,
/var/lib/extensions/
, /run/extensions/
(and a bunch of
others). So if we now install our freshly built development software
into a subdirectory of those paths, then that's entirely sufficient to
make them valid system extension images in the sense of
systemd-sysext
, and thus can be merged into /usr/
to try them out.
To be more specific: when I develop systemd
itself, here's what I do
regularly, to see how my new development version would behave on my
host system. As preparation I checked out the systemd development git
tree first of course, hacked around in it a bit, then built it with
meson/ninja. And now I want to test what I just built:
sudo DESTDIR=/run/extensions/systemd-test meson install -C build --quiet --no-rebuild &&
sudo systemd-sysext refresh --force
Explanation: first, we'll install my current build tree as a system
extension into /run/extensions/systemd-test/
. And then we apply it
to the host via the systemd-sysext refresh
command. This command
will search for all installed system extension images in the
aforementioned directories, then unmount (i.e. "unmerge") any
previously merged dirs from /usr/
and then freshly mount
(i.e. "merge") the new set of system extensions on top of /usr/
. And
just like that, I have installed my development tree of systemd
into
the host OS, and all that without actually modifying/replacing even a
single file on the host at all. Nothing here actually hit the disk!
Note that all this works on any system really, it is not necessary
that the underlying OS even is designed with immutability in
mind. Just because the tool was developed with immutable systems in
mind it doesn't mean you couldn't use it on traditional systems where
/usr/
is mutable as well. In fact, my development box actually runs
regular Fedora, i.e. is RPM-based and thus has a mutable /usr/
tree. As long as system extensions are applied the whole of /usr/
becomes read-only though.
Once I am done testing, when I want to revert to how things were without the image installed, it is sufficient to call:
sudo systemd-sysext unmerge
And there you go, all files my development tree generated are gone
again, and the host system is as it was before (and /usr/
mutable
again, in case one is on a traditional Linux distribution).
Also note that a reboot (regardless if a clean one or an abnormal
shutdown) will undo the whole thing automatically, since we installed
our build tree into /run/
after all, i.e. a tmpfs
instance that is
flushed on boot. And given that the overlayfs
merge is a runtime
thing, too, the whole operation was executed without any
persistence. Isn't that great?
(You might wonder why I specified --force
on the systemd-sysext
refresh
line earlier. That's because systemd-sysext
actually does
some minimal version compatibility checks when applying system
extension images. For that it will look at the host's
/etc/os-release
file with
/usr/lib/extension-release.d/extension-release.<name>
, and refuse
operaton if the image is not actually built for the host OS
version. Here we don't want to bother with dropping that file in
there, we know already that the extension image is compatible with
the host, as we just built it on it. --force
allows us to skip the
version check.)
You might wonder: what about the combination of the idea from the
previous blog story (regarding running container's off the host
/usr/
tree) with system extensions? Glad you asked. Right now we
have no support for this, but it's high on our TODO list (patches
welcome, of course!). i.e. a new switch for systemd-nspawn
called
--system-extension=
that would allow merging one or more such
extensions into the container tree booted would be stellar. With that,
with a single command I could run a container off my host OS but with
a development version of systemd dropped in, all without any
persistence. How awesome would that be?
(Oh, and in case you wonder, all of this only works with distributions
that have completed the /usr/
merge. On legacy distributions that
didn't do that and still place parts of /usr/
all over the hierarchy
the above won't work, since merging /usr/
trees via overlayfs
is
pretty pointess if the OS is not hermetic in /usr/
.)
And that's all for now. Happy hacking!
The title might be a bit hyperbolic here, but we’re indeed exploring a first step in that direction with radv. The impetus here is the ExecuteIndirect
command in Direct3D 12 and some games that are using it in non-trivial ways. (e.g. Halo Infinite)
ExecuteIndirect
can be seen as an extension of what we have in Vulkan with vkCmdDrawIndirectCount
. It adds extra capabilities. To support that with vkd3d-proton we need the following indirect Vulkan capabilities:
This functionality happens to be a subset of VK_NV_device_generated_commands
and hence I’ve been working on implementing a subset of that extension on radv. Unfortunately, we can’t really give the firmware a “extended indirect draw call” and execute stuff, so we’re stuck generating command buffers on the GPU.
The way the extension works, the application specifies a command “signature” on the CPU, which specifies that for each draw call the application is going to update A, B and C. Then, at runtime, the application provides a buffer providing the data for A, B and C for each draw call. The driver then processes that into a command buffer and then executes that into a secondary command buffer.
The workflow is then as follows:
vkCmdPreprocessGeneratedCommandsNV
which converts the application buffer into a command buffer (in the preprocess buffer)vkCmdExecuteGeneratedCommandsNV
to execute the generated command buffer.When the application triggers a draw command in Vulkan, the driver generates GPU commands to do the following:
Of course we skip any of these steps (or parts of them) when they’re redundant. The majority of the complexity is in the register state we have to set. There are multiple parts here
Fixed function state:
Overall, most of the pipeline state is fairly easy to emit: we just precompute it on pipeline creation and memcpy
it over if we switch shaders. The most difficult is probably the user SGPRs, and the reason for that is that it is derived from a lot of the remaining API state . Note that the list above doesn’t include push constants, descriptor sets or vertex buffers. The driver computes all of these, and generates the user SGPR data from that.
Descriptor sets in radv are just a piece of GPU memory, and radv binds a descriptor set by providing the shader with a pointer to that GPU memory in a user SGPR. Similarly, we have no hardware support for vertex buffers, so radv generates a push descriptor set containing internal texel buffers and then provides a user SGPR with a pointer to that descriptor set.
For push constants, radv has two modes: a portion of the data can be passed in user SGPRs directly, but sometimes a chunk of memory gets allocated and then a pointer to that memory is provided in a user SGPR. This fallback exists because the hardware doesn’t always have enough user SGPRs to fit all the data.
On Vega and later there are 32 user SGPRs, and on earlier GCN GPUs there are 16. This needs to fit pointers to all the referenced descriptor sets (including internal ones like the one for vertex buffers), push constants, builtins like the start vertex and start instance etc. To get the best performance here, radv determines a mapping of API object to user SGPR at shader compile time and then at draw time radv uses that mapping to write user SGPRs.
This results in some interesting behavior, like switching pipelines does cause the driver to update all the user SGPRs because the mapping might have changed.
Furthermore, as an interesting performance hack radv allocates all upload buffers (for the push constant and push descriptor sets), shaders and descriptor pools in a single 4 GiB region of of memory so that we can pass only the bottom 32-bits of all the pointers in a user SGPR, getting us farther with the limited number of user SGPRs. We will see later how that makes things difficult for us.
As shown above radv has a bunch of complexity around state for draw calls and if we start generating command buffers on the GPU that risks copying a significant part of that complexity to a shader. Luckily ExecuteIndirect
and VK_NV_device_generated_commands
have some limitations that make this easier. The app can only change
VK_NV_device_generated_commands
also allows changing shaders and the rotation winding of what is considered a primitive backface but we’ve chosen to ignore that for now since it isn’t needed for ExecuteIndirect
(though especially the shader switching could be useful for an application).
The second curveball is that the buffer the application provides needs to provide the same set of data for every draw call. This avoids having to do a lot of serial processing to figure out what the previous state was, which allows processing every draw command in a separate shader invocation. Unfortunately we’re still a bit dependent on the old state that is bound before the indirect command buffer execution:
Remember that for vertex buffers and push constants we may put them in a piece of memory. That piece of memory needs to contains all the vertex buffers/push constants for that draw call, so even if we modify only one of them, we have to copy the rest over. The index buffer is different: in the draw packets for the GPU there is a field that is derived from the index buffer size.
So in vkCmdPreprocessGeneratedCommandsNV
radv partitions the preprocess buffer into a command buffer and an upload buffer (for the vertex buffers & push constants), both with a fixed stride based on the command signature. Then it launches a shader which processes a draw call in each invocation:
if (shader used vertex buffers && we change a vertex buffer) {
copy all vertex buffers
update the changed vertex buffers
emit a new vertex descriptor set pointer
}
if (we change a push constant) {
if (we change a push constant in memory) {
copy all push constant
update changed push constants
emit a new push constant pointer
}
emit all changed inline push constants into user SGPRs
}
if (we change the index buffer) {
emit new index buffers
}
emit a draw command
insert NOPs up to the stride
In vkCmdExecuteGeneratedCommandsNV
radv uses the internal equivalent of vkCmdExecuteCommands
to execute as if the generated command buffer is a secondary command buffer.
Of course one does not simply move part of the driver to GPU shaders without any challenges. In fact we have a whole bunch of them. Some of them just need a bunch of work to solve, some need some extension specification tweaking and some are hard to solve without significant tradeoffs.
A big problem is that the code needed for the limited subset of state that is supported is now in 3 places:
vkCmdPreprocessGeneratedCommandsNV
to build the preprocess buffer.Having the same functionality in multiple places is a recipe for things going out of sync. This makes it harder to change this code and much easier for bugs to sneak in. This can be mitigated with a lot of testing, but a bunch of GPU work gets complicated quickly. (e.g. the preprocess buffer being larger than needed still results in correct results, getting a second opinion from the shader to check adds significant complexity).
nir_builder
gets old quicklyIn the driver at the moment we have no good high level shader compiler. As a result a lot of the internal helper shaders are written using the nir_builder
helper to generate nir
, the intermediate IR of the shader compiler. Example fragment:
nir_push_loop(b);
{
nir_ssa_def *curr_offset = nir_load_var(b, offset);
nir_push_if(b, nir_ieq(b, curr_offset, cmd_buf_size));
{
nir_jump(b, nir_jump_break);
}
nir_pop_if(b, NULL);
nir_ssa_def *packet_size = nir_isub(b, cmd_buf_size, curr_offset);
packet_size = nir_umin(b, packet_size, nir_imm_int(b, 0x3ffc * 4));
nir_ssa_def *len = nir_ushr_imm(b, packet_size, 2);
len = nir_iadd_imm(b, len, -2);
nir_ssa_def *packet = nir_pkt3(b, PKT3_NOP, len);
nir_store_ssbo(b, packet, dst_buf, curr_offset, .write_mask = 0x1,
.access = ACCESS_NON_READABLE, .align_mul = 4);
nir_store_var(b, offset, nir_iadd(b, curr_offset, packet_size), 0x1);
}
nir_pop_loop(b, NULL);
It is clear that this all gets very verbose very quickly. This is somewhat fine as long as all the internal shaders are tiny. However, between this and raytracing our internal shaders are getting significantly bigger and the verbosity really becomes a problem.
Interesting things to explore here are to use glslang, or even to try writing our shaders in OpenCL C and then compiling it to SPIR-V at build time. The challenge there is that radv is built on a diverse set of platforms (including Windows, Android and desktop Linux) which can make significant dependencies a struggle.
Ideally your GPU work is very suitable for pipelining to avoid synchronization cost on the GPU. If we generate the command buffer and then execute it we need to have a full GPU sync point in between, which can get very expensive as it waits until the GPU is idle. To avoid this VK_NV_device_generated_commands
has added the separate vkCmdPreprocessGeneratedCommandsNV
command, so that the application can batch up a bunch of work before incurring the cost a sync point.
However, in radv we have to do the command buffer generation in vkCmdExecuteGeneratedCommandsNV
as our command buffer generation depends on some of the other state that is bound, but might not be bound yet when the application calls vkCmdPreprocessGeneratedCommandsNV
.
Which brings up a slight spec problem: The extension specification doesn’t specify whether the application is allowed to execute vkCmdExecuteGeneratedCommandsNV
on multiple queues concurrently with the same preprocess buffer. If all the writing of that happens in vkCmdPreprocessGeneratedCommandsNV
that would result in correct behavior, but if the writing happens in vkCmdExecuteGeneratedCommandsNV
this results in a race condition.
Remember that radv only passes the bottom 32-bits of some pointers around. As a result the application needs to allocate the preprocess buffer in that 4-GiB range. This in itself is easy: just add a new memory type and require it for this usage. However, the devil is in the details.
For example, what should we do for memory budget queries? That is per memory heap, not memory type. However, a new memory heap does not make sense, as the memory is also still subject to physical availability of VRAM, not only address space.
Furthermore, this 4-GiB region is more constrained than other memory, so it would be a shame if applications start allocating random stuff in it. If we look at the existing usage for a pretty heavy game (HZD) we get about
So typically we have a lot of room available. Ideally the ordering of memory types would get an application to prefer another memory type when we do not need this special region. However, memory object caching poses a big risk here: Would you choose a memory object in the cache that you can reuse/suballocate (potentially in that limited region), or allocate new for a “better” memory type?
Luckily we have not seen that risk play out, but the only real tested user at this point has been vkd3d-proton.
When executing the generated command buffer radv does that the same way as calling a secondary command buffer. This has a significant limitation: A secondary command buffer cannot call a secondary command buffer on the hardware. As a result the current implementation has a problem if vkCmdExecuteGeneratedCommandsNV
gets called on a secondary command buffer.
It is possible to work around this. An example would be to split the secondary command buffer into 3 parts: pre, generated, post. However, that needs a bunch of refactoring to allow multiple internal command buffers per API command buffers.
Don’t expect this upstream very quickly. The main reason for exploring this in radv is ExecuteIndirect
support for Halo Infinite, and after some recent updates we’re back into GPU hang limbo with radv/vkd3d-proton there. So while we’re solving that I’m holding off on upstreaming in case the hangs are caused by the implementation of this extension.
Furthermore, this is only a partial implementation of the extension anyways, with a fair number of limitations that we’d ideally eliminate before fully exposing this extension.
With Mesa 22.1 RC1 firmly out the door, most eyes have turned towards Mesa 22.2.
But not all eyes.
No, while most expected me to be rocketing off towards the next shiny feature, one ticket caught my eye:
Mesa 22.1rc1: Zink on Windows doesn’t work even simple wglgears app fails..
Sadly, I don’t support Windows. I don’t have a test machine to run it, and I don’t even have a VM I could spin up to run Lavapipe. I knew that Kopper was going to cause problems with other frontends, but I didn’t know how many other frontends were actually being used.
The answer was not zero, unfortunately. Plenty of users were enjoying the slow, software driver speed of Zink on Windows to spin those gears, and I had just crushed their dreams.
As I had no plans to change anything here, it would take a new hero to set things right.
Who here loves X-Plane?
I love X-Plane. It’s my favorite flight simulator. If I could, I’d play it all day every day. And do you know who my favorite X-Plane developer is?
Friend of the blog and part-time Zink developer, Sidney Just.
Some of you might know him from his extensive collection of artisanal blog posts. Some might have seen his work enabling Vulkan<->OpenGL interop in Mesa on Windows.
But did you know that Sid’s latest project is much more groundbreaking than just bumping Zink’s supported extension count far beyond the reach of every other driver?
What if I told you that this image
is Zink running wglgears on a NVIDIA 2070 GPU on Windows at full speed? No software-copy scanout. Just Kopper.
Over the past couple days, Sid’s done the esoteric work of hammering out WSI support for Zink on Windows, making us the first hardware-accelerated, GL 4.6-capable Mesa driver to run natively on Windows.
Don’t believe me?
Recognize a little Aztec Ruins action from GFXBench?
The results are about what we’d expect of an app I’ve literally never run myself:
Zink
NVIDIA
Not too bad at all!
I think we can safely say that Sid has managed to fix the original bug. Thanks, Sid!
But why is an X-Plane developer working on Zink?
The man himself has this to say on the topic:
X-Plane has traditionally been using OpenGL directly for all of its rendering needs. As a result, for years our plugin SDK has directly exposed the games OpenGL context directly to third party plugins, which have used it to render custom avionic screens and GUI elements. When we finally did the switch to Vulkan and Metal in 2020, one of the big issues we faced was how to deal with plugins. Our solution so far has been to rely on native Vulkan/OpenGL driver interop via extensions, which has mostly worked and allowed us to ship with modern backends.
Unfortunately this puts us at the mercy of the driver to provide good interop. Sadly on some platforms, this just isn’t available at all. On others, the drivers are broken leading to artifacts when mixing Vulkan and GL rendering. To date, our solution has been to just shrug it off and hope for better drivers. X-Plane plugins make use of compatibly profile GL features, as well as core profile features, depending on the authors skill, so libraries like ANGLE were not an option for us.
This is where Zink comes in for us: Being a real GL driver, it has support for all of the features that we need. Being open source also means that any issues that we do discover are much easier to fix ourselves. We’ve made some progress including Zink into the next major version of X-Plane, X-Plane 12, and it’s looking very promising so far. Our hope is to ship X-Plane 12 with Zink as the GL backend for plugins and leave driver interop issues in the past.
The roots of this interest can also be seen in his blog post from last year where he touches on the future of GL plugin support.
Awesome!
Big Triangle’s definitely paying attention now.
And if any of my readers think this work is cool, go buy yourself a copy of X-Plane to say thanks for contributing back to open source.
This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:
In this article, we will finally focus on generating the rootfs/container image of the CI Gateway in a way that enables live patching the system without always needing to reboot.
This work is sponsored by the Valve Corporation.
System updates are a necessary evil for any internet-facing server, unless you want your system to become part of a botnet. This is especially true for CI systems since they let people on the internet run code on machines, often leading to unfair use such as cryptomining (this one is hard to avoid though)!
The problem with system updates is not the 2 or 3 minutes of downtime that it takes to reboot, it is that we cannot reboot while any CI job is running. Scheduling a reboot thus first requires to stop accepting new jobs, wait for the current ones to finish, then finally reboot. This solution may be acceptable if your jobs take ~30 minutes, but what if they last 6h? A reboot suddenly gets close to a typical 8h work day, and we definitely want to have someone looking over the reboot sequence so they can revert to a previous boot configuration if the new one failed.
This problem may be addressed in a cloud environment by live-migrating services/containers/VMs from a non-updated host to an updated one. This is unfortunately a lot more complex to pull off for a bare-metal CI without having a second CI gateway and designing synchronization systems/hardware to arbiter access to the test machines's power/serial consoles/boot configuration.
So, while we cannot always avoid the need to drain the CI jobs before rebooting, what we can do is reduce the cases in which we need to perform this action. Unfortunately, containers have been designed with atomic updates in mind (this is why we want to use them), but that means that trivial operations such as adding an ssh key, a Wireguard peer, or updating a firewall rule will require a reboot. A hacky solution may be for the admins to update the infra container then log in the different CI gateways and manually reproduce the changes they have done in the new container. These changes would be lost at the next reboot, but this is not a problem since the CI gateway would use the latest container when rebooting which already contains the updates. While possible, this solution is error-prone and not testable ahead of time, which is against the requirements for the gateway we laid out in Part 3.
An improvement to live-updating containers by hand would be to use tools such as Ansible, Salt, or even Puppet to manage and deploy non-critical services and configuration. This would enable live-updating the currently-running container but would need to be run after every reboot. An Ansible playbook may be run locally, so it is not inconceivable for a service to be run at boot that would download the latest playbook and run it. This solution is however forcing developers/admins to decide which services need to have their configuration baked in the container and which services should be deployed using a tool like Ansible... unless...
We could use a tool like Ansible to describe all the packages and services to install, along with their configuration. Creating a container would then be achieved by running the Ansible playbook on a base container image. Assuming that the playbook would truly be idem-potent (running the playbook multiple times will lead to the same final state), this would mean that there would be no differences between the live-patched container and the new container we created. In other words, we simply morph the currently-running container to the wanted configuration by running the same Ansible playbook we used to create the container, but against the live CI gateway! This will not always remove the need to reboot the CI gateways from time to time (updating the kernel, or services which don't support live-updates without affecting CI jobs), but all the smaller changes can get applied in-situ!
The base container image has to contain the basic dependencies of the tool like Ansible, but if it were made to contain all the OS packages, it would split the final image into three container layers: the base OS container, the packages needed, and the configuration. Updating the configuration would thus result in only a few megabytes of update to download at the next reboot rather than the full OS image, thus reducing the reboot time.
Ansible is perfectly-suited to morph a container into its newest version, provided that all the resources used remain static between when the new container was created and when the currently-running container gets live-patched. This is because of Ansible's core principle of idempotency of operations: Rather than running commands blindly like in a shell-script, it first checks what is the current state then, if needed, update the state to match the desired target. This makes it safe to run the playbook multiple times, but will also allow us to only reboot services if its configuration or one of its dependencies' changed.
When version pinning of packages is possible (Python, Ruby, Rust, Golang, ...), Ansible can guarantee the idempotency that make live-patching safe. Unfortunately, package managers of Linux distributions are usually not idempotent: They were designed to ship updates, not pin software versions! In practice, this means that there are no guarantees that the package installed during live-patching will be the same as the one installed in the new base container, thus exposing oneself to potential differences in behaviour between the two deployment methods... The only way out of this issue is to create your own package repository and make sure its content will not change between the creation of the new container and the live-patching of all the CI Gateways. Failing that, all I can advise you to do is pick a stable distribution which will try its best to limit functional changes between updates within the same distribution version (Alpine Linux, CentOS, Debian, ...).
In the end, Ansible won't always be able to make live-updating your container strictly equivalent to rebooting into its latest version, but as long as you are aware of its limitations (or work around them), it will make updating your CI gateways way less of a trouble than it would be otherwise! You will need to find the right balance between live-updatability, and ease of maintenance of the code-base of your gateway.
At this point, you may be wondering how all of this looks in practice! Here is the example of the CI gateways we have been developping for Valve:
And if you are wondering how we can go from these scripts to working containers, here is how:
$ podman run --rm -d -p 8088:5000 --name registry docker.io/library/registry:2
$ env \
IMAGE_NAME=localhost:8088/valve-infra-base-container \
BASE_IMAGE=archlinux \
buildah unshare -- .gitlab-ci/valve-infra-base-container-build.sh
$ env \
IMAGE_NAME=localhost:8088/valve-infra-container \
BASE_IMAGE=valve-infra-base-container \
ANSIBLE_EXTRA_ARGS='--extra-vars service_mgr_override=inside_container -e development=true' \
buildah unshare -- .gitlab-ci/valve-infra-container-build.sh
And if you were willing to use our Makefile, it gets even easier:
$ make valve-infra-base-container BASE_IMAGE=archlinux IMAGE_NAME=localhost:8088/valve-infra-base-container
$ make valve-infra-container BASE_IMAGE=localhost:8088/valve-infra-base-container IMAGE_NAME=localhost:8088/valve-infra-container
Not too bad, right?
PS: These scripts are constantly being updated, so make sure to check out their current version!
In this post, we highlighted the difficulty of keeping the CI Gateways up to date when CI jobs can take multiple hours to complete, preventing new jobs from starting until the current queue is emptied and the gateway has rebooted.
We have then shown that despite looking like competing solutions to deploy services in production, containers and tools like Ansible can actually work well together to reduce the need for reboots by morphing the currently-running container into the updated one. There are however some limits to this solution which are important to keep in mind when designing the system.
In the next post, we will be designing the executor service which is responsible for time-sharing the test machines between different CI/manual jobs. We will thus be talking about deploying test environments, BOOTP, and serial consoles!
That's all for now, thanks for making it to the end!
As everyone who’s anyone knows, the next Mesa release branchpoint is coming up tomorrow. Like usual, here’s the rundown on what to expect from zink in this release:
So if you find a zink problem in the 22.1 release of Mesa, it’s definitely because of Kopper and not actually anything zink-related.
But also this is sort-of-almost-maybe a lavapipe blog, and that driver has had a much more exciting quarter. Here’s a rundown.
New Extensions:
Vulkan 1.3 is now supported. We’ve landed a number of big optimizations as well, leading to massively improved CI performance.
Lavapipe: the cutting-edge software implementation of Vulkan.
…as long as you don’t need descriptor indexing.
Since Kopper got merged today upstream I wanted to write a little about it as I think the value it brings can be unclear for the uninitiated.
Adam Jackson in our graphics team has been working for the last Months together with other community members like Mike Blumenkrantz implementing Kopper. For those unaware Zink is an OpenGL implementation running on top of Vulkan and Kopper is the layer that allows you to translate OpenGL and GLX window handling to Vulkan WSI handling. This means that you can get full OpenGL support even if your GPU only has a Vulkan driver available and it also means you can for instance run GNOME on top of this stack thanks to the addition of Kopper to Zink.
During the lifecycle of the soon to be released Fedora Workstation 36 we expect to allow you to turn on the doing OpenGL using Kopper and Zink as an experimental feature once we update Fedora 36 to Mesa 22.1.
So you might ask why would I care about this as an end user? Well initially you probably will not care much, but over time it is likely that GPU makers will eventually stop developing native OpenGL drivers and just focus on their Vulkan drivers. At that point Zink and Kopper provides you with a backwards compatibility solution for your OpenGL applications. And for Linux distributions it will also at some point help reduce the amount of code we need to ship and maintain significantly as we can just rely on Zink and Kopper everywhere which of course reduces the workload for maintainers.
This is not going to be an overnight transition though, Zink and Kopper will need some time to stabilize and further improve performance. At the moment performance is generally a bit slower than the native drivers, but we have seen some examples of games which actually got better performance with specific driver combinations, but over time we expect to see the negative performance delta shrink. The delta is unlikely to ever fully go away due to the cost of translating between the two APIs, but on the other side we are going to be in a situation in a few years where all current/new applications use Vulkan natively (or through Proton) and thus the stuff that relies on OpenGL will be older software, so combined with faster GPUs you should still get more than good enough performance. And at that point Zink will be a lifesaver for your old OpenGL based applications and games.
By the time you read this, Kopper will have landed. This means a number of things have changed:
MESA_LOADER_DRIVER_OVERRIDE=zink
will work for all driversIn particular, lots of cases of garbled/flickering rendering (I’m looking at you, Supertuxkart on ANV) will now be perfectly smooth and without issue.
Also there’s no swapinterval control yet, so X11 clients will have no choice but to churn out the maximum amount of FPS possible at all times.
You (probably?) aren’t going to be able to run a compositor on zink just yet, but it’s on the 22.1 TODO list.
Big thanks to Adam Jackson for carrying this project on his back.
Apparently, in some parts of this
world, the /usr/
-merge
transition is still ongoing. Let's take the opportunity to have a look
at one specific way to take benefit of the /usr/
-merge (and
associated work) IRL.
I develop system-level software as you might know. Oftentimes I want
to run my development code on my PC but be reasonably sure it cannot
destroy or otherwise negatively affect my host system. Now I could set
up a container tree for that, and boot into that. But often I am too
lazy for that, I don't want to bother with a slow package manager
setting up a new OS tree for me. So here's what I often do instead —
and this only works because of the /usr/
-merge.
I run a command like the following (without any preparatory work):
systemd-nspawn \
--directory=/ \
--volatile=yes \
-U \
--set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \
--set-credential=firstboot.locale:C.UTF-8 \
--bind-user=lennart \
-b
And then I very quickly get a login prompt on a container that runs
the exact same software as my host — but is also isolated from the
host. I do not need to prepare any separate OS tree or anything
else. It just works. And my host user lennart
is just there,
ready for me to log into.
So here's what these
systemd-nspawn
options specifically do:
--directory=/
tells systemd-nspawn
to run off the host OS'
file hierarchy. That smells like danger of course, running two
OS instances off the same directory hierarchy. But don't be
scared, because:
--volatile=yes
enables volatile mode. Specifically this means
what we configured with --directory=/
as root file system is
slightly rearranged. Instead of mounting that tree as it is, we'll
mount a tmpfs
instance as actual root file system, and then
mount the /usr/
subdirectory of the specified hierarchy into the
/usr/
subdirectory of the container file hierarchy in read-only
fashion – and only that directory. So now we have a container
directory tree that is basically empty, but imports all host OS
binaries and libraries into its /usr/
tree. All software
installed on the host is also available in the container with no
manual work. This mechanism only works because on /usr/
-merged
OSes vendor resources are monopolized at a single place:
/usr/
. It's sufficient to share that one directory with the
container to get a second instance of the host OS running. Note
that this means /etc/
and /var/
will be entirely empty
initially when this second system boots up. Thankfully, forward
looking distributions (such as Fedora) have adopted
systemd-tmpfiles
and
systemd-sysusers
quite pervasively, so that system users and files/directories
required for operation are created automatically should they be
missing. Thus, even though at boot the mentioned directories are
initially empty, once the system is booted up they are
sufficiently populated for things to just work.
-U
means we'll enable user namespacing, in fully automatic
mode. This does three things: it picks a free host UID range
dynamically for the container, then sets up user namespacing for
the container processes mapping host UID range to UIDs 0…65534 in
the container. It then sets up a similar UID mapped mount on the
/usr/
tree of the container. Net effect: file ownerships as set
on the host OS tree appear as they belong to the very same users
inside of the container environment, except that we use user
namespacing for everything, and thus the users are actually
neatly isolated from the host.
--set-credential=passwd.hashed-password.root:$(mkpasswd
mysecret)
passes a credential to the container. Credentials are
bits of data that you can pass to systemd services and whole
systems. They are actually awesome concepts (e.g. they support
TPM2 authentication/encryption that just works!) but I am not going
to go into details around that, given it's off-topic in this
specific scenario. Here we just take benefit of the fact that
systemd-sysusers
looks for a credential called
passwd.hashed-password.root
to initialize the root password of
the system from. We set it to mysecret
. This means once the
system is booted up we can log in as root
and the supplied
password. Yay. (Remember, /etc/
is initially empty on this
container, and thus also carries no /etc/passwd
or
/etc/shadow
, and thus has no root user record, and thus no root
password.)
mkpasswd
is a tool then
converts a plain text password into a UNIX hashed password, which
is what this specific credential expects.
Similar, --set-credential=firstboot.locale:C.UTF-8
tells the
systemd-firstboot
service in the container to initialize /etc/locale.conf
with
this locale.
--bind-user=lennart
binds the host user lennart
into the
container, also as user lennart
. This does two things: it mounts
the host user's home directory into the container. It also copies
a minimal user record of the specified user into the container
that
nss-systemd
then picks up and includes in the regular user database. This
means, once the container is booted up I can log in as lennart
with my regular password, and once I logged in I will see my
regular host home directory, and can make changes to it. Yippieh!
(This does a couple of more things, such as UID mapping and
things, but let's not get lost in too much details.)
So, if I run this, I will very quickly get a login prompt, where I can
log into as my regular user. I have full access to my host home
directory, but otherwise everything is nicely isolated from the host,
and changes outside of the home directory are either prohibited or are
volatile, i.e. go to a tmpfs
instance whose lifetime is bound to the
container's lifetime: when I shut down the container I just started,
then any changes outside of my user's home directory are lost.
Note that while here I use --volatile=yes
in combination with
--directory=/
you can actually use it on any OS hierarchy, i.e. just
about any directory that contains OS binaries.
Similar, the --bind-user=
stuff works with any OS hierarchy too (but
do note that only systemd 249 and newer will pick up the user records
passed to the container that way, i.e. this requires at least v249
both on the host and in the container to work).
Or in short: the possibilities are endless!
For this all to work, you need:
A recent kernel (5.15 should suffice, as it brings UID mapped
mounts for the most common file systems, so that -U
and
--bind-user=
can work well.)
A recent systemd (249 should suffice, which brings --bind-user=
,
and a -U
switch backed by UID mapped mounts).
A distribution that adopted the /usr/
-merge, systemd-tmpfiles
and systemd-sysusers
so that the directory hierarchy and user
databases are automatically populated when empty at boot. (Fedora
35 should suffice.)
While a lot of today's software actually out of the box works well on
systems that come up with an unpopulated /etc/
and /var/
, and
either fall back to reasonable built-in defaults, or deploy
systemd-tmpfiles
to create what is missing, things aren't perfect:
some software typically installed an desktop OSes will fail to start
when invoked in such a container, and be visible as ugly failed
services, but it won't stop me from logging in and using the system
for what I want to use it. It would be excellent to get that fixed,
though. This can either be fixed in the relevant software upstream
(i.e. if opening your configuration file fails with ENOENT
, then
just default to reasonable defaults), or in the distribution packaging
(i.e. add a
tmpfiles.d/
file that copies or symlinks in skeleton configuration from
/usr/share/factory/etc/
via the C
or L
line types).
And then there's certain software dealing with hardware management and
similar that simply cannot work in a container (as device APIs on
Linux are generally not virtualized for containers) reasonably. It
would be excellent if software like that would be updated to carry
ConditionVirtualization=!container
or
ConditionPathIsReadWrite=/sys
conditionalization in their unit
files, so that it is automatically – cleanly – skipped when executed
in such a container environment.
And that's all for now.
Do you want to start a career in open-source? Do you want to learn amazing skills while getting paid? Keep reading!
Igalia has a grant program that gives students with a background in Computer Science, Information Technology and Free Software their first exposure to the professional world, working hand in hand with Igalia programmers and learning with them. It is called Igalia Coding Experience.
While this experience is open for everyone, Igalia expressly invites women (both cis and trans), trans men, and genderqueer people to apply. The Coding Experience program gives preference to applications coming from underrepresented groups in our industry.
You can apply to any of the offered grants this year: Web Standards, WebKit, Chromium, Compilers and Graphics.
In the case of Graphics, the student will have the opportunity to deal with the Linux DRM subsystem. Specifically, the student will improve the test coverage of DRM drivers through IGT, a testing framework designed for this purpose. These includes learning how to contribute to Linux kernel/DRM, interact with the DRI-devel community, understand DRM core functionality, and increase test coverage of IGT tool.
The conditions of our Coding Experience program are:
The submission period goes from March 16th until April 30th. Students will be selected in May. We will work with the student to arrange a suitable starting date during 2022, from June onwards, and finishing on a date to be agreed that suits their schedule.
The popular Google Summer of Code is another option for students. This year, X.Org Foundation participates as Open Source organization. We have some proposed ideas but you can propose any project idea as well.
Timeline for proposals is from April 4th to April 19th. However, you should contact us before in order to discuss your ideas with potential mentors.
GSoC gives some stipend to students too (from 1,500 to 6,000 USD depending on the size of the project and your location). The hours to complete the project varies from 175 to 350 hours depending on the size of the project as well.
Of course, this is a remote-friendly program, so any student in the world can participate in it.
Outreachy is another internship program for applicants from around the world who face under-representation, systemic bias or discrimination in the technology industry of their country. Outreachy supports diversity in free and open source software!
Outreachy internships are remote, paid ($7,000), and last three months. Outreachy internships run from May to August and December to March. Applications open in January and August.
The projects listed cover many areas of the open-source software stack: from kernel to distributions work. Please check current proposals to find anything that is interesting for you!
X.Org Foundation voted in 2008 to initiate a program known as the X.Org Endless Vacation of Code (EVoC) program, in order to give more flexibility to students: an EVoC mentorship can be initiated at any time during the calendar year, the Board can fund as many of these mentorships as it sees fit.
Like the other programs, EVoC is remote-friendly as well. The stipend goes as follows: an initial payment of 500 USD and two further payments of 2,250 USD upon completion of project milestones. EVoC does not set limits in hours, but there are some requirements and steps to do before applying. Please read X.Org Endless Vacation of Code website to learn more.
As you see, there are many ways to enter into the Open Source community. Although I focused in the open source graphics stack related programs, there are many of them.
With all of these possibilities (and many more, including internships at companies), I hope that you can apply and that the experience will encourage you to start a career in the open-source community.
Happy hacking!
Today marks (at last) the release of some cool extensions I’ve had the pleasure of working on:
This extension revolutionizes how PSOs can be managed by the application, and it’s the first step towards solving the dreaded stuttering that zink suffers from when attempting to play any sort of game. There’s definitely going to be more posts from me on this in the future.
Currently, zink has to do some awfulness internally to replicate the awfulness of GL_PRIMITIVES_GENERATED
. With this extension, at least some of that awfulness can be pushed down to the driver. And the spec, of course. You can’t scrub this filth out of your soul.
The mesa community being awesome as it is, support for these extensions is already underway:
But obviously Lavapipe, being the greatest of all drivers, will already have support landed by the time you read this post.
Let the bug reports flow!
OpenSSH has this very nice setting, VerifyHostKeyDNS
, which when
enabled, will pull SSH host keys from DNS, and you no longer need to
either trust on first use, or copy host keys around out of band.
Naturally, trusting unsecured DNS is a bit scary, so this requires the
record to be signed using DNSSEC. This has worked for a long time,
but then broke, seemingly out of the blue. Running ssh -vvv
gave
output similar to
debug1: found 4 insecure fingerprints in DNS
debug3: verify_host_key_dns: checking SSHFP type 1 fptype 2
debug3: verify_host_key_dns: checking SSHFP type 4 fptype 2
debug1: verify_host_key_dns: matched SSHFP type 4 fptype 2
debug3: verify_host_key_dns: checking SSHFP type 4 fptype 1
debug1: verify_host_key_dns: matched SSHFP type 4 fptype 1
debug3: verify_host_key_dns: checking SSHFP type 1 fptype 1
debug1: matching host key fingerprint found in DNS
even though the zone was signed, the resolver was checking the
signature and I even checked that the DNS response had the AD
bit
set.
The fix was to add options trust-ad
to /etc/resolv.conf
. Without
this, glibc will discard the AD
bit from any upstream DNS
servers. Note that you should only add this if you actually have a
trusted DNS resolver. I run unbound on localhost, so if somebody can
do a man-in-the-middle attack on that traffic, I have other problems.
Anyone who knows me knows that I hate cardio.
Full stop.
I’m not picking up and putting down all these heavy weights just so I can go for a jog afterwards and lose all my gains.
Similarly, I’m not trying to set a world record for speed-writing code. This stuff takes time, and it can’t be rushed.
Unless…
Today we’re a Lavapipe blog.
Lavapipe is, of course, the software implementation of Vulkan that ships with Mesa, originally braindumped directly into the repo by graphics god and part-time Twitter executive, Dave Airlie. For a long time, the Lavapipe meme has been “Try it on Lavapipe—it’s not conformant, but it still works pretty good haha” and we’ve all had a good chuckle at the idea that anything not officially signed and stamped by Khronos could ever draw a single triangle properly.
But, pending a single MR that fixes the four outstanding failures for Vulkan 1.2 conformance, as of last week, Lavapipe passes 100% of conformance tests. Thus, pending a merge and a Mesa bugfix release, Lavapipe will achieve official conformance.
And then we’ll have a new meme: Vulkan 1.3 When?
As some have noticed, I’ve been quietly (very, very, very, very, very, very, very, very, very, very, very, very quietly) implementing a number of features for Lavapipe over the past week.
But why?
Khronos-fanatics will immediately recognize that these features are all part of Vulkan 1.3.
Which Lavapipe also now supports, pending more merges which I expect to happen early next week.
This is what a sprint looks like.
We had a busy 2021 within GNU/Linux graphics stack at Igalia.
Would you like to know what we have done last year? Keep reading!
Last year both the OpenGL and the Vulkan drivers received a lot of love. For example, we implemented several optimizations such improvements in the v3dv pipeline cache. In this blog post, Alejandro Piñeiro presents how we improved the v3dv pipeline cache times by reducing the two-cache-lookup done previously by only one, and shows some numbers on both a synthetic test (modified CTS test), and some games.
We also did performance improvements of the v3d compilers for OpenGL and Vulkan. Iago Toral explains our work on optimizating the backend compiler with techniques such as improving memory lookup efficiency, reducing instruction counts, instruction packing, uniform handling, among others. There are some numbers that show framerate improvements from ~6 to ~62% on different games / demos.
Framerate improvement after optimization (in %). Taken from Iago’s blogpost
Of course, there was work related to feature implementation. This blog post from Iago lists some Vulkan extensions implemented in the v3dv driver in 2021… Although not all the implemented extensions are listed there, you can see the driver is quickly catching up in its Vulkan extension support.
My colleague Juan A. Suárez implemented performance counters in the v3d driver (an OpenGL driver) which required modifications in the kernel and in the Mesa driver. More info in his blog post.
There was more work in other areas done in 2021 too, like the improved support for RenderDoc and GFXReconstruct. And not to forget the kernel contributions to the DRM driver done by Melissa Wen, who not only worked on developing features for it, but also reviewed all the patches that came from the community.
However, the biggest milestone for the v3Dv driver was to be Vulkan 1.1 conformant in the last quarter of 2021. That was just one year after becoming Vulkan 1.0 conformant. As you can imagine, that implied a lot of work implementing features, fixing bugs and, of course, improving the driver in many different ways. Great job folks!
If you want to know more about all the work done on these drivers during 2021, there is an awesome talk from my colleague Alejando Piñeiro at FOSDEM 2022: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4”, and another one from my colleague Iago Toral in XDC 2021: “Raspberry Pi Vulkan driver update”. Below you can find the video recordings of both talks.
FOSDEM 2022 talk: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4”
XDC 2021 talk: “Raspberry Pi Vulkan driver update”
Photo of the Qualcomm® Robotics RB3 Platform embedded board that I use for Turnip development.
There were also several achievements done by igalians on both Freedreno and Turnip drivers. These are reverse engineered open-source drivers for Qualcomm Adreno GPUs: Freedreno for OpenGL and Turnip for Vulkan.
Starting 2021, my colleague Danylo Piliaiev helped with implementing the missing bits in Freedreno for supporting OpenGL 3.3 on Adreno 6xx GPUs. His blog post explained his work, such as implementing ARB_blend_func_extended, ARB_shader_stencil_export and fixing a variety of CTS test failures.
Related to this, my colleague Guilherme G. Piccoli worked on porting a recent kernel to one of the boards we use for Freedreno development: the Inforce 6640. He did an awesome job getting a 5.14 kernel booting on that embedded board. If you want to know more, please read the blog post he wrote explaining all the issues he found and how he fixed them!
Picture of the Inforce 6640 board that Guilherme used for his development. Image from his blog post.
However the biggest chunk of work was done in Turnip driver. We have implemented a long list of Vulkan extensions: VK_KHR_buffer_device_address, VK_KHR_depth_stencil_resolve, VK_EXT_image_view_min_lod, VK_KHR_spirv_1_4, VK_EXT_descriptor_indexing, VK_KHR_timeline_semaphore, VK_KHR_16bit_storage, VK_KHR_shader_float16, VK_KHR_uniform_buffer_standard_layout, VK_EXT_extended_dynamic_state, VK_KHR_pipeline_executable_properties, VK_VALVE_mutable_descriptor_type, VK_KHR_vulkan_memory_model and many others. Danylo Piliaiev and Hyunjun Ko are terrific developers!
But not all our work was related to feature development, for example I implemented Low-Resolution Z-buffer (LRZ) HW optimization, Danylo fixed a long list of rendering bugs that happened in real-world applications (blog post 1, blog post 2) like D3D games run on Vulkan (thanks to DXVK and VKD3D), instrumented the backend compiler to dump register values, among many other fixes and optimizations.
However, the biggest achievement was getting Vulkan 1.1 conformance for Turnip. Danylo wrote a blog post mentioning all the work we did to achieve that this year.
If you want to know more, don’t miss this FOSDEM 2022 talk given by my colleague Hyunjun Ko called “The status of turnip driver development. What happened in 2021 and will happen in 2022 for turnip.”. Video below.
FOSDEM 2022 talk: “The status of turnip driver development. What happened in 2021 and will happen in 2022 for turnip.”
Our graphics work doesn’t cover only driver development, we also participate in Khronos Group as Vulkan Conformance Test Suite developers and even as spec contributors.
My colleague Ricardo Garcia is a very productive developer. He worked on implementing tests for Vulkan Ray Tracing extensions (read his blog post about ray tracing for more info about this big Vulkan feature), implemented tests for a long list of Vulkan extensions like VK_KHR_present_id and VK_KHR_present_wait, VK_EXT_multi_draw (watch his talk at XDC 2021), VK_EXT_border_color_swizzle (watch his talk at FOSDEM 2022) among many others. In many of these extensions, he contributed to their respective specifications in a significant way (just search for his name in the Vulkan spec!).
XDC 2021 talk: “Quick Overview of VK_EXT_multi_draw”
FOSDEM 2022 talk: “Fun with border colors in Vulkan. An overview of the story behind VK_EXT_border_color_swizzle”
Similarly, I participated modestly in this effort by developing tests for some extensions like VK_EXT_image_view_min_lod (blog post). Of course, both Ricardo and I implemented many new CTS tests by adding coverage to existing ones, we fixed lots of bugs in existing ones and reported dozens of driver issues to the respective Mesa developers.
Not only that, both Ricardo and I appeared as Vulkan 1.3 spec contributors.
Another interesting work we started in 2021 is Vulkan Video support on Gstreamer. My colleague Víctor Jaquez presented the Vulkan Video extension at XDC 2021 and soon after he started working on Vulkan Video’s h264 decoder support. You can find more information in his blog post, or watching his XDC 2021 talk below:
FOSDEM 2022 talk: “Video decoding in Vulkan: VK_KHR_video_queue/decode APIs”
Before I leave this section, don’t forget to take a look at Ricardo’s blogpost on debugPrintfEXT feature. If you are a Graphics developer, you will find this feature very interesting for debugging issues in your applications!
Along those lines, Danylo presented at XDC 2021 a talk about dissecting and fixing Vulkan rendering issues in drivers with RenderDoc. Very useful for driver developers! Watch the talk below:
XDC 2021 talk: “Dissecting Vulkan rendering issues in drivers with RenderDoc”
To finalize this blog post, remember that you now have vkrunner (the Vulkan shader tester created by Igalia) available for RPM-based GNU/Linux distributions. In case you are working with embedded systems, maybe my blog post about cross-compiling with icecream will help to speed up your builds.
This is just a summary of the highlights we did last year. I’m sorry if I am missing more work from my colleagues.
Those of you in-the-know are well aware that Zink has always had a crippling addiction to seamless cubemaps. Specifically, Vulkan doesn’t support non-seamless cubemaps since nobody wants those anymore, but this is the default mode of sampling for OpenGL.
Thus, it is impossible for Zink to pass GL 4.6 conformance until this issue is resolved.
But what does this even mean?
As veterans of intro to geometry courses all know*, a cube is a 3D shape that has six identically-sized sides called “faces”. In computer graphics, each of these faces has its own content that can be read and written discretely.
When interpolating data from a cube-type texture, there are two methods:
This effectively results in Zink interpolating across the boundaries of cube faces when it should instead be clamping/wrapping pixel data to a single face.
But who cares about all that math nonsense when the result is that Zink is still failing CTS cases?
*Disclosure: I have been advised by my lawyer to state on the record that I have never taken an intro to geometry course and have instead copied this entire blog post off StackOverflow.
In order to replicate this basic OpenGL behavior, a substantial amount of code is required—most of it terrible.
The first step is to determine when a cube should be sampled as non-seamless. OpenGL helpfully has only one extension (plus this other extension) which control seamless access of cubemaps, so as long as that one state (plus the other state) isn’t enabled, a cube shouldn’t be interpreted seamlessly.
With this done, what happens at coordinates that lie at the edge of a face? The OpenGL wrap enum covers this. For the purposes of this blog post, only two wrap modes exist:
coord = extent
or coord = 0
)coord %= extent
)So now non-seamless cubes are detected, and the behavior for handling their non-seamlessness is known, but how can this actually be done?
In short, this requires shader rewrites to handle coordinate clamping, then wrapping. Since it’s not going to be desirable to have a different shader variant for every time the wrap mode changes, this means loading the parameters from a UBO. Since it’s further not going to be desirable to have shader variants for each per-texture seamless/non-seamless cube combination, this means also making the rewrite handle the no-op case of continuing to use the original, seamless coordinates after doing all the calculations for the non-seamless case.
Worst of all, this has to be plumbed through the Rube Goldberg machine that is Mesa.
It was terrible, and it continues to be terrible.
If I were another blogger, I would probably take this opportunity to flex my Calculus credentials by putting all kinds of math here, but nobody really wants to read that, and the hell if I know how to make markdown do that whiteboard thing so I can doodle in fancy formulas or whatever from the spec.
Instead, you can read the merge request if you’re that deeply invested in cube mechanics.
Subgroup operations or wave intrinsics, such as reducing a value across the threads of a shader subgroup or wave, were introduced in GPU programming languages a while ago. They communicate with other threads of the same wave, for example to exchange the input values of a reduction, but not necessarily with all of them if there is divergent control flow.
In LLVM, we call such operations convergent. Unfortunately, LLVM does not define how the set of communicating threads in convergent operations -- the set of converged threads -- is affected by control flow.
If you're used to thinking in terms of structured control flow, this may seem trivial. Obviously, there is a tree of control flow constructs: loops, if-statements, and perhaps a few others depending on the language. Two threads are converged in the body of a child construct if and only if both execute that body and they are converged in the parent. Throw in some simple and intuitive rules about loop counters and early exits (nested return, break and continue, that sort of thing) and you're done.
In an unstructured control flow graph, the answer is not obvious at all. I gave a presentation at the 2020 LLVM Developers' Meeting that explains some of the challenges as well as a solution proposal that involves adding convergence control tokens to the IR.
Very briefly, convergent operations in the proposal use a token variable that is defined by a convergence control intrinsic. Two dynamic instances of the same static convergent operation from two different threads are converged if and only if the dynamic instances of the control intrinsic producing the used token values were converged.
(The published draft of the proposal talks of multiple threads executing the same dynamic instance. I have since been convinced that it's easier to teach this matter if we instead always give every thread its own dynamic instances and talk about a convergence equivalence relation between dynamic instances. This doesn't change the resulting semantics.)
The draft has three such control intrinsics: anchor, entry, and (loop) heart. Of particular interest here is the heart. For the most common and intuitive use cases, a heart intrinsic is placed in the header of natural loops. The token it defines is used by convergent operations in the loop. The heart intrinsic itself also uses a token that is defined outside the loop: either by another heart in the case of nested loops, or by an anchor or entry. The heart combines two intuitive behaviors:
Viewed from this angle, how about we define a weaker version of these rules that lies somewhere between an anchor and a loop heart? We could call it a "light heart", though I will stick with "iterating anchor". The iterating anchor defines a token but has no arguments. Like for the anchor, the set of converged threads is implementation-defined -- when the iterating anchor is first encountered. When threads encounter the iterating anchor again without leaving the dominance region of its containing basic block, they are converged if and only if they were converged during their previous encounter of the iterating anchor.
The notion of an iterating anchor came up when discussing the convergence behaviors that can be guaranteed for natural loops. Is it possible to guarantee that natural loops always behave in the natural way -- according to their loop counter -- when it comes to convergence?
Naively, this should be possible: just put hearts into loop headers! Unfortunately, that's not so straightforward when multiple natural loops are contained in an irreducible loop:
Hearts in A and C must refer to a token defined outside the loops; that is, a token defined in E. The resulting program is ill-formed because it has a closed path that goes through two hearts that use the same token, but the path does not go through the definition of that token. This well-formedness rule exists because the rules about heart semantics are unsatisfiable if the rule is broken.
The underlying intuitive issue is that if the branch at E is divergent in a typical implementation, the wave (or subgroup) must choose whether A or C is executed first. Neither choice works. The heart in A indicates that (among the threads that are converged in E) all threads that visit A (whether immediately or via C) must be converged during their first visit of A. But if the wave executes A first, then threads which branch directly from E to A cannot be converged with those that first branch to C. The opposite conflict exists if the wave executes C first.
If we replace the hearts in A and C by iterating anchors, this problem goes away because the convergence during the initial visit of each block is implementation-defined. In practice, it should fall out of which of the blocks the implementation decides to execute first.
So it seems that iterating anchors can fill a gap in the expressiveness of the convergence control design. But are they really a sound addition? There are two main questions:
Consider the following simple CFG with an iterating anchor in A and a heart in B that refers back to a token defined in E:
Now consider two threads that are initially converged with execution traces:
One could try to resolve the paradox by saying that the threads cannot be converged in A at all, but this would mean that the threads mustdiverge before a divergent branch occurs. That seems unreasonable, since typical implementations want to avoid divergence as long as control flow is uniform.
The example arguably breaks the spirit of the rule about convergence regions from the draft proposal linked above, and so a minor change to the definition of convergence region may be used to exclude it.
What if the CFG instead looks as follows, which does not break any rules about convergence regions:
For the same execution traces, the heart rule again implies that the threads must be converged in B. The convergence of the first dynamic instances of A are technically implementation-defined, but we'd expect most implementations to be converged there.
The second dynamic instances of A cannot be converged due to the convergence of the dynamic instances of B. That's okay: the second dynamic instance of A in thread 2 is a re-entry into the dominance region of A, and so its convergence is unrelated to any convergence of earlier dynamic instances of A.
Unfortunately, we still cannot allow this second example. A program transform may find that the conditional branch in E is constant and the edge from E to B is dead. Removing that edge brings us back to the previous example which is ill-formed. However, a transform which removes the dead edge would not normally inspect the blocks A and B or their dominance relation in detail. The program becomes ill-formed by spooky action at a distance.
The following static rule forbids both example CFGs: if there is a closed path through a heart and an iterating anchor, but not through the definition of the token that the heart uses, then the heart must dominate the iterating anchor.
There is at least one other issue of spooky action at a distance. If the iterating anchor is not the first (non-phi) instruction of its basic block, then it may be preceded by a function call in the same block. The callee may contain control flow that ends up being inlined. Back edges that previously pointed at the block containing the iterating anchor will then point to a different block, which changes the behavior quite drastically. Essentially, the iterating anchor is reduced to a plain anchor.
What can we do about that? It's tempting to decree that an iterating anchor must always be the first (non-phi) instruction of a basic block. Unfortunately, this is not easily done in LLVM in the face of general transforms that might sink instructions or merge basic blocks.
We could chew through some other ideas for making iterating anchors work, but that turns out to be unnecessary. The desired behavior of iterating anchors can be obtained by inserting preheader blocks. The initial example of two natural loops contained in an irreducible loop becomes:
Place anchors in Ap and Cp and hearts in A and C that use the token defined by their respective dominating anchor. Convergence at the anchors is implementation-defined, but relative to this initial convergence at the anchor, convergence inside the natural loops headed by A and C behaves in the natural way, based on a virtual loop counter. The transform of inserting an anchor in the preheader is easily generalized.
To sum it up: We've concluded that defining an "iterating anchor" convergence control intrinsic is problematic, but luckily also unnecessary. The control intrinsics defined in the original proposal are sufficient. I hope that the discussion that led to those conclusions helps illustrate some aspects of the convergence control proposal for LLVM as well as the goals and principles that drove it.
A quick reminder: libei is the library for emulated input. It comes as a pair of C libraries, libei for the client side and libeis for the server side.
libei has been sitting mostly untouched since the last status update. There are two use-cases we need to solve for input emulation in Wayland - the ability to emulate input (think xdotool, or Synergy/Barrier/InputLeap client) and the ability to capture input (think Synergy/Barrier/InputLeap server). The latter effectively blocked development in libei [1], until that use-case was sorted there wasn't much point investing too much into libei - after all it may get thrown out as a bad idea. And epiphanies were as elusive like toilet paper and RATs, so nothing much get done. This changed about a week or two ago when the required lightbulb finally arrived, pre-lit from the factory.
So, the solution to the input capturing use-case is going to be a so-called "passive context" for libei. In the traditional [2] "active context" approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat that typically represent the available screens. libei then sends events through these devices, causing input to be appear in the compositor which moves the cursor around. In a typical and simple use-case you'd get a 1920x1080 absolute pointer device and a keyboard with a $layout keymap, libei then sends events to position the cursor and or happily type away on-screen.
In the "passive context" <deja-vu> approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat </deja-vu> that typically represent the physical devices connected to the host computer. libei then receives events from these devices, causing input to be generated in the libei client. In a typical and simple use-case you'd get a relative pointer device and a keyboard device with a $layout keymap, the compositor then sends events matching the relative input of the connected mouse or touchpad.
The two notable differences are thus: events flow from EIS to libei and the devices don't represent the screen but rather the physical [3] input devices.
This changes libei from a library for emulated input to an input event transport layer between two processes. On a much higher level than e.g. evdev or HID and with more contextual information (seats, devices are logically abstracted, etc.). And of course, the EIS implementation is always in control of the events, regardless which direction they flow. A compositor can implement an event filter or designate key to break the connection to the libei client. In pseudocode, the compositor's input event processing function will look like this:
Not shown here are the various appropriate filters and conversions in between (e.g. all relative events from libinput devices would likely be sent through the single relative device exposed on the EIS context). Again, the compositor is in control so it would be trivial to implement e.g. capturing of the touchpad only but not the mouse.
function handle_input_events():
real_events = libinput.get_events()
for e in real_events:
if input_capture_active:
send_event_to_passive_libei_client(e)
else:
process_event(e)
emulated_events = eis.get_events_from_active_clients()
for e in emulated_events:
process_event(e)
In the current design, a libei context can only be active or passive, not both. The EIS context is both, it's up to the implementation to disconnect active or passive clients if it doesn't support those.
Notably, the above only caters for the transport of input events, it doesn't actually make any decision on when to capture events. This handled by the CaptureInput XDG Desktop Portal [4]. The idea here is that an application like Synergy/Barrier/InputLeap server connects to the CaptureInput portal and requests a CaptureInput session. In that session it can define pointer barriers (left edge, right edge, etc.) and, in the future, maybe other triggers. In return it gets a libei socket that it can initialize a libei context from. When the compositor decides that the pointer barrier has been crossed, it re-routes the input events through the EIS context so they pop out in the application. Synergy/Barrier/InputLeap then converts that to the global position, passes it to the right remote Synergy/Barrier/InputLeap client and replays it there through an active libei context where it feeds into the local compositor.
Because the management of when to capture input is handled by the portal and the respective backends, it can be natively integrated into the UI. Because the actual input events are a direct flow between compositor and application, the latency should be minimal. Because it's a high-level event library, you don't need to care about hardware-specific details (unlike, say, the inputfd proposal from 2017). Because the negotiation of when to capture input is through the portal, the application itself can run inside a sandbox. And because libei only handles the transport layer, compositors that don't want to support sandboxes can set up their own negotiation protocol.
So overall, right now this seems like a workable solution.
[1] "blocked" is probably overstating it a bit but no-one else tried to push it forward, so..
[2] "traditional" is probably overstating it for a project that's barely out of alpha development
[3] "physical" is probably overstating it since it's likely to be a logical representation of the types of inputs, e.g. one relative device for all mice/touchpads/trackpoints
[4] "handled by" is probably overstating it since at the time of writing the portal is merely a draft of an XML file
Yeah, my b, I forgot this was a thing.
Fuck it though, I’m a professional, so I’m gonna pretend I didn’t just skip a month of blogs and get right back into it.
Gallivm is the nir/tgsi-to-llvm translation layer in Gallium that LLVMpipe (and thus Lavapipe) uses to generate the JIT functions which make triangles. It’s very old code in that it predates me knowing how triangles work, but that doesn’t mean it doesn’t have bugs.
And Gallivm bugs are the worst bugs.
For a long time, I’ve had SIGILL crashes on exactly one machine locally for the CTS glob dEQP-GLES31.functional.program_uniform.by*sampler2D_samplerCube*
. These tests pass on everyone else’s machines including CI.
Like I said, Gallivm bugs are the worst bugs.
How does one debug JIT code? GDB can’t be used, valgrind doesn’t work, and, despite what LLVM developers would tell you, building an assert-enabled LLVM doesn’t help at all in most cases here since that will only catch invalid behavior, not questionably valid behavior that very obviously produces invalid results.
So we enter the world of lp_build_print
debugging. Much like standard printf
debugging, the strategy here is to just lp_build_print_value
or lp_build_printf("I hate this part of the shader too")
our way to figuring out where in the shader the crash occurs.
Here’s an example shader from dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex
that crashes:
#version 310 es
in highp vec4 a_position;
out mediump float v_vtxOut;
struct structType
{
mediump sampler2D m0;
mediump samplerCube m1;
};
uniform structType u_var;
mediump float compare_float (mediump float a, mediump float b) { return abs(a - b) < 0.05 ? 1.0 : 0.0; }
mediump float compare_vec4 (mediump vec4 a, mediump vec4 b) { return compare_float(a.x, b.x)*compare_float(a.y, b.y)*compare_float(a.z, b.z)*compare_float(a.w, b.w); }
void main (void)
{
gl_Position = a_position;
v_vtxOut = 1.0;
v_vtxOut *= compare_vec4(texture(u_var.m0, vec2(0.0)), vec4(0.15, 0.52, 0.26, 0.35));
v_vtxOut *= compare_vec4(texture(u_var.m1, vec3(0.0)), vec4(0.88, 0.09, 0.30, 0.61));
}
Which, in llvmpipe NIR, is:
shader: MESA_SHADER_VERTEX
source_sha1: {0xcb00c93e, 0x64db3b0f, 0xf4764ad3, 0x12b69222, 0x7fb42437}
inputs: 1
outputs: 2
uniforms: 0
shared: 0
ray queries: 0
decl_var uniform INTERP_MODE_NONE sampler2D lower@u_var.m0 (0, 0, 0)
decl_var uniform INTERP_MODE_NONE samplerCube lower@u_var.m1 (0, 0, 1)
decl_function main (0 params)
impl main {
block block_0:
/* preds: */
vec1 32 ssa_0 = deref_var &a_position (shader_in vec4)
vec4 32 ssa_1 = intrinsic load_deref (ssa_0) (access=0)
vec1 16 ssa_2 = load_const (0xb0cd = -0.150024)
vec1 16 ssa_3 = load_const (0x2a66 = 0.049988)
vec1 16 ssa_4 = load_const (0xb829 = -0.520020)
vec1 16 ssa_5 = load_const (0xb429 = -0.260010)
vec1 16 ssa_6 = load_const (0xb59a = -0.350098)
vec1 16 ssa_7 = load_const (0xbb0a = -0.879883)
vec1 16 ssa_8 = load_const (0xadc3 = -0.090027)
vec1 16 ssa_9 = load_const (0xb4cd = -0.300049)
vec1 16 ssa_10 = load_const (0xb8e1 = -0.609863)
vec2 32 ssa_13 = load_const (0x00000000, 0x00000000) = (0.000000, 0.000000)
vec1 32 ssa_49 = load_const (0x00000000 = 0.000000)
vec4 16 ssa_14 = (float16)txl ssa_13 (coord), ssa_49 (lod), 0 (texture), 0 (sampler)
vec1 16 ssa_15 = fadd ssa_14.x, ssa_2
vec1 16 ssa_16 = fabs ssa_15
vec1 16 ssa_17 = fadd ssa_14.y, ssa_4
vec1 16 ssa_18 = fabs ssa_17
vec1 16 ssa_19 = fadd ssa_14.z, ssa_5
vec1 16 ssa_20 = fabs ssa_19
vec1 16 ssa_21 = fadd ssa_14.w, ssa_6
vec1 16 ssa_22 = fabs ssa_21
vec1 16 ssa_23 = fmax ssa_16, ssa_18
vec1 16 ssa_24 = fmax ssa_23, ssa_20
vec1 16 ssa_25 = fmax ssa_24, ssa_22
vec3 32 ssa_27 = load_const (0x00000000, 0x00000000, 0x00000000) = (0.000000, 0.000000, 0.000000)
vec1 32 ssa_50 = load_const (0x00000000 = 0.000000)
vec4 16 ssa_28 = (float16)txl ssa_27 (coord), ssa_50 (lod), 1 (texture), 1 (sampler)
vec1 16 ssa_29 = fadd ssa_28.x, ssa_7
vec1 16 ssa_30 = fabs ssa_29
vec1 16 ssa_31 = fadd ssa_28.y, ssa_8
vec1 16 ssa_32 = fabs ssa_31
vec1 16 ssa_33 = fadd ssa_28.z, ssa_9
vec1 16 ssa_34 = fabs ssa_33
vec1 16 ssa_35 = fadd ssa_28.w, ssa_10
vec1 16 ssa_36 = fabs ssa_35
vec1 16 ssa_37 = fmax ssa_30, ssa_32
vec1 16 ssa_38 = fmax ssa_37, ssa_34
vec1 16 ssa_39 = fmax ssa_38, ssa_36
vec1 16 ssa_40 = fmax ssa_25, ssa_39
vec1 32 ssa_41 = flt32 ssa_40, ssa_3
vec1 32 ssa_42 = b2f32 ssa_41
vec1 32 ssa_43 = deref_var &gl_Position (shader_out vec4)
intrinsic store_deref (ssa_43, ssa_1) (wrmask=xyzw /*15*/, access=0)
vec1 32 ssa_44 = deref_var &v_vtxOut (shader_out float)
intrinsic store_deref (ssa_44, ssa_42) (wrmask=x /*1*/, access=0)
/* succs: block_1 */
block block_1:
}
There’s two sample ops (txl
), and since these tests only do simple texture()
calls, it seems reasonable to assume that one of them is causing the crash. Sticking a lp_build_print_value
on the texel values fetched by the sample operations will reveal whether the crash occurs before or after them.
What output does this yield?
Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'..
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
[1] 3500332 illegal hardware instruction (core dumped)
Each txl
op fetches four values, which means this is the result from the first instruction, but the second one isn’t reached before the crash. Unsurprisingly, this is also the cube sampling instruction, which makes sense given that all the crashes of this type I get are from cube sampling tests.
Now that it’s been determined the second txl
is causing the crash, it’s reasonable to assume that the construction of that sampling op is the cause rather than the op itself, as proven by sticking some simple lp_build_printf("What am I doing with my life")
calls in just before that op. Indeed, as the printfs confirm, I’m still questioning the life choices that led me to this point, so it’s now proven that the txl
instruction itself is the problem.
Cube sampling has a lot of complex math involved for face selection, and I’ve spent a lot of time in there recently. My first guess was that the cube coordinates were bogus. Printing them yielded results:
Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'..
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
cubecoords nan nan nan nan nan nan nan nan
cubecoords nan nan nan nan nan nan nan nan
These cube coords have more NaNs than a 1960s Batman TV series, so it looks like I was right in my hunch. Printing the cube S-face value next yields more NaNs. My printf search continued a couple more iterations until I wound up at this function:
static LLVMValueRef
lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord)
{
/* ima = +0.5 / abs(coord); */
LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5);
LLVMValueRef absCoord = lp_build_abs(coord_bld, coord);
LLVMValueRef ima = lp_build_div(coord_bld, posHalf, absCoord);
return ima;
}
Immediately, all of us multiverse-brain engineers spot something suspicious: this has a division operation with a user-provided divisor. Printing absCoord
here yielded all zeroes, which was about where my remaining energy was at this Friday morning, so I mangled the code slightly:
static LLVMValueRef
lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord)
{
/* ima = +0.5 / abs(coord); */
LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5);
LLVMValueRef absCoord = lp_build_abs(coord_bld, coord);
/* avoid div by zero */
LLVMValueRef sel = lp_build_cmp(coord_bld, PIPE_FUNC_GREATER, absCoord, coord_bld->zero);
LLVMValueRef div = lp_build_div(coord_bld, posHalf, absCoord);
LLVMValueRef ima = lp_build_select(coord_bld, sel, div, coord_bld->zero);
return ima;
}
And blammo, now that Gallivm could no longer divide by zero, the test was now passing. And so were a lot of others.
There’s been some speculation about how close Zink really is to being “useful”, where “useful” is determined by the majesty of passing GL4.6 CTS.
So how close is it? The answer might shock you.
Remaining Lavapipe Fails: 17
Remaining ANV Fails (Icelake): 9
Big Triangle better keep a careful eye on us now.
Around 2 years ago while I was working on tessellation support for llvmpipe, and running the heaven benchmark on my Ryzen, I noticed that heaven despite running slowly wasn't saturating all the cores. I dug in a bit, and found that llvmpipe despite threading rasterization, fragment shading and blending stages, never did anything else while those were happening.
I dug into the code as I clearly remembered seeing a concept of a "scene" where all the primitives were binned into and then dispatched. It turned out the "scene" was always executed synchronously.
At the time I wrote support to allow multiple scenes to exist, so while one scene was executing the vertex shading and binning for the next scene could execute, and it would be queued up. For heaven at the time I saw some places where it would build 36 scenes. However heaven was still 1fps with tess, and regressions in other areas were rampant, and I mostly left them in a branch.
The reasons so many things were broken by the patches was that large parts of llvmpipe and also lavapipe, weren't ready for the async pipeline processing. The concept of a fence after the pipeline finished was there, but wasn't used properly everywhere. A lot of operations assumed there was nothing going on behind the scenes so never fenced. Lots of things like queries broke due to fact that a query would always be ready in the old model, but now query availability could return unavailable like a real hw driver. Resource tracking existed but was incomplete, so knowing when to flush wasn't always accurate. Presentation was broken due to incorrect waiting both for GL and Lavapipe. Lavapipe needed semaphore support that actually did things as apps used it between the render and present pipeline pieces.
Mesa CI recently got some paraview traces added to it, and I was doing some perf traces with them. Paraview is a data visualization tool, and it generates vertex heavy workloads, as opposed to compositors and even games. It turned out binning was most of the overhead, and I realized the overlapping series could help this sort of workload. I dusted off the patch series and nailed down all the issues.
Emma Anholt ran some benchmarks on the results with the paraview traces and got
I've got it all lined up in a merge request and it doesn't break CI anymore, so hopefully get it landed in the next while, once I cleanup any misc bits.
Earlier this week, Neil McGovern announced that he is due to be stepping down as the Executive Director as the GNOME Foundation later this year. As the President of the board and Neil’s effective manager together with the Executive Committee, I wanted to take a moment to reflect on his achievements in the past 5 years and explain a little about what the next steps would be.
Since joining in 2017, Neil has overseen a productive period of growth and maturity for the Foundation, increasing our influence both within the GNOME project and the wider Free and Open Source Software community. Here’s a few highlights of what he’s achieved together with the Foundation team and the community:
Recognizing and appreciating the amazing progress that GNOME has made with Neil’s support, the search for a new Executive Director provides the opportunity for the Foundation board to set the agenda and next high-level goals we’d like to achieve together with our new Executive Director.
In terms of the desktop, applications, technology, design and development processes, whilst there are always improvements to be made, the board’s general feeling is that thanks to the work of our amazing community of contributors, GNOME is doing very well in terms of what we produce and publish. Recent desktop releases have looked great, highly polished and well-received, and the application ecosystem is growing and improving through new developers and applications bringing great energy at the moment. From here, our largest opportunity in terms of growing the community and our user base is being able to articulate the benefits of what we’ve produced to a wider public audience, and deliver impact which allows us to secure and grow new and sustainable sources of funding.
For individuals, we are able to offer an exceedingly high quality desktop experience and a broad range of powerful applications which are affordable to all, backed by a nonprofit which can be trusted to look after your data, digital security and your best interests as an individual. From the perspective of being a public charity in the US, we also have the opportunity to establish programs that draw upon our community, technology and products to deliver impact such as developing employable skills, incubating new Open Source contributors, learning to program and more.
For our next Executive Director, we will be looking for an individual with existing experience in that nonprofit landscape, ideally with prior experience establishing and raising funds for programs that deliver impact through technology, and appreciation for the values that bring people to Free, Open Source and other Open Culture organizations. Working closely with the existing members, contributors, volunteers and whole GNOME community, and managing our relationships with the Advisory Board and other key partners, we hope to find a candidate that can build public awareness and help people learn about, use and benefit from what GNOME has built over the past two decades.
Neil has agreed to stay in his position for a 6 month transition period, during which he will support the board in our search for a new Executive Director and support a smooth hand-over. Over the coming weeks we will publish the job description for the new ED, and establish a search committee who will be responsible for sourcing and interviewing candidates to make a recommendation to the board for Neil’s successor – a hard act to follow!
I’m confident the community will join me and the board in personally thanking Neil for his 5 years of dedicated service in support of GNOME and the Foundation. Should you have any queries regarding the process, or offers of assistance in the coming hiring process, please don’t hesitate to join the discussion or reach out directly to the board.
After roughly 20 years and counting up to 0.40 in release numbers, I've decided to call the next version of the xf86-input-wacom driver the 1.0 release. [1] This cycle has seen a bulk of development (>180 patches) which is roughly as much as the last 12 releases together. None of these patches actually added user-visible features, so let's talk about technical dept and what turned out to be an interesting way of reducing it.
The wacom driver's git history goes back to 2002 and the current batch of maintainers (Ping, Jason and I) have all been working on it for one to two decades. It used to be a Wacom-only driver but with the improvements made to the kernel over the years the driver should work with most tablets that have a kernel driver, albeit some of the more quirky niche features will be more limited (but your non-Wacom devices probably don't have those features anyway).
The one constant was always: the driver was extremely difficult to test, something common to all X input drivers. Development is a cycle of restarting the X server a billion times, testing is mostly plugging hardware in and moving things around in the hope that you can spot the bugs. On a driver that doesn't move much, this isn't necessarily a problem. Until a bug comes along, that requires some core rework of the event handling - in the kernel, libinput and, yes, the wacom driver.
After years of libinput development, I wasn't really in the mood for the whole "plug every tablet in and test it, for every commit". In a rather caffeine-driven development cycle [2], the driver was separated into two logical entities: the core driver and the "frontend". The default frontend is the X11 one which is now a relatively thin layer around the core driver parts, primarily to translate events into the X Server's API. So, not unlike libinput + xf86-input-libinput in terms of architecture. In ascii-art:
|
+--------------------+ | big giant
/dev/input/event0->| core driver | x11 |->| X server
+--------------------+ | process
|
Now, that logical separation means we can have another frontend which I implemented as a relatively light GObject wrapper and is now a library creatively called libgwacom:
This isn't a public library or API and it's very much focused on the needs of the X driver so there are some peculiarities in there. What it allows us though is a new wacom-record tool that can hook onto event nodes and print the events as they come out of the driver. So instead of having to restart X and move and click things, you get this:
+-----------------------+ |
/dev/input/event0->| core driver | gwacom |--| tools or test suites
+-----------------------+ |
This is YAML which means we can process the output for comparison or just to search for things.
$ ./builddir/wacom-record
wacom-record:
version: 0.99.2
git: xf86-input-wacom-0.99.2-17-g404dfd5a
device:
path: /dev/input/event6
name: "Wacom Intuos Pro M Pen"
events:
- source: 0
event: new-device
name: "Wacom Intuos Pro M Pen"
type: stylus
capabilities:
keys: true
is-absolute: true
is-direct-touch: false
ntouches: 0
naxes: 6
axes:
- {type: x , range: [ 0, 44800], resolution: 200000}
- {type: y , range: [ 0, 29600], resolution: 200000}
- {type: pressure , range: [ 0, 65536], resolution: 0}
- {type: tilt_x , range: [ -64, 63], resolution: 57}
- {type: tilt_y , range: [ -64, 63], resolution: 57}
- {type: wheel , range: [ -900, 899], resolution: 0}
...
- source: 0
mode: absolute
event: motion
mask: [ "x", "y", "pressure", "tilt-x", "tilt-y", "wheel" ]
axes: { x: 28066, y: 17643, pressure: 0, tilt: [ -4, 56], rotation: 0, throttle: 0, wheel: -108, rings: [ 0, 0]
A tool to quickly analyse data makes for faster development iterations but it's still a far cry from reliable regression testing (and writing a test suite is a daunting task at best). But one nice thing about GObject is that it's accessible from other languages, including Python. So our test suite can be in Python, using pytest and all its capabilities, plus all the advantages Python has over C. Most of driver testing comes down to: create a uinput device, set up the driver with some options, push events through that device and verify they come out of the driver in the right sequence and format. I don't need C for that. So there's pull request sitting out there doing exactly that - adding a pytest test suite for a 20-year old X driver written in C. That this is a) possible and b) a lot less work than expected got me quite unreasonably excited. If you do have to maintain an old C library, maybe consider whether's possible doing the same because there's nothing like the warm fuzzy feeling a green tick on a CI pipeline gives you.
[1] As scholars of version numbers know, they make as much sense as your stereotypical uncle's facebook opinion, so why not.
[2] The Colombian GDP probably went up a bit
FOSDEM 2022 took place this past weekend, on
February 5th and 6th. It was a virtual event for the second year in a row, but
this year the Graphics
devroom made a comeback and I participated in it with a talk titled
“Fun with border
colors in Vulkan”. In the talk, I explained the context and origins behind the
VK_EXT_border_color_swizzle
extension that was published last year and in
which I’m listed as one of the contributors.
Big kudos and a big thank you to the FOSDEM organizers one more year. FOSDEM is arguably the most important free and open source software conference in Europe and one of the most important FOSS conferences in the world. It’s run entirely by volunteers, doing an incredible amount of work that makes it possible to have hundreds of talks and dozens of different devrooms in the span of two days. Special thanks to the Graphics devroom organizers.
For the virtual setup, one more year FOSDEM relied on Matrix. It’s great because at Igalia we also use Matrix for our internal communications and, thanks to the federated nature of the service, I could join the FOSDEM virtual rooms using the same interface, client and account I normally use for work. The FOSDEM organizers also let participants create ad-hoc accounts to join the conference, in case they didn’t have a Matrix account previously. Thanks to Matrix widgets, each virtual devroom had its corresponding video stream, which you could also watch freely on their site, embedded in each of the virtual devrooms, so participants wanting to watch the talks and ask questions had everything in a single page.
Talks were pre-recorded and submitted in advance, played at the scheduled times, and Jitsi was used for post-talk Q&A sessions, in which moderators and devroom organizers read aloud the most voted questions in the devroom chat.
Of course, a conference this big is not without its glitches. Video feeds from presenters and moderators were sometimes cut automatically by Jitsi allegedly due to insufficient bandwidth. It also happened to me during my Q&A section while I was using a wired connection on a 300 Mbps symmetric FTTH line. I can only suppose the pipe was not wide enough on the other end to handle dozens of streams at the same time, or Jitsi was playing games as it sometimes does. In any case, audio was flawless.
In addition, some of the pre-recorded videos could not be played at the scheduled time, resulting in a black screen with no sound, due to an apparent bug in the video system. It’s worth noting all pre-recorded talks had been submitted, processed and reviewed prior to the conference, so this was an unexpected problem. This happened with my talk and I had to do the presentation live. Fortunately, I had written a script for the talk and could use it to deliver it without issues by sharing my screen with the slides over Jitsi.
Finally, as a possible improvement point for future virtual or mixed editions, is the fact that the deadline for submitting talk videos was only communicated directly and prominently by email on the day the deadline ended, a couple of weeks before the conference. It was also mentioned in the presenter’s guide that was linked in a previous email message, but an explicit warning a few days or a week before the deadline would have been useful to avoid last-minute rushes and delays submitting talks.
In any case, those small problems don’t take away the great online-only experience we had this year.
Another advantage of having a written script for the talk is that I can use it to provide a pseudo-transcription of its contents for those that prefer not to watch a video or are unable to do so. I’ve also summed up the Q&A section at the end below. The slides are available as an attachment in the talk page.
Enjoy and see you next year, hopefully in Brussels this time.
Hello, my name is Ricardo Garcia. I work at Igalia as part of its Graphics Team, where I mostly work on the CTS project creating new Vulkan tests and fixing existing ones. Sometimes this means I also contribute to the specification text and other pieces of the Vulkan ecosystem.
Today I’m going to talk about the story behind the “border color swizzle” extension that was published last year. I created tests for this one and I also participated in its release process, so I’m listed as one of the contributors.
I’ve already started mentioning border colors, so before we dive directly into the extension let me give you a brief introduction to sampling operations in Vulkan and explain where border colors fit in that.
Sampling means reading pixels from an image view and is typically done in the fragment shader, for example to apply a texture to some geometry.
In the example you see here, we have an image view with 3 8-bit color components in BGR order and in unsigned normalized format. This means we’ll suppose each image pixel is stored in memory using 3 bytes, with each byte corresponding to the blue, green and red components in that order.
However, when we read pixels from that image view, we want to get back normalized floating point values between 0 (for the lowest value) and 1 (for the highest value, i.e. when all bits are 1 and the natural number in memory is 255).
As you can see in the GLSL code, the result of the operation is a vector of 4 floating point numbers. Since the image does not have alpha information, it’s natural to think the output vector may have a 1 in the last component, making the color opaque.
If the coordinates of the sample operation make us read the pixel represented there, we would get the values you see on the right.
It’s also worth noting the sampler argument is a combination of two objects in Vulkan: an image view and a sampler object that specifies how sampling is done.
Focusing a bit on the coordinates used to sample from the image, the most common case is using normalized coordinates, which means using floating point values between 0 and 1 in each of the image axis, like the 2D case you see on the right.
But, what happens if the coordinates fall outside that range? That means sampling outside the original image, in points around it like the red marks you see on the right.
That depends on how the sampler is configured. When creating it, we can specify a so-called “address mode” independently for each of the 3 texture coordinate axis that may be used (2 in our example).
There are several possible address modes. The most common one is probably the one you see here on the bottom left, which is the repeat addressing mode, which applies some kind of module operation to the coordinates as if the texture was virtually repeating in the selected axis.
There’s also the clamp mode on the top right, for example, which clamps coordinates to 0 and 1 and produces the effect of the texture borders extending beyond the image edge.
The case we’re interested in is the one on the top left, which is the border mode. When sampling outside we get a border color, as if the image was surrounded by a virtually infinite frame of a chosen color.
The border color is specified when creating the sampler, and initially could only be chosen among a restricted set of values: transparent black (all zeros), opaque white (all ones) or the “special” opaque black color, which has a zero in all color components and a 1 in the alpha component.
The “custom border color” extension introduced the possibility of specifying arbitrary RGBA colors when creating the sampler.
However, sampling operations are also affected by one parameter that’s not part of the sampler object. It’s part of the image view and it’s called the component swizzle.
In the example I gave you before we got some color values back, but that was supposing the component swizzle was the identity swizzle (i.e. color components were not reorder or replaced).
It’s possible, however, to specify other swizzles indicating what the resulting final color should be for each of the 4 components: you can reorder the components arbitrarily (e.g. saying the red component should actually come from the original blue one), you can force some of them to be zero or one, you can replicate one of the original components in multiple positions of the final color, etc. It’s a very flexible operation.
While working on the Zink Mesa driver, Mike discovered that the interaction between non-identity swizzle and custom border colors produced different results for different implementations, and wondered if the result was specified at all.
Let me give you an example: you specify a custom border color of 0, 0, 1, 1 (opaque blue) and an addressing mode of clamping to border in the sampler.
The image view has this strange swizzle in which the red component should come from the original blue, the green component is always zero, the blue component comes from the original green and the alpha component is not modified.
If the swizzle applies to the border color you get red. If it does not, you get blue.
Any option is reasonable: if the border color is specified as part of the sampler, maybe you want to get that color no matter which image view you use that sampler on, and expect to always get a blue border.
If the border color is supposed to act as if it came from the original image, it should be affected by the swizzle as the normal pixels are and you’d get red.
Jason pointed out the spec laid out the rules in a section called “Texel Input Operations”, which specifies that swizzling should affect border colors, and non-identity swizzles could be applied to custom border colors without restrictions according to the spec, contrary to “opaque black”, which was considered a special value and non-identity swizzles would result in undefined values with that border.
The Texel Input Operations spec section describes what the expected result is according to some steps which are supposed to happen in a defined order. It doesn’t mean the hardware has to work like this. It may need instructions before or after the hardware sampling operation to simulate things happen in the order described there.
I’ve simplified and removed some of the steps but if border color needs to be applied we’re interested in the steps we can see in bold, and step 5 (border color applied) comes before step 7 (applying the image view swizzle).
I’ll describe the steps with a bit more detail now.
Step 1 is coordinate conversion: this includes converting normalized coordinates to integer texel coordinates for the image view and clamping and modifying those values depending on the addressing mode.
Once that is done, step 2 is validating the coordinates. Here, we’ll decide if texel replacement takes place or not, which may imply using the border color. In other sampling modes, robustness features will also be taken into account.
Step 3 happens when the coordinates are valid, and is reading the actual texel from the image. This immediately implies reordering components from the in-memory layout to the standard RGBA layout, which means a BGR image view gets its components immediately put in RGB order after reading.
Step 4 also applies if an actual texel was read from the image and is format conversion. For example, unsigned normalized formats need to convert pixel values (stored as natural numbers in memory) to floating point values.
Our example texel, already in RGB order, results in the values you see on the right.
Step 5 is texel replacement, and is the alternative to the previous two steps when the coordinates were not valid. In the case of border colors, this means taking the border color and cutting it short so it only has the components present in the original image view, to act as if the border color was actually part of the image.
Because this happens after the color components have already been reordered, the border color is always specified in standard red, green, blue and alpha order when creating the sampler. The fact that the original image view was in BGR order is irrelevant for the border color. We care about the alpha component being missing, but not about the in-memory order of the image view.
Our transparent blue border is converted to just “blue” in this step.
Step 6 takes us back to a unified flow of steps: it applies to the color no matter where it came from. The color is expanded to always have 4 components as expected in the shader. Missing color components are replaced with zeros and the alpha component, if missing, is set to one.
Our original transparent blue border is now opaque blue.
Step 7, finally the swizzle is applied. Let’s suppose our image view had that strange swizzle in which the red component is copied from the original blue, the green component is set to zero, the blue one is set to one and the alpha component is not modified.
Our original transparent blue border is now opaque magenta.
So we had this situation in which some implementations swizzled the border color and others did not. What could we do?
We could double-down on the existing spec and ask vendors to fix their implementations but, what happens if they cannot fix them? Or if the fix is impractical due to its impact in performance?
Unfortunately, that was the actual situation: some implementations could not be fixed. After discovering this problem, CTS tests were going to be created for these cases. If an implementation failed to behave as mandated by the spec, it wouldn’t pass conformance, so those implementations only had one way out: stop supporting custom border colors, but that’s also a loss for users if those implementations are in widespread use (and they were).
The second option is backpedaling a bit, making behavior undefined unless some other feature is present and designing a mechanism that would allow custom border colors to be used with non-identity swizzles at least in some of the implementations.
And that’s how the “border color swizzle” extension was created last year. Custom colors with non-identity swizzle produced undefined results unless the borderColorSwizzle feature was available and enabled. Some implementations could advertise support for this almost “for free” and others could advertise lack of support for this feature.
In the middle ground, some implementations can indicate they support the case, but the component swizzle has to be indicated when creating the sampler as well. So it’s both part of the image view and part of the sampler. Samplers created this way can only be used with image views having a matching component swizzle (which means they are no longer generic samplers).
The drawback of this extension, apart from the obvious observation that it should’ve been part of the original custom border color extension, is that it somehow lowers the bar for applications that want to use a single code path for every vendor. If borderColorSwizzle is supported, it’s always legal to pass the swizzle when creating the sampler. Some implementations will need it and the rest can ignore it, so the unified code path is now harder or more specific.
And that’s basically it. Sometimes the Vulkan Working Group in Khronos has had to backpedal and mark as undefined something that previous versions of the Vulkan spec considered defined. It’s not frequent nor ideal, but it happens. But it usually does not go as far as publishing a new extension as part of the fix, which is why I considered this interesting.
Thanks for watching! Let me know if you have any questions.
Martin: The first question is from "ancurio" and he’s asking if swizzling is implemented in hardware.
Me: I don’t work on implementations so take my answer with a grain of salt. It’s my understanding you can usually program that in hardware and the hardware does the swizzling for you. There may be implementations which need to do the swizzling in software, emitting extra instructions.
Martin: Another question from "ancurio". When you said lowering the bar do you mean raising it?
I explain that, yes, I meant to say raising the bar for the application. Note: I meant to say that it lowers the bar for the specification and API, which means a more complicated solution has been accepted.
Martin: "enunes" asks if this was originally motivated by some real application bug or by something like conformance tests/spec disambiguation?
I explain it has both factors. Mike found the problem while developing Zink, so a real application hit the problematic case, and then the Vulkan Working Group inside Khronos wanted to fix this, make the spec clear and provide a solution for apps that wanted to use non-identity swizzle with border colors, as it was originally allowed.
Martin: no more questions in the room but I have one more for you. How was your experience dealing with Khronos coordinating with different vendors and figuring out what was the acceptable solution for everyone?
I explain that the main driver behind the extension in Khronos was Piers Daniell from NVIDIA (NB: listed as the extension author). I mention that my experience was positive, that the Working Group is composed of people who are really interested in making a good specification and implementations that serve app developers. When this problem was detected I created some tests that worked as a poll to see which vendors could make this work easily and what others may need to make this work if at all. Then, this was discussed in the Working Group, a solution was proposed (the extension), then more vendors reviewed and commented that, then tests were adapted to the final solution, and finally the extension was published.
Martin: How long did this whole process take?
Me: A few months. Take into account the Working Group does not meet every day, and they have a backlog of issues to discuss. Each of the previous steps takes several weeks, so you end up with a few months, which is not bad.
Martin: Not bad at all.
Me: Not at all, I think it works reasonably well.
Martin: Specially when you realize you broke something and the specification needs fixing. Definitely decent.
(I nearly went with clutterectomy, but that would be doing our old servant project a disservice.)
Yesterday, I finally merged the work-in-progress branch porting totem to GStreamer's GTK GL sink widget, undoing a lot of the work done in 2011 and 2014 to port the video widget and then to finally make use of its features.
But GTK has been modernised (in GTK3 but in GTK4 even more so), GStreamer grew a collection of GL plugins, Wayland and VA-API matured and clutter (and its siblings clutter-gtk, and clutter-gst) didn't get the resources they needed to follow.
A screenshot with practically no changes, as expected
The list of bug fixes and enhancements is substantial:
Until the port to GTK4, we expect a overall drop in performance on systems where there's no VA-API support, and the GTK4 port should bring it to par with the fastest of players available for GNOME.
You can install a Preview version right now by running:
$ flatpak install --user https://flathub.org/beta-repo/appstream/org.gnome.Totem.Devel.flatpakref
and filing bug in the GNOME GitLab.
Next stop, a GTK4 port!
I always do one of these big roundups for each Mesa release, so here’s what you can expect to see from zink in the upcoming release:
--i-know-this-is-not-a-benchmark
to see the real speed)All around looking like another great release.
Yes, we’re here.
After literally years of awfulness, I’ve finally solved (for good) the debacle that is point size conversion from GL to Vulkan.
What’s so awful about it, you might be asking. How hard can it be to just add gl_PointSize
to a shader, you follow up with as you push your glasses higher up your nose.
Allow me to explain.
In Vulkan, there is exactly one method for setting the size of points: the gl_PointSize
shader output controls it, and that’s it.
14.4 Points If program point size mode is enabled, the derived point size is taken from the (potentially clipped) shader built-in gl_PointSize written by the last vertex processing stage and clamped to the implementation-dependent point size range. If the value written to gl_PointSize is less than or equal to zero, or if no value was written to gl_PointSize, results are undefined. If program point size mode is disabled, the derived point size is specified with the command
void PointSize( float size );
Tessellation evaluation shaders have a number of built-in output variables used to pass values to equivalent built-in input variables read by subsequent shader stages or to subsequent fixed functionality vertex processing pipeline stages. These variables are gl_Position, gl_PointSize, gl_ClipDistance, and gl_CullDistance, and all behave identically to equivalently named vertex shader outputs.
The built-in output gl_PointSize, if written, holds the size of the point to be rasterized, measured in pixels
In short, if PROGRAM_POINT_SIZE
is enabled, then points are sized based on the gl_PointSize
shader output of the last vertex stage.
In OpenGL ES (versions 2.0, 3.0, 3.1):
The point size is taken from the shader built-in gl_PointSize written by the
vertex shader, and clamped to the implementation-dependent point size range.
In OpenGL ES (version 3.2):
The point size is determined by the last vertex processing stage. If the last vertex processing stage is not a vertex shader, the point size is 1.0. If the last vertex processing stage is a vertex shader, the point size is taken from the shader built-in gl_PointSize written by the vertex shader, and is clamped to the implementation-dependent point size range.
Thus for an ES context, the point size always comes from the last vertex stage, which means it can be anything it wants to be if that stage is a vertex shader and cannot be written to for all other stages because it is not a valid output (this last, bolded part is going to be really funny in a minute or two).
What do the specs agree on?
gl_PointSize
Literally that’s it.
Awesome.
As we know, Vulkan has a very simple and clearly defined model for point size:
The point size is taken from the (potentially clipped) shader built-in PointSize written by:
• the geometry shader, if active;
• the tessellation evaluation shader, if active and no geometry shader is active;
• the vertex shader, otherwise
- 27.10. Points
It really can be that simple.
So one would think that we can just hook up some conditionals based on the GL rules and then export the correct value.
That would be easy.
Simple.
It would make sense.
hahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahaha
It gets worse (obviously).
gl_PointSize
is a valid XFB varying, which means it must be exported correctly to the transform feedback buffer. For the ES case, it’s simple, but for desktop GL, there’s a little something called PROGRAM_POINT_SIZE
state which totally fucks that up. Because, as we know, Vulkan has exactly one way of setting point size, and it’s the shader variable.
Thus, if there is a desktop GL context using a vertex shader as its last vertex stage for a draw, and if that shader has its own gl_PointSize
value, this value must be exported for XFB.
But not used for point rasterization.
…Because in order to pass CTS for ES 3.2, your implementation also has to be able to violate spec.
Remember above when I said it was going to be funny that gl_PointSize
is not a legal output for non-vertex stages in ES contexts?
CTS explicitly has “wide points” tests which verify illegal point sizes that are exported by the tessellation and geometry shader stages. Isn’t that cool?
Also, let’s be reasonable people for a moment, who actually wants a point that’s just one pixel? Nobody can see that on their 8k display.
I hate GL point size, and so should you.
I keep meaning to blog, but then I get sidetracked by not blogging.
Truly a tough life.
So what’s new in zink-land?
Nothing too exciting. Mostly bug fixes. I managed to sneak ARB_sparse_texture_clamp in for zink just before the branchpoint, so all the sparse texturing features supported by Mesa will be supported by zink. But only on NVIDIA since they’re the only driver that fully supports Vulkan sparse texturing.
The past couple days I’ve been doing some truly awful things with gl_PointSize
to try and make this conformant for all possible cases. It’s a real debacle, and I’ll probably post more in-depth about it so everyone can get a good chuckle.
The one unusual part of my daily routine is that I haven’t rebased my testing branch in at least a couple weeks now since I’ve been trying to iron out regressions. Will I find that everything crashes and fails as soon as I do?
Probably.
More posts to come.
There was an article on Open for Everyone today about Nobara, a Fedora-based distribution optimized for gaming. So I have no beef with Tomas Crider or any other creator/maintainer of a distribution targeting a specific use case. In fact they are usually trying to solve or work around real problems and make things easier for people. That said I have for years felt that the need for these things is a failing in itself and it has been a goal for me in the context of Fedora Workstation to figure out what we can do to remove the need for ‘usecase distros’. So I thought it would be of interest if I talk a bit about how I been viewing these things and the concrete efforts we taken to reduce the need for usecase oriented distributions. It is worth noting that the usecase distributions have of course proven useful for this too, in the sense that they to some degree also function as a very detailed ‘bug report’ for why the general case OS is not enough.
Before I start, you might say, but isn’t Fedora Workstation as usecase OS too? You often talk about having a developer focus? Yes, developers are something we care deeply about, but for instance that doesn’t mean we pre-install 50 IDEs in Fedora Workstation. Fedora Workstation should be a great general purpose OS out of the box and then we should have tools like GNOME Software and Toolbx available to let you quickly and easily tweak it into your ideal development system. But at the same time by being a general purpose OS at heart, it should be equally easy to install Steam and Lutris to start gaming or install Carla and Ardour to start doing audio production. Or install OBS Studio to do video streaming.
Looking back over the years one of the first conclusions I drew from looking at all the usecase distributions out there was that they often where mostly the standard distro, but with a carefully procured list of pre-installed software, for instance the old Fedora game spin was exactly that, a copy of Fedora with a lot of games pre-installed. So why was this valuable to people? For those of us who have been around for a while we remember that the average linux ‘app store’ was a very basic GUI which listed available software by name (usually quite cryptic names) and at best with a small icon. There was almost no other metadata available and search functionality was limited at best. So finding software was not simple, at it was usually more of a ‘search the internet and if you find something interesting see if its packaged for your distro’. So the usecase distros who focused on having procured pre-installed software, be that games, or pro-audio software or graphics tools ot whatever was their focus was basically responding to the fact that finding software was non-trivial and a lot of people maybe missed out on software that could be useful to them since it they simply never learned about its existence.
So when we kicked of the creation of GNOME Software one of the big focuses early on was to create a system for providing good metadata and displaying that metadata in a useful manner. So as an end user the most obvious change was of course the more rich UI of GNOME Software, but maybe just as important was the creation of AppStream, which was a specification for how applications to ship with metadata to allow GNOME Software and others to display much more in-depth information about the application and provide screenshots and so on.
So I do believe that between working on a better ‘App Store’ story for linux between the work on GNOME Software as the actual UI, but also by working with many stakeholders in the Linux ecosystem to define metadata standards like AppStream we made software a lot more discoverable on Linux and thus reduced the need for pre-loading significantly. This work also provided an important baseline for things like Flathub to thrive, as it then had a clear way to provide metadata about the applications it hosts.
We do continue to polish that user experience on an ongoing basis, but I do feel we reduced the need to pre-load a ton of software very significantly already with this.
Of course another aspect of this is application availability, which is why we worked to ensure things like Steam is available in GNOME Software on Fedora Workstation, and which we have now expanded on by starting to include more and more software listings from Flathub. These things makes it easy for our users to find the software they want, but at the same time we are still staying true to our mission of only shipping free software by default in Fedora.
The second major reason for usecase distributions have been that the generic version of the OS didn’t really have the right settings or setup to handle an important usecase. I think pro-audio is the best example of this where usecase distros like Fedora Jam or Ubuntu Studio popped up. The pre-install a lot of relevant software was definitely part of their DNA too, but there was also other issues involved, like the need for a special audio setup with JACK and often also kernel real-time patches applied. When we decided to include Pro-audio support in PipeWire resolving these issues was a big part of it. I strongly believe that we should be able to provide a simple and good out-of-the box experience for musicians and audio engineers on Linux without needing the OS to be specifically configured for the task. The strong and positive response we gotten from the Pro-audio community for PipeWire I believe points to that we are moving in the right direction there. Not claiming things are 100% yet, but we feel very confident that we will get there with PipeWire and make the Pro-Audio folks full fledged members of the Fedora WS community. Interestingly we also spent quite a bit of time trying to ensure the pro-audio tools in Fedora has proper AppStream metadata so that they would appear in GNOME Software as part of this. One area there where we are still looking at is the real time kernel stuff, our current take is that we do believe the remaining unmerged patches are not strictly needed anymore, as most of the important stuff has already been merged, but we are monitoring it as we keep developing and benchmarking PipeWire for the Pro-Audio usecase.
Another reason that I often saw that drove the creation of a usecase distribution is special hardware support, and not necessarily that special hardware, the NVidia driver for instance has triggered a lot of these attempts. The NVidia driver is challenging on a lot of levels and has been something we have been constantly working on. There was technical issues for instance, like the NVidia driver and Mesa fighting over who owned the OpenGL.so implementation, which we fixed by the introduction glvnd a few years ago. But for a distro like Fedora that also cares deeply about free and open source software it also provided us with a lot of philosophical challenges. We had to answer the question of how could we on one side make sure our users had easy access to the driver without abandoning our principle around Fedora only shipping free software of out the box? I think we found a good compromise today where the NVidia driver is available in Fedora Workstation for easy install through GNOME Software, but at the same time default to Nouveau of the box. That said this is a part of the story where we are still hard at work to improve things further and while I am not at liberty to mention any details I think I can at least mention that we are meeting with our engineering counterparts at NVidia on almost a weekly basis to discuss how to improve things, not just for graphics, but around compute and other shared areas of interest. The most recent public result of that collaboration was of course the XWayland support in recent NVidia drivers, but I promise you that this is something we keep focusing on and I expect that we will be able to share more cool news and important progress over the course of the year, both for users of the NVidia binary driver and for users of Nouveau.
What are we still looking at in terms of addressing issues like this? Well one thing we are talking about is if there is value/need for a facility to install specific software based on hardware or software. For instance if we detect a high end gaming mouse connected to your system should we install Piper/ratbag or at least make GNOME Software suggest it? And if we detect that you installed Lutris and Steam are there other tools we should recommend you install, like the gamemode GNOME Shell extenion? It is a somewhat hard question to answer, which is why we are still pondering it, on one side it seems like a nice addition, but such connections would mean that we need to have a big database we constantly maintain which isn’t trivial and also having something running on your system to lets say check for those high end mice do add a little overhead that might be a waste for many users.
Another area that we are looking at is the issue of codecs. We did a big effort a couple of years ago and got AC3, mp3, AAC and mpeg2 video cleared for inclusion, and also got the OpenH264 implementation from Cisco made available. That solved a lot of issues, but today with so many more getting into media creation I believe we need to take another stab at it and for instance try to get reliable hardware accelerated encoding and decoding on video. I am not ready to announce anything, but we got a few ideas and leads we are looking at for how to move the needle there in a significant way.
So to summarize, I am not criticizing anyone for putting together what I call usecase distros, but at the same time I really want to get to a point where they are rarely needed, because we should be able to cater to most needs within the context of a general purpose Linux operating system. That said I do appreciate the effort of these distro makers both in terms of trying to help users have a better experience on linux and in indirectly helping us showcase both potential solutions or highlight the major pain points that still needs addressing in a general purpose Linux desktop operating system.
NIR has been an integral part of the Mesa driver stack for about six or seven years now (depending on how you count) and a lot has changed since NIR first landed at the end of 2014 and I wrote my initial NIR notes. Also, for various reasons, I’ve had to give my NIR elevator pitch a few times lately. I think it’s time for a new post. This time on why, after working on this mess for seven years, I still think NIR was the right call.
Shortly after I joined the Mesa team at Intel in the summer of 2014, I was sitting in the cube area asking Ken questions, trying to figure out how Mesa was put together, and I asked, “Why don’t you use LLVM?” Suddenly, all eyes turned towards Ken and myself and I realized I’d poked a bear. Ken calmly explained a bunch of the packaging/shipping issues around having your compiler in a different project as well as issues radeonsi had run into with apps bundling their own LLVM that didn’t work. But for the more technical question of whether or not it was a good idea, his answer was something about trade-offs and how it’s really not clear if LLVM would really gain them much.
That same summer, Connor Abbott showed up as our intern and started developing NIR. By the end of the summer, he had a bunch of data structures a few mostly untested passes, and a validator. He also had most of a GLSL IR to NIR pass which mostly passed validation. Later that year, after Connor had gone off to school, I took over NIR, finished the Intel scalar back-end NIR consumer, fixed piles of bugs, and wrote out-of-SSA and a bunch of optimization passes to get it to the point where we could finally land it in the tree at the end of 2014. Initially, it was only a few Intel folks and Emma Anholt (Broadcom, at the time) who were all that interested in NIR. Today, it’s integral to the Mesa project and at the core of every driver that’s still seeing active development. Over the past seven years, we (the Mesa community) have poured thousands of man hours (probably millions of engineering dollars) into NIR and it’s gone from something only capable of handling fragment shaders to supporting full Vulkan 1.2 plus ray-tracing (task and mesh are coming) along with OpenCL 1.2 compute.
Was it worth it? That’s the multi-million dollar (literally) question. 2014 was a simpler time. Compute shaders were still newish and people didn’t use them for all that much more than they would have used a fancy fragment shader for a couple years earlier. More advanced features like Vulkan’s variable pointers weren’t even on the horizon. Had I known at the time how much work we’d have to put into NIR to keep up, I may have said, “Nah, this is too much effort; let’s just use LLVM.” If I had, I think it would have made the wrong call.
I’d like to get this one out of the way first because, while these issues are definitely real, it’s easily the least compelling reason to write a whole new piece of software. Having your compiler in a separate project and in LLVM in particular comes with an annoying set of problems.
First, there’s release cycles. Mesa releases on a rough 3-month cadence whereas LLVM releases on a 6-month cadence and there’s nothing syncing the two release cycles. This means that any new feature enabled in Mesa that require new LLVM compiler work can’t be enabled until they pick up a new LLVM. Not only does this make the question “what mesa version has X? unanswerable, it also means every one of these features needs conditional paths in the driver to be enabled or not depending on LLVM version. Also, because we can’t guarantee which LLVM version a distro will choose to pair with any give Mesa version, radeonsi (the only LLVM-based hardware driver in Mesa) has to support the latest two releases of LLVM as well as tip-of-tree at all times. While this has certainly gotten better in recent years, it used to be that LLVM would switch around C++ data structures on you requiring a bunch of wrapper classes in Mesa to deal with the mess. (They still reserve the right, it just happens less these days.)
Second is bug fixing. What do you do if there’s a compiler bug? You fix it in LLVM, of course, right? But what if the bug is in an old version of the AMD LLVM back-end and AMD’s LLVM people refuse to back-port the fix? You work around it in Mesa, of course! Yup, even though Mesa and LLVM are both open-source projects that theoretically have a stable bugfix release cycle, Mesa has to carry LLVM work-around patches because we can’t get the other team/project to back-port fixes. Things also get sticky whenever there’s a compiler bug which touches on the interface between the LLVM back-end compiler and the driver. How do you fix that in a backwards-compatible way? Sometimes, you don’t. Those interfaces can be absurdly subtle and complex and sometimes the bug is in the interface itself so you either have to fix it LLVM tip-of-tree and work around it in Mesa for older versions, or you have to break backwards compatibility somewhere and hope users pick up the LLVM bug-fix release.
Third is that some games actually link against LLVM and, historically, LLVM hasn’t done well with two different versions of it loaded at the same time. Some of this is LLVM and some of it is the way C++ shared library loading is handled on Linux. I won’t get into all the details but the point is that there have been some games in the past which simply can’t run on radeonsi because of LLVM library version conflicts. Some of this could probably be solved if Mesa were linked against LLVM statically but distros tend to be pretty sour on static linking unless you have a really good reason. A closed-source game pulling in their own LLVM isn’t generally considered to be a good reason.
And that, in the words of Forrest Gump, is all I have to say about that.
One of the key differences between NIR and LLVM is that NIR is a GPU-focused compiler whereas LLVM is CPU-focused. Yes, AMD has an upstream LLVM back-end for their GPU hardware, Intel likes to brag about their out-of-tree LLVM back-end and many other vendors use it in their drivers as well even if their back-ends are closed-source and Internal. However, none of that actually means that LLVM understands GPUs or is any good at compiling for them. Most HW vendors have made that choice because they needed LLVM for OpenCL support and they wanted a unified compiler so they figured out how to make LLVM do graphics. It works but that doesn’t mean it works well.
To demonstrate this, let’s look at the following GLSL shader I stole from the texelFetch
piglit test:
#version 120
#extension GL_EXT_gpu_shader4: require
#define ivec1 int
flat varying ivec4 tc;
uniform vec4 divisor;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
vec4 color = texelFetch2D(tex, ivec2(tc), tc.w);
fragColor = color/divisor;
}
When compiled to NIR, this turns into
shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 1
outputs: 1
uniforms: 1
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE sampler2D tex (1, 0, 0)
decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0)
decl_function main (0 params)
impl main {
block block_0:
/* preds: */
vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
vec3 32 ssa_1 = intrinsic load_input (ssa_0) (0, 0, 34, 160) /* base=0 */ /* component=0 */ /* dest_type=int32 */ /* location=32 slots=1 */
vec1 32 ssa_2 = deref_var &tex (uniform sampler2D)
vec2 32 ssa_3 = vec2 ssa_1.x, ssa_1.y
vec1 32 ssa_4 = mov ssa_1.z
vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod)
vec4 32 ssa_6 = intrinsic load_ubo (ssa_0, ssa_0) (0, 1073741824, 0, 0, 16) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=16 */
vec1 32 ssa_7 = frcp ssa_6.x
vec1 32 ssa_8 = frcp ssa_6.y
vec1 32 ssa_9 = frcp ssa_6.z
vec1 32 ssa_10 = frcp ssa_6.w
vec1 32 ssa_11 = fmul ssa_5.x, ssa_7
vec1 32 ssa_12 = fmul ssa_5.y, ssa_8
vec1 32 ssa_13 = fmul ssa_5.z, ssa_9
vec1 32 ssa_14 = fmul ssa_5.w, ssa_10
vec4 32 ssa_15 = vec4 ssa_11, ssa_12, ssa_13, ssa_14
intrinsic store_output (ssa_15, ssa_0) (0, 15, 0, 160, 132) /* base=0 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */
/* succs: block_1 */
block block_1:
}
Then, the AMD driver turns it into the following LLVM IR:
; ModuleID = 'mesa-shader'
source_filename = "mesa-shader"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"
target triple = "amdgcn--"
define amdgpu_ps <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @main(<4 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %0, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %1, float addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %2, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %3, i32 inreg %4, i32 inreg %5, <2 x i32> %6, <2 x i32> %7, <2 x i32> %8, <3 x i32> %9, <2 x i32> %10, <2 x i32> %11, <2 x i32> %12, float %13, float %14, float %15, float %16, float %17, i32 %18, i32 %19, float %20, i32 %21) #0 {
main_body:
%22 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 0, i32 0, i32 %5) #4
%23 = bitcast float %22 to i32
%24 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 1, i32 0, i32 %5) #4
%25 = bitcast float %24 to i32
%26 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 2, i32 0, i32 %5) #4
%27 = bitcast float %26 to i32
%28 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(6)* %3, i32 32, !amdgpu.uniform !0
%29 = load <8 x i32>, <8 x i32> addrspace(6)* %28, align 4, !invariant.load !0
%30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4
%31 = ptrtoint float addrspace(6)* %2 to i32
%32 = insertelement <4 x i32> <i32 poison, i32 0, i32 16, i32 163756>, i32 %31, i32 0
%33 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 0, i32 0) #4
%34 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 4, i32 0) #4
%35 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 8, i32 0) #4
%36 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 12, i32 0) #4
%37 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %33) #4
%38 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %34) #4
%39 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %35) #4
%40 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %36) #4
%41 = extractelement <4 x float> %30, i32 0
%42 = fmul nsz arcp float %41, %37
%43 = extractelement <4 x float> %30, i32 1
%44 = fmul nsz arcp float %43, %38
%45 = extractelement <4 x float> %30, i32 2
%46 = fmul nsz arcp float %45, %39
%47 = extractelement <4 x float> %30, i32 3
%48 = fmul nsz arcp float %47, %40
%49 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> undef, i32 %4, 4
%50 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %49, float %42, 5
%51 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %50, float %44, 6
%52 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %51, float %46, 7
%53 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %52, float %48, 8
%54 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %53, float %20, 19
ret <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %54
}
; Function Attrs: nounwind readnone speculatable willreturn
declare float @llvm.amdgcn.interp.mov(i32 immarg, i32 immarg, i32 immarg, i32) #1
; Function Attrs: nounwind readonly willreturn
declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2
; Function Attrs: nounwind readnone willreturn
declare float @llvm.amdgcn.s.buffer.load.f32(<4 x i32>, i32, i32 immarg) #3
; Function Attrs: nounwind readnone speculatable willreturn
declare float @llvm.amdgcn.rcp.f32(float) #1
attributes #0 = { "InitialPSInputAddr"="0xb077" "denormal-fp-math"="ieee,ieee" "denormal-fp-math-f32"="preserve-sign,preserve-sign" "target-features"="+DumpCode" }
attributes #1 = { nounwind readnone speculatable willreturn }
attributes #2 = { nounwind readonly willreturn }
attributes #3 = { nounwind readnone willreturn }
attributes #4 = { nounwind readnone }
!0 = !{}
For those of you who can’t read NIR and/or LLVM or don’t want to sift through all that, let me reduce it down to the important lines:
GLSL:
vec4 color = texelFetch2D(tex, ivec2(tc), tc.w);
NIR:
vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod)
LLVM:
%30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4
; Function Attrs: nounwind readonly willreturn
declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2
attributes #2 = { nounwind readonly willreturn }
attributes #4 = { nounwind readnone }
In NIR, a texelFetch()
shows up as a texture instruction. NIR has a special instruction type just for textures called nir_tex_instr
to handle of the combinatorial explosion of possibilities when it comes to all the different ways you can access a texture. In this particular case, the texture opcode is nir_texop_txf
for a texel fetch and it is passed a texture, a sampler, a coordinate and an LOD. Pretty standard stuff.
In AMD-flavored LLVM IR, this turns into a magic intrinsic funciton called llvm.amdgcn.image.load.mip.2d.v4f32.i32
. A bunch of information about the operation such as the fact that it takes a mip parameter and returns a vec4
is encoded in the function name. The AMD back-end then knows how to turn this into the right sequence of hardware instructions to load from a texture.
There are a couple of important things to note here. First is the @llvm.amdgcn
prefix on the function name. This is an entirely AMD-specific function. If I dumped out the LLVM from the Intel windows drivers for that same GLSL, it would use a different function name with a different encoding for the various bits of ancillary information such as the return type. Even though both drivers share LLVM, in theory, the way they encode graphics operations is entirely different. If you looked at NVIDIA, you would find a third encoding. There is no standardization.
Why is this important? Well, one of the most common arguments I hear from people for why we should all be using LLVM for graphics is because it allows for code sharing. Everyone can leverage all that great work that happens in upstream LLVM. Except it doesn’t. Not really. Sure, you can get LLVM’s algebraic optimizations and code motion etc. But you can’t share any of the optimizations that are really interesting for graphics because nothing graphics-related is common. Could it be standardized? Probably. But, in the state it’s in today, any claims that two graphics compilers are sharing significant optimizations because they’re both LLVM based is a half-truth at best. And it will never become standardized unless someone other than AMD decides to put their back-end into upstream LLVM and they decide to work together.
The second important bit about that LLVM function call is that LLVM has absolutely no idea what that function does. All it knows is that it’s been decorated nounwind
, readonly
, and willreturn
. The readonly
gives it a bit of information so it knows it can move the function call around a bit since it won’t write misc data. However, it can’t even eliminate redundant texture ops because, for all LLVM knows, a second call will return a different result. While LLVM has pretty good visibility into the basic math in the shader, when it comes to anything that touches image or buffer memory, it’s flying entirely blind. The Intel LLVM-based graphics compiler tries to improve this somewhat by using actual LLVM pointers for buffer memory so LLVM gets a bit more visibility but you still end up with a pile of out-of-thin-air pointers that all potentially alias each other so it’s pretty limited.
In contrast, NIR knows exactly what sort of thing nir_texop_txf
is and what it does. It knows, for instance, that, even though it accesses external memory, the API guarantees that nothing shifts out from under you so it’s fine to eliminate redundant texture calls. For nir_texop_tex
(texture()
in GLSL), it knows that it takes implicit derivatives and so it can’t be moved into non-uniform control-flow. For things like SSBO and workgroup memory, we know what kind of memory they’re touching and can do alias analysis that’s actually aware of buffer bindings.
When people try to justify their use of LLVM to me, there are typically two major benefits they cite. The first is that LLVM lets them take advantage of all this academic compiler work. In the previous section, I explained why this is a weak argument at best. The second is that embracing LLVM for graphics lets them share code with their compute compiler. Does that mean that we’re against sharing code? Not at all! In fact, NIR lets us get far more code sharing than most companies do by using LLVM.
The difference is the axis for sharing. This is something I ran into trying to explain myself to people at Intel all the time. They’re usually only thinking about how to get the Intel OpenCL driver and the Intel D3D12 driver to share code. With NIR, we have compiler code shared effectively across 20 years of hardware from a 8 different vendors and at least 4 APIs. So while Intel’s Linux Vulkan and OpenCL drivers don’t share a single line of compiler code, it’s not like we went off and hand-coded a whole compiler stack just for Intel Linux Vulkan.
As an example of this, consider nir_lower_tex()
a pass that lowers various different types of texture operations to other texture operations. It can, among other things:
Lower texture projectors away by doing the division in the shader,
Lower texelFetchOffset()
to texelFetch()
,
Lower rectangle textures by dividing the coordinate by the result of textureSize()
,
Lower texture swizzles to swizzling in the shader,
Lower various forms of textureGrad*()
to textureLod*()
under various conditions,
Lower imageSize(i, lod)
with an LOD to imageSize(i, 0)
and some shader math,
And much more…
Exactly what lowering is needed is highly hardware dependent (except projectors; only old Qualcomm hardware has those) but most of them are needed by at least two different vendor’s hardware. While most of these are pretty simple, when you get into things like turning derivatives into LODs, the calculations get complex and we really don’t want everyone typing it themselves if we can avoid it.
And texture lowering is just one example. We’ve got dozens of passes for everything from lowering read-only images to textures for OpenCL to lowering built-in functions like frexp()
to simpler math to flipping gl_FragCoord
and gl_PointCoord
when rendering upside down which as is required to implement OpenGL on Linux window-systems. All that code is in one central place where it’s usable by all the graphics drivers on Linux.
I mentioned earlier that having your compiler out-of-tree is painful from a packaging and release point-of-view. What I haven’t addressed yet is just how tight driver/compiler integration has to be. It depends a lot on the API and hardware, of course but the interface between compiler and driver is often very complex. We make it look very simple on the API side where you have descriptor sets (or bindings in GL) and then you access things from them in the shader. Simple, right? Hah!
In the Intel Linux Vulkan driver, we can access a UBO one of four ways depending on a complex heuristic:
We try to find up to 4 small ranges UBO commonly used constants and push those into the shader as push constants.
If we can’t push it all and it fits inside the hardware’s 240 entry binding table, we create a descriptor for it and put it in the binding table.
Depending on the hardware generation, UBOs successfully bound to descriptors might be accessed as SSBOs or we might access them through the texture unit.
If we ran our of entries in the binding table or if it’s in a ray-tracing stage (those don’t have binding tables), we fall back to doing bounds checking in the shader and access it using raw 64-bit GPU addresses.
And that’s just UBOs! SSBO binding has a similar level of complexity and also depends on the SSBO operations done in the shader. Textures have silent fall-back to bindless if we have too many, etc. In order to handle all this insanity, we have a compiler pass called anv_nir_apply_pipeline_layout()
which lives in the driver. The interface between that pass and the rest of the driver is quite complex and can communicate information about exactly how things are actually laid out. We do have to serialize it to put it all in the pipeline cache so that limits the complexity some but we don’t have to worry about keeping the interface stable at all because it lives in the driver.
We also have passes for handling YCbCr format conversion, turning multiview into instanced rendering and constructing a gl_ViewID
in the shader based on the view mask and the instance number, and a handful of other tasks. Each of these requires information from the VkPipelineCreateInfo
and some of them result in magic push constants which the driver has to know need pushing.
Trying to do that with your compiler in another project would be insane. So how does AMD do it with their LLVM compiler? Good question! They either do it in NIR or as part of the NIR to LLVM conversion. By the time the shader gets to LLVM, most of the GL or Vulkanisms have been translated to simpler constructs, keeping the driver/LLVM interface manageable. It also helps that AMD’s hardware binding model is crazy simple and was basically designed for an API like Vulkan.
One of the riskier decisions we made when designing NIR was to make all control-flow inherently structured. Instead of branch and conditional branch instructions like LLVM or SPIR-V has, NIR has control-flow nodes in a tree structure. The root of the tree is always a nir_function_impl
. In each function, is a list of control-flow nodes that may be nir_block
, nir_if
, or nir_loop
. An if has a condition and then and else cases. A loop is a simple infinite loop and there are nir_jump_break
and nir_jump_continue
instructions which act exactly as their C counterparts.
At the time, this decision was made from pure pragmatism. We had structure coming out of GLSL and most of the back-ends expected structure. Why break everything? It did mean that, when we started writing control-flow manipulation passes, things were a lot harder. A dead control-flow pass in an unstructured IR is trivial:. Delete any conditional branches where the condition is false and replace it with an unconditional branch if the condition is true. Then delete any unreachable blocks and merge blocks as necessary. Done. In a structured IR, it’s a lot more fiddly. You have to manually collapse if ladders and deleting the unconditional break at the end of a loop is equivalent to loop unrolling. But we got over that hump, built tools to make it less painful, and have implemented most of the important control-flow optimizations at this point. In exchange, back-ends get structure which is something most GPUs want thanks to the SIMT model they use.
What we didn’t see coming when we made that decision (2014, remember?) was wave/subgroup ops. In the last several years, the SIMT nature of shader execution has slowly gone from an implementation detail to something that’s baked into all modern 3D and compute APIs and shader languages. With that shift has come the need to be consistent about re-convergence. If we say “texture()
has to be in uniform control flow”, is the following shader ok?
void main
#version 120
varying vec2 tc;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
if (tc.x > 1.0)
tc.x = 1.0;
fragColor = texture(tex, tc);
}
Obviously, it should be. But what guarantees that you’re actually in uniform control-flow by the time you get to the texture()
call? In an unstructured IR, once you diverge, it’s really hard to guarantee convergence. Of course, every GPU vendor with an LLVM-based compiler has algorithms for trying to maintain or re-create the structure but it’s always a bit fragile. Here’s an even more subtle example:
void main
#version 120
varying vec2 tc;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
/* Block 0 */
float x = tc.x;
while (1) {
/* Block 1 */
if (x < 1.0) {
/* Block 2 */
tc.x = x;
break;
}
/* Block 3 */
x = x - 1.0;
}
/* Block 4 */
fragColor = texture(tex, tc);
}
The same question of validity holds but there’s something even trickier in here. Can the compiler merge block 4 and block 2? If so, where should it put it? To a CPU-centric compiler like LLVM, it looks like it would be fine to merge the two and put it all in block 2. In fact, since texture ops are expensive and block 2 is deeper inside control-flow, it may think the resulting shader would be more efficient if it did. And it would be wrong on both counts.
First, the loop exit condition is non-uniform and, since texture()
takes derivatives, it’s illegal to put it in non-uniform control-flow. (Yes, in this particular case, the result of those derivatives might be a bit wonky.) Second, due to the SIMT nature of execution, you really don’t want the texture op in the loop. In the worst case, a 32-wide execution will hit block 2 32 separate times whereas, if you guarantee re-convergence, it only hits block 4 once.
The fact that NIR’s control-flow is structured from start to finish has been a hidden blessing here. Once we get the structure figured out from SPIR-V decorations (which is annoyingly challenging at times), we never lose that structure and the re-convergence information it implies. NIR knows better than to move derivatives into non-uniform control-flow and its code-motion passes are tuned assuming a SIMT execution model. What has become a constant fight for people working with LLVM is a non-issue for us. The only thing that has been a challenge has been dealing with SPIR-V’s less than obvious structure rules and trying to make sure we properly structurize everything that’s legal. (It’s been getting better recently.)
Side-note: NIR does support OpenCL SPIR-V which is unstructured. To handle this, we have
nir_jump_goto
andnir_jump_goto_if
instructions which are allowed only for a very brief period of time. After the initial SPIR-V to NIR conversion, we run a couple passes and then structurize. After that, it remains structured for the rest of the compile.
Every GPU compiler engineer has horror stories about something some app developer did in a shader. Sometimes it’s the fault of the developer and sometimes it’s just an artifact of whatever node-based visual shader building system the game engine presents to the artists and how it’s been abused. On Linux, however, it can get even more entertaining. Not only do we have those shaders that were written for DX9 and someone lost the code so they ran them through a DX9 to HLSL translator and then through FXC, but they then ported the app to OpenGL so it can run on Linux they did a DXBC to GLSL conversion with some horrid tool. The end result is x != 0
implemented with three levels of nested function calls, multiple splats out to a vec4
and a truly impressive pile of control-flow. I only wish I were joking….
To chew through this mess, we have nir_opt_algebraic()
. We’ve implemented a little language for expressing these expression trees using python tuples and nir_opt_algebraic.py. To get a sense for what this looks like, let’s look at some excerpts from nir_opt_algebraic.py
starting with the simple description at the top:
# Written in the form (<search>, <replace>) where <search> is an expression
# and <replace> is either an expression or a value. An expression is
# defined as a tuple of the form ([~]<op>, <src0>, <src1>, <src2>, <src3>)
# where each source is either an expression or a value. A value can be
# either a numeric constant or a string representing a variable name.
#
# <more details>
optimizations = [
...
(('iadd', a, 0), a),
This rule is a good starting example because it’s so straightforward. It looks for an integer add operation of something with zero and gets rid of it. A slightly more complex example removes redundant fmax
opcodes:
(('fmax', ('fmax', a, b), b), ('fmax', a, b)),
Since it’s written in python, we can also write little rule generators if the same thing applies to a bunch of opcodes or if you want to generalize across types:
# For any float comparison operation, "cmp", if you have "a == a && a cmp b"
# then the "a == a" is redundant because it's equivalent to "a is not NaN"
# and, if a is a NaN then the second comparison will fail anyway.
for op in ['flt', 'fge', 'feq']:
optimizations += [
(('iand', ('feq', a, a), (op, a, b)), ('!' + op, a, b)),
(('iand', ('feq', a, a), (op, b, a)), ('!' + op, b, a)),
]
Because we’ve made adding new optimizations so incredibly easy, we have a lot of them. Not just the simple stuff I’ve highlighted above, either. We’ve got at least two cases where someone hand-rolled bitfieldReverse()
and we match a giant pattern and turn it into a single HW instruction. (Some UE4 demo and Cyberpunk 2077, if you want to know who to blame. They hand-roll it differently, of course.) We also have patterns to chew through all the garbage from D3D9 to HLSL conversion where they emit piles of x ? 1.0 : 0.0
everywhere because D3D9 didn’t have real Boolean types. All told, as of the writing of this blog post, we have 1911 such search and replace patterns.
Not only have we made it easy to add new patterns but the nir_search
framework has some pretty useful smarts in it. The expression I first showed matches a + 0
and replaces it with a
but nir_search
is smart enough to know that nir_op_iadd
is commutative and so it also matches 0 + a
without having to write two expressions. We also have syntax for detecting constants, handling different bit sizes, and applying arbitrary C predicates based on the SSA value. Since NIR is actually a vector IR (we support a lot of vec4-based hardware), nir_search
also magically handles swizzles for you.
You might think 1911 patterns is a lot and it is. Doesn’t that take forever? Isn’t it O(NPS) where N is the number of instructions, P is the number of patterns and S is the average pattern size or something like that? Nope! A couple years ago, Connor Abbot converted it to using a finite state machine automata, built at driver compile time, to filter out impossible matches as we go. The result is that the whole pass effectively runs in linear time in the number of instructions.
This one continues to surprise me. When we set out to design NIR, the goal was something that was SSA and used flat lists of instructions (not expression trees). That was pretty much the extent of the design requirements. However, whenever you build an IR, you inevitably make a series of choices about what kinds of things you’re going to support natively and what things are going to require emulation or be a bit more painful.
One of the most fundamental choices we made in NIR was that SSA values would be typeless vectors. Each nir_ssa_def
has a bit size and a number of vector components and that’s it. We don’t distinguish between integers and floats and we don’t support matrix or composite types. Not supporting matrix types was a bit controversial but it’s turned out fine. We also have to do a bit of juggling to support hardware that doesn’t have native integers because we have to lower integer operations to float and we’ve lost the type information. When working with shaders that come from D3D to OpenGL or Vulkan translators, the type information does more harm than good. I can’t count the number of shaders I’ve seen where they declare vec4 x1
through vec4 x80
at the top and then uintBitsToFloat()
and floatBitsToUint()
all over everywhere.
We also made adding new ALU ops and intrinsics really easy but also added a fairly powerful metadata system for both so the compiler can still reason about them. The lines we drew between ALU ops, intrinsics, texture instructions, and control-flow like break
and continue
were pretty arbitrary at the time if we’re honest. Texturing was going to be a lot of intrinsics so Connor added an instruction type. That was pretty much it.
The end result, however, has been an IR that’s incredibly versatile. It’s somehow both a high-level and low-level IR at the same time. When we do SPIR-V to NIR translation, we don’t have a separate IR for parsing SPIR-V. We have some data structures to deal with composite types and a handful of other stuff but when we parse SPIR-V opcodes, we go straight to NIR. We’ve got variables with fairly standard dereference chains (those do support composite types), bindings, all the crazy built-ins like frexp()
, and a bunch of other language-level stuff. By the time the NIR shows up in your back-end, however, all that’s gone. Crazy built-in functions have been lowered. GL/Vulkan binding with derefs, descriptors, and locations has been turned into byte offsets and indices in a flat binding table. Some drivers have even attempted to emit hardware instructions directly from NIR. (It’s never quite worked but says a lot that they even tried.)
The Intel compiler back-end has probably shrunk by half in terms of optimization and lowering passes in the last seven years because we’re able to do so much in NIR. We’ve got code that lowers storage image access with unsupported formats to other image formats or even SSBO access, splitting of vector UBO/SSBO access that’s too wide for hardware, workarounds for imprecise trig ops, and a bunch of others. All of the interesting lowering is done in NIR. One reason for this is that Intel has two back-ends, one for scalar and one that’s vec4 and any lowering we can do in NIR is lowering that only happens once. But, also, it’s nice to be able to have the full power of NIR’s optimizer run on your lowered code.
As I said earlier, I find the versatility of NIR astounding. We never intended to write an IR that could get that close to hardware. We just wanted SSA for easier optimization writing. But the end result has been absolutely fantastic and has done a lot to accelerate driver development in Mesa.
If you’ve gotten this far, I both applaud and thank you! NIR has been a lot of fun to build and, as you can probably tell, I’m quite proud of it. It’s also been a huge investment involving thousands of man hours but I think it’s been well worth it. There’s a lot more work to do, of course. We still don’t have the ray-tracing situation where it needs to be and OpenCL-style compute needs some help to be really competent. But it’s come an incredibly long way in the last seven years and I’m incredibly proud of what we’ve built and forever thankful to the many many developers who have chipped in and fixed bugs and contributed optimization and lowering passes.
Hopefully, this post provides some additional background and explanation for the big question of why Mesa carries its own compiler stack. And maybe, just maybe, someone will get excited enough about it to play around with it and even contribute! One can hope, right?
(This post was first published with Collabora on Jan 25, 2022.)
My work on Wayland and Weston color management and HDR support has been full of learning new concepts and terms. Many of them are crucial for understanding how color works. I started out so ignorant that I did not know how to blend two pixels together correctly. I did not even know that I did not know - I was just doing the obvious blend, and that was wrong. Now I think I know what I know and do not know, and I also feel that most developers around window systems and graphical applications are as uneducated as I was.
Color knowledge is surprisingly scarce in my field it seems. It is not enough that I educate myself. I need other people to talk to, to review my work, and to write patches that I will be reviewing. With the hope of making it even a little bit easier to understand what is going on with color I wrote the article: A Pixel's Color.
The article goes through most of the important concepts, trying to give you, a programmer, a vague idea of what they are. It does not explain everything too well, because I want you to be able to read through it, but it still got longer than I expected. My intention is to tell you about things you might not know about, so that you would at least know what you do not know.
A warm thank you to everyone who reviewed and commented on the article.
Originally Wayland CM&HDR extension merge request included documentation about how color management would work on Wayland. The actual protocol extension specification cannot even begin to explain all that.
To make that documentation easier to revise and contribute to, I proposed to move it into a new repository: color-and-hdr. That also allowed us to widen the scope of the documentation, so we can easily include things outside of Wayland: EGL, Vulkan WSI, DRM KMS, and more.
I hope that color-and-hdr documentation repository gains traction and becomes a community maintained effort in gathering information about color and HDR on Linux, and that we can eventually move it out of my personal namespace to become truly community owned.
Ever since I announced that I was leaving Intel, there’s been a lot of speculation as to where I’d end up. I left it a bit quiet over the holidays but, now that we’re solidly in 2022, It’s time to let it spill. As of January 24, I’ll be at Collabora!
For those of you that don’t know, Collabora is an open-source consultancy. They sell engineering services to companies who are making devices that run Linux and want to contribute to open-source technologies. They’ve worked on everything from automotive to gaming consoles to smart TVs to infotainment systems to VR platforms. I’m not an expert on what Collabora has done over the years so I’ll refer you to their brag sheet for that. Unlike some contract houses, Collabora doesn’t just do engineering for hire. They’re also an ideologically driven company that really believes in upstream and invests directly in upstream projects such as Mesa, Wayland, and others.
My personal history with Collabora is as old as my history as an open-source software developer. My first real upstream work was on Wayland in early 2013. I jumped in with a cunning plan for running a graphics-enabled desktop Linux chroot on an Android device and absolutely no idea what I was getting myself into. Two of the people who not only helped me understand the underbelly of Linux window systems but also helped me learn to navigate the world of open-source software were Daniel Stone and Pekka Paalanen, both of whom were at Collabora then and still are today.
After switching to Mesa when I joined Intel in 2014, I didn’t interact with Collabora devs quite as much since they mostly stayed in the window-system world and I tried to stay in 3D. In the last few years, however, they’ve been building up their 3D team and doing some really interesting work. Alyssa Rosenzweig and I have worked quite a bit together on various NIR passes as part of her work on Panfrost and now agx. I also worked with Boris Brezillon and Erik Faye-Lund on some of the CLOn12, GLOn12, and Zink work which layers OpenGL and OpenCL on top of D3D12 and Vulkan. In case you haven’t figured it out already from my glowing review, Collabora has some top-notch people who are doing great work and I’m excited to be joining the team and working more closely with them.
So how did this happen? What convinced me to leave the cushy corporate job and join a tiny (compared to Intel) open-source company? It’s not been for lack of opportunities. I get pinged by recruiters on LinkedIn on a regular basis and certain teams in the industry have been rather persistent. I’ve thought quite a lot over the years about where I’d want to go if I ever left Intel. Intel has been my engineering home for 7.5 years and has provided the strange cocktail on which I’ve built my career: a stable team, well-funded upstream open-source work, fairly cutting edge hardware, and an IHV seat at Khronos. Every place I’d ever considered going would mean losing one or more of those things and, until Collabora, no one had given me a good enough reason to give any of that up.
Back in September, I was chatting on IRC with other Mesa devs about OpenCL, SPIR-V, and some corner-case we were missing in the compiler when the following exchange happened:
11:39 < jenatali> I hope I get time to get back to CL at some point, I
hate leaving it half-finished, but stupid corporate priorities
mean I have to do other stuff instead :P
11:41 < jekstrand> Yeah... Corporations... Why do we work for them
again? Oh, right, so we can afford to eat.
About an hour later, Daniel Stone replied privately:
12:40 <daniels> hey so if corporations ever get you down, there are
always less-corporate options … :)
12:40 <daniels> timing completely coincidental of course
12:42 <jekstrand> Of course...
12:42 <jekstrand> I'm always open to new things if the offer is
right...
This kicked off the weirdest and most interesting career conversation I’ve had to date. At first, I didn’t believe him. The job he was describing doesn’t exist. No one gets that offer. Not unless you’re Dave Airlie or Linus Torvalds. But, after multiple 1 – 2 hour video chats, more IRC chatter, and an hour chatting with Philippe Kalaf (Collabora’s CEO), they had me convinced. This is real.
So what did Collabora finally offer me that no one else has? Total autonomy. In my new role at Collabora, my mandate consists of two things: invest in and mentor the Collabora 3D graphics team and invest in upstream Linux and open-source graphics however I see fit. I won’t be expected to do any contract work. I may meet with clients from time to time and I’ll likely get involved more with the various Collabora-driven Mesa projects but my primary focus will be on ensuring that upstream is healthy. I won’t be tied to any one driver or hardware vendor either. Sure, it’d be good to do a bit of Panfrost work so I can help Alyssa out since she’s now my coworker and I’ll likely still work on Intel drivers a bit since that’s my home turf. But, at the end of the day, I’m now free to put my effort wherever it’s needed in the stack without concern for corporate priorities. Ray-tracing in RADV? Why not. OpenCL 3.0 for everyone? Sure. Hacking on a new kernel interface for Freedreno? That’s fine too. As far as I’m concerned, when it comes to how I spend my engineering effort, I now report directly to upstream. No strings attached.
One of the interesting side-effect of this is how it will affect my role within Khronos. Collabora is a Khronos member so I still plan to be involved there but it will look different. For several years now (as long as RADV has been a competent driver, really), I’ve always worn two hats at Khronos: Intel and Mesa/Linux. Most of the time, I’m representing Intel but there were always those weird awkward moments where I help out the Igalia team working on V3DV or the RADV team. Now that I’m no longer at a hardware vendor, I can really embrace the role of representing Mesa and Linux upstream within Khronos. This doesn’t mean that I’m suddenly going to fix all your Vulkan spec problems overnight but it does mean I’ll be paying a bit more attention to the non-Intel drivers and doing what I can to ensure that all the Vulkan drivers in Mesa are in good shape.
Honestly, I’m still in shock that I was offered this role. It’s a great testament to Collabora’s belief in upstream that they’re willing to fund such a role and it shows an incredible amount of faith in my work. At Intel, I was blessed to be able to work upstream as part of my day job, which isn’t something most open-source software developers get. To have someone believe in your work so much that they’re willing to cut you a pay check just to keep doing what you’re doing is mind boggling. I’m truly honored and I hope the work I do in the days, months, and years to come will prove that their faith was well placed.
So, what am I going to be working on with my new found freedom? Do I have any cool new projects planned that are going to turn the industry upside-down? Of course I do! But those are topics for other blog posts.
This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:
In this article, we will further discuss the role of the CI gateway, and which steps we can take to simplify its deployment, maintenance, and disaster recovery.
This work is sponsored by the Valve Corporation.
As seen in the part 1 of this CI series, the testing gateway is sitting between the test machines and the public network/internet:
Internet / ------------------------------+
Public network |
+---------+--------+ USB
| +-----------------------------------+
| Testing | Private network |
Main power (120/240 V) -----+ | Gateway +-----------------+ |
| +------+--+--------+ | |
| | | Serial / | |
| Main | | Ethernet | |
| Power| | | |
+-----------+-----------------|--+--------------+ +-------+--------+ +----+----+
| Switchable PDU | | | RJ45 switch | | USB Hub |
| Port 0 Port 1 ...| Port N | | | | |
+----+------------------------+-----------------+ +---+------------+ +-+-------+
| | |
Main | | |
Power| | |
+--------|--------+ Ethernet | |
| +-----------------------------------------+ +----+----+ |
| Test Machine 1 | Serial (RS-232 / TTL) | Serial | |
| +---------------------------------------------+ 2 USB +----+ USB
+-----------------+ +---------+
The testing gateway's role is to expose the test machines to the users, either directly or via GitLab/Github. As such, it will likely require the following components:
Since the gateway is connected to the internet, both the OS and the different services needs to be be kept updated relatively often to prevent your CI farm from becoming part of a botnet. This creates interesting issues:
These issues can thankfully be addressed by running all the services in a container (as systemd units), started using boot2container. Updating the operating system and the services would simply be done by generating a new container, running tests to validate it, pushing it to a container registry, rebooting the gateway, then waiting while the gateway downloads and execute the new services.
Using boot2container does not however fix the issue of how to update the kernel or boot configuration when the system fails to boot the current one. Indeed, if the kernel/boot2container/kernel command line are stored locally, they can only be modified via an SSH connection and thus require the machine to always be reachable, the gateway will be bricked until an operator boots an alternative operating system.
The easiest way not to brick your gateway after a broken update is to power it through a switchable PDU (so that we can power cycle the machine), and to download the kernel, initramfs (boot2container), and the kernel command line from a remote server at boot time. This is fortunately possible even through the internet by using fancy bootloaders, such as iPXE, and this will be the focus of this article!
Tune in for part 4 to learn more about how to create the container.
iPXE is a tiny bootloader that packs a punch! Not only can it boot kernels from local partitions, but it can also connect to the internet, and download kernels/initramfs using HTTP(S). Even more impressive is the little scripting engine which executes boot scripts instead of declarative boot configurations like grub. This enables creating loops, endlessly trying to boot until one method finally succeeds!
Let's start with a basic example, and build towards a production-ready solution!
In this example, we will focus on netbooting the gateway from a local HTTP
server. Let's start by reviewing a simple script that makes iPXE acquire an IP
from the local DHCP server, then download and execute another iPXE script from
http://<ip of your dev machine>:8000/boot/ipxe
.
If any step failed, the script will be restarted from the start until a successful
boot is achieved.
#!ipxe
echo Welcome to Valve infra's iPXE boot script
:retry
echo Acquiring an IP
dhcp || goto retry # Keep retrying getting an IP, until we get one
echo Got the IP: $${netX/ip} / $${netX/netmask}
echo
echo Chainloading from the iPXE server...
chain http://<ip of your dev machine>:8000/boot.ipxe
# The boot failed, let's restart!
goto retry
Neat, right? Now, we need to generate a bootable ISO image starting iPXE with the above script run as a default. We will then flash this ISO to a USB pendrive:
$ git clone git://git.ipxe.org/ipxe.git
$ make -C ipxe/src -j`nproc` bin/ipxe.iso EMBED=<boot script file>
$ sudo dd if=ipxe/src/bin/ipxe.iso of=/dev/sdX bs=1M conv=fsync status=progress
Once connected to the gateway, ensure that you boot from the pendrive, and you
should see iPXE bootloader trying to boot the kernel, but failing to download
the script from
http://<ip of your dev machine>:8000/boot.ipxe
. So, let's write one:
#!ipxe
kernel /files/kernel b2c.container="docker://hello-world"
initrd /files/initrd
boot
This script specifies the following elements:
http://<ip of your dev machine>:8000/files/kernel
, and set the kernel command line to ask boot2container to start the hello-world
containerhttp://<ip of your dev machine>:8000/files/initrd
Assuming your gateway has an architecture supported by boot2container, you may now download the kernel and initrd from boot2container's releases page. In case it is unsupported, create an issue, or a merge request to add support for it!
Now that you have created all the necessary files for the boot, start the web server on your development machine:
$ ls
boot.ipxe initrd kernel
$ python -m http.server 8080
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
<ip of your gateway> - - [09/Jan/2022 15:32:52] "GET /boot.ipxe HTTP/1.1" 200 -
<ip of your gateway> - - [09/Jan/2022 15:32:56] "GET /kernel HTTP/1.1" 200 -
<ip of your gateway> - - [09/Jan/2022 15:32:54] "GET /initrd HTTP/1.1" 200 -
If everything went well, the gateway should, after a couple of seconds, start downloading the boot script, then the kernel, and finally the initramfs. Once done, your gateway should boot Linux, run docker's hello-world container, then shut down.
Congratulations for netbooting your gateway! However, the current solution has one annoying constraint: it requires a trusted local network and server because we are using HTTP rather than HTTPS... On an untrusted network, a man in the middle could override your boot configuration and take over your CI...
If we were using HTTPS, we could download our boot script/kernel/initramfs directly from any public server, even GIT forges, without fear of any man in the middle! Let's try to achieve this!
In the previous section, we managed to netboot our gateway from the local network. In this section, we try to improve on it by netbooting using HTTPS. This enables booting from a public server hosted at places such as Linode for $5/month.
As I said earlier, iPXE supports HTTPS. However, if you are anyone like me, you may be wondering how such a small bootloader could know which root certificates to trust. The answer is that iPXE generates an SSL certificate at compilation time which is then used to sign all of the root certificates trusted by Mozilla (default), or any amount of certificate you may want. See iPXE's crypto page for more information.
WARNING: iPXE currently does not like certificates exceeding 4096 bits. This can be a limiting factor when trying to connect to existing servers. We hope to one day fix this bug, but in the mean time, you may be forced to use a 2048 bits Let's Encrypt certificate on a self-hosted web server. See our issue for more information.
WARNING 2: iPXE only supports a limited amount of ciphers. You'll need to make
sure they are listed in nginx's ssl_ciphers
configuration:
AES-128-CBC:AES-256-CBC:AES256-SHA256 and AES128-SHA256:AES256-SHA:AES128-SHA
To get started, install NGINX + Let's encrypt on your server, following your
favourite tutorial, copy the boot.ipxe
, kernel
, and initrd
files to the root
of the web server, then make sure you can download them using your browser.
With this done, we just need to edit iPXE's general config C header to enable HTTPS support:
$ sed -i 's/#undef\tDOWNLOAD_PROTO_HTTPS/#define\tDOWNLOAD_PROTO_HTTPS/' ipxe/src/config/general.h
Then, let's update our boot script to point to the new server:
#!ipxe
echo Welcome to Valve infra's iPXE boot script
:retry
echo Acquiring an IP
dhcp || goto retry # Keep retrying getting an IP, until we get one
echo Got the IP: $${netX/ip} / $${netX/netmask}
echo
echo Chainloading from the iPXE server...
chain https://<your server>/boot.ipxe
# The boot failed, let's restart!
goto retry
And finally, let's re-compile iPXE, reflash the gateway pendrive, and boot the gateway!
$ make -C ipxe/src -j`nproc` bin/ipxe.iso EMBED=<boot script file>
$ sudo dd if=ipxe/src/bin/ipxe.iso of=/dev/sdX bs=1M conv=fsync status=progress
If all went well, the gateway should boot and run the hello world container once again! Let's continue our journey by provisioning and backup'ing the local storage of the gateway!
In the previous section, we managed to control the boot configuration of our gateway via a public HTTPS server. In this section, we will improve on that by provisioning and backuping any local file the gateway container may need.
Boot2container has a nice feature that enables you to create a volume, and provision it from a bucket in a S3-compatible cloud storage, and sync back any local change. This is done by adding the following arguments to the kernel command line:
b2c.minio="s3,${s3_endpoint},${s3_access_key_id},${s3_access_key}"
: URL and credentials to the S3 serviceb2c.volume="perm,mirror=s3/${s3_bucket_name},pull_on=pipeline_start,push_on=changes,overwrite,delete"
: Create a perm
podman volume, mirror it from the bucket ${s3_bucket_name}
when booting the gateway, then push any local change back to the bucket. Delete or overwrite any existing file when mirroring.b2c.container="-ti -v perm:/mnt/perm docker://alpine"
: Start an alpine container, and mount the perm
container volume to /mnt/perm
Pretty, isn't it? Provided that your bucket is configured to save all the revisions of every file, this trick will kill three birds with one stone: initial provisioning, backup, and automatic recovery of the files in case the local disk fails and gets replaced with a new one!
The issue is that the boot configuration is currently open for everyone to see, if they know where to look for. This means that anyone could tamper with your local storage or even use your bucket to store their files...
To prevent attackers from stealing our S3 credentials by simply pointing their web browser to the right URL, we can authenticate incoming HTTPS requests by using an SSL client certificate. A different certificate would be embedded in every gateway's iPXE bootloader and checked by NGINX before serving the boot configuration for this precise gateway. By limiting access to a machine's boot configuration to its associated client certificate fingerprint, we even prevent compromised machines from accessing the data of other machines.
Additionally, secrets should not be kept in the kernel command line, as any
process executed on the gateway could easily gain access to it by reading
/proc/cmdline
. To address this issue, boot2container has a
b2c.extra_args_url
argument to source additional parameters from this URL.
If this URL is generated every time the gateway is downloading its boot
configuration, can be accessed only once, and expires soon after being created,
then secrets can be kept private inside boot2container and not be exposed to
the containers it starts.
Implementing these suggestions in a blog post is a little tricky, so I suggest you check out valve-infra's ipxe-boot-server component for more details. It provides a Makefile that makes it super easy to generate working certificates and create bootable gateway ISOs, a small python-based web service that will serve the right configuration to every gateway (including one-time secrets), and step-by-step instructions to deploy everything!
Assuming you decided to use this component and followed the README, you should then configure the gateway in this way:
$ pwd
/home/ipxe/valve-infra/ipxe-boot-server/files/<fingerprint of your gateway>/
$ ls
boot.ipxe initrd kernel secrets
$ cat boot.ipxe
#!ipxe
kernel /files/kernel b2c.extra_args_url="${secrets_url}" b2c.container="-v perm:/mnt/perm docker://alpine" b2c.ntp_peer=auto b2c.cache_device=auto
initrd /files/initrd
boot
$ cat secrets
b2c.minio="bbz,${s3_endpoint},${s3_access_key_id},${s3_access_key}" b2c.volume="perm,mirror=bbz/${s3_bucket_name},pull_on=pipeline_start,push_on=changes,overwrite,delete"
And that's it! We finally made it to the end, and created a secure way to provision our CI gateways with the wanted kernel, Operating System, and even local files!
When Charlie Turner and I started designing this system, we felt it would be a clean and simple way to solve our problems with our CI gateways, but the implementation ended up being quite a little trickier than the high-level view... especially the SSL certificates! However, the certainty that we can now deploy updates and fix our CI gateways even when they are physically inaccessible from us (provided the hardware and PDU are fine) definitely made it all worth it and made the prospect of having users depending on our systems less scary!
Let us know how you feel about it!
In this post, we focused on provisioning the CI gateway with its boot configuration, and local files via the internet. This drastically reduces the risks that updating the gateway's kernel would result in an extended loss of service, as the kernel configuration can quickly be reverted by changing the boot config files which is served from a cloud service provider.
The local file provisioning system also doubles as a backup, and disaster recovery system which will automatically kick in in case of hardware failure thanks to the constant mirroring of the local files with an S3-compatible cloud storage bucket.
In the next post, we will be talking about how to create the infra container, and how we can minimize down time during updates by not needing to reboot the gateway.
That's all for now, thanks for making it to the end!
A few months ago, I went on a quest to better digitize and collect a bunch of the recipes I use on a regular basis. Like most people, I’ve got a 3-ring binder of stuff I’ve printed from the internet, a box with the usual 4x6 cards, most of which are hand-written, and a stack of cookbooks. I wanted something that could be both digital and physical and which would make recipes easier to share. I also wanted whatever storage system I developed to be something stupid simple. If there’s one thing I’ve learned about myself over the years it’s that if I make something too hard, I’ll never get around to it.
This led to a few requirements:
A simple human-writable file format
Recipes organized as individual files in a Git repo
A simple tool to convert to a web page and print formats
That tool should be able to produce 4x6 cards
Enter PyCook. It does pretty much exactly what the above says and not much more. The file format is based on YAML and, IMO, is beautiful to look at all by itself and very easy to remember how to type:
name: Swedish Meatballs
from: Grandma Ekstrand
ingredients:
- 1 [lb] Ground beef (80%)
- 1/4 [lb] Ground pork
- 1 [tsp] Salt
- 1 [tsp] Sugar
- 1 Egg, beaten
- 1 [tbsp] Catsup
- 1 Onion, grated
- 1 [cup] Bread crumbs
- 1 [cup] Milk
instructions:
- Soak crumbs in milk
- Mix all together well
- Make into 1/2 -- 3/4 [in] balls and brown in a pan on the stove
(you don't need to cook them through; just brown)
- Heat in a casserole dish in the oven until cooked through.
Over Christmas this year, I got my mom and a couple siblings together to put together a list of family favorite recipes. The format is simple enough that my non-technical mother was able to help type up recipes and they needed very little editing before PyCook would consume them. To me, that’s a pretty good indicator that the file format is simple enough. 😁
The one bit of smarts it does have is around quantities and units. One thing that constantly annoys me with recipes is the inconsistency with abbreviations of units. Take tablespoons, for instance. I’ve seen it abbreviated “T” (as opposed to “t” for teaspoon), “tbsp”, or “tblsp”, sometimes with a “.” after the abbreviation and sometimes not. To handle this, I have a tiny macro language where units have standard abbreviations and are placed in brackets. This is then substituted with the correct abbreviation to let me change them all in one go if I ever want to. It’s also capable of handling plurals properly so when you type [cup]
it will turn into either “cup” or “cups” depending on the associated number. It also has smarts to detect “1/4” and turn that into vulgar fraction character “¼” for HTML output or a nice LaTeX fraction when doing PDF output.
That brings me to the PDF generator. One of my requirements was the ability to produce 4x6 cards that I could use while cooking instead of having an electronic device near all those fluids. I started with the LaTeX template I got from my friend Steve Schulteis for making a PDF cookbook and adapted them to 4x6 cards, one recipe per card. And the results look pretty nice, I think
It’s even able to nicely paginate the 4x6 cards into a double-sided 8.5x11 PDF which prints onto Avery 5389 templates.
Other output formats include an 8.5x11 PDF cookbook with as many recipes per page as will fit and an HTML web version generated using Sphinx which is searchable and has a nice index.
P.S. Yes, that’s a real Swedish meatball recipe and it makes way better meatballs than the ones from IKEA.
This week, I’ve taken advantage of some of my funemployment time to we work our website infrastructure. As part of my new job (more on that soon!) I’ll be communicating publicly more and I need a better blog space. I set up a blogger a few years ago but I really hate it. It’s not that Blogger is a terrible platform per se. It just doesn’t integrate well with the rest of my website and the comments I got were 90% spam.
At the time, I used Blogger because I didn’t want to mess implementing a blog on my own website infrastructure. Why? The honest answer is an object lesson in software engineering. The last time I re-built my website I thought that building a website generator sounded like a fantastic excuse to learn some Ruby. Why not? It’s a great programming language for web stuff and learning new languages is good, right? While those are both true, software maintainability is important. When I went to try and add a blog, it’d been nearly 4 years since I’d written a line of ruby code and the static site generation framework I was using (nanoc) had moved to a backwards-incompatible new version and I had no clue how to move my site forward without re-learning Ruby and nanoc and rewriting it from scratch. I didn’t have the time for that.
This time, I learned my lesson and went all C programmer on it. The new infra is built in Python, a language I use nearly daily in my Mesa work and, instead of using someone else’s static site generation infra, I rolled my own. I’m using Jinja2 for templating as well as Pygments for syntax highlighting and Pandoc for Markdown to HTML conversion. They’re all libraries not frameworks so they take a lot less learning to figure out and remember how to use. Jinja2 is something of a framework (it’s a templating language) but it’s so obvious and user-friendly that it’s basically impossible to forget how to use.
Sure, my infra isn’t nearly as nice as some. I don’t have caching and I have to re-build the whole thing every time. I also didn’t bother to implement RSS feed support, comments, or any of those other bloggy features. But, frankly, I don’t care. Even on a crappy raspberry pi, my website builds in a few seconds. As far as RSS goes, who actually uses RSS readers these days? Just follow me on Twitter (@jekstrand_) if you want to know when I blog stuff. Comments? 95% of the comments I got on Blogger were spam anyway. If you want to comment, reply to my tweet.
The old blog lived at jason-blog.jlekstrand.net while the new one lives at www.jlekstrand.net/jason/blog/. I’ve set up my webserver to re-direct from the old one and gone though the painful process of checking every single post to ensure the re-directs work. Any links you may have should still work.
So there you have it! A new website framework written in all of 2 days. Hopefully, this one will be maintainable and, if I need to extend it, I can without re-learning whole programming languages. I’m also hoping that now that blogging is as easy as writing some markdown and doing git push
, I’ll actually blog more.
It appears that Google created a handy tool that helps finding the command which causes a GPU hang/crash. It is called Graphics Flight Recorder (GFR) and was open-sourced a year ago but didn’t receive any attention. From the readme:
The Graphics Flight Recorder (GFR) is a Vulkan layer to help trackdown and identify the cause of GPU hangs and crashes. It works by instrumenting command buffers with completion tags. When an error is detected a log file containing incomplete command buffers is written. Often the last complete or incomplete commands are responsible for the crash.
It requires VK_AMD_buffer_marker
support; however, this extension is rather trivial to implement - I had only to copy-paste the code from our vkCmdSetEvent
implementation and that was it. Note, at the moment of writing, GFR unconditionally usesVK_AMD_device_coherent_memory
, which could be manually patched out for it to run on other GPUs.
GFR already helped me to fix hangs in “Alien: Isolation” and “Digital Combat Simulator”. In both cases the hang was in a compute shader and the output from GFR looked like:
...
- # Command:
id: 6/9
markerValue: 0x000A0006
name: vkCmdBindPipeline
state: [SUBMITTED_EXECUTION_COMPLETE]
parameters:
- # parameter:
name: commandBuffer
value: 0x000000558CFD2A10
- # parameter:
name: pipelineBindPoint
value: 1
- # parameter:
name: pipeline
value: 0x000000558D3D6750
- # Command:
id: 6/9
message: '>>>>>>>>>>>>>> LAST COMPLETE COMMAND <<<<<<<<<<<<<<'
- # Command:
id: 7/9
markerValue: 0x000A0007
name: vkCmdDispatch
state: [SUBMITTED_EXECUTION_INCOMPLETE]
parameters:
- # parameter:
name: commandBuffer
value: 0x000000558CFD2A10
- # parameter:
name: groupCountX
value: 5
- # parameter:
name: groupCountY
value: 1
- # parameter:
name: groupCountZ
value: 1
internalState:
pipeline:
vkHandle: 0x000000558D3D6750
bindPoint: compute
shaderInfos:
- # shaderInfo:
stage: cs
module: (0x000000558F82B2A0)
entry: "main"
descriptorSets:
- # descriptorSet:
index: 0
set: 0x000000558E498728
- # Command:
id: 8/9
markerValue: 0x000A0008
name: vkCmdPipelineBarrier
state: [SUBMITTED_EXECUTION_NOT_STARTED]
...
After confirming that corresponding vkCmdDispatch
is indeed the call which hangs, in both cases I made an Amber test which fully simulated the call. For a compute shader, this is relatively easy to do since all you need is to save the decompiled shader and buffers being used by it. Luckily in both cases these Amber tests reproduced the hangs.
With standalone reproducers, the problems were much easier to debug, and fixes were made shortly: MR#14044 for “Alien: Isolation” and MR#14110 for “Digital Combat Simulator”.
Unfortunately this tool is not a panacea:
Anyway, it’s easy to use so you should give it a try.
One of the big issues I have when working on Turnip driver development is that when compiling either Mesa or VK-GL-CTS it takes a lot of time to complete, no matter how powerful the embedded board is. There are reasons for that: typically those board have limited amount of RAM (8 GB for the best case), a slow storage disk (typically UFS 2.1 on-board storage) and CPUs that are not so powerful compared with x86_64 desktop alternatives.
Photo of the Qualcomm® Robotics RB3 Platform embedded board that I use for Turnip development.
To fix this, it is recommended to do cross-compilation, however installing the development environment for cross-compilation could be cumbersome and prone to errors depending on the toolchain you use. One alternative is to use a distributed compilation system that allows cross-compilation like Icecream.
Icecream is a distributed compilation system that is very useful when you have to compile big projects and/or on low-spec machines, while having powerful machines in the local network that can do that job instead. However, it is not perfect: the linking stage is still done in the machine that submits the job, which depending on the available RAM, could be too much for it (however you can alleviate this a bit by using ZRAM for example).
One of the features that icecream has over its alternatives is that there is no need to install the same toolchain in all the machines as it is able to share the toolchain among all of them. This is very useful as we will see below in this post.
$ sudo apt install icecc
$ sudo dnf install icecream
You can compile it from sources.
You need to have an icecc scheduler in the local network that will balance the load among all the available nodes connected to it.
It does not matter which machine is the scheduler, you can use any of them as it is quite lightweight. To run the scheduler execute the following command:
$ sudo icecc-scheduler
Notice that the machine running this command is going to be the scheduler but it will not participate in the compilation process by default unless you ran iceccd
daemon as well (see next step).
First you need to run the iceccd
daemon as root. This is not needed on Debian-based systems, as its systemd unit is enabled by default.
You can do that using systemd in the following way:
$ sudo systemctl start iceccd
Or you can enable the daemon at startup time:
$ sudo systemctl enable iceccd
The daemon will connect automatically to the scheduler that is running in the local network. If that’s not the case, or there are more than one scheduler, you can run it standalone and give the scheduler’s IP as parameter:
sudo iceccd -s <ip_scheduler>
If you use ccache (recommended option), you just need to add the following in your .bashrc
:
export CCACHE_PREFIX=icecc
To use it without ccache, you need to add its path to $PATH
envvar so it is picked before the system compilers:
export PATH=/usr/lib/icecc/bin:$PATH
If you followed the previous steps, any time you compile anything on C/C++, it will distribute the work among the fastest nodes in the network. Notice that it will take into account system load, network connection, cores, among other variables, to decide which node will compile the object file.
Remember that the linking stage is always done in the machine that submits the job.
Icemon showing my x86_64 desktop (maxwell) cross-compiling a job for my aarch64 board (rb3).
In one x86_64 machine, you need to create a toolchain. This is not automatically done by icecc as you can have different toolchains for cross-compilation.
For example, you can install the cross-compiler from the distribution repositories:
For Debian-based systems:
sudo apt install crossbuild-essential-arm64
For Fedora:
$ sudo dnf install gcc-aarch64-linux-gnu gcc--c++-aarch64-linux-gnu
Finally, to create the toolchain to share in icecc:
$ icecc-create-env --gcc /usr/bin/aarch64-linux-gnu-gcc /usr/bin/aarch64-linux-gnu-g++
This will create a <hash>.tar.gz
file. The <hash>
is used to identify the toolchain to distribute among the nodes in case there is more than one. But don’t worry, once it is copied to a node, it won’t be copied again as it detects it is already present.
Note: it is important that the toolchain is compatible with the target machine. For example, if my aarch64 board is using Debian 11 Bullseye, it is better if the cross-compilation toolchain is created from a Debian Bullseye x86_64 machine (a VM also works), because you avoid incompatibilities like having different glibc versions.
If you have installed Debian 11 Bullseye in your aarch64, you can use my own cross-compilation toolchain for x86_64 and skip this step.
scp <hash>.tar.gz aarch64-machine-hostname:
Once the toolchain (<hash>.tar.gz
) is copied to the aarch64 machine, you just need to export this on .bashrc
:
# Icecc setup for crosscompilation
export CCACHE_PREFIX=icecc
export ICECC_VERSION=x86_64:~/<hash>.tar.gz
Just compile on aarch64 machine and the jobs be distributed among your x86_64 machines as well. Take into account the jobs will be shared among other aarch64 machines as well if icecc decides so, therefore no need to do any extra step.
It is important to remark that the cross-compilation toolchain creation is only needed once, as icecream will copy it on all the x86_64 machines that will execute any job launched by this aarch64 machine. However, you need to copy this toolchain to any aarch64 machines that will use icecream resources for cross-compiling.
This is an interesting graphical tool to see the status of the icecc nodes and the jobs under execution.
$ sudo apt install icecc-monitor
$ sudo dnf install icemon
You can compile it from sources.
Even though icecream has a good cross-compilation documentation, it was the post written 8 years ago by my Igalia colleague Víctor Jáquez the one that convinced me to setup icecream as explained in this post.
Hope you find this info as useful as I did :-)
On the road to AppStream 1.0, a lot of items from the long todo list have been done so far – only one major feature is remaining, external release descriptions, which is a tricky one to implement and specify. For AppStream 1.0 it needs to be present or be rejected though, as it would be a major change in how release data is handled in AppStream.
Besides 1.0 preparation work, the recent 0.15 release and the releases before it come with their very own large set of changes, that are worth a look and may be interesting for your application to support. But first, for a change that affects the implementation and not the XML format:
Keeping all AppStream data in memory is expensive, especially if the data is huge (as on Debian and Ubuntu with their large repositories generated from desktop-entry files as well) and if processes using AppStream are long-running. The latter is more and more the case, not only does GNOME Software run in the background, KDE uses AppStream in KRunner and Phosh will use it too for reading form factor information. Therefore, AppStream via libappstream provides an on-disk cache that is memory-mapped, so data is only consuming RAM if we are actually doing anything with it.
Previously, AppStream used an LMDB-based cache in the background, with indices for fulltext search and other common search operations. This was a very fast solution, but also came with limitations, LMDB’s maximum key size of 511 bytes became a problem quite often, adjusting the maximum database size (since it has to be set at opening time) was annoyingly tricky, and building dedicated indices for each search operation was very inflexible. In addition to that, the caching code was changed multiple times in the past to allow system-wide metadata to be cached per-user, as some distributions didn’t (want to) build a system-wide cache and therefore ran into performance issues when XML was parsed repeatedly for generation of a temporary cache. In addition to all that, the cache was designed around the concept of “one cache for data from all sources”, which meant that we had to rebuild it entirely if just a small aspect changed, like a MetaInfo file being added to /usr/share/metainfo
, which was very inefficient.
To shorten a long story, the old caching code was rewritten with the new concepts of caches not necessarily being system-wide and caches existing for more fine-grained groups of files in mind. The new caching code uses Richard Hughes’ excellent libxmlb internally for memory-mapped data storage. Unlike LMDB, libxmlb knows about the XML document model, so queries can be much more powerful and we do not need to build indices manually. The library is also already used by GNOME Software and fwupd for parsing of (refined) AppStream metadata, so it works quite well for that usecase. As a result, search queries via libappstream are now a bit slower (very much depends on the query, roughly 20% on average), but can be mmuch more powerful. The caching code is a lot more robust, which should speed up startup time of applications. And in addition to all of that, the AsPool
class has gained a flag to allow it to monitor AppStream source data for changes and refresh the cache fully automatically and transparently in the background.
All software written against the previous version of the libappstream library should continue to work with the new caching code, but to make use of some of the new features, software using it may need adjustments. A lot of methods have been deprecated too now.
Compiling MetaInfo and other metadata into AppStream collection metadata, extracting icons, language information, refining data and caching media is an involved process. The appstream-generator tool does this very well for data from Linux distribution sources, but the tool is also pretty “heavyweight” with lots of knobs to adjust, an underlying database and a complex algorithm for icon extraction. Embedding it into other tools via anything else but its command-line API is also not easy (due to D’s GC initialization, and because it was never written with that feature in mind). Sometimes a simpler tool is all you need, so the libappstream-compose library as well as appstreamcli compose
are being developed at the moment. The library contains building blocks for developing a tool like appstream-generator while the cli tool allows to simply extract metadata from any directory tree, which can be used by e.g. Flatpak. For this to work well, a lot of appstream-generator‘s D code is translated into plain C, so the implementation stays identical but the language changes.
Ultimately, the generator tool will use libappstream-compose for any general data refinement, and only implement things necessary to extract data from the archive of distributions. New applications (e.g. for new bundling systems and other purposes) can then use the same building blocks to implement new data generators similar to appstream-generator with ease, sharing much of the code that would be identical between implementations anyway.
Want to advertise that your application supports touch input? Keyboard input? Has support for graphics tablets? Gamepads? Sure, nothing is easier than that with the new control
relation item and supports
relation kind (since 0.12.11 / 0.15.0, details):
<supports>
<control>pointing</control>
<control>keyboard</control>
<control>touch</control>
<control>tablet</control>
</supports>
Some applications are unusable below a certain window size, so you do not want to display them in a software center that is running on a device with a small screen, like a phone. In order to encode this information in a flexible way, AppStream now contains a display_length
relation item to require or recommend a minimum (or maximum) display size that the described GUI application can work with. For example:
<requires>
<display_length compare="ge">360</display_length>
</requires>
This will make the application require a display length greater or equal to 300 logical pixels. A logical pixel (also device independent pixel) is the amount of pixels that the application can draw in one direction. Since screens, especially phone screens but also screens on a desktop, can be rotated, the display_length
value will be checked against the longest edge of a display by default (by explicitly specifying the shorter edge, this can be changed).
This feature is available since 0.13.0, details. See also Tobias Bernard’s blog entry on this topic.
This is a feature that was originally requested for the LVFS/fwupd, but one of the great things about AppStream is that we can take very project-specific ideas and generalize them so something comes out of them that is useful for many. The new tags
tag allows people to tag components with an arbitrary namespaced string. This can be useful for project-internal organization of applications, as well as to convey certain additional properties to a software center, e.g. an application could mark itself as “featured” in a specific software center only. Metadata generators may also add their own tags to components to improve organization. AppStream gives no recommendations as to how these tags are to be interpreted except for them being a strictly optional feature. So any meaning is something clients and metadata authors need to negotiate. It therefore is a more specialized usecase of the already existing custom
tag, and I expect it to be primarily useful within larger organizations that produce a lot of software components that need sorting. For example:
<tags>
<tag namespace="lvfs">vendor-2021q1</tag>
<tag namespace="plasma">featured</tag>
</tags>
This feature is available since 0.15.0, details.
The MetaInfo Creator (source) tool is a very simple web application that provides you with a form to fill out and will then generate MetaInfo XML to add to your project after you have answered all of its questions. It is an easy way for developers to add the required metadata without having to read the specification or any guides at all.
Recently, I added support for the new control
and display_length
tags, resolved a few minor issues and also added a button to instantly copy the generated output to clipboard so people can paste it into their project. If you want to create a new MetaInfo file, this tool is the best way to do it!
The creator tool will also not transfer any data out of your webbrowser, it is strictly a client-side application.
And that is about it for the most notable changes in AppStream land! Of course there is a lot more, additional tags for the LVFS and content rating have been added, lots of bugs have been squashed, the documentation has been refined a lot and the library has gained a lot of new API to make building software centers easier. Still, there is a lot to do and quite a few open feature requests too. Onwards to 1.0!
Khronos submission indicating Vulkan 1.1 conformance for Turnip on Adreno 618 GPU.
It is a great feat, especially for a driver which is created without hardware documentation. And we support features far from the bare minimum required for conformance.
But first of all, I want to thank and congratulate everyone working on the driver: Connor Abbott, Rob Clark, Emma Anholt, Jonathan Marek, Hyunjun Ko, Samuel Iglesias. And special thanks to Samuel Iglesias and Ricardo Garcia for tirelessly improving Khronos Vulkan Conformance Tests.
At the start of the year, when I started working on Turnip, I looked at the list of failing tests and thought “It wouldn’t take a lot to fix them!”, right, sure… And so I started fixing issues alongside of looking for missing features.
In June there were even more failures than there were in January, how could it be? Of course we were adding new features and it accounted for some of them. However even this list was likely not exhaustive because for gitlab CI instead of running the whole Vulkan CTS suite - we ran 1/3 of it. We didn’t have enough devices to run the whole suite fast enough to make it usable in CI. So I just ran it locally from time to time.
1/3 of the tests doesn’t sound bad and for the most part it’s good enough since we have a huge amount of tests looking like this:
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture_format_list
...
Every format, every operation, etc. Tens of thousands of them.
Unfortunately the selection of tests for a fractional run is as straightforward as possible - just every third test. Which bites us when there a single unique tests, like:
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil_no_attachment
...
Most of them test something unique that has much higher probability of triggering a special path in a driver compared to uncountable image tests. And they fell through the cracks. I even had to fix one test twice because the CI didn’t run it.
A possible solution is to skip tests only when there is a large swath of them and run smaller groups as-is. But it’s likely more productive to just throw more hardware at the issue =).
Another trouble is that we had only one 6xx sub-generation present in CI - Adreno 630. We distinguish four sub-generations. Not only they have some different capabilities, there are also differences in the existing ones, causing the same test to pass on CI and being broken on another newer GPU. Presently in CI we test only Adreno 618 and 630 which are “Gen 1” GPUs and we claimed conformance only for Adreno 618.
Yet another issue is that we could render in tiling and bypass (sysmem) modes. That’s because there are a few features we could support only when there is no tiling and we render directly into the sysmem, and sometimes rendering directly into sysmem is just faster. At the moment we use tiling rendering by default unless we meet an edge case, so by default CTS tests only tiling rendering.
We are forcing sysmem mode for a subset of tests on CI, however it’s not enough because the difference between modes is relevant for more than just a few tests. Thus ideally we should run twice as many tests, and even better would be thrice as many to account for tiling mode without binning vertex shader.
That issue became apparent when I implemented a magical eight-ball to choose between tiling and bypass modes depending on the run-time information in order to squeeze more performance (it’s still work-in-progress). The basic idea is that a single draw call or a few small draw calls is faster to render directly into system memory instead of loading framebuffer into the tile memory and storing it back. But almost every single CTS test does exactly this! Do a single or a few draw calls per render pass, which causes all tests to run in bypass mode. Fun!
Now we would be forced to deal with this issue since with the magic eight-ball games would partly run in the tiling mode and partly in the bypass, making them equally important for real-world workload.
Unfortunately no test suite could wholly reflect what game developers do in their games. However, the amount of tests grows and new tests are getting contributed based on issues found in games and other applications.
When I ran my stash of D3D11 game traces through DXVK on Turnip for the first time - I found a bunch of new crashes and hangs but it took fixing just a few of them for majority of games to render correctly. This shows that Khronos Vulkan Conformance Tests are doing their job and we at Igalia are striving to make them even better.
One of the extensions released as part of Vulkan 1.2.199 was VK_EXT_image_view_min_lod extension. I’m happy to see it published as I have participated in the release process of this extension: from reviewing the spec exhaustively (I even contributed a few things to improve it!) to developing CTS tests for it that will be eventually merged to the CTS repo.
This extension was proposed by Valve to mirror a feature present in Direct3D 12 (check ResourceMinLODClamp
here) and Direct3D 11 (check SetResourceMinLOD
here). In other words, this extension allows clamping the minimum LOD value accessed by an image view to a minLod value set at image view creation time.
That way, any library or API layer that translates Direct3D 11/12 calls to Vulkan can use the extension to mirror the behavior above on Vulkan directly without workarounds, facilitating the port of Direct3D applications such as games to Vulkan. For example, projects like Vkd3d, Vkd3d-proton and DXVK could benefit from it.
Going into more details, this extension changed how the image level selection is calculated and sets an additional minimum required in the image level for integer texel coordinate operations if it is enabled.
The way to use this feature in an application is very simple:
// Provided by VK_EXT_image_view_min_lod
typedef struct VkPhysicalDeviceImageViewMinLodFeaturesEXT {
VkStructureType sType;
void* pNext;
VkBool32 minLod;
} VkPhysicalDeviceImageViewMinLodFeaturesEXT;
Once you know everything is working, enable both the extension and the feature when creating the device.
When you want to create a VkImageView
that defines a minLod for image accesses, then add the following structure filled with the value you want in VkImageViewCreateInfo’s pNext
.
// Provided by VK_EXT_image_view_min_lod
typedef struct VkImageViewMinLodCreateInfoEXT {
VkStructureType sType;
const void* pNext;
float minLod;
} VkImageViewMinLodCreateInfoEXT;
And that’s all! As you see, it is a very simple extension.
Happy hacking!
I was interested in how much work a vaapi on top of vulkan video proof of concept would be.
My main reason for being interested is actually video encoding, there is no good vulkan video encoding demo yet, and I'm not experienced enough in the area to write one, but I can hack stuff. I think it is probably easier to hack a vaapi encode to vulkan video encode than write a demo app myself.
With that in mind I decided to see what decode would look like first. I talked to Mike B (most famous zink author) before he left for holidays, then I ignored everything he told me and wrote a super hack.
This morning I convinced zink vaapi on top anv with iris GL doing the presents in mpv to show me some useful frames of video. However zink vaapi on anv with zink GL is failing miserably (well green jellyfish).
I'm not sure how much more I'll push on the decode side at this stage, I really wanted it to validate the driver side code, and I've found a few bugs in there already.
The WIP hacks are at [1]. I might push on to encode side and see if I can workout what it entails, though the encode spec work is a lot more changeable at the moment.
[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/zink-video-wip
If you own a laptop (Dell, HP, Lenovo) with a WWAN module, it is very likely that the modules are FCC-locked on every boot, and the special FCC unlock procedure needs to be run before they can be used.
Until ModemManager 1.18.2, the procedure was automatically run for the FCC unlock procedures we knew about, but this will no longer happen. Once 1.18.4 is out, the procedure will need to be explicitly enabled by each user, under their own responsibility, or otherwise implicitly enabled after installing an official FCC unlock tool provided by the manufacturer itself.
See a full description of the rationale behind this change in the ModemManager documentation site and the suggested code changes in the gitlab merge request.
If you want to enable the ModemManager provided unofficial FCC unlock tools once you have installed 1.18.4, run (assuming sysconfdir=/etc and datadir=/usr/share) this command (*):
sudo ln -sft /etc/ModemManager/fcc-unlock.d /usr/share/ModemManager/fcc-unlock.available.d/*
The user-enabled tools in /etc should not be removed during package upgrades, so this should be a one-time setup.
(*) Updated to have one single command instead of a for loop; thanks heftig!
Previously I mentioned having AMD VCN h264 support. Today I added initial support for the older UVD engine[1]. This is found on chips from Vega back to SI.
I've only tested it on my Vega so far.
I also worked out the "correct" answer to the how to I send the reset command correctly, however the nvidia player I'm using as a demo doesn't do things that way yet, so I've forked it for now[2].
The answer is to use vkCmdControlVideoCodingKHR to send a reset the first type a session is used. However I can't see how the app is meant to know this is necessary, but I've asked the appropriate people.
The initial anv branch I mentioned last week is now here[3].
[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-uvd-h264
[2] https://github.com/airlied/vk_video_samples/tree/radv-fixes
[3] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode
Last week I mentioned I had the basics of h264 decode using the proposed vulkan video on radv. This week I attempted to do the same thing with Intel's Mesa vulkan driver "anv".
Now I'd previously unsuccessfully tried to get vaapi on crocus working but got sidetracked back into other projects. The Intel h264 decoder hasn't changed a lot between ivb/hsw/gen8/gen9 era. I ported what I had from crocus to anv and started trying to get something to decode on my WhiskeyLake.
I wrote the code pretty early on, figured out all the things I had to send the hardware.
The first anv side bridge to cross was Vulkan is doing H264 Picture level decode API, so it means you get handed the encoded slice data. However to program the Intel hw you need to decode the slice header. I wrote a slice header decoder in some common code. The other thing you need to give the intel hw is a number of bits of slice header, which in some encoding schemes is rounded to bytes and in some isn't. Slice headers also have a 3-byte header on them, which Intel hardware wants you to discard or skip before handing it to it.
Once I'd fixed up that sort of thing in anv + crocus, I started getting grey I-frames decoded with later B/P frames using the grey frames as references so you'd see this kinda wierd motion.
That was I think 3 days ago. I've have stared at this intently for those 3 days blaming everything from bitstream encoding to rechecking all my packets (not enough times though). I had someone else verify they could see grey frames.
Today after a long discussion about possibilities, I was randomly comparing a frame from the intel-vaapi-driver and from crocus, and I spotted a packet header, the docs say is 34 dwords long, but intel-vaapi was only encoding 16 dwords, I switched crocus to explicitly state a 16-dword length and I started seeing my I-frames.
Now the B/P frames still have issues. I don't think I'm getting the ref frames logic right yet, but it felt like a decent win after 3 days of staring at it.
The crocus code is [1]. The anv code isn't cleaned up enough to post a pointer to yet, enterprising people might find it. Next week I'll clean it all up, and then start to ponder upstream paths and shared code for radv + anv. Then h265 maybe.
[1]https://gitlab.freedesktop.org/airlied/mesa/-/tree/crocus-media-wip
A few weeks ago I watched Victor's excellent talk on Vulkan Video. This made me question my skills in this area. I'm pretty vague on video processing hardware, I really have no understanding of H264 or any of the standards. I've been loosely following the Vulkan video group inside of Khronos, but I can't say I've understood it or been useful.
radeonsi has a gallium vaapi driver, that talks to firmware driver encoder on the hardware, surely copying what it is programming can't be that hard. I got an mpv/vaapi setup running and tested some videos on that setup just to get comfortable. I looked at what sort of data was being pushed about.
The thing is the firmware is doing all the work here, the driver is mostly just responsible for taking semi-parsed h264 bitstream data structures and giving them in memory buffers to the fw API. Then the resulting decoded image should be magically in a buffer.
I then got the demo nvidia video decoder application mentioned in Victor's talk.
I ported the code to radv in a couple of days, but then began a long journey into the unknown. The firmware is quite expectant on exactly what it wants and when it wants it. After fixing some interactions with the video player, I started to dig.
Now vaapi and DXVA (Windows) are context based APIs. This means they are like OpenGL, where you create a context, do a bunch of work, and tear it down, the driver does all the hw queuing of commands internally. All the state is held in the context. Vulkan is a command buffer based API. The application records command buffers and then enqueues those command buffers to the hardware itself.
So the vaapi driver works like this for a video
create hw ctx, flush, decode, flush, decode, flush, decode, flush, decode, flush, destroy hw ctx, flush
However Vulkan wants things to be more like
Create Session, record command buffer with (begin, decode, end) send to hw, (begin, decode, end), send to hw, End Sesssion
There is no way at the Create/End session time to submit things to the hardware.
After a week or two of hair removal and insightful irc chats I stumbled over a decent enough workaround to avoid the hw dying and managed to decode a H264 video of some jellyfish.
The work is based on bunch of other stuff, and is in no way suitable for upstreaming yet, not to mention the Vulkan specification is only beta/provisional so can't be used anywhere outside of development.
The preliminary code is in my gitlab repo here[1]. It has a start on h265 decode, but it's not working at all yet, and I think the h264 code is a bit hangy randomly.
I'm not sure where this is going yet, but it was definitely an interesting experiment.
[1]: https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-prelim-decode
A basic example of the git alias function syntax looks like this.
[alias]
shortcut = "!f() \
{\
echo Hello world!; \
}; f"
This syntax defines a function f
and then calls it. These aliases are executed in a sh
shell,
which means there's no access to Bash / Zsh specific functionality.
Every command is ended with a ;
and each line ended with a \
. This is easy enough
to grok. But when we try to clean up the above snippet and add some quotes to
"Hello world!"
, we hit this obtuse error message.
}; f: 1: Syntax error: end of file unexpected (expecting "}")
This syntax error is caused by quotes needing to be escaped. The reason for this comes down to how git tokenizes and executes these functions. If you're curious …