planet.freedesktop.org
March 22, 2022
OpenSSH has this very nice setting, VerifyHostKeyDNS, which when enabled, will pull SSH host keys from DNS, and you no longer need to either trust on first use, or copy host keys around out of band. Naturally, trusting unsecured DNS is a bit scary, so this requires the record to be signed using DNSSEC. This has worked for a long time, but then broke, seemingly out of the blue. Running ssh -vvv gave output similar to debug1: found 4 insecure fingerprints in DNS debug3: verify_host_key_dns: checking SSHFP type 1 fptype 2 debug3: verify_host_key_dns: checking SSHFP type 4 fptype 2 debug1: verify_host_key_dns: matched SSHFP type 4 fptype 2 debug3: verify_host_key_dns: checking SSHFP type 4 fptype 1 debug1: verify_host_key_dns: matched SSHFP type 4 fptype 1 debug3: verify_host_key_dns: checking SSHFP type 1 fptype 1 debug1: matching host key fingerprint found in DNS even though the zone was signed, the resolver was checking the signature and I even checked that the DNS response had the AD bit set. The fix was to add options trust-ad to /etc/resolv.conf. Without this, glibc will discard the AD bit from any upstream DNS servers. Note that you should only add this if you actually have a trusted DNS resolver. I run unbound on localhost, so if somebody can do a man-in-the-middle attack on that traffic, I have other problems.
March 18, 2022

March Forward

Anyone who knows me knows that I hate cardio.

Full stop.

I’m not picking up and putting down all these heavy weights just so I can go for a jog afterwards and lose all my gains.

Similarly, I’m not trying to set a world record for speed-writing code. This stuff takes time, and it can’t be rushed.

Unless…

Lavapipe: The Best Driver

Today we’re a Lavapipe blog.

Lavapipe is, of course, the software implementation of Vulkan that ships with Mesa, originally braindumped directly into the repo by graphics god and part-time Twitter executive, Dave Airlie. For a long time, the Lavapipe meme has been “Try it on Lavapipe—it’s not conformant, but it still works pretty good haha” and we’ve all had a good chuckle at the idea that anything not officially signed and stamped by Khronos could ever draw a single triangle properly.

But, pending a single MR that fixes the four outstanding failures for Vulkan 1.2 conformance, as of last week, Lavapipe passes 100% of conformance tests. Thus, pending a merge and a Mesa bugfix release, Lavapipe will achieve official conformance.

And then we’ll have a new meme: Vulkan 1.3 When?

Meme Over

As some have noticed, I’ve been quietly (very, very, very, very, very, very, very, very, very, very, very, very quietly) implementing a number of features for Lavapipe over the past week.

But why?

Khronos-fanatics will immediately recognize that these features are all part of Vulkan 1.3.

Which Lavapipe also now supports, pending more merges which I expect to happen early next week.

This is what a sprint looks like.

March 17, 2022

We had a busy 2021 within GNU/Linux graphics stack at Igalia.

Would you like to know what we have done last year? Keep reading!

Open Source Raspberry Pi GPU (VideoCore) drivers

Raspberry Pi 4, model B

Last year both the OpenGL and the Vulkan drivers received a lot of love. For example, we implemented several optimizations such improvements in the v3dv pipeline cache. In this blog post, Alejandro Piñeiro presents how we improved the v3dv pipeline cache times by reducing the two-cache-lookup done previously by only one, and shows some numbers on both a synthetic test (modified CTS test), and some games.

We also did performance improvements of the v3d compilers for OpenGL and Vulkan. Iago Toral explains our work on optimizating the backend compiler with techniques such as improving memory lookup efficiency, reducing instruction counts, instruction packing, uniform handling, among others. There are some numbers that show framerate improvements from ~6 to ~62% on different games / demos.

Framerate improvements Framerate improvement after optimization (in %). Taken from Iago’s blogpost

Of course, there was work related to feature implementation. This blog post from Iago lists some Vulkan extensions implemented in the v3dv driver in 2021… Although not all the implemented extensions are listed there, you can see the driver is quickly catching up in its Vulkan extension support.

My colleague Juan A. Suárez implemented performance counters in the v3d driver (an OpenGL driver) which required modifications in the kernel and in the Mesa driver. More info in his blog post.

There was more work in other areas done in 2021 too, like the improved support for RenderDoc and GFXReconstruct. And not to forget the kernel contributions to the DRM driver done by Melissa Wen, who not only worked on developing features for it, but also reviewed all the patches that came from the community.

However, the biggest milestone for the v3Dv driver was to be Vulkan 1.1 conformant in the last quarter of 2021. That was just one year after becoming Vulkan 1.0 conformant. As you can imagine, that implied a lot of work implementing features, fixing bugs and, of course, improving the driver in many different ways. Great job folks!

If you want to know more about all the work done on these drivers during 2021, there is an awesome talk from my colleague Alejando Piñeiro at FOSDEM 2022: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4”, and another one from my colleague Iago Toral in XDC 2021: “Raspberry Pi Vulkan driver update”. Below you can find the video recordings of both talks.

FOSDEM 2022 talk: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4”

XDC 2021 talk: “Raspberry Pi Vulkan driver update”

Open Source Qualcomm Adreno GPU drivers

RB3 Photo of the Qualcomm® Robotics RB3 Platform embedded board that I use for Turnip development.

There were also several achievements done by igalians on both Freedreno and Turnip drivers. These are reverse engineered open-source drivers for Qualcomm Adreno GPUs: Freedreno for OpenGL and Turnip for Vulkan.

Starting 2021, my colleague Danylo Piliaiev helped with implementing the missing bits in Freedreno for supporting OpenGL 3.3 on Adreno 6xx GPUs. His blog post explained his work, such as implementing ARB_blend_func_extended, ARB_shader_stencil_export and fixing a variety of CTS test failures.

Related to this, my colleague Guilherme G. Piccoli worked on porting a recent kernel to one of the boards we use for Freedreno development: the Inforce 6640. He did an awesome job getting a 5.14 kernel booting on that embedded board. If you want to know more, please read the blog post he wrote explaining all the issues he found and how he fixed them!

Inforce6640 Picture of the Inforce 6640 board that Guilherme used for his development. Image from his blog post.

However the biggest chunk of work was done in Turnip driver. We have implemented a long list of Vulkan extensions: VK_KHR_buffer_device_address, VK_KHR_depth_stencil_resolve, VK_EXT_image_view_min_lod, VK_KHR_spirv_1_4, VK_EXT_descriptor_indexing, VK_KHR_timeline_semaphore, VK_KHR_16bit_storage, VK_KHR_shader_float16, VK_KHR_uniform_buffer_standard_layout, VK_EXT_extended_dynamic_state, VK_KHR_pipeline_executable_properties, VK_VALVE_mutable_descriptor_type, VK_KHR_vulkan_memory_model and many others. Danylo Piliaiev and Hyunjun Ko are terrific developers!

But not all our work was related to feature development, for example I implemented Low-Resolution Z-buffer (LRZ) HW optimization, Danylo fixed a long list of rendering bugs that happened in real-world applications (blog post 1, blog post 2) like D3D games run on Vulkan (thanks to DXVK and VKD3D), instrumented the backend compiler to dump register values, among many other fixes and optimizations.

However, the biggest achievement was getting Vulkan 1.1 conformance for Turnip. Danylo wrote a blog post mentioning all the work we did to achieve that this year.

If you want to know more, don’t miss this FOSDEM 2022 talk given by my colleague Hyunjun Ko called “The status of turnip driver development. What happened in 2021 and will happen in 2022 for turnip.”. Video below.

FOSDEM 2022 talk: “The status of turnip driver development. What happened in 2021 and will happen in 2022 for turnip.”

Vulkan contributions

Our graphics work doesn’t cover only driver development, we also participate in Khronos Group as Vulkan Conformance Test Suite developers and even as spec contributors.

My colleague Ricardo Garcia is a very productive developer. He worked on implementing tests for Vulkan Ray Tracing extensions (read his blog post about ray tracing for more info about this big Vulkan feature), implemented tests for a long list of Vulkan extensions like VK_KHR_present_id and VK_KHR_present_wait, VK_EXT_multi_draw (watch his talk at XDC 2021), VK_EXT_border_color_swizzle (watch his talk at FOSDEM 2022) among many others. In many of these extensions, he contributed to their respective specifications in a significant way (just search for his name in the Vulkan spec!).

XDC 2021 talk: “Quick Overview of VK_EXT_multi_draw”

FOSDEM 2022 talk: “Fun with border colors in Vulkan. An overview of the story behind VK_EXT_border_color_swizzle”

Similarly, I participated modestly in this effort by developing tests for some extensions like VK_EXT_image_view_min_lod (blog post). Of course, both Ricardo and I implemented many new CTS tests by adding coverage to existing ones, we fixed lots of bugs in existing ones and reported dozens of driver issues to the respective Mesa developers.

Not only that, both Ricardo and I appeared as Vulkan 1.3 spec contributors.

Vulkan 1.3

Another interesting work we started in 2021 is Vulkan Video support on Gstreamer. My colleague Víctor Jaquez presented the Vulkan Video extension at XDC 2021 and soon after he started working on Vulkan Video’s h264 decoder support. You can find more information in his blog post, or watching his XDC 2021 talk below:

FOSDEM 2022 talk: “Video decoding in Vulkan: VK_KHR_video_queue/decode APIs”

Before I leave this section, don’t forget to take a look at Ricardo’s blogpost on debugPrintfEXT feature. If you are a Graphics developer, you will find this feature very interesting for debugging issues in your applications!

Along those lines, Danylo presented at XDC 2021 a talk about dissecting and fixing Vulkan rendering issues in drivers with RenderDoc. Very useful for driver developers! Watch the talk below:

XDC 2021 talk: “Dissecting Vulkan rendering issues in drivers with RenderDoc”

To finalize this blog post, remember that you now have vkrunner (the Vulkan shader tester created by Igalia) available for RPM-based GNU/Linux distributions. In case you are working with embedded systems, maybe my blog post about cross-compiling with icecream will help to speed up your builds.

This is just a summary of the highlights we did last year. I’m sorry if I am missing more work from my colleagues.

March 16, 2022

At Last

Those of you in-the-know are well aware that Zink has always had a crippling addiction to seamless cubemaps. Specifically, Vulkan doesn’t support non-seamless cubemaps since nobody wants those anymore, but this is the default mode of sampling for OpenGL.

Thus, it is impossible for Zink to pass GL 4.6 conformance until this issue is resolved.

But what does this even mean?

Cubes: They Have Faces Just Like You And Me

As veterans of intro to geometry courses all know*, a cube is a 3D shape that has six identically-sized sides called “faces”. In computer graphics, each of these faces has its own content that can be read and written discretely.

When interpolating data from a cube-type texture, there are two methods:

  • Using a seamless interpretation of a cube yields cases where pixels may be interpolated across the boundaries of faces
  • Using a non-seamless interpretation of a cube yields cases where pixels may be clamped/wrapped at the boundary of a face

This effectively results in Zink interpolating across the boundaries of cube faces when it should instead be clamping/wrapping pixel data to a single face.

But who cares about all that math nonsense when the result is that Zink is still failing CTS cases?

*Disclosure: I have been advised by my lawyer to state on the record that I have never taken an intro to geometry course and have instead copied this entire blog post off StackOverflow.

How To Make Cubes Worse 101

In order to replicate this basic OpenGL behavior, a substantial amount of code is required—most of it terrible.

The first step is to determine when a cube should be sampled as non-seamless. OpenGL helpfully has only one extension (plus this other extension) which control seamless access of cubemaps, so as long as that one state (plus the other state) isn’t enabled, a cube shouldn’t be interpreted seamlessly.

With this done, what happens at coordinates that lie at the edge of a face? The OpenGL wrap enum covers this. For the purposes of this blog post, only two wrap modes exist:

  • edge clamping - clamp the coordinate to the edge of the face (coord = extent or coord = 0)
  • repeat - pretend the face repeats infinitely (coord %= extent)

So now non-seamless cubes are detected, and the behavior for handling their non-seamlessness is known, but how can this actually be done?

Excruciating

In short, this requires shader rewrites to handle coordinate clamping, then wrapping. Since it’s not going to be desirable to have a different shader variant for every time the wrap mode changes, this means loading the parameters from a UBO. Since it’s further not going to be desirable to have shader variants for each per-texture seamless/non-seamless cube combination, this means also making the rewrite handle the no-op case of continuing to use the original, seamless coordinates after doing all the calculations for the non-seamless case.

Worst of all, this has to be plumbed through the Rube Goldberg machine that is Mesa.

It was terrible, and it continues to be terrible.

If I were another blogger, I would probably take this opportunity to flex my Calculus credentials by putting all kinds of math here, but nobody really wants to read that, and the hell if I know how to make markdown do that whiteboard thing so I can doodle in fancy formulas or whatever from the spec.

Instead, you can read the merge request if you’re that deeply invested in cube mechanics.

March 09, 2022

Subgroup operations or wave intrinsics, such as reducing a value across the threads of a shader subgroup or wave, were introduced in GPU programming languages a while ago. They communicate with other threads of the same wave, for example to exchange the input values of a reduction, but not necessarily with all of them if there is divergent control flow.

In LLVM, we call such operations convergent. Unfortunately, LLVM does not define how the set of communicating threads in convergent operations -- the set of converged threads -- is affected by control flow.

If you're used to thinking in terms of structured control flow, this may seem trivial. Obviously, there is a tree of control flow constructs: loops, if-statements, and perhaps a few others depending on the language. Two threads are converged in the body of a child construct if and only if both execute that body and they are converged in the parent. Throw in some simple and intuitive rules about loop counters and early exits (nested return, break and continue, that sort of thing) and you're done.

In an unstructured control flow graph, the answer is not obvious at all. I gave a presentation at the 2020 LLVM Developers' Meeting that explains some of the challenges as well as a solution proposal that involves adding convergence control tokens to the IR.

Very briefly, convergent operations in the proposal use a token variable that is defined by a convergence control intrinsic. Two dynamic instances of the same static convergent operation from two different threads are converged if and only if the dynamic instances of the control intrinsic producing the used token values were converged.

(The published draft of the proposal talks of multiple threads executing the same dynamic instance. I have since been convinced that it's easier to teach this matter if we instead always give every thread its own dynamic instances and talk about a convergence equivalence relation between dynamic instances. This doesn't change the resulting semantics.)

The draft has three such control intrinsics: anchor, entry, and (loop) heart. Of particular interest here is the heart. For the most common and intuitive use cases, a heart intrinsic is placed in the header of natural loops. The token it defines is used by convergent operations in the loop. The heart intrinsic itself also uses a token that is defined outside the loop: either by another heart in the case of nested loops, or by an anchor or entry. The heart combines two intuitive behaviors:

  • It uses a token in much the same way that convergent operations do: two threads are converged for their first execution of the heart if and only if they were converged at the intrinsic that defined the used token.
  • Two threads are converged at subsequent executions of the heart if and only if they were converged for the first execution and they are currently at the same loop iteration, where iterations are counted by a virtual loop counter that is incremented at the heart.

Viewed from this angle, how about we define a weaker version of these rules that lies somewhere between an anchor and a loop heart? We could call it a "light heart", though I will stick with "iterating anchor". The iterating anchor defines a token but has no arguments. Like for the anchor, the set of converged threads is implementation-defined -- when the iterating anchor is first encountered. When threads encounter the iterating anchor again without leaving the dominance region of its containing basic block, they are converged if and only if they were converged during their previous encounter of the iterating anchor.

The notion of an iterating anchor came up when discussing the convergence behaviors that can be guaranteed for natural loops. Is it possible to guarantee that natural loops always behave in the natural way -- according to their loop counter -- when it comes to convergence?

Naively, this should be possible: just put hearts into loop headers! Unfortunately, that's not so straightforward when multiple natural loops are contained in an irreducible loop: 

Hearts in A and C must refer to a token defined outside the loops; that is, a token defined in E. The resulting program is ill-formed because it has a closed path that goes through two hearts that use the same token, but the path does not go through the definition of that token. This well-formedness rule exists because the rules about heart semantics are unsatisfiable if the rule is broken.

The underlying intuitive issue is that if the branch at E is divergent in a typical implementation, the wave (or subgroup) must choose whether A or C is executed first. Neither choice works. The heart in A indicates that (among the threads that are converged in E) all threads that visit A (whether immediately or via C) must be converged during their first visit of A. But if the wave executes A first, then threads which branch directly from E to A cannot be converged with those that first branch to C. The opposite conflict exists if the wave executes C first.

If we replace the hearts in A and C by iterating anchors, this problem goes away because the convergence during the initial visit of each block is implementation-defined. In practice, it should fall out of which of the blocks the implementation decides to execute first.

So it seems that iterating anchors can fill a gap in the expressiveness of the convergence control design. But are they really a sound addition? There are two main questions:

  • Satisfiability: Can the constraints imposed by iterating anchors be satisfied, or can they cause the sort of logical contradiction discussed for the example above? And if so, is there a simple static rule that prevents such contradictions?
  • Spooky action at a distance: Are there generic code transforms which change semantics while changing a part of the code that is distant from the iterating anchor?
The second question is important because we want to add convergence control to LLVM without having to audit and change the existing generic transforms. We certainly don't want to hurt compile-time performance by increasing the amount of code that generic transforms have to examine for making their decisions.

Satisfiability 

Consider the following simple CFG with an iterating anchor in A and a heart in B that refers back to a token defined in E:

Now consider two threads that are initially converged with execution traces:

  1. E - A - A - B - X
  2. E - A - B - A - X
The heart rule implies that the threads must be converged in B. The iterating anchor rule implies that if the threads are converged in their first dynamic instances of A, then they must also be converged in their second dynamic instances of A, which leads to a temporal paradox.

One could try to resolve the paradox by saying that the threads cannot be converged in A at all, but this would mean that the threads mustdiverge before a divergent branch occurs. That seems unreasonable, since typical implementations want to avoid divergence as long as control flow is uniform.

The example arguably breaks the spirit of the rule about convergence regions from the draft proposal linked above, and so a minor change to the definition of convergence region may be used to exclude it.

What if the CFG instead looks as follows, which does not break any rules about convergence regions:

For the same execution traces, the heart rule again implies that the threads must be converged in B. The convergence of the first dynamic instances of A are technically implementation-defined, but we'd expect most implementations to be converged there.

The second dynamic instances of A cannot be converged due to the convergence of the dynamic instances of B. That's okay: the second dynamic instance of A in thread 2 is a re-entry into the dominance region of A, and so its convergence is unrelated to any convergence of earlier dynamic instances of A.

Spooky action at a distance 

Unfortunately, we still cannot allow this second example. A program transform may find that the conditional branch in E is constant and the edge from E to B is dead. Removing that edge brings us back to the previous example which is ill-formed. However, a transform which removes the dead edge would not normally inspect the blocks A and B or their dominance relation in detail. The program becomes ill-formed by spooky action at a distance.

The following static rule forbids both example CFGs: if there is a closed path through a heart and an iterating anchor, but not through the definition of the token that the heart uses, then the heart must dominate the iterating anchor.

There is at least one other issue of spooky action at a distance. If the iterating anchor is not the first (non-phi) instruction of its basic block, then it may be preceded by a function call in the same block. The callee may contain control flow that ends up being inlined. Back edges that previously pointed at the block containing the iterating anchor will then point to a different block, which changes the behavior quite drastically. Essentially, the iterating anchor is reduced to a plain anchor.

What can we do about that? It's tempting to decree that an iterating anchor must always be the first (non-phi) instruction of a basic block. Unfortunately, this is not easily done in LLVM in the face of general transforms that might sink instructions or merge basic blocks.

Preheaders to the rescue 

We could chew through some other ideas for making iterating anchors work, but that turns out to be unnecessary. The desired behavior of iterating anchors can be obtained by inserting preheader blocks. The initial example of two natural loops contained in an irreducible loop becomes: 

Place anchors in Ap and Cp and hearts in A and C that use the token defined by their respective dominating anchor. Convergence at the anchors is implementation-defined, but relative to this initial convergence at the anchor, convergence inside the natural loops headed by A and C behaves in the natural way, based on a virtual loop counter. The transform of inserting an anchor in the preheader is easily generalized.

To sum it up: We've concluded that defining an "iterating anchor" convergence control intrinsic is problematic, but luckily also unnecessary. The control intrinsics defined in the original proposal are sufficient. I hope that the discussion that led to those conclusions helps illustrate some aspects of the convergence control proposal for LLVM as well as the goals and principles that drove it.

March 07, 2022

They Say An Image Macro Conveys An Entire Day Of Shouting At The Computer

tess-bugs.png

March 04, 2022

A quick reminder: libei is the library for emulated input. It comes as a pair of C libraries, libei for the client side and libeis for the server side.

libei has been sitting mostly untouched since the last status update. There are two use-cases we need to solve for input emulation in Wayland - the ability to emulate input (think xdotool, or Synergy/Barrier/InputLeap client) and the ability to capture input (think Synergy/Barrier/InputLeap server). The latter effectively blocked development in libei [1], until that use-case was sorted there wasn't much point investing too much into libei - after all it may get thrown out as a bad idea. And epiphanies were as elusive like toilet paper and RATs, so nothing much get done. This changed about a week or two ago when the required lightbulb finally arrived, pre-lit from the factory.

So, the solution to the input capturing use-case is going to be a so-called "passive context" for libei. In the traditional [2] "active context" approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat that typically represent the available screens. libei then sends events through these devices, causing input to be appear in the compositor which moves the cursor around. In a typical and simple use-case you'd get a 1920x1080 absolute pointer device and a keyboard with a $layout keymap, libei then sends events to position the cursor and or happily type away on-screen.

In the "passive context" <deja-vu> approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat </deja-vu> that typically represent the physical devices connected to the host computer. libei then receives events from these devices, causing input to be generated in the libei client. In a typical and simple use-case you'd get a relative pointer device and a keyboard device with a $layout keymap, the compositor then sends events matching the relative input of the connected mouse or touchpad.

The two notable differences are thus: events flow from EIS to libei and the devices don't represent the screen but rather the physical [3] input devices.

This changes libei from a library for emulated input to an input event transport layer between two processes. On a much higher level than e.g. evdev or HID and with more contextual information (seats, devices are logically abstracted, etc.). And of course, the EIS implementation is always in control of the events, regardless which direction they flow. A compositor can implement an event filter or designate key to break the connection to the libei client. In pseudocode, the compositor's input event processing function will look like this:


function handle_input_events():
real_events = libinput.get_events()
for e in real_events:
if input_capture_active:
send_event_to_passive_libei_client(e)
else:
process_event(e)

emulated_events = eis.get_events_from_active_clients()
for e in emulated_events:
process_event(e)
Not shown here are the various appropriate filters and conversions in between (e.g. all relative events from libinput devices would likely be sent through the single relative device exposed on the EIS context). Again, the compositor is in control so it would be trivial to implement e.g. capturing of the touchpad only but not the mouse.

In the current design, a libei context can only be active or passive, not both. The EIS context is both, it's up to the implementation to disconnect active or passive clients if it doesn't support those.

Notably, the above only caters for the transport of input events, it doesn't actually make any decision on when to capture events. This handled by the CaptureInput XDG Desktop Portal [4]. The idea here is that an application like Synergy/Barrier/InputLeap server connects to the CaptureInput portal and requests a CaptureInput session. In that session it can define pointer barriers (left edge, right edge, etc.) and, in the future, maybe other triggers. In return it gets a libei socket that it can initialize a libei context from. When the compositor decides that the pointer barrier has been crossed, it re-routes the input events through the EIS context so they pop out in the application. Synergy/Barrier/InputLeap then converts that to the global position, passes it to the right remote Synergy/Barrier/InputLeap client and replays it there through an active libei context where it feeds into the local compositor.

Because the management of when to capture input is handled by the portal and the respective backends, it can be natively integrated into the UI. Because the actual input events are a direct flow between compositor and application, the latency should be minimal. Because it's a high-level event library, you don't need to care about hardware-specific details (unlike, say, the inputfd proposal from 2017). Because the negotiation of when to capture input is through the portal, the application itself can run inside a sandbox. And because libei only handles the transport layer, compositors that don't want to support sandboxes can set up their own negotiation protocol.

So overall, right now this seems like a workable solution.

[1] "blocked" is probably overstating it a bit but no-one else tried to push it forward, so..
[2] "traditional" is probably overstating it for a project that's barely out of alpha development
[3] "physical" is probably overstating it since it's likely to be a logical representation of the types of inputs, e.g. one relative device for all mice/touchpads/trackpoints
[4] "handled by" is probably overstating it since at the time of writing the portal is merely a draft of an XML file

Blogging: That Thing I Forgot About

Yeah, my b, I forgot this was a thing.

Fuck it though, I’m a professional, so I’m gonna pretend I didn’t just skip a month of blogs and get right back into it.

Gallivm

Gallivm is the nir/tgsi-to-llvm translation layer in Gallium that LLVMpipe (and thus Lavapipe) uses to generate the JIT functions which make triangles. It’s very old code in that it predates me knowing how triangles work, but that doesn’t mean it doesn’t have bugs.

And Gallivm bugs are the worst bugs.

For a long time, I’ve had SIGILL crashes on exactly one machine locally for the CTS glob dEQP-GLES31.functional.program_uniform.by*sampler2D_samplerCube*. These tests pass on everyone else’s machines including CI.

Like I said, Gallivm bugs are the worst bugs.

Debugging

How does one debug JIT code? GDB can’t be used, valgrind doesn’t work, and, despite what LLVM developers would tell you, building an assert-enabled LLVM doesn’t help at all in most cases here since that will only catch invalid behavior, not questionably valid behavior that very obviously produces invalid results.

So we enter the world of lp_build_print debugging. Much like standard printf debugging, the strategy here is to just lp_build_print_value or lp_build_printf("I hate this part of the shader too") our way to figuring out where in the shader the crash occurs.

Here’s an example shader from dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex that crashes:

#version 310 es
in highp vec4 a_position;
out mediump float v_vtxOut;

struct structType
{
	mediump sampler2D m0;
	mediump samplerCube m1;
};
uniform structType u_var;

mediump float compare_float    (mediump float a, mediump float b)  { return abs(a - b) < 0.05 ? 1.0 : 0.0; }
mediump float compare_vec4     (mediump vec4 a, mediump vec4 b)    { return compare_float(a.x, b.x)*compare_float(a.y, b.y)*compare_float(a.z, b.z)*compare_float(a.w, b.w); }

void main (void)
{
	gl_Position = a_position;
	v_vtxOut = 1.0;
	v_vtxOut *= compare_vec4(texture(u_var.m0, vec2(0.0)), vec4(0.15, 0.52, 0.26, 0.35));
	v_vtxOut *= compare_vec4(texture(u_var.m1, vec3(0.0)), vec4(0.88, 0.09, 0.30, 0.61));
}

Which, in llvmpipe NIR, is:

shader: MESA_SHADER_VERTEX
source_sha1: {0xcb00c93e, 0x64db3b0f, 0xf4764ad3, 0x12b69222, 0x7fb42437}
inputs: 1
outputs: 2
uniforms: 0
shared: 0
ray queries: 0
decl_var uniform INTERP_MODE_NONE sampler2D lower@u_var.m0 (0, 0, 0)
decl_var uniform INTERP_MODE_NONE samplerCube lower@u_var.m1 (0, 0, 1)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = deref_var &a_position (shader_in vec4) 
	vec4 32 ssa_1 = intrinsic load_deref (ssa_0) (access=0)
	vec1 16 ssa_2 = load_const (0xb0cd = -0.150024)
	vec1 16 ssa_3 = load_const (0x2a66 = 0.049988)
	vec1 16 ssa_4 = load_const (0xb829 = -0.520020)
	vec1 16 ssa_5 = load_const (0xb429 = -0.260010)
	vec1 16 ssa_6 = load_const (0xb59a = -0.350098)
	vec1 16 ssa_7 = load_const (0xbb0a = -0.879883)
	vec1 16 ssa_8 = load_const (0xadc3 = -0.090027)
	vec1 16 ssa_9 = load_const (0xb4cd = -0.300049)
	vec1 16 ssa_10 = load_const (0xb8e1 = -0.609863)
	vec2 32 ssa_13 = load_const (0x00000000, 0x00000000) = (0.000000, 0.000000)
	vec1 32 ssa_49 = load_const (0x00000000 = 0.000000)
	vec4 16 ssa_14 = (float16)txl ssa_13 (coord), ssa_49 (lod), 0 (texture), 0 (sampler)
	vec1 16 ssa_15 = fadd ssa_14.x, ssa_2
	vec1 16 ssa_16 = fabs ssa_15
	vec1 16 ssa_17 = fadd ssa_14.y, ssa_4
	vec1 16 ssa_18 = fabs ssa_17
	vec1 16 ssa_19 = fadd ssa_14.z, ssa_5
	vec1 16 ssa_20 = fabs ssa_19
	vec1 16 ssa_21 = fadd ssa_14.w, ssa_6
	vec1 16 ssa_22 = fabs ssa_21
	vec1 16 ssa_23 = fmax ssa_16, ssa_18
	vec1 16 ssa_24 = fmax ssa_23, ssa_20
	vec1 16 ssa_25 = fmax ssa_24, ssa_22
	vec3 32 ssa_27 = load_const (0x00000000, 0x00000000, 0x00000000) = (0.000000, 0.000000, 0.000000)
	vec1 32 ssa_50 = load_const (0x00000000 = 0.000000)
	vec4 16 ssa_28 = (float16)txl ssa_27 (coord), ssa_50 (lod), 1 (texture), 1 (sampler)
	vec1 16 ssa_29 = fadd ssa_28.x, ssa_7
	vec1 16 ssa_30 = fabs ssa_29
	vec1 16 ssa_31 = fadd ssa_28.y, ssa_8
	vec1 16 ssa_32 = fabs ssa_31
	vec1 16 ssa_33 = fadd ssa_28.z, ssa_9
	vec1 16 ssa_34 = fabs ssa_33
	vec1 16 ssa_35 = fadd ssa_28.w, ssa_10
	vec1 16 ssa_36 = fabs ssa_35
	vec1 16 ssa_37 = fmax ssa_30, ssa_32
	vec1 16 ssa_38 = fmax ssa_37, ssa_34
	vec1 16 ssa_39 = fmax ssa_38, ssa_36
	vec1 16 ssa_40 = fmax ssa_25, ssa_39
	vec1 32 ssa_41 = flt32 ssa_40, ssa_3
	vec1 32 ssa_42 = b2f32 ssa_41
	vec1 32 ssa_43 = deref_var &gl_Position (shader_out vec4) 
	intrinsic store_deref (ssa_43, ssa_1) (wrmask=xyzw /*15*/, access=0)
	vec1 32 ssa_44 = deref_var &v_vtxOut (shader_out float) 
	intrinsic store_deref (ssa_44, ssa_42) (wrmask=x /*1*/, access=0)
	/* succs: block_1 */
	block block_1:
}

There’s two sample ops (txl), and since these tests only do simple texture() calls, it seems reasonable to assume that one of them is causing the crash. Sticking a lp_build_print_value on the texel values fetched by the sample operations will reveal whether the crash occurs before or after them.

What output does this yield?

Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'..
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
[1]    3500332 illegal hardware instruction (core dumped)

Each txl op fetches four values, which means this is the result from the first instruction, but the second one isn’t reached before the crash. Unsurprisingly, this is also the cube sampling instruction, which makes sense given that all the crashes of this type I get are from cube sampling tests.

Now that it’s been determined the second txl is causing the crash, it’s reasonable to assume that the construction of that sampling op is the cause rather than the op itself, as proven by sticking some simple lp_build_printf("What am I doing with my life") calls in just before that op. Indeed, as the printfs confirm, I’m still questioning the life choices that led me to this point, so it’s now proven that the txl instruction itself is the problem.

Cube sampling has a lot of complex math involved for face selection, and I’ve spent a lot of time in there recently. My first guess was that the cube coordinates were bogus. Printing them yielded results:

Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'..
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
cubecoords nan nan nan nan nan nan nan nan
cubecoords nan nan nan nan nan nan nan nan

These cube coords have more NaNs than a 1960s Batman TV series, so it looks like I was right in my hunch. Printing the cube S-face value next yields more NaNs. My printf search continued a couple more iterations until I wound up at this function:

static LLVMValueRef
lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord)
{
   /* ima = +0.5 / abs(coord); */
   LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5);
   LLVMValueRef absCoord = lp_build_abs(coord_bld, coord);
   LLVMValueRef ima = lp_build_div(coord_bld, posHalf, absCoord);
   return ima;
}

Immediately, all of us multiverse-brain engineers spot something suspicious: this has a division operation with a user-provided divisor. Printing absCoord here yielded all zeroes, which was about where my remaining energy was at this Friday morning, so I mangled the code slightly:

static LLVMValueRef
lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord)
{
   /* ima = +0.5 / abs(coord); */
   LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5);
   LLVMValueRef absCoord = lp_build_abs(coord_bld, coord);
   /* avoid div by zero */
   LLVMValueRef sel = lp_build_cmp(coord_bld, PIPE_FUNC_GREATER, absCoord, coord_bld->zero);
   LLVMValueRef div = lp_build_div(coord_bld, posHalf, absCoord);
   LLVMValueRef ima = lp_build_select(coord_bld, sel, div, coord_bld->zero);
   return ima;
}

And blammo, now that Gallivm could no longer divide by zero, the test was now passing. And so were a lot of others.

Progress

There’s been some speculation about how close Zink really is to being “useful”, where “useful” is determined by the majesty of passing GL4.6 CTS.

So how close is it? The answer might shock you.

Remaining Lavapipe Fails: 17

  • KHR-GL46.gpu_shader_fp64.builtin.mod_dvec2,Fail
  • KHR-GL46.gpu_shader_fp64.builtin.mod_dvec3,Fail
  • KHR-GL46.gpu_shader_fp64.builtin.mod_dvec4,Fail
  • KHR-GL46.pipeline_statistics_query_tests_ARB.functional_primitives_vertices_submitted_and_clipping_input_output_primitives,Fail
  • KHR-GL46.tessellation_shader.single.isolines_tessellation,Fail
  • KHR-GL46.tessellation_shader.tessellation_control_to_tessellation_evaluation.data_pass_through,Fail
  • KHR-GL46.tessellation_shader.tessellation_invariance.invariance_rule3,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_point_mode.points_verification,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.degenerate_case,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_tessellation.gl_InvocationID_PatchVerticesIn_PrimitiveID,Fail
  • KHR-GL46.tessellation_shader.vertex.vertex_spacing,Fail
  • KHR-GL46.texture_barrier.disjoint-texels,Fail
  • KHR-GL46.texture_barrier.overlapping-texels,Fail
  • KHR-GL46.texture_barrier_ARB.disjoint-texels,Fail
  • KHR-GL46.texture_barrier_ARB.overlapping-texels,Fail
  • KHR-GL46.texture_swizzle.functional,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.inner_tessellation_level_rounding,Crash

Remaining ANV Fails (Icelake): 9

  • KHR-GL46.pipeline_statistics_query_tests_ARB.functional_primitives_vertices_submitted_and_clipping_input_output_primitives,Fail
  • KHR-GL46.tessellation_shader.single.isolines_tessellation,Fail
  • KHR-GL46.tessellation_shader.tessellation_control_to_tessellation_evaluation.data_pass_through,Fail
  • KHR-GL46.tessellation_shader.tessellation_invariance.invariance_rule3,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_point_mode.points_verification,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.degenerate_case,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.inner_tessellation_level_rounding,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_tessellation.gl_InvocationID_PatchVerticesIn_PrimitiveID,Fail
  • KHR-GL46.tessellation_shader.vertex.vertex_spacing,Fail

Big Triangle better keep a careful eye on us now.

February 17, 2022

Around 2 years ago while I was working on tessellation support for llvmpipe, and running the heaven benchmark on my Ryzen, I noticed that heaven despite running slowly wasn't saturating all the cores. I dug in a bit, and found that llvmpipe despite threading rasterization, fragment shading and blending stages, never did anything else while those were happening.

I dug into the code as I clearly remembered seeing a concept of a "scene" where all the primitives were binned into and then dispatched. It turned out the "scene" was always executed synchronously.

At the time I wrote support to allow multiple scenes to exist, so while one scene was executing the vertex shading and binning for the next scene could execute, and it would be queued up. For heaven at the time I saw some places where it would build 36 scenes. However heaven was still 1fps with tess, and regressions in other areas were rampant, and I mostly left them in a branch.

The reasons so many things were broken by the patches was that large parts of llvmpipe and also lavapipe, weren't ready for the async pipeline processing. The concept of a fence after the pipeline finished was there, but wasn't used properly everywhere. A lot of operations assumed there was nothing going on behind the scenes so never fenced. Lots of things like queries broke due to fact that a query would always be ready in the old model, but now query availability could return unavailable like a real hw driver. Resource tracking existed but was incomplete, so knowing when to flush wasn't always accurate. Presentation was broken due to incorrect waiting both for GL and Lavapipe. Lavapipe needed semaphore support that actually did things as apps used it between the render and present pipeline pieces.

Mesa CI recently got some paraview traces added to it, and I was doing some perf traces with them. Paraview is a data visualization tool, and it generates vertex heavy workloads, as opposed to compositors and even games. It turned out binning was most of the overhead, and I realized the overlapping series could help this sort of workload. I dusted off the patch series and nailed down all the issues.

Emma Anholt ran some benchmarks on the results with the paraview traces and got

  • pv-waveletvolume fps +13.9279% +/- 4.91667% (n=15)
  • pv-waveletcountour fps +67.8306% +/- 11.4762% (n=3)
which seems like a good return on the investment.

I've got it all lined up in a merge request and it doesn't break CI anymore, so hopefully get it landed in the next while, once I cleanup any misc bits.

February 16, 2022

Earlier this week, Neil McGovern announced that he is due to be stepping down as the Executive Director as the GNOME Foundation later this year. As the President of the board and Neil’s effective manager together with the Executive Committee, I wanted to take a moment to reflect on his achievements in the past 5 years and explain a little about what the next steps would be.

Since joining in 2017, Neil has overseen a productive period of growth and maturity for the Foundation, increasing our influence both within the GNOME project and the wider Free and Open Source Software community. Here’s a few highlights of what he’s achieved together with the Foundation team and the community:

  • Improved public perception of GNOME as a desktop and GTK as a development platform, helping to align interests between key contributors and wider ecosystem stakeholders and establishing an ongoing collaboration with KDE around the Linux App Summit.
  • Worked with the board to improve the maturity of the board itself and allow it to work at a more strategic level, instigating staggered two-year terms for directors providing much-needed stability, and established the Executive and Finance committees to handle specific topics and the Governance committees to take a longer-term look at the board’s composition and capabilities.
  • Arranged 3 major grants to the Foundation totaling $2M and raised a further $250k through targeted fundraising initiatives.
  • Grown the Foundation team to its largest ever size, investing in staff development, and established ongoing direct contributions to GNOME, GTK and Flathub by Foundation staff and contractors.
  • Launched and incubated Flathub as an inclusive and sustainable ecosystem for Linux app developers to engage directly with their users, and delivered the Community Engagement Challenge to invest in the sustainability of our contributor base ­­– the Foundation’s largest and most substantial programs outside of GNOME itself since Outreachy.
  • Achieved a fantastic resolution for GNOME and the wider community, by negotiating a settlement which protects FOSS developers from patent enforcement by the Rothschild group of non-practicing entities.
  • Stood for a diverse and inclusive Foundation, implementing a code of conduct for GNOME events and online spaces, establishing our first code of conduct committee and updating the bylaws to be gender-neutral.
  • Established the GNOME Circle program together with the board, broadening the membership base of the foundation by welcoming app and library developers from the wider ecosystem.

Recognizing and appreciating the amazing progress that GNOME has made with Neil’s support, the search for a new Executive Director provides the opportunity for the Foundation board to set the agenda and next high-level goals we’d like to achieve together with our new Executive Director.

In terms of the desktop, applications, technology, design and development processes, whilst there are always improvements to be made, the board’s general feeling is that thanks to the work of our amazing community of contributors, GNOME is doing very well in terms of what we produce and publish. Recent desktop releases have looked great, highly polished and well-received, and the application ecosystem is growing and improving through new developers and applications bringing great energy at the moment. From here, our largest opportunity in terms of growing the community and our user base is being able to articulate the benefits of what we’ve produced to a wider public audience, and deliver impact which allows us to secure and grow new and sustainable sources of funding.

For individuals, we are able to offer an exceedingly high quality desktop experience and a broad range of powerful applications which are affordable to all, backed by a nonprofit which can be trusted to look after your data, digital security and your best interests as an individual. From the perspective of being a public charity in the US, we also have the opportunity to establish programs that draw upon our community, technology and products to deliver impact such as developing employable skills, incubating new Open Source contributors, learning to program and more.

For our next Executive Director, we will be looking for an individual with existing experience in that nonprofit landscape, ideally with prior experience establishing and raising funds for programs that deliver impact through technology, and appreciation for the values that bring people to Free, Open Source and other Open Culture organizations. Working closely with the existing members, contributors, volunteers and whole GNOME community, and managing our relationships with the Advisory Board and other key partners, we hope to find a candidate that can build public awareness and help people learn about, use and benefit from what GNOME has built over the past two decades.

Neil has agreed to stay in his position for a 6 month transition period, during which he will support the board in our search for a new Executive Director and support a smooth hand-over. Over the coming weeks we will publish the job description for the new ED, and establish a search committee who will be responsible for sourcing and interviewing candidates to make a recommendation to the board for Neil’s successor – a hard act to follow!

I’m confident the community will join me and the board in personally thanking Neil for his 5 years of dedicated service in support of GNOME and the Foundation. Should you have any queries regarding the process, or offers of assistance in the coming hiring process, please don’t hesitate to join the discussion or reach out directly to the board.

February 15, 2022

After roughly 20 years and counting up to 0.40 in release numbers, I've decided to call the next version of the xf86-input-wacom driver the 1.0 release. [1] This cycle has seen a bulk of development (>180 patches) which is roughly as much as the last 12 releases together. None of these patches actually added user-visible features, so let's talk about technical dept and what turned out to be an interesting way of reducing it.

The wacom driver's git history goes back to 2002 and the current batch of maintainers (Ping, Jason and I) have all been working on it for one to two decades. It used to be a Wacom-only driver but with the improvements made to the kernel over the years the driver should work with most tablets that have a kernel driver, albeit some of the more quirky niche features will be more limited (but your non-Wacom devices probably don't have those features anyway).

The one constant was always: the driver was extremely difficult to test, something common to all X input drivers. Development is a cycle of restarting the X server a billion times, testing is mostly plugging hardware in and moving things around in the hope that you can spot the bugs. On a driver that doesn't move much, this isn't necessarily a problem. Until a bug comes along, that requires some core rework of the event handling - in the kernel, libinput and, yes, the wacom driver.

After years of libinput development, I wasn't really in the mood for the whole "plug every tablet in and test it, for every commit". In a rather caffeine-driven development cycle [2], the driver was separated into two logical entities: the core driver and the "frontend". The default frontend is the X11 one which is now a relatively thin layer around the core driver parts, primarily to translate events into the X Server's API. So, not unlike libinput + xf86-input-libinput in terms of architecture. In ascii-art:


|
+--------------------+ | big giant
/dev/input/event0->| core driver | x11 |->| X server
+--------------------+ | process
|

Now, that logical separation means we can have another frontend which I implemented as a relatively light GObject wrapper and is now a library creatively called libgwacom:



+-----------------------+ |
/dev/input/event0->| core driver | gwacom |--| tools or test suites
+-----------------------+ |

This isn't a public library or API and it's very much focused on the needs of the X driver so there are some peculiarities in there. What it allows us though is a new wacom-record tool that can hook onto event nodes and print the events as they come out of the driver. So instead of having to restart X and move and click things, you get this:

$ ./builddir/wacom-record
wacom-record:
version: 0.99.2
git: xf86-input-wacom-0.99.2-17-g404dfd5a
device:
path: /dev/input/event6
name: "Wacom Intuos Pro M Pen"
events:
- source: 0
event: new-device
name: "Wacom Intuos Pro M Pen"
type: stylus
capabilities:
keys: true
is-absolute: true
is-direct-touch: false
ntouches: 0
naxes: 6
axes:
- {type: x , range: [ 0, 44800], resolution: 200000}
- {type: y , range: [ 0, 29600], resolution: 200000}
- {type: pressure , range: [ 0, 65536], resolution: 0}
- {type: tilt_x , range: [ -64, 63], resolution: 57}
- {type: tilt_y , range: [ -64, 63], resolution: 57}
- {type: wheel , range: [ -900, 899], resolution: 0}
...
- source: 0
mode: absolute
event: motion
mask: [ "x", "y", "pressure", "tilt-x", "tilt-y", "wheel" ]
axes: { x: 28066, y: 17643, pressure: 0, tilt: [ -4, 56], rotation: 0, throttle: 0, wheel: -108, rings: [ 0, 0]
This is YAML which means we can process the output for comparison or just to search for things.

A tool to quickly analyse data makes for faster development iterations but it's still a far cry from reliable regression testing (and writing a test suite is a daunting task at best). But one nice thing about GObject is that it's accessible from other languages, including Python. So our test suite can be in Python, using pytest and all its capabilities, plus all the advantages Python has over C. Most of driver testing comes down to: create a uinput device, set up the driver with some options, push events through that device and verify they come out of the driver in the right sequence and format. I don't need C for that. So there's pull request sitting out there doing exactly that - adding a pytest test suite for a 20-year old X driver written in C. That this is a) possible and b) a lot less work than expected got me quite unreasonably excited. If you do have to maintain an old C library, maybe consider whether's possible doing the same because there's nothing like the warm fuzzy feeling a green tick on a CI pipeline gives you.

[1] As scholars of version numbers know, they make as much sense as your stereotypical uncle's facebook opinion, so why not.
[2] The Colombian GDP probably went up a bit

February 12, 2022

FOSDEM 2022 took place this past weekend, on February 5th and 6th. It was a virtual event for the second year in a row, but this year the Graphics devroom made a comeback and I participated in it with a talk titled “Fun with border colors in Vulkan”. In the talk, I explained the context and origins behind the VK_EXT_border_color_swizzle extension that was published last year and in which I’m listed as one of the contributors.

Big kudos and a big thank you to the FOSDEM organizers one more year. FOSDEM is arguably the most important free and open source software conference in Europe and one of the most important FOSS conferences in the world. It’s run entirely by volunteers, doing an incredible amount of work that makes it possible to have hundreds of talks and dozens of different devrooms in the span of two days. Special thanks to the Graphics devroom organizers.

For the virtual setup, one more year FOSDEM relied on Matrix. It’s great because at Igalia we also use Matrix for our internal communications and, thanks to the federated nature of the service, I could join the FOSDEM virtual rooms using the same interface, client and account I normally use for work. The FOSDEM organizers also let participants create ad-hoc accounts to join the conference, in case they didn’t have a Matrix account previously. Thanks to Matrix widgets, each virtual devroom had its corresponding video stream, which you could also watch freely on their site, embedded in each of the virtual devrooms, so participants wanting to watch the talks and ask questions had everything in a single page.

Talks were pre-recorded and submitted in advance, played at the scheduled times, and Jitsi was used for post-talk Q&A sessions, in which moderators and devroom organizers read aloud the most voted questions in the devroom chat.

Of course, a conference this big is not without its glitches. Video feeds from presenters and moderators were sometimes cut automatically by Jitsi allegedly due to insufficient bandwidth. It also happened to me during my Q&A section while I was using a wired connection on a 300 Mbps symmetric FTTH line. I can only suppose the pipe was not wide enough on the other end to handle dozens of streams at the same time, or Jitsi was playing games as it sometimes does. In any case, audio was flawless.

In addition, some of the pre-recorded videos could not be played at the scheduled time, resulting in a black screen with no sound, due to an apparent bug in the video system. It’s worth noting all pre-recorded talks had been submitted, processed and reviewed prior to the conference, so this was an unexpected problem. This happened with my talk and I had to do the presentation live. Fortunately, I had written a script for the talk and could use it to deliver it without issues by sharing my screen with the slides over Jitsi.

Finally, as a possible improvement point for future virtual or mixed editions, is the fact that the deadline for submitting talk videos was only communicated directly and prominently by email on the day the deadline ended, a couple of weeks before the conference. It was also mentioned in the presenter’s guide that was linked in a previous email message, but an explicit warning a few days or a week before the deadline would have been useful to avoid last-minute rushes and delays submitting talks.

In any case, those small problems don’t take away the great online-only experience we had this year.

Transcription

Another advantage of having a written script for the talk is that I can use it to provide a pseudo-transcription of its contents for those that prefer not to watch a video or are unable to do so. I’ve also summed up the Q&A section at the end below. The slides are available as an attachment in the talk page.

Enjoy and see you next year, hopefully in Brussels this time.

Slide 1 (Talk cover)

Hello, my name is Ricardo Garcia. I work at Igalia as part of its Graphics Team, where I mostly work on the CTS project creating new Vulkan tests and fixing existing ones. Sometimes this means I also contribute to the specification text and other pieces of the Vulkan ecosystem.

Today I’m going to talk about the story behind the “border color swizzle” extension that was published last year. I created tests for this one and I also participated in its release process, so I’m listed as one of the contributors.

Slide 2 (Sampling in Vulkan)

I’ve already started mentioning border colors, so before we dive directly into the extension let me give you a brief introduction to sampling operations in Vulkan and explain where border colors fit in that.

Sampling means reading pixels from an image view and is typically done in the fragment shader, for example to apply a texture to some geometry.

In the example you see here, we have an image view with 3 8-bit color components in BGR order and in unsigned normalized format. This means we’ll suppose each image pixel is stored in memory using 3 bytes, with each byte corresponding to the blue, green and red components in that order.

However, when we read pixels from that image view, we want to get back normalized floating point values between 0 (for the lowest value) and 1 (for the highest value, i.e. when all bits are 1 and the natural number in memory is 255).

As you can see in the GLSL code, the result of the operation is a vector of 4 floating point numbers. Since the image does not have alpha information, it’s natural to think the output vector may have a 1 in the last component, making the color opaque.

If the coordinates of the sample operation make us read the pixel represented there, we would get the values you see on the right.

It’s also worth noting the sampler argument is a combination of two objects in Vulkan: an image view and a sampler object that specifies how sampling is done.

Slide 3 (Normalized Coordinates)

Focusing a bit on the coordinates used to sample from the image, the most common case is using normalized coordinates, which means using floating point values between 0 and 1 in each of the image axis, like the 2D case you see on the right.

But, what happens if the coordinates fall outside that range? That means sampling outside the original image, in points around it like the red marks you see on the right.

That depends on how the sampler is configured. When creating it, we can specify a so-called “address mode” independently for each of the 3 texture coordinate axis that may be used (2 in our example).

Slide 4 (Address Mode)

There are several possible address modes. The most common one is probably the one you see here on the bottom left, which is the repeat addressing mode, which applies some kind of module operation to the coordinates as if the texture was virtually repeating in the selected axis.

There’s also the clamp mode on the top right, for example, which clamps coordinates to 0 and 1 and produces the effect of the texture borders extending beyond the image edge.

The case we’re interested in is the one on the top left, which is the border mode. When sampling outside we get a border color, as if the image was surrounded by a virtually infinite frame of a chosen color.

Slide 5 (Border Color)

The border color is specified when creating the sampler, and initially could only be chosen among a restricted set of values: transparent black (all zeros), opaque white (all ones) or the “special” opaque black color, which has a zero in all color components and a 1 in the alpha component.

The “custom border color” extension introduced the possibility of specifying arbitrary RGBA colors when creating the sampler.

Slide 6 (Image View Swizzle)

However, sampling operations are also affected by one parameter that’s not part of the sampler object. It’s part of the image view and it’s called the component swizzle.

In the example I gave you before we got some color values back, but that was supposing the component swizzle was the identity swizzle (i.e. color components were not reorder or replaced).

It’s possible, however, to specify other swizzles indicating what the resulting final color should be for each of the 4 components: you can reorder the components arbitrarily (e.g. saying the red component should actually come from the original blue one), you can force some of them to be zero or one, you can replicate one of the original components in multiple positions of the final color, etc. It’s a very flexible operation.

Slide 7 (Border Color and Swizzle pt. 1)

While working on the Zink Mesa driver, Mike discovered that the interaction between non-identity swizzle and custom border colors produced different results for different implementations, and wondered if the result was specified at all.

Slide 8 (Border Color and Swizzle pt. 2)

Let me give you an example: you specify a custom border color of 0, 0, 1, 1 (opaque blue) and an addressing mode of clamping to border in the sampler.

The image view has this strange swizzle in which the red component should come from the original blue, the green component is always zero, the blue component comes from the original green and the alpha component is not modified.

If the swizzle applies to the border color you get red. If it does not, you get blue.

Any option is reasonable: if the border color is specified as part of the sampler, maybe you want to get that color no matter which image view you use that sampler on, and expect to always get a blue border.

If the border color is supposed to act as if it came from the original image, it should be affected by the swizzle as the normal pixels are and you’d get red.

Slide 9 (Border Color and Swizzle pt. 3)

Jason pointed out the spec laid out the rules in a section called “Texel Input Operations”, which specifies that swizzling should affect border colors, and non-identity swizzles could be applied to custom border colors without restrictions according to the spec, contrary to “opaque black”, which was considered a special value and non-identity swizzles would result in undefined values with that border.

Slide 10 (Texel Input Operations)

The Texel Input Operations spec section describes what the expected result is according to some steps which are supposed to happen in a defined order. It doesn’t mean the hardware has to work like this. It may need instructions before or after the hardware sampling operation to simulate things happen in the order described there.

I’ve simplified and removed some of the steps but if border color needs to be applied we’re interested in the steps we can see in bold, and step 5 (border color applied) comes before step 7 (applying the image view swizzle).

I’ll describe the steps with a bit more detail now.

Slide 11 (Coordinate Conversion)

Step 1 is coordinate conversion: this includes converting normalized coordinates to integer texel coordinates for the image view and clamping and modifying those values depending on the addressing mode.

Slide 12 (Coordinate Validation)

Once that is done, step 2 is validating the coordinates. Here, we’ll decide if texel replacement takes place or not, which may imply using the border color. In other sampling modes, robustness features will also be taken into account.

Slide 13 (Reading Texel from Image)

Step 3 happens when the coordinates are valid, and is reading the actual texel from the image. This immediately implies reordering components from the in-memory layout to the standard RGBA layout, which means a BGR image view gets its components immediately put in RGB order after reading.

Slide 14 (Format Conversion)

Step 4 also applies if an actual texel was read from the image and is format conversion. For example, unsigned normalized formats need to convert pixel values (stored as natural numbers in memory) to floating point values.

Our example texel, already in RGB order, results in the values you see on the right.

Slide 15 (Texel Replacement)

Step 5 is texel replacement, and is the alternative to the previous two steps when the coordinates were not valid. In the case of border colors, this means taking the border color and cutting it short so it only has the components present in the original image view, to act as if the border color was actually part of the image.

Because this happens after the color components have already been reordered, the border color is always specified in standard red, green, blue and alpha order when creating the sampler. The fact that the original image view was in BGR order is irrelevant for the border color. We care about the alpha component being missing, but not about the in-memory order of the image view.

Our transparent blue border is converted to just “blue” in this step.

Slide 16 (Expansion to RGBA)

Step 6 takes us back to a unified flow of steps: it applies to the color no matter where it came from. The color is expanded to always have 4 components as expected in the shader. Missing color components are replaced with zeros and the alpha component, if missing, is set to one.

Our original transparent blue border is now opaque blue.

Slide 17 (Component Swizzle)

Step 7, finally the swizzle is applied. Let’s suppose our image view had that strange swizzle in which the red component is copied from the original blue, the green component is set to zero, the blue one is set to one and the alpha component is not modified.

Our original transparent blue border is now opaque magenta.

Slide 18 (VK_EXT_custom_border_color)

So we had this situation in which some implementations swizzled the border color and others did not. What could we do?

We could double-down on the existing spec and ask vendors to fix their implementations but, what happens if they cannot fix them? Or if the fix is impractical due to its impact in performance?

Unfortunately, that was the actual situation: some implementations could not be fixed. After discovering this problem, CTS tests were going to be created for these cases. If an implementation failed to behave as mandated by the spec, it wouldn’t pass conformance, so those implementations only had one way out: stop supporting custom border colors, but that’s also a loss for users if those implementations are in widespread use (and they were).

The second option is backpedaling a bit, making behavior undefined unless some other feature is present and designing a mechanism that would allow custom border colors to be used with non-identity swizzles at least in some of the implementations.

Slide 19 (VK_EXT_border_color_swizzle)

And that’s how the “border color swizzle” extension was created last year. Custom colors with non-identity swizzle produced undefined results unless the borderColorSwizzle feature was available and enabled. Some implementations could advertise support for this almost “for free” and others could advertise lack of support for this feature.

In the middle ground, some implementations can indicate they support the case, but the component swizzle has to be indicated when creating the sampler as well. So it’s both part of the image view and part of the sampler. Samplers created this way can only be used with image views having a matching component swizzle (which means they are no longer generic samplers).

The drawback of this extension, apart from the obvious observation that it should’ve been part of the original custom border color extension, is that it somehow lowers the bar for applications that want to use a single code path for every vendor. If borderColorSwizzle is supported, it’s always legal to pass the swizzle when creating the sampler. Some implementations will need it and the rest can ignore it, so the unified code path is now harder or more specific.

And that’s basically it. Sometimes the Vulkan Working Group in Khronos has had to backpedal and mark as undefined something that previous versions of the Vulkan spec considered defined. It’s not frequent nor ideal, but it happens. But it usually does not go as far as publishing a new extension as part of the fix, which is why I considered this interesting.

Slide 20 (Questions?)

Thanks for watching! Let me know if you have any questions.

Q&A Section

Martin: The first question is from "ancurio" and he’s asking if swizzling is implemented in hardware.

Me: I don’t work on implementations so take my answer with a grain of salt. It’s my understanding you can usually program that in hardware and the hardware does the swizzling for you. There may be implementations which need to do the swizzling in software, emitting extra instructions.

Martin: Another question from "ancurio". When you said lowering the bar do you mean raising it?

I explain that, yes, I meant to say raising the bar for the application. Note: I meant to say that it lowers the bar for the specification and API, which means a more complicated solution has been accepted.

Martin: "enunes" asks if this was originally motivated by some real application bug or by something like conformance tests/spec disambiguation?

I explain it has both factors. Mike found the problem while developing Zink, so a real application hit the problematic case, and then the Vulkan Working Group inside Khronos wanted to fix this, make the spec clear and provide a solution for apps that wanted to use non-identity swizzle with border colors, as it was originally allowed.

Martin: no more questions in the room but I have one more for you. How was your experience dealing with Khronos coordinating with different vendors and figuring out what was the acceptable solution for everyone?

I explain that the main driver behind the extension in Khronos was Piers Daniell from NVIDIA (NB: listed as the extension author). I mention that my experience was positive, that the Working Group is composed of people who are really interested in making a good specification and implementations that serve app developers. When this problem was detected I created some tests that worked as a poll to see which vendors could make this work easily and what others may need to make this work if at all. Then, this was discussed in the Working Group, a solution was proposed (the extension), then more vendors reviewed and commented that, then tests were adapted to the final solution, and finally the extension was published.

Martin: How long did this whole process take?

Me: A few months. Take into account the Working Group does not meet every day, and they have a backlog of issues to discuss. Each of the previous steps takes several weeks, so you end up with a few months, which is not bad.

Martin: Not bad at all.

Me: Not at all, I think it works reasonably well.

Martin: Specially when you realize you broke something and the specification needs fixing. Definitely decent.

February 05, 2022

(I nearly went with clutterectomy, but that would be doing our old servant project a disservice.)

Yesterday, I finally merged the work-in-progress branch porting totem to GStreamer's GTK GL sink widget, undoing a lot of the work done in 2011 and 2014 to port the video widget and then to finally make use of its features.

But GTK has been modernised (in GTK3 but in GTK4 even more so), GStreamer grew a collection of GL plugins, Wayland and VA-API matured and clutter (and its siblings clutter-gtk, and clutter-gst) didn't get the resources they needed to follow.

Screenshot_from_2022-02-03_18-03-40A screenshot with practically no changes, as expected

The list of bug fixes and enhancements is substantial:

  • Makes some files that threw shaders warnings playable
  • Fixes resize lag for the widgets embedded in the video widget
  • Fixes interactions with widgets on some HDR capable systems, or even widgets disappearing sometimes (!)
  • Gets rid of the floating blank windows under Wayland
  • Should help with tearing, although that's highly dependent on the system
  • Hi-DPI support
  • Hardware acceleration (through libva)

Until the port to GTK4, we expect a overall drop in performance on systems where there's no VA-API support, and the GTK4 port should bring it to par with the fastest of players available for GNOME.

You can install a Preview version right now by running:

$ flatpak install --user https://flathub.org/beta-repo/appstream/org.gnome.Totem.Devel.flatpakref

and filing bug in the GNOME GitLab.

Next stop, a GTK4 port!

February 03, 2022

22.0

I always do one of these big roundups for each Mesa release, so here’s what you can expect to see from zink in the upcoming release:

  • fewer hangs on RADV
  • massively improved usability on NVIDIA
  • greatly improved performance with unsupported texture download formats (e.g., CS:GO, L4D2)
  • more extensions: ARB_sparse_texture, ARB_sparse_texture2, ARB_sparse_texture_clamp, EXT_memory_object, EXT_memory_object_fd, GL_EXT_semaphore, GL_EXT_semaphore_fd
  • ~1000% improved glxgears performance (be sure to run with --i-know-this-is-not-a-benchmark to see the real speed)
  • tons and tons and tons of bug fixes

All around looking like another great release.

I Hate gl_PointSize And So Can You

Yes, we’re here.

After literally years of awfulness, I’ve finally solved (for good) the debacle that is point size conversion from GL to Vulkan.

What’s so awful about it, you might be asking. How hard can it be to just add gl_PointSize to a shader, you follow up with as you push your glasses higher up your nose.

Allow me to explain.

In Vulkan, there is exactly one method for setting the size of points: the gl_PointSize shader output controls it, and that’s it.

In OpenGL (core profile):

  • 14.4 Points If program point size mode is enabled, the derived point size is taken from the (potentially clipped) shader built-in gl_PointSize written by the last vertex processing stage and clamped to the implementation-dependent point size range. If the value written to gl_PointSize is less than or equal to zero, or if no value was written to gl_PointSize, results are undefined. If program point size mode is disabled, the derived point size is specified with the command

    void PointSize( float size );

  • 11.2.3.4 Tessellation Evaluation Shader Outputs Tessellation evaluation shaders have a number of built-in output variables used to pass values to equivalent built-in input variables read by subsequent shader stages or to subsequent fixed functionality vertex processing pipeline stages. These variables are gl_Position, gl_PointSize, gl_ClipDistance, and gl_CullDistance, and all behave identically to equivalently named vertex shader outputs.
  • 11.3.4.5 Geometry Shader Outputs The built-in output gl_PointSize, if written, holds the size of the point to be rasterized, measured in pixels

In short, if PROGRAM_POINT_SIZE is enabled, then points are sized based on the gl_PointSize shader output of the last vertex stage.

In OpenGL ES (versions 2.0, 3.0, 3.1):

  • (3.3 | 3.4 | 13.3) Points The point size is taken from the shader built-in gl_PointSize written by the vertex shader, and clamped to the implementation-dependent point size range.

In OpenGL ES (version 3.2):

  • 13.5 Points The point size is determined by the last vertex processing stage. If the last vertex processing stage is not a vertex shader, the point size is 1.0. If the last vertex processing stage is a vertex shader, the point size is taken from the shader built-in gl_PointSize written by the vertex shader, and is clamped to the implementation-dependent point size range.

Thus for an ES context, the point size always comes from the last vertex stage, which means it can be anything it wants to be if that stage is a vertex shader and cannot be written to for all other stages because it is not a valid output (this last, bolded part is going to be really funny in a minute or two).

What do the specs agree on?

  • If a vertex shader is the last vertex stage, it can write gl_PointSize

Literally that’s it.

Awesome.

Zink

As we know, Vulkan has a very simple and clearly defined model for point size:

The point size is taken from the (potentially clipped) shader built-in PointSize written by:
• the geometry shader, if active;
• the tessellation evaluation shader, if active and no geometry shader is active;
• the vertex shader, otherwise
- 27.10. Points

It really can be that simple.

So one would think that we can just hook up some conditionals based on the GL rules and then export the correct value.

That would be easy.

Simple.

It would make sense.

HAHA

hahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahaha

XFB

It gets worse (obviously).

gl_PointSize is a valid XFB varying, which means it must be exported correctly to the transform feedback buffer. For the ES case, it’s simple, but for desktop GL, there’s a little something called PROGRAM_POINT_SIZE state which totally fucks that up. Because, as we know, Vulkan has exactly one way of setting point size, and it’s the shader variable.

Thus, if there is a desktop GL context using a vertex shader as its last vertex stage for a draw, and if that shader has its own gl_PointSize value, this value must be exported for XFB.

But not used for point rasterization.

It’s Actually Even Worse Than That

…Because in order to pass CTS for ES 3.2, your implementation also has to be able to violate spec.

Remember above when I said it was going to be funny that gl_PointSize is not a legal output for non-vertex stages in ES contexts?

CTS explicitly has “wide points” tests which verify illegal point sizes that are exported by the tessellation and geometry shader stages. Isn’t that cool?

Also, let’s be reasonable people for a moment, who actually wants a point that’s just one pixel? Nobody can see that on their 8k display.

To Sum Up

I hate GL point size, and so should you.

February 02, 2022

Checking In

I keep meaning to blog, but then I get sidetracked by not blogging.

Truly a tough life.

So what’s new in zink-land?

Nothing too exciting. Mostly bug fixes. I managed to sneak ARB_sparse_texture_clamp in for zink just before the branchpoint, so all the sparse texturing features supported by Mesa will be supported by zink. But only on NVIDIA since they’re the only driver that fully supports Vulkan sparse texturing.

The past couple days I’ve been doing some truly awful things with gl_PointSize to try and make this conformant for all possible cases. It’s a real debacle, and I’ll probably post more in-depth about it so everyone can get a good chuckle.

The one unusual part of my daily routine is that I haven’t rebased my testing branch in at least a couple weeks now since I’ve been trying to iron out regressions. Will I find that everything crashes and fails as soon as I do?

Probably.

More posts to come.

February 01, 2022

There was an article on Open for Everyone today about Nobara, a Fedora-based distribution optimized for gaming. So I have no beef with Tomas Crider or any other creator/maintainer of a distribution targeting a specific use case. In fact they are usually trying to solve or work around real problems and make things easier for people. That said I have for years felt that the need for these things is a failing in itself and it has been a goal for me in the context of Fedora Workstation to figure out what we can do to remove the need for ‘usecase distros’. So I thought it would be of interest if I talk a bit about how I been viewing these things and the concrete efforts we taken to reduce the need for usecase oriented distributions. It is worth noting that the usecase distributions have of course proven useful for this too, in the sense that they to some degree also function as a very detailed ‘bug report’ for why the general case OS is not enough.
Before I start, you might say, but isn’t Fedora Workstation as usecase OS too? You often talk about having a developer focus? Yes, developers are something we care deeply about, but for instance that doesn’t mean we pre-install 50 IDEs in Fedora Workstation. Fedora Workstation should be a great general purpose OS out of the box and then we should have tools like GNOME Software and Toolbx available to let you quickly and easily tweak it into your ideal development system. But at the same time by being a general purpose OS at heart, it should be equally easy to install Steam and Lutris to start gaming or install Carla and Ardour to start doing audio production. Or install OBS Studio to do video streaming.

Looking back over the years one of the first conclusions I drew from looking at all the usecase distributions out there was that they often where mostly the standard distro, but with a carefully procured list of pre-installed software, for instance the old Fedora game spin was exactly that, a copy of Fedora with a lot of games pre-installed. So why was this valuable to people? For those of us who have been around for a while we remember that the average linux ‘app store’ was a very basic GUI which listed available software by name (usually quite cryptic names) and at best with a small icon. There was almost no other metadata available and search functionality was limited at best. So finding software was not simple, at it was usually more of a ‘search the internet and if you find something interesting see if its packaged for your distro’. So the usecase distros who focused on having procured pre-installed software, be that games, or pro-audio software or graphics tools ot whatever was their focus was basically responding to the fact that finding software was non-trivial and a lot of people maybe missed out on software that could be useful to them since it they simply never learned about its existence.
So when we kicked of the creation of GNOME Software one of the big focuses early on was to create a system for providing good metadata and displaying that metadata in a useful manner. So as an end user the most obvious change was of course the more rich UI of GNOME Software, but maybe just as important was the creation of AppStream, which was a specification for how applications to ship with metadata to allow GNOME Software and others to display much more in-depth information about the application and provide screenshots and so on.

So I do believe that between working on a better ‘App Store’ story for linux between the work on GNOME Software as the actual UI, but also by working with many stakeholders in the Linux ecosystem to define metadata standards like AppStream we made software a lot more discoverable on Linux and thus reduced the need for pre-loading significantly. This work also provided an important baseline for things like Flathub to thrive, as it then had a clear way to provide metadata about the applications it hosts.
We do continue to polish that user experience on an ongoing basis, but I do feel we reduced the need to pre-load a ton of software very significantly already with this.

Of course another aspect of this is application availability, which is why we worked to ensure things like Steam is available in GNOME Software on Fedora Workstation, and which we have now expanded on by starting to include more and more software listings from Flathub. These things makes it easy for our users to find the software they want, but at the same time we are still staying true to our mission of only shipping free software by default in Fedora.

The second major reason for usecase distributions have been that the generic version of the OS didn’t really have the right settings or setup to handle an important usecase. I think pro-audio is the best example of this where usecase distros like Fedora Jam or Ubuntu Studio popped up. The pre-install a lot of relevant software was definitely part of their DNA too, but there was also other issues involved, like the need for a special audio setup with JACK and often also kernel real-time patches applied. When we decided to include Pro-audio support in PipeWire resolving these issues was a big part of it. I strongly believe that we should be able to provide a simple and good out-of-the box experience for musicians and audio engineers on Linux without needing the OS to be specifically configured for the task. The strong and positive response we gotten from the Pro-audio community for PipeWire I believe points to that we are moving in the right direction there. Not claiming things are 100% yet, but we feel very confident that we will get there with PipeWire and make the Pro-Audio folks full fledged members of the Fedora WS community. Interestingly we also spent quite a bit of time trying to ensure the pro-audio tools in Fedora has proper AppStream metadata so that they would appear in GNOME Software as part of this. One area there where we are still looking at is the real time kernel stuff, our current take is that we do believe the remaining unmerged patches are not strictly needed anymore, as most of the important stuff has already been merged, but we are monitoring it as we keep developing and benchmarking PipeWire for the Pro-Audio usecase.

Another reason that I often saw that drove the creation of a usecase distribution is special hardware support, and not necessarily that special hardware, the NVidia driver for instance has triggered a lot of these attempts. The NVidia driver is challenging on a lot of levels and has been something we have been constantly working on. There was technical issues for instance, like the NVidia driver and Mesa fighting over who owned the OpenGL.so implementation, which we fixed by the introduction glvnd a few years ago. But for a distro like Fedora that also cares deeply about free and open source software it also provided us with a lot of philosophical challenges. We had to answer the question of how could we on one side make sure our users had easy access to the driver without abandoning our principle around Fedora only shipping free software of out the box? I think we found a good compromise today where the NVidia driver is available in Fedora Workstation for easy install through GNOME Software, but at the same time default to Nouveau of the box. That said this is a part of the story where we are still hard at work to improve things further and while I am not at liberty to mention any details I think I can at least mention that we are meeting with our engineering counterparts at NVidia on almost a weekly basis to discuss how to improve things, not just for graphics, but around compute and other shared areas of interest. The most recent public result of that collaboration was of course the XWayland support in recent NVidia drivers, but I promise you that this is something we keep focusing on and I expect that we will be able to share more cool news and important progress over the course of the year, both for users of the NVidia binary driver and for users of Nouveau.

What are we still looking at in terms of addressing issues like this? Well one thing we are talking about is if there is value/need for a facility to install specific software based on hardware or software. For instance if we detect a high end gaming mouse connected to your system should we install Piper/ratbag or at least make GNOME Software suggest it? And if we detect that you installed Lutris and Steam are there other tools we should recommend you install, like the gamemode GNOME Shell extenion? It is a somewhat hard question to answer, which is why we are still pondering it, on one side it seems like a nice addition, but such connections would mean that we need to have a big database we constantly maintain which isn’t trivial and also having something running on your system to lets say check for those high end mice do add a little overhead that might be a waste for many users.

Another area that we are looking at is the issue of codecs. We did a big effort a couple of years ago and got AC3, mp3, AAC and mpeg2 video cleared for inclusion, and also got the OpenH264 implementation from Cisco made available. That solved a lot of issues, but today with so many more getting into media creation I believe we need to take another stab at it and for instance try to get reliable hardware accelerated encoding and decoding on video. I am not ready to announce anything, but we got a few ideas and leads we are looking at for how to move the needle there in a significant way.

So to summarize, I am not criticizing anyone for putting together what I call usecase distros, but at the same time I really want to get to a point where they are rarely needed, because we should be able to cater to most needs within the context of a general purpose Linux operating system. That said I do appreciate the effort of these distro makers both in terms of trying to help users have a better experience on linux and in indirectly helping us showcase both potential solutions or highlight the major pain points that still needs addressing in a general purpose Linux desktop operating system.

January 27, 2022

In defense of NIR

NIR has been an integral part of the Mesa driver stack for about six or seven years now (depending on how you count) and a lot has changed since NIR first landed at the end of 2014 and I wrote my initial NIR notes. Also, for various reasons, I’ve had to give my NIR elevator pitch a few times lately. I think it’s time for a new post. This time on why, after working on this mess for seven years, I still think NIR was the right call.

A bit of history

Shortly after I joined the Mesa team at Intel in the summer of 2014, I was sitting in the cube area asking Ken questions, trying to figure out how Mesa was put together, and I asked, “Why don’t you use LLVM?” Suddenly, all eyes turned towards Ken and myself and I realized I’d poked a bear. Ken calmly explained a bunch of the packaging/shipping issues around having your compiler in a different project as well as issues radeonsi had run into with apps bundling their own LLVM that didn’t work. But for the more technical question of whether or not it was a good idea, his answer was something about trade-offs and how it’s really not clear if LLVM would really gain them much.

That same summer, Connor Abbott showed up as our intern and started developing NIR. By the end of the summer, he had a bunch of data structures a few mostly untested passes, and a validator. He also had most of a GLSL IR to NIR pass which mostly passed validation. Later that year, after Connor had gone off to school, I took over NIR, finished the Intel scalar back-end NIR consumer, fixed piles of bugs, and wrote out-of-SSA and a bunch of optimization passes to get it to the point where we could finally land it in the tree at the end of 2014. Initially, it was only a few Intel folks and Emma Anholt (Broadcom, at the time) who were all that interested in NIR. Today, it’s integral to the Mesa project and at the core of every driver that’s still seeing active development. Over the past seven years, we (the Mesa community) have poured thousands of man hours (probably millions of engineering dollars) into NIR and it’s gone from something only capable of handling fragment shaders to supporting full Vulkan 1.2 plus ray-tracing (task and mesh are coming) along with OpenCL 1.2 compute.

Was it worth it? That’s the multi-million dollar (literally) question. 2014 was a simpler time. Compute shaders were still newish and people didn’t use them for all that much more than they would have used a fancy fragment shader for a couple years earlier. More advanced features like Vulkan’s variable pointers weren’t even on the horizon. Had I known at the time how much work we’d have to put into NIR to keep up, I may have said, “Nah, this is too much effort; let’s just use LLVM.” If I had, I think it would have made the wrong call.

Distro and packaging issues

I’d like to get this one out of the way first because, while these issues are definitely real, it’s easily the least compelling reason to write a whole new piece of software. Having your compiler in a separate project and in LLVM in particular comes with an annoying set of problems.

First, there’s release cycles. Mesa releases on a rough 3-month cadence whereas LLVM releases on a 6-month cadence and there’s nothing syncing the two release cycles. This means that any new feature enabled in Mesa that require new LLVM compiler work can’t be enabled until they pick up a new LLVM. Not only does this make the question “what mesa version has X? unanswerable, it also means every one of these features needs conditional paths in the driver to be enabled or not depending on LLVM version. Also, because we can’t guarantee which LLVM version a distro will choose to pair with any give Mesa version, radeonsi (the only LLVM-based hardware driver in Mesa) has to support the latest two releases of LLVM as well as tip-of-tree at all times. While this has certainly gotten better in recent years, it used to be that LLVM would switch around C++ data structures on you requiring a bunch of wrapper classes in Mesa to deal with the mess. (They still reserve the right, it just happens less these days.)

Second is bug fixing. What do you do if there’s a compiler bug? You fix it in LLVM, of course, right? But what if the bug is in an old version of the AMD LLVM back-end and AMD’s LLVM people refuse to back-port the fix? You work around it in Mesa, of course! Yup, even though Mesa and LLVM are both open-source projects that theoretically have a stable bugfix release cycle, Mesa has to carry LLVM work-around patches because we can’t get the other team/project to back-port fixes. Things also get sticky whenever there’s a compiler bug which touches on the interface between the LLVM back-end compiler and the driver. How do you fix that in a backwards-compatible way? Sometimes, you don’t. Those interfaces can be absurdly subtle and complex and sometimes the bug is in the interface itself so you either have to fix it LLVM tip-of-tree and work around it in Mesa for older versions, or you have to break backwards compatibility somewhere and hope users pick up the LLVM bug-fix release.

Third is that some games actually link against LLVM and, historically, LLVM hasn’t done well with two different versions of it loaded at the same time. Some of this is LLVM and some of it is the way C++ shared library loading is handled on Linux. I won’t get into all the details but the point is that there have been some games in the past which simply can’t run on radeonsi because of LLVM library version conflicts. Some of this could probably be solved if Mesa were linked against LLVM statically but distros tend to be pretty sour on static linking unless you have a really good reason. A closed-source game pulling in their own LLVM isn’t generally considered to be a good reason.

And that, in the words of Forrest Gump, is all I have to say about that.

A compiler built for GPUs

One of the key differences between NIR and LLVM is that NIR is a GPU-focused compiler whereas LLVM is CPU-focused. Yes, AMD has an upstream LLVM back-end for their GPU hardware, Intel likes to brag about their out-of-tree LLVM back-end and many other vendors use it in their drivers as well even if their back-ends are closed-source and Internal. However, none of that actually means that LLVM understands GPUs or is any good at compiling for them. Most HW vendors have made that choice because they needed LLVM for OpenCL support and they wanted a unified compiler so they figured out how to make LLVM do graphics. It works but that doesn’t mean it works well.

To demonstrate this, let’s look at the following GLSL shader I stole from the texelFetch piglit test:

#version 120

#extension GL_EXT_gpu_shader4: require
#define ivec1 int
flat varying ivec4 tc;
uniform vec4 divisor;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
    vec4 color = texelFetch2D(tex, ivec2(tc), tc.w);
    fragColor = color/divisor;
}

When compiled to NIR, this turns into

shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 1
outputs: 1
uniforms: 1
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE sampler2D tex (1, 0, 0)
decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0)
decl_function main (0 params)

impl main {
    block block_0:
    /* preds: */
    vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
    vec3 32 ssa_1 = intrinsic load_input (ssa_0) (0, 0, 34, 160) /* base=0 */ /* component=0 */ /* dest_type=int32 */ /* location=32 slots=1 */
    vec1 32 ssa_2 = deref_var &tex (uniform sampler2D)
    vec2 32 ssa_3 = vec2 ssa_1.x, ssa_1.y
    vec1 32 ssa_4 = mov ssa_1.z
    vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod)
    vec4 32 ssa_6 = intrinsic load_ubo (ssa_0, ssa_0) (0, 1073741824, 0, 0, 16) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=16 */
    vec1 32 ssa_7 = frcp ssa_6.x
    vec1 32 ssa_8 = frcp ssa_6.y
    vec1 32 ssa_9 = frcp ssa_6.z
    vec1 32 ssa_10 = frcp ssa_6.w
    vec1 32 ssa_11 = fmul ssa_5.x, ssa_7
    vec1 32 ssa_12 = fmul ssa_5.y, ssa_8
    vec1 32 ssa_13 = fmul ssa_5.z, ssa_9
    vec1 32 ssa_14 = fmul ssa_5.w, ssa_10
    vec4 32 ssa_15 = vec4 ssa_11, ssa_12, ssa_13, ssa_14
    intrinsic store_output (ssa_15, ssa_0) (0, 15, 0, 160, 132) /* base=0 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */
    /* succs: block_1 */
    block block_1:
}

Then, the AMD driver turns it into the following LLVM IR:

; ModuleID = 'mesa-shader'
source_filename = "mesa-shader"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"
target triple = "amdgcn--"

define amdgpu_ps <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @main(<4 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %0, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %1, float addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %2, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %3, i32 inreg %4, i32 inreg %5, <2 x i32> %6, <2 x i32> %7, <2 x i32> %8, <3 x i32> %9, <2 x i32> %10, <2 x i32> %11, <2 x i32> %12, float %13, float %14, float %15, float %16, float %17, i32 %18, i32 %19, float %20, i32 %21) #0 {
main_body:
  %22 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 0, i32 0, i32 %5) #4
  %23 = bitcast float %22 to i32
  %24 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 1, i32 0, i32 %5) #4
  %25 = bitcast float %24 to i32
  %26 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 2, i32 0, i32 %5) #4
  %27 = bitcast float %26 to i32
  %28 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(6)* %3, i32 32, !amdgpu.uniform !0
  %29 = load <8 x i32>, <8 x i32> addrspace(6)* %28, align 4, !invariant.load !0
  %30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4
  %31 = ptrtoint float addrspace(6)* %2 to i32
  %32 = insertelement <4 x i32> <i32 poison, i32 0, i32 16, i32 163756>, i32 %31, i32 0
  %33 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 0, i32 0) #4
  %34 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 4, i32 0) #4
  %35 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 8, i32 0) #4
  %36 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 12, i32 0) #4
  %37 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %33) #4
  %38 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %34) #4
  %39 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %35) #4
  %40 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %36) #4
  %41 = extractelement <4 x float> %30, i32 0
  %42 = fmul nsz arcp float %41, %37
  %43 = extractelement <4 x float> %30, i32 1
  %44 = fmul nsz arcp float %43, %38
  %45 = extractelement <4 x float> %30, i32 2
  %46 = fmul nsz arcp float %45, %39
  %47 = extractelement <4 x float> %30, i32 3
  %48 = fmul nsz arcp float %47, %40
  %49 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> undef, i32 %4, 4
  %50 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %49, float %42, 5
  %51 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %50, float %44, 6
  %52 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %51, float %46, 7
  %53 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %52, float %48, 8
  %54 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %53, float %20, 19
  ret <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %54
}

; Function Attrs: nounwind readnone speculatable willreturn
declare float @llvm.amdgcn.interp.mov(i32 immarg, i32 immarg, i32 immarg, i32) #1

; Function Attrs: nounwind readonly willreturn
declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2

; Function Attrs: nounwind readnone willreturn
declare float @llvm.amdgcn.s.buffer.load.f32(<4 x i32>, i32, i32 immarg) #3

; Function Attrs: nounwind readnone speculatable willreturn
declare float @llvm.amdgcn.rcp.f32(float) #1

attributes #0 = { "InitialPSInputAddr"="0xb077" "denormal-fp-math"="ieee,ieee" "denormal-fp-math-f32"="preserve-sign,preserve-sign" "target-features"="+DumpCode" }
attributes #1 = { nounwind readnone speculatable willreturn }
attributes #2 = { nounwind readonly willreturn }
attributes #3 = { nounwind readnone willreturn }
attributes #4 = { nounwind readnone }

!0 = !{}

For those of you who can’t read NIR and/or LLVM or don’t want to sift through all that, let me reduce it down to the important lines:

GLSL:

vec4 color = texelFetch2D(tex, ivec2(tc), tc.w);

NIR:

vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod)

LLVM:

%30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4

; Function Attrs: nounwind readonly willreturn
declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2

attributes #2 = { nounwind readonly willreturn }
attributes #4 = { nounwind readnone }

In NIR, a texelFetch() shows up as a texture instruction. NIR has a special instruction type just for textures called nir_tex_instr to handle of the combinatorial explosion of possibilities when it comes to all the different ways you can access a texture. In this particular case, the texture opcode is nir_texop_txf for a texel fetch and it is passed a texture, a sampler, a coordinate and an LOD. Pretty standard stuff.

In AMD-flavored LLVM IR, this turns into a magic intrinsic funciton called llvm.amdgcn.image.load.mip.2d.v4f32.i32. A bunch of information about the operation such as the fact that it takes a mip parameter and returns a vec4 is encoded in the function name. The AMD back-end then knows how to turn this into the right sequence of hardware instructions to load from a texture.

There are a couple of important things to note here. First is the @llvm.amdgcn prefix on the function name. This is an entirely AMD-specific function. If I dumped out the LLVM from the Intel windows drivers for that same GLSL, it would use a different function name with a different encoding for the various bits of ancillary information such as the return type. Even though both drivers share LLVM, in theory, the way they encode graphics operations is entirely different. If you looked at NVIDIA, you would find a third encoding. There is no standardization.

Why is this important? Well, one of the most common arguments I hear from people for why we should all be using LLVM for graphics is because it allows for code sharing. Everyone can leverage all that great work that happens in upstream LLVM. Except it doesn’t. Not really. Sure, you can get LLVM’s algebraic optimizations and code motion etc. But you can’t share any of the optimizations that are really interesting for graphics because nothing graphics-related is common. Could it be standardized? Probably. But, in the state it’s in today, any claims that two graphics compilers are sharing significant optimizations because they’re both LLVM based is a half-truth at best. And it will never become standardized unless someone other than AMD decides to put their back-end into upstream LLVM and they decide to work together.

The second important bit about that LLVM function call is that LLVM has absolutely no idea what that function does. All it knows is that it’s been decorated nounwind, readonly, and willreturn. The readonly gives it a bit of information so it knows it can move the function call around a bit since it won’t write misc data. However, it can’t even eliminate redundant texture ops because, for all LLVM knows, a second call will return a different result. While LLVM has pretty good visibility into the basic math in the shader, when it comes to anything that touches image or buffer memory, it’s flying entirely blind. The Intel LLVM-based graphics compiler tries to improve this somewhat by using actual LLVM pointers for buffer memory so LLVM gets a bit more visibility but you still end up with a pile of out-of-thin-air pointers that all potentially alias each other so it’s pretty limited.

In contrast, NIR knows exactly what sort of thing nir_texop_txf is and what it does. It knows, for instance, that, even though it accesses external memory, the API guarantees that nothing shifts out from under you so it’s fine to eliminate redundant texture calls. For nir_texop_tex (texture() in GLSL), it knows that it takes implicit derivatives and so it can’t be moved into non-uniform control-flow. For things like SSBO and workgroup memory, we know what kind of memory they’re touching and can do alias analysis that’s actually aware of buffer bindings.

Code sharing

When people try to justify their use of LLVM to me, there are typically two major benefits they cite. The first is that LLVM lets them take advantage of all this academic compiler work. In the previous section, I explained why this is a weak argument at best. The second is that embracing LLVM for graphics lets them share code with their compute compiler. Does that mean that we’re against sharing code? Not at all! In fact, NIR lets us get far more code sharing than most companies do by using LLVM.

The difference is the axis for sharing. This is something I ran into trying to explain myself to people at Intel all the time. They’re usually only thinking about how to get the Intel OpenCL driver and the Intel D3D12 driver to share code. With NIR, we have compiler code shared effectively across 20 years of hardware from a 8 different vendors and at least 4 APIs. So while Intel’s Linux Vulkan and OpenCL drivers don’t share a single line of compiler code, it’s not like we went off and hand-coded a whole compiler stack just for Intel Linux Vulkan.

As an example of this, consider nir_lower_tex() a pass that lowers various different types of texture operations to other texture operations. It can, among other things:

  • Lower texture projectors away by doing the division in the shader,

  • Lower texelFetchOffset() to texelFetch(),

  • Lower rectangle textures by dividing the coordinate by the result of textureSize(),

  • Lower texture swizzles to swizzling in the shader,

  • Lower various forms of textureGrad*() to textureLod*() under various conditions,

  • Lower imageSize(i, lod) with an LOD to imageSize(i, 0) and some shader math,

  • And much more…

Exactly what lowering is needed is highly hardware dependent (except projectors; only old Qualcomm hardware has those) but most of them are needed by at least two different vendor’s hardware. While most of these are pretty simple, when you get into things like turning derivatives into LODs, the calculations get complex and we really don’t want everyone typing it themselves if we can avoid it.

And texture lowering is just one example. We’ve got dozens of passes for everything from lowering read-only images to textures for OpenCL to lowering built-in functions like frexp() to simpler math to flipping gl_FragCoord and gl_PointCoord when rendering upside down which as is required to implement OpenGL on Linux window-systems. All that code is in one central place where it’s usable by all the graphics drivers on Linux.

Tight driver integration

I mentioned earlier that having your compiler out-of-tree is painful from a packaging and release point-of-view. What I haven’t addressed yet is just how tight driver/compiler integration has to be. It depends a lot on the API and hardware, of course but the interface between compiler and driver is often very complex. We make it look very simple on the API side where you have descriptor sets (or bindings in GL) and then you access things from them in the shader. Simple, right? Hah!

In the Intel Linux Vulkan driver, we can access a UBO one of four ways depending on a complex heuristic:

  • We try to find up to 4 small ranges UBO commonly used constants and push those into the shader as push constants.

  • If we can’t push it all and it fits inside the hardware’s 240 entry binding table, we create a descriptor for it and put it in the binding table.

  • Depending on the hardware generation, UBOs successfully bound to descriptors might be accessed as SSBOs or we might access them through the texture unit.

  • If we ran our of entries in the binding table or if it’s in a ray-tracing stage (those don’t have binding tables), we fall back to doing bounds checking in the shader and access it using raw 64-bit GPU addresses.

And that’s just UBOs! SSBO binding has a similar level of complexity and also depends on the SSBO operations done in the shader. Textures have silent fall-back to bindless if we have too many, etc. In order to handle all this insanity, we have a compiler pass called anv_nir_apply_pipeline_layout() which lives in the driver. The interface between that pass and the rest of the driver is quite complex and can communicate information about exactly how things are actually laid out. We do have to serialize it to put it all in the pipeline cache so that limits the complexity some but we don’t have to worry about keeping the interface stable at all because it lives in the driver.

We also have passes for handling YCbCr format conversion, turning multiview into instanced rendering and constructing a gl_ViewID in the shader based on the view mask and the instance number, and a handful of other tasks. Each of these requires information from the VkPipelineCreateInfo and some of them result in magic push constants which the driver has to know need pushing.

Trying to do that with your compiler in another project would be insane. So how does AMD do it with their LLVM compiler? Good question! They either do it in NIR or as part of the NIR to LLVM conversion. By the time the shader gets to LLVM, most of the GL or Vulkanisms have been translated to simpler constructs, keeping the driver/LLVM interface manageable. It also helps that AMD’s hardware binding model is crazy simple and was basically designed for an API like Vulkan.

Structured control-flow

One of the riskier decisions we made when designing NIR was to make all control-flow inherently structured. Instead of branch and conditional branch instructions like LLVM or SPIR-V has, NIR has control-flow nodes in a tree structure. The root of the tree is always a nir_function_impl. In each function, is a list of control-flow nodes that may be nir_block, nir_if, or nir_loop. An if has a condition and then and else cases. A loop is a simple infinite loop and there are nir_jump_break and nir_jump_continue instructions which act exactly as their C counterparts.

At the time, this decision was made from pure pragmatism. We had structure coming out of GLSL and most of the back-ends expected structure. Why break everything? It did mean that, when we started writing control-flow manipulation passes, things were a lot harder. A dead control-flow pass in an unstructured IR is trivial:. Delete any conditional branches where the condition is false and replace it with an unconditional branch if the condition is true. Then delete any unreachable blocks and merge blocks as necessary. Done. In a structured IR, it’s a lot more fiddly. You have to manually collapse if ladders and deleting the unconditional break at the end of a loop is equivalent to loop unrolling. But we got over that hump, built tools to make it less painful, and have implemented most of the important control-flow optimizations at this point. In exchange, back-ends get structure which is something most GPUs want thanks to the SIMT model they use.

What we didn’t see coming when we made that decision (2014, remember?) was wave/subgroup ops. In the last several years, the SIMT nature of shader execution has slowly gone from an implementation detail to something that’s baked into all modern 3D and compute APIs and shader languages. With that shift has come the need to be consistent about re-convergence. If we say “texture() has to be in uniform control flow”, is the following shader ok?

void main
#version 120

varying vec2 tc;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
    if (tc.x > 1.0)
        tc.x = 1.0;

    fragColor = texture(tex, tc);
}

Obviously, it should be. But what guarantees that you’re actually in uniform control-flow by the time you get to the texture() call? In an unstructured IR, once you diverge, it’s really hard to guarantee convergence. Of course, every GPU vendor with an LLVM-based compiler has algorithms for trying to maintain or re-create the structure but it’s always a bit fragile. Here’s an even more subtle example:

void main
#version 120

varying vec2 tc;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
    /* Block 0 */
    float x = tc.x;
    while (1) {
        /* Block 1 */
        if (x < 1.0) {
            /* Block 2 */
            tc.x = x;
            break;
        }

        /* Block 3 */
        x = x - 1.0;
    }

    /* Block 4 */
    fragColor = texture(tex, tc);
}

The same question of validity holds but there’s something even trickier in here. Can the compiler merge block 4 and block 2? If so, where should it put it? To a CPU-centric compiler like LLVM, it looks like it would be fine to merge the two and put it all in block 2. In fact, since texture ops are expensive and block 2 is deeper inside control-flow, it may think the resulting shader would be more efficient if it did. And it would be wrong on both counts.

First, the loop exit condition is non-uniform and, since texture() takes derivatives, it’s illegal to put it in non-uniform control-flow. (Yes, in this particular case, the result of those derivatives might be a bit wonky.) Second, due to the SIMT nature of execution, you really don’t want the texture op in the loop. In the worst case, a 32-wide execution will hit block 2 32 separate times whereas, if you guarantee re-convergence, it only hits block 4 once.

The fact that NIR’s control-flow is structured from start to finish has been a hidden blessing here. Once we get the structure figured out from SPIR-V decorations (which is annoyingly challenging at times), we never lose that structure and the re-convergence information it implies. NIR knows better than to move derivatives into non-uniform control-flow and its code-motion passes are tuned assuming a SIMT execution model. What has become a constant fight for people working with LLVM is a non-issue for us. The only thing that has been a challenge has been dealing with SPIR-V’s less than obvious structure rules and trying to make sure we properly structurize everything that’s legal. (It’s been getting better recently.)

Side-note: NIR does support OpenCL SPIR-V which is unstructured. To handle this, we have nir_jump_goto and nir_jump_goto_if instructions which are allowed only for a very brief period of time. After the initial SPIR-V to NIR conversion, we run a couple passes and then structurize. After that, it remains structured for the rest of the compile.

Algebraic optimizations

Every GPU compiler engineer has horror stories about something some app developer did in a shader. Sometimes it’s the fault of the developer and sometimes it’s just an artifact of whatever node-based visual shader building system the game engine presents to the artists and how it’s been abused. On Linux, however, it can get even more entertaining. Not only do we have those shaders that were written for DX9 and someone lost the code so they ran them through a DX9 to HLSL translator and then through FXC, but they then ported the app to OpenGL so it can run on Linux they did a DXBC to GLSL conversion with some horrid tool. The end result is x != 0 implemented with three levels of nested function calls, multiple splats out to a vec4 and a truly impressive pile of control-flow. I only wish I were joking….

To chew through this mess, we have nir_opt_algebraic(). We’ve implemented a little language for expressing these expression trees using python tuples and nir_opt_algebraic.py. To get a sense for what this looks like, let’s look at some excerpts from nir_opt_algebraic.py starting with the simple description at the top:

# Written in the form (<search>, <replace>) where <search> is an expression
# and <replace> is either an expression or a value.  An expression is
# defined as a tuple of the form ([~]<op>, <src0>, <src1>, <src2>, <src3>)
# where each source is either an expression or a value.  A value can be
# either a numeric constant or a string representing a variable name.
#
# <more details>

optimizations = [
   ...
   (('iadd', a, 0), a),

This rule is a good starting example because it’s so straightforward. It looks for an integer add operation of something with zero and gets rid of it. A slightly more complex example removes redundant fmax opcodes:

(('fmax', ('fmax', a, b), b), ('fmax', a, b)),

Since it’s written in python, we can also write little rule generators if the same thing applies to a bunch of opcodes or if you want to generalize across types:

# For any float comparison operation, "cmp", if you have "a == a && a cmp b"
# then the "a == a" is redundant because it's equivalent to "a is not NaN"
# and, if a is a NaN then the second comparison will fail anyway.
for op in ['flt', 'fge', 'feq']:
   optimizations += [
      (('iand', ('feq', a, a), (op, a, b)), ('!' + op, a, b)),
      (('iand', ('feq', a, a), (op, b, a)), ('!' + op, b, a)),
   ]

Because we’ve made adding new optimizations so incredibly easy, we have a lot of them. Not just the simple stuff I’ve highlighted above, either. We’ve got at least two cases where someone hand-rolled bitfieldReverse() and we match a giant pattern and turn it into a single HW instruction. (Some UE4 demo and Cyberpunk 2077, if you want to know who to blame. They hand-roll it differently, of course.) We also have patterns to chew through all the garbage from D3D9 to HLSL conversion where they emit piles of x ? 1.0 : 0.0 everywhere because D3D9 didn’t have real Boolean types. All told, as of the writing of this blog post, we have 1911 such search and replace patterns.

Not only have we made it easy to add new patterns but the nir_search framework has some pretty useful smarts in it. The expression I first showed matches a + 0 and replaces it with a but nir_search is smart enough to know that nir_op_iadd is commutative and so it also matches 0 + a without having to write two expressions. We also have syntax for detecting constants, handling different bit sizes, and applying arbitrary C predicates based on the SSA value. Since NIR is actually a vector IR (we support a lot of vec4-based hardware), nir_search also magically handles swizzles for you.

You might think 1911 patterns is a lot and it is. Doesn’t that take forever? Isn’t it O(NPS) where N is the number of instructions, P is the number of patterns and S is the average pattern size or something like that? Nope! A couple years ago, Connor Abbot converted it to using a finite state machine automata, built at driver compile time, to filter out impossible matches as we go. The result is that the whole pass effectively runs in linear time in the number of instructions.

NIR is a low(ish) level IR

This one continues to surprise me. When we set out to design NIR, the goal was something that was SSA and used flat lists of instructions (not expression trees). That was pretty much the extent of the design requirements. However, whenever you build an IR, you inevitably make a series of choices about what kinds of things you’re going to support natively and what things are going to require emulation or be a bit more painful.

One of the most fundamental choices we made in NIR was that SSA values would be typeless vectors. Each nir_ssa_def has a bit size and a number of vector components and that’s it. We don’t distinguish between integers and floats and we don’t support matrix or composite types. Not supporting matrix types was a bit controversial but it’s turned out fine. We also have to do a bit of juggling to support hardware that doesn’t have native integers because we have to lower integer operations to float and we’ve lost the type information. When working with shaders that come from D3D to OpenGL or Vulkan translators, the type information does more harm than good. I can’t count the number of shaders I’ve seen where they declare vec4 x1 through vec4 x80 at the top and then uintBitsToFloat() and floatBitsToUint() all over everywhere.

We also made adding new ALU ops and intrinsics really easy but also added a fairly powerful metadata system for both so the compiler can still reason about them. The lines we drew between ALU ops, intrinsics, texture instructions, and control-flow like break and continue were pretty arbitrary at the time if we’re honest. Texturing was going to be a lot of intrinsics so Connor added an instruction type. That was pretty much it.

The end result, however, has been an IR that’s incredibly versatile. It’s somehow both a high-level and low-level IR at the same time. When we do SPIR-V to NIR translation, we don’t have a separate IR for parsing SPIR-V. We have some data structures to deal with composite types and a handful of other stuff but when we parse SPIR-V opcodes, we go straight to NIR. We’ve got variables with fairly standard dereference chains (those do support composite types), bindings, all the crazy built-ins like frexp(), and a bunch of other language-level stuff. By the time the NIR shows up in your back-end, however, all that’s gone. Crazy built-in functions have been lowered. GL/Vulkan binding with derefs, descriptors, and locations has been turned into byte offsets and indices in a flat binding table. Some drivers have even attempted to emit hardware instructions directly from NIR. (It’s never quite worked but says a lot that they even tried.)

The Intel compiler back-end has probably shrunk by half in terms of optimization and lowering passes in the last seven years because we’re able to do so much in NIR. We’ve got code that lowers storage image access with unsupported formats to other image formats or even SSBO access, splitting of vector UBO/SSBO access that’s too wide for hardware, workarounds for imprecise trig ops, and a bunch of others. All of the interesting lowering is done in NIR. One reason for this is that Intel has two back-ends, one for scalar and one that’s vec4 and any lowering we can do in NIR is lowering that only happens once. But, also, it’s nice to be able to have the full power of NIR’s optimizer run on your lowered code.

As I said earlier, I find the versatility of NIR astounding. We never intended to write an IR that could get that close to hardware. We just wanted SSA for easier optimization writing. But the end result has been absolutely fantastic and has done a lot to accelerate driver development in Mesa.

Conclusion

If you’ve gotten this far, I both applaud and thank you! NIR has been a lot of fun to build and, as you can probably tell, I’m quite proud of it. It’s also been a huge investment involving thousands of man hours but I think it’s been well worth it. There’s a lot more work to do, of course. We still don’t have the ray-tracing situation where it needs to be and OpenCL-style compute needs some help to be really competent. But it’s come an incredibly long way in the last seven years and I’m incredibly proud of what we’ve built and forever thankful to the many many developers who have chipped in and fixed bugs and contributed optimization and lowering passes.

Hopefully, this post provides some additional background and explanation for the big question of why Mesa carries its own compiler stack. And maybe, just maybe, someone will get excited enough about it to play around with it and even contribute! One can hope, right?

January 26, 2022

(This post was first published with Collabora on Jan 25, 2022.)

A Pixel's Color

My work on Wayland and Weston color management and HDR support has been full of learning new concepts and terms. Many of them are crucial for understanding how color works. I started out so ignorant that I did not know how to blend two pixels together correctly. I did not even know that I did not know - I was just doing the obvious blend, and that was wrong. Now I think I know what I know and do not know, and I also feel that most developers around window systems and graphical applications are as uneducated as I was.

Color knowledge is surprisingly scarce in my field it seems. It is not enough that I educate myself. I need other people to talk to, to review my work, and to write patches that I will be reviewing. With the hope of making it even a little bit easier to understand what is going on with color I wrote the article: A Pixel's Color.

The article goes through most of the important concepts, trying to give you, a programmer, a vague idea of what they are. It does not explain everything too well, because I want you to be able to read through it, but it still got longer than I expected. My intention is to tell you about things you might not know about, so that you would at least know what you do not know.

A warm thank you to everyone who reviewed and commented on the article.

A New Documentation Repository

Originally Wayland CM&HDR extension merge request included documentation about how color management would work on Wayland. The actual protocol extension specification cannot even begin to explain all that.

To make that documentation easier to revise and contribute to, I proposed to move it into a new repository: color-and-hdr. That also allowed us to widen the scope of the documentation, so we can easily include things outside of Wayland: EGL, Vulkan WSI, DRM KMS, and more.

I hope that color-and-hdr documentation repository gains traction and becomes a community maintained effort in gathering information about color and HDR on Linux, and that we can eventually move it out of my personal namespace to become truly community owned.

January 20, 2022

It’s Happening (For Real)

After weeks of hunting for the latest rumors of jekstrand’s future job prospects, I’ve finally done it: zink now supports more extensions than any other OpenGL driver in Mesa.

That’s right.

Check it on mesamatrix if you don’t believe me.

A couple days ago I merged support for the external memory extensions that I’d been putting off, and today we got sparse textures thanks to Qiang Yu at AMD doing 99% of the work to plumb the extensions through the rest of Mesa.

There’s even another sparse texture extension, which I’ve already landed all the support for in zink, that should be enabled for the upcoming release.

What’s Next?

Zink (sometimes) has the performance, now it has the features, so naturally the focus now is going to shift to compatibility and correctness. Kopper is going to mostly take care of the former, which leaves the latter. There aren’t a ton of CTS cases failing.

Ideally, by the end of the year, there won’t be any.

January 18, 2022

Suspicion

The last thing I remember Thursday was trying to get the truth out about Jason Ekstrand’s new role. Days have now passed, and I can’t remember what I was about to say or what I did over the extended weekend.

But Big Triangle sure has been busy. It’s clear I was on to something, because otherwise they wouldn’t have taken such drastic measures. Look at this: jekstrand is claiming Collabora has hired him. This is clearly part of a larger coverup, and the graphics news media are eating it up.

Congratulations to him, sure, but it’s obvious this is just another attempt to throw us off the trail. We may never find out what Jason’s real new job is, but that doesn’t mean we’re going to stop following the hints and clues as they accumulate. Sooner or later, Big Triangle is going to slip up, and then we’ll all know the truth.

Progress

In the meantime, zink goes on. I’ve spent quite a long while tinkering with NVIDIA and getting a solid baseline of CTS results. At present, I’m down to about 800 combined fails for GL 4.6 and ES 3.2. Given that lavapipe is at around 80 and RADV is just over 600, both excluding the confidential test suites, this is a pretty decent start.

This is probably going to be the last time I’m on nvidia for a while, and it hasn’t been too bad overall.

The Year’s First Rebrand

The (second) biggest news story for today is a rebrand.

Copper is being renamed.

It will, in fact, be named Kopper to match the zink/vulkan naming scheme.

I can’t overstate how significant this change is and how massive the ecosystem changes around it will be.

Just huge. Like the number of words in this blog post.

January 17, 2022

Hello, Collabora!

Ever since I announced that I was leaving Intel, there’s been a lot of speculation as to where I’d end up. I left it a bit quiet over the holidays but, now that we’re solidly in 2022, It’s time to let it spill. As of January 24, I’ll be at Collabora!

For those of you that don’t know, Collabora is an open-source consultancy. They sell engineering services to companies who are making devices that run Linux and want to contribute to open-source technologies. They’ve worked on everything from automotive to gaming consoles to smart TVs to infotainment systems to VR platforms. I’m not an expert on what Collabora has done over the years so I’ll refer you to their brag sheet for that. Unlike some contract houses, Collabora doesn’t just do engineering for hire. They’re also an ideologically driven company that really believes in upstream and invests directly in upstream projects such as Mesa, Wayland, and others.

My personal history with Collabora is as old as my history as an open-source software developer. My first real upstream work was on Wayland in early 2013. I jumped in with a cunning plan for running a graphics-enabled desktop Linux chroot on an Android device and absolutely no idea what I was getting myself into. Two of the people who not only helped me understand the underbelly of Linux window systems but also helped me learn to navigate the world of open-source software were Daniel Stone and Pekka Paalanen, both of whom were at Collabora then and still are today.

After switching to Mesa when I joined Intel in 2014, I didn’t interact with Collabora devs quite as much since they mostly stayed in the window-system world and I tried to stay in 3D. In the last few years, however, they’ve been building up their 3D team and doing some really interesting work. Alyssa Rosenzweig and I have worked quite a bit together on various NIR passes as part of her work on Panfrost and now agx. I also worked with Boris Brezillon and Erik Faye-Lund on some of the CLOn12, GLOn12, and Zink work which layers OpenGL and OpenCL on top of D3D12 and Vulkan. In case you haven’t figured it out already from my glowing review, Collabora has some top-notch people who are doing great work and I’m excited to be joining the team and working more closely with them.

So how did this happen? What convinced me to leave the cushy corporate job and join a tiny (compared to Intel) open-source company? It’s not been for lack of opportunities. I get pinged by recruiters on LinkedIn on a regular basis and certain teams in the industry have been rather persistent. I’ve thought quite a lot over the years about where I’d want to go if I ever left Intel. Intel has been my engineering home for 7.5 years and has provided the strange cocktail on which I’ve built my career: a stable team, well-funded upstream open-source work, fairly cutting edge hardware, and an IHV seat at Khronos. Every place I’d ever considered going would mean losing one or more of those things and, until Collabora, no one had given me a good enough reason to give any of that up.

Back in September, I was chatting on IRC with other Mesa devs about OpenCL, SPIR-V, and some corner-case we were missing in the compiler when the following exchange happened:

11:39 < jenatali> I hope I get time to get back to CL at some point, I
      hate leaving it half-finished, but stupid corporate priorities
      mean I have to do other stuff instead :P
11:41 < jekstrand> Yeah... Corporations... Why do we work for them
      again?  Oh, right, so we can afford to eat.

About an hour later, Daniel Stone replied privately:

12:40 <daniels> hey so if corporations ever get you down, there are
      always less-corporate options … :)
12:40 <daniels> timing completely coincidental of course
12:42 <jekstrand> Of course...
12:42 <jekstrand> I'm always open to new things if the offer is
      right...

This kicked off the weirdest and most interesting career conversation I’ve had to date. At first, I didn’t believe him. The job he was describing doesn’t exist. No one gets that offer. Not unless you’re Dave Airlie or Linus Torvalds. But, after multiple 1 – 2 hour video chats, more IRC chatter, and an hour chatting with Philippe Kalaf (Collabora’s CEO), they had me convinced. This is real.

So what did Collabora finally offer me that no one else has? Total autonomy. In my new role at Collabora, my mandate consists of two things: invest in and mentor the Collabora 3D graphics team and invest in upstream Linux and open-source graphics however I see fit. I won’t be expected to do any contract work. I may meet with clients from time to time and I’ll likely get involved more with the various Collabora-driven Mesa projects but my primary focus will be on ensuring that upstream is healthy. I won’t be tied to any one driver or hardware vendor either. Sure, it’d be good to do a bit of Panfrost work so I can help Alyssa out since she’s now my coworker and I’ll likely still work on Intel drivers a bit since that’s my home turf. But, at the end of the day, I’m now free to put my effort wherever it’s needed in the stack without concern for corporate priorities. Ray-tracing in RADV? Why not. OpenCL 3.0 for everyone? Sure. Hacking on a new kernel interface for Freedreno? That’s fine too. As far as I’m concerned, when it comes to how I spend my engineering effort, I now report directly to upstream. No strings attached.

One of the interesting side-effect of this is how it will affect my role within Khronos. Collabora is a Khronos member so I still plan to be involved there but it will look different. For several years now (as long as RADV has been a competent driver, really), I’ve always worn two hats at Khronos: Intel and Mesa/Linux. Most of the time, I’m representing Intel but there were always those weird awkward moments where I help out the Igalia team working on V3DV or the RADV team. Now that I’m no longer at a hardware vendor, I can really embrace the role of representing Mesa and Linux upstream within Khronos. This doesn’t mean that I’m suddenly going to fix all your Vulkan spec problems overnight but it does mean I’ll be paying a bit more attention to the non-Intel drivers and doing what I can to ensure that all the Vulkan drivers in Mesa are in good shape.

Honestly, I’m still in shock that I was offered this role. It’s a great testament to Collabora’s belief in upstream that they’re willing to fund such a role and it shows an incredible amount of faith in my work. At Intel, I was blessed to be able to work upstream as part of my day job, which isn’t something most open-source software developers get. To have someone believe in your work so much that they’re willing to cut you a pay check just to keep doing what you’re doing is mind boggling. I’m truly honored and I hope the work I do in the days, months, and years to come will prove that their faith was well placed.

So, what am I going to be working on with my new found freedom? Do I have any cool new projects planned that are going to turn the industry upside-down? Of course I do! But those are topics for other blog posts.

January 13, 2022

We Need To Talk

It’s come to my attention that there’s a lot of rumors flying around about what exactly I’m doing aside from posting the latest info about where Jason Ekstrand, who coined the phrase, “If it compiles, we should ship it.” is going to end up.

Everyone knows that jekstrand’s next career move is big news—the kind of industry-shaking maneuvering that has every BigCo from Alphabet to Meta on tenterhooks. This post is going to debunk a number of the most common nonsense I’ve been hearing as well as give some updates about what else I’ve been doing besides scouring the internet for even the tiniest clue about what’s coming for this man’s career in 2022.

Is Jason going to Apple to work on a modernized, open source implementation of Mac OS with a new Finder based on Vulkan?

My sources were very keen on this rumor up until Tuesday, when, in an undisclosed IRC channel, Jason himself had the following to say:

<jekstrand> Sachiel: Contrary to popular belief, I can't work on every idea in the multiverse simultaneously.  I'm limited to the same N dimensions as the rest of you.

This absolutely blew all the existing chatter out of the water. Until now, in the course of working on more sparse texturing extensions, I had the firm impression that we’d be seeing a return to form, likely with a Khronos member company, continuing to work on graphics. But now? With this? Clearly everyone was thinking too small.

Everyone except jekstrand himself, who will be taking up a position at CERN devising new display technology for particle accelerators.

Or at least, that’s what I thought until yesterday.

Is Jason really going to be working at CERN? How well does GPU knowledge translate to theoretical physics?

Unfortunately, this turned out to be bogus, no more than chaff deployed to stop us from getting to the truth because we were too close. Later, while I was pondering how buggy NVIDIA’s sparse image functionality was in the latest beta drivers and attempting to pass what few equally buggy CTS cases there were for ARB_sparse_texture2, I stumbled upon the obvious.

It’s so obvious, in fact, that everyone overlooked it because of how obvious it is.

Jason has left Intel and turned in his badge because he’s on vacation.

As everyone knows, he’s the kind of person who literally does not comprehend time in the same way that the rest of us do. It was his assessment of the HR policy that in order to take time off and leave the office, he had to quit. My latest intel (no pun intended) revealed that managers and executives alike were still scrambling, trying to figure out how to explain the company’s vacation policy using SSA-based compiler terminology, but optimizer passes left their attempts to engage him as no-ops.

Tragic.

So this whole thing was just a ruse?

I’ll be completely honest with you since you’ve read this far: I’ve just heard breaking news today. This is so fresh, so hot-off-the-presses that it’s almost as difficult to reveal as it is that I’ve implemented another 4 GL extensions. When the totality of all my MRs are landed, zink will become the GL driver in Mesa supporting the most extensions, and this is likely to be the case for the next release. Shocking, I know.

But not nearly as shocking as the fact that Jason is actually starting at Texas Instruments working on Vulkan for graphing calculators.

Think about it.

Anyone who knows jekstrand even the smallest amount knows how much sense this makes on both sides. He gets unlimited graphing calculators, and that’s all he had to hear before signing the contract. It’s that simple.

Graphing Calculators? Does Anyone Even Use Those Anymore?

I know at least one person who does, and it’s not Jason Ekstrand. Because in the time that I was writing out the last (and now deprecated) information I had available, there’s been more, even later breaking news.

Copper now has a real MR open for it.

I realize it’s entirely off-topic now to be talking about some measly merge request, but it has the WSI tag on it, which means Jason has no choice but to read through the entire thing.

That’s because he’ll be working for Khronos as the Assistant Deputy Director of Presentation. If there’s presentations to be done by anyone in the graphics space, for any reason, they’ll have to go through jekstrand first. I don’t envy the responsibility and accountability that this sort of role demands; when it comes to shedsmanship, people in the presentation space are several levels above the rest.

We can only hope he’s up to the challenge.

Or at least, we would if that were actually where he was going, because I’ve just heard from

January 10, 2022

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:

  • Part 1: The high-level view of the whole CI system, and how to fully control test machines remotely (power on, OS to boot, keyboard/screen emulation using a serial console);
  • Part 2: Comparing the different ways to generate the rootfs of your test environment, and introducing the boot2container project.

In this article, we will further discuss the role of the CI gateway, and which steps we can take to simplify its deployment, maintenance, and disaster recovery.

This work is sponsored by the Valve Corporation.

Requirements for the CI gateway

As seen in the part 1 of this CI series, the testing gateway is sitting between the test machines and the public network/internet:

      Internet /   ------------------------------+
    Public network                               |
                                       +---------+--------+                USB
                                       |                  +-----------------------------------+
                                       |      Testing     | Private network                   |
Main power (120/240 V) -----+          |      Gateway     +-----------------+                 |
                            |          +------+--+--------+                 |                 |
                            |                 |  | Serial /                 |                 |
                            |            Main |  | Ethernet                 |                 |
                            |            Power|  |                          |                 |
                +-----------+-----------------|--+--------------+   +-------+--------+   +----+----+
                |              Switchable PDU |                |   |   RJ45 switch  |   | USB Hub |
                |  Port 0    Port 1        ...|         Port N  |   |                |   |         |
                +----+------------------------+-----------------+   +---+------------+   +-+-------+
                     |                                                  |                  |
                Main |                                                  |                  |
                Power|                                                  |                  |
            +--------|--------+               Ethernet                  |                  |
            |                 +-----------------------------------------+   +----+----+    |
            |  Test Machine 1 |            Serial (RS-232 / TTL)            |  Serial |    |
            |                 +---------------------------------------------+  2 USB  +----+ USB
            +-----------------+                                             +---------+

The testing gateway's role is to expose the test machines to the users, either directly or via GitLab/Github. As such, it will likely require the following components:

  • a host Operating System;
  • a config file describing the different test machines;
  • a bunch of services to expose said machines and deploy their test environment on demand.

Since the gateway is connected to the internet, both the OS and the different services needs to be be kept updated relatively often to prevent your CI farm from becoming part of a botnet. This creates interesting issues:

  1. How do we test updates ahead of deployment, to minimize downtime due to bad updates?
  2. How do we make updates atomic, so that we never end up with a partially-updated system?
  3. How do we rollback updates, so that broken updates can be quickly reverted?

These issues can thankfully be addressed by running all the services in a container (as systemd units), started using boot2container. Updating the operating system and the services would simply be done by generating a new container, running tests to validate it, pushing it to a container registry, rebooting the gateway, then waiting while the gateway downloads and execute the new services.

Using boot2container does not however fix the issue of how to update the kernel or boot configuration when the system fails to boot the current one. Indeed, if the kernel/boot2container/kernel command line are stored locally, they can only be modified via an SSH connection and thus require the machine to always be reachable, the gateway will be bricked until an operator boots an alternative operating system.

The easiest way not to brick your gateway after a broken update is to power it through a switchable PDU (so that we can power cycle the machine), and to download the kernel, initramfs (boot2container), and the kernel command line from a remote server at boot time. This is fortunately possible even through the internet by using fancy bootloaders, such as iPXE, and this will be the focus of this article!

Tune in for part 4 to learn more about how to create the container.

iPXE + boot2container: Netbooting your CI infrastructure from anywhere

iPXE is a tiny bootloader that packs a punch! Not only can it boot kernels from local partitions, but it can also connect to the internet, and download kernels/initramfs using HTTP(S). Even more impressive is the little scripting engine which executes boot scripts instead of declarative boot configurations like grub. This enables creating loops, endlessly trying to boot until one method finally succeeds!

Let's start with a basic example, and build towards a production-ready solution!

Netbooting from a local server

In this example, we will focus on netbooting the gateway from a local HTTP server. Let's start by reviewing a simple script that makes iPXE acquire an IP from the local DHCP server, then download and execute another iPXE script from http://<ip of your dev machine>:8000/boot/ipxe. If any step failed, the script will be restarted from the start until a successful boot is achieved.

#!ipxe

echo Welcome to Valve infra's iPXE boot script

:retry
echo Acquiring an IP
dhcp || goto retry # Keep retrying getting an IP, until we get one
echo Got the IP: $${netX/ip} / $${netX/netmask}

echo

echo Chainloading from the iPXE server...
chain http://<ip of your dev machine>:8000/boot.ipxe

# The boot failed, let's restart!
goto retry

Neat, right? Now, we need to generate a bootable ISO image starting iPXE with the above script run as a default. We will then flash this ISO to a USB pendrive:

$ git clone git://git.ipxe.org/ipxe.git
$ make -C ipxe/src -j`nproc` bin/ipxe.iso EMBED=<boot script file>
$ sudo dd if=ipxe/src/bin/ipxe.iso of=/dev/sdX bs=1M conv=fsync status=progress

Once connected to the gateway, ensure that you boot from the pendrive, and you should see iPXE bootloader trying to boot the kernel, but failing to download the script from http://<ip of your dev machine>:8000/boot.ipxe. So, let's write one:

#!ipxe

kernel /files/kernel b2c.container="docker://hello-world"
initrd /files/initrd
boot

This script specifies the following elements:

  • kernel: Download the kernel at http://<ip of your dev machine>:8000/files/kernel, and set the kernel command line to ask boot2container to start the hello-world container
  • initrd: Download the initramfs at http://<ip of your dev machine>:8000/files/initrd
  • boot: Boot the specified boot configuration

Assuming your gateway has an architecture supported by boot2container, you may now download the kernel and initrd from boot2container's releases page. In case it is unsupported, create an issue, or a merge request to add support for it!

Now that you have created all the necessary files for the boot, start the web server on your development machine:

$ ls
boot.ipxe  initrd  kernel
$ python -m http.server 8080
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
<ip of your gateway> - - [09/Jan/2022 15:32:52] "GET /boot.ipxe HTTP/1.1" 200 -
<ip of your gateway> - - [09/Jan/2022 15:32:56] "GET /kernel HTTP/1.1" 200 -
<ip of your gateway> - - [09/Jan/2022 15:32:54] "GET /initrd HTTP/1.1" 200 -

If everything went well, the gateway should, after a couple of seconds, start downloading the boot script, then the kernel, and finally the initramfs. Once done, your gateway should boot Linux, run docker's hello-world container, then shut down.

Congratulations for netbooting your gateway! However, the current solution has one annoying constraint: it requires a trusted local network and server because we are using HTTP rather than HTTPS... On an untrusted network, a man in the middle could override your boot configuration and take over your CI...

If we were using HTTPS, we could download our boot script/kernel/initramfs directly from any public server, even GIT forges, without fear of any man in the middle! Let's try to achieve this!

Netbooting from public servers

In the previous section, we managed to netboot our gateway from the local network. In this section, we try to improve on it by netbooting using HTTPS. This enables booting from a public server hosted at places such as Linode for $5/month.

As I said earlier, iPXE supports HTTPS. However, if you are anyone like me, you may be wondering how such a small bootloader could know which root certificates to trust. The answer is that iPXE generates an SSL certificate at compilation time which is then used to sign all of the root certificates trusted by Mozilla (default), or any amount of certificate you may want. See iPXE's crypto page for more information.

WARNING: iPXE currently does not like certificates exceeding 4096 bits. This can be a limiting factor when trying to connect to existing servers. We hope to one day fix this bug, but in the mean time, you may be forced to use a 2048 bits Let's Encrypt certificate on a self-hosted web server. See our issue for more information.

WARNING 2: iPXE only supports a limited amount of ciphers. You'll need to make sure they are listed in nginx's ssl_ciphers configuration: AES-128-CBC:AES-256-CBC:AES256-SHA256 and AES128-SHA256:AES256-SHA:AES128-SHA

To get started, install NGINX + Let's encrypt on your server, following your favourite tutorial, copy the boot.ipxe, kernel, and initrd files to the root of the web server, then make sure you can download them using your browser.

With this done, we just need to edit iPXE's general config C header to enable HTTPS support:

$ sed -i 's/#undef\tDOWNLOAD_PROTO_HTTPS/#define\tDOWNLOAD_PROTO_HTTPS/' ipxe/src/config/general.h

Then, let's update our boot script to point to the new server:

#!ipxe

echo Welcome to Valve infra's iPXE boot script

:retry
echo Acquiring an IP
dhcp || goto retry # Keep retrying getting an IP, until we get one
echo Got the IP: $${netX/ip} / $${netX/netmask}

echo

echo Chainloading from the iPXE server...
chain https://<your server>/boot.ipxe

# The boot failed, let's restart!
goto retry

And finally, let's re-compile iPXE, reflash the gateway pendrive, and boot the gateway!

$ make -C ipxe/src -j`nproc` bin/ipxe.iso EMBED=<boot script file>
$ sudo dd if=ipxe/src/bin/ipxe.iso of=/dev/sdX bs=1M conv=fsync status=progress

If all went well, the gateway should boot and run the hello world container once again! Let's continue our journey by provisioning and backup'ing the local storage of the gateway!

Provisioning and backups of the local storage

In the previous section, we managed to control the boot configuration of our gateway via a public HTTPS server. In this section, we will improve on that by provisioning and backuping any local file the gateway container may need.

Boot2container has a nice feature that enables you to create a volume, and provision it from a bucket in a S3-compatible cloud storage, and sync back any local change. This is done by adding the following arguments to the kernel command line:

  • b2c.minio="s3,${s3_endpoint},${s3_access_key_id},${s3_access_key}": URL and credentials to the S3 service
  • b2c.volume="perm,mirror=s3/${s3_bucket_name},pull_on=pipeline_start,push_on=changes,overwrite,delete": Create a perm podman volume, mirror it from the bucket ${s3_bucket_name} when booting the gateway, then push any local change back to the bucket. Delete or overwrite any existing file when mirroring.
  • b2c.container="-ti -v perm:/mnt/perm docker://alpine": Start an alpine container, and mount the perm container volume to /mnt/perm

Pretty, isn't it? Provided that your bucket is configured to save all the revisions of every file, this trick will kill three birds with one stone: initial provisioning, backup, and automatic recovery of the files in case the local disk fails and gets replaced with a new one!

The issue is that the boot configuration is currently open for everyone to see, if they know where to look for. This means that anyone could tamper with your local storage or even use your bucket to store their files...

Securing the access to the local storage

To prevent attackers from stealing our S3 credentials by simply pointing their web browser to the right URL, we can authenticate incoming HTTPS requests by using an SSL client certificate. A different certificate would be embedded in every gateway's iPXE bootloader and checked by NGINX before serving the boot configuration for this precise gateway. By limiting access to a machine's boot configuration to its associated client certificate fingerprint, we even prevent compromised machines from accessing the data of other machines.

Additionally, secrets should not be kept in the kernel command line, as any process executed on the gateway could easily gain access to it by reading /proc/cmdline. To address this issue, boot2container has a b2c.extra_args_url argument to source additional parameters from this URL. If this URL is generated every time the gateway is downloading its boot configuration, can be accessed only once, and expires soon after being created, then secrets can be kept private inside boot2container and not be exposed to the containers it starts.

Implementing these suggestions in a blog post is a little tricky, so I suggest you check out valve-infra's ipxe-boot-server component for more details. It provides a Makefile that makes it super easy to generate working certificates and create bootable gateway ISOs, a small python-based web service that will serve the right configuration to every gateway (including one-time secrets), and step-by-step instructions to deploy everything!

Assuming you decided to use this component and followed the README, you should then configure the gateway in this way:

$ pwd
/home/ipxe/valve-infra/ipxe-boot-server/files/<fingerprint of your gateway>/
$ ls
boot.ipxe  initrd  kernel  secrets
$ cat boot.ipxe
#!ipxe

kernel /files/kernel b2c.extra_args_url="${secrets_url}" b2c.container="-v perm:/mnt/perm docker://alpine" b2c.ntp_peer=auto b2c.cache_device=auto
initrd /files/initrd
boot
$ cat secrets
b2c.minio="bbz,${s3_endpoint},${s3_access_key_id},${s3_access_key}" b2c.volume="perm,mirror=bbz/${s3_bucket_name},pull_on=pipeline_start,push_on=changes,overwrite,delete"

And that's it! We finally made it to the end, and created a secure way to provision our CI gateways with the wanted kernel, Operating System, and even local files!

When Charlie Turner and I started designing this system, we felt it would be a clean and simple way to solve our problems with our CI gateways, but the implementation ended up being quite a little trickier than the high-level view... especially the SSL certificates! However, the certainty that we can now deploy updates and fix our CI gateways even when they are physically inaccessible from us (provided the hardware and PDU are fine) definitely made it all worth it and made the prospect of having users depending on our systems less scary!

Let us know how you feel about it!

Conclusion

In this post, we focused on provisioning the CI gateway with its boot configuration, and local files via the internet. This drastically reduces the risks that updating the gateway's kernel would result in an extended loss of service, as the kernel configuration can quickly be reverted by changing the boot config files which is served from a cloud service provider.

The local file provisioning system also doubles as a backup, and disaster recovery system which will automatically kick in in case of hardware failure thanks to the constant mirroring of the local files with an S3-compatible cloud storage bucket.

In the next post, we will be talking about how to create the infra container, and how we can minimize down time during updates by not needing to reboot the gateway.

That's all for now, thanks for making it to the end!

This Is A Serious Blog

I posted some fun fluff pieces last week to kick off the new year, but now it’s time to get down to brass tacks.

Everyone knows adding features is just flipping on the enable button. Now it’s time to see some real work.

If you don’t like real work, stop reading. Stop right now. Now.

Alright, now that all the haters are gone, let’s put on our bisecting snorkels and dive in.

Regressions Suck

The dream of 2022 was that I’d come back and everything would work exactly how I left it. All the same tests would pass, all the perf would be there, and my driver would compile.

I got two of those things, which isn’t too bad.

After spending a while bisecting and debugging last week, I categorized a number of regressions to RADV problems which probably only affect me since there’s no Vulkan CTS cases for them (yet). But today I came to the last of the problem cases: dEQP-GLES31.functional.tessellation_geometry_interaction.feedback.tessellation_output_quads_geometry_output_points.

There’s nothing too remarkable about the test. It’s XFB, so, according to Jason Ekstrand, future Head of Graphic Wows at Pixar, it’s terrible.

What is remarkable, however is that the test passes fine when run in isolation.

Here We Go Again

Anyone who’s anyone knows what comes next.

  • You find the caselist of the other 499 tests that were run in this block
  • You run the caselist
  • You find out that the test still fails in that caselist
  • You tip your head back to stare at the ceiling and groan

Then it’s another X minutes (where X is usually between 5 and 180 depending on test runtimes) to slowly pare down the caselist to the sequence which actually triggers the failure. For those not in the know, this type of failure indicates a pathological driver bug where a sequence of commands triggers different results if tests are run in a different order.

There is, to my knowledge, no ‘automatic’ way to determine exactly which tests are required to trigger this type of failure from a caselist. It would be great if there was, and it would save me (and probably others who are similarly unaware) considerable time doing this type of caselist fuzzing.

Finally, I was left with this shortened caselist:

dEQP-GLES31.functional.shaders.builtin_constants.tessellation_shader.max_tess_evaluation_texture_image_units
dEQP-GLES31.functional.tessellation_geometry_interaction.feedback.tessellation_output_quads_geometry_output_points
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.25

What Now?

Ideally, it would be great to be able to use something like gfxreconstruct for this. I could record two captures—one of the test failing in the caselist and one where it passes in isolation—and then compare them.

Here’s an excerpt from that attempt:

"[790]vkCreateShaderModule": {
    "return": "VK_SUCCESS",
    "device": "0x0x4",
    "pCreateInfo": {
        "sType": "VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO",
        "pNext": null,
        "flags": 0,
        "codeSize": Unhandled VkFormatFeatureFlagBits2KHR,
        "pCode": "0x0x285c8e0"
    },
    "pAllocator": null,
    "[out]pShaderModule": "0x0xe0"
},

Why is it trying to print an enum value for codeSize you might ask?

I’m not the only one to ask, and it’s still an unresolved mystery.

I was successful in doing the comparison with gfxreconstruct, but it yielded nothing of interest.

Puzzled, I decided to try the test out on lavapipe. Would it pass?

No.

It similarly fails on llvmpipe and IRIS.

But my lavapipe testing revealed an important clue. Given that there are no synchronization issues with lavapipe, this meant I could be certain this was a zink bug. Furthermore, the test failed both when the bug was exhibiting and when it wasn’t, meaning that I could actually see the “passing” values in addition to the failing ones for comparison.

Here’s the failing error output:

Verifying feedback results.
Element at index 0 (tessellation invocation 0) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 1 (tessellation invocation 1) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 2 (tessellation invocation 2) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 3 (tessellation invocation 3) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 4 (tessellation invocation 4) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 5 (tessellation invocation 5) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 6 (tessellation invocation 6) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0.5, 1)
Element at index 7 (tessellation invocation 7) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0.5, 1)
Omitted 24 error(s).

And here’s the passing error output:

Verifying feedback results.
Element at index 3 (tessellation invocation 1) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 4 (tessellation invocation 2) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 5 (tessellation invocation 3) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 6 (tessellation invocation 4) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 7 (tessellation invocation 5) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 8 (tessellation invocation 6) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 9 (tessellation invocation 7) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 10 (tessellation invocation 8) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Omitted 18 error(s).

This might not look like much, but to any zinkologists, there’s an immediate red flag: the Z component of the vertex is 0.5 in the failing case.

What does this remind us of?

Naturally it reminds us of nir_lower_clip_halfz, the compiler pass which converts OpenGL Z coordinate ranges ([-1, 1]) to Vulkan ([0, 1]). This pass is run on the last vertex stage, but if it gets run more than once, a value of -1 becomes 0.5.

Thus, it looks like the pass is being run twice in this test. How can this be verified?

ZINK_DEBUG=spirv will export all spirv shaders used by an app. Therefore, dumping all the shaders for passing and failing runs should confirm that the conversion pass is being run an extra time when they’re compared. The verdict?

@@ -1,7 +1,7 @@
 ; SPIR-V
 ; Version: 1.5
 ; Generator: Khronos; 0
-; Bound: 23
+; Bound: 38
 ; Schema: 0
                OpCapability TransformFeedback
                OpCapability Shader
@@ -36,13 +36,28 @@
 %_ptr_Output_v4float = OpTypePointer Output %v4float
 %gl_Position = OpVariable %_ptr_Output_v4float Output
      %v4uint = OpTypeVector %uint 4
+%uint_1056964608 = OpConstant %uint 1056964608
        %main = OpFunction %void None %3
          %18 = OpLabel
                OpBranch %17
          %17 = OpLabel
          %19 = OpLoad %v4float %a_position
          %21 = OpBitcast %v4uint %19
-         %22 = OpBitcast %v4float %21
-               OpStore %gl_Position %22
+         %22 = OpCompositeExtract %uint %21 3
+         %23 = OpCompositeExtract %uint %21 3
+         %24 = OpCompositeExtract %uint %21 2
+         %25 = OpBitcast %float %24
+         %26 = OpBitcast %float %23
+         %27 = OpFAdd %float %25 %26
+         %28 = OpBitcast %uint %27
+         %30 = OpBitcast %float %28
+         %31 = OpBitcast %float %uint_1056964608
+         %32 = OpFMul %float %30 %31
+         %33 = OpBitcast %uint %32
+         %34 = OpCompositeExtract %uint %21 1
+         %35 = OpCompositeExtract %uint %21 0
+         %36 = OpCompositeConstruct %v4uint %35 %34 %33 %22
+         %37 = OpBitcast %v4float %36
+               OpStore %gl_Position %37
                OpReturn
                OpFunctionEnd

And, as is the rule for such things, the fix was a simple one-liner to unset values in the vertex shader key.

It wasn’t technically a regression, but it manifested as such, and fixing it yielded another dozen or so fixes for cases which were affected by the same issue.

Blammo.

January 07, 2022

PyCook

A few months ago, I went on a quest to better digitize and collect a bunch of the recipes I use on a regular basis. Like most people, I’ve got a 3-ring binder of stuff I’ve printed from the internet, a box with the usual 4x6 cards, most of which are hand-written, and a stack of cookbooks. I wanted something that could be both digital and physical and which would make recipes easier to share. I also wanted whatever storage system I developed to be something stupid simple. If there’s one thing I’ve learned about myself over the years it’s that if I make something too hard, I’ll never get around to it.

This led to a few requirements:

  1. A simple human-writable file format

  2. Recipes organized as individual files in a Git repo

  3. A simple tool to convert to a web page and print formats

  4. That tool should be able to produce 4x6 cards

Enter PyCook. It does pretty much exactly what the above says and not much more. The file format is based on YAML and, IMO, is beautiful to look at all by itself and very easy to remember how to type:

name: Swedish Meatballs
from: Grandma Ekstrand

ingredients:
  - 1 [lb] Ground beef (80%)
  - 1/4 [lb] Ground pork
  - 1 [tsp] Salt
  - 1 [tsp] Sugar
  - 1 Egg, beaten
  - 1 [tbsp] Catsup
  - 1 Onion, grated
  - 1 [cup] Bread crumbs
  - 1 [cup] Milk

instructions:
  - Soak crumbs in milk

  - Mix all together well

  - Make into 1/2 -- 3/4 [in] balls and brown in a pan on the stove
    (you don't need to cook them through; just brown)

  - Heat in a casserole dish in the oven until cooked through.

Over Christmas this year, I got my mom and a couple siblings together to put together a list of family favorite recipes. The format is simple enough that my non-technical mother was able to help type up recipes and they needed very little editing before PyCook would consume them. To me, that’s a pretty good indicator that the file format is simple enough. 😁

The one bit of smarts it does have is around quantities and units. One thing that constantly annoys me with recipes is the inconsistency with abbreviations of units. Take tablespoons, for instance. I’ve seen it abbreviated “T” (as opposed to “t” for teaspoon), “tbsp”, or “tblsp”, sometimes with a “.” after the abbreviation and sometimes not. To handle this, I have a tiny macro language where units have standard abbreviations and are placed in brackets. This is then substituted with the correct abbreviation to let me change them all in one go if I ever want to. It’s also capable of handling plurals properly so when you type [cup] it will turn into either “cup” or “cups” depending on the associated number. It also has smarts to detect “1/4” and turn that into vulgar fraction character “¼” for HTML output or a nice LaTeX fraction when doing PDF output.

That brings me to the PDF generator. One of my requirements was the ability to produce 4x6 cards that I could use while cooking instead of having an electronic device near all those fluids. I started with the LaTeX template I got from my friend Steve Schulteis for making a PDF cookbook and adapted them to 4x6 cards, one recipe per card. And the results look pretty nice, I think

Meatballs recipe as a 4x6 cardMeatballs recipe as a 4x6 card

It’s even able to nicely paginate the 4x6 cards into a double-sided 8.5x11 PDF which prints onto Avery 5389 templates.

Other output formats include an 8.5x11 PDF cookbook with as many recipes per page as will fit and an HTML web version generated using Sphinx which is searchable and has a nice index.

P.S. Yes, that’s a real Swedish meatball recipe and it makes way better meatballs than the ones from IKEA.

New blog!

This week, I’ve taken advantage of some of my funemployment time to we work our website infrastructure. As part of my new job (more on that soon!) I’ll be communicating publicly more and I need a better blog space. I set up a blogger a few years ago but I really hate it. It’s not that Blogger is a terrible platform per se. It just doesn’t integrate well with the rest of my website and the comments I got were 90% spam.

At the time, I used Blogger because I didn’t want to mess implementing a blog on my own website infrastructure. Why? The honest answer is an object lesson in software engineering. The last time I re-built my website I thought that building a website generator sounded like a fantastic excuse to learn some Ruby. Why not? It’s a great programming language for web stuff and learning new languages is good, right? While those are both true, software maintainability is important. When I went to try and add a blog, it’d been nearly 4 years since I’d written a line of ruby code and the static site generation framework I was using (nanoc) had moved to a backwards-incompatible new version and I had no clue how to move my site forward without re-learning Ruby and nanoc and rewriting it from scratch. I didn’t have the time for that.

This time, I learned my lesson and went all C programmer on it. The new infra is built in Python, a language I use nearly daily in my Mesa work and, instead of using someone else’s static site generation infra, I rolled my own. I’m using Jinja2 for templating as well as Pygments for syntax highlighting and Pandoc for Markdown to HTML conversion. They’re all libraries not frameworks so they take a lot less learning to figure out and remember how to use. Jinja2 is something of a framework (it’s a templating language) but it’s so obvious and user-friendly that it’s basically impossible to forget how to use.

Sure, my infra isn’t nearly as nice as some. I don’t have caching and I have to re-build the whole thing every time. I also didn’t bother to implement RSS feed support, comments, or any of those other bloggy features. But, frankly, I don’t care. Even on a crappy raspberry pi, my website builds in a few seconds. As far as RSS goes, who actually uses RSS readers these days? Just follow me on Twitter (@jekstrand_) if you want to know when I blog stuff. Comments? 95% of the comments I got on Blogger were spam anyway. If you want to comment, reply to my tweet.

The old blog lived at jason-blog.jlekstrand.net while the new one lives at www.jlekstrand.net/jason/blog/. I’ve set up my webserver to re-direct from the old one and gone though the painful process of checking every single post to ensure the re-directs work. Any links you may have should still work.

So there you have it! A new website framework written in all of 2 days. Hopefully, this one will be maintainable and, if I need to extend it, I can without re-learning whole programming languages. I’m also hoping that now that blogging is as easy as writing some markdown and doing git push, I’ll actually blog more.

January 03, 2022

It appears that Google created a handy tool that helps finding the command which causes a GPU hang/crash. It is called Graphics Flight Recorder (GFR) and was open-sourced a year ago but didn’t receive any attention. From the readme:

The Graphics Flight Recorder (GFR) is a Vulkan layer to help trackdown and identify the cause of GPU hangs and crashes. It works by instrumenting command buffers with completion tags. When an error is detected a log file containing incomplete command buffers is written. Often the last complete or incomplete commands are responsible for the crash.

It requires VK_AMD_buffer_marker support; however, this extension is rather trivial to implement - I had only to copy-paste the code from our vkCmdSetEvent implementation and that was it. Note, at the moment of writing, GFR unconditionally usesVK_AMD_device_coherent_memory, which could be manually patched out for it to run on other GPUs.

GFR already helped me to fix hangs in “Alien: Isolation” and “Digital Combat Simulator”. In both cases the hang was in a compute shader and the output from GFR looked like:

...
- # Command:
        id: 6/9
        markerValue: 0x000A0006
        name: vkCmdBindPipeline
        state: [SUBMITTED_EXECUTION_COMPLETE]
        parameters:
          - # parameter:
            name: commandBuffer
            value: 0x000000558CFD2A10
          - # parameter:
            name: pipelineBindPoint
            value: 1
          - # parameter:
            name: pipeline
            value: 0x000000558D3D6750
      - # Command:
        id: 6/9
        message: '>>>>>>>>>>>>>> LAST COMPLETE COMMAND <<<<<<<<<<<<<<'
      - # Command:
        id: 7/9
        markerValue: 0x000A0007
        name: vkCmdDispatch
        state: [SUBMITTED_EXECUTION_INCOMPLETE]
        parameters:
          - # parameter:
            name: commandBuffer
            value: 0x000000558CFD2A10
          - # parameter:
            name: groupCountX
            value: 5
          - # parameter:
            name: groupCountY
            value: 1
          - # parameter:
            name: groupCountZ
            value: 1
        internalState:
          pipeline:
            vkHandle: 0x000000558D3D6750
            bindPoint: compute
            shaderInfos:
              - # shaderInfo:
                stage: cs
                module: (0x000000558F82B2A0)
                entry: "main"
          descriptorSets:
            - # descriptorSet:
              index: 0
              set: 0x000000558E498728
      - # Command:
        id: 8/9
        markerValue: 0x000A0008
        name: vkCmdPipelineBarrier
        state: [SUBMITTED_EXECUTION_NOT_STARTED]
...

After confirming that corresponding vkCmdDispatch is indeed the call which hangs, in both cases I made an Amber test which fully simulated the call. For a compute shader, this is relatively easy to do since all you need is to save the decompiled shader and buffers being used by it. Luckily in both cases these Amber tests reproduced the hangs.

With standalone reproducers, the problems were much easier to debug, and fixes were made shortly: MR#14044 for “Alien: Isolation” and MR#14110 for “Digital Combat Simulator”.

Unfortunately this tool is not a panacea:

  • It likely would fail to help with unrecoverable hangs where it would be impossible to read the completion tags back.
  • Or when the mere addition of the tags could “fix” the issue which may happen with synchronization issues.
  • If draw/dispatch calls run in parallel on the GPU, writing tags may force them to execute sequentially or to be imprecise.

Anyway, it’s easy to use so you should give it a try.

December 09, 2021
Starting with kernel 5.17 the kernel supports the builtin privacy screens built into the LCD panel of some new laptop models.

This means that the drm drivers will now return -EPROBE_DEFER from their probe() method on models with a builtin privacy screen when the privacy screen provider driver has not been loaded yet.

To avoid any regressions distors should modify their initrd generation tools to include privacy screen provider drivers in the initrd (at least on systems with a privacy screen), before 5.17 kernels start showing up in their repos.

If this change is not made, then users using a graphical bootsplash (plymouth) will get an extra boot-delay of up to 8 seconds (DeviceTimeout in plymouthd.defaults) before plymouth will show and when using disk-encryption where the LUKS password is requested from the initrd, the system will fallback to text-mode after these 8 seconds.

I've written a patch with the necessary changes for dracut, which might be useful as an example for how to deal with this in other initrd generators, see: https://github.com/dracutdevs/dracut/pull/1666

I've also filed bugs for tracking this for Fedora, openSUSE, Arch, Debian and Ubuntu.

Introduction

One of the big issues I have when working on Turnip driver development is that when compiling either Mesa or VK-GL-CTS it takes a lot of time to complete, no matter how powerful the embedded board is. There are reasons for that: typically those board have limited amount of RAM (8 GB for the best case), a slow storage disk (typically UFS 2.1 on-board storage) and CPUs that are not so powerful compared with x86_64 desktop alternatives.

RB3 Photo of the Qualcomm® Robotics RB3 Platform embedded board that I use for Turnip development.

To fix this, it is recommended to do cross-compilation, however installing the development environment for cross-compilation could be cumbersome and prone to errors depending on the toolchain you use. One alternative is to use a distributed compilation system that allows cross-compilation like Icecream.

Icecream is a distributed compilation system that is very useful when you have to compile big projects and/or on low-spec machines, while having powerful machines in the local network that can do that job instead. However, it is not perfect: the linking stage is still done in the machine that submits the job, which depending on the available RAM, could be too much for it (however you can alleviate this a bit by using ZRAM for example).

One of the features that icecream has over its alternatives is that there is no need to install the same toolchain in all the machines as it is able to share the toolchain among all of them. This is very useful as we will see below in this post.

Installation

Debian-based systems

$ sudo apt install icecc

Fedora systems

$ sudo dnf install icecream

Compile it from sources

You can compile it from sources.

Configuration of icecc scheduler

You need to have an icecc scheduler in the local network that will balance the load among all the available nodes connected to it.

It does not matter which machine is the scheduler, you can use any of them as it is quite lightweight. To run the scheduler execute the following command:

$ sudo icecc-scheduler

Notice that the machine running this command is going to be the scheduler but it will not participate in the compilation process by default unless you ran iceccd daemon as well (see next step).

Setup on icecc nodes

Launch daemon

First you need to run the iceccd daemon as root. This is not needed on Debian-based systems, as its systemd unit is enabled by default.

You can do that using systemd in the following way:

$ sudo systemctl start iceccd

Or you can enable the daemon at startup time:

$ sudo systemctl enable iceccd

The daemon will connect automatically to the scheduler that is running in the local network. If that’s not the case, or there are more than one scheduler, you can run it standalone and give the scheduler’s IP as parameter:

sudo iceccd -s <ip_scheduler>

Enable icecc compilation

With ccache

If you use ccache (recommended option), you just need to add the following in your .bashrc:

export CCACHE_PREFIX=icecc

Without ccache

To use it without ccache, you need to add its path to $PATH envvar so it is picked before the system compilers:

export PATH=/usr/lib/icecc/bin:$PATH

Execution

Same architecture

If you followed the previous steps, any time you compile anything on C/C++, it will distribute the work among the fastest nodes in the network. Notice that it will take into account system load, network connection, cores, among other variables, to decide which node will compile the object file.

Remember that the linking stage is always done in the machine that submits the job.

Different architectures (example cross-compiling for aarch64 on x86_64 nodes)

Icecream Icemon showing my x86_64 desktop (maxwell) cross-compiling a job for my aarch64 board (rb3).

Preparation on x86_64 machine

In one x86_64 machine, you need to create a toolchain. This is not automatically done by icecc as you can have different toolchains for cross-compilation.

Install cross-compiler

For example, you can install the cross-compiler from the distribution repositories:

For Debian-based systems:

sudo apt install crossbuild-essential-arm64

For Fedora:

$ sudo dnf install gcc-aarch64-linux-gnu gcc--c++-aarch64-linux-gnu

Create toolchain for icecc

Finally, to create the toolchain to share in icecc:

$ icecc-create-env --gcc /usr/bin/aarch64-linux-gnu-gcc /usr/bin/aarch64-linux-gnu-g++

This will create a <hash>.tar.gz file. The <hash> is used to identify the toolchain to distribute among the nodes in case there is more than one. But don’t worry, once it is copied to a node, it won’t be copied again as it detects it is already present.

Note: it is important that the toolchain is compatible with the target machine. For example, if my aarch64 board is using Debian 11 Bullseye, it is better if the cross-compilation toolchain is created from a Debian Bullseye x86_64 machine (a VM also works), because you avoid incompatibilities like having different glibc versions.

If you have installed Debian 11 Bullseye in your aarch64, you can use my own cross-compilation toolchain for x86_64 and skip this step.

Copy the toolchain to the aarch64 machine

scp <hash>.tar.gz aarch64-machine-hostname:

Preparation on aarch64

Once the toolchain (<hash>.tar.gz) is copied to the aarch64 machine, you just need to export this on .bashrc:

# Icecc setup for crosscompilation
export CCACHE_PREFIX=icecc
export ICECC_VERSION=x86_64:~/<hash>.tar.gz

Execute

Just compile on aarch64 machine and the jobs be distributed among your x86_64 machines as well. Take into account the jobs will be shared among other aarch64 machines as well if icecc decides so, therefore no need to do any extra step.

It is important to remark that the cross-compilation toolchain creation is only needed once, as icecream will copy it on all the x86_64 machines that will execute any job launched by this aarch64 machine. However, you need to copy this toolchain to any aarch64 machines that will use icecream resources for cross-compiling.

Icecream monitor

Icemon

This is an interesting graphical tool to see the status of the icecc nodes and the jobs under execution.

Install on Debian-based systems

$ sudo apt install icecc-monitor

Install on Fedora

$ sudo dnf install icemon

Install it from sources

You can compile it from sources.

Acknowledgments

Even though icecream has a good cross-compilation documentation, it was the post written 8 years ago by my Igalia colleague Víctor Jáquez the one that convinced me to setup icecream as explained in this post.

Hope you find this info as useful as I did :-)

December 06, 2021

On the road to AppStream 1.0, a lot of items from the long todo list have been done so far – only one major feature is remaining, external release descriptions, which is a tricky one to implement and specify. For AppStream 1.0 it needs to be present or be rejected though, as it would be a major change in how release data is handled in AppStream.

Besides 1.0 preparation work, the recent 0.15 release and the releases before it come with their very own large set of changes, that are worth a look and may be interesting for your application to support. But first, for a change that affects the implementation and not the XML format:

1. Completely rewritten caching code

Keeping all AppStream data in memory is expensive, especially if the data is huge (as on Debian and Ubuntu with their large repositories generated from desktop-entry files as well) and if processes using AppStream are long-running. The latter is more and more the case, not only does GNOME Software run in the background, KDE uses AppStream in KRunner and Phosh will use it too for reading form factor information. Therefore, AppStream via libappstream provides an on-disk cache that is memory-mapped, so data is only consuming RAM if we are actually doing anything with it.

Previously, AppStream used an LMDB-based cache in the background, with indices for fulltext search and other common search operations. This was a very fast solution, but also came with limitations, LMDB’s maximum key size of 511 bytes became a problem quite often, adjusting the maximum database size (since it has to be set at opening time) was annoyingly tricky, and building dedicated indices for each search operation was very inflexible. In addition to that, the caching code was changed multiple times in the past to allow system-wide metadata to be cached per-user, as some distributions didn’t (want to) build a system-wide cache and therefore ran into performance issues when XML was parsed repeatedly for generation of a temporary cache. In addition to all that, the cache was designed around the concept of “one cache for data from all sources”, which meant that we had to rebuild it entirely if just a small aspect changed, like a MetaInfo file being added to /usr/share/metainfo, which was very inefficient.

To shorten a long story, the old caching code was rewritten with the new concepts of caches not necessarily being system-wide and caches existing for more fine-grained groups of files in mind. The new caching code uses Richard Hughes’ excellent libxmlb internally for memory-mapped data storage. Unlike LMDB, libxmlb knows about the XML document model, so queries can be much more powerful and we do not need to build indices manually. The library is also already used by GNOME Software and fwupd for parsing of (refined) AppStream metadata, so it works quite well for that usecase. As a result, search queries via libappstream are now a bit slower (very much depends on the query, roughly 20% on average), but can be mmuch more powerful. The caching code is a lot more robust, which should speed up startup time of applications. And in addition to all of that, the AsPool class has gained a flag to allow it to monitor AppStream source data for changes and refresh the cache fully automatically and transparently in the background.

All software written against the previous version of the libappstream library should continue to work with the new caching code, but to make use of some of the new features, software using it may need adjustments. A lot of methods have been deprecated too now.

2. Experimental compose support

Compiling MetaInfo and other metadata into AppStream collection metadata, extracting icons, language information, refining data and caching media is an involved process. The appstream-generator tool does this very well for data from Linux distribution sources, but the tool is also pretty “heavyweight” with lots of knobs to adjust, an underlying database and a complex algorithm for icon extraction. Embedding it into other tools via anything else but its command-line API is also not easy (due to D’s GC initialization, and because it was never written with that feature in mind). Sometimes a simpler tool is all you need, so the libappstream-compose library as well as appstreamcli compose are being developed at the moment. The library contains building blocks for developing a tool like appstream-generator while the cli tool allows to simply extract metadata from any directory tree, which can be used by e.g. Flatpak. For this to work well, a lot of appstream-generator‘s D code is translated into plain C, so the implementation stays identical but the language changes.

Ultimately, the generator tool will use libappstream-compose for any general data refinement, and only implement things necessary to extract data from the archive of distributions. New applications (e.g. for new bundling systems and other purposes) can then use the same building blocks to implement new data generators similar to appstream-generator with ease, sharing much of the code that would be identical between implementations anyway.

2. Supporting user input controls

Want to advertise that your application supports touch input? Keyboard input? Has support for graphics tablets? Gamepads? Sure, nothing is easier than that with the new control relation item and supports relation kind (since 0.12.11 / 0.15.0, details):

<supports>
  <control>pointing</control>
  <control>keyboard</control>
  <control>touch</control>
  <control>tablet</control>
</supports>

3. Defining minimum display size requirements

Some applications are unusable below a certain window size, so you do not want to display them in a software center that is running on a device with a small screen, like a phone. In order to encode this information in a flexible way, AppStream now contains a display_length relation item to require or recommend a minimum (or maximum) display size that the described GUI application can work with. For example:

<requires>
  <display_length compare="ge">360</display_length>
</requires>

This will make the application require a display length greater or equal to 300 logical pixels. A logical pixel (also device independent pixel) is the amount of pixels that the application can draw in one direction. Since screens, especially phone screens but also screens on a desktop, can be rotated, the display_length value will be checked against the longest edge of a display by default (by explicitly specifying the shorter edge, this can be changed).

This feature is available since 0.13.0, details. See also Tobias Bernard’s blog entry on this topic.

4. Tags

This is a feature that was originally requested for the LVFS/fwupd, but one of the great things about AppStream is that we can take very project-specific ideas and generalize them so something comes out of them that is useful for many. The new tags tag allows people to tag components with an arbitrary namespaced string. This can be useful for project-internal organization of applications, as well as to convey certain additional properties to a software center, e.g. an application could mark itself as “featured” in a specific software center only. Metadata generators may also add their own tags to components to improve organization. AppStream gives no recommendations as to how these tags are to be interpreted except for them being a strictly optional feature. So any meaning is something clients and metadata authors need to negotiate. It therefore is a more specialized usecase of the already existing custom tag, and I expect it to be primarily useful within larger organizations that produce a lot of software components that need sorting. For example:

<tags>
  <tag namespace="lvfs">vendor-2021q1</tag>
  <tag namespace="plasma">featured</tag>
</tags>

This feature is available since 0.15.0, details.

5. MetaInfo Creator changes

The MetaInfo Creator (source) tool is a very simple web application that provides you with a form to fill out and will then generate MetaInfo XML to add to your project after you have answered all of its questions. It is an easy way for developers to add the required metadata without having to read the specification or any guides at all.

Recently, I added support for the new control and display_length tags, resolved a few minor issues and also added a button to instantly copy the generated output to clipboard so people can paste it into their project. If you want to create a new MetaInfo file, this tool is the best way to do it!

The creator tool will also not transfer any data out of your webbrowser, it is strictly a client-side application.

And that is about it for the most notable changes in AppStream land! Of course there is a lot more, additional tags for the LVFS and content rating have been added, lots of bugs have been squashed, the documentation has been refined a lot and the library has gained a lot of new API to make building software centers easier. Still, there is a lot to do and quite a few open feature requests too. Onwards to 1.0!

December 02, 2021

Khronos submission indicating Vulkan 1.1 conformance for Turnip on Adreno 618 GPU.

It is a great feat, especially for a driver which is created without hardware documentation. And we support features far from the bare minimum required for conformance.

But first of all, I want to thank and congratulate everyone working on the driver: Connor Abbott, Rob Clark, Emma Anholt, Jonathan Marek, Hyunjun Ko, Samuel Iglesias. And special thanks to Samuel Iglesias and Ricardo Garcia for tirelessly improving Khronos Vulkan Conformance Tests.


At the start of the year, when I started working on Turnip, I looked at the list of failing tests and thought “It wouldn’t take a lot to fix them!”, right, sure… And so I started fixing issues alongside of looking for missing features.

In June there were even more failures than there were in January, how could it be? Of course we were adding new features and it accounted for some of them. However even this list was likely not exhaustive because for gitlab CI instead of running the whole Vulkan CTS suite - we ran 1/3 of it. We didn’t have enough devices to run the whole suite fast enough to make it usable in CI. So I just ran it locally from time to time.

1/3 of the tests doesn’t sound bad and for the most part it’s good enough since we have a huge amount of tests looking like this:

dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture_format_list
...

Every format, every operation, etc. Tens of thousands of them.

Unfortunately the selection of tests for a fractional run is as straightforward as possible - just every third test. Which bites us when there a single unique tests, like:

dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil_no_attachment
...

Most of them test something unique that has much higher probability of triggering a special path in a driver compared to uncountable image tests. And they fell through the cracks. I even had to fix one test twice because the CI didn’t run it.

A possible solution is to skip tests only when there is a large swath of them and run smaller groups as-is. But it’s likely more productive to just throw more hardware at the issue =).

Not enough hardware in CI

Another trouble is that we had only one 6xx sub-generation present in CI - Adreno 630. We distinguish four sub-generations. Not only they have some different capabilities, there are also differences in the existing ones, causing the same test to pass on CI and being broken on another newer GPU. Presently in CI we test only Adreno 618 and 630 which are “Gen 1” GPUs and we claimed conformance only for Adreno 618.

Yet another issue is that we could render in tiling and bypass (sysmem) modes. That’s because there are a few features we could support only when there is no tiling and we render directly into the sysmem, and sometimes rendering directly into sysmem is just faster. At the moment we use tiling rendering by default unless we meet an edge case, so by default CTS tests only tiling rendering.

We are forcing sysmem mode for a subset of tests on CI, however it’s not enough because the difference between modes is relevant for more than just a few tests. Thus ideally we should run twice as many tests, and even better would be thrice as many to account for tiling mode without binning vertex shader.

That issue became apparent when I implemented a magical eight-ball to choose between tiling and bypass modes depending on the run-time information in order to squeeze more performance (it’s still work-in-progress). The basic idea is that a single draw call or a few small draw calls is faster to render directly into system memory instead of loading framebuffer into the tile memory and storing it back. But almost every single CTS test does exactly this! Do a single or a few draw calls per render pass, which causes all tests to run in bypass mode. Fun!

Now we would be forced to deal with this issue since with the magic eight-ball games would partly run in the tiling mode and partly in the bypass, making them equally important for real-world workload.

Does conformance matter? Does it reflect anything real-world?

Unfortunately no test suite could wholly reflect what game developers do in their games. However, the amount of tests grows and new tests are getting contributed based on issues found in games and other applications.

When I ran my stash of D3D11 game traces through DXVK on Turnip for the first time - I found a bunch of new crashes and hangs but it took fixing just a few of them for majority of games to render correctly. This shows that Khronos Vulkan Conformance Tests are doing their job and we at Igalia are striving to make them even better.

One of the extensions released as part of Vulkan 1.2.199 was VK_EXT_image_view_min_lod extension. I’m happy to see it published as I have participated in the release process of this extension: from reviewing the spec exhaustively (I even contributed a few things to improve it!) to developing CTS tests for it that will be eventually merged to the CTS repo.

This extension was proposed by Valve to mirror a feature present in Direct3D 12 (check ResourceMinLODClamp here) and Direct3D 11 (check SetResourceMinLOD here). In other words, this extension allows clamping the minimum LOD value accessed by an image view to a minLod value set at image view creation time.

That way, any library or API layer that translates Direct3D 11/12 calls to Vulkan can use the extension to mirror the behavior above on Vulkan directly without workarounds, facilitating the port of Direct3D applications such as games to Vulkan. For example, projects like Vkd3d, Vkd3d-proton and DXVK could benefit from it.

Going into more details, this extension changed how the image level selection is calculated and sets an additional minimum required in the image level for integer texel coordinate operations if it is enabled.

The way to use this feature in an application is very simple:

  • Check the extension is supported and if the physical device supports the respective feature:
// Provided by VK_EXT_image_view_min_lod
typedef struct VkPhysicalDeviceImageViewMinLodFeaturesEXT {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           minLod;
} VkPhysicalDeviceImageViewMinLodFeaturesEXT;
  • Once you know everything is working, enable both the extension and the feature when creating the device.

  • When you want to create a VkImageView that defines a minLod for image accesses, then add the following structure filled with the value you want in VkImageViewCreateInfo’s pNext.

// Provided by VK_EXT_image_view_min_lod
typedef struct VkImageViewMinLodCreateInfoEXT {
    VkStructureType    sType;
    const void*        pNext;
    float              minLod;
} VkImageViewMinLodCreateInfoEXT;

And that’s all! As you see, it is a very simple extension.

Happy hacking!

November 24, 2021

I was interested in how much work a vaapi on top of vulkan video proof of concept would be.

My main reason for being interested is actually video encoding, there is no good vulkan video encoding demo yet, and I'm not experienced enough in the area to write one, but I can hack stuff. I think it is probably easier to hack a vaapi encode to vulkan video encode than write a demo app myself.

With that in mind I decided to see what decode would look like first. I talked to Mike B (most famous zink author) before he left for holidays, then I ignored everything he told me and wrote a super hack.

This morning I convinced zink vaapi on top anv with iris GL doing the presents in mpv to show me some useful frames of video. However zink vaapi on anv with zink GL is failing miserably (well green jellyfish).

I'm not sure how much more I'll push on the decode side at this stage, I really wanted it to validate the driver side code, and I've found a few bugs in there already.

The WIP hacks are at [1]. I might push on to encode side and see if I can workout what it entails, though the encode spec work is a lot more changeable at the moment.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/zink-video-wip

November 18, 2021

If you own a laptop (Dell, HP, Lenovo) with a WWAN module, it is very likely that the modules are FCC-locked on every boot, and the special FCC unlock procedure needs to be run before they can be used.

Until ModemManager 1.18.2, the procedure was automatically run for the FCC unlock procedures we knew about, but this will no longer happen. Once 1.18.4 is out, the procedure will need to be explicitly enabled by each user, under their own responsibility, or otherwise implicitly enabled after installing an official FCC unlock tool provided by the manufacturer itself.

See a full description of the rationale behind this change in the ModemManager documentation site and the suggested code changes in the gitlab merge request.

If you want to enable the ModemManager provided unofficial FCC unlock tools once you have installed 1.18.4, run (assuming sysconfdir=/etc and datadir=/usr/share) this command (*):

sudo ln -sft /etc/ModemManager/fcc-unlock.d /usr/share/ModemManager/fcc-unlock.available.d/*

The user-enabled tools in /etc should not be removed during package upgrades, so this should be a one-time setup.

(*) Updated to have one single command instead of a for loop; thanks heftig!

November 15, 2021

Previously I mentioned having AMD VCN h264 support. Today I added initial support for the older UVD engine[1]. This is found on chips from Vega back to SI.

I've only tested it on my Vega so far.

I also worked out the "correct" answer to the how to I send the reset command correctly, however the nvidia player I'm using as a demo doesn't do things that way yet, so I've forked it for now[2].

The answer is to use vkCmdControlVideoCodingKHR to send a reset the first type a session is used. However I can't see how the app is meant to know this is necessary, but I've asked the appropriate people.

The initial anv branch I mentioned last week is now here[3].


[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-uvd-h264

[2] https://github.com/airlied/vk_video_samples/tree/radv-fixes

[3] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode

November 12, 2021

Last week I mentioned I had the basics of h264 decode using the proposed vulkan video on radv. This week I attempted to do the same thing with Intel's Mesa vulkan driver "anv".

Now I'd previously unsuccessfully tried to get vaapi on crocus working but got sidetracked back into other projects. The Intel h264 decoder hasn't changed a lot between ivb/hsw/gen8/gen9 era. I ported what I had from crocus to anv and started trying to get something to decode on my WhiskeyLake.

I wrote the code pretty early on, figured out all the things I had to send the hardware.

The first anv side bridge to cross was Vulkan is doing H264 Picture level decode API, so it means you get handed the encoded slice data. However to program the Intel hw you need to decode the slice header. I wrote a slice header decoder in some common code. The other thing you need to give the intel hw is a number of bits of slice header, which in some encoding schemes is rounded to bytes and in some isn't. Slice headers also have a 3-byte header on them, which Intel hardware wants you to discard or skip before handing it to it.

Once I'd fixed up that sort of thing in anv + crocus, I started getting grey I-frames decoded with later B/P frames using the grey frames as references so you'd see this kinda wierd motion.

That was I think 3 days ago. I've have stared at this intently for those 3 days blaming everything from bitstream encoding to rechecking all my packets (not enough times though). I had someone else verify they could see grey frames.

Today after a long discussion about possibilities, I was randomly comparing a frame from the intel-vaapi-driver and from crocus, and I spotted a packet header, the docs say is 34 dwords long, but intel-vaapi was only encoding 16 dwords, I switched crocus to explicitly state a 16-dword length and I started seeing my I-frames.

Now the B/P frames still have issues. I don't think I'm getting the ref frames logic right yet, but it felt like a decent win after 3 days of staring at it.

The crocus code is [1]. The anv code isn't cleaned up enough to post a pointer to yet, enterprising people might find it. Next week I'll clean it all up, and then start to ponder upstream paths and shared code for radv + anv. Then h265 maybe.

[1]https://gitlab.freedesktop.org/airlied/mesa/-/tree/crocus-media-wip

November 05, 2021

A few weeks ago I watched Victor's excellent talk on Vulkan Video. This made me question my skills in this area. I'm pretty vague on video processing hardware, I really have no understanding of H264 or any of the standards. I've been loosely following the Vulkan video group inside of Khronos, but I can't say I've understood it or been useful.

radeonsi has a gallium vaapi driver, that talks to firmware driver encoder on the hardware, surely copying what it is programming can't be that hard. I got an mpv/vaapi setup running and tested some videos on that setup just to get comfortable. I looked at what sort of data was being pushed about.

The thing is the firmware is doing all the work here, the driver is mostly just responsible for taking semi-parsed h264 bitstream data structures and giving them in memory buffers to the fw API. Then the resulting decoded image should be magically in a buffer.

I then got the demo nvidia video decoder application mentioned in Victor's talk.

I ported the code to radv in a couple of days, but then began a long journey into the unknown. The firmware is quite expectant on exactly what it wants and when it wants it. After fixing some interactions with the video player, I started to dig.

Now vaapi and DXVA (Windows) are context based APIs. This means they are like OpenGL, where you create a context, do a bunch of work, and tear it down, the driver does all the hw queuing of commands internally. All the state is held in the context. Vulkan is a command buffer based API. The application records command buffers and then enqueues those command buffers to the hardware itself.

So the vaapi driver works like this for a video

create hw ctx, flush, decode, flush, decode, flush, decode, flush, decode, flush, destroy hw ctx, flush

However Vulkan wants things to be more like

Create Session, record command buffer with (begin, decode, end) send to hw, (begin, decode, end), send to hw, End Sesssion

There is no way at the Create/End session time to submit things to the hardware.

After a week or two of hair removal and insightful irc chats I stumbled over a decent enough workaround to avoid the hw dying and managed to decode a H264 video of some jellyfish.

The work is based on bunch of other stuff, and is in no way suitable for upstreaming yet, not to mention the Vulkan specification is only beta/provisional so can't be used anywhere outside of development.

The preliminary code is in my gitlab repo here[1]. It has a start on h265 decode, but it's not working at all yet, and I think the h264 code is a bit hangy randomly.

I'm not sure where this is going yet, but it was definitely an interesting experiment.

[1]: https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-prelim-decode

November 04, 2021

A basic example of the git alias function syntax looks like this.

[alias]
    shortcut = "!f() \
    {\
        echo Hello world!; \
    }; f"

This syntax defines a function f and then calls it. These aliases are executed in a sh shell, which means there's no access to Bash / Zsh specific functionality.

Every command is ended with a ; and each line ended with a \. This is easy enough to grok. But when we try to clean up the above snippet and add some quotes to "Hello world!", we hit this obtuse error message.

}; f: 1: Syntax error: end of file unexpected (expecting "}")

This syntax error is caused by quotes needing to be escaped. The reason for this comes down to how git tokenizes and executes these functions. If you're curious …

October 20, 2021

 In 2007, Jan Arne Petersen added a D-Bus API to what was still pretty much an import into gnome-control-center of the "acme" utility I wrote to have all the keys on my iBook working.

It switched the code away from remapping keyboard keys to "XF86Audio*", to expecting players to contact the D-Bus daemon and ask to be forwarded key events.

 

Multimedia keys circa 2003

In 2013, we added support for controlling media players using MPRIS, as another interface. Fast-forward to 2021, and MPRIS support is ubiquitous, whether in free software, proprietary applications or even browsers. So we'll be parting with the "org.gnome.SettingsDaemon.MediaKeys" D-Bus API. If your application still wants to work with older versions of GNOME, it is recommended to at least quiet the MediaKeys API's unavailability.

 

Multimedia keys in 2021
 

TL;DR: Remove code that relies on gnome-settings-daemon's MediaKeys API, make sure to add MPRIS support to your app.

October 01, 2021

Wim Taymans

Wim Taymans laying out the vision for the future of Linux multimedia


PipeWire has already made great strides forward in terms of improving the audio handling situation on Linux, but one of the original goals was to also bring along the video side of the house. In fact in the first few releases of Fedora Workstation where we shipped PipeWire we solely enabled it as a tool to handle screen sharing for Wayland and Flatpaks. So with PipeWire having stabilized a lot for audio now we feel the time has come to go back to the video side of PipeWire and work to improve the state-of-art for video capture handling under Linux. Wim Taymans did a presentation to our team inside Red Hat on the 30th of September talking about the current state of the world and where we need to go to move forward. I thought the information and ideas in his presentation deserved wider distribution so this blog post is building on that presentation to share it more widely and also hopefully rally the community to support us in this endeavour.

The current state of video capture, usually webcams, handling on Linux is basically the v4l2 kernel API. It has served us well for a lot of years, but we believe that just like you don’t write audio applications directly to the ALSA API anymore, you should neither write video applications directly to the v4l2 kernel API anymore. With PipeWire we can offer a lot more flexibility, security and power for video handling, just like it does for audio. The v4l2 API is an open/ioctl/mmap/read/write/close based API, meant for a single application to access at a time. There is a library called libv4l2, but nobody uses it because it causes more problems than it solves (no mmap, slow conversions, quirks). But there is no need to rely on the kernel API anymore as there are GStreamer and PipeWire plugins for v4l2 allowing you to access it using the GStreamer or PipeWire API instead. So our goal is not to replace v4l2, just as it is not our goal to replace ALSA, v4l2 and ALSA are still the kernel driver layer for video and audio.

It is also worth considering that new cameras are getting more and more complicated and thus configuring them are getting more complicated. Driving this change is a new set of cameras on the way often called MIPI cameras, as they adhere to the API standards set by the MiPI Alliance. Partly driven by this V4l2 is in active development with a Codec API addition, statefull/stateless, DMABUF, request API and also adding a Media Controller (MC) Graph with nodes, ports, links of processing blocks. This means that the threshold for an application developer to use these APIs directly is getting very high in addition to the aforementioned issues of single application access, the security issues of direct kernel access and so on.

libcamera logo


Libcamera is meant to be the userland library for v4l2.


Of course we are not the only ones seeing the growing complexity of cameras as a challenge for developers and thus libcamera has been developed to make interacting with these cameras easier. Libcamera provides unified API for setup and capture for cameras, it hides the complexity of modern camera devices, it is supported for ChromeOS, Android and Linux.
One way to describe libcamera is as the MESA of cameras. Libcamera provides hooks to run (out-of-process) vendor extensions like for image processing or enhancement. Using libcamera is considering pretty much a requirement for embedded systems these days, but also newer Intel chips will also have IPUs configurable with media controllers.

Libcamera is still under heavy development upstream and do not yet have a stable ABI, but they did add a .so version very recently which will make packaging in Fedora and elsewhere a lot simpler. In fact we have builds in Fedora ready now. Libcamera also ships with a set of GStreamer plugins which means you should be able to get for instance Cheese working through libcamera in theory (although as we will go into, we think this is the wrong approach).

Before I go further an important thing to be aware of here is that unlike on ALSA, where PipeWire can provide a virtual ALSA device to provide backwards compatibility with older applications using the ALSA API directly, there is no such option possible for v4l2. So application developers will need to port to something new here, be that libcamera or PipeWire. So what do we feel is the right way forward?

Ideal Linux Multimedia Stack

How we envision the Linux multimedia stack going forward


Above you see an illustration of what we believe should be how the stack looks going forward. If you made this drawing of what the current state is, then thanks to our backwards compatibility with ALSA, PulseAudio and Jack, all the applications would be pointing at PipeWire for their audio handling like they are in the illustration you see above, but all the video handling from most applications would be pointing directly at v4l2 in this diagram. At the same time we don’t want applications to port to libcamera either as it doesn’t offer a lot of the flexibility than using PipeWire will, but instead what we propose is that all applications target PipeWire in combination with the video camera portal API. Be aware that the video portal is not an alternative or a abstraction of the PipeWire API, it is just a way to set up the connection to PipeWire that has the added bonus of working if your application is shipping as a Flatpak or another type of desktop container. PipeWire would then be in charge of talking to libcamera or v42l for video, just like PipeWire is in charge of talking with ALSA on the audio side. Having PipeWire be the central hub means we get a lot of the same advantages for video that we get for audio. For instance as the application developer you interact with PipeWire regardless of if what you want is a screen capture, a camera feed or a video being played back. Multiple applications can share the same camera and at the same time there are security provided to avoid the camera being used without your knowledge to spy on you. And also we can have patchbay applications that supports video pipelines and not just audio, like Carla provides for Jack applications. To be clear this feature will not come for ‘free’ from Jack patchbays since Jack only does audio, but hopefully new PipeWire patchbays like Helvum can add video support.

So what about GStreamer you might ask. Well GStreamer is a great way to write multimedia applications and we strongly recommend it, but we do not recommend your GStreamer application using the v4l2 or libcamera plugins, instead we recommend that you use the PipeWire plugins, this is of course a little different from the audio side where PipeWire supports the PulseAudio and Jack APIs and thus you don’t need to port, but by targeting the PipeWire plugins in GStreamer your GStreamer application will get the full PipeWire featureset.

So what is our plan of action>
So we will start putting the pieces in place for this step by step in Fedora Workstation. We have already started on this by working on the libcamera support in PipeWire and packaging libcamera for Fedora. We will set it up so that PipeWire can have option to switch between v4l2 and libcamera, so that most users can keep using the v4l2 through PipeWire for the time being, while we work with upstream and the community to mature libcamera and its PipeWire backend. We will also enable device discoverer for PipeWire.

We are also working on maturing the GStreamer elements for PipeWire for the video capture usecase as we expect a lot of application developers will just be using GStreamer as opposed to targeting PipeWire directly. We will start with Cheese as our initial testbed for this work as it is a fairly simple application, using Cheese as a proof of concept to have it use PipeWire for camera access. We are still trying to decide if we will make Cheese speak directly with PipeWire, or have it talk to PipeWire through the pipewiresrc GStreamer plugin, but have their pro and cons in the context of testing and verifying this.

We will also start working with the Chromium and Firefox projects to have them use the Camera portal and PipeWire for camera support just like we did work with them through WebRTC for the screen sharing support using PipeWire.

There are a few major items we are still trying to decide upon in terms of the interaction between PipeWire and the Camera portal API. It would be tempting to see if we can hide the Camera portal API behind the PipeWire API, or failing that at least hide it for people using the GStreamer plugin. That way all applications get the portal support for free when porting to GStreamer instead of requiring using the Camera portal API as a second step. On the other side you need to set up the screen sharing portal yourself, so it would probably make things more consistent if we left it to application developers to do for camera access too.

What do we want from the community here?
First step is just help us with testing as we roll this out in Fedora Workstation and Cheese. While libcamera was written motivated by MIPI cameras, all webcams are meant to work through it, and thus all webcams are meant to work with PipeWire using the libcamera backend. At the moment that is not the case and thus community testing and feedback is critical for helping us and the libcamera community to mature libcamera. We hope that by allowing you to easily configure PipeWire to use the libcamera backend (and switch back after you are done testing) we can get a lot of you to test and let us what what cameras are not working well yet.

A little further down the road please start planning moving any application you maintain or contribute to away from v4l2 API and towards PipeWire. If your application is a GStreamer application the transition should be fairly simple going from the v4l2 plugins to the pipewire plugins, but beyond that you should familiarize yourself with the Camera portal API and the PipeWire API for accessing cameras.

For further news and information on PipeWire follow our @PipeWireP twitter account and for general news and information about what we are doing in Fedora Workstation make sure to follow me on twitter @cfkschaller.
PipeWire

September 27, 2021
For the hw-enablement for Bay- and Cherry-Trail devices which I do as a side project, sometimes it is useful to play with the Android which comes pre-installed on some of these devices.

Sometimes the Android-X86 boot-loader (kerneflinger) is locked and the standard "Developer-Options" -> "Enable OEM Unlock" -> "Run 'fastboot oem unlock'" sequence does not work (e.g. I got the unlock yes/no dialog, and could move between yes and no, but I could not actually confirm the choice).

Luckily there is an alternative, kernelflinger checks a "OEMLock" EFI variable to see if the device is locked or not. Like with some of my previous adventures changing hidden BIOS settings, this EFI variable is hidden from the OS as soon as the OS calls ExitBootServices, but we can use the same modified grub to change this EFI variable. After booting from an USB stick with the relevant grub binary installed as "EFI/BOOT/BOOTX64.EFI" or "BOOTIA32.EFI", entering the
following command on the grub cmdline will unlock the bootloader:

setup_var_cv OEMLock 0 1 1

Disabling dm-verity support is pretty easy on these devices because they can just boot a regular Linux distro from an USB drive. Note booting a regular Linux distro may cause the Android "system" partition to get auto-mounted after which dm-verity checks will fail! Once we have a regular Linux distro running step 1 is to find out which partition is the android_boot partition to do this as root run:

blkid /dev/mmcblk?p#

Replacing the ? for the mmcblk number for the internal eMMC and then for # is 1 to n, until one of the partitions is reported as having 'PARTLABEL="android_boot"', usually "mmcblk?p3" is the one you want, so you could try that first.

Now make an image of the partition by running e.g.:

dd if=/dev/mmcblk1p3" of=android_boot.img

And then copy the "android_boot.img" file to another computer. On this computer extract the file and then the initrd like this:

abootimg -x android_boot.img
mkdir initrd
cd initrd
zcat ../initrd.img | cpio -i


Now edit the fstab file and remove "verify" from the line for the system partition. after this update android_boot.img like this:

find . | cpio -o -H newc -R 0.0 | gzip -9 > ../initrd.img
cd ..
abootimg -u android_boot.img -r initrd.img


The easiest way to test the new image is using fastboot, boot the tablet into Android and connect it to the PC, then run:

adb reboot bootloader
fastboot boot android_boot.img


And then from an "adb shell" do "cat /fstab" verify that the "verify" option is gone now. After this you can (optionally) dd the new android_boot.img back to the android_boot partition to make the change permanent.

Note if Android is not booting you can force the bootloader to enter fastboot mode on the next boot by downloading this file and then under regular Linux running the following command as root:

cat LoaderEntryOneShot > /sys/firmware/efi/efivars/LoaderEntryOneShot-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f
September 24, 2021

Fedora Workstation
So I have spoken about what is our vision for Fedora Workstation quite a few times before, but I feel it is often useful to get back to it as we progress with our overall effort.So if you read some of my blog posts about Fedora Workstation over the last 5 years, be aware that there is probably little new in here for you. If you haven’t read them however this is hopefully a useful primer on what we are trying to achieve with Fedora Workstation.

The first few years after we launched Fedora Workstation in 2014 we focused on lot on establishing a good culture around what we where doing with Fedora, making sure that it was a good day to day desktop driver for people, and not just a great place to develop the operating system itself. I think it was Fedora Project Lead Matthew Miller who phrased it very well when he said that we want to be Leading Edge, not Bleeding Edge. We also took a good look at the operating system from an overall stance and tried to map out where Linux tended to fall short as a desktop operating system and also tried to ask ourselves what our core audience would and should be. We refocused our efforts on being a great Operating System for all kinds of developers, but I think it is fair to say that we decided that was to narrow a wording as our efforts are truly to reach makers of all kinds like graphics artists and musicians, in addition to coders. So I thought I go through our key pillar efforts and talk about where they are at and where they are going.

Flatpak

Flatpak logo
One of the first things we concluded was that our story for people who wanted to deploy applications to our platform was really bad. The main challenge was that the platform was moving very fast and it was a big overhead for application developers to keep on top of the changes. In addition to that, since the Linux desktop is so fragmented, the application developers would have to deal with the fact that there was 20 different variants of this platform, all moving at a different pace. The way Linux applications was packaged, with each dependency being packaged independently of the application created pains on both sides, for the application developer it means the world kept moving underneath them with limited control and for the distributions it meant packaging pains as different applications who all depended on the same library might work or fail with different versions of a given library. So we concluded we needed a system which allowed us to decouple of application from the host OS to let application developers update their platform at a pace of their own choosing and at the same time unify the platform in the sense that the application should be able to run without problems on the latest Fedora releases, the latest RHEL releases or the latest versions of any other distribution out there. As we looked at it we realized there was some security downsides compared to the existing model, since the Os vendor would not be in charge of keeping all libraries up to date and secure, so sandboxing the applications ended up a critical requirement. At the time Alexander Larsson was working on bringing Docker to RHEL and Fedora so we tasked him with designing the new application model. The initial idea was to see if we could adjust Docker containers to the desktop usecase, but Docker containers as it stood at that time were very unsuited for the purpose of hosting desktop applications and our experience working with the docker upstream at the time was that they where not very welcoming to our contributions. So in light of how major the changes we would need to implement and the unlikelyhood of getting them accepted upstream, Alex started on what would become Flatpak. Another major technology that was coincidentally being developed at the same time was OSTree by Colin Walters. To this day I think the best description of OSTree is that it functions as a git for binaries, meaning it allows you a simple way to maintain and update your binary applications with minimally sized updates. It also provides some disk deduplication which we felt was important due to the duplication of libraries and so on that containers bring with them. Finally another major design decision Alex did was that the runtime/baseimage should be hosted outside the container, so make possible to update the runtime independently of the application with relevant security updates etc.

Today there is a thriving community around Flatpaks, with the center of activity being flathub, the Flatpak application repository. In Fedora Workstation 35 you should start seeing Flatpak from Flathub being offered as long as you have 3rd party repositories enabled. Also underway is Owen Taylor leading our efforts of integrating Flatpak building into the internal tools we use at Red Hat for putting RHEL together, with the goal of switching over to Flatpaks as our primary application delivery method for desktop applications in RHEL and to help us bridge the Fedora and RHEL application ecosystem.

You can follow the latest news from Flatpak through the official Flatpak twitter account.

Silverblue

So another major issue we decided needing improvements was that of OS upgrades (as opposed to application updates). The model pursued by Linux distros since their inception is one of shipping their OS as a large collection of independently packaged libraries. This setup is inherently fragile and requires a lot of quality engineering and testing to avoid problems, but even then sometimes things sometimes fail, especially in a fast moving OS like Fedora. A lot of configuration changes and updates has traditionally been done through scripts and similar, making rollback to an older version in cases where there is a problem also very challenging. Adventurous developers could also have done changes to their own copy of the OS that would break the upgrade later on. So thanks to all the great efforts to test and verify upgrades they usually go well for most users, but we wanted something even more sturdy. So the idea came up to move to a image based OS model, similar to what people had gotten used to on their phones. And OSTree once again became the technology we choose to do this, especially considering it was being used in Red Hat first foray into image based operating systems for servers (the server effort later got rolled into CoreOS as part of Red Hat acquiring CoreOS). The idea is that you ship the core operating system as a singular image and then to upgrade you just replace that image with a new image, and thus the risks of problems are greatly reduced. On top of that each of those images can be tested and verified as a whole by your QE and test teams. Of course we realized that a subset of people would still want to be able to tweak their OS, but once again OSTree came to our rescue as it allows developers to layer further RPMS on top of the OS image, including replacing current system libraries with for instance newer ones. The great thing about OSTree layering is that once you are done testing/using the layers RPMS you can with a very simple command just drop them again and go back to the upstream image. So combined with applications being shipped as Flatpaks this would create an OS that is a lot more sturdy, secure and simple to update and with a lot lower chance of an OS update breaking any of your applications. On top of that OSTree allows us to do easy OS rollbacks, so if the latest update somehow don’t work for you can you quickly rollback while waiting for the issue you are having to be fixed upstream. And hence Fedora Silverblue was born as the vehicle for us to develop and evolve an image based desktop operating system.

You can follow our efforts around Silverblue through the offical Silverblue twitter account.

Toolbx

Toolbox with RHEL

Toolbox pet container with RHEL UBI


So Flatpak helped us address a lot of the the gaps for making a better desktop OS on the application side and Silverblue was the vehicle for our vision on the OS side, but we realized that we also needed some way for all kinds of developers to be able to easily take advantage of the great resource that is the Fedora RPM package universe and the wider tools universe out there. We needed something that provided people with a great terminal experience. We had already been working on various smaller improvements to the terminal for a while, but we realized we needed something a lot more substantial. Accessing an immutable OS like Silverblue through a terminal window tends to be quite limiting. So that it is usually not want you want to do and also you don’t want to rely on the OSTree layering for running all your development tools and so on as that is going to be potentially painful when you upgrade your OS.
Luckily the container revolution happening in the Linux world pointed us to the solution here too, as while containers were rolled out the concept of ‘pet containers’ were also born. The idea of a pet container is that unlike general containers (sometimes refer to as cattle containers) pet container are containers that you care about on an individual level, like your personal development environment. In fact pet containers even improves on how we used to do things as they allow you to very easily maintain different environments for different projects. So for instance if you have two projects, hosted in two separate pet containers, where the two project depends on two different versions of python, then containers make that simple as it ensures that there is no risk of one of your projects ‘contaminating’ the others with its dependencies, yet at the same time allow you to grab RPMS or other kind of packages from upstream resources and install them in your container. In fact while inside your pet container the world feels a lot like it always has when on the linux command line. Thanks to the great effort of Dan Walsh and his team we had a growing number of easy to use container tools available to us, like podman. Podman is developed with the primary usecase being for running and deploying your containers at scale, managed by OpenShift and Kubernetes. But it also gave us the foundation we needed for Debarshi Ray to kicked of the Toolbx project to ensure that we had an easy to use tool for creating and managing pet containers. As a bonus Toolbx allows us to achieve another important goal, to allow Fedora Workstation users to develop applications against RHEL in a simple and straightforward manner, because Toolbx allows you to create RHEL containers just as easy as it allows you to create Fedora containers.

You can follow our efforts around Toolbox on the official Toolbox twitter account

Wayland

Ok, so between Flatpak, Silverblue and Toolbox we have the vision clear for how to create a robust OS, with a great story for application developers to maintain and deliver applications for it, to Toolbox providing a great developer story on top of this OS. But we also looked at the technical state of the Linux desktop and realized that there where some serious deficits we needed to address. One of the first one we saw was the state of graphics where X.org had served us well for many decades, but its age was showing and adding new features as they came in was becoming more and more painful. Kristian Høgsberg had started work on an alternative to X while still at Red Hat called Wayland, an effort he and a team of engineers where pushing forward at Intel. There was a general agreement in the wider community that Wayland was the way forward, but apart from Intel there was little serious development effort being put into moving it forward. On top of that, Canonical at the time had decided to go off on their own and develop their own alternative architecture in competition with X.org and Wayland. So as we were seeing a lot of things happening in the graphics space horizon, like HiDPI, and also we where getting requests to come up with a way to make Linux desktops more secure, we decided to team up with Intel and get Wayland into a truly usable state on the desktop. So we put many of our top developers, like Olivier Fourdan, Adam Jackson and Jonas Ådahl, on working on maturing Wayland as quickly as possible.
As things would have it we also ended up getting a lot of collaboration and development help coming in from the embedded sector, where companies such as Collabora was helping to deploy systems with Wayland onto various kinds of embedded devices and contributing fixes and improvements back up to Wayland (and Weston). To be honest I have to admit we did not fully appreciate what a herculean task it would end up being getting Wayland production ready for the desktop and it took us quite a few Fedora releases before we decided it was ready to go. As you might imagine dealing with 30 years of technical debt is no easy thing to pay down and while we kept moving forward at a steady pace there always seemed to be a new batch of issues to be resolved, but we managed to do so, not just by maturing Wayland, but also by porting major applications such as Martin Stransky porting Firefox, and Caolan McNamara porting LibreOffice over to Wayland. At the end of the day I think what saw us through to success was the incredible collaboration happening upstream between a large host of individual contributors, companies and having the support of the X.org community. And even when we had the whole thing put together there where still practical issues to overcome, like how we had to keep defaulting to X.org in Fedora when people installed the binary NVidia driver because that driver did not work with XWayland, the X backwards compatibility layer in Wayland. Luckily that is now in the process of becoming a thing of the past with the latest NVidia driver updates support XWayland and us working closely with NVidia to ensure driver and windowing stack works well.

PipeWire

Pipewire in action

Example of PipeWire running


So now we had a clear vision for the OS and a much improved and much more secure graphics stack in the form of Wayland, but we realized that all the new security features brought in by Flatpak and Wayland also made certain things like desktop capturing/remoting and web camera access a lot harder. Security is great and critical, but just like the old joke about the most secure computer being the one that is turned off, we realized that we needed to make sure these things kept working, but in a secure and better manner. Thankfully we have GStreamer co-creator Wim Taymans on the team and he thought he could come up with a pulseaudio equivalent for video that would allow us to offer screen capture and webcam access in a convenient and secure manner.
As Wim where prototyping what we called PulseVideo at the time we also started discussing the state of audio on Linux. Wim had contributed to PulseAudio to add a security layer to it, to make for instance it harder for a rogue application to eavesdrop on you using your microphone, but since it was not part of the original design it wasn’t a great solution. At the same time we talked about how our vision for Fedora Workstation was to make it the natural home for all kind of makers, which included musicians, but how the separateness of the pro-audio community getting in the way of that, especially due to the uneasy co-existence of PulseAudio on the consumer side and Jack for the pro-audio side. As part of his development effort Wim came to the conclusion that he code make the core logic of his new project so fast and versatile that it should be able to deal with the low latency requirements of the pro-audio community and also serve its purpose well on the consumer audio and video side. Having audio and video in one shared system would also be an improvement for us in terms of dealing with combined audio and video sources as guaranteeing audio video sync for instance had often been a challenge in the past. So Wims effort evolved into what we today call PipeWire and which I am going to be brave enough to say has been one of the most successful launches of a major new linux system component we ever done. Replacing two old sound servers while at the same time adding video support is no small feat, but Wim is working very hard on fixing bugs as quickly as they come in and ensure users have a great experience with PipeWire. And at the same time we are very happy that PipeWire now provides us with the ability of offering musicians and sound engineers a new home in Fedora Workstation.

You can follow our efforts on PipeWire on the PipeWire twitter account.

Hardware support and firmware

In parallel with everything mentioned above we where looking at the hardware landscape surrounding desktop linux. One of the first things we realized was horribly broken was firmware support under Linux. More and more of the hardware smarts was being found in the firmware, yet the firmware access under Linux and the firmware update story was basically non-existent. As we where discussing this problem internally, Peter Jones who is our representative on UEFI standards committee, pointed out that we probably where better poised to actually do something about this problem than ever, since UEFI was causing the firmware update process on most laptops and workstations to become standardized. So we teamed Peter up with Richard Hughes and out of that collaboration fwupd and LVFS was born. And in the years since we launched that we gone from having next to no firmware available on Linux (and the little we had only available through painful processes like burning bootable CDs etc.) to now having a lot of hardware getting firmware update support and more getting added almost on a weekly basis.
For the latest and greatest news around LVFS the best source of information is Richard Hughes twitter account.

In parallel to this Adam Jackson worked on glvnd, which provided us with a way to have multiple OpenGL implementations on the same system. For those who has been using Linux for a while I am sure you remembers the pain of the NVidia driver and Mesa fighting over who provided OpenGL on your system as it was all tied to a specific .so name. There was a lot of hacks being used out there to deal with that situation, of varying degree of fragility, but with the advent of glvnd nobody has to care about that problem anymore.

We also decided that we needed to have a part of the team dedicated to looking at what was happening in the market and work on covering important gaps. And with gaps I mean fixing the things that keeps the hardware vendors from being able to properly support Linux, not writing drivers for them. Instead we have been working closely with Dell and Lenovo to ensure that their suppliers provide drivers for their hardware and when needed we work to provide a framework for them to plug their hardware into. This has lead to a series of small, but important improvements, like getting the fingerprint reader stack on Linux to a state where hardware vendors can actually support it, bringing Thunderbolt support to Linux through Bolt, support for high definition and gaming mice through the libratbag project, support in the Linux kernel for the new laptop privacy screen feature, improved power management support through the power profiles daemon and now recently hiring a dedicated engineer to get HDR support fully in place in Linux.

Summary

So to summarize. We are of course not over the finish line with our vision yet. Silverblue is a fantastic project, but we are not yet ready to declare it the official version of Fedora Workstation, mostly because we want to give the community more time to embrace the Flatpak application model and for developers to embrace the pet container model. Especially applications like IDEs that cross the boundary between being in their own Flatpak sandbox while also interacting with things in your pet container and calling out to system tools like gdb need more work, but Christian Hergert has already done great work solving the problem in GNOME Builder while Owen Taylor has put together support for using Visual Studio Code with pet containers. So hopefully the wider universe of IDEs will follow suit, in the meantime one would need to call them from the command line from inside the pet container.

The good thing here is that Flatpaks and Toolbox also works great on traditional Fedora Workstation, you can get the full benefit of both technologies even on a traditional distribution, so we can allow for a soft and easy transition.

So for anyone who made it this far, appoligies for this become a little novel, that was not my intention when I started writing it :)

Feel free to follow my personal twitter account for more general news and updates on what we are doing around Fedora Workstation.
Christian F.K. Schaller photo

Last week we had our most loved annual conference: X.Org Developers Conference 2021. As a reminder, due to COVID-19 situation in Europe (and its respective restrictions on travel and events), we kept it virtual again this year… which is a pity as the former venue was Gdańsk, a very beautiful city (see picture below if you don’t believe me!) in Poland. Let’s see if we can finally have an XDC there!

XDC 2021

This year we had a very strong program. There were talks covering all aspects of the open-source graphics stack: from the kernel (including an Outreachy talk about VKMS) and Mesa drivers of all kind, inputs, libraries, X.org security and Wayland robustness… we had talks about testing drivers, debugging them, our infra at freedesktop.org, and even Vulkan specs (such Vulkan Video and VK_EXT_multi_draw) and their support in the open-source graphics stack. Definitely, a very complete program that is very interesting to all open-source developers working on this area. You can watch all the talks here or here and the slides were already uploaded in the program.

On behalf of the Call For Papers Committee, I would like to thank all speakers for their talks… this conference won’t make sense without you!

Big shout-out to the XDC 2021 organizers (Intel) represented by Radosław Szwichtenberg, Ryszard Knop and Maciej Ramotowski. They did an awesome job on having a very smooth conference. I can tell you that they promptly fixed any issue that happened, all of that behind the scenes so that the attendees not even noticed anything most of the times! That is what good conference organizers do!

XDC 2021 Organizers Can I invite you to a drink at least? You really deserve it!

If you want to know more details about what this virtual conference entailed, just watch Ryszard’s talk at XDC (info, video) or you can reuse their materials for future conferences. That’s very useful info for future conference organizers!

Talking about our streaming platforms, the big novelty this year was the use of media.ccc.de as a privacy-friendly alternative to our traditional Youtube setup (last year we got feedback about this). Media.ccc.de is an open-source platform that respects your privacy and we hope it worked fine for all attendees. Our stats indicate that ~50% of our audience connected to it during the three days of the conference. That’s awesome!

Last but not least, we couldn’t make this conference without our sponsors. We are very lucky to have on board Intel as our Platinum sponsor and organizer, our Gold sponsors (Google, NVIDIA, ARM, Microsoft and AMD, our Silver sponsors (Igalia, Collabora, The Linux Foundation), our Bronze sponsors (Gitlab and Khronos Group) and our Supporters (C3VOC). Big thank you from the X.Org community!

XDC 2021 Sponsors

Feedback

We would like to hear from you and learn about what worked and what needs to be improved for future editions of XDC! Share us your experience!

We have sent an email asking for feedback to different mailing lists (for example this). Don’t hesitate to send an email to X.Org Foundation board with all your feedback!

XDC 2022 announced!

X.Org Developers Conference 2022 has been announced! Jeremy White, from Codeweavers, gave a lightning talk presenting next year edition! Next year the XDC will not be alone… WineConf 2022 is going to be organized by Codeweavers as well and co-located with XDC!

Save the dates! October 4-5-6, 2022 in Minneapolis, Minnesota, USA.

XDC 2022: Minneapolis, Minnesota, USA Image from Wikipedia. License CC BY-SA 4.0.

XDC 2023 hosting proposals

Have you enjoyed XDC 2021? Do you think you can do it better? ;-) We are looking for organizers for XDC 2023 (most likely in Europe but we are open to other places).

We know this is a decision that takes time (trigger internal discussion, looking for volunteers, budget, a venue suitable for the event, etc). Therefore, we encourage potential interested parties to start the internal discussions now, so any question they have can be answered before we open the call for proposals for XDC 2023 at some point next year. Please read what it is required to organize this conference and feel free to contact me or the X.Org Foundation board for more info if needed.

Final acknowledgment

I would like to thank Igalia for all the support I got when I decided to run for re-election this year in the X.Org Foundation board and to allow me to participate in XDC organization during my work hours. It’s amazing that our Free Software and collaboration values are still present after 20 years rocking in the free world!

Igalia 20th anniversary Igalia

September 23, 2021

After a nine year hiatus, a new version of the X Input Protocol is out. Credit for the work goes to Povilas Kanapickas, who also implemented support for XI 2.4 in the various pieces of the stack [0]. So let's have a look.

X has had touch events since XI 2.2 (2012) but those were only really useful for direct touch devices (read: touchscreens). There were accommodations for indirect touch devices like touchpads but they were never used. The synaptics driver set the required bits for a while but it was dropped in 2015 because ... it was complicated to make use of and no-one seemed to actually use it anyway. Meanwhile, the rest of the world moved on and touchpad gestures are now prevalent. They've been standard in MacOS for ages, in Windows for almost ages and - with recent GNOME releases - now feature prominently on the Linux desktop as well. They have been part of libinput and the Wayland protocol for years (and even recently gained a new set of "hold" gestures). Meanwhile, X was left behind in the dust or mud, depending on your local climate.

XI 2.4 fixes this, it adds pinch and swipe gestures to the XI2 protocol and makes those available to supporting clients [2]. Notably here is that the interpretation of gestures is left to the driver [1]. The server takes the gestures and does the required state handling but otherwise has no decision into what constitutes a gesture. This is of course no different to e.g. 2-finger scrolling on a touchpad where the server just receives scroll events and passes them on accordingly.

XI 2.4 gesture events are quite similar to touch events in that they are processed as a sequence of begin/update/end with both types having their own event types. So the events you will receive are e.g. XIGesturePinchBegin or XIGestureSwipeUpdate. As with touch events, a client must select for all three (begin/update/end) on a window. Only one gesture can exist at any time, so if you are a multi-tasking octopus prepare to be disappointed.

Because gestures are tied to an indirect-touch device, the location they apply at is wherever the cursor is currently positioned. In that, they work similar to button presses, and passive grabs apply as expected too. So long-term the window manager will likely want a passive grab on the root window for swipe gestures while applications will implement pinch-to-zoom as you'd expect.

In terms of API there are no suprises. libXi 1.8 is the version to implement the new features and there we have a new XIGestureClassInfo returned by XIQueryDevice and of course the two events: XIGesturePinchEvent and XIGestureSwipeEvent. Grabbing is done via e.g. XIGrabSwipeGestureBegin, so for those of you with XI2 experience this will all look familiar. For those of you without - it's probably no longer worth investing time into becoming an XI2 expert.

Overall, it's a nice addition to the protocol and it will help getting the X server slightly closer to Wayland for a widely-used feature. Once GTK, mutter and all the other pieces in the stack are in place, it will just work for any (GTK) application that supports gestures under Wayland already. The same will be true for Qt I expect.

X server 21.1 will be out in a few weeks, xf86-input-libinput 1.2.0 is already out and so are xorgproto 2021.5 and libXi 1.8.

[0] In addition to taking on the Xorg release, so clearly there are no limits here
[1] More specifically: it's done by libinput since neither xf86-input-evdev nor xf86-input-synaptics will ever see gestures being implemented
[2] Hold gestures missed out on the various deadlines

September 22, 2021

Xorg is about to released.

And it's a release without Xwayland.

And... wait, what?

Let's unwind this a bit, and ideally you should come away with a better understanding of Xorg vs Xwayland, and possibly even Wayland itself.

Heads up: if you are familiar with X, the below is simplified to the point it hurts. Sorry about that, but as an X developer you're probably good at coping with pain.

Let's go back to the 1980s, when fashion was weird and there were still reasons to be optimistic about the future. Because this is a thought exercise, we go back with full hindsight 20/20 vision and, ideally, the winning Lotto numbers in case we have some time for some self-indulgence.

If we were to implement an X server from scratch, we'd come away with a set of components. libxprotocol that handles the actual protocol wire format parsing and provides a C api to access that (quite like libxcb, actually). That one will just be the protocol-to-code conversion layer.

We'd have a libxserver component which handles all the state management required for an X server to actually behave like an X server (nothing in the X protocol require an X server to display anything). That library has a few entry points for abstract input events (pointer and keyboard, because this is the 80s after all) and a few exit points for rendered output.

libxserver uses libxprotocol but that's an implementation detail, we can ignore the protocol for the rest of the post.

Let's create a github organisation and host those two libraries. We now have: http://github.com/x/libxserver and http://github.com/x/libxprotocol [1].

Now, to actually implement a working functional X server, our new project would link against libxserver hook into this library's API points. For input, you'd use libinput and pass those events through, for output you'd use the modesetting driver that knows how to scream at the hardware until something finally shows up. This is somewhere between outrageously simplified and unacceptably wrong but it'll do for this post.

Your X server has to handle a lot of the hardware-specifics but other than that it's a wrapper around libxserver which does the work of ... well, being an X server.

Our stack looks like this:


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
|[libinput] [modesetting]|
+------------------------+
| kernel |
+------------------------+
Hooray, we have re-implemented Xorg. Or rather, XFree86 because we're 20 years from all the pent-up frustratrion that caused the Xorg fork. Let's host this project on http://github.com/x/xorg

Now, let's say instead of physical display devices, we want to render into an framebuffer, and we have no input devices.


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
| [write()] |
+------------------------+
| some buffer |
+------------------------+
This is basically Xvfb or, if you are writing out PostScript, Xprint. Let's host those on github too, we're accumulating quite a set of projects here.

Now, let's say those buffers are allocated elsewhere and we're just rendering to them. And those buffer are passed to us via an IPC protocol, like... Wayland!


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
|input events [render]|
+------------------------+
| |
+------------------------+
| Wayland compositor |
+------------------------+
And voila, we have Xwayland. If you swap out the protocol you can have Xquartz (X on Macos) or Xwin (X on Windows) or Xnext/Xephyr (X on X) or Xvnc (X over VNC). The principle is always the same.

Fun fact: the Wayland compositor doesn't need to run on the hardware, you can play display server matryoshka until you run out of turtles.

In our glorious revisioned past all these are distinct projects, re-using libxserver and some external libraries where needed. Depending on the projects things may be very simple or get very complex, it depends on how we render things.

But in the end, we have several independent projects all providing us with an X server process - the specific X bits are done in libxserver though. We can release Xwayland without having to release Xorg or Xvfb.

libxserver won't need a lot of releases, the behaviour is largely specified by the protocol requirements and once you're done implementing it, it'll be quite a slow-moving project.

Ok, now, fast forward to 2021, lose some hindsight, hope, and attitude and - oh, we have exactly the above structure. Except that it's not spread across multiple independent repos on github, it's all sitting in the same git directory: our Xorg, Xwayland, Xvfb, etc. are all sitting in hw/$name, and libxserver is basically the rest of the repo.

A traditional X server release was a tag in that git directory. An XWayland-only release is basically an rm -rf hw/*-but-not-xwayland followed by a tag, an Xorg-only release is basically an rm -rf hw/*-but-not-xfree86 [2].

In theory, we could've moved all these out into separate projects a while ago but the benefits are small and no-one has the time for that anyway.

So there you have it - you can have Xorg-only or XWayland-only releases without the world coming to an end.

Now, for the "Xorg is dead" claims - it's very likely that the current release will be the last Xorg release. [3] There is little interest in an X server that runs on hardware, or rather: there's little interest in the effort required to push out releases. Povilas did a great job in getting this one out but again, it's likely this is the last release. [4]

Xwayland - very different, it'll hang around for a long time because it's "just" a protocol translation layer. And of course the interest is there, so we have volunteers to do the releases.

So basically: expecting Xwayland releases, be surprised (but not confused) by Xorg releases.

[1] Github of course doesn't exist yet because we're in the 80s. Time-travelling is complicated.
[2] Historical directory name, just accept it.
[3] Just like the previous release...
[4] At least until the next volunteer steps ups. Turns out the problem "no-one wants to work on this" is easily fixed by "me! me! I want to work on this". A concept that is apparently quite hard to understand in the peanut gallery.

The Strange State of Authenticated Boot and Disk Encryption on Generic Linux Distributions

TL;DR: Linux has been supporting Full Disk Encryption (FDE) and technologies such as UEFI SecureBoot and TPMs for a long time. However, the way they are set up by most distributions is not as secure as they should be, and in some ways quite frankly weird. In fact, right now, your data is probably more secure if stored on current ChromeOS, Android, Windows or MacOS devices, than it is on typical Linux distributions.

Generic Linux distributions (i.e. Debian, Fedora, Ubuntu, …) adopted Full Disk Encryption (FDE) more than 15 years ago, with the LUKS/cryptsetup infrastructure. It was a big step forward to a more secure environment. Almost ten years ago the big distributions started adding UEFI SecureBoot to their boot process. Support for Trusted Platform Modules (TPMs) has been added to the distributions a long time ago as well — but even though many PCs/laptops these days have TPM chips on-board it's generally not used in the default setup of generic Linux distributions.

How these technologies currently fit together on generic Linux distributions doesn't really make too much sense to me — and falls short of what they could actually deliver. In this story I'd like to have a closer look at why I think that, and what I propose to do about it.

The Basic Technologies

Let's have a closer look what these technologies actually deliver:

  1. LUKS/dm-crypt/cryptsetup provide disk encryption, and optionally data authentication. Disk encryption means that reading the data in clear-text form is only possible if you possess a secret of some form, usually a password/passphrase. Data authentication means that no one can make changes to the data on disk unless they possess a secret of some form. Most distributions only enable the former though — the latter is a more recent addition to LUKS/cryptsetup, and is not used by default on most distributions (though it probably should be). Closely related to LUKS/dm-crypt is dm-verity (which can authenticate immutable volumes) and dm-integrity (which can authenticate writable volumes, among other things).

  2. UEFI SecureBoot provides mechanisms for authenticating boot loaders and other pre-OS binaries before they are invoked. If those boot loaders then authenticate the next step of booting in a similar fashion there's a chain of trust which can ensure that only code that has some level of trust associated with it will run on the system. Authentication of boot loaders is done via cryptographic signatures: the OS/boot loader vendors cryptographically sign their boot loader binaries. The cryptographic certificates that may be used to validate these signatures are then signed by Microsoft, and since Microsoft's certificates are basically built into all of today's PCs and laptops this will provide some basic trust chain: if you want to modify the boot loader of a system you must have access to the private key used to sign the code (or to the private keys further up the certificate chain).

  3. TPMs do many things. For this text we'll focus one facet: they can be used to protect secrets (for example for use in disk encryption, see above), that are released only if the code that booted the host can be authenticated in some form. This works roughly like this: every component that is used during the boot process (i.e. code, certificates, configuration, …) is hashed with a cryptographic hash function before it is used. The resulting hash is written to some small volatile memory the TPM maintains that is write-only (the so called Platform Configuration Registers, "PCRs"): each step of the boot process will write hashes of the resources needed by the next part of the boot process into these PCRs. The PCRs cannot be written freely: the hashes written are combined with what is already stored in the PCRs — also through hashing and the result of that then replaces the previous value. Effectively this means: only if every component involved in the boot matches expectations the hash values exposed in the TPM PCRs match the expected values too. And if you then use those values to unlock the secrets you want to protect you can guarantee that the key is only released to the OS if the expected OS and configuration is booted. The process of hashing the components of the boot process and writing that to the TPM PCRs is called "measuring". What's also important to mention is that the secrets are not only protected by these PCR values but encrypted with a "seed key" that is generated on the TPM chip itself, and cannot leave the TPM (at least so goes the theory). The idea is that you cannot read out a TPM's seed key, and thus you cannot duplicate the chip: unless you possess the original, physical chip you cannot retrieve the secret it might be able to unlock for you. Finally, TPMs can enforce a limit on unlock attempts per time ("anti-hammering"): this makes it hard to brute force things: if you can only execute a certain number of unlock attempts within some specific time then brute forcing will be prohibitively slow.

How Linux Distributions use these Technologies

As mentioned already, Linux distributions adopted the first two of these technologies widely, the third one not so much.

So typically, here's how the boot process of Linux distributions works these days:

  1. The UEFI firmware invokes a piece of code called "shim" (which is stored in the EFI System Partition — the "ESP" — of your system), that more or less is just a list of certificates compiled into code form. The shim is signed with the aforementioned Microsoft key, that is built into all PCs/laptops. This list of certificates then can be used to validate the next step of the boot process. The shim is measured by the firmware into the TPM. (Well, the shim can do a bit more than what I describe here, but this is outside of the focus of this article.)

  2. The shim then invokes a boot loader (often Grub) that is signed by a private key owned by the distribution vendor. The boot loader is stored in the ESP as well, plus some other places (i.e. possibly a separate boot partition). The corresponding certificate is included in the list of certificates built into the shim. The boot loader components are also measured into the TPM.

  3. The boot loader then invokes the kernel and passes it an initial RAM disk image (initrd), which contains initial userspace code. The kernel itself is signed by the distribution vendor too. It's also validated via the shim. The initrd is not validated, though (!). The kernel is measured into the TPM, the initrd sometimes too.

  4. The kernel unpacks the initrd image, and invokes what is contained in it. Typically, the initrd then asks the user for a password for the encrypted root file system. The initrd then uses that to set up the encrypted volume. No code authentication or TPM measurements take place.

  5. The initrd then transitions into the root file system. No code authentication or TPM measurements take place.

  6. When the OS itself is up the user is prompted for their user name, and their password. If correct, this will unlock the user account: the system is now ready to use. At this point no code authentication, no TPM measurements take place. Moreover, the user's password is not used to unlock any data, it's used only to allow or deny the login attempt — the user's data has already been decrypted a long time ago, by the initrd, as mentioned above.

What you'll notice here of course is that code validation happens for the shim, the boot loader and the kernel, but not for the initrd or the main OS code anymore. TPM measurements might go one step further: the initrd is measured sometimes too, if you are lucky. Moreover, you might notice that the disk encryption password and the user password are inquired by code that is not validated, and is thus not safe from external manipulation. You might also notice that even though TPM measurements of boot loader/OS components are done nothing actually ever makes use of the resulting PCRs in the typical setup.

Attack Scenarios

Of course, before determining whether the setup described above makes sense or not, one should have an idea what one actually intends to protect against.

The most basic attack scenario to focus on is probably that you want to be reasonably sure that if someone steals your laptop that contains all your data then this data remains confidential. The model described above probably delivers that to some degree: the full disk encryption when used with a reasonably strong password should make it hard for the laptop thief to access the data. The data is as secure as the password used is strong. The attacker might attempt to brute force the password, thus if the password is not chosen carefully the attacker might be successful.

Two more interesting attack scenarios go something like this:

  1. Instead of stealing your laptop the attacker takes the harddisk from your laptop while you aren't watching (e.g. while you went for a walk and left it at home or in your hotel room), makes a copy of it, and then puts it back. You'll never notice they did that. The attacker then analyzes the data in their lab, maybe trying to brute force the password. In this scenario you won't even know that your data is at risk, because for you nothing changed — unlike in the basic scenario above. If the attacker manages to break your password they have full access to the data included on it, i.e. everything you so far stored on it, but not necessarily on what you are going to store on it later. This scenario is worse than the basic one mentioned above, for the simple fact that you won't know that you might be attacked. (This scenario could be extended further: maybe the attacker has a chance to watch you type in your password or so, effectively lowering the password strength.)

  2. Instead of stealing your laptop the attacker takes the harddisk from your laptop while you aren't watching, inserts backdoor code on it, and puts it back. In this scenario you won't know your data is at risk, because physically everything is as before. What's really bad though is that the attacker gets access to anything you do on your laptop, both the data already on it, and whatever you will do in the future.

I think in particular this backdoor attack scenario is something we should be concerned about. We know for a fact that attacks like that happen all the time (Pegasus, industry espionage, …), hence we should make them hard.

Are we Safe?

So, does the scheme so far implemented by generic Linux distributions protect us against the latter two scenarios? Unfortunately not at all. Because distributions set up disk encryption the way they do, and only bind it to a user password, an attacker can easily duplicate the disk, and then attempt to brute force your password. What's worse: since code authentication ends at the kernel — and the initrd is not authenticated anymore —, backdooring is trivially easy: an attacker can change the initrd any way they want, without having to fight any kind of protections. And given that FDE unlocking is implemented in the initrd, and it's the initrd that asks for the encryption password things are just too easy: an attacker could trivially easily insert some code that picks up the FDE password as you type it in and send it wherever they want. And not just that: since once they are in they are in, they can do anything they like for the rest of the system's lifecycle, with full privileges — including installing backdoors for versions of the OS or kernel that are installed on the device in the future, so that their backdoor remains open for as long as they like.

That is sad of course. It's particular sad given that the other popular OSes all address this much better. ChromeOS, Android, Windows and MacOS all have way better built-in protections against attacks like this. And it's why one can certainly claim that your data is probably better protected right now if you store it on those OSes then it is on generic Linux distributions.

(Yeah, I know that there are some niche distros which do this better, and some hackers hack their own. But I care about general purpose distros here, i.e. the big ones, that most people base their work on.)

Note that there are more problems with the current setup. For example, it's really weird that during boot the user is queried for an FDE password which actually protects their data, and then once the system is up they are queried again – now asking for a username, and another password. And the weird thing is that this second authentication that appears to be user-focused doesn't really protect the user's data anymore — at that moment the data is already unlocked and accessible. The username/password query is supposed to be useful in multi-user scenarios of course, but how does that make any sense, given that these multiple users would all have to know a disk encryption password that unlocks the whole thing during the FDE step, and thus they have access to every user's data anyway if they make an offline copy of the harddisk?

Can we do better?

Of course we can, and that is what this story is actually supposed to be about.

Let's first figure out what the minimal issues we should fix are (at least in my humble opinion):

  1. The initrd must be authenticated before being booted into. (And measured unconditionally.)

  2. The OS binary resources (i.e. /usr/) must be authenticated before being booted into. (But don't need to be encrypted, since everyone has the same anyway, there's nothing to hide here.)

  3. The OS configuration and state (i.e. /etc/ and /var/) must be encrypted, and authenticated before they are used. The encryption key should be bound to the TPM device; i.e system data should be locked to a security concept belonging to the system, not the user.

  4. The user's home directory (i.e. /home/lennart/ and similar) must be encrypted and authenticated. The unlocking key should be bound to a user password or user security token (FIDO2 or PKCS#11 token); i.e. user data should be locked to a security concept belonging to the user, not the system.

Or to summarize this differently:

  1. Every single component of the boot process and OS needs to be authenticated, i.e. all of shim (done), boot loader (done), kernel (done), initrd (missing so far), OS binary resources (missing so far), OS configuration and state (missing so far), the user's home (missing so far).

  2. Encryption is necessary for the OS configuration and state (bound to TPM), and for the user's home directory (bound to a user password or user security token).

In Detail

Let's see how we can achieve the above in more detail.

How to Authenticate the initrd

At the moment initrds are generated on the installed host via scripts (dracut and similar) that try to figure out a minimal set of binaries and configuration data to build an initrd that contains just enough to be able to find and set up the root file system. What is included in the initrd hence depends highly on the individual installation and its configuration. Pretty likely no two initrds generated that way will be fully identical due to this. This model clearly has benefits: the initrds generated this way are very small and minimal, and support exactly what is necessary for the system to boot, and not less or more. It comes with serious drawbacks too though: the generation process is fragile and sometimes more akin to black magic than following clear rules: the generator script natively has to understand a myriad of storage stacks to determine what needs to be included and what not. It also means that authenticating the image is hard: given that each individual host gets a different specialized initrd, it means we cannot just sign the initrd with the vendor key like we sign the kernel. If we want to keep this design we'd have to figure out some other mechanism (e.g. a per-host signature key – that is generated locally; or by authenticating it with a message authentication code bound to the TPM). While these approaches are certainly thinkable, I am not convinced they actually are a good idea though: locally and dynamically generated per-host initrds is something we probably should move away from.

If we move away from locally generated initrds, things become a lot simpler. If the distribution vendor generates the initrds on their build systems then it can be attached to the kernel image itself, and thus be signed and measured along with the kernel image, without any further work. This simplicity is simply lovely. Besides robustness and reproducibility this gives us an easy route to authenticated initrds.

But of course, nothing is really that simple: working with vendor-generated initrds means that we can't adjust them anymore to the specifics of the individual host: if we pre-build the initrds and include them in the kernel image in immutable fashion then it becomes harder to support complex, more exotic storage or to parameterize it with local network server information, credentials, passwords, and so on. Now, for my simple laptop use-case these things don't matter, there's no need to extend/parameterize things, laptops and their setups are not that wildly different. But what to do about the cases where we want both: extensibility to cover for less common storage subsystems (iscsi, LVM, multipath, drivers for exotic hardware…) and parameterization?

Here's a proposal how to achieve that: let's build a basic initrd into the kernel as suggested, but then do two things to make this scheme both extensible and parameterizable, without compromising security.

  1. Let's define a way how the basic initrd can be extended with additional files, which are stored in separate "extension images". The basic initrd should be able to discover these extension images, authenticate them and then activate them, thus extending the initrd with additional resources on-the-fly.

  2. Let's define a way how we can safely pass additional parameters to the kernel/initrd (and actually the rest of the OS, too) in an authenticated (and possibly encrypted) fashion. Parameters in this context can be anything specific to the local installation, i.e. server information, security credentials, certificates, SSH server keys, or even just the root password that shall be able to unlock the root account in the initrd …

In such a scheme we should be able to deliver everything we are looking for:

  1. We'll have a full trust chain for the code: the boot loader will authenticate and measure the kernel and basic initrd. The initrd extension images will then be authenticated by the basic initrd image.

  2. We'll have authentication for all the parameters passed to the initrd.

This so far sounds very unspecific? Let's make it more specific by looking closer at the components I'd suggest to be used for this logic:

  1. The systemd suite since a few months contains a subsystem implementing system extensions (v248). System extensions are ultimately just disk images (for example a squashfs file system in a GPT envelope) that can extend an underlying OS tree. Extending in this regard means they simply add additional files and directories into the OS tree, i.e. below /usr/. For a longer explanation see systemd-sysext(8). When a system extension is activated it is simply mounted and then merged into the main /usr/ tree via a read-only overlayfs mount. Now what's particularly nice about them in this context we are talking about here is that the extension images may carry dm-verity authentication data, and PKCS#7 signatures (once this is merged, that is, i.e. v250).

  2. The systemd suite also contains a concept called service "credentials". These are small pieces of information passed to services in a secure way. One key feature of these credentials is that they can be encrypted and authenticated in a very simple way with a key bound to the TPM (v250). See LoadCredentialEncrypted= and systemd-creds(1) for details. They are great for safely storing SSL private keys and similar on your system, but they also come handy for parameterizing initrds: an encrypted credential is just a file that can only be decoded if the right TPM is around with the right PCR values set.

  3. The systemd suite contains a component called systemd-stub(7). It's an EFI stub, i.e. a small piece of code that is attached to a kernel image, and turns the kernel image into a regular EFI binary that can be directly executed by the firmware (or a boot loader). This stub has a number of nice features (for example, it can show a boot splash before invoking the Linux kernel itself and such). Once this work is merged (v250) the stub will support one more feature: it will automatically search for system extension image files and credential files next to the kernel image file, measure them and pass them on to the main initrd of the host.

Putting this together we have nice way to provide fully authenticated kernel images, initrd images and initrd extension images; as well as encrypted and authenticated parameters via the credentials logic.

How would a distribution actually make us of this? A distribution vendor would pre-build the basic initrd, and glue it into the kernel image, and sign that as a whole. Then, for each supposed extension of the basic initrd (e.g. one for iscsi support, one for LVM, one for multipath, …), the vendor would use a tool such as mkosi to build an extension image, i.e. a GPT disk image containing the files in squashfs format, a Verity partition that authenticates it, plus a PKCS#7 signature partition that validates the root hash for the dm-verity partition, and that can be checked against a key provided by the boot loader or main initrd. Then, any parameters for the initrd will be encrypted using systemd-creds encrypt -T. The resulting encrypted credentials and the initrd extension images are then simply placed next to the kernel image in the ESP (or boot partition). Done.

This checks all boxes: everything is authenticated and measured, the credentials also encrypted. Things remain extensible and modular, can be pre-built by the vendor, and installation is as simple as dropping in one file for each extension and/or credential.

How to Authenticate the Binary OS Resources

Let's now have a look how to authenticate the Binary OS resources, i.e. the stuff you find in /usr/, i.e. the stuff traditionally shipped to the user's system via RPMs or DEBs.

I think there are three relevant ways how to authenticate this:

  1. Make /usr/ a dm-verity volume. dm-verity is a concept implemented in the Linux kernel that provides authenticity to read-only block devices: every read access is cryptographically verified against a top-level hash value. This top-level hash is typically a 256bit value that you can either encode in the kernel image you are using, or cryptographically sign (which is particularly nice once this is merged). I think this is actually the best approach since it makes the /usr/ tree entirely immutable in a very simple way. However, this also means that the whole of /usr/ needs to be updated as once, i.e. the traditional rpm/apt based update logic cannot work in this mode.

  2. Make /usr/ a dm-integrity volume. dm-integrity is a concept provided by the Linux kernel that offers integrity guarantees to writable block devices, i.e. in some ways it can be considered to be a bit like dm-verity while permitting write access. It can be used in three ways, one of which I think is particularly relevant here. The first way is with a simple hash function in "stand-alone" mode: this is not too interesting here, it just provides greater data safety for file systems that don't hash check their files' data on their own. The second way is in combination with dm-crypt, i.e. with disk encryption. In this case it adds authenticity to confidentiality: only if you know the right secret you can read and make changes to the data, and any attempt to make changes without knowing this secret key will be detected as IO error on next read by those in possession of the secret (more about this below). The third way is the one I think is most interesting here: in "stand-alone" mode, but with a keyed hash function (e.g. HMAC). What's this good for? This provides authenticity without encryption: if you make changes to the disk without knowing the secret this will be noticed on the next read attempt of the data and result in IO errors. This mode provides what we want (authenticity) and doesn't do what we don't need (encryption). Of course, the secret key for the HMAC must be provided somehow, I think ideally by the TPM.

  3. Make /usr/ a dm-crypt (LUKS) + dm-integrity volume. This provides both authenticity and encryption. The latter isn't typically needed for /usr/ given that it generally contains no secret data: anyone can download the binaries off the Internet anyway, and the sources too. By encrypting this you'll waste CPU cycles, but beyond that it doesn't hurt much. (Admittedly, some people might want to hide the precise set of packages they have installed, since it of course does reveal a bit of information about you: i.e. what you are working on, maybe what your job is – think: if you are a hacker you have hacking tools installed – and similar). Going this way might simplify things in some cases, as it means you don't have to distinguish "OS binary resources" (i.e /usr/) and "OS configuration and state" (i.e. /etc/ + /var/, see below), and just make it the same volume. Here too, the secret key must be provided somehow, I think ideally by the TPM.

All three approach are valid. The first approach has my primary sympathies, but for distributions not willing to abandon client-side updates via RPM/dpkg this is not an option, in which case I would propose the other two approaches for these cases.

The LUKS encryption key (and in case of dm-integrity standalone mode the key for the keyed hash function) should be bound to the TPM. Why the TPM for this? You could also use a user password, a FIDO2 or PKCS#11 security token — but I think TPM is the right choice: why that? To reduce the requirement for repeated authentication, i.e. that you first have to provide the disk encryption password, and then you have to login, providing another password. It should be possible that the system boots up unattended and then only one authentication prompt is needed to unlock the user's data properly. The TPM provides a way to do this in a reasonably safe and fully unattended way. Also, when we stop considering just the laptop use-case for a moment: on servers interactive disk encryption prompts don't make much sense — the fact that TPMs can provide secrets without this requiring user interaction and thus the ability to work in entirely unattended environments is quite desirable. Note that crypttab(5) as implemented by systemd (v248) provides native support for authentication via password, via TPM2, via PKCS#11 or via FIDO2, so the choice is ultimately all yours.

How to Encrypt/Authenticate OS Configuration and State

Let's now look at the OS configuration and state, i.e. the stuff in /etc/ and /var/. It probably makes sense to not consider these two hierarchies independently but instead just consider this to be the root file system. If the OS binary resources are in a separate file system it is then mounted onto the /usr/ sub-directory of the root file system.

The OS configuration and state (or: root file system) should be both encrypted and authenticated: it might contain secret keys, user passwords, privileged logs and similar. This data matters and contains plenty data that should remain confidential.

The encryption of choice here is dm-crypt (LUKS) + dm-integrity similar as discussed above, again with the key bound to the TPM.

If the OS binary resources are protected the same way it is safe to merge these two volumes and have a single partition for both (see above)

How to Encrypt/Authenticate the User's Home Directory

The data in the user's home directory should be encrypted, and bound to the user's preferred token of authentication (i.e. a password or FIDO2/PKCS#11 security token). As mentioned, in the traditional mode of operation the user's home directory is not individually encrypted, but only encrypted because FDE is in use. The encryption key for that is a system wide key though, not a per-user key. And I think that's problem, as mentioned (and probably not even generally understood by our users). We should correct that and ensure that the user's password is what unlocks the user's data.

In the systemd suite we provide a service systemd-homed(8) (v245) that implements this in a safe way: each user gets its own LUKS volume stored in a loopback file in /home/, and this is enough to synthesize a user account. The encryption password for this volume is the user's account password, thus it's really the password provided at login time that unlocks the user's data. systemd-homed also supports other mechanisms of authentication, in particular PKCS#11/FIDO2 security tokens. It also provides support for other storage back-ends (such as fscrypt), but I'd always suggest to use the LUKS back-end since it's the only one providing the comprehensive confidentiality guarantees one wants for a UNIX-style home directory.

Note that there's one special caveat here: if the user's home directory (e.g. /home/lennart/) is encrypted and authenticated, what about the file system this data is stored on, i.e. /home/ itself? If that dir is part of the the root file system this would result in double encryption: first the data is encrypted with the TPM root file system key, and then again with the per-user key. Such double encryption is a waste of resources, and unnecessary. I'd thus suggest to make /home/ its own dm-integrity volume with a HMAC, keyed by the TPM. This means the data stored directly in /home/ will be authenticated but not encrypted. That's good not only for performance, but also has practical benefits: it allows extracting the encrypted volume of the various users in case the TPM key is lost, as a way to recover from dead laptops or similar.

Why authenticate /home/, if it only contains per-user home directories that are authenticated on their own anyway? That's a valid question: it's because the kernel file system maintainers made clear that Linux file system code is not considered safe against rogue disk images, and is not tested for that; this means before you mount anything you need to establish trust in some way because otherwise there's a risk that the act of mounting might exploit your kernel.

Summary of Resources and their Protections

So, let's now put this all together. Here's a table showing the various resources we deal with, and how I think they should be protected (in my idealized world).

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources yes no dm-verity root hash linked into kernel image, or firmware/initrd certificate database top-level partition
OS configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This should provide all the desired guarantees: everything is authenticated, and the individualized per-host or per-user data is also encrypted. No double encryption takes place. The encryption keys/verification certificates are stored/bound to the most appropriate infrastructure.

Does this address the three attack scenarios mentioned earlier? I think so, yes. The basic attack scenario I described is addressed by the fact that /var/, /etc/ and /home/*/ are encrypted. Brute forcing the former two is harder than in the status quo ante model, since a high entropy key is used instead of one derived from a user provided password. Moreover, the "anti-hammering" logic of the TPM will make brute forcing prohibitively slow. The home directories are protected by the user's password or ideally a personal FIDO2/PKCS#11 security token in this model. Of course, a password isn't better security-wise then the status quo ante. But given the FIDO2/PKCS#11 support built into systemd-homed it should be easier to lock down the home directories securely.

Binding encryption of /var/ and /etc/ to the TPM also addresses the first of the two more advanced attack scenarios: a copy of the harddisk is useless without the physical TPM chip, since the seed key is sealed into that. (And even if the attacker had the chance to watch you type in your password, it won't help unless they possess access to to the TPM chip.) For the home directory this attack is not addressed as long as a plain password is used. However, since binding home directories to FIDO2/PKCS#11 tokens is built into systemd-homed things should be safe here too — provided the user actually possesses and uses such a device.

The backdoor attack scenario is addressed by the fact that every resource in play now is authenticated: it's hard to backdoor the OS if there's no component that isn't verified by signature keys or TPM secrets the attacker hopefully doesn't know.

For general purpose distributions that focus on updating the OS per RPM/dpkg the idealized model above won't work out, since (as mentioned) this implies an immutable /usr/, and thus requires updating /usr/ via an atomic update operation. For such distros a setup like the following is probably more realistic, but see above.

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources, configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This means there's only one root file system that contains all of /etc/, /var/ and /usr/.

Recovery Keys

When binding encryption to TPMs one problem that arises is what strategy to adopt if the TPM is lost, due to hardware failure: if I need the TPM to unlock my encrypted volume, what do I do if I need the data but lost the TPM?

The answer here is supporting recovery keys (this is similar to how other OSes approach this). Recovery keys are pretty much the same concept as passwords. The main difference being that they are computer generated rather than user-chosen. Because of that they typically have much higher entropy (which makes them more annoying to type in, i.e you want to use them only when you must, not day-to-day). By having higher entropy they are useful in combination with TPM, FIDO2 or PKCS#11 based unlocking: unlike a combination with passwords they do not compromise the higher strength of protection that TPM/FIDO2/PKCS#11 based unlocking is supposed to provide.

Current versions of systemd-cryptenroll(1) implement a recovery key concept in an attempt to address this problem. You may enroll any combination of TPM chips, PKCS#11 tokens, FIDO2 tokens, recovery keys and passwords on the same LUKS volume. When enrolling a recovery key it is generated and shown on screen both in text form and as QR code you can scan off screen if you like. The idea is write down/store this recovery key at a safe place so that you can use it when you need it. Note that such recovery keys can be entered wherever a LUKS password is requested, i.e. after generation they behave pretty much the same as a regular password.

TPM PCR Brittleness

Locking devices to TPMs and enforcing a PCR policy with this (i.e. configuring the TPM key to be unlockable only if certain PCRs match certain values, and thus requiring the OS to be in a certain state) brings a problem with it: TPM PCR brittleness. If the key you want to unlock with the TPM requires the OS to be in a specific state (i.e. that all OS components' hashes match certain expectations or similar) then doing OS updates might have the affect of making your key inaccessible: the OS updates will cause the code to change, and thus the hashes of the code, and thus certain PCRs. (Thankfully, you unrolled a recovery key, as described above, so this doesn't mean you lost your data, right?).

To address this I'd suggest three strategies:

  1. Most importantly: don't actually use the TPM PCRs that contain code hashes. There are actually multiple PCRs defined, each containing measurements of different aspects of the boot process. My recommendation is to bind keys to PCR 7 only, a PCR that contains measurements of the UEFI SecureBoot certificate databases. Thus, the keys will remain accessible as long as these databases remain the same, and updates to code will not affect it (updates to the certificate databases will, and they do happen too, though hopefully much less frequent then code updates). Does this reduce security? Not much, no, because the code that's run is after all not just measured but also validated via code signatures, and those signatures are validated with the aforementioned certificate databases. Thus binding an encrypted TPM key to PCR 7 should enforce a similar level of trust in the boot/OS code as binding it to a PCR with hashes of specific versions of that code. i.e. using PCR 7 means you say "every code signed by these vendors is allowed to unlock my key" while using a PCR that contains code hashes means "only this exact version of my code may access my key".

  2. Use LUKS key management to enroll multiple versions of the TPM keys in relevant volumes, to support multiple versions of the OS code (or multiple versions of the certificate database, as discussed above). Specifically: whenever an update is done that might result changing the relevant PCRs, pre-calculate the new PCRs, and enroll them in an additional LUKS slot on the relevant volumes. This means that the unlocking keys tied to the TPM remain accessible in both states of the system. Eventually, once rebooted after the update, remove the old slots.

  3. If these two strategies didn't work out (maybe because the OS/firmware was updated outside of OS control, or the update mechanism was aborted at the wrong time) and the TPM PCRs changed unexpectedly, and the user now needs to use their recovery key to get access to the OS back, let's handle this gracefully and automatically reenroll the current TPM PCRs at boot, after the recovery key checked out, so that for future boots everything is in order again.

Other approaches can work too: for example, some OSes simply remove TPM PCR policy protection of disk encryption keys altogether immediately before OS or firmware updates, and then reenable it right after. Of course, this opens a time window where the key bound to the TPM is much less protected than people might assume. I'd try to avoid such a scheme if possible.

Anything Else?

So, given that we are talking about idealized systems: I personally actually think the ideal OS would be much simpler, and thus more secure than this:

I'd try to ditch the Shim, and instead focus on enrolling the distribution vendor keys directly in the UEFI firmware certificate list. This is actually supported by all firmwares too. This has various benefits: it's no longer necessary to bind everything to Microsoft's root key, you can just enroll your own stuff and thus make sure only what you want to trust is trusted and nothing else. To make an approach like this easier, we have been working on doing automatic enrollment of these keys from the systemd-boot boot loader, see this work in progress for details. This way the Firmware will authenticate the boot loader/kernel/initrd without any further component for this in place.

I'd also not bother with a separate boot partition, and just use the ESP for everything. The ESP is required anyway by the firmware, and is good enough for storing the few files we need.

FAQ

Can I implement all of this in my distribution today?

Probably not. While the big issues have mostly been addressed there's a lot of integration work still missing. As you might have seen I linked some PRs that haven't even been merged into our tree yet, and definitely not been released yet or even entered the distributions.

Will this show up in Fedora/Debian/Ubuntu soon?

I don't know. I am making a proposal how these things might work, and am working on getting various building blocks for this into shape. What the distributions do is up to them. But even if they don't follow the recommendations I make 100%, or don't want to use the building blocks I propose I think it's important they start thinking about this, and yes, I think they should be thinking about defaulting to setups like this.

Work for measuring/signing initrds on Fedora has been started, here's a slide deck with some information about it.

But isn't a TPM evil?

Some corners of the community tried (unfortunately successfully to some degree) to paint TPMs/Trusted Computing/SecureBoot as generally evil technologies that stop us from using our systems the way we want. That idea is rubbish though, I think. We should focus on what it can deliver for us (and that's a lot I think, see above), and appreciate the fact we can actually use it to kick out perceived evil empires from our devices instead of being subjected to them. Yes, the way SecureBoot/TPMs are defined puts you in the driver seat if you want — and you may enroll your own certificates to keep out everything you don't like.

What if my system doesn't have a TPM?

TPMs are becoming quite ubiquitous, in particular as the upcoming Windows versions will require them. In general I think we should focus on modern, fully equipped systems when designing all this, and then find fall-backs for more limited systems. Frankly it feels as if so far the design approach for all this was the other way round: try to make the new stuff work like the old rather than the old like the new (I mean, to me it appears this thinking is the main raison d'être for the Grub boot loader).

More specifically, on the systems where we have no TPM we ultimately cannot provide the same security guarantees as for those which have. So depending on the resource to protect we should fall back to different TPM-less mechanisms. For example, if we have no TPM then the root file system should probably be encrypted with a user provided password, typed in at boot as before. And for the encrypted boot credentials we probably should simply not encrypt them, and place them in the ESP unencrypted.

Effectively this means: without TPM you'll still get protection regarding the basic attack scenario, as before, but not the other two.

What if my system doesn't have UEFI?

Many of the mechanisms explained above taken individually do not require UEFI. But of course the chain of trust suggested above requires something like UEFI SecureBoot. If your system lacks UEFI it's probably best to find work-alikes to the technologies suggested above, but I doubt I'll be able to help you there.

rpm/dpkg already cryptographically validates all packages at installation time (gpg), why would I need more than that?

This type of package validation happens once: at the moment of installation (or update) of the package, but not anymore when the data installed is actually used. Thus when an attacker manages to modify the package data after installation and before use they can make any change they like without this ever being noticed. Such package download validation does address certain attack scenarios (i.e. man-in-the-middle attacks on network downloads), but it doesn't protect you from attackers with physical access, as described in the attack scenarios above.

Systems such as ostree aren't better than rpm/dpkg regarding this BTW, their data is not validated on use either, but only during download or when processing tree checkouts.

Key really here is that the scheme explained here provides offline protection for the data "at rest" — even someone with physical access to your device cannot easily make changes that aren't noticed on next use. rpm/dpkg/ostree provide online protection only: as long as the system remains up, and all OS changes are done through the intended program code-paths, and no one has physical access everything should be good. In today's world I am sure this is not good enough though. As mentioned most modern OSes provide offline protection for the data at rest in one way or another. Generic Linux distributions are terribly behind on this.

This is all so desktop/laptop focused, what about servers?

I am pretty sure servers should provide similar security guarantees as outlined above. In a way servers are a much simpler case: there are no users and no interactivity. Thus the discussion of /home/ and what it contains and of user passwords doesn't matter. However, the authenticated initrd and the unattended TPM-based encryption I think are very important for servers too, in a trusted data center environment. It provides security guarantees so far not given by Linux server OSes.

I'd like to help with this, or discuss/comment on this

Submit patches or reviews through GitHub. General discussion about this is best done on the systemd mailing list.

September 16, 2021

I am back with another status update on raytracing in RADV. And the good news is that things are finally starting to come together. After ~9 months of on and off work we’re now having games working with raytracing. Working on first try after getting all the required functionality was Control:

Control with raytracing on RADV

After poking for a long time at CTS and demos it is really nice to see the fruits of ones work.

The piece that I added recently was copy/compaction and serialization of acceleration structures which was a bunch of shader writing, handling another type of queries and dealing with indirect dispatches. (Since of course the API doesn’t give the input size on the CPU. No idea how this API should be secured …)

What games?

I did try 5 games:

  1. Quake 2 RTX (Vulkan): works. This was working already on my previous update.
  2. Control (D3D): works. Pretty much just works. Runs at maybe 30-50% of RT performance on Windows.
  3. Metro Exodus (Vulkan): works. Needs one workaround and is very finicky in WSI but otherwise works fine. Runs at 20-25% of RT performance on Windows.
  4. Ghostrunner (D3D): Does not work. This really needs per shadergroup compilation instead of just mashing all the shaders together as I get shaders now with 1 million NIR instructions, which is a pain to debug.
  5. Doom Eternal (Vulkan): Does not work. The raytracing option in the menu stays grayed out and at this point I’m at a loss what is required to make the game allow enabling RT.

If anybody could tell me how to get Doom Eternal to allow RT I’d appreciate it.

What is next?

Of course the support is far from done. Some things to still make progress on:

  1. Upstreaming what I have. Samuel has been busy reviewing my MRs and I think there is a good chance that what I have now will make it into 21.3.
  2. Improve the pipeline compilation model to hopefully make ghostrunner work.
  3. Improved BVH building. The current BVH is really naive, which is likely one of the big performance factors.
  4. Improve traversal.
  5. Move on to stuff needed for DXR 1.1 like VK_KHR_ray_query.

P.S. If you haven’t seen it yet, Jason Ekstrand from Intel recently gave a talk about how Intel implements raytracing. Nice showcase of how you can provide some more involved hardware implementation than RDNA2 does.

Been some time since my last update, so I felt it was time to flex my blog writing muscles again and provide some updates of some of the things we are working on in Fedora in preparation for Fedora Workstation 35. This is not meant to be a comprehensive whats new article about Fedora Workstation 35, more of a listing of some of the things we are doing as part of the Red Hat desktop team.

NVidia support for Wayland
One thing we spent a lot of effort on for a long time now is getting full support for the NVidia binary driver under Wayland. It has been a recurring topic in our bi-weekly calls with the NVidia engineering team ever since we started looking at moving to Wayland. There has been basic binary driver support for some time, meaning you could run a native Wayland session on top of the binary driver, but the critical missing piece was that you could not get support for accelerated graphics when running applications through XWayland, our X.org compatibility layer. Which basically meant that any application requiring 3D support and which wasn’t a native Wayland application yet wouldn’t work. So over the last Months we been having a great collaboration with NVidia around closing this gap, with them working closely with us in fixing issues in their driver while we have been fixing bugs and missing pieces in the rest of the stack. We been reporting and discussing issues back and forth allowing us a very quickly turnaround on issues as we find them which of course all resulted in the NVidia 470.42.01 driver with XWayland support. I am sure we will find new corner cases that needs to be resolved in the coming Months, but I am equally sure we will be able to quickly resolve them due to the close collaboration we have now established with NVidia. And I know some people will wonder why we spent so much time working with NVidia around their binary driver, but the reality is that NVidia is the market leader, especially in the professional Linux workstation space, and there are lot of people who either would end up not using Linux or using Linux with X without it, including a lot of Red Hat customers and Fedora users. And that is what I and my team are here for at the end of the day, to make sure Red Hat customers are able to get their job done using their Linux systems.

Lightweight kiosk mode
One of the wonderful things about open source is the constant flow of code and innovation between all the different parts of the ecosystem. For instance one thing we on the RHEL side have often been asked about over the last few years is a lightweight and simple to use solution for people wanting to run single application setups, like information boards, ATM machines, cash registers, information kiosks and so on. For many use cases people felt that running a full GNOME 3 desktop underneath their application was either to resource hungry and or created a risk that people accidentally end up in the desktop session. At the same time from our viewpoint as a development team we didn’t want a completely separate stack for this use case as that would just increase our maintenance burden as we would end up having to do a lot of things twice. So to solve this problem Ray Strode spent some time writing what we call GNOME Kiosk mode which makes setting up a simple session running single application easy and without running things like the GNOME shell, tracker, evolution etc. This gives you a window manager with full support for the latest technologies such as compositing, libinput and Wayland, but coming in at about 18MB, which is about 71MB less than a minimal GNOME 3 desktop session. You can read more about the new Kiosk mode and how to use it in this great blog post from our savvy Edge Computing Product Manager Ben Breard. The kiosk mode session described in Ben’s article about RHEL will be available with Fedora Workstation 35.

high-definition mouse wheel support
A major part of what we do is making sure that Red Hat Enterprise Linux customers and Fedora users get hardware support on par with what you find on other operating systems. We try our best to work with our hardware partners, like Lenovo, to ensure that such hardware support comes day and date with when those features are enabled on other systems, but some things ends up taking longer time for various reasons. Support for high-definition mouse wheels was one of those. Peter Hutterer, our resident input expert, put together a great blog post explaining the history and status of high-definition mouse wheel support. As Peter points out in his blog post the feature is not yet fully supported under Wayland, but we hope to close that gap in time for Fedora Workstation 35.

Mouse with hires mouse

Mouse with HiRes scroll wheel

PipeWire
I feel I can’t do one of these posts without talking about latest developments in PipeWire, our unified audio and video server. Wim Taymans keeps working with rapidly growing PipeWire community to fix issues as they are reported and add new features to PipeWire. Most recently Wims focus has been on implementing support for S/PDIF passthrough support over both S/PDIF and HDMI connections. This will allow us to send undecoded data over such connections which is critical for working well with surround sound systems and soundbars. Also the PipeWire community has been working hard on further improving the Bluetooth support with bluetooth battery status support for head-set profile and using Apple extensions. aptX-LL and FastStream codec support was also added. And of course a huge amount of bug fixes, it turns out that when you replace two different sound servers that has been around for close to two decades there are a lot of corner cases to cover :). Make sure to check out two latest release notes for 0.3.35 and for 0.3.36 for details.

Screenshot of Easyeffects

EasyEffects is a great example of a cool new application built with PipeWire

Privacy screen
Another feature that we have been working on as a result of our Lenovo partnership is Privacy screen support. For those not familiar with this technology it is basically to allow you to reduce the readability of your screen when viewed from the side, so that if you are using your laptop at a coffee shop for instance then a person sitting close by will have a lot harder time trying to read what is on your screen. Hans de Goede has been shepherding the kernel side of this forward working with Marco Trevisan from Canonical on the userspace part of it (which also makes this a nice example of cross-company collaboration), allowing you to turn this feature on or off. This feature though is not likely to fully land in time for Fedora Workstation 35 so we are looking at if we will bring this in as an update to Fedora Workstation 35 or if it will be a Fedora Workstation 36 feature.

Penny

zink inside

Zink inside the penny


As most of you know the future of 3D graphics on Linux is the Vulkan API from the Khronos Group. This doesn’t mean that OpenGL is going away anytime soon though, as there is a large host of applications out there using this API and for certain types of 3D graphics development developers might still choose to use OpenGL over Vulkan. Of course for us that creates a little bit of a challenge because maintaining two 3D graphics interfaces is a lot of work, even with the great help and contributions from the hardware makers themselves. So we been eyeing the Zink project for a while, which aims at re-implementing OpenGL on top of Vulkan, as a potential candidate for solving our long term needs to support the OpenGL API, but without drowning us in work while doing so. The big advantage to Zink is that it allows us to support one shared OpenGL implementation across all hardware and then focus our HW support efforts on the Vulkan drivers. As part of this effort Adam Jackson has been working on a project called Penny.

Zink implements OpenGL in terms of Vulkan, as far as the drawing itself is concerned, but presenting that drawing to the rest of the system is currently system-specific (GLX). For hardware that already has a Mesa driver, we use GBM. On NVIDIA’s Vulkan (and probably any other binary stacks on Linux, and probably also like WSL or macOS + MoltenVK) we download the image from the GPU back to the CPU and then use the same software upload/display path as llvmpipe, which as you can imagine is Not Fast.

Penny aims to extend Zink by replacing both of those paths, and instead using the various Vulkan WSI extensions to manage presentation. Even for the GBM case this should enable higher performance since zink will have more information about the rendering pipeline (multisampling in particular is poorly handled atm). Future window system integration work can focus on Vulkan, with EGL and GLX getting features “for free” once they’re enabled in Vulkan.

3rd party software cleanup
Over time we have been working on adding more and more 3rd party software for easy consumption in Fedora Workstation. The problem we discovered though was that due to this being done over time, with changing requirements and expectations, the functionality was not behaving in a very intuitive way and there was also new questions that needed to be answered. So Allan Day and Owen Taylor spent some time this cycle to review all the bits and pieces of this functionality and worked to clean it up. So the goal is that when you enable third-party repositories in Fedora Workstation 35 it behaves in a much more predictable and understandable way and also includes a lot of applications from Flathub. Yes, that is correct you should be able to install a lot of applications from Flathub in Fedora Workstation 35 without having to first visit the Flathub website to enable it, instead they will show up once you turned the knob for general 3rd party application support.

Power profiles
Another item we spent quite a bit of time for Fedora Workstation 35 is making sure we integrate the Power Profiles work that Bastien Nocera has been working on as part of our collaboration with Lenovo. Power Profiles is basically a feature that allows your system to behave in a smarter way when it comes to power consumption and thus prolongs your battery life. So for instance when we notice you are getting low on battery we can offer you to go into a strong power saving mode to prolong how long you can use the system until you can recharge. More in-depth explanation of Power profiles in the official README.

Wayland
I usually also have ended up talking about Wayland in my posts, but I expect to be doing less going forward as we have now covered all the major gaps we saw between Wayland and X.org. Jonas Ådahl got the headless support merged which was one of our big missing pieces and as mentioned above Olivier Fourdan and Jonas and others worked with NVidia on getting the binary driver with XWayland support working with GNOME Shell. Of course this being software we are never truly done, there will of course be new issues discovered, random bugs that needs to be fixed, and of course also new features that needs to be implemented. We already have our next big team focus in place, HDR support, which will need work from the graphics drivers, up through Mesa, into the window manager and the GUI toolkits and in the applications themselves. We been investigating and trying out some things for a while already, but we are now ready to make this a main focus for the team. In fact we will soon be posting a new job listing for a fulltime engineer to work on HDR vertically through the stack so keep an eye out for that if you are interested in working on this. The job will be open to candidates who which to work remotely, so as long as Red Hat has a business presence in the country you live we should be able to offer you the job if you are the right candidate for us. Update:Job listing is now online for our HDR engineer.

BTW, if you want to see future updates and keep on top of other happenings from Fedora and Red Hat in the desktop space, make sure to follow me on twitter.

September 01, 2021

I've been working on portals recently and one of the issues for me was that the documentation just didn't quite hit the sweet spot. At least the bits I found were either too high-level or too implementation-specific. So here's a set of notes on how a portal works, in the hope that this is actually correct.

First, Portals are supposed to be a way for sandboxed applications (flatpaks) to trigger functionality they don't have direct access too. The prime example: opening a file without the application having access to $HOME. This is done by the applications talking to portals instead of doing the functionality themselves.

There is really only one portal process: /usr/libexec/xdg-desktop-portal, started as a systemd user service. That process owns a DBus bus name (org.freedesktop.portal.Desktop) and an object on that name (/org/freedesktop/portal/desktop). You can see that bus name and object with D-Feet, from DBus' POV there's nothing special about it. What makes it the portal is simply that the application running inside the sandbox can talk to that DBus name and thus call the various methods. Obviously the xdg-desktop-portal needs to run outside the sandbox to do its things.

There are multiple portal interfaces, all available on that one object. Those interfaces have names like org.freedesktop.portal.FileChooser (to open/save files). The xdg-desktop-portal implements those interfaces and thus handles any method calls on those interfaces. So where an application is sandboxed, it doesn't implement the functionality itself, it instead calls e.g. the OpenFile() method on the org.freedesktop.portal.FileChooser interface. Then it gets an fd back and can read the content of that file without needing full access to the file system.

Some interfaces are fully handled within xdg-desktop-portal. For example, the Camera portal checks a few things internally, pops up a dialog for the user to confirm access to if needed [1] but otherwise there's nothing else involved with this specific method call.

Other interfaces have a backend "implementation" DBus interface. For example, the org.freedesktop.portal.FileChooser interface has a org.freedesktop.impl.portal.FileChooser (notice the "impl") counterpart. xdg-desktop-portal does not implement those impl.portals. xdg-desktop-portal instead routes the DBus calls to the respective "impl.portal". Your sandboxed application calls OpenFile(), xdg-desktop-portal now calls OpenFile() on org.freedesktop.impl.portal.FileChooser. That interface returns a value, xdg-desktop-portal extracts it and returns it back to the application in respones to the original OpenFile() call.

What provides those impl.portals doesn't matter to xdg-desktop-portal, and this is where things are hot-swappable. GTK and Qt both provide (some of) those impl portals, There are GTK and Qt-specific portals with xdg-desktop-portal-gtk and xdg-desktop-portal-kde but another one is provided by GNOME Shell directly. You can check the files in /usr/share/xdg-desktop-portal/portals/ and see which impl portal is provided on which bus name. The reason those impl.portals exist is so they can be native to the desktop environment - regardless what application you're running and with a generic xdg-desktop-portal, you see the native file chooser dialog for your desktop environment.

So the full call sequence is:

  • At startup, xdg-desktop-portal parses the /usr/libexec/xdg-desktop-portal/*.portal files to know which impl.portal interface is provided on which bus name
  • The application calls OpenFile() on the org.freedesktop.portal.FileChooser interface on the object path /org/freedesktop/portal/desktop. It can do so because the bus name this object sits on is not restricted by the sandbox
  • xdg-desktop-portal receives that call. This is portal with an impl.portal so xdg-desktop-portal calls OpenFile() on the bus name that provides the org.freedesktop.impl.portal.FileChooser interface (as previously established by reading the *.portal files)
  • Assuming xdg-desktop-portal-gtk provides that portal at the moment, that process now pops up a GTK FileChooser dialog that runs outside the sandbox. User selects a file
  • xdg-desktop-portal-gtk sends back the fd for the file to the xdg-desktop-portal, and the impl.portal parts are done
  • xdg-desktop-portal receives that fd and sends it back as reply to the OpenFile() method in the normal portal
  • The application receives the fd and can read the file now
A few details here aren't fully correct, but it's correct enough to understand the sequence - the exact details depend on the method call anyway.

Finally: because of DBus restrictions, the various methods in the portal interfaces don't just reply with values. Instead, the xdg-desktop-portal creates a new org.freedesktop.portal.Request object and returns the object path for that. Once that's done the method is complete from DBus' POV. When the actual return value arrives (e.g. the fd), that value is passed via a signal on that Request object, which is then destroyed. This roundabout way is done for purely technical reasons, regular DBus methods would time out while the user picks a file path.

Anyway. Maybe this helps someone understanding how the portal bits fit together.

[1] it does so using another portal but let's ignore that
[2] not really hot-swappable though. You need to restart xdg-desktop-portal but not your host. So luke-warm-swappable only

Edit Sep 01: clarify that it's not GTK/Qt providing the portals, but xdg-desktop-portal-gtk and -kde

August 31, 2021

Let me talk here about how we implemented the support for performance counters in the Mesa V3D driver, the OpenGL driver used by the Raspberry Pi 4. For reference, the implementation is very similar to the one already available (not done by me, by the way) for the VC4, OpenGL driver for the Raspberry Pi 3 and prior devices, also part of Mesa. If you are already familiar with how this is implemented in VC4, then this will mostly be a refresher.

First of all, what are these performance counters? Most of the processors nowadays contain some hardware facilities to get measurements about what is happening inside the processor. And of course graphics processors aren’t different. In this case, the graphics chips used by Raspberry Pi devices (manufactured by Broadcom) can record a bunch of different graphics-related parameters: how many quads are passing or failing depth/stencil tests, how many clock cycles are spent on doing vertex/fragment shading, hits/misses in the GPU cache, and many others values. In fact, with the V3D driver it is possible to measure around 87 different parameters, and up to 32 of them simultaneously. Quite a few less in VC4, though. But still a lot.

On a hardware level, using these counters is just a matter of writing and reading some GPU registers. First, write the registers to select what we want to measure, then a few more to start to measure, and finally read other registers containing the results. But of course, much like we don’t expect users to write GPU assembly code, we don’t expect users to write registers in the GPU directly. Moreover, even the Mesa drivers such as V3D can’t interact directly with the hardware; rather, this is done through the kernel, the one that can use the hardware directly, through the DRM subsystem in the kernel. For the case of V3D (and same applies to VC4, and in general to any other driver), we have a driver in user-space (whether the OpenGL driver, V3D, or the Vulkan driver, V3DV), and a kernel driver in the kernel-space, unsurprisingly also called V3D. The user-space driver is in charge of translating all the commands and options created with the OpenGL API or other API to batches of commands to be executed by the GPU, which are submitted to the kernel driver as DRM jobs. The kernel does the proper actions to send these to the GPU to execute them, including touching the proper registers. Thus, if we want to implement support for the performance counters, we need to modify the code in two places: the kernel and the (user-space) driver.

Implementation in the kernel

Here we need to think about how to deal with the GPU and the registers to make the performance counters work, as well as the API we provide to user-space to use them. As mentioned before, the approach we are following here is the same as the one used in the VC4 driver: performance counters monitors. That is, the user-space driver creates one or more monitors, specifying for each monitor what counters it is interested in (up to 32 simultaneously, the hardware limit). The kernel returns a unique identifier for each monitor, which can be used later to do the measurement, query the results, and finally destroy it when done.

In this case, there isn’t an explicit start/stop the measurement. Rather, every time the driver wants to measure a job, it includes the identifier of the monitor it wants to use for that job, if any. Before submitting a job to the GPU, the kernel checks if the job has a monitor identifier attached. If so, then it needs to check if the previous job executed by the GPU was also using the same monitor identifier, in which case it doesn’t need to do anything other than send the job to the GPU, as the performance counters required are already enabled. If the monitor is different, then it needs first to read the current counter values (through proper GPU registers), adding them to the current monitor, stop the measurement, configure the counters for the new monitor, start the measurement again, and finally submit the new job to the GPU. In this process, if it turns out there wasn’t a monitor under execution before, then it only needs to execute the last steps.

The reason to do all this is that multiple applications can be executing at the same time, some using (different) performance counters, and most of them probably not using performance counters at all. But the performance counter values of one application shouldn’t affect any other application so we need to make sure we don’t mix up the counters between applications. Keeping the values in their respective monitors helps to accomplish this. There is still a small requirement in the user-space driver to help with accomplishing this, but in general, this is how we avoid the mixing.

If you want to take a look at the full implementation, it is available in a single commit.

Implementation in the driver

Once we have a way to create and manage the monitors, using them in the driver is quite easy: as mentioned before, we only need to create a monitor with the counters we are interested in and attach it to the job to be submitted to the kernel. In order to make things easier, we keep a mirror-like version of the monitor inside the driver.

This approach is adequate when you are developing the driver, and you can add code directly on it to check performance. But what about the final user, who is writing an OpenGL application and wants to check how to improve its performance, or check any bottleneck on it? We want the user to have a way to use OpenGL for this.

Fortunately, there is in fact a way to do this through OpenGL: the GL_AMD_performance_monitor extension. This OpenGL extension provides an API to query what counters the hardware supports, to create monitors, to start and stop them, and to retrieve the values. It looks very similar to what we have described so far, except for an important difference: the user needs to start and stop the monitors explicitly. We will explain later why this is necessary. But the key point here is that when we start a monitor, this means that from that moment on, until stopping it, any job created and submitted to the kernel will have the identifier of that monitor attached. This implies that only one monitor can be enabled in the application at the same time. But this isn’t a problem, as this restriction is part of the extension.

Our driver does not implement this API directly, but through “queries”, which are used then by the Gallium subsystem in Mesa to implement the extension. For reference, the V3D driver (as well as the VC4) is implemented as part of the Gallium subsystem. The Gallium part basically handles all the hardware-independent OpenGL functionality, and just requires the driver hook functions to be implemented by the driver. If the driver implements the proper functions, then Gallium exposes the right extension (in this case, the GL_AMD_performance_monitor extension).

For our case, it requires the driver to implement functions to return which counters are available, to create or destroy a query (in this case, the query is the same as the monitor), start and stop the query, and once it is finished, to get the results back.

At this point, I would like to explain a bit better what it implies to stop the monitor and get the results back. As explained earlier, stopping the monitor or query means that from that moment on, any new job submitted to the kernel (and thus to the GPU) won’t contain a performance monitor identifier attached, and hence won’t be measured. But it is important to know that the driver submits jobs to the kernel to be executed at its own pace, but these aren’t executed immediatly; the GPU needs time to execute the jobs, and so the kernel puts the arriving jobs in a queue, to be submitted to the GPU. This means when the user stops the monitor, there could be still jobs in the queue that haven’t been executed yet and are thus pending to be measured.

And how do we know that the jobs have been executed by the GPU? The hook function to implement getting the query results has a “wait” parameter, which tells if the function needs to wait for all the pending jobs to be measured to be executed or not. If it doesn’t but there are pending jobs, then it just returns telling the caller this fact. This allows to do other work meanwhile and query again later, instead of becoming blocked waiting for all the jobs to be executed. This is implemented through sync objects. Every time a job is sent to the kernel, there’s a sync object that is used to signal when the job has finished executing. This is mainly used to have a way to synchronize the jobs. In our case, when the user finalizes the query we save this fence for the last submitted job, and we use it to know when this last job has been executed.

There are quite a few details I’m not covering here. If you are interested though, you can take a look at the merge request.

Gallium HUD

So far we have seen how the performance counters are implemented, and how to use them. In all the cases it requires writing code to create the monitor/query, start/stop it, and querying back the results, either in the driver itself or in the application through the GL_AMD_performance_monitor extension1.

But what if we want to get some general measurements without adding code to the application or the driver? Fortunately, there is an environmental variable GALLIUM_HUD that, when correctly, will show on top of the application some graphs with the measured counters.

Using it is very easy; set it to help to know how to use it, as well as to get a list of the available counters for the current hardware.

As example:

$ env GALLIUM_HUD=L2T-CLE-reads,TLB-quads-passing-z-and-stencil-test,QPU-total-active-clk-cycles-vertex-coord-shading scorched3d

You will see:

Performance Counters in Scorched 3D

Bear in mind that to be able to use this you will need a kernel that supports performance counters for V3D. At the moment of writing this, no kernel has been released yet with this support. If you don’t want to wait for it, you can download the patch, apply it to your raspberry pi kernel (which has been tested in the 5.12 branch), build and install it.

  1. All this is for the case of using OpenGL; if your application uses Vulkan, there are other similar extensions, which are not yet implemented in our V3DV driver at the moment of writing this post. 

Gut Ding braucht Weile. Almost three years ago, we added high-resolution wheel scrolling to the kernel (v5.0). The desktop stack however was first lagging and eventually left behind (except for an update a year ago or so, see here). However, I'm happy to announce that thanks to José Expósito's efforts, we now pushed it across the line. So - in a socially distanced manner and masked up to your eyebrows - gather round children, for it is storytime.

Historical History

In the beginning, there was the wheel detent. Or rather there were 24 of them, dividing a 360 degree [1] movement of a wheel into a neat set of 15 clicks. libinput exposed those wheel clicks as part of the "pointer axis" namespace and you could get the click count with libinput_event_pointer_get_axis_discrete() (announced here). The degree value is exposed as libinput_event_pointer_get_axis_value(). Other scroll backends (finger-scrolling or button-based scrolling) expose the pixel-precise value via that same function.

In a "recent" Microsoft Windows version (Vista!), MS added the ability for wheels to trigger more than 24 clicks per rotation. The MS Windows API now treats one "traditional" wheel click as a value of 120, anything finer-grained will be a fraction thereof. You may have a mouse that triggers quarter-wheel clicks, each sending a value of 30. This makes for smoother scrolling and is supported(-ish) by a lot of mice introduced in the last 10 years [2]. Obviously, three small scrolls are nicer than one large scroll, so the UX is less bad than before.

Now it's time for libinput to catch up with Windows Vista! For $reasons, the existing pointer axis API could get changed to accommodate for the high-res values, so a new API was added for scroll events. Read on for the details, you will believe what happens next.

Out with the old, in with the new

As of libinput 1.19, libinput has three new events: LIBINPUT_EVENT_POINTER_SCROLL_WHEEL, LIBINPUT_EVENT_POINTER_SCROLL_FINGER, and LIBINPUT_EVENT_POINTER_SCROLL_CONTINUOUS. These events reflect, perhaps unsuprisingly, scroll movements of a wheel, a finger or along a continuous axis (e.g. button scrolling). And they replace the old event LIBINPUT_EVENT_POINTER_AXIS. Those familiar with libinput will notice that the new event names encode the scroll source in the event name now. This makes them slightly more flexible and saves callers an extra call.

In terms of actual API, the new events have two new functions: libinput_event_pointer_get_scroll_value(). For the FINGER and CONTINUOUS events, the value returned is in "pixels" [3]. For the new WHEEL events, the value is in degrees. IOW this is a drop-in replacement for the old libinput_event_pointer_get_axis_value() function. The second call is libinput_event_pointer_get_scroll_value_v120() which, for WHEEL events, returns the 120-based logical units the kernel uses as well. libinput_event_pointer_has_axis() returns true if the given axis has a value, just as before. With those three calls you now get the data for the new events.

Backwards compatibility

To ensure backwards compatibility, libinput generates both old and new events so the rule for callers is: if you want to support the new events, just ignore the old ones completely. libinput also guarantees new events even on pre-5.0 kernels. This makes the old and new code easy to ifdef out, and once you get past the immediate event handling the code paths are virtually identical.

When, oh when?

These changes have been merged into the libinput main branch and will be part of libinput 1.19. Which is due to be released over the next month or so, so feel free to work backwards from that for your favourite distribution.

Having said that, libinput is merely the lowest block in the Jenga tower that is the desktop stack. José linked to the various MRs in the upstream libinput MR, so if you're on your seat's edge waiting for e.g. GTK to get this, well, there's an MR for that.

[1] That's degrees of an angle, not Fahrenheit
[2] As usual, on a significant number of those you'll need to know whatever proprietary protocol the vendor deemed to be important IP. Older MS mice stand out here because they use straight HID.
[3] libinput doesn't really have a concept of pixels, but it has a normalized pixel that movements are defined as. Most callers take that as real pixels except for the high-resolution displays where it's appropriately scaled.

August 25, 2021

A year ago, I first announced libei - a library to support emulated input. After an initial spurt of development, it was left mostly untouched until a few weeks ago. Since then, another flurry of changes have been added, including some initial integration into GNOME's mutter. So, let's see what has changed.

A Recap

First, a short recap of what libei is: it's a transport layer for emulated input events to allow for any application to control the pointer, type, etc. But, unlike the XTEST extension in X, libei allows the compositor to be in control over clients, the devices they can emulate and the input events as well. So it's safer than XTEST but also a lot more flexible. libei already supports touch and smooth scrolling events, something XTest doesn't have or is struggling with.

Terminology refresher: libei is the client library (used by an application wanting to emulate input), EIS is the Emulated Input Server, i.e. the part that typically runs in the compositor.

Server-side Devices

So what has changed recently: first, the whole approach has flipped on its head - now a libei client connects to the EIS implementation and "binds" to the seats the EIS implementation provides. The EIS implementation then provides input devices to the client. In the simplest case, that's just a relative pointer but we have capabilities for absolute pointers, keyboards and touch as well. Plans for the future is to add gestures and tablet support too. Possibly joysticks, but I haven't really thought about that in detail yet.

So basically, the initial conversation with an EIS implementation goes like this:

  • Client: Hello, I am $NAME
  • Server: Hello, I have "seat0" and "seat1"
  • Client: Bind to "seat0" for pointer, keyboard and touch
  • Server: Here is a pointer device
  • Server: Here is a keyboard device
  • Client: Send relative motion event 10/2 through the pointer device
Notice how the touch device is missing? The capabilities the client binds to are just what the client wants, the server doesn't need to actually give the client a device for that capability.

One of the design choices for libei is that devices are effectively static. If something changes on the EIS side, the device is removed and a new device is created with the new data. This applies for example to regions and keymaps (see below), so libei clients need to be able to re-create their internal states whenever the screen or the keymap changes.

Device Regions

Devices can now have regions attached to them, also provided by the EIS implementation. These regions define areas reachable by the device and are required for clients such as Barrier. On a dual-monitor setup you may have one device with two regions or two devices with one region (representing one monitor), it depends on the EIS implementation. But either way, as libei client you will know that there is an area and you will know how to reach any given pixel on that area. Since the EIS implementation decides the regions, it's possible to have areas that are unreachable by emulated input (though I'm struggling a bit for a real-world use-case).

So basically, the conversation with an EIS implementation goes like this:

  • Client: Hello, I am $NAME
  • Server: Hello, I have "seat0" and "seat1"
  • Client: Bind to "seat0" for absolute pointer
  • Server: Here is an abs pointer device with regions 1920x1080@0,0, 1080x1920@1920,0
  • Server: Here is an abs pointer device with regions 1920x1080@0,0
  • Server: Here is an abs pointer device with regions 1080x1920@1920,0
  • Client: Send abs position 100/100 through the second device
Notice how we have three absolute devices? A client emulating a tablet that is mapped to a screen could just use the third device. As with everything, the server decides what devices are created and the clients have to figure out what they want to do and how to do it.

Perhaps unsurprisingly, the use of regions make libei clients windowing-system independent. The Barrier EI support WIP no longer has any Wayland-specific code in it. In theory, we could implement EIS in the X server and libei clients would work against that unmodified.

Keymap handling

The keymap handling has been changed so the keymap too is provided by the EIS implementation now, effectively in the same way as the Wayland compositor provides the keymap to Wayland clients. This means a client knows what keycodes to send, it can handle the state to keep track of things, etc. Using Barrier as an example again - if you want to generate an "a", you need to look up the keymap to figure out which keycode generates an A, then you can send that through libei to actually press the key.

Admittedly, this is quite messy. XKB (and specifically libxkbcommon) does not make it easy to go from a keysym to a key code. The existing Barrier X code is full of corner-cases with XKB already, I espect those to be necessary for the EI support as well.

Scrolling

Scroll events have four types: pixel-based scrolling, discrete scrolling, and scroll stop/cancel events. The first should be obvious, discrete scrolling is for mouse wheels. It uses the same 120-based API that Windows (and the kernel) use, so it's compatible with high-resolution wheel mice. The scroll stop event notifies an EIS implementation that the scroll interaction has stopped (e.g. lifting fingers off) which in turn may start kinetic scrolling - just like the libinput/Wayland scroll stop events. The scroll cancel event notifies the EIS implementation that scrolling really has stopped and no kinetic scrolling should be triggered. There's no equivalent in libinput/Wayland for this yet but it helps to get the hook in place.

Emulation "Transactions"

This has fairly little functional effect, but interactions with an EIS server are now sandwiched in a start/stop emulating pair. While this doesn't matter for one-shot tools like xdotool, it does matter for things like Barrier which can send the start emulating event when the pointer enters the local window. This again allows the EIS implementation to provide some visual feedback to the user. To correct the example from above, the sequence is actually:

  • ...
  • Server: Here is a pointer device
  • Client: Start emulating
  • Client: Send relative motion event 10/2 through the pointer device
  • Client: Send relative motion event 1/4 through the pointer device
  • Client: Stop emulating

Properties

Finally, there is now a generic property API, something copied from PipeWire. Properties are simple key/value string pairs and cover those things that aren't in the immediate API. One example here: the portal can set things like "ei.application.appid" to the Flatpak's appid. Properties can be locked down and only libei itself can set properties before the initial connection. This makes them reliable enough for the EIS implementation to make decisions based on their values. Just like with PipeWire, the list of useful properties will grow over time. it's too early to tell what is really needed.

Repositories

Now, for the actual demo bits: I've added enough support to Barrier, XWayland, Mutter and GNOME Shell that I can control a GNOME on Wayland session through Barrier (note: the controlling host still needs to run X since we don't have the ability to capture input events under Wayland yet). The keymap handling in Barrier is nasty but it's enough to show that it can work.

GNOME Shell has a rudimentary UI, again just to show what works:

The status icon shows ... if libei clients are connected, it changes to !!! while the clients are emulating events. Clients are listed by name and can be disconnected at will. I am not a designer, this is just a PoC to test the hooks.

Note how xdotool is listed in this screenshot: that tool is unmodified, it's the XWayland libei implementation that allows it to work and show up correctly

The various repositories are in the "wip/ei" branch of:

And of course libei itself.

Where to go from here? The last weeks were driven by rapid development, so there's plenty of test cases to be written to make sure the new code actually works as intended. That's easy enough. Looking at the Flatpak integration is another big ticket item, once the portal details are sorted all the pieces are (at least theoretically) in place. That aside, improving the integrations into the various systems above is obviously what's needed to get this working OOTB on the various distributions. Right now it's all very much in alpha stage and I could use help with all of those (unless you're happy to wait another year or so...). Do ping me if you're interested to work on any of this.

August 23, 2021

Hi all, hope you all are doing fine!

Finally today the part 5 of my Outreachy Saga came out, week 9 was on 7/19/21, and as you can see I'm a little late too( ;P )... This week had the theme: “Career opportunities/ Career Goals”

When I read the Outreachy Organizers email, I had an anxiety crisis, starting to think about what I want to do after the internship, what my career goals are, I panicked... The Imposter Syndrome hit hard, and it still haunts my thoughts and it has been very challenging (as my therapist says) to work with this feeling of not being good enough to apply for a job opening or thinking that my resume is worthless...

But week 11 arrived #SPOILERALERT with the theme: "Making connections" and talking to some people I could get to know their experiences in companies that work with free software and their contributions to it, I could feel that I am on the right path, that this is the area I want to work on. So let's get back to the topic of today's post!!

What am I looking for?

I'm looking for a job, preferably remote, which can be full or part-time!

But also some other opportunity where I can improve my CV and help me to continue working with the Linux Kernel, preferably. The end of my Outreachy internship is fast approaching (or rather, it's the day after tomorrow (o_o) ), so after August 24th, I will be available to work full or part-time. I currently live in Fundão, Portugal, and I am open to remote positions based anywhere in the world, along with the possibility of international relocation (my dream is to live in Canada!!! 😜).

What types of work would you like to contribute to?

I would like to continue working with the Linux Kernel, and I also really like Embedded Systems (I've played a little with Raspberry, Beagle Bone, and Arduino)

What tools or skills do you have that would help you with that work?

You know... I have a lot of difficulty answering this question because I always think I don't have many skills, but I'll do my best!!!

During my Outreachy internship, I Created vkms_config_show(), the function which aims to print the data in drm_debugfs_create_files(), I also started to learn how to debug the code. I already worked a little with the Coccinelle tool. And as I mentioned earlier, I've already used Raspberry, Beagle Bone, and Arduino.

What languages do you speak, and at what school grade level?

Portuguese (Native), English (intermediate)

And reviewing now my CV and Linkedin, I could see that I've already done a lot!

I have experience in remote work, with people from different parts of the world, I was the DAECOMP (Academic Directory of Computer engineering - that is how we call our students association) President, where I needed to interact with my classmates to find out which improvements they thought the course needed to improve, I also needed to interact with the campus management and our professors to get things to improve the course and the campus, I organized Free Software dissemination events. I was also a flute tutor (yes I study music!!!) and a robotics tutor, working with young people and children.

Well, I'll stop here... To see more about my experience, feel free to visit my Linkedin

Once again thank you for following me so far, my adventure with Outreachy is almost over, every day has been a lot of learning! Please feel free to comment! And stay tuned to the next chapters of this Saga!!!

Take care and have a great day!

August 11, 2021

Here I’m playing “Spelunky 2” on my laptop and simultaneously replaying the same Vulkan calls on an ARM board with Adreno GPU running the open source Turnip Vulkan driver. Hint: it’s an x64 Windows game that doesn’t run on ARM.

The bottom right is the game I’m playing on my laptop, the top left is GFXReconstruct immediately replaying Vulkan calls from the game on ARM board.

How is it done? And why would it be useful for debugging? Read below!


Debugging issues a driver faces with real-world applications requires the ability to capture and replay graphics API calls. However, for mobile GPUs it becomes even more challenging since for Vulkan driver the main “source” of real-world workload are x86-64 apps that run via Wine + DXVK, mainly games which were made for desktop x86-64 Windows and do not run on ARM. Efforts are being made to run these apps on ARM but it is still work-in-progress. And we want to test the drivers NOW.

The obvious solution would be to run those applications on an x86-64 machine capturing all Vulkan calls. Then replaying those calls on a second machine where we cannot run the app. This way it would be possible to test the driver even without running the application directly on it.

The main trouble is that Vulkan calls made on one GPU + Driver combo are not generally compatible with other GPU + Driver combo, sometimes even for one GPU vendor. There are different memory capabilities (VkPhysicalDeviceMemoryProperties), different memory requirements for buffer and images, different extensions available, and different optional features supported. It is easier with OpenGL but there are also some incompatibilities there.

There are two open-source vendor-agnostic tools for capturing Vulkan calls: RenderDoc (captures single frame) and GFXReconstruct (captures multiple frames). RenderDoc at the moment isn’t suitable for the task of capturing applications on desktop GPUs and replaying on mobile because it doesn’t translate memory type and requirements (see issue #814). GFXReconstruct on the other hand has the necessary features for this.

I’ll show a couple of tricks with GFXReconstruct I’m using to test things on Turnip.


Capturing with GFXReconstruct

At this point you either have the application itself or, if it doesn’t use Vulkan, a trace of its calls that could be translated to Vulkan. There is a detailed instruction on how to use GFXReconstruct to capture a trace on desktop OS. However there is no clear instruction of how to do this on Android (see issue #534), fortunately there is one in Android’s documentation:

Android how-to (click me)
For Android 9 you should copy layers to the application which will be traced
For Android 10+ it's easier to copy them to com.lunarg.gfxreconstruct.replay
You should have userdebug build of Android or probably rooted Android

# Push GFXReconstruct layer to the device
adb push libVkLayer_gfxreconstruct.so /sdcard/

# Since there is to APK for capture layer,
# copy the layer to e.g. folder of com.lunarg.gfxreconstruct.replay
adb shell run-as com.lunarg.gfxreconstruct.replay cp /sdcard/libVkLayer_gfxreconstruct.so .

# Enable layers
adb shell settings put global enable_gpu_debug_layers 1

# Specify target application
adb shell settings put global gpu_debug_app <package_name>

# Specify layer list (from top to bottom)
adb shell settings put global gpu_debug_layers VK_LAYER_LUNARG_gfxreconstruct

# Specify packages to search for layers
adb shell settings put global gpu_debug_layer_app com.lunarg.gfxreconstruct.replay

If the target application doesn’t have rights to write into external storage - you should change where the capture file is created:

adb shell "setprop debug.gfxrecon.capture_file '/data/data/<target_app_folder>/files/'"


However, when trying to replay the trace you captured on another GPU - most likely it will result in an error:

[gfxrecon] FATAL - API call vkCreateDevice returned error value VK_ERROR_EXTENSION_NOT_PRESENT that does not match the result from the capture file: VK_SUCCESS.  Replay cannot continue.
Replay has encountered a fatal error and cannot continue: the specified extension does not exist

Or other errors/crashes. Fortunately we could limit the capabilities of desktop GPU with VK_LAYER_LUNARG_device_simulation

VK_LAYER_LUNARG_device_simulation when simulating another GPU should be told to intersect the capabilities of both GPUs, making the capture compatible with both of them. This could be achieved by recently added environment variables:

VK_DEVSIM_MODIFY_EXTENSION_LIST=whitelist
VK_DEVSIM_MODIFY_FORMAT_LIST=whitelist
VK_DEVSIM_MODIFY_FORMAT_PROPERTIES=whitelist

whitelist name is rather confusing because it’s essentially means “intersection”.

One would also need to get a json file which describes target GPU capabilities, this should be done by running:

vulkaninfo -j &> <device_name>.json

The final command to capture a trace would be:

VK_LAYER_PATH=<path/to/device-simulation-layer>:<path/to/gfxreconstruct-layer> \
VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_gfxreconstruct:VK_LAYER_LUNARG_device_simulation \
VK_DEVSIM_FILENAME=<device_name>.json \
VK_DEVSIM_MODIFY_EXTENSION_LIST=whitelist \
VK_DEVSIM_MODIFY_FORMAT_LIST=whitelist \
VK_DEVSIM_MODIFY_FORMAT_PROPERTIES=whitelist \
<the_app>

Replaying with GFXReconstruct

gfxrecon-replay -m rebind --skip-failed-allocations <trace_name>.gfxr
  • -m Enable memory translation for replay on GPUs with memory types that are not compatible with the capture GPU’s
    • rebind Change memory allocation behavior based on resource usage and replay memory properties. Resources may be bound to different allocations with different offsets.
  • --skip-failed-allocations skip vkAllocateMemory, vkAllocateCommandBuffers, and vkAllocateDescriptorSets calls that failed during capture

Without these options replay would fail.

Now you could easily test any app/game on your ARM board, if you have enough RAM =) I even successfully ran a capture of “Metro Exodus” on Turnip.

But what if I want to test something that requires interactivity?

Or you don’t want to save a huge trace on disk, which could grow tens of gigabytes if application is running for considerable amount of time.

During the recording GFXReconstruct just appends calls to a file, there are no additional post-processing steps. Given that the next logical step is to just skip writing to a disk and send Vulkan calls over the network!

This would allow us to interact with the application and immediately see the results on another device with different GPU. And so I hacked together a crude support of over-the-network replay.

The only difference with ordinary tracing is that now instead of file we have to specify a network address of the target device:

VK_LAYER_PATH=<path/to/device-simulation-layer>:<path/to/gfxreconstruct-layer> \
    ...
GFXRECON_CAPTURE_FILE="<ip>:<port>" \
<the_app>

And on the target device:

while true; do gfxrecon-replay -m rebind --sfa ":<port>"; done

Why while true? It is common for DXVK to call vkCreateInstance several times leading to the creation of several traces. When replaying over the network we therefor want gfxrecon-replay to immediately restart when one trace ends to be ready for another.

You may want to bring the FPS down to match the capabilities of lower power GPU in order to prevent constant hiccups. It could be done either with libstrangle or with mangohud:

  • stranglevk -f 10
  • MANGOHUD_CONFIG=fps_limit=10 mangohud

You have seen the result at the start of the post.

August 05, 2021

Just about a year after the original announcement, I think it's time to see the progress on power-profiles-daemon.

Note that I would still recommend you read the up-to-date project README if you have questions about why this project was necessary, and why a new project was started rather than building on an existing one.

 The project was born out of the need to make a firmware feature available to end-users for a number of lines of Lenovo laptops for them to be fully usable on Fedora. For that, I worked with Mark Pearson from Lenovo, who wrote the initial kernel support for the feature and served as our link to the Lenovo firmware team, and Hans de Goede, who worked on making the kernel interfaces more generic.

More generic, but in a good way

 With the initial kernel support written for (select) Lenovo laptops, Hans implemented a more generic interface called platform_profile. This interface is now the one that power-profiles-daemon will integrate with, and means that it also supports a number of Microsoft Surface, HP, Lenovo's own Ideapad laptops, and maybe Razer laptops soon.

 The next item to make more generic is Lenovo's "lap detection" which still relies on a custom driver interface. This should be soon transformed into a generic proximity sensor, which will mean I get to work some more on iio-sensor-proxy.

Working those interactions

 power-profiles-dameon landed in a number of distributions, sometimes enabled by default, sometimes not enabled by default (sigh, the less said about that the better), which fortunately meant that we had some early feedback available.

 The goal was always to have the user in control, but we still needed to think carefully about how the UI would look and how users would interact with it when a profile was temporarily unavailable, or the system started a "power saver" mode because battery was running out.

 The latter is something that David Redondo's work on the "HoldProfile" API made possible. Software can programmatically switch to the power-saver or performance profile for the duration of a command. This is useful to switch to the Performance profile when running a compilation (eg. powerprofilesctl jhbuild --no-interact build gnome-shell), or for gnome-settings-daemon to set the power-saver profile when low on battery.

 The aforementioned David Redondo and Kai Uwe Broulik also worked on the KDE interface to power-profiles-daemon, as Florian Müllner implemented the gnome-shell equivalent.

Promised by me, delivered by somebody else :)

 I took this opportunity to update the Power panel in Settings, which shows off the temporary switch to the performance mode, and the setting to automatically switch to power-saver when low on battery.

Low-Power, everywhere

 Talking of which, while it's important for the system to know that they're targetting a power saving behaviour, it's also pretty useful for applications to try and behave better.
 
 Maybe you've already integrated with "low memory" events using GLib, but thanks to Patrick Griffis you can be an event better ecosystem citizen and monitor whether the system is in "Power Saver" mode and adjust your application's behaviour.
 
 This feature will be available in GLib 2.70 along with documentation of useful steps to take. GNOME Software will already be using this functionality to avoid large automated downloads when energy saving is needed.

Availability

 The majority of the above features are available in the GNOME 41 development branches and should get to your favourite GNOME-friendly distribution for their next release, such as Fedora 35.
August 04, 2021

 I've been chasing a crocus misrendering bug show in a qt trace.


The bottom image is crocus vs 965 on top. This only happened on Gen4->5, so Ironlake and GM45 were my test machines. I burned a lot of time trying to work this out. I trimmed the traces down, dumped a stupendous amount of batchbuffers, turned off UBO push constants, dump all the index and vertex buffers, tried some RGBx changes, but nothing was rushing to hit me, except that the vertex shaders produced were different.

However they were different for many reasons, due to the optimization pipelines the mesa state tracker runs vs the 965 driver. Inputs and UBO loads were in different places so there was a lot of noise in the shaders.

I ported the trace to a piglit GL application so I could easier hack on the shaders and GL, with that I trimmed it down even further (even if I did burn some time on a misplace */+ typo).

Using the ported app, I removed all uniform buffer loads and then split the vertex shader in half (it was quite large, but had two chunks). I finally then could spot the difference in the NIR shaders.

What stood out was the 965 shader had an if which the crocus shader has converted to a bcsel. This is part of peephole optimization and the mesa/st calls it, and sure enough removing that call fixed the rendering, but why? it is a valid optimization.

In a parallel thread on another part of the planet, Ian Romanick filed a MR to mesa https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12191 fixing a bug in the gen4/5 fs backend with conditional selects. This was something he noticed while debugging elsewhere. However his fix was for the fragment shader backend, and my bug was in the vec4 vertex shader backend. I tracked down where the same changes were needed in the vec4 backend and tested a fix on top of his branch, and the misrendering disappeared.

It's a strange coincidence we both started hitting the same bug in different backends in the same week via different tests, but he's definitely saved me a lot of pain in working this out! Hopefully we can combine them and get it merged this week.

Also thanks to Angelo on the initial MR for testing crocus with some real workloads.

August 03, 2021

Hi all, hope you all are doing fine!

Finally today the part 4 of my Outreachy Saga came out, the mid-point was on 5/7/21 and as you can see I'm really late, this week had the theme: “Modifying Expectations”.

But why did it take me so long to post? First, I had to internalize the topic a lot, because in my head I thought that when I reached this point, I would have achieved all the goals I had proposed at the beginning of the internship, but when the mid-point arrived, it seemed to me that I didn't have done anything and that my internship was going to end, as I didn't fulfill expectations.

The project aimed at 2 tasks

  • Clean up the debugfs support
  • Remove custom dumbmapoffset implementations

During the development of the first task, we found that it could not be carried out as intended, so it needed to be restructured and resulted in:

  • Create vkmsconfigshow( )

    • function Which aims to print the data in drmdebugfscreate_files()
    • It has already been reviewed and approved to be part of the drm-misc tree

During the development of this function, I came across an improvement in the code:

  • Replace macro in vkms_release()

    • It has already been reviewed and approved to be part of the drm-misc tree

As part of this week's assignments, I needed to talk to my advisors about the internship progress and schedule review. During our conversation, I could see/understand that I managed to achieve one of the goals, as presented above (I thought I hadn't achieved anything!!), and we also realized that I was not going to be able to do the second task, as it was in another context and could take a long time to understand how to solve it.

Thus, for the second half of the internship, it was decided to convert vkmsconfigdebugfs into the structure proposed by Wambui Karuga, and here I'm working on it.

During this time I'm learning that the Linux Kernel development is not linear, that several things can happen (setup problem again, break the kernel, not knowing what to do...), so I realized that one of the goals of Outreachy is learning to contribute and work with the project I've chosen.

So I started to take advantage of my journey to learn how to contribute as much I can in the Linux Kernel and as I identified a lot with the development, maybe I can find a job to keep working/contributing to the Linux Kernel development.

Thank you for following me so far, please feel free to comment! And stay tuned to the next chapters of this Saga!!!

Take care and have a great day!

July 28, 2021

Part 1, Part 2, Part 3

After getting thouroughly nerd-sniped a few weeks back, we now have FreeBSD support through qemu in the freedesktop.org ci-templates. This is possible through the qemu image generation we have had for quite a while now. So let's see how we can easily add a FreeBSD VM (or other distributions) to our gitlab CI pipeline:


.freebsd:
variables:
FDO_DISTRIBUTION_VERSION: '13.0'
FDO_DISTRIBUTION_TAG: 'freebsd.0' # some value for humans to read

build-image:
extends:
- .freebsd
- .fdo.qemu-build@freebsd
variables:
FDO_DISTRIBUTION_PACKAGES: "curl wget"
Now, so far this may all seem quite familiar. And indeed, this is almost exactly the same process as for normal containers (see Part 1), the only difference is the .fdo.qemu-build base template. Using this template means we build an image babushka: our desired BSD image is actual a QEMU RAW image sitting inside another generic container image. That latter image only exists to start the QEMU image and set up the environment if need be, you don't need to care what distribution it runs out (Fedora for now).

Because of the nesting, we need to handle this accordingly in our script: tag for the actual test job - we need to start the image and make sure our jobs are actually built within. The templates set up an ssh alias "vm" for this and the vmctl script helps to do things on the vm:


test-build:
extends:
- .freebsd
- .fdo.distribution-image@freebsd
script:
# start our QEMU image
- /app/vmctl start

# copy our current working directory to the VM
# (this is a yaml multiline command to work around the colon)
- |
scp -r $PWD vm:

# Run the build commands on the VM and if they succeed, create a .success file
- /app/vmctl exec "cd $CI_PROJECT_NAME; meson builddir; ninja -C builddir" && touch .success || true

# Copy results back to our run container so we can include them in artifacts:
- |
scp -r vm:$CI_PROJECT_NAME/builddir .

# kill the VM
- /app/vmctl stop

# Now that we have cleaned up: if our build job before
# failed, exit with an error
- [[ -e .success ]] || exit 1
Now, there's a bit to unpack but with the comments above it should be fairly obvious what is happening. We start the VM, copy our working directory over and then run a command on the VM before cleaning up. The reason we use touch .success is simple: it allows us to copy things out and clean up before actually failing the job.

Obviously, if you want to build any other distribution you just swap the freebsd out for fedora or whatever - the process is the same. libinput has been using fedora qemu images for ages now.

July 27, 2021

Thanks to the work done by Josè Expòsito, libinput 1.19 will ship with a new type of gesture: Hold Gestures. So far libinput supported swipe (moving multiple fingers in the same direction) and pinch (moving fingers towards each other or away from each other). These gestures are well-known, commonly used, and familiar to most users. For example, GNOME 40 recently has increased its use of touchpad gestures to switch between workspaces, etc. Swipe and pinch gestures require movement, it was not possible (for callers) to detect fingers on the touchpad that don't move.

This gap is now filled by Hold gestures. These are triggered when a user puts fingers down on the touchpad, without moving the fingers. This allows for some new interactions and we had two specific ones in mind: hold-to-click, a common interaction on older touchscreen interfaces where holding a finger in place eventually triggers the context menu. On a touchpad, a three-finger hold could zoom in, or do dictionary lookups, or kill a kitten. Whatever matches your user interface most, I guess.

The second interaction was the ability to stop kinetic scrolling. libinput does not actually provide kinetic scrolling, it merely provides the information needed in the client to do it there: specifically, it tells the caller when a finger was lifted off a touchpad at the end of a scroll movement. It's up to the caller (usually: the toolkit) to implement the kinetic scrolling effects. One missing piece was that while libinput provided information about lifting the fingers, it didn't provide information about putting fingers down again later - a common way to stop scrolling on other systems.

Hold gestures are intended to address this: a hold gesture triggered after a flick with two fingers can now be used by callers (read: toolkits) to stop scrolling.

Now, one important thing about hold gestures is that they will generate a lot of false positives, so be careful how you implement them. The vast majority of interactions with the touchpad will trigger some movement - once that movement hits a certain threshold the hold gesture will be cancelled and libinput sends out the movement events. Those events may be tiny (depending on touchpad sensitivity) so getting the balance right for the aforementioned hold-to-click gesture is up to the caller.

As usual, the required bits to get hold gestures into the wayland protocol are either in the works, mid-flight or merge-ready so expect this to hit the various repositories over the medium-term future.

July 26, 2021

I have not talked about raytracing in RADV for a while, but after some procrastination being focused on some other things I recently got back to it and achieved my next milestone.

In particular I have been hacking away at CTS and got to a point where CTS on dEQP-VK.ray_tracing.* runs to completion without crashes or hangs. Furthermore, I got the passrate to 90% of non-skiped tests. So we’re finally getting somewhere close to usable.

As further show that it is usable my fixes for CTS also fixed the corruption issues in Quake 2 RTX (Github version), delivering this image:

Q2RTX on RADV

Of course not everything is perfect yet. Besides the not 100% CTS passrate it has like half the Windows performance at 4k right now and we still have some feature gaps to make it really usable for most games.

Why is it slow?

TL;DR Because I haven’t optimized it yet and implemented every shortcut imaginable.

AMD raytracing primer

Raytracing with Vulkan works with two steps:

  1. You built a giant acceleration structure that contains all your geometry. Typically this ends up being some kind of tree, typically a Bounding Volume Hierarchy (BVH).
  2. Then you trace rays using some traversal shader through the acceleration structure you just built.

With RDNA2 AMD started accelerating this by adding an instruction that allowed doing intersection tests between a ray and a single BVH node, where the BVH node can either be

  • A triangle
  • A box node specifying 4 AABB boxes

Of course this isn’t quite enough to deal with all geometry types in Vulkan so we also add two more:

  • an AABB box
  • an instance of another BVH combined with a transformation matrix

Building the BVH

With a search tree like a BVH it is very possibly to make trees that are very useless. As an example consider a binary search tree that is very unbalanced. We can have similarly bad things with a BVH including making it unbalanced or having overlapping bounding volumes.

And my implementation is the simplest thing possible: the input geometry becomes the leaves in exactly the same order and then internal nodes are created just as you’d draw them. That is probably decently fast in building the BVH but surely results in a terrible BVH to actually use.

BVH traversal

After we built a BVH we can start tracing some rays. In rough pseudocode the current implementation is

stack = empty
insert root node into stack
while stack is not empty:

   node = pop a node from the stack

   if we left the bottom level BVH:
      reset ray origin/direction to initial origin/direction

   result = amd_intersect(ray, node)
   switch node type:
      triangle:
         if result is a hit:
            load some node data
            process hit
      box node:
         for each box hit:
            push child node on stack
      custom node 1 (instance):
         load node data
         push the root node of the bottom BVH on the stack
         apply transformation matrix to ray origin/direction
      custom node 2 (AABB geometry):
         load node data
         process hit

We already knew there were inherently going to be some difficulties:

  • We have a poor BVH so we’re going to do way more iterations than needed.
  • Calling shaders as a result of hits is going to result in some divergence.

Furthermore this also clearly shows some difficulties with how we approached the intersection instruction. Some advantages of the intersection instruction are that it avoids divergence in computing collisions if we have different node types in a subgroup and to be cheaper when there are only a few lanes active. (A single CU can process one ray/node intersection per cycle, modulo memory latency, while it can process an ALU instruction on 64 lanes per cycle).

However even if it avoids the divergence in the collision computation we still introduce a ton of divergence in the processing of the results of the intersection. So we are still doing pretty bad here.

A fast GPU traversal stack needs some work too

Another thing to be noted is our traversal stack size. According to the Vulkan specification a bottom level acceleration structure should support 2^24 -1 triangles and a top level acceleration structure should support 2^24 - 1 bottom level structures. Combined with a tree with 4 children in each internal node we can end up with a tree depth of about 24 levels.

In each internal node iteration of our loop we pop one element and push up to 4 elements, so at the deepest level of traversal we could end up with a 72 entry stack. Assuming these are 32-bit node identifiers, that ends up with 288 bytes of stack per lane, or ~18 KiB per 64 lane workgroup (the minimum which could possibly keep a CU busy with an ALU only workload). Given that we have 64 KiB of LDS (yes I am using LDS since there is no divergent dynamic register addressing) per CU that leaves only 3 workgroups per CU, leaving very little options for parallelism between different hardware execution units (e.g. the ALU and the texture unit that executes the ray intersections) or latency hiding of memory operations.

So ideally we get this stack size down significantly.

Where do we go next?

First step is to get CTS passing and getting an initial merge request into upstream Mesa. As a follow on to that I’d like to get a minimal prototype going for some DXR 1.0 games with vkd3d-proton just to make sure we have the right feature coverage.

After that we’ll have to do all the traversal optimizations. I’ll probably implement a bunch of instrumentation so I actually have a clue on what to optimize. This is where having some runnable games really helps get the right idea about performance bottlenecks.

Finally, with some luck better shaders to build a BVH will materialize as well.