planet.freedesktop.org
March 03, 2021

One of the best decisions I did in my life was when I joined Igalia in 2012. Inside Igalia, I have been working in different open-source projects, most of the time related to graphics technologies, interacting with different communities, giving talks, organizing conferences and, more importantly, contributing to free software as my daily job.

Now I’m thrilled to announce that we are hiring for our Graphics team!

Igalia

Right now we have two open positions:

  • Graphics Developer.

    We are looking for candidates that would like to contribute to open-source OpenGL/Vulkan graphics drivers (Mesa), or other areas of the open-source graphics stack such as X11 or Wayland, among others. If you have experience with them, or you are very motivated to become an expert there, just send us your CV!

  • Kernel Developer.

    We are looking for candidates that either have experience with kernel development or they can ramp-up quickly to contribute to linux kernel drivers. Although no specific subsystem is mentioned in the job position, I encourage you to apply if you have DRM experience and/or ARM[64]/MIPS related knowledge.

Graphics technologies are not your cup of tea? We have positions in other areas like browsers, compilers, multimedia… Just check out our job offers on our website!

What we offer is to work in an open-source consultancy in which you can participate equally in the management and decision-making process of the company via our democratic, consensus-based assembly structure. As all of our positions are remote-friendly, we welcome submissions from any part of the world.

Are you still a student? We have launched the 2021 edition of our Coding Experience program. Check it out!

Igalia's office

February 27, 2021

It is extremely rare that a hobby software project of mine gets completed, but now it has happened. Behold! Fourbyfour!

Have you ever had to implement a mathematical algorithm, say, matrix inversion? You want it to be fast and measuring the speed is fairly simple, right. But what about correctness? Or precision? Behavior around inputs that are on the edge? You can hand-pick a few example inputs, put those into your test suite, and verify the result is what you expect. If you do not pick only trivial inputs, this is usually enough to guarantee your algorithm does not have fundamental mistakes. But what about those almost invalid inputs, can you trust your algorithm to not go haywire on them? How close to invalid can your inputs be before things break down? Does your algorithm know when it stops working and tell you?

Inverting a square matrix requires that the inverse matrix exists to begin with. Matrices that do not mathematically have an inverse matrix are called singular. Can your matrix inversion algorithm tell you when you are trying to invert a matrix that cannot be inverted, or does it just give you a bad result pretending it is ok?

Working with computers often means working with floating-point numbers. With floating-point, the usual mathematics is not enough, it can actually break down. You calculate something and the result a computer gives you is total nonsense, like 1+2=2 in spirit. In the case of matrix inversion, it's not enough that the input matrix is not singular mathematically, it needs to be "nice enough" numerically as well. How do you test your matrix inversion algorithm with this in mind?

These questions I tried to answer with Fourbyfour. The README has the links to the sub-pages discussing how I solved this, so I will not repeat it here. However, as the TL;DR, if there is one thing you should remember, it is this:

    Do not use the matrix determinant to test if a matrix is invertible!

Yes, the determinant is zero for a singular matrix. No, close to zero determinant does not tell you how close to singular the matrix is. There are better ways.

However, the conclusion I came to is that if you want a clear answer for a specific input matrix, is it invertible, the only way to know for sure is to actually invert it, multiply the input matrix with the inverse you computed, and measure how far off from the identity matrix it is. Of course, you also need to set a threshold, how close to identity matrix is close enough for your application, because with numerical algorithms, you will almost never get the exact answer. Also, pick an appropriate matrix norm for the matrix difference.

The reason for this conclusion is what one of the tools I wrote tells me about a matrix that would be typical for a display server with two full-HD monitors. The matrix is simply the pixel offset of the second monitor on the desktop. The analysis of the matrix is the example I used to demonstrate fourbyfour-analyse. If you read through it, you should be shocked. The mathematics, as far as I can understand, seems to tell us that if you use 32-bit floating-point, inverting this matrix gives us a result that leads to no correct digits at all. Obviously this is nonsense, the inverse is trivial and algorithms should give the exact correct result. However, the math does not lie (unless I did). If I did my research right, then what fourbyfour-analyse tells us is true, with an important detail: it is the upper error bound. It guarantees that we cannot get errors larger than that (heh, zero correct digits is pretty hard to make much worse). But I also read that there is no better error bound possible for a generic matrix inversion algorithm. (If you take the obvious-to-human constraints into account that those elements must be one and those must be zero, the analysis would likely be very different.) Therefore the only thing left to do is to actually go on with the matrix inversion and then verify the result.

Here is a list of the cool things the Fourbyfour project does or has:

  • Generates random matrices arbitrarily close to singular in a controlled way. If you simply generated random matrices for testing, they would almost never be close to singular. With this code, you can define how close to singular you want the matrices to be, to really torture your inversion algorithms.
  • Generates random matrices with a given determinant value. This is orthogonal to choosing how close to singular the generated matrices are. You can independently pick the determinant value and the condition number, and have the random matrices have both simultaneously.
  • Plot a graph about a matrix inversion algorithm's behavior when inputs get closer to singular, to see exactly when it breaks down.
  • A tutorial on how mathematical matrix notation works and how it relates to row- vs. column-major layouts (spoiler: it does not).
  • A comparison between Graphene and Weston matrix inversion algorithms.
In this project I also tried out several project quality assurance features:

  • Use Gitlab CI to run the test suite for the main branch, tags, and all merge requests, but not for other git branches.
  • Use Freedesktop ci-templates to easily generate the Docker image in CI under which CI testing will run.
  • Generate LCOV test code coverage report from CI.
  • Use reuse lint tool in CI to ensure every single file has a defined, machine-readable license. Using well-known licenses clearly is important if you want your code to be attractive. Fourbyfour also uses the Developer Certificate of Origin.
  • Use ci-fairy to ensure every commit has Singed-off-by and every merge request allows maintainer pushes.
  • Good CI test coverage. Test even the pseudo-random number generator in the test suite that it roughly follows the intended distribution.
  • CONTRIBUTING file. I believe that every open source project regardless of size needs this to set up the people's expectations when they see your project, whether you expect or accept contributions or not.
I'm really happy this project is now "done", well, version 1.0.0 so to say. One thing I have realized it is still missing is a determinant sweep mode. The precision testing mode sweeps over condition numbers and allows plotting the inversion behavior. It should have another mode where the sweep controls the determinant value, with some fixed condition number for the random test matrices. This determinant mode could point out inversion algorithms that use determinant value for matrix singularity testing and show how it leads to completely arbitrary results.

I you want to learn about numerical methods for matrices, I recommend the book Gene H. Golub, Charles F. van Loan, Matrix Computations. The Johns Hopkins University Press. I used the third edition, 1996, when implementing the Weston matrix inversion years ago.

February 26, 2021

Introduction

In a now pretty well established tradition on my part, I am posting on things I no longer work on!

I gave a talk on modifiers at XDC 2017 and Linux Plumbers 2017 audio only. It was always my goal to have a blog post accompany the work. Relatively shortly after the talks, I ended up leaving graphics and so it dropped on the priority list.

I'm splitting this up into two posts. This post will go over the problem, and solutions. The next post will go over the implementation details.

Modifiers

Each 3d computational unit in an Intel GPU is called an Execution Unit (EU). Aside from what you might expect them to do, like execute shaders, they may be used for copy operations (itself a shader), or compute operations (also, shaders). All of these things require memory bandwidth in order to complete their task in a timely manner.

Modifiers were the chosen solution in order to allow end to end renderbuffer [de]compression to work, which is itself designed to reduce memory bandwidth needs in the GPU and display pipeline. End to end renderbuffer compression simply means that through all parts of the GPU and display pipeline, assets are read and written to in a compression scheme that is capable of reducing bandwidth (more on this later).

Modifiers are relatively simple concept. They are modifications that are applied to a buffer's layout. Typically a buffer has a few properties, width, height, and pixel format to name a few. Modifiers can be thought of as ancillary information that is passed along with the pixel data. It will impact how the data is processed or displayed. One such example might be to support tiling, which is a mechanism to change how pixels are stored (not sequentially) in order for operations to make better use of locality for caching and other similar reasons. Modifiers were primarily designed to help negotiate modified buffers between the GPU rendering engine and the display engine (usually by way of the compositor). In addition, other uses can crop up such as the video decode/encode engines.

A Waste of Time and Gates

My understanding is that even now, 3 years later, full modifier support isn't readily available across all corners of the graphics ecosystem. Many hardware features are being entirely unrealized. Upstreaming sweeping graphics features like this one can be very time consuming and I seriously would advise hardware designers to take that into consideration (or better yet, ask your local driver maintainer) before they spend the gates. If you can make changes that don't require software, just do it. If you need software involvement, the longer you wait, the worse it will be.

They weren't new even when I made the presentation 3.5 years ago.

commit e3eb3250d84ef97b766312345774367b6a310db8
Author: Rob Clark <robdclark@gmail.com>
Date:   6 years ago

    drm: add support for tiled/compressed/etc modifier in addfb2

I managed to land some stuff:

commit db1689aa61bd1efb5ce9b896e7aa860a85b7f1b6
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   3 years, 7 months ago

    drm: Create a format/modifier blob

Admiring the Problem

Back of the envelope requirement for a midrange Skylake GPU from the time can be calculated relatively easily. 4 years ago, at the frequencies we run our GPUs and their ISA, we can expect roughly 1GBs for each of the 24 EUs.

A 4k display:

3840px × 2160rows × 4Bpp × 60HZ = 1.85GBs

24GBs + 1.85GBs = 25.85GBs

This by itself will oversaturate single channel DDR4 bandwidth (which was what was around at the time) at the fastest possible clock. As it turns out, it gets even worse with compositing. Most laptops sporting a SKL of this range wouldn't have a 4k display, but you get the idea.

The picture (click for larger SVG) is a typical "flow" for a composited desktop using direct rendering with X or a Wayland compositor using EGL. In this case, drawing a Rubik's cube looking thing into a black window.

Admiring the problem

Using this simple Rubik's cube example I'll explain each of the steps so that we can understand where our bandwidth is going and how we might mitigate that. This is just the overview, so feel free to move on to the next section. Since the example will be trivial, and the window is small (and only a singleton) it won't saturate the bandwidth, but it will demonstrate where the bandwidth is being consumed, and open up a discussion on how savings can be achieved.

Rendering and Texturing

For the example, no processing happens other than texturing. In a simple world, the processing of the shader instructions doesn't increase the memory bandwidth cost. As such, we'll omit that from the details.

The main steps on how you get this Rubik's cube displayed are

  • upload a static texture
  • read from static texture
  • write to offscreen buffer
  • copy to output frame
  • scanout from output

More details below...

Texture Upload

Getting the texture from the application, usually from disk, into main memory, is what I'm referring to as texture upload. In terms of memory bandwidth, you are using write bandwidth to write into the memory.

Assets are transfered from persistent storage to memory

Textures may either be generated by the 3d application, which would be trivial for this example, or they may be authored using a set of offline tools and baked into the application. For any consequential use, the latter is predominantly used. Certain surface types are often dynamically generated though, for example, the shadow mapping technique will generate depth maps. Those dynamically generated surfaces actually will benefit even more (more later).

This is pseudo code (but close to real) to upload the texture in OpenGL:

const unsigned height = 128;
const unsigned width = 64;
const void *data = ... // rubik's cube
GLuint tex;

glGenTextures(1, &tex);
glBindTexture(GL_TEXTURE_2D, texture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, width, height, 0, GL_RGB, GL_UNSIGNED_BYTE, data);
glGenerateMipmap(GL_TEXTURE_2D);

I'm going to punt on explaining mipmaps, which are themselves a mechanism to conserve memory bandwidth. If you have no understanding, I'd recommend reading up on mipmaps. This wikipedia article looks decent to me.

Texture Sampling

Once the texture is bound, the graphics runtime can execute shaders which can reference those textures. When the shader requests a color value (also known as sampling) from the texture, it's possiblelikely that the calculated coordinate within the texture will be in between pixels. The hardware will have to return a single color value for the sample point and the way it interpolates is chosen by the graphics runtime. This is referred to as a filter

Texture Fetch
Texture Fetch/Filtering
  • Nearest: Take the value of the closest two pixels and interpolate. If the texture coordinate hits a single pixel, don't interpolate.
  • Bilinear: Take the surround 4 pixels and interpolate based on distance for the texture coordinate
  • Trilinear: Bilinear, but also interpolate between the two closest mipmaps. (I skipped discussing mipmaps, but part of the texture fetch involves finding the nearest miplevel.
  • Anisotropic: It's complicated. Let's say 16x trilinear.

Here's the GLSL to fetch the texture:

#version 330

uniform sampler2D tex;
in vec2 texCoord;
out vec4 fragColor;

void main() {
    vec4 temp = texelFetch(tex, ivec2(texCoord));
    fragColor = temp;
}

The above actually does something that's perhaps not immediately obvious. fragColor = temp;. This actually instructs the fragment shader to write out that value to a surface which is bound for output (usually a framebuffer). In other words, there are two steps here, read and filter a value from the texture, write it back out.

The part of the overall diagram that represents this step:

Composition

In the old days of X, and even still when not using the composite extension, the graphics applications could be given a window to write the pixels directly into the resulting output. The X window manager would mediate resize and move events, letting the client update as needed. This has a lot of downsides which I'll say are out of scope here. There is one upside that is in scope though, there's no extra copy needed to create the screen composition. It just is what it is, tearing and all.

If you don't know if you're currently using a compositor, you almost certainly are using one. Wayland only composites, and the number of X window managers that don't composite is very few. So what exactly is compositing? Simply put it's a window manager that marshals frame updates from clients and is responsible for drawing them on the final output. Often the compositor may add its own effects such as the infamous wobbly windows. Those effects themselves may use up bandwidth!

Compositor
Simplified compositor block diagram

Applications will write their output into what's referred to as an offscreen buffer. 👋👋 The compositor will read the output and copy it into what will become the next frame. What this means from a bandwidth consumption perspective is that the compositor will need to use both read and write bandwidth just to build the final frame. 👋👋

Display

It's the mundane part of this whole thing. Pixels are fetched from memory and pushed out onto whatever display protocol.

Display Engine
Display Engine

Perhaps the interesting thing about the display engine is it has fairly isochronous timing requirements and can't tolerate latency very well. As such, it will likely have a dedicated port into memory that bypasses arbitration with other agents in the system that are generating memory traffic.

Out of scope here but I'll briefly mention, this also gets a bit into tiling. Display wants to read things row by row, whereas rendering works a bit different. In short this is the difference between X-tiling (good for display), and Y-tiling (good for rendering). Until Skylake, the display engine couldn't even understand Y-tiled buffers.

Summing the Bandwidth Cost

Running through our 64x64 example...

Operation Color Depth Description Bandwidth R/W
Texture Upload 1Bpc (RGBX8) File to DRAM 16KB (64 × 64 × 4) W
Texel Fetch (nearest) 1Bpc DRAM to Sampler 16KB (64 × 64 × 4) R
FB Write 1Bpc GPU to DRAM 16KB (64 × 64 × 4) W
Compositing 1Bpc DRAM to DRAM 32KB (64 × 64 × 4 × 2) R+W
Scanout 1Bpc DRAM to PHY 16KB (64 × 64 × 4) R

Total = (16 + 16 + 16 + 32 + 16) × 60Hz = 5.625MBs

But actually, Display Engine will always scanout the whole thing, so really with a 4k display:

Total = (16 + 16 + 16 + 32 + 32400) × 60Hz = 1.9GBs

Don't forget about those various filter modes though!

Filter Mode Multiplier (texel fetch) Total Bandwidth
Bilinear 4x 11.25MBs
Trilinear 8x 18.75MBs
Aniso 4x 32x 63.75MBs
Aniso 16x 128x 243.75MBs

Proposing some solutions

Without actually doing math, I think cache is probably the biggest win you can get. One spot where caching could help is that framebuffer write step followed composition step could avoid the trip to main memory. Another is the texture upload and fetch. Assuming you don't blow out your cache, you can avoid the main memory trip.

While caching can buy you some relief, ultimately you have to flush your caches to get the display engine to be able to read your buffer. At least as of 2017, I was unaware of an architecture that had a shared cache between display and 3d.

Also, cache sizes are limited...

Wait for DRAM to get faster

Instead of doing anything, why not just wait until memory gets higher bandwidth?

Here's a quick breakdown of the progression at the high end of the specs. For the DDR memory types, I took a swag at number of populated channels. For a fair comparison, the expectation with DDR is you'll have at least dual channel nowadays.

Bandwidth

Looking at the graph it seems like the memory vendors aren't hitting Moore's Law any time soon, and if they are, they're fooling me. A similar chart should be made for execution unit counts, but I'm too lazy. A Tigerlake GT2 has 96 Eus. If you go back to our back of the envelope calculation we had a mid range GPU at 24 EUs, so that has quadrupled. In other words, the system architects will use all the bandwidth they can get.

Improving memory technologies is vitally important, it just isn't enough.

TOTAL SAVINGS = 0%

Hardware Composition

One obvious place we want to try to reduce bandwidth is composition. It was after all the biggest individual consumer of available memory bandwidth.

With composition as we described earlier, there was presumed to be a single plane. Software would arrange the various windows onto the plane, which if you recall from the section on composition added quite a bit to the bandwidth consumption, then the display engine could display from that plane.

Hardware composition is the notion that each of those windows could have a separate display plane, directly write into that, and all the compositor would have to do is make sure those display planes occupied the right part of the overall screen. It's conceptually similar to the direct scanout we described earlier in the section on composition.

Operation Color Depth Description Bandwidth R/W
Compositing 1Bpc DRAM to DRAM 32KB (64 × 64 × 4 × 2) R+W

TOTAL SAVINGS = 1.875MBs (33% savings)

Hardware Compsition Verdict

33% savings is really really good, and certainly if you have hardware with this capability, the driver should enable it, but there are some problems that come along with this that make it not so appealing.

  1. Hardware has a limited number of planes.
  2. Formats. One thing I left out about the compositor earlier is that one of things it may opt to do is convert the application's window into a format that the display hardware understands. This means some amount of negotiation has to take place so the application knows about this. Prior to this work, that wasn't in place.
  3. Doesn't reduce any other parts of process, ie. a full screen application wouldn't benefit at all.

Texture Compression

So far in order to solve the, not enough bandwidth, problem, we've tried to add more bandwidth, and reduce usage with hardware composition. The next place to go is to try to tackle texture consumption from texturing.

If you recall, we split the texturing up into two stages. Texture upload, and texture fetch. This third proposed solution attempts to reduce bandwidth by storing a compressed texture in memory. Texture upload will compress it while uploading, and texture sampling can understand the compression scheme and avoid doing all the lookups. Compressing the texture usually comes with some unperceivable degradation. In terms of the sampling, it's a bit handwavy to say you reduce by the compression factor, but let's say for simplicity sake, that's what it does.

Some common formats at the time of the original materials were

Format Compression Ratio
DXT1 8:1
ETC2 4:1
ASTC Variable, 6:1

Using DXT1 as an example of the savings:

Operation Color Depth Bandwidth R/W
Texture Upload DXT1 2KB (64 × 64 × 4 / 8) W
Texel Fetch (nearest) DXT1 2KB (64 × 64 × 4 / 8) R
FB Write 1Bpc 16KB (64 × 64 × 4) W
Compositing 1Bpc 32KB (64 × 64 × 4 × 2) R+W
Scanout 1Bpc 16KB (64 × 64 × 4) R

Here's an example with the simple DXT1 format:

Texture Compression Verdict

Texture compression solves a couple of the limitations that hardware composition left. Namely it can work for full screen applications, and if your hardware supports it, there isn't a limit to how many applications can make use of it. Furthermore, it scales a bit better because an application might use many many textures but only have 1 visible window.

There are of course some downsides.

Click for SVG

For comparison, here is the same cube scaled down with an 8:1 ratio. As you can see DXT1 does a really good job.

Scaled cube

We can't ignore the degradation though as certain rendering may get very distorted as a result.

  • Lossy (perhaps).
  • Hardware compatibility. Application developers need to be ready and able to compress to different formats and select the right things at runtime based on what the hardware supports. This takes effort both in the authoring, as well as the programming.
  • Patents.
  • Display doesn't understand this, so you need to decompress before display engine will read from a surface that has this compression.
  • Doesn't work well for all filtering types, like anisotropic.

*TOTAL SAVINGS (DXT1) = 1.64MBs (30% savings)

*total savings here is kind of a theoretical max

End to end lossless compression

So what if I told you there was a way to reduce your memory bandwidth consumption without having to modify your application, without being subject to hardware limits on planes, and without having to wait for new memory technologies to arrive?

End to end loss compression attempts to provide both "end to end" and "lossless" compression transparently to software. Explanation coming up.

End to End

As mentioned in the previous section on texture compression, one of the pitfalls is that you'd have to decompress the texture in order for it to be used outside of your 3d engine. Typically this would mean for the display engine to scanout out from, but you could also envision a case where perhaps you'd like to share these surfaces with the hardware video encoder. The nice thing about this "end to end" attribute is every stage we mentioned in previous sections that required bandwidth can get the savings just by running on hardware and drivers that enable this.

Lossless

Now because this is all transparent to the application running, a lossless compression scheme has to be used so that there aren't any unexpected results. While lossless might sound great on the surface (why would you want to lose quality?) it reduces the potential savings because lossless compression algorithms are always more inefficient, but it's still a pretty big win.

What's with the box, bro?

I want to provide an example of how this can be possible. Going back to our original image of the full picture, everything looks sort of the same. The only difference is there is a little display engine decompression step, and all of the sampler and framebuffer write steps now have a little purple box accompanying them

One sort of surprising aspect of this compression is it reduces bandwidth, not overall memory usage (that's also true of the Intel implementation). In order to store the compression information, hardware carves off a little bit of extra memory which is referenced for each operation on a texture (yes, that might use bandwidth too if it's not cached).

Here's a made-up implementation which tracks state in a similar way to Skylake era hardware, but the rest is entirely made up by me. It shows that even a naive implementation can get up to a lossless 2:1 compression ratio. Remember though, this comes at the cost of adding gates to the design and so you'd probably want something better performing than this.

2:1 compression

Everything is tracked as cacheline pairs. In this example we have state called "CCS". For every pair of cachelines in the image, 2b are in this state to track the current compression. When the pair of cachelines uses 12 or fewer colors (which is surprisingly often in real life), we're able to compress the data into a single cacheline (state becomes '01'). When the data is compressed, we can reassemble the image losslessly from a single cacheline, this is 2:1 compression because 1 cacheline gets us back 2 cachelines worth of pixel data.

Walking through the example we've been using of the Rubik's cube.

  1. As the texture is being uploaded, the hardware observes all the runs of the same color and stores them in this compressed manner by building the lookup table. On doing this it modifies the state bits in the CCS to be 01 for those cachelines.
  2. On texture fetch, the texture sampler checks the CCS. If the encoding is 01, then the hardware knows to use the LUT mechanism instead for all the color values.
  3. Throughout the rest of rendering, steps 1 & 2 are repeated as needed.
  4. When display is ready to scanout the next frame, it too can look at the CCS determine if there is compression, and decompress as it's doing the scanout.

The memory consumed is minimal which also means that any bandwidth usage overhead is minimal. In the example we have a 64x128 image. In total that's 512 cachelines. At 2 bits per 2 cachelines the CCS size for the example would be not even be a full cacheline: 512 / 2 / 2 = 128b = 16B

* Unless you really want to understand how hardware might actually work, ignore the 00 encoding for clear color.

* There's a caveat here that we assume texture upload and fetch use the sampler. At the time of the original presentation, this was not usually the case and so until the FB write occurred, you didn't actually get compression.

Theoretical best savings would compress everything:

Operation Color Depth Description Bandwidth R/W
Texture Upload 1Bpc compressed File to DRAM 8KB (64 × 64 × 4) / 2 W
Texel Fetch (nearest) 1Bpc compressed DRAM to Sampler 8KB (64 × 64 × 4) / 2 R
FB Write 1Bpc compressed GPU to DRAM 8KB (64 × 64 × 4) / 2 W
Compositing 1Bpc compressed DRAM to DRAM 16KB (64 × 64 × 4 × 2) / 2 R+W
Scanout 1Bpc compressed DRAM to PHY 8KB (64 × 64 × 4) / 2 R

TOTAL SAVINGS = 2.8125MBs (50% savings)

And if you use HW compositing in addition to this...

TOTAL SAVINGS = 3.75MBs (66% savings)

Ending Notes

Hopefully it's somewhat clear how 3d applications are consuming memory bandwidth, and how quickly the consumption grows when adding more applications, textures, screen size, and refresh rate.

End to end lossless compression isn't always going to be a huge win, but in many cases it can really chip away at the problem enough to be measurable. The challenge as it turns out is actually getting it hooked up in the driver and rest of the graphics software stack. As I said earlier, just because a feature seems good doesn't necessarily mean it would be worth the software effort to implement it. End to end loss compression is one feature that you cannot just turn on by setting a bit, and the fact that it's still not enabled anywhere, to me, is an indication that effort and gates may have been better spent elsewhere.

However, next section will be all about how we got it hooked up through the graphics stack.

If you've made it this far, you probably could use a drink. I know I can.

February 25, 2021

A Losing Battle

For a long time, I’ve tried very, very, very, very hard to work around problems with NIR variables when it comes to UBOs and SSBOs.

Really, I have.

But the bottom line is that, at least for gallium-based drivers, they’re unusable. They’re so unreliable that it’s only by sheer luck (and a considerable amount of it) that zink has worked at all until this point.

Don’t believe me? Here’s a list of just some of the hacks that are currently in use by zink to handle support for these descriptor types, along with the reason(s) why they’re needed:

Hack Reason It’s Used Bad Because?
iterating the list of variables backwards this indexing vaguely matches the value used by shaders to access the descriptor only works coincidentally for as long as nothing changes this ordering and explodes entirely with GL-SPIRV
skipping non-array variables with data.location > 0 these are (usually) explicit references to components of a block BO object in a shader sometimes they’re the only reference to the BO, and skipping them means the whole BO interface gets skipped
using different indexing for SSBO variables depending on whether data.explicit_binding is set this (sometimes) works to fix indexing for SSBOs with random bindings and also atomic counters the value is set randomly by other optimization passes and so it isn’t actually reliable
atomic counters are identified by using !strcmp(glsl_get_type_name(var->interface_type), "counters") counters get converted to SSBOs, but they require different indexing in order to be accessed correctly c’mon.
runtime arrays (array[]) are randomly tacked onto SPIRV SSBO variables based on the variable type fixes atomic counter array access and SSBO length() method not actually needed most of the time

And then there’s this monstrosity that’s used for linking up SSBO variable indices with their instruction’s access value (comments included for posterity):

unsigned ssbo_idx = 0;
if (!is_ubo_array && var->data.explicit_binding &&
    (glsl_type_is_unsized_array(var->type) || glsl_get_length(var->interface_type) == 1)) {
    /* - block ssbos get their binding broken in gl_nir_lower_buffers,
     *   but also they're totally indistinguishable from lowered counter buffers which have valid bindings
     *
     * hopefully this is a counter or some other non-block variable, but if not then we're probably fucked
     */
    ssbo_idx = var->data.binding;
} else if (base >= 0)
   /* we're indexing into a ssbo array and already have the base index */
   ssbo_idx = base + i;
else {
   if (ctx->ssbo_mask & 1) {
      /* 0 index is used, iterate through the used blocks until we find the first unused one */
      for (unsigned j = 1; j < ctx->num_ssbos; j++)
         if (!(ctx->ssbo_mask & (1 << j))) {
            /* we're iterating forward through the blocks, so the first available one should be
             * what we're looking for
             */
            base = ssbo_idx = j;
            break;
         }
   } else
      /* we're iterating forward through the ssbos, so always assign 0 first */
      base = ssbo_idx = 0;
   assert(ssbo_idx < ctx->num_ssbos);
}
assert(!ctx->ssbos[ssbo_idx]);
ctx->ssbos[ssbo_idx] = var_id;
ctx->ssbo_mask |= 1 << ssbo_idx;
ctx->ssbo_vars[ssbo_idx] = var;

Does it work?

Amazingly, yes, it does work the majority of the time.

But is this really how we should live our lives?

A Methodology To Live By

As the great compiler-warrior Jasonus Ekstrandimus once said, “Just Delete All The Code”.

Truly this is a pivotal revelation, one that can induce many days of deep thinking, but how can it be applied to this scenario?

Today I present the latest in zink code deletion: a NIR pass that deletes all the broken variables and makes new ones.

bender.jpg

Let’s get into it.

uint32_t ssbo_used = 0;
uint32_t ubo_used = 0;
uint64_t max_ssbo_size = 0;
uint64_t max_ubo_size = 0;
bool ssbo_sizes[PIPE_MAX_SHADER_BUFFERS] = {false};

if (!shader->info.num_ssbos && !shader->info.num_ubos && !shader->num_uniforms)
   return false;
nir_function_impl *impl = nir_shader_get_entrypoint(shader);
nir_foreach_block(block, impl) {
   nir_foreach_instr(instr, block) {
      if (instr->type != nir_instr_type_intrinsic)
         continue;

      nir_intrinsic_instr *intrin = nir_instr_as_intrinsic(instr);
      switch (intrin->intrinsic) {
      case nir_intrinsic_store_ssbo:
         ssbo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[1]));
         break;

      case nir_intrinsic_get_ssbo_size: {
         uint32_t slot = nir_src_as_uint(intrin->src[0]);
         ssbo_used |= BITFIELD_BIT(slot);
         ssbo_sizes[slot] = true;
         break;
      }
      case nir_intrinsic_ssbo_atomic_add:
      case nir_intrinsic_ssbo_atomic_imin:
      case nir_intrinsic_ssbo_atomic_umin:
      case nir_intrinsic_ssbo_atomic_imax:
      case nir_intrinsic_ssbo_atomic_umax:
      case nir_intrinsic_ssbo_atomic_and:
      case nir_intrinsic_ssbo_atomic_or:
      case nir_intrinsic_ssbo_atomic_xor:
      case nir_intrinsic_ssbo_atomic_exchange:
      case nir_intrinsic_ssbo_atomic_comp_swap:
      case nir_intrinsic_ssbo_atomic_fmin:
      case nir_intrinsic_ssbo_atomic_fmax:
      case nir_intrinsic_ssbo_atomic_fcomp_swap:
      case nir_intrinsic_load_ssbo:
         ssbo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[0]));
         break;
      case nir_intrinsic_load_ubo:
      case nir_intrinsic_load_ubo_vec4:
         ubo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[0]));
         break;
      default:
         break;
      }
   }
}

The start of the pass iterates over the instructions in the shader. All UBOs and SSBOs that are used get tagged into a bitfield of their index, and any SSBOs which have the length() method called are similarly tagged.


nir_foreach_variable_with_modes(var, shader, nir_var_mem_ssbo | nir_var_mem_ubo) {
   const struct glsl_type *type = glsl_without_array(var->type);
   if (type_is_counter(type))
      continue;
   unsigned size = glsl_count_attribute_slots(type, false);
   if (var->data.mode == nir_var_mem_ubo)
      max_ubo_size = MAX2(max_ubo_size, size);
   else
      max_ssbo_size = MAX2(max_ssbo_size, size);
   var->data.mode = nir_var_shader_temp;
}
nir_fixup_deref_modes(shader);
NIR_PASS_V(shader, nir_remove_dead_variables, nir_var_shader_temp, NULL);
optimize_nir(shader);

Next, the existing SSBO and UBO variables get iterated over. A maximum size is stored for each type, and then the variable mode is set to temp so it can be deleted. These variables aren’t actually used by the shader anymore, so this is definitely okay.

Boom.

if (!ssbo_used && !ubo_used)
   return false;

Early return if it turns out that there’s not actually any UBO or SSBO use in the shader, and all the variables are gone to boot.

struct glsl_struct_field *fields = rzalloc_array(shader, struct glsl_struct_field, 2);
fields[0].name = ralloc_strdup(shader, "base");
fields[1].name = ralloc_strdup(shader, "unsized");

The new variables are all going to be the same type, one which matches what’s actually used during SPIRV translation: a simple struct containing an array of uints, aka base. SSBO variables which need the length() method will get a second struct member that’s a runtime array, aka unsized.

if (ubo_used) {
   const struct glsl_type *ubo_type = glsl_array_type(glsl_uint_type(), max_ubo_size * 4, 4);
   fields[0].type = ubo_type;
   u_foreach_bit(slot, ubo_used) {
      char buf[64];
      snprintf(buf, sizeof(buf), "ubo_slot_%u", slot);
      nir_variable *var = nir_variable_create(shader, nir_var_mem_ubo, glsl_struct_type(fields, 1, "struct", false), buf);
      var->interface_type = var->type;
      var->data.driver_location = slot;
   }
}

If there’s a valid bitmask of UBOs that are used by the shader, the index slots get iterated over, and a variable is created for that slot using the same type for each one. The size is determined by the size of the biggest UBO variable that previously existed, which ensures that there won’t be any errors or weirdness with access past the boundary of the variable. All the GLSL compiliation and NIR passes to this point have already handled bounds detection, so this is also fine.

if (ssbo_used) {
   const struct glsl_type *ssbo_type = glsl_array_type(glsl_uint_type(), max_ssbo_size * 4, 4);
   const struct glsl_type *unsized = glsl_array_type(glsl_uint_type(), 0, 4);
   fields[0].type = ssbo_type;
   u_foreach_bit(slot, ssbo_used) {
      char buf[64];
      snprintf(buf, sizeof(buf), "ssbo_slot_%u", slot);
      if (ssbo_sizes[slot])
         fields[1].type = unsized;
      else
         fields[1].type = NULL;
      nir_variable *var = nir_variable_create(shader, nir_var_mem_ssbo,
                                              glsl_struct_type(fields, 1 + !!ssbo_sizes[slot], "struct", false), buf);
      var->interface_type = var->type;
      var->data.driver_location = slot;
   }
}

SSBOs are almost the same, but as previously mentioned, they also get a bonus member if they need the length() method. The GLSL compiler has already pre-computed the adjustment for the value that will be returned by length(), so it doesn’t actually matter what the size of the variable is anymore.

And that’s it! The entire encyclopedia of hacks can now be removed, and I can avoid ever having to look at any of this again.

February 23, 2021

Linaro has been working together with Qualcomm to enable camera support on their platformssince 2017. The Open Source CAMSS driver was written to support the ISP IP-block with the same name that is present on Qualcomm SoCs coming from the smartphone space.

The first development board targeted by this work was the DragonBoard 410C, which was followed in 2018 by DragonBoard 820C support. Recently support for the Snapdragon 660 SoC was added to the driver, which will be part of the v5.11 Linux Kernel release. These SoCs all contain the CAMSS (Camera SubSystem) version of the ISP architecture.

Currently, support for the ISP found in the Snapdragon 845 SoC and the DragonBoard 845C is in the process of being upstreamed to the mailinglists. Having …

February 19, 2021

Quickly: ES 3.2

I’ve been getting a lot of pings over the past week or two about ES 3.2 support.

Here’s the deal.

It’s not happening soon. Probably.

Zink currently supports every 3.2 extension except for photoshop. There’s two ways to achieve support for that extension at present:

  • the nice, simple, VK_EXT_blend_operation_advanced which nobody* supports
  • the difficult, excruciating fbfetch method using shader rewrites, which also requires extensions that nobody supports

* Yes, I know that Nvidia supports advanced blend, but zink+nvidia is not currently capable of doing ES of any version, so that’s not testable.

So in short, it’s going to be a while.

But there’s not really a technical reason to rush towards full ES 3.2 anyway other than to fill out a box on mesamatrix. If you have an app that requires 3.2, chances are that it probably doesn’t really require it; few apps actually use the advanced blend extension, and so it should be possible for the app to require only 3.1 and then verify the presence of whatever 3.2-based extensions it may use in order to be more compatible.

Of course, this is unlikely to happen. It’s far easier for app developers to just say “give me 3.2” if maybe they just want geometry shaders, and I wouldn’t expect anyone is going to be special-casing things just to run on zink.

Nonetheless, it’s really not a priority for me given the current state of the Vulkan ecosystem. As time moves on and various extensions/features become more common that may change, but for now I’m focusing on things that are going to be the most useful.

Background

After a lot of effort over short stints in the last several months, I have completed my blog migration to Lektor in the hopes that when I migrate again in the future, it won't be as painful.

Despite my efforts, many old posts might not be perfect. This is a job for the wayback machine

In case you're curious I did this primarily for one reason (and lots of smaller ones). I wanted my data back. Wordpress is an open source blogging platform with huge adoption. It has a very large plugin ecosystem and is very actively updated and maintained. While security issues have come up here and there, at some point automatic updates became an option and that helped a bit. In 2010 it was the obvious choice.

If you've gained anything from my blog posts, you should thank Wordpress. Wordpress' ease of setup and relative ease of use is a big reason I was able to author things as well as I did.

So what happened - plugins

I wanted my data back. It was a self hosted instance and I had all my information stored in a SQL database. Obviously I never lost my data, but...

Plugins.

I used plugins for my tables (multiple plugins). I used plugins for code highlighting. Plugins for LaTeX. Plugins for table of contents, social media integration, post tagging, image captioning and formatting, spelling. You get the idea. The result of all this was I ended up with a blog post that was entirely useless in its text only form. Plugins storing the data in non-standard places so it can be processed and look fancy.

The WYSIWYG editor interface was a huge plus for me. I spent all day in front of a terminal breaking graphics and display (meaning I really was in front of an 80x24 terminal at times). I didn't want to have to deal with fanciful layout engines or styles. Those plugins ended up destroying the WYSIWYG editor experience and I ended up doing everything in quasi markdown anyway.

Plugins themselves introduced security issues when they weren't intentionally malicious anyway.

What was next?

These static site generators seemed really appealing as a solution to this problem. Everything in markdown. Assets stored together in the filesystem. Jekyll is obviously hugely popular. Hugo, Pelican, Gatsby, and Sphinx are all generators I considered. The number of static site generators is staggering. I wish I could remember what made me choose Lektor, but I can't - python based was my only requirement.

Python because I wanted a platform that did most of what I wanted but was extendible by me if absolutely necessary.

Migrating was definitely a lot of work. I was tempted several times to abort the effort and just rely on wayback machine. Ultimately I decided that migrating the post would be a good way to learn how well the platform would meet my needs (that being an annual blog post or so)

There are definitely some features I miss that I may or may not get to.

  1. Comments. There is disqus integration. I'm not convinced this is what I want.
  2. Post grouping. There is categories. It was too complicated for me to figure out in a short time, so I'm punting on it for now.
  3. I'd really like to not have to learn CSS and jinja2. I can scrape by a bit, but changing anything drastic takes a lot of effort for me.

Migration

I followed this. I did have to make some minor changes specific to my needs and posts did still require some touchups, in large part due to plugins and my obsessive use of SVG.

See you soon

Now that I'm back, I hope to post more often. Next up will be a recap of some of the pathfinding projects I worked on after FreeBSD enabling

February 18, 2021

Last year I wrote about how to create a user-specific XKB layout, followed by a post explaining that this won't work in X. But there's a pandemic going on, which is presumably the only reason people haven't all switched to Wayland yet. So it was time to figure out a workaround for those still running X.

This Merge Request (scheduled for xkeyboard-config 2.33) adds a "custom" layout to the evdev.xml and base.xml files. These XML files are parsed by the various GUI tools to display the selection of available layouts. An entry in there will thus show up in the GUI tool.

Our rulesets, i.e. the files that convert a layout/variant configuration into the components to actually load already have wildcard matching [1]. So the custom layout will resolve to the symbols/custom file in your XKB data dir - usually /usr/share/X11/xkb/symbols/custom.

This file is not provided by xkeyboard-config. It can be created by the user though and whatever configuration is in there will be the "custom" keyboard layout. Because xkeyboard-config does not supply this file, it will not get overwritten on update.

From XKB's POV it is just another layout and it thus uses the same syntax. For example, to override the +/* key on the German keyboard layout with a key that produces a/b/c/d on the various Shift/Alt combinations, use this:


default
xkb_symbols "basic" {
include "de(basic)"
key <AD12> { [ a, b, c, d ] };
};
This example includes the "basic" section from the symbols/de file (i.e. the default German layout), then overrides the 12th alphanumeric key from left in the 4th row from bottom (D) with the given symbols. I'll leave it up to the reader to come up with a less useful example.

There are a few drawbacks:

  • If the file is missing and the user selects the custom layout, the results are... undefined. For run-time configuration like GNOME it doesn't really matter - the layout compilation fails and you end up with the one the device already had (i.e. the default one built into X, usually the US layout).
  • If the file is missing and the custom layout is selected in the xorg.conf, the results are... undefined. I tested it and ended up with the US layout but that seems more by accident than design. My recommendation is to not do that.
  • No variants are available in the XML files, so the only accessible section is the one marked default.
  • If a commandline tool uses a variant of custom, the GUI will not reflect this. If the GUI goes boom, that's a bug in the GUI.

So overall, it's a hack[2]. But it's a hack that fixes real user issues and given we're talking about X, I doubt anyone notices another hack anyway.

[1] If you don't care about GUIs, setxkbmap -layout custom -variant foobar has been possible for years.
[2] Sticking with the UNIX principle, it's a hack that fixes the issue at hand, is badly integrated, and weird to configure.

What’s Next

It’s been a busy week. The CTS fixes and patch drops are coming faster and faster, and progress is swift. Here’s a quick note on some things that are on the horizon.

Features Landing Soon

Zink’s in a tough spot right now in master. GL 4.6 is available, but there’s still plenty of things that won’t work, e.g., running anything at 60fps. These are things I expect (hope) to see land in the repo in the next month or so:

  • improved barrier support, which frees up some opportunities with queue refactoring
  • removing explicit pre-fencing on every frame (and sometimes multiple times per frame)
  • descriptor caching
  • various bugfixes which weren’t feasible due to architectural issues

All told, just as an example, Unigine Heaven (which can now run in color!) should see roughly a 100% performance improvement (possibly more) once this is in, and I’d expect substantial performance gains across the board.

Will you be able to suddenly play all your favorite GL-based Steam games?

No.

I can’t even play all your favorite GL-based Steam games yet, so it’s a long ways off for everyone else.

But you’ll probably be able to get surprisingly good speed on what things you can run considering the amount of time that will pass between hitting 4.6 and these patchsets merging.

Features I’m Working On

I spent some time working on Wolfenstein over the past week, but there’s some non-zink issues in the way, so that’s on the backburner for a while. Instead, I’ve turned my attention to CTS and begun unloading a dumptruck of resulting fixes into the codebase.

There comes a time when performance is “good enough” for a while, and, after some intense optimizing since the start of the year, that time has come. So now it’s back to stabilization mode, and I’m now aiming to have a vaguely decent pass rate in the near term.

Hopefully I’ll find some time to post some of the crazy bugs I’ve been hunting, but maybe not. Time will tell.

February 11, 2021

By Now

…or in the very near future, the ol’ bumperino will have landed, putting zink at GL 4.5.

But that’s boring, so let’s check out something very slightly more interesting.

Steam Games

What are they and how do they work?

I’m not going to answer these questions, but I am going to be looking into getting them working on zink.

To that end, as I hinted at yesterday, I began with Wolfenstein: The New Order, as chosen by Daniel Schuermann, the lucky winner of the What Steam Game Should Zink Use As Its Primary Test Case And Benchmark contest that was recently held.

Early tests of this game were unimpressive. That is to say I got an immediate crash. It turns out that having the GL compatibility context restricted to 3.0 is bad for getting AAA games running, so zink-wip now enables 4.6 compat contexts.

But then I was still getting a crash without any clear error message. Suddenly, I was back in 2004 trying to figure out how to debug wine apps.

Things are much simpler now, however. PROTON_DUMP_DEBUG_COMMANDS enables dumping scripts for debugging from steam, including one which attaches a debugger to the game. This solved the problem of getting a debugger in before the almost-immediate crash, but it didn’t get me closer to a resolution.

The problem now is that I’d attached a debugger to the in-wine process, which is just a sandbox for the Windows API. What I actually wanted was to attach to the wine process itself so I could see what was going on in the driver.

gdb --pid=$(pidof WolfNewOrder_x64.exe) ended up what I needed, but this was complicated by the fact that I had to attach before the game crashed and without triggering the steam error reporter. So in the end, I had to attach using the proton script, then while it was paused, attach to the outer process for driver debugging. But then also I had to attach to the outer process after zink was loaded, so it was a real struggle.

Then, as per usual, another problem: I had no symbols loaded because proton runs a static binary. After cluelessly asking around in the DXVK discord, @Herbert helpfully provided a gdb python script for proton in-process debugging that I was able to repurpose for my needs. The gist (haha) of the script is that it scans /proc/$pid/maps and then manually loads the required library files.

At last, I had attached to the game, I had symbols, and I could see that I was hitting a zink assert I’d added to catch int overflows. A quick one-liner to change the order of a calculation fixed that, and now I’m on to an entirely new class of bugs.

February 10, 2021

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Check out part 1 where we expose the context/high-level principles of the whole CI system, and make the machine fully controllable remotely (power on, OS to boot, keyboard/screen emulation using a serial console).

In this article, we will start demystifying the boot process, and discuss about different ways to generate and boot an OS image along with a kernel for your machine. Finally, we will introduce boot2container, a project that makes running containers on bare metal a breeze!

This work is sponsored by the Valve Corporation.

Generating a kernel & rootfs for your Linux-based testing

To boot your test environment, you will need to generate the following items:

  • A kernel, providing all the necessary drivers for your test;
  • A userspace, containing all the dependencies of your test (rootfs);
  • An initramfs (optional), containing the drivers/firmwares needed to access the userspace image, along with an init script performing the early boot sequence of the machine;

The initramfs is optional because the drivers and their firmwares can be built in the kernel directly.

Let's not generate these items just yet, but instead let's look at the different ways one could generate them, depending on their experience.

The embedded way

Buildroot's logo

If you are used to dealing with embedded devices, you are already familiar with projects such as Yocto or Buildroot. They are well-suited to generate a tiny rootfs which can be be useful for netbooted systems such as the one we set up in part 1 of this series. They usually allow you to describe everything you want on your rootfs, then will configure, compile, and install all the wanted programs in the rootfs.

If you are wondering which one to use, I suggest you check out the presentation from Alexandre Belloni / Thomas Pettazoni which will give you an overview of both projects, and help you decide on what you need.

Pros:

  • Minimal size: Only what is needed is included
  • Complete: Configures and compiles the kernel for you

Cons:

  • Slow to generate: Everything is compiled from source
  • Small selection of software/libraries: Adding build recipes is however relatively easy

The Linux distribution way

Debian Logo, www.debian.org

If you are used to installing Linux distributions, your first instinct might be to install your distribution of choice in a chroot or a Virtual Machine, install the packages you want, and package the folder/virtual disk into a tarball.

Some tools such as debos, or virt-builder make this process relatively painless, although they will not be compiling an initramfs, nor a kernel for you.

Fortunately, building the kernel is relatively simple, and there are plenty of tutorials on the topic (see ArchLinux's wiki). Just make sure to compile modules and firmware in the kernel, to avoid the complication of using an initramfs. Don't forget to also compress your kernel if you decide to netboot it!

Pros:

  • Relatively fast: No compilation necessary (except for the kernel)
  • Familiar environment: Closer to what users/developers use in the wild

Cons:

  • Larger: Packages tend to bring a lot of unwanted dependencies, drastically increasing the size of the image
  • Limited choice of distros: Not all distributions are easy to install in a chroot
  • Insecure: Requires root rights to generate the image, which may accidentally trash your distro
  • Poor reproducibility: Distributions get updates continuously, leading to different outcomes when running the same command
  • No caching: all the steps to generate the rootfs are re-done every time
  • Incomplete: does not generate a kernel or initramfs for you

The refined distribution way: containers

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc.

Containers are an evolution of the old chroot trick, but instead made secure thanks the addition of multiple namespaces to Linux. Containers and their runtimes have been addressing pretty much all the cons of the "Linux distribution way", and became a standard way to share applications.

On top of generating a rootfs, containers also allow setting environment variables, control the command line of the program, and have a standardized transport mechanism which simplifies sharing images.

Finally, container images are constituted of cacheable layers, which can be used to share base images between containers, and also speed up the generation of the container image by only re-computing the layer that changed and all the layers applied on top of it.

The biggest draw-back of containers is that they usually are meant to be run on pre-configured hosts. This means that if you want to run the container directly, you will need to make sure to include an initscript or install systemd in your container, and set it as the entrypoint of the container. It is however possible to perform these tasks before running the container, as we'll explain in the following sections.

Pros:

  • Fastest: No compilation necessary (except for the kernel), and layers cached
  • Familiar: Shared environment between developers and the test system
  • Flexible: Full choice of distro
  • Secure: No root rights needed, everything is done in a user namespace
  • Shareable: Containers come with a transport/storage mechanism (registries)
  • Reproducible: Easily run the exact same userspace on your dev and test machines

Cons:

  • Larger: Packages tend to bring a lot of dependencies, drastically increasing the size of the image
  • Incomplete: does not generate a kernel or initramfs for you

Deploying and booting a rootfs

Now we know how we could generate a rootfs, so the next step is to be able to deploy and boot it!

Challenge #1: Deploying the Kernel / Initramfs

There are multiple ways to deploy an operating system:

  • Flash and reboot: Typical on ARM boards / Android phones;
  • Netboot: Typical in big organizations that manage thousands of machines.

The former solution is great at preventing the bricking of a device that depends on an Operating System to be flashed again, as it enables checking the deployment on the device itself before rebooting.

The latter solution enables diskless test machines, which is an effective way to reduce state (the enemy #1 of reproducible results). It also enables a faster deployment/boot time as the CI system would not have to boot the machine, flash it, then reboot. Instead, the machine simply starts up, requests an IP address through BOOTP/DHCP, downloads the kernel/initramfs, and executes the kernel. This was the solution we opted for in part 1 of this blog series.

Whatever solution you end up picking, you will now be presented with your next challenge: making sure the rootfs remains the same across reboots.

Challenge #2: Deploying the rootfs efficiently

If you have chosen the Flash and reboot deployment method, you may be prepared to re-flash the entire Operating System image every time you boot. This would make sure that the state of a previous boot won't leak into following boots.

This method can however become a big burden on your network when scaled to tens of machines, so you may be tempted to use a Network File System such as NFS to spread the load over a longer period of time. Unfortunately, using NFS brings its own set of challenges (how deep is this rabbit hole?):

  • The same rootfs directory cannot be shared across machines without duplication unless mounted read-only, as machines should not be able to influence each-other's execution;
  • The NFS server needs to remain available as long as at least one test machine is running;
  • Network congestion might influence the testing happening on the machine, which can affect functional testing, but will definitely affect performance testing.

So, instead of trying to spread the load, we could try to reduce the size of the rootfs by only sending the content that changed. For example, the rootfs could be split into the following layers:

  • The base Operating System needed for the testing;
  • The driver you want to test (if it wasn't in the kernel);
  • The test suite(s) you want to run.

Layers can be downloaded by the test machine, through a short-lived-state network protocol such as HTTP, as individual SquashFS images. Additionally, SquashFS provides compression which further reduces the storage/network bandwidth needs.

The layers can then be directly combined by first mounting the layers to separate folders in read-only mode (only mode supported by SquashFS), then merging them using OverlayFS. OverlayFS will store all the writes done to this file system into the workdir directory. If this work directory is backed up by a ramdisk (tmpfs) or a never-reused temporary directory, then this would guarantee that no information from previous boots would impact the new boots!

If you are familiar with containers, you may have recognized this approach as what is used by containers: layers + overlay2 storage driver. The only difference is that container runtimes depend on tarballs rather than SquashFS images, probably because this is a Linux-only filesystem.

If you are anything like me, you should now be pretty tempted to simply use containers for the rootfs generation, transport, and boot! That would be a wise move, given that thousands of engineers have been working on them over the last decade or so, and whatever solution you may come up with will inevitably have even more quirks than these industry standards.

I would thus recommend using containers to generate your rootfs, as there are plenty of tools that will generate them for you, with varying degree of complexity. Check out buildah, if Docker, or Podman are not too high/level for your needs!

Let's now brace for the next challenge, deploying a container runtime!

Challenge #3: Deploying a container runtime to run the test image

In the previous challenge, we realized that a great way to deploy a rootfs efficiently was to simply use a container runtime to do everything for us, rather than re-inventing the wheel.

This would enable us to create an initramfs which would be downloaded along with the kernel through the usual netboot process, and would be responsible for initializing the machine, connecting to the network, mounting the layer cache partition, setting the time, downloading a container, then executing it. The last two steps would be performed by the container runtime of our choice.

Generating an initramfs is way easier than one can expect. Projects like dracut are meant to simplify their creation, but my favourite has been u-root, coming from the LinuxBoot project. I generated my first initramfs in less than 5 minutes, so I was incredibly hopeful to achieve the outlined goals in no time!

Unfortunately, the first setback came quickly: container runtimes (Docker, or Podman) are huge (~150 to 300 MB), if we are to believe Alpine Linux's size of their respective packages and dependencies! While this may not be a problem for the Flash and reboot method, it is definitely a significant issue for the Netboot method which would need to download it for every boot.

Challenge #3.5: Minifying the container runtime

After spending a significant amount of time studying container runtimes, I identified the following functions:

  • Transport / distribution: Downloading a container image from a container registry to the local storage (spec );
  • De-layer the rootfs: Unpack the layers' tarball, and use OverlayFS to merge them (default storage driver, but there are many other ways);
  • Generate the container manifest: A JSON-based config file specifying how the container should be run;
  • Executing the container

Thus started my quest to find lightweight solutions that could do all of these steps... and wonder just how deep is this rabbit hole??

The usual executor found in the likes of Podman and Docker is runc. It is written in Golang, which compiles everything statically and leads to giant binaries. In this case, runc clocks at ~12MB. Fortunately, a knight in shining armour came to the rescue, re-implemented runc in C, and named it crun. The final binary size is ~400 KB, and it is fully compatible with runc. That's good-enough for me!

To download and unpack the rootfs from the container image, I found genuinetools/img which supports that out of the box! Its size was however much bigger than expected, at ~28.5MB. Fortunately, compiling it ourselves, stripping the symbols, then compressing it using UPX led to a much more manageable ~9MB!

What was left was to generate the container manifest according to the runtime spec. I started by hardcoding it to verify that I could indeed run the container. I was relieved to see it would work on my development machine, even thought it fails on my initramfs. After spending a couple of hours diffing straces, poking a couple of files sysfs/config files, and realizing that pivot_root does not work in an initramfs , I finally managed to run the container with crun run --no-pivot!

I was over the moon, as the only thing left was to generate the container manifest by patching genuinetools/img to generate it according to the container image manifest (like docker or podman does). This is where I started losing grip: lured by the prospect of a simple initramfs solving all my problems, being so close to the goal, I started free-falling down what felt like the deepest rabbit hole of my engineering career... Fortunately, after a couple of weeks, I emerged, covered in mud but victorious! Queue the gory battle log :)

When trying to access the container image's manifest in img, I realized that it was re-creating the layers and manifest, and thus was losing the information such as entrypoint, environment variables, and other important parameters. After scouring through its source code and its 500 kLOC of dependencies, I came to the conclusion that it would be easier to start a project from scratch that would use Red Hat's image and storage libraries to download and store the container on the cache partition. I then needed to unpack the layers, generate the container manifest, and start runc. After a couple of days, ~250 lines of code, and tons of staring at straces to get it working, it finally did! Out was img, and the new runtime's size was under 10 MB \o/!

The last missing piece in the puzzle was performance-related: use OverlayFS to merge the layers, rather than unpacking them ourselves.

This is when I decided to have another look at Podman, saw that they have their own internal library for all the major functions, and decided to compile podman to try it out. The binary size was ~50 MB, but after removing some features, setting the -w -s LDFLAGS, and compressing it using upx --best, I got the final size to be ~14 MB! Of course, Podman is more than just one binary, so trying to run a container with it failed. However, after a bit of experimentation and stracing, I realized that running the container with --privileged --network=host would work using crun... provided we force-added the --no-pivot parameter to crun. My happiness was however short-lived, replaced by a MAJOR FACEPALM MOMENT:

After a couple of minutes of constant facepalming, I realized I was also relieved, as Podman is a battle-tested container runtime, and I would not need to maintain a single line of Go! Also, I now knew how deep the rabbit was, and we just needed to package everything nicely in an initramfs and we would be good. Success, at last!

Boot2container: Run your containers from an initramfs!

If you have managed to read through the article up to this point, congratulations! For the others who just gave up and jumped straight to this section, I forgive you for teleporting yourself to the bottom of the rabbit hole directly! In both cases, you are likely wondering where is this breeze you were promised in the introduction?

     Boot2container enters the chat.

Boot2container is a lightweight (sub-20 MB) and fast initramfs I developed that will allow you to ignore the subtleties of operating a container runtime and focus on what matters, your test environment!

Here is an example of how to run boot2container, using SYSLINUX:

LABEL root
    MENU LABEL Run docker's hello world container, with caching disabled
    LINUX /vmlinuz-linux
    APPEND b2c.container=docker://hello-world b2c.cache_device=none b2c.ntp_peer=auto
    INITRD /initramfs.linux_amd64.cpio.xz

The hello-world container image will be run in privileged mode, without the host network, which is what you want when running the container for bare metal testing!

Make sure to check out the list of features and options before either generating the initramfs yourself or downloading it from the releases page. Try it out with your kernel, or the example one bundled in in the release!

With this project mostly done, we pretty much conclude the work needed to set up the test machines, and the next articles in this series will be focusing on the infrastructure needed to support a fleet of test machines, and expose it to Gitlab/Github/...

That's all for now, thanks for reading that far!

wolfenstein.png

wolfenstein2.png

February 09, 2021

If you don’t know what is traces based rendering regression testing, read the appendix before continuing.


The Mesa community has witnessed an explosion of the Continuous Integration interest in the last two years.

In addition to checking the proper building of the project, integrating the testing of its functional correctness has become a priority. The user space graphics drivers exhibit a wide variety of types of tests and test suites. One kind of those tests are the traces based rendering regression testing.

The public effort to add this kind of tests into Mesa’s CI started with this mail from Alexandros Frantzis.

At some point, we had support for replaying OpenGL, Vulkan and D3D11 traces using apitrace, RenderDoc and GFXReconstruct with the in-tree tool tracie. However, it was a very custom solution made to the needs of Mesa so I proposed to move this codebase and integrate it into the piglit test suite. It was a natural step forward.

This is how replayer was born into piglit.

replayer

The first step to test a trace is, actually, obtaining a trace. I won’t go into the details about how to create one from scratch. The process is well documented on each of the tools listed above. However, the Mesa community has been collecting publicly distributable traces for a while and placing them in traces-db whose CI is copying them to Freedesktop.org’s MinIO instance.

To make things simple, once we have built and installed piglit, if we would like to test an apitrace created OpenGL trace, we can download from there with:

$ replayer.py download \
 	 --download-url https://minio-packet.freedesktop.org/mesa-tracie-public/ \
 	 --db-path ./traces-db \
 	 --force-download \
 	 glxgears/glxgears-2.trace

The parameters are self explanatory. The downloaded trace will now exist at ./traces-db/glxgears/glxgears-2.trace.

The next step will be to dump an image from the trace. Since it is a .trace file we will need to have apitrace installed in the system. If we do not specify the call(s) from which to dump the image(s), we will just get the last frame of the trace:

$ replayer.py dump ./traces-db/glxgears/glxgears-2.trace

The dumped PNG image will be at ./results/glxgears-2.trace-0000001413.png. Notice, the number suffix is the snapshot id from the trace.

Dumping from a trace may result in a range of different possible images. One example is when the trace makes use of uninitialized values, leading to undefined behaviors.

However, since the original aim was performing pre-merge rendering regression testing in Mesa’s CI, the idea is that replaying any of the provided traces would be quick and the dumped image will be consistent. In other words, if we would dump several times the same frame of a trace with the same GFX stack, the image will always be the same.

With this precondition, we can test whether 2 different images are the same just by doing a hash of its content. replayer can obtain the hash for the generated dumped image:

$ replayer.py checksum ./results/glxgears-2.trace-0000001413.png 
f8eba0fec6e3e0af9cb09844bc73bdc8

Now, if we would build a different commit of Mesa, we could check the generated image at this new point against the previously generated reference image. If everything goes well, we will see something like:

$ replayer.py compare trace \
 	 --download-url https://minio-packet.freedesktop.org/mesa-tracie-public/ \
 	 --device-name gl-vmware-llvmpipe \
 	 --db-path ./traces-db \
 	 --keep-image \
 	 glxgears/glxgears-2.trace f8eba0fec6e3e0af9cb09844bc73bdc8
[dump_trace_images] Info: Dumping trace ./traces-db/glxgears/glxgears-2.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./traces-db/glxgears/glxgears-2.trace
// process.name = "/usr/bin/glxgears"
1384 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

1413 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

error: drawable failed to resize: expected 1515x843, got 300x300
[dump_trace_images] Running: eglretrace --headless --snapshot=1413 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace- ./blog-traces-db/glxgears/glxgears-2.trace
Wrote ./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413.png

OK
[check_image]
    actual: f8eba0fec6e3e0af9cb09844bc73bdc8
  expected: f8eba0fec6e3e0af9cb09844bc73bdc8
[check_image] Images match for:
  glxgears/glxgears-2.trace

PIGLIT: {"images": [{"image_desc": "glxgears/glxgears-2.trace", "image_ref": "f8eba0fec6e3e0af9cb09844bc73bdc8.png", "image_render": "./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413-f8eba0fec6e3e0af9cb09844bc73bdc8.png"}], "result": "pass"}

replayer‘s compare subcommand is the one spitting a piglit formatted test expectations output.

Putting everything together

We can make the whole process way simpler by passing the replayer a YAML tests list file. For example:

$ cat testing-traces.yml
traces-db:
  download-url: https://minio-packet.freedesktop.org/mesa-tracie-public/

traces:
  - path: gputest/triangle.trace
    expectations:
      - device: gl-vmware-llvmpipe
        checksum: c8848dec77ee0c55292417f54c0a1a49
  - path: glxgears/glxgears-2.trace
    expectations:
      - device: gl-vmware-llvmpipe
        checksum: f53ac20e17da91c0359c31f2fa3f401e
$ replayer.py compare yaml \
 	 --device-name gl-vmware-llvmpipe \
 	 --yaml-file testing-traces.yml 
[check_image] Downloading file gputest/triangle.trace took 5s.
[dump_trace_images] Info: Dumping trace ./replayer-db/gputest/triangle.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./replayer-db/gputest/triangle.trace
// process.name = "/home/anholt/GpuTest_Linux_x64_0.7.0/GpuTest"
397 glXSwapBuffers(dpy = 0x7f0ad0005a90, drawable = 56623106)

510 glXSwapBuffers(dpy = 0x7f0ad0005a90, drawable = 56623106)


/home/anholt/GpuTest_Linux_x64_0.7.0/GpuTest
[dump_trace_images] Running: eglretrace --headless --snapshot=510 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/gputest/triangle.trace- ./replayer-db/gputest/triangle.trace
Wrote ./results/trace/gl-vmware-llvmpipe/gputest/triangle.trace-0000000510.png

OK
[check_image]
    actual: c8848dec77ee0c55292417f54c0a1a49
  expected: c8848dec77ee0c55292417f54c0a1a49
[check_image] Images match for:
  gputest/triangle.trace

[check_image] Downloading file glxgears/glxgears-2.trace took 5s.
[dump_trace_images] Info: Dumping trace ./replayer-db/glxgears/glxgears-2.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./replayer-db/glxgears/glxgears-2.trace
// process.name = "/usr/bin/glxgears"
1384 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

1413 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)


/usr/bin/glxgears
error: drawable failed to resize: expected 1515x843, got 300x300
[dump_trace_images] Running: eglretrace --headless --snapshot=1413 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace- ./replayer-db/glxgears/glxgears-2.trace
Wrote ./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413.png

OK
[check_image]
    actual: f8eba0fec6e3e0af9cb09844bc73bdc8
  expected: f8eba0fec6e3e0af9cb09844bc73bdc8
[check_image] Images match for:
  glxgears/glxgears-2.trace

replayer features also the query subcommand, which is just a helper to read the YAML files with the tests configuration.

Testing the other kind of supported 3D traces doesn’t change much from what’s shown here. Just make sure to have the needed tools installed: RenderDoc, GFXReconstruct, the VK_LAYER_LUNARG_screenshot layer, Wine and DXVK. A good reference for building, installing and configuring these tools are Mesa’s GL and VK test containers building scripts.

replayer also accepts several configurations to tweak how to behave and where to find the actual tracing tools needed for replaying the different types of traces. Make sure to check the replay section in piglit’s configuration example file.

replayer‘s README.md file is also a good read for further information.

piglit

replayer is a test runner in a similar fashion to shader_runner or glslparsertest. We are now missing how does it integrate so we can do piglit runs which will produce piglit formatted results.

This is done through the replay test profile.

This profile needs a couple configuration values. Easiest is just to set the PIGLIT_REPLAY_DESCRIPTION_FILE and PIGLIT_REPLAY_DEVICE_NAME env variables. They are self explanatory, but make sure to check the documentation for this and other configuration options for this profile.

The following example features a similar run to the one done above invoking directly replayer but with piglit integration, providing formatted results:

$ PIGLIT_REPLAY_DESCRIPTION_FILE=testing-traces.yml PIGLIT_REPLAY_DEVICE_NAME=gl-vmware-llvmpipe piglit run replay -n replay-example replay-results
[2/2] pass: 2   
Thank you for running Piglit!
Results have been written to replay-results

We can create some summary based on the results:

# piglit summary console replay-results/
trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace: pass
trace/gl-vmware-llvmpipe/gputest/triangle.trace: pass
summary:
       name: replay-example
       ----  --------------
       pass:              2
       fail:              0
      crash:              0
       skip:              0
    timeout:              0
       warn:              0
 incomplete:              0
 dmesg-warn:              0
 dmesg-fail:              0
    changes:              0
      fixes:              0
regressions:              0
      total:              2
       time:       00:00:00

Creating an HTML summary may be also interesting, specially when finding failures!

Wishlist

  • Through different backends, replayer supports running apitrace, RenderDoc and GFXReconstruct traces. We may want to support other tracing tools in the future. The dummy backend used for functional testing is a good starting point when writing a new backend.
  • The solution chosen for checking whether we detect a rendering regression is dependent on having consistent results, as said before. It’d be great if we could add a secondary testing method whenever the expected rendered image is variable. From the top of my head, using exclusion masks could be a quick single-run solution when we know which specific areas in a rendered scenario are the ones fluctuating. For more complex variations, a multi-run based solution seems to be the best option. EzBench has a great statistical approach for this!
  • The current syntax of the YAML test list files implies running the compare subcommand with the default behavior of checking against the last frame of the tested trace. This means figuring out which call number is the one of the last frame first. It would be great to support providing the call numbers directly in the YAML files to be able to test more than just the last frame and, additionally, cut down the time taken to run the test.
  • The HTML generated summary allows us to see the reference and generated image during a test run side to side when it fails. It’d be great to have also some easy way of checking its differences. Using Rembrandt.js could be a possible solution.

Thanks a lot to the whole Mesa community for helping with the creation of this tool. Alexandros Frantzis, Rohan Garg and Tomeu Vizoso did a lot of the initial development for the in-tree tracie tool while Dylan Baker was very patient while reviewing my patches for the piglit integration.

Finally, thanks to Igalia for allowing me to work in this.


Appendix

In 3D computer graphics we say “traces”, for short, to name the files generated by 3D APIs capturing tools which store not only the calls to the specific 3D API but also the internal state of the 3D program during the capturing process: shaders, textures, buffers, etc.

Being able to “record” the execution of a 3D program is very useful. Usually, it will allow us to replay the execution without the need of the original program from which we generated the trace, it will also allow in-depth analysis for debugging and performance optimization, it’s a very good solution for sharing with other developers, and, in some cases, will allow us to check how the replay will happen with different GPUs.

In this post, however, I focus in a specific usage: rendering regression testing.

When doing a regression test what we would do is compare a specific metric obtained by replaying the trace with a specific version of the GFX software stack against the same metric obtained from a different version of the GFX stack. If the value of the metric changes we have found a regression (or an improvement!).

To make things simpler, we would like to check changes happening just in one of the many elements of the software stack. The most relevant component is the user space driver. In particular, I care about the Mesa drivers and the GNU/Linux stack.

Mainly, there are two kinds of regression testing we can do with a trace: performance or rendering regression testing. When doing a performance one, the checked metric(s) usually are in terms of speed or memory usage. In the case of the rendering ones what we would do is comparing the rendered output at one (or many) point during the trace replay. This output, a bitmap image, is the metric that we will compare in between two different points of the Mesa driver. If the images differ, we may have found a regression; artifacts, improper colors, etc, or an enhancement, if the reference image is the one featuring any of these problems.

If you’re on Intel…

Your zink built from git master now has GL 4.3.

Turns out having actual hardware available when doing feature support is important, so I need to do some fixups there for stencil texturing before you can enjoy things.

February 08, 2021

Under contracting work for Valve Corporation, I have been working with Charlie Turner and Andres Gomez from Igalia to develop a CI test farm for driver testing (most graphics).

This is now the fifth CI system I have worked with / on, and I am growing tired of not being able to re-use components from the previous systems due to how deeply-integrated its components are, and how implementation details permeate from one component to another. Additionally, such designs limit the ability of the system to grow, as updating a component would impact a lot of components, making it difficult or even impossible to do without a rewrite of the system, or taking the system down for multiple hours.

With this new system, I am putting emphasis on designing good interfaces between components in order to create an open source toolbox that CI systems can re-use freely and tailor to their needs, while not painting themselves in a corner.

I aim to blog about all the different components/interfaces we will be making for this test system, but in this article, I would like to start with the basics: proposing design goals, and setting up a machine to be controllable remotely by a test system.

Overall design principles

When designing a test system, it is important to keep in mind that test results need to be:

  • Stable: Re-executing the same test should yield the same result;
  • Reproducible: The test should be runnable on other machines with the same hardware, and yield the same result;

What this means is that we should use the default configuration as much as possible (no weird setup in CI). Additionally, we need to reduce the amount of state in the system to the absolute minimum. This can be achieved in the following way:

  • Power cycle the machine between each test cycle: this helps reset the hardware;
  • Go diskless if at all possible, or treat the disk as a cache that can be flushed when testing fails;
  • Pre-compute as much as possible outside of the test machine, to reduce the impact of the environment of the machine running the test.

Finally, the machine should not restrict which kernel / Operating System can be loaded for testing. An easy way to achieve this is to use netboot (PXE), which is a common BIOS feature allowing diskless machines to boot from the network.

Converting a machine for testing

Now that we have a pretty good idea about the design principles behind preparing a machine for CI, let's try to apply them to an actual machine.

Step 1: Powering up the machine remotely

In order to power up, a machine needs both power and a signal to start. The latter is usually provided by a power button, but additional ways exist (non-exhaustive):

  • Wake on LAN: An Ethernet frame sent to the network adapter triggers the boot;
  • Power on by Mouse/Keyboard: Any activity on the mouse or the keyboard will boot the computer;
  • Power on AC: Providing power to the machine will automatically turn it on;
  • Timer: Boot at a specified time.

An Intel motherboard's list of wakeup events

Unfortunately, none of these triggers can be used to also turn off the machine. The only way to guarantee that a machine will power down and reset its internal state completely is to cut its power supply for a significant amount of time. A safe way to provide/cut power is to use a remotely-switchable Power Distribution Unit (example), or simply using some smart plug such as Ikea's TRÅDFRI. In any case, make sure you rely on as few services as possible (no cloud!), that you won't exceed the ratings (voltage, power, and cycles), and can read back the state to make sure the command was well received. If you opt out for the industrial PDUs, make sure to check out PDU Gateway, our REST service to control the machines.

An example of a PDU

Now that we can reliably cut/provide power, we still need to control the boot signal. The difficulty here is that the signal needs to be received after the machine received power and initialized enough to receive this event. To make things as easy as possible, the easiest is to configure the BIOS to boot as soon as the power is brought to the computer. This is usually called "Boot on AC". If your computer does not support this feature, you may want to try the other ones, or use a microcontroller to press the power button for you when powering up (see the HELP! My machine can't ... Boot on AC section at the end of this article).

Step 2: Net booting

Net booting is quite commonly supported on x86 and ARM bootloaders. On x86 platforms, you can generally find this option in the boot option priorities under the name PXE boot or network boot. You may also need to enable the LAN option ROM, LAN controller, or the UEFI network stack. Reboot, and check that your machine is trying to get an IP!

The next step will be to set up a machine, called Testing Gateway, that will provide a PXE service. This machine should have two network interfaces, one connected to a public network, and one connected to the test machines (through a switch). Setting up this machine will be the subject of an upcoming blog post, but if your are impatient, you may use our valve-infra container.

Step 3: Emulating your screen and keyboard using a serial console

Thanks to the previous steps, we can now boot in any Operating System we want, but we cannot interact with it...

One solution could be to run an SSH server on the Operating System, but until we could connect to it, there would be no way to know what is going on. Instead, we could use an ancient technology, a serial port, to drive a console. This solution is often called "Serial console" and is supported by most Operating Systems. Serial ports come in two types:

  • UART: voltage changing between 0 and VCC (TTL signalling), more common in the System-on-Chip (SoC) and microcontrollers world;
  • RS-232: voltage changing between a positive and negative voltage, more common in the desktop and datacenter world.

In any case, I suggest you find a serial-to-USB adapter adapted to the computer you are trying to connect:

On Linux, using a serial console is relatively simple, just add the following in the command line to get a console on your screen AND over the /dev/ttyS0 serial port running at 9600 bauds:

console=tty0 console=ttyS0,9600 earlyprintk=vga,keep

If your machine does not have a serial port but has USB ports, which is more the norm than the exception in the desktop/laptop world, you may want to connect two RS-232-to-USB adapters together, using a Null modem cable:

Test Machine <-> USB <-> RS-232 <-> NULL modem cable <-> RS-232 <-> USB Hub <-> Gateway

And the kernel command line should use ttyACM0 / ttyUSB0 instead of ttyS0.

Putting it all together

Start by removing the internal battery if it has one (laptops), and any built-in wireless antenna. Then set the BIOS to boot on AC, and use netboot.

Steps for an AMD motherboard:

Steps for an Intel motherboard:

Finally, connect the test machine to the wider infrastructure in this way:

      Internet /   ------------------------------+
    Public network                               |
                                       +---------+--------+                USB
                                       |                  +-----------------------------------+
                                       |      Testing     | Private network                   |
Main power (240 V) ---------+          |      Gateway     +-----------------+                 |
                            |          +---------+--------+                 |                 |
                            |                    | Serial /                 |                 |
                            |                    | Ethernet                 |                 |
                            |                    |                          |                 |
                +-----------+--------------------+--------------+   +-------+--------+   +----+----+
                |                 Switchable PDU                |   |   RJ45 switch  |   | USB Hub |
                |  Port 0    Port 1        ...          Port N  |   |                |   |         |
                +----+------------------------------------------+   +---+------------+   +-+-------+
                     |                                                  |                  |
                Main |                                                  |                  |
                Power|                                                  |                  |
            +--------|--------+               Ethernet                  |                  |
            |                 +-----------------------------------------+   +----+----+    |
            |  Test Machine 1 |            Serial (RS-232 / TTL)            |  Serial |    |
            |                 +---------------------------------------------+  2 USB  +----+ USB
            +-----------------+                                             +---------+

If you managed to do all this, then congratulations, you are set! If you got some issues finding the BIOS parameters, brace yourself, and check out the following section!

HELP! My machine can't ...

Net boot

It's annoying, but it is super simple to work around that. What you need is to install a bootloader on a drive or USB stick which supports PXE. I would recommend you look into SYSLINUX, and Arch Linux's wiki page about it.

Boot on AC

Well, that's a bummer, but that's not the end of the line either if you have some experience dealing with microcontrollers, such as Arduino. Provided you can find the following 4 wires, you should be fine:

  • Ground: The easiest to find;
  • Power rail: 3.3 or 5V depending on what your controller expects;
  • Power LED: A signal that will change when the computer turns on/off;
  • Power Switch: A signal to pull-up/down to start the computer.

On desktop PCs, all these wires can be easily found in the motherboard's manual. For laptops, you'll need to scour the motherboard for these signals using a multimeter. Pay extra attention when looking for the power rail, as it needs to be able to source enough current for your microcontroller. If you are struggling to find one, look for the VCC pins of some of the chips and you'll be set.

Next, you'll just need to figure out what voltage the power LED is at when the machine is ON or OFF. Make sure to check that this voltage is compatible with your microcontroller's input rating and plug it directly into a GPIO of your microcontroller.

Let's then do the same work for the power switch, except this time we also need to check how much current will flow through it when it is activated. To do that, just use a multimeter to check how much current is flowing when you connect the two wires of the power switch. Check that this amount of current can be sourced/sinked by the microcontroller, and then connect it to a GPIO.

Finally, we need to find power for the microcontroller that will be present as soon as we plug the machine to the power. For desktop PCs, you would find this in Pin 9 of the ATX connector. For laptops, you will need to probe the motherboard until you find a pin that has one with a voltage suitable for your microcontroller (5 or 3.3V). However, make sure it is able to source enough current without the voltage dropping bellow the minimum acceptable VCC of your microcontroller. The best way to make sure of that is to connect this rail to the ground through a ~100 Ohm and check that the voltage at the leads of the resistor, and keep on trying until you find a suitable place (took me 3 attempts). Connect your microcontroller's VCC and ground to the these pads.

The last step will be to edit this Arduino code for your needs, flash it to your microcontroller, and iterate until it works!

Here is a photo summary of all the above steps:

Thanks to Arkadiusz Hiler for giving me a couple of these BluePills, as I did not have any microcontroller that would be small-enough to fit in place of a laptop speaker. If you are a novice, I would suggest you pick an Arduino nano instead.

Oh, and if you want to create a board that would be generic-enough for most motherboards, check out the schematics from my 8 year-old blog post about doing just that!

Boot without a battery

So far, I have never heard of any laptop that would completely refuse to boot when disconnecting the battery. The worst I have heard of was that the laptop would take 30s before starting to boot.

Let's be real though, your time is valuable, and I would suggest you buy/get another laptop. However, if this is the only model you can get, and you really want to test it, then it will be .... juuuuuust fine!

Your state of mind, right now!

There are multiple options here, depending how far down the stack you want to go to:

  1. Rework the Embedded Controller (EC) to drop this delay: Applicable when you have access to the EC's source code, like for chromebooks;
  2. Impersonate the battery: Replacing the battery with a microcontroller that will respond to the EC's commands just like the real battery;
  3. Reuse the battery controller, but replace the battery cells with ... capacitors: The fastest way forward, but can be real-tricky without some knowledge about dealing with Li-ion cells.

I will not explain what needs to be done in option 1, as this is highly-dependent on your platform of choice, but it is by far the safest and the least hacky.

Option 2 is the next best option if the EC is not open or easy to flash. What you will want is to figure out what are the I2C lines in the battery's connector, and then attach a protocol analyser to it. Boot the machine, then inspect the logs and try to figure out the pattern. Re-implement as much as of it as needed in a microcontroller, until the system boots reliably. Should be a good weekend project!

Option 3 is by far the hackiest and requiring the most skills even if it is the fastest to implement IF you have an oscilloscope, and some super capacitors with a low discharge rate lying around (who doesn't?). What you'll need to do is open the battery, rip off the battery cells, and replace them with the super capacitors. They will simulate the battery cell well-enough for most controllers, but beware that the controller might not like starting with the capacitors being discharged, so you may need to force-charge them to [2, 3.6]V (the range between a fully discharged and a fully-charged battery), so consider using a 3.3V power rail. Beware that the battery cells might be wired in series, so you should not connect their negative pole to the ground, as it would short one or more cells! In my case, the controller was happy with seeing an empty battery, and it was fun to see the battery go from 50% to 100% in a second when booting :D

That's all, folks!

February 04, 2021

All In A Day’s Work

I posted yesterday in brief about zink on nvidia blob, But what actually was necessary to get this working?

In short, there were three problem areas that needed to be worked on:

  • GLX
  • image creation
  • memory allocation

Let’s go over some of these in a bit more depth.

GLX

GLX is the layer which connects GL to X. In the context of this post and zink, this can be thought of as the method by which zink negotiates how it’s going to draw to the screen.

The way this works, at a very high level and with only the barest concern for accuracy and depth of subject matter, is that Mesa grabs a bunch of properties and features from the driver (zink) and then generates a series of configurations that can be used for X window outputs. GLX then compares these configurations against the one requested by the driver. If a match is found, the driver gets to draw. Otherwise, an error like this one I found on StackOverflow is printed:

X Error of failed request:  BadMatch (invalid parameter attributes)
Major opcode of failed request:  128 (GLX)
Minor opcode of failed request:  34 ()
Serial number of failed request:  39
Current serial number in output stream:  40

I didn’t do much digging into this. Here’s a visual representation of the process by which, with the assistance of GLX Hall of Famer Adam Jackson, I resolved the issue:

giphy.gif

In short, he found this problem some time ago and disabled some of the less relevant checks for configuration matching. All I had to do was apply the patch.

#teamwork

Image Creation

A core philosophy of Vulkan is that it’s an explicit API. This means that in the case of images, the exact format, tiling mode, target, usage, etc are specified at the time of creation. This also means that drivers have things they support and things they don’t support.

Historically, zink has just assumed that everything is supported and then either crashed horribly or succeeded because ANV is a pretty cool driver that makes a lot of stuff work when it shouldn’t.

To improve this situation, I’ve added two tiers of validation for image resource creation:

  • first, check all VkImageUsageFlags that are needed for an image are supported by the format being used
  • second, run the exact image creation params by the Vulkan driver before creation to see if it’ll work (and abort() if it doesn’t)

Roughly, these correspond to vkGetPhysicalDeviceFormatProperties (which previously had some, but not comprehensive use for this type of validation) and vkGetPhysicalDeviceImageFormatProperties (which has never been used previously in zink).

A major issue that I found along the way is that there’s many cases where zink needs a linear image tiling, as this is the only type of layout which enables reading it back into a buffer. Zink has been assuming that if the format has any (e.g., non-linear) support for the image’s usage, linear is also fine. This is not the case, however, so now there’s more checks which enforce a series of hoops to be jumped through when it’s necessary to do readback from images which have no support at all for linear tiling.

Memory Allocation

This is more or less the same as the issue that existed with image creation: zink tried to allocate memory from the wrong bucket (usually HOST_VISIBLE), and the Vulkan driver couldn’t handle that, so everything blew up.

Now this is handled by using device memory for almost all allocations, and everything works well enough.

Closing Thoughts

Nvidia GPUs should work pretty well from today on in my branch; I’m at a ~95% pass rate in piglit tests as of my last run, putting it solidly in second place on the Zink Preferred GPU List behind ANV, where I’m getting upwards of 97% of tests passing.

This didn’t make it into yesterday’s post, but everyone’s favorite benchmark also runs on zink+nvidia now:

nvidia-heaven.png

Here’s the caveat for all of the above: at present, zink on NV is unusably slow. The primary reason for this is that every frame that gets displayed has to be copied multiple times:

  • first, a GPU copy to a staging image that has HOST_VISIBLE (CPU-readable) memory
  • second, a CPU copy to another image which will then be used for displaying the frame

The first step in the process not only introduces more GPU work, it also forces an explicit fence, meaning that this is effectively right back where zink from master/release versions is at in forcing a complete stop of all work prior to each frame being finished.

The second step is also pretty bad, as it delays queuing any further work by the driver until the entire frame has once again been copied.

There’s not really any way to improve the current situation, but that’s only temporary. In the “soon” future, we’ll be landing WSI support, which will greatly improve things for all the drivers that zink supports, but probably mostly this one.

February 03, 2021

So some months have passed since our last update, when we announced that v3dv became Vulkan 1.0 conformant. The main reason for not publishing so many posts is that we saw the 1.0 checkpoint as a good moment to hold on adding new big features, and focus on improving the codebase (refactor, clean-ups, etc.) and the already existing features. For the latter we did a lot of work on performance. That alone would deserve a specific blog post, so in this one I will summarize the other stuff we did.

New features

Even if we didn’t focus on adding new features, we were still able to add some:

  • The following optional 1.0 features were enabled: logicOp, althaToOne, independentBlend, drawIndirectFirstInstance, and shaderStorageImageExtendedFormats.
  • Added support for timestamp queries.
  • Added implementation for VK_KHR_maintenance1, VK_EXT_private_data, and VK_KHR_display extensions
  • Added support for Wayland WSI.

Here I would like to highlight that we started to get feature contributions out of the initial core of developers that created the driver. VK_KHR_display was submitted by Steven Houston, and Wayland WSI support was submitted by Ella-0. Thanks a lot for it, really appreciated! We hope that this would begin a trend of having more things implemented by the rpi/mesa community as a whole.

Bugfixing and vulkan tools

Even if the driver got conformant, we were still testing the driver with several demos and applications, and provided fixes. As a example, we got Sascha Willem’s oit (Order Independent Transparency) working:

Sascha Willem’s oit demo on the rpi4

Among those applications that we were testing, we can highlight renderdoc and gfxreconstruct. The former is a frame-capture based graphics debugger and the latter is a tool that allows to capture and replay several frames. Both tools are heavily used when debugging and testing vulkan applications. We tested that they work on the rpi4 (fixing some bugs while doing it), and also started to use them to help/guide the performance work we are doing.

Fosdem 2021

If you are interested on an overview of the development of the driver during the last year, we are going to present “Overview of the Open Source Vulkan Driver for Raspberry Pi 4” on FOSDEM this weekend (presentation details here).

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working
v3dv status update 2020-07-31
v3dv status update 2020-09-07
Vulkan update: we’re conformant!

What Is Even Happening Anymore

Thanks once again to my generous sponsors at Valve, as well as a patch from a GLX guru with the very accurate commit log “I hate everything.”, I put in a little time today and came up with this:

nvidia.png

So now zink+nvidia is a thing.

Also: snow. Why is that a thing and can it stop being a thing for the rest of the week so I can stop shoveling?

January 28, 2021

(This post was first published with Collabora on Nov 19, 2020.) (Fixed a broken link on Jan 28, 2021.)

Wayland (the protocol and architecture) is still lacking proper consideration for color management. Wayland also lacks support for high dynamic range (HDR) imagery which has been around in movie and broadcasting industry for a while now (e.g. Netflix HDR UI).

While there are well established tools and workflows for how to do color management on X11, even X11 has not gained support for HDR. There were plans for it (Alex GoinsDeepColor Visuals), but as far as I know nothing really materialized from them.  Right now, the only way to watch HDR content on a HDR monitor in Linux is to use the DRM KMS API directly, in other words, not use any window system, which means not using any desktop environment. Kodi is one of the very few applications that can do this at all.

This is a story about starting the efforts to fix the situation on Wayland.

History and People

Color management for Wayland has been talked about on and off for many years by dozens of people. To me it was obvious from the start that color management architecture on Wayland must be fundamentally different from X11. I thought the display server must be part of the color management stack instead of an untrusted, unknown entity that must be bypassed and overridden by applications that fight each other for who gets to configure the display. This opinion was wildly controversial and it took a long time to get my point across, but over the years some color management experts started to open up to new ideas and other people joined in the opinion as well.  Whether these new ideas are actually better than the ways of old remains to be seen, though. I think the promise of getting everything and more to work better is far too great to not try it out.

The discussions started several times over the years, but they always dried out mostly without any tangible progress. Color management is a wide, deep and difficult topic, and the required skills, knowledge, interest, and available time did not come together until fairly recently. People did write draft protocol extensions, but I would claim that it was not really until Sebastian Wick started building on top of them that things started moving forward. But one person cannot push such a huge effort alone even for the simple reason that there must be at least one reviewer before anything can be merged upstream. I was very lucky that since summer 2020 I have been able to work on Wayland color management and HDR for improving ChromeOS, letting me support Sebastian's efforts on a daily basis. Vitaly Prosyak joined the effort this year as well, researching how to combine the two seemingly different worlds of ICC and HDR, and how tone-mapping could be implemented.

I must also note the past efforts of Harish Krupo, who submitted a major Weston merge request, but unfortunately at the time reviewers in Weston upstream were not much available. Even before that, there were experiments by Ville Syrjälä. All these are now mostly superseded by the on-going work.

Currently the active people around the topic are me (Collabora), Vitaly Prosyak (AMD), and Naveen Kumar (Intel). Sebastian Wick (unaffilated) is still around as well. None of us is a color management or HDR expert by trade, so we are all learning things as we go.

Design

The foundation for the color management protocol are ICC profile files for describing both output and content color spaces. The aim is for ICCv4, also allowing ICCv2, as these are known and supported well in general. Adding iccMAX support or anything else will be possible any time in the future.

As color management is all about color spaces and gamuts, and high dynamic range (HDR) is also very much about color spaces and gamuts plus extended luminance range, Sebastian and I decided that Wayland color management extension should cater for both from the beginning. Combining traditional color management and HDR is a fairly new thing as far as I know, and I'm not sure we have much prior art to base upon, so this is an interesting research journey as well. There is a lot of prior art on HDR and color management separately, but they tend to have fundamental differences that makes the combination not obvious.

To help us keep focused and explain to the community about what we actually intend with Wayland color management and HDR support, I wrote the section "Wayland Color Management and HDR Design Goals" in color.rst (draft). I very much recommend you to read it so that you get a picture what we (or I, at least) want to aim for.

Elle Stone explains in their article how color management should work on X11. As I wanted to avoid repeating the massive email threads that were had on the wayland-devel mailing list, I wrote the section "Color Pipeline Overview" in color.rst (draft) more or less as a response to her article, trying to explain in what ways Wayland will be different from X11. I think that understanding that section is paramount before anyone makes any comment on our efforts with the Wayland protocol extension.

HDR brings even more reasons to put color space conversions in the display server than just the idea that all applications should be color managed if not explicitly then implicitly.  Most of the desktop applications (well, literally all right now) are using Standard Dynamic Range (SDR).  SDR is a fuzzy concept referring to all traditional, non-HDR image content.  Therefore, your desktop is usually 100% SDR. You run your fancy new HDR monitor in SDR mode, which means it looks just like any old monitor with nothing fancy.  What if you want to watch a HDR video? The monitor won't display HDR in SDR mode.  If you simply switch the monitor to HDR mode, you will be blinded by all the over-bright SDR applications.  Switching monitor modes may also cause flicker and take a bit of time. That would be a pretty bad user experience, right?

A solution is to run your monitor in HDR mode all the time, and have the window system compositor convert all SDR application windows appropriately to the HDR luminance, so that they look normal in spite of the HDR mode. There will always be applications that will never support HDR at all, so the compositor doing the conversion is practically the only way.

For the protocol, we are currently exploring the use of relative luminance.  The reason is that people look at monitors in wildly varying viewing environments, under standard office lighting for example. The environment and personal preferences affect what monitor brightness you want. Also monitors themselves can be wildly different in their capabilities. Most prior art on HDR uses absolute luminance, but absolute luminance has the problem that it assumes a specific viewing environment, usually a dark room, similar to a movie theatre.  If a display server would show a movie with the absolute luminance it was mastered for, in most cases it would be far too dark to see. Whether using relative luminance at the protocol level turns out to be a good idea or not, we shall see.

Development

The Wayland color management and HDR protocol extension proposal is known as wayland/wayland-protocols!14 (MR14). Because it is a very long running merge request (the bar for landing a new protocol into wayland-protocols is high) and there are several people working on it, we started using sub-merge-requests to modify the proposal. You can find the sub-MRs in Sebastian's fork. If you have a change to propose, that is how to do it.

Obviously using sub-MRs also splits the review discussions into multiple places, but in this case I think it is a good thing, because the discussion threads in Gitlab are already massive.

There are several big and small open questions we haven't had the time to tackle yet even among the active group; questions that I feel we should have some tentative answers before asking for wider community comments. There is also no set schedule, so don't hold your breath. This work is likely to take months still before there is a complete tentative protocol, and probably years until these features are available in your favourite Wayland desktop environments.

If you are an expert on the topics of color management or HDR displays and content, you are warmly welcome to join the development.

If you are an interested developer or an end user looking to try out things, sorry, there is nothing really for you yet.

January 26, 2021

Overhead Migration

The goal in this post is to migrate a truckload block of code I wrote to handle sampler updating out of zink and into Gallium, thereby creating several days worth of rebase work for myself but also removing a costly codepath from the driver thread.

The first step in getting sampler creation to work right in zink is getting Gallium to create samplers with the correct filters in accordance with Chapter 42 of the Vulkan Spec:

VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT specifies that if VK_FORMAT_FEATURE_SAMPLED_IMAGE_BIT is also set, an image view can be used with a sampler that has either of magFilter or minFilter set to VK_FILTER_LINEAR, or mipmapMode set to VK_SAMPLER_MIPMAP_MODE_LINEAR. If VK_FORMAT_FEATURE_BLIT_SRC_BIT is also set, an image can be used as the srcImage to vkCmdBlitImage2KHR and vkCmdBlitImage with a filter of VK_FILTER_LINEAR. This bit must only be exposed for formats that also support the VK_FORMAT_FEATURE_SAMPLED_IMAGE_BIT or VK_FORMAT_FEATURE_BLIT_SRC_BIT.

If the format being queried is a depth/stencil format, this bit only specifies that the depth aspect (not the stencil aspect) of an image of this format supports linear filtering, and that linear filtering of the depth aspect is supported whether depth compare is enabled in the sampler or not. If this bit is not present, linear filtering with depth compare disabled is unsupported and linear filtering with depth compare enabled is supported, but may compute the filtered value in an implementation-dependent manner which differs from the normal rules of linear filtering. The resulting value must be in the range [0,1] and should be proportional to, or a weighted average of, the number of comparison passes or failures.

Here’s the (primary) function that I’ll be modifying to get everything working:

void
st_convert_sampler(const struct st_context *st,
                   const struct gl_texture_object *texobj,
                   const struct gl_sampler_object *msamp,
                   float tex_unit_lod_bias,
                   struct pipe_sampler_state *sampler)
{
   memset(sampler, 0, sizeof(*sampler));
   sampler->wrap_s = gl_wrap_xlate(msamp->Attrib.WrapS);
   sampler->wrap_t = gl_wrap_xlate(msamp->Attrib.WrapT);
   sampler->wrap_r = gl_wrap_xlate(msamp->Attrib.WrapR);

   if (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = gl_filter_to_img_filter(msamp->Attrib.MinFilter);
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }
   sampler->min_mip_filter = gl_filter_to_mip_filter(msamp->Attrib.MinFilter);

   if (texobj->Target != GL_TEXTURE_RECTANGLE_ARB)
      sampler->normalized_coords = 1;

   sampler->lod_bias = msamp->Attrib.LodBias + tex_unit_lod_bias;
   /* Reduce the number of states by allowing only the values that AMD GCN
    * can represent. Apps use lod_bias for smooth transitions to bigger mipmap
    * levels.
    */
   sampler->lod_bias = CLAMP(sampler->lod_bias, -16, 16);
   sampler->lod_bias = roundf(sampler->lod_bias * 256) / 256;

   sampler->min_lod = MAX2(msamp->Attrib.MinLod, 0.0f);
   sampler->max_lod = msamp->Attrib.MaxLod;
   if (sampler->max_lod < sampler->min_lod) {
      /* The GL spec doesn't seem to specify what to do in this case.
       * Swap the values.
       */
      float tmp = sampler->max_lod;
      sampler->max_lod = sampler->min_lod;
      sampler->min_lod = tmp;
      assert(sampler->min_lod <= sampler->max_lod);
   }

   /* Check that only wrap modes using the border color have the first bit
    * set.
    */
   STATIC_ASSERT(PIPE_TEX_WRAP_CLAMP & 0x1);
   STATIC_ASSERT(PIPE_TEX_WRAP_CLAMP_TO_BORDER & 0x1);
   STATIC_ASSERT(PIPE_TEX_WRAP_MIRROR_CLAMP & 0x1);
   STATIC_ASSERT(PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER & 0x1);
   STATIC_ASSERT(((PIPE_TEX_WRAP_REPEAT |
                   PIPE_TEX_WRAP_CLAMP_TO_EDGE |
                   PIPE_TEX_WRAP_MIRROR_REPEAT |
                   PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE) & 0x1) == 0);

   /* For non-black borders... */
   if (/* This is true if wrap modes are using the border color: */
       (sampler->wrap_s | sampler->wrap_t | sampler->wrap_r) & 0x1 &&
       (msamp->Attrib.BorderColor.ui[0] ||
        msamp->Attrib.BorderColor.ui[1] ||
        msamp->Attrib.BorderColor.ui[2] ||
        msamp->Attrib.BorderColor.ui[3])) {
      const GLboolean is_integer = texobj->_IsIntegerFormat;
      GLenum texBaseFormat = _mesa_base_tex_image(texobj)->_BaseFormat;

      if (texobj->Attrib.StencilSampling)
         texBaseFormat = GL_STENCIL_INDEX;

      if (st->apply_texture_swizzle_to_border_color) {
         const struct st_texture_object *stobj = st_texture_object_const(texobj);
         /* XXX: clean that up to not use the sampler view at all */
         const struct st_sampler_view *sv = st_texture_get_current_sampler_view(st, stobj);

         if (sv) {
            struct pipe_sampler_view *view = sv->view;
            union pipe_color_union tmp;
            const unsigned char swz[4] =
            {
               view->swizzle_r,
               view->swizzle_g,
               view->swizzle_b,
               view->swizzle_a,
            };

            st_translate_color(&msamp->Attrib.BorderColor, &tmp,
                               texBaseFormat, is_integer);

            util_format_apply_color_swizzle(&sampler->border_color,
                                            &tmp, swz, is_integer);
         } else {
            st_translate_color(&msamp->Attrib.BorderColor,
                               &sampler->border_color,
                               texBaseFormat, is_integer);
         }
      } else {
         st_translate_color(&msamp->Attrib.BorderColor,
                            &sampler->border_color,
                            texBaseFormat, is_integer);
      }
   }

   sampler->max_anisotropy = (msamp->Attrib.MaxAnisotropy == 1.0 ?
                              0 : (GLuint) msamp->Attrib.MaxAnisotropy);

   /* If sampling a depth texture and using shadow comparison */
   if (msamp->Attrib.CompareMode == GL_COMPARE_R_TO_TEXTURE) {
      GLenum texBaseFormat = _mesa_base_tex_image(texobj)->_BaseFormat;

      if (texBaseFormat == GL_DEPTH_COMPONENT ||
          (texBaseFormat == GL_DEPTH_STENCIL && !texobj->Attrib.StencilSampling)) {
         sampler->compare_mode = PIPE_TEX_COMPARE_R_TO_TEXTURE;
         sampler->compare_func = st_compare_func_to_pipe(msamp->Attrib.CompareFunc);
      }
   }

   /* Only set the seamless cube map texture parameter because the per-context
    * enable should be ignored and treated as disabled when using texture
    * handles, as specified by ARB_bindless_texture.
    */
   sampler->seamless_cube_map = msamp->Attrib.CubeMapSeamless;
}

texobj here is the texture being sampled, msamp is the GL sampler object, and sampler is the template for the driver-backed sampler object that will be created with the pipe_context::create_sampler_state hook. The first half of the function deals with setting up filtering and wrap modes. The second half is mostly for border color pre-swizzling (i.e., what the Vulkan spec claims is the way that drivers should be handling border colors).

First: LINEAR Availability

First, if a driver doesn’t provide the format feature for linear filtering, linear filtering can’t be used.

I added a struct pipe_screen hook for this:

/**
 * Check if the given pipe_format and resource is supported for linear filtering
 * as a sampler view.
 * \param format The format to check.
 * \param pres The resource to check.
 */
bool (*is_linear_filtering_supported)( struct pipe_screen *,
                                       enum pipe_format format,
                                       struct pipe_resource *pres );

This gets called in st_convert_sampler(), which is the path that all user-managed samplers go through:

void
st_convert_sampler(const struct st_context *st,
                   const struct gl_texture_object *texobj,
                   const struct gl_sampler_object *msamp,
                   float tex_unit_lod_bias,
                   struct pipe_sampler_state *sampler)
{
   const struct st_texture_object *stobj = NULL;
   const struct st_sampler_view *sv = NULL;

   memset(sampler, 0, sizeof(*sampler));
   sampler->wrap_s = gl_wrap_xlate(msamp->Attrib.WrapS);
   sampler->wrap_t = gl_wrap_xlate(msamp->Attrib.WrapT);
   sampler->wrap_r = gl_wrap_xlate(msamp->Attrib.WrapR);

   bool is_linear_filtering_supported = true;

   if (st->pipe->screen->is_linear_filtering_supported) {
      enum pipe_format fmt = PIPE_FORMAT_NONE;
      stobj = st_texture_object_const(texobj);
      if (stobj->surface_based)
         fmt = stobj->surface_format;
      else {
         sv = st_texture_get_current_sampler_view(st, stobj);
         if (sv)
            fmt = sv->view->format;
         else
            fmt = stobj->pt->format;
      }
      assert(fmt != PIPE_FORMAT_NONE);
      is_linear_filtering_supported =
         st->pipe->screen->is_linear_filtering_supported(st->pipe->screen, fmt, stobj->pt);
   }

   if (!is_linear_filtering_supported ||
       (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest)) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = gl_filter_to_img_filter(msamp->Attrib.MinFilter);
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }

   if (is_linear_filtering_supported)
      sampler->min_mip_filter = gl_filter_to_mip_filter(msamp->Attrib.MinFilter);
   else
      sampler->min_mip_filter = gl_filter_to_img_filter(GL_NEAREST);

The code automatically assumes that linear filtering is available for all formats, utilizing the new is_linear_filtering_supported method if it’s available in order to override that value. The filtering modes are then updated based on the (Vulkan) driver’s capabilities.

Easy.

Second: LINEAR Depth Filtering

The spec allows linear filtering unconditionally for formats containing a depth aspect so long as depth compare is enabled:

If this bit is not present, linear filtering with depth compare disabled is unsupported and linear filtering with depth compare enabled is supported

For this, I added PIPE_CAP_LINEAR_DEPTH_FILTERING and some interaction with the pipe_screen::is_linear_filtering_supported hook:

...
   bool is_linear_filtering_supported = true;
   bool has_depth = false;

   if (st->pipe->screen->is_linear_filtering_supported) {
      enum pipe_format fmt = PIPE_FORMAT_NONE;
      stobj = st_texture_object_const(texobj);
      if (stobj->surface_based)
         fmt = stobj->surface_format;
      else {
         sv = st_texture_get_current_sampler_view(st, stobj);
         if (sv)
            fmt = sv->view->format;
         else
            fmt = stobj->pt->format;
      }
      assert(fmt != PIPE_FORMAT_NONE);
      is_linear_filtering_supported =
         st->pipe->screen->is_linear_filtering_supported(st->pipe->screen, fmt, stobj->pt);
      if (st->linear_depth_filtering_semantics)
         has_depth = util_format_has_depth(util_format_description(fmt));
   }

   /* PIPE_CAP_LINEAR_DEPTH_FILTERING */
   if (has_depth &&
       !is_linear_filtering_supported) {
      /* this conditional has the same result as the one after it,
       * but its complexity makes splitting it more readable
       */
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else if ((!is_linear_filtering_supported && !has_depth) ||
       (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest)) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = gl_filter_to_img_filter(msamp->Attrib.MinFilter);
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }

   if (is_linear_filtering_supported || has_depth)
      sampler->min_mip_filter = gl_filter_to_mip_filter(msamp->Attrib.MinFilter);
   else
      sampler->min_mip_filter = gl_filter_to_img_filter(GL_NEAREST);
...
   /* PIPE_CAP_LINEAR_DEPTH_FILTERING */
   if (sampler->compare_mode == PIPE_TEX_COMPARE_NONE &&
       has_depth && !is_linear_filtering_supported &&
       (sampler->mag_img_filter == PIPE_TEX_FILTER_LINEAR ||
        sampler->min_img_filter == PIPE_TEX_FILTER_LINEAR ||
        sampler->min_mip_filter == PIPE_TEX_FILTER_LINEAR)) {
      sampler->compare_mode = PIPE_TEX_COMPARE_R_TO_TEXTURE;
      sampler->compare_func = PIPE_FUNC_ALWAYS;
   }

If the format has a depth component, allow linear filtering and, if necessary, enable depth compare.

Nothing too complex here either.

Third: Handle Int-based Border Colors

Now it’s time to start getting grimy. Vulkan’s custom border color extension is, at best, functional. One component of the spec for this is that border colors can be specified as either integer or float values, and this is actually significant, so ideally this should just be passed through using the same value the user specified.

I made PIPE_CAP_NEED_BORDER_COLOR_TYPE for this. When specified, the border_color_is_integer member that I added to struct pipe_sampler_state will be treated as a disambiguating value for sampler states, and thus zink can use it to set the right type of border color values. It affects this glorious micro-optimization in the CSO state cache:

void
cso_single_sampler(struct cso_context *ctx, enum pipe_shader_type shader_stage,
                   unsigned idx, const struct pipe_sampler_state *templ)
{
   if (templ) {
      unsigned key_size = ctx-> needs_sampler_border_color_type ?
                             sizeof(struct pipe_sampler_state) :
                             offsetof(struct pipe_sampler_state, border_color_is_integer);
      unsigned hash_key = cso_construct_key((void*)templ, key_size);

Finally: GL_CLAMP

Get ready to roll around in some mud.

I added PIPE_CAP_EMULATE_GL_CLAMP_PLZ (named by Kayden, not up for discussion) to handle this awfulness. Functionally, it’s comprised of 3 components:

  • handling in Mesa core to flag shader and sampler state updates any time a sampler sets or unsets GL_CLAMP (or GL_MIRROR_CLAMP)
  • handling in Gallium to force the sampler to either CLAMP_TO_BORDER or CLAMP_TO_EDGE depending on linear filtering availability for a given resource
  • shader rewrites in Gallium to run nir_lower_tex with bitfield info set for the GL_CLAMP samplers (henceforth gl_clamplers) that need to be rewritten

Since I’m already going deep into this function, let’s go once more into st_convert_sampler():

...
   /* PIPE_CAP_LINEAR_DEPTH_FILTERING */
   if (has_depth &&
       !is_linear_filtering_supported &&
       (!st->emulate_gl_clamp || (
       sampler->wrap_s != PIPE_TEX_WRAP_CLAMP &&
       sampler->wrap_s != PIPE_TEX_WRAP_MIRROR_CLAMP &&
       sampler->wrap_t != PIPE_TEX_WRAP_CLAMP &&
       sampler->wrap_t != PIPE_TEX_WRAP_MIRROR_CLAMP &&
       sampler->wrap_r != PIPE_TEX_WRAP_CLAMP &&
       sampler->wrap_r != PIPE_TEX_WRAP_MIRROR_CLAMP))) {
      /* this conditional has the same result as the one after it,
       * but its complexity makes splitting it more readable
       */
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else if ((!is_linear_filtering_supported && !has_depth) ||
       (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest)) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = min_img_filter;
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }
...
   if (st->emulate_gl_clamp) {
      bool clamp_to_border = (is_linear_filtering_supported || has_depth) &&
                             min_img_filter != PIPE_TEX_FILTER_NEAREST;
      if (sampler->wrap_s == PIPE_TEX_WRAP_CLAMP)
         sampler->wrap_s = clamp_to_border ? PIPE_TEX_WRAP_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_CLAMP_TO_EDGE;
      else if (sampler->wrap_s == PIPE_TEX_WRAP_MIRROR_CLAMP)
         sampler->wrap_s = clamp_to_border ? PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE;

      if (sampler->wrap_t == PIPE_TEX_WRAP_CLAMP)
         sampler->wrap_t = clamp_to_border ? PIPE_TEX_WRAP_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_CLAMP_TO_EDGE;
      else if (sampler->wrap_t == PIPE_TEX_WRAP_MIRROR_CLAMP)
         sampler->wrap_t = clamp_to_border ? PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE;

      if (sampler->wrap_r == PIPE_TEX_WRAP_CLAMP)
         sampler->wrap_r = clamp_to_border ? PIPE_TEX_WRAP_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_CLAMP_TO_EDGE;
      else if (sampler->wrap_r == PIPE_TEX_WRAP_MIRROR_CLAMP)
         sampler->wrap_r = clamp_to_border ? PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE;
   }
...

The depth component codepath is a little trickier, so I’ve split it off for readability even though it’s able to be collapsed into the conditional after. In short, this is only allowing linear filtering for unsupported depth formats when GL_CLAMP is also used.

With the filtering and wrap modes set, the next step here is to adjust the wrap modes based on whether linear filtering is available and the min filter mode. If linear is available and the min filter is linear, GL_CLAMP becomes CLAMP_TO_BORDER, otherwise it’s CLAMP_TO_EDGE. In conjunction with the NIR pass, this ends up replicating the expected behavior.

And to get that info to the NIR pass, more awfulness is required:

static inline GLboolean
is_wrap_gl_clamp(GLint param)
{
   return param == GL_CLAMP || param == GL_MIRROR_CLAMP_EXT;
}

static void
update_gl_clamplers(struct st_context *st, struct gl_program *prog, uint32_t *gl_clamplers)
{
   if (!st->emulate_gl_clamp)
      return;

   gl_clamplers[0] = gl_clamplers[1] = gl_clamplers[2] = 0;
   GLbitfield samplers_used = prog->SamplersUsed;
   unsigned unit;
   /* same as st_atom_sampler.c */
   for (unit = 0; samplers_used; unit++, samplers_used >>= 1) {
      unsigned tex_unit = prog->SamplerUnits[unit];
      if (samplers_used & 1 &&
          (st->ctx->Texture.Unit[tex_unit]._Current->Target != GL_TEXTURE_BUFFER ||
           st->texture_buffer_sampler)) {
         const struct gl_texture_object *texobj;
         struct gl_context *ctx = st->ctx;
         const struct gl_sampler_object *msamp;

         texobj = ctx->Texture.Unit[tex_unit]._Current;
         assert(texobj);

         msamp = _mesa_get_samplerobj(ctx, tex_unit);
         if (is_wrap_gl_clamp(msamp->Attrib.WrapS))
            gl_clamplers[0] |= BITFIELD64_BIT(unit);
         if (is_wrap_gl_clamp(msamp->Attrib.WrapT))
            gl_clamplers[1] |= BITFIELD64_BIT(unit);
         if (is_wrap_gl_clamp(msamp->Attrib.WrapR))
            gl_clamplers[2] |= BITFIELD64_BIT(unit);
      }
   }
}

This function iterates over all the samplers used by a given shader (struct gl_program), checking the wrap modes for GL_CLAMP and then updating the bitfields which correspond to struct nir_lower_tex_options::saturate_{s,t,r} when one is found. Each Gallium shader key is updated to use the values for comparisons, though I’ve helpfully reduced the key size used for comparisons for drivers which don’t set the pipe cap as well as those which do but have yet to see a gl_clampler.

The Result

By setting all these pipe caps and adding a trivial function, zink no longer needs to internally create and track sampler variants based on the above factors in order to support various sampler modes. Additionally, all Gallium-based drivers which emulate GL_CLAMP (there’s several) can switch over to this and delete a bunch of code.

Hooray.

January 22, 2021

Less than a month ago, I began investigating the Apple M1 GPU in hopes of developing a free and open-source driver. This week, I’ve reached a second milestone: drawing a triangle with my own open-source code. The vertex and fragment shaders are handwritten in machine code, and I interface with the hardware via the IOKit kernel driver in an identical fashion to the system’s Metal userspace driver.

A triangle rendered on the M1 with open-source code

The bulk of the new code is responsible for constructing the various command buffers and descriptors resident in shared memory, used to control the GPU’s behaviour. Any state accessible from Metal corresponds to bits in these buffers, so understanding them will be the next major task. So far, I have focused less on the content and more on the connections between them. In particular, the structures contain pointers to one another, sometimes nested multiple layers deep. The bring-up process for the project’s triangle provides a bird’s eye view of how all these disparate pieces in memory fit together.

As an example, the application-provided vertex data are in their own buffers. An internal table in yet another buffer points each of these vertex buffers. That internal table is passed directly as input to the vertex shader, specified in another buffer. That description of the vertex shader, including the address of the code in executable memory, is pointed to by another buffer, itself referenced from the main command buffer, which is referenced by a handle in the IOKit call to submit a command buffer. Whew!

In other words, the demo code is not yet intended to demonstrate an understanding of the fine-grained details of the command buffers, but rather to demonstrate there is “nothing missing”. Since GPU virtual addresses change from run to run, the demo validates that all of the pointers required are identified and can be relocated freely in memory using our own (trivial) allocator. As there is a bit of “magic” around memory and command buffer allocation on macOS, having this code working at an early stage gives peace of mind going forward.

I employed a piecemeal bring-up process. Since my IOKit wrapper exists in the same address space as the Metal application, the wrapper may modify command buffers just before submission to the GPU. As an early “hello world”, I identified the encoding of the render target’s clear colour in memory, and demonstrated that I could modify the colour as I pleased. Similarly, while learning about the instruction set to bring up the disassembler, I replaced shaders with handwritten equivalents and confirmed I could execute code on the GPU, provided I wrote out the machine code. But it’s not necessary to stop at these “leaf nodes” of the system; after modifying the shader code, I tried uploading shader code to a different part of the executable buffer while modifying the command buffer’s pointer to the code to compensate. After that, I could try uploading the commands for the shader myself. Iterating in this fashion, I could build up every structure needed while testing each in isolation.

Despite curveballs, this procedure worked out far better than the alternative of jumping straight to constructing buffers, perhaps via a “replay”. I had used that alternate technique to bring-up Mali a few years back, but it comes with the substantial drawback of fiendishly difficult debugging. If there is a single typo in five hundred lines of magic numbers, there would be no feedback, except an error from the GPU. However, by working one bit at a time, errors could be pinpointed and fixed immediately, providing a faster turn around time and a more pleasant bring-up experience.

But curveballs there were! My momentary elation at modifying the clear colours disappeared when I attempted to allocate a buffer for the colours. Despite encoding the same bits as before, the GPU would fail to clear correctly. Wondering if there was something wrong with the way I modified the pointer, I tried placing the colour in an unused part of memory that was already created by the Metal driver – that worked. The contents were the same, the way I modified the pointers was the same, but somehow the GPU didn’t like my memory allocation. I wondered if there was something wrong with the way I allocated memory, but the arguments I used to invoke the memory allocation IOKit call were bit-identical to those used by Metal, as confirmed by wrap. My last-ditch effort was checking if GPU memory had to be mapped explicitly via some side channel, like the mmap system call. IOKit does feature a device-independent memory map call, but no amount of fortified tracing found any evidence of side-channel system call mappings.

Trouble was brewing. Feeling delirious after so much time chasing an “impossible” bug, I wondered if there wasn’t something “magic” in the system call… but rather in the GPU memory itself. It was a silly theory since it produces a serious chicken-and-egg problem if true: if a GPU allocation has to be blessed by another GPU allocation, who blesses the first allocation?

But feeling silly and perhaps desperate, I pressed forward to test the theory by inserting a memory allocation call in the middle of the application flow, such that every subsequent allocation would be at a different address. Dumping GPU memory before and after this change and checking for differences revealed my first horror: an auxiliary buffer in GPU memory tracked all of the required allocations. In particular, I noticed values in this buffer increasing by one at a predictable offset (every 0x40 bytes), suggesting that the buffer contained an array of handles to allocations. Indeed, these values corresponded exactly to handles returned from the kernel on GPU memory allocation calls.

Putting aside the obvious problems with this theory, I tested it anyway, modifying this table to include an extra entry at the end with the handle of my new allocation, and modifying the header data structure to bump the number of entries by one. Still no dice. Discouraging as it was, that did not sink the theory entirely. In fact, I noticed something peculiar about the entries: contrary to what I thought, not all of them corresponded to valid handles. No, all but the last entry were valid. The handles from the kernel are 1-indexed, yet in each memory dump, the final handle was always 0, nonexistent. Perhaps this acts as a sentinel value, analogous to NULL-terminated strings in C. That explanation begs the question of why? If the header already contains a count of entries, a sentinel value is redundant.

I pressed on. Instead of adding on an extra entry with my handle, I copied the last entry n to the extra entry n + 1 and overwrote the (now second to last) entry n with the new handle.

Suddenly my clear colour showed up.

Is the mystery solved? I got the code working, so in some sense, the answer must be yes. But this is hardly a satisfying explanation; at every step, the unlikely solution only raises more questions. The chicken-and-egg problem is the easiest to resolve: this mapping table, along with the root command buffer, is allocated via a special IOKit selector independent from the general buffer allocation, and the handle to the mapping table is passed along with the submit command buffer selector. Further, the idea of passing required handles with command buffer submission is not unheard of; a similar mechanism is used on mainline Linux drivers. Nevertheless, the rationale for using 64-byte table entries in shared memory, as opposed to a simple CPU-side array, remains totally elusive.

Putting memory allocation woes behind me, the road ahead was not without bumps (and potholes), but with patience, I iterated until I had constructed the entirety of GPU memory myself in parallel to Metal, relying on the proprietary userspace only to initialize the device. Finally, all that remained was a leap of faith to kick off the IOKit handshake myself, and I had my first triangle.

These changes amount to around 1700 lines of code since the last blog post, available on GitHub. I’ve pieced together a simple demo animating a triangle with the GPU on-screen. The window system integration is effectively nonexistent at this point: XQuartz is required and detiling the (64x64 Morton-order interleaved) framebuffer occurs in software with naive scalar code. Nevertheless, the M1’s CPU is more than fast enough to cope.

Now that each part of the userspace driver is bootstrapped, going forward we can iterate on the instruction set and the command buffers in isolation. We can tease apart the little details and bit-by-bit transform the code from hundreds of inexplicable magic constants to a real driver. Onwards!

Passion Led Us Here

How our for-profit company became a nonprofit, to better tackle the digital divide.

Originally posted on the Endless OS Foundation blog.

An 8-year journey to a nonprofit

On the 1st of April 2020, our for-profit Endless Mobile officially became a nonprofit as the Endless OS Foundation. Our launch as a nonprofit just as the global pandemic took hold was, predictably, hardly noticed, but for us the timing was incredible: as the world collectively asked “What can we do to help others in need?”, we framed our mission statement and launched our .org with the same very important question in mind. Endless always had a social impact mission at its heart, and the challenges related to students, families, and communities falling further into the digital divide during COVID-19 brought new urgency and purpose to our team’s decision to officially step in the social welfare space.

On April 1st 2020, our for-profit Endless Mobile officially became a nonprofit as the Endless OS Foundation, focused on the #DigitalDivide.

Our updated status was a long time coming: we began our transformation to a nonprofit organization in late 2019 with the realization that the true charter and passions of our team would be greatly accelerated without the constraints of for-profit goals, investors and sales strategies standing in the way of our mission of digital access and equity for all. 

But for 8 years we made a go of it commercially, headquartered in Silicon Valley and framing ourselves as a tech startup with access to the venture capital and partnerships on our doorstep. We believed that a successful commercial channel would be the most efficient way to scale the impact of bringing computer devices and access to communities in need. We still believe this – we’ve just learned through our experience that we don’t have the funding to enter the computer and OS marketplace head-on. With the social impact goal first, and the hope of any revenue a secondary goal, we have had many successes in those 8 years bridging the digital divide throughout the world, from Brazil, to Kenya, and the USA. We’ve learned a huge amount which will go on to inform our strategy as a nonprofit.

Endless always had a social impact mission at its heart. COVID-19 brought new urgency and purpose to our team’s decision to officially step in the social welfare space.

Our unique perspective

One thing we learned as a for-profit is that the OS and technology we’ve built has some unique properties which are hugely impactful as a working solution to digital equity barriers. And our experience deploying in the field around the world for 8 years has left us uniquely informed via many iterations and incremental improvements.

Endless OS designer in discussion with prospective user

With this knowledge in-hand, we’ve been refining our strategy throughout 2020 and now starting to focus on what it really means to become an effective nonprofit and make that impact. In many ways it is liberating to abandon the goals and constraints of being a for-profit entity, and in other ways it’s been a challenging journey for me and the team to adjust our way of thinking and let these for-profit notions and models go. Previously we exclusively built and sold a product that defined our success; and any impact we achieved was a secondary consequence of that success and seen through that lens. Now our success is defined purely in terms of social impact, and through our actions, those positive impacts can be made with or without our “product”. That means that we may develop and introduce technology to solve a problem, but it is equally as valid to find another organization’s existing offering and design a way to increase that positive impact and scale.

We develop technology to solve access equity issues, but it’s equally as valid to find another organization’s offering and partner in a way that increases their positive impact.

The analogy to Free and Open Source Software is very strong – while Endless has always used and contributed to a wide variety of FOSS projects, we’ve also had a tension where we’ve been trying to hold some pieces back and capture value – such as our own application or content ecosystem, our own hardware platform – necessarily making us competitors to other organisations even though they were hoping to achieve the same things as us. As a nonprofit we can let these ideas go and just pick the best partners and technologies to help the people we’re trying to reach.

School kids writing on paper

Digital equity … 4 barriers we need to overcome

In future, our decisions around which projects to build or engage with will revolve around 4 barriers to digital equity, and how our Endless OS, Endless projects, or our partners’ offerings can help to solve them. We define these 4 equity barriers as: barriers to devices, barriers to connectivity, barriers to literacy in terms of your ability to use the technology, and barriers to engagement in terms of whether using the system is rewarding and worthwhile.

We define the 4 digital equity barriers we exist to impact as:
1. barriers to devices
2. barriers to connectivity
3. barriers to literacy
4. barriers to engagement

It doesn’t matter who makes the solutions that break these barriers; what matters is how we assist in enabling people to use technology to gain access to the education and opportunities these barriers block. Our goal therefore is to simply ensure that solutions exist – building them ourselves and with partners such as the FOSS community and other nonprofits – proving them with real-world deployments, and sharing our results as widely as possible to allow for better adoption globally.

If we define our goal purely in terms of whether people are using Endless OS, we are effectively restricting the reach and scale of our solutions to the audience we can reach directly with Endless OS downloads, installs and propagation. Conversely, partnerships that scale impact are a win-win-win for us, our partners, and the communities we all serve. 

Engineering impact

Our Endless engineering roots and capabilities feed our unique ability to build and deploy all of our solutions, and the practical experience of deploying them gives us evidence and credibility as we advocate for their use. Either activity would be weaker without the other.

Our engineering roots and capabilities feed our unique ability to build and deploy digital divide solutions.

Our partners in various engineering communities will have already seen our change in approach. Particularly, with GNOME we are working hard to invest in upstream and reconcile the long-standing differences between our experience and GNOME. If successful, many more people can benefit from our work than just users of Endless OS. We’re working with Learning Equality on Kolibri to build a better app experience for Linux desktop users and bring content publishers into our ecosystem for the first time, and we’ve also taken our very own Hack, the immersive and fun destination for kids learning to code, released it for non-Endless systems on Flathub, and made it fully open-source.

Planning tasks with sticky notes on a whiteboard

What’s next for our OS?

What then is in store for the future of Endless OS, the place where we have invested so much time and planning through years of iterations? For the immediate future, we need the capacity to deploy everything we’ve built – all at once, to our partners. We built an OS that we feel is very unique and valuable, containing a number of world-firsts: first production OS shipped with OSTree, first Flatpak-only desktop, built-in support for updating OS and apps from USBs, while still providing a great deal of reliability and convenience for deployments in offline and educational-safe environments with great apps and content loaded on every system.

However, we need to find a way to deliver this Linux-based experience in a more efficient way, and we’d love to talk if you have ideas about how we can do this, perhaps as partners. Can the idea of “Endless OS” evolve to become a spec that is provided by different platforms in the future, maybe remixes of Debian, Fedora, openSUSE or Ubuntu? 

Build, Validate, Advocate

Beyond the OS, the Endless OS Foundation has identified multiple programs to help underserved communities, and in each case we are adopting our “build, validate, advocate” strategy. This approach underpins all of our projects: can we build the technology (or assist in the making), will a community in-need validate it by adoption, and can we inspire others by telling the story and advocating for its wider use?

We are adopting a “build, validate, advocate” strategy.
1. build the technology (or assist in the making)
2. validate by community adoption
3. advocate for its wider use

As examples, we have just launched the Endless Key (link) as an offline solution for students during the COVID-19 at-home distance learning challenges. This project is also establishing a first-ever partnership of well-known online educational brands to reach an underserved offline audience with valuable learning resources. We are developing a pay-as-you-go platform and new partnerships that will allow families to own laptops via micro-payments that are built directly into the operating system, even if they cannot qualify for standard retail financing. And during the pandemic, we’ve partnered with Teach For America to focus on very practical digital equity needs in the USA’s urban and rural communities.

One part of the world-wide digital divide solution

We are one solution provider for the complex matrix of issues known collectively as the #DigitalDivide, and these issues will not disappear after the pandemic. Digital equity was an issue long before COVID-19, and we are not so naive to think it can be solved by any single institution, or by the time the pandemic recedes. It will take time and a coalition of partnerships to win. We are in for the long-haul and we are always looking for partners, especially now as we are finding our feet in the nonprofit world. We’d love to hear from you, so please feel free to reach out to me – I’m ramcq on IRC, RocketChat, Twitter, LinkedIn or rob@endlessos.org.

Your XKB keymap contains two important parts. One is the mapping from the hardware scancode to some internal representation, for example:

  <AB10> = 61;  

Which basically means Alphanumeric key in row B (from bottom), 10th key from the left. In other words: the /? key on a US keyboard.

The second part is mapping that internal representation to a keysym, for example:

  key <AB10> {        [     slash,    question        ]       }; 

This is the actual layout mapping - once in place this key really produces a slash or question mark (on level2, i.e. when Shift is down).

This two-part approach exists so either part can be swapped without affecting the other. Swap the second part to an exclamation mark and paragraph symbol and you have the French version of this key, swap it to dash/underscore and you have the German version of the key - all without having to change the keycode.

Back in the golden days of everyone-does-what-they-feel-like, keyboard manufacturers (presumably happily so) changed the key codes and we needed model-specific keycodes in XKB. The XkbModel configuration is a leftover from these trying times.

The Linux kernel's evdev API has largely done away with this. It provides a standardised set of keycodes, defined in linux/input-event-codes.h, and ensures, with the help of udev [0], that all keyboards actually conform to that. An evdev XKB keycode is a simple "kernel keycode + 8" [1] and that applies to all keyboards. On top of that, the kernel uses semantic definitions for the keys as they'd be in the US layout. KEY_Q is the key that would, behold!, produce a Q. Or an A in the French layout because they just have to be different, don't they? Either way, with evdev the Xkb Model configuration largely points to nothing and only wastes a few cycles with string parsing.

The second part, the keysym mapping, uses two approaches. One is to use a named #define like the "slash", "question" outlined above (see X11/keysymdef.h for the defines). The other is to use unicode directly like this example from  the Devangari layout:

  key <AB10> { [ U092f, U095f, slash, question ] };

As you can see, mix and match is available too. Using Unicode code points of course makes the layouts less immediately readable but on the other hand we don't need to #define the whole of Unicode. So from a maintenance perspective it's a win.

However, there's a third type of key that we care about: functional keys. Those are the multimedia (historically: "internet") keys that most devices have these days. Volume up, touchpad on/off, cycle display connectors, etc. Those keys are special in that they don't have a Unicode representation and they are always mapped to the same fixed functionality. Even Dvorak users want their volume keys to do what it says on the key.

Because they have no Unicode code points, those keys are defined, historically, in XF86keysyms.h:

  #define XF86XK_MonBrightnessUp    0x1008FF02  /* Monitor/panel brightness */

And mapping a key like this looks like this [2]:

  key <I21>   {       [ XF86Calculator        ] };

The only drawback: every key needs to be added manually. This has been done for some, but not for others. And some keys were added with different names than what the kernel uses [3].

So we're in this weird situation where we have a flexible keymap system  but the kernel already tells us what a key does anyway and we don't want to change that. Virtually all keys added in the last decade or so falls into that group of keys, but to actually make use of them requires a #define in xorgproto and an update to the keycodes and symbols in xkeyboard-config. That again introduces discrepancies and we end up in the situation where we're at right now: some keys don't work until someone files a bug, and then the users still need to wait for several components to be released and those releases trickle into the distributions.

10 years ago would've been a good time to make this more efficient. The situation wasn't that urgent then, most of the kernel keycodes added are >255 which means they cannot be used in X anyway. [4] The second best time to do it is now. What we need is basically a pass-through from kernel code to symbol and that's currently sitting in various MRs:

- xkeyboard-config can generate the keycodes/evdev file based on the list of kernel keycodes, so all kernel keycodes are mapped to internal representations by default

- xorgproto has reserved a range within the XF86 keysym reserved range for pass-through mappings, i.e. any KEY_FOO define from the kernel is mapped to XF86XK_Foo with a specific value [5]. The #define format is fixed so it can be parsed.

- xkeyboard-config parses theses XF86 keysyms and sets up a keysym mapping in the default keymap.

This is semi-automatic, i.e. there are helper scripts that detect changes and notify us, hooked into the CI, but the actual work must be done manually. These keysyms immediately become set-in-stone API so we don't want some unsupervised script to go wild on them.

There's a huge backlog of keys to be added (dating to kernels pre-v3.18) and I'll go through them one-by-one over the next weeks to make sure they're correct. But eventually they'll be done and we have a full keymap for all kernel keys to be immediately available in the XKB layout.

The last part of all of this is a calendar reminder for me to do this after every new kernel release. Let's hope this crucial part isn't the first to fail.

[0] 60-keyboard.hwdb has a mere ~1800 lines!
[1] Historical reasons, you don't want to know. *jedi wave*
[2] the XK_ part of the key name is dropped, implementation detail.
[3] This can also happen when a kernel define is renamed/aliased but we cannot easily do so for this header.
[4] X has an 8 bit keycode limit and that won't change until someone develops XKB2 with support for 32-bit keycodes, i.e. never.

[5] The actual value is an implementation detail and no client must care


Return of the Blog

I meant to blog a couple times this week, but I kept getting sidetracked by various matters. Here’s a very brief recap on what’s happened in zinkland over the past week.

Important MRs

And a bunch more extensions related to GL 4.3+ are now enabled.

A new zink-wip snapshot is finally up after most of a week spent fighting regressions. Anyone updating from a previous (e.g., 20201230) snapshot will find:

  • a ton of garbage patches which haven’t been properly split or pruned in any way and are likely unreadable/unbisectable
  • a new descriptor manager which doesn’t cache and instead uses templates and incremental updates to provide comparable performance to the caching one; ZINK_CACHE_DESCRIPTORS=1 will use the caching version
  • tons of optimizations for reducing driver overhead
  • a prototype for a new Vulkan extension providing direct multidraw functionality (as well as accompanying implementations for both ANV and RADV) which very slightly improves performance
  • lots of corner case bug fixes for things much earlier in the branch

All told, zink should now be (slightly, possibly even imperceptibly) faster as well as being even less bug-prone.

I’m pretty beat after this week, so that’s all for now. Hoping to return to more normal, in-depth coverage of driver internals next week.

January 14, 2021

The open source Panfrost driver for Arm Mali Midgard and Bifrost GPUs now provides non-conformant OpenGL ES 3.0 on Bifrost and desktop OpenGL 3.1 on Midgard (Mali T760 and newer) and Bifrost, in time for Mesa’s first release of 2021. This follows the OpenGL ES 3.0 support on Midgard that landed over the summer, as well as the initial OpenGL ES 2.0 support that recently debuted for Bifrost. OpenGL ES 3.0 is now tested on Mali G52 in Mesa’s continuous integration, achieving a 99.9% pass rate on the corresponding drawElements Quality Program tests.

Architecturally, Bifrost shares most of its fixed-function data structures with Midgard, but features a brand new instruction set. Our work for bringing up OpenGL ES 3.0 on Bifrost reflects this division. Some fixed-function features, like instancing and transform feedback, worked without any Bifrost-specific changes since we already did bring-up on Midgard. Other shader features, like uniform buffer objects, required “from scratch” implementations in the Bifrost compiler, a task facilitated by the compiler’s maturing intermediate representation with first-class builder support. Yet other features like multiple render targets required some Bifrost-specific code while leveraging other code shared with Midgard. All in all, the work progressed much more quickly the second time around, a testament to the power of code sharing. But there is no need to limit sharing to just Panfrost GPUs; open source drivers can share code across vendors.

Indeed, since Mali is an embedded GPU, the proprietary driver only exposes exposes OpenGL ES, not desktop OpenGL. However, desktop OpenGL 3.1 support comes nearly “for free” for us as an upstream Mesa driver by leveraging common infrastructure. This milestone shows the technical advantage of open source development: Compared to layered implementations of desktop GL like gl4es or Zink, Panfrost’s desktop OpenGL support is native, reducing CPU overhead. Furthermore, applications can make use of the hardware’s hidden features, like explicit primitive restart indices, alpha testing, and quadrilaterals. Although these features could be emulated, the native solutions are more efficient.

Mesa’s shared code also extends to OpenCL support via Clover. Once a driver supports compute shaders and sufficient compiler features, baseline OpenCL is just a few patches and a bug-fixing spree away. While OpenCL implementations could be layered (for example with clvk), an open source Mesa driver avoids the indirection.

I would like to thank Collaboran Boris Brezillon, who has worked tirelessly to bring OpenGL ES 3.0 support to Bifrost, as well as the prolific Icecream95, who has spearheaded OpenCL and desktop OpenGL support.

Originally posted on Collabora’s blog

A Quintessential Metric

There’s been a lot of talk about driver overhead in the Mesa community as of late, in large part begun by Marek Olšák and his daredevil stunts driving RadeonSI through flaming hoops while juggling chainsaws.

While zink isn’t quite at that level yet (and neither am I), there’s still some progress being made that I’d like to dig into a bit.

What Is Overhead?

As in all software, overhead is the performance penalty that is incurred as compared to a baseline measurement. In Mesa, a lot of people know of driver overhead as “Gallium sucks” and/or “A Gallium-based driver is slow” due to the fact that Gallium does incur some amount of overhead as compared to the old-style immediate mode DRI drivers.

While it’s true that there is an amount of performance lost by using Gallium in this sense, it’s also true that the performance gained is much greater. The reason for this is that Gallium is able to batch commands and state changes for every driver using it, allowing redundant calls to avoid triggering any work in the GPU.

It also makes for an easier time profiling and improving upon the CPU usage that’s required to handle the state changes emitted by Gallium. Instead of having a ton of core Mesa callbacks which need to be handled, each one potentially leading to a no-op that can be analyzed and deferred by the driver, Gallium provides a more cohesive API where each driver hook is a necessary change that must be handled. Because of this, the job of optimizing for those changes is simplified.

How Can Overhead Be Measured?

Other than the obvious method of running apps on a driver and checking the fps counter, piglit provides a facility for this: the drawoverhead test. This test has over a hundred subtests which perform sequences of draw operations with various state changes, each with its own result relative to a baseline, enabling a developer to easily profile and optimize a given codepath.

How Is Zink Doing Here?

To answer this, let’s look at some preliminary results from zink in master, the code which will soon be shipping in Mesa 21.0.0. All numbers here are, in contrast to my usual benchmarking, done on AMD 5700XT GPU. More on this later.

ZINK: MASTER

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                  818, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                  686, 83.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                  411, 50.3%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  232, 28.4%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  258, 31.5%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,             87, 10.7%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             162, 19.9%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 150, 18.3%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                120, 14.7%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     192, 23.5%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    146, 17.9%

After this point, the test aborts because shader images are not yet implemented, but it’s enough for a baseline.

These numbers are…not great. Primarily, at least to start, I’ll be focusing on the first row where zink is performing 818,000 draws per second.

Let’s check out some performance from zink-wip (20201230 snapshot), specifically with GALLIUM_THREAD=0 set to disable threaded context. This means I’m adding in descriptor caching and unlimited command buffer counts (vs forcing a stall after every submit from the 4th one onwards to reset a batch):

ZINK: WIP (CACHED, NO THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                  766, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                  633, 82.6%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                  407, 53.1%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  500, 65.3%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  449, 58.6%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,             85, 11.2%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             235, 30.7%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 159, 20.8%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                128, 16.7%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     179, 23.4%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    139, 18.2%

This is actually worse for a lot of cases!

But why is that?

It turns out that in the base draw case, threaded context is really necessary to be doing caching and using more command buffers. There’s sizable gains made in the baseline texture cases (+100% or so each) and a vertex attribute change (+50%), but fundamentally the overhead for the driver seems higher.

What happens if threading is enabled though?

ZINK: WIP (CACHED, THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5206, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 5149, 98.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 5187, 99.6%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5210, 100.1%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 4684, 90.0%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            137, 2.6%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             252, 4.8%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 243, 4.7%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                222, 4.3%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     213, 4.1%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    208, 4.0%

blink.gif

Indeed, threading yields almost a 700% performance improvement for teh baseline cases. It turns out that synchronously performing expensive tasks like computing hash values for descriptor sets is bad. Who could have guessed.

State Changes

Looking at the other values, however, is a bit more pertinent for the purpose of this post. Overhead is incurred when state changes are triggered by descriptors being changed, and this is much closer to a real world scenario (i.e., gaming) than simply running draw calls with no changes. Caching yields roughly a 50% performance improvement for this case.

Further Improvements

As I’d mentioned previously, I’m doing some work now on descriptor management with an aim to further lower this overhead. Let’s see what that looks like.

ZINK: TEST (UNCACHED, THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5426, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 5423, 99.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 5432, 100.1%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5246, 96.7%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 5177, 95.4%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            153, 2.8%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             229, 4.2%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 247, 4.6%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                228, 4.2%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     237, 4.4%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    223, 4.1%

While there’s a small (~4%) improvement for the baseline numbers, what’s much more interesting is the values where descriptor states are changed. They are, in fact, about as good or even slightly better than the caching version of descriptor management.

This is huge. Specifically it’s huge because it means that I can likely port over some of the techniques used in this approached to the cached version in order to drive further reductions in overhead.

Closing Remarks

Before I go, let’s check out some numbers from a real driver. Specifically, RadeonSI: the pinnacle of Gallium-based drivers.

RADEONSI

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 6221, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 6261, 100.7%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 6236, 100.2%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 6263, 100.7%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 6243, 100.4%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            217, 3.5%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,            1467, 23.6%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 374, 6.0%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                218, 3.5%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     680, 10.9%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    318, 5.1%

Yikes. Especially intimidating here is the relative performance for vertex attribute changes, where RadeonSI is able to retain almost 25% of its baseline performance relative to zink not even managing 5%.

Hopefully these figures get closer to each other in the future, but this just shows that there’s still a long way to go.

January 13, 2021

This post explains how to parse the HID Unit Global Item as explained by the HID Specification, page 37. The table there is quite confusing and it took me a while to fully understand it (Benjamin Tissoires was really the one who cracked it). I couldn't find any better explanation online which means either I'm incredibly dense and everyone's figured it out or no-one has posted a better explanation. On the off-chance it's the latter [1], here are the instructions on how to parse this item.

We know a HID Report Descriptor consists of a number of items that describe the content of each HID Report (read: an event from a device). These Items include things like Logical Minimum/Maximum for axis ranges, etc. A HID Unit item specifies the physical unit to apply. For example, a Report Descriptor may specify that X and Y axes are in mm which can be quite useful for all the obvious reasons.

Like most HID items, a HID Unit Item consists of a one-byte item tag and 1, 2 or 4 byte payload. The Unit item in the Report Descriptor itself has the binary value 0110 01nn where the nn is either 1, 2, or 3 indicating 1, 2 or 4 bytes of payload, respectively. That's standard HID.

The payload is divided into nibbles (4-bit units) and goes from LSB to MSB. The lowest-order 4 bits (first byte & 0xf) define the unit System to apply: one of SI Linear, SI Rotation, English Linear or English Rotation (well, or None/Reserved). The rest of the nibbles are in this order: "length", "mass", "time", "temperature", "current", "luminous intensity". In something resembling code this means:


system = value & 0xf
length_exponent = (value & 0xf0) >> 4
mass_exponent = (value & 0xf00) >> 8
time_exponent = (value & 0xf000) >> 12
...
The System defines which unit is used for length (e.g. SILinear means length is in cm). The actual value of each nibble is the exponent for the unit in use [2]. In something resembling code:

switch (system)
case SILinear:
print("length is in cm^{length_exponent}");
break;
case SIRotation:
print("length is in rad^{length_exponent}");
break;
case EnglishLinear:
print("length is in in^{length_exponent}");
break;
case EnglishRotation:
print("length is in deg^{length_exponent}");
break;
case None:
case Reserved"
print("boo!");
break;

For example, the value 0x321 means "SI Linear" (0x1) so the remaining nibbles represent, in ascending nibble order: Centimeters, Grams, Seconds, Kelvin, Ampere, Candela. The length nibble has a value of 0x2 so it's square cm, the mass nibble has a value of 0x3 so it is cubic grams (well, it's just an example, so...). This means that any report containing this item comes in cm²g³. As a more realistic example: 0xF011 would be cm/s.

If we changed the lowest nibble to English Rotation (0x4), i.e. our value is now 0x324, the units represent: Degrees, Slug, Seconds, F, Ampere, Candela [3]. The length nibble 0x2 means square degrees, the mass nibble is cubic slugs. As a more realistic example, 0xF014 would be degrees/s.

Any nibble with value 0 means the unit isn't in use, so the example from the spec with value 0x00F0D121 is SI linear, units cm² g s⁻³ A⁻¹, which is... Voltage! Of course you knew that and totally didn't have to double-check with wikipedia.

Because bits are expensive and the base units are of course either too big or too small or otherwise not quite right, HID also provides a Unit Exponent item. The Unit Exponent item (a separate item to Unit in the Report Descriptor) then describes the exponent to be applied to the actual value in the report. For example, a Unit Eponent of -3 means 10⁻³ to be applied to the value. If the report descriptor specifies an item of Unit 0x00F0D121 (i.e. V) and Unit Exponent -3, the value of this item is mV (milliVolt), Unit Exponent of 3 would be kV (kiloVolt).

Now, in hindsight all this is pretty obvious and maybe even sensible. It'd have been nice if the spec would've explained it a bit clearer but then I would have nothing to write about, so I guess overall I call it a draw.

[1] This whole adventure was started because there's a touchpad out there that measures touch pressure in radians, so at least one other person out there struggled with the docs...
[2] The nibble value is twos complement (i.e. it's a signed 4-bit integer). Values 0x1-0x7 are exponents 1 to 7, values 0x8-0xf are exponents -8 to -1.
[3] English Linear should've trolled everyone and use Centimetres instead of Centimeters in SI Linear.

January 12, 2021

TL;DR: It's now easy to unlock your LUKS2 volume with a FIDO2 security token (e.g. YubiKey or Nitrokey FIDO2). And TPM2 unlocking is easy now too.

Blogging is a lot of work, and a lot less fun than hacking. I mostly focus on the latter because of that, but from time to time I guess stuff is just too interesting to not be blogged about. Hence here, finally, another blog story about exciting new features in systemd.

With the upcoming systemd v248 the systemd-cryptsetup component of systemd (which is responsible for assembling encrypted volumes during boot) gained direct support for unlocking encrypted storage with three types of security hardware:

  1. Unlocking with FIDO2 security tokens (well, at least with those which implement the hmac-secret extension, most do). i.e. your YubiKeys (series 5 and above), or Nitrokey FIDO2 and such.

  2. Unlocking with TPM2 security chips (pretty ubiquitous on non-budget PCs/laptops/…)

  3. Unlocking with PKCS#11 security tokens, i.e. your smartcards and older YubiKeys (the ones that implement PIV). (Strictly speaking this was supported on older systemd already, but was a lot more "manual".)

For completeness' sake, let's keep in mind that the component also allows unlocking with these more traditional mechanisms:

  1. Unlocking interactively with a user-entered passphrase (i.e. the way most people probably already deploy it, supported since about forever)

  2. Unlocking via key file on disk (optionally on removable media plugged in at boot), supported since forever.

  3. Unlocking via a key acquired through trivial AF_UNIX/SOCK_STREAM socket IPC. (Also new in v248)

  4. Unlocking via recovery keys. These are pretty much the same thing as a regular passphrase (and in fact can be entered wherever a passphrase is requested) — the main difference being that they are always generated by the computer, and thus have guaranteed high entropy, typically higher than user-chosen passphrases. They are generated in a way they are easy to type, in many cases even if the local key map is misconfigured. (Also new in v248)

In this blog story, let's focus on the first three items, i.e. those that talk to specific types of hardware for implementing unlocking.

To make working with security tokens and TPM2 easy, a new, small tool was added to the systemd tool set: systemd-cryptenroll. It's only purpose is to make it easy to enroll your security token/chip of choice into an encrypted volume. It works with any LUKS2 volume, and embeds a tiny bit of meta-information into the LUKS2 header with parameters necessary for the unlock operation.

Unlocking with FIDO2

So, let's see how this fits together in the FIDO2 case. Most likely this is what you want to use if you have one of these fancy FIDO2 tokens (which need to implement the hmac-secret extension, as mentioned). Let's say you already have your LUKS2 volume set up, and previously unlocked it with a simple passphrase. Plug in your token, and run:

# systemd-cryptenroll --fido2-device=auto /dev/sda5

(Replace /dev/sda5 with the underlying block device of your volume).

This will enroll the key as an additional way to unlock the volume, and embeds all necessary information for it in the LUKS2 volume header. Before we can unlock the volume with this at boot, we need to allow FIDO2 unlocking via /etc/crypttab. For that, find the right entry for your volume in that file, and edit it like so:

myvolume /dev/sda5 - fido2-device=auto

Replace myvolume and /dev/sda5 with the right volume name, and underlying device of course. Key here is the fido2-device=auto option you need to add to the fourth column in the file. It tells systemd-cryptsetup to use the FIDO2 metadata now embedded in the LUKS2 header, wait for the FIDO2 token to be plugged in at boot (utilizing systemd-udevd, …) and unlock the volume with it.

And that's it already. Easy-peasy, no?

Note that all of this doesn't modify the FIDO2 token itself in any way. Moreover you can enroll the same token in as many volumes as you like. Since all enrollment information is stored in the LUKS2 header (and not on the token) there are no bounds on any of this. (OK, well, admittedly, there's a cap on LUKS2 key slots per volume, i.e. you can't enroll more than a bunch of keys per volume.)

Unlocking with PKCS#11

Let's now have a closer look how the same works with a PKCS#11 compatible security token or smartcard. For this to work, you need a device that can store an RSA key pair. I figure most security tokens/smartcards that implement PIV qualify. How you actually get the keys onto the device might differ though. Here's how you do this for any YubiKey that implements the PIV feature:

# ykman piv reset
# ykman piv generate-key -a RSA2048 9d pubkey.pem
# ykman piv generate-certificate --subject "Knobelei" 9d pubkey.pem
# rm pubkey.pem

(This chain of commands erases what was stored in PIV feature of your token before, be careful!)

For tokens/smartcards from other vendors a different series of commands might work. Once you have a key pair on it, you can enroll it with a LUKS2 volume like so:

# systemd-cryptenroll --pkcs11-token-uri=auto /dev/sda5

Just like the same command's invocation in the FIDO2 case this enrolls the security token as an additional way to unlock the volume, any passphrases you already have enrolled remain enrolled.

For the PKCS#11 case you need to edit your /etc/crypttab entry like this:

myvolume /dev/sda5 - pkcs11-uri=auto

If you have a security token that implements both PKCS#11 PIV and FIDO2 I'd probably enroll it as FIDO2 device, given it's the more contemporary, future-proof standard. Moreover, it requires no special preparation in order to get an RSA key onto the device: FIDO2 keys typically just work.

Unlocking with TPM2

Most modern (non-budget) PC hardware (and other kind of hardware too) nowadays comes with a TPM2 security chip. In many ways a TPM2 chip is a smartcard that is soldered onto the mainboard of your system. Unlike your usual USB-connected security tokens you thus cannot remove them from your PC, which means they address quite a different security scenario: they aren't immediately comparable to a physical key you can take with you that unlocks some door, but they are a key you leave at the door, but that refuses to be turned by anyone but you.

Even though this sounds a lot weaker than the FIDO2/PKCS#11 model TPM2 still bring benefits for securing your systems: because the cryptographic key material stored in TPM2 devices cannot be extracted (at least that's the theory), if you bind your hard disk encryption to it, it means attackers cannot just copy your disk and analyze it offline — they always need access to the TPM2 chip too to have a chance to acquire the necessary cryptographic keys. Thus, they can still steal your whole PC and analyze it, but they cannot just copy the disk without you noticing and analyze the copy.

Moreover, you can bind the ability to unlock the harddisk to specific software versions: for example you could say that only your trusted Fedora Linux can unlock the device, but not any arbitrary OS some hacker might boot from a USB stick they plugged in. Thus, if you trust your OS vendor, you can entrust storage unlocking to the vendor's OS together with your TPM2 device, and thus can be reasonably sure intruders cannot decrypt your data unless they both hack your OS vendor and steal/break your TPM2 chip.

Here's how you enroll your LUKS2 volume with your TPM2 chip:

# systemd-cryptenroll --tpm2-device=auto --tpm2-pcrs=7 /dev/sda5

This looks almost as straightforward as the two earlier sytemd-cryptenroll command lines — if it wasn't for the --tpm2-pcrs= part. With that option you can specify to which TPM2 PCRs you want to bind the enrollment. TPM2 PCRs are a set of (typically 24) hash values that every TPM2 equipped system at boot calculates from all the software that is invoked during the boot sequence, in a secure, unfakable way (this is called "measurement"). If you bind unlocking to a specific value of a specific PCR you thus require the system has to follow the same sequence of software at boot to re-acquire the disk encryption key. Sounds complex? Well, that's because it is.

For now, let's see how we have to modify your /etc/crypttab to unlock via TPM2:

myvolume /dev/sda5 - tpm2-device=auto

This part is easy again: the tpm2-device= option is what tells systemd-cryptsetup to use the TPM2 metadata from the LUKS2 header and to wait for the TPM2 device to show up.

Bonus: Recovery Key Enrollment

FIDO2, PKCS#11 and TPM2 security tokens and chips pair well with recovery keys: since you don't need to type in your password everyday anymore it makes sense to get rid of it, and instead enroll a high-entropy recovery key you then print out or scan off screen and store a safe, physical location. i.e. forget about good ol' passphrase-based unlocking, go for FIDO2 plus recovery key instead! Here's how you do it:

# systemd-cryptenroll --recovery-key /dev/sda5

This will generate a key, enroll it in the LUKS2 volume, show it to you on screen and generate a QR code you may scan off screen if you like. The key has highest entropy, and can be entered wherever you can enter a passphrase. Because of that you don't have to modify /etc/crypttab to make the recovery key work.

Future

There's still plenty room for further improvement in all of this. In particular for the TPM2 case: what the text above doesn't really mention is that binding your encrypted volume unlocking to specific software versions (i.e. kernel + initrd + OS versions) actually sucks hard: if you naively update your system to newer versions you might lose access to your TPM2 enrolled keys (which isn't terrible, after all you did enroll a recovery key — right? — which you then can use to regain access). To solve this some more integration with distributions would be necessary: whenever they upgrade the system they'd have to make sure to enroll the TPM2 again — with the PCR hashes matching the new version. And whenever they remove an old version of the system they need to remove the old TPM2 enrollment. Alternatively TPM2 also knows a concept of signed PCR hash values. In this mode the distro could just ship a set of PCR signatures which would unlock the TPM2 keys. (But quite frankly I don't really see the point: whether you drop in a signature file on each system update, or enroll a new set of PCR hashes in the LUKS2 header doesn't make much of a difference). Either way, to make TPM2 enrollment smooth some more integration work with your distribution's system update mechanisms need to happen. And yes, because of this OS updating complexity the example above — where I referenced your trusty Fedora Linux — doesn't actually work IRL (yet? hopefully…). Nothing updates the enrollment automatically after you initially enrolled it, hence after the first kernel/initrd update you have to manually re-enroll things again, and again, and again … after every update.

The TPM2 could also be used for other kinds of key policies, we might look into adding later too. For example, Windows uses TPM2 stuff to allow short (4 digits or so) "PINs" for unlocking the harddisk, i.e. kind of a low-entropy password you type in. The reason this is reasonably safe is that in this case the PIN is passed to the TPM2 which enforces that not more than some limited amount of unlock attempts may be made within some time frame, and that after too many attempts the PIN is invalidated altogether. Thus making dictionary attacks harder (which would normally be easier given the short length of the PINs).

Postscript

(BTW: Yubico sent me two YubiKeys for testing and Nitrokey a Nitrokey FIDO2, thank you! — That's why you see all those references to YubiKey/Nitrokey devices in the text above: it's the hardware I had to test this with. That said, I also tested the FIDO2 stuff with a SoloKey I bought, where it also worked fine. And yes, you!, other vendors!, who might be reading this, please send me your security tokens for free, too, and I might test things with them as well. No promises though. And I am not going to give them back, if you do, sorry. ;-))

January 11, 2021

A Different Strategy

As the merge window for the upcoming Mesa release looms, Erik and I have decided on a new strategy for development: we’re just going to stop merging patches.

At this point in time, we have no regressions as compared to the last release, so we’re just doing a full stop until after the branch point in order to save ourselves time potentially tracking down any issues in further feature additions.

Slowdowns

Some of you may have noticed that zink-wip has yet to update this year. This isn’t due to a lack of work, but rather due to lack of stability. I’ve been tinkering with a new descriptor management infrastructure (yes, I’m back on the horse), and it’s… capable of drawing frames is maybe the best way to describe it. I’ve gone through probably about ten iterations on it so far based on all the ideas I’ve had.

This is hardly an exhaustive list, but here’s some of the ideas that I’ve cycled through:

  • async descriptor updating - It seems like this should be good on paper given that it’s legal to do descriptor updates in threads, but the overhead from signalling the task thread in this case ended up being, on average, about 10-20x the cost of just doing the updating synchronously.

  • all push descriptors all the time - Just for hahas along the way I jammed everything into a pushed descriptor set. Or at least I was going to try. About halfway through, I realized this was way more work to execute than it’d be worth for the hahas considering I wouldn’t ever be able to use this in reality.

  • zero iteration updates - The gist of this ideas is that looking at the descriptor updating code, there’s a ton of iterating going on. This is an extreme hotpath, so any amount of looping that can be avoided is great, and the underlying Vulkan driver has to iterate the sets anyway, so… Eventually I managed to throw a bunch of memory at the problem and do all the setup during pipeline init, giving me pre-initialized blobs of memory in the form of VkWriteDescriptorSet arrays with the associated sub-types for descriptors. With this in place, naturally I turned to…

  • templates - Descriptor templates are a way of giving the Vulkan driver the raw memory of the descriptor info as a blob and letting it huck that directly into a buffer. Since I already had the memory set up for this, it was an easy swap over, though the gains were less impressive than I’d expected.

At last I’ve settled on a model of uncached, templated descriptors with an extra push set for handling uniform data for further exploration. Initial results for real world use (e.g., graphical benchmarks) are good, but piglit’s drawoverhead test shows there’s still a lot of work to be done to catch up to caching.

Big thanks to Hans-Kristian Arntzen, aka themaister, aka low-level graphics swashbuckler, for providing insight and consults along the process of this.

January 08, 2021

VkRunner is a Vulkan shader tester based on Piglit’s shader_runner (I already talked about it in my blog). This tool is very helpful for creating simple Vulkan tests without writing hundreds of lines of code. In the Graphics Team at Igalia, we use it extensively to help us in the open-source driver development in Mesa such as V3D and Turnip drivers.

As a hobby project for last Christmas holiday season, I wrote the .spec file for VkRunner and uploaded it to Fedora’s Copr and OpenSUSE Build Service (OBS) for generating the respective RPM packages.

This is the first time I create a package and thanks to the documentation on how to create RPM packages, the process was simpler than I initially thought. If I find the time to read Debian New Maintainers’ Guide, I will create a DEB package as well.

Anyway, if you have installed Fedora or OpenSUSE in your computer and you want to try VkRunner, just follow these steps:

Fedora

  • Fedora:
$ sudo dnf copr enable samuelig/vkrunner
$ sudo dnf install vkrunner

OpenSUSE logo

  • OpenSUSE / SLE:
$ sudo zypper addrepo https://download.opensuse.org/repositories/home:samuelig/openSUSE_Leap_15.2/home:samuelig.repo
$ sudo zypper refresh
$ sudo zypper install vkrunner

Enjoy it!

January 07, 2021

Apple’s latest line of Macs includes their in-house “M1” system-on-chip, featuring a custom GPU. This poses a problem for those of us in the Asahi Linux project who wish to run Linux on our devices, as this custom Apple GPU has neither public documentation nor open source drivers. Some speculate it might descend from PowerVR GPUs, as used in older iPhones, while others believe the GPU to be completely custom. But rumours and speculations are no fun when we can peek under the hood ourselves!

A few weeks ago, I purchased a Mac Mini with an M1 GPU as a development target to study the instruction set and command stream, to understand the GPU’s architecture at a level not previously publicly understood, and ultimately to accelerate the development of a Mesa driver for the hardware. Today I’ve reached my first milestone: I now understand enough of the instruction set to disassemble simple shaders with a free and open-source tool chain, released on GitHub here.

The process for decoding the instruction set and command stream of the GPU parallels the same process I used for reverse-engineering Mali GPUs in the Panfrost project, originally pioneered by the Lima, Freedreno, and Nouveau free software driver projects. Typically, for Linux or Android driver reverse-engineering, a small wrap library will be written to inject into a test application via LD_PRELOAD that hooks key system calls like ioctl and mmap in order to analyze user-kernel interactions. Once the “submit command buffer” call is issued, the library can dump all (mapped) shared memory for offline analysis.

The same overall process will work for the M1, but there are some macOSisms that need to be translated. First, there is no LD_PRELOAD on macOS; the equivalent is DYLD_INSERT_LIBRARIES, which has some extra security features which are easy enough to turn off for our purposes. Second, while the standard Linux/BSD system calls do exist on macOS, they are not used for graphics drivers. Instead, Apple’s own IOKit framework is used for both kernel and userspace drivers, with the critical entry point of IOConnectCallMethod, an analogue of ioctl. These differences are easy enough to paper over, but they do add a layer of distance from the standard Linux tooling.

The bigger issue is orienting ourselves in the IOKit world. Since Linux is under a copyleft license, (legal) kernel drivers are open source, so the ioctl interface is public, albeit vendor-specific. macOS’s kernel (XNU) being under a permissive license brings no such obligations; the kernel interface is proprietary and undocumented. Even after wrapping IOConnectCallMethod, it took some elbow grease to identify the three critical calls: memory allocation, command buffer creation, and command buffer submission. Wrapping the allocation and creation calls is essential for tracking GPU-visible memory (what we are interested in studying), and wrapping the submission call is essential for timing the memory dump.

With those obstacles cleared, we can finally get to the shader binaries, black boxes in themselves. However, the process from here on out is standard: start with the simplest fragment or compute shader possible, make a small change in the input source code, and compare the output binaries. Iterating on this process is tedious but will quickly reveal key structures, including opcode numbers.

The findings of the process documented in the free software disassembler confirm a number of traits of the GPU:

One, the architecture is scalar. Unlike some GPUs that are scalar for 32-bits but vectorized for 16-bits, the M1’s GPU is scalar at all bit sizes. Yet Metal optimization resources imply 16-bit arithmetic should be significantly faster, in addition to a reduction of register usage leading to higher thread count (occupancy). This suggests the hardware is superscalar, with more 16-bit ALUs than 32-bit ALUs, allowing the part to benefit from low-precision graphics shaders much more than competing chips can, while removing a great deal of complexity from the compiler.

Two, the architecture seems to handle scheduling in hardware, common among desktop GPUs but less so in the embedded space. This again makes the compiler simpler at the expense of more hardware. Instructions seem to have minimal encoding overhead, unlike other architectures which need to pad out instructions with nop’s to accommodate highly constrained instruction sets.

Three, various modifiers are supported. Floating-point ALUs can do clamps (saturate), negates, and absolute value modifiers “for free”, a common shader architecture trait. Further, most (all?) instructions can type-convert between 16-bit and 32-bit “for free” on both the destination and the sources, which allows the compiler to be much more aggressive about using 16-bit operations without risking conversion overheads. On the integer side, various bitwise complements and shifts are allowed on certain instructions for free. None of this is unique to Apple’s design, but it’s worth noting all the same.

Finally, not all ALU instructions have the same timing. Instructions like imad, used to multiply two integers and add a third, are avoided in favour of repeated iadd integer addition instructions where possible. This also suggests a superscalar architecture; software-scheduled designs like those I work on for my day job cannot exploit differences in pipeline length, inadvertently slowing down simple instructions to match the speed of complex ones.

From my prior experience working with GPUs, I expect to find some eldritch horror waiting in the instruction set, to balloon compiler complexity. Though the above work currently covers only a small surface area of the instruction set, so far everything seems sound. There are no convoluted optimization tricks, but doing away with the trickery creates a streamlined, efficient design that does one thing and does it well. Maybe Apple’s hardware engineers discovered it’s hard to beat simplicity.

Alas, a shader tool chain isn’t much use without an open source userspace driver. Next up: dissecting the command stream!

Disclaimer: This work is a hobby project conducted based on public information. Opinions expressed may not reflect those of my employer.

January 05, 2021

Some People Will Appreciate This

border_colors.png

Also zink hit GL 4.1 today.

January 04, 2021

It Happens

As long-time readers of the blog know, SGC is a safe space where making mistakes is not only accepted, it’s a way of life. So it is once again that I need to amend statements previously made regarding Xorg synchronization after Michel Dänzer, also known for anchoring the award-winning series Why Is My MR Failing CI Today?, pointed out that while I was indeed addressing the correct problem, I was addressing it from the wrong side.

Looking Closer

The issue here is that WSI synchronizes with the display server using a file descriptor for the swapchain image that the Vulkan driver manages. But what if the Vulkan driver never configures itself to be used for WSI (genuine or faked) in the first place?

Yes, this indeed appeared to be the true problem. Iago Toral Quiroga added handling for this specific to the V3DV driver back in October, and it’s the same mechanism: setting up a Mesa-internal struct during resource initialization.

So I extended this to the ANV codepath and…

And obviously it didn’t work.

But why was this the case?

A script-based git blame revealed that ANV has a different handling for implicit sync than other Vulkan drivers. After a well-hidden patch, ANV relies entirely on a struct attached to VkSubmitInfo which contains the swapchain image’s memory pointer in order to handle implicit sync. Thus by attaching a wsi_memory_signal_submit_info struct, everything was resolved.

Is it a great fix? No. Does it work? Yes.

Questions

If ANV wasn’t configuring itself to handle implicit sync, why was poll() working?

Luck.

Why does RADV work without any of this?

Also probably luck.

January 01, 2021

Reviving Very Old X Code

I've taken the week between Christmas and New Year's off this year. I didn't really have anything serious planned, just taking a break from the usual routine. As often happens, I got sucked into doing a project when I received this simple bug report Debian Bug #974011

I have been researching old terminal and X games recently, and realized
that much of the code from 'xmille' originated from the terminal game
'mille', which is part of bsdgames.

...

[The copyright and license information] has been stripped out of all
code in the xmille distribution.  Also, none of the included materials
give credit to the original author, Ken Arnold.

The reason the 'xmille' source is missing copyright and license information from the 'mille' files is that they were copied in before that information was added upstream. Xmille forked from Mille around 1987 or so. I wrote the UI parts for the system I had at the time, which was running X10R4. A very basic port to X11 was done at some point, and that's what Debian has in the archive today.

At some point in the 90s, I ported Xmille to the Athena widget set, including several custom widgets in an Xaw extension library, Xkw. It's a lot better than the version in Debian, including displaying the cards correctly (the Debian version has some pretty bad color issues).

Here's what the current Debian version looks like:

Fixing The Bug

To fix the missing copyright and license information, I imported the mille source code into the "latest" Xaw-based version. The updated mille code had a number of bug fixes and improvements, along with the copyright information.

That should have been sufficient to resolve the issue and I could have constructed a suitable source package from whatever bits were needed and and uploaded that as a replacement 'xmille' package.

However, at some later point, I had actually merged xmille into a larger package, 'kgames', which also included a number of other games, including Reversi, Dominoes, Cribbage and ten Solitaire/Patience variants. (as an aside, those last ten games formed the basis for my Patience Palm Pilot application, which seems to have inspired an Android App of the same name...)

So began my yak shaving holiday.

Building Kgames in 2020

Ok, so getting this old source code running should be easy, right? It's just a bunch of C code designed in the 80s and 90s to work on VAXen and their kin. How hard could it be?

  1. Everything was a 32-bit computer back then; pointers and ints were both 32 bits, so you could cast them with wild abandon and cause no problems. Today, testing revealed segfaults in some corners of the code.

  2. It's K&R C code. Remember that the first version of ANSI-C didn't come out until 1989, and it was years later that we could reliably expect to find an ANSI compiler with a random Unix box.

  3. It's X11 code. Fortunately (?), X11 hasn't changed since these applications were written, so at least that part still works just fine. Imagine trying to build Windows or Mac OS code from the early 90's on a modern OS...

I decided to dig in and add prototypes everywhere; that found a lot of pointer/int casting issues, as well as several lurking bugs where the code was just plain broken.

After a day or so, I had things building and running and was no longer hitting crashes.

Kgames 1.0 uploaded to Debian New Queue

With that done, I decided I could at least upload the working bits to the Debian archive and close the bug reported above. kgames 1.0-2 may eventually get into unstable, presumably once the Debian FTP team realizes just how important fixing this bug is. Or something.

Here's what xmille looks like in this version:

And here's my favorite solitaire variant too:

But They Look So Old

Yeah, Xaw applications have a rustic appearance which may appeal to some, but for people with higher resolution monitors and “well seasoned” eyesight, squinting at the tiny images and text makes it difficult to enjoy these games today.

How hard could it be to update them to use larger cards and scalable fonts?

Xkw version 2.0

I decided to dig in and start hacking the code, starting by adding new widgets to the Xkw library that used cairo for drawing instead of core X calls. Fortunately, the needs of the games were pretty limited, so I only needed to implement a handful of widgets:

  • KLabel. Shows a text string. It allows the string to be left, center or right justified. And that's about it.

  • KCommand. A push button, which uses KLabel for the underlying presentation.

  • KToggle. A push-on/push-off button, which uses KCommand for most of the implementation. Also supports 'radio groups' where pushing one on makes the others in the group turn off.

  • KMenuButton. A button for bringing up a menu widget; this is some pretty simple behavior built on top of KCommand.

  • KSimpleMenu, KSmeBSB, KSmeLine. These three create pop-up menus; KSimpleMenu creates a container which can hold any number of KSmeBSB (string) and KSmeLine (separator lines) objects).

  • KTextLine. A single line text entry widget.

The other Xkw widgets all got their rendering switched to using cairo, plus using double buffering to make updates look better.

SVG Playing Cards

Looking on wikimedia, I found a page referencing a large number of playing cards in SVG form That led me to Adrian Kennard's playing card web site that let me customize and download a deck of cards, licensed using the CC0 Public Domain license.

With these cards, I set about rewriting the Xkw playing card widget, stripping out three different versions of bitmap playing cards and replacing them with just these new SVG versions.

SVG Xmille Cards

Ok, so getting regular playing cards was good, but the original goal was to update Xmille, and that has cards hand drawn by me. I could just use those images, import them into cairo and let it scale them to suit on the screen. I decided to experiment with inkscape's bitmap tracing code to see what it could do with them.

First, I had to get them into a format that inkscape could parse. That turned out to be a bit tricky; the original format is as a set of X bitmap layers; each layer painting a single color. I ended up hacking the Xmille source code to generate the images using X, then fetching them with XGetImage and walking them to construct XPM format files which could then be fed into the portable bitmap tools to create PNG files that inkscape could handle.

The resulting images have a certain charm:

I did replace the text in the images to make it readable, otherwise these are untouched from what inkscape generated.

The Results

Remember that all of these are applications built using the venerable X toolkit; there are still some non-antialiased graphics visible as the shaped buttons use the X Shape extension. But, all rendering is now done with cairo, so it's all anti-aliased and all scalable.

Here's what Xmille looks like after the upgrades:

And here's spider:

Once kgames 1.0 reaches Debian unstable, I'll upload these new versions.

December 30, 2020

A Different Sort Of Optimization

There’s a number of strange hacks in zink that provide compatibility for some of the layers in mesa. One of these hacks is the NIR pass used for managing non-constant UBO/SSBO array indexing, made necessary because SPIRV operates by directly accessing variables, and so it’s impossible to have a non-constant index because then when generating the SPIRV there’s no way to know which variable is being accessed.

In its current state from zink-wip it looks like this:

static nir_ssa_def *
recursive_generate_bo_ssa_def(nir_builder *b, nir_intrinsic_instr *instr, nir_ssa_def *index, unsigned start, unsigned end)
{
   if (start == end - 1) {
      /* block index src is 1 for this op */
      unsigned block_idx = instr->intrinsic == nir_intrinsic_store_ssbo;
      nir_intrinsic_instr *new_instr = nir_intrinsic_instr_create(b->shader, instr->intrinsic);
      new_instr->src[block_idx] = nir_src_for_ssa(nir_imm_int(b, start));
      for (unsigned i = 0; i < nir_intrinsic_infos[instr->intrinsic].num_srcs; i++) {
         if (i != block_idx)
            nir_src_copy(&new_instr->src[i], &instr->src[i], &new_instr->instr);
      }
      if (instr->intrinsic != nir_intrinsic_load_ubo_vec4) {
         nir_intrinsic_set_align(new_instr, nir_intrinsic_align_mul(instr), nir_intrinsic_align_offset(instr));
         if (instr->intrinsic != nir_intrinsic_load_ssbo)
            nir_intrinsic_set_range(new_instr, nir_intrinsic_range(instr));
      }
      new_instr->num_components = instr->num_components;
      if (instr->intrinsic != nir_intrinsic_store_ssbo)
         nir_ssa_dest_init(&new_instr->instr, &new_instr->dest,
                           nir_dest_num_components(instr->dest),
                           nir_dest_bit_size(instr->dest), NULL);
      nir_builder_instr_insert(b, &new_instr->instr);
      return &new_instr->dest.ssa;
   }

   unsigned mid = start + (end - start) / 2;
   return nir_build_alu(b, nir_op_bcsel, nir_build_alu(b, nir_op_ilt, index, nir_imm_int(b, mid), NULL, NULL),
      recursive_generate_bo_ssa_def(b, instr, index, start, mid),
      recursive_generate_bo_ssa_def(b, instr, index, mid, end),
      NULL
   );
}

static bool
lower_dynamic_bo_access_instr(nir_intrinsic_instr *instr, nir_builder *b)
{
   if (instr->intrinsic != nir_intrinsic_load_ubo &&
       instr->intrinsic != nir_intrinsic_load_ubo_vec4 &&
       instr->intrinsic != nir_intrinsic_get_ssbo_size &&
       instr->intrinsic != nir_intrinsic_load_ssbo &&
       instr->intrinsic != nir_intrinsic_store_ssbo)
      return false;
   /* block index src is 1 for this op */
   unsigned block_idx = instr->intrinsic == nir_intrinsic_store_ssbo;
   if (nir_src_is_const(instr->src[block_idx]))
      return false;
   b->cursor = nir_after_instr(&instr->instr);
   bool ssbo_mode = instr->intrinsic != nir_intrinsic_load_ubo && instr->intrinsic != nir_intrinsic_load_ubo_vec4;
   unsigned first_idx = 0, last_idx;
   if (ssbo_mode) {
      last_idx = first_idx + b->shader->info.num_ssbos;
   } else {
      /* skip 0 index if uniform_0 is one we created previously */
      first_idx = !b->shader->info.first_ubo_is_default_ubo;
      last_idx = first_idx + b->shader->info.num_ubos;
   }

   /* now create the composite dest with a bcsel chain based on the original value */
   nir_ssa_def *new_dest = recursive_generate_bo_ssa_def(b, instr,
                                                       instr->src[block_idx].ssa,
                                                       first_idx, last_idx);

   if (instr->intrinsic != nir_intrinsic_store_ssbo)
      /* now use the composite dest in all cases where the original dest (from the dynamic index)
       * was used and remove the dynamically-indexed load_*bo instruction
       */
      nir_ssa_def_rewrite_uses_after(&instr->dest.ssa, nir_src_for_ssa(new_dest), &instr->instr);

   nir_instr_remove(&instr->instr);
   return true;
}

In brief, lower_dynamic_bo_access_instr() is used to detect UBO/SSBO instructions with a non-constant index, e.g., array_of_ubos[n] where n is a uniform. Following this, recursive_generate_bo_ssa_def() generates a chain of bcsel instructions which checks the non-constant array index against constant values and then, upon matching, uses the value loaded from that UBO.

Without going into more depth about the exact mechanics of this pass for the sake of time, I’ll instead provide a better explanation by example. Here’s a stripped down version of one of the simplest piglit shader tests for non-constant uniform indexing (fs-array-nonconst):

[require]
GLSL >= 1.50
GL_ARB_gpu_shader5

[vertex shader passthrough]

[fragment shader]
#version 150
#extension GL_ARB_gpu_shader5: require

uniform block {
	vec4 color[2];
} arr[4];

uniform int n;
uniform int m;

out vec4 color;

void main()
{
	color = arr[n].color[m];
}

[test]
clear color 0.2 0.2 0.2 0.2
clear

ubo array index 0
uniform vec4 block.color[0] 0.0 1.0 1.0 0.0
uniform vec4 block.color[1] 1.0 0.0 0.0 0.0

uniform int n 0
uniform int m 1
draw rect -1 -1 1 1

relative probe rect rgb (0.0, 0.0, 0.5, 0.5) (1.0, 0.0, 0.0)

Using two uniforms, a color is indexed from a UBO as the FS output.

In the currently shipping version of zink, the final NIR output from ANV of the fragment shader might look something like this:

shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 8
ubos: 5
shared: 0
decl_var shader_out INTERP_MODE_NONE vec4 color (FRAG_RESULT_DATA0.xyzw, 8, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000002 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x00000001 /* 0.000000 */)
	vec1 32 ssa_2 = load_const (0x00000004 /* 0.000000 */)
	vec1 32 ssa_3 = load_const (0x00000003 /* 0.000000 */)
	vec1 32 ssa_4 = load_const (0x00000010 /* 0.000000 */)
	vec1 32 ssa_5 = intrinsic load_ubo (ssa_1, ssa_4) (0, 1073741824, 16, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=16 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_6 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_7 = intrinsic load_ubo (ssa_1, ssa_6) (0, 1073741824, 0, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_8 = umin ssa_7, ssa_3
	vec1 32 ssa_9 = ishl ssa_5, ssa_2
	vec1 32 ssa_10 = iadd ssa_8, ssa_1
	vec1 32 ssa_11 = load_const (0xfffffffc /* -nan */)
	vec1 32 ssa_12 = iand ssa_9, ssa_11
	vec1 32 ssa_13 = load_const (0x00000005 /* 0.000000 */)
	vec4 32 ssa_14 = intrinsic load_ubo (ssa_13, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec4 32 ssa_23 = intrinsic load_ubo (ssa_2, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_27 = ilt32 ssa_10, ssa_2
	vec1 32 ssa_28 = b32csel ssa_27, ssa_23.x, ssa_14.x
	vec1 32 ssa_29 = b32csel ssa_27, ssa_23.y, ssa_14.y
	vec1 32 ssa_30 = b32csel ssa_27, ssa_23.z, ssa_14.z
	vec1 32 ssa_31 = b32csel ssa_27, ssa_23.w, ssa_14.w
	vec4 32 ssa_32 = intrinsic load_ubo (ssa_3, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_36 = ilt32 ssa_10, ssa_3
	vec1 32 ssa_37 = b32csel ssa_36, ssa_32.x, ssa_28
	vec1 32 ssa_38 = b32csel ssa_36, ssa_32.y, ssa_29
	vec1 32 ssa_39 = b32csel ssa_36, ssa_32.z, ssa_30
	vec1 32 ssa_40 = b32csel ssa_36, ssa_32.w, ssa_31
	vec4 32 ssa_41 = intrinsic load_ubo (ssa_0, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec4 32 ssa_45 = intrinsic load_ubo (ssa_1, ssa_12) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_49 = ilt32 ssa_10, ssa_1
	vec1 32 ssa_50 = b32csel ssa_49, ssa_45.x, ssa_41.x
	vec1 32 ssa_51 = b32csel ssa_49, ssa_45.y, ssa_41.y
	vec1 32 ssa_52 = b32csel ssa_49, ssa_45.z, ssa_41.z
	vec1 32 ssa_53 = b32csel ssa_49, ssa_45.w, ssa_41.w
	vec1 32 ssa_54 = ilt32 ssa_10, ssa_0
	vec1 32 ssa_55 = b32csel ssa_54, ssa_50, ssa_37
	vec1 32 ssa_56 = b32csel ssa_54, ssa_51, ssa_38
	vec1 32 ssa_57 = b32csel ssa_54, ssa_52, ssa_39
	vec1 32 ssa_58 = b32csel ssa_54, ssa_53, ssa_40
	vec4 32 ssa_59 = vec4 ssa_55, ssa_56, ssa_57, ssa_58
	intrinsic store_output (ssa_59, ssa_6) (8, 15, 0, 160, 132) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */	/* color */
	/* succs: block_1 */
	block block_1:
}

All the b32csel ops are generated by the above NIR pass, with each one “checking” a non-constant index against a constant value. At the end of the shader, the store_output uses the correct values, but this is pretty gross.

And Then Inlining

Some time ago, noted Gallium professor Marek Olšák authored a series which provided a codepath for inlining uniform data directly into shaders. The process for this is two steps:

  • Detect and designate uniforms to be inlined
  • Replace shader loads of these uniforms with the actual uniforms

The purpose of this is specifically to eliminate complex conditionals resulting from uniform data, so the detection NIR pass specifically looks for conditionals which use only constants and uniform data as the sources. Something like if (uniform_variable_expression) then becomes if (constant_value_expression) which can then be optimized out, greatly simplifying the eventual shader instructions.

Looking at the above NIR, this seems like a good target for inlining as well, so I took my hatchet to the detection pass and added in support for the bcsel and fcsel ALU ops when their result sources were the results of intrinsics, e.g., loads. The results are good to say the least:

shader: MESA_SHADER_FRAGMENT
inputs: 0
outputs: 0
uniforms: 8
ubos: 5
shared: 0
decl_var shader_out INTERP_MODE_NONE vec4 color (FRAG_RESULT_DATA0.xyzw, 8, 0)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000001 /* 0.000000 */)
	vec1 32 ssa_1 = load_const (0x00000004 /* 0.000000 */)
	vec1 32 ssa_2 = load_const (0x00000010 /* 0.000000 */)
	vec1 32 ssa_3 = intrinsic load_ubo (ssa_0, ssa_2) (0, 1073741824, 16, 0, -1) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=16 */ /* range_base=0 */ /* range=-1 */
	vec1 32 ssa_4 = ishl ssa_3, ssa_1
	vec1 32 ssa_5 = load_const (0x00000002 /* 0.000000 */)
	vec1 32 ssa_6 = load_const (0xfffffffc /* -nan */)
	vec1 32 ssa_7 = iand ssa_4, ssa_6
	vec1 32 ssa_8 = load_const (0x00000000 /* 0.000000 */)
	vec4 32 ssa_9 = intrinsic load_ubo (ssa_5, ssa_7) (0, 4, 0, 0, -1) /* access=0 */ /* align_mul=4 */ /* align_offset=0 */ /* range_base=0 */ /* range=-1 */
	intrinsic store_output (ssa_9, ssa_8) (8, 15, 0, 160, 132) /* base=8 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */	/* color */
	/* succs: block_1 */
	block block_1:
}

The second load_ubo here is using the inlined uniform data to determine that it needs to load the 0 index, greatly reducing the shader’s complexity.

This still needs a bit of tuning, but I’m hoping to get it finalized soonish.

December 29, 2020

A New Sync

For some time now I’ve been talking about zink’s lack of WSI and the forthcoming, near-messianic work by FPS sherpa Adam Jackson to implement it.

This is an extremely challenging project, however, and some work needs to be done in the meanwhile to ensure that zink doesn’t drive off a cliff.

Swapchain Strikes Back

Any swapchain master is already well acquainted with the mechanism by which images are displayed on the screen, but the gist of it for anyone unfamiliar is that there’s N image resources that are swapped back and forth (2 for double-buffered, 3 for triple-buffered, …). An image being rendered to is a backbuffer, and an image being displayed is a frontbuffer.

Ideally, a frontbuffer shouldn’t be drawn to while it’s in the process of being presented since such an action obliterates the app’s usefulness. The knowledge of exactly when a resource is done presenting is gained through WSI. On Xorg, however, it’s a bit tricky, to say the least. DRI3 is intended to address the underlying problems there with the XPresent extension, and the Mesa DRI frontend utilizes this to determine when an image is safe to use.

All this is great, and I’m sure it works terrifically in other cases, but zink is not like other cases. Zink lacks direct WSI integration. Under Xorg, this means it relies entirely on the DRI frontend to determine when it’s safe to start rendering onto an image resource.

But what if the DRI frontend gets it wrong?

Indeed, due to quirks in the protocol/xserver, XPresent idle events can be received for a “presented” image immediately, even if it’s still in use and has not finished presenting.

scumbag-xorg.png

In apps like SuperTuxKart, this results in insane flickering due to always rendering over the current frame before it’s finished being presented.

Return Of The Poll

To solve this problem, a wise, reclusive ghostwriter took time off from being at his local pub to offer me a suggestion:

Why not just rip the implicit fence of the DMAbuf out of the image object?

It was a great idea. But what did this pub enthusiast mean?

In short, WSI handles this problem by internally poll()ing on the image resource’s underlying file descriptor. When there’s no more events to poll() for, the image is safe to write on.

So now it’s back to the (semi) basics of programming. First, get the file descriptor of the image using normal Vulkan function calls:

static int
get_resource_fd(struct zink_screen *screen, struct zink_resource *res)
{
   VkMemoryGetFdInfoKHR fd_info = {};
   int fd;
   fd_info.sType = VK_STRUCTURE_TYPE_MEMORY_GET_FD_INFO_KHR;
   fd_info.memory = res->obj->mem;
   fd_info.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT;
   VkResult result = (*screen->vk_GetMemoryFdKHR)(screen->dev, &fd_info, &fd);
   return result == VK_SUCCESS ? fd : -1;
}

This provides a file descriptor that can be used for more nefarious purposes. Any time the gallium pipe_context::flush hook is called, the flushed resource (swapchain image) must be synchronized by poll()ing as in this snippet:

static void
zink_flush(struct pipe_context *pctx,
           struct pipe_fence_handle **pfence,
           enum pipe_flush_flags flags)
{
   struct zink_context *ctx = zink_context(pctx);

   if (flags & PIPE_FLUSH_END_OF_FRAME && ctx->flush_res) {
      if (ctx->flush_res->obj->fd != -1) {
          /* FIXME: remove this garbage once we get wsi */
          struct pollfd p = {};
          p.fd = ctx->flush_res->obj->fd;
          p.events = POLLOUT;
          assert(poll(&p, 1, -1) == 1);
          assert(p.revents & POLLOUT);
      }
      ctx->flush_res = NULL;
   }

The POLLOUT event flag is used to determine when it’s safe to write. If there’s no pending usage during present then this will return immediately, otherwise it will wait until the image is safe to use.

hacks++.

December 24, 2020

ETOOMUCHREBASE

For real though, I’ve spent literal hours over the past week just rebasing stuff and managing conflicts. And then rebasing again after diffing against a reference commit when I inevitably discover that I fucked up the merge somehow.

But now the rebasing is done for a few minutes while I run more unit tests, so it’s finally time to blog.

It’s been a busy week. Nothing I’ve done has been very interesting. Lots of stabilizing and refactoring.

The blogging must continue, however, so here goes.

The Return of QBOs

Many months ago I blogged about QBOs.

Maybe.

Maybe I didn’t.

QBOs are are Query Buffer Objects, where the result of a given query is stored into a buffer. This is great for performance since it doesn’t require any stalling while the query result is directly read.

Conceptually, anyway.

At present, zink has problems making this efficient for many types of queries due to the mismatch between GL query data and Vulkan query data, and there’s a need to manually read it back and parse it with the CPU.

This is consistent with how zink manages non-QBO queries:

  • start query
  • end query
  • stall GPU
  • read query results back for user

As I’ve said many times along the way, the goal for zink has been to get the features in place and working first and then optimize later.

It’s now later, and query bottlenecking is actually hurting performance in some apps (e.g., RPCS3).

Memory++

Some profiling was done recently by bleeding edge tester Witold Baryluk, and it turns out that zink is using slightly less GPU memory than some native drivers, though it’s also using slightly more than some other native drivers:

lowmem.png

Looking at the right side of the graph, it’s obvious that there’s still some VRAM available to be used, which means there’s some VRAM available to use in optimizations.

As such, I decided to rewrite the query internals to have every query be a QBO internally, consistent with the RadeonSI model. While it does use a tiny bit more VRAM due to needing to allocate the backing buffers, the benefit of this is that now all query result data is copied to a buffer as soon as the query stops, so from an API perspective, this means that the result becomes available as soon as the backing buffer becomes idle.

It also means that any time actual QBOs are used (which is all the time for competent apps), I’ll eventually have the ability to asynchronously post the result data from a query onto a user buffer without triggering a stall.

Functionally, this isn’t a super complex maneuver: I’ve already got a utility function that performs a vkCmdCopyQueryPoolResults for regular QBO handling, so repurposing this to be called any time a query was ended, combined with modifying the parsing function to first map the internal buffer, was sufficient.

In the end, the query code is now a bit more uniform, and in the future I can use a compute shader to keep everything on GPU without needing to do any manual readback.

December 18, 2020

memcpy harder

Amidst the flurry of patches being smashed into the repo today I thought I’d talk about about memcpy. Yes, I’m referring to the same function everyone knows and loves.

The final frontier of RPCS3 performance was memcpy. With ARGB emulation in place, my perf results were looking like this on RADV:

zink.png

Pushing a hard 12fps, this was a big improvement from before, but it seemed a bit low. Not having any prior experience in PS3 emulation, I started wondering whether this was the limit of things in the current year.

Meanwhile, RadeonSI was getting significantly higher performance with a graph like this:

radeonsi.png

Clearly performance is capable of being much higher, so why is zink so much worse at memcpy?

Further Exploration…

Along with the wisdom of the great sage Dave Airlie led me to checking out the resource creation code in zink, specifically the types of memory that are used for allocation. Vulkan supports a range of memory types for the discerning allocations connoisseur, but the driver was jamming everything into VK_MEMORY_PROPERTY_HOST_COHERENT_BIT. This is great for ease of use, as the contents of buffers are always synchronized between CPU and GPU with no additional work needed, but it ends up being massively slower for any kind of direct copy operations of the backing memory, e.g., anything PBO-related.

What should really be used is VK_MEMORY_PROPERTY_HOST_CACHED_BIT whenever possible. This requires some additional legwork to properly invalidate/flush the memory used by any vkMapMemory calls, but the results were well worth the effort:

zink2.png

And performance was a buttery smooth 30fps (the cap) as well:

bioshock.png

Mission accomplished.

December 17, 2020

The Name Of The Game

…is emulation.

With blit-based transfers working, I checked my RPCS3 flamegraph again to see the massive performance improvements that I’d no doubt be seeing:

fg.png

Except there were none.

Closer examination revealed that this was due to the app using ARGB formats for its PBOs. Referencing that against VkFormat led to a further problem: ARGB and ABGR are not explicitly supported by Vulkan.

This wasn’t exactly going to be an easy fix, but it wouldn’t prove too challenging either.

Swizzle Me Timbers

Yes, swizzles. In layman’s terms, a swizzle is mapping a given component to another component, for example in an RGBA-ordered format, using a WXYZ swizzle would result in a reordering of the 0-indexed components to 3012, or ARGB.

Gallium, when using blit-based transfers, provides a lot of opportunities to use swizzles, specifically by having a lot of blits go through a u_blitter codepath that translates blits into quad draws with a sampled image.

Thus, by applying an ARGB/ABGR emulation swizzle to each of these codepaths, I can drop in native-ish support under the hood of the driver by internally reporting ARGB as RGBA and ABGR as BGRA.

In pseudocode, the ARGB path looks something like this:

 unsigned char dst_swiz[4];

 if (src_is_argb) {
    unsigned char reverse_alpha[] = {
       PIPE_SWIZZLE_Y,
       PIPE_SWIZZLE_Z,
       PIPE_SWIZZLE_W,
       PIPE_SWIZZLE_X,
    };
    /* compose swizzle with alpha at the end */
    util_format_compose_swizzles(original_swizzle, reverse_alpha, dst_swiz);
 } else if (dst_is_argb) {
    unsigned char reverse_alpha[] = {
       PIPE_SWIZZLE_W,
       PIPE_SWIZZLE_X,
       PIPE_SWIZZLE_Y,
       PIPE_SWIZZLE_Z,
    };
    /* compose swizzle with alpha at the start */
    util_format_compose_swizzles(original_swizzle, reverse_alpha, dst_swiz);
}

The original swizzle is composed with the alpha-reversing swizzle to generate a swizzle that translates the resource’s internal ARGB data into RGBA data (or vice versa) like the Vulkan driver is expecting it to be.

From there, the only restriction is that this emulation is prohibited in texel buffers due to there not being a direct method of applying a swizzle to that codepath. Sure, I could do the swizzle in the shader as a variant, but then this leads to more shader variants and more pipeline objects, so it’s simpler to just claim no support here and let gallium figure things out using other codepaths.

Progress?

Would this be enough to finally get some frames moving?

Find out tomorrow in the conclusion to this SGC miniseries.

December 16, 2020

To Begin

This is the journey of how zink-wip went from 0 fps in RPCS3 to a bit more than that. Quite a bit more, in fact, if you’re using RADV.

As all new app tests begin, this one started with firing up the app. Since there’s no homebrew games available (that I could find), I decided to pick something that I owned and was familiar with. Namely a demo of Bioshock.

It started up nicely enough:

title.png

But then I started a game and things got rough:

maximum-oof.png

Yikes.

Another Overly Technical Post

One of the fundamentals of a graphics driver is that the GPU should be handling as much work as possible. This means that, for example, any time an application is using a Pixel Buffer Object (PBO), the GPU should be used for uploading and downloading the pixel buffer.

Why are you suddenly mentioning PBOs, you might be asking.

Well, let’s check out what’s going on using a perf flamegraph:

fg.png

The driver in this case is hitting a software path for copying pixels to and from a PBO, effectively doing full-frame memcpy operations multiple times each frame. This is on the CPU, which is obviously not great for performance. As above, ideally this should be moved to the GPU.

Gallium provides a pipe cap for this: PIPE_CAP_PREFER_BLIT_BASED_TEXTURE_TRANSFER

Zink doesn’t use this in master right now, which naturally led me down the path of enabling it.

There were problems.

Lots of problems.

The first problem was that suddenly I had an infinite number of failing unit tests. Confusing for sure. Some intensive debugging led me to this block of code in zink which is used for directly mapping a rectangular region of image resource memory:

VkResult result = vkMapMemory(screen->dev, res->mem, res->offset, res->size, 0, &ptr);
if (result != VK_SUCCESS)
   return NULL;
VkImageSubresource isr = {
   res->aspect,
   level,
   0
};
VkSubresourceLayout srl;
vkGetImageSubresourceLayout(screen->dev, res->image, &isr, &srl);
ptr = ((uint8_t *)ptr) + box->z * srl.depthPitch +
                         box->y * srl.rowPitch +
                         box->x;

Suspicious. box in this case represents the region to be mapped, yet members of VkSubresourceLayout like offset aren’t being applied to handle the level that’s intended to be loaded, nor is this taking into account the bits-per-pixel of the image. In fact, this is always assuming that each x coordinate unit equals a single byte.

The fully corrected version is more like this:

VkResult result = vkMapMemory(screen->dev, res->mem, res->offset, res->size, 0, &ptr);
if (result != VK_SUCCESS)
   return NULL;
VkImageSubresource isr = {
   res->aspect,
   level,
   0
};
VkSubresourceLayout srl;
vkGetImageSubresourceLayout(screen->dev, res->image, &isr, &srl);
const struct util_format_description *desc = util_format_description(res->base.format);
unsigned offset = srl.offset +
                  box->z * srl.depthPitch +
                  (box->y / desc->block.height) * srl.rowPitch +
                  (box->x / desc->block.width) * (desc->block.bits / 8);
ptr = ((uint8_t *)ptr) + offset;

It turns out that no unit test had previously passed a nonzero x coordinate for a mapping region or tried to map a nonzero level, so this was never exposed as being broken.

Imagine that.

December 15, 2020

In Many Forms

One of the people in the #zink IRC channel has been posing an interesting challenge for me in the form of trying to run every possible console emulator on my zink-wip branch.

This has raised a number of issues with various parts of the driver, so expect a number of posts on the topic.

Threads

First up was the citra emulator for the 3DS. This is an interesting app for a number of reasons, the least of which is because it uses a ton of threads, including a separate one for GL, which put my own work to the test.

Suffice to say that my initial implementation of u_threaded_context needed some work.

One of the main premises of the threaded context is this idea of an asynchronous fence object. The threaded context will create these in a thread and provide them to the driver in the pipe_context::flush hook, but only in some cases; at other times, the fence object provided will just be a “regular” synchronous one.

The trick here is that the driver itself has a fence for managing synchronization, and the threaded context can create N number of its own fences to manage the driver’s fence, all of which must potentially work when called in a random order and from either the “main” thread or the driver-specific thread.

There’s too much code involved here to be providing any samples here, but I’ll go over the basics of it just for posterity. Initially, I had implemented this entirely on the zink side such that each zink fence had references to all the tc fences in a chain, and fence-related resources were generally managed on the last fence in the chain. I had two separate object types for this: one for zink fences and one for tc fences. The former contained all the required vulkan-specific objects while the latter contained just enough info to work with tc.

This was sort of fine, and it worked for many things, the least of which was all my benchmarking.

The problem was that a desync could occur if one of the tc fences was destroyed sufficiently later than its zink fence, leading to an eventual crash. This was never triggered by unit tests nor basic app usage, but something like citra with its many threads managed to hit it consistently and quickly.

Thus began the day-long process of rewriting the tc implementation to a much-improved 2.0 version. The primary difference in this design model is that I worked a bit closer to the original RadeonSI implementation, having only a single externally-used fence object type for both gallium as well as tc and creating them for the zink fence object without any sort of cross-referencing. This meant that rather than having 1 zink fence with references to N tc fences, I now had N tc fences each with a reference to 1 zink fence.

This simplified the code a bit in other ways after the rewrite, as the gallium/tc fence objects were now entirely independent. The one small catch was that zink fences get recycled, meaning that in theory a gallium/tc fence could have a reference to a zink fence that it no longer was managing, but this was simple enough to avoid by gating all tc fence functionality on a comparison between its stored fence id and the id of the fence that it had a reference to. If they failed to match, the gallium/tc fence had already completed.

Stability++

Things seem like they’re in better shape now with regards to stability. It’s become more challenging than ever to debug the driver with threading enabled, but that’s just one of the benefits that threads provide.

Next time I’ll begin a series on how to get a mesa driver from less than 1fps to native performance in RPCS3.

December 14, 2020

New Week, New Idea

I have to change things up.

Historically I’ve spent a day working on zink and then written up a post at the end. The problem with this approach is obvious: a lot of times when I get to the end of the day I’m just too mentally drained to think about anything else and want to just pass out on my couch.

So now I’m going to try inverting my schedule: as soon as I get up, it’s now going to be blog time.

I’m not even fully awake right now, so this is definitely going to be interesting.

Stencil Sampling

Today’s exploratory morning post is about sampling from stencil buffers.

What is sampling from stencil buffers, some of you might be asking.

Sampling in general is the reading of data from a resource. It’s most commonly used as an alternative to using a Copy command for transferring some amount of data from one resource to another in a specified way.

For example, extracting only stencil data from a resource which combines both depth and stencil data. In zink, this is an important operation because none of the Copy commands support multisampled resources containing both depth and stencil data, an OpenGL feature that the unit tests most certainly cover.

As with all things, zink has a tough time with this.

Sampling Basics

For the purpose of this post, I’m only going to be talking about sampling from image resources. Sampling from buffer resources is certainly possible and useful, however, but there’s just less that can go wrong for that case.

The general process of a sampling-based copy operation in Gallium-based drivers is as follows:

  • have resource src which contains some amount of data
  • have resource dst which is a the intended destination for the data from src
  • bind src as a “sampler view”, which is essentially a combination of info that determines how data will be sampled from a resource
  • bind dst as an output target (e.g., a framebuffer attachment)
  • bind a fragment shader containing a sampler type that samples from the bound sampler view and writes to the output target (either by gl_FragColor output or imageStore)
  • dump some vertices into the pipeline and blammo, missionaccomplished.jpg

In the case of stencil sampling, zink has issues with the third step here.

The Code

Here’s what we’ve currently got shipping in the driver for the relevant part of creating image sampler views:

VkImageViewCreateInfo ivci = {};
ivci.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
ivci.image = res->image;
ivci.viewType = image_view_type(state->target);
ivci.format = zink_get_format(screen, state->format);
assert(ivci.format);
ivci.components.r = component_mapping(state->swizzle_r);
ivci.components.g = component_mapping(state->swizzle_g);
ivci.components.b = component_mapping(state->swizzle_b);
ivci.components.a = component_mapping(state->swizzle_a);

ivci.subresourceRange.aspectMask = sampler_aspect_from_format(state->format);
ivci.subresourceRange.baseMipLevel = state->u.tex.first_level;
ivci.subresourceRange.baseArrayLayer = state->u.tex.first_layer;
ivci.subresourceRange.levelCount = state->u.tex.last_level - state->u.tex.first_level + 1;
ivci.subresourceRange.layerCount = state->u.tex.last_layer - state->u.tex.first_layer + 1;

err = vkCreateImageView(screen->dev, &ivci, NULL, &sampler_view->image_view);

Filling in some gaps:

  • res is the image being sampled from
  • image_view_type() converts Gallium texture types (e.g., 1D, 2D, 2D_ARRAY, …) to the corresponding Vulkan type
  • zink_get_format() converts a Gallium image format to a usage Vulkan one
  • component_mapping() converts a Gallium swizzle to a Vulkan one (swizzles determine which channel in the sample operation are mapped from the source to the destination)
  • sampler_aspect_from_format() infers VkImageAspectFlags from a Gallium format

The Problem

Regarding sampler descriptors, Vulkan spec states If imageView is created from a depth/stencil image, the aspectMask used to create the imageView must include either VK_IMAGE_ASPECT_DEPTH_BIT or VK_IMAGE_ASPECT_STENCIL_BIT but not both.

This means that for combined depth+stencil resources, only the depth or stencil aspect can be specified but not both. As Gallium presents drivers with a format and swizzle based on the data being sampled from the image’s data, this poses a problem since 1) the format provided will usually map to something like VK_FORMAT_D32_SFLOAT_S8_UINT and 2) the swizzle provided will be based on this format.

But if zink can only specify one of the aspects, this poses a problem.

The Solution

The format being sampled must also match the aspect type, and VK_FORMAT_D32_SFLOAT_S8_UINT is obviously not a pure stencil format. This means that any time zink infers a stencil-only aspect image format like PIPE_FORMAT_X32_S8X24_UINT, which is a two channel format where the depth channel is ignored, the format passed in VkImageViewCreateInfo has to just be the stencil format being sampled. Helpfully, this will always be VK_FORMAT_S8_UINT.

So now the code would look like this:

VkImageViewCreateInfo ivci = {};
ivci.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
ivci.image = res->obj->image;
ivci.viewType = image_view_type(state->target);

ivci.components.r = component_mapping(state->swizzle_r);
ivci.components.g = component_mapping(state->swizzle_g);
ivci.components.b = component_mapping(state->swizzle_b);
ivci.components.a = component_mapping(state->swizzle_a);
ivci.subresourceRange.aspectMask = sampler_aspect_from_format(state->format);
/* samplers for stencil aspects of packed formats need to always use stencil type */
if (ivci.subresourceRange.aspectMask == VK_IMAGE_ASPECT_STENCIL_BIT)
   ivci.format = VK_FORMAT_S8_UINT;
else
   ivci.format = zink_get_format(screen, state->format);

Mo’ Time, Mo’ Problems

The above code was working great for months in zink-wip.

Then “bugs” were fixed in master.

The new problem came from a merge request claiming to “fix depth/stencil blit shaders”. The short of this is that previously, the shaders generated by mesa for the purpose of doing depth+stencil sampling were always reading from the first channel of the image, which was exactly what zink was intending given that for that case the underlying Vulkan driver would only be reading one component anyway. After this change, however, samplers are now reading from the second channel of the image.

Given that a Vulkan stencil format has no second channel, this poses a problem.

Luckily, the magic of swizzles can solve this. By mapping the second channel of the sampler to the first channel of the image data, the sampler will read the stencil data again.

The fully fixed code now looks like this:

VkImageViewCreateInfo ivci = {};
ivci.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
ivci.image = res->obj->image;
ivci.viewType = image_view_type(state->target);

ivci.components.r = component_mapping(state->swizzle_r);
ivci.components.g = component_mapping(state->swizzle_g);
ivci.components.b = component_mapping(state->swizzle_b);
ivci.components.a = component_mapping(state->swizzle_a);
ivci.subresourceRange.aspectMask = sampler_aspect_from_format(state->format);
/* samplers for stencil aspects of packed formats need to always use stencil type */
if (ivci.subresourceRange.aspectMask == VK_IMAGE_ASPECT_STENCIL_BIT) {
   ivci.format = VK_FORMAT_S8_UINT;
   ivci.components.g = VK_COMPONENT_SWIZZLE_R;
} else
   ivci.format = zink_get_format(screen, state->format);

And now everything works.

For now.

December 08, 2020

Counting

I keep saying this, but I feel like I’m progressively getting further away from the original goal of this blog, which was to talk about actual code that I’m writing and not just create great graphics memes. So today it’s once again a return to the roots and the code that I had intended to talk about yesterday.

Gallium is a state tracker, and as part of this, it provides various features to make writing drivers easier. One of these features is that it rolls atomic counters into SSBOs, both in terms of the actual buffer resource and the changing of shader instructions to access atomic counters as though they’re uint32_t values at an offset in a buffer. On the surface, and for most drivers, this is great: the driver just has to implement handling for SSBOs, and then they get counters as a bonus.

As always, however, for zink this is A Very Bad Thing.

SPIRV

One of the challenges in zink is the ntv backend which translates the OpenGL shader into a Vulkan shader. Typically this means GLSL -> SPIRV, but there’s also the ARB_gl_spirv extension which allows GL to have SPIRV shaders as well, meaning that zink has to do SPIRV -> SPIRV. GLSL has a certain way of working that NIR handles, but SPIRV is very different, and so the information that is provided by NIR for GLSL is very different than what’s available for SPIRV.

In particular, SPIRV shaders have valid binding values for shader buffers. GLSL shaders do not. This becomes a problem when trying to determine, as zink must, exactly which descriptors are on a given resource and which descriptors need to have their bindings used vs which can be ignored. Since there’s no way to differentiate a GLSL shader from a SPIRV shader, this is a challenge. It’s further a challenge given that one of the NIR passes that changes shader instructions over from variable pointer derefs to explicit block_id and offset values happens to break bindings in such a way that it becomes impossible to accurately tell which SSBO variables are counters and which are actual SSBOs.

Enter The Dark Arts

if (!strcmp(glsl_get_type_name(var->interface_type), "counters"))

Yup. The original SSBO/counter implementation has to use strcmp to check the name of the variable’s interface in order to accurately determine whether it’s a counter-turned-SSBO.

There’s also some extremely gross code in ntv for trying to match up the SSBO to its expected block_id based on this, but SGC is a SFW blog, so I’m going to refrain from posting it.

Improvements

As always, there’s ways to improve my code. This way came some time after I’d written SSBO support in the form of a new pipe cap, PIPE_CAP_NIR_ATOMICS_AS_DEREF. What this does is allow a driver to skip the Gallium code that transforms counters into SSBOs, making them very easy to detect.

With this in my pocket, I was already 5% of the way to better atomic counter handling.

The next step was to unbreak counter location values. The location is, in ntv language, the offset of a variable inside a given buffer block using a type-based unit, e.g., location=1 would mean an offset of 4 bytes for an int type block. Here’s a NIR pass I wrote to tackle the problem:

static bool
fixup_counter_locations(nir_shader *shader)
{
   unsigned last_binding = 0;
   unsigned last_location = 0;
   if (!shader->info.num_abos)
      return false;
   nir_foreach_variable_with_modes(var, shader, nir_var_uniform) {
      if (!type_is_counter(var->type))
         continue;
      var->data.binding += shader->info.num_ssbos;
      if (var->data.binding != last_binding) {
         last_binding = var->data.binding;
         last_location = 0;
      }
      var->data.location = last_location++;
   }
   return true;
}

The premise here is that counters get merged into buffers based on their binding value, and any number of counters can exist for a given binding. Since Gallium always puts counter buffers after SSBOs, the binding used needs to be incremented by the number of real SSBOs present. With this done, all counters with matching bindings can be assumed to exist sequentially on the same buffer.

Next comes the actual SPIRV variable construction. With the knowledge that zink will be receiving some sort of NIR shader instruction like vec1 32 ssa_0 = deref_var &some_counter, where some_counter is actually a value at an offset inside a buffer, it’s important to be considering how to conveniently handle the offset. I ended up with something like this:

   if (type_is_counter(var->type)) {
      SpvId padding = var->data.offset ? get_sized_uint_array_type(ctx, var->data.offset / 4) : 0;
      SpvId types[2];
      if (padding)
         types[0] = padding, types[1] = array_type;
      else
         types[0] = array_type;
      struct_type = spirv_builder_type_struct(&ctx->builder, types, 1 + !!padding);
      if (padding)
         spirv_builder_emit_member_offset(&ctx->builder, struct_type, 1, var->data.offset);
   }

This creates a struct containing 1-2 members:

  • (optional) a padding array for the variable’s offset
  • the actual variable type, sized as an array

Converting the deref_var instruction can then be simplified into a consistent and easy to generate OpAccessChain:

if (type_is_counter(deref->var->type)) {
   SpvId dest_type = glsl_type_is_array(deref->var->type) ?
                     get_glsl_type(ctx, deref->var->type) :
                     get_dest_uvec_type(ctx, &deref->dest);
   SpvId ptr_type = spirv_builder_type_pointer(&ctx->builder,
                                               SpvStorageClassStorageBuffer,
                                               dest_type);
   SpvId idx[] = {
      emit_uint_const(ctx, 32, !!deref->var->data.offset),
   };
   result = spirv_builder_emit_access_chain(&ctx->builder, ptr_type, result, idx, ARRAY_SIZE(idx));
}

After setting up the destination type for the deref, the OpAccessChain is generated using a single index: for cases where the variable lies at a nonzero offset it selects the second member after the padding array, otherwise it selects the first member, which is the intended counter variable.

The rest of the atomic counter conversion was just a matter of supporting the specific counter-related instructions that would otherwise have been converted to regular atomic instructions.

As a result of these changes, zink has gone from a 75% pass rate in ARB_gl_spirv piglit tests all the way up to around 90%.

December 07, 2020

Alright, But Now I’m Really Back

Blogging is hard. Also, getting back on track during a major holiday week is hard.

But now things are settling, and it’s time to get down to brass tacks.

And code. Brass code.

Maybe just code.

Updates

But first, some updates.

Historically when I’ve missed my blogging window for an extended period, it’s because I’m busy. This has been the case for the past week, but I don’t have much to show for it in terms of zink enhancements. There’s some work ongoing on various MRs, but probably this is a good time to revise the bold statement I’d previously made: there’s now roughly two weeks (9 workdays) remaining in which it’s feasible to land zink patches before the end of the year, and probably hitting GL 4.6 in mainline mesa is unrealistic. I’d be pleasantly surprised if we hit 4.0 given that we’d need to be landing a minimum of 1 new MR each day.

But there are some cool things on the horizon for zink nonetheless:

  • work has begun on getting zink working with lavapipe for CI purposes
  • I fixed an annoying spec-related issue that now gives better compatiblity with non-Intel drivers
  • improved gl_spirv support
  • further performance-related work

And Now For Something Vaguely Interesting

Looking at the second item in the above list, there’s a vague sense of discomfort that anyone working deeply with shader images will recognize.

Yes, I’m talking about the COHERENT qualifier.

For anyone interested in a deeper reading into this GLSL debacle, check out this stackoverflow thread.

TL;DR, COHERENT is supposed to ensure that buffer/image access across shader stages is synchronized, also known as coherency in this context. But then also due to GL spec wording it can simultaneously mean absolutely nothing, sort of like compiler warnings, so this is an area that generally one should avoid thinking about or delving into.

Naturally, zink is delving deep into this. And of course, Vulkan makes everything better, so this issue is 100% not an issue anymore, and everything is great.

Just kidding.

Vulkan has the exact same language in the parts of the spec referencing this behavior:

While GLSL (and legacy SPIR-V) applies the “coherent” decoration to variables (for historical reasons), this model treats each memory access instruction as having optional implicit availability/visibility operations.

optional implicit

Aren’t explicit specifications like Vulkan great?

What happens here is that the spec has no requirement that either the application or driver actually enforces coherency across resources, meaning that if an application optionally decides not to bother, then it’s up to the driver whether to optionally bother guaranteeing coherent access. If neither the application nor driver take any action to guarantee this behavior, the application won’t work as expected.

coherency.png

To fix this on the application (zink) side, image writes in shaders need to specify MakeTexelAvailable|NonPrivateTexel operands, and the reads need MakeTexelVisible|NonPrivateTexel.

December 04, 2020

In Part 1 I've shown you how to create your own distribution image using the freedesktop.org CI templates. In Part 2, I've shown you how to truly build nested images. In this part, I'll talk about the ci-fairy tool that is part of the same repository of ci-templates.

When you're building a CI pipeline, there are some tasks that most projects need in some way or another. The ci-fairy tool is a grab-bag of solutions for these. Some of those solutions are for a pipeline itself, others are for running locally. So let's go through the various commands available.

Using ci-fairy in a pipeline

It's as simple as including the template in your .gitlab-ci.yml file.


include:
- 'https://gitlab.freedesktop.org/freedesktop/ci-templates/-/raw/master/templates/ci-fairy.yml'
Of course, if you want to track a specific sha instead of following master, just sub that sha there. freedesktop.org projects can include ci-fairy like this:

include:
- project: 'freedesktop/ci-templates'
ref: master
file: '/templates/ci-fairy.yml'
Once that's done, you have access to a .fdo.ci-fairy job template that you can extends from. This will download an image from quay.io that is capable of git, python, bash and obviously ci-fairy. This image is a fixed one and referenced by a unique sha so even if where we keep working on ci-fairy upstream you should never see regression, updating requires you to explicitly update the sha of the included ci-fairy template. Obviously, if you're using master like above you'll always get the latest.

Due to how the ci-templates work, it's good to set the FDO_UPSTREAM_REPO variable with the upstream project name. This means ci-fairy will be able to find the equivalent origin/master branch, where that's not available in the merge request. Note, this is not your personal fork but the upstream one, e.g. "freedesktop/ci-templates" if you are working on the ci-templates itself.

Checking commit messages

ci-fairy has a command to check commits for a few basic expectations in commit messages. This currently includes things like enforcing a 80 char subject line length, that there is an empty line after the subject line, that no fixup or squash commits are in the history, etc. If you have complex requirements you need to write your own but for most projects this job ensures that there are no obvious errors in the git commit log:


check-commit:
extends:
- .fdo.ci-fairy
script:
- ci-fairy check-commits --signed-off-by
except:
- master@upstream/project
Since you don't ever want this to fail on an already merged commit, exclude this job the master branch of the upstream project - the MRs should've caught this already anyway.

Checking merge requests

To rebase a contributors merge request, the contributor must tick the checkbox to Allow commits from members who can merge to the target branch. The default value is off which is frustrating (gitlab is working on it though) and causes unnecessary delays in processing merge requests. ci-fairy has command to check for this value on an MR and fail - contributors ideally pay attention to the pipeline and fix this accordingly.


check-merge-request:
extends:
- .fdo.ci-fairy
script:
- ci-fairy check-merge-request --require-allow-collaboration
allow_failure: true
As a tip: run this job towards the end of the pipeline to give collaborators a chance to file an MR before this job fails.

Using ci-fairy locally

The two examples above are the most useful ones for CI pipelines, but ci-fairy also has some useful local commands. For that you'll have to install it, but that's as simple as


$ pip3 install git+http://gitlab.freedesktop.org/freedesktop/ci-templates
A big focus on ci-fairy for local commands is that it should, usually, be able to work without any specific configuration if you run it in the repository itself.

Linting

Just hacked on the CI config?


$ ci-fairy lint
and done, you get the same error back that the online linter for your project would return.

Pipeline checks

Just pushed to the repo?


$ ci-fairy wait-for-pipeline
Pipeline https://gitlab.freedesktop.org/username/project/-/pipelines/238586
status: success | 7/7 | created: 0 | pending: 0 | running: 0 | failed: 0 | success: 7 ....
The command is self-explanatory, I think.

Summary

There are a few other parts to ci-fairy including templating and even minio handling. I recommend looking at e.g. the libinput CI pipeline which uses much of ci-fairy's functionality. And the online documentation for ci-fairy, who knows, there may be something useful in there for you.

The useful contribution of ci-fairy is primarily that it tries to detect the settings for each project automatically, regardless of whether it's run inside a MR pipeline or just as part of a normal pipeline. So the same commands will work without custom configuration on a per-project basis. And for many things it works without API tokens, so the setup costs are just the pip install.

If you have recurring jobs, let us know, we're always looking to add more useful functionality to this little tool.

November 27, 2020

New Game+

Up until now, I’ve been relying solely on my Intel laptop’s onboard GPU for testing, and that’s been great; Intel’s drivers are robust as hell and have very few issues. On top of that, the rare occasions when I’ve found issues have led to a swift resolution.

Certainly I can’t complain at all about my experience with Intel’s hardware or software.

But now things are different and strange because I received in the mail a couple weeks ago a shiny AMD Radeon RX 5700XT.

Mostly in that it’s a new codebase with new debugging tools and such.

Unlike when I started my zink journey earlier this year, however, I’m much better equipped to dive in and Get Things Done.

Progress

This week’s been a bit hectic as I worked to build my new machines, do holiday stuff, and then also get back into the project here. Nevertheless, significant progress has been made in a couple areas:

  • I tested out and then dumped some very heavy-handed descriptor locks that had no impact on performance because they were never doing anything
  • I fixed a major issue with barrier usage

The latter of these is what I’m going to talk about today, and I’m going to zoom in on one very specific handler since it’s been a while since this blog has shown any actual code.

Barrier Performance

When using Vulkan barriers, it’s important to simultaneously:

  • use enough of them that the underlying driver has enough information to transition resources into the states desired
  • avoid using so many that your performance is garbage

Many months ago, I wrote a patch which aimed to address the second point while also not neglecting the first.

I succeeded in one of these goals.

The reason I didn’t notice that I also failed in one of these goals until now is that ANV actually has weak barrier support. By this I mean that while ANV’s barriers work fine and serve the expected purpose of changing resource layouts when necessary, it doesn’t actually do anything with the srcStageMask or dstStageMask parameters. Also, Intel drivers conveniently don’t change an image’s layout for different uses (i.e., GENERAL is the same as SHADER_READ), so screwing up these layouts doesn’t really matter.

Is this a problem?

No.

ANV is great at doing ANV stuff, and zink has thus far been great at punting GL down the pipe so that ANV can blast out some (correct) pixels.

Consider the following code:

bool
zink_resource_image_needs_barrier(struct zink_resource *res, VkImageLayout new_layout, VkAccessFlags flags, VkPipelineStageFlags pipeline)
{
   if (!pipeline)
       pipeline = pipeline_dst_stage(new_layout);
    if (!flags)
       flags = access_dst_flags(new_layout);
    return res->layout != new_layout || (res->access & flags) != flags || (res->access_stage & pipeline) != pipeline;
}

This is a function I wrote for the purpose of no-oping redundant barriers. The idea here is that a barrier is unnecessary if it’s applying the same layout with the same access (or a subset of access) for the same stages (or a subset of stages). The access and stage flags can optionally be omitted and filled in with default values for ease of use too.

Basic state tracking.

This worked great on ANV.

The problem, however, comes when trying to run this on a driver that really gets deep into putting those access and stage flags to work in order to optimize the resource’s access.

RADV is such a driver.

Consider the following barrier sequence:

* VK_IMAGE_LAYOUT_GENERAL, VK_ACCESS_READ | VK_ACCESS_WRITE, VK_PIPELINE_STAGE_TRANSFER_BIT
* VK_IMAGE_LAYOUT_GENERAL, VK_ACCESS_READ | VK_ACCESS_WRITE, VK_PIPELINE_STAGE_TRANSFER_BIT

Obviously it’s not desirable to use GENERAL as a layout, but that’s how zink works for a couple cases at the moment, so it’s a case that must be covered adequately. Going by the above filtering function, the second barrier has the same layout, the same access flags, and the same stage flags, so it gets ignored.

Conceptually, barriers in Vulkan are used for the purpose of informing the driver of dependencies between operations for both internal image layout (i.e., compressing/decompressing images for various usages) and synchronization. This means that if image A is written to in operation O1 and then read from in operation O2, the user can either stall after O1 or use one of the various synchronization methods provided by Vulkan to ensure the desired result. Given that this is GPU <-> GPU synchronization, that means either a semaphore or a pipeline barrier.

The above scenario seems at first glance to be a redundant barrier based on the state-tracking flags, but conceptually it isn’t since it expresses a dependency between two operations which happen to use matching access.

Refinement

After crashing my system a few times trying to do full piglit runs (seriously, don’t ever try this with zink+radv at present if you’re on similar hardware to mine), I came back to the barrier issue and started to rejigger this filter a bit.

The improved version is here:

bool
zink_resource_image_needs_barrier(struct zink_resource *res, VkImageLayout new_layout, VkAccessFlags flags, VkPipelineStageFlags pipeline)
{
   if (!pipeline)
      pipeline = pipeline_dst_stage(new_layout);
   if (!flags)
      flags = access_dst_flags(new_layout);
   return res->layout != new_layout || (res->access_stage & pipeline) != pipeline ||
          (res->access & flags) != flags ||
          (zink_resource_access_is_write(flags) && util_bitcount(flags) > 1);
}

This adds an extra check for a sequence of barriers where the new barrier has at least two access flags and one of them is write access. In this sense, the barrier dependency is ignored if the resource is doing READ -> READ, but READ|WRITE -> READ|WRITE will still be emitted like it should.

This change fixes a ton of unit tests, though I don’t actually know how many since there’s still some overall instability in various tests which cause my GPU to hang.

RADVCAL

Certainly worth mentioning is that I’ve been working closely with the RADV developer community over the past few days, and they’ve been extremely helpful in both getting me started with debugging the driver and assisting with resolving some issues. In particular, keep your eyes on these MRs which also fix zink issues.

Stay tuned as always for more updates on all things Mike and zink.

November 23, 2020

I’ve Been Here For…

I guess I never left, really, since I’ve been vicariously living the life of someone who still writes zink patches through reviewing and discussing some great community efforts that are ongoing.

But now I’m back living that life of someone who writes zink patches.

Valve has generously agreed to sponsor my work on graphics-related projects.

For the time being, that work happens to be zink.

Ambition

I don’t want to just make a big post about leaving and then come back after a couple weeks like nothing happened.

It’s 2020.

We need some sort of positive energy and excitement here.

As such, I’m hereby announcing Operation Oxidize, an ambitious endeavor between me and the formidably skillful Erik Faye-Lund of Collabora.

We’re going to land 99% of zink-wip into mainline Mesa by the end of the year, bringing the driver up to basic GL 4.6 and ES 3.2 support with vastly improved performance.

Or at least, that’s the goal.

Will we succeed?

Stay tuned to find out!

November 22, 2020

Another Brief Review

This was a (relatively) quiet week in zink-world. Here’s some updates, once more in no particular order:

  • Custom border color support landed
  • Erik wrote and I reviewed a patch that enabled some blitting optimizations but also regressed a number of test cases
    • Oops
  • I wrote and Erik reviewed a series which improved some of the query code but also regressed a number of test cases
    • Oops++
  • The flurry of activity around Quake3 not working on RADV died down as it’s now been suggested that this is not a RADV bug and is instead the result of no developers fully understanding the majesty of RADV’s pipeline barrier implementation
    • Developers around the world stunned by the possibility that they don’t know everything
  • Witold Baryluk has helpfully contributed a truckload of issue tickets for my zink-wip branch after extensive testing on AMD hardware
    • I’ll get to these at some point, I promise

Stay tuned for further updates.

November 21, 2020
Recently I acquired an Acer Aspire Switch 10 E SW3-016, this device was the main reason for writing my blog post about the shim boot loop. The EFI firmware of this is bad in a number of ways:

  1. It considers its eMMC unbootable unless its ESP contains an EFI/Microsoft/Boot/bootmgfw.efi file.

  2. But it will actually boot EFI/Boot/bootx64.efi ! (wait what? yes really)

  3. It will only boot from an USB disk connected to its micro-USB connector, not from the USB-A connector on the keyboard-dock.

  4. You must first set a BIOS admin password before you can disable secure-boot (which is necessary to boot home-build kernels without doing your own signing)

  5. Last but not least it has one more nasty "feature", it detect if the OS being booted is Windows, Android or unknown and it updates the ACPI DSDT based in this!

Some more details on the OS detection mis feature. The ACPI "Device (SDHB) node for the MMC controller connected to the SDIO wifi module contains:

        Name (WHID, "80860F14")
        Name (AHID, "INT33BB")


Depending on what OS the BIOS thinks it is booting it renames one of these 2 to _HID. This is weird given that it will only boot if EFI/Microsoft/Boot/bootmgfw.efi exists, but it still does this. Worse it looks at the actual contents of EFI/Boot/bootx64.efi for this. It seems that that file must be signed, otherwise it goes in OS unknown mode and keeps the 2 above DSDT bits as is, so there is no _HID defined for the wifi's mmc controller and thus no wifi. I hit this issue when I replaced EFI/Boot/bootx64.efi with grubx64.efi to break the bootloop. grubx64.efi is not signed so the DSDT as Linux saw it contained the above AML code. Using the proper workaround for the bootloop from my previous blog post this bit of the DSDT morphes into:

        Name (_HID, "80860F14")
        Name (AHID, "INT33BB")


And the wifi works.

The Acer Aspire Switch 10 E SW3-016's firmware also triggers an actual bug / issue in Linux' ACPI implementation, causing the bluetooth to not work. This is discussed in much detail here. I have a patch series fixing this here.

And the older Acer Aspire Switch 10 SW5-012's and S1002's firmware has some similar issues:

  1. It considers its eMMC unbootable unless its ESP contains an EFI/Microsoft/Boot/bootmgfw.efi file

  2. These models will actually always boot the EFI/Microsoft/Boot/bootmgfw.efi file, so that is somewhat more sensible.

  3. On the SW5-012 you must first set a BIOS admin password before you can disable secure-boot.

  4. The SW5-012 is missing an ACPI device node for the PWM controller used for controlling the backlight brightness. I guess that the Windows i915 gfx driver just directly pokes the registers (which are in a whole other IP block), rather then relying on a separate PWM driver as Linux does. Unfortunately there is no way to fix this, other then using a DSDT overlay. I have a DSDT overlay for the V1.20 BIOS and only for the v1.20 BIOS available for this here.

Because of 1. and 2. you need to take the following steps to get Linux to boot on the Acer Aspire Switch 10 SW5-012 or the S1002:

  1. Rename the original bootmgfw.efi (so that you can chainload it in the multi-boot case)

  2. Replace bootmgfw.efi with shimia32.efi

  3. Copy EFI/fedora/grubia32.efi to EFI/Microsoft/Boot

This assumes that you have the files from a 32 bit Windows install in your ESP already.
November 18, 2020
How to fix the Linux EFI secure-boot shim bootloop issue seen on some systems.

Quite a few Bay- and Cherry-Trail based systems have bad firmware which completely ignores any efibootmgr set boot options. They basically completely reset the boot order doing some sort of auto-detection at boot. Some of these even will given an error about their eMMC not being bootable unless the ESP has a EFI/Microsoft/Boot/bootmgfw.efi file!

Many of these end up booting EFI/Boot/bootx64.efi unconditionally every boot. This will cause a boot loop since when Linux is installed EFI/Boot/bootx64.efi is now shim. When shim is started with a path of EFI/Boot/bootx64.efi, shim will add a new efibootmgr entry pointing to EFI/fedora/shimx64.efi and then reset. The goal of this is so that the firmware's F12 bootmenu can be used to easily switch between Windows and Linux (without chainloading which breaks bitlocker). But since these bad EFI implementations ignore efibootmgr stuff, EFI/Boot/bootx64.efi shim will run again after the reset and we have a loop.

There are 2 ways to fix this loop:

1. The right way: Stop shim from trying to add a bootentry pointing to EFI/fedora/shimx64.efi:

rm EFI/Boot/fbx64.efi
cp EFI/fedora/grubx64.efi EFI/Boot


The first command will stop shim from trying to add a new efibootmgr entry (it calls fbx64.efi to do that for it) instead it will try to execute grubx64.efi from the from which it was executed, so we must put a grubx64.efi in the EFI/Boot dir, which the second command does. Do not use the livecd EFI/Boot/grubx64.efi file for this as I did at first, that searches for its config and env under EFI/Boot which is not what we want.

Note that upgrading shim will restore EFI/Boot/fbx64.efi. To avoid this you may want to backup EFI/Boot/bootx64.efi, then do "sudo rpm -e shim-x64" and then restore the backup.

2. The wrong way: Replace EFI/Boot/bootx64.efi with a copy of EFI/fedora/grubx64.efi

This is how I used to do this until hitting the scenario which caused me to write this blog post. There are 2 problems with this:

2a) This requires disabling secure-boot (which I could live with sofar)
2b) Some firmwares change how they behave, exporting a different DSDT to the OS dependending on if EFI/Boot/bootx64.efi is signed or not (even with secure boot disabled) and their behavior is totally broken when it is not signed. I will post another rant ^W blogpost about this soon. For now lets just say that you should use workaround 1. from above since it simply is a better workaround.

Note for better readability the above text uses bootx64, shimx64, fbx64 and grubx64 throughout. When using a 32 bit EFI (which is typical on Bay Trail systems) you should replace these with bootia32, shimia32, fbia32 and grubia32. Note 32 bit EFI Bay Trail systems should still use a 64 bit Linux distro, the firmware being 32 bit is a weird Windows related thing.

Also note that your system may use another key then F12 to show the firmware's bootmenu.
November 15, 2020

A Brief Review

As time/sanity permit, I’ll be trying to do roundup posts for zink happenings each week. Here’s a look back at things that happened, in no particular order:

November 13, 2020

(project was renamed from vallium to lavapipe)

I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.

tl;dr DO NOT USE LAVAPIPE OVER A GALLIUM HW DRIVER,

What is lavapipe?

The lavapipe layer is a gallium frontend. It takes the Vulkan API and roughly translates it into the gallium API.

How does it do that?

Vulkan is a lowlevel API, it allows the user to allocate memory, create resources, record command buffers amongst other things. When a hw vulkan driver is recording a command buffer, it is putting hw specific commands into it that will be run directly on the GPU. These command buffers are submitted to queues when the app wants to execute them.

Gallium is a context level API, i.e. like OpenGL/D3D10. The user has to create resources and contexts and the driver internally manages command buffers etc. The driver controls internal flushing and queuing of command buffers.
 
In order to bridge the gap, the lavapipe layer abstracts the gallium context into a separate thread of execution. When recording a vulkan command buffer it creates a CPU side command buffer containing an encoding of the Vulkan API. It passes that recorded CPU command buffer to the thread on queue submission. The thread then creates a gallium context, and replays the whole CPU recorded command buffer into the context, one command at a time.

That sounds horrible, isn't it slow?

Yes.

Why doesn't that matter for *software* drivers?

Software rasterizers are a very different proposition from an overhead point of view than real hardware. CPU rasterization is pretty heavy on the CPU load, so nearly always 90% of your CPU time will be in the rasterizer and fragment shader. Having some minor CPU overheads around command submission and queuing isn't going to matter in the overall profile of the user application. CPU rasterization is already slow, the Vulkan->gallium translation overhead isn't going to be the reason for making it much slower.
For real HW drivers which are meant to record their own command buffers in the GPU domain and submit them direct to the hw, adding in a CPU layer that just copies the command buffer data is a massive overhead and one that can't easily be removed from the lavapipe layer.

The lavapipe execution context is also pretty horrible, it has to connect all the state pieces like shaders etc to the gallium context, and disconnect them all at the end of each command buffer. There is only one command submission queue, one context to be used. A lot of hardware exposes more queues etc that this will never model.

I still don't want to write a vulkan driver, give me more reasons.

Pipeline barriers:

Pipeline barriers in Vulkan are essential to efficient driver hw usage. They are one of the most difficult to understand and hard to get right pieces of writing a vulkan driver. For a software rasterizer they are also mostly unneeded. When I get a barrier I just completely hardflush the gallium context because I know the sw driver behind it. For a real hardware driver this would be a horrible solution. You spend a lot of time trying to make anything optimal here.

Memory allocation:

Vulkan is built around the idea of separate memory allocation and objects binding to those allocations. Gallium is built around object allocation with the memory allocs happening implicitly. I've added some simple memory allocation objects to the gallium API for swrast. These APIs are in no way useful for hw drivers. There is no way to expose memory types or heaps from gallium usefully. The current memory allocation API works for software drivers because I know all they want is an aligned_malloc. There is no decent way to bridge this gap without writing a new gallium API that looks like Vulkan. (in which case just write a vulkan driver already).

Can this make my non-Vulkan capable hw run Vulkan?

No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.


There's been some recent discussion about whether the X server is abandonware. As the person arguably most responsible for its care and feeding over the last 15 years or so, I feel like I have something to say about that.

The thing about being the maintainer of a public-facing project for nearly the whole of your professional career is it's difficult to separate your own story from the project. So I'm not going to try to be dispassionate, here. I started working on X precisely because free software had given me options and capabilities that really matter, and I feel privileged to be able to give that back. I can't talk about that without caring about it.

So here's the thing: X works extremely well for what it is, but what it is is deeply flawed. There's no shame in that, it's 33 years old and still relevant, I wish more software worked so well on that kind of timeframe. But using it to drive your display hardware and multiplex your input devices is choosing to make your life worse.

It is, however, uniquely well suited to a very long life as an application compatibility layer. Though the code happens to implement an unfortunate specification, the code itself is quite well structured, easy to hack on, and not far off from being easily embeddable.

The issue, then, is how to get there. And I don't have any real desire to get there while still pretending that the xfree86 hardware-backed server code is a real thing. Sorry, I guess, but I've worked on xfree86-derived servers for very nearly as long as XFree86-the-project existed, and I am completely burnt out on that on its own merits, let alone doing that and also being release manager and reviewer of last resort. You can only apply so much thrust to the pig before you question why you're trying to make it fly at all.

So, is Xorg abandoned? To the extent that that means using it to actually control the display, and not just keep X apps running, I'd say yes. But xserver is more than xfree86. Xwayland, Xwin, Xephyr, Xvnc, Xvfb: these are projects with real value that we should not give up. A better way to say it is that we can finally abandon xfree86.

And if that sounds like a world you'd like to see, please, come talk to us, let's make it happen. I'd be absolutely thrilled to see someone take this on, and I'm happy to be your guide through the server internals.

November 12, 2020

A recent article on phoronix has some commentary about sharing code between Windows and Linux, and how this seems to be a metric that Intel likes.

I'd like to explore this idea a bit and explain why I believe it's bad for Linux based distros and our open source development models in the graphics area.

tl;dr there is a big difference between open source released and open source developed projects in terms of sustainability and community.

The Linux graphics stack from a distro vendor point of view is made up of two main projects, the Linux kernel and Mesa userspace. These two projects are developed in the open with completely open source vendor agnostic practices. There is no vendor controlling either project and both projects have a goal of try to maximise shared code and shared processes/coding standards across drivers from all vendors.

This cross-vendor synergy is very important to the functioning ecosystem that is the Linux graphics stack. The stack also relies in some places on the LLVM project, but again LLVM upstream is vendor agnostic and open source developed.

The value to distros is they have central places to pick up driver stacks with good release cycles and a minimal number of places they have to deal with to interact with those communities. Now usually hardware vendors don't see the value in the external communities as much as Linux distros do. From a hardware vendor internal point of view they see more benefit in creating a single stack shared between their Windows and Linux to maximise their return on investment, or make their orgchart prettier or produce less powerpoints about why their orgchart isn't optimal.

A shared Windows/Linux stack as such is a thing the vendors want more for their own reasons than for the benefit of the Linux community.

Why is it a bad idea?

I'll start by saying it's not always a bad idea. In theory it might be possible to produce such a stack with the benefits of open source development model, however most vendors seem to fail at this. They see open source as a release model, they develop internally and shovel the results over the fence into a github repo every X weeks after a bunch of cycles. They build products containing these open source pieces, but they never expend the time building projects or communities around them.

As an example take AMDVLK vs radv. I started radv because AMD had been promising the world an open source Vulkan driver for Linux that was shared with their Windows stack. Even when it was delivered it was open source released but internally developed. There was no avenue for community participation in the driver development. External contributors were never on the same footing as an AMD employee. Even AMD employees on different teams weren't on the same footing. Compare this to the radv project in Mesa where it allowed Valve to contribute the ACO backend compiler and provide better results than AMD vendor shared code could ever have done, with far less investement and manpower.

Intel have a non-mesa compiler called Intel Graphics Compiler mentioned in the article. This is fully developed by intel internally, there is little info on project direction or how to get involved or where the community is. There doesn't seem to be much public review, patches seem to get merged to the public repo by igcbot which may mean they are being mirrored from some internal repo. There are not using github merge requests etc. Compare this to development of a Mesa NIR backend where lots of changes are reviewed and maximal common code sharing is attempted so that all vendors benefit from the code.

One area where it has mostly sort of worked out what with the AMD display code in the kernel. I believe this code to be shared with their Windows driver (but I'm not 100% sure). They do try to engage with community changes to the code, but the code is still pretty horrible and not really optimal on Linux. Integrating it with atomic modesetting and refactoring was a pain. So even in the best case it's not an optimal outcome even for the vendor. They have to work hard to make the shared code be capable of supporting different OS interactions.

How would I do it?

If I had to share Windows/Linux driver stack I'd (biased opinion) start from the most open project and bring that into the closed projects. I definitely wouldn't start with a new internal project that tries to disrupt both. For example if I needed to create a Windows GL driver, I could:

a) write a complete GL implementation and throw it over the wall every few weeks. and make Windows/Linux use it, Linux users lose out on the shared stack, distros lose out on one dependency instead having to build a stack of multiple per vendor deps, Windows gains nothing really, but I'm so in control of my own destiny (communities don't matter).

b) use Mesa and upstream my driver to share with the Linux stack, add the Windows code to the Mesa stack. I get to share the benefits of external development by other vendors and Windows gains that benefit, and Linux retains the benefits to it's ecosystem.

A warning then to anyone wishing for more vendor code sharing between OSes it generally doesn't end with Linux being better off, it ends up with Linux being more fragmented, harder to support and in the long run unsustainable.


November 06, 2020

About a year ago ago, I got a new laptop: a late 2019 Razer Blade Stealth 13.  It sports an Intel i7-1065G7 with the best Intel's Ice Lake graphics along with an NVIDIA GeForce GTX 1650.  Apart from needing an ACPI lid quirk and the power management issues described here, it’s been a great laptop so far and the Linux experience has been very smooth.

Unfortunately, the out-of-the-box integrated graphics performance of my new laptop was less than stellar.  My first task with the new laptop was to debug a rendering issue in the Linux port of Shadow of the Tomb Raider which turned out to be a bug in the game.  In the process, I discovered that the performance of the game’s built-in benchmark was almost half of Windows.  We’ve had some performance issues with Mesa from time to time on some games but half seemed a bit extreme.  Looking at system-level performance data with gputop revealed that GPU clock rate was unable to get above about 60-70% of the maximum in spite of the GPU being busy the whole time.  Why?  The GPU wasn’t able to get enough power.  Once I sorted out my power management problems, the benchmark went from about 50-60% the speed of Windows to more like 104% the speed of windows (yes, that’s more than 100%).

This blog post is intended to serve as a bit of a guide to understanding memory throughput and power management issues and configuring your system properly to get the most out of your Intel integrated GPU.  Not everything in this post will affect all laptops so you may have to do some experimentation with your system to see what does and does not matter.  I also make no claim that this post is in any way complete; there are almost certainly other configuration issues of which I'm not aware or which I've forgotten.

Update your drivers

This should go without saying but if you want the best performance out of your hardware, running the latest drivers is always recommended.  This is especially true for hardware that has just been released.  Generally, for graphics, most of the big performance improvements are going to be in Mesa but your Linux kernel version can matter as well.  In the case of Intel Ice Lake processors, some of the power management features aren’t enabled until Linux 5.4.

I’m not going to give a complete guide to updating your drivers here.  If you’re running a distro like Arch, chances are that you’re already running something fairly close to the latest available.  If you’re on Ubuntu, the padoka PPA provides versions of the userspace components (Mesa, X11, etc.) that are usually no more than about a week out-of-date but upgrading your kernel is more complicated.  Other distros may have something similar but I’ll leave as an exercise to the reader.

This doesn’t mean that you need to be obsessive about updating kernels and drivers.  If you’re happy with the performance and stability of your system, go ahead and leave it alone.  However, if you have brand new hardware and want to make sure you have new enough drivers, it may be worth attempting an update.  Or, if you have the patience, you can just wait 6 months for the next distro release cycle and hope to pick up with a distro update.

Make sure you have dual-channel RAM

One of the big bottleneck points in 3D rendering applications is memory bandwidth.  Most standard monitors run at a resolution of 1920x1080 and a refresh rate of 60 Hz.  A 1920x1080 RGBA (32bpp) image is just shy of 8 MiB in size and, if the GPU is rendering at 60 FPS, that adds up to about 474 MiB/s of memory bandwidth to write out the image every frame.  If you're running a 4K monitor, multiply by 4 and you get about 1.8 GiB/s.  Those numbers are only for the final color image, assume we write every pixel of the image exactly once, and don't take into account any other memory access.  Even in a simple 3D scene, there are other images than just the color image being written such as depth buffers or auxiliary gbuffers, each pixel typically gets written more than once depending on app over-draw, and shading typically involves reading from uniform buffers and textures.  Modern 3D applications typically also have things such as depth pre-passes, lighting passes, and post-processing filters for depth-of-field and/or motion blur.  The result of this is that actual memory bandwidth for rendering a 3D scene can be 10-100x the bandwidth required to simply write the color image.

Because of the incredible amount of bandwidth required for 3D rendering, discrete GPUs use memories which are optimized for bandwidth above all else.  These go by different names such as GDDR6 or HBM2 (current as of the writing of this post) but they all use extremely wide buses and access many bits of memory in parallel to get the highest throughput they can.  CPU memory, on the other hand, is typically DDR4 (current as of the writing of this post) which runs on a narrower 64-bit bus and so the over-all maximum memory bandwidth is lower.  However, as with anything in engineering, there is a trade-off being made here.  While narrower buses have lower over-all throughput, they are much better at random access which is necessary for good CPU memory performance when crawling complex data structures and doing other normal CPU tasks.  When 3D rendering, on the other hand, the vast majority of your memory bandwidth is consumed in reading/writing large contiguous blocks of memory and so the trade-off falls in favor of wider buses.

With integrated graphics, the GPU uses the same DDR RAM as the CPU so it can't get as much raw memory throughput as a discrete GPU.  Some of the memory bottlenecks can be mitigated via large caches inside the GPU but caching can only do so much.  At the end of the day, if you're fetching 2 GiB of memory to draw a scene, you're going to blow out your caches and load most of that from main memory.

The good news is that most motherboards support a dual-channel ram configurations where, if your DDR units are installed in identical pairs, the memory controller will split memory access between the two DDR units in the pair.  This has similar benefits to running on a 128-bit bus but without some of the drawbacks.  The result is about a 2x improvement in over-all memory throughput.  While this may not affect your CPU performance significantly outside of some very special cases, it makes a huge difference to your integrated GPU which cares far more about total throughput than random access.  If you are unsure how your computer's RAM is configured, you can run “dmidecode -t memory” and see if you have two identical devices reported in different channels.

Power management 101

Before getting into the details of how to fix power management issues, I should explain a bit about how power management works and, more importantly, how it doesn’t.  If you don’t care to learn about power management and are just here for the system configuration tips, feel free to skip this section.

Why is power management important?  Because the clock rate (and therefore the speed) of your CPU or GPU is heavily dependent on how much power is available to the system.  If it’s unable to get enough power for some reason, it will run at a lower clock rate and you’ll see that as processes taking more time or lower frame rates in the case of graphics.  There are some things that you, as the user, cannot control such as the physical limitations of the chip or the way the OEM has configured things on your particular laptop.  However, there are some things which you can do from a system configuration perspective which can greatly affect power management and your performance.

First, we need to talk about thermal design power or TDP.  There is a lot of misunderstanding on the internet about TDP and we need to clear some of them up.  Wikipedia defines TDP as “the maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate under any workload.”  The Intel Product Specifications site defines TDP as follows:

Thermal Design Power (TDP) represents the average power, in watts, the processor dissipates when operating at Base Frequency with all cores active under an Intel-defined, high-complexity workload. Refer to Datasheet for thermal solution requirements.

In other words, the TDP value provided on the Intel spec sheet is a pretty good design target for OEMs but doesn’t provide nearly as many guarantees as one might hope.  In particular, there are several things that the TDP value on the spec sheet is not:
  • It’s not the exact maximum power.  It’s a “average power”.
  • It may not match any particular workload.  It’s based on “an Intel-defined, high-complexity workload”.  Power consumption on any other workload is likely to be slightly different.
  • It’s not the actual maximum.  It’s based on when the processor is “operating at Base Frequency with all cores active.” Technologies such as Turbo Boost can cause the CPU to operate at a higher power for short periods of time.
If you look at the  Intel Product Specifications page for the i7-1065G7, you’ll see three TDP values: the nominal TDP of 15W, a configurable TDP-up value of 25W and a configurable TDP-down value of 12W.  The nominal TDP (simply called “TDP”) is the base TDP which is enough for the CPU to run all of its cores at the base frequency which, given sufficient cooling, it can do in the steady state.  The TDP-up and TDP-down values provide configurability that gives the OEM options when they go to make a laptop based on the i7-1065G7.  If they’re making a performance laptop like Razer and are willing to put in enough cooling, they can configure it to 25W and get more performance.  On the other hand, if they’re going for battery life, they can put the exact same chip in the laptop but configure it to run as low as 12W.  They can also configure the chip to run at 12W or 15W and then ship software with the computer which will bump it to 25W once Windows boots up.  We’ll talk more about this reconfiguration later on.

Beyond just the numbers on the spec sheet, there are other things which may affect how much power the chip can get.  One of the big ones is cooling.  The law of conservation of energy dictates that energy is never created or destroyed.  In particular, your CPU doesn’t really consume energy; it turns that electrical energy into heat.  For every Watt of electrical power that goes into the CPU, a Watt of heat has to be pumped out by the cooling system.  (Yes, a Watt is also a measure of heat flow.)  If the CPU is using more electrical energy than the cooling system can pump back out, energy gets temporarily stored in the CPU as heat and you see this as the CPU temperature rising.  Eventually, however, the CPU has to back off and let the cooling system catch up or else that built up heat may cause permanent damage to the chip.

Another thing which can affect CPU power is the actual power delivery capabilities of the motherboard itself.  In a desktop, the discrete GPU is typically powered directly by the power supply and it can draw 300W or more without affecting the amount of power available to the CPU.  In a laptop, however, you may have more power limitations.  If you have multiple components requiring significant amounts of power such as a CPU and a discrete GPU, the motherboard may not be able to provide enough power for both of them to run flat-out so it may have to limit CPU power while the discrete GPU is running.  These types of power balancing decisions can happen at a very deep firmware level and may not be visible to software.

The moral of this story is that the TDP listed on the spec sheet for the chip isn’t what matters; what matters is how the chip is configured by the OEM, how much power the motherboard is able to deliver, and how much power the cooling system is able to remove.  Just because two laptops have the same processor with the same part number doesn’t mean you should expect them to get the same performance.  This is unfortunate for laptop buyers but it’s the reality of the world we live in.  There are some things that you, as the user, cannot control such as the physical limitations of the chip or the way the OEM has configured things on your particular laptop.  However, there are some things which you can do from a system configuration perspective and that’s what we’ll talk about next.

If you want to experiment with your system and understand what’s going on with power, there are two tools which are very useful for this: powertop and turbostat.  Both are open-source and should be available through your distro package manager.  I personally prefer the turbostat interface for CPU power investigations but powertop is able to split your power usage up per-process which can be really useful as well.

Update GameMode to at least version 1.5

About a two and a half years ago (1.0 was released in may of 2018), Feral Interactive released their GameMode daemon which is able to tweak some of your system settings when a game starts up to get maximal performance.  One of the settings that GameMode tweaks is your CPU performance governor.  By default, GameMode will set it to “performance” when a game is running.  While this seems like a good idea (“performance” is better, right?), it can actually be counterproductive on integrated GPUs and cause you to get worse over-all performance.

Why would the “performance” governor cause worse performance?  First, understand that the names “performance” and “powersave” for CPU governors are a bit misleading.  The powersave governor isn’t just for when you’re running on battery and want to use as little power as possible.  When on the powersave governor, your system will clock all the way up if it needs to and can even turbo if you have a heavy workload.  The difference between the two governors is that the powersave governor tries to give you as much performance as possible while also caring about power; it’s quite well balanced.  Intel typically recommends the powersave governor even in data centers because, even though they have piles of power and cooling available, data centers typically care about their power bill.  The performance governor, on the other hand, doesn’t care about power consumption and only cares about getting the maximum possible performance out of the CPU so it will typically burn significantly more power than needed.

So what does this have to do with GPU performance?  On an integrated GPU, the GPU and CPU typically share a power budget and every Watt of power the CPU is using is a Watt that’s unavailable to the GPU.  In some configurations, the TDP is enough to run both the GPU and CPU flat-out but that’s uncommon.  Most of the time, however, the CPU is capable of using the entire TDP if you clock it high enough.  When running with the performance governor, that extra unnecessary CPU power consumption can eat into the power available to the GPU and cause it to clock down.

This problem should be mostly fixed as of GameMode version 1.5 which adds an integrated GPU heuristic.  The heuristic detects when the integrated GPU is using significant power and puts the CPU back to using the powersave governor.  In the testing I’ve done, this pretty reliably chooses the powersave governor in the cases where the GPU is likely to be TDP limited.  The heuristic is dynamic so it will still use the performance governor if the CPU power usage way overpowers the GPU power usage such as when compiling shaders at a loading screen.

What do you need to do on your system?  First, check what version of GameMode you have installed on your system (if any).  If it’s version 1.4 or earlier)and you intend to play games on an integrated GPU, I recommend either upgrading GameMode or disabling or uninstalling the GameMode daemon.

Use thermald

In “power management 101” I talked about how sometimes OEMs will configure a laptop to 12W or 15W in BIOS and then re-configure it to 25W in software.  This is done via the “Intel Dynamic Platform and Thermal Framework” driver on Windows.  The DPTF driver manages your over-all system thermals and keep the system within its thermal budget.  This is especially important for fanless or ultra-thin laptops where the cooling may not be sufficient for the system to run flat-out for long periods.  One thing the DPTF driver does is dynamically adjust the TDP of your CPU.  It can adjust it both up if the laptop is running cool and you need the power or down if the laptop is running hot and needs to cool down.  Some OEMs choose to be very conservative with their TDP defaults in BIOS to prevent the laptop from overheating or constantly running hot if the Windows DPTF driver is not available.

On Linux, the equivalent to this is thermald.  When installed and enabled on your system, it reads the same OEM configuration data from ACPI as the windows DPTF driver and is also able to scale up your package TDP threshold past the BIOS default as per the OEM configuration.  You can also write your own configuration files if you really wish but you do so at your own risk.

Most distros package thermald but it may not be enabled nor work quite properly out-of-the-box.  This is because, historically, it has relied on the closed-source dptfxtract utility that's provided by Intel as a binary.  It requires dptfxtract to fetch the OEM provided configuration data from the ACPI tables. Since most distros don't usually ship closed-source software in their main repositories and since thermald doesn't do much without that data, a lot of distros don't bother to ship or enable it by default.  You'll have to turn it on manually.

To fix this, install both thermald and dptfxtract and ensure that thermald is enabled.  On most distros, thermald is packaged normally even if it isn’t enabled by default because it is open-source.  The dptfxtract utility is usually available in your distro’s non-free repositories.  On Ubuntu, dptfxtract is available as a package in multiverse.  For Fedora, dptfxtract is available via RPM Fusion’s non-free repo.  There are also packages for Arch and likely others as well.  If no one packages it for your distro, it’s just one binary so it’s pretty easy to install manually.

Some of this may change going forward, however.  Recently, however, Matthew Garrett did some work to reverse-engineer the DPTF framework and provide support for fetching the DPTF data from ACPI without the need for the binary blob.  When running with a recent kernel and Matthew's fork of thermald, you should be able to get OEM-configured thermals without the need for the dptfxtract blob at least on some hardware.  Whether or not you get the right configuration will depend on your hardware, your kernel version, your distro, and whether they ship the Intel version of thermald or Matthew's fork.  Even there, your distro may leave it uninstalled or disabled by default.  It's still disabled by default in Fedora 33, for instance.

It should be noted at this point that, if thermald and dptfxtract are doing their job, your laptop is likely to start running much hotter when under heavy load than it did before.  This is because thermald is re-configuring your processor with a higher thermal budget which means it can now run faster but it will also generate more heat and may drain your battery faster.  In theory, thermald should keep your laptop’s thermals within safe limits; just not within the more conservative limits the OEM programmed into BIOS.  If all the additional heat makes you uncomfortable, you can just disable thermald and it should go back to the BIOS defaults.

Enable NVIDIA’s dynamic power-management

On my laptop (the late 2019 Razer Blade Stealth 13), the BIOS has the CPU configured to 35W out-of-the-box.  (Yes, 35W is higher than TDP-up and I’ve never seen it burn anything close to that much power; I have no idea why it’s configured that way.)  This means that we have no need for DPTF and the cooling is good enough that I don’t really need thermald on it either.  Instead, its power management problems come from the power balancing that the motherboard does between the CPU and the discrete NVIDIA GPU.

If the NVIDIA GPU is powered on at all, the motherboard configures the CPU to the TDP-down value of 12W.  I don’t know exactly how it’s doing this but it’s at a very deep firmware level that seems completely opaque to software.  To make matters worse, it doesn’t just restrict CPU power when the discrete GPU is doing real rendering; it restricts CPU power whenever the GPU is powered on at all.  In the default configuration with the NVIDIA proprietary drivers, that’s all the time.

Fortunately, if you know where to find it, there is a configuration option available in recent drivers for Turing and later GPUs which lets the NVIDIA driver completely power down the discrete GPU when it isn’t in use.  You can find this documented in Chapter 22 of the NVIDIA driver README.  The runtime power management feature is still beta as of the writing of this post and does come with some caveats such as that it doesn’t work if you have audio or USB controllers (for USB-C video) on your GPU.  Fortunately, with many laptops with a hybrid Intel+NVIDIA graphics solution, the discrete GPU exists only for render off-loading and doesn’t have any displays connected to it.  In that case, the audio and USB-C can be disabled and don’t cause any problems.  On my laptop, as soon as I properly enabled runtime power management in the NVIDIA driver, the motherboard stopped throttling my CPU and it started running at the full TDP-up of 25W.

I believe that nouveau has some capabilities for runtime power management.  However, I don’t know for sure how good they are and whether or not they’re able to completely power down the GPU.

Look for other things which might be limiting power

In this blog post, I've covered some of the things which I've personally seen limit GPU power when playing games and running benchmarks.  However, it is by no means an exhaustive list.  If there's one thing that's true about power management, it's that every machine is a bit different.  The biggest challenge with my laptop was the NVIDIA discrete GPU draining power.  On some other laptop, it may be something else.

You can also look for background processes which may be using significant CPU cycles.  With a discrete GPU, a modest amount of background CPU work will often not hurt you unless the game is particularly CPU-hungry.  With an integrated GPU, however, it's far more likely that a background task such as a backup or software update will eat into the GPU's power budget.  Just this last week, a friend of mine was playing a game on Proton and discovered that the game launcher itself was burning enough power with the CPU to prevent the GPU from running at full power.  Once he suspended the game launcher, his GPU was able to run at full power.

Especially with laptops, you're also likely to be affected by the computer's cooling system as was mentioned earlier.  Some laptops such as my Razer are designed with high-end cooling systems that let the laptop run at full power.  Others, particularly the ultra-thin laptops, are far more thermally limited and may never be able to hit the advertised TDP for extended periods of time.

Conclusion

When trying to get the most performance possible out of a laptop, RAM configuration and power management are key.  Unfortunately, due to the issues documented above (and possibly others), the out-of-the-box experience on Linux is not what it should be.  Hopefully, we’ll see this situation improve in the coming years but for now this post will hopefully give people the tools they need to configure their machines properly and get the full performance out of their hardware.

This Is The End

…of my full-time hobby work on zink.

At least for a while.

More on that at the end of the post.

Before I get to that, let’s start with yesterday’s riddle. Anyone who chose this pic

1.png

with 51 fps as being zink, you were correct.

That’s right, zink is now at around 95% of native GL performance for this benchmark, at least on my system.

I know there’s been a lot of speculation about the capability of the driver to reach native or even remotely-close-to-native speeds, and I’m going to say definitively that it’s possible, and performance is only going to increase further from here.

A bit of a different look on things can also be found on my Fall roundup post here.

A Big Boost From Threads

I’ve long been working on zink using a single-thread architecture, and my goal has been to make it as fast as possible within that constraint. Part of my reasoning is that it’s been easier to work within the existing zink architecture than to rewrite it, but the main issue is just that threads are hard, and if you don’t have a very stable foundation to build off of when adding threading to something, it’s going to get exponentially more difficult to gain that stability afterwards.

Reaching a 97% pass rate on my piglit tests at GL 4.6 and ES 3.2 gave me a strong indicator that the driver was in good enough shape to start looking at threads more seriously. Sure, piglit tests aren’t CTS; they fail to cover a lot of areas, and they’re certainly less exhaustive about the areas that they do cover. With that said, CTS isn’t a great tool for zink at the moment due to the lack of provoking vertex compatibility support in the driver (I’m still waiting on a Vulkan extension for this, though it’s looking likely that Erik will be providing a fallback codepath for this using a geometry shader in the somewhat near future) which will fail lots of tests. Given the sheer number of CTS tests, going through the failures and determining which ones are failing due to provoking vertex issues and which are failing due to other issues isn’t a great use of my time, so I’m continuing to wait on that. The remaining piglit test failures are mostly due either to provoking vertex issues or some corner case missing features such as multisampled ZS readback which are being worked on by other people.

With all that rambling out of the way, let’s talk about threads and how I’m now using them in zink-wip.

At present, I’m using u_threaded_context, aka glthread, making zink the only non-radeon driver to implement it. The way this works is by using Gallium to write the command stream to a buffer that is then processed asynchronously, freeing up the main thread for application use and avoiding any sort of blocking from driver overhead. For systems where zink is CPU-bound in the driver thread, this massively increases performance, as seen from the ~40% fps improvement that I gained after the implementation.

This transition presented a number of issues, the first of which was that u_threaded_context required buffer invalidation and rebinding. I’d had this on my list of targets for a while, so it was a good opportunity to finally hook it up.

Next up, u_threaded_context was very obviously written to work for the existing radeon driver architecture, and this was entirely incompatible with zink, specifically in how the batch/command buffer implementation is hardcoded like I talked about yesterday. Switching to monotonic, dynamically scaling command buffer usage resolved that and brought with it some other benefits.

The other big issue was, as I’m sure everyone expected, documentation.

I certainly can’t deny that there’s lots of documentation for u_threaded_context. It exists, it’s plentiful, and it’s quite detailed in some cases.

It’s also written by people who know exactly how it works with the expectation that it’s being read by other people who know exactly how it works. I had no idea going into the implementation how any of it worked other than a general knowledge of the asynchronous command stream parts that are common to all thread queue implementations, so this was a pretty huge stumbling block.

Nevertheless, I persevered, and with the help of a lot of RTFC, I managed to get it up and running. This is a more general overview post rather than a more in-depth, technical one, so I’m not going to go into any deep analysis of the (huge amounts of) code required to make it work, but here’s some key points from the process in case anyone reading this hits some of the same issues/annoyances that I did:

  • use consistent naming for all your struct subclassing, because a huge amount of the code churn is just going to be replacing driver class -> gallium class references to driver class -> u_threaded_context class -> gallium class ones; if you can sed these all at once, it simplifies the work tremendously
  • u_threaded_context works off the radeon queue/fence architecture, which allows (in some cases) multiple fences for any given queue submission, so ensure that your fences work the same way or (as I did) can effectively have sub-fences
  • obviously don’t forget your locking, but also don’t over-lock; I’m still doing some analysis to check how much locking I need for the context-based caches, and it may even be the case that I’m under-locked at the moment, but it’s important to keep in mind that your pipe_context can be in many different threads at a given time, and so, as the u_threaded_context docs repeatedly say without further explanation, don’t use it “in an unsafe way”
  • the buffer mapping rules/docs are complex, but basically it boils down to checking the TC_TRANSFER_MAP_* flags before doing the things that those flags prohibit
    • ignore threaded_resource::max_forced_staging_uploads to start with since it adds complexity
    • if you get TC_TRANSFER_MAP_THREADED_UNSYNC, you have to use threaded_context::base.stream_uploader for staging buffers, though this isn’t (currently) documented anywhere
    • watch your buffer alignments; I already fixed an issue with this, but u_threaded_context was written for radeon drivers, so there may be other cases where hardcoded values for those drivers exist
    • probably just read the radeonsi code before even attempting this anyway

All told, fixing all the regressions took much longer than the actual implementation, but that’s just par for the course with driver work.

Anyone interested in testing should take note that, as always, this has only been used on Intel hardware (and if you’re on Intel, this post is definitely worth reading), and so on systems which were not CPU-bound previously or haven’t been worked on by me, you may not yet see these kinds of gains.

But you will eventually.

And That’s It

This is a sort of bittersweet post as it marks the end of my full-time hobby work with zink. I’ve had a blast over the past ~6 months, but all things change eventually, and such is the case with this situation.

Those of you who have been following me for a long time will recall that I started hacking on zink while I was between jobs in order to improve my skills and knowledge while doing something productive along the way. I succeeded in all regards, at least by my own standards, and I got to work with some brilliant people at the same time.

But now, at last, I will once again become employed, and the course of that employment will take me far away from this project. I don’t expect that I’ll have a considerable amount of mental energy to dedicate to hobbyist Open Source projects, at least for the near term, so this is a farewell of sorts in that sense. This means (again, for at least the near term):

  • I’ll likely be blogging far less frequently
  • I don’t expect to be writing any new patches for zink/gallium/mesa

This does not mean that zink is dead, or the project is stalling development, or anything like that, so don’t start overreaching on the meaning of this post.

I still have 450+ patches left to be merged into mainline Mesa, and I do plan to continue driving things towards that end, though I expect it’ll take a good while. I’ll also be around to do patch reviews for the driver and continue to be involved in the community.

I look forward to a time when I’ll get to write more posts here and move the zink user experience closer to where I think it can be.

This is Mike, signing off for now.

Happy rendering.

November 05, 2020

During my presentation at the X Developers Conference I stated that we had been mostly using the Khronos Vulkan Conformance Test suite (aka Vulkan CTS) to validate our Vulkan driver for Raspberry Pi 4 (aka V3DV). While the CTS is an invaluable resource for driver testing and validation, it doesn’t exactly compare to actual real world applications, and so, I made the point that we should try to do more real world testing for the driver after completing initial Vulkan 1.0 support.

To be fair, we had been doing a little bit of this already when I worked on getting the Vulkan ports of all 3 Quake game classics to work with V3DV, which allowed us to identify and fix a few driver bugs during development. The good thing about these games is that we could get the source code and compile them natively for ARM platforms, so testing and debugging was very convenient.

Unfortunately, there are not a plethora of Vulkan applications and games like these that we can easily test and debug on a Raspberry Pi as of today, which posed a problem. One way to work around this limitation that was suggested after my presentation at XDC was to use Zink, the OpenGL to Vulkan layer in Mesa. Using Zink, we can take existing OpenGL applications that are currently available for Raspberry Pi and use them to test our Vulkan implementation a bit more thoroughly, expanding our options for testing while we wait for the Vulkan ecosystem on Raspberry Pi 4 to grow.

So last week I decided to get hands on with that. Zink requires a few things from the underlying Vulkan implementation depending on the OpenGL version targeted. Currently, Zink only targets desktop OpenGL versions, so that limits us to OpenGL 2.1, which is the maximum version of desktop OpenGL that Raspbery Pi 4 can support (we support up to OpenGL ES 3.1 though). For that desktop OpenGL version, Zink required a few optional Vulkan 1.0 features that we were missing in V3DV, namely:

  • Logic operations.
  • Alpha to one.
  • VK_KHR_maintenance1.

The first two were trivial: they were already implemented and we only had to expose them in the driver. Notably, when I was testing these features with the relevant CTS tests I found a bug in the alpha to one tests, so I proposed a fix to Khronos which is currently in review.

I also noticed that Zink was also implicitly requiring support for timestamp queries, so I also implemented that in V3DV and then also wrote a patch for Zink to handle this requirement better.

Finally, Zink doesn’t use Vulkan swapchains, instead it creates presentable images directly, which was problematic for us because our platform needs to handle allocations for presentable images specially, so a patch for Zink was also required to address this.

As of the writing of this post, all this work has been merged in Mesa and it enables Zink to run OpenGL 2.1 applications over V3DV on Raspberry Pi 4. Here are a few screenshots of Quake3 taken with the native OpenGL driver (V3D), with the native Vulkan driver (V3DV) and with Zink (over V3DV). There is a significant performance hit with Zink at present, although that is probably not too unexpected at this stage, but otherwise it seems to be rendering correctly, which is what we were really interested to see:


Quake3 Vulkan renderer (V3DV)

Quake3 OpenGL renderer (V3D)

Quake3 OpenGL renderer (Zink + V3DV)

Note: you’ll notice that the Vulkan screenshot is darker than the OpenGL versions. As I reported in another post, that is a feature of the Vulkan port of Quake3 and is unrelated to the driver.

Going forward, we expect to use Zink to test more applications and hopefully identify driver bugs that help us make V3DV better.

It’s Time.

I’ve been busy cramming more code than ever into the repo this week in order to finish up my final project for a while by Friday. I’ll talk more about that tomorrow though. Today I’ve got two things for all of you.

First, A Riddle

Of these two screenshots, one is zink+ANV and one is IRIS. Which is which?

2.png

1.png

Second, Queue Architecture

Let’s talk a bit at a high level about how zink uses (non-compute) command buffers.

Currently in the repo zink works like this:

  • there is 1 queue
  • there are 4 command buffers used in a ring
  • after every flush (e.g., glFlush), the command buffers cycle
  • the driver flushes itself internally on pretty much every function call
  • any time an in-use command buffer is iterated to, the driver stalls until the command buffer has completed

In short, there’s a huge bottleneck around the flushing mechanism, and then there’s a lesser-reached bottleneck for cases where an application flushes repeatedly before a command buffer’s ops are completed.

Some time ago I talked about some modifications I’d done to the above architecture, and then things looked more like this:

  • there is 1 queue
  • there are 4 command buffers used in a ring
  • after every flush (e.g., glFlush), the command buffers cycle* after
  • the driver defers all possible flushes to try and match 1 flush to 1 frame
  • any time an in-use command buffer is iterated to, the driver stalls until the command buffer has completed

The major difference after this work was that the flushing was reduced, which then greatly reduced the impact of that bottleneck that exists when all the command buffers are submitted and the driver wants to continue recording commands.

A lot of speculation has occurred among the developers over “how many” command buffers should be used, and there’s been some talk of profiling this, but for various reasons I’ll get into tomorrow, I opted to sidestep the question entirely in favor of a more dynamic solution: monotonically-identified command buffers.

Monotony

The basic idea behind this strategy, which is used by a number of other drivers in the tree, is that there’s no need to keep a “ring” of command buffers to cycle through, as the driver can just continually allocate new command buffers on-the-fly and submit them as needed, reusing them once they’ve naturally completed instead of forcibly stalling on them. Here’s a visual comparison:

The current design:

Here’s the new version:

This way, there’s no possibility of stalling based on application flushes (or the rare driver-internal flush which does still exist in a couple places).

The architectural change here had two great benefits:

  • for systems that aren’t CPU bound, more command buffers will automatically be created and used, yielding immediate performance gains (~5% on Dave Airlie’s AMD setup)
  • the driver internals get massively simplified

The latter of these is due to the way that the queue in zink is split between gfx and compute command buffers; with the hardcoded batch system, the compute queue had its own command buffer while the gfx queue had four, but they all had unique IDs which were tracked using bitfields all over the place, not to mention it was frustrating never being able to just “know” which command buffer was currently being recorded to for a given command without indexing the array.

Now it’s easy to know which command buffer is currently being recorded to, as it’ll always be the one associated with the queue (gfx or compute) for the given operation.

This had further implications, however, and I’d done this to pave the way for a bigger project, one that I’ve spent the past few days on. Check back tomorrow for that and more.

November 02, 2020

New Hotness

Quick update today, but I’ve got some very exciting news coming soon.

The biggest news of the day is that work is underway to merge some patches from Duncan Hopkins which enable zink to run on Mac OS using MoltenVK. This has significant potential to improve OpenGL support on that platform, so it’s awesome that work has been done to get the ball rolling there.

In only slightly less monumental news though, Adam Jackson is already underway with Vulkan WSI work for zink, which is going to be huge for performance.

October 30, 2020

(I just sent the below email to mesa3d developer list).

Just to let everyone know, a month ago I submitted the 20.2 llvmpipe
driver for OpenGL 4.5 conformance under the SPI/X.org umbrella, and it
is now official[1].

Thanks to everyone who helped me drive this forward, and to all the
contributors both to llvmpipe and the general Mesa stack that enabled
this.

Big shout out to Roland Scheidegger for helping review the mountain of
patches I produced in this effort.

My next plans involved submitting lavapipe for Vulkan 1.0, it's at 99%
or so CTS, but there are line drawing, sampler accuracy and some snorm
blending failure I have to work out.
I also ran the OpenCL 3.0 conformance suite against clover/llvmpipe
yesterday and have some vague hopes of driving that to some sort of
completion.

(for GL 4.6 only texture anisotropy is really missing, I've got
patches for SPIR-V support, in case someone was feeling adventurous).

Dave.

[1] https://www.khronos.org/conformance/adopters/conformant-products/opengl#submission_272