July 21, 2017

I planned on writing about the Present extension this week, but I’ll postpone this since I’m currently strongly absorbed into finding the last rough edges of a first patch I can show off. I then hope to get some feedback on this from other developers in the xorg-devel mailing list.

Another reason is that I stalled my work on the Present extension for now and try to get first my Xwayland code working. My mentor Daniel recommended that to me since the approach I pursued in my work on Present might be more difficult than I first assessed. At least it is something similar to what other way more experienced developers than myself tried in the past and weren’t able to do according to Daniel. My idea was to make Present flip per CRTC only, but this would clash with Pixmaps being linked to the whole screen only. There are no Pixmaps only for CRTCs in X.

On the other hand when accepting the restriction of only being able to flip one window at a time my code already works quite good. The flipping is smooth and at least in a short test also improved the frame rate. But the main problem I had and still to some degree have, is that stopping the flipping can fail. The reason seems to be that the Present extension sets always the Screen Pixmap on flips. But when I test my work with KWin, it drives Xwayland in rootless mode, i.e. without a Screen Pixmap and only the Window Pixmaps. I’m currently looking into how to circumvent this in Xwayland. I think it’s possible, but I need to look very carefully on how to change the process in order to not forget necessary cleanups on the flipped Pixmaps. I hope though that I’m able to solve these issues already this weekend and then get some feedback on the xorg-devel mailing list.

As always you can find my latest work on my working branch on GitHub.

@GodTributes took over my title, soz.

Dude, where's my maintainer?

Last year, probably as a distraction from doing anything else, or maybe because I was asked, I started reviewing bugs filed as a result of automated flaw discovery tools (from Coverity to UBSan via fuzzers) being run on gdk-pixbuf.

Apart from the security implications of a good number of those problems, there was also the annoyance of having a busted image file bring down your file manager, your desktop, or even an app that opened a file chooser either because it was broken, or because the image loader for that format didn't check for the sanity of memory allocations.

(I could have added links to Bugzilla entries for each one of the problems above, but that would just make it harder to read)

Two big things happened in gdk-pixbuf 2.36.1, which was used in GNOME 3.24:

  • the removal of GdkPixdata as a stand-alone image format loader. We really don't want to load GdkPixdata files from sources other than generated sources or embedded data structures, and removing that loader closed off those avenues. We still ended up fixing a fair number of naive assumptions in helper functions though.
  • the addition of a thumbnailer for gdk-pixbuf supported images. Images would not be special-cased any more in gnome-desktop's thumbnailing code, making the file manager, the file chooser and anything else navigating directories full of broken and huge images more reliable.
But that's just the start. gdk-pixbuf continues getting bug fixes, and we carry on checking for overflows, underflows and just flows, breaks and beats in general.

Programmatic Thumbellina portrait-maker

Picture, if you will, a website making you download garbage files from the Internet, the ROM dump of a NES cartridge that wasn't properly blown on and digital comic books that you definitely definitely paid for.

That's a nice summary of the security bugs foisted upon GNOME in past year or so, even if, thankfully, we were ahead of the curve in terms of fixing those issues (the GStreamer NSF decoder bug was removed in 2013, the comics backend in evince was rewritten over a period of 2 years and committed in March 2017).

Still, 2 pieces of code were running on pretty much every file downloaded, on purpose or not, from the Internet: Tracker's indexers and the file manager's thumbnailers.

Tracker started protecting itself not long after the NSF vulnerability, even if recent versions of GStreamer weren't vulnerable, as we mentioned.

That left the thumbnailers. Some of those are first party, like the gdk-pixbuf, and those offered by core applications (Evince, Videos), written by GNOME developers (yours truly for both epub/mobi and Nintendo DS).

They're all good quality code I'd vouch for (having written or maintained quite a few of them), but they can rely on third-party libraries (say GStreamer, poppler, or libarchive), have naive or insufficiently defensive code (gdk-pixbuf loaders,  GStreamer plugins) or, worst of all: THIRD-PARTY EXTENSIONS.

There are external plugins and extensions for image formats in gdk-pixbuf, for video and audio formats in GStreamer, and for thumbnailers pretty much anywhere. We can't control those, but the least we can do when they explode in a wet mess is make sure that the toilet door is closed.

Not even Nicholas Cage can handle this Alcatraz

For GNOME 3.26 (and today in git master), the thumbnailer stall will be doubly bolted by a Bubblewrap sandbox and a seccomp blacklist.

This closes a whole vector of attack for the GNOME Desktop, but doesn't mean we're completely out of the woods. We'll need to carry on maintaining and fixing security bugs in those libraries and tools we depend on, as GStreamer plugin bugs still affect Videos, gdk-pixbuf bugs still affect Photos and Eye Of Gnome, etc.

And there are limits to what those 2 changes can achieve. The sandboxing and syscall blacklisting avoids those thumbnailers writing anything but an image file in PNG format in a specific directory. There's no network, the filename of the original file is hidden and sanitised, but the thumbnailer could still create a crafted PNG file, and the sandbox doesn't work inside a sandbox! So no protection if the application running the thumbnailer is inside Flatpak.

In fine

GNOME 3.26 will have better security for thumbnailers, so you won't "need to delete GNOME Files".

But you'll probably want to be careful with desktops that forked our thumbnailing code, namely Cinnamon and MATE, which don't implement those security features.

The next step for the thumbnailers will be beefing up our protection against greedy thumbnailers (in terms of CPU and memory usage), and sharing the code better between thumbnailers.

Note for later, more images of cute animals.
When I discussed this project with my mentor before GSoC, he told me that the button mappings were going to be the most complicated piece. This week I’ve been working on precisely that and, well, let’s just say he wasn’t wrong 😉 If you’ve been following along on GitHub, you’re probably thinking that it was a slow week. Indeed, there hasn’t been that much activity this week as in previous weeks.
July 20, 2017

Since the last post a lot work has gone into upstreaming and stabilizing the etnaviv on Android ecosystem. This has involved Android, kernel and Mesa changes. Many of which are available upstream now. A How-To for getting you up and running on an iMX6 dev board is available here.


Modifiers support

Modifiers support has been accepted into Mesa, GBM and gbm_gralloc. Modifiers were mentioned in a previous post.

Etnaviv driver support for Android

Patches enabling the etnaviv Mesa driver being built for Android have now landed upstream.

Stability on Android

A number for small stability issues present while running Android on i.MX6 hardware have now been fixed, and the platform is now relatively stable.

Performance diagnostics

We have a decent understanding that the …

July 17, 2017

Video of my casync Presentation @ kinvolk

The great folks at kinvolk have uploaded a video of my casync presentation at their offices last week.

The slides are available as well.


This week’s VC5 progress primarily involved rewriting the CLIF-style debug dumps using Intel’s gen_decode.c code. Instead of code-generating a bunch of C functions that print out a struct’s contents, I now have a little bit of C code that parses a compressed version of the XML at runtime to pick apart the struct and dump it. I’ve implemented this on VC4 and VC5, and started the Android build debugging process for it.

I also finished fixing the regressions from my VC5 QIR redesign. We now operate on just QPU instructions with sideband information for register allocation, instead of a higher-level IR (that’s what NIR is for).

For Raspbian performance, I’ve been talking with keithp and others about my window-dragging performance issue. My current plan is to implement a little GL extension that gives glBlitFramebuffer() defined behavior for 1:1 overlapping copies, and then use that from X11 to avoid the temporary. That should cut the cost of the window movement in half (not counting the cost of the drawing caused by the expose events).

On the KMS front, I’ve fixed a regression in dual-display support from my previous tiling work. In the process, I’ve also written a fix for panning, which was broken even before the tiling work. I’ve pushed a fix from Boris for a warning on CRTC enable. I’ve also worked on handling the review feedback from the last DSI series, and started on review of Hans’s VC4 CEC support.

July 14, 2017

I provided in the past few weeks some general information about my project and hopefully helpful documentation for the multiple components I’m working with, but I have not yet talked about the work I’m doing on the code itself. Let’s change this today.

You can find my work branch on GitHub. It’s basically just a personal repository so I can sync my work between my devices, so be warned: The commits are messy as nothing is cleaned up and debug lines as well as temporary TODOs are all over the place. And to be honest up until yesterday my changes didn’t accumulate to much. For some reason no picture was displayed in my two test applications, which are Neverball and VLC.

Then on the weekend suddenly the KWin Wayland session wouldn’t even launch anymore. Well, at least this issue I was able to fix pretty quickly. But there was still no picture, it seemed the presenting just halted after the first buffer was sent to KWin and without any further messages. Neither the Xwayland server nor the client were unresponsive though. Only today I finally could solve the problem thanks to Daniel’s help. The reason for the halt was that I waited on a frame callback from KWin in order to present the next frame. But this never arrived since I hadn’t set any damage in the previous frame and KWin then wouldn’t signal a new frame. I fixed it by adding a generic damage request for now. After that the picture was depicted and moving nicely.

This is definitively the first big milestone with this project. Until now all I achieved was increasing my own knowledge by reading documentation and poking into the code with debug lines. Ok, I also added some code I hoped would make sense, but besides the compilation there was no feedback through a working prototype to see if my code was going in the right direction or if it was utter bollocks. But after today I can say that my buffer flipping and committing code at least produces a picture. And when looking at the FPS counter in Neverball I would even say, that the buffer flipping replacing all the buffer copies already improved the frame rate.

But to test this I first had to solve another problem: The frame rate was always limited to the 60 Hz of my display. The reason was simple: I called present_event_notify only on the frame callback, but in the Xwayland case we can call it directly after the buffer has been sent to the compositor. The only problem I see with this is that the Present extension assumes, that after a new Pixmap has been flipped the old one can be instantly set ready to be used again for new rendering content. But if the last Pixmap’s buffer is still used by the compositor in some way this can lead to tearing.

This hints to a fundamental issue with our approach of using the Present extension in Xwayland. The extension was written with hardware in mind. It assumes a flip happens directly on a screen. There is no intermediate link like a Wayland compositor and if a flip has happened the old buffer is not on the screen anymore. Why do we still try to leverage the Present extension support in Xwayland then? There are two important features of a Wayland compositor we want to have with Xwayland: A tear-free experience for the user and the ability to output a buffer rendered by a direct rendering client on a hardware plane without any copies in between. Every frame is perfect should also remain valid when using some legacy application and that we want no unnecessary copies is simply a question of performance improvements. This is especially important for many of the more demanding games out there, which won’t be Wayland native in the short term and some of them maybe never. Both features need the the full Present extension support in the Xwayland DDX. Without it a direct rendering application would still use the Present extension but only with its fallback code path of copying the Pixmap’s content. And for a tear-free experience we would at least need to sync these copies to the frame events sent by the Wayland compositor or better directly allow multiple buffers, otherwise we would limit our frame rate. In both cases this means again to increase the Present extension support.

I plan on writing about the Present extension in detail in the next week. So if you didn’t fully understand some of the concepts I talked about in this post it could be a good idea to check back.

Originally the plan for this week was to start working on the button mappings, however together with my mentor I decided that it’s better to do the LEDs first. This is because I was sure I could finish this in a few days, and button mappings is definitely going to take much longer than that. So, this week I’ll run you through the implementation of the LED stack page, and the coming weeks I’ll be working on button mappings, profile support and a proper welcome screen, in that order.
July 13, 2017

We managed to get Fedora Workstation 26 out the door this week which I am very happy about. In some ways it was far from our most splashy release as it mostly was about us improving on already released features, like improving the Wayland support and improving the Flatpak support in GNOME Software and improving the Qt integration into GNOME through the QtGNOME platform.

One major thing that is fully functional now though and that I have been testing myself extensively is being able to easily install the NVidia binary driver. If you set up the repository from Negativo17 you should be able to go install the Nvidia driver either using dnf on the command line or by searching for NVidia in GNOME Software, and just install it without any further work thanks to all the effort we and NVidia have been putting into things like glvnd. If you have a workstation with an NVidia card I would say that you have a fully functional system at this point without any hacks or file conflicts with Mesa.

For hybrid graphics laptops this also just works, with the only caveat being that your NVidia card will be engaged at all times once you do this, which is not great for your battery life. We are working to improve this, but it will take some time as it both requires us re-architecting some older parts of the stack and get the Nvidia driver updated to support the new solution.

We do plan on listing the NVidia driver in GNOME Software soon without having to manually setup the repository, so soon we will have a very smooth experience where the Nvidia driver is just a click in the Software store away for our users.

Another item of interest here for the discerning user is that if you are on the NVidia binary driver you will be using X and not Wayland. The reason for this as I have stated in previous blog posts too is that we still have some major gaps on the Wayland side when it comes to dealing with the binary NVidia driver. The biggest one here is that XWayland OpenGL applications doesn’t work, something the team is hard at work trying to resolve. Also the general infrastructure for dealing with hybrid graphics under Wayland is not there yet, but we are working on that too. We have a top notch team looking at the issues here, including Adam Jackson, Jonas Ådahl and Olivier Fourdan, so I am sure we will close this gap as soon as techically possible.

The other big item we have for Fedora Workstation 26 is going to be the formal launch of the Fleet Commander project, with a fully functional release and proper website. We hope to get that set up for next week, so I will blog more about it then. It is a really cool piece of technology which should make deploying Fedora and RHEL in large orgainzations a lot simpler.

As a sidenote, we received our first HDR capable monitor in the office this week, a Dell Ultrasharp UP2718Q. We have another one already ordered and we should be bringing in more in the next Months. This means we can finally seriously kick off figuring out the plumbing work and update the userland stack to have full HDR support under Linux for both media creation and consumption.

July 10, 2017
A little while back I took to wondering why one particular demo from the Sascha Willems vulkan demos was a lot slower on radv compared to amdgpu-pro. Like half the speed slow.

I internally titled this my "no fps left behind" project.

The deferred demo, does an offscreen rendering to 3 2048x2048 color attachments and one 2048x2048 D32S8 depth attachment. It then does a rendering using those down to as 1280x720 screen image.

Bas identifed the first cause was probably the fact we were doing clear color eliminations on the offscreen surfaces when we didn't need to. AMD GPU have a delta-color compression feature, and with certain clear values you don't need to do the clear color eliminations step. This brought me back from about 1/2 the FPS to about 3/4, however it took me quite a while to figure out where the rest of the FPS were hiding.

I took a few diversions in my testing, I pulled in some experimental patches to allow the depth buffer to be texture cache compatible, so could bypass the depth decompression pass, however this didn't seem to budge the number too much.

I found a bunch of registers we were setting different values from -pro, nothing too much came of these.

I found some places we were using a compute shader to fill some DCC or htile surfaces to a value, then doing a clear and overwriting the values, not much help.

I noticed the vertex descriptions and buffer attachments on amdgpu-pro were done quite different to how radv does it. With vulkan you have vertex descriptors and bindings, with radv we generate a set of hw descriptors from the combination of both descriptors and bindings. The pro driver uses typed buffer loads in the shader to embed the descriptor contents in the shader, then it only updates the hw descriptors for the buffer bindings. This seems like it might be more efficient, guess what, no help. (LLVM just grew support for typed buffer loads, so we could probably move to this scheme if we wished now).

I dug out some patches that inline all the push constants and some descriptors so our shaders had less overhead, (really helps our meta shaders have less impact), no helps.

I noticed they export the shader results in a different order from the fragment shader, and always at the end. (no help). The vertex shader emits pos first, (no help). The vertex shader uses off exports for unused channels, (no help).

I went on holidays for a week and came back to stare at the traces again, when I my brain finally noticed something I'd missed. When binding the 3 color buffers, the addresses given as the base address were unusual. A surface has a 40-bit address, normally for alignment and tiling the bottom 16-bits are 0, and we shift 8 of those off completely before writing them. This leaves the bottom 8 bits of the base address has should be 0, and the CIK docs from AMD say that. However the pro traces didn't have these at 0. It appears from earlier evergreen/cayman documents these register control some tiling offset bits. After writing a hacky patch to set the values, I managed to get back the rest of the FPS I was missing in the deferred demo. I discussed with AMD developers, and we worked out the addrlib library has an API for working out these values, and it seems that it allows better memory bandwidth utilisation. I've written a patch to try and use these values correctly and sent it out along with the DCC avoidance patch.

Now I'm not sure this will help any real apps, we may not be hitting limitations in that area, and I'm never happy with the benchmarks I run myself. I thought I saw some FPS difference with some madmax scenes, but I might be lying to myself. Once the patches land in mesa I'm sure others will run benchmarks and we can see if there is any use case where they have an effect. The AMD radeonsi OpenGL driver can also do the same tweaks so hopefully there as well there will be some benefit.

Otherwise I can just write this off as making deferred run at equality and removing at least one of the deltas that radv has compared to the pro driver. Some of the other differences I discovered along the way might also have some promise in other scenarios, so I'll keep an eye on them.

Thanks to Bas, Marek and Christian for looking into what the magic meant!
Due to lots of people telling me LJ is bad, mm'kay, I've migrated to blogspot.

New blog is/will be here:
I'm moving my blog from LJ to blogspot, because people keep telling me LJ is up to no go, like hacking DNC servers and interfering in elections.

Last week I got permission to open source my work on a Mesa-based VC5 3D driver for BCM7268. You can see the announcement here which I won’t replicate on this blog. TWIVC4 is going to be a lot of TWIVC5 from here on out!

I spent the rest of the week working on fixing performance regressions when Raspbian switches from software rendering to to using the vc4 driver without a compositor enabled. The current concern is that window dragging gets slower, and in the worst case can end up with seconds of window dragging queued up behind the motion of the mouse cursor.

Past debugging of mine into how we end up with seconds of window movement queued was fruitless. I suspect it’s “each mouse position is streamed out to the window manager, and the window manager naively queues up a window move for each position it gets, rather than reading through all the position events it gets at once and sending a single move for the last one”. Instead, I worked on just seeing if we can speed up enough that we don’t care.

X11 opaque window dragging is a tough case, because unlike compositors, the contents of the window are stored on the screen (saving gobs of memory, which is important on the Raspberry Pi). When you drag the window, the src and dst regions usually overlap, so we have to be careful to not overwrite src pixels before they’ve been copied to the dst. In software rasterization, we just arrange the memcpy to happen in the correct order. For GL, we have no such control.

What glamor does instead is make a temporary copy of the src pixels, and then copy from the temp to the dst. This creates dependencies between the screen-to-temp and temp-to-screen jobs, so we flush the rendering job at least twice per copy of the window, not counting any flushes that happen for the rendering of the exposed contents from whatever was underneath the window’s old position.

In my tracing, I found that the jobs being generated during window dragging were saying that they could modify any tile on the screen, not just the tiles being affected by the copy (so we read and write each tile on the screen for those jobs). In many paths in glamor we use glScissor() to limit our rendering to some subset of the screen, and this lets the GL keep our jobs trimmed to the appropriate size. However, copies and rectangle fills were scissoring only to the destination drawable’s area, which for LXDE was everything but the global menu bar.

I made a small series that looks at small drawing operations and uses glScissor() to clip to their bounds to help tiled renderers like vc4. I was careful to try to limit the impact of these changes on non-vc4 – fast desktop renderers don’t want to spend the CPU to compute the bounds of the operations when they don’t use the bounds.

It hasn’t completely fixed RPi window dragging, but things are a lot smoother. We may find more paths that need this treatment as more people switch to using vc4 for X11 drawing.

July 07, 2017

After the two-part series on the fundamentals of Xwayland, I want to briefly introduce the basic idea for my Google Summer of Code (GSoC) project for X.Org. This means I’ll talk about how Xwayland currently handles the graphic buffers of its applications, why this leads to tearing and how we plan to change that.

The project has its origin in my work on KWin. In fact there is some connection to my unsuccessful GSoC application from last year on atomic mode setting and layered compositing in KWin. You can read up on these notions and the previous application in some of my older posts, but the relevant part of it to this year’s project is in short the transfer of application graphic buffers directly onto the screen without the Wayland server compositing them into a global scene before that. This can be done by putting the buffers on some overlay planes and let the hardware do the compositing of these planes into a background provided by the compositor or in the simpler case by putting a single buffer of a full screen application directly onto the primary plane.

At the beginning of the year I was working on enabling this simpler case in KWin. In a first working prototype I was pretty sure I got the basic implementation right, but my test, a full screen video in VLC, showed massive tearing. Of course I suspected at first my own code to be the problem, but in this case it wasn’t. Only after I wrote a second test application, which was a simple QML application playing the same video in full screen and showing no tearing, I had the suspicion that the problem wasn’t my code but Xwayland, since VLC was running on Xwayland while my test application was Wayland native.

Indeed the Wayland protocol should prevent tearing overall, as long as the client respects the compositor’s messages. It works like this: After committing a newly drawn buffer to the server, the client is not allowed to touch it anymore and only after the compositor has sent the release event, the client is again allowed to repaint or delete it. If the client needs to repaint in the meantime it is supposed to allocate a different buffer object. But this is exactly, what Xwayland based applications are not doing, as Daniel Stone was quick to tell me after I asked for help from him for the tearing issues I experienced.

Under Xwayland an app only ever uses one buffer at all and repaints are always done into this one buffer. This means that the buffer is given to the compositor at some point but the application doesn’t stop repainting into it. So in my case the buffer content changed, although at the same time it was presented to the user on the primary plane. The consequence is tearing. Other developers noticed that as well the same time around as documented in this bug report.

The proposed solution is to bolster the Present extension support in Xwayland. In theory with that extension an X based application should be able to paint into more than one Pixmap, which then translate to different Wayland buffers. On the other side Xwayland notifies the app through the Present extension when it can reuse one of its Pixmaps based on the associated Wayland buffer event. The Present extension is a relatively new extension to the Xserver, but is already supported by most of the more interesting applications. It was written by Keith Packard, and you can read more about it on his blog. In theory it should only be necessary to add support for the extension to the Xwayland DDX. But there are some issues in the DIX side of the extension code, which first need to be ironed out. I plan on writing more about the Present extension in general and the limitations we encounter in our Xwayland use case in the next articles.

This week was a good week with loads of progress. I’ll just go through each item one by one, as usual. MouseMap merged Last Friday I opened the pull request to merge the MouseMap widget. After a bit of a discussion around preferences, my mentor and I decided to merge ongoing work in a rewrite branch. This keeps the master branch functional (while still receiving improvements, such as adding new functionality to the ratbagd bindings or a Gtk.
July 06, 2017
While my colleagues are working on mice that shine in all kinds of different colours, I went towards the old school.

For around 10 units of currency, you should be able to find the uDraw tablet for the PlayStation 3, the drawing tablet that brought down a company.

The device contains a large touchpad which can report one or two touches, for right-clicking (as long as the fingers aren't too close), a pen interface which will make the cheapest of the cheapest Wacom tablets feel like a professional tool from 30 years in the future, a 4-button joypad (plus Start/Select/PS) with the controls either side of the device, and an accelerometer to play Marble Madness with.

The driver landed in kernel 4.10. Note that it only supports the PlayStation 3 version of the tablet, as the Wii and XBox 360 versions require receivers that aren't part of the package. Here, a USB dongle should be provided.

Recommended for: point'n'click adventure games, set-top box menu navigation.

The second driver landed in kernel 4.12, and is a primer for more work to be done. This driver adds support for the Retrode 2's joypad adapters.

The Retrode is a USB console cartridge reader which makes Sega Mega Drive (aka Genesis) and Super Nintendo (aka Super Famicom) cartridges show up as files on a mass storage devices in your computer.

It also has 4 connectors for original joypads which the aforementioned driver now splits up and labels, so you know which is which, as well as making the mouse work out of the box. I'd still recommend picking up the newer optical model of that mouse, from Hyperkin. Moving a mouse with a ball in it is like weighing a mobile phone from that same era.

I will let you inspect the add-ons for the device, like support for additional Nintendo 64 pads and cartridges, and Game Boy/GB Color/GB Advance, and Sega Master System adapters.

Recommended for: cartridge-based retro games, obviously.

Integrated firmware updates, and better integration with Games is in the plans.

I'll leave you with this video, which shows how you could combine GNOME Games, a Retrode, this driver, a SNES mouse, and a cartridge of Mario Paint. Let's get creative :)

Some time ago I promised to write a bit more about how shadow mapping works. It has taken me a while to bring myself to actually deliver on that front, but I have finally decided to put together some posts on this topic, this being the first one. However, before we cover shadow mapping itself we need to cover some lighting basics first. After all, without light there can’t be shadows, right?

This post will introdcuce the popular Phong reflection model as the basis for our lighting model. A lighting model provides a simplified representation of how light works in the natural world that allows us to simulate light in virtual scenes at reasonable computing costs. So let’s dive into it:

Light in the natural world

In the real world, the light that reaches an object is a combination of both direct and indirect light. Direct light is that which comes straight from a light source, while indirect light is the result of light rays hitting other surfaces in the scene, bouncing off of them and eventually reaching the object as a result, maybe after multiple reflections from other objects. Because each time a ray of light hits a surface it loses part of its energy, indirect light reflection is less bright than direct light reflection and its color might have been altered. The contrast between surfaces that are directly hit by the light source and surfaces that only receive indirect light is what creates shadows. A shadow isn’t but the part of a scene that doesn’t receive direct light but might still receive some amount of (less intense) indirect light.

Direct vs Indirect light

Light in the digital world

Unfortunately, implementing realistic light behavior like that is too expensive, specially for real-time applications, so instead we use simplifications that can produce similar results with much lower computing requirements. The Phong reflection model in particular describes the light reflected from surfaces or emitted by light sources as the combination of 3 components: diffuse, ambient and specular. The model also requires information about the direction in which a particular surface is facing, provided via vectors called surface normals. Let’s introduce each of these concepts:

Surface normals

When we study the behavior of light, we notice that the direction in which surfaces reflect incoming light affects our perception of the surface. For example, if we lit a shiny surface (such as a piece of metal) using a strong light shource so that incoming light is reflected off the surface in the exact opposite direction in which we are looking at it, we will see a strong reflection in the form of highlights. If we move around so that we look at the same surface from a different angle, then we will see the reflection get dimmer and the highlights will eventually disappear. In order to model this behavior we need to know the direction in which the surfaces we render reflect incoming light. The way to do this is by associating vectors called normals with the surfaces we render so that shaders can use that information to produce lighting calculations akin to what we see in the natural world.

Usually, modeling programs can compute normal vectors for us, even model loading libraries can do this work automatically, but some times, for example when we define vertex meshes programatically, we need to define them manually. I won’t covere here how to do this in the general, you can see this article from Khronos if you’re interested in specific algorithms, but I’ll point out something relevant: given a plane, we can compute normal vectors in two opposite directions, one is correct for the front face of the plane/polygon and the other one is correct for the back face, so make sure that if you compute normals manually, you use the correct direction for each face, otherwise you won’t be reflecting light in the correct direction and results won’t be as you expect.

Light reflected using correct normal vector for the front face of the triangle

In most scenarios, we only render the front faces of the polygons (by enabling back face culling) and thus, we only care about one of the normal vectors (the one for the front face).

Another thing to notice about normal vectors is that they need to be transformed with the model to be correct for transformed models: if we rotate a model we need to rotate the normals too, since the faces they represent are now rotated and thus, their normal directions have rotated too. Scaling also affects normals, specifically if we don’t use regular scaling, since in that case the orientation of the surfaces may change and affect the direction of the normal vector. Because normal vectors represent directions, their position in world space is irrelevant, so for the purpose of lighting calculations, a normal vector such as (1, 0, 0) defined for a surface placed at (0, 0, 0) is still valid to represent the same surface at any other position in the world; in other words, we do not need to apply translation transforms to our normal vectors.

In practice, the above means that we want to apply the rotation and scale transforms from our models to their normal vectors, but we can skip the translation transform. The matrix representing these transforms is usually called the normal matrix. We can compute the normal matrix from our model matrix by computing the transpose of the inverse of the 3×3 submatrix of the model matrix. Usually, we’d want to compute this matrix in the application and feed it to our vertex shader like we do with our model matrix, but for reference, here is how this can be achieved in the shader code itself, plus how to use this matrix to transform the original normal vectors:

mat3 NormalMatrix = transpose(inverse(mat3(ModelMatrix)));
vec3 out_normal = normalize(NormalMatrix * in_normal);

Notice that the code above normalizes the resulting normal before it is fed to the fragment shader. This is important because the rasterizer will compute normals for all fragments in the surface automatically, and for that it will interpolate between the normals for each vertex we emit. For the interpolated normals to be correct, all vertex normals we output in the vertex shader must have the same length, otherwise the larger normals will deviate the direction of the interpolated vectors towards them because their larger size will increase their weight in the interpolation computations.

Finally, even if we emit normalized vectors in the vertex shader stage, we should note that the interpolated vectors that arrive to the fragment shader are not guaranteed to be normalized. Think for example of the normal vectors (1, 0, 0) and (0, 1, 0) being assigned to the two vertices in a line primitive. At the half-way point in between these two vertices, the interpolator will compute a normal vector of (0.5, 0.5, 0), which is not unit-sized. This means that in the general case, input normals in the fragment shader will need to be normalized again even if have normalized vertex normals at the vertex shader stage.

Diffuse reflection

The diffuse component represents the reflection produced from direct light. It is important to notice that the intensity of the diffuse reflection is affected by the angle between the light coming from the source and the normal of the surface that receives the light. This makes a surface looking straight at the light source be the brightest, with reflection intensity dropping as the angle increases:

Diffuse light (spotlight source)

In order to compute the diffuse component for a fragment we need its normal vector (the direction in which the surface is facing), the vector from the fragment’s position to the light source, the diffuse component of the light and the diffuse reflection of the fragment’s material:

vec3 normal = normalize(surface_normal);
vec3 pos_to_light_norm = normalize(pos_to_light);
float dp_reflection = max(0.0, dot(normal, pos_to_light_norm));
vec3 diffuse = material.diffuse * light.diffuse * dp_reflection;

Basically, we multiply the diffuse component of the incoming light with the diffuse reflection of the fragment’s material to produce the diffuse component of the light reflected by the fragment. The diffuse component of the surface tells how the object absorbs and reflects incoming light. For example, a pure yellow object (diffuse material vec3(1,1,0)) would absorb the blue component and reflect 100% of the red and green components of the incoming light. If the light is a pure white light (diffuse vec3(1,1,1)), then the observer would see a yellow object. However, if we are using a red light instead (diffuse vec3(1,0,0)), then the light reflected from the surface of the object would only contain the red component (since the light isn’t emitting a green component at all) and we would see it red.

As we said before though, the intensity of the reflection depends on the angle between the incoming light and the direction of the reflection. We account for this with the dot product between the normal at the fragment (surface_normal) and the direction of the light (or rather, the vector pointing from the fragment to the light source). Notice that because the vectors that we use to compute the dot product are normalized, dp_reflection is exactly the cosine of the angle between these two vectors. At an angle of 0º the surface is facing straight at the light source, and the intensity of the diffuse reflection is at its peak, since cosine(0º)=1. At an angle of 90º (or larger) the cosine will be 0 or smaller and will be clamped to 0, meaning that no light is effectively being reflected by the surface (the computed diffuse component will be 0).

Ambient reflection

Computing all possible reflections and bounces of all rays of light from each light source in a scene is way too expensive. Instead, the Phong model approximates this by making indirect reflection from a light source constant across the scene. In other words: it assumes that the amount of indirect light received by any surface in the scene is the same. This eliminates all the complexity while still producing reasonable results in most scenarios. We call this constant factor ambient light.

Ambient light

Adding ambient light to the fragment is then as simple as multiplying the light source’s ambient light by the material’s ambient reflection. The meaning of this product is exactly the same as in the case of the diffuse light, only that it affects the indirect light received by the fragment:

vec3 ambient = material.ambient * light.ambient;

Specular reflection

Very sharp, smooth surfaces such as metal are known to produce specular highlights, which are those bright spots that we can see on shiny objects. Specular reflection depends on the angle between the observer’s view direction and the direction in which the light is reflected off the surface. Specifically, the specular reflection is strongest when the observer is facing exactly in the opposite direction in which the light is reflected. Depending on the properties of the surface, the specular reflection can be more or less focused, affecting how the specular component scatters after being reflected. This property of the material is usually referred to as its shininess.

Specular light

Implementing specular reflection requires a bit more of work:

vec3 specular = vec3(0);
vec3 light_dir_norm = normalize(vec3(light.direction));
if (dot(normal, -light_dir_norm) >= 0.0) {
   vec3 reflection_dir = reflect(light_dir_norm, normal);
   float shine_factor = dot(reflection_dir, normalize(in_view_dir));
   specular = * *
         pow(max(0.0, shine_factor), material.shininess.x);

Basically, the code above checks if there is any specular reflection at all by computing the cosine of the angle between the fragment’s normal and the direction of the light (notice that, once again, both vectors are normalized prio to using them in the call to dot()). If there is specular reflection, then we compute how shiny the reflection is perceived by the viewer based on the angle between the vector from this fragment to the observer (in_view_dir) and the direction of the light reflected off the fragment’s surface (reflection_dir). The smaller the angle, the more parallel the directions are, meaning that the camera is receiving more reflection and the specular component received is stronger. Finally, we modulate the result based on the shininess of the fragment. We can compute in_view_dir in the vertex shader using the inverse of the View matrix like this:

mat4 ViewInv = inverse(View);
out_view_dir =
   normalize(vec3(ViewInv * vec4(0.0, 0.0, 0.0, 1.0) - world_pos));

The code above takes advantage of the fact that camera transformations are an illusion created by applying the transforms to everything else we render. For example, if we want to create the illusion that the camera is moving to the right, we just apply a translation to everything we render so they show up a bit to the left. This is what our View matrix achieves. From the point of view of GL or Vulkan, the camera is always fixed at (0,0,0). Taking advantage of this, we can compute the position of the virtual observer (the camera) in world space coordinates by applying the inverse of our camera transform to its fixed location (0,0,0). This is what the code above does, where world_pos is the position of this vertex in world space and View is the camera’s view matrix.

In order to produce the final look of the scene according to the Phong reflection model, we need to compute these 3 components for each fragment and add them together:

out_color  = vec4(diffuse + ambient + specular, 1.0)
Diffuse + Ambient + Specular (spotlight source)


In most scenarios, light intensity isn’t constant across the scene. Instead, it is brightest at its source and gets dimmer with distance. We can easily model this by adding an attenuation factor that is multiplied by the distance from the fragment to the light source. Typically, the intensity of the light decreases quite fast with distance, so a linear attenuation factor alone may not produce the best results and a quadratic function is preferred:

float attenuation = 1.0 /
    (light.attenuation.constant +
     light.attenuation.linear * dist +
     light.attenuation.quadratic * dist * dist);

diffuse = diffuse * attenuation;
ambient = ambient * attenuation;
specular = specular * attenuation;

Of course, we may decide not to apply attenuation to the ambient component at all if we really want to make it look like it is constant across the scene, however, do notice that when multiple light sources are present, the ambient factors from each source will accumulate and may produce too much ambient light unless they are attenuated.

Types of lights

When we model a light source we also need to consider the kind of light we are manipulating:

Directional lights

These are light sources that emit rays that travel along a specific direction so that all are parallel to each other. We typically use this model to represent bright, distant light sources that produce constant light across the scene. An example would be the sun light. Because the distance to the light source is so large compared to distances in the scene, the attenuation factor is irrelevant and can be discarded. Another particularity of directional light sources is that because the light rays are parallel, shadows casted from them are regular (we will talk more about this once we cover shadow mapping in future posts).

Directional light

If we had used a directional light in the scene, it would look like this:

Scene with a directional light

Notice how the brightness of the scene doesn’t lower with the distance to the light source.

Point lights

These are light sources for which light originates at a specific position and spreads outwards in all directions. Shadows casted by point lights are not regular, instead they are projected. An example would be the light produced by a light bulb. The attenuation code I showed above would be appropriate to represent point lights.

Point light

Here is how the scene would look like with a point light:

Scene with a point light

In this case, we can see how attenuation plays a factor and brightness lowers as we walk away from the light source (which is close to the blue cuboid).


This is the light source I used to illustrate the diffuse, ambient and specular components. They are similar to point lights, is the sense that light originates from a specific point in space and spreads outwards, however, instead of scattering in all directions, rays scatter forming a cone with the tip at the origin of the light. The angle formed by the lights’s direction and the sides of the cone is usually called the cutoff angle, because not light is casted outside its limits. Flashlights are a good example of this type of light.


In order to create spotlights we need to consider the cutoff angle of the light and make sure that no diffuse or specular component is reflected by a fragment which is beyond the cutoff threshold:

vec3 light_to_pos_norm = -pos_to_light_norm;
float dp = dot(light_to_pos_norm, light_dir_norm);
if (dp <= light.cutoff) {
   diffuse = vec3(0);
   specular = vec3(0);

In the code above we compute the cosine of the angle between the light’s direction and the vector from the light to the fragment (dp). Here, light.cutoff represents the cosine of the spotlight’s cutoff angle too, so when dp is smaller it means that the fragment is outside the light cone emitted by the spotlight and we remove its diffuse and specular reflections completely.

Multiple lights

Handling multiple lights is easy enough: we only need to compute the color contribution for each light separately and then add all of them together for each fragment (pseudocode):

vec3 fragColor = vec3(0);
foreach light in lights
    fragColor += compute_color_for_light(light, ...);

Of course, light attenuation plays a vital role here to limit the area of influence of each light so that scenes where we have multiple lights don’t get too bright.

An important thing to notice above the pseudocode above is that this process involves looping through costy per-fragment light computations for each light source, which can lead to important performance hits as the number of lights in the scene increases. This shading model, as described here, is called forward rendering and it has the benefit that it is very simple to implement but its downside is that we may incur in many costy lighting computations for fragments that, eventually, won’t be visible in the screen (due to them being occluded by other fragments). This is particularly important when the number of lights in the scene is quite large and its complexity makes it so that there are many occluded fragments. Another technique that may be more suitable for these situations is called deferred rendering, which postpones costy shader computations to a later stage (hence the word deferred) in which we only evaluate them for fragments that are known to be visible, but that is a topic for another day, in this series we will focus on forward rendering only.

Lights and shadows

For the purpose of shadow mapping in particular we should note that objects that are directly lit by the light source reflect all 3 of the light components, while objects in the shadow only reflect the ambient component. Because objects that only reflect ambient light are less bright, they appear shadowed, in similar fashion as they would in the real world. We will see the details how this is done in the next post, but for the time being, keep this in mind.

Source code

The scene images in this post were obtained from a simple shadow mapping demo I wrote in Vulkan. The source code for that is available here, and it includes also the shadow mapping implementation that I’ll cover in the next post. Specifically relevant to this post are the scene vertex and fragment shaders where lighting calculations take place.


In order to represent shadows we first need a means to represent light. In this post we discussed the Phong reflection model as a simple, yet effective way to model light reflection in a scene as the addition of three separate components: diffuse, ambient and specular. Once we have a representation of light we can start discussing shadows, which are parts of the scene that only receive ambient light because other objects occlude the diffuse and specular components of the light source.

July 04, 2017

I (finally!) merged a patchset to detect palms based on pressure into libinput. This should remove a lot of issues that our users have seen with accidental pointer movement. Palm detection in libinput previously used two approaches: disable-while-typing and an edge-based approach. The former simply ignores touchpad events while keyboard events are detected, the latter ignores touches that happen in the edge zones of the touchpad where real interaction is unlikely. Both approaches have the obvious disadvantages: they're timeout- and location-dependent, causing erroneous pointer movements. But their big advantage is that they work even on old touchpads where a lot of other information is unreliable. Touchpads are getting better, so it's time to make use of that.

The new feature is relatively simple: libinput looks at per-touch pressure and if that pressure hits a given threshold, the touch is regarded as palm. Once a palm, that touch will be ignored until touch up. The threshold is intended to be high enough that it cannot easily be hit. At least on the touchpads I have available for testing, I have to go through quite some effort to trigger palm detection with my finger.

Pressure on touchpads is unfortunately hardware-dependent and we can expect most laptops to have different pressure thresholds. For our users this means that the feature won't immediately work perfectly, it will require a lot of hwdb entries. libinput now ships a libinput measure touchpad-pressure tool to experiment with the various pressure thresholds. This makes it easy to figure out the right pressure threshold and submit a bug report (or patch) for libinput to get the pressure threshold updated. The documentation for this tool is available as part of libinput's online documentation.

TLDR: if libinput seems to misdetect touches as palms, figure out the right threshold with libinput measure touchpad-pressure and file a bug report so we can merge this into our hwdb.

July 01, 2017

DRM leasing part three (vblank)

The last couple of weeks have been consumed by getting frame sequence numbers and events handled within the leasing environment (and Vulkan) correctly.

Vulkan EXT_display_control extension

This little extension provides the bits necessary for applications to track the display of frames to the user.

vkGetSwapchainCounterEXT(VkDevice           device,
             VkSwapchainKHR         swapchain,
             VkSurfaceCounterFlagBitsEXT    counter,
             uint64_t           *pCounterValue);

This function just retrieves the current frame count from the display associated with swapchain.

vkRegisterDisplayEventEXT(VkDevice          device,
              VkDisplayKHR          display,
              const VkDisplayEventInfoEXT   *pDisplayEventInfo,
              const VkAllocationCallbacks   *pAllocator,
              VkFence           *pFence);

This function creates a fence that will be signaled when the specified event happens. Right now, the only event supported is when the first pixel of the next display refresh cycle leaves the display engine for the display. If you want something fancier (like two frames from now), you get to do that on your own using this basic function.


drmWaitVBlank is the existing interface for all things sequence related and has three modes (always nice to have one function do three things, I think). It can:

  1. Query the current vblank number
  2. Block until a specified vblank number
  3. Queue an event to be delivered at a specific vblank number

This interface has a few issues:

  • It has been kludged into supporting multiple CRTCs by taking bits from the 'type' parameter to hold a 'pipe' number, which is the index in the kernel into the array of CRTCs.

  • It has a random selection of 'int' and 'long' datatypes in the interface, making it need special helpers for 32-bit apps running on a 64-bit kernel.

  • Times are in microseconds, frame counts are 32 bits. Vulkan does everything in nanoseconds and wants 64-bits of frame counts.

For leases, figuring out the index into the kernel list of crtcs is pretty tricky -- our lease has a subset of those crtcs, so we can't actually compute the global crtc index.


int drmCrtcGetSequence(int fd, uint32_t crtcId,
               uint64_t *sequence, uint64_t *ns);

Here's a simple new function — hand it a crtc ID and it provides the current frame sequence number and the time when that frame started (in nanoseconds).


int drmCrtcQueueSequence(int fd, uint32_t crtcId,
                 uint32_t flags, uint64_t sequence,
             uint64_t user_data);

struct drm_event_crtc_sequence {
    struct drm_event    base;
    __u64           user_data;
    __u64           time_ns;
    __u64           sequence;

This will cause a CRTC_SEQUENCE event to be delivered at the start of the specified frame sequence. That event will include the frame when the event was actually generated (in case it's late), along with the time (in nanoseconds) when that frame was started. The event also includes a 64-bit user_data value, which can be used to hold a pointer to whatever data the application wants to see in the event handler.

The 'flags' argument contains a combination of:

#define DRM_CRTC_SEQUENCE_RELATIVE      0x00000001  /* sequence is relative to current */
#define DRM_CRTC_SEQUENCE_NEXT_ON_MISS      0x00000002  /* Use next sequence if we've missed */
#define DRM_CRTC_SEQUENCE_FIRST_PIXEL_OUT   0x00000004  /* Signal when first pixel is displayed */

These are similar to the values provided for the drmWaitVBlank function, except I've added a selector for when the event should be delivered to align with potential future additions to Vulkan. Right now, the only time you can ask for is first-pixel-out, which says that the event should correspond to the display of the first pixel on the screen.

DRM events → Vulkan fences

With the kernel able to deliver a suitable event at the next frame, all the Vulkan code needed was a to create a fence and hook it up to such an event. The existing fence code only deals with rendering fences, so I added window system interface (WSI) fencing infrastructure and extended the radv driver to be able to handle both kinds of fences within that code.

Multiple waiting threads

I've now got three places which can be waiting for a DRM event to appear:

  1. Frame sequence fences.

  2. Wait for an idle image. Necessary when you want an image to draw the next frame to.

  3. Wait for the previous flip to complete. The kernel can only queue one flip at a time, so we have to make sure the previous flip is complete before queuing another one.

Vulkan allows these to be run from separate threads, so I needed to deal with multiple threads waiting for a specific DRM event at the same time.

XCB has the same problem and goes to great lengths to manage this with a set of locking and signaling primitives so that only one thread is ever doing poll or read from the socket at time. If another thread wants to read at the same time, it will block on a condition variable which is then signaled by the original reader thread at the appropriate time. It's all very complicated, and it didn't work reliably for a number of years.

I decided to punt and just create a separate thread for processing all DRM events. It blocks using poll(2) until some events are readable, processes those and then broadcasts to a condition variable to notify any waiting threads that 'something' has happened. Each waiting thread simply checks for the desired condition and if not satisfied, blocks on that condition variable. It's all very simple looking, and seems to work just fine.

Code Complete, Validation Remains

At this point, all of the necessary pieces are in place for the VR application to take advantage of an HMD using only existing Vulkan extensions. Those will be automatically mapped into DRM leases and DRM events as appropriate.

The VR compositor application is working pretty well; tests with Dota 2 show occasional jerky behavior in complex scenes, so there's clearly more work to be done somewhere. I need to go write a pile of tests to independently verify that my code is working. I wonder if I'll need to wire up some kind of light sensor so I can actually tell when frames get displayed as it's pretty easy to get consistent-but-wrong answers in this environment.

Source Code

  • Linux. This is based off of a reasonably current drm-next branch from Dave Airlie. 965 commits past 4.12 RC3.

    git:// drm-lease-v3

  • X server (which includes xf86-video-modesetting). This is pretty close to master.

    git:// drm-lease

  • RandR protocol changes

    git:// drm-lease

  • xcb proto (no changes to libxcb sources, but it will need to be rebuilt)

    git:// drm-lease

  • DRM library. About a dozen patches behind master.

    git:// drm-lease

  • Mesa. Branched early this month (4 June), this is pretty far from master.

    git:// drm-lease

June 30, 2017

Last week in part one of this two part series about the fundamentals of Xwayland, we treated Xwayland like a black box. We stated what its purpose is and gave a rough overview on how it connects to its environment, notably its clients and the Wayland compositor. In a sense this was only a teaser, since we didn’t yet look at Xwayland’s inner workings. So welcome to part two, where we do a deep dive into its code base!

You can find the Xwayland code base here. Maybe to your surprise this is just the code of’s Xserver, which we will just refer to as the Xserver in the rest of this text. But as a reminder from part one: Xwayland is only a normal Xserver “with a special backend written to communicate with the Wayland compositor active on your system.” This backend is located in /hw/xwayland. To understand why we find this special backend here and what I mean with an Xserver backend at all, we have to first learn some Xserver fundamentals.


The hw subdirectory is the Device Dependent X (DDX) part of the Xserver. All other directories in the source tree form the Device Independent X (DIX) part. This structuring is an important abstraction in the Xserver. Like the names suggest the DIX part is supposed to be generic enough to be the same on every imaginable hardware platform. The word hardware hereby should be understood in an abstract way as being some sort of environment the Xserver works in and has to talk to, which could be the kernel with its DRM subsystem and hardware drivers or as we already know a Wayland compositor. On the other side all code, that is potentially different with respect to the environment the Xserver is compiled for is bundled into the DDX part. Since this code is by its very definition mostly responsible for establishing and maintaining the required communication channels with the environment, we can indeed call the platform specific code paths in DDX the Xserver’s backends.

I want to emphasize that the Xserver is compiled for different environments, because we are now able to understand how the Xorg and Xwayland binaries we talked about in part one and that both implement a full Xserver come into existence: Autotools, the build system of the Xserver, is told by configuration parameters before compilation what the intended target platforms are. It then will use for each enabled target platform the respective subdirectory in hw to compile a binary with this platform’s appropriate DDX plus the generic DIX from the other top level directories. For example to compile only the Xwayland binary, you can use this command from the root of the source tree:

./ --prefix=/usr --disable-docs --disable-devel-docs \
  --enable-xwayland --disable-xorg --disable-xvfb --disable-xnest \
  --disable-xquartz --disable-xwin

Coming back to the functionality let’s look at two examples in order to better understand the DIX and DDX divide and how the two parts interact with each other. Take first the concept of regions: A region specifies a certain portion of the view displayed to the user. It is defined by values for its width, height and position in some coordinate system. How regions work is therefore completely independent on the choice of hardware the Xserver runs on. That allowed the Xserver creators to put all the region code in the DIX part of the server.

Talking about regions in a view we think directly of the screen this view is displayed on. That’s the second example. We can always assume that there is some sort of real or emulated screen or even multiple of them to display our view. But how these screens and their properties are retrieved is dependent on the environment. So there needs to be some “screen code” in DDX, but on the other hand we want to move as much logic as possible in the DIX to avoid rewriting shared functionality for different platforms.

The Xserver is equipped with tools to facilitate this dichotomy. In our example about screens DIX represents the generic part of such a screen in its _Screen struct. But the struct features also the void pointer field devPrivate, which can be set by the DDX part to some struct, that then provides the device dependent information for the screen. When DIX then calls DDX to do something concerning the screen, DIX also hands over a _Screen pointer and DDX can retrieve these information through the devPrivate pointer. The private resource pointer is a tool featured in several core objects of the Xserver. For example we can also find it in the _Window struct for windows.

Besides this information sharing between DIX and DDX there are of course also procedures triggered in one part and reaching into the other one. And these procedures run according to the main event loop. We will learn more about them when we now finally analyze the Xwayland DDX code itself.

The Xwayland DDX

The names of the source files in the /hw/xwayland directory already indicate what they are supposed to do. Luckily there are not many of them and most of the files are rather compact. It’s quite a feat that the creators of Xwayland were able to provide X backward compatibility in a Wayland session with only that few lines of code added to the generic part of a normal Xserver. This is of course only possible thanks to the abstractions described above.

But coming back to the files here’s a table of all the files with short descriptions:

Files Description
Basically the entry point to everything else, define and implement the most central structs and functions of the Xwayland DDX.
xwayland-output.c Provides a representation of a display/output. All its data is of course received from the Wayland server.
xwayland-cvt.c Supports the output creation by generating a display mode calculated from available information.
xwayland-input.c Deals with inputs provided by mice and other input devices. As you can see by its size, it’s not the most straight forward area to work on.
xwayland-cursor.c Makes a cursor appear. That is in a graphic pipeline often treated as a special case to reduce repaints.
Provide two different ways for allocating graphic buffers.
Support for hardware accelerated video playback and older games, what is in parts not yet fully functional.

In the following we will restrict our analysis to the xwayland.* files, in order to keep the growing length of this article in check.

Some basic structs and functions also shared with the other source files are defined in the header file xwayland.h. A good first point to remember is, that all structs and functions with names starting on xwl_ are only known to the Xwayland DDX and won’t be called from anywhere else. But at the beginning of the xwayland.c file we find some methods without the prefix. They are only defined in the DIX and their implementation is required to make Xwayland a fully functional DDX.

Scrolling down to the end of the file we see the main entry point to the DDX on server startup, the InitOutput method. If you look closely you will notice a call to AddScreen, where we also hook up an Xwayland internal screen init function as one of its arguments. But it’s only called once! So what about multiple screens? The explanation is, that Xwayland uses the RandR extension for its screen management and here only asks for the creation of one screen struct as a dummy, which holds on runtime some global information about the Wayland environment. We looked at this particular screen struct in the previous chapter as an example for information sharing between DIX and DDX through void pointers and that these pointers are set by the DDX.

Although it’s only a dummy, we can still follow this now live in action in the hooked up init function xwl_screen_init. Here we set with the help of some DIX methods a hash key to later identify the data field again and then set the data, which is an xwl_screen struct with static information about the Wayland environment the Xwayland server is deployed in.

In the hooked up init function the later manipulation of the function pointers RealizeWindow, UnrealizeWindow and so on is also quite interesting. I asked Daniel about it, because I didn’t understand at all the steps done here as well as similar ones later in the involved functions xwl_realize_window, xwl_unrealize_window and so on. Daniel explained the mechanism well to me and it is quite nifty indeed. Basically thanks to this trick, called wrapping, Xwayland and other DDX can intercept DIX calls to a procedure like RealizeWindow, execute their own code, and then go on with the procedure looking to the DIX like it never happened.

In the case of RealizeWindow, which is called when a window was created and is now ready to be displayed, we intercept it with xwl_realize_window, where an Xwayland internal representation of type struct xwl_window is allocated with all the Xwayland specific additional information, in particular a Wayland surface. At the end the request to create the surface is sent to the Wayland server via the Wayland protocol. You can probably imagine what UnrealizeWindow and the wrapped xwl_unrealize_window is supposed to do and that it does this in a very similar way.

As a last point let’s look at the event loop and the buffer dispatch of possibly new or changed graphical content. We have block_handler, which was registered in xwl_screen_init to the DIX, and gets called continuously throughout the event loop. From here we call into a global damage posting function and from there for each window into xwl_window_post_damage. If we’re lucky we get a buffer with hardware acceleration from the implementation in xwayland-glamor.c or otherwise without acceleration from the one in xwayland-shm.c, attach it to the surface and fire it away. In the next event loop we play the same game.

Forcing an end to this article, what we ignored in total is input handling in Xwayland and we also only touched the graphics buffer in the end. But at least the graphic buffers we’ll discuss in the coming weeks exhaustively, since my Google Summer of Code project is all about these little guys.

This week Piper saw some progress again! Today I opened the pull request for the MouseMap that I’ve been working on for the past two and a half weeks now. I’ll discuss the changes made since the last blog later; first, I want to highlight the other work I did the past week. A major milestone this week is the merging of ratbagd and libratbag. While Piper shouldn’t notice any of this (it talks to ratbagd over DBus), it’s still a highlight I want to mention.
June 27, 2017


GALLIUM_HUD is a feature that adds performance graphs to applications that describe various aspects like FPS, CPU usage, etc in realtime.

It is enabled using an environment variable, GALLIUM_HUD, that can be set for GL/EGL/etc applications. It only works for Mesa drivers that are Gallium based, which means that the most drivers (with the notable exception of some Intel drivers) support GALLIUM_HUD.

See GALLIUM_HUD options:

export GALLIUM_HUD=help


If you're building Android, you can supply system-wide environment values by doing an export in the init.rc file of the device you are using, like this.

# Go to android source code checkout
cd android

# Add export to init.rc (linaro/generic is the device I use)
nano device/linaro/generic/init …


GALLIUM_HUD is a feature that adds performance graphs to applications that describe various aspects like FPS, CPU usage, etc in realtime.

It is enabled using an environment variable, GALLIUM_HUD, that can be set for GL/EGL/etc applications. It only works for Mesa drivers that are Gallium based, which means that the most drivers (with the notable exception of some Intel drivers) support GALLIUM_HUD.

See GALLIUM_HUD options:

export GALLIUM_HUD=help


If you're building Android, you can supply system-wide environment values by doing an export in the init.rc file of the device you are using, like this.

# Go to android source code checkout
cd android

# Add export to init.rc (linaro/generic is the device I use)
nano device/linaro/generic/init …

Introducing mkosi

After blogging about casync I realized I never blogged about the mkosi tool that combines nicely with it. mkosi has been around for a while already, and its time to make it a bit better known. mkosi stands for Make Operating System Image, and is a tool for precisely that: generating an OS tree or image that can be booted.

Yes, there are many tools like mkosi, and a number of them are quite well known and popular. But mkosi has a number of features that I think make it interesting for a variety of use-cases that other tools don't cover that well.

What is mkosi?

What are those use-cases, and what does mkosi precisely set apart? mkosi is definitely a tool with a focus on developer's needs for building OS images, for testing and debugging, but also for generating production images with cryptographic protection. A typical use-case would be to add a mkosi.default file to an existing project (for example, one written in C or Python), and thus making it easy to generate an OS image for it. mkosi will put together the image with development headers and tools, compile your code in it, run your test suite, then throw away the image again, and build a new one, this time without development headers and tools, and install your build artifacts in it. This final image is then "production-ready", and only contains your built program and the minimal set of packages you configured otherwise. Such an image could then be deployed with casync (or any other tool of course) to be delivered to your set of servers, or IoT devices or whatever you are building.

mkosi is supposed to be legacy-free: the focus is clearly on today's technology, not yesteryear's. Specifically this means that we'll generate GPT partition tables, not MBR/DOS ones. When you tell mkosi to generate a bootable image for you, it will make it bootable on EFI, not on legacy BIOS. The GPT images generated follow specifications such as the Discoverable Partitions Specification, so that /etc/fstab can remain unpopulated and tools such as systemd-nspawn can automatically dissect the image and boot from them.

So, let's have a look on the specific images it can generate:

  1. Raw GPT disk image, with ext4 as root
  2. Raw GPT disk image, with btrfs as root
  3. Raw GPT disk image, with a read-only squashfs as root
  4. A plain directory on disk containing the OS tree directly (this is useful for creating generic container images)
  5. A btrfs subvolume on disk, similar to the plain directory
  6. A tarball of a plain directory

When any of the GPT choices above are selected, a couple of additional options are available:

  1. A swap partition may be added in
  2. The system may be made bootable on EFI systems
  3. Separate partitions for /home and /srv may be added in
  4. The root, /home and /srv partitions may be optionally encrypted with LUKS
  5. The root partition may be protected using dm-verity, thus making offline attacks on the generated system hard
  6. If the image is made bootable, the dm-verity root hash is automatically added to the kernel command line, and the kernel together with its initial RAM disk and the kernel command line is optionally cryptographically signed for UEFI SecureBoot

Note that mkosi is distribution-agnostic. It currently can build images based on the following Linux distributions:

  1. Fedora
  2. Debian
  3. Ubuntu
  4. ArchLinux
  5. openSUSE

Note though that not all distributions are supported at the same feature level currently. Also, as mkosi is based on dnf --installroot, debootstrap, pacstrap and zypper, and those packages are not packaged universally on all distributions, you might not be able to build images for all those distributions on arbitrary host distributions.

The GPT images are put together in a way that they aren't just compatible with UEFI systems, but also with VM and container managers (that is, at least the smart ones, i.e. VM managers that know UEFI, and container managers that grok GPT disk images) to a large degree. In fact, the idea is that you can use mkosi to build a single GPT image that may be used to:

  1. Boot on bare-metal boxes
  2. Boot in a VM
  3. Boot in a systemd-nspawn container
  4. Directly run a systemd service off, using systemd's RootImage= unit file setting

Note that in all four cases the dm-verity data is automatically used if available to ensure the image is not tampered with (yes, you read that right, systemd-nspawn and systemd's RootImage= setting automatically do dm-verity these days if the image has it.)

Mode of Operation

The simplest usage of mkosi is by simply invoking it without parameters (as root):

# mkosi

Without any configuration this will create a GPT disk image for you, will call it image.raw and drop it in the current directory. The distribution used will be the same one as your host runs.

Of course in most cases you want more control about how the image is put together, i.e. select package sets, select the distribution, size partitions and so on. Most of that you can actually specify on the command line, but it is recommended to instead create a couple of mkosi.$SOMETHING files and directories in some directory. Then, simply change to that directory and run mkosi without any further arguments. The tool will then look in the current working directory for these files and directories and make use of them (similar to how make looks for a Makefile…). Every single file/directory is optional, but if they exist they are honored. Here's a list of the files/directories mkosi currently looks for:

  1. mkosi.default — This is the main configuration file, here you can configure what kind of image you want, which distribution, which packages and so on.

  2. mkosi.extra/ — If this directory exists, then mkosi will copy everything inside it into the images built. You can place arbitrary directory hierarchies in here, and they'll be copied over whatever is already in the image, after it was put together by the distribution's package manager. This is the best way to drop additional static files into the image, or override distribution-supplied ones.

  3. — This executable file is supposed to be a build script. When it exists, mkosi will build two images, one after the other in the mode already mentioned above: the first version is the build image, and may include various build-time dependencies such as a compiler or development headers. The build script is also copied into it, and then run inside it. The script should then build whatever shall be built and place the result in $DESTDIR (don't worry, popular build tools such as Automake or Meson all honor $DESTDIR anyway, so there's not much to do here explicitly). It may also run a test suite, or anything else you like. After the script finished, the build image is removed again, and a second image (the final image) is built. This time, no development packages are included, and the build script is not copied into the image again — however, the build artifacts from the first run (i.e. those placed in $DESTDIR) are copied into the image.

  4. mkosi.postinst — If this executable script exists, it is invoked inside the image (inside a systemd-nspawn invocation) and can adjust the image as it likes at a very late point in the image preparation. If exists, i.e. the dual-phased development build process used, then this script will be invoked twice: once inside the build image and once inside the final image. The first parameter passed to the script clarifies which phase it is run in.

  5. mkosi.nspawn — If this file exists, it should contain a container configuration file for systemd-nspawn (see systemd.nspawn(5) for details), which shall be shipped along with the final image and shall be included in the check-sum calculations (see below).

  6. mkosi.cache/ — If this directory exists, it is used as package cache directory for the builds. This directory is effectively bind mounted into the image at build time, in order to speed up building images. The package installers of the various distributions will place their package files here, so that subsequent runs can reuse them.

  7. mkosi.passphrase — If this file exists, it should contain a pass-phrase to use for the LUKS encryption (if that's enabled for the image built). This file should not be readable to other users.

  8. and should be an X.509 key pair to use for signing the kernel and initrd for UEFI SecureBoot, if that's enabled.

How to use it

So, let's come back to our most trivial example, without any of the mkosi.$SOMETHING files around:

# mkosi

As mentioned, this will create a build file image.raw in the current directory. How do we use it? Of course, we could dd it onto some USB stick and boot it on a bare-metal device. However, it's much simpler to first run it in a container for testing:

# systemd-nspawn -bi image.raw

And there you go: the image should boot up, and just work for you.

Now, let's make things more interesting. Let's still not use any of the mkosi.$SOMETHING files around:

# mkosi -t raw_btrfs --bootable -o foobar.raw
# systemd-nspawn -bi foobar.raw

This is similar as the above, but we made three changes: it's no longer GPT + ext4, but GPT + btrfs. Moreover, the system is made bootable on UEFI systems, and finally, the output is now called foobar.raw.

Because this system is bootable on UEFI systems, we can run it in KVM:

qemu-kvm -m 512 -smp 2 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -drive format=raw,file=foobar.raw

This will look very similar to the systemd-nspawn invocation, except that this uses full VM virtualization rather than container virtualization. (Note that the way to run a UEFI qemu/kvm instance appears to change all the time and is different on the various distributions. It's quite annoying, and I can't really tell you what the right qemu command line is to make this work on your system.)

Of course, it's not all raw GPT disk images with mkosi. Let's try a plain directory image:

# mkosi -d fedora -t directory -o quux
# systemd-nspawn -bD quux

Of course, if you generate the image as plain directory you can't boot it on bare-metal just like that, nor run it in a VM.

A more complex command line is the following:

# mkosi -d fedora -t raw_squashfs --checksum --xz --package=openssh-clients --package=emacs

In this mode we explicitly pick Fedora as the distribution to use, ask mkosi to generate a compressed GPT image with a root squashfs, compress the result with xz, and generate a SHA256SUMS file with the hashes of the generated artifacts. The package will contain the SSH client as well as everybody's favorite editor.

Now, let's make use of the various mkosi.$SOMETHING files. Let's say we are working on some Automake-based project and want to make it easy to generate a disk image off the development tree with the version you are hacking on. Create a configuration file:

# cat > mkosi.default <<EOF


# The packages to appear in both the build and the final image
Packages=openssh-clients httpd
# The packages to appear in the build image, but absent from the final image
BuildPackages=make gcc libcurl-devel

And let's add a build script:

# cat > <<EOF
./configure --prefix=/usr
make -j `nproc`
make install
# chmod +x

And with all that in place we can now build our project into a disk image, simply by typing:

# mkosi

Let's try it out:

# systemd-nspawn -bi image.raw

Of course, if you do this you'll notice that building an image like this can be quite slow. And slow build times are actively hurtful to your productivity as a developer. Hence let's make things a bit faster. First, let's make use of a package cache shared between runs:

# mkdir mkosi.cache

Building images now should already be substantially faster (and generate less network traffic) as the packages will now be downloaded only once and reused. However, you'll notice that unpacking all those packages and the rest of the work is still quite slow. But mkosi can help you with that. Simply use mkosi's incremental build feature. In this mode mkosi will make a copy of the build and final images immediately before dropping in your build sources or artifacts, so that building an image becomes a lot quicker: instead of always starting totally from scratch a build will now reuse everything it can reuse from a previous run, and immediately begin with building your sources rather than the build image to build your sources in. To enable the incremental build feature use -i:

# mkosi -i

Note that if you use this option, the package list is not updated anymore from your distribution's servers, as the cached copy is made after all packages are installed, and hence until you actually delete the cached copy the distribution's network servers aren't contacted again and no RPMs or DEBs are downloaded. This means the distribution you use becomes "frozen in time" this way. (Which might be a bad thing, but also a good thing, as it makes things kinda reproducible.)

Of course, if you run mkosi a couple of times you'll notice that it won't overwrite the generated image when it already exists. You can either delete the file yourself first (rm image.raw) or let mkosi do it for you right before building a new image, with mkosi -f. You can also tell mkosi to not only remove any such pre-existing images, but also remove any cached copies of the incremental feature, by using -f twice.

I wrote mkosi originally in order to test systemd, and quickly generate a disk image of various distributions with the most current systemd version from git, without all that affecting my host system. I regularly use mkosi for that today, in incremental mode. The two commands I use most in that context are:

# mkosi -if && systemd-nspawn -bi image.raw

And sometimes:

# mkosi -iff && systemd-nspawn -bi image.raw

The latter I use only if I want to regenerate everything based on the very newest set of RPMs provided by Fedora, instead of a cached snapshot of it.

BTW, the mkosi files for systemd are included in the systemd git tree: mkosi.default and This way, any developer who wants to quickly test something with current systemd git, or wants to prepare a patch based on it and test it can check out the systemd repository and simply run mkosi in it and a few minutes later he has a bootable image he can test in systemd-nspawn or KVM. casync has similar files: mkosi.default,

Random Interesting Features

  1. As mentioned already, mkosi will generate dm-verity enabled disk images if you ask for it. For that use the --verity switch on the command line or Verity= setting in mkosi.default. Of course, dm-verity implies that the root volume is read-only. In this mode the top-level dm-verity hash will be placed along-side the output disk image in a file named the same way, but with the .roothash suffix. If the image is to be created bootable, the root hash is also included on the kernel command line in the roothash= parameter, which current systemd versions can use to both find and activate the root partition in a dm-verity protected way. BTW: it's a good idea to combine this dm-verity mode with the raw_squashfs image mode, to generate a genuinely protected, compressed image suitable for running in your IoT device.

  2. As indicated above, mkosi can automatically create a check-sum file SHA256SUMS for you (--checksum) covering all the files it outputs (which could be the image file itself, a matching .nspawn file using the mkosi.nspawn file mentioned above, as well as the .roothash file for the dm-verity root hash.) It can then optionally sign this with gpg (--sign). Note that systemd's machinectl pull-tar and machinectl pull-raw command can download these files and the SHA256SUMS file automatically and verify things on download. With other words: what mkosi outputs is perfectly ready for downloads using these two systemd commands.

  3. As mentioned, mkosi is big on supporting UEFI SecureBoot. To make use of that, place your X.509 key pair in two files mkosi.secureboot.crt and mkosi.secureboot.key, and set SecureBoot= or --secure-boot. If so, mkosi will sign the kernel/initrd/kernel command line combination during the build. Of course, if you use this mode, you should also use Verity=/--verity=, otherwise the setup makes only partial sense. Note that mkosi will not help you with actually enrolling the keys you use in your UEFI BIOS.

  4. mkosi has minimal support for GIT checkouts: when it recognizes it is run in a git checkout and you use the script stuff, the source tree will be copied into the build image, but will all files excluded by .gitignore removed.

  5. There's support for encryption in place. Use --encrypt= or Encrypt=. Note that the UEFI ESP is never encrypted though, and the root partition only if explicitly requested. The /home and /srv partitions are unconditionally encrypted if that's enabled.

  6. Images may be built with all documentation removed.

  7. The password for the root user and additional kernel command line arguments may be configured for the image to generate.

Minimum Requirements

Current mkosi requires Python 3.5, and has a number of dependencies, listed in the README. Most notably you need a somewhat recent systemd version to make use of its full feature set: systemd 233. Older versions are already packaged for various distributions, but much of what I describe above is only available in the most recent release mkosi 3.

The UEFI SecureBoot support requires sbsign which currently isn't available in Fedora, but there's a COPR.


It is my intention to continue turning mkosi into a tool suitable for:

  1. Testing and debugging projects
  2. Building images for secure devices
  3. Building portable service images
  4. Building images for secure VMs and containers

One of the biggest goals I have for the future is to teach mkosi and systemd/sd-boot native support for A/B IoT style partition setups. The idea is that the combination of systemd, casync and mkosi provides generic building blocks for building secure, auto-updating devices in a generic way from, even though all pieces may be used individually, too.


  1. Why are you reinventing the wheel again? This is exactly like $SOMEOTHERPROJECT! — Well, to my knowledge there's no tool that integrates this nicely with your project's development tree, and can do dm-verity and UEFI SecureBoot and all that stuff for you. So nope, I don't think this exactly like $SOMEOTHERPROJECT, thank you very much.

  2. What about creating MBR/DOS partition images? — That's really out of focus to me. This is an exercise in figuring out how generic OSes and devices in the future should be built and an attempt to commoditize OS image building. And no, the future doesn't speak MBR, sorry. That said, I'd be quite interested in adding support for booting on Raspberry Pi, possibly using a hybrid approach, i.e. using a GPT disk label, but arranging things in a way that the Raspberry Pi boot protocol (which is built around DOS partition tables), can still work.

  3. Is this portable? — Well, depends what you mean by portable. No, this tool runs on Linux only, and as it uses systemd-nspawn during the build process it doesn't run on non-systemd systems either. But then again, you should be able to create images for any architecture you like with it, but of course if you want the image bootable on bare-metal systems only systems doing UEFI are supported (but systemd-nspawn should still work fine on them).

  4. Where can I get this stuff? — Try GitHub. And some distributions carry packaged versions, but I think none of them the current v3 yet.

  5. Is this a systemd project? — Yes, it's hosted under the systemd GitHub umbrella. And yes, during run-time systemd-nspawn in a current version is required. But no, the code-bases are separate otherwise, already because systemd is a C project, and mkosi Python.

  6. Requiring systemd 233 is a pretty steep requirement, no? — Yes, but the feature we need kind of matters (systemd-nspawn's --overlay= switch), and again, this isn't supposed to be a tool for legacy systems.

  7. Can I run the resulting images in LXC or Docker? — Humm, I am not an LXC nor Docker guy. If you select directory or subvolume as image type, LXC should be able to boot the generated images just fine, but I didn't try. Last time I looked, Docker doesn't permit running proper init systems as PID 1 inside the container, as they define their own run-time without intention to emulate a proper system. Hence, no I don't think it will work, at least not with an unpatched Docker version. That said, again, don't ask me questions about Docker, it's not precisely my area of expertise, and quite frankly I am not a fan. To my knowledge neither LXC nor Docker are able to run containers directly off GPT disk images, hence the various raw_xyz image types are definitely not compatible with either. That means if you want to generate a single raw disk image that can be booted unmodified both in a container and on bare-metal, then systemd-nspawn is the container manager to go for (specifically, its -i/--image= switch).

Should you care? Is this a tool for you?

Well, that's up to you really.

If you hack on some complex project and need a quick way to compile and run your project on a specific current Linux distribution, then mkosi is an excellent way to do that. Simply drop the mkosi.default and files in your git tree and everything will be easy. (And of course, as indicated above: if the project you are hacking on happens to be called systemd or casync be aware that those files are already part of the git tree — you can just use them.)

If you hack on some embedded or IoT device, then mkosi is a great choice too, as it will make it reasonably easy to generate secure images that are protected against offline modification, by using dm-verity and UEFI SecureBoot.

If you are an administrator and need a nice way to build images for a VM or systemd-nspawn container, or a portable service then mkosi is an excellent choice too.

If you care about legacy computers, old distributions, non-systemd init systems, old VM managers, Docker, … then no, mkosi is not for you, but there are plenty of well-established alternatives around that cover that nicely.

And never forget: mkosi is an Open Source project. We are happy to accept your patches and other contributions.

Oh, and one unrelated last thing: don't forget to submit your talk proposal and/or buy a ticket for All Systems Go! 2017 in Berlin — the conference where things like systemd, casync and mkosi are discussed, along with a variety of other Linux userspace projects used for building systems.

June 26, 2017
Since it has been a while since the last update, I guess it is a good time to post an update on some of the progress that has been happening with freedreno and upstream support for snapdragon boards.

freedreno / mesa

While the 17.1 release included enabling reorder support by default, there have been many other interesting features landed since the 17.1 branch point (so they will be included in the future 17.2 release).  Many, but not all, are related to a5xx.  (Something that I just realized I forgot to blog about, but have demoed here and there.)

GL/GLES Compute Shaders:

So far this is only a5xx (although a4xx seems to work similarly, and would probably be not too hard to get working if someone had the right hardware and a bit of time).  SSBOs and atomics are supported, but image support (an important part of compute shaders) is still TODO (and some r/e required, although it seems to share a lot in common with SSBOs).  Adreno 3xx support for compute shaders appears to be more work (ie. less in common with a4xx/a5xx, and probably part of the reason that qualcomm never bothered adding support in android blob driver).  Patches welcome, but for now a3xx compute support is far enough down my TODO list that it might not otherwise happen.

I know there is a lot of interest in open source OpenCL support for freedreno, and hopefully that is something that will come in the future.  But there is the big challenge of how to get opencl shaders (kernels) into a form that can be consumed by freedreno's ir3 shader compiler backend.  While there is some potential to re-use spirv_to_nir at some point, there are some complicated details.  For compute kernels (ie. OpenCL) there are some restrictions lifted on SPIRV that spirv_to_nir relies on.  (Little details like lack of requirement for structured flow control.)

A5xx HW Binning Support:

Traditionally hw binning support, while a pretty big perf boost, has been kinda difficult (translation: lot of things can be done wrong to lead to difficult to debug GPU lockups), this time around it wasn't so hard.  I guess experience on a3xx/a4xx has helped.  And everyone loves ~30% fps boost in your favorite game!

This has brought performance roughly up to the levels as ifc6540/a420.  Which sounds bad, but remember we are comparing apples and oranges.  On ifc6540 (snapdragon 805), we don't yet have upstream kernel support so this was using a 3.10 android kernel (with bus-scaling and all the downstream tricks to optimize memory bandwidth and overall SoC performance).  But on a530 (dragonboard820c), I never had a working downstream kernel (or had to bother backporting the upstream drm/msm driver to some ancient android kernel.. hurray!).  The upshot is that any perf #'s for a5xx don't include bus-scaling, cpufreq, etc.  I expect a pretty big performance boost on a530 once we have a way to clock up memory/interconnects.  (Ie. on micro-benchmarks a530 is >2x faster than a420 on alu limited workloads, but still a bit slower than a420 on bandwidth limited workloads, despite having a higher theoretical bandwidth.)

Side note, linaro is working on an upstream solution for bus-scaling.  This is a very important improvement needed upstream for ARM SoC's, especially ones that optimize so strongly for battery life.  (Keep in mind that interconnects, which span across the SoC, and memory, are a big power consumer in a modern SoC.. so a lot of qualcomm's good performance + battery life in phones comes down to these systemwide optimizations.)  It is equivalent to slow memory clockings on some generations of nouveau, except in this case it is outside the gpu driver (ie. we aren't talking about vram on a discrete gpu), and the reason is to enable a high end phone SoC to last a couple days on battery, rather than keeping your video card from melting.

A5xx gles3.0/gl3.1 support:

Probably it would have made sense to spend time on this before compute shaders (since they are otherwise only exposed with $MESA_GL_VERSION_OVERRIDE tricks.. but hey, I was curious about how compute shaders worked).  After an assortment of small things to r/e and implement, we where just a few (~50) texture/vbo/fb formats away from gl3.1.  Nothing really exciting.  Mostly just a few weekends probing unknown format #'s and seeing which piglit format tests started passing.  The sort have thing that would have taken approximately 10 minutes with docs.. but hey, it needed to be done.

Switching to NIR by default:

This is one thing that benefits a3xx and a4xx as well as a5xx.  While freedreno has had NIR support for a while, it hasn't been enabled by default until more recently.  The issue was handling of complex dereferences (multi-dimensional arrays, arrays of structs, etc).  The problem was that freedreno's ir3 backend preferred to keep things in SSA form (since that gives the instruction scheduler more flexibilty, which is pretty imprortant in the a3xx+ instruction set architecture (ir3)).  Adding support to lower arrays to regs allowed moving the deref offset calculation to NIR so that we wouldn't regress by turning NIR on by default.  This is useful since it cuts shader compilation time, but also because tgsi_to_nir doesn't support SSBOs, atomics, and other new shiny glsl features.  (Now we only rely on tgsi_to_nir for various legacy paths and built-in blit shaders which don't need new shiny glsl features.)

A5xx HW Query Support:

Adreno 5xx changed how hw queries (ie. occlusion query and time-elapsed query, etc) work.  For the better, since now we can accumulate per-tile results on the GPU.  But it required some new support in freedreno for a different sort of query, and some r/e about how this actually worked.  And while we had previously lied about occlusions query support (mostly to expose more than gl1.4 support), that isn't a very good long term solution.  In addition, time-elapsed query is useful for performance/profiling work, so helpful for some of the following projects.

A5xx LRZ Support:

Adreno 5xx adds another cute optimization called "LRZ".  (Presumably "low resolution Z (depth buffer)".  I've spent a some time r/e'ing this feature and implementing support for it in freedreno.  It is a neat new hw trick that a5xx has, which serves two purposes.
The basic idea is to have a per-quad depth value so that in the binning pass primitives can be rejected (per tile) based on depth (ie. reject more early). But then recycle the LRZ buffer in draw phase to function as for-free depth pre-pass (ie. reject earlier primitives based on the z value of later primitives).

The benefit depends on how well optimized the game is.  Ie. games that are well optimized for traditional GPU architectures (ie. sorting geometry, already doing depth pre-passes, etc) won't benefit as much.. but this helps a lot for badly written games that relied on per-pixel deferred rendering.

Overall, for things like stk/xonotic, it seems like a ~5-10% win.

edit: I forgot to mention, this isn't enabled by default as it causes some issues (which seem like a sort of z-fighting) with 0ad.  Other than that, I haven't found anything that it doesn't work with.  To enable: FD_MESA_DEBUG=lrz.   It would be nice if there were some way to have driver specific flags in driconf to control things like this.

The main remaining performance trick for a5xx is UBWC (ie. bandwidth compression) + tiled textures.  I've worked out mostly how UBWC works (in particular texture layout, at least for 2d textures + mipmap, but I think we can infer how 2d arrays, 3d, etc, work from that).  Most of the infrastructure for upload/download blits (to convert to/from linear) should be easier thanks to the reorder support.  We'll see if I actually find time to implement it before the mesa 17.2 branch point.

Standardized Embedded Nonsense Hacks

Anyone who has dealt with arm (non-server) devices, should be familiar with the silly-embedded-nonsense-hacks world.  In particular the non-standard boot-chain which makes it difficult for distro's to support the plethora of arm boards (let alone phones/tablets/etc) out there without per-board support.  Which was fine in the early days, but N boards times M distro's, it really doesn't scale.

Thanks to work by Mateusz Kulikowski, we now have u-boot support for dragonboard 410c.  It's been on my TODO list to play with for a while.  But more recently I realized that u-boot, thanks to the work of many others, can provide enough of EFI runtime-services interface for grub to work.  This means that it is a path forward for standardized distros on aarch64 (like fedora and opensuse), which expect UEFI, to boot on boards which don't otherwise have UEFI firmware.

So I decided to spend a bit of time pretending to be a crack smoking firmware engineer.  (Not literally, of course.. that would be stupid!)

After fixing some linker script bugs with u-boot's db410c support vs efi_runtime section, and debugging some issues with grub finding the boot disk with the help of Peter Jones (the resident grub/EFI expert who conveniently sits near me), and a couple other misc u-boot fixes, I had a fedora 26 alpha image booting on the db410c.

The next step was figuring out display, so we could have grub boot menu on screen, like you would expect on a grown-up platform.  As it turns out, on most devices, lk (little kernel, ie. what normally loads the kernel+initrd on snapdragon android devices) already supports lighting up the display, since most/all android devices put up the initial splash-screen before the kernel is loaded.  Unfortunately this was not the case with the db410c's lk.  But Archit (qcom engineer who has contributed a whole lot of drm/msm and other drm patches) pointed me at a different lk branch (among the 100's) which had msm8916 display + adv7533 dsi->hdmi bridge (like what db410c uses).  After digging through a convoluted git history, I was able to track down the relevant gpio/i2c/adv7533 patches to port to the lk branch used on db410c.

After that, I added support for lk to populate a framebuffer node, using the simple-framebuffer bindings to pass the pre-configured scanout buffer (+dimensions) to u-boot.  This plus a new simplefb video driver for u-boot, enables u-boot to expose display support to grub via the EFI GOP protocol.  (Along the way I had to add 32bpp rgb support to lk since u-boot and grub don't understand packed 24bpp rgb.)

All this got to the point of:

This is a fedora image, booting off of usb disk (ie. not just rootfs on usb disk, but also grub/kernel/initrd/dtb).  With graphical grub menu to select which kernel to boot, just like you would expect on a PC.  The grubaa64.efi here is vanilla distro boot-loader, and from the point of view of the distro image, lk/u-boot is just the platform's firmware which somehow provides the UEFI interface the distro media expects.  It is worth pointing out some advantages of a traditional lk->kernel boot chain:
  • booting from USB, network, etc (which lk cannot do)
  • doesn't require kernel packed in custom boot.img partition which is board specific
  • booting installer image (ie. from sd-card or network)
When the kernel starts, in early boot, it is using efifb, just like it would on a PC.  (Ie. so you can see what is going on on-screen before hw specific drm driver kernel module is loaded).

There are still a few rough edges.  The drm/msm driver and msm clk drivers are a bit surprised when some clks are already enabled when the kernel starts, and the display is already light up.. now we have a good reason to fix some of those issues.  And right now we don't have a good way to load a newer device tree binary (dtb) after a distro kernel update (ie. without updating u-boot, aka "the firmware").  (For simple SoC's maybe a pre-baked dtb for the life of the board is sufficient... I have my doubts about that for SoCs as complex as the various snapdragon's, if for no other reason that we haven't even figured out how to model all the features of the existing SoCs in devicetree.)  One idea is for u-boot to pass to grub the name of the board dtb file to load via EFI variables.  I've sent a very early RFC to add EFI variable support in u-boot.  We'll see how this goes, in the mean time there might be more "firmware" upgrades needed than you'd normally expect on a mature platform like x86.

For now, my lk + u-boot work is here:
and prebulit "firmware" is here.  For now you will need to edit distro grub.cfg to add 'devicetree' commands to load appropriate dtb since what is included with u-boot.img is a very minimal fdt (ie. just enough for the drivers in u-boot).

This week I picked up my old vc4-xml branch. This rework was inspired by the Intel driver, where they wrote an XML description of the hardware packets and use that to code-generate the packet packing and debug dumping code. Given that vc4’s debug dumping has always been somewhat of a mess (and its code is duplicated between mesa and vc4-gpu-tools), it would be great to do the same thing to vc4. More importantly, the XML-generated pack code easily lets you do things like precompute part of your packed state packet at gallium CSO generation time, and then just memcpy (or OR together two copies) at draw time.

My problem with the branch had been that it bloated the size of the vc4_emit.c code (the draw-time path), which probably meant that it reduced performance compared to my old hand-written packing. I had spent a couple of weeks writing fast paths for things like moving a float into the unaligned CL, or packing a couple of flag bits into the bottom of a 32-bit address, but that only took the bloat from like 20% to 10%. Last week, I decided to stop using the size as a proxy for performance and just test performance, and it turns out that the difference was negligible or slightly positive! Now I need to get an Android build done, and merge.

In the process of doing this draw overhead testing, I turned on a new gallium flag that cut the CPU overhead of draw calls by 5%, which was more than any of my vc4-xml overhead ever was!

I also spent more time on the 7” panel again, trying a rework of the load order in response to review feedback. It turns out that the DSI portion of DRM isn’t built to support drivers the way that previous feedback requested, and nobody has a concrete plan for how it would work. I’ve tried one avenue of fixing it, but that ran into another mess in the DSI subsystem.

Switching Raspbian over to vc4 fkms is currently stalled on Simon rebuilding the packages with the current Mesa patchset. I’ve fixed a minor issue in the fkms overlay that requested aligned CMA areas despite my having removed that requirement a while back, so hopefully they’ll finally enable vc4 on Pi0/1 as well once they get around to updating.

I did another round of review on the piglit series for ANDROID_native_fence and they’re ready to land now.

I sent out a rework to make some VC4 NIR lowering code shareable with other drivers. Freedreno and Intel have both wanted it at some point, so hopefully I can get some review on it.

In the kernel, I polished up my BO-labeling code that gives you detailed graphics memory usage information in /debug/dri/0/bo_stats. The cleanup was to effectively eliminate the CPU overhead, unless you choose do labeling from userspace. Adding this userspace interface required adding intel-gpu-tools testcases, so I wrote those. The Mesa side isn’t merged yet since we need kernel review first, and will probably only be enabled in debug driver builds. To make it really fancy, I should also hook up glObjectLabel() all the way to the kernel, so that /debug/dri/0/bo_stats can have things like “X11 ARGB glyph cache” instead of “resource 1024x1024@4” for that mystery 4MB buffer you’ve got.

Finally, I did some cleanup of the VC4 modesetting code, prompted by Boris’s recent cleanups. We’re much closer to matching the common DRM helpers now, with just our async pageflip code still being special. Once Gustavo’s async cursor bits land, we may be able to remove our async pageflip special case as well!

June 25, 2017
SPIR-V is the binary shader code representation used by Vulkan, and GL_ARB_gl_spirv is a recent extension that allows it to be used for OpenGL as well. Over the last weeks, I've been exploring how to add support for it in radeonsi.

As a bit of background, here's an overview of the various relevant shader representations that Mesa knows about. There are some others for really old legacy OpenGL features, but we don't care about those. On the left, you see the SPIR-V to LLVM IR path used by radv for Vulkan. On the right is the path from GLSL to LLVM IR, plus a mention of the conversion from GLSL IR to NIR that some other drivers are using (i965, freedreno, and vc4).

For GL_ARB_gl_spirv, we ultimately need to translate SPIR-V to LLVM IR. A path for this exists, but it's in the context of radv, not radeonsi. Still, the idea is to reuse this path.

Most of the differences between radv and radeonsi are in the ABI used by the shaders: the conventions by which the shaders on the GPU know where to load constants and image descriptors from, for example. The existing NIR-to-LLVM code needs to be adjusted to be compatible with radeonsi's ABI. I have mostly completed this work for simple VS-PS shader pipelines, which has the interesting side effect of allowing the GLSL-to-NIR conversion in radeonsi as well. We don't plan to use it soon, but it's nice to be able to compare.

Then there's adding SPIR-V support to the driver-independent mesa/main code.  This is non-trivial, because while GL_ARB_gl_spirv has been designed to remove a lot of the cruft of the old GLSL paths, we still need more supporting code than a Vulkan driver. This still needs to be explored a bit; the main issue is that GL_ARB_gl_spirv allows using default-block uniforms, so the whole machinery around glUniform*() calls has to work, which requires setting up all the same internal data structures that are setup for GLSL programs. Oh, and it looks like assigning locations is required, too.

My current plan is to achieve all this by re-using the GLSL linker, giving a final picture that looks like this:

So the canonical path in radeonsi for GLSL remains GLSL -> AST -> IR -> TGSI -> LLVM (with an optional deviation along the IR -> NIR -> LLVM path for testing), while the path for GL_ARB_gl_spirv is SPIR-V -> NIR -> LLVM, with NIR-based linking in between. In radv, the path remains as it is today.

Now, you may rightfully say that the GLSL linker is a huge chunk of subtle code, and quite thoroughly invested in GLSL IR. How could it possibly be used with NIR?

The answer is that huge parts of the linker don't really that much about the code in the shaders that are being linked. They only really care about the variables: uniforms and shader inputs and outputs. True, there are a bunch of linking steps that touch code, but most of them aren't actually needed for SPIR-V. Most notably, GL_ARB_gl_spirv doesn't require intrastage linking, and it explicitly disallows the use of features that only exist in compatibility profiles.

So most of the linker functionality can be preserved simply by converting the relevant variables (shader inputs/outputs, uniforms) from NIR to IR, then performing the linking on those, and finally extracting the linker results and writing them back into NIR. This isn't too much work. Luckily, NIR reuses the GLSL IR type system.

There are still parts that might need to look at the actual shader code, but my hope is that they are few enough that they don't matter.

And by the way, some people might want to move the IR -> NIR translation to before linking, so this work would set a foundation for that as well.

Anyway, I got a ridiculously simple toy VS-PS pipeline working correctly this weekend. The real challenge now is to find actual test cases...
June 23, 2017
As mentioned in my previous blog, the X.Org Foundation now wants us to blog every week. Whilst that means shorter blogs (last week’s was a tad long), it also means that there isn’t much to blog about if I didn’t do much in a week. Such is the case for this week, sadly. It’s the last week of university, and so there are a few assignment deadlines that I needed to complete; I haven’t been able to invest as much time into my project as I would have wanted.

In this week’s article for my ongoing Google Summer of Code (GSoC) project I planned on writing about the basic idea behind the project, but I reconsidered and decided to first give an overview on how Xwayland functions on a high-level and in the next week take a look at its inner workings in detail. The reason for that is, that there is not much Xwayland documentation available right now. So these two articles are meant to fill this void in order to give interested beginners a helping hand. And in two weeks I’ll catch up on explaining the project’s idea.

As we go high level this week the first question is, what is Xwayland supposed to achieve at all? You may know this. It’s something in a Wayland session ensuring that applications, which don’t support Wayland but only the old Xserver still function normally, i.e. it ensures backwards compatibility. But how does it do this? Before we go into this, there is one more thing to talk about, since I called Xwayland only something before. What is Xwayland exactly? How does it look to you on your Linux system? We’ll see in the next week that it’s not as easy to answer as the following simple explanation makes it appear, but for now this is enough: It’s a single binary containing an Xserver with a special backend written to communicate with the Wayland compositor active on your system - for example with KWin in a Plasma Wayland session.

To make it more tangible let’s take a look at Debian: There is a package called Xwayland and it consists of basically only the aforementioned binary file. This binary gets copied to /usr/bin/Xwayland. Compare this to the normal Xserver provided by, which in Debian you can find in the package xserver-xorg-core. The respective binary gets put into /usr/bin/Xorg together with a symlink /usr/bin/X pointing to it.

While the latter is the central building block in an X session and therefore gets launched before anything else with graphical output, the Xserver in the Xwayland binary works differently: It is embedded in a Wayland session. And in a Wayland session the Wayland compositor is the central building block. This means in particular that the Wayland compositor also takes up the role of being the server, who talks to Wayland native applications with graphical output as its clients. They send request to it in order to present their painted stuff on the screen. The Xserver in the Xwayland binary is only a necessary link between applications, which are only able to speak to an Xserver, and the Wayland compositor/server. Therefore the Xwayland binary gets launched later on by the compositor or some other process in the workspace. In Plasma it’s launched by KWin after the compositor has initialized the rendering pipeline. You find the relevant code here.

Although in this case KWin also establishes some communication channels with the newly created Xwayland process, in general the communication between Xwayland and a Wayland server is done by the normal Wayland protocoll in the same way other native Wayland applications talk to the compositor/server. This means the windows requested by possibly several X based applications and provided by Xwayland acting as an Xserver are translated at the same time by Xwayland to Wayland compatible objects and, acting as a native Wayland client, send to the Wayland compositor via the Wayland protocol. These windows look to the Wayland compositor just like the windows - in Wayland terminology surfaces - of every other Wayland native application. When reading this keep in mind, that an application in Wayland is not limited to using only one window/surface but can create multiple at the same time, so Xwayland as a native Wayland client can do the same for all the windows created for all of its X clients.

In the second part next week we’ll have a close look at the Xwayland code to see how Xwayland fills its role as an Xserver in regards to its X based clients and at the same time acts as a Wayland client when facing the Wayland compositor.

June 20, 2017

Felt it been to long since I did another Fedora Workstation update. We spend a lot of time trying to figure out how we can best spend our resources to produce the best desktop possible for our users, because even though Red Hat invests more into the Linux desktop than any other company by quite a margin, our resources are still far from limitless. So we have a continuous effort of asking ourselves if each of the areas we are investing in are the right ones that give our users the things they need the most, so below is a sampling of the things we are working on.

Improving integration of the NVidia binary driver
This has been ongoing for quite a while, but things have started to land now. Hans de Goede and Simone Caronni has been collaboring, building on the work by NVidia and Adam Jackson around glvnd. So if you set up Simones NVidia repository hosted on negativo17 you will be able to install the Nvidia driver without any conflicts with the Mesa stack and due to Hans work you should be fairly sure that even if the NVidia driver stops working with a given kernel update you will smoothly transition back to the open source Nouveau driver. I been testing it on my own Lenovo P70 system for the last week and it seems to work well under X. That said once you install the binary NVidia driver that is what your running on, which is of course not the behaviour you want from a hybrid graphics system. Fixing that last issue requires further collaboration between us and NVidia.
Related to this Adam Jackson is currently working on a project he calls glxmux. glxmux will allow you to have more than one GLX implementation on the system, so that you can switch between Mesa GLX for the Intel integrated graphics card and NVidia GLX for the binary driver. While we can make no promises we hope to have the framework in place for Fedora Workstation 27. Having that in place should allow us to create a solution where you only use the NVidia driver when you want the extra graphics power which will of course require significant work from Nvidia to enable it on their side so I can’t give a definite timeline for when all the puzzle pieces are in place. Just be assured we are working on it and talking regularly to NVidia about it. I will let you know here as soon as things come together.

On the Wayland side the Jonas Ådahl is working on putting the final touches on Hybrid Graphics support

Fleet Commander ready for take-off
Another major project we been working on for a long time in Fleet Commander. Fleet Commander is a tool to allow you to manage Fedora and RHEL desktops centrally. This is a tool targeted at for instance Universities or companies with tens, hundreds or thousands of workstation installation. It gives you a graphical browser based UI (accessible through Cockpit) to create configuration profiles and deploy across your organization. Currently it allows you to control anything that has a gsetting associated with it like enabling/disabling extensions and setting configuration settings in GTK+ and GNOME applications. It allows you to configure Network Manager settings so if your updating the company VPN or proxy settings you can easily push those changes out to all user in the organization. Or quickly migrate Evolution email settings to a new email server. The tool also allows you to control recommended applications in the Software Center and set bookmarks in Firefox. There is also support for controlling settings inside LibreOffice.

All this features can be set and controlled on either a user level or a group level or organization wide due to the close integration we have with FreeIPA suite of tools. The data is stored inside your organizations LDAP server alongside other user information so you don’t need to have the clients connect to a new service for this, and while it is not there in this initial release we will in the future also support Active Directory.

The initial release and Fleet Commander website will be out alongside Fedora Workstation 26.

I talked about PipeWire before, when it was still called Pinos, but the scope and ambition for the project has significantly changed since then. Last time when I spoke about it the goal was just to create something that could be considered a video equivalent of pulseaudio. Wim Taymans, who you might know co-created GStreamer and who has been a major PulseAudio contributor, has since expanded the scope and PipeWire now aims at unifying linux Audio and Video. The long term the goal is for PipeWire to not only provide handling of video streams, but also handle all kinds of audio. Due to this Wim has been spending a lot of time making sure PipeWire can handle audio in a way that not only address the PulseAudio usecases, but also the ones handled by Jack today. A big part of the motivation for this is that we want to make Fedora Workstation the best place to create content and we want the pro-audio crowd to be first class citizens of our desktop.

At the same time we don’t want to make this another painful subsystem transition so PipeWire so we will need to ensure that PulseAudio applications can still be run without modification.

We expect to start shipping PipeWire with Fedora Workstation 27, but at that point only have it handle video as we need this to both enable good video handling for Flatpak applications through a video portal, but also to provide an API for applications that want to do screen capture under Wayland, like web browser applications offering screen sharing. We will the bring the audio features onboard in subsequent releases as we also try to work with the Jack and PulseAudio communities to make this a joint effort. We are also working now on a proper website for PipeWire.

Red Hat developer integration
A feature we are quite excited about is the integration of support for the Red Hat developer account system into Fedora. This means that you should be able to create a Red Hat developer account through GNOME Online accounts and once you have that account set up you should be able to easily create Red Hat Enterprise Linux virtual machines or containers on your Fedora system. This is a crucial piece for the developer focus that we want the workstation to have and one that we think will make a lot of developers life easier. We where originally hoping to have this ready for Fedora Workstaton 26, but atm it looks more likely to hit Fedora Workstation 27, but we will keep you up to date as this progresses.

Fractional scaling for HiDPI systems
Fedora Workstation has been leading the charge in supporting HiDPI on Linux and we hope to build on that with the current work to enable fractional scaling support. Since we introduced HiDPI support we have been improving it step by step, for instance last year we introduced support for dealing with different DPI levels per monitor for Wayland applications. The fractional scaling work will take this a step further. The biggest problem it will resolve is that for certain monitor sizes the current scaling options either left things to small or to big. With the fractional scaling support we will introduce intermediate steps, so that you can scale your interface 1.5x times instead of having to go all the way to 2. The set of technologies we are developing for handling fractional scaling should also allow us to provide better scaling for XWayland applications as it provides us with methods for scaling that doesn’t need direct support from the windowing system or toolkit.

GNOME Shell performance
Carlos Garnacho has been doing some great work recently improving the general performance of GNOME Shell. This comes on top of his earlier performance work that was very well received. How fast/slow GNOME shell is often a subjective thing, but reducing overhead where we can is never a bad thing.

Flatpak building
Owen Taylor has been working hard on putting the pieces in place for start large scale Flatpak building in Fedora. You might see a couple of test flatpaks appear in a Fedora Workstation 26 timeframe, but the goal is to have a huge Flatpak catalog ready in time for Fedora Workstation 27. Essentially what we are doing is making it very simple for a Fedora maintainer to build a Flatpak of the application they maintain through the Fedora package building infrastructure and push that Flatpak into a central Flatpak registry. And while of course this is mainly meant to be to the benefit of Fedora users there is of course nothing stopping other distributions from offering these Flatpak packaged applications to their users also.

Atomic Workstation
Another effort that is marching forward is what we call Atomic Workstation. The idea here is to have an immutable OS image kinda like what you see on for instance Android devices. The advantage to this is that the core of the operating system gets tested and deployed as a unit and the chance of users ending with broken systems decrease significantly as we don’t need to rely on packages getting applied in the correct order or scripts executing as expected on each individual workstation out there. This effort is largely based on the Project Atomic effort, and the end goal here is to have a image based OS install and Flatpak based applications on top of it. If you are very adventerous and/or want to help out with this effort you can get the ISO image installer for Atomic Workstation here.

Firmware handling
Our Linux Firmware project is still going strong with new features being added and new vendors signing on. As Richard Hughes recently blogged about the latest vendor joining the effort is Logitech who now will upload their firmware into the service so that you can keep your Logitech peripherals updated through it. It is worthwhile pointing out here how we worked with Logitech to make this happen, with Richard working on the special tooling needed and thus reducing the threshold for Logitech to start offering their firmware through the service. We have other vendors we are having similar discussions and collaborations with so expect to see more. At this point I tend to recommend people get a Dell to run Linux, due to their strong support for efforts such as the Linux Firmware Service, but other major vendors are in the final stages of testing so expect more major vendors starting to push firmware updates soon.

High Dynamic Range
The next big thing in the display technology field is HDR (High Dynamic Range). HDR allows for deeper more vibrant colours and is a feature seen on a lot of new TVs these days and game consoles like the Playstation 4 support it. Computer monitors are appearing on the market too now with this feature, for instance the Dell UP2718Q. We want to ensure Fedora and Linux is a leader here, for the benefit of video and graphics artists using Fedora and Red Hat Enterprise Linux. We are thus kicking of an effort to make sure this technology mature as quickly as possible and be fully supported. We are not the only ones interested in this so we will hopefully be collaborating with our friends at Intel, AMD and NVidia on this. We hope to have the first monitors delivered to our office within a few weeks.

While playback these days have moved to streaming where locally installed codecs are of less importance for the consumption usecase, having a wide selection of codecs available is still important for media editing and creation usecases, so we want you to be able to load a varity of old media files into you video editor for instance. Luckily we are at a crossroads now where a lot of widely used codecs have their essential patents expire (mp3, ac3 and more) while at the same time the industry focus seems to have moved to royalty free codec development moving forward (Opus, VP9, Alliance for Open Media). We have been spending a lot of time with the Red Hat legal team trying to clear these codecs, which resulted in mp3 and AC3 now shipping in Fedora Workstation. We have more codecs on the way though, so this effort is in no way over. My goal is that over the course of this year the situation of software patents being a huge issue when dealing with audio and video codecs on Linux will be considered a thing of the past. I would like to thank the Red Hat legal team for their support on this issue as they have had to spend significant time on it as a big company like Red Hat do need to do our own due diligence when it comes to these things, we can’t just trust statements from random people on the internet that these codecs are now free to ship.

Battery life
We been looking at this for a while now and hope to be able to start sharing information with users on which laptops they should get that will have good battery life under Fedora. Christian Kellner is now our point man on battery life and he has taken up improving the Battery Bench tool that Owen Taylor wrote some time ago.

QtGNOME platform
We will have a new version of the QtGNOME platform in Fedora 26. For those of you who have not yet heard of this effort it is a set of themes and tools to ensure that Qt applications runs without any major issues under GNOME 3. With the new version the theming expands to include the accessibility and dark themes in Adwaita, meaning that if you switch to one of these themes under GNOME shell it will also switch your Qt applications over. We are also making sure things like cut’n paste and drag and drop works well. The version in Fedora Workstation 26 is a big step forward for this effort and should hopefully make Qt applications be first class citizens under your Fedora Workstation desktop.

Wayland polish
Ever since we switched the default to Wayland we have kept the pressure up and kept fixing bugs and finding solutions for corner cases. The result should be an improved Wayland experience in Fedora Workstation 26. A big thanks for Olivier Fourdan, Jonas Ådahl and the whole Wayland community for their continued efforts here. Two major items Jonas is working on for instance is improving fractional scaling, to ensure that your desktop scales to an optimal size on HiDPI displays of various sizes. What we currently have is limited to 1x or 2x, which is either to small or to big for some screens, but with this work you can also do 1.5x scaling. He is also working on preparing an API that will allow screen sharing under Wayland, so that for instance sharing your slides over video conferencing can work under Wayland.

June 19, 2017

Introducing casync

In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).

The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that's what makes it useful for a variety of use-cases that other tools can't cover that well.


I created casync after studying how today's popular tools store and deliver file system images. To briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).

Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:

  1. Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
  2. Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
  3. Put boundaries on disk space usage on clients
  4. Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
  5. Simplicity to use for users, repository administrators and developers

I don't think any of the tools mentioned above are really good on more than a small subset of these points.

Specifically: Docker's layered tarball approach dumps the "delta" question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaining full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there's no use for it there, and likely requires substantially more disk space and download sizes.

OSTree's serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.

Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/meta-data (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the use-case I am interested in.

(Note: all the mentioned systems have great properties, and it's not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)

Security & Reproducibility

Besides the issues pointed out above I wasn't happy with the security and reproducibility properties of these systems. In today's world where security breaches involving hacking and breaking into connected systems happen every day, an image delivery system that cannot make strong guarantees regarding data integrity is out of date. Specifically, the tarball format is famously nondeterministic: the very same file tree can result in any number of different valid serializations depending on the tool used, its version and the underlying OS and file system. Some tar implementations attempt to correct that by guaranteeing that each file tree maps to exactly one valid serialization, but such a property is always only specific to the tool used. I strongly believe that any good update system must guarantee on every single link of the chain that there's only one valid representation of the data to deliver, that can easily be verified.

What casync Is

So much about the background why I created casync. Now, let's have a look what casync actually is like, and what it does. Here's the brief technical overview:

Encoding: Let's take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk's contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let's call this directory a "chunk store". At the same time, generate a "chunk index" file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story.

Decoding: Let's take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.

As an extra twist, we introduce a well-defined, reproducible, random-access serialization format for file trees (think: a more modern tar), to permit efficient, stable storage of complete file trees in the system, simply by serializing them and then passing them into the encoding step explained above.

Finally, let's put all this on the network: for each image you want to deliver, generate a chunk index file and place it on an HTTP server. Do the same with the chunk store, and share it between the various index files you intend to deliver.

Why bother with all of this? Streams with similar contents will result in mostly the same chunk files in the chunk store. This means it is very efficient to store many related versions of a data stream in the same chunk store, thus minimizing disk usage. Moreover, when transferring linear data streams chunks already known on the receiving side can be made use of, thus minimizing network traffic.

Why is this different from rsync or OSTree, or similar tools? Well, one major difference between casync and those tools is that we remove file boundaries before chunking things up. This means that small files are lumped together with their siblings and large files are chopped into pieces, which permits us to recognize similarities in files and directories beyond file boundaries, and makes sure our chunk sizes are pretty evenly distributed, without the file boundaries affecting them.

The "chunking" algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks. xz is used to compress the individual chunks.

Here's a diagram, hopefully explaining a bit how the encoding process works, wasn't it for my crappy drawing skills:


The diagram shows the encoding process from top to bottom. It starts with a block device or a file tree, which is then serialized and chunked up into variable sized blocks. The compressed chunks are then placed in the chunk store, while a chunk index file is written listing the chunk hashes in order. (The original SVG of this graphic may be found here.)


Note that casync operates on two different layers, depending on the use-case of the user:

  1. You may use it on the block layer. In this case the raw block data on disk is taken as-is, read directly from the block device, split into chunks as described above, compressed, stored and delivered.

  2. You may use it on the file system layer. In this case, the file tree serialization format mentioned above comes into play: the file tree is serialized depth-first (much like tar would do it) and then split into chunks, compressed, stored and delivered.

The fact that it may be used on both the block and file system layer opens it up for a variety of different use-cases. In the VM and IoT ecosystems shipping images as block-level serializations is more common, while in the container and application world file-system-level serializations are more typically used.

Chunk index files referring to block-layer serializations carry the .caibx suffix, while chunk index files referring to file system serializations carry the .caidx suffix. Note that you may also use casync as direct tar replacement, i.e. without the chunking, just generating the plain linear file tree serialization. Such files carry the .catar suffix. Internally .caibx are identical to .caidx files, the only difference is semantical: .caidx files describe a .catar file, while .caibx files may describe any other blob. Finally, chunk stores are directories carrying the .castr suffix.


Here are a couple of other features casync has:

  1. When downloading a new image you may use casync's --seed= feature: each block device, file, or directory specified is processed using the same chunking logic described above, and is used as preferred source when putting together the downloaded image locally, avoiding network transfer of it. This of course is useful whenever updating an image: simply specify one or more old versions as seed and only download the chunks that truly changed since then. Note that using seeds requires no history relationship between seed and the new image to download. This has major benefits: you can even use it to speed up downloads of relatively foreign and unrelated data. For example, when downloading a container image built using Ubuntu you can use your Fedora host OS tree in /usr as seed, and casync will automatically use whatever it can from that tree, for example timezone and locale data that tends to be identical between distributions. Example: casync extract --seed=/dev/sda1 /dev/sda2. This will place the block-layer image described by the indicated URL in the /dev/sda2 partition, using the existing /dev/sda1 data as seeding source. An invocation like this could be typically used by IoT systems with an A/B partition setup. Example 2: casync extract --seed=/srv/container-v1 --seed=/srv/container-v2 /src/container-v3, is very similar but operates on the file system layer, and uses two old container versions to seed the new version.

  2. When operating on the file system level, the user has fine-grained control on the meta-data included in the serialization. This is relevant since different use-cases tend to require a different set of saved/restored meta-data. For example, when shipping OS images, file access bits/ACLs and ownership matter, while file modification times hurt. When doing personal backups OTOH file ownership matters little but file modification times are important. Moreover different backing file systems support different feature sets, and storing more information than necessary might make it impossible to validate a tree against an image if the meta-data cannot be replayed in full. Due to this, casync provides a set of --with= and --without= parameters that allow fine-grained control of the data stored in the file tree serialization, including the granularity of modification times and more. The precise set of selected meta-data features is also always part of the serialization, so that seeding can work correctly and automatically.

  3. casync tries to be as accurate as possible when storing file system meta-data. This means that besides the usual baseline of file meta-data (file ownership and access bits), and more advanced features (extended attributes, ACLs, file capabilities) a number of more exotic data is stored as well, including Linux chattr(1) file attributes, as well as FAT file attributes (you may wonder why the latter? — EFI is FAT, and /efi is part of the comprehensive serialization of any host). In the future I intend to extend this further, for example storing btrfs sub-volume information where available. Note that as described above every single type of meta-data may be turned off and on individually, hence if you don't need FAT file bits (and I figure it's pretty likely you don't), then they won't be stored.

  4. The user creating .caidx or .caibx files may control the desired average chunk length (before compression) freely, using the --chunk-size= parameter. Smaller chunks increase the number of generated files in the chunk store and increase HTTP GET load on the server, but also ensure that sharing between similar images is improved, as identical patterns in the images stored are more likely to be recognized. By default casync will use a 64K average chunk size. Tweaking this can be particularly useful when adapting the system to specific CDNs, or when delivering compressed disk images such as squashfs (see below).

  5. Emphasis is placed on making all invocations reproducible, well-defined and strictly deterministic. As mentioned above this is a requirement to reach the intended security guarantees, but is also useful for many other use-cases. For example, the casync digest command may be used to calculate a hash value identifying a specific directory in all desired detail (use --with= and --without to pick the desired detail). Moreover the casync mtree command may be used to generate a BSD mtree(5) compatible manifest of a directory tree, .caidx or .catar file.

  6. The file system serialization format is nicely composable. By this I mean that the serialization of a file tree is the concatenation of the serializations of all files and file sub-trees located at the top of the tree, with zero meta-data references from any of these serializations into the others. This property is essential to ensure maximum reuse of chunks when similar trees are serialized.

  7. When extracting file trees or disk image files, casync will automatically create reflinks from any specified seeds if the underlying file system supports it (such as btrfs, ocfs, and future xfs). After all, instead of copying the desired data from the seed, we can just tell the file system to link up the relevant blocks. This works both when extracting .caidx and .caibx files — the latter of course only when the extracted disk image is placed in a regular raw image file on disk, rather than directly on a plain block device, as plain block devices do not know the concept of reflinks.

  8. Optionally, when extracting file trees, casync can create traditional UNIX hard-links for identical files in specified seeds (--hardlink=yes). This works on all UNIX file systems, and can save substantial amounts of disk space. However, this only works for very specific use-cases where disk images are considered read-only after extraction, as any changes made to one tree will propagate to all other trees sharing the same hard-linked files, as that's the nature of hard-links. In this mode, casync exposes OSTree-like behavior, which is built heavily around read-only hard-link trees.

  9. casync tries to be smart when choosing what to include in file system images. Implicitly, file systems such as procfs and sysfs are excluded from serialization, as they expose API objects, not real files. Moreover, the "nodump" (+d) chattr(1) flag is honored by default, permitting users to mark files to exclude from serialization.

  10. When creating and extracting file trees casync may apply an automatic or explicit UID/GID shift. This is particularly useful when transferring container image for use with Linux user name-spacing.

  11. In addition to local operation, casync currently supports HTTP, HTTPS, FTP and ssh natively for downloading chunk index files and chunks (the ssh mode requires installing casync on the remote host, though, but an sftp mode not requiring that should be easy to add). When creating index files or chunks, only ssh is supported as remote back-end.

  12. When operating on block-layer images, you may expose locally or remotely stored images as local block devices. Example: casync mkdev exposes the disk image described by the indicated URL as local block device in /dev, which you then may use the usual block device tools on, such as mount or fdisk (only read-only though). Chunks are downloaded on access with high priority, and at low priority when idle in the background. Note that in this mode, casync also plays a role similar to "dm-verity", as all blocks are validated against the strong digests in the chunk index file before passing them on to the kernel's block layer. This feature is implemented though Linux' NBD kernel facility.

  13. Similar, when operating on file-system-layer images, you may mount locally or remotely stored images as regular file systems. Example: casync mount /srv/mytree mounts the file tree image described by the indicated URL as a local directory /srv/mytree. This feature is implemented though Linux' FUSE kernel facility. Note that special care is taken that the images exposed this way can be packed up again with casync make and are guaranteed to return the bit-by-bit exact same serialization again that it was mounted from. No data is lost or changed while passing things through FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that's hopefully just a temporary gap to be fixed soon).

  14. In IoT A/B fixed size partition setups the file systems placed in the two partitions are usually much shorter than the partition size, in order to keep some room for later, larger updates. casync is able to analyze the super-block of a number of common file systems in order to determine the actual size of a file system stored on a block device, so that writing a file system to such a partition and reading it back again will result in reproducible data. Moreover this speeds up the seeding process, as there's little point in seeding the white-space after the file system within the partition.

Example Command Lines

Here's how to use casync, explained with a few examples:

$ casync make foobar.caidx /some/directory

This will create a chunk index file foobar.caidx in the local directory, and populate the chunk store directory default.castr located next to it with the chunks of the serialization (you can change the name for the store directory with --store= if you like). This command operates on the file-system level. A similar command operating on the block level:

$ casync make foobar.caibx /dev/sda1

This command creates a chunk index file foobar.caibx in the local directory describing the current contents of the /dev/sda1 block device, and populates default.castr in the same way as above. Note that you may as well read a raw disk image from a file instead of a block device:

$ casync make foobar.caibx myimage.raw

To reconstruct the original file tree from the .caidx file and the chunk store of the first command, use:

$ casync extract foobar.caidx /some/other/directory

And similar for the block-layer version:

$ casync extract foobar.caibx /dev/sdb1

or, to extract the block-layer version into a raw disk image:

$ casync extract foobar.caibx myotherimage.raw

The above are the most basic commands, operating on local data only. Now let's make this more interesting, and reference remote resources:

$ casync extract /some/other/directory

This extracts the specified .caidx onto a local directory. This of course assumes that foobar.caidx was uploaded to the HTTP server in the first place, along with the chunk store. You can use any command you like to accomplish that, for example scp or rsync. Alternatively, you can let casync do this directly when generating the chunk index:

$ casync make /some/directory

This will use ssh to connect to the server, and then places the .caidx file and the chunks on it. Note that this mode of operation is "smart": this scheme will only upload chunks currently missing on the server side, and not re-transmit what already is available.

Note that you can always configure the precise path or URL of the chunk store via the --store= option. If you do not do that, then the store path is automatically derived from the path or URL: the last component of the path or URL is replaced by default.castr.

Of course, when extracting .caidx or .caibx files from remote sources, using a local seed is advisable:

$ casync extract --seed=/some/exising/directory /some/other/directory

Or on the block layer:

$ casync extract --seed=/dev/sda1 /dev/sdb2

When creating chunk indexes on the file system layer casync will by default store meta-data as accurately as possible. Let's create a chunk index with reduced meta-data:

$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir

This command will create a chunk index for a file tree serialization that has three features above the absolute baseline supported: 1s granularity time-stamps, symbolic links and a single read-only bit. In this mode, all the other meta-data bits are not stored, including nanosecond time-stamps, full UNIX permission bits, file ownership or even ACLs or extended attributes.

Now let's make a .caidx file available locally as a mounted file system, without extracting it:

$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar

And similar, let's make a .caibx file available locally as a block device:

$ casync mkdev http://example.comf/images/foobar.caibx

This will create a block device in /dev and print the used device node path to STDOUT.

As mentioned, casync is big about reproducibility. Let's make use of that to calculate the a digest identifying a very specific version of a file tree:

$ casync digest .

This digest will include all meta-data bits casync and the underlying file system know about. Usually, to make this useful you want to configure exactly what meta-data to include:

$ casync digest --with=unix .

This makes use of the --with=unix shortcut for selecting meta-data fields. Specifying --with-unix= selects all meta-data that traditional UNIX file systems support. It is a shortcut for writing out: --with=16bit-uids --with=permissions --with=sec-time --with=symlinks --with=device-nodes --with=fifos --with=sockets.

Note that when calculating digests or creating chunk indexes you may also use the negative --without= option to remove specific features but start from the most precise:

$ casync digest --without=flag-immutable

This generates a digest with the most accurate meta-data, but leaves one feature out: chattr(1)'s immutable (+i) file flag.

To list the contents of a .caidx file use a command like the following:

$ casync list


$ casync mtree

The former command will generate a brief list of files and directories, not too different from tar t or ls -al in its output. The latter command will generate a BSD mtree(5) compatible manifest. Note that casync actually stores substantially more file meta-data than mtree files can express, though.

What casync isn't

  1. casync is not an attempt to minimize serialization and downloaded deltas to the extreme. Instead, the tool is supposed to find a good middle ground, that is good on traffic and disk space, but not at the price of convenience or requiring explicit revision control. If you care about updates that are absolutely minimal, there are binary delta systems around that might be an option for you, such as Google's Courgette.

  2. casync is not a replacement for rsync, or git or zsync or anything like that. They have very different use-cases and semantics. For example, rsync permits you to directly synchronize two file trees remotely. casync just cannot do that, and it is unlikely it every will.

Where next?

casync is supposed to be a generic synchronization tool. Its primary focus for now is delivery of OS images, but I'd like to make it useful for a couple other use-cases, too. Specifically:

  1. To make the tool useful for backups, encryption is missing. I have pretty concrete plans how to add that. When implemented, the tool might become an alternative to restic, BorgBackup or tarsnap.

  2. Right now, if you want to deploy casync in real-life, you still need to validate the downloaded .caidx or .caibx file yourself, for example with some gpg signature. It is my intention to integrate with gpg in a minimal way so that signing and verifying chunk index files is done automatically.

  3. In the longer run, I'd like to build an automatic synchronizer for $HOME between systems from this. Each $HOME instance would be stored automatically in regular intervals in the cloud using casync, and conflicts would be resolved locally.

  4. casync is written in a shared library style, but it is not yet built as one. Specifically this means that almost all of casync's functionality is supposed to be available as C API soon, and applications can process casync files on every level. It is my intention to make this library useful enough so that it will be easy to write a module for GNOME's gvfs subsystem in order to make remote or local .caidx files directly available to applications (as an alternative to casync mount). In fact the idea is to make this all flexible enough that even the remoting back-ends can be replaced easily, for example to replace casync's default HTTP/HTTPS back-ends built on CURL with GNOME's own HTTP implementation, in order to share cookies, certificates, … There's also an alternative method to integrate with casync in place already: simply invoke casync as a sub-process. casync will inform you about a certain set of state changes using a mechanism compatible with sd_notify(3). In future it will also propagate progress data this way and more.

  5. I intend to a add a new seeding back-end that sources chunks from the local network. After downloading the new .caidx file off the Internet casync would then search for the listed chunks on the local network first before retrieving them from the Internet. This should speed things up on all installations that have multiple similar systems deployed in the same network.

Further plans are listed tersely in the TODO file.


  1. Is this a systemd project?casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the code-bases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.

  2. Is casync portable? — At the moment: no. I only run Linux and that's what I code for. That said, I am open to accepting portability patches (unlike for systemd, which doesn't really make sense on non-Linux systems), as long as they don't interfere too much with the way casync works. Specifically this means that I am not too enthusiastic about merging portability patches for OSes lacking the openat(2) family of APIs.

  3. Does casync require reflink-capable file systems to work, such as btrfs? — No it doesn't. The reflink magic in casync is employed when the file system permits it, and it's good to have it, but it's not a requirement, and casync will implicitly fall back to copying when it isn't available. Note that casync supports a number of file system features on a variety of file systems that aren't available everywhere, for example FAT's system/hidden file flags or xfs's projinherit file flag.

  4. Is casync stable? — I just tagged the first, initial release. While I have been working on it since quite some time and it is quite featureful, this is the first time I advertise it publicly, and it hence received very little testing outside of its own test suite. I am also not fully ready to commit to the stability of the current serialization or chunk index format. I don't see any breakages coming for it though. casync is pretty light on documentation right now, and does not even have a man page. I also intend to correct that soon.

  5. Are the .caidx/.caibx and .catar file formats open and documented?casync is Open Source, so if you want to know the precise format, have a look at the sources for now. It's definitely my intention to add comprehensive docs for both formats however. Don't forget this is just the initial version right now.

  6. casync is just like $SOMEOTHERTOOL! Why are you reinventing the wheel (again)? — Well, because casync isn't "just like" some other tool. I am pretty sure I did my homework, and that there is no tool just like casync right now. The tools coming closest are probably rsync, zsync, tarsnap, restic, but they are quite different beasts each.

  7. Why did you invent your own serialization format for file trees? Why don't you just use tar? — That's a good question, and other systems — most prominently tarsnap — do that. However, as mentioned above tar doesn't enforce reproducibility. It also doesn't really do random access: if you want to access some specific file you need to read every single byte stored before it in the tar archive to find it, which is of course very expensive. The serialization casync implements places a focus on reproducibility, random access, and meta-data control. Much like traditional tar it can still be generated and extracted in a stream fashion though.

  8. Does casync save/restore SELinux/SMACK file labels? — At the moment not. That's not because I wouldn't want it to, but simply because I am not a guru of either of these systems, and didn't want to implement something I do not fully grok nor can test. If you look at the sources you'll find that there's already some definitions in place that keep room for them though. I'd be delighted to accept a patch implementing this fully.

  9. What about delivering squashfs images? How well does chunking work on compressed serializations? – That's a very good point! Usually, if you apply the a chunking algorithm to a compressed data stream (let's say a tar.gz file), then changing a single bit at the front will propagate into the entire remainder of the file, so that minimal changes will explode into major changes. Thankfully this doesn't apply that strictly to squashfs images, as it provides random access to files and directories and thus breaks up the compression streams in regular intervals to make seeking easy. This fact is beneficial for systems employing chunking, such as casync as this means single bit changes might affect their vicinity but will not explode in an unbounded fashion. In order achieve best results when delivering squashfs images through casync the block sizes of squashfs and the chunks sizes of casync should be matched up (using casync's --chunk-size= option). How precisely to choose both values is left a research subject for the user, for now.

  10. What does the name casync mean? – It's a synchronizing tool, hence the -sync suffix, following rsync's naming. It makes use of the content-addressable concept of git hence the ca- prefix.

  11. Where can I get this stuff? Is it already packaged? – Check out the sources on GitHub. I just tagged the first version. Martin Pitt has packaged casync for Ubuntu. There is also an ArchLinux package. Zbigniew Jędrzejewski-Szmek has prepared a Fedora RPM that hopefully will soon be included in the distribution.

Should you care? Is this a tool for you?

Well, that's up to you really. If you are involved with projects that need to deliver IoT, VM, container, application or OS images, then maybe this is a great tool for you — but other options exist, some of which are linked above.

Note that casync is an Open Source project: if it doesn't do exactly what you need, prepare a patch that adds what you need, and we'll consider it.

If you are interested in the project and would like to talk about this in person, I'll be presenting casync soon at Kinvolk's Linux Technologies Meetup in Berlin, Germany. You are invited. I also intend to talk about it at All Systems Go!, also in Berlin.

The All Systems Go! 2017 Call for Participation is Now Open!

We’d like to invite presentation proposals for All Systems Go! 2017!

All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward.

All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd.

All Systems Go! is a 2-day event with 2-3 talks happening in parallel. Full presentation slots are 30-45 minutes in length and lightning talk slots are 5-10 minutes.

We are now accepting submissions for presentation proposals. In particular, we are looking for sessions including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome too, as long as they have a clear and direct relevance for user-space.

Please submit your proposals by September 3rd. Notification of acceptance will be sent out 1-2 weeks later.

To submit your proposal now please visit our CFP submission web site.

For further information about All Systems Go! visit our conference web site.

systemd.conf will not take place this year in lieu of All Systems Go!. All Systems Go! welcomes all projects that contribute to Linux user space, which, of course, includes systemd. Thus, anything you think was appropriate for submission to systemd.conf is also fitting for All Systems Go!

I just released the first release candidate for libinput 1.8. Aside from the build system switch to meson one of the more visible things is that the helper tools have switched from a "libinput-some-tool" to the "libinput some-tool" approach (note the space). This is similar to what git does so it won't take a lot of adjustment for most developers. The actual tools are now hiding in /usr/libexec/libinput. This gives us a lot more flexibility in writing testing and debugging tools and shipping them to the users without cluttering up the default PATH.

There are two potential breakages here, one is that the two existing tools libinput-debug-events and libinput-list-devices have been renamed too. We currently ship compatibility wrappers for those but expect those wrappers to go away with future releases. The second breakage is of lesser impact: typing "man libinput" used to bring up the man page for the xf86-input-libinput driver. Now it brings up the man page for the libinput tool which then points to the man pages of the various features. That's probably a good thing, it puts the documentation a bit closer to the user. For the driver, you now have to type "man 4 libinput" though.

Last week I got review on the new interface for tiling for vc4, and landed it in the kernel. I’ve submitted Mesa patches to the list, landed the kernel backport in Raspbian, and have sent Simon (the package maintainer at the Foundation) the Mesa patchset to backport all of my work from the last 6 months (unfortunately Raspbian generally doesn’t update Mesa past the Debian stable release, so graphics work for them has long delays). I also worked with Dom at the Foundation to get an interface for the firmwarekms mode to do tiled scanout, so it will be as fast as the open source driver.

I also reworked the official 7” touchscreen panel driver again. I had initially written it as a DSI panel driver, then got told it should be a bridge driver plus a panel driver, then the bridge maintainers told me it should be only one panel driver but attach to I2C. Maybe this version will stick, who knows.

My Mesa side for pl111 is now in master, so I’ve got the second platform for the vc4 driver basically done. I can’t wait to do another one.

I got a question about whether VC4 could do EGL_ANDROID_native_fence, to allow using Android’s hwcomposer (the display server that puts surfaces in KMS planes if it can). I spent a while looking into it, and found that while code had landed in Mesa for other drivers, the piglit tests had been left around untouched for the last half a year. I reviewed them and hopefully they can land soon. Unfortunately, they don’t cover the important functionality of the extension, so I can’t use them to actually write the driver code (which doesn’t look very hard) until I build some more tests.

On the non-vc4 kernel side, I prepared and submitted pull requests for 4.13. We’ve got USB OTG support for the Pi Zero, SDHOST enabled by default finally, thermal enabled by default finally, and the RPi3 DT being installed for 32-bit kernels. Three of these took months longer than they should have, because the Linux kernel has the worst software development process I’ve worked with since XFree86.

June 16, 2017
This is going to be a long one, so grab a drink and a snack and buckle up! Incidentally, the X.Org foundation has asked us, the 2017 GSoC students, to blog weekly so from now on I will do so; which will also mean smaller blogs in the future. I made a schedule to go with my proposal in which I divided the coding period into two week sprints to plan out my project.

There is a saying that persistence is the key to success. Not that I’m always following this advice, but I did luckily earlier this year when I had to decide if I wanted to apply again for a Google Summer of Code (GSoC) spot after my application in the last year was rejected by my organisation of choice back then. I talked about it in one of my last posts. Anyway, thanks to being persistent this year one of my project ideas got accepted and I now have the opportunity to work on a very interesting project concerning Xwayland, this time for the X.Org Foundation.

The project is kind of difficult though. At least it feels this way to me right now, after I’ve already spent quite some time on digging into it and getting a feel for it.

The main reason for the difficulty is, that various components of the XServer are in play and documentation for them is most often missing. Would I be on my own, I could basically only work with the code and try to comprehend the steps done one after the other. I’m not on my own though! With Daniel Stone I have one of the core developers of several key parts of the Linux graphic stack as my mentor, who seems to really want to help me to understand the difficult stuff on a large scale and still gives concrete advice on what to do next. He even drew a diagram for me!

I plan on publishing in future blog posts resources like this otherwise clearly underused diagram, since such material to understand the XServer code is otherwise only sparsely to find on the web. Additionally I’ll try to explain the idea of my project and the general structure of the code I’m dealing with in my own words. Regarding these future posts a nasty surprise for me was this week, that it’s expected from me to publish one blog post per week about my project. This somewhat spoils my plans for this blog, where I wanted to publish very few but in comparison more sophisticated posts. Nevertheless you can therefore expect to have frequent weekly posts until August. Next week I want to explain the basic idea of my project.

June 15, 2017
For the last couple of months I've been working on improving Linux support for Intel Bay and Cherry Trail based tablets and laptops as a spare-time project.

I'm happy to report that quite a bit of work on this has landed for 4.12, specifically the following fixes / improvements have landed:

  • LCD panel not working with the i915 driver on some devices

  • Battery monitoring not working on most Bay and Cherry devices (battery monitoring did not work on any device using the very popular AXP288 PMIC)

  • Backlight control not working on many devices (you need to builtin the pwm-lpss drivers, or add them to your initrd for this fix to work)

  • Volume and power buttons not working on Cherry Trail devices shipping with Win10

  • The microsd slot on the Asus Transformer T100TA not working

  • Preliminary (incomplete) support Cherry Trail devices with a Whiskey Cove PMIC, some bits to get e.g. battery monitoring working are still missing

  • Added the rtl8723bs driver to staging, enabling ootb support for wifi on many devices

I've also prepared a patch for the Fedora kernel config to enable the necessary new drivers so this should all also work with the upcoming 4.12
kernel for Fedora.
June 09, 2017
For Fedora 27 I'm working on making Fedora 27 a fully integrated VirtualBox guest, the plan is to have Fedora 27 Workstation ship with the VirtualBox Guest Additions installed out of the box, so that cut and paste, file sharing, etc. all will work out of the box for users running Fedora 27 in a VirtualBox vm.

As a first step towards this I'm working on getting the VirtualBox guest kernel drivers merged into the mainline Linux kernel. Up until now this has never been done because of userspace ABI stability concerns surrounding the guest drivers.

I've been talking to VirtualBox upstream about mainlining the guest drivers and VirtualBox upstream has agreed to consider the userspace ABI stable and only extend it in a backwards compatible manner.

I'm about to finish cleaning up the vboxvideo drm/kms driver, when I started working on this the files under /usr/src/vboxguest/vboxvideo as installed by VirtualBox 5.1.18 Guest Additions had a total linecount of 52681 lines. I've been submitting cleanups to VirtualBox upstream and the current VirtualBox upstream svn code is now under 8000 lines.

Update: I've submitted the vboxvideo driver for inclusion into drivers/staging, see this mailinglist thread.
June 03, 2017
This will be a quick blog post on the previous two weeks. During this time, the community bonding period ended and the coding phase officially begun! I spent most of the first week working on university projects, so that those are as done as they can be at this point and I have little else to work on besides Piper. On the second week, I thought I would get a few days head start on my schedule by starting early on the mockups.
June 02, 2017

Back in 2003, the pressure range value for Wacom pen tablets was set to 2048. That's a #define in the driver, but it shouldn't matter because we also advertise this range as part of the device description in the X Input protocol. Clients should be using that advertised min/max range and scale appropriately, in the same way as they should be doing this for the x/y axis ranges.

Fast-forward to 2017 and we changed the pressure range. New Wacom devices now use ~8000 levels, but we opted to just #define it in the driver to 65536 and be done with it. We now scale all models into that range, with varying granularity based on the physical hardware. It shouldn't matter because it's not tied to a reliable physical property anyway and the only thing that matters is the percentage of the max value (which is why libinput just gives you a [0, 1] range. Hindsight is a bliss.).

Slow-forward to 2017-and-a-bit and we received complaints that pressure handling is now broken. Turns out that some applications hardcoded the old 2048 range and now get confused, because virtually any pressure will now hit that maximum. Since those applications are largely proprietary ones and cannot be updated easily, we needed a workaround to this. Jason Gerecke from Wacom got busy and we now have a "Pressure2K" option available in the driver. If set, this option will scale everything into the 2048 range to make sure those applications still work. To get this to work, the following xorg.conf.d snippet is recommended:

Section "InputClass"
Identifier "Wacom pressure compatibility"
MatchDriver "wacom"
Option "Pressure2K" "true"
Put it in a file that sorts higher than the wacom driver itself (e.g. /etc/X11/xorg.conf.d/99-wacom-pressure2k.conf) and restart X. Check the Xorg.log/journal for a "Using 2K pressure levels" message, then verify it works by running xinput list "device name". xinput should show a range of 0 to 2048 on the third valuator.

$> xinput list "Wacom Intuos4 6x9 Pen stylus"
Wacom Intuos4 6x9 Pen stylus id=25 [slave pointer (2)]
Reporting 8 classes:
Class originated from: 25. Type: XIButtonClass
Buttons supported: 9
Button labels: None None None None None None None None None
Button state:
Class originated from: 25. Type: XIKeyClass
Keycodes supported: 248
Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 0:
Label: Abs X
Range: 0.000000 - 44704.000000
Resolution: 200000 units/m
Mode: absolute
Current value: 22340.000000
Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 1:
Label: Abs Y
Range: 0.000000 - 27940.000000
Resolution: 200000 units/m
Mode: absolute
Current value: 13970.000000
Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 2:
Label: Abs Pressure
Range: 0.000000 - 2048.000000
Resolution: 1 units/m
Mode: absolute
Current value: 0.000000

Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 3:
Label: Abs Tilt X
Range: -64.000000 - 63.000000
Resolution: 57 units/m
Mode: absolute
Current value: 0.000000
Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 4:
Label: Abs Tilt Y
Range: -64.000000 - 63.000000
Resolution: 57 units/m
Mode: absolute
Current value: 0.000000
Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 5:
Label: Abs Wheel
Range: -900.000000 - 899.000000
Resolution: 1 units/m
Mode: absolute
Current value: 0.000000

This is an application bug, but this workaround will make sure new versions of the driver can be used until those applications have been fixed. The option will be available in the soon-to-be-released xf86-input-wacom 0.35. The upstream commit is d958ab79d21b57141415650daac88f9369a1c861.

Edit 02/06/2017: wacom devices have 8k pressure levels now.

It’s been a busy month. Part of that is that I took a week off to go to our Burning Man regional event, which threw a serious wrench into my status updates.

Hans Verkuil now has HDMI CEC working, and is polishing the patches up for inclusion in the kernel. Unfortunatelly we set his work back a bit with Boris’s new HDMI power management code, and that’s getting resolved now. He also tripped over a bug in my new dma-buf reservation code that I helped track down.

Boris has the transposer working and has reviewed the core writeback code, so hopefully we can land it soon. I’m really excited to use this for testing tiled display support (see below).

My work for the last couple of weeks has been in figuring out what to do about the fact that the new Raspbian display stack (vc4 firmwarekms, glamor, compton compositor) takes way more memory than the old display stack (fbdev on the firmware, no glamor, no compositing). It’s been pretty easy for users to hit the 256MB CMA limit, and we aren’t even using 256MB of CMA on the Pi0/1 due to its only having 512MB of system memory.

One of the ideas would be to page GPU allocations out to system memory. That only gets us from 256MB to <1GB of memory available. It’s not like you want to be swapping to SD cards. If we’re hitting 256MB this easily, we need to figure out how to fix that.

The first step was to figure out where our memory was going. I wrote a hack that had the kernel give names to the BOs it allocated, and added a new UAPI that let userspace attach more descriptive names to BOs. With that, I added a little hack to Mesa to attach some descriptions. I then made /debug/dri/0/bo_stats print out the stats on count/size per name, rather than just total. Here’s the output of LXDE with compton as I was dragging a terminal around:

 scanout resource 1920x1080@32/0:  48600kb BOs (6)
    tiling shadow 1920x1080:  32640kb BOs (4)
		   overflow:  16384kb BOs (1)
    resource 1920x1080@32/0:  16320kb BOs (2)
    resource 1024x1024@32/0:   4096kb BOs (1)
      tiling shadow 661x411:   2184kb BOs (2)
      resource 657x386@32/0:   1092kb BOs (1)
  scanout resource 661x411@32/0:   1068kb BOs (1)
     resource 1048576x1@8/0:   1024kb BOs (1)
     resource 1024x1024@8/0:   1024kb BOs (1)
		   resource:    888kb BOs (20)
      tiling shadow 1920x26:    240kb BOs (1)
      resource 1920x26@32/0:    240kb BOs (1)
		     shader:    236kb BOs (56)
  scanout resource 1920x26@32/0:    196kb BOs (1)
		 mesa cache:    152kb BOs (5)
       resource 659x20@32/0:     84kb BOs (1)
       resource 65536x1@8/0:     64kb BOs (1)
      resource 128x128@32/0:     64kb BOs (1)
	  resource 1x1@32/0:     56kb BOs (14)
	      dumb 64x64@32:     48kb BOs (3)
	resource 659x1@32/0:     36kb BOs (3)
		      cache:     28kb BOs (4)
			RCL:     20kb BOs (2)
	 resource 26x1@32/0:     16kb BOs (4)
	resource 1x386@32/0:     16kb BOs (2)
	resource 1x361@32/0:     16kb BOs (2)
	 resource 1x25@32/0:     16kb BOs (4)
	resource 609x1@32/0:     12kb BOs (1)
	resource 607x1@32/0:     12kb BOs (1)
	resource 12x18@32/0:      4kb BOs (1)
			BCL:      4kb BOs (1)

This horrifying. First, this is already after a fix I’ve landed that kept us from leaking an extra overflow BO, that cost 16MB at all times. Next, 6 full-screen scanout resources? I expected 3, certainly, due to compton’s pageflipping (though for a compositor, keeping 3 is bad: I should have GPU idle time built into my pipeline anyway, as presumably GPU work is being done by other clients to generate the frames I’m compositing). Our non-pageflipping scanout buffer is probably another one (can we free the screen pixmap’s contents during pageflipping? That would sure be nice). Maybe the file manager is drawing into a window rather than the root window, to make the another one. There’s still one more I haven’t accounted for, though.

More embarassing is the tiling shadow allocations. We do this in vc4 because it can’t texture from linear BOs, so we allocate a shadow resource that we blit a tiled copy into. The cost of those tile blits are actually why we’re running a compositor in Raspbian in the first place, because window movement is too expensive otherwise.

My crazy idea was to make a new mode for glamor that just keeps all pixmaps in system memory until they need to be on the GPU (because they’re either scanned out by the GPU or they have an exposed reference to them for DRI). My previous reworks for linear vs tiled buffers in glamor already made this look easy. You end up with software fallbacks all the time (The old fbturbo driver was all software fallbacks, though), and with the new gbm_bo_map() interface we should be able to keep the cost of software fallbacks down to just the cost of the rendering performed, not a 1920x1080 readback per drawing operation.

However, 20 patches in and without having gotten to the point of gbm_bo_map() yet, this is looking like a mess. On Friday I had breakfast with keithp and talked over what I was doing, and he convinced me to go back to an older, simpler plan: Just make all of our pixmaps tiled, so we don’t have the shadow copies at all. If we don’t have the shadow copies, then we don’t need the compositor to get window-dragging performance. At least 64MB of the allocations above would go away.

The new plan means we eat the memory bandwidth cost of scanning out from tiled, but it will be easy enough to register a timer to go make a linear copy and flip to that. We’re not the only platform that would like that, I’m sure.

I’ve now written the kernel interface and debugging it is the plan for this week.

In other news, my panel-bridge v3 is now in drm-misc-next, along with most of the associated code deletion. I also helped the team trying to integrate my PL111 code with a bug in panning support, which their current 3D stack relies on. In the process of debugging the pl111+vc4 code (a full screen X11 fallback would reallocate the BO when it shouldn’t), I added debugfs register dumping for pl111, which should be useful for other developers using the code.

I have the X Server building in Travis CI, at long last. The meson build system was critical for this, as autotools was just too slow. The other key part was switching to Docker for bundling our dependencies, since 20 autotools invocations was unreasonable, and I won’t have the time to convert those 20 things to meson. Next steps on CI are to get the X Test suite running in the meson build system (the docker image already has it present).

Finally, earlier this month I submitted my Mesa register allocation series (let the driver choose the color during graph coloring), which should be good for about a 2% performance improvement to vc4 3D.

June 01, 2017

With modifier support added to Mesa and gbm_gralloc, it is now possible to boot Android on iMX6 platforms using no proprietary blobs at all. This makes iMX6 one of the very few embedded SOCs that needs no blobs at all to run a full graphics stack.

Not only is that a great win for Open Source in general, but it also makes the iMX6 more attractive as a platform. A further positive point is that this lays the groundwork for the iMX8 platform, and supporting it will come much easier.

What are modifiers used for?

Modifiers are used to represent different properties of buffers. These properties can cover a range of different information about a buffer, for example compression and tiling.

For the case of …

With modifier support added to Mesa and gbm_gralloc, it is now possible to boot Android on iMX6 platforms using no proprietary blobs at all. This makes iMX6 one of the very few embedded SOCs that needs no blobs at all to run. Not only is that a great win for Open Source in general, but it also makes the iMX6 more attractive as a platform.

Currently the modifiers work is in the process of being upstreamed, but in the meantime it can be found here. If you'd like to test this out yourself I maintain a how-to.

What are modifiers used for?

Modifiers are used to represent different properties of buffers. These properties can cover a range of different information about a buffer, for example …

May 30, 2017

DRM leasing part three (Vulkan)

With the kernel APIs off for review, and the X RandR bits looking like they're in reasonable shape, I finally found some time to sit down and figure out how I wanted to integrate this into Vulkan.

Avoiding two DRM file descriptors

Given that a DRM lease is represented by a DRM master file descriptor, we want to use that for all of the operations in the driver, including rendering and mode setting. Using the vulkan driver render node and the lease master node together would require passing buffer objects between the kernel contexts using even more file descriptors.

The Mesa Vulkan drivers open the device nodes while enumerating devices, not when they are created. This seems a bit early to me, but it makes sure that the devices being enumerated are actually available for use, and not just present in the system. To replace the render node fd with the lease master fd means hooking into the system early enough that the enumeration code can see the lease fd. And that means creating an instance extension as the instance gets created before devices are enumerated.

The VK_KEITHP_kms_display instance extension

This simple instance extension provides the necessary hooks to get the lease information from the application down into the driver before the DRM node is opened. In the first implementation, I added a function that could be called before the devices were enumerated to save the information in the Vulkan loader. That worked, but required quite a bit of study of the Vulkan loader and its XML description of the full Vulkan API.

Mark Young suggested that a simpler plan would be to chain the information into the VkInstanceCreateInfo pNext field; with no new APIs added to Vulkan, there shouldn't be any need to change the Vulkan loader -- the device driver would advertise the new instance extension and the application could find it.

That would have worked great, except the Vulkan loader 'helpfully' elides all instance extensions it doesn't know about before returning the list to the application. I'd say this was a bug and should be fixed, but for now, I've gone ahead and added the few necessary definitions to the loader to make it work.

In the application, it's a simple matter of searching for this extension, constructing the VkKmsDisplayInfoKEITHP structure, chaining that into the VkInstanceCreateInfo pNext list and passing that in to the vkCreateInstance call.

typedef struct VkKmsDisplayInfoKEITHP {
    VkStructureType         sType;  /* VK_STRUCTURE_TYPE_KMS_DISPLAY_INFO_KEITHP */
    const void*             pNext;
    int                     fd;
    uint32_t                crtc_id;
    uint32_t                *connector_ids;
    int                     connector_count;
    drmModeModeInfoPtr      mode;
} VkKmsDisplayInfoKEITHP;

As you can see, this includes the master file descriptor along with all of the information necessary to set the desired video mode using the specified resources.

The driver just walks the pNext list from the VkInstanceCreateInfo structure looking for any provided VkKmsDisplayInfoKEITHP structure and pulls the data out.

To avoid questions about file descriptor lifetimes, the driver dup's the provided fd. The application is expected to close their copy at a suitable time.

The VK_KHR_display extension

Vulkan already has an API for directly accessing the raw device, including code for exposing video modes and everything. As tempting as it may be to just go do something simpler, there's a lot to be said for using existing APIs.

This extension doesn't provide any direct APIs for acquiring display resources, relying on the VK_EXT_acquire_xlib_display extension for that part. And that takes a VkPhysicalDisplay parameter, which is only available after the device is opened, which is why I created the VK_KEITHP_kms_display extension instead of just using the VK_EXT_acquire_xlib_display extension -- we can't increase the capabilities of the render node opened by the driver, and we don't want to keep two file descriptors around.

With the information provided by the VK_KEITHP_kms_display extension, we can implement all of the VK_KHR_display extension APIs, including enumerating planes and modes and creating the necessary display surface. Of course, there's only one plane and one mode, so some of the implementation is pretty simplistic.

The big piece of work was to create the swap chain structure and associated frame buffers.

A working example

I've taken the 'cube' example from the Vulkan loader and hacked it up to use XCB to construct a DRM lease, the VK_KEITHP_kms_display extension to pass that lease into the Vulkan driver. The existing support for the VK_KHR_display extension "just worked", which was pretty satisfying.

It's a bit of a mess

I'm not satisfied with the mesa code at this point; there's a bunch of code in the radeon driver which should be in the vulkan WSI bits, and the vulkan WSI bits should probably not have the KMS interfaces wired in. I'll ask around and see what other Mesa developers think I should do before restructuring it further; I'll probably have to rewrite it all at least one more time before it's ready to upstream.

Seeing the code

I'll be cleaning the code up a bit before sending it out for review, but it's already visible in my own repositories:

May 29, 2017

tl;dr We can create coverage-instrumented binaries, run them and aggregate the coverage data from running both the program and the unit tests.

In the Go world, unit testing is tightly integrated with the go tool chain. Write some unit tests, run go test and tell anyone that will listen that you really hope to never have to deal with a build system for the rest of your life.

Since Go 1.2 (Dec. 2013), go test has supported test coverage analysis: with the ‑cover option it will tell you how much of the code is being exercised by the unit tests.

So far, so good.

I've been wanting to do something slightly different for some time though. Imagine you have a command line tool. I'd like to be able to run that tool with different options and inputs, check that everything is OK (using something like bats) and gather coverage data from those runs. Even better, wouldn't be neat to merge the coverage from the unit tests with the one from those program runs and have an aggregated view of the code paths exercised by both kind of testing?

A word about coverage in Go

Coverage instrumentation in Go is done by rewriting the source of an application. The cover tool inserts code to increment a counter at the start of each basic block, a different counter for each basic block of course. Some metadata is kept along side each of the counters: the location of the basic block (source file, start/end line & columns) and the size of the basic block (number of statements).

This rewriting is done automatically by go test when coverage information has been asked by the user (go test -x to see what's happening under the hood). go test then generates an instrumented test binary and runs it.

A more detailed explanation of the cover story can be found on the Go blog.

Another interesting thing is that it's possible to ask go test to write out a file containing the coverage information with the ‑coverprofile option. This file starts with the coverage mode, which is how the coverage counters are incremented. This is one of set, count or atomic (see blog post for details). The rest of the file is the list of basic blocks of the program with their metadata, one block per line:,244.9 3 4

This describes one piece of code from oci.go, composed of 3 statements without branches, starting at line 241, column 29 and finishing at line 244, column 9. This block has been reached 4 times during the execution of the test binary.

Generating coverage instrumented programs

Now, what I really want to do is to compile my program with the coverage instrumentation, not just the test binary. I also want to get the coverage data written to disk when the program finishes.

And that's when we have to start being creative.

We're going to use go test to generate that instrumented program. It's possible to define a custom TestMain function, an entry point of a kind, for the test package. TestMain is often used to setup up the test environment before running the list of unit tests. We can hack it a bit to call our main function and jump to running our normal program instead of the tests! I ended up with something like this:

The current project I'm working on is called cc-runtime, an OCI runtime spawning virtual machines. It definitely deserves its own blog post, but for now, knowing the binary name is enough. Generating a coverage instrumented cc-runtime binary is just a matter of invoking go test:

$ go test -o cc-runtime -covermode count

I haven't used atomic as this binary is really a thin wrapper around a library and doesn't use may goroutines. I'm also assuming that the use of atomic operations in every branch a "quite a bit" higher then the non-atomic addition. I don't care too much if the counter is off by a bit, as long as it's strictly positive.

We can run this binary just as if it were built with go build, except it's really a test binary and we have access to the same command line arguments as we would otherwise. In particular, we can ask to output the coverage profile.

$ ./cc-runtime -test.coverprofile=list.cov list
[ outputs the list of containers ]

And let's have a look at list.cov. Hang on... there's a problem, nothing was generated: we din't get the usual "coverage: xx.x% of statements" at the end of a go test run and there's no list.cov in the current directory. What's going on?

The testing package flushes the various profiles to disk after running all the tests. The problem is that we don't run any test here, we just call main. Fortunately enough, the API to trigger a test run is semi-public: it's not covered by the go1 API guarantee and has "internal only" warnings. Not. Even. Scared. Hacking up a dummy test suite and running is easy enough:

There is still one little detail left. We need to call this FlushProfiles function at the end of the program and that program could very well be using os.Exit anywhere. I couldn't find better than having a tiny exit package implementing the equivalent of the libc atexit() function and forbid direct use of os.Exit in favour of exit.Exit(). It's even testable.

Putting everything together

It's now time for a full example. I have a small calc program that can compute additions and substractions.

$ calc add 4 8

The code isn't exactly challenging:

I've written some unit-tests for the add function only. We're going to run calc itself to cover the remaining statements. But first, let's see the unit tests code with both TestAdd and our hacked up TestMain function. I've swept the hacky bits away in a cover package.

Let's run the unit-tests, asking to save a unit-tests.cov profile.

$ go test -covermode count -coverprofile unit-tests.cov
coverage: 7.1% of statements
ok 0.003s

Huh. 7.1%. Well, we're only testing the 1 statement of the add function after all. It's time for the magic. Let's compile an instrumented calc:

$ go test -o calc -covermode count

And run calc a few times to exercise more code paths. For each run, we'll produce a coverage profile.

$ ./calc -test.coverprofile=sub.cov sub 1 2
$ covertool report sub.cov
coverage: 57.1% of statements

$ ./calc -test.coverprofile=error1.cov foo
expected 3 arguments, got 1
$ covertool report error1.cov
coverage: 21.4% of statements

$ ./calc -test.coverprofile=error2.cov mul 3 4
unknown operation: mul
$ covertool report error2.cov
coverage: 50.0% of statements

We want to aggregate those profiles into one single super-profile. While there are some hints people are interested in merging profiles from several runs (that commit is in go 1.8), the cover tool doesn't seem to support these kind of things easily so I wrote a little utility to do it: covertool

$ covertool merge -o all.cov unit-tests.cov sub.cov error1.cov error2.cov

Unfortunately again, I discovered a bug in Go's cover and so we need covertool to tell us the coverage of the aggregated profile:

$ covertool report all.cov
coverage: 92.9% of statements

Not Bad!

Still not 100% though. Let's fire the HTML coverage viewer to see what we are missing:

$ go tool cover -html=all.cov

Oh, indeed, we're missing 1 statement. We never call add from the command line so that switch case is never covered. Good. Seems like everything is working as intended.

Here be dragons

As fun as this is, it definitely feels like very few people are doing this kind of instrumented binaries. Everything is a bit rough around the edges. I may have missed something obvious, of course, but I'm sure the Internet will tell me if that's the case!

It'd be awesome if we could have something nicely integrated in the future.

May 23, 2017

TLDR: If you see devices like "xwayland-pointer" show up in your xinput list output, then you are running under a Wayland compositor and debugging/configuration with xinput will not work.

For many years, the xinput tool has been a useful tool to debug configuration issues (it's not a configuration UI btw). It works by listing the various devices detected by the X server. So a typical output from xinput list under X could look like this:

:: whot@jelly:~> xinput list
⎡ Virtual core pointer id=2 [master pointer (3)]
⎜ ↳ Virtual core XTEST pointer id=4 [slave pointer (2)]
⎜ ↳ SynPS/2 Synaptics TouchPad id=22 [slave pointer (2)]
⎜ ↳ TPPS/2 IBM TrackPoint id=23 [slave pointer (2)]
⎜ ↳ ELAN Touchscreen id=20 [slave pointer (2)]
⎣ Virtual core keyboard id=3 [master keyboard (2)]
↳ Virtual core XTEST keyboard id=5 [slave keyboard (3)]
↳ Power Button id=6 [slave keyboard (3)]
↳ Video Bus id=7 [slave keyboard (3)]
↳ Lid Switch id=8 [slave keyboard (3)]
↳ Sleep Button id=9 [slave keyboard (3)]
↳ ThinkPad Extra Buttons id=24 [slave keyboard (3)]
Alas, xinput is scheduled to go the way of the dodo. More and more systems are running a Wayland session instead of an X session, and xinput just doesn't work there. Here's an example output from xinput list under a Wayland session:

$ xinput list
⎡ Virtual core pointer id=2 [master pointer (3)]
⎜ ↳ Virtual core XTEST pointer id=4 [slave pointer (2)]
⎜ ↳ xwayland-pointer:13 id=6 [slave pointer (2)]
⎜ ↳ xwayland-relative-pointer:13 id=7 [slave pointer (2)]
⎣ Virtual core keyboard id=3 [master keyboard (2)]
↳ Virtual core XTEST keyboard id=5 [slave keyboard (3)]
↳ xwayland-keyboard:13 id=8 [slave keyboard (3)]
As you can see, none of the physical devices are available, the only ones visible are the virtual devices created by XWayland. On a Wayland session, the X server doesn't have access to the physical devices. Instead, it talks via the Wayland protocol to the compositor. This image from the Wayland documentation shows the architecture:
In the above graphic, devices are known to the Wayland compositor (1), but not to the X server. The Wayland protocol doesn't expose physical devices, it merely provides a 'pointer' device, a 'keyboard' device and, where available, a touch and tablet tool/pad devices (2). XWayland wraps these into virtual devices and provides them via the X protocol (3), but they don't represent the physical devices.

This usually doesn't matter, but when it comes to debugging or configuring devices with xinput we run into a few issues. First, configuration via xinput usually means changing driver-specific properties but in the XWayland case there is no driver involved - it's all handled by libinput inside the compositor. Second, debugging via xinput only shows what the wayland protocol sends to XWayland and what XWayland then passes on to the client. For low-level issues with devices, this is all but useless.

The takeaway here is that if you see devices like "xwayland-pointer" show up in your xinput list output, then you are running under a Wayland compositor and debugging with xinput will not work. If you're trying to configure a device, use the compositor's configuration system (e.g. gsettings). If you are debugging a device, use libinput-debug-events. Or compare the behaviour between the Wayland session and the X session to narrow down where the failure point is.

May 20, 2017

Not only, but to a large extent I worked in the last few months on foundational improvements to KWin’s DRM backend, which is a central building block of KWin’s Wayland session. The idea in the beginninng was to directly expand upon my past Atomic Mode Setting (AMS) work from last year. We’re talking about direct scanout of graphic buffers for fullscreen applications and later layered compositing. Indeed this was my Season of KDE project with Martin Flöser as my mentor, but in the end relative to the initial goal it was unsuccessful.

The reason for the missed goal wasn’t a lack of work or enthusiasm from my side, but the realization that I need to go back and first rework the foundations, which were in some kind of disarray, mostly because of mistakes I did when I first worked on AMS last year, partly because of changes Daniel Stone made to his work-in-progress patches for AMS support in Weston, which I used as an example throughout my work on AMS in KWin, and also because of some small flaws introduced to our DRM backend before I started working on it.

The result of this rework are three seperate patches depending on each other and all of them got merged last week. They will be part of the 5.10 release. The reason for doing three patches instead of only one, was to ease the review process.

The first patch dealt with the query of important kernel display objects, which represent real hardware, the CRTCs and Connectors. KWin didn’t remember these objects in the past, although they are static while the system is running. This meant for example that KWin requeried all of them on a hot plugging event and had no prolonged knowledge about their state after a display was disconnected again. The last point made it in particular difficult to do a proper cleanup of the associated memory after a disconnect. So changing this in a way that the kernel objects are only queried once in the beginning made sense. Also from my past work I already had created a generic class for kernel object with the necessary subclasses, which could be used in this context. But still to me this patch was the most “controversial” one of the three, which means it was the one I was most worried about being somehow “wrong”, not just in details, but in general, especially since it didn’t solve any observable specific misbehaviour, which it could be benchmarked against. Of course I did my research, but there is always the anxiety of overlooking something crucial. Too bad the other patches depended on it. But the patch was accepted and to my relief everything seems to work well on the current master and the beta branch for the upcoming release as well.

The second patch restructured the DrmBuffer class. We support KWin builds with or without Generic Buffer Manager (GBM). It made therefore sense to split off the GBM dependent part of DrmBuffer into a seperate file, which gets only included when GBM is available. Martin had this idea and, although the patch is still quite large because of all the moved around code and renamed classes, the change was straight forward. I still managed to introduce a build breaking regression, which was quickly discovered and easily to solve. This patch was also meant as a preperation for the future direct scanout of buffers, which will then be done by a new subclass of DrmBuffer, also depending on GBM.

The last patch finally directly tackled all the issues I experienced when trying to use the before that rather underwhelming code path for AMS. Yes, you saw the picture on the screen, the buffer flipping worked, but basic functionality like hot plugging or display suspending were not working at all or led to unpredictable behaviour. Basically a complete rewrite later with many, many manual in and out pluggings of external monitors to test the bahaviour the problems have been solved to the point I consider the AMS code path now to be ready for daily use. For Plasma 5.11 I therefore plan to make it the new default. That means that it will be available on Intel graphics automatically from Linux kernel 4.12 onwards, when on the kernel side the Intel driver also defaults to it. If you want to test my code on Plasma 5.10 you need to set the environment variable KWIN_DRM_AMS and on kernels older than 4.12 you need to add the boot parameter i915.nuclear_pageflip. If you use Nvidia with the open source Nouveau driver, AMS should be available to you since kernel 4.10. In this case you should only need to set the environment variable above on 5.10, if you want to test it. Since I only tested AMS with Intel graphics until now, some reports back how it works with Nouveau would be great.

That’s it for now. But of course there is more to come. I haven’t given up on the direct scanout and at some point in the future I want to finish it. I already had a working prototype and mainly waited for my three patches to land. But for now I’ll postpone further work on direct scanout and layered compositing. Instead the last weeks I worked on something special for our Wayland session in 5.11. I call it Night Color, and with this name you can probably guess what it will be. And did I mention, that I was accepted as a Google Summer of Code student for the foundation with my project to implement multi buffered present support in XWayland? Nah, I didn’t. Sorry for rhetorical asking in this smug way, but I’m just very happy and also a bit proud of having learned so much in basically only one year to the point of now being able to start work on an project directly. I’ll write about it in another blog post in the near future.

May 19, 2017
Two weeks have now passed since my introductory blog post, so as promised here is part two! The theme of this blog post is probably something along the lines of “preparation”, as that is what I’ve been doing mostly. The period between that of announcing the accepted student proposals and phase one is called the community bonding period. In this period students are supposed to get to know their mentors, their organizations, familiarize themselves with their projects (even more) and get everything ready to start off the coding period on May 30th.
May 14, 2017

Early last year, Oracle announced it would be shutting down the general project hosting on & at the end of April 2017.

We'd been hosting various Solaris content on there since shut down in 2013, and have worked to move most of it off now. We've requested some redirects be put in place where possible, but they only offer one per subproject, so we can't make all pages redirect to the best spot for that particular content.

If you referenced anything under or its sub projects, here's a guide to where to find that content now:

FOSS Source Code:
The mailing lists were not heavily used, so we've not set up new replacement mailing lists, but suggested discussion migrate to the forums at instead.
Information pages:

And of course, much of it was archived in the Internet Archive for historical reference.

May 09, 2017

I’ve been procrastinating on writing up status, so this update’s a big one.

Boris Brezillon has been hacking away at implementing the VC4 transposer module as a DRM writeback connector, and has things running in test apps. My immediate goal with his work is to enable IGT testing of VC4’s HVS (hardware compositor) features, so that we can implement plane tiling and formats with confidence. However, there’s also a long term goal to extend X11’s modesetting driver to do Present into DRM planes, and only collapse planes down to the primary framebuffer once an atomic check fails or the display goes idle. This is something that other compositors like Raspberry Pi’s DispmanX firmware, Android HWC2, or Weston can all do, so we should fix X11 to do it as well.

Hans Verkuil has started on a CEC implementation for vc4’s HDMI module. The hardware looks fairly simple – unlike many other HDMI CEC implementations, ours is right there in the HDMI module, so there are no complicated inter-kernel-module dependencies to implement. I’ve set him up with register definitions and answered some hardware questions, and I suspect he’ll have results soon.

On my end, I’m excited to announce that the ARM pl111 module DRM KMS driver for Broadcom’s Cygnus is now merged to drm-misc-next, along with V3D support. This has been a more or less pleasant process: I got useful review feedback from Linus Walleij (who’s being involved in the fbdev version of this driver), some cleanups on the V3D code from Florian, and I got the code landed in a reasonable amount of time. The drm-misc process may not be perfect, but it’s the best I’ve experienced in the Linux kernel yet.

Next steps on Cygnus: I need to land a reset controller driver and the associated DT (right now if you lock up the GPU, it never resets). On Raspberry Pi reset works by us turning the power domain off and on, which causes the firmware to go through the reset process, but on Cygnus we don’t have a separate power domain for V3D. I also need to finish the panel driver and submit it (I’m using a stub for now, and have a proper panel driver in progress). And I’m working on using gallium’s “renderonly” helper functions to allow you to bring up GBM with GL accelerated on vc4 on a pl111 DRM node. This should get X, glmark, and kmscube (and any other GBM applications), working on Cygnus with no userspace changes.

I’ve also been reworking Raspberry Pi’s DSI panel support. I got really discouraged on this after trying to submit to the DRM panel tree back in December. Based on that feedback, I’m moving most of the code to a DRM bridge driver, with just the panel timing in the DRM panel tree. However, writing drivers that talk to DRM panels sucks, so I’ve written the “panel-bridge” helper that cuts about 100 lines of boilerplate from most DRM drivers that talk to a panel, by wrapping the panel in a DRM bridge structure. I submitted an early version for review last week, which was met with some skepticism. I need to re-submit my updated version that converts more drivers – probably when the patch series removes 400 lines of code from DRM, it’ll get a warmer reception.

I’ve also been working on cleaning up our GLES2 conformance test results. I’ve landed 4 patches to the GLSL compiler, and one more Mesa patch is in the queue. The most interesting problem for conformance is that some of the GLES2 conformance tests fail in our register allocator: vc4 can’t spill registers, so we need to be really good at register allocation to make arbitrary programs render successfully.

My current goal for allocation is to implement Sethi-Ullman based instruction scheduling at the QIR stage to try to keep register pressure down. However, register pressure is weighted at a middle level in the QIR instruction scheduler’s heuristics, because otherwise later on we end up too constrained for the QPU instruction scheduler and introduce stalls. To fix the QPU scheduling, I wrote a patch series that gives the driver a chance to choose among available registers during register coloring. With that, the driver can prioritize the accumulators while rotating through the available registers. Initial performance results look excellent, so I should be able to bump register pressure up in priority next and do the Sethi-Ullman work. Interestingly, this callback could resolve a longstanding issue that Intel’s pre-Sandybridge driver had in register allocation as well!

In the X Server, I’ve landed the meson build system. I’m not quite dogfooding it yet, because I can successfully run my desktop with startx but not with gdm (it looks like something is going wrong in the systemd integration). I’ve received patches from a few other X developers fixing up the meson build system for themselves, so I think we’re on track toward general adoption. I’ve made some more progress on implementing Travis CI for the X Server now, and on integrating the current X Server unit tests into meson.

May 05, 2017
Yesterday, after waiting for what felt like a very long time, I finally got the results of my Google Summer of Code (GSoC) application! I can happily say that one of my proposals has been accepted: I will work with Peter Hutterer to redesign and rewrite Piper. The synopsis of my proposal is as follows: Piper is an application frontend to libratbag and ratbagd, a library and system daemon to configure gaming mice respectively.
May 04, 2017
Test run totals:
Passed: 109293/150992 (72.4%)
Failed: 0/150992 (0.0%)
Not supported: 41697/150992 (27.6%)
Warnings: 2/150992 (0.0%)

This is effectively a pass. The Not Supported stuff isn't missing features as uneducated people are quick to spout, it's more stuff the hardware doesn't support or is pointless to expose on the hardware. (lots of image formats).

This is the results from the Vulkan CTS 1.0.2 branch, against mesa master with one patch (a workaround for some InternalErrors that CTS throws up).

Do not call the driver conformant as that is against the Khronos rules as we haven't paid or filed for approval, but the driver does now effectively pass the latest conformance test suite. I'll update on things if that changes.

Thanks again to everyone involved.
April 26, 2017

Since the hardware very much matters this is going to be divided into a few parts, the common steps and the hardware specific ones.

This post is a bit of a living document and will be changed over time, and if you have any questions about it, please reach out through email (robert.foss at or irc (tomeu or robertfoss on #dri-devel on freenode).

2017-04-27:, Added -b [device] support to and
2017-05-02: Don't write to SD-card without -b option
2017-05-04: Switch git repo urls to shared repository
2017-05-09: Add compiler installation to apt-get
2017-05-09: Re-ordered some instructions
2017 …
April 25, 2017

So we have a new job available for someone interested in joing our team and work on improving the Linux graphics stack. The focus of this job will be on GPU compute related work, but you should also expect to be spending time on improving the graphics driver stack in general. We are looking for someone at the Principal Engineer level, but I do recommend that even if you don’t feel you are quite at that level yet you should apply because to be fair the amount of people with the kind of experience we are looking for are few and far between, so in the end there is a chance we will hire two more junior developers instead if we have candidates with the right profile.

We are quite flexible on working location for this job, so for the right candidate working remotely is definitely a possibility. And of course if you are interested in joining us at one of our offices that is an option too, for instance we have existing team members working out of our Boston (USA), Brno(Czech Republic), Brisbane (Australia) and Munich (Germany) offices.

GPU Compute is rapidly growing in importance and use so this is your chance to be in the middle of it and work for what I personally think is one of the best companies in the world to work for.

So be sure to submit an application though the Red Hat hiring portal.

April 23, 2017

Since the hardware very much matters this is going to be divided into a few parts, the common steps and the hardware specific ones.

Common steps

mkdir /opt/android
repo init -u -b android-7.1.1_r28
cd /opt/android/.repo
git clone git:// local_manifests -b etnaviv-android
repo sync -j75

mkdir /opt/imx6_android
cp /opt/imx6_android
git clone git:// -b imx_rdu2_v4.11-rc3

# The mkimage tool is used even if you're not
# using u-boot it as a bootloader
sudo apt install u-boot-tools

# Fetch Kconfig, bootloaders and some scripts
git clone git:// .

# This will destroy all data …
Of the Wayland demo clients in the Weston repository, simple-shm is the simplest. All the related code is in that one file, and it interfaces directly with libwayland. It does not use GL or EGL, so it can be ran on systems where the EGL stack does not support the Wayland platform nor extensions. However, what it renders, is surprising:
The original simple-shm client on a Weston desktop.

The square with apparently garbage texture is the original simple-shm. To any graphics developer, who does not know any better, that immediately looks like something is wrong with the image stride somewhere in the graphics stack. That really is what it was supposed to look like, not a bug.

I decided to propose a different rendering, that would not look so much like a bug, and had some real diagnostic value.
The proposed appearance of simple-shm, the way it is supposed to look like.
The new appearance has some vertical bars moving from left to right, some horizontal bars moving upwards, and some circles that shrink into the center. With these, you can actually see if there is a stride bug somewhere, or non-uniform scaling. There is one more diagnostic feature.
This is how the proposed simple-shm looks like when the X-channel is mistaken as alpha.
Simple-shm uses XRGB buffers. If the compositor does not properly ignore the X-channel, and uses it as alpha, you will see a cross over the image. Depending on whether the compositor repaints what is below simple-shm or not, the cross will either saturate to white or show the background through. It is best to have a bright background picture to clearly see it.

I do hope no-one gets hypnotized by the animation. ;-)
Recently I drew some diagrams of how an EGL library relates to the Wayland stack. Here I am presenting the Mesa EGL version of them with the details explained.
Mesa EGL with Wayland, and simplified X as comparison.

X11 part

The X11 part of the diagram is very much simplified. It completely ignores indirect rendering, DRI1, details of DRI2, and others. It only shows, that a direct rendering X11 EGL application uses the X11 protocol to create an X11 window, and the Mesa EGL X11 platform uses the DRI2 protocol in some way to communicate with the X server. Naturally the application also uses one of the OpenGL interfaces. The X server has hardware or platform specific drivers that are generally referred to as DDX. On the Linux DRI stack, these call into libdrm and the various driver specific sub-libraries. In the end they use the kernel DRM services, like kernel mode setting (KMS). All this in the diagram is just for comparison with a Wayland stack.

Wayland server

The Wayland server in the diagram is Weston with the DRM backend. The server does its rendering using GL ES 2, which it initialises by calling EGL. Since the server runs on "bare KMS", it uses the EGL DRM platform, which could really be called as the GBM platform, since it relies on the Mesa GBM interface. Mesa GBM is an abstraction of the graphics driver specific buffer management APIs (for instance the various libdrm_* libraries), implemented internally by calling into the Mesa GPU drivers.

Mesa GBM provides graphics memory buffers to Weston. Weston then uses EGL calls to bind them into GL objects, and renders into them with GL ES 2. A rendered buffer is shown on an output (monitor) by queuing a page flip via the libdrm KMS API.

If the EGL implementation offers the extension EGL_WL_bind_wayland_display, Weston will use it to register its wl_display object (facing the clients) to EGL. In practice, the Mesa EGL then adds a new global Wayland object to the wl_display. That object (or interface) is called wl_drm, and the server will automatically advertise that to all clients. Clients will use wl_drm for DRM authentication, getting the right DRM device node, and sharing graphics buffers with the server without copying pixels.

Wayland client

A Wayland client, naturally, connects to a Wayland server, and gets the main Wayland protocol object wl_display. The client creates a window, which is a Wayland object of type wl_surface. All what follows is enabled by the Wayland platform support in Mesa EGL.

The client passes the wl_display object to eglGetDisplay() and receives an EGLDisplay to be used with EGL calls. Then comes the trick that is denoted by the double-arrowed blue line from Wayland client to Mesa EGL in the diagram. The client calls the wayland-egl API (implemented in Mesa) function wl_egl_window_create() to get the native window handle. Normally you would just use the "real" native window object wl_surface (or an X11 Window if you were using X). The native window handle is used to create the EGLSurface EGL handle. Wayland has this extra step and the wayland-egl API because a wl_surface carries no information of its size. When the EGL library allocates buffers, it needs to know the size, and wayland-egl API is the only way to tell that.

Once EGL Wayland platform knows the size, it can allocate a graphics buffer by calling the Mesa GPU driver. Then this graphics buffer needs to be mapped into a Wayland protocol object wl_buffer. A wl_buffer object is created by sending a request through the wl_drm interface carrying the name of the (DRM) graphics buffer. In the server side, wl_drm requests are handled in the Mesa EGL library, where the corresponding server side part of the wl_buffer object is created. In the diagram this is shown as the blue dotted arrow from EGL Wayland platform to itself. Now, whenever the wl_buffer object is referenced in the Wayland protocol, the server knows exactly what it is.

The client now has an EGLSurface ready, and renders into it by using one of the GL APIs or OpenVG offered by Mesa. Finally, the client calls eglSwapBuffers() to show the result in its Wayland window.

The buffer swap in Mesa EGL Wayland platform uses the Wayland core protocol and sends an attach request to the wl_surface, with the wl_buffer as an argument. This is the blue dotted arrow from EGL Wayland platform to Wayland server.

Weston itself processes the attach request. It knows the buffer is not a shm buffer, so it passes the wl_buffer object to the Mesa EGL library in an eglCreateImageKHR() function call. In return Weston gets an EGLImage handle, which is then turned into a 2D texture, and used in drawing the surface (window). This operation is enabled by EGL_WL_bind_wayland_display extension.


The important facts, that should be apparent in the diagram, are:
  • There are two different EGL platforms in play: one for the server, and one for the clients.
  • A Wayland server does not contain any graphics hardware or driver specific code, it is all in the generic libraries of DRM, EGL and GL (libdrm and Mesa).
  • Everything about wl_drm is an implementation detail internal to the EGL library in use.
The system dependent part of Weston is the backend, which somehow must be able to drive the outputs. The new abstractions in Mesa (GBM API) make it completely hardware agnostic on standard Linux systems. Therefore every Wayland server implementation does not need its own set of graphics drivers, like X does.

It is also worth to note, that 3D graphics on X uses very much the same drivers as Wayland. However, due to the Wayland requirements from the EGL framework (extensions, EGL Wayland platform), proprietary driver stacks need to specifically implement Wayland support, or they need to be wrapped into a meta-EGL-library, that glues Wayland support on top. Proprietary drivers also need to provide a way to use accelerated graphics without X, for a Wayland server to run without X beneath. Therefore the desktop proprietary drivers like Nvidia's have a long way to go, as currently nvidia does not implement EGL at all, no support for Wayland, and no support for running without X, or even setting a video mode without X.

Due to the way wl_drm is totally encapsulated into Mesa EGL and how the interfaces are defined for the EGL Wayland platform and the EGL extension, another EGL implementor can choose their very own way of sharing graphics buffers instead of using wl_drm.

There are already plans to change to some of the architecture described in this article, so it is possible that details in the diagram become outdated fairly soon. This article also does not consider a purely software rendered Wayland stack, which certainly would lift all these requirements, but quite likely be too slow in practice for the desktop.

See also: the authoritative description of the Wayland architecture
While being contracted by Collabora, I started a Wayland R&D project in October 2011 with the primary goal of getting to know Wayland, and strengthening Wayland expertise in Collabora. During the four months I started the wl_shell_surface protocol for desktops, added screen locking, ported an X screensaver to Wayland with new protocol, and most recently implemented surface transformations in Weston (the reference compositor, originally the wayland-demos compositor). All this sponsored by Collabora.

The project started by getting wayland-demos running under X, and then looking into the bugs I hit. To rule out problems in hardware GL renderer, I also got the demos running with softpipe and llvmpipe. Trying to fix segmentation faults and other obvious problems was my stepping stone into the Wayland code base.

My first real piece of work was screen locking. That included adding special protocol for it, having a way to have privileged Wayland clients, implementing locking in the shell plugin in the compositor, and writing an unlock dialog for the desktop-shell client. Those are the obvious parts. I also had to extend the shell plugin interface, find a way to hide surfaces so they do not render while the screen is locked, and of course bug hunting and patch set rebasing and rewriting, before screen locking landed upstream.

Next was porting an X screensaver as a regular Wayland client. Once that worked, I extended the protocol by adding a screensaver interface, and made the shell plugin automatically start the screensaver application. Handling screensavers would have been a walk in the park, except I needed shell-specific data to be attached to all surfaces. I wrote a hacky solution, but in the end, Kristian Høgsberg wanted me to add a whole new interface into the shell protocol for this. It became the wl_shell_surface interface, and all demo clients needed to adopt it. Yet that was not all. Since we are used to have per-monitor screensavers, I needed my screensaver to set different instances for each monitor. Hence I had to add output event callbacks in the toytoolkit.

A cleanup phase came next, I took Valgrind and ran. I fixed a pile of memory leaks and wrote missing destructor functions all over, in compositor, clients and the toytoolkit, at the same time collecting a Valgrind suppressions list to ease Valgrinding in the future. This work included adding some ad hoc way of cleanly exiting demo clients.

In January there were some discussions on maximised and full-screen surfaces, what they are and how they should be implemented. Surface scaling was raised as one point. Weston already had the zoom effect, and full-screen scaling would be another surface transformation, so I decided to write a transformation matrix stack for supporting any number of simultaneous transformations. It turned out to be a three week task.

Implementing surface transformations required changes all over Weston. First, I needed a way to invert the transformation which is a 4-by-4 matrix. After searching in vain for a MIT-licenced C implementation I wrote one myself, based on LU-decomposition. I believe LU-decomposition is more efficient on a 4x4 matrix than the cofactor method. Along the inversion routines, I wrote a unit test application for testing the speed and precision of the inversion. Detecting and dealing with non-invertible transformations is also important.

Going through the transformation stack every time you need to transform a point might be costly, so I added a cached total transform and its inverse. Implementing input redirection was a simple matter of applying the inverse total transform to pointer coordinates. Needing a way to test transformations, I added a Weston key binding for rotating surfaces, and modified an existing demo application to mark the clicked point. Adding functions for explicitly converting between display global coordinates and surface local coordinates (surface local are the only ones a client knows of) clarified some of the coordinate computations.

Surface painting and damage region tracking needed fixes, too. Previously, a zoomed surface was repainted as a whole, and it forced a full display redraw, i.e. damaging the whole display. Transformed surface repaint needed to start honoring the repaint regions, so we could avoid excessive repainting. Damage and repaint regions are tracked as global coordinate axis aligned rectangles. Whenever a transformed surface is damaged (requires repainting), we need to compute the bounding box for the damage instead of simply using the global x, y of the top-left corner and the surface width, height. Then during surface painting, we take the list of damage rectangles, and render only those. Surface local coordinates (texture coordinates) are computed via the inverse transformation. This method may result in sampling outside of a surface's buffer (texture), so those samples need to be discarded in the fragment shader.

Other things that needed fixing after the surface transformations were window move and resize. Before fixing, moving a surface would not follow the pointer but move in the surface local orientation. Resize needed the same orientation fix, and another fix in relative surface motion that a client can set in the surface's attach request.

What you mostly see as the result of the surface transformations work is, that you can rotate any normal window, no application support needed. The pointer position on screen, over a window, accurately corresponds to what the application receives as the local pointer location. I did not realise it at the time, but this input redirection working flawlessly became an appreciated feature. Apparently it is hard or impossible to do in X, I would not know. In Wayland, and for me, it was just another relatively easy bug to be fixed. The window rotation feature was meant purely for debugging surface transformations.

Two rotated windows and some flowers.
There are still further issues to be fixed with surface transformations. Relative surfaces, like pop-up windows and menus, are not transformed and appear at a wrong location. Pointer cursors are not transformed; you would want the text bar cursor to be aligned with the text orientation. Continuously resizing a transformed window from its (locally) top-left corner makes the window drift away. We are probably still damaging larger regions than absolutely necessary for repaints. Repaint optimisation of opaque surfaces does not work with transformations.

During all this work of four months there were also the usual bug hunts, enhancements and fixes all over. For example, decorationless EGL apps, which turned out to have been a bug in Cairo, and moving the configuration file parser into a helper library that is shared between clients and the compositor.

Now, I am done with the Wayland R&D project and moving into another project at Collabora. In the new project I will continue working on Wayland, Weston, and the demos.
I recently got a Nokia N9 phone. One of the first things I did was copy my music collection into it. Since the player shows also album cover images, if such are stored, I started adding them -- not by embedding them into ID3v2 tags but as separate files, to avoid useless copies of images.

Usually it is as simple as putting a cover.jpg file into a directory, that contains a single album. Sometimes and in some cases, though, that does not work. I found out, that the N9's default music player is supposed to follow Media Art Storage specification. That gave me hints.

If a directory contains more than one album, you can name the cover image files according to the album, for example 'Back in Black.jpg' and 'Flick of the Switch.jpg', as long as the names correspond the ID3 tag album name (somehow?).

My real problem case was a directory full of songs downloaded from Nectarine. I edited them all (EasyTAG is a wonderful tool) to make the ID3 album tag "Nectarine" because I wanted to have them all under the same "album", and there are over 50 songs in that single directory. Simply adding a cover.jpg or Nectarine.jpg did not work.

There are two possible reasons that I found. First, the directory contains too many files, according to the Media Art Storage spec. Second, apparently the cover art is not taken into use, unless at least one song file, which would use that cover art, is touched (modification date updated).

I created a new directory, moved one Nectarine song into it, and put Nectarine.jpg there, too. And it started to work, for all my Nectarine songs.

There is software called Tracker in the N9, which maintains some sort of database of all media. Also album cover art gets used via Tracker. If you ssh into your phone, and move around your media files, Tracker update is not automatically triggered. You could use the command tracker-control -r to force a full rebuild when you launch e.g. the music player the next time, but the rebuild will take a long time. An easy way to force a faster rebuild is to plug the N9 into a computer via USB, and then unplug it.
Now that screen locking is done in Wayland demos, it is time to go for the eye-candy: full-screen idle animations, also known as screensavers. The first step was to port an existing screensaver to Wayland. I chose glmatrix from XScreenSaver, because it is cool, and it renders with OpenGL. This way I did not have to port Xlib based rendering to Cairo (yay!).

Here is GLMatrix running as a regular, windowed application on Wayland, using the toytoolkit:
GLMatrix on the Wayland demo compositor.
On Wayland, screensavers can be reduced to pure animation applications, while the compositor handles everything about locking. Next, we need a Wayland protocol extension to actually use this idle-animation in a screensaver'y way.

GLMatrix is already in the Wayland demo repository as a client called wscreensaver, and it requires cairo-gl, just like gears does.