October 20, 2016

Bloom Filter

The bloom filter makes bright objects glow and bleed through other objects positioned in between them and the camera. It is a common post-processing effect used all the time in video games and animated movies. The demo supports a couple of configuration options that control the intensity and behavior of the filter, here are some screenshots with different settings:

Bloom filter Off
Bloom filter On, default settings
Bloom filter On, intensity increased

I particularly like the glow effect that this brings to the specular reflections on the water surface, although to really appreciate that you need to run the demo and see it in motion.

Cascaded Shadow Maps

I should really write a post about basic shadow mapping before going into the details of Cascaded Shadow Maps, so for now I’ll just focus on the problem they try to solve.

One of the problems with shadow mapping is rendering high resolution shadows, specially for shadows that are rendered close to the camera. Generally, basic shadow mapping provides two ways in which we can improve the resolution of the shadows we render:

1. Increase the resolution of the shadow map textures. This one is obvious but comes at a high performance (and memory) hit.

2. Reduce the distance at which we can render shadows. But this is not ideal of course.

One compromise solution is to notice that, as usual with 3D computer graphics, it is far more important to render nearby objects in high quality than distant ones.

Cascaded Shadow Maps allow us to use different levels of detail for shadows that are rendered at different distances from the camera. Instead of having a single shadow map for all the shadows, we split the viewing frustum into slices and render shadows in each slice to a different shadow map.

There are two immediate benefits of this technique:

1. We have flexibility to define the resolution of the shadow maps for each level of the cascade, allowing us, for example, to increase the resolution of the levels closest to the camera and maybe reduce those that are further away.

2. Each level only records shadows in a slice of the viewing frustum, which increases shadow resolution even if we keep the same texture resolution we used with the original shadow map implementation for each shadow map level.

This approach also has some issues:

1. We need to render multiple shadow maps, which can be a serious performance hit depending on the resolutions of the shadow maps involved. This is why we usually lower the resolution of the shadow maps as distance from the camera increases.

2. As we move closer to or further from shadowed objects we can see the changes in shadow quality pop-in. Of course we can control this by avoiding drastic quality changes between consecutive levels in the cascade.

Here is an example that illustrates the second issue (in this case I have lowered the resolution of the 2nd and 3d cascade levels to 50% and 25% respectively so the effect was more obvious). The screenshots show the rendering of the shadows at different distances. We can see how the shadows in the close-up shot are very sharp and as the distance increases they become blurrier due to the use of a lower resolution shadow map:

CSM level 0 (4096×4096)
CSM level 1 (2048×2048)
CSM level 2 (1024×1024)

The demo supports up to 4 shadow map levels although the default configuration is to use 3. The resolution of each level can be configured separately too, in the default configuration I lowered the shadow resolution of the second and third levels to 75% and 50% respectively. If we configure the demo to run on a single level (with 100% texture resolution), we are back to the original shadow map implementation, so it is easy to experiment with both techniques.

I intend to cover the details behind shadow mapping and the implementation of the bloom filter in more detail in a future post, so again, stay tuned for more!

October 18, 2016
The most visible change this week is a fix to Mesa for texture upload performance.  A user had reported that selection rectangles in LXDE's file manager were really slow.  I brought up sysprof, and it showed that we were doing uncached reads from the GPU in a situation that should have been entirely write-combined writes.

The bug was that when the size of the texture wasn't aligned to utiles, we were loading the previous contents into the temporary buffer before writing the new texture data in and uploading, even if the full texture was being updated.  I've fixed it to check for when the full texture is being uploaded, and not do the initial download.  This bug was getting it on almost any Cairo vector graphics operation with its Xlib backend, so hopefully this helps a lot of people's desktops.

I also worked on a cleanup of the simulator mode.  I use the closed source simulator regularly as part of my work -- it's fairly accurate to hardware behavior, and allows you to trace what's happening when things go wrong, all with the driver code running like "normal" for apps on my x86 desktop.

However, my simulator support is a little invasive to the driver, replacing vc4 ioctl calls with alternative ioctls to the i915 driver.  I started on a series to make vc4_simulator.c take plain VC4 ioctls and translate them, so that the simulator code is entirely contained.  The last step I think I have is figuring out which level to put the simulator's copying in and out of window system framebuffers at.

Last, I got DEQP's GLES2 tests up and running on Raspberry Pi.  These are approximately equivalent to the official conformance tests.  The initial results were quite good -- 18630/19098 (97.5%) passing when run in the piglit framework.  I found a couple of fixes to be made in glClear() support, one of which will affect all gallium drivers.  Of the remainder, there are a few tests that are failing due to test bugs (we expose extensions that allow features the tests don't expect), but most are failing in register allocation.  For the register allocation failures, I see a relatively quick fix that should reduce register pressure in loops.

For the last 3 years, I've been working in the OpenStack Telemetry team at eNovance, and then at Red Hat. Our mission is to maintain the OpenStack Telemetry stack, both upstream and downstream (i.e. inside Red Hat products). Besides the technical challenges, the organization of the team always have played a major role in our accomplishments.

Here, I'd like to share some of my hindsight with you, faithful readers.

Meet the team

The team I work in changed a bit during those 3 years, but the core components always have been the same: a few software engineers, a QE engineer, a product owner, and an engineering manager. That meant the team size has been always been between 6 and 8 people.

I cannot emphasize enough how team size is important. Not having more than 8 persons in a team fits with the two pizzas rule from Jeff Bezzos, which turned out to be key in our team composition.

The group dynamic that is applied by teams not bigger than this is excellent. It offers the possibility to know and connect with everyone – each time member has only up to 7 people to talk to on a daily basis, which means only 28 communication axis between people. Having a team of e.g. 16 people means 120 different links in your team. Double your team size, and multiply by 4 your communication overhead. My practice shows that the less communication axis you have in a team, the less overhead your will have and the swifter your team will be.

All team members being remote workers, is is even more challenging to build relationship and bound. We had the opportunity to know each other during the OpenStack summit twice a year and doing regular video-conference via Google Hangout or BlueJeans really helped.

The atmosphere you set-up in your team will also forge the outcome of your team work. Run your team with trust, peace and humor (remember I'm on the team 🤣) and awesome things will happen. Run your team with fear, pressure, and finger-pointing, nothing good will happen.

There's little chance that when a team is built, everyone will be on the same level. We were no exception, we had more and less experienced engineers. But the most experienced engineers took the time needed to invest and mentor the less experienced. That also helped to build trust and communication links between members of the team. And over the long run, everyone is getting more efficient: the less experienced engineers are getting better and the more experienced can delegate a lot of stuff to their fellows.

Then they can chill or work on bigger stuff. Win-win.

It's actually no more different than that the way you should run an open source team, as I already claimed in a previous article on FOSS projects management.

Practicing agility

I might be bad at practicing agility: contrary to many people, I don't see agility as a set of processes. I see that as a state of mind, as a team organization based on empowerment. No more, no less.

And each time I meet people and explain that our team is "agile", they start shivering, explaining how they hate sprints, daily stand-ups, scrum, and planning poker, that this is all a waste of time and energy.

Well, it turns out that you can be agile without all of that.

Planning poker

In our team, we tried at first to run 2-weeks sprints and used planning poker to schedule our user stories from our product backlog (= todo list). It never worked as expected.

First, most people had the feeling to lose their time because they already knew exactly what they were supposed to do. Having any doubt, they would have just gone and talked to the product owner or another fellow engineer.

Secondly, some stories were really specialized and only one of the team member was able to understand it in details and evaluate it. So most of the time, a lot of the team members playing planning poker would just vote a random number based on the length of the explanation of the story teller. For example, if an engineer said "I just need to change that flag in the configuration file" then everyone would vote 1. If they started rambling for 5 minutes about "how the configuration option is easy to switch, but that there might be other things to change at the same time, and things to check for impact bigger than expected, and code refactoring to do", then most people would just announce a score of 13 on that story. Just because the guy talked for 3 minutes straight and everything sounded complicated and out of their scope.

That meant that the poker score had no meaning to us. We never managed to have a number of points that we knew we could accomplish during a sprint (the team velocity as they call it).

The only benefit that we identified from planning poker, in our case, is that it forces people to keep sit down and communicate about a user story. Though, it turned out that making people communicate was not a problem we needed to solve in our team, so we decided to stop doing that. But it can be a pretty good tool to make people talking to each other.

Therefore, the 2-weeks sprint never made much sense as we were unable to schedule our work reliably. Furthermore, doing most of our daily job in open source communities, we were unable to schedule anything. When sending patches to an upstream project, you have no clue when they will be reviewed. What you know for sure, is that in order to maximize your code merge throughput with this high latency of code review, you need to parallelize your patch submission a lot. So as soon as you receive some feedback from your reviewers, you need to (almost) drop everything, rework your code and resubmit it.

There's no need to explain what this does not absolutely work with a sprint approach. Most of the scrum framework lays on the fact that you own workflow from top to bottom, which is far from being true when working in open source communities.

Daily stand-up meetings

We used to run a daily stand-up meeting every day, then every other day. Doing that remotely kills the stand-up part, obviously, so there is less guarantee the meeting will be short. Considering all team members are working remotely in different time zones, with some freedom to organize their schedule, it was very difficult to synchronize those meetings. With member spread from the US to Eastern Europe, the meeting was in the middle of the afternoon for me. I found it frustrating to have to stop my activities in the middle of every afternoon to chat with my team. We all know the cost of context switching to us, humans.

So we drifted from our 10 minutes daily meeting to a one-hour weekly meeting with the whole team. It's way easier to synchronize for a large chunk of time once a week and to have this high-throughput communication channel.

Our (own) agile framework

Drifting from the original scrum implementation, we ended up running our own agility framework. It turned out to have similarity with kanban – you don't always have to invent new things!

Our main support is a Trello board that we share with the whole team. It consists of different columns, where we put card representing small user stories or simple to-do items. Each column is the state of the current card, and we move them left to right:

  • Ideas: where we put things we'd like to do or dig into, but there's no urgency. It might lead to new, smaller ideas, in the "To Do" column.
  • To Do: where we put real things we need to do. We might run a grooming session with our product manager if we need help prioritizing things, but it's usually not necessary.
  • Epic: here we create a few bigger cards that regroup several To Do items. We don't move them around, we just archive them when they are fully implemented. There are only 5-6 big cards here at max, which are the long term goals we work on.
  • Doing: where we move cards from To Do when we start doing them. At this stage, we also add people working on the task to the card, so we see a little face of people involved.
  • Under review: 90% of our job being done upstream, we usually move cards done and waiting for feedback from the community to this column. When the patches are approved and the card is complete, we move the card to Done. If a patch needs further improvement, we move back the card to Doing and work on it, and then move it back to Under review when resubmitted.
  • On hold / blocked: some of the tasks we work on might be blocked by external factors. We move cards there to keep track of them.
  • Done during week #XX: we create a new list every Monday to stack our done cards by week. This is just easier to display and it allows us to see the cards that we complete each week. We archive lists older than a month, from time to time. It gives a great visual feedback and what has been accomplished and merged every week.

We started to automate some of our Trello workflow in a tool called Trelloha. For example, it allows us to track upstream patches sent through Gerrit or GitHub and tick the checkbox items in any card when those are merged.

We actually don't put many efforts on our Trello board. It's just a slightly organized chaos, as are upstream projects. We use it as a lightweight system for taking notes, organizing our thought and letting know what we're doing and why we're doing it. That's where Trello is wonderful because using it has a very low friction: creating, updating and moving card is a one click operation.

One bias of most engineers is to overthink and over-engineer their workflow, trying to rationalize it. Most of the time, they end up automating which means building processes and bureaucracy. It just slows things down and builds frustration upon everyone. Just embrace chaos and spend time on what matters.

Most of the things we do are linked to external Launchpad bugs, Gerrit reviews or GitHub issues. That means the cards in Trello carry very little information, as everything happens outside, in the wild Internet of open source communities. This is very important as we need to avoid any kind of retention of knowledge and information from contributors outside the company. This also makes sure that our internal way of running does not leak outside and (badly) influence outside communities.


We also run a retrospective every 2 weeks, which might be the only thing we kept from the scrum practice. It's actually a good opportunity for us to share our feelings, concerns or jokes. We used to do it using the six thinking hats method, but it slowly faded away. In the end, we now use a different Trello board with those columns:

  • Good 😄
  • Hopes and Wishes 🎁
  • Puzzles and Challenges 🌊
  • To improve 😡
  • Action Items 🤘

All teammates fill the board with the card they want, and everyone is free to add themselves to any card. We then run through each card and let people who added their name to it talk about it. The column "Action Items" is usually filled as we speak and discover we should do things. We can then move cards created there to our regular board, in the To Do column.

Central communication

Sure, people have different roles in a team, but we dislike bottleneck and single point of failure. Therefore, we are using an internal mailing list where we ask people to send their request and messages to. If people send things related to our team job to one of us personally, we just forward or Cc the list when replying so everyone is aware of what one might be talking about with people external to the team.

This is very important, as it emphasizes that no team member should be considered special. Nobody owns more information and knowledge than others, and anybody can jump into a conversation if it has valuable knowledge to share.

The same applies for our internal IRC channel.

We also make sure that we discuss only company-specific things on this list or on our internal IRC channel. Everything that can be public and is related to upstream is discussed on external communication medium (IRC, upstream mailing list, etc). This is very important to make sure that we are not blocking anybody outside the Red Hat to join us and contribute to the projects or ideas we work on. We also want to make sure that people working in our company are no more special than other contributors.


We're pretty happy with our set-up right now, and the team runs pretty smoothly since a few months. We're still trying to improve, and having a general sense of trust among team members make sure we can openly speak about whatever problem we might have.

Feel free to share your feedback and own experience of running your own teams in the comment section.

October 13, 2016
Lots of individuals and companies have made substantial contributions to Apitrace. But maintenance has always rested with me, the original author.

I would prefer have shared that responsibility with a wider team, but things haven't turned out that way. For several reasons I suppose:

  • There are many people that care about one section of the functionality (one API, one OS), but few care about all of them.
  • For all existing and potential contributors including me Apitrace is merely a means to an end (test/debug graphics drivers or applications), it's not an end on itself. The other stuff always gets the top priority.
  • There are more polished tools for newer generation APIs like Vulkan, Metal, Direct3D 12. These newer APIs are much leaner than legacy APIs, which eliminates a lot of design constraints. And some of these tools have large teams behind them.
  • Last but not least, I failed to nurture such a community. I always kept close control, partly to avoid things to become a hodgepodge, partly from fear of breakage, but can't shake the feeling that if I had been more relaxed things might of turned out differently.

Apitrace has always been something I worked on the spare time, or whenever I had an itch to scratch. That is still true, with the exception that after having a kid I have scarcely any free time left.

Furthermore the future is not bright: I believe Apitrace will have a long life in graphics driver test automation, and perhaps whenever somebody needs to debug an old OpenGL application, but I doubt it will flourish beyond that. And this facts weighs in whenever I need to decide whether to spend some time on Apitrace vs everything else.

The end result is that I haven't been a responsive maintainer for some time (long time merging patches, providing feedback, resolving issue, etc), and I'm afraid that will continue for the foreseeable future.

I don't feel any obligation to do more (after all, the license does say the software is provided as is), but I do want to set right expectations to avoid frustrating users/contributors who might otherwise timely feedback, hence this post.

In this post I’ll discuss how I setup and render terrain mesh in the OpenGL terrain rendering demo. Most of the relevant code for this is in the ter-terrain.cpp file.

Setting up a grid of vertices

Unless you know how to use a 3D modeling program properly, a reasonable way to create a decent mesh for a terrain consists in using a grid of vertices and elevate them according to a height map image. In order to create the grid we only need to decide how many rows and columns we want. This, in the end, determines the number of polygons and the resolution of the terrain.

We need to map these vertices to world coordinates too. We do that by defining a tile size, which is the distance between consecutive vertices in world units. Larger tile sizes increase the size of the terrain but lower the resolution by creating larger polygons.

The image below shows an 8×6 grid that defines 35 tiles. Each tile is rendered using 2 triangles:

8×6 terrain grid

The next step is to elevate these vertices so we don’t end up with a boring flat surface. We do this by sampling the height map image for each vertex in the grid. A height map is a gray scale image where the values of the pixels represent altitudes at different positions. The closer the color is to white, the more elevated it is.

Heightmap image

Adding more vertices to the grid increases the number of sampling points from the height map and reduces the sampling distances, leading to a smoother and more precise representation of the height map in the resulting terrain.

Sampling the heightmap to compute vertex heights

Of course, we still need to map the height map samples (gray scale colors) to altitudes in world units. In the demo I do this by normalizing the color values to [-1,+1] and then applying a scale factor to compute the altitude values in world space. By playing with the scaling factor we can make our terrain look more or less abrupt.

Altitude scale=6.0
Altitude scale=12.0

For reference, the height map sampling is implemented in ter_terrain_set_heights_from_texture().

Creating the mesh

At this point we know the full position (x, y, z) in world coordinates of all the vertices in our grid. The next step is to build the actual triangle mesh that we will use to render the terrain and the normal vectors for each triangle. This process is described below and is implemented in the ter_terrain_build_mesh() function.

Computing normals

In order to get nice lighting on our terrain we need to compute normals for each vertex in the mesh. A simple way to achieve this is would be to compute the normal for each face (triangle) and use that normal for each vertex in the triangle. This works, but it has 3 problems:

1. Every vertex of each triangle has the same exact normal, which leads to a rather flat result.

2. Adjacent triangles with different orientations showcase abrupt changes in the normal value, leading to significantly different lighting across the surfaces that highlight the individual triangles in the mesh.

3. Because each vertex in the mesh can have a different normal value for each triangle it participates in, we need to replicate the vertices when we render, which is not optimal.

Alternatively, we can compute the normals for each vertex considering the heights of its neighboring vertices. This solves all the problems mentioned above and leads to much better results thanks to the interpolation of the normal vectors across the triangles, which leads to smooth lighting reflection transitions:

Flat normals
Smooth normals

The implementation for this is in the function calculate_normal(), which takes the column and row indices of the vertex in the grid and calculates the Y coordinate by sampling the heights of the 4 nearby vertices in the grid.

Preparing the draw call

Now that we know the positions of all the vertices and their normal vectors we have all the information that we need to render the terrain. We still have to decide how exactly we want to render all the polygons.

The simplest way to render the terrain using a single draw call is to setup a vertex buffer with data for each triangle in the mesh (including position and normal information) and use GL_TRIANGLES for the primitive of the draw call. This, however, is not the best option from the point of view of performance.

Because the terrain will typically contain a large number of vertices and most of them participate in multiple triangles, we end up uploading a large amount of vertex data to the GPU and processing a lot of vertices in the draw call. The result is large memory requirements and suboptimal performance.

For reference, the terrain I used in the demo from my original post used a 251×251 grid. This grid represents 250×250 tiles, each one rendered as two triangles (6 vertices/tile), so we end up with 250x250x6=375,000 vertices. For each of these vertices we need to upload 24 bytes of vertex data with the position and normal, so we end up with a GPU buffer that is almost 9MB large.

One obvious way to reduce this is to render the terrain using triangle strips. The problem with this is that in theory, we can’t render the terrain with just one strip, we would need one strip (and so, one draw call) per tile column or one strip per tile row. Fortunately, we can use degenerate triangles to link separate strips for each column into a single draw call. With this we trim down the number of vertices to 126,000 and the size of the buffer to a bit below 3 MB. This alone produced a 15%-20% performance increase in the demo.

We can do better though. A lot of the vertices in the terrain mesh participate in various triangles across the large triangle strip in the draw call, so we can reduce memory requirements by using an index buffer to render the strip. If we do this, we trim things down to 63,000 vertices and ~1.5MB. This added another 4%-5% performance bonus over the original implementation.


So far we have been rendering the full mesh of the terrain in each frame and we do this by uploading the vertex data to the GPU just once (for example in the first frame). However, depending on where the camera is located and where it is looking at, just a fraction of the terrain may be visible.

Although the GPU will discard all the geometry and fragments that fall outside the viewport, it still has to process each vertex in the vertex shader stage before it can clip non-visible triangles. Because the number of triangles in the terrain is large, this is suboptimal and to address this we want to do CPU-side clipping before we render.

Doing CPU-side clippig comes with some additional complexities though: it requires that we compute the visible region of the terrain and upload new vertex data to the GPU in each frame preventin GPU stalls.

In the demo, we implement the clipping by computing a quad sub-region of the terrain that includes the visible area that we need to render. Once we know the sub-region that we want to render, we compute the new indices of the vertices that participate in the region so we can render it using a single triangle strip. Finally, we upload the new index data to the index buffer for use in the follow-up draw call.

Avoiding GPU stalls

Although all the above is correct, it actually leads, as described, to much worse performance in general. The reason for this is that our uploads of vertex data in each frame lead to frequent GPU stalls. This happens in two scenarios:

1. In the same frame, because we need to upload different vertex data for the rendering of the terrain for the shadow map and the scene (the shadow map renders the terrain from the point of view of the light, so the visible region of the terrain is different). This creates stalls because the rendering of the terrain for the shadow map might not have completed before we attempt to upload new data to the index buffer in order to render the terrain for the scene.

2. Between different frames. Because the GPU might not be completely done rendering the previous frame (and thus, stills needs the index buffer data available) before we start preparing the next frame and attempt to upload new terrain index data for it.

In the case of the Intel Mesa driver, these GPU stalls can be easily identified by using the environment variable INTEL_DEBUG=perf. When using this, the driver will detect these situations and produce warnings informing about the stalls, the buffers affected and the regions of the buffers that generate the stall, such as:

Stalling on glBufferSubData(0, 503992) (492kb) to a busy (0-1007984)
buffer object.  Use glMapBufferRange() to avoid this.

The solution to this problem that I implemented (other than trying to put as much work as possible between read/write accesses to the index buffer) comes in two forms:

1. Circular buffers

In this case, we allocate a larger buffer than we need so that each subsequent upload of new index data happens in a separate sub-region of the allocated buffer. I set up the demo so that each circular buffer is large enough to hold the index data required for all updates of the index buffer happening in each frame (the shadow map and the scene).

2. Multi-buffering

We allocate more than one circular buffer. When we don’t have enough free space at the end of the current buffer to upload the new index buffer data, we upload it to a different circular buffer instead. When we run out of buffers we circle back to the first one (which at this point will hopefully be free to be re-used again).

So why not just use a single, very large circular buffer? Mostly because there are limits to the size of the buffers that the GPU may be able to handle correctly (or efficiently). Also, why not having many smaller independent buffers instead of circular buffers? That would work just fine, but using fewer, larger buffers reduces the number of objects we need to bind/unbind and is better to prevent memory fragmentation, so that’s a plus.

Final touches

We are almost done, at this point we only need to add a texture to the terrain surface, add some slight fog effect for distant pixels to create a more realistic look, add a skybox (it is important to choose the color of the fog so it matches the color of the sky!) and tweak the lighting parameters to get a nice result:

Final rendering

I hope to cover some of these aspects in future posts, so stay tuned for more!

October 11, 2016
Last week I started work on making hello_fft work with the vc4 driver loaded.

hello_fft is a demo program for doing FFTs using the QPUs (the shader core in vc4).  Instead of drawing primitives into a framebuffer like a GL pipeline does, though, it uses the User QPU pipeline, which just hands a uniform stream to a QPU shader instance.

Originally I didn't build a QPU user shader ioctl for vc4 because there's no good way to use this pipeline of the VC4.  The hardware isn't quite capable enough to be exposed as OpenCL or GL compute shaders (some research had been done into this, and the designers' conclusion was that the memory acccess support wasn't quite good enough to be useful).  That leaves you with writing VC4 shaders in raw assembly, which I don't recommend.

The other problem for vc4 exposing user shaders is that, since the GPU sits directly on main memory with no MMU in between, GPU shader programs could access anywhere in system memory.  For 3D, there's no need (in GL 2.1) to do general write back to system memory, so the kernel reads your shader code and rejects shaders if they try to do it.  It also makes sure that you bounds-check all reads through the uniforms or texture sampler.  For QPU user shaders, though, the expected mode of using them is to have the VPM DMA units do general loads and stores, so we'd need new validation support.

If it was just this one demo, we might be willing to lose support for it in the transition to the open driver.  However, some modes of accelerated video decode also use QPU shaders at some stage, so we have to have some sort of solution.

My plan is basically to not do validation and have root-only execution of QPU shaders for now.  The firmware communication driver captures hello_fft's request to have the firmware pass QPU shaders to the V3D hardware, and it redirects it into VC4.  VC4 then maintains the little queue of requests coming in, powers up the hardware, feeds them in, and collects the interrupts from the QPU shaders for when they're done (each job request is required to "mov interrupt, 1" at the end of its last shader to be run).

Dom is now off experimenting in what it will take to redirect video decode's QPU usage onto this code.

The other little project last week was fixing Processing's performance on vc4.  It's one of the few apps that is ported to the closed driver, so it's a likely thing to be compared on.

Unfortunately, Processing was clearing its framebuffers really inefficiently.  For non-tiled hardware it was mostly OK, with just a double-clear of the depth buffer, but for vc4 its repeated glClear()s (rather than a single glClear with all the buffers to be cleared set) were triggering flushes and reloads of the scene, and in one case a render of a full screen quad (and a full screen load before doing so) rather than fast clearing.

The solution was to improve the tracking of what buffers are cleared and whether any primitives have been drawn yet, so that I can coalesce sets of repeated or partial clears together while only updating the colors to be cleared.  There's always a tension in this sort of optimization: Should the GL driver be clever to work around apps behaving badly, or should it push that off to the app developer (more like Vulkan does).  In this case, tiled renderer behavior is sufficiently different from non-tiled renderers, and enough apps will hit this path, that it's well worth it.  For Processing, I saw an improvement of about 10% on the demo I was looking at.

Sadly, Processing won't be a good comparison for open vs closed drivers even after these fixes.  For the closed driver Processing uses EGL, which says that depth buffers become undefined on eglSwapBuffers. In its default mode, though, processing uses GLX, which says that the depth buffer is retained on glXSwapBuffers().  To avoid this extra overhead in its GLX mode, you'd need to use glDiscardFramebufferEXT() right before the swap.  Unfortunately, this isn't connected in gallium yet but shouldn't be hard to do once we have an app that we can test.
October 07, 2016

I procristanated rather badly on this one, so instead of the previous kernel release happening the v4.8 release is already out of the door. Read on for my slightly more terse catch-up report.

Since I’m this late I figured instead of the usual comprehensive list I’ll do something new and just list some of the work that landed in 4.8, but with a bit more focus on the impact and why things have been done.

Midlayers, Be Gone!

The first thing I want to highlight is the driver de-midlayering. In the linux kernel community the mid-layer mistake or helper library design pattern, see the linked article from LWN is a set of rules to design subsystems and common support code for drivers. The underlying rule is that the driver itself must be in control of everything, like allocating memory, handling all requests. Common code is only shared in helper library functions, which the driver can call if they are suitable. The reason for that is that there is always some hardware which needs special treatment, and when you have a special case and there’s a midlayer, it will get in the way.

Due to the shared history with BSD kernels DRM originally had a full-blown midlayer, but over time this has been fixed. For example kernel modesetting was designed from the start with the helper library pattern. The last hold is the device structure itself, and for the Intel driver this is now fixed. This has two main benefits:

  • First we can get rid of a lot of pointer dereferencing in the compiled binaries. With the midlayer DRM allocated a struct drm_device, and the Intel driver allocated it’s own, separate structure. Both are connected with pointers, and every time control transferred from driver private functions to shared code those pointers had to be walked.

    With a helper approach the driver allocates the DRM device structure embedded into it’s own device strucure. That way the pointer derefencing just becomes a fixed adjustement offset of the original pointer. And fixed offsets can be baked into each access of individual member fields for free, resulting in a big reduction of compiled code.

  • The other benefit is that the Intel driver is now in full control of the driver load and unload sequence. The DRM midlayer functions for loading had a few driver callbacks, but for historical reasons at the wrong spots. And fixing that is impossible without rewriting the load code for all the drivers. Without the midlayer we can have as many steps in the load sequence as we want, and where we want it. The most important fix here is that the driver will now be initialiazed completely before any part of it is registered and visible to userspace (through the /dev node, sysfs or anywhere else).

Thundering Herds

GPUs process rendering asynchronously, and sometimes the CPU needs to wait for them. For this purpose there’s a wait queue in the driver. Userspace processes block on that until the interrupt handler wakes them up. The trouble now is that thus far there was just one wait queue per engine, which means every time the GPU completed something all waiters had to be woken up. Then they checked whether the work they needed to wait for completed, and if not, again block on the wait queue until the next batch job completed. That’s all rather inefficient. On top there’s only one per-engine knob to enable interrupts. Which means even if there was only one waiting process, it was woken for every completed job. And GPUs can have a lot of jobs in-flight.

In summary, waiting for the GPU worked more like a frantic herd trampling all over things instead of something orderly. To fix this the request and completion tracking was entirely revamped, to make sure that the driver has a much better understanding of what’s going on. On top there’s now also an efficient search structure of all current waiting processes. With that the interrupt handler can quickly check whether the just completed GPU job is of interest, and if so, which exact process should be woken up.

But this wasn’t just done to make the driver more efficient. Better tracking of pending and completed GPU requests is an important fundation to implement proper GPU scheduling on top of. And it’s also needed to interface the completion tracking with other drivers, to finally fixing tearing for multi-GPU machines. Having a thundering herd in your own backyard is unsightly, letting it loose on your neighbours is downright bad! A lot of this follow-up work already landed for the 4.9 kernel, hence I will talk more about this in a future installement of this seris.

October 06, 2016

No more “which is now the index of this modem…?”

DBus object path and index

When modems are detected by ModemManager and exposed in DBus, they are assigned an unique DBus object path, with a common prefix and a unique index number, e.g.:


This path is the one used by the mmcli command line tool to operate on a modem, so users can identify the device by the full path or just by the index, e.g. this two calls are totally equivalent:

$ mmcli -m /org/freedesktop/ModemManager1/Modem/0
$ mmcli -m 0

This logic looks good, except for the fact that there isn’t a fixed DBus object path for each modem detected: i.e. the index given to a device is the next one available, and if the device is power cycled or unplugged and replugged, a different index will be given to it.


Systems like NetworkManager handle this index change gracefully, just by assuming that the exposed device isn’t the same one as the one exposed earlier with a different index. If settings need to be applied to a specific device, they will be stored associated with the EquipmentIdentifier property of the modem, which is the same across reboots (i.e. the IMEI for GSM/UMTS/LTE devices).

User-provided names

The 1.8 stable release of ModemManager will come with support for user-provided names assigned to devices. A use case of this new feature is for example those custom systems where the user would like to assign a name to a device based on the USB port in which it is connected (e.g. assuming the USB hardware layout doesn’t change across reboots).

The user can specify the names (UID, unique IDs) just by tagging in udev the physical device that owns all ports of a modem with the new ID_MM_PHYSDEV_UID property. This tags need to be applied before the ID_MM_CANDIDATE properties, and therefore the rules file should be named before the 80-mm-candidate.rules one, for example like this:

$ cat /lib/udev/rules.d/78-mm-naming.rules

ACTION!="add|change|move", GOTO="mm_naming_rules_end"

The value of the new ID_MM_PHYSDEV_UID property will be used in the Device property exposed in the DBus object, and can also be used directly in mmcli calls instead of the path or index, e.g.:

$ mmcli -m USB4
 System | device: 'USB4'
        | drivers: 'qmi_wwan, qcserial'
        | plugin: 'Sierra'
        | primary port: 'cdc-wdm2'

Given that the same property value will always be set for the modem in a specific device path, this user provided names may unequivocally identify a specific modem even when the device is power-cycled, unplugged and replugged or even the whole system rebooted.

Binding the property to the device path is just an example of what could be done. There is no restriction on what the logic is to apply the ID_MM_PHYSDEV_UID property, so users may also choose other different approaches.

This support is already in ModemManager git master, and as already said, will be included in the stable 1.8 release, whenever that is.

TL;DR? ModemManager now supports assigning unique names to devices that stay even across full system reboots.

Filed under: Development, FreeDesktop Planet, GNOME Planet, Planets Tagged: gnome, gnu/linux, ModemManager, udev
October 05, 2016

I have been quite busy with various things in the last few weeks, but I have finally found some time to clean up and upload the code of the OpenGL terrain render demo to Github.

Since this was intended as a programming exercise I have not tried to be very elegant or correct during the implementation, so expect things like error handling to be a bit rough around the edges, but otherwise I think the code should be easy enough to follow.

Notice that I have only tested this on Intel GPUs. I know it works on NVIDIA too (thanks to Samuel and Chema for testing this) but there are a couple of rendering artifacts there, specifically at the edges of the skybox and some “pillars” showing up in the distance some times, probably because I am rendering one to many “rows” of the terrain and I end up rendering garbage. I may fix these some day.

The code I uploaded to the repository includes a few new features too:

  • Model variants, which are basically color variations of the same model
  • A couple of additional models (a new tree and plant) and a different rock type
  • Collision detection, which makes navigating the terrain more pleasent

Here is a new screenshot:


In future posts I will talk a bit about some of the implementation details, so having the source code around will be useful. Enjoy!

October 04, 2016

systemd.conf 2016 is Over Now!

A few days ago systemd.conf 2016 ended, our second conference of this kind. I personally enjoyed this conference a lot: the talks, the atmosphere, the audience, the organization, the location, they all were excellent!

I'd like to take the opportunity to thanks everybody involved. In particular I'd like to thank Chris, Daniel, Sandra and Henrike for organizing the conference, your work was stellar!

I'd also like to thank our sponsors, without which the conference couldn't take place like this, of course. In particular I'd like to thank our gold sponsor, Red Hat, our organizing sponsor Kinvolk, as well as our silver sponsors CoreOS and Facebook. I'd also like to thank our bronze sponsors Collabora, OpenSUSE, Pantheon, Pengutronix, our supporting sponsor Codethink and last but not least our media sponsor Linux Magazin. Thank you all!

I'd also like to thank the Video Operation Center ("VOC") for their amazing work on live-streaming the conference and making all talks available on YouTube. It's amazing how efficient the VOC is, it's simply stunning! Thank you guys!

In case you missed this year's iteration of the conference, please have a look at our YouTube Channel. You'll find all of this year's talks there, as well the ones from last year. (For example, my welcome talk is available here). Enjoy!

We hope to see you again next year, for systemd.conf 2017 in Berlin!

Last week Adam Jackson landed my X testing series, so now at least 3 of us pushing code are testing glamor with every commit.  I used the series to test a cleanup of the Render extension related code in glamor, deleting 550 lines of junk.  I did this while starting to work on using GL sampler objects, which should help reduce our pixmap allocation and per-draw Render overhead.   That patch is failing make check so far.

The big project for the week was tackling some of the HDMI bug reports.  I pulled out my bigger monitor that actually takes HDMI input, and started working toward getting it to display all of its modes correctly.  We had an easy bug in the clock driver that broke changing to low resolutions: trying to drive the PLL too slow, instead of driving the PLL in spec and using the divider to get down to the clock we want.  We needed infoframes support to fix purple borders on the screen.  We were doing our interlaced video mode programming pretty much entirely wrong (it's not perfect yet, but 1 of the 3 interlaced modes on my montior displays correctly and they all display something).  Finally, I added support for double-clocked CEA modes (generally the low resolution, low refresh ones in your list.  These are probably particularly interesitng for RPi due to all the people running emulators).

Finally, one of the more exciting things for me was that I got general agreement from gregkh to merge the VCHIQ driver through the staging tree, and he even put the initial patch together for me.  If we add in a couple more small helper drivers, we should be able to merge the V4L2 camera driver, and hopefully get closer to doing a V4L2 video decode driver for upstream.  I've now tested a messy series that at least executes "vcdbg log msg."  Those other drivers will definitely need some cleanups, though.
October 03, 2016

After a few weeks of hard work with the team, here is the new major version of Gnocchi, stamped 3.0.0. It was very challenging, as we wanted to implement a few big changes in it.

Gnocchi is now using reno to its maximum and you can read the release notes of the 3.0 branch online. Some notes might be missing as it is our first release with it, but we are making good progress at writing changelogs for most of our user facing and impacting changes.

Therefore, I'll only write here about our big major feature that made us bump the major version number.

New storage engine

And so the most interesting thing that went in the 3.0 release, is the new storage engine that has been built by me and Gordon Chung during those last months. The original approach of writing data in Gnocchi was really naive, so we had an iterative improvement process since version 1.0, and we're getting close to something very solid.

This new version leverages several important features which increase performance by a large factor on Ceph (using write(offset) rather than read()+write() to append new points), our recommended back-end.

read+write (Swift and file drivers) vs offset (Ceph)

To summarize, since most data points are sent sequentially and ordered, we enhanced the data format to profit from that fact and be able to be appended without reading anything. That only works on Ceph though, which provides the needed features.

We also enabled data compression on all storage drivers by enabling LZ4 compression (see my previous article and research on the subject), which obviously offers its own set of challenges when using append-only write. The results are tremendous and decrease data usage by a huge factor:

Disk size comparison between Gnocchi 2 and Gnocchi 3

The rest of the processing pipeline also has been largely improved:

Processing time of new measures
Compression time comparison

Overall, we're delighted with the performance improvement we achieved, and we're looking forward making even better more progress. Gnocchi is now one of the most performing and scalable timeseries databases out there.

Upcoming challenges

With that big change done, we're now heading toward a set of more lightweight improvements. Our bug tracker is a good place to learn what might be on our mind (check for the wishlist bugs).

Improving our API features and offering a better experience for those coming outside of the real of OpenStack are now on my top priority list.

But let me know if there's anything you have scratching you, obviously. 😎

September 30, 2016

Since about a year we’re running the Intel graphics driver with a new process: Besides the two established maintainers we’ve added all regular contributors as committers to the main feature branch feeding into -next. This turned out into a tremendous success, but did require some initial adustments to how we run things in the first few months.

I’ve presented the new model here at Kernel Recipes in Paris, and I will also talk about it at Kernel Summit in Santa Fe. Since LWN is present at both I won’t bother with a full writeup, but leave that to much better editors. Update: LWN on kernel maintainer scalability.

Anyway, there’s a video recording and the slides. Our process is also documented - scroll down to the bottom for the more interesting bits around what’s expected of committers.

On a related note: At XDC, and a bit before, Eric Anholt started a discussion about improving our patch submission process, especially for new contributors. He used the Rust community as a great example, and presented about it at XDC. Rather interesting to hear his perspective as a first-time contributor confirm what I learned in LCA this year in Emily Dunham’s awesome talk on Life is better with Rust’s community automation.

September 27, 2016
The answer is YES!!

I fixed the last bug with instance rendering and Talos renders great on radv now.

Also with the semi-interesting branch vkQuake also renders, there are some upstream bugs that needs fixing in spirv/nir that I'm awaiting and upstream resolution on, but I've included some prelim fixes in semi-interesting for now, that'll go away when upstream fixes are decided on.

Here's a screenshot:

September 26, 2016
Last week I spent at XDS 2016 with other graphics stack developers.

I gave a talk on X Server testing and our development model.  Since playing around in the Servo project (and a few other github-centric communities) recently, I've become increasingly convinced that github + CI + automatic merges is the right way to go about software development now.  I got a warmer reception from the other X Server developers than I expected, and I hope to keep working on building out this infrastructure.

I also spent a while with keithp and ajax (other core X developers) talking about how to fix our rendering performance.  We've got two big things hurting vc4 right now: Excessive flushing, and lack of bounds on rendering operations.

The excessive flushing one is pretty easy.  X currently does some of its drawing directly to the front buffer (I think this is bad and we should stop, but that *is* what we're doing today).  With glamor, we do our rendering with GL, so the driver accumulates a command stream over time.  In order for the drawing to the front buffer to show up on time, both for running animations, and for that last character you typed before rendering went idle, we have to periodically flush GL's command stream so that its rendering to the front buffer (if any) shows up.

Right now we just glFlush() every time we might block sending things back to a client, which is really frequent.  For a tiling architecture like vc4, each glFlush() is quite possibly a load and store of the entire screen (8MB).

The solution I started back in August is to rate-limit how often we flush.  When the server wakes up (possibly to render), I arm a timer.  When it fires, I flush and disarm it until the next wakeup.  I set the timer to 5ms as an arbitrary value that was clearly not pretending to be vsync.  The state machine was a little tricky and had a bug it seemed (or at least a bad interaction with x11perf -- it was unclear).  But ajax pointed out that I could easily do better: we have a function in glamor that gets called right before it does any GL operation, meaning that we could use that to arm the timer and not do the arming every time the server wakes up to process input.

Bounding our rendering operations is a bit trickier.  We started doing some of this with keithp's glamor rework: We glScissor() all of our rendering to the pCompositeClip (which is usually the bounds of the current window).  For front buffer rendering in X, this is a big win on vc4 because it means that when you update an uncomposited window I know that you're only loading and storing the area covered by the window, not the entire screen.  That's a big bandwidth savings.

However, the pCompositeClip isn't enough.  As I'm typing we should be bounding the GL rendering around each character (or span of them), so that I only load and store one or two tiles rather than the entire window.  It turns out, though, that you've probably already computed the bounds of the operation in the Damage extension, because you have a compositor running!  Wouldn't it be nice if we could just reuse that old damage computation and reference it?

Keith has started on making this possible: First with the idea of const-ifying all the op arguments so that we could reuse their pointers as a cheap cache key, and now with the idea of just passing along a possibly-initialized rectangle with the bounds of the operation.  If you do compute the bounds, then you pass it down to anything you end up calling.

Between these two, we should get much improved performance on general desktop apps on Raspberry Pi.

Other updates: Landed opt_peephole_sel improvement for glmark2 performance (and probably mupen64plus), added regression testing of glamor to the X Server, fixed a couple of glamor rendering bugs.
September 20, 2016

Writing a book is a big undertaking. You have to think about what you will actually write, the content, its organization, the examples you want to show, illustrations, etc.

When publishing with the help of a regular editor, your job stops there at writing – and that's already a big and hard enough task. Your editor will handle the publishing process, leaving you free of the printing task. Though they might have their own set of requirements, such as making you work with a word processing tool (think LibreOffice Writer or Microsoft Word).

The Hacker's Guide to Python on my Kindle

When you self-publish like I did with The Hacker's Guide to Python, none of that happens. You have to deal yourself with getting your work out there, released and available in a viable format for your readership.

Most of the time, you need to render your book in different formats. You will have to make sure it works correctly on different devices and that the formatting and content disposition is correct.

I knew exactly what I wanted exactly when writing my book. I wanted to have the book published in at least PDF (for computer reading) and ePub (for e-readers). I also knew, as an Emacs user, that I did not want to spend hours writing a book in LibreOffice. It's not for me.

When I wrote about the making of The Hacker's Guide to Python, I briefly mentioned which tools I used to build the book and that I picked AsciiDoc as the input format. It makes it easy to write your book inside your favorite text editor, and AsciiDoc has plenty of output format. Customizing these formats to my liking and requirements was another challenge.

It took me hours and hours of work to have all the nitty-gritty details right. Today I am happy to announce that I can save you a few hours of work if you also want to publish a book.

I've published a new project on my GitHub called asciidoc-book-toolchain. It is the actual toolchain that I use to build The Hacker's Guide to Python. It should be easy to use and is able to render any book in HTML, PDF, PDF (printable 6"×9" format), ePub and MOBI.

Conversion workflow of the asciidoc-book-toolchain

So feel free to use it, hack it, pull-request it, or whatever. You don't have any good excuse to not write a book now! 😇 And if you want to self-publish a book and need some help getting started, let me know, I would be glad giving you a few hints!

First a definition: a trackstick is also called trackpoint, pointing stick, or "that red knob between G, H, and B". I'll be using trackstick here, because why not.

This post is the continuation of libinput and the Lenovo T450 and T460 series touchpads where we focused on a stalling pointer when moving the finger really slowly. Turns out the T460s at least, possibly others in the *60 series have another bug that caused a behaviour that is much worse but we didn't notice for ages as we were focusing on the high-precision cursor movement. Specifically, the pointer would just randomly stop moving for a short while (spoiler alert: 300ms), regardless of the movement speed.

libinput has built-in palm detection and one of the things it does is to disable the touchpad when the trackstick is in use. It's not uncommon to rest the hand near or on the touchpad while using the trackstick and any detected touch would cause interference with the pointer motion. So events from the touchpad are ignored whenever the trackpoint sends events. [1]

On (some of) the T460s the trackpoint sends spurious events. In the recording I have we have random events at 9s, then again 3.5s later, then 14s later, then 2s later, etc. Each time, our palm detection could would assume the trackpoint was in use and disable the touchpad for 300ms. If you were using the touchpad while this was happening, the touchpad would suddenly stop moving for 300ms and then continue as normal. Depending on how often these spurious events come in and the user's current caffeination state, this was somewhere between odd, annoying and infuriating.

The good news is: this is fixed in libinput now. libinput 1.5 and the upcoming 1.4.3 releases will have a fix that ignores these spurious events and makes the touchpad stalls a footnote of history. Hooray.

[1] we still allow touchpad physical button presses, and trackpoint button clicks won't disable the touchpad

September 19, 2016

This post explains how the evdev protocol works. After reading this post you should understand what evdev is and how to interpret evdev event dumps to understand what your device is doing. The post is aimed mainly at users having to debug a device, I will thus leave out or simplify some of the technical details. I'll be using the output from evemu-record as example because that is the primary debugging tool for evdev.

What is evdev?

evdev is a Linux-only generic protocol that the kernel uses to forward information and events about input devices to userspace. It's not just for mice and keyboards but any device that has any sort of axis, key or button, including things like webcams and remote controls. Each device is represented as a device node in the form of /dev/input/event0, with the trailing number increasing as you add more devices. The node numbers are re-used after you unplug a device, so don't hardcode the device node into a script. The device nodes are also only readable by root, thus you need to run any debugging tools as root too.

evdev is the primary way to talk to input devices on Linux. All X.Org drivers on Linux use evdev as protocol and libinput as well. Note that "evdev" is also the shortcut used for xf86-input-evdev, the X.Org driver to handle generic evdev devices, so watch out for context when you read "evdev" on a mailing list.

Communicating with evdev devices

Communicating with a device is simple: open the device node and read from it. Any data coming out is a struct input_event, defined in /usr/include/linux/input.h:

struct input_event {
struct timeval time;
__u16 type;
__u16 code;
__s32 value;
I'll describe the contents later, but you can see that it's a very simple struct.

Static information about the device such as its name and capabilities can be queried with a set of ioctls. Note that you should always use libevdevto interact with a device, it blunts the few sharp edges evdev has. See the libevdev documentation for usage examples.

evemu-record, our primary debugging tool for anything evdev is very simple. It reads the static information about the device, prints it and then simply reads and prints all events as they come in. The output is in machine-readable format but it's annotated with human-readable comments (starting with #). You can always ignore the non-comment bits. There's a second command, evemu-describe, that only prints the description and exits without waiting for events.

Relative devices and keyboards

The top part of an evemu-record output is the device description. This is a list of static properties that tells us what the device is capable of. For example, the USB mouse I have plugged in here prints:

# Input device name: "PIXART USB OPTICAL MOUSE"
# Input device ID: bus 0x03 vendor 0x93a product 0x2510 version 0x110
# Supported events:
# Event type 0 (EV_SYN)
# Event code 0 (SYN_REPORT)
# Event code 1 (SYN_CONFIG)
# Event code 2 (SYN_MT_REPORT)
# Event code 3 (SYN_DROPPED)
# Event code 4 ((null))
# Event code 5 ((null))
# Event code 6 ((null))
# Event code 7 ((null))
# Event code 8 ((null))
# Event code 9 ((null))
# Event code 10 ((null))
# Event code 11 ((null))
# Event code 12 ((null))
# Event code 13 ((null))
# Event code 14 ((null))
# Event type 1 (EV_KEY)
# Event code 272 (BTN_LEFT)
# Event code 273 (BTN_RIGHT)
# Event code 274 (BTN_MIDDLE)
# Event type 2 (EV_REL)
# Event code 0 (REL_X)
# Event code 1 (REL_Y)
# Event code 8 (REL_WHEEL)
# Event type 4 (EV_MSC)
# Event code 4 (MSC_SCAN)
# Properties:
The device name is the one (usually) set by the manufacturer and so are the vendor and product IDs. The bus is one of the "BUS_USB" and similar constants defined in /usr/include/linux/input.h. The version is often quite arbitrary, only a few devices have something meaningful here.

We also have a set of supported events, categorised by "event type" and "event code" (note how type and code are also part of the struct input_event). The type is a general category, and /usr/include/linux/input-event-codes.h defines quite a few of those. The most important types are EV_KEY (keys and buttons), EV_REL (relative axes) and EV_ABS (absolute axes). In the output above we can see that we have EV_KEY and EV_REL set.

As a subitem of each type we have the event code. The event codes for this device are self-explanatory: BTN_LEFT, BTN_RIGHT and BTN_MIDDLE are the left, right and middle button. The axes are a relative x axis, a relative y axis and a wheel axis (i.e. a mouse wheel). EV_MSC/MSC_SCAN is used for raw scancodes and you can usually ignore it. And finally we have the EV_SYN bits but let's ignore those, they are always set for all devices.

Note that an event code cannot be on its own, it must be a tuple of (type, code). For example, REL_X and ABS_X have the same numerical value and without the type you won't know which one is which.

That's pretty much it. A keyboard will have a lot of EV_KEY bits set and the EV_REL axes are obviously missing (but not always...). Instead of BTN_LEFT, a keyboard would have e.g. KEY_ESC, KEY_A, KEY_B, etc. 90% of device debugging is looking at the event codes and figuring out which ones are missing or shouldn't be there.

Exercise: You should now be able to read a evemu-record description from any mouse or keyboard device connected to your computer and understand what it means. This also applies to most special devices such as remotes - the only thing that changes are the names for the keys/buttons. Just run sudo evemu-describe and pick any device in the list.

The events from relative devices and keyboards

evdev is a serialised protocol. It sends a series of events and then a synchronisation event to notify us that the preceeding events all belong together. This synchronisation event is EV_SYN SYN_REPORT, is generated by the kernel, not the device and hence all EV_SYN codes are always available on all devices.

Let's have a look at a mouse movement. As explained above, half the line is machine-readable but we can ignore that bit and look at the human-readable output on the right.

E: 0.335996 0002 0000 0001 # EV_REL / REL_X 1
E: 0.335996 0002 0001 -002 # EV_REL / REL_Y -2
E: 0.335996 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
This means that within one hardware event, we've moved 1 device unit to the right (x axis) and two device units up (y axis). Note how all events have the same timestamp (0.335996).

Let's have a look at a button press:

E: 0.656004 0004 0004 589825 # EV_MSC / MSC_SCAN 589825
E: 0.656004 0001 0110 0001 # EV_KEY / BTN_LEFT 1
E: 0.656004 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
E: 0.727002 0004 0004 589825 # EV_MSC / MSC_SCAN 589825
E: 0.727002 0001 0110 0000 # EV_KEY / BTN_LEFT 0
E: 0.727002 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
For button events, the value 1 signals button pressed, button 0 signals button released.

And key events look like this:

E: 0.000000 0004 0004 458792 # EV_MSC / MSC_SCAN 458792
E: 0.000000 0001 001c 0000 # EV_KEY / KEY_ENTER 0
E: 0.000000 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
E: 0.560004 0004 0004 458976 # EV_MSC / MSC_SCAN 458976
E: 0.560004 0001 001d 0001 # EV_KEY / KEY_LEFTCTRL 1
E: 0.560004 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
E: 1.172732 0001 001d 0002 # EV_KEY / KEY_LEFTCTRL 2
E: 1.172732 0000 0000 0001 # ------------ SYN_REPORT (1) ----------
E: 1.200004 0004 0004 458758 # EV_MSC / MSC_SCAN 458758
E: 1.200004 0001 002e 0001 # EV_KEY / KEY_C 1
E: 1.200004 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
Mostly the same as button events. But wait, there is one difference: we have a value of 2 as well. For key events, a value 2 means "key repeat". If you're on the tty, then this is what generates repeat keys for you. In X and Wayland we ignore these repeat events and instead use XKB-based key repeat.

Now look at the keyboard events again and see if you can make sense of the sequence. We have an Enter release (but no press), then ctrl down (and repeat), followed by a 'c' press - but no release. The explanation is simple - as soon as I hit enter in the terminal, evemu-record started recording so it captured the enter release too. And it stopped recording as soon as ctrl+c was down because that's when it was cancelled by the terminal. One important takeaway here: the evdev protocol is not guaranteed to be balanced. You may see a release for a key you've never seen the press for, and you may be missing a release for a key/button you've seen the press for (this happens when you stop recording). Oh, and there's one danger: if you record your keyboard and you type your password, the keys will show up in the output. Security experts generally reocmmend not publishing event logs with your password in it.

Exercise: You should now be able to read a evemu-record events list from any mouse or keyboard device connected to your computer and understand the event sequence.This also applies to most special devices such as remotes - the only thing that changes are the names for the keys/buttons. Just run sudo evemu-record and pick any device listed.

Absolute devices

Things get a bit more complicated when we look at absolute input devices like a touchscreen or a touchpad. Yes, touchpads are absolute devices in hardware and the conversion to relative events is done in userspace by e.g. libinput. The output of my touchpad is below. Note that I've manually removed a few bits to make it easier to grasp, they will appear later in the multitouch discussion.

# Input device name: "SynPS/2 Synaptics TouchPad"
# Input device ID: bus 0x11 vendor 0x02 product 0x07 version 0x1b1
# Supported events:
# Event type 0 (EV_SYN)
# Event code 0 (SYN_REPORT)
# Event code 1 (SYN_CONFIG)
# Event code 2 (SYN_MT_REPORT)
# Event code 3 (SYN_DROPPED)
# Event code 4 ((null))
# Event code 5 ((null))
# Event code 6 ((null))
# Event code 7 ((null))
# Event code 8 ((null))
# Event code 9 ((null))
# Event code 10 ((null))
# Event code 11 ((null))
# Event code 12 ((null))
# Event code 13 ((null))
# Event code 14 ((null))
# Event type 1 (EV_KEY)
# Event code 272 (BTN_LEFT)
# Event code 325 (BTN_TOOL_FINGER)
# Event code 328 (BTN_TOOL_QUINTTAP)
# Event code 330 (BTN_TOUCH)
# Event code 333 (BTN_TOOL_DOUBLETAP)
# Event code 334 (BTN_TOOL_TRIPLETAP)
# Event code 335 (BTN_TOOL_QUADTAP)
# Event type 3 (EV_ABS)
# Event code 0 (ABS_X)
# Value 2919
# Min 1024
# Max 5112
# Fuzz 0
# Flat 0
# Resolution 42
# Event code 1 (ABS_Y)
# Value 3711
# Min 2024
# Max 4832
# Fuzz 0
# Flat 0
# Resolution 42
# Event code 24 (ABS_PRESSURE)
# Value 0
# Min 0
# Max 255
# Fuzz 0
# Flat 0
# Resolution 0
# Event code 28 (ABS_TOOL_WIDTH)
# Value 0
# Min 0
# Max 15
# Fuzz 0
# Flat 0
# Resolution 0
# Properties:
# Property type 0 (INPUT_PROP_POINTER)
# Property type 2 (INPUT_PROP_BUTTONPAD)
We have a BTN_LEFT again and a set of other buttons that I'll explain in a second. But first we look at the EV_ABS output. We have the same naming system as above. ABS_X and ABS_Y are the x and y axis on the device, ABS_PRESSURE is an (arbitrary) ranged pressure value.

Absolute axes have a bit more state than just a simple bit. Specifically, they have a minimum and maximum (not all hardware has the top-left sensor position on 0/0, it can be an arbitrary position, specified by the minimum). Notable here is that the axis ranges are simply the ones announced by the device - there is no guarantee that the values fall within this range and indeed a lot of touchpad devices tend to send values slightly outside that range. Fuzz and flat can be safely ignored, but resolution is interesting. It is given in units per millimeter and thus tells us the size of the device. in the above case: (5112 - 1024)/42 means the device is 97mm wide. The resolution is quite commonly wrong, a lot of axis overrides need the resolution changed to the correct value.

The axis description also has a current value listed. The kernel only sends events when the value changes, so even if the actual hardware keeps sending events, you may never see them in the output if the value remains the same. In other words, holding a finger perfectly still on a touchpad creates plenty of hardware events, but you won't see anything coming out of the event node.

Finally, we have properties on this device. These are used to indicate general information about the device that's not otherwise obvious. In this case INPUT_PROP_POINTER tells us that we need a pointer for this device (it is a touchpad after all, a touchscreen would instead have INPUT_PROP_DIRECT set). INPUT_PROP_BUTTONPAD means that this is a so-called clickpad, it does not have separate physical buttons but instead the whole touchpad clicks. Ignore INPUT_PROP_TOPBUTTONPAD because it only applies to the Lenovo *40 series of devices.

Ok, back to the buttons: aside from BTN_LEFT, we have BTN_TOUCH. This one signals that the user is touching the surface of the touchpad (with some in-kernel defined minimum pressure value). It's not just for finger-touches, it's also used for graphics tablet stylus touchpes (so really, it's more "contact" than "touch" but meh).

The BTN_TOOL_FINGER event tells us that a finger is in detectable range. This gives us two bits of information: first, we have a finger (a tablet would have e.g. BTN_TOOL_PEN) and second, we may have a finger in proximity without touching. On many touchpads, BTN_TOOL_FINGER and BTN_TOUCH come in the same event, but others can detect a finger hovering over the touchpad too (in which case you'd also hope for ABS_DISTANCE being available on the touchpad).

Finally, the BTN_TOOL_DOUBLETAP up to BTN_TOOL_QUINTTAP tell us whether the device can detect 2 through to 5 fingers on the touchpad. This doesn't actually track the fingers, it merely tells you "3 fingers down" in the case of BTN_TOOL_TRIPLETAP.

Exercise: Look at your touchpad's description and figure out if the size of the touchpad is correct based on the axis information [1]. Check how many fingers your touchpad can detect and whether it can do pressure or distance detection.

The events from absolute devices

Events from absolute axes are not really any different than events from relative devices which we already covered. The same type/code combination with a value and a timestamp, all framed by EV_SYN SYN_REPORT events. Here's an example of me touching the touchpad:

E: 0.000001 0001 014a 0001 # EV_KEY / BTN_TOUCH 1
E: 0.000001 0003 0000 3335 # EV_ABS / ABS_X 3335
E: 0.000001 0003 0001 3308 # EV_ABS / ABS_Y 3308
E: 0.000001 0003 0018 0069 # EV_ABS / ABS_PRESSURE 69
E: 0.000001 0001 0145 0001 # EV_KEY / BTN_TOOL_FINGER 1
E: 0.000001 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +0ms
E: 0.021751 0003 0018 0070 # EV_ABS / ABS_PRESSURE 70
E: 0.021751 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +21ms
E: 0.043908 0003 0000 3334 # EV_ABS / ABS_X 3334
E: 0.043908 0003 0001 3309 # EV_ABS / ABS_Y 3309
E: 0.043908 0003 0018 0065 # EV_ABS / ABS_PRESSURE 65
E: 0.043908 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +22ms
E: 0.052469 0001 014a 0000 # EV_KEY / BTN_TOUCH 0
E: 0.052469 0003 0018 0000 # EV_ABS / ABS_PRESSURE 0
E: 0.052469 0001 0145 0000 # EV_KEY / BTN_TOOL_FINGER 0
E: 0.052469 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +9ms
In the first event you see BTN_TOOL_FINGER and BTN_TOUCH set (this touchpad doesn't detect hovering fingers). An x/y coordinate pair and a pressure value. The pressure changes in the second event, the third event changes pressure and location. Finally, we have BTN_TOOL_FINGER and BTN_TOUCH released on finger up, and the pressure value goes back to 0. Notice how the second event didn't contain any x/y coordinates? As I said above, the kernel only sends updates on absolute axes when the value changed.

Ok, let's look at a three-finger tap (again, minus the ABS_MT_ bits):

E: 0.000001 0001 014a 0001 # EV_KEY / BTN_TOUCH 1
E: 0.000001 0003 0000 2149 # EV_ABS / ABS_X 2149
E: 0.000001 0003 0001 3747 # EV_ABS / ABS_Y 3747
E: 0.000001 0003 0018 0066 # EV_ABS / ABS_PRESSURE 66
E: 0.000001 0001 014e 0001 # EV_KEY / BTN_TOOL_TRIPLETAP 1
E: 0.000001 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +0ms
E: 0.034209 0003 0000 2148 # EV_ABS / ABS_X 2148
E: 0.034209 0003 0018 0064 # EV_ABS / ABS_PRESSURE 64
E: 0.034209 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +34ms
E: 0.138510 0003 0000 4286 # EV_ABS / ABS_X 4286
E: 0.138510 0003 0001 3350 # EV_ABS / ABS_Y 3350
E: 0.138510 0003 0018 0055 # EV_ABS / ABS_PRESSURE 55
E: 0.138510 0001 0145 0001 # EV_KEY / BTN_TOOL_FINGER 1
E: 0.138510 0001 014e 0000 # EV_KEY / BTN_TOOL_TRIPLETAP 0
E: 0.138510 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +23ms
E: 0.147834 0003 0000 4287 # EV_ABS / ABS_X 4287
E: 0.147834 0003 0001 3351 # EV_ABS / ABS_Y 3351
E: 0.147834 0003 0018 0037 # EV_ABS / ABS_PRESSURE 37
E: 0.147834 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +9ms
E: 0.157151 0001 014a 0000 # EV_KEY / BTN_TOUCH 0
E: 0.157151 0003 0018 0000 # EV_ABS / ABS_PRESSURE 0
E: 0.157151 0001 0145 0000 # EV_KEY / BTN_TOOL_FINGER 0
E: 0.157151 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +10ms
In the first event, the touchpad detected all three fingers at the same time. So get BTN_TOUCH, x/y/pressure and BTN_TOOL_TRIPLETAP set. Note that the various BTN_TOOL_* bits are mutually exclusive. BTN_TOOL_FINGER means "exactly 1 finger down" and you can't have exactly 1 finger down when you have three fingers down. In the second event x and pressure update (y has no event, it stayed the same).

In the event after the break, we switch from three fingers to one finger. BTN_TOOL_TRIPLETAP is released, BTN_TOOL_FINGER is set. That's very common. Humans aren't robots, you can't release all fingers at exactly the same time, so depending on the hardware scanout rate you have intermediate states where one finger has left already, others are still down. In this case I released two fingers between scanouts, one was still down. It's not uncommon to see a full cycle from BTN_TOOL_FINGER to BTN_TOOL_DOUBLETAP to BTN_TOOL_TRIPLETAP on finger down or the reverse on finger up.

Exercise: test out the pressure values on your touchpad and see how close you can get to the actual announced range. Check how accurate the multifinger detection is by tapping with two, three, four and five fingers. (In both cases, you'll likely find that it's very much hit and miss).

Multitouch and slots

Now we're at the most complicated topic regarding evdev devices. In the case of multitouch devices, we need to send multiple touches on the same axes. So we need an additional dimension and that is called multitouch slots (there is another, older multitouch protocol that doesn't use slots but it is so rare now that you don't need to bother).

First: all axes that are multitouch-capable are repeated as ABS_MT_foo axis. So if you have ABS_X, you also get ABS_MT_POSITION_X and both axes have the same axis ranges and resolutions. The reason here is backwards-compatibility: if a device only sends multitouch events, older programs only listening to the ABS_X etc. events won't work. Some axes may only be available for single-touch (ABS_MT_TOOL_WIDTH in this case).

Let's have a look at my touchpad, this time without the axes removed:

# Input device name: "SynPS/2 Synaptics TouchPad"
# Input device ID: bus 0x11 vendor 0x02 product 0x07 version 0x1b1
# Supported events:
# Event type 0 (EV_SYN)
# Event code 0 (SYN_REPORT)
# Event code 1 (SYN_CONFIG)
# Event code 2 (SYN_MT_REPORT)
# Event code 3 (SYN_DROPPED)
# Event code 4 ((null))
# Event code 5 ((null))
# Event code 6 ((null))
# Event code 7 ((null))
# Event code 8 ((null))
# Event code 9 ((null))
# Event code 10 ((null))
# Event code 11 ((null))
# Event code 12 ((null))
# Event code 13 ((null))
# Event code 14 ((null))
# Event type 1 (EV_KEY)
# Event code 272 (BTN_LEFT)
# Event code 325 (BTN_TOOL_FINGER)
# Event code 328 (BTN_TOOL_QUINTTAP)
# Event code 330 (BTN_TOUCH)
# Event code 333 (BTN_TOOL_DOUBLETAP)
# Event code 334 (BTN_TOOL_TRIPLETAP)
# Event code 335 (BTN_TOOL_QUADTAP)
# Event type 3 (EV_ABS)
# Event code 0 (ABS_X)
# Value 5112
# Min 1024
# Max 5112
# Fuzz 0
# Flat 0
# Resolution 41
# Event code 1 (ABS_Y)
# Value 2930
# Min 2024
# Max 4832
# Fuzz 0
# Flat 0
# Resolution 37
# Event code 24 (ABS_PRESSURE)
# Value 0
# Min 0
# Max 255
# Fuzz 0
# Flat 0
# Resolution 0
# Event code 28 (ABS_TOOL_WIDTH)
# Value 0
# Min 0
# Max 15
# Fuzz 0
# Flat 0
# Resolution 0
# Event code 47 (ABS_MT_SLOT)
# Value 0
# Min 0
# Max 1
# Fuzz 0
# Flat 0
# Resolution 0
# Event code 53 (ABS_MT_POSITION_X)
# Value 0
# Min 1024
# Max 5112
# Fuzz 8
# Flat 0
# Resolution 41
# Event code 54 (ABS_MT_POSITION_Y)
# Value 0
# Min 2024
# Max 4832
# Fuzz 8
# Flat 0
# Resolution 37
# Event code 57 (ABS_MT_TRACKING_ID)
# Value 0
# Min 0
# Max 65535
# Fuzz 0
# Flat 0
# Resolution 0
# Event code 58 (ABS_MT_PRESSURE)
# Value 0
# Min 0
# Max 255
# Fuzz 0
# Flat 0
# Resolution 0
# Properties:
# Property type 0 (INPUT_PROP_POINTER)
# Property type 2 (INPUT_PROP_BUTTONPAD)
We have an x and y position for multitouch as well as a pressure axis. There are also two special multitouch axes that aren't really axes. ABS_MT_SLOT and ABS_MT_TRACKING_ID. The former specifies which slot is currently active, the latter is used to track touch points.

Slots are a static property of a device. My touchpad, as you can see above ony supports 2 slots (min 0, max 1) and thus can track 2 fingers at a time. Whenever the first finger is set down it's coordinates will be tracked in slot 0, the second finger will be tracked in slot 1. When the finger in slot 0 is lifted, the second finger continues to be tracked in slot 1, and if a new finger is set down, it will be tracked in slot 0. Sounds more complicated than it is, think of it as an array of possible touchpoints.

The tracking ID is an incrementing number that lets us tell touch points apart and also tells us when a touch starts and when it ends. The two values are either -1 or a positive number. Any positive number means "new touch" and -1 means "touch ended". So when you put two fingers down and lift them again, you'll get a tracking ID of 1 in slot 0, a tracking ID of 2 in slot 1, then a tracking ID of -1 in both slots to signal they ended. The tracking ID value itself is meaningless, it simply increases as touches are created.

Let's look at a single tap:

E: 0.000001 0003 0039 0387 # EV_ABS / ABS_MT_TRACKING_ID 387
E: 0.000001 0003 0035 2560 # EV_ABS / ABS_MT_POSITION_X 2560
E: 0.000001 0003 0036 2905 # EV_ABS / ABS_MT_POSITION_Y 2905
E: 0.000001 0003 003a 0059 # EV_ABS / ABS_MT_PRESSURE 59
E: 0.000001 0001 014a 0001 # EV_KEY / BTN_TOUCH 1
E: 0.000001 0003 0000 2560 # EV_ABS / ABS_X 2560
E: 0.000001 0003 0001 2905 # EV_ABS / ABS_Y 2905
E: 0.000001 0003 0018 0059 # EV_ABS / ABS_PRESSURE 59
E: 0.000001 0001 0145 0001 # EV_KEY / BTN_TOOL_FINGER 1
E: 0.000001 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +0ms
E: 0.021690 0003 003a 0067 # EV_ABS / ABS_MT_PRESSURE 67
E: 0.021690 0003 0018 0067 # EV_ABS / ABS_PRESSURE 67
E: 0.021690 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +21ms
E: 0.033482 0003 003a 0068 # EV_ABS / ABS_MT_PRESSURE 68
E: 0.033482 0003 0018 0068 # EV_ABS / ABS_PRESSURE 68
E: 0.033482 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +12ms
E: 0.044268 0003 0035 2561 # EV_ABS / ABS_MT_POSITION_X 2561
E: 0.044268 0003 0000 2561 # EV_ABS / ABS_X 2561
E: 0.044268 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +11ms
E: 0.054093 0003 0035 2562 # EV_ABS / ABS_MT_POSITION_X 2562
E: 0.054093 0003 003a 0067 # EV_ABS / ABS_MT_PRESSURE 67
E: 0.054093 0003 0000 2562 # EV_ABS / ABS_X 2562
E: 0.054093 0003 0018 0067 # EV_ABS / ABS_PRESSURE 67
E: 0.054093 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +10ms
E: 0.064891 0003 0035 2569 # EV_ABS / ABS_MT_POSITION_X 2569
E: 0.064891 0003 0036 2903 # EV_ABS / ABS_MT_POSITION_Y 2903
E: 0.064891 0003 003a 0059 # EV_ABS / ABS_MT_PRESSURE 59
E: 0.064891 0003 0000 2569 # EV_ABS / ABS_X 2569
E: 0.064891 0003 0001 2903 # EV_ABS / ABS_Y 2903
E: 0.064891 0003 0018 0059 # EV_ABS / ABS_PRESSURE 59
E: 0.064891 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +10ms
E: 0.073634 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1
E: 0.073634 0001 014a 0000 # EV_KEY / BTN_TOUCH 0
E: 0.073634 0003 0018 0000 # EV_ABS / ABS_PRESSURE 0
E: 0.073634 0001 0145 0000 # EV_KEY / BTN_TOOL_FINGER 0
E: 0.073634 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +9ms
We have a tracking ID (387) signalling finger down, as well as a position plus pressure. then some updates and eventually a tracking ID of -1 (signalling finger up). Notice how there is no ABS_MT_SLOT here - the kernel buffers those too so while you stay in the same slot (0 in this case) you don't see any events for it. Also notice how you get both single-finger as well as multitouch in the same event stream. This is for backwards compatibility [2]

Ok, time for a two-finger tap:

E: 0.000001 0003 0039 0496 # EV_ABS / ABS_MT_TRACKING_ID 496
E: 0.000001 0003 0035 2609 # EV_ABS / ABS_MT_POSITION_X 2609
E: 0.000001 0003 0036 3791 # EV_ABS / ABS_MT_POSITION_Y 3791
E: 0.000001 0003 003a 0054 # EV_ABS / ABS_MT_PRESSURE 54
E: 0.000001 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.000001 0003 0039 0497 # EV_ABS / ABS_MT_TRACKING_ID 497
E: 0.000001 0003 0035 3012 # EV_ABS / ABS_MT_POSITION_X 3012
E: 0.000001 0003 0036 3088 # EV_ABS / ABS_MT_POSITION_Y 3088
E: 0.000001 0003 003a 0056 # EV_ABS / ABS_MT_PRESSURE 56
E: 0.000001 0001 014a 0001 # EV_KEY / BTN_TOUCH 1
E: 0.000001 0003 0000 2609 # EV_ABS / ABS_X 2609
E: 0.000001 0003 0001 3791 # EV_ABS / ABS_Y 3791
E: 0.000001 0003 0018 0054 # EV_ABS / ABS_PRESSURE 54
E: 0.000001 0001 014d 0001 # EV_KEY / BTN_TOOL_DOUBLETAP 1
E: 0.000001 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +0ms
E: 0.012909 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0
E: 0.012909 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1
E: 0.012909 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.012909 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1
E: 0.012909 0001 014a 0000 # EV_KEY / BTN_TOUCH 0
E: 0.012909 0003 0018 0000 # EV_ABS / ABS_PRESSURE 0
E: 0.012909 0001 014d 0000 # EV_KEY / BTN_TOOL_DOUBLETAP 0
E: 0.012909 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +12ms
This was a really quick two-finger tap that illustrates the tracking IDs nicely. In the first event we get a touch down, then an ABS_MT_SLOT event. This tells us that subsequent events belong to the other slot, so it's the other finger. There too we get a tracking ID + position. In the next event we get an ABS_MT_SLOT to switch back to slot 0. Tracking ID of -1 means that touch ended, and then we see the touch in slot 1 ended too.

Time for a two-finger scroll:

E: 0.000001 0003 0039 0557 # EV_ABS / ABS_MT_TRACKING_ID 557
E: 0.000001 0003 0035 2589 # EV_ABS / ABS_MT_POSITION_X 2589
E: 0.000001 0003 0036 3363 # EV_ABS / ABS_MT_POSITION_Y 3363
E: 0.000001 0003 003a 0048 # EV_ABS / ABS_MT_PRESSURE 48
E: 0.000001 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.000001 0003 0039 0558 # EV_ABS / ABS_MT_TRACKING_ID 558
E: 0.000001 0003 0035 3512 # EV_ABS / ABS_MT_POSITION_X 3512
E: 0.000001 0003 0036 3028 # EV_ABS / ABS_MT_POSITION_Y 3028
E: 0.000001 0003 003a 0044 # EV_ABS / ABS_MT_PRESSURE 44
E: 0.000001 0001 014a 0001 # EV_KEY / BTN_TOUCH 1
E: 0.000001 0003 0000 2589 # EV_ABS / ABS_X 2589
E: 0.000001 0003 0001 3363 # EV_ABS / ABS_Y 3363
E: 0.000001 0003 0018 0048 # EV_ABS / ABS_PRESSURE 48
E: 0.000001 0001 014d 0001 # EV_KEY / BTN_TOOL_DOUBLETAP 1
E: 0.000001 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +0ms
E: 0.027960 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0
E: 0.027960 0003 0035 2590 # EV_ABS / ABS_MT_POSITION_X 2590
E: 0.027960 0003 0036 3395 # EV_ABS / ABS_MT_POSITION_Y 3395
E: 0.027960 0003 003a 0046 # EV_ABS / ABS_MT_PRESSURE 46
E: 0.027960 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.027960 0003 0035 3511 # EV_ABS / ABS_MT_POSITION_X 3511
E: 0.027960 0003 0036 3052 # EV_ABS / ABS_MT_POSITION_Y 3052
E: 0.027960 0003 0000 2590 # EV_ABS / ABS_X 2590
E: 0.027960 0003 0001 3395 # EV_ABS / ABS_Y 3395
E: 0.027960 0003 0018 0046 # EV_ABS / ABS_PRESSURE 46
E: 0.027960 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +27ms
E: 0.051720 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0
E: 0.051720 0003 0035 2609 # EV_ABS / ABS_MT_POSITION_X 2609
E: 0.051720 0003 0036 3447 # EV_ABS / ABS_MT_POSITION_Y 3447
E: 0.051720 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.051720 0003 0036 3080 # EV_ABS / ABS_MT_POSITION_Y 3080
E: 0.051720 0003 0000 2609 # EV_ABS / ABS_X 2609
E: 0.051720 0003 0001 3447 # EV_ABS / ABS_Y 3447
E: 0.051720 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +24ms
E: 0.272034 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0
E: 0.272034 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1
E: 0.272034 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.272034 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1
E: 0.272034 0001 014a 0000 # EV_KEY / BTN_TOUCH 0
E: 0.272034 0003 0018 0000 # EV_ABS / ABS_PRESSURE 0
E: 0.272034 0001 014d 0000 # EV_KEY / BTN_TOOL_DOUBLETAP 0
E: 0.272034 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +30ms
Note that "scroll" is something handled in userspace, so what you see here is just a two-finger move. Everything in there i something we've already seen, but pay attention to the two middle events: as updates come in for each finger, the ABS_MT_SLOT changes before the upates are sent. The kernel filter for identical events is still in effect, so in the third event we don't get an update for the X position on slot 1. The filtering is per-touchpoint, so in this case this means that slot 1 position x is still on 3511, just as it was in the previous event.

That's all you have to remember, really. If you think of evdev as a serialised way of sending an array of touchpoints, with the slots as the indices then it should be fairly clear. The rest is then just about actually looking at the touch positions and making sense of them.

Exercise: do a pinch gesture on your touchpad. See if you can track the two fingers moving closer together. Then do the same but only move one finger. See how the non-moving finger gets less updates.

That's it. There are a few more details to evdev but much of that is just more event types and codes. The few details you really have to worry about when processing events are either documented in libevdev or abstracted away completely. The above should be enough to understand what your device does, and what goes wrong when your device isn't working. Good luck.

[1] If not, file a bug against systemd's hwdb and CC me so we can put corrections in
[2] We treat some MT-capable touchpads as single-touch devices in libinput because the MT data is garbage

While stuck on an airplane, I put together a repository for apitraces with confirmed good images for driver output.  Combined with the piglit patches I pushed, I now have regression testing of actual apps on vc4 (particularly relevant given that I'm working on optimizing one of those apps!)

The flight was to visit the Raspberry Pi Foundation, with the goal of getting something usable for their distro to switch to the open 3D stack.  There's still a giant pile of KMS work to do (HDMI audio, DSI power management, SDTV support, etc.), and waiting for all of that to be regression-free will be a long time.  The question is: what could we do that would get us 3D, even if KMS isn't ready?

So, I put together a quick branch to expose the firmware's display stack as the KMS display pipeline. It's a filthy hack, and loses us a lot of the important new features that the open stack was going to bring (changing video modes in X, vblank timestamping, power management), but it gets us much closer to the featureset of the previous stack.  Hopefully they'll be switching to it as the default in new installs soon.

In debugging while I was here, Simon found that on his HDMI display the color ramps didn't quite match between closed and open drivers.  After a bit of worrying about gamma ramp behavior, I figured out that it was actually that the monitor was using a CEA mode that requires limited range RGB input.  A patch is now on the list.

September 17, 2016

Tickets for systemd 2016 Workshop day still available!

We still have a number of ticket for the workshop day of systemd.conf 2016 available. If you are a newcomer to systemd, and would like to learn about various systemd facilities, or if you already know your way around, but would like to know more: this is the best chance to do so. The workshop day is the 28th of September, one day before the main conference, at the betahaus in Berlin, Germany. The schedule for the day is available here. There are five interesting, extensive sessions, run by the systemd hackers themselves. Who better to learn systemd from, than the folks who wrote it?

Note that the workshop day and the main conference days require different tickets. (Also note: there are still a few tickets available for the main conference!).

Buy a ticket here.

See you in Berlin!

September 16, 2016

libinput's touchpad acceleration is the cause for a few bugs and outcry from a quite vocal (maj|in)ority. A common suggestion is "make it like the synaptics driver". So I spent a few hours going through the pointer acceleration code to figure out what xf86-input-synaptics actually does (I don't think anyone knows at this point) [1].

If you just want the TLDR: synaptics doesn't use physical distances but works in device units coupled with a few magic factors, also based on device units. That pretty much tells you all that's needed.

Also a disclaimer: the last time some serious work was done on acceleration was in 2008/2009. A lot of things have changed since and since the server is effectively un-testable, we ended up with the mess below that seems to make little sense. It probably made sense 8 years ago and given that most or all of the patches have my signed-off-by it must've made sense to me back then. But now we live in the glorious future and holy cow it's awful and confusing.

Synaptics has three options to configure speed: MinSpeed, MaxSpeed and AccelFactor. The first two are not explained beyond "speed factor" but given how accel usually works let's assume they all somewhoe should work as a multiplication on the delta (so a factor of 2 on a delta of dx/dy gives you 2dx/2dy). AccelFactor is documented as "acceleration factor for normal pointer movements", so clearly the documentation isn't going to help clear any confusion.

I'll skip the fact that synaptics also has a pressure-based motion factor with four configuration options because oh my god what have we done. Also, that one is disabled by default and has no effect unless set by the user. And I'll also only handle default values here, I'm not going to get into examples with configured values.

Also note: synaptics has a device-specific acceleration profile (the only driver that does) and thus the acceleration handling is split between the server and the driver.

Ok, let's get started. MinSpeed and MaxSpeed default to 0.4 and 0.7. The MinSpeed is used to set constant acceleration (1/min_speed) so we always apply a 2.5 constant acceleration multiplier to deltas from the touchpad. Of course, if you set constant acceleration in the xorg.conf, then it overwrites the calculated one.

MinSpeed and MaxSpeed are mangled during setup so that MaxSpeed is actually MaxSpeed/MinSpeed and MinSpeed is always 1.0. I'm not 100% why but the later clipping to the min/max speed range ensures that we never go below a 1.0 acceleration factor (and thus never decelerate).

The AccelFactor default is 200/diagonal-in-device-coordinates. On my T440s it's thus 0.04 (and will be roughly the same for most PS/2 Synaptics touchpads). But on a Cyapa with a different axis range it is 0.125. On a T450s it's 0.035 when booted into PS2 and 0.09 when booted into RMI4. Admittedly, the resolution halfs under RMI4 so this possibly maybe makes sense. Doesn't quite make as much sense when you consider the x220t which also has a factor of 0.04 but the touchpad is only half the size of the T440s.

There's also a magic constant "corr_mul" which is set as:

/* synaptics seems to report 80 packet/s, but dix scales for
* 100 packet/s by default. */
pVel->corr_mul = 12.5f; /*1000[ms]/80[/s] = 12.5 */
It's correct that the frequency is roughly 80Hz but I honestly don't know what the 100packet/s reference refers to. Either way, it means that we always apply a factor of 12.5, regardless of the timing of the events. Ironically, this one is hardcoded and not configurable unless you happen to know that it's the X server option VelocityScale or ExpectedRate (both of them set the same variable).

Ok, so we have three factors. 2.5 as a function of MaxSpeed, 12.5 because of 80Hz (??) and 0.04 for the diagonal.

When the synaptics driver calculates a delta, it does so in device coordinates and ignores the device resolution (because this code pre-dates devices having resolutions). That's great until you have a device with uneven resolutions like the x220t. That one has 75 and 129 units/mm for x and y, so for any physical movement you're going to get almost twice as many units for y than for x. Which means that if you move 5mm to the right you end up with a different motion vector (and thus acceleration) than when you move 5mm south.

The core X protocol actually defines who acceleration is supposed to be handled. Look up the man page for XChangePointerControl(), it sets a threshold and an accel factor:

The XChangePointerControl function defines how the pointing device moves. The acceleration, expressed as a fraction, is a multiplier for movement. For example, specifying 3/1 means the pointer moves three times as fast as normal. The fraction may be rounded arbitrarily by the X server. Acceleration only takes effect if the pointer moves more than threshold pixels at once and only applies to the amount beyond the value in the threshold argument.
Of course, "at once" is a bit of a blurry definition outside of maybe theoretical physics. Consider the definition of "at once" for a gaming mouse with 500Hz sampling rate vs. a touchpad with 80Hz (let us fondly remember the 12.5 multiplier here) and the above description quickly dissolves into ambiguity.

Anyway, moving on. Let's say the server just received a delta from the synaptics driver. The pointer accel code in the server calculates the velocity over time, basically by doing a hypot(dx, dy)/dtime-to-last-event. Time in the server is always in ms, so our velocity is thus in device-units/ms (not adjusted for device resolution).

Side-note: the velocity is calculated across several delta events so it gets more accurate. There are some checks though so we don't calculate across random movements: anything older than 300ms is discarded, anything not in the same octant of movement is discarded (so we don't get a velocity of 0 for moving back/forth). And there's two calculations to make sure we only calculate while the velocity is roughly the same and don't average between fast and slow movements. I have my doubts about these, but until I have some more concrete data let's just say this is accurate (altough since the whole lot is in device units, it probably isn't).

Anyway. The velocity is multiplied with the constant acceleration (2.5, see above) and our 12.5 magic value. I'm starting to think that this is just broken and would only make sense if we used a delta of "event count" rather than milliseconds.

It is then passed to the synaptics driver for the actual acceleration profile. The first thing the driver does is remove the constant acceleration again, so our velocity is now just v * 12.5. According to the comment this brings it back into "device-coordinate based velocity" but this seems wrong or misguided since we never changed into any other coordinate system.

The driver applies the accel factor (0.04, see above) and then clips the whole lot into the MinSpeed/MaxSpeed range (which is adjusted to move MinSpeed to 1.0 and scale up MaxSpeed accordingly, remember?). After the clipping, the pressure motion factor is calculated and applied. I skipped this above but it's basically: the harder you press the higher the acceleration factor. Based on some config options. Amusingly, pressure motion has the potential to exceed the MinSpeed/MaxSpeed options. Who knows what the reason for that is...

Oh, and btw: the clipping is actually done based on the accel factor set by XChangePointerControl into the acceleration function here. The code is

double acc = factor from XChangePointerControl();
double factor = the magic 0.04 based on the diagonal;

accel_factor = velocity * accel_factor;
if (accel_factor > MaxSpeed * acc)
accel_factor = MaxSpeed * acc;
So we have a factor set by XChangePointerControl() but it's only used to determine the maximum factor we may have, and then we clip to that. I'm missing some cross-dependency here because this is what the GUI acceleration config bits hook into. Somewhere this sets things and changes the acceleration by some amount but it wasn't obvious to me.

Alrighty. We have a factor now that's returned to the server and we're back in normal pointer acceleration land (i.e. not synaptics-specific). Woohoo. That factor is averaged across 4 events using the simpson's rule to smooth out aprupt changes. Not sure this really does much, I don't think we've ever done any evaluation on that. But it looks good on paper (we have that in libinput as well).

Now the constant accel factor is applied to the deltas. So far we've added the factor, removed it (in synaptics), and now we're adding it again. Which also makes me wonder whether we're applying the factor twice to all other devices but right now I'm past the point where I really want to find out . With all the above, our acceleration factor is, more or less:

f = units/ms * 12.5 * (200/diagonal) * (1.0/MinSpeed)
and the deltas we end up using in the server are

(dx, dy) = f * (dx, dy)
But remember, we're still in device units here (not adjusted for resolution).

Anyway. You think we're finished? Oh no, the real fun bits start now. And if you haven't headdesked in a while, now is a good time.

After acceleration, the server does some scaling because synaptics is an absolute device (with axis ranges) in relative mode [2]. Absolute devices are mapped into the whole screen by default but when they're sending relative events, you still want a 45 degree line on the device to map into 45 degree cursor movement on the screen. The server does this by adjusting dy in-line with the device-to-screen-ratio (taking device resolution into account too). On my T440s this means:

touchpad x:y is 1:1.45 (16:11)
screen is 1920:1080 is 1:177 (16:9)

dy scaling is thus: (16:11)/(16:9) = 9:11 -> y * 11/9
dx is left as-is. Now you have the delta that's actually applied to the cursor. Except that we're in device coordinates, so we map the current cursor position to device coordinates, then apply the delta, then map back into screen coordinates (i.e. pixels). You may have spotted the flaw here: when the screen size changes, the dy scaling changes and thus the pointer feel. Plug in another monitor, and touchpad acceleration changes. Also: the same touchpad feels different on laptops when their screen hardware differs.

Ok, let's wrap this up. Figuring out what the synaptics driver does is... "tricky". It seems much like a glorified random number scheme. I'm not planning to implement "exactly the same acceleration as synaptics" in libinput because this would be insane and despite my best efforts, I'm not that yet. Collecting data from synaptics users is almost meaningless, because no two devices really employ the same acceleration profile (touchpad axis ranges + screen size) and besides, there are 11 configuration options that all influence each other.

What I do plan though is collect more motion data from a variety of touchpads and see if I can augment the server enough that I can get a clear picture of how motion maps to the velocity. If nothing else, this should give us some picture on how different the various touchpads actually behave.

But regardless, please don't ask me to "just copy the synaptics code".

[1] fwiw, I had this really great idea of trying to get behind all this, with diagrams and everything. But then I was printing json data from the X server into the journal to be scooped up by sed and python script to print velocity data. And I questioned some of my life choices.
[2] why the hell do we do this? because synaptics at some point became a device that announce the axis ranges (seemed to make sense at the time, 2008) and then other things started depending on it and with all the fixes to the server to handle absolute devices in relative mode (for tablets) we painted ourselves into a corner. Synaptics should switch back to being a relative device, but last I tried it breaks pointer acceleration and that a) makes the internets upset and b) restoring the "correct" behaviour is, well, you read the article so far, right?

September 14, 2016
Last week I spent working on the glmark2 performance issues.  I now have a NIR patch out for the pathological conditionals test (it's now faster than on the old driver), and a branch for job shuffling (+17% and +27% on the two desktop tests).

Here's the basic idea of job shuffling:

We're a tiled renderer, and tiled renderers get their wins from having a Clear at the start of the frame (indicating we don't need to load any previous contents into the tile buffer).  When your frame is done, we flush each tile out to memory.  If you do your clear, start rendering some primitives, and then switch to some other FBO (because you're rendering to a texture that you're planning on texturing from in your next draw to the main FBO), we have to flush out all of those tiles, start rendering to the new FBO, and flush its rendering, and then when you come back to the main FBO and we have to reload your old cleared-and-a-few-draws tiles.

Job shuffling deals with this by separating the single GL command stream into separate jobs per FBO.  When you switch to your temporary FBO, we don't flush the old job, we just set it aside.  To make this work we have to add tracking for which buffers have jobs writing into them (so that if you try to read those from another job, we can go flush the job that wrote it), and which buffers have jobs reading from them (so that if you try to write to them, they can get flushed so that they don't get incorrectly updated contents).

This felt like it should have been harder than it was, and there's a spot where I'm using a really bad data structure I had laying around, but that data structure has been bad news since the driver was imported and it hasn't been high in any profiles yet.  The other tests don't seem to have any problem with the possible increased CPU overhead.

The shuffling branch also unearthed a few bugs related to clearing and blitting in the multisample tests.  Some of the piglit cases involved are fixed, but some will be reporting new piglit "regressions" because the tests are now closer to working correctly (sigh, reftests).

I also started writing documentation for updating the system's X and Mesa stack on Raspbian for one of the Foundation's developers.  It's not polished, and if I was rewriting it I would use modular's instead of some of what I did there.  But it's there for reference.
September 09, 2016

A great new feature has been merged during this 1.19 X server development cycle: we're now using threads for input [1]. Previously, there were two options for how an input driver would pass on events to the X server: polling or from within the signal handler. Polling simply adds all input devices' file descriptors to a select(2) loop that is processed in the mainloop of the server. The downside here is that if the server is busy rendering something, your input is delayed until that rendering is complete. Historically, polling was primarily used by the keyboard driver because it just doesn't matter much when key strokes are delayed. Both because you need the client to render them anyway (which it can't when it's busy) and possibly also because we're just so bloody used to typing delays.

The signal handler approach circumvented the delays by installing a SIGIO handler for each input device fd and calling that when any input occurs. This effectively interrupts the process until the signal handler completes, regardless of what the server is currently busy with. A great solution to provide immediate visible cursor movement (hence it is used by evdev, synaptics, wacom, and most of the now-retired legacy drivers) but it comes with a few side effects. First of all, because the main process is interrupted, the bit where we read the events must be completely separate to the bit where we process the events. That's easy enough, we've had an input event queue in the server for as long as I've been involved with X.Org development (~2006). The drivers push events into the queue during the signal handler, in the main loop the server reads them and processes them. In a busy server that may be several seconds after the pointer motion was performed on the screen but hey, it still feels responsive.

The bigger issue with the use of a signal handler is: you can't use malloc [2]. Or anything else useful. Look at the man page for signal(7), it literally has a list of allowed functions. This leads to two weird side-effects: one is that you have to pre-allocate everything you may ever need for event processing, the other is that you need to re-implement any function that is not currently async signal safe. The server actually has its own implementation of printf for this reason (for error logging). Let's just say this is ... suboptimal. Coincidentally, libevdev is mostly async signal safe for that reason too. It also means you can't use any libraries, because no-one [3] is insane enough to make libraries async signal-safe.

We were still mostly "happy" with it until libinput came along. libinput is a full input stack and expecting it to work within a signal handler is the somewhere between optimistic, masochistic and sadistic. The xf86-input-libinput driver doesn't use the signal handler and the side effect of this is that a desktop with libinput didn't feel as responsive when the server was busy rendering.

Keith Packard stepped in and switched the server from the signal handler to using input threads. Or more specifically: one input thread on top of the main thread. That thread controls all the input device's file descriptors and continuously reads events off them. It otherwise provides the same functionality the signal handler did before: visible pointer movement and shoving events into the event queue for the main thread to process them later. But of course, once you switch to threads, problems have 2 you now. A signal handler is "threading light", only one code path can be interrupted and you know you continue where you left off. So synchronisation primitives are easier than in threads where both code paths continue independently. Keith replaced the previous xf86BlockSIGIO() calls with corresponding input_lock() and input_unlock() calls and all the main drivers have been switched over. But some interesting race conditions kept happening. But as of today, we think most of these are solved.

The best test we have at this point is libinput's internal test suite. It creates roughly 5000 devices within about 4 minutes and thus triggers most code paths to do with device addition and removal, especially the overlaps between devices sending events before/during/after they get added and/or removed. This is the largest source of possible errors as these are the code paths with the most amount of actual simultaneous access to the input devices by both threads. But what the test suite can't test is normal everyday use. So until we get some more code maturity, expect the occasional crash and please do file bug reports. They'll be hard to reproduce and detect, but don't expect us to run into the same race conditions by accident.

[1] Yes, your calendar is right, it is indeed 2016, not the 90s or so
[2] Historical note: we actually mostly ignored this until about 2010 or so when glibc changed the malloc implementation and the server was just randomly hanging whenever we tried to malloc from within the signal handler. Users claimed this was bad UX, but I think it's right up there with motif.
[3] yeah, yeah, I know, there's always exceptions.

September 08, 2016

When working with timestamps, one question that often arises is the precision of those timestamps. Most software is good enough with a precision up to the second, and that's easy. But in some cases, like working on metering, a finer precision is required.

I don't know exactly why¹, and it makes me suffer every day, but OpenStack is really tied to MySQL (and its clones). It hurts because MySQL is a very poor solution if you want to leverage your database to actually solve problems. But that's how life is, unfair. And in the context of the projects I work on, that boils down to that we can't afford to not support MySQL.

So here we are, needing to work with MySQL and at the same time requiring timestamp with a finer precision than just seconds. And guess what: MySQL did not support that until 2011.

No microseconds in MySQL? No problem: DECIMAL!

MySQL 5.6.4 (released in 2011), a beta version of MySQL 5.6 (hello MySQL, ever heard of Semantic Versioning?), brought microsecond precision to timestamps. But the first stable version supporting that, MySQL 5.6.10, was only released in 2013. So for a long time, there was a problem without any solution.

The obvious workaround, in this case, is to reassess your choices in technologies, discover that PostgreSQL supports microsecond precision for at least a decade and problem solved.

This is not what happened in our case, and in order to support MySQL, one had to find a workaround. And so did they in our Ceilometer project, using a DECIMAL type instead of DATETIME.

The DECIMAL type takes 2 arguments: the total number of digits you need to store, and how many in that total will be used for the fractional part. Knowing that the internal storage of MySQL uses 1 byte for 2 digits, 2 bytes for 4 digits, 3 bytes for 6 digits and 4 bytes for 9 digits, and that each part is stored independently, in order to maximize your storage space, you want to pick a number of digits that fits that correctly.

This is why Ceilometer picked 14 for the integer part (9 digits on 4 bytes and 5 digits on 3 bytes) and 6 for the decimal part (3 bytes).

Wait. It's stupid because:

  • DECIMAL(20, 6) implies that you uses 14 digits for the integer part, which using epoch as a reference makes you able to encode timestamp (10^14) - 1 which is year 3170843. I am certain Ceilometer won't last that far.
  • 14 digits is 9 + 5 digits in MySQL which is 7 bytes, the same size that is used for 9 + 6 digits. So if you could have DECIMAL(21, 6) for the same storage space (and go up to year 31690708 which is a nice bonus, right?)

Well, I guess the original author of the patch did not read the documentation entirely (DECIMAL(20, 6) being on the MySQL documentation page as an example, I imagine it just has been copy-pasted blindly?).

The best choice for this use case would have been DECIMAL(17, 6) which would allow storing 11 digits for integer (5 bytes), supporting timestamp up to (2^11)-1 (year 5138), and 6 digits for decimal part (3 bytes), using only 8 bytes in total per timestamp.

Nonetheless, this workaround has been implemented using a SQLAlchemy custom type and works as expected:

class PreciseTimestamp(sqlalchemy.types.TypeDecorator):
"""Represents a timestamp precise to the microsecond."""
impl = sqlalchemy.DateTime
def load_dialect_impl(self, dialect):
if == 'mysql':
return sqlalchemy.dialect.type_descriptor(
return sqlalchemy.dialect.type_descriptor(self.impl)

Microseconds in MySQL? Damn, migration!

As I said, MySQL 5.6.4 brought microseconds precision to the table (pun intended). Therefore, it's a great time to migrate away from this hackish format to the brand new one.

First, be aware that the default DATETIME type has no microseconds precision: you have to specify how many digits you want as an argument. To support microseconds, you should therefore use DATETIME(6).

If we were using a great RDBMS, let's say, hum, PostgreSQL, we could do that very easily, see:

postgres=# CREATE TABLE foo (mytime decimal);
postgres=# \d foo
Table ""
Column Type Modifiers
mytime numeric
postgres=# INSERT INTO foo (mytime) VALUES (1473254401.234);
postgres=# ALTER TABLE foo ALTER COLUMN mytime SET DATA TYPE timestamp with time zone USING to_timestamp(mytime);
postgres=# \d foo
Table ""
Column Type Modifiers
mytime timestamp with time zone
postgres=# select * from foo;
2016-09-07 13:20:01.234+00
(1 row)

And since this is a pretty common use case, it's even an example in the PostgreSQL documentation. The version from the documentation uses a calculation based on epoch, whereas my example here leverages the to_timestamp() function. That's my personal touch.

Obviously, doing this conversion in a single line is not possible with MySQL: it does not implement the USING keyword on ALTER TABLE … ALTER COLUMN. So what's the solution gonna be? Well, it's a 4 steps job:

  1. Create a new column of type DATETIME(6)
  2. Copy data from the old column to the new column, converting them to the new format
  3. Delete the old column
  4. Rename the new column to the old column name.

But I know what you're thinking: there are 4 steps, but that's not a problem, we'll just use a transaction and embed these operations inside.

MySQL does not support transactions on data definition language (DDL). So if any of those steps fails, you'll be unable rollback steps 1, 3 and 4. Who knew that using MySQL was like living on the edge, right?

Doing this in Python with our friend Alembic

I like Alembic. It's a Python library based on SQLAlchemy that handles schema migration for your favorite RDBMS.

Once you created a new alembic migration script using alembic revision, it's time to edit it and write something along those lines:

from alembic import op
import sqlalchemy as sa
from sqlalchemy.sql import func
class Timestamp(sa.types.TypeDecorator):
"""Represents a timestamp precise to the microsecond."""
impl = sqlalchemy.DateTime
def load_dialect_impl(self, dialect):
if == 'mysql':
return dialect.type_descriptor(mysql.DATETIME(fsp=6))
return self.impl
def upgrade():
bind = op.get_bind()
if bind and == "mysql":
existing_type = sa.types.DECIMAL(
precision=20, scale=6, asdecimal=True)
existing_col = sa.Column("mytime", existing_type, nullable=False)
temp_col = sa.Column("mytime_ts", Timestamp(), nullable=False)
# Step 1: ALTER TABLE mytable ADD COLUMN mytime_ts DATETIME(6)
op.add_column("mytable", temp_col)
t = sa.sql.table("mytable", existing_col, temp_col)
# Step 2: UPDATE mytable SET mytime_ts=from_unixtime(mytime)
# Step 3: ALTER TABLE mytable DROP COLUMN mytime
op.drop_column("mytable", "mytime")
# Step 4: ALTER TABLE mytable CHANGE mytime_ts mytime
# Note: MySQL needs to have all the old/new information to just rename a column…

In MySQL, the function to convert a float to a UNIX timestamp is from_unixtime(), so the script leverages it to convert the data. As said, you'll notice we don't bother using any kind of transaction, so if anything goes wrong, there's no rollback, and it won't be possible to re-run the migration without a manual intervention.

TimestampUTC is a custom class that implements sqlalchemy.DateTime using a DATETIME(6) type for MySQL, and a regular sqlalchemy.DateTime type for other back-ends. It is used by the rest of the code (e.g. ORM model) but I've pasted it in this example for a better understanding.

Once written, you can easily test your migration using pifpaf to run a temporary database:

$ pifpaf run mysql $SHELL
$ alembic -c alembic/alembic.ini upgrade 1c98ac614015 # upgrade to the initial revision
$ mysql -S $PIFPAF_MYSQL_SOCKET pifpaf
mysql> INSERT INTO mytable (mytime) VALUES (1325419200.213000);
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM mytable;
| mytime |
| 1325419200.213000 |
1 row in set (0.00 sec)
$ alembic -c alembic/alembic.ini upgrade head
$ mysql -S $PIFPAF_MYSQL_SOCKET pifpaf
mysql> SELECT * FROM mytable;
| mytime |
| 2012-01-01 13:00:00.213000 |
1 row in set (0.00 sec)

And voilà, we just migrated unsafely our data to a new fancy format. Thank you Alembic for solving a problem we would not have without MySQL. 😊

September 07, 2016
Last week I was tasked with making performance comparisons between vc4 and the closed driver possible.  I decided to take glmark2 and port it to dispmanx, and submitted a pull request upstream.  It already worked on X11 on the vc4 driver, and I fixed the drm backend to work as well (though the drm backend has performance problems specific to the glmark2 code).

Looking at glmark2, vc4 has a few bugs.  Terrain has some rendering bugs.  The driver on master took a big performance hit on one of the conditionals tests since the loops support was added, because NIR isn't aggressive enough in flattening if statements.  Some of the tests require that we shuffle rendering jobs to avoid extra frame store/loads.  Finally, we need to use the multithreaded fragment shader mode to hide texture fetching latency on a bunch of tests.  Despite the bugs, results looked good.

(Note: I won't be posting comparisons here.  Comparisons will be left up to the reader on their particular hardware and software stacks).

I'm expecting to get to do some glamor work on vc4 again soon, so I spent some of the time while I was waiting for Raspberry Pi builds working on the X Server's testing infrastructure.  I've previously talked about Travis CI, but for it to be really useful it needs to run our integration tests.  I fixed up the piglit XTS test wrapper to not spuriously fail, made the X Test suite spuriously fail less, and worked with Adam Jackson at Red Hat to fix build issues in XTS.  Finally, I wrote scripts that will, if you have an XTS tree and a piglit tree and build Xvfb, actually run the XTS rendering tests at xserver make check time.

Next steps for xserver testing are to test glamor in a similar fashion, and to integrate this testing into travis-ci and land the travis-ci branch.

Finally, I submitted pull requests to the upstream kernel.  4.8 got some fixes for VC4 3D (merged by Dave), and 4.9 got interlaced vblank timing patches from the Mario Kleiner (not yet merged by Dave) and Raspberry Pi Zero (merged by Florian).
September 06, 2016

On Fedora, if you have mate-desktop or cinnamon-desktop installed, your GNOME touchpad configuration panel won't work (see Bug 1338585). Both packages install a symlink to assign the synaptics driver to the touchpad. But GNOME's control-center does not support synaptics anymore, so no touchpad is detected. Note that the issue occurs regardless of whether you use MATE/Cinnamon, merely installing it is enough.

Unfortunately, there is no good solution to this issue. Long-term both MATE and Cinnamon should support libinput but someone needs to step up and implement it. We don't support run-time driver selection in the X server, so an xorg.conf.d snippet is the only way to assign a touchpad driver. And this means that you have to decide whether GNOME's or MATE/Cinnamon's panel is broken at X start-up time.

If you need the packages installed but you're not actually using Mate/Cinnamon itself, remove the following symlinks (whichever is present on your system):

# rm /etc/X11/xorg.conf.d/99-synaptics-mate.conf
# rm /etc/X11/xorg.conf.d/99-synaptics-cinnamon.conf
# rm /usr/share/X11/xorg.conf.d/99-synaptics-mate.conf
# rm /usr/share/X11/xorg.conf.d/99-synaptics-cinnamon.conf
The /usr/share paths are the old ones and have been replaced with the /etc/ symlinks in cinnamon-desktop-3.0.2-2.fc25 and mate-desktop-1.15.1-4.fc25 and their F24 equivalents.

I'm using T450 and T460 as reference but this affects all laptops from the Lenovo *50 and *60 series. The Lenovo T450 and T460 have the same touchpad hardware, but unfortunately it suffers from what is probably a firmware issue. On really slow movements, the pointer has a halting motion. That effect disappears when the finger moves faster.

The observable effect is that of a pointer stalling, then jumping by 20 or so pixels. We have had a quirk for this in libinput since March 2016 (see commit a608d9) and detect this at runtime for selected models. In particular, what we do is look for a sequence of events that only update the pressure values but not the x/y position of the finger. This is a good indication that the bug triggers. While it's possible to trigger pressure changes alone, triggering several in a row without a change in the x/y coordinates is extremely unlikely. Remember that these touchpads have a resolution of ~40 units per mm - you cannot hold your finger that still while changing pressure [1]. Once we see those pressure changes only we reset the motion history we keep for each touch. The next event with an x/y coordinate will thus not calculate the delta to the previous position and not trigger a move. The event after that is handled normally again. This avoids the extreme jumps but there isn't anything we can do about the stalling - we never get the event from the kernel. [2]

Anyway. This bug popped up again elsewhere so this time I figured I'll analyse the data more closely. Specifically, I wrote a script that collected all x/y coordinates of a touchpad recording [3] and produced a black and white image of all device coordinates sent. This produces a graphic that's interesting but not overly useful:

Roughly 37000 touchpad events. You'll have to zoom in to see the actual pixels.
I modified the script to assume a white background and colour any x/y coordinate that was never hit black. So an x coordinate of 50 would now produce a vertical 1 pixel line at 50, a y coordinate of 70 a horizontal line at 70, etc. Any pixel that remains white is a coordinate that is hit at some point, anything black was unreachable. This produced more interesting results. Below is the graphic of a short, slow movement right to left.

A single short slow finger movement
You can clearly see the missing x coordinates. More specifically, there are some events, then a large gap, then events again. That gap is the stalling cursor where we didn't get any x coordinates. My first assumption was that it may be a sensor issue and that some areas on the touchpad just don't trigger. So what I did was move my finger around the whole touchpad to try to capture as many x and y coordinates as possible.

Let's have look at the recording from a T440 first because it doesn't suffer from this issue:

Sporadic black lines indicating unused coordinates but the center is purely white, indicating every device unit was hit at some point
Ok, looks roughly ok. The black areas are irregular, on the edges and likely caused by me just not covering those areas correctly. In the center it's white almost everywhere, that's where the most events were generated. And now let's compare this to a T450:

A visible grid of unreachable device units
The difference is quite noticeable, especially if you consider that the T440 recording had under 15000 events, the T450 recording had almost 37000. The T450 has a patterned grid of unreachable positions. But why? We currently use the PS/2 protocol to talk to the device but we should be using RMI4 over SMBus instead (which is what Windows has done for a while and luckily the RMI4 patches are on track for kernel 4.9). Once we talk to the device in its native protocol we see a resolution of ~20 units/mm and it looks like the T440 output:

With RMI4, the grid disappears
Ok, so the problem is not missing coordinates in the sensor and besides, at the resolution the touchpad has a single 'pixel' not triggering shouldn't be much of a problem anyway.

Maybe the issue had to do with horizontal movements or something? The next approach was for me to move my finger slowly from one side to the left. That's actually hard to do consistently when you're not a robot, so the results are bound to be slightly different. On the T440:

The x coordinates are sporadic with many missing ones, but the y coordinates are all covered
You can clearly see where the finger moved left to right. The big black gaps on the x coordinates mostly reflect me moving too fast but you can see how the distance narrows, indicating slower movements. Most importantly: vertically, the strip is uniformly white, meaning that within that range I hit every y coordinate at least once. And the recording from the T450:

Only one gap in the y range, sporadic gaps in the x range
Well, still looks mostly the same, so what is happening here? Ok, last test: This time an extremely slow motion left to right. It took me 87 seconds to cover the touchpad. In theory this should render the whole strip white if all x coordinates are hit. But look at this:

An extremely slow finger movement
Ok, now we see the problem. This motion was slow enough that almost every x coordinate should have been hit at least once. But there are large gaps and most notably: larger gaps than in the recording above that was a faster finger movement. So what we have here is not an actual hardware sensor issue but that the firmware is working against us here, filtering things out. Unfortunately, that's also the worst result because while hardware issues can usually be worked around, firmware issues are a lot more subtle and less predictable. We've also verified that newer firmware versions don't fix this and trying out some tweaks in the firmware didn't change anything either.

Windows is affected by this too and so is the synaptics driver. But it's not really noticeable on either and all reports so far were against libinput, with some even claiming that it doesn't manifest with synaptics. But each time we investigated in more detail it turns out that the issue is still there (synaptics uses the same kernel data after all) but because of different acceleration methods users just don't trigger it. So my current plan is to change the pointer acceleration to match something closer to what synaptics does on these devices. That's hard because synaptics is mostly black magic (e.g. synaptics' pointer acceleration depends on screen resolution) and hard to reproduce. Either way, until that is sorted at least this post serves as a link to point people to.

Many thanks to Andrew Duggan from Synaptics and Benjamin Tissoires for helping out with the analysis and testing of all this.

[1] Because pressing down on a touchpad flattens your finger and thus changes the shape slightly. While you can hold a finger still, you cannot control that shape
[2] Yes, predictive movement would be possible but it's very hard to get this right
[3] These are events as provided by the kernel and unaffected by anything in the userspace stack

September 05, 2016

A few weeks ago, I recorded an interview with Krishnan Raghuram about what was discussed for this development cycle for OpenStack Telemetry at the Austin summit.

It's interesting to look back at this video more than 3 months after recording it, and see what actually happened to Telemetry. It turns out that some of the things that I think were going to happen did not happen yet. As the first release candidate version is approaching, it's very unlikely they happen.

And on the other side, some new fancy features arrived suddenly without me having a clue about them.

As far as Ceilometer is concerned, here's the list of what really happened in terms of user features:

  • Added full support for SNMP v3 USM model
  • Added support for batch measurement in Gnocchi dispatcher
  • Set ended_at timestamp in Gnocchi dispatcher
  • Allow Swift pollster to specify regions
  • Add L3 cache usage and memory bandwidth meters
  • Split out the event code (REST API and storage) to a new Panko project

And a few other minor things. I planned none of them except Panko (which I was responsible for), and the ones we planned (documentation update, pipeline rework and polling enhancement) did not happen yet.

For Aodh, we expected to rework the documentation entirely too, and that did not happen either. What we did instead:

  • Deprecate and disable combination alarms
  • Add pagination support in REST API
  • Deprecated all non-SQL database store and provide a tool to migrate
  • Support batch notification for aodh-notifier

It's definitely a good list of new features for Aodh, still small, but simplifying it, removing technical debt and continuing building momentum around it.

For Gnocchi, we really had no plan, except maybe a few small features (they're usually tracked in the Launchpad bug list). It turned out we had some fancy new idea with Gordon Chung on how to boost our storage engine, so we work on that. That kept us busy a few weeks in the end, though the preliminary results look tremendous – so it was definitely worth it. We also have a AWS S3 storage driver on its way.

I find this exercise interesting, as it really emphasizes how you can't really control what's happening in any open source project, where your contributors come and go and work on their own agenda.

That does not mean we're dropping the themes and ideas I've laid out in that video. We're still pushing our "documentation is mandatory" policy and improving our "work by default" scenario. It's just a longer road that we expected.

September 03, 2016
Like is usual in GPUs, Radeon executes shaders in waves that execute the same program for many threads or work-items simultaneously in lock-step. Given a single program counter for up to 64 items (e.g. pixels being processed by a pixel shader), branch statements must be lowered to manipulation of the exec mask (unless the compiler can prove the branch condition to be uniform across all items). The exec mask is simply a bit-field that contains a 1 for every thread that is currently active, so code like this:

if (i != 0) {
... some code ...
gets lowered to something like this:

v_cmp_ne_i32_e32 vcc, 0, v1
s_and_saveexec_b64 s[0:1], vcc
s_xor_b64 s[0:1], exec, s[0:1]

... some code ...

s_or_b64 exec, exec, s[0:1]
(The saveexec assembly instructions apply a bit-wise operation to the exec register, storing the original value of exec in their destination register. Also, we can introduce branches to skip the if-block entirely if the condition happens to be uniformly false, )

This is quite different from CPUs, and so a generic compiler framework like LLVM tends to get confused. For example, the fast register allocator in LLVM is a very simple allocator that just spills all live registers at the end of a basic block before the so-called terminators. Usually, those are just branch instructions, so in the example above it would spill registers after the s_xor_b64.

This is bad because the exec mask has already been reduced by the if-condition at that point, and so vector registers end up being spilled only partially.

Until recently, these issues were hidden by the fact that we lowered the control flow instructions into their final form only at the very end of the compilation process. However, previous optimization passes including register allocation can benefit from seeing the precise shape of the GPU-style control flow earlier. But then, some of the subtleties of the exec masks need to be taken account by those earlier optimization passes as well.

A related problem arises with another GPU-specific specialty, the "whole quad mode". We want to be able to compute screen-space derivatives in pixel shaders - mip-mapping would not be possible without it - and the way this is done in GPUs is to always run pixel shaders on 2x2 blocks of pixels at once and approximate the derivatives by taking differences between the values for neighboring pixels. This means that the exec mask needs to be turned on for pixels that are not really covered by whatever primitive is currently being rendered. Those are called helper pixels.

However, there are times when helper pixels absolutely must be disabled in the exec mask, for example when storing to an image. A separate pass deals with the enabling and disabling of helper pixels. Ideally, this pass should run after instruction scheduling, since we want to be able to rearrange memory loads and stores freely, which can only be done before adding the corresponding exec-instructions. The instructions added by this pass look like this:

s_mov_b64 s[2:3], exec
s_wqm_b64 exec, exec

... code with helper pixels enabled goes here ...

s_and_b64 exec, exec, s[2:3]

... code with helper pixels disabled goes here ...
Naturally, adding the bit-wise AND of the exec mask must happen in a way that doesn't conflict with any of the exec manipulations for control flow. So some careful coordination needs to take place.

My suggestion is to allow arbitrary instructions at the beginning and end of basic blocks to be marked as "initiators" and "terminators", as opposed to the current situation, where there is no notion of initiators, and whether an instruction is a terminator is a property of the opcode. An alternative, that Matt Arsenault is working on, adds aliases for certain exec-instructions which act as terminators. This may well be sufficient, I'm looking forward to seeing the result.
August 31, 2016

In the X server, the input driver assignment is handled by xorg.conf.d snippets. Each driver assigns itself to the type of devices it can handle and the driver that actually loaded is simply the one that sorts last. Historically, we've had the evdev driver sort low and assign itself to everything. synaptics, wacom and the other few drivers that matter sorted higher than evdev and thus assigned themselves to the respective device.

When xf86-input-libinput first came out 2 years ago, we used a higher sort order than all other drivers to assign it to (almost) all devices. This was of course intentional because we believe that libinput is the best input stack around, the odd bug non-withstanding. Now it has matured a fair bit and we had a lot more exposure to various types of hardware. We've been quirking and fixing things like crazy and libinput is much better for it.

Two things were an issue with this approach though. First, overriding xf86-input-libinput required manual intervention, usually either copying or symlinking an xorg.conf.d snippet. Second, even though we were overriding the default drivers, we still had them installed everywhere. Now it's time to start properly retiring the old drivers.

The upstream approach for this is fairly simple: the xf86-input-libinput xorg.conf.d snippet will drop in sort order to sit above evdev. evdev remains as the fallback driver for miscellaneous devices where evdev's blind "forward everything" approach is sufficient. All other drivers will sort higher than xf86-input-libinput and will thus override the xf86-input-libinput assignment. The new sort order is thus:

  • evdev
  • libinput
  • synaptics, wacom, vmmouse, joystick
evdev and libinput are generic drivers, the others are for specific devices or use-cases. To use a specific driver other than xf86-input-libinput, you now only have to install it. To fall back to xf86-inputlibinput, you uninstall it. No more manual xorg.conf.d snippets symlinking.

This has an impact on distributions and users. Distributions should ensure that other drivers are never installed by default unless requested by the user (or some software). And users need to be aware that having a driver other than xf86-input-libinput installed may break things. For example, recent GNOME does not support the synaptics driver anymore, installing it will cause the control panel's touchpad bits to stop working. So there'll be a messy transition period but once things are settled, the solution to most input-related driver bugs will be "install/remove driver $foo" as opposed to the current symlink/copy/write an xorg.conf.d snippet.

August 29, 2016
I spent a day or so last week cleaning up @jonasarrow's demo patch for derivatives on vc4.  It had been hanging around on the github issue waiting for a rework due to feedback, and I decided to finally just go and do it.  It unfortunately involved totally rewriting their patches (which I dislike doing, it's always more awesome to have the original submitter get credit), but we now have dFdx()/dFdy() on Mesa master.

I also landed a fix for GPU hangs with 16 vertex attributes (4 or more vec4s, aka glsl-routing in piglit).  I'd been debugging this one for a while, and finally came up with an idea ("what if this FIFO here is a bad idea to use and we should be synchronous with this external unit?"), it worked, and a hardware developer confirmed that the fix was correct.  This one got a huge explanation comment.  I also fixed discards inside of if/loop statements -- generally discards get lowered out of ifs, but if it's in a non-unrolled loop we were doing discards ignoring whether the channel was in the loop.

Thanks to review from Rhys, I landed Mesa's Travis build fixes.  Rhys then used Travis to test out a couple of fixes to i915 and r600.  This is pretty cool, but it just makes me really want to get piglit into Travis so that we can get some actual integration testing in this process.

I got xserver's Travis to the point of running the unit tests, and one of them crashes on CI but not locally.  That's interesting.

The last GPU hang I have in piglit is in glsl-vs-loops.  This week I figured out what's going on, and I hope I'll be able to write about a fix next week.

Finally, I landed Stefan Wahren's Raspberry Pi Zero devicetree for upstream.  If nothing goes wrong, the Zero should be supported in 4.9.

Lately I have been working on a simple terrain OpenGL renderer demo, mostly to have a playground where I could try some techniques like shadow mapping and water rendering in a scenario with a non trivial amount of geometry, and I thought it would be interesting to write a bit about it.

But first, here is a video of the demo running on my old Intel IvyBridge GPU:

And some screenshots too:

OpenGL Terrain screenshot 1 OpenGL Terrain screenshot 2
OpenGL Terrain screenshot 3 OpenGL Terrain screenshot 4

Note that I did not create any of the textures or 3D models featured in the video.

With that out of the way, let’s dig into some of the technical aspects:

The terrain is built as a 251×251 grid of vertices elevated with a heightmap texture, so it contains 63,000 vertices and 125,000 triangles. It uses a single 512×512 texture to color the surface.

The water is rendered in 3 passes: refraction, reflection and the final rendering. Distortion is done via a dudv map and it also uses a normal map for lighting. From a geometry perspective it is also implemented as a grid of vertices with 750 triangles.

I wrote a simple OBJ file parser so I could load some basic 3D models for the trees, the rock and the plant models. The parser is limited, but sufficient to load vertex data and simple materials. This demo features 4 models with these specs:

  • Tree A: 280 triangles, 2 materials.
  • Tree B: 380 triangles, 2 materials.
  • Rock: 192 triangles, 1 material (textured)
  • Grass: 896 triangles (yes, really!), 1 material.

The scene renders 200 instances of Tree A another 200 instances of Tree B, 50 instances of Rock and 150 instances of Grass, so 600 objects in total.

Object locations in the terrain are randomized at start-up, but the demo prevents trees and grass to be under water (except for maybe their base section only) because it would very weird otherwise :), rocks can be fully submerged though.

Rendered objects fade in and out smoothly via alpha blending (so there is no pop-in/pop-out effect as they reach clipping planes). This cannot be observed in the video because it uses a static camera but the demo supports moving the camera around in real-time using the keyboard.

Lighting is implemented using the traditional Phong reflection model with a single directional light.

Shadows are implemented using a 4096×4096 shadow map and Percentage Closer Filter with a 3x3 kernel, which, I read is (or was?) a very common technique for shadow rendering, at least in the times of the PS3 and Xbox 360.

The demo features dynamic directional lighting (that is, the sun light changes position every frame), which is rather taxing. The demo also supports static lighting, which is significantly less demanding.

There is also a slight haze that builds up progressively with the distance from the camera. This can be seen slightly in the video, but it is more obvious in some of the screenshots above.

The demo in the video was also configured to use 4-sample multisampling.

As for the rendering pipeline, it mostly has 4 stages:

  • Shadow map.
  • Water refraction.
  • Water reflection.
  • Final scene rendering.

A few notes on performance as well: the implementation supports a number of configurable parameters that affect the framerate: resolution, shadow rendering quality, clipping distances, multi-sampling, some aspects of the water rendering, N-buffering of dynamic VBO data, etc.

The video I show above runs at locked 60fps at 800×600 but it uses relatively high quality shadows and dynamic lighting, which are very expensive. Lowering some of these settings (very specially turning off dynamic lighting, multisampling and shadow quality) yields framerates around 110fps-200fps. With these settings it can also do fullscreen 1600×900 with an unlocked framerate that varies in the range of 80fps-170fps.

That’s all in the IvyBridge GPU. I also tested this on an Intel Haswell GPU for significantly better results: 160fps-400fps with the “low” settings at 800×600 and roughly 80fps-200fps with the same settings used in the video.

So that’s it for today, I had a lot of fun coding this and I hope the post was interesting to some of you. If time permits I intend to write follow-up posts that go deeper into how I implemented the various elements of the demo and I’ll probably also write some more posts about the optimization process I followed. If you are interested in any of that, stay tuned for more.

August 26, 2016
Clickbait titles for the win!

First up, massive thanks to my major co-conspirator on radv, Bas Nieuwenhuizen, for putting in so much effort on getting radv going.

So where are we at?

Well this morning I finally found the last bug that was causing missing rendering on Dota 2. We were missing support for a compressed texture format that dota2 used. So currently dota 2 renders, I've no great performance comparison to post yet because my CPU is 5 years old, and can barely get close to 30fps with GL or Vulkan. I think we know of a couple of places that could be bottlenecking us on the CPU side. The radv driver is currently missing hyper-z (90% done), fast color clears and DCC, which are all GPU side speedups in theory. Also running the phoronix-test-suite dota2 tests works sometimes, hangs in a thread lock sometimes, or crashes sometimes. I think we have some memory corruption somewhere that it collides with.

Other status bits: the Vulkan CTS test suite contains 114598 tests, a piglit run a few hours before I fixed dota2 was at:
[114598/114598] skip: 50388, pass: 62932, fail: 1193, timeout: 2, crash: 83 - |/-\

So that isn't too bad a showing, we know some missing features are accounting for some of fails. A lot of the crashes are an assert in CTS hitting, that I don't think is a real problem.

We render most of the Sascha Willems demos fine.

I've tested the Talos Principle as well, the texture fix renders a lot more stuff on the screen, but we are still seeing large chunks of blackness where I think there should be trees in-game, the menus etc all seem to load fine.

All this work is on the semi-interesting branch of

It only has been tested on VI AMD GPUs, Polaris worked previously but something derailed it, but we should fix it once we get the finished bisect. CIK GPUs kinda work with the amdgpu kernel driver loaded. SI GPUs are nowhere yet.

Here's a screenshot:
August 22, 2016
Last week I finally plugged in the camera module I got a while ago to go take a look at what vc4 needs for displaying camera output.

The surprising answer was "nothing."  vc4 could successfully import RGB dmabufs and display them as planes, even though I had been expecting to need fixes on that front.

However, the bcm2835 v4l camera driver needs a lot of work.  First of all, it doesn't use the proper contiguous memory support in v4l (vb2-dma-contig), and instead asks the firmware to copy from the firmware's contiguous memory into vmalloced kernel memory.  This wastes memory and wastes memory bandwidth, and doesn't give us dma-buf support.

Even more, MMAL (the v4l equivalent that the firmware exposes for driving the hardware) wants to output planar buffers with specific padding.  However, instead of using the multi-plane format support in v4l to expose buffers with that padding, the bcm2835 driver asks the firmware to do another copy from the firmware's planar layout into the old no-padding V4L planar format.

As a user of the V4L api, you're also in trouble because none of these formats have any priority information that I can see: The camera driver says it's equally happy to give you RGB or planar, even though RGB costs an extra copy.  I think properly done today, the camera driver would be exposing multi-plane planar YUV, and giving you a mem2mem adapter that could use MMAL calls to turn the planar YUV into RGB.

For now, I've updated the bug report with links to the demo code and instructions.

I also spent a little bit of time last week finishing off the series to use st/nir in vc4.  I managed to get to no regressions, and landed it today.  It doesn't eliminate TGSI, but it does mean TGSI is gone from the normal GLSL path.

Finally, I got inspired to do some work on testing.  I've been doing some free time work on servo, Mozilla's Rust-based web browser, and their development environment has been a delight as a new developer.  All patch submissions, from core developers or from newbies, go through github pull requests.  When you generate a PR, Travis builds and runs the unit tests on the PR.  Then a core developer reviews the code by adding a "r" comment in the PR or provides feedback.  Once it's reviewed, a bot picks up the pull request, tries merging it to master, then runs the full integration test suite on it.  If the test suite passes, the bot merges it to master, otherwise the bot writes a comment with a link to the build/test logs.

Compare this to Mesa's development process.  You make a patch.  You file it in the issue tracker and it gets utterly ignored.  You complain, and someone tells you you got the process wrong, so you join the mailing list and send your patch (and then get a flood of email until you unsubscribe).  It gets mangled by your email client, and you get told to use git-send-email, so you screw around with that for a while before you get an email that will actually show up in people's inboxes.  Then someone reviews it (hopefully) before it scrolls off the end of their inbox, and then it doesn't get committed anyway because your name was familiar enough that the reviewer thought maybe you had commit access.  Or they do land your patch, and it turns out you hasn't run the integration tests and then people complain at you for not testing.

So, as a first step toward making a process like Mozilla's possible, I put some time into fixing up Travis on Mesa, and building Travis support for the X Server.  If I can get Travis to run piglit and ensure that expected-pass tests don't regress, that at least gives us a documentable path for new developers in these two projects to put their code up on github and get automated testing of the branches they're proposing on the mailing lists.
August 16, 2016

Wrapping libudev using LD_PRELOAD

Peter Hutterer and I were chasing down an X server bug which was exposed when running the libinput test suite against the X server with a separate thread for input. This was crashing deep inside libudev, which led us to suspect that libudev was getting run from multiple threads at the same time.

I figured I'd be able to tell by wrapping all of the libudev calls from the server and checking to make sure we weren't ever calling it from both threads at the same time. My first attempt was a simple set of cpp macros, but that failed when I discovered that libwacom was calling libgudev, which was calling libudev.

Instead of recompiling the world with my magic macros, I created a new library which exposes all of the (public) symbols in libudev. Each of these functions does a bit of checking and then simply calls down to the 'real' function.

Finding the real symbols

Here's the snippet which finds the real symbols:

static void *udev_symbol(const char *symbol)
    static void *libudev;
    static pthread_mutex_t  find_lock = PTHREAD_MUTEX_INITIALIZER;

    void *sym;
    if (!libudev) {
        libudev = dlopen("", RTLD_LOCAL | RTLD_NOW);
    sym = dlsym(libudev, symbol);
    return sym;

Yeah, the libudev version is hard-coded into the source; I didn't want to accidentally load the wrong one. This could probably be improved...

Checking for re-entrancy

As mentioned above, we suspected that the bug was caused when libudev got called from two threads at the same time. So, our checks are pretty simple; we just count the number of calls into any udev function (to handle udev calling itself). If there are other calls in process, we make sure the thread ID for those is the same as the current thread.

static void udev_enter(const char *func) {
    assert (udev_running == 0 || udev_thread == pthread_self());
    udev_thread = pthread_self();
    udev_func[udev_running] = func;

static void udev_exit(void) {
    if (udev_running == 0)
    udev_thread = 0;
    udev_func[udev_running] = 0;

Wrapping functions

Now, the ugly part -- libudev exposes 93 different functions, with a wide variety of parameters and return types. I constructed a hacky macro, calls for which could be constructed pretty easily from the prototypes found in libudev.h, and which would construct our stub function:

#define make_func(type, name, formals, actuals)         \
    type name formals {                     \
    type ret;                       \
    static void *f;                     \
    if (!f)                         \
        f = udev_symbol(__func__);              \
    udev_enter(__func__);                   \
    ret = ((typeof (&name)) f) actuals;         \
    udev_exit();                        \
    return ret;                     \

There are 93 invocations of this macro (or a variant for void functions) which look much like:

make_func(struct udev *,
      (struct udev *udev),

Using udevwrap

To use udevwrap, simply stick the filename of the .so in LD_PRELOAD and run your program normally:

# LD_PRELOAD=/usr/local/lib/ Xorg 

Source code

I stuck udevwrap in my git repository:;a=summary

You can clone it using

$ git git://
August 15, 2016

A Preliminary systemd.conf 2016 Schedule is Now Available!

We have just published a first, preliminary version of the systemd.conf 2016 schedule. There is a small number of white slots in the schedule still, because we're missing confirmation from a small number of presenters. The missing talks will be added in as soon as they are confirmed.

The schedule consists of 5 workshops by high-profile speakers during the workshop day, 22 exciting talks during the main conference days, followed by one full day of hackfests.

Please sign up for the conference soon! Only a limited number of tickets are available, hence make sure to secure yours quickly before they run out! (Last year we sold out.) Please sign up here for the conference!

Last week I mostly worked on getting the upstream work I and others have done into downstream Raspbian (most of that time unfortunately in setting up another Raspbian development environment, after yet another SD card failed).

However, the most exciting thing for most users is that with the merge of the rpi-4.4.y-dsi-stub-squash branch, the DSI display should now come up by default with the open source driver.  This is unfortunately not a full upstreamable DSI driver, because the closed-source firmware is getting in the way of Linux by stealing our interrupts and then talking to the hardware behind our backs.  To work around the firmware, I never talk to the DSI hardware, and we just replace the HVS display plane configuration on the DSI's output pipe.  This means your display backlight is always on and the DSI link is always running, but better that than no display.

I also transferred the wiki I had made for VC4 over to github.  In doing so, I was pleasantly surprised at how much documentation I wanted to write once I got off of the awful wiki software at freedesktop.  You can find more information on VC4 at my mesa and linux trees.

(Side note, wikis on github are interesting.  When you make your fork, you inherit the wiki of whoever you fork from, and you can do PRs back to their wiki similarly to how you would for the main repo.  So my linux tree has Raspberry Pi's wiki too, and I'm wondering if I want to move all of my wiki over to their tree.  I'm not sure.)

Is there anything that people think should be documented for the vc4 project that isn't there?
August 12, 2016

So we have two jobs openings in the Red Hat desktop team. What we are looking for is people to help us ensure that Fedora and RHEL runs great on various desktop hardware, with a focus on laptops. Since these jobs require continuous access to a lot of new and different hardware we can not accept applications this time for remotees, but require you to work out of out office in Munich, Germany. We are looking for people with people not afraid to jump into a lot of different code and who likes tinkering with new hardware. The hardware enablement here might include some kernel level work, but will more likely involve improving higher level stacks. So for example if we have a new laptop where bluetooth doesn’t work you would need to investigate and figure out if the problem is in the kernel, in the bluez stack or in our Bluetooth desktop parts.

This will be quite varied work and we expect you to be part of a team which will be looking at anything from driver bugs, battery life issues, implementing new stacks, biometric login and enabling existing features in the kernel or in low level libraries in the user interface.

You can read more about the jobs at the That link lists a Senior Engineer, but we also got a Principal Engineer position open with id 53653, but that one is not on the website as I post this, but should hopefully be very soon.

Also if you happen to be in the Karlsruhe area or at GUADEC this year I will be here until Sunday, so you could come over for a chat. Feel free to email me on if you are interested in meeting up.

August 11, 2016
A couple of weeks ago, I hinted at a presentation that I wanted to do during this year's GUADEC, as a Lightning talk.

Unfortunately, I didn't get a chance to finish the work that I set out to do, encountering a couple of bugs that set me back. Hopefully this will get resolved post-GUADEC, so you can expect some announcements later on in the year.

At least one of the tasks I set to do worked out, and was promptly obsoleted by a nicer solution. Let's dive in.

How to compile for a different architecture

There are four possible solutions to compile programs for a different architecture:

  • Native compilation: get a machine of that architecture, install your development packages, and compile. This is nice when you have fast machines with plenty of RAM to compile on, usually developer boards, not so good when you target low-power devices.
  • Cross-compilation: install a version of GCC and friends that runs on your machine's architecture, but produces binaries for your target one. This is usually fast, but you won't be able to run the binaries created, so might end up with some data created from a different set of options, and won't be able to run the generated test suite.
  • Virtual Machine: you'd run a virtual machine for the target architecture, install an OS, and build everything. This is slower than cross-compilation, but avoids the problems you'd see in cross-compilation.
The final option is one that's used more and more, mixing the last 2 solutions: the QEmu user-space emulator.

Using the QEMU user-space emulator

If you want to run just the one command, you'd do something like:

qemu-static-arm myarmbinary

Easy enough, but hardly something you want to try when compiling a whole application, with library dependencies. This is where binfmt support in Linux comes into play. Register the ELF format for your target with that user-space emulator, and you can run myarmbinary without any commands before it.

One thing to note though, is that this won't work as easily if the qemu user-space emulator and the target executable are built as a dynamic executables: QEmu will need to find the libraries for your architecture, usually x86-64, to launch itself, and the emulated binary will also need to find its libraries.

To solve that first problem, there are QEmu static binaries available in a number of distributions (Fedora support is coming). For the second one, the easiest would be if we didn't have to mix native and target libraries on the filesystem, in a chroot, or container for example. Hmm, container you say.

Running QEmu user-space emulator in a container

We have our statically compiled QEmu, and a filesystem with our target binaries, and switched the root filesystem. Well, you try to run anything, and you get a bunch of errors. The problem is that there is a single binfmt configuration for the kernel, whether it's the normal OS, or inside a container or chroot.

The Flatpak hack

This commit for Flatpak works-around the problem. The binary for the emulator needs to have the right path, so it can be found within the chroot'ed environment, and it will need to be copied there so it is accessible too, which is what this patch will do for you.

Follow the instructions in the commit, and test it out with this Flatpak script for GNU Hello.

$ TARGET=arm ./
$ ls org.gnu.hello.arm.xdgapp
918k org.gnu.hello.arm.xdgapp

Ready to install on your device!

The proper way

The above solution was built before it looked like the "proper way" was going to find its way in the upstream kernel. This should hopefully land in the upcoming 4.8 kernel.

Instead of launching a separate binary for each non-native invocation, this patchset allows the kernel to keep the binary opened, so it doesn't need to be copied to the container.

In short

With the work being done on Fedora's static QEmu user-space emulators, and the kernel feature that will land, we should be able to have a nice tickbox in Builder to build for any of the targets supported by QEmu.

Get cross-compiling!

Three years after my definitive guide on Python classic, static, class and abstract methods, it seems to be time for a new one. Here, I would like to dissect and discuss Python exceptions.

Dissecting the base exceptions

In Python, the base exception class is named BaseException. Being rarely used in any program or library, it ought to be considered as an implementation detail. But to discover how it's implemented, you can go and read Objects/exceptions.c in the CPython source code. In that file, what is interesting is to see that the BaseException class defines all the basic methods and attribute of exceptions. The basic well-known Exception class is then simply defined as a subclass of BaseException, nothing more:

* Exception extends BaseException
SimpleExtendsException(PyExc_BaseException, Exception,
"Common base class for all non-exit exceptions.");

The only other exceptions that inherits directly from BaseException are GeneratorExit, SystemExit and KeyboardInterrupt. All the other builtin exceptions inherits from Exception. The whole hierarchy can be seen by running pydoc2 exceptions or pydoc3 builtins.

Here are the graph representing the builtin exceptions inheritance in Python 2 and Python 3 (generated using this script).

Python 2 builtin exceptions inheritance graph
Python 3 builtin exceptions inheritance graph

The BaseException.__init__ signature is actually BaseException.__init__(*args). This initialization method stores any arguments that is passed in the args attribute of the exception. This can be seen in the exceptions.c source code – and is true for both Python 2 and Python 3:

static int
BaseException_init(PyBaseExceptionObject *self, PyObject *args, PyObject *kwds)
if (!_PyArg_NoKeywords(Py_TYPE(self)->tp_name, kwds))
return -1;
Py_XSETREF(self->args, args);
return 0;

The only place where this args attribute is used is in the BaseException.__str__ method. This method uses self.args to convert an exception to a string:

static PyObject *
BaseException_str(PyBaseExceptionObject *self)
switch (PyTuple_GET_SIZE(self->args)) {
case 0:
return PyUnicode_FromString("");
case 1:
return PyObject_Str(PyTuple_GET_ITEM(self->args, 0));
return PyObject_Str(self->args);

This can be translated in Python to:

def __str__(self):
if len(self.args) == 0:
return ""
if len(self.args) == 1:
return str(self.args[0])
return str(self.args)

Therefore, the message to display for an exception should be passed as the first and the only argument to the BaseException.__init__ method.

Defining your exceptions properly

As you may already know, in Python, exceptions can be raised in any part of the program. The basic exception is called Exception and can be used anywhere in your program. In real life, however no program nor library should ever raise Exception directly: it's not specific enough to be helpful.

Since all exceptions are expected to be derived from the base class Exception, this base class can easily be used as a catch-all:

except Exception:
# THis will catch any exception!
print("Something terrible happened")

To define your own exceptions correctly, there are a few rules and best practice that you need to follow:

  • Always inherit from (at least) Exception:
    class MyOwnError(Exception):

  • Leverage what we saw earlier about BaseException.__str__: it uses the first argument passed to BaseException.__init__ to be printed, so always call BaseException.__init__ with only one argument.

  • When building a library, define a base class inheriting from Excepion. It will make it easier for consumers to catch any exception from the library:

    class ShoeError(Exception):
    """Basic exception for errors raised by shoes"""
    class UntiedShoelace(ShoeError):
    """You could fall"""
    class WrongFoot(ShoeError):
    """When you try to wear your left show on your right foot"""

    It then makes it easy to use except ShoeError when doing anything with that piece of code related to shoes. For example, Django does not do that for some of its exceptions, making it hard to catch "any exception raised by Django".

  • Provide details about the error. This is extremely valuable to be able to log correctly errors or take further action and try to recover:

class CarError(Exception):
"""Basic exception for errors raised by cars"""
def __init__(self, car, msg=None):
if msg is None:
# Set some default useful error message
msg = "An error occured with car %s" % car
super(CarError, self).__init__(msg) = car
class CarCrashError(CarError):
"""When you drive too fast"""
def __init__(self, car, other_car, speed):
super(CarCrashError, self).__init__(
car, msg="Car crashed into %s at speed %d" % (other_car, speed))
self.speed = speed
self.other_car = other_car

Then, any code can inspect the exception to take further action:

except CarCrashError as e:
# If we crash at high speed, we call emergency
if e.speed >= 30:

For example, this is leveraged in Gnocchi to raise specific application exceptions (NoSuchArchivePolicy) on expected foreign key violations raised by SQL constraints:
with self.facade.writer() as session:
except exception.DBReferenceError as e:
if e.constraint == 'fk_metric_ap_name_ap_name':
raise indexer.NoSuchArchivePolicy(archive_policy_name)

  • Inherits from builtin exceptions types when it makes sense. This makes it easier for programs to not be specific to your application or library:
    class CarError(Exception):
    """Basic exception for errors raised by cars"""
    class InvalidColor(CarError, ValueError):
    """Raised when the color for a car is invalid"""

    That allows many programs to catch errors in a more generic way without noticing your own defined type. If a program already knows how to handle a ValueError, it won't need any specific code nor modification.


There is no limitation on where and when you can define exceptions. As they are, after all, normal classes, they can be defined in any module, function or class – even as closures.

Most libraries package their exceptions into a specific exception module: SQLAlchemy has them in sqlalchemy.exc, requests has them in requests.exceptions, Werkzeug has them in werkzeug.exceptions, etc.

That makes sense for libraries to export exceptions that way, as it makes it very easy for consumers to import their exception module and know where the exceptions are defined when writing code to handle errors.

This is not mandatory, and smaller Python modules might want to retain their exceptions into their sole module. Typically, if your module is small enough to be kept in one file, don't bother splitting your exceptions into a different file/module.

While this wisely applies to libraries, applications tend to be different beasts. Usually, they are composed of different subsystems, where each one might have its own set of exceptions. This is why I generally discourage going with only one exception module in an application, but to split them across the different parts of one's program. There might be no need of a special myapp.exceptions module.

For example, if your application is composed of an HTTP REST API defined into the module myapp.http and of a TCP server contained into myapp.tcp, it's likely they can both define different exceptions tied to their own protocol errors and cycle of life. Defining those exceptions in a myapp.exceptions module would just scatter the code for the sake of some useless consistency. If the exceptions are local to a file, just define them somewhere at the top of that file. It will simplify the maintenance of the code.

Wrapping exceptions

Wrapping exception is the practice by which one exception is encapsulated into another:

class MylibError(Exception):
"""Generic exception for mylib"""
def __init__(self, msg, original_exception)
super(MylibError, self).__init__(msg + (": %s" % original_exception))
self.original_exception = original_exception
except requests.exceptions.ConnectionError as e:
raise MylibError("Unable to connect", e)

This makes sense when writing a library which leverages other libraries. If a library uses requests and does not encapsulate requests exceptions into its own defined error classes, it will be a case of layer violation. Any application using your library might receive a requests.exceptions.ConnectionError, which is a problem because:

  1. The application has no clue that the library was using requests and does not need/want to know about it.
  2. The application will have to import requests.exceptions itself and therefore will depend on requests – even if it does not use it directly.
  3. As soon as mylib changes from requests to e.g. httplib2, the application code catching requests exceptions will become irrelevant.

The Tooz library is a good example of wrapping, as it uses a driver-based approach and depends on a lot of different Python modules to talk to different backends (ZooKeeper, PostgreSQL, etcd…). Therefore, it wraps exception from other modules on every occasion into its own set of error classes. Python 3 introduced the raise from form to help with that, and that's what Tooz leverages to raise its own error.

It's also possible to encapsulate the original exception into a custom defined exception, as done above. That makes the original exception available for inspection easily.

Catching and logging

When designing exceptions, it's important to remember that they should be targeted both at humans and computers. That's why they should include an explicit message, and embed as much information as possible. That will help to debug and write resilient programs that can pivot their behavior depending on the attributes of exception, as seen above.

Also, silencing exceptions completely is to be considered as bad practice. You should not write code like that:

except Exception:
# Whatever

Not having any kind of information in a program where an exception occurs is a nightmare to debug.

If you use (and you should) the logging library, you can use the exc_info parameter to log a complete traceback when an exception occurs, which might help debugging on severe and unrecoverable failure:

except Exception:
logging.getLogger().error("Something bad happened", exc_info=True)

Further reading

If you understood everything so far, congratulations, you might be ready to handle exception in Python! If you want to have a broader scope on exceptions and what Python misses, I encourage you to read about condition systems and discover the generalization of exceptions – that I hope we'll see in Python one day!

I hope this will help you building better libraries and application. Feel free to shoot any question in the comment section!

August 09, 2016
At the bottom of the release notes for GNOME 3.20, you might have seen the line:
If you plug in an audio device (such as a headset, headphones or microphone) and it cannot be identified, you will now be asked what kind of device it is. This addresses an issue that prevented headsets and microphones being used on many Dell computers.
Before I start explaining what this does, as a picture is worth a thousand words:

This selection dialogue is one you will get on some laptops and desktop machines when the hardware is not able to detect whether the plugged in device is headphones, a microphone, or a combination of both, probably because it doesn't have an impedance detection circuit to figure that out.

This functionality was integrated into Unity's gnome-settings-daemon version a couple of years ago, written by David Henningsson.

The code that existed for this functionality was completely independent, not using any of the facilities available in the media-keys plugin to volume keys, and it could probably have been split as an external binary with very little effort.

After a bit of to and fro, most of the sound backend functionality was merged into libgnome-volume-control, leaving just 2 entry points, one to signal that something was plugged into the jack, and another to select which type of device was plugged in, in response to the user selection. This means that the functionality should be easily implementable in other desktop environments that use libgnome-volume-control to interact with PulseAudio.

Many thanks to David Henningsson for the original code, and his help integrating the functionality into GNOME, Bednet for providing hardware to test and maintain this functionality, and Allan, Florian and Rui for working on the UI notification part of the functionality, and wiring it all up after I abandoned them to go on holidays ;)
August 08, 2016
Last week's project for vc4 was to take a look at memory usage.  Eben had expressed concern that the new driver stack would use more memory than the closed stack, and so I figured I would spend a little while working on that.

I first pulled out valgrind's massif tool on piglit's glsl-algebraic-add-add-1.shader_test.  This works as a minimum "how much memory does it take to render *anything* with this driver?" test.  We were consuming 1605k of heap at the peak, and there were some obvious fixes to be made.

First, the gallium state_tracker was allocating 659kb of space at context creation so that it could bytecode-interpret TGSI if needed for glRasterPos() and glRenderMode(GL_FEEDBACK).  Given that nobody should ever use those features, and luckily they rarely do, I delayed the allocation of the somewhat misleadingly-named "draw" context until the fallbacks were needed.

Second, Mesa was allocating the memory for the GL 1.x matrix stacks up front at context creation.  We advertise 32 matrices for modelview/projection, 10 per texture unit (32 of those), and 4 for programs.  I instead implemented a typical doubling array reallocation scheme for storing the matrices, so that only the top matrix per stack is allocated at context creation.  This saved 63kb of dirty memory per context.

722KB for these two fixes may not seem like a whole lot of memory to readers on fancy desktop hardware with 8GB of RAM, but the Raspberry Pi has only 1GB of RAM, and when you exhaust that you're swapping to an SD card.  You should also expect a desktop to have several GL contexts created: the X Server uses one to do its rendering, you have a GL-based compositor with its own context, and your web browser and LibreOffice may each have one or more.  Additionally, trying to complete our piglit testsuite on the Raspberry Pi is currently taking me 6.5 hours (when it even succeeds and doesn't see piglit's python runner get shot by the OOM killer), so I could use any help I can get in reducing context initialization time.

However, malloc()-based memory isn't all that's involved.  The GPU buffer objects that get allocated don't get counted by massif in my analysis above.  To try to approximately fix this, I added in valgrind macro calls to mark the mmap()ed space in a buffer object as being a malloc-like operation until the point that the BO is freed.  This doesn't get at allocations for things like the window-system renderbuffers or the kernel's overflow BO (valgrind requires that you have a pointer involved to report it to massif), but it does help.

Once I has massif reporting more, I noticed that glmark2 -b terrain was allocating a *lot* of memory for shader BOs.  Going through them, an obvious problem was that we were generating a lot of shaders for glGenerateMipmap().  A few weeks ago I improved performance on the benchmark by fixing glGenerateMipmap()'s fallback blits that we were doing because vc4 doesn't support the GL_TEXTURE_BASE_LEVEL that the gallium aux code uses.  I had fixed the fallback by making the shader do an explicit-LOD lookup of the base level if the GL_TEXTURE_BASE_LEVEL==GL_TEXTURE_MAX_LEVEL.  However, in the process I made the shader depend on that base level, so we would comple a new shader variant per level of the texture.  The fix was to make the base level into a uniform value that's uploaded per draw call, and with that change I dropped 572 shader variants from my shader-db results.

Reducing extra shaders was fun, so I set off on another project I had thought of before.  VC4's vertex shader to fragment shader IO system is a bit unusual in that it's just a FIFO of floats (effectively), with none of these silly "vec4"s that permeate GLSL.  Since I can take my inputs in any order, and more flexibility in the FS means avoiding register allocation failures sometimes, I have the FS compiler tell the VS what order it would like its inputs in.  However, the list of all the inputs in their arbitrary orders would be expensive to hash at draw time, so I had just been using the identity of the compiled fragment shader variant in the VS and CS's key to decide when to recompile it in case output order changed.  The trick was that, while the set of all possible orders is huge, the number that any particular application will use is quite small.  I take the FS's input order array, keep it in a set, and use the pointer to the data in the set as the key.  This cut 712 shaders from shader-db.

Also, embarassingly, when I mentioned tracking the FS in the CS's key above?  Coordinate shaders don't output anything to the fragment shader.  Like the name says, they just generate coordinates, which get consumed by the binner.  So, by removing the FS from the CS key, I trivially cut 754 shaders from shader-db.  Between the two, piglit's gl-1.0-blend-func test now passes instead of OOMing, so we get test coverage on blending.

Relatedly, while working on fixing a kernel oops recently, I had noticed that we were still reallocating the overflow BO on every draw call.  This was old debug code from when I was first figuring out how overflow worked.  Since each client can have up to 5 outstanding jobs (limited by Mesa) and each job was allocating a 256KB BO, we coud be saving a MB or so per client assuming they weren't using much of their overflow (likely true for the X Server).  The solution, now that I understand the overflow system better, was just to not reallocate and let the new job fill out the previous overflow area.

Other projects for the week that I won't expand on here: Debugging GPU hang in piglit glsl-routing (generated fixes for vc4-gpu-tools parser, tried writing a GFXH30 workaround patch, still not fixed) and working on supporting direct GLSL IR to NIR translation (lots of cleanups, a couple fixes, patches on the Mesa list).
August 05, 2016

A common issue with users typing on a laptop is that the user's palms will inadvertently get in contact with the touchpad at some point, causing the cursor to move and/or click. In the best case it's annoying, in the worst case you're now typing your password into the newly focused twitter application. While this provides some general entertainment and thus makes the world a better place for a short while, here at the libinput HQ [1] we strive to keep life as boring as possible and avoid those situations.

The best way to avoid accidental input is to detect palm touches and simply ignore them. That works ok-ish on some touchpads and fails badly on others. Lots of hardware is barely able to provide an accurate touch location, let alone enough information to decide whether a touch is a palm. libinput's palm detection largely works by using areas on the touchpad that are likely to be touched by the palms.

The second-best way to avoid accidental input is to disable the touchpad while a user is typing. The libinput marketing department [2] has decided to name this feature "disable-while-typing" (DWT) and it's been in libinput for quite a while. In this post I'll describe how exactly DWT works in libinput.

Back in the olden days of roughly two years ago we all used the synaptics X.Org driver and were happy with it [3]. Disable-while-typing was featured there through the use of a tool called syndaemon. This synaptics daemon [4] has two modes. One was to poll the keyboard state every few milliseconds and check whether a key was down. If so, syndaemon sends a command to the driver to tell it to disable itself. After a timeout when the keyboard state is neutral again syndaemon tells the driver to re-enable itself. This causes a lot of wakeups, especially during those 95% of the time when the user isn't actually typing. Or missed keys if the press + release occurs between two polls. Hence the second mode, using the RECORD extension, where syndaemon opens a second connection to the X server and end checks for key events [5]. If it sees one float past, it tells the driver to disable itself, and so on and so forth. Either way, you had a separate process that did that job. syndaemon had a couple of extra options and features that I'm not going to discuss here, but we've replicated the useful ones in libinput.

libinput has no external process, DWT is integrated into the library with a couple of smart extra features. This is made easier by libinput controlling all the devices, so all keyboard events are passed around internally to the touchpad backend. That backend then decides whether it should stop sending events. And this is where the interesting bits come in.

First, we have different timeouts: if you only hit a single key, the touchpad will re-enable itself quicker than after a period of typing. So if you use the touchpad, hit a key to trigger some UI the pointer only stops moving for a very short time. But once you type, the touchpad disables itself longer. Since your hand is now in a position over the keyboard, moving back to the touchpad takes time anyway so a longer timeout doesn't hurt. And as typing is interrupted by pauses, a longer timeout bridges over those to avoid accidental movement of the cursor.

Second, any touch started while you were typing is permanently ignored, so it's safe to rest the palm on the touchpad while typing and leave it there. But we keep track of the start time of each touch so any touch started after the last key event will work normally once the DWT timeout expires. You may feel a short delay but it should be well in the acceptable range of a tens of ms.

Third, libinput is smart enough to detect which keyboard to pair with. If you have an external touchpad like the Apple Magic Trackpad or a Logitech T650, DWT will never enable on those. Likewise, typing on an external keyboard won't disable the internal touchpad. And in the rare case of two internal touchpads [6], both of them will do the right thing. As of systemd v231 the information of whether a touchpad is internal or external is available in the ID_INPUT_TOUCHPAD_INTEGRATION udev tag and thus available to everyone, not just libinput.

Finally, modifier keys are ignored for DWT, so using the touchpad to do shift-clicks works unimpeded. This also goes for the F-Key row and the numpad if you have any. These keys are usually out of the range of the touchpad anyway so interference is not an issue here. As of today, modifier key combos work too. So hitting Ctrl+S to save a document won't disable the touchpad (or any other modifiers + key combination). But once you are typing DWT activates and if you now type Shift+S to type the letter 'S' the touchpad remains disabled.

So in summary: what we've gained from switching to libinput is one external process less that causes wakeups and the ability to be a lot smarter about when we disable the touchpad. Coincidentally, libinput has similar code to avoid touchpad interference when the trackpoint is in use.

[1] that would be me
[2] also me
[3] uphill, both ways, snow, etc.
[4] nope. this one wasn't my fault
[5] Yes, syndaemon is effectively a keylogger, except it doesn't do any of the "logging" bit a keylogger would be expected to do to live up to its name
[6] This currently happens on some Dell laptops using hid-i2c. We get two devices, one named "DLL0704:01 06CB:76AE Touchpad" or similar and one "SynPS/2 Synaptics TouchPad". The latter one will never send events unless hid-i2c is disabled in the kernel

August 01, 2016
This weekend I landed a patchset in mesa to add support for resource shadowing and batch re-ordering in freedreno.  What this is, will take a bit of explaining, but the tl;dr: is a nice fps boost in many games/apps.

But first, a bit of background about tiling gpu's:  the basic idea of a tiler is to render N draw calls a tile at a time, with a tile's worth of the "framebuffer state" (ie. each of the MRT color bufs + depth/stencil) resident in an internal tile buffer.  The idea is that most of your memory traffic is to/from your color and z/s buffers.  So rather than rendering each of your draw calls in it's entirety, you split the screen up into tiles and repeat each of the N draws for each tile to fast internal/on-chip memory.  This avoids going back to main memory for each of the color and z/s buffer accesses, and enables a tiler to do more with less memory bandwidth.  But it means there is never a single point in the sequence of draws.. ie. draw #1 for tile #2 could happen after draw #2 for tile #1.  (Also, that is why GL_TIMESTAMP queries are bonkers for tilers.)

For purpose of discussion (and also how things are named in the code, if you look), I will define a tile-pass, ie. rendering of N draws for each tile in succession (or even if multiple tiles are rendered in parallel) as a "batch".

Unfortunately, many games/apps are not written with tilers in mind.  There are a handful of common anti-patterns which force a driver for a tiling gpu to flush the current batch.  Examples are unnecessary FBO switches, and texture or UBO uploads mid-batch.

For example, with a 1920x1080 r8g8b8a8 render target, with z24s8 depth/stencil buffer, an unnecessary batch flush costs you 16MB of write memory bandwidth, plus another 16MB of read when we later need to pull the data back into the tile buffer.  That number can easily get much bigger with games using float16 or float32 (rather than 8 bits per component) intermediate buffers, and/or multiple render targets.  Ie. two MRT's with float16 internal-format plus z24s8 z/s would be 40MB write + 40MB read per extra flush.

So, take the example of a UBO update, at a point where you are not otherwise needing to flush the batch (ie. swapbuffers or FBO switch).  A straightforward gl driver for a tiler would need to flush the current batch, so each of the draws before the UBO update would see the old state, and each of the draws after the UBO update would see the new state.

Enter resource shadowing and batch reordering.  Two reasonably big (ie. touches a lot of the code) changes in the driver which combine to avoid these extra batch flushes, as much as possible.

Resource shadowing is allocating a new backing GEM buffer object (BO) for the resource (texture/UBO/VBO/etc), and if necessary copying parts of the BO contents to the new buffer (back-blit).

So for the example of the UBO update, rather than taking the 16MB+16MB (or more) hit of a tile flush, why not just create two versions of the UBO.  It might involve copying a few KB's of UBO (ie. whatever was not overwritten by the game), but that is a lot less than 32MB?

But of course, it is not that simple.  Was the buffer or texture level mapped with GL_MAP_INVALIDATE_BUFFER_BIT or GL_MAP_INVALIDATE_RANGE_BIT?  (Or GL API that implies the equivalent, although fortunately as a gallium driver we don't have to care so much about all the various different GL paths that amount to the same thing for the hw.)  For a texture with mipmap levels, we unfortunately don't know at the time where we need to create the new shadow BO whether the next GL calls will glGenerateMipmap() or upload the remaining mipmap levels.  So there is a bit of complexity in handling all the cases properly.  There may be a few more cases we could handle without falling back to flushing the current batch, but for now we handle all the common cases.

The batch re-ordering component of this allows any potential back-blits from the shadow'd BO to the new BO (when resource shadowing kicks in), to be split out into a separate batch.  The resource/dependency tracking between batches and resources (ie. if various batches need to read from a given resource, we need to know that so they can be executed before something writes to the resource) lets us know which order to flush various in-flight batches to achieve correct results.  Note that this is partly because we use util_blitter, which turns any internally generated resource-shadowing back-blits into normal draw calls (since we don't have a dedicated blit pipe).. but this approach also handles the unnecessary FBO switch case for free.

Unfortunately, the batch re-ordering required a bit of an overhaul about how cmdstream buffers are handled, which required changes in all layers of the stack (mesa + libdrm + kernel).  The kernel changes are in drm-next for 4.8 and libdrm parts are in the latest libdrm release.  And while things will continue to work with a new userspace and old kernel, all these new optimizations will be disabled.

(And, while there is a growing number of snapdragon/adreno SBC's and phones/tablets getting upstream attention, if you are stuck on a downstream 3.10 kernel, look here.)

And for now, even with a new enough kernel, for the time being reorder support is not enabled by default.  There are a couple more piglit tests remaining to investigate, but I'll probably flip it to be enabled by default (if you have a new enough kernel) before the next mesa release branch.  Until then, use FD_MESA_DEBUG=reorder (and once the default is switched, that would be FD_MESA_DEBUG=noreorder to disable).

I'll cover the implementation and tricks to keep the CPU overhead of all this extra bookkeeping small later (probably at XDC2016), since this post is already getting rather long.  But the juicy bits: ~30% gain in supertuxkart (new render engine) and ~20% gain in manhattan are the big winners.  In general at least a few percent gain in most things I looked at, generally in the 5-10% range.

July 27, 2016

Please note that the systemd.conf 2016 Call for Participation ends on Monday, on Aug. 1st! Please send in your talk proposal by then! We’ve already got a good number of excellent submissions, but we are very interested in yours, too!

We are looking for talks on all facets of systemd: deployment, maintenance, administration, development. Regardless of whether you use it in the cloud, on embedded, on IoT, on the desktop, on mobile, in a container or on the server: we are interested in your submissions!

In addition to proposals for talks for the main conference, we are looking for proposals for workshop sessions held during our Workshop Day (the first day of the conference). The workshop format consists of a day of 2-3h training sessions, that may cover any systemd-related topic you'd like. We are both interested in submissions from the developer community as well as submissions from organizations making use of systemd! Introductory workshop sessions are particularly welcome, as the Workshop Day is intended to open up our conference to newcomers and people who aren't systemd gurus yet, but would like to become more fluent.

For further details on the submissions we are looking for and the CfP process, please consult the CfP page and submit your proposal using the provided form!

ALSO: Please sign up for the conference soon! Only a limited number of tickets are available, hence make sure to secure yours quickly before they run out! (Last year we sold out.) Please sign up here for the conference!

AND OF COURSE: We are also looking for more sponsors for systemd.conf! If you are working on systemd-related projects, or make use of it in your company, please consider becoming a sponsor of systemd.conf 2016! Without our sponsors we couldn't organize systemd.conf 2016!

Thank you very much, and see you in Berlin!

July 20, 2016

Don't panic. Of course it isn't. Stop typing that angry letter to the editor and read on. I just picked that title because it's clickbait and these days that's all that matters, right?

With the release of libinput 1.4 and the newest feature to add tablet pad mode switching, we've now finished the TODO list we had when libinput was first conceived. Let's see what we have in libinput right now:

  • keyboard support (actually quite boring)
  • touchscreen support (actually quite boring too)
  • support for mice, including middle button emulation where needed
  • support for trackballs including the ability to use them rotated and to use button-based scrolling
  • touchpad support, most notably:
    • proper multitouch support on touchpads [1]
    • two-finger scrolling and edge scrolling
    • tapping, tap-to-drag and drag-lock (all configurable)
    • pinch and swipe gestures
    • built-in palm and thumb detection
    • smart disable-while-typing without the need for an external process like syndaemon
    • more predictable touchpad behaviours because everything is based on physical units [2]
    • a proper API to allow for kinetic scrolling on a per-widget basis
  • tracksticks work with middle button scrolling and communicate with the touchpad where needed
  • tablet support, most notably:
    • each tool is a separate entity with its own capabilities
    • the pad itself is a separate entity with its own capabilities and events
    • mode switching is exported by the libinput API and should work consistently across callers
  • a way to identify if multiple kernel devices belong to the same physical device (libinput device groups)
  • a reliable test suite
  • Documentation!
The side-effect of libinput is that we are also trying to fix the rest of the stack where appropriate. Mostly this meant pushing stuff into systemd/udev so far, with the odd kernel fix as well. Specifically the udev bits means we
  • know the DPI density of a mouse
  • know whether a touchpad is internal or external
  • fix up incorrect axis ranges on absolute devices (mostly touchpads)
  • try to set the trackstick sensitivity to something sensible
  • know when the wheel click is less/more than the default 15 degrees
And of course, the whole point of libinput is that it can be used from any Wayland compositor and take away most of the effort of implementing an input stack. GNOME, KDE and enlightenment already uses libinput, and so does Canonical's Mir. And some distribution use libinput as the default driver in X through xf86-input-libinput (Fedora 22 was the first to do this). So overall libinput is already quite a success.

The hard work doesn't stop of course, there are still plenty of areas where we need to be better. And of course, new features come as HW manufacturers bring out new hardware. I already have touch arbitration on my todo list. But it's nice to wave at this big milestone as we pass it into the way to the glorious future of perfect, bug-free input. At this point, I'd like to extend my thanks to all our contributors: Andreas Pokorny, Benjamin Tissoires, Caibin Chen, Carlos Garnacho, Carlos Olmedo Escobar, David Herrmann, Derek Foreman, Eric Engestrom, Friedrich Schöller, Gilles Dartiguelongue, Hans de Goede, Jackie Huang, Jan Alexander Steffens (heftig), Jan Engelhardt, Jason Gerecke, Jasper St. Pierre, Jon A. Cruz, Jonas Ådahl, JoonCheol Park, Kristian Høgsberg, Krzysztof A. Sobiecki, Marek Chalupa, Olivier Blin, Olivier Fourdan, Peter Frühberger, Peter Hutterer, Peter Korsgaard, Stephen Chandler Paul, Thomas Hindoe Paaboel Andersen, Tomi Leppänen, U. Artie Eoff, Velimir Lisec.

Finally: libinput was started by Jonas Ådahl in late 2013, so it's already over 2.5 years old. And the git log shows we're approaching 2000 commits and a simple LOCC says over 60000 lines of code. I would also like to point out that the vast majority of commits were done by Red Hat employees, I've been working on it pretty much full-time since 2014 [3]. libinput is another example of Red Hat putting money, time and effort into the less press-worthy plumbing layers that keep our systems running. [4]

[1] Ironically, that's also the biggest cause of bugs because touchpads are terrible. synaptics still only does single-finger with a bit of icing and on bad touchpads that often papers over hardware issues. We now do that in libinput for affected hardware too.
[2] The synaptics driver uses absolute numbers, mostly based on the axis ranges for Synaptics touchpads making them unpredictable or at least different on other touchpads.
[3] Coincidentally, if you see someone suggesting that input is easy and you can "just do $foo", their assumptions may not match reality
[4] No, Red Hat did not require me to add this. I can pretty much write what I want in this blog and these opinions are my own anyway and don't necessary reflect Red Hat yadi yadi ya. The fact that I felt I had to add this footnote to counteract whatever wild conspiracy comes up next is depressing enough.

July 19, 2016
(email sent to mesa-devel list).

I was waiting for an open source driver to appear when I realised I should really just write one myself, some talking with Bas later, and we decided to see where we could get.

This is the point at which we were willing to show it to others, it's not really a vulkan driver yet, so far it's a vulkan triangle demos driver.

It renders the tri and cube demos from the vulkan loader,
and the triangle demo from Sascha Willems demos
and the Vulkan CTS smoke tests (all 4 of them one of which draws a triangle).

There is a lot of work to do, and it's at the stage where we are seeing if anyone else wants to join in at the start, before we make too many serious design decisions or take a path we really don't want to.

So far it's only been run on Tonga and Fiji chips I think, we are hoping to support radeon kernel driver for SI/CIK at some point, but I think we need to get things a bit further on VI chips first.

The code is currently here:

There is a not-interesting branch which contains all the pre-history which might be useful for someone else bringing up a vulkan driver on other hardware.

The code is pretty much based on the Intel anv driver, with the winsys ported from gallium driver,
and most of the state setup from there. Bas wrote the code to connect NIR<->LLVM IR so we could reuse it in the future for SPIR-V in GL if required. It also copies AMD addrlib over, (this should be shared).

Also we don't do SPIR-V->LLVM direct. We use NIR as it has the best chance for inter shader stage optimisations (vertex/fragment combined) which neither SPIR-V or LLVM handles for us, (nir doesn't do it yet but it can).

If you want to submit bug reports, they will only be taken seriously if accompanied by working patches at this stage, and we've no plans to merge to master yet, but open to discussion on when we could do that and what would be required.
I will be presenting a lightning talk during this year's GUADEC, and running a contest related to what I will be presenting.


To enter the contest, you will need to create a Flatpak for a piece of software that hasn't been flatpak'ed up to now (application, runtime or extension), hosted in a public repository.

You will have to send me an email about the location of that repository.

I will choose a winner amongst the participants, on the eve of the lightning talks, depending on, but not limited to, the difficulty of packaging, the popularity of the software packaged and its redistributability potential.

You can find plenty of examples (and a list of already packaged applications and runtimes) on this Wiki page.


A piece of hardware that you can use to replicate my presentation (or to replicate my attempts at a presentation, depending ;). You will need to be present during my presentation at GUADEC to claim your prize.

Good luck to one and all!

I finally unlazied and moved my blog away from the Google mothership to something simply, fast and statically generated. It’s built on Jekyll, hosted on github. It’s not quite as fancy as the old one, but with some googling I figured out how to add pages for tags and an archive section, and that’s about all that’s really needed.

Comments are gone too, because I couldn’t be bothered, and because everything seems to add Orwellian amounts of trackers. Ping me on IRC, by mail or on twitter instead. The share buttons are also just plain links now without tracking for Twitter (because I’m there) and G+ (because all the cool kernel hackers are there, but I’m not cool enough).

And in case you wonder why I blatter for so long about this change: I need a new blog entry to double check that the generated feeds are still at the right spots for the various planets to pick them up …

July 18, 2016

Please note that the systemd.conf 2016 Call for Participation ends in less than two weeks, on Aug. 1st! Please send in your talk proposal by then! We’ve already got a good number of excellent submissions, but we are interested in yours even more!

We are looking for talks on all facets of systemd: deployment, maintenance, administration, development. Regardless of whether you use it in the cloud, on embedded, on IoT, on the desktop, on mobile, in a container or on the server: we are interested in your submissions!

In addition to proposals for talks for the main conference, we are looking for proposals for workshop sessions held during our Workshop Day (the first day of the conference). The workshop format consists of a day of 2-3h training sessions, that may cover any systemd-related topic you'd like. We are both interested in submissions from the developer community as well as submissions from organizations making use of systemd! Introductory workshop sessions are particularly welcome, as the Workshop Day is intended to open up our conference to newcomers and people who aren't systemd gurus yet, but would like to become more fluent.

For further details on the submissions we are looking for and the CfP process, please consult the CfP page and submit your proposal using the provided form!

And keep in mind:

REMINDER: Please sign up for the conference soon! Only a limited number of tickets are available, hence make sure to secure yours quickly before they run out! (Last year we sold out.) Please sign up here for the conference!

AND OF COURSE: We are also looking for more sponsors for systemd.conf! If you are working on systemd-related projects, or make use of it in your company, please consider becoming a sponsor of systemd.conf 2016! Without our sponsors we couldn't organize systemd.conf 2016!

Thank you very much, and see you in Berlin!

July 15, 2016

More and more distros are switching to libinput by default. That's a good thing but one side-effect is that the synclient tool does not work anymore [1], it just complains that "Couldn't find synaptics properties. No synaptics driver loaded?"

What is synclient? A bit of history first. Many years ago the only way to configure input devices was through xorg.conf options, there was nothing that allowed for run-time configuration. The Xorg synaptics driver found a solution to that: the driver would initialize a shared memory segment that kept the configuration options and a little tool, synclient (synaptics client), would know about that segment. Calling synclient with options would write to that SHM segment and thus toggle the various options at runtime. Driver and synclient had to be of the same version to know the layout of the segment and it's about as secure as you expect it to be. In 2008 I added input device properties to the server (X Input Extension 1.5 and it's part of 2.0 as well of course). Rather than the SHM segment we now had a generic API to talk to the driver. The API is quite simple, you effectively have two keys (device ID and property number) and you can set any value(s). Properties literally support just about anything but drivers restrict what they allow on their properties and which value maps to what. For example, to enable left-finger tap-to-click in synaptics you need to set the 5th byte of the "Synaptics Tap Action" property to 1.

xinput, a commandline tool and debug helper, has a generic API to change those properties so you can do things like xinput set-prop "device name" "property name" 1 [2]. It does a little bit under the hood but generally it's pretty stupid. You can run xinput set-prop and try to set a value that's out of range, or try to switch from int to float, or just generally do random things.

We were able to keep backwards compatibility in synclient, so where before it would use the SHM segment it would now use the property API, without the user interface changing (except the error messages are now standard Xlib errors complaining about BadValue, BadMatch or BadAccess). But synclient and xinput use the same API to talk to the server and the server can't tell the difference between the two.

Fast forward 8 years and now we have libinput, wrapped by the xf86-input-libinput driver. That driver does the same as synaptics, the config toggles are exported as properties and xinput can read and change them. Because really, you do the smart work by selecting the right property names and values and xinput just passes on the data. But synclient is broken now, simply because it requires the synaptics driver and won't work with anything else. It checks for a synaptics-specific property ("Synaptics Edges") and if that doesn't exists it complains with "Couldn't find synaptics properties. No synaptics driver loaded?". libinput doesn't initialise that property, it has its own set of properties. We did look into whether it's possible to have property-compatibility with synaptics in the libinput driver but it turned out to be a huge effort, flaky reliability at best (not all synaptics options map into libinput options and vice versa) and the benefit was quite limited. Because, as we've been saying since about 2009 - your desktop environment should take over configuration of input devices, hand-written scripts are dodo-esque.

So if you must insist on shellscripts to configure your input devices use xinput instead. synclient is like fsck.ext2, on that glorious day you switch to btrfs it won't work because it was only designed with one purpose in mind.

[1] Neither does syndaemon btw but it's functionality is built into libinput so that doesn't matter.
[2] xinput set-prop --type=int --format=32 "device name" "hey I have a banana" 1 2 3 4 5 6 and congratulations, you've just created a new property for all X clients to see. It doesn't do anything, but you could use those to attach info to devices. If anything was around to read that.

July 14, 2016

xinput is a commandline tool to change X device properties. Specifically, it's a generic interface to change X input driver configuration at run-time, used primarily in the absence of a desktop environment or just for testing things. But there's a feature of xinput that many don't appear to know: it resolves device and property names correctly. So plenty of times you see advice to run a command like this:

xinput set-prop 15 281 1
This is bad advice, it's almost impossible to figure out what this is supposed to do, it depends on the device ID never changing (spoiler: it will) and the property number never changing (spoiler: it will). Worst case, you may suddenly end up setting a different property on a different device and you won't even notice. Instead, just use the built-in name resolution features of xinput:

xinput set-prop "SynPS/2 Synaptics TouchPad" "libinput Tapping Enabled" 1
This command will work regardless of the device ID for the touchpad and regardless of the property number. Plus it's self-documenting. This has been possible for many many years, so please stop using the number-only approach.

July 13, 2016

In case you haven’t heard yet, with the recently announced Mesa 12.0 release, Intel gen8+ GPUs expose OpenGL 4.3, which is quite a big leap from the previous OpenGL 3.3!

OpenGL 4.3

The Mesa i965 Intel driver now exposes OpenGL 4.3 on Broadwell and later!

Although this might surprise some, the truth is that even if the i965 driver only exposed OpenGL 3.3 it had been exposing many of the OpenGL 4.x extensions for quite some time, however, there was one OpenGL 4.0 extension in particular that was still missing and preventing the driver from exposing a higher version: ARB_gpu_shader_fp64 (fp64 for short). There was a good reason for this: it is a very large feature that has been in the works by Intel first and Igalia later for quite some time. We first started to work on this as far back as November 2015 and by that time Intel had already been working on it for months.

I won’t cover here what made this such a large effort because there would be a lot of stuff to cover and I don’t feel like spending weeks writing a series of posts on the subject :). Hopefully I will get a chance to talk about all that at XDC in September, so instead I’ll focus on explaining why we only have this working in gen8+ at the moment and the status of gen7 hardware.

The plan for ARB_gpu_shader_fp64 was always to focus on gen8+ hardware (Broadwell and later) first because it has better support for the feature. I must add that it also has fewer hardware bugs too, although we only found out about that later ;). So the plan was to do gen8+ and then extend the implementation to cover the quirks required by gen7 hardware (IvyBridge, Haswell, ValleyView).

At this point I should explain that Intel GPUs have two code generation backends: scalar and vector. The main difference between both backends is that the vector backend (also known as align16) operates on vectors (surprise, right?) and has native support for things like swizzles and writemasks, while the scalar backend (known as align1) operates on scalars, which means that, for example, a vec4 GLSL operation running is broken up into 4 separate instructions, each one operating on a single component. You might think that this makes the scalar backend slower, but that would not be accurate. In fact it is usually faster because it allows the GPU to exploit SIMD better than the vector backend.

The thing is that different hardware generations use one backend or the other for different shader stages. For example, gen8+ used to run Vertex, Fragment and Compute shaders through the scalar backend and Geometry and Tessellation shaders via the vector backend, whereas Haswell and IvyBridge use the vector backend also for Vertex shaders.

Because you can use 64-bit floating point in any shader stage, the original plan was to implement fp64 support on both backends. Implementing fp64 requires a lot of changes throughout the driver compiler backends, which makes the task anything but trivial, but the vector backend is particularly difficult to implement because the hardware only supports 32-bit swizzles. This restriction means that a hardware swizzle such as XYZW only selects components XY in a dvecN and therefore, there is no direct mechanism to access components ZW. As a consequence, dealing with anything bigger than a dvec2 requires more creative solutions, which then need to face some other hardware limitations and bugs, etc, which eventually makes the vector backend require a significantly larger development effort than the scalar backend.

Thankfully, gen8+ hardware supports scalar Geometry and Tessellation shaders and Intel‘s Kenneth Graunke had been working on enabling that for a while. When we realized that the vector fp64 backend was going to require much more effort than what we had initially thought, he gave a final push to the full scalar gen8+ implementation, which in turn allowed us to have a full fp64 implementation for this hardware and expose OpenGL 4.0, and soon after, OpenGL 4.3.

That does not mean that we don’t care about gen7 though. As I said above, the plan has always been to bring fp64 and OpenGL4 to gen7 as well. In fact, we have been hard at work on that since even before we started sending the gen8+ implementation for review and we have made some good progress.

Besides addressing the quirks of fp64 for IvyBridge and Haswell (yes, they have different implementation requirements) we also need to implement the full fp64 vector backend support from scratch, which as I said, is not a trivial undertaking. Because Haswell seems to require fewer changes we have started with that and I am happy to report that we have a working version already. In fact, we have already sent a small set of patches for review that implement Haswell‘s requirements for the scalar backend and as I write this I am cleaning-up an initial implementation of the vector backend in preparation for review (currently at about 100 patches, but I hope to trim it down a bit before we start the review process). IvyBridge and ValleView will come next.

The initial implementation for the vector backend has room for improvement since the focus was on getting it working first so we can expose OpenGL4 in gen7 as soon as possible. The good thing is that it is more or less clear how we can improve the implementation going forward (you can see an excellent post by Curro on that topic here).

You might also be wondering about OpenGL 4.1’s ARB_vertex_attrib_64bit, after all, that kind of goes hand in hand with ARB_gpu_shader_fp64 and we implemented the extension for gen8+ too. There is good news here too, as my colleague Juan Suárez has already implemented this for Haswell and I would expect it to mostly work on IvyBridge as is or with minor tweaks. With that we should be able to expose at least OpenGL 4.2 on all gen7 hardware once we are done.

So far, implementing ARB_gpu_shader_fp64 has been quite the ride and I have learned a lot of interesting stuff about how the i965 driver and Intel GPUs operate in the process. Hopefully, I’ll get to talk about all this in more detail at XDC later this year. If you are planning to attend and you are interested in discussing this or other Mesa stuff with me, please find me there, I’ll be looking forward to it.

Finally, I’d like to thank both Intel and Igalia for supporting my work on Mesa and i965 all this time, my igalian friends Samuel Iglesias, who has been hard at work with me on the fp64 implementation all this time, Juan Suárez and Andrés Gómez, who have done a lot of work to improve the fp64 test suite in Piglit and all the friends at Intel who have been helping us in the process, very especially Connor Abbot, Francisco Jerez, Jason Ekstrand and Kenneth Graunke.

July 11, 2016

In an earlier post, I explained how we added graphics tablet pad support to libinput. Read that article first, otherwise this article here will be quite confusing.

A lot of tablet pads have mode-switching capabilities. Specifically, they have a set of LEDs and pressing one of the buttons cycles the LEDs. And software is expected to map the ring, strip or buttons to different functionality depending on the mode. A common configuration for a ring or strip would be to send scroll events in mode 1 but zoom in/out when in mode 2. On the Intuos Pro series tablets that mode switch button is the one in the center of the ring. On the Cintiq 21UX2 there are two sets of buttons, one left and one right and one mode toggle button each. The Cintiq 24HD is even more special, it has three separate buttons on each side to switch to a mode directly (rather than just cycling through the modes).

In the upcoming libinput 1.4 we will have mode switching support in libinput, though modes themselves have no real effect within libinput, it is merely extra information to be used by the caller. The important terms here are "mode" and "mode group". A mode is a logical set of button, strip and ring functions, as interpreted by the compositor or the client. How they are used is up to them as well. The Wacom control panels for OS X and Windows allow mode assignment only to the strip and rings while the buttons remain in the same mode at all times. We assign a mode to each button so a caller may provide differing functionality on each button. But that's optional, having a OS X/Windows-style configuration is easy, just ignore the button modes.

A mode group is a physical set of buttons, strips and rings that belong together. On most tablets there is only one mode group but tablets like the Cintiq 21UX2 and the 24HD have two independently controlled mode groups - one left and one right. That's all there is to mode groups, modes are a function of mode groups and can thus be independently handled. Each button, ring or strip belongs to exactly one mode group. And finally, libinput provides information about which button will toggle modes or whether a specific event has toggled the mode. Documentation and a starting point for which functions to look at is available in the libinput documentation.

Mode switching on Wacom tablets is actually software-controlled. The tablet relies on some daemon running to intercept button events and write to the right sysfs files to toggle the LEDs. In the past this was handled by e.g. a callout by gnome-settings-daemon. The first libinput draft implementation took over that functionality so we only have one process to handle the events. But there are a few issues with that approach. First, we need write access to the sysfs file that exposes the LED. Second, running multiple libinput instances would result in conflicts during LED access. Third, the sysfs interface is decidedly nonstandard and quite quirky to handle. And fourth, the most recent device, the Express Key Remote has hardware-controlled LEDs.

So instead we opted for a two-factor solution: the non-standard sysfs interface will be deprecated in favour of a proper kernel LED interface (/sys/class/leds/...) with the same contents as other LEDs. And second, the kernel will take over mode switching using LED triggers that are set up to cover the most common case - hitting a mode toggle button changes the mode. Benjamin Tissoires is currently working on those patches. Until then, libinput's backend implementation will just pretend that each tablet only has one mode group with a single mode. This allows us to get the rest of the userstack in place and then, once the kernel patches are in a released kernel, switch over to the right backend.

June 25, 2016

sign-big-150dpi-magnified-name-200x200I’m sad to say it’s the end of the road for me with Gentoo, after 13 years volunteering my time (my “anniversary” is tomorrow). My time and motivation to commit to Gentoo have steadily declined over the past couple of years and eventually stopped entirely. It was an enormous part of my life for more than a decade, and I’m very grateful to everyone I’ve worked with over the years.

My last major involvement was running our participation in the Google Summer of Code, which is now fully handed off to others. Prior to that, I was involved in many things from migrating our X11 packages through the Big Modularization and maintaining nearly 400 packages to serving 6 terms on the council and as desktop manager in the pre-council days. I spent a long time trying to change and modernize our distro and culture. Some parts worked better than others, but the inertia I had to fight along the way was enormous.

No doubt I’ve got some packages floating around that need reassignment, and my retirement bug is already in progress.

Thanks, folks. You can reach me by email using my nick at this domain, or on Twitter, if you’d like to keep in touch.

Tagged: gentoo,
June 23, 2016


I’ve been fortunate enough lately to attend the largest virtual reality professional event/conference : SVVR. This virtual reality conference’s been held each year in the Silicon Valley for 3 years now. This year, it showcased more than 100 VR companies on the exhibit floor and welcomed more than 1400 VR professionals and enthusiasts from all around the world. As a VR enthusiast myself, I attended the full 3-day conference and met most of the exhibitors and I’d like to summarize my thoughts, and the things I learned below, grouped under various themes. This post is by no means exhaustive and consists of my own, personal opinions.


I realize that content creation for VR is really becoming the one area where most players will end up working. Hardware manufacturers and platform software companies are building the VR infrastructure as we speak (and it’s already comfortably usable), but as we move along and standards become more solid, I’m pretty sure we’re going to see lots and lots of new start-ups in the VR Content world, creating immersive games, 360 video contents, live VR events, etc… Right now, the realms of deployment possibilities for a content developer is not really elaborate. The vast majority of content creators are targeting the Unity3D plug-in, since it’s got built-in support for virtually all VR devices there is on the market like the Oculus family of headsets, HTC Vive, PlayStation VR, Samsung’s GearVR and even generic D3D or OpenGL-based applications on PC/Mac/Linux.

2 types of content

There really is two main types of VR content out there. First, 3D virtual artificially-generated content and 360 real-life captured content.


The former being what we usually refer to when thinking about VR, that is, computer-generated 3D worlds, e.g. in games, in which VR user can wander and interact. This is usually the kind of contents used in VR games, but also in VR applications, like Google’s great drawing app called TiltBrush (more info below). Click here to see a nice demo video!

newThe latter is everything that’s not generated but rather “captured” from real-life and projected or rendered in the VR space with the use of, most commonly, spherical projections and post-processing stitching and filtering. Simply said, we’re talking about 360 videos here (both 2D and 3D). Usually, this kind of contents does not let VR users interact with the VR world as “immersively” as the computer-generated 3D contents. It’s rather “played-back” and “replayed” just like regular online television series, for example, except for the fact that watchers can “look around”.

At SVVR2016, there were so many exhibitors doing VR content… Like InnerVision VR, Baobab Studios, SculptVR, MiddleVR, Cubicle ninjas, etc… on the computer-generated side, and Facade TV, VR Sports, Koncept VR, etc… on the 360 video production side.


Personally, I think tracking is by far the most important factor when considering the whole VR user experience. You have to actually try the HTC Vive tracking system to understand. The HTC Vive uses two “Lighthouse” camera towers placed in the room to let you track a larger space, something close to 15′ x 15′ ! I tried it a lot of times and tracking always seemed to keep pretty solid and constant. With the Vive you can literally walk in the VR space, zig-zag, leap and dodge without losing detection. On that front, I think competition is doing quite poorly. For example, Oculus’ CV1 is only tracking your movement from the front and the tracking angle  is pretty narrow… tracking was often lost when I faced away just a little… disappointing!

Talking about tracking, one of the most amazing talks was Leap Motion CTO David Holz’s demo of his brand new ‘Orion’, which is a truly impressive hand tracking camera with very powerful detection algos and very, very low latency. We could only “watch” David interact, but it looked so natural !  Check it out for yourself !


Audio is becoming increasingly crucial to the VR work flow since it adds so much to the VR experience. It is generally agreed in the VR community that awesome, well 3D-localised audio that seems “real” can add a lot of realism even to the visuals. At SVVR2016, there were a few audio-centric exhibitors like Ossic and Subpac. The former is releasing a kickstarter-funded 3D headset that lets you “pan” stereo audio content by rotating your head left-right. The latter is showcasing a complete body suit using tactile transducers and vibrotactile membranes to make you “feel” audio. The goal of this article is not to review specific technologies, but to discuss every aspects/domains part of the VR experience and, when it comes to audio, I unfortunately feel we’re still at the “3D sound is enough” level, but I believe it’s not.

See, proper audio 3D localization is a must of course. You obviously do no want to play a VR game where a dog appearing on your right is barking on your left!… nor do you want to have the impression a hovercraft is approaching up ahead when it’s actually coming from the back. Fortunately, we now have pretty good audio engines that correctly render audio coming from anywhere around you with good front/back discrimination. A good example of that is 3Dception from TwoBigEars. 3D specialization of audio channels is a must-have and yet, it’s an absolute minimum in my opinion. Try it for yourself ! Most of today’s VR games have coherent sound, spatially, but most of the time, you just do not believe sound is actually “real”. Why ?

Well, there are a number of reasons going from “limited audio diversity” (limited number of objects/details found in audio feed… like missing tiny air flows/sounds, user’s respiration or room’s ambient noise level) to limited  sound cancellation capability (ability to suppress high-pitched ambient sounds coming from the “outside” of the game) but I guess one of the most important factors is simply the way audio is recorded and rendered in our day-to day cheap stereo headset… A lot of promises is brought with binaural recording and stereo-to-binaural conversion algorithm. Binaural recording is a technique that records audio through two tiny omni microphones located under diaphragm structures resembling the human ears, so that audio is bounced back just like it is being routed through the human ears before reaching the microphones. The binaural audio experience is striking and the “stereo” feeling is magnified. It is very difficult to explain, you have to hear it for yourself. Talking about ear structure that has a direct impact on audio spectrum, I think one of the most promising techniques moving forward for added audio realism will be the whole geometry-based audio modeling field, where you can basically render sound as if it had actually been reflected on a computed-generated 3D geometry. Using such vrworks-audio-planmodels, a dog barking in front of a tiled metal shed will sound really differently than the same dog barking near a wooden chalet. The brain does pick up those tiny details and that’s why you find guys like Nvidia releasing their brand new “Physically Based Acoustic Simulator Engine” in VrWorks.


Haptics is another very interesting VR domain that consists of letting users perceive virtual objects not through visual nor aural channels, but through touch. Usually, this sense of touch in VR experience is brought in by the use of special haptic wands that, using force feedback and other technologies, make you think that you are actually really pushing an object in the VR world.

You mostly find two types of haptic devices out there. Wand-based and Glove-based. Gloves for haptics are of course more natural to most users. It’s easy to picture yourself in a VR game trying to “feel” rain drops falling on your fingers or in an flight simulator, pushing buttons and really feeling them. However, by talking to many exhibitors at SVVR, it seems we’ll be stuck at the “feel button pushes” level for quite some time, as we’re very far from being able to render “textures” since spatial resolutions involved would simply be too high for any haptic technology that’s currently available. There are some pretty cool start-ups with awesome glove-based haptic technologies like Kickstarter-funded Neurodigital Technologies GloveOne or Virtuix’s Hands Omni.

Now, I’m not saying wand-based haptic technologies are outdated and not promising. In fact, I think they are more promising than gloves for any VR application that relies on “tools” like a painting app requiring you to use a brush or a remote-surgery medical application requiring you to use an actual scalpel ! When it comes to wands, tools and the like, the potential for haptic feedback is multiplied because you simply have more room to fit more actuators and gyros. I once tried an arm-based 3D joystick in a CAD application and I could swear I was really hitting objects with my design tool…  it was stunning !


If VR really takes off in the consumer mass market someday soon, it will most probably be social. That’s something I heard at SVVR2016 (paraphrased) in the very interesting talk by David Baszucki titled : “Why the future of VR is social”. I mean, in essence, let’s just take a look at current technology appropriation nowadays and let’s just acknowledge that the vast majority of applications rely on the “social” aspect, right ? People want to “connect”, “communicate” and “share”. So when VR comes around, why would it be suddenly different? Of course, gamers will want to play really immersive VR games and workers will want to use VR in their daily tasks to boost productivity, but most users will probably want to put on their VR glasses to talk to their relatives, thousands of miles away, as if they were sitting in the same room. See ? Even the gamers and the workers I referred to above will want to play or work “with other real people”. No matter how you use VR, I truly believe the social factor will be one of the most important ones to consider to build successful software. At SVVR 2016, I discovered a very interesting start-up that focused on the social VR experience. With mimesys‘s telepresence demo, using a HTC Vive controller, they had me collaborate on a painting with a “real” guy hooked to the same system, painting from his home apartment in France, some 9850 km away and I had a pretty good sense of his “presence”. The 3D geometry and rendered textures were not perfect, but it was good enough to have a true collaboration experience !


We’re only at the very beginning of this very exciting journey through Virtual Reality and it’s really difficult to predict what VR devices will look like in only 3-5 years from now because things are just moving so quickly… An big area I did not cover in my post and that will surely change of lot of parameters moving forward in the VR world is… AR – Augmented Reality:) Check out what’s MagicLeap‘s up to these days !



June 21, 2016
Early bird gets eaten by the Nyarlathotep
The more adventurous of you can use those (designed as embeddable) Lua scripts to transform your DRM-free downloads into Flatpaks.

The long-term goal would obviously be for this not to be needed, and for online games stores to ship ".flatpak" files, with metadata so we know what things are in GNOME Software, which automatically picks up the right voice/subtitle language, and presents its extra music and documents in the respective GNOME applications.
But in the meanwhile, and for the sake of the games already out there, there's flatpak-games. Note that lua-archive is still fiddly.
Support for a few Humble Bundle formats (some formats already are), grab-all RPMs and Debs, and those old Loki games is also planned.
It's late here, I'll be off to do some testing I think :)

PS: Even though I have enough programs that would fail to create bundles in my personal collection to accept "game donations", I'm still looking for original copies of Loki games. Drop me a message if you can spare one!

This is a very exciting day for me as two major projects I am deeply involved with are having a major launch. First of all Fedora Workstation 24 is out which crosses a few critical milestones for us. Maybe most visible is that this is the first time you can use the new graphical update mechanism in GNOME Software to take you from Fedora Workstation 23 to Fedora Workstation 24. This means that when you open GNOME Software it will show you an option to do a system upgrade to Fedora Workstation 24. We been testing and doing a lot of QA work around this feature so my expectation is that it will provide a smooth upgrade experience for you.
Fedora System Upgrade

The second major milestone is that we do feel Wayland is now in a state where the vast majority of users should be able to use it on a day to day basis. We been working through the kinks and resolving many corner cases during the previous 6 Months, with a lot of effort put into making sure that the interaction between applications running natively on Wayland and those running using XWayland is smooth. For instance one item we crossed off the list early in this development cycle was adding middle-mouse button cut and paste as we know that was a crucial feature for many long time linux users looking to make the switch. So once you updated I ask all of you to try switching to the Wayland session by clicking on the little cogwheel in the login screen, so that we get as much testing as possible of Wayland during the Fedora Workstation 24 lifespan. Feedback provided by our users during the Fedora Workstation 24 lifecycle will be a crucial information to allow us to make the final decision about Wayland as the default for Fedora Workstation 25. Of course the team will be working ardently during Fedora Workstation 24 to make sure we find and address any niggling issues left.

In addition to that there is also of course a long list of usability improvements, new features and bugfixes across the desktop, both coming in from our desktop team at Red Hat and from the GNOME community in general.

There was also the formal announcement of Flatpak today (be sure to read that press release), which is the great new technology for shipping desktop applications. For those of you who have read my previous blog entries you probably seen me talking about this technology using its old name xdg-app. Flatpak is an incredible piece of engineering designed by Alexander Larsson we developed alongside a lot of other components.
Because as Matthew Garret pointed out not long ago, unless we move away from X11 we can not really produce a secure desktop container technology, which is why we kept such a high focus on pushing Wayland forward for the last year. It is also why we invested so much time into Pinos which is as I mentioned in my original annoucement of the project our video equivalent of PulseAudio (and yes a proper Pinos website is getting close :). Wim Taymans who created Pinos have also been working on patches to PulseAudio to make it more suitable for using with sandboxed applications and those patches have recently been taken over by community member Ahmed S. Darwish who is trying to get them ready for merging into the main codebase.

We are feeling very confident about Flatpak as it has a lot of critical features designed in from the start. First of all it was built to be a cross distribution solution from day one, meaning that making Flatpak run on any major linux distribution out there should be simple. We already got Simon McVittie working on Debian support, we got Arch support and there is also an Ubuntu PPA that the team put together that allows you to run Flatpaks fully featured on Ubuntu. And Endless Mobile has chosen flatpak as their application delivery format going forward for their operating system.

We use the same base technologies as Docker like namespaces, bind mounts and cgroups for Flatpak, which means that any system out there wanting to support Docker images would also have the necessary components to support Flatpaks. Which means that we will also be able to take advantage of the investment and development happening around server side containers.

Flatpak is also heavy using another exciting technology, OSTree, which was originally developed by Colin Walters for GNOME. This technology is actually seeing a lot of investment and development these days as it became the foundation for Project Atomic, which is Red Hats effort to create an enterprise ready platform for running server side containers. OStree provides us with a lot of important features like efficient storage of application images and a very efficient transport mechanism. For example one core feature OSTree brings us is de-duplication of files which means you don’t need to keep multiple copies on your disk of the same file, so if ten Flatpak images share the same file, then you only keep one copy of it on your local disk.

Another critical feature of Flatpak is its runtime separation, which basically means that you can have different runtimes for some families of usecases. So for instance you can have a GNOME runtime that allows all your GNOME applications to share a lot of libraries yet giving you a single point for security updates to those libraries. So while we don’t want a forest of runtimes it does allow us to create a few important ones to cover certain families of applications and thus reduce disk usage further and improve system security.

Going forward we are looking at a lot of exciting features for Flatpak. The most important of these is the thing I mentioned earlier, Portals.
In the current release of flatpak you can choose between two options. Either make it completely sandboxed or not make it sandboxed at all. Portals are basically the way you can sandbox your application yet still allow it to interact with your general desktop and storage. For instance Pinos and PulseAudios role for containers is to provide such portals for handling audio and video. Of course more portals are needed and during the the GTK+ hackfest in Toronto last week a lot of time was spent on mapping out the roadmap for Portals. Expect more news about Portals as they are getting developed going forward.

I want to mention that we of course realize that a new technology like Flatpak should come with a high quality developer story, which is why Christian Hergert has been spending time working on support for Flatpak in the Builder IDE. There is some support in already, but expect to see us flesh this out significantly over the next Months. We are also working on adding more documentation to the Flatpak website, to cover how to integrate more build systems and similar with Flatpak.

And last, but not least Richard Hughes has been making sure we have great Flatpak support in Software in Fedora Workstation 24 ensuring that as an end user you shouldn’t have to care about if your application is a Flatpak or a RPM.

June 20, 2016

I'm back from the GTK hackfest in Toronto, Canada and mostly recovered from jetlag, so it's time to write up my notes on what we discussed there.

Despite the hackfest's title, I was mainly there to talk about non-GUI parts of the stack, and technologies that fit more closely in what could be seen as the platform than they do in GNOME. In particular, I'm interested in Flatpak as a way to deploy self-contained "apps" in a freedesktop-based, sandboxed runtime environment layered over the Universal Operating System and its many derivatives, with both binary and source compatibility with other GNU/Linux distributions.

I'm mainly only writing about discussions I was directly involved in: lots of what sounded like good discussion about the actual graphics toolkit went over my head completely :-) More notes, mostly from Matthias Clasen, are available on the GNOME wiki.

In no particular order:

Thinking with portals

We spent some time discussing Flatpak's portals, mostly on Tuesday. These are the components that expose a subset of desktop functionality as D-Bus services that can be used by contained applications: they are part of the security boundary between a contained app and the rest of the desktop session. Android's intents are a similar concept seen elsewhere. While the portals are primarily designed for Flatpak, there's no real reason why they couldn't be used by other app-containment solutions such as Canonical's Snap.

One major topic of discussion was their overall design and layout. Most portals will consist of a UX-independent part in Flatpak itself, together with a UX-specific implementation of any user interaction the portal needs. For example, the portal for file selection has a D-Bus service in Flatpak, which interacts with some UX-specific service that will pop up a standard UX-specific "Open" dialog — for GNOME and probably other GTK environments, that dialog is in (a branch of) GTK.

A design principle that was reiterated in this discussion is that the UX-independent part should do as much as possible, with the UX-specific part only carrying out the user interactions that need to comply with a particular UX design (in the GTK case, GNOME's design). This minimizes the amount of work that needs to be redone for other desktop or embedded environments, while still ensuring that the other environments can have their chosen UX design. In particular, it's important that, as much as possible, the security- and performance-sensitive work (such as data transport and authentication) is shared between all environments.

The aim is for portals to get the user's permission to carry out actions, while keeping it as implicit as possible, avoiding an "are you sure?" step where feasible. For example, if an application asks to open a file, the user's permission is implicitly given by them selecting the file in the file-chooser dialog and pressing OK: if they do not want this application to open a file at all, they can deny permission by cancelling. Similarly, if an application asks to stream webcam data, the UX we expect is for GNOME's Cheese app (or a similar non-GNOME app) to appear, open the webcam to provide a preview window so they can see what they are about to send, but not actually start sending the stream to the requesting app until the user has pressed a "Start" button. When defining the API "contracts" to be provided by applications in that situation, we will need to be clear about whether the provider is expected to obtain confirmation like this: in most cases I would anticipate that it is.

One security trade-off here is that we have to have a small amount of trust in the providing app. For example, continuing the example of Cheese as a webcam provider, Cheese could (and perhaps should) be a contained app itself, whether via something like Flatpak, an LSM like AppArmor or both. If Cheese is compromised somehow, then whenever it is running, it would be technically possible for it to open the webcam, stream video and send it to a hostile third-party application. We concluded that this is an acceptable trade-off: each application needs to be trusted with the privileges that it needs to do its job, and we should not put up barriers that are easy to circumvent or otherwise serve no purpose.

The main (only?) portal so far is the file chooser, in which the contained application asks the wider system to show an "Open..." dialog, and if the user selects a file, it is returned to the contained application through a FUSE filesystem, the document portal. The reference implementation of the UX for this is in GTK, and is basically a GtkFileChooserDialog. The intention is that other environments such as KDE will substitute their own equivalent.

Other planned portals include:

  • image capture (scanner/camera)
  • opening a specified URI
    • this needs design feedback on how it should work for non-http(s)
  • sharing content, for example on social networks (like Android's Sharing menu)
  • proxying joystick/gamepad input (perhaps via Wayland or FUSE, or perhaps by modifying libraries like SDL with a new input source)
  • network proxies (GProxyResolver) and availability (GNetworkMonitor)
  • contacts/address book, probably vCard-based
  • notifications, probably based on Notifications
  • video streaming (perhaps using Pinot, analogous to PulseAudio but for video)

Environment variables

GNOME on Wayland currently has a problem with environment variables: there are some traditional ways to set environment variables for X11 sessions or login shells using shell script fragments (/etc/X11/Xsession.d, /etc/X11/xinit/xinitrc.d, /etc/profile.d), but these do not apply to Wayland, or to noninteractive login environments like cron and systemd --user. We are also keen to avoid requiring a Turing-complete shell language during session startup, because it's difficult to reason about and potentially rather inefficient.

Some uses of environment variables can be dismissed as unnecessary or even unwanted, similar to the statement in Debian Policy §9.9: "A program must not depend on environment variables to get reasonable defaults." However, there are two common situations where environment variables can be necessary for proper OS integration: search-paths like $PATH, $XDG_DATA_DIRS and $PYTHONPATH (particularly necessary for things like Flatpak), and optionally-loaded modules like $GTK_MODULES and $QT_ACCESSIBILITY where a package influences the configuration of another package.

There is a stopgap solution in GNOME's gdm display manager, /usr/share/gdm/env.d, but this is gdm-specific and insufficiently expressive to provide the functionality needed by Flatpak: "set XDG_DATA_DIRS to its specified default value if unset, then add a couple of extra paths".

pam_env comes closer — PAM is run at every transition from "no user logged in" to "user can execute arbitrary code as themselves" — but it doesn't support .d fragments, which are required if we want distribution packages to be able to extend search paths. pam_env also turns off per-user configuration by default, citing security concerns.

I'll write more about this when I have a concrete proposal for how to solve it. I think the best solution is probably a PAM module similar to pam_env but supporting .d directories, either by modifying pam_env directly or out-of-tree, combined with clarifying what the security concerns for per-user configuration are and how they can be avoided.

Relocatable binary packages

On Windows and OS X, various GLib APIs automatically discover where the application binary is located and use search paths relative to that; for example, if C:\myprefix\bin\app.exe is running, GLib might put C:\myprefix\share into the result of g_get_system_data_dirs(), so that the application can ask to load app/data.xml from the data directories and get C:\myprefix\share\app\data.xml. We would like to be able to do the same on Linux, for example so that the apps in a Flatpak or Snap package can be constructed from RPM or dpkg packages without needing to be recompiled for a different --prefix, and so that other third-party software packages like the games on Steam and can easily locate their own resources.

Relatedly, there are currently no well-defined semantics for what happens when a .desktop file or a D-Bus .service file has Exec=./bin/foo. The meaning of Exec=foo is well-defined (it searches $PATH) and the meaning of Exec=/opt/whatever/bin/foo is obvious. When this came up in D-Bus previously, my assertion was that the meaning should be the same as in .desktop files, whatever that is.

We agreed to propose that the meaning of a non-absolute path in a .desktop or .service file should be interpreted relative to the directory where the .desktop or .service file was found: for example, if /opt/whatever/share/applications/foo.desktop says Exec=../../bin/foo, then /opt/whatever/bin/foo would be the right thing to execute. While preparing a mail to the freedesktop and D-Bus mailing lists proposing this, I found that I had proposed the same thing almost 2 years ago... this time I hope I can actually make it happen!

Flatpak and OSTree bug fixing

On the way to the hackfest, and while the discussion moved to topics that I didn't have useful input on, I spent some time fixing up the Debian packaging for Flatpak and its dependencies. In particular, I did my first upload as a co-maintainer of bubblewrap, uploaded ostree to unstable (with the known limitation that the grub, dracut and systemd integration is missing for now since I haven't been able to test it yet), got most of the way through packaging Flatpak 0.6.5 (which I'll upload soon), cherry-picked the right patches to make ostree compile on Debian 8 in an effort to make backports trivial, and spent some time disentangling a flatpak test failure which was breaking the Debian package's installed-tests. I'm still looking into ostree test failures on little-endian MIPS, which I was able to reproduce on a Debian porterbox just before the end of the hackfest.

OSTree + Debian

I also had some useful conversations with developers from Endless, who recently opened up a version of their OSTree build scripts for public access. Hopefully that information brings me a bit closer to being able to publish a walkthrough for how to deploy a simple Debian derivative using OSTree (help with that is very welcome of course!).

GTK life-cycle and versioning

The life-cycle of GTK releases has already been mentioned here and elsewhere, and there are some interesting responses in the comments on my earlier blog post.

It's important to note that what we discussed at the hackfest is only a proposal: a hackfest discussion between a subset of the GTK maintainers and a small number of other GTK users (I am in the latter category) doesn't, and shouldn't, set policy for all of GTK or for all of GNOME. I believe the intention is that the GTK maintainers will discuss the proposals further at GUADEC, and make a decision after that.

As I said before, I hope that being more realistic about API and ABI guarantees can avoid GTK going too far towards either of the possible extremes: either becoming unable to advance because it's too constrained by compatibility, or breaking applications because it isn't constrained enough. The current situation, where it is meant to be compatible within the GTK 3 branch but in practice applications still sometimes break, doesn't seem ideal for anyone, and I hope we can do better in future.


Thanks to everyone involved, particularly:

  • Matthias Clasen, who organised the hackfest and took a lot of notes
  • Allison Lortie, who provided on-site cat-herding and led us to some excellent restaurants
  • Red Hat Inc., who provided the venue (a conference room in their Toronto office), snacks, a lot of coffee, and several participants
  • my employers Collabora Ltd., who sponsored my travel and accomodation
June 17, 2016

limba-smallI wanted to write this blogpost since April, and even announced it in two previous posts, but never got to actually write it until now. And with the recent events in Snappy and Flatpak land, I can not defer this post any longer (unless I want to answer the same questions over and over on IRC ^^).

As you know, I develop the Limba 3rd-party software installer since 2014 (see this LWN article explaining the project better then I could do 😉 ) which is a spiritual successor to the Listaller project which was in development since roughly 2008. Limba got some competition by Flatpak and Snappy, so it’s natural to ask what the projects next steps will be.

Meeting with the competition

At last FOSDEM and at the GNOME Software sprint this year in April, I met with Alexander Larsson and we discussed the rather unfortunate situation we got into, with Flatpak and Limba being in competition.

Both Alex and I have been experimenting with 3rd-party app distribution for quite some time, with me working on Listaller and him working on Glick and Glick2. All these projects never went anywhere. Around the time when I started Limba, fixing design mistakes done with Listaller, Alex started a new attempt at software distribution, this time with sandboxing added to the mix and a new OSTree-based design of the software-distribution mechanism. It wasn’t at all clear that XdgApp, later to be renamed to Flatpak, would get huge backing by GNOME and later Red Hat, becoming a very promising candidate for a truly cross-distro software distribution system.

The main difference between Limba and Flatpak is that Limba allows modular runtimes, with things like the toolkit, helper libraries and programs being separate modules, which can be updated independently. Flatpak on the other hand, allows just one static runtime and enforces everything that is not in the runtime already to be bundled with the actual application. So, while a Limba bundle might depend on multiple individual other bundles, Flatpak bundles only have one fixed dependency on a runtime. Getting a compromise between those two concepts is not possible, and since the modular vs. static approach in Limba and Flatpak where fundamental, conscious design decisions, merging the projects was also not possible.

Alex and I had very productive discussions, and except for the modularity issue, we were pretty much on the same page in every other aspect regarding the sandboxing and app-distribution matters.

Sometimes stepping out of the way is the best way to achieve progress

So, what to do now? Obviously, I can continue to push Limba forward, but given all the other projects I maintain, this seems to be a waste of resources (Limba eats a lot of my spare time). Now with Flatpak and Snappy being available, I am basically competing with Canonical and Red Hat, who can make much more progress faster then I can do as a single developer. Also, Flatpaks bigger base of contributors compared to Limba is a clear sign which project the community favors more.

Furthermore, I started the Listaller and Limba projects to scratch an itch. When being new to Linux, it was very annoying to me to see some applications only being made available in compiled form for one distribution, and sometimes one that I didn’t use. Getting software was incredibly hard for me as a newbie, and using the package-manager was also unusual back then (no software center apps existed, only package lists). If you wanted to update one app, you usually needed to update your whole distribution, sometimes even to a development version or rolling-release channel, sacrificing stability.

So, if now this issue gets solved by someone else in a good way, there is no point in pushing my solution hard. I developed a tool to solve a problem, and it looks like another tool will fix that issue now before mine does, which is fine, because this longstanding problem will finally be solved. And that’s the thing I actually care most about.

I still think Limba is the superior solution for app distribution, but it is also the one that is most complex and requires additional work by upstream projects to use it properly. Which is something most projects don’t want, and that’s completely fine. 😉

And that being said: I think Flatpak is a great project. Alex has much experience in this matter, and the design of Flatpak is sound. It solves many issues 3rd-party app development faces in a pretty elegant way, and I like it very much for that. Also the focus on sandboxing is great, although that part will need more time to become really useful. (Aside from that, working with Alexander is a pleasure, and he really cares about making Flatpak a truly cross-distributional, vendor independent project.)

Moving forward

So, what I will do now is not to stop Limba development completely, but keep it going as a research project. Maybe one can use Limba bundles to create Flatpak packages more easily. We also discussed having Flatpak launch applications installed by Limba, which would allow both systems to coexist and benefit from each other. Since Limba (unlike Listaller) was also explicitly designed for web-applications, and therefore has a slightly wider scope than Flatpak, this could make sense.

In any case though, I will invest much less time in the Limba project. This is good news for all the people out there using the Tanglu Linux distribution, AppStream-metadata-consuming services, PackageKit on Debian, etc. – those will receive more attention 😉

An integral part of Limba is a web service called “LimbaHub” to accept new bundles, do QA on them and publish them in a public repository. I will likely rewrite it to be a service using Flatpak bundles, maybe even supporting Flatpak bundles and Limba bundles side-by-side (and if useful, maybe also support AppImageKit and Snappy). But this project is still on the drawing board.

Let’s see 🙂

P.S: If you come to Debconf in Cape Town, make sure to not miss my talks about AppStream and bundling 🙂