planet.freedesktop.org
November 29, 2016

I pushed the patch to require resolution today, expect this to hit the general public with libinput 1.6. If your graphics tablet does not provide axis resolution we will need to add a hwdb entry. Please file a bug in systemd and CC me on it (@whot).

How do you know if your device has resolution? Run sudo evemu-describe against the device node and look for the ABS_X/ABS_Y entries:


# Event code 0 (ABS_X)
# Value 2550
# Min 0
# Max 3968
# Fuzz 0
# Flat 0
# Resolution 13
# Event code 1 (ABS_Y)
# Value 1323
# Min 0
# Max 2240
# Fuzz 0
# Flat 0
# Resolution 13
if the Resolution value is 0 you'll need a hwdb entry or your tablet will stop working in libinput 1.6. You can file the bug now and we can get it fixed, that way it'll be in place once 1.6 comes out.

November 28, 2016
I missed last week's update, but with the holiday it ended up being a short week anyway.

The multithreaded fragment shaders are now in drm-next and Mesa master.  I think this was the last big win for raw GL performance and we're now down to the level of making 1-2% improvements in our commits.  That is, unless we're supposed to be using double-buffer non-MS mode and the closed driver was just missing that feature.  With the glmark2 comparisons I've done, I'm comfortable with this state, though.  I'm now working on performance comparisons for 3DMMES Taiji, which the HW team often uses as a benchmark.  I spent a day or so trying to get it ported to the closed driver and failed, but I've got it working on the open stack and have written a couple of little performance fixes with it.

The first was just a regression fix from the multithreading patch series, but it was impressive that multithreading hid a 2.9% instruction count penalty and still showed gains.

One of the new fixes I've been working on is folding ALU operations into texture coordinate writes.  This came out of frustration from the instruction selection research I had done the last couple of weeks, where all algorithms seemed like they would still need significant peephole optimization after selection.  I finally said "well, how hard would it be to just finish the big selection problems I know are easily doable with peepholes?" and it wasn't all that bad.  The win came out to about 1% of instructions, with a similar benefit to overall 3DMMES performance (it's shocking how ALU-bound 3DMMES is)

I also started on a branch to jump to the end of the program when all 16 pixels in a thread have been discarded.  This had given me a 7.9% win on GLB2.7 on Intel, so I hoped for similar wins here.  3DMMES seemed like a good candidate for testing, too, with a lot of discards that are followed by reams of code that could be skipped, including texturing.  Initial results didn't seem to show a win, but I haven't actually done any stats on it yet.  I also haven't done the usual "draw red where we were applying the optimization" hack to verify that my code is really working, either.

While I've been working on this, Jonas Pfeil (who originally wrote the multithreading support) has been working on a couple of other projects.  He's been trying to get instructions scheduled into the delay slots of thread switches and branching, which should help reduce any regressions those two features might have caused.  More exciting, he's just posed a branch for doing nearest-filtered raster textures (the primary operation in X11 compositing) using direct memory lookups instead of our shadow-copy fallback.  Hopefully I get a chance to review, test, and merge in the next week or two.

On the kernel side, my branches for 4.10 have been pulled.  We've got ETC1 and multithread FS for 4.10, and a performance win in power management.  I've also been helping out and reviewing Boris Brezillon's work for SDTV output in vc4.  Those patches should be hitting the list this week.
November 20, 2016

The Fedora Change to retire the synaptics driver was approved by FESCO. This will apply to Fedora 26 and is part of a cleanup to, ironically, make the synaptics driver easier to install.

Since Fedora 22, xorg-x11-drv-libinput is the preferred input driver. For historical reasons, almost all users have the xorg-x11-drv-synaptics package installed. But to actually use the synaptics driver over xorg-x11-drv-libinput requires a manually dropped xorg.conf.d snippet. And that's just not ideal. Unfortunately, in DNF/RPM we cannot just say "replace the xorg-x11-drv-synaptics package with xorg-x11-drv-libinput on update but still allow users to install xorg-x11-drv-synaptics after that".

So the path taken is a package rename. Starting with Fedora 26, xorg-x11-drv-libinput's RPM will Provide/Obsolete [1] xorg-x11-drv-synaptics and thus remove the old package on update. Users that need the synaptics driver then need to install xorg-x11-drv-synaptics-legacy. This driver will then install itself correctly without extra user intervention and will take precedence over the libinput driver. Removing xorg-x11-drv-synaptics-legacy will remove the driver assignment and thus fall back to libinput for touchpads. So aside from the name change, everything else works smoother now. Both packages are now updated in Rawhide and should be available from your local mirror soon.

What does this mean for you as a user? If you are a synaptics user, after an update/install, you need to now manually install xorg-x11-drv-synaptics-legacy. You can remove any xorg.conf.d snippets assigning the synaptics driver unless they also include other custom configuration.

See the Fedora Change page for details. Note that this is a Fedora-specific change only, the upstream change for this is already in place.

[1] "Provide" in RPM-speak means the package provides functionality otherwise provided by some other package even though it may not necessarily provide the code from that package. "Obsolete" means that installing this package replaces the obsoleted package.

November 16, 2016
At this point, I haven't pushed a new release tag for xf86-video-freedreno to update to latest xserver ABI.  I'm inclined not to.  If you are using a modern xserver you probably want to be using xf86-video-modesetting + glamor.  It has more features (dri3, xv, etc) and better performance.  And GL support on a3xx/a4xx is pretty solid.  So distros with a modern xserver might as well drop the xf86-video-freedreno package.

The one case where xf86-video-freedreno is still useful is bringing up a new generation of adreno, since it can do dri2 with pure-sw fallbacks for all the EXA ops.  But if that is what you are doing, I guess you know how to git clone and build.

The possible alternative is to push a patch that makes xf86-video-freedreno still build, but only probe (with latest xserver ABI) if some "ForceLoad" type option is given in xorg.conf, otherwise fallback to modesetting/glamor.  I can't think of a good reason to do this at the moment.  But as always, questions/comments/suggestions welcome.

November 15, 2016

As usual, it can be turned on and off at build-time and there is some configuration available as well to control how the effect works. Here are some screen-shots:

motion-blur-0
Motion Blur Off
motion-blur-0_2
Motion Blur Off
motion-blur-8
Motion Blur On, intensity 12.5%
motion-blur-8_2
Motion Blur On, intensity 12.5%
motion-blur-4

Motion Blur On, intensity 25%
motion-blur-4_2

Motion Blur On, intensity 25%
motion-blur-2
Motion Blur On, intensity 50%
motion-blur-2_2
Motion Blur On, intensity 50%

Motion blur is a technique used to make movement feel smoother than it actually is and is targeted at hiding the fact that things don’t move in continuous fashion, but rather, at fixed intervals dictated by the frame rate. For example, a fast moving object in a scene can “leap” many pixels between consecutive frames even if we intend for it to have a smooth animation at a fixed speed. Quick camera movement produces the same effect on pretty much every object on the scene. Our eyes can notice these gaps in the animation, which is not great. Motion blur applies a slight blur to objects across the direction in which they move, which aims at filling the animation gaps produced by our discrete frames, tricking the brain into perceiving a smoother animation as result.

In my demo there are no moving objects other than the sky box or the shadows, which are relatively slow objects anyway, however, camera movement can make objects change screen-space positions quite fast (specially when we rotate the view point) and the motion- blur effect helps us perceive a smoother animation in this case.

I will try to cover the actual implementation in some other post but for now I’ll keep it short and leave it to the images above to showcase what the filter actually does at different configuration settings. Notice that the smoothing effect is something that can only be perceived during motion, so still images are not the best way to showcase the result of the filter from the perspective of the viewer, however, still images are a good way to freeze the animation and see exactly how the filter modifies the original image to achieve the smoothing effect.

Last Friday, both a GNOME bug day and a bank holiday, a few of us got together to squash some bugs, and discuss GNOME and GNOME technologies.

Guillaume, a new comer in our group, tested the captive portal support for NetworkManager and GNOME in Gentoo, and added instructions on how to enable it to their Wiki. He also tested a gateway related configuration problem, the patch for which I merged after a code review. Near the end of the session, he also rebuilt WebKitGTK+ to test why Google Docs was not working for him anymore in Web. And nobody believed that he could build it that quickly. Looks like opinions based on past experiences are quite hard to change.

Mathieu worked on removing jhbuild's .desktop file as nobody seems to use it, and it was creating the Sundry category for him, in gnome-shell. He also spent time looking into the tracker blocker that is Mozilla's Focus, based on disconnectme's block lists. It's not as effective as uBlock when it comes to blocking adverts, but the memory and performance improvements, and the slow churn rate, could make it a good default blocker to have in Web.

Haïkel looked into using Emeus, potentially the new GTK+ 4.0 layout manager, to implement the series properties page for Videos.

Finally, I added Bolso to jhbuild, and struggled to get gnome-online-accounts/gnome-keyring to behave correctly in my installation, as the application just did not want to log in properly to the service. I also discussed Fedora's privacy policy (inappropriate for Fedora Workstation, as it doesn't cover the services used in the default installation), a potential design for Flatpak support of joypads and removable devices in general, as well as the future design of the Network panel.
I dropped HDMI audio last week because Jonas Pfeil showed up with a pair of branches to do multithreaded fragment shaders.

Some context for multithreaded fragment shaders: Texture lookups are really slow.  I think we eyeballed them as having a latency of around 20 QPU instructions.  We can hide latency sometimes if there's some unrelated math to be done after the texture coordinate calculation but before using the texture sample.  However, in most cases, you need the texture sample results right away for the next bit of work.

To allow programs to hide that latency, there's a cooperative hyperthreading mode that a fragment shader can opt into.  The shader stays in the bottom half of register space, and before it collects the results of a texture fetch, it issues a thread switch signal, which the hardware will use to run a second fragment shader thread until that one issues its own thread switch.  For the second thread, the top bit of the register addresses gets flipped, so the two threads' states don't overlap (except for the accumulators and the flags, which the shaders have to save and restore).

I had delayed working on this because the full solution was going to be really tricky: queue up as many lookups as we can, then thread switch, then collect all those results before switching again, all while respecting the FIFO sizes.  However, Jonas's huge contribution here was in figuring out that you don't need to be perfect, you can get huge gains by thread switching between each texture lookup and reading its results.

The upshot was a 0-20% performance improvement on glmark2 and a performance hit to only one testcase.  With this we're down to 3 subtests that we're slower on than the closed source driver.  Jonas's kernel patch is out on the list, and I rewrote the Mesa series to expose threading to the register allocator and landed all but the enabling patch (which has to wait on the kernel side getting merged).  Hopefully I'll finish merging it in a week.

In the process of writing multithreading, Jonas noticed that we were scheduling our TLB_Z writes early in the program, which can cut into fragment shader parallelism between the QPUs (of which we have 12) because a TLB_Z write locks the scoreboard for that 2x2 pixel quad.  I merged a patch to Mesa that pushes the TLB_Z down to the bottom, at the cost of a single extra QPU instruction.  Some day we should flip QPU scheduling around so that we pair that TLB_Z up better, and fill our delay slots in thread switches and branching as well.

I also tracked down a major performance issue in Raspbian's desktop using the open driver.  I asked them to use xcompmgr a while back, because readback from the front buffer using the GPU is really slow (The texture unit can't read from raster textures, so we have to reformat them to be readable), and made window dragging unbearable.  However, xcompmgr doesn't unredirect fullscreen windows, so full screen GL apps emit copies (with reformatting!) instead of pageflipping.

It looks like the best choice for Raspbian is going to be using compton, an xcompmgr fork that does pageflipping (you get tear free screen updates!) and unredirection of full screen windows (you get pageflipping directly from GL apps).  I've also opened a bug report on compton with a description of how they could improve their GL drawing for a tiled renderer like the Pi, which could improve its performance for windowed updates significantly.

Simon ran into some trouble with compton, so he hasn't flipped the default yet, but I would encourage anyone running a Raspberry Pi desktop to give it a shot -- the improvement should be massive.

Other things last week: More VCHIQ patch review, merging more to the -next branches for 4.10 (Martin Sperl's thermal driver, at last!), and a shadow texturing bugfix.
November 14, 2016

I've written more extensively about this here but here's an analogy that should get the point across a bit better: Wayland is just a protocol, just like HTTP. In both cases, you have two sides with very different roles and functionality. In the HTTP case, you have the server (e.g. Apache) and the client (a browser, e.g. Firefox). The communication protocol is HTTP but both sides make a lot of decisions unrelated to the protocol. The server decides what data is sent, the client decides how the data is presented to the user. Wayland is very similar. The server, called the "compositor", decides what data is sent (also: which of the clients even gets the data). The client renders the data [1] and decides what to do with input like key strokes, etc.

Asking Does $FEATURE work under Wayland? is akin to asking Does $FEATURE work under HTTP?. The only answer is: it depends on the compositor and on the client. It's the wrong question. You should ask questions related to the compositor and the client instead, e.g. "does $FEATURE work in GNOME?" or "does $FEATURE work in GTK applications?". That's a question that can be answered.

Of course, there are some cases where the fault is really the protocol itself. But often enough, it's not.

[1] albeit it does so by telling the compositor to display it. The analogy with HTTP only works to some extent... :)

November 10, 2016

So, in Fedora Workstation 24 we added H264 support through OpenH264. In Fedora Workstation 25 I am happy to tell you all that we are taking another step in improving our codec support by adding support for mp3 playback. I know this has been a big wishlist item for a long time for a lot of people so I am really happy that we are finally in a position to fulfil that wish. You should be able to download the mp3 plugin on day 1 through GNOME Software or through the missing codec installer in various GStreamer applications. For Fedora Workstation 26 I would not be surprised if we decide to ship it on the install media.

Fo the technically inclined out there, our initial enablement is through the mpeg123 library and corresponding GStreamer plugin. The main reason we choose this library over all the others available out there was a combination of using the same license as GStreamer (lgpl v2) and being a well established library used by a lot of different applications already. There might be other mp3 decoders added in the future depending on interest in and effort by the community. So get ready to install Fedora Workstation 25 when its released soon and play some tunes :)

P.S. To be 110% clear we will not be adding encoding support at this time.

November 08, 2016
Almost all of Collabora's customers use the Linux kernel on their products. Often they will use the exact code as delivered by the SBC vendors and we'll work with them in other pars of their software stack. But it's becoming increasingly common for our customers to adapt the kernel sources to the specific needs of their particular products.

A very big problem most of them have is that the kernel version they based on isn't getting security updates any more because it's already several years old. And the reason why companies are shipping kernels so old is that they have been so heavily modified compared to the upstream versions, that rebasing their trees on top of newer mainline releases is so expensive that is very hard to budget and plan for it.

To avoid that, we always recommend our customers to stay close to their upstreams, which implies rebasing often on top of new releases (typically LTS releases, with long term support). For the budgeting of that work to become possible, the size of the delta between mainline and downstream sources needs to be manageable, which is why we recommend contributing back any changes that aren't strictly specific to their products.

But even for those few companies that already have processes in place for upstreaming their changes and are rebasing regularly on top of new LTS releases, keeping up with mainline can be a substantial disruption of their production schedules. This is in part because new bugs will be in the new mainline release, and new bugs will be in the downstream changes as they get applied to the new version.

Those companies that are already keeping close to their upstreams typically have advanced QA infrastructure that will detect those bugs long before production, but a long stabilization phase after every rebase can significantly slow product development.

To improve this situation and encourage more companies to keep their efforts close to upstream we at Collabora have been working for a few years already in continuous integration of FOSS components across a diverse array of hardware. The initial work was sponsored by Bosch for one of their automotive projects, and since the start of 2016 Google has been sponsoring work on continuous integration of the mainline kernel.

One of the major efforts to continuously integrate the mainline Linux kernel codebase is kernelci.org, which builds several configurations of different trees and submits boot jobs to several labs around the world, collating the results. This is being of great help already in detecting at a very early stage any changes that either break the builds, or prevent a specific piece of hardware from completing the boot stage.

Though kernelci.org can easily detect when an update to a source code repository has introduced a bug, such updates can have several dozens of new commits, and without knowing which specific commit introduced the bug, we cannot identify culprits to notify of the problem. This means that either someone needs to monitor the dashboard for problems, or email notifications are sent to the owners of the repositories who then have to manually look for suspicious commits before getting in contact with their author.

To address this limitation, Google has asked us to look into improving the existing code for automatic bisection so it can be used right away when a regression is detected, so the possible culprits are notified right away without any manual intervention.

Another area in which kernelci.org is currently lacking is in the coverage of the testing. Build and boot regressions are very annoying for developers because they impact negatively everybody who work in the affected configurations and hardware, but the consequences of regressions in peripheral support or other subsystems that aren't involved critically during boot can still make rebases much costlier.

At Collabora we have had a strong interest in having the DRM subsystem under continuous integration and some time ago started a R&D project for making the test suite in IGT generically useful for all the DRM drivers. IGT started out being i915-specific, but as most of the tests exercise the generic DRM ABI, they could as well test other drivers with a moderate amount of effort. Early in 2016 Google started sponsoring this work and as of today submitters of new drivers are using it to validate their code.

Another related effort has been the addition to DRM of a generic ABI for retrieving CRCs of frames from different components in the graphics pipeline, so two frames can be compared when we know that they should match. And another one is adding support to IGT for the Chamelium board, which can simulate several display connections and hotplug events.

A side-effect of having continuous integration of changes in mainline is that when downstreams are sending back changes to reduce their delta, the risk of introducing regressions is much smaller and their contributions can be accepted faster and with less effort.

We believe that improved QA of FOSS components will expand the base of companies that can benefit from involvement in development upstream and are very excited by the changes that this will bring to the industry. If you are an engineer who cares about QA and FOSS, and would like to work with us on projects such as kernelci.org, LAVA, IGT and Chamelium, get in touch!

November 07, 2016
I spent a little bit of time last week on actual 3D work.  I finally figured out what was going wrong with glmark2's terrain demo.  The RCP (reciprocal) instruction on vc4 is a very rough approximation, and in translating GLSL code doing floating point divides (which we have to convert from a / b to a * (1 / b)) we use a Newton-Raphson step to improve our approximation.  However, we forgot to do this for the implicit divide we have to do at the end of your vertex shader, so the triangles in the distance were unstable and bounced around based on the error in the RCP output.

I also finally got around to writing kernel-side validation for ETC1 texture compression.  I should be able to merge support for this to v4.10, which is one of the feature checkboxes we were missing compared to the old driver.  Unfortunately ETC1 doesn't end up being very useful for us in open source stacks, because S3TC has been around a lot longer and supported on more hardware, so there's not much content using ETC that I know of outside of Android land.

I also spent a while fixing up various regressions that had happened in Mesa while I'd been playing in the kernel.  Some day I should work on CI infrastructure on real hardware, but unfortunately Mesa doesn't have any centralized CI infrastructure to plug into.  Intel has a test farm, but not all proposed patches get submitted to it, and it (understandably) doesn't have non-Intel hardware.

In kernel land, I sent out a patch to reduce V3D's overhead due to power management.  We were immediately shutting off the 3D engine when it went idle, even if userland might be expected to submit a new frame shortly thereafter.  This was taking up 1% of the CPU on profiles I was looking at last week, and I've seen it up to 3.5%.  Now we keep the GPU on for at least 40ms ("about two frames plus some slack").  If I'm reading right, the closed driver does immediate powerdown as well, so we may be using slightly more power now, but given that power state transitions themselves cost power, and the CPU overhead costs power, hopefully this works out fine anyway (even though we've never done power measurements for the platform).

I also reviewed more patches from Michael Zoran for vchiq.  It should now (once all reviewed patches land) work on arm64.  We still need to get the vchiq-using drivers like camera and vcsm into staging.

Finally, I made more progress on HDMI audio.  One of the ALSA developers, Lars-Peter Clausen, pointed out that ALSA's design is that userspace would provide the IEC958 subframes by using alsalib's iec958 plugin and a bit ouf asoundrc configuration.  He also provided a ton of help debugging that pipeline.  I rebased and cleaned up into a branch that uses simple-audio-card as my machine driver, so the whole stack is pretty trivial.  Now aplay runs successfully, though no sound has come out.  Next step is to go test that the closed stack actually plays audio on this monitor and capture some register state for debugging.
November 02, 2016

Interested in hacking on some low-level stuff and implementing a feature that's useful to a lot of laptop owners out there? We have a feature on libinput's todo list but I'm just constantly losing my fight against the ever-growing todo list. So if you already know C and you're interested in playing around with some low-level bits of software this may be the project for you.

Specifically: within libinput, we want to disable certain devices based on a lid state. In the first instance this means that when the lid switch is toggled to closed, the touchpad and trackpoint get silently disabled to not send events anymore. [1] Since it's based on a switch state, this also means that we'll now have to listen to switch events and expose those devices to libinput users.

The things required to get all this working are:

  • Designing a switch interface plus the boilerplate code required (I've done most of this bit already)
  • Extending the current evdev backend to handle devices with EV_SW and exposing their events
  • Hooking up the switch devices to internal touchpads/trackpoints to disable them ad-hoc
  • Handle those devices where lid switch is broken in the hardware (more details on this when we get to this point)

You get to dabble with libinput and a bit of udev and the kernel. Possibly Xorg stuff, but that's unlikely at this point. This project is well suited for someone with a few spare weekends ahead. It's great for someone who hasn't worked with libinput before, but it's not a project to learn C, you better know that ahead of time. I'd provide the mentoring of course (I'm in UTC+10, so expect IRC/email). If you're interested let me know. Riches and fame may happen but are not guaranteed.

[1] A number of laptops have a hw issue where either device may send random events when the lid is closed

November 01, 2016

I finally have a bit of time to look at touchpad pointer acceleration in libinput. But when I did, I found a great total of 5 bugs across freedesktop.org and Red Hat's bugzilla, despite this being the first thing anyone brings up whenever libinput is mentioned. 5 bugs - that's not much to work on. Note that over time there were also a lot of bugs where pointer acceleration was fixed once the touchpad's axis ranges were corrected which usually is a two-liner for the udev hwdb.

Anyway, point of this post: if you're still having issues with pointer acceleration on your touchpad in libinput, please file a bug against libinput and make it block the new tracker bug 98535. The libinput documentation has instructions on how to report a touchpad bug, but amongst the various things I need is your laptop model name.

Don't complain about it on reddit, phoronix, HN, or in some random forum, because you're just wasting bytes there and it won't get fixed that way.

When we started the Fedora Workstation effort one thing we wanted to do was to drain the proverbial swamp of make sure that running Linux on a laptop is a first rate experience. As you see from my last blog entry we have been working on building a team dedicated to that task. There are many facets to this effort, but one that we kept getting asked about was sorting out hybrid graphics. So be aware that some of this has been covered in previous blog entries, but I want to be thorough here. So this blog will cover the big investments of time and effort we are putting into Wayland and X Windows, GNOME Shell and Nouveau (the open source driver for NVidia GPU hardware).

A big part of this story is also that we are trying to make it easier to run the NVidia binary driver on Fedora in general which is actually a fairly large undertaking as there are a lot challenges with trying to do that without requiring any manual configuration from the user and making sure it works fully in the hybrid graphics usecase. This is one of the most requested areas of improvement for Fedora. Many users are trying to run Fedora on their existing laptops or other hardware. Rather than allow this gap to be a reason for people to not run Fedora, we want to provide a better experience. Of course users will have the freedom to make their own choice about installing these drivers, using information provided by Software.

Hybrid graphics is the term used for when you have a laptop with two GPUs, usually one Intel and one NVidia GPU, but there are also some laptops available which comes with Intel/AMD CPU + AMD GPU. The purpose of hybrid graphics is to have your (Intel) Integrated GPU be your ‘day to day GPU’ for running your desktop yet not drawing to much power, but then if you want to for instance play a game on your system you can activate your secondary GPU which got a lot more power, but which also draws more electricity.

What complicated this even more was the fact that most users who wanted to use this setup wanted to use it in combination with the binary NVidia driver, in order to get top performance from their second GPU, which is not surprising since that is the whole point of switching to it. The main challenge with this was that the Mesa and binary NVidia driver both provided an OpenGL implementation that due to the way things works in the X Window System ended up overwriting each other, basically breaking the overwritten driver. The solution for this was a system called Bumblebee which employed some clever hacks to work around this issue, but Bumblebee is a solution to a problem we shouldn’t have to begin with.

Of course dealing with the OpenGL stack wasn’t the only challenge here. For instance different display outputs ended up being connected to different GPUs, with one of the most common setups being that the internal screen is connected to your Intel GPU and your external HDMI and DisplayPort connections are connected to your NVidia or AMD GPU. One some systems there where hardware connectors allowing you to display using either GPU to any screen, but a setup we see becoming more common is that the drivers for both card needs to be initialized in all cases, allowing the display connected GPU to slave itself to the other GPU to allow rendering on one GPU and displaying on the other. Within Mesa this is fairly straightforward to handle, so if both Intel and NVidia as rendering using Mesa things are fairly straightforward. What makes this a lot more challenging here is the NVidia binary driver, so once you install the binary driver we need to find a way to bridge and interoperate between these two stacks.

And of course that is just the highlights, like anything complex like this there are a long laundry list of items from the point where you can checkbox having a feature to it working really well and seamless.

What to expect in Fedora Workstation 25
Lets start with a word of caution here, we are working on a tight deadline here to get all the pieces lined up, so I can not be 100% sure what will make it for the day of release. What we are confident about though is that we will have all the low level plumbing sorted so that we can improve this over the course of the Fedora 25 lifecycle. As always I hope the community will join us in testing and developing this, to ensure we have even the corner cases worked out for Fedora Workstation 26.

The initial step we took and this is quite some time ago was that Dave Airlie worked on making sure we could handle two GPUs within the Mesa stack. As a bit of wordplay on the NVidia solution being called Optimus the open source Mesa solution was named Prime. This work basically allowed you to use Mesa for your Intel card and use Mesa (Nouveau) for your NVidia card and launch applications on the NVidia card while the screen was connected to the Intel card. You can choose which one is used by setting the DRI_PRIME=1 environment variable.

The second step was Adam Jackson collaborating with NVidia on something called libglvnd. Libglvnd stands for GL Vendor Neutral Dispatch and it is basically a dispatch layer that allows your OpenGL calls to be dispatched to more than one OpenGL implementation. Nvidia created this specification and have been steadily working on supporting it in their binary driver, but Adam Jackson stepped up to help review their patches and work on supporting glvnd in the Mesa drivers. This means that for Fedora Workstation 25 you should for the first time ever be able to have both the binary NVidia driver and Mesa installed without any file conflicts. So the X server now has the infrastructure to route your OpenGL call to the correct stack when that DRI_PRIME variable is set. That said there is a major catch here, due to the way things currently work once the NVidia binary driver is installed it expects to be rendering the screen all the time. So we need for the short term to figure out a way to allow it to do that, and in the long run we need to work with NVidia to figure out how the Intel open source driver can collaborate with the Nvidia driver to allow us to only use the Intel driver at times (to save on power for instance). We are still pushing hard on trying to have the short term fix in place, but as I write this we haven’t fully nailed this down yet.

The third step is work that Ben Skeggs has been doing on dealing with the monitor handling here, which includes adding MST support to Nouveau, because a lot of external ports and docking stations have not been working with these hybrid graphics setups due to the external screens all being routed through the NVidia chip. These patches just got accepted upstream and we will be including them in Fedora Workstation 25.

The fourth step has been work that Hans de Goede has been doing around Prime, fixing the modesetting driver and fixing cursor handling with hybrid graphics. In some sense the work Hans has been doing, and check his blog entry linked, is part of that washlist I talked about, the many smaller items that needs to work smoothly for this to go from ‘checkbox works’ to ‘actually works’. He is also working with the DNF team to allow us to control the kernel updates if you use the binary NVidia driver, meaning that we will only bump up your kernel driver when we have the binary NVidia driver module ready too. If for any reason the binary Nvidia driver doesn’t work we want to have graceful fallback to Nouveau.

Fifth step has been Jonas Ådahls work on enabling the binary NVidia driver for Wayland. He has put together a set of patches to be able to support NVidias EGLStreams interface, which means that starting from Fedora Workstation 25 you will be able to use Wayland also with NVidias binary driver.
You can see his work in progress patches here. Jonas will also be looking at implementing hybrid graphics support in Wayland, so we ensure that also for this usecase Wayland is on par with X for Fedora Workstation 26.

The sixth step has been work done by Bastien Nocera to ensure we expose this functionality in the user interface. The idea is that you should be able to configure on a per application basis which GPU they are being launched on. It also means that you can now get all your GPU information in the GNOME about screen also when you have dual-GPU systems. More details in his blog post here.

gnome-shell-discrete-gpu-menuChoosing discete GPU in GNOME Shell

The seventh step is the work that Simone Caronni from Negativo17 has been doing on working with us on packaging the binary NVidia driver in a way that makes all this work. You can find his repo here on Negativo17, but Simone has also been working with Kalev Lember and Richard Hughes to ensure the driver show up correctly in GNOME Software once you have the repository enabled.

NVidia driver in GNOME SoftwareNVidia driver in GNOME Software

The plan is to offer the driver in Fedora Workstation 25 as third party software, but we haven’t yet made the formal proposal due to wanting to be sure we ironed out all important bugs first, both on our side and on NVidia side.

So as you can see from all of this there are a lot of components involved and since we are trying to support both X and Wayland with all of this the job hasn’t exactly been easy. There are for sure some things that will not be ready in time for Fedora Workstation 25, for instance if you want to use the binary NVidia driver we will not be able to make that work with XWayland. So native Wayland applications will be fine, including games using SDL on top of Wayland, but due to how the stack is architected we haven’t gotten around to implementing bridging OpenGL from Xwayland down to the binary NVidia driver. With Nouveau it should work however. This corner case we will have to figure out for Fedora Workstation 26, but for now we have decided that if we detect a Optimus system we will default you to X instead of Wayland.

Get involved
As always we love for more people to join us in this effort and one good way to get started with that is to join our Hybrid graphics test day on Thursday the 3rd of November!

October 31, 2016
Last week was spent almost entirely on HDMI audio support.

I had started the VC4-side HDMI setup last week, and got the rest of the stubbed bits filled out this week.

Next was starting to get the audio side connected.  I had originally derived my VC4 side from the Mediatek driver, which makes a child device for the generic "hdmi-codec" DAI codec to attach to.  HDMI-codec is a shared implementation for HDMI encoders that take I2S input, which basically just passes audio setup params into the DRM HDMI encoder, whose job is to actually make I2S signals on wires get routed into HDMI audio packets.  They also have a CPU DAI registered through device tree that moves PCM data from memory onto wires, and a machine driver registered through devicetree that connects the CPU DAI's wires to the HDMI codec DAI's wires.  This is apparently how SoC audio is usually structured.

VC4, on the other hand, doesn't have a separate hardware block for moving PCM data from memory onto wires, it just has the MAI_DATA register in the HDMI block that you put S/PDIF (IEC958) data into using the SoC's generic DMA engine.  So I made VC4's HDMI driver register another a child device with the MAI_DATA register's address passed in, and that child device gets bound by the CPU DAI driver and uses the ASoC generic DMA engine support.  That CPU DAI driver also ends up setting up the machine driver, since I didn't have any other reference to the CPU DAI anywhere.  The glue here is all based on static strings I can't rely on, and is a complete hack for now.

Last, notice how I've used "PCM" talking about a working implementation's audio data and "S/PDIF" data for the thing I have to take in?  That's the sticking point.  The MAI_DATA register wants things that look like the AES3 frames description you can find on wikipedia.  I think the HW is generating my preamble.  In the firmware's HDMI audio implementation, they go through the unpacked PCM data and set the channel status bit in each sample according to the S/PDIF channel status word (the first 2 bytes of which are also documented on wikipedia).   At the same time that they're doing that, they also set the parity bit.

My first thought was "Hey, look, ALSA has SNDRV_PCM_FORMAT_IEC958_SUBFRAME, maybe my codec can say it only takes these S/PDIF frames and everything will work out."  No.  It turns out aplay won't generate them and neither will gstreamer.  To make a usable HDMI audio driver, I'm going to need to generate my IEC958 myself.

Linux is happy to generate my channel status word in memory.  I think the hardware is happy to generate my parity bit.  That just leaves making the CPU DAI write the bits of the channel status word into the subframes as PCM samples come in to the driver, and making sure that the MSB of the PCM audio data is bit 27.  I haven't figured out how to do that quite yet, which is the project for this week.

Work-in-progress, completely non-functional code is here.

For the last time in 2016, I flew out to the OpenStack Summit in Barcelona, where I had the chance to meet (again) a lot of my fellow OpenStack contributors there.

How To Work Upstream with OpenStack

My week started by giving a talk about How To Work Upstream with OpenStack where I explained, accompanied by Ryota and Ashiq, to the audience how to contribute upstream to OpenStack. It went well and was well received by the public – you can watch the video below or download the slides.

Python 3 in telemetry projects

I've attended a few interesting cross-project sessions, which helped me getting some prioritization for my work during the next few months.

The Python 3 porting effort is blocked for a while in Nova and Swift for various (mostly non-technical) reasons, while almost all other projects are working correctly. On the other hand, we have committed the telemetry projects to be the first one to drop Python 2 support has soon as it is possible. The next steps are to be sure downstream is ready and enable functional testing in devstack with Python 3.

Ceilometer deprecation

Gordon Chung talking about Gnocchi 3

The Ceilometer sessions were really interesting, are we mainly discussed deprecating and removing old crufts that are not or should not be used anymore. The main change will be the depreciation of the Ceilometer API. It has been clear for more than a year that Gnocchi is the way-to-go to store and provide access to metrics, but we failed at announcing wildly. A lot of the people I talked to during the summit were not aware that the Ceilometer API was not a good pick, and that Gnocchi was the now recommended storage backend. Bad communication from our side – but we are going to fix it as of now.

We also committed to simplify the current architecture by removing the collector, which has now be made obsolete by the agent based architecture that was implemented during the last development cycles.

Aodh alarm timeout

We had a feature proposal for a while in Aodh that we postponed for too long already: having timeout triggered after not having seen some events. This seems to be a functionality requested by NFV users – something we want Aodh to cover. We spent some time discussing this feature, and now that we all have a clear understanding of the use case, we'll work on having a clear path to the implementation.

I've also attended a session with the Vitrage developers in order to discuss how we could work better together, as they rely on Aodh. It seems there might be some convergence in the future, which would be very welcome. Wait'n see.

Gnocchi improvement, past and future

The Gnocchi session ran smoothly, and everyone seemed happy with the work we have done so far. We've made some impressive improvement in Gnocchi 3.0 – as I already covered previously – and Gordon Chung presented a short talk about the performance difference metered while working on this new version of Gnocchi:

The return of the InfluxDB driver is on the table, as Sam Morrison proposed a patch for that while back. While it's not as fast and scalable as other drivers, it offers a good alternative for people having to use it.

Leandro Reox presented how to do capacity planning using Ceilometer and Gnocchi, presenting the projects at the same time:

It is pretty impressive to see what they achieved with this project, and I'm looking forward to being able to check how it works inside.

PTG and beyond

The next meeting is supposed to be the new OpenStack PTG in February in Atlanta, though we did not request any specific space there. While the team love seeing each other face-to-face every few months, we achieved to follow all of the guidelines I listed recently on good open source project management, meaning we are able to work very well asynchronously and remotely. There is no need to put hard requirements on people wanting to participate in our community. Nevertheless, I expect cross-projects discussions that will happen to still concern the OpenStack Telemetry projects.

In the end, we're all very happy with our past and future roadmaps and I'm looking forward to achieving our next big milestones with our amazing telemetry team!

You might remember my attempts at getting an easy to use cross-compilation for ARM applications on my x86-64 desktop machine.

With Fedora 25 approaching, I'm happy to say that the necessary changes to integrate the feature have now rolled into Fedora 25.

For example, to compile the GNU Hello Flatpak for ARM, you would run:

$ flatpak install gnome org.freedesktop.Platform/arm org.freedesktop.Sdk/arm
Installing: org.freedesktop.Platform/arm/1.4 from gnome
[...]
$ sudo dnf install -y qemu-user-static
[...]
$ TARGET=arm ./build.sh

For other applications, add the --arch=arm argument to the flatpak-builder command-line.

This example also works for 64-bit ARM with the architecture name aarch64.
October 28, 2016
Recently I've been working on improving hybrid graphics support for the upcoming Fedora 25 release. Although Fedora 25 Workstation will use Wayland by default for its GNOME 3 desktop, my work has been on hybrid gfx support under X11 (Xorg) as GNOME 3 on Wayland does not yet support hybrid gfx,

So no Wayland, still there are a lot of noticable hybrid gfx support and users of laptops with hybrid gfx using the open-source drivers should have a much smoother userexperience then before. Here is an (incomplete) list of generic improvements:

  • Fix the discrete GPU not suspending after using an external monitor, halving laptop battery life

  • xrandr --listproviders no longer shows 3 devices instead of 2 on laptops with 2 GPUs

  • Hardware cursor support when both GPUs have active video outputs, previously X would fallback to software cursor rendering in this case which would typically lead to a flickering, or entirely invisible cursor at the top of the screen

Besides this a lot of work has been done on fixing hybrid gfx issues in the modesetting driver, this is important since with Fedora we use the modesetting driver on Skylake and newer Intel integrated gfx as well as on Maxwell and newer Nvidia discrete GPUs, the following issues have been fixed in the modesetting driver (and thus on laptops with a Skylake CPU and/or a Maxwell or newer GPU):

  • Hide HW cursor on init, this fixes 2 cursors showing on some setups

  • Make the modesetting driver support DRI_PRIME=1 render offloading when the secondary GPU is using the modesetting driver

  • Fix misrendering (tiled vs linear) when using DRI_PRIME=1 render offloading and the primary GPU is using the modesetting driver

  • Fix GL apps running at 1 fps when shown on a video-output of the secondary GPU and the primary GPU is using the modesetting driver

  • Fix secondary GPU video output partial screen updates (part of the screen showing a previous frame) when the discrete GPU is the secondary GPU and the primary GPU is using the modesetting driver

  • Fix secondary GPU video output updates lagging (or sometimes the last frame simply not being shown at all because no further rendering is happening) when the discrete GPU is the secondary GPU and the primary GPU is using the modesetting driver

Note coming Thursday (November 3th) we're having a Fedora Better Switchable Graphics Test Day, if you have a laptop with hybrid gfx please join us to help further improving hybrid gfx support.
I have recently been occupied with a new project (and being with a cold all this week), so I have not been much present in the Wayland community. Now I can finally say what I and Emilio have been up to: Waltham! For more information, please see our annoucement.
October 26, 2016
Thanks to the work of Hans de Goede and many others, dual-GPU (aka NVidia Optimus or AMD Hybrid Graphics) support works better than ever in Fedora 25.

On my side, I picked up some work I originally did for Fedora 24, but ended up being blocked by hardware support. This brings better integration into GNOME.

The Details Settings panel now shows which video cards you have in your (most likely) laptop.

dual-GPU Graphics

The second feature is what Blender and 3D video games users have been waiting for: a contextual menu item to launch the application on the more powerful GPU in your machine.

Mooo Powaa!

This demonstration uses a slightly modified GtkGLArea example, which shows which of the GPUs is used to render the application in the title bar.

on the integrated GPU

on the discrete GPU

Behind the curtain

Behind those 2 features, we have a simple D-Bus service, which runs automatically on boot, and stays running to offer a single property (HasDualGpu) that system components can use to detect what UI to present. This requires the "switcheroo" driver to work on the machine in question.

Because of the way applications are launched on the discrete GPU, we cannot currently support D-Bus activated applications, but GPU-heavy D-Bus-integrated applications are few and far between right now.

Future plans

There's plenty more to do in this area, to polish the integration. We might want applications to tell us whether they'd prefer being run on the integrated or discrete GPU, as live switching between renderers is still something that's out of the question on Linux.

Wayland dual-GPU support, as well as support for the proprietary NVidia drivers are also things that will be worked on, probably by my colleagues though, as the graphics stack really isn't my field.

And if the hardware becomes more widely available, we'll most certainly want to support hardware with hotpluggable graphics support (whether gaming laptop "power-ups" or workstation docks).

Availability

All the patches necessary to make this work are now available in GNOME git (targeted at GNOME 3.24), and backports are integrated in Fedora 25, due to be released shortly.
October 24, 2016

There are some program transformations that are obviously correct when compiling regular single-threaded or even multi-threaded code, but that cannot be used for shader code. For example:

 v = texture(u_sampler, texcoord);  
if (cond) {
gl_FragColor = v;
} else {
gl_FragColor = vec4(0.);
}

... cannot be transformed to ...

if (cond) {
// The implicitly computed derivate of texcoord
// may be wrong here if neighbouring pixels don't
// take the same code path.
gl_FragColor = texture(u_sampler, texcoord);
} else {
gl_FragColor = vec4(0.);
}

... but the reverse transformation is allowed.

Another example is:

 if (cond) {
v = texelFetch(u_sampler[1], texcoord, 0);
} else {
v = texelFetch(u_sampler[2], texcoord, 0);
}

... cannot be transformed to ...

v = texelFetch(u_sampler[cond ? 1 : 2], texcoord, 0);
// Incorrect, unless cond happens to be dynamically uniform.

... but the reverse transformation is allowed.

Using GL_ARB_shader_ballot, yet another example is:

 bool cond = ...;
uint64_t v = ballotARB(cond);
if (other_cond) {
use(v);
}

... cannot be transformed to ...

bool cond = ...;
if (other_cond) {
use(ballotARB(cond));
// Here, ballotARB returns 1-bits only for threads/work items
// that take the if-branch.
}

... and the reverse transformation is also forbidden.

These restrictions are all related to the GPU-specific SPMD/SIMT execution model, and they need to be taught to the compiler. Unfortunately, we partially fail at that today.

Here are some types of restrictions to think about (each of these restrictions should apply on top of any other restrictions that are expressible in the usual, non-SIMT-specific ways, of course):

  1. An instruction can be moved from location A to location B only if B dominates or post-dominates A.

    This restriction applies e.g. to instructions that take derivatives (like in the first example) or that explicitly take values from neighbouring threads (like in the third example). It also applies to barrier instructions.

    This is LLVM's convergent function attribute as I understand it.

  2. An instruction can be moved from location A to location B only if A dominates or post-dominates B.

    This restriction applies to the ballot instruction above, but it is not required for derivative computations or barrier instructions.

    This is in a sense dual to LLVM's convergent attribute, so it's co-convergence? Divergence? Not sure what to call this.

  3. Something vague about not introducing additional non-uniformity in the arguments of instructions / intrinsic calls.

    This last one applies to the sampler parameter of texture intrinsics (for the second example), to the ballot instruction, and also to the texture coordinates on sampling instructions that implicitly compute derivatives.

For the last type of restriction, consider the following example:

 uint idx = ...;
if (idx == 1u) {
v = texture(u_sampler[idx], texcoord);
} else if (idx == 2u) {
v = texture(u_sampler[idx], texcoord);
}

... cannot be transformed to ...

uint idx = ...;
if (idx == 1u || idx == 2u) {
v = texture(u_sampler[idx], texcoord);
}

In general, whenever an operation has this mysterious restriction on its arguments, then the second restriction above must apply: we can move it from A to B only if A dominates or post-dominates B, because only then can we be certain that the move introduces no non-uniformity. (At least, this rule applies to transformations that are not SIMT-aware. A SIMT-aware transformation might be able to prove that idx is dynamically uniform even without the predication on idx == 1u or idx == 2u.)

However, the control flow rule is not enough:

 v1 = texture(u_sampler[0], texcoord);
v2 = texture(u_sampler[1], texcoord);
v = cond ? v1 : v2;

... cannot be transformed to ...

v = texture(u_sampler[cond ? 0 : 1], texcoord);

The transformation does not break any of the CFG-related rules, and it would clearly be correct for a single-threaded program (given the knowledge that texture(...) is an operation without side effects). So the CFG-based restrictions really aren't sufficient to model the real set of restrictions that apply to the texture instruction. And it gets worse:

 v1 = texelFetch(u_sampler, texcoord[0], 0);
v2 = texelFetch(u_sampler, texcoord[1], 0);
v = cond ? v1 : v2;

... is equivalent to ...

v = texelFetch(u_sampler, texcoord[cond ? 0 : 1], 0);

After all, texelFetch computes no implicit derivatives.

Calling the three kinds of restrictions 'convergent', 'co-convergent', and 'uniform', we get:

 texture(uniform sampler, uniform texcoord) convergent (co-convergent)
texelFetch(uniform sampler, texcoord, lod) (co-convergent)
ballotARB(uniform cond) convergent co-convergent
barrier() convergent

For the texturing instructions, I put 'co-convergent' in parentheses because these instructions aren't inherently 'co-convergent'. The attribute is only there because of the 'uniform' function argument.

Actually, looking at the examples, it seems that co-convergent only appears when a function has a uniform argument. Then again, the texelFetch function can be moved freely in the CFG by a SIMT-aware pass that can prove that the move doesn't introduce non-uniformity to the sampler argument, so being able to distinguish functions that are inherently co-convergent (like ballotARB) from those that are only implicitly co-convergent (like texture and texelFetch) is still useful.

For added fun, things get muddier when you notice that in practice, AMDGPU doesn't even flag texturing intrinsics as 'convergent' today. Conceptually, the derivative-computing intrinsics need to be convergent to ensure that the texture coordinates for neighbouring pixels are preserved (as in the very first example). However, the AMDGPU backend does register allocation after the CFG has been transformed into the wave-level control-flow graph. So register allocation automatically preserves neighbouring pixels even when a texture instruction is sunk into a location with additional control-flow dependencies.

When we reach a point where vector register allocation happens with respect to the thread-level control-flow graph, then texture instructions really need to be marked as convergent for correctness. (This change would be beneficial overall, but is tricky because scalar register allocation must happen with respect to the wave-level control flow graph. LLVM currently wants to allocate all registers in one pass.)

The most interesting work I did last week was starting to look into HDMI audio.  I've been putting this off because really I'm a 3D developer, not modesetting, and I'm certainly not an audio developer.  But the work needs to get done, so here I am.

HDMI audio on SOCs looks like it's not terribly hard.  VC4's HDMI driver needs to instantiate a platform device for audio when it gets loaded, and attach a function table to it for calls to be made when starting/stopping audio.  Then I write a sound driver that will bind to the node that VC4 created, accepting a parameter of which register to DMA into.  That sound driver will tell ALSA about what formats it can acccept and actually drive the DMA engine to write audio frames into HDMI.  On the VC4 HDMI side, we do a bit of register setup (clocks, channel mappings, and the HDMI audio infoframe) when the sound driver tells us it wants to start.

It all sounds pretty straightforward so far.  I'm currently at the "lots of typing" stage of development of this.

Second, I did a little poking at the DSI branch again.  I've rebased onto 4.9 and tested an idea I had for how DSI might be failing.  Still no luck.  There's code on drm-vc4-dsi-panel-restart-cleanup-4.9, and the length of that name suggests how long I've been bashing my head against DSI.  I would love for anyone out there with the inclination to do display debug to spend some time looking into it.  I know the history there is terrible, but I haven't prioritized cleanup while trying to get the driver working in the first place.

I also finished the simulator cleanup from last week, found a regression in the major functional improvement in the cleanup, and landed what worked.  Dave Airlie pointed out that he took a different approach for testing with virgl called vtest, and I took a long look at that.  Now that I've got my partial cleanup in place, the vtest-style simulator wrapper looks a lot more doable, and if it doesn't have the major failing that swrast did with visuals then it should actually make vc4 simulation more correct.

I landed a very tiny optimization to Mesa that I wrote while debugging DEQP failures.

Finally, I put together the -next branches for upstream bcm2835 now that 4.9-rc1 is out.  So far I've landed a major cleanup of pinctrl setup in the DT (started by me and finished by Gerd Hoffmann from Red Hat, to whom I'm very grateful) and a little USB setup fix.  I did a lot of review of VCHIQ patches for staging, and sort of rewrote Linus Walleij's RFC patch for getting descriptive names for the GPIOs in lsgpio.
October 20, 2016

Bloom Filter

The bloom filter makes bright objects glow and bleed through other objects positioned in between them and the camera. It is a common post-processing effect used all the time in video games and animated movies. The demo supports a couple of configuration options that control the intensity and behavior of the filter, here are some screenshots with different settings:

bloom-disabled
Bloom filter Off
bloom-default
Bloom filter On, default settings
bloom-intense
Bloom filter On, intensity increased

I particularly like the glow effect that this brings to the specular reflections on the water surface, although to really appreciate that you need to run the demo and see it in motion.

Cascaded Shadow Maps

I should really write a post about basic shadow mapping before going into the details of Cascaded Shadow Maps, so for now I’ll just focus on the problem they try to solve.

One of the problems with shadow mapping is rendering high resolution shadows, specially for shadows that are rendered close to the camera. Generally, basic shadow mapping provides two ways in which we can improve the resolution of the shadows we render:

1. Increase the resolution of the shadow map textures. This one is obvious but comes at a high performance (and memory) hit.

2. Reduce the distance at which we can render shadows. But this is not ideal of course.

One compromise solution is to notice that, as usual with 3D computer graphics, it is far more important to render nearby objects in high quality than distant ones.

Cascaded Shadow Maps allow us to use different levels of detail for shadows that are rendered at different distances from the camera. Instead of having a single shadow map for all the shadows, we split the viewing frustum into slices and render shadows in each slice to a different shadow map.

There are two immediate benefits of this technique:

1. We have flexibility to define the resolution of the shadow maps for each level of the cascade, allowing us, for example, to increase the resolution of the levels closest to the camera and maybe reduce those that are further away.

2. Each level only records shadows in a slice of the viewing frustum, which increases shadow resolution even if we keep the same texture resolution we used with the original shadow map implementation for each shadow map level.

This approach also has some issues:

1. We need to render multiple shadow maps, which can be a serious performance hit depending on the resolutions of the shadow maps involved. This is why we usually lower the resolution of the shadow maps as distance from the camera increases.

2. As we move closer to or further from shadowed objects we can see the changes in shadow quality pop-in. Of course we can control this by avoiding drastic quality changes between consecutive levels in the cascade.

Here is an example that illustrates the second issue (in this case I have lowered the resolution of the 2nd and 3d cascade levels to 50% and 25% respectively so the effect was more obvious). The screenshots show the rendering of the shadows at different distances. We can see how the shadows in the close-up shot are very sharp and as the distance increases they become blurrier due to the use of a lower resolution shadow map:

csm-high
CSM level 0 (4096×4096)
csm-med
CSM level 1 (2048×2048)
csm-low
CSM level 2 (1024×1024)

The demo supports up to 4 shadow map levels although the default configuration is to use 3. The resolution of each level can be configured separately too, in the default configuration I lowered the shadow resolution of the second and third levels to 75% and 50% respectively. If we configure the demo to run on a single level (with 100% texture resolution), we are back to the original shadow map implementation, so it is easy to experiment with both techniques.

I intend to cover the details behind shadow mapping and the implementation of the bloom filter in more detail in a future post, so again, stay tuned for more!

October 18, 2016
The most visible change this week is a fix to Mesa for texture upload performance.  A user had reported that selection rectangles in LXDE's file manager were really slow.  I brought up sysprof, and it showed that we were doing uncached reads from the GPU in a situation that should have been entirely write-combined writes.

The bug was that when the size of the texture wasn't aligned to utiles, we were loading the previous contents into the temporary buffer before writing the new texture data in and uploading, even if the full texture was being updated.  I've fixed it to check for when the full texture is being uploaded, and not do the initial download.  This bug was getting it on almost any Cairo vector graphics operation with its Xlib backend, so hopefully this helps a lot of people's desktops.

I also worked on a cleanup of the simulator mode.  I use the closed source simulator regularly as part of my work -- it's fairly accurate to hardware behavior, and allows you to trace what's happening when things go wrong, all with the driver code running like "normal" for apps on my x86 desktop.

However, my simulator support is a little invasive to the driver, replacing vc4 ioctl calls with alternative ioctls to the i915 driver.  I started on a series to make vc4_simulator.c take plain VC4 ioctls and translate them, so that the simulator code is entirely contained.  The last step I think I have is figuring out which level to put the simulator's copying in and out of window system framebuffers at.

Last, I got DEQP's GLES2 tests up and running on Raspberry Pi.  These are approximately equivalent to the official conformance tests.  The initial results were quite good -- 18630/19098 (97.5%) passing when run in the piglit framework.  I found a couple of fixes to be made in glClear() support, one of which will affect all gallium drivers.  Of the remainder, there are a few tests that are failing due to test bugs (we expose extensions that allow features the tests don't expect), but most are failing in register allocation.  For the register allocation failures, I see a relatively quick fix that should reduce register pressure in loops.

For the last 3 years, I've been working in the OpenStack Telemetry team at eNovance, and then at Red Hat. Our mission is to maintain the OpenStack Telemetry stack, both upstream and downstream (i.e. inside Red Hat products). Besides the technical challenges, the organization of the team always have played a major role in our accomplishments.

Here, I'd like to share some of my hindsight with you, faithful readers.

Meet the team

The team I work in changed a bit during those 3 years, but the core components always have been the same: a few software engineers, a QE engineer, a product owner, and an engineering manager. That meant the team size has been always been between 6 and 8 people.

I cannot emphasize enough how team size is important. Not having more than 8 persons in a team fits with the two pizzas rule from Jeff Bezzos, which turned out to be key in our team composition.

The group dynamic that is applied by teams not bigger than this is excellent. It offers the possibility to know and connect with everyone – each time member has only up to 7 people to talk to on a daily basis, which means only 28 communication axis between people. Having a team of e.g. 16 people means 120 different links in your team. Double your team size, and multiply by 4 your communication overhead. My practice shows that the less communication axis you have in a team, the less overhead your will have and the swifter your team will be.

All team members being remote workers, is is even more challenging to build relationship and bound. We had the opportunity to know each other during the OpenStack summit twice a year and doing regular video-conference via Google Hangout or BlueJeans really helped.

The atmosphere you set-up in your team will also forge the outcome of your team work. Run your team with trust, peace and humor (remember I'm on the team 🤣) and awesome things will happen. Run your team with fear, pressure, and finger-pointing, nothing good will happen.

There's little chance that when a team is built, everyone will be on the same level. We were no exception, we had more and less experienced engineers. But the most experienced engineers took the time needed to invest and mentor the less experienced. That also helped to build trust and communication links between members of the team. And over the long run, everyone is getting more efficient: the less experienced engineers are getting better and the more experienced can delegate a lot of stuff to their fellows.

Then they can chill or work on bigger stuff. Win-win.

It's actually no more different than that the way you should run an open source team, as I already claimed in a previous article on FOSS projects management.

Practicing agility

I might be bad at practicing agility: contrary to many people, I don't see agility as a set of processes. I see that as a state of mind, as a team organization based on empowerment. No more, no less.

And each time I meet people and explain that our team is "agile", they start shivering, explaining how they hate sprints, daily stand-ups, scrum, and planning poker, that this is all a waste of time and energy.

Well, it turns out that you can be agile without all of that.

Planning poker

In our team, we tried at first to run 2-weeks sprints and used planning poker to schedule our user stories from our product backlog (= todo list). It never worked as expected.

First, most people had the feeling to lose their time because they already knew exactly what they were supposed to do. Having any doubt, they would have just gone and talked to the product owner or another fellow engineer.

Secondly, some stories were really specialized and only one of the team member was able to understand it in details and evaluate it. So most of the time, a lot of the team members playing planning poker would just vote a random number based on the length of the explanation of the story teller. For example, if an engineer said "I just need to change that flag in the configuration file" then everyone would vote 1. If they started rambling for 5 minutes about "how the configuration option is easy to switch, but that there might be other things to change at the same time, and things to check for impact bigger than expected, and code refactoring to do", then most people would just announce a score of 13 on that story. Just because the guy talked for 3 minutes straight and everything sounded complicated and out of their scope.

That meant that the poker score had no meaning to us. We never managed to have a number of points that we knew we could accomplish during a sprint (the team velocity as they call it).

The only benefit that we identified from planning poker, in our case, is that it forces people to keep sit down and communicate about a user story. Though, it turned out that making people communicate was not a problem we needed to solve in our team, so we decided to stop doing that. But it can be a pretty good tool to make people talking to each other.

Therefore, the 2-weeks sprint never made much sense as we were unable to schedule our work reliably. Furthermore, doing most of our daily job in open source communities, we were unable to schedule anything. When sending patches to an upstream project, you have no clue when they will be reviewed. What you know for sure, is that in order to maximize your code merge throughput with this high latency of code review, you need to parallelize your patch submission a lot. So as soon as you receive some feedback from your reviewers, you need to (almost) drop everything, rework your code and resubmit it.

There's no need to explain what this does not absolutely work with a sprint approach. Most of the scrum framework lays on the fact that you own workflow from top to bottom, which is far from being true when working in open source communities.

Daily stand-up meetings

We used to run a daily stand-up meeting every day, then every other day. Doing that remotely kills the stand-up part, obviously, so there is less guarantee the meeting will be short. Considering all team members are working remotely in different time zones, with some freedom to organize their schedule, it was very difficult to synchronize those meetings. With member spread from the US to Eastern Europe, the meeting was in the middle of the afternoon for me. I found it frustrating to have to stop my activities in the middle of every afternoon to chat with my team. We all know the cost of context switching to us, humans.

So we drifted from our 10 minutes daily meeting to a one-hour weekly meeting with the whole team. It's way easier to synchronize for a large chunk of time once a week and to have this high-throughput communication channel.

Our (own) agile framework

Drifting from the original scrum implementation, we ended up running our own agility framework. It turned out to have similarity with kanban – you don't always have to invent new things!

Our main support is a Trello board that we share with the whole team. It consists of different columns, where we put card representing small user stories or simple to-do items. Each column is the state of the current card, and we move them left to right:

  • Ideas: where we put things we'd like to do or dig into, but there's no urgency. It might lead to new, smaller ideas, in the "To Do" column.
  • To Do: where we put real things we need to do. We might run a grooming session with our product manager if we need help prioritizing things, but it's usually not necessary.
  • Epic: here we create a few bigger cards that regroup several To Do items. We don't move them around, we just archive them when they are fully implemented. There are only 5-6 big cards here at max, which are the long term goals we work on.
  • Doing: where we move cards from To Do when we start doing them. At this stage, we also add people working on the task to the card, so we see a little face of people involved.
  • Under review: 90% of our job being done upstream, we usually move cards done and waiting for feedback from the community to this column. When the patches are approved and the card is complete, we move the card to Done. If a patch needs further improvement, we move back the card to Doing and work on it, and then move it back to Under review when resubmitted.
  • On hold / blocked: some of the tasks we work on might be blocked by external factors. We move cards there to keep track of them.
  • Done during week #XX: we create a new list every Monday to stack our done cards by week. This is just easier to display and it allows us to see the cards that we complete each week. We archive lists older than a month, from time to time. It gives a great visual feedback and what has been accomplished and merged every week.

We started to automate some of our Trello workflow in a tool called Trelloha. For example, it allows us to track upstream patches sent through Gerrit or GitHub and tick the checkbox items in any card when those are merged.

We actually don't put many efforts on our Trello board. It's just a slightly organized chaos, as are upstream projects. We use it as a lightweight system for taking notes, organizing our thought and letting know what we're doing and why we're doing it. That's where Trello is wonderful because using it has a very low friction: creating, updating and moving card is a one click operation.

One bias of most engineers is to overthink and over-engineer their workflow, trying to rationalize it. Most of the time, they end up automating which means building processes and bureaucracy. It just slows things down and builds frustration upon everyone. Just embrace chaos and spend time on what matters.

Most of the things we do are linked to external Launchpad bugs, Gerrit reviews or GitHub issues. That means the cards in Trello carry very little information, as everything happens outside, in the wild Internet of open source communities. This is very important as we need to avoid any kind of retention of knowledge and information from contributors outside the company. This also makes sure that our internal way of running does not leak outside and (badly) influence outside communities.

Retrospectives

We also run a retrospective every 2 weeks, which might be the only thing we kept from the scrum practice. It's actually a good opportunity for us to share our feelings, concerns or jokes. We used to do it using the six thinking hats method, but it slowly faded away. In the end, we now use a different Trello board with those columns:

  • Good 😄
  • Hopes and Wishes 🎁
  • Puzzles and Challenges 🌊
  • To improve 😡
  • Action Items 🤘

All teammates fill the board with the card they want, and everyone is free to add themselves to any card. We then run through each card and let people who added their name to it talk about it. The column "Action Items" is usually filled as we speak and discover we should do things. We can then move cards created there to our regular board, in the To Do column.

Central communication

Sure, people have different roles in a team, but we dislike bottleneck and single point of failure. Therefore, we are using an internal mailing list where we ask people to send their request and messages to. If people send things related to our team job to one of us personally, we just forward or Cc the list when replying so everyone is aware of what one might be talking about with people external to the team.

This is very important, as it emphasizes that no team member should be considered special. Nobody owns more information and knowledge than others, and anybody can jump into a conversation if it has valuable knowledge to share.

The same applies for our internal IRC channel.

We also make sure that we discuss only company-specific things on this list or on our internal IRC channel. Everything that can be public and is related to upstream is discussed on external communication medium (IRC, upstream mailing list, etc). This is very important to make sure that we are not blocking anybody outside the Red Hat to join us and contribute to the projects or ideas we work on. We also want to make sure that people working in our company are no more special than other contributors.

Improvement

We're pretty happy with our set-up right now, and the team runs pretty smoothly since a few months. We're still trying to improve, and having a general sense of trust among team members make sure we can openly speak about whatever problem we might have.

Feel free to share your feedback and own experience of running your own teams in the comment section.

October 13, 2016
Lots of individuals and companies have made substantial contributions to Apitrace. But maintenance has always rested with me, the original author.

I would prefer have shared that responsibility with a wider team, but things haven't turned out that way. For several reasons I suppose:

  • There are many people that care about one section of the functionality (one API, one OS), but few care about all of them.
  • For all existing and potential contributors including me Apitrace is merely a means to an end (test/debug graphics drivers or applications), it's not an end on itself. The other stuff always gets the top priority.
  • There are more polished tools for newer generation APIs like Vulkan, Metal, Direct3D 12. These newer APIs are much leaner than legacy APIs, which eliminates a lot of design constraints. And some of these tools have large teams behind them.
  • Last but not least, I failed to nurture such a community. I always kept close control, partly to avoid things to become a hodgepodge, partly from fear of breakage, but can't shake the feeling that if I had been more relaxed things might of turned out differently.

Apitrace has always been something I worked on the spare time, or whenever I had an itch to scratch. That is still true, with the exception that after having a kid I have scarcely any free time left.

Furthermore the future is not bright: I believe Apitrace will have a long life in graphics driver test automation, and perhaps whenever somebody needs to debug an old OpenGL application, but I doubt it will flourish beyond that. And this facts weighs in whenever I need to decide whether to spend some time on Apitrace vs everything else.

The end result is that I haven't been a responsive maintainer for some time (long time merging patches, providing feedback, resolving issue, etc), and I'm afraid that will continue for the foreseeable future.

I don't feel any obligation to do more (after all, the license does say the software is provided as is), but I do want to set right expectations to avoid frustrating users/contributors who might otherwise timely feedback, hence this post.

In this post I’ll discuss how I setup and render terrain mesh in the OpenGL terrain rendering demo. Most of the relevant code for this is in the ter-terrain.cpp file.

Setting up a grid of vertices

Unless you know how to use a 3D modeling program properly, a reasonable way to create a decent mesh for a terrain consists in using a grid of vertices and elevate them according to a height map image. In order to create the grid we only need to decide how many rows and columns we want. This, in the end, determines the number of polygons and the resolution of the terrain.

We need to map these vertices to world coordinates too. We do that by defining a tile size, which is the distance between consecutive vertices in world units. Larger tile sizes increase the size of the terrain but lower the resolution by creating larger polygons.

The image below shows an 8×6 grid that defines 35 tiles. Each tile is rendered using 2 triangles:

terrain-vertex-grid
8×6 terrain grid

The next step is to elevate these vertices so we don’t end up with a boring flat surface. We do this by sampling the height map image for each vertex in the grid. A height map is a gray scale image where the values of the pixels represent altitudes at different positions. The closer the color is to white, the more elevated it is.

terrain-heightmap-01
Heightmap image

Adding more vertices to the grid increases the number of sampling points from the height map and reduces the sampling distances, leading to a smoother and more precise representation of the height map in the resulting terrain.

terrain_heightmap_sampling
Sampling the heightmap to compute vertex heights

Of course, we still need to map the height map samples (gray scale colors) to altitudes in world units. In the demo I do this by normalizing the color values to [-1,+1] and then applying a scale factor to compute the altitude values in world space. By playing with the scaling factor we can make our terrain look more or less abrupt.

terrain_and_heightmap
Altitude scale=6.0
terrain_and_heightmap_abrupt
Altitude scale=12.0

For reference, the height map sampling is implemented in ter_terrain_set_heights_from_texture().

Creating the mesh

At this point we know the full position (x, y, z) in world coordinates of all the vertices in our grid. The next step is to build the actual triangle mesh that we will use to render the terrain and the normal vectors for each triangle. This process is described below and is implemented in the ter_terrain_build_mesh() function.

Computing normals

In order to get nice lighting on our terrain we need to compute normals for each vertex in the mesh. A simple way to achieve this is would be to compute the normal for each face (triangle) and use that normal for each vertex in the triangle. This works, but it has 3 problems:

1. Every vertex of each triangle has the same exact normal, which leads to a rather flat result.

2. Adjacent triangles with different orientations showcase abrupt changes in the normal value, leading to significantly different lighting across the surfaces that highlight the individual triangles in the mesh.

3. Because each vertex in the mesh can have a different normal value for each triangle it participates in, we need to replicate the vertices when we render, which is not optimal.

Alternatively, we can compute the normals for each vertex considering the heights of its neighboring vertices. This solves all the problems mentioned above and leads to much better results thanks to the interpolation of the normal vectors across the triangles, which leads to smooth lighting reflection transitions:

normals-flat
Flat normals
normals-smooth
Smooth normals

The implementation for this is in the function calculate_normal(), which takes the column and row indices of the vertex in the grid and calculates the Y coordinate by sampling the heights of the 4 nearby vertices in the grid.

Preparing the draw call

Now that we know the positions of all the vertices and their normal vectors we have all the information that we need to render the terrain. We still have to decide how exactly we want to render all the polygons.

The simplest way to render the terrain using a single draw call is to setup a vertex buffer with data for each triangle in the mesh (including position and normal information) and use GL_TRIANGLES for the primitive of the draw call. This, however, is not the best option from the point of view of performance.

Because the terrain will typically contain a large number of vertices and most of them participate in multiple triangles, we end up uploading a large amount of vertex data to the GPU and processing a lot of vertices in the draw call. The result is large memory requirements and suboptimal performance.

For reference, the terrain I used in the demo from my original post used a 251×251 grid. This grid represents 250×250 tiles, each one rendered as two triangles (6 vertices/tile), so we end up with 250x250x6=375,000 vertices. For each of these vertices we need to upload 24 bytes of vertex data with the position and normal, so we end up with a GPU buffer that is almost 9MB large.

One obvious way to reduce this is to render the terrain using triangle strips. The problem with this is that in theory, we can’t render the terrain with just one strip, we would need one strip (and so, one draw call) per tile column or one strip per tile row. Fortunately, we can use degenerate triangles to link separate strips for each column into a single draw call. With this we trim down the number of vertices to 126,000 and the size of the buffer to a bit below 3 MB. This alone produced a 15%-20% performance increase in the demo.

We can do better though. A lot of the vertices in the terrain mesh participate in various triangles across the large triangle strip in the draw call, so we can reduce memory requirements by using an index buffer to render the strip. If we do this, we trim things down to 63,000 vertices and ~1.5MB. This added another 4%-5% performance bonus over the original implementation.

Clipping

So far we have been rendering the full mesh of the terrain in each frame and we do this by uploading the vertex data to the GPU just once (for example in the first frame). However, depending on where the camera is located and where it is looking at, just a fraction of the terrain may be visible.

Although the GPU will discard all the geometry and fragments that fall outside the viewport, it still has to process each vertex in the vertex shader stage before it can clip non-visible triangles. Because the number of triangles in the terrain is large, this is suboptimal and to address this we want to do CPU-side clipping before we render.

Doing CPU-side clippig comes with some additional complexities though: it requires that we compute the visible region of the terrain and upload new vertex data to the GPU in each frame preventin GPU stalls.

In the demo, we implement the clipping by computing a quad sub-region of the terrain that includes the visible area that we need to render. Once we know the sub-region that we want to render, we compute the new indices of the vertices that participate in the region so we can render it using a single triangle strip. Finally, we upload the new index data to the index buffer for use in the follow-up draw call.

Avoiding GPU stalls

Although all the above is correct, it actually leads, as described, to much worse performance in general. The reason for this is that our uploads of vertex data in each frame lead to frequent GPU stalls. This happens in two scenarios:

1. In the same frame, because we need to upload different vertex data for the rendering of the terrain for the shadow map and the scene (the shadow map renders the terrain from the point of view of the light, so the visible region of the terrain is different). This creates stalls because the rendering of the terrain for the shadow map might not have completed before we attempt to upload new data to the index buffer in order to render the terrain for the scene.

2. Between different frames. Because the GPU might not be completely done rendering the previous frame (and thus, stills needs the index buffer data available) before we start preparing the next frame and attempt to upload new terrain index data for it.

In the case of the Intel Mesa driver, these GPU stalls can be easily identified by using the environment variable INTEL_DEBUG=perf. When using this, the driver will detect these situations and produce warnings informing about the stalls, the buffers affected and the regions of the buffers that generate the stall, such as:

Stalling on glBufferSubData(0, 503992) (492kb) to a busy (0-1007984)
buffer object.  Use glMapBufferRange() to avoid this.

The solution to this problem that I implemented (other than trying to put as much work as possible between read/write accesses to the index buffer) comes in two forms:

1. Circular buffers

In this case, we allocate a larger buffer than we need so that each subsequent upload of new index data happens in a separate sub-region of the allocated buffer. I set up the demo so that each circular buffer is large enough to hold the index data required for all updates of the index buffer happening in each frame (the shadow map and the scene).

2. Multi-buffering

We allocate more than one circular buffer. When we don’t have enough free space at the end of the current buffer to upload the new index buffer data, we upload it to a different circular buffer instead. When we run out of buffers we circle back to the first one (which at this point will hopefully be free to be re-used again).

So why not just use a single, very large circular buffer? Mostly because there are limits to the size of the buffers that the GPU may be able to handle correctly (or efficiently). Also, why not having many smaller independent buffers instead of circular buffers? That would work just fine, but using fewer, larger buffers reduces the number of objects we need to bind/unbind and is better to prevent memory fragmentation, so that’s a plus.

Final touches

We are almost done, at this point we only need to add a texture to the terrain surface, add some slight fog effect for distant pixels to create a more realistic look, add a skybox (it is important to choose the color of the fog so it matches the color of the sky!) and tweak the lighting parameters to get a nice result:

terrain_final
Final rendering

I hope to cover some of these aspects in future posts, so stay tuned for more!

October 11, 2016
Last week I started work on making hello_fft work with the vc4 driver loaded.

hello_fft is a demo program for doing FFTs using the QPUs (the shader core in vc4).  Instead of drawing primitives into a framebuffer like a GL pipeline does, though, it uses the User QPU pipeline, which just hands a uniform stream to a QPU shader instance.

Originally I didn't build a QPU user shader ioctl for vc4 because there's no good way to use this pipeline of the VC4.  The hardware isn't quite capable enough to be exposed as OpenCL or GL compute shaders (some research had been done into this, and the designers' conclusion was that the memory acccess support wasn't quite good enough to be useful).  That leaves you with writing VC4 shaders in raw assembly, which I don't recommend.

The other problem for vc4 exposing user shaders is that, since the GPU sits directly on main memory with no MMU in between, GPU shader programs could access anywhere in system memory.  For 3D, there's no need (in GL 2.1) to do general write back to system memory, so the kernel reads your shader code and rejects shaders if they try to do it.  It also makes sure that you bounds-check all reads through the uniforms or texture sampler.  For QPU user shaders, though, the expected mode of using them is to have the VPM DMA units do general loads and stores, so we'd need new validation support.

If it was just this one demo, we might be willing to lose support for it in the transition to the open driver.  However, some modes of accelerated video decode also use QPU shaders at some stage, so we have to have some sort of solution.

My plan is basically to not do validation and have root-only execution of QPU shaders for now.  The firmware communication driver captures hello_fft's request to have the firmware pass QPU shaders to the V3D hardware, and it redirects it into VC4.  VC4 then maintains the little queue of requests coming in, powers up the hardware, feeds them in, and collects the interrupts from the QPU shaders for when they're done (each job request is required to "mov interrupt, 1" at the end of its last shader to be run).

Dom is now off experimenting in what it will take to redirect video decode's QPU usage onto this code.

The other little project last week was fixing Processing's performance on vc4.  It's one of the few apps that is ported to the closed driver, so it's a likely thing to be compared on.

Unfortunately, Processing was clearing its framebuffers really inefficiently.  For non-tiled hardware it was mostly OK, with just a double-clear of the depth buffer, but for vc4 its repeated glClear()s (rather than a single glClear with all the buffers to be cleared set) were triggering flushes and reloads of the scene, and in one case a render of a full screen quad (and a full screen load before doing so) rather than fast clearing.

The solution was to improve the tracking of what buffers are cleared and whether any primitives have been drawn yet, so that I can coalesce sets of repeated or partial clears together while only updating the colors to be cleared.  There's always a tension in this sort of optimization: Should the GL driver be clever to work around apps behaving badly, or should it push that off to the app developer (more like Vulkan does).  In this case, tiled renderer behavior is sufficiently different from non-tiled renderers, and enough apps will hit this path, that it's well worth it.  For Processing, I saw an improvement of about 10% on the demo I was looking at.

Sadly, Processing won't be a good comparison for open vs closed drivers even after these fixes.  For the closed driver Processing uses EGL, which says that depth buffers become undefined on eglSwapBuffers. In its default mode, though, processing uses GLX, which says that the depth buffer is retained on glXSwapBuffers().  To avoid this extra overhead in its GLX mode, you'd need to use glDiscardFramebufferEXT() right before the swap.  Unfortunately, this isn't connected in gallium yet but shouldn't be hard to do once we have an app that we can test.
October 07, 2016

I procristanated rather badly on this one, so instead of the previous kernel release happening the v4.8 release is already out of the door. Read on for my slightly more terse catch-up report.

Since I’m this late I figured instead of the usual comprehensive list I’ll do something new and just list some of the work that landed in 4.8, but with a bit more focus on the impact and why things have been done.

Midlayers, Be Gone!

The first thing I want to highlight is the driver de-midlayering. In the linux kernel community the mid-layer mistake or helper library design pattern, see the linked article from LWN is a set of rules to design subsystems and common support code for drivers. The underlying rule is that the driver itself must be in control of everything, like allocating memory, handling all requests. Common code is only shared in helper library functions, which the driver can call if they are suitable. The reason for that is that there is always some hardware which needs special treatment, and when you have a special case and there’s a midlayer, it will get in the way.

Due to the shared history with BSD kernels DRM originally had a full-blown midlayer, but over time this has been fixed. For example kernel modesetting was designed from the start with the helper library pattern. The last hold is the device structure itself, and for the Intel driver this is now fixed. This has two main benefits:

  • First we can get rid of a lot of pointer dereferencing in the compiled binaries. With the midlayer DRM allocated a struct drm_device, and the Intel driver allocated it’s own, separate structure. Both are connected with pointers, and every time control transferred from driver private functions to shared code those pointers had to be walked.

    With a helper approach the driver allocates the DRM device structure embedded into it’s own device strucure. That way the pointer derefencing just becomes a fixed adjustement offset of the original pointer. And fixed offsets can be baked into each access of individual member fields for free, resulting in a big reduction of compiled code.

  • The other benefit is that the Intel driver is now in full control of the driver load and unload sequence. The DRM midlayer functions for loading had a few driver callbacks, but for historical reasons at the wrong spots. And fixing that is impossible without rewriting the load code for all the drivers. Without the midlayer we can have as many steps in the load sequence as we want, and where we want it. The most important fix here is that the driver will now be initialiazed completely before any part of it is registered and visible to userspace (through the /dev node, sysfs or anywhere else).

Thundering Herds

GPUs process rendering asynchronously, and sometimes the CPU needs to wait for them. For this purpose there’s a wait queue in the driver. Userspace processes block on that until the interrupt handler wakes them up. The trouble now is that thus far there was just one wait queue per engine, which means every time the GPU completed something all waiters had to be woken up. Then they checked whether the work they needed to wait for completed, and if not, again block on the wait queue until the next batch job completed. That’s all rather inefficient. On top there’s only one per-engine knob to enable interrupts. Which means even if there was only one waiting process, it was woken for every completed job. And GPUs can have a lot of jobs in-flight.

In summary, waiting for the GPU worked more like a frantic herd trampling all over things instead of something orderly. To fix this the request and completion tracking was entirely revamped, to make sure that the driver has a much better understanding of what’s going on. On top there’s now also an efficient search structure of all current waiting processes. With that the interrupt handler can quickly check whether the just completed GPU job is of interest, and if so, which exact process should be woken up.

But this wasn’t just done to make the driver more efficient. Better tracking of pending and completed GPU requests is an important fundation to implement proper GPU scheduling on top of. And it’s also needed to interface the completion tracking with other drivers, to finally fixing tearing for multi-GPU machines. Having a thundering herd in your own backyard is unsightly, letting it loose on your neighbours is downright bad! A lot of this follow-up work already landed for the 4.9 kernel, hence I will talk more about this in a future installement of this seris.

October 06, 2016
img_20161006_115415

No more “which is now the index of this modem…?”

DBus object path and index

When modems are detected by ModemManager and exposed in DBus, they are assigned an unique DBus object path, with a common prefix and a unique index number, e.g.:

/org/freedesktop/ModemManager1/Modem/0

This path is the one used by the mmcli command line tool to operate on a modem, so users can identify the device by the full path or just by the index, e.g. this two calls are totally equivalent:

$ mmcli -m /org/freedesktop/ModemManager1/Modem/0
$ mmcli -m 0

This logic looks good, except for the fact that there isn’t a fixed DBus object path for each modem detected: i.e. the index given to a device is the next one available, and if the device is power cycled or unplugged and replugged, a different index will be given to it.

EquipmentIdentifier

Systems like NetworkManager handle this index change gracefully, just by assuming that the exposed device isn’t the same one as the one exposed earlier with a different index. If settings need to be applied to a specific device, they will be stored associated with the EquipmentIdentifier property of the modem, which is the same across reboots (i.e. the IMEI for GSM/UMTS/LTE devices).

User-provided names

The 1.8 stable release of ModemManager will come with support for user-provided names assigned to devices. A use case of this new feature is for example those custom systems where the user would like to assign a name to a device based on the USB port in which it is connected (e.g. assuming the USB hardware layout doesn’t change across reboots).

The user can specify the names (UID, unique IDs) just by tagging in udev the physical device that owns all ports of a modem with the new ID_MM_PHYSDEV_UID property. This tags need to be applied before the ID_MM_CANDIDATE properties, and therefore the rules file should be named before the 80-mm-candidate.rules one, for example like this:

$ cat /lib/udev/rules.d/78-mm-naming.rules

ACTION!="add|change|move", GOTO="mm_naming_rules_end"
DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.5",ENV{ID_MM_PHYSDEV_UID}="USB1"
DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.2",ENV{ID_MM_PHYSDEV_UID}="USB2"
DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.3",ENV{ID_MM_PHYSDEV_UID}="USB3"
DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.4",ENV{ID_MM_PHYSDEV_UID}="USB4"
LABEL="mm_naming_rules_end"

The value of the new ID_MM_PHYSDEV_UID property will be used in the Device property exposed in the DBus object, and can also be used directly in mmcli calls instead of the path or index, e.g.:

$ mmcli -m USB4
...
 -------------------------
 System | device: 'USB4'
        | drivers: 'qmi_wwan, qcserial'
        | plugin: 'Sierra'
        | primary port: 'cdc-wdm2'
...

Given that the same property value will always be set for the modem in a specific device path, this user provided names may unequivocally identify a specific modem even when the device is power-cycled, unplugged and replugged or even the whole system rebooted.

Binding the property to the device path is just an example of what could be done. There is no restriction on what the logic is to apply the ID_MM_PHYSDEV_UID property, so users may also choose other different approaches.

This support is already in ModemManager git master, and as already said, will be included in the stable 1.8 release, whenever that is.


TL;DR? ModemManager now supports assigning unique names to devices that stay even across full system reboots.


Filed under: Development, FreeDesktop Planet, GNOME Planet, Planets Tagged: gnome, gnu/linux, ModemManager, udev
October 05, 2016

I have been quite busy with various things in the last few weeks, but I have finally found some time to clean up and upload the code of the OpenGL terrain render demo to Github.

Since this was intended as a programming exercise I have not tried to be very elegant or correct during the implementation, so expect things like error handling to be a bit rough around the edges, but otherwise I think the code should be easy enough to follow.

Notice that I have only tested this on Intel GPUs. I know it works on NVIDIA too (thanks to Samuel and Chema for testing this) but there are a couple of rendering artifacts there, specifically at the edges of the skybox and some “pillars” showing up in the distance some times, probably because I am rendering one to many “rows” of the terrain and I end up rendering garbage. I may fix these some day.

The code I uploaded to the repository includes a few new features too:

  • Model variants, which are basically color variations of the same model
  • A couple of additional models (a new tree and plant) and a different rock type
  • Collision detection, which makes navigating the terrain more pleasent

Here is a new screenshot:

screenshot

In future posts I will talk a bit about some of the implementation details, so having the source code around will be useful. Enjoy!

October 04, 2016

systemd.conf 2016 is Over Now!

A few days ago systemd.conf 2016 ended, our second conference of this kind. I personally enjoyed this conference a lot: the talks, the atmosphere, the audience, the organization, the location, they all were excellent!

I'd like to take the opportunity to thanks everybody involved. In particular I'd like to thank Chris, Daniel, Sandra and Henrike for organizing the conference, your work was stellar!

I'd also like to thank our sponsors, without which the conference couldn't take place like this, of course. In particular I'd like to thank our gold sponsor, Red Hat, our organizing sponsor Kinvolk, as well as our silver sponsors CoreOS and Facebook. I'd also like to thank our bronze sponsors Collabora, OpenSUSE, Pantheon, Pengutronix, our supporting sponsor Codethink and last but not least our media sponsor Linux Magazin. Thank you all!

I'd also like to thank the Video Operation Center ("VOC") for their amazing work on live-streaming the conference and making all talks available on YouTube. It's amazing how efficient the VOC is, it's simply stunning! Thank you guys!

In case you missed this year's iteration of the conference, please have a look at our YouTube Channel. You'll find all of this year's talks there, as well the ones from last year. (For example, my welcome talk is available here). Enjoy!

We hope to see you again next year, for systemd.conf 2017 in Berlin!

Last week Adam Jackson landed my X testing series, so now at least 3 of us pushing code are testing glamor with every commit.  I used the series to test a cleanup of the Render extension related code in glamor, deleting 550 lines of junk.  I did this while starting to work on using GL sampler objects, which should help reduce our pixmap allocation and per-draw Render overhead.   That patch is failing make check so far.

The big project for the week was tackling some of the HDMI bug reports.  I pulled out my bigger monitor that actually takes HDMI input, and started working toward getting it to display all of its modes correctly.  We had an easy bug in the clock driver that broke changing to low resolutions: trying to drive the PLL too slow, instead of driving the PLL in spec and using the divider to get down to the clock we want.  We needed infoframes support to fix purple borders on the screen.  We were doing our interlaced video mode programming pretty much entirely wrong (it's not perfect yet, but 1 of the 3 interlaced modes on my montior displays correctly and they all display something).  Finally, I added support for double-clocked CEA modes (generally the low resolution, low refresh ones in your list.  These are probably particularly interesitng for RPi due to all the people running emulators).

Finally, one of the more exciting things for me was that I got general agreement from gregkh to merge the VCHIQ driver through the staging tree, and he even put the initial patch together for me.  If we add in a couple more small helper drivers, we should be able to merge the V4L2 camera driver, and hopefully get closer to doing a V4L2 video decode driver for upstream.  I've now tested a messy series that at least executes "vcdbg log msg."  Those other drivers will definitely need some cleanups, though.
October 03, 2016

After a few weeks of hard work with the team, here is the new major version of Gnocchi, stamped 3.0.0. It was very challenging, as we wanted to implement a few big changes in it.

Gnocchi is now using reno to its maximum and you can read the release notes of the 3.0 branch online. Some notes might be missing as it is our first release with it, but we are making good progress at writing changelogs for most of our user facing and impacting changes.

Therefore, I'll only write here about our big major feature that made us bump the major version number.

New storage engine

And so the most interesting thing that went in the 3.0 release, is the new storage engine that has been built by me and Gordon Chung during those last months. The original approach of writing data in Gnocchi was really naive, so we had an iterative improvement process since version 1.0, and we're getting close to something very solid.

This new version leverages several important features which increase performance by a large factor on Ceph (using write(offset) rather than read()+write() to append new points), our recommended back-end.

read+write (Swift and file drivers) vs offset (Ceph)

To summarize, since most data points are sent sequentially and ordered, we enhanced the data format to profit from that fact and be able to be appended without reading anything. That only works on Ceph though, which provides the needed features.

We also enabled data compression on all storage drivers by enabling LZ4 compression (see my previous article and research on the subject), which obviously offers its own set of challenges when using append-only write. The results are tremendous and decrease data usage by a huge factor:

Disk size comparison between Gnocchi 2 and Gnocchi 3

The rest of the processing pipeline also has been largely improved:

Processing time of new measures
Compression time comparison

Overall, we're delighted with the performance improvement we achieved, and we're looking forward making even better more progress. Gnocchi is now one of the most performing and scalable timeseries databases out there.

Upcoming challenges

With that big change done, we're now heading toward a set of more lightweight improvements. Our bug tracker is a good place to learn what might be on our mind (check for the wishlist bugs).

Improving our API features and offering a better experience for those coming outside of the real of OpenStack are now on my top priority list.

But let me know if there's anything you have scratching you, obviously. 😎

September 30, 2016

Since about a year we’re running the Intel graphics driver with a new process: Besides the two established maintainers we’ve added all regular contributors as committers to the main feature branch feeding into -next. This turned out into a tremendous success, but did require some initial adustments to how we run things in the first few months.

I’ve presented the new model here at Kernel Recipes in Paris, and I will also talk about it at Kernel Summit in Santa Fe. Since LWN is present at both I won’t bother with a full writeup, but leave that to much better editors. Update: LWN on kernel maintainer scalability.

Anyway, there’s a video recording and the slides. Our process is also documented - scroll down to the bottom for the more interesting bits around what’s expected of committers.

On a related note: At XDC, and a bit before, Eric Anholt started a discussion about improving our patch submission process, especially for new contributors. He used the Rust community as a great example, and presented about it at XDC. Rather interesting to hear his perspective as a first-time contributor confirm what I learned in LCA this year in Emily Dunham’s awesome talk on Life is better with Rust’s community automation.

September 27, 2016
The answer is YES!!

I fixed the last bug with instance rendering and Talos renders great on radv now.

Also with the semi-interesting branch vkQuake also renders, there are some upstream bugs that needs fixing in spirv/nir that I'm awaiting and upstream resolution on, but I've included some prelim fixes in semi-interesting for now, that'll go away when upstream fixes are decided on.

Here's a screenshot:

September 26, 2016
Last week I spent at XDS 2016 with other graphics stack developers.

I gave a talk on X Server testing and our development model.  Since playing around in the Servo project (and a few other github-centric communities) recently, I've become increasingly convinced that github + CI + automatic merges is the right way to go about software development now.  I got a warmer reception from the other X Server developers than I expected, and I hope to keep working on building out this infrastructure.

I also spent a while with keithp and ajax (other core X developers) talking about how to fix our rendering performance.  We've got two big things hurting vc4 right now: Excessive flushing, and lack of bounds on rendering operations.

The excessive flushing one is pretty easy.  X currently does some of its drawing directly to the front buffer (I think this is bad and we should stop, but that *is* what we're doing today).  With glamor, we do our rendering with GL, so the driver accumulates a command stream over time.  In order for the drawing to the front buffer to show up on time, both for running animations, and for that last character you typed before rendering went idle, we have to periodically flush GL's command stream so that its rendering to the front buffer (if any) shows up.

Right now we just glFlush() every time we might block sending things back to a client, which is really frequent.  For a tiling architecture like vc4, each glFlush() is quite possibly a load and store of the entire screen (8MB).

The solution I started back in August is to rate-limit how often we flush.  When the server wakes up (possibly to render), I arm a timer.  When it fires, I flush and disarm it until the next wakeup.  I set the timer to 5ms as an arbitrary value that was clearly not pretending to be vsync.  The state machine was a little tricky and had a bug it seemed (or at least a bad interaction with x11perf -- it was unclear).  But ajax pointed out that I could easily do better: we have a function in glamor that gets called right before it does any GL operation, meaning that we could use that to arm the timer and not do the arming every time the server wakes up to process input.

Bounding our rendering operations is a bit trickier.  We started doing some of this with keithp's glamor rework: We glScissor() all of our rendering to the pCompositeClip (which is usually the bounds of the current window).  For front buffer rendering in X, this is a big win on vc4 because it means that when you update an uncomposited window I know that you're only loading and storing the area covered by the window, not the entire screen.  That's a big bandwidth savings.

However, the pCompositeClip isn't enough.  As I'm typing we should be bounding the GL rendering around each character (or span of them), so that I only load and store one or two tiles rather than the entire window.  It turns out, though, that you've probably already computed the bounds of the operation in the Damage extension, because you have a compositor running!  Wouldn't it be nice if we could just reuse that old damage computation and reference it?

Keith has started on making this possible: First with the idea of const-ifying all the op arguments so that we could reuse their pointers as a cheap cache key, and now with the idea of just passing along a possibly-initialized rectangle with the bounds of the operation.  If you do compute the bounds, then you pass it down to anything you end up calling.

Between these two, we should get much improved performance on general desktop apps on Raspberry Pi.

Other updates: Landed opt_peephole_sel improvement for glmark2 performance (and probably mupen64plus), added regression testing of glamor to the X Server, fixed a couple of glamor rendering bugs.
September 20, 2016

Writing a book is a big undertaking. You have to think about what you will actually write, the content, its organization, the examples you want to show, illustrations, etc.

When publishing with the help of a regular editor, your job stops there at writing – and that's already a big and hard enough task. Your editor will handle the publishing process, leaving you free of the printing task. Though they might have their own set of requirements, such as making you work with a word processing tool (think LibreOffice Writer or Microsoft Word).

The Hacker's Guide to Python on my Kindle

When you self-publish like I did with The Hacker's Guide to Python, none of that happens. You have to deal yourself with getting your work out there, released and available in a viable format for your readership.

Most of the time, you need to render your book in different formats. You will have to make sure it works correctly on different devices and that the formatting and content disposition is correct.

I knew exactly what I wanted exactly when writing my book. I wanted to have the book published in at least PDF (for computer reading) and ePub (for e-readers). I also knew, as an Emacs user, that I did not want to spend hours writing a book in LibreOffice. It's not for me.

When I wrote about the making of The Hacker's Guide to Python, I briefly mentioned which tools I used to build the book and that I picked AsciiDoc as the input format. It makes it easy to write your book inside your favorite text editor, and AsciiDoc has plenty of output format. Customizing these formats to my liking and requirements was another challenge.

It took me hours and hours of work to have all the nitty-gritty details right. Today I am happy to announce that I can save you a few hours of work if you also want to publish a book.

I've published a new project on my GitHub called asciidoc-book-toolchain. It is the actual toolchain that I use to build The Hacker's Guide to Python. It should be easy to use and is able to render any book in HTML, PDF, PDF (printable 6"×9" format), ePub and MOBI.

Conversion workflow of the asciidoc-book-toolchain

So feel free to use it, hack it, pull-request it, or whatever. You don't have any good excuse to not write a book now! 😇 And if you want to self-publish a book and need some help getting started, let me know, I would be glad giving you a few hints!

First a definition: a trackstick is also called trackpoint, pointing stick, or "that red knob between G, H, and B". I'll be using trackstick here, because why not.

This post is the continuation of libinput and the Lenovo T450 and T460 series touchpads where we focused on a stalling pointer when moving the finger really slowly. Turns out the T460s at least, possibly others in the *60 series have another bug that caused a behaviour that is much worse but we didn't notice for ages as we were focusing on the high-precision cursor movement. Specifically, the pointer would just randomly stop moving for a short while (spoiler alert: 300ms), regardless of the movement speed.

libinput has built-in palm detection and one of the things it does is to disable the touchpad when the trackstick is in use. It's not uncommon to rest the hand near or on the touchpad while using the trackstick and any detected touch would cause interference with the pointer motion. So events from the touchpad are ignored whenever the trackpoint sends events. [1]

On (some of) the T460s the trackpoint sends spurious events. In the recording I have we have random events at 9s, then again 3.5s later, then 14s later, then 2s later, etc. Each time, our palm detection could would assume the trackpoint was in use and disable the touchpad for 300ms. If you were using the touchpad while this was happening, the touchpad would suddenly stop moving for 300ms and then continue as normal. Depending on how often these spurious events come in and the user's current caffeination state, this was somewhere between odd, annoying and infuriating.

The good news is: this is fixed in libinput now. libinput 1.5 and the upcoming 1.4.3 releases will have a fix that ignores these spurious events and makes the touchpad stalls a footnote of history. Hooray.

[1] we still allow touchpad physical button presses, and trackpoint button clicks won't disable the touchpad

September 19, 2016

This post explains how the evdev protocol works. After reading this post you should understand what evdev is and how to interpret evdev event dumps to understand what your device is doing. The post is aimed mainly at users having to debug a device, I will thus leave out or simplify some of the technical details. I'll be using the output from evemu-record as example because that is the primary debugging tool for evdev.

What is evdev?

evdev is a Linux-only generic protocol that the kernel uses to forward information and events about input devices to userspace. It's not just for mice and keyboards but any device that has any sort of axis, key or button, including things like webcams and remote controls. Each device is represented as a device node in the form of /dev/input/event0, with the trailing number increasing as you add more devices. The node numbers are re-used after you unplug a device, so don't hardcode the device node into a script. The device nodes are also only readable by root, thus you need to run any debugging tools as root too.

evdev is the primary way to talk to input devices on Linux. All X.Org drivers on Linux use evdev as protocol and libinput as well. Note that "evdev" is also the shortcut used for xf86-input-evdev, the X.Org driver to handle generic evdev devices, so watch out for context when you read "evdev" on a mailing list.

Communicating with evdev devices

Communicating with a device is simple: open the device node and read from it. Any data coming out is a struct input_event, defined in /usr/include/linux/input.h:


struct input_event {
struct timeval time;
__u16 type;
__u16 code;
__s32 value;
};
I'll describe the contents later, but you can see that it's a very simple struct.

Static information about the device such as its name and capabilities can be queried with a set of ioctls. Note that you should always use libevdevto interact with a device, it blunts the few sharp edges evdev has. See the libevdev documentation for usage examples.

evemu-record, our primary debugging tool for anything evdev is very simple. It reads the static information about the device, prints it and then simply reads and prints all events as they come in. The output is in machine-readable format but it's annotated with human-readable comments (starting with #). You can always ignore the non-comment bits. There's a second command, evemu-describe, that only prints the description and exits without waiting for events.

Relative devices and keyboards

The top part of an evemu-record output is the device description. This is a list of static properties that tells us what the device is capable of. For example, the USB mouse I have plugged in here prints:


# Input device name: "PIXART USB OPTICAL MOUSE"
# Input device ID: bus 0x03 vendor 0x93a product 0x2510 version 0x110
# Supported events:
# Event type 0 (EV_SYN)
# Event code 0 (SYN_REPORT)
# Event code 1 (SYN_CONFIG)
# Event code 2 (SYN_MT_REPORT)
# Event code 3 (SYN_DROPPED)
# Event code 4 ((null))
# Event code 5 ((null))
# Event code 6 ((null))
# Event code 7 ((null))
# Event code 8 ((null))
# Event code 9 ((null))
# Event code 10 ((null))
# Event code 11 ((null))
# Event code 12 ((null))
# Event code 13 ((null))
# Event code 14 ((null))
# Event type 1 (EV_KEY)
# Event code 272 (BTN_LEFT)
# Event code 273 (BTN_RIGHT)
# Event code 274 (BTN_MIDDLE)
# Event type 2 (EV_REL)
# Event code 0 (REL_X)
# Event code 1 (REL_Y)
# Event code 8 (REL_WHEEL)
# Event type 4 (EV_MSC)
# Event code 4 (MSC_SCAN)
# Properties:
The device name is the one (usually) set by the manufacturer and so are the vendor and product IDs. The bus is one of the "BUS_USB" and similar constants defined in /usr/include/linux/input.h. The version is often quite arbitrary, only a few devices have something meaningful here.

We also have a set of supported events, categorised by "event type" and "event code" (note how type and code are also part of the struct input_event). The type is a general category, and /usr/include/linux/input-event-codes.h defines quite a few of those. The most important types are EV_KEY (keys and buttons), EV_REL (relative axes) and EV_ABS (absolute axes). In the output above we can see that we have EV_KEY and EV_REL set.

As a subitem of each type we have the event code. The event codes for this device are self-explanatory: BTN_LEFT, BTN_RIGHT and BTN_MIDDLE are the left, right and middle button. The axes are a relative x axis, a relative y axis and a wheel axis (i.e. a mouse wheel). EV_MSC/MSC_SCAN is used for raw scancodes and you can usually ignore it. And finally we have the EV_SYN bits but let's ignore those, they are always set for all devices.

Note that an event code cannot be on its own, it must be a tuple of (type, code). For example, REL_X and ABS_X have the same numerical value and without the type you won't know which one is which.

That's pretty much it. A keyboard will have a lot of EV_KEY bits set and the EV_REL axes are obviously missing (but not always...). Instead of BTN_LEFT, a keyboard would have e.g. KEY_ESC, KEY_A, KEY_B, etc. 90% of device debugging is looking at the event codes and figuring out which ones are missing or shouldn't be there.

Exercise: You should now be able to read a evemu-record description from any mouse or keyboard device connected to your computer and understand what it means. This also applies to most special devices such as remotes - the only thing that changes are the names for the keys/buttons. Just run sudo evemu-describe and pick any device in the list.

The events from relative devices and keyboards

evdev is a serialised protocol. It sends a series of events and then a synchronisation event to notify us that the preceeding events all belong together. This synchronisation event is EV_SYN SYN_REPORT, is generated by the kernel, not the device and hence all EV_SYN codes are always available on all devices.

Let's have a look at a mouse movement. As explained above, half the line is machine-readable but we can ignore that bit and look at the human-readable output on the right.


E: 0.335996 0002 0000 0001 # EV_REL / REL_X 1
E: 0.335996 0002 0001 -002 # EV_REL / REL_Y -2
E: 0.335996 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
This means that within one hardware event, we've moved 1 device unit to the right (x axis) and two device units up (y axis). Note how all events have the same timestamp (0.335996).

Let's have a look at a button press:


E: 0.656004 0004 0004 589825 # EV_MSC / MSC_SCAN 589825
E: 0.656004 0001 0110 0001 # EV_KEY / BTN_LEFT 1
E: 0.656004 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
E: 0.727002 0004 0004 589825 # EV_MSC / MSC_SCAN 589825
E: 0.727002 0001 0110 0000 # EV_KEY / BTN_LEFT 0
E: 0.727002 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
For button events, the value 1 signals button pressed, button 0 signals button released.

And key events look like this:


E: 0.000000 0004 0004 458792 # EV_MSC / MSC_SCAN 458792
E: 0.000000 0001 001c 0000 # EV_KEY / KEY_ENTER 0
E: 0.000000 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
E: 0.560004 0004 0004 458976 # EV_MSC / MSC_SCAN 458976
E: 0.560004 0001 001d 0001 # EV_KEY / KEY_LEFTCTRL 1
E: 0.560004 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
[....]
E: 1.172732 0001 001d 0002 # EV_KEY / KEY_LEFTCTRL 2
E: 1.172732 0000 0000 0001 # ------------ SYN_REPORT (1) ----------
E: 1.200004 0004 0004 458758 # EV_MSC / MSC_SCAN 458758
E: 1.200004 0001 002e 0001 # EV_KEY / KEY_C 1
E: 1.200004 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
Mostly the same as button events. But wait, there is one difference: we have a value of 2 as well. For key events, a value 2 means "key repeat". If you're on the tty, then this is what generates repeat keys for you. In X and Wayland we ignore these repeat events and instead use XKB-based key repeat.

Now look at the keyboard events again and see if you can make sense of the sequence. We have an Enter release (but no press), then ctrl down (and repeat), followed by a 'c' press - but no release. The explanation is simple - as soon as I hit enter in the terminal, evemu-record started recording so it captured the enter release too. And it stopped recording as soon as ctrl+c was down because that's when it was cancelled by the terminal. One important takeaway here: the evdev protocol is not guaranteed to be balanced. You may see a release for a key you've never seen the press for, and you may be missing a release for a key/button you've seen the press for (this happens when you stop recording). Oh, and there's one danger: if you record your keyboard and you type your password, the keys will show up in the output. Security experts generally reocmmend not publishing event logs with your password in it.

Exercise: You should now be able to read a evemu-record events list from any mouse or keyboard device connected to your computer and understand the event sequence.This also applies to most special devices such as remotes - the only thing that changes are the names for the keys/buttons. Just run sudo evemu-record and pick any device listed.

Absolute devices

Things get a bit more complicated when we look at absolute input devices like a touchscreen or a touchpad. Yes, touchpads are absolute devices in hardware and the conversion to relative events is done in userspace by e.g. libinput. The output of my touchpad is below. Note that I've manually removed a few bits to make it easier to grasp, they will appear later in the multitouch discussion.


# Input device name: "SynPS/2 Synaptics TouchPad"
# Input device ID: bus 0x11 vendor 0x02 product 0x07 version 0x1b1
# Supported events:
# Event type 0 (EV_SYN)
# Event code 0 (SYN_REPORT)
# Event code 1 (SYN_CONFIG)
# Event code 2 (SYN_MT_REPORT)
# Event code 3 (SYN_DROPPED)
# Event code 4 ((null))
# Event code 5 ((null))
# Event code 6 ((null))
# Event code 7 ((null))
# Event code 8 ((null))
# Event code 9 ((null))
# Event code 10 ((null))
# Event code 11 ((null))
# Event code 12 ((null))
# Event code 13 ((null))
# Event code 14 ((null))
# Event type 1 (EV_KEY)
# Event code 272 (BTN_LEFT)
# Event code 325 (BTN_TOOL_FINGER)
# Event code 328 (BTN_TOOL_QUINTTAP)
# Event code 330 (BTN_TOUCH)
# Event code 333 (BTN_TOOL_DOUBLETAP)
# Event code 334 (BTN_TOOL_TRIPLETAP)
# Event code 335 (BTN_TOOL_QUADTAP)
# Event type 3 (EV_ABS)
# Event code 0 (ABS_X)
# Value 2919
# Min 1024
# Max 5112
# Fuzz 0
# Flat 0
# Resolution 42
# Event code 1 (ABS_Y)
# Value 3711
# Min 2024
# Max 4832
# Fuzz 0
# Flat 0
# Resolution 42
# Event code 24 (ABS_PRESSURE)
# Value 0
# Min 0
# Max 255
# Fuzz 0
# Flat 0
# Resolution 0
# Event code 28 (ABS_TOOL_WIDTH)
# Value 0
# Min 0
# Max 15
# Fuzz 0
# Flat 0
# Resolution 0
# Properties:
# Property type 0 (INPUT_PROP_POINTER)
# Property type 2 (INPUT_PROP_BUTTONPAD)
# Property type 4 (INPUT_PROP_TOPBUTTONPAD)
We have a BTN_LEFT again and a set of other buttons that I'll explain in a second. But first we look at the EV_ABS output. We have the same naming system as above. ABS_X and ABS_Y are the x and y axis on the device, ABS_PRESSURE is an (arbitrary) ranged pressure value.

Absolute axes have a bit more state than just a simple bit. Specifically, they have a minimum and maximum (not all hardware has the top-left sensor position on 0/0, it can be an arbitrary position, specified by the minimum). Notable here is that the axis ranges are simply the ones announced by the device - there is no guarantee that the values fall within this range and indeed a lot of touchpad devices tend to send values slightly outside that range. Fuzz and flat can be safely ignored, but resolution is interesting. It is given in units per millimeter and thus tells us the size of the device. in the above case: (5112 - 1024)/42 means the device is 97mm wide. The resolution is quite commonly wrong, a lot of axis overrides need the resolution changed to the correct value.

The axis description also has a current value listed. The kernel only sends events when the value changes, so even if the actual hardware keeps sending events, you may never see them in the output if the value remains the same. In other words, holding a finger perfectly still on a touchpad creates plenty of hardware events, but you won't see anything coming out of the event node.

Finally, we have properties on this device. These are used to indicate general information about the device that's not otherwise obvious. In this case INPUT_PROP_POINTER tells us that we need a pointer for this device (it is a touchpad after all, a touchscreen would instead have INPUT_PROP_DIRECT set). INPUT_PROP_BUTTONPAD means that this is a so-called clickpad, it does not have separate physical buttons but instead the whole touchpad clicks. Ignore INPUT_PROP_TOPBUTTONPAD because it only applies to the Lenovo *40 series of devices.

Ok, back to the buttons: aside from BTN_LEFT, we have BTN_TOUCH. This one signals that the user is touching the surface of the touchpad (with some in-kernel defined minimum pressure value). It's not just for finger-touches, it's also used for graphics tablet stylus touchpes (so really, it's more "contact" than "touch" but meh).

The BTN_TOOL_FINGER event tells us that a finger is in detectable range. This gives us two bits of information: first, we have a finger (a tablet would have e.g. BTN_TOOL_PEN) and second, we may have a finger in proximity without touching. On many touchpads, BTN_TOOL_FINGER and BTN_TOUCH come in the same event, but others can detect a finger hovering over the touchpad too (in which case you'd also hope for ABS_DISTANCE being available on the touchpad).

Finally, the BTN_TOOL_DOUBLETAP up to BTN_TOOL_QUINTTAP tell us whether the device can detect 2 through to 5 fingers on the touchpad. This doesn't actually track the fingers, it merely tells you "3 fingers down" in the case of BTN_TOOL_TRIPLETAP.

Exercise: Look at your touchpad's description and figure out if the size of the touchpad is correct based on the axis information [1]. Check how many fingers your touchpad can detect and whether it can do pressure or distance detection.

The events from absolute devices

Events from absolute axes are not really any different than events from relative devices which we already covered. The same type/code combination with a value and a timestamp, all framed by EV_SYN SYN_REPORT events. Here's an example of me touching the touchpad:


E: 0.000001 0001 014a 0001 # EV_KEY / BTN_TOUCH 1
E: 0.000001 0003 0000 3335 # EV_ABS / ABS_X 3335
E: 0.000001 0003 0001 3308 # EV_ABS / ABS_Y 3308
E: 0.000001 0003 0018 0069 # EV_ABS / ABS_PRESSURE 69
E: 0.000001 0001 0145 0001 # EV_KEY / BTN_TOOL_FINGER 1
E: 0.000001 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +0ms
E: 0.021751 0003 0018 0070 # EV_ABS / ABS_PRESSURE 70
E: 0.021751 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +21ms
E: 0.043908 0003 0000 3334 # EV_ABS / ABS_X 3334
E: 0.043908 0003 0001 3309 # EV_ABS / ABS_Y 3309
E: 0.043908 0003 0018 0065 # EV_ABS / ABS_PRESSURE 65
E: 0.043908 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +22ms
E: 0.052469 0001 014a 0000 # EV_KEY / BTN_TOUCH 0
E: 0.052469 0003 0018 0000 # EV_ABS / ABS_PRESSURE 0
E: 0.052469 0001 0145 0000 # EV_KEY / BTN_TOOL_FINGER 0
E: 0.052469 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +9ms
In the first event you see BTN_TOOL_FINGER and BTN_TOUCH set (this touchpad doesn't detect hovering fingers). An x/y coordinate pair and a pressure value. The pressure changes in the second event, the third event changes pressure and location. Finally, we have BTN_TOOL_FINGER and BTN_TOUCH released on finger up, and the pressure value goes back to 0. Notice how the second event didn't contain any x/y coordinates? As I said above, the kernel only sends updates on absolute axes when the value changed.

Ok, let's look at a three-finger tap (again, minus the ABS_MT_ bits):


E: 0.000001 0001 014a 0001 # EV_KEY / BTN_TOUCH 1
E: 0.000001 0003 0000 2149 # EV_ABS / ABS_X 2149
E: 0.000001 0003 0001 3747 # EV_ABS / ABS_Y 3747
E: 0.000001 0003 0018 0066 # EV_ABS / ABS_PRESSURE 66
E: 0.000001 0001 014e 0001 # EV_KEY / BTN_TOOL_TRIPLETAP 1
E: 0.000001 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +0ms
E: 0.034209 0003 0000 2148 # EV_ABS / ABS_X 2148
E: 0.034209 0003 0018 0064 # EV_ABS / ABS_PRESSURE 64
E: 0.034209 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +34ms
[...]
E: 0.138510 0003 0000 4286 # EV_ABS / ABS_X 4286
E: 0.138510 0003 0001 3350 # EV_ABS / ABS_Y 3350
E: 0.138510 0003 0018 0055 # EV_ABS / ABS_PRESSURE 55
E: 0.138510 0001 0145 0001 # EV_KEY / BTN_TOOL_FINGER 1
E: 0.138510 0001 014e 0000 # EV_KEY / BTN_TOOL_TRIPLETAP 0
E: 0.138510 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +23ms
E: 0.147834 0003 0000 4287 # EV_ABS / ABS_X 4287
E: 0.147834 0003 0001 3351 # EV_ABS / ABS_Y 3351
E: 0.147834 0003 0018 0037 # EV_ABS / ABS_PRESSURE 37
E: 0.147834 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +9ms
E: 0.157151 0001 014a 0000 # EV_KEY / BTN_TOUCH 0
E: 0.157151 0003 0018 0000 # EV_ABS / ABS_PRESSURE 0
E: 0.157151 0001 0145 0000 # EV_KEY / BTN_TOOL_FINGER 0
E: 0.157151 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +10ms
In the first event, the touchpad detected all three fingers at the same time. So get BTN_TOUCH, x/y/pressure and BTN_TOOL_TRIPLETAP set. Note that the various BTN_TOOL_* bits are mutually exclusive. BTN_TOOL_FINGER means "exactly 1 finger down" and you can't have exactly 1 finger down when you have three fingers down. In the second event x and pressure update (y has no event, it stayed the same).

In the event after the break, we switch from three fingers to one finger. BTN_TOOL_TRIPLETAP is released, BTN_TOOL_FINGER is set. That's very common. Humans aren't robots, you can't release all fingers at exactly the same time, so depending on the hardware scanout rate you have intermediate states where one finger has left already, others are still down. In this case I released two fingers between scanouts, one was still down. It's not uncommon to see a full cycle from BTN_TOOL_FINGER to BTN_TOOL_DOUBLETAP to BTN_TOOL_TRIPLETAP on finger down or the reverse on finger up.

Exercise: test out the pressure values on your touchpad and see how close you can get to the actual announced range. Check how accurate the multifinger detection is by tapping with two, three, four and five fingers. (In both cases, you'll likely find that it's very much hit and miss).

Multitouch and slots

Now we're at the most complicated topic regarding evdev devices. In the case of multitouch devices, we need to send multiple touches on the same axes. So we need an additional dimension and that is called multitouch slots (there is another, older multitouch protocol that doesn't use slots but it is so rare now that you don't need to bother).

First: all axes that are multitouch-capable are repeated as ABS_MT_foo axis. So if you have ABS_X, you also get ABS_MT_POSITION_X and both axes have the same axis ranges and resolutions. The reason here is backwards-compatibility: if a device only sends multitouch events, older programs only listening to the ABS_X etc. events won't work. Some axes may only be available for single-touch (ABS_MT_TOOL_WIDTH in this case).

Let's have a look at my touchpad, this time without the axes removed:


# Input device name: "SynPS/2 Synaptics TouchPad"
# Input device ID: bus 0x11 vendor 0x02 product 0x07 version 0x1b1
# Supported events:
# Event type 0 (EV_SYN)
# Event code 0 (SYN_REPORT)
# Event code 1 (SYN_CONFIG)
# Event code 2 (SYN_MT_REPORT)
# Event code 3 (SYN_DROPPED)
# Event code 4 ((null))
# Event code 5 ((null))
# Event code 6 ((null))
# Event code 7 ((null))
# Event code 8 ((null))
# Event code 9 ((null))
# Event code 10 ((null))
# Event code 11 ((null))
# Event code 12 ((null))
# Event code 13 ((null))
# Event code 14 ((null))
# Event type 1 (EV_KEY)
# Event code 272 (BTN_LEFT)
# Event code 325 (BTN_TOOL_FINGER)
# Event code 328 (BTN_TOOL_QUINTTAP)
# Event code 330 (BTN_TOUCH)
# Event code 333 (BTN_TOOL_DOUBLETAP)
# Event code 334 (BTN_TOOL_TRIPLETAP)
# Event code 335 (BTN_TOOL_QUADTAP)
# Event type 3 (EV_ABS)
# Event code 0 (ABS_X)
# Value 5112
# Min 1024
# Max 5112
# Fuzz 0
# Flat 0
# Resolution 41
# Event code 1 (ABS_Y)
# Value 2930
# Min 2024
# Max 4832
# Fuzz 0
# Flat 0
# Resolution 37
# Event code 24 (ABS_PRESSURE)
# Value 0
# Min 0
# Max 255
# Fuzz 0
# Flat 0
# Resolution 0
# Event code 28 (ABS_TOOL_WIDTH)
# Value 0
# Min 0
# Max 15
# Fuzz 0
# Flat 0
# Resolution 0
# Event code 47 (ABS_MT_SLOT)
# Value 0
# Min 0
# Max 1
# Fuzz 0
# Flat 0
# Resolution 0
# Event code 53 (ABS_MT_POSITION_X)
# Value 0
# Min 1024
# Max 5112
# Fuzz 8
# Flat 0
# Resolution 41
# Event code 54 (ABS_MT_POSITION_Y)
# Value 0
# Min 2024
# Max 4832
# Fuzz 8
# Flat 0
# Resolution 37
# Event code 57 (ABS_MT_TRACKING_ID)
# Value 0
# Min 0
# Max 65535
# Fuzz 0
# Flat 0
# Resolution 0
# Event code 58 (ABS_MT_PRESSURE)
# Value 0
# Min 0
# Max 255
# Fuzz 0
# Flat 0
# Resolution 0
# Properties:
# Property type 0 (INPUT_PROP_POINTER)
# Property type 2 (INPUT_PROP_BUTTONPAD)
# Property type 4 (INPUT_PROP_TOPBUTTONPAD)
We have an x and y position for multitouch as well as a pressure axis. There are also two special multitouch axes that aren't really axes. ABS_MT_SLOT and ABS_MT_TRACKING_ID. The former specifies which slot is currently active, the latter is used to track touch points.

Slots are a static property of a device. My touchpad, as you can see above ony supports 2 slots (min 0, max 1) and thus can track 2 fingers at a time. Whenever the first finger is set down it's coordinates will be tracked in slot 0, the second finger will be tracked in slot 1. When the finger in slot 0 is lifted, the second finger continues to be tracked in slot 1, and if a new finger is set down, it will be tracked in slot 0. Sounds more complicated than it is, think of it as an array of possible touchpoints.

The tracking ID is an incrementing number that lets us tell touch points apart and also tells us when a touch starts and when it ends. The two values are either -1 or a positive number. Any positive number means "new touch" and -1 means "touch ended". So when you put two fingers down and lift them again, you'll get a tracking ID of 1 in slot 0, a tracking ID of 2 in slot 1, then a tracking ID of -1 in both slots to signal they ended. The tracking ID value itself is meaningless, it simply increases as touches are created.

Let's look at a single tap:


E: 0.000001 0003 0039 0387 # EV_ABS / ABS_MT_TRACKING_ID 387
E: 0.000001 0003 0035 2560 # EV_ABS / ABS_MT_POSITION_X 2560
E: 0.000001 0003 0036 2905 # EV_ABS / ABS_MT_POSITION_Y 2905
E: 0.000001 0003 003a 0059 # EV_ABS / ABS_MT_PRESSURE 59
E: 0.000001 0001 014a 0001 # EV_KEY / BTN_TOUCH 1
E: 0.000001 0003 0000 2560 # EV_ABS / ABS_X 2560
E: 0.000001 0003 0001 2905 # EV_ABS / ABS_Y 2905
E: 0.000001 0003 0018 0059 # EV_ABS / ABS_PRESSURE 59
E: 0.000001 0001 0145 0001 # EV_KEY / BTN_TOOL_FINGER 1
E: 0.000001 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +0ms
E: 0.021690 0003 003a 0067 # EV_ABS / ABS_MT_PRESSURE 67
E: 0.021690 0003 0018 0067 # EV_ABS / ABS_PRESSURE 67
E: 0.021690 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +21ms
E: 0.033482 0003 003a 0068 # EV_ABS / ABS_MT_PRESSURE 68
E: 0.033482 0003 0018 0068 # EV_ABS / ABS_PRESSURE 68
E: 0.033482 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +12ms
E: 0.044268 0003 0035 2561 # EV_ABS / ABS_MT_POSITION_X 2561
E: 0.044268 0003 0000 2561 # EV_ABS / ABS_X 2561
E: 0.044268 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +11ms
E: 0.054093 0003 0035 2562 # EV_ABS / ABS_MT_POSITION_X 2562
E: 0.054093 0003 003a 0067 # EV_ABS / ABS_MT_PRESSURE 67
E: 0.054093 0003 0000 2562 # EV_ABS / ABS_X 2562
E: 0.054093 0003 0018 0067 # EV_ABS / ABS_PRESSURE 67
E: 0.054093 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +10ms
E: 0.064891 0003 0035 2569 # EV_ABS / ABS_MT_POSITION_X 2569
E: 0.064891 0003 0036 2903 # EV_ABS / ABS_MT_POSITION_Y 2903
E: 0.064891 0003 003a 0059 # EV_ABS / ABS_MT_PRESSURE 59
E: 0.064891 0003 0000 2569 # EV_ABS / ABS_X 2569
E: 0.064891 0003 0001 2903 # EV_ABS / ABS_Y 2903
E: 0.064891 0003 0018 0059 # EV_ABS / ABS_PRESSURE 59
E: 0.064891 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +10ms
E: 0.073634 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1
E: 0.073634 0001 014a 0000 # EV_KEY / BTN_TOUCH 0
E: 0.073634 0003 0018 0000 # EV_ABS / ABS_PRESSURE 0
E: 0.073634 0001 0145 0000 # EV_KEY / BTN_TOOL_FINGER 0
E: 0.073634 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +9ms
We have a tracking ID (387) signalling finger down, as well as a position plus pressure. then some updates and eventually a tracking ID of -1 (signalling finger up). Notice how there is no ABS_MT_SLOT here - the kernel buffers those too so while you stay in the same slot (0 in this case) you don't see any events for it. Also notice how you get both single-finger as well as multitouch in the same event stream. This is for backwards compatibility [2]

Ok, time for a two-finger tap:


E: 0.000001 0003 0039 0496 # EV_ABS / ABS_MT_TRACKING_ID 496
E: 0.000001 0003 0035 2609 # EV_ABS / ABS_MT_POSITION_X 2609
E: 0.000001 0003 0036 3791 # EV_ABS / ABS_MT_POSITION_Y 3791
E: 0.000001 0003 003a 0054 # EV_ABS / ABS_MT_PRESSURE 54
E: 0.000001 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.000001 0003 0039 0497 # EV_ABS / ABS_MT_TRACKING_ID 497
E: 0.000001 0003 0035 3012 # EV_ABS / ABS_MT_POSITION_X 3012
E: 0.000001 0003 0036 3088 # EV_ABS / ABS_MT_POSITION_Y 3088
E: 0.000001 0003 003a 0056 # EV_ABS / ABS_MT_PRESSURE 56
E: 0.000001 0001 014a 0001 # EV_KEY / BTN_TOUCH 1
E: 0.000001 0003 0000 2609 # EV_ABS / ABS_X 2609
E: 0.000001 0003 0001 3791 # EV_ABS / ABS_Y 3791
E: 0.000001 0003 0018 0054 # EV_ABS / ABS_PRESSURE 54
E: 0.000001 0001 014d 0001 # EV_KEY / BTN_TOOL_DOUBLETAP 1
E: 0.000001 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +0ms
E: 0.012909 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0
E: 0.012909 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1
E: 0.012909 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.012909 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1
E: 0.012909 0001 014a 0000 # EV_KEY / BTN_TOUCH 0
E: 0.012909 0003 0018 0000 # EV_ABS / ABS_PRESSURE 0
E: 0.012909 0001 014d 0000 # EV_KEY / BTN_TOOL_DOUBLETAP 0
E: 0.012909 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +12ms
This was a really quick two-finger tap that illustrates the tracking IDs nicely. In the first event we get a touch down, then an ABS_MT_SLOT event. This tells us that subsequent events belong to the other slot, so it's the other finger. There too we get a tracking ID + position. In the next event we get an ABS_MT_SLOT to switch back to slot 0. Tracking ID of -1 means that touch ended, and then we see the touch in slot 1 ended too.

Time for a two-finger scroll:


E: 0.000001 0003 0039 0557 # EV_ABS / ABS_MT_TRACKING_ID 557
E: 0.000001 0003 0035 2589 # EV_ABS / ABS_MT_POSITION_X 2589
E: 0.000001 0003 0036 3363 # EV_ABS / ABS_MT_POSITION_Y 3363
E: 0.000001 0003 003a 0048 # EV_ABS / ABS_MT_PRESSURE 48
E: 0.000001 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.000001 0003 0039 0558 # EV_ABS / ABS_MT_TRACKING_ID 558
E: 0.000001 0003 0035 3512 # EV_ABS / ABS_MT_POSITION_X 3512
E: 0.000001 0003 0036 3028 # EV_ABS / ABS_MT_POSITION_Y 3028
E: 0.000001 0003 003a 0044 # EV_ABS / ABS_MT_PRESSURE 44
E: 0.000001 0001 014a 0001 # EV_KEY / BTN_TOUCH 1
E: 0.000001 0003 0000 2589 # EV_ABS / ABS_X 2589
E: 0.000001 0003 0001 3363 # EV_ABS / ABS_Y 3363
E: 0.000001 0003 0018 0048 # EV_ABS / ABS_PRESSURE 48
E: 0.000001 0001 014d 0001 # EV_KEY / BTN_TOOL_DOUBLETAP 1
E: 0.000001 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +0ms
E: 0.027960 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0
E: 0.027960 0003 0035 2590 # EV_ABS / ABS_MT_POSITION_X 2590
E: 0.027960 0003 0036 3395 # EV_ABS / ABS_MT_POSITION_Y 3395
E: 0.027960 0003 003a 0046 # EV_ABS / ABS_MT_PRESSURE 46
E: 0.027960 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.027960 0003 0035 3511 # EV_ABS / ABS_MT_POSITION_X 3511
E: 0.027960 0003 0036 3052 # EV_ABS / ABS_MT_POSITION_Y 3052
E: 0.027960 0003 0000 2590 # EV_ABS / ABS_X 2590
E: 0.027960 0003 0001 3395 # EV_ABS / ABS_Y 3395
E: 0.027960 0003 0018 0046 # EV_ABS / ABS_PRESSURE 46
E: 0.027960 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +27ms
E: 0.051720 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0
E: 0.051720 0003 0035 2609 # EV_ABS / ABS_MT_POSITION_X 2609
E: 0.051720 0003 0036 3447 # EV_ABS / ABS_MT_POSITION_Y 3447
E: 0.051720 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.051720 0003 0036 3080 # EV_ABS / ABS_MT_POSITION_Y 3080
E: 0.051720 0003 0000 2609 # EV_ABS / ABS_X 2609
E: 0.051720 0003 0001 3447 # EV_ABS / ABS_Y 3447
E: 0.051720 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +24ms
[...]
E: 0.272034 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0
E: 0.272034 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1
E: 0.272034 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1
E: 0.272034 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1
E: 0.272034 0001 014a 0000 # EV_KEY / BTN_TOUCH 0
E: 0.272034 0003 0018 0000 # EV_ABS / ABS_PRESSURE 0
E: 0.272034 0001 014d 0000 # EV_KEY / BTN_TOOL_DOUBLETAP 0
E: 0.272034 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +30ms
Note that "scroll" is something handled in userspace, so what you see here is just a two-finger move. Everything in there i something we've already seen, but pay attention to the two middle events: as updates come in for each finger, the ABS_MT_SLOT changes before the upates are sent. The kernel filter for identical events is still in effect, so in the third event we don't get an update for the X position on slot 1. The filtering is per-touchpoint, so in this case this means that slot 1 position x is still on 3511, just as it was in the previous event.

That's all you have to remember, really. If you think of evdev as a serialised way of sending an array of touchpoints, with the slots as the indices then it should be fairly clear. The rest is then just about actually looking at the touch positions and making sense of them.

Exercise: do a pinch gesture on your touchpad. See if you can track the two fingers moving closer together. Then do the same but only move one finger. See how the non-moving finger gets less updates.

That's it. There are a few more details to evdev but much of that is just more event types and codes. The few details you really have to worry about when processing events are either documented in libevdev or abstracted away completely. The above should be enough to understand what your device does, and what goes wrong when your device isn't working. Good luck.

[1] If not, file a bug against systemd's hwdb and CC me so we can put corrections in
[2] We treat some MT-capable touchpads as single-touch devices in libinput because the MT data is garbage

While stuck on an airplane, I put together a repository for apitraces with confirmed good images for driver output.  Combined with the piglit patches I pushed, I now have regression testing of actual apps on vc4 (particularly relevant given that I'm working on optimizing one of those apps!)

The flight was to visit the Raspberry Pi Foundation, with the goal of getting something usable for their distro to switch to the open 3D stack.  There's still a giant pile of KMS work to do (HDMI audio, DSI power management, SDTV support, etc.), and waiting for all of that to be regression-free will be a long time.  The question is: what could we do that would get us 3D, even if KMS isn't ready?

So, I put together a quick branch to expose the firmware's display stack as the KMS display pipeline. It's a filthy hack, and loses us a lot of the important new features that the open stack was going to bring (changing video modes in X, vblank timestamping, power management), but it gets us much closer to the featureset of the previous stack.  Hopefully they'll be switching to it as the default in new installs soon.

In debugging while I was here, Simon found that on his HDMI display the color ramps didn't quite match between closed and open drivers.  After a bit of worrying about gamma ramp behavior, I figured out that it was actually that the monitor was using a CEA mode that requires limited range RGB input.  A patch is now on the list.

September 17, 2016

Tickets for systemd 2016 Workshop day still available!

We still have a number of ticket for the workshop day of systemd.conf 2016 available. If you are a newcomer to systemd, and would like to learn about various systemd facilities, or if you already know your way around, but would like to know more: this is the best chance to do so. The workshop day is the 28th of September, one day before the main conference, at the betahaus in Berlin, Germany. The schedule for the day is available here. There are five interesting, extensive sessions, run by the systemd hackers themselves. Who better to learn systemd from, than the folks who wrote it?

Note that the workshop day and the main conference days require different tickets. (Also note: there are still a few tickets available for the main conference!).

Buy a ticket here.

See you in Berlin!

September 16, 2016

libinput's touchpad acceleration is the cause for a few bugs and outcry from a quite vocal (maj|in)ority. A common suggestion is "make it like the synaptics driver". So I spent a few hours going through the pointer acceleration code to figure out what xf86-input-synaptics actually does (I don't think anyone knows at this point) [1].

If you just want the TLDR: synaptics doesn't use physical distances but works in device units coupled with a few magic factors, also based on device units. That pretty much tells you all that's needed.

Also a disclaimer: the last time some serious work was done on acceleration was in 2008/2009. A lot of things have changed since and since the server is effectively un-testable, we ended up with the mess below that seems to make little sense. It probably made sense 8 years ago and given that most or all of the patches have my signed-off-by it must've made sense to me back then. But now we live in the glorious future and holy cow it's awful and confusing.

Synaptics has three options to configure speed: MinSpeed, MaxSpeed and AccelFactor. The first two are not explained beyond "speed factor" but given how accel usually works let's assume they all somewhoe should work as a multiplication on the delta (so a factor of 2 on a delta of dx/dy gives you 2dx/2dy). AccelFactor is documented as "acceleration factor for normal pointer movements", so clearly the documentation isn't going to help clear any confusion.

I'll skip the fact that synaptics also has a pressure-based motion factor with four configuration options because oh my god what have we done. Also, that one is disabled by default and has no effect unless set by the user. And I'll also only handle default values here, I'm not going to get into examples with configured values.

Also note: synaptics has a device-specific acceleration profile (the only driver that does) and thus the acceleration handling is split between the server and the driver.

Ok, let's get started. MinSpeed and MaxSpeed default to 0.4 and 0.7. The MinSpeed is used to set constant acceleration (1/min_speed) so we always apply a 2.5 constant acceleration multiplier to deltas from the touchpad. Of course, if you set constant acceleration in the xorg.conf, then it overwrites the calculated one.

MinSpeed and MaxSpeed are mangled during setup so that MaxSpeed is actually MaxSpeed/MinSpeed and MinSpeed is always 1.0. I'm not 100% why but the later clipping to the min/max speed range ensures that we never go below a 1.0 acceleration factor (and thus never decelerate).

The AccelFactor default is 200/diagonal-in-device-coordinates. On my T440s it's thus 0.04 (and will be roughly the same for most PS/2 Synaptics touchpads). But on a Cyapa with a different axis range it is 0.125. On a T450s it's 0.035 when booted into PS2 and 0.09 when booted into RMI4. Admittedly, the resolution halfs under RMI4 so this possibly maybe makes sense. Doesn't quite make as much sense when you consider the x220t which also has a factor of 0.04 but the touchpad is only half the size of the T440s.

There's also a magic constant "corr_mul" which is set as:


/* synaptics seems to report 80 packet/s, but dix scales for
* 100 packet/s by default. */
pVel->corr_mul = 12.5f; /*1000[ms]/80[/s] = 12.5 */
It's correct that the frequency is roughly 80Hz but I honestly don't know what the 100packet/s reference refers to. Either way, it means that we always apply a factor of 12.5, regardless of the timing of the events. Ironically, this one is hardcoded and not configurable unless you happen to know that it's the X server option VelocityScale or ExpectedRate (both of them set the same variable).

Ok, so we have three factors. 2.5 as a function of MaxSpeed, 12.5 because of 80Hz (??) and 0.04 for the diagonal.

When the synaptics driver calculates a delta, it does so in device coordinates and ignores the device resolution (because this code pre-dates devices having resolutions). That's great until you have a device with uneven resolutions like the x220t. That one has 75 and 129 units/mm for x and y, so for any physical movement you're going to get almost twice as many units for y than for x. Which means that if you move 5mm to the right you end up with a different motion vector (and thus acceleration) than when you move 5mm south.

The core X protocol actually defines who acceleration is supposed to be handled. Look up the man page for XChangePointerControl(), it sets a threshold and an accel factor:

The XChangePointerControl function defines how the pointing device moves. The acceleration, expressed as a fraction, is a multiplier for movement. For example, specifying 3/1 means the pointer moves three times as fast as normal. The fraction may be rounded arbitrarily by the X server. Acceleration only takes effect if the pointer moves more than threshold pixels at once and only applies to the amount beyond the value in the threshold argument.
Of course, "at once" is a bit of a blurry definition outside of maybe theoretical physics. Consider the definition of "at once" for a gaming mouse with 500Hz sampling rate vs. a touchpad with 80Hz (let us fondly remember the 12.5 multiplier here) and the above description quickly dissolves into ambiguity.

Anyway, moving on. Let's say the server just received a delta from the synaptics driver. The pointer accel code in the server calculates the velocity over time, basically by doing a hypot(dx, dy)/dtime-to-last-event. Time in the server is always in ms, so our velocity is thus in device-units/ms (not adjusted for device resolution).

Side-note: the velocity is calculated across several delta events so it gets more accurate. There are some checks though so we don't calculate across random movements: anything older than 300ms is discarded, anything not in the same octant of movement is discarded (so we don't get a velocity of 0 for moving back/forth). And there's two calculations to make sure we only calculate while the velocity is roughly the same and don't average between fast and slow movements. I have my doubts about these, but until I have some more concrete data let's just say this is accurate (altough since the whole lot is in device units, it probably isn't).

Anyway. The velocity is multiplied with the constant acceleration (2.5, see above) and our 12.5 magic value. I'm starting to think that this is just broken and would only make sense if we used a delta of "event count" rather than milliseconds.

It is then passed to the synaptics driver for the actual acceleration profile. The first thing the driver does is remove the constant acceleration again, so our velocity is now just v * 12.5. According to the comment this brings it back into "device-coordinate based velocity" but this seems wrong or misguided since we never changed into any other coordinate system.

The driver applies the accel factor (0.04, see above) and then clips the whole lot into the MinSpeed/MaxSpeed range (which is adjusted to move MinSpeed to 1.0 and scale up MaxSpeed accordingly, remember?). After the clipping, the pressure motion factor is calculated and applied. I skipped this above but it's basically: the harder you press the higher the acceleration factor. Based on some config options. Amusingly, pressure motion has the potential to exceed the MinSpeed/MaxSpeed options. Who knows what the reason for that is...

Oh, and btw: the clipping is actually done based on the accel factor set by XChangePointerControl into the acceleration function here. The code is


double acc = factor from XChangePointerControl();
double factor = the magic 0.04 based on the diagonal;

accel_factor = velocity * accel_factor;
if (accel_factor > MaxSpeed * acc)
accel_factor = MaxSpeed * acc;
So we have a factor set by XChangePointerControl() but it's only used to determine the maximum factor we may have, and then we clip to that. I'm missing some cross-dependency here because this is what the GUI acceleration config bits hook into. Somewhere this sets things and changes the acceleration by some amount but it wasn't obvious to me.

Alrighty. We have a factor now that's returned to the server and we're back in normal pointer acceleration land (i.e. not synaptics-specific). Woohoo. That factor is averaged across 4 events using the simpson's rule to smooth out aprupt changes. Not sure this really does much, I don't think we've ever done any evaluation on that. But it looks good on paper (we have that in libinput as well).

Now the constant accel factor is applied to the deltas. So far we've added the factor, removed it (in synaptics), and now we're adding it again. Which also makes me wonder whether we're applying the factor twice to all other devices but right now I'm past the point where I really want to find out . With all the above, our acceleration factor is, more or less:


f = units/ms * 12.5 * (200/diagonal) * (1.0/MinSpeed)
and the deltas we end up using in the server are

(dx, dy) = f * (dx, dy)
But remember, we're still in device units here (not adjusted for resolution).

Anyway. You think we're finished? Oh no, the real fun bits start now. And if you haven't headdesked in a while, now is a good time.

After acceleration, the server does some scaling because synaptics is an absolute device (with axis ranges) in relative mode [2]. Absolute devices are mapped into the whole screen by default but when they're sending relative events, you still want a 45 degree line on the device to map into 45 degree cursor movement on the screen. The server does this by adjusting dy in-line with the device-to-screen-ratio (taking device resolution into account too). On my T440s this means:


touchpad x:y is 1:1.45 (16:11)
screen is 1920:1080 is 1:177 (16:9)

dy scaling is thus: (16:11)/(16:9) = 9:11 -> y * 11/9
dx is left as-is. Now you have the delta that's actually applied to the cursor. Except that we're in device coordinates, so we map the current cursor position to device coordinates, then apply the delta, then map back into screen coordinates (i.e. pixels). You may have spotted the flaw here: when the screen size changes, the dy scaling changes and thus the pointer feel. Plug in another monitor, and touchpad acceleration changes. Also: the same touchpad feels different on laptops when their screen hardware differs.

Ok, let's wrap this up. Figuring out what the synaptics driver does is... "tricky". It seems much like a glorified random number scheme. I'm not planning to implement "exactly the same acceleration as synaptics" in libinput because this would be insane and despite my best efforts, I'm not that yet. Collecting data from synaptics users is almost meaningless, because no two devices really employ the same acceleration profile (touchpad axis ranges + screen size) and besides, there are 11 configuration options that all influence each other.

What I do plan though is collect more motion data from a variety of touchpads and see if I can augment the server enough that I can get a clear picture of how motion maps to the velocity. If nothing else, this should give us some picture on how different the various touchpads actually behave.

But regardless, please don't ask me to "just copy the synaptics code".

[1] fwiw, I had this really great idea of trying to get behind all this, with diagrams and everything. But then I was printing json data from the X server into the journal to be scooped up by sed and python script to print velocity data. And I questioned some of my life choices.
[2] why the hell do we do this? because synaptics at some point became a device that announce the axis ranges (seemed to make sense at the time, 2008) and then other things started depending on it and with all the fixes to the server to handle absolute devices in relative mode (for tablets) we painted ourselves into a corner. Synaptics should switch back to being a relative device, but last I tried it breaks pointer acceleration and that a) makes the internets upset and b) restoring the "correct" behaviour is, well, you read the article so far, right?

September 14, 2016
Last week I spent working on the glmark2 performance issues.  I now have a NIR patch out for the pathological conditionals test (it's now faster than on the old driver), and a branch for job shuffling (+17% and +27% on the two desktop tests).

Here's the basic idea of job shuffling:

We're a tiled renderer, and tiled renderers get their wins from having a Clear at the start of the frame (indicating we don't need to load any previous contents into the tile buffer).  When your frame is done, we flush each tile out to memory.  If you do your clear, start rendering some primitives, and then switch to some other FBO (because you're rendering to a texture that you're planning on texturing from in your next draw to the main FBO), we have to flush out all of those tiles, start rendering to the new FBO, and flush its rendering, and then when you come back to the main FBO and we have to reload your old cleared-and-a-few-draws tiles.

Job shuffling deals with this by separating the single GL command stream into separate jobs per FBO.  When you switch to your temporary FBO, we don't flush the old job, we just set it aside.  To make this work we have to add tracking for which buffers have jobs writing into them (so that if you try to read those from another job, we can go flush the job that wrote it), and which buffers have jobs reading from them (so that if you try to write to them, they can get flushed so that they don't get incorrectly updated contents).

This felt like it should have been harder than it was, and there's a spot where I'm using a really bad data structure I had laying around, but that data structure has been bad news since the driver was imported and it hasn't been high in any profiles yet.  The other tests don't seem to have any problem with the possible increased CPU overhead.

The shuffling branch also unearthed a few bugs related to clearing and blitting in the multisample tests.  Some of the piglit cases involved are fixed, but some will be reporting new piglit "regressions" because the tests are now closer to working correctly (sigh, reftests).

I also started writing documentation for updating the system's X and Mesa stack on Raspbian for one of the Foundation's developers.  It's not polished, and if I was rewriting it I would use modular's build.sh instead of some of what I did there.  But it's there for reference.
September 11, 2016
So I've a RK3188 tablet which accelerometer calibration was completely off. I've spend quite some time figuring out how to fix this so I wanted to share the fix. Note this will only work on some tablets. First lets see if this fix will work for your tablet, from an adb shell do:

cat /proc/acc_info

If that file exists it will contain something like this:

root@rk31board:/ # cat /proc/acc_info                                        
name:bma250
units:256
dir:7
offset:8 0 -4


If it does not exist, the this fix will not work for you. If the numbers after offset are all close to 0 like above, then likely you do not need this fix, but you can still try it.

So rockchip tablets with this particular file store accelerometer calibration data in something which rockchip calls the "sys sector" of nand, if you look in dmesg you will see "rknand_sys_storage_ioctl" messages there. The problem with this approach is that the "sys sector" data survives a factory reset so once the calibration data is off, it stays off, unless we manually force a recalibrate.

Note the next steps need root, make sure that the tablet is flat on a level service, and that it is not sleeping
(otherwise it will not calibrate until you wake it up, registering the powerbutton press during the calibrate)
then do:

echo 1 > /proc/acc_cal

Now run "dmesg | grep acc", you should see something like this:

<4>[ 1497.846716] acc calibrating 1
<4>[ 1497.875330] acc new offset 8 0 0


If you do congrats, you've successfully recalibrated your tablet.
September 09, 2016

A great new feature has been merged during this 1.19 X server development cycle: we're now using threads for input [1]. Previously, there were two options for how an input driver would pass on events to the X server: polling or from within the signal handler. Polling simply adds all input devices' file descriptors to a select(2) loop that is processed in the mainloop of the server. The downside here is that if the server is busy rendering something, your input is delayed until that rendering is complete. Historically, polling was primarily used by the keyboard driver because it just doesn't matter much when key strokes are delayed. Both because you need the client to render them anyway (which it can't when it's busy) and possibly also because we're just so bloody used to typing delays.

The signal handler approach circumvented the delays by installing a SIGIO handler for each input device fd and calling that when any input occurs. This effectively interrupts the process until the signal handler completes, regardless of what the server is currently busy with. A great solution to provide immediate visible cursor movement (hence it is used by evdev, synaptics, wacom, and most of the now-retired legacy drivers) but it comes with a few side effects. First of all, because the main process is interrupted, the bit where we read the events must be completely separate to the bit where we process the events. That's easy enough, we've had an input event queue in the server for as long as I've been involved with X.Org development (~2006). The drivers push events into the queue during the signal handler, in the main loop the server reads them and processes them. In a busy server that may be several seconds after the pointer motion was performed on the screen but hey, it still feels responsive.

The bigger issue with the use of a signal handler is: you can't use malloc [2]. Or anything else useful. Look at the man page for signal(7), it literally has a list of allowed functions. This leads to two weird side-effects: one is that you have to pre-allocate everything you may ever need for event processing, the other is that you need to re-implement any function that is not currently async signal safe. The server actually has its own implementation of printf for this reason (for error logging). Let's just say this is ... suboptimal. Coincidentally, libevdev is mostly async signal safe for that reason too. It also means you can't use any libraries, because no-one [3] is insane enough to make libraries async signal-safe.

We were still mostly "happy" with it until libinput came along. libinput is a full input stack and expecting it to work within a signal handler is the somewhere between optimistic, masochistic and sadistic. The xf86-input-libinput driver doesn't use the signal handler and the side effect of this is that a desktop with libinput didn't feel as responsive when the server was busy rendering.

Keith Packard stepped in and switched the server from the signal handler to using input threads. Or more specifically: one input thread on top of the main thread. That thread controls all the input device's file descriptors and continuously reads events off them. It otherwise provides the same functionality the signal handler did before: visible pointer movement and shoving events into the event queue for the main thread to process them later. But of course, once you switch to threads, problems have 2 you now. A signal handler is "threading light", only one code path can be interrupted and you know you continue where you left off. So synchronisation primitives are easier than in threads where both code paths continue independently. Keith replaced the previous xf86BlockSIGIO() calls with corresponding input_lock() and input_unlock() calls and all the main drivers have been switched over. But some interesting race conditions kept happening. But as of today, we think most of these are solved.

The best test we have at this point is libinput's internal test suite. It creates roughly 5000 devices within about 4 minutes and thus triggers most code paths to do with device addition and removal, especially the overlaps between devices sending events before/during/after they get added and/or removed. This is the largest source of possible errors as these are the code paths with the most amount of actual simultaneous access to the input devices by both threads. But what the test suite can't test is normal everyday use. So until we get some more code maturity, expect the occasional crash and please do file bug reports. They'll be hard to reproduce and detect, but don't expect us to run into the same race conditions by accident.

[1] Yes, your calendar is right, it is indeed 2016, not the 90s or so
[2] Historical note: we actually mostly ignored this until about 2010 or so when glibc changed the malloc implementation and the server was just randomly hanging whenever we tried to malloc from within the signal handler. Users claimed this was bad UX, but I think it's right up there with motif.
[3] yeah, yeah, I know, there's always exceptions.

September 08, 2016

When working with timestamps, one question that often arises is the precision of those timestamps. Most software is good enough with a precision up to the second, and that's easy. But in some cases, like working on metering, a finer precision is required.

I don't know exactly why¹, and it makes me suffer every day, but OpenStack is really tied to MySQL (and its clones). It hurts because MySQL is a very poor solution if you want to leverage your database to actually solve problems. But that's how life is, unfair. And in the context of the projects I work on, that boils down to that we can't afford to not support MySQL.

So here we are, needing to work with MySQL and at the same time requiring timestamp with a finer precision than just seconds. And guess what: MySQL did not support that until 2011.

No microseconds in MySQL? No problem: DECIMAL!

MySQL 5.6.4 (released in 2011), a beta version of MySQL 5.6 (hello MySQL, ever heard of Semantic Versioning?), brought microsecond precision to timestamps. But the first stable version supporting that, MySQL 5.6.10, was only released in 2013. So for a long time, there was a problem without any solution.

The obvious workaround, in this case, is to reassess your choices in technologies, discover that PostgreSQL supports microsecond precision for at least a decade and problem solved.

This is not what happened in our case, and in order to support MySQL, one had to find a workaround. And so did they in our Ceilometer project, using a DECIMAL type instead of DATETIME.

The DECIMAL type takes 2 arguments: the total number of digits you need to store, and how many in that total will be used for the fractional part. Knowing that the internal storage of MySQL uses 1 byte for 2 digits, 2 bytes for 4 digits, 3 bytes for 6 digits and 4 bytes for 9 digits, and that each part is stored independently, in order to maximize your storage space, you want to pick a number of digits that fits that correctly.

This is why Ceilometer picked 14 for the integer part (9 digits on 4 bytes and 5 digits on 3 bytes) and 6 for the decimal part (3 bytes).

Wait. It's stupid because:

  • DECIMAL(20, 6) implies that you uses 14 digits for the integer part, which using epoch as a reference makes you able to encode timestamp (10^14) - 1 which is year 3170843. I am certain Ceilometer won't last that far.
  • 14 digits is 9 + 5 digits in MySQL which is 7 bytes, the same size that is used for 9 + 6 digits. So if you could have DECIMAL(21, 6) for the same storage space (and go up to year 31690708 which is a nice bonus, right?)

Well, I guess the original author of the patch did not read the documentation entirely (DECIMAL(20, 6) being on the MySQL documentation page as an example, I imagine it just has been copy-pasted blindly?).

The best choice for this use case would have been DECIMAL(17, 6) which would allow storing 11 digits for integer (5 bytes), supporting timestamp up to (2^11)-1 (year 5138), and 6 digits for decimal part (3 bytes), using only 8 bytes in total per timestamp.

Nonetheless, this workaround has been implemented using a SQLAlchemy custom type and works as expected:

class PreciseTimestamp(sqlalchemy.types.TypeDecorator):
"""Represents a timestamp precise to the microsecond."""
 
impl = sqlalchemy.DateTime
 
def load_dialect_impl(self, dialect):
if dialect.name == 'mysql':
return sqlalchemy.dialect.type_descriptor(
sqlalchemy.types.DECIMAL(precision=20,
scale=6,
asdecimal=True))
return sqlalchemy.dialect.type_descriptor(self.impl)


Microseconds in MySQL? Damn, migration!

As I said, MySQL 5.6.4 brought microseconds precision to the table (pun intended). Therefore, it's a great time to migrate away from this hackish format to the brand new one.

First, be aware that the default DATETIME type has no microseconds precision: you have to specify how many digits you want as an argument. To support microseconds, you should therefore use DATETIME(6).

If we were using a great RDBMS, let's say, hum, PostgreSQL, we could do that very easily, see:

postgres=# CREATE TABLE foo (mytime decimal);
CREATE TABLE
postgres=# \d foo
Table "public.foo"
Column Type Modifiers
────────┼─────────┼───────────
mytime numeric
postgres=# INSERT INTO foo (mytime) VALUES (1473254401.234);
INSERT 0 1
postgres=# ALTER TABLE foo ALTER COLUMN mytime SET DATA TYPE timestamp with time zone USING to_timestamp(mytime);
ALTER TABLE
postgres=# \d foo
Table "public.foo"
Column Type Modifiers
────────┼──────────────────────────┼───────────
mytime timestamp with time zone
 
postgres=# select * from foo;
mytime
────────────────────────────
2016-09-07 13:20:01.234+00
(1 row)


And since this is a pretty common use case, it's even an example in the PostgreSQL documentation. The version from the documentation uses a calculation based on epoch, whereas my example here leverages the to_timestamp() function. That's my personal touch.

Obviously, doing this conversion in a single line is not possible with MySQL: it does not implement the USING keyword on ALTER TABLE … ALTER COLUMN. So what's the solution gonna be? Well, it's a 4 steps job:

  1. Create a new column of type DATETIME(6)
  2. Copy data from the old column to the new column, converting them to the new format
  3. Delete the old column
  4. Rename the new column to the old column name.

But I know what you're thinking: there are 4 steps, but that's not a problem, we'll just use a transaction and embed these operations inside.

MySQL does not support transactions on data definition language (DDL). So if any of those steps fails, you'll be unable rollback steps 1, 3 and 4. Who knew that using MySQL was like living on the edge, right?

Doing this in Python with our friend Alembic

I like Alembic. It's a Python library based on SQLAlchemy that handles schema migration for your favorite RDBMS.

Once you created a new alembic migration script using alembic revision, it's time to edit it and write something along those lines:

from alembic import op
import sqlalchemy as sa
from sqlalchemy.sql import func
 
class Timestamp(sa.types.TypeDecorator):
"""Represents a timestamp precise to the microsecond."""
 
impl = sqlalchemy.DateTime
 
def load_dialect_impl(self, dialect):
if dialect.name == 'mysql':
return dialect.type_descriptor(mysql.DATETIME(fsp=6))
return self.impl
 
def upgrade():
bind = op.get_bind()
if bind and bind.engine.name == "mysql":
existing_type = sa.types.DECIMAL(
precision=20, scale=6, asdecimal=True)
existing_col = sa.Column("mytime", existing_type, nullable=False)
temp_col = sa.Column("mytime_ts", Timestamp(), nullable=False)
# Step 1: ALTER TABLE mytable ADD COLUMN mytime_ts DATETIME(6)
op.add_column("mytable", temp_col)
t = sa.sql.table("mytable", existing_col, temp_col)
# Step 2: UPDATE mytable SET mytime_ts=from_unixtime(mytime)
op.execute(t.update().values(mytime_ts=func.from_unixtime(existing_col)}))
# Step 3: ALTER TABLE mytable DROP COLUMN mytime
op.drop_column("mytable", "mytime")
# Step 4: ALTER TABLE mytable CHANGE mytime_ts mytime
# Note: MySQL needs to have all the old/new information to just rename a column…
op.alter_column("mytable",
"mytime_ts",
nullable=False,
type_=Timestamp(),
existing_nullable=False,
existing_type=existing_type,
new_column_name="mytime")


In MySQL, the function to convert a float to a UNIX timestamp is from_unixtime(), so the script leverages it to convert the data. As said, you'll notice we don't bother using any kind of transaction, so if anything goes wrong, there's no rollback, and it won't be possible to re-run the migration without a manual intervention.

TimestampUTC is a custom class that implements sqlalchemy.DateTime using a DATETIME(6) type for MySQL, and a regular sqlalchemy.DateTime type for other back-ends. It is used by the rest of the code (e.g. ORM model) but I've pasted it in this example for a better understanding.

Once written, you can easily test your migration using pifpaf to run a temporary database:

$ pifpaf run mysql $SHELL
$ alembic -c alembic/alembic.ini upgrade 1c98ac614015 # upgrade to the initial revision
$ mysql -S $PIFPAF_MYSQL_SOCKET pifpaf
mysql> INSERT INTO mytable (mytime) VALUES (1325419200.213000);
Query OK, 1 row affected (0.00 sec)
 
mysql> SELECT * FROM mytable;
+-------------------+
| mytime |
+-------------------+
| 1325419200.213000 |
+-------------------+
1 row in set (0.00 sec)
 
$ alembic -c alembic/alembic.ini upgrade head
 
$ mysql -S $PIFPAF_MYSQL_SOCKET pifpaf
mysql> SELECT * FROM mytable;
+----------------------------+
| mytime |
+----------------------------+
| 2012-01-01 13:00:00.213000 |
+----------------------------+
1 row in set (0.00 sec)


And voilà, we just migrated unsafely our data to a new fancy format. Thank you Alembic for solving a problem we would not have without MySQL. 😊

September 07, 2016
Last week I was tasked with making performance comparisons between vc4 and the closed driver possible.  I decided to take glmark2 and port it to dispmanx, and submitted a pull request upstream.  It already worked on X11 on the vc4 driver, and I fixed the drm backend to work as well (though the drm backend has performance problems specific to the glmark2 code).

Looking at glmark2, vc4 has a few bugs.  Terrain has some rendering bugs.  The driver on master took a big performance hit on one of the conditionals tests since the loops support was added, because NIR isn't aggressive enough in flattening if statements.  Some of the tests require that we shuffle rendering jobs to avoid extra frame store/loads.  Finally, we need to use the multithreaded fragment shader mode to hide texture fetching latency on a bunch of tests.  Despite the bugs, results looked good.

(Note: I won't be posting comparisons here.  Comparisons will be left up to the reader on their particular hardware and software stacks).

I'm expecting to get to do some glamor work on vc4 again soon, so I spent some of the time while I was waiting for Raspberry Pi builds working on the X Server's testing infrastructure.  I've previously talked about Travis CI, but for it to be really useful it needs to run our integration tests.  I fixed up the piglit XTS test wrapper to not spuriously fail, made the X Test suite spuriously fail less, and worked with Adam Jackson at Red Hat to fix build issues in XTS.  Finally, I wrote scripts that will, if you have an XTS tree and a piglit tree and build Xvfb, actually run the XTS rendering tests at xserver make check time.

Next steps for xserver testing are to test glamor in a similar fashion, and to integrate this testing into travis-ci and land the travis-ci branch.

Finally, I submitted pull requests to the upstream kernel.  4.8 got some fixes for VC4 3D (merged by Dave), and 4.9 got interlaced vblank timing patches from the Mario Kleiner (not yet merged by Dave) and Raspberry Pi Zero (merged by Florian).
September 06, 2016

On Fedora, if you have mate-desktop or cinnamon-desktop installed, your GNOME touchpad configuration panel won't work (see Bug 1338585). Both packages install a symlink to assign the synaptics driver to the touchpad. But GNOME's control-center does not support synaptics anymore, so no touchpad is detected. Note that the issue occurs regardless of whether you use MATE/Cinnamon, merely installing it is enough.

Unfortunately, there is no good solution to this issue. Long-term both MATE and Cinnamon should support libinput but someone needs to step up and implement it. We don't support run-time driver selection in the X server, so an xorg.conf.d snippet is the only way to assign a touchpad driver. And this means that you have to decide whether GNOME's or MATE/Cinnamon's panel is broken at X start-up time.

If you need the packages installed but you're not actually using Mate/Cinnamon itself, remove the following symlinks (whichever is present on your system):


# rm /etc/X11/xorg.conf.d/99-synaptics-mate.conf
# rm /etc/X11/xorg.conf.d/99-synaptics-cinnamon.conf
# rm /usr/share/X11/xorg.conf.d/99-synaptics-mate.conf
# rm /usr/share/X11/xorg.conf.d/99-synaptics-cinnamon.conf
The /usr/share paths are the old ones and have been replaced with the /etc/ symlinks in cinnamon-desktop-3.0.2-2.fc25 and mate-desktop-1.15.1-4.fc25 and their F24 equivalents.

I'm using T450 and T460 as reference but this affects all laptops from the Lenovo *50 and *60 series. The Lenovo T450 and T460 have the same touchpad hardware, but unfortunately it suffers from what is probably a firmware issue. On really slow movements, the pointer has a halting motion. That effect disappears when the finger moves faster.

The observable effect is that of a pointer stalling, then jumping by 20 or so pixels. We have had a quirk for this in libinput since March 2016 (see commit a608d9) and detect this at runtime for selected models. In particular, what we do is look for a sequence of events that only update the pressure values but not the x/y position of the finger. This is a good indication that the bug triggers. While it's possible to trigger pressure changes alone, triggering several in a row without a change in the x/y coordinates is extremely unlikely. Remember that these touchpads have a resolution of ~40 units per mm - you cannot hold your finger that still while changing pressure [1]. Once we see those pressure changes only we reset the motion history we keep for each touch. The next event with an x/y coordinate will thus not calculate the delta to the previous position and not trigger a move. The event after that is handled normally again. This avoids the extreme jumps but there isn't anything we can do about the stalling - we never get the event from the kernel. [2]

Anyway. This bug popped up again elsewhere so this time I figured I'll analyse the data more closely. Specifically, I wrote a script that collected all x/y coordinates of a touchpad recording [3] and produced a black and white image of all device coordinates sent. This produces a graphic that's interesting but not overly useful:


Roughly 37000 touchpad events. You'll have to zoom in to see the actual pixels.
I modified the script to assume a white background and colour any x/y coordinate that was never hit black. So an x coordinate of 50 would now produce a vertical 1 pixel line at 50, a y coordinate of 70 a horizontal line at 70, etc. Any pixel that remains white is a coordinate that is hit at some point, anything black was unreachable. This produced more interesting results. Below is the graphic of a short, slow movement right to left.

A single short slow finger movement
You can clearly see the missing x coordinates. More specifically, there are some events, then a large gap, then events again. That gap is the stalling cursor where we didn't get any x coordinates. My first assumption was that it may be a sensor issue and that some areas on the touchpad just don't trigger. So what I did was move my finger around the whole touchpad to try to capture as many x and y coordinates as possible.

Let's have look at the recording from a T440 first because it doesn't suffer from this issue:


Sporadic black lines indicating unused coordinates but the center is purely white, indicating every device unit was hit at some point
Ok, looks roughly ok. The black areas are irregular, on the edges and likely caused by me just not covering those areas correctly. In the center it's white almost everywhere, that's where the most events were generated. And now let's compare this to a T450:

A visible grid of unreachable device units
The difference is quite noticeable, especially if you consider that the T440 recording had under 15000 events, the T450 recording had almost 37000. The T450 has a patterned grid of unreachable positions. But why? We currently use the PS/2 protocol to talk to the device but we should be using RMI4 over SMBus instead (which is what Windows has done for a while and luckily the RMI4 patches are on track for kernel 4.9). Once we talk to the device in its native protocol we see a resolution of ~20 units/mm and it looks like the T440 output:

With RMI4, the grid disappears
Ok, so the problem is not missing coordinates in the sensor and besides, at the resolution the touchpad has a single 'pixel' not triggering shouldn't be much of a problem anyway.

Maybe the issue had to do with horizontal movements or something? The next approach was for me to move my finger slowly from one side to the left. That's actually hard to do consistently when you're not a robot, so the results are bound to be slightly different. On the T440:


The x coordinates are sporadic with many missing ones, but the y coordinates are all covered
You can clearly see where the finger moved left to right. The big black gaps on the x coordinates mostly reflect me moving too fast but you can see how the distance narrows, indicating slower movements. Most importantly: vertically, the strip is uniformly white, meaning that within that range I hit every y coordinate at least once. And the recording from the T450:

Only one gap in the y range, sporadic gaps in the x range
Well, still looks mostly the same, so what is happening here? Ok, last test: This time an extremely slow motion left to right. It took me 87 seconds to cover the touchpad. In theory this should render the whole strip white if all x coordinates are hit. But look at this:

An extremely slow finger movement
Ok, now we see the problem. This motion was slow enough that almost every x coordinate should have been hit at least once. But there are large gaps and most notably: larger gaps than in the recording above that was a faster finger movement. So what we have here is not an actual hardware sensor issue but that the firmware is working against us here, filtering things out. Unfortunately, that's also the worst result because while hardware issues can usually be worked around, firmware issues are a lot more subtle and less predictable. We've also verified that newer firmware versions don't fix this and trying out some tweaks in the firmware didn't change anything either.

Windows is affected by this too and so is the synaptics driver. But it's not really noticeable on either and all reports so far were against libinput, with some even claiming that it doesn't manifest with synaptics. But each time we investigated in more detail it turns out that the issue is still there (synaptics uses the same kernel data after all) but because of different acceleration methods users just don't trigger it. So my current plan is to change the pointer acceleration to match something closer to what synaptics does on these devices. That's hard because synaptics is mostly black magic (e.g. synaptics' pointer acceleration depends on screen resolution) and hard to reproduce. Either way, until that is sorted at least this post serves as a link to point people to.

Many thanks to Andrew Duggan from Synaptics and Benjamin Tissoires for helping out with the analysis and testing of all this.

[1] Because pressing down on a touchpad flattens your finger and thus changes the shape slightly. While you can hold a finger still, you cannot control that shape
[2] Yes, predictive movement would be possible but it's very hard to get this right
[3] These are events as provided by the kernel and unaffected by anything in the userspace stack

September 05, 2016

A few weeks ago, I recorded an interview with Krishnan Raghuram about what was discussed for this development cycle for OpenStack Telemetry at the Austin summit.

It's interesting to look back at this video more than 3 months after recording it, and see what actually happened to Telemetry. It turns out that some of the things that I think were going to happen did not happen yet. As the first release candidate version is approaching, it's very unlikely they happen.

And on the other side, some new fancy features arrived suddenly without me having a clue about them.

As far as Ceilometer is concerned, here's the list of what really happened in terms of user features:

  • Added full support for SNMP v3 USM model
  • Added support for batch measurement in Gnocchi dispatcher
  • Set ended_at timestamp in Gnocchi dispatcher
  • Allow Swift pollster to specify regions
  • Add L3 cache usage and memory bandwidth meters
  • Split out the event code (REST API and storage) to a new Panko project

And a few other minor things. I planned none of them except Panko (which I was responsible for), and the ones we planned (documentation update, pipeline rework and polling enhancement) did not happen yet.

For Aodh, we expected to rework the documentation entirely too, and that did not happen either. What we did instead:

  • Deprecate and disable combination alarms
  • Add pagination support in REST API
  • Deprecated all non-SQL database store and provide a tool to migrate
  • Support batch notification for aodh-notifier

It's definitely a good list of new features for Aodh, still small, but simplifying it, removing technical debt and continuing building momentum around it.

For Gnocchi, we really had no plan, except maybe a few small features (they're usually tracked in the Launchpad bug list). It turned out we had some fancy new idea with Gordon Chung on how to boost our storage engine, so we work on that. That kept us busy a few weeks in the end, though the preliminary results look tremendous – so it was definitely worth it. We also have a AWS S3 storage driver on its way.

I find this exercise interesting, as it really emphasizes how you can't really control what's happening in any open source project, where your contributors come and go and work on their own agenda.

That does not mean we're dropping the themes and ideas I've laid out in that video. We're still pushing our "documentation is mandatory" policy and improving our "work by default" scenario. It's just a longer road that we expected.

September 03, 2016
Like is usual in GPUs, Radeon executes shaders in waves that execute the same program for many threads or work-items simultaneously in lock-step. Given a single program counter for up to 64 items (e.g. pixels being processed by a pixel shader), branch statements must be lowered to manipulation of the exec mask (unless the compiler can prove the branch condition to be uniform across all items). The exec mask is simply a bit-field that contains a 1 for every thread that is currently active, so code like this:

if (i != 0) {
... some code ...
}
gets lowered to something like this:

v_cmp_ne_i32_e32 vcc, 0, v1
s_and_saveexec_b64 s[0:1], vcc
s_xor_b64 s[0:1], exec, s[0:1]

if_block:
... some code ...

join:
s_or_b64 exec, exec, s[0:1]
(The saveexec assembly instructions apply a bit-wise operation to the exec register, storing the original value of exec in their destination register. Also, we can introduce branches to skip the if-block entirely if the condition happens to be uniformly false, )

This is quite different from CPUs, and so a generic compiler framework like LLVM tends to get confused. For example, the fast register allocator in LLVM is a very simple allocator that just spills all live registers at the end of a basic block before the so-called terminators. Usually, those are just branch instructions, so in the example above it would spill registers after the s_xor_b64.

This is bad because the exec mask has already been reduced by the if-condition at that point, and so vector registers end up being spilled only partially.

Until recently, these issues were hidden by the fact that we lowered the control flow instructions into their final form only at the very end of the compilation process. However, previous optimization passes including register allocation can benefit from seeing the precise shape of the GPU-style control flow earlier. But then, some of the subtleties of the exec masks need to be taken account by those earlier optimization passes as well.

A related problem arises with another GPU-specific specialty, the "whole quad mode". We want to be able to compute screen-space derivatives in pixel shaders - mip-mapping would not be possible without it - and the way this is done in GPUs is to always run pixel shaders on 2x2 blocks of pixels at once and approximate the derivatives by taking differences between the values for neighboring pixels. This means that the exec mask needs to be turned on for pixels that are not really covered by whatever primitive is currently being rendered. Those are called helper pixels.

However, there are times when helper pixels absolutely must be disabled in the exec mask, for example when storing to an image. A separate pass deals with the enabling and disabling of helper pixels. Ideally, this pass should run after instruction scheduling, since we want to be able to rearrange memory loads and stores freely, which can only be done before adding the corresponding exec-instructions. The instructions added by this pass look like this:

s_mov_b64 s[2:3], exec
s_wqm_b64 exec, exec

... code with helper pixels enabled goes here ...

s_and_b64 exec, exec, s[2:3]

... code with helper pixels disabled goes here ...
Naturally, adding the bit-wise AND of the exec mask must happen in a way that doesn't conflict with any of the exec manipulations for control flow. So some careful coordination needs to take place.

My suggestion is to allow arbitrary instructions at the beginning and end of basic blocks to be marked as "initiators" and "terminators", as opposed to the current situation, where there is no notion of initiators, and whether an instruction is a terminator is a property of the opcode. An alternative, that Matt Arsenault is working on, adds aliases for certain exec-instructions which act as terminators. This may well be sufficient, I'm looking forward to seeing the result.
August 31, 2016

In the X server, the input driver assignment is handled by xorg.conf.d snippets. Each driver assigns itself to the type of devices it can handle and the driver that actually loaded is simply the one that sorts last. Historically, we've had the evdev driver sort low and assign itself to everything. synaptics, wacom and the other few drivers that matter sorted higher than evdev and thus assigned themselves to the respective device.

When xf86-input-libinput first came out 2 years ago, we used a higher sort order than all other drivers to assign it to (almost) all devices. This was of course intentional because we believe that libinput is the best input stack around, the odd bug non-withstanding. Now it has matured a fair bit and we had a lot more exposure to various types of hardware. We've been quirking and fixing things like crazy and libinput is much better for it.

Two things were an issue with this approach though. First, overriding xf86-input-libinput required manual intervention, usually either copying or symlinking an xorg.conf.d snippet. Second, even though we were overriding the default drivers, we still had them installed everywhere. Now it's time to start properly retiring the old drivers.

The upstream approach for this is fairly simple: the xf86-input-libinput xorg.conf.d snippet will drop in sort order to sit above evdev. evdev remains as the fallback driver for miscellaneous devices where evdev's blind "forward everything" approach is sufficient. All other drivers will sort higher than xf86-input-libinput and will thus override the xf86-input-libinput assignment. The new sort order is thus:

  • evdev
  • libinput
  • synaptics, wacom, vmmouse, joystick
evdev and libinput are generic drivers, the others are for specific devices or use-cases. To use a specific driver other than xf86-input-libinput, you now only have to install it. To fall back to xf86-inputlibinput, you uninstall it. No more manual xorg.conf.d snippets symlinking.

This has an impact on distributions and users. Distributions should ensure that other drivers are never installed by default unless requested by the user (or some software). And users need to be aware that having a driver other than xf86-input-libinput installed may break things. For example, recent GNOME does not support the synaptics driver anymore, installing it will cause the control panel's touchpad bits to stop working. So there'll be a messy transition period but once things are settled, the solution to most input-related driver bugs will be "install/remove driver $foo" as opposed to the current symlink/copy/write an xorg.conf.d snippet.

August 29, 2016
I spent a day or so last week cleaning up @jonasarrow's demo patch for derivatives on vc4.  It had been hanging around on the github issue waiting for a rework due to feedback, and I decided to finally just go and do it.  It unfortunately involved totally rewriting their patches (which I dislike doing, it's always more awesome to have the original submitter get credit), but we now have dFdx()/dFdy() on Mesa master.

I also landed a fix for GPU hangs with 16 vertex attributes (4 or more vec4s, aka glsl-routing in piglit).  I'd been debugging this one for a while, and finally came up with an idea ("what if this FIFO here is a bad idea to use and we should be synchronous with this external unit?"), it worked, and a hardware developer confirmed that the fix was correct.  This one got a huge explanation comment.  I also fixed discards inside of if/loop statements -- generally discards get lowered out of ifs, but if it's in a non-unrolled loop we were doing discards ignoring whether the channel was in the loop.

Thanks to review from Rhys, I landed Mesa's Travis build fixes.  Rhys then used Travis to test out a couple of fixes to i915 and r600.  This is pretty cool, but it just makes me really want to get piglit into Travis so that we can get some actual integration testing in this process.

I got xserver's Travis to the point of running the unit tests, and one of them crashes on CI but not locally.  That's interesting.

The last GPU hang I have in piglit is in glsl-vs-loops.  This week I figured out what's going on, and I hope I'll be able to write about a fix next week.

Finally, I landed Stefan Wahren's Raspberry Pi Zero devicetree for upstream.  If nothing goes wrong, the Zero should be supported in 4.9.

Lately I have been working on a simple terrain OpenGL renderer demo, mostly to have a playground where I could try some techniques like shadow mapping and water rendering in a scenario with a non trivial amount of geometry, and I thought it would be interesting to write a bit about it.

But first, here is a video of the demo running on my old Intel IvyBridge GPU:

And some screenshots too:

OpenGL Terrain screenshot 1 OpenGL Terrain screenshot 2
OpenGL Terrain screenshot 3 OpenGL Terrain screenshot 4

Note that I did not create any of the textures or 3D models featured in the video.

With that out of the way, let’s dig into some of the technical aspects:

The terrain is built as a 251×251 grid of vertices elevated with a heightmap texture, so it contains 63,000 vertices and 125,000 triangles. It uses a single 512×512 texture to color the surface.

The water is rendered in 3 passes: refraction, reflection and the final rendering. Distortion is done via a dudv map and it also uses a normal map for lighting. From a geometry perspective it is also implemented as a grid of vertices with 750 triangles.

I wrote a simple OBJ file parser so I could load some basic 3D models for the trees, the rock and the plant models. The parser is limited, but sufficient to load vertex data and simple materials. This demo features 4 models with these specs:

  • Tree A: 280 triangles, 2 materials.
  • Tree B: 380 triangles, 2 materials.
  • Rock: 192 triangles, 1 material (textured)
  • Grass: 896 triangles (yes, really!), 1 material.

The scene renders 200 instances of Tree A another 200 instances of Tree B, 50 instances of Rock and 150 instances of Grass, so 600 objects in total.

Object locations in the terrain are randomized at start-up, but the demo prevents trees and grass to be under water (except for maybe their base section only) because it would very weird otherwise :), rocks can be fully submerged though.

Rendered objects fade in and out smoothly via alpha blending (so there is no pop-in/pop-out effect as they reach clipping planes). This cannot be observed in the video because it uses a static camera but the demo supports moving the camera around in real-time using the keyboard.

Lighting is implemented using the traditional Phong reflection model with a single directional light.

Shadows are implemented using a 4096×4096 shadow map and Percentage Closer Filter with a 3x3 kernel, which, I read is (or was?) a very common technique for shadow rendering, at least in the times of the PS3 and Xbox 360.

The demo features dynamic directional lighting (that is, the sun light changes position every frame), which is rather taxing. The demo also supports static lighting, which is significantly less demanding.

There is also a slight haze that builds up progressively with the distance from the camera. This can be seen slightly in the video, but it is more obvious in some of the screenshots above.

The demo in the video was also configured to use 4-sample multisampling.

As for the rendering pipeline, it mostly has 4 stages:

  • Shadow map.
  • Water refraction.
  • Water reflection.
  • Final scene rendering.

A few notes on performance as well: the implementation supports a number of configurable parameters that affect the framerate: resolution, shadow rendering quality, clipping distances, multi-sampling, some aspects of the water rendering, N-buffering of dynamic VBO data, etc.

The video I show above runs at locked 60fps at 800×600 but it uses relatively high quality shadows and dynamic lighting, which are very expensive. Lowering some of these settings (very specially turning off dynamic lighting, multisampling and shadow quality) yields framerates around 110fps-200fps. With these settings it can also do fullscreen 1600×900 with an unlocked framerate that varies in the range of 80fps-170fps.

That’s all in the IvyBridge GPU. I also tested this on an Intel Haswell GPU for significantly better results: 160fps-400fps with the “low” settings at 800×600 and roughly 80fps-200fps with the same settings used in the video.

So that’s it for today, I had a lot of fun coding this and I hope the post was interesting to some of you. If time permits I intend to write follow-up posts that go deeper into how I implemented the various elements of the demo and I’ll probably also write some more posts about the optimization process I followed. If you are interested in any of that, stay tuned for more.

August 26, 2016
Clickbait titles for the win!

First up, massive thanks to my major co-conspirator on radv, Bas Nieuwenhuizen, for putting in so much effort on getting radv going.

So where are we at?

Well this morning I finally found the last bug that was causing missing rendering on Dota 2. We were missing support for a compressed texture format that dota2 used. So currently dota 2 renders, I've no great performance comparison to post yet because my CPU is 5 years old, and can barely get close to 30fps with GL or Vulkan. I think we know of a couple of places that could be bottlenecking us on the CPU side. The radv driver is currently missing hyper-z (90% done), fast color clears and DCC, which are all GPU side speedups in theory. Also running the phoronix-test-suite dota2 tests works sometimes, hangs in a thread lock sometimes, or crashes sometimes. I think we have some memory corruption somewhere that it collides with.

Other status bits: the Vulkan CTS test suite contains 114598 tests, a piglit run a few hours before I fixed dota2 was at:
[114598/114598] skip: 50388, pass: 62932, fail: 1193, timeout: 2, crash: 83 - |/-\

So that isn't too bad a showing, we know some missing features are accounting for some of fails. A lot of the crashes are an assert in CTS hitting, that I don't think is a real problem.

We render most of the Sascha Willems demos fine.

I've tested the Talos Principle as well, the texture fix renders a lot more stuff on the screen, but we are still seeing large chunks of blackness where I think there should be trees in-game, the menus etc all seem to load fine.

All this work is on the semi-interesting branch of
https://github.com/airlied/mesa

It only has been tested on VI AMD GPUs, Polaris worked previously but something derailed it, but we should fix it once we get the finished bisect. CIK GPUs kinda work with the amdgpu kernel driver loaded. SI GPUs are nowhere yet.

Here's a screenshot:
August 22, 2016
Last week I finally plugged in the camera module I got a while ago to go take a look at what vc4 needs for displaying camera output.

The surprising answer was "nothing."  vc4 could successfully import RGB dmabufs and display them as planes, even though I had been expecting to need fixes on that front.

However, the bcm2835 v4l camera driver needs a lot of work.  First of all, it doesn't use the proper contiguous memory support in v4l (vb2-dma-contig), and instead asks the firmware to copy from the firmware's contiguous memory into vmalloced kernel memory.  This wastes memory and wastes memory bandwidth, and doesn't give us dma-buf support.

Even more, MMAL (the v4l equivalent that the firmware exposes for driving the hardware) wants to output planar buffers with specific padding.  However, instead of using the multi-plane format support in v4l to expose buffers with that padding, the bcm2835 driver asks the firmware to do another copy from the firmware's planar layout into the old no-padding V4L planar format.

As a user of the V4L api, you're also in trouble because none of these formats have any priority information that I can see: The camera driver says it's equally happy to give you RGB or planar, even though RGB costs an extra copy.  I think properly done today, the camera driver would be exposing multi-plane planar YUV, and giving you a mem2mem adapter that could use MMAL calls to turn the planar YUV into RGB.

For now, I've updated the bug report with links to the demo code and instructions.

I also spent a little bit of time last week finishing off the series to use st/nir in vc4.  I managed to get to no regressions, and landed it today.  It doesn't eliminate TGSI, but it does mean TGSI is gone from the normal GLSL path.

Finally, I got inspired to do some work on testing.  I've been doing some free time work on servo, Mozilla's Rust-based web browser, and their development environment has been a delight as a new developer.  All patch submissions, from core developers or from newbies, go through github pull requests.  When you generate a PR, Travis builds and runs the unit tests on the PR.  Then a core developer reviews the code by adding a "r" comment in the PR or provides feedback.  Once it's reviewed, a bot picks up the pull request, tries merging it to master, then runs the full integration test suite on it.  If the test suite passes, the bot merges it to master, otherwise the bot writes a comment with a link to the build/test logs.

Compare this to Mesa's development process.  You make a patch.  You file it in the issue tracker and it gets utterly ignored.  You complain, and someone tells you you got the process wrong, so you join the mailing list and send your patch (and then get a flood of email until you unsubscribe).  It gets mangled by your email client, and you get told to use git-send-email, so you screw around with that for a while before you get an email that will actually show up in people's inboxes.  Then someone reviews it (hopefully) before it scrolls off the end of their inbox, and then it doesn't get committed anyway because your name was familiar enough that the reviewer thought maybe you had commit access.  Or they do land your patch, and it turns out you hasn't run the integration tests and then people complain at you for not testing.

So, as a first step toward making a process like Mozilla's possible, I put some time into fixing up Travis on Mesa, and building Travis support for the X Server.  If I can get Travis to run piglit and ensure that expected-pass tests don't regress, that at least gives us a documentable path for new developers in these two projects to put their code up on github and get automated testing of the branches they're proposing on the mailing lists.
August 16, 2016

Wrapping libudev using LD_PRELOAD

Peter Hutterer and I were chasing down an X server bug which was exposed when running the libinput test suite against the X server with a separate thread for input. This was crashing deep inside libudev, which led us to suspect that libudev was getting run from multiple threads at the same time.

I figured I'd be able to tell by wrapping all of the libudev calls from the server and checking to make sure we weren't ever calling it from both threads at the same time. My first attempt was a simple set of cpp macros, but that failed when I discovered that libwacom was calling libgudev, which was calling libudev.

Instead of recompiling the world with my magic macros, I created a new library which exposes all of the (public) symbols in libudev. Each of these functions does a bit of checking and then simply calls down to the 'real' function.

Finding the real symbols

Here's the snippet which finds the real symbols:

static void *udev_symbol(const char *symbol)
{
    static void *libudev;
    static pthread_mutex_t  find_lock = PTHREAD_MUTEX_INITIALIZER;

    void *sym;
    pthread_mutex_lock(&find_lock);
    if (!libudev) {
        libudev = dlopen("libudev.so.1.6.4", RTLD_LOCAL | RTLD_NOW);
    }
    sym = dlsym(libudev, symbol);
    pthread_mutex_unlock(&find_lock);
    return sym;
}

Yeah, the libudev version is hard-coded into the source; I didn't want to accidentally load the wrong one. This could probably be improved...

Checking for re-entrancy

As mentioned above, we suspected that the bug was caused when libudev got called from two threads at the same time. So, our checks are pretty simple; we just count the number of calls into any udev function (to handle udev calling itself). If there are other calls in process, we make sure the thread ID for those is the same as the current thread.

static void udev_enter(const char *func) {
    pthread_mutex_lock(&check_lock);
    assert (udev_running == 0 || udev_thread == pthread_self());
    udev_thread = pthread_self();
    udev_func[udev_running] = func;
    udev_running++;
    pthread_mutex_unlock(&check_lock);
}

static void udev_exit(void) {
    pthread_mutex_lock(&check_lock);
    udev_running--;
    if (udev_running == 0)
    udev_thread = 0;
    udev_func[udev_running] = 0;
    pthread_mutex_unlock(&check_lock);
}

Wrapping functions

Now, the ugly part -- libudev exposes 93 different functions, with a wide variety of parameters and return types. I constructed a hacky macro, calls for which could be constructed pretty easily from the prototypes found in libudev.h, and which would construct our stub function:

#define make_func(type, name, formals, actuals)         \
    type name formals {                     \
    type ret;                       \
    static void *f;                     \
    if (!f)                         \
        f = udev_symbol(__func__);              \
    udev_enter(__func__);                   \
    ret = ((typeof (&name)) f) actuals;         \
    udev_exit();                        \
    return ret;                     \
    }

There are 93 invocations of this macro (or a variant for void functions) which look much like:

make_func(struct udev *,
      udev_ref,
      (struct udev *udev),
      (udev))

Using udevwrap

To use udevwrap, simply stick the filename of the .so in LD_PRELOAD and run your program normally:

# LD_PRELOAD=/usr/local/lib/libudevwrap.so Xorg 

Source code

I stuck udevwrap in my git repository:

http://keithp.com/cgi-bin/gitweb.cgi?p=udevwrap;a=summary

You can clone it using

$ git git://keithp.com/git/udevwrap
August 15, 2016

A Preliminary systemd.conf 2016 Schedule is Now Available!

We have just published a first, preliminary version of the systemd.conf 2016 schedule. There is a small number of white slots in the schedule still, because we're missing confirmation from a small number of presenters. The missing talks will be added in as soon as they are confirmed.

The schedule consists of 5 workshops by high-profile speakers during the workshop day, 22 exciting talks during the main conference days, followed by one full day of hackfests.

Please sign up for the conference soon! Only a limited number of tickets are available, hence make sure to secure yours quickly before they run out! (Last year we sold out.) Please sign up here for the conference!

Last week I mostly worked on getting the upstream work I and others have done into downstream Raspbian (most of that time unfortunately in setting up another Raspbian development environment, after yet another SD card failed).

However, the most exciting thing for most users is that with the merge of the rpi-4.4.y-dsi-stub-squash branch, the DSI display should now come up by default with the open source driver.  This is unfortunately not a full upstreamable DSI driver, because the closed-source firmware is getting in the way of Linux by stealing our interrupts and then talking to the hardware behind our backs.  To work around the firmware, I never talk to the DSI hardware, and we just replace the HVS display plane configuration on the DSI's output pipe.  This means your display backlight is always on and the DSI link is always running, but better that than no display.

I also transferred the wiki I had made for VC4 over to github.  In doing so, I was pleasantly surprised at how much documentation I wanted to write once I got off of the awful wiki software at freedesktop.  You can find more information on VC4 at my mesa and linux trees.

(Side note, wikis on github are interesting.  When you make your fork, you inherit the wiki of whoever you fork from, and you can do PRs back to their wiki similarly to how you would for the main repo.  So my linux tree has Raspberry Pi's wiki too, and I'm wondering if I want to move all of my wiki over to their tree.  I'm not sure.)

Is there anything that people think should be documented for the vc4 project that isn't there?