planet.freedesktop.org
January 20, 2017

This is the write-up of my talk at LCA 2017 in Hobart. It’s not exactly the same, because this is a blog and not a talk, but the same contents. The slides for the talk are here, and I will link to the video as soon as it is available.

Linux Kernel Maintainers

First let’s look at how the kernel community works, and how a change gets merged into Linus Torvalds’ repository. Changes are submitted as patches to mailing list, then get some review and eventually get applied by a maintainer to that maintainer’s git tree. Each maintainer then sends pull request, often directly to Linus. With a few big subsystems (networking, graphics and ARM-SoC are the major ones) there’s a second or third level of sub-maintainers in. 80% of the patches get merged this way, only 20% are committed by a maintainer directly.

Most maintainers are just that, a single person, and often responsible for a bunch of different areas in the kernel with corresponding different git branches and repositories. To my knowledge there are only three subsystems that have embraced group maintainership models of different kinds: TIP (x86 and core kernel), ARM-SoC and the graphics subsystem (DRM).

The radical change, at least for the kernel community, that we implemented over a year ago for the Intel graphics driver is to hand out commit rights to all regular contributors. Currently there are 19 people with commit rights to the drm-intel repository. In the first year of ramp-up 70% of all patches are now committed directly by their authors, a big change compared to how things worked before, and still work everywhere else outside of the graphics subsystem. More recently we also started to manage the drm-misc tree for subsystem wide refactorings and core changes in the same way.

I’ve covered the details of the new process in my Kernel Recipes talk “Maintainers Don’t Scale”, and LWN has covered that, and a few other talks, in their article on linux kernel maintainer scalability. I also covered this topic at the kernel summit, again LWN covered the group maintainership discussion. I don’t want to go into more detail here, mostly because we’re still learning, too, and not really experts on commit rights for everyone and what it takes to make this work well. If you want to enjoy what a community does who really has this all figured out, watch Emily Dunham’s talk “Life is better with Rust’s community automation” from last year’s LCA.

What we are experts on is the Linux Kernel’s maintainer model - we’ve run things for years with the traditional model, both as single maintainers and small groups, and now gained the outside perspective by switching to something completely different. Personally, I’ve come to believe that the maintainer model as implemented by the kernel community just doesn’t scale. Not in the technical sense of big-O scalability, because obviously the kernel community scales to a rather big size. Much larger organizations, entire states are organized in a hierarchical way, the kernel maintainer hierarchy is not anything special. Besides that, git was developed specifically to support the Linux maintainer hierarchy, and git won. Clearly, the linux maintainer model scales to big numbers of contributors. Where I think it falls short is the constant factor of how efficiently contributions are reviewed and merged, especially for non-maintainer contributors. Which do 80% of all patches.

Cult of Busy

The first issue that routinely comes out when talking about maintainer topics is that everyone is overloaded. There’s a pervasive spirit in our industry (especially in the US) hailing overworked engineers as heroes, with an entire “cult of busy” around. If you have time, you’re a slacker and probably not worth it. Of course this doesn’t help when being a maintainer, but I don’t believe it’s a cause of why the Linux maintainer model doesn’t work. This cult of busy leads to burnout, which is in my opinion a prime risk when you’re an open source person. Personally I’ve gone through a few difficult phases until I understood my limits and respected them. When you start as a maintainer for 2-3 people, and it increases to a few dozen within a couple of years, then getting a bit overloaded is rather natural - it’s a new job, with a different set of responsibilities and I had no clue about a lot of things. That’s no different from suddenly being a leader of a much bigger team anywhere else. A great talk on this topic is “What part of “… for life” don’t you understand?” from Jacob Kaplan-Moss since it’s by a former maintainer. It also contains a bunch of links to talks on burnout specifically. Ignoring burnout is not healthy, or not knowing about the early warning signs, it is rampant in our communities, but for now I’ll leave it at that.

Boutique Trees and Bus Factors

The first issue I see is how maintainers usually are made: You scratch an itch somewhere, write a bit of code, suddenly a few more people find it useful, and “tag” you’re the maintainer. On top, you often end up being stuck in that position “for life”. If the community keeps growing, or your maintainer becomes otherwise busy with work&life, you have your standard-issue overloaded bottleneck.

That’s the point where I think the kernel community goes wrong. When other projects reach this point they start to build up a more formal community structure, with specialized roles, boards for review and other bits and pieces. One of the oldest, and probably most notorious, is Debian with its constitution. Of course a small project doesn’t need such elaborate structures. But if the goal is world domination, or at least creating something lasting, it helps when there’s solid institutions that cope with people turnover. At first just documenting processes and roles properly goes a long way, long before bylaws and codified decision processes are needed.

The kernel community, at least on the maintainer side, entirely lacks this.

What instead most often happens is that a new set of ad-hoc, chosen-by-default maintainers start to crop up in a new level of the hierarchy, below your overload bottleneck. Because becoming your own maintainer is the only way to help out and to get your own features merged. That only perpetuates the problem, since the new maintainers are as likely to be otherwise busy, or occupied with plenty of other kernel parts already. If things go well that area becomes big, and you have another git tree with another overloaded maintainer. More often than not people move around, and accumulate small bits allover under their maintainership. And then the cycle repeats.

The end result is a forest of boutique trees, each covering a tiny part of the project, maintained by a bunch of notoriously overloaded people. The resulting cross-tree coordination issues are pretty impressive - in the graphics subsystem we fairly often end up with with simple drivers that somehow need prep patches in 5 different trees before you can even land that simple driver in the graphics tree.

Unfortunately that’s not the bad part. Because these maintainers are all busy with other trees, or their work, or life in general, you’re guaranteed that one of them is not available at any given time. Worse, because their tree has relatively little activity because it covers a small area, many only pick up patches once per kernel release, which means a built-in 3 month delay. That’s all because each tree and area has just one maintainer. In the end you don’t even need the proverbial bus to hit anyone to feel the pain of having a single point of failure in your organization - there’s so many maintainer trees around that that absence always happens, and constantly.

Of course people get fed up trying to get features merged, and often the fix is trying to become a maintainer yourself. That takes a while and isn’t easy - only 20% of all patches are authored by maintainers - and after the new code landed it makes it all worse: Now there’s one more semi-absent maintainer with one more boutique tree, adding to all the existing troubles.

Checks and Balances

All patches merged into the Linux kernel are supposed to be reviewed, and rather often that review is only done by the maintainers who merges the patch. When maintainers send out pull requests the next level of maintainers then reviews those patch piles, until they land in Linus’ tree. That’s an organization where control flows entirely top-down, with no checks and balances to reign in maintainers who are not serving their contributors well. History of dicatorships tells us that despite best intentions, the end result tends to heavily favour the few over the many. As a crude measure for how much maintainers subject themselves to some checks&balances by their peers and contributors I looked at how many patches authored and committed by the same person (probably a maintainer) do not also carry a reviewed or acked tag. For the Intel driver that’s less than 3%. But even within the core graphics code it’s only 5%, and that covers the time before we started to experiment with commit rights for that area. And for the graphics subsystem overall the ratio is still only about 25%, including a lot of drivers with essentially just one contributor, who is always volunteered as the maintainer, and hence somewhat natural that those maintainers lack reviewers.

Outside of graphics only roughly 25% of all patches written by maintainers are reviewed by their peers - 75% of all maintainer patches lack any kind of recorded peer review, compared to just 25% for graphics alone. And even looking at core areas like kernel/ or mm/ the ratio is only marginally better at about 30%. In short, in the kernel at large, peer review of maintainers isn’t the norm.

And there’s nothing outside of the maintainer hierarchy that could provide some checks and balance either. The only way to escalate disagreement is by starting a revolution, and revolutions tend to be long, drawn-out struggles and generally not worth it. Even Debian only recently learned that they lack a way to depose maintainers, and that maybe going maintainerless would be easier (again, LWN has you covered).

Of course the kernel is not the only hierarchy where there’s no meaningful checks and balances. Professor at universities, and managers at work are in a fairly similar position, with minimal options for students or employers to meaningfully appeal decisions. But that’s a recognized problem, and at least somewhat countered by providing ways to provide anonymous feedback, often through regular surveys. The results tend to not be all that significant, but at least provide some control and accountability to the wider masses of first-level dwellers in the hierarchy. In the kernel that amounts to about 80% of all contributions, but there’s no such survey. On the contrary, feedback sessions about maintainer happiness only reinforce the control structure, with e.g. the kernel summit featuring an “Is Linus happy?” session each year.

Another closely related aspect to all this is how a project handles personal conflicts between contributors. For a very long time Linux didn’t have any formal structures in this area either, with the only options available to unhappy people to either take it or leave it. Well, or usurping a maintainer with a small revolution, but that’s not really an option. For two years we’ve now had the “Code of Conflict”, which de facto just throws up its hands and declares that conflict are the normal outcome, essentially just encoding the status quo. Refusing to handle conflicts in a project with thousands of contributors just doesn’t work, except that it results in lots of frustration and ultimately people trying to get away. Again, the lack of a poised board to enforce a strong code of conduct, independent of the maintainer hierarchy, is in line with the kernel community unwillingness to accept checks and balances.

Mesh vs. Hierarchy

The last big issue I see with the Linux kernel model, featuring lots of boutique trees and overloaded maintainer, is that it seems to harm collaboration and integration of new contributors. In the Intel graphics, driver maintainers only ever reviewed a small minority of all patches over the last few years, with the goal to foster direct collaboration between contributors. Still, when a patch was stuck, maintainers were the first point of contact, especially, but not only, for newer contributors. No amount of explaining that only the lack of agreement with the reviewer was the gating factor could persuade people to fully collaborate on code reviews and rework the code, tests and documentation as needed. Especially when they’re coming with previous experience where code review is more of a rubber-stamp step compared to the distributed and asynchronous pair-programming it often resembles in open-source. Instead, new contributors often just ended up falling back to pinging maintainers to make a decision or just merge the patches as-is.

Giving all regular contributors commit rights and fully trusting them to do the right thing entirely fixed that: If the reviewer or author have commit rights there’s no easy excuse anymore to involve maintainers when the author and reviewer can’t reach agreement. Of course that requires a lot of work in mentoring people, making sure requirements for merging are understood and documented, and automating as much as possible to avoid screw ups. I think maintainers who lament their lack of review bandwidth, but also state they can’t trust anyone else aren’t really doing their jobs.

At least for me, review isn’t just about ensuring good code quality, but also about diffusing knowledge and improving understanding. At first there’s maybe one person, the author (and that’s not a given), understanding the code. After good review there should be at least two people who fully understand it, including corner cases. And that’s also why I think that group maintainership is the only way to run any project with more than one regular contributor.

On the topic of patch review and maintainers, there’s also the habit of wholesale rewrites of patches written by others. If you want others to contribute to your project, then that means you need to accept other styles and can’t enforce your own all the time. Merging first and polishing later recognizes new contributions, and if you engage newcomers for the polish work they tend to stick around more often. And even when a patch really needs to be reworked before merging it’s better to ask the author to do it: Worst case they don’t have time, best case you’ve improved your documentation and training procedure and maybe gained a new regular contributor on top.

A great take on the consequences of having fixed roles instead of trying to spread responsibilities more evenly is Alice Goldfuss’ talk “Rock Stars, Builders, and Janitors: You’re doing it wrong”. I also think that rigid roles present a bigger bar for people with different backgrounds, hampering diversity efforts and in the spirit of Sarah Sharps post on what makes a good community, need to be fixed first.

Towards a Maintainer’s Manifest

I think what’s needed in the end is some guidelines and discussions about what a maintainer is, and what a maintainer does. We have ready-made licenses to avoid havoc, there’s code of conducts to copypaste and implement, handbooks for building communities, and for all of these things, lots of conferences. Maintainer on the other hand you become by accident, as a default. And then everyone gets to learn how to do it on their own, while hopefully not burning too many bridges - at least I myself was rather lost on that journey at times. I’d like to conclude with a draft on a maintainer’s manifest.

It’s About the People

If you’re maintainer of a project or code area with a bunch of full time contributors (or even a lot of drive-by contributions) then primarily you deal with people. Insisting that you’re only a technical leader just means you don’t acknowledge what your true role really is.

And then, trust them to do a good job, and recognize them for the work they’re doing. The important part is to trust people just a bit more than what they’re ready for, as the occasional challenge, but not too much that they’re bound to fail. In short, give them the keys and hope they don’t wreck the car too badly, but in all cases have insurance ready. And insurance for software is dirt cheap, generally a git revert and the maintainer profusely apologizing to everyone and taking the blame is all it takes.

Recognize Your Power

You’re a maintainer, and you have essentially absolute power over what happens to your code. For successful projects that means you can unleash a lot of harm on people who for better or worse are employed to deal with you. One of the things that annoy me the most is when maintainers engage in petty status fights against subordinates, thinly veiled as technical discussions - you end up looking silly, and it just pisses everyone off. Instead recognize your powers, try to stay on the good side of the force and make sure you share it sufficiently with the contributors of your project.

Accept Your Limits

At the beginning you’re responsible for everything, and for a one-person project that’s all fine. But eventually the project grows too much and you’ll just become a dictator, and then failure is all but assured because we’re all human. Recognize what you don’t do well, build institutions to replace you. Recognize that the responsibility you initially took on might not be the same as that which you’ll end up with and either accept it, or move on. And do all that before you start burning out.

Be a Steward, Not a Lord

I think one of key advantages of open source is that people stick around for a very long time. Even when they switch jobs or move around. Maybe the usual “for life” qualifier isn’t really a great choice, since it sounds more like a mandatory sentence than something done by choice. What I object to is the “dictator” part, since if your goal is to grow a great community and maybe reach world domination, then you as the maintainer need to serve that community. And not that the community serves you.

Thanks a lot to Ben Widawsky, Daniel Stone, Eric Anholt, Jani Nikula, Karen Sandler, Kimmo Nikkanen and Laurent Pinchart for reading and commenting on drafts of this text.

January 17, 2017
In this blog post I promised I would get back to people who want to use the nvidia driver on an optimus laptop.

The set of xserver patches I blogged about last time have landed upstream and in Fedora 25 (in xorg-x11-server 1.19.0-3 and newer), allowing the nvidia driver packages to drop a xorg.conf snippet which will make the driver atuomatically work on optimus setups.

The negativo17.org nvidia packages now are using this, so if you install these, then the nvidia driver should just work on your laptop.

Note that you should only install these drivers if you actually have a supported (new enough) nvidia GPU. These drivers replace the libGL implementation, so installing them on a system without a nvidia GPU will cause things to break. This will be fixed soon by switching to libglvnd as the official libGL provider and having both mesa and the nvidia driver provide "plugins" for libglvnd. I've actually just completed building a new enough libglvnd + libglvnd enabled mesa for rawhide, so rawhide users will have libglvnd starting tomorrow.
January 16, 2017

My day-to-day activities are still evolving around the Python programming language, as I continue working on the OpenStack project as part of my job at Red Hat. OpenStack is still the biggest Python project out there, and attract a lot of Python hackers.

Those last few years, however, things have taken a different turn for me when I made the choice with my team to rework the telemetry stack architecture. We decided to make a point of making it scale way beyond what has been done in the project so far.

I started to dig into a lot of different fields around Python. Topics you don't often look at when writing a simple and straight-forward application. It turns out that writing scalable applications in Python is not impossible, nor that difficult. There are a few hiccups to avoid, and various tools that can help, but it really is possible – without switching to another whole language, framework, or exotic tool set.

Working on those projects seemed to me like a good opportunity to share with the rest of the world what I learned. Therefore, I decided to share my most recent knowledge addition around distributed and scalable Python application in a new book, entitled The Hacker's Guide to Scaling Python (or Scaling Python, in short). The book should be released in a few months – fingers crossed.

And as the book is still a work-in-progress, I'll be happy to hear any remark, subject, interrogation or topic idea you might have or any particular angle you would like me to take in this book (reply in the comments section or shoot me an email). And if you'd like to get be kept updated on this book advancement, you can subscribe in the following form or from the book homepage.

The adventure of working on my previous book, The Hacker's Guide to Python, has been so tremendous and the feedback so great, that I'm looking forward releasing this new book later this year!

Last week I was on vacation, but the week before that I did some more work on figuring out Intel's Mesa CI system, and Mark Janes has started working on some better documentation for it.  I now understand better how the setup is going to work, but haven't made much progress on actually getting a master running yet.

More fun, though, was finally taking a look at optimizing the tiled texture load/store code.  This got started with Patrick Walton tweeting a link to a blog post on clever math for implementing texture tiling, given a couple of assumptions.

As with all GPUs these days, VC4 swizzles textures into a tiled layout so that when a cacheline is loaded from memory, it will cover nearby pixels in the Y direction as well as X, so that you are more likely to get cache hits for your neighboring pixels in drawing.  And the tiling tends to be multiple levels, so that nearby cachelines are also nearby on the screen, reducing DRAM access latency.

For small textures on VC4, we have a single level of tiling: 4x4@32bpp blocks ("utiles") of raster order data, themselves arranged in up to a 4x4 block, and this mode is called LT.  Once things go over that size, we go to T tiling, where we call a 4x4 LT block a subtile, and arrange subtiles either clockwise or counterclockwise order within a 2x2 block (a tile, which at this point is now 1kb of data), and tiles themselves are arranged left-to-right, then right-to-left, then left-to-right again.

The first thing I did was implement the blog post's clever math for LT textures.  One of the nice features of the math trick is that it means I can do partial utile updates finally, because it skips around in the GPU's address space based on the CPU-side pixel coordinate instead of trying to go through GPU address space to update a 4x4 utile at a time.  The downside of doing things this way is that jumping around in GPU address space means that our writes are unlikely to benefit from write combining, which is usually important for getting full write performance to GPU memory.  It turned out, though, that the math was so much more efficient than what I was doing that it was still a win.

However, I found out that the clever math made reads slower.  The problem is that, because we're write-combined, reads are uncached -- each load is a separate bus transaction to main memory.  Switching from my old utile-at-a-time load using memcpy to the new code meant that instead of doing 4 loads using NEON to grab a row at a time, we were now doing 16 loads of 32 bits at a time, for what added up to a 30% performance hit.

Reads *shouldn't* be something that people do much, except that we still have some software fallbacks in X on VC4 ("core" unantialiased text rendering, for example, which I need to write the GLSL 1.20 shaders for), which involve doing a read/modify/write cycle on a texture.  My first attempt at fixing the regression was just adding back a fast path that operates on a utile at a time if things are nicely utile-aligned (they generally are).  However, some forced inlining of functions per cpp that I was doing for the unaligned case meant that the glibc memcpy call now got inlined back to being non-NEON, and the "fast" utile code ended up not helping loads.

Relying on details of glibc's implementation (their tradeoff for when to do NEON loads) and of gcc's implementation (when to go from memcpy calls to inlined 32-bits-at-a-time code) seems like a pretty bad idea, so I decided to finally write the NEON acceleration that Eben and I have talked about several times.

My first hope was that I could load a full cacheline with NEON's VLD.  VLD1 only loads up to 1 "quadword" (16 bytes) at a time, so that doesn't seem like much help.  VLD4 can load 64 bytes like we want, but it also turns AOS data into SOA in the process, and there's no corresponding "SOA-back-to-AOS store 8 or 16 bytes at a time" like we need to do to get things back into the CPU's strided representation.  I tried VLD4+VST4 into a temporary, then doing my old untiling path on the cached temporary, but that still left me a few percent slower on loads than not doing any of this work at all.

Finally, I hit on using the VLDM instruction.  It seems to be intended for stack loads/stores, but we can also use it to get 64 bytes of data in from memory untouched into NEON registers, and then I can use 4 (32bpp) or 8 (8 or 16bpp) VST1s to store it to the CPU side.  With this, we get a 208.256% +/- 7.07029% (n=10) improvement to GetTexImage performance at 1024x1024.  Doing the same NEON code for stores gave a 41.2371% +/- 3.52799% (n=10) improvement, probably mostly due to not calling into memcpy and having it go through its size/alignment-based memcpy path choosing process.

I'm not yet hitting full memory bandwidth, but this should be a noticeable improvement to X, and it'll probably help my piglit test suite runtime as well.  Hopefully I'll get the code polished up and landed next week when I get back from LCA.
January 11, 2017
A while back Debian has switched to using the modesetting Xorg driver rather then the intel Xorg driver for Intel GPUs.

There are several good reasons for this, rather then repeating them I'm just going to point to the Debian announcement.

This blog post is to let all Fedora users know that starting with Fedora-26 / rawhide as of today, we are making the same change.

Note that the xorg-x11-drv-intel package has already been carrying a Fedora patch to not bind to the GPU on Skylake or newer, even before Debian announced this, this just makes the same change for older Intel GPUs.

For people who are using the now default GNOME3 on Wayland session, nothing changes, since Xwayland always uses glamor for X acceleration, just like the modesetting driver.

If you encounter any issues causes by this change, please file a bug in bugzilla.

The "for all recent Intel GPUs" in the subject of this blog post in practice means that we're making this change for gen4 and newer Intel GPUs.
January 09, 2017

Since beginning last year, our team at Igalia has been involved in enabling ARB_gpu_shader_fp64 extension to different Intel generations: first Broadwell and later, then Haswell, now IvyBridge (still under review); so working on supporting Vulkan’s Float64 capability into Mesa was natural.

Since beginning last year, our team at Igalia has been involved in enabling ARB_gpu_shader_fp64 extension to different Intel generations: first Broadwell and later, then Haswell, now IvyBridge (still under review); so working on supporting Vulkan’s Float64 capability into Mesa was natural.

January 07, 2017

mm-openwrt

I’ve been lately working on integrating ModemManager in OpenWRT, in order to provide a unique and consolidated way to configure and manage mobile broadband modems (2G, 3G, 4G, Iridium…), all working with netifd.

OpenWRT already has some support for a lot of the devices that ModemManager is able to manage (e.g. through the uqmi, umbim or wwan packages), but unlike the current solutions, ModemManager doesn’t require protocol-specific configurations or setups for the different devices; i.e. the configuration for a modem running in MBIM mode may be the same one as the configuration for a modem requiring AT commands and a PPP session.

Currently the OpenWRT package prepared is based on ModemManager git master, and therefore it supports: QMI modems (including the new MC74XX series which are raw-ip only and don’t support DMS UIM operations), MBIM modems, devices requiring QMI over MBIM operations (e.g. FCC auth), and of course generic AT+PPP based modems, Cinterion, Huawei (both AT+PPP and AT+NDISDUP), Icera, Haier, Linktop, Longcheer, Ericsson MBM, Motorola, Nokia, Novatel, Option (AT+PPP and HSO), Pantech, Samsung, Sierra Wireless (AT+PPP and DirectIP), Simtech, Telit, u-blox, Wavecom, ZTE… and even Iridium and Thuraya satellite modems. All with the same configuration.

Along with ModemManager itself, the OpenWRT feed also contains libqmi and libmbim, which provide the qmicli, mbimcli, and soon the qmi-firmware-update utilities. Note that you can also use these command line tools, even if ModemManager is running, via the qmi-proxy and mbim-proxy setups (i.e. just adding -p to the qmicli or mbimcli commands).

This is not the first time I’ve tried to do this; but this time I believe it is a much more complete setup and likely ready for others to play with it. You can jump to the modemmanager-openwrt bitbucket repository and follow the instructions to include it in your OpenWRT builds:

https://bitbucket.org/aleksander0m/modemmanager-openwrt

The following sections try to get into a bit more detail of which were the changes required to make all this work.

And of course, thanks to VeloCloud for sponsoring the development of the latest ModemManager features that made this integration possible 🙂

udev vs hotplug

One of the latest biggest features merged in ModemManager was the possibility to run without udev support; i.e. without automatically monitoring the device addition and removals happening in the system.

Instead of using udev, the mmcli command line tool ended up with a new --report-kernel-event that can be used to report the device addition and removals manually, e.g.:

$ mmcli --report-kernel-event="action=add,subsystem=tty,name=ttyUSB0"
$ mmcli --report-kernel-event="action=add,subsystem=net,name=wwan0"

This new way of notifying device events made it very easy to integrate the automatic device discovery supported in ModemManager directly via tty and net hotplug scripts (see mm_report_event()).

With the integration in the hotplug scripts, ModemManager will automatically detect and probe the different ports exposed by the broadband modem devices.

udev rules

ModemManager relies on udev rules for different things:

  • Blacklisting devices: E.g. we don’t want ModemManager to claim and probe the TTYs exposed by Arduinos or braille displays. The package includes a USB vid:pid based blacklist of devices that expose TTY ports and are not modems to be managed by ModemManager.
  • Blacklisting ports: There are cases where we don’t want the automatic logic selection to grab and use some specific modem ports, so the package also provides a much shorter list of ports blacklisted from actual modem devices. E.g. the QMI implementation in some ZTE devices is so poor that we decided to completely skip it and fallback to AT+PPP.
  • Greylisting USB serial adapters: The TTY ports exposed by USB serial adapters aren’t probed automatically, as we don’t know what’s connected in the serial side. If we want to have a serial modem, though, the mmcli --scan-modems operation may be executed, which will include the probing of these greylisted devices.
  • Specifying port type hints: Some devices expose multiple AT ports, but with different purposes. E.g. a modem may expose a port for AT control and another port for the actual PPP session, and choosing the wrong one will not work. ModemManager includes a list of port type hints so that the automatic selection of which port is for what purpose is done transparently.

As we’re not using udev when running in OpenWRT, ModemManager includes now a custom generic udev rules parser that uses sysfs properties to process and apply the rules.

procd based startup

The ModemManager daemon is setup to be started and controlled via procd. The init script controlling the startup will also take care of scheduling the re-play of the hotplug events that had earlier triggered --report-kernel-event actions (they’re cached in /tmp); e.g. to cope with events coming before the daemon started or to handle daemon restarts gracefully.

DBus

Well, no, I didn’t port ModemManager to use ubus 🙂 If you want to run ModemManager under OpenWRT you’ll also need to have the DBus daemon running.

netifd protocol handler

When using ModemManager, the user shouldn’t need to know the peculiarities of the modem being used: all modems and protocols (QMI, MBIM, Generic AT, vendor-specific AT…) are all managed via the same single DBus interfaces. All the modem control commands are internal to ModemManager, and the only additional considerations needed are related to how to setup the network interface once the modem is connected, e.g.:

  • PPP: some modems require a PPP session over a serial port.
  • Static: some modems require static IP configuration on a network interface.
  • DHCP: some modems require dynamic IP configuration on a network interface.

The OpenWRT package for ModemManager includes a custom protocol handler that enables the modemmanager protocol to be used when configuring network interfaces. This new protocol handler takes care of configuring and bringing up the interfaces as required when the modem gets into connected state.

Example configuration

The following snippet shows an example interface configuration to set in /etc/config/network.

 config interface 'broadband'
   option device '/sys/devices/platform/soc/20980000.usb/usb1/1-1/1-1.2/1-1.2.1'
   option proto 'modemmanager'
   option apn 'ac.vodafone.es'
   option username 'vodafone'
   option password 'vodafone'
   option pincode '7423'
   option lowpower '1'

The settings currently supported are the following ones:

  • device: The full sysfs path of the broadband modem device needs to be configured. Relying on the interface names exposed by the kernel is never a good idea, as these may change e.g. across reboots or when more than one modem device is available in the system.
  • proto: As said earlier, the new modemmanager protocol needs to be configured.
  • apn: If the connection requires an APN, the APN to use.
  • username: If the access point requires authentication, the username to use.
  • password: If the access point requires authentication, the password to use.
  • pincode: If the SIM card requires a PIN, the code to use to unlock it.
  • lowpower: If enabled, this setting will request the modem to go into low-power state (i.e. IMSI detach and RF off) when the interface is disconnected.

As you can see, the configuration can be used for any kind of modem device, regardless of what control protocol it uses, which interfaces are exposed, or how the connection is established. The settings are currently only IPv4 only, but adding IPv6 support shouldn’t be a big issue, patches welcome 🙂

SMS, USSD, GPS…

The main purpose of using a mobile broadband modem is of course the connectivity itself, but it also may provide many more features. ModemManager provides specific interfaces and mmcli actions for the secondary features which are also available in the OpenWRT integration, including:

  • SMS messaging (both 3GPP and 3GPP2).
  • Location information (3GPP LAC/CID, CDMA Base station, GPS…).
  • Time information (as reported by the operator).
  • 3GPP USSD operations (e.g. to query prepaid balance to the operator).
  • Extended signal quality information (RSSI, Ec/Io, LTE RSRQ and RSRP…).
  • OMA device management operations (e.g. to activate CDMA devices).
  • Voice call control.

Worth noting that not all these features are available for all modem types (e.g. SMS messaging is available for most devices, but OMA DM is only supported in QMI based modems).

TL;DR?

You can now have your 2G/3G/4G mobile broadband modems managed with ModemManager and netifd in your OpenWRT based system.


Filed under: Development, FreeDesktop Planet, GNOME Planet, Planets Tagged: libmbim, libqmi, ModemManager, openwrt
January 06, 2017

2017 starts with good news for Intel Haswell users: it has been a long time coming, but we have finally landed GL_ARB_gpu_shader_fp64 for this platform. Thanks to Matt Turner for reviewing the huge patch series!

Maybe you are not particularly excited about GL_ARB_gpu_shader_fp64, but that does not mean this is not an exciting milestone for you if you have a Haswell GPU (or even IvyBridge, read below): this extension was the last piece missing to finally bring Haswell to expose OpenGL 4.0!

If you want to give it a try but you don’t want to build the driver from the latest Mesa sources, don’t worry: the feature freeze for the Mesa 13.1 release is planned to happen in just a few days and the current plan is to have the release in early February, so if things go according to plan you won’t have to wait too long for an official release.

But that is not all, now that we have landed Fp64 we can also send for review the implementation of GL_ARB_vertex_attrib_64bit. This could be a very exciting milestone, since I believe this is the only thing missing for Haswell to have all the extensions required for OpenGL 4.5!

You might be wondering about IvyBridge too, and 2017 also starts with good news for IvyBridge users. Landing Fp64 for Haswell allowed us to send for review the IvyBridge patches we had queued up for GL_ARB_gpu_shader_fp64 which will bring IvyBridge up to OpenGL 4.0. But again, that is not all, once we land Fp64 we should also be able to send the patches for GL_vertex_attrib_64bit and get IvyBridge up to OpenGL 4.2, so look forward to this in the near future!

We have been working hard on Fp64 and Va64 during a good part of 2016, first for Broadwell and later platforms and later for Haswell and IvyBridge; it has been a lot of work so it is exciting to see all this getting to the last stages and on its way to the hands of the final users.

All this has only been possible thanks to Intel’s sponsoring and the great support and insight that our friends there have provided throughout the development and review processes, so big thanks to all of them and also to the team at Igalia that has been involved in the development with me.

January 03, 2017

This post describes the synclient tool, part of the xf86-input-synaptics package. It does not describe the various options, that's what the synclient(1) and synaptics(4) man pages are for. This post describes what synclient is, where it came from and how it works on a high level. Think of it as a anti-bus-factor post.

Maintenance status

The most important thing first: synclient is part of the synaptics X.Org driver which is in maintenance mode, and superseded by libinput and the xf86-input-libinput driver. In general, you should not be using synaptics anymore anyway, switch to libinput instead (and report bugs where the behaviour is not correct). It is unlikely that significant additional features will be added to synclient or synaptics and bugfixes are rare too.

The interface

synclient's interface is extremely simple: it's a list of key/value pairs that would all be set at the same time. For example, the following command sets two options, TapButton1 and TapButton2:


synclient TapButton1=1 TapButton2=2
The -l switch lists the current values in one big list:

$ synclient -l
Parameter settings:
LeftEdge = 1310
RightEdge = 4826
TopEdge = 2220
BottomEdge = 4636
FingerLow = 25
FingerHigh = 30
MaxTapTime = 180
...
The commandline interface is effectively a mapping of the various xorg.conf options. As said above, look at the synaptics(4) man page for details to each option.

History

A decade ago, the X server had no capabilities to change driver settings at runtime. Changing a device's configuration required rewriting an xorg.conf file and restarting the server. To avoid this, the synaptics X.Org touchpad driver exposed a shared memory (SHM) segment. Anyone with knowledge of the memory layout (an internal struct) and permission to write to that segment could change driver options at runtime. This is how synclient came to be, it was the tool that knew that memory layout. A synclient command would thus set the correct bits in the SHM segment and the driver would use the newly updated options. For obvious reasons, synclient and synaptics had to be the same version to work.

Atoms are 32-bit unsigned integers and created for each property name at runtime. They represent a unique string (the property name) and can be created by applications too. Property name to Atom mappings are global. Once any driver initialises a property by its name (e.g. "Synaptics Tap Actions"), that property and the corresponding Atom will exist globally until the server resets. Atoms unknown to a driver are simply ignored.

8 or so years ago, the X server got support for input device properties, a generic key/value store attached to each input device. The keys are the properties, identified by an "Atom" (see box on the side). The values are driver-specific. All drivers make use of this now, being able to change a property at runtime is the result of changing a property that the driver knows of.

synclient was converted to use properties instead of the SHM segment and eventually the SHM support was removed from both synclient and the driver itself. The backend to synclient is thus identical to the one used by the xinput tool or tools used by other drivers (e.g. the xsetwacom tool). synclient's killer feature was that it was the only tool that knew how to configure the driver, these days it's merely a commandline argument to property mapping tool. xinput, GNOME, KDE, they all do the same thing in the backend.

How synclient works

The driver has properties of a specific name, format and value range. For example, the "Synaptics Tap Action" property contains 7 8-bit values, each representing a button mapping for a specific tap action. If you change the fifth value of that property, you change the button mapping for a single-finger tap. Another property "Synaptics Off" is a single 8-bit value with an allowed range of 0, 1 or 2. The properties are described in the synaptics(4) man page. There is no functional difference between this synclient command:


synclient SynapticsOff=1
and this xinput command

xinput set-prop "SynPS/2 Synaptics TouchPad" "Synaptics Off" 1
Both set the same property with the same calls. synclient uses XI 1.x's XChangeDeviceProperty() and xinput uses XI 2.x's XIChangeProperty() if available but that doesn't really matter. They both fetch the property, overwrite the respective value and send it back to the server.

Pitfalls and quirks

synclient is a simple tool. If multiple touchpads are present it will simply pick the first one. This is a common issue for users with a i2c touchpad and will be even more common once the RMI4/SMBus support is in a released kernel. In both cases, the kernel creates the i2c/SMBus device and an additional PS/2 touchpad device that never sends events. So if synclient picks that device, all the settings are changed on a device that doesn't actually send events. This depends on the order the devices were added to the X server and can vary between reboots. You can work around that by disabling or ignoring the PS/2 device.

synclient is a one-shot tool, it does not monitor devices. If a device is added at runtime, the user must run the command to change settings. If a device is disabled and re-enabled (VT-switch, suspend/resume, ...), the user must run synclient to change settings. This is a major reason we recommend against using synclient, the desktop environment should take care of this. synclient will also conflict with the desktop environment in that it isn't aware when something else changes things. If synclient runs before the DE's init scripts (e.g. through xinitrc), its settings may be overwritten by the DE. If it runs later, it overwrites the DE's settings.

synclient exclusively supports synaptics driver properties. It cannot change any other driver's properties and it cannot change the properties created by the X server on each device. That's another reason we recommend against it, because you have to mix multiple tools to configure all devices instead of using e.g. the xinput tool for all property changes. Or, as above, letting the desktop environment take care of it.

The interface of synclient is IMO not significantly more obvious than setting the input properties directly. One has to look up what TapButton1 does anyway, so looking up how to set the property with the more generic xinput is the same amount of effort. A wrong value won't give the user anything more useful than the equivalent of a "this didn't work".

TL;DR

If you're TL;DR'ing an article labelled "the definitive guide to" you're kinda missing the point...

January 02, 2017
The 3DMMES test has been thoroughly instruction count limited, so any wins we can get on code generation translate pretty directly into performance gains.  Last week I decided to work on fixing up Jonas's patch to schedule instructions in the delay slots of thread switching, which can save us 3 instructions per texture sample.

Thread switching, you may recall, is a trick in the fragment shader to hide texture fetch latency by cooperatively switching to another fragment shader instance after you request a texture fetch, so that it can make some progress when you'd probably be blocked anyway.  This lets us better occupy the ALUs, at the cost of each shader needing to fit into half of the physical register file.

However, VC4 doesn't cancel instructions in the pipeline when we request a switch (same as for within-shader branching), so 3 more instructions from the current shader get executed.  For my first implementation of thread switching, I just dumped in 3 NOPs after each THRSW to fill the delay slots.  This was still a massive win over not threading, so it was good enough.

Jonas's patch tries to fill the delay slots after we schedule in the thread switch by trying to extract the THRSW signal off of the instruction we scheduled and move it up to 2 instructions before that, and then only add enough NOPs to get us 3 slots filled.  There was a little bug (it re-scheduled the thrsw instruction instead of a NOP in trying to pad out to delay slots), but it basically worked and got us a 1.55% instruction count win on shader-db.

The problem was that he was scheduling ALU operations along with the thrsw, and if the thrsw signal was alone without an ALU operation in it, after moving the thrsw up we'd have a NOP left in the old location.  I wrote a followon patch to fix that: We now only schedule thrsws on their own without ALU operations, insert the THRSW as early as we can, and then add NOPs as necessary to fill remaining delay slots.  This was another 0.41% instruction count win.

This isn't as good as it could be.  Maybe we don't fill the slots as well as we could before choosing to schedule thrsw, but instructions we choose to schedule after that could be fit in.  Those would be tricky because we have to check that they don't write flags or the accumulators (which wouldn't be preserved across the thrsw) or new texture coordinates.  We also don't put the THRSW at any particular point in the timeline between sampler request and results collection.  We might be able to get wins by trying to put thrsw at the lowest-register-pressure point between them, so that fewer things need to be forced into physical regs instead of accumulators.

Those will be projects for later.  It's probably much more important that we figure out how to schedule 2-4 samples with a single thrsw, instead of doing a thrsw per sample like we do today.

The other project last week was starting to build up a plan for vc4 CI.  We've had a few regressions to vc4 in Mesa master because developers (including me!) don't do testing on vc4 hardware on every commit.  The Intel folks have a lovely CI system that does piglit, DEQP, and performance regression testing for them, both tracking Mesa master and doing pre-commit tests of voluntarily submitted branches.  I despaired of the work needed to build something that good, but apparently they've open sourced their configuration, so I'll be trying to replicate that in the next few weeks.  This week I worked on getting the hardware and a plan for where to host it.  I'm looking forward to a bright future of 30-minute test runs on the actual hardware and long-term tracking of performance metrics.  Unfortunately, there are no docs to it and I've never worked with jenkins before, so this is going to be a challenge.

Other things: I submitted a patch to mm/ to shut up the CMA warning we've been suffering from (and patching away downstream) for years, got exactly the expected response ("this dmesg spew in an common path might be useful for somebody debugging something some day, so it should stay there"), so hopefully we can just get Fedora and Debian to patch it out instead.  This is yet another data point in favor of Matthew Garrett's plan of "write kernel patches you need, don't bother submitting them upstream". Started testing a new patch to reduce error return rate on vc4 memory allocatoins on upstream kernel, haven't confirmed that it's working yet.  I also spent a lot of time reviewing tarceri's patches to Mesa preparing for the on-disk shader cache, which should help improve app startup times and reduce memory consumption.

Packaging Python has been a painful experience for long. The history of the various distribution that Python offered along the years is really bumpy, and both the user and developer experience has been pretty bad.

Fortunately, things improved a lot in the recent years, with the reconciliation of setuptools and distribute.

Though in the context of the OpenStack project, a solution on top of setuptools has been already started a while back. Its usage is now spread across a whole range of software and libraries.

This project is called pbr, for Python Build Reasonableness. Don't be afraid by the OpenStack colored themed of the documentation – it is a bad habit of OpenStack folks to not advertise their tooling in an agnostic fashion. The tool has no dependency with the cloud platform, and can be used painlessly with any package.

How it works

pbr takes inspiration from distutils2 (a now abandoned project) and uses a setup.cfg file to describe the packager's intents. This is how a setup.py using pbr looks like:

import setuptools
 
setuptools.setup(setup_requires=['pbr'], pbr=True)


Two lines of code – it's that simple. The actual metadata that the setup requires is stored in setup.cfg:

[metadata]
name = foobar
author = Dave Null
author-email = foobar@example.org
summary = Package doing nifty stuff
license = MIT
description-file =
README.rst
home-page = http://pypi.python.org/pypi/foobar
requires-python = >=2.6
classifier =
Development Status :: 4 - Beta
Environment :: Console
Intended Audience :: Developers
Intended Audience :: Information Technology
License :: OSI Approved :: Apache Software License
Operating System :: OS Independent
Programming Language :: Python
 
[files]
packages =
foobar


This syntax is way easier to write and read than the standard setup.py.

pbr also offers other features such as:

  • automatic dependency installation based on requirements.txt
  • automatic documentation building and generation using Sphinx
  • automatic generation of AUTHORS and ChangeLog files based on git history
  • automatic creation of the list of files to include using git
  • version management based on git tags

All of this comes with little to no effort on your part.

Using flavors

One of the feature that I use a lot, is the definition of flavors. It's not tied particularly to pbr – it's actually provided by setuptools and pip themselves – but pbr setup.cfg file makes it easy to use.

When distributing a software, it's common to have different drivers for it. For example, your project could support both PostgreSQL or MySQL – but nobody is going to use both at the same time. The usual trick to make it work is to add the needed library to the requirements list (e.g. requirements.txt). The upside is that the software will work directly with either RDBMS, but the downside is that this will install both libraries, whereas only one is needed. Using flavors, you can specify different scenarios:

[extras]
postgresql =
psycopg2
mysql =
pymysql


When installing your package, the user can then just pick the right flavor by using pip to install the package:

$ pip install foobar[postgresql]


This will install foobar, all its dependencies listed in requirements.txt, plus whatever dependencies are listed in the [extras] section of setup.cfg matching the flavor. You can also combine several flavors, e.g.:

$ pip install foobar[postgresql,mysql]


would install both flavors.

pbr is well-maintained and in very active development, so if you have any plans to distribute your software, you should seriously consider including pbr in those plans.

December 23, 2016
Yesterday Valve gave me a copy of DOOM for Christmas (not really for Christmas), and I got the wine bits in place from Fedora, then I spent today trying to get DOOM to render on radv.



Thanks to ParkerR on #radeon for taking the picture from his machine, I'm too lazy.

So it runs kinda, it hangs the GPU a fair bit, it misrenders some colors in some scenes, but you can see most of it. I'm not sure if I'll get back to this before next year (I'll try), but I'm pretty happy to have gotten it this far in a day, though I'm sure the next few things will me much more difficult to debug.

The branch is here:
https://github.com/airlied/mesa/commits/radv-wip-doom-wine
December 22, 2016
Yesterday I gave a short talk about the Chamelium board from the ChromeOS team, and thought that the slides could be useful for others as this board gets used more and more outside of Google.

https://people.collabora.com/~tomeu/Chamelium_Overview.odp


If you are interested in how this board can help you automate the testing of your display (and not only!) code and hardware, a new mailing list has been created to discuss its uses. We at Collabora will be happy to help you integrate this board in your CI lab as well.

Thanks go to Intel for sponsoring the preparation of these slides and for allowing me to share them under an open license.

And of course, thanks to Google's ChromeOS team for releasing the hardware design with an open hardware license along with the code they are running on it and with it.

The fancy new Sphinx-based documentation has landed a while ago in upstream. Jani Nikula has written a nice overview on LWN (part 2). And it is getting used a lot. But judging by how often I type it in replies on the mailing list what’s missing is a super-short howto. To build the documentation, run:

$ make DOCBOOKS="" htmldocs

The output can then be found in Documentation/output/. When typing documentation please always check that your new text does get rendered. The output also contains documentation about kernel-doc and the toolchain itself. Since the build is incremental it is recommended that you first run it before touching anything. That way you’ll only see warnings in areas you’ve touched, not all of them - the build is unfortunately somewhat noisy.

December 20, 2016
Two big successes last week.

One is that Dave Airlie has pulled Boris's VEC (STDV) code for 4.10.  We didn't quite make it for getting the DT changes in time to have full support in 4.10, but the DT changes will be a lot easier to backport than the driver code.

The other is that I finally got the DSI panel working, after 9 months of frustrating development.  It turns out that my DSI transactions to the Toshiba chip aren't working, but if I use I2C to the undocumented Atmel microcontroller to relay to the Toshiba in one of the 4 possible orderings of the register sequence, the panel comes up.  The Toshiba docs say I need to do I2C writes at the beginning of the poweron sequence, but the closed firmware uses DSI transactions for the whole sequence, making me suspicious that the Atmel has already done some of the Toshiba register setup.

I've now submitted the DSI driver, panel, and clock patches after some massive cleanup.  It even comes with proper runtime power management for both the panel and the VC4 DSI module.

The next steps in modesetting land will be to write an input driver for the panel, do any reworks we need from review, and get those chunks merged in time for 4.11.  While I'm working on this, Boris is now looking at my HDMI audio code so that hopefully we can get that working as well.  Eben is going to poke Dom about fixing the VC4driver interop with media decode.  I feel like we're now making rapid progress toward feature parity on the modesetting side of the VC4 driver.
December 19, 2016

This is a common source of confusion: the legacy X.Org driver for touchpads is called xf86-input-synaptics but it is not a driver written by Synaptics, Inc. (the company).

The repository goes back to 2002 and for the first couple of years it Peter Osterlund was the sole contributor. Back then it was called "synaptics" and really was a "synaptics device" driver, i.e. it handled PS/2 protocol requests to initialise Synaptics, Inc. touchpads. Evdev support was added in 2003, punting the initialisation work to the kernel instead. This was the groundwork for a generic touchpad driver. In 2008 the driver was renamed to xf86-input-synaptics and relicensed from GPL to MIT to take it under the X.Org umbrella. I've been involved with it since 2008 and the official maintainer since 2011.

For many years now, the driver has been a generic touchpad driver that handles any device that the Linux kernel can handle. In fact, most bugs attributed to the synaptics driver not finding the touchpad are caused by the kernel not initialising the touchpad correctly. The synaptics driver reads the same evdev events that are also handled by libinput and the xf86-input-evdev driver, any differences in behaviour are driver-specific and not related to the hardware. The driver handles devices from Synaptics, Inc., ALPS, Elantech, Cypress, Apple and even some Wacom touch tablets. We don't care about what touchpad it is as long as the evdev events are sane.

Synaptics, Inc.'s developers are active in kernel development to help get new touchpads up and running. Once the kernel handles them, the xorg drivers and libinput will handle them too. I can't remember any significant contribution by Synaptics, Inc. to the X.org synaptics driver, so they are simply neither to credit nor to blame for the current state of the driver. The top 10 contributors since August 2008 when the first renamed version of xf86-input-synaptics was released are:


8 Simon Thum
10 Hans de Goede
10 Magnus Kessler
13 Alexandr Shadchin
15 Christoph Brill
18 Daniel Stone
18 Henrik Rydberg
39 Gaetan Nadon
50 Chase Douglas
396 Peter Hutterer
There's a long tail of other contributors but the top ten illustrate that it wasn't Synaptics, Inc. that wrote the driver. Any complaints about Synaptics, Inc. not maintaining/writing/fixing the driver are missing the point, because this driver was never a Synaptics, Inc. driver. That's not a criticism of Synaptics, Inc. btw, that's just how things are. We should have renamed the driver to just xf86-input-touchpad back in 2008 but that ship has sailed now. And synaptics is about to be superseded by libinput anyway, so it's simply not worth the effort now.

The other reason I included the commit count in the above: I'm also the main author of libinput. So "the synaptics developers" and "the libinput developers" are effectively the same person, i.e. me. Keep that in mind when you read random comments on the interwebs, it makes it easier to identify people just talking out of their behind.

A long-standing criticism of libinput is its touchpad acceleration code, oscillating somewhere between "terrible", "this is bad and you should feel bad" and "I can't complain because I keep missing the bloody send button". I finally found the time and some more laptops to sit down and figure out what's going on.

I recorded touch sequences of the following movements:

  • super-slow: a very slow movement as you would do when pixel-precision is required. I recorded this by effectively slowly rolling my finger. This is an unusual but sometimes required interaction.
  • slow: a slow movement as you would do when you need to hit a target several pixels across from a short distance away, e.g. the Firefox tab close button
  • medium: a medium-speed movement though probably closer to the slow side. This would be similar to the movement when you move 5cm across the screen.
  • medium-fast: a medium-to-fast speed movement. This would be similar to the movement when you move 5cm across the screen onto a large target, e.g. when moving between icons in the file manager.
  • fast: a fast movement. This would be similar to the movement when you move between windows some distance apart.
  • flick: a flick movement. This would be similar to the movement when you move to a corner of the screen.
Note that all these are by definition subjective and somewhat dependent on the hardware. Either way, I tried to get something of a reasonable subset.

Next, I ran this through a libinput 1.5.3 augmented with printfs in the pointer acceleration code and a script to post-process that output. Unfortunately, libinput's pointer acceleration internally uses units equivalent to a 1000dpi mouse and that's not something easy to understand. Either way, the numbers themselves don't matter too much for analysis right now and I've now switched everything to mm/s anyway.

A note ahead: the analysis relies on libinput recording an evemu replay. That relies on uinput and event timestamps are subject to a little bit of drift across recordings. Some differences in the before/after of the same recording can likely be blamed on that.

The graph I'll present for each recording is relatively simple, it shows the velocity and the matching factor.The x axis is simply the events in sequence, the y axes are the factor and the velocity (note: two different scales in one graph). And it colours in the bits that see some type of acceleration. Green means "maximum factor applied", yellow means "decelerated". The purple "adaptive" means per-velocity acceleration is applied. Anything that remains white is used as-is (aside from the constant deceleration). This isn't really different to the first graph, it just shows roughly the same data in different colours.

Interesting numbers for the factor are 0.4 and 0.8. We have a constant acceleration of 0.4 on touchpads, i.e. a factor of 0.4 "don't apply acceleration", the latter is "maximum factor". The maximum factor is twice as big as the normal factor, so the pointer moves twice as fast. Anything below 0.4 means we decelerate the pointer, i.e. the pointer moves slower than the finger.

The super-slow movement shows that the factor is, aside from the beginning always below 0.4, i.e. the sequence sees deceleration applied. The takeaway here is that acceleration appears to be doing the right thing, slow motion is decelerated and while there may or may not be some tweaking to do, there is no smoking gun.


Super slow motion is decelerated.

The slow movement shows that the factor is almost always 0.4, aside from a few extremely slow events. This indicates that for the slow speed, the pointer movement maps exactly to the finger movement save for our constant deceleration. As above, there is no indicator that we're doing something seriously wrong.


Slow motion is largely used as-is with a few decelerations.

The medium movement gets interesting. If we look at the factor applied, it changes wildly with the velocity across the whole range between 0.4 and the maximum 0.8. There is a short spike at the beginning where it maxes out but the rest is accelerated on-demand, i.e. different finger speeds will produce different acceleration. This shows the crux of what a lot of users have been complaining about - what is a fairly slow motion still results in an accelerated pointer. And because the acceleration changes with the speed the pointer behaviour is unpredictable.


In medium-speed motion acceleration changes with the speed and even maxes out.

The medium-fast movement shows almost the whole movement maxing out on the maximum acceleration factor, i.e. the pointer moves at twice the speed to the finger. This is a problem because this is roughly the speed you'd use to hit a "mentally preselected" target, i.e. you know exactly where the pointer should end up and you're just intuitively moving it there. If the pointer moves twice as fast, you're going to overshoot and indeed that's what I've observed during the touchpad tap analysis userstudy.


Medium-fast motion easily maxes out on acceleration.

The fast movement shows basically the same thing, almost the whole sequence maxes out on the acceleration factor so the pointer will move twice as far as intuitively guessed.


Fast motion maxes out acceleration.

So does the flick movement, but in that case we want it to go as far as possible and note that the speeds between fast and flick are virtually identical here. I'm not sure if that's me just being equally fast or the touchpad not quite picking up on the short motion.


Flick motion also maxes out acceleration.

Either way, the takeaway is simple: we accelerate too soon and there's a fairly narrow window where we have adaptive acceleration, it's very easy to top out. The simplest fix to get most touchpad movements working well is to increase the current threshold on when acceleration applies. Beyond that it's a bit harder to quantify, but a good idea seems to be to stretch out the acceleration function so that the factor changes at a slower rate as the velocity increases. And up the acceleration factor so we don't top out and we keep going as the finger goes faster. This would be the intuitive expectation since it resembles physics (more or less).

There's a set of patches on the list now that does exactly that. So let's see what the result of this is. Note ahead: I also switched everything from mm/s which causes some numbers to shift slightly.

The super-slow motion is largely unchanged though the velocity scale changes quite a bit. Part of that is that the new code has a different unit which, on my T440s, isn't exactly 1000dpi. So the numbers shift and the result of that is that deceleration applies a bit more often than before.


Super-slow motion largely remains the same.

The slow motions are largely unchanged but more deceleration is now applied. Tbh, I'm not sure if that's an artefact of the evemu replay, the new accel code or the result of the not-quite-1000dpi of my touchpad.


Slow motion largely remains the same.

The medium motion is the first interesting one because that's where we had the first observable issues. In the new code, the motion is almost entirely unaccelerated, i.e. the pointer will move as the finger does. Success!


Medium-speed motion now matches the finger speed.

The same is true of the medium-fast motion. In the recording the first few events were past the new thresholds so some acceleration is applied, the rest of the motion matches finger motion.


Medium-fast motion now matches the finger speed except at the beginning where some acceleration was applied.

The fast and flick motion are largely identical in having the acceleration factor applied to almost the whole motion but the big change is that the factor now goes up to 2.3 for the fast motion and 2.5 for the flick motion, i.e. both movements would go a lot faster than before. In the graphics below you still see the blue area marked as "previously max acceleration factor" though it does not actually max out in either recording now.


Fast motion increases acceleration as speed increases.

Flick motion increases acceleration as speed increases.

In summary, what this means is that the new code accelerates later but when it does accelerate, it goes faster. I tested this on a T440s, a T450p and an Asus VivoBook with an Elantech touchpad (which is almost unusable with current libinput). They don't quite feel the same yet and I'm not happy with the actual acceleration, but for 90% of 'normal' movements the touchpad now behaves very well. So at least we go from "this is terrible" to "this needs tweaking". I'll go check if there's any champagne left.

December 15, 2016
We're about a week before Christmas, and I'm going to explain how I created a retro keyboard as a gift to my father, who introduced me to computers when he brought back a Thomson TO7 home, all the way back in 1985.

The original idea was to use a Thomson computer to fit in a smaller computer, such as a CHIP or Raspberry Pi, but the software update support would have been difficult, the use limited to the builtin programs, and it would have required a separate screen. So I restricted myself to only making a keyboard. It was a big enough task, as we'll see.

How do keyboards work?

Loads of switches, that's how. I'll point you to Michał Trybus' blog post « How to make a keyboard - the matrix » for details on this works. You'll just need to remember that most of the keyboards present in those older computers have no support for xKRO, and that the micro-controller we'll be using already has the necessary pull-up resistors builtin.

The keyboard hardware

I chose the smallest Thomson computer available for my project, the MO5. I could have used a stand-alone keyboard, but would have lost all the charm of it (it just looks like a PC keyboard), some other computers have much bigger form factors, to include cartridge, cassette or floppy disk readers.

The DCMoto emulator's website includes tons of documentation, including technical documentation explaining the inner workings of each one of the chipsets on the mainboard. In one of those manuals, you'll find this page:



Whoot! The keyboard matrix in details, no need for us to discover it with a multimeter.

That needs a wash in soapy water

After opening up the computer, and eventually giving the internals, and the keyboard especially if it has mechanical keys, a good clean, we'll need to see how the keyboard is connected.

Finicky metal covered plastic

Those keyboards usually are membrane keyboards, with pressure pads, so we'll need to either find replacement connectors at our local electronics store, or desolder the ones on the motherboard. I chose the latter option.

Desoldered connectors

After matching the physical connectors to the rows and columns in the matrix, using a multimeter and a few key presses, we now know which connector pin corresponds to which connector on the matrix. We can start soldering.

The micro-controller

The micro-controller in my case is a Teensy 2.0, an Atmel AVR-based micro-controller with a very useful firmware that makes it very very difficult to brick. You can either press the little button on the board itself to upload new firmware, or wire it to an external momentary switch. The funny thing is that the Atmega32U4 is 16 times faster than the original CPU (yeah, we're getting old).

I chose to wire it to the "Initial. Prog" ("Reset") button on the keyboard, so as to make it easy to upload new firmware. To do this, I needed to cut a few traces coming out of the physical switch on the board, to avoid interferences from components on the board, using a tile cutter. This is completely optional, and if you're only going to use firmware that you already know at least somewhat works, you can set a key combo to go into firmware upload mode in the firmware. We'll get back to that later.

As far as connecting and soldering to the pins, we can use any I/O pins we want, except D6, which is connected to the board's LED. Note that any deviation from the pinout used in your firmware, you'd need to make changes to it. We'll come back to that again in a minute.

The soldering

Colorful tinning

I wanted to keep the external ports full, so it didn't look like there were holes in the case, but there was enough headroom inside the case to fit the original board, the teensy and pins on the board. That makes it easy to rewire in case of error. You could also dremel (yes, used as a verb) a hole in the board.

As always, make sure early that things would fit, especially the cables!

The unnecessary pollution

The firmware

Fairly early on during my research, I found the TMK keyboard firmware, as well as very well written forum post with detailed explanations on how to modify an existing firmware for your own uses.

This is what I used to modify the firmware for the gh60 keyboard for my own use. You can see here a step-by-step example, implementing the modifications in the same order as the forum post.

Once you've followed the steps, you'll need to compile the firmware. Fedora ships with the necessary packages, so it's a simple:


sudo dnf install -y avr-libc avr-binutils avr-gcc

I also compiled and installed in my $PATH the teensy_cli firmware uploader, and fixed up the udev rules. And after a "make teensy" and a button press...

It worked first time! This is a good time to verify that all the keys work, and you don't see doubled-up letters because of short circuits in your setup. I had 2 wires touching, and one column that just didn't work.

I also prepared a stand-alone repository, with a firmware that uses the tmk_core from the tmk firmware, instead of modifying an existing one.

Some advices

This isn't the first time I hacked on hardware, but I'll repeat some old adages, and advices, because I rarely heed those warnings, and I regret...
  • Don't forget the size, length and non-flexibility of cables in your design
  • Plan ahead when you're going to cut or otherwise modify hardware, because you might regret it later
  • Use breadboard cables and pins to connect things, if you have the room
  • Don't hotglue until you've tested and retested and are sure you're not going to make more modifications
That last one explains the slightly funny cabling of my keyboard.

Finishing touches

All Sugru'ed up

To finish things off nicely, I used Sugru to stick the USB cable, out of the machine, in place. And as earlier, it will avoid having an opening onto the internals.

There are a couple more things that I'll need to finish up before delivery. First, the keymap I have chosen in the firmware only works when a US keymap is selected. I'll need to make a keymap for Linux, possibly hard-coding it. I will also need to create a Windows keymap for my father to use (yep, genealogy software on Linux isn't quite up-to-par).

Prototype and final hardware

All this will happen in the aforementioned repository. And if you ever make your own keyboard, I'm happy to merge in changes to this repository with documentation for your Speccy, C64, or Amstrad CPC hacks.

(If somebody wants to buy me a Sega keyboard, I'll gladly work on a non-destructive adapter. Get in touch :)
December 14, 2016

The collective internet troll fest had it’s fun recently discussing AMD’s DAL. Hacker news discussed the rejection and some of the reactions, reddit had some fun and of course everyone on phoronix forums was going totally nuts. Luckily reason seems to finally prevail with LWN covering things too. I don’t want to spill more bits over the topic itself (read the LWN coverage and mailing list threads for that), but I think it’s worth looking at the fundamental underlying problem a bit more.

Discussing midlayers seems to be one of the recuring topics in the linux kernel. There’s the original midlayer-mistake article from Neil Brown that seems to have started it all. But LWN gained more articles in the years since, covering the iscsi driver as a study in avoiding OS abstraction layers, or a similar case in wireless with the Broadcom driver. The dismissal of midlayers and hailing of helper libraries has become so prevalent that calling your shiny new subsystem libfoo (viz. libnvdimm) seems to be a powerful trick to speed up its acceptance.

It seems common knowledge and accepted fact, but still there’s a constant stream of drivers that come packaged with huge abstraction layers - mostly to abstract OS internals, but very often also just a few midlayers between different components within the driver. A major reason for this is certain that submissions by former proprietary teams, or just generally code developed internally behind closed doors is suffering from the platform problem - again LWN has you covered. If your driver is not open and part of upstream (or even for an open source OS), then the provided services and interfaces are fixed. And there’s no way to improve things, worse, when change does happen the driver team generally doesn’t have any influence at all over what and how things changes. Hence it makes sense to insert a big abstraction layer to isolate the driver from the outside madness.

But that does not explain why big drivers (and more so, subsystems) come with some nice abstraction layers wedged in-between different parts. I believe, with not any proof really, that this is because company planners are extremely risk averse: Sure, working together and sharing code across the entire project has long-term benefits. But the more people are involved the bigger the risk for a bikeshed fest or some other delay, and you never want to be the one team that delayed a release by a few months. Adding isolation in the form of lots of fixed abstraction layers helps with that. But long term, and for really big projects, sharing code and working together has clear benefits, even if there’s routinely a hiccup - the neck-breaking speed of Linux kernel development overall is testament enough for that I think.

All that is just technicalities really, because in the end upstream and open source is about collaboratively developing software. That requires shared control, and interfaces between different components need to be a lot more permeable. In my experience that core idea of handing control over development to outsiders is the really scary part of joining upstream, since followed through to its conclusions it means you need to completely rethink how products are planned and developed: The entire organisation, from individual developers, to teams and including the management chain have to change. And that freaks people out.

In summary I think code submissions with lots of midlayers are bound to stay with us, and that’s good: Because new midlayers means new teams and companies start to take upstream seriously. And new midlayers getting cleaned up means new teams and new companies are going to the painful changes necessary to adjust to an upstream first model. The code itself is just the canary for the real shifts happening.

In other news: World domination is still progressing according to plan.

December 12, 2016
As blogged about already by Christian for Fedora 25 we've been working on improving hybrid gfx support, as well as on making it easier for users who want to, to install the NVidia binary driver.

The improved hybrid gfx support using the default opensource drivers was ready in time for and is part of the Fedora 25 release. Unfortunately the NVidia driver work was not ready in time. This has lead to some confusion.

Let me start with a FAQ to try and clarify things:

1)  I want to use Fedora 25 on a laptop with hybrid graphics, will this work?

Yes Fedora 25 supports hybrid graphics out of the box using the opensource drivers included with Fedora. A lot of work has been done to make this as smooth as possible. If you encounter any problems with hybrid graphics under Fedora 25 using the default opensource drivers, please file a bug in bugzilla and put me in the Cc.

2) I want to use the NVidia driver with Fedora 25, I heard it would be available in gnome-software?

It is our intend to make the NVidia driver available in gnome-software (for users who have 3th party sources enabled), but this is not ready yet, I hope to announce some progress on this on this blog soon. see below for a progress report on this.

3) I've an NVidia GPU which driver is better for me?

If you want maximum performance for e.g. gaming, you will likely want the NVidia driver. But be aware that on Optimus laptops using the NVidia driver will result in much higher power consumption / shorter battery life, as all rendering then is done on the NVidia GPU, which means that it cannot be powered down while your doing light work.

If your workload does not need maximum 3D rendering performance your likely better of with the default opensource drivers. On Optimus laptops these offer much better batery life and the opensource drivers do not have the risk of breaking when upgrading your kernel, or upgrading to the next Fedora release.

4) I want to have good battery life, but I also want to play the occasional game with good performance?

As part of the work on making it easy to use the NVidia driver with Fedora, I'm also planning to add a utility which will allow you to easily switch between nouveau and the NVidia driver. Note this will require a reboot each time you switch.

5) What is the status of making the NVidia driver available for easy installation in Fedora 25?

This biggest issue with the Nvidia driver is that it will not work out of the box on optimus enabled laptops, installing it on such a laptop will typically result in the laptop booting to a black screen, which is NOT the user experience we want to offer.

I've just finished a set of patches for the xserver, which will allow the NVidia driver rpm to install a xorg.conf snippet which will make the xorg autoconfigure code automatically set things up the right way for the nvidia driver on optimus enabled laptops.

Depending on the upstream review of these patches I will prepare an updated xserver package with these patches for Fedora 25 soon. Once that is in place, the negativo17.org rpms can be updated with the xorg.conf snippet.

When that is done, there still are some other issues to tackle:

  • Mesa in F25 is not yet glvnd dispatch enabled, which means that the negativo17.org rpms need to play ldconfig path tricks, which means that if the rpms get installed on a system with an unsupported GPU, and Xorg falls back to the open-source driver things go boom because apps end up loading the wrong libGL.so

  • We want to have some mechanism in place to automatically load the nouveau kernel-module before starting gdm if the nvidia kernel module did not load for some reason (e.g. it failed to compile against a new kernel)

  • The negativo17.org rpms currently use dkms or akmod, building the nvidia kernel module from source on your system, we want to switch things over to having pre-built kmods available

Note not all of these are necessarily blockers for adding the Nvidia driver
to gnome-software.

6) I want to try out the nvidia driver now, is that possible ?

First of all, if you've an optimus enabled laptop, please do not do this (yet), we should have a solution ready for you in a couple of weeks.

If you have a system which is only using a *single* nvidia GPU, then you can install the nvidia driver using the following commands:

  sudo dnf update 'kernel*'
  sudo dnf config-manager --add-repo=http://negativo17.org/repos/fedora-nvidia.repo
  sudo dnf install nvidia-settings kernel-devel dkms-nvidia vulkan.i686 nvidia-driver-libs.i686

Then reboot and you should be running the nvidia driver, to get back to the opensource driver do:

  sudo dnf remove nvidia-driver

A short while ago, I asked a bunch of people for long-term touchpad usage data (specifically: evemu recordings). I currently have 25 sets of data, the shortest of which has 9422 events, the longest of which has 987746 events. I requested that evemu-record was to be run in the background while people use their touchpad normally. Thus the data is quite messy, it contains taps, two-finger scrolling, edge scrolling, palm touches, etc. It's also raw data from the touchpad, not processed by libinput. Some care has to be taken with analysis, especially since it is weighted towards long recordings. In other words, the user with 987k events has a higher influence than the user with 9k events. So the data is useful for looking for patterns that can be independently verified with other data later. But it's also useful for disproving hypothesis, i.e. "we cannot do $foo because some users' events show $bla".

One of the things I've looked into was tapping. In libinput, a tap has two properties: a time threshold and a movement threshold. If the finger is held down longer than 180ms or it moves more than 3mm it is not a tap. These numbers are either taken from synaptics or just guesswork (both, probably). The need for a time-based threshold is obvious: we don't know whether the user is tapping until we see the finger up event. Only if that doesn't happen within a given time we know the user simply put the finger down. The movement threshold is required because small movements occur while tapping, caused by the finger really moving (e.g. when tapping shortly before/after a pointer motion) or by the finger center moving (as the finger flattens under pressure, the center may move a bit). Either way, these thresholds delay real pointer movement, making the pointer less reactive than it could be. So it's in our interest to have these thresholds low to get reactive pointer movement but as high as necessary to have reliable tap detection.

General data analysis

Let's look at the (messy) data. I wrote a script to calculate the time delta and movement distance for every single-touch sequence, i.e. anything with two or more fingers down was ignored. The script used a range of 250ms and 6mm of movement, discarding any sequences outside those thresholds. I also ignored anything in the left-most or right-most 10% because it's likely that anything that looks like a tap is a palm interaction [1]. I ran the script against those files where the users reported that they use tapping (10 users) which gave me 6800 tap sequences. Note that the ranges are purposely larger than libinput's to detect if there was a significant amount of attempted taps that exceed the current thresholds and would be misdetected as non-taps.

Let's have a look at the results. First, a simple picture that merely prints the start location of each tap, normalised to the width/height of the touchpad. As you can see, taps are primarily clustered around the center but can really occur anywhere on the touchpad. This means any attempt at detecting taps by location would be unreliable.


Normalized distribution of touch sequence start points (relative to touchpad width/height)

You can easily see the empty areas in the left-most and right-most 10%, that is an artefact of the filtering.

The analysis of time is more interesting: There are spikes around the 50ms mark with quite a few outliers going towards 100ms forming what looks like a narrow normal distribution curve. The data points are overlaid with markers for the mean [2], the 50 percentile, the 90 percentile and the 95 percentile [3]. And the data says: 95% of events fall below 116ms. That's something to go on.


Times between touch down and touch up for a possible tap event.
Note that we're using a 250ms timeout here and thus even look at touches that would not have been detected as tap by libinput. If we reduce to the 180ms libinput uses, we get a 95% percentile of 98ms, i.e. "of all taps currently detected as taps, 95% are 98ms or shorter".

The analysis of distance is similar: Most of the tap sequences have little to no movement, with 50% falling below 0.2mm of movement. Again the data points are overlaid with markers for the mean, the 50 percentile, the 90 percentile and the 95 percentile. And the data says: 95% of events fall below 1.8mm. Again, something to go on.


Movement between the touch down and the touch up event for a possible tap (10 == 1mm)
Note that we're using a 6mm threshold here and thus even look at touches that would not have been detected as tap by libinput. If we reduce to the 3mm libinput uses, we get a 95% percentile of 1.2mm, i.e. "of all taps currently detected as taps, 95% move 1.2mm or less".

Now let's combine the two. Below is a graph mapping times and distances from touch sequences. In general, the longer the time, the longer the more movement we get but most of the data is in the bottom left. Since doing percentiles is tricky on 2 axes, I mapped the respective axes individually. The biggest rectangle is the 95th percentile for time and distance, the number below shows how many data points actually fall into this rectangle. Looks promising, we still have a vast majority of touchpoints fall into the respective 95 percentiles though the numbers are slightly lower than the individual axes suggest.


Time to distance map for all possible taps
Again, this is for the 250ms by 6mm movement. About 3.3% of the events fall into the area between 180ms/3mm and 250ms/6mm. There is a chance that some of the touches have have been short, small movements, we just can't know by from data.

So based on the above, we learned one thing: it would not be reliable to detect taps based on their location. But we also suspect two things now: we can reduce the timeout and movement threshold without sacrificing a lot of reliability.

Verification of findings

Based on the above, our hypothesis is: we can reduce the timeout to 116ms and the threshold to 1.8mm while still having a 93% detection reliability. This is the most conservative reading, based on the extended thresholds.

To verify this, we needed to collect tap data from multiple users in a standardised and reproducible way. We wrote a basic website that displays 5 circles (see the screenshot below) on a canvas and asked a bunch of co-workers in two different offices [4] to tap them. While doing so, evemu-record was running in the background to capture the touchpad interactions. The touchpad was the one from a Lenovo T450 in both cases.


Screenshot of the <canvas> that users were asked to perform the taps on.
Some users ended up clicking instead of tapping and we had to discard those recordings. The total number of useful recordings was 15 from the Paris office and 27 from the Brisbane office. In total we had 245 taps (some users missed the circle on the first go, others double-tapped).

We asked each user three questions: "do you know what tapping/tap-to-click is?", "do you have tapping enabled" and "do you use it?". The answers are listed below:

  • Do you know what tapping is? 33 yes, 12 no
  • Do you have tapping enabled? 19 yes, 26 no
  • Do you use tapping? 10 yes, 35 no

I admit I kinda screwed up the data collection here because it includes those users whose recordings we had to discard. And the questions could've been better. So I'm not going to go into too much detail. The only useful thing here though is: the majority of users had tapping disabled and/or don't use it which should make any potential learning effect disappear[5]

Ok, let's look at the data sets, same scripts as above:


Times between touch down and touch up for tap events

Movement between the touch down and the touch up events of a tap (10 == 1mm)
95th percentile for time is 87ms. 95th percentile for distance is 1.09mm. Both are well within the numbers we expected we saw above. The combined diagram shows that 87% of events fall within the 87ms/10.9mm box.

Time to distance map for all taps
The few outliers here are close enough to the edge that expanding the box to to 100ms/1.3mm we get more than 95%. So it appears that our hypothesis is correct, reducing the timeout to 116ms and 1.8mm will have a 95% detection reliability. Furthermore, using the clean data it looks like we can use a lower threshold than previously assumed and still get a good detection ratio. Specifically, data collected in a controlled environment across 42 different users of varying familiarity with touchpad tapping shows that 100ms and 1.3mm gets us a 95% detection rate of taps.

What does this mean for users?

Based on the above, the libinput thresholds will be reduced to 100ms and 1.3mm. Let's see how we go with this and then we can increase it in the future if misdetection is higher than expected. Patches will on the wayland-devel list shortly.

For users that don't have tapping enabled, this will not change anything. All users who have tapping enabled will see a more responsive cursor on small movements as the time and distance thresholds have been significantly reduced. Some users may see a drop in tap detection rate. This is hopefully a subconscious enough effect that those users learn to tap faster or with less movement. If not, we have to look at it separately and see how we can deal with that.

If you find any issues with the analysis above, please let me know.

[1] These scripts analyse raw touchpad data, they don't benefit from libinput's palm detection
[2] Note: mean != average, the mean is less affected by strong outliers. look it up, it's worth knowing
[3] X percentile means X% of events fall below this value
[4] The Brisbane and Paris offices. No separate analysis was done, so it is unknown whether close proximity to baguettes has an effect to tap behaviour
[5] i.e. the effect of users learning how to use a system that doesn't work well out-of-the-box. This may result in e.g. quicker taps from those that are familiar with the system vs those that don't.

December 09, 2016

img_20161209_160520qmi-firmware-update

One of the key reasons to keep using the out-of-tree Sierra GobiNet drivers and GobiAPI was that upgrading firmware in the WWAN modules was supported out of the box, while we didn’t have any way to do so with qmi_wwan in the upstream kernel and libqmi.

I’m glad to say that this is no longer the case; as we already have a new working solution in the aleksander/qmi-firmware-update branch in the upstream libqmi git repository, which will be released in the next major libqmi release. Check it out!

The new tool is named, no surprise, qmi-firmware-update; and allows upgrading firmware for Qualcomm based Sierra Wireless devices (e.g. MC7700, MC7710, MC7304, MC7354, MC7330, MC7444…). I’ve personally not tested any other device or manufacturer yet, so won’t say we support others just yet.

This work wouldn’t have been possible without Bjørn Mork‘s swi-update program, which already contained most of the bits and pieces for the QDL download session management, we all owe him quite some virtual beers. And thanks also to Zodiac Inflight Innovations for sponsoring this work!

Sierra Wireless SWI9200 series (e.g. MC7700, MC7710…)

The upgrade process for Sierra Wireless SWI9200 devices (already flagged as EOL, but still used in thousands of places) is very straightforward:

  • Device is rebooted in boot & hold mode (what we call QDL download mode) by running AT!BOOTHOLD in the primary AT port.
  • A QDL download session is run to upload the firmware image, which is usually just one single file which contains the base system plus the carrier-specific settings.
  • Once the QDL download session is finished, the device is rebooted in normal mode.

The new qmi-firmware-update tool supports all these steps just by running one single command as follows:

$ sudo qmi-firmware-update \
     --update \
     -d 1199:68a2 \
     9999999_9999999_9200_03.05.14.00_00_generic_000.000_001_SPKG_MC.cwe

Sierra Wireless SWI9x15 series (e.g. MC7304, MC7354…)

The upgrade process for Sierra Wireless SWI9x15 devices is a bit more complex, as these devices support and require the QMI DMS Set/Get Firmware Preference commands to initiate the download. The steps would be:

  • Decide which firmware version, config version and carrier strings to use. The firmware version is the version of the system itself, the config version is the version of the carrier-specific image, and the carrier string is the identifier of the network operator.
  • Using QMI DMS Set Firmware Preference, the new desired firmware version, config version and carrier are specified. When the firmware and config version don’t match the ones currently running in the device, it will reboot itself in boot & hold mode and wait for the new downloads.
  • A QDL download session is run to upload each firmware image available. For these kind of devices, two options are given to the users: a pair of .cwe and .nvu images containing the system and carrier images separately, or a consolidated .spk image with both. It’s advised to always use the consolidated .spk image to avoid mismatches between system and config.
  • Once the QDL download sessions are finished, the device is rebooted in normal mode.

Again, the new qmi-firmware-update tool supports all these steps just by running one single command as follows:

$ sudo qmi-firmware-update \
     --update \
     -d 1199:68c0 \
 9999999_9902574_SWI9X15C_05.05.66.00_00_GENNA-UMTS_005.028_000-field.spk

This previous commands will analyze each firmware image provided and will extract the firmware version, config version and carrier so that the user doesn’t need to explicitly provide them (although there are also options to do that if needed).

Sierra Wireless SWI9x30 series (e.g. MC7455, MC7430..)

The upgrade process for Sierra Wireless SWI9x30 devices is equivalent to the one used for SWI9x15. One of the main differences, though, is that SWI9x15 devices seem to only allow one pair of modem+pri images (system+config) installed in the system, while the SWI9x30 allows the user to download multiple images and then select them using the QMI DMS List/Select/Delete Stored Image commands.

The SWI9x30 modules may also run in MBIM mode instead of QMI. In this case, the firmware upgrade process is exactly the same as with the SWI9x15 series, but using QMI over MBIM. The qmi-firmware-update program supports this operation with the –device-open-mbim command line argument:

$ sudo qmi-firmware-update \
    --update \
    -d 1199:9071 \
    --device-open-mbim \
    SWI9X30C_02.20.03.00.cwe \
    SWI9X30C_02.20.03.00_Generic_002.017_000.nvu

Notes on device selection

There are multiple ways to select which device is going to be upgraded:

  • vid:pid: If there is a single device to upgrade in the system, it usually is easiest to use the -d option to select it by vid:pid (or even just by vid). This is the way used by default in all previous examples, and really the easiest one if you just have one modem available.
  • bus:dev: If there are multiple devices to upgrade in the same system, a more restrictive device selection can be achieved with the -s option specifying the USB device bus number plus device number, which is unique per physical device.
  • /dev/cdc-wdm: A firmware upgrade operation may also be started by using the –cdc-wdm option (shorter, -w) and specifying a /dev/cdc-wdm device exposed by the module.
  • /dev/ttyUSB: If the device is already in boot & hold mode, a QDL download session may be performed directly on the tty specified by the –qdl-serial (shorter, -q) option.

Notes on firmware images

Sierra Wireless provides firmware images for all their SWI9200, SWI9x15 and SWI9x30 modules in their website. Sometimes they do specify “for Linux” (and you would get .cwe, .nvu or .spk images) and sometimes they just provide .exe Windows OS binaries. For the latter, you can just decompress the .exe file e.g. using 7-Zip and get the firmware images that you would use with qmi-firmware-update, e.g.:

 $ 7z x SWI9200M_3.5-Release13-SWI9200X_03.05.29.03.exe
 $ ls *.{cwe,nvu,spk} 2>/dev/null
 9999999_9999999_9200_03.05.29.03_00_generic_000.000_001_SPKG_MC.cwe

[TL;DR?]

qmi-firmware-update now allows upgrading firmware in Sierra Wireless modems using qmi_wwan and libqmi.


Filed under: FreeDesktop Planet, GNOME Planet, Planets Tagged: firmware upgrade, Gobi, GobiNet, libqmi, QMI, qmi_wwan, Qualcomm, sierra-wireless
December 07, 2016

Update: Dec 08 2016: someone's working on this project. Sorry about the late update, but feel free to pick other projects you want to work on.

Interested in hacking on some low-level stuff and implementing a feature that's useful to a lot of laptop owners out there? We have a feature on libinput's todo list but I'm just constantly losing my fight against the ever-growing todo list. So if you already know C and you're interested in playing around with some low-level bits of software this may be the project for you.

Specifically: within libinput, we want to disable certain devices based on a lid state. In the first instance this means that when the lid switch is toggled to closed, the touchpad and trackpoint get silently disabled to not send events anymore. [1] Since it's based on a switch state, this also means that we'll now have to listen to switch events and expose those devices to libinput users.

The things required to get all this working are:

  • Designing a switch interface plus the boilerplate code required (I've done most of this bit already)
  • Extending the current evdev backend to handle devices with EV_SW and exposing their events
  • Hooking up the switch devices to internal touchpads/trackpoints to disable them ad-hoc
  • Handle those devices where lid switch is broken in the hardware (more details on this when we get to this point)

You get to dabble with libinput and a bit of udev and the kernel. Possibly Xorg stuff, but that's unlikely at this point. This project is well suited for someone with a few spare weekends ahead. It's great for someone who hasn't worked with libinput before, but it's not a project to learn C, you better know that ahead of time. I'd provide the mentoring of course (I'm in UTC+10, so expect IRC/email). If you're interested let me know. Riches and fame may happen but are not guaranteed.

[1] A number of laptops have a hw issue where either device may send random events when the lid is closed

xinput is a tool to query and modify X input device properties (amongst other things). Every so-often someone-complains about it's non-intuitive interface, but this is where users are mistaken: xinput is a not a configuration UI. It is a DUI - a developer user interface [1] - intended to test things without having to write custom (more user-friendly) for each new property. It is nothing but a tool to access what is effectively a key-value store. To use it you need to know not only the key name(s) but also the allowed formats, some of which are only documented in header files. It is intended to be run under user supervision, anything it does won't survive device hotplugging. Relying on xinput for configuration is the same as relying on 'echo' to toggle parameters in /sys for kernel configuration. It kinda possibly maybe works most of the time but it's not pretty. And it's not intended to be, so please don't complain to me about the arcane user interface.

[1] don't do it, things will be a bit confusing, you may not do the right thing, you can easily do damage, etc. A lot of similarities... ;)

December 06, 2016

Avoiding CVE-2016-8655 with systemd

Just a quick note: on recent versions of systemd it is relatively easy to block the vulnerability described in CVE-2016-8655 for individual services.

Since systemd release v211 there's an option RestrictAddressFamilies= for service unit files which takes away the right to create sockets of specific address families for processes of the service. In your unit file, add RestrictAddressFamilies=~AF_PACKET to the [Service] section to make AF_PACKET unavailable to it (i.e. a blacklist), which is sufficient to close the attack path. Safer of course is a whitelist of address families whch you can define by dropping the ~ character from the assignment. Here's a trivial example:

…
[Service]
ExecStart=/usr/bin/mydaemon
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
…

This restricts access to socket families, so that the service may access only AF_INET, AF_INET6 or AF_UNIX sockets, which is usually the right, minimal set for most system daemons. (AF_INET is the low-level name for the IPv4 address family, AF_INET6 for the IPv6 address family, and AF_UNIX for local UNIX socket IPC).

Starting with systemd v232 we added RestrictAddressFamilies= to all of systemd's own unit files, always with the minimal set of socket address families appropriate.

With the upcoming v233 release we'll provide a second method for blocking this vulnerability. Using RestrictNamespaces= it is possible to limit which types of Linux namespaces a service may get access to. Use RestrictNamespaces=yes to prohibit access to any kind of namespace, or set RestrictNamespaces=net ipc (or similar) to restrict access to a specific set (in this case: network and IPC namespaces). Given that user namespaces have been a major source of security vulnerabilities in the past months it's probably a good idea to block namespaces on all services which don't need them (which is probably most of them).

Of course, ideally, distributions such as Fedora, as well as upstream developers would turn on the various sandboxing settings systemd provides like these ones by default, since they know best which kind of address families or namespaces a specific daemon needs.

This post mostly affects developers of desktop environments/Wayland compositors. A systemd pull request was merged to add two new properties to some keyboards: XKB_FIXED_LAYOUT and XKB_FIXED_VARIANT. If set, the device must not be switched to a user-configured layout but rather the one set in the properties. This is required to make fake keyboard devices work correctly out-of-the-box. For example, Yubikeys emulate a keyboard and send the configured passwords as key codes matching a US keyboard layout. If a different layout is applied, then the password may get mangled by the client.

Since udev and libinput are sitting below the keyboard layout there isn't much we can do in this layer. This is a job for those parts that handle keyboard layouts and layout configurations, i.e. GNOME, KDE, etc. I've filed a bug for gnome here, please do so for your desktop environment.

If you have a device that falls into this category, please submit a systemd patch/file a bug and cc me on it (@whot).

December 05, 2016

This post applies to most tools that interface with the X server and change settings in the server, including xinput, xmodmap, setxkbmap, xkbcomp, xrandr, xsetwacom and other tools that start with x. The one word to sum up the future for these tools under Wayland is: "non-functional".

An X window manager is little more than an innocent bystander when it comes to anything input-related. Short of handling global shortcuts and intercepting some mouse button presses (to bring the clicked window to the front) there is very little a window manager can do. It's a separate process to the X server and does not receive most input events and it cannot affect what events are being generated. When it comes to input device configuration, any X client can tell the server to change it - that's why general debugging tools like xinput work.

A Wayland compositor is much more, it is a window manager and the display server merged into one process. This gives the compositor a lot more power and responsibility. It handles all input events as they come out of libinput and also manages device's configuration. Oh, and instead of the X protocol it speaks Wayland protocol.

The difference becomes more obvious when you consider what happens when you toggle a setting in the GNOME control center. In both Wayland and X, the control center toggles a gsettings key and waits for some other process to pick it up. In both cases, mutter gets notified about the change but what happens then is quite different. In GNOME(X), mutter tells the X server to change a device property, the server passes that on to the xf86-input-libinput driver and from there the setting is toggled in libinput. In GNOME(Wayland), mutter toggles the setting directly in libinput.

Since there is no X server in the stack, the various tools can't talk to it. So to get the tools to work they would have to talk to the compositor instead. But they only know how to speak X protocol, and no Wayland protocol extension exists for input device configuration. Such a Wayland protocol extension would most likely have to be a private one since the various compositors expose device configuration in different ways. Whether this extension will be written and added to compositors is uncertain, I'm not aware of any plans or even intentions to do so (it's a very messy problem). But either way, until it exists, the tools will merely shout into the void, without even an echo to keep them entertained. Non-functional is thus a good summary.

The Raspberry Pi Foundation recently started contracting with Free Electrons to give me some support on the display side of the stack.  Last week I got to review and release their first big piece of work: Boris Brezillon's code for SDTV support.  I had suggested that we use this as the first project because it should have been small and self contained.  It ended up that we had some clock bugs Boris had to fix, and a bug in my core VC4 CRTC code, but he got a working patch series together shockingly quickly.  He did one respin for a couple more fixes once I had tested it, and it's now out on the list waiting for devicetree maintainer review.  If nothing goes wrong, we should have composite out support in 4.11 (we're probably a week late for 4.10).

On my side, I spent some more time on HDMI audio and the DSI panel.  On the audio side, I'm now emitting the GCP packet for audio mute appropriately (I think), and with some more clocking fixes it's now accepting the audio data at the expected rate.  On the DSI front, I fixed a bit of sequencing and added debugfs for the registers like we have in our other encoders.  There's still no actual audio coming out of HDMI, and only white coming out of the panel.

The DSI situation is making me wish for someone else's panel that I could attach to the connector, so I could see if my bugs are in the Atmel bridge programming or in the DSI driver.

I did some more analysis of 3DMMES's shaders, and improved our code generation, for wins of 0.4%, 1.9%, 1.2%, 2.6%, and 1.2%.  I also experimented with changing the UBO (indirect addressed uniform array) upload path, which showed no effect.  3DMMES's uniform arrays are tiny, though, so it may be a win in some other app later.

I also got a couple of new patches from Jonas Pfeil.  I went through his thread switch delay slots patch, which is pretty close to ready.  He has a similar patch for branching delay slots, though apparently that one isn't showing wins yet in things he's tested.  Perhaps most exciting, though, is that he went and implemented an idea I had dropped on github: replacing our shadow copies of raster textures with a bunch of math in the shader and using general memory loads.  This could potentially fix X performance without a compositor, which we otherwise really don't have a workaround for other than "use a compositor."  It could also improve GL-in-a-window performance: right now all of our DRI surfaces are raster format, because we want to be able to get full screen pageflipping, but that means we do the shadow copy if they aren't fullscreen.  Hopefully this week I'll get a chance to test and review it.

I pushed the patch to require resolution today, expect this to hit the general public with libinput 1.6. If your graphics tablet does not provide axis resolution we will need to add a hwdb entry. Please file a bug in systemd and CC me on it (@whot).

How do you know if your device has resolution? Run sudo evemu-describe against the device node and look for the ABS_X/ABS_Y entries:


# Event code 0 (ABS_X)
# Value 2550
# Min 0
# Max 3968
# Fuzz 0
# Flat 0
# Resolution 13
# Event code 1 (ABS_Y)
# Value 1323
# Min 0
# Max 2240
# Fuzz 0
# Flat 0
# Resolution 13
if the Resolution value is 0 you'll need a hwdb entry or your tablet will stop working in libinput 1.6. You can file the bug now and we can get it fixed, that way it'll be in place once 1.6 comes out.

pastebins are useful for dumping large data sets whenever the medium of conversation doesn't make this easy or useful. IRC is one example, or audio/video conferencing. But pastebins only work when the other side looks at the pastebin before it expires, and the default expiry date for a pastebin may only be a few days.

This makes them effectively useless for bugs where it may take a while for the bug to be triaged and the assignee to respond. It may take even longer to figure out the source of the bug, and if there's a regression it can take months to figure it out. Once the content disappears we have to re-request the data from the reporter. And there is a vicious dependency too: usually, logs are more important for difficult bugs. Difficult bugs take longer to fix. Thus, with pastebins, the more difficult the bug, the more likely the logs become unavailable.

All useful bug tracking systems have an attachment facility. Use that instead, it's archived with the bug and if a year later we notice a regression, we still have access to the data.

If you got here because I pasted the link to this blog post, please do the following: download the pastebin content as raw text, then add it as attachment to the bug (don't paste it as comment). Once that's done, we can have a look at your bug again.

November 28, 2016
I missed last week's update, but with the holiday it ended up being a short week anyway.

The multithreaded fragment shaders are now in drm-next and Mesa master.  I think this was the last big win for raw GL performance and we're now down to the level of making 1-2% improvements in our commits.  That is, unless we're supposed to be using double-buffer non-MS mode and the closed driver was just missing that feature.  With the glmark2 comparisons I've done, I'm comfortable with this state, though.  I'm now working on performance comparisons for 3DMMES Taiji, which the HW team often uses as a benchmark.  I spent a day or so trying to get it ported to the closed driver and failed, but I've got it working on the open stack and have written a couple of little performance fixes with it.

The first was just a regression fix from the multithreading patch series, but it was impressive that multithreading hid a 2.9% instruction count penalty and still showed gains.

One of the new fixes I've been working on is folding ALU operations into texture coordinate writes.  This came out of frustration from the instruction selection research I had done the last couple of weeks, where all algorithms seemed like they would still need significant peephole optimization after selection.  I finally said "well, how hard would it be to just finish the big selection problems I know are easily doable with peepholes?" and it wasn't all that bad.  The win came out to about 1% of instructions, with a similar benefit to overall 3DMMES performance (it's shocking how ALU-bound 3DMMES is)

I also started on a branch to jump to the end of the program when all 16 pixels in a thread have been discarded.  This had given me a 7.9% win on GLB2.7 on Intel, so I hoped for similar wins here.  3DMMES seemed like a good candidate for testing, too, with a lot of discards that are followed by reams of code that could be skipped, including texturing.  Initial results didn't seem to show a win, but I haven't actually done any stats on it yet.  I also haven't done the usual "draw red where we were applying the optimization" hack to verify that my code is really working, either.

While I've been working on this, Jonas Pfeil (who originally wrote the multithreading support) has been working on a couple of other projects.  He's been trying to get instructions scheduled into the delay slots of thread switches and branching, which should help reduce any regressions those two features might have caused.  More exciting, he's just posed a branch for doing nearest-filtered raster textures (the primary operation in X11 compositing) using direct memory lookups instead of our shadow-copy fallback.  Hopefully I get a chance to review, test, and merge in the next week or two.

On the kernel side, my branches for 4.10 have been pulled.  We've got ETC1 and multithread FS for 4.10, and a performance win in power management.  I've also been helping out and reviewing Boris Brezillon's work for SDTV output in vc4.  Those patches should be hitting the list this week.
November 20, 2016

The Fedora Change to retire the synaptics driver was approved by FESCO. This will apply to Fedora 26 and is part of a cleanup to, ironically, make the synaptics driver easier to install.

Since Fedora 22, xorg-x11-drv-libinput is the preferred input driver. For historical reasons, almost all users have the xorg-x11-drv-synaptics package installed. But to actually use the synaptics driver over xorg-x11-drv-libinput requires a manually dropped xorg.conf.d snippet. And that's just not ideal. Unfortunately, in DNF/RPM we cannot just say "replace the xorg-x11-drv-synaptics package with xorg-x11-drv-libinput on update but still allow users to install xorg-x11-drv-synaptics after that".

So the path taken is a package rename. Starting with Fedora 26, xorg-x11-drv-libinput's RPM will Provide/Obsolete [1] xorg-x11-drv-synaptics and thus remove the old package on update. Users that need the synaptics driver then need to install xorg-x11-drv-synaptics-legacy. This driver will then install itself correctly without extra user intervention and will take precedence over the libinput driver. Removing xorg-x11-drv-synaptics-legacy will remove the driver assignment and thus fall back to libinput for touchpads. So aside from the name change, everything else works smoother now. Both packages are now updated in Rawhide and should be available from your local mirror soon.

What does this mean for you as a user? If you are a synaptics user, after an update/install, you need to now manually install xorg-x11-drv-synaptics-legacy. You can remove any xorg.conf.d snippets assigning the synaptics driver unless they also include other custom configuration.

See the Fedora Change page for details. Note that this is a Fedora-specific change only, the upstream change for this is already in place.

[1] "Provide" in RPM-speak means the package provides functionality otherwise provided by some other package even though it may not necessarily provide the code from that package. "Obsolete" means that installing this package replaces the obsoleted package.

November 16, 2016
At this point, I haven't pushed a new release tag for xf86-video-freedreno to update to latest xserver ABI.  I'm inclined not to.  If you are using a modern xserver you probably want to be using xf86-video-modesetting + glamor.  It has more features (dri3, xv, etc) and better performance.  And GL support on a3xx/a4xx is pretty solid.  So distros with a modern xserver might as well drop the xf86-video-freedreno package.

The one case where xf86-video-freedreno is still useful is bringing up a new generation of adreno, since it can do dri2 with pure-sw fallbacks for all the EXA ops.  But if that is what you are doing, I guess you know how to git clone and build.

The possible alternative is to push a patch that makes xf86-video-freedreno still build, but only probe (with latest xserver ABI) if some "ForceLoad" type option is given in xorg.conf, otherwise fallback to modesetting/glamor.  I can't think of a good reason to do this at the moment.  But as always, questions/comments/suggestions welcome.

November 15, 2016

As usual, it can be turned on and off at build-time and there is some configuration available as well to control how the effect works. Here are some screen-shots:

motion-blur-0
Motion Blur Off
motion-blur-0_2
Motion Blur Off
motion-blur-8
Motion Blur On, intensity 12.5%
motion-blur-8_2
Motion Blur On, intensity 12.5%
motion-blur-4

Motion Blur On, intensity 25%
motion-blur-4_2

Motion Blur On, intensity 25%
motion-blur-2
Motion Blur On, intensity 50%
motion-blur-2_2
Motion Blur On, intensity 50%

Motion blur is a technique used to make movement feel smoother than it actually is and is targeted at hiding the fact that things don’t move in continuous fashion, but rather, at fixed intervals dictated by the frame rate. For example, a fast moving object in a scene can “leap” many pixels between consecutive frames even if we intend for it to have a smooth animation at a fixed speed. Quick camera movement produces the same effect on pretty much every object on the scene. Our eyes can notice these gaps in the animation, which is not great. Motion blur applies a slight blur to objects across the direction in which they move, which aims at filling the animation gaps produced by our discrete frames, tricking the brain into perceiving a smoother animation as result.

In my demo there are no moving objects other than the sky box or the shadows, which are relatively slow objects anyway, however, camera movement can make objects change screen-space positions quite fast (specially when we rotate the view point) and the motion- blur effect helps us perceive a smoother animation in this case.

I will try to cover the actual implementation in some other post but for now I’ll keep it short and leave it to the images above to showcase what the filter actually does at different configuration settings. Notice that the smoothing effect is something that can only be perceived during motion, so still images are not the best way to showcase the result of the filter from the perspective of the viewer, however, still images are a good way to freeze the animation and see exactly how the filter modifies the original image to achieve the smoothing effect.

Last Friday, both a GNOME bug day and a bank holiday, a few of us got together to squash some bugs, and discuss GNOME and GNOME technologies.

Guillaume, a new comer in our group, tested the captive portal support for NetworkManager and GNOME in Gentoo, and added instructions on how to enable it to their Wiki. He also tested a gateway related configuration problem, the patch for which I merged after a code review. Near the end of the session, he also rebuilt WebKitGTK+ to test why Google Docs was not working for him anymore in Web. And nobody believed that he could build it that quickly. Looks like opinions based on past experiences are quite hard to change.

Mathieu worked on removing jhbuild's .desktop file as nobody seems to use it, and it was creating the Sundry category for him, in gnome-shell. He also spent time looking into the tracker blocker that is Mozilla's Focus, based on disconnectme's block lists. It's not as effective as uBlock when it comes to blocking adverts, but the memory and performance improvements, and the slow churn rate, could make it a good default blocker to have in Web.

Haïkel looked into using Emeus, potentially the new GTK+ 4.0 layout manager, to implement the series properties page for Videos.

Finally, I added Bolso to jhbuild, and struggled to get gnome-online-accounts/gnome-keyring to behave correctly in my installation, as the application just did not want to log in properly to the service. I also discussed Fedora's privacy policy (inappropriate for Fedora Workstation, as it doesn't cover the services used in the default installation), a potential design for Flatpak support of joypads and removable devices in general, as well as the future design of the Network panel.
I dropped HDMI audio last week because Jonas Pfeil showed up with a pair of branches to do multithreaded fragment shaders.

Some context for multithreaded fragment shaders: Texture lookups are really slow.  I think we eyeballed them as having a latency of around 20 QPU instructions.  We can hide latency sometimes if there's some unrelated math to be done after the texture coordinate calculation but before using the texture sample.  However, in most cases, you need the texture sample results right away for the next bit of work.

To allow programs to hide that latency, there's a cooperative hyperthreading mode that a fragment shader can opt into.  The shader stays in the bottom half of register space, and before it collects the results of a texture fetch, it issues a thread switch signal, which the hardware will use to run a second fragment shader thread until that one issues its own thread switch.  For the second thread, the top bit of the register addresses gets flipped, so the two threads' states don't overlap (except for the accumulators and the flags, which the shaders have to save and restore).

I had delayed working on this because the full solution was going to be really tricky: queue up as many lookups as we can, then thread switch, then collect all those results before switching again, all while respecting the FIFO sizes.  However, Jonas's huge contribution here was in figuring out that you don't need to be perfect, you can get huge gains by thread switching between each texture lookup and reading its results.

The upshot was a 0-20% performance improvement on glmark2 and a performance hit to only one testcase.  With this we're down to 3 subtests that we're slower on than the closed source driver.  Jonas's kernel patch is out on the list, and I rewrote the Mesa series to expose threading to the register allocator and landed all but the enabling patch (which has to wait on the kernel side getting merged).  Hopefully I'll finish merging it in a week.

In the process of writing multithreading, Jonas noticed that we were scheduling our TLB_Z writes early in the program, which can cut into fragment shader parallelism between the QPUs (of which we have 12) because a TLB_Z write locks the scoreboard for that 2x2 pixel quad.  I merged a patch to Mesa that pushes the TLB_Z down to the bottom, at the cost of a single extra QPU instruction.  Some day we should flip QPU scheduling around so that we pair that TLB_Z up better, and fill our delay slots in thread switches and branching as well.

I also tracked down a major performance issue in Raspbian's desktop using the open driver.  I asked them to use xcompmgr a while back, because readback from the front buffer using the GPU is really slow (The texture unit can't read from raster textures, so we have to reformat them to be readable), and made window dragging unbearable.  However, xcompmgr doesn't unredirect fullscreen windows, so full screen GL apps emit copies (with reformatting!) instead of pageflipping.

It looks like the best choice for Raspbian is going to be using compton, an xcompmgr fork that does pageflipping (you get tear free screen updates!) and unredirection of full screen windows (you get pageflipping directly from GL apps).  I've also opened a bug report on compton with a description of how they could improve their GL drawing for a tiled renderer like the Pi, which could improve its performance for windowed updates significantly.

Simon ran into some trouble with compton, so he hasn't flipped the default yet, but I would encourage anyone running a Raspberry Pi desktop to give it a shot -- the improvement should be massive.

Other things last week: More VCHIQ patch review, merging more to the -next branches for 4.10 (Martin Sperl's thermal driver, at last!), and a shadow texturing bugfix.
November 14, 2016

I've written more extensively about this here but here's an analogy that should get the point across a bit better: Wayland is just a protocol, just like HTTP. In both cases, you have two sides with very different roles and functionality. In the HTTP case, you have the server (e.g. Apache) and the client (a browser, e.g. Firefox). The communication protocol is HTTP but both sides make a lot of decisions unrelated to the protocol. The server decides what data is sent, the client decides how the data is presented to the user. Wayland is very similar. The server, called the "compositor", decides what data is sent (also: which of the clients even gets the data). The client renders the data [1] and decides what to do with input like key strokes, etc.

Asking Does $FEATURE work under Wayland? is akin to asking Does $FEATURE work under HTTP?. The only answer is: it depends on the compositor and on the client. It's the wrong question. You should ask questions related to the compositor and the client instead, e.g. "does $FEATURE work in GNOME?" or "does $FEATURE work in GTK applications?". That's a question that can be answered.

Of course, there are some cases where the fault is really the protocol itself. But often enough, it's not.

[1] albeit it does so by telling the compositor to display it. The analogy with HTTP only works to some extent... :)

November 10, 2016

So, in Fedora Workstation 24 we added H264 support through OpenH264. In Fedora Workstation 25 I am happy to tell you all that we are taking another step in improving our codec support by adding support for mp3 playback. I know this has been a big wishlist item for a long time for a lot of people so I am really happy that we are finally in a position to fulfil that wish. You should be able to download the mp3 plugin on day 1 through GNOME Software or through the missing codec installer in various GStreamer applications. For Fedora Workstation 26 I would not be surprised if we decide to ship it on the install media.

Fo the technically inclined out there, our initial enablement is through the mpeg123 library and corresponding GStreamer plugin. The main reason we choose this library over all the others available out there was a combination of using the same license as GStreamer (lgpl v2) and being a well established library used by a lot of different applications already. There might be other mp3 decoders added in the future depending on interest in and effort by the community. So get ready to install Fedora Workstation 25 when its released soon and play some tunes :)

P.S. To be 110% clear we will not be adding encoding support at this time.

November 08, 2016
Almost all of Collabora's customers use the Linux kernel on their products. Often they will use the exact code as delivered by the SBC vendors and we'll work with them in other pars of their software stack. But it's becoming increasingly common for our customers to adapt the kernel sources to the specific needs of their particular products.

A very big problem most of them have is that the kernel version they based on isn't getting security updates any more because it's already several years old. And the reason why companies are shipping kernels so old is that they have been so heavily modified compared to the upstream versions, that rebasing their trees on top of newer mainline releases is so expensive that is very hard to budget and plan for it.

To avoid that, we always recommend our customers to stay close to their upstreams, which implies rebasing often on top of new releases (typically LTS releases, with long term support). For the budgeting of that work to become possible, the size of the delta between mainline and downstream sources needs to be manageable, which is why we recommend contributing back any changes that aren't strictly specific to their products.

But even for those few companies that already have processes in place for upstreaming their changes and are rebasing regularly on top of new LTS releases, keeping up with mainline can be a substantial disruption of their production schedules. This is in part because new bugs will be in the new mainline release, and new bugs will be in the downstream changes as they get applied to the new version.

Those companies that are already keeping close to their upstreams typically have advanced QA infrastructure that will detect those bugs long before production, but a long stabilization phase after every rebase can significantly slow product development.

To improve this situation and encourage more companies to keep their efforts close to upstream we at Collabora have been working for a few years already in continuous integration of FOSS components across a diverse array of hardware. The initial work was sponsored by Bosch for one of their automotive projects, and since the start of 2016 Google has been sponsoring work on continuous integration of the mainline kernel.

One of the major efforts to continuously integrate the mainline Linux kernel codebase is kernelci.org, which builds several configurations of different trees and submits boot jobs to several labs around the world, collating the results. This is being of great help already in detecting at a very early stage any changes that either break the builds, or prevent a specific piece of hardware from completing the boot stage.

Though kernelci.org can easily detect when an update to a source code repository has introduced a bug, such updates can have several dozens of new commits, and without knowing which specific commit introduced the bug, we cannot identify culprits to notify of the problem. This means that either someone needs to monitor the dashboard for problems, or email notifications are sent to the owners of the repositories who then have to manually look for suspicious commits before getting in contact with their author.

To address this limitation, Google has asked us to look into improving the existing code for automatic bisection so it can be used right away when a regression is detected, so the possible culprits are notified right away without any manual intervention.

Another area in which kernelci.org is currently lacking is in the coverage of the testing. Build and boot regressions are very annoying for developers because they impact negatively everybody who work in the affected configurations and hardware, but the consequences of regressions in peripheral support or other subsystems that aren't involved critically during boot can still make rebases much costlier.

At Collabora we have had a strong interest in having the DRM subsystem under continuous integration and some time ago started a R&D project for making the test suite in IGT generically useful for all the DRM drivers. IGT started out being i915-specific, but as most of the tests exercise the generic DRM ABI, they could as well test other drivers with a moderate amount of effort. Early in 2016 Google started sponsoring this work and as of today submitters of new drivers are using it to validate their code.

Another related effort has been the addition to DRM of a generic ABI for retrieving CRCs of frames from different components in the graphics pipeline, so two frames can be compared when we know that they should match. And another one is adding support to IGT for the Chamelium board, which can simulate several display connections and hotplug events.

A side-effect of having continuous integration of changes in mainline is that when downstreams are sending back changes to reduce their delta, the risk of introducing regressions is much smaller and their contributions can be accepted faster and with less effort.

We believe that improved QA of FOSS components will expand the base of companies that can benefit from involvement in development upstream and are very excited by the changes that this will bring to the industry. If you are an engineer who cares about QA and FOSS, and would like to work with us on projects such as kernelci.org, LAVA, IGT and Chamelium, get in touch!

November 07, 2016
I spent a little bit of time last week on actual 3D work.  I finally figured out what was going wrong with glmark2's terrain demo.  The RCP (reciprocal) instruction on vc4 is a very rough approximation, and in translating GLSL code doing floating point divides (which we have to convert from a / b to a * (1 / b)) we use a Newton-Raphson step to improve our approximation.  However, we forgot to do this for the implicit divide we have to do at the end of your vertex shader, so the triangles in the distance were unstable and bounced around based on the error in the RCP output.

I also finally got around to writing kernel-side validation for ETC1 texture compression.  I should be able to merge support for this to v4.10, which is one of the feature checkboxes we were missing compared to the old driver.  Unfortunately ETC1 doesn't end up being very useful for us in open source stacks, because S3TC has been around a lot longer and supported on more hardware, so there's not much content using ETC that I know of outside of Android land.

I also spent a while fixing up various regressions that had happened in Mesa while I'd been playing in the kernel.  Some day I should work on CI infrastructure on real hardware, but unfortunately Mesa doesn't have any centralized CI infrastructure to plug into.  Intel has a test farm, but not all proposed patches get submitted to it, and it (understandably) doesn't have non-Intel hardware.

In kernel land, I sent out a patch to reduce V3D's overhead due to power management.  We were immediately shutting off the 3D engine when it went idle, even if userland might be expected to submit a new frame shortly thereafter.  This was taking up 1% of the CPU on profiles I was looking at last week, and I've seen it up to 3.5%.  Now we keep the GPU on for at least 40ms ("about two frames plus some slack").  If I'm reading right, the closed driver does immediate powerdown as well, so we may be using slightly more power now, but given that power state transitions themselves cost power, and the CPU overhead costs power, hopefully this works out fine anyway (even though we've never done power measurements for the platform).

I also reviewed more patches from Michael Zoran for vchiq.  It should now (once all reviewed patches land) work on arm64.  We still need to get the vchiq-using drivers like camera and vcsm into staging.

Finally, I made more progress on HDMI audio.  One of the ALSA developers, Lars-Peter Clausen, pointed out that ALSA's design is that userspace would provide the IEC958 subframes by using alsalib's iec958 plugin and a bit ouf asoundrc configuration.  He also provided a ton of help debugging that pipeline.  I rebased and cleaned up into a branch that uses simple-audio-card as my machine driver, so the whole stack is pretty trivial.  Now aplay runs successfully, though no sound has come out.  Next step is to go test that the closed stack actually plays audio on this monitor and capture some register state for debugging.
November 01, 2016

I finally have a bit of time to look at touchpad pointer acceleration in libinput. But when I did, I found a great total of 5 bugs across freedesktop.org and Red Hat's bugzilla, despite this being the first thing anyone brings up whenever libinput is mentioned. 5 bugs - that's not much to work on. Note that over time there were also a lot of bugs where pointer acceleration was fixed once the touchpad's axis ranges were corrected which usually is a two-liner for the udev hwdb.

Anyway, point of this post: if you're still having issues with pointer acceleration on your touchpad in libinput, please file a bug against libinput and make it block the new tracker bug 98535. The libinput documentation has instructions on how to report a touchpad bug, but amongst the various things I need is your laptop model name.

Don't complain about it on reddit, phoronix, HN, or in some random forum, because you're just wasting bytes there and it won't get fixed that way.

When we started the Fedora Workstation effort one thing we wanted to do was to drain the proverbial swamp of make sure that running Linux on a laptop is a first rate experience. As you see from my last blog entry we have been working on building a team dedicated to that task. There are many facets to this effort, but one that we kept getting asked about was sorting out hybrid graphics. So be aware that some of this has been covered in previous blog entries, but I want to be thorough here. So this blog will cover the big investments of time and effort we are putting into Wayland and X Windows, GNOME Shell and Nouveau (the open source driver for NVidia GPU hardware).

A big part of this story is also that we are trying to make it easier to run the NVidia binary driver on Fedora in general which is actually a fairly large undertaking as there are a lot challenges with trying to do that without requiring any manual configuration from the user and making sure it works fully in the hybrid graphics usecase. This is one of the most requested areas of improvement for Fedora. Many users are trying to run Fedora on their existing laptops or other hardware. Rather than allow this gap to be a reason for people to not run Fedora, we want to provide a better experience. Of course users will have the freedom to make their own choice about installing these drivers, using information provided by Software.

Hybrid graphics is the term used for when you have a laptop with two GPUs, usually one Intel and one NVidia GPU, but there are also some laptops available which comes with Intel/AMD CPU + AMD GPU. The purpose of hybrid graphics is to have your (Intel) Integrated GPU be your ‘day to day GPU’ for running your desktop yet not drawing to much power, but then if you want to for instance play a game on your system you can activate your secondary GPU which got a lot more power, but which also draws more electricity.

What complicated this even more was the fact that most users who wanted to use this setup wanted to use it in combination with the binary NVidia driver, in order to get top performance from their second GPU, which is not surprising since that is the whole point of switching to it. The main challenge with this was that the Mesa and binary NVidia driver both provided an OpenGL implementation that due to the way things works in the X Window System ended up overwriting each other, basically breaking the overwritten driver. The solution for this was a system called Bumblebee which employed some clever hacks to work around this issue, but Bumblebee is a solution to a problem we shouldn’t have to begin with.

Of course dealing with the OpenGL stack wasn’t the only challenge here. For instance different display outputs ended up being connected to different GPUs, with one of the most common setups being that the internal screen is connected to your Intel GPU and your external HDMI and DisplayPort connections are connected to your NVidia or AMD GPU. One some systems there where hardware connectors allowing you to display using either GPU to any screen, but a setup we see becoming more common is that the drivers for both card needs to be initialized in all cases, allowing the display connected GPU to slave itself to the other GPU to allow rendering on one GPU and displaying on the other. Within Mesa this is fairly straightforward to handle, so if both Intel and NVidia as rendering using Mesa things are fairly straightforward. What makes this a lot more challenging here is the NVidia binary driver, so once you install the binary driver we need to find a way to bridge and interoperate between these two stacks.

And of course that is just the highlights, like anything complex like this there are a long laundry list of items from the point where you can checkbox having a feature to it working really well and seamless.

What to expect in Fedora Workstation 25
Lets start with a word of caution here, we are working on a tight deadline here to get all the pieces lined up, so I can not be 100% sure what will make it for the day of release. What we are confident about though is that we will have all the low level plumbing sorted so that we can improve this over the course of the Fedora 25 lifecycle. As always I hope the community will join us in testing and developing this, to ensure we have even the corner cases worked out for Fedora Workstation 26.

The initial step we took and this is quite some time ago was that Dave Airlie worked on making sure we could handle two GPUs within the Mesa stack. As a bit of wordplay on the NVidia solution being called Optimus the open source Mesa solution was named Prime. This work basically allowed you to use Mesa for your Intel card and use Mesa (Nouveau) for your NVidia card and launch applications on the NVidia card while the screen was connected to the Intel card. You can choose which one is used by setting the DRI_PRIME=1 environment variable.

The second step was Adam Jackson collaborating with NVidia on something called libglvnd. Libglvnd stands for GL Vendor Neutral Dispatch and it is basically a dispatch layer that allows your OpenGL calls to be dispatched to more than one OpenGL implementation. Nvidia created this specification and have been steadily working on supporting it in their binary driver, but Adam Jackson stepped up to help review their patches and work on supporting glvnd in the Mesa drivers. This means that for Fedora Workstation 25 you should for the first time ever be able to have both the binary NVidia driver and Mesa installed without any file conflicts. So the X server now has the infrastructure to route your OpenGL call to the correct stack when that DRI_PRIME variable is set. That said there is a major catch here, due to the way things currently work once the NVidia binary driver is installed it expects to be rendering the screen all the time. So we need for the short term to figure out a way to allow it to do that, and in the long run we need to work with NVidia to figure out how the Intel open source driver can collaborate with the Nvidia driver to allow us to only use the Intel driver at times (to save on power for instance). We are still pushing hard on trying to have the short term fix in place, but as I write this we haven’t fully nailed this down yet.

The third step is work that Ben Skeggs has been doing on dealing with the monitor handling here, which includes adding MST support to Nouveau, because a lot of external ports and docking stations have not been working with these hybrid graphics setups due to the external screens all being routed through the NVidia chip. These patches just got accepted upstream and we will be including them in Fedora Workstation 25.

The fourth step has been work that Hans de Goede has been doing around Prime, fixing the modesetting driver and fixing cursor handling with hybrid graphics. In some sense the work Hans has been doing, and check his blog entry linked, is part of that washlist I talked about, the many smaller items that needs to work smoothly for this to go from ‘checkbox works’ to ‘actually works’. He is also working with the DNF team to allow us to control the kernel updates if you use the binary NVidia driver, meaning that we will only bump up your kernel driver when we have the binary NVidia driver module ready too. If for any reason the binary Nvidia driver doesn’t work we want to have graceful fallback to Nouveau.

Fifth step has been Jonas Ådahls work on enabling the binary NVidia driver for Wayland. He has put together a set of patches to be able to support NVidias EGLStreams interface, which means that starting from Fedora Workstation 25 you will be able to use Wayland also with NVidias binary driver.
You can see his work in progress patches here. Jonas will also be looking at implementing hybrid graphics support in Wayland, so we ensure that also for this usecase Wayland is on par with X for Fedora Workstation 26.

The sixth step has been work done by Bastien Nocera to ensure we expose this functionality in the user interface. The idea is that you should be able to configure on a per application basis which GPU they are being launched on. It also means that you can now get all your GPU information in the GNOME about screen also when you have dual-GPU systems. More details in his blog post here.

gnome-shell-discrete-gpu-menuChoosing discete GPU in GNOME Shell

The seventh step is the work that Simone Caronni from Negativo17 has been doing on working with us on packaging the binary NVidia driver in a way that makes all this work. You can find his repo here on Negativo17, but Simone has also been working with Kalev Lember and Richard Hughes to ensure the driver show up correctly in GNOME Software once you have the repository enabled.

NVidia driver in GNOME SoftwareNVidia driver in GNOME Software

The plan is to offer the driver in Fedora Workstation 25 as third party software, but we haven’t yet made the formal proposal due to wanting to be sure we ironed out all important bugs first, both on our side and on NVidia side.

So as you can see from all of this there are a lot of components involved and since we are trying to support both X and Wayland with all of this the job hasn’t exactly been easy. There are for sure some things that will not be ready in time for Fedora Workstation 25, for instance if you want to use the binary NVidia driver we will not be able to make that work with XWayland. So native Wayland applications will be fine, including games using SDL on top of Wayland, but due to how the stack is architected we haven’t gotten around to implementing bridging OpenGL from Xwayland down to the binary NVidia driver. With Nouveau it should work however. This corner case we will have to figure out for Fedora Workstation 26, but for now we have decided that if we detect a Optimus system we will default you to X instead of Wayland.

Get involved
As always we love for more people to join us in this effort and one good way to get started with that is to join our Hybrid graphics test day on Thursday the 3rd of November!

October 31, 2016
Last week was spent almost entirely on HDMI audio support.

I had started the VC4-side HDMI setup last week, and got the rest of the stubbed bits filled out this week.

Next was starting to get the audio side connected.  I had originally derived my VC4 side from the Mediatek driver, which makes a child device for the generic "hdmi-codec" DAI codec to attach to.  HDMI-codec is a shared implementation for HDMI encoders that take I2S input, which basically just passes audio setup params into the DRM HDMI encoder, whose job is to actually make I2S signals on wires get routed into HDMI audio packets.  They also have a CPU DAI registered through device tree that moves PCM data from memory onto wires, and a machine driver registered through devicetree that connects the CPU DAI's wires to the HDMI codec DAI's wires.  This is apparently how SoC audio is usually structured.

VC4, on the other hand, doesn't have a separate hardware block for moving PCM data from memory onto wires, it just has the MAI_DATA register in the HDMI block that you put S/PDIF (IEC958) data into using the SoC's generic DMA engine.  So I made VC4's HDMI driver register another a child device with the MAI_DATA register's address passed in, and that child device gets bound by the CPU DAI driver and uses the ASoC generic DMA engine support.  That CPU DAI driver also ends up setting up the machine driver, since I didn't have any other reference to the CPU DAI anywhere.  The glue here is all based on static strings I can't rely on, and is a complete hack for now.

Last, notice how I've used "PCM" talking about a working implementation's audio data and "S/PDIF" data for the thing I have to take in?  That's the sticking point.  The MAI_DATA register wants things that look like the AES3 frames description you can find on wikipedia.  I think the HW is generating my preamble.  In the firmware's HDMI audio implementation, they go through the unpacked PCM data and set the channel status bit in each sample according to the S/PDIF channel status word (the first 2 bytes of which are also documented on wikipedia).   At the same time that they're doing that, they also set the parity bit.

My first thought was "Hey, look, ALSA has SNDRV_PCM_FORMAT_IEC958_SUBFRAME, maybe my codec can say it only takes these S/PDIF frames and everything will work out."  No.  It turns out aplay won't generate them and neither will gstreamer.  To make a usable HDMI audio driver, I'm going to need to generate my IEC958 myself.

Linux is happy to generate my channel status word in memory.  I think the hardware is happy to generate my parity bit.  That just leaves making the CPU DAI write the bits of the channel status word into the subframes as PCM samples come in to the driver, and making sure that the MSB of the PCM audio data is bit 27.  I haven't figured out how to do that quite yet, which is the project for this week.

Work-in-progress, completely non-functional code is here.

For the last time in 2016, I flew out to the OpenStack Summit in Barcelona, where I had the chance to meet (again) a lot of my fellow OpenStack contributors there.

How To Work Upstream with OpenStack

My week started by giving a talk about How To Work Upstream with OpenStack where I explained, accompanied by Ryota and Ashiq, to the audience how to contribute upstream to OpenStack. It went well and was well received by the public – you can watch the video below or download the slides.

Python 3 in telemetry projects

I've attended a few interesting cross-project sessions, which helped me getting some prioritization for my work during the next few months.

The Python 3 porting effort is blocked for a while in Nova and Swift for various (mostly non-technical) reasons, while almost all other projects are working correctly. On the other hand, we have committed the telemetry projects to be the first one to drop Python 2 support has soon as it is possible. The next steps are to be sure downstream is ready and enable functional testing in devstack with Python 3.

Ceilometer deprecation

Gordon Chung talking about Gnocchi 3

The Ceilometer sessions were really interesting, are we mainly discussed deprecating and removing old crufts that are not or should not be used anymore. The main change will be the depreciation of the Ceilometer API. It has been clear for more than a year that Gnocchi is the way-to-go to store and provide access to metrics, but we failed at announcing wildly. A lot of the people I talked to during the summit were not aware that the Ceilometer API was not a good pick, and that Gnocchi was the now recommended storage backend. Bad communication from our side – but we are going to fix it as of now.

We also committed to simplify the current architecture by removing the collector, which has now be made obsolete by the agent based architecture that was implemented during the last development cycles.

Aodh alarm timeout

We had a feature proposal for a while in Aodh that we postponed for too long already: having timeout triggered after not having seen some events. This seems to be a functionality requested by NFV users – something we want Aodh to cover. We spent some time discussing this feature, and now that we all have a clear understanding of the use case, we'll work on having a clear path to the implementation.

I've also attended a session with the Vitrage developers in order to discuss how we could work better together, as they rely on Aodh. It seems there might be some convergence in the future, which would be very welcome. Wait'n see.

Gnocchi improvement, past and future

The Gnocchi session ran smoothly, and everyone seemed happy with the work we have done so far. We've made some impressive improvement in Gnocchi 3.0 – as I already covered previously – and Gordon Chung presented a short talk about the performance difference metered while working on this new version of Gnocchi:

The return of the InfluxDB driver is on the table, as Sam Morrison proposed a patch for that while back. While it's not as fast and scalable as other drivers, it offers a good alternative for people having to use it.

Leandro Reox presented how to do capacity planning using Ceilometer and Gnocchi, presenting the projects at the same time:

It is pretty impressive to see what they achieved with this project, and I'm looking forward to being able to check how it works inside.

PTG and beyond

The next meeting is supposed to be the new OpenStack PTG in February in Atlanta, though we did not request any specific space there. While the team love seeing each other face-to-face every few months, we achieved to follow all of the guidelines I listed recently on good open source project management, meaning we are able to work very well asynchronously and remotely. There is no need to put hard requirements on people wanting to participate in our community. Nevertheless, I expect cross-projects discussions that will happen to still concern the OpenStack Telemetry projects.

In the end, we're all very happy with our past and future roadmaps and I'm looking forward to achieving our next big milestones with our amazing telemetry team!

You might remember my attempts at getting an easy to use cross-compilation for ARM applications on my x86-64 desktop machine.

With Fedora 25 approaching, I'm happy to say that the necessary changes to integrate the feature have now rolled into Fedora 25.

For example, to compile the GNU Hello Flatpak for ARM, you would run:

$ flatpak install gnome org.freedesktop.Platform/arm org.freedesktop.Sdk/arm
Installing: org.freedesktop.Platform/arm/1.4 from gnome
[...]
$ sudo dnf install -y qemu-user-static
[...]
$ TARGET=arm ./build.sh

For other applications, add the --arch=arm argument to the flatpak-builder command-line.

This example also works for 64-bit ARM with the architecture name aarch64.
October 28, 2016
Recently I've been working on improving hybrid graphics support for the upcoming Fedora 25 release. Although Fedora 25 Workstation will use Wayland by default for its GNOME 3 desktop, my work has been on hybrid gfx support under X11 (Xorg) as GNOME 3 on Wayland does not yet support hybrid gfx,

So no Wayland, still there are a lot of noticable hybrid gfx support and users of laptops with hybrid gfx using the open-source drivers should have a much smoother userexperience then before. Here is an (incomplete) list of generic improvements:

  • Fix the discrete GPU not suspending after using an external monitor, halving laptop battery life

  • xrandr --listproviders no longer shows 3 devices instead of 2 on laptops with 2 GPUs

  • Hardware cursor support when both GPUs have active video outputs, previously X would fallback to software cursor rendering in this case which would typically lead to a flickering, or entirely invisible cursor at the top of the screen

Besides this a lot of work has been done on fixing hybrid gfx issues in the modesetting driver, this is important since with Fedora we use the modesetting driver on Skylake and newer Intel integrated gfx as well as on Maxwell and newer Nvidia discrete GPUs, the following issues have been fixed in the modesetting driver (and thus on laptops with a Skylake CPU and/or a Maxwell or newer GPU):

  • Hide HW cursor on init, this fixes 2 cursors showing on some setups

  • Make the modesetting driver support DRI_PRIME=1 render offloading when the secondary GPU is using the modesetting driver

  • Fix misrendering (tiled vs linear) when using DRI_PRIME=1 render offloading and the primary GPU is using the modesetting driver

  • Fix GL apps running at 1 fps when shown on a video-output of the secondary GPU and the primary GPU is using the modesetting driver

  • Fix secondary GPU video output partial screen updates (part of the screen showing a previous frame) when the discrete GPU is the secondary GPU and the primary GPU is using the modesetting driver

  • Fix secondary GPU video output updates lagging (or sometimes the last frame simply not being shown at all because no further rendering is happening) when the discrete GPU is the secondary GPU and the primary GPU is using the modesetting driver

Note coming Thursday (November 3th) we're having a Fedora Better Switchable Graphics Test Day, if you have a laptop with hybrid gfx please join us to help further improving hybrid gfx support.
I have recently been occupied with a new project (and being with a cold all this week), so I have not been much present in the Wayland community. Now I can finally say what I and Emilio have been up to: Waltham! For more information, please see our annoucement.
October 26, 2016
Thanks to the work of Hans de Goede and many others, dual-GPU (aka NVidia Optimus or AMD Hybrid Graphics) support works better than ever in Fedora 25.

On my side, I picked up some work I originally did for Fedora 24, but ended up being blocked by hardware support. This brings better integration into GNOME.

The Details Settings panel now shows which video cards you have in your (most likely) laptop.

dual-GPU Graphics

The second feature is what Blender and 3D video games users have been waiting for: a contextual menu item to launch the application on the more powerful GPU in your machine.

Mooo Powaa!

This demonstration uses a slightly modified GtkGLArea example, which shows which of the GPUs is used to render the application in the title bar.

on the integrated GPU

on the discrete GPU

Behind the curtain

Behind those 2 features, we have a simple D-Bus service, which runs automatically on boot, and stays running to offer a single property (HasDualGpu) that system components can use to detect what UI to present. This requires the "switcheroo" driver to work on the machine in question.

Because of the way applications are launched on the discrete GPU, we cannot currently support D-Bus activated applications, but GPU-heavy D-Bus-integrated applications are few and far between right now.

Future plans

There's plenty more to do in this area, to polish the integration. We might want applications to tell us whether they'd prefer being run on the integrated or discrete GPU, as live switching between renderers is still something that's out of the question on Linux.

Wayland dual-GPU support, as well as support for the proprietary NVidia drivers are also things that will be worked on, probably by my colleagues though, as the graphics stack really isn't my field.

And if the hardware becomes more widely available, we'll most certainly want to support hardware with hotpluggable graphics support (whether gaming laptop "power-ups" or workstation docks).

Availability

All the patches necessary to make this work are now available in GNOME git (targeted at GNOME 3.24), and backports are integrated in Fedora 25, due to be released shortly.
October 24, 2016

There are some program transformations that are obviously correct when compiling regular single-threaded or even multi-threaded code, but that cannot be used for shader code. For example:

 v = texture(u_sampler, texcoord);  
if (cond) {
gl_FragColor = v;
} else {
gl_FragColor = vec4(0.);
}

... cannot be transformed to ...

if (cond) {
// The implicitly computed derivate of texcoord
// may be wrong here if neighbouring pixels don't
// take the same code path.
gl_FragColor = texture(u_sampler, texcoord);
} else {
gl_FragColor = vec4(0.);
}

... but the reverse transformation is allowed.

Another example is:

 if (cond) {
v = texelFetch(u_sampler[1], texcoord, 0);
} else {
v = texelFetch(u_sampler[2], texcoord, 0);
}

... cannot be transformed to ...

v = texelFetch(u_sampler[cond ? 1 : 2], texcoord, 0);
// Incorrect, unless cond happens to be dynamically uniform.

... but the reverse transformation is allowed.

Using GL_ARB_shader_ballot, yet another example is:

 bool cond = ...;
uint64_t v = ballotARB(cond);
if (other_cond) {
use(v);
}

... cannot be transformed to ...

bool cond = ...;
if (other_cond) {
use(ballotARB(cond));
// Here, ballotARB returns 1-bits only for threads/work items
// that take the if-branch.
}

... and the reverse transformation is also forbidden.

These restrictions are all related to the GPU-specific SPMD/SIMT execution model, and they need to be taught to the compiler. Unfortunately, we partially fail at that today.

Here are some types of restrictions to think about (each of these restrictions should apply on top of any other restrictions that are expressible in the usual, non-SIMT-specific ways, of course):

  1. An instruction can be moved from location A to location B only if B dominates or post-dominates A.

    This restriction applies e.g. to instructions that take derivatives (like in the first example) or that explicitly take values from neighbouring threads (like in the third example). It also applies to barrier instructions.

    This is LLVM's convergent function attribute as I understand it.

  2. An instruction can be moved from location A to location B only if A dominates or post-dominates B.

    This restriction applies to the ballot instruction above, but it is not required for derivative computations or barrier instructions.

    This is in a sense dual to LLVM's convergent attribute, so it's co-convergence? Divergence? Not sure what to call this.

  3. Something vague about not introducing additional non-uniformity in the arguments of instructions / intrinsic calls.

    This last one applies to the sampler parameter of texture intrinsics (for the second example), to the ballot instruction, and also to the texture coordinates on sampling instructions that implicitly compute derivatives.

For the last type of restriction, consider the following example:

 uint idx = ...;
if (idx == 1u) {
v = texture(u_sampler[idx], texcoord);
} else if (idx == 2u) {
v = texture(u_sampler[idx], texcoord);
}

... cannot be transformed to ...

uint idx = ...;
if (idx == 1u || idx == 2u) {
v = texture(u_sampler[idx], texcoord);
}

In general, whenever an operation has this mysterious restriction on its arguments, then the second restriction above must apply: we can move it from A to B only if A dominates or post-dominates B, because only then can we be certain that the move introduces no non-uniformity. (At least, this rule applies to transformations that are not SIMT-aware. A SIMT-aware transformation might be able to prove that idx is dynamically uniform even without the predication on idx == 1u or idx == 2u.)

However, the control flow rule is not enough:

 v1 = texture(u_sampler[0], texcoord);
v2 = texture(u_sampler[1], texcoord);
v = cond ? v1 : v2;

... cannot be transformed to ...

v = texture(u_sampler[cond ? 0 : 1], texcoord);

The transformation does not break any of the CFG-related rules, and it would clearly be correct for a single-threaded program (given the knowledge that texture(...) is an operation without side effects). So the CFG-based restrictions really aren't sufficient to model the real set of restrictions that apply to the texture instruction. And it gets worse:

 v1 = texelFetch(u_sampler, texcoord[0], 0);
v2 = texelFetch(u_sampler, texcoord[1], 0);
v = cond ? v1 : v2;

... is equivalent to ...

v = texelFetch(u_sampler, texcoord[cond ? 0 : 1], 0);

After all, texelFetch computes no implicit derivatives.

Calling the three kinds of restrictions 'convergent', 'co-convergent', and 'uniform', we get:

 texture(uniform sampler, uniform texcoord) convergent (co-convergent)
texelFetch(uniform sampler, texcoord, lod) (co-convergent)
ballotARB(uniform cond) convergent co-convergent
barrier() convergent

For the texturing instructions, I put 'co-convergent' in parentheses because these instructions aren't inherently 'co-convergent'. The attribute is only there because of the 'uniform' function argument.

Actually, looking at the examples, it seems that co-convergent only appears when a function has a uniform argument. Then again, the texelFetch function can be moved freely in the CFG by a SIMT-aware pass that can prove that the move doesn't introduce non-uniformity to the sampler argument, so being able to distinguish functions that are inherently co-convergent (like ballotARB) from those that are only implicitly co-convergent (like texture and texelFetch) is still useful.

For added fun, things get muddier when you notice that in practice, AMDGPU doesn't even flag texturing intrinsics as 'convergent' today. Conceptually, the derivative-computing intrinsics need to be convergent to ensure that the texture coordinates for neighbouring pixels are preserved (as in the very first example). However, the AMDGPU backend does register allocation after the CFG has been transformed into the wave-level control-flow graph. So register allocation automatically preserves neighbouring pixels even when a texture instruction is sunk into a location with additional control-flow dependencies.

When we reach a point where vector register allocation happens with respect to the thread-level control-flow graph, then texture instructions really need to be marked as convergent for correctness. (This change would be beneficial overall, but is tricky because scalar register allocation must happen with respect to the wave-level control flow graph. LLVM currently wants to allocate all registers in one pass.)

The most interesting work I did last week was starting to look into HDMI audio.  I've been putting this off because really I'm a 3D developer, not modesetting, and I'm certainly not an audio developer.  But the work needs to get done, so here I am.

HDMI audio on SOCs looks like it's not terribly hard.  VC4's HDMI driver needs to instantiate a platform device for audio when it gets loaded, and attach a function table to it for calls to be made when starting/stopping audio.  Then I write a sound driver that will bind to the node that VC4 created, accepting a parameter of which register to DMA into.  That sound driver will tell ALSA about what formats it can acccept and actually drive the DMA engine to write audio frames into HDMI.  On the VC4 HDMI side, we do a bit of register setup (clocks, channel mappings, and the HDMI audio infoframe) when the sound driver tells us it wants to start.

It all sounds pretty straightforward so far.  I'm currently at the "lots of typing" stage of development of this.

Second, I did a little poking at the DSI branch again.  I've rebased onto 4.9 and tested an idea I had for how DSI might be failing.  Still no luck.  There's code on drm-vc4-dsi-panel-restart-cleanup-4.9, and the length of that name suggests how long I've been bashing my head against DSI.  I would love for anyone out there with the inclination to do display debug to spend some time looking into it.  I know the history there is terrible, but I haven't prioritized cleanup while trying to get the driver working in the first place.

I also finished the simulator cleanup from last week, found a regression in the major functional improvement in the cleanup, and landed what worked.  Dave Airlie pointed out that he took a different approach for testing with virgl called vtest, and I took a long look at that.  Now that I've got my partial cleanup in place, the vtest-style simulator wrapper looks a lot more doable, and if it doesn't have the major failing that swrast did with visuals then it should actually make vc4 simulation more correct.

I landed a very tiny optimization to Mesa that I wrote while debugging DEQP failures.

Finally, I put together the -next branches for upstream bcm2835 now that 4.9-rc1 is out.  So far I've landed a major cleanup of pinctrl setup in the DT (started by me and finished by Gerd Hoffmann from Red Hat, to whom I'm very grateful) and a little USB setup fix.  I did a lot of review of VCHIQ patches for staging, and sort of rewrote Linus Walleij's RFC patch for getting descriptive names for the GPIOs in lsgpio.
October 20, 2016

Bloom Filter

The bloom filter makes bright objects glow and bleed through other objects positioned in between them and the camera. It is a common post-processing effect used all the time in video games and animated movies. The demo supports a couple of configuration options that control the intensity and behavior of the filter, here are some screenshots with different settings:

bloom-disabled
Bloom filter Off
bloom-default
Bloom filter On, default settings
bloom-intense
Bloom filter On, intensity increased

I particularly like the glow effect that this brings to the specular reflections on the water surface, although to really appreciate that you need to run the demo and see it in motion.

Cascaded Shadow Maps

I should really write a post about basic shadow mapping before going into the details of Cascaded Shadow Maps, so for now I’ll just focus on the problem they try to solve.

One of the problems with shadow mapping is rendering high resolution shadows, specially for shadows that are rendered close to the camera. Generally, basic shadow mapping provides two ways in which we can improve the resolution of the shadows we render:

1. Increase the resolution of the shadow map textures. This one is obvious but comes at a high performance (and memory) hit.

2. Reduce the distance at which we can render shadows. But this is not ideal of course.

One compromise solution is to notice that, as usual with 3D computer graphics, it is far more important to render nearby objects in high quality than distant ones.

Cascaded Shadow Maps allow us to use different levels of detail for shadows that are rendered at different distances from the camera. Instead of having a single shadow map for all the shadows, we split the viewing frustum into slices and render shadows in each slice to a different shadow map.

There are two immediate benefits of this technique:

1. We have flexibility to define the resolution of the shadow maps for each level of the cascade, allowing us, for example, to increase the resolution of the levels closest to the camera and maybe reduce those that are further away.

2. Each level only records shadows in a slice of the viewing frustum, which increases shadow resolution even if we keep the same texture resolution we used with the original shadow map implementation for each shadow map level.

This approach also has some issues:

1. We need to render multiple shadow maps, which can be a serious performance hit depending on the resolutions of the shadow maps involved. This is why we usually lower the resolution of the shadow maps as distance from the camera increases.

2. As we move closer to or further from shadowed objects we can see the changes in shadow quality pop-in. Of course we can control this by avoiding drastic quality changes between consecutive levels in the cascade.

Here is an example that illustrates the second issue (in this case I have lowered the resolution of the 2nd and 3d cascade levels to 50% and 25% respectively so the effect was more obvious). The screenshots show the rendering of the shadows at different distances. We can see how the shadows in the close-up shot are very sharp and as the distance increases they become blurrier due to the use of a lower resolution shadow map:

csm-high
CSM level 0 (4096×4096)
csm-med
CSM level 1 (2048×2048)
csm-low
CSM level 2 (1024×1024)

The demo supports up to 4 shadow map levels although the default configuration is to use 3. The resolution of each level can be configured separately too, in the default configuration I lowered the shadow resolution of the second and third levels to 75% and 50% respectively. If we configure the demo to run on a single level (with 100% texture resolution), we are back to the original shadow map implementation, so it is easy to experiment with both techniques.

I intend to cover the details behind shadow mapping and the implementation of the bloom filter in more detail in a future post, so again, stay tuned for more!

October 18, 2016
The most visible change this week is a fix to Mesa for texture upload performance.  A user had reported that selection rectangles in LXDE's file manager were really slow.  I brought up sysprof, and it showed that we were doing uncached reads from the GPU in a situation that should have been entirely write-combined writes.

The bug was that when the size of the texture wasn't aligned to utiles, we were loading the previous contents into the temporary buffer before writing the new texture data in and uploading, even if the full texture was being updated.  I've fixed it to check for when the full texture is being uploaded, and not do the initial download.  This bug was getting it on almost any Cairo vector graphics operation with its Xlib backend, so hopefully this helps a lot of people's desktops.

I also worked on a cleanup of the simulator mode.  I use the closed source simulator regularly as part of my work -- it's fairly accurate to hardware behavior, and allows you to trace what's happening when things go wrong, all with the driver code running like "normal" for apps on my x86 desktop.

However, my simulator support is a little invasive to the driver, replacing vc4 ioctl calls with alternative ioctls to the i915 driver.  I started on a series to make vc4_simulator.c take plain VC4 ioctls and translate them, so that the simulator code is entirely contained.  The last step I think I have is figuring out which level to put the simulator's copying in and out of window system framebuffers at.

Last, I got DEQP's GLES2 tests up and running on Raspberry Pi.  These are approximately equivalent to the official conformance tests.  The initial results were quite good -- 18630/19098 (97.5%) passing when run in the piglit framework.  I found a couple of fixes to be made in glClear() support, one of which will affect all gallium drivers.  Of the remainder, there are a few tests that are failing due to test bugs (we expose extensions that allow features the tests don't expect), but most are failing in register allocation.  For the register allocation failures, I see a relatively quick fix that should reduce register pressure in loops.

For the last 3 years, I've been working in the OpenStack Telemetry team at eNovance, and then at Red Hat. Our mission is to maintain the OpenStack Telemetry stack, both upstream and downstream (i.e. inside Red Hat products). Besides the technical challenges, the organization of the team always have played a major role in our accomplishments.

Here, I'd like to share some of my hindsight with you, faithful readers.

Meet the team

The team I work in changed a bit during those 3 years, but the core components always have been the same: a few software engineers, a QE engineer, a product owner, and an engineering manager. That meant the team size has been always been between 6 and 8 people.

I cannot emphasize enough how team size is important. Not having more than 8 persons in a team fits with the two pizzas rule from Jeff Bezzos, which turned out to be key in our team composition.

The group dynamic that is applied by teams not bigger than this is excellent. It offers the possibility to know and connect with everyone – each time member has only up to 7 people to talk to on a daily basis, which means only 28 communication axis between people. Having a team of e.g. 16 people means 120 different links in your team. Double your team size, and multiply by 4 your communication overhead. My practice shows that the less communication axis you have in a team, the less overhead your will have and the swifter your team will be.

All team members being remote workers, is is even more challenging to build relationship and bound. We had the opportunity to know each other during the OpenStack summit twice a year and doing regular video-conference via Google Hangout or BlueJeans really helped.

The atmosphere you set-up in your team will also forge the outcome of your team work. Run your team with trust, peace and humor (remember I'm on the team 🤣) and awesome things will happen. Run your team with fear, pressure, and finger-pointing, nothing good will happen.

There's little chance that when a team is built, everyone will be on the same level. We were no exception, we had more and less experienced engineers. But the most experienced engineers took the time needed to invest and mentor the less experienced. That also helped to build trust and communication links between members of the team. And over the long run, everyone is getting more efficient: the less experienced engineers are getting better and the more experienced can delegate a lot of stuff to their fellows.

Then they can chill or work on bigger stuff. Win-win.

It's actually no more different than that the way you should run an open source team, as I already claimed in a previous article on FOSS projects management.

Practicing agility

I might be bad at practicing agility: contrary to many people, I don't see agility as a set of processes. I see that as a state of mind, as a team organization based on empowerment. No more, no less.

And each time I meet people and explain that our team is "agile", they start shivering, explaining how they hate sprints, daily stand-ups, scrum, and planning poker, that this is all a waste of time and energy.

Well, it turns out that you can be agile without all of that.

Planning poker

In our team, we tried at first to run 2-weeks sprints and used planning poker to schedule our user stories from our product backlog (= todo list). It never worked as expected.

First, most people had the feeling to lose their time because they already knew exactly what they were supposed to do. Having any doubt, they would have just gone and talked to the product owner or another fellow engineer.

Secondly, some stories were really specialized and only one of the team member was able to understand it in details and evaluate it. So most of the time, a lot of the team members playing planning poker would just vote a random number based on the length of the explanation of the story teller. For example, if an engineer said "I just need to change that flag in the configuration file" then everyone would vote 1. If they started rambling for 5 minutes about "how the configuration option is easy to switch, but that there might be other things to change at the same time, and things to check for impact bigger than expected, and code refactoring to do", then most people would just announce a score of 13 on that story. Just because the guy talked for 3 minutes straight and everything sounded complicated and out of their scope.

That meant that the poker score had no meaning to us. We never managed to have a number of points that we knew we could accomplish during a sprint (the team velocity as they call it).

The only benefit that we identified from planning poker, in our case, is that it forces people to keep sit down and communicate about a user story. Though, it turned out that making people communicate was not a problem we needed to solve in our team, so we decided to stop doing that. But it can be a pretty good tool to make people talking to each other.

Therefore, the 2-weeks sprint never made much sense as we were unable to schedule our work reliably. Furthermore, doing most of our daily job in open source communities, we were unable to schedule anything. When sending patches to an upstream project, you have no clue when they will be reviewed. What you know for sure, is that in order to maximize your code merge throughput with this high latency of code review, you need to parallelize your patch submission a lot. So as soon as you receive some feedback from your reviewers, you need to (almost) drop everything, rework your code and resubmit it.

There's no need to explain what this does not absolutely work with a sprint approach. Most of the scrum framework lays on the fact that you own workflow from top to bottom, which is far from being true when working in open source communities.

Daily stand-up meetings

We used to run a daily stand-up meeting every day, then every other day. Doing that remotely kills the stand-up part, obviously, so there is less guarantee the meeting will be short. Considering all team members are working remotely in different time zones, with some freedom to organize their schedule, it was very difficult to synchronize those meetings. With member spread from the US to Eastern Europe, the meeting was in the middle of the afternoon for me. I found it frustrating to have to stop my activities in the middle of every afternoon to chat with my team. We all know the cost of context switching to us, humans.

So we drifted from our 10 minutes daily meeting to a one-hour weekly meeting with the whole team. It's way easier to synchronize for a large chunk of time once a week and to have this high-throughput communication channel.

Our (own) agile framework

Drifting from the original scrum implementation, we ended up running our own agility framework. It turned out to have similarity with kanban – you don't always have to invent new things!

Our main support is a Trello board that we share with the whole team. It consists of different columns, where we put card representing small user stories or simple to-do items. Each column is the state of the current card, and we move them left to right:

  • Ideas: where we put things we'd like to do or dig into, but there's no urgency. It might lead to new, smaller ideas, in the "To Do" column.
  • To Do: where we put real things we need to do. We might run a grooming session with our product manager if we need help prioritizing things, but it's usually not necessary.
  • Epic: here we create a few bigger cards that regroup several To Do items. We don't move them around, we just archive them when they are fully implemented. There are only 5-6 big cards here at max, which are the long term goals we work on.
  • Doing: where we move cards from To Do when we start doing them. At this stage, we also add people working on the task to the card, so we see a little face of people involved.
  • Under review: 90% of our job being done upstream, we usually move cards done and waiting for feedback from the community to this column. When the patches are approved and the card is complete, we move the card to Done. If a patch needs further improvement, we move back the card to Doing and work on it, and then move it back to Under review when resubmitted.
  • On hold / blocked: some of the tasks we work on might be blocked by external factors. We move cards there to keep track of them.
  • Done during week #XX: we create a new list every Monday to stack our done cards by week. This is just easier to display and it allows us to see the cards that we complete each week. We archive lists older than a month, from time to time. It gives a great visual feedback and what has been accomplished and merged every week.

We started to automate some of our Trello workflow in a tool called Trelloha. For example, it allows us to track upstream patches sent through Gerrit or GitHub and tick the checkbox items in any card when those are merged.

We actually don't put many efforts on our Trello board. It's just a slightly organized chaos, as are upstream projects. We use it as a lightweight system for taking notes, organizing our thought and letting know what we're doing and why we're doing it. That's where Trello is wonderful because using it has a very low friction: creating, updating and moving card is a one click operation.

One bias of most engineers is to overthink and over-engineer their workflow, trying to rationalize it. Most of the time, they end up automating which means building processes and bureaucracy. It just slows things down and builds frustration upon everyone. Just embrace chaos and spend time on what matters.

Most of the things we do are linked to external Launchpad bugs, Gerrit reviews or GitHub issues. That means the cards in Trello carry very little information, as everything happens outside, in the wild Internet of open source communities. This is very important as we need to avoid any kind of retention of knowledge and information from contributors outside the company. This also makes sure that our internal way of running does not leak outside and (badly) influence outside communities.

Retrospectives

We also run a retrospective every 2 weeks, which might be the only thing we kept from the scrum practice. It's actually a good opportunity for us to share our feelings, concerns or jokes. We used to do it using the six thinking hats method, but it slowly faded away. In the end, we now use a different Trello board with those columns:

  • Good 😄
  • Hopes and Wishes 🎁
  • Puzzles and Challenges 🌊
  • To improve 😡
  • Action Items 🤘

All teammates fill the board with the card they want, and everyone is free to add themselves to any card. We then run through each card and let people who added their name to it talk about it. The column "Action Items" is usually filled as we speak and discover we should do things. We can then move cards created there to our regular board, in the To Do column.

Central communication

Sure, people have different roles in a team, but we dislike bottleneck and single point of failure. Therefore, we are using an internal mailing list where we ask people to send their request and messages to. If people send things related to our team job to one of us personally, we just forward or Cc the list when replying so everyone is aware of what one might be talking about with people external to the team.

This is very important, as it emphasizes that no team member should be considered special. Nobody owns more information and knowledge than others, and anybody can jump into a conversation if it has valuable knowledge to share.

The same applies for our internal IRC channel.

We also make sure that we discuss only company-specific things on this list or on our internal IRC channel. Everything that can be public and is related to upstream is discussed on external communication medium (IRC, upstream mailing list, etc). This is very important to make sure that we are not blocking anybody outside the Red Hat to join us and contribute to the projects or ideas we work on. We also want to make sure that people working in our company are no more special than other contributors.

Improvement

We're pretty happy with our set-up right now, and the team runs pretty smoothly since a few months. We're still trying to improve, and having a general sense of trust among team members make sure we can openly speak about whatever problem we might have.

Feel free to share your feedback and own experience of running your own teams in the comment section.

October 13, 2016
Lots of individuals and companies have made substantial contributions to Apitrace. But maintenance has always rested with me, the original author.

I would prefer have shared that responsibility with a wider team, but things haven't turned out that way. For several reasons I suppose:

  • There are many people that care about one section of the functionality (one API, one OS), but few care about all of them.
  • For all existing and potential contributors including me Apitrace is merely a means to an end (test/debug graphics drivers or applications), it's not an end on itself. The other stuff always gets the top priority.
  • There are more polished tools for newer generation APIs like Vulkan, Metal, Direct3D 12. These newer APIs are much leaner than legacy APIs, which eliminates a lot of design constraints. And some of these tools have large teams behind them.
  • Last but not least, I failed to nurture such a community. I always kept close control, partly to avoid things to become a hodgepodge, partly from fear of breakage, but can't shake the feeling that if I had been more relaxed things might of turned out differently.

Apitrace has always been something I worked on the spare time, or whenever I had an itch to scratch. That is still true, with the exception that after having a kid I have scarcely any free time left.

Furthermore the future is not bright: I believe Apitrace will have a long life in graphics driver test automation, and perhaps whenever somebody needs to debug an old OpenGL application, but I doubt it will flourish beyond that. And this facts weighs in whenever I need to decide whether to spend some time on Apitrace vs everything else.

The end result is that I haven't been a responsive maintainer for some time (long time merging patches, providing feedback, resolving issue, etc), and I'm afraid that will continue for the foreseeable future.

I don't feel any obligation to do more (after all, the license does say the software is provided as is), but I do want to set right expectations to avoid frustrating users/contributors who might otherwise timely feedback, hence this post.

In this post I’ll discuss how I setup and render terrain mesh in the OpenGL terrain rendering demo. Most of the relevant code for this is in the ter-terrain.cpp file.

Setting up a grid of vertices

Unless you know how to use a 3D modeling program properly, a reasonable way to create a decent mesh for a terrain consists in using a grid of vertices and elevate them according to a height map image. In order to create the grid we only need to decide how many rows and columns we want. This, in the end, determines the number of polygons and the resolution of the terrain.

We need to map these vertices to world coordinates too. We do that by defining a tile size, which is the distance between consecutive vertices in world units. Larger tile sizes increase the size of the terrain but lower the resolution by creating larger polygons.

The image below shows an 8×6 grid that defines 35 tiles. Each tile is rendered using 2 triangles:

terrain-vertex-grid
8×6 terrain grid

The next step is to elevate these vertices so we don’t end up with a boring flat surface. We do this by sampling the height map image for each vertex in the grid. A height map is a gray scale image where the values of the pixels represent altitudes at different positions. The closer the color is to white, the more elevated it is.

terrain-heightmap-01
Heightmap image

Adding more vertices to the grid increases the number of sampling points from the height map and reduces the sampling distances, leading to a smoother and more precise representation of the height map in the resulting terrain.

terrain_heightmap_sampling
Sampling the heightmap to compute vertex heights

Of course, we still need to map the height map samples (gray scale colors) to altitudes in world units. In the demo I do this by normalizing the color values to [-1,+1] and then applying a scale factor to compute the altitude values in world space. By playing with the scaling factor we can make our terrain look more or less abrupt.

terrain_and_heightmap
Altitude scale=6.0
terrain_and_heightmap_abrupt
Altitude scale=12.0

For reference, the height map sampling is implemented in ter_terrain_set_heights_from_texture().

Creating the mesh

At this point we know the full position (x, y, z) in world coordinates of all the vertices in our grid. The next step is to build the actual triangle mesh that we will use to render the terrain and the normal vectors for each triangle. This process is described below and is implemented in the ter_terrain_build_mesh() function.

Computing normals

In order to get nice lighting on our terrain we need to compute normals for each vertex in the mesh. A simple way to achieve this is would be to compute the normal for each face (triangle) and use that normal for each vertex in the triangle. This works, but it has 3 problems:

1. Every vertex of each triangle has the same exact normal, which leads to a rather flat result.

2. Adjacent triangles with different orientations showcase abrupt changes in the normal value, leading to significantly different lighting across the surfaces that highlight the individual triangles in the mesh.

3. Because each vertex in the mesh can have a different normal value for each triangle it participates in, we need to replicate the vertices when we render, which is not optimal.

Alternatively, we can compute the normals for each vertex considering the heights of its neighboring vertices. This solves all the problems mentioned above and leads to much better results thanks to the interpolation of the normal vectors across the triangles, which leads to smooth lighting reflection transitions:

normals-flat
Flat normals
normals-smooth
Smooth normals

The implementation for this is in the function calculate_normal(), which takes the column and row indices of the vertex in the grid and calculates the Y coordinate by sampling the heights of the 4 nearby vertices in the grid.

Preparing the draw call

Now that we know the positions of all the vertices and their normal vectors we have all the information that we need to render the terrain. We still have to decide how exactly we want to render all the polygons.

The simplest way to render the terrain using a single draw call is to setup a vertex buffer with data for each triangle in the mesh (including position and normal information) and use GL_TRIANGLES for the primitive of the draw call. This, however, is not the best option from the point of view of performance.

Because the terrain will typically contain a large number of vertices and most of them participate in multiple triangles, we end up uploading a large amount of vertex data to the GPU and processing a lot of vertices in the draw call. The result is large memory requirements and suboptimal performance.

For reference, the terrain I used in the demo from my original post used a 251×251 grid. This grid represents 250×250 tiles, each one rendered as two triangles (6 vertices/tile), so we end up with 250x250x6=375,000 vertices. For each of these vertices we need to upload 24 bytes of vertex data with the position and normal, so we end up with a GPU buffer that is almost 9MB large.

One obvious way to reduce this is to render the terrain using triangle strips. The problem with this is that in theory, we can’t render the terrain with just one strip, we would need one strip (and so, one draw call) per tile column or one strip per tile row. Fortunately, we can use degenerate triangles to link separate strips for each column into a single draw call. With this we trim down the number of vertices to 126,000 and the size of the buffer to a bit below 3 MB. This alone produced a 15%-20% performance increase in the demo.

We can do better though. A lot of the vertices in the terrain mesh participate in various triangles across the large triangle strip in the draw call, so we can reduce memory requirements by using an index buffer to render the strip. If we do this, we trim things down to 63,000 vertices and ~1.5MB. This added another 4%-5% performance bonus over the original implementation.

Clipping

So far we have been rendering the full mesh of the terrain in each frame and we do this by uploading the vertex data to the GPU just once (for example in the first frame). However, depending on where the camera is located and where it is looking at, just a fraction of the terrain may be visible.

Although the GPU will discard all the geometry and fragments that fall outside the viewport, it still has to process each vertex in the vertex shader stage before it can clip non-visible triangles. Because the number of triangles in the terrain is large, this is suboptimal and to address this we want to do CPU-side clipping before we render.

Doing CPU-side clippig comes with some additional complexities though: it requires that we compute the visible region of the terrain and upload new vertex data to the GPU in each frame preventin GPU stalls.

In the demo, we implement the clipping by computing a quad sub-region of the terrain that includes the visible area that we need to render. Once we know the sub-region that we want to render, we compute the new indices of the vertices that participate in the region so we can render it using a single triangle strip. Finally, we upload the new index data to the index buffer for use in the follow-up draw call.

Avoiding GPU stalls

Although all the above is correct, it actually leads, as described, to much worse performance in general. The reason for this is that our uploads of vertex data in each frame lead to frequent GPU stalls. This happens in two scenarios:

1. In the same frame, because we need to upload different vertex data for the rendering of the terrain for the shadow map and the scene (the shadow map renders the terrain from the point of view of the light, so the visible region of the terrain is different). This creates stalls because the rendering of the terrain for the shadow map might not have completed before we attempt to upload new data to the index buffer in order to render the terrain for the scene.

2. Between different frames. Because the GPU might not be completely done rendering the previous frame (and thus, stills needs the index buffer data available) before we start preparing the next frame and attempt to upload new terrain index data for it.

In the case of the Intel Mesa driver, these GPU stalls can be easily identified by using the environment variable INTEL_DEBUG=perf. When using this, the driver will detect these situations and produce warnings informing about the stalls, the buffers affected and the regions of the buffers that generate the stall, such as:

Stalling on glBufferSubData(0, 503992) (492kb) to a busy (0-1007984)
buffer object.  Use glMapBufferRange() to avoid this.

The solution to this problem that I implemented (other than trying to put as much work as possible between read/write accesses to the index buffer) comes in two forms:

1. Circular buffers

In this case, we allocate a larger buffer than we need so that each subsequent upload of new index data happens in a separate sub-region of the allocated buffer. I set up the demo so that each circular buffer is large enough to hold the index data required for all updates of the index buffer happening in each frame (the shadow map and the scene).

2. Multi-buffering

We allocate more than one circular buffer. When we don’t have enough free space at the end of the current buffer to upload the new index buffer data, we upload it to a different circular buffer instead. When we run out of buffers we circle back to the first one (which at this point will hopefully be free to be re-used again).

So why not just use a single, very large circular buffer? Mostly because there are limits to the size of the buffers that the GPU may be able to handle correctly (or efficiently). Also, why not having many smaller independent buffers instead of circular buffers? That would work just fine, but using fewer, larger buffers reduces the number of objects we need to bind/unbind and is better to prevent memory fragmentation, so that’s a plus.

Final touches

We are almost done, at this point we only need to add a texture to the terrain surface, add some slight fog effect for distant pixels to create a more realistic look, add a skybox (it is important to choose the color of the fog so it matches the color of the sky!) and tweak the lighting parameters to get a nice result:

terrain_final
Final rendering

I hope to cover some of these aspects in future posts, so stay tuned for more!

October 11, 2016
Last week I started work on making hello_fft work with the vc4 driver loaded.

hello_fft is a demo program for doing FFTs using the QPUs (the shader core in vc4).  Instead of drawing primitives into a framebuffer like a GL pipeline does, though, it uses the User QPU pipeline, which just hands a uniform stream to a QPU shader instance.

Originally I didn't build a QPU user shader ioctl for vc4 because there's no good way to use this pipeline of the VC4.  The hardware isn't quite capable enough to be exposed as OpenCL or GL compute shaders (some research had been done into this, and the designers' conclusion was that the memory acccess support wasn't quite good enough to be useful).  That leaves you with writing VC4 shaders in raw assembly, which I don't recommend.

The other problem for vc4 exposing user shaders is that, since the GPU sits directly on main memory with no MMU in between, GPU shader programs could access anywhere in system memory.  For 3D, there's no need (in GL 2.1) to do general write back to system memory, so the kernel reads your shader code and rejects shaders if they try to do it.  It also makes sure that you bounds-check all reads through the uniforms or texture sampler.  For QPU user shaders, though, the expected mode of using them is to have the VPM DMA units do general loads and stores, so we'd need new validation support.

If it was just this one demo, we might be willing to lose support for it in the transition to the open driver.  However, some modes of accelerated video decode also use QPU shaders at some stage, so we have to have some sort of solution.

My plan is basically to not do validation and have root-only execution of QPU shaders for now.  The firmware communication driver captures hello_fft's request to have the firmware pass QPU shaders to the V3D hardware, and it redirects it into VC4.  VC4 then maintains the little queue of requests coming in, powers up the hardware, feeds them in, and collects the interrupts from the QPU shaders for when they're done (each job request is required to "mov interrupt, 1" at the end of its last shader to be run).

Dom is now off experimenting in what it will take to redirect video decode's QPU usage onto this code.

The other little project last week was fixing Processing's performance on vc4.  It's one of the few apps that is ported to the closed driver, so it's a likely thing to be compared on.

Unfortunately, Processing was clearing its framebuffers really inefficiently.  For non-tiled hardware it was mostly OK, with just a double-clear of the depth buffer, but for vc4 its repeated glClear()s (rather than a single glClear with all the buffers to be cleared set) were triggering flushes and reloads of the scene, and in one case a render of a full screen quad (and a full screen load before doing so) rather than fast clearing.

The solution was to improve the tracking of what buffers are cleared and whether any primitives have been drawn yet, so that I can coalesce sets of repeated or partial clears together while only updating the colors to be cleared.  There's always a tension in this sort of optimization: Should the GL driver be clever to work around apps behaving badly, or should it push that off to the app developer (more like Vulkan does).  In this case, tiled renderer behavior is sufficiently different from non-tiled renderers, and enough apps will hit this path, that it's well worth it.  For Processing, I saw an improvement of about 10% on the demo I was looking at.

Sadly, Processing won't be a good comparison for open vs closed drivers even after these fixes.  For the closed driver Processing uses EGL, which says that depth buffers become undefined on eglSwapBuffers. In its default mode, though, processing uses GLX, which says that the depth buffer is retained on glXSwapBuffers().  To avoid this extra overhead in its GLX mode, you'd need to use glDiscardFramebufferEXT() right before the swap.  Unfortunately, this isn't connected in gallium yet but shouldn't be hard to do once we have an app that we can test.
October 07, 2016

I procristanated rather badly on this one, so instead of the previous kernel release happening the v4.8 release is already out of the door. Read on for my slightly more terse catch-up report.

Since I’m this late I figured instead of the usual comprehensive list I’ll do something new and just list some of the work that landed in 4.8, but with a bit more focus on the impact and why things have been done.

Midlayers, Be Gone!

The first thing I want to highlight is the driver de-midlayering. In the linux kernel community the mid-layer mistake or helper library design pattern, see the linked article from LWN is a set of rules to design subsystems and common support code for drivers. The underlying rule is that the driver itself must be in control of everything, like allocating memory, handling all requests. Common code is only shared in helper library functions, which the driver can call if they are suitable. The reason for that is that there is always some hardware which needs special treatment, and when you have a special case and there’s a midlayer, it will get in the way.

Due to the shared history with BSD kernels DRM originally had a full-blown midlayer, but over time this has been fixed. For example kernel modesetting was designed from the start with the helper library pattern. The last hold is the device structure itself, and for the Intel driver this is now fixed. This has two main benefits:

  • First we can get rid of a lot of pointer dereferencing in the compiled binaries. With the midlayer DRM allocated a struct drm_device, and the Intel driver allocated it’s own, separate structure. Both are connected with pointers, and every time control transferred from driver private functions to shared code those pointers had to be walked.

    With a helper approach the driver allocates the DRM device structure embedded into it’s own device strucure. That way the pointer derefencing just becomes a fixed adjustement offset of the original pointer. And fixed offsets can be baked into each access of individual member fields for free, resulting in a big reduction of compiled code.

  • The other benefit is that the Intel driver is now in full control of the driver load and unload sequence. The DRM midlayer functions for loading had a few driver callbacks, but for historical reasons at the wrong spots. And fixing that is impossible without rewriting the load code for all the drivers. Without the midlayer we can have as many steps in the load sequence as we want, and where we want it. The most important fix here is that the driver will now be initialiazed completely before any part of it is registered and visible to userspace (through the /dev node, sysfs or anywhere else).

Thundering Herds

GPUs process rendering asynchronously, and sometimes the CPU needs to wait for them. For this purpose there’s a wait queue in the driver. Userspace processes block on that until the interrupt handler wakes them up. The trouble now is that thus far there was just one wait queue per engine, which means every time the GPU completed something all waiters had to be woken up. Then they checked whether the work they needed to wait for completed, and if not, again block on the wait queue until the next batch job completed. That’s all rather inefficient. On top there’s only one per-engine knob to enable interrupts. Which means even if there was only one waiting process, it was woken for every completed job. And GPUs can have a lot of jobs in-flight.

In summary, waiting for the GPU worked more like a frantic herd trampling all over things instead of something orderly. To fix this the request and completion tracking was entirely revamped, to make sure that the driver has a much better understanding of what’s going on. On top there’s now also an efficient search structure of all current waiting processes. With that the interrupt handler can quickly check whether the just completed GPU job is of interest, and if so, which exact process should be woken up.

But this wasn’t just done to make the driver more efficient. Better tracking of pending and completed GPU requests is an important fundation to implement proper GPU scheduling on top of. And it’s also needed to interface the completion tracking with other drivers, to finally fixing tearing for multi-GPU machines. Having a thundering herd in your own backyard is unsightly, letting it loose on your neighbours is downright bad! A lot of this follow-up work already landed for the 4.9 kernel, hence I will talk more about this in a future installement of this seris.

October 06, 2016
img_20161006_115415

No more “which is now the index of this modem…?”

DBus object path and index

When modems are detected by ModemManager and exposed in DBus, they are assigned an unique DBus object path, with a common prefix and a unique index number, e.g.:

/org/freedesktop/ModemManager1/Modem/0

This path is the one used by the mmcli command line tool to operate on a modem, so users can identify the device by the full path or just by the index, e.g. this two calls are totally equivalent:

$ mmcli -m /org/freedesktop/ModemManager1/Modem/0
$ mmcli -m 0

This logic looks good, except for the fact that there isn’t a fixed DBus object path for each modem detected: i.e. the index given to a device is the next one available, and if the device is power cycled or unplugged and replugged, a different index will be given to it.

EquipmentIdentifier

Systems like NetworkManager handle this index change gracefully, just by assuming that the exposed device isn’t the same one as the one exposed earlier with a different index. If settings need to be applied to a specific device, they will be stored associated with the EquipmentIdentifier property of the modem, which is the same across reboots (i.e. the IMEI for GSM/UMTS/LTE devices).

User-provided names

The 1.8 stable release of ModemManager will come with support for user-provided names assigned to devices. A use case of this new feature is for example those custom systems where the user would like to assign a name to a device based on the USB port in which it is connected (e.g. assuming the USB hardware layout doesn’t change across reboots).

The user can specify the names (UID, unique IDs) just by tagging in udev the physical device that owns all ports of a modem with the new ID_MM_PHYSDEV_UID property. This tags need to be applied before the ID_MM_CANDIDATE properties, and therefore the rules file should be named before the 80-mm-candidate.rules one, for example like this:

$ cat /lib/udev/rules.d/78-mm-naming.rules

ACTION!="add|change|move", GOTO="mm_naming_rules_end"
DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.5",ENV{ID_MM_PHYSDEV_UID}="USB1"
DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.2",ENV{ID_MM_PHYSDEV_UID}="USB2"
DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.3",ENV{ID_MM_PHYSDEV_UID}="USB3"
DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.4",ENV{ID_MM_PHYSDEV_UID}="USB4"
LABEL="mm_naming_rules_end"

The value of the new ID_MM_PHYSDEV_UID property will be used in the Device property exposed in the DBus object, and can also be used directly in mmcli calls instead of the path or index, e.g.:

$ mmcli -m USB4
...
 -------------------------
 System | device: 'USB4'
        | drivers: 'qmi_wwan, qcserial'
        | plugin: 'Sierra'
        | primary port: 'cdc-wdm2'
...

Given that the same property value will always be set for the modem in a specific device path, this user provided names may unequivocally identify a specific modem even when the device is power-cycled, unplugged and replugged or even the whole system rebooted.

Binding the property to the device path is just an example of what could be done. There is no restriction on what the logic is to apply the ID_MM_PHYSDEV_UID property, so users may also choose other different approaches.

This support is already in ModemManager git master, and as already said, will be included in the stable 1.8 release, whenever that is.


TL;DR? ModemManager now supports assigning unique names to devices that stay even across full system reboots.


Filed under: Development, FreeDesktop Planet, GNOME Planet, Planets Tagged: gnome, gnu/linux, ModemManager, udev
October 05, 2016

I have been quite busy with various things in the last few weeks, but I have finally found some time to clean up and upload the code of the OpenGL terrain render demo to Github.

Since this was intended as a programming exercise I have not tried to be very elegant or correct during the implementation, so expect things like error handling to be a bit rough around the edges, but otherwise I think the code should be easy enough to follow.

Notice that I have only tested this on Intel GPUs. I know it works on NVIDIA too (thanks to Samuel and Chema for testing this) but there are a couple of rendering artifacts there, specifically at the edges of the skybox and some “pillars” showing up in the distance some times, probably because I am rendering one to many “rows” of the terrain and I end up rendering garbage. I may fix these some day.

The code I uploaded to the repository includes a few new features too:

  • Model variants, which are basically color variations of the same model
  • A couple of additional models (a new tree and plant) and a different rock type
  • Collision detection, which makes navigating the terrain more pleasent

Here is a new screenshot:

screenshot

In future posts I will talk a bit about some of the implementation details, so having the source code around will be useful. Enjoy!