planet.freedesktop.org
December 20, 2014
Just in time for the upcoming break, we have figured out how to do alpha-test, and now supertuxkart is rendering properly:



If you are wondering about the new stk beta, I have a build from a few weeks back which seems to render properly as well.. few rough edges but I think that is just from using random git commit-id for stk.  But we don't have enough gl3 features yet (on a3xx or a4xx) to be using the new rendering paths.

And gnome-shell works nicely too.  Still some rendering issues with xonotic.  And a little ways behind a3xx in piglit results, but not quite as much as I would have expected at this early stage.

Still missing are some optimizations that are important for certain use-cases (hw-binning support for games, GMEM bypass for UI/mipmap-generation/etc).  But the a420 in apq8084 (ifc6540 board) is surprisingly fast all the same.
December 17, 2014

Multi-Stream Transport 4k Monitors and X

I'm sure you've seen a 4k monitor on a friends desk running Mac OS X or Windows and are all ready to go get one so that you can use it under Linux.

Once you've managed to acquire one, I'm afraid you'll discover that when you plug it in, you're limited to 30Hz refresh rates at the full size, unless you're running a kernel that is version 3.17 or later. And then...

Good Grief! What Is My Computer Doing!

Ok, so now you're running version 3.17 and when X starts up, it's like you're using a gigantic version of Google Cardboard. Two copies of a very tall, but very narrow screen greets you.

Welcome to MST island.

In order to drive these giant new panels at full speed, there isn't enough bandwidth in the display hardware to individually paint each pixel once during each frame. So, like all good hardware engineers, they invented a clever hack.

This clever hack paints the screen in parallel. I'm assuming that they've got two bits of display hardware, each one hooked up to half of the monitor. Now, each paints only half of the pixels, avoiding costly redesign of expensive silicon, at least that's my surmise.

In the olden days, if you did this, you'd end up running two monitor cables to your computer, and potentially even having two video cards. Today, thanks to the magic of Display Port Multi-Stream Transport, we don't need all of that; instead, MST allows us to pack multiple cables-worth of data into a single cable.

I doubt the inventors of MST intended it to be used to split a single LCD panel into multiple "monitors", but hardware engineers are clever folk and are more than capable of abusing standards like this when it serves to save a buck.

Turning Two Back Into One

We've got lots of APIs that expose monitor information in the system, and across which we might be able to wave our magic abstraction wand to fix this:

  1. The KMS API. This is the kernel interface which is used by all graphics stuff, including user-space applications and the frame buffer console. Solve the problem here and it works everywhere automatically.

  2. The libdrm API. This is just the KMS ioctls wrapped in a simple C library. Fixing things here wouldn't make fbcons work, but would at least get all of the window systems working.

  3. Every 2D X driver. (Yeah, we're trying to replace all of these with the one true X driver). Fixing the problem here would mean that all X desktops would work. However, that's a lot of code to hack, so we'll skip this.

  4. The X server RandR code. More plausible than fixing every driver, this also makes X desktops work.

  5. The RandR library. If not in the X server itself, how about over in user space in the RandR protocol library? Well, the problem here is that we've now got two of them (Xlib and xcb), and the xcb one is auto-generated from the protocol descriptions. Not plausible.

  6. The Xinerama code in the X server. Xinerama is how we did multi-monitor stuff before RandR existed. These days, RandR provides Xinerama emulation, but we've been telling people to switch to RandR directly.

  7. Some new API. Awesome. Ok, so if we haven't fixed this in any existing API we control (kernel/libdrm/X.org), then we effectively dump the problem into the laps of the desktop and application developers. Given how long it's taken them to adopt current RandR stuff, providing yet another complication in their lives won't make them very happy.

All Our APIs Suck

Dave Airlie merged MST support into the kernel for version 3.17 in the simplest possible fashion -- pushing the problem out to user space. I was initially vaguely tempted to go poke at it and try to fix things there, but he eventually convinced me that it just wasn't feasible.

It turns out that all of our fancy new modesetting APIs describe the hardware in more detail than any application actually cares about. In particular, we expose a huge array of hardware objects:

  • Subconnectors
  • Connectors
  • Outputs
  • Video modes
  • Crtcs
  • Encoders

Each of these objects exposes intimate details about the underlying hardware -- which of them can work together, and which cannot; what kinds of limits are there on data rates and formats; and pixel-level timing details about blanking periods and refresh rates.

To make things work, some piece of code needs to actually hook things up, and explain to the user why the configuration they want just isn't possible.

The sticking point we reached was that when an MST monitor gets plugged in, it needs two CRTCs to drive it. If one of those is already in use by some other output, there's just no way you can steal it for MST mode.

Another problem -- we expose EDID data and actual video mode timings. Our MST monitor has two EDID blocks, one for each half. They happen to describe how they're related, and how you should configure them, but if we want to hide that from the application, we'll have to pull those EDID blocks apart and construct a new one. The same goes for video modes; we'll have to construct ones for MST mode.

Every single one of our APIs exposes enough of this information to be dangerous.

Every one, except Xinerama. All it talks about is a list of rectangles, each of which represents a logical view into the desktop. Did I mention we've been encouraging people to stop using this? And that some of them listened to us? Foolishly?

Dave's Tiling Property

Dave hacked up the X server to parse the EDID strings and communicate the layout information to clients through an output property. Then he hacked up the gnome code to parse that property and build a RandR configuration that would work.

Then, he changed to RandR Xinerama code to also parse the TILE properties and to fix up the data seen by application from that.

This works well enough to get a desktop running correctly, assuming that desktop uses Xinerama to fetch this data. Alas, gtk has been "fixed" to use RandR if you have RandR version 1.3 or later. No biscuit for us today.

Adding RandR Monitors

RandR doesn't have enough data types yet, so I decided that what we wanted to do was create another one; maybe that would solve this problem.

Ok, so what clients mostly want to know is which bits of the screen are going to be stuck together and should be treated as a single unit. With current RandR, that's some of the information included in a CRTC. You pull the pixel size out of the associated mode, physical size out of the associated outputs and the position from the CRTC itself.

Most of that information is available through Xinerama too; it's just missing physical sizes and any kind of labeling to help the user understand which monitor you're talking about.

The other problem with Xinerama is that it cannot be configured by clients; the existing RandR implementation constructs the Xinerama data directly from the RandR CRTC settings. Dave's Tiling property changes edit that data to reflect the union of associated monitors as a single Xinerama rectangle.

Allowing the Xinerama data to be configured by clients would fix our 4k MST monitor problem as well as solving the longstanding video wall, WiDi and VNC troubles. All of those want to create logical monitor areas within the screen under client control

What I've done is create a new RandR datatype, the "Monitor", which is a rectangular area of the screen which defines a rectangular region of the screen. Each monitor has the following data:

  • Name. This provides some way to identify the Monitor to the user. I'm using X atoms for this as it made a bunch of things easier.

  • Primary boolean. This indicates whether the monitor is to be considered the "primary" monitor, suitable for placing toolbars and menus.

  • Pixel geometry (x, y, width, height). These locate the region within the screen and define the pixel size.

  • Physical geometry (width-in-millimeters, height-in-millimeters). These let the user know how big the pixels will appear in this region.

  • List of outputs. (I think this is the clever bit)

There are three requests to define, delete and list monitors. And that's it.

Now, we want the list of monitors to completely describe the environment, and yet we don't want existing tools to break completely. So, we need some way to automatically construct monitors from the existing RandR state while still letting the user override portions of it as needed to explain virtual or tiled outputs.

So, what I did was to let the client specify a list of outputs for each monitor. All of the CRTCs which aren't associated with an output in any client-defined monitor are then added to the list of monitors reported back to clients. That means that clients need only define monitors for things they understand, and they can leave the other bits alone and the server will do something sensible.

The second tricky bit is that if you specify an empty rectangle at 0,0 for the pixel geometry, then the server will automatically compute the geometry using the list of outputs provided. That means that if any of those outputs get disabled or reconfigured, the Monitor associated with them will appear to change as well.

Current Status

Gtk+ has been switched to use RandR for RandR versions 1.3 or later. Locally, I hacked libXrandr to override the RandR version through an environment variable, set that to 1.2 and Gtk+ happily reverts back to Xinerama and things work fine. I suspect the plan here will be to have it use the new Monitors when present as those provide the same info that it was pulling out of RandR's CRTCs.

KDE appears to still use Xinerama data for this, so it "just works".

Where's the code

As usual, all of the code for this is in a collection of git repositories in my home directory on fd.o:

git://people.freedesktop.org/~keithp/randrproto master
git://people.freedesktop.org/~keithp/libXrandr master
git://people.freedesktop.org/~keithp/xrandr master
git://people.freedesktop.org/~keithp/xserver randr-monitors

RandR protocol changes

Here's the new sections added to randrproto.txt

                  ❧❧❧❧❧❧❧❧❧❧❧

1.5. Introduction to version 1.5 of the extension

Version 1.5 adds monitors

 • A 'Monitor' is a rectangular subset of the screen which represents
   a coherent collection of pixels presented to the user.

 • Each Monitor is be associated with a list of outputs (which may be
   empty).

 • When clients define monitors, the associated outputs are removed from
   existing Monitors. If removing the output causes the list for that
   monitor to become empty, that monitor will be deleted.

 • For active CRTCs that have no output associated with any
   client-defined Monitor, one server-defined monitor will
   automatically be defined of the first Output associated with them.

 • When defining a monitor, setting the geometry to all zeros will
   cause that monitor to dynamically track the bounding box of the
   active outputs associated with them

This new object separates the physical configuration of the hardware
from the logical subsets  the screen that applications should
consider as single viewable areas.

1.5.1. Relationship between Monitors and Xinerama

Xinerama's information now comes from the Monitors instead of directly
from the CRTCs. The Monitor marked as Primary will be listed first.

                  ❧❧❧❧❧❧❧❧❧❧❧

5.6. Protocol Types added in version 1.5 of the extension

MONITORINFO { name: ATOM
          primary: BOOL
          automatic: BOOL
          x: INT16
          y: INT16
          width: CARD16
          height: CARD16
          width-in-millimeters: CARD32
          height-in-millimeters: CARD32
          outputs: LISTofOUTPUT }

                  ❧❧❧❧❧❧❧❧❧❧❧

7.5. Extension Requests added in version 1.5 of the extension.

┌───
    RRGetMonitors
    window : WINDOW
     ▶
    timestamp: TIMESTAMP
    monitors: LISTofMONITORINFO
└───
    Errors: Window

    Returns the list of Monitors for the screen containing
    'window'.

    'timestamp' indicates the server time when the list of
    monitors last changed.

┌───
    RRSetMonitor
    window : WINDOW
    info: MONITORINFO
└───
    Errors: Window, Output, Atom, Value

    Create a new monitor. Any existing Monitor of the same name is deleted.

    'name' must be a valid atom or an Atom error results.

    'name' must not match the name of any Output on the screen, or
    a Value error results.

    If 'info.outputs' is non-empty, and if x, y, width, height are all
    zero, then the Monitor geometry will be dynamically defined to
    be the bounding box of the geometry of the active CRTCs
    associated with them.

    If 'name' matches an existing Monitor on the screen, the
    existing one will be deleted as if RRDeleteMonitor were called.

    For each output in 'info.outputs, each one is removed from all
    pre-existing Monitors. If removing the output causes the list of
    outputs for that Monitor to become empty, then that Monitor will
    be deleted as if RRDeleteMonitor were called.

    Only one monitor per screen may be primary. If 'info.primary'
    is true, then the primary value will be set to false on all
    other monitors on the screen.

    RRSetMonitor generates a ConfigureNotify event on the root
    window of the screen.

┌───
    RRDeleteMonitor
    window : WINDOW
    name: ATOM
└───
    Errors: Window, Atom, Value

    Deletes the named Monitor.

    'name' must be a valid atom or an Atom error results.

    'name' must match the name of a Monitor on the screen, or a
    Value error results.

    RRDeleteMonitor generates a ConfigureNotify event on the root
    window of the screen.

                  ❧❧❧❧❧❧❧❧❧❧❧
December 16, 2014
So kernel version 3.18 is out the door and it's time for our regular look at what's in the next merge window.
First looking at new hardware the big item is basic Skylake support. There are still a few smalls things missing, but mostly it's there now. This has been contributed by Damien, Satheeshakrishna and a lot of other folks. Looking at other platforms there has also been a lot of changes for vlv/chv: Improved backlight code, completely refactored interrupt handling to bring it in line with other platforms, rewritten panel power sequencing code, all from Ville. Rodrigo contributed PSR support for vlv/chv together with a lot of other fixes for PSR. Unfortunately it's not yet again enabled by default.

Moving on to Broadwell and the render side of things, Mika and Arun provided patches to improve the render workaround code and bring the set of workarounds up to date. execlist (the new command submission support on Gen8+) is also being polished with the addition of on-demand pinning of context objects with patches from Thomas Daniel and Oscar Mateo. Finally the RPS/render-turbo code has seen a lot of polish from Imre with a few fixes from Tom O'Rourke.

Otherwise not a lot of really big things happened on the GEM side: Just a few patches to fix issues in ppgtt (unfortunately still not enabled by default anywhere due to fun with context switches). And there's a bit of prep work and reorg all over for new stuff landing hopefully soon.

Looking at overall infrastructure changes the big thing certainly is the preparations for atomic display updates. The drm core/driver interface for atomic and all the helper library code to convert drivers has landed in 3.19, and already some conversions. On the Intel side it's been just prep work under the hood thus far with patches from Ander to precompute display PLL state. The new code to use vblank evades for pagelips has also landed, which is needed for atomic plane updates. And prep patches from Gustavo Padovan started to split the low-level plane update functions into check and commit steps. Lots more patches from different people are in flight and some have been merged for 3.20 already.

Besides these driver internal changes for atomic there has been other work to improve the codebase: Imre reorganized our handlers for suspend, resume and thawing and freezing. Jani reworked the audio and eld code which is the gfx side of the puzzle needed to make audio over HDMI or DP work. Jesse provided patches to track infoframes more accurately, which is needed to correctly fastboot (i.e. without modesets if possible) on external screens.

For older machines Ville has spent a few spare cycles to make them more useful: GPU reset support for gen3/4 should mitigate some of the recent chromium crashes on mesa, and the modeset code on i830M might work correctly for the first time, ever.


And of course the usual pile of smaller fixes and improvements all over.

Not directly related to code or features is the start of documenting i915 driver internals: With this release we now have some of the interrupt handling, fifo underrun reporting, frontbuffer tracking and runtime pm support newly document. And there's lots more in-flight, so hopefully soonish this will be fairly useful.

Those running Fedora Rawhide or GNOME 3.12 may have noticed that there is no Xorg.log file anymore. This is intentional, gdm now starts the X server so that it writes the log to the systemd journal. Update 29 Mar 2014: The X server itself has no capabilities for logging to the journal yet, but no changes to the X server were needed anyway. gdm merely starts the server with a /dev/null logfile and redirects stdin/stderr to the journal.

Thus, to get the log file use journalctl, not vim, cat, less, notepad or whatever your $PAGER was before.

This leaves us with the following commands.


journalctl -e _COMM=Xorg
Which would conveniently show something like this:

Mar 25 10:48:41 yabbi Xorg[5438]: (II) UnloadModule: "wacom"
Mar 25 10:48:41 yabbi Xorg[5438]: (II) evdev: Lenovo Optical USB Mouse: Close
Mar 25 10:48:41 yabbi Xorg[5438]: (II) UnloadModule: "evdev"
Mar 25 10:48:41 yabbi Xorg[5438]: (II) evdev: Integrated Camera: Close
Mar 25 10:48:41 yabbi Xorg[5438]: (II) UnloadModule: "evdev"
Mar 25 10:48:41 yabbi Xorg[5438]: (II) evdev: Sleep Button: Close
Mar 25 10:48:41 yabbi Xorg[5438]: (II) UnloadModule: "evdev"
Mar 25 10:48:41 yabbi Xorg[5438]: (II) evdev: Video Bus: Close
Mar 25 10:48:41 yabbi Xorg[5438]: (II) UnloadModule: "evdev"
Mar 25 10:48:41 yabbi Xorg[5438]: (II) evdev: Power Button: Close
Mar 25 10:48:41 yabbi Xorg[5438]: (II) UnloadModule: "evdev"
Mar 25 10:48:41 yabbi Xorg[5438]: (EE) Server terminated successfully (0). Closing log file.
The -e toggle jumps to the end and only shows 1000 lines, but that's usually enough. journalctl has a bunch more options described in the journalctl man page. Note the PID in square brackets though. You can easily limit the output to just that PID, which makes it ideal to attach to the log to a bug report.

journalctl _COMM=Xorg _PID=5438
Previously the server kept only a single backup log file around, so if you restarted twice after a crash, the log was gone. With the journal it's now easy to extract the log file from that crash five restarts ago. It's almost like the future is already here.

Update 16/12/2014: This post initially suggested to use journactl /usr/bin/Xorg. Using _COMM is path-independent.

Fedora 21

Added 16/12/2014: If you recently updated to/installed Fedora 21 you'll notice that the above command won't show anything. As part of the Xorg without root rights feature Fedora ships a wrapper script as /usr/bin/Xorg. This script eventually executes /usr/libexecs/Xorg.bin which is the actual X server binary. Thus, on Fedora 21 replace Xorg with Xorg.bin:


journalctl -e _COMM=Xorg.bin
journalctl _COMM=Xorg.bin _PID=5438
Note that we're looking into this so that in a few updates time we don't have a special command here.

December 13, 2014

Present and Compositors

The current Present extension is pretty unfriendly to compositing managers, causing an extra frame of latency between the applications operation and the scanout buffer. Here's how I'm fixing that.

An extra frame of lag

When an application uses PresentPixmap, that operation is generally delayed until the next vblank interval. When using X without composting, this ensures that the operation will get started in the vblank interval, and, if the rendering operation is quick enough, you'll get the frame presented without any tearing.

When using a compositing manager, the operation is still delayed until the vblank interval. That means that the CopyArea and subsequent Damage event generation don't occur until the display has already started the next frame. The compositing manager receives the damage event and constructs a new frame, but it also wants to avoid tearing, so that frame won't get displayed immediately, instead it'll get delayed until the next frame, introducing the lag.

Copy now, complete later

While away from the keyboard this morning, I had a sudden idea -- what if we performed the CopyArea and generated Damage right when the PresentPixmap request was executed but delayed the PresentComplete event until vblank happened.

With the contents updated and damage delivered, the compositing manager can immediately start constructing a new scene for the upcoming frame. When that is complete, it can also use PresentPixmap (either directly or through OpenGL) to queue the screen update.

If it's fast enough, that will all happen before vblank and the application contents will actually appear at the desired time.

Now, at the appointed vblank time, the PresentComplete event will get delivered to the client, telling it that the operation has finished and that its contents are now on the screen. If the compositing manager was quick, this event won't even be a lie.

We'll be lying less often

Right now, the CopyArea, Damage and PresentComplete operations all happen after the vblank has passed. As the compositing manager delays the screen update until the next vblank, then every single PresentComplete event will have the wrong UST/MSC values in it.

With the CopyArea happening immediately, we've a pretty good chance that the compositing manager will get the application contents up on the screen at the target time. When this happens, the PresentComplete event will have the correct values in it.

How can we do better?

The only way to do better is to have the PresentComplete event generated when the compositing manager displays the frame. I've talked about how that should work, but it's a bit twisty, and will require changes in the compositing manager to report the association between their PresentPixmap request and the applications' PresentPixmap requests.

Where's the code

I've got a set of three patches, two of which restructure the existing code without changing any behavior and a final patch which adds this improvement. Comments and review are encouraged, as always!

git://people.freedesktop.org/~keithp/xserver.git present-compositor
December 08, 2014
As the development window for GNOME 3.16 advances, I've been adding a few new developer features, selfishly, so I could use them in my own programs.

Connectivity support for applications

Picking up from where Dan Winship left off, we've merged support for application to detect the network availability, especially the "connected to a network but not to the Internet" case.

In glib/gio now, watch the value of the "connectivity" property in GNetworkMonitor.

Grilo automatic network awareness

This glib/gio feature allows us to show/hide Grilo sources from applications' view if they require Internet and LAN access to work. This should be landing very soon, once we've made the new feature optional based on the presence of the new GLib.

Totem

And finally, this means we'll soon be able to show a nice placeholder when no network connection is available, and there are no channels left.

Grilo Lua resources support

A long-standing request, GResources support has landed for Grilo Lua plugins. When a script is loaded, we'll look for a separate GResource file with ".gresource" as the suffix, and automatically load it. This means you can use a local icon for sources with the URL "resource:///org/gnome/grilo/foo.png". Your favourite Lua sources will soon have icons!

Grilo Opensubtitles plugin

The developers affected by this new feature may be a group of one, but if the group is ever to expand, it's the right place to do it. This new Grilo plugin will fetch the list of available text subtitles for specific videos, given their "hashes", which are now exported by Tracker.

GDK-Pixbuf enhancements

I can point you to the NEWS file for the latest version, but the main gains are that GIF animations won't eat all your memory, DPI metadata support in JPEG, PNG and TIFF formats, and, for image viewers, you can tell whether a TIFF file is multi-page to open it in a more capable viewer.

Batched inserts, and better filters in GOM

Does what it says on the tin. This is useful for populating the database quicker than through piecemeal inserts, it also means you don't need to chain inserts when inserting multiple items.

Mathieu also worked on fixing the priority of filters when building complex queries, as well as supporting more than 2 items in a filter ("foo OR bar OR baz" for example).
December 06, 2014

click here to jump to the instructions

Mice have an optical sensor that tells them how far they moved in "mickeys". Depending on the sensor, a mickey is anywhere between 1/100 to 1/8200 of an inch or less. The current "standard" resolution is 1000 DPI, but older mice will have 800 DPI, 400 DPI etc. Resolutions above 1200 DPI are generally reserved for gaming mice with (usually) switchable resolution and it's an arms race between manufacturers in who can advertise higher numbers.

HW manufacturers are cheap bastards so of course the mice don't advertise the sensor resolution. Which means that for the purpose of pointer acceleration there is no physical reference. That delta of 10 could be a millimeter of mouse movement or a nanometer, you just can't know. And if pointer acceleration works on input without reference, it becomes useless and unpredictable. That is partially intended, HW manufacturers advertise that a lower resolution will provide more precision while sniping and a higher resolution means faster turns while running around doing rocket jumps. I personally don't think that there's much difference between 5000 and 8000 DPI anymore, the mouse is so sensitive that if you sneeze your pointer ends up next to Philae. But then again, who am I to argue with marketing types.

For us, useless and unpredictable is bad, especially in the use-case of everyday desktops. To work around that, libinput 0.7 now incorporates the physical resolution into pointer acceleration. And to do that we need a database, which will be provided by udev as of systemd 218 (unreleased at the time of writing). This database incorporates the various devices and their physical resolution, together with their sampling rate. udev sets the resolution as the MOUSE_DPI property that we can read in libinput and use as reference point in the pointer accel code. In the simplest case, the entry lists a single resolution with a single frequency (e.g. "MOUSE_DPI=1000@125"), for switchable gaming mice it lists a list of resolutions with frequencies and marks the default with an asterisk ("MOUSE_DPI=400@50 800@50 *1000@125 1200@125"). And you can and should help us populate the database so it gets useful really quickly.

How to add your device to the database

We use udev's hwdb for the database list. The upstream file is in /usr/lib/udev/hwdb.d/70-mouse.hwdb, the ruleset to trigger a match is in /usr/lib/udev/rules.d/70-mouse.rules. The easiest way to add a match is with the libevdev mouse-dpi-tool (version 1.3.2). Run it and follow the instructions. The output looks like this:


$ sudo ./tools/mouse-dpi-tool /dev/input/event8
Mouse Lenovo Optical USB Mouse on /dev/input/event8
Move the device along the x-axis.
Pause 3 seconds before movement to reset, Ctrl+C to exit.
Covered distance in device units: 264 at frequency 125.0Hz | |^C
Estimated sampling frequency: 125Hz
To calculate resolution, measure physical distance covered
and look up the matching resolution in the table below
16mm 0.66in 400dpi
11mm 0.44in 600dpi
8mm 0.33in 800dpi
6mm 0.26in 1000dpi
5mm 0.22in 1200dpi
4mm 0.19in 1400dpi
4mm 0.17in 1600dpi
3mm 0.15in 1800dpi
3mm 0.13in 2000dpi
3mm 0.12in 2200dpi
2mm 0.11in 2400dpi

Entry for hwdb match (replace XXX with the resolution in DPI):
mouse:usb:v17efp6019:name:Lenovo Optical USB Mouse:
MOUSE_DPI=XXX@125
Take those last two lines, add them to a local new file /etc/udev/hwdb.d/71-mouse.hwdb. Rebuild the hwdb, trigger it, and done:

$ sudo udevadm hwdb --update
$ sudo udevadm trigger /dev/input/event8
Leave out the device path if you're not on systemd 218 yet. Check if the property is set:

$ udevadm info /dev/input/event8 | grep MOUSE_DPI
E: MOUSE_DPI=1000@125
And that shows everything worked. Restart X/Wayland/whatever uses libinput and you're good to go. If it works, double-check the upstream instructions, then file a bug against systemd with those two lines and assign it to me.

Trackballs are a bit hard to measure like this, my suggestion is to check the manufacturer's website first for any resolution data.

Update 2014/12/06: trackball comment added, udevadm trigger comment for pre 218

December 02, 2014

Disclaimer: Limba is stilllimba-small in a very early stage of development. Bugs happen, and I give to guarantees on API stability yet.

Limba is a very simple cross-distro package installer, utilizing OverlayFS found in recent Linux kernels (>= 3.18).

As example I created a small Limba package for one of the Qt5 demo applications, and I would like to share the process of creating Limba packages – it’s quite simple, and I could use some feedback on how well the resulting packages work on multiple distributions.

I assume that you have compiled Limba and installed it – how that is done is described in its README file. So, let’s start.

1. Prepare your application

The cool thing about Limba is that you don’t really have to do many changes on your application. There are a few things to pay attention to, though:

  • Ensure the binaries and data are installed into the right places in the directory hierarchy. Binaries must go to $prefix/bin, for example.
  • Ensure that configuration can be found under /etc as well as under $prefix/etc

This needs to be done so your application will find its data at runtime. Additionally, you need to write an AppStream metadata file, and find out which stuff your application depends on.

2. Create package metadata & install software

1.1 Basics

Now you can create the metadata necessary to build a Limba package. Just run

cd /path/to/my/project
lipkgen make-template

This will create a “pkginstall” directory, containing a “control” file and a “metainfo.xml” file, which can be a symlink to the AppStream metadata, or be new metadata.

Now, configure your application with /opt/bundle as install prefix (-DCMAKE_INSTALL_PREFIX=/opt/bundle, –prefix=/opt/bundle, etc.) and install it to the pkginstall/inst_target directory.

1.2 Handling dependencies

If your software has dependencies on other packages, just get the Limba packages for these dependencies, or build new ones. Then place the resulting IPK packages in the pkginstall/repo directory. Ideally, you should be able to fetch Limba packages which contain the software components directly from their upstream developers.

Then, open the pkginstall/control file and adjust the “Requires” line. The names of the components you depend on match their AppStream-IDs (<id/> tag in the AppStream XML document). Any version-relation (>=, >>, <<, <=, <>) is supported, and specified in brackets after the component-id.

The resulting control-file might look like this:

Format-Version: 1.0

Requires: Qt5Core (>= 5.3), Qt5DBus (>= 5.3), libpng12

If the specified dependencies are in the repo/ subdirectory, these packages will get installed automatically, if your application package is installed. Otherwise, Limba depends on the user to install these packages manually – there is no interaction with the distribution’s package-manager (yet?).

3. Building the package

In order to build your package, make sure the content in inst_target/ is up to date, then run

lipkgen build pkginstall/

This will build your package and output it in the pkginstall/ directory.

4. Testing the package

You can now test your package, Just run

sudo lipa install package.ipk

Your software should install successfully. If you provided a .desktop file in $prefix/share/applications, you should find your application in your desktop’s application-menu. Otherwise, you can run a binary from the command-line, just append the version of your package to the binary name (bash-comletion helps). Alternatively, you can use the runapp command, which lets you run any binary in your bundle/package, which is quite helpful for debugging (since the environment a Limba-installed application is run is different from the one of other applications).

Example:

runapp ${component_id}-${version}:/bin/binary-name

And that’s it! :-)

I used these steps to create a Limba package for the OpenGL Qt5 demo on Tanglu 2 (Bartholomea), and tested it on Kubuntu 15.04 (Vivid) with KDE, as well as on an up-to-date Fedora 21, with GNOME and without any Qt or KDE stuff installed:

qt5demo-limba-kubuntuqt5demo-limba-fedora

I encountered a few obstacles when building the packages, e.g. Qt5 initially didn’t find the right QPA plugin – that has been fixed by adjusting a config file in the Qt5Gui package. Also, on Fedora, a matching libpng was missing, so I included that as well.

You can find the packages at Github, currently (but I am planning to move them to a different place soon). The biggest issue with Limba is at time, that it needs Linux 3.18, or an older kernel with OverlayFS support compiled in. Apart from that and a few bugs, the experience is quite smooth. As soon as I am sure there are now hidden fundamental issues, I can think of implementing more features, like signing packages and automatically updating them.

Have fun playing around with Limba!

December 01, 2014

A long-standing and unfixable problem in X is that we cannot send a number of keys to clients because their keycode is too high. This doesn't affect any of the normal keys for typing, but a lot of multimedia keys, especially "newly" introduced ones.

X has a maximum keycode 255, and "Keycodes lie in the inclusive range [8,255]". The reason for the offset 8 keeps escaping me but it doesn't matter anyway. Effectively, it means that we are limited to 247 keys per keyboard. Now, you may think that this would be enough and thus the limit shouldn't really affect us. And you're right. This post explains why it is a problem nonetheless.

Let's discard any ideas about actually increasing the limit from 8 bit to 32 bit. It's hardwired into too many open-coded structs that this is simply not an option. You'd be breaking every X client out there, so at this point you might as well rewrite the display server and aim for replacing X altogether. Oh wait...

So why aren't 247 keycodes enough? The reason is that large chunks of that range are unused and wasted.

In X, the keymap is an array in the form keysyms[keycode] = some keysym (that's a rather simplified view, look at the output from "xkbcomp -xkb $DISPLAY -" for details). The actual value of the keycode doesn't matter in theory, it's just an index. Of course, that theory only applies when you're looking at one keyboard at a time. We need to ship keymaps that are useful everywhere (see xkeyboard-config) and for that we need some sort of standard. In the olden days this meant every vendor had their own keycodes (see /usr/share/X11/xkb/keycodes) but these days Linux normalizes it to evdev keycodes. So we know that KEY_VOLUMEUP is always 115 and we can hook it up thus to just work out of the box. That however leaves us with huge ranges of unused keycodes because every device is different. My keyboard does not have a CD eject key, but it has volume control keys. I have a key to start a web browser but I don't have a key to start a calculator. Others' keyboards do have those keys though, and they expect those keys to work. So the default keymap needs to map the possible keycodes to the matching keysyms and suddenly our 247 keycodes per keyboard becomes 247 for all keyboards ever made. And that is simply not enough.

To work around this, we'd need locally hardware-adjusted keymaps generated at runtime. After loading the driver we can look at the keys that exist, remap higher keycodes into an unused range and then communicate that to move the keysyms into the newly mapped keycodes. This is...complicated. evdev doesn't know about keymaps. When gnome-settings-daemon applied your user-specific layout, evdev didn't get told about this. GNOME on the other hand has no idea that evdev previously re-mapped higher keycodes. So when g-s-d applies your setting, it may overwrite the remapped keysym with the one from the default keymaps (and evdev won't notice).

As usual, none of this is technically unfixable. You could figure out a protocol extension that drivers can talk to the servers and the clients to notify them of remapped keycodes. This of course needs to be added to evdev, the server, libX11, probably xkbcomp and libxkbcommon and of course to all desktop environments that set they layout. To write the patches you need a deep understanding of XKB which would definitely make your skillset a rare one, probably make you quite employable and possibly put you on the fast track for your nearest mental institution. XKB and happiness don't usually go together, but at least the jackets will keep you warm.

Because of the above, we go with the simple answer: "X can't handle keycodes over 255"

November 29, 2014
I'm not sure

but if hd0;u]; means anything to anyone from displaylink, or is the first unencrypted bytes they send, then oops.

Looks like I have some work to do next week.
November 21, 2014
So someone leaked 2011 era PowerVR SGX microcode and user space... And now everyone is pissing themselves like a bunch of overexcited puppies...

I've been fed links from several sides now, and i cannot believe how short-sighted and irresponsible people are, including a few people who should know better.

STOP TELLING PEOPLE TO LOOK AT PROPRIETARY CODE.

Having gotten that out of the way, I am writing this blog to put everyone straight and stop the nonsense, and to calmly explain why this leak is not a good thing.

Before i go any further, IANAL, but i clearly do seem to tread much more carefully on these issues than most. As always, feel free to debunk what i write here in the comments, especially you actual lawyers, especially those lawyers in the .EU.

LIBV and the PVR.

Let me just, once again, state my position towards the PowerVR.

I have worked on the Nokia N9, primarily on the SGX kernel side (which is of course GPLed), but i also touched both the microcode and userspace. So I have seen the code, worked with and i am very much burned on it. Unless IMG itself gives me permission to do so, i am not allowed to contribute to any open source driver for the PowerVR. I personally also include the RGX, and not just SGX, in that list, as i believe that some things do remain the same. The same is true for Rob Clark, who worked with PowerVR when at Texas Instruments.

This is, however, not why i try to keep people from REing the PowerVR.

The reason why i tell people to stay away is because of the design of the PowerVR and its driver stack: PVR is heavily microcode driven, and this microcode is loaded through the kernel from userspace. The microcode communicates directly with the kernel through some shared structs, which change depending on build options. There are sometimes extensive changes to both the microcode, kernel and userspace code depending on the revision of the SGX, customer project and build options, and sometimes the whole stack is affected, from microcode to userspace. This makes the powervr a very unstable platform: change one component, and the whole house of cards comes tumbling down. A nightmare for system integrators, but also bad news for people looking to provide a free driver for this platform. As if the murderous release cycle of mobile hardware wasn't bad enough of a moving target already.

The logic behind me attempting to keep people away from REing the PowerVR is, at one end, the attempt to focus the available decent developers on more rewarding GPUs and to keep people from burning out on something as shaky as the PowerVR. On the other hand, by getting everyone working on the other GPUs, we are slowly forcing the whole market open, singling out Imagination Technologies. At one point, IMG will be forced to either do this work itself, and/or to directly support open sourcing themselves, or to remain the black sheep forever.

None of the above means that I am against an open source driver for PVR, quite the opposite, I just find it more productive to work on the other GPUs amd wait this one out.

Given their bad reputation with system integrators, their shaky driver/microcode design, and the fact that they are in a cut throat competition with ARM, Imagination Technologies actually has the most to gain from an open source driver. It would at least take some of the pain out of that shaky microcode/kernel/userspace combination, and make a lot of peoples lives a lot easier.

This is not open source software.

Just because someone leaked this code, it has not magically become free software.

It is still just as proprietary as before. You cannot use this code in any open source project, or at all, the license on it applies just as strongly as before. If you download it, or distribute it, or whatever other actions forbidden in the license, you are just as accountable as the other parties in the chain.

So for all you kiddies who now think "Great, finally an open driver for PowerVR, let's go hack our way into celebrity", you couldn't be more wrong. At best, you just tainted yourself.

But the repercussion go further than that. The simple fact that this code has been leaked has cast a very dark shadow on any future open source project that might involve the powervr. So be glad that we have been pretty good at dissuading people from wasting their time on powervr, and that this leak didn't end up spoiling many man-years of work.

Why? Well, let's say that there was an advanced and active PowerVR reverse engineering project. Naturally, the contributors would not be able to look at the leaked code. But it goes further than that. Say that you are the project maintainer of such a reverse engineered driver, how do you deal with patches that come in from now on? Are you sure that they are not taken more or less directly from the leaked driver? How do you prove this?

Your fun project just turned from a relatively straightforward REing project to a project where patches absolutely need to be signed-off, and where you need to establish some severe trust into your contributors. That's going to slow you down massively.

But even if you can manage to keep your code clean, the stigma will remain. Even if lawyers do not get involved, you will spend a lot of time preparing yourself for such an eventuality. Not a fun position to be in.

The manpower issue.

I know that any clued and motivated individual can achieve anything. I also know that really clued people, who are dedicated and can work structuredly are extremely rare and that their time is unbelievably valuable.

With the exception of Rob, who is allowed to spend some of his redhat time on the freedreno driver, none of the people working on the open ARM GPU drivers have any support. Working on such a long haul project without support either limits the amount of time available for it, or severely reduces the living standard of the person doing so, or anywhere between those extremes. If you then factor in that there are only a handful of people working on a handful of drivers, you get individuals spending several man-years mostly on their own for themselves.

If you are wondering why ARM GPU drivers are not moving faster, then this is why. There are just a limited few clued individuals who are doing this, and they are on their own, and they have been at it for years by now. Think of that the next time you want to ask "Is it done yet?".

This is why I tried to keep people from REing the powerVR, what little talent and stamina there is can be better put to use on more straightforward GPUs. We have a hard enough time as it is already.

Less work? More work!

If you think that this leaked driver takes away much of the hard work of reverse engineering and makes writing an open source driver easy, you couldn't be more wrong.

This leak means that here is no other option left apart from doing a full clean room. And there need to be very visible and fully transparent processes in place in a case like this. Your one man memory dumper/bit-poker/driver writer just became at least two persons. One of them gets to spend his time ogling bad code (which proprietary code usually ends up being), trying to make sense of it, and then trying to write extensive documentation about it (without being able to test his findings much). The other gets to write code from that documentation, but also little more. Both sides are very much forbidden to go back and forth between those two positions.

As if we ARM GPU driver developers didn't have enough frustration to deal with, and the PVR stack isn't bad enough already, the whole situation just got much much worse.

So for all those who think that now the floodgates are open for PowerVR, don't hold your breath. And to those who now suddenly want to create an open source driver for the powervr, i ask: you and what army?

For all those who are rinsing out their shoes ask yourself how many unsupported man-years you will honestly be able to dedicate to this, and whether there will be enough individuals who can honestly claim the same. Then pick your boring task, and then stick to it. Forever. And hope that the others also stick to their side of this bargain.

LOL, http://goo.gl/kbBEPX

What have we come to?

The leaked source code of a proprietary graphics driver is not something you should be spreading amongst your friends for "Lolz", especially not amongst your open source graphics driver developing friends.

I personally am not too bothered about the actual content of this one, the link names were clear about what it was, and I had seen it before. I was burned before, so i quickly delved in to verify that this was indeed SGX userspace. In some cases, with the links being posted publicly, i then quickly moved on to dissuade people from looking at it, for what limited success that could have had.

But what would i have done if this were Mali code, and the content was not clear from the link name? I got lucky here.

I am horrified about the lack of responsibility of a lot of people. These are not some cat pictures, or some nude celebrities. This is code that forbids people from writing graphics drivers.

But even if you haven't looked at this code yet, most of the damage has been done. A reverse engineered driver for powervr SGX will now probably never happen. Heck, i just got told that someone even went and posted the links to the powerVR REing mailinglist (which luckily has never seen much traffic). I wonder how that went:
Hi,
Are you the guys doing the open source driver for PowerVR SGX?
I have some proprietary code here that could help you speed things along.
Good luck!

So for the person who put this up on github: thank you so much. I hope that you at least didn't use your real name. I cannot imagine that any employer would want to hire anyone who acts this irresponsibly. Your inability to read licenses means that you cannot be trusted with either proprietary code or open source code, as you seem unable to distinguish between them. Well done.

The real culprit is of course LG, for crazily sticking the GPL on this. But because one party "accidentally" sticks a GPL on that doesn't make it GPL, and that doesn't suddenly give you the right to repeat the mistake.

Last months ISA release.

And now for something slightly different...

Just over a month ago, there was the announcement about Imagination Technologies' new SDK. Supposedly, at least according to the phoronix article, Imagination Technologies made the ISA (instruction set architecture) of the RGX available in it.

This was not true.

What was released was the assembly language for the PowerVR shaders, which then needs to be assembled by the IMG RGX assembler to provide the actual shader binaries. This is definitely not the ISA, and I do not know whether it was Alexandru Voica (an Imagination marketing guy who suddenly became active on the phoronix forums, and who i believe to be the originator of this story) or the author of the article on Phoronix who made this error. I do not think that this was bad intent though, just that something got lost in translation.

The release of the assembly language is very nice though. It makes it relatively straightforward to match the assembly to the machine code, and takes away most of the pain of ISA REing.

Despite the botched message, this was a big step forwards for ARM GPU makers; Imagination delivered what its customers need (in this case, the ability to manually tune some shaders), and in the process it also made it easier for potential REers to create an open source driver.

Looking forward.

Between the leak, the assembly release, and the market position Imagination Technologies is in, things are looking up though.

Whereas the leak made a credible open source reverse engineering project horribly impractical and very unlikely, it did remove some of the incentive for IMG to not support an open source project themselves. I doubt that IMG will now try to bullshit us with the inane patent excuse. The (not too credible) potential damage has been done here already now.

With the assembly language release, a lot of the inner workings and the optimization of the RGX shaders was also made public. So there too the barrier has disappeared.

Given the structure of the IMG graphics driver stack, system integrators have a limited level of satisfaction with IMG. I really doubt that this has improved too much since my Nokia days. Going open source now, by actively supporting some clued open source developers and by providing extensive NDA-free documentation, should not pose much of a legal or political challenge anymore, and could massively improve the perception of Imagination Technologies, and their hardware.

So go for it, IMG. No-one else is going to do this for you, and you can only gain from it!

With OpenStack embracing the Tooz library more and more over the past year, I think it's a good start to write a bit about it.

A bit of history

A little more than year ago, with my colleague Yassine Lamgarchal and others at eNovance, we investigated on how to solve a problem often encountered inside OpenStack: synchronization of multiple distributed workers. And while many people in our ecosystem continue to drive development by adding new bells and whistles, we made a point of solving new problems with a generic solution able to address the technical debt at the same time.

Yassine wrote the first ideas of what should be the group membership service that was needed for OpenStack, identifying several projects that could make use of this. I've presented this concept during the OpenStack Summit in Hong-Kong during an Oslo session. It turned out that the idea was well-received, and the week following the summit we started the tooz project on StackForge.

Goals

Tooz is a Python library that provides a coordination API. Its primary goal is to handle groups and membership of these groups in distributed systems.

Tooz also provides another useful feature which is distributed locking. This allows distributed nodes to acquire and release locks in order to synchronize themselves (for example to access a shared resource).

The architecture

If you are familiar with distributed systems, you might be thinking that there are a lot of solutions already available to solve these issues: ZooKeeper, the Raft consensus algorithm or even Redis for example.

You'll be thrilled to learn that Tooz is not the result of the NIH syndrome, but is an abstraction layer on top of all these solutions. It uses drivers to provide the real functionalities behind, and does not try to do anything fancy.

All the drivers do not have the same amount of functionality of robustness, but depending on your environment, any available driver might be suffice. Like most of OpenStack, we let the deployers/operators/developers chose whichever backend they want to use, informing them of the potential trade-offs they will make.

So far, Tooz provides drivers based on:

All drivers are distributed across processes. Some can be distributed across the network (ZooKeeper, memcached, redis…) and some are only available on the same host (IPC).

Also note that the Tooz API is completely asynchronous, allowing it to be more efficient, and potentially included in an event loop.

Features

Group membership

Tooz provides an API to manage group membership. The basic operations provided are: the creation of a group, the ability to join it, leave it and list its members. It's also possible to be notified as soon as a member joins or leaves a group.

Leader election

Each group can have a leader elected. Each member can decide if it wants to run for the election. If the leader disappears, another one is elected from the list of current candidates. It's possible to be notified of the election result and to retrieve the leader of a group at any moment.

Distributed locking

When trying to synchronize several workers in a distributed environment, you may need a way to lock access to some resources. That's what a distributed lock can help you with.

Adoption in OpenStack

Ceilometer is the first project in OpenStack to use Tooz. It has replaced part of the old alarm distribution system, where RPC was used to detect active alarm evaluator workers. The group membership feature of Tooz was leveraged by Ceilometer to coordinate between alarm evaluator workers.

Another new feature part of the Juno release of Ceilometer is the distribution of polling tasks of the central agent among multiple workers. There's again a group membership issue to know which nodes are online and available to receive polling tasks, so Tooz is also being used here.

The Oslo team has accepted the adoption of Tooz during this release cycle. That means that it will be maintained by more developers, and will be part of the OpenStack release process.

This opens the door to push Tooz further in OpenStack. Our next candidate would be write a service group driver for Nova.

The complete documentation for Tooz is available online and has examples for the various features described here, go read it if you're curious and adventurous!

November 19, 2014

Debian's latest round of angry mailing list threads have been about some combination of init systems, future direction and project governance. The details aren't particularly important here, and pretty much everything worthwhile in favour of or against each position has already been said several times, but I think this bit is important enough that it bears repeating: the reason I voted "we didn't need this General Resolution" ahead of the other options is that I hope we can continue to use our normal technical and decision-making processes to make Debian 8 the best possible OS distribution for everyone. That includes people who like systemd, people who dislike systemd, people who don't care either way and just want the OS to work, and everyone in between those extremes.

I think that works best when we do things, and least well when a lot of time and energy get diverted into talking about doing things. I've been trying to do my small part of the former by fixing some release-critical bugs so we can release Debian 8. Please join in, and remember to write good unblock requests so our hard-working release team can get through them in a finite time. I realise not everyone will agree with my idea of which bugs, which features and which combinations of packages are highest-priority; that's fine, there are plenty of bugs to go round!

Regarding init systems specifically, Debian 'jessie' currently works with at least systemd-sysv or sysvinit-core as pid 1 (probably also Upstart, but I haven't tried that) and I'm confident that Debian developers won't let either of those regress before it's released as Debian 8.

I expect the freeze for Debian 'stretch' (presumably Debian 9) to be a couple of years away, so it seems premature to say anything about what will or won't be supported there; that depends on what upstream developers do, and what Debian developers do, between now and then. What I can predict is that the components that get useful bug reports, active maintenance, thorough testing, careful review, and similar help from contributors will work better than the things that don't; so if you like a component and want it to be supported in Debian, you can help by, well, supporting it.


PS. If you want the Debian 8 installer to leave you running sysvinit as pid 1 after the first reboot, here's a suitable incantation to add to the kernel command-line in the installer's bootloader. This one certainly worked when KiBi asked for testing a few days ago:

preseed/late_command="in-target apt-get install -y sysvinit-core"

I think that corresponds to this line in a preseeding file, if you use those:

d-i preseed/late_command string in-target apt-get install -y sysvinit-core

A similar apt-get command, without the in-target prefix, should work on an installed system that already has systemd-sysv. Depending on other installed software, you might need to add systemd-shim to the command line too, but when I tried it, apt-get was able to work that out for itself.

If you use aptitude instead of apt-get, double-check what it will do before saying "yes" to this particular switchover: its heuristic for resolving conflicts seems to be rather more trigger-happy about removing packages than the one in apt-get.

November 16, 2014

Apparently, people care when you, as privileged person (white, male, long-time Debian Developer) throw in the towel because the amount of crap thrown your way just becomes too much. I guess that's good, both because it gives me a soap box for a short while, but also because if enough people talk about how poisonous the well that Debian is has become, we can fix it.

This morning, I resigned as a member of the systemd maintainer team. I then proceeded to leave the relevant IRC channels and announced this on twitter. The responses I've gotten have been almost all been heartwarming. People have generally been offering hugs, saying thanks for the work put into systemd in Debian and so on. I've greatly appreciated those (and I've been getting those before I resigned too, so this isn't just a response to that). I feel bad about leaving the rest of the team, they're a great bunch: competent, caring, funny, wonderful people. On the other hand, at some point I had to draw a line and say "no further".

Debian and its various maintainer teams are a bunch of tribes (with possibly Debian itself being a supertribe). Unlike many other situations, you can be part of multiple tribes. I'm still a member of the DSA tribe for instance. Leaving pkg-systemd means leaving one of my tribes. That hurts. It hurts even more because it feels like a forced exit rather than because I've lost interest or been distracted by other shiny things for long enough that you don't really feel like part of a tribe. That happened with me with debian-installer. It was my baby for a while (with a then quite small team), then a bunch of real life thing interfered and other people picked it up and ran with it and made it greater and more fantastic than before. I kinda lost touch, and while it's still dear to me, I no longer identify as part of the debian-boot tribe.

Now, how did I, standing stout and tall, get forced out of my tribe? I've been a DD for almost 14 years, I should be able to weather any storm, shouldn't I? It turns out that no, the mountain does get worn down by the rain. It's not a single hurtful comment here and there. There's a constant drum about this all being some sort of conspiracy and there are sometimes flares where people wish people involved in systemd would be run over by a bus or just accusations of incompetence.

Our code of conduct says, "assume good faith". If you ever find yourself not doing that, step back, breathe. See if there's a reasonable explanation for why somebody is saying something or behaving in a way that doesn't make sense to you. It might be as simple as your native tongue being English and their being something else.

If you do genuinely disagree with somebody (something which is entirely fine), try not to escalate, even if the stakes are high. Examples from the last year include talking about this as a war and talking about "increasingly bitter rear-guard battles". By using and accepting this terminology, we, as a project, poison ourselves. Sam Hartman puts this better than me:

I'm hoping that we can all take a few minutes to gain empathy for those who disagree with us. Then I'm hoping we can use that understanding to reassure them that they are valued and respected and their concerns considered even when we end up strongly disagreeing with them or valuing different things.

I'd be lying if I said I didn't ever feel the urge to demonise my opponents in discussions. That they're worse, as people, than I am. However, it is imperative to never give in to this, since doing that will diminish us as humans and make the entire project poorer. Civil disagreements with reasonable discussions lead to better technical outcomes, happier humans and a healthier projects.

November 15, 2014
A couple weeks ago, qualcomm (quic) surprised some by sending kernel patches to enable the new adreno 4xx family of GPUs found in their latest SoCs.  Such as the apq8084 powering my ifc6540 board with the a420 GPU.  Note that qualcomm had already sent patches to enable display support for apq8084, merged in 3.17.  And I'm looking forward to more good things from their upstream efforts in the future.

So in the last weeks, in between various other kernel work (atomic-helper conversion and few other misc things for 3.19) and RHEL stuff, I've managed to bang out initial gallium support for a4xx.  There are still plenty of missing things, or stuff hard-coded, etc.  But yesterday I managed to get textures working, and fix RGBA/BGRA confusion, so now enough works for 'gears and maybe about half of glmark2:



I've intentionally pushed it (just now) after the mesa 10.4 branch point, since it isn't quite ready to be enabled by default in distro mesa builds.  When it gets to the point of at least being able to run a desktop environment (gnome-shell / compiz / etc), I may backport to 10.4.  But there is still a lot of work to do.  The good news is that so far it seems quite fast (and that is without hw binning or XA yet even!)

November 11, 2014

Container Integration

Since a while containers have been one of the hot topics on Linux. Container managers such as libvirt-lxc, LXC or Docker are widely known and used these days. In this blog story I want to shed some light on systemd's integration points with container managers, to allow seamless management of services across container boundaries.

We'll focus on OS containers here, i.e. the case where an init system runs inside the container, and the container hence in most ways appears like an independent system of its own. Much of what I describe here is available on pretty much any container manager that implements the logic described here, including libvirt-lxc. However, to make things easy we'll focus on systemd-nspawn, the mini-container manager that is shipped with systemd itself. systemd-nspawn uses the same kernel interfaces as the other container managers, however is less flexible as it is designed to be a container manager that is as simple to use as possible and "just works", rather than trying to be a generic tool you can configure in every low-level detail. We use systemd-nspawn extensively when developing systemd.

Anyway, so let's get started with our run-through. Let's start by creating a Fedora container tree in a subdirectory:

# yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in /srv/mycontainer. This command line is Fedora-specific, but most distributions provide similar functionality in one way or another. The examples section in the systemd-nspawn(1) man page contains a list of the various command lines for other distribution.

We now have the new container installed, let's set an initial root password:

# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.
#

We use systemd-nspawn here to get a shell in the container, and then use passwd to set the root password. After that the initial setup is done, hence let's boot it up and log in as root with our new password:

$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket.
         Starting Journal Service...
[  OK  ] Started Journal Service.
[  OK  ] Reached target Paths.
         Mounting Debug File System...
         Mounting Configuration File System...
         Mounting FUSE Control File System...
         Starting Create static device nodes in /dev...
         Mounting POSIX Message Queue File System...
         Mounting Huge Pages File System...
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Reached target Swap.
         Mounting Temporary Directory...
         Starting Load/Save Random Seed...
[  OK  ] Mounted Configuration File System.
[  OK  ] Mounted FUSE Control File System.
[  OK  ] Mounted Temporary Directory.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted Debug File System.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started Create static device nodes in /dev.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
         Starting Recreate Volatile Files and Directories...
[  OK  ] Started Recreate Volatile Files and Directories.
         Starting Update UTMP about System Reboot/Shutdown...
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
[  OK  ] Started Update UTMP about System Reboot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Login Service...
         Starting Permit User Sessions...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.
[  OK  ] Started Permit User Sessions.
         Starting Console Getty...
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Login Service.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root
Password:
-bash-4.2#

Now we have everything ready to play around with the container integration of systemd. Let's have a look at the first tool, machinectl. When run without parameters it shows a list of all locally running containers:

$ machinectl
MACHINE                          CONTAINER SERVICE
mycontainer                      container nspawn

1 machines listed.

The "status" subcommand shows details about the container:

$ machinectl status mycontainer
mycontainer:
       Since: Mi 2014-11-12 16:47:19 CET; 51s ago
      Leader: 5374 (systemd)
     Service: nspawn; class container
        Root: /srv/mycontainer
     Address: 192.168.178.38
              10.36.6.162
              fd00::523f:56ff:fe00:4994
              fe80::523f:56ff:fe00:4994
          OS: Fedora 20 (Heisenbug)
        Unit: machine-mycontainer.scope
              ├─5374 /usr/lib/systemd/systemd
              └─system.slice
                ├─dbus.service
                │ └─5414 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-act...
                ├─systemd-journald.service
                │ └─5383 /usr/lib/systemd/systemd-journald
                ├─systemd-logind.service
                │ └─5411 /usr/lib/systemd/systemd-logind
                └─console-getty.service
                  └─5416 /sbin/agetty --noclear -s console 115200 38400 9600

With this we see some interesting information about the container, including its control group tree (with processes), IP addresses and root directory.

The "login" subcommand gets us a new login shell in the container:

# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The "reboot" subcommand reboots the container:

# machinectl reboot mycontainer

The "poweroff" subcommand powers the container off:

# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more commands, please check the man page for details. Note again that even though we use systemd-nspawn as container manager here the concepts apply to any container manager that implements the logic described here, including libvirt-lxc for example.

machinectl is not the only tool that is useful in conjunction with containers. Many of systemd's own tools have been updated to explicitly support containers too! Let's try this (after starting the container up again first, repeating the systemd-nspawn command from above.):

# hostnamectl -M mycontainer set-hostname "wuff"

This uses hostnamectl(1) on the local container and sets its hostname.

Similar, many other tools have been updated for connecting to local containers. Here's systemctl(1)'s -M switch in action:

# systemctl -M mycontainer
UNIT                                 LOAD   ACTIVE SUB       DESCRIPTION
-.mount                              loaded active mounted   /
dev-hugepages.mount                  loaded active mounted   Huge Pages File System
dev-mqueue.mount                     loaded active mounted   POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted   /proc/sys/kernel/random/boot_id
[...]
time-sync.target                     loaded active active    System Time Synchronized
timers.target                        loaded active active    Timers
systemd-tmpfiles-clean.timer         loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified container, not the host. (Output is shortened here, the blog story is already getting too long).

Let's use this to restart a service within our container:

# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M switch. With the -r switch it shows the units running on the host, plus all units of all local, running containers:

# systemctl -r
UNIT                                        LOAD   ACTIVE SUB       DESCRIPTION
boot.automount                              loaded active waiting   EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount           loaded active waiting   Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0\x2dLVDS\x2d1-intel_backlight.device loaded active plugged   /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[...]
timers.target                                                                                       loaded active active    Timers
mandb.timer                                                                                         loaded active waiting   Daily man-db cache update
systemd-tmpfiles-clean.timer                                                                        loaded active waiting   Daily Cleanup of Temporary Directories
mycontainer:-.mount                                                                                 loaded active mounted   /
mycontainer:dev-hugepages.mount                                                                     loaded active mounted   Huge Pages File System
mycontainer:dev-mqueue.mount                                                                        loaded active mounted   POSIX Message Queue File System
[...]
mycontainer:time-sync.target                                                                        loaded active active    System Time Synchronized
mycontainer:timers.target                                                                           loaded active active    Timers
mycontainer:systemd-tmpfiles-clean.timer                                                            loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the units of the one container we have currently running. The units of the containers are prefixed with the container name, and a colon (":"). (The output is shortened again for brevity's sake.)

The list-machines subcommand of systemctl shows a list of all running containers, inquiring the system managers within the containers about system state and health. More specifically it shows if containers are properly booted up, or if there are any failed services:

# systemctl list-machines
NAME         STATE   FAILED JOBS
delta (host) running      0    0
mycontainer  running      0    0
miau         degraded     1    0
waldi        running      0    0

4 machines listed.

To make things more interesting we have started two more containers in parallel. One of them has a failed service, which results in the machine state to be degraded.

Let's have a look at journalctl(1)'s container support. It too supports -M to show the logs of a specific container:

# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes...
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the host and all local containers:

# journalctl -m -e

(Let's skip the output here completely, I figure you can extrapolate how this looks.)

But it's not only systemd's own tools that understand container support these days, procps sports support for it, too:

# ps -eo pid,machine,args
 PID MACHINE                         COMMAND
   1 -                               /usr/lib/systemd/systemd --switched-root --system --deserialize 20
[...]
2915 -                               emacs contents/projects/containers.md
3403 -                               [kworker/u16:7]
3415 -                               [kworker/u16:9]
4501 -                               /usr/libexec/nm-vpnc-service
4519 -                               /usr/sbin/vpnc --non-inter --no-detach --pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid -
4749 -                               /usr/libexec/dconf-service
4980 -                               /usr/lib/systemd/systemd-resolved
5006 -                               /usr/lib64/firefox/firefox
5168 -                               [kworker/u16:0]
5192 -                               [kworker/u16:4]
5193 -                               [kworker/u16:5]
5497 -                               [kworker/u16:1]
5591 -                               [kworker/u16:8]
5711 -                               sudo -s
5715 -                               /bin/bash
5749 -                               /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer                     /usr/lib/systemd/systemd
5799 mycontainer                     /usr/lib/systemd/systemd-journald
5862 mycontainer                     /usr/lib/systemd/systemd-logind
5863 mycontainer                     /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
5868 mycontainer                     /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt102
5871 mycontainer                     /usr/sbin/sshd -D
6527 mycontainer                     /usr/lib/systemd/systemd-resolved
[...]

This shows a process list (shortened). The second column shows the container a process belongs to. All processes shown with "-" belong to the host itself.

But it doesn't stop there. The new "sd-bus" D-Bus client library we have been preparing in the systemd/kdbus context knows containers too. While you use sd_bus_open_system() to connect to your local host's system bus sd_bus_open_system_container() may be used to connect to the system bus of any local container, so that you can execute bus methods on it.

sd-login.h and machined's bus interface provide a number of APIs to add container support to other programs too. They support enumeration of containers as well as retrieving the machine name from a PID and similar.

systemd-networkd also has support for containers. When run inside a container it will by default run a DHCP client and IPv4LL on any veth network interface named host0 (this interface is special under the logic described here). When run on the host networkd will by default provide a DHCP server and IPv4LL on veth network interface named ve- followed by a container name.

Let's have a look at one last facet of systemd's container integration: the hook-up with the name service switch. Recent systemd versions contain a new NSS module nss-mymachines that make the names of all local containers resolvable via gethostbyname() and getaddrinfo(). This only applies to containers that run within their own network namespace. With the systemd-nspawn command shown above the the container shares the network configuration with the host however; hence let's restart the container, this time with a virtual veth network link between host and container:

# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer --network-veth -b

Now, (assuming that networkd is used in the container and outside) we can already ping the container using its name, due to the simple magic of nss-mymachines:

# ping mycontainer
PING mycontainer (10.0.0.2) 56(84) bytes of data.
64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with all other tools that use libc gethostbyname() or getaddrinfo() too, among them venerable ssh.

And this is pretty much all I want to cover for now. We briefly touched a variety of integration points, and there's a lot more still if you look closely. We are working on even more container integration all the time, so expect more new features in this area with every systemd release.

Note that the whole machine concept is actually not limited to containers, but covers VMs too to a certain degree. However, the integration is not as close, as access to a VM's internals is not as easy as for containers, as it usually requires a network transport instead of allowing direct syscall access.

Anyway, I hope this is useful. For further details, please have a look at the linked man pages and other documentation.

Over the last couple of years, we've put some effort into better tooling for debugging input devices. Benjamin's hid-replay is an example for a low-level tool that's great for helping with kernel issues, evemu is great for userspace debugging of evdev devices. evemu has recently gained better Python bindings, today I'll explain here how those make it really easy to analyse event recordings.

Requirement: evemu 2.1.0 or later

The input needed to make use of the Python bindings is either a device directly or an evemu recordings file. I find the latter a lot more interesting, it enables me to record multiple users/devices first, and then run the analysis later. So let's go with that:


$ sudo evemu-record > mouse-events.evemu
Available devices:
/dev/input/event0: Lid Switch
/dev/input/event1: Sleep Button
/dev/input/event2: Power Button
/dev/input/event3: AT Translated Set 2 keyboard
/dev/input/event4: SynPS/2 Synaptics TouchPad
/dev/input/event5: Lenovo Optical USB Mouse
Select the device event number [0-5]: 5
That pipes any event from the mouse into the file, to be terminated by ctrl+c. It's just a text file, feel free to leave it running for hours.

Now for the actual analysis. The simplest approach is to read all events from a file and print them:


#!/usr/bin/env python

import sys
import evemu

filename = sys.argv[1]
# create an evemu instance from the recording,
# create=False means don't create a uinput device from it
d = evemu.Device(filename, create=False)

for e in d.events():
print e
That prints out all events, so the output should look identical to the input file's event list. The output you should see is something like:

E: 7.817877 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
E: 7.821887 0002 0000 -001 # EV_REL / REL_X -1
E: 7.821903 0000 0000 0000 # ------------ SYN_REPORT (0) ----------
E: 7.825872 0002 0000 -001 # EV_REL / REL_X -1
E: 7.825879 0002 0001 -001 # EV_REL / REL_Y -1
E: 7.825883 0000 0000 0000 # ------------ SYN_REPORT (0) ----------

The events are an evemu.InputEvent object, with the properties type, code, value and the timestamp as sec, usec accessible (i.e. the underlying C struct). The most useful method of the object is InputEvent.matches(type, code) which takes both integer values and strings:


if e.matches("EV_REL"):
print "this is a relative event of some kind"
elif e.matches("EV_ABS", "ABS_X"):
print "absolute X movement"
elif e.matches(0x03, 0x01):
printf "absolute Y movement"

A practical example: let's say we want to know the maximum delta value our mouse sends.



import sys
import evemu

filename = sys.argv[1]
# create an evemu instance from the recording,
# create=False means don't create a uinput device from it
d = evemu.Device(filename, create=False)

if not d.has_event("EV_REL", "REL_X") or \
not d.has_event("EV_REL", "REL_Y"):
print "%s isn't a mouse" % d.name
sys.exit(1)

deltas = []

for e in d.events():
if e.matches("EV_REL", "REL_X") or \
e.matches("EV_REL", "REL_Y"):
deltas.append(e.value)

max = max([abs(x) for x in deltas])
print "Maximum delta is %d" % (max)
And voila, with just a few lines of code we've analysed a set of events. The rest is up to your imagination. So far I've used scripts like this to help us implement palm detection, figure out ways how to deal with high-DPI mice, estimate the required size for top softwarebuttons on touchpads, etc.

Especially for printing event values, a couple of other functions come in handy here:


type = evemu.event_get_value("EV_REL")
code = evemu.event_get_value("EV_REL", "REL_X")

strtype = evemu.event_get_name(type)
strcode = evemu.event_get_name(type, code)
They do what you'd expect from them, and both functions take either strings and actual types/codes as numeric values. The same exists for input properties.

The following was debugged and discovered by Benjamin Tissoires, I'm merely playing the editor and publisher. All credit and complimentary beverages go to him please.

Wacom recently added two interesting products to its lineup: the Intuos Creative Stylus 2 and the Bamboo Stylus Fineline. Both are styli only, without the accompanying physical tablet and they are marketed towards the Apple iPad market. The basic idea here is that touch location is provided by the system, the pen augments that with buttons, pressure and whatever else. The tips of the styli are 2.9mm (Creative Stylus 2) and 1.9mm (Bamboo Fineline), so definitely smaller than your average finger, and smaller than most other touch pens. This could of course be useful for any touch-capable Linux laptop, it's a cheap way to get an artist's tablet. The official compatibility lists the iPads only, but then that hasn't stopped anyone in the past.

We enjoy a good relationship with the Linux engineers at Wacom, so naturally the first thing was to ask if they could help us out here. Unfortunately, the answer was no. Or more specifically (and heavily paraphrased): "those devices aren't really general purpose, so we wouldn't want to disclose the spec". That of course immediately prompted Benjamin to go and buy one.

From Wacom's POV not disclosing the specs makes sense and why will become more obvious below. The styli are designed for a specific use-case, if Wacom claims that they can work in any use-case they have a lot to lose - mainly from the crowd that blames the manufacturer if something doesn't work as they expect. Think of when netbooks were first introduced and people complained that they weren't full-blown laptops, despite the comparatively low price...

The first result: the stylus works on most touchscreens (and Benjamin has a few of those) but not on all of them. Specifically, the touchscreen on the Asus N550JK didn't react to it. So that's warning number 1: it may not work on your specific laptop and you probably won't know until you try.

Pairing works, provided you have a Bluetooth 4.0 chipset and your kernel supports it (tested on 3.18-rc3). Problem is: you can connect the device but you don't get anything out of it. Why? Bluetooth LE. Let's expand on that: Bluetooth LE uses the Generic Attribute Profile (GATT). The actual data is divided into Profiles, Services and Characteristics, which are clearly named by committee and stand for the general topic, subtopic/item and data point(s). So in the example here the Profile is Heart Rate Profile, the Service is Heart Rate Measurement and the Characteristic is the actual count of "lub-dub" on your ticker [1]. All are predefined. Again, why does this matter? Because what we're hoping for is the Hid Service or the Hid over GATT Service service. In both cases we could then use the kernel's uhid module to get the stylus to work. Alas, the actual output of the device is:


[bluetooth]# info C5:37:E8:73:57:BE
Device C5:37:E8:73:57:BE
Name: Stylus1
Alias: Stylus1
Appearance: 0x0341
Paired: yes
Trusted: yes
Blocked: no
Connected: yes
LegacyPairing: no
UUID: Vendor specific (00001523-1212-efde-1523-785feabcd123)
UUID: Generic Access Profile (00001800-0000-1000-8000-00805f9b34fb)
UUID: Generic Attribute Profile (00001801-0000-1000-8000-00805f9b34fb)
UUID: Device Information (0000180a-0000-1000-8000-00805f9b34fb)
UUID: Battery Service (0000180f-0000-1000-8000-00805f9b34fb)
UUID: Vendor specific (6e400001-b5a3-f393-e0a9-e50e24dcca9e)
Modalias: usb:v056Ap0329d0001
So we can see GAP and GATT, Device Information and Battery Service (both predefined) and 2 Vendor specific profiles (i.e. "magic fairy dust"). And this is where Benjamin got stuck - each of these may have a vendor-specific handshake, protocol, etc. And it's not even sure he'll be able to set the device up so it talks to him. So warning number 2: you can see and connect the device, but it'll talk gibberish (or nothing).

Now, it's probably possible to reverse engineer that if you have sufficient motivation. We don't. The Bluetooth spec is available though, once you work your way through that you can start working on the vendor specific protocol which we know nothing about.

Last but not least: the userspace component. The device itself is not ready-to-use, it provides pressure but you'd still have to associate it with the right touch point. That's not trivial, especially in the presence of other touch points (the outside of your hand while using the stylus for example). So we'd need to add support for this in the X drivers and libinput to make it work. Wacom and/or OS X presumably solved this for iPads, but even there it doesn't just work. The applications need to support it and "You do have to do some digging to figure out to connect the stylus to your favorite art apps -- it's a different procedure for each one, but that's common among these styluses." That's something we wouldn't do the same way on the Linux desktop. So warning number 3: if you can make the kernel work, it won't work as you expect in userspace, and getting it to work is a huge task.

Now all that pretty much boils down to: is it worthwhile? Our consensus so far was "no". I guess Wacom was right in holding back the spec after all. These devices won't work on any tablet and even if they would, we don't have anything in the userspace stack to actually support them properly. So in summary: don't buy the stylus if you plan to use it in Linux.

[1] lub-dub is good. ta-lub-dub is not. you don't want lub-dub-ta. wikipedia

In my previous blog post, I was talking about a pathologically bad Linux desktop performance with FullHD monitors on Allwinner A10 hardware.

A lot of time has passed since then. Thanks to the availability of Rockchip sources and documentation, we have learned a lot of information about the DRAM controller in Allwinner A10/A13/A20 SoCs. Both Allwinner and Rockchip are apparently licensing the DRAM controller IP from the same third-party vendor. And their DRAM controller hardware registers are sharing a lot of similarities (though unfortunately this is not an exact match).

Having a much better knowledge about the hardware allowed us to revisit this problem, investigate it in more details and come up with a solution back in April 2014. The only missing part was providing an update in this blog. At least to make it clear that the problem has been resolved now. So here we go...

November 10, 2014

As some of you already know, since the larger restructuring in PackageKit for the 1.0 release, I am rethinking Listaller, the 3rd-party application installer for Linux systems, as well.

During the past weeks, I was playing around with a lot of different ideas and code, to make installations of 3rd-party software easily possible on Linux, but also working together with the distribution package manager. I now have come up with an experimental project, which might achieve this.

Motivation

Many of you know Lennart’s famous blogpost on how we put together Linux distributions. And he makes a lot of good and valid points there (in fact, I agree with his reasoning there). The proposed solution, however, is not something which I am very excited about, at least not for the use-case of installing a simple application[1]. Leaving things like the exclusive dependency on technology like Btrfs aside, the solution outlined by Lennart basically bypasses the distribution itself, instead of working together with it. This results in a duplication of installed libraries, making it harder to overview which versions of which software component are actually running on the system. There is also a risk for security holes due to libraries not being updated. The security issues are worked around by a superior sandbox, which still needs to be implemented (but will definitively come soon, maybe next year).

I wanted to explore a different approach of managing 3rd-party applications on Linux systems, which allows sharing as much code as possible between applications.

Limba – Glick2 and Listaller concepts mergedlimba-small

In order to allow easy creation of software packages, as well as the ability to share software between different 3rd-party applications, I took heavy inspiration from Alexander Larssons Glick2 project, combining it with ideas from the application-directory based Listaller.

The result is Limba (named after Limba tree, not the voodoo spirit – I needed some name starting with “li” to keep the prefix used in Listaller, and for a tool like this the name didn’t really matter ;-) ).

Limba uses OverlayFS to combine an application with its dependencies before running it, as well as mount namespaces and shared subtrees. Except for OverlayFS, which just landed in the kernel recently, all other kernel features needed by Limba are available for years now (and many distributions ship with OverlayFS on older kernels as well).

How does it work?

In order to to achieve separation of software, each software component is located in a separate container (= package). A software component can be an application, like Kate or GEdit, but also be a single shared library (openssl) or even a full runtime (KDE Frameworks 5 parts, GNOME 3).

Each of these software components can be identified via AppStream metadata, which is just a few bits of XML. A Limba package can declare a dependency on any other software component. In case that software is available in the distribution’s repositories, the version found there can be used. Otherwise, another Limba package providing the software is required.

Limba packages can be provided from software repositories (e.g. provided by the distributor), or be nested in other packages. For example, imagine the software “Kate” requires a version of the Qt5 libraries, >= 5.2. The downloadable package for “Kate” can be bundled with that dependency, by including the “Qt5 5.2″ Limba package in the “Kate” package. In case another software is installed later, which also requires the same version of Qt, the already installed version will be used.

Since the software components are located in separate directories under /opt/software, an application will not automatically find its dependencies, or be able to locate its own files. Therefore, each application has to be run by a special tool, which merges the directory trees of the main application and it’s dependencies together using OverlayFS. This has the nice sideeffect that the main application could override files from its dependencies, if necessary. The tool also sets up a new mount namespace, so if the application is compiled with a certain prefix, it does not need to be relocatable to find its data files.

At installation time, to achieve better system integration, certain files (like e.g. the .desktop file) are split out of the installed directory tree, so the newly installed application achieves almost full system integration.

AQNAY*

Can I use Limba now?

Limba is an experiment. I like it very much, but it might happen that I find some issues with it and kill it off again. So, if you feel adventurous, you can compile the source code and use the example “Foobar” application to play around with Limba. Before it can be used in production (if at all), some more time is needed.

I will publish documentation on how to test the project soon.

Doesn’t OverlayFS have a maximum stacking depth?

Oh yes it has! The “How does it work” explanation doesn’t tell the whole truth in that regard (mainly to keep the section small). In fact, Limba will generate a “runtime” for the newly installed software, which is a directory with links to the actual individual software components the runtime consists of. The runtime is identified by an UUID. This runtime is then mounted together with the respective applications using OverlayFS. This works pretty great, and also results in no dependency-resolution to be done immediately before an application is started.

Than dependency stuff gives me a headache…

Admittedly, allowing dependencies adds a whole lot of complexity. Other approaches, like the one outlined by Lennart work around that (and there are good reasons for doing that as well).

In my opinion, the dependency-sharing and de-duplication of software components, as well as the ability to use the components which are packaged by your Linux distribution is worth the extra effort.

Can you give an overview of future plans for Limba?

Sure, so here is the stuff which currently works:

  • Creating simple packages
  • Installing packages
  • Very basic dependency resolution (no relations (like >, <, =) are respected yet)
  • Running applications
  • Initial bits of system integration (.desktop files are registered)

These features are planned for the new future:

  • Support for removing software
  • Automatic software updates of 3rd-party software
  • Atomic updates
  • Better system integration
  • Integration with the new sandboxing features
  • GPG signing of packages
  • More documentation / bugfixes

Remember that Limba is an experiment, still ;-)

XKCD 927

Technically, I am replacing one solution with another one here, so the situation does not change at all ;-). But indeed, some duplicate work is done due to more people working in this area now on similar questions.

But I think this is a good thing, because the solutions worked on are fundamentally different approaches, and by exploring multiple ways of doing things, we will come up with something great in the end. (XKCD reference)

Doesn’t the use of OverlayFS have an impact on the performance of software running with Limba?

I ran some synthetic benchmarks and didn’t notice any problems – even the startup speed of Limba applications is only a few milliseconds slower than the startup of the “raw” native application. However, I will still have to run further tests to give a definitive answer on this.

How do you solve ABI compatibility issues?

This approach requires software to keep their ABI stable. But since software can have strict dependencies on a specific version of a software (although I’d discourage that), even people who are worried about this issue can be happy. We are getting much better at tracking unwanted ABI breaks, and larger projects offer stable API/ABI during a major release cycle. For smaller dependencies, there are, as explained above, stricter dependencies.

In summary, I don’t think ABI incompatibilities will be a problem with this approach – at least not more than they have been in general. (The libuild facilities from Listaller to minimize dependencies will still be present im Limba, of course)

You are wrong because of $X!

Please leave a comment in this case! I’d love to discuss new ideas and find the limits of the Limba concept – that’s why I am writing C code afterall, since what looks great on paper might not work in reality or have issues one hasn’t thought about before. So any input is welcomed!

Conclusion

Last but not least I want to thank Alexander Larsson for writing Glick2, which Limba is heavily inspired from, and for his patient replies to my emails.

If Limba turns out to be a good idea, you can expect a few more blog posts about it soon.


* Answered questions nobody asked yet

[1]: Don’t get me wrong, I would like to have these ideas implemented – they offer great value. But I think for “simple” software deployment, the solution is an overkill.

November 07, 2014
okay another braindump (still nothing working).

The git repo mentioned in previous post has all the code I've hacked up so far.

I finished writing the HDCP protocol stages, and sending all the msgs and getting replies from the device.

So I've successfully reached a point where I've negotiated a HDCP session key with the device, and we are both happy about it. Unfortunately I've no idea what I'm meant to be encrypting to send to the device. The next packet the USB traces contain is 384-bytes of encrypted data.

Now HDCP v2 had a vulnerabilty in its key neg, and I've written code to try and use this fact. So I've taken a trace I made from Windows, and extracted the necessary bits, and using that I've managed to derive the master key used in that trace, and subsequently managed to derived the session key for it. So I've replayed the first encrypted packet from the trace to the device and got an encrypted response the same as in the trace.

I've tried changing a bit in the session key, riv value and data I'm sending, and doing that causes the device not to reply with the answer. This to me implies that the device is using the HDCP cipher to encode the control channel. Now HDCP does say you should only do this for video streams, but maybe DisplayLink forgot to read that bit.

Now where does this leave me, in theory I should be able to replay the full trace (haven't had time yet) and I should see the same picture on screen as I did (though I can't remember what monitor/device I used, so I might have to retrace and restage my tests before then).

However I really need to decrypt the encrypted data in the trace, and from reading the HDCP spec the only values I need to feed the AES engine are ks ^ lc128, riv, streamctr, inputctr. I'm assuming streamctr and inputctr are 0 for the first packet (I could be wrong, maybe they use some wacky streamctr to avoid messing with hdcp), riv and ks I've captured. So lc128 is possibly the crux.

Now what is lc128? Its a secret 128-bit value in the HDCP world given only to HDCP adopters. Its normally something you'd store in hw on the GPU etc as an input to the hw cipher. But in displaylink there is no GPU encrypting the data. Now its possible that displaylink don't use the same lc128 as the HDCP people, unlikely but possible. Maybe they cipher their streams with their own lc128, and only use the offical hdcp lc128 for actual HDCP streams.

I don't think lc128 has leaked, I'm not sure what the consequences of it leaking would be, but hey its just a magic number, and if displaylink are using as an input to their AES code, it must be in RAM at some point, now I need to figure out ways to work that out. I'm not sure how long it would take to brute force as 128-bit key space, probably impossible.

At any point if someone from DisplayLink wants to talk, you know where to find me :-)

I got back to working on enabling LLVM’s machine scheduler for radeonsi targets in the R600 backend after seeing a really good tutorial about how it works at this year’s LLVM Developer’s conference.

Since I last worked on this, I’ve figured out how to enable register pressure tracking in the scheduler, so now the scheduler will switch to a register pressure reduction strategy once register usage approaches the threshold where using more registers reduces the number of threads that can run in parallel.

So far the results look pretty good, several of the Phoronix benchmarks are faster with the scheduler enabled. However, I am still trying to track down a bug which is causing the xonotic benchmark to lockup when using the ‘ultra’ settings.

If anyone wants to test it out, I’ve pushed the code to my personal repo.

November 06, 2014

… or “Why do I not see any update notifications on my brand-new Debian Jessie installation??”

This is a short explanation of the status quo, and also explains the “no update notifications” issue in a slightly more detailed way, since I am already getting bug reports for that.

As you might know, GNOME provides GNOME-Software for installation of applications via PackageKit. In order to work properly, GNOME-Software needs AppStream metadata, which is not yet available in Debian. There was a GSoC student working on the necessary code for that, but the code is not yet ready and doesn’t produce good results yet. Therefore, I postponed AppStream integration to Jessie+1, with an option to include some metadata for GNOME and KDE to use via a normal .deb package.

Then, GNOME was updated to 3.14. GNOME 3.14 moved lots of stuff into GNOME-Software, including the support for update-notifications (which have been in g-s-d before). GNOME-Software is also the only thing which can edit the application groups in GNOME-Shell, at least currently.

So obviously, there was no a much stronger motivation to support GNOME-Software in Jessie. The appstream-glib library, which GNOME-Software uses exclusively to read AppStream metadata, didn’t support the DEP-11 metadata format which Debian uses in place of the original AppSTream XML for a while, but does so in it’s current development branch. So that component had to be packaged first. Later, GNOME-Software was uploaded to the archive as well, but still lacked the required metadata. That data was provided by me as a .deb package later, locally generated using the current code by my SoC student (the data isn’t great, but better than nothing). So far with the good news.

But there are multiple issues at time. First of all, the appstream-data package didn’t pass NEW so far, due to it’s complex copyright situation (nothing we can’t resolve, since app-install-data, which appstream-data would replace, is in Debian as well). Also, GNOME-Software is exclusively using offline-updates (more information also on [1] and [2]) at time. This isn’t always working at the moment, since I haven’t had the time to test it properly – and I didn’t expect it to be used in Debian Jessie as well[3].

Furthermore, the offline-updates feature requires systemd (which isn’t an issue in itself, I am quite fine with that, but people not using it will get unexpected results, unless someone does the work to implement offline-updates with sysvinit).

Since we are in freeze at time, and obviously this stuff is not ready yet, GNOME is currently without update notifications and without a way to change the shell application groups.

So, how can we fix this? One way would of course be to patch notification support back into g-s-d, if the new layout there allows doing that. But that would not give us the other features GNOME-Software provides, like application-group-editing.

Implementing that differently and patching it to make it work would be more or at least the same amount of work like making GNOME-Software run properly. I therefore prefer getting GNOME-Software to run, at least with basic functionality. That would likely mean hiding things like the offline-update functionality, and using online-updates with GNOME-PackageKit instead.

Obviously, this approach has it’s own issues, like doing most of the work post-freeze, which kind of defeats the purpose of the freeze and would need some close coordination with the release-team.

So, this is the status quo at time. It is kind of unfortunate that GNOME moved crucial functionality into a new component which requires additional integration work by the distributors so quickly, but that’s something which isn’t worth to talk about. We need a way forward to bring update-notifications back, and there is currently work going on to do that. For all Debian users: Please be patient while we resolve the situation, and sorry for the inconvenience. For all developers: If you would like to help, please contact me or Laurent Bigonville, there are some tasks which could use some help.

As a small remark: If you are using KDE, you are lucky – Apper provides the notification support like it always did, and thanks to improvements in aptcc and PackageKit, it even is a bit faster now. For the Xfce and <other_desktop> people, you need to check if your desktop provides integration with PackageKit for update-checking. At least Xfce doesn’t, but after GNOME-PackageKit removed support for it (which was moved to gnome-settings-daemon and now to GNOME-Software) nobody stepped up to implement it yet (so if you want to do it – it’s not super-complicated, but knowledge of C and GTK+ is needed).

—-

[3]: It looks like dpkg tries to ask a debconf question for some reason, or an external tool like apt-listchanges is interfering with the process, which must run completely unsupervised. There is some debugging needed to resolve these Debian-specific issues.

November 04, 2014

So just a quick update. I pushed out the 1.5 release of Transmageddon today. No major new features just fixing a regression in terms of dealing with files where you only have a video track or where you want to drop the audio track as part of the transcoding process. I am also having some issues with Intel Hardware encoding atm, but I think those are somewhere lower in the stack, so I hope to file a bug against either GStreamer or the libva project for that issue, but for now I recommend not having the Intel VA plugins for GStreamer installed.

As always you find the latest release on linuxrising.org.

I also submitted a Transmageddon update to Fedora 21, so if you are a Fedora user please test the build there and give it some Karma

November 03, 2014

So I've just reposted my atomic modeset helper series, and since the main goal of all that work was to ensure a smooth and simple transition for existing drivers to the promised atomic land it's time to elaborate a bit. The big problem is that the existing helper libraries and callbacks to driver backends don't really fit the new semantics, so some shuffling was required to avoid long-term pain. So if you are a driver writer and just interested in the details then read for what needs to be done to support atomic modeset updates using these new helper libraries.

Phase 1: Reworking the Driver Backend Functions for Planes

The first phase is reworking the driver backend callbacks to fit the new world. There are two big mismatches between the new atomic semantics and legacy ioctl interfaces:

  • The primary plane is no longer tied to the CRTC. Instead it is possible to enable the CRTC without any planes (resulting in a black screen) or only overlay planes. And the primary plane can be enabled/disabled and moved without changing the mode (of course only if the hardware actually supports it). But the existing CRTC helper library used to implement modesets only provides the single crtc->mode_set driver callback which always implicitly enables the primary plane, too.
  • Atomic updates of multiple planes isn't supported at all. And worse the code to check whether a plane update will work out is smashed into the same callback that does the actual plane update, defeating the check/commit distinction used in atomic interfaces.
Both issues are addressed by adding new driver backend callbacks. Furthermore a few transitional helper functions are provided to implement the legacy entry points in terms of these new callbacks. That way the driver backend can be reworked without the additional hassle of needing to deal with all the atomic state object handling and check/commit semantics.

The first step is to rework the ->disable/update_plane hooks using the transitional helper implementations drm_plane_helper_update/disable. These need the following new driver callbacks:
  • ->atomic_check for both CRTCs and planes. This isn't strictly required, but any state checks implemented in the current ->update_plane hook must be moved into the plane's ->atomic_check callback. The CRTC's callback will likely be empty for now.
  • ->atomic_begin and ->atomic_flush CRTC callbacks. These wrap the actual plane update and should do per-CRTC work like preparing to send out the flip completion event. Or ensure that the plane updates are actually done atomically by e.g. setting/clearing GO bits or latching the update through some other means. Or if the hardware does not provide any support for synchronized updates, use vblank evasion to ensure all updates happen on the same frame.
  • ->prepare_fb and ->cleanup_fb hooks are also optional. These are used to setup the framebuffers, e.g. pin their backing storage into memory and set up any needed hardware resources. The important part is that besides the ->atomic_check callbacks ->prepare_fb is the only new callback which is allowed to fail. This is important to make asynchronous commits of atomic state updates work. The helper library guarantees that for any successful call of ->prepare_fb it will call ->cleanup_fb - even when something else fails in the atomic update.
  • Finally there's ->atomic_update. That's the function which does all the per-plane update, like setting up the new viewport or the new base address of the framebuffer for each plane.
With this it's also easy to implement universal plane support directly, instead of with the default implementation which doesn't allow the primary plane to be disabled. Universal planes are a requirement for atomic and need to be implemented in phase 1, but testing the primary plane support is also a good preparation for the next step:

The new crtc->mode_set_nofb callback must be implement, which just updates the CRTC timings and data in the hardware without touching the primary plane state at all. The provided helpers functions drm_helper_crtc_mode_set and drm_helper_crtc_mode_set_base then implement the callbacks required by the CRTC helpers in terms of the new ->mode_set_nofb callback and the above newly implemented plane helper callbacks.

Phase 2: Wire up the Atomic State Object Scaffolding

With the completion of phase 1 all the driver backend functions have been adapted to the new requirements of the atomic helper library. The goal of phase 2 is to get all the state object handling needed for atomic updates into place. There are three steps to that:

  • The first is fairly simply and consists in just wiring up all the state reset, duplicate and destroy functions for planes, CRTCs and connectors. Except for really crazy cases the default implementations from the atomic helper library should be good enough, at least to get started. With this there will always be an atomic state object stored in each object's ->state pointer.
  • The second step is patching up the state objects in legacy code paths to make sure that we can partially transition to atomic updates. If your driver doesn't have any transition checks for plane updates (i.e. doesn't ever look at the old state to figure out whether an change is possible) then all you need to do is keep the framebuffer pointers and reference counts in balance with drm_atomic_set_fb_for_plane. The transitional helpers from phase 1 already do this, so usually the only place this is needs to be manually added is in the ->page_flip callback.
  • Finally all ->mode_fixup callbacks need to be audited to not depend upon any state which is only set in the various CRTC helper callbacks and not tracked by the atomic state objects. This isn't required for implementing the legacy interfaces using atomic updates, but this is important to correctly implement the check/commit semantics. Especially when the commit is done asynchronously. This really is a corner-case though, but any such code must be moved into ->atomic_check hooks and rewritten in terms of the atomic state objects.

Phase 3: Rolling out Atomic Support

With the driver backend changes from phase 1 and the state handling changes from phase 2 everything is ready for the step-by-step rollout of atomic support. Presuming nothing was missed this just consists of wiring up the ->atomic_check and ->atomic_commit implementations from the atomic helper library. And then replacing all the legacy entry pointers with the corresponding functions from the atomic helper library to implement them in terms of atomic.

The recommended order is to start with planes first, then test the ->set_config functionality. Page flips and properties are best done later since they likely need some additional work:
  • The atomic helper library doesn't provide any default asynchronous commit support, since driver and hardware requirements seem to be too diverse. At least until we have a few proper implementations and can use them to extract a good set of helper functions. Hence drivers must implement basic async commit support using the building blocks provided (and other drm and kernel infrastructre like flip queues, fence callbacks and work items - hopefully soonish also vblank callbacks).
  • Property values need to be moved into the relevant state object first before the corresponding implementations can be wired up. As long as the driver doesn't yet support the full atomic ioctl this can be done at leisure, but must be completed before the ioctl can be enabled. To do so drivers need to subclass the relevant state structure and reimplement all the state handling functions rolled out in phase 2.

Besides these two complications (which might require a bit of work depending upon the driver) this is all that's needed for full atomic modeset and pageflip support.

Follow-up Driver Cleanups

But there's of course quite a bit of cleanup work possible afterards!

The are some big differences between the old CRTC helper modeset logic and the new one (using the same callbacks, but completely rewritten otherwise) in the atomic helper library:
  • The encoder/bridge/CRTC enabling/disabling sequence for a given modeset configuration is now always the same. Which means unused CRTC won't be disabled any more only after everything else is set up, but together with all the other blocks before enabling anything from the new configuration. Also, when an output pipeline changes the helper library will always force a full modeset of the entire pipeline.
    This reduces combinatorial complexity a lot and should especially help with shared resources (like PLLs) - no longer can a modeset spuriously fail just because the old CRTC hasn't released its PLL before the new one was enabled.
  • Thanks to the atomic state tracking the helper code won't lose track of the software state of each object any more. Which means disabled functions won't be disabled more than once. So all code in the driver-backend which checks the current state and acts accordingly can be flattened and replaced by WARNings.

These are all lessons learned from the i915 modeset rewrite. The only thing missing in the atomic helpers compared to i915 is the state readout and cross-checking support - everything else is there. But even that can be easily implemented by adding hardware state readout callbacks and using them in the various state reset functions (to reconstruct matching software state) and also to cross-check state.

The other big cleanup task is to stop using all the legacy state variables and switch all the driver backend code to only look at the state object structures. The two big examples here are crtc->mode and the plane->fb pointer.

So What Now?

With all that converting drivers should be simple and can be done with a series of not-too-invasive refactorings. But my patch series doesn't yet contain the actual atomic modeset ioctl. So what's left to be done in the drm core?

  • Per-plane locking is still missing - currently all plane-related changes just lock all CRTCs. Which is a bit too much, and in cases like the cursor plane or for page flips actually a regression compared to the legacy code paths.
  • The atomic ioctl uses properties for everything, even for the standard properties inherited from the legacy ioctls. All the code for parsing properties exists already in Rob Clark's patch series, but needs to be rebased and adpated to the slightly different interfaces this latest iteration of the internal atomic interface has.
  • The fbdev emulation needs to grow proper atomic check/commit support. This is both a good kernel-internal validation of the atomic interface and would finally allow us to get multi-pipe configuration for fbcon to work correctly. But before we can do this we need a driver with multiple CRTCs, shared resource constraints and proper atomic support to be able to even test this.
  • There's still some room for more helpers, for example pretty much all drivers have some sort of vblank driver callback and work item infrastructure. That's better done as a helper in the core vblank handling code.
  • And finally we need the actual ioctl code.
So still a few things to do, besides adding atomic support to all drivers.

Update: The explanation for how to implement state readout and cross checking was a bit confused, so I reworded that.
November 02, 2014
Since Dave Airlie moved the feature cut-off of the drm-next tree roughly one month ahead it is already time for our regular look at what's ahead. Even though the 3.17 features aren't even released yet.

On the modeset side of things we now have the final pieces for plane rotation support from Sonika Jindal and Ville. The DisplayPort code has also seen lots of improvements, with updated training values in preparation of the latest eDP standard (Sonika Jindal) and support for DP training pattern 3 (Ville). DSI panels now support burst mode (Shobhit) and hdmi conformance has been improved with some fixes from Clint Taylor.

For eDP panels we also have improved panel power sequencing code, mostly to fix issues on Cherryview, from Ville. Ville has also contributed fixes to the VDD handling code, which is used to temporarily enable panel power. And the backlight code learned to handle the bl_power setting so that the backlight can be turned off completely without upsetting the panel's power sequencing, contributed by Jani.

Chris Wilson has also been fairly busy on the modeset code: 3.18 includes his patches to cache EDIDs for a single probe call - unfortunately the full caching solution to keep the EDID around between multiple probe calls isn't merged yet. And pageflips have now improved error detection and recovery logic: In case something goes wrong we shouldn't end up stuck any longer waiting for a pageflip to complete that has been lost by either the hardware or the driver.

Moving on to platform specific work there's been lots of preparations for Skylake, most of it from Damien and Sonika. The actual intial platform enabling is delayed for 3.19 though. On the other end of the timeline Ville fixed up i830M modeset support on a rainy w/e in his vacation, and 3.18 now has all that code. And there has been a lot of Cherryview fixes all over.

Cherryview also gained support for power wells and hence runtime pm (Ville). And for platform agnostic feature a lot of the preparation for DRRS (dynamic refresh rate switching) is merged, hopefully the actual feature patches from Vandana Kannan will land in 3.19.

Moving on the render side of the driver there's been a lot of patches to beat the full ppgtt support into shape. The context code has been cleaned up, lifetime handling for ppgtt address spaces is fixed and bad interactions with secure batches are now also rectified. Enabling full ppgtt missed the feature cutoff by a hair though, but it's already enabling for the following release.

Basic support for execlists command submission from Ben Widawsky, Oscar Mateo and Thomas Daniel was also merged. This is the fancy new way to submit commands available on Gen8 and subsequent platforms. It's not yet enabled by default, but since it's a requirement for a lot of cool new features keep an eye on what's going on here. There is also a lot of work going on underneath to enable all this new code in GEM, like preparing to switch away from sequence numbers to tracking gpu progress more abstractly using the driver's request structures.

And this time around there is also some cool stuff going on in the drm core worth of a shout-out: The vblank handling code is massively revamped, hopefully plugging all the small races, inconsistencies and inefficiencies in that code. And thanks to David Herrmann it is finally possible to write a drm driver without the drm midlayer getting completely in the way of a proper driver load and unload sequence! Unfortunately i915 can't be converted right away since the legacy usermodesetting code crucial relies on this midlayer functionality. But that's now well deprecated and hopefully can be removed in one of the next releases.
November 01, 2014
Trackballs

I dusted off (literally) my Logitech Marble trackball to replace the Intuos tablet + mouse combination that I was using to cut down on the lateral movement of my right arm which led to back pains.

Not that you care about that one bit, but that meant that I needed a way to get a scroll wheel working with this scroll-wheel less trackball. That's now implemented in gnome-settings-daemon for GNOME 3.16. You'd run:


gsettings set org.gnome.settings-daemon.peripherals.trackball scroll-wheel-emulation-button 8

With "8" being the mouse button number to use to make the trackball ball into a wheel. We plan to add an interface to configure this in the Settings.

Touchscreens

Touchscreens are now switched off when the screensaver is on. This means you'll usually need to use one of the hardware buttons on tablets, or a mouse or keyboard on laptops to turn the screen back on.

Note that you'll need a kernel patch to avoid surprises when the touchscreen is re-enabled.

More touchscreens

The driver for the Goodix touchscreen found in the Onda v975w is now upstream as well.
October 31, 2014

Last week, a blog post Hints for writing Unix tools by Marius Eriksen made the rounds. It presented nine suggestions on what makes a command a good citizen of the Unix command-line ecosystem, especially for fitting into pipelines and filters.

This reminded me of a longer list of guidelines I recently gathered as part of our efforts to train new hires in Solaris engineering. I polled long time engineers, trawled the Best Practices documents of our Architecture Review Committee, cross referenced to the WCAG2ICT accessibility guidelines for non-web applications recommended by Oracle’s accessibility group, and linked to our online documentation, to come up with our suggestions on writing new CLI tools for Solaris.

Since these may be useful to others writing commands, I figured I’d share some of them. I’ve left out the bits specific to complying with our internal policies or using private interfaces that aren’t documented for external use, but many of these are generally applicable. Do note that these are based in part on lessons learned from 40+ years of Unix history, and that history means that many existing commands do not follow these suggestions, and in some cases, can’t follow them without breaking backwards compatibility, so please don’t start calling tech support to complain about every case our old code isn’t doing one of these things.

One of the key points of our best practices is that many commands belong to part of a larger family, and it’s best to fit in with that existing family. If you’re writing a Solaris-specific command, it should follow the Solaris Command Line Interface Paradigm guidelines (as listed in the Solaris Intro(1) man page), but GNU commands should instead follow the GNU Coding Standards, classic X11 commands should use the options described in the OPTIONS section of the X(7) man page, and so on.

Command names & paths

  • Most new commands should have names 3-9 letters long. Command names longer than 9 letters should be commands users rarely have to type in.
  • Follow common naming patterns, such as:
    Pattern Usage
    *adm Command to change current state & administer a subsystem
    *cfg Command to make permanent configuration changes to a subsystem
    *info Command to print information about objects managed by a subsystem
    *prop Command to print properties of objects managed by a subsystem
    *stat Command to print/monitor statistics on a subsystem
  • Commands run by normal users should be delivered in /usr/bin/. Commands normally only run by sysadmins should be delivered in /usr/sbin/. Commands only run by other programs, not humans, should be in an appropriate subdirectory under /usr/lib/. (Commands not delivered with the OS should instead use the appropriate subdirectory under /opt instead of /usr in the above paths.)

Options

  • Never provide an option to take a password or other sensitive data on the command line or environment variables, as ps and the proc tools can show those to other users. (see Passing secrets to subprocesses).
  • All commands should have a --help and -? option to print recognized options/arguments/subcommands.
  • Option parsing should use one of the standard getopt() routines if at all possible. If you don’t use one, your custom parser will need to replicate a lot of things the standard routines provide for error checking & handling.
    • When reporting errors, be specific about which argument/option failed, don’t just dump usage output and make the user guess which part of the command line was wrong. (See WCAG2ICT #3.3.1. Examples of fixing this in X11 programs: bitmap, fslsfonts, mkfontscale, xgamma, xpr, xsetroot.)
    • If possible, provide suggestions to correct - if option is invalid, list options that would be valid. Same for subcommands, arguments, etc. (See WCAG2ICT #3.3.3.)
  • Option flags should be similar across commands when possible.

Subcommands

If you are writing a command that uses subcommands, then being careful in your work can make your command much easier to use.

Good examples to follow: hg, zfs, dladm

  • The help subcommand should list the other subcommands, but not overwhelm the user with pages of details on all of them. (Remember, the Solaris kernel text console has no scrollback and users with text-to-speech don’t want 10 minutes of output from it.)
    Good examples: hg, svccfg
  • The help foo or foo --help subcommands should list the options specific to that subcommand.
    Good examples: hg
  • Look at existing commands with similar subcommands and use similar names for your subcommands

Text output

  • All functionality should be available when TERM=dumb. Use of color output, bold text, terminal positioning codes, etc. can be used to enhance output on more capable terminals, but users need to be able to use the system without it. Users may need to run different commands to get plain text interface instead of curses/terminal mode, such as ed instead of vi, or mailx vs. mutt, as long as it’s clearly documented what they need to run instead, but they must be able to get their work done in some way. (See WCAG2ICT #1.3.2, WCAG2ICT #1.4.1, & WCAG2ICT #1.4.3)
  • Text output is generally composed of messages and data. Messages are the text included in the program, such as status descriptions, error messages, and output headers; while data comes from the subsystem the command interacts with, and depends on the system in question.
    • Messages displayed to users should use gettext(3C) to allow translation & localization.
    • Errors should be printed to stderr, other output to stdout, unless specific output redirection options (such as logging errors to a file) are given.
  • Users should be able to disable any use of ASCII art, line drawing characters, figlet-style text and any other output other than plain text which a text-to-speech screen reader cannot figure out how to read, while not losing information, only formatting of it. (See WCAG2ICT #1.1.1 & WCAG2ICT #1.4.5)
  • Error messages should include the program name to help track down which program produced an error in a shell script, SMF method, etc. This is automatically done if you use the standard libc functions err, verr, errx, verrx, warn, vwarn, warnx, vwarnx (3c).
  • Parsable output should follow the design outlined in Creating Shell-Friendly Parsable Output:
    • Parsable output should require the user to specify the fields to output, via a -o or similar flag, so that new fields can be added to the command without breaking existing parsers.
    • Headers should be omitted in parsable output mode or when a flag such as -H is specified to omit them.
    • Parsable output should use a non-whitespace delimiter, such as “:” between fields.

Privileges on Solaris

User Interaction

  • If you offer an interactive command prompt mode, such as svccfg does for executing subcommands, consider using libtecla or similar support for command line editing in this mode.
  • Any operation that may permanently alter or destroy data should either have an “undo” option (such as rollback to prior snapshot) or have a mode offering the user a chance to confirm (such as the -i option to rm). (See WCAG2ICT #3.3.4)
  • Users should be able to configure timeout lengths for any operation that expects user interaction before a timeout expires. (See WCAG2ICT #2.2.1)

Implementation

References

So for some reason I decided to look at the displaylink usb3 adaptors today. (no good news).

This blog post is so I don't forget all of this when I page it out. Notes, HDCP1.0 being broken doesn't matter to this, maybe HDCPv2.0 being a bit broken could be used, but I'm not sure how!

The displaylink USB3 protocol is based on HDCP protocol. I've traced the first few packets and it clearly
looks like the host sends two packets

AKE_Init,
AKE_Transmitter_Info

and the device sends back
AKE_Send_Cert

at least.

AKE_Send_Cert contains a 522 byte certificate, containing a receiver id, public key, some misc bytes and a signature generated with the DCP LLC private key, that you have to verify.

so the HDCP v2.2 spec contains the DP LLC public key, and I've written some code to verify the spec using openssl, but it totally fails to work. This is probably due to me doing something stupid, or not understanding what I'm doing, if you are openssl knowledgeable and want to look, the hack fest is
http://cgit.freedesktop.org/~airlied/dl3dev/

It might be the DisplayLink devices use a different signing key than the DP LLC one.

That repo contains some code to talk to the device (currently disabled) and do the initial sequence, along with an attempt to verify the cert.

Now once I get past this hurdle, the larger one seems to remain, the HDCP 2.0 spec has a global secret 128-bit value called LC128, that everyone who implements HDCP gets and hides somewhere. Its probably sitting in the displaylink driver in hex, but I'd hope they at least hide it better than that. It may also be possibly supplied by the OS, Windows or OSX. (I've no clue yet). That value is used in the key negotiation.

Now it might be possible that Displaylink allow non-HDCP encrypted data to be sent to the device, in which case win if I can find out where/how to do that, or it might be the device requires HDCP and decrypts non-HDCP content before sending it over VGA/DVI. I've no ideas yet on that front either.

Ah well probably enough learning for today, I knew nothing about HDCP this morning, so I can't say it made my life any better learning about it :-P
October 30, 2014

Glamor Cleanup

Before I start really digging in to reworking the Render support in Glamor, I wanted to take a stab at cleaning up some cruft which has accumulated in Glamor over the years. Here's what I've done so far.

Get rid of the Intel fallback paths

I think it's my fault, and I'm sorry.

The original Intel Glamor code has Glamor implement accelerated operations using GL, and when those fail, the Intel driver would fall back to its existing code, either UXA acceleration or software. Note that it wasn't Glamor doing these fallbacks, instead the Intel driver had a complete wrapper around every rendering API, calling special Glamor entry points which would return FALSE if GL couldn't accelerate the specified operation.

The thinking was that when GL couldn't do something, it would be far faster to take advantage of the existing UXA paths than to have Glamor fall back to pulling the bits out of GL, drawing to temporary images with software, and pushing the bits back to GL.

And, that may well be true, but what we've managed to prove is that there really aren't any interesting rendering paths which GL can't do directly. For core X, the only fallbacks we have today are for operations using a weird planemask, and some CopyPlane operations. For Render, essentially everything can be accelerated with the GPU.

At this point, the old Intel Glamor implementation is a lot of ugly code in Glamor without any use. I posted patches to the Intel driver several months ago which fix the Glamor bits there, but they haven't seen any review yet and so they haven't been merged, although I've been running them since 1.16 was released...

Getting rid of this support let me eliminate all of the _nf functions exported from Glamor, along with the GLAMOR_USE_SCREEN and GLAMOR_USE_PICTURE_SCREEN parameters, along with the GLAMOR_SEPARATE_TEXTURE pixmap type.

Force all pixmaps to have exact allocations

Glamor has a cache of recently used textures that it uses to avoid allocating and de-allocating GL textures rapidly. For pixmaps small enough to fit in a single texture, Glamor would use a cache texture that was larger than the pixmap.

I disabled this when I rewrote the Glamor rendering code for core X; that code used texture repeat modes for tiles and stipples; if the texture wasn't the same size as the pixmap, then texturing would fail.

On the Render side, Glamor would actually reallocate pixmaps used as repeating texture sources. I could have fixed up the core rendering code to use this, but I decided instead to just simplify things and eliminate the ability to use larger textures for pixmaps everywhere.

Remove redundant pixmap and screen private pointers

Every Glamor pixmap private structure had a pointer back to the pixmap it was allocated for, along with a pointer to the the Glamor screen private structure for the related screen. There's no particularly good reason for this, other than making it possible to pass just the Glamor pixmap private around a lot of places. So, I removed those pointers and fixed up the functions to take the necessary extra or replaced parameters.

Similarly, every Glamor fbo had a pointer back to the Glamor screen private too; I removed that and now pass the Glamor screen private parameter as needed.

Reducing pixmap private complexity

Glamor had three separate kinds of pixmap private structures, one for 'normal' pixmaps (those allocated by them selves in a single FBO), one for 'large' pixmaps, where the pixmap was tiled across many FBOs, and a third for 'atlas' pixmaps, which presumably would be a single FBO holding multiple pixmaps.

The 'atlas' form was never actually implemented, so it was pretty easy to get rid of that.

For large vs normal pixmaps, the solution was to move the extra data needed by large pixmaps into the same structure as that used by normal pixmaps and simply initialize those elements correctly in all cases. Now, most code can ignore the difference and simply walk the array of FBOs as necessary.

The other thing I did was to shrink the number of possible pixmap types from 8 down to three. Glamor now exposes just these possible pixmap types:

  • GLAMOR_MEMORY. This is a software-only pixmap, stored in regular memory and only drawn with software. This is used for 1bpp pixmaps, shared memory pixmaps and glyph pixmaps. Most of the time, these pixmaps won't even get a Glamor pixmap private structure allocated, but if you use one of these with the existing Render acceleration code, that will end up wanting a private pointer. I'm hoping to fix the code so we can just use a NULL private to indicate this kind of pixmap.

  • GLAMOR_TEXTURE. This is a full Glamor pixmap, capable of being used via either GL or software fallbacks.

  • GLAMOR_DRM_ONLY. This is a pixmap based on an FBO which was passed from the driver, and for which Glamor couldn't get the underlying DRM object. I think this is an error, but I don't quite understand what's going on here yet...

Future Work

  • Deal with X vs GL color formats
  • Finish my new CompositeGlyphs code
  • Create pure shader-based gradients
  • Rewrite Composite to use the GPU for more computation
  • Take another stab at doing GPU-accelerated trapezoids
October 23, 2014
I've just spent another week hanging out with my Broadcom and Raspberry Pi teammates, and it's unblocked a lot of my work.

Notably, I learned some unstated rules about how loading and storing from the tilebuffer work, which has significantly improved stability on the Pi (as opposed to simulation, which only asserted about following half of these rules).

I got an intro on the debug process for GPU hangs, which ultimately just looks like "run it through simpenrose (the simulator) directly. If that doesn't catch the problem, you capture a .CLIF file of all the buffers involved and feed it into RTL simulation, at which point you can confirm for yourself that yes, it's hanging, and then you hand it to somebody who understands the RTL and they tell you what the deal is." There's also the opportunity to use JTAG to look at the GPU's perspective of memory, which might be useful for some classes of problems. I've started on .CLIF generation (currently simulation-environment-only), but I've got some bugs in my generated files because I'm using packets that the .CLIF generator wasn't prepared for.

I got an overview of the cache hierarchy, which pointed out that I wasn't flushing the ARM dcache to get my writes out into system L2 (more like an L3) so that the GPU could see it. This should also improve stability, since before we were only getting lucky that the GPU would actually see our command stream.

Most importantly, I ended up fixing a mistake in my attempt at reset using the mailbox commands, and now I've got working reset. Testing cycles for GPU hangs have dropped from about 5 minutes to 2-30 seconds. Between working reset and improved stability from loads/stores, we're at the point that X is almost stable. I can now run piglit on actual hardware! (it takes hours, though)

On the X front, the modesetting driver is now merged to the X Server with glamor-based X rendering acceleration. It also happens to support DRI3 buffer passing, but not Present's pageflipping/vblank synchronization. I've submitted a patch series for DRI2 support with vblank synchronization (again, no pageflipping), which will get us more complete GLX extension support, including things like GLX_INTEL_swap_event that gnome-shell really wants.

In other news, I've been talking to a developer at Raspberry Pi who's building the KMS support. Combined with the discussions with keithp and ajax last week about compositing inside the X Server, I think we've got a pretty solid plan for what we want our display stack to look like, so that we can get GL swaps and video presentation into HVS planes, and avoid copies on our very bandwidth-limited hardware. Baby steps first, though -- he's still working on putting giant piles of clock management code into the kernel module so we can even turn on the GPU and displays on our own without using the firmware blob.

Testing status:
- 93.8% passrate on piglit on simulation
- 86.3% passrate on piglit gpu.py on Raspberry Pi

All those opcodes I mentioned in the previous post are now completed -- sadly, I didn't get people up to speed fast enough to contribute before those projects were the biggest things holding back the passrate. I've started a page at http://dri.freedesktop.org/wiki/VC4/ for documenting the setup process and status.

And now, next steps. Now that I've got GPU reset, a high priority is switching to interrupt-based render job tracking and putting an actual command queue in the kernel so we can have multiple GPU jobs queued up by userland at the same time (the VC4 sadly has no ringbuffer like other GPUs have). Then I need to clean up user <-> kernel ABI so that I can start pushing my linux code upstream, and probably work on building userspace BO caching.
October 22, 2014

For those of you who like me missed this years GStreamer Conference the recorded talks are now available online thanks to Ubicast. Ubicats has been a tremendous partner for GStreamer over the years making sure we have high quality talk recordings online shortly after the conference ends. So be sure to check out this years batch of great GStreamer talks.

Btw, I also done a minor release of Transmageddon today, which mostly includes a couple of bugfixes and a few less deprecated widgets :)

For those of you reading this that didn’t know, I’ve had two months of paid vacation – one of the real perks of working for Intel. Today is the last day. It is as hard as I thought it would be.

Most of the vacation was spent vacationing. As I have access to none of the pictures at the moment, and I don’t want to make you jealous, I’ll skip over that. Toward the end though, I ended up at a coffee shop waiting for someone with nothing to do. I spent a little bit of time working on HOBos, and I thought it could be interesting to write about it.

WARNING: There is nothing novel here.

A brief history of HOBos

The HOBby operating system is an operating system project I started a while ago. I am a bit embarrassed that I started an OS. In my opinion, it’s one of lamer tasks to take on because 1. everyone seems to do it; 2. there really isn’t a need, there are many operating systems with permissive licenses already; and 3. sites like OSDev have made much of the work trivial (I like to think that when I started there wasn’t quite as much info readily available, but that’s a lie).

Larrabee Av in Portland (not what the project was named after)

Larrabee Av in Portland
(not what the project was named after)

HOBos began while I was working on the Larrabee project. The team spent a lot of time optimizing the memory management, and the scheduler for the embedded software. I really wanted to work on these things full time. Unfortunately for me, having had a background in device drivers, I was often required to do other things. As a means to scratch the itch, I started HOBos after not finding anything suitable for my needs. The stuff I found were all either too advanced, or too rudimentary. When I was hired to work on i915, I decided that it was better use of my free time. Since then, I’ve been tweaking things here or there, and I do try to make sure things still run with the latest QEMU and compilers at least once a year. The last actual feature I added was more than 1300 days ago:

	commit 1c9b5c78b22b97246989b00e807c9bf1fbc9e517
	Author: Ben Widawsky <ben@bwidawsk.net>
	Date: Sat Mar 19 21:19:57 2011 -0700

	basic backtrace

So back to the coffee shop. I tried to do something or other, got a hang, and didn’t want to fire up GDB.

Backtracing

HOBos had implemented backtraces since the original import from SVN (let’s pretend that means, since always). Obtaining a backtrace is actually pretty straightforward on x86.

The stack frame

The stack frame can be thought of as memory contents that are locally scoped to a function. Declaring a local variable will end up in the stack frame. A global variable will not. As functions call other functions, you end up with multiple frames. A stack is used because the last frames added are the first ones removed (this is not always true for things like exceptions, but nevermind that for now). The fact that a stack decrements is arbitrarily chosen, as far as I can tell. The following shows the stack when the function foo() calls the function bar().

Example Stackframe

Example Stackframe

The memory contents shown above are created as a result of two things. First is what the CPU implicitly does upon execution of the call instruction. The second is what the compiler generates. Since we’re talking about x86 here, the call instruction always pushes at least the return address. The second I’ll detail a bit more in a bit. Correlating this to the picture, the green (foo) and blue (bar) are creations of the compiler. The brown is automatically pushed on to the stack by the hardware, and is automatically popped from the stack on the ret instruction.

In the above there are two registers worth noting, RBP, and RSP. RBP which is the 64b extension to the original 8086 BP, Base Pointer, register is the beginning of the stack frame ie the Frame Pointer. RSP, the extension to the 8086 SP, Stack Pointer, points to the end of the stack frame. By convention the Base Pointer doesn’t change throughout a function being executed and therefore it is often used as the reference to local variables stored on the stack. -100(%rbp) as an example above.

Digging further into that disassembly above, one notices a pattern. Every function begins with:

push   %rbp       // Push the old RBP, RSP now points to this
mov    %rsp,%rbp  // Store RSP in RBP

Assuming this is the convention, it implies that at any given point during the execution of a function we can obtain the previous RBP by reading the current RBP and doing some processing. Specifically, reading RBP gives us the old Stack Pointer, which is pointing to the last RBP. As mentioned above, the x86 CPU pushed the return address immediately before the push %rbp – which means as we work backwards through the Base Pointers, we can also obtain the caller for the current stack frame. People have done really nice pictures on this – use your favorite search engine.

Here is the HOBos code (ignore the part about symbols for now):

void bt_fp(void *fp)
{
	do {
		uint64_t prev_rbp = *((uint64_t *)fp);
		uint64_t prev_ip = *((uint64_t *)(fp + sizeof(prev_rbp)));
		struct sym_offset sym_offset = get_symbol((void *)prev_ip);
		printf("\t%s (+0x%x)\n", sym_offset.name, sym_offset.offset);
		fp = (void *)prev_rbp;
		/* Stop if rbp is not in the kernel
		 * TODO< need an upper bound too*/
		if (fp <= (void *)KVADDR(DMAP_PML,0,0,0))
			break;
	} while(1);
}

As far as I know, all modern CPUs work in a similar fashion with differences sprinkled here and there. ARM for example has an LR register for the return address instead of using the stack.

ABI/Calling Conventions

The fact that we can work backwards this way is a byproduct of the calling convention. One example of an aspect of the calling convention is where to put the arguments to a function. Do they go on the stack, in registers, or somewhere else? In addition to these arguments, the way in which RBP, and RSP are used are strictly software constructs that are part of the convention. As a result, it might not always be possible to get a backtrace if:

  1. This convention is not adhered to (or -fomit-frame-pointer)
  2. The contents of RBP are destroyed
  3. The contents of the stack are corrupted.

How arguments are passed to function are also needed to make sure linkers and loaders (both static and dynamic) can operate to correctly form an executable, or dynamically call a function. Since this isn’t really important to obtaining a backtrace, I will leave it there. Some architectures do provide a way to obtain useful backtrace information without caring about the calling convention: Intel’s Processor Trace for example.

Symbol information

The previous section will get us a reverse list of addresses for all function calls working backward from a given point during execution. But having names makes things much more easier to quickly diagnose what is going wrong. There is a lot of information on the internet about this stuff. I’m simply providing all that’s relevant to my specific problem.

ELF Symbols (linking)

The ELF format provides everything we need (assuming things aren’t stripped). Glossing over the details (see this simple tool if you’re curious), we end up with two “sections” that tell us everything we need. They are conventionally named, “.symtab”, and “.strtab” and are conveniently of type, SHT_SYMTAB, and SHT_STRTAB. The symbol table defines the information about each symbol (functions, variables, whatever). Part of the information is a name, which is an index into the string table. In this simplest case, these are provisions for inter-object linking. If I had defined foo() in foo.c, and bar() in bar.c, the compiled object files can be linked together, but the linker needs the information about the symbol bar (in this case) in order to do its job.

readelf -S a.out
[Nr] Name Type Address Offset
[33] .symtab SYMTAB 0000000000000000 000015b8
[34] .strtab STRTAB 0000000000000000 00001c90
> readelf -S a.out | egrep "\.strtab|\.symtab" | wc -l
2
> strip a.out
> readelf -S a.out | egrep "\.strtab|\.symtab" | wc -l
0

Summing that up, if we have an entire ELF file, and the symbol and string tables haven’t been stripped, we’re good to go. However, ELF sections are not the unit in which an ELF loader decides what to load. The loader loads segments which are of type PT_LOAD. A segment is made up of 0 or more sections, plus padding. Since the Operating System is itself an ELF loaded by an ELF loader (the bootloader) we’re not in a good position. :(

> readelf -l a.out | egrep "\.strtab|\.symtab" | wc -l
0
ELF Loader

ELF Loader

Debug Info

Note that what we want is not the same thing as debug information. If one wants to do source level debug, there needs to be some way of correlating a machine instruction to a line of source code. This is also a software abstraction, and there is usually a spec for it unless you are using some proprietary thing. It would technically be possible to include DWARF capabilities within the kernel, but I do not know of a way to get that info to the OS (see multiboot stuff for details).

From boot to symbols

The HOBos project implements a small Multiboot compliant bootloader called smallboot. When the machine starts up, boot firmware is loaded from a fixed location (this is currently done by SeaBIOS). The boot firmware then loads the smallboot bootloader. The bootloader will load the kernel (smallboot, and most modern bootloaders will do this through a text configuration file on the resident filesystem). In the HOBos case, the kernel is simply an ELF file. smallboot implements a basic ELF loader to load the kernel into memory and give execution over.

The multiboot specification is a standardized communication mechanism (for various things)  from the bootloader to the Operating System (or any file really). One of these things is symbol information. Quoting the multiboot spec

If bit 5 in the ‘flags’ word is set, then the following fields in the Multiboot information structure starting at byte 28 are valid:

             +-------------------+
     28      | num               |
     32      | size              |
     36      | addr              |
     40      | shndx             |
             +-------------------+

These indicate where the section header table from an ELF kernel is, the size of each entry, number of entries, and the string table used as the index of names. They correspond to the ‘shdr_*’ entries (‘shdr_num’, etc.) in the Executable and Linkable Format (elf) specification in the program header. All sections are loaded, and the physical address fields of the elf section header then refer to where the sections are in memory (refer to the i386 elf documentation for details as to how to read the section header(s)). Note that ‘shdr_num’ may be 0, indicating no symbols, even if bit 5 in the ‘flags’ word is set.

Since the beginning I had implemented these fields in the bootloader:

multiboot_info.flags |= MULTIBOOT_INFO_ELF_SHDR;
multiboot_info.u.elf_sec = *table;

Because of the fact that the symbols weren’t in the ELF segments though, I was stumped as to how to get at the data one the OS is loaded. As it turned out, I didn’t actually read all 4 sentences and  I had missed one very important part.

All sections are loaded, and the physical address fields of the elf section header then refer to where the sections are in memory

What the spec is dictating is that even though the sections are not in loadable segments, they shall exist within memory during the handover to the OS, and the section header information will be updated so that the OS knows where to find it. With this, the OS can copy out, or just make sure not to overwrite the info, and then get access to it.

    for (i = 0; i < shnum; i++) {
        __ElfN(Shdr) *sh = &shdr[i];
        if (sh->sh_size == 0)
            continue;

        if (sh->sh_addr) /* Already loaded */
            continue;

        ASSERT(sizeof(void *) == 4);
        *((volatile __ElfN(Addr) *)&sh->sh_addr) = sh->sh_offset + (uint32_t)addr;
    }

et cetera

The code for pulling out the symbols is quite a bit longer, but it can be found in kern/core/syms.c. With the given RBP unwinder near the top, we can easily get the IP for the caller. With that IP, we do a symbol lookup from the symbols we got via the multiboot info.

Screenshot with backtrace

Screenshot with backtrace

Inkscape Links

https://bwidawsk.net/blog/wp-content/uploads/2014/10/stackframe.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/10/loader.svg

Download PDF
October 21, 2014
GNOME has long had relationships with Linux kernel development, in that we would have some developers do our bidding, helping us solve hard problems. Features like inotify, memfd and kdbus were all originally driven by the desktop.

I've posted a wishlist of kernel features we'd like to see implemented on the GNOME Wiki, and referenced it on the kernel mailing-list.

I hope it sparks healthy discussions about alternative (and possibly existing) features, allowing us to make instant progress.
October 13, 2014
In the process of reverse engineering work for freedreno, I've cobbled together some interesting tools.  The earliest and most useful of which is cffdump.  (Named after some command-stream dumping debug code in the old kgsl android kernel driver, upon which it was originally inspired.)  The cffdump tool knows how to parse out the "toplevel" command-stream stored as an .rd (re-dump) file, finding packets that load state memory, write registers, IB (indirect branch), etc.  The .rd file contains snapshots of gpu buffers, in order to chase gpu pointers at decode time.  It links in librnn from the nouveau envytools project for the decoding of individual registers, and a few other things.  It also calls out to the freedreno disassembler code to show inline disassembly of shaders, decodes vertex and constant (uniform) buffers, etc.  And even generates pretty color output (thanks to librnn):


A few months back, I added some basic lua scripting support to cffdump, mostly to assist in r/e work for adreno a4xx.  When invoked with the --script argument, cffdump would load the specific lua script, and call the 'draw' function it defines on each CP_DRAW_INDX opcode.  The choice of lua was mostly because it seemed fairly easy to integrate with .c code.

Since then, I've had the thought in the back of my mind that adding script bindings to integrate rnn register decode to lua would be useful for much more.  Such as writing a command-stream validator to check for inconsistent programming.  There are a number of places where inconsistencies between various register settings and such will result in gpu lockup.  The general adreno design philosophy appears to be to not ever dedicate transistors to making the driver writer's life easier... which for a SoC gpu is certainly the right choice, but it doesn't make things any easier for me.  Over time, I've discovered many of these of these rules, but they are mostly all in my head at the moment.  And from time to time, when adding new features to the gallium driver, I inadvertently break one or more of the rules and end up wasting time studying cmdstream dumps from the freedreno gallium driver to figure out what I did wrong.

So, on the way to XDC2014 I started hacking up support for register decoding from lua scripts.  It turns out that time in airports and airplanes, where I can't exactly break out an ifc6410 and hdmi monitor to do some driver work, is a good time to catch up on these sort of projects.  Now I can do nifty things like:

-- load rnn database file for a320:
r = rnn.init("a320")

function start_cmdstream(name)
  io.write("START: " .. name .. "\n")
end

function draw(primtype, nindx)
  -- simple full register access:
  io.write("GRAS_CL_VPORT_XOFFSET: " .. r.GRAS_CL_VPORT_XOFFSET .. "\n")
  -- access boolean bitfield Z_ENABLE in RB_DEPTH_CONTROL register:
  io.write("RB_DEPTH_CONTROL.Z_ENABLE: " .. tostring(r.RB_DEPTH_CONTROL.Z_ENABLE) .. "\n")
  -- access ROP_CONTROL bitfield inside CONTROL register inside RB_MRT[] array:
  io.write("RB_MRT[0].CONTROL.ROP_CODE: " .. r.RB_MRT[0].CONTROL.ROP_CODE .. "\n")
end

function end_cmdstream()
  io.write("END\n")
end

function finish()
  io.write("FINISH\n")
end


which will generate output like:

[robclark@thunkpad:~/src/freedreno (master)]$ ./cffdump --script test.lua piglit.rd
Reading piglit.rd...
START: piglit.rd

GRAS_CL_VPORT_XOFFSET: 79.5
RB_DEPTH_CONTROL.Z_ENABLE: true
RB_MRT[0].CONTROL.ROP_CODE: 12

Currently it should handle all of the rnndb constructs that are used for adreno.  Ie. simple registers, arrays of simple registers, arrays of groups of registers, etc.  No support for "stripes" yet since those are not used for freedreno.

At the moment, all the script bindings are in freedreno.git/util/script.c but if there is some interest in this from nouveau or anyone else using librnn then it would be a good idea to try to refactor some of this into more generic code in librnn.  It would still need a bit of glue from the tool linking librnn to get at the actual register values.

Still needed are a few more script hooks (such as CP_LOAD_STATE) to do everything I need for a validator script.  Hopefully I find some time to work on that before the next conference ;-)

PS. I hope this post is at least a bit coherent.. I am still a bit jetlagged..
 


October 10, 2014

As you might know, due to invasive changes in PackageKit, I am currently rewriting the 3rd-party application installer Listaller. Since I am not the only one looking at the 3rd-party app-installation issue (there is a larger effort going on at GNOME, based on Lennarts ideas), it makes sense to redesign some concepts of Listaller.

Currently, dependencies and applications are installed into directories in /opt, and Listaller contains some logic to make applications find dependencies, and to talk to the package manager to install missing things. This has some drawbacks, like the need to install an application before using it, the need for applications to be relocatable, and application-installations being non-atomic.

Glick2

There is/was another 3rd-party app installer approach on the GNOME side, by Alexander Larsson, called Glick2. Glick uses application bundles (do you remember Klik from back in the days?) mounted via FUSE. This allows some neat features, like atomic installations and software upgrades, no need for relocatable apps and no need to install the application.

However, it also has disadvantages. Quoting the introduction document for Glick2:

“Bundling isn’t perfect, there are some well known disadvantages. Increased disk footprint is one, although current storage space size makes this not such a big issues. Another problem is with security (or bugfix) updates in bundled libraries. With bundled libraries its much harder to upgrade a single library, as you need to find and upgrade each app that uses it. Better tooling and upgrader support can lessen the impact of this, but not completely eliminate it.”

This is what Listaller does better, since it was designed to do a large effort to avoid duplication of code.

Also, currently Glick doesn’t have support for updates and software-repositories, which Listaller had.

Combining Listaller and Glick ideas

So, why not combine the ideas of Listaller and Glick? In order to have Glick share resources, the system needs to know which shared resources are available. This is not possible if there is one huge Glick bundle containing all of the application’s dependencies. So I modularized Glick bundles to contain just one software component, which is e.g. GTK+ or Qt, GStreamer or could even be a larger framework (e.g. “GNOME 3.14 Platform”). These components are identified using AppStream XML metadata, which allows them to be installed from the distributor’s software repositories as well, if that is wanted.

If you now want to deploy your application, you first create a Glick bundle for it. Then, in a second step, you bundle your application bundle with it’s dependencies in one larger tarball, which can also be GPG signed and can contain additional metadata.

The resulting “metabundle” will look like this:

glick-libundle

 

 

 

 

 

 

 

 

 

This doesn’t look like we share resources yet, right? The dependencies are still bundled with the application requiring them. The trick lies in the “installation” step: While the application above can be executed right away without installing it, there will also be an option to install it. For the user, this will mean that the application shows up in GNOME-Shell’s overview or KDEs Plasma launcher, gets properly registered with mimetypes and is – if installed for all users – available system-wide.

Technically, this will mean that the application’s main bundle is extracted and moved to a special location on the file system, so are the dependency-bundles. If bundles already exist, they will not be installed again, and the new application will simply use the existing software. Since the bundles contain information about their dependencies, the system is able to determine which software is needed and which can simply be deleted from the installation directories.

If the application is started now, the bundles are combined and mounted, so the application can see the libraries it depends on.

Additionally, this concept allows secure updates of applications and shared resources. The bundle metadata contains an URL which points to a bundle repository. If new versions are released, the system’s auto-updater can automatically pick these up and install them – this means e.g. the Qt bundle will receive security updates, even if the developer who shipped it with his/her app didn’t think of updating it.

Conclusion

So far, no productive code exists for this – I just have a proof-of-concept here. But I pretty much like the idea, and I am thinking about going further in that direction, since it allows deploying applications on the Linux desktop as well as deploying software on servers in a way which plays nice with the native package manager, and which does not duplicate much code (less risk of having not-updated libraries with security flaws around).

However, there might be issues I haven’t thought about yet. Also, it makes sense to look at GNOME to see how the whole “3rd-party app deployment” issue develops. In case I go further with Listaller-NEXT, it is highly likely that it will make use of the ideas sketched above (comments and feedback are more than welcome!).

October 09, 2014

I’ve wrote an article about the new graphics platform for Chromium called Ozone-GBM. I particularly think that Ozone-GBM will play an important role next in Chromium and Linux graphics communities in general.  I hope you enjoy the read :) Please share it.

https://01.org/chromium/blogs/tiagovignatti/2014/chromium-ozone-gbm-explained


October 08, 2014

I just finished my talk at XDC 2014. The short version: UBO support in OpenGL drivers is terrible, and I have the test cases to prove it.

There are slides, a white paper, and, eventually, a video.

UPDATE: Fixed a typo in the white paper reported by Jonas Kulla on Twitter.

UPDATE: Direct link to video.

October 04, 2014
A number of people have recently asked what is new with freedreno.  It had been a while since posting an update.. and, well, not everyone watches mesa commit logs for fun, or watches #freedreno on freenode, so it seemed like time for another semi-irregular freedreno blog post.

The tl;dr version: recently it has been a lot of robustness, and bug fixes and smaller feature implementation for piglit, etc.  No one big exciting feature this time.. but lots of little things adding up to make freedreno on a3xx more complete and mature.

And an obligatory screenshot, just because:


(Yeah, webgl should probably be faster in chrome/chromium.. but not packaged for fedora, and chrome build system was invented by someone who wants to make compiling their src as difficult as possible.)

Mesa..

On the mesa/gallium driver front, the big news is that earlier this week we finally achieved a 90% pass ratio for piglit.  (In fact, 90.4%)  To put this in perspective, a little over six months ago freedreno was at just 50% pass.  Since June, we have added around 600 passing tests.  In fact in the last week, an additional ~50 tests are passing, which bumps us up to 91% pass.

For those who are not familiar with it, piglit is an open source OpenGL test suite.  Since the mesa developers are quite good about adding new test cases to piglit whenever adding a new feature/extension to mesa, it is a very comprehensive test suite.  The down side, if you could call it that, is that it has a lot more OpenGL tests compared to OpenGLES (at least for GLES < 3.0).  So getting the pass ratio up involved implementing (and in some cases emulating) a number of features that the blob ES-only driver does not support.  Fortunately enough of the registers and bitfields are known at this point that trial and error with educated guesses (and then see which guesses make piglit tests pass) has worked out reasonably well for some features.  Other features, like GL_CLAMP and two sided color, we need to emulate in the shader, which was implemented as a TGSI to TGSI pass in order to hopefully be useful for other gallium drivers for GLES class hardware.  (And, in fact both of those are things that at least some of the desktop drivers need to emulate as well.)

And big thanks to Ilia Mirkin for a lot of advice and some patches for the failing piglits.  Ilia has also started sending a lot of patches for the compiler to flesh out integer support, add new instructions (in particular texture sample instructions), and other things that will be needed for GL3/GLES3.  In fact as a result of his work, we are already at ~85% pass for GL3 despite missing some bullet-point features!

DDX..

On the xf86-video-freedreno front, over the last few months we have gained server managed fd's and OutputClass support (so that a sufficiently new xserver can auto-pick the correct driver, like we have had for a long time on desktop/pci systems).  And a hot-off-the-presses 1.3.0 release with a handful of robustness fixes.  I strongly recommend to upgrade.

Kernel..

These last few kernel releases have seen a significant improvement in the state of apq8064/ifc6410 support upstream.  As of the 3.17 kernel, the main things missing to work on a pure-upstream[1] kernel are the rpm/rpm-regulators iommu drivers.  The linaro folks have been a big help there.  In particular, their integration branch, which consists of latest upstream plus in-flight patches, is significantly easier than tracking all the relevant kernel mailing lists.

For drm/msm, the last few kernel releases have seen:  some basic gpu perf and logging debugfs features, DT support for mdp4 (display controller version in apq8064), LVDS and multi-monitor support for mdp4, and mdp5 v1.3 support from qcom for upcoming devices.  And of course bug fixes!

[1] Ie. Linus's tree... kernel-msm or AOSP is not upstream, for any android type's who were confused about that.


October 02, 2014

So I am writing this blog entry using the current development snapshot of the Fedora Workstation and using Wayland as my display server. It is an important milestone for the Fedora Workstation, for Wayland and for me personally. There are many things here I am very happy about, first of all this is a major milestone for what in some sense was the first and biggest engineering effort we kicked off under the Fedora Workstation banner, meaning it was an effort we decided to put our weight behind with the vision we have for the Fedora Workstation being the primary motivator for doing it. And it has been a big success in more ways than I expected, I think it is fair to say that the level of engagement and support from the wider community took me by surprise, and I want to state that if it wasn’t for all the incredible effort from the wider community pushing Wayland forward we would not been able to provide something of this quality so soon.
The fact that Wayland now runs and works on non-Intel GPUs for this release, that XWayland is fully functional, that libinput is as far along as it is, are all thanks to the wider community. There are more people who have contributed than I can list, but I want to call out Adel Gadllah and Jonas Ådahl, who have contributed many crucial pieces to GNOME Shell, Wayland or libinput.

I would also like to specially say thank you to Jasper St.Pierre, because we would not be here today without his tireless effort on porting the GNOME Shell to Wayland. I think anyone who knows Jasper appreciates the amount of effort he puts in and the level of enthusiasm he brings to everything he does. So Jasper recently transferred from Red Hat to Endless Mobile and I am very happy that he will continue to contribute to both the GNOME Shell and Wayland as part of his job at Endless too, as he would be sorely missed both as a developer and as an individual otherwise.

Another person I want to call out at this point is of course Kristian Høgsberg, who created Wayland and got it to reach critical mass in terms of mindshare and functionality. Having been around linux for a long time I have seen efforts at replacing the X window system come and go so I know that achieving what Kristian has achieved here is not trivial at all. So a big thank you to Kristian for his incredible work and for his incredible level of persistence allowing Wayland become a reality where so many other projects have failed.

Wayland in Fedora Workstation 21 is also an important milestone as it exemplifies the new development philosophy we are embarking on. Fedora has for a long time been known to be a linux distribution where a lot of new pieces become available first. The problem here is that it has also given Fedora bit of a reputation for being not as dependable as some other distributions or operating systems, which has kept a lot of people away from Fedora that I think would be inclined to use it otherwise.

So we want to keep being a place where you do get access to new and exciting technologies first, but as you see with the Wayland effort we are now going to go the extra mile to make sure we offer this new technologies in a way that allows you to still use Fedora as your day to day working machine without worrying that these new features will hinder your work. So we will keep Wayland available as a separate non-default session until we feel very confident that our users are not going to be negatively impacted by the switch. Which means we want to fix and polish up the last remaining bits and pieces, make sure that performance is top notch, make sure all input hardware works flawlessly, work with NVidia and AMD to help them make their binary drivers available for Wayland before we make this the new default.

An crucial value for us at Red Hat and for the Fedora community is working closely with our upstreams. Which means we always aim at working with our upstream communities to get the features we need or bugfixes we want included in the upstream releases which we then integrate into Fedora (and Red Hat Enterprise Linux). Working closely with the upstream communities enables us to achieve a lot more than we would be able to do on our own. In preparation for Fedora Workstation 21 we have of course done a lot of work on improving the general Fedora desktop experience which has meant a lot of work has gone into GNOME 3.14. And while most of our upstream contributions here has been about code, not all of it is code. A major part of creating a modern and polished desktop experience is making sure that the applications you run conforms to a shared set of interface guidelines, to both bring a unique and polished look to the applications, but also to make using them easier as things like keybindings or work patterns you learn with one application will transfer over to the next. To help accelerate that process for the Fedora Workstation we had Allan Day work with the GNOME community to create am updated set of Human Interface Guidelines for GNOME 3 and thus implicitly for the Fedora Workstation.

Another crucial improvement that you will see in Fedora Workstation 21 is on software installation. There has been a range of things in Fedora in regards to software installation that has been suboptimal. On the command line and library level there has been a piece of Fedora that I know a lot of people have disliked, many to such a strong degree that they have kept away from Fedora, namely Yum. Yum for those who doesn’t know it is the tool you used either directly or indirectly to install new software on a Fedora system. Yum used to be very slow and while it has gotten a lot better over the years it was still considered a bit of an eyesore for many. So Aleš Kozumplík and others have worked writing a new set of tools to do the low level software handling over the last few years and I am happy to say that for Fedora Workstation 21 we will be using those tools to greatly improve the software installation and update experience. There is a new commandline tool called dnf that will work with the same command line parameters you know from yum, but will complete its task much quicker than before.

The desktop Software installer side Richard Hughes has been working on making the installer use the new libraries developed for dnf, called hawkeye and libsolve, to provide you with a much smoother software installation experience in Fedora Workstation 21. So if you tried the preview we offered of the Software tool in Fedora 20, then I think you will find Software to be a lot more responsive in Fedora Workstation 21.
Of course a good software installer is not just about how nice the user interface looks or how quickly it can perform an installation, it is also very much a product of the quality of your installation metadata. Richard Hughes got a blog entry outlining the great progress is being made on providing more and improved metadata, like application descriptions and screenshots, for Fedora Workstation 21. Ryan Lerch has been working with Richard to improve our cover greatly which means the quality of the software listings in Fedora Workstation 21 should be greatly improved over what you saw in Fedora 20. For more details and screenshots Kalev Lember got a great writeup of the state of the Software installer in Fedora Workstation 21.

This also highlights one of the advantages of the new Fedora product model where we have one clear desktop product we are targeting, that we can define operating system standards for things like application metadata and apply them to the system as a whole. So for Fedora 22 we expect to make appdata metadata a mandatory part of the application packaging for Fedora, ensuring that any desktop application packaged for Fedora is easily discover able by our users. In the old ‘bucket of parts’ model these things would in practice not happen as there was no clear target that everyone was expected to aim for.

There has also been a lot of general user interface polish work happening, both on the toolkit level with a lot of work being done by our UI designers to improve the default desktop theme called Adwaita. And since we want people to run all kinds of applications in Fedora Workstation 21 we are not only doing this for GTK+, but we also have Martin Briza working on bringing Adwaita to Qt for Fedora Workstation. We hope to get the Qt theme packaged soon, but for those interested in taking a look the Adwaita Qt code can be found here. In Fedora Workstation 21 we hope to cover Qt4 applications using the standard Adwaita theme, with wider support planed for Fedora Workstation 22, to cover more Qt versions and also make sure we have full coverage for the Adwaita Dark variant and accessibility versions. There is a chance we will miss the Fedora 21 cutoff date with this theme, but hopefully we can then get it included during the Fedora Workstation 21 lifespan.

We also worked on improving the shell animations. Things like animations might seem like their unimportant, but they contribute greatly to the general feeling of polish in the system. The team worked hard on improving these for Fedora Workstation 21, so in GNOME 3.14 you will for instance see that the animations in the shell overview has been greatly improved.

Last but not least I want to say that while I am very excited about what we have put together for Fedora Workstation 21 it is just the beginning. Being the first release under the new 3 product strategy a lot of time and effort has gone into re-jigging the whole Fedora development process to cater for having 3 different products instead of one, changing the way the Fedora community organize itself, get contributors on board and re-aligned with the new products and so on and also refocus our internal development teams at Red Hat to start thinking about their development process and goals with contributing to these 3 new products in mind. So my expectation is that as we go towards Fedora Workstation 22 the pace of innovation and progress will only pick up. So great things are ahead and I hope that once Fedora Workstation 21 is released regardless of if you are a long time Fedora users, a lapsed former Fedora users or someone who has never tried Fedora before you will be willing to give it a try and hopefully become as excited about it as we are.

October 01, 2014

A few people have noticed a trend in the Oracle Solaris 11 update releases of delivering more and more Solaris commands as 64-bit binaries, so I figured it was time to write a detailed explanation to answer some of the questions and help prepare users & developers for further change, as it now becomes more critical to deliver shared objects in both 32-bit (ILP32) and 64-bit (LP64) versions.

I’d like to thank Ali, Gary, Margot, Rod, and Jeff for their feedback on this post, and most especially to Sharon for helping rework it to get to the most important bits first.

For the developers and administrators who are familiar with Solaris history, I’ll start with info about what you should be doing to make sure you’re ready for increased use of 64-bit software in Solaris. You can refer to the section that describes issues that arise when converting binaries — I cover examples from the X Window System packages for Oracle Solaris that I’ve done the conversion work for, and discuss LP64 conversion work by other Solaris engineers.

For those who need more background, see the sections about LP64 history in Solaris and the Application Binary Interface (ABI) differences between 32-bit and 64-bit binaries.

What do you need to do?

You need to do what you have always done — ensure you have both 32-bit and 64-bit versions of any shared objects. This requirement is being highlighted because the consequences of not providing both binary versions is becoming more disruptive in each subsequent Solaris release.

Development requirements

If you develop software for Solaris, the requirement is to provide both 32-bit and 64-bit versions of any shared objects you provide for other software to use, whether as libraries to link their programs against or as loadable objects for frameworks such as localized input methods, PAM service modules, custom crypt(3C) password hashes, or dozens of other shared object uses in Solaris.

Additionally, you should make sure all of your software is 64-bit clean, even if you still support it on 32-bit systems. While Solaris 10 still has more than 6 years left of active support, eventually you’ll move on to only supporting Solaris 11 and later and be able to go 64-bit only as well. The Solaris 64-bit Developer’s Guide and Solaris Studio Guide to Converting Applications for a 64-Bit Environment can help here. Oracle’s ISV support team is also looking to provide more assistance in future versions of the Oracle Solaris Preflight Applications Checker tool to help find possible issues for you.

User requirements

Administrators and users should verify that the software that they install and use provides both 32-bit & 64-bit shared objects. If they are not provided, contact the developers to provide both binary versions.

You should also keep an eye on the End of Features (EOF) Planned for Future Releases of Oracle Solaris page to see if something you may still need is being removed because it was determined not to be useful any more when it was reviewed for LP64 conversion. The page is updated regularly as new items make their way through the internal obsolescence review processes. If something appears there that would cause you major grief, let Solaris development know through your sales or support channels, so we can supply a better transition or replacement plan where possible.

And of course, if you find bugs in the Solaris 64-bit converted software, or find that you need a 64-bit version of a particular Solaris library or shared object that’s not already available, file bugs via Oracle Support or the Oracle Partner Network for Solaris.

Delivery of X commands as LP64 binaries

Because Solaris 11 no longer requires programs run on a 32-bit kernel, and the minimum supported system has 64 times as much RAM as the first Ultra 1, Oracle Solaris can now ship programs directly as 64-bit binaries, which better equips them to run on modern sized data sets, while utilizing the full capabilities of today’s hardware, and have started doing so.

For instance, before Solaris 11 shipped, I switched the default build flags for the X Window System programs in Solaris to 64-bit (with a few exceptions). Solaris has long shipped all the non-obsolete public libraries for X as both 32-bit and 64-bit, and the upstream open source versions of this software were made 64-bit clean long ago, originally by DEC for their Alpha workstations, and maintained since then by the BSD & Linux platforms that delivered 64-bit only distributions instead of multilib models. For the most part, this was just an implementation detail for the X programs - their functionality didn’t change, nor did their interfaces.

One case with a visible impact from becoming LP64-only was the Xorg server, which uses dlopen() to load shared object drivers for the specific hardware in use. Solaris 10 8/07 moved from Xorg 6.9 to 7.2 and started delivering Xorg as LP64 binaries for SPARC & x64, since video cards would soon have more VRAM than a 32-bit X server could access. This was also the first delivery of Xorg for SPARC, and did not need to support 32-bit SPARC platforms, so on SPARC Xorg was LP64 from the beginning. On x86, since Solaris was still supporting 32-bit platforms, the 64-bit Xorg was added alongside the existing 32-bit version. As of the 64-bit only release of Solaris 11, we could drop the 32-bit version of Xorg in Solaris, but because Solaris wasn’t the only source of the loadable driver modules (for instance, nvidia & VirtualBox both provided drivers for their graphics), we ensured that 64-bit versions were available, before announcing the end of support for 32-bit driver modules.

One less visible impact was in xdm, the old style login GUI for X, which Solaris still ships, though most Solaris systems use the more modern GNOME display manager (gdm) instead. In order to authenticate users, xdm uses the PAM framework, which allows administrators to configure a variety of login methods, such as Kerberos or SmartCards. Administrators can also install additional PAM methods to work with authentication systems Solaris doesn’t have built-in support for, or additional pluggable crypt modules to handle other password hashing methods. While PAM has supported 64-bit modules since Solaris 7, and the crypt framework has supported 64-bit modules since it was introduced in Solaris 9, most programs calling PAM and the crypt framework have been 32-bit. Installing only the 32-bit version of a crypt or PAM module thus worked most of the time. However, now that xdm is 64-bit, a 32-bit-only module will generate failure messages for users trying to login via xdm, because the system won’t find a 64-bit module to load. While xdm and PAM show that a non-existent binary can prevent system login, other less prominent 64-bit shared objects are going to be required over time, so providers of shared objects need to ensure they are installing both 32-bit & 64-bit versions of their software going forward.

Some users of custom input methods for different languages may also notice that their input methods are not available in 64-bit programs, since input methods are similarly provided as shared objects that are loaded via dlopen() and thus also have to exist in the same ABI variants as the calling programs.

Conversion of Solaris commands to LP64 binaries

This effort isn’t limited to X11 software — engineers in all areas of Solaris are evaluating the various programs and determining what needs to be done. They have two tasks - to ensure the program itself is 64-bit clean, and to ensure that the surrounding ecosystem (such as the PAM modules example above) is 64-bit ready. In some cases, they’ve found software Solaris doesn’t need to ship any longer, like the gettable tool for maintaining Internet host tables in the days before DNS, and are publishing End of Support notices for them. But for most cases, they’re working to deliver 64-bit versions into Solaris as time allows.

The results of this work are already visible — the number of LP64 programs in /usr/bin and /usr/sbin in the full Oracle Solaris package repositories has climbed each release:

Release32-bit64-bittotal
Solaris 11.01707 (92%)144 (8%)1851
Solaris 11.11723 (92%)150 (8%)1873
Solaris 11.21652 (86%)271 (14%)1923
(These numbers only count ELF binaries, not python programs, shell scripts, etc.)

In Solaris 11.0, X11 programs provided the bulk of the LP64 programs, but there were some from other subsystems, such as gdb, emacs, and the NSS crypto commands. Solaris 11.1 added LP64 crypto commands, including digest, decrypt, elfsign, pktool, and tpmadm; as well as other commands like top & gzip. Solaris 11.2 added a number of LP64 GNU programs, including the GNU coreutils and groff. New tools added in 11.2 were 64-bit in their first delivery, such as mlocate, the Intel GPU tools, and the jsl JavaScript Lint package. The bzip2 compression program was made LP64 in 11.2 at the request of a customer who uses it on large files and wanted the extra performance on x64.

Solaris 11.2 also adds Java 8 packages, alongside Java 7, and the now deprecated Java 6 packages. Java 8 for Solaris is 64-bit-only, dropping the 32-bit binary option found in previous Java releases. As noted under the Removal of 32-bit Solaris item in the Features Removed From JDK8 list, the Java plugin for web browsers was also removed from Java 8 for Solaris, because a 64-bit Java plugin cannot be run in the 32-bit Firefox browser.

Solaris Data Model History

Solaris 11.2 represents the latest step in a 20 year journey.

The original Solaris 2.x ABI used a data model referred to as ILP32, in which the size of the C language types “int”, “long”, and pointers are all 32-bit numbers. This data model matched the SPARC & x86 CPUs available at the time.

In 1995, Sun introduced its first SPARC v9 CPU’s, the UltraSPARC I, which offered 64-bit integers and addresses. Solaris 7 followed, bringing a second ABI to Solaris, using the LP64 model, in which “int” remained 32-bits wide, but “long” and pointer sizes doubled to 64-bits.

This affected both the kernel and user space code, and Solaris 7 delivered both 32-bit (ILP32) and 64-bit (LP64) kernels for UltraSPARC systems. The Solaris kernel implementation only allowed running 64-bit user space processes if a 64-bit kernel was loaded, but 32-bit user space software could be used with either a 32-bit or 64-bit kernel.

For user-space programs, the class (32 or 64-bit) of libraries must match the program that links to them. In order to support both 32 and 64-bit programs, it was necessary to provide both 32 and 64-bit versions of libraries. To preserve binary compatibility with existing 32-bit software, the libraries in directories such as /usr/lib were left as the 32-bit versions, and the 64-bit versions were added in a new sparcv9 subdirectory. On modern Linux systems, this approach is now called “multilib.”

Because the UltraSPARC CPUs did not impose any significant performance penalty when running existing 32-bit code on a 64-bit CPU, Sun continued to ship 32-bit user-space programs in the ILP32 ABI so that the same binary could be used on both 32-bit-only and 64-bit-capable CPUs, reducing development, testing, and support costs. It also reduced by half the memory requirements for pointers and long ints (thus more easily fitting them in the CPU caches) in programs which wouldn’t benefit from the larger sizes, an important consideration in systems with only 32MB of RAM. For the small number of programs that had to run 64-bit to be able to read 64-bit kernel structures or debug other 64-bit binaries, Solaris shipped both 32-bit & 64-bit versions, with isaexec used to execute the version matching the kernel in use.

On the x86 side, nearly a decade later AMD’s first AMD64 CPUs provided similar hardware support, and Solaris 10 introduced a matching LP64 ABI for x86 platforms in 2005, with the 64-bit libraries delivered in amd64 subdirectories. Even though 32-bit binaries did not run as fast as fully 64-bit binaries, Solaris followed the same model of providing mostly ILP32 programs to get the release to market faster; to save on development, test, and support costs; and to be consistent with Solaris on SPARC platforms.

In these transitional phases, 32-bit & 64-bit software and hardware coexisted. For SPARC, support for 32-bit-only CPUs was phased out in Solaris 10, when support for the last pre-sun4u platforms was dropped. Support for the 32-bit kernel was dropped at the same time, because all remaining supported hardware could run the 64-bit kernel. For x86, support for 32-bit-only CPUs was dropped in Solaris 11.

Therefore, as of the shipping of Oracle Solaris 11 in November 2011, the supported set of platforms have 64-bit kernels that can run 32-bit or 64-bit user space binaries.

Other Differences Between 32-bit & 64-bit ABIs

While the size of the types is the defining difference between the two ABI models, the opportunity to introduce a fresh new ABI after learning of the mistakes and limitations of the old ABI was hard to resist, and other changes were made as well. Significant differences include:

Additionally, since not all 64 bits in pointers are needed to address current memory sizes, unused ones can be used for additional tasks, such as some of the new features coming in the SPARC M7 CPU.

Some other platforms followed a different strategy. For example, Linux introduced “x32”, a new 32-bit ABI, and is considering a proposal for year-2038-safe ABIs. Engineers at Sun long ago debated a “large time” extension to the 32-bit ABI like the large file interfaces, but decided to concentrate efforts on LP64 instead. Because Solaris is not trying to maintain 32-bit kernel support for embedded devices, that is not a problem we have to solve as we move forward. The result should be a simpler system, which is always a benefit for developers and ISVs.

We don’t know yet when we’ll finish this journey, but hopefully we’ll get there before the industry starts converting software to run on CPUs with 128-bit addressing.


Disclaimer: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Chromium (the browser) and DRI3

I got a note on IRC a week ago that Chromium was crashing with DRI3.

The Google team working on Chromium eventually sent me a link to the bug report. That's secret Google stuff, so you won't be able to follow the link, even though it's a bug in a free software application when running on free software drivers.

There's a bug report in the freedesktop bugzilla which looks the same to me.

In both cases, the recommended “fix” was to switch from DRI3 back to DRI2. That's not exactly a great plan, given that DRI3 offers better security between GPU-using applications, which seems like a pretty nice thing to have when you're running random GL applications from the web.

Chromium Sandboxing

I'm not entirely sure how it works, but Chromium creates a process separate from the main browser engine to talk to the GPU. That process has very limited access to the operating system via some fancy library adventures. Presumably, the hope is that security bugs in the GL driver would be harder to leverage into a remote system exploit.

Debugging in this environment is a bit tricky as you can't simply run chromium under gdb and expect to be able to set breakpoints in the GL driver. Instead, you have to run chromium with a magic flag which causes the GPU process to pause before loading the driver so you can connect to it with gdb and debug from there, along with a flag that lets you see crashes within the gpu process and the usual flag that causes chromium to ignore the GPU black list which seems to always include the Intel driver for one reason or another:

$ chromium --gpu-startup-dialog --disable-gpu-watchdog --ignore-gpu-blacklist

Once Chromium starts up, it will print out a message telling you to attach gdb to the GPU process and send that process a SIGUSR1 to continue it. Now you can happily debug and get a stack trace when the crash occurs.

Locating the Bug

The bug manifested with a segfault at the first access to a DRI3-allocated buffer within the application. We've seen this problem in the past; whenever buffer allocation fails for some reason, the driver ignores the problem and attempts to de-reference through the (NULL) buffer pointer, causing a segfault. In this case, Chromium called glClear, which tried (and failed) to allocate a back buffer causing the i965 driver to subsequently segfault.

We should probably go fix the i965 driver to not segfault when buffer allocation fails, but that wouldn't provide a lot of additional information. What I have done is add some error messages in the DRI3 buffer allocation path which at least tell you why the buffer allocation failed. That patch has been merged to Mesa master, and should also get merged to the Mesa stable branch for the next stable release.

Once I had added the error messages, it was pretty easy to see what happened:

$ chromium --ignore-gpu-blacklist
[10618:10643:0930/200525:ERROR:nss_util.cc(856)] After loading Root Certs, loaded==false: NSS error code: -8018
libGL: pci id for fd 12: 8086:0a16, driver i965
libGL: OpenDriver: trying /local-miki/src/mesa/mesa/lib/i965_dri.so
libGL: Can't open configuration file /home/keithp/.drirc: Operation not permitted.
libGL: Can't open configuration file /home/keithp/.drirc: Operation not permitted.
libGL error: DRI3 Fence object allocation failure Operation not permitted

The first two errors were just the sandbox preventing Mesa from using my GL configuration file. I'm not sure how that's a security problem, but it shouldn't harm the driver much.

The last error is where the problem lies. In Mesa, the DRI3 implementation uses a chunk of shared memory to hold a fence object that lets Mesa know when buffers are idle without using the X connection. That shared memory segment is allocated by creating a temporary file using the O_TMPFILE flag:

fd = open("/dev/shm", O_TMPFILE|O_RDWR|O_CLOEXEC|O_EXCL, 0666);

This call “cannot fail” as /dev/shm is used by glibc for shared memory objects, and must therefore be world writable on any glibc system. However, with the Chromium sandbox enabled, it returns EPERM.

Running Without a Sandbox

Now that the bug appears to be in the sandboxing code, we can re-test with the GPU sandbox disabled:

$ chromium --ignore-gpu-blacklist --disable-gpu-sandbox

And, indeed, without the sandbox getting in the way of allocating a shared memory segment, Chromium appears happy to use the Intel driver with DRI3.

Final Thoughts

I looked briefly at the Chromium sandbox code. It looks like it needs to know intimate details of the OpenGL implementation for every possible driver it runs on; it seems to contain a fixed list of all possible files and modes that the driver will pass to open(2). That seems incredibly fragile to me, especially when used in a general Linux desktop environment. Minor changes in how the GL driver operates can easily cause the browser to stop working.

September 30, 2014
Let's get the features in early!

If you're working on a Javascript application for GNOME, you'll be interested to know that you can now write GTK+ widget templates in gjs.

Many thanks to Giovanni for writing the original patches. And now to a small example:

const MyComplexGtkSubclass = new Lang.Class({
Name: 'MyComplexGtkSubclass',
Extends: Gtk.Grid,
Template: 'resource:///org/gnome/myapp/widget.xml',
Children: ['label-child'],

_init: function(params) {
this.parent(params);

this._internalLabel = this.get_template_child(MyComplexGtkSubclass,
'label-child');
}
});

And now you just need to create your widget:

let content = new MyComplexGtkSubclass();
content._internalLabel.set_label("My updated label");

You'll need gjs from git master to use this feature. And if you see anything that breaks, don't hesitate to file a bug against gjs in the GNOME Bugzilla.
September 24, 2014

Last November, Jonas Ådahl sent an RFC to the wayland-devel list about a common library to handle input devices in Wayland compositors called libinput. Fast-forward and we are now at libinput 0.6, with a broad support of devices and features. In this post I'll give an overview on libinput and why it is necessary in the first place. Unsuprisingly I'll be focusing on Linux, for other systems please mentally add the required asterisks, footnotes and handwaving.

The input stack in X.org

The input stack as it currently works in X.org is a bit of a mess. I'm not even talking about the different protocol versions that we need to support and that are partially incompatible with each other (core, XI 1.x, XI2, XKB, etc.), I'm talking about the backend infrastructure. Let's have a look:

The graph above is a simplification of the input stack, focusing on the various high-level functionalities. The X server uses some device discovery mechanism (udev now, previously hal) and matches each device with an input driver (evdev, synaptics, wacom, ...) based on the configuration snippets (see your local /usr/share/X11/xorg.conf.d/ directory).

The idea of having multiple drivers for different hardware was great when hardware still mattered, but these days we only speak evdev and the drivers that are hardware-specific are racing the dodos to the finishing line.

The drivers can communicate with the server through the very limited xf86 DDX API, but there is no good mechanism to communicate between the drivers. It's possible, just not doable in a sane manner. Most drivers support multiple X server releases so any communication between drivers would need to take version number mixes into account. Only the server knows the drivers that are loaded, through a couple of structs. Knowledge of things like "which device is running synaptics" is obtainable, but not sensibly. Likewise, drivers can get to the drivers of other devices, but not reasonably so (the API is non-opaque, so you can get to anything if you find the matching header).

Some features are provided by the X server: pointer acceleration, disabling a device, mapping a device to monitor, etc. Most other features such as tapping, two-finger scrolling, etc. are provided by the driver. That leads to an interesting feature support matrix: synaptics and wacom both provide two-finger scrolling, but only synaptics has edge-scrolling. evdev and wacom have calibration support but they're incompatible configuration options, etc. The server in general has no idea what feature is being used, all it sees is button, motion and key events.

The general result of this separation is that of a big family gathering. It looks like a big happy family at first, but then you see that synaptics won't talk to evdev because of the tapping incident a couple of years back, mouse and keyboard are have no idea what forks and knives are for, wacom is the hippy GPL cousin that doesn't even live in the same state and no-one quite knows why elographics keeps getting invited. The X server tries to keep the peace by just generally getting in the way of everyone so no-one can argue for too long. You step back, shrug apologetically and say "well, that's just how these things are, right?"

To give you one example, and I really wish this was a joke: The server is responsible for button mappings, tapping is implemented in synaptics. In order to support left-handed touchpads, gnome-settings-daemon sets the button mappings on the device in the server. Then it has to swap the tapping actions from left/right/middle for 1/2/3-finger tap to right/left/middle. That way synaptics submits right button clicks for a one-finger tap, which is then swapped back by the server to a left click.

The X.org input drivers are almost impossible to test. synaptics has (quick guesstimate done with grep and wc) around 70 user-configurable options. Testing all combinations would be something around the order of 10101 combinations, not accounting for HW differences. Testing the driver on it's own is not doable, you need to fire up an X server and then run various tests against that (see XIT). But now you're not testing the driver, you're testing the whole stack. And you can only get to the driver through the X protocol, and that is 2 APIs away from the driver core. Plus, test results get hard to evaluate as different modules update separately.

So in summary, in the current stack features are distributed across modules that don't communicate with each other. The stack is impossible to test, partially thanks to the vast array of user-exposed options. These are largely technical issues, we control the xf86 DDX API and can break it when needed to, but at this point you're looking at something that resembles a rewrite anyway. And of course, don't you dare change my workflow!

The input stack in Wayland

From the input stack's POV, Wayland simply merges the X server and the input modules into one item. See the architecture diagram from the wayland docs:

evdev gets fed into the compositor and wayland comes out the other end. If life were so simple... Due to the X.org input modules being inseparable from the X server, Weston and other compositors started implementing their own input stack, separately. Let me introduce a game called feature bingo: guess which feature is currently not working in $COMPOSITOR. If you collect all five in a row, you get to shout "FFS!" and win the price of staying up all night fixing your touchpad. As much fun as that can be, maybe let's not do that.

libinput

libinput provides a full input stack to compositors. It does device discovery over udev and event processing and simply provides the compositor with the pre-processed events. If one of the devices is a touchpad, libinput will handle tapping, two-finger scrolling, etc. All the compositor needs to worry about is moving the visible cursor, selecting the client and converting the events into wayland protocol. The stack thus looks something like this:

Almost everything has moved into libinput, including device discovery and pointer acceleration. libinput has internal backends for pointers, touchpads, tablets, etc. but they are not exposed to the compositor. More importantly, libinput knows about all devices (within a seat), so cross-device communication is possible but invisible to the compositor. The compositor still does configuration parsing, but only for user-specific options such as whether to enable tapping or not. And it doesn't handle the actual feature, it simply tells libinput to enable or disable it.

The graph above also shows another important thing: libinput provides an API to the compositor. libinput is not "wayland-y", it doesn't care about the Wayland protocol, it's simply an input stack. Which means it can be used as base for an X.org input driver or even Canonical's MIR.

libinput is very much a black box, at least compared to X input drivers (remember those 70 options in synaptics?). The X mantra of "mechanism, not policy" allows for interesting use-cases, but it makes the default 90% use-case extremely painful from a maintainer's and integrator's point of view. libinput, much like wayland itself, is a lot more restrictive in what it allows, specifically in the requirement it places on the compositor. At the same time aims for a better out-of-the-box experience.

To give you an example, the X.org synaptics driver lets you arrange the software buttons more-or-less freely on the touchpad. The default placement is simply a config snippet. In libinput, the software buttons are always at the bottom of the touchpad and also at the top of the touchpad on some models (Lenovo *40 series, mainly). The buttons are of a fixed size (which we decided on after analysing usage data), and you only get a left and right button. The top software buttons have left/middle/right matching the markings on the touchpad. The whole configuration is decided based on the hardware. The compositor/user don't have to enable them, they are simply there when needed.

That may sound restrictive, but we have a number of features on top that we can enable where required. Pressing both left and right software buttons simultaneously gives you a middle button click; the middle button in the top software button row provides wheel emulation for the trackstick. When the touchpad is disabled, the top buttons continue to work and they even grow larger to make them easier to hit. The touchpad can be set to auto-disable whenever an external mouse is plugged in.

And the advantage of having libinput as a generic stack also results in us having tests. So we know what the interactions are between software buttons and tapping, we don't have to wait for a user to trip over a bug to tell us what's broken.

Summary

We need a new input stack for Wayland, simply because the design of compositors in a Wayland world is different. We can't use the current modules of X.org for a number of technical reasons, and the amount of work it would require to get those up to scratch and usable is equivalent to a rewrite.

libinput now provides that input stack. It's the default in Weston at the time of this writing and used in other compositors or in the process of being used. It abstracts most of input away and more importantly makes input consistent across all compositors.

September 22, 2014
In French, for a change :)

Mardi soir, le 23 septembre, quelques-uns d'entre nous se retrouveront vers 18h30 au Smoking Dog pour quelques boissons, et poursuivront avec un dîner indien prés du métro St-Jean.

N'hésitez pas à vous inscrire sur le Wiki, que vous soyez utilisateurs de GNOME, développeurs ou simplement des amis du logiciel libre.

À mardi!
September 21, 2014

Note: The purpose of this post is basically just so we have a link when this comes up in future bugreports.

Some stylus devices have two buttons on the stylus, plus the tip itself which acts as a button. In the kernel, these two are forwarded to userspace as BTN_STYLUS and BTN_STYLUS2. Userspace then usually maps those two into right and middle click, depending on your configuration. The pen itself used BTN_TOOL_PEN when it goes into proximity.

The default stylus that comes with the Wacom Intuos Pro [1] has an eraser on the other side of the pen. If you turn the pen around it goes out of proximity and comes back in as BTN_TOOL_RUBBER. [2] In the wacom X driver we handle this accordingly, through two different devices available via the X Input Extension. For example the GIMP assigns different tools assigned to each device. [3]

In the HID spec there are a couple of different fields (In Range, Tip Switch, Barrel Switch, Eraser and Invert) that matter here. Barrel Switch is the stylus button, In Range and Tip Switch are proximity and "touching the surface". Invert signals which side of the pen points down, and Eraser is triggered when the eraser touches the surface. In Wacom tablets, Invert is always on when the eraser touches because that's how the pens designed.

Microsoft, in its {in}finite wisdom has decided to make the lower button an "eraser" button on its Surface 3 Pen. So what happens now is that once you press the button, In Range goes to zero, then to one in the next event, together with Invert. Eraser comes on once you touch the surface but curiously that also causes Invert to go off. Anyway, that's a low-level detail that will get handled. What matters to users is that on the press of that button, the pen goes virtually out of proximity, comes back in as eraser and then hooray, you can now use it as an eraser tool without having actually moved it. Of course, since the button controls the mode it doesn't actually work as button, you're left with the second button on the stylus only.

Now, the important thing here is: that's the behaviour you get if you have one of these devices. We could work around this in software by detecting the mode button, flipping bits here and there and trying to emulate a stylus button based on the mode switches. But we won't. The overlords have decreed and it's too much effort to hack around the intended behaviour for little gain.

[1] If marketing decides to rename products so that need a statement "Bamboo pen tablets are now Intuos. Intuos5 is now Intuos Pro." then you've probably screwed up.
[2] Isn't it nice to see some proper queen's English for a change? For those of you on the other side of some ocean: eraser.[3] GIMP, rubber tool, I'm not making this up, seriously
Here is a small recap of the GNOME 3.14 features I worked on. Some are already well publicised, through blogs:
And obviously loads of bug fixes, and patch reviews. And I do mean loads :)

To look forward to

If all goes according to plan, I'll be able to merge the aforementioned automatic rotation support into systemd/udev. The kernel API is pretty bad, which makes the user-space code look bad...

The first parts of ebooks support in gnome-documents have already been written, scheduled for 3.16.

And my favourites

Note: With links that will open up like a Christmas present when GNOME 3.14 is released.

There are a lot of big, new features in GNOME 3.14. The Adwaita rewrite made it possible to polish the theme greatly. The captive portals support is very useful, the travelling you will enjoy this (I certainly have!).

But my favourite new feature has to be the gestures support in gnome-shell. I'll make good use of that :)
September 19, 2014

This post is a summary, the full writeup with data is here (PDF). The texts are quite similar, so if you plan to read the paper you can skip this post.

Following pointer acceleration in libinput - an analysis, we decided to run an actual userstudy to gather some data on how our acceleration behaves, and - more importantly - test if a modified acceleration method is better.

We developed two new pointer acceleration methods (plus the one already in libinput). As explained previously, the pointer acceleration method is a function mapping input speed of a device into cursor speed in pixels. The faster one moves the mouse, the further the cursor moves per "mickey" (a 1-device unit movement). In a simplest example, input deltas of 1 may result in a 1 pixel movement, input deltas of 10 may result in a 30 pixel movement.

The three pointer acceleration methods used in this study were nicknamed:

smooth:
A shortening of "smooth and simple", this method is used in libinput 0.5 as well as in the X.Org stack since ~2008.
stretched:
a modification of 'smooth' with roughly the same profile, but the maximum acceleration is applied at a higher speed. This method was developed by Hans de Goede and very promising in personal testing.
linear:
a linear acceleration method with a roughly similar speed-to-acceleration profile as the first two. This method was developed to test if a simple function could achieve similar results, as the more complex "smooth" and "stretched" methods.
The input data expected by all three methods is in units/ms. Touchpad devices are normalised to 400 dpi, other devices are left as-is. It is impossible to detect in software what resolution a generic mouse supports, so any acceleration method differs between devices. This is intended by the manufacturer, high-resolution devices are sold as "faster" for this reason.

The three pointer acceleration methods
As the graph shows, the base profile is roughly identical and the main difference is how quickly the maximum acceleration factor is reached.

Study description

Central component was a tool built on libinput that displays a full-screen white window, with a round green target. Participants were prompted by GTK dialog boxes on the steps to take next. Otherwise the study was unsupervised and self-guided.

The task required participants to click on a round target with a radius of 15, 30 and 45 pixels. Targets were grouped, each "set" consisted of 15 targets of the same size. On a successful click within the target, a new target appeared on one out of 12 possible locations, arranged in a grid of 4x3 with grid points 300 pixels apart. The location of the target was randomly selected but was never on the same location twice in a row.


Screenshot of the study tool with the first target (size 45) visible.

Each participant was tested for two acceleration methods, each acceleration method had 6 sets of 15 targets (2 sets per target size, order randomised). The two acceleration methods were randomly selected on startup, throughout the study they were simply referred to as "first" and "second" acceleration method with no further detail provided. Acceleration changed after 6 sets (participants were informed about it), and on completion of all 12 sets participants had to fill out a questionnaire and upload the data.

Statistical concepts

A short foray into statistics to help explain the numbers below. This isn't a full statistics course, I'm just aiming to explain the various definitions used below.

The mean of a dataset is what many people call the average: all values added up divided by the number of values. As a statistical tool, the mean is easy to calculate but is greatly affected by outliers. For skewed datasets the median is be more helpful: the middle value of the data array (array[len/2]). The closer the mean and the median are together, the more symmetrical the distribution is.

The standard deviation (SD) describes how far the data points spread from the median. The smaller the SD, the closer together are the data points. The SD is also used to estimate causality vs randomly induced sampling errors. Generally, if the difference between two items is more than 2 standard deviations, there's a 95% confidence that this is a true effect, not just randomness (95% certainty is a widely accepted standard in this domain). That 95% directly maps to the p-value you may have seen in other studies. A p-value of less than 0.05 equals a less than 5% chance of random factors causing the data differences. That translates into "statistically significant".

The ANOVA method is a standard statistical tool for studies like ours. (we're using one-way ANOVA only here, Wikipedia has an example here). If multiple sets of samples differ in only a single factor (e.g. pointer acceleration method), we start with the so-called Null-Hypothesis of "the factor has no influence, all results are the same on average". Our goal is to reject that hypothesis so we can say that the factor did actually change things. If we cannot reject the Null-Hypothesis, either our factor didn't change anything or the results are caused by random influences. The tools for ANOVA compare the mean value within each sets to the mean value differences across the sets and spit out a p-value. As above, a p-value less than 0.05 means greater than 95% confidence that the Null-Hypothesis can be rejected, i.e. we can say our factor did cause those differences.

One peculiarity of ANOVA is that the sample sets have to be the same size. This affects our samples, more see below.

Study participants

An email was sent to three Red Hat-internal lists with a link to the study description. One list was a specific developer list, the other two list were generic lists. As Red Hat employees, participants are expected to be familiar with Linux-based operating systems and the majority is more technical than the average user. The data collected does not make it possible to identify who took part in the study beyond the information provided in the questionnaire.

44 participants submitted results, 7 left-handed, 37 right-handed (no ambidextrous option was provded in the questionnaire). Gender distribution was 38 male, 6 female. Mean age was 33.3 years (SD 6.7) and participants had an mean 21.2 years of experience with mouse-like input devices (SD 4.9) and used those devices an average 58.1 hours per week (SD 20.0).

As all participants are familiar with Linux systems and thus exposed to the smooth acceleration method on their workstations, we expect a bias towards the smooth acceleration method.

Study data

Data was manually checked and verified, three result files were discarded for bugs or as extreme outliers, leaving us with 41 data files. The distribution of methods in these sets was: 27 for smooth, 25 for stretched and 30 for linear.

The base measurement was the so-called "Index of Difficulty" (ID), the number obtained by distance-to-target/width-of-target. This index gives an indication on how difficult it is to hit the target; a large target very close is easier to hit than a small target that is some distance away.


Illustration of the Index of Difficulty for a target.

In hindsight, the study was not ideally suited for evaluation based on ID. The targets were aligned on a grid and the ID based on the pointer position was very variable. As is visible in the graph below, there are few clear dividing lines to categorise the targets based on their ID. For the evaluation the targets were grouped into specific ID groups: ID < 4.2, ID < 8.4, ID < 12.9, ID < 16.9 < ID < 25 and ID > 25. The numbers were selected simply because there are clear gaps between the ID clusters. This division results in uneven group sizes, (I ran the same calculations with different group numbers, it does not have any real impact on the results.)


ID for each target with the divider lines shown
The top ID was 36.44, corresponding to a 15px radius target 1093 pixels away, the lowest ID was 1.45, corresponding to a 45px radius target 130 pixels away.

Number of targets per ID group
As said above, ANOVA requires equal-sized sample sets. ANOVA was performed separately between the methods (i.e. smooth vs stretched, then smooth vs linear, then stretched vs linear). Before each analysis, the two data arrays were cut to be of equal length. For example, comparing smooth and stretched in the ID max group shortened the smooth dataset to 150 elements. The order of targets was randomised.

Study Results

The following factors were analysed:

  • Time to click on target
  • Movement efficiency
  • Overshoot

Time to click on target

Time to click on a target was measured as the time between displaying the target and clicking on it. This does not take reaction time into account, but there is no reliable way of measuring reaction time in this setup.


Mean time to click on target

As is visible, an increasing ID increases the time-to-click. On a quick glance, we can see that the smooth method is slower than the other two in most ID groups, with linear and stretched being fairly close together. However, the differences are only statistically significant in the following cases:

  • ID 8.4: linear is faster than smooth and stretched
  • ID 12.9: linear and stretched are faster than smooth
  • ID 25: linear and stretched are faster than smooth
In all other combinations, there is no statistically significant difference between the three methods, but overall a slight advantage for the two methods stretched and linear.

Efficiency of movement

The most efficient path from the cursor position to the target is a straight line. However, most movements do not follow that straight line for a number of reasons. One of these reasons is basic anatomy - it is really hard to move a mouse in a straight line due to the rotary action of our wrists. Other reasons may be deficiencies in the pointer acceleration method. To measure the efficiency, we calculated the distance to the target (i.e. the straight line) and compared that to all the deltas added up to the total movement. Note that the distance is to the center of the target, whereas the actual movement may be to any point in the target. So for short distances and large targets, there is a chance that a movement may be less than the distance to the target.


Straight distance to target vs. movement path shows the efficiency of movement.

The efficiency was calculated as movement-path/distance, then normalised to a percent value. A value of 10 thus means the movement path was 10% longer than the straight line to the target centre).


Extra distance covered
Stretched seems to perform better than smooth and linear in all but one ID group and smooth performing worse than linear in all but ID group 4.2. Looking at the actual values however shows that the large standard deviation prevents statistical significance. The differences are only statistically significant in the following cases:
  • ID 4.2: stretched is more efficient than smooth and linear
In all other combinations, there is no statistically significant difference between the three methods.

Overshoot

Somewhat similar to the efficiency of movement, the overshoot is the distance the pointer has moved past the target. It was calculated by drawing a line perpendicular to the direct path from the pointer position to the target's far side. If the pointer moves past this line, the user has overshot the target. The maximum distance between the line and the pointer shows how much the user has overshot the target.


Illustration of pointer overshooting the target.
The red line shows the amount the pointer has overshot the target.
Overshoot was calculated in pixels, as % of the distance and as % of the actual path taken. Unsurprisingly, the graphs look rather the same so I'll only put one up here.

Overshoot in pixels by ID group
As the ID increases, the amount of overshooting increases too. Again the three pointer acceleration methods are largely the same, though linear seems to be slightly less affected by overshoot than smooth and stretched. The differences are only statistically significant in the following cases:
  • ID 4.2: if measured as percentage of distance, stretched has less overshoot than linear.
  • ID 8.4: if measured as percentage of movement path, linear has less overshoot than smooth.
  • ID 16.8: if measured as percentage of distance, stretched and linear have less overshoot than smooth.
  • ID 16.8: if measured as percentage of distance, linear has less overshoot than smooth.
  • ID 16.8: if measured in pixels, linear has less overshoot than smooth.
In all other combinations, there is no statistically significant difference between the three methods.

Summary

In summary, there is not a lot of difference between the three methods, though smooth has no significant advantage in any of the measurements. The race between stretched and linear is mostly undecided.

Questionnaire results

The above data was objectively measured. Equally important is the subjective feel of each acceleration method. At the end of the study, the following 14 questions were asked of each participant, with answer ranges in a 5-point Likert scale, ranging from "Strongly Disagree" to "Strongly Agree".

  1. The first acceleration method felt natural
  2. The first acceleration method allowed for precise pointer control
  3. The first acceleration method allowed for fast pointer movement
  4. The first acceleration method made it easy to hit the targets
  5. I would prefer the first acceleration method to be faster
  6. I would prefer the first acceleration method to be slower
  7. The second acceleration method felt natural
  8. The second acceleration method allowed for precise pointer control
  9. The second acceleration method allowed for fast pointer movement
  10. The second acceleration method made it easy to hit the targets
  11. I would prefer the second acceleration method to be faster
  12. I would prefer the second acceleration method to be slower
  13. The two acceleration methods felt different
  14. The first acceleration method was preferable over the second
The figure below shows that comparatively few "strongly agree" and "strongly disagree" answers were given, hinting that the differences between the methods were small.

Distribution of answers in the questionnaire
Looking at statistical significance, the questionnaire didn't really provide anything of value. Not even the question "The two acceleration methods felt different" provided any answers, and the question "The first acceleration method was preferable over the second" was likewise inconclusive. So the summary of the questionnaire is pretty much: on the whole none of the methods stood out as better or worse.

Likert frequencies for the question of which method is preferable

Summary

Subjective data was inconclusive, but the objective data goes slightly in favour of linear and stretched over the current smooth method. We didn't have enough sample sets to analyse separately for each device type, so from a maintainer's point of view the vote goes to linear. It allows replacing a rather complicated pointer acceleration method with 3 lines of code.

September 18, 2014
Prodded by Adam Williamson's fedlet work, and by my inability to getting an Android phone to display anything, I bought an x86 tablet.

At first, I was more interested in buying a brand-name one, such as the Dell Venue 8 Pro Adam has, or the Lenovo Miix 2 that Benjamin Tissoires doesn't seem to get enough time to hack on. But all those tablets are around 300€ at most retailers around, and have a smaller 7 or 8-inch screen.

So I bought a "not exported out of China" tablet, the 10" Onda v975w. The prospect of getting a no-name tablet scared me a little. Would it be as "good" (read bad) as a PadMini or an Action Pad?


Vrrrroooom.


Well, the hardware's pretty decent, and feels rather solid. There's a small amount of light leakage on the side of the touchscreen, but not something too noticeable. I wish it had a button on the bezel to mimick the Windows button on some other tablets, but the edge gestures should replace it nicely.

The screen is pretty gorgeous and its high DPI triggers the eponymous mode in GNOME.

With help of various folks (Larry Finger, and the aforementioned Benjamin and Adam), I got the tablet to a state where I could use it to replace my force-obsoleted iPad 1 to read comic books.

I've put up a wiki page with the status of hardware/kernel support. It's doesn't contain all my notes just yet (sound is working, touchscreen will work very very soon, and various "basic" features are being worked on).

I'll be putting up the fixed-up Wi-Fi driver and more instructions about installation on the Wiki page.

And if you want to make the jump, the tablets are available at $150 plus postage from Aliexpress.

Update: On Google+ and in comments of this blog, it was pointed out that the seller on Aliexpress was trying to scam people. All my apologies, I just selected the cheapest from this website. I personally bought it on Amazon.fr using NewTec24 FR as the vendor.
September 17, 2014
The more astute (or Wayland testing) amongst you will recognise mutter running a nested Wayland compositor. Yes, it means that Videos will work natively under Wayland.

Got to love indie films

It's not perfect, as I'm still seeing hangs within the Intel driver for a number of operations, but basic playback works, and the playback is actually within the same window and correctly hidden when in the overview ;)
September 16, 2014
We've added a few, but nonetheless interesting features to Videos in GNOME 3.14.

Auto-rotation of videos

If you capture videos in portrait orientation on your phone, we are now able to rotate them automatically in the movie player, as well as in the thumbnails.

Better streaming

You can now seek anywhere inside streamed videos, even if we didn't download all the way to that point. That's particularly useful for long videos, or slow servers (or a combination of both).

Thumbnails generation

Finally, videos without thumbnails in your videos directory will have thumbnails automatically generated, without having to browse them in Files. This makes the first experience of videos more pleasing to the eye.

What's next?

We'll work on integrating Victor Toso's work on grilo plugins, to show information about the film or TV series on your computer, such as grouping episodes of a series together, showing genres, covers and synopsis for films.

With a bit of luck, we should also be able to provide you with more video content as well, through partners.
September 15, 2014

A Forest of X Server Changes

We've got about another month left in the X server merge window for 1.17 and I've written a small set of fixes which haven't been reviewed yet for merging. I thought I'd advertise them a bit and see if I couldn't encourage a few of you to take a look and see if they're useful, correct and complete.

All of these are in my personal X server repository:

git://people.freedesktop.org/~keithp/xserver.git

Cleaning up the X Registry

Branch: registry-fixes

I'll bet most of you don't even know about this code. It serves as a database mapping various X enumerations to strings to aid in diagnostics. For the security extensions, SECURITY and XSELinux, it holds names for all of the request, event and errors in the core protocol and all registered extensions. For X-Resource, it has the names of the registered resource types.

The X registry gets the request, event and error data from a file, "protocol.txt", which is installed in /usr/lib/xorg/protocol.txt on my machine. It gets the resource names as a part of resource type allocation.

So, what's wrong with this? Three basic things:

  1. A simple bug -- protocol.txt is left open while the server runs. This consumes a file descriptor for no good reason.

  2. protocol.txt is read and parsed even if the security extensions aren't available. This wastes time and memory.

  3. The resource names are kept even if X-Resource isn't in use.

The fixes remove the configure options for including the registry code; these functions are only used by the above extensions, so we can tell whether to include the code based solely on whether the extensions are being built.

Getting rid of the TCP listener by default

Branch: listen-fixes

We've had the '-nolisten' option for a while now to disable inbound TCP connections. It's useful for security reasons, but we've never enabled this by default. This patch sequence provides configure options for each of the listen sockets (tcp, unix and local), leaves unix and local enabled by default and disables tcp by default.

A new option, '-listen', is added which allows the user to override the -nolisten defaults in case they actually want to use TCP connections to X.

Glamor bug fixes

branch: glamor-fixes

This branch fixes two bugs:

  1. Scale a large pixmap down to a small pixmap. This happens when you display enormous images in a web page. Iceweasel sends the whole huge image to X and uses Render to scale it to the screen. If the image is larger than a single texture, the X server splits it up into tiles, but the code which tries to perform the merged scale is just broken. Five patches fix this.

  2. Shader-based trapezoids. This code uses area coverage to compute trapezoids. That violates the Render spec, which requires point sampling. Further, the performance of these trapezoids is lower than software (by a lot). This one patch removes the code.

Present bug fixes

branch: present-fixes

A selection of small bug fixes:

  1. Clear pending flips at CloseScreen. This removes a reference to any pending flip pixmap, allowing it to be freed. Otherwise, we'll leak memory across server reset.

  2. Add support for PresentOptionCopy. This has been in the protocol spec for a while, and was completely trivial to implement. However, it never got done. One tiny little patch.

  3. Expose the Present API to drivers via sdksyms.sh. Until now, the present extension APIs have only been available inside the X server. This exposes them to drivers. This took a few cleanup patches first.

Use Present for Glamor XV

branch: glamor-present-xv

Painting XV to the screen should be done at vblank time to avoid tearing. Present offers vblank synchronized operations. Hooking those two together required a few new present APIs to expose the vblank functionality outside of the present code, then a bit of glamor code to hook up that new API to the XV bits.

Switching Glamor to a GL core profile context

branch: glamor-core-profile

This patch set is still in progress, but demonstrates how close we are. We'll be requiring OpenGL 3.3 for this so that we get texture swizzling, which is required for our single channel objects.

The changes present on the branch are:

  1. Switch single channel surfaces from GL_ALPHA to GL_RED.

  2. Use vertex array objects.

  3. Switch ephyr over to using a core 3.3 profile.

Still left to do is

  1. Switch Render code to VBOs

The core code uses VBOs everywhere, but the Render code doesn't. This means that all Render drawing fails, which makes the resulting server not very useful.

My main objective for getting this done is to reduce memory usage by about 16MB, which is the space allocated for software rendering in Mesa in case someone does something which the hardware doesn't handle, and that can only with some legacy OpenGL APIs.

Please help out!

All of these friendly little patches are looking for a bit of review so that they can get merged before the 1.17 window closes.

A lot of people read up on good Python practice, and there's plenty of information about that on the Internet. Many tips are included in the book I wrote this year, The Hacker's Guide to Python. Today I'd like to show a concrete case of code that I don't consider being the state of the art.

In my last article where I talked about my new project Gnocchi, I wrote about how I tested, hacked and then ditched whisper out. Here I'm going to explain part of my thought process and a few things that raised my eyebrows when hacking this code.

Before I start, please don't get the spirit of this article wrong. It's in no way a personal attack to the authors and contributors (who I don't know). Furthermore, whisper is a piece of code that is in production in thousands of installation, storing metrics for years. While I can argue that I consider the code not to be following best practice, it definitely works well enough and is worthy to a lot of people.

Tests

The first thing that I noticed when trying to hack on whisper, is the lack of test. There's only one file containing tests, named test_whisper.py, and the coverage it provides is pretty low. One can check that using the coverage tool.

$ coverage run test_whisper.py
...........
----------------------------------------------------------------------
Ran 11 tests in 0.014s
 
OK
$ coverage report
Name Stmts Miss Cover
----------------------------------
test_whisper 134 4 97%
whisper 584 227 61%
----------------------------------
TOTAL 718 231 67%


While one would think that 61% is "not so bad", taking a quick peak at the actual test code shows that the tests are incomplete. Why I mean by incomplete is that they for example use the library to store values into a database, but they never check if the results can be fetched and if the fetched results are accurate. Here's a good reason one should never blindly trust the test cover percentage as a quality metric.

When I tried to modify whisper, as the tests do not check the entire cycle of the values fed into the database, I ended up doing wrong changes but had the tests still pass.

No PEP 8, no Python 3

The code doesn't respect PEP 8 . A run of flake8 + hacking shows 732 errors… While it does not impact the code itself, it's more painful to hack on it than it is on most Python projects.

The hacking tool also shows that the code is not Python 3 ready as there is usage of Python 2 only syntax.

A good way to fix that would be to set up tox and adds a few targets for PEP 8 checks and Python 3 tests. Even if the test suite is not complete, starting by having flake8 run without errors and the few unit tests working with Python 3 should put the project in a better light.

Not using idiomatic Python

A lot of the code could be simplified by using idiomatic Python. Let's take a simple example:

def fetch(path,fromTime,untilTime=None,now=None):
fh = None
try:
fh = open(path,'rb')
return file_fetch(fh, fromTime, untilTime, now)
finally:
if fh:
fh.close()


That piece of code could be easily rewritten as:

def fetch(path,fromTime,untilTime=None,now=None):
with open(path, 'rb') as fh:
return file_fetch(fh, fromTime, untilTime, now)


This way, the function looks actually so simple that one can even wonder why it should exists – but why not.

Usage of loops could also be made more Pythonic:

for i,archive in enumerate(archiveList):
if i == len(archiveList) - 1:
break


could be actually:

for i, archive in itertools.islice(archiveList, len(archiveList) - 1):


That reduce the code size and makes it easier to read through the code.

Wrong abstraction level

Also, one thing that I noticed in whisper, is that it abstracts its features at the wrong level.

Take the create() function, it's pretty obvious:

def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
# Set default params
if xFilesFactor is None:
xFilesFactor = 0.5
if aggregationMethod is None:
aggregationMethod = 'average'
 
#Validate archive configurations...
validateArchiveList(archiveList)
 
#Looks good, now we create the file and write the header
if os.path.exists(path):
raise InvalidConfiguration("File %s already exists!" % path)
fh = None
try:
fh = open(path,'wb')
if LOCK:
fcntl.flock( fh.fileno(), fcntl.LOCK_EX )
 
aggregationType = struct.pack( longFormat, aggregationMethodToType.get(aggregationMethod, 1) )
oldest = max([secondsPerPoint * points for secondsPerPoint,points in archiveList])
maxRetention = struct.pack( longFormat, oldest )
xFilesFactor = struct.pack( floatFormat, float(xFilesFactor) )
archiveCount = struct.pack(longFormat, len(archiveList))
packedMetadata = aggregationType + maxRetention + xFilesFactor + archiveCount
fh.write(packedMetadata)
headerSize = metadataSize + (archiveInfoSize * len(archiveList))
archiveOffsetPointer = headerSize
 
for secondsPerPoint,points in archiveList:
archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points)
fh.write(archiveInfo)
archiveOffsetPointer += (points * pointSize)
 
#If configured to use fallocate and capable of fallocate use that, else
#attempt sparse if configure or zero pre-allocate if sparse isn't configured.
if CAN_FALLOCATE and useFallocate:
remaining = archiveOffsetPointer - headerSize
fallocate(fh, headerSize, remaining)
elif sparse:
fh.seek(archiveOffsetPointer - 1)
fh.write('\x00')
else:
remaining = archiveOffsetPointer - headerSize
chunksize = 16384
zeroes = '\x00' * chunksize
while remaining > chunksize:
fh.write(zeroes)
remaining -= chunksize
fh.write(zeroes[:remaining])
 
if AUTOFLUSH:
fh.flush()
os.fsync(fh.fileno())
finally:
if fh:
fh.close()


The function is doing everything: checking if the file doesn't exist already, opening it, building the structured data, writing this, building more structure, then writing that, etc.

That means that the caller has to give a file path, even if it just wants a whipser data structure to store itself elsewhere. StringIO() could be used to fake a file handler, but it will fail if the call to fcntl.flock() is not disabled – and it is inefficient anyway.

There's a lot of other functions in the code, such as for example setAggregationMethod(), that mixes the handling of the files – even doing things like os.fsync() – while manipulating structured data. This is definitely not a good design, especially for a library, as it turns out reusing the function in different context is near impossible.

Race conditions

There are race conditions, for example in create() (see added comment):

if os.path.exists(path):
raise InvalidConfiguration("File %s already exists!" % path)
fh = None
try:
# TOO LATE I ALREADY CREATED THE FILE IN ANOTHER PROCESS YOU ARE GOING TO
# FAIL WITHOUT GIVING ANY USEFUL INFORMATION TO THE CALLER :-(
fh = open(path,'wb')


That code should be:

try:
fh = os.fdopen(os.open(path, os.O_WRONLY | os.O_CREAT | os.O_EXCL), 'wb')
except OSError as e:
if e.errno = errno.EEXIST:
raise InvalidConfiguration("File %s already exists!" % path)


to avoid any race condition.

Unwanted optimization

We saw earlier the fetch() function that is barely useful, so let's take a look at the file_fetch() function that it's calling.

def file_fetch(fh, fromTime, untilTime, now = None):
header = __readHeader(fh)
[...]


The first thing the function does is to read the header from the file handler. Let's take a look at that function:

def __readHeader(fh):
info = __headerCache.get(fh.name)
if info:
return info
 
originalOffset = fh.tell()
fh.seek(0)
packedMetadata = fh.read(metadataSize)
 
try:
(aggregationType,maxRetention,xff,archiveCount) = struct.unpack(metadataFormat,packedMetadata)
except:
raise CorruptWhisperFile("Unable to read header", fh.name)
[...]


The first thing the function does is to look into a cache. Why is there a cache?

It actually caches the header based with an index based on the file path (fh.name). Except that if one for example decide not to use file and cheat using StringIO, then it does not have any name attribute. So this code path will raise an AttributeError.

One has to set a fake name manually on the StringIO instance, and it must be unique so nobody messes with the cache

import StringIO
 
packedMetadata = <some source>
fh = StringIO.StringIO(packedMetadata)
fh.name = "myfakename"
header = __readHeader(fh)


The cache may actually be useful when accessing files, but it's definitely useless when not using files. But it's not necessarily true that the complexity (even if small) that the cache adds is worth it. I doubt most of whisper based tools are long run processes, so the cache that is really used when accessing the files is the one handled by the operating system kernel, and this one is going to be much more efficient anyway, and shared between processed. There's also no expiry of that cache, which could end up of tons of memory used and wasted.

Docstrings

None of the docstrings are written in a a parsable syntax like Sphinx. This means you cannot generate any documentation in a nice format that a developer using the library could read easily.

The documentation is also not up to date:

def fetch(path,fromTime,untilTime=None,now=None):
"""fetch(path,fromTime,untilTime=None)
[...]
"""
 
def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
"""create(path,archiveList,xFilesFactor=0.5,aggregationMethod='average')
[...]
"""


This is something that could be avoided if a proper format was picked to write the docstring. A tool cool be used to be noticed when there's a diversion between the actual function signature and the documented one, like missing an argument.

Duplicated code

Last but not least, there's a lot of code that is duplicated around in the scripts provided by whisper in its bin directory. Theses scripts should be very lightweight and be using the console_scripts facility of setuptools, but they actually contains a lot of (untested) code. Furthermore, some of that code is partially duplicated from the whisper.py library which is against DRY.

Conclusion

There are a few more things that made me stop considering whisper, but these are part of the whisper features, not necessarily code quality. One can also point out that the code is very condensed and hard to read, and that's a more general problem about how it is organized and abstracted.

A lot of these defects are actually points that made me start writing The Hacker's Guide to Python a year ago. Running into this kind of code makes me think it was a really good idea to write a book on advice to write better Python code!

A book I wrote talking about designing Python applications, state of the art, advice to apply when building your application, various Python tips, etc. Interested? Check it out.

September 14, 2014

And not the kingdom of Spain unfortunately (unfortunately because I miss it and because it's still a kingdom). In a few months (not sure about specific dates yet, probably in early 2015) I will be moving back to the United Kingdom, this time to the larger metropolis, London. Don't panic, I will still be with Red Hat, there won't be a lot of changes in that front. In the meantime I will settle back in Gran Canaria and will be flying back and forth on a monthly basis.

I must note that when I made the decision to move to Czech my plan was: "I do not have a plan", just enjoying it and trying to make the best of it without thinking in deadlines as to when to move back to Spain. Red Hat has been a very welcoming company in which I feel just like home and Brno has been a very welcoming city and this is definitively a part of Europe that is worth experiencing. I've met terrific people during this period both inside and outside Red Hat.

There was, however, a little problem.

Something altered the mid-term plans, a few months before I moved, when the decision was already made, I met someone very special with whom now I want to share my life with. After 16 months of  carrying a distant relationship it was due time to find a place where we could be together, after months of planning and considering options, London presented itself as the spot to make the move as she found a pretty good job there.

While I am going to miss sharing the office on a daily basis with awesome people, I am looking forward to this new chapter in my life.

Canary Wharf at Night | London, England, Niko Trinkhaus, (CC by-nc)

I want to note that I am deeply thankful to Christian Schaller for his tremendous amount of support during my stay in Brno and for working with me in figuring ways to balance my professional and personal life. I also wish him the best of luck with his new life in Westford, I'm certainly going to miss him.

On the other hand I guess this means I'll show up at the GNOME Beers in London more often :-)

September 11, 2014

Listaller-Logo (with text)It is time for another report on Listaller, the cross-distro 3rd-party package installer, which is now in development for – depending how you count – 5-6 years. This will become a longer post, so you might grab some coffee or tea ;-)

The original idea

The Listaller project was initially started with the goal to make application deployment on Linux distributions as simple as possible, by providing a unified package installation format and tools which make building apps for multiple distributions easier and deployment of updates simple. The key ideas were:

  • Seamless integration of all installation steps into the system – users shouldn’t care about the origin of their application, they just handle all installed apps with the same tool and update all apps with the same interface they use for updating the system.
  • Out-of-the-boy sandboxing for all 3rd-party apps
  • Easy signing and key-validation for Listaller packages
  • Simple creation of updates for developers
  • Resource-sharing: It should always be clear which application uses which library, duplicates should be avoided. The distribution-provided software should take priority, since it is often well-maintained and receives security updates.

The current state

The current release of Listaller handles all of this with a plugin for PackageKit, the cross-distro package-management abstraction layer. It hooks into PackageKit and reads information passing through to the native distributor backend, and if it encounters Listaller software, it handles it appropriately. It can also inject update information. This results in all Listaller software being shown in any PackageKit frontends, and people can work with it just like if the packages were native packages. Listaller package installations are controlled by a machine policy, so the administrator can decide that e.g. only packages from a trusted source (= GPG signature in trusted database) can be installed. Dependencies can be pulled from the distributor’s repositories, or optionally from external sources, like the PyPI.

This sounds good on paper, but the current implementation has various problems.

The issues

The current Listaller approach has some problems. The biggest one lies in the future: Soon, there will be no PackageKit plugins anymore! PackageKit 1.0 will remove support for them, because they appear to be a major source for crashes, even the in-tree plugins cause problems. Also, the PackageKit service itself is currently being trimmed of unneeded features and less-used code. These changes in PackageKit are great and needed for the project (and I support these efforts), but they cause a pretty huge problem for Listaller: The project relies on the PackageKit plugin – if used without it, you loose the system-integration part, which is one of the key concepts of Listaller, and a primary goal.

But this issue is not the only one. There are more. One huge problem for Listaller is dependency-solving: It needs to know where to get software from in case it isn’t installed already. And that has to be done in a cross-distributional way. This is an incredibly complex task, and Listaller contains lots of workarounds for various quirks. It contains so much hacks for distro-specific stuff, that it became really hard to understand. The Listaller dependency model also became very complex, because it tried to handle many corner-cases. This is bad, of course. But the workarounds weren’t added for fun, but because it was assumed to be easier than to fixing the root cause, which would have required collaboration between distributors and some changes on the stack, which seemed unlikely to happen at the time the code was written.

The systemd effort

Also a thing which affects Listaller, is the latest push from the systemd team to allow cross-distro 3rd-party installations to happen. I definitively recommend reading the linked blogpost from Lennart, if you have some spare time! The identified problems are the same as for Listaller, but the solution they propose is completely different, and about three orders of magnitude more invasive than whatever the Listaller project had in mind (I make these numbers up, so don’t ask!). There are also a few issues I see with Lennarts approach, I will probably go into detail about that in another blogpost (e.g. it requires multiple copies of a library lying around, where one version might have a security vulnerability, and another one doesn’t – it’s hard to ensure everything is up to date and secure that way, even if you have a top-notch sandbox). I have great respect for the systemd crew and especially Lennart, and I hope them to succeed with their efforts. However, I also think Listaller can achieve a similar things with a less-invasive solution, at least for the 3rd-party app-installations (Listaller is one of the partial-fix solutions with strict focus, so not a direct competitor to the holistic systemd approach. Both solutions could happily live together.)

A step into the future

Some might have guessed it already: There are some bigger changes coming to Listaller! The most important one is that there will be no Listaller anymore, at least not in its old form.

Since the current code relies heavily on the PackageKit plugin, and contains some ugly workarounds, it doesn’t make much sense to continue working on it.

Instead, I started the Listaller.NEXT project, which is a rewrite of Listaller in C. There are a some goals for the rewrite:

  • No stupid hacks and workarounds: We will not add any workaround. If there is a problem, we will fix it at its source, even if that might be more invasive.
  • Trimmed down project: The new incarnation of Listaller will only support installations of statically linked software at the beginning. We will start with a very small, robust core, and then add more features (like dependency-solving) gradually, but only if they are useful. There will be no feature-creep like in the previous version.
  • Faster development cycle: Releases will happen much faster, not only two or three times a year
  • Integration: Since there is no PackageKit plugin anymore, but integration is still one of Listaller’s key concepts, we will integrate Listaller into downstream tools, ranging from Apper to GNOME-Software. Richard Hughes will help with the integration and user interfaces, so Listaller applications get displayed properly.
  • AppStream-first: AppStream is the ultimate tool for Listaller to detect dependencies. With the 0.6 release, the Listaller component-concept was merged into it, which makes it a very powerful and non-hackish solution for dependency-detection. We will advance the use of its metadata, and probably use it exclusively, which would restrict Listaller to only work properly on distributions which ship AppStream metadata.
  • No desktop-only focus: The previous Listaller was focused only on desktop GUI apps. The new version will be developed with a much larger target audience in mind, including server deployments (“Can I use it to deploy my server app” is one very frequently asked questions about Listaller – with the new version, the answer is yes)
  • We will continue to improve the static-linking and cross-distro development toolchain (libuild, with ligcc, lig++ and binreloc), to make building portable apps easier.

I made a last release of the 0.5.x series of Listaller, to work with PackageKit 0.9.x – the future lies in the C port.

If you are using Listaller (and I know of people who do, for example some deploy statically-linked stuff on internal test-setups with it), stay tuned. The packaging format will stay mostly compatible with the current version, so you will not see many changes there (the plan is to freeze it very soon, so no backwards-incompatible changes are made anymore). The o.5.x series will receive critical bugfixes if necessary.

Help needed!

As always, there is help needed! Writing C is not that difficult ;-) But user feedback is welcome as well, in case you have an idea. The new code will be hosted on Github in the new listaller-next branch (currently not that much to find there). Long-term, we will completely migrate away from Launchpad.

You can expect more blogposts about the Listaller concepts and progress in the next months (as soon as I am done with some AppStream-related things, which take priority).

September 03, 2014
thereifixedit.com - Euro Ipod Charger
see more There I Fixed It

I try to fairly regularly build recent git checkouts of all the upstream modules from X.Org (at least all those listed in the current build.sh) on Solaris. Normally I do this in 32-bit mode on x86 machines using the Sun compilers on the latest Solaris 11 internal development build, but I also occasionally do it in 64-bit mode, or with gcc compilers, or on a SPARC machine. This helps me catch issues that would break our builds when we integrate the new releases before those releases happen. (Ideally I'd set up a Solaris client of the X.Org tinderbox, but I've not gotten around to that.)

Anyways, recently I finally decided to track down an error that only shows up in the 64-bit builds of the xscope protocol monitor/decoder for X11 on Solaris. The builds run fine up until the final link stage, which fails with:

ld: fatal: relocation error: R_AMD64_PC32: file audio.o: symbol littleEndian: value 0x8086c355 does not fit
ld: fatal: relocation error: R_AMD64_PC32: file audio.o: symbol ServerHostName: value 0x8086b4fe does not fit
ld: fatal: relocation error: R_AMD64_PC32: file decode11.o: symbol LBXEvent: value 0x808664c3 does not fit
(and over 150 more symbols that didn't fit)

A google search turned up some forum posts, a blog post, and an article on the AMD64 ABI support in the Sun Studio compilers. And indeed, the solutions they offered did work - building with -Kpic did allow the program to link.

But is that really the best answer? xscope is a simple program, and shouldn't be overflowing the normal memory model. Once it linked, looking at the resulting binary was a bit shocking:

% /usr/gnu/bin/size  xscope
   text	   data	    bss	    dec	    hex	filename
 416753	   5256	2155921980	2156343989	808732b5	xscope

% /usr/bin/size -f xscope

23(.interp) + 32(.SUNW_cap) + 5860(.eh_frame_hdr) + 27200(.eh_frame)
 + 2964(.SUNW_syminfo) + 5944(.hash) + 4224(.SUNW_ldynsym)
 + 17784(.dynsym) + 14703(.dynstr) + 192(.SUNW_version)
 + 1482(.SUNW_versym) + 3168(.SUNW_dynsymsort) + 96(.SUNW_reloc)
 + 1944(.rela.plt) + 1312(.plt) + 291018(.text) + 33(.init) + 33(.fini)
 + 280(.rodata) + 38461(.rodata1) + 1376(.got) + 784(.dynamic)
 + 1952(.data) + 0(.bssf) + 1144(.picdata) + 0(.tdata) + 0(.tbss)
 + 2155921980(.bss) = 2156343989

% pmap -x `pgrep xscope`
26151:	./xscope
         Address     Kbytes        RSS       Anon     Locked Mode   Mapped File
0000000000400000        408        408          -          - r-x--  xscope
0000000000476000          8          8          8          - rw---  xscope
0000000000478000    2105388       1064       1064          - rw---  xscope
0000000080C83000         52         52         52          - rw---    [ heap ]
[....]
FFFFFD7FFFDF8000         32         32         32          - rw---    [ stack ]
---------------- ---------- ---------- ---------- ----------
        total Kb    2108668       3204       1300          -

Two gigabytes of .bss space allocated!?!?! That can't be right. Looking through the output of the elfdump and nm programs a single symbol stood out:

Symbol Table Section:  .SUNW_ldynsym
     index    value              size              type bind oth ver shndx          name
[...]
      [89]  0x00000000009ff280 0x0000000080280000  OBJT GLOB  D    1 .bss           FDinfo

[Index]   Value                Size                Type  Bind  Other Shndx   Name
[...]
[528]   |            10482304|          2150105088|OBJT |GLOB |0    |28     |FDinfo

Unfortunately, that wasn't one of the ones listed in the linker errors, since it's starting address fit inside the normal memory model, but everything that came after it was out of range.

So what is this giant static allocation for? It's defined in scope.h:

#define BUFFER_SIZE (1024 * 32)

struct fdinfo
{
  Boolean Server;
  long    ClientNumber;
  FD      pair;
  unsigned char   buffer[BUFFER_SIZE];
  int     bufcount;
  int     bufstart;
  int     buflimit;     /* limited writes */
  int     bufdelivered; /* total bytes delivered */
  Boolean writeblocked;
};

extern struct fdinfo   FDinfo[StaticMaxFD];

So it allocates a 32k buffer for up to StaticMaxFD file descriptors. How many is that? For that we need to look in xscope's fd.h:

/* need to change the MaxFD to allow larger number of fd's */
#define StaticMaxFD FD_SETSIZE

and from there to the Solaris system headers, which define FD_SETSIZE in <sys/select.h>:

/*
 * Select uses bit masks of file descriptors in longs.
 * These macros manipulate such bit fields.
 * FD_SETSIZE may be defined by the user, but the default here
 * should be >= NOFILE (param.h).
 */
#ifndef FD_SETSIZE
#ifdef _LP64
#define FD_SETSIZE      65536
#else
#define FD_SETSIZE      1024
#endif  /* _LP64 */

So this makes the buffer fields alone in FDinfo become 65536 * 32 * 1024 bytes, aka 2 gigabytes.

Thus in this case, while compiler flags like -Kpic allow the code to link, using -DFD_SETSIZE=256 instead, builds code that's a little bit saner, fits in the normal memory model, and is less likely to fail with out of memory errors when you need it most:

% /usr/gnu/bin/size -f xscope
   text	   data	    bss	    dec	    hex	filename
 409388	   3352	8449804	8862544	 873b50	xscope

% pmap -x `pgrep xscope`
         Address     Kbytes        RSS       Anon     Locked Mode   Mapped File
0000000000400000        404        404          -          - r-x--  xscope
0000000000475000          4          4          4          - rw---  xscope
0000000000476000       8248         20         20          - rw---  xscope
0000000000C84000         52         52         52          - rw---    [ heap ]
[...]
FFFFFD7FFFDFD000         12         12         12          - rw---    [ stack ]
---------------- ---------- ---------- ---------- ----------
        total Kb      11500       2136        232          -

Of course that assumes that xscope is not going to be monitoring more than about 120 clients at a time (since it opens two file descriptors for each client, one connected to the client and one to the real X server), and still wastes many page mappings if you're only monitoring one client. The real fix being worked on for the next upstream release is to make the buffer allocation be dynamic, and allocate just enough for the number of clients we actually are monitoring.

The moral of this story? Just because you can make it build doesn't mean you've fixed it well, and sometimes it's useful to understand why the linker is giving you a hard time.

September 02, 2014
I've had a couple of questions about whether there's a way for others to contribute to the VC4 driver project.  There is!  I haven't posted about it before because things aren't as ready as I'd like for others to do development (it has a tendency to lock up, and the X implementation isn't really ready yet so you don't get to see your results), but that shouldn't actually stop anyone.

To get your environment set up, build the kernel (https://github.com/anholt/linux.git vc4 branch), Mesa (git://anongit.freedesktop.org/mesa/mesa) with --with-gallium-drivers=vc4, and piglit (git://anongit.freedesktop.org/git/piglit).  For working on the Pi, I highly recommend having a serial cable and doing NFS root so that you don't have to write things to slow, unreliable SD cards.

You can run an existing piglit test that should work, to check your environment: env PIGLIT_PLATFORM=gbm VC4_DEBUG=qir ./bin/shader_runner tests/shaders/glsl-algebraic-add-add-1.shader_test -auto -fbo -- you should see a dump of the IR for this shader, and a pass report.  The kernel will make some noise about how it's rendered a frame.

Now the actual work:  I've left some of the TGSI opcodes unfinished (SCS, DST, DPH, and XPD, for example), so the driver just aborts when a shader tries to use them.  How they work is described in src/gallium/docs/source/tgsi.rst. The TGSI-to_QIR code is in vc4_program.c (where you'll find all the opcodes that are implemented currently), and vc4_qir.h has all the opcodes that are available to you and helpers for generating them.  Once it's in QIR (which I think should have all the opcodes you need for this work), vc4_qpu_emit.c will turn the QIR into actual QPU code like you find described in the chip specs.

You can dump the shaders being generated by the driver using VC4_DEBUG=tgsi,qir,qpu in the environment (that gets you 3/4 stages of code dumped -- at times you might want some subset of that just to quiet things down).

Since we've still got a lot of GPU hangs, and I don't have reset wokring, you can't even complete a piglit run to find all the problems or to test your changes to see if your changes are good.  What I can offer currently is that you could run PIGLIT_PLATFORM=gbm VC4_DEBUG=norast ./piglit-run.py tests/quick.py results/vc4-norast; piglit-summary-html.py --overwrite summary/mysum results/vc4-norast will get you a list of all the tests (which mostly failed, since we didn't render anything), some of which will have assertion failed.  Now that you have which tests were assertion failing from the opcode you worked on, you can run them manually, like PIGLIT_PLATFORM=gbm /home/anholt/src/piglit/bin/shader_runner /home/anholt/src/piglit/generated_tests/spec/glsl-1.10/execution/built-in-functions/vs-asin-vec4.shader_test -auto (copy-and-pasted from the results) or PIGLIT_PLATFORM=gbm PIGLIT_TEST="XPD test 2 (same src and dst arg)" ./bin/glean -o -v -v -v -t +vertProg1 --quick (also copy and pasted from the results, but note that you need the other env var for glean to pick out the subtest to run).

Other things you might want eventually: I do my development using cross-builds instead of on the Pi, install to a prefix in my homedir, then rsync that into my NFS root and use LD_LIBRARY_PATH/LIBGL_DRIVERS_PATH on the Pi to point my tests at the driver in the homedir prefix.  Cross-builds were a *huge* pain to set up (debian's multiarch doesn't ship the .so symlink with the libary, and the -dev packages that do install them don't install simultaneously for multiple arches), but it's worth it in the end.  If you look into cross-build, what I'm using is rpi-tools/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64/bin/arm-linux-gnueabihf-gcc and you'll want --enable-malloc0returnsnull if you cross-build a bunch of X-related packages.
September 01, 2014

So two years ago my family and I moved to Brno in the Czech Republic due to me starting a new job at Red Hat. It has been two roller coaster years with a lot of changes happening both inside Red Hat and with the world that the Linux desktop operates in. During those years my wife and I have gotten to love Brno, which both of us find a bit surprising as we where both quite skeptical to the city in the outset.

I think having grown up in west europe during the cold war I had some preconceptions about what life was like in the former east europe and Brno specifically is struggling a bit with being the second city in Czech after Prague, due to Prague so often being hailed internationally as a beautiful and exciting city.

But I think during these two years Brno has proven itself to us as a place that is great to live, especially if you have a little child. Brno has a lot of beautiful outdoors areas which are great for hiking or relaxing, it is packed full of these childrens cafes where you can take your kid to play while you sit down and have a coffee or a tea, a vibrant expat community, affordable housing, a good range of restaurants, short distance to major cities like Vienna, Prague and Budapest. And lot of old castles and towns around to explore in the vicinity. I think Telc has to be one of our topmost favorites in that regard. And it has very little crime, my wife has been telling her friends how Brno is the first city she has ever lived in where she feels that as a woman she can walk along through the city in the evening or at night and feel safe.

But that said the time has come for us to move on. Due to one of these changes inside Red Hat I mentioned I am getting moved to our US Engineering office in Westford, Massachusetts. For those not familiar with Westford it is close to a city you probably do know, Boston.

So tomorrow the moving company will arrive at our flat here in Brno and pack up everything for the transport to the US. The furniture will take some time to arrive there, so while our stuff is sailing across the ocean we will live with my family in Norway, while I take advantage of the Red Hat office in downtown Oslo. So by mid-October I expect us to be fully set up in the Boston area, although we are heading over there next week for a final house hunting trip so that the furniture has a place to arrive to :)

So goodbye to Brno for now, and looking forward to seeing new and old friends in Boston!

August 31, 2014

In a previous blog story I discussed Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems, I now want to take the opportunity to explain a bit where we want to take this with systemd in the longer run, and what we want to build out of it. This is going to be a longer story, so better grab a cold bottle of Club Mate before you start reading.

Traditional Linux distributions are built around packaging systems like RPM or dpkg, and an organization model where upstream developers and downstream packagers are relatively clearly separated: an upstream developer writes code, and puts it somewhere online, in a tarball. A packager than grabs it and turns it into RPMs/DEBs. The user then grabs these RPMs/DEBs and installs them locally on the system. For a variety of uses this is a fantastic scheme: users have a large selection of readily packaged software available, in mostly uniform packaging, from a single source they can trust. In this scheme the distribution vets all software it packages, and as long as the user trusts the distribution all should be good. The distribution takes the responsibility of ensuring the software is not malicious, of timely fixing security problems and helping the user if something is wrong.

Upstream Projects

However, this scheme also has a number of problems, and doesn't fit many use-cases of our software particularly well. Let's have a look at the problems of this scheme for many upstreams:

  • Upstream software vendors are fully dependent on downstream distributions to package their stuff. It's the downstream distribution that decides on schedules, packaging details, and how to handle support. Often upstream vendors want much faster release cycles then the downstream distributions follow.

  • Realistic testing is extremely unreliable and next to impossible. Since the end-user can run a variety of different package versions together, and expects the software he runs to just work on any combination, the test matrix explodes. If upstream tests its version on distribution X release Y, then there's no guarantee that that's the precise combination of packages that the end user will eventually run. In fact, it is very unlikely that the end user will, since most distributions probably updated a number of libraries the package relies on by the time the package ends up being made available to the user. The fact that each package can be individually updated by the user, and each user can combine library versions, plug-ins and executables relatively freely, results in a high risk of something going wrong.

  • Since there are so many different distributions in so many different versions around, if upstream tries to build and test software for them it needs to do so for a large number of distributions, which is a massive effort.

  • The distributions are actually quite different in many ways. In fact, they are different in a lot of the most basic functionality. For example, the path where to put x86-64 libraries is different on Fedora and Debian derived systems..

  • Developing software for a number of distributions and versions is hard: if you want to do it, you need to actually install them, each one of them, manually, and then build your software for each.

  • Since most downstream distributions have strict licensing and trademark requirements (and rightly so), any kind of closed source software (or otherwise non-free) does not fit into this scheme at all.

This all together makes it really hard for many upstreams to work nicely with the current way how Linux works. Often they try to improve the situation for them, for example by bundling libraries, to make their test and build matrices smaller.

System Vendors

The toolbox approach of classic Linux distributions is fantastic for people who want to put together their individual system, nicely adjusted to exactly what they need. However, this is not really how many of today's Linux systems are built, installed or updated. If you build any kind of embedded device, a server system, or even user systems, you frequently do your work based on complete system images, that are linearly versioned. You build these images somewhere, and then you replicate them atomically to a larger number of systems. On these systems, you don't install or remove packages, you get a defined set of files, and besides installing or updating the system there are no ways how to change the set of tools you get.

The current Linux distributions are not particularly good at providing for this major use-case of Linux. Their strict focus on individual packages as well as package managers as end-user install and update tool is incompatible with what many system vendors want.

Users

The classic Linux distribution scheme is frequently not what end users want, either. Many users are used to app markets like Android, Windows or iOS/Mac have. Markets are a platform that doesn't package, build or maintain software like distributions do, but simply allows users to quickly find and download the software they need, with the app vendor responsible for keeping the app updated, secured, and all that on the vendor's release cycle. Users tend to be impatient. They want their software quickly, and the fine distinction between trusting a single distribution or a myriad of app developers individually is usually not important for them. The companies behind the marketplaces usually try to improve this trust problem by providing sand-boxing technologies: as a replacement for the distribution that audits, vets, builds and packages the software and thus allows users to trust it to a certain level, these vendors try to find technical solutions to ensure that the software they offer for download can't be malicious.

Existing Approaches To Fix These Problems

Now, all the issues pointed out above are not new, and there are sometimes quite successful attempts to do something about it. Ubuntu Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of this problem set, usually with a strict focus on one facet of Linux systems. For example, Ubuntu Apps focus strictly on end user (desktop) applications, and don't care about how we built/update/install the OS itself, or containers. Docker OTOH focuses on containers only, and doesn't care about end-user apps. Software Collections tries to focus on the development environments. ChromeOS focuses on the OS itself, but only for end-user devices. CoreOS also focuses on the OS, but only for server systems.

The approaches they find are usually good at specific things, and use a variety of different technologies, on different layers. However, none of these projects tried to fix this problems in a generic way, for all uses, right in the core components of the OS itself.

Linux has come to tremendous successes because its kernel is so generic: you can build supercomputers and tiny embedded devices out of it. It's time we come up with a basic, reusable scheme how to solve the problem set described above, that is equally generic.

What We Want

The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom Gundersen, David Herrmann, and yours truly) recently met in Berlin about all these things, and tried to come up with a scheme that is somewhat simple, but tries to solve the issues generically, for all use-cases, as part of the systemd project. All that in a way that is somewhat compatible with the current scheme of distributions, to allow a slow, gradual adoption. Also, and that's something one cannot stress enough: the toolbox scheme of classic Linux distributions is actually a good one, and for many cases the right one. However, we need to make sure we make distributions relevant again for all use-cases, not just those of highly individualized systems.

Anyway, so let's summarize what we are trying to do:

  • We want an efficient way that allows vendors to package their software (regardless if just an app, or the whole OS) directly for the end user, and know the precise combination of libraries and packages it will operate with.

  • We want to allow end users and administrators to install these packages on their systems, regardless which distribution they have installed on it.

  • We want a unified solution that ultimately can cover updates for full systems, OS containers, end user apps, programming ABIs, and more. These updates shall be double-buffered, (at least). This is an absolute necessity if we want to prepare the ground for operating systems that manage themselves, that can update safely without administrator involvement.

  • We want our images to be trustable (i.e. signed). In fact we want a fully trustable OS, with images that can be verified by a full trust chain from the firmware (EFI SecureBoot!), through the boot loader, through the kernel, and initrd. Cryptographically secure verification of the code we execute is relevant on the desktop (like ChromeOS does), but also for apps, for embedded devices and even on servers (in a post-Snowden world, in particular).

What We Propose

So much about the set of problems, and what we are trying to do. So, now, let's discuss the technical bits we came up with:

The scheme we propose is built around the variety of concepts of btrfs and Linux file system name-spacing. btrfs at this point already has a large number of features that fit neatly in our concept, and the maintainers are busy working on a couple of others we want to eventually make use of.

As first part of our proposal we make heavy use of btrfs sub-volumes and introduce a clear naming scheme for them. We name snapshots like this:

  • usr:<vendorid>:<architecture>:<version> -- This refers to a full vendor operating system tree. It's basically a /usr tree (and no other directories), in a specific version, with everything you need to boot it up inside it. The <vendorid> field is replaced by some vendor identifier, maybe a scheme like org.fedoraproject.FedoraWorkstation. The <architecture> field specifies a CPU architecture the OS is designed for, for example x86-64. The <version> field specifies a specific OS version, for example 23.4. An example sub-volume name could hence look like this: usr:org.fedoraproject.FedoraWorkstation:x86_64:23.4

  • root:<name>:<vendorid>:<architecture> -- This refers to an instance of an operating system. Its basically a root directory, containing primarily /etc and /var (but possibly more). Sub-volumes of this type do not contain a populated /usr tree though. The <name> field refers to some instance name (maybe the host name of the instance). The other fields are defined as above. An example sub-volume name is root:revolution:org.fedoraproject.FedoraWorkstation:x86_64.

  • runtime:<vendorid>:<architecture>:<version> -- This refers to a vendor runtime. A runtime here is supposed to be a set of libraries and other resources that are needed to run apps (for the concept of apps see below), all in a /usr tree. In this regard this is very similar to the usr sub-volumes explained above, however, while a usr sub-volume is a full OS and contains everything necessary to boot, a runtime is really only a set of libraries. You cannot boot it, but you can run apps with it. An example sub-volume name is: runtime:org.gnome.GNOME3_20:x86_64:3.20.1

  • framework:<vendorid>:<architecture>:<version> -- This is very similar to a vendor runtime, as described above, it contains just a /usr tree, but goes one step further: it additionally contains all development headers, compilers and build tools, that allow developing against a specific runtime. For each runtime there should be a framework. When you develop against a specific framework in a specific architecture, then the resulting app will be compatible with the runtime of the same vendor ID and architecture. Example: framework:org.gnome.GNOME3_20:x86_64:3.20.1

  • app:<vendorid>:<runtime>:<architecture>:<version> -- This encapsulates an application bundle. It contains a tree that at runtime is mounted to /opt/<vendorid>, and contains all the application's resources. The <vendorid> could be a string like org.libreoffice.LibreOffice, the <runtime> refers to one the vendor id of one specific runtime the application is built for, for example org.gnome.GNOME3_20:3.20.1. The <architecture> and <version> refer to the architecture the application is built for, and of course its version. Example: app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133

  • home:<user>:<uid>:<gid> -- This sub-volume shall refer to the home directory of the specific user. The <user> field contains the user name, the <uid> and <gid> fields the numeric Unix UIDs and GIDs of the user. The idea here is that in the long run the list of sub-volumes is sufficient as a user database (but see below). Example: home:lennart:1000:1000.

btrfs partitions that adhere to this naming scheme should be clearly identifiable. It is our intention to introduce a new GPT partition type ID for this.

How To Use It

After we introduced this naming scheme let's see what we can build of this:

  • When booting up a system we mount the root directory from one of the root sub-volumes, and then mount /usr from a matching usr sub-volume. Matching here means it carries the same <vendor-id> and <architecture>. Of course, by default we should pick the matching usr sub-volume with the newest version by default.

  • When we boot up an OS container, we do exactly the same as the when we boot up a regular system: we simply combine a usr sub-volume with a root sub-volume.

  • When we enumerate the system's users we simply go through the list of home snapshots.

  • When a user authenticates and logs in we mount his home directory from his snapshot.

  • When an app is run, we set up a new file system name-space, mount the app sub-volume to /opt/<vendorid>/, and the appropriate runtime sub-volume the app picked to /usr, as well as the user's /home/$USER to its place.

  • When a developer wants to develop against a specific runtime he installs the right framework, and then temporarily transitions into a name space where /usris mounted from the framework sub-volume, and /home/$USER from his own home directory. In this name space he then runs his build commands. He can build in multiple name spaces at the same time, if he intends to builds software for multiple runtimes or architectures at the same time.

Instantiating a new system or OS container (which is exactly the same in this scheme) just consists of creating a new appropriately named root sub-volume. Completely naturally you can share one vendor OS copy in one specific version with a multitude of container instances.

Everything is double-buffered (or actually, n-fold-buffered), because usr, runtime, framework, app sub-volumes can exist in multiple versions. Of course, by default the execution logic should always pick the newest release of each sub-volume, but it is up to the user keep multiple versions around, and possibly execute older versions, if he desires to do so. In fact, like on ChromeOS this could even be handled automatically: if a system fails to boot with a newer snapshot, the boot loader can automatically revert back to an older version of the OS.

An Example

Note that in result this allows installing not only multiple end-user applications into the same btrfs volume, but also multiple operating systems, multiple system instances, multiple runtimes, multiple frameworks. Or to spell this out in an example:

Let's say Fedora, Mageia and ArchLinux all implement this scheme, and provide ready-made end-user images. Also, the GNOME, KDE, SDL projects all define a runtime+framework to develop against. Finally, both LibreOffice and Firefox provide their stuff according to this scheme. You can now trivially install of these into the same btrfs volume:

  • usr:org.fedoraproject.WorkStation:x86_64:24.7
  • usr:org.fedoraproject.WorkStation:x86_64:24.8
  • usr:org.fedoraproject.WorkStation:x86_64:24.9
  • usr:org.fedoraproject.WorkStation:x86_64:25beta
  • usr:org.mageia.Client:i386:39.3
  • usr:org.mageia.Client:i386:39.4
  • usr:org.mageia.Client:i386:39.6
  • usr:org.archlinux.Desktop:x86_64:302.7.8
  • usr:org.archlinux.Desktop:x86_64:302.7.9
  • usr:org.archlinux.Desktop:x86_64:302.7.10
  • root:revolution:org.fedoraproject.WorkStation:x86_64
  • root:testmachine:org.fedoraproject.WorkStation:x86_64
  • root:foo:org.mageia.Client:i386
  • root:bar:org.archlinux.Desktop:x86_64
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.1
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.4
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.5
  • runtime:org.gnome.GNOME3_22:x86_64:3.22.0
  • runtime:org.kde.KDE5_6:x86_64:5.6.0
  • framework:org.gnome.GNOME3_22:x86_64:3.22.0
  • framework:org.kde.KDE5_6:x86_64:5.6.0
  • app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133
  • app:org.libreoffice.LibreOffice:GNOME3_22:x86_64:166
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:39
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:40
  • home:lennart:1000:1000
  • home:hrundivbakshi:1001:1001

In the example above, we have three vendor operating systems installed. All of them in three versions, and one even in a beta version. We have four system instances around. Two of them of Fedora, maybe one of them we usually boot from, the other we run for very specific purposes in an OS container. We also have the runtimes for two GNOME releases in multiple versions, plus one for KDE. Then, we have the development trees for one version of KDE and GNOME around, as well as two apps, that make use of two releases of the GNOME runtime. Finally, we have the home directories of two users.

Now, with the name-spacing concepts we introduced above, we can actually relatively freely mix and match apps and OSes, or develop against specific frameworks in specific versions on any operating system. It doesn't matter if you booted your ArchLinux instance, or your Fedora one, you can execute both LibreOffice and Firefox just fine, because at execution time they get matched up with the right runtime, and all of them are available from all the operating systems you installed. You get the precise runtime that the upstream vendor of Firefox/LibreOffice did their testing with. It doesn't matter anymore which distribution you run, and which distribution the vendor prefers.

Also, given that the user database is actually encoded in the sub-volume list, it doesn't matter which system you boot, the distribution should be able to find your local users automatically, without any configuration in /etc/passwd.

Building Blocks

With this naming scheme plus the way how we can combine them on execution we already came quite far, but how do we actually get these sub-volumes onto the final machines, and how do we update them? Well, btrfs has a feature they call "send-and-receive". It basically allows you to "diff" two file system versions, and generate a binary delta. You can generate these deltas on a developer's machine and then push them into the user's system, and he'll get the exact same sub-volume too. This is how we envision installation and updating of operating systems, applications, runtimes, frameworks. At installation time, we simply deserialize an initial send-and-receive delta into our btrfs volume, and later, when a new version is released we just add in the few bits that are new, by dropping in another send-and-receive delta under a new sub-volume name. And we do it exactly the same for the OS itself, for a runtime, a framework or an app. There's no technical distinction anymore. The underlying operation for installing apps, runtime, frameworks, vendor OSes, as well as the operation for updating them is done the exact same way for all.

Of course, keeping multiple full /usr trees around sounds like an awful lot of waste, after all they will contain a lot of very similar data, since a lot of resources are shared between distributions, frameworks and runtimes. However, thankfully btrfs actually is able to de-duplicate this for us. If we add in a new app snapshot, this simply adds in the new files that changed. Moreover different runtimes and operating systems might actually end up sharing the same tree.

Even though the example above focuses primarily on the end-user, desktop side of things, the concept is also extremely powerful in server scenarios. For example, it is easy to build your own usr trees and deliver them to your hosts using this scheme. The usr sub-volumes are supposed to be something that administrators can put together. After deserializing them into a couple of hosts, you can trivially instantiate them as OS containers there, simply by adding a new root sub-volume for each instance, referencing the usr tree you just put together. Instantiating OS containers hence becomes as easy as creating a new btrfs sub-volume. And you can still update the images nicely, get fully double-buffered updates and everything.

And of course, this scheme also applies great to embedded use-cases. Regardless if you build a TV, an IVI system or a phone: you can put together you OS versions as usr trees, and then use btrfs-send-and-receive facilities to deliver them to the systems, and update them there.

Many people when they hear the word "btrfs" instantly reply with "is it ready yet?". Thankfully, most of the functionality we really need here is strictly read-only. With the exception of the home sub-volumes (see below) all snapshots are strictly read-only, and are delivered as immutable vendor trees onto the devices. They never are changed. Even if btrfs might still be immature, for this kind of read-only logic it should be more than good enough.

Note that this scheme also enables doing fat systems: for example, an installer image could include a Fedora version compiled for x86-64, one for i386, one for ARM, all in the same btrfs volume. Due to btrfs' de-duplication they will share as much as possible, and when the image is booted up the right sub-volume is automatically picked. Something similar of course applies to the apps too!

This also allows us to implement something that we like to call Operating-System-As-A-Virus. Installing a new system is little more than:

  • Creating a new GPT partition table
  • Adding an EFI System Partition (FAT) to it
  • Adding a new btrfs volume to it
  • Deserializing a single usr sub-volume into the btrfs volume
  • Installing a boot loader into the EFI System Partition
  • Rebooting

Now, since the only real vendor data you need is the usr sub-volume, you can trivially duplicate this onto any block device you want. Let's say you are a happy Fedora user, and you want to provide a friend with his own installation of this awesome system, all on a USB stick. All you have to do for this is do the steps above, using your installed usr tree as source to copy. And there you go! And you don't have to be afraid that any of your personal data is copied too, as the usr sub-volume is the exact version your vendor provided you with. Or with other words: there's no distinction anymore between installer images and installed systems. It's all the same. Installation becomes replication, not more. Live-CDs and installed systems can be fully identical.

Note that in this design apps are actually developed against a single, very specific runtime, that contains all libraries it can link against (including a specific glibc version!). Any library that is not included in the runtime the developer picked must be included in the app itself. This is similar how apps on Android declare one very specific Android version they are developed against. This greatly simplifies application installation, as there's no dependency hell: each app pulls in one runtime, and the app is actually free to pick which one, as you can have multiple installed, though only one is used by each app.

Also note that operating systems built this way will never see "half-updated" systems, as it is common when a system is updated using RPM/dpkg. When updating the system the code will either run the old or the new version, but it will never see part of the old files and part of the new files. This is the same for apps, runtimes, and frameworks, too.

Where We Are Now

We are currently working on a lot of the groundwork necessary for this. This scheme relies on the ability to monopolize the vendor OS resources in /usr, which is the key of what I described in Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems a few weeks back. Then, of course, for the full desktop app concept we need a strong sandbox, that does more than just hiding files from the file system view. After all with an app concept like the above the primary interfacing between the executed desktop apps and the rest of the system is via IPC (which is why we work on kdbus and teach it all kinds of sand-boxing features), and the kernel itself. Harald Hoyer has started working on generating the btrfs send-and-receive images based on Fedora.

Getting to the full scheme will take a while. Currently we have many of the building blocks ready, but some major items are missing. For example, we push quite a few problems into btrfs, that other solutions try to solve in user space. One of them is actually signing/verification of images. The btrfs maintainers are working on adding this to the code base, but currently nothing exists. This functionality is essential though to come to a fully verified system where a trust chain exists all the way from the firmware to the apps. Also, to make the home sub-volume scheme fully workable we actually need encrypted sub-volumes, so that the sub-volume's pass-phrase can be used for authenticating users in PAM. This doesn't exist either.

Working towards this scheme is a gradual process. Many of the steps we require for this are useful outside of the grand scheme though, which means we can slowly work towards the goal, and our users can already take benefit of what we are working on as we go.

Also, and most importantly, this is not really a departure from traditional operating systems:

Each app, each OS and each app sees a traditional Unix hierarchy with /usr, /home, /opt, /var, /etc. It executes in an environment that is pretty much identical to how it would be run on traditional systems.

There's no need to fully move to a system that uses only btrfs and follows strictly this sub-volume scheme. For example, we intend to provide implicit support for systems that are installed on ext4 or xfs, or that are put together with traditional packaging tools such as RPM or dpkg: if the the user tries to install a runtime/app/framework/os image on a system that doesn't use btrfs so far, it can just create a loop-back btrfs image in /var, and push the data into that. Even us developers will run our stuff like this for a while, after all this new scheme is not particularly useful for highly individualized systems, and we developers usually tend to run systems like that.

Also note that this in no way a departure from packaging systems like RPM or DEB. Even if the new scheme we propose is used for installing and updating a specific system, it is RPM/DEB that is used to put together the vendor OS tree initially. Hence, even in this scheme RPM/DEB are highly relevant, though not strictly as an end-user tool anymore, but as a build tool.

So Let's Summarize Again What We Propose

  • We want a unified scheme, how we can install and update OS images, user apps, runtimes and frameworks.

  • We want a unified scheme how you can relatively freely mix OS images, apps, runtimes and frameworks on the same system.

  • We want a fully trusted system, where cryptographic verification of all executed code can be done, all the way to the firmware, as standard feature of the system.

  • We want to allow app vendors to write their programs against very specific frameworks, under the knowledge that they will end up being executed with the exact same set of libraries chosen.

  • We want to allow parallel installation of multiple OSes and versions of them, multiple runtimes in multiple versions, as well as multiple frameworks in multiple versions. And of course, multiple apps in multiple versions.

  • We want everything double buffered (or actually n-fold buffered), to ensure we can reliably update/rollback versions, in particular to safely do automatic updates.

  • We want a system where updating a runtime, OS, framework, or OS container is as simple as adding in a new snapshot and restarting the runtime/OS/framework/OS container.

  • We want a system where we can easily instantiate a number of OS instances from a single vendor tree, with zero difference for doing this on order to be able to boot it on bare metal/VM or as a container.

  • We want to enable Linux to have an open scheme that people can use to build app markets and similar schemes, not restricted to a specific vendor.

Final Words

I'll be talking about this at LinuxCon Europe in October. I originally intended to discuss this at the Linux Plumbers Conference (which I assumed was the right forum for this kind of major plumbing level improvement), and at linux.conf.au, but there was no interest in my session submissions there...

Of course this is all work in progress. These are our current ideas we are working towards. As we progress we will likely change a number of things. For example, the precise naming of the sub-volumes might look very different in the end.

Of course, we are developers of the systemd project. Implementing this scheme is not just a job for the systemd developers. This is a reinvention how distributions work, and hence needs great support from the distributions. We really hope we can trigger some interest by publishing this proposal now, to get the distributions on board. This after all is explicitly not supposed to be a solution for one specific project and one specific vendor product, we care about making this open, and solving it for the generic case, without cutting corners.

If you have any questions about this, you know how you can reach us (IRC, mail, G+, ...).

The future is going to be awesome!

August 29, 2014
We currently have a large influx of new people contributing to i915 - for the curious just check the git logs. As part of ramping them up I've done a few trainings about upstream review, and a bunch of people I've talked with at KS in Chicago were interested in that, too. So I've cleaned up the slides a bit and dropped the very few references to Intel internal resources. No speaker notes or video recording, but I think this is useful all in itself. And of course if you have comments or see big gaps - feedback is very much welcome:

Upstream Review Training Slides