planet.freedesktop.org
 February 06, 2016

Last week-end, I was in Brussels, Belgium for the FOSDEM, one of the greatest open source developer conference. I was not sure to go there this year (I already skipped it in 2015), but it turned out I was requested to do a talk in the shared Lua & GNU Guile devroom.

As a long time Lua user and developer, and a follower of GNU Guile for several years, the organizer asked me to run a talk that would be a link between the two languages.

I've entitled my talk "How awesome ended up with Lua and not Guile" and gave it to a room full of interested users of the awesome window manager 🙂.

We continued with a panel discussion entitled "The future of small languages Experience of Lua and Guile" composed of Andy Wingo, Christopher Webber, Ludovic Courtès, Etiene Dalcol, Hisham Muhammaad and myself. It was a pretty interesting discussion, where both language shared their views on the state of their languages.

It was a bit awkward to talk about Lua & Guile whereas most of my knowledge was year old, but it turns out many things didn't change. I hope I was able to provide interesting hindsight to both community. Finally, it was a pretty interesting FOSDEM to me, and it was a long time I didn't give talk here, so I really enjoyed it. See you next year!

 February 04, 2016

The slides from my FOSDEM talk are now available.

The video of the talk will eventually be available. It takes a lot of time and hard work to prepare all of the videos. The notes in the slides are probably more coherent that I am in the video. There were a few good questions at the end, so you may want to fast forward to that.

I am planning to post the source for the partial Amiga simulator next week. When I do, I will also have another blog post. Watch this space...

 February 03, 2016

I've moved my blog, you can read my new posts here https://siliconislandblog.wordpress.com/

 January 30, 2016

Over the last few days I've been at the GNOME Developer Experience hackfest in Brussels, looking into xdg-app and how best to use it in Debian and Debian derivatives.

xdg-app is basically a way to run "non-core" software on Linux distributions, analogous to apps on Android and iOS. It doesn't replace distributions like Debian or packaging systems, but it adds a layer above them. It's mostly aimed towards third-party apps obtained from somewhere that isn't your distribution vendor, aiming to address a few long-standing problems in that space:

• There's no single ABI that can be called "a standard Linux system" in the same way there would be for Windows or OS X or Android or whatever, apart from LSB which is rather limited. Testing that a third-party app "works on Linux", or even "works on stable Linux releases from 2015", involves a combinatorial explosion of different distributions, desktop environments and local configurations. Steam uses the Steam Runtime, a chroot environment closely resembling Ubuntu 12.04 LTS; other vendors tend to test on a vaguely recent Ubuntu LTS and leave it at that.

• There's no widely-supported mechanism for installing third-party applications as an ordinary user. gog.com used to distribute Ubuntu- and Debian-compatible .deb files, but installing a .deb involves running arbitrary vendor-supplied scripts as root, which should worry anyone who wants any sort of privilege-separation. (They have now switched to executable self-extracting installers, which involve running arbitrary vendor-supplied scripts as an ordinary user... better, but not perfect.)

• Relatedly, the third-party application itself runs with the user's full privileges: a malicious or security-buggy third-party application can do more or less anything, unless you either switch to a different uid to run third-party apps, or use a carefully-written, app-specific AppArmor profile or equivalent.

To address the first point, each application uses a specified "runtime", which is available as /usr inside its sandbox. This can be used to run application bundles with multiple, potentially incompatible sets of dependencies within the same desktop environment. A runtime can be updated within its branch - for instance, if an application uses the "GNOME 3.18" runtime (consisting of a basic Linux system, the GNOME 3.18 libraries, other related libraries like Mesa, and their recursive dependencies like libjpeg), it can expect to see minor-version updates from GNOME 3.18.x (including any security updates that might be necessary for the bundled libraries), but not a jump to GNOME 3.20.

To address the second issue, the plan is for application bundles to be available as a single file, containing metadata (such as the runtime to use), the app itself, and any dependencies that are not available in the runtime (which the app vendor is responsible for updating if necessary). However, the primary way to distribute and upgrade runtimes and applications is to package them as OSTree repositories, which provide a git-like content-addressed filesystem, with efficient updates using binary deltas. The resulting files are hard-linked into place.

To address the last point, application bundles run partially isolated from the wider system, using containerization techniques such as namespaces to prevent direct access to system resources. Resources from outside the sandbox can be accessed via "portal" services, which are responsible for access control; for example, the Documents portal (the only one, so far) displays an "Open" dialog outside the sandbox, then allows the application to access only the selected file.

## xdg-app for Debian

One thing I've been doing at this hackfest is improving the existing Debian/Ubuntu packaging for xdg-app (and its dependencies ostree and libgsystem), aiming to get it into a state where I can upload it to Debian experimental. Because xdg-app aims to be a general freedesktop project, I'm currently intending to make it part of the "Utopia" packaging team alongside projects like D-Bus and polkit, but I'm open to suggestions if people want to co-maintain it elsewhere.

In the process of updating xdg-app, I sent various patches to Alex, mostly fixing build and test issues, which are in the new 0.4.8 release.

I'd appreciate co-maintainers and further testing for this stuff, particularly ostree: ostree is primarily a whole-OS deployment technology, which isn't a use-case that I've tested, and in particular ostree-grub2 probably doesn't work yet.

Source code:

Binaries (no trust path, so only use these if you have a test VM):

• deb https://people.debian.org/~smcv/xdg-app xdg-app main

## The "Hello, World" of xdg-apps

Another thing I set out to do here was to make a runtime and an app out of Debian packages. Most of the test applications in and around GNOME use the "freedesktop" or "GNOME" runtimes, which consist of a Yocto base system and lots of RPMs, are rebuilt from first principles on-demand, and are extensive and capable enough that they make it somewhat non-obvious what's in an app or a runtime.

So, here's a step-by-step route through xdg-app, first using typical GNOME instructions, but then using the simplest GUI app I could find - xvt, a small xterm clone. I'm using a Debian testing (stretch) x86_64 virtual machine for all this. xdg-app currently requires systemd-logind to put users and apps in cgroups, either with systemd as pid 1 (systemd-sysv) or systemd-shim and cgmanager; I used the default systemd-sysv. In principle it could work with plain cgmanager, but nobody has contributed that support yet.

### Demonstrating an existing xdg-app

Debian's kernel is currently patched to be able to allow unprivileged users to create user namespaces, but make it runtime-configurable, because there have been various security issues in that feature, making it a security risk for a typical machine (and particularly a server). Hopefully unprivileged user namespaces will soon be secure enough that we can enable them by default, but for now, we have to do one of three things to let xdg-app use them:

• enable unprivileged user namespaces via sysctl:

sudo sysctl kernel.unprivileged_userns_clone=1

• make xdg-app root-privileged (it will keep CAP_SYS_ADMIN and drop the rest):

sudo dpkg-statoverride --update --add root root 04755 /usr/bin/xdg-app-helper

• make xdg-app slightly less privileged:

sudo setcap cap_sys_admin+ep /usr/bin/xdg-app-helper


First, we'll need a runtime. The standard xdg-app tutorial would tell you to download the "GNOME Platform" version 3.18. To do that, you'd add a remote, which is a bit like a git remote, and a bit like an apt repository:

$wget http://sdk.gnome.org/keys/gnome-sdk.gpg$ xdg-app remote-add --user --gpg-import=gnome-sdk.gpg gnome \
http://sdk.gnome.org/repo/


(I'm ignoring considerations like trust paths and security here, for brevity; in real life, you'd want to obtain the signing key via https and/or have a trust path to it, just like you would for a secure-apt signing key.)

You can list what's available in a remote:

$xdg-app remote-ls --user gnome ... org.freedesktop.Platform ... org.freedesktop.Platform.Locale.cy ... org.freedesktop.Sdk ... org.gnome.Platform ...  The Platform runtimes are what we want here: they are collections of runtime libraries with which you can run an application. The Sdk runtimes add development tools, header files, etc. to be able to compile apps that will be compatible with the Platform. For now, all we want is the GNOME 3.18 platform: $ xdg-app install --user gnome org.gnome.Platform 3.18


Next, we can install an app that uses it, from Alex Larsson's nightly builds of a subset of GNOME. The server they're on doesn't have a great deal of bandwidth, so be nice :-)

$wget http://209.132.179.2/keys/nightly.gpg$ xdg-app remote-add --user --gpg-import=nightly.gpg nightly \
http://209.132.179.2/repo/
$xdg-app install --user nightly org.mypaint.MypaintDevel  We now have one app, and the runtime it needs: $ xdg-app list
org.mypaint.MypaintDevel
$xdg-app run org.mypaint.MypaintDevel [you see a GUI window]  ### Digression: what's in a runtime? Behind the scenes, xdg-app runtimes and apps are both OSTree trees. This means the ostree tool, from the package of the same name, can be used to inspect them. $ sudo apt install ostree
$ostree refs --repo ~/.local/share/xdg-app/repo gnome:runtime/org.gnome.Platform/x86_64/3.18 nightly:app/org.mypaint.MypaintDevel/x86_64/master  A "ref" has roughly the same meaning as in git: something like a branch or a tag. ostree can list the directory tree that it represents: $ ostree ls --repo ~/.local/share/xdg-app/repo \
runtime/org.gnome.Platform/x86_64/3.18
d00755 0 0      0 /
d00755 0 0      0 /files
$ostree ls --repo ~/.local/share/xdg-app/repo \ runtime/org.gnome.Platform/x86_64/3.18 /files d00755 0 0 0 /files l00777 0 0 0 /files/local -> ../var/usrlocal l00777 0 0 0 /files/sbin -> bin d00755 0 0 0 /files/bin d00755 0 0 0 /files/cache d00755 0 0 0 /files/etc d00755 0 0 0 /files/games d00755 0 0 0 /files/include d00755 0 0 0 /files/lib d00755 0 0 0 /files/lib64 d00755 0 0 0 /files/libexec d00755 0 0 0 /files/share d00755 0 0 0 /files/src  You can see that /files in a runtime is basically a copy of /usr. This is not coincidental: the runtime's /files gets mounted at /usr inside the xdg-app container. There is also some metadata, which is in the ini-like syntax seen in .desktop files: $ ostree cat --repo ~/.local/share/xdg-app/repo \
[Runtime]
name=org.gnome.Platform/x86_64/3.16
runtime=org.gnome.Platform/x86_64/3.16
sdk=org.gnome.Sdk/x86_64/3.16

[Extension org.freedesktop.Platform.GL]
version=1.2
directory=lib/GL

[Extension org.freedesktop.Platform.Timezones]
version=1.2
directory=share/zoneinfo

[Extension org.gnome.Platform.Locale]
directory=share/runtime/locale
subdirectories=true

[Environment]
GI_TYPELIB_PATH=/app/lib/girepository-1.0
GST_PLUGIN_PATH=/app/lib/gstreamer-1.0
LD_LIBRARY_PATH=/app/lib:/usr/lib/GL


Looking at an app, the situation is fairly similar:

$ostree ls --repo ~/.local/share/xdg-app/repo \ app/org.mypaint.MypaintDevel/x86_64/master d00755 0 0 0 / -00644 0 0 258 /metadata d00755 0 0 0 /export d00755 0 0 0 /files  This time, /files maps to what will become /app for the application, which was compiled with --prefix=/app: $ ostree ls --repo ~/.local/share/xdg-app/repo \
app/org.mypaint.MypaintDevel/x86_64/master /files
d00755 0 0      0 /files
-00644 0 0   4599 /files/manifest.json
d00755 0 0      0 /files/bin
d00755 0 0      0 /files/lib
d00755 0 0      0 /files/share


There is also a /export directory, which is made visible to the host system so that the contained app can appear as a "first-class citizen" in menus:

$ostree ls --repo ~/.local/share/xdg-app/repo \ app/org.mypaint.MypaintDevel/x86_64/master /export d00755 0 0 0 /export d00755 0 0 0 /export/share user@debian:~$ ostree ls --repo ~/.local/share/xdg-app/repo \
app/org.mypaint.MypaintDevel/x86_64/master /export/share
d00755 0 0      0 /export/share
d00755 0 0      0 /export/share/app-info
d00755 0 0      0 /export/share/applications
d00755 0 0      0 /export/share/icons
user@debian:~$ostree ls --repo ~/.local/share/xdg-app/repo \ app/org.mypaint.MypaintDevel/x86_64/master /export/share/applications d00755 0 0 0 /export/share/applications -00644 0 0 715 /export/share/applications/org.mypaint.MypaintDevel.desktop$ ostree cat --repo ~/.local/share/xdg-app/repo \
app/org.mypaint.MypaintDevel/x86_64/master \
/export/share/applications/org.mypaint.MypaintDevel.desktop
[Desktop Entry]
Version=1.0
Name=(Nightly) MyPaint
TryExec=mypaint
Exec=mypaint %f
Comment=Painting program for digital artists
...
Comment[zh_HK]=藝術家的电脑绘画
GenericName=Raster Graphics Editor
GenericName[fr]=Éditeur d'Image Matricielle
MimeType=image/openraster;image/png;image/jpeg;
Type=Application
Icon=org.mypaint.MypaintDevel
StartupNotify=true
Categories=Graphics;GTK;2DGraphics;RasterGraphics;
Terminal=false


$ostree cat --repo ~/.local/share/xdg-app/repo \ app/org.mypaint.MypaintDevel/x86_64/master /metadata [Application] name=org.mypaint.MypaintDevel runtime=org.gnome.Platform/x86_64/3.18 sdk=org.gnome.Sdk/x86_64/3.18 command=mypaint [Context] shared=ipc; sockets=x11;pulseaudio; filesystems=host; [Extension org.mypaint.MypaintDevel.Debug] directory=lib/debug  ### Building a runtime, probably the wrong way The way in which the reference/demo runtimes and containers are generated is... involved. As far as I can tell, there's a base OS built using Yocto, and the actual GNOME bits come from RPMs. However, we don't need to go that far to get a working runtime. In preparing this runtime I'm probably completely ignoring some best-practices and tools - but it works, so it's good enough. First we'll need a repository: $ sudo install -d -o$(id -nu) /srv/xdg-apps$ ostree init --repo /srv/xdg-apps


I'm just keeping this local for this demonstration, but you could rsync it to a web server's exported directory or something - a lot like a git repository, it's just a collection of files. We want everything in /usr because that's what xdg-app expects, hence usrmerge:

$sudo mount -t tmpfs -o mode=0755 tmpfs /mnt$ sudo debootstrap --arch=amd64 --include=libx11-6,usrmerge \
--variant=minbase stretch /mnt http://192.168.122.1:3142/debian
$sudo mkdir /mnt/runtime$ sudo mv /mnt/usr /mnt/runtime/files


This obviously has a lot of stuff in it that we don't need - most obviously init, apt and dpkg - but it's Good Enough™.

We will also need some metadata. This is sufficient:

$sudo sh -c 'cat > /mnt/runtime/metadata' [Runtime] name=org.debian.Debootstrap/x86_64/8.20160130 runtime=org.debian.Debootstrap/x86_64/8.20160130  That's a runtime. We can commit it to ostree, and generate xdg-app metadata: $ ostree commit --repo /srv/xdg-apps \
--branch runtime/org.debian.Debootstrap/x86_64/8.20160130 \
/mnt/runtime
$fakeroot ostree commit --repo /srv/xdg-apps \ --branch runtime/org.debian.Debootstrap/x86_64/8.20160130$ fakeroot xdg-app build-update-repo /srv/xdg-apps


(I'm not sure why ostree and xdg-app report "Operation not permitted" when we aren't root or fakeroot - feedback welcome.)

build-update-repo would presumably also be the right place to GPG-sign your repository, if you were doing that.

We can add that as another xdg-app remote:

$xdg-app remote-add --user --no-gpg-verify local file:///srv/xdg-apps$ xdg-app remote-ls --user local
org.debian.Debootstrap


### Building an app, probably the wrong way

The right way to build an app is to build a "SDK" runtime - similar to that platform runtime, but with development files and tools - and recompile the app and any missing libraries with ./configure --prefix=/app && make && make install. I'm not going to do that, because simplicity is nice, and I'm reasonably sure xvt doesn't actually hard-code /usr into the binary:

$install -d xvt-app/files/bin$ sudo apt-get --download-only install xvt
$dpkg-deb --fsys-tarfile /var/cache/apt/archives/xvt_2.1-20.1_amd64.deb \ | tar -xvf - ./usr/bin/xvt ./usr/ ./usr/bin/ ./usr/bin/xvt ...$ mv usr xvt-app/files


Again, we'll need metadata, and it's much simpler than the more production-quality GNOME nightly builds:

$cat > xvt-app/metadata [Application] name=org.debian.packages.xvt runtime=org.debian.Debootstrap/x86_64/8.20160130 command=xvt [Context] sockets=x11;$ fakeroot ostree commit --repo /srv/xdg-apps \
--branch app/org.debian.packages.xvt/x86_64/2.1-20.1 xvt-app
$fakeroot xdg-app build-update-repo /srv/xdg-apps Updating appstream branch No appstream data for runtime/org.debian.Debootstrap/x86_64/8.20160130 No appstream data for app/org.debian.packages.xvt/x86_64/2.1-20.1 Updating summary$ xdg-app remote-ls --user local
org.debian.Debootstrap
org.debian.packages.xvt


### The obligatory screenshot

OK, good, now we can install it:

$xdg-app install --user local org.debian.Debootstrap 8.20160130$ xdg-app install --user local org.debian.packages.xvt 2.1-20.1
$xdg-app run --branch=2.1-20.1 org.debian.packages.xvt  and you can play around with the shell in the xvt and see what you can and can't do in the container. I'm sure there were better ways to do most of this, but I think there's value in having such a simplistic demo to go alongside the various GNOMEish apps. Acknowledgements: Thanks to all those!  January 25, 2016 libinput 1.1.5 has a change in how we deal with semi-mt touchpads, in particular: interpretation of touch points will cease and we will rely on the single touch position and the BTN_TOOL_* flags instead to detect multi-finger interaction. For most of you this will have little effect, even if you have a semi-mt touchpad. As a reminder: semi-mt touchpads are those that can detect the bounding box of two-finger interactions but cannot identify which finger is which. This provides some ambiguity, a pair of touch points at x1/y1 and x2/y2 could be a physical pair of touches at x1/y2 and x2/y1. More importantly, we found issues with semi-mt touchpads that go beyond the ambiguity and reduce the usability of the touchpoints. Some devices have an extremely low resolution when two-fingers are down (see Bug 91135), the data is little better than garbage. We have had 2-finger scrolling disabled on these touchpads since before libinput 1.0. More recently, Bug 93583 showed that some semi-mt touchpads do not assign the finger positions for some fingers, especially when three fingers are down. This results in touches defaulting to position 0/0 which triggers palm detection or results in scroll jumps, neither of which are helpful. Other semi-mt touchpads assign a straightforward 0/0 as position data and don't update until several events later (see Red Hat Bug 1295073). libinput is not particularly suited to handle this, and even if it did, the touchpad's reaction to a three-finger tap would be noticeably delayed. In light of these problems, and since these affect all three big semi-mt touchpad manufacturers we decided to drop back and handle semi-mt touchpads as single-finger touchpads with extra finger capability. This means we track only one touchpoint but detect two- and three-finger interactions. Two-finger scrolling is still possible and so is two- and three-finger tapping or the clickfinger behaviour. What isn't possible anymore are pinch gestures and some of the built-in palm detection is deactivated. As mentioned above, this is unlikely to affect you too much, but if you're wondering why gestures don't work on your semi-mt device: the data is garbage. This question turns up a lot, on the irc channel, mailing lists, forums, your local Stammtisch and at weddings. The correct answer is: this is the wrong question. And I'll explain why in this post. Note that I'll be skipping over a couple of technical bits, if you notice those then you're probably not the person that needs to ask the question in the first place. On your current Linux desktop, right now, you have at least three processes running: the X server, a window manager/compositor and your web browser. The X server is responsible for rendering things to the screen and handling your input. The window manager is responsible for telling the X server where to render the web browser window. Your web browser is responsible for displaying this post. The X server and the window manager communicate over the X protocol, the X server and the web browser do so too. The browser and the window manager communicate through X properties using the X server as a middle man. That too is done via the X protocol. Note: This is of course a very simplified view. Wayland is a protocol and it replaces the X protocol. Under Wayland, you only need two processes: a compositor and your web browser. The compositor is effectively equivalent to the X server and window manager merged into one thing, and it communicates with the web browser over the Wayland protocol. For this to work you need the compositor and the web browser to be able to understand the Wayland protocol. This is why the question "is wayland ready yet" does not make a lot of sense. Wayland is the communication protocol and says very little about the implementation of the two sides that you want to communicate. Let's assume a scenario where we all decide to switch from English to French because it sounds nicer and English was designed in the 80s when ASCII was king so it doesn't support those funky squiggles that the French like to put on every second character. In this scenario, you wouldn't ask "Is French ready yet?" If no-one around you speaks French yet, then that's not the language not being ready, the implementation (i.e. the humans) aren't ready. Maybe you can use French in a restaurant, but not yet in the supermarket. Maybe one waiter speaks both English and French, but the other one French only. So whether you can use French depends very much on the situation. But everyone agrees that eventually we'll all speak French, even though English will hang around for ages until it finally falls out of use. And those squiggles are so cute! Wayland is the same. The protocol is stable and has been for a while. But not every compositor and/or toolkit/application speak Wayland yet, so it may not be sufficient for your use-case. So rather than asking "Is Wayland ready yet", you should be asking: "Can I run GNOME/KDE/Enlightenment/etc. under Wayland?" That is the right question to ask, and the answer is generally "It depends what you expect to work flawlessly." This also means "people working on Wayland" is often better stated as "people working on Wayland support in ....". An exception to the above: Wayland as a protocol defines what you can talk about. As a young protocol (compared to X with 30 years worth of extensions) there are things that should be defined in the protocol but aren't yet. For example, Wacom tablet support is currently missing. Those are the legitimate cases where you can say Wayland isn't ready yet and where people are "working on Wayland". Of course, once the protocol is agreed on, you fall back to the above case: both sides of the equation need to implement the new protocol before you can make use of it. Update 25/01/15: Matthias' answer to Is GNOME on Wayland ready yet?  January 24, 2016 I’ve had this post sitting in my drafts for the last 7 months. The code is stale, but the concepts are correct. I had intended to add some pictures before posting, but it’s clear that won’t happen now. Words are better than nothing, I suppose… Recently I pushed an intel-gpu-tool tool for modifying GPU frequencies. […] # Preface GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them. If you’re going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I’d write this post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision by the time I completed part3. 1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There’s no indication about the “upstreamability”. What this means is that if you read my blog to understand how the i915 driver currently works, you’ll be taking a crap-shoot on this one. 2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can’t refer to them directly, I can only refer to the code. 3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally the masses to test the code. Without product, that isn’t possible. Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it’s still fresh (for some very warped definition of “fresh”). # Goal GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability. ## Current Status I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn’t nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize: Feature Status TODO Dynamic page tables Implemented Test and fix bugs 64b Address space Implemented Test and fix bugs GPU mirroring Proof of Concept Decide on interface; Implement interface.1 Testing has been limited to just one machine, mine, when I don’t have a million other things to do. With that caveat, on top of my last PPGTT stabilization patches things look pretty stable. ## Present: Relocations Throughout many of my previous blog posts I’ve gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It’s impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit. The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism. To get to the above state, something like the following would happen. 1. Create BOx 2. Create BOy 3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING). 4. Do one of aforementioned operations on BOx and BOy 5. Perform execbuf2. Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address. Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace: STATE_BASE_ADDRESS Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU. ## Future: No relocations The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it’s still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it. "OpenCL 2.0 adds support for shared virtual memory (a.k.a. SVM). SVM allows the host and kernels executing on devices to directly share complex, pointer-containing data structures such as trees and linked lists. It also eliminates the need to marshal data between the host and devices. As a result, SVM substantially simplifies OpenCL programming and may improve performance." # Makin’ it Happen ## 64b As I’ve already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU. If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you’re probably left with a big ‘WTF’ face. I probably can’t help you if you’re in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you’d expect them to. Instead of the x86 CPU’s CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory. The actual changes in hardware interactions are trivial on top of all the existing PPGTT work. The keen observer will notice that there are only 256 PML4 entries. This has to do with the way in which we've come about 64b addressing in x86. <a href="http://en.wikipedia.org/wiki/X86-64#Canonical_form_addresses" target="_blank">This wikipedia article</a> explains it pretty well, and has links. ## “This will take one week. I can just allocate everything up front.” (Dynamic Page Table Allocation) Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. “One week. I can just allocate everything up front.” If what I have is, “done” then I was off by 10x. Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems. Page for the PML4 512 PDP pages per PML4 (512, ok we actually use 256) 512 PD pages per PDP (256 * 512 pages for PDs) 512 PT pages per PD (256 * 512 * 512 pages for PTs) (256 * 512^2 + 256 * 512 + 256 + 1) * PAGE_SIZE = ~256G = oops ### Dissimilarities to x86 First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can’t use the code directly, surely I can copy. Wrong. Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux. typedef unsigned long pteval_t; typedef struct { pteval_t pte; } pte_t; static inline pteval_t native_pte_val(pte_t pte) { return pte.pte; } static inline pteval_t pte_flags(pte_t pte) { return native_pte_val(pte) & PTE_FLAGS_MASK; } static inline int pte_present(pte_t a) { return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA); } static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) { return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address); } #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address)) #define pgd_offset(mm, address) ( (mm)->pgd + pgd_index((address))) static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address) { return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address); } static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) { return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); } /* My completely fabricated example of finding page presence */ pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *ptep; struct mm_struct *mm = current->mm; unsigned long address = 0xdefeca7e; pgd = pgd_offset(mm, address); pud = pud_offset(pgd, address); pmd = pmd_offset(pud, address); ptep = pte_offset_map(pmd, address); printk("Page is present: %s\n", pte_present(*ptep) ? "yes" : "no"); X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy). 1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don’t know where our pages tables reside in physical memory because of the IOMMU. When VT-d is enabled, the i915 driver only knows the DMA address of the page tables. 2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don’t want to take the hit to map them every time you need to look at a PTE. If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those two options. After about of week of making the IOMMU driver do things it shouldn’t do, and trying to push the square block through the round hole, I gave up on reusing the x86 code. ### Why Do We Actually Need Page Table Tracking? The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It’s difficult to try to get the instantaneous state of the page tables… Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn’t a good solution, but at that time I just wanted to get the thing working and didn’t really care if things blew up spectacularly after running for a few minutes. There is actually a really easy set of operations that show why this won’t work. For the following, think of the four level page tables as arrays. ie. • PML4[0-255], each point to a PDP • PDP[0-255][0-511], each point to a PD • PD[0-255][0-511][0-511], each point to a PT • PT[0-255][0-511][0-511][0-511] (where PT[0][0][0][0][0] is the 0th PTE in the system) 1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer 2. [i915] See new BO in the execbuffer list. Allocate page tables for it… 1. [DRM]Find that address 0 is free. 2. [i915]Allocate PDP for PML4[0] 3. [i915]Allocate PD for PDP[0][0] 4. [i915]Allocate PT for PD[0][0][0]/li> 5. [i915](condensed)Set pointers from PML4->PDP->PD->PT 6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO’s backing page. 3. [i915] Dispatch work to the GPU on behalf of mesa. 4. [i915] Observe the hardware has completed 5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer. 6. [i915] See new BO in the execbuffer list. Allocate page tables for it… 1. [DRM]Find that address 0x200000 is free. 2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1]. 3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea! 4. Abort. ### Page Tables Tracking with Bitmaps Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn’t try it. After I had determined I couldn’t reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went – none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!. $\frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \ 34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \ 34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes$ That’s 4GB simply to track every page. There’s some more overhead because page [tables, directories, directory pointers] are also tracked. $256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \ 67240192entries \times \frac{1bit}{1entry} = 67240192bits \ 67240192bits \times \frac{8bits}{1byte} = 8405024bytes \ 4294967296bytes + 8405024bytes = 4303372320bytes \ 4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G$ I can’t remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn’t see the big picture. I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you’re using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn’t fall into that trap, and I just wasted your time, but there it is anyway. #### Sample code to walk the page tables This code does not actually exist, but it is very similar to the real code. The following shows how one would “walk” to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar. static struct i915_pagedirpo * alloc_one_pdp(struct i915_pml4 *pml4, int entry) { ... } static struct i915_pagedir * alloc_one_pd(struct i915_pagedirpo *pdp, int entry) { ... } static struct i915_tab * alloc_one_pt(struct i915_pagedir *pd, int entry) { ... } /** * alloc_page_tables - Allocate all page tables for the given virtual address. * * This will allocate all the necessary page tables to map exactly one page at * @address. The page tables will not be connected, and the PTE will not point * to a page. * * @ppgtt: The PPGTT structure encapsulating the virtual address space. * @address: The virtual address for which we want page tables. * */ static void alloc_page_tables(ppgtt, unsigned long address) { struct i915_pagetab *pt; struct i915_pagedir *pd; struct i915_pagedirpo *pdp; struct i915_pml4 *pml4 = &ppgtt->pml4; /* Always there */ int pml4e = (address >> GEN8_PML4E_SHIFT) & GEN8_PML4E_MASK; int pdpe = (address >> GEN8_PDPE_SHIFT) & GEN8_PDPE_MASK; int pde = (address >> GEN8_PDE_SHIFT) & I915_PDE_MASK; int pte = (address & I915_PDES_PER_PD); if (!test_bit(pml4e, pml4->used_pml4es)) goto alloc; pdp = pml4->pagedirpo[pml4e]; if (!test_bit(pdpe, pdp->used_pdpes;)) goto alloc; pd = pdp->pagedirs[pdpe]; if (!test_bit(pde, pd->used_pdes) goto alloc; pt = pd->page_tables[pde]; if (test_bit(pte, pt->used_ptes)) return; alloc_pdp: pdp = alloc_one_pdp(pml4, pml4e); set_bit(pml4e, pml4->used_pml4es); alloc_pd: pd = alloc_one_pd(pdp, pdpe); set_bit(pdpe, pdp->used_pdpes); alloc_pt: pt = alloc_one_pt(pd, pde); set_bit(pde, pd->used_pdes); } Here is a picture which shows the bitmaps for the 2 allocation example above. # The GPU mirroring interface I really don’t want to spend too much time here. In other words, no more pictures. As I’ve already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface. In the patches I’ve submitted, 2 changes were made to the existing userptr interface (which wasn’t then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring. struct drm_i915_gem_userptr { __u64 user_ptr; __u64 user_size; __u32 ctx_id; __u32 flags;</p> <h1>define I915_USERPTR_READ_ONLY (1&lt;&lt;0)</h1> <h1>define I915_USERPTR_GPU_MIRROR (1&lt;&lt;1)</h1> <h1>define I915_USERPTR_UNSYNCHRONIZED (1&lt;&lt;31)</h1> /** * Returned handle for the object. * * Object handles are nonzero. */ __u32 handle; __u32 pad; <p>};  The context argument is to tell the i915 driver for which address space we’ll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address. • Pros: • This interface is very simple. • Existing userptr code does the hard work for us • Cons: • You need 1 IOCTL per object. Much undeeded overhead. • It’s subject to a lot of problems userptr has5 • Userptr was already merged, so unless pad get’s repruposed, we’re screwed ## What should be: soft pin There hasn’t been too much discussion here, so it’s hard to say. I believe the trends of the discussion (and the author’s personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, “soft pin.” It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can’t merged until there is an open source userspace. Stay tuned. Perhaps I’ll update the blog as the story unfolds. # Wrapping it up (all 4 parts) As usual, please report bugs or ask questions. So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches). This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there. Unlike the usual, “buy me a beer if you liked this”, I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher. ## Image links 1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN" 2. The GEM and BO terminology is a fancy sounding wrapper for the notion that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output) 3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect. 4. I am not sure how KVM works manages page tables. At least conceptually I’d think they’d have a similar problem to the i915 driver’s page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn’t have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took 5. Let me be clear that I don’t think userptr is a bad thing. It’s a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring • EDIT1: I forgot to include a diagram I did of the software state machine for some presentation. I long lost the SVG, and it got kind of messed up, but it’s there at the bottom. • EDIT2: (Apologies to aggregators) Grammar fixes. Fixed some bugs in a couple of the images. • EDIT3: (Again, apologies to aggregators) s/indirect rendering/direct rendering. I had to fix this or else the sentence made no sense. • EDIT4 (2017-07-13): I was under the impression we were not yet allowed to talk about preemption. But apparently we are. So feature matrix at the bottom is updated. The Per-Process Graphics Translation Tables provide real process isolation among the various graphics processes running within an i915 based system. When in use, the combination of the PPGTT and the Hardware Context provide the equivalent of the traditional CPU process. Most of the same capabilities can be provided, and most of the same limitations come with it. True PPGTT encompasses all of the functionality currently merged into the i915 kernel driver that support page tables and address spaces. It’s called, “true” because the Aliasing PPGTT was introduced first and often was simply called, “PPGTT.” The True PPGTT patches represent one of the more challenging aspects of working on a project like the Linux kernel. The feature couldn’t realistically be enabled in isolation of the existing driver. When regressions occur it’s likely that the user gets no display. To say we get chided on occasion would be an understatement. Ipso facto, this feature is not enabled by default. There are quite a few patches on the mailing list that build new functionality on top of this support, and to help stabilize existing support. If one wishes to try enabling the real PPGTT, one must simply use the i915 module parameter: enable_ppgtt=2. I highly recommended that the stability patches be used unless you’re reading this in some future where the stability problems are fixed upstream. Unlike the previous posts where I tried to emphasize the hardware architecture for this feature, the following will not go into almost no detail about how hardware works. There won’t be PRM references, or hardware state machines. All of those mechanics have been described in parts 1 and part 2 # A Brief History of the i915 Graphics Process There have been three stages of the definition of a graphics process within the i915 driver. I believe that by explaining the stages one can get a better appreciation for the capabilities. In the following pictures there is meant to be a highlighted region (yellow in the first two, yellow, orange and blue in the last) that denote the scope of a GPU context/process with the specified feature. Incrementally the definition of a process begins to bleed between the CPU, and the GPU. Unfortunately I have some overlap with my earlier post about Hardware Contexts. I found no good way to write this post without doing so. If you read that post, consider this a refresher. ## File Descriptors Initially all GPU state was shared by every GPU client. The only partition was done via the operating system. Every process that does direct rendering will get a file descriptor for the device. The file descriptor is the thing through which commands are submitted. This could be used by the i915 driver to help disambiguate “who” was doing “what.” This permitted the i915 kernel driver to prevent one GPU client from directly referencing the buffers owned by a different GPU client. By making the buffer object handles per file descriptor (this is very easy to implement, it’s just an idr in the kernel) there exist no mechanism to reference buffer handles from a different file descriptor. For applications which do not require context saved, non-buggy apps, or non-malicious apps, this separation is still perfectly sufficient. As an example, BO handle #1 for the X server is not the same as BO handle #1 for xonotic since each has a different file descriptor1. Even though we had this partition at the software level, nothing was enforced by the hardware. Provided a GPU client could guess where another buffer resided, it could easily operate on that buffer. Similarly, a GPU client could not expect the GPU state it had set previously to be preserved for any amount of time. ## Hardware Contexts The next step towards isolation was the Hardware Context2. The hardware contexts built upon the isolation provided by the original file descriptor mechanism. The hardware context was an opt-in interface which meant that those not wishing to use the interface received the old behavior: they could purposefully or accidentally use the state from another GPU client3. There was quite a bit of discussion around this at the time the patches were in review, and there’s not really any point in lamenting about how it could be better, now. The context exists within the domain of the process/file descriptor in the same way that a BO exists in that domain. Contexts cannot be shared [intentionally]. The interface created was, and remains extremely simple. struct drm_i915_gem_context_create { /* output: id of new context*/ __u32 ctx_id; __u32 pad; }; struct drm_i915_gem_context_destroy { __u32 ctx_id; __u32 pad; };  As you can see from the two IOCTL payloads above, I wasn’t lying about the simplicity. Because there was not a great deal of variable functionality, there just wasn’t a lot to add in terms of the interface. Destroy is an optional call because we have the file descriptor and can clean up if a process does not. The primary motivation for destroy() is simply to allow very meticulous and memory conscious GPU clients to keep things tidy. Earlier I had a list of 3 types of GPU clients that could survive without this separation. Considering their inverse; this takes one of those off the list. • GPU clients needed HW context preserved • Buggy applications writing to random memory • Malicious applications The block diagram is quite similar to above diagram with the exception that now there are discrete blocks for the persistent state. I was a bit lazy with the separation on this drawing. Hopefully, you get the idea. ## Full PPGTT The last piece was to provide a discrete virtual address space for each GPU client. For completeness, I will provide the diagram, but by now you should already know what to expect. If I write about this picture, there would be no point in continuing with an organized blog post :-). So I’ll continue to explain this topic. Take my word for it that this addresses the other two types of GPU clients • GPU clients needed HW context preserved • Buggy applications writing to random memory • Malicious applications Since the GGTT isn’t really mentioned much in this post, I’d like to point out that the GTT still exists as you can see in this diagram. It is required for several components that were listed in my previous blog post. # VMAs and Address Spaces (AKA VMs) The patch series which began to implement PPGTT was actually a separate series. It was the one that introduced the Virtual Memory Area for the PPGTT, simply referred to as, VMA4. You can think of a VMA in a very similar way to a GEM BO. It is an identifiable, continuous range within an address space. Conceptually there isn’t much difference between a GEM BO. To try to define it in my horrible math jargon: a logical grouping of virtual addresses representing an operand for some GPU operation within a given PPGTT domain. A VMA is uniquely identified via the tuple (BO, Address space). In the likely case that I made no sense just there, a VMA is just another handle on a chunk of GPU memory used for rendering. ## Sharing VMAs You can’t (see the note at the bottom). There’s not a whole lot I can say without doing another post about DMA-Buf, and/or Flink. Perhaps someday I will, but for now I’ll keep things general and brief. It is impossible to share a VMA. To repeat, a VMA is uniquely identifiable by the address space, and a BO. It remains possible to share a BO. An address space exists for an individual GPU client’s process. Therefore it makes no sense to share a VMA since the address space cannot be shared5. As a result of using the existing sharing interfaces a GPU will get multiple VMAs that reference the same BO. Trying to go back to the math jargon again: 1. VMA: (BO, Address Space) // Some BO mapped by the address space. 2. VMA′: (BO′, Address Space) // Another BO mapped into the address space 3. VMA″: (BO, Address Space′) // The same BO as 1, mapped into a different address space. In case it’s still unclear, I’ll use an example (which is kind of a simplified/false demonstration). The scanout buffer is the thing which is displayed on the screen. When doing frontbuffer rendering, one directly renders to that buffer. If we remember my previous post, the Display Engine requires a GGTT mapping. Therefore we know we have VMAglobal. Jumping ahead, a GPU client cannot have a global mapping, therefore, to render to the frontbuffer it too has a VMA, VMApp. There you have two VMAs pointing to the same Buffer Object. NOTE: You can actually share VMAs if you are already sharing a Context/PPGTT. I can’t think of any real world examples off of the top of my head, but it is possible, and potentially a useful thing to do. ## Data Structures Here are the relevant data structures cropped for the sake of brevity. struct i915_address_space { struct drm_mm mm; unsigned long start; /* Start offset always 0 for dri2 */ size_t total; /* size addr space maps (ex. 2GB for ggtt) */ struct list_head active_list; struct list_head inactive_list; }; struct i915_hw_ppgtt { struct i915_address_space base; int (*switch_mm)(struct i915_hw_ppgtt *ppgtt, struct intel_engine_cs *ring, bool synchronous); }; struct i915_vma { struct drm_mm_node node; struct drm_i915_gem_object *obj; struct i915_address_space *vm; };  The struct i915_hw_ppgtt is a subclass of a struct i915_address_space. Only two implementors of i915_address space exist: the i915_hw_ppgtt (a PPGTT), and the i915_gtt (the GGTT). It might make some sense to create a new PPGTT subclass for GEN8+ but I’ve not opted to do this. I feel there is too much duplication for not enough benefit. I’ve already explained in different words that a range of used address space is the VMA. If the address space has the drm_mm, then it should make direct sense that the VMA has the drm_mm_node because this is the used part of the address space6. In the i915_vma struct above is a pointer to the address space for which the VMA exists, and the object the VMA is referencing. This provides the tuple that define the VMA. ## Relation to the Hardware Context struct intel_context { struct kref ref; int id; ... struct i915_address_space *vm; };  With the 3 elements discussed a few times already: file descriptor, context, PPGTT, we get real GPU process isolation. Since the context was historically an opt-in interface, changes needed to be made in order to keep the opt-in behavior yet provide isolation behind the scenes regardless of what the GPU client tried to do. If this was not done, then innocent GPU clients could feel the wrath. The file descriptor was already intimately connected with the direct rendering process (one cannot render without getting a file descriptor), it made sense to hook off of that to create the contexts and PPGTTs. ## Implicit Context (“private default context”) From here on out we can consider a, “context” as the 3 elements: fd, HW context, and a PPGTT. In the driver as it exists today if a GPU client does not provide a context for rendering, it cannot rely on GPU state being preserved. A context is created for GPU clients that do not provide one, but the state of this context should be considered completely opaque to all GPU clients. I’ve called this the Private Default Context as it very much resembles the default context that exists for the whole system (again, let me point you to the previous blog post on contexts). The driver will isolate the various contexts within the system from implicit contexts, and vice versa. Hardware state is undefined while using the private default context. Hardware state maintains it’s state from the previous render operation when using the IOCTLs. The behavior of the implicit context does result in waste when userspace uses contexts (as mesa/libgl does). There are a few solutions to this problem, and I’ve submitted patches for all of them (I can count 3 off the top of my head). Perhaps one day in the not too distant future, this above section will be false and we can just say – every process will get a context when they open the DRI file. If they want more contexts, they can use the IOCTL. ## Multi Context A GPU client can create more than one context. The context they wish to use for a given rendering command is built into the execbuffer2 API (note that KMS is not context savvy). struct drm_i915_gem_execbuffer2 { /** * List of gem_exec_object2 structs */ __u64 buffers_ptr; __u32 buffer_count; /** Offset in the batchbuffer to start execution from. */ __u32 batch_start_offset; /** Bytes used in batchbuffer from batch_start_offset */ __u32 batch_len; ... __u64 flags; __u64 rsvd1; /* now used for context info */ __u64 rsvd2; };  A process may wish to create several GL contexts. The API allows this, and for reasons I don’t understand, it’s something some applications wish to do. If there was no mechanism to create a new contexts, userspace would be forced to open a new file descriptor for each GL context or else they would not reap the benefits of everything we’ve discussed for a GL context. ## The Big Picture – literally ## Context:PPGTT One of the more contentious topics in the very early stages of development was the relationship and connection of a PPGTT and a HW context. Quoting myself from one of earlier public declarations, here: My long term vision is for contexts to have a 1:1 relationship with a PPGTT. Sharing objects between address spaces would work similarly to the flink/dmabuf model if needed. My idea was to embed the PPGTT within the context structure, and creating a context always resulted in a new PPGTT. Creating a PPGTT by itself would have been impossible. This is not what we ended up doing. The implementation allows multiple hardware contexts to share a PPGTT. I’m still unclear exactly what is needed to support share groups within OpenGL, but it has been speculated that this is a requirement for share groups. Fundamentally this would allow the client to create multiple GPU contexts that share an address space (it resembles what you’d get back when there was only HW contexts). The execbuffer2 IOCTL allows one to specify the context. Behaviorally however, my proposal matches what is in use currently. I think it’s a bit easier to think of things this way too. # Conclusion Please feel free to send me issues or questions. Oh yeah. Here is a state machine that I did for a presentation on this. Things got rendered weird, and I lost the original SVG file, but perhaps it will be of some value to someone. ## TODO As I alluded to earlier, there is still some work left to do in order to get this feature turned on by default. I gave the links to some patches, and the parameter to make it happen. If you feel motivated to help get this stuff moving forward, test it, report bugs, try to fix stuff, don’t yell at me when things break :-). ## Summary That’s most of it. I like to give the 10 second summary. 1. i915_vma, i915_hw_ppgtt, i915_address_space: important things. 2. The GPU has a virtual address space per DRI file descriptor. 3. There is a connection between the PPGTT, and a Hardware Context. 4. VMAs are backed by BOs which are backed by physical pages. 5. GPU clients have some flexibility with how they interact with contexts, and therefore the PPGTT. And finally, since I compared our now well defined notion of a GPU process to the traditional CPU process, I wanted to create a quick list of what I think are some interesting data points regarding the capabilities of the processors. [easytable] Thing,Modern X86 CPU,Modern i915 GPU Phys Address Limit,48b?,~40b Process Isolation, Yes, Yes (with True PPGTT) Virtual Address Space, Yes, Yes 64b VA Space, Yes, GEN8+ 48b only PTE access controls, Yes, No Page Fault Handling, Yes, No Preemption ((I am defining the word, "preemption" as the ability to switch at an arbitrary point in time between contexts. On the CPU, this is <em>easily</em> accomplished. The GPU running the i915 driver, as of today, has no way to do this. Once a batch is running, it cannot be interrupted except for RC6.)), Yes, *With execlists [/easytable] So while True PPGTT brings the GPU closer to having all of the [what I consider to be] interesting features of a modern x86 CPU – it still has a ways to go. I would be surprised if things didn’t continue going in this direction. ## SVG Links As usual, please feel free to do something useful with the images I’ve created. Also as usual, they are really poorly named. https://bwidawsk.net/blog/wp-content/uploads/2014/07/pre-context.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-context.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-ppgtt.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma-bo-page.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/ppgtt-context.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/multi-context.svg 1. It’s technically possible to make them be the same BO through the two buffer sharing mechanisms. 2. Around the same time Hardware Contexts were introduced, so was the Aliasing PPGTT. The Aliasing PPGTT was interesting, however it does not contribute to any part of the GPU “process” 3. Hardware contexts use a mechanism which will inhibit the restoration of state when not opted-in. This means if one GPU client does opt-in, and another does not, the client without contexts can reuse the state of the client with contexts. As the address space is still shared, this is actually a really dangerous thing to allow. 4. I would have preferred the reservation of a space within the address space be called a, “GVMA”, but that was shot down during review 5. There’s a whole section below describing how this statement could be false. For now, let’s pretend address spaces can’t be shared 6. For those unfamiliar with the Direct Render Manager memory manager, a drm_mm is the structure for the memory manager provided by the DRM midlayer helper layer. It does all the things you’d expect out of a memory manager like, find free nodes, allocate nodes, free up nodes… A drm_mm_node is a structure representing an allocation from the memory manager. The PPGTT code relies entirely on thedrm_mm and the DRM helper functions in order to actually do the address space allocations and frees.  January 20, 2016 In light of recent general confusion between X.Org the technical project and X.Org the Foundation here's a little overview. ## X.Org the project X.Org is the current reference implementation of the X Window System which has been around since the mid-80s. Its most prominent members is the X server and the related drivers but we put a whole bunch of other things under the same umbrella, e.g. mesa, drm, and - yes - wayland. Like most free software projects it is loosely organised and very few developers are involved in everything, everybody has their niche. If you're running Linux or a BSD and you can see a desktop environment in front of you, X.Org the technical project is somewhere in that stack. ## X.Org the Foundation The foundation is a non-profit organisation tasked with the stewardship of the X Window System, particularly the X.Org implementation. The most important thing is: the X.Org Foundation does not control the technical direction, it acts in a supporting role only. X.Org has a 501(c)3 tax code in the US which means that donations can be tax deducted (though we haven't collected donations in years). It also means that how we can spend money is very restricted. These days the Foundation's supporting roles are largely: sponsoring the annual X Developers Conference (XDC), providing travel sponsorship to XDC attendees and be the organisation to participate in the Google Summer of Code. Oh, and did I mention that the X.Org Foundation does not control the technical direction? ## What does it matter? The difference matters, especially for well-nuanced and thought-out statements like "X must die" in response to articles about the X.Org Foundation. If you want the Foundation to cease to exist, you're essentially saying "XDC and X.Org's GSoC participation must die". Given that a significant percentage of those two are now Wayland-related that may have some unintended side-effects. If you want the technical project to die, it may be wise to consider the side-effects. Wayland isn't quite ready yet, much of the work that is done under the umbrella of X benefits Wayland (libinput, graphics driver work, etc.). Now if you excuse me, there's a windmill that needs tilting at. Rocinante, where are you?  January 14, 2016 First the title is a slight lie, this really is about compositor switching and not necessarily about using Linux VTs for that. But I hope that the title draws in the right folks and tempts them to read this. Since with atomic there's a bit a problem if you want to switch between different compositors - maybe you have X running and hack on wayland-mutter, or kwin and mutter or just a DE and a login manager - and expect it to not end up in a modern arts project like this. Now the trouble with atomic modesetting and switching between different compositors is that atomic display updates are incremental for two reasons: • First we need to be able to support the legacy KMS interface for existing userspace, and that inteface was only updating parts of the overall display state. • Second we want atomic to be extensible. Which means compositors must be able to only update the properties they understand and leave everything else at a presumably reasonable default values. But if you mix this with a bunch of different compositors which all understand different subsets of all the atomic extensions a driver supports, suddenly the assumption that unhandled values have reasonable settings becomes invalid, and partial updates become a problem. Recently there's been a discussions on mailing lists and IRC about how to solve this, which ultimately ended in the conclusion that us kernel folks don't really know what would work best for distros, desktop environments and their compositors. Just going ahead with some new kernel ABI could easily result in a mistake that we have to support for the next 10 years. Therefore this blog post here covers the ideas with come up with to tackle this, just to make it clear that kernel folks are aware of this gap. As soon as userspace people with real clue about this topic run into problems they're more than welcome on dri-devel@lists.freedesktop.org and then we can figure out what to implement. With that out of the way, let's look at possible solutions. FBDEV resets to defaults This is the cheap cop-out that is essentially implemented right now - when switching to a kernel console running on top of the FBDEV emulation KMS drivers can provide, the driver resets atomic properties of new extensions (like rotation, color management, Z-position/alpha/blending, whatever) to hopefully sane defaults. That's good enough for developers hacking around on different compositors, as long as you remember to VT-switch to a kernel console after things went south. But FBDEV is seriously uncool, and a lot of people are working towards removing the kernel's VT subsytem from modern distros, too. This doesn't really work everywhere, but it's kinda the minimal and what we'll definitely implement. This has also the downside that maybe you only want to restore some properties, while keeping others (since they might be crucial to your setup, for example rotating the screen). System compositor restores boot-up state If doing something in the kernel isn't flexible enough then the usual approach is to do it in userspace. Most systems have some kind of master compositor that's run in-between user sessions, like a login manager. That system compositor could restore modeset state to something sensible every time it runs again, and user session compositors could then take over that sensible setup as their starting point. This has the benefit that more clever and specific stuff like only restoring some properties is easy to implement. This shouldn't need a hole lot of code since a compositor needs to be able to restore it's state anyway to allow switching between them. But, you're saying, how can a system compositor know about all the possible and future atomic KMS extensions? Won't this scheme break the much heralded extensibility? Well the neat trick is that userspace doesn't need to be able to understand properties to save and restore them - the actual property value transport between kernel and userspace is fully generic. There are a few special cases, like the need to disable outputs that have been unplugged meanwhile, and also some properties with special meaning, like framebuffers. But generic userspace can read out all the metadata, and even if future property types extend e.g. the value range that should still work. The downsides of this approach is that it depends upon the global system compositor to do this right, so if that crashes and leaves your display setup in shambles you can't easily recover any more. This also requires that everyone who cares about switching between different compositors to have such a system compositors. Interlude: Only launching compositors matters The above observation that atomic clients can always faithfully restore a state (even if they don't understand the semantics of all properties) means that switching compositors itself will always work. The real trouble is making sure that a compositor can start up with a sane enough configuration to be able to successfully get pixels onto the screen. New atomic IOCTL kernel flag What might be needed is a way to make this safe state persistent in the kernel. The next option tries that, by adding a new flag to the atomic IOCTL which asks the kernel to start out with an atomic KMS state reset to some default value. But again the trouble here is, like with the FBDEV approach, that it's monolithic, and doesn't easily allow userspace to control what's being restored. And again the question is, should things get restored to the boot-up state (and which boot-up state - something equivalent to what FBDEV emulation would pick, what the firmware would have picked or a mix), or maybe reset values (set everything to unrotated) is better? On top of that most often compositors don't want to reset state at all, to be able to smoothly take over the display configuration from the preceeding KMS client. Users have lost pretty much all appreciation of unsightly flickering that commonly happened in the pre-KMS world when switching compositors. Another problem with keeping the boot-up state around is that the kernel then needs to keep a copy of all such state (and some objects like gamma tables are sizeable) around. Just in case there's userspace around to ask for it. Use SysRq to reset atomic state A problem with adding a flag to the atomic IOCTL to reset state is that all compositors need to implement support for it, and somehow make a decision for when to employ it. But on most systems compositors shouldn't completely mess up the atomic state. Most likely that's after a crash, and then there's no userspace around anyway to fix things up. Now generally this is, or well should, only be a problem for developers, and a possible solution might be to wire up a SysRq hotkey where the kernel force-resets atomic state to defaults. This would be similar to the FBDEV based solution, except without FBDEV and not tied to VT switching. An alternative would be to implement this in the boot-splash service, by sampling boot-up state and providing some command to force-reset to that. But that's pretty much the system compositor approach, but using the boot splash, and a tool to restore its state from a stored location, as the system compositor. Expose reset or boot-up state An easy fix to give control back to userspace over what will get restored is to instead expose the boot-up values or the reset values to userspace, through an extension to the GET_PROPERTY IOCTL. But again storing boot-up state in the kernel would be wasteful on systems that will never need it (like Android or CrOS), and exposing reset values somewhat pointless if e.g. you really want your screen rotated, always. Per-compositor atomic state A slight spiel on all this is to make atomic state per-compositor in the kernel. This sounds a bit like it might help, but on the other hand implementing full state restore isn't more effort when compositors need to restore their state after a VT switch anyway. This leaves the tricky question of what the inherited state should be when a new compositor starts up: Normally you want the current state, so that the compositor can take over smoothly. Except when that's a really bad idea because the current state is badly mangled from a different compositor that just crashed. Overall lots of different approaches and ideas, but no clear winner. That's why kernel folks need distro, compositor and desktop people to run into this issue first, to make sure the solution that lands actually solves the right problem. And in a way that suits userspace. Thanks to Daniel Stone, Pekka Paalanen and Ville Syrjälä for input on this.  January 11, 2016 Kernel version 4.4 is released, it's time for our regular look at what's in store for the Intel graphics driver in the next release. Overall this cycle has seen lots and lots of bugfixes, and the reason for that is that we're rebuilding our CI infrastructure after it went up in a poof of smoke last summer. Really big thanks to the entire team for the effort invested! And that's why this overview is a bit different and we'll start with bugfix efforts before delving into the few feature additions: Ville fixed up display fifo underruns all over the place: FDI modeset fixes for Haswell/Broadwell, correctly detecting fused-of VGA on the same, and disabling fifo underrun reporting in some places where we've learned that underruns just happen - mostly around starting up the display pipeline. Next up is improved runtime PM wakelock debugging from Imre Deak, with efforts from other folks trying to fix up various issues. Unfortunately this turned up so many little buglets that we had to disable the reporting again, at least for now. But at least we'll now have a very clear list of things to address, and a reliable way to audit any failures, to finally be able to enable runtime PM by default hopefully soon. There's also been lots of fixes for PSR and FBC from Rodrigo and Paulo respectively. PSR enabled by default missed the 4.5 merge window by just a hair - it's already enabled for 4.6. FBC is also pretty close, the last bit Paulo is working is untangling the locking issues. FBC is sitting between GEM and KMS and FBC code gets called by both subsystems. And that's an easy recipe for deadlocks, which Paulo is now working to resolve. There's also fixes included from Tvrtko Ursulin, Lukas Wunner and Chris Wilson to remedy some long-standing regressions in the fbdev framebuffer setup code. In GEM finally Dave Gordon fixed issues with the page dirty tracking. And Chris Wilson fine-tuned the request polling logic to avoid needlessly wasting CPU cycles. Imre, Patrik and others have done a lot of work to fix up various issues in the DMC firmware loader for Skylake and the DC5/6 support. It works well now on that platform, and could reenable the overall display power well support again on Skylake. But there's still plenty of issues on Broxton unfortunately. Since bugfixes have been highly prioritized over feature work this time around there's only very little progress on atomic modesettting and specifically atomic watermark updates. But 4.5 includes a few more prep patches from Maarten Lankhorst and Matt Roper. There have been some real features though still: Alex Goins from nvidia implementd proper sync for page-flipping dma-buf backed framebuffers, benefitting setups where nVidia renders buffers that the Intel driver displays. Finally there's also been the usual amount of internal refactoring to prepare the code for the future and keep it maintainable. Jani Nikula rewrote the VBT parsing code. And Ander started to rework the DP detection code as the first step of a large DP support revamp. And finally ther's been a bit of enabling for Kabylake too, but it's not yet complete. And of courese there's been a lot more smaller things, again mostly bugfixes.  January 10, 2016 This summer Intel sponsored some work to improve the kerneldoc toolchain, with the aim to use all that to extend the DRM and i915 driver documentation we have. Most of it landed, but the last bit to integrate some type of text markup processing was stalled until it could be discussed at the kernel summit, see the LWN summary. Unfortunately it died in a bikeshed fest due to an alliance of people who think docs are useless and you should just read the code, and others who didn't even know how to convert the kerneldoc into something pretty. But we still need this, since without lists, highlighting, basic tables and inserting code snippets it's really hard to write decent documentation. Luckily Dave Airlie is ok with using it for DRM kerneldoc as long as Intel maintains the support. It's purely opt-in and the only downside of not using asciidoc is that the resulting docs won't be as pretty. All the changes to the text itself to use this markup are going into upstream as normal. The only bit that's not in upstream is the tooling, which is available in a topic branch at  git://anongit.freedesktop.org/drm-intel topic/kerneldoc If you want to build pretty docs just install asciidoc and base your drm documentation patches on top of drm-intel-nightly from the same repository - that tree also includes all of Dave's tree. Alternatively pull in the above topic branch into your own personal tree. Note that asciidoc is detected automatically, so you really only need it and the tooling branch to check the rendering of your changes. For added convenience Intel also maintains an autobuilder that pushes latest drm-intel-nightly DRM documentation builds to http://dri.freedesktop.org/docs/drm/. Aside: If all you want to build is just the GPU DocBook instead of all of them, you can do that with $ make DOCBOOKS="gpu.xml" htmldocs

With that have fun reading the new&improved documentation, and if you spot anything please submit a patch to dri-devel@lists.freedesktop.org.
As we were working on audio jack notifications, and were wondering whether the type of notification we'd pop up in this case could be reused in other cases, I encountered a feature request that could now be solved easily with the rfkill D-Bus service we added to gnome-settings-daemon for the 3.10 release.

If you have keyboard buttons on your laptop to enable or disable Bluetooth, or Airplane mode, you can now use them. Note that the "UWB" toggle key will toggle the whole airplane mode mainly because no in-kernel driver uses it, and nobody remembers what UWB is.

Note that the labels and icons used are still subject to changes. In particular as you can see that the labels are too long for lower resolutions.

 January 09, 2016
Prodded by me while I snoozed on his sofa and with his cat warming me up, a day before the Content Applications hackfest, Florian Müllner started working on fixing a long-standing gjs bug that made it impossible to use gom in GNOME/JavaScript applications. The result of that initial research came a few days later, and is now part of the latest gjs release.

This also fixes using GtkBuilder and json-glib when the libraries create new objects for the benefit of the JavaScript code.

We can finally use gom to store user data in applications like Bolso. Thanks Florian!
 January 05, 2016

One of the things that makes me really happy in terms of the public reception to the Fedora Workstation is all the people calling out how stable and solid it is, as this was and is one of our big goals from the start of the Fedora Workstation effort.

From the start we wanted to bury the old idea of Fedora being only for people who didn’t mind risking a lot of instability in return for being on the so called bleeding edge. We also wanted to bury the related idea that by using Fedora you where basically alpha testing highly unstable and unfinished software for Red Hat Enterprise Linux. Yet at the same time we did want to preserve and build upon the idea that Fedora is a great operating system if you want to experience a lot of the latest and greatest new developments as they are happening. At first glance those two goals might seem a bit contradictory, but we decided that we should be able to do both by both adjusting our policies a bit and also by relying more on the Fedora retrace server as our bug fixing prioritization tool.

So in terms of policies the division of Fedora into a distinct server and workstation images and also the clearer separation of the spins, allowed us to start making decisions without worrying so much how they affected other usecases than our own. Because sometimes what from a user perspective seems like a bug or something being broken was non-workstation policy decisions getting in the way of the desktop behaving as expected, for instance firewall rules hindering basic desktop functions.

Secondly we incorporated a more careful approach into what and when we brought in new stuff, meaning we still try to keep on top of major upstream developments and be a leading edge system, but at the same time we do a little mental exercise for each decision to make sure its a decision that makes us ‘leading edge’ and not ‘bleeding edge’. And if we really want something in, but it isn’t 100% ready for prime time yet we do what we have done with Wayland or the GTK3 port of LibreOffice, we make it available as an option for early adopters, but we default to the safer choice while we work out the last wrinkles. (Btw, if you are interested in progress on Wayland, Kevin Martin, sent out an emailing with a link to a good Wayland development status just before the Holidays.

The final piece of the puzzle is regularly checking and identifying important bugs from the Fedora retrace server. Because like almost all developers we get way more bug reports than we realistically can ever address, so having the data from the retrace server allows us to easily identify the crashes that affect the most users, and just as importantly lets us filter out the bug reports that are likely caused by users installing weird stuff on their system. When we started using retrace various desktop modules tended to dominate the top 3 pages when sorting bugs based on count, but due to a continuous effort over the last few years desktop modules appearing in the top crashers list are few and far between and when they do appear we make sure to get fixes done quickly for them. So if you ever wonder if the data collected by these kind of systems are actually helping developers working on the software you use better, I can say that it is true for Fedora for sure.

That said I thought it could be interesting to explain a bit the challenges we have with tracking our progress in this area. So lets start by looking at a graph I pulled from the retrace server.

Looking at that graph one could say that it is clear that we have made great strides in improving system stability and I do believe that is the case, however the graphs doesn’t truly prove that inconclusively, they are just an indication. The reason they are not hard evidence is that there are a lot of things you need to take into consideration when reading them. First of all they are not adjusted based on total user population, which means that if you win or lose a lot of users between releases it can create an appearance of increased instability or decreased instability which is actually due to increase or decrease in user population, not in ‘how well does the system run on an individual users system’. So from what we see through other metrics our user population has been increasing since we launched the Fedora Workstation which means we shouldn’t be getting any ‘help’ in these graphs from a declining user population.

A second reason is that there are a lot of false positives being reported here, for instance we had an issue for a long while that the Intel graphics drivers generating a ton of this crash reports without it actually being crashes as such. So while they did represent bugs that should ideally be fixed they where not issues you might actually have noticed as a user of the system. So we spent some effort between Fedora Workstation 21 and Fedora Workstation 22 to reduce the amount of noise caused by this, which was an useful effort for us in terms of reducing noise in our retrace server, but from a user perspective it didn’t really make a tangible difference. And even with our efforts there are a still a lot of kernel issues showing up here which are not impacting users in a way that they are likely to perceive as the system being unstable.

A third item that might in a given release skewer the statistics is that we currently don’t differentiate between Fedora Workstation and spins in the statistics, which means that there might be issues caused by one of the spins generating a lot of bug reports against a module, but that might be a bug or an API usage issue that is not triggered by the Workstation edition and thus those items appearing or disappearing might affect the statistics, but as a user of the Fedora Workstation you would never experience it.

So keeping this is mind the retrace server is an important tool for us and one that at least gives us a decent indication of how we are doing with quality. But we can always do better so we will keep reviewing the reports we get through the ABRT and retrace systems and I also do strong recommend any application or library maintainers out there to look into what major issues are reported against their own modules.

 December 22, 2015
Now that we have a few years of experience with the Wayland protocol, I thought I would put some of my observations in writing. This, what will hopefully become a series rather than just one post, considers how to design Wayland protocol extensions the right way.

This first post considers protocol object lifespan and the related races between the compositor/server and the client. I assume that the reader is already aware of the Wayland protocol basics. If not, I suggest reading Chapter 4. Wayland Protocol and Model of Operation.

### How protocol objects are created

On a new Wayland connection, the only object that exists is the wl_display which is a specially constructed object. You always have it, and there is no wire protocol for creating it.

The only thing the client can create next is a wl_registry through the wl_display. Registry is the root of the whole interface (class) hierarchy. Wl_registry advertises the global objects by numerical name, and using wl_registry.bind request to bind to a global is the first normal way to create a protocol object.

Binding is slightly special still, as the protocol specification in XML for wl_registry uses the new_id argument type, but does not specify the interface (class) for the new object. In the wire protocol, this special argument gets turned into three arguments: interface name (string), interface version (uint32_t), and the new object ID (uint32_t). This is unique in the Wayland core protocol.

The usual way to create a new protocol object is for the client to send a request that has a new_id type of argument. The protocol specification (XML) defines what the interface is, so there is no need to communicate the interface type over the wire. All that is needed on the wire is the new object ID. Almost all object creation happens this way.

Although rare, also the server may create protocol objects for the client. This happens by having a new_id type of argument in an event. Every time the client receives this event, it receives a new protocol object.

As all requests and events are always part of some interface (like a member of a class), this creates an interface hierarchy. For example, wl_compositor objects are created from wl_registry, and wl_surface objects are created from wl_compositor.

Object creation never fails. Once the request or event is sent, the new objects it creates exists, period. This keeps the protocol asynchronous, as there is no need to reply or check that the creation succeeded.

### How protocol objects are destroyed

There are two ways to destroy a protocol object. By far the most common one is to have a request in the interface that is specified to be a destructor. Most often this request is called "destroy". When the client code calls the function wl_foobar_destroy(), the request is sent to the server and the client side proxy (struct wl_proxy) for the object gets destroyed. The server then handles the destructor request at some point in the future.

The other way is to destroy the object by an event. In that case, no destructor must be defined in the interface's protocol specification, and the event must be clearly documented to be destructive as there is no automation nor safeties for this. This is for cases where the server decides when an object dies, and requires extreme care in protocol design to work right in all cases. When a client receives such an event, all it can do is destroy the proxy. The (in)famous example of an interface like this is wl_callback.

### Enter the boogeyman: races

It is very important that both the client and the server agree on which protocol objects exist. If the client sends a request on, or references as an argument, an object that does not exist in the server's opinion, the server raises a protocol error, and disconnects the client. Obviously this should never happen, nor should it happen that the server sends an event to an object that the client destroyed.

Wayland being a completely asynchronous protocol, we have no implicit guarantees. The server may send an event at the same time as the client destroys the object, and now the event targets an object the client does not know about anymore. Rather than the client shooting itself dead (that's the server's job), we have a trick in libwayland-client: it silently ignores events to destroyed objects, until the server confirms that the object is truly gone.

This works very well for interfaces where the destructor is a request. If the client first sends the destructor request and then sends another request on the destroyed object, it just shot its own head off - no race needed.

Things get tricky for the other case, destructor events. The server may send the destructor event at the same time the client is sending a request on the same object. When the server finally gets the request, the object is already gone, and the client gets taken behind the shed and shot. Therefore pretty much the only safe way to use destructor events is if the interface does not define any requests at all. Ever, not even in future extensions. Furthermore, objects with that interface should not be used as arguments anywhere, or you may hit the race. That is why destructor events are difficult to use right.

### The boogeyman's brother

There is yet another nasty race with events that create objects, i.e. server-created objects. If the client is destroying the (parent) object at the same time as the server is sending an event on that object, creating a new (child) object, the server cannot know if the client actually handled the event or not. If the client ignored the event, it will never tell the server to destroy that new object, and you leak in the server.

You could try to make your way out of that pitfall by writing in your protocol specification, that when the (parent) object is destroyed, all the child objects will be destroyed implicitly. But then the client must not send the destructor request for the child objects after it has destroyed the parent, because otherwise the server sees requests on objects it does not know about, and kicks you in the groin, hard. If the child interface defines a destructor, the client cannot destroy its proxies after destroying the parent object. If the child interface does not define a destructor, you can never free the server-side resources until the parent gets destroyed.

The client could destroy all the child objects with a defined destructor in one go, and then immediately destroy the parent object. I am not sure if that works, but it might. If it does not, you have to specify a whole tear-down protocol sequence. The client tells the server it wants to destroy the parent object, the server acks and guarantees it no longer sends any events on it, then the client actually destroys the parent object. Hey, you have a round-trip and just turned a beautiful asynchronous protocol into synchronous, congratulations!

### Concluding with recommendations

Here are my recommendations when designing Wayland protocol extensions:
• Always make sure there is a guaranteed way to destroy all objects. This may sound obvious, but we have fixed several cases in the Wayland core protocol where there was no way to destroy a created protocol object such, that all resources on both server and client side could be freed. And there are still some cases not fixed.
• Always define a destructor request. If you have any doubt whether your new interface needs a destructor request, just put it there. It is more awkward to add later than normal requests. If you do not have one, the client cannot tell the server to free those protocol object resources.
• Do not use destructor events. They are hard to design right, and extending the interface later will be a bitch. The client cannot tell the server to free the resources, so objects with destructor events should be short-lived, and the destruction must be guaranteed.
• Do not use server-side created objects without a serious thought. Designing the destruction sequence such that it never leaks nor explodes is tricky.
 December 19, 2015
Having finished debugging and fixing a rather tricky GPU VM fault bug in the radeonsi driver, I thought I'd document the bug chasing process I went through (some parts are cleaned up with dead ends removed). May it help myself and others in the future.

Fortunately, the original submitter of the bug had already bisected the cause of the VM faults to a change in LLVM, so the fault was clearly due to some shader. Unfortunately, the triggering commit in LLVM was completely unrelated to Radeon, so it was very unclear what was going on. Still, the bug occured in the publically available Unreal Elemental Demo and was easily reproducible, so off I went.

Since some shader was to blame, the first thing to do after reproduction was to collect all shaders before and after the bad commit, using RADEON_DUMP_SHADERS=y (R600_DEBUG=ps,vs,gs also does this). This resulted in a lot of output with a large diff between the good and the bad run. Clearly, the change in LLVM subtly affected register allocation and/or instruction scheduling in the compiler in a way that affected many shaders and exposed some pre-existing, underlying bug. I needed to find the exact shader that caused problems.

The next step, to ensure even more reliable and deterministic reproduction, was to record an apitrace. This allows us to replay the exact same sequence of OpenGL calls that leads to the VM faults over and over again, to learn ever more about what's going on. (The Unreal Elemental Demo always plays the same scene, but it is affected by timing.)

Now it was time to find the exact draw call that caused the problems. The driver has some tools to help with that: the GALLIUM_DDEBUG feature is meant to detect lockups, but it conveniently causes a command stream flush after every draw call, so I used it by setting GALLIUM_DDEBUG=800. This makes the replay terribly slow (there's a reason we batch many draw calls into a single CS, also called IB in the kernel). So I implemented a GALLIUM_DDEBUG_SKIP feature in the driver that let me skip the additional flushes and lockup checks for the initial, known-good segment of the trace.

In addition, the driver comes with a debug feature that detects and aborts on VM faults, which is enabled via R600_DEBUG=check_vm. Since the fault comes from a shader, we also need a way to cross-reference the dected fault to the currently bound shader. This is achieved by dumping shaders and enabling the vm debug option, for a full command line of something like
GALLIUM_DDEBUG=800 GALLIUM_DDEBUG_SKIP=170000 \RADEON_DUMP_SHADERS=y R600_DEBUG=vm,check_vm \glretrace -v ElementalDemo.trace > runXXX.log 2>&1
The option -v for glretrace dumps all OpenGL calls as they are executed, which also turned out to be useful.

How to find the faulty shader from all that? Well, the check_vm logic not only detects VM faults, but also writes helpful logging dumps to a file in ~/ddebug_dumps/ (use less -R to make sense of the coloring escape codes that are written to the file). Most crucially, this dump contains a list of all buffers that were mapped, obviously including the buffers that contain the shader binaries. In one example run:
        Size    VM start page         VM end page           Usage...         245    -- hole --           1    0x0000000113a9b       0x0000000113a9c       USER_SHADER         268    -- hole --...         564    -- hole --           2    0x000000016f00d       0x000000016f00f       USER_SHADER         145    -- hole --
Remember that we enabled the vm debug option together with shader dumping? This means our log contains lots of lines of the form
VM start=0x105249000  end=0x10524A000 | Buffer 4096 bytes
(Note that the log contains byte addresses, while the check_vm dump contains page numbers. Pages are 4KB, so you just need to add or remove three 0s at the end to go from bytes to pages and vice versa.) All we need to do is grep the log file for "VM start=0x113A9B" and "VM start=0x16F00D" (mmh, I'm getting hungry...). And while those might appear multiple times if a buffer is reused or destroyed, the last occurence will be where the shader binary was created.

The shader dump contains three versions of the shader: the initial TGSI, which is what Gallium's state tracker provides to the hardware-dependent driver, the LLVM IR after an initial optimization pass, and the disassembly of the final shader binary as provided by LLVM. I extracted the IR of the two shaders (vertex and fragment), and compiled them with LLVM's standalone compiler, once with the "good" version of LLVM and once with the "bad" (in both cases using the command line llc -march=amdgcn -mcpu=tonga < shader.ll). It turned out that both shaders were affected by the change, so I needed to figure out whether it was the vertex or the fragment shader.

The GUI of apitrace has the wonderful ability of allowing you to edit a trace. Remember that I used the -v option of glretrace? That produces lots of lines like
2511163 @2 glDrawRangeElements(mode = GL_TRIANGLES, start = 0, end = 9212, ...
Indeed, that was the final reported draw call, i.e. the one causing the fault. So let's open the trace with qapitrace and jump to that exact call using its number. Sure enough, not long before the call we find
glUseProgram(505)
We need to find where this program is linked, which is typically much earlier in the program. So we search for glLinkProgram(program = 505) and find exactly one such call. A short bit above, we find the calls
glAttachShader(505, 69)glAttachShader(505, 504)
where the fragment and vertex shaders are attached to the program. Finally, we search for glShaderSource(shader = 69 and 504 to find the call where the source is loaded into the shader. Now, we can edit those calls to replace the shaders by dummy versions. Be careful, though: the length of the shader source is passed as a separate argument, and you must adjust it manually or you will get surprising error messages when running the modified trace.

By doing so, I determined that the fragment shader was at fault: even minor modifications such as re-ordering statements without changing any effects removed VM faults when applied to the fragment shader, but not when applied to the vertex shader.

So... time to stare at the disassembly of the fragment shader. Unfortunately, there was nothing that caught my eye. It was a long shader with more than 700 instructions. So what next? Since even minor changes at the source level fixed the fault, no matter what kind of change, I needed to go deeper and modify the binary directly. I wrote a new feature for radeonsi that would help me do just that, by allowing me to tell the driver to replace the binary created by LLVM on the N'th compile by a binary that is supplied as an environment variable.

I would have loved to be able to edit assembly at this point. Unfortunately, the llvm-mc tool, which can theoretically act as an assembler, is not able to parse all the assembly constructs that llc generates for the AMDGPU backend. So I went with the next best option, creating an ELF object file using llc -march=amdgcn -mcpu=tonga -filetype=obj and editing the binary directly with a hex editor.

That wasn't too bad though: since VM faults are generated by memory instructions, I could just replace those memory instructions by NOPs. Since the shader dumps collected above helpfully include the binary representation of instructions in addition to the assembly, the instructions aren't too hard to find, either. I only needed to take care not to NOP out memory instructions whose output was then later used as addresses or resource descriptors for other memory instructions, or else I would have introduced new sources for VM faults!

At that point, I ran into a tough problem. My plan was to NOP out large groups of memory instructions initially, and then do a kind of binary search to isolate the bad access. Unfortunately, what happened was, roughly speaking, that when I NOP'ed out group A of instructions or group B of instructions, the VM faults remained, but when I NOP'ed out both groups at the same time, the VM faults disappeared. This can happen when there are really two underlying bugs, but unfortunately I did not see a plausible culprit in either group (in fact, the first bug which I found was actually outside both groups, but - as far as I understand it - depended subtly on timing and on how that affected the scheduling of shader waves by the hardware).

Luckily, at that point I had long suspected the problem to be in wait state handling. You see, in order to save on complicated circuitry in the hardware, there are some rarely occuring data hazards which the compiler must avoid by inserting NOPs explicitly (there is also the s_waitcnt instruction which is part of the strategy for hiding memory latency without complex out-of-order circuitry). So I read the documentation of those hazards again and noticed that the compiler in fact didn't insert enough wait states for a sequence involving register spills (i.e., something affecting only very large shaders). I fixed that, and while it didn't eliminate the VM faults, it changed their pattern sufficiently to give me new hope.

Indeed, with the additional wait states, my original idea of finding the bad instructions by binary search was successful. I narrowed the problem down to accesses to one single texture. At that point, my brain was too exhausted to see the bug that was rather obvious in hindsight, but a colleague pointed it out to me: there was a multi-word register copy which attempted to copy a resource descriptor (consisting of 8 32-bit words) between overlapping register ranges, and it was doing that copy in the wrong direction - kind of like using memcpy when you should be using memmove.

Once this second bug was found, coming up with a fix was relatively straightforward. And yes, I checked: both bug fixes were actually required to completely fix the original bug.

That was my story. Hopefully you've learned something if you've come this far, but there is not really much of a moral to it. Perhaps it is this: pray you have deterministically reproducible bugs. If you do, patiently collecting more and more information will lead you to a solution. If you don't, well, sometimes you just have to be lucky.
 December 15, 2015

This post only applies to users of the Lenovo x220 laptop experiencing issues when using the touchpad. Specifically, the touchpad is imprecise and "jumpy" after a firmware update, as outlined in Fedora bug 1264453. The cause is buggy touchpad firmware, identifiable by the string "fw: 8.1" in the dmesg output for the touchpad:

[  +0.005261] psmouse serio1: synaptics: Touchpad model: 1, fw: 8.1, id: 0x1e2b1, caps: 0xd002a3/0x940300/0x123800, board id: 1611, fw id: 1099905
If you are experiencing these touchpad issues and your dmesg shows the 8.1 firmware version, please read on for a solution. By default, the x220 shipped with version 8.0 so unless you updated the firmware as part of a Lenovo update, you are not affected by this bug.

The touchpad issues seem identical as the ones seen on the Lenovo x230 model which has the same physical hardware and also ships with a firmware version 8.1. The root cause as seen by libinput is that the touchpad only sends events once the finger moves approximately 50 device units in either direction. The touchpad advertises a resolution of 65 units/mm horizontally and 136 units/mm vertically, but the effective resolution is reduced by roughly 75% and 30% This bugzilla attachment 1082925 shows the recording, you can easily see that while the pressure is upgraded with high granularity, the motion coordinates jump from one position to the next. From what we know this was introduced by the touchpad firmware v8.1, presumably as part of a filter to reduce the jitter some x230 users saw.

libinput automatically detects the x230 and enables a custom acceleration function for just that model. That same acceleration function works for the x220 v8.1, but unfortunately we cannot automatically detect it. As of libinput 1.1.3, libinput recognises a special udev tag, LIBINPUT_MODEL_LENOVO_X220_TOUCHPAD_FW81, to mark such an updated x220 and enable a better pointer behaviour. To apply this tag, please do the following:

1. Create a new file /etc/udev/hwdb.d/90-libinput-x220-fw8.1.hwdb
2. Look for X220 in the 90-libinput-model-quirks.hwdb file, copy the match and the property assignment into the file. As of the time of writing, the two lines are as below, but make sure you take the latest from your locally installed libinput version or the link above.
libinput:name:SynPS/2 Synaptics TouchPad:dmi:*svnLENOVO:*:pvrThinkPadX220* LIBINPUT_MODEL_LENOVO_X220_TOUCHPAD_FW81=1
3. Update the udev hwdb with sudo udevadm hwdb --update
4. Verify the tag shows up with sudo udevadm test /sys/class/input/event4 (adjust the event node if necessary)
5. Reboot
The touchpad is now marked as requiring special treatment and libinput will apply a different pointer acceleration for this touchpad.

Note that any udev property starting with LIBINPUT_MODEL_ is private API and subject to change at any time. We will never break the meaning of the LIBINPUT_MODEL_LENOVO_X220_TOUCHPAD_FW81 property, but the exact behaviour of the property is implementation-dependent and may change at any time. Do not use it for any other purpose than marking the touchpad on a Lenovo x220 with an updated touchpad firmware version v8.1.

Big news for the VC4 project today:

commit 21de54b3c4d08d2b20e80876c6def0b421dfec2e
Merge: 870a171 2146136
Author: Dave Airlie
Date: Tue Dec 15 10:43:27 2015 +1000

Merge tag 'drm-vc4-next-2015-12-11' of http://github.com/anholt/linux into drm-next

This is the last step for getting the VC4 driver upstream: Dave's pulled my code for inclusion in kernel 4.5 (probably to be released around mid-March).  The ABI is now stable, so I'm working on getting that ABI usage into the Mesa 11.1 release.  Hopefully I'll land that in the next couple of days.

As far as using it out of the box, we're not there yet.  I've been getting my code included in some builds for the Raspberry Pi Foundation developers.  They've been working on switching to kernel 4.2, and their tree has VC4 support up to the previous ABI.  Once the Mesa 11.1 merge happens, I'll ask them to pull the new kernel ABI and rebuild userspace using Mesa 11.1-rc4.  Hopefully this then appears in the next Raspbian Jessie build they produce.  Until that release happens, there are instructions for the development environment on the DRI wiki, and I'd recommend trying out the continuous integration builds linked from there.

The Raspberry Pi folks aren't ready to swap everyone over to the vc4 driver quite yet, though.  They want to make sure we don't regress functionality, obviously, and there are some big chunks of work left to do: HDMI audio support, video overlays, and integration of the vc4 driver with the camera and video decode support come time mind.  And then there's the matter of making sure that 3D performance doesn't suffer.  That's a bit hard to test, since only a few apps work with the existing GLES2 support, while the vc4 driver gives GLX, EGL-on-X11, EGL-on-gbm, most of GL2.1, and all of GLES2, but doesn't support the EGL-on-Dispmanx interface that the previous driver used.  So, until then, they're going to have a devicetree overlay that determines whether the firmware sets itself up for Linux using the vc4 driver or the closed driver.

Part of what's taken so long to get to this point has been trying to get my dependencies merged to the kernel.  To turn on V3D, I need to turn on power, which means a power domain driver.  Turning on the power required talking to the firmware, which required resurrecting an old patchset for talking to the firmware, which got bikeshedded harder than I've ever had happen to my code before.  Programming video modes required a clock driver.  Every step of the way is misery to get the code merged, and I would give up a lot to never work on the Linux kernel again.

Until then, though, I've become as Raspberry Pi kernel maintainer, so that I can ack other people's patches and help shepherd them into the kernel.  Hopefully for 4.5 I can get the aux clock driver bikeshedding dealt with and resubmit it, at which point people can use UART1 and SPI1/2.  I have a third rework to do of my power domain driver so that if we're lucky we can get it merged nad actually turn on the 3D core (and manage power of many other devices, too!).  Martin Sperl is doing a major rewrite of the SPI driver (an area I know basically nothing about), and his recent patch split may deal with the subsystem maintainer's concerns.  I want to pull in feedback and merge Lubomir's thermal driver.  There's also a cpufreq driver (for actually doing the overclocking you can set with config.txt) from Lubomir, which I expect to be harder to deal with the feedback on.

So, while I've been quiet on the blogging front, there's been a lot going on for vc4, and it's in pretty good shape now.  Hopefully more folks can give it a try as it becomes more upstreamed and accessible.
 December 14, 2015

Back in 2011, when the AppStream meeting in Nürnberg had just happened, I published the DEP-11 (Debian Extension Project 11) draft together with Michael Vogt and Julian Andres Klode, as an approach to implement AppStream in Debian.

Back then, the FTPMasters team rejected the suggestion to use the official XML specification, and so the DEP-11 specification was adapted to be based on YAML instead of XML. This wasn’t much of a big deal, since the initial design of DEP-11 was to be a superset of the AppStream specification, so it wasn’t meant to be exactly like AppStream anyway. AppStream back then was only designed for applications (as in “stuff that provides a .desktop file”), but with DEP-11 we aimed for much more: DEP-11 should also describe fonts, drivers, pkg-config files and other metadata, so in the end one would be able to ask the package manager meaningful questions like “is the firmware of device X installed?” or request actions such as “please install me the GIMP”, making it unnecessary to know package names at all, and making packages a mere implementation detail.

Then, GNOME-Software happened and demanded all these features. Back then, I was the de-facto maintainer of the AppStream upstream project already, but didn’t feel like being the maintainer yet, so I only curated the existing specification, without extending it much. The big push forward GNOME-Software created changed that dramatically, and with me taking control of the specification and documenting it properly, the very essence of DEP-11 became AppStream (that was around the AppStream 0.6 release). So today, DEP-11 is mainly a YAML-based version of the AppStream XML specification.

AppStream XML and DEP-11 YAML are implemented by two projects, GLib and Qt libraries exist to access the metadata and AppStream is used by the software centers of GNOME, KDE and Elementary.

Today there are two things to celebrate for me: First of all, there is the release of AppStream 0.9 (that happened last Saturday already), which brings some nice improvements to the API for developers and some micro-optimizations to speed up Xapian database queries. Yay!

The second thing is full DEP-11 support in Debian! This means that you don’t need to copy metadata around manually, or install extra packages: All you need to do is to install the appstream package, everything else is done for you, and the data is kept up to date automatically.

This is made possible by APT 1.1 (thanks to the whole APT team!), some dedicated support for it in AppStream directly, the work of our Sysadmin team at Debian, which set up infrastructure to build the metadata automatically, as well as our FTPMasters team where Joerg helped with the final steps of getting the metadata into the archive.

That AppStream data is now in the archive doesn’t mean we live in a perfect utopia yet – there are still issues to be handled, but all the major work is done now and we can now gradually improve the data generator and tools and squash the remaining bugs.

And another item from the good news department: It’s highly likely that Ubuntu will follow Debian in AppStream/DEP-11 support with the upcoming Xenial release!

### But how can I make use of the new metadata?

Just install the appstream package – everything is done for you! Another easy way is to install GNOME-Software, which makes use of the new metadata already. KDE Discover in Debian does not enable support for AppStream yet, this will likely come later.

If you prefer to use the command-line, you can now use commands like

sudo appsteamcli install org.kde.kate.desktop

This will simply install the Kate text editor.

### Who wants some statistics?

At time the Debian Sid/Unstable suite contains 1714 valid software components. It could be even more if the errors generated during metadata extraction would be resolved. For that, the metadata generator has a nice statistics page, showing the amount of each hint type in the suite and the development of the available software components in Debian and the hint types count over time (this plot feature was just added recently, so we are still a bit low on data). For packagers and interested upstreams, the data extractor creates detailed reports for each package, explaining why data was not included and how to fix the issue (in case something is unclear, please file a bug report and/or get in contact with me).

### In summary

Thanks to everyone who helped to make this happen! For me this project means a lot, when writing this blog post I realized that I am basically working on it for almost 5 years (!) now (and the idea is even older). Seeing it to grow to such a huge success in other distributions was a joy, but now Debian can join the game with first-class AppStream support as well, which makes me even happier. Afterall Debian is the distribution I feel most at home.

There is still lots of work to do (and already a few bugs known), but the hardest part of the journey is done – let’s walk into a bright future with AppStream!

 December 11, 2015

It is time again for another Tanglu blogpost (to be honest, this article is pretty much overdue ). I want to shine a spotlight on the work that’s done in Tanglu, be it ongoing or past work done for the new release.

## Why is the new release taking longer than usual?

As you might have noticed, usually a new Tanglu release (Tanglu 4 “Dasyatis”) should be released this month. We decided a while ago, however, to defer the release and are now aiming for an release in February / March 2016.

Reason for this change in schedule is the GCC 5 transition (and more importantly the huge amount of follow-up transitions) as well as some other major infrastructure tasks, which we would like to complete and give them a good amount of testing before releasing. Also, some issues with our build system, and generally less build power than in previous releases is a problem (At least the Debile build-system issues could be worked around or solved). The hard disk crash in the forum and bugtracking server also delayed the start of the Tanglu 4 development process a lot.

In all of these tasks, manpower is of course the main problem

### Improvements on Synchrotron

Synchrotron, the software which is synchronizing the Tanglu archive with the Debian archive, received a few tweaks to make it more robust and detect “installability” of packages more reliably. We also run it more often now.

### Rapidumo improvements

Rapidumo is a software written to complement dak (the Debian Archive Kit) in managing the Tanglu archive. It performs automatic QA tasks on the archive and provides a collection of tools to perform various tasks (like triggering package rebuilds).

For Tanglu 4, we sometimes drop broken packages from the archive now, to speed up transitions and to remove packages which got uninstallable more quickly. Packages removed from the release suite still have a chance to enter it again, but they need to go through Tanglu’s staging area again first. The removal process is currently semiautomatic, to avoid unneccessary removals and breakage.

Rapidumo could benefit from some improvements and an interactive web interface (as opposed to static HTML pages) would be nice. Some early work on this is done, but not completed (and has a very low priority at the moment).

### DEP-11 integration

There will be a bigger announcement on AppStream and DEP-11 in the next days, so I keep this terse: Tanglu will go for full AppStream integration with the next release, which means no more packaged metadata, but data placed directly in the archive. Work on this has started in Tanglu, but I needed to get back to the drawing board with it, to incorporate some new ideas for using synergies with Debian on generating the DEP-11 metadata.

### Phabricator

Phabricator has been integrated well into our infrastructure, but there are still some pending tasks. E.g. we need subprojects in Phabricator, and a more powerful Conduit interface. Those are upstream bugs on Phabricator, and are actively being worked on.

As soon as the missing features are available in Phabricator, we will also pursue the integration of Tanglu bug information with the Debian DistroTracker, which was discussed at DebConf this summer.

## UEFI support

UEFI support is a tricky beast. Full UEFI support is a release-goal for Tanglu 4 (so we won’t release without it). At time, our Live-CDs start on pure EFI systems, but there are several reported issues with the Calamares live-installer as well as Debian-Installer, which fails to install GRUB correctly.

Work on resolving these problems is ongoing.

## Major LiveCD rework

Tanglu 4 will ship with improved live-cds, which e.g. allow selecting the preferred locale early in the bootloader. We switched from live-boot to manage our live sessions to casper, the same tool which is also used in Ubuntu. Casper fixed a few issues we had, and brought some new, but overall using it was a good choice and work on making top-notch live-cds is progressing well.

## KDE Plasma

Integration of the latest Plasma release is progressing, but its speed has slowed down since fewer people are working on it. If you want to help Tanglu’s KDE Workspace integration, please help!

For the upcoming release, the Plasma 5 packages which we created based on a collaboration with Kubuntu for the previous Tanglu 3 release have been merged with their Debian counterparts. This action fortunately was possible without major problems. Now Tanglu is (mostly) using the same Plasma 5 packages as Debian again (lots of Kudos go the the Kubuntu and Debian Qt/KDE packagers!).

## GNOME

The same re-merge with Debian has been done on Tanglu’s GNOME flavor (Tanglu also shipped with a more recent GNOME release than Debian in Tanglu 3). So Tanglu 4 GNOME and Debian Testing GNOME are at (almost) the same level.

Unfortunately, the GNOME team is heavily understaffed – a GNOME team member left at the beginning of the year for personal reasons, and the team now only has one semi-active member and me.

## Other

An fvwm-nightshade spin is being worked on . Apart from that, there are no teams maintaining other flavors in Tanglu.

## Security & Stable maintenance

Thanks to several awesome people, the current Tanglu Stable release (Tanglu 3 “Chromodoris”) receives regular updates, even for many packages which are not in the “fully supported” set.

Tasks (not) done in the global Tanglu universe.

## HTTPS for community sites

Thanks to our participation in the Let’s Encrypt closed beta program, Tanglu websites like the user forums and bugtracker have been fully encrypted for a while, which should make submitting data to these sites much more secure. So far, we didn’t encounter issues, which means that we will likely aim for getting HTTPS encryption enabled on every Tanglu website.

## Tanglu.org website

The Tanglu main website didn’t receive much love at all. It could use a facelift and more importantly updated content about Tanglu.

So far, nobody volunteered to update the website, so this task is still open. It is, however, a high-priority task to increase our public visibility as a project, so updating the website is definitely something which will be done soon.

## Promotion

It doesn’t hurt to think about how to sell Tanglu as “brand”: What is Tanglu? Our motivation, our goals? What is our slogan? How can we communicate all of this to new users, without having them to read long explanations? (e.g. the Fedora Project has this nicely covered on their main website, and their slogan “Freedom, Friends, Features, First” immediately communicates the project’s key values)

Those are issues engineers don’t think about often, still it is important to present Tanglu in a way that is easy to grasp for people hearing about it for the first time. So far, no people are working on this task specifically, although it regularly comes up on IRC.

Tanglu is – by design – not backed by any corporation. However, we still need to get our servers paid. Currently, Tanglu is supported by three sponsors, which basically provide the majority of our infrastructure. Also, developers provide additional machines to get Tanglu running and/or packages built.

Still, this dependency on very few sponsors paying the majority of the costs is very bad, since it makes Tanglu vulnerable in case one sponsor decides to end sponsorship. We have no financial backing to be able to e.g. continue to pay an important server in case it’s sponsor decides to drop sponsorship.

Also, since Tanglu is no legal entity, accepting donations is hard for us at time, since a private person would need to accept the money on behalf of Tanglu. So we can’t easily make donating to Tanglu possible (at least not in the way we want donations to be).

These issues are currently being worked on, there are a couple of possible solutions on the table. I will write about details as soon as it makes sense to go public with them.

In general, I am very happy with the Tanglu community: The developer community is a pleasure to work with, and interactions with our users on the forums or via the bugtracker are highly productive and friendly. Although the development speed has slowed down a bit, Tanglu is still an active and awesome project, and I am looking forward to the cool stuff we will do in future!

 December 09, 2015
Due to vacations, conferences and other things I'm way later than usual and 4.3 has been released a while ago. More than overdue to take a look at what's in store in the next kernel release.
First looking at overall infrastructure work on the display side there's a lot of atomic conversion progress again. One feature that's now on solid fundations is fastboot, built on top of atomic infrastructure with patches from Maarten. Unfortunately we had to disable it again due to some backligh issues early in 4.4-rc. The other big piece is reworking the watermark update code (Ville&Matt), which unfortunately ran into regression roadblocks already in the development cycle and had to be reverted partially. Another piece of infrastructure building on top of atomic is validation&adjusting the display clock - some ULT chips can't drive all DP screens and the driver now detects that, and it should also downclock when less bandwidth is needed. This was implemented by Mika Kahola and Ville.

Again this round has seen a lot of improvements and bug fixes to PSR code (from Rodrigo) and for FBC (from Paulo). Unfortunately we're not yet done with those, but it looks really good that at least PSR can finally be enabled for 4.5. Still on the display side of the driver there was a pile of smaller improvements all over: Prep work for Broxton DSI support (Shashank Sharma). HDMI detection finally checks the hotplug sense, after some workaround from Sonika. And tons of cleanups all over. Fixing up DMC support (for new low-power display states) was also a topic, but we've only managed to fix it up for real in 4.5.

On the GEM side the big thing for sure is support for the extended 48-bit GPU address space on Broadwell and later chips, from Michel Thierry. And then there's the code for GuC-based command submission (Alex Dai and Dave Gordon), which is merged but not yet enabled by default. The idea behind that is to feed all command submission through an on-chip microcontroller, which can then react much faster to changing workloads and tune power states accordingly. It should also help long-term with better scheduling by supporting preemption. But none of that is implemented yet, so this is just fundations.

For existing features there are bugfixes for userptr and shrinker improvements from Chris Wilson. And Tvrtko has extended the vma view code in prepartion of rotation support for NV12.

Of course there's also been the usual enabling work for new platforms, this time around mostly consisting of workaround patches for Skylake and Broxton. But Zhiyuan Lv submitted support for the virtualized XenGT gpu support on Broadwell.

Finally for driver internals there's the massive work from Ville to make the register access functions type safe. This is escpecially a problem for writing registers, where both the register and the value that needs to be written are of type uin32_t. That resulted in subtile bugs fairly often. Ville encapsulated the register offset into a struct and converted all the thousands of register #defines and users over to that, and now compilation will fail if we ever get this wrong again.
 December 08, 2015
As you might already have noticed from the posts on Planet GNOME, and can find again on the hackfest's page, we spent some time in the MediaLab Prado discussing and hacking on Content Apps.

Music

Following discussions about Music's state, I did my bit trying to gather more contributors by porting it to grilo 0.3, and thus bringing it back into the default jhbuild target.

Videos

I made some progress on Videos' "series grouping" feature. Loads of backend code written, but not much in the way of UI for now. We however made some progress discussing said UI with Allan.

I also took the opportunity to fix a few low-hanging fruit^Wbugs.

Documents

This is where the majority of my energy went. After getting a new enough version of LibreOffice going on my machine (Fedora users, that lives in rawhide only right), no thanks to COPR, I tested Pranav's LibreOfficeKit integration into gnome-documents, after Cosimo rebased it.

You can test it now by checking out the wip/lokdocview-rebase branch of gnome-documents, grabbing the above mentioned version of LibreOffice, and running:

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/libreoffice/program/ gjs org.gnome.Documents After a number of fixes, and bugs filed in the Document Foundation bugzilla, we should be able to land this so that you can preview and edit word processing documents, presentations and spreadsheets without going through the heavy PDF preview. A picture, which doubles the length of my blog post And the side-effect of this work is that we can start adding new "views" to the application without too much trouble, like, say, an epub view. Thanks Many thanks to the GNOME Foundation for sponsoring my travel, the MediaLab Prado for hosting us, and Allan and Florian for organising the hackfest.  December 03, 2015 In a previous blog post I said, "If you have encountered a regression and you are building a driver from source, please provide the results of git-bisect." There is some feeling that performing a bisect is hard, time consuming, or both. Back in the bad-old-days, that was true... but git bisect run changed all that. Most of performing a bisect is mechanical and repetitious: 1. Build the project. 2. If the build fails, run git bisect skip. 3. Run the test. 4. Inspect the results. 5. Run git bisect good or git bisect bad depending on the test result. 6. While there are more steps to bisect, repeat from step 1. 7. Run git bisect reset to restore the tree to its original state. Some years ago, someone noticed that this seems like a task a computer could do. At least as early as git 1.6.0 (around 2010), this has been possible with using git bisect run. Once you get the hang of it, it's surprisingly easy to use. ## A Word of Caution Before actually discussing automated bisects, I want to offer a word of caution. Bisecting, automated or otherwise, is a process that still requires the application of common sense. A critical step at the end of bisecting is manually testing the guilty commit and the commit immediately before the guilty commit. You should also look at the commit that git-bisect claims is guilty. Over the years I have seen many bug reports for Mesa that either point at commits that only change comments in the code or only change a driver other than the driver on which the bug was observed. I have observed these failures to have two causes, although other causes may be possible. With a manual bisect, it is really easy to accidentally git bisect good when you meant git bisect bad. I have done this by using up-arrow to go back through the shell command history when I'm too lazy to type the command again. It's really easy to get the wrong one doing that. The other cause of failure I have observed occurs when multiple problems occur between the known-good commit and the known-bad commit. In this case the bisect will report a commit that was already know to be bad and was already fixed. This false information can waste a lot of time for the developer who is trying to fix the bug. They will spend time trying to determine why the commit reported by the bisect still causes problems when some other commit is the actual cause. The remedy is proper application of common sense while performing the bisect. It's often not enough to just see that the test case fails. The mode of the failure must also be considered. ## Automated Bisect Basics All of the magic in an automated bisect revolves around a single script that you supply. This script analyzes the commit and tells bisect what to do next. There are four things that the script can tell bisect, and each of them is communicated using the script's return code. 1. Skip this commit because it cannot be analyzed. This is identical to manually running git bisect skip.This can be used, for example, if the commit does not build. A script might contain something like: if ! make ; then exit 125 fi  As you might infer from the code example, a return code of 125 instructs bisect to skip the commit. 2. Accept this commit as good. This is identical to git bisect good. A return code of 0 instructs bisect to accept the commit as good. 3. Reject this commit as bad. This is identical to git bisect bad. All tests in the piglit test suite print a string in a specific format when a test passes or fails. This can be used by the script to generate the exit code. For example: bin/arb_clear_buffer_object-null-data -auto > /tmp/result.$$if grep -q 'PIGLIT: {"result": "pass" }' /tmp/result.$$; then rm /tmp/result.$$exit 0 else cat /tmp/result.$$ rm /tmp/result.$$exit 1 fi  In this bit of script, the output of the test is printed in the "bad" case. This can be very useful. Bisects of long ranges of commits may encounter failures unrelated to the failure you are trying to bisect. Seeing the output from the test may alert you to failures with unrelated causes. Looking for simple "pass" or "fail" output from the test may not be good enough. It may be better to look for specific failure messages from the test. As mentioned above, it is important to only report a commit as bad if it the test fails due to the problem you are trying to bisect. Imagine a case where a failure in the arb_clear_buffer_object-null-data on the master branch is being bisected. The particular failure is an incorrect error being generated, and the known-good commit is HEAD~200 when the last stable release occurred (on a different branch with a common root). However, HEAD~110..HEAD~90 contain an unrelated rendering error that was fixed in HEAD~89. Since git-bisect performs a binary search, it will test HEAD~100 first and see the wrong failure. Simply looking for test failure would incorrectly identify HEAD~110 as the problem commit. If the script instead checked for the specific incorrect error message, the correct guilty commit is more likely to be found. A return code with any value of 1 through 127, excluding 125, instructs bisect to reject the commit as bad. 4. Stop at this commit and wait for human interaction. This can be used when something really catastrophic happens that requires human attention. Imagine a scenario where the bisect is being performed on one system but tests are run on another. This could be used if the bisect system is unable to communicate wit the test system. A return code with any value of 128 through 255 will halt the bisect. All of this can be used to generate a variety of scripts for any sort of complex environment. To quote Alan Kay, "Simple things should be simple, complex things should be possible." For a simple make-based project and an appropriately written test case, an automated bisect script could be as simple as:  #!/bin/bash if ! make ; then exit 125 fi # Use the return code from the test case exec test_case  Since this is just a script that builds the project and runs a test, you can easily test the script. Testing the script is a very good idea if you plan to leave the computer when the bisect starts. It would be shame to leave the computer for several hours only to find it stuck at the first commit due to a problem in the automated bisect script. Assuming the script is called auto_bisect.sh, testing the script can be as easy as: $ ./auto_bisect.sh ; echo $?  Now all of the human interaction for the entire bisect would be three commands: $ git bisect start bad-commit good-commit
$git bisect run auto_bisect.sh$ git bisect reset


If there are a lot commits in good-commit..bad-commit, building the project takes a long time, or running the tests takes a long time, feel free to go have a sandwich while you wait. Or play Quake. Or do other work.

## Broken Builds

The bane of most software developer's existence is a broken build. Few things are more irritating. With GIT, it is possible to have transient build failures that nobody notices. It's not unheard of for a 20 patch series to break at patch 9 and fix at patch 11. This commonly occurs either when people move patches around in a series during development or when reviewers suggest splitting large patches into smaller patches. In either case patch 9 could add a call to a function that isn't added until patch 11, for example. If nobody builds at patch 9 the break goes unnoticed.

The break goes unnoticed until a bisect hits exactly patch 9. If the problem being bisected and the build break are unrelated (and the build break is short lived), the normal skip process is sufficient. The range of commits that don't build will skip. Assuming the commit before the range of commits that don't build and the commit after the range of commits that don't build are both good or bad, the guilty commit will be found.

Sometimes things are not quite so rosy. You are bisecting because there was a problem, after all. Why have just one problem when you can have a whole mess of them? I believe that the glass is either empty or overflowing with steaming hot fail. The failing case might look something like:

    $git bisect start HEAD HEAD~20 Bisecting: 9 revisions left to test after this (roughly 3 steps) [2d712d35c57900fc0aa0f1455381de48cdda0073] gallium/radeon: move printing texture info into a separate function$ git bisect run ./auto_bisect.sh
running ./auto_bisect.sh
auto_bisect.sh says skip
Bisecting: 9 revisions left to test after this (roughly 3 steps)
[622186fbdf47e4c77aadba3e38567636ecbcccf5] mesa: errors: validate the length of null terminated string
running ./auto_bisect.sh
auto_bisect.sh says good
Bisecting: 8 revisions left to test after this (roughly 3 steps)
running ./auto_bisect.sh
auto_bisect.sh says skip
Bisecting: 8 revisions left to test after this (roughly 3 steps)
[5294debfa4910e4259112ce3c6d5a8c1cd346ae9] automake: Fix typo in MSVC2008 compat flags.
running ./auto_bisect.sh
auto_bisect.sh says good
Bisecting: 6 revisions left to test after this (roughly 3 steps)
running ./auto_bisect.sh
auto_bisect.sh says skip
Bisecting: 6 revisions left to test after this (roughly 3 steps)
running ./auto_bisect.sh
auto_bisect.sh says skip
Bisecting: 6 revisions left to test after this (roughly 3 steps)
running ./auto_bisect.sh
auto_bisect.sh says skip
Bisecting: 6 revisions left to test after this (roughly 3 steps)
[bfc14796b077444011c81f544ceec5d8592c5c77] radeonsi: fix occlusion queries on Fiji
running ./auto_bisect.sh
Bisecting: 5 revisions left to test after this (roughly 3 steps)
running ./auto_bisect.sh
auto_bisect.sh says skip
Bisecting: 5 revisions left to test after this (roughly 3 steps)
running ./auto_bisect.sh
auto_bisect.sh says skip
Bisecting: 5 revisions left to test after this (roughly 3 steps)
[3a6de8c86ee8a0a6d2f2fbc8cf2c461af0b9a007] radeonsi: print framebuffer info into ddebug logs
running ./auto_bisect.sh
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[a5055e2f86e698a35da850378cd2eaa128df978a] gallium/aux/util: Trivial, we already have format use it
running ./auto_bisect.sh
auto_bisect.sh says skip
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
19eaceb6edc6cd3a9ae878c89f9deb79afae4dd6
2d712d35c57900fc0aa0f1455381de48cdda0073
84fbb0aff98d6e90e4759bbe701c9484e569c869
c60d49161e3496b9e64b99ecbbc7ec9a02b15a17
1cca259d9942e2f453c65e8d7f9f79fe9dc5f0a7
75d64698f0b0c906d611e69d9f8b118c35026efa
a0bfb2798d243a4685d6ea32e9a7091fcec74700
a5055e2f86e698a35da850378cd2eaa128df978a
3a6de8c86ee8a0a6d2f2fbc8cf2c461af0b9a007
We cannot bisect more!
bisect run cannot continue any more


In even more extreme cases, the range of breaks can be even longer. Six or seven is about the most that I have personally experienced.

The problem doesn't have to be a broken build. It could be anything that prevents the test case from running. On Mesa I have experienced problems where a bug that prevents one driver from being able to load or create an OpenGL context persists for a few commits. Anything that prevents the test from running (e.g., not produce a pass or fail result) or causes additional, unrelated failures should be skipped.

Usually the problem is something really trivial. If the problem was fixed, a patch for the problem may even already exist. Let's assume a patch exists in a file named fix-the-build.patch. We also know that the build broke at commit 75d6469, and it was fixed at commit 3a6de8c. This means that the range 75d6469^..3a6de8c^ need the patch applied. If you're not convinced that the ^ is necessary, observe the log output:

    $git log --oneline 75d6469^..3a6de8c^ a0bfb27 gallium/radeon: print more info about HTILE 1cca259 gallium/radeon: print more info about CMASK 84fbb0a gallium/radeon: rename fmask::pitch -> pitch_in_pixels 19eaceb gallium/radeon: print more information about textures 2d712d3 gallium/radeon: move printing texture info into a separate function c60d491 gallium/radeon: remove unused r600_texture::pitch_override 75d6469 gallium/radeon: remove DBG_TEXMIP  Notice that the bottom commit in the list is the commit where the break is first experienced, and the top commit in the list is not the one where the break is fixed. Using this information is simple. The bisect script need only determine the current commit is in the list of commits that need the patch and conditionally apply the patch.  # Get the short-from SHA of the current commit SHA=$(git log --oneline HEAD^.. | cut -f1 -d' ')

# If the current commit is in the list of commits that need the patch
# applied, do it.  If applying the patch fails, even partially, abort.
if grep --silent "^$SHA " <(git log --oneline 75d6469^..3a6de8c^) # ^^ ^ # This bit runs git-log, redirects the output # to a temporary file, then uses that temporary # file as the input to grep. Non-bash shells # will probably need to do all that manually. if ! patch -p1 --forward --silent < fix-the-build.patch ; then exit 255 fi fi  Before exiting, the script must return the tree to its original state. If it does not, applying the next commit may fail or applying the patch on the next step will certainly fail. git-reset can do most of the work. It just has to be applied everywhere this script might exit. I generally do this using a wrapper function. The simple bisect script from before might look like:  #!/bin/bash function report() { git reset --hard HEAD exit$1
}

# Get the short-from SHA of the current commit
SHA=$(git log --oneline HEAD^.. | cut -f1 -d' ') # If the current commit is in the list of commits that need the patch # applied, do it. If applying the patch fails, even partially, abort. if grep --silent "^$SHA " <(git log --oneline 75d6469^..3a6de8c^)
if ! patch -p1 --forward --silent < fix-the-build.patch ; then
# Just exit here... so that we can see what went wrong
exit 255
fi
fi

if ! make ; then
report 125
fi

# Use the return code from the test case
test_case
report $?  This can be extended to any number of patches to fix any number of problems. There is one other tip here. If the first bisect attempt produced inconclusive results due to skipped commits, it may not have been wasted effort. Referring back to the previous output, there were two good commits found. These commits can be given to the next invocation of git bisect start. This helps reduce the search space from 9 to 6 in this case. $ git bisect start HEAD HEAD~20 622186fbdf47e4c77aadba3e38567636ecbcccf5 5294debfa4910e4259112ce3c6d5a8c1cd346ae9
Bisecting: 6 revisions left to test after this (roughly 3 steps)


Using the last bad commit can reduce the search even further.

    $git bisect start 3a6de8c86ee8a0a6d2f2fbc8cf2c461af0b9a007 HEAD~20 622186fbdf47e4c77aadba3e38567636ecbcccf5 5294debfa4910e4259112ce3c6d5a8c1cd346ae9 Bisecting: 4 revisions left to test after this (roughly 2 steps) [2d712d35c57900fc0aa0f1455381de48cdda0073] gallium/radeon: move printing texture info into a separate function  Note that git-bisect does not emit "good" or "bad" information. You have to author your bisect script to emit that information. The report function is a good place to do this.  function report() { if [$1 -eq 0 ]; then
echo "    auto_bisect.sh says good"
elif [ $1 -eq 125 ]; then echo " auto_bisect.sh says skip" else echo " auto_bisect.sh says bad" fi git reset --hard HEAD exit$1
}


## Remote Test Systems

Running tests on remote systems pose additional challenges. At the very least, there are three additional steps: get the built project on the remote system, start test on the remote system, and retrieve the result.

For these extra steps, rsync and ssh are powerful tools. There are numerous blog posts and tutorials dedicated to using rsync and ssh in various environments, and duplicating that effort is well beyond the scope of this post. However, there is one couple nice feature relative to automated bisects that is worth mentioning.

Recall that returning 255 from the script will cause the bisect to halt waiting for human intervention. It just so happens that ssh returns 255 when an error occurs. Otherwise it returns the result of the remote command. To make use of this, split the work across two scripts instead of putting all of the test in a single auto_bisect.sh script. A new local_bisect.sh contains all of the commands that run on the build / bisect system, and remote_bisect.sh contains all of the commands that run on the testing system.

remote_bisect.sh should (only) execute the test and exit with the same sort of return code as auto_bisect.sh would. local_bisect.sh should build the project, copy the build to the testing system, and start the test on the testing system. The return code from remote_bisect.sh should be directly returned from local_bisect.sh. A simple local_bisect.sh doesn't look too different from auto_bisect.sh:

    #!/bin/bash
if ! make ; then
exit 125
fi

if ! rsync build_results tester@test.system.com:build_results/; then
exit 255
fi

# Use the return code from the test case
exec ssh tester@test.system.com /path/to/test/remote_bisect.sh


Since remote_bisect.sh returns "normal" automated bisect return codes and ssh returns 255 on non-test failures, everything is taken care of.

## Interactive Test Cases

Automated bisecting doesn't work out too well when the test itself cannot be automated. There is still some benefit to be had from automating the process. Optionally applying patches, building the project, sending files to remote systems, and starting the test can all still be automated, and I think "automated" applies only very loosely. When the test is done, the script should exit with a return code of 255. This will halt the bisect. Run git bisect good or git bisect bad. Then, run git bisect run ./auto_bisect.sh again.

It's tempting to just run auto_bisect.sh by hand and skip git bisect run. The small advantage to the later is that skipping build failures will still be automated.

Going further requires making an interactive test case be non-interactive. For developers of OpenGL drivers, it is common to need to bisect rendering errors in games. This can be really, really painful and tedious. Most of the pain comes from the task not being automatable. Just loading the game and getting to the place where the error occurs can often take several minutes. These bugs are often reported by end-users who last tested with the previous release. From the 11.0 branch point to the 11.1 branch point on Mesa there were 2,308 commits.

    git bisect start 11.1-branchpoint 11.0-branchpoint Bisecting: 1153 revisions left to test after this (roughly 10 steps) [bf5f931aee35e8448a6560545d86deb35f0639b3] nir: make nir_instrs_equal() static  When you realized that bisect will be 10 steps with at least 10 to 15 minutes per step, you may begin to feel your insides die. It's even worse if you accidentally type git bisect good when you meant git bisect bad along the way. This is a common problem testing interactive applications. A variety of tools exist to remove the interactivity from interactive applications. apitrace is one such tool. Using apitrace, the OpenGL commands from the application can be recorded. This step must be done manually. The resulting trace can then be run at a known good commit, and an image can be captured from the portion of the trace that would exhibit the bug. This step must also be done manually, but the image capture is performed by a command line option to the trace replay command. Now a script can reply the trace, collect a new image, and compare the new image with the old image. If the images match, the commit is good. Otherwise, the commit is bad. This can be error prone, so it's a good idea to keep all the images from the bisect. A human can then examine all the images after the bisect to make sure the right choice were made at each commit tested. A full apitrace tutorial is beyond the scope of this post. The apitrace documentation has some additional information about using apitrace with automated bisects. ## What Now? git-bisect is an amazingly powerful tool. Some of the more powerful aspects of GIT get a bad rap for being hard to use correctly: make one mistake, and you'll have to re-clone your tree. With even the more powerful aspects of git-bisect, such as automated bisects, it's actually hard to go wrong. There are two absolutely critical final tips. First, remember that you're bisecting. If you start performing other GIT commands in the middle of the bisect, both you and your tree will get confused. Second, remember to reset your tree using git bisect reset when you're done. Without this step, you'll still be bisecting, so see the first tip. git-bisect and automated bisects really make simple things simple and complex things possible.  November 29, 2015 For public documentation's sake, if you want to run Steam on 64-bit Ubuntu 15.10 together with the open-source radeonsi driver (the part of Mesa that implements OpenGL for recent AMD GPUs), you'll probably run into problems that can be fixed by starting Steam as LD_PRELOAD=/usr/lib/i386-linux-gnu/libstdc++.so.6 steam (You'll have to have installed certain 32-bit packages, which you should have automatically been prompted to do while installing Steam.) The background of this issue is that both Steam and radeonsi contain C++ code and dynamically link against the C++ runtime libraries. However, Steam comes with its own copy of these libraries, which it prefers to link against. Meanwhile, Ubuntu 15.10 is built using the most recent g++, which contains a new C++ ABI. This ABI is not supported by the C++ libraries shipped with Steam, so that by default, the radeonsi driver fails to load due to linking errors. The workaround noted above fixes the problem by forcing the use of the newer C++ library. (Note that it does so only for 32-bit applications. If you're running 64-bit games from Steam, a similar workaround may be required.)  November 23, 2015 Long time no post! A quick reminder for students (*) in Spain interested in participating in this year’s CUSL, the deadline for the project proposals has been extended until December 1st: https://www.concursosoftwarelibre.org/1516 You’re still on time to submit a proposal! * Universidad, bachiller, ciclos de grado medio… Filed under: FreeDesktop Planet, GNOME Planet, GNU Planet, Meetings, Planets, Uncategorized  November 20, 2015 Yesterday I pushed an implementation of a new OpenGL extension GL_EXT_shader_samples_identical to Mesa. This extension will be in the Mesa 11.1 release in a few short weeks, and it will be enabled on various Intel platforms: • GEN7 (Ivy Bridge, Baytrail, Haswell): Only currently effective in the fragment shader. More details below. • GEN8 (Broadwell, Cherry Trail, Braswell): Only currently effective in the vertex shader and fragment shader. More details below. • GEN9 (Skylake): Only currently effective in the vertex shader and fragment shader. More details below. The extension hasn't yet been published in the official OpenGL extension registry, but I will take care of that before Mesa 11.1 is released. ## Background Multisample anti-aliasing (MSAA) is a well known technique for reducing aliasing effects ("jaggies") in rendered images. The core idea is that the expensive part of generating a single pixel color happens once. The cheaper part of determining where that color exists in the pixel happens multiple times. For 2x MSAA this happens twice, for 4x MSAA this happens four times, etc. The computation cost is not increased by much, but the storage and memory bandwidth costs are increased linearly. Some time ago, a clever person noticed that in areas where the whole pixel is covered by the triangle, all of the samples have exactly the same value. Furthermore, since most triangles are (much) bigger than a pixel, this is really common. From there it is trivial to apply some sort of simple data compression to the sample data, and all modern GPUs do this in some form. In addition to the surface that stores the data, there is a multisample control surface (MCS) that describes the compression. On Intel GPUs, sample data is stored in n separate planes. For 4x MSAA, there are four planes. The MCS has a small table for each pixel that maps a sample to a plane. If the entry for sample 2 in the MCS is 0, then the data for sample 2 is stored in plane 0. The GPU automatically uses this to reduce bandwidth usage. When writing a pixel on the interior of a polygon (where all the samples have the same value), the MCS gets all zeros written, and the sample value is written only to plane 0. This does add some complexity to the shader compiler. When a shader executes the texelFetch function, several things happen behind the scenes. First, an instruction is issued to read the MCS. Then a second instruction is executed to read the sample data. This second instruction uses the sample index and the result of the MCS read as inputs. A simple shader like  #version 150 uniform sampler2DMS tex; uniform int samplePos; in vec2 coord; out vec4 frag_color; void main() { frag_color = texelFetch(tex, ivec2(coord), samplePos); }  generates this assembly  pln(16) g8<1>F g7<0,1,0>F g2<8,8,1>F { align1 1H compacted }; pln(16) g10<1>F g7.4<0,1,0>F g2<8,8,1>F { align1 1H compacted }; mov(16) g12<1>F g6<0,1,0>F { align1 1H compacted }; mov(16) g16<1>D g8<8,8,1>F { align1 1H compacted }; mov(16) g18<1>D g10<8,8,1>F { align1 1H compacted }; send(16) g2<1>UW g16<8,8,1>F sampler ld_mcs SIMD16 Surface = 1 Sampler = 0 mlen 4 rlen 8 { align1 1H }; mov(16) g14<1>F g2<8,8,1>F { align1 1H compacted }; send(16) g120<1>UW g12<8,8,1>F sampler ld2dms SIMD16 Surface = 1 Sampler = 0 mlen 8 rlen 8 { align1 1H }; sendc(16) null<1>UW g120<8,8,1>F render RT write SIMD16 LastRT Surface = 0 mlen 8 rlen 0 { align1 1H EOT };  The ld_mcs instruction is the read from the MCS, and the ld2dms is the read from the multisample surface using the MCS data. If a shader reads multiple samples from the same location, the compiler will likely eliminate all but one of the ld_mcs instructions. Modern GPUs also have an additional optimization. When an application clears a surface, some values are much more commonly used than others. Permutations of 0s and 1s are, by far, the most common. Bandwidth usage can further be reduced by taking advantage of this. With a single bit for each of red, green, blue, and alpha, only four bits are necessary to describe a clear color that contains only 0s and 1s. A special value could then be stored in the MCS for each sample that uses the fast-clear color. A clear operation that uses a fast-clear compatible color only has to modify the MCS. All of this is well documented in the Programmer's Reference Manuals for Intel GPUs. ## There's More Information from the MCS can also help users of the multisample surface reduce memory bandwidth usage. Imagine a simple, straight forward shader that performs an MSAA resolve operation:  #version 150 uniform sampler2DMS tex; #define NUM_SAMPLES 4 // generate a different shader for each sample count in vec2 coord; out vec4 frag_color; void main() { vec4 color = texelFetch(tex, ivec2(coord), 0); for (int i = 1; i < NUM_SAMPLES; i++) color += texelFetch(tex, ivec2(coord), i); frag_color = color / float(NUM_SAMPLES); }  The problem should be obvious. On most pixels all of the samples will have the same color, but the shader still reads every sample. It's tempting to think the compiler should be able to fix this. In a very simple cases like this one, that may be possible, but such an optimization would be both challenging to implement and, likely, very easy to fool. A better approach is to just make the data available to the shader, and that is where this extension comes in. A new function textureSamplesIdenticalEXT is added that allows the shader to detect the common case where all the samples have the same value. The new, optimized shader would be:  #version 150 #extension GL_EXT_shader_samples_identical: enable uniform sampler2DMS tex; #define NUM_SAMPLES 4 // generate a different shader for each sample count #if !defined GL_EXT_shader_samples_identical #define textureSamplesIdenticalEXT(t, c) false #endif in vec2 coord; out vec4 frag_color; void main() { vec4 color = texelFetch(tex, ivec2(coord), 0); if (! textureSamplesIdenticalEXT(tex, ivec2(coord)) { for (int i = 1; i < NUM_SAMPLES; i++) color += texelFetch(tex, ivec2(coord), i); color /= float(NUM_SAMPLES); } frag_color = color; }  The intention is that this function be implemented by simply examining the MCS data. At least on Intel GPUs, if the MCS for a pixel is all 0s, then all the samples are the same. Since textureSamplesIdenticalEXT can reuse the MCS data read by the first texelFetch call, there are no extra reads from memory. There is just a single compare and conditional branch. These added instructions can be scheduled while waiting for the ld2dms instruction to read from memory (slow), so they are practically free. It is also tempting to use textureSamplesIdenticalEXT in conjunction with anyInvocationsARB (from GL_ARB_shader_group_vote). Such a shader might look like:  #version 430 #extension GL_EXT_shader_samples_identical: require #extension GL_ARB_shader_group_vote: require uniform sampler2DMS tex; #define NUM_SAMPLES 4 // generate a different shader for each sample count in vec2 coord; out vec4 frag_color; void main() { vec4 color = texelFetch(tex, ivec2(coord), 0); if (anyInvocationsARB(!textureSamplesIdenticalEXT(tex, ivec2(coord))) { for (int i = 1; i < NUM_SAMPLES; i++) color += texelFetch(tex, ivec2(coord), i); color /= float(NUM_SAMPLES); } frag_color = color; }  Whether or not using anyInvocationsARB improves performance is likely to be dependent on both the shader and the underlying GPU hardware. Currently Mesa does not support GL_ARB_shader_group_vote, so I don't have any data one way or the other. ## Caveats The implementation of this extension that will ship with Mesa 11.1 has a three main caveats. Each of these will likely be resolved to some extent in future releases. The extension is only effective on scalar shader units. This means on GEN7 it is effective in fragment shaders. On GEN8 and GEN9 it is only effective in vertex shaders and fragment shaders. It is supported in all shader stages, but in non-scalar stages textureSamplesIdenticalEXT always returns false. The implementation for the non-scalar stages is slightly different, and, on GEN9, the exact set of instructions depends on the number of samples. I didn't think it was likely that people would want to use this feature in a vertex shader or geometry shader, so I just didn't finish the implementation. This will almost certainly be resolved in Mesa 11.2. The current implementation also returns a false negative for texels fully set to the fast-clear color. There are two problems with the fast-clear color. It uses a different value than the "all plane 0" case, and the size of the value depends on the number of samples. For 2x MSAA, the MCS read returns 0x000000ff, but for 8x MSAA it returns 0xffffffff. The first problem means the compiler would needs to generate additional instructions to check for "all plane 0" or "all fast-clear color." This could hurt the performance of applications that either don't use a fast-clear color or, more likely, that later draw non-clear data to the entire surface. The second problem means the compiler would needs to do state-based recompiles when the number of samples changes. In the end, we decided that "all plane 0" was by far the most common case, so we have ignored the "all fast-clear color" case for the time being. We are still collecting data from applications, and we're working on several uses of this functionality inside our driver. In future versions we may implement a heuristic to determine whether or not to check for the fast-clear color case. As mentioned above, Mesa does not currently support GL_ARB_shader_group_vote. Applications that want to use textureSamplesIdenticalEXT on Mesa will need paths that do not use anyInvocationsARB for at least the time being. ## Future As stated by issue #3, the extension still needs to gain SPIR-V support. This extension would be just as useful in Vulkan and OpenCL as it is in OpenGL. At some point there is likely to be a follow-on extension that provides more MCS data to the shader in a more raw form. As stated in issue #2 and previously in this post, there are a few problems with providing raw MCS data. The biggest problem is how the data is returned. Each sample count needs a different amount of data. Current 8x MSAA surfaces have 32-bits (returned) per pixel. Current 16x MSAA MCS surfaces have 64-bits per pixel. Future 32x MSAA, should that ever exist, would need 192 bits. Additionally, there would need to be a set of new texelFetch functions that take both a sample index and the MCS data. This, again, has problems with variable data size. Applications would also want to query things about the MCS values. How many times is plane 0 used? Which samples use plane 2? What is the highest plane used? There could be other useful queries. I can imagine that a high quality, high performance multisample resolve filter could want all of this information. Since the data changes based on the sample count and could change on future hardware, the future extension really should not directly expose the encoding of the MCS data. How should it provide the data? I'm expecting to write some demo applications and experiment with a bunch of different things. Obviously, this is an open area of research.  November 19, 2015 So many of you have probably seen that RHEL 7.2 is out today. There are many important updates in this release, some of them detailed in the official RHEL 7.2 press release. One thing however which you would only discover if you start digging into the 7.2 update is that its the first time in RHEL history that we are doing a full scale desktop update in a point release. We shipped RHEL 7.0 with GNOME 3.8 and in RHEL 7.2 we are updating it to GNOME 3.14. This brings in a lot of new major features into RHEL, like the work we did on improved HiDPI support, improved touch and gesture support, it brings GNOME Software to RHEL, the improved system status area and so on. We plan on updating the desktop further in later RHEL 7.x point releases. This change of policy is of course important to the many RHEL Workstation customers we have, but I also hope it will make RHEL Workstation and also the CentOS Workstation more attractive options to those in the community who have been looking for a LTS version of Fedora. This policy change gives you the rock solid foundation of RHEL and the RHEL kernel and combines it with a very well tested yet fairly new desktop release. So if you feel Fedora is moving to quickly, yet have felt that RHEL on the other hand has been moving to slowly, we hope that with this change to RHEL we have found a sweet compromise. We will of course also keep doing regular applications updates in RHEL 7.x, just like we started with in RHEL 6.x. Giving you up to date versions of things like LibreOffice, Firefox, Thunderbird and more.  November 18, 2015 # The Event Loop API of libsystemd When we began working on systemd we built it around a hand-written ad-hoc event loop, wrapping Linux epoll. The more our project grew the more we realized the limitations of using raw epoll: • As we used timerfd for our timer events, each event source cost one file descriptor and we had many of them! File descriptors are a scarce resource on UNIX, as RLIMIT_NOFILE is typically set to 1024 or similar, limiting the number of available file descriptors per process to 1021, which isn't particularly a lot. • Ordering of event dispatching became a nightmare. In many cases, we wanted to make sure that a certain kind of event would always be dispatched before another kind of event, if both happen at the same time. For example, when the last process of a service dies, we might be notified about that via a SIGCHLD signal, via an sd_notify() "STATUS=" message, and via a control group notification. We wanted to get these events in the right order, to know when it's safe to process and subsequently release the runtime data systemd keeps about the service or process: it shouldn't be done if there are still events about it pending. • For each program we added to the systemd project we noticed we were adding similar code, over and over again, to work with epoll's complex interfaces. For example, finding the right file descriptor and callback function to dispatch an epoll event to, without running into invalidated pointer issues is outright difficult and requires non-trivial code. • Integrating child process watching into our event loops was much more complex than one could hope, and even more so if child process events should be ordered against each other and unrelated kinds of events. Eventually, we started working on sd-bus. At the same time we decided to seize the opportunity, put together a proper event loop API in C, and then not only port sd-bus on top of it, but also the rest of systemd. The result of this is sd-event. After almost two years of development we declared sd-event stable in systemd version 221, and published it as official API of libsystemd. ## Why? sd-event.h, of course, is not the first event loop API around, and it doesn't implement any really novel concepts. When we started working on it we tried to do our homework, and checked the various existing event loop APIs, maybe looking for candidates to adopt instead of doing our own, and to learn about the strengths and weaknesses of the various implementations existing. Ultimately, we found no implementation that could deliver what we needed, or where it would be easy to add the missing bits: as usual in the systemd project, we wanted something that allows us access to all the Linux-specific bits, instead of limiting itself to the least common denominator of UNIX. We weren't looking for an abstraction API, but simply one that makes epoll usable in system code. With this blog story I'd like to take the opportunity to introduce you to sd-event, and explain why it might be a good candidate to adopt as event loop implementation in your project, too. So, here are some features it provides: • I/O event sources, based on epoll's file descriptor watching, including edge triggered events (EPOLLET). See sd_event_add_io(3). • Timer event sources, based on timerfd_create(), supporting the CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTIME clocks, as well as the CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM clocks that can resume the system from suspend. When creating timer events a required accuracy parameter may be specified which allows coalescing of timer events to minimize power consumption. For each clock only a single timer file descriptor is kept, and all timer events are multiplexed with a priority queue. See sd_event_add_time(3). • UNIX process signal events, based on signalfd(2), including full support for real-time signals, and queued parameters. See sd_event_add_signal(3). • Child process state change events, based on waitid(2). See sd_event_add_child(3). • Static event sources, of three types: defer, post and exit, for invoking calls in each event loop, after other event sources or at event loop termination. See sd_event_add_defer(3). • Event sources may be assigned a 64bit priority value, that controls the order in which event sources are dispatched if multiple are pending simultanously. See sd_event_source_set_priority(3). • The event loop may automatically send watchdog notification messages to the service manager. See sd_event_set_watchdog(3). • The event loop may be integrated into foreign event loops, such as the GLib one. The event loop API is hence composable, the same way the underlying epoll logic is. See sd_event_get_fd(3) for an example. • The API is fully OOM safe. • A complete set of documentation in UNIX man page format is available, with sd-event(3) as the entry page. • It's pretty widely available, and requires no extra dependencies. Since systemd is built on it, most major distributions ship the library in their default install set. • After two years of development, and after being used in all of systemd's components, it has received a fair share of testing already, even though we only recently decided to declare it stable and turned it into a public API. Note that sd-event has some potential drawbacks too: • If portability is essential to you, sd-event is not your best option. sd-event is a wrapper around Linux-specific APIs, and that's visible in the API. For example: our event callbacks receive structures defined by Linux-specific APIs such as signalfd. • It's a low-level C API, and it doesn't isolate you from the OS underpinnings. While I like to think that it is relatively nice and easy to use from C, it doesn't compromise on exposing the low-level functionality. It just fills the gaps in what's missing between epoll, timerfd, signalfd and related concepts, and it does not hide that away. Either way, I believe that sd-event is a great choice when looking for an event loop API, in particular if you work on system-level software and embedded, where functionality like timer coalescing or watchdog support matter. ## Getting Started Here's a short example how to use sd-event in a simple daemon. In this example, we'll not just use sd-event.h, but also sd-daemon.h to implement a system service. #include <alloca.h> #include <endian.h> #include <errno.h> #include <netinet/in.h> #include <signal.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/ioctl.h> #include <sys/socket.h> #include <unistd.h> #include <systemd/sd-daemon.h> #include <systemd/sd-event.h> static int io_handler(sd_event_source *es, int fd, uint32_t revents, void *userdata) { void *buffer; ssize_t n; int sz; /* UDP enforces a somewhat reasonable maximum datagram size of 64K, we can just allocate the buffer on the stack */ if (ioctl(fd, FIONREAD, &sz) < 0) return -errno; buffer = alloca(sz); n = recv(fd, buffer, sz, 0); if (n < 0) { if (errno == EAGAIN) return 0; return -errno; } if (n == 5 && memcmp(buffer, "EXIT\n", 5) == 0) { /* Request a clean exit */ sd_event_exit(sd_event_source_get_event(es), 0); return 0; } fwrite(buffer, 1, n, stdout); fflush(stdout); return 0; } int main(int argc, char *argv[]) { union { struct sockaddr_in in; struct sockaddr sa; } sa; sd_event_source *event_source = NULL; sd_event *event = NULL; int fd = -1, r; sigset_t ss; r = sd_event_default(&event); if (r < 0) goto finish; if (sigemptyset(&ss) < 0 || sigaddset(&ss, SIGTERM) < 0 || sigaddset(&ss, SIGINT) < 0) { r = -errno; goto finish; } /* Block SIGTERM first, so that the event loop can handle it */ if (sigprocmask(SIG_BLOCK, &ss, NULL) < 0) { r = -errno; goto finish; } /* Let's make use of the default handler and "floating" reference features of sd_event_add_signal() */ r = sd_event_add_signal(event, NULL, SIGTERM, NULL, NULL); if (r < 0) goto finish; r = sd_event_add_signal(event, NULL, SIGINT, NULL, NULL); if (r < 0) goto finish; /* Enable automatic service watchdog support */ r = sd_event_set_watchdog(event, true); if (r < 0) goto finish; fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0); if (fd < 0) { r = -errno; goto finish; } sa.in = (struct sockaddr_in) { .sin_family = AF_INET, .sin_port = htobe16(7777), }; if (bind(fd, &sa.sa, sizeof(sa)) < 0) { r = -errno; goto finish; } r = sd_event_add_io(event, &event_source, fd, EPOLLIN, io_handler, NULL); if (r < 0) goto finish; (void) sd_notifyf(false, "READY=1\n" "STATUS=Daemon startup completed, processing events."); r = sd_event_loop(event); finish: event_source = sd_event_source_unref(event_source); event = sd_event_unref(event); if (fd >= 0) (void) close(fd); if (r < 0) fprintf(stderr, "Failure: %s\n", strerror(-r)); return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS; }  The example above shows how to write a minimal UDP/IP server, that listens on port 7777. Whenever a datagram is received it outputs its contents to STDOUT, unless it is precisely the string EXIT\n in which case the service exits. The service will react to SIGTERM and SIGINT and do a clean exit then. It also notifies the service manager about its completed startup, if it runs under a service manager. Finally, it sends watchdog keep-alive messages to the service manager if it asked for that, and if it runs under a service manager. When run as systemd service this service's STDOUT will be connected to the logging framework of course, which means the service can act as a minimal UDP-based remote logging service. To compile and link this example, save it as event-example.c, then run:  gcc event-example.c -o event-example pkg-config --cflags --libs libsystemd


For a first test, simply run the resulting binary from the command line, and test it against the following netcat command line:

$nc -u localhost 7777  For the sake of brevity error checking is minimal, and in a real-world application should, of course, be more comprehensive. However, it hopefully gets the idea across how to write a daemon that reacts to external events with sd-event. For further details on the functions used in the example above, please consult the manual pages: sd-event(3), sd_event_exit(3), sd_event_source_get_event(3), sd_event_default(3), sd_event_add_signal(3), sd_event_set_watchdog(3), sd_event_add_io(3), sd_notifyf(3), sd_event_loop(3), sd_event_source_unref(3), sd_event_unref(3). ## Conclusion So, is this the event loop to end all other event loops? Certainly not. I actually believe in "event loop plurality". There are many reasons for that, but most importantly: sd-event is supposed to be an event loop suitable for writing a wide range of applications, but it's definitely not going to solve all event loop problems. For example, while the priority logic is important for many usecase it comes with drawbacks for others: if not used carefully high-priority event sources can easily starve low-priority event sources. Also, in order to implement the priority logic, sd-event needs to linearly iterate through the event structures returned by epoll_wait(2) to sort the events by their priority, resulting in worst case O(n*log(n)) complexity on each event loop wakeup (for n = number of file descriptors). Then, to implement priorities fully, sd-event only dispatches a single event before going back to the kernel and asking for new events. sd-event will hence not provide the theoretically possible best scalability to huge numbers of file descriptors. Of course, this could be optimized, by improving epoll, and making it support how todays's event loops actually work (after, all, this is the problem set all event loops that implement priorities -- including GLib's -- have to deal with), but even then: the design of sd-event is focussed on running one event loop per thread, and it dispatches events strictly ordered. In many other important usecases a very different design is preferable: one where events are distributed to a set of worker threads and are dispatched out-of-order. Hence, don't mistake sd-event for what it isn't. It's not supposed to unify everybody on a single event loop. It's just supposed to be a very good implementation of an event loop suitable for a large part of the typical usecases. Note that our APIs, including sd-bus, integrate nicely into sd-event event loops, but do not require it, and may be integrated into other event loops too, as long as they support watching for time and I/O events. And that's all for now. If you are considering using sd-event for your project and need help or have questions, please direct them to the systemd mailing list.  November 16, 2015 Writing programs is fun, but making them fast can be a pain. Python programs are no exception to that, but the basic profiling toolchain is actually not that complicated to use. Here, I would like to show you how you can quickly profile and analyze your Python code to find what part of the code you should optimize. # What's profiling? Profiling a Python program is doing a dynamic analysis that measures the execution time of the program and everything that compose it. That means measuring the time spent in each of its functions. This will give you data about where your program is spending time, and what area might be worth optimizing. It's a very interesting exercise. Many people focus on local optimizations, such as determining e.g. which of the Python functions range or xrange is going to be faster. It turns out that knowing which one is faster may never be an issue in your program, and that the time gained by one of the functions above might not be worth the time you spend researching that, or arguing about it with your colleague. Trying to blindly optimize a program without measuring where it is actually spending its time is a useless exercise. Following your guts alone is not always sufficient. There are many types of profiling, as there are many things you can measure. In this exercise, we'll focus on CPU utilization profiling, meaning the time spent by each function executing instructions. Obviously, we could do many more kind of profiling and optimizations, such as memory profiling which would measure the memory used by each piece of code – something I talk about in The Hacker's Guide to Python. # cProfile Since Python 2.5, Python provides a C module called cProfile which has a reasonable overhead and offers a good enough feature set. The basic usage goes down to: >>> import cProfile>>> cProfile.run('2 + 2') 2 function calls in 0.000 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} Though you can also run a script with it, which turns out to be handy: $ python -m cProfile -s cumtime lwn2pocket.py         72270 function calls (70640 primitive calls) in 4.481 seconds    Ordered by: cumulative time    ncalls  tottime  percall  cumtime  percall filename:lineno(function)        1    0.004    0.004    4.481    4.481 lwn2pocket.py:2(<module>)        1    0.001    0.001    4.296    4.296 lwn2pocket.py:51(main)        3    0.000    0.000    4.286    1.429 api.py:17(request)        3    0.000    0.000    4.268    1.423 sessions.py:386(request)      4/3    0.000    0.000    3.816    1.272 sessions.py:539(send)        4    0.000    0.000    2.965    0.741 adapters.py:323(send)        4    0.000    0.000    2.962    0.740 connectionpool.py:421(urlopen)        4    0.000    0.000    2.961    0.740 connectionpool.py:317(_make_request)        2    0.000    0.000    2.675    1.338 api.py:98(post)       30    0.000    0.000    1.621    0.054 ssl.py:727(recv)       30    0.000    0.000    1.621    0.054 ssl.py:610(read)       30    1.621    0.054    1.621    0.054 {method 'read' of '_ssl._SSLSocket' objects}        1    0.000    0.000    1.611    1.611 api.py:58(get)        4    0.000    0.000    1.572    0.393 httplib.py:1095(getresponse)        4    0.000    0.000    1.572    0.393 httplib.py:446(begin)       60    0.000    0.000    1.571    0.026 socket.py:410(readline)        4    0.000    0.000    1.571    0.393 httplib.py:407(_read_status)        1    0.000    0.000    1.462    1.462 pocket.py:44(wrapped)        1    0.000    0.000    1.462    1.462 pocket.py:152(make_request)        1    0.000    0.000    1.462    1.462 pocket.py:139(_make_request)        1    0.000    0.000    1.459    1.459 pocket.py:134(_post_request)[…]

This prints out all the function called, with the time spend in each and the number of times they have been called.

While being useful, the output format is very basic and does not make easy to grab knowledge for complete programs. For more advanced visualization, I leverage KCacheGrind. If you did any C programming and profiling these last years, you may have used it as it is primarily designed as front-end for Valgrind generated call-graphs.

In order to use, you need to generate a cProfile result file, then convert it to KCacheGrind format. To do that, I use pyprof2calltree.

$python -m cProfile -o myscript.cprof myscript.py$ pyprof2calltree -k -i myscript.cprof

And the KCacheGrind window magically appears!

# Concrete case: Carbonara optimization

I was curious about the performances of Carbonara, the small timeserie library I wrote for Gnocchi. I decided to do some basic profiling to see if there was any obvious optimization to do.

In order to profile a program, you need to run it. But running the whole program in profiling mode can generate a lot of data that you don't care about, and adds noise to what you're trying to understand. Since Gnocchi has thousands of unit tests and a few for Carbonara itself, I decided to profile the code used by these unit tests, as it's a good reflection of basic features of the library.

Note that this is a good strategy for a curious and naive first-pass profiling. There's no way that you can make sure that the hotspots you will see in the unit tests are the actual hotspots you will encounter in production. Therefore, a profiling in conditions and with a scenario that mimics what's seen in production is often a necessity if you need to push your program optimization further and want to achieve perceivable and valuable gain.

I activated cProfile using the method described above, creating a cProfile.Profile object around my tests (I actually started to implement that in testtools). I then run KCacheGrind as described above. Using KCacheGrind, I generated the following figures.

The test I profiled here is called test_fetch and is pretty easy to understand: it puts data in a timeserie object, and then fetch the aggregated result. The above list shows that 88 % of the ticks are spent in set_values (44 ticks over 50). This function is used to insert values into the timeserie, not to fetch the values. That means that it's really slow to insert data, and pretty fast to actually retrieve them.

Reading the rest of the list indicates that several functions share the rest of the ticks, update, _first_block_timestamp, _truncate, _resample, etc. Some of the functions in the list are not part of Carbonara, so there's no point in looking to optimize them. The only thing that can be optimized is, sometimes, the number of times they're called.

The call graph gives me a bit more insight about what's going on here. Using my knowledge about how Carbonara works, I don't think that the whole stack on the left for _first_block_timestamp makes much sense. This function is supposed to find the first timestamp for an aggregate, e.g. with a timestamp of 13:34:45 and a period of 5 minutes, the function should return 13:30:00. The way it works currently is by calling the resample function from Pandas on a timeserie with only one element, but that seems to be very slow. Indeed, currently this function represents 25 % of the time spent by set_values (11 ticks on 44).

Fortunately, I recently added a small function called _round_timestamp that does exactly what _first_block_timestamp needs that without calling any Pandas function, so no resample. So I ended up rewriting that function this way:

def _first_block_timestamp(self):-        ts = self.ts[-1:].resample(self.block_size)-        return (ts.index[-1] - (self.block_size * self.back_window))+        rounded = self._round_timestamp(self.ts.index[-1], self.block_size)+        return rounded - (self.block_size * self.back_window)

And then I re-run the exact same test to compare the output of cProfile.

The list of function seems quite different this time. The number of time spend used by set_values dropped from 88 % to 71 %.

The call stack for set_values shows that pretty well: we can't even see the _first_block_timestamp function as it is so fast that it totally disappeared from the display. It's now being considered insignificant by the profiler.

So we just speed up the whole insertion process of values into Carbonara by a nice 25 % in a few minutes. Not that bad for a first naive pass, right?

 November 15, 2015

Discworld Noir was a superb adventure game, but is also notoriously unreliable, even in Windows on real hardware; using Wine is just not going to work. After many attempts at bringing it back into working order, I've settled on an approach that seems to work: now that qemu and libvirt have made virtualization and emulation easy, I can run it in a version of Windows that was current at the time of its release. Unfortunately, Windows 98 doesn't virtualize particularly well either, so this still became a relatively extensive yak-shaving exercise.

These instructions assume that /srv/virt is a suitable place to put disk images, but you can use anywhere you want.

## The emulated PC

After some trial and error, it seems to work if I configure qemu to emulate the following:

• Fully emulated hardware instead of virtualization (qemu-system-i386 -no-kvm)
• Intel Pentium III
• Intel i440fx-based motherboard with ACPI
• Real-time clock in local time
• No HPET
• 256 MiB RAM
• IDE primary master: IDE hard disk (I used 30 GiB, which is massively overkill for this game; qemu can use sparse files so it actually ends up less than 2 GiB on the host system)
• IDE primary slave, secondary master, secondary slave: three CD-ROM drives
• PS/2 keyboard and mouse
• Realtek AC97 sound card
• Cirrus video card with 16 MiB video RAM

A modern laptop CPU is an order of magnitude faster than what Discworld Noir needs, so full emulation isn't a problem, despite being inefficient.

There is deliberately no networking, because Discworld Noir doesn't need it, and a 17 year old operating system with no privilege separation is very much not safe to use on the modern Internet!

## Software needed

• Windows 98 installation CD-ROM as a .iso file (cp /dev/cdrom windows98.iso) - in theory you could also use a real optical drive, but my laptop doesn't usually have one of those. I used the OEM disc, version 4.10.1998 (that's the original Windows 98, not the Second Edition), which came with a long-dead PC, and didn't bother to apply any patches.
• A Windows 98 license key. Again, I used an OEM key from a past PC.
• A complete set of Discworld Noir (English) CD-ROMs as .iso files. I used the UK "Sold Out Software" budget release, on 3 CDs.
• A multi-platform Realtek AC97 audio driver.

## Windows 98 installation

It seems to be easiest to do this bit by running qemu-system-i386 manually:

qemu-img create -f qcow2 /srv/virt/discworldnoir.qcow2 30G
qemu-system-i386 -hda /srv/virt/discworldnoir.qcow2 \
-drive media=cdrom,format=raw,file=/srv/virt/windows98.iso \
-no-kvm -vga cirrus -m 256 -cpu pentium3 -localtime


Don't start the installation immediately. Instead, boot the installation CD to a DOS prompt with CD-ROM support. From here, run

fdisk


and create a single partition filling the emulated hard disk. When finished, hard-reboot the virtual machine (press Ctrl+C on the qemu-system-i386 process and run it again).

The DOS FORMAT.COM utility is on the Windows CD-ROM but not in the root directory or the default %PATH%, so you'll have to run:

d:\win98\format c:


to create the FAT filesystem. You might have to reboot again at this point.

The reason for doing this the hard way is that the Windows 98 installer doesn't detect qemu as supporting ACPI. You want ACPI support, so that Windows will issue IDLE instructions from its idle loop, instead of occupying a CPU core with a busy-loop. To get that, boot to a DOS prompt again, and use:

setup /p j /iv


/p j forces ACPI support (Thanks to "Richard S" on the Virtualbox forums for this tip.) /iv is unimportant, but it disables the annoying "billboards" during installation, which advertised once-exciting features like support for dial-up modems and JPEG wallpaper.

I used a "Typical" installation; there didn't seem to be much point in tweaking the installed package set when everything is so small by modern standards.

Windows 98 has built-in support for the Cirrus VGA card that we're emulating, so after a few reboots, it should be able to run in a semi-modern resolution and colour depth. Discworld Noir apparently prefers a 640 × 480 × 16-bit video mode, so right-click on the desktop background, choose Properties and set that up.

## Audio drivers

This is the part that took me the longest to get working. Of the sound cards that qemu can emulate, Windows 98 only supports the SoundBlaster 16 out of the box. Unfortunately, the Soundblaster 16 emulation in qemu is incomplete, and in particular version 2.1 (as shipped in Debian 8) has a tendency to make Windows lock up during boot.

I've seen advice in various places to emulate an Eqsonic ES1370 (SoundBlaster AWE 64), but that didn't work for me: one of the drivers I tried caused Windows to lock up at a black screen during boot, and the other didn't detect the emulated hardware.

The next-oldest sound card that qemu can emulate is a Realtek AC97, which was often found integrated into motherboards in the late 1990s. This one seems to work, with the "A400" driver bundle linked above. For Windows 98 first edition, you need a driver bundle that includes the old "VXD" drivers, not just the "WDM" drivers supported by Second Edition and newer.

The easiest way to get that into qemu seems to be to turn it into a CD image:

genisoimage -o /srv/virt/discworldnoir-drivers.iso WDM_A400.exe
qemu-system-i386 -hda /srv/virt/discworldnoir.qcow2 \
-drive media=cdrom,format=raw,file=/srv/virt/windows98.iso \
-drive media=cdrom,format=raw,file=/srv/virt/discworldnoir-drivers.iso \
-no-kvm -vga cirrus -m 256 -cpu pentium3 -localtime -soundhw ac97


Run the installer from E:, then reboot with the Windows 98 CD inserted, and Windows should install the driver.

## Installing Discworld Noir

Boot up the virtual machine with CD 1 in the emulated drive:

qemu-system-i386 -hda /srv/virt/discworldnoir.qcow2 \
-drive media=cdrom,format=raw,file=/srv/virt/DWN_ENG_1.iso \
-no-kvm -vga cirrus -m 256 -cpu pentium3 -localtime -soundhw ac97


You might be thinking "... why not insert all three CDs into D:, E: and F:?" but the installer expects subsequent disks to appear in the same drive where CD 1 was initially, so that won't work. Instead, when prompted for a new CD, switch to the qemu monitor with Ctrl+Alt+2 (note that this is 2, not F2). At the (qemu) prompt, use info block to see a list of emulated drives, then issue a command like

change ide0-cd1 /srv/virt/DWN_ENG_2.iso


to swap the CD. Then switch back to Windows' console with Ctrl+Alt+1 and continue installation. I used a Full installation of Discworld Noir.

## Transferring the virtual machine to GNOME Boxes

Having finished the "control freak" phase of installation, I wanted a slightly more user-friendly way to run this game, so I transferred the virtual machine to be used by libvirtd, which is the backend for both GNOME Boxes and virt-manager:

virsh create discworldnoir.xml


Here is the configuration I used. It's a mixture of automatic configuration from virt-manager, and hand-edited configuration to make it match the qemu-system-i386 command-line.

## Running the game

If all goes well, you should now see a discworldnoir virtual machine in GNOME Boxes, in which you can boot Windows 98 and play the game. Have fun!

 November 11, 2015

So yesterday, the 10th of November, was the official launch day of the Steam Machines. The hardware are meant to be dedicated game machines for the living room taking advantage of the Steam ecosystem, to take on the Xbox One and PS4.

But for us in the Linux community these machines are more than that, they are an important part of helping us break into a broader market by paving the way for even more games and more big budget games coming to our platform. Playing computer games is not just a niche, it is a mainstream activity these days, and not having access to games on our platform has cost us quite a few users and potential contributors over the years. I have for instance met a lot of computer science students who ended up not using Linux as the main operating system during their studies simply due to the lack of games on the platform. Instead Linux got de-regulated to that thing in a VM only run when you needed it for an assignment.

Steam for Linux and SteamOS can and will be important pieces of breaking through that. SteamOS and the Steam Macines are also important for the Linux community for another reason. They can help funnel more resources from hardware companies into Linux drivers and support. I know for instance that all the 3 major GPU vendors have increased their Linux drivers investments due to SteamOS.

So I want to congratulate Valve on the launch of the first Steam Machines and strongly recommend everyone in the community to get a Steam machine for their home!

People who have had a good chance to test the hardware has recommended me to get one of the Alienware SteamOS systems, so I am passing that recommendation onwards.

As a sidenote we are also working on a few features in Fedora Workstation to make it a better host for Steam and Steam games. This includes our work on the GL Dispatch and Optimus support as covered in a previous blog and libratbag, our new library for handling gaming mice under Linux. And finally we are working on a few bug fixes in Fedora to make it an even better host for the Steam client related to C++ ABI issues.

 November 08, 2015

# systemd.conf 2015 is Over Now!

Last week our first systemd.conf conference took place at betahaus, in Berlin, Germany. With almost 100 attendees, a dense schedule of 23 high-quality talks stuffed into a single track on just two days, a productive hackfest and numerous consumed Club-Mates I believe it was quite a success!

If you couldn't attend the conference, you may watch all talks on our YouTube Channel. The slides are available online, too.

Many photos from the conference are available on the Google Events Page. Enjoy!

I'd specifically like to thank Daniel Mack, Chris Kühl and Nils Magnus for running the conference, and making sure that it worked out as smoothly as it did! Thank you very much, you did a fantastic job!

I'd also specifically like to thank the CCC Video Operation Center folks for the excellent video coverage of the conference. Not only did they implement a live-stream for the entire talks part of the conference, but also cut and uploaded videos of all talks to our YouTube Channel within the same day (in fact, within a few hours after the talks finished). That's quite an impressive feat!

The folks from LinuxTag e.V. put a lot of time and energy in the organization. It was great to see how well this all worked out! Excellent work!

(BTW, LinuxTag e.V. and the CCC Video Operation Center folks are willing to help with the organization of Free Software community events in Germany (and Europe?). Hence, if you need an entity that can do the financial work and other stuff for your Free Software project's conference, consider pinging LinuxTag, they might be willing to help. Similar, if you are organizing such an event and are thinking about providing video coverage, consider pinging the the CCC VOC folks! Both of them get our best recommendations!)

I'd also like to thank our conference sponsors! Specifically, we'd like to thank our Gold Sponsors Red Hat and CoreOS for their support. We'd also like to thank our Silver Sponsor Codethink, and our Bronze Sponsors Pengutronix, Pantheon, Collabora, Endocode, the Linux Foundation, Samsung and Travelping, as well as our Cooperation Partners LinuxTag and kinvolk.io, and our Media Partner Golem.de.

Last but not least I'd really like to thank our speakers and attendees for presenting and participating in the conference. Of course, the conference we put together specifically for you, and we really hope you had as much fun at it as we did!

Thank you all for attending, supporting, and organizing systemd.conf 2015! We are looking forward to seeing you and working with you again at systemd.conf 2016!

Thanks!

 November 06, 2015

Another major piece of engineering that I have covered that we did for Fedora Workstation 23 is the GTK3 port of LibreOffice. Those of you who follow Caolán McNamaras blog are probably aware of the details. The motivation for the port wasn’t improved look and feel integration, there was easier ways to achieve that, but to help us have LibreOffice deal well with a range of new technologies we are supporting in Fedora Workstation namely: Touch support, Wayland support and HiDPI.

That ongoing work is now available in Fedora Workstation 23 if you install the ‘libreoffice-gtk3’ package. You have to install this using a terminal and dnf as this is a early adopter technology, but we would love as many as possible for you to try and report any issues you have either to the upstream LibreOffice bugzilla or the Fedora bugzilla against the LibreOffice component. Testing of how it works under X and how it works under Wayland are both more than welcome. Be aware that it is ‘tech preview’ technology so you might want to remove the libreoffice-gtk3 package again if you find that it hinders your effective use of LibreOffice. For instance there is a quite bad titlebar bug you would exprience under Wayland that we hope to fix with an update.

If you specifically want to test out the touch support there are two features implemented so far, both in Impress. One is to allow you to switch slides in Impress by a swiping gesture and the second is long press, you can bring up the impress slide context menu with it and switch to e.g. drawing mode. We would love feedback on what gestures you would like to see supported in various LibreOffice applications, so don’t be shy about filing enhancement bug reports with your suggestions.

HiDPI it wasn’t a primary focus of the porting effort it has to be said, but we do expect that it should also make improving the HiDPI support in LibreOffice further easier. Another nice little bonus of the port is that the GTK Inspector can now be used with LibreOffice.

A big thanks to Caolán for this work.

Not that I'm really running after more gadgets, but sometimes, there is a need that could only be soothed through new hardware.

Bluetooth UE roll

Got this for my wife, to play music when staying out on the quays of the Rhône, playing music in the kitchen (from a phone or computer), or when she's at the photo lab.

It works well with iOS, MacOS X and Linux. It's very easy to use, with whether it's paired, connected completely obvious, and the charging doesn't need specific cables (USB!).

I'll need to borrow it to add battery reporting for those devices though. You can find a full review on Ars Technica.

Sugru (!)

Not a gadget per se, but I bought some, used it to fix up a bunch of cables, repair some knickknacks, and do some DIY. Highly recommended, especially given the current price of their starter packs.

It's apparently from Ckeyin, but you'll find the exact same box from other vendors. Made my old Gravis joystick work, in the hope that I can make it work with DOSBox and my 20-year old copy of X-Wing vs. Tie Fighter.

Microsoft Surface ARC Mouse

That one was given to me, for testing, works well with Linux. Again, we'll need to do some work to report the battery. I only ever use it when travelling, as the batteries last for absolute ages.

Logitech K750 keyboard

Bought this nearly two years ago, and this is one of my best buys. My desk is close to a window, so it's wireless but I never need to change the batteries or think about charging it. GNOME also supports showing the battery status in the Power panel.

Got this one in sale (17€), to replace my Logitech trackball (one of its buttons broke...). It works great, and can even get you shell gestures when run in Wayland. I'm certainly happy to have one less cable running across my desk, and reuses the same dongle as the keyboard above.

If you use more than one devices, you might be interested in this bug to make it easier to support multiple Logitech "Unifying" devices.

ClicLite charger

Got this from a design shop in Berlin. It should probably have been cheaper than what I paid for it, but it's certainly pretty useful. Charges up my phone by about 20%, it's small, and charges up at the same time as my keyboard (above).

Dell S2340T

Bought about 2 years ago, to replace the monitor I had in an all-in-one (Lenovo all-in-ones, never buy that junk).

Nowadays, the resolution would probably be considered a bit on the low side, and the touchscreen mesh would show for hardcore photography work. It's good enough for videos though and the speaker reaches my sitting position.

It's only been possible to use the USB cable for graphics for a couple of months, and it's probably not what you want to lower CPU usage on your machine, but it works for Fedora with this RPM I made. Talk to me if you can help get it into RPMFusion.

Shame about the huge power brick, but a little bonus for the builtin Ethernet adapter.

Surface 3

This is probably the biggest ticket item. Again, I didn't pay full price for it, thanks to coupons, rewards, and all. The work to getting Linux and GNOME to play well with it is still ongoing, and rather slow.

I won't comment too much on Windows either, but rather as what it should be like once Linux runs on it.

I really enjoy the industrial design, maybe even the slanted edges, but one as to wonder why they made the USB power adapter not sit flush with the edge when plugged in.

I've used it a couple of times (under Windows, sigh) to read Pocket as I do on my iPad 1 (yes, the first one), or stream videos to the TV using Flash, without the tablet getting hot, or too slow either. I also like the fact that there's a real USB(-A) port that's separate from the charging port. The micro SD card port is nicely placed under the kickstand, hard enough to reach to avoid it escaping the tablet when lugged around.

The keyboard, given the thickness of it, and the constraints of using it as a cover, is good enough for light use, when travelling for example, and the layout isn't as awful as on, say, a Thinkpad Carbon X1 2nd generation. The touchpad is a bit on the small side though it would have been hard to make it any bigger given the cover's dimensions.

I would however recommend getting a Surface Pro if you want things to work right now (or at least soon). The one-before-last version, the Surface Pro 3, is probably a good target.
 November 04, 2015

Finally, Gnocchi 1.3.0 is out. This is our final release, more or less matching the OpenStack 6 months schedule, that concludes the Liberty development cycle.

This release was supposed to be released a few weeks earlier, but our integration test got completely blocked for several days just the week before the OpenStack Mitaka summit.

# New website

We build a new dedicated website for Gnocchi at gnocchi.xyz. We want to promote Gnocchi outside of the OpenStack bubble, as it a useful timeseries database on its own that can work without the rest of the stack. We'll try to improve the documentation. If you're curious, feel free to check it out and report anything you miss!

# The speed bump

Obviously, if it was a bug in Gnocchi that we have hit, it would have been quick to fix. However, we found a nasty bug in Swift caused by the evil monkey-patching of Eventlet (once again) blended with a mixed usage of native threads and Eventlet threads in Swift. Shake all of that, and you got yourself pretty race conditions when using the Keystone middleware authentication.

In the meantime, we disabled Swift multi-threading by using mod_wsgi instead of Eventlet in devstack.

# New features

So what's new in this new shiny release? A few interesting things:

• Metric deletion is now asynchronous. That's not the most used feature in the REST API – weirdly people do not often delete metrics – but it's now way faster and reliable by being asynchronous. Metricd is now in charge of cleaning up things up.

• Speed improvement. We are now confident to be even more faster than in the latest benchmarks I run (around 1.5-2× faster), which makes Gnocchi really fast with its native storage back-ends. We profiled and optimized Carbonara and the REST API data validation.

• Improve metricd status report. It now reports the size of the backlog of the whole cluster both in its log and via the REST API. Easy monitoring!

• Ceph drivers enhancement. We had people testing the Ceph drivers in production, so we made a few changes and fixes to it to make it more solid.

And that's all we did in the last couple of months. We have a lot of things on the roadmap that are pretty exciting, and I'll sure talk about them in the next weeks.

 November 02, 2015

Last week I was in Tokyo, Japan for the OpenStack Summit, discussing the new Mitaka version that will be released in 6 months.

I've attended the summit mainly to discuss and follow-up new developments on Ceilometer, Gnocchi, Aodh and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things. Below are what I found remarkable during this summit concerning those projects.

# Distributed lock manager

I did not attend this session, but I need to write something about it.

See, when working in a distributed environment like OpenStack, it's almost obvious that sooner or later you end up needing a distributed lock mechanism. It started to be pretty obvious and a serious problem for us 2 years ago in Ceilometer. Back then, we proposed the service-sync blueprint and talked about it during the OpenStack Icehouse Design Summit in Hong-Kong. The session at that time was a success, and in 20 minutes I convinced everyone it was the good thing to do. The night following the session, we picked a named, Tooz, to name this new library. It was the first time I met Joshua Harlow, which became one of the biggest Tooz contributor since then.

For the following months, we tried to move the lines in OpenStack. It was very hard to convince people that it was the solution to their problem. Most of the time, they did not seem to grasp the entirety of what was at stake.

This time, it seems that we managed to convince everyone that a DLM is indeed needed. Joshua wrote an extensive specification called Chronicle of a DLM, which ended up being discussed and somehow adopted during that session in Tokyo.

So yes, Tooz will be the weapon of choice for OpenStack. It will avoid a hard requirement on any DLM solution directly. The best driver right now is the ZooKeeper one, but it'll still be possible for operators to use e.g. Redis.

This is a great achievement for us, after spending years trying to fix features such as the Nova service group subsystem and seeing our proposals postponed forever.

(If you want to know more, LWN.net has a great article about that session.)

# Telemetry team name

With the new projects launched this last year, Aodh & Gnocchi, in parallel of the old Ceilometer, plus the change from programs to Big Tent in OpenSack, the team is having an identity issue. Being referred to as the "Ceilometer team" is not really accurate, as some of us only work on Aodh or on Gnocchi. So after discussing that, I proposed to rename the team to Telemetry instead. We'll see how it goes.

# Alarms

The first session was about alarms and the Aodh project. It turns out that the project is in pretty good shape, but probably need some more love, which I hope I'll be able to provide in the next months.

The need for a new aodhclient based on the technologies we recently used building gnocchiclient has been reasserted, so we might end up working on that pretty soon. The Tempest support also needs some improvement, and we have a plan to enhance that.

# Data visualisation

We got David Lyle in this session, the Project Technical Leader for Horizon. It was an interesting discussion. It used to be technically challenging to draw charts from the data Ceilometer collects, but it's now very easy with Gnocchi and its API.

While the technical side is resolved, the more political and user experience side of was to draw and how was discussed at length. We don't want to make people think that Ceilometer and Gnocchi are a full monitoring solution, so there's some precaution to take. Other than that, it would be pretty cool to have view of the data in Horizon.

It turns out that Ceilometer has an architecture that makes it easy to have rolling upgrade. We just need to write a proper documentation explaining how to do it and in which order the services should be upgraded.

# Ceilometer splitting

The split of the alarm feature of Ceilometer in its own project Aodh in the last cycle was a great success for the whole team. We want to split other pieces of Ceilometer, as they make sense on their own, makes it easier to manage. They are also some projects that want to use them without the whole stack, so that's a good idea to make it happen.

# CloudKitty & Gnocchi

I attended the 2 sessions that were allocated to CloudKitty. It was pretty interesting as they want to simplify their architecture and leverage what Gnocchi provides. I proposed my view of the project architecture and how they could leverage the more of Gnocchi to retrieve and store data. They want to go in that direction though it's a large amount of work and refactoring on their side, so it'll take time.

We also need to enhance the support of extension for new resources in Gnocchi, and that's something I hope I'll work on in the next months.

Overall, this summit was pretty good and I got a tremendous amount of good feedback on Gnocchi. I again managed to get enough ideas and tasks to tackle for the next 6 months. It really looks interesting to see where the whole team will go from that. Stay tuned!

 October 30, 2015
You might have heard of the C.H.I.P., the 9$computer. After contributing to their Kickstarter, and with no intent on hacking on more kernel code than is absolutely necessary, I requested the "final" devices, when chumps like me can read loads of docs and get accessories for it easily. Turns out that our old friend the Realtek 8723BS chip is the Wi-Fi/Bluetooth chip in the nano computer. NextThingCo got in touch, and sent me a couple of early devices (as well as to the "Kernel hacker" backers), with their plan being to upstream all the drivers and downstream hacks into the upstream kernel. Before being able to hack on the kernel driver though, we'll need to get some software on it, and find a way to access it. The docs website has instructions on how to flash the device using Ubuntu, but we don't use that here. You'll need a C.H.I.P., a jumper cable, and the USB cable you usually use for charging your phone/tablet/e-book reader. First, let's install a few necessary packages: dnf install -y sunxi-tools uboot-tools python3-pyserial moserial You might need other things, like git and gcc, but I kind of expect you to already have that installed if you're software hacking. You will probably also need to get sunxi-tools from Koji to get a new enough version that will support the C.H.I.P. Get your jumper cable out, and make the connection as per the NextThingCo docs. I've copied the photo from the docs to keep this guide stand-alone. Let's install the tools, modified to work with Fedora's newer, upstreamer, version of the sunxi-tools. $ git clone https://github.com/hadess/CHIP-tools.git$cd CHIP-tools$ make$sudo ./chip-update-firmware.sh -d If you've followed the instructions, you haven't plugged in the USB cable yet. Plug in the USB cable now, to the micro USB power supply on one end, and to your computer on the other. You should see the little "OK" after the "waiting for fel" message: == upload the SPL to SRAM and execute it ==waiting for fel........OK At this point, you can unplug the jumper cable, something not mentioned in the original docs. If you don't do that, when the device reboots, it will reboot in flashing mode again, and we obviously don't want that. At this point, you'll just need to wait a while. It will verify the installation when done, and turn off the device. Unplug, replug, and launch moserial as root. You should be able to access the C.H.I.P. through /dev/ttyACM0 with a baudrate of 115200. The root password is "chip". Obligatory screenshot of our new computer: Next step, testing out our cleaned up Realtek driver, Fedora on the C.H.I.P., and plenty more.  October 28, 2015 So as we quickly approach the Fedora Workstation 23 release I been running Wayland exclusively for over a week now. Despite a few glitches it now works well enough for me to not have to switch back into the X11 session anymore. There are some major new features coming in Fedora Workstation 23 that I am quite excited about. First and foremost it will be the first release shipping with our new firmware update system supported. This means that if your hardware supports it and your vendor uploads the needed firmware to lvfs you can update your system firmware through GNOME Software. So no more struggling with proprietary tools or bootable DVDs. So while this is a major step forward in my opinion it will also be one of those things ramping up slowly, as we need to bring more vendors onboard and also have them ship more UEFI 2.5 ready systems before the old ‘BIOS’ update problems are a thing of the past. A big thanks goes to Richard Hughes and Peter Jones for their work here. Another major feature that we spent a lot of time to get right is the new Google Drive backend for the file manager. This means that as long as you have internet access you can manage your google drive files through Nautilus along with all your other local and remote file systems. I know a lot of Fedora users are wither using Google drive personally or as part of their work, so I think this is a major new feature we managed to land. A big thanks to Debarshi Ray for working on this item. Thirdly we now have support for ambient light sensors support. This was a crucial step in our ongoing effort to improve battery life under Fedora and can have very significant impact on how long your battery lasts. It is very easy to keep running the screen with more backlight than you actually need and thus drain your battery quickly, so with this enabled you might often squeeze out some extra hours of your battery. A big thanks to Bastien Nocera and Benjamin Tissoires for their work on this feature. And finally this is the first release where we are shipping our current xdg-app tech preview as part of Fedora, instead of just making it available in a COPR. So for those of you who don’t know, xdg-app is our new technology for packaging desktop applications. While still early stage it provides a way for software developers to package their software in a way that is both usable across multiple distributions and with improved security through the use of the LXC container technology. In fact as we are trying to make this technology as usable and widely deployed as possible Alexander Larsson is currently trying to work with the Open Container Initiative to make xdg-apps be OCI compliant. This is important for a multitude of reasons, but mainly xdg-app fills an important gap in the container technology landscape, because while Docker and Rocket are great for packaging server software, there is no real equivalent for the desktop. The few similar efforts that has been launched are usually tied to a specific distribution or operating system, while xdg-app aims to provide the same level of cross system compatibility for desktop applications that Docker and Rocket offers for server applications. Fedora Workstation 24 Of course with Fedora Workstation 23 almost out the door we have been reviewing our Fedora Workstation tasklist to make sure it reflects what we currently have developers working on and what we expect to be able to land in Fedora Workstation 24. And let me use this opportunity to remind community members that if you are working on a cool feature for Fedora Workstation 24, make sure to let the Workstation working group know on the Fedora Desktop mailing list, so that we can make sure your feature gets listed and tracked in the tasklist too. Anyway, I sent out this email to the working group this week, to outline what I see on the horizon in terms of major Fedora Workstation features lined up for 24. You can get the full details in the email, and the tasklist has also been updated with these items, but I want to go into a bit more details on some of them. xdg-app for world domination As some of you might be aware of, Christian Hergert, of GNOME Builder fame, recently joined our team. Christian will be doing a lot of cool stuff for us, but one thing he has already started on is working with Alexander Larsson to make sure we have a great developer story for xdg-app. If we want developers to adopt this technology we need to make it dead simple to create your own xdg-app packages and Christian will make sure that GNOME Builder supports this in a way that makes transitioning your application into an xdg-app becomes something you can do without needing to read a long howto. We hope to have the initial fruits of this labour ready for Fedora Workstation 24. Another big part of this of course is the work that Richard Hughes and Kalev Lember are doing on GNOME Software to make sure we have the infrastructure in place to be able to install and upgrade xdg-apps. As we expect xdg-apps to come from a wide variety of sources as opposed to the current model of most things being in a central repository we need to develop good ways for new sources to be added and help users make more informed choices about the software they are installing. Related to this we are also looking at how we can improve labeling of the applications available, to make it easier to make your decisions based on a variety of criteria. The current star system in GNOME Software is not very obvious in what it tries to convey so we will try looking at better ways to do this kind of labeling and what information we want to be able to provide through it. Another major item is what I blogged about before is our effort to finally make dealing with the binary graphics drivers less of a pain. I wrote a longer blog post about this before, but to summarize we want to make sure that if you need to install the binary drivers this is a simple operation that doesn’t conflict with your installation of Mesa and also that if you have an Optimus enabled laptop, it is easy and pleasant to use. Anyway, there are some further items in the email I sent, but I will go more into detail about some of them at a later stage.  October 18, 2015 # Second Round of systemd.conf 2015 Sponsors We are happy to announce the second round of systemd.conf 2015 sponsors! In addition to those from the first announcement, we have: Our second Gold sponsor is Red Hat! What began as a better way to build software—openness, transparency, collaboration—soon shifted the balance of power in an entire industry. The revolution of choice continues. Today Red Hat® is the world's leading provider of open source solutions, using a community-powered approach to provide reliable and high-performing cloud, virtualization, storage, Linux®, and middleware technologies. A Bronze sponsor is Samsung: From the beginning we have established a very fast pace and are currently one of the biggest and fastest growing modern-technology R&D centers in East-Central Europe. We have started with designing subsystems for digital satellite television, however, we have quickly expanded the scope of our interest. Currently, it includes advanced systems of digital television, platform convergence, mobile systems, smart solutions, and enterprise solutions. Also a vital role in our activity plays the quality and certification center, which controls the conformity of Samsung Electronics products with the highest standards of quality and reliability. A Bronze sponsor is travelping: Travelping is passionate about networks, communications and devices. We empower our customers to deploy and operate networks using our state of the art products, solutions and services. Our products and solutions are based on our industry proven physical and virtual appliance platforms. These purpose built platforms ensure best in class performance, scalability and reliability combined with consistent end to end management capabilities. To build this products, Travelping has developed a own embedded, cross platform Linux distribution called CAROS.io which incorporates the systemd service manager and tools. A Bronze sponsor is Collabora: Collabora has over 10 years of experience working with top tier OEMs & silicon manufacturers worldwide to develop products based on Open Source software. Through the use of Open Source technologies and methodologies, Collabora helps clients in multiple market segments gain faster time to market and save millions of dollars in licensing and maintenance costs. Collabora has already brought to market several products relying on systemd extensively. A Bronze sponsor is Endocode: Endocode AG. An employee-owned, software engineering company from Berlin. Open Source is our heart and soul. A Bronze sponsor is the Linux Foundation: The Linux Foundation advances the growth of Linux and offers its collaborative principles and practices to any endeavor. We are Cooperating with LinuxTag e.V. on the organization: LinuxTag is Europe's leading organizer of Linux and Open Source events. Born of the community and in business for 20 years, we organize LinuxTag, an annual conference and exhibition attracting thousands of visitors. We also participate and cooperate in organizing workshops, tutorials, seminars, and other events together with and for the Open Source community. Selected events include non-profit workshops, the German Kernel Summit at FrOSCon, participation in the Open Tech Summit, and others. We take care of the organizational framework of systemd.conf 2015. LinuxTag e.V. is a non-profit organization and welcomes donations of ideas and workforce. A Media Partner is Golem: Golem.de is an up to date online-publication intended for professional computer users. It provides technology insights of the IT and telecommunications industry. Golem.de offers profound and up to date information on significant and trending topics. Online- and IT-Professionals, marketing managers, purchasers, and readers inspired by technology receive substantial information on product, market and branding potentials through tests, interviews und market analysis. We'd like to thank our sponsors for their support! Without sponsors our conference would not be possible! The Conference s SOLD OUT since a few weeks. We no longer accept registrations, nor paper submissions. For further details about systemd.conf consult the conference website. See you in Berlin!  October 14, 2015 In the old days, there were not a lot of users of open-source 3D graphics drivers. Aside from a few ID Software games, tuxracer, and glxgears, there were basically no applications. As a result, we got almost no bug reports. It was a blessing and a curse. Now days, thanks to Valve Software and Steam, there are more applications than you can shake a stick at. As a result, we get a lot of bug reports. It is a blessing and a curse. There's quite a wide chasm between a good bug report and a bad bug report. Since we have a lot to do, the quality of the bug report can make a big difference in how quickly the bug gets fixed. A poorly written bug report may get ignored. A poorly written bug may require several rounds asking for more information or clarifications. A poorly written bug report may send a developer looking for the problem in the wrong place. As someone who fields a lot of these reports for Mesa, here's what I look for... 1. Submit against the right product and component. The component will determine the person (or mailing list) that initially receives notification about the bug. If you have a bug in the Radeon driver and only people working Nouveau get the bug e-mail, your bug probably won't get much attention. If you have experienced a bug in 3D rendering, you want to submit your bug against Mesa. Selecting the correct component can be tricky. It doesn't have to be 100% correct initially, but it should be close enough to get the attention of the right people. A good first approximation is to pick the component that matches the hardware you are running on. That means one of the following: If you're not sure what GPU is in your system, the command glxinfo | grep "OpenGL renderer string" will help. 2. Correctly describe the software on your system. There is a field in bugzilla to select the version where the bug was encountered. This information is useful, but it is more important to get specific, detailed information in the bug report. • Provide the output of glxinfo | grep "OpenGL version string". There may be two lines of output. That's fine, and you should include both. • If you are using the drivers supplied by your distro, include the version of the distro package. On Fedora, this can be determined by rpm -q mesa-dri-drivers. Other distros use other commands. I don't know what they are, but Googling for "how do i determine the version of an installed package on " should help. • If you built the driver from source using a Mesa release tarball, include the full name of that tarball. • If you built the driver from source using the Mesa GIT repository, include SHA of the commit used. • Include the full version of the kernel using uname -r. This isn't always important, but it is sometimes helpful. 3. Correctly describe the hardware in your system. There are a lot of variations of each GPU out there. Knowing exactly which variation exhibited bad behavior can help find the problem. • Provide the output of glxinfo | grep "OpenGL renderer string". As mentioned above, this is the hardware that Mesa thinks is in the system. • Provide the output of lspci -d ::0300 -nn. This provides the raw PCI information about the GPU. In some cases this maybe more detailed (or just plain different) than the output from glxinfo. • Provide the output of grep "model name" /proc/cpuinfo | head -1. It is sometimes useful to know the CPU in the system, and this provides information about the CPU model. 4. Thoroughly describe the problem you observed. There are two goals. First, you want the person reading the bug report to know both what did happen and what you expected to happen. Second, you want the person reading the bug report to know how to reproduce the problem you observed. The person fielding the bug report may not have access to the application. • If the application was run from the command line, provide the full command line used to run the application. If the application is a test that was run from piglit, provide the command line used to run the test, not the command line used to run piglit. This information is in the piglit results.json result file. • If the application is game or other complex application, provide any necessary information needed to reproduce the problem. Did it happen at specific location in a level? Did it happen while performing a specific operation? Does it only happen when loading certain data files? • Provide output from the application. If the application is a test case, provide the text output from the test. If this output is, say, less than 5 lines, include it in the report. Otherwise provide it as an attachment. Output lines that just say "Fail" are not useful. It is usually possible to run tests in verbose mode that will provide helpful output. If the application is game or other complex application, try to provide a screenshot. If it is possible, providing a "good" and "bad" screenshot is even more useful. Bug #91617 is a good example of this. If the bug was a GPU or system hang, provide output from dmesg as an attachment to the bug. If the bug is not a GPU or system hang, this output is unlikely to be useful. 5. Add annotations to the bug summary. Annotations are parts of the bug summary at the beginning enclosed in square brackets. The two most important ones are regression and bisected (see the next item for more information about bisected). If the bug being reported is a new failure after updating system software, it is (most likely) a regression. Adding [regression] to the beginning of the bug summary will help get it noticed! In this case you also must, at the very least, tell us what software version worked... with all the same details mentioned above. The other annotations used by the Intel team are short names for hardware versions. Unless you're part of Intel's driver development team or QA team, you don't really need to worry about these. The ones currently in use are: 6. Bisect regressions. If you have encountered a regression and you are building a driver from source, please provide the results of git-bisect. We really just need the last part: the guilty commit. When this information is available, add bisected to the tag. This should look like [regression bisected]. Note: If you're on a QA team somewhere, supplying the bisect is a practical requirement. 7. Respond to the bug. There may be requests for more information. There may be requests to test patches. Either way, developers working on the bug need you to be responsive. I have closed a lot of bugs over the years because the original reporter didn't respond to a question for six months or more. Sometimes this is our fault because the question was asked months (or even years) after the bug was submitted. The general practice is to change the bug status to NEEDINFO when information or testing is requested. After providing the requested information, please change the status back to either NEW or ASSIGNED or whatever the previous state was. It is also general practice to leave bugs in the NEW state until a developer actually starts working on it. When someone starts work, they will change the "Assigned To" field and change the status to ASSIGNED. This is how we prevent multiple people from duplicating effort by working on the same bug. If the bug was encountered in a released version and the bug persists after updating to a new major version (e.g., from 10.6.2 to 11.0.1), it is probably worth adding a comment to the bug "Problem still occurs with ." That way we know you're alive and still care about the problem. Here is a sample script that collects a lot of information request. It does not collect information about the driver version other than the glxinfo output. Since I don't know whether you're using the distro driver or building one from source, there really isn't any way that my script could do that. You'll want to customize it for your own needs. echo "Software versions:" echo -n " " uname -r echo -n " " glxinfo 2>&1 | grep "OpenGL version string" echo echo "GPU hardware:" echo -n " " glxinfo 2>&1 | grep "OpenGL renderer string" echo -n " " lspci -d ::0300 -nn echo echo "CPU hardware:" echo -n " " uname -m echo -n " " grep "model name" /proc/cpuinfo | head -1 | sed 's/^[^:]*: //' echo  This generates nicely formatted output that you can copy-and-paste directly into a bug report. Here is the output from my system: Software versions: 4.0.4-202.fc21.x86_64 OpenGL version string: 3.0 Mesa 11.1.0-devel GPU hardware: OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 5500 (Broadwell GT2) 00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09) CPU hardware: x86_64 Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz   October 13, 2015 We got pretty good feedback on Gnocchi so far, even if we only had little. Recently, in order to have a better feeling of where we were at, we wanted to know how fast (or slow) Gnocchi was. The early benchmarks that some of the Mirantis engineers ran last year showed pretty good signs. But a year later, it was time to get real numbers and have a good understanding of Gnocchi capacity. ## Benchmark tools The first thing I realized when starting that process, is that we were lacking of tools to run benchmarks. Therefore I started to write some benchmark tools in python-gnocchiclient, which provides a command line tool to interrogate Gnocchi. I added a few basic commands to measure metric performance, such as: $ gnocchi benchmark metric create -w 48 -n 10000 -a low+----------------------+------------------+| Field                | Value            |+----------------------+------------------+| client workers       | 48               || create executed      | 10000            || create failures      | 0                || create failures rate | 0.00 %           || create runtime       | 8.80 seconds     || create speed         | 1136.96 create/s || delete executed      | 10000            || delete failures      | 0                || delete failures rate | 0.00 %           || delete runtime       | 39.56 seconds    || delete speed         | 252.75 delete/s  |+----------------------+------------------+

The command line tool supports the --verbose switch to have detailed progress report on the benchmark progression. So far it supports metric operations only, but that's the most interesting part of Gnocchi.

## Spinning up some hardware

I got a couple of bare metal servers to test Gnocchi on. I dedicated the first one to Gnocchi, and used the second one as the benchmark client, plugged on the same network. Each server is made of 2×Intel Xeon E5-2609 v3 (12 cores in total) and 32 GB of RAM. That provides a lot of CPU to handle requests in parallel.

Then I simply performed a basic RHEL 7 installation and ran devstack to spin up an installation of Gnocchi based on the master branch, disabling all of the others OpenStack components. I then tweaked the Apache httpd configuration to use the worker MPM and increased the maximum number of clients that can sent request simultaneously.

I configured Gnocchi to use the PostsgreSQL indexer, as it's the recommended one, and the file storage driver, based on Carbonara (Gnocchi own storage engine). That means files were stored locally rather than in Ceph or Swift. Using the file driver is less scalable (you have to run on only one node or uses a technology like NFS to share the files), but it was good enough for this benchmark and to have some numbers and profiling the beast.

The OpenStack Keystone authentication middleware was not enabled in this setup, as it would add some delay validating the authentication token.

## Metric CRUD operations

Metric creation is pretty fast. I managed to attain 1500 metric/s created pretty easily. Deletion is now asynchronous, which means it's faster than in Gnocchi 1.2, but it's still slower than creation: 300 metric/s can be deleted. That does not sound like a huge issue since metric deletion is actually barely used in production.

Retrieving metric information is also pretty fast and goes up to 800 metric/s. It'd be easy to achieve very higher throughput for this one, as it'd be easy to cache, but we didn't feel the need to implement it so far.

Another important thing is that all of these numbers are constant and barely depends on the number of the metric already managed by Gnocchi.

Operation Details Rate
Create metric Created 100k metrics in 77 seconds 1300 metric/s
Delete metric Deleted 100k metrics in 190 seconds 524 metric/s
Show metric Show a metric 100k times in 149 seconds 670 metric/s

## Sending and getting measures

Pushing measures into metrics is one of the hottest topic. Starting with Gnocchi 1.1, the measures pushed are treated asynchronously, which makes it much faster to push new measures. Getting new numbers on that feature was pretty interesting.

The number of metric per second you can push depends on the batch size, meaning the number of actual measurements you send per call. The naive approach is to push 1 measure per call, and in that case, Gnocchi is able to handle around 600 measures/s. With a batch containing 100 measures, the number of calls per second goes down to 450, but since you push 100 measures each time, that means 45k measures per second pushed into Gnocchi!

I've pushed the test further, inspired by the recent blog post of InfluxDB claiming to achieve 300k points per second with their new engine. I ran the same benchmark on the hardware I had, which is roughly two times smaller than the one they used. I achieved to push Gnocchi to a little more than 120k measurement per second. If I had same hardware as they used, I could interpolate the results to achieve almost 250k measures/s pushed. Obviously, you can't strictly compare Gnocchi and InfluxDB since they are not doing exactly the same thing, but it still looks way better than what I expected.

Using smaller batch sizes of 1k or 2k improve the throughput further to around 125k measures/s.

Operation Details Rate
Push metric 5k Push 5M measures with batch of 5k measures in 40 seconds 122k measures/s
Push metric 4k Push 5M measures with batch of 4k measures in 40 seconds 125k measures/s
Push metric 3k Push 5M measures with batch of 3k measures in 40 seconds 123k measures/s
Push metric 2k Push 5M measures with batch of 2k measures in 41 seconds 121k measures/s
Push metric 1k Push 5M measures with batch of 1k measures in 44 seconds 113k measures/s
Push metric 500 Push 5M measures with batch of 500 measures in 51 seconds 98k measures/s
Push metric 100 Push 5M measures with batch of 100 measures in 112 seconds 45k measures/s
Push metric 10 Push 5M measures with batch of 10 measures in 852 seconds 6k measures/s
Push metric 1 Push 500k measures with batch of 1 measure in 800 seconds 624 measures/s
Get measures Push 43k measures of 1 metric 260k measures/s

What about getting measures? Well, it's actually pretty fast too. Retrieving a metric with 1 month of data with 1 minute interval (that's 43k points) takes less than 2 second.

Though it's actually slower than what I expected. The reason seems to be that the JSON is 2 MB big and encoding it takes a lot of time for Python. I'll investigate that. Another point I discovered, is that by default Gnocchi returns all the datapoints for each granularities available for the asked period, which might double the size of the returned data for nothing if you don't need it. It'll be easy to add an option to the API to only retrieve what you need though!

Once benchmarked, that meant I was able to retrieve 6 metric/s per second, which translates to around 260k measures/s.

## Metricd speed

New measures that are pushed into Gnocchi are processed asynchronously by the gnocchi-metricd daemon. When doing the benchmarks above, I ran into a very interesting issue: sending 10k measures on a metric would make gnocchi-metricd uses up to 2 GB RAM and 120 % CPU for more than 10 minutes.

After further investigation, I found that the naive approach we used to resample datapoints in Carbonara using Pandas was causing that. I reported a bug on Pandas and the upstream author was kind enough to provide a nice workaround, that I sent as a pull request to Pandas documentation.

I wrote a fix for Gnocchi based on that, and started using it. Computing the standard aggregation methods set (std, count, 95pct, min, max, sum, median, mean) for 10k batches of 1 measure (worst case scenario) for one metric with 10k measures now takes only 20 seconds and uses 100 MB of RAM – 45× faster. That means that in normal operations, where only a few new measures are processed, the operation of updating a metric only takes a few milliseconds. Awesome!

## Comparison with Ceilometer

For comparison sake, I've quickly run some read operations benchmark in Ceilometer. I've fed it with one month of samples for 100 instances polled every minute. That represents roughly 4.3M samples injected, and that took a while – almost 1 hour whereas it would have taken less than a minute in Gnocchi. Then I tried to retrieve some statistics in the same way that we provide them in Gnocchi, which mean aggregating them over a period of 60 seconds over a month.

Operation Details Rate

Obviously, Ceilometer is very slow. It has to look into 4M of samples to compute and return the result, which takes a lot of time. Whereas Gnocchi just has to fetch a file and pass it over. That also means that the more samples you have (so the more time you collect data and the more resources you have), slower Ceilometer will become. This is not a problem with Gnocchi, as I emphasized when I started designing it.

Most Gnocchi operations are O(log R) where R is the number of metrics or resources, whereas most Ceilometer operations are O(log S) where S is the number of samples (measures). Since is R millions of time smaller than S, Gnocchi gets to be much faster.

And what's even more interesting, is that Gnocchi is entirely scalable horizontally. Adding more Gnocchi servers (for the API and its background processing worker metricd) will multiply Gnocchi performances by the number of servers added.

## Improvements

There are several things to improve in Gnocchi, such as splitting Carbonara archives to make them more efficient, especially from drivers such as Ceph and Swift. It's already on my plate, and I'm looking forwarding to working on that!

And if you have any questions, feel free to shoot them in the comment section. 😉

 October 05, 2015

Last week, I've been invited to the OpenStack Paris meetup #16, whose subject was about metrics in OpenStack. Last time I spoke at this meetup was back in 2012, during the OpenStack Paris meetup #2. A very long time ago!

I talked for half an hour about Gnocchi, the OpenStack project I've been running for 18 months now. I started by explaining the story behind the project and why we needed to build it. Ceilometer has an interesting history and had a curious roadmap these last year, and I summarized that briefly. Then I talk about how Gnocchi works and what it offers to users and operators. The slides where full of JSON, but I imagine it offered a interesting view of what the API looks like and how easy it is to operate. This also allowed me to emphasize how many use cases are actually really covered and solved, contrary to what Ceilometer did so far. The talk has been well received and I got a few interesting questions at the end.

The video of the talk (in French) and my slides are available on my talk page and below. I hope you'll enjoy it.

 September 25, 2015

Lately I have worked on ARB_shader_storage_buffer_object  extension implementation on Mesa, along with Iago (who has already explained how we implement it).

One of the features we implemented was adding support for buffer variables to the queries defined in ARB_program_interface_query. Those queries ended up being very useful to check another feature introduced by ARB_shader_storage_buffer_object: the new storage layout called std430.

I have developed basic tests for piglit, but Jordan Justen and Ilia Mirkin mentioned that Ian Romanick had worked in a comprehensive testing for uniform buffer objects presented at XDC2014.

Ian developed a stochastic search-based testing for uniform block layouts which generates shader_runner tests using a template in Mako that defines a series of uniform blocks with different layouts. Finally, shader_runner verifies some information through glGetActiveUniforms*() queries such as: the variable type, its offset inside the block, its array size (if any), if it is row major, etc… using “active uniform” command.

That was the kind of testing that I was looking for, so I cloned his repository and developed a modified version for shader storage blocks and std430.

During that process, I found that shader_runner was lacking support for the queries defined in ARB_program_interface_query extension. Yesterday, the patches were pushed to master branch of the piglit repository.

If you want to use this new command, the format  is the following:

verify program_interface_query GL_INTERFACE_TYPE_ENUM var_name GL_PROPS_ENUM integer

or, if we include the GL type enum:

verify program_interface_query GL_INTERFACE_TYPE_ENUM var_name GL_PROPS_ENUM GL_TYPE_ENUM

I have written an example to show how to use this command in a shader_runner test. This is just a snippet showing the usage of the new command:

[test]

verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_TYPE GL_FLOAT_VEC4
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_TOP_LEVEL_ARRAY_SIZE 4
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_OFFSET 128
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[0].s1_3.fv3[0] GL_OFFSET 48
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[0].m34_1 GL_TYPE GL_FLOAT_MAT3x4
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].m34_1 GL_OFFSET 80
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].m34_1 GL_ARRAY_STRIDE 0
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.m43_1 GL_IS_ROW_MAJOR 1
verify program_interface_query GL_PROGRAM_INPUT piglit_vertex GL_TYPE GL_FLOAT_VEC4
verify program_interface_query GL_PROGRAM_OUTPUT piglit_fragcolor GL_TYPE GL_FLOAT_VEC4

I hope you find them useful for your tests!

I've wanted a stand-alone radio in my office for a long time. I've been using a small portable radio, but it ate batteries quickly (probably a 4-pack of AA for a bit less of a work week's worth of listening), changing stations was cumbersome (hello FM dials) and the speaker was a bit teeny.

A couple of years back, I had a Raspberry Pi-based computer on pre-order (the Kano, highly recommended for kids, and beginners) through a crowd-funding site. So I scoured « brocantes » (imagine a mix of car boot sale and antiques fair, in France, with people emptying their attics) in search of a shell for my small computer. A whole lot of nothing until my wife came back from a week-end at a friend's with this:

A Philips Octode Super 522A, from 1934, when SKUs were as superlative-laden and impenetrable as they are today.

Let's DIY

I started by removing the internal parts of the radio, without actually turning it on. When you get such old electronics, they need to be checked thoroughly before being plugged, and as I know nothing about tube radios, I preferred not to. And FM didn't exist when this came out, so not sure what I would have been able to do with it anyway.

Roomy, and dirty. The original speaker was removed, the front buttons didn't have anything holding them any more, and the nice backlit screen went away as well.

To replace the speaker, I went through quite a lot of research, looking for speakers that were embedded, rather than get a speaker in box that I would need to extricate from its container. Visaton make speakers that can be integrated into ceiling, vehicles, etc. That also allowed me to choose one that had a good enough range, and would fit into the one hole in my case.

To replace the screen, I settled on an OLED screen that I knew would work without too much work with the Raspberry Pi, a small AdaFruit SSD1306 one. Small amount of soldering that was up to my level of skills.

It worked, it worked!

Hey, soldering is easy. So because of the size of the speaker I selected, and the output power of the RPi, I needed an amp. The Velleman MK190 kit was cheap (€10), and should just be able to work with the 5V USB power supply I planned to use. Except that the schematics are really not good enough for an electronics starter. I spent a couple of afternoons verifying, checking on the Internet for alternate instructions, re-doing the solder points, to no avail.

'Sup Tiga!

So much wasted time, and got a cheap car amp with a power supply. You can probably find cheaper.

Finally, I got another Raspberry Pi, and SD card, so that the Kano, with its super wireless keyboard, could find a better home (it went to my godson, who seemed to enjoy the early game of Pong, and being a wizard).

Putting it all together

We'll need to hold everything together. I got a bit of help for somebody with a Dremel tool for the piece of wood that will hold the speaker, and another one that will stick three stove bolts out of the front, to hold the original tuning, mode and volume buttons.

A real joiner

I fast-forwarded the machine by a couple of years with a « Philips » figure-of-8 plug at the back, so machine's electrics would be well separated from the outside.

Screws into the side panel for the amp, blu-tack to hold the OLED screen for now, RPi on a few leftover bits of wood.

Software

My first attempt at getting something that I could control on this small computer was lcdgrilo. Unfortunately, I would have had to write a Web UI for it (remember, my buttons are just stuck on, for now at least), and probably port the SSD1306 OLED screen's driver from Python, so not a good fit.

There's no proper Fedora support for Raspberry Pis, and while one can use a nearly stock Debian with a few additional firmware files on Raspberry Pis, Fedora chose not to support that slightly older SoC at all, which is obviously disappointing for somebody working on Fedora as a day job.

Looking for other radio retrofits, and there are plenty of quality ones on the Internet, and for various connected speakers backends, I found PiMusicBox. It's a Debian variant with Mopidy builtin, and a very easy to use initial setup: edit a settings file on the SD card image, boot and access the interface via a browser. Tada!

Once I had tested playback, I lowered the amp's volume to nearly zero, raised the web UI's volume to the maximum, and raised the amp's volume to the maximum bearable for the speaker. As I won't be able to access the amp's dial, we'll have this software only solution.

Wrapping up

I probably spent a longer time looking for software and hardware than actually making my connected radio, but it was an enjoyable couple of afternoons of work, and the software side isn't quite finished.

First, in terms of hardware support, I'll need to make this OLED screen work, how lazy of me. The audio setup is currently just the right speaker, as I'd like both the radios and AirPlay streams to be downmixed.

Secondly, Mopidy supports plugins to extend its sources, uses GStreamer, so would be a right fit for Grilo, making it easier for Mopidy users to extend through Lua.

Do note that the Raspberry Pi I used is a B+ model. For B models, it's recommended to use a separate DAC, because of the bad audio quality, even if the B+ isn't that much better. Testing out use the HDMI output with an HDMI to VGA+jack adapter might be a way to cut costs as well.

Possible improvements could include making the front-facing dials work (that's going to be a tough one), or adding RFID support, so I can wave items in front of it to turn it off, or play a particular radio.

In all, this radio cost me:
- 10 € for the radio case itself
- 36.50 € for the Raspberry Pi and SD card (I already had spare power supplies, and supported Wi-Fi dongle)
- 26.50 € for the OLED screen plus various cables
- 20 € for the speaker
- 18 € for the amp
- 21 € for various cables, bolts, planks of wood, etc.

I might also count the 14 € for the soldering iron, the 10 € for the Velleman amp, and about 10 € for adapters, cables, and supplies I didn't end up using.

So between 130 and 150 €, and a number of afternoons, but at the end, a very flexible piece of hardware that didn't really stretch my miniaturisation skills, and a completely unique piece of furniture.

In the future, I plan on playing with making my own 3-button keyboard, and making a remote speaker to plug in the living room's 5.1 amp with a C.H.I.P computer.

Happy hacking!
 September 24, 2015
Introduction If you were waiting for a new post to learn something new about Intel GEN graphics, I am sorry. I started moving away from i915 kernel development about one year ago. Although I had worked landed patches in mesa before, it has taken me some time to get to a point where I had […]
 September 23, 2015
As I'm known to do, a focus on the little things I worked on during the just released GNOME 3.18 development cycle.

Hardware support

The accelerometer support in GNOME now uses iio-sensor-proxy. This daemon also now supports ambient light sensors, which Richard used to implement the automatic brightness adjustment, and compasses, which are used in GeoClue and gnome-maps.

In kernel-land, I've fixed the detection of some Bosch accelerometers, added support for another Kyonix one, as used in some tablets.

I've also added quirks for out-of-the-box touchscreen support on some cheaper tablets using the goodix driver, and started reviewing a number of patches for that same touchscreen.

With Larry Finger, of Realtek kernel drivers fame, we've carried on cleaning up the Realtek 8723BS driver used in the majority of Windows-compatible tablets, in the Endless computer, and even in the $9 C.H.I.P. Linux computer. Bluetooth UI changes The Bluetooth panel now has better « empty states », explaining how to get Bluetooth working again when a hardware killswitch is used, or it's been turned off by hand. We've also made receiving files through OBEX Push easier, and builtin to the Bluetooth panel, so that you won't forget to turn it off when done, and won't have trouble finding it, as is the case for settings that aren't used often. Videos GNOME Videos has seen some work, mostly in the stabilisation, and bug fixing department, most of those fixes were also landed in the 3.16 version. We've also been laying the groundwork in grilo for writing ever less code in C for plugin sources. Grilo Lua plugins can now use gnome-online-accounts to access keys for specific accounts, which we've used to re-implement the Pocket videos plugin, as well as the Last.fm cover art plugin. All those changes should allow implementing OwnCloud support in gnome-music in GNOME 3.20. My favourite GNOME 3.18 features You can call them features, or bug fixes, but the overall improvements in the Wayland and touchpad/touchscreen support are pretty exciting. Do try it out when you get a GNOME 3.18 installation, and file bugs, it's coming soon! Talking of bug fixes, this one means that I don't need to put in my password by hand when I want to access work related resources. Connect to the VPN, and I'm authenticated to Kerberos. I've also got a particular attachment to the GeoClue GPS support through phones. This allows us to have more accurate geolocation support than any desktop environments around. A few for later The LibreOfficeKit support that will be coming to gnome-documents will help us get support for EPubs in gnome-books, as it will make it easier to plug in previewers other than the Evince widget. Victor Toso has also been working through my Grilo bugs to allow us to implement a preview page when opening videos. Work has already started on that, so fingers crossed for GNOME 3.20!  September 22, 2015 # Only 14 tickets still available! systemd.conf 2015 is close to being sold out, there are only 14 tickets left now. If you haven't bought your ticket yet, now is the time to do it, because otherwise it will be too late and all tickets will be gone! Why attend? At this conference you'll get to meet everybody who is involved with the systemd project and learn what they are working on, and where the project will go next. You'll hear from major users and projects working with systemd. It's the primary forum where you can make yourself heard and get first hand access to everybody who's working on the future of the core Linux userspace! To get an idea about the schedule, please consult our preliminary schedule. In order to register for the conference, please visit the registration page. We are still looking for sponsors. If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page! For further details about systemd.conf consult the conference website.  September 18, 2015 I've done a talk at XDC 2015 about atomic modesetting with a focus for driver writers. Most of the talk is an overview of how an atomic modeset looks and how to implement the different parts in a driver backend. Anyway, for all those who missed it, there's a video and slides.  September 17, 2015 A few days ago, the French equivalent of Hacker News, called "Le Journal du Hacker", interviewed me about my work on OpenStack, my job at Red Hat and my self-published book The Hacker's Guide to Python. I've spent some time translating it into English so you can read it if you don't understand French! I hope you'll enjoy it. Hi Julien, and thanks for participating in this interview for the Journal du Hacker. For our readers who don't know you, can you introduce you briefly? You're welcome! My name is Julien, I'm 31 years old, and I live in Paris. I now have been developing free software for around fifteen years. I had the pleasure to work (among other things) on Debian, Emacs and awesome these last years, and more recently on OpenStack. Since a few months now, I work at Red Hat, as a Principal Software Engineer on OpenStack. I am in charge of doing upstream development for that cloud-computing platform, mainly around the Ceilometer, Aodh and Gnocchi projects. Being myself a system architect, I follow your work in OpenStack since a while. It's uncommon to have the point of view of someone as implied as you are. Can you give us a summary of the state of the project, and then detail your activities in this project? The OpenStack project has grown and changed a lot since I started 4 years ago. It started as a few projects providing the basics, like Nova (compute), Swift (object storage), Cinder (volume), Keystone (identity) or even Neutron (network) who are basis for a cloud-computing platform, and finally became composed of a lot more projects. For a while, the inclusion of projects was the subject of a strict review from the technical committee. But since a few months, the rules have been relaxed, and we see a lot more projects connected to cloud-computing joining us. As far as I'm concerned, I've started with a few others people the Ceilometer project in 2012, devoted to handling metrics of OpenStack platforms. Our goal is to be able to collect all the metrics and record them to analyze them later. We also have a module providing the ability to trigger actions on threshold crossing (alarm). The project grew in a monolithic way, and in a linear way for the number of contributors, during the first two years. I was the PTL (Project Technical Leader) for a year. This leader position asks for a lot of time for bureaucratic things and people management, so I decided to leave my spot in order to be able to spend more time solving the technical challenges that Ceilometer offered. I've started the Gnocchi project in 2014. The first stable version (1.0.0) was released a few months ago. It's a timeseries database offering a REST API and a strong ability to scale. It was a necessary development to solve the problems tied to the large amount of metrics created by a cloud-computing platform, where tens of thousands of virtual machines have to be metered as often as possible. This project works as a standalone deployment or with the rest of OpenStack. More recently, I've started Aodh, the result of moving out the code and features of Ceilometer related to threshold action triggering (alarming). That's the logical suite to what we started with Gnocchi. It means Ceilometer is to be split into independent modules that can work together – with or without OpenStack. It seems to me that the features provided by Ceilometer, Aodh and Gnocchi can also be interesting for operators running more classical infrastructures. That's why I've pushed the projects into that direction, and also to have a more service-oriented architecture (SOA) I'd like to stop for a moment on Ceilometer. I think that this solution was very expected, especially by the cloud-computing providers using OpenStack for billing resources sold to their customers. I remember reading a blog post where you were talking about the high-speed construction of this brick, and features that were not supposed to be there. Nowadays, with Gnocchi and Aodh, what is the quality of the brick Ceilometer and the programs it relies on? Indeed, one of the first use-case for Ceilometer was tied to the ability to get metrics to feed a billing tool. That's now a reached goal since we have billing tools for OpenStack using Ceilometer, such as CloudKitty. However, other use-cases appeared rapidly, such as the ability to trigger alarms. This feature was necessary, for example, to implement the auto-scaling feature that Heat needed. At the time, for technical and political reasons, it was not possible to implement this feature in a new project, and the functionality ended up in Ceilometer, since it was using the metrics collected and stored by Ceilometer itself. Though, like I said, this feature is now in its own project, Aodh. The alarm feature is used since a few cycles in production, and the Aodh project brings new features on the table. It allows to trigger threshold actions and is one of the few solutions able to work at high scale with several thousands of alarms. It's impossible to make Nagios run with millions of instances to fetch metrics and triggers alarms. Ceilometer and Aodh can do that easily on a few tens of nodes automatically. On the other side, Ceilometer has been for a long time painted as slow and complicated to use, because its metrics storage system was by default using MongoDB. Clearly, the data structure model picked was not optimal for what the users were doing with the data. That's why I started Gnocchi last year, which is perfectly designed for this use case. It allows linear access time to metrics (O(1) complexity) and fast access time to the resources data via an index. Today, with 3 projects having their own perimeter of features defined – and which can work together – Ceilometer, Aodh and Gnocchi finally erased the biggest problems and defects of the initial project. To end with OpenStack, one last question. You're a Python developer for a long time and a fervent user of software testing and test-driven development. Several of your blogs posts point how important their usage are. Can you tell us more about the usage of tests in OpenStack, and the test prerequisites to contribute to OpenStack? I don't know any project that is as tested on every layer as OpenStack is. At the start of the project, there was a vague test coverage, made of a few unit tests. For each release, a bunch of new features were provided, and you had to keep your fingers crossed to have them working. That's already almost unacceptable. But the big issue was that there was also a lot of regressions, et things that were working were not anymore. It was often corner cases that developers forgot about that stopped working. Then the project decided to change its policy and started to refuse all patches – new features or bug fix – that would not implement a minimal set of unit tests, proving the patch would work. Quickly, regressions were history, and the number of bugs largely reduced months after months. Then came the functional tests, with the Tempest project, which runs a test battery on a complete OpenStack deployment. OpenStack now possesses a complete test infrastructure, with operators hired full-time to maintain them. The developers have to write the test, and the operators maintain an architecture based on Gerrit, Zuul, and Jenkins, which runs the test battery of each project for each patch sent. Indeed, for each version of a patch sent, a full OpenStack is deployed into a virtual machine, and a battery of thousands of unit and functional tests is run to check that no regressions are possible. To contribute to OpenStack, you need to know how to write a unit test – the policy on functional tests is laxer. The tools used are standard Python tools, unittest for the framework and tox to run a virtual environment (venv) and run them. It's also possible to use DevStack to deploy an OpenStack platform on a virtual machine and run functional tests. However, since the project infrastructure also do that when a patch is submitted, it's not mandatory to do that yourself locally. The tools and tests you write for OpenStack are written in Python, a language which is very popular today. You seem to like it more than you have to, since you wrote a book about it, The Hacker's Guide to Python, that I really enjoyed. Can you explain what brought you to Python, the main strong points you attribute to this language (quickly) and how you went from developer to author? I stumbled upon Python by chance, around 2005. I don't remember how I hear about it, but I bought a first book to discover it and started toying with that language. At that time, I didn't find any project to contribute to or to start. My first project with Python was rebuildd for Debian in 2007, a bit later. I like Python for its simplicity, its object orientation rather clean, its easiness to be deployed and its rich open source ecosystem. Once you get the basics, it's very easy to evolve and to use it for anything, because the ecosystem makes it easy to find libraries to solve any kind of problem. I became an author by chance, writing blog posts from time to time about Python. I finally realized that after a few years studying Python internals (CPython), I learned a lot of things. While writing a post about the differences between method types in Python – which is still one of the most read post on my blog – I realized that a lot of things that seemed obvious to me where not for other developers. I wrote that initial post after thousands of hours spent doing code reviews on OpenStack. I, therefore, decided to note all the developers pain points and to write a book about that. A compilation of what years of experience taught me and taught to the other developers I decided to interview in the book. I've been very interested by the publication of your book, for the subject itself, but also the process you chose. You self-published the book, which seems very relevant nowadays. Is that a choice from the start? Did you look for an editor? Can you tell use more about that? I've been lucky to find out about others self-published authors, such as Nathan Barry – who even wrote a book on that subject, called Authority. That's what convinced me it was possible and gave me hints for that project. I've started to write in August 2013, and I ran the firs interviews with other developers at that time. I started to write the table of contents and then filled the pages with what I knew and what I wanted to share. I manage to finish the book around January 2014. The proof-reading took more time than I expected, so the book was only released in March 2014. I wrote a complete report on that on my blog, where I explain the full process in detail, from writing to launching. I did not look for editors though I've been proposed some. The idea of self-publishing really convince me, so I decided to go on my own, and I have no regret. It's true that you have to wear two hats at the same time and handle a lot more things, but with a minimal audience and some help from the Internet, anything's possible! I've been reached by two editors since then, a Chinese and Korean one. I gave them rights to translate and publish the books in their countries, so you can buy the Chinese and Korean version of the first edition of the book out there. Seeing how successful it was, I decided to launch a second edition in Mai 2015, and it's likely that a third edition will be released in 2016. Nowadays, you work for Red Hat, a company that represents the success of using Free Software as a commercial business model. This company fascinates a lot in our community. What can you say about your employer from your point of view? It only has been a year since I joined Red Hat (when they bought eNovance), so my experience is quite recent. Though, Red Hat is really a special company on every level. It's hard to see from the outside how open it is, and how it works. It's really close to and it really looks like an open source project. For more details, you should read The Open Organization, a book wrote by Jim Whitehurst (CEO of Red Hat), which he just published. It describes perfectly how Red Hat works. To summarize, meritocracy and the lack of organization in silos is what makes Red Hat a strong organization and puts them as one of the most innovative company. In the end, I'm lucky enough to be autonomous for the project I work on with my team around OpenStack, and I can spend 100% working upstream and enhance the Python ecosystem.  September 15, 2015 Many modern mice have the ability to store profiles, customize button mappings and actions and switch between several hardware resolutions. A number of those mice are targeted at gamers, but the features are increasingly common in standard mice. Under Linux, support for these device is spotty, though there are a few projects dedicated to supporting parts of the available device range. [1] [2] [3] Benjamin Tissoires and I started a new project: libratbag. libratbag is a library to provide a generic interface to these mice,enabling desktop environments to provide configuration tools without having to worry about the device model. As of the time of this writing, we have partial support for the Logitech HID++ 1.0 (G500, G5) and HID++ 2.0 protocols (G303), the Etekcity Scroll Alpha and Roccat Kone XTD. Thomas H. P. Anderson already added the G5, G9 and the M705. git clone http://github.com/libratbag/libratbag The internal architecture is fairly simple, behind the library's API we have a couple of protocol-specific drivers that access the mouse. The drivers match a specific product/vendor ID combination and load the data from the device, the library then exports it to the caller as a struct ratbag_device. Each device has at least one profile, each profile has a number of buttons and at least one resolution. Where possible, the resolutions can be queried and set, the buttons likewise can be queried and set for different functions. If the hardware supports it, you can map buttons to other buttons, assign macros, or special functions such as DPI/profile switching. The main goal of libratbag is to unify access to the devices so a configuration application doesn't need different libraries per hardware. Especially short-term, we envision using some of the projects listed above through custom backends. We're at version 0.1 at the moment, so the API is still subject to change. It looks like this: #include <libratbag.h>struct ratbag *ratbag;struct ratbag_device *device;struct ratbag_profile *p;struct ratbag_button *b;struct ratbag_resolution *r;ratbag = ratbag_create_context(...);device = ratbag_device_new_from_udev(ratbag, udev_device);/* retrieve the first profile */p = ratbag_device_get_profile(device, 0);/* retrieve the first resolution setting of the profile */r = ratbag_profile_get_resolution(p, 0);printf("The first resolution is: %dpi @ %d Hz\n", ratbag_resolution_get_dpi(r), ratbag_resolution_get_report_rate(r));ratbag_resolution_unref(r);/* retrieve the fourth button */b = ratbag_profile_get_button(p, 4);if (ratbag_button_get_action_type(b) == RATBAG_BUTTON_ACTION_TYPE_SPECIAL && ratbag_button_get_special(b) == RATBAG_BUTTON_ACTION_SPECIAL_RESOLUTION_UP) printf("button 4 selects next resolution");ratbag_button_unref(b);ratbag_profile_unref(p);ratbag_device_unref(device);ratbag_unref(device); For testing and playing around with libratbag, we have a tool called ratbag-command that exposes most of the library: $ ratbag-command info /dev/input/event8Device 'BTL Gaming Mouse'Capabilities: res profile btn-key btn-macrosNumber of buttons: 11Profiles supported: 5  Profile 0 (active)    Resolutions:      0: 800x800dpi @ 500Hz      1: 800x800dpi @ 500Hz (active)      2: 2400x2400dpi @ 500Hz      3: 3200x3200dpi @ 500Hz      4: 4000x4000dpi @ 500Hz      5: 8000x8000dpi @ 500Hz    Button: 0 type left is mapped to 'button 1'    Button: 1 type right is mapped to 'button 2'    Button: 2 type middle is mapped to 'button 3'    Button: 3 type extra (forward) is mapped to 'profile up'    Button: 4 type side (backward) is mapped to 'profile down'    Button: 5 type resolution cycle up is mapped to 'resolution cycle up'    Button: 6 type pinkie is mapped to 'macro "": H↓ H↑ E↓ E↑ L↓ L↑ L↓ L↑ O↓ O↑'    Button: 7 type pinkie2 is mapped to 'macro "foo": F↓ F↑ O↓ O↑ O↓ O↑'    Button: 8 type wheel up is mapped to 'wheel up'    Button: 9 type wheel down is mapped to 'wheel down'    Button: 10 type unknown is mapped to 'none'  Profile 1      ...
And to toggle/query the various settings on the device:
$ratbag-command dpi set 400 /dev/input/event8$ ratbag-command profile 1 resolution 3 dpi set 800 /dev/input/event8\$ ratbag-command profile 0 button 4 set action special doubleclick

libratbag is in a very early state of development. There are a bunch of FIXMEs in the code, the hardware support is still spotty and we'll appreciate any help we can get, especially with the hardware driver backends. There's a TODO in the repo for some things that we already know needs changing. Feel free to browse the repo on github and drop us some patches.

Eventually we want this to be integrated into the desktop environments, either in the respective control panels or in a standalone application. libratbag already provides SVGs for some devices we support but we'll need some designer input for the actual application. Again, any help you want to provide here will be much appreciated.

# A Preliminary systemd.conf 2015 Schedule is Now Online!

We are happy to announce that an initial, preliminary version of the systemd.conf 2015 schedule is now online! (Please ignore that some rows in the schedule link the same session twice on that page. That's a bug in the web site CMS we are working on to fix.)

We got an overwhelming number of high-quality submissions during the CfP! Because there were so many good talks we really wanted to accept, we decided to do two full days of talks now, leaving one more day for the hackfest and BoFs. We also shortened many of the slots, to make room for more. All in all we now have a schedule packed with fantastic presentations!

The areas covered range from containers, to system provisioning, stateless systems, distributed init systems, the kdbus IPC, control groups, systemd on the desktop, systemd in embedded devices, configuration management and systemd, and systemd in downstream distributions.

We'd like to thank everybody who submited a presentation proposal!

Also, don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

We are still looking for sponsors. If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

For further details about systemd.conf consult the conference website.

 September 14, 2015

We've been hard working with the Gnocchi team these last months to store your metrics, and I guess it's time to show off a bit.

So far Gnocchi offers scalable metric storage and resource indexation, especially for OpenStack cloud – but not only, we're generic. It's cool to store metrics, but it can be even better to have a way to visualize them!

## Prototyping

We very soon started to build a little HTML interface. Being REST-friendly guys, we enabled it on the same endpoints that were being used to retrieve information and measures about metric, sending back text/html instead of application/json if you were requesting those pages from a Web browser.

But let's face it: we are back-end developers, we suck at any kind front-end development. CSS, HTML, JavaScript? Bwah! So what we built was a starting point, hoping some magical Web developer would jump in and finish the job.

Obviously it never happened.

## Ok, so what's out there?

It turns out there are back-end agnostic solutions out there, and we decided to pick Grafana. Grafana is a complete graphing dashboard solution that can be plugged on top of any back-end. It already supports timeseries databases such as Graphite, InfluxDB and OpenTSDB.

That was largely enough for that my fellow developer Mehdi Abaakouk to jump in and start writing a Gnocchi plugin for Grafana! Consequently, there is now a basic but solid and working back-end for Grafana that lies in the grafana-plugins repository.

With that plugin, you can graph anything that is stored in Gnocchi, from raw metrics to metrics tied to resources. You can use templating, but no annotation yet. The back-end supports Gnocchi with or without Keystone involved, and any type of authentication (basic auth or Keystone token). So yes, it even works if you're not running Gnocchi with the rest of OpenStack.

It also supports advanced queries, so you can search for resources based on some criterion and graphs their metrics.

## I want to try it!

If you want to deploy it, all you need to do is to install Grafana and its plugins, and create a new datasource pointing to Gnocchi. It is that simple. There's some CORS middleware configuration involved if you're planning on using Keystone authentication, but it's pretty straightforward – just set the cors.allowed_origin option to the URL of your Grafana dashboard.

We added support of Grafana directly in Gnocchi devstack plugin. If you're running DevStack you can follow the instructions – which are basically adding the line enable_service gnocchi-grafana.

## Moving to Grafana core

Mehdi just opened a pull request a few days ago to merge the plugin into Grafana core. It's actually one of the most unit-tested plugin in Grafana so far, so it should be on a good path to be merged in the future and have support of Gnocchi directly into Grafana without any plugin involved.

 September 11, 2015

The Limba project does not only have the goal to allow developers to deploy their applications directly on multiple Linux distributions while reducing duplication of shared resources, it should also make it easy for developers to build software for Limba.

Limba is worth nothing without good tooling to make it fun to use. That’s why I am working on that too, and I want to share some ideas of how things could work in future and which services I would like to have running. I will also show what is working today already (and that’s quite something!). This time I look at things from a developer’s perspective (since the last posts on Limba were more end-user centric). If you read on, you will also find a nice video of the developer workflow

## 1. Creating metadata and building the software

To make building Limba packages as simple as possible, Limba reuses already existing metadata, like AppStream metadata to find information about the software you want to create your package for.

To ensure upstreams can build their software in a clean environment, Limba makes using one as simple as possible: The limba-build CLI tool creates a clean chroot environment quickly by using an environment created by debootstrap (or a comparable tool suitable for the Linux distribution), and then using OverlayFS to have all changes to the environment done during the build process land in a separate directory.

To define build instructions, limba-build uses the same YAML format TravisCI uses as well for continuous integration. So there is a chance this data is already present as well (if not, it’s trivial to write).

In case upstream projects don’t want to use these tools, e.g. because they have well-working CI already, then all commands needed to build a Limba package can be called individually as well (ideally, building a Limba package is just one call to lipkgen).

I am currently planning “DeveloperIPK” packages containing resources needed to develop against another Limba package. With that in place and integrated with the automatic build-environment creation, upstream developers can be sure the application they just built is built against the right libraries as present in the package they depend on. The build tool could even fetch the build-dependencies automatically from a central repository.

While everyone can set up their own Limba repository, and the limba-build repo command will help with that, there are lots of benefits in having a central place where upstream developers can upload their software to.

I am currently developing a service like that, called “LimbaHub”. LimbaHub will contain different repositories distributors can make available to their users by default, e.g. there will be one with only free software, and one for proprietary software. It will also later allow upstreams to create private repositories, e.g. for beta-releases.

## 3. Security in LimbaHub

Every Limba package is signed with they key of its creator anyway, so in order to get a package into LimbaHub, one needs to get their OpenPGP key accepted by the service first.

Additionally, the Hub service works with a per-package permission system. This means I can e.g. allow the Mozilla release team members to upload a package with the component-ID “org.mozilla.firefox.desktop” or even allow those user(s) to “own” the whole org.mozilla.* namespace.

This should prevent people hijacking other people’s uploads accidentally or on purpose.

## 4. QA with LimbaHub

LimbaHub should also act as guardian over ABI stability and general quality of the software. We could for example warn upstreams that they broke ABI without declaring that in the package information, or even reject the package then. We could validate .desktop files and AppStream metadata, or even check if a package was built using hardening flags.

This should help both developers to improve their software as well as users who benefit from that effort. In case something really bad gets submitted to LimbaHub, we always have the ability to remove the package from the repositories as a last resort (which might trigger Limba to issue a warning for the user that he won’t receive updates anymore).

## What works

Limba, LimbaHub and the tools around it are developing nicely, so far no big issues have been encountered yet.

That’s why I made a video showing how Limba and LimbaHub work together at time:

Still, there is a lot of room for improvement – Limba has not yet received enough testing, and LimbaHub is merely a proof-of-concept at time. Also, lots of high-priority features are not yet implemented.

## LimbaHub and Limba need help!

At time I am developing LimbaHub and Limba alone with only occasional contributions from others (which are amazing and highly welcomed!). So, if you like Python and Flask, and want to help developing LimbaHub, please contact me – the LimbaHub software could benefit from a more experienced Python web developer than I am (and maybe having a designer look over the frontend later makes sense as well). If you are not afraid of C and GLib, and like to chase bugs or play with building Limba packages, consider helping Limba development

 September 07, 2015
Kernel 4.2 is released already and the 4.3 merge window in full swing, time to look at what's in it for the intel graphics driver.

Biggest thing for sure is that Skylake is finally out of preliminary support and enabled by default. The reason for the long hold-up was some ABI fumble - the hardware exposes the topmost plane both through the new universal plane registers and the legacy cursor registers and because we simply carried the legacy plane code around in the driver we ended up exposing both. This wasn't something big to take care of but somehow was dragged on forever.

The other big thing is that now legacy modesets are done with the new atomic modesetting code driver-internally. Atomic support in i915.ko isn't ready for prime-time yet fully, but this is definitely a big step forward. Besides atomic there's also other cross-platform improvements in the modeset code: Ville fixed up the 12bpc support for HDMI, which is now used by default if the screen supports it. Mika Kahola and Ville also implemented dynamic adjustment of the cdclk, which is the main clock source for display engines on intel graphics. And there's a big difference in the clock speeds needed between e.g. a 4k screen and a 720p TV.

Continuing with power saving features Rodrigo again spent a lot of time fixing up PSR (panel self refresh). And Paulo did the same by writing patches to improve FBC (framebuffer compression). We have some really solid testcases by now, unfortunately neither feature is ready for enabling by default yet. Especially PSR is still plagued by screen freezes on some random systems. Also there's been some fixes to DRRS (dynamic refresh rate switching) from Ramalingam. DRRS is enabled by default already, where supported. And finally some improvements to make the frontbuffer rendering tracking more accurate, which is used by all three of these display power saving features.

And of course there's also tons of improvements to platform code. Display PLL code for Sklylake and Valleyview&Cherryview was tuned by Damien and Ville respectively. There's been tons of work on Broxton and DSI support by Imre, Gaurav and others.

Moving on to the rendering side the big change is how tracking of rendering tasks is handled. In the past the driver just used raw sequence numbers emitted by the hardware, but for cross-driver synchronization and reordering tasks with an eventual gpu scheduler more abstraction is needed. A big step is converting over to the i915 request structure completely, done by John Harrison. The next step will be to switch the internal implementation for i915 requests to the cross-driver fences, but that's for future kernels. As a follow-up cleanup John also removed the OLR, which stands for outstanding lazy request. It was a neat little trick implemented years ago to simplify handling error recovery, but which causes tons of pain with subtle bugs. Making requests more explicit in the driver allowed us to finally remove this trick since.

There's also been a pile of platform related features: MOCS programming for Skylake/Broxton (which is used for caching control). Resource streamer support from Abdiel, which is used to offload some of the buffer object tracking for shaders from the cpu to the gpu. And the command parser on Haswell was extended to support atomic instructions in shaders. And finally for Skylake Mika Kuoppala added code to avoid resetting the gpu - in certain cases the hardware would hard-hang the entire system trying to execute the reset. And a dead gpu is still better than a dead system.

Piwik told me that people are still sharing my post about the state of GNOME-Software and update notifications in Debian Jessie.

So I thought it might be useful to publish a small update on that matter:

• If you are using GNOME or KDE Plasma with Debian 8 (Jessie), everything is fine – you will receive update notifications through the GNOME-Shell/via g-s-d or Apper respectively. You can perform updates on GNOME with GNOME-PackageKit and with Apper on KDE Plasma.
• If you are using a desktop-environment not supporting PackageKit directly – for example Xfce, which previously relied on external tools – you might want to try pk-update-icon from jessie-backports. The small GTK+ tool will notify about updates and install them via GNOME-PackageKit, basically doing what GNOME-PackageKit did by itself before the functionality was moved into GNOME-Software.
• For Debian Stretch, the upcoming release of Debian, we will have gnome-software ready and fully working. However, one of the design decisions of upstream is to only allow offline-updates (= download updates in the background, install on request at next reboot) with GNOME-Software. In case you don’t want to use that, GNOME-PackageKit will still be available, and so are of course all the CLI tools.
• For KDE Plasma 5 on Debian 9 (Stretch), a nice AppStream based software management solution with a user-friendly updater is also planned and being developed upstream. More information on that will come when there’s something ready to show ;-).

I hope that clarifies things a little. Have fun using Debian 8!

#### UPDATE:

It appears that many people have problems with getting update notifications in GNOME on Jessie. If you are affected by this, please try the following:

1. Open dconf-editor and navigate to org.gnome.settings-daemon.plugins.updates. Check if the key active is set to true
2. If that doesn’t help, also check if at org.gnome.settings-daemon.plugins.updates the frequency-refresh-cache value is set to a sane value (e.g. 86400)
3. Consider increasing the priority value (if it isn’t at 300 already)
4. If all of that doesn’t help: I guess some internal logic in g-s-d is preventing a cache refresh then (e.g. because I thinks it is on a expensive network connection and therefore doesn’t refresh the cache automatically, or thinks it is running on battery). This is a bug. If that still happens on Debian Stretch with GNOME-Software, please report a bug against the gnome-software package. As a workaround for Jessie you can enable unconditional cache refreshing via the APT cronjob by installing apt-config-auto-update.
 September 05, 2015
While trying to autogenerate C interfaces for XKB during the last 2 months it sometimes felt like an attempt to "behead the Hydra": Whenever I fixed a problem, a number of new issues arose (deep from the intestines of the python code generator). At the moment it seems I have reached a state where no new heads emerge, and I even managed to cut off a number of the most ugliest...

The cause of most (all?) troubles is a very basic way of distinguishing data types into fixed size (everything you can feed to sizeof), variable size data types (lists, valueparams, ...) and padding. In the X protocol fixed size fields are normally grouped together, so e.g. in a request all fixed size fields are collected before any variable size fields.

XCB, whose purpose it is to translate between X server and client, takes advantage of this ordering: Several grouped fixed size fields are conveniently mapped to a C struct, so it is fairly easy to deal with. The treatment of variable size data is more difficult and involves the use of (autogenerated) accessors and iterators. Also the specific number of elements in a variable size data type can depend on expressions that are specified in the protocol, but need to be evaluated at runtime.
Now XKB breaks with the "pure doctrine" of the X core protocol: Fixed and variable size fields are intermixed in several cases and new expressions are introduced. Further problematic cases include a variable size union, lists with a variable number of variable size elements, ... And finally, the keyboard extension defines a new datatype (ListOfItems), which is referred to as 'switch' in XCB.

Switch is a kind of list, but the 'items' are not necessarily of the same type, and whether an item is included or not depends on expressions that have to be evaluated for each item separately. Defining a C mapping for 'switch' was one of the main goals of my work, and it turned out that a set of new functions was needed. These new functions must either 'serialize' a switch into a buffer, depending on the concrete switch conditions, or 'unserialize' it to some C struct. As 'switch' can contain any other data type valid in the X protocol (which means especially that it also can contain other switches...), 'serialize'/'unserialize' had to be defined for all those data types.
Once I had the autogenerator spill out '_serialize()' and '_unserialize()', it turned out they can be used to deal with some other special cases as well - most notably the (above-mentioned) intermixing of fixed and variable size fields.

But probably the most appealing feature of the serializers is that they are invisible to anyone who wants to use XCB as they get called automatically in the autogenerated helper functions.

A last note on the current status:
+ xkb.c (autogenerated from xkb.xml) compiles!
+ the request side should be finished (more tests needed however)
+ simple XKB test programs run
- on the reply side, special cases still need special treatment
- _unserialize() should be renamed to _sizeof() in some cases
- some special cases also need special accessors

The code is currently hosted on annarchy, but I will export the repos shortly.
 September 04, 2015

Continuing my post series on the tools I use these days in Python, this time I would like to talk about a library I really like, named voluptuous.

It's no secret that most of the time, when a program receives data from the outside, it's a big deal to handle it. Indeed, most of the time your program has no guarantee that the stream is valid and that it contains what is expected.

The robustness principle says you should be liberal in what you accept, though that is not always a good idea neither. Whatever policy you chose, you need to process those data and implement a policy that will work – lax or not.

That means that the program need to look into the data received, check that it finds everything it needs, complete what might be missing (e.g. set some default), transform some data, and maybe reject those data in the end.

## Data validation

The first step is to validate the data, which means checking all the fields are there and all the types are right or understandable (parseable). Voluptuous provides a single interface for all that called a Schema.

>>> from voluptuous import Schema>>> s = Schema({...   'q': str,...   'per_page': int,...   'page': int,... })>>> s({"q": "hello"}){'q': 'hello'}>>> s({"q": "hello", "page": "world"})voluptuous.MultipleInvalid: expected int for dictionary value @ data['page']>>> s({"q": "hello", "unknown": "key"})voluptuous.MultipleInvalid: extra keys not allowed @ data['unknown']

The argument to voluptuous.Schema should be the data structure that you expect. Voluptuous accepts any kind of data structure, so it could also be a simple string or an array of dict of array of integer. You get it. Here it's a dict with a few keys that if present should be validated as certain types. By default, Voluptuous does not raise an error if some keys are missing. However, it is invalid to have extra keys in a dict by default. If you want to allow extra keys, it is possible to specify it.

>>> from voluptuous import Schema>>> s = Schema({"foo": str}, extra=True)>>> s({"bar": 2}){"bar": 2}

It is also possible to make some keys mandatory.

>>> from voluptuous import Schema, Required>>> s = Schema({Required("foo"): str})>>> s({})voluptuous.MultipleInvalid: required key not provided @ data['foo']

You can create custom data type very easily. Voluptuous data types are actually just functions that are called with one argument, the value, and that should either return the value or raise an Invalid or ValueError exception.

>>> from voluptuous import Schema, Invalid>>> def StringWithLength5(value):...     if isinstance(value, str) and len(value) == 5:...             return value...     raise Invalid("Not a string with 5 chars")...>>> s = Schema(StringWithLength5)>>> s("hello")'hello'>>> s("hello world")voluptuous.MultipleInvalid: Not a string with 5 chars

Most of the time though, there is no need to create your own data types. Voluptuous provides logical operators that can, combined with a few others provided primitives such as voluptuous.Length or voluptuous.Range, create a large range of validation scheme.

>>> from voluptuous import Schema, Length, All>>> s = Schema(All(str, Length(min=3, max=5)))>>> s("hello")"hello">>> s("hello world")voluptuous.MultipleInvalid: length of value must be at most 5

The voluptuous documentation has a good set of examples that you can check to have a good overview of what you can do.

## Data transformation

What's important to remember, is that each data type that you use is a function that is called and returns a value, if the value is considered valid. That value returned is what is actually used and returned after the schema validation:

>>> import uuid>>> from voluptuous import Schema>>> def UUID(value):...     return uuid.UUID(value)...>>> s = Schema({"foo": UUID})>>> data_converted = s({"foo": "uuid?"})voluptuous.MultipleInvalid: not a valid value for dictionary value @ data['foo']>>> data_converted = s({"foo": "8B7BA51C-DFF5-45DD-B28C-6911A2317D1D"})>>> data_converted{'foo': UUID('8b7ba51c-dff5-45dd-b28c-6911a2317d1d')}

By defining a custom UUID function that converts a value to a UUID, the schema converts the string passed in the data to a Python UUID object – validating the format at the same time.

Note a little trick here: it's not possible to use directly uuid.UUID in the schema, otherwise Voluptuous would check that the data is actually an instance of uuid.UUID:

>>> from voluptuous import Schema>>> s = Schema({"foo": uuid.UUID})>>> s({"foo": "8B7BA51C-DFF5-45DD-B28C-6911A2317D1D"})voluptuous.MultipleInvalid: expected UUID for dictionary value @ data['foo']>>> s({"foo": uuid.uuid4()}){'foo': UUID('60b6d6c4-e719-47a7-8e2e-b4a4a30631ed')}

And that's not what is wanted here.

That mechanism is really neat to transform, for example, strings to timestamps.

>>> import datetime>>> from voluptuous import Schema>>> def Timestamp(value):...     return datetime.datetime.strptime(value, "%Y-%m-%dT%H:%M:%S")...>>> s = Schema({"foo": Timestamp})>>> s({"foo": '2015-03-03T12:12:12'}){'foo': datetime.datetime(2015, 3, 3, 12, 12, 12)}>>> s({"foo": '2015-03-03T12:12'})voluptuous.MultipleInvalid: not a valid value for dictionary value @ data['foo']

## Recursive schemas

So far, Voluptuous has one limitation so far: the ability to have recursive schemas. The simplest way to circumvent it is by using another function as an indirection.

>>> from voluptuous import Schema, Any>>> def _MySchema(value):...     return MySchema(value)...>>> from voluptuous import Any>>> MySchema = Schema({"foo": Any("bar", _MySchema)})>>> MySchema({"foo": {"foo": "bar"}}){'foo': {'foo': 'bar'}}>>> MySchema({"foo": {"foo": "baz"}})voluptuous.MultipleInvalid: not a valid value for dictionary value @ data['foo']['foo']

## Usage in REST API

I started to use Voluptuous to validate data in a the REST API provided by Gnocchi. So far it has been a really good tool, and we've been able to create a complete REST API that is very easy to validate on the server side. I would definitely recommend it for that. It blends with any Web framework easily.

One of the upside compared to solution like JSON Schema, is the ability to create or re-use your own custom data types while converting values at validation time. It is also very Pythonic, and extensible – it's pretty great to use for all of that. It's also not tied to any serialization format.

On the other hand, JSON Schema is language agnostic and is serializable itself as JSON. That makes it easy to be exported and provided to a consumer so it can understand the API and validate the data potentially on its side.

 August 29, 2015

# Overview

Pictures are the right way to start.

Conceptual view of aliasing PPGTT bind/unbind

There is exactly one thing to get from the above drawing, everything else is just to make it as close to fact as possible.

1. The aliasing PPGTT (aliases|shadows|mimics) the global GTT.

## The wordy overview

Support for Per-process Graphics Translation Tables (PPGTT) debuted on Sandybridge (GEN6). The features provided by hardware are a superset of Aliasing PPGTT, which is entirely a software construct. The most obvious unimplemented feature is that the hardware supports multiple PPGTTs. Aliasing PPGTT is a single instance of a PPGTT. Although not entirely true, it’s easiest to think of the Aliasing PPGTT as a set page table of page tables that is maintained to have the identical mappings as the global GTT (the picture above). There is more on this in the Summary section

Until recently, aliasing PPGTT was the only way to make use of the hardware feature (unless you accidentally stepped into one of my personal branches). Aliasing PPGTT is implemented as a performance feature (more on this later). It was an important enabling step for us as well as it provided a good foundation for the lower levels of the real PPGTT code.

In the following, I will be using the HSW PRMs as a reference. I’ll also assume you’ve read, or understand part 1.

## Selecting GGTT or PPGTT

Choosing between the GGTT and the Aliasing PPGTT is very straight forward. The choice is provided in several GPU commands. If there is no explicit choice, than there is some implicit behavior which is usually sensible. The most obvious command to be provided with a choice is MI_BATCH_BUFFER_START. When a batchbuffer is submitted, the driver sets a single bit that determines whether the batch will execute out of the GGTT or a Aliasing PPGTT1. Several commands as well, like PIPE_CONTROL, have a bit to direct which to use for the reads or writes that the GPU command will perform.

# Architecture

The names for all the page table data structures in hardware are the same as for IA CPU. You can see the Intel® 64 and IA-32 Architectures Software Developer Manuals for more information. (At the time of this post: page 1988 Vol3. 4.2 HIERARCHICAL PAGING STRUCTURES: AN OVERVIEW). I don’t want to rehash the HSW PRMs  too much, and I am probably not allowed to won’t copy the diagrams. However, for the sake of having a consolidated post, I will rehash the most pertinent parts.

There is one conceptual Page Directory for a PPGTT – the docs call this a set of Page Directory Entries (PDEs), however since they are contiguous, calling it a Page Directory makes a lot of sense to me. In fact, going back to the Ironlake docs, that seems to be the case. So there is one page directory with up to 512 entries, each pointing to a page table.  There are several good diagrams which I won’t bother redrawing in the PRMs2

Page Directory Entry
31:12 11:04 03:02 01 0
Physical Page Address 31:12 Physical Page Address 39:32 Rsvd Page size (4K/32K) Valid
Page Table Entry
31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid

There’s some things we can get from this for those too lazy to click on the links to the docs.

1. PPGTT page tables exist in physical memory.
2. PPGTT PTEs have the exact same layout as GGTT PTEs.
3. PDEs don’t have cache attributes (more on this later).
4. There exists support for big pages3

With the above definitions, we now can derive a lot of interesting attributes about our GPU. As already stated, the PPGTT is a two-level page table (I’ve not yet defined the size).

• A PDE is 4 bytes wide
• A PTE is 4 bytes wide
• A Page table occupies 4k of memory.
• There are 4k/4 entries in a page table.

With all this information, I now present you a slightly more accurate picture.

An object with an aliased PPGTT mapping

## Size

PP_DCLV – PPGTT Directory Cacheline Valid Register: As the spec tells us, “This register controls update of the on-chip PPGTT Directory Cache during a context restore.” This statement is directly contradicted in the very next paragraph, but the important part is the bit about the on-chip cache. This register also determines the amount of virtual address space covered by the PPGTT. The documentation for this register is pretty terrible, so a table is actually useful in this case.

PPGTT Directory Cacheline Valid Register (from the docs)
63:32 31:0
MBZ PPGTT Directory Cache Restore [1..32] 16 entries
DCLV, the right way
31 30 1 0
PDE[511:496] enable PDE [495:480] enable PDE[31:16] enable PDE[15:0] enable

The, “why” is not important. Each bit represents a cacheline of PDEs, which is how the register gets its name4. A PDE is 4 bytes, there are 64b in a cacheline, so 64/4 = 16 entries per bit.  We now know how much address space we have.

512 PDEs * 1024 PTEs per PT * 4096 PAGE_SIZE = 2GB

## Location

PP_DIR_BASE: Sadly, I cannot find the definition to this in the public HSW docs. However, I did manage to find a definition in the Ironlake docs yay me. There are several mentions in more recent docs, and it works the same way as is outlined on Ironlake. Quoting the docs again, “This register contains the offset into the GGTT where the (current context’s) PPGTT page directory begins.” We learn a very important caveat about the PPGTT here – the PPGTT PDEs reside within the GGTT.

## Programming

With these two things, we now have the ability to program the location, and size (and get the thing to load into the on-chip cache). Here is current i915 code which switches the address space (with simple comments added). It’s actually pretty ho-hum.

...
ret = intel_ring_begin(ring, 6);
if (ret)
return ret;

intel_ring_emit(ring, RING_PP_DIR_DCLV(ring));
intel_ring_emit(ring, PP_DIR_DCLV_2G);       // program size
intel_ring_emit(ring, RING_PP_DIR_BASE(ring));
intel_ring_emit(ring, get_pd_offset(ppgtt)); // program location
intel_ring_emit(ring, MI_NOOP);
...


As you can see, we program the size to always be the full amount (in fact, I fixed this a long time ago, but never merged). Historically, the offset was at the top of the GGTT, but with my PPGTT series merged, that is abstracted out, and the simple get_pd_offset() macro gets the offset within the GGTT. The intel_ring_emit() stuff is because the docs recommended setting the registers via the GPU’s LOAD_REGISTER_IMMEDIATE command, though empirically it seems to be fine if we simply write the registers via MMIO (for Aliasing PPGTT). See my previous blog post if you want more info about the commands execution in the GPU’s ringbuffer. If it’s easier just pretend it’s 2 MMIO writes.

### Initialization

All of the resources are allocated and initialized upfront. There are 3 main steps. Note that the following comes from a relatively new kernel, and I have already submitted patches which change some of the cosmetics. However, the concepts haven’t changed for pre-gen8.

1. Allocate space in the GGTT for the PPGTT PDEs

ret = drm_mm_insert_node_in_range_generic(&dev_priv->gtt.base.mm,
&ppgtt->node, GEN6_PD_SIZE,
GEN6_PD_ALIGN, 0,
0, dev_priv->gtt.base.total,
DRM_MM_TOPDOWN);


2. Allocate the page tables

for (i = 0; i < ppgtt->num_pd_entries; i++) {
ppgtt->pt_pages[i] = alloc_page(GFP_KERNEL);
if (!ppgtt->pt_pages[i]) {
gen6_ppgtt_free(ppgtt);
return -ENOMEM;
}
}


3. [possibly] IOMMU map the pages

for (i = 0; i < ppgtt->num_pd_entries; i++) {

pt_addr = pci_map_page(dev->pdev, ppgtt->pt_pages[i], 0, 4096,
PCI_DMA_BIDIRECTIONAL);
...
}


As the system binds, and unbinds objects into the aliasing PPGTT, it simply writes the PTEs for the given object (possibly spanning multiple page tables). The PDEs do not change. PDEs are mapped to a scratch page when not used, as are the PTEs.

### IOMMU

As we saw in step 3 above, I mention that the page tables may be mapped by the IOMMU. This is one important caveat that I didn’t fully understand early on, so I wanted to recap a bit. Recall that the GGTT is allocated out of system memory during the boot firmware’s initialization. This means that as long as Linux treats that memory as special, everything will just work (just don’t look for IOMMU implicated bugs on our bugzilla). The page tables however are special because they get allocated after Linux is already running, and the IOMMU is potentially managing the memory. In other words, we don’t want to write the physical address to the PDEs, we want to write the dma address. Deferring to wikipedia again for the description of an IOMMU., that’s all.It tripped be up the first time I saw it because I hadn’t dealt with this kind of thing before. Our PTEs have worked the same way for a very long time when mapping the BOs, but those have somewhat hidden details because they use the scatter-gather functions.

Feel free to ask questions in the comments if you need more clarity – I’d probably need another diagram to accommodate.

## Cached page tables

Let me be clear, I favored writing a separate post for the Aliasing PPGTT because it gets a lot of the details out of the way for the post about Full PPGTT. However, the entire point of this feature is to get a [to date, unmeasured] performance win. Let me explain… Notice bits 4:3 of the ECOCHK register.  Similarly in the i915 code:

ecochk = I915_READ(GAM_ECOCHK);
if (IS_HASWELL(dev)) {
ecochk |= ECOCHK_PPGTT_WB_HSW;
} else {
ecochk |= ECOCHK_PPGTT_LLC_IVB;
ecochk &= ~ECOCHK_PPGTT_GFDT_IVB;
}
I915_WRITE(GAM_ECOCHK, ecochk);


What these bits do is tell the GPU whether (and how) to cache the PPGTT page tables. Following the Haswell case, the code is saying to map the PPGTT page table with write-back caching policy. Since the writes for Aliasing PPGTT are only done at initialization, the policy is really not that important.

Below is how I’ve chosen to distinguish the two. I have no evidence that this is actually what happens, but it seems about right.

Flow chart for GPU GGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Red means slow. The point which was hopefully made clear above is that when you miss the TLB on a GGTT access, you need to fetch the entry from memory, which has a relatively high latency. When you miss the TLB on a PPGTT access, you have two caches (the special PDE cache for PPGTT, and LLC) which are backing the request. Note there is an intentional bug in the second diagram – you may miss the LLC on the PTE fetch also. I was trying to keep things simple, and show the hopeful case.

Because of this, all mappings which do not require GGTT mappings get mapped to the aliasing PPGTT.

## Distinctions from the GGTT

At this point I hope you’re asking why we need the global GTT at all. There are a few limited cases where the hardware is incapable (or it is undesirable) of using a per process address space.

A brief description of why, with all the current callers of the global pin interface.

• Display: Display actually implements it’s own version of the GGTT. Maintaining the logic to support multiple level page tables was both costly, and unnecessary. Anything relating to a buffer being scanned out to the display must always be mapped into the GGTT. Ie xpect this to be true, forever.
• i915_gem_object_pin_to_display_plane(): page flipping
• intel_setup_overlay(): overlays
• Ringbuffer: Keep in mind that the aliasing PPGTT is a special case of PPGTT. The ringbuffer must remain address space and context agnostic. It doesn’t make any sense to connect it to the PPGTT, and therefore the logic does not support it. The ringbuffer provides direct communication to the hardware’s execution logic – which would be a nightmare to synchronize if we forget about the security nightmare. If you go off and think about how you would have a ringbuffer mapped by multiple address spaces, you will end up with something like execlists.
• allocate_ring_buffer()
• HW Contexts: Extremely similar to ringbuffer.
• intel_alloc_context_page(): Ironlake RC6
• i915_gem_create_context(): Create the default HW context
• i915_gem_context_reset(): Re-pin the default HW context
• do_switch(): Pin the logical context we’re switching to
• Hardware status page: The use of this, prior to execlists, is much like rinbuffers, and contexts. There is a per process status page with execlists.
• init_status_page()
• Workarounds:
• init_pipe_control(): Initialize scratch space for workarounds.
• intel_init_render_ring_buffer(): An i830 w/a I won’t bother to understand
• render_state_alloc(): Full initialization of GPUs 3d state from within the kernel
• Other
• i915_gem_fault(): Map an object into the aperture for gtt_mmap. More info here.
• i915_gem_pin_ioctl(): The DRI1 pin interface.

# GEN8 disambiguation

Off the top of my head, the list of some of the changes on GEN8 which will get more detail in a later post. These changes are all upstream from the original Broadwell integration.

• PTE size increased to 8b
• Therefore, 512 entries per table
• Format mimics the CPU PTEs
• PDEs increased to 8b (remains 512 PDEs per PD)
• Page Directories live in system memory
• GGTT no longer holds the PDEs.
• There are 4 PDPs, and therefore 4 PDs
• PDEs are cached in LLC instead of special cache (I’m guessing)
• New HW PDP (Page Directory Pointer) registers point to the PDs, for legacy 32b addressing.
• PP_DIR_BASE, and PP_DCLV are removed
• Support for 4 level page tables, up to 48b virtual address space.
• PML4[PML4E]->PDP
• PDP[PDPE] -> PD
• PD[PDE] -> PT
• PT{PTE] -> Memory
• Big pages are now 64k instead of 32k (still not implemented)
• New caching interface via PAT like structure

# Summary

There’s actually an interesting thing that you start to notice after reading Distinctions from the GGTT. Just about every thing mapped into the GGTT shouldn’t be mapped into the PPGTT. We already stated that we try to map everything else into the PPGTT. The set of objects mapped in the GGTT, and the set of objects mapped into the PPGTT are disjoint5). The patches to make this work are not yet merged. I’d put an image here to demonstrate, but I am feeling lazy and I really want to get this post out today.

Recapping:

• The Aliasing PPGTT is a single instance of the hardware feature: PPGTT.
• Aliasing PPGTT was designed as a drop in performance replacement to the GGTT.
• GEN8 changed a lot of architectural stuff.
• The Aliasing PPGTT shouldn’t actually alias the GGTT because the objects they map are a disjoint set.

Like last time, links to all the SVGs I’ve created. Use them as you like.