November 29, 2015
For public documentation's sake, if you want to run Steam on 64-bit Ubuntu 15.10 together with the open-source radeonsi driver (the part of Mesa that implements OpenGL for recent AMD GPUs), you'll probably run into problems that can be fixed by starting Steam as

LD_PRELOAD=/usr/lib/i386-linux-gnu/ steam
(You'll have to have installed certain 32-bit packages, which you should have automatically been prompted to do while installing Steam.)

The background of this issue is that both Steam and radeonsi contain C++ code and dynamically link against the C++ runtime libraries. However, Steam comes with its own copy of these libraries, which it prefers to link against. Meanwhile, Ubuntu 15.10 is built using the most recent g++, which contains a new C++ ABI. This ABI is not supported by the C++ libraries shipped with Steam, so that by default, the radeonsi driver fails to load due to linking errors.

The workaround noted above fixes the problem by forcing the use of the newer C++ library. (Note that it does so only for 32-bit applications. If you're running 64-bit games from Steam, a similar workaround may be required.)
November 23, 2015

Long time no post!


A quick reminder for students (*) in Spain interested in participating in this year’s CUSL, the deadline for the project proposals has been extended until December 1st:

You’re still on time to submit a proposal!

* Universidad, bachiller, ciclos de grado medio…

Filed under: FreeDesktop Planet, GNOME Planet, GNU Planet, Meetings, Planets, Uncategorized
November 20, 2015

Yesterday I pushed an implementation of a new OpenGL extension GL_EXT_shader_samples_identical to Mesa. This extension will be in the Mesa 11.1 release in a few short weeks, and it will be enabled on various Intel platforms:

  • GEN7 (Ivy Bridge, Baytrail, Haswell): Only currently effective in the fragment shader. More details below.
  • GEN8 (Broadwell, Cherry Trail, Braswell): Only currently effective in the vertex shader and fragment shader. More details below.
  • GEN9 (Skylake): Only currently effective in the vertex shader and fragment shader. More details below.

The extension hasn't yet been published in the official OpenGL extension registry, but I will take care of that before Mesa 11.1 is released.


Multisample anti-aliasing (MSAA) is a well known technique for reducing aliasing effects ("jaggies") in rendered images. The core idea is that the expensive part of generating a single pixel color happens once. The cheaper part of determining where that color exists in the pixel happens multiple times. For 2x MSAA this happens twice, for 4x MSAA this happens four times, etc. The computation cost is not increased by much, but the storage and memory bandwidth costs are increased linearly.

Some time ago, a clever person noticed that in areas where the whole pixel is covered by the triangle, all of the samples have exactly the same value. Furthermore, since most triangles are (much) bigger than a pixel, this is really common. From there it is trivial to apply some sort of simple data compression to the sample data, and all modern GPUs do this in some form. In addition to the surface that stores the data, there is a multisample control surface (MCS) that describes the compression.

On Intel GPUs, sample data is stored in n separate planes. For 4x MSAA, there are four planes. The MCS has a small table for each pixel that maps a sample to a plane. If the entry for sample 2 in the MCS is 0, then the data for sample 2 is stored in plane 0. The GPU automatically uses this to reduce bandwidth usage. When writing a pixel on the interior of a polygon (where all the samples have the same value), the MCS gets all zeros written, and the sample value is written only to plane 0.

This does add some complexity to the shader compiler. When a shader executes the texelFetch function, several things happen behind the scenes. First, an instruction is issued to read the MCS. Then a second instruction is executed to read the sample data. This second instruction uses the sample index and the result of the MCS read as inputs.

A simple shader like

    #version 150
    uniform sampler2DMS tex;
    uniform int samplePos;

    in vec2 coord;
    out vec4 frag_color;

    void main()
       frag_color = texelFetch(tex, ivec2(coord), samplePos);

generates this assembly

    pln(16)         g8<1>F          g7<0,1,0>F      g2<8,8,1>F      { align1 1H compacted };
    pln(16)         g10<1>F         g7.4<0,1,0>F    g2<8,8,1>F      { align1 1H compacted };
    mov(16)         g12<1>F         g6<0,1,0>F                      { align1 1H compacted };
    mov(16)         g16<1>D         g8<8,8,1>F                      { align1 1H compacted };
    mov(16)         g18<1>D         g10<8,8,1>F                     { align1 1H compacted };
    send(16)        g2<1>UW         g16<8,8,1>F
                                sampler ld_mcs SIMD16 Surface = 1 Sampler = 0 mlen 4 rlen 8 { align1 1H };
    mov(16)         g14<1>F         g2<8,8,1>F                      { align1 1H compacted };
    send(16)        g120<1>UW       g12<8,8,1>F
                                sampler ld2dms SIMD16 Surface = 1 Sampler = 0 mlen 8 rlen 8 { align1 1H };
    sendc(16)       null<1>UW       g120<8,8,1>F
                                render RT write SIMD16 LastRT Surface = 0 mlen 8 rlen 0 { align1 1H EOT };

The ld_mcs instruction is the read from the MCS, and the ld2dms is the read from the multisample surface using the MCS data. If a shader reads multiple samples from the same location, the compiler will likely eliminate all but one of the ld_mcs instructions.

Modern GPUs also have an additional optimization. When an application clears a surface, some values are much more commonly used than others. Permutations of 0s and 1s are, by far, the most common. Bandwidth usage can further be reduced by taking advantage of this. With a single bit for each of red, green, blue, and alpha, only four bits are necessary to describe a clear color that contains only 0s and 1s. A special value could then be stored in the MCS for each sample that uses the fast-clear color. A clear operation that uses a fast-clear compatible color only has to modify the MCS.

All of this is well documented in the Programmer's Reference Manuals for Intel GPUs.

There's More

Information from the MCS can also help users of the multisample surface reduce memory bandwidth usage. Imagine a simple, straight forward shader that performs an MSAA resolve operation:

    #version 150
    uniform sampler2DMS tex;

    #define NUM_SAMPLES 4 // generate a different shader for each sample count

    in vec2 coord;
    out vec4 frag_color;

    void main()
        vec4 color = texelFetch(tex, ivec2(coord), 0);

        for (int i = 1; i < NUM_SAMPLES; i++)
            color += texelFetch(tex, ivec2(coord), i);

        frag_color = color / float(NUM_SAMPLES);

The problem should be obvious. On most pixels all of the samples will have the same color, but the shader still reads every sample. It's tempting to think the compiler should be able to fix this. In a very simple cases like this one, that may be possible, but such an optimization would be both challenging to implement and, likely, very easy to fool.

A better approach is to just make the data available to the shader, and that is where this extension comes in. A new function textureSamplesIdenticalEXT is added that allows the shader to detect the common case where all the samples have the same value. The new, optimized shader would be:

    #version 150
    #extension GL_EXT_shader_samples_identical: enable
    uniform sampler2DMS tex;

    #define NUM_SAMPLES 4 // generate a different shader for each sample count

    #if !defined GL_EXT_shader_samples_identical
    #define textureSamplesIdenticalEXT(t, c)  false

    in vec2 coord;
    out vec4 frag_color;

    void main()
        vec4 color = texelFetch(tex, ivec2(coord), 0);

        if (! textureSamplesIdenticalEXT(tex, ivec2(coord)) {
            for (int i = 1; i < NUM_SAMPLES; i++)
                color += texelFetch(tex, ivec2(coord), i);

            color /= float(NUM_SAMPLES);

        frag_color = color;

The intention is that this function be implemented by simply examining the MCS data. At least on Intel GPUs, if the MCS for a pixel is all 0s, then all the samples are the same. Since textureSamplesIdenticalEXT can reuse the MCS data read by the first texelFetch call, there are no extra reads from memory. There is just a single compare and conditional branch. These added instructions can be scheduled while waiting for the ld2dms instruction to read from memory (slow), so they are practically free.

It is also tempting to use textureSamplesIdenticalEXT in conjunction with anyInvocationsARB (from GL_ARB_shader_group_vote). Such a shader might look like:

    #version 430
    #extension GL_EXT_shader_samples_identical: require
    #extension GL_ARB_shader_group_vote: require
    uniform sampler2DMS tex;

    #define NUM_SAMPLES 4 // generate a different shader for each sample count

    in vec2 coord;
    out vec4 frag_color;

    void main()
        vec4 color = texelFetch(tex, ivec2(coord), 0);

        if (anyInvocationsARB(!textureSamplesIdenticalEXT(tex, ivec2(coord))) {
            for (int i = 1; i < NUM_SAMPLES; i++)
                color += texelFetch(tex, ivec2(coord), i);

            color /= float(NUM_SAMPLES);

        frag_color = color;

Whether or not using anyInvocationsARB improves performance is likely to be dependent on both the shader and the underlying GPU hardware. Currently Mesa does not support GL_ARB_shader_group_vote, so I don't have any data one way or the other.


The implementation of this extension that will ship with Mesa 11.1 has a three main caveats. Each of these will likely be resolved to some extent in future releases.

The extension is only effective on scalar shader units. This means on GEN7 it is effective in fragment shaders. On GEN8 and GEN9 it is only effective in vertex shaders and fragment shaders. It is supported in all shader stages, but in non-scalar stages textureSamplesIdenticalEXT always returns false. The implementation for the non-scalar stages is slightly different, and, on GEN9, the exact set of instructions depends on the number of samples. I didn't think it was likely that people would want to use this feature in a vertex shader or geometry shader, so I just didn't finish the implementation. This will almost certainly be resolved in Mesa 11.2.

The current implementation also returns a false negative for texels fully set to the fast-clear color. There are two problems with the fast-clear color. It uses a different value than the "all plane 0" case, and the size of the value depends on the number of samples. For 2x MSAA, the MCS read returns 0x000000ff, but for 8x MSAA it returns 0xffffffff.

The first problem means the compiler would needs to generate additional instructions to check for "all plane 0" or "all fast-clear color." This could hurt the performance of applications that either don't use a fast-clear color or, more likely, that later draw non-clear data to the entire surface. The second problem means the compiler would needs to do state-based recompiles when the number of samples changes.

In the end, we decided that "all plane 0" was by far the most common case, so we have ignored the "all fast-clear color" case for the time being. We are still collecting data from applications, and we're working on several uses of this functionality inside our driver. In future versions we may implement a heuristic to determine whether or not to check for the fast-clear color case.

As mentioned above, Mesa does not currently support GL_ARB_shader_group_vote. Applications that want to use textureSamplesIdenticalEXT on Mesa will need paths that do not use anyInvocationsARB for at least the time being.


As stated by issue #3, the extension still needs to gain SPIR-V support. This extension would be just as useful in Vulkan and OpenCL as it is in OpenGL.

At some point there is likely to be a follow-on extension that provides more MCS data to the shader in a more raw form. As stated in issue #2 and previously in this post, there are a few problems with providing raw MCS data. The biggest problem is how the data is returned. Each sample count needs a different amount of data. Current 8x MSAA surfaces have 32-bits (returned) per pixel. Current 16x MSAA MCS surfaces have 64-bits per pixel. Future 32x MSAA, should that ever exist, would need 192 bits. Additionally, there would need to be a set of new texelFetch functions that take both a sample index and the MCS data. This, again, has problems with variable data size.

Applications would also want to query things about the MCS values. How many times is plane 0 used? Which samples use plane 2? What is the highest plane used? There could be other useful queries. I can imagine that a high quality, high performance multisample resolve filter could want all of this information. Since the data changes based on the sample count and could change on future hardware, the future extension really should not directly expose the encoding of the MCS data. How should it provide the data? I'm expecting to write some demo applications and experiment with a bunch of different things. Obviously, this is an open area of research.

November 19, 2015

So many of you have probably seen that RHEL 7.2 is out today. There are many important updates in this release, some of them detailed in the official RHEL 7.2 press release.

One thing however which you would only discover if you start digging into the 7.2 update is that its the first time in RHEL history that we are doing a full scale desktop update in a point release. We shipped RHEL 7.0 with GNOME 3.8 and in RHEL 7.2 we are updating it to GNOME 3.14. This brings in a lot of new major features into RHEL, like the work we did on improved HiDPI support, improved touch and gesture support, it brings GNOME Software to RHEL, the improved system status area and so on. We plan on updating the desktop further in later RHEL 7.x point releases.

This change of policy is of course important to the many RHEL Workstation customers we have, but I also hope it will make RHEL Workstation and also the CentOS Workstation more attractive options to those in the community who have been looking for a LTS version of Fedora. This policy change gives you the rock solid foundation of RHEL and the RHEL kernel and combines it with a very well tested yet fairly new desktop release. So if you feel Fedora is moving to quickly, yet have felt that RHEL on the other hand has been moving to slowly, we hope that with this change to RHEL we have found a sweet compromise.

We will of course also keep doing regular applications updates in RHEL 7.x, just like we started with in RHEL 6.x. Giving you up to date versions of things like LibreOffice, Firefox, Thunderbird and more.

November 18, 2015

The Event Loop API of libsystemd

When we began working on systemd we built it around a hand-written ad-hoc event loop, wrapping Linux epoll. The more our project grew the more we realized the limitations of using raw epoll:

  • As we used timerfd for our timer events, each event source cost one file descriptor and we had many of them! File descriptors are a scarce resource on UNIX, as RLIMIT_NOFILE is typically set to 1024 or similar, limiting the number of available file descriptors per process to 1021, which isn't particularly a lot.

  • Ordering of event dispatching became a nightmare. In many cases, we wanted to make sure that a certain kind of event would always be dispatched before another kind of event, if both happen at the same time. For example, when the last process of a service dies, we might be notified about that via a SIGCHLD signal, via an sd_notify() "STATUS=" message, and via a control group notification. We wanted to get these events in the right order, to know when it's safe to process and subsequently release the runtime data systemd keeps about the service or process: it shouldn't be done if there are still events about it pending.

  • For each program we added to the systemd project we noticed we were adding similar code, over and over again, to work with epoll's complex interfaces. For example, finding the right file descriptor and callback function to dispatch an epoll event to, without running into invalidated pointer issues is outright difficult and requires non-trivial code.

  • Integrating child process watching into our event loops was much more complex than one could hope, and even more so if child process events should be ordered against each other and unrelated kinds of events.

Eventually, we started working on sd-bus. At the same time we decided to seize the opportunity, put together a proper event loop API in C, and then not only port sd-bus on top of it, but also the rest of systemd. The result of this is sd-event. After almost two years of development we declared sd-event stable in systemd version 221, and published it as official API of libsystemd.


sd-event.h, of course, is not the first event loop API around, and it doesn't implement any really novel concepts. When we started working on it we tried to do our homework, and checked the various existing event loop APIs, maybe looking for candidates to adopt instead of doing our own, and to learn about the strengths and weaknesses of the various implementations existing. Ultimately, we found no implementation that could deliver what we needed, or where it would be easy to add the missing bits: as usual in the systemd project, we wanted something that allows us access to all the Linux-specific bits, instead of limiting itself to the least common denominator of UNIX. We weren't looking for an abstraction API, but simply one that makes epoll usable in system code.

With this blog story I'd like to take the opportunity to introduce you to sd-event, and explain why it might be a good candidate to adopt as event loop implementation in your project, too.

So, here are some features it provides:

  • I/O event sources, based on epoll's file descriptor watching, including edge triggered events (EPOLLET). See sd_event_add_io(3).

  • Timer event sources, based on timerfd_create(), supporting the CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTIME clocks, as well as the CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM clocks that can resume the system from suspend. When creating timer events a required accuracy parameter may be specified which allows coalescing of timer events to minimize power consumption. For each clock only a single timer file descriptor is kept, and all timer events are multiplexed with a priority queue. See sd_event_add_time(3).

  • UNIX process signal events, based on signalfd(2), including full support for real-time signals, and queued parameters. See sd_event_add_signal(3).

  • Child process state change events, based on waitid(2). See sd_event_add_child(3).

  • Static event sources, of three types: defer, post and exit, for invoking calls in each event loop, after other event sources or at event loop termination. See sd_event_add_defer(3).

  • Event sources may be assigned a 64bit priority value, that controls the order in which event sources are dispatched if multiple are pending simultanously. See sd_event_source_set_priority(3).

  • The event loop may automatically send watchdog notification messages to the service manager. See sd_event_set_watchdog(3).

  • The event loop may be integrated into foreign event loops, such as the GLib one. The event loop API is hence composable, the same way the underlying epoll logic is. See sd_event_get_fd(3) for an example.

  • The API is fully OOM safe.

  • A complete set of documentation in UNIX man page format is available, with sd-event(3) as the entry page.

  • It's pretty widely available, and requires no extra dependencies. Since systemd is built on it, most major distributions ship the library in their default install set.

  • After two years of development, and after being used in all of systemd's components, it has received a fair share of testing already, even though we only recently decided to declare it stable and turned it into a public API.

Note that sd-event has some potential drawbacks too:

  • If portability is essential to you, sd-event is not your best option. sd-event is a wrapper around Linux-specific APIs, and that's visible in the API. For example: our event callbacks receive structures defined by Linux-specific APIs such as signalfd.

  • It's a low-level C API, and it doesn't isolate you from the OS underpinnings. While I like to think that it is relatively nice and easy to use from C, it doesn't compromise on exposing the low-level functionality. It just fills the gaps in what's missing between epoll, timerfd, signalfd and related concepts, and it does not hide that away.

Either way, I believe that sd-event is a great choice when looking for an event loop API, in particular if you work on system-level software and embedded, where functionality like timer coalescing or watchdog support matter.

Getting Started

Here's a short example how to use sd-event in a simple daemon. In this example, we'll not just use sd-event.h, but also sd-daemon.h to implement a system service.

#include <alloca.h>
#include <endian.h>
#include <errno.h>
#include <netinet/in.h>
#include <signal.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <unistd.h>

#include <systemd/sd-daemon.h>
#include <systemd/sd-event.h>

static int io_handler(sd_event_source *es, int fd, uint32_t revents, void *userdata) {
        void *buffer;
        ssize_t n;
        int sz;

        /* UDP enforces a somewhat reasonable maximum datagram size of 64K, we can just allocate the buffer on the stack */
        if (ioctl(fd, FIONREAD, &sz) < 0)
                return -errno;
        buffer = alloca(sz);

        n = recv(fd, buffer, sz, 0);
        if (n < 0) {
                if (errno == EAGAIN)
                        return 0;

                return -errno;

        if (n == 5 && memcmp(buffer, "EXIT\n", 5) == 0) {
                /* Request a clean exit */
                sd_event_exit(sd_event_source_get_event(es), 0);
                return 0;

        fwrite(buffer, 1, n, stdout);
        return 0;

int main(int argc, char *argv[]) {
        union {
                struct sockaddr_in in;
                struct sockaddr sa;
        } sa;
        sd_event_source *event_source = NULL;
        sd_event *event = NULL;
        int fd = -1, r;
        sigset_t ss;

        r = sd_event_default(&event);
        if (r < 0)
                goto finish;

        if (sigemptyset(&ss) < 0 ||
            sigaddset(&ss, SIGTERM) < 0 ||
            sigaddset(&ss, SIGINT) < 0) {
                r = -errno;
                goto finish;

        /* Block SIGTERM first, so that the event loop can handle it */
        if (sigprocmask(SIG_BLOCK, &ss, NULL) < 0) {
                r = -errno;
                goto finish;

        /* Let's make use of the default handler and "floating" reference features of sd_event_add_signal() */
        r = sd_event_add_signal(event, NULL, SIGTERM, NULL, NULL);
        if (r < 0)
                goto finish;
        r = sd_event_add_signal(event, NULL, SIGINT, NULL, NULL);
        if (r < 0)
                goto finish;

        /* Enable automatic service watchdog support */
        r = sd_event_set_watchdog(event, true);
        if (r < 0)
                goto finish;

        if (fd < 0) {
                r = -errno;
                goto finish;
        } = (struct sockaddr_in) {
                .sin_family = AF_INET,
                .sin_port = htobe16(7777),
        if (bind(fd, &, sizeof(sa)) < 0) {
                r = -errno;
                goto finish;

        r = sd_event_add_io(event, &event_source, fd, EPOLLIN, io_handler, NULL);
        if (r < 0)
                goto finish;

        (void) sd_notifyf(false,
                          "STATUS=Daemon startup completed, processing events.");

        r = sd_event_loop(event);

        event_source = sd_event_source_unref(event_source);
        event = sd_event_unref(event);

        if (fd >= 0)
                (void) close(fd);

        if (r < 0)
                fprintf(stderr, "Failure: %s\n", strerror(-r));

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;

The example above shows how to write a minimal UDP/IP server, that listens on port 7777. Whenever a datagram is received it outputs its contents to STDOUT, unless it is precisely the string EXIT\n in which case the service exits. The service will react to SIGTERM and SIGINT and do a clean exit then. It also notifies the service manager about its completed startup, if it runs under a service manager. Finally, it sends watchdog keep-alive messages to the service manager if it asked for that, and if it runs under a service manager.

When run as systemd service this service's STDOUT will be connected to the logging framework of course, which means the service can act as a minimal UDP-based remote logging service.

To compile and link this example, save it as event-example.c, then run:

$ gcc event-example.c -o event-example `pkg-config --cflags --libs libsystemd`

For a first test, simply run the resulting binary from the command line, and test it against the following netcat command line:

$ nc -u localhost 7777

For the sake of brevity error checking is minimal, and in a real-world application should, of course, be more comprehensive. However, it hopefully gets the idea across how to write a daemon that reacts to external events with sd-event.

For further details on the functions used in the example above, please consult the manual pages: sd-event(3), sd_event_exit(3), sd_event_source_get_event(3), sd_event_default(3), sd_event_add_signal(3), sd_event_set_watchdog(3), sd_event_add_io(3), sd_notifyf(3), sd_event_loop(3), sd_event_source_unref(3), sd_event_unref(3).


So, is this the event loop to end all other event loops? Certainly not. I actually believe in "event loop plurality". There are many reasons for that, but most importantly: sd-event is supposed to be an event loop suitable for writing a wide range of applications, but it's definitely not going to solve all event loop problems. For example, while the priority logic is important for many usecase it comes with drawbacks for others: if not used carefully high-priority event sources can easily starve low-priority event sources. Also, in order to implement the priority logic, sd-event needs to linearly iterate through the event structures returned by epoll_wait(2) to sort the events by their priority, resulting in worst case O(n*log(n)) complexity on each event loop wakeup (for n = number of file descriptors). Then, to implement priorities fully, sd-event only dispatches a single event before going back to the kernel and asking for new events. sd-event will hence not provide the theoretically possible best scalability to huge numbers of file descriptors. Of course, this could be optimized, by improving epoll, and making it support how todays's event loops actually work (after, all, this is the problem set all event loops that implement priorities -- including GLib's -- have to deal with), but even then: the design of sd-event is focussed on running one event loop per thread, and it dispatches events strictly ordered. In many other important usecases a very different design is preferable: one where events are distributed to a set of worker threads and are dispatched out-of-order.

Hence, don't mistake sd-event for what it isn't. It's not supposed to unify everybody on a single event loop. It's just supposed to be a very good implementation of an event loop suitable for a large part of the typical usecases.

Note that our APIs, including sd-bus, integrate nicely into sd-event event loops, but do not require it, and may be integrated into other event loops too, as long as they support watching for time and I/O events.

And that's all for now. If you are considering using sd-event for your project and need help or have questions, please direct them to the systemd mailing list.

November 16, 2015

Writing programs is fun, but making them fast can be a pain. Python programs are no exception to that, but the basic profiling toolchain is actually not that complicated to use. Here, I would like to show you how you can quickly profile and analyze your Python code to find what part of the code you should optimize.

What's profiling?

Profiling a Python program is doing a dynamic analysis that measures the execution time of the program and everything that compose it. That means measuring the time spent in each of its functions. This will give you data about where your program is spending time, and what area might be worth optimizing.

It's a very interesting exercise. Many people focus on local optimizations, such as determining e.g. which of the Python functions range or xrange is going to be faster. It turns out that knowing which one is faster may never be an issue in your program, and that the time gained by one of the functions above might not be worth the time you spend researching that, or arguing about it with your colleague.

Trying to blindly optimize a program without measuring where it is actually spending its time is a useless exercise. Following your guts alone is not always sufficient.

There are many types of profiling, as there are many things you can measure. In this exercise, we'll focus on CPU utilization profiling, meaning the time spent by each function executing instructions. Obviously, we could do many more kind of profiling and optimizations, such as memory profiling which would measure the memory used by each piece of code – something I talk about in The Hacker's Guide to Python.


Since Python 2.5, Python provides a C module called cProfile which has a reasonable overhead and offers a good enough feature set. The basic usage goes down to:

>>> import cProfile
>>>'2 + 2')
2 function calls in 0.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

Though you can also run a script with it, which turns out to be handy:

$ python -m cProfile -s cumtime
72270 function calls (70640 primitive calls) in 4.481 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 4.481 4.481<module>)
1 0.001 0.001 4.296 4.296
3 0.000 0.000 4.286 1.429
3 0.000 0.000 4.268 1.423
4/3 0.000 0.000 3.816 1.272
4 0.000 0.000 2.965 0.741
4 0.000 0.000 2.962 0.740
4 0.000 0.000 2.961 0.740
2 0.000 0.000 2.675 1.338
30 0.000 0.000 1.621 0.054
30 0.000 0.000 1.621 0.054
30 1.621 0.054 1.621 0.054 {method 'read' of '_ssl._SSLSocket' objects}
1 0.000 0.000 1.611 1.611
4 0.000 0.000 1.572 0.393
4 0.000 0.000 1.572 0.393
60 0.000 0.000 1.571 0.026
4 0.000 0.000 1.571 0.393
1 0.000 0.000 1.462 1.462
1 0.000 0.000 1.462 1.462
1 0.000 0.000 1.462 1.462
1 0.000 0.000 1.459 1.459

This prints out all the function called, with the time spend in each and the number of times they have been called.

Advanced visualization with KCacheGrind

While being useful, the output format is very basic and does not make easy to grab knowledge for complete programs. For more advanced visualization, I leverage KCacheGrind. If you did any C programming and profiling these last years, you may have used it as it is primarily designed as front-end for Valgrind generated call-graphs.

In order to use, you need to generate a cProfile result file, then convert it to KCacheGrind format. To do that, I use pyprof2calltree.

$ python -m cProfile -o myscript.cprof
$ pyprof2calltree -k -i myscript.cprof

And the KCacheGrind window magically appears!

Concrete case: Carbonara optimization

I was curious about the performances of Carbonara, the small timeserie library I wrote for Gnocchi. I decided to do some basic profiling to see if there was any obvious optimization to do.

In order to profile a program, you need to run it. But running the whole program in profiling mode can generate a lot of data that you don't care about, and adds noise to what you're trying to understand. Since Gnocchi has thousands of unit tests and a few for Carbonara itself, I decided to profile the code used by these unit tests, as it's a good reflection of basic features of the library.

Note that this is a good strategy for a curious and naive first-pass profiling. There's no way that you can make sure that the hotspots you will see in the unit tests are the actual hotspots you will encounter in production. Therefore, a profiling in conditions and with a scenario that mimics what's seen in production is often a necessity if you need to push your program optimization further and want to achieve perceivable and valuable gain.

I activated cProfile using the method described above, creating a cProfile.Profile object around my tests (I actually started to implement that in testtools). I then run KCacheGrind as described above. Using KCacheGrind, I generated the following figures.

The test I profiled here is called test_fetch and is pretty easy to understand: it puts data in a timeserie object, and then fetch the aggregated result. The above list shows that 88 % of the ticks are spent in set_values (44 ticks over 50). This function is used to insert values into the timeserie, not to fetch the values. That means that it's really slow to insert data, and pretty fast to actually retrieve them.

Reading the rest of the list indicates that several functions share the rest of the ticks, update, _first_block_timestamp, _truncate, _resample, etc. Some of the functions in the list are not part of Carbonara, so there's no point in looking to optimize them. The only thing that can be optimized is, sometimes, the number of times they're called.

The call graph gives me a bit more insight about what's going on here. Using my knowledge about how Carbonara works, I don't think that the whole stack on the left for _first_block_timestamp makes much sense. This function is supposed to find the first timestamp for an aggregate, e.g. with a timestamp of 13:34:45 and a period of 5 minutes, the function should return 13:30:00. The way it works currently is by calling the resample function from Pandas on a timeserie with only one element, but that seems to be very slow. Indeed, currently this function represents 25 % of the time spent by set_values (11 ticks on 44).

Fortunately, I recently added a small function called _round_timestamp that does exactly what _first_block_timestamp needs that without calling any Pandas function, so no resample. So I ended up rewriting that function this way:

def _first_block_timestamp(self):
- ts = self.ts[-1:].resample(self.block_size)
- return (ts.index[-1] - (self.block_size * self.back_window))
+ rounded = self._round_timestamp(self.ts.index[-1], self.block_size)
+ return rounded - (self.block_size * self.back_window)

And then I re-run the exact same test to compare the output of cProfile.

The list of function seems quite different this time. The number of time spend used by set_values dropped from 88 % to 71 %.

The call stack for set_values shows that pretty well: we can't even see the _first_block_timestamp function as it is so fast that it totally disappeared from the display. It's now being considered insignificant by the profiler.

So we just speed up the whole insertion process of values into Carbonara by a nice 25 % in a few minutes. Not that bad for a first naive pass, right?

November 15, 2015

Discworld Noir was a superb adventure game, but is also notoriously unreliable, even in Windows on real hardware; using Wine is just not going to work. After many attempts at bringing it back into working order, I've settled on an approach that seems to work: now that qemu and libvirt have made virtualization and emulation easy, I can run it in a version of Windows that was current at the time of its release. Unfortunately, Windows 98 doesn't virtualize particularly well either, so this still became a relatively extensive yak-shaving exercise.


These instructions assume that /srv/virt is a suitable place to put disk images, but you can use anywhere you want.

The emulated PC

After some trial and error, it seems to work if I configure qemu to emulate the following:

  • Fully emulated hardware instead of virtualization (qemu-system-i386 -no-kvm)
  • Intel Pentium III
  • Intel i440fx-based motherboard with ACPI
  • Real-time clock in local time
  • No HPET
  • 256 MiB RAM
  • IDE primary master: IDE hard disk (I used 30 GiB, which is massively overkill for this game; qemu can use sparse files so it actually ends up less than 2 GiB on the host system)
  • IDE primary slave, secondary master, secondary slave: three CD-ROM drives
  • PS/2 keyboard and mouse
  • Realtek AC97 sound card
  • Cirrus video card with 16 MiB video RAM

A modern laptop CPU is an order of magnitude faster than what Discworld Noir needs, so full emulation isn't a problem, despite being inefficient.

There is deliberately no networking, because Discworld Noir doesn't need it, and a 17 year old operating system with no privilege separation is very much not safe to use on the modern Internet!

Software needed

  • Windows 98 installation CD-ROM as a .iso file (cp /dev/cdrom windows98.iso) - in theory you could also use a real optical drive, but my laptop doesn't usually have one of those. I used the OEM disc, version 4.10.1998 (that's the original Windows 98, not the Second Edition), which came with a long-dead PC, and didn't bother to apply any patches.
  • A Windows 98 license key. Again, I used an OEM key from a past PC.
  • A complete set of Discworld Noir (English) CD-ROMs as .iso files. I used the UK "Sold Out Software" budget release, on 3 CDs.
  • A multi-platform Realtek AC97 audio driver.

Windows 98 installation

It seems to be easiest to do this bit by running qemu-system-i386 manually:

qemu-img create -f qcow2 /srv/virt/discworldnoir.qcow2 30G
qemu-system-i386 -hda /srv/virt/discworldnoir.qcow2 \
    -drive media=cdrom,format=raw,file=/srv/virt/windows98.iso \
    -no-kvm -vga cirrus -m 256 -cpu pentium3 -localtime

Don't start the installation immediately. Instead, boot the installation CD to a DOS prompt with CD-ROM support. From here, run


and create a single partition filling the emulated hard disk. When finished, hard-reboot the virtual machine (press Ctrl+C on the qemu-system-i386 process and run it again).

The DOS FORMAT.COM utility is on the Windows CD-ROM but not in the root directory or the default %PATH%, so you'll have to run:

d:\win98\format c:

to create the FAT filesystem. You might have to reboot again at this point.

The reason for doing this the hard way is that the Windows 98 installer doesn't detect qemu as supporting ACPI. You want ACPI support, so that Windows will issue IDLE instructions from its idle loop, instead of occupying a CPU core with a busy-loop. To get that, boot to a DOS prompt again, and use:

setup /p j /iv

/p j forces ACPI support (Thanks to "Richard S" on the Virtualbox forums for this tip.) /iv is unimportant, but it disables the annoying "billboards" during installation, which advertised once-exciting features like support for dial-up modems and JPEG wallpaper.

I used a "Typical" installation; there didn't seem to be much point in tweaking the installed package set when everything is so small by modern standards.

Windows 98 has built-in support for the Cirrus VGA card that we're emulating, so after a few reboots, it should be able to run in a semi-modern resolution and colour depth. Discworld Noir apparently prefers a 640 × 480 × 16-bit video mode, so right-click on the desktop background, choose Properties and set that up.

Audio drivers

This is the part that took me the longest to get working. Of the sound cards that qemu can emulate, Windows 98 only supports the SoundBlaster 16 out of the box. Unfortunately, the Soundblaster 16 emulation in qemu is incomplete, and in particular version 2.1 (as shipped in Debian 8) has a tendency to make Windows lock up during boot.

I've seen advice in various places to emulate an Eqsonic ES1370 (SoundBlaster AWE 64), but that didn't work for me: one of the drivers I tried caused Windows to lock up at a black screen during boot, and the other didn't detect the emulated hardware.

The next-oldest sound card that qemu can emulate is a Realtek AC97, which was often found integrated into motherboards in the late 1990s. This one seems to work, with the "A400" driver bundle linked above. For Windows 98 first edition, you need a driver bundle that includes the old "VXD" drivers, not just the "WDM" drivers supported by Second Edition and newer.

The easiest way to get that into qemu seems to be to turn it into a CD image:

genisoimage -o /srv/virt/discworldnoir-drivers.iso WDM_A400.exe
qemu-system-i386 -hda /srv/virt/discworldnoir.qcow2 \
    -drive media=cdrom,format=raw,file=/srv/virt/windows98.iso \
    -drive media=cdrom,format=raw,file=/srv/virt/discworldnoir-drivers.iso \
    -no-kvm -vga cirrus -m 256 -cpu pentium3 -localtime -soundhw ac97

Run the installer from E:, then reboot with the Windows 98 CD inserted, and Windows should install the driver.

Installing Discworld Noir

Boot up the virtual machine with CD 1 in the emulated drive:

qemu-system-i386 -hda /srv/virt/discworldnoir.qcow2 \
    -drive media=cdrom,format=raw,file=/srv/virt/DWN_ENG_1.iso \
    -no-kvm -vga cirrus -m 256 -cpu pentium3 -localtime -soundhw ac97

You might be thinking "... why not insert all three CDs into D:, E: and F:?" but the installer expects subsequent disks to appear in the same drive where CD 1 was initially, so that won't work. Instead, when prompted for a new CD, switch to the qemu monitor with Ctrl+Alt+2 (note that this is 2, not F2). At the (qemu) prompt, use info block to see a list of emulated drives, then issue a command like

change ide0-cd1 /srv/virt/DWN_ENG_2.iso

to swap the CD. Then switch back to Windows' console with Ctrl+Alt+1 and continue installation. I used a Full installation of Discworld Noir.

Transferring the virtual machine to GNOME Boxes

Having finished the "control freak" phase of installation, I wanted a slightly more user-friendly way to run this game, so I transferred the virtual machine to be used by libvirtd, which is the backend for both GNOME Boxes and virt-manager:

virsh create discworldnoir.xml

Here is the configuration I used. It's a mixture of automatic configuration from virt-manager, and hand-edited configuration to make it match the qemu-system-i386 command-line.

Running the game

If all goes well, you should now see a discworldnoir virtual machine in GNOME Boxes, in which you can boot Windows 98 and play the game. Have fun!

November 11, 2015

So yesterday, the 10th of November, was the official launch day of the Steam Machines. The hardware are meant to be dedicated game machines for the living room taking advantage of the Steam ecosystem, to take on the Xbox One and PS4.

But for us in the Linux community these machines are more than that, they are an important part of helping us break into a broader market by paving the way for even more games and more big budget games coming to our platform. Playing computer games is not just a niche, it is a mainstream activity these days, and not having access to games on our platform has cost us quite a few users and potential contributors over the years. I have for instance met a lot of computer science students who ended up not using Linux as the main operating system during their studies simply due to the lack of games on the platform. Instead Linux got de-regulated to that thing in a VM only run when you needed it for an assignment.

Steam for Linux and SteamOS can and will be important pieces of breaking through that. SteamOS and the Steam Macines are also important for the Linux community for another reason. They can help funnel more resources from hardware companies into Linux drivers and support. I know for instance that all the 3 major GPU vendors have increased their Linux drivers investments due to SteamOS.

So I want to congratulate Valve on the launch of the first Steam Machines and strongly recommend everyone in the community to get a Steam machine for their home!

People who have had a good chance to test the hardware has recommended me to get one of the Alienware SteamOS systems, so I am passing that recommendation onwards.

As a sidenote we are also working on a few features in Fedora Workstation to make it a better host for Steam and Steam games. This includes our work on the GL Dispatch and Optimus support as covered in a previous blog and libratbag, our new library for handling gaming mice under Linux. And finally we are working on a few bug fixes in Fedora to make it an even better host for the Steam client related to C++ ABI issues.

November 08, 2015

systemd.conf 2015 is Over Now!

Last week our first systemd.conf conference took place at betahaus, in Berlin, Germany. With almost 100 attendees, a dense schedule of 23 high-quality talks stuffed into a single track on just two days, a productive hackfest and numerous consumed Club-Mates I believe it was quite a success!

If you couldn't attend the conference, you may watch all talks on our YouTube Channel. The slides are available online, too.

Many photos from the conference are available on the Google Events Page. Enjoy!

I'd specifically like to thank Daniel Mack, Chris Kühl and Nils Magnus for running the conference, and making sure that it worked out as smoothly as it did! Thank you very much, you did a fantastic job!

I'd also specifically like to thank the CCC Video Operation Center folks for the excellent video coverage of the conference. Not only did they implement a live-stream for the entire talks part of the conference, but also cut and uploaded videos of all talks to our YouTube Channel within the same day (in fact, within a few hours after the talks finished). That's quite an impressive feat!

The folks from LinuxTag e.V. put a lot of time and energy in the organization. It was great to see how well this all worked out! Excellent work!

(BTW, LinuxTag e.V. and the CCC Video Operation Center folks are willing to help with the organization of Free Software community events in Germany (and Europe?). Hence, if you need an entity that can do the financial work and other stuff for your Free Software project's conference, consider pinging LinuxTag, they might be willing to help. Similar, if you are organizing such an event and are thinking about providing video coverage, consider pinging the the CCC VOC folks! Both of them get our best recommendations!)

I'd also like to thank our conference sponsors! Specifically, we'd like to thank our Gold Sponsors Red Hat and CoreOS for their support. We'd also like to thank our Silver Sponsor Codethink, and our Bronze Sponsors Pengutronix, Pantheon, Collabora, Endocode, the Linux Foundation, Samsung and Travelping, as well as our Cooperation Partners LinuxTag and, and our Media Partner

Last but not least I'd really like to thank our speakers and attendees for presenting and participating in the conference. Of course, the conference we put together specifically for you, and we really hope you had as much fun at it as we did!

Thank you all for attending, supporting, and organizing systemd.conf 2015! We are looking forward to seeing you and working with you again at systemd.conf 2016!


November 06, 2015

Another major piece of engineering that I have covered that we did for Fedora Workstation 23 is the GTK3 port of LibreOffice. Those of you who follow Caolán McNamaras blog are probably aware of the details. The motivation for the port wasn’t improved look and feel integration, there was easier ways to achieve that, but to help us have LibreOffice deal well with a range of new technologies we are supporting in Fedora Workstation namely: Touch support, Wayland support and HiDPI.

That ongoing work is now available in Fedora Workstation 23 if you install the ‘libreoffice-gtk3’ package. You have to install this using a terminal and dnf as this is a early adopter technology, but we would love as many as possible for you to try and report any issues you have either to the upstream LibreOffice bugzilla or the Fedora bugzilla against the LibreOffice component. Testing of how it works under X and how it works under Wayland are both more than welcome. Be aware that it is ‘tech preview’ technology so you might want to remove the libreoffice-gtk3 package again if you find that it hinders your effective use of LibreOffice. For instance there is a quite bad titlebar bug you would exprience under Wayland that we hope to fix with an update.

If you specifically want to test out the touch support there are two features implemented so far, both in Impress. One is to allow you to switch slides in Impress by a swiping gesture and the second is long press, you can bring up the impress slide context menu with it and switch to e.g. drawing mode. We would love feedback on what gestures you would like to see supported in various LibreOffice applications, so don’t be shy about filing enhancement bug reports with your suggestions.

HiDPI it wasn’t a primary focus of the porting effort it has to be said, but we do expect that it should also make improving the HiDPI support in LibreOffice further easier. Another nice little bonus of the port is that the GTK Inspector can now be used with LibreOffice.

A big thanks to Caolán for this work.

Not that I'm really running after more gadgets, but sometimes, there is a need that could only be soothed through new hardware.

Bluetooth UE roll

Got this for my wife, to play music when staying out on the quays of the Rhône, playing music in the kitchen (from a phone or computer), or when she's at the photo lab.

It works well with iOS, MacOS X and Linux. It's very easy to use, with whether it's paired, connected completely obvious, and the charging doesn't need specific cables (USB!).

I'll need to borrow it to add battery reporting for those devices though. You can find a full review on Ars Technica.

Sugru (!)

Not a gadget per se, but I bought some, used it to fix up a bunch of cables, repair some knickknacks, and do some DIY. Highly recommended, especially given the current price of their starter packs.

15-pin to USB Joystick adapter

It's apparently from Ckeyin, but you'll find the exact same box from other vendors. Made my old Gravis joystick work, in the hope that I can make it work with DOSBox and my 20-year old copy of X-Wing vs. Tie Fighter.

Microsoft Surface ARC Mouse

That one was given to me, for testing, works well with Linux. Again, we'll need to do some work to report the battery. I only ever use it when travelling, as the batteries last for absolute ages.

Logitech K750 keyboard

Bought this nearly two years ago, and this is one of my best buys. My desk is close to a window, so it's wireless but I never need to change the batteries or think about charging it. GNOME also supports showing the battery status in the Power panel.

Logitech T650 touchpad

Got this one in sale (17€), to replace my Logitech trackball (one of its buttons broke...). It works great, and can even get you shell gestures when run in Wayland. I'm certainly happy to have one less cable running across my desk, and reuses the same dongle as the keyboard above.

If you use more than one devices, you might be interested in this bug to make it easier to support multiple Logitech "Unifying" devices.

ClicLite charger

Got this from a design shop in Berlin. It should probably have been cheaper than what I paid for it, but it's certainly pretty useful. Charges up my phone by about 20%, it's small, and charges up at the same time as my keyboard (above).

Dell S2340T

Bought about 2 years ago, to replace the monitor I had in an all-in-one (Lenovo all-in-ones, never buy that junk).

Nowadays, the resolution would probably be considered a bit on the low side, and the touchscreen mesh would show for hardcore photography work. It's good enough for videos though and the speaker reaches my sitting position.

It's only been possible to use the USB cable for graphics for a couple of months, and it's probably not what you want to lower CPU usage on your machine, but it works for Fedora with this RPM I made. Talk to me if you can help get it into RPMFusion.

Shame about the huge power brick, but a little bonus for the builtin Ethernet adapter.

Surface 3

This is probably the biggest ticket item. Again, I didn't pay full price for it, thanks to coupons, rewards, and all. The work to getting Linux and GNOME to play well with it is still ongoing, and rather slow.

I won't comment too much on Windows either, but rather as what it should be like once Linux runs on it.

I really enjoy the industrial design, maybe even the slanted edges, but one as to wonder why they made the USB power adapter not sit flush with the edge when plugged in.

I've used it a couple of times (under Windows, sigh) to read Pocket as I do on my iPad 1 (yes, the first one), or stream videos to the TV using Flash, without the tablet getting hot, or too slow either. I also like the fact that there's a real USB(-A) port that's separate from the charging port. The micro SD card port is nicely placed under the kickstand, hard enough to reach to avoid it escaping the tablet when lugged around.

The keyboard, given the thickness of it, and the constraints of using it as a cover, is good enough for light use, when travelling for example, and the layout isn't as awful as on, say, a Thinkpad Carbon X1 2nd generation. The touchpad is a bit on the small side though it would have been hard to make it any bigger given the cover's dimensions.

I would however recommend getting a Surface Pro if you want things to work right now (or at least soon). The one-before-last version, the Surface Pro 3, is probably a good target.
November 04, 2015

Finally, Gnocchi 1.3.0 is out. This is our final release, more or less matching the OpenStack 6 months schedule, that concludes the Liberty development cycle.

This release was supposed to be released a few weeks earlier, but our integration test got completely blocked for several days just the week before the OpenStack Mitaka summit.

New website

We build a new dedicated website for Gnocchi at We want to promote Gnocchi outside of the OpenStack bubble, as it a useful timeseries database on its own that can work without the rest of the stack. We'll try to improve the documentation. If you're curious, feel free to check it out and report anything you miss!

The speed bump

Obviously, if it was a bug in Gnocchi that we have hit, it would have been quick to fix. However, we found a nasty bug in Swift caused by the evil monkey-patching of Eventlet (once again) blended with a mixed usage of native threads and Eventlet threads in Swift. Shake all of that, and you got yourself pretty race conditions when using the Keystone middleware authentication.

In the meantime, we disabled Swift multi-threading by using mod_wsgi instead of Eventlet in devstack.

New features

So what's new in this new shiny release? A few interesting things:

  • Metric deletion is now asynchronous. That's not the most used feature in the REST API – weirdly people do not often delete metrics – but it's now way faster and reliable by being asynchronous. Metricd is now in charge of cleaning up things up.

  • Speed improvement. We are now confident to be even more faster than in the latest benchmarks I run (around 1.5-2× faster), which makes Gnocchi really fast with its native storage back-ends. We profiled and optimized Carbonara and the REST API data validation.

  • Improve metricd status report. It now reports the size of the backlog of the whole cluster both in its log and via the REST API. Easy monitoring!

  • Ceph drivers enhancement. We had people testing the Ceph drivers in production, so we made a few changes and fixes to it to make it more solid.

And that's all we did in the last couple of months. We have a lot of things on the roadmap that are pretty exciting, and I'll sure talk about them in the next weeks.

November 02, 2015

Last week I was in Tokyo, Japan for the OpenStack Summit, discussing the new Mitaka version that will be released in 6 months.

I've attended the summit mainly to discuss and follow-up new developments on Ceilometer, Gnocchi, Aodh and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things. Below are what I found remarkable during this summit concerning those projects.

Distributed lock manager

I did not attend this session, but I need to write something about it.

See, when working in a distributed environment like OpenStack, it's almost obvious that sooner or later you end up needing a distributed lock mechanism. It started to be pretty obvious and a serious problem for us 2 years ago in Ceilometer. Back then, we proposed the service-sync blueprint and talked about it during the OpenStack Icehouse Design Summit in Hong-Kong. The session at that time was a success, and in 20 minutes I convinced everyone it was the good thing to do. The night following the session, we picked a named, Tooz, to name this new library. It was the first time I met Joshua Harlow, which became one of the biggest Tooz contributor since then.

For the following months, we tried to move the lines in OpenStack. It was very hard to convince people that it was the solution to their problem. Most of the time, they did not seem to grasp the entirety of what was at stake.

This time, it seems that we managed to convince everyone that a DLM is indeed needed. Joshua wrote an extensive specification called Chronicle of a DLM, which ended up being discussed and somehow adopted during that session in Tokyo.

So yes, Tooz will be the weapon of choice for OpenStack. It will avoid a hard requirement on any DLM solution directly. The best driver right now is the ZooKeeper one, but it'll still be possible for operators to use e.g. Redis.

This is a great achievement for us, after spending years trying to fix features such as the Nova service group subsystem and seeing our proposals postponed forever.

(If you want to know more, has a great article about that session.)

Telemetry team name

With the new projects launched this last year, Aodh & Gnocchi, in parallel of the old Ceilometer, plus the change from programs to Big Tent in OpenSack, the team is having an identity issue. Being referred to as the "Ceilometer team" is not really accurate, as some of us only work on Aodh or on Gnocchi. So after discussing that, I proposed to rename the team to Telemetry instead. We'll see how it goes.


The first session was about alarms and the Aodh project. It turns out that the project is in pretty good shape, but probably need some more love, which I hope I'll be able to provide in the next months.

The need for a new aodhclient based on the technologies we recently used building gnocchiclient has been reasserted, so we might end up working on that pretty soon. The Tempest support also needs some improvement, and we have a plan to enhance that.

Data visualisation

We got David Lyle in this session, the Project Technical Leader for Horizon. It was an interesting discussion. It used to be technically challenging to draw charts from the data Ceilometer collects, but it's now very easy with Gnocchi and its API.

While the technical side is resolved, the more political and user experience side of was to draw and how was discussed at length. We don't want to make people think that Ceilometer and Gnocchi are a full monitoring solution, so there's some precaution to take. Other than that, it would be pretty cool to have view of the data in Horizon.

Rolling upgrade

It turns out that Ceilometer has an architecture that makes it easy to have rolling upgrade. We just need to write a proper documentation explaining how to do it and in which order the services should be upgraded.

Ceilometer splitting

The split of the alarm feature of Ceilometer in its own project Aodh in the last cycle was a great success for the whole team. We want to split other pieces of Ceilometer, as they make sense on their own, makes it easier to manage. They are also some projects that want to use them without the whole stack, so that's a good idea to make it happen.

CloudKitty & Gnocchi

I attended the 2 sessions that were allocated to CloudKitty. It was pretty interesting as they want to simplify their architecture and leverage what Gnocchi provides. I proposed my view of the project architecture and how they could leverage the more of Gnocchi to retrieve and store data. They want to go in that direction though it's a large amount of work and refactoring on their side, so it'll take time.

We also need to enhance the support of extension for new resources in Gnocchi, and that's something I hope I'll work on in the next months.

Overall, this summit was pretty good and I got a tremendous amount of good feedback on Gnocchi. I again managed to get enough ideas and tasks to tackle for the next 6 months. It really looks interesting to see where the whole team will go from that. Stay tuned!

October 30, 2015
You might have heard of the C.H.I.P., the 9$ computer. After contributing to their Kickstarter, and with no intent on hacking on more kernel code than is absolutely necessary, I requested the "final" devices, when chumps like me can read loads of docs and get accessories for it easily.

Turns out that our old friend the Realtek 8723BS chip is the Wi-Fi/Bluetooth chip in the nano computer. NextThingCo got in touch, and sent me a couple of early devices (as well as to the "Kernel hacker" backers), with their plan being to upstream all the drivers and downstream hacks into the upstream kernel.

Before being able to hack on the kernel driver though, we'll need to get some software on it, and find a way to access it. The docs website has instructions on how to flash the device using Ubuntu, but we don't use that here.

You'll need a C.H.I.P., a jumper cable, and the USB cable you usually use for charging your phone/tablet/e-book reader.

First, let's install a few necessary packages:

dnf install -y sunxi-tools uboot-tools python3-pyserial moserial

You might need other things, like git and gcc, but I kind of expect you to already have that installed if you're software hacking. You will probably also need to get sunxi-tools from Koji to get a new enough version that will support the C.H.I.P.

Get your jumper cable out, and make the connection as per the NextThingCo docs. I've copied the photo from the docs to keep this guide stand-alone.

Let's install the tools, modified to work with Fedora's newer, upstreamer, version of the sunxi-tools.

$ git clone
$ cd CHIP-tools
$ make
$ sudo ./ -d

If you've followed the instructions, you haven't plugged in the USB cable yet. Plug in the USB cable now, to the micro USB power supply on one end, and to your computer on the other.

You should see the little "OK" after the "waiting for fel" message:

== upload the SPL to SRAM and execute it ==
waiting for fel........OK

At this point, you can unplug the jumper cable, something not mentioned in the original docs. If you don't do that, when the device reboots, it will reboot in flashing mode again, and we obviously don't want that.

At this point, you'll just need to wait a while. It will verify the installation when done, and turn off the device. Unplug, replug, and launch moserial as root. You should be able to access the C.H.I.P. through /dev/ttyACM0 with a baudrate of 115200. The root password is "chip".

Obligatory screenshot of our new computer:

Next step, testing out our cleaned up Realtek driver, Fedora on the C.H.I.P., and plenty more.
October 28, 2015

So as we quickly approach the Fedora Workstation 23 release I been running Wayland exclusively for over a week now. Despite a few glitches it now works well enough for me to not have to switch back into the X11 session anymore.

There are some major new features coming in Fedora Workstation 23 that I am quite excited about. First and foremost it will be the first release shipping with our new firmware update system supported. This means that if your hardware supports it and your vendor uploads the needed firmware to lvfs you can update your system firmware through GNOME Software. So no more struggling with proprietary tools or bootable DVDs. So while this is a major step forward in my opinion it will also be one of those things ramping up slowly, as we need to bring more vendors onboard and also have them ship more UEFI 2.5 ready systems before the old ‘BIOS’ update problems are a thing of the past. A big thanks goes to Richard Hughes and Peter Jones for their work here.

Another major feature that we spent a lot of time to get right is the new Google Drive backend for the file manager. This means that as long as you have internet access you can manage your google drive files through Nautilus along with all your other local and remote file systems. I know a lot of Fedora users are wither using Google drive personally or as part of their work, so I think this is a major new feature we managed to land. A big thanks to Debarshi Ray for working on this item.

Thirdly we now have support for ambient light sensors support. This was a crucial step in our ongoing effort to improve battery life under Fedora and can have very significant impact on how long your battery lasts. It is very easy to keep running the screen with more backlight than you actually need and thus drain your battery quickly, so with this enabled you might often squeeze out some extra hours of your battery. A big thanks to Bastien Nocera and Benjamin Tissoires for their work on this feature.

And finally this is the first release where we are shipping our current xdg-app tech preview as part of Fedora, instead of just making it available in a COPR. So for those of you who don’t know, xdg-app is our new technology for packaging desktop applications. While still early stage it provides a way for software developers to package their software in a way that is both usable across multiple distributions and with improved security through the use of the LXC container technology. In fact as we are trying to make this technology as usable and widely deployed as possible Alexander Larsson is currently trying to work with the Open Container Initiative to make xdg-apps be OCI compliant.

This is important for a multitude of reasons, but mainly xdg-app fills an important gap in the container technology landscape, because while Docker and Rocket are great for packaging server software, there is no real equivalent for the desktop. The few similar efforts that has been launched are usually tied to a specific distribution or operating system, while xdg-app aims to provide the same level of cross system compatibility for desktop applications that Docker and Rocket offers for server applications.

Fedora Workstation 24
Of course with Fedora Workstation 23 almost out the door we have been reviewing our Fedora Workstation tasklist to make sure it reflects what we currently have developers working on and what we expect to be able to land in Fedora Workstation 24. And let me use this opportunity to remind community members that if you are working on a cool feature for Fedora Workstation 24, make sure to let the Workstation working group know on the Fedora Desktop mailing list, so that we can make sure your feature gets listed and tracked in the tasklist too.

Anyway, I sent out this email to the working group this week, to outline what I see on the horizon in terms of major Fedora Workstation features lined up for 24.

You can get the full details in the email, and the tasklist has also been updated with these items, but I want to go into a bit more details on some of them.

xdg-app for world domination

As some of you might be aware of, Christian Hergert, of GNOME Builder fame, recently joined our team. Christian will be doing a lot of cool stuff for us, but one thing he has already started on is working with Alexander Larsson to make sure we have a great developer story for xdg-app. If we want developers to adopt this technology we need to make it dead simple to create your own xdg-app packages and Christian will make sure that GNOME Builder supports this in a way that makes transitioning your application into an xdg-app becomes something you can do without needing to read a long howto. We hope to have the initial fruits of this labour ready for Fedora Workstation 24.

Another big part of this of course is the work that Richard Hughes and Kalev Lember are doing on GNOME Software to make sure we have the infrastructure in place to be able to install and upgrade xdg-apps. As we expect xdg-apps to come from a wide variety of sources as opposed to the current model of most things being in a central repository we need to develop good ways for new sources to be added and help users make more informed choices about the software they are installing. Related to this we are also looking at how we can improve labeling of the applications available,
to make it easier to make your decisions based on a variety of criteria. The current star system in GNOME Software is not very obvious in what it tries to convey so we will try looking at better ways to do this kind of labeling and what information we want to be able to provide through it.

Another major item is what I blogged about before is our effort to finally make dealing with the binary graphics drivers less of a pain. I wrote a longer blog post about this before, but to summarize we want to make sure that if you need to install the binary drivers this is a simple operation that doesn’t conflict with your installation of Mesa and also that if you have an Optimus enabled laptop, it is easy and pleasant to use.

Anyway, there are some further items in the email I sent, but I will go more into detail about some of them at a later stage.

October 18, 2015

Second Round of systemd.conf 2015 Sponsors

We are happy to announce the second round of systemd.conf 2015 sponsors! In addition to those from the first announcement, we have:

Our second Gold sponsor is Red Hat!

What began as a better way to build software—openness, transparency, collaboration—soon shifted the balance of power in an entire industry. The revolution of choice continues. Today Red Hat® is the world's leading provider of open source solutions, using a community-powered approach to provide reliable and high-performing cloud, virtualization, storage, Linux®, and middleware technologies.

A Bronze sponsor is Samsung:

From the beginning we have established a very fast pace and are currently one of the biggest and fastest growing modern-technology R&D centers in East-Central Europe. We have started with designing subsystems for digital satellite television, however, we have quickly expanded the scope of our interest. Currently, it includes advanced systems of digital television, platform convergence, mobile systems, smart solutions, and enterprise solutions. Also a vital role in our activity plays the quality and certification center, which controls the conformity of Samsung Electronics products with the highest standards of quality and reliability.

A Bronze sponsor is travelping:

Travelping is passionate about networks, communications and devices. We empower our customers to deploy and operate networks using our state of the art products, solutions and services. Our products and solutions are based on our industry proven physical and virtual appliance platforms. These purpose built platforms ensure best in class performance, scalability and reliability combined with consistent end to end management capabilities. To build this products, Travelping has developed a own embedded, cross platform Linux distribution called which incorporates the systemd service manager and tools.

A Bronze sponsor is Collabora:

Collabora has over 10 years of experience working with top tier OEMs & silicon manufacturers worldwide to develop products based on Open Source software. Through the use of Open Source technologies and methodologies, Collabora helps clients in multiple market segments gain faster time to market and save millions of dollars in licensing and maintenance costs. Collabora has already brought to market several products relying on systemd extensively.

A Bronze sponsor is Endocode:

Endocode AG. An employee-owned, software engineering company from Berlin. Open Source is our heart and soul.

A Bronze sponsor is the Linux Foundation:

The Linux Foundation advances the growth of Linux and offers its collaborative principles and practices to any endeavor.

We are Cooperating with LinuxTag e.V. on the organization:

LinuxTag is Europe's leading organizer of Linux and Open Source events. Born of the community and in business for 20 years, we organize LinuxTag, an annual conference and exhibition attracting thousands of visitors. We also participate and cooperate in organizing workshops, tutorials, seminars, and other events together with and for the Open Source community. Selected events include non-profit workshops, the German Kernel Summit at FrOSCon, participation in the Open Tech Summit, and others. We take care of the organizational framework of systemd.conf 2015. LinuxTag e.V. is a non-profit organization and welcomes donations of ideas and workforce.

A Media Partner is Golem: is an up to date online-publication intended for professional computer users. It provides technology insights of the IT and telecommunications industry. offers profound and up to date information on significant and trending topics. Online- and IT-Professionals, marketing managers, purchasers, and readers inspired by technology receive substantial information on product, market and branding potentials through tests, interviews und market analysis.

We'd like to thank our sponsors for their support! Without sponsors our conference would not be possible!

The Conference s SOLD OUT since a few weeks. We no longer accept registrations, nor paper submissions.

For further details about systemd.conf consult the conference website.

See the the first round of sponsor announcements!

See you in Berlin!

October 14, 2015

In the old days, there were not a lot of users of open-source 3D graphics drivers. Aside from a few ID Software games, tuxracer, and glxgears, there were basically no applications. As a result, we got almost no bug reports. It was a blessing and a curse.

Now days, thanks to Valve Software and Steam, there are more applications than you can shake a stick at. As a result, we get a lot of bug reports. It is a blessing and a curse.

There's quite a wide chasm between a good bug report and a bad bug report. Since we have a lot to do, the quality of the bug report can make a big difference in how quickly the bug gets fixed. A poorly written bug report may get ignored. A poorly written bug may require several rounds asking for more information or clarifications. A poorly written bug report may send a developer looking for the problem in the wrong place.

As someone who fields a lot of these reports for Mesa, here's what I look for...

  1. Submit against the right product and component. The component will determine the person (or mailing list) that initially receives notification about the bug. If you have a bug in the Radeon driver and only people working Nouveau get the bug e-mail, your bug probably won't get much attention.

    If you have experienced a bug in 3D rendering, you want to submit your bug against Mesa.

    Selecting the correct component can be tricky. It doesn't have to be 100% correct initially, but it should be close enough to get the attention of the right people. A good first approximation is to pick the component that matches the hardware you are running on. That means one of the following:

    If you're not sure what GPU is in your system, the command glxinfo | grep "OpenGL renderer string" will help.

  2. Correctly describe the software on your system. There is a field in bugzilla to select the version where the bug was encountered. This information is useful, but it is more important to get specific, detailed information in the bug report.

    • Provide the output of glxinfo | grep "OpenGL version string". There may be two lines of output. That's fine, and you should include both.

    • If you are using the drivers supplied by your distro, include the version of the distro package. On Fedora, this can be determined by rpm -q mesa-dri-drivers. Other distros use other commands. I don't know what they are, but Googling for "how do i determine the version of an installed package on " should help.

    • If you built the driver from source using a Mesa release tarball, include the full name of that tarball.

    • If you built the driver from source using the Mesa GIT repository, include SHA of the commit used.

    • Include the full version of the kernel using uname -r. This isn't always important, but it is sometimes helpful.

  3. Correctly describe the hardware in your system. There are a lot of variations of each GPU out there. Knowing exactly which variation exhibited bad behavior can help find the problem.

    • Provide the output of glxinfo | grep "OpenGL renderer string". As mentioned above, this is the hardware that Mesa thinks is in the system.

    • Provide the output of lspci -d ::0300 -nn. This provides the raw PCI information about the GPU. In some cases this maybe more detailed (or just plain different) than the output from glxinfo.

    • Provide the output of grep "model name" /proc/cpuinfo | head -1. It is sometimes useful to know the CPU in the system, and this provides information about the CPU model.

  4. Thoroughly describe the problem you observed. There are two goals. First, you want the person reading the bug report to know both what did happen and what you expected to happen. Second, you want the person reading the bug report to know how to reproduce the problem you observed. The person fielding the bug report may not have access to the application.

    • If the application was run from the command line, provide the full command line used to run the application.

      If the application is a test that was run from piglit, provide the command line used to run the test, not the command line used to run piglit. This information is in the piglit results.json result file.

    • If the application is game or other complex application, provide any necessary information needed to reproduce the problem. Did it happen at specific location in a level? Did it happen while performing a specific operation? Does it only happen when loading certain data files?

    • Provide output from the application. If the application is a test case, provide the text output from the test. If this output is, say, less than 5 lines, include it in the report. Otherwise provide it as an attachment. Output lines that just say "Fail" are not useful. It is usually possible to run tests in verbose mode that will provide helpful output.

      If the application is game or other complex application, try to provide a screenshot. If it is possible, providing a "good" and "bad" screenshot is even more useful. Bug #91617 is a good example of this.

      If the bug was a GPU or system hang, provide output from dmesg as an attachment to the bug. If the bug is not a GPU or system hang, this output is unlikely to be useful.

  5. Add annotations to the bug summary. Annotations are parts of the bug summary at the beginning enclosed in square brackets. The two most important ones are regression and bisected (see the next item for more information about bisected). If the bug being reported is a new failure after updating system software, it is (most likely) a regression. Adding [regression] to the beginning of the bug summary will help get it noticed! In this case you also must, at the very least, tell us what software version worked... with all the same details mentioned above.

    The other annotations used by the Intel team are short names for hardware versions. Unless you're part of Intel's driver development team or QA team, you don't really need to worry about these. The ones currently in use are:

  6. Bisect regressions. If you have encountered a regression and you are building a driver from source, please provide the results of git-bisect. We really just need the last part: the guilty commit. When this information is available, add bisected to the tag. This should look like [regression bisected].

    Note: If you're on a QA team somewhere, supplying the bisect is a practical requirement.

  7. Respond to the bug. There may be requests for more information. There may be requests to test patches. Either way, developers working on the bug need you to be responsive. I have closed a lot of bugs over the years because the original reporter didn't respond to a question for six months or more. Sometimes this is our fault because the question was asked months (or even years) after the bug was submitted.

    The general practice is to change the bug status to NEEDINFO when information or testing is requested. After providing the requested information, please change the status back to either NEW or ASSIGNED or whatever the previous state was.

    It is also general practice to leave bugs in the NEW state until a developer actually starts working on it. When someone starts work, they will change the "Assigned To" field and change the status to ASSIGNED. This is how we prevent multiple people from duplicating effort by working on the same bug.

    If the bug was encountered in a released version and the bug persists after updating to a new major version (e.g., from 10.6.2 to 11.0.1), it is probably worth adding a comment to the bug "Problem still occurs with ." That way we know you're alive and still care about the problem.

Here is a sample script that collects a lot of information request. It does not collect information about the driver version other than the glxinfo output. Since I don't know whether you're using the distro driver or building one from source, there really isn't any way that my script could do that. You'll want to customize it for your own needs.

echo "Software versions:"
echo -n "    "
uname -r
echo -n "    "
glxinfo 2>&1 | grep "OpenGL version string"

echo "GPU hardware:"
echo -n "    "
glxinfo 2>&1 | grep "OpenGL renderer string"
echo -n "    "
lspci -d ::0300 -nn

echo "CPU hardware:"
echo -n "    "
uname -m
echo -n "    "
grep "model name" /proc/cpuinfo | head -1 | sed 's/^[^:]*: //'

This generates nicely formatted output that you can copy-and-paste directly into a bug report. Here is the output from my system:

Software versions:
    OpenGL version string: 3.0 Mesa 11.1.0-devel

GPU hardware:
    OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 5500 (Broadwell GT2) 
    00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09)

CPU hardware:
    Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
October 13, 2015

We got pretty good feedback on Gnocchi so far, even if we only had little. Recently, in order to have a better feeling of where we were at, we wanted to know how fast (or slow) Gnocchi was.

The early benchmarks that some of the Mirantis engineers ran last year showed pretty good signs. But a year later, it was time to get real numbers and have a good understanding of Gnocchi capacity.

Benchmark tools

The first thing I realized when starting that process, is that we were lacking of tools to run benchmarks. Therefore I started to write some benchmark tools in python-gnocchiclient, which provides a command line tool to interrogate Gnocchi. I added a few basic commands to measure metric performance, such as:

$ gnocchi benchmark metric create -w 48 -n 10000 -a low
| Field | Value |
| client workers | 48 |
| create executed | 10000 |
| create failures | 0 |
| create failures rate | 0.00 % |
| create runtime | 8.80 seconds |
| create speed | 1136.96 create/s |
| delete executed | 10000 |
| delete failures | 0 |
| delete failures rate | 0.00 % |
| delete runtime | 39.56 seconds |
| delete speed | 252.75 delete/s |

The command line tool supports the --verbose switch to have detailed progress report on the benchmark progression. So far it supports metric operations only, but that's the most interesting part of Gnocchi.

Spinning up some hardware

I got a couple of bare metal servers to test Gnocchi on. I dedicated the first one to Gnocchi, and used the second one as the benchmark client, plugged on the same network. Each server is made of 2×Intel Xeon E5-2609 v3 (12 cores in total) and 32 GB of RAM. That provides a lot of CPU to handle requests in parallel.

Then I simply performed a basic RHEL 7 installation and ran devstack to spin up an installation of Gnocchi based on the master branch, disabling all of the others OpenStack components. I then tweaked the Apache httpd configuration to use the worker MPM and increased the maximum number of clients that can sent request simultaneously.

I configured Gnocchi to use the PostsgreSQL indexer, as it's the recommended one, and the file storage driver, based on Carbonara (Gnocchi own storage engine). That means files were stored locally rather than in Ceph or Swift. Using the file driver is less scalable (you have to run on only one node or uses a technology like NFS to share the files), but it was good enough for this benchmark and to have some numbers and profiling the beast.

The OpenStack Keystone authentication middleware was not enabled in this setup, as it would add some delay validating the authentication token.

Metric CRUD operations

Metric creation is pretty fast. I managed to attain 1500 metric/s created pretty easily. Deletion is now asynchronous, which means it's faster than in Gnocchi 1.2, but it's still slower than creation: 300 metric/s can be deleted. That does not sound like a huge issue since metric deletion is actually barely used in production.

Retrieving metric information is also pretty fast and goes up to 800 metric/s. It'd be easy to achieve very higher throughput for this one, as it'd be easy to cache, but we didn't feel the need to implement it so far.

Another important thing is that all of these numbers are constant and barely depends on the number of the metric already managed by Gnocchi.

Operation Details Rate
Create metric Created 100k metrics in 77 seconds 1300 metric/s
Delete metric Deleted 100k metrics in 190 seconds 524 metric/s
Show metric Show a metric 100k times in 149 seconds 670 metric/s

Sending and getting measures

Pushing measures into metrics is one of the hottest topic. Starting with Gnocchi 1.1, the measures pushed are treated asynchronously, which makes it much faster to push new measures. Getting new numbers on that feature was pretty interesting.

The number of metric per second you can push depends on the batch size, meaning the number of actual measurements you send per call. The naive approach is to push 1 measure per call, and in that case, Gnocchi is able to handle around 600 measures/s. With a batch containing 100 measures, the number of calls per second goes down to 450, but since you push 100 measures each time, that means 45k measures per second pushed into Gnocchi!

I've pushed the test further, inspired by the recent blog post of InfluxDB claiming to achieve 300k points per second with their new engine. I ran the same benchmark on the hardware I had, which is roughly two times smaller than the one they used. I achieved to push Gnocchi to a little more than 120k measurement per second. If I had same hardware as they used, I could interpolate the results to achieve almost 250k measures/s pushed. Obviously, you can't strictly compare Gnocchi and InfluxDB since they are not doing exactly the same thing, but it still looks way better than what I expected.

Using smaller batch sizes of 1k or 2k improve the throughput further to around 125k measures/s.

Operation Details Rate
Push metric 5k Push 5M measures with batch of 5k measures in 40 seconds 122k measures/s
Push metric 4k Push 5M measures with batch of 4k measures in 40 seconds 125k measures/s
Push metric 3k Push 5M measures with batch of 3k measures in 40 seconds 123k measures/s
Push metric 2k Push 5M measures with batch of 2k measures in 41 seconds 121k measures/s
Push metric 1k Push 5M measures with batch of 1k measures in 44 seconds 113k measures/s
Push metric 500 Push 5M measures with batch of 500 measures in 51 seconds 98k measures/s
Push metric 100 Push 5M measures with batch of 100 measures in 112 seconds 45k measures/s
Push metric 10 Push 5M measures with batch of 10 measures in 852 seconds 6k measures/s
Push metric 1 Push 500k measures with batch of 1 measure in 800 seconds 624 measures/s
Get measures Push 43k measures of 1 metric 260k measures/s

What about getting measures? Well, it's actually pretty fast too. Retrieving a metric with 1 month of data with 1 minute interval (that's 43k points) takes less than 2 second.

Though it's actually slower than what I expected. The reason seems to be that the JSON is 2 MB big and encoding it takes a lot of time for Python. I'll investigate that. Another point I discovered, is that by default Gnocchi returns all the datapoints for each granularities available for the asked period, which might double the size of the returned data for nothing if you don't need it. It'll be easy to add an option to the API to only retrieve what you need though!

Once benchmarked, that meant I was able to retrieve 6 metric/s per second, which translates to around 260k measures/s.

Metricd speed

New measures that are pushed into Gnocchi are processed asynchronously by the gnocchi-metricd daemon. When doing the benchmarks above, I ran into a very interesting issue: sending 10k measures on a metric would make gnocchi-metricd uses up to 2 GB RAM and 120 % CPU for more than 10 minutes.

After further investigation, I found that the naive approach we used to resample datapoints in Carbonara using Pandas was causing that. I reported a bug on Pandas and the upstream author was kind enough to provide a nice workaround, that I sent as a pull request to Pandas documentation.

I wrote a fix for Gnocchi based on that, and started using it. Computing the standard aggregation methods set (std, count, 95pct, min, max, sum, median, mean) for 10k batches of 1 measure (worst case scenario) for one metric with 10k measures now takes only 20 seconds and uses 100 MB of RAM – 45× faster. That means that in normal operations, where only a few new measures are processed, the operation of updating a metric only takes a few milliseconds. Awesome!

Comparison with Ceilometer

For comparison sake, I've quickly run some read operations benchmark in Ceilometer. I've fed it with one month of samples for 100 instances polled every minute. That represents roughly 4.3M samples injected, and that took a while – almost 1 hour whereas it would have taken less than a minute in Gnocchi. Then I tried to retrieve some statistics in the same way that we provide them in Gnocchi, which mean aggregating them over a period of 60 seconds over a month.

Operation Details Rate
Read metric SQL Read measures for 1 metric 2min 58s
Read metric MongoDB Read measures for 1 metric 28s
Read metric Gnocchi Read measures for 1 metric 2s

Obviously, Ceilometer is very slow. It has to look into 4M of samples to compute and return the result, which takes a lot of time. Whereas Gnocchi just has to fetch a file and pass it over. That also means that the more samples you have (so the more time you collect data and the more resources you have), slower Ceilometer will become. This is not a problem with Gnocchi, as I emphasized when I started designing it.

Most Gnocchi operations are O(log R) where R is the number of metrics or resources, whereas most Ceilometer operations are O(log S) where S is the number of samples (measures). Since is R millions of time smaller than S, Gnocchi gets to be much faster.

And what's even more interesting, is that Gnocchi is entirely scalable horizontally. Adding more Gnocchi servers (for the API and its background processing worker metricd) will multiply Gnocchi performances by the number of servers added.


There are several things to improve in Gnocchi, such as splitting Carbonara archives to make them more efficient, especially from drivers such as Ceph and Swift. It's already on my plate, and I'm looking forwarding to working on that!

And if you have any questions, feel free to shoot them in the comment section. 😉

October 05, 2015

Last week, I've been invited to the OpenStack Paris meetup #16, whose subject was about metrics in OpenStack. Last time I spoke at this meetup was back in 2012, during the OpenStack Paris meetup #2. A very long time ago!

I talked for half an hour about Gnocchi, the OpenStack project I've been running for 18 months now. I started by explaining the story behind the project and why we needed to build it. Ceilometer has an interesting history and had a curious roadmap these last year, and I summarized that briefly. Then I talk about how Gnocchi works and what it offers to users and operators. The slides where full of JSON, but I imagine it offered a interesting view of what the API looks like and how easy it is to operate. This also allowed me to emphasize how many use cases are actually really covered and solved, contrary to what Ceilometer did so far. The talk has been well received and I got a few interesting questions at the end.

The video of the talk (in French) and my slides are available on my talk page and below. I hope you'll enjoy it.

September 25, 2015

Lately I have worked on ARB_shader_storage_buffer_object  extension implementation on Mesa, along with Iago (who has already explained how we implement it).

One of the features we implemented was adding support for buffer variables to the queries defined in ARB_program_interface_query. Those queries ended up being very useful to check another feature introduced by ARB_shader_storage_buffer_object: the new storage layout called std430.

I have developed basic tests for piglit, but Jordan Justen and Ilia Mirkin mentioned that Ian Romanick had worked in a comprehensive testing for uniform buffer objects presented at XDC2014.

Ian developed a stochastic search-based testing for uniform block layouts which generates shader_runner tests using a template in Mako that defines a series of uniform blocks with different layouts. Finally, shader_runner verifies some information through glGetActiveUniforms*() queries such as: the variable type, its offset inside the block, its array size (if any), if it is row major, etc… using “active uniform” command.

That was the kind of testing that I was looking for, so I cloned his repository and developed a modified version for shader storage blocks and std430.

During that process, I found that shader_runner was lacking support for the queries defined in ARB_program_interface_query extension. Yesterday, the patches were pushed to master branch of the piglit repository.

If you want to use this new command, the format  is the following:

verify program_interface_query GL_INTERFACE_TYPE_ENUM var_name GL_PROPS_ENUM integer

or, if we include the GL type enum:

verify program_interface_query GL_INTERFACE_TYPE_ENUM var_name GL_PROPS_ENUM GL_TYPE_ENUM

I have written an example to show how to use this command in a shader_runner test. This is just a snippet showing the usage of the new command:

link success

verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_TYPE GL_FLOAT_VEC4
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_TOP_LEVEL_ARRAY_SIZE 4
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].s1_3.fv3[0] GL_OFFSET 128
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[0].s1_3.fv3[0] GL_OFFSET 48
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[0].m34_1 GL_TYPE GL_FLOAT_MAT3x4
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].m34_1 GL_OFFSET 80
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.s5_1[0].s1_4[1].m34_1 GL_ARRAY_STRIDE 0
verify program_interface_query GL_BUFFER_VARIABLE SSBO1.m43_1 GL_IS_ROW_MAJOR 1
verify program_interface_query GL_PROGRAM_INPUT piglit_vertex GL_TYPE GL_FLOAT_VEC4
verify program_interface_query GL_PROGRAM_OUTPUT piglit_fragcolor GL_TYPE GL_FLOAT_VEC4

I hope you find them useful for your tests!


I've wanted a stand-alone radio in my office for a long time. I've been using a small portable radio, but it ate batteries quickly (probably a 4-pack of AA for a bit less of a work week's worth of listening), changing stations was cumbersome (hello FM dials) and the speaker was a bit teeny.

A couple of years back, I had a Raspberry Pi-based computer on pre-order (the Kano, highly recommended for kids, and beginners) through a crowd-funding site. So I scoured « brocantes » (imagine a mix of car boot sale and antiques fair, in France, with people emptying their attics) in search of a shell for my small computer. A whole lot of nothing until my wife came back from a week-end at a friend's with this:

Photo from Radio Historia

A Philips Octode Super 522A, from 1934, when SKUs were as superlative-laden and impenetrable as they are today.

Let's DIY

I started by removing the internal parts of the radio, without actually turning it on. When you get such old electronics, they need to be checked thoroughly before being plugged, and as I know nothing about tube radios, I preferred not to. And FM didn't exist when this came out, so not sure what I would have been able to do with it anyway.

Roomy, and dirty. The original speaker was removed, the front buttons didn't have anything holding them any more, and the nice backlit screen went away as well.

To replace the speaker, I went through quite a lot of research, looking for speakers that were embedded, rather than get a speaker in box that I would need to extricate from its container. Visaton make speakers that can be integrated into ceiling, vehicles, etc. That also allowed me to choose one that had a good enough range, and would fit into the one hole in my case.

To replace the screen, I settled on an OLED screen that I knew would work without too much work with the Raspberry Pi, a small AdaFruit SSD1306 one. Small amount of soldering that was up to my level of skills.

It worked, it worked!

Hey, soldering is easy. So because of the size of the speaker I selected, and the output power of the RPi, I needed an amp. The Velleman MK190 kit was cheap (€10), and should just be able to work with the 5V USB power supply I planned to use. Except that the schematics are really not good enough for an electronics starter. I spent a couple of afternoons verifying, checking on the Internet for alternate instructions, re-doing the solder points, to no avail.

'Sup Tiga!

So much wasted time, and got a cheap car amp with a power supply. You can probably find cheaper.

Finally, I got another Raspberry Pi, and SD card, so that the Kano, with its super wireless keyboard, could find a better home (it went to my godson, who seemed to enjoy the early game of Pong, and being a wizard).

Putting it all together

We'll need to hold everything together. I got a bit of help for somebody with a Dremel tool for the piece of wood that will hold the speaker, and another one that will stick three stove bolts out of the front, to hold the original tuning, mode and volume buttons.

A real joiner

I fast-forwarded the machine by a couple of years with a « Philips » figure-of-8 plug at the back, so machine's electrics would be well separated from the outside.

Screws into the side panel for the amp, blu-tack to hold the OLED screen for now, RPi on a few leftover bits of wood.


My first attempt at getting something that I could control on this small computer was lcdgrilo. Unfortunately, I would have had to write a Web UI for it (remember, my buttons are just stuck on, for now at least), and probably port the SSD1306 OLED screen's driver from Python, so not a good fit.

There's no proper Fedora support for Raspberry Pis, and while one can use a nearly stock Debian with a few additional firmware files on Raspberry Pis, Fedora chose not to support that slightly older SoC at all, which is obviously disappointing for somebody working on Fedora as a day job.

Looking for other radio retrofits, and there are plenty of quality ones on the Internet, and for various connected speakers backends, I found PiMusicBox. It's a Debian variant with Mopidy builtin, and a very easy to use initial setup: edit a settings file on the SD card image, boot and access the interface via a browser. Tada!

Once I had tested playback, I lowered the amp's volume to nearly zero, raised the web UI's volume to the maximum, and raised the amp's volume to the maximum bearable for the speaker. As I won't be able to access the amp's dial, we'll have this software only solution.

Wrapping up

I probably spent a longer time looking for software and hardware than actually making my connected radio, but it was an enjoyable couple of afternoons of work, and the software side isn't quite finished.

First, in terms of hardware support, I'll need to make this OLED screen work, how lazy of me. The audio setup is currently just the right speaker, as I'd like both the radios and AirPlay streams to be downmixed.

Secondly, Mopidy supports plugins to extend its sources, uses GStreamer, so would be a right fit for Grilo, making it easier for Mopidy users to extend through Lua.

Do note that the Raspberry Pi I used is a B+ model. For B models, it's recommended to use a separate DAC, because of the bad audio quality, even if the B+ isn't that much better. Testing out use the HDMI output with an HDMI to VGA+jack adapter might be a way to cut costs as well.

Possible improvements could include making the front-facing dials work (that's going to be a tough one), or adding RFID support, so I can wave items in front of it to turn it off, or play a particular radio.

In all, this radio cost me:
- 10 € for the radio case itself
- 36.50 € for the Raspberry Pi and SD card (I already had spare power supplies, and supported Wi-Fi dongle)
- 26.50 € for the OLED screen plus various cables
- 20 € for the speaker
- 18 € for the amp
- 21 € for various cables, bolts, planks of wood, etc.

I might also count the 14 € for the soldering iron, the 10 € for the Velleman amp, and about 10 € for adapters, cables, and supplies I didn't end up using.

So between 130 and 150 €, and a number of afternoons, but at the end, a very flexible piece of hardware that didn't really stretch my miniaturisation skills, and a completely unique piece of furniture.

In the future, I plan on playing with making my own 3-button keyboard, and making a remote speaker to plug in the living room's 5.1 amp with a C.H.I.P computer.

Happy hacking!
September 24, 2015


If you were waiting for a new post to learn something new about Intel GEN graphics, I am sorry. I started moving away from i915 kernel development about one year ago. Although I had worked landed patches in mesa before, it has taken me some time to get to a point where I had enough of a grasp on things to write a blog post of interest. I did start to write a post about the i965 compiler, but I was never able to complete it.1

In general I think the documentation has gotten quite good at giving the full details and explanations in a reasonable format. In terms of diagrams and prose, the GEN8+ Programmer Reference Manuals blow the previous docs out of the water. The greatest challenge I see though is the organization of the information which is arguably worse on newer documentation. Also, there is so much of it that it’s very difficult to be able to find everything you need when you need it; and it is especially hard when you don’t really know what you’re looking for. Therefore, I thought it would be valuable to people who want to understand this part of the Intel hardware and the i965 driver. More importantly, I have a terrible memory, and this will serve as good documentation for future me.

I highly recommend looking at the SVG files if you want to see the pictures in higher detail. I try very hard to strike a balance between factual accuracy and in making the information consumable. If you feel that either something is grossly inaccurate, or far too confusing, please let me know.

And finally, before I begin, I’d like to extend a special thanks to Kristian Høgsberg for the initial 90 minute discussion that seeded this post as well as helping me along the way. I believe it benefited both of us.

Whiteboarding with krh

Blurry picture of the resulting whiteboard


One of the most important components that goes into a building a 3d scene is a vertex. The point represented by the vertex may simply be a point or it may be part of some amalgamation of other points to make a line, shape, mesh, etc. Depending on the situation, you might observe a set of vertices referred to as: topology, primitive, or geometry (these aren’t all the same thing, but unless you’re speaking to a pedant you can expect the terms to be interchanged pretty frequently). Most often, these vertices are parts of triangles which make up a more complex model which is part of a scene.

The first stage in modern current programmable graphics pipeline is the Vertex Shader which unsurprisingly operates on the vertices.2 In the simplest case you might use a vertex shader to simply locate the vertex in the correct place using the model, and view transformations. Don’t worry if you’re asking yourself why is it called a shader. Given the example there isn’t anything I would define as “shading” either. How clever programmers make use of the vertex shader is out of scope, thankfully – but rest assured it’s not always so simple.

Vertex processing of a gradient square

Vertex processing of a gradient square

Vertex Buffers

The attributes that define a vertex are called, Vertex Attributes (yay). For now you can just imagine the most obvious attribute – position. The attributes are stored in one or more buffers that are created and defined by the Graphics API. Each vertex (defined here as a collection of all the attributes for a single point) gets processed and/or modified by the vertex shader with the program that is also specified by the API.

It is the responsibility of the application to build up the vertex buffers. I already mentioned that position is a very common attribute describing a vertex; unless you’re going to be creating parts of your scene within a shader, it’s likely a required attribute. So the programmer in one way or another will define each position of every vertex for the scene. We can have other things though, and the purpose of the current programmable GPU pipelines is to allow the sky to be the limit. Colors and normals are some other common vertex attributes. With a few more magic API calls, we’ll have a nice 2d image on the screen that represents the 3D image provided via the vertices.

Here are some links with much more detailed information:

Intel GPU Hardware and the i965 Driver

The GPU hardware will need to be able to operate on the vertices that are contained in the buffers created by the programmer. It will also need to be able to execute a shader program specified by the programmer. Setting this up is the responsibility of the GPU driver. When I say driver, I mean like the i965 driver in mesa. This is just a plain old shared object sitting on your disk. This is not some kernel component. In graphics, the kernel driver knows less about the underlying 3d hardware than the userspace driver.

Ultimately, the driver needs to take all the API goop and make sure it programs the hardware correctly for the given goop. There will be details on this later in the post. Getting the shader program to run takes some amount of state setup, relatively little. It’s primarily a different component within the driver which implements this. Modern graphics drivers have a surprisingly full featured compiler within them for being able to convert the API’s shading language into the native code on the GPU.

Execution Units

Programmable shading for all stages in the GPU pipeline occurs on Execution Units which are often abbreviated to EU. All other stuff is generally called, “fixed function”. The Execution Units are VLIW processors. The ISA for the EU is documented. The most important point I need to make is that each thread of each EU has its own register file. If you’re unfamiliar with the term register file, you can think of it as thread local memory. If you’re unfamiliar with that, I’m sorry.

Push vs. Pull Vertex Attributes

As it pertains to the shader, Vertex Attributes can be pushed, or pulled [or both]. Intel and most other hardware started with the push model. This is because there was no programmable shaders for geometry in the olden days. Some of the hardware vendors have moved exclusively from push to pull. Since Intel hardware can do both (as is likely the case for any HW supporting the push model), I’ll give a quick overview mostly as a background. The Intel docs do have a comparison of the two which I noticed after writing this section. I am too lazy to find the link 😛

In the pull model, none of the vertex attributes needed by the shader program are fetched before execution of the shader program beings. Assuming you can get the shader programs up and running quickly, and have a way to hide the memory latency (like relatively fast thread switching). This comes with some advantages (maybe more which I haven’t thought of):

  1. In programs with control flow or conditional execution, you may not need all the attributes.
  2. Fewer transistors for fixed function logic.
  3. Assuming the push model requires register space – more ability to avoid spills when there are a lot of vertex attributes.
    • More initial register space for the programs

The push model has fixed function hardware that is designed to fetch all the vertex attributes and populate it into the shader program during invocation. Unlike the pull model, all attributes that may be needed are fetched and this can cause some overhead.

  1. It doesn’t suffer from the two requirements above though (fast shader invocation + latency hiding).
  2. Can do format conversion automatically.
  3. Hardware should also have a better view of the system and be able to more intelligently do the attribute pushing, perhaps utilizing caches better, or considering memory arbiters better, etc.

As already mentioned, usually if your hardware supports the push model, you can optionally pull things as needed, and you can easily envision how these things may work. That is especially useful on GEN in certain cases where we have hardware constraints that make the push model difficult, or even impossible to use.

Push vs. Pull

Push vs. Pull

Green Square

Before moving on, I’d like to demonstrate a real example with code. We’ll use this to see how the attributes make their way down the pipeline. The shader program will simply draw a green square. Such a square can be achieved very easily without a vertex shader, or more specifically, without passing the color through the vertex shader, but this will circumvent what I want to demonstrate. I’ll be using shader_runner from the piglit GL test suite and framework. The i965 driver has several debug capabilities built into the driver and exposed via environment variables. In order to see the contents of the commands emitted by the driver to the hardware in a batchbuffer, you can set the environment variable INTEL_DEBUG=batch. The contents of the batchbuffers will be passed to libdrm for decoding3. The shader programs get disassembled to the native code by the i965 driver4.

Here is the linked shader we’ll be using for shader_runner:

GLSL >= 1.30

[vertex shader]
varying vec4 color;
void main()
  color = vec4(0,1,0,1);
  gl_Position = gl_Vertex;

[fragment shader]
varying vec4 color;

void main()
  gl_FragColor = color;

draw rect -1 -1 2 2

From the shader program we can infer the vertex shader is reading from exactly one input variable: gl_Vertex and is going to “output” two things: the varying vec4 color, and the built-in vec4 gl_Position. Here is the disassembled program from invoking INTEL_DEBUG=vs Don’t worry if you can’t make sense of the assembly that follows. I’m only trying to show the reader that the above shader program is indeed fetching vertex attributes with the aforementioned push model.

mov(8)          g119<1>UD       g1<8,8,1>UD                     { align1 WE_all 1Q compacted };
mov(8)          g120<1>F        g2<8,8,1>F                      { align1 1Q compacted };
mov(8)          g121<1>F        g3<8,8,1>F                      { align1 1Q compacted };
mov(8)          g122<1>F        g4<8,8,1>F                      { align1 1Q compacted };
mov(8)          g123<1>F        g5<8,8,1>F                      { align1 1Q compacted };
mov(8)          g124<1>F        [0F, 0F, 0F, 0F]VF              { align1 1Q compacted };
mov(8)          g125<1>F        1F                              { align1 1Q };
mov(8)          g126<1>F        [0F, 0F, 0F, 0F]VF              { align1 1Q compacted };
mov(8)          g127<1>F        1F                              { align1 1Q };
send(8)         null            g119<8,8,1>F
                            urb 1 SIMD8 write mlen 9 rlen 0                 { align1 1Q EOT };

Ignore that first mov for now. That last send instruction indicates that the program is outputting 98 registers of data as determined by mlen. Yes, mlen is 9 because of that first mov I asked you to ignore. The rest of the information for the send instruction won’t yet make sense. Refer to Intel’s public documentation if you want to know more about the send instruction, but hopefully I’ll cover most of the important parts by the end. The constant 0,1,0,15 values which are moved to g124-g127 should make it clear what color corresponds to, which means the other thing in g120-g123 is gl_Position, and g2-g5 is gl_Vertex (because gl_Position = gl_Vertex;).

*Either you already know how this stuff works, or that should seem weird. We have two vec4 things, color and position, and yet we’re passing 8 vec4 registers down to the next stage – what are the other 6? Well that is the result of SIMD8 dispatch.

Programming the Hardware

Okay, fine. Vertices are important. The various graphics APIs all give ways to specify them, but unless you want to use a software renderer, the hardware needs to be made aware of them too. I guess (and I really am guessing) that most of the graphics APIs are similar enough that a single set of HW commands are sufficient to achieve the goal. It’s actually not a whole lot of commands to do this, and in fact they make a good deal of sense once you understand enough about the APIs and HW.

For the very simple case, we can reduce the number of commands needed to get vertices into the GPU pipeline to 3. There are certainly other very important parts of setting up state so things will run correctly, and the not so simple cases have more commands just for this part of the pipeline setup.

  1. Here are my vertex buffers – 3DSTATE_VERTEX_BUFFERS
  2. Here is how my vertex buffers are laid out – 3DSTATE_VERTEX_ELEMENTS
  3. Do it – 3DPRIMITIVE

There is a 4th command which deserves honorable mention, 3DSTATE_VS. We’re going to have to defer dissecting that for now.

The hardware unit responsible for fetching vertices is simply called the Vertex Fetcher (VF). It’s the fixed function hardware that was mentioned earlier in the section about push vs pull models. The Vertex Fetcher’s job is to feed the vertex shader, and this can be split into two main functional objectives, transferring the data from the vertex buffer into a format which can be read by the vertex shader, and allocating a handle and special internal memory for the data (more later).


The API specifies a vertex buffer which contains very important things for rendering the scene. This is done with a command that describes properties of the vertex buffer. From the 3DSTATE_VERTEX_BUFFERS documentation:

This structure is used in 3DSTATE_VERTEX_BUFFERS to set the state associated with a VB. The VF function will use this state to determine how/where to extract vertex element data for all vertex elements associated with the VB.

(the actual inline data is defined here: VERTEX_BUFFER_STATE)

In my words, the command specifies a number of buffers in any order which are “bound” during the time of drawing. These binding points are needed by the hardware later when it actually begins to fetch the data for drawing the scene. That is why you may have noticed that a lot of information is actually missing from the command. Here is a diagram which shows what a vertex buffer might look like in memory. This diagram is based upon the green square vertex shader described earlier.




The API also specifies the layout of the vertex buffers. This entails where the actual components are as well as the format of the components. Application programmers can and do use lots of cooky stuff to do amazing things, which is again, thankfully out of scope. As an example, let’s suppose our vertices are comprised of three attributes, a 2D coordinate X,Y; a Normal vector U, V; and a color R, G, B.

Packing Vertex Buffers

Packing Vertex Buffers

In the above picture, the first two have very similar programming. They have the same number of VERTEX_ELEMENTS. The third case is a bit weird even from the API perspective, so just consider the fact that it exists and move on.

Just like the vertex buffer state we initialized in hardware, here too we must initialize the vertex element state.

Vertex Elements

Vertex Elements

I don’t think I’ve posted driver code yet, so here is some. What the code does is loop over every enabled attribute enum gl_vert_attrib (GL defines some and some are generic) and is setting up the command for transferring the data from the vertex buffer to the special memory we’ll talk about later. The hardware will have a similar looking loop (the green text in the above diagram) to read these things and to act upon them.

BEGIN_BATCH(1 + nr_elements * 2);
OUT_BATCH((_3DSTATE_VERTEX_ELEMENTS << 16) | (2 * nr_elements - 1));
for (unsigned i = 0; i < brw->vb.nr_enabled; i++) {
   struct brw_vertex_element *input = brw->vb.enabled[i];
   uint32_t format = brw_get_vertex_surface_type(brw, input->glarray);
   uint32_t comp0 = BRW_VE1_COMPONENT_STORE_SRC;
   uint32_t comp1 = BRW_VE1_COMPONENT_STORE_SRC;
   uint32_t comp2 = BRW_VE1_COMPONENT_STORE_SRC;
   uint32_t comp3 = BRW_VE1_COMPONENT_STORE_SRC;

   switch (input->glarray->Size) {
   case 0: comp0 = BRW_VE1_COMPONENT_STORE_0;
   case 1: comp1 = BRW_VE1_COMPONENT_STORE_0;
   case 2: comp2 = BRW_VE1_COMPONENT_STORE_0;
   case 3: comp3 = input->glarray->Integer ? BRW_VE1_COMPONENT_STORE_1_INT
                                           : BRW_VE1_COMPONENT_STORE_1_FLT;
   OUT_BATCH((input->buffer << GEN6_VE0_INDEX_SHIFT) |
             GEN6_VE0_VALID |
             (format << BRW_VE0_FORMAT_SHIFT) |
             (input->offset << BRW_VE0_SRC_OFFSET_SHIFT));
             (comp1 << BRW_VE1_COMPONENT_1_SHIFT) |
             (comp2 << BRW_VE1_COMPONENT_2_SHIFT) |
             (comp3 << BRW_VE1_COMPONENT_3_SHIFT));


Once everything is set up, 3DPRIMITIVE tells the hardware to start doing stuff. Again, much of it is driven by the graphics API in use. I don’t think it’s worth going into too much detail on this one, but feel free to ask questions in the comments…


Modern GEN hardware has a dedicated GPU memory which is referred to as L3 cache. I’m not sure there was a more confusing way to name it. I’ll try to make sure I call it GPU L3 which separates it from whatever else may be called L3.6 The GPU L3 is partitioned into several functional parts, the largest of which is the special URB memory.

The Unified Return Buffer (URB) is a general purpose buffer used for sending data between different threads, and, in some cases, between threads and fixed function units (or vice versa).

The Thread Dispatcher (TD) is the main source of URB reads. As a part of spawning a thread, pipeline fixed functions provide the TD with a number of URB handles, read offsets, and lengths. The TD reads the specified data from the URB and provide that data in the thread payload pre loaded into GRF registers.

I’ll explain that quote in finer detail in the next section. For now, simply recall from the first diagram that we have some input which goes into some geometry part of the pipeline (starting with vertex shading), followed by rasterization (and other fixed function things), followed by fragment shading. The URB is what is containing the data of this geometry as it makes its way through the various stages within that geometry part of this pipeline. The fixed function part of the hardware will consume the data from the URB at which point it can be reused for the next set of geometry.

To summarize, the data going through the pipeline may exist in 3 types of memory depending on the stage and the hardware:

  • Graphics memory (GDDR on discrete cards, or system memory on integrated graphics).
  • The URB.
  • The Execution Unit’s register file

URB space must be divided among the geometry shading stages. To do this, we have state commands like 3DSTATE_URB_VS. Allocation granularity is in 512b chunks which one might make an educated guess is based on the cacheline size in the system (cacheline granularity tends to be the norm for internal clients of memory arbiters can request).

Boiling down the URB allocation in mesa for the VS, you get an admittedly difficult to understand thing.

unsigned vs_size = MAX2(brw->vs.prog_data->base.urb_entry_size, 1);
unsigned vs_entry_size_bytes = vs_size * 64;
unsigned vs_wants =
   ALIGN(brw->urb.max_vs_entries * vs_entry_size_bytes,
         chunk_size_bytes) / chunk_size_bytes - vs_chunks;

The vs_entry_size_bytes is the size of every “packet” which is delivered to the VS shader thread with the push model of execution (keep this in mind for the next section).

count = _mesa_bitcount_64(vs_prog_data->inputs_read);
unsigned vue_entries =
   MAX2(count, vs_prog_data->base.vue_map.num_slots);

vs_prog_data->base.urb_entry_size = ALIGN(vue_entries, 4) / 4;

Above, count is the number of inputs for the vertex shader, and let’s defer a bit on num_slots. Ultimately, we get a count of VUE entries.

WTF is a VUE

In general, vertex data is stored in Vertex URB Entries (VUEs) in the URB, processed by CLIP threads, and only referenced by the pipeline stages indirectly via VUE handles.

The Vertex Fetcher and the Thread Dispatcher will work in tandem to get a Vertex Shader thread running with all the correct data in place. A VUE handle is created for some portion of the URB with the specified size at the time the Vertex Fetcher acts upon a 3DSTATE_VERTEX_ELEMENT. The Vertex Fetcher will populate some part of the VUE based on the contents of the vertex buffer as well as things like adding the handle it generated, and format conversion. That data is referred to as the VS thread payload, and the functionality is referred to as Input Assembly (The documentation there is really good, you should read that page). It’s probably been too long without a picture. The following picture deviates from our green square example in case you are keeping a close eye.

VF create VUEs

VF create VUEs

There are a few things you should notice there. The VF can only write a full vec4 into the URB. In the example there are 2 vec2s (a position, and a normal), but the VF must populate the remaining 2 elements. We briefly saw some of the VERTEX_ELEMENT fields earlier but I didn’t explain them. The STORE_SRC instructs the VF to copy, or convert the source data from the vertex buffer and write it into the VUE. STORE_0 and STORE_1 do not read from the vertex buffer but they do write in a 0, and 1 respectively into the VUE. The terminology that seems to be the most prevalent for the vec4 of data in the VUE is “row”. ie. We’ve used 2 rows per VUE for input, row 0 is the position, row 1 is the normal. The images depict a VUE as an array of vec4 rows.

Here is the second half of the action where the VS actually makes the vertex data available to the EU by loading the register file.

SIMD8 VS Thread Dispatch

SIMD8 VS Thread Dispatch

A VUE is allocated per vertex, and one handle is allocated for each as well. That happens once (it’s more complicated with HS, DS, and GS, but I don’t claim to understand that, so I’ll skip it for now). That means that as the vertex attributes are getting mutated along the GPU pipeline the output is going to have to be written back into the same VUE space that the input came from. The other constraint is we need to actually setup the VUEs so they can be used by the fixed function pipeline later. This may be deferred until a later stage if we’re using other geometry shading stages, or perhaps ignored entirely for streamout. To sum up there are two requirements we have to meet:

  1. The possibly modified vertex data for a given VUE must be written back to the same VUE.
  2. We must build the VUE header, also in place of the old VUE.7

#2 seems pretty straight forward once we find the definition of a VUE header. That should also clear up why we had a function doing MAX2 to determine the VUE size in the previous chapter – we need the max of [the inputs, and outputs with a VUE header]. You don’t have to bother looking at VUE header definition if you don’t want, for our example we just need to know that we need to put an X, Y, Z, W in DWord 4-7 of the header. But how are we going to get the data back into the VUE… And now finally we can explain that 9th register (and mov) I asked you to ignore earlier. Give yourself a pat on the back if you still remember that.

Well we didn’t allocate the handles or the space in the URB for the VUEs. The Vertex Fetcher hardware did that. So we can’t possibly know where they are. But wait! VS Thread Payload for SIMD8 dispatch does have some interesting things in it. Here’s my simplified version of the payload from the table in the docs:

3g0.3Scratch Space
Sampler State Pointer
4g0.4Binding Table Pointer
5g0.5Scratch Offset
6.-7g0.6Don't Care
8-15g1.0Vertex 0-7 Return Handles
16-23 (optional)g2.0 (optional)Constant Data
16-23 (pushed up for constant data)g2.0Vertex Data

Oh. g1 contains the “return” handles for the VUEs. In other words, the thing to reference the location of the output. Let’s go back to the original VS example now

mov(8)          g119<1>UD       g1<8,8,1>UD                     { align1 WE_all 1Q compacted };
mov(8)          g120<1>F        g2<8,8,1>F                      { align1 1Q compacted };
mov(8)          g121<1>F        g3<8,8,1>F                      { align1 1Q compacted };
mov(8)          g122<1>F        g4<8,8,1>F                      { align1 1Q compacted };
mov(8)          g123<1>F        g5<8,8,1>F                      { align1 1Q compacted };
mov(8)          g124<1>F        [0F, 0F, 0F, 0F]VF              { align1 1Q compacted };
mov(8)          g125<1>F        1F                              { align1 1Q };
mov(8)          g126<1>F        [0F, 0F, 0F, 0F]VF              { align1 1Q compacted };
mov(8)          g127<1>F        1F                              { align1 1Q };
send(8)         null            g119<8,8,1>F
                            urb 1 SIMD8 write mlen 9 rlen 0                 { align1 1Q EOT };

With what we now know, we should be able to figure out what this means. Recall that we had 4 vertices in the green square. The destination of the mov instruction is the first register, so our chunk of data to be sent in the URB write message is:

g119 <= URB return handles
g120 <= X X X X ? ? ? ?
g121 <= Y Y Y Y ? ? ? ?
g122 <= Z Z Z Z ? ? ? ?
g123 <= W W W W ? ? ? ?
g124 <= 0 0 0 0 0 0 0 0
g125 <= 1 1 1 1 1 1 1 1
g126 <= 0 0 0 0 0 0 0 0 
g127 <= 1 1 1 1 1 1 1 1

The send command [message], like the rest of the ISA operates upon a channel. In other words, read the values in column order. Send outputs n (9 in this case) registers worth of data. The URB write is pretty straightforward, it just writes n-1 (8 in this case) elements from the channel consuming the first register as the URB handle. That happens across all 8 columns because we’re using SIMD8 dispatch.
URB return handle, x, y, z, w, 0, 1, 0, 1. So that’s good, but if you are paying very close attention, you are missing one critical detail, and not enough information is provided above to figure it out. I’ll give you a moment to go back in case I got your curiosity……

As we already stated above, DWord 4-7 of the VUE header has to be XYZW. So if you believe what I told you about URB writes, what we have above is going to end up with XYZW in DWORD 0-3, and RGBA in DWord4-7. That means the position will be totally screwed, and the color won’t even get passed down to the next stage (because it will be VUE header data). This is solved with the global offset field in the URB write message. To be honest, I don’t quite understand how the code implements this (it’s definitely in src/mesa/drivers/dri/i965/brw_fs_visitor.cpp). I can go find the references to the URB write messages, and dig around the i965 code to figure out how it works, or you can do that and I’ll do one last diagram. What; diagram? Good choice.

URB writeback from VS

URB writeback from VS


For me, writing blog posts like this is like writing a big patch series. You write and correct so many times that you become blind to any errors. WordPress says I am closing in on my 300th revision (and that of course excludes the diagrams). Please do let me know if you notice any errors.

I really wanted to talk about a few other things actually. The way SGVs work for VertexID, InstanceID, and the edge flag. Unfortunately, I think I’ve reached the limit on information that should be in a single blog post. Maybe I’ll do those later or something. Here is the gradient square using VertexID which I was planning to elaborate upon (that is why the first picture had a gradient square):

GLSL >= 1.30

[vertex shader]
varying vec4 color;
void main()
  color = vec4(gl_VertexID == 0, gl_VertexID == 1,
		gl_VertexID == 2, 1);
  gl_Position = gl_Vertex;

[fragment shader]
varying vec4 color;

void main()
  gl_FragColor = color;

draw rect -1 -1 2 2


A great link about the GPU pipeline which I never actually read.


Inkscape SVGs

I think Inkscape might save some extra stuff, so try that if it looks weird in the browser or whatever.
Basic Pipeline:

Push vs. Pull:

Pushed Attributes:

Vertex Buffers and Vertex Elements:

Attribute Packing:

VUE Life Cycle:

Download PDF

  1. I spent a significant amount of time and effort writing about the post about the GLSL compiler in mesa with a focus on the i965 backend. If you’ve read my blog posts, hopefully you see that I try to include real spec reference and code snippets (and pretty pictures). I started work on that before NIR was fully baked, and before certain people essentially rewrote the whole freaking driver. Needless to say, by the time I had a decent post going, everything had changed. Sadly it would have been great to document that part of the driver, but I discarded that post. If there is any interest in me posting what I had written/drawn, maybe I’d be willing to post what I had done in its current state. 

  2. I find the word “pipeline” a bit misleading here, but maybe that’s just me. It’s not specific enough. It seems to imply that a vertex is the smallest atom for which the GPU will operate on, and while that’s true, it ignores the massively parallel nature of the beast. There are other oddities as well, like streamout 

  3. The decoding within libdrm is quite old and crufty. Most, if not everything in this post was hand decoded via the specifications for the commands in the public documentation 

  4. To dump the generated vertex shader and fragment shader in native code (with some amount of the IR), you similarly add fs, and vs. To dump all three things at once for example:

  5. The layout is RGBA, ie. completely opaque green value. 

  6. Originally, and in older public docs it is sometimes referred to as Mid Level Cache (MLC), and I believe they are exactly the same thing. MLC was the level before Last Level Cache (LLC), which made sense, until Baytrail which had a GPU L3, but did not have an LLC. 

  7. You could get clever, or buggy, depending on which way you look at it and actually write into a different VUE. The ability is there, but it’s not often (if ever) used 

September 23, 2015
As I'm known to do, a focus on the little things I worked on during the just released GNOME 3.18 development cycle.

Hardware support

The accelerometer support in GNOME now uses iio-sensor-proxy. This daemon also now supports ambient light sensors, which Richard used to implement the automatic brightness adjustment, and compasses, which are used in GeoClue and gnome-maps.

In kernel-land, I've fixed the detection of some Bosch accelerometers, added support for another Kyonix one, as used in some tablets.

I've also added quirks for out-of-the-box touchscreen support on some cheaper tablets using the goodix driver, and started reviewing a number of patches for that same touchscreen.

With Larry Finger, of Realtek kernel drivers fame, we've carried on cleaning up the Realtek 8723BS driver used in the majority of Windows-compatible tablets, in the Endless computer, and even in the $9 C.H.I.P. Linux computer.

Bluetooth UI changes

The Bluetooth panel now has better « empty states », explaining how to get Bluetooth working again when a hardware killswitch is used, or it's been turned off by hand. We've also made receiving files through OBEX Push easier, and builtin to the Bluetooth panel, so that you won't forget to turn it off when done, and won't have trouble finding it, as is the case for settings that aren't used often.


GNOME Videos has seen some work, mostly in the stabilisation, and bug fixing department, most of those fixes were also landed in the 3.16 version.

We've also been laying the groundwork in grilo for writing ever less code in C for plugin sources. Grilo Lua plugins can now use gnome-online-accounts to access keys for specific accounts, which we've used to re-implement the Pocket videos plugin, as well as the cover art plugin.

All those changes should allow implementing OwnCloud support in gnome-music in GNOME 3.20.

My favourite GNOME 3.18 features

You can call them features, or bug fixes, but the overall improvements in the Wayland and touchpad/touchscreen support are pretty exciting. Do try it out when you get a GNOME 3.18 installation, and file bugs, it's coming soon!

Talking of bug fixes, this one means that I don't need to put in my password by hand when I want to access work related resources. Connect to the VPN, and I'm authenticated to Kerberos.

I've also got a particular attachment to the GeoClue GPS support through phones. This allows us to have more accurate geolocation support than any desktop environments around.

A few for later

The LibreOfficeKit support that will be coming to gnome-documents will help us get support for EPubs in gnome-books, as it will make it easier to plug in previewers other than the Evince widget.

Victor Toso has also been working through my Grilo bugs to allow us to implement a preview page when opening videos. Work has already started on that, so fingers crossed for GNOME 3.20!
September 22, 2015

Only 14 tickets still available!

systemd.conf 2015 is close to being sold out, there are only 14 tickets left now. If you haven't bought your ticket yet, now is the time to do it, because otherwise it will be too late and all tickets will be gone!

Why attend? At this conference you'll get to meet everybody who is involved with the systemd project and learn what they are working on, and where the project will go next. You'll hear from major users and projects working with systemd. It's the primary forum where you can make yourself heard and get first hand access to everybody who's working on the future of the core Linux userspace!

To get an idea about the schedule, please consult our preliminary schedule.

In order to register for the conference, please visit the registration page.

We are still looking for sponsors. If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

For further details about systemd.conf consult the conference website.

September 18, 2015
I've done a talk at XDC 2015 about atomic modesetting with a focus for driver writers. Most of the talk is an overview of how an atomic modeset looks and how to implement the different parts in a driver backend. Anyway, for all those who missed it, there's a video and slides.
September 17, 2015

A few days ago, the French equivalent of Hacker News, called "Le Journal du Hacker", interviewed me about my work on OpenStack, my job at Red Hat and my self-published book The Hacker's Guide to Python. I've spent some time translating it into English so you can read it if you don't understand French! I hope you'll enjoy it.

Hi Julien, and thanks for participating in this interview for the Journal du Hacker. For our readers who don't know you, can you introduce you briefly?

You're welcome! My name is Julien, I'm 31 years old, and I live in Paris. I now have been developing free software for around fifteen years. I had the pleasure to work (among other things) on Debian, Emacs and awesome these last years, and more recently on OpenStack. Since a few months now, I work at Red Hat, as a Principal Software Engineer on OpenStack. I am in charge of doing upstream development for that cloud-computing platform, mainly around the Ceilometer, Aodh and Gnocchi projects.

Being myself a system architect, I follow your work in OpenStack since a while. It's uncommon to have the point of view of someone as implied as you are. Can you give us a summary of the state of the project, and then detail your activities in this project?

The OpenStack project has grown and changed a lot since I started 4 years ago. It started as a few projects providing the basics, like Nova (compute), Swift (object storage), Cinder (volume), Keystone (identity) or even Neutron (network) who are basis for a cloud-computing platform, and finally became composed of a lot more projects.

For a while, the inclusion of projects was the subject of a strict review from the technical committee. But since a few months, the rules have been relaxed, and we see a lot more projects connected to cloud-computing joining us.

As far as I'm concerned, I've started with a few others people the Ceilometer project in 2012, devoted to handling metrics of OpenStack platforms. Our goal is to be able to collect all the metrics and record them to analyze them later. We also have a module providing the ability to trigger actions on threshold crossing (alarm).

The project grew in a monolithic way, and in a linear way for the number of contributors, during the first two years. I was the PTL (Project Technical Leader) for a year. This leader position asks for a lot of time for bureaucratic things and people management, so I decided to leave my spot in order to be able to spend more time solving the technical challenges that Ceilometer offered.

I've started the Gnocchi project in 2014. The first stable version (1.0.0) was released a few months ago. It's a timeseries database offering a REST API and a strong ability to scale. It was a necessary development to solve the problems tied to the large amount of metrics created by a cloud-computing platform, where tens of thousands of virtual machines have to be metered as often as possible. This project works as a standalone deployment or with the rest of OpenStack.

More recently, I've started Aodh, the result of moving out the code and features of Ceilometer related to threshold action triggering (alarming). That's the logical suite to what we started with Gnocchi. It means Ceilometer is to be split into independent modules that can work together – with or without OpenStack. It seems to me that the features provided by Ceilometer, Aodh and Gnocchi can also be interesting for operators running more classical infrastructures. That's why I've pushed the projects into that direction, and also to have a more service-oriented architecture (SOA)

I'd like to stop for a moment on Ceilometer. I think that this solution was very expected, especially by the cloud-computing providers using OpenStack for billing resources sold to their customers. I remember reading a blog post where you were talking about the high-speed construction of this brick, and features that were not supposed to be there. Nowadays, with Gnocchi and Aodh, what is the quality of the brick Ceilometer and the programs it relies on?

Indeed, one of the first use-case for Ceilometer was tied to the ability to get metrics to feed a billing tool. That's now a reached goal since we have billing tools for OpenStack using Ceilometer, such as CloudKitty.

However, other use-cases appeared rapidly, such as the ability to trigger alarms. This feature was necessary, for example, to implement the auto-scaling feature that Heat needed. At the time, for technical and political reasons, it was not possible to implement this feature in a new project, and the functionality ended up in Ceilometer, since it was using the metrics collected and stored by Ceilometer itself.

Though, like I said, this feature is now in its own project, Aodh. The alarm feature is used since a few cycles in production, and the Aodh project brings new features on the table. It allows to trigger threshold actions and is one of the few solutions able to work at high scale with several thousands of alarms. It's impossible to make Nagios run with millions of instances to fetch metrics and triggers alarms. Ceilometer and Aodh can do that easily on a few tens of nodes automatically.

On the other side, Ceilometer has been for a long time painted as slow and complicated to use, because its metrics storage system was by default using MongoDB. Clearly, the data structure model picked was not optimal for what the users were doing with the data.

That's why I started Gnocchi last year, which is perfectly designed for this use case. It allows linear access time to metrics (O(1) complexity) and fast access time to the resources data via an index.

Today, with 3 projects having their own perimeter of features defined – and which can work together – Ceilometer, Aodh and Gnocchi finally erased the biggest problems and defects of the initial project.

To end with OpenStack, one last question. You're a Python developer for a long time and a fervent user of software testing and test-driven development. Several of your blogs posts point how important their usage are. Can you tell us more about the usage of tests in OpenStack, and the test prerequisites to contribute to OpenStack?

I don't know any project that is as tested on every layer as OpenStack is. At the start of the project, there was a vague test coverage, made of a few unit tests. For each release, a bunch of new features were provided, and you had to keep your fingers crossed to have them working. That's already almost unacceptable. But the big issue was that there was also a lot of regressions, et things that were working were not anymore. It was often corner cases that developers forgot about that stopped working.

Then the project decided to change its policy and started to refuse all patches – new features or bug fix – that would not implement a minimal set of unit tests, proving the patch would work. Quickly, regressions were history, and the number of bugs largely reduced months after months.

Then came the functional tests, with the Tempest project, which runs a test battery on a complete OpenStack deployment.

OpenStack now possesses a complete test infrastructure, with operators hired full-time to maintain them. The developers have to write the test, and the operators maintain an architecture based on Gerrit, Zuul, and Jenkins, which runs the test battery of each project for each patch sent.

Indeed, for each version of a patch sent, a full OpenStack is deployed into a virtual machine, and a battery of thousands of unit and functional tests is run to check that no regressions are possible.

To contribute to OpenStack, you need to know how to write a unit test – the policy on functional tests is laxer. The tools used are standard Python tools, unittest for the framework and tox to run a virtual environment (venv) and run them.

It's also possible to use DevStack to deploy an OpenStack platform on a virtual machine and run functional tests. However, since the project infrastructure also do that when a patch is submitted, it's not mandatory to do that yourself locally.

The tools and tests you write for OpenStack are written in Python, a language which is very popular today. You seem to like it more than you have to, since you wrote a book about it, The Hacker's Guide to Python, that I really enjoyed. Can you explain what brought you to Python, the main strong points you attribute to this language (quickly) and how you went from developer to author?

I stumbled upon Python by chance, around 2005. I don't remember how I hear about it, but I bought a first book to discover it and started toying with that language. At that time, I didn't find any project to contribute to or to start. My first project with Python was rebuildd for Debian in 2007, a bit later.

I like Python for its simplicity, its object orientation rather clean, its easiness to be deployed and its rich open source ecosystem. Once you get the basics, it's very easy to evolve and to use it for anything, because the ecosystem makes it easy to find libraries to solve any kind of problem.

I became an author by chance, writing blog posts from time to time about Python. I finally realized that after a few years studying Python internals (CPython), I learned a lot of things. While writing a post about the differences between method types in Python – which is still one of the most read post on my blog – I realized that a lot of things that seemed obvious to me where not for other developers.

I wrote that initial post after thousands of hours spent doing code reviews on OpenStack. I, therefore, decided to note all the developers pain points and to write a book about that. A compilation of what years of experience taught me and taught to the other developers I decided to interview in the book.

I've been very interested by the publication of your book, for the subject itself, but also the process you chose. You self-published the book, which seems very relevant nowadays. Is that a choice from the start? Did you look for an editor? Can you tell use more about that?

I've been lucky to find out about others self-published authors, such as Nathan Barry – who even wrote a book on that subject, called Authority. That's what convinced me it was possible and gave me hints for that project.

I've started to write in August 2013, and I ran the firs interviews with other developers at that time. I started to write the table of contents and then filled the pages with what I knew and what I wanted to share. I manage to finish the book around January 2014. The proof-reading took more time than I expected, so the book was only released in March 2014. I wrote a complete report on that on my blog, where I explain the full process in detail, from writing to launching.

I did not look for editors though I've been proposed some. The idea of self-publishing really convince me, so I decided to go on my own, and I have no regret. It's true that you have to wear two hats at the same time and handle a lot more things, but with a minimal audience and some help from the Internet, anything's possible!

I've been reached by two editors since then, a Chinese and Korean one. I gave them rights to translate and publish the books in their countries, so you can buy the Chinese and Korean version of the first edition of the book out there.

Seeing how successful it was, I decided to launch a second edition in Mai 2015, and it's likely that a third edition will be released in 2016.

Nowadays, you work for Red Hat, a company that represents the success of using Free Software as a commercial business model. This company fascinates a lot in our community. What can you say about your employer from your point of view?

It only has been a year since I joined Red Hat (when they bought eNovance), so my experience is quite recent.

Though, Red Hat is really a special company on every level. It's hard to see from the outside how open it is, and how it works. It's really close to and it really looks like an open source project. For more details, you should read The Open Organization, a book wrote by Jim Whitehurst (CEO of Red Hat), which he just published. It describes perfectly how Red Hat works. To summarize, meritocracy and the lack of organization in silos is what makes Red Hat a strong organization and puts them as one of the most innovative company.

In the end, I'm lucky enough to be autonomous for the project I work on with my team around OpenStack, and I can spend 100% working upstream and enhance the Python ecosystem.

September 15, 2015

Many modern mice have the ability to store profiles, customize button mappings and actions and switch between several hardware resolutions. A number of those mice are targeted at gamers, but the features are increasingly common in standard mice. Under Linux, support for these device is spotty, though there are a few projects dedicated to supporting parts of the available device range. [1] [2] [3]

Benjamin Tissoires and I started a new project: libratbag. libratbag is a library to provide a generic interface to these mice,enabling desktop environments to provide configuration tools without having to worry about the device model. As of the time of this writing, we have partial support for the Logitech HID++ 1.0 (G500, G5) and HID++ 2.0 protocols (G303), the Etekcity Scroll Alpha and Roccat Kone XTD. Thomas H. P. Anderson already added the G5, G9 and the M705.

git clone

The internal architecture is fairly simple, behind the library's API we have a couple of protocol-specific drivers that access the mouse. The drivers match a specific product/vendor ID combination and load the data from the device, the library then exports it to the caller as a struct ratbag_device. Each device has at least one profile, each profile has a number of buttons and at least one resolution. Where possible, the resolutions can be queried and set, the buttons likewise can be queried and set for different functions. If the hardware supports it, you can map buttons to other buttons, assign macros, or special functions such as DPI/profile switching. The main goal of libratbag is to unify access to the devices so a configuration application doesn't need different libraries per hardware. Especially short-term, we envision using some of the projects listed above through custom backends.

We're at version 0.1 at the moment, so the API is still subject to change. It looks like this:

#include <libratbag.h>

struct ratbag *ratbag;
struct ratbag_device *device;
struct ratbag_profile *p;
struct ratbag_button *b;
struct ratbag_resolution *r;

ratbag = ratbag_create_context(...);
device = ratbag_device_new_from_udev(ratbag, udev_device);

/* retrieve the first profile */
p = ratbag_device_get_profile(device, 0);

/* retrieve the first resolution setting of the profile */
r = ratbag_profile_get_resolution(p, 0);
printf("The first resolution is: %dpi @ %d Hz\n",


/* retrieve the fourth button */
b = ratbag_profile_get_button(p, 4);

if (ratbag_button_get_action_type(b) == RATBAG_BUTTON_ACTION_TYPE_SPECIAL &&
ratbag_button_get_special(b) == RATBAG_BUTTON_ACTION_SPECIAL_RESOLUTION_UP)
printf("button 4 selects next resolution");


For testing and playing around with libratbag, we have a tool called ratbag-command that exposes most of the library:

$ ratbag-command info /dev/input/event8
Device 'BTL Gaming Mouse'
Capabilities: res profile btn-key btn-macros
Number of buttons: 11
Profiles supported: 5
Profile 0 (active)
0: 800x800dpi @ 500Hz
1: 800x800dpi @ 500Hz (active)
2: 2400x2400dpi @ 500Hz
3: 3200x3200dpi @ 500Hz
4: 4000x4000dpi @ 500Hz
5: 8000x8000dpi @ 500Hz
Button: 0 type left is mapped to 'button 1'
Button: 1 type right is mapped to 'button 2'
Button: 2 type middle is mapped to 'button 3'
Button: 3 type extra (forward) is mapped to 'profile up'
Button: 4 type side (backward) is mapped to 'profile down'
Button: 5 type resolution cycle up is mapped to 'resolution cycle up'
Button: 6 type pinkie is mapped to 'macro "": H↓ H↑ E↓ E↑ L↓ L↑ L↓ L↑ O↓ O↑'
Button: 7 type pinkie2 is mapped to 'macro "foo": F↓ F↑ O↓ O↑ O↓ O↑'
Button: 8 type wheel up is mapped to 'wheel up'
Button: 9 type wheel down is mapped to 'wheel down'
Button: 10 type unknown is mapped to 'none'
Profile 1
And to toggle/query the various settings on the device:

$ ratbag-command dpi set 400 /dev/input/event8
$ ratbag-command profile 1 resolution 3 dpi set 800 /dev/input/event8
$ ratbag-command profile 0 button 4 set action special doubleclick

libratbag is in a very early state of development. There are a bunch of FIXMEs in the code, the hardware support is still spotty and we'll appreciate any help we can get, especially with the hardware driver backends. There's a TODO in the repo for some things that we already know needs changing. Feel free to browse the repo on github and drop us some patches.

Eventually we want this to be integrated into the desktop environments, either in the respective control panels or in a standalone application. libratbag already provides SVGs for some devices we support but we'll need some designer input for the actual application. Again, any help you want to provide here will be much appreciated.

A Preliminary systemd.conf 2015 Schedule is Now Online!

We are happy to announce that an initial, preliminary version of the systemd.conf 2015 schedule is now online! (Please ignore that some rows in the schedule link the same session twice on that page. That's a bug in the web site CMS we are working on to fix.)

We got an overwhelming number of high-quality submissions during the CfP! Because there were so many good talks we really wanted to accept, we decided to do two full days of talks now, leaving one more day for the hackfest and BoFs. We also shortened many of the slots, to make room for more. All in all we now have a schedule packed with fantastic presentations!

The areas covered range from containers, to system provisioning, stateless systems, distributed init systems, the kdbus IPC, control groups, systemd on the desktop, systemd in embedded devices, configuration management and systemd, and systemd in downstream distributions.

We'd like to thank everybody who submited a presentation proposal!

Also, don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

We are still looking for sponsors. If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

For further details about systemd.conf consult the conference website.

September 14, 2015

We've been hard working with the Gnocchi team these last months to store your metrics, and I guess it's time to show off a bit.

So far Gnocchi offers scalable metric storage and resource indexation, especially for OpenStack cloud – but not only, we're generic. It's cool to store metrics, but it can be even better to have a way to visualize them!


We very soon started to build a little HTML interface. Being REST-friendly guys, we enabled it on the same endpoints that were being used to retrieve information and measures about metric, sending back text/html instead of application/json if you were requesting those pages from a Web browser.

But let's face it: we are back-end developers, we suck at any kind front-end development. CSS, HTML, JavaScript? Bwah! So what we built was a starting point, hoping some magical Web developer would jump in and finish the job.

Obviously it never happened.

Ok, so what's out there?

It turns out there are back-end agnostic solutions out there, and we decided to pick Grafana. Grafana is a complete graphing dashboard solution that can be plugged on top of any back-end. It already supports timeseries databases such as Graphite, InfluxDB and OpenTSDB.

That was largely enough for that my fellow developer Mehdi Abaakouk to jump in and start writing a Gnocchi plugin for Grafana! Consequently, there is now a basic but solid and working back-end for Grafana that lies in the grafana-plugins repository.

With that plugin, you can graph anything that is stored in Gnocchi, from raw metrics to metrics tied to resources. You can use templating, but no annotation yet. The back-end supports Gnocchi with or without Keystone involved, and any type of authentication (basic auth or Keystone token). So yes, it even works if you're not running Gnocchi with the rest of OpenStack.

It also supports advanced queries, so you can search for resources based on some criterion and graphs their metrics.

I want to try it!

If you want to deploy it, all you need to do is to install Grafana and its plugins, and create a new datasource pointing to Gnocchi. It is that simple. There's some CORS middleware configuration involved if you're planning on using Keystone authentication, but it's pretty straightforward – just set the cors.allowed_origin option to the URL of your Grafana dashboard.

We added support of Grafana directly in Gnocchi devstack plugin. If you're running DevStack you can follow the instructions – which are basically adding the line enable_service gnocchi-grafana.

Moving to Grafana core

Mehdi just opened a pull request a few days ago to merge the plugin into Grafana core. It's actually one of the most unit-tested plugin in Grafana so far, so it should be on a good path to be merged in the future and have support of Gnocchi directly into Grafana without any plugin involved.

September 11, 2015

LimbaHubThe Limba project does not only have the goal to allow developers to deploy their applications directly on multiple Linux distributions while reducing duplication of shared resources, it should also make it easy for developers to build software for Limba.

Limba is worth nothing without good tooling to make it fun to use. That’s why I am working on that too, and I want to share some ideas of how things could work in future and which services I would like to have running. I will also show what is working today already (and that’s quite something!). This time I look at things from a developer’s perspective (since the last posts on Limba were more end-user centric). If you read on, you will also find a nice video of the developer workflow 😉

1. Creating metadata and building the software

To make building Limba packages as simple as possible, Limba reuses already existing metadata, like AppStream metadata to find information about the software you want to create your package for.

To ensure upstreams can build their software in a clean environment, Limba makes using one as simple as possible: The limba-build CLI tool creates a clean chroot environment quickly by using an environment created by debootstrap (or a comparable tool suitable for the Linux distribution), and then using OverlayFS to have all changes to the environment done during the build process land in a separate directory.

To define build instructions, limba-build uses the same YAML format TravisCI uses as well for continuous integration. So there is a chance this data is already present as well (if not, it’s trivial to write).

In case upstream projects don’t want to use these tools, e.g. because they have well-working CI already, then all commands needed to build a Limba package can be called individually as well (ideally, building a Limba package is just one call to lipkgen).

I am currently planning “DeveloperIPK” packages containing resources needed to develop against another Limba package. With that in place and integrated with the automatic build-environment creation, upstream developers can be sure the application they just built is built against the right libraries as present in the package they depend on. The build tool could even fetch the build-dependencies automatically from a central repository.

2. Uploading the software to a repository

While everyone can set up their own Limba repository, and the limba-build repo command will help with that, there are lots of benefits in having a central place where upstream developers can upload their software to.

I am currently developing a service like that, called “LimbaHub”. LimbaHub will contain different repositories distributors can make available to their users by default, e.g. there will be one with only free software, and one for proprietary software. It will also later allow upstreams to create private repositories, e.g. for beta-releases.

3. Security in LimbaHub

Every Limba package is signed with they key of its creator anyway, so in order to get a package into LimbaHub, one needs to get their OpenPGP key accepted by the service first.

Additionally, the Hub service works with a per-package permission system. This means I can e.g. allow the Mozilla release team members to upload a package with the component-ID “org.mozilla.firefox.desktop” or even allow those user(s) to “own” the whole org.mozilla.* namespace.

This should prevent people hijacking other people’s uploads accidentally or on purpose.

4. QA with LimbaHub

LimbaHub should also act as guardian over ABI stability and general quality of the software. We could for example warn upstreams that they broke ABI without declaring that in the package information, or even reject the package then. We could validate .desktop files and AppStream metadata, or even check if a package was built using hardening flags.

This should help both developers to improve their software as well as users who benefit from that effort. In case something really bad gets submitted to LimbaHub, we always have the ability to remove the package from the repositories as a last resort (which might trigger Limba to issue a warning for the user that he won’t receive updates anymore).

What works

Limba, LimbaHub and the tools around it are developing nicely, so far no big issues have been encountered yet.

That’s why I made a video showing how Limba and LimbaHub work together at time:

Still, there is a lot of room for improvement – Limba has not yet received enough testing, and LimbaHub is merely a proof-of-concept at time. Also, lots of high-priority features are not yet implemented.

LimbaHub and Limba need help!

At time I am developing LimbaHub and Limba alone with only occasional contributions from others (which are amazing and highly welcomed!). So, if you like Python and Flask, and want to help developing LimbaHub, please contact me – the LimbaHub software could benefit from a more experienced Python web developer than I am 😉 (and maybe having a designer look over the frontend later makes sense as well). If you are not afraid of C and GLib, and like to chase bugs or play with building Limba packages, consider helping Limba development :-)

September 07, 2015
Kernel 4.2 is released already and the 4.3 merge window in full swing, time to look at what's in it for the intel graphics driver.

Biggest thing for sure is that Skylake is finally out of preliminary support and enabled by default. The reason for the long hold-up was some ABI fumble - the hardware exposes the topmost plane both through the new universal plane registers and the legacy cursor registers and because we simply carried the legacy plane code around in the driver we ended up exposing both. This wasn't something big to take care of but somehow was dragged on forever.

The other big thing is that now legacy modesets are done with the new atomic modesetting code driver-internally. Atomic support in i915.ko isn't ready for prime-time yet fully, but this is definitely a big step forward. Besides atomic there's also other cross-platform improvements in the modeset code: Ville fixed up the 12bpc support for HDMI, which is now used by default if the screen supports it. Mika Kahola and Ville also implemented dynamic adjustment of the cdclk, which is the main clock source for display engines on intel graphics. And there's a big difference in the clock speeds needed between e.g. a 4k screen and a 720p TV.

Continuing with power saving features Rodrigo again spent a lot of time fixing up PSR (panel self refresh). And Paulo did the same by writing patches to improve FBC (framebuffer compression). We have some really solid testcases by now, unfortunately neither feature is ready for enabling by default yet. Especially PSR is still plagued by screen freezes on some random systems. Also there's been some fixes to DRRS (dynamic refresh rate switching) from Ramalingam. DRRS is enabled by default already, where supported. And finally some improvements to make the frontbuffer rendering tracking more accurate, which is used by all three of these display power saving features.

And of course there's also tons of improvements to platform code. Display PLL code for Sklylake and Valleyview&Cherryview was tuned by Damien and Ville respectively. There's been tons of work on Broxton and DSI support by Imre, Gaurav and others.

Moving on to the rendering side the big change is how tracking of rendering tasks is handled. In the past the driver just used raw sequence numbers emitted by the hardware, but for cross-driver synchronization and reordering tasks with an eventual gpu scheduler more abstraction is needed. A big step is converting over to the i915 request structure completely, done by John Harrison. The next step will be to switch the internal implementation for i915 requests to the cross-driver fences, but that's for future kernels. As a follow-up cleanup John also removed the OLR, which stands for outstanding lazy request. It was a neat little trick implemented years ago to simplify handling error recovery, but which causes tons of pain with subtle bugs. Making requests more explicit in the driver allowed us to finally remove this trick since.

There's also been a pile of platform related features: MOCS programming for Skylake/Broxton (which is used for caching control). Resource streamer support from Abdiel, which is used to offload some of the buffer object tracking for shaders from the cpu to the gpu. And the command parser on Haswell was extended to support atomic instructions in shaders. And finally for Skylake Mika Kuoppala added code to avoid resetting the gpu - in certain cases the hardware would hard-hang the entire system trying to execute the reset. And a dead gpu is still better than a dead system.

Piwik told me that people are still sharing my post about the state of GNOME-Software and update notifications in Debian Jessie.

So I thought it might be useful to publish a small update on that matter:

  • If you are using GNOME or KDE Plasma with Debian 8 (Jessie), everything is fine – you will receive update notifications through the GNOME-Shell/via g-s-d or Apper respectively. You can perform updates on GNOME with GNOME-PackageKit and with Apper on KDE Plasma.
  • If you are using a desktop-environment not supporting PackageKit directly – for example Xfce, which previously relied on external tools – you might want to try pk-update-icon from jessie-backports. The small GTK+ tool will notify about updates and install them via GNOME-PackageKit, basically doing what GNOME-PackageKit did by itself before the functionality was moved into GNOME-Software.
  • For Debian Stretch, the upcoming release of Debian, we will have gnome-software ready and fully working. However, one of the design decisions of upstream is to only allow offline-updates (= download updates in the background, install on request at next reboot) with GNOME-Software. In case you don’t want to use that, GNOME-PackageKit will still be available, and so are of course all the CLI tools.
  • For KDE Plasma 5 on Debian 9 (Stretch), a nice AppStream based software management solution with a user-friendly updater is also planned and being developed upstream. More information on that will come when there’s something ready to show ;-).

I hope that clarifies things a little. Have fun using Debian 8!


It appears that many people have problems with getting update notifications in GNOME on Jessie. If you are affected by this, please try the following:

  1. Open dconf-editor and navigate to org.gnome.settings-daemon.plugins.updates. Check if the key active is set to true
  2. If that doesn’t help, also check if at org.gnome.settings-daemon.plugins.updates the frequency-refresh-cache value is set to a sane value (e.g. 86400)
  3. Consider increasing the priority value (if it isn’t at 300 already)
  4. If all of that doesn’t help: I guess some internal logic in g-s-d is preventing a cache refresh then (e.g. because I thinks it is on a expensive network connection and therefore doesn’t refresh the cache automatically, or thinks it is running on battery). This is a bug. If that still happens on Debian Stretch with GNOME-Software, please report a bug against the gnome-software package. As a workaround for Jessie you can enable unconditional cache refreshing via the APT cronjob by installing apt-config-auto-update.
September 05, 2015
While trying to autogenerate C interfaces for XKB during the last 2 months it sometimes felt like an attempt to "behead the Hydra": Whenever I fixed a problem, a number of new issues arose (deep from the intestines of the python code generator). At the moment it seems I have reached a state where no new heads emerge, and I even managed to cut off a number of the most ugliest...

The cause of most (all?) troubles is a very basic way of distinguishing data types into fixed size (everything you can feed to sizeof), variable size data types (lists, valueparams, ...) and padding. In the X protocol fixed size fields are normally grouped together, so e.g. in a request all fixed size fields are collected before any variable size fields.

XCB, whose purpose it is to translate between X server and client, takes advantage of this ordering: Several grouped fixed size fields are conveniently mapped to a C struct, so it is fairly easy to deal with. The treatment of variable size data is more difficult and involves the use of (autogenerated) accessors and iterators. Also the specific number of elements in a variable size data type can depend on expressions that are specified in the protocol, but need to be evaluated at runtime.
Now XKB breaks with the "pure doctrine" of the X core protocol: Fixed and variable size fields are intermixed in several cases and new expressions are introduced. Further problematic cases include a variable size union, lists with a variable number of variable size elements, ... And finally, the keyboard extension defines a new datatype (ListOfItems), which is referred to as 'switch' in XCB.

Switch is a kind of list, but the 'items' are not necessarily of the same type, and whether an item is included or not depends on expressions that have to be evaluated for each item separately. Defining a C mapping for 'switch' was one of the main goals of my work, and it turned out that a set of new functions was needed. These new functions must either 'serialize' a switch into a buffer, depending on the concrete switch conditions, or 'unserialize' it to some C struct. As 'switch' can contain any other data type valid in the X protocol (which means especially that it also can contain other switches...), 'serialize'/'unserialize' had to be defined for all those data types.
Once I had the autogenerator spill out '_serialize()' and '_unserialize()', it turned out they can be used to deal with some other special cases as well - most notably the (above-mentioned) intermixing of fixed and variable size fields.

But probably the most appealing feature of the serializers is that they are invisible to anyone who wants to use XCB as they get called automatically in the autogenerated helper functions.

A last note on the current status:
+ xkb.c (autogenerated from xkb.xml) compiles!
+ the request side should be finished (more tests needed however)
+ simple XKB test programs run
- on the reply side, special cases still need special treatment
- _unserialize() should be renamed to _sizeof() in some cases
- some special cases also need special accessors

The code is currently hosted on annarchy, but I will export the repos shortly.
September 04, 2015

Continuing my post series on the tools I use these days in Python, this time I would like to talk about a library I really like, named voluptuous.

It's no secret that most of the time, when a program receives data from the outside, it's a big deal to handle it. Indeed, most of the time your program has no guarantee that the stream is valid and that it contains what is expected.

The robustness principle says you should be liberal in what you accept, though that is not always a good idea neither. Whatever policy you chose, you need to process those data and implement a policy that will work – lax or not.

That means that the program need to look into the data received, check that it finds everything it needs, complete what might be missing (e.g. set some default), transform some data, and maybe reject those data in the end.

Data validation

The first step is to validate the data, which means checking all the fields are there and all the types are right or understandable (parseable). Voluptuous provides a single interface for all that called a Schema.

>>> from voluptuous import Schema
>>> s = Schema({
... 'q': str,
... 'per_page': int,
... 'page': int,
... })
>>> s({"q": "hello"})
{'q': 'hello'}
>>> s({"q": "hello", "page": "world"})
voluptuous.MultipleInvalid: expected int for dictionary value @ data['page']
>>> s({"q": "hello", "unknown": "key"})
voluptuous.MultipleInvalid: extra keys not allowed @ data['unknown']

The argument to voluptuous.Schema should be the data structure that you expect. Voluptuous accepts any kind of data structure, so it could also be a simple string or an array of dict of array of integer. You get it. Here it's a dict with a few keys that if present should be validated as certain types. By default, Voluptuous does not raise an error if some keys are missing. However, it is invalid to have extra keys in a dict by default. If you want to allow extra keys, it is possible to specify it.

>>> from voluptuous import Schema
>>> s = Schema({"foo": str}, extra=True)
>>> s({"bar": 2})
{"bar": 2}

It is also possible to make some keys mandatory.

>>> from voluptuous import Schema, Required
>>> s = Schema({Required("foo"): str})
>>> s({})
voluptuous.MultipleInvalid: required key not provided @ data['foo']

You can create custom data type very easily. Voluptuous data types are actually just functions that are called with one argument, the value, and that should either return the value or raise an Invalid or ValueError exception.

>>> from voluptuous import Schema, Invalid
>>> def StringWithLength5(value):
... if isinstance(value, str) and len(value) == 5:
... return value
... raise Invalid("Not a string with 5 chars")
>>> s = Schema(StringWithLength5)
>>> s("hello")
>>> s("hello world")
voluptuous.MultipleInvalid: Not a string with 5 chars

Most of the time though, there is no need to create your own data types. Voluptuous provides logical operators that can, combined with a few others provided primitives such as voluptuous.Length or voluptuous.Range, create a large range of validation scheme.

>>> from voluptuous import Schema, Length, All
>>> s = Schema(All(str, Length(min=3, max=5)))
>>> s("hello")
>>> s("hello world")
voluptuous.MultipleInvalid: length of value must be at most 5

The voluptuous documentation has a good set of examples that you can check to have a good overview of what you can do.

Data transformation

What's important to remember, is that each data type that you use is a function that is called and returns a value, if the value is considered valid. That value returned is what is actually used and returned after the schema validation:

>>> import uuid
>>> from voluptuous import Schema
>>> def UUID(value):
... return uuid.UUID(value)
>>> s = Schema({"foo": UUID})
>>> data_converted = s({"foo": "uuid?"})
voluptuous.MultipleInvalid: not a valid value for dictionary value @ data['foo']
>>> data_converted = s({"foo": "8B7BA51C-DFF5-45DD-B28C-6911A2317D1D"})
>>> data_converted
{'foo': UUID('8b7ba51c-dff5-45dd-b28c-6911a2317d1d')}

By defining a custom UUID function that converts a value to a UUID, the schema converts the string passed in the data to a Python UUID object – validating the format at the same time.

Note a little trick here: it's not possible to use directly uuid.UUID in the schema, otherwise Voluptuous would check that the data is actually an instance of uuid.UUID:

>>> from voluptuous import Schema
>>> s = Schema({"foo": uuid.UUID})
>>> s({"foo": "8B7BA51C-DFF5-45DD-B28C-6911A2317D1D"})
voluptuous.MultipleInvalid: expected UUID for dictionary value @ data['foo']
>>> s({"foo": uuid.uuid4()})
{'foo': UUID('60b6d6c4-e719-47a7-8e2e-b4a4a30631ed')}

And that's not what is wanted here.

That mechanism is really neat to transform, for example, strings to timestamps.

>>> import datetime
>>> from voluptuous import Schema
>>> def Timestamp(value):
... return datetime.datetime.strptime(value, "%Y-%m-%dT%H:%M:%S")
>>> s = Schema({"foo": Timestamp})
>>> s({"foo": '2015-03-03T12:12:12'})
{'foo': datetime.datetime(2015, 3, 3, 12, 12, 12)}
>>> s({"foo": '2015-03-03T12:12'})
voluptuous.MultipleInvalid: not a valid value for dictionary value @ data['foo']

Recursive schemas

So far, Voluptuous has one limitation so far: the ability to have recursive schemas. The simplest way to circumvent it is by using another function as an indirection.

>>> from voluptuous import Schema, Any
>>> def _MySchema(value):
... return MySchema(value)
>>> from voluptuous import Any
>>> MySchema = Schema({"foo": Any("bar", _MySchema)})
>>> MySchema({"foo": {"foo": "bar"}})
{'foo': {'foo': 'bar'}}
>>> MySchema({"foo": {"foo": "baz"}})
voluptuous.MultipleInvalid: not a valid value for dictionary value @ data['foo']['foo']

Usage in REST API

I started to use Voluptuous to validate data in a the REST API provided by Gnocchi. So far it has been a really good tool, and we've been able to create a complete REST API that is very easy to validate on the server side. I would definitely recommend it for that. It blends with any Web framework easily.

One of the upside compared to solution like JSON Schema, is the ability to create or re-use your own custom data types while converting values at validation time. It is also very Pythonic, and extensible – it's pretty great to use for all of that. It's also not tied to any serialization format.

On the other hand, JSON Schema is language agnostic and is serializable itself as JSON. That makes it easy to be exported and provided to a consumer so it can understand the API and validate the data potentially on its side.

August 29, 2015


GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them.

If you’re going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I’d write this post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision by the time I completed part3.

  1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There’s no indication about the “upstreamability”. What this means is that if you read my blog to understand how the i915 driver currently works, you’ll be taking a crap-shoot on this one.
  2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can’t refer to them directly, I can only refer to the code.
  3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally the masses to test the code. Without product, that isn’t possible.

Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it’s still fresh (for some very warped definition of “fresh”).


GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability.

Current Status

I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn’t nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize:

Feature Status TODO
Dynamic page tables Implemented Test and fix bugs
64b Address space Implemented Test and fix bugs
GPU mirroring Proof of Concept Decide on interface; Implement interface.1

Testing has been limited to just one machine, mine, when I don’t have a million other things to do. With that caveat, on top of my last PPGTT stabilization patches things look pretty stable.

Present: Relocations

Throughout many of my previous blog posts I’ve gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It’s impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit. The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism.

Current PPGTT support

Current PPGTT support

To get to the above state, something like the following would happen.

  1. Create BOx
  2. Create BOy
  3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING).
  4. Do one of aforementioned operations on BOx and BOy
  5. Perform execbuf2.

Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address.

Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset
is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace:

Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU.

Future: No relocations

GPU Mirroring

GPU Mirroring

The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it’s still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it.

"OpenCL 2.0 adds support for shared virtual memory (a.k.a. SVM). SVM allows the host and 
kernels executing on devices to directly share complex, pointer-containing data structures such 
as trees and linked lists. It also eliminates the need to marshal data between the host and devices. 
As a result, SVM substantially simplifies OpenCL programming and may improve performance."

Makin’ it Happen


As I’ve already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU.

Page Table Hierarchy

Page Table Hierarchy

If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you’re probably left with a big ‘WTF’ face. I probably can’t help you if you’re in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you’d expect them to. Instead of the x86 CPU’s CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory. The actual changes in hardware interactions are trivial on top of all the existing PPGTT work.

The keen observer will notice that there are only 256 PML4 entries. This has to do with the way in which we've come about 64b addressing in x86. <a href="" target="_blank">This wikipedia article</a> explains it pretty well, and has links.

“This will take one week. I can just allocate everything up front.” (Dynamic Page Table Allocation)

Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. “One week. I can just allocate everything up front.” If what I have is, “done” then I was off by 10x.

Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems.

Page for the PML4
512 PDP pages per PML4 (512, ok we actually use 256)
512 PD pages per PDP (256 * 512 pages for PDs)
512 PT pages per PD (256 * 512 * 512 pages for PTs)
(256 * 512^2 + 256 * 512 + 256 + 1) * PAGE_SIZE = ~256G = oops

Dissimilarities to x86

First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can’t use the code directly, surely I can copy. Wrong.

Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux.

typedef unsigned long   pteval_t;
typedef struct { pteval_t pte; } pte_t;

static inline pteval_t native_pte_val(pte_t pte)
        return pte.pte;

static inline pteval_t pte_flags(pte_t pte)
        return native_pte_val(pte) & PTE_FLAGS_MASK;

static inline int pte_present(pte_t a)
        return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
        return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
#define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))

#define pgd_offset(mm, address) ( (mm)->pgd + pgd_index((address)))
static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
        return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);

/* My completely fabricated example of finding page presence */
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
struct mm_struct *mm = current->mm;
unsigned long address = 0xdefeca7e;

pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
ptep = pte_offset_map(pmd, address);
printk("Page is present: %s\n", pte_present(*ptep) ? "yes" : "no");

X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy).

  1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don’t know where our pages tables reside in physical memory because of the IOMMU. When VT-d is enabled, the i915 driver only knows the DMA address of the page tables.
  2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don’t want to take the hit to map them every time you need to look at a PTE.

If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those two options. After about of week of making the IOMMU driver do things it shouldn’t do, and trying to push the square block through the round hole, I gave up on reusing the x86 code.

Why Do We Actually Need Page Table Tracking?

The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It’s difficult to try to get the instantaneous state of the page tables…

Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn’t a good solution, but at that time I just wanted to get the thing working and didn’t really care if things blew up spectacularly after running for a few minutes. There is actually a really easy set of operations that show why this won’t work. For the following, think of the four level page tables as arrays. ie.

  • PML4[0-255], each point to a PDP
  • PDP[0-255][0-511], each point to a PD
  • PD[0-255][0-511][0-511], each point to a PT
  • PT[0-255][0-511][0-511][0-511] (where PT[0][0][0][0][0] is the 0th PTE in the system)
  1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer
  2. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0 is free.
    2. [i915]Allocate PDP for PML4[0]
    3. [i915]Allocate PD for PDP[0][0]
    4. [i915]Allocate PT for PD[0][0][0]/li>
    5. [i915](condensed)Set pointers from PML4->PDP->PD->PT
    6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO’s backing page.
  3. [i915] Dispatch work to the GPU on behalf of mesa.
  4. [i915] Observe the hardware has completed
  5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer.
  6. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0x200000 is free.
    2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1].
    3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea!
    4. Abort.

Page Tables Tracking with Bitmaps

Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn’t try it.

After I had determined I couldn’t reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went – none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!.
\frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \\  34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \\  34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes
That’s 4GB simply to track every page. There’s some more overhead because page [tables, directories, directory pointers] are also tracked.
  256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \\  67240192entries \times \frac{1bit}{1entry} = 67240192bits \\  67240192bits \times \frac{8bits}{1byte} = 8405024bytes \\  4294967296bytes + 8405024bytes = 4303372320bytes \\  4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G

I can’t remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn’t see the big picture. I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you’re using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn’t fall into that trap, and I just wasted your time, but there it is anyway.

Sample code to walk the page tables

This code does not actually exist, but it is very similar to the real code. The following shows how one would “walk” to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar.

static struct i915_pagedirpo *
alloc_one_pdp(struct i915_pml4 *pml4, int entry)

static struct i915_pagedir *
alloc_one_pd(struct i915_pagedirpo *pdp, int entry)

static struct i915_tab *
alloc_one_pt(struct i915_pagedir *pd, int entry)

 * alloc_page_tables - Allocate all page tables for the given virtual address.
 * This will allocate all the necessary page tables to map exactly one page at
 * @address. The page tables will not be connected, and the PTE will not point
 * to a page.
 * @ppgtt:	The PPGTT structure encapsulating the virtual address space.
 * @address:	The virtual address for which we want page tables.
static void
alloc_page_tables(ppgtt, unsigned long address)
	struct i915_pagetab *pt;
	struct i915_pagedir *pd;
	struct i915_pagedirpo *pdp;
	struct i915_pml4 *pml4 = &ppgtt->pml4; /* Always there */

	int pml4e = (address >> GEN8_PML4E_SHIFT) & GEN8_PML4E_MASK;
	int pdpe = (address >> GEN8_PDPE_SHIFT) & GEN8_PDPE_MASK;
	int pde = (address >> GEN8_PDE_SHIFT) & I915_PDE_MASK;
	int pte = (address & I915_PDES_PER_PD);

	if (!test_bit(pml4e, pml4->used_pml4es))
		goto alloc;

	pdp = pml4->pagedirpo[pml4e];
	if (!test_bit(pdpe, pdp->used_pdpes;))
		goto alloc;

	pd = pdp->pagedirs[pdpe];
	if (!test_bit(pde, pd->used_pdes)
		goto alloc;

	pt = pd->page_tables[pde];
	if (test_bit(pte, pt->used_ptes))

	pdp = alloc_one_pdp(pml4, pml4e);
	set_bit(pml4e, pml4->used_pml4es);
	pd = alloc_one_pd(pdp, pdpe);
	set_bit(pdpe, pdp->used_pdpes);
	pt = alloc_one_pt(pd, pde);
	set_bit(pde, pd->used_pdes);

Here is a picture which shows the bitmaps for the 2 allocation example above.

Bitmaps tracking page tables

Bitmaps tracking page tables

The GPU mirroring interface

I really don’t want to spend too much time here. In other words, no more pictures. As I’ve already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface.

In the patches I’ve submitted, 2 changes were made to the existing userptr interface (which wasn’t then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring.

struct drm_i915_gem_userptr {
	__u64 user_ptr;
	__u64 user_size;
	__u32 ctx_id;
	__u32 flags;
#define I915_USERPTR_READ_ONLY          (1<<0)
#define I915_USERPTR_GPU_MIRROR         (1<<1)
#define I915_USERPTR_UNSYNCHRONIZED     (1<<31)
	 * Returned handle for the object.
	 * Object handles are nonzero.
	__u32 handle;
	__u32 pad;

The context argument is to tell the i915 driver for which address space we’ll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address.

  • Pros:
    • This interface is very simple.
    • Existing userptr code does the hard work for us
  • Cons:
    • You need 1 IOCTL per object. Much undeeded overhead.
    • It’s subject to a lot of problems userptr has5
    • Userptr was already merged, so unless pad get’s repruposed, we’re screwed

What should be: soft pin

There hasn’t been too much discussion here, so it’s hard to say. I believe the trends of the discussion (and the author’s personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, “soft pin.” It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can’t merged until there is an open source userspace. Stay tuned. Perhaps I’ll update the blog as the story unfolds.

Wrapping it up (all 4 parts)

As usual, please report bugs or ask questions.

So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches).

This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there.

Unlike the usual, “buy me a beer if you liked this”, I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher.

Image links

The images I’ve created. Feel free to do with them as you please.

Download PDF

  1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN" 

  2. The GEM and BO terminology is a fancy sounding wrapper for the notion that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output)  

  3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect. 

  4. I am not sure how KVM works manages page tables. At least conceptually I’d think they’d have a similar problem to the i915 driver’s page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn’t have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took 

  5. Let me be clear that I don’t think userptr is a bad thing. It’s a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring 


Pictures are the right way to start.


Conceptual view of aliasing PPGTT bind/unbind

There is exactly one thing to get from the above drawing, everything else is just to make it as close to fact as possible.

  1. The aliasing PPGTT (aliases|shadows|mimics) the global GTT.

The wordy overview

Support for Per-process Graphics Translation Tables (PPGTT) debuted on Sandybridge (GEN6). The features provided by hardware are a superset of Aliasing PPGTT, which is entirely a software construct. The most obvious unimplemented feature is that the hardware supports multiple PPGTTs. Aliasing PPGTT is a single instance of a PPGTT. Although not entirely true, it’s easiest to think of the Aliasing PPGTT as a set page table of page tables that is maintained to have the identical mappings as the global GTT (the picture above). There is more on this in the Summary section

Until recently, aliasing PPGTT was the only way to make use of the hardware feature (unless you accidentally stepped into one of my personal branches). Aliasing PPGTT is implemented as a performance feature (more on this later). It was an important enabling step for us as well as it provided a good foundation for the lower levels of the real PPGTT code.

In the following, I will be using the HSW PRMs as a reference. I’ll also assume you’ve read, or understand part 1.

Selecting GGTT or PPGTT

Choosing between the GGTT and the Aliasing PPGTT is very straight forward. The choice is provided in several GPU commands. If there is no explicit choice, than there is some implicit behavior which is usually sensible. The most obvious command to be provided with a choice is MI_BATCH_BUFFER_START. When a batchbuffer is submitted, the driver sets a single bit that determines whether the batch will execute out of the GGTT or a Aliasing PPGTT1. Several commands as well, like PIPE_CONTROL, have a bit to direct which to use for the reads or writes that the GPU command will perform.


The names for all the page table data structures in hardware are the same as for IA CPU. You can see the Intel® 64 and IA-32 Architectures Software Developer Manuals for more information. (At the time of this post: page 1988 Vol3. 4.2 HIERARCHICAL PAGING STRUCTURES: AN OVERVIEW). I don’t want to rehash the HSW PRMs  too much, and I am probably not allowed to won’t copy the diagrams. However, for the sake of having a consolidated post, I will rehash the most pertinent parts.

There is one conceptual Page Directory for a PPGTT – the docs call this a set of Page Directory Entries (PDEs), however since they are contiguous, calling it a Page Directory makes a lot of sense to me. In fact, going back to the Ironlake docs, that seems to be the case. So there is one page directory with up to 512 entries, each pointing to a page table.  There are several good diagrams which I won’t bother redrawing in the PRMs2

Page Directory Entry
31:12 11:04 03:02 01 0
Physical Page Address 31:12 Physical Page Address 39:32 Rsvd Page size (4K/32K) Valid
Page Table Entry
31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid

There’s some things we can get from this for those too lazy to click on the links to the docs.

  1. PPGTT page tables exist in physical memory.
  2. PPGTT PTEs have the exact same layout as GGTT PTEs.
  3. PDEs don’t have cache attributes (more on this later).
  4. There exists support for big pages3

With the above definitions, we now can derive a lot of interesting attributes about our GPU. As already stated, the PPGTT is a two-level page table (I’ve not yet defined the size).

  • A PDE is 4 bytes wide
  • A PTE is 4 bytes wide
  • A Page table occupies 4k of memory.
  • There are 4k/4 entries in a page table.

With all this information, I now present you a slightly more accurate picture.


An object with an aliased PPGTT mapping


PP_DCLV – PPGTT Directory Cacheline Valid Register: As the spec tells us, “This register controls update of the on-chip PPGTT Directory Cache during a context restore.” This statement is directly contradicted in the very next paragraph, but the important part is the bit about the on-chip cache. This register also determines the amount of virtual address space covered by the PPGTT. The documentation for this register is pretty terrible, so a table is actually useful in this case.

PPGTT Directory Cacheline Valid Register (from the docs)
63:32 31:0
MBZ PPGTT Directory Cache Restore [1..32] 16 entries
DCLV, the right way
31 30 1 0
PDE[511:496] enable PDE [495:480] enable PDE[31:16] enable PDE[15:0] enable

The, “why” is not important. Each bit represents a cacheline of PDEs, which is how the register gets its name4. A PDE is 4 bytes, there are 64b in a cacheline, so 64/4 = 16 entries per bit.  We now know how much address space we have.

512 PDEs * 1024 PTEs per PT * 4096 PAGE_SIZE = 2GB


PP_DIR_BASE: Sadly, I cannot find the definition to this in the public HSW docs. However, I did manage to find a definition in the Ironlake docs yay me. There are several mentions in more recent docs, and it works the same way as is outlined on Ironlake. Quoting the docs again, “This register contains the offset into the GGTT where the (current context’s) PPGTT page directory begins.” We learn a very important caveat about the PPGTT here – the PPGTT PDEs reside within the GGTT.


With these two things, we now have the ability to program the location, and size (and get the thing to load into the on-chip cache). Here is current i915 code which switches the address space (with simple comments added). It’s actually pretty ho-hum.

ret = intel_ring_begin(ring, 6);
if (ret)
	return ret;

intel_ring_emit(ring, MI_LOAD_REGISTER_IMM(2));
intel_ring_emit(ring, RING_PP_DIR_DCLV(ring));
intel_ring_emit(ring, PP_DIR_DCLV_2G);       // program size
intel_ring_emit(ring, RING_PP_DIR_BASE(ring));
intel_ring_emit(ring, get_pd_offset(ppgtt)); // program location
intel_ring_emit(ring, MI_NOOP);

As you can see, we program the size to always be the full amount (in fact, I fixed this a long time ago, but never merged). Historically, the offset was at the top of the GGTT, but with my PPGTT series merged, that is abstracted out, and the simple get_pd_offset() macro gets the offset within the GGTT. The intel_ring_emit() stuff is because the docs recommended setting the registers via the GPU’s LOAD_REGISTER_IMMEDIATE command, though empirically it seems to be fine if we simply write the registers via MMIO (for Aliasing PPGTT). See my previous blog post if you want more info about the commands execution in the GPU’s ringbuffer. If it’s easier just pretend it’s 2 MMIO writes.


All of the resources are allocated and initialized upfront. There are 3 main steps. Note that the following comes from a relatively new kernel, and I have already submitted patches which change some of the cosmetics. However, the concepts haven’t changed for pre-gen8.

1. Allocate space in the GGTT for the PPGTT PDEs

ret = drm_mm_insert_node_in_range_generic(&dev_priv->,
					  &ppgtt->node, GEN6_PD_SIZE,
					  GEN6_PD_ALIGN, 0,
					  0, dev_priv->,

2. Allocate the page tables

for (i = 0; i < ppgtt->num_pd_entries; i++) {
	ppgtt->pt_pages[i] = alloc_page(GFP_KERNEL);
	if (!ppgtt->pt_pages[i]) {
		return -ENOMEM;

3. [possibly] IOMMU map the pages

for (i = 0; i < ppgtt->num_pd_entries; i++) {
	dma_addr_t pt_addr;

	pt_addr = pci_map_page(dev->pdev, ppgtt->pt_pages[i], 0, 4096,

As the system binds, and unbinds objects into the aliasing PPGTT, it simply writes the PTEs for the given object (possibly spanning multiple page tables). The PDEs do not change. PDEs are mapped to a scratch page when not used, as are the PTEs.


As we saw in step 3 above, I mention that the page tables may be mapped by the IOMMU. This is one important caveat that I didn’t fully understand early on, so I wanted to recap a bit. Recall that the GGTT is allocated out of system memory during the boot firmware’s initialization. This means that as long as Linux treats that memory as special, everything will just work (just don’t look for IOMMU implicated bugs on our bugzilla). The page tables however are special because they get allocated after Linux is already running, and the IOMMU is potentially managing the memory. In other words, we don’t want to write the physical address to the PDEs, we want to write the dma address. Deferring to wikipedia again for the description of an IOMMU., that’s all.It tripped be up the first time I saw it because I hadn’t dealt with this kind of thing before. Our PTEs have worked the same way for a very long time when mapping the BOs, but those have somewhat hidden details because they use the scatter-gather functions.

Feel free to ask questions in the comments if you need more clarity – I’d probably need another diagram to accommodate.

Cached page tables

Let me be clear, I favored writing a separate post for the Aliasing PPGTT because it gets a lot of the details out of the way for the post about Full PPGTT. However, the entire point of this feature is to get a [to date, unmeasured] performance win. Let me explain… Notice bits 4:3 of the ECOCHK register.  Similarly in the i915 code:

ecochk = I915_READ(GAM_ECOCHK);
if (IS_HASWELL(dev)) {
} else {
I915_WRITE(GAM_ECOCHK, ecochk);

What these bits do is tell the GPU whether (and how) to cache the PPGTT page tables. Following the Haswell case, the code is saying to map the PPGTT page table with write-back caching policy. Since the writes for Aliasing PPGTT are only done at initialization, the policy is really not that important.

Below is how I’ve chosen to distinguish the two. I have no evidence that this is actually what happens, but it seems about right.


Flow chart for GPU GGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Red means slow. The point which was hopefully made clear above is that when you miss the TLB on a GGTT access, you need to fetch the entry from memory, which has a relatively high latency. When you miss the TLB on a PPGTT access, you have two caches (the special PDE cache for PPGTT, and LLC) which are backing the request. Note there is an intentional bug in the second diagram – you may miss the LLC on the PTE fetch also. I was trying to keep things simple, and show the hopeful case.

Because of this, all mappings which do not require GGTT mappings get mapped to the aliasing PPGTT.


Distinctions from the GGTT

At this point I hope you’re asking why we need the global GTT at all. There are a few limited cases where the hardware is incapable (or it is undesirable) of using a per process address space.

A brief description of why, with all the current callers of the global pin interface.

  • Display: Display actually implements it’s own version of the GGTT. Maintaining the logic to support multiple level page tables was both costly, and unnecessary. Anything relating to a buffer being scanned out to the display must always be mapped into the GGTT. Ie xpect this to be true, forever.
    • i915_gem_object_pin_to_display_plane(): page flipping
    • intel_setup_overlay(): overlays
  • Ringbuffer: Keep in mind that the aliasing PPGTT is a special case of PPGTT. The ringbuffer must remain address space and context agnostic. It doesn’t make any sense to connect it to the PPGTT, and therefore the logic does not support it. The ringbuffer provides direct communication to the hardware’s execution logic – which would be a nightmare to synchronize if we forget about the security nightmare. If you go off and think about how you would have a ringbuffer mapped by multiple address spaces, you will end up with something like execlists.
    • allocate_ring_buffer()
  • HW Contexts: Extremely similar to ringbuffer.
    • intel_alloc_context_page(): Ironlake RC6
    • i915_gem_create_context(): Create the default HW context
    • i915_gem_context_reset(): Re-pin the default HW context
    • do_switch(): Pin the logical context we’re switching to
  • Hardware status page: The use of this, prior to execlists, is much like rinbuffers, and contexts. There is a per process status page with execlists.
    • init_status_page()
  • Workarounds:
    • init_pipe_control(): Initialize scratch space for workarounds.
    • intel_init_render_ring_buffer(): An i830 w/a I won’t bother to understand
    • render_state_alloc(): Full initialization of GPUs 3d state from within the kernel
  • Other
    • i915_gem_gtt_pwrite_fast(): Handle pwrites through the aperture. More info here.
    • i915_gem_fault(): Map an object into the aperture for gtt_mmap. More info here.
    • i915_gem_pin_ioctl(): The DRI1 pin interface.

GEN8 disambiguation

Off the top of my head, the list of some of the changes on GEN8 which will get more detail in a later post. These changes are all upstream from the original Broadwell integration.

  • PTE size increased to 8b
    • Therefore, 512 entries per table
    • Format mimics the CPU PTEs
  • PDEs increased to 8b (remains 512 PDEs per PD)
    • Page Directories live in system memory
      • GGTT no longer holds the PDEs.
    • There are 4 PDPs, and therefore 4 PDs
    • PDEs are cached in LLC instead of special cache (I’m guessing)
  • New HW PDP (Page Directory Pointer) registers point to the PDs, for legacy 32b addressing.
    • PP_DIR_BASE, and PP_DCLV are removed
  • Support for 4 level page tables, up to 48b virtual address space.
    • PML4[PML4E]->PDP
    • PDP[PDPE] -> PD
    • PD[PDE] -> PT
    • PT{PTE] -> Memory
  • Big pages are now 64k instead of 32k (still not implemented)
  • New caching interface via PAT like structure


There’s actually an interesting thing that you start to notice after reading Distinctions from the GGTT. Just about every thing mapped into the GGTT shouldn’t be mapped into the PPGTT. We already stated that we try to map everything else into the PPGTT. The set of objects mapped in the GGTT, and the set of objects mapped into the PPGTT are disjoint5). The patches to make this work are not yet merged. I’d put an image here to demonstrate, but I am feeling lazy and I really want to get this post out today.


  • The Aliasing PPGTT is a single instance of the hardware feature: PPGTT.
  • Aliasing PPGTT was designed as a drop in performance replacement to the GGTT.
  • GEN8 changed a lot of architectural stuff.
  • The Aliasing PPGTT shouldn’t actually alias the GGTT because the objects they map are a disjoint set.

Like last time, links to all the SVGs I’ve created. Use them as you like.

Download PDF

  1. Actually it will use whatever the current PPGTT is, but for this post, that is always the Aliasing PPGTT 

  2. Page walk, Two-Level Per-Process Virtual Memory 

  3. Big pages have the same goal as they do on the CPU – to reduce TLB pressure. To date, there has been no implementation of big pages for GEN (though a while ago I started putting something together). There has been some anecdotal evidence that there isn’t a big win to be had for many workloads we care about, and so this remains a low priority. 

  4. This register thus allows us to limit, or make a sparse address space for the PPGTT. This mechanism is not used, even in the full PPGTT patches 

  5. There actually is a case on GEN6 which requires both. Currently this need is implemented by drivers/gpu/drm/i915/i915_gem_execbuffer.c: i915_gem_execbuffer_relocate_entry( 

For those of you reading this that didn’t know, I’ve had two months of paid vacation – one of the real perks of working for Intel. Today is the last day. It is as hard as I thought it would be.

Most of the vacation was spent vacationing. As I have access to none of the pictures at the moment, and I don’t want to make you jealous, I’ll skip over that. Toward the end though, I ended up at a coffee shop waiting for someone with nothing to do. I spent a little bit of time working on HOBos, and I thought it could be interesting to write about it.

WARNING: There is nothing novel here.

A brief history of HOBos

The HOBby operating system is an operating system project I started a while ago. I am a bit embarrassed that I started an OS. In my opinion, it’s one of lamer tasks to take on because

  1. everyone seems to do it;
  2. there really isn’t a need, there are many operating systems with permissive licenses already; and
  3. sites like OSDev have made much of the work trivial (I like to think that when I started there wasn’t quite as much info readily available, but that’s a lie).
Larrabee Av in Portland (not what the project was named after)

Larrabee Av in Portland
(not what the project was named after)

HOBos began while I was working on the Larrabee project. The team spent a lot of time optimizing the memory management, and the scheduler for the embedded software. I really wanted to work on these things full time. Unfortunately for me, having had a background in device drivers, I was often required to do other things. As a means to scratch the itch, I started HOBos after not finding anything suitable for my needs. The stuff I found were all either too advanced, or too rudimentary. When I was hired to work on i915, I decided that it was better use of my free time. Since then, I’ve been tweaking things here or there, and I do try to make sure things still run with the latest QEMU and compilers at least once a year. The last actual feature I added was more than 1300 days ago:

commit 1c9b5c78b22b97246989b00e807c9bf1fbc9e517
	Author: Ben Widawsky 
	Date: Sat Mar 19 21:19:57 2011 -0700

	basic backtrace

So back to the coffee shop. I tried to do something or other, got a hang, and didn’t want to fire up GDB.


HOBos had implemented backtraces since the original import from SVN (let’s pretend that means, since always). Obtaining a backtrace is actually pretty straightforward on x86.

The stack frame

The stack frame can be thought of as memory contents that are locally scoped to a function. Declaring a local variable will end up in the stack frame. A global variable will not. As functions call other functions, you end up with multiple frames. A stack is used because the last frames added are the first ones removed (this is not always true for things like exceptions, but nevermind that for now). The fact that a stack decrements is arbitrarily chosen, as far as I can tell. The following shows the stack when the function foo() calls the function bar().

Example Stackframe

Example Stackframe

void bt_fp(void *fp)
	do {
		uint64_t prev_rbp = *((uint64_t *)fp);
		uint64_t prev_ip = *((uint64_t *)(fp + sizeof(prev_rbp)));
		struct sym_offset sym_offset = get_symbol((void *)prev_ip);
		printf("\t%s (+0x%x)\n";,, sym_offset.offset);
		fp = (void *)prev_rbp;
		/* Stop if rbp is not in the kernel
		 * TODO: need an upper bound too*/
		if (fp <= (void *)KVADDR(DMAP_PML,0,0,0))
	} while(1);

The memory contents shown above are created as a result of two things. First is what the CPU implicitly does upon execution of the call instruction. The second is what the compiler generates. Since we’re talking about x86 here, the call instruction always pushes at least the return address. The second I’ll detail a bit more in a bit. Correlating this to the picture, the green (foo) and blue (bar) are creations of the compiler. The brown is automatically pushed on to the stack by the hardware, and is automatically popped from the stack on the ret instruction.

In the above there are two registers worth noting, RBP, and RSP. RBP which is the 64b extension to the original 8086 BP, Base Pointer, register is the beginning of the stack frame ie the Frame Pointer. RSP, the extension to the 8086 SP, Stack Pointer, points to the end of the stack frame. By convention the Base Pointer doesn’t change throughout a function being executed and therefore it is often used as the reference to local variables stored on the stack. -100(%rbp) as an example above.

Digging further into that disassembly above, one notices a pattern. Every function begins with:

push   %rbp       // Push the old RBP, RSP now points to this
mov    %rsp,%rbp  // Store RSP in RBP

Assuming this is the convention, it implies that at any given point during the execution of a function we can obtain the previous RBP by reading the current RBP and doing some processing. Specifically, reading RBP gives us the old Stack Pointer, which is pointing to the last RBP. As mentioned above, the x86 CPU pushed the return address immediately before the push %rbp – which means as we work backwards through the Base Pointers, we can also obtain the caller for the current stack frame. People have done really nice pictures on this – use your favorite search engine.

Here is the HOBos code (ignore the part about symbols for now):

void bt_fp(void *fp)
	do {
		uint64_t prev_rbp = *((uint64_t *)fp);
		uint64_t prev_ip = *((uint64_t *)(fp + sizeof(prev_rbp)));
		struct sym_offset sym_offset = get_symbol((void *)prev_ip);
		printf("\t%s (+0x%x)\n",, sym_offset.offset);
		fp = (void *)prev_rbp;
		/* Stop if rbp is not in the kernel
		 * TODO: need an upper bound too*/
		if (fp <= (void *)KVADDR(DMAP_PML,0,0,0))
	} while(1);

As far as I know, all modern CPUs work in a similar fashion with differences sprinkled here and there. ARM for example has an LR register for the return address instead of using the stack.

ABI/Calling Conventions

The fact that we can work backwards this way is a byproduct of the calling convention. One example of an aspect of the calling convention is where to put the arguments to a function. Do they go on the stack, in registers, or somewhere else? In addition to these arguments, the way in which RBP, and RSP are used are strictly software constructs that are part of the convention. As a result, it might not always be possible to get a backtrace if:

  1. This convention is not adhered to (or -fomit-frame-pointer)
  2. The contents of RBP are destroyed
  3. The contents of the stack are corrupted.

How arguments are passed to function are also needed to make sure linkers and loaders (both static and dynamic) can operate to correctly form an executable, or dynamically call a function. Since this isn’t really important to obtaining a backtrace, I will leave it there. Some architectures do provide a way to obtain useful backtrace information without caring about the calling convention: Intel’s Processor Trace for example.

Symbol information

The previous section will get us a reverse list of addresses for all function calls working backward from a given point during execution. But having names makes things much more easier to quickly diagnose what is going wrong. There is a lot of information on the internet about this stuff. I’m simply providing all that’s relevant to my specific problem.

ELF Symbols (linking)

The ELF format provides everything we need (assuming things aren’t stripped). Glossing over the details (see this simple tool if you’re curious), we end up with two “sections” that tell us everything we need. They are conventionally named, “.symtab”, and “.strtab” and are conveniently of type, SHT_SYMTAB, and SHT_STRTAB. The symbol table defines the information about each symbol (functions, variables, whatever). Part of the information is a name, which is an index into the string table. In this simplest case, these are provisions for inter-object linking. If I had defined foo() in foo.c, and bar() in bar.c, the compiled object files can be linked together, but the linker needs the information about the symbol bar (in this case) in order to do its job.

readelf -S a.out

[Nr] Name Type Address Offset
[33] .symtab SYMTAB 0000000000000000 000015b8
[34] .strtab STRTAB 0000000000000000 00001c90

> readelf -S a.out | egrep "\.strtab|\.symtab" | wc -l
> strip a.out
> readelf -S a.out | egrep "\.strtab|\.symtab" | wc -l

Summing that up, if we have an entire ELF file, and the symbol and string tables haven’t been stripped, we’re good to go. However, ELF sections are not the unit in which an ELF loader decides what to load. The loader loads segments which are of type PT_LOAD. A segment is made up of 0 or more sections, plus padding. Since the Operating System is itself an ELF loaded by an ELF loader (the bootloader) we’re not in a good position. :(

> readelf -l a.out | egrep "\.strtab|\.symtab" | wc -l

ELF Loader

ELF Loader

Debug Info

Note that what we want is not the same thing as debug information. If one wants to do source level debug, there needs to be some way of correlating a machine instruction to a line of source code. This is also a software abstraction, and there is usually a spec for it unless you are using some proprietary thing. It would technically be possible to include DWARF capabilities within the kernel, but I do not know of a way to get that info to the OS (see multiboot stuff for details).

From boot to symbols

The HOBos project implements a small Multiboot compliant bootloader called smallboot. When the machine starts up, boot firmware is loaded from a fixed location (this is currently done by SeaBIOS). The boot firmware then loads the smallboot bootloader. The bootloader will load the kernel (smallboot, and most modern bootloaders will do this through a text configuration file on the resident filesystem). In the HOBos case, the kernel is simply an ELF file. smallboot implements a basic ELF loader to load the kernel into memory and give execution over.

The multiboot specification is a standardized communication mechanism (for various things)  from the bootloader to the Operating System (or any file really). One of these things is symbol information. Quoting the multiboot spec

If bit 5 in the ‘flags’ word is set, then the following fields in the Multiboot information structure starting at byte 28 are valid:

     28      | num               |
     32      | size              |
     36      | addr              |
     40      | shndx             |

These indicate where the section header table from an ELF kernel is, the size of each entry, number of entries, and the string table used as the index of names. They correspond to the ‘shdr_*’ entries (‘shdr_num’, etc.) in the Executable and Linkable Format (elf) specification in the program header. All sections are loaded, and the physical address fields of the elf section header then refer to where the sections are in memory (refer to the i386 elf documentation for details as to how to read the section header(s)). Note that ‘shdr_num’ may be 0, indicating no symbols, even if bit 5 in the ‘flags’ word is set.

Since the beginning I had implemented these fields in the bootloader:

multiboot_info.flags |= MULTIBOOT_INFO_ELF_SHDR;
multiboot_info.u.elf_sec = *table;

Because of the fact that the symbols weren’t in the ELF segments though, I was stumped as to how to get at the data one the OS is loaded. As it turned out, I didn’t actually read all 4 sentences and  I had missed one very important part.

All sections are loaded, and the physical address fields of the elf section header then refer to where the sections are in memory

What the spec is dictating is that even though the sections are not in loadable segments, they shall exist within memory during the handover to the OS, and the section header information will be updated so that the OS knows where to find it. With this, the OS can copy out, or just make sure not to overwrite the info, and then get access to it.

for (i = 0; i < shnum; i++) {
    __ElfN(Shdr) *sh = &shdr[i];
    if (sh->sh_size == 0)

    if (sh->sh_addr) /* Already loaded */

    ASSERT(sizeof(void *) == 4);
    *((volatile __ElfN(Addr) *)&sh->sh_addr) = sh->sh_offset + (uint32_t)addr;

et cetera

The code for pulling out the symbols is quite a bit longer, but it can be found in kern/core/syms.c. With the given RBP unwinder near the top, we can easily get the IP for the caller. With that IP, we do a symbol lookup from the symbols we got via the multiboot info.

Screenshot with backtrace

Screenshot with backtrace

Inkscape Links

Download PDF
August 27, 2015

LAST REMINDER! systemd.conf 2015 Call for Presentations ends August 31st!

Here's the last reminder that the systemd.conf 2015 CfP ends on August 31st 11:59:59pm Central European Time (that's monday next week)! Make sure to submit your proposals until then!

Please submit your proposals on our website!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.

August 26, 2015

click here to jump to the instructions

Mice have an optical sensor that tells them how far they moved in "mickeys". Depending on the sensor, a mickey is anywhere between 1/100 to 1/8200 of an inch or less. The current "standard" resolution is 1000 DPI, but older mice will have 800 DPI, 400 DPI etc. Resolutions above 1200 DPI are generally reserved for gaming mice with (usually) switchable resolution and it's an arms race between manufacturers in who can advertise higher numbers.

HW manufacturers are cheap bastards so of course the mice don't advertise the sensor resolution. Which means that for the purpose of pointer acceleration there is no physical reference. That delta of 10 could be a millimeter of mouse movement or a nanometer, you just can't know. And if pointer acceleration works on input without reference, it becomes useless and unpredictable. That is partially intended, HW manufacturers advertise that a lower resolution will provide more precision while sniping and a higher resolution means faster turns while running around doing rocket jumps. I personally don't think that there's much difference between 5000 and 8000 DPI anymore, the mouse is so sensitive that if you sneeze your pointer ends up next to Philae. But then again, who am I to argue with marketing types.

For us, useless and unpredictable is bad, especially in the use-case of everyday desktops. To work around that, libinput 0.7 now incorporates the physical resolution into pointer acceleration. And to do that we need a database, which will be provided by udev as of systemd 218 (unreleased at the time of writing). This database incorporates the various devices and their physical resolution, together with their sampling rate. udev sets the resolution as the MOUSE_DPI property that we can read in libinput and use as reference point in the pointer accel code. In the simplest case, the entry lists a single resolution with a single frequency (e.g. "MOUSE_DPI=1000@125"), for switchable gaming mice it lists a list of resolutions with frequencies and marks the default with an asterisk ("MOUSE_DPI=400@50 800@50 *1000@125 1200@125"). And you can and should help us populate the database so it gets useful really quickly.

How to add your device to the database

We use udev's hwdb for the database list. The upstream file is in /usr/lib/udev/hwdb.d/70-mouse.hwdb, the ruleset to trigger a match is in /usr/lib/udev/rules.d/70-mouse.rules. The easiest way to add a match is with the libevdev mouse-dpi-tool (version 1.3.2). Run it and follow the instructions. The output looks like this:

$ sudo ./tools/mouse-dpi-tool /dev/input/event8
Mouse Lenovo Optical USB Mouse on /dev/input/event8
Move the device along the x-axis.
Pause 3 seconds before movement to reset, Ctrl+C to exit.
Covered distance in device units: 264 at frequency 125.0Hz | |^C
Estimated sampling frequency: 125Hz
To calculate resolution, measure physical distance covered
and look up the matching resolution in the table below
16mm 0.66in 400dpi
11mm 0.44in 600dpi
8mm 0.33in 800dpi
6mm 0.26in 1000dpi
5mm 0.22in 1200dpi
4mm 0.19in 1400dpi
4mm 0.17in 1600dpi
3mm 0.15in 1800dpi
3mm 0.13in 2000dpi
3mm 0.12in 2200dpi
2mm 0.11in 2400dpi

Entry for hwdb match (replace XXX with the resolution in DPI):
mouse:usb:v17efp6019:name:Lenovo Optical USB Mouse:
Take those last two lines, add them to a local new file /etc/udev/hwdb.d/71-mouse.hwdb. Rebuild the hwdb, trigger it, and done:

$ sudo udevadm hwdb --update
$ sudo udevadm trigger /dev/input/event8
Leave out the device path if you're not on systemd 218 yet. Check if the property is set:

$ udevadm info /dev/input/event8 | grep MOUSE_DPI
E: MOUSE_DPI=1000@125
And that shows everything worked. Restart X/Wayland/whatever uses libinput and you're good to go. If it works, double-check the upstream instructions, then file a bug against systemd with those two lines and assign it to me.

Trackballs are a bit hard to measure like this, my suggestion is to check the manufacturer's website first for any resolution data.

Update 2014/12/06: trackball comment added, udevadm trigger comment for pre 218
Update 2015/08/26: udpated link to systemd bugzilla (now on github)

August 24, 2015

First Round of systemd.conf 2015 Sponsors

We are happy to announce the first round of systemd.conf 2015 sponsors!

Our first Gold sponsor is CoreOS!

CoreOS develops software for modern infrastructure that delivers a consistent operating environment for distributed applications. CoreOS's commercial offering, Tectonic, is an enterprise-ready platform that combines Kubernetes and the CoreOS stack to run Linux containers. In addition CoreOS is the creator and maintainer of open source projects such as CoreOS Linux, etcd, fleet, flannel and rkt. The strategies and architectures that influence CoreOS allow companies like Google, Facebook and Twitter to run their services at scale with high resilience. Learn more about CoreOS here, Tectonic here, or follow CoreOS on Twitter @coreoslinux.

A Silver sponsor is Codethink:

Codethink is a software services consultancy, focusing on engineering reliable systems for long-term deployment with open source technologies.

A Bronze sponsor is Pantheon:

Pantheon is a platform for professional website development, testing, and deployment. Supporting Drupal and WordPress, Pantheon runs over 100,000 websites for the world's top brands, universities, and media organizations on top of over a million containers.

A Bronze sponsor is Pengutronix:

Pengutronix provides consulting, training and development services for Embedded Linux to customers from the industry. The Kernel Team ports Linux to customer hardware and has more than 3100 patches in the official mainline kernel. In addition to lowlevel ports, the Pengutronix Application Team is responsible for board support packages based on PTXdist or Yocto and deals with system integration (this is where systemd plays an important role). The Graphics Team works on accelerated multimedia tasks, based on the Linux kernel, GStreamer, Qt and web technologies.

We'd like to thank our sponsors for their support! Without sponsors our conference would not be possible!

We'll shortly announce our second round of sponsors, please stay tuned!

If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

Reminder! The systemd.conf 2015 Call for Presentations ends on monday, August 31st! Please make sure to submit your proposals on the CfP page until then!

Also, don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.

August 19, 2015

So I realized I hadn’t posted a Wayland update in a while. So we are still making good progress on Wayland, but the old saying that the last 10% is 90% of the work is definitely true here. So there was a Wayland BOF at GUADEC this year which tried to create a TODO list for major items remaining before Wayland is ready to replace X.

  • Proper menu positioning. Most visible user issue currently for people testing Wayland
  • Relative pointer/confinement/locking. Important for games among other things.
  • Kinetic scroll in GTK+
  • More work needed to remove all X11 dependencies (so that desktop has no dependency on XWayland being available.
  • Minimize main thread stalls (could be texture uploads for example)
  • Tablet support. Includes gestures, on-screen keyboard and more.

A big thank you to Jonas Ådahl, Owen Taylor, Carlos Garnacho, Rui Matos, Marek Chalupa, Olivier Fourdan and more for their ongoing work on polishing
up the Wayland experience.

So as you can tell there is a still lot of details that needs working out when doing something as major as switching from one Display system to the other, but things are starting to looking really good now.

One new feature I am particularly excited about is what we call multi-DPI support ready for Wayland sessions in Fedora 23. What this means is that if you have a HiDPI laptop screen and a standard DPI external monitor you should be able to drag windows back and forth between the two screens and have them automatically rescale to work with the different DPIs. This is an example of an issue which was relatively straightforward to resolve under Wayland, but which would have been a lot of pain to get working under X.

We will not be defaulting to Wayland in Fedora Workstation 23 though, because as I have said in earlier blog posts about the subject, we will need to have a stable and feature complete Wayland in at least one release before we switch the default. We hope Fedora Workstation 23 will be that stable and feature complete release, which means Fedora Workstation 24 is the one where we can hope to make the default switchover.

Of course porting the desktop itself to Wayland is just part of the story. While we support running X applications using XWayland, to get full benefit from Wayland we need our applications to run on top of Wayland natively. So we spent effort on getting some major applications like LibreOffice and Firefox Wayland ready recently.

Caolan McNamara has been working hard on finishing up the GTK3 port of LibreOffice which is part of the bigger effort to bring LibreOffice nativly to Wayland. The GTK3 version of LibreOffice should be available in Fedora Workstation 23 (as non-default option) and all the necessary code will be included in LibreOffice 5 which will be released pretty soon. The GTK3 version should be default in F24, hopefully with complete Wayland support.

For Firefox Martin Stransky has been looking into ensuring Firefox runs on Wayland now that the basic GTK3 port is done. Martin just reported today that he got Firefox running natively under Wayland, although there are still a lot of glitches and issues he needs to figure out before it can be claimed to be ready for normal use.

Another major piece we are working on which is not directly Wayland related, but which has a Wayland component too is to try to move the state of Linux forward in the context of dealing with multiple OpenGL implementations, multi-GPU systems and the need to free our 3D stack from its close ties to GLX.

This work with is lead by Adam Jackson, but where also Dave Airlie is a major contributor, involves trying to decide and implement what is needed to have things like GL Dispatch, EGLstreams and EGL Device proposals used across the stack. Once this work is done the challenges around for instance using the NVidia binary driver on a linux system or using a discreet GPU like on Optimus laptops should be a thing of the past.

So the first step of this work is getting GL Dispatch implemented. GL Dispatch basically allows you to have multiple OpenGL implementations installed and then have your system pick the right one as needed. So for instance on a system with NVidia Optimus you can use Mesa with the integrated Intel graphics card, but NVidias binary OpenGL implementatioin with the discreet Optimus GPU. Currently that is a pain to do since you can only have one OpenGL implementation used. Bumblebee tries to hack around that requirement, but GL Dispatch will allow us to resolve this without having to ‘fight’ the assumptions of the system.

We plan to have easy to use support for both Optimus and Prime (the Nouveau equivalent of Optimus) in the desktop, allowing you to choose what GPU to use for your applications without needing to edit any text files or set environment variables.

The final step then is getting the EGL Device and EGLStreams proposal implemented so we can have something to run Wayland on top of. And while GL Dispatch are not directly related to those we do feel that having it in place should make the setup easier to handle as you don’t risk conflicts between the binary NVidia driver and the Mesa driver anymore at that point, which becomes even more crucial for Wayland since it runs on top of EGL.

August 18, 2015

REMINDER! systemd.conf 2015 Call for Presentations ends August 31st!

We'd like to remind you that the systemd.conf 2015 Call for Presentations ends on August 31st! Please submit your presentation proposals before that data on our website.

We are specifically interested in submissions from projects and vendors building today's and tomorrow's products, services and devices with systemd. We'd like to learn about the problems you encounter and the benefits you see! Hence, if you work for a company using systemd, please submit a presentation!

We are also specifically interested in submissions from downstream distribution maintainers of systemd! If you develop or maintain systemd packages in a distribution, please submit a presentation reporting about the state, future and the problems of systemd packaging so that we can improve downstream collaboration!

And of course, all talks regarding systemd usage in containers, in the cloud, on servers, on the desktop, in mobile and in embedded are highly welcome! Talks about systemd networking and kdbus IPC are very welcome too!

Please submit your presentations until August 31st!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

Also, limited travel and entry fee sponsorship is available for community contributors. Please contact us for details!

For further details about the CfP consult the CfP page.

For further details about systemd.conf consult the conference website.

REMINDER! systemd.conf 2015 Call for Papers ends August 31st!

We'd like to remind you that the systemd.conf 2015 Call for Presentations ends on August 31st! Please submit your presentation proposals before that data on our website.

We are specifically interested in submissions from projects and vendors building today's and tomorrow's products, services and devices with systemd. We'd like to learn about the problems you encounter and the benefits you see! Hence, if you work for a company using systemd, please submit a presentation!

We are also specifically interested in submissions from downstream distribution maintainers of systemd! If you develop or maintain systemd packages in a distribution, please submit a presentation reporting about the state, future and the problems of systemd packaging so that we can improve downstream collaboration!

And of course, all talks regarding systemd usage in containers, in the cloud, on servers, on the desktop, in mobile and in embedded are highly welcome! Talks about systemd networking and kdbus IPC are very welcome too!

Please submit your presentations until August 31st!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

Also, limited travel and entry fee sponsorship is available for community contributors. Please contact us for details!

For further details abou the CfP consult the CfP page.

For further details about systemd.conf consult the conference website.

First Round of systemd.conf 2015 Sponsors

We are happy to announce the first round of systemd.conf 2015 sponsors!

Our first Silver sponsor is CoreOS!

CoreOS develops software for modern infrastructure that delivers a consistent operating environment for distributed applications. CoreOS's commercial offering, Tectonic, is an enterprise-ready platform that combines Kubernetes and the CoreOS stack to run Linux containers. In addition CoreOS is the creator and maintainer of open source projects such as CoreOS Linux, etcd, fleet, flannel and rkt. The strategies and architectures that influence CoreOS allow companies like Google, Facebook and Twitter to run their services at scale with high resilience. Learn more about CoreOS here, Tectonic here, or follow CoreOS on Twitter @coreoslinux.

A Bronze sponsor is Codethink:

Codethink is a software services consultancy, focusing on engineering reliable systems for long-term deployment with open source technologies.

A Bronze sponsor is Pantheon:

Pantheon is a platform for professional website development, testing, and deployment. Supporting Drupal and WordPress, Pantheon runs over 100,000 websites for the world's top brands, universities, and media organizations on top of over a million containers.

A Bronze sponsor is Pengutronix:

Pengutronix provides consulting, training and development services for Embedded Linux to customers from the industry. The Kernel Team ports Linux to customer hardware and has more than 3100 patches in the official mainline kernel. In addition to lowlevel ports, the Pengutronix Application Team is responsible for board support packages based on PTXdist or Yocto and deals with system integration (this is where systemd plays an important role). The Graphics Team works on accelerated multimedia tasks, based on the Linux kernel, GStreamer, Qt and web technologies.

We'd like to thank our sponsors for their support! Without sponsors our conference would not be possible!

We'll shortly announce our second round of sponsors, please stay tuned!

If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

Reminder! The systemd.conf 2015 Call for Presentations ends on monday, August 31st! Please make sure to submit your proposals on the CfP page until then!

Also, don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.

August 17, 2015

A couple of weeks I visited my mother back home in Norway. She had gotten a new laptop some time ago that my brother-in-law had set up for her. As usual when I come for a visit I was asked to look at some technical issues my mother was experiencing with her computer. Anyway, one thing I discovered while looking at these issues was that my brother-in-law had installed OpenOffice on her computer. So knowing that the OpenOffice project is all but dead upstream since IBM pulled their developers of the project almost a year ago and has significantly fallen behind feature wise, I of course installed LibreOffice on the system instead, knowing it has a strong and vibrant community standing behind it and is going from strength to strength.

And this is why I am writing this open letter. Because while a lot of us who comes from technical backgrounds have already caught on to the need to migrate from OpenOffice to LibreOffice, there are still many non-technical users out there who are still defaulting to installing OpenOffice when they are looking for an open source Office Suite, because that is the one they came across 5 years ago. And I believe that the Apache Foundation, being an organization dedicated to open source software, care about the general quality and perception of open source software and thus would share my interest in making sure that all users of open source software gets the best experience possible, even if the project in question isn’t using their code license of preference.

So I realize that the Apache Foundation took a lot of pride in and has invested a lot of effort trying to create an Apache Licensed Office suite based on the old OpenOffice codebase, but I hope that now that it is clear that this effort has failed that you would be willing to re-direct people who go to the website to the LibreOffice website instead. Letting users believe that OpenOffice is still alive and evolving is only damaging the general reputation of open source Office software among non-technical users and thus I truly believe that it would be in everyones interest to help the remaining OpenOffice users over to LibreOffice.

And to be absolutely clear I am only suggesting this due to the stagnant state of the OpenOffice project, if OpenOffice had managed to build a large active community beyond the resources IBM used to provide then it would have been a very different story, but since that did not happen I don’t see any value to anyone involved to just let users keep downloading an aging releases of a stagnant codebase until the point where bit rot chases them away or they hear about LibreOffice true mainstream media or friends. And as we all know it is not about just needing a developer or two to volunteer here, maintaining and developing something as large as OpenOffice is a huge undertaking and needs a very sizeable and dedicated community to be able to succeed.

So dear Apache developers, for the sake of open source and free software, please recommend people to go and download LibreOffice, the free office suite that is being actively maintained and developed and which has the best chance of giving them a great experience using free software. OpenOffice is an important part of open source history, but that is also what it is at this point in time.

August 16, 2015
After a few years of development the atomic display update IOCTL for drm drivers is finally ready for prime time with the 4.2 pull request from Dave Airlie. It's been a long road, with a lot of drivers already converted over to atomic and even more in progress, the atomic helper libraries and support code in the drm subsystem sufficiently polished. But what's really missing is a design overview of what the overall atomic infrastructure looks like and why some decisions and details are implemented like they are.

That's now done and published on LWN: Part 1 talks about the problem space, issues with the Android atomic display framework and the basic atomic IOCTL interface. Part 2 goes into more detail about a few specific things like locking, helper library design and the exact semantics of atomic modessetting updates. Happy Reading!
August 15, 2015
So the big news for the upcoming mesa 11.0 release is gl4.x support for radeon and nouveau.  Which has been in the works for a long time, and a pretty tremendous milestone (and the reason that the next mesa release is 11.0 rather than 10.7).  But on the freedreno side of things, we haven't been sitting still either.  In fact, with the transform-feedback support I landed a couple weeks ago (for a3xx+a4xx), plus MRT+z32s8 support for a4xx (Ilia landed the a3xx parts of those a while back), we now support OpenGLES 3.0[1] on both adreno 3xx and 4xx!!

In addition, with the TBO support that landed a few days ago, plus handful of other fixes in the last few days, we have the new antarctica gl3.1 render engine for supertuxkart working!

Note that you need to use MESA_GL_VERSION_OVERRIDE=3.1 and MESA_GLSL_VERSION_OVERRIDE=140, since while we support everything that stk needs, we don't yet support everything needed to advertise gl3.1.  (But hey, according to qualcomm, adreno 3xx doesn't even support higher than gles3.0.. I guess we'll have to show them ;-))

The nice thing to see about this working, is that it is utilizing pretty much all of the recent freedreno features (transform feedback, MRT, UBO's, TBO's, etc).

Of course, the new render engine is considerably more heavyweight compared to older versions of stk.  But I think there is some low hanging fruit on the stk engine side of things to reclaim some of those lost fps.

update: oh, and the first time around, I completely forgot to mention that qualcomm has recently published *some* gpu docs, for a3xx, for the dragonboard 410c. Not quite as extensive as what broadcom has published for vc4, but it gives us all the a3xx registers, which is quite a lot more than any other SoC vendor has done to date :-)

[1] minus MSAA.. There is a bigger task, which is on the TODO list, to teach mesa st about some extensions to support MSAA resolve on tile->mem.. such as EXT_multisampled_render_to_texture, plus perhaps driconf option to enable it for apps that are not aware, which would make MSAA much more useful on a tiling gpu.  Until then, mesa doesn't check MSAA for gles3, and if it did we could advertise PIPE_CAP_FAKE_SW_MSAA.  Plus, who really cares about MSAA on a 5" 4k screen?

August 13, 2015

I've started to use Pocket a few months ago to store my backlog of things to read. It's especially useful as I can use it to read content offline since we still don't have any Internet access in places such as airplanes or the Paris metro. It's only 2015 after all.

I am also a subscriber for years now, and I really like their articles from the weekly edition. Unfortunately, as the access is restricted to subscribers, you need to login: it makes it impossible to add these articles to Pocket directly. Sad.

Yesterday, I thought about that and decided to start hacking on it. LWN provides a feature called "Subscriber Link" that allows you to share an article with a friend. I managed to use that feature to share the articles with my friend… Pocket!

As doing that every week is tedious, I wrote a small Python program called lwn2pocket that I published on GitHub. Feel free to use it, hack it and send pull requests.

August 11, 2015

One of the things we discussed at this year’s Akademy conference is making AppStream work on Kubuntu.appstream-logo

On Debian-based systems, we use a YAML-based implementation of AppStream, called “DEP-11”. DEP-11 exists for historical reasons (the DEP-11 YAML format was a superset of AppStream once) and because YAML, unlike XML, is one accepted file format by the Debian FTPMasters team, who we want to have on board when adding support for AppStream.

So I’ve spent the last few days on setting up the DEP-11 generator for Kubuntu, as well as improving it greatly to produce more meaningful error messages and to generate better output. It became a bit slower in the process, but the greatly improved diagnostics data is worth it.

For example, maintainers of Debian packages will now get a more verbose explanation of issues found with metadata in their packages, making them easier to fix for people who didn’t get in contact with AppStream yet.

At time, we generate AppStream metadata for Tanglu, Kubuntu and Debian, but so far only Tanglu makes real use of it by shipping it in a .deb package. Shipping the data as package is only a workaround though, for a proper implementation, the data will be downloaded by Apt. To achieve that, the data needs to reach the archive first, which is something that I can hopefully discuss and implement with the FTPMasters team of Debian at this year’s Debconf.

When this is done, the new metadata will automatically become available in tools like GNOME-Software or Muon Discover.

How can I see if there are issues with my package?

The dep11-generator tool will return HTML pages to show both the extracted metadata, as well as issues with it.

You can find the information for the respective distribution here:

Each issue tag will contain a detailed explanation of what went wrong. Errors generally lead to ignoring the metadata, so it will not be processed. Warnings usually concern things which might reduce the amount of metadata or make it less useful, while Info-type hints contain information on how to improve the metadata or make it more useful.

Can I use the data already?

Yes, you can. You just need to place the compressed YAML files in /var/cache/app-info/yaml and the icons in /var/cache/app-info/icons/<suite>-<component>/<size>, for example: /var/cache/app-info/icons/jessie-amd64/64×64

I think I found a bug in the generator…

In that case, please report the issue against the appstream-dep11 package at Debian, or file an issue at Github..

The only reason why I announce this feature now is to find remaining generator bugs, before officially announcing the feature on debian-devel-announce.

When will this be officially announced?

I want to give this feature a little bit more testing, and ideally have the integration into the archive ready, so people can see how the metadata looks like when rendered in GNOME-Software/Discover. I also want to write a bit more documentation to help Debian developers and upstreams to improve their metadata.

Ideally, I also want to incorporate some feedback at Debconf when announcing full AppStream support in Debian. So take all the stuff you’ve read above as a little sneak peek 😉

Debconf15goingI will also give a talk at Debconf, titled “AppStream, Limba, XdgApp – Where we are going.” The aim of this talk is to give an insight i tnto the new developments happening in the software distribution area, and what the goal of these different projects is (the talk should give an overview of what’s in the oven and how it will impact Debian). So if you are interested, please drop by :-) Maybe setting up a BOF would also be a good idea.

August 10, 2015

I am very late with this, but I still wanted to write a few words about Akademy 2015.

First of all: It was an awesome conference! Meeting all the great people involved with KDE and seeing who I am working with (as in: face to face, not via email) was awesome. We had some very interesting discussions on a wide variety of topics, and also enjoyed quite some beer together. Particularly important to me was of course talking to Daniel Vrátil who is working on XdgApp, and Aleix Pol of Muon Discover fame.

Also meeting with the other Kubuntu members was awesome – I haven’t seen some of them for about 3 years, and also met many cool people for the first time.

My talk on AppStream/Limba went well, except that I got slightly confused by the timer showing that I had only 2 minutes left, after I had just completed the first half of my talk. It turned out that the timer was wrong 😉

Another really nice aspect was to be able to get an insight into areas where I am usually not involved with, like visual design. It was really interesting to learn about the great work others are doing and to talk to people about their work – and I also managed to scratch an itch in the Systemsettings application, where three categories had shown the same icon. Now Systemsettings looks like it is supposed to be, finally :-)

The only thing I could maybe complain about was the weather, which was more Scotland/Wales like than Spanish – but that didn’t stop us at all, not even at the social event outside. So I actually don’t complain 😉

We also managed to discuss some new technical stuff, like AppStream for Kubuntu, and plenty of other things that I’ll write about in a separate blog post.

Generally, I got so many new impressions from this year’s Akademy, that I could write a 10-pages long blogpost about it while still having to leave out things.

Kudos to the organizers of this Akademy, you did a fantastic job! I also want to thank the Ubuntu community for funding my trip, and the Kubuntu community for pushing me a little to attend :-).

KDE is a great community with people driving the development of Free Software forward. The diversity of people, projects and ideas in KDE is a pleasure, and I am very happy to be part of this community.


August 08, 2015

Wanted to let everyone know that the GStreamer Conference 2015 is happening for the 6th time this year. So if you want to attend the premier open source multimedia conference you can do so in Dublin, Ireland between the 8th and 9th of October. If you want to attend I suggest registering as early as possible using the GStreamer Conference registration webpage. Like earlier years the GStreamer Conference is co-located with other great conferences like the Embedded Linux Conference Europe so you have the chance to combine the attendance into one trip.

The GStreamer Conference has always been a great opportunity to not only learn about the latest developments in GStreamer, but about whats happeing in the general Linux multimedia stack, latest news from the world of codec development and other related topics. I strongly recommend setting aside the 8th and the 9th of October for a trip to Dublin and the GStreamer Conference.

Also a heads up for those interested in doing a talk. The formal deadline for submitting a proposal is this Sunday the 9th of August, so you need to hurry to send in a proposal. You find the details for how to submit a talk on the GStreamer Conference 2015 website. While talks submitted before the 9th will be prioritized I do recommend anyone seeing this after the deadline to still send in a proposal as there might be a chance to get on the program anyway if you get your proposal in during next week.

August 04, 2015

It's been a while since I talked about Ceilometer and its companions, so I thought I'd go ahead and write a bit about what's going on this side of OpenStack. I'm not going to cover new features and fancy stuff today, but rather a shallow overview of the new project processes we initiated.

Ceilometer growing

Ceilometer has grown a lot since that time when we started it 3 years ago. It has evolved from a system designed to fetch and store measurements, to a more complex system, with agents, alarms, events, databases, APIs, etc.

All those features were needed and asked for by users and operators, but let's be honest, some of them should never have ended up in the Ceilometer code repository, especially not all at the same time.

The reality is we picked a pragmatic approach due to the rigidity of the OpenStack Technical Committee in regards to new projects to become OpenStack integrated – and, therefore, blessed – projects. Ceilometer was actually the first project to be incubated and then integrated. We had to go through the very first issues of that process.

Fortunately, now that time has passed, and all those constraints have been relaxed. To me, the OpenStack Foundation is turning into something that looks like the Apache Foundation, and there's, therefore, no need to tie technical solutions to political issues.

Indeed, the Big Tent now allows much more flexibility to all of that. Back a year ago, we were afraid to bring Gnocchi into Ceilometer. Was the Technical Committee going to review the project? Was the project going to be in the scope of Ceilometer for the Technical Committee? Now we don't have to ask ourselves those questions, now that we have that freedom, it empowers us to actually do what we think is good in term of technical design without worrying too much about political issues.

Ceilometer development activity

Acknowledging Gnocchi

The first step in this new process was to continue working on Gnocchi (a timeserie database and resource indexer designed to overcome historical Ceilometer storage issue) and to decide that it was not the right call to merge it into Ceilometer as some REST API v3, but that it was better to keep it standalone.

We managed to get traction to Gnocchi, getting a few contributors and users. We're even seeing talks proposed to the next Tokyo Summit where people leverage Gnocchi, such as "Service of predictive analytics on cost and performance in OpenStack", "Suveil" and "Cutting Edge NFV On OpenStack: Healing and Scaling Distributed Applications".

We are also doing some progress on pushing Gnocchi outside of the OpenStack community, as it can be a self-sufficient timeserie and resource database that can be used without any OpenStack interaction.

Branching Aodh

Rather than continuing to grow Ceilometer, during the last summit we all decided that it was time to reorganize and split Ceilometer into the different components it is made of, leveraging a more service-oriented architecture. The alarm subsystem of Ceilometer being mostly untied to the rest of Ceilometer, we decided it was the first and perfect candidate to do that. I personally engaged into doing the work and created a new repository with only the alarm code from Ceilometer, named Aodh.

Aodh is an Irish word meaning fire. A word picked so it also had some relation to Heat, and because we have some Irish influence around the project 😁.

This made sense for a lot of reason. First because Aodh can now work completely standalone, using either Ceilometer or Gnocchi as a backend – or any new plugin you'd write. I love the idea that OpenStack projects can work standalone – like Swift does for example – without implying any other OpenStack component. I think it's a proof of good design. Secondly, because it allows us to resonate on a smaller chunk of software – a reason really under-estimated today in OpenStack. I believe that the size of your software should match a certain ratio to the size of your team.

Aodh is, therefore, a new project under the OpenStack Telemetry program (or what remains of OpenStack programs now), alongside Ceilometer and Gnocchi, forked from the original Ceilometer alarm feature. We'll deprecate the latter with the Liberty release, and we'll remove it in the Mitaka release.

Lessons learned

Actually, moving that code out of Ceilometer (in the case of Aodh), or not merging it in (in the case of Gnocchi) had a few side effects that I admit I think we probably under-estimated back then.

Indeed, the code size of Gnocchi or Aodh ended up being much smaller than the entire Ceilometer project – Gnocchi is 7× smaller and Aodh 5x smaller than Ceilometer – and therefore much more easy to manipulate and to hack on. That allowed us to merge dozens of patches in a few weeks, cleaning-up and enhancing a lot of small things in the code. Those tasks are very much harder in Ceilometer, due to the bigger size of the code base and the small size of our team. By having our small team working on smaller chunks of changes – even when it meant actually doing more reviews – greatly improved our general velocity and the number of bugs fixed and features implemented.

On the more sociological side, I think it gave the team the sensation of finally owning the project. Ceilometer was huge, and it was impossible for people to know every side of it. Now, it's getting possible for people inside a team to cover a much larger portion of those smaller project, which gives them a greater sense of ownership and caring. Which ends up being good for the project quality overall.

That also means that we technically decided to have different core teams by project (Ceilometer, Gnocchi, and Aodh) as they all serve different purposes and can all be used standalone or with each others. Meaning we could have contributors completely ignoring other projects.

All of that reminds me some discussion I heard about projects such as Glance, trying to fit new features in - some that are really orthogonal to the original purpose. It's now clear to me that having different small components interacting together that can be completely owned and taken care of by a (small) team of contributors is the way to go. People that can therefore trust each others and easily bring new people in, makes a project really incredibly more powerful. Having a project covering a too wide set of features make things more difficult if you don't have enough manpower. This is clearly an issue that big projects inside OpenStack are facing now, such as Neutron or Nova.

July 28, 2015

Announcing systemd.conf 2015

We are happy to announce the inaugural systemd.conf 2015 conference of the systemd project.

The conference takes place November 5th-7th, 2015 in Berlin, Germany.

Only a limited number of tickets are available, hence make sure to sign up quickly.

For further details consult the conference website.

July 22, 2015

Below is an outline of the various types of touchpads that can be found in the wild. Touchpads aren't simply categorised into a single type, instead they have a set of properties, a combination of number of physical buttons, touch support and physical properties.

Number of buttons

Physically separate buttons

For years this was the default type of touchpads: a touchpad with a separate set of physical buttons below the touch surface. Such touchpads are still around, but most newer models are Clickpads now.

Touchpads with physical buttons usually provide two buttons, left and right. A few touchpads with three buttons exist, and Apple used to have touchpads with a single physical buttons back in the PPC days. Touchpads with only two buttons require the software stack to emulate a middle button. libinput does this when both buttons are pressed simultaneously.

A two-button touchpad, with a two-button pointing stick above.

Note that many Lenovo laptops provide a pointing stick above the touchpad. This pointing stick has a set of physical buttons just above the touchpad. While many users use those as substitute touchpad buttons, they logically belong to the pointing stick. The *40 and *50 series are an exception here, the former had no physical buttons on the touchpad and required the top section of the pad to emulate pointing stick buttons, the *50 series has physical buttons but they are wired to the touchpads. The kernel re-routes those buttons through the trackstick device.


Clickpads are the most common type of touchpads these days. A Clickpad has no separate physical buttons, instead the touchpad itself is clickable as a whole, i.e. a user presses down on the touch area and triggers a physical click. Clickpads thus only provide a single button, everything else needs to be software-emulated.

A clickpad on a Lenovo x220t. Just above the touchpad are the three buttons associated with the pointing stick. Faint markings on the bottom of the touchpad hint at where the software buttons should be.

Right and middle clicks are generated either via software buttons or "clickfinger" behaviour. Software buttons define an area on the touchpad that is a virtual right button. If a finger is in that area when the click happens, the left button event is changed to a right button event. A middle click is either a separate area or emulated when both the left and right virtual buttons are pressed simultaneously.

When the software stack uses the clickfinger method, the number of fingers decide the type of click: a one-finger is a left button, a two-finger click is a right button, a three-finger click is a middle button. The location of the fingers doesn't matter, though there are usually some limits in how the fingers can be distributed (e.g. some implementations try to detect a thumb at the bottom of the touchpad to avoid accidental two-finger clicks when the user intends a thumb click).

The libinput documentation has a section on Clickpad software button behaviour with more detailed illustrations

The touchpad on a T440s has no physical buttons for the pointing stick. The marks on the top of the touchpad hint at the software button position for the pointing stick. Note that there are no markings at the bottom of the touchpad anymore.

Clickpads are labelled by the kernel with the INPUT_PROP_BUTTONPAD input property.


One step further down the touchpad evolution, Forcepads are Clickpads without a physical button. They provide pressure and (at least in Apple's case) have a vibration element that is software-controlled. Instead of the satisfying click of a physical button, you instead get a buzz of happiness. Which apparently feels the same as a click, judging by the reviews I've read so far. A software-controlled click feel has some advantages, it can be disabled for some gestures, modified for others, etc. I suspect that over time Forcepads will become the main touchpad category, but that's a few years away.

Not much to say on the implementation here. The kernel has some ForcePad support but everything else is spotty.

Note how Apple's Clickpads have no markings whatsoever, Apple uses the clickfinger method by default.

Touch capabilities

Single-touch touchpads

In the beginning, there was the single-finger touchpad. This touchpad would simply provide x/y coordinates for a single finger and get mightily confused when more than one finger was present. These touchpads are now fighting with dodos for exhibition space in museums, few of those are still out in the wild.

Pure multi-touch touchpads

Pure multi-touch touchpads are those that can track, i.e. identify the location of all fingers on the touchpad. Apple's touchpads support 16 touches (iirc), others support 5 touches like the Synaptics touchpads when using SMBus.

Pure multi-touch touchpads are the easiest to support, we can rely on the finger locations and use them for scrolling, gestures, etc. These touchpads usually also provide extra information. In the case of the Apple touchpads we get an ellipsis and the orientation of the ellipsis for each touch point. Other touchpads provide a pressure value for each touch point. Though pressure is a bit of a misnomer, pressure is usually directly related to contact area. Since our puny human fingers flatten out as the pressure on the pad increases, the contact area increases and the firmware then calculates that back into a (mostly rather arbitrary) pressure reading.

Because pressure is really contact area size, we can use it to detect accidental palm contact or thumbs though it's fairly unreliable. A light palm touch or a touch at the very edge of a touchpad will have a low pressure reading simply because the palm is mostly next to the touchpad and thus the contact area itself remains small.

Partial multi-touch touchpads

The vast majority of touchpads fall into this category. It's the half-way point between single-touch and pure multi-touch. These devices can track N fingers, but detect more than N. The current Synaptics touchpads fall into that category when they're using the serial protocol. Most touchpads that fall into this category can track two fingers and detect up to four or five. So a typical three-finger interaction would give you the location of two fingers and a separate value telling you that a third finger is down.

The lack of finger location doesn't matter for some interactions (tapping, three-finger click) but it can cause issues in some cases. For example, a user may have a thumb resting on a touchpad while scrolling with two fingers. Which touch locations you get depends on the order of the fingers being set down, i.e. this may look like thumb + finger + third touch somewhere (lucky!) or two fingers scrolling + third touch somewhere (unlucky, this looks like a three-finger swipe). So far we've mostly avoided having anything complex enough that requires the exact location of more than two fingers, these pads are so prevalent that any complex feature would exclude the majority of users.

Semi-mt touchpads

A sub-class of partial multi-touch touchpads. These touchpads can technically detect two fingers but the location of both is limited to the bounding box, i.e. the first touch is always the top-left one and the second touch is the bottom-right one. Coordinates jump around as fingers move past each other. Most semi-mt touchpads also have a lower resolution for two touches than for one, so even things like two-finger scrolling can be very jumpy.

Semi-mt are labelled by the kernel with the INPUT_PROP_SEMI_MT input property.

Physical properties

External touchpads

USB or Bluetooth touchpads not in a laptop chassis. Think the Apple Magic Trackpad, the Logitech T650, etc. These are usually clickpads, the biggest difference is that they can be removed or added at runtime. One interaction method that is only possible on external touchpads is a thumb resting on the very edge/immediately next to the touchpad. On the far edge, touchpads don't always detect the finger location so clicking with a thumb barely touching the edge makes it hard or impossible to figure out which software button area the finger is on.

These touchpads also don't need palm detection - since they're not located underneath the keyboard, accidental palm touches are a non-issue.

A Logitech T650 external touchpad. Note the thumb position, it is possible to click the touchpad without triggering a touch.

Circular touchpads

Yes, used to be a thing. Touchpad shaped in an ellipsis or circle. Luckily for us they have gone full dodo. The X.Org synaptics driver had to be aware of these touchpads to calculate the right distance for edge scrolling - unsurprisingly an edge scroll motion on a circular touchpad isn't very straight.

Graphics tablets

Touch-capable graphics tablets are effectively external touchpads, with two differentiators: they are huge compared to normal touchpads and they have no touchpad buttons whatsoever. This means they can either work like a Forcepad, or rely on interaction methods that don't require buttons (like tap-to-click). Since the physical device is shared with the pen input, some touch arbitration is required to avoid touch input interfering when the pen is in use.

Dedicated edge scroll area

Mostly on older touchpads before two-finger scrolling became the default method. These touchpads have a marking on the touch area that designates the edge to be used for scrolling. A finger movement in that edge zone should trigger vertical motions. Some touchpads have markers for a horizontal scroll area too at the bottom of the touchpad.

A touchpad with a marked edge scroll area on the right.
July 21, 2015

I've had a lot of thought about Mont. (Sorry for the rhymes.) Mont, recall, is the set of all Monte objects. I have a couple interesting thoughts on Mont that I'd like to share, but the compelling result I hope to convince readers of is this: Mont is a simple and easy-to-think-about category once we define an appropriate sort of morphism. By "category" I mean the fundamental building block of category theory, and most of the maths I'm going to use in this post is centered around that field. In particular, "morphism" is used in the sense of categories.

I'd like to put out a little lemma from the other day, first. Let us say that the Monte == operator is defined as follows: For any two objects in Mont, x and y, x == y if and only if x is y, or for any message [verb, args] sendable to these objects, M.send(x, verb, args) == M.send(y, verb, args). In other words, x == y if it is not possible to distinguish x and y by any chain of sent messages. This turns out to relate to the category definition I give below. It also happens to correlate nicely with the idea of equivalence, in that == is an equivalence relation on Mont! The proof:

  • Reflexivity: For any object x, x == x. The first branch of the definition handles this.
  • Symmetry: For any objects x and y, x == y iff y == x. Identity is symmetric, and the second branch is covered by recursion.
  • Transitivity: For any objects x, y, and z, x == y and y == z implies x == z. Yes, if x and y can't be told apart, then if y and z also can't be told apart it immediately follows that x and z are likewise indistinguishable.

Now, obviously, since objects can do whatever computation they like, the actual implementation of == has to be conservative. We generally choose to be sound and incomplete; thus, x == y sometimes has false negatives when implemented in software. We can't really work around this without weakening the language considerably. Thus, when I talk about Mont/==, please be assured that I'm talking more about the ideal than the reality. I'll try to address spots where this matters.

Back to categories. What makes a category? Well, we need a set, some morphisms, and a couple proofs about the behavior of those morphisms. First, the set. I'll use Mont-DF for starters, but eventually we want to use Mont. Not up on my posts? Mont-DF is the subset of Mont where objects are transitively immutable; this is extremely helpful to us since we do not have to worry about mutable state nor any other side effect. (We do have to worry about identity, but most of my results are going to be stated as holding up to equivalence. I am not really concerned with whether there are two 42 objects in Mont right now.)

My first (and, spoiler alert, failed) attempt at defining a category was to use messages as morphisms; that is, to go from one object to another in Mont-DF, send a message to the first object and receive the second object. Clear, clean, direct, simple, and corresponds wonderfully to Monte's semantics. However, there's a problem. The first requirement of a category is that, for any object in the set, there exists an identity morphism, usually called 1, from that object to itself. This is a problem in Monte. We can come up with a message like that for some objects, like good old 42, which responds to ["add", [0]] with 42. (Up to equivalence, of course!) However, for some other objects, like object o as DeepFrozen {}, there's no obvious methods to use.

The answer is to add a new Miranda method which is not overrideable called _magic/0. (Yes, if this approach would have worked, I would have picked a better name.) Starting from Mont-DF, we could amend all objects to get a new set, Mont-DF+Magic, in which the identity morphism is always ["_magic", []]. This neatly wraps up the identity morphism problem.

Next, we have to figure out how to compose messages. At first blush, this is simple; if we start from x and send it some message to get y, and then send another message to y to get z, then we obviously can get to x from z. However, here's the rub: There might not be any message directly from x to z! We're stuck here. Unlike with other composition operators, there's no hand-wavey way to compose messages like with functions. So this is bunk.

However, we can cheat gently and use the free monoid a.k.a. the humble list. A list of messages will work just fine: To compose them, simply catenate the lists, and the identity morphism is the empty list. Putting it all together, a morphism from 6 to 42 might be [["multiply", [7]]], and we could compose that with [["asString", []]] to get [["multiply", [7]], ["asString", []]], a morphism from 6 to "42". Not shabby at all!

There we go. Now Mont-DF is a category up to equivalence. The (very informally defined) set of representatives of equivalence classes via ==, which I'll call Mont-DF/==, is definitely a category here as well, since it encapsulates the equivalence question. We could alternatively insist that objects in Mont-DF are unique (or that equivalent definitions of objects are those same objects), but I'm not willing to take up that sword this time around, mostly because I don't think that it's true.

"Hold up," you might say; "you didn't prove that Mont is a category, only Mont-DF." Curses! I didn't fool you at all, did I? Yes, you're right. We can't extend this result to Mont wholesale, since objects in Mont can mutate themselves. In fact, Mont doesn't really make sense to discuss in this way, since objects in sets aren't supposed to be mutable. I'm probably going to have to extend/alter my definition of Mont in order to get anywhere with that.

July 16, 2015

In a perfect world, any device that advertises absolute x/y axes also advertises the resolution for those axes. Alas, not all of them do. For libinput, this problem is two-fold: parts of the touchscreen API provide data in mm - without knowing the resolution this is a guess at best. But it also matters for touchpads, where a lack of resolution is a lot more common (though the newest generations of major touchpad manufacturers tend to advertise resolutions now).

We have a number of features that rely on the touchpad resolution: from the size of the software button to deciding which type of palm detection we need, it all is calculated based on physical measurements. Until recently, we had code to differentiate between touchpads with resolution and most of the special handling was a matter of magic numbers, usually divided by the diagonal of the touchpad in device units. This made code maintenance more difficult - without testing each device, behaviour could not be guaranteed.

With libinput 0.20, we now got rid of this special handling and instead require the touchpads to advertise resolutions. This requires manual intervention, so we're trying to fix this in multiple places, depending on the confidence of the data. We have hwdb entries for the bcm5974 (Apple) touchpads and the Chromebook Pixel. For Elantech touchpads, a kernel patch is currently waiting for merging. For ALPS touchpads, we ship size hints with libinput's hwdb. If all that fails, we fall back to a default touchpad size of 69x55mm. [1]

All this affects users in two ways: one is that you may notice a slightly different behaviour of your touchpad now. The software-buttons may be bigger or smaller than before, pointer acceleration may be slightly different, etc. Shouldn't be too bad, but you may just notice it. The second noticeable change is that libinput will now log when it falls back to the default size. If you notice a message like that in your log, please file a bug and attach the output of evemu-describe and the physical dimensions of your touchpad. Once we have that information, we can add it to the right place and make sure that everyone else with that touchpad gets the right settings out of the box.

[1] The default size was chosen because it's close enough to what old touchpads used to be, and those are most likely to lack resolution values. This size may change over time as we get better data.

July 15, 2015

With the release of Solaris 11.3 beta, I've gone back and made a new list of changes to the bundled software packages available in the Solaris IPS package repository, as I've done for the Solaris 11.1, Solaris 11.2 beta, and the Solaris 11.2 GA releases.

Oracle packages

Several bundled packages improve integration with other Oracle software. The Oracle Instant Client packages are now in the IPS repo for building software that connects to Oracle databases. MySQL 5.6 has also been added alongside the existing version 5.5 packages.

The Java runtime & developer kits for Java 7 & 8 were updated to new versions, while the Java 6 versions were removed as its support life winds down. The End of Feature Notices for Oracle Solaris 11 warns that Java 7 will be coming out as well in a later release.

Also updated was Oracle Hardware Management Pack (HMP), a set of tools that work with the ILOM, firmware, and other components in Sun/Oracle servers to configure low-level system options. HMP 2.2 was introduced in Solaris 11.2, and Solaris 11.3 now delivers HMP 2.3 packages.

Python packages

Solaris has long included and depended on Python 2. Solaris 11.3 adds Python 3 support for the first time, with the bundling of Python 3.4 and many module packages that work with it. Python 2.7 is still included, as is 2.6 for now, but Python 2 software in Solaris is almost completely switched over to 2.7 now, and Python 2.6 will be obsoleted soon.

A side effect of these changes was a revamping of the naming pattern for Python module packages in IPS - previously most modules delivered a set of packages following the pattern:

  • library/python-2/<module name>
  • library/python-2/<module name>-<for each Python version>
For example, there were three Mako packages, library/python-2/mako, library/python-2/mako-26, library/python-2/mako-27, where the latter two installed the modules built for the named versions of Python, and the first uses IPS conditional dependencies to install the modules for any Python versions that were installed on the system.

In extending this to provide Python 3 modules, it was decided to drop the python major version from the library/python-N prefix, leaving just the version at the end of the module name. Thus in Solaris 11.3, you'll see that the mako packages are now library/python/mako, library/python/mako-26, library/python/mako-27, and library/python/mako-34.

NVIDIA graphics driver packages

NVIDIA has been providing graphics driver packages for Solaris for almost a decade now. As new families and models of graphics cards are regularly introduced, they retire support for older generations from time to time in the new drivers. Support for these models is retained in a legacy driver, but that requires uninstalling the latest version and switching to a legacy branch. Previously that meant installing NVDIA's SVR4 package release instead of IPS, losing the ability to get updates with a simple “pkg update” command.

Now the legacy drivers are also available in IPS packages, which will continue to be updated as necessary for bug fixes and support for new Xorg releases during NVIDIA’s Support timeframes for Unix legacy GPU releases. To switch to the version 340 legacy driver on Solaris 11.3 or the later Solaris 11.2 SRU’s simply run:

  # pkg install --reject driver/graphics/nvidia driver/graphics/nvidiaR340 
and then reboot into the new BE created. For the previous version 304, change the above command to end in nvidiaR304 instead.

Other packages

There are far more changes than I've covered here - fortunately, the engineers who worked on many of these changes have written their own blog posts about them for you to check out:

One more thing... Solaris 11.2 packages

While all these are available now in the Solaris 11.3 beta, many are also available for testing and evaluation on existing Solaris 11.2 systems, when you're ready to upgrade a FOSS package, but not the rest of the OS. This is planned to be an ongoing program, so once Solaris 11.3 is officially released, the evaluation packages will keep moving forward to new versions of many packages. More details are available in a Solaris FOSS blog post and an article in the Solaris 11 OTN community space.

Not all packages are available in the evaluation program though, since some depend on OS changes not in Solaris 11.2. For instance, OpenSSH is not available for Solaris 11.2, since it depends on changes to the existing SunSSH packages that allow for the ssh package mediator to choose which ssh software to use on a given system.

Detailed list of changes

This table shows most of the changes to the bundled packages between the original Solaris 11.2.0 release, the latest Solaris 11.2 support repository update (SRU11, aka 11.2.11, released June 13, 2015), and the Solaris 11.3 beta released today. These show the versions they were released with, and not later versions that may now be available via the new FOSS Evaluation Packages for existing Solaris releases.

As with last time, some were excluded for clarity, or to reduce noise and duplication. All of the bundled packages which didn’t change the version number in their packaging info are not included, even if they had updates to fix bugs, security holes, or add support for new hardware or new features of Solaris.

PackageUpstream11. Beta
cloud/openstack OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/cinder OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/glance OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/heat OpenStacknot included0.2014.2.20.2014.2.2
cloud/openstack/horizon OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/keystone OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/neutron OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/nova OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/swift OpenStack1.
communication/im/pidgin Pidgin2.
compress/pigz pigznot included2.
crypto/gnupg GnuPG2.
database/mysql-56 MySQLnot included
(MySQL 5.5 in database/mysql-56)
database/sqlite-3 SQLite3.
developer/build/ant Apache Ant1.
developer/documentation-tool/help2man GNU help2mannot includednot included1.46.1
developer/documentation-tool/xmlto xmltonot includednot included0.0.25
developer/java/jdk-6 Java1.6.0.75
(Java SE 6u75)
(Java SE 6u95)
not included
developer/java/jdk-7 Java1.7.0.65
(Java SE 7u65)
(Java SE 7u80)
(Java SE 7u80)
developer/java/jdk-8 Java1.8.0.11
(Java SE 8u11)
(Java SE 8u45)
(Java SE 8u45)
developer/test/check checknot includednot included0.9.14
developer/versioning/mercurial Mercurial SCM2.
developer/versioning/subversion Apache Subversion1.
diagnostic/nicstat nicstatnot includednot included1.95
diagnostic/tcpdump tcpdump4.
diagnostic/wireshark Wireshark1.
driver/graphics/nvidia NVIDIA0.331.38.00.346.35.00.346.35.0
driver/graphics/nvidiaR304 NVIDIAnot included0.304.125.00.304.125.0
driver/graphics/nvidiaR340 NVIDIAnot included0.340.65.00.340.65.0
file/mc GNU Midnight Commander4.
library/apr-15 Apache Portable Runtimenot includednot included1.5.1
library/c++/net6 Gobby1.
library/jansson Janssonnot includednot included2.7
library/json-c JSON-C0.90.90.12
library/libee libee0.
library/libestr libestr0.
library/libgsl GNU GSLnot includednot included1.16
library/liblogging LibLoggingnot includednot included1.0.4
library/libmicrohttpd GNU Libmicrohttpdnot includednot included0.9.37
library/libmilter Sendmail8.
library/libxml2 XML C parser2.
library/neon neon0.
library/perl-5/openscap-512 OpenSCAP1.
library/perl-5/xml-libxml CPAN: XML::LibXML2.142.142.121
was library/python-2/alembic
was library/python-2/amqp
library/python/barbicanclient OpenStacknot included3.
was library/python-2/boto
library/python/ceilometerclient OpenStack1.
library/python/cinderclient OpenStack1.
was library/python-2/cliff
library/python/django Django1.
library/python/django-pyscss django-pyscssnot included1.
was library/python-2/django_compressor
was library/python-2/django_openstack_auth
was library/python-2/eventlet
library/python/futures pythonfuturesnot included2.
library/python/glance_store OpenStacknot included0.
library/python/glanceclient OpenStack0.
was library/python-2/greenlet
library/python/heatclient OpenStack0.
library/python/iniparse iniparsenot included0.40.4
library/python/ipaddr ipaddr-pynot included2.
library/python/jinja2 Jinja2.
library/python/keystoneclient OpenStack0.
library/python/keystonemiddleware OpenStack not included1.
was library/python-2/kombu
library/python/ldappool ldappoolnot included1.01.0
was library/python-2/netaddr
was library/python-2/netifaces
library/python/networkx NetworkXnot included1.
library/python/neutronclient OpenStack2.
library/python/novaclient OpenStack2.
library/python/oauthlib OAuthLibnot included0.
library/python/openscap OpenSCAP1.
library/python/oslo.config OpenStack1.
library/python/oslo.context OpenStacknot included0.
library/python/oslo.db OpenStacknot included1.
library/python/oslo.i18n OpenStacknot included1.
library/python/oslo.messaging OpenStacknot included1.
library/python/oslo.middleware OpenStacknot included0.
library/python/oslo.serialization OpenStacknot included1.
library/python/oslo.utils OpenStacknot included1.
library/python/oslo.vmware OpenStacknot included0.
library/python/osprofiler OpenStacknot included0.
was library/python-2/pep8
PyPI: pep81.
was library/python-2/pip
library/python/posix_ipc POSIX IPC for Pythonnot included0.
was library/python-2/py
library/python/pycadf OpenStacknot included0.
was library/python-2/pyflakes
library/python/pyscss pyScssnot included1.
library/python/pysendfile pysendfilenot included2.
was library/python-2/pytest
was library/python-2/python-mysql
was library/python-2/pytz
was library/python-2/requests
library/python/retrying Retryingnot included1.
library/python/rfc3986 rfc3986not included0.
library/python/saharaclient OpenStacknot included0.
was library/python-2/setuptools
PyPI: setuptools0.
library/python/simplegeneric PyPI: simplegenericnot included0.
was library/python-2/simplejson
library/python/six PyPI: six1.
was library/python-2/sqlalchemy
was library/python-2/sqlalchemy-migrate
was library/python-2/stevedore
library/python/swiftclient OpenStack2.
library/python/taskflow OpenStacknot included0.
was library/python-2/tox
library/python/troveclient OpenStack0.
was library/python-2/virtualenv
library/python/websockify Websockify0.
library/python/wsme wsmenot included0.
library/ruby/hiera Puppetnot included1.
library/security/libassuan GnuPG2.
library/security/libksba GnuPG1.
library/security/openssl OpenSSL1.0.1.8 (1.0.1h) (1.0.1m) (1.0.1o)
library/unixodbc unixODBC2.
library/zlib zlib1.
mail/mailman GNU Mailmannot includednot included2.1.18.1
network/dns/bind ISC BIND9.
network/firewall OpenBSD PFnot includednot included5.5
network/mtr MTRnot includednot included0.86
network/openssh OpenSSHnot includednot included6.5.0.1
network/rsync rsync3.
print/filter/hplip HPLIP3.
runtime/erlang erlang15.2.317.517.5
runtime/java/jre-6 Java1.6.0.75
(Java SE 6u75)
(Java SE 6u95)
not included
runtime/java/jre-7 Java1.7.0.65
(Java SE 7u65)
(Java SE 7u80)
(Java SE 7u80)
runtime/java/jre-8 Java1.8.0.11
(Java SE 8u11)
(Java SE 8u45)
(Java SE 8u45)
runtime/python-27 Python2.
runtime/python-34 Pythonnot includednot included3.4.3
runtime/ruby-21 Rubynot included
(Ruby 1.9.3 in runtime/ruby-19)
security/compliance/openscap OpenSCAP1.
security/sudo Sudo1.
service/network/dns/bind ISC BIND9.
service/network/ftp ProFTPD1. (1.3.4c)
service/network/ntp NTP4.2.7.381 (4.2.7p381) (4.2.8p2) (4.2.8p2)
service/network/samba Samba3.
service/network/smtp/postfix Postfixnot includednot included2.11.3
service/network/smtp/sendmail Sendmail8.
shell/bash GNU bash4.
shell/watch procps-ngnot includednot included3.3.10
shell/zsh Zsh5.
system/data/hardware-registry pci.ids
system/data/timezone IANA Time Zone Data0.5.11 (2014c)0.5.11 (2015d)2015.4 (2015d)
system/font/truetype/google-droid Droid Fonts0.2010.2.240.2010.2.240.2013.6.7
system/library/freetype-2 FreeType2.
system/library/hmp-libs Oracle HMP2.
system/library/i18n/libthai libthai0.
system/library/libdatrie datrie0.
system/management/biosconfig Oracle HMP2.
system/management/facter Puppet1.
system/management/fwupdate Oracle HMP2.
system/management/fwupdate/qlogic Oracle HMP1.
system/management/hmp-snmp Oracle HMP2.
system/management/hwmgmtcli Oracle HMP2.
system/management/hwmgmtd Oracle HMP2.
system/management/ocm Oracle Configuration Manager12.
system/management/puppet Puppet3.
system/management/raidconfig Oracle HMP2.
system/management/ubiosconfig Oracle HMP2.
system/rsyslog rsyslog6.
system/test/sunvts Oracle VTS7.
terminal/tmux tmux1.
text/gnu-patch GNU Patch2.
text/groff GNU troff1.
text/less Less436436458
text/text-utilities util-linuxnot includednot included2.24.2
web/browser/firefox Mozilla Firefox17.0.1131.
web/browser/links Links1.
web/curl cURL7.
web/java-servlet/tomcat Apache Tomcat6.0.416.0.436.0.43
web/java-servlet/tomcat-8 Apache Tomcatnot includednot included8.0.21
web/novnc noVNCnot included0.50.5
web/php-53 PHP5.
web/php-56 PHPnot includednot included5.6.8
web/php-56/extension/php-suhosin-extension Suhosinnot includednot included0.9.37.1
web/php-56/extension/php-xdebug Xdebugnot includednot included2.3.2
web/server/apache-22 Apache HTTPD2.
web/server/apache-22/module/apache-jk Apache Tomcat1.
web/server/apache-22/module/apache-security ModSecurity2.
web/server/apache-22/module/apache-wsgi mod_wsgi3.
web/server/apache-24 Apache HTTPDnot includednot included2.4.12
web/server/apache-24/module/apache-dtrace Apache DTrace modulenot includednot included0.3.1
web/server/apache-24/module/apache-fcgid Apache mod_fcgidnot includednot included2.3.9
web/server/apache-24/module/apache-jk Apache Tomcatnot includednot included1.2.40
web/server/apache-24/module/apache-security ModSecuritynot includednot included2.8.0
mod_wsginot includednot included4.3.0
web/wget GNU wget1.141.161.16
x11/server/xorg/driver/xorg-input-keyboard X.Org1.
x11/server/xorg/driver/xorg-input-mouse X.Org1.
x11/server/xorg/driver/xorg-input-synaptics X.Org1.
x11/server/xorg/driver/xorg-video-ast X.Org0.
x11/server/xorg/driver/xorg-video-dummy X.Org0.
x11/server/xorg/driver/xorg-video-mga X.Org1.
x11/server/xorg/driver/xorg-video-vesa X.Org2.

The libinput test suite takes somewhere around 35 minutes now for a full run. That's annoying, especially as I'm running it for every commit before pushing. I've tried optimising things, but attempts at making it parallel have mostly failed so far (almost all tests need a uinput device created) and too many tests rely on specific timeouts to check for behaviours. Containers aren't an option when you have to create uinput devices so I started out farming out into VMs.

Ideally, the test suite should run against multiple commits (on multiple VMs) at the same time while I'm working on some other branch and then accumulate the results. And that's where git notes come in. They're a bit odd to use and quite the opposite of what I expected. But in short: a git note is an object that can be associated with a commit, without changing the commit itself. Sort-of like a post-it note attached to the commit. But there are plenty of limitations, for example you can only have one note (per namespace) and merge conflicts are quite easy to trigger. Look at any git notes tutorial to find out more, there's plenty out there.

Anyway, dealing with merge conflicts is a no-go for me here. So after a bit of playing around, I found something that seems to work out well. A script to run make check and add notes to the commit, combined with a repository setup to fetch those notes and display them automatically. The core of the script is this:

make check
if [ $? -eq 0 ]; then

if [ -n "$sha" ]; then
git notes --ref "test-$HOSTNAME" append \
-m "$status: $HOSTNAME: make check `date`" HEAD
exit $rc
Then in my main repository, I add each VM as a remote, adding a fetch path for the notes:

[remote "f22-libinput1"]
url = f22-libinput1.local:/home/whot/code/libinput
fetch = +refs/heads/*:refs/remotes/f22-libinput1/*
fetch = +refs/notes/*:refs/notes/f22-libinput1/*
Finally, in the main repository, I extended the glob that displays notes to 'everything':

$ git config notes.displayRef "*"
Now git log (and by extension tig) displays all notes attached to a commit automatically. All that's needed is a git fetch --all to fetch everything and it's clear in the logs which commit fails and which one succeeded.

:: whot@jelly:~/code/libinput (master)> git log
commit 6896bfd3f5c3791e249a0573d089b7a897c0dd9f
Author: Peter Hutterer
Date: Tue Jul 14 14:19:25 2015 +1000

test: check for fcntl() return value

Mostly to silence coverity complaints.

Signed-off-by: Peter Hutterer

Notes (f22-jelly/test-f22-jelly):
SUCCESS: f22-jelly: make check Tue Jul 14 00:20:14 EDT 2015

Whenever I look at the log now, I immediately see which commits passed the test suite and which ones didn't (or haven't had it run yet). The only annoyance is that since a note is attached to a commit, amending the commit message or rebasing makes the note "go away". I've copied notes manually after this, but it'd be nice to find a solution to that.

Everything else has been working great so far, but it's quite new so there'll be a bit of polishing happening over the next few weeks. Any suggestions to improve this are welcome.

July 09, 2015

Update June 09, 2015: edge scrolling for clickpads has been merged. Will be availble in libinput 0.20. Consider the rest of this post obsolete.

libinput supports edge scrolling since version 0.7.0. Whoops, how does the post title go with this statement? Well, libinput supports edge scrolling, but only on some devices and chances are your touchpad won't be one of them. Bug 89381 is the reference bug here.

First, what is edge scrolling? As the libinput documentation illustrates, it is scrolling triggered by finger movement within specific regions of the touchpad - the left and bottom edges for vertical and horizontal scrolling, respectively. This is in contrast to two-finger scrolling, triggered by a two-finger movement, anywhere on the touchpad. synaptics had edge scrolling since at least 2002, the earliest commit in the repo. Back then we didn't have multitouch-capable touchpads, these days they're the default and you'd be struggling to find one that doesn't support at least two fingers. But back then edge-scrolling was the default, and touchpads even had the markings for those scroll edges painted on.

libinput adds a whole bunch of features to the touchpad driver, but those features make it hard to support edge scrolling. First, libinput has quite smart software button support. Those buttons are usually on the lowest ~10mm of the touchpad. Depending on finger movement and position libinput will send a right button click, movement will be ignored, etc. You can leave one finger in the button area while using another finger on the touchpad to move the pointer. You can press both left and right areas for a middle click. And so on. On many touchpads the vertical travel/physical resistance is enough to trigger a movement every time you click the button, just by your finger's logical center moving.

libinput also has multi-direction scroll support. Traditionally we only sent one scroll event for vertical/horizontal at a time, even going as far as locking the scroll direction. libinput changes this and only requires a initial threshold to start scrolling, after that the caller will get both horizontal and vertical scroll information. The reason is simple: it's context-dependent when horizontal scrolling should be used, so a global toggle to disable doesn't make sense. And libinput's scroll coordinates are much more fine-grained too, which is particularly useful for natural scrolling where you'd expect the content to move with your fingers.

Finally, libinput has smart palm detection. The large majority of palm touches are along the left and right edges of the touchpad and they're usually indistinguishable from finger presses (same pressure values for example). Without palm detection some laptops are unusable (e.g. the T440 series).

These features interfere heavily with edge scrolling. Software button areas are in the same region as the horizontal scroll area, palm presses are in the same region as the vertical edge scroll area. The lower vertical edge scroll zone overlaps with software buttons - and that's where you would put your finger if you'd want to quickly scroll up in a document (or down, for natural scrolling). To support edge scrolling on those touchpads, we'd need heuristics and timeouts to guess when something is a palm, a software button click, a scroll movement, the start of a scroll movement, etc. The heuristics are unreliable, the timeouts reduce responsiveness in the UI. So our decision was to only provide edge scrolling on touchpads where it is required, i.e. those that cannot support two-finger scrolling, those with physical buttons. All other touchpads provide only two-finger scrolling. And we are focusing on making 2 finger scrolling good enough that you don't need/want to use edge scrolling (pls file bugs for anything broken)

Now, before you get too agitated: if edge scrolling is that important to you, invest the time you would otherwise spend sharpening pitchforks, lighting torches and painting picket signs into developing a model that allows us to do reliable edge scrolling in light of all the above, without breaking software buttons, maintaining palm detection. We'd be happy to consider it.

July 08, 2015

In my previous post I introduced ARB_shader_storage_buffer, an OpenGL 4.3 feature that is coming soon to Mesa and the Intel i965 driver. While that post focused on explaining the features introduced by the extension, in this post I’ll dive into some of the implementation aspects, for those who are curious about this kind of stuff. Be warned that some parts of this post will be specific to Intel hardware.

Following the trail of UBOs

As I explained in my previous post, SSBOs are similar to UBOs, but they are read-write. Because there is a lot of code already in place in Mesa’s GLSL compiler to deal with UBOs, it made sense to try and reuse all the data structures and code we had for UBOs and specialize the behavior for SSBOs where that was needed, that allows us to build on code paths that are already working well and reuse most of the code.

That path, however, had some issues that bit me a bit further down the road. When it comes to representing these operations in the IR, my first idea was to follow the trail of UBO loads as well, which are represented as ir_expression nodes. There is a fundamental difference between the two though: UBO loads are constant operations because uniform buffers are read-only. This means that a UBO load operation with the same parameters will always return the same value. This has implications related to certain optimization passes that work based on the assumption that other ir_expression operations share this feature. SSBO loads are not like this: since the shader storage buffer is read-write, two identical SSBO load operations in the same shader may not return the same result if the underlying buffer storage has been altered in between by SSBO write operations within the same or other threads. This forced me to alter a number of optimization passes in Mesa to deal with this situation (mostly disabling them for the cases of SSBO loads and stores).

The situation was worse with SSBO stores. These just did not fit into ir_expression nodes: they did not return a value and had side-effects (memory writes) so we had to come up with a different way to represent them. My initial implementation created a new IR node for these, ir_ssbo_store. That worked well enough, but it left us with an implementation of loads and stores that was a bit inconsistent since both operations used very different IR constructs.

These issues were made clear during the review process, where it was suggested that we used GLSL IR intrinsics to represent load and store operations instead. This has the benefit that we can make the implementation more consistent, having both loads and stores represented with the same IR construct and follow a similar treatment in both the GLSL compiler and the i965 backend. It would also remove the need to disable or alter certain optimization passes to be SSBO friendly.

Read/Write coherence

One of the issues we detected early in development was that our reads and writes did not seem to work very well together: some times a read after a write would fail to see the last value written to a buffer variable. The problem here also spawned from following the implementation trail of the UBO path. In the Intel hardware, there are various interfaces to access memory, like the Sampling Engine and the Data Port. The former is a read-only interface and is used, for example, for texture and UBO reads. The Data Port allows for read-write access. Although both interfaces give access to the same memory region, there is something to consider here: if you mix reads through the Sampling Engine and writes through the Data Port you can run into cache coherence issues, this is because the caches in use by the Sampling Engine and the Data Port functions are different. Initially, we implemented SSBO load operations like UBO loads, so we used the Sampling Engine, and ended up running into this problem. The solution, of course, was to rewrite SSBO loads to go though the Data Port as well.

Parallel reads and writes

GPUs are highly parallel hardware and this has some implications for driver developers. Take a sentence like this in a fragment shader program:

float cx = 1.0;

This is a simple assignment of the value 1.0 to variable cx that is supposed to happen for each fragment produced. In Intel hardware running in SIMD16 mode, we process 16 fragments simultaneously in the same GPU thread, this means that this instruction is actually 16 elements wide. That is, we are doing 16 assignments of the value 1.0 simultaneously, each one is stored at a different offset into the GPU register used to hold the value of cx.

If cx was a buffer variable in a SSBO, it would also mean that the assignment above should translate to 16 memory writes to the same offset into the buffer. That may seem a bit absurd: why would we want to write 16 times if we are always assigning the same value? Well, because things can get more complex, like this:

float cx = gl_FragCoord.x;

Now we are no longer assigning the same value for all fragments, each of the 16 values assigned with this instruction could be different. If cx was a buffer variable inside a SSBO, then we could be potentially writing 16 different values to it. It is still a bit silly, since only one of the values (the one we write last), would prevail.

Okay, but what if we do something like this?:

int index = int(mod(gl_FragCoord.x, 8));
cx[index] = 1;

Now, depending on the value we are reading for each fragment, we are writing to a separate offset into the SSBO. We still have a single assignment in the GLSL program, but that translates to 16 different writes, and in this case the order may not be relevant, but we want all of them to happen to achieve correct behavior.

The bottom line is that when we implement SSBO load and store operations, we need to understand the parallel environment in which we are running and work with test scenarios that allow us to verify correct behavior in these situations. For example, if we only test scenarios with assignments that give the same value to all the fragments/vertices involved in the parallel instructions (i.e. assignments of values that do not depend on properties of the current fragment or vertex), we could easily overlook fundamental defects in the implementation.

Dealing with helper invocations

From Section 7.1 of the GLSL spec version 4.5:

“Fragment shader helper invocations execute the same shader code
as non-helper invocations, but will not have side effects that
modify the framebuffer or other shader-accessible memory.”

To understand what this means I have to introduce the concept of helper invocations: certain operations in the fragment shader need to evaluate derivatives (explicitly or implicitly) and for that to work well we need to make sure that we compute values for adjacent fragments that may not be inside the primitive that we are rendering. The fragment shader executions for these added fragments are called helper invocations, meaning that they are only needed to help in computations for other fragments that are part of the primitive we are rendering.

How does this affect SSBOs? Because helper invocations are not part of the primitive, they cannot have side-effects, after they had served their purpose it should be as if they had never been produced, so in the case of SSBOs we have to be careful not to do memory writes for helper fragments. Notice also, that in a SIMD16 execution, we can have both proper and helper fragments mixed in the group of 16 fragments we are handling in parallel.

Of course, the hardware knows if a fragment is part of a helper invocation or not and it tells us about this through a pixel mask register that is delivered with all executions of a fragment shader thread, this register has a bitmask stating which pixels are proper and which are helper. The Intel hardware also provides developers with various kinds of messages that we can use, via the Data Port interface, to write to memory, however, the tricky thing is that not all of them incorporate pixel mask information, so for use cases where you need to disable writes from helper fragments you need to be careful with the write message you use and select one that accepts this sort of information.

Vector alignments

Another interesting thing we had to deal with are address alignments. UBOs work with layout std140. In this setup, elements in the UBO definition are aligned to 16-byte boundaries (the size of a vec4). It turns out that GPUs can usually optimize reads and writes to multiples of 16 bytes, so this makes sense, however, as I explained in my previous post, SSBOs also introduce a packed layout mode known as std430.

Intel hardware provides a number of messages that we can use through the Data Port interface to write to memory. Each message has different characteristics that makes it more suitable for certain scenarios, like the pixel mask I discussed before. For example, some of these messages have the capacity to write data in chunks of 16-bytes (that is, they write vec4 elements, or OWORDS in the language of the technical docs). One could think that these messages are great when you work with vector data types, however, they also introduce the problem of dealing with partial writes: what happens when you only write to an element of a vector? or to a buffer variable that is smaller than the size of a vector? what if you write columns in a row_major matrix? etc

In these scenarios, using these messages introduces the need to mask the writes because you need to disable the channels in the vec4 element that you don’t want to write. Of course, the hardware provides means to do this, we only need to set the writemask of the destination register of the message instruction to select the right channels. Consider this example:

struct TB {
    float a, b, c, d;

layout(std140, binding=0) buffer Fragments {
   TB s[3];
   int index;

void main()
   s[0].d = -1.0;

In this case, we could use a 16-byte write message that takes 0 as offset (i.e writes at the beginning of the buffer, where s[0] is stored) and then set the writemask on that instruction to WRITEMASK_W so that only the fourth data element is actually written, this way we only write one data element of 4 bytes (-1) at offset 12 bytes (s[0].d). Easy, right? However, how do we know, in general, the writemask that we need to use? In std140 layout mode this is easy: since each element in the SSBO is aligned to a 16-byte boundary, we simply need to take the byte offset at which we are writing, divide it by 16 (to convert it to units of vec4) and the modulo of that operation is the byte offset into the chunk of 16-bytes that we are writing into, then we only have to divide that by 4 to get the component slot we need to write to (a number between 0 and 3).

However, there is a restriction: we can only set the writemask of a register at compile/link time, so what happens when we have something like this?:

s[i].d = -1.0;

The problem with this is that we cannot evaluate the value of i at compile/link time, which inevitably makes our solution invalid for this. In other words, if we cannot evaluate the actual value of the offset at which we are writing at compile/link time, we cannot use the writemask to select the channels we want to use when we don’t want to write a vec4 worth of data and we have to use a different type of message.

That said, in the case of std140 layout mode, since each data element in the SSBO is aligned to a 16-byte boundary you may realize that the actual value of i is irrelevant for the purpose of the modulo operation discussed above and we can still manage to make things work by completely ignoring it for the purpose of computing the writemask, but in std430 that trick won’t work at all, and even in std140 we would still have row_major matrix writes to deal with.

Also, we may need to tweak the message depending on whether we are running on the vertex shader or the fragment shader because not all message types have appropriate SIMD modes (SIMD4x2, SIMD8, SIMD16, etc) for both, or because different hardware generations may not have all the message types or support all the SIMD modes we need need, etc

The point of this is that selecting the right message to use can be tricky, there are multiple things and corner cases to consider and you do not want to end up with an implementation that requires using many different messages depending on various circumstances because of the increasing complexity that it would add to the implementation and maintenance of the code.

Closing notes

This post did not cover all the intricacies of the implementation of ARB_shader_storage_buffer_object, I did not discuss things like the optional unsized array or the compiler details of std430 for example, but, hopefully, I managed to give an idea of the kind of problems one would have to deal with when coding driver support for this or other similar features.