August 29, 2015


GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them.

If you’re going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I’d write this post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision by the time I completed part3.

  1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There’s no indication about the “upstreamability”. What this means is that if you read my blog to understand how the i915 driver currently works, you’ll be taking a crap-shoot on this one.
  2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can’t refer to them directly, I can only refer to the code.
  3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally the masses to test the code. Without product, that isn’t possible.

Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it’s still fresh (for some very warped definition of “fresh”).


GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability.

Current Status

I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn’t nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize:

Feature Status TODO
Dynamic page tables Implemented Test and fix bugs
64b Address space Implemented Test and fix bugs
GPU mirroring Proof of Concept Decide on interface; Implement interface.1

Testing has been limited to just one machine, mine, when I don’t have a million other things to do. With that caveat, on top of my last PPGTT stabilization patches things look pretty stable.

Present: Relocations

Throughout many of my previous blog posts I’ve gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It’s impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit. The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism.

Current PPGTT support

Current PPGTT support

To get to the above state, something like the following would happen.

  1. Create BOx
  2. Create BOy
  3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING).
  4. Do one of aforementioned operations on BOx and BOy
  5. Perform execbuf2.

Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address.

Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset
is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace:

Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU.

Future: No relocations

GPU Mirroring

GPU Mirroring

The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it’s still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it.

"OpenCL 2.0 adds support for shared virtual memory (a.k.a. SVM). SVM allows the host and 
kernels executing on devices to directly share complex, pointer-containing data structures such 
as trees and linked lists. It also eliminates the need to marshal data between the host and devices. 
As a result, SVM substantially simplifies OpenCL programming and may improve performance."

Makin’ it Happen


As I’ve already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU.

Page Table Hierarchy

Page Table Hierarchy

If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you’re probably left with a big ‘WTF’ face. I probably can’t help you if you’re in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you’d expect them to. Instead of the x86 CPU’s CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory. The actual changes in hardware interactions are trivial on top of all the existing PPGTT work.

The keen observer will notice that there are only 256 PML4 entries. This has to do with the way in which we've come about 64b addressing in x86. This wikipedia article explains it pretty well, and has links.

“This will take one week. I can just allocate everything up front.” (Dynamic Page Table Allocation)

Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. “One week. I can just allocate everything up front.” If what I have is, “done” then I was off by 10x.

Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems.

Page for the PML4
512 PDP pages per PML4 (512, ok we actually use 256)
512 PD pages per PDP (256 * 512 pages for PDs)
512 PT pages per PD (256 * 512 * 512 pages for PTs)
(256 * 512^2 + 256 * 512 + 256 + 1) * PAGE_SIZE = ~256G = oops

Dissimilarities to x86

First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can’t use the code directly, surely I can copy. Wrong.

Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux.

typedef unsigned long   pteval_t;
typedef struct { pteval_t pte; } pte_t;

static inline pteval_t native_pte_val(pte_t pte)
        return pte.pte;

static inline pteval_t pte_flags(pte_t pte)
        return native_pte_val(pte) & PTE_FLAGS_MASK;

static inline int pte_present(pte_t a)
        return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
        return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
#define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))

#define pgd_offset(mm, address) ( (mm)->pgd + pgd_index((address)))
static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
        return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);

/* My completely fabricated example of finding page presence */
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
struct mm_struct *mm = current->mm;
unsigned long address = 0xdefeca7e;

pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
ptep = pte_offset_map(pmd, address);
printk("Page is present: %s\n", pte_present(*ptep) ? "yes" : "no");

X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy).

  1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don’t know where our pages tables reside in physical memory because of the IOMMU. When VT-d is enabled, the i915 driver only knows the DMA address of the page tables.
  2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don’t want to take the hit to map them every time you need to look at a PTE.

If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those two options. After about of week of making the IOMMU driver do things it shouldn’t do, and trying to push the square block through the round hole, I gave up on reusing the x86 code.

Why Do We Actually Need Page Table Tracking?

The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It’s difficult to try to get the instantaneous state of the page tables…

Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn’t a good solution, but at that time I just wanted to get the thing working and didn’t really care if things blew up spectacularly after running for a few minutes. There is actually a really easy set of operations that show why this won’t work. For the following, think of the four level page tables as arrays. ie.

  • PML4[0-255], each point to a PDP
  • PDP[0-255][0-511], each point to a PD
  • PD[0-255][0-511][0-511], each point to a PT
  • PT[0-255][0-511][0-511][0-511] (where PT[0][0][0][0][0] is the 0th PTE in the system)
  1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer
  2. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0 is free.
    2. [i915]Allocate PDP for PML4[0]
    3. [i915]Allocate PD for PDP[0][0]
    4. [i915]Allocate PT for PD[0][0][0]/li>
    5. [i915](condensed)Set pointers from PML4->PDP->PD->PT
    6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO’s backing page.
  3. [i915] Dispatch work to the GPU on behalf of mesa.
  4. [i915] Observe the hardware has completed
  5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer.
  6. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0x200000 is free.
    2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1].
    3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea!
    4. Abort.

Page Tables Tracking with Bitmaps

Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn’t try it.

After I had determined I couldn’t reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went – none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!.
\frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \\  34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \\  34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes
That’s 4GB simply to track every page. There’s some more overhead because page [tables, directories, directory pointers] are also tracked.
  256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \\  67240192entries \times \frac{1bit}{1entry} = 67240192bits \\  67240192bits \times \frac{8bits}{1byte} = 8405024bytes \\  4294967296bytes + 8405024bytes = 4303372320bytes \\  4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G

I can’t remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn’t see the big picture. I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you’re using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn’t fall into that trap, and I just wasted your time, but there it is anyway.

Sample code to walk the page tables

This code does not actually exist, but it is very similar to the real code. The following shows how one would “walk” to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar.

static struct i915_pagedirpo *
alloc_one_pdp(struct i915_pml4 *pml4, int entry)

static struct i915_pagedir *
alloc_one_pd(struct i915_pagedirpo *pdp, int entry)

static struct i915_tab *
alloc_one_pt(struct i915_pagedir *pd, int entry)

 * alloc_page_tables - Allocate all page tables for the given virtual address.
 * This will allocate all the necessary page tables to map exactly one page at
 * @address. The page tables will not be connected, and the PTE will not point
 * to a page.
 * @ppgtt:	The PPGTT structure encapsulating the virtual address space.
 * @address:	The virtual address for which we want page tables.
static void
alloc_page_tables(ppgtt, unsigned long address)
	struct i915_pagetab *pt;
	struct i915_pagedir *pd;
	struct i915_pagedirpo *pdp;
	struct i915_pml4 *pml4 = &ppgtt->pml4; /* Always there */

	int pml4e = (address >> GEN8_PML4E_SHIFT) & GEN8_PML4E_MASK;
	int pdpe = (address >> GEN8_PDPE_SHIFT) & GEN8_PDPE_MASK;
	int pde = (address >> GEN8_PDE_SHIFT) & I915_PDE_MASK;
	int pte = (address & I915_PDES_PER_PD);

	if (!test_bit(pml4e, pml4->used_pml4es))
		goto alloc;

	pdp = pml4->pagedirpo[pml4e];
	if (!test_bit(pdpe, pdp->used_pdpes;))
		goto alloc;

	pd = pdp->pagedirs[pdpe];
	if (!test_bit(pde, pd->used_pdes)
		goto alloc;

	pt = pd->page_tables[pde];
	if (test_bit(pte, pt->used_ptes))

	pdp = alloc_one_pdp(pml4, pml4e);
	set_bit(pml4e, pml4->used_pml4es);
	pd = alloc_one_pd(pdp, pdpe);
	set_bit(pdpe, pdp->used_pdpes);
	pt = alloc_one_pt(pd, pde);
	set_bit(pde, pd->used_pdes);

Here is a picture which shows the bitmaps for the 2 allocation example above.

Bitmaps tracking page tables

Bitmaps tracking page tables

The GPU mirroring interface

I really don’t want to spend too much time here. In other words, no more pictures. As I’ve already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface.

In the patches I’ve submitted, 2 changes were made to the existing userptr interface (which wasn’t then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring.

struct drm_i915_gem_userptr {
	__u64 user_ptr;
	__u64 user_size;
	__u32 ctx_id;
	__u32 flags;
#define I915_USERPTR_READ_ONLY          (1<<0)
#define I915_USERPTR_GPU_MIRROR         (1<<1)
#define I915_USERPTR_UNSYNCHRONIZED     (1<<31)
	 * Returned handle for the object.
	 * Object handles are nonzero.
	__u32 handle;
	__u32 pad;

The context argument is to tell the i915 driver for which address space we’ll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address.

  • Pros:
    • This interface is very simple.
    • Existing userptr code does the hard work for us
  • Cons:
    • You need 1 IOCTL per object. Much undeeded overhead.
    • It’s subject to a lot of problems userptr has5
    • Userptr was already merged, so unless pad get’s repruposed, we’re screwed

What should be: soft pin

There hasn’t been too much discussion here, so it’s hard to say. I believe the trends of the discussion (and the author’s personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, “soft pin.” It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can’t merged until there is an open source userspace. Stay tuned. Perhaps I’ll update the blog as the story unfolds.

Wrapping it up (all 4 parts)

As usual, please report bugs or ask questions.

So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches).

This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there.

Unlike the usual, “buy me a beer if you liked this”, I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher.

Image links

The images I’ve created. Feel free to do with them as you please.

Download PDF

  1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN" 

  2. The GEM and BO terminology is a fancy sounding wrapper for the notion that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output)  

  3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect. 

  4. I am not sure how KVM works manages page tables. At least conceptually I’d think they’d have a similar problem to the i915 driver’s page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn’t have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took 

  5. Let me be clear that I don’t think userptr is a bad thing. It’s a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring 


Pictures are the right way to start.


Conceptual view of aliasing PPGTT bind/unbind

There is exactly one thing to get from the above drawing, everything else is just to make it as close to fact as possible.

  1. The aliasing PPGTT (aliases|shadows|mimics) the global GTT.

The wordy overview

Support for Per-process Graphics Translation Tables (PPGTT) debuted on Sandybridge (GEN6). The features provided by hardware are a superset of Aliasing PPGTT, which is entirely a software construct. The most obvious unimplemented feature is that the hardware supports multiple PPGTTs. Aliasing PPGTT is a single instance of a PPGTT. Although not entirely true, it’s easiest to think of the Aliasing PPGTT as a set page table of page tables that is maintained to have the identical mappings as the global GTT (the picture above). There is more on this in the Summary section

Until recently, aliasing PPGTT was the only way to make use of the hardware feature (unless you accidentally stepped into one of my personal branches). Aliasing PPGTT is implemented as a performance feature (more on this later). It was an important enabling step for us as well as it provided a good foundation for the lower levels of the real PPGTT code.

In the following, I will be using the HSW PRMs as a reference. I’ll also assume you’ve read, or understand part 1.

Selecting GGTT or PPGTT

Choosing between the GGTT and the Aliasing PPGTT is very straight forward. The choice is provided in several GPU commands. If there is no explicit choice, than there is some implicit behavior which is usually sensible. The most obvious command to be provided with a choice is MI_BATCH_BUFFER_START. When a batchbuffer is submitted, the driver sets a single bit that determines whether the batch will execute out of the GGTT or a Aliasing PPGTT1. Several commands as well, like PIPE_CONTROL, have a bit to direct which to use for the reads or writes that the GPU command will perform.


The names for all the page table data structures in hardware are the same as for IA CPU. You can see the Intel® 64 and IA-32 Architectures Software Developer Manuals for more information. (At the time of this post: page 1988 Vol3. 4.2 HIERARCHICAL PAGING STRUCTURES: AN OVERVIEW). I don’t want to rehash the HSW PRMs  too much, and I am probably not allowed to won’t copy the diagrams. However, for the sake of having a consolidated post, I will rehash the most pertinent parts.

There is one conceptual Page Directory for a PPGTT – the docs call this a set of Page Directory Entries (PDEs), however since they are contiguous, calling it a Page Directory makes a lot of sense to me. In fact, going back to the Ironlake docs, that seems to be the case. So there is one page directory with up to 512 entries, each pointing to a page table.  There are several good diagrams which I won’t bother redrawing in the PRMs2

Page Directory Entry
31:12 11:04 03:02 01 0
Physical Page Address 31:12 Physical Page Address 39:32 Rsvd Page size (4K/32K) Valid
Page Table Entry
31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid

There’s some things we can get from this for those too lazy to click on the links to the docs.

  1. PPGTT page tables exist in physical memory.
  2. PPGTT PTEs have the exact same layout as GGTT PTEs.
  3. PDEs don’t have cache attributes (more on this later).
  4. There exists support for big pages3

With the above definitions, we now can derive a lot of interesting attributes about our GPU. As already stated, the PPGTT is a two-level page table (I’ve not yet defined the size).

  • A PDE is 4 bytes wide
  • A PTE is 4 bytes wide
  • A Page table occupies 4k of memory.
  • There are 4k/4 entries in a page table.

With all this information, I now present you a slightly more accurate picture.


An object with an aliased PPGTT mapping


PP_DCLV – PPGTT Directory Cacheline Valid Register: As the spec tells us, “This register controls update of the on-chip PPGTT Directory Cache during a context restore.” This statement is directly contradicted in the very next paragraph, but the important part is the bit about the on-chip cache. This register also determines the amount of virtual address space covered by the PPGTT. The documentation for this register is pretty terrible, so a table is actually useful in this case.

PPGTT Directory Cacheline Valid Register (from the docs)
63:32 31:0
MBZ PPGTT Directory Cache Restore [1..32] 16 entries
DCLV, the right way
31 30 1 0
PDE[511:496] enable PDE [495:480] enable PDE[31:16] enable PDE[15:0] enable

The, “why” is not important. Each bit represents a cacheline of PDEs, which is how the register gets its name4. A PDE is 4 bytes, there are 64b in a cacheline, so 64/4 = 16 entries per bit.  We now know how much address space we have.

512 PDEs * 1024 PTEs per PT * 4096 PAGE_SIZE = 2GB


PP_DIR_BASE: Sadly, I cannot find the definition to this in the public HSW docs. However, I did manage to find a definition in the Ironlake docs yay me. There are several mentions in more recent docs, and it works the same way as is outlined on Ironlake. Quoting the docs again, “This register contains the offset into the GGTT where the (current context’s) PPGTT page directory begins.” We learn a very important caveat about the PPGTT here – the PPGTT PDEs reside within the GGTT.


With these two things, we now have the ability to program the location, and size (and get the thing to load into the on-chip cache). Here is current i915 code which switches the address space (with simple comments added). It’s actually pretty ho-hum.

ret = intel_ring_begin(ring, 6);
if (ret)
	return ret;

intel_ring_emit(ring, MI_LOAD_REGISTER_IMM(2));
intel_ring_emit(ring, RING_PP_DIR_DCLV(ring));
intel_ring_emit(ring, PP_DIR_DCLV_2G);       // program size
intel_ring_emit(ring, RING_PP_DIR_BASE(ring));
intel_ring_emit(ring, get_pd_offset(ppgtt)); // program location
intel_ring_emit(ring, MI_NOOP);

As you can see, we program the size to always be the full amount (in fact, I fixed this a long time ago, but never merged). Historically, the offset was at the top of the GGTT, but with my PPGTT series merged, that is abstracted out, and the simple get_pd_offset() macro gets the offset within the GGTT. The intel_ring_emit() stuff is because the docs recommended setting the registers via the GPU’s LOAD_REGISTER_IMMEDIATE command, though empirically it seems to be fine if we simply write the registers via MMIO (for Aliasing PPGTT). See my previous blog post if you want more info about the commands execution in the GPU’s ringbuffer. If it’s easier just pretend it’s 2 MMIO writes.


All of the resources are allocated and initialized upfront. There are 3 main steps. Note that the following comes from a relatively new kernel, and I have already submitted patches which change some of the cosmetics. However, the concepts haven’t changed for pre-gen8.

1. Allocate space in the GGTT for the PPGTT PDEs

ret = drm_mm_insert_node_in_range_generic(&dev_priv->,
					  &ppgtt->node, GEN6_PD_SIZE,
					  GEN6_PD_ALIGN, 0,
					  0, dev_priv->,

2. Allocate the page tables

for (i = 0; i < ppgtt->num_pd_entries; i++) {
	ppgtt->pt_pages[i] = alloc_page(GFP_KERNEL);
	if (!ppgtt->pt_pages[i]) {
		return -ENOMEM;

3. [possibly] IOMMU map the pages

for (i = 0; i < ppgtt->num_pd_entries; i++) {
	dma_addr_t pt_addr;

	pt_addr = pci_map_page(dev->pdev, ppgtt->pt_pages[i], 0, 4096,

As the system binds, and unbinds objects into the aliasing PPGTT, it simply writes the PTEs for the given object (possibly spanning multiple page tables). The PDEs do not change. PDEs are mapped to a scratch page when not used, as are the PTEs.


As we saw in step 3 above, I mention that the page tables may be mapped by the IOMMU. This is one important caveat that I didn’t fully understand early on, so I wanted to recap a bit. Recall that the GGTT is allocated out of system memory during the boot firmware’s initialization. This means that as long as Linux treats that memory as special, everything will just work (just don’t look for IOMMU implicated bugs on our bugzilla). The page tables however are special because they get allocated after Linux is already running, and the IOMMU is potentially managing the memory. In other words, we don’t want to write the physical address to the PDEs, we want to write the dma address. Deferring to wikipedia again for the description of an IOMMU., that’s all.It tripped be up the first time I saw it because I hadn’t dealt with this kind of thing before. Our PTEs have worked the same way for a very long time when mapping the BOs, but those have somewhat hidden details because they use the scatter-gather functions.

Feel free to ask questions in the comments if you need more clarity – I’d probably need another diagram to accommodate.

Cached page tables

Let me be clear, I favored writing a separate post for the Aliasing PPGTT because it gets a lot of the details out of the way for the post about Full PPGTT. However, the entire point of this feature is to get a [to date, unmeasured] performance win. Let me explain… Notice bits 4:3 of the ECOCHK register.  Similarly in the i915 code:

ecochk = I915_READ(GAM_ECOCHK);
if (IS_HASWELL(dev)) {
} else {
I915_WRITE(GAM_ECOCHK, ecochk);

What these bits do is tell the GPU whether (and how) to cache the PPGTT page tables. Following the Haswell case, the code is saying to map the PPGTT page table with write-back caching policy. Since the writes for Aliasing PPGTT are only done at initialization, the policy is really not that important.

Below is how I’ve chosen to distinguish the two. I have no evidence that this is actually what happens, but it seems about right.


Flow chart for GPU GGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Red means slow. The point which was hopefully made clear above is that when you miss the TLB on a GGTT access, you need to fetch the entry from memory, which has a relatively high latency. When you miss the TLB on a PPGTT access, you have two caches (the special PDE cache for PPGTT, and LLC) which are backing the request. Note there is an intentional bug in the second diagram – you may miss the LLC on the PTE fetch also. I was trying to keep things simple, and show the hopeful case.

Because of this, all mappings which do not require GGTT mappings get mapped to the aliasing PPGTT.


Distinctions from the GGTT

At this point I hope you’re asking why we need the global GTT at all. There are a few limited cases where the hardware is incapable (or it is undesirable) of using a per process address space.

A brief description of why, with all the current callers of the global pin interface.

  • Display: Display actually implements it’s own version of the GGTT. Maintaining the logic to support multiple level page tables was both costly, and unnecessary. Anything relating to a buffer being scanned out to the display must always be mapped into the GGTT. Ie xpect this to be true, forever.
    • i915_gem_object_pin_to_display_plane(): page flipping
    • intel_setup_overlay(): overlays
  • Ringbuffer: Keep in mind that the aliasing PPGTT is a special case of PPGTT. The ringbuffer must remain address space and context agnostic. It doesn’t make any sense to connect it to the PPGTT, and therefore the logic does not support it. The ringbuffer provides direct communication to the hardware’s execution logic – which would be a nightmare to synchronize if we forget about the security nightmare. If you go off and think about how you would have a ringbuffer mapped by multiple address spaces, you will end up with something like execlists.
    • allocate_ring_buffer()
  • HW Contexts: Extremely similar to ringbuffer.
    • intel_alloc_context_page(): Ironlake RC6
    • i915_gem_create_context(): Create the default HW context
    • i915_gem_context_reset(): Re-pin the default HW context
    • do_switch(): Pin the logical context we’re switching to
  • Hardware status page: The use of this, prior to execlists, is much like rinbuffers, and contexts. There is a per process status page with execlists.
    • init_status_page()
  • Workarounds:
    • init_pipe_control(): Initialize scratch space for workarounds.
    • intel_init_render_ring_buffer(): An i830 w/a I won’t bother to understand
    • render_state_alloc(): Full initialization of GPUs 3d state from within the kernel
  • Other
    • i915_gem_gtt_pwrite_fast(): Handle pwrites through the aperture. More info here.
    • i915_gem_fault(): Map an object into the aperture for gtt_mmap. More info here.
    • i915_gem_pin_ioctl(): The DRI1 pin interface.

GEN8 disambiguation

Off the top of my head, the list of some of the changes on GEN8 which will get more detail in a later post. These changes are all upstream from the original Broadwell integration.

  • PTE size increased to 8b
    • Therefore, 512 entries per table
    • Format mimics the CPU PTEs
  • PDEs increased to 8b (remains 512 PDEs per PD)
    • Page Directories live in system memory
      • GGTT no longer holds the PDEs.
    • There are 4 PDPs, and therefore 4 PDs
    • PDEs are cached in LLC instead of special cache (I’m guessing)
  • New HW PDP (Page Directory Pointer) registers point to the PDs, for legacy 32b addressing.
    • PP_DIR_BASE, and PP_DCLV are removed
  • Support for 4 level page tables, up to 48b virtual address space.
    • PML4[PML4E]->PDP
    • PDP[PDPE] -> PD
    • PD[PDE] -> PT
    • PT{PTE] -> Memory
  • Big pages are now 64k instead of 32k (still not implemented)
  • New caching interface via PAT like structure


There’s actually an interesting thing that you start to notice after reading Distinctions from the GGTT. Just about every thing mapped into the GGTT shouldn’t be mapped into the PPGTT. We already stated that we try to map everything else into the PPGTT. The set of objects mapped in the GGTT, and the set of objects mapped into the PPGTT are disjoint5). The patches to make this work are not yet merged. I’d put an image here to demonstrate, but I am feeling lazy and I really want to get this post out today.


  • The Aliasing PPGTT is a single instance of the hardware feature: PPGTT.
  • Aliasing PPGTT was designed as a drop in performance replacement to the GGTT.
  • GEN8 changed a lot of architectural stuff.
  • The Aliasing PPGTT shouldn’t actually alias the GGTT because the objects they map are a disjoint set.

Like last time, links to all the SVGs I’ve created. Use them as you like.

Download PDF

  1. Actually it will use whatever the current PPGTT is, but for this post, that is always the Aliasing PPGTT 

  2. Page walk, Two-Level Per-Process Virtual Memory 

  3. Big pages have the same goal as they do on the CPU – to reduce TLB pressure. To date, there has been no implementation of big pages for GEN (though a while ago I started putting something together). There has been some anecdotal evidence that there isn’t a big win to be had for many workloads we care about, and so this remains a low priority. 

  4. This register thus allows us to limit, or make a sparse address space for the PPGTT. This mechanism is not used, even in the full PPGTT patches 

  5. There actually is a case on GEN6 which requires both. Currently this need is implemented by drivers/gpu/drm/i915/i915_gem_execbuffer.c: i915_gem_execbuffer_relocate_entry( 

For those of you reading this that didn’t know, I’ve had two months of paid vacation – one of the real perks of working for Intel. Today is the last day. It is as hard as I thought it would be.

Most of the vacation was spent vacationing. As I have access to none of the pictures at the moment, and I don’t want to make you jealous, I’ll skip over that. Toward the end though, I ended up at a coffee shop waiting for someone with nothing to do. I spent a little bit of time working on HOBos, and I thought it could be interesting to write about it.

WARNING: There is nothing novel here.

A brief history of HOBos

The HOBby operating system is an operating system project I started a while ago. I am a bit embarrassed that I started an OS. In my opinion, it’s one of lamer tasks to take on because

  1. everyone seems to do it;
  2. there really isn’t a need, there are many operating systems with permissive licenses already; and
  3. sites like OSDev have made much of the work trivial (I like to think that when I started there wasn’t quite as much info readily available, but that’s a lie).
Larrabee Av in Portland (not what the project was named after)

Larrabee Av in Portland
(not what the project was named after)

HOBos began while I was working on the Larrabee project. The team spent a lot of time optimizing the memory management, and the scheduler for the embedded software. I really wanted to work on these things full time. Unfortunately for me, having had a background in device drivers, I was often required to do other things. As a means to scratch the itch, I started HOBos after not finding anything suitable for my needs. The stuff I found were all either too advanced, or too rudimentary. When I was hired to work on i915, I decided that it was better use of my free time. Since then, I’ve been tweaking things here or there, and I do try to make sure things still run with the latest QEMU and compilers at least once a year. The last actual feature I added was more than 1300 days ago:

commit 1c9b5c78b22b97246989b00e807c9bf1fbc9e517
	Author: Ben Widawsky 
	Date: Sat Mar 19 21:19:57 2011 -0700

	basic backtrace

So back to the coffee shop. I tried to do something or other, got a hang, and didn’t want to fire up GDB.


HOBos had implemented backtraces since the original import from SVN (let’s pretend that means, since always). Obtaining a backtrace is actually pretty straightforward on x86.

The stack frame

The stack frame can be thought of as memory contents that are locally scoped to a function. Declaring a local variable will end up in the stack frame. A global variable will not. As functions call other functions, you end up with multiple frames. A stack is used because the last frames added are the first ones removed (this is not always true for things like exceptions, but nevermind that for now). The fact that a stack decrements is arbitrarily chosen, as far as I can tell. The following shows the stack when the function foo() calls the function bar().

Example Stackframe

Example Stackframe

void bt_fp(void *fp)
	do {
		uint64_t prev_rbp = *((uint64_t *)fp);
		uint64_t prev_ip = *((uint64_t *)(fp + sizeof(prev_rbp)));
		struct sym_offset sym_offset = get_symbol((void *)prev_ip);
		printf("\t%s (+0x%x)\n";,, sym_offset.offset);
		fp = (void *)prev_rbp;
		/* Stop if rbp is not in the kernel
		 * TODO: need an upper bound too*/
		if (fp <= (void *)KVADDR(DMAP_PML,0,0,0))
	} while(1);

The memory contents shown above are created as a result of two things. First is what the CPU implicitly does upon execution of the call instruction. The second is what the compiler generates. Since we’re talking about x86 here, the call instruction always pushes at least the return address. The second I’ll detail a bit more in a bit. Correlating this to the picture, the green (foo) and blue (bar) are creations of the compiler. The brown is automatically pushed on to the stack by the hardware, and is automatically popped from the stack on the ret instruction.

In the above there are two registers worth noting, RBP, and RSP. RBP which is the 64b extension to the original 8086 BP, Base Pointer, register is the beginning of the stack frame ie the Frame Pointer. RSP, the extension to the 8086 SP, Stack Pointer, points to the end of the stack frame. By convention the Base Pointer doesn’t change throughout a function being executed and therefore it is often used as the reference to local variables stored on the stack. -100(%rbp) as an example above.

Digging further into that disassembly above, one notices a pattern. Every function begins with:

push   %rbp       // Push the old RBP, RSP now points to this
mov    %rsp,%rbp  // Store RSP in RBP

Assuming this is the convention, it implies that at any given point during the execution of a function we can obtain the previous RBP by reading the current RBP and doing some processing. Specifically, reading RBP gives us the old Stack Pointer, which is pointing to the last RBP. As mentioned above, the x86 CPU pushed the return address immediately before the push %rbp – which means as we work backwards through the Base Pointers, we can also obtain the caller for the current stack frame. People have done really nice pictures on this – use your favorite search engine.

Here is the HOBos code (ignore the part about symbols for now):

void bt_fp(void *fp)
	do {
		uint64_t prev_rbp = *((uint64_t *)fp);
		uint64_t prev_ip = *((uint64_t *)(fp + sizeof(prev_rbp)));
		struct sym_offset sym_offset = get_symbol((void *)prev_ip);
		printf("\t%s (+0x%x)\n",, sym_offset.offset);
		fp = (void *)prev_rbp;
		/* Stop if rbp is not in the kernel
		 * TODO: need an upper bound too*/
		if (fp <= (void *)KVADDR(DMAP_PML,0,0,0))
	} while(1);

As far as I know, all modern CPUs work in a similar fashion with differences sprinkled here and there. ARM for example has an LR register for the return address instead of using the stack.

ABI/Calling Conventions

The fact that we can work backwards this way is a byproduct of the calling convention. One example of an aspect of the calling convention is where to put the arguments to a function. Do they go on the stack, in registers, or somewhere else? In addition to these arguments, the way in which RBP, and RSP are used are strictly software constructs that are part of the convention. As a result, it might not always be possible to get a backtrace if:

  1. This convention is not adhered to (or -fomit-frame-pointer)
  2. The contents of RBP are destroyed
  3. The contents of the stack are corrupted.

How arguments are passed to function are also needed to make sure linkers and loaders (both static and dynamic) can operate to correctly form an executable, or dynamically call a function. Since this isn’t really important to obtaining a backtrace, I will leave it there. Some architectures do provide a way to obtain useful backtrace information without caring about the calling convention: Intel’s Processor Trace for example.

Symbol information

The previous section will get us a reverse list of addresses for all function calls working backward from a given point during execution. But having names makes things much more easier to quickly diagnose what is going wrong. There is a lot of information on the internet about this stuff. I’m simply providing all that’s relevant to my specific problem.

ELF Symbols (linking)

The ELF format provides everything we need (assuming things aren’t stripped). Glossing over the details (see this simple tool if you’re curious), we end up with two “sections” that tell us everything we need. They are conventionally named, “.symtab”, and “.strtab” and are conveniently of type, SHT_SYMTAB, and SHT_STRTAB. The symbol table defines the information about each symbol (functions, variables, whatever). Part of the information is a name, which is an index into the string table. In this simplest case, these are provisions for inter-object linking. If I had defined foo() in foo.c, and bar() in bar.c, the compiled object files can be linked together, but the linker needs the information about the symbol bar (in this case) in order to do its job.

readelf -S a.out

[Nr] Name Type Address Offset
[33] .symtab SYMTAB 0000000000000000 000015b8
[34] .strtab STRTAB 0000000000000000 00001c90

> readelf -S a.out | egrep "\.strtab|\.symtab" | wc -l
> strip a.out
> readelf -S a.out | egrep "\.strtab|\.symtab" | wc -l

Summing that up, if we have an entire ELF file, and the symbol and string tables haven’t been stripped, we’re good to go. However, ELF sections are not the unit in which an ELF loader decides what to load. The loader loads segments which are of type PT_LOAD. A segment is made up of 0 or more sections, plus padding. Since the Operating System is itself an ELF loaded by an ELF loader (the bootloader) we’re not in a good position. :(

> readelf -l a.out | egrep "\.strtab|\.symtab" | wc -l

ELF Loader

ELF Loader

Debug Info

Note that what we want is not the same thing as debug information. If one wants to do source level debug, there needs to be some way of correlating a machine instruction to a line of source code. This is also a software abstraction, and there is usually a spec for it unless you are using some proprietary thing. It would technically be possible to include DWARF capabilities within the kernel, but I do not know of a way to get that info to the OS (see multiboot stuff for details).

From boot to symbols

The HOBos project implements a small Multiboot compliant bootloader called smallboot. When the machine starts up, boot firmware is loaded from a fixed location (this is currently done by SeaBIOS). The boot firmware then loads the smallboot bootloader. The bootloader will load the kernel (smallboot, and most modern bootloaders will do this through a text configuration file on the resident filesystem). In the HOBos case, the kernel is simply an ELF file. smallboot implements a basic ELF loader to load the kernel into memory and give execution over.

The multiboot specification is a standardized communication mechanism (for various things)  from the bootloader to the Operating System (or any file really). One of these things is symbol information. Quoting the multiboot spec

If bit 5 in the ‘flags’ word is set, then the following fields in the Multiboot information structure starting at byte 28 are valid:

     28      | num               |
     32      | size              |
     36      | addr              |
     40      | shndx             |

These indicate where the section header table from an ELF kernel is, the size of each entry, number of entries, and the string table used as the index of names. They correspond to the ‘shdr_*’ entries (‘shdr_num’, etc.) in the Executable and Linkable Format (elf) specification in the program header. All sections are loaded, and the physical address fields of the elf section header then refer to where the sections are in memory (refer to the i386 elf documentation for details as to how to read the section header(s)). Note that ‘shdr_num’ may be 0, indicating no symbols, even if bit 5 in the ‘flags’ word is set.

Since the beginning I had implemented these fields in the bootloader:

multiboot_info.flags |= MULTIBOOT_INFO_ELF_SHDR;
multiboot_info.u.elf_sec = *table;

Because of the fact that the symbols weren’t in the ELF segments though, I was stumped as to how to get at the data one the OS is loaded. As it turned out, I didn’t actually read all 4 sentences and  I had missed one very important part.

All sections are loaded, and the physical address fields of the elf section header then refer to where the sections are in memory

What the spec is dictating is that even though the sections are not in loadable segments, they shall exist within memory during the handover to the OS, and the section header information will be updated so that the OS knows where to find it. With this, the OS can copy out, or just make sure not to overwrite the info, and then get access to it.

for (i = 0; i < shnum; i++) {
    __ElfN(Shdr) *sh = &shdr[i];
    if (sh->sh_size == 0)

    if (sh->sh_addr) /* Already loaded */

    ASSERT(sizeof(void *) == 4);
    *((volatile __ElfN(Addr) *)&sh->sh_addr) = sh->sh_offset + (uint32_t)addr;

et cetera

The code for pulling out the symbols is quite a bit longer, but it can be found in kern/core/syms.c. With the given RBP unwinder near the top, we can easily get the IP for the caller. With that IP, we do a symbol lookup from the symbols we got via the multiboot info.

Screenshot with backtrace

Screenshot with backtrace

Inkscape Links

Download PDF
August 27, 2015

LAST REMINDER! systemd.conf 2015 Call for Presentations ends August 31st!

Here's the last reminder that the systemd.conf 2015 CfP ends on August 31st 11:59:59pm Central European Time (that's monday next week)! Make sure to submit your proposals until then!

Please submit your proposals on our website!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.

August 26, 2015

click here to jump to the instructions

Mice have an optical sensor that tells them how far they moved in "mickeys". Depending on the sensor, a mickey is anywhere between 1/100 to 1/8200 of an inch or less. The current "standard" resolution is 1000 DPI, but older mice will have 800 DPI, 400 DPI etc. Resolutions above 1200 DPI are generally reserved for gaming mice with (usually) switchable resolution and it's an arms race between manufacturers in who can advertise higher numbers.

HW manufacturers are cheap bastards so of course the mice don't advertise the sensor resolution. Which means that for the purpose of pointer acceleration there is no physical reference. That delta of 10 could be a millimeter of mouse movement or a nanometer, you just can't know. And if pointer acceleration works on input without reference, it becomes useless and unpredictable. That is partially intended, HW manufacturers advertise that a lower resolution will provide more precision while sniping and a higher resolution means faster turns while running around doing rocket jumps. I personally don't think that there's much difference between 5000 and 8000 DPI anymore, the mouse is so sensitive that if you sneeze your pointer ends up next to Philae. But then again, who am I to argue with marketing types.

For us, useless and unpredictable is bad, especially in the use-case of everyday desktops. To work around that, libinput 0.7 now incorporates the physical resolution into pointer acceleration. And to do that we need a database, which will be provided by udev as of systemd 218 (unreleased at the time of writing). This database incorporates the various devices and their physical resolution, together with their sampling rate. udev sets the resolution as the MOUSE_DPI property that we can read in libinput and use as reference point in the pointer accel code. In the simplest case, the entry lists a single resolution with a single frequency (e.g. "MOUSE_DPI=1000@125"), for switchable gaming mice it lists a list of resolutions with frequencies and marks the default with an asterisk ("MOUSE_DPI=400@50 800@50 *1000@125 1200@125"). And you can and should help us populate the database so it gets useful really quickly.

How to add your device to the database

We use udev's hwdb for the database list. The upstream file is in /usr/lib/udev/hwdb.d/70-mouse.hwdb, the ruleset to trigger a match is in /usr/lib/udev/rules.d/70-mouse.rules. The easiest way to add a match is with the libevdev mouse-dpi-tool (version 1.3.2). Run it and follow the instructions. The output looks like this:

$ sudo ./tools/mouse-dpi-tool /dev/input/event8
Mouse Lenovo Optical USB Mouse on /dev/input/event8
Move the device along the x-axis.
Pause 3 seconds before movement to reset, Ctrl+C to exit.
Covered distance in device units: 264 at frequency 125.0Hz | |^C
Estimated sampling frequency: 125Hz
To calculate resolution, measure physical distance covered
and look up the matching resolution in the table below
16mm 0.66in 400dpi
11mm 0.44in 600dpi
8mm 0.33in 800dpi
6mm 0.26in 1000dpi
5mm 0.22in 1200dpi
4mm 0.19in 1400dpi
4mm 0.17in 1600dpi
3mm 0.15in 1800dpi
3mm 0.13in 2000dpi
3mm 0.12in 2200dpi
2mm 0.11in 2400dpi

Entry for hwdb match (replace XXX with the resolution in DPI):
mouse:usb:v17efp6019:name:Lenovo Optical USB Mouse:
Take those last two lines, add them to a local new file /etc/udev/hwdb.d/71-mouse.hwdb. Rebuild the hwdb, trigger it, and done:

$ sudo udevadm hwdb --update
$ sudo udevadm trigger /dev/input/event8
Leave out the device path if you're not on systemd 218 yet. Check if the property is set:

$ udevadm info /dev/input/event8 | grep MOUSE_DPI
E: MOUSE_DPI=1000@125
And that shows everything worked. Restart X/Wayland/whatever uses libinput and you're good to go. If it works, double-check the upstream instructions, then file a bug against systemd with those two lines and assign it to me.

Trackballs are a bit hard to measure like this, my suggestion is to check the manufacturer's website first for any resolution data.

Update 2014/12/06: trackball comment added, udevadm trigger comment for pre 218
Update 2015/08/26: udpated link to systemd bugzilla (now on github)

August 24, 2015

First Round of systemd.conf 2015 Sponsors

We are happy to announce the first round of systemd.conf 2015 sponsors!

Our first Silver sponsor is CoreOS!

CoreOS develops software for modern infrastructure that delivers a consistent operating environment for distributed applications. CoreOS's commercial offering, Tectonic, is an enterprise-ready platform that combines Kubernetes and the CoreOS stack to run Linux containers. In addition CoreOS is the creator and maintainer of open source projects such as CoreOS Linux, etcd, fleet, flannel and rkt. The strategies and architectures that influence CoreOS allow companies like Google, Facebook and Twitter to run their services at scale with high resilience. Learn more about CoreOS here, Tectonic here, or follow CoreOS on Twitter @coreoslinux.

A Bronze sponsor is Codethink:

Codethink is a software services consultancy, focusing on engineering reliable systems for long-term deployment with open source technologies.

A Bronze sponsor is Pantheon:

Pantheon is a platform for professional website development, testing, and deployment. Supporting Drupal and WordPress, Pantheon runs over 100,000 websites for the world's top brands, universities, and media organizations on top of over a million containers.

A Bronze sponsor is Pengutronix:

Pengutronix provides consulting, training and development services for Embedded Linux to customers from the industry. The Kernel Team ports Linux to customer hardware and has more than 3100 patches in the official mainline kernel. In addition to lowlevel ports, the Pengutronix Application Team is responsible for board support packages based on PTXdist or Yocto and deals with system integration (this is where systemd plays an important role). The Graphics Team works on accelerated multimedia tasks, based on the Linux kernel, GStreamer, Qt and web technologies.

We'd like to thank our sponsors for their support! Without sponsors our conference would not be possible!

We'll shortly announce our second round of sponsors, please stay tuned!

If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

Reminder! The systemd.conf 2015 Call for Presentations ends on monday, August 31st! Please make sure to submit your proposals on the CfP page until then!

Also, don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.

August 19, 2015

So I realized I hadn’t posted a Wayland update in a while. So we are still making good progress on Wayland, but the old saying that the last 10% is 90% of the work is definitely true here. So there was a Wayland BOF at GUADEC this year which tried to create a TODO list for major items remaining before Wayland is ready to replace X.

  • Proper menu positioning. Most visible user issue currently for people testing Wayland
  • Relative pointer/confinement/locking. Important for games among other things.
  • Kinetic scroll in GTK+
  • More work needed to remove all X11 dependencies (so that desktop has no dependency on XWayland being available.
  • Minimize main thread stalls (could be texture uploads for example)
  • Tablet support. Includes gestures, on-screen keyboard and more.

A big thank you to Jonas Ådahl, Owen Taylor, Carlos Garnacho, Rui Matos, Marek Chalupa, Olivier Fourdan and more for their ongoing work on polishing
up the Wayland experience.

So as you can tell there is a still lot of details that needs working out when doing something as major as switching from one Display system to the other, but things are starting to looking really good now.

One new feature I am particularly excited about is what we call multi-DPI support ready for Wayland sessions in Fedora 23. What this means is that if you have a HiDPI laptop screen and a standard DPI external monitor you should be able to drag windows back and forth between the two screens and have them automatically rescale to work with the different DPIs. This is an example of an issue which was relatively straightforward to resolve under Wayland, but which would have been a lot of pain to get working under X.

We will not be defaulting to Wayland in Fedora Workstation 23 though, because as I have said in earlier blog posts about the subject, we will need to have a stable and feature complete Wayland in at least one release before we switch the default. We hope Fedora Workstation 23 will be that stable and feature complete release, which means Fedora Workstation 24 is the one where we can hope to make the default switchover.

Of course porting the desktop itself to Wayland is just part of the story. While we support running X applications using XWayland, to get full benefit from Wayland we need our applications to run on top of Wayland natively. So we spent effort on getting some major applications like LibreOffice and Firefox Wayland ready recently.

Caolan McNamara has been working hard on finishing up the GTK3 port of LibreOffice which is part of the bigger effort to bring LibreOffice nativly to Wayland. The GTK3 version of LibreOffice should be available in Fedora Workstation 23 (as non-default option) and all the necessary code will be included in LibreOffice 5 which will be released pretty soon. The GTK3 version should be default in F24, hopefully with complete Wayland support.

For Firefox Martin Stransky has been looking into ensuring Firefox runs on Wayland now that the basic GTK3 port is done. Martin just reported today that he got Firefox running natively under Wayland, although there are still a lot of glitches and issues he needs to figure out before it can be claimed to be ready for normal use.

Another major piece we are working on which is not directly Wayland related, but which has a Wayland component too is to try to move the state of Linux forward in the context of dealing with multiple OpenGL implementations, multi-GPU systems and the need to free our 3D stack from its close ties to GLX.

This work with is lead by Adam Jackson, but where also Dave Airlie is a major contributor, involves trying to decide and implement what is needed to have things like GL Dispatch, EGLstreams and EGL Device proposals used across the stack. Once this work is done the challenges around for instance using the NVidia binary driver on a linux system or using a discreet GPU like on Optimus laptops should be a thing of the past.

So the first step of this work is getting GL Dispatch implemented. GL Dispatch basically allows you to have multiple OpenGL implementations installed and then have your system pick the right one as needed. So for instance on a system with NVidia Optimus you can use Mesa with the integrated Intel graphics card, but NVidias binary OpenGL implementatioin with the discreet Optimus GPU. Currently that is a pain to do since you can only have one OpenGL implementation used. Bumblebee tries to hack around that requirement, but GL Dispatch will allow us to resolve this without having to ‘fight’ the assumptions of the system.

We plan to have easy to use support for both Optimus and Prime (the Nouveau equivalent of Optimus) in the desktop, allowing you to choose what GPU to use for your applications without needing to edit any text files or set environment variables.

The final step then is getting the EGL Device and EGLStreams proposal implemented so we can have something to run Wayland on top of. And while GL Dispatch are not directly related to those we do feel that having it in place should make the setup easier to handle as you don’t risk conflicts between the binary NVidia driver and the Mesa driver anymore at that point, which becomes even more crucial for Wayland since it runs on top of EGL.

August 18, 2015

REMINDER! systemd.conf 2015 Call for Presentations ends August 31st!

We'd like to remind you that the systemd.conf 2015 Call for Presentations ends on August 31st! Please submit your presentation proposals before that data on our website.

We are specifically interested in submissions from projects and vendors building today's and tomorrow's products, services and devices with systemd. We'd like to learn about the problems you encounter and the benefits you see! Hence, if you work for a company using systemd, please submit a presentation!

We are also specifically interested in submissions from downstream distribution maintainers of systemd! If you develop or maintain systemd packages in a distribution, please submit a presentation reporting about the state, future and the problems of systemd packaging so that we can improve downstream collaboration!

And of course, all talks regarding systemd usage in containers, in the cloud, on servers, on the desktop, in mobile and in embedded are highly welcome! Talks about systemd networking and kdbus IPC are very welcome too!

Please submit your presentations until August 31st!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

Also, limited travel and entry fee sponsorship is available for community contributors. Please contact us for details!

For further details about the CfP consult the CfP page.

For further details about systemd.conf consult the conference website.

REMINDER! systemd.conf 2015 Call for Papers ends August 31st!

We'd like to remind you that the systemd.conf 2015 Call for Presentations ends on August 31st! Please submit your presentation proposals before that data on our website.

We are specifically interested in submissions from projects and vendors building today's and tomorrow's products, services and devices with systemd. We'd like to learn about the problems you encounter and the benefits you see! Hence, if you work for a company using systemd, please submit a presentation!

We are also specifically interested in submissions from downstream distribution maintainers of systemd! If you develop or maintain systemd packages in a distribution, please submit a presentation reporting about the state, future and the problems of systemd packaging so that we can improve downstream collaboration!

And of course, all talks regarding systemd usage in containers, in the cloud, on servers, on the desktop, in mobile and in embedded are highly welcome! Talks about systemd networking and kdbus IPC are very welcome too!

Please submit your presentations until August 31st!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

Also, limited travel and entry fee sponsorship is available for community contributors. Please contact us for details!

For further details abou the CfP consult the CfP page.

For further details about systemd.conf consult the conference website.

First Round of systemd.conf 2015 Sponsors

We are happy to announce the first round of systemd.conf 2015 sponsors!

Our first Silver sponsor is CoreOS!

CoreOS develops software for modern infrastructure that delivers a consistent operating environment for distributed applications. CoreOS's commercial offering, Tectonic, is an enterprise-ready platform that combines Kubernetes and the CoreOS stack to run Linux containers. In addition CoreOS is the creator and maintainer of open source projects such as CoreOS Linux, etcd, fleet, flannel and rkt. The strategies and architectures that influence CoreOS allow companies like Google, Facebook and Twitter to run their services at scale with high resilience. Learn more about CoreOS here, Tectonic here, or follow CoreOS on Twitter @coreoslinux.

A Bronze sponsor is Codethink:

Codethink is a software services consultancy, focusing on engineering reliable systems for long-term deployment with open source technologies.

A Bronze sponsor is Pantheon:

Pantheon is a platform for professional website development, testing, and deployment. Supporting Drupal and WordPress, Pantheon runs over 100,000 websites for the world's top brands, universities, and media organizations on top of over a million containers.

A Bronze sponsor is Pengutronix:

Pengutronix provides consulting, training and development services for Embedded Linux to customers from the industry. The Kernel Team ports Linux to customer hardware and has more than 3100 patches in the official mainline kernel. In addition to lowlevel ports, the Pengutronix Application Team is responsible for board support packages based on PTXdist or Yocto and deals with system integration (this is where systemd plays an important role). The Graphics Team works on accelerated multimedia tasks, based on the Linux kernel, GStreamer, Qt and web technologies.

We'd like to thank our sponsors for their support! Without sponsors our conference would not be possible!

We'll shortly announce our second round of sponsors, please stay tuned!

If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

Reminder! The systemd.conf 2015 Call for Presentations ends on monday, August 31st! Please make sure to submit your proposals on the CfP page until then!

Also, don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.

August 17, 2015

A couple of weeks I visited my mother back home in Norway. She had gotten a new laptop some time ago that my brother-in-law had set up for her. As usual when I come for a visit I was asked to look at some technical issues my mother was experiencing with her computer. Anyway, one thing I discovered while looking at these issues was that my brother-in-law had installed OpenOffice on her computer. So knowing that the OpenOffice project is all but dead upstream since IBM pulled their developers of the project almost a year ago and has significantly fallen behind feature wise, I of course installed LibreOffice on the system instead, knowing it has a strong and vibrant community standing behind it and is going from strength to strength.

And this is why I am writing this open letter. Because while a lot of us who comes from technical backgrounds have already caught on to the need to migrate from OpenOffice to LibreOffice, there are still many non-technical users out there who are still defaulting to installing OpenOffice when they are looking for an open source Office Suite, because that is the one they came across 5 years ago. And I believe that the Apache Foundation, being an organization dedicated to open source software, care about the general quality and perception of open source software and thus would share my interest in making sure that all users of open source software gets the best experience possible, even if the project in question isn’t using their code license of preference.

So I realize that the Apache Foundation took a lot of pride in and has invested a lot of effort trying to create an Apache Licensed Office suite based on the old OpenOffice codebase, but I hope that now that it is clear that this effort has failed that you would be willing to re-direct people who go to the website to the LibreOffice website instead. Letting users believe that OpenOffice is still alive and evolving is only damaging the general reputation of open source Office software among non-technical users and thus I truly believe that it would be in everyones interest to help the remaining OpenOffice users over to LibreOffice.

And to be absolutely clear I am only suggesting this due to the stagnant state of the OpenOffice project, if OpenOffice had managed to build a large active community beyond the resources IBM used to provide then it would have been a very different story, but since that did not happen I don’t see any value to anyone involved to just let users keep downloading an aging releases of a stagnant codebase until the point where bit rot chases them away or they hear about LibreOffice true mainstream media or friends. And as we all know it is not about just needing a developer or two to volunteer here, maintaining and developing something as large as OpenOffice is a huge undertaking and needs a very sizeable and dedicated community to be able to succeed.

So dear Apache developers, for the sake of open source and free software, please recommend people to go and download LibreOffice, the free office suite that is being actively maintained and developed and which has the best chance of giving them a great experience using free software. OpenOffice is an important part of open source history, but that is also what it is at this point in time.

August 16, 2015
After a few years of development the atomic display update IOCTL for drm drivers is finally ready for prime time with the 4.2 pull request from Dave Airlie. It's been a long road, with a lot of drivers already converted over to atomic and even more in progress, the atomic helper libraries and support code in the drm subsystem sufficiently polished. But what's really missing is a design overview of what the overall atomic infrastructure looks like and why some decisions and details are implemented like they are.

That's now done and published on LWN: Part 1 talks about the problem space, issues with the Android atomic display framework and the basic atomic IOCTL interface. Part 2 goes into more detail about a few specific things like locking, helper library design and the exact semantics of atomic modessetting updates. Happy Reading!
August 15, 2015
So the big news for the upcoming mesa 11.0 release is gl4.x support for radeon and nouveau.  Which has been in the works for a long time, and a pretty tremendous milestone (and the reason that the next mesa release is 11.0 rather than 10.7).  But on the freedreno side of things, we haven't been sitting still either.  In fact, with the transform-feedback support I landed a couple weeks ago (for a3xx+a4xx), plus MRT+z32s8 support for a4xx (Ilia landed the a3xx parts of those a while back), we now support OpenGLES 3.0[1] on both adreno 3xx and 4xx!!

In addition, with the TBO support that landed a few days ago, plus handful of other fixes in the last few days, we have the new antarctica gl3.1 render engine for supertuxkart working!

Note that you need to use MESA_GL_VERSION_OVERRIDE=3.1 and MESA_GLSL_VERSION_OVERRIDE=140, since while we support everything that stk needs, we don't yet support everything needed to advertise gl3.1.  (But hey, according to qualcomm, adreno 3xx doesn't even support higher than gles3.0.. I guess we'll have to show them ;-))

The nice thing to see about this working, is that it is utilizing pretty much all of the recent freedreno features (transform feedback, MRT, UBO's, TBO's, etc).

Of course, the new render engine is considerably more heavyweight compared to older versions of stk.  But I think there is some low hanging fruit on the stk engine side of things to reclaim some of those lost fps.

update: oh, and the first time around, I completely forgot to mention that qualcomm has recently published *some* gpu docs, for a3xx, for the dragonboard 410c. Not quite as extensive as what broadcom has published for vc4, but it gives us all the a3xx registers, which is quite a lot more than any other SoC vendor has done to date :-)

[1] minus MSAA.. There is a bigger task, which is on the TODO list, to teach mesa st about some extensions to support MSAA resolve on tile->mem.. such as EXT_multisampled_render_to_texture, plus perhaps driconf option to enable it for apps that are not aware, which would make MSAA much more useful on a tiling gpu.  Until then, mesa doesn't check MSAA for gles3, and if it did we could advertise PIPE_CAP_FAKE_SW_MSAA.  Plus, who really cares about MSAA on a 5" 4k screen?

August 13, 2015

I've started to use Pocket a few months ago to store my backlog of things to read. It's especially useful as I can use it to read content offline since we still don't have any Internet access in places such as airplanes or the Paris metro. It's only 2015 after all.

I am also a subscriber for years now, and I really like their articles from the weekly edition. Unfortunately, as the access is restricted to subscribers, you need to login: it makes it impossible to add these articles to Pocket directly. Sad.

Yesterday, I thought about that and decided to start hacking on it. LWN provides a feature called "Subscriber Link" that allows you to share an article with a friend. I managed to use that feature to share the articles with my friend… Pocket!

As doing that every week is tedious, I wrote a small Python program called lwn2pocket that I published on GitHub. Feel free to use it, hack it and send pull requests.

August 11, 2015

One of the things we discussed at this year’s Akademy conference is making AppStream work on Kubuntu.appstream-logo

On Debian-based systems, we use a YAML-based implementation of AppStream, called “DEP-11″. DEP-11 exists for historical reasons (the DEP-11 YAML format was a superset of AppStream once) and because YAML, unlike XML, is one accepted file format by the Debian FTPMasters team, who we want to have on board when adding support for AppStream.

So I’ve spent the last few days on setting up the DEP-11 generator for Kubuntu, as well as improving it greatly to produce more meaningful error messages and to generate better output. It became a bit slower in the process, but the greatly improved diagnostics data is worth it.

For example, maintainers of Debian packages will now get a more verbose explanation of issues found with metadata in their packages, making them easier to fix for people who didn’t get in contact with AppStream yet.

At time, we generate AppStream metadata for Tanglu, Kubuntu and Debian, but so far only Tanglu makes real use of it by shipping it in a .deb package. Shipping the data as package is only a workaround though, for a proper implementation, the data will be downloaded by Apt. To achieve that, the data needs to reach the archive first, which is something that I can hopefully discuss and implement with the FTPMasters team of Debian at this year’s Debconf.

When this is done, the new metadata will automatically become available in tools like GNOME-Software or Muon Discover.

How can I see if there are issues with my package?

The dep11-generator tool will return HTML pages to show both the extracted metadata, as well as issues with it.

You can find the information for the respective distribution here:

Each issue tag will contain a detailed explanation of what went wrong. Errors generally lead to ignoring the metadata, so it will not be processed. Warnings usually concern things which might reduce the amount of metadata or make it less useful, while Info-type hints contain information on how to improve the metadata or make it more useful.

Can I use the data already?

Yes, you can. You just need to place the compressed YAML files in /var/cache/app-info/yaml and the icons in /var/cache/app-info/icons/<suite>-<component>/<size>, for example: /var/cache/app-info/icons/jessie-amd64/64×64

I think I found a bug in the generator…

In that case, please report the issue against the appstream-dep11 package at Debian, or file an issue at Github..

The only reason why I announce this feature now is to find remaining generator bugs, before officially announcing the feature on debian-devel-announce.

When will this be officially announced?

I want to give this feature a little bit more testing, and ideally have the integration into the archive ready, so people can see how the metadata looks like when rendered in GNOME-Software/Discover. I also want to write a bit more documentation to help Debian developers and upstreams to improve their metadata.

Ideally, I also want to incorporate some feedback at Debconf when announcing full AppStream support in Debian. So take all the stuff you’ve read above as a little sneak peek 😉

Debconf15goingI will also give a talk at Debconf, titled “AppStream, Limba, XdgApp – Where we are going.” The aim of this talk is to give an insight i tnto the new developments happening in the software distribution area, and what the goal of these different projects is (the talk should give an overview of what’s in the oven and how it will impact Debian). So if you are interested, please drop by :-) Maybe setting up a BOF would also be a good idea.

August 10, 2015

I am very late with this, but I still wanted to write a few words about Akademy 2015.

First of all: It was an awesome conference! Meeting all the great people involved with KDE and seeing who I am working with (as in: face to face, not via email) was awesome. We had some very interesting discussions on a wide variety of topics, and also enjoyed quite some beer together. Particularly important to me was of course talking to Daniel Vrátil who is working on XdgApp, and Aleix Pol of Muon Discover fame.

Also meeting with the other Kubuntu members was awesome – I haven’t seen some of them for about 3 years, and also met many cool people for the first time.

My talk on AppStream/Limba went well, except that I got slightly confused by the timer showing that I had only 2 minutes left, after I had just completed the first half of my talk. It turned out that the timer was wrong 😉

Another really nice aspect was to be able to get an insight into areas where I am usually not involved with, like visual design. It was really interesting to learn about the great work others are doing and to talk to people about their work – and I also managed to scratch an itch in the Systemsettings application, where three categories had shown the same icon. Now Systemsettings looks like it is supposed to be, finally :-)

The only thing I could maybe complain about was the weather, which was more Scotland/Wales like than Spanish – but that didn’t stop us at all, not even at the social event outside. So I actually don’t complain 😉

We also managed to discuss some new technical stuff, like AppStream for Kubuntu, and plenty of other things that I’ll write about in a separate blog post.

Generally, I got so many new impressions from this year’s Akademy, that I could write a 10-pages long blogpost about it while still having to leave out things.

Kudos to the organizers of this Akademy, you did a fantastic job! I also want to thank the Ubuntu community for funding my flight, and the Kubuntu community for pushing me a little to attend :-).

KDE is a great community with people driving the development of Free Software forward. The diversity of people, projects and ideas in KDE is a pleasure, and I am very happy to be part of this community.


August 08, 2015

Wanted to let everyone know that the GStreamer Conference 2015 is happening for the 6th time this year. So if you want to attend the premier open source multimedia conference you can do so in Dublin, Ireland between the 8th and 9th of October. If you want to attend I suggest registering as early as possible using the GStreamer Conference registration webpage. Like earlier years the GStreamer Conference is co-located with other great conferences like the Embedded Linux Conference Europe so you have the chance to combine the attendance into one trip.

The GStreamer Conference has always been a great opportunity to not only learn about the latest developments in GStreamer, but about whats happeing in the general Linux multimedia stack, latest news from the world of codec development and other related topics. I strongly recommend setting aside the 8th and the 9th of October for a trip to Dublin and the GStreamer Conference.

Also a heads up for those interested in doing a talk. The formal deadline for submitting a proposal is this Sunday the 9th of August, so you need to hurry to send in a proposal. You find the details for how to submit a talk on the GStreamer Conference 2015 website. While talks submitted before the 9th will be prioritized I do recommend anyone seeing this after the deadline to still send in a proposal as there might be a chance to get on the program anyway if you get your proposal in during next week.

August 04, 2015

It's been a while since I talked about Ceilometer and its companions, so I thought I'd go ahead and write a bit about what's going on this side of OpenStack. I'm not going to cover new features and fancy stuff today, but rather a shallow overview of the new project processes we initiated.

Ceilometer growing

Ceilometer has grown a lot since that time when we started it 3 years ago. It has evolved from a system designed to fetch and store measurements, to a more complex system, with agents, alarms, events, databases, APIs, etc.

All those features were needed and asked for by users and operators, but let's be honest, some of them should never have ended up in the Ceilometer code repository, especially not all at the same time.

The reality is we picked a pragmatic approach due to the rigidity of the OpenStack Technical Committee in regards to new projects to become OpenStack integrated – and, therefore, blessed – projects. Ceilometer was actually the first project to be incubated and then integrated. We had to go through the very first issues of that process.

Fortunately, now that time has passed, and all those constraints have been relaxed. To me, the OpenStack Foundation is turning into something that looks like the Apache Foundation, and there's, therefore, no need to tie technical solutions to political issues.

Indeed, the Big Tent now allows much more flexibility to all of that. Back a year ago, we were afraid to bring Gnocchi into Ceilometer. Was the Technical Committee going to review the project? Was the project going to be in the scope of Ceilometer for the Technical Committee? Now we don't have to ask ourselves those questions, now that we have that freedom, it empowers us to actually do what we think is good in term of technical design without worrying too much about political issues.

Ceilometer development activity

Acknowledging Gnocchi

The first step in this new process was to continue working on Gnocchi (a timeserie database and resource indexer designed to overcome historical Ceilometer storage issue) and to decide that it was not the right call to merge it into Ceilometer as some REST API v3, but that it was better to keep it standalone.

We managed to get traction to Gnocchi, getting a few contributors and users. We're even seeing talks proposed to the next Tokyo Summit where people leverage Gnocchi, such as "Service of predictive analytics on cost and performance in OpenStack", "Suveil" and "Cutting Edge NFV On OpenStack: Healing and Scaling Distributed Applications".

We are also doing some progress on pushing Gnocchi outside of the OpenStack community, as it can be a self-sufficient timeserie and resource database that can be used without any OpenStack interaction.

Branching Aodh

Rather than continuing to grow Ceilometer, during the last summit we all decided that it was time to reorganize and split Ceilometer into the different components it is made of, leveraging a more service-oriented architecture. The alarm subsystem of Ceilometer being mostly untied to the rest of Ceilometer, we decided it was the first and perfect candidate to do that. I personally engaged into doing the work and created a new repository with only the alarm code from Ceilometer, named Aodh.

Aodh is an Irish word meaning fire. A word picked so it also had some relation to Heat, and because we have some Irish influence around the project 😁.

This made sense for a lot of reason. First because Aodh can now work completely standalone, using either Ceilometer or Gnocchi as a backend – or any new plugin you'd write. I love the idea that OpenStack projects can work standalone – like Swift does for example – without implying any other OpenStack component. I think it's a proof of good design. Secondly, because it allows us to resonate on a smaller chunk of software – a reason really under-estimated today in OpenStack. I believe that the size of your software should match a certain ratio to the size of your team.

Aodh is, therefore, a new project under the OpenStack Telemetry program (or what remains of OpenStack programs now), alongside Ceilometer and Gnocchi, forked from the original Ceilometer alarm feature. We'll deprecate the latter with the Liberty release, and we'll remove it in the Mitaka release.

Lessons learned

Actually, moving that code out of Ceilometer (in the case of Aodh), or not merging it in (in the case of Gnocchi) had a few side effects that I admit I think we probably under-estimated back then.

Indeed, the code size of Gnocchi or Aodh ended up being much smaller than the entire Ceilometer project – Gnocchi is 7× smaller and Aodh 5x smaller than Ceilometer – and therefore much more easy to manipulate and to hack on. That allowed us to merge dozens of patches in a few weeks, cleaning-up and enhancing a lot of small things in the code. Those tasks are very much harder in Ceilometer, due to the bigger size of the code base and the small size of our team. By having our small team working on smaller chunks of changes – even when it meant actually doing more reviews – greatly improved our general velocity and the number of bugs fixed and features implemented.

On the more sociological side, I think it gave the team the sensation of finally owning the project. Ceilometer was huge, and it was impossible for people to know every side of it. Now, it's getting possible for people inside a team to cover a much larger portion of those smaller project, which gives them a greater sense of ownership and caring. Which ends up being good for the project quality overall.

That also means that we technically decided to have different core teams by project (Ceilometer, Gnocchi, and Aodh) as they all serve different purposes and can all be used standalone or with each others. Meaning we could have contributors completely ignoring other projects.

All of that reminds me some discussion I heard about projects such as Glance, trying to fit new features in - some that are really orthogonal to the original purpose. It's now clear to me that having different small components interacting together that can be completely owned and taken care of by a (small) team of contributors is the way to go. People that can therefore trust each others and easily bring new people in, makes a project really incredibly more powerful. Having a project covering a too wide set of features make things more difficult if you don't have enough manpower. This is clearly an issue that big projects inside OpenStack are facing now, such as Neutron or Nova.

July 28, 2015

Announcing systemd.conf 2015

We are happy to announce the inaugural systemd.conf 2015 conference of the systemd project.

The conference takes place November 5th-7th, 2015 in Berlin, Germany.

Only a limited number of tickets are available, hence make sure to sign up quickly.

For further details consult the conference website.

July 22, 2015

Below is an outline of the various types of touchpads that can be found in the wild. Touchpads aren't simply categorised into a single type, instead they have a set of properties, a combination of number of physical buttons, touch support and physical properties.

Number of buttons

Physically separate buttons

For years this was the default type of touchpads: a touchpad with a separate set of physical buttons below the touch surface. Such touchpads are still around, but most newer models are Clickpads now.

Touchpads with physical buttons usually provide two buttons, left and right. A few touchpads with three buttons exist, and Apple used to have touchpads with a single physical buttons back in the PPC days. Touchpads with only two buttons require the software stack to emulate a middle button. libinput does this when both buttons are pressed simultaneously.

A two-button touchpad, with a two-button pointing stick above.

Note that many Lenovo laptops provide a pointing stick above the touchpad. This pointing stick has a set of physical buttons just above the touchpad. While many users use those as substitute touchpad buttons, they logically belong to the pointing stick. The *40 and *50 series are an exception here, the former had no physical buttons on the touchpad and required the top section of the pad to emulate pointing stick buttons, the *50 series has physical buttons but they are wired to the touchpads. The kernel re-routes those buttons through the trackstick device.


Clickpads are the most common type of touchpads these days. A Clickpad has no separate physical buttons, instead the touchpad itself is clickable as a whole, i.e. a user presses down on the touch area and triggers a physical click. Clickpads thus only provide a single button, everything else needs to be software-emulated.

A clickpad on a Lenovo x220t. Just above the touchpad are the three buttons associated with the pointing stick. Faint markings on the bottom of the touchpad hint at where the software buttons should be.

Right and middle clicks are generated either via software buttons or "clickfinger" behaviour. Software buttons define an area on the touchpad that is a virtual right button. If a finger is in that area when the click happens, the left button event is changed to a right button event. A middle click is either a separate area or emulated when both the left and right virtual buttons are pressed simultaneously.

When the software stack uses the clickfinger method, the number of fingers decide the type of click: a one-finger is a left button, a two-finger click is a right button, a three-finger click is a middle button. The location of the fingers doesn't matter, though there are usually some limits in how the fingers can be distributed (e.g. some implementations try to detect a thumb at the bottom of the touchpad to avoid accidental two-finger clicks when the user intends a thumb click).

The libinput documentation has a section on Clickpad software button behaviour with more detailed illustrations

The touchpad on a T440s has no physical buttons for the pointing stick. The marks on the top of the touchpad hint at the software button position for the pointing stick. Note that there are no markings at the bottom of the touchpad anymore.

Clickpads are labelled by the kernel with the INPUT_PROP_BUTTONPAD input property.


One step further down the touchpad evolution, Forcepads are Clickpads without a physical button. They provide pressure and (at least in Apple's case) have a vibration element that is software-controlled. Instead of the satisfying click of a physical button, you instead get a buzz of happiness. Which apparently feels the same as a click, judging by the reviews I've read so far. A software-controlled click feel has some advantages, it can be disabled for some gestures, modified for others, etc. I suspect that over time Forcepads will become the main touchpad category, but that's a few years away.

Not much to say on the implementation here. The kernel has some ForcePad support but everything else is spotty.

Note how Apple's Clickpads have no markings whatsoever, Apple uses the clickfinger method by default.

Touch capabilities

Single-touch touchpads

In the beginning, there was the single-finger touchpad. This touchpad would simply provide x/y coordinates for a single finger and get mightily confused when more than one finger was present. These touchpads are now fighting with dodos for exhibition space in museums, few of those are still out in the wild.

Pure multi-touch touchpads

Pure multi-touch touchpads are those that can track, i.e. identify the location of all fingers on the touchpad. Apple's touchpads support 16 touches (iirc), others support 5 touches like the Synaptics touchpads when using SMBus.

Pure multi-touch touchpads are the easiest to support, we can rely on the finger locations and use them for scrolling, gestures, etc. These touchpads usually also provide extra information. In the case of the Apple touchpads we get an ellipsis and the orientation of the ellipsis for each touch point. Other touchpads provide a pressure value for each touch point. Though pressure is a bit of a misnomer, pressure is usually directly related to contact area. Since our puny human fingers flatten out as the pressure on the pad increases, the contact area increases and the firmware then calculates that back into a (mostly rather arbitrary) pressure reading.

Because pressure is really contact area size, we can use it to detect accidental palm contact or thumbs though it's fairly unreliable. A light palm touch or a touch at the very edge of a touchpad will have a low pressure reading simply because the palm is mostly next to the touchpad and thus the contact area itself remains small.

Partial multi-touch touchpads

The vast majority of touchpads fall into this category. It's the half-way point between single-touch and pure multi-touch. These devices can track N fingers, but detect more than N. The current Synaptics touchpads fall into that category when they're using the serial protocol. Most touchpads that fall into this category can track two fingers and detect up to four or five. So a typical three-finger interaction would give you the location of two fingers and a separate value telling you that a third finger is down.

The lack of finger location doesn't matter for some interactions (tapping, three-finger click) but it can cause issues in some cases. For example, a user may have a thumb resting on a touchpad while scrolling with two fingers. Which touch locations you get depends on the order of the fingers being set down, i.e. this may look like thumb + finger + third touch somewhere (lucky!) or two fingers scrolling + third touch somewhere (unlucky, this looks like a three-finger swipe). So far we've mostly avoided having anything complex enough that requires the exact location of more than two fingers, these pads are so prevalent that any complex feature would exclude the majority of users.

Semi-mt touchpads

A sub-class of partial multi-touch touchpads. These touchpads can technically detect two fingers but the location of both is limited to the bounding box, i.e. the first touch is always the top-left one and the second touch is the bottom-right one. Coordinates jump around as fingers move past each other. Most semi-mt touchpads also have a lower resolution for two touches than for one, so even things like two-finger scrolling can be very jumpy.

Semi-mt are labelled by the kernel with the INPUT_PROP_SEMI_MT input property.

Physical properties

External touchpads

USB or Bluetooth touchpads not in a laptop chassis. Think the Apple Magic Trackpad, the Logitech T650, etc. These are usually clickpads, the biggest difference is that they can be removed or added at runtime. One interaction method that is only possible on external touchpads is a thumb resting on the very edge/immediately next to the touchpad. On the far edge, touchpads don't always detect the finger location so clicking with a thumb barely touching the edge makes it hard or impossible to figure out which software button area the finger is on.

These touchpads also don't need palm detection - since they're not located underneath the keyboard, accidental palm touches are a non-issue.

A Logitech T650 external touchpad. Note the thumb position, it is possible to click the touchpad without triggering a touch.

Circular touchpads

Yes, used to be a thing. Touchpad shaped in an ellipsis or circle. Luckily for us they have gone full dodo. The X.Org synaptics driver had to be aware of these touchpads to calculate the right distance for edge scrolling - unsurprisingly an edge scroll motion on a circular touchpad isn't very straight.

Graphics tablets

Touch-capable graphics tablets are effectively external touchpads, with two differentiators: they are huge compared to normal touchpads and they have no touchpad buttons whatsoever. This means they can either work like a Forcepad, or rely on interaction methods that don't require buttons (like tap-to-click). Since the physical device is shared with the pen input, some touch arbitration is required to avoid touch input interfering when the pen is in use.

Dedicated edge scroll area

Mostly on older touchpads before two-finger scrolling became the default method. These touchpads have a marking on the touch area that designates the edge to be used for scrolling. A finger movement in that edge zone should trigger vertical motions. Some touchpads have markers for a horizontal scroll area too at the bottom of the touchpad.

A touchpad with a marked edge scroll area on the right.
July 21, 2015

I've had a lot of thought about Mont. (Sorry for the rhymes.) Mont, recall, is the set of all Monte objects. I have a couple interesting thoughts on Mont that I'd like to share, but the compelling result I hope to convince readers of is this: Mont is a simple and easy-to-think-about category once we define an appropriate sort of morphism. By "category" I mean the fundamental building block of category theory, and most of the maths I'm going to use in this post is centered around that field. In particular, "morphism" is used in the sense of categories.

I'd like to put out a little lemma from the other day, first. Let us say that the Monte == operator is defined as follows: For any two objects in Mont, x and y, x == y if and only if x is y, or for any message [verb, args] sendable to these objects, M.send(x, verb, args) == M.send(y, verb, args). In other words, x == y if it is not possible to distinguish x and y by any chain of sent messages. This turns out to relate to the category definition I give below. It also happens to correlate nicely with the idea of equivalence, in that == is an equivalence relation on Mont! The proof:

  • Reflexivity: For any object x, x == x. The first branch of the definition handles this.
  • Symmetry: For any objects x and y, x == y iff y == x. Identity is symmetric, and the second branch is covered by recursion.
  • Transitivity: For any objects x, y, and z, x == y and y == z implies x == z. Yes, if x and y can't be told apart, then if y and z also can't be told apart it immediately follows that x and z are likewise indistinguishable.

Now, obviously, since objects can do whatever computation they like, the actual implementation of == has to be conservative. We generally choose to be sound and incomplete; thus, x == y sometimes has false negatives when implemented in software. We can't really work around this without weakening the language considerably. Thus, when I talk about Mont/==, please be assured that I'm talking more about the ideal than the reality. I'll try to address spots where this matters.

Back to categories. What makes a category? Well, we need a set, some morphisms, and a couple proofs about the behavior of those morphisms. First, the set. I'll use Mont-DF for starters, but eventually we want to use Mont. Not up on my posts? Mont-DF is the subset of Mont where objects are transitively immutable; this is extremely helpful to us since we do not have to worry about mutable state nor any other side effect. (We do have to worry about identity, but most of my results are going to be stated as holding up to equivalence. I am not really concerned with whether there are two 42 objects in Mont right now.)

My first (and, spoiler alert, failed) attempt at defining a category was to use messages as morphisms; that is, to go from one object to another in Mont-DF, send a message to the first object and receive the second object. Clear, clean, direct, simple, and corresponds wonderfully to Monte's semantics. However, there's a problem. The first requirement of a category is that, for any object in the set, there exists an identity morphism, usually called 1, from that object to itself. This is a problem in Monte. We can come up with a message like that for some objects, like good old 42, which responds to ["add", [0]] with 42. (Up to equivalence, of course!) However, for some other objects, like object o as DeepFrozen {}, there's no obvious methods to use.

The answer is to add a new Miranda method which is not overrideable called _magic/0. (Yes, if this approach would have worked, I would have picked a better name.) Starting from Mont-DF, we could amend all objects to get a new set, Mont-DF+Magic, in which the identity morphism is always ["_magic", []]. This neatly wraps up the identity morphism problem.

Next, we have to figure out how to compose messages. At first blush, this is simple; if we start from x and send it some message to get y, and then send another message to y to get z, then we obviously can get to x from z. However, here's the rub: There might not be any message directly from x to z! We're stuck here. Unlike with other composition operators, there's no hand-wavey way to compose messages like with functions. So this is bunk.

However, we can cheat gently and use the free monoid a.k.a. the humble list. A list of messages will work just fine: To compose them, simply catenate the lists, and the identity morphism is the empty list. Putting it all together, a morphism from 6 to 42 might be [["multiply", [7]]], and we could compose that with [["asString", []]] to get [["multiply", [7]], ["asString", []]], a morphism from 6 to "42". Not shabby at all!

There we go. Now Mont-DF is a category up to equivalence. The (very informally defined) set of representatives of equivalence classes via ==, which I'll call Mont-DF/==, is definitely a category here as well, since it encapsulates the equivalence question. We could alternatively insist that objects in Mont-DF are unique (or that equivalent definitions of objects are those same objects), but I'm not willing to take up that sword this time around, mostly because I don't think that it's true.

"Hold up," you might say; "you didn't prove that Mont is a category, only Mont-DF." Curses! I didn't fool you at all, did I? Yes, you're right. We can't extend this result to Mont wholesale, since objects in Mont can mutate themselves. In fact, Mont doesn't really make sense to discuss in this way, since objects in sets aren't supposed to be mutable. I'm probably going to have to extend/alter my definition of Mont in order to get anywhere with that.

July 16, 2015

In a perfect world, any device that advertises absolute x/y axes also advertises the resolution for those axes. Alas, not all of them do. For libinput, this problem is two-fold: parts of the touchscreen API provide data in mm - without knowing the resolution this is a guess at best. But it also matters for touchpads, where a lack of resolution is a lot more common (though the newest generations of major touchpad manufacturers tend to advertise resolutions now).

We have a number of features that rely on the touchpad resolution: from the size of the software button to deciding which type of palm detection we need, it all is calculated based on physical measurements. Until recently, we had code to differentiate between touchpads with resolution and most of the special handling was a matter of magic numbers, usually divided by the diagonal of the touchpad in device units. This made code maintenance more difficult - without testing each device, behaviour could not be guaranteed.

With libinput 0.20, we now got rid of this special handling and instead require the touchpads to advertise resolutions. This requires manual intervention, so we're trying to fix this in multiple places, depending on the confidence of the data. We have hwdb entries for the bcm5974 (Apple) touchpads and the Chromebook Pixel. For Elantech touchpads, a kernel patch is currently waiting for merging. For ALPS touchpads, we ship size hints with libinput's hwdb. If all that fails, we fall back to a default touchpad size of 69x55mm. [1]

All this affects users in two ways: one is that you may notice a slightly different behaviour of your touchpad now. The software-buttons may be bigger or smaller than before, pointer acceleration may be slightly different, etc. Shouldn't be too bad, but you may just notice it. The second noticeable change is that libinput will now log when it falls back to the default size. If you notice a message like that in your log, please file a bug and attach the output of evemu-describe and the physical dimensions of your touchpad. Once we have that information, we can add it to the right place and make sure that everyone else with that touchpad gets the right settings out of the box.

[1] The default size was chosen because it's close enough to what old touchpads used to be, and those are most likely to lack resolution values. This size may change over time as we get better data.

July 15, 2015

With the release of Solaris 11.3 beta, I've gone back and made a new list of changes to the bundled software packages available in the Solaris IPS package repository, as I've done for the Solaris 11.1, Solaris 11.2 beta, and the Solaris 11.2 GA releases.

Oracle packages

Several bundled packages improve integration with other Oracle software. The Oracle Instant Client packages are now in the IPS repo for building software that connects to Oracle databases. MySQL 5.6 has also been added alongside the existing version 5.5 packages.

The Java runtime & developer kits for Java 7 & 8 were updated to new versions, while the Java 6 versions were removed as its support life winds down. The End of Feature Notices for Oracle Solaris 11 warns that Java 7 will be coming out as well in a later release.

Also updated was Oracle Hardware Management Pack (HMP), a set of tools that work with the ILOM, firmware, and other components in Sun/Oracle servers to configure low-level system options. HMP 2.2 was introduced in Solaris 11.2, and Solaris 11.3 now delivers HMP 2.3 packages.

Python packages

Solaris has long included and depended on Python 2. Solaris 11.3 adds Python 3 support for the first time, with the bundling of Python 3.4 and many module packages that work with it. Python 2.7 is still included, as is 2.6 for now, but Python 2 software in Solaris is almost completely switched over to 2.7 now, and Python 2.6 will be obsoleted soon.

A side effect of these changes was a revamping of the naming pattern for Python module packages in IPS - previously most modules delivered a set of packages following the pattern:

  • library/python-2/<module name>
  • library/python-2/<module name>-<for each Python version>
For example, there were three Mako packages, library/python-2/mako, library/python-2/mako-26, library/python-2/mako-27, where the latter two installed the modules built for the named versions of Python, and the first uses IPS conditional dependencies to install the modules for any Python versions that were installed on the system.

In extending this to provide Python 3 modules, it was decided to drop the python major version from the library/python-N prefix, leaving just the version at the end of the module name. Thus in Solaris 11.3, you'll see that the mako packages are now library/python/mako, library/python/mako-26, library/python/mako-27, and library/python/mako-34.

NVIDIA graphics driver packages

NVIDIA has been providing graphics driver packages for Solaris for almost a decade now. As new families and models of graphics cards are regularly introduced, they retire support for older generations from time to time in the new drivers. Support for these models is retained in a legacy driver, but that requires uninstalling the latest version and switching to a legacy branch. Previously that meant installing NVDIA's SVR4 package release instead of IPS, losing the ability to get updates with a simple “pkg update” command.

Now the legacy drivers are also available in IPS packages, which will continue to be updated as necessary for bug fixes and support for new Xorg releases during NVIDIA’s Support timeframes for Unix legacy GPU releases. To switch to the version 340 legacy driver on Solaris 11.3 or the later Solaris 11.2 SRU’s simply run:

  # pkg install --reject driver/graphics/nvidia driver/graphics/nvidiaR340 
and then reboot into the new BE created. For the previous version 304, change the above command to end in nvidiaR304 instead.

Other packages

There are far more changes than I've covered here - fortunately, the engineers who worked on many of these changes have written their own blog posts about them for you to check out:

One more thing... Solaris 11.2 packages

While all these are available now in the Solaris 11.3 beta, many are also available for testing and evaluation on existing Solaris 11.2 systems, when you're ready to upgrade a FOSS package, but not the rest of the OS. This is planned to be an ongoing program, so once Solaris 11.3 is officially released, the evaluation packages will keep moving forward to new versions of many packages. More details are available in a Solaris FOSS blog post and an article in the Solaris 11 OTN community space.

Not all packages are available in the evaluation program though, since some depend on OS changes not in Solaris 11.2. For instance, OpenSSH is not available for Solaris 11.2, since it depends on changes to the existing SunSSH packages that allow for the ssh package mediator to choose which ssh software to use on a given system.

Detailed list of changes

This table shows most of the changes to the bundled packages between the original Solaris 11.2.0 release, the latest Solaris 11.2 support repository update (SRU11, aka 11.2.11, released June 13, 2015), and the Solaris 11.3 beta released today. These show the versions they were released with, and not later versions that may now be available via the new FOSS Evaluation Packages for existing Solaris releases.

As with last time, some were excluded for clarity, or to reduce noise and duplication. All of the bundled packages which didn’t change the version number in their packaging info are not included, even if they had updates to fix bugs, security holes, or add support for new hardware or new features of Solaris.

PackageUpstream11. Beta
cloud/openstack OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/cinder OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/glance OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/heat OpenStacknot included0.2014.2.20.2014.2.2
cloud/openstack/horizon OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/keystone OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/neutron OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/nova OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/swift OpenStack1.
communication/im/pidgin Pidgin2.
compress/pigz pigznot included2.
crypto/gnupg GnuPG2.
database/mysql-56 MySQLnot included
(MySQL 5.5 in database/mysql-56)
database/sqlite-3 SQLite3.
developer/build/ant Apache Ant1.
developer/documentation-tool/help2man GNU help2mannot includednot included1.46.1
developer/documentation-tool/xmlto xmltonot includednot included0.0.25
developer/java/jdk-6 Java1.6.0.75
(Java SE 6u75)
(Java SE 6u95)
not included
developer/java/jdk-7 Java1.7.0.65
(Java SE 7u65)
(Java SE 7u80)
(Java SE 7u80)
developer/java/jdk-8 Java1.8.0.11
(Java SE 8u11)
(Java SE 8u45)
(Java SE 8u45)
developer/test/check checknot includednot included0.9.14
developer/versioning/mercurial Mercurial SCM2.
developer/versioning/subversion Apache Subversion1.
diagnostic/nicstat nicstatnot includednot included1.95
diagnostic/tcpdump tcpdump4.
diagnostic/wireshark Wireshark1.
driver/graphics/nvidia NVIDIA0.331.38.00.346.35.00.346.35.0
driver/graphics/nvidiaR304 NVIDIAnot included0.304.125.00.304.125.0
driver/graphics/nvidiaR340 NVIDIAnot included0.340.65.00.340.65.0
file/mc GNU Midnight Commander4.
library/apr-15 Apache Portable Runtimenot includednot included1.5.1
library/c++/net6 Gobby1.
library/jansson Janssonnot includednot included2.7
library/json-c JSON-C0.90.90.12
library/libee libee0.
library/libestr libestr0.
library/libgsl GNU GSLnot includednot included1.16
library/liblogging LibLoggingnot includednot included1.0.4
library/libmicrohttpd GNU Libmicrohttpdnot includednot included0.9.37
library/libmilter Sendmail8.
library/libxml2 XML C parser2.
library/neon neon0.
library/perl-5/openscap-512 OpenSCAP1.
library/perl-5/xml-libxml CPAN: XML::LibXML2.142.142.121
was library/python-2/alembic
was library/python-2/amqp
library/python/barbicanclient OpenStacknot included3.
was library/python-2/boto
library/python/ceilometerclient OpenStack1.
library/python/cinderclient OpenStack1.
was library/python-2/cliff
library/python/django Django1.
library/python/django-pyscss django-pyscssnot included1.
was library/python-2/django_compressor
was library/python-2/django_openstack_auth
was library/python-2/eventlet
library/python/futures pythonfuturesnot included2.
library/python/glance_store OpenStacknot included0.
library/python/glanceclient OpenStack0.
was library/python-2/greenlet
library/python/heatclient OpenStack0.
library/python/iniparse iniparsenot included0.40.4
library/python/ipaddr ipaddr-pynot included2.
library/python/jinja2 Jinja2.
library/python/keystoneclient OpenStack0.
library/python/keystonemiddleware OpenStack not included1.
was library/python-2/kombu
library/python/ldappool ldappoolnot included1.01.0
was library/python-2/netaddr
was library/python-2/netifaces
library/python/networkx NetworkXnot included1.
library/python/neutronclient OpenStack2.
library/python/novaclient OpenStack2.
library/python/oauthlib OAuthLibnot included0.
library/python/openscap OpenSCAP1.
library/python/oslo.config OpenStack1.
library/python/oslo.context OpenStacknot included0.
library/python/oslo.db OpenStacknot included1.
library/python/oslo.i18n OpenStacknot included1.
library/python/oslo.messaging OpenStacknot included1.
library/python/oslo.middleware OpenStacknot included0.
library/python/oslo.serialization OpenStacknot included1.
library/python/oslo.utils OpenStacknot included1.
library/python/oslo.vmware OpenStacknot included0.
library/python/osprofiler OpenStacknot included0.
was library/python-2/pep8
PyPI: pep81.
was library/python-2/pip
library/python/posix_ipc POSIX IPC for Pythonnot included0.
was library/python-2/py
library/python/pycadf OpenStacknot included0.
was library/python-2/pyflakes
library/python/pyscss pyScssnot included1.
library/python/pysendfile pysendfilenot included2.
was library/python-2/pytest
was library/python-2/python-mysql
was library/python-2/pytz
was library/python-2/requests
library/python/retrying Retryingnot included1.
library/python/rfc3986 rfc3986not included0.
library/python/saharaclient OpenStacknot included0.
was library/python-2/setuptools
PyPI: setuptools0.
library/python/simplegeneric PyPI: simplegenericnot included0.
was library/python-2/simplejson
library/python/six PyPI: six1.
was library/python-2/sqlalchemy
was library/python-2/sqlalchemy-migrate
was library/python-2/stevedore
library/python/swiftclient OpenStack2.
library/python/taskflow OpenStacknot included0.
was library/python-2/tox
library/python/troveclient OpenStack0.
was library/python-2/virtualenv
library/python/websockify Websockify0.
library/python/wsme wsmenot included0.
library/ruby/hiera Puppetnot included1.
library/security/libassuan GnuPG2.
library/security/libksba GnuPG1.
library/security/openssl OpenSSL1.0.1.8 (1.0.1h) (1.0.1m) (1.0.1o)
library/unixodbc unixODBC2.
library/zlib zlib1.
mail/mailman GNU Mailmannot includednot included2.1.18.1
network/dns/bind ISC BIND9.
network/firewall OpenBSD PFnot includednot included5.5
network/mtr MTRnot includednot included0.86
network/openssh OpenSSHnot includednot included6.5.0.1
network/rsync rsync3.
print/filter/hplip HPLIP3.
runtime/erlang erlang15.2.317.517.5
runtime/java/jre-6 Java1.6.0.75
(Java SE 6u75)
(Java SE 6u95)
not included
runtime/java/jre-7 Java1.7.0.65
(Java SE 7u65)
(Java SE 7u80)
(Java SE 7u80)
runtime/java/jre-8 Java1.8.0.11
(Java SE 8u11)
(Java SE 8u45)
(Java SE 8u45)
runtime/python-27 Python2.
runtime/python-34 Pythonnot includednot included3.4.3
runtime/ruby-21 Rubynot included
(Ruby 1.9.3 in runtime/ruby-19)
security/compliance/openscap OpenSCAP1.
security/sudo Sudo1.
service/network/dns/bind ISC BIND9.
service/network/ftp ProFTPD1. (1.3.4c)
service/network/ntp NTP4.2.7.381 (4.2.7p381) (4.2.8p2) (4.2.8p2)
service/network/samba Samba3.
service/network/smtp/postfix Postfixnot includednot included2.11.3
service/network/smtp/sendmail Sendmail8.
shell/bash GNU bash4.
shell/watch procps-ngnot includednot included3.3.10
shell/zsh Zsh5.
system/data/hardware-registry pci.ids
system/data/timezone IANA Time Zone Data0.5.11 (2014c)0.5.11 (2015d)2015.4 (2015d)
system/font/truetype/google-droid Droid Fonts0.2010.2.240.2010.2.240.2013.6.7
system/library/freetype-2 FreeType2.
system/library/hmp-libs Oracle HMP2.
system/library/i18n/libthai libthai0.
system/library/libdatrie datrie0.
system/management/biosconfig Oracle HMP2.
system/management/facter Puppet1.
system/management/fwupdate Oracle HMP2.
system/management/fwupdate/qlogic Oracle HMP1.
system/management/hmp-snmp Oracle HMP2.
system/management/hwmgmtcli Oracle HMP2.
system/management/hwmgmtd Oracle HMP2.
system/management/ocm Oracle Configuration Manager12.
system/management/puppet Puppet3.
system/management/raidconfig Oracle HMP2.
system/management/ubiosconfig Oracle HMP2.
system/rsyslog rsyslog6.
system/test/sunvts Oracle VTS7.
terminal/tmux tmux1.
text/gnu-patch GNU Patch2.
text/groff GNU troff1.
text/less Less436436458
text/text-utilities util-linuxnot includednot included2.24.2
web/browser/firefox Mozilla Firefox17.0.1131.
web/browser/links Links1.
web/curl cURL7.
web/java-servlet/tomcat Apache Tomcat6.0.416.0.436.0.43
web/java-servlet/tomcat-8 Apache Tomcatnot includednot included8.0.21
web/novnc noVNCnot included0.50.5
web/php-53 PHP5.
web/php-56 PHPnot includednot included5.6.8
web/php-56/extension/php-suhosin-extension Suhosinnot includednot included0.9.37.1
web/php-56/extension/php-xdebug Xdebugnot includednot included2.3.2
web/server/apache-22 Apache HTTPD2.
web/server/apache-22/module/apache-jk Apache Tomcat1.
web/server/apache-22/module/apache-security ModSecurity2.
web/server/apache-22/module/apache-wsgi mod_wsgi3.
web/server/apache-24 Apache HTTPDnot includednot included2.4.12
web/server/apache-24/module/apache-dtrace Apache DTrace modulenot includednot included0.3.1
web/server/apache-24/module/apache-fcgid Apache mod_fcgidnot includednot included2.3.9
web/server/apache-24/module/apache-jk Apache Tomcatnot includednot included1.2.40
web/server/apache-24/module/apache-security ModSecuritynot includednot included2.8.0
mod_wsginot includednot included4.3.0
web/wget GNU wget1.141.161.16
x11/server/xorg/driver/xorg-input-keyboard X.Org1.
x11/server/xorg/driver/xorg-input-mouse X.Org1.
x11/server/xorg/driver/xorg-input-synaptics X.Org1.
x11/server/xorg/driver/xorg-video-ast X.Org0.
x11/server/xorg/driver/xorg-video-dummy X.Org0.
x11/server/xorg/driver/xorg-video-mga X.Org1.
x11/server/xorg/driver/xorg-video-vesa X.Org2.

The libinput test suite takes somewhere around 35 minutes now for a full run. That's annoying, especially as I'm running it for every commit before pushing. I've tried optimising things, but attempts at making it parallel have mostly failed so far (almost all tests need a uinput device created) and too many tests rely on specific timeouts to check for behaviours. Containers aren't an option when you have to create uinput devices so I started out farming out into VMs.

Ideally, the test suite should run against multiple commits (on multiple VMs) at the same time while I'm working on some other branch and then accumulate the results. And that's where git notes come in. They're a bit odd to use and quite the opposite of what I expected. But in short: a git note is an object that can be associated with a commit, without changing the commit itself. Sort-of like a post-it note attached to the commit. But there are plenty of limitations, for example you can only have one note (per namespace) and merge conflicts are quite easy to trigger. Look at any git notes tutorial to find out more, there's plenty out there.

Anyway, dealing with merge conflicts is a no-go for me here. So after a bit of playing around, I found something that seems to work out well. A script to run make check and add notes to the commit, combined with a repository setup to fetch those notes and display them automatically. The core of the script is this:

make check
if [ $? -eq 0 ]; then

if [ -n "$sha" ]; then
git notes --ref "test-$HOSTNAME" append \
-m "$status: $HOSTNAME: make check `date`" HEAD
exit $rc
Then in my main repository, I add each VM as a remote, adding a fetch path for the notes:

[remote "f22-libinput1"]
url = f22-libinput1.local:/home/whot/code/libinput
fetch = +refs/heads/*:refs/remotes/f22-libinput1/*
fetch = +refs/notes/*:refs/notes/f22-libinput1/*
Finally, in the main repository, I extended the glob that displays notes to 'everything':

$ git config notes.displayRef "*"
Now git log (and by extension tig) displays all notes attached to a commit automatically. All that's needed is a git fetch --all to fetch everything and it's clear in the logs which commit fails and which one succeeded.

:: whot@jelly:~/code/libinput (master)> git log
commit 6896bfd3f5c3791e249a0573d089b7a897c0dd9f
Author: Peter Hutterer
Date: Tue Jul 14 14:19:25 2015 +1000

test: check for fcntl() return value

Mostly to silence coverity complaints.

Signed-off-by: Peter Hutterer

Notes (f22-jelly/test-f22-jelly):
SUCCESS: f22-jelly: make check Tue Jul 14 00:20:14 EDT 2015

Whenever I look at the log now, I immediately see which commits passed the test suite and which ones didn't (or haven't had it run yet). The only annoyance is that since a note is attached to a commit, amending the commit message or rebasing makes the note "go away". I've copied notes manually after this, but it'd be nice to find a solution to that.

Everything else has been working great so far, but it's quite new so there'll be a bit of polishing happening over the next few weeks. Any suggestions to improve this are welcome.

July 09, 2015

Update June 09, 2015: edge scrolling for clickpads has been merged. Will be availble in libinput 0.20. Consider the rest of this post obsolete.

libinput supports edge scrolling since version 0.7.0. Whoops, how does the post title go with this statement? Well, libinput supports edge scrolling, but only on some devices and chances are your touchpad won't be one of them. Bug 89381 is the reference bug here.

First, what is edge scrolling? As the libinput documentation illustrates, it is scrolling triggered by finger movement within specific regions of the touchpad - the left and bottom edges for vertical and horizontal scrolling, respectively. This is in contrast to two-finger scrolling, triggered by a two-finger movement, anywhere on the touchpad. synaptics had edge scrolling since at least 2002, the earliest commit in the repo. Back then we didn't have multitouch-capable touchpads, these days they're the default and you'd be struggling to find one that doesn't support at least two fingers. But back then edge-scrolling was the default, and touchpads even had the markings for those scroll edges painted on.

libinput adds a whole bunch of features to the touchpad driver, but those features make it hard to support edge scrolling. First, libinput has quite smart software button support. Those buttons are usually on the lowest ~10mm of the touchpad. Depending on finger movement and position libinput will send a right button click, movement will be ignored, etc. You can leave one finger in the button area while using another finger on the touchpad to move the pointer. You can press both left and right areas for a middle click. And so on. On many touchpads the vertical travel/physical resistance is enough to trigger a movement every time you click the button, just by your finger's logical center moving.

libinput also has multi-direction scroll support. Traditionally we only sent one scroll event for vertical/horizontal at a time, even going as far as locking the scroll direction. libinput changes this and only requires a initial threshold to start scrolling, after that the caller will get both horizontal and vertical scroll information. The reason is simple: it's context-dependent when horizontal scrolling should be used, so a global toggle to disable doesn't make sense. And libinput's scroll coordinates are much more fine-grained too, which is particularly useful for natural scrolling where you'd expect the content to move with your fingers.

Finally, libinput has smart palm detection. The large majority of palm touches are along the left and right edges of the touchpad and they're usually indistinguishable from finger presses (same pressure values for example). Without palm detection some laptops are unusable (e.g. the T440 series).

These features interfere heavily with edge scrolling. Software button areas are in the same region as the horizontal scroll area, palm presses are in the same region as the vertical edge scroll area. The lower vertical edge scroll zone overlaps with software buttons - and that's where you would put your finger if you'd want to quickly scroll up in a document (or down, for natural scrolling). To support edge scrolling on those touchpads, we'd need heuristics and timeouts to guess when something is a palm, a software button click, a scroll movement, the start of a scroll movement, etc. The heuristics are unreliable, the timeouts reduce responsiveness in the UI. So our decision was to only provide edge scrolling on touchpads where it is required, i.e. those that cannot support two-finger scrolling, those with physical buttons. All other touchpads provide only two-finger scrolling. And we are focusing on making 2 finger scrolling good enough that you don't need/want to use edge scrolling (pls file bugs for anything broken)

Now, before you get too agitated: if edge scrolling is that important to you, invest the time you would otherwise spend sharpening pitchforks, lighting torches and painting picket signs into developing a model that allows us to do reliable edge scrolling in light of all the above, without breaking software buttons, maintaining palm detection. We'd be happy to consider it.

July 08, 2015

In my previous post I introduced ARB_shader_storage_buffer, an OpenGL 4.3 feature that is coming soon to Mesa and the Intel i965 driver. While that post focused on explaining the features introduced by the extension, in this post I’ll dive into some of the implementation aspects, for those who are curious about this kind of stuff. Be warned that some parts of this post will be specific to Intel hardware.

Following the trail of UBOs

As I explained in my previous post, SSBOs are similar to UBOs, but they are read-write. Because there is a lot of code already in place in Mesa’s GLSL compiler to deal with UBOs, it made sense to try and reuse all the data structures and code we had for UBOs and specialize the behavior for SSBOs where that was needed, that allows us to build on code paths that are already working well and reuse most of the code.

That path, however, had some issues that bit me a bit further down the road. When it comes to representing these operations in the IR, my first idea was to follow the trail of UBO loads as well, which are represented as ir_expression nodes. There is a fundamental difference between the two though: UBO loads are constant operations because uniform buffers are read-only. This means that a UBO load operation with the same parameters will always return the same value. This has implications related to certain optimization passes that work based on the assumption that other ir_expression operations share this feature. SSBO loads are not like this: since the shader storage buffer is read-write, two identical SSBO load operations in the same shader may not return the same result if the underlying buffer storage has been altered in between by SSBO write operations within the same or other threads. This forced me to alter a number of optimization passes in Mesa to deal with this situation (mostly disabling them for the cases of SSBO loads and stores).

The situation was worse with SSBO stores. These just did not fit into ir_expression nodes: they did not return a value and had side-effects (memory writes) so we had to come up with a different way to represent them. My initial implementation created a new IR node for these, ir_ssbo_store. That worked well enough, but it left us with an implementation of loads and stores that was a bit inconsistent since both operations used very different IR constructs.

These issues were made clear during the review process, where it was suggested that we used GLSL IR intrinsics to represent load and store operations instead. This has the benefit that we can make the implementation more consistent, having both loads and stores represented with the same IR construct and follow a similar treatment in both the GLSL compiler and the i965 backend. It would also remove the need to disable or alter certain optimization passes to be SSBO friendly.

Read/Write coherence

One of the issues we detected early in development was that our reads and writes did not seem to work very well together: some times a read after a write would fail to see the last value written to a buffer variable. The problem here also spawned from following the implementation trail of the UBO path. In the Intel hardware, there are various interfaces to access memory, like the Sampling Engine and the Data Port. The former is a read-only interface and is used, for example, for texture and UBO reads. The Data Port allows for read-write access. Although both interfaces give access to the same memory region, there is something to consider here: if you mix reads through the Sampling Engine and writes through the Data Port you can run into cache coherence issues, this is because the caches in use by the Sampling Engine and the Data Port functions are different. Initially, we implemented SSBO load operations like UBO loads, so we used the Sampling Engine, and ended up running into this problem. The solution, of course, was to rewrite SSBO loads to go though the Data Port as well.

Parallel reads and writes

GPUs are highly parallel hardware and this has some implications for driver developers. Take a sentence like this in a fragment shader program:

float cx = 1.0;

This is a simple assignment of the value 1.0 to variable cx that is supposed to happen for each fragment produced. In Intel hardware running in SIMD16 mode, we process 16 fragments simultaneously in the same GPU thread, this means that this instruction is actually 16 elements wide. That is, we are doing 16 assignments of the value 1.0 simultaneously, each one is stored at a different offset into the GPU register used to hold the value of cx.

If cx was a buffer variable in a SSBO, it would also mean that the assignment above should translate to 16 memory writes to the same offset into the buffer. That may seem a bit absurd: why would we want to write 16 times if we are always assigning the same value? Well, because things can get more complex, like this:

float cx = gl_FragCoord.x;

Now we are no longer assigning the same value for all fragments, each of the 16 values assigned with this instruction could be different. If cx was a buffer variable inside a SSBO, then we could be potentially writing 16 different values to it. It is still a bit silly, since only one of the values (the one we write last), would prevail.

Okay, but what if we do something like this?:

int index = int(mod(gl_FragCoord.x, 8));
cx[index] = 1;

Now, depending on the value we are reading for each fragment, we are writing to a separate offset into the SSBO. We still have a single assignment in the GLSL program, but that translates to 16 different writes, and in this case the order may not be relevant, but we want all of them to happen to achieve correct behavior.

The bottom line is that when we implement SSBO load and store operations, we need to understand the parallel environment in which we are running and work with test scenarios that allow us to verify correct behavior in these situations. For example, if we only test scenarios with assignments that give the same value to all the fragments/vertices involved in the parallel instructions (i.e. assignments of values that do not depend on properties of the current fragment or vertex), we could easily overlook fundamental defects in the implementation.

Dealing with helper invocations

From Section 7.1 of the GLSL spec version 4.5:

“Fragment shader helper invocations execute the same shader code
as non-helper invocations, but will not have side effects that
modify the framebuffer or other shader-accessible memory.”

To understand what this means I have to introduce the concept of helper invocations: certain operations in the fragment shader need to evaluate derivatives (explicitly or implicitly) and for that to work well we need to make sure that we compute values for adjacent fragments that may not be inside the primitive that we are rendering. The fragment shader executions for these added fragments are called helper invocations, meaning that they are only needed to help in computations for other fragments that are part of the primitive we are rendering.

How does this affect SSBOs? Because helper invocations are not part of the primitive, they cannot have side-effects, after they had served their purpose it should be as if they had never been produced, so in the case of SSBOs we have to be careful not to do memory writes for helper fragments. Notice also, that in a SIMD16 execution, we can have both proper and helper fragments mixed in the group of 16 fragments we are handling in parallel.

Of course, the hardware knows if a fragment is part of a helper invocation or not and it tells us about this through a pixel mask register that is delivered with all executions of a fragment shader thread, this register has a bitmask stating which pixels are proper and which are helper. The Intel hardware also provides developers with various kinds of messages that we can use, via the Data Port interface, to write to memory, however, the tricky thing is that not all of them incorporate pixel mask information, so for use cases where you need to disable writes from helper fragments you need to be careful with the write message you use and select one that accepts this sort of information.

Vector alignments

Another interesting thing we had to deal with are address alignments. UBOs work with layout std140. In this setup, elements in the UBO definition are aligned to 16-byte boundaries (the size of a vec4). It turns out that GPUs can usually optimize reads and writes to multiples of 16 bytes, so this makes sense, however, as I explained in my previous post, SSBOs also introduce a packed layout mode known as std430.

Intel hardware provides a number of messages that we can use through the Data Port interface to write to memory. Each message has different characteristics that makes it more suitable for certain scenarios, like the pixel mask I discussed before. For example, some of these messages have the capacity to write data in chunks of 16-bytes (that is, they write vec4 elements, or OWORDS in the language of the technical docs). One could think that these messages are great when you work with vector data types, however, they also introduce the problem of dealing with partial writes: what happens when you only write to an element of a vector? or to a buffer variable that is smaller than the size of a vector? what if you write columns in a row_major matrix? etc

In these scenarios, using these messages introduces the need to mask the writes because you need to disable the channels in the vec4 element that you don’t want to write. Of course, the hardware provides means to do this, we only need to set the writemask of the destination register of the message instruction to select the right channels. Consider this example:

struct TB {
    float a, b, c, d;

layout(std140, binding=0) buffer Fragments {
   TB s[3];
   int index;

void main()
   s[0].d = -1.0;

In this case, we could use a 16-byte write message that takes 0 as offset (i.e writes at the beginning of the buffer, where s[0] is stored) and then set the writemask on that instruction to WRITEMASK_W so that only the fourth data element is actually written, this way we only write one data element of 4 bytes (-1) at offset 12 bytes (s[0].d). Easy, right? However, how do we know, in general, the writemask that we need to use? In std140 layout mode this is easy: since each element in the SSBO is aligned to a 16-byte boundary, we simply need to take the byte offset at which we are writing, divide it by 16 (to convert it to units of vec4) and the modulo of that operation is the byte offset into the chunk of 16-bytes that we are writing into, then we only have to divide that by 4 to get the component slot we need to write to (a number between 0 and 3).

However, there is a restriction: we can only set the writemask of a register at compile/link time, so what happens when we have something like this?:

s[i].d = -1.0;

The problem with this is that we cannot evaluate the value of i at compile/link time, which inevitably makes our solution invalid for this. In other words, if we cannot evaluate the actual value of the offset at which we are writing at compile/link time, we cannot use the writemask to select the channels we want to use when we don’t want to write a vec4 worth of data and we have to use a different type of message.

That said, in the case of std140 layout mode, since each data element in the SSBO is aligned to a 16-byte boundary you may realize that the actual value of i is irrelevant for the purpose of the modulo operation discussed above and we can still manage to make things work by completely ignoring it for the purpose of computing the writemask, but in std430 that trick won’t work at all, and even in std140 we would still have row_major matrix writes to deal with.

Also, we may need to tweak the message depending on whether we are running on the vertex shader or the fragment shader because not all message types have appropriate SIMD modes (SIMD4x2, SIMD8, SIMD16, etc) for both, or because different hardware generations may not have all the message types or support all the SIMD modes we need need, etc

The point of this is that selecting the right message to use can be tricky, there are multiple things and corner cases to consider and you do not want to end up with an implementation that requires using many different messages depending on various circumstances because of the increasing complexity that it would add to the implementation and maintenance of the code.

Closing notes

This post did not cover all the intricacies of the implementation of ARB_shader_storage_buffer_object, I did not discuss things like the optional unsized array or the compiler details of std430 for example, but, hopefully, I managed to give an idea of the kind of problems one would have to deal with when coding driver support for this or other similar features.

July 04, 2015
So, I realized it has been a while since posting about freedreno progress, so in honor of US independence day I figured it was as good an excuse as any for an update about independence from gpu blob driver for snapdragon/adreno..

Back in end of March 2015 at ELC, I gave a freedreno update presentation at ELC, listing the following major tasks left for gles3 support:
  • Uniform Buffer Objects (UBO)
  • Transform Feedback (TF)
  • Multi-Render-Target (MRT)
  • advanced flow control in shader compiler
 and additionally for gl3:
  • Multisample anti-aliasing (MSAA)
  • NV_conditional_render
  • 32b depth (z32 and z32_s8) (which I forgot to mention in the presentation)
EDIT: Ilia pointed out that 32b depth is needed for gles3 too, and gl3 additionally needs clipdist/etc (which we'll have to emulate, but hopefully can do in a generic nir pass) and rgtc (which will need sw decompression hopefully in mesa core so other drivers for gles class hw can reuse).  Original list was based on what mesa's compute_version() code was checking quite some time back.
Since then, we've gained support for UBO's (a3xx by Ilia Mirkin, and a4xx), MRT (for a3xx and core, again thanks to Ilia.. still needs to be wired up for a4xx), 32b depth (a3xx and core, again thanks to Ilia), and I've finished up shader compiler for loops/flow-control for ir3 (a3xx/a4xx).  The shader compiler work was a somewhat larger task than I expected (and I did expect it to be a lot of work), but it also involved moving over to NIR, in addition to re-writing the scheduler and register allocation passes, as well as a lot of re-org to ir3 in order to support multiple basic blocks.  The move to NIR was not strictly required, but it brings a lot of benefits in the form of shared support for conversion to SSA, scalarizing, CSE, DCE, constant folding, and algebraic optimizations.  And I figured it was less work in the long run to move to NIR first and drop the TGSI frontend, before doing all the refactoring needed to support loops and non-lowerable flow-control.  Incidentally, the compiler work should make the shader-compiler part of TF easier (since we need to generate a conditional write to TF buffer iff not overwriting past the end of the TF buffer).

In the mean time, freedreno and drm/msm have also gained support for the a306 gpu found in the new dragonboard 410c.  This board is a nice new low cost ($75) snapdragon community board based on the 64bit snapdragon 410.  And thanks to a lot of work by linaro and qualcomm, the upstream kernel situation for this board is looking pretty good.  It is shipping initially with a 4.0 based kernel (with patches on top for stuff that hadn't yet been merged for 4.0, including a lot of stuff backported from 4.1 and 4.2), including gpu/display/audio/video-codec/etc.  I believe that the 4.1 kernel was the first version where a vanilla kernel could boot on db410c with basic stuff (like serial console) working.  The kernel support for the gpu and display, other than the adv7533 hdmi bridge chip) landed in 4.2.  There is still more work to get *everything* (including audio, vidc, etc) merged upstream, but work continues in that direction, making this quite an exciting board.
Also, we have a GSoC student, Varad, working on freedreno support for android.  It is still in early stages, with some debugging still to do, but he has made a lot of progress and things are starting to work.
And since no blog post is complete without some nice screenshots...  the other day someone pointed me at a post in the dolphin forums about how dolphin was running on a420 (same device as in the ifc6540).  We all had a good laugh about the rendering issues with the blob driver.  But, since dolphin was the first gl3 game that worked with freedreno, I was curious how freedreno would do.. so I fired up the ifc6540 and replayed some dolphin fifo logs that would let me render approximately the same scenes:

Yoshi looks to be rendering pretty well.. digimon has a bit of corruption, but no where near as bad as the blob driver.  I suspect the issue with digimon is an instruction scheduling issue in the shader compiler (well, no rest for the gpu driver writers), but nice to see that it is already in pretty good shape.

Now we just need steam store or some unigine demos for arm linux :-P

In type-theoretic terms, Monte has a very boring type system. All objects expressible in Monte form a set, Mont, which has some properties, but not anything interesting from a theoretical point of view. I plan to talk about Mont later, but for now we'll just consider it to be a way for me to make existential or universal claims about Monte's object model.

Let's start with guards. Guards are one of the most important parts of writing idiomatic Monte, and they're also definitely an integral part of Monte's safety and security guarantees. They look like types, but are they actually useful as part of a type system?

Let's consider the following switch expression:

switch (x):
    match c :Char:
        "It's a character"
    match i :Int:
        "It's an integer"
    match _:
        "I don't know what it is!"

The two guards, Char and Int, perform what amounts to a type discrimination. We might have an intuition that if x were to pass Char, then it would not pass Int, and vice versa; we might also have an intuition that the order of checking Char and Int does not matter. I'm going to formalize these and show how strong they can be in Monte.

When a coercion happens, the object being coerced is called the specimen. The result of the coercion is called the prize. You've already been introduced to the guard, the object which is performing the coercion.

It happens that a specimen might override a Miranda method, _conformTo/1, in order to pass guards that it cannot normally pass. We call all such specimens conforming. All specimens that pass a guard also conform to it, but some non-passing specimens might still be able to conform by yielding a prize to the guard.

Here's an axiom of guards: For all objects in Mont, if some object specimen conforms to a guard G, and def prize := G.coerce(specimen, _), then prize passes G. This cannot be proven by any sort of runtime assertion (yet?), but any guard that does not obey this axiom is faulty. One expects that a prize returned from a coercion passes the guard that was performing the coercion; without this assumption, it would be quite foolhardy to trust any guard at all!

With that in mind, let's talk about properties of guards. One useful property is idempotence. An idempotent guard G is one that, for all objects in Mont which pass G, any such object specimen has the equality G.coerce(specimen, _) == specimen. (Monte's equality, if you're unfamiliar with it, considers two objects to be equal if they cannot be distinguished by any response to any message sent at them. I could probably craft equivalency classes out of that rule at some point in the future.)

Why is idempotency good? Well, it formalizes the intuition that objects aren't altered when coerced if they're already "of the right type of object." I expect that if I pass 42 to a function that has the pattern x :Int, I might reasonably expect that x will get 42 bound to it, and not 420 or some other wrong number.

Monte's handling of state is impure. This complicates things. Since an object's internal state can vary, its willingness to respond to messages can vary. Let's be more precise in our definition of passing coercion. An object specimen passes coercion by a guard G if, for some combination of specimen and G internal states, G.coerce(specimen, _) == specimen. If specimen passes for all possible combinations of specimen and G internal states, then we say that specimen always passes coercion by G. (And if specimen cannot pass coercion with any possible combination of states, then it never passes.)

Now we can get to retractability. A idempotent guard G is unretractable if, for all objects in Mont which pass coercion by G, those objects always pass coercion by G. The converse property, that it's possible for some object to pass but not always pass coercion, would make G retractable.

An unretractable guard provides a very comfortable improvement over an idempotent one, similar to dipping your objects in DeepFrozen. I think that most of the interesting possibilities for guards come from unretractable guards. Most of the builtin guards are unretractable, too; data guards like Double and Str are good examples.

Theorem: An unretractable guard G partitions Mont into two disjoint subsets whose members always pass or never pass coercion by G, respectively. The proof is pretty trivial. This theorem lets us formalize the notion of a guard as protecting a section of code from unacceptable values; if Char is unretractable (and it is!), then a value guarded by Char is always going to be a character and never anything else. This theorem also gives us our first stab at a type declaration, where we might say something like "An object is of type Char if it passes Char."

Now let's go back to the beginning. We want to know how Char and Int interact. So, let's define some operations analagous to set union and intersection. The union of two unretractable guards G and H is written Any[G, H] and is defined as an unretractable guard that partitions Mont into the union of the two sets of objects that always pass G or H respectively, and all other objects. A similar definition can be created for the intersection of G and H, written All[G, H] and creating a similar partition with the intersection of the always-passing sets.

Both union and intersection are semigroups on the set of unretractable guards. (I haven't picked a name for this set yet. Maybe Mont-UG?) We can add in identity elements to get monoids. For union, we can use the hypothetical guard None, which refuses to pass any object in Mont, and for intersection, the completely real guard Any can be used.

object None:
    to coerce(_, ej):
        throw(ej, "None shall pass")

It gets better. The operations are also closed over Mont-UG, and it's possible to construct an inverse of any unretractable guard which is also an unretractable guard:

def invertUG(ug):
    return object invertedUG:
        to coerce(specimen, ej):
            escape innerEj:
                ug.coerce(specimen, innerEj)
                throw(ej, "Inverted")
            catch _:
                return specimen

This means that we have groups! Two lovely groups. They're both Abelian, too. Exciting stuff. And, in the big payoff of the day, we get two rings on Mont-UG, depending on whether you want to have union or intersection as your addition or multiplication.

This empowers a programmer, informally, to intuit that if Char and Int are disjoint (and, in this case, they are), then it might not matter in which order they are placed into the switch expression.

That's all for now!

June 30, 2015

So this will be the first in a series of blogs talking about some major initiatives we are doing for Fedora Workstation. Today I want to present and talk about a thing we call Pinos.

So what is Pinos? One of the original goals of Pinos was to provide the same level of advanced hardware handling for Video that PulseAudio provides for Audio. For those of you who has been around for a while you might remember how you once upon a time could only have one application using the sound card at the same time until PulseAudio properly fixed that. Well Pinos will allow you to share your video camera between multiple applications and also provide an easy to use API to do so.

Video providers and consumers are implemented as separate processes communicating with DBUS and exchanging video frames using fd passing.

Some features of Pinos

  • Easier switching of cameras in your applications
  • It will also allow you to more easily allow applications to switch between multiple cameras or mix the content from multiple sources.

  • Multiple types of video inputs
  • Supports more than cameras. Pinos also supports other type of video sources, for instance it can support your desktop as a video source.

  • GStreamer integration
  • Pinos is built using GStreamer and also have GStreamer elements supporting it to make integrating it into GStreamer applications simple and straightforward.

  • Pinos got some audio support
  • Well it tries to solve some of the same issues for video that PulseAudio solves for audio. Namely letting you have multiple applications sharing the same camera hardware. Pinos does also include audio support in order to let you handle both.

What do we want to do with this in Fedora Workstation?

  • One thing we know is of great use and importance for many of our users, including many developers who wants to make videos demonstrating their software, is to have better screen capture support. One of the test cases we are using for Pinos is to improve the built in screen casting capabilities of GNOME 3, the goal being to reducing overhead and to allow for easy setup of picture in picture capturing. So you can easily set it up so there will be a camera capturing your face and voice and mixing that into your screen recording.
  • Video support for Desktop Sandboxes. We have been working for a while on providing technology for sandboxing your desktop applications and while we with a little work can use PulseAudio for giving the sandboxed applications audio access we needed something similar for video. Pinos provides us with such a solution.

Who is working on this?
Pinos is being designed and written by Wim Taymans who is the co-creator of the GStreamer multimedia framework and also a regular contributor to the PulseAudio project. Wim is also the working for Red Hat as a Principal Engineer, being in charge of a lot of our multimedia support in both Red Hat Enterprise Linux and Fedora. It is also worth nothing that it draws many of its ideas from an early prototype by William Manley called PulseVideo and builds upon some of the code that was merged into GStreamer due to that effort.

Where can I get the code?
The code is currently hosteed in Wim’s private repository on freedesktop. You can get it at

How can I get involved or talk to the author
You can find Wim on Freenode IRC, he uses the name wtay and hangs out in both the #gstreamer and #pulseaudio IRC channels.
Once the project is a bit further along we will get some basic web presence set up and a mailing list created.


If Pinos contains Audio support will it eventually replace PulseAudio too?
Probably not, the usecases and goals for the two systems are somewhat different and it is not clear that trying to make Pinos accommodate all the PulseAudio usescases would be worth the effort or possible withour feature loss. So while there is always a temptation to think ‘hey, wouldn’t it be nice to have one system that can handle everything’ we are at this point unconvinced that the gain outweighs the pain.

Will Pinos offer re-directing kernel APIs for video devices like PulseAudio does for Audio? In order to handle legacy applications?
No, that was possible due to the way ALSA worked, but V4L2 doesn’t have such capabilities and thus we can not take advantage of them.

Why the name Pinos?
The code name for the project was PulseVideo, but to avoid confusion with the PulseAudio project and avoid people making to many assumptions based on the name we decided to follow in the tradition of Wayland and Weston and take inspiration from local place names related to the creator. So since Wim lives in Pinos de Alhaurin close to Malaga in Spain we decided to call the project Pinos. Pinos is the word for pines in Spanish :)

June 26, 2015
The 4.1 kernel release is still a few weeks off and hence a bit early to talk about 4.2. But the drm subsystem feature cut-off already passed and I'm going on vacation for 2 weeks, so here we go.

First things first: No, i915 does not yet support atomic modesets. But a lot of progress has been made again towards enabling it. As I explained last time around the trouble is that the intel driver has grown its own almost-atomic modeset infrastructure over the past few years. And now we need to convert that to the slightly different proper atomic support infrastructure merged into the drm core, which means lots and lots of small changes all over the driver. A big part merged in this release is the removal of the ->new_config pointer by Ander, Matt & Maarten. This was the old i915-specific pointer to the staged new configuration. Removing it required switching all the CRTC code over to handling the staged configuration stored in the struct drm_atomic_state to be compatible with the atomic core. Unfortunately we still need to do the same for encoder/connector states and for plane states, so there's still lots of shuffling pending for 4.2.

There has also been other feature work going on on the modeset side: Ville cleaned&fixed up the CDCLK support in anticipation of implementing dynamic display clock frequency scaling. Unfortunately that part of his patches hasn't landed yet. Ville has also merged patches to fix up some details in the CPT modeset sequence, maybe this will finally fix the remaining "DP port stuck" issues we still seem to have.

Looking at newer platforms the interesting bit is rotation support for SKL from Sonika and Tvrtko. Compared to older platforms skl now also supports 90° and 270° rotation in the scanout engines, but only when the framebuffer uses a special tiling layout (which have been enabled in 4.0). A related feature is support for plane/CRTC scalers on SKL, provided by Chandra. Skylake has also gained support for the new low-power display states DC5/6. For Broxton basic enabling has landed, but there's nothing too interesting yet besides piles of small adjustments all over. This is because Broxton and Skylake have a common display block (similar to how the render block for atom chips was already shared since Baytrail) and hence share a lot of the infrastructure code. Unfortunately neither of these platforms has yet left the preliminary hardware support label for the i915 driver.

There's also a few minor features in the display code worth mentioning: DP compliance testing infrastructure from Todd Previte - DP compliance test devices have a special DP AUX sidechannel protocol for requesting certain test procedures and hence need a bit of driver support. Most of this will be in userspace though, with the kernel just forward requests and handing back results. Mika Kahola has optimized the DP link training, the kernel will now first try to use the current values (either from a previous modeset or set up by the firmware). PSR has also seen some more work, unfortunately it's still not yet enabled by default. And finally there's been lots of cleanups and improvements under the hood all over, as usual.

A big feature is the dynamic pagetable allocation for gen8+ from Michel Thierry and Ben Widawsky. This will greatly reduce the overhead of PPGTT and is a requirement for 48bit address space support - with that big a VM preallocating all the pagetables is just not possible any more. The gen7 cmd parser is now finally fixed up and enabled by default (thanks to Rebecca Palmer for one crucial fix), which means finally some newer GL extensions can be used without adding kernel hacks. And Chris Wilson has fine-tuned the cmd parser with a big pile of patches to reduce the overhead. And Chris has tuned the RPS boost code more, it should now no longer erratically boost the GPU's clock when it's inappropriate. He has also written a lot of patches to reduce the overhead of execlist command submission, and some of those patches have been merged into this release.

Finally two pieces of prep work: A few patches from John Harrison to prepare for removing the outstanding lazy request. We've added this years ago as a cheap way out of a memory and ringbuffer space preallocation issue and ever since then paid the price for this with added complexity leaking all over the GEM code. Unfortunately the actual removal is still pending. And then Joonas Lahtinen has implemented partial GTT mmap support. This is needed for virtual enviroments like XenGT where the GTT is cut up between the different guests and hence badly fragmented. The merged code only supports linear views and still needs support for fenced buffer objects to be actually useful.

June 25, 2015

One of the bits we are currently finalising in libinput are touchpad gestures. Gestures on a normal touchscreens are left to the compositor and, in extension, to the client applications. Touchpad gestures are notably different though, they are bound to the location of the pointer or the keyboard focus (depending on the context) and they are less context-sensitive. Two fingers moving together on a touchscreen may be two windows being moved at the same time. On a touchpad however this is always a pinch.

Touchpad gestures are a lot more hardware-sensitive than touchscreens where we can just forward the touch points directly. On a touchpad we may have to consider software buttons or just HW-limitations of the touchpad. This prevents the implementation of touchpad gestures in a higher level - only libinput is aware of the location, size, etc. of software buttons.

Hence - touchpad gestures in libinput. The tree is currently sitting here and is being rebased as we go along, but we're expecting to merge this into master soon.

The interface itself is fairly simple: any device that may send gestures will have the LIBINPUT_DEVICE_CAP_GESTURE capability set. This is currently only implemented for touchpads but there is the potential to support this on other devices too. Two gestures are supported: swipe and pinch (+rotate). Both come with a finger count and both follow a Start/Update/End cycle. Gestures have a finger count that remains the same for the gestures, so if you switch from a two-finger pinch to a three-finger pinch you will see one gesture end and the next one start. Note that how to deal with this is up to the caller - it may very well consider this the same gesture semantically.

Swipe gestures have delta coordinates (horizontally and vertically) of the logical center of the gesture, compared to the previous event. A pinch gesture has the delta coordinates too and a delta angle (clockwise, in degrees). A pinch gesture also has the notion of an absolute scale, the Begin event always has a scale of 1.0 and that changes as the fingers move towards each other further apart. A scale of 2.0 means they're now twice as far apart as originally.

Nothing overly exciting really, it's a simple API that provides a couple of basic elements of data. Once integrated into the desktop properly, it should provide for some improved navigation. OS X has had this for a log time now and it's only time we caught up.

June 20, 2015

You’re a developer and you know AF_UNIX? You used it occasionally in your code, you know how high-level IPC puts marshaling on top and generally have a confident feeling when talking about it? But you actually have no clue what this fancy new kdbus is really about? During discussions you just nod along and hope nobody notices?


This is how it should be! As long as you don’t work on IPC libraries, there’s absolutely no requirement for you to have any idea what kdbus is. But as you’re reading this, I assume you’re curious and want to know more. So lets pick you up at AF_UNIX and look at a simple example.


Imagine a handful of processes that need to talk to each other. You have two options: Either you create a separate socket-pair between each two processes, or you create just one socket per process and make sure you can address all others via this socket. The first option will cause a quadratic growth of sockets and blows up if you raise the number of processes. Hence, we choose the latter, so our socket allocation looks like this:


Simple. Now we have to make sure the socket has a name and others can find it. We choose to not pollute the file-system but rather use the managed abstract namespace. As we don’t care for the exact names right now, we just let the kernel choose one. Furthermore, we enable credential-transmission so we can recognize peers that we get messages from:

struct sockaddr_un address = { .sun_family = AF_UNIX };
int enable = 1;

setsockopt(fd, SOL_SOCKET, SO_PASSCRED, &enable, sizeof(enable));
bind(fd, (struct sockaddr*)&address, sizeof(address.sun_family));

By omitting the sun_path part of the address, we tell the kernel to pick one itself. This was easy. Now we’re ready to go so lets see how we can send a message to a peer. For simplicity, we assume we know the address of the peer and it’s stored in destination.

struct sockaddr_un destination = { .sun_family = AF_UNIX, .sun_path = "..." };

sendto(fd, "foobar", 7, MSG_NOSIGNAL, (struct sockaddr*)&destination, sizeof(destination));

…and that’s all that is needed to send our message to the selected destination. On the receiver’s side, we call into recvmsg to receive the first message from our queue. We cannot use recvfrom as we want to fetch the credentials, too. Furthermore, we also cannot know how big the message is, so we query the kernel first and allocate a suitable buffer. This could be avoided, if we knew the maximum package size. But lets be thorough and support unlimited package sizes. Also note that recvmsg will return any next queued message. We cannot know the sender beforehand, so we also pass a buffer to store the address of the sender of this message:

char control[CMSG_SPACE(sizeof(struct ucred))];
struct sockaddr_un sender = {};
struct ucred creds = {};
struct msghdr msg = {};
struct iovec iov = {};
struct cmsghdr *cmsg;
char *message;
ssize_t l;
int size;

ioctl(fd, SIOCINQ, &size);
message = malloc(size + 1);
iov.iov_base = message;
iov.iov_len = size;

msg.msg_name = (struct sockaddr*)&sender;
msg.msg_namelen = sizeof(sender);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = control;
msg.msg_controllen = sizeof(control);

l = recvmsg(fd, msg, MSG_CMSG_CLOEXEC);

for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
        if (cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == SCM_CREDENTIALS)
                memcpy(&creds, CMSG_DATA(cmsg), sizeof(creds));

printf("Message: %s (length: %zd uid: %u sender: %s)\n",
       message, l, creds.uid, sender.sun_path + 1);

That’s it. With this in place, we can easily send arbitrary messages between our peers. We have no length restriction, we can identify the peers reliably and we’re not limited by any marshaling. Sure, we now dropped error-handling, event-loop integration and ignored some nasty corner cases, but that can all be solved. The code stays mostly the same.

Congratulations! You now understand kdbus. In kdbus:

  • socket(AF_UNIX, SOCK_DGRAM, 0) becomes open(“/sys/fs/kdbus/….”, …)
  • bind(fd, …, …) becomes ioctl(fd, KDBUS_CMD_HELLO, …)
  • sendto(fd, …) becomes ioctl(fd, KDBUS_CMD_SEND, …)
  • recvmsg(fd, …) becomes ioctl(fd, KDBUS_CMD_RECV, …)

Granted, the code will look slightly different. However, the concept stays the same. You still transmit raw messages, no marshaling is mandated. You can transmit credentials and file-descriptors, you can specify the peer to send messages to and you got to do that all through a single file descriptor. Doesn’t sound complex, does it?

So if kdbus is actually just like AF_UNIX+SOCK_DGRAM, why use it?


The AF_UNIX setup described above has some significant flaws:

  • Messages are buffered in the receive-queue, which is relatively small. If you send a message to a peer with a full receive-queue, you will get EAGAIN. However, since your socket is not connected, you cannot wait for POLLOUT as it does not exist for unconnected sockets (which peer would you wait for?). Hence, you’re basically screwed if remote peers do not clear their queues. This is a really severe restriction which makes this model useless for such setups.
  • There is no policy regarding who can talk to whom. In the abstract namespace, everyone can talk to anyone in the same network namespace. This can be avoided by placing sockets in the file system. However, then you lose the auto-cleanup feature of the abstract namespace. Furthermore, you’re limited to file-system policies, which might not be suitable.
  • You can only have a single name per socket. If you implement multiple services that should be reached via different names, then you need multiple sockets (and thus you lose global message ordering).
  • You cannot send broadcasts. You cannot even easily enumerate peers.
  • You cannot get notifications if peers die. However, this is crucial if you provide services to them and need to clean them up after they disconnected.

kdbus solves all these issues. Some of these issues could be solved with AF_UNIX (which, btw., would look a lot like AF_NETLINK), some cannot. But more importantly, AF_UNIX was never designed as a shared bus and hence should not be used as such. kdbus, on the other hand, was designed with a shared bus model in mind and as such avoids most of these issues.

With that basic understanding of kdbus, next time a news source reports about the “crazy idea to shove DBus into the kernel”, I hope you’ll be able to judge for yourself. And if you want to know more, I recommend building the kernel documentation, or diving into the code.

June 18, 2015

With the new v221 release of systemd we are declaring the sd-bus API shipped with systemd stable. sd-bus is our minimal D-Bus IPC C library, supporting as back-ends both classic socket-based D-Bus and kdbus. The library has been been part of systemd for a while, but has only been used internally, since we wanted to have the liberty to still make API changes without affecting external consumers of the library. However, now we are confident to commit to a stable API for it, starting with v221.

In this blog story I hope to provide you with a quick overview on sd-bus, a short reiteration on D-Bus and its concepts, as well as a few simple examples how to write D-Bus clients and services with it.

What is D-Bus again?

Let's start with a quick reminder what D-Bus actually is: it's a powerful, generic IPC system for Linux and other operating systems. It knows concepts like buses, objects, interfaces, methods, signals, properties. It provides you with fine-grained access control, a rich type system, discoverability, introspection, monitoring, reliable multicasting, service activation, file descriptor passing, and more. There are bindings for numerous programming languages that are used on Linux.

D-Bus has been a core component of Linux systems since more than 10 years. It is certainly the most widely established high-level local IPC system on Linux. Since systemd's inception it has been the IPC system it exposes its interfaces on. And even before systemd, it was the IPC system Upstart used to expose its interfaces. It is used by GNOME, by KDE and by a variety of system components.

D-Bus refers to both a specification, and a reference implementation. The reference implementation provides both a bus server component, as well as a client library. While there are multiple other, popular reimplementations of the client library – for both C and other programming languages –, the only commonly used server side is the one from the reference implementation. (However, the kdbus project is working on providing an alternative to this server implementation as a kernel component.)

D-Bus is mostly used as local IPC, on top of AF_UNIX sockets. However, the protocol may be used on top of TCP/IP as well. It does not natively support encryption, hence using D-Bus directly on TCP is usually not a good idea. It is possible to combine D-Bus with a transport like ssh in order to secure it. systemd uses this to make many of its APIs accessible remotely.

A frequently asked question about D-Bus is why it exists at all, given that AF_UNIX sockets and FIFOs already exist on UNIX and have been used for a long time successfully. To answer this question let's make a comparison with popular web technology of today: what AF_UNIX/FIFOs are to D-Bus, TCP is to HTTP/REST. While AF_UNIX sockets/FIFOs only shovel raw bytes between processes, D-Bus defines actual message encoding and adds concepts like method call transactions, an object system, security mechanisms, multicasting and more.

From our 10year+ experience with D-Bus we know today that while there are some areas where we can improve things (and we are working on that, both with kdbus and sd-bus), it generally appears to be a very well designed system, that stood the test of time, aged well and is widely established. Today, if we'd sit down and design a completely new IPC system incorporating all the experience and knowledge we gained with D-Bus, I am sure the result would be very close to what D-Bus already is.

Or in short: D-Bus is great. If you hack on a Linux project and need a local IPC, it should be your first choice. Not only because D-Bus is well designed, but also because there aren't many alternatives that can cover similar functionality.

Where does sd-bus fit in?

Let's discuss why sd-bus exists, how it compares with the other existing C D-Bus libraries and why it might be a library to consider for your project.

For C, there are two established, popular D-Bus libraries: libdbus, as it is shipped in the reference implementation of D-Bus, as well as GDBus, a component of GLib, the low-level tool library of GNOME.

Of the two libdbus is the much older one, as it was written at the time the specification was put together. The library was written with a focus on being portable and to be useful as back-end for higher-level language bindings. Both of these goals required the API to be very generic, resulting in a relatively baroque, hard-to-use API that lacks the bits that make it easy and fun to use from C. It provides the building blocks, but few tools to actually make it straightforward to build a house from them. On the other hand, the library is suitable for most use-cases (for example, it is OOM-safe making it suitable for writing lowest level system software), and is portable to operating systems like Windows or more exotic UNIXes.

GDBus is a much newer implementation. It has been written after considerable experience with using a GLib/GObject wrapper around libdbus. GDBus is implemented from scratch, shares no code with libdbus. Its design differs substantially from libdbus, it contains code generators to make it specifically easy to expose GObject objects on the bus, or talking to D-Bus objects as GObject objects. It translates D-Bus data types to GVariant, which is GLib's powerful data serialization format. If you are used to GLib-style programming then you'll feel right at home, hacking D-Bus services and clients with it is a lot simpler than using libdbus.

With sd-bus we now provide a third implementation, sharing no code with either libdbus or GDBus. For us, the focus was on providing kind of a middle ground between libdbus and GDBus: a low-level C library that actually is fun to work with, that has enough syntactic sugar to make it easy to write clients and services with, but on the other hand is more low-level than GDBus/GLib/GObject/GVariant. To be able to use it in systemd's various system-level components it needed to be OOM-safe and minimal. Another major point we wanted to focus on was supporting a kdbus back-end right from the beginning, in addition to the socket transport of the original D-Bus specification ("dbus1"). In fact, we wanted to design the library closer to kdbus' semantics than to dbus1's, wherever they are different, but still cover both transports nicely. In contrast to libdbus or GDBus portability is not a priority for sd-bus, instead we try to make the best of the Linux platform and expose specific Linux concepts wherever that is beneficial. Finally, performance was also an issue (though a secondary one): neither libdbus nor GDBus will win any speed records. We wanted to improve on performance (throughput and latency) -- but simplicity and correctness are more important to us. We believe the result of our work delivers our goals quite nicely: the library is fun to use, supports kdbus and sockets as back-end, is relatively minimal, and the performance is substantially better than both libdbus and GDBus.

To decide which of the three APIs to use for you C project, here are short guidelines:

  • If you hack on a GLib/GObject project, GDBus is definitely your first choice.

  • If portability to non-Linux kernels -- including Windows, Mac OS and other UNIXes -- is important to you, use either GDBus (which more or less means buying into GLib/GObject) or libdbus (which requires a lot of manual work).

  • Otherwise, sd-bus would be my recommended choice.

(I am not covering C++ specifically here, this is all about plain C only. But do note: if you use Qt, then QtDBus is the D-Bus API of choice, being a wrapper around libdbus.)

Introduction to D-Bus Concepts

To the uninitiated D-Bus usually appears to be a relatively opaque technology. It uses lots of concepts that appear unnecessarily complex and redundant on first sight. But actually, they make a lot of sense. Let's have a look:

  • A bus is where you look for IPC services. There are usually two kinds of buses: a system bus, of which there's exactly one per system, and which is where you'd look for system services; and a user bus, of which there's one per user, and which is where you'd look for user services, like the address book service or the mail program. (Originally, the user bus was actually a session bus -- so that you get multiple of them if you log in many times as the same user --, and on most setups it still is, but we are working on moving things to a true user bus, of which there is only one per user on a system, regardless how many times that user happens to log in.)

  • A service is a program that offers some IPC API on a bus. A service is identified by a name in reverse domain name notation. Thus, the org.freedesktop.NetworkManager service on the system bus is where NetworkManager's APIs are available and org.freedesktop.login1 on the system bus is where systemd-logind's APIs are exposed.

  • A client is a program that makes use of some IPC API on a bus. It talks to a service, monitors it and generally doesn't provide any services on its own. That said, lines are blurry and many services are also clients to other services. Frequently the term peer is used as a generalization to refer to either a service or a client.

  • An object path is an identifier for an object on a specific service. In a way this is comparable to a C pointer, since that's how you generally reference a C object, if you hack object-oriented programs in C. However, C pointers are just memory addresses, and passing memory addresses around to other processes would make little sense, since they of course refer to the address space of the service, the client couldn't make sense of it. Thus, the D-Bus designers came up with the object path concept, which is just a string that looks like a file system path. Example: /org/freedesktop/login1 is the object path of the 'manager' object of the org.freedesktop.login1 service (which, as we remember from above, is still the service systemd-logind exposes). Because object paths are structured like file system paths they can be neatly arranged in a tree, so that you end up with a venerable tree of objects. For example, you'll find all user sessions systemd-logind manages below the /org/freedesktop/login1/session sub-tree, for example called /org/freedesktop/login1/session/_7, /org/freedesktop/login1/session/_55 and so on. How services precisely label their objects and arrange them in a tree is completely up to the developers of the services.

  • Each object that is identified by an object path has one or more interfaces. An interface is a collection of signals, methods, and properties (collectively called members), that belong together. The concept of a D-Bus interface is actually pretty much identical to what you know from programming languages such as Java, which also know an interface concept. Which interfaces an object implements are up the developers of the service. Interface names are in reverse domain name notation, much like service names. (Yes, that's admittedly confusing, in particular since it's pretty common for simpler services to reuse the service name string also as an interface name.) A couple of interfaces are standardized though and you'll find them available on many of the objects offered by the various services. Specifically, those are org.freedesktop.DBus.Introspectable, org.freedesktop.DBus.Peer and org.freedesktop.DBus.Properties.

  • An interface can contain methods. The word "method" is more or less just a fancy word for "function", and is a term used pretty much the same way in object-oriented languages such as Java. The most common interaction between D-Bus peers is that one peer invokes one of these methods on another peer and gets a reply. A D-Bus method takes a couple of parameters, and returns others. The parameters are transmitted in a type-safe way, and the type information is included in the introspection data you can query from each object. Usually, method names (and the other member types) follow a CamelCase syntax. For example, systemd-logind exposes an ActivateSession method on the org.freedesktop.login1.Manager interface that is available on the /org/freedesktop/login1 object of the org.freedesktop.login1 service.

  • A signature describes a set of parameters a function (or signal, property, see below) takes or returns. It's a series of characters that each encode one parameter by its type. The set of types available is pretty powerful. For example, there are simpler types like s for string, or u for 32bit integer, but also complex types such as as for an array of strings or a(sb) for an array of structures consisting of one string and one boolean each. See the D-Bus specification for the full explanation of the type system. The ActivateSession method mentioned above takes a single string as parameter (the parameter signature is hence s), and returns nothing (the return signature is hence the empty string). Of course, the signature can get a lot more complex, see below for more examples.

  • A signal is another member type that the D-Bus object system knows. Much like a method it has a signature. However, they serve different purposes. While in a method call a single client issues a request on a single service, and that service sends back a response to the client, signals are for general notification of peers. Services send them out when they want to tell one or more peers on the bus that something happened or changed. In contrast to method calls and their replies they are hence usually broadcast over a bus. While method calls/replies are used for duplex one-to-one communication, signals are usually used for simplex one-to-many communication (note however that that's not a requirement, they can also be used one-to-one). Example: systemd-logind broadcasts a SessionNew signal from its manager object each time a user logs in, and a SessionRemoved signal every time a user logs out.

  • A property is the third member type that the D-Bus object system knows. It's similar to the property concept known by languages like C#. Properties also have a signature, and are more or less just variables that an object exposes, that can be read or altered by clients. Example: systemd-logind exposes a property Docked of the signature b (a boolean). It reflects whether systemd-logind thinks the system is currently in a docking station of some form (only applies to laptops …).

So much for the various concepts D-Bus knows. Of course, all these new concepts might be overwhelming. Let's look at them from a different perspective. I assume many of the readers have an understanding of today's web technology, specifically HTTP and REST. Let's try to compare the concept of a HTTP request with the concept of a D-Bus method call:

  • A HTTP request you issue on a specific network. It could be the Internet, or it could be your local LAN, or a company VPN. Depending on which network you issue the request on, you'll be able to talk to a different set of servers. This is not unlike the "bus" concept of D-Bus.

  • On the network you then pick a specific HTTP server to talk to. That's roughly comparable to picking a service on a specific bus.

  • On the HTTP server you then ask for a specific URL. The "path" part of the URL (by which I mean everything after the host name of the server, up to the last "/") is pretty similar to a D-Bus object path.

  • The "file" part of the URL (by which I mean everything after the last slash, following the path, as described above), then defines the actual call to make. In D-Bus this could be mapped to an interface and method name.

  • Finally, the parameters of a HTTP call follow the path after the "?", they map to the signature of the D-Bus call.

Of course, comparing an HTTP request to a D-Bus method call is a bit comparing apples and oranges. However, I think it's still useful to get a bit of a feeling of what maps to what.

From the shell

So much about the concepts and the gray theory behind them. Let's make this exciting, let's actually see how this feels on a real system.

Since a while systemd has included a tool busctl that is useful to explore and interact with the D-Bus object system. When invoked without parameters, it will show you a list of all peers connected to the system bus. (Use --user to see the peers of your user bus instead):

$ busctl
NAME                                       PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCRIPTION
:1.1                                         1 systemd         root             :1.1          -                         -          -
:1.11                                      705 NetworkManager  root             :1.11         NetworkManager.service    -          -
:1.14                                      744 gdm             root             :1.14         gdm.service               -          -
:1.4                                       708 systemd-logind  root             :1.4          systemd-logind.service    -          -
:1.7200                                  17563 busctl          lennart          :1.7200       session-1.scope           1          -
org.freedesktop.NetworkManager             705 NetworkManager  root             :1.11         NetworkManager.service    -          -
org.freedesktop.login1                     708 systemd-logind  root             :1.4          systemd-logind.service    -          -
org.freedesktop.systemd1                     1 systemd         root             :1.1          -                         -          -
org.gnome.DisplayManager                   744 gdm             root             :1.14         gdm.service               -          -

(I have shortened the output a bit, to make keep things brief).

The list begins with a list of all peers currently connected to the bus. They are identified by peer names like ":1.11". These are called unique names in D-Bus nomenclature. Basically, every peer has a unique name, and they are assigned automatically when a peer connects to the bus. They are much like an IP address if you so will. You'll notice that a couple of peers are already connected, including our little busctl tool itself as well as a number of system services. The list then shows all actual services on the bus, identified by their service names (as discussed above; to discern them from the unique names these are also called well-known names). In many ways well-known names are similar to DNS host names, i.e. they are a friendlier way to reference a peer, but on the lower level they just map to an IP address, or in this comparison the unique name. Much like you can connect to a host on the Internet by either its host name or its IP address, you can also connect to a bus peer either by its unique or its well-known name. (Note that each peer can have as many well-known names as it likes, much like an IP address can have multiple host names referring to it).

OK, that's already kinda cool. Try it for yourself, on your local machine (all you need is a recent, systemd-based distribution).

Let's now go the next step. Let's see which objects the org.freedesktop.login1 service actually offers:

$ busctl tree org.freedesktop.login1
  │ ├─/org/freedesktop/login1/seat/seat0
  │ └─/org/freedesktop/login1/seat/self
  │ ├─/org/freedesktop/login1/session/_31
  │ └─/org/freedesktop/login1/session/self

Pretty, isn't it? What's actually even nicer, and which the output does not show is that there's full command line completion available: as you press TAB the shell will auto-complete the service names for you. It's a real pleasure to explore your D-Bus objects that way!

The output shows some objects that you might recognize from the explanations above. Now, let's go further. Let's see what interfaces, methods, signals and properties one of these objects actually exposes:

$ busctl introspect org.freedesktop.login1 /org/freedesktop/login1/session/_31
NAME                                TYPE      SIGNATURE RESULT/VALUE                             FLAGS
org.freedesktop.DBus.Introspectable interface -         -                                        -
.Introspect                         method    -         s                                        -
org.freedesktop.DBus.Peer           interface -         -                                        -
.GetMachineId                       method    -         s                                        -
.Ping                               method    -         -                                        -
org.freedesktop.DBus.Properties     interface -         -                                        -
.Get                                method    ss        v                                        -
.GetAll                             method    s         a{sv}                                    -
.Set                                method    ssv       -                                        -
.PropertiesChanged                  signal    sa{sv}as  -                                        -
org.freedesktop.login1.Session      interface -         -                                        -
.Activate                           method    -         -                                        -
.Kill                               method    si        -                                        -
.Lock                               method    -         -                                        -
.PauseDeviceComplete                method    uu        -                                        -
.ReleaseControl                     method    -         -                                        -
.ReleaseDevice                      method    uu        -                                        -
.SetIdleHint                        method    b         -                                        -
.TakeControl                        method    b         -                                        -
.TakeDevice                         method    uu        hb                                       -
.Terminate                          method    -         -                                        -
.Unlock                             method    -         -                                        -
.Active                             property  b         true                                     emits-change
.Audit                              property  u         1                                        const
.Class                              property  s         "user"                                   const
.Desktop                            property  s         ""                                       const
.Display                            property  s         ""                                       const
.Id                                 property  s         "1"                                      const
.IdleHint                           property  b         true                                     emits-change
.IdleSinceHint                      property  t         1434494624206001                         emits-change
.IdleSinceHintMonotonic             property  t         0                                        emits-change
.Leader                             property  u         762                                      const
.Name                               property  s         "lennart"                                const
.Remote                             property  b         false                                    const
.RemoteHost                         property  s         ""                                       const
.RemoteUser                         property  s         ""                                       const
.Scope                              property  s         "session-1.scope"                        const
.Seat                               property  (so)      "seat0" "/org/freedesktop/login1/seat... const
.Service                            property  s         "gdm-autologin"                          const
.State                              property  s         "active"                                 -
.TTY                                property  s         "/dev/tty1"                              const
.Timestamp                          property  t         1434494630344367                         const
.TimestampMonotonic                 property  t         34814579                                 const
.Type                               property  s         "x11"                                    const
.User                               property  (uo)      1000 "/org/freedesktop/login1/user/_1... const
.VTNr                               property  u         1                                        const
.Lock                               signal    -         -                                        -
.PauseDevice                        signal    uus       -                                        -
.ResumeDevice                       signal    uuh       -                                        -
.Unlock                             signal    -         -                                        -

As before, the busctl command supports command line completion, hence both the service name and the object path used are easily put together on the shell simply by pressing TAB. The output shows the methods, properties, signals of one of the session objects that are currently made available by systemd-logind. There's a section for each interface the object knows. The second column tells you what kind of member is shown in the line. The third column shows the signature of the member. In case of method calls that's the input parameters, the fourth column shows what is returned. For properties, the fourth column encodes the current value of them.

So far, we just explored. Let's take the next step now: let's become active - let's call a method:

# busctl call org.freedesktop.login1 /org/freedesktop/login1/session/_31 org.freedesktop.login1.Session Lock

I don't think I need to mention this anymore, but anyway: again there's full command line completion available. The third argument is the interface name, the fourth the method name, both can be easily completed by pressing TAB. In this case we picked the Lock method, which activates the screen lock for the specific session. And yupp, the instant I pressed enter on this line my screen lock turned on (this only works on DEs that correctly hook into systemd-logind for this to work. GNOME works fine, and KDE should work too).

The Lock method call we picked is very simple, as it takes no parameters and returns none. Of course, it can get more complicated for some calls. Here's another example, this time using one of systemd's own bus calls, to start an arbitrary system unit:

# busctl call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager StartUnit ss "cups.service" "replace"
o "/org/freedesktop/systemd1/job/42684"

This call takes two strings as input parameters, as we denote in the signature string that follows the method name (as usual, command line completion helps you getting this right). Following the signature the next two parameters are simply the two strings to pass. The specified signature string hence indicates what comes next. systemd's StartUnit method call takes the unit name to start as first parameter, and the mode in which to start it as second. The call returned a single object path value. It is encoded the same way as the input parameter: a signature (just o for the object path) followed by the actual value.

Of course, some method call parameters can get a ton more complex, but with busctl it's relatively easy to encode them all. See the man page for details.

busctl knows a number of other operations. For example, you can use it to monitor D-Bus traffic as it happens (including generating a .cap file for use with Wireshark!) or you can set or get specific properties. However, this blog story was supposed to be about sd-bus, not busctl, hence let's cut this short here, and let me direct you to the man page in case you want to know more about the tool.

busctl (like the rest of system) is implemented using the sd-bus API. Thus it exposes many of the features of sd-bus itself. For example, you can use to connect to remote or container buses. It understands both kdbus and classic D-Bus, and more!


But enough! Let's get back on topic, let's talk about sd-bus itself.

The sd-bus set of APIs is mostly contained in the header file sd-bus.h.

Here's a random selection of features of the library, that make it compare well with the other implementations available.

  • Supports both kdbus and dbus1 as back-end.

  • Has high-level support for connecting to remote buses via ssh, and to buses of local OS containers.

  • Powerful credential model, to implement authentication of clients in services. Currently 34 individual fields are supported, from the PID of the client to the cgroup or capability sets.

  • Support for tracking the life-cycle of peers in order to release local objects automatically when all peers referencing them disconnected.

  • The client builds an efficient decision tree to determine which handlers to deliver an incoming bus message to.

  • Automatically translates D-Bus errors into UNIX style errors and back (this is lossy though), to ensure best integration of D-Bus into low-level Linux programs.

  • Powerful but lightweight object model for exposing local objects on the bus. Automatically generates introspection as necessary.

The API is currently not fully documented, but we are working on completing the set of manual pages. For details see all pages starting with sd_bus_.

Invoking a Method, from C, with sd-bus

So much about the library in general. Here's an example for connecting to the bus and issuing a method call:

#include <stdio.h>
#include <stdlib.h>
#include <systemd/sd-bus.h>

int main(int argc, char *argv[]) {
        sd_bus_error error = SD_BUS_ERROR_NULL;
        sd_bus_message *m = NULL;
        sd_bus *bus = NULL;
        const char *path;
        int r;

        /* Connect to the system bus */
        r = sd_bus_open_system(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
                goto finish;

        /* Issue the method call and store the respons message in m */
        r = sd_bus_call_method(bus,
                               "org.freedesktop.systemd1",           /* service to contact */
                               "/org/freedesktop/systemd1",          /* object path */
                               "org.freedesktop.systemd1.Manager",   /* interface name */
                               "StartUnit",                          /* method name */
                               &error,                               /* object to return error in */
                               &m,                                   /* return message on success */
                               "ss",                                 /* input signature */
                               "cups.service",                       /* first argument */
                               "replace");                           /* second argument */
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %s\n", error.message);
                goto finish;

        /* Parse the response message */
        r = sd_bus_message_read(m, "o", &path);
        if (r < 0) {
                fprintf(stderr, "Failed to parse response message: %s\n", strerror(-r));
                goto finish;

        printf("Queued service job as %s.\n", path);


        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;

Save this example as bus-client.c, then build it with:

$ gcc bus-client.c -o bus-client `pkg-config --cflags --libs libsystemd`

This will generate a binary bus-client you can now run. Make sure to run it as root though, since access to the StartUnit method is privileged:

# ./bus-client
Queued service job as /org/freedesktop/systemd1/job/3586.

And that's it already, our first example. It showed how we invoked a method call on the bus. The actual function call of the method is very close to the busctl command line we used before. I hope the code excerpt needs little further explanation. It's supposed to give you a taste how to write D-Bus clients with sd-bus. For more more information please have a look at the header file, the man page or even the sd-bus sources.

Implementing a Service, in C, with sd-bus

Of course, just calling a single method is a rather simplistic example. Let's have a look on how to write a bus service. We'll write a small calculator service, that exposes a single object, which implements an interface that exposes two methods: one to multiply two 64bit signed integers, and one to divide one 64bit signed integer by another.

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <systemd/sd-bus.h>

static int method_multiply(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r));
                return r;

        /* Reply with the response */
        return sd_bus_reply_method_return(m, "x", x * y);

static int method_divide(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r));
                return r;

        /* Return an error on division by zero */
        if (y == 0) {
                sd_bus_error_set_const(ret_error, "net.poettering.DivisionByZero", "Sorry, can't allow division by zero.");
                return -EINVAL;

        return sd_bus_reply_method_return(m, "x", x / y);

/* The vtable of our little object, implements the net.poettering.Calculator interface */
static const sd_bus_vtable calculator_vtable[] = {
        SD_BUS_METHOD("Multiply", "xx", "x", method_multiply, SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_METHOD("Divide",   "xx", "x", method_divide,   SD_BUS_VTABLE_UNPRIVILEGED),

int main(int argc, char *argv[]) {
        sd_bus_slot *slot = NULL;
        sd_bus *bus = NULL;
        int r;

        /* Connect to the user bus this time */
        r = sd_bus_open_user(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
                goto finish;

        /* Install the object */
        r = sd_bus_add_object_vtable(bus,
                                     "/net/poettering/Calculator",  /* object path */
                                     "net.poettering.Calculator",   /* interface name */
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %s\n", strerror(-r));
                goto finish;

        /* Take a well-known service name so that clients can find us */
        r = sd_bus_request_name(bus, "net.poettering.Calculator", 0);
        if (r < 0) {
                fprintf(stderr, "Failed to acquire service name: %s\n", strerror(-r));
                goto finish;

        for (;;) {
                /* Process requests */
                r = sd_bus_process(bus, NULL);
                if (r < 0) {
                        fprintf(stderr, "Failed to process bus: %s\n", strerror(-r));
                        goto finish;
                if (r > 0) /* we processed a request, try to process another one, right-away */

                /* Wait for the next request to process */
                r = sd_bus_wait(bus, (uint64_t) -1);
                if (r < 0) {
                        fprintf(stderr, "Failed to wait on bus: %s\n", strerror(-r));
                        goto finish;


        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;

Save this example as bus-service.c, then build it with:

$ gcc bus-service.c -o bus-service `pkg-config --cflags --libs libsystemd`

Now, let's run it:

$ ./bus-service

In another terminal, let's try to talk to it. Note that this service is now on the user bus, not on the system bus as before. We do this for simplicity reasons: on the system bus access to services is tightly controlled so unprivileged clients cannot request privileged operations. On the user bus however things are simpler: as only processes of the user owning the bus can connect no further policy enforcement will complicate this example. Because the service is on the user bus, we have to pass the --user switch on the busctl command line. Let's start with looking at the service's object tree.

$ busctl --user tree net.poettering.Calculator

As we can see, there's only a single object on the service, which is not surprising, given that our code above only registered one. Let's see the interfaces and the members this object exposes:

$ busctl --user introspect net.poettering.Calculator /net/poettering/Calculator
NAME                                TYPE      SIGNATURE RESULT/VALUE FLAGS
net.poettering.Calculator           interface -         -            -
.Divide                             method    xx        x            -
.Multiply                           method    xx        x            -
org.freedesktop.DBus.Introspectable interface -         -            -
.Introspect                         method    -         s            -
org.freedesktop.DBus.Peer           interface -         -            -
.GetMachineId                       method    -         s            -
.Ping                               method    -         -            -
org.freedesktop.DBus.Properties     interface -         -            -
.Get                                method    ss        v            -
.GetAll                             method    s         a{sv}        -
.Set                                method    ssv       -            -
.PropertiesChanged                  signal    sa{sv}as  -            -

The sd-bus library automatically added a couple of generic interfaces, as mentioned above. But the first interface we see is actually the one we added! It shows our two methods, and both take "xx" (two 64bit signed integers) as input parameters, and return one "x". Great! But does it work?

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Multiply xx 5 7
x 35

Woohoo! We passed the two integers 5 and 7, and the service actually multiplied them for us and returned a single integer 35! Let's try the other method:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 99 17
x 5

Oh, wow! It can even do integer division! Fantastic! But let's trick it into dividing by zero:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 43 0
Sorry, can't allow division by zero.

Nice! It detected this nicely and returned a clean error about it. If you look in the source code example above you'll see how precisely we generated the error.

And that's really all I have for today. Of course, the examples I showed are short, and I don't get into detail here on what precisely each line does. However, this is supposed to be a short introduction into D-Bus and sd-bus, and it's already way too long for that …

I hope this blog story was useful to you. If you are interested in using sd-bus for your own programs, I hope this gets you started. If you have further questions, check the (incomplete) man pages, and inquire us on IRC or the systemd mailing list. If you need more examples, have a look at the systemd source tree, all of systemd's many bus services use sd-bus extensively.

June 16, 2015

Recently, I've been fighting with the never ending issue of timezones. I never thought I would have plunged into this rabbit hole, but hacking on OpenStack and Gnocchi I felt into that trap easily is, thanks to Python.

“Why you really, really, should never ever deal with timezones”

To get a glimpse of the complexity of timezones, I recommend that you watch Tom Scott's video on the subject. It's fun and it summarizes remarkably well the nightmare that timezones are and why you should stop thinking that you're smart.

The importance of timezones in applications

Once you've heard what Tom says, I think it gets pretty clear that a timestamp without any timezone attached does not give any useful information. It should be considered irrelevant and useless. Without the necessary context given by the timezone, you cannot infer what point in time your application is really referring to.

That means your application should never handle timestamps with no timezone information. It should try to guess or raises an error if no timezone is provided in any input.

Of course, you can infer that having no timezone information means UTC. This sounds very handy, but can also be dangerous in certain applications or language – such as Python, as we'll see.

Indeed, in certain applications, converting timestamps to UTC and losing the timezone information is a terrible idea. Imagine that a user create a recurring event every Wednesday at 10:00 in its local timezone, say CET. If you convert that to UTC, the event will end up being stored as every Wednesday at 09:00.

Now imagine that the CET timezone switches from UTC+01:00 to UTC+02:00: your application will compute that the event starts at 11:00 CET every Wednesday. Which is wrong, because as the user told you, the event starts at 10:00 CET, whatever the definition of CET is. Not at 11:00 CET. So CET means CET, not necessarily UTC+1.

As for endpoints like REST API, a thing I daily deal with, all timestamps should include a timezone information. It's nearly impossible to know what timezone the timestamps are in otherwise: UTC? Server local? User local? No way to know.

Python design & defect

Python comes with a timestamp object named datetime.datetime. It can store date and time precise to the microsecond, and is qualified of timezone "aware" or "unaware", whether it embeds a timezone information or not.

To build such an object based on the current time, one can use datetime.datetime.utcnow() to retrieve the date and time for the UTC timezone, and to retrieve the date and time for the current timezone, whatever it is.

>>> import datetime
>>> datetime.datetime.utcnow()
datetime.datetime(2015, 6, 15, 13, 24, 48, 27631)
datetime.datetime(2015, 6, 15, 15, 24, 52, 276161)

As you can notice, none of these results contains timezone information. Indeed, Python datetime API always returns unaware datetime objects, which is very unfortunate. Indeed, as soon as you get one of this object, there is no way to know what the timezone is, therefore these objects are pretty "useless" on their own.

Armin Ronacher proposes that an application always consider that the unaware datetime objects from Python are considered as UTC. As we just saw, that statement cannot be considered true for objects returned by, so I would not advise doing so. datetime objects with no timezone should be considered as a "bug" in the application.


My recommendation list comes down to:

  1. Always use aware datetime object, i.e. with timezone information. That makes sure you can compare them directly (aware and unaware datetime objects are not comparable) and will return them correctly to users. Leverage pytz to have timezone objects.
  2. Use ISO 8601 as input and output string format. Use datetime.datetime.isoformat() to return timestamps as string formatted using that format, which includes the timezone information.

In Python, that's equivalent to having:

>>> import datetime
>>> import pytz
>>> def utcnow():
>>> utcnow()
datetime.datetime(2015, 6, 15, 14, 45, 19, 182703, tzinfo=<UTC>)
>>> utcnow().isoformat()

If you need to parse strings containing ISO 8601 formatted timestamp, you can rely on the iso8601, which returns timestamps with correct timezone information. This makes timestamps directly comparable:

>>> import iso8601
>>> iso8601.parse_date(utcnow().isoformat())
datetime.datetime(2015, 6, 15, 14, 46, 43, 945813, tzinfo=<FixedOffset '+00:00' datetime.timedelta(0)>)
>>> iso8601.parse_date(utcnow().isoformat()) < utcnow()

If you need to store those timestamps, the same rule should apply. If you rely on MongoDB, it assumes that all the timestamp are in UTC, so be careful when storing them – you will have to normalize the timestamp to UTC.

For MySQL, nothing is assumed, it's up to the application to insert them in a timezone that makes sense to it. Obviously, if you have multiple applications accessing the same database with different data sources, this can end up being a nightmare.

PostgreSQL has a special data type that is recommended called timestamp with timezone, and which can store the timezone associated, and do all the computation for you. That's the recommended way to store them obviously. That does not mean you should not use UTC in most cases; that just means you are sure that the timestamp are stored in UTC since it's written in the database, and you check if any other application inserted timestamps with different timezone.

OpenStack status

As a side note, I've improved OpenStack situation recently by changing the oslo.utils.timeutils module to deprecate some useless and dangerous functions. I've also added support for returning timezone aware objects when using the oslo_utils.timeutils.utcnow() function. It's not possible to make it a default unfortunately for backward compatibility reason, but it's there nevertheless, and it's advised to use it. Thanks to my colleague Victor for the help!

Have a nice day, whatever your timezone is!


This weekend my work on implementing NVIDIA global performance counters has been merged in Nouveau.

With Linux 4.2, Nouveau will allow the userspace to monitor both compute and graphics (global) counters for Tesla, but only compute counters for Fermi. I need to go back to Windows for reverse engineering graphics counters with NVIDIA Perfkit. About Kepler, I have to figure out how to deal with clock gating but this is not going to be hard, so I’ll probably submit a series which adds compute counters this month.

All of these performance counters will be exposed through the Gallium’s HUD and GL_AMD_performance_monitor once I have finished writing the code in mesa.

But don’t be too excited for the moment, because we still need to implement the new nvif interface exposed by Nouveau in libdrm.

My plan is to complete all of this work before the XDC 2015.


June 12, 2015

Last post and the one before were about how to create your own piglit tests. Previously, I have written an introduction to piglit and how to launch a tailored piglit run (more details about these last two topics in my FOSDEM 2015 talk).

Now it’s time to talk about how to contribute to piglit.

How to contribute to piglit

Once you want to contribute something to piglit, you need to generate the patches and send them for review to the mailing list. They are usually created by git format-patch and sent by git send-email command (if you need help with git, there are a lot of tutorials). Remember to rebase your branch against up-to-dated master before creating the patches, so no merge conflicts will appear if the reviewers wants to apply them locally.

Whether you have some patches ready to be submitted or you have questions about piglit, subscribe to and send them there.

Most piglit developers work in other areas (such as OpenGL driver development!) which means that the review process of piglit patches could take some time, so be patient and wait.

If after some time (something like one or two weeks) there is no answer about your patches, you can send a reminder saying that a review is pending. If you have commit rights and the patch is trivial (or you are very confident that it is right), you can even push it to the repository’s master branch after that time. Piglit is not as strict as other projects in this regard, however do not abuse of this rule.

Once you have a good track of contributions or other contributors told you to do so, you can ask for piglit repository’s commit rights by following these instructions. And don’t hesitate to review patches from others!

Table of contents

This is the list of my piglit related posts:

  1. Piglit, an open-source test suite for OpenGL implementations
  2. piglit (II): How to launch a tailored piglit run
  3. piglit (III): How to write GLSL shader tests
  4. piglit (IV): How to write binary test programs
  5. piglit (V): how to contribute to piglit and table of contents

Plus my FOSDEM 2015 talk.

Thanks for following this short introduction to piglit. Happy hacking!

June 11, 2015

Last post I talked about how to develop GLSL shader tests and how to add them to piglit. This is a nice way to develop simple tests but sometimes you need to do something more complex. For that case, piglit can run binary test programs.


Binary test programs in piglit are written in C language taking advantage of the piglit framework to facilitate the test development (there is no main() function, no need to setup GLX (or WGL or EGL or whatever), no need to check manually the available extensions or call to glXSwapBuffers() or similar functions…), however you can still use all OpenGL C-language API.

Simple example

Piglit framework is mostly undocumented but easy to understand once you start reading some existing tests. So I will start with a simple example explaining how a typical binary test looks like and then show you a real example.

 * Copyright © 2015 Intel Corporation
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the "Software"),
 * to deal in the Software without restriction, including without limitation
 * the rights to use, copy, modify, merge, publish, distribute, sublicense,
 * and/or sell copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following conditions:
 * The above copyright notice and this permission notice (including the next
 * paragraph) shall be included in all copies or substantial portions of the
 * Software.

/** @file test-example.c
 * This test shows an skeleton for creating new tests.

#include "piglit-util-gl.h"

    config.window_width = 100;
    config.window_height = 100;
    config.supports_gl_compat_version = 10;
    config.supports_gl_core_version = 31;


static const char vs_pass_thru_text[] =
    "#version 330n"
    "in vec4 piglit_vertex;n"
    "void main() {n"
    "   gl_Position = piglit_vertex;n"

static const char fs_source[] =
    "#version 330n"
    "void main() {n"
    "       color = vec4(1.0, 0.0. 0.0, 1.0);n"

GLuint prog;

piglit_init(int argc, char **argv)
    bool pass = true;

    /* piglit_require_extension("GL_ARB_shader_storage_buffer_object"); */

    prog = piglit_build_simple_program(vs_pass_thru_text, fs_source);


    glClearColor(0, 0, 0, 0);

    /* <-- OpenGL commands to be done --> */

    glViewport(0, 0, piglit_width, piglit_height);

    /* piglit_draw_* commands can go into piglit_display() too */
    piglit_draw_rect(-1, -1, 2, 2);

    if (!piglit_check_gl_error(GL_NO_ERROR))
       pass = false;

    piglit_report_result(pass ? PIGLIT_PASS : PIGLIT_FAIL);

enum piglit_result piglit_display(void)
    /* <-- OpenGL drawing commands, if needed --> */

    /* UNREACHED */
    return PIGLIT_FAIL;

As you see in this example, there are four different parts:

  1. License header and description of the test
    • The license details should be included in each source file. There is one agreed by most contributors and it’s a MIT license assigning the copyright to Intel Corporation. More information in COPYING file.
    • It includes a brief description of what the test does.

     * Copyright © 2015 Intel Corporation
     * Permission is hereby granted, free of charge, to any person obtaining a
     * copy of this software and associated documentation files (the "Software"),
     * to deal in the Software without restriction, including without limitation
     * the rights to use, copy, modify, merge, publish, distribute, sublicense,
     * and/or sell copies of the Software, and to permit persons to whom the
     * Software is furnished to do so, subject to the following conditions:
     * The above copyright notice and this permission notice (including the next
     * paragraph) shall be included in all copies or substantial portions of the
     * Software.
    /** @file test-example.c
     * This test shows an skeleton for creating new tests.

  2. Piglit setup. This is needed to check if a test can be executed by a given driver (minimum supported GL version), or to create a window of a specific size, or even to define if we want double buffering.

        config.window_width = 100;
        config.window_height = 100;
        config.supports_gl_compat_version = 10;
        config.supports_gl_core_version = 31;
        config.window_visual = PIGLIT_GL_VISUAL_DOUBLE | PIGLIT_GL_VISUAL_RGBA;

  3. piglit_init(). This is the function that it is going to be called for configuring the test itself. Some tests implement all their code inside piglit_init() because it doesn’t need to draw anything (or it doesn’t need to update the drawing frame by frame). In any case, you usually put here the following code:
    • Check for needed extensions.
    • Check for limits or maximum values of different variables like GL_MAX_SHADER_STORAGE_BLOCKS, GL_UNIFORM_BLOCK_SIZE, etc.
    • Setup constant data, upload it.
    • All the initialization setup you need for drawing commands: compile and link shaders, set clear color, etc.

    piglit_init(int argc, char **argv)
        bool pass = true;
        /* piglit_require_extension("GL_ARB_shader_storage_buffer_object"); */
        prog = piglit_build_simple_program(vs_pass_thru_text, fs_source);
        glClearColor(0, 0, 0, 0);
        /* <-- OpenGL commands to be done --> */
        glViewport(0, 0, piglit_width, piglit_height);
        /* piglit_draw_* commands can go into piglit_display() too */
        piglit_draw_rect(-1, -1, 2, 2);
        if (!piglit_check_gl_error(GL_NO_ERROR))
           pass = false;
        piglit_report_result(pass ? PIGLIT_PASS : PIGLIT_FAIL);

  4. piglit_display(). This is the function that it is going to be executed periodically to update each frame of the rendered window. In some tests, you will find it almost empty (it returns PIGLIT_FAIL) because it is not needed by the test program.

    enum piglit_result piglit_display(void)
        /* <-- OpenGL drawing commands, if needed --> */
        /* UNREACHED */
        return PIGLIT_FAIL;

Notice that you are free to add any helper functions you need like any other C program but the aforementioned parts are required by piglit.

Piglit API

Piglit provides a lot of functions under its API to be used by the test program. They are usually often-used functions that substitute one or several OpenGL function calls and other code that accompany them.

The available functions are listed in file, which must be included in every binary test source code.

 * Copyright (c) The Piglit project 2007
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the "Software"),
 * to deal in the Software without restriction, including without limitation
 * on the rights to use, copy, modify, merge, publish, distribute, sub
 * license, and/or sell copies of the Software, and to permit persons to whom
 * the Software is furnished to do so, subject to the following conditions:
 * The above copyright notice and this permission notice (including the next
 * paragraph) shall be included in all copies or substantial portions of the
 * Software.

#pragma once
#ifndef __PIGLIT_UTIL_GL_H__
#define __PIGLIT_UTIL_GL_H__

#ifdef __cplusplus
extern "C" {

#include "piglit-util.h"

#include <piglit/gl_wrap.h>
#include <piglit/glut_wrap.h>

#define piglit_get_proc_address(x) piglit_dispatch_resolve_function(x)

#include "piglit-framework-gl.h"
#include "piglit-shader.h"

extern const uint8_t fdo_bitmap[];
extern const unsigned int fdo_bitmap_width;
extern const unsigned int fdo_bitmap_height;

extern bool piglit_is_core_profile;

 * Determine if the API is OpenGL ES.
bool piglit_is_gles(void);

 * \brief Get version of OpenGL or OpenGL ES API.
 * Returned version is multiplied by 10 to make it an integer.  So for
 * example, if the GL version is 2.1, the return value is 21.
int piglit_get_gl_version(void);

 * \precondition name is not null
bool piglit_is_extension_supported(const char *name);

 * reinitialize the supported extension List.
void piglit_gl_reinitialize_extensions();

 * \brief Convert a GL error to a string.
 * For example, given GL_INVALID_ENUM, return "GL_INVALID_ENUM".
 * Return "(unrecognized error)" if the enum is not recognized.
const char* piglit_get_gl_error_name(GLenum error);

 * \brief Convert a GL enum to a string.
 * For example, given GL_INVALID_ENUM, return "GL_INVALID_ENUM".
 * Return "(unrecognized enum)" if the enum is not recognized.
const char *piglit_get_gl_enum_name(GLenum param);

 * \brief Convert a string to a GL enum.
 * For example, given "GL_INVALID_ENUM", return GL_INVALID_ENUM.
 * abort() if the string is not recognized.
GLenum piglit_get_gl_enum_from_name(const char *name);

 * \brief Convert a GL primitive type enum value to a string.
 * For example, given GL_POLYGON, return "GL_POLYGON".
 * We don't use piglit_get_gl_enum_name() for this because there are
 * other enums which alias the prim type enums (ex: GL_POINTS = GL_NONE);
 * Return "(unrecognized enum)" if the enum is not recognized.
const char *piglit_get_prim_name(GLenum prim);

 * \brief Check for unexpected GL errors.
 * If glGetError() returns an error other than \c expected_error, then
 * print a diagnostic and return GL_FALSE.  Otherwise return GL_TRUE.
piglit_check_gl_error_(GLenum expected_error, const char *file, unsigned line);

#define piglit_check_gl_error(expected) \
 piglit_check_gl_error_((expected), __FILE__, __LINE__)

 * \brief Drain all GL errors.
 * Repeatly call glGetError and discard errors until it returns GL_NO_ERROR.
void piglit_reset_gl_error(void);

void piglit_require_gl_version(int required_version_times_10);
void piglit_require_extension(const char *name);
void piglit_require_not_extension(const char *name);
unsigned piglit_num_components(GLenum base_format);
bool piglit_get_luminance_intensity_bits(GLenum internalformat, int *bits);
int piglit_probe_pixel_rgb_silent(int x, int y, const float* expected, float *out_probe);
int piglit_probe_pixel_rgba_silent(int x, int y, const float* expected, float *out_probe);
int piglit_probe_pixel_rgb(int x, int y, const float* expected);
int piglit_probe_pixel_rgba(int x, int y, const float* expected);
int piglit_probe_rect_r_ubyte(int x, int y, int w, int h, GLubyte expected);
int piglit_probe_rect_rgb(int x, int y, int w, int h, const float* expected);
int piglit_probe_rect_rgb_silent(int x, int y, int w, int h, const float *expected);
int piglit_probe_rect_rgba(int x, int y, int w, int h, const float* expected);
int piglit_probe_rect_rgba_int(int x, int y, int w, int h, const int* expected);
int piglit_probe_rect_rgba_uint(int x, int y, int w, int h, const unsigned int* expected);
void piglit_compute_probe_tolerance(GLenum format, float *tolerance);
int piglit_compare_images_color(int x, int y, int w, int h, int num_components,
                const float *tolerance,
                const float *expected_image,
                const float *observed_image);
int piglit_probe_image_color(int x, int y, int w, int h, GLenum format, const float *image);
int piglit_probe_image_rgb(int x, int y, int w, int h, const float *image);
int piglit_probe_image_rgba(int x, int y, int w, int h, const float *image);
int piglit_compare_images_ubyte(int x, int y, int w, int h,
                const GLubyte *expected_image,
                const GLubyte *observed_image);
int piglit_probe_image_stencil(int x, int y, int w, int h, const GLubyte *image);
int piglit_probe_image_ubyte(int x, int y, int w, int h, GLenum format,
                 const GLubyte *image);
int piglit_probe_texel_rect_rgb(int target, int level, int x, int y,
                int w, int h, const float *expected);
int piglit_probe_texel_rgb(int target, int level, int x, int y,
               const float* expected);
int piglit_probe_texel_rect_rgba(int target, int level, int x, int y,
                 int w, int h, const float *expected);
int piglit_probe_texel_rgba(int target, int level, int x, int y,
                const float* expected);
int piglit_probe_texel_volume_rgba(int target, int level, int x, int y, int z,
                 int w, int h, int d, const float *expected);
int piglit_probe_pixel_depth(int x, int y, float expected);
int piglit_probe_rect_depth(int x, int y, int w, int h, float expected);
int piglit_probe_pixel_stencil(int x, int y, unsigned expected);
int piglit_probe_rect_stencil(int x, int y, int w, int h, unsigned expected);
int piglit_probe_rect_halves_equal_rgba(int x, int y, int w, int h);

bool piglit_probe_buffer(GLuint buf, GLenum target, const char *label,
             unsigned n, unsigned num_components,
             const float *expected);

int piglit_use_fragment_program(void);
int piglit_use_vertex_program(void);
void piglit_require_fragment_program(void);
void piglit_require_vertex_program(void);
GLuint piglit_compile_program(GLenum target, const char* text);
GLvoid piglit_draw_triangle(float x1, float y1, float x2, float y2,
                float x3, float y3);
GLvoid piglit_draw_triangle_z(float z, float x1, float y1, float x2, float y2,
                  float x3, float y3);
GLvoid piglit_draw_rect_custom(float x, float y, float w, float h,
                   bool use_patches);
GLvoid piglit_draw_rect(float x, float y, float w, float h);
GLvoid piglit_draw_rect_z(float z, float x, float y, float w, float h);
GLvoid piglit_draw_rect_tex(float x, float y, float w, float h,
                            float tx, float ty, float tw, float th);
GLvoid piglit_draw_rect_back(float x, float y, float w, float h);
void piglit_draw_rect_from_arrays(const void *verts, const void *tex,
                  bool use_patches);

unsigned short piglit_half_from_float(float val);

void piglit_escape_exit_key(unsigned char key, int x, int y);

void piglit_gen_ortho_projection(double left, double right, double bottom,
                 double top, double near_val, double far_val,
                 GLboolean push);
void piglit_ortho_projection(int w, int h, GLboolean push);
void piglit_frustum_projection(GLboolean push, double l, double r, double b,
                   double t, double n, double f);
void piglit_gen_ortho_uniform(GLint location, double left, double right,
                  double bottom, double top, double near_val,
                  double far_val);
void piglit_ortho_uniform(GLint location, int w, int h);

GLuint piglit_checkerboard_texture(GLuint tex, unsigned level,
    unsigned width, unsigned height,
    unsigned horiz_square_size, unsigned vert_square_size,
    const float *black, const float *white);
GLuint piglit_miptree_texture(void);
GLfloat *piglit_rgbw_image(GLenum internalFormat, int w, int h,
                           GLboolean alpha, GLenum basetype);
GLubyte *piglit_rgbw_image_ubyte(int w, int h, GLboolean alpha);
GLuint piglit_rgbw_texture(GLenum internalFormat, int w, int h, GLboolean mip,
            GLboolean alpha, GLenum basetype);
GLuint piglit_depth_texture(GLenum target, GLenum format, int w, int h, int d, GLboolean mip);
GLuint piglit_array_texture(GLenum target, GLenum format, int w, int h, int d, GLboolean mip);
GLuint piglit_multisample_texture(GLenum target, GLenum tex,
                  GLenum internalFormat,
                  unsigned width, unsigned height,
                  unsigned depth, unsigned samples,
                  GLenum format, GLenum type, void *data);
extern float piglit_tolerance[4];
void piglit_set_tolerance_for_bits(int rbits, int gbits, int bbits, int abits);
extern void piglit_require_transform_feedback(void);

piglit_get_compressed_block_size(GLenum format,
                 unsigned *bw, unsigned *bh, unsigned *bytes);

piglit_compressed_image_size(GLenum format, unsigned width, unsigned height);

piglit_compressed_pixel_offset(GLenum format, unsigned width,
                   unsigned x, unsigned y);

piglit_visualize_image(float *img, GLenum base_internal_format,
               int image_width, int image_height,
               int image_count, bool rhs);

float piglit_srgb_to_linear(float x);
float piglit_linear_to_srgb(float x);

extern GLfloat cube_face_texcoords[6][4][3];
extern const char *cube_face_names[6];
extern const GLenum cube_face_targets[6];

 * Common vertex program code to perform a model-view-project matrix transform
    "ATTRIB iPos = vertex.position;\n"      \
    "OUTPUT oPos = result.position;\n"      \
    "PARAM  mvp[4] = { state.matrix.mvp };\n"   \
    "DP4    oPos.x, mvp[0], iPos;\n"        \
    "DP4    oPos.y, mvp[1], iPos;\n"        \
    "DP4    oPos.z, mvp[2], iPos;\n"        \
    "DP4    oPos.w, mvp[3], iPos;\n"

 * Handle to a generic fragment program that passes the input color to output
 * \note
 * Either \c piglit_use_fragment_program or \c piglit_require_fragment_program
 * must be called before using this program handle.
extern GLint piglit_ARBfp_pass_through;

static const GLint PIGLIT_ATTRIB_POS = 0;
static const GLint PIGLIT_ATTRIB_TEX = 1;

 * Given a GLSL version number, return the lowest-numbered GL version
 * that is guaranteed to support it.
required_gl_version_from_glsl_version(unsigned glsl_version);

#ifdef __cplusplus
} /* end extern "C" */

#endif /* __PIGLIT_UTIL_GL_H__ */

Most functions are undocumented although there are a lot of examples of how to use it in other piglit tests. Furthermore, once you know which function you need, it is usually straightforward to learn how to call it.

Just to mention a few: you can request extensions (piglit_require_extension()) or a GL version (piglit_require_gl_version()), compile program (piglit_compile_program()), draw a rectangle or triangle, read pixel values and compare them with a list of expected values, check for GL errors, etc.

piglit-shader.h has all the shader-related functions: compile a shader, link a simple program, build (i.e. compile and link) a simple program with vertex and fragment shaders, require a specific GLSL version, etc.

 * Copyright (c) The Piglit project 2007
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the "Software"),
 * to deal in the Software without restriction, including without limitation
 * on the rights to use, copy, modify, merge, publish, distribute, sub
 * license, and/or sell copies of the Software, and to permit persons to whom
 * the Software is furnished to do so, subject to the following conditions:
 * The above copyright notice and this permission notice (including the next
 * paragraph) shall be included in all copies or substantial portions of the
 * Software.

#pragma once

 * Null parameters are ignored.
 * \param es Is it GLSL ES?
void piglit_get_glsl_version(bool *es, int* major, int* minor);

GLuint piglit_compile_shader(GLenum target, const char *filename);
GLuint piglit_compile_shader_text_nothrow(GLenum target, const char *text);
GLuint piglit_compile_shader_text(GLenum target, const char *text);
GLboolean piglit_link_check_status(GLint prog);
GLboolean piglit_link_check_status_quiet(GLint prog);
GLint piglit_link_simple_program(GLint vs, GLint fs);
GLint piglit_build_simple_program(const char *vs_source, const char *fs_source);
GLuint piglit_build_simple_program_unlinked(const char *vs_source,
                        const char *fs_source);
GLint piglit_link_simple_program_multiple_shaders(GLint shader1, ...);
GLint piglit_build_simple_program_unlinked_multiple_shaders_v(GLenum target1,
                                 const char*source1,
                                 va_list ap);
GLint piglit_build_simple_program_unlinked_multiple_shaders(GLenum target1,
                               const char *source1,
GLint piglit_build_simple_program_multiple_shaders(GLenum target1,
                          const char *source1,

extern GLboolean piglit_program_pipeline_check_status(GLuint pipeline);
extern GLboolean piglit_program_pipeline_check_status_quiet(GLuint pipeline);

 * Require a specific version of GLSL.
 * \param version Integer version, for example 130
extern void piglit_require_GLSL_version(int version);
/** Require any version of GLSL */
extern void piglit_require_GLSL(void);
extern void piglit_require_fragment_shader(void);
extern void piglit_require_vertex_shader(void);

There are more header files such as piglit-glx-util.h, piglit-matrix.h, piglit-util-egl.h, etc.

Usually, you only need to add piglit-util-gl.h to your source code, however I recommend you to browse through tests/util/ so you find out all the available functions that piglit provides.


A complete example of how a piglit binary test looks like is ARB_uniform_buffer_object rendering test.

 * Copyright (c) 2014 VMware, Inc.
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the "Software"),
 * to deal in the Software without restriction, including without limitation
 * the rights to use, copy, modify, merge, publish, distribute, sublicense,
 * and/or sell copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following conditions:
 * The above copyright notice and this permission notice (including the next
 * paragraph) shall be included in all copies or substantial portions of the
 * Software.

/** @file rendering.c
 * Test rendering with UBOs.  We draw four squares with different positions,
 * sizes, rotations and colors where those parameters come from UBOs.

#include "piglit-util-gl.h"


    config.supports_gl_compat_version = 20;


static const char vert_shader_text[] =
    "#extension GL_ARB_uniform_buffer_object : require\n"
    "layout(std140) uniform;\n"
    "uniform ub_pos_size { vec2 pos; float size; };\n"
    "uniform ub_rot {float rotation; };\n"
    "void main()\n"
    "   mat2 m;\n"
    "   m[0][0] = m[1][1] = cos(rotation); \n"
    "   m[0][1] = sin(rotation); \n"
    "   m[1][0] = -m[0][1]; \n"
    "   gl_Position.xy = m * gl_Vertex.xy * vec2(size) + pos;\n"
    " = vec2(0, 1);\n"

static const char frag_shader_text[] =
    "#extension GL_ARB_uniform_buffer_object : require\n"
    "layout(std140) uniform;\n"
    "uniform ub_color { vec4 color; float color_scale; };\n"
    "void main()\n"
    "   gl_FragColor = color * color_scale;\n"

#define NUM_SQUARES 4
#define NUM_UBOS 3

/* Square positions and sizes */
static const float pos_size[NUM_SQUARES][3] = {
    { -0.5, -0.5, 0.1 },
    {  0.5, -0.5, 0.2 },
    { -0.5, 0.5, 0.3 },
    {  0.5, 0.5, 0.4 }

/* Square color and color_scales */
static const float color[NUM_SQUARES][8] = {
    { 2.0, 0.0, 0.0, 1.0,   0.50, 0.0, 0.0, 0.0 },
    { 0.0, 4.0, 0.0, 1.0,   0.25, 0.0, 0.0, 0.0 },
    { 0.0, 0.0, 5.0, 1.0,   0.20, 0.0, 0.0, 0.0 },
    { 0.2, 0.2, 0.2, 0.2,   5.00, 0.0, 0.0, 0.0 }

/* Square rotations */
static const float rotation[NUM_SQUARES] = {

static GLuint prog;
static GLuint buffers[NUM_UBOS];
static GLint alignment;
static bool test_buffer_offset = false;

static void
    static const char *names[NUM_UBOS] = {
    static GLubyte zeros[1000] = {0};
    int i;

    glGetIntegerv(GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, &alignment);
    printf("GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT = %d\n", alignment);

    if (test_buffer_offset) {
        printf("Testing buffer offset %d\n", alignment);
    else {
        /* we use alignment as the offset */
        alignment = 0;

    glGenBuffers(NUM_UBOS, buffers);

    for (i = 0; i < NUM_UBOS; i++) {
        GLint index, size;

        /* query UBO index */
        index = glGetUniformBlockIndex(prog, names[i]);

        /* query UBO size */
        glGetActiveUniformBlockiv(prog, index,
                      GL_UNIFORM_BLOCK_DATA_SIZE, &size);

        printf("UBO %s: index = %d, size = %d\n",
               names[i], index, size);

        /* Allocate UBO */
        /* XXX for some reason, this test doesn't work at all with
         * nvidia if we pass NULL instead of zeros here.  The UBO data
         * is set/overwritten in the piglit_display() function so this
         * really shouldn't matter.
        glBindBuffer(GL_UNIFORM_BUFFER, buffers[i]);
        glBufferData(GL_UNIFORM_BUFFER, size + alignment,
                             zeros, GL_DYNAMIC_DRAW);

        /* Attach UBO */
        glBindBufferRange(GL_UNIFORM_BUFFER, i, buffers[i],
                  alignment,  /* offset */
        glUniformBlockBinding(prog, index, i);

        if (!piglit_check_gl_error(GL_NO_ERROR))

piglit_init(int argc, char **argv)

    if (argc > 1 && strcmp(argv[1], "offset") == 0) {
        test_buffer_offset = true;

    prog = piglit_build_simple_program(vert_shader_text, frag_shader_text);


    glClearColor(0.2, 0.2, 0.2, 0.2);

static bool
probe(int x, int y, int color_index)
    float expected[4];

    /* mul color by color_scale */
    expected[0] = color[color_index][0] * color[color_index][4];
    expected[1] = color[color_index][1] * color[color_index][4];
    expected[2] = color[color_index][2] * color[color_index][4];
    expected[3] = color[color_index][3] * color[color_index][4];

    return piglit_probe_pixel_rgba(x, y, expected);

enum piglit_result
    bool pass = true;
    int x0 = piglit_width / 4;
    int x1 = piglit_width * 3 / 4;
    int y0 = piglit_height / 4;
    int y1 = piglit_height * 3 / 4;
    int i;

    glViewport(0, 0, piglit_width, piglit_height);


    for (i = 0; i < NUM_SQUARES; i++) {
        /* Load UBO data, at offset=alignment */
        glBindBuffer(GL_UNIFORM_BUFFER, buffers[0]);
        glBufferSubData(GL_UNIFORM_BUFFER, alignment, sizeof(pos_size[0]),
        glBindBuffer(GL_UNIFORM_BUFFER, buffers[1]);
        glBufferSubData(GL_UNIFORM_BUFFER, alignment, sizeof(color[0]),
        glBindBuffer(GL_UNIFORM_BUFFER, buffers[2]);
        glBufferSubData(GL_UNIFORM_BUFFER, alignment, sizeof(rotation[0]),

        if (!piglit_check_gl_error(GL_NO_ERROR))
            return PIGLIT_FAIL;

        piglit_draw_rect(-1, -1, 2, 2);

    pass = probe(x0, y0, 0) && pass;
    pass = probe(x1, y0, 1) && pass;
    pass = probe(x0, y1, 2) && pass;
    pass = probe(x1, y1, 3) && pass;


    return pass ? PIGLIT_PASS : PIGLIT_FAIL;

The source starts with the license header and describing what the test does. Following those, it’s the config setup for piglit: request doube buffer, RGBA pixel format and GL compat version 2.0.

Furthermore, this test defines GLSL shaders, the global data and helper functions as any other OpenGL C-language program. Notice that setup_ubos() include calls to OpenGL API but also calls to piglit_check_gl_error() and piglit_report_result() which are used to check for any OpenGL error and tell piglit that there was a failure, respectively.

Following the structure introduced before, piglit_init() indicates that ARB_uniform_buffer_object extension is required, it builds the program with the aforementioned shaders, setups the uniform buffer objects and sets the clear color.

Finally in piglit_display() is where the relevant content is placed. Among other things, it loads UBO’s data, draw a rectangle and check that the rendered pixels have the same values as the expected ones. Depending of the result of that check, it will report to piglit that the test was a success or a failure.

How to add a binary test to piglit

How to build it

Now that you have written the source code of your test, it’s time to learn how to build it. As we have seen in an earlier post, piglit uses cmake for generating binaries.

First you need to check if cmake tracks the directory where your source code is. If you create a new extension subdirectory under tests/spec, modify this CMakeLists.txt file and add yours.

Once you have done that, cmake needs to know that there are binaries to build and where to get its source code. For doing that, create a CMakeLists.txt file in your test’s directory with the following content:


If you have binaries inside directories pending the former, you can add them to the same CMakeLists.txt file you are creating:

add_subdirectory (compiler)
add_subdirectory (execution)
add_subdirectory (linker)

But this is not enough to compile your test as we have not specified which binary is and where to get its source code. We do that by creating a new file with a similar content than this one.


link_libraries (

piglit_add_executable (arb_shader_storage_buffer_object-minmax minmax.c)

# vim: ft=cmake:

As you see, first we declare where to find the needed headers and libraries. Then we define the binary name (arb_shader_storage_buffer_object-minmax) and which is its source code file (minmax.c).

And that’s it. If everything is fine, next time you run cmake . && make in piglit’s directory, piglit will build the test and place it inside bin/ directory.

Example in piglit repository

Let’s review a real example of how a test was added for a new extension in piglit (this commit). As you see, it added tests/spec/arb_robustness/ subdirectory to tests/spec/CMakeLists.txt, create a tests/spec/arb_robustness/CMakeLists.txt to indicate cmake to track this directory, tests/spec/arb_robustness/ to compile the binary test file and add its source code file tests/spec/arb_robustness/draw-vbo-bounds.c.

tests/spec/CMakeLists.txt | 1 +
 tests/spec/arb_robustness/ | 19 +++++++++++++++++++
 tests/spec/arb_robustness/CMakeLists.txt | 1 +
 tests/spec/arb_robustness/draw-vbo-bounds.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 226 insertions(+)

If you run git log tests/spec/<dir> command in other directories, you will find similar commits.

How to run it with profile

Once you successfully build the binary test program you can run it standalone. However, it’s better to add it to tests/ profile so it is executed automatically by this profile.

Open tests/  and look for your extension name, then add your test by looking at  how other tests are defined and adapt it to your case. In case you are developing tests for a new extension, you have to add more lines to tests/ For example this is how was done for ARB_shader_storage_buffer_object extension:

with profile.group_manager(
        grouptools.join('spec', 'arb_shader_storage_buffer_object')) as g:
    g(['arb_shader_storage_buffer_object-minmax'], 'minmax')

These lines make several things: indicate under which extension you will find the results, which are the binaries to run and which short name will appear in the summaries for each binary.

ARB_shader_storage_buffer_object's minmax test result in HTML summaryARB_shader_storage_buffer_object’s minmax test in HTML summary


This post and last one were about how to create your own piglit tests. I hope you will find them useful and I will be glad to see new contributors because of them :-)

Next post will talk about how to send your contributions to piglit and a table of contents so you can easily jump to any post of this series.

June 05, 2015

I've recently found myself explaining polkit (formerly PolicyKit) to one of Collabora's clients, and thought that blogging about the same topic might be useful for other people who are confused by it; so, here is why udisks2 and polkit are the way they are.

As always, opinions in this blog are my own, not Collabora's.

Privileged actions

Broadly, there are two ways a process can do something: it can do it directly (i.e. ask the kernel directly), or it can use inter-process communication to ask a service to do that operation on its behalf. If it does it directly, the components that say whether it can succeed are the Linux kernel's normal permissions checks (DAC), and if configured, AppArmor, SELinux or a similar MAC layer. All very simple so far.

Unfortunately, the kernel's relatively coarse-grained checks are not sufficient to express the sorts of policies that exist on a desktop/laptop/mobile system. My favourite example for this sort of thing is mounting filesystems. If I plug in a USB stick with a FAT filesystem, it's reasonable to expect my chosen user interface to either mount it automatically, or let me press a button to mount it. Similarly, to avoid data loss, I should be able to unmount it when I'm finished with it. However, mounting and unmounting a USB stick is fundamentally the same system call as mounting and unmounting any other filesystem - and if ordinary users can do arbitrary mount system calls, they can cause all sorts of chaos, for instance by mounting a filesystem that contains setuid executables (privilege escalation), or umounting a critical OS filesystem like /usr (denial of service). Something needs to arbitrate: “you can mount filesystems, but only under certain conditions”.

The kernel developer motto for this sort of thing is “mechanism, not policy”: they are very keen to avoid encoding particular environments' policies (the sort of thing you could think of as “business rules”) in the kernel, because that makes it non-generic and hard to maintain. As a result, direct mount/unmount actions are only allowed by privileged processes, and it's up to user-space processes to arrange for a privileged process to make the desired mount syscall.

Here are some other privileged actions which laptop/desktop users can reasonably expect to “just work”, with or without requiring a sysadmin-like (root-equivalent) user:

  • reconfiguring networking (privileged because, in the general case, it's an availability and potentially integrity issue)
  • installing, upgrading or removing packages (privileged because, in the general case, it can result in arbitrary root code execution)
  • suspending or shutting down the system (privileged because you wouldn't want random people doing this on your server, but should normally be allowed on e.g. laptops for people with physical access, because they could just disconnect the power anyway)

In environments that use a MAC framework like AppArmor, actions that would normally be allowed can become privileged: for instance, in a framework for sandboxed applications, most apps shouldn't be allowed to record audio. This prevents carrying out these actions directly, again resulting in the only way to achieve them being to ask a service to carry out the action.

Ask a system service to do it

On to the next design, then: I can submit a request to a privileged process, which does some checks to make sure I'm not trying to break the system (or alternatively, that I have enough sysadmin rights that I'm allowed to break the system if I want to), and then does the privileged action for me.

You might think I'm about to start discussing D-Bus and daemons, but actually, a prominent earlier implementation of this was mount(8), which is normally setuid root:

% ls -l /bin/mount
-rwsr-xr-x 1 root root 40000 May 22 11:37 /bin/mount

If you look at it from an odd angle, this is inter-process communication across a privilege boundary: I run the setuid executable, creating a process. Because the executable has the setuid bit set, the kernel makes the process highly privileged: its effective uid is root, and it has all the necessary capabilities to mount filesystems. I submit the request by passing it in the command-line arguments. mount does some checks - specifically, it looks in /etc/fstab to see whether the filesystem I'm trying to mount has the “user” or “users” flag - then carries out the mount system call.

There are a few obvious problems with this:

  • When machines had a static set of hardware devices (and a sysadmin who knew how to configure them), it might have made sense to list them all in /etc/fstab; but this is not a useful solution if you can plug in any number of USB drives, or if you are a non-expert user with Linux on your laptop. The decision ought to be based on general attributes of devices, such as “is removable?”, and on the role of the machine.
  • Setuid executables are alarmingly easy to get wrong so it is not necessarily wise to assume that mount(8) is safe to be setuid.
  • One fact that a reasonable security policy might include is “users who are logged in remotely should have less control over physically present devices than those who are physically present” - but that sort of thing can't be checked by mount(8) without specifically teaching the mount binary about it.

Ask a system service to do it, via D-Bus or other IPC

To avoid the issues of setuid, we could use inter-process communication in the traditional sense: run a privileged daemon (on boot or on-demand), make it listen for requests, and use the IPC channel as our privilege boundary.

udisks2 is one such privileged daemon, which uses D-Bus as its IPC channel. D-Bus is a commonly-used inter-process system; one of its intended/designed uses is to let user processes and system services communicate, especially this sort of communication between a privileged daemon and its less-privileged clients.

People sometimes criticize D-Bus as not doing anything you couldn't do yourself with some AF_UNIX sockets. Well, no, of course it doesn't - the important bit of the reference implementation and the various interoperable reimplementations consists of a daemon and some AF_UNIX sockets, and the rest is a simple matter of programming. However, it's sufficient for most uses in its problem space, and is usually better than inventing your own.

The advantage of D-Bus over doing your own thing is precisely that you are not doing your own thing: good IPC design is hard, and D-Bus makes some structural decisions so that fewer application authors have to think about them. For instance, it has a central “hub” daemon (the dbus-daemon, or “message bus”) so that n communicating applications don't need O(n²) sockets; it uses the dbus-daemon to provide a total message ordering so you don't have to think about message reordering; it has a distributed naming model (which can also be used as a distributed mutex) so you don't have to design that; it has a serialization format and a type system so you don't have to design one of those; it has a framework for “activating" run-on-demand daemons so they don't have to use resources initially, implemented using a setuid helper and/or systemd; and so on.

If you have religious objections to D-Bus, you can mentally replace “D-Bus” with “AF_UNIX or something” and most of this article will still be true.

Is this OK?

In either case - exec'ing a privileged helper, or submitting a request to a privileged daemon via IPC - the privileged process has two questions that it needs to answer before it does its work:

  • what am I being asked to do?
  • should I do it?

It needs to make some sort of decision on the latter based on the information available to it. However, before we even get there, there is another layer:

  • did the request get there at all?

In the setuid model, there is a simple security check that you can apply: you can make /bin/mount only executable by a particular group, or only executable by certain AppArmor profiles, or similar. That works up to a point, but cannot distinguish between physically-present and not-physically-present users, or other facts that might be interesting to your local security policy. Similarly, in the IPC model, you can make certain communication channels impossible, for instance by using dbus-daemon's ability to decide which messages to deliver, or AF_UNIX sockets' filesystem permissions, or a MAC framework like AppArmor.

Both of these are quite “coarse-grained” checks which don't really understand the finer details of what is going on. If the answer to "is this safe?” is something of the form “maybe, it depends on...”, then they can't do the right thing: they must either let it through and let the domain-specific privileged process do the check, or deny it and lose potentially useful functionality.

For instance, in an AppArmor environment, some applications have absolutely no legitimate reason to talk to udisks2, so the AppArmor policy can just block it altogether. However, once again, this is a coarse-grained check: the kernel has mechanism, not policy, and it doesn't know what the service does or why. If the application does need to be able to talk to the service at all, then finer-grained access control (obeying some, but not all, requests) has to be the service's job.

dbus-daemon does have the ability to match messages in a relatively fine-grained way, based on the object path, interface and member in the message, as well as the routing information that it uses itself (i.e. the source and destination). However, it is not clear that this makes a great deal of sense conceptually: these are facts about the mechanics of the IPC, not facts about the domain-specific request (because the mechanics of the IPC are all that dbus-daemon understands). For instance, taking the udisks2 example again, dbus-daemon can't distinguish between an attempt to adjust mount options for a USB stick (probably fine) and an attempt to adjust mount options for /usr (not good).

To have a domain-specific security policy, we need a domain-specific component, for instance udisks2, to get involved. Unlike dbus-daemon, udisks2 knows that not all disks are equal, knows which categories make sense to distinguish, and can identify which categories a particular disk is in. So udisks2 can make a more informed decision.

So, a naive approach might be to write a function in udisks2 that looks something like this pseudocode:

may_i_mount_this_disk (user, disk, mount options) → boolean
    if (user is root || user is root-equivalent)
        return true;

    if (disk is not removable)
       return false;

    if (mount options are scary)
       return false;

    if (user is in “manipulate non-local disks” group)
        return true;

    if (user is not logged-in locally)
       return false;

    if (user is not logged-in on the same seat where the disk is
            plugged in)
       return false;

    return true;

Delegating the security policy to something central

The pseudocode security policy outlined above is reasonably complicated already, and doesn't necessarily cover everything that you might want to consider.

Meanwhile, not every system is the same. A general-purpose Linux distribution like Debian might run on server/mainframe systems with only remote users, personal laptops/desktops with one root-equivalent user, locked-down corporate laptops/desktops, mobile devices and so on; these systems should not necessarily all have the same security policy.

Another interesting factor is that for some privileged operations, you might want to carry out interactive authorization: ask the requesting user to confirm that the action (which might have come from a background process) should take place (like Windows' UAC), or to prove that the person currently at the keyboard is the same as the person who logged in by giving their password (like sudo).

We could in principle write code for all of this in udisks2, and in NetworkManager, and in systemd, ... - but that clearly doesn't scale, particularly if you want the security policy to be configurable. Enter polkit (formerly PolicyKit), a system service for applying security policies to actions.

The way polkit works is that the application does its domain-specific analysis of the request - in the case of udisks2, whether the device to be mounted is removable, whether the mount options are reasonable, etc. - and converts it into an action. The action gives polkit a way to distinguish between things that are conceptually different, without needing to know the specifics. For instance, udisks2 currently divides up filesystem-mounting into org.freedesktop.udisks2.filesystem-mount, org.freedesktop.udisks2.filesystem-mount-fstab, org.freedesktop.udisks2.filesystem-mount-system and org.freedesktop.udisks2.filesystem-mount-other-seat.

The application also finds the identity of the user making the request. Next, the application sends the action, the identity of the requesting user, and any other interesting facts to polkit. As currently implemented, polkit is a D-Bus service, so this is an IPC request via D-Bus. polkit consults its database of policies in order to choose one of several results:

  • yes, allow it
  • no, do not allow it
  • ask the user to either authenticate as themselves or as a privileged (sysadmin) user to allow it, or cancel authentication to not allow it
  • ask the user to authenticate the first time, but if they do, remember that for a while and don't ask again

So how does polkit decide this? The first thing is that it reads the machine-readable description of the actions, in /usr/share/polkit-1/actions, which specifies a default policy. Next, it evaluates a local security policy to see what that says. In the current version of polkit, the local security policy is configured by writing JavaScript in /etc/polkit-1/rules.d (local policy) and /usr/share/polkit-1/rules.d (OS-vendor defaults). In older versions such as the one currently shipped in Debian unstable, there was a plugin architecture; but in practice nobody wrote plugins for it, and instead everyone used the example local authority shipped with polkit, which was configured via files in /etc/polkit-1/localauthority and /etc/polkit-1/localauthority.d.

These policies can take into account useful facts like:

  • what is the action we're talking about?
  • is the user logged-in locally? are they active, i.e. they are not just on a non-current virtual console?
  • is the user in particular groups?

For instance, gnome-control-center on Debian installs this snippet:

polkit.addRule(function(action, subject) {
        if (( == "org.freedesktop.locale1.set-locale" ||
    == "org.freedesktop.locale1.set-keyboard" ||
    == "org.freedesktop.hostname1.set-static-hostname" ||
    == "org.freedesktop.hostname1.set-hostname" ||
    == "org.gnome.controlcenter.datetime.configure") &&
            subject.local &&
            subject.isInGroup ("sudo")) {
                    return polkit.Result.YES;

which is reasonably close to being pseudocode for “active local users in the sudo group may set the system locale, keyboard layout, hostname and time, without needing to authenticate”. A system administrator could of course override that by dropping a higher-priority policy for some or all of these actions into /etc/polkit-1/rules.d.


  • Kernel-based permission checks are not sufficiently fine-grained to be able to express some quite reasonable security policies
  • Fine-grained access control needs domain-specific understanding
  • The kernel doesn't have that information (and neither does dbus-daemon)
  • The privileged service that does the domain-specific thing can provide the domain-specific understanding to turn the request into an action
  • polkit evaluates a configurable policy to determine whether privileged services should carry out requested actions
June 04, 2015

libinput provides a number of different out-of-the-box configurations, based on capabilities. For example: middle mouse button emulation is enabled by default if a device only has left and right buttons. On devices with a physical middle button it is available but disabled by default. Likewise, whether tapping is enabled and/or available depends on hardware capabilities. But some requirements cannot be gathered purely by looking at the hardware capabilities.

libinput uses a couple of udev properties, assigned through udev's hwdb, to detect device types. We use the same mechanism to provide us with specific tags to adjust libinput-internal behaviour. The udev properties named LIBINPUT_MODEL_.... tag devices based on a set of udev rules combined with hwdb matches. For example, we tag Chromebooks with LIBINPUT_MODEL_CHROMEBOOK.

Inside libinput, we parse those tags and use them for model-specific configuration. At the time of writing this, we use the chromebook tag to automatically enable clickfinger behaviour on those touchpads (which matches the google defaults on chromebooks). We tag the Lenovo X230 touchpad to give it it's own acceleration method. This touchpad is buggy and the data it sends has a very bad resolution.

In the future these tags will likely expand and encompass more devices that need customised tweaks. But the goal is always that libinput works well out of the box, even if the hardware is quirky. Introducing these tags instead of a sleigh of configuration options has short-term drawbacks: it increases the workload on us maintainers and it may require software updates to get a device to work exactly the way it should. The long-term benefits are maintainability and testability though, as well as us being more aware of what hardware is out there and how it needs to be fixed. Plus the relief of not having to deal with configuration snippets that are years out of date, do all the wrong things but still spread across forums like an STD.

Note: the tags are considered private API and may change at any time, depending what we want or need to do with them. Do not use them for home-made configuration.

June 03, 2015

libinput uses udev tags to determine what a device is. This is a significant difference to the X.Org stack which determines how to deal with a device based on an elaborate set of rules, rules grown over time, matured, but with a slight layer of mould on top by now. In evdev's case that is understandable, it stems from a design where you could just point it at a device in your xorg.conf and it'd automagically work, well before we had even input hotplugging in X. What it leads to now though is that the server uses slightly different rules to decide what a device is (to implement MatchIsTouchscreen for example) than evdev does. So you may have, in theory, a device that responds to MatchIsTouchscreen only to set itself up as keyboard.

libinput does away with this in two ways: it punts most of the decisions on what a device is to udev and its ID_INPUT_... properties. A device marked as ID_INPUT_KEYBOARD will initialize a keyboard interface, an ID_INPUT_TOUCHPAD device will initialize a touchpad backend. The obvious advantage of this is that we only have one place where we have generic device type rules. The second advantage is that where this one place isn't precise enough, it's simple to override with custom rule sets. For example, Wacom tablets are hard to categorise just by looking at the device alone. libwacom generates a udev rule containing the VID/PID of each known device with the right ID_INPUT_TABLET etc. properties.

This is a libinput-internal behaviour. Externally, we are a lot more vague. In fact, we don't tell you at all what a device is, other than what events it will send (pointer, keyboard, or touch). We have thought about implementing some sort of device identifier and the conclusion is that we won't implement this as part of libinput's API because it will simply be wrong some of the time. And if it's wrong, it requires the caller to re-implement something on top of it. At which point the caller may as well implement all of it instead. Why do we expect it to be wrong? Because libinput doesn't know the exact context that requires a device to be labelled as a specific type.

Take a keyboard for example. There are a great many devices that send key events. To the client a keyboard may be any device that can get an XKB layout and is used for typing. But to the compositor, a keyboard may be anything that can send a few specific keycodes. A device with nothing but KEY_POWER? That's enough for the compositor to decide to shut down but that device may not otherwise work as a keyboard. libinput can't know this context. But what libinput provides is the API to query information. libinput_device_pointer_has_button() and libinput_device_keyboard_has_key() are the two candidates here to query about a specific set of buttons and/or keys.

Touchpads, trackpoints and mice all look send pointer events and there is no flag that tells you the device type and that is intentional. libinput doesn't have any intrinsic knowledge about what is a touchpad, we take the ID_INPUT_TOUCHPAD tag. At best, we refuse some devices that were clearly mislabelled but we don't init devices as touchpads that aren't labelled as such. Any device type identification would likely be wrong - for example some Wacom tablets are touchpads internally but would be considered tablets in other contexts.

So in summary, callers are encouraged to rely on the udev information and other bits they can pull from the device to group it into the semantically correct device type. libinput_device_get_udev_device() provides a udev handle for a libinput device and all configurable features are query-able (e.g. "does this device support tapping?"). libinput will not provide a device type because it would likely be wrong in the current context anyway.

June 02, 2015

TLDR: as of libinput 0.16 you can end a touchpad tap-and-drag with a final additional tap

libinput originally only supported single-tap and double-tap. With version 0.15 we now support multi-tap, so you can tap repeatedly to get a triple, quadruple, etc. click. This is quite useful in text editors where a triple click highlights a line, four clicks highlight a paragraph, and 28 clicks order a new touchpad from ebay. Multi-tap also works with drag-and drop, so a triple tap followed by a finger down and hold will send three clicks followed by a single click.

We also support continuous tap-and-drag which is something synaptics didn't support provided with the LockedDrags option: Once the user is in dragging mode (x * tap + hold finger down) they can lift the finger and set it down again without the drag being interrupted. This is quite useful when you have to move across the screen, especially on smaller touchpads or for users that prefer a slow acceleration.

Of course, this adds a timeout to the drag release since we need to wait and see whether the finger comes down again. To help accelerate this, we added a tap-to-release feature (contributed by Velimir Lisec): once in drag mode a final tap will release the button immediately. This is something that OS X has supported for years and after a bit of muscle memory retraining it becomes second nature quickly. So the new timeout-free way to tap-and-drag on a touchpad is now:

tap, finger-down, move, .... move, finger up, tap

Update 03/06/25: add synaptics LockedDrag option reference

I don't often write about tools I use when for my daily software development tasks. I recently realized that I really should start to share more often my workflows and weapons of choice.

One thing that I have a hard time enduring while doing Python code reviews, is people writing utility code that is not directly tied to the core of their business. This looks to me as wasted time maintaining code that should be reused from elsewhere.

So today I'd like to start with retrying, a Python package that you can use to… retry anything.

It's OK to fail

Often in computing, you have to deal with external resources. That means accessing resources you don't control. Resources that can fail, become flapping, unreachable or unavailable.

Most applications don't deal with that at all, and explode in flight, leaving a skeptical user in front of the computer. A lot of software engineers refuse to deal with failure, and don't bother handling this kind of scenario in their code.

In the best case, applications usually handle simply the case where the external reached system is out of order. They log something, and inform the user that it should try again later.

In this cloud computing area, we tend to design software components with service-oriented architecture in mind. That means having a lot of different services talking to each others over the network. And we all know that networks tend to fail, and distributed systems too. Writing software with failing being part of normal operation is a terrific idea.


In order to help applications with the handling of these potential failures, you need a plan. Leaving to the user the burden to "try again later" is rarely a good choice. Therefore, most of the time you want your application to retry.

Retrying an action is a full strategy on its own, with a lot of options. You can retry only on certain condition, and with the number of tries based on time (e.g. every second), based on a number of tentative (e.g. retry 3 times and abort), based on the problem encountered, or even on all of those.

For all of that, I use the retrying library that you can retrieve easily on PyPI.

retrying provides a decorator called retry that you can use on top of any function or method in Python to make it retry in case of failure. By default, retry calls your function endlessly until it returns rather than raising an error.

import random
from retrying import retry
def pick_one():
if random.randint(0, 10) != 1:
raise Exception("1 was not picked")

This will execute the function pick_one until 1 is returned by random.randint.

retry accepts a few arguments, such as the minimum and maximum delays to use, which also can be randomized. Randomizing delay is a good strategy to avoid detectable pattern or congestion. But more over, it supports exponential delay, which can be used to implement exponential backoff, a good solution for retrying tasks while really avoiding congestion. It's especially handy for background tasks.

@retry(wait_exponential_multiplier=1000, wait_exponential_max=10000)
def wait_exponential_1000():
print "Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards"
raise Exception("Retry!")

You can mix that with a maximum delay, which can give you a good strategy to retry for a while, and then fail anyway:

# Stop retrying after 30 seconds anyway
>>> @retry(wait_exponential_multiplier=1000, wait_exponential_max=10000, stop_max_delay=30000)
... def wait_exponential_1000():
... print "Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards"
... raise Exception("Retry!")
>>> wait_exponential_1000()
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python2.7/site-packages/", line 212, in call
raise attempt.get()
File "/usr/local/lib/python2.7/site-packages/", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/local/lib/python2.7/site-packages/", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "<stdin>", line 4, in wait_exponential_1000
Exception: Retry!

A pattern I use very often, is the ability to retry only based on some exception type. You can specify a function to filter out exception you want to ignore or the one you want to use to retry.

def retry_on_ioerror(exc):
return isinstance(exc, IOError)
def read_file():
with open("myfile", "r") as f:

retry will call the function passed as retry_on_exception with the exception raised as first argument. It's up to the function to then return a boolean indicating if a retry should be performed or not. In the example above, this will only retry to read the file if an IOError occurs; if any other exception type is raised, no retry will be performed.

The same pattern can be implemented using the keyword argument retry_on_result, where you can provide a function that analyses the result and retry based on it.

def retry_if_file_empty(result):
return len(result) <= 0
def read_file():
with open("myfile", "r") as f:

This example will read the file until it stops being empty. If the file does not exist, an IOError is raised, and the default behavior which triggers retry on all exceptions kicks-in – the retry is therefore performed.

That's it! retry is really a good and small library that you should leverage rather than implementing your own half-baked solution!

May 29, 2015

In earlier posts I talked about how to install piglit on your system and run the first time and how to tailor your piglit run. I have given a talk in FOSDEM 2015 about how to test your OpenGL drivers using Free Software where I explained how to run piglit and dEQP. This post and the next one are going to introduce the topic of creating new tests for piglit, as this was not covered before in this post series.

Brief introduction

Before start talking about how to create a GLSL test, let me explain you a couple of things about how piglit organizes its tests.

First is that all the tests are defined inside tests/ directory.

  • The ones related to a specific spec (like OpenGL extension, GL version, GLSL version, etc) are placed inside tests/spec subdirectory.
  • Shader tests are usually defined inside  tests/shaders/ directory, but not always. If they are specific to a OpenGL extension, they will be in the its subdirectory (see first bullet).
  • Tests specific to other APIs: tests/egl, tests/cl, etc.
  • Etc.

Take your time browsing all these directories and figure out the best place for your tests or, if you are looking for a specific test, where it is likely placed.

Second thing is that binary tests should be defined inside tests/ to be executed in a full piglit run. This is not the case for GLSL tests or shader runner tests as we are going to see in this post.

The most common language to write these tests is C but there are a couple of alternatives if you are interested on writing GLSL shader tests. Let’s start with a couple of examples of GLSL shader tests as it is the easiest way to contribute new tests to piglit.

GLSL compiler test

When creating a GLSL compiler test, most of the time you can avoid to write a C code example with all those bells and whistles. Actually, it is pretty straightforward if you just want to check that a GLSL shader compiles successfully or not according to your expectations.

Usually, the rule of thumb is to keep GLSL tests simple and focused to detect failures (or success) in shaders’ compilation. For example, you want to check that defining a shader storage buffer block is only possible if ARB_shader_storage_buffer_object extension is enabled in the driver.

#version 120
#extension GL_ARB_shader_storage_buffer_object: disable

buffer ssbo {
    vec4 a;

void foo(void) {

Once you write the GLSL shader, save it to a file named like extension-disabled-shader-storage-block.frag. There are several suffixes that you can use depending of the GLSL shader type: .vert for vertex shader, .geom for geometry shader, .frag for fragment shader. Notice that the name itself is a summary of what the test does, so other people can find what it does without opening the file.

There is a piglit binary called glslparser that it is going to pick this shader and compile it checking for errors. This binary needs some parameters to know how to run this test and the expected result, so we provide them. Add at the top of the shader the following content:

// [config]
// expect_result: fail
// glsl_version: 1.20
// require_extensions: GL_ARB_shader_storage_buffer_object
// [end config]

There we write down that we expect the test to fail, which is the minimum supported GLSL version to run this test and the required extensions. At this case we need GLSL 1.20 version and GL_ARB_shader_storage_buffer_object extension.

// [config]
// expect_result: fail
// glsl_version: 1.20
// require_extensions: GL_ARB_shader_storage_buffer_object
// [end config]

#version 120
#extension GL_ARB_shader_storage_buffer_object: disable

buffer ssbo {
    vec4 a;

void foo(void) {

Once you have it, you should save it in the proper place. The directory will be the same than the extension name it is going to test, usually it is saved in a subdirectory called compiler because we are testing the GLSL shader compilation, for this example: tests/spec/arb_shader_storage_buffer_object/compiler.

And that’s it. Next time you run piglit with tests/ profile, one script will find this test, parse it and execute glslparser with the provided configuration checking if the result of your shader is the expected one or not.

You can execute it manually by running the following command:

$ bin/glslparser tests/spec/arb_shader_storage_buffer_object/compiler/extension-disabled-shader-storage-block.frag fail 1.20 GL_ARB_shader_storage_buffer_object

As you see, the last arguments come from the config we defined in the test file but we added them by hand.

It is possible to link a GLSL shader and look for errors by using check_link with true or false, depending if you expect the test to succeed or to fail at link time:

// check_link: true

Nevertheless, it only links one shader. If you want to link more than one shader, I recommend you to use another type of piglit tests (see next section).

As you see, you can easily write new GLSL shader tests to verify some bits of a specification: valid/invalid keywords, initialization, special cases explained in the spec, etc.

GLSL shader linker test

Sometimes compiling/linking only one shader is not enough, there are cases that you want to compile and link different but related shaders: link a vertex and a fragment shader or a geometry shader between them, etc. In the previous example, we saw that the GLSL shader test does not link different shaders (actually there is only one). When there are several GLSL shaders, you need to use another type of piglit tests: shader_runner tests.

As usual, you first start writing the GLSL shaders. In some cases, you can substitute one shader by its pass-through counterpart to avoid writing “useless” shaders when there is not any user provided interface dependency.

Let’s start with an example of a pass-through vertex shader:

# ARB_gpu_shader5 spec says:
#   "If an implementation supports  vertex streams, the individual
#   streams are numbered 0 through -1"
# This test verifies that a link error occurs if EmitStreamVertex()
# is called with a stream value which is negative.

GLSL >= 1.50

[vertex shader passthrough]

[geometry shader]

#extension GL_ARB_gpu_shader5 : enable

layout(points) in;
layout(points, max_vertices=3) out;

void main()
    gl_Position = vec4(1.0, 1.0, 1.0, 1.0);

[fragment shader]

out vec4 color;

void main()
  color = vec4(0.0, 1.0, 0.0, 1.0);

link error

The file starts with a comment describing what the test does and which spec (or part of it) is checking.

Next step is very similar to the previous example: specify the minimum GL and/or GLSL version required and the extensions needed for the test execution. This information is written in the [require] section.

Then the different GLSL shader definitions start: vertex shader is pass-through as it has no user-defined interface dependency to next stage (geometry shader at this case). After it, it defines the  geometry and fragment shaders and finally the test commands to run. In this case, we expect the test to fail at link time.

However you are not limited to link checking, there are other available commands to run in the test commands part:

    • Load uniform values
      • uniform <type> <name> <value>

uniform vec4 color 0.0 0.5 0.0 0.0

    • Draw rectangles
      • draw rect <coordinates>

draw rect -1 -1 2 2

    • Probe a pixel color value and compare it to an expected value
      • probe {rgb, rgba}

probe rgb 1 1 0.0 1.0 0.0

    • Probe all pixel values.
      • probe all {rgb,rgba}

probe all rgba 0.0 1.0 0.0 0.0

Or even more complex commands:

    • Load data into a vertex array object and render primitives using that vertex array data.

[vertex data]
 1.0  1.0  1.0
-1.0  1.0  1.0
-1.0 -1.0  1.0
 1.0 -1.0  1.0

draw arrays GL_TRIANGLE_FAN 0 4

    •  Relative probe pixel color

relative probe rgb (.5, .5) (0.0, 1.0, 0.0)

    • And much more:

[fragment program]
OPTION ARB_fragment_program_shadow;
TXP result.color, fragment.texcoord[0], texture[0], SHADOWRECT;

texture shadowRect 0 (32, 32)
texparameter Rect depth_mode luminance
texparameter Rect compare_func greater

Just check other tests to figure out if they are doing something like what you want to do in your test and copy the interesting bits!

Save the file with a good name describing briefly what it does (example’s filename is stream-negative-value.shader_test) and place it in the corresponding subdirectory. For this example, the proper place is under GL_ARB_gpu_shader5 test directory and linker subdirectory because this is a linker test: tests/spec/arb_gpu_shader5/linker/

In order to execute this test with tests/ profile, you don’t need to do anything else. A script will find this test file and call shader_runner binary with the filename as a parameter.

In case you want to run it by yourself, the command is easy:

$ bin/shader_runner tests/spec/arb_gpu_shader5/linker/stream-negative-value.shader_test

Next post will cover the creation of GL binary tests within the piglit framework. I hope this post and next one will help you to contribute to piglit!

May 26, 2015

So we just got the second Fedora Workstation release out the door, and I am quite happy with it, we had quite a few last minute hardware issues pop up, but due to the hard work of the team we where able to get them fixed in time for todays release.

Every release we do is of course the result of both work we do as part of the Fedora Workstation team, but we also rely on a lot of other people upstream. I would like to especially call out Laurent Pinchart, who is the upstream maintainer of the UVC driver, who fixed a bug we discovered with some built in webcams on newer laptops. So thank you Laurent! So for any users of the Toshiba z20t Portege laptop, your rear camera now works thanks to Laurent :)

Having a relatively short development cycle this release doesn’t contain huge amounts of major changes, but our team did manage to sneak in a few nice new features. As you can see from this blog entry from Allan Day the notification area re-design that he and Florian worked on landed. It is a huge improvement in my opinion and will let us continue polishing the notification behavior of applications going forward.

We have a bunch of improvements to the Nautilus file manager thanks to the work of Carlos Soriano. Recommend reading through his blog as there is a quite sizeable collection of smaller fixes and improvements he was able to push through.

Another thing we got properly resolved for Fedora Workstation 22 is installing it in Boxes. Boxes is our easy to use virtual machine manager which we are putting resources into to make a great developer companion. So while this is a smaller fix for Boxes and Fedora, we have some great Boxes features lining up for the next Fedora release, so stayed tuned for more on that in another blog post.

Wayland support is also marching forward with this release. The GDM session you get upon installing Fedora Workstation 22 will now default to Wayland, but fall back to X if there is an issue. It is a first step towards migrating the default session to Wayland. We still have some work to do there to get the Wayland session perfect, but we are closing the gap rapidly. Jonas Ådahl and Owen Taylor is pushing that effort forward.

Related to Wayland we introduce libinput as the backend for both X and Wayland in this release. While we shipped libinput in Fedora 21, when we wrote libinput we did so with Wayland as the primary target, yet at the same time we realized that we didn’t want to maintain two separate input systems going forward, so in this release also uses libinput for input. This means we have one library to work on now that will improve input in both your Wayland session and X sessions.

This is also the first release featuring the new Adwaita theme for Qt. This release supports Qt4, but we hope to support Qt5 in an upcoming Fedora release and also include a dark variant of the theme for Qt applications. Martin Briza has been leading that effort.

Another nice little feature addition this release is the notification of long running jobs in the terminal. It was a feature we wanted to do from early on in the Fedora Workstation process, but it took quite some while to figure out the fine details for how we wanted to do it. Basically it means you no longer need to check in with your open terminals to see if a job has completed, instead you are now getting a notification. So you can for instance start a compile and then not have to think about it again until you get the notification. We are still tweaking the notifications a little bit for this one, to make sure we cut down the amount of unhelpful notifications to an absolute minimum, so if you have feedback on how we can improve this feature we be happy to hear it. For example we are thinking about turning off the notification for UI applications launched from a terminal.

Anyway, we have a lot of features in the pipeline now for Fedora Workstation 23 since quite a few of the items planned for Fedora Workstation 22 didn’t get completed in time, so I am looking forward to writing a blog informing you about those soon.

You can also read about this release in Fedora Magazine.

Last week I was in Vancouver, BC for the OpenStack Summit, discussing the new Liberty version that will be released in 6 months.

I've attended the summit mainly to discuss and follow-up new developments on Ceilometer, Gnocchi and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things.

Ops feedback

We had half a dozen Ceilometer sessions, and the first one was dedicated to getting feedbacks from operators using Ceilometer. We had a few operators present, and a few of the Ceilometer team. We had constructive discussion, and my feeling is that operators struggles with 2 things so far: scaling Ceilometer storage and having Ceilometer not killing the rest of OpenStack.

We discussed the first point as being addressed by Gnocchi, and I presented a bit Gnocchi itself, as well as how and why it will fix the storage scalability issue operators encountered so far.

Ceilometer putting down the OpenStack installation is more interesting problem. Ceilometer pollsters request information from Nova, Glance… to gather statistics. Until Kilo, Ceilometer used to do that regularly and at fixed interval, causing high pike load in OpenStack. With the introduction of jitter in Kilo, this should be less of a problem. However, Ceilometer hits various endpoints in OpenStack that are poorly designed, and hitting those endpoints of Nova or other components triggers a lot of load on the platform. Unfortunately, this makes operators blame Ceilometer rather than blaming the components being guilty of poor designs. We'd like to push forward improving these components, but it's probably going to take a long time.


When I started the Gnocchi project last year, I pretty soon realized that we would be able to split Ceilometer itself in different smaller components that could work independently, while being able to leverage each others. For example, Gnocchi can run standalone and store your metrics even if you don't use Ceilometer – nor even OpenStack itself.

My fellow developer Chris Dent had the same idea about splitting Ceilometer a few months ago and drafted a proposal. The idea is to have Ceilometer split in different parts that people could assemble together or run on their owns.

Interestingly enough, we had three 40 minutes sessions planned to talk and debate about this division of Ceilometer, though we all agreed in 5 minutes that this was the good thing to do. Five more minutes later, we agreed on which part to split. The rest of the time was allocated to discuss various details of that split, and I engaged to start doing the work with Ceilometer alarming subsystem.

I wrote a specification on the plane bringing me to Vancouver, that should be approved pretty soon now. I already started doing the implementation work. So fingers crossed, Ceilometer should have a new components in Liberty handling alarming on its own.

This would allow users for example to only deploys Gnocchi and Ceilometer alarm. They would be able to feed data to Gnocchi using their own system, and build alarms using Ceilometer alarm subsystem relying on Gnocchi's data.


We didn't have a Gnocchi dedicated slot – mainly because I indicated I didn't feel we needed one. We anyway discussed a few points around coffee, and I've been able to draw a few new ideas and changes I'd like to see in Gnocchi. Mainly changing the API contract to be more asynchronously so we can support InfluxDB more correctly, and improve Carbonara (the library we created to manipulate timeseries) based drivers to be faster.

All of those should – plus a few Oslo tasks I'd like to tackle – should keep me busy for the next cycle!

May 25, 2015

Rule the Stack

Last week during the the OpenStack Summit in Vancouver, Intel organized a Rule the Stack contest. That's the third one, after Atlanta a year ago and Paris six months ago. In case you missed earlier episodes, SUSE won the two previous contests with Dirk being pretty fast in Atlanta and Adam completing the HA challenge so we could keep the crown. So of course, we had to try again!

For this contest, the rules came with a list of penalties and bonuses which made it easier for people to participate. And indeed, there were quite a number of participants with the schedule for booking slots being nearly full. While deploying Kilo was a goal, you could go with older releases getting a 10 minutes penalty per release (so +10 minutes for Juno, +20 minutes for Icehouse, and so on). In a similar way, the organizers wanted to see some upgrade and encouraged that with a bonus that could significantly impact the results (-40 minutes) — nobody tried that, though.

And guess what? SUSE kept the crown again. But we also went ahead with a new challenge: outperforming everyone else not just once, but twice, with two totally different methods.

For the super-fast approach, Dirk built again an appliance that has everything pre-installed and that configures the software on boot. This is actually not too difficult thanks to the amazing Kiwi tool and all the knowledge we have accumulated through the years at SUSE about building appliances, and also the small scripts we use for the CI of our OpenStack packages. Still, it required some work to adapt the setup to the contest and also to make sure that our Kilo packages (that were brand new and without much testing) were fully working. The clock result was 9 minutes and 6 seconds, resulting in a negative time of minus 10 minutes and 54 seconds (yes, the text in the picture is wrong) after the bonuses. Pretty impressive.

But we also wanted to show that our product would fare well, so Adam and I started looking at this. We knew it couldn't be faster than the way Dirk picked, and from the start, we targetted the second position. For this approach, there was not much to do since this was similar to what he did in Paris, and there was work to update our SUSE OpenStack Cloud Admin appliance recently. Our first attempt failed miserably due to a nasty bug (which was actually caused by some unicode character in the ID of the USB stick we were using to install the OS... we fixed that bug later in the night). The second attempt went smoother and was actually much faster than we had anticipated: SUSE OpenStack Cloud deployed everything in 23 minutes and 17 seconds, which resulted in a final time of 10 minutes and 17 seconds after bonuses/penalties. And this was with a 10 minutes penalty due to the use of Juno (as well as a couple of minutes lost debugging some setup issue that was just mispreparation on our side). A key contributor to this result is our use of Crowbar, which we've kept improving over time, and that really makes it easy and fast to deploy OpenStack.

Wall-clock time for SUSE OpenStack Cloud

Wall-clock time for SUSE OpenStack Cloud

These two results wouldn't have been possible without the help of Tom and Ralf, but also without the whole SUSE OpenStack Cloud team that works on a daily basis on our product to improve it and to adapt it to the needs of our customers. We really have an awesome team (and btw, we're hiring)!

For reference, three other contestants succeeded in deploying OpenStack, with the fastest of them ending at 58 minutes after bonuses/penalties. And as I mentioned earlier, there were even more contestants (including some who are not vendors of an OpenStack distribution), which is really good to see. I hope we'll see even more in Tokyo!

Results of the Rule the Stack contest

Results of the Rule the Stack contest

Also thanks to Intel for organizing this; I'm sure every contestant had fun and there was quite a good mood in the area reserved for the contest.

Update: See also the summary of the contest from the organizers.

May 22, 2015
Modern (and some less modern) laptops and tablets have a lot of builtin sensors: accelerometer for screen positioning, ambient light sensors to adjust the screen brightness, compass for navigation, proximity sensors to turn off the screen when next to your ear, etc.


We've supported accelerometers in GNOME/Linux for a number of years, following work on the WeTab. The accelerometer appeared as an input device, and sent kernel events when the orientation of the screen changed.

Recent devices, especially Windows 8 compatible devices, instead export a HID device, which, under Linux, is handled through the IIO subsystem. So the first version of iio-sensor-proxy took readings from the IIO sub-system and emulated the WeTab's accelerometer: a few too many levels of indirection.

The 1.0 version of the daemon implements a D-Bus interface, which means we can support more than accelerometers. The D-Bus API, this time, is modelled after the Android and iOS APIs.


Accelerometers will work in GNOME 3.18 as well as it used to, once a few bugs have been merged[1]. If you need support for older versions of GNOME, you can try using version 0.1 of the proxy.

Orientation lock in action

As we've adding ambient light sensor support in the 1.0 release, time to put in practice best practice mentioned by Owen's post about battery usage. We already had code like that in gnome-power-manager nearly 10 years ago, but it really didn't work very well.

The major problem at the time was that ambient light sensor reading weren't in any particular unit (values had different meanings for different vendors) and the user felt that they were fighting against the computer for the control of the backlight.

Richard fixed that though, adapting work he did on the ColorHug ALS sensor, and the brightness is now completely in the user's control, and adapts to the user's tastes. This means that we can implement the simplest of UIs for its configuration.

Power saving in action

This will be available in the upcoming GNOME 3.17.2 development release.

Looking ahead

For future versions, we'll want to export the raw accelerometer readings, so that applications, including games, can make use of them, which might bring up security issues. SDL, Firefox, WebKit could all do with being adapted, in the near future.

We're also looking at adding compass support (thanks Elad!), which Geoclue will then export to applications, so that location and heading data is collected through a single API.

Richard and Benjamin Tissoires, of fixing input devices fame, are currently working on making the ColorHug-ALS compatible with Windows 8, meaning it would work out of the box with iio-sensor-proxy.


We're currently using GitHub for bug and code tracking. Releases are mirrored on, as GitHub is known to mangle filenames. API documentation is available on

[1]: gnome-settings-daemon, gnome-shell, and systemd will need patches
May 20, 2015

In the last weeks I have been working together with my colleague Samuel on bringing support for ARB_shader_storage_buffer_object, an OpenGL 4.3 feature, to Mesa and the Intel i965 driver, so I figured I would write a bit on what this brings to OpenGL/GLSL users. If you are interested, read on.

Introducing Shader Storage Buffer Objects

This extension introduces the concept of shader storage buffer objects (SSBOs), which is a new type of OpenGL buffer. SSBOs allow GL clients to create buffers that shaders can then map to variables (known as buffer variables) via interface blocks. If you are familiar with Uniform Buffer Objects (UBOs), SSBOs are pretty similar but:

  • They are read/write, unlike UBOs, which are read-only.
  • They allow a number of atomic operations on them.
  • They allow an optional unsized array at the bottom of their definitions.

Since SSBOs are read/write, they create a bidirectional channel of communication between the GPU and CPU spaces: the GL application can set the value of shader variables by writing to a regular OpenGL buffer, but the shader can also update the values stored in that buffer by assigning values to them in the shader code, making the changes visible to the GL application. This is a major difference with UBOs.

In a parallel environment such as a GPU where we can have multiple shader instances running simultaneously (processing multiple vertices or fragments from a specific rendering call) we should be careful when we use SSBOs. Since all these instances will be simultaneously accessing the same buffer there are implications to consider relative to the order of reads and writes. The spec does not make many guarantees about the order in which these take place, other than ensuring that the order of reads and writes within a specific execution of a shader is preserved. Thus, it is up to the graphics developer to ensure, for example, that each execution of a fragment or vertex shader writes to a different offset into the underlying buffer, or that writes to the same offset always write the same value. Otherwise the results would be undefined, since they would depend on the order in which writes and reads from different instances happen in a particular execution.

The spec also allows to use glMemoryBarrier() from shader code and glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) from a GL application to add sync points. These ensure that all memory accesses to buffer variables issued before the barrier are completely executed before moving on.

Another tool for developers to deal with concurrent accesses is atomic operations. The spec introduces a number of new atomic memory functions for use with buffer variables: atomicAdd, atomicMin, atomicMax, atomicAnd, atomicOr, atomicXor, atomicExchange (atomic assignment to a buffer variable), atomicCompSwap (atomic conditional assignment to a buffer variable).

The optional unsized array at the bottom of an SSBO definition can be used to push a dynamic number of entries to the underlying buffer storage, up to the total size of the buffer allocated by the GL application.

Using shader storage buffer objects (GLSL)

Okay, so how do we use SSBOs? We will introduce this through an example: we will use a buffer to record information about the fragments processed by the fragment shader. Specifically, we will group fragments according to their X coordinate (by computing an index from the coordinate using a modulo operation). We will then record how many fragments are assigned to a particular index, the first fragment to be assigned to a given index, the last fragment assigned to a given index, the total number of fragments processed and the complete list of fragments processed.

To store all this information we will use the SSBO definition below:

layout(std140, binding=0) buffer SSBOBlock {
   vec4 first[8];     // first fragment coordinates assigned to index
   vec4 last[8];      // last fragment coordinates assigned to index
   int counter[8];    // number of fragments assigned to index
   int total;         // number of fragments processed
   vec4 fragments[];  // coordinates of all fragments processed

Notice the use of the keyword buffer to tell the compiler that this is a shader storage buffer object. Also notice that we have included an unsized array called fragments[], there can only be one of these in an SSBO definition, and in case there is one, it has to be the last field defined.

In this case we are using std140 layout mode, which imposes certain alignment rules for the buffer variables within the SSBO, like in the case of UBOs. These alignment rules may help the driver implement read/write operations more efficiently since the underlying GPU hardware can usually read and write faster from and to aligned addresses. The downside of std140 is that because of these alignment rules we also waste some memory and we need to know the alignment rules on the GL side if we want to read/write from/to the buffer. Specifically for SSBOs, the specification introduces a new layout mode: std430, which removes these alignment restrictions, allowing for a more efficient memory usage implementation, but possibly at the expense of some performance impact.

The binding keyword, just like in the case of UBOs, is used to select the buffer that we will be reading from and writing to when accessing these variables from the shader code. It is the application’s responsibility to bound the right buffer to the binding point we specify in the shader code.

So with that done, the shader can read from and write to these variables as we see fit, but we should be aware of the fact that multiple instances of the shader could be reading from and writing to them simultaneously. Let’s look at the fragment shader that stores the information we want into the SSBO:

void main() {
   int index = int(mod(gl_FragCoord.x, 8));

   int i = atomicAdd(counter[index], 1);
   if (i == 0)
      first[index] = gl_FragCoord;
      last[index] = gl_FragCoord;

   i = atomicAdd(total, 1);
   fragments[i] = gl_FragCoord;

The first line computes an index into our integer array buffer variable by using gl_FragCoord. Notice that different fragments could get the same index. Next we increase in one unit counter[index]. Since we know that different fragments can get to do this at the same time we use an atomic operation to make sure that we don’t lose any increments.

Notice that if two fragments can write to the same index, reading the value of counter[index] after the atomicAdd can lead to different results. For example, if two fragments have already executed the atomicAdd, and assuming that counter[index] is initialized to 0, then both would read counter[index] == 2, however, if only one of the fragments has executed the atomic operation by the time it reads counter[index] it would read a value of 1, while the other fragment would read a value of 2 when it reaches that point in the shader execution. Since our shader intends to record the coordinates of the first fragment that writes to counter[index], that won’t work for us. Instead, we use the return value of the atomic operation (which returns the value that the buffer variable had right before changing it) and we write to first[index] only when that value was 0. Because we use the atomic operation to read the previous value of counter[index], only one fragment will read a value of 0, and that will be the fragment that first executed the atomic operation.

If this is not the first fragment assigned to that index, we write to last[index] instead. Again, multiple fragments assigned to the same index could do this simultaneously, but that is okay here, because we only care about the the last write. Also notice that it is possible that different executions of the same rendering command produce different values of first[] and last[].

The remaining instructions unconditionally push the fragment coordinates to the unsized array. We keep the last index into the unsized array fragments[] we have written to in the buffer variable total. Each fragment will atomically increase total before writing to the unsized array. Notice that, once again, we have to be careful when reading the value of total to make sure that each fragment reads a different value and we never have two fragments write to the same entry.

Using shader storage buffer objects (GL)

On the side of the GL application, we need to create the buffer, bind it to the appropriate binding point and initialize it. We do this as usual, only that we use the new GL_SHADER_STORAGE_BUFFER target:

typedef struct {
   float first[8*4];      // vec4[8]
   float last[8*4];       // vec4[8]
   int counter[8*4];      // int[8] padded as per std140
   int total;             // int
   int pad[3];            // padding: as per std140 rules
   char fragments[1024];  // up to 1024 bytes of unsized array

SSBO data;


memset(&data, 0, sizeof(SSBO));

GLuint buf;
glGenBuffers(1, &buf);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buf);

The code creates a buffer, binds it to binding point 0 of GL_SHADER_STORAGE_BUFFER (the same we have bound our shader to) and initializes the buffer data to 0. Notice that because we are using std140 we have to be aware of the alignment rules at work. We could have used std430 instead to avoid this.

Since we have 1024 bytes for the fragments[] unsized array and we are pushing a vec4 (16 bytes) worth of data to it with every fragment we process then we have enough room for 64 fragments. It is the developer’s responsibility to ensure that this limit is not surpassed, otherwise we would write beyond the allocated space for our buffer and the results would be undefined.

The next step is to do some rendering so we get our shaders to work. That would trigger the execution of our fragment shader for each fragment produced, which will generate writes into our buffer for each buffer variable the shader code writes to. After rendering, we can map the buffer and read its contents from the GL application as usual:

SSBO *ptr = (SSBO *) glMapNamedBuffer(buf, GL_READ_ONLY);

/* List of fragments recorded in the unsized array */
printf("%d fragments recorded:\n", ptr->total);
float *coords = (float *) ptr->fragments;
for (int i = 0; i < ptr->total; i++, coords +=4) {
   printf("Fragment %d: (%.1f, %.1f, %.1f, %.1f)\n",
          i, coords[0], coords[1], coords[2], coords[3]);

/* First fragment for each index used */
for (int i = 0; i < 8; i++) {
   if (ptr->counter[i*4] > 0)
      printf("First fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
             i, ptr->first[i*4], ptr->first[i*4+1],
             ptr->first[i*4+2], ptr->first[i*4+3]);

/* Last fragment for each index used */
for (int i = 0; i < 8; i++) {
   if (ptr->counter[i*4] > 1)
      printf("Last fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
             i, ptr->last[i*4], ptr->last[i*4+1],
             ptr->last[i*4+2], ptr->last[i*4+3]);
   else if (ptr->counter[i*4] == 1)
      printf("Last fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
             i, ptr->first[i*4], ptr->first[i*4+1],
             ptr->first[i*4+2], ptr->first[i*4+3]);

/* Fragment counts for each index */
for (int i = 0; i < 8; i++) {
   if (ptr->counter[i*4] > 0)
      printf("Fragment count at index %d: %d\n", i, ptr->counter[i*4]);

I get this result for an execution where I am drawing a handful of points:

4 fragments recorded:
Fragment 0: (199.5, 150.5, 0.5, 1.0)
Fragment 1: (39.5, 150.5, 0.5, 1.0)
Fragment 2: (79.5, 150.5, 0.5, 1.0)
Fragment 3: (139.5, 150.5, 0.5, 1.0)

First fragment for index 3: (139.5, 150.5, 0.5, 1.0)
First fragment for index 7: (39.5, 150.5, 0.5, 1.0)

Last fragment for index 3: (139.5, 150.5, 0.5, 1.0)
Last fragment for index 7: (79.5, 150.5, 0.5, 1.0)

Fragment count at index 3: 1
Fragment count at index 7: 3

It recorded 4 fragments that the shader mapped to indices 3 and 7. Multiple fragments where assigned to index 7 but we could handle that gracefully by using the corresponding atomic functions. Different executions of the same program will produce the same 4 fragments and map them to the same indices, but the first and last fragments recorded for index 7 can change between executions.

Also notice that the first fragment we recorded in the unsized array (fragments[0]) is not the first fragment recorded for index 7 (fragments[1]). That means that the execution of fragments[0] got first to the unsized array addition code, but the execution of fragments[1] beat it in the race to execute the code that handled the assignment to the first/last arrays, making clear that we cannot make any assumptions regarding the execution order of reads and writes coming from different instances of the same shader execution.

So that’s it, the patches are now in the mesa-dev mailing list undergoing review and will hopefully land soon, so look forward to it! Also, if you have any interesting uses for this new feature, let me know in the comments.

May 12, 2015

A couple of months ago, I was meeting colleagues of mine working on Docker and discussing about how much effort it would be to add support for it to SUSE OpenStack Cloud. It's been something that had been requested for a long time by quite a number of people and we never really had time to look into it. To find out how difficult it would be, I started looking at it on the evening; the README confirmed it shouldn't be too hard. But of course, we use Crowbar as our deployment framework, and the manual way of setting it up is not really something we'd want to recommend. Now would it be "not too hard" or just "easy"? There was only way to know that... And guess what happened next?

It took a couple of hours (and two patches) to get this working, including the time for packaging the missing dependencies and for testing. That's one of the nice things we benefit from using Crowbar: adding new features like this is relatively straight-forward, and so we can enable people to deploy a full cloud with all of these nice small features, without requiring them to learn about all the technologies and how to deploy them. Of course this was just a first pass (using the Juno code, btw).

Fast-forward a bit, and we decided to integrate this work. Since it was not a simple proof of concept anymore, we went ahead with some more serious testing. This resulted in us backporting patches for the Juno branch, but also making Nova behave a bit better since it wasn't aware of Docker as an hypervisor. This last point is a major problem if people want to use Docker as well as KVM, Xen, VMware or Hyper-V — the multi-hypervisor support is something that really matters to us, and this issue was actually the first one that got reported to us ;-) To validate all our work, we of course asked tempest to help us and the results are pretty good (we still have some failures, but they're related to missing features like volume support).

All in all, the integration went really smoothly :-)

Oh, I forgot to mention: there's also a docker plugin for heat. It's now available with our heat packages now in the Build Service as openstack-heat-plugin-heat_docker (Kilo, Juno); I haven't played with it yet, but this post should be a good start for anyone who's curious about this plugin.

May 11, 2015

I've recently been contacted by Johannes Hubertz, who is writing a new book about Python in German called "Softwaretests mit Python" which will be published by Open Source Press, Munich this summer. His book will feature some interviews, and he was kind enough to let me write a bit about software testing. This is the interview that I gave for his book. Johannes translated to German and it will be included in Johannes' book, and I decided to publish it on my blog today. Following is the original version.

How did you come to Python?

I don't recall exactly, but around ten years ago, I saw more and more people using it and decided to take a look. Back then, I was more used to Perl. I didn't really like Perl and was not getting a good grip on its object system.

As soon as I found an idea to work on – if I remember correctly that was rebuildd – I started to code in Python, learning the language at the same time.

I liked how Python worked, and how fast I was to able to develop and learn it, so I decided to keep using it for my next projects. I ended up diving into Python core for some reasons, even doing things like briefly hacking on projects like Cython at some point, and finally ended up working on OpenStack.

OpenStack is a cloud computing platform entirely written in Python. So I've been writing Python every day since working on it.

That's what pushed me to write The Hacker's Guide to Python in 2013 and then self-publish it a year later in 2014, a book where I talk about doing smart and efficient Python.

It had a great success, has even been translated in Chinese and Korean, so I'm currently working on a second edition of the book. It has been an amazing adventure!

Zen of Python: Which line is the most important for you and why?

I like the "There should be one – and preferably only one – obvious way to do it". The opposite is probably something that scared me in languages like Perl. But having one obvious way to do it is and something I tend to like in functional languages like Lisp, which are in my humble opinion, even better at that.

For a python newbie, what are the most difficult subjects in Python?

I haven't been a newbie since a while, so it's hard for me to say. I don't think the language is hard to learn. There are some subtlety in the language itself when you deeply dive into the internals, but for beginners most of the concept are pretty straight-forward. If I had to pick, in the language basics, the most difficult thing would be around the generator objects (yield).

Nowadays I think the most difficult subject for new comers is what version of Python to use, which libraries to rely on, and how to package and distribute projects. Though things get better, fortunately.

When did you start using Test Driven Development and why?

I learned unit testing and TDD at school where teachers forced me to learn Java, and I hated it. The frameworks looked complicated, and I had the impression I was losing my time. Which I actually was, since I was writing disposable programs – that's the only thing you do at school.

Years later, when I started to write real and bigger programs (e.g. rebuildd), I quickly ended up fixing bugs… I already fixed. That recalled me about unit tests and that it may be a good idea to start using them to stop fixing the same things over and over again.

For a few years, I wrote less Python and more C code and Lua (for the awesome window manager), and I didn't use any testing. I probably lost hundreds of hours testing manually and fixing regressions – that was a good lesson. Though I had good excuses at that time – it is/was way harder to do testing in C/Lua than in Python.

Since that period, I have never stopped writing "tests". When I started to hack on OpenStack, the project was adopting a "no test? no merge!" policy due to the high number of regressions it had during the first releases.

I honestly don't think I could work on any project that does not have – at least a minimal – test coverage. It's impossible to hack efficiently on a code base that you're not able to test in just a simple command. It's also a real problem for new comers in the open source world. When there are no test, you can hack something and send a patch, and get a "you broke this" in response. Nowadays, this kind of response sounds unacceptable to me: if there is no test, then I didn't break anything!

In the end, it's just too much frustration to work on non tested projects as I demonstrated in my study of whisper source code.

What do you think to be the most often seen pitfalls of TDD and how to avoid them best?

The biggest problems are when and at what rate writing tests.

On one hand, some people starts to write too precise tests way too soon. Doing that slows you down, especially when you are prototyping some idea or concept you just had. That does not mean that you should not do test at all, but you should probably start with a light coverage, until you are pretty sure that you're not going to rip every thing and start over. On the other hand, some people postpone writing tests for ever, and end up with no test all or a too thin layer of test. Which makes the project with a pretty low coverage.

Basically, your test coverage should reflect the state of your project. If it's just starting, you should build a thin layer of test so you can hack it on it easily and remodel it if needed. The more your project grow, the more you should make it sold and lay more tests.

Having too detailed tests is painful to make the project evolve at the start. Having not enough in a big project makes it painful to maintain it.

Do you think, TDD fits and scales well for the big projects like OpenStack?

Not only I think it fits and scales well, but I also think it's just impossible to not use TDD in such big projects.

When unit and functional tests coverage was weak in OpenStack – at its beginning – it was just impossible to fix a bug or write a new feature without breaking a lot of things without even noticing. We would release version N, and a ton of old bugs present in N-2 – but fixed in N-1 – were reopened.

For big projects, with a lot of different use cases, configuration options, etc, you need belt and braces. You cannot throw code in a repository thinking it's going to work ever, and you can't afford to test everything manually at each commit. That's just insane.

May 08, 2015

I’ve had this post sitting in my drafts for the last 7 months. The code is stale, but the concepts are correct. I had intended to add some pictures before posting, but it’s clear that won’t happen now. Words are better than nothing, I suppose…

Recently I pushed an intel-gpu-tool tool for modifying GPU frequencies. It’s meant to simplify using the sysfs interfaces, similar to many other such helper tools in Linux (cpupower has become one of my recent favorites). I wrote this fancy getopt example^w^w^w program to address a specific need I had, but saw its uses for the community as well. Some time after upstreaming the tool, I accidentally put the name of this tool into my favorite search engine (yes, after all these years X clipboard still confuses me). Surprisingly I was greeted by a discussion about the tool. None of it was terribly misinformed, but I figured I might set the record straight anyway.


Dynamically changing frequencies is a difficult thing to accomplish. Typically there are multiple clock domains in a chip and coordinating all the necessary ones in order to modify one probably requires a bunch of state which I’d not understand much less be able to explain. To facilitate this (on Gen6+) there is firmware which does whatever it is that firmwares do to change frequencies. When we talk about changing frequencies from a Linux kernel driver perspective, it means we’re asking the firmware for a frequency. It can, and does overrule, balk and ignore our frequency requests

A brief explanation on frequencies in i915

The term that is used within the kernel driver and docs is, “RPS” which is [I believe] short for Render P-States. They are analogous to CPU P-states in that lower numbers are faster, higher number are slower. Conveniently, we only have two numbers, 0, and 1 on GEN. Past that, I don’t know how CPU P-States work, so I’m not sure how much else is similar.

There are roughly 4 generations of RPS:

  1. IPS (not to be confused with Intermediate Pixel Storage). The implementation of this predates my knowledge of this part of the hardware, so I can’t speak much about it. It stopped being a thing after Ironlake (Gen5). I don’t care to look, but you can if you want: drivers/platform/x86/intel_ips.c
  2. RPS (Sandybridge, Ivybridge)
    There are 4 numbers of interest: RP0, RP1, RPn, “hw max”. The first 3 are directly read from a register

    rp_state_cap = I915_READ(GEN6_RP_STATE_CAP);
    dev_priv->rps.rp0_freq          = (rp_state_cap >>  0) & 0xff;
    dev_priv->rps.rp1_freq          = (rp_state_cap >>  8) & 0xff;
    dev_priv->rps.min_freq          = (rp_state_cap >> 16) & 0xff;

    RP0 is the maximum value the driver can request from the firmware. It’s the highest non-overclocked frequency supported.
    RPn is the minimum value the driver can request from the firmware. It’s the lowest frequency supported.
    RP1 is the most efficient frequency.
    hw_max is RP0 if the system does not supports overclocking.
    Otherwise, it is read through a special set of commands where we’re told by the firmware the real max. The overclocking max typically cannot be sustained for long durations due to thermal considerations, but that is transparent to software.

  3. RPS (HSW+)
    Similar to the previous RPS, except there is an RPe (for efficient). I don’t know how it differs from RP1. I just learned about this after posting the tool – but just be aware it’s different. Baytrail and Cherryview also have a concept of RPe.
  4. The Atom based stuff (Baytrail, Cherryview)
    I don’t pretend to know how this works [or doesn’t]. They use similar terms.
    They have the same set of names as the HSW+ RPS.

How the driver uses it

The driver can make requests to the firmware for a desired frequency:

if (IS_HASWELL(dev) || IS_BROADWELL(dev))
        			  GEN6_OFFSET(0) |

This register interface doesn’t provide a way to determine the request was granted other than reading back the current frequency, which is error prone, as explained below.

By default, the driver will request frequencies between the efficient frequency (RP1, or RPe), and the max frequency (RP0 or hwmax) based on the system’s busyness. The busyness can be calculated either by software, or by the hardware. For the former, the driver can periodically read a register to get busyness and make decisions based on that:

render_count = I915_READ(VLV_RENDER_C0_COUNT_REG);

In the latter case, the firmware will itself measure busyness and give the driver an interrupt when it determines that the GPU is sufficiently overworked or under worked. At each interrupt the driver would raise or lower by the smallest step size (typically 50MHz), and continue on its way. The most complex thing we did (which we still managed to screw up) was disable interrupts telling us to go up when we were already at max, and the equivalent for down.

It seems obvious that there are usual trends, if you’re incrementing the frequency you’re more likely to increment again in the near future and such. Since leaving for sabbatical and returning to work on mesa, there has been a lot of complexity added here by Chris Wilson, things which mimic concepts like the ondemand CPU frequency governor. I never looked much into those, so I can’t talk knowledgeably about it – just realize it’s not as naive as it once was, and should do a better job as a result.

The flow seems a bit ping-pongish:

The benefit is though that the firmware can do all the dirty work, and the CPU can sleep. Particularly when there’s nothing going on in the system, that should provide significant power savings.

Why you shouldn’t set –max or –min

First, what does it do? –max and –min lock the GPU frequency to min and max respectively. What this actually means is that in the driver, even if we get interrupts to throttle up, or throttle down, we ignore them (hopefully the driver will disable such useless interrupts, but I am too scared to check). I also didn’t check what this does when we’re using software to determine busyness, but it should be very similar

I should mention now that this is what I’ve inferred through careful observation^w^w random guessing. In the world of thermals and overheating, things can go from good, to bad, to broke faster than the CPU can take an interrupt and adjust things. As a result, the firmware can and will lower frequencies even if it’s previously acknowledges it can give a specific frequency.

As an example if you do:

intel_gpu_frequency --set X
assert (intel_gpu_frequency --get == X)

There is a period in the middle, and after the assert where the firmware may have done who knows what. Furthermore, if we try to lock to –max, the GPU is more likely to hit these overheating conditions and throttle you down. So –max doesn’t even necessarily get you max. It’s sort of an interesting implication there since one would think these same conditions (probably the CPU is heavily used as well) would end up clocking up all the way anyway, and we’d get into the same spot, but I’m not really sure. Perhaps the firmware won’t tell us to throttle up so aggressively when it’s near its thermal limits. Using –max can actually result in non-optimal performance, and I have some good theories why, but I’ll keep them to myself since I really have no proof.

–min on the other hand is equally stupid for a different and more obvious reason. As I said above, it’s guaranteed to provide the worst possible performance and not guaranteed to provide the most optimal power usage.

What you should do with it

The usages are primarily benchmarking and performance debug. Assuming you can sustain decent cooling, locking the GPU frequency to anything will give you the most consistent results from run to run. Presumably max, or near max will be the most optimal.

min is useful to get a measure of what the worst possible performance is to see how it might impact various optimizations. It can help you change the ratio of GPU to CPU frequency drastically (and easily). I don’t actually expect anyone to use this very much.

How not to get shot in the foot

If you are a developer trying to get performance optimizations or measurements which you think can be negatively impacted by the GPU throttling you can set this value. Again because of the thermal considerations, as you go from run torun making tweaks, I’d recommend setting something maybe 75% or so of max – that is a total ballpark figure. When you’re ready to get some absolute numbers, you can try setting –max, and the few frequencies near max to see if you have any unexpected results. Take highest value when done.

The sysfs entries require root for a reason. Any time


On the surface it would seem that the minimum frequency should always use the least amount of power. At idle, I’d assert that is always true. The corollary to that is that at minimum frequency they also finish the workload the slower. Intel GPUs don’t idle for long, they go into RC6. The efficient frequency is a blend of maintaining a low frequency and winning the race to idle. AFAIK, it’s a number selected after a whole lot of tests run – we could ignore it.

Download PDF
Upstreaming requirements for the DRM subsystem are a bit special since Dave Airlie requires a full-blown open-source implementation as a demonstration vehicle for any new interfaces added. I've figured it's better to clear this up once instead of dealing with the fallout from surprises and made a few slides for a training session. Dave reviewed and acked them, hence this should be the up-to-date rules - the old mails and blogs back from when some ARM SoC vendors tried to push drm drivers for blob userspace to upstream are a bit outdated.

Any here's the slides for my gfx kernel upstreaming requirements training.
May 07, 2015

So I composed an email today to the Fedora-desktop mailing list to summarize the feedback we got here on the blog post on
my request for reasons people where not switching to Fedora. Thought I should share it here too for easier access for the
wider community and for the commentators to see that I did take the time to go through and try to summarize the posts.
Of course feel free to comment on this blog if you think I missed something important or if there are other major items we
should be thinking about. We will be discussing this both on the mailing list, in our Workstation working group meetings and at Flock in August. Anyway, I will let you go on to read the email (you can find the thread here if interested.

A couple of weeks ago I blogged about who Fedora Workstation is an integrated system, but also asking for
feedback for why people are not migrating to Fedora Workstation, especially asking about why people would be using
GNOME 3 on another distro. So I got about 140 comments on that post so I thought I should write up a summary and
post here. There was of course a lot of things mentioned, but I will try to keep this summary to what I picked up
as the recurring topics.

So while this of course is a poll consisting of self selected commentators I still think the sample is big enough that we
should take the feedback into serious consideration for our plans going forward. Some of them I even think are already
handled by underway efforts.

Release cadence
Quite a few people mentioned this, ranging from those who wanted to switch us to a rolling release, a tick/tock
release style, to just long release cycles. Probably more people saying they thought the current 6 Month cycle
was just to harrowing than people who wanted rolling releases or tick/tock releases.

3rd Party Software
This was the single most brought up item. With people saying that they stayed on other distros due to the pain of
getting 3rd party software on Fedora. This ranged from drivers (NVidia, Wi-Fi), to media codecs to end user
applications. Width of software available in general was also brought up quite a few times. If anyone is in any doubt
that our current policy here is costing us users I think these comments clearly demonstrates otherwise.

Optimus support
Quite a few people did bring up that our Optimus support wasn’t great. Luckily I know Bastien Nocera is working on
something there based on work by Dave Arlie, so hopefully this is one we can check off soon.

Many people also pointed out that we had no UI for upgrading Fedora.

HiDPI issues
A few comments on various challenges people have with HiDPI screens, especially when dealing with non-GTK3 apps-

Multimonitor support
A few comments that our multimonitor support could be better

SELinux is a pain
A few comments about SELinux still getting in the way at times

Better Android integration
A few people asked for more/better Android device integration features

Built in backup solution
A few people requested we create some kind of integrated backup solution

Also a few concrete requests in terms of applications for Fedora:
choqok for GNOME (microblogging client)

This post is a reaction to and I could not think of a better title.

Before I get into my thoughts, I'd like to quickly recap two arguments presented by two different studiers of communication, as they are essential to my premise.

The first argument is known as "the medium is the message," from Marshall McLuhan. It is, in essence, the concept that the medium in which a message is conveyed is part of the message. "Medium" here is the singular of "media."

The key realization from McLuhan is the idea that, since the message is inalienably embedded in its medium, the process of interpreting the medium is intertwined with interpreting the message. A message restated in a different medium is necessarily a different message from the original. It also means that the ability of a consumer to understand the message is connected to their ability to understand the medium.

The second argument has no nifty name that I know of. It is Hofstadter's assertion, from GEB, that messages are decomposable into (at least) three pieces, which he calls the frame, outer message, and inner message. The frame indicates that a message is a message. The outer message explains how the inner message is structured. The inner message is freeform or formless.

What I'm taking from Hofstadter's model is the idea that, although the frame can identify a message, neither the frame nor the outer message can actually explain how the inner message is to be understood. The GEB examples are all linguistic and describe how no part of a message can explain which language was used to craft it without using the language in question. The outer message is the recognition that the inner message exists and is in a certain language.

I can unify these two ideas. I bet that you can, too. McLuhan's medium is Hofstadter's frame and outer message. A message is not just data, but also the metadata which surrounds it.

Can we talk about Lisp? Not quite. I have one more thing. Consider two people speaking English to each other. We have a medium of speech, a frame of rhythmic audio, and an outer message of English speech. They can understand each other, right? Well, not quite. They might not speak exactly the same dialect of English. They have different life experiences, which means that their internal dictionaries, idioms, figures of speech, and so forth might not line up precisely. As an example, our two speakers might disagree on whether "raspberry" refers to fruit or flatulence. Worse, they might disagree on whether "raspberry" refers to berries! These differences constantly occur as people exchange messages.

Our speakers can certainly negotiate, word by word, any contention, as long as they have some basic words in common. However, this might not be sufficient, in the case of some extremely different English dialects like American English and Scottish English. It also might not be necessary; consider the trope of two foreigners without a common language coming together to barter or trade, or the still-not-understood process of linguistic bootstrapping which an infant performs. Even if the inner message isn't decodable, the outer message and frame can still carry some of the content of the message (since the medium is the message!) and communication can still happen in some limited fashion.

I'll now point out that the idea that "normal" communication is not "limited" is illusory; this sort of impedance mismatch occurs during transmission and receipt of every message, since the precise concepts that a person carries with them are ever-changing. A person might not even be able to communicate effectively with themselves; many people have the surreal, uncanny experience of reading a note which they have written to themselves only a day or two ago, and not understanding what the note was trying to convey.

I would like to now postulate an idea. Since the medium is the message and the deciphering and grokking of messages is imperfect, communication is not just about deciphering messages, but also about discovering the meaning of the message via knowledge of the medium and messenger. To put it shortly and informally, communication is about discovering how others talk, not just about talking.

Okay! Now we're ready to tackle Lisp. What does Lisp source look like? Well, it's lots of lists, which contain other lists, and some atoms, and some numbers. Or, at least, that's what a Lisper sees. Most non-Lispers see parentheses. Lots of parentheses. Lispers like to talk about how the parentheses eventually fade away, leaving pure syntax, metasyntax, metapatterns, and so forth. The language has a simple core, and is readily extended by macros, which are also Lisp code.

I'll get to macros in a second. First, I want to deal with something from the original article. "Think about it: to represent hierarchical data you need two syntactic elements: a token separator and a block delimiter. In S expressions, whitespace is the token separator and parens are the block delimiters. That's it. You can't get more minimal than that." I am obligated to disagree. English, the language of the article, has conventions for semantic blocks, but they are often ignored, omitted, etc. I speak Lojban, which at its most formal has no delimiters nor token separators which are readable to the human eye. Placing spaces in Lojban text is a courtesy to human readers but not needed by computers. Another programming language, Forth, has no block delimiters. It also lacks blocks. Also the token separator is, again, a courtesy; Forth's lexer often reinterprets or ignores whitespace when asked.

In fact, Lisp's syntax is, well, syntax. While relatively simple when compared to many languages, Lisp is still structured and retains syntax that would be completely superfluous were one to speak in the language of digital logic which computers normally use. To use the language from earlier in this post, Lisp's syntax is part of its framing; we identify Lisp code precisely by the preponderance of parentheses. The opening parenthesis signals to both humans and computers alike that a Lisp list is starting.

The article talks about the power to interleave data and code. Code operates on data. Code which is also data can be operated on by metacode, which is also code. A strange loop forms. This is not unlike Hofstadter's suggestion that an inner message could itself contain another outer and inner message. English provides us with ample opportunities to form and observe such loops. Consider this quote: "Within these quotation marks, a new message is possible, which can extend beyond its limits and be interpreted differently from the rest of the message: 'Within these quotation marks, a new message is possible.'" I've decided to not put any nasty logic puzzles into this post, but if I had done so, this would be the spot. Mixing metalevels can be hard.

I won't talk about quoting any longer, other than to note that it's not especially interesting if a system supports quoting. There is a proof that quines are possible in any sufficiently complex (Turing complete, for example) language.

Macros are a kind of code-as-data system which involves reflection and reification, transforming code into data and operating on it before turning the data back into code. In particular, this permits code to generate code. This isn't a good or bad thing. In fact, similar systems are seen across the wider programming ecosystem. C has macros, C++ has templates, Haskell has Template Haskell, object-based systems like Python, Ruby, and JS have metaclass or metaprototype faculties.

Lispers loudly proclaim that macros benefit the expressive power of their code. By adding macros to Lisp code, the code becomes more expressive as it takes on deeper metalevels. This is not unlike the expressive power that code gains as it is factored and changed to be more generic. However, it comes at a cost; macros create dialects.

This "dialect" label could be borrowed from the first part of this post, where I used it to talk about spoken languages. Lispers use it to talk about different Lisps which have different macros, different builtin special forms, etc. Dialects create specialization, both for the code and for those reading and writing the code. This specialization is a direct result of the macros that are incorporated into the dialect.

I should stress that I am not saying that macros are bad. This metapower is neither Good nor Bad, in terms of Quality; it is merely different. Lisp is an environment where macros are accepted. Forth is another such environment; the creator of Forth anticipated that most Forth code would not be ANS FORTH but instead be customized heavily for "the task at hand." A Forth machine's dictionary should be full of words which were created by the programmer specifically for the programmer's needs and desires.

Dialects are evident in many other environments. Besides Forth and Lisp, dialects can be seen in large C++ and Java codebases, where tooling and support libraries make up non-trivial portions of applications. Haskellers are often heard complaining about lenses and Template Haskell magic. Pythonistas tell horror stories of Django, web2py, and Twisted. It's not enough to know a language to be effective in these codebases; it's often necessary to know the precise dialect being used. Without the dialect, a programmer has to infer more meaning from the message; they have to put more effort into deciphering and grokking.

"Surely macros and functions are alike in this regard," you might say, if you were my inner voice. And you would be somewhat right, in that macros and functions are both code. The difference is that a macro is metacode; it is code which operates on code-as-data. This necessarily means that usage of a macro makes every reader of the code change how they interpret the code; a reader must either internalize the macro, adding it to their dictionary, or else reinterpret every usage of the macro in terms of concepts that they already know. (I am expecting a sage nod from you, reader, as you reflect upon instances in your past when you first learned a new word or phrase!) Or, to put it in the terms of yore, the medium is the message, and macros are part of the medium of the inner message. The level of understanding of the reader is improved when they know the macros already!

How can we apply this to improve readability of code? For starters, consider writing code in such a way that quotations and macro applications are explicit, and that it is obvious which macros are being applied to quotations. To quote a great programmer, "Namespaces are one honking great idea." Anything that helps isolate and clarify metacode is good for readability.

Since I'm obligated to mention Monte, I'm going to point out that Monte has a very clever and simple way to write metacode: Monte requires metacode to be called with special quotation marks, and also to be annotated with the name of the dialect to use. This applies to regular expressions, parser generators, XML and JSON, etc.; if a new message is to be embedded inside Monte code, and it should be recognized as such, then Monte provides a metacode system for doing precisely that action. In Monte, this system is called the quasiliteral system, and it is very much like quasiliteral quoting within Lisp, with the two differences I just mentioned: Special quotation marks and a dialect annotation.

I think that that's about the end of this particular ramble. Thanks.

May 04, 2015

A year passed since the first release of The Hacker's Guide to Python in March 2014. A few hundreds copies have been distributed so far, and the feedback is wonderful!

I already wrote extensively about the making of that book last year, and I cannot emphasize enough how this adventure has been amazing so far. That's why I decided a few months ago to update the guide and add some new content.

So let's talk about what's new in this second edition of the book!

First, I obviously fixed a few things. I had some reports about small mistakes and typos which I applied as I received them. Not a lot fortunately, but it's still better to have fewer errors in a book, right?

Then, I updated some of the content. Things changed since I wrote the first chapters of that guide 18 months ago. Therefore I had to rewrite some of the sections and take into account new software or libraries that were released.

At last, I decided to enhance the book with one more interview. I've requested my fellow OpenStack developer Joshua Harlow, who is leading a few interesting Python projects, to join the long list of interviewees in the book. I hope you'll enjoy it!

If you didn't get the book yet, go check it out and use the coupon THGTP2LAUNCH to get 20% off during the next 48 hours!

April 27, 2015

Just like they did for Debian developers before, it is Valve’s way of saying thanks and giving something back to the community. This is great news for all Mesa contributors, now we can play some great Valve games for free and we can also have an easier time looking into bug reports for them, which also works great for Valve, closing a perfect circle :)

April 23, 2015

I just wanted to say that you to everyone who commented and provided Feedback on my last blog entry where I asked for feedback on what would make you switch to Fedora Workstation. We will take all that feedback and incorporate it into our Fedora Workstation 23 planning. So a big thanks to all of you!

April 22, 2015

I've had a half-broken temperature monitoring setup at home for quite some time. It started out with a Atom-based NAS, a USB-serial adapter and a passive 1-wire adapter. It sometimes worked, then stopped working, then started when poked with a stick. Later, the NAS was moved under the stairs and I put a Beaglebone Black in its old place. The temperature monitoring thereafter never really worked, but I didn't have the time to fix it. Over the last few days, I've managed to get it working again, of course by replacing nearly all the existing components.

I'm using the DS18B20 sensors. They're about USD 1 a piece on Ebay (when buying small quantities) and seems to work quite ok.

My first task was to address the reliability problems: Dropouts and really poor performance. I thought the passive adapter was problematic, in particular with the wire lengths I'm using and I therefore wanted to replace it with something else. The BBB has GPIO support, and various blog posts suggested using that. However, I'm running Debian on my BBB which doesn't have support for DTB overrides, so I needed to patch the kernel DTB. (Apparently, DTB overrides are landing upstream, but obviously not in time for Jessie.)

I've never even looked at Device Tree before, but the structure was reasonably simple and with a sample override from bonebrews it was easy enough to come up with my patch. This uses pin 11 (yes, 11, not 13, read the bonebrews article for explanation on the numbering) on the P8 block. This needs to be compiled into a .dtb. I found the easiest way was just to drop the patched .dts into an unpacked kernel tree and then running make dtbs.

Once this works, you need to compile the w1-gpio kernel module, since Debian hasn't yet enabled that. Run make menuconfig, find it under "Device drivers", "1-wire", "1-wire bus master", build it as a module. I then had to build a full kernel to get the symversions right, then build the modules. I think there is or should be an easier way to do that, but as I cross-built it on a fast AMD64 machine, I didn't investigate too much.

Insmod-ing w1-gpio then works, but for me, it failed to detect any sensors. Reading the data sheet, it looked like a pull-up resistor on the data line was needed. I had enabled the internal pull-up, but apparently that wasn't enough, so I added a 4.7kOhm resistor between pin 3 (VDD_3V3) on P9 and pin (GPIO_45) on P8. With that in place, my sensors showed up in /sys/bus/w1/devices and you can read the values using cat.

In my case, I wanted the data to go into collectd and then to graphite. I first tried using an Exec plugin, but never got it to work properly. Using a [python plugin] worked much better and my graphite installation is now showing me temperatures.

Now I just need to add more probes around the house.

The most useful references were

In addition, various searches for DS18B20 pinout and similar, of course.

April 21, 2015

presenter modeGot a talk coming up? Want it to go well? Here’s some starting points.

I give a lot of talks. Often I’m paid to give them, and I regularly get very high ratings or even awards. But every time I listen to people speaking in public for the first time, or maybe the first few times, I think of some very easy ways for them to vastly improve their talks.

Here, I wanted to share my top tips to make your life (and, selfishly, my life watching your talks) much better:

  1. Presenter mode is the greatest invention ever. Use it. If you ignore or forget everything else in this post, remember the rainbows and unicorns of presenter mode. This magical invention keeps the current slide showing on the projector while your laptop shows something different — the current slide, a small image of the next slide, and your slide notes. The last bit is the key. What I put on my notes is the main points of the current slide, followed by my transition to the next slide. Presentations look a lot more natural when you say the transition before you move to the next slide rather than after. More than anything else, presenter mode dramatically cut down on my prep time, because suddenly I no longer had to rehearse. I had seamless, invisible crib notes while I was up on stage.
  2. Plan your intro. Starting strong goes a long way, as it turns out that making a good first impression actually matters. It’s time very well spent to literally script your first few sentences. It helps you get the flow going and get comfortable, so you can really focus on what you’re saying instead of how nervous you are. Avoid jokes unless most of your friends think you’re funny almost all the time. (Hint: they don’t, and you aren’t.)
  3. No bullet points. Ever. (Unless you’re an expert, and you probably aren’t.) We’ve been trained by too many years of boring, sleep-inducing PowerPoint presentations that bullet points equal naptime. Remember presenter mode? Put the bullet points in the slide notes that only you see. If for some reason you think you’re the sole exception to this, at a minimum use visual advances/transitions. (And the only good transition is an instant appear. None of that fading crap.) That makes each point appear on-demand rather than all of them showing up at once.
  4. Avoid text-filled slides. When you put a bunch of text in slides, people inevitably read it. And they read it at a different pace than you’re reading it. Because you probably are reading it, which is incredibly boring to listen to. The two different paces mean they can’t really focus on either the words on the slide or the words coming out of your mouth, and your attendees consequently leave having learned less than either of those options alone would’ve left them with.
  5. Use lots of really large images. Each slide should be a single concept with very little text, and images are a good way to force yourself to do so. Unless there’s a very good reason, your images should be full-bleed. That means they go past the edges of the slide on all sides. My favorite place to find images is a Flickr advanced search for Creative Commons licenses. Google also has this capability within Search Tools. Sometimes images are code samples, and that’s fine as long as you remember to illustrate only one concept — highlight the important part.
  6. Look natural. Get out from behind the podium, so you don’t look like a statue or give the classic podium death-grip (one hand on each side). You’ll want to pick up a wireless slide advancer and make sure you have a wireless lavalier mic, so you can wander around the stage. Remember to work your way back regularly to check on your slide notes, unless you’re fortunate enough to have them on extra monitors around the stage. Talk to a few people in the audience beforehand, if possible, to get yourself comfortable and get a few anecdotes of why people are there and what their background is.
  7. Don’t go over time. You can go under, even a lot under, and that’s OK. One of the best talks I ever gave took 22 minutes of a 45-minute slot, and the rest filled up with Q&A. Nobody’s going to mind at all if you use up 30 minutes of that slot, but cutting into their bathroom or coffee break, on the other hand, is incredibly disrespectful to every attendee. This is what watches, and the timer in presenter mode, and clocks, are for. If you don’t have any of those, ask a friend or make a new friend in the front row.
  8. You’re the centerpiece. The slides are a prop. If people are always looking at the slides rather than you, chances are you’ve made a mistake. Remember, the focus should be on you, the speaker. If they’re only watching the slides, why didn’t you just post a link to Slideshare or Speakerdeck and call it a day?

I’ve given enough talks that I have a good feel on how long my slides will take, and I’m able to adjust on the fly. But if you aren’t sure of that, it might make sense to rehearse. I generally don’t rehearse, because after all, this is the lazy way.

If you can manage to do those 8 things, you’ve already come a long way. Good luck!

Tagged: communication, gentoo

A few months ago, I wrote a long post about what I called back then the "Gnocchi experiment". Time passed and we – me and the rest of the Gnocchi team – continued to work on that project, finalizing it.

It's with a great pleasure that we are going to release our first 1.0 version this month, roughly at the same time that the integrated OpenStack projects release their Kilo milestone. The first release candidate numbered 1.0.0rc1 has been released this morning!

The problem to solve

Before I dive into Gnocchi details, it's important to have a good view of what problems Gnocchi is trying to solve.

Most of the IT infrastructures out there consists of a set of resources. These resources have properties: some of them are simple attributes whereas others might be measurable quantities (also known as metrics).

And in this context, the cloud infrastructures make no exception. We talk about instances, volumes, networks… which are all different kind of resources. The problems that are arising with the cloud trend is the scalability of storing all this data and being able to request them later, for whatever usage.

What Gnocchi provides is a REST API that allows the user to manipulate resources (CRUD) and their attributes, while preserving the history of those resources and their attributes.

Gnocchi is fully documented and the documentation is available online. We are the first OpenStack project to require patches to integrate the documentation. We want to raise the bar, so we took a stand on that. That's part of our policy, the same way it's part of the OpenStack policy to require unit tests.

I'm not going to paraphrase the whole Gnocchi documentation, which covers things like installation (super easy), but I'll guide you through some basics of the features provided by the REST API. I will show you some example so you can have a better understanding of what you could leverage using Gnocchi!

Handling metrics

Gnocchi provides a full REST API to manipulate time-series that are called metrics. You can easily create a metric using a simple HTTP request:

POST /v1/metric HTTP/1.1
Content-Type: application/json
"archive_policy_name": "low"
HTTP/1.1 201 Created
Location: http://localhost/v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a
Content-Type: application/json; charset=UTF-8
"archive_policy": {
"aggregation_methods": [
"back_window": 0,
"definition": [
"granularity": "0:00:01",
"points": 3600,
"timespan": "1:00:00"
"granularity": "0:30:00",
"points": 48,
"timespan": "1 day, 0:00:00"
"name": "low"
"created_by_project_id": "e8afeeb3-4ae6-4888-96f8-2fae69d24c01",
"created_by_user_id": "c10829c6-48e2-4d14-ac2b-bfba3b17216a",
"id": "387101dc-e4b1-4602-8f40-e7be9f0ed46a",
"name": null,
"resource_id": null

The archive_policy_name parameter defines how the measures that are being sent are going to be aggregated. You can also define archive policies using the API and specify what kind of aggregation period and granularity you want. In that case , the low archive policy keeps 1 hour of data aggregated over 1 second and 1 day of data aggregated to 30 minutes. The functions used for aggregations are the mathematical functions standard deviation, minimum, maximum, … and even 95th percentile. All of that is obviously customizable and you can create your own archive policies.

If you don't want to specify the archive policy manually for each metric, you can also create archive policy rule, that will apply a specific archive policy based on the metric name, e.g. metrics matching disk.* will be high resolution metrics so they will use the high archive policy.

It's also worth noting Gnocchi is precise up to the nanosecond and is not tied to the current time. You can manipulate and inject measures that are years old and precise to the nanosecond. You can also inject points with old timestamps (i.e. old compared to the most recent one in the timeseries) with an archive policy allowing it (see back_window parameter).

It's then possible to send measures to this metric:

POST /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1
Content-Type: application/json
"timestamp": "2014-10-06T14:33:57",
"value": 43.1
"timestamp": "2014-10-06T14:34:12",
"value": 12
"timestamp": "2014-10-06T14:34:20",
"value": 2

HTTP/1.1 204 No Content

These measures are synchronously aggregated and stored into the configured storage backend. Our most scalable storage drivers for now are either based on Swift or Ceph which are both scalable storage objects systems.

It's then possible to retrieve these values:

GET /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

As older Ceilometer users might notice here, metrics are only storing points and values, nothing fancy such as metadata anymore.

By default, values eagerly aggregated using mean are returned for all supported granularities. You can obviously specify a time range or a different aggregation function using the aggregation, start and stop query parameter.

Gnocchi also supports doing aggregation across aggregated metrics:

GET /v1/aggregation/metric?metric=65071775-52a8-4d2e-abb3-1377c2fe5c55&metric=9ccdd0d6-f56a-4bba-93dc-154980b6e69a&start=2014-10-06T14:34&aggregation=mean HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

This computes the mean of mean for the metric 65071775-52a8-4d2e-abb3-1377c2fe5c55 and 9ccdd0d6-f56a-4bba-93dc-154980b6e69a starting on 6th October 2014 at 14:34 UTC.

Indexing your resources

Another object and concept that Gnocchi provides is the ability to manipulate resources. There is a basic type of resource, called generic, which has very few attributes. You can extend this type to specialize it, and that's what Gnocchi does by default by providing resource types known for OpenStack such as instance, volume, network or even image.

POST /v1/resource/generic HTTP/1.1
Content-Type: application/json
"id": "75C44741-CC60-4033-804E-2D3098C7D2E9",
"project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D",
"user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D"
HTTP/1.1 201 Created
Location: http://localhost/v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9
ETag: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd"
Last-Modified: Fri, 17 Apr 2015 11:18:48 GMT
Content-Type: application/json; charset=UTF-8
"created_by_project_id": "ec181da1-25dd-4a55-aa18-109b19e7df3a",
"created_by_user_id": "4543aa2a-6ebf-4edd-9ee0-f81abe6bb742",
"ended_at": null,
"id": "75c44741-cc60-4033-804e-2d3098c7d2e9",
"metrics": {},
"project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d",
"revision_end": null,
"revision_start": "2015-04-17T11:18:48.696288Z",
"started_at": "2015-04-17T11:18:48.696275Z",
"type": "generic",
"user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"

The resource is created with the UUID provided by the user. Gnocchi handles the history of the resource, and that's what the revision_start and revision_end fields are for. They indicates the lifetime of this revision of the resource. The ETag and Last-Modified headers are also unique to this resource revision and can be used in a subsequent request using If-Match or If-Not-Match header, for example:

GET /v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9 HTTP/1.1
If-Not-Match: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd"
HTTP/1.1 304 Not Modified

Which is useful to synchronize and update any view of the resources you might have in your application.

You can use the PATCH HTTP method to modify properties of the resource, which will create a new revision of the resource. The history of the resources are available via the REST API obviously.

The metrics properties of the resource allow you to link metrics to a resource. You can link existing metrics or create new ones dynamically:

POST /v1/resource/generic HTTP/1.1
Content-Type: application/json
"id": "AB68DA77-FA82-4E67-ABA9-270C5A98CBCB",
"metrics": {
"temperature": {
"archive_policy_name": "low"
"project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D",
"user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D"
HTTP/1.1 201 Created
Location: http://localhost/v1/resource/generic/ab68da77-fa82-4e67-aba9-270c5a98cbcb
ETag: "9f64c8890989565514eb50c5517ff01816d12ff6"
Last-Modified: Fri, 17 Apr 2015 14:39:22 GMT
Content-Type: application/json; charset=UTF-8
"created_by_project_id": "cfa2ebb5-bbf9-448f-8b65-2087fbecf6ad",
"created_by_user_id": "6aadfc0a-da22-4e69-b614-4e1699d9e8eb",
"ended_at": null,
"id": "ab68da77-fa82-4e67-aba9-270c5a98cbcb",
"metrics": {
"temperature": "ad53cf29-6d23-48c5-87c1-f3bf5e8bb4a0"
"project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d",
"revision_end": null,
"revision_start": "2015-04-17T14:39:22.181615Z",
"started_at": "2015-04-17T14:39:22.181601Z",
"type": "generic",
"user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"

Haystack, needle? Find!

With such a system, it becomes very easy to index all your resources, meter them and retrieve this data. What's even more interesting is to query the system to find and list the resources you are interested in!

You can search for a resource based on any field, for example:

POST /v1/search/resource/instance HTTP/1.1
Content-Type: application/json
"=": {
"user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"

That query will return a list of all resources owned by the user_id bd3a1e52-1c62-44cb-bf04-660bd88cd74d.

You can do fancier queries such as retrieving all the instances started by a user this month:

POST /v1/search/resource/instance HTTP/1.1
Content-Type: application/json
Content-Length: 113
"and": [
"=": {
"user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"
">=": {
"started_at": "2015-04-01"

And you can even do fancier queries than the fancier ones (still following?). What if we wanted to retrieve all the instances that were on host foobar the 15th April and who had already 30 minutes of uptime? Let's ask Gnocchi to look in the history!

POST /v1/search/resource/instance?history=true HTTP/1.1
Content-Type: application/json
Content-Length: 113
"and": [
"=": {
"host": "foobar"
">=": {
"lifespan": "1 hour"
"<=": {
"revision_start": "2015-04-15"

I could also mention the fact that you can search for value in metrics. One feature that I will very likely include in Gnocchi 1.1 is the ability to search for resource whose specific metrics matches some value. For example, having the ability to search for instances whose CPU consumption was over 80% during a month.

Cherries on the cake

While Gnocchi is well integrated and based on common OpenStack technology, please do note that it is completely able to function without any other OpenStack component and is pretty straight-forward to deploy.

Gnocchi also implements a full RBAC system based on the OpenStack standard oslo.policy and which allows pretty fine grained control of permissions.

There is also some work ongoing to have HTML rendering when browsing the API using a Web browser. While still simple, we'd like to have a minimal Web interface served on top of the API for the same price!

Ceilometer alarm subsystem supports Gnocchi with the Kilo release, meaning you can use it to trigger actions when a metric value crosses some threshold. And OpenStack Heat also supports auto-scaling your instances based on Ceilometer+Gnocchi alarms.

And there are a few more API calls that I didn't talk about here, so don't hesitate to take a peek at the full documentation!

Towards Gnocchi 1.1!

Gnocchi is a different beast in the OpenStack community. It is under the umbrella of the Ceilometer program, but it's one of the first projects that is not part of the (old) integrated release. Therefore we decided to have a release schedule not directly linked to the OpenStack and we'll release more often that the rest of the old OpenStack components – probably once every 2 months or the like.

What's coming next is a close integration with Ceilometer (e.g. moving the dispatcher code from Gnocchi to Ceilometer) and probably more features as we have more requests from our users. We are also exploring different backends such as InfluxDB (storage) or MongoDB (indexer).

Stay tuned, and happy hacking!